Red Storm experiments

This is the list of experiment candidates for the unrestricted jumbo mode allocation.

Microbenchmarks

POC: Oldfield/Kordenbrock

Performance measurements:

  1. Max throughput to OST (Portals/LWFS/LWFS w/buffers)
  2. optimal xfer size (LWFS test)
  3. optimal stripe count (OSTs/stripe) – (Lustre)
  4. optimal stripe size (bytes/chunk) – (Lustre)

Development reqs:

  1. just configuration and test scripts

Application: IOR

POC: Oldfield/Kordenbrock

Performance comparisons:

	LWFS-direct
	LWFS-sysio
	LWFS-overlay
	ADIOS?
	Lustre (we can run these later)

Metrics:

  1. Execution time/throughput
  2. Operation counts (total, histogram, throughput) – requires SDDF instr
    1. requests
    2. overhead of security model with respect to opcounts
      1. get_caps
      2. verify_caps
      3. cache hits
    3. pending queue sizes over time (how well do we manage bursts)
  1. Parameters (to think about)
    1. number of I/O servers
    2. Total Bytes or Bytes/node for each experiment
    3. IOR params (variables)
      1. - blocksize (bytes per task)
      2. - transferSize (bytes per transfer)
      3. - file-per-proc vs shared file
      4. - …

Development/Prep Reqs:

  1. IO server buffers (Todd's on it)
  2. startup-kill scripts (Ron's on it)
    1. look at lwfs-ping for timeout
  3. IOR-LWFS naming service integration (Ron)
  4. IOR-libsysio/LWFS get shared file case working correctly (Todd)
  5. lwfs_config processing (Ron)
  6. IOR-LWFS-overlay (Ron)

Application: Sage

POC: Kordenbrock

Performance comparisons:

  1. LWFS-sysio
  2. Lustre

Development/Prep Reqs:

  1. Build and link with latest LWFS/libsysio

Application: CTH

POC: Kordenbrock

Performance comparisons:

  1. LWFS-sysio
  2. Lustre

Development/Prep Reqs:

  1. Build and link with latest LWFS/libsysio

Application: GTC (with ADIOS)

POC: Widener

We may be able to get the GTC code that has been tested at ORNL and run it on RS, replicating the experimental setups that we used for the SC07/ccgrid08 papers (neither of which were accepted) - datatap to IOgraph. This is mostly an opportunity to test SSDS components at scale - we would measure throughput at the datatap, perturbance of GTC caused by the datatap, and throughput in the IOgraph. I'm not optimistic about the ability to get this done unless more people get involved. I'm going to ping Wolf/Lofstead/Abbasi about the degree of difficulty here.

Application: GA Tech code (IOGraph, EVPath, ADIOS, ...)

POC: Widener

Metabot test: IOR writes a large output data set as separate smaller files (32?). Then we have a metabot come along afterward and combine those files into a single large HDF file.

IOgraph test: What the GT folks would like to do here is capture the smaller chunks as streaming output into an IOgraph. The IOgraph would then be responsible for streaming all the data back to a central point which would do the large HDF output. ** This would require modifying IOR to either install a datatap or to produce output using EVpath.

Metrics: - compare execution time & throughput for the normal IOR case vs. (metabot or IOgraph) case. - parameters - number of separate file chunks, number of IOgraph branches, number of metabots operating concurrently (I'm thinking one per output “location” - LWFS storage server or parallel filesystem, but we may choose to write more than one chunk to a location)

Implementation: - Mary has this metabot nearly complete, at last update. - build test harness - we wanted to try and do this in small scale here at UNM before trying it on rsqual, I am hoping she'll be in a position to do this in the next few days.

If we have free time, we might also try to port IOR to the ADIOS framework. Not sure what that would buy us directly in terms of experimental value, it's more of a benefit to Jay. I do know that the GT guys would like any opportunity to run at scale with ADIOS, however.

 
/var/www/ssl/data/pages/lwfs/redstorm_experiments.txt · Last modified: 2008/02/19 16:25 by pmw     Back to top