hpc:bghep:benchmarks
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
hpc:bghep:benchmarks [2014/02/17 16:45] – asc | hpc:bghep:benchmarks [2014/05/27 16:45] (current) – edmay | ||
---|---|---|---|
Line 4: | Line 4: | ||
====== IO benchmarks using ProMC format ====== | ====== IO benchmarks using ProMC format ====== | ||
(E.May) | (E.May) | ||
- | |||
A number of IO tests are being performed on BlueGene/Q | A number of IO tests are being performed on BlueGene/Q | ||
Line 20: | Line 19: | ||
===== Fig 1 ===== | ===== Fig 1 ===== | ||
- | {{: | + | {{: |
The first plot shows the (inverse) rate as a function | The first plot shows the (inverse) rate as a function | ||
Line 32: | Line 31: | ||
===== Fig 2 ===== | ===== Fig 2 ===== | ||
- | Plot 2 shows the data arranged | + | Plot 2 shows the data arranged as a speed-up presentation. Note the use of log scales which minimises the nonlinearity. |
- | as a speed-up presentation. Note the use of log scales which minimises the nonlinearity. | + | |
- | {{: | + | {{: |
===== Fig 3 ===== | ===== Fig 3 ===== | ||
A more appropriate measure is shown in Plot 3 which is the effective utilization of | A more appropriate measure is shown in Plot 3 which is the effective utilization of | ||
- | the multiple cores. Above 100 cores the fraction begins to drop reaching only 20% for | + | the multiple cores. Above 100 cores the fraction begins to drop reaching only 20% for 1024 (and above) |
- | > = 1024 cores. This is really rather poor performance! The plot 5 shows the I/O performance | + | |
- | which shows that the code is not really pushing the I/O capabilities of Vesta. | + | |
- | {{: | + | |
+ | {{: | ||
+ | |||
+ | |||
+ | This plot shows the efficiency (R_1c / Nc / R_Nc) v. Nc, where | ||
+ | |||
+ | < | ||
+ | R == sec/ | ||
+ | Nc == number of cores used in job | ||
+ | </ | ||
+ | |||
+ | For a perfect speed-up this would always be 1. My experience with clusters of | ||
+ | smaller size is that 80% is usually achievable while 20% is quite low and the | ||
+ | usual interpretation is the code has high fraction of serialization. For this | ||
+ | case it would be more efficient to run 8 jobs of 512 cores than 1 job of 4096. | ||
+ | This of course is speculation on my part as I have not identified to cause of | ||
+ | the inefficiency! | ||
===== Fig 4 ===== | ===== Fig 4 ===== | ||
- | {{: | + | {{: |
+ | The ALCF experts suggested the I/O model of 1 directory and many files in that 1 directory would preform badly due to lock contention | ||
+ | on the directory! Thus the example code was modified to use a model | ||
+ | of 1 output promc data file per directory. Running the modified code | ||
+ | produced the following figures: | ||
+ | http:// | ||
+ | Focusing on the ' | ||
+ | improvements both at low core numbers and a high core numbers: 80% rising to 90% and 20% rising to 40% respectively. | ||
+ | |||
+ | The large step between 512 to 1024 is still present! | ||
+ | |||
+ | As part of the bootcamp for MIRA the code was moved to the | ||
+ | BG/Q Mira and a subset up the benchmarks were run in the new | ||
+ | IO model. The results are shown in | ||
+ | http:// | ||
+ | Again focusing on the Efficiency plot the results are very | ||
+ | similar. This suggest when using this naive IO model of each | ||
+ | mpi rank writing its own output ProMC file should be limited to | ||
+ | jobs of 512 cores or less for good utilization of the machine | ||
+ | and IO resouces. This is OK for Vesta where the minimum charging | ||
+ | is 32 nodes (ie 512 cores). While on Mira the minimum is 512 | ||
+ | nodes (ie 8192 cores) there is not a good match! | ||
[[hpc: | [[hpc: | ||
- | --- // | + | --- // |
+ | --- // | ||
--- // | --- // |
hpc/bghep/benchmarks.1392655547.txt.gz · Last modified: 2014/02/17 16:45 by asc