(written by S.Chekanov, ANL)
ProMC is a package for file input and output for structural event records (such as Monte Carlo or data events). The main features include:
ProMC(“ProtocolBuffers” MC) is based on Google's Protocol Buffers, language-neutral, platform-neutral and extensible mechanism for serializing structured data. It uses “varints” as a way to store and compress integers using one or more bytes. Smaller numbers take a smaller number of bytes. This means that low energetic particles can be represented by a smaller number of bytes since the values to store 4-momenta are smaller compared to high-energetic particles. This is important concept to store events with many soft particles (“pileup”) in the same event record since they use less disk space.
This project is tailored to HEP ANL BlueGene/Q project, since can provide a simple and efficient way to stream data from/to BlueGene/P.
The main idea behind ProMC is to use “content-dependent” compression to store particles depending on their importance. An 14 TeV pp collision event with 140 pileup soft events can have more than 10k particles. Most of them have low pT (“soft”). If we can encode 4-momenta using integer values, then soft particles can be represented by smaller values compared to most interesting (“hard”) particles. If we encode this information using Protocol buffers “variants”, we can use less bytes to store soft particles from pileup. Read Protocol-buffers Encoding.
However, Protocol buffers is still not sufficient, since it can be used write and read separate “messages” (i.e. “single events”). ProMC is designed to store multiple “messages” (or events in HEP terminology“) in a file using a platform neutral way. It also constructs a header for events and organize “messages” in a form suitable for event Monte Carlo records.
Example: A typical HEPMC file size for 100 ttbar events with 140 pileup events (14 TeV) is 1,230 MB. Gzipping this file reduces the file size to 445 MB (in which case it is impossible to read it). The main objective is to store such events in a platform-independent file of size of about 300 MB and still be able to read such data (even using random access). As you will see below, such goal has been achieved using the ProMC package.
When we say “compression”, we typically mean some compression algorithm. In ROOT, up to 50% of CPU is spent on compression/decompression of data. ProMC does not use any algorithm to compress or decompress files. It just streams data into a binary format without CPU overhead.
Each event of the ProMC library is a “ProtocolBuffer” message. Float values (pX,pY,pZ,M) are encoded as “int64” integers. In such representation, 0.01 MeV is minimum allowed energy, while 24 TeV is maximum allowed energy.
Here is a mapping table:
Energy | Representation | How many bytes in encoding |
---|---|---|
0.01 MeV | 1 | 1 bytes |
0.1 MeV | 10 | 1 bytes |
1 MeV | 100 | 2 bytes |
1 GeV | 100 000 | 4 bytes |
1 TeV | 100 000 000 | 8 bytes |
20 TeV | 2000 000 000 | 8 bytes |
Thus, a 4-momentum of a soft particle (~ MeV) can be represented by a reduced number of bytes, compared to fixed length encoding. For a typical pT spectra (falling distribution), this means that the bulk of particle spectra at low pT is compressed more effectively, than for particles at the high-pT tail. This compression depends on the pT spectra.
The mapping between energy values (in GeV) and C++/JAVA integer representation using the int64 varint type of the Protocol Buffers library and the multiplicative factor 100000 in the default ProMC varint conversion. The last column shows the rounding errors for the conversion.
There are other places where the Google's compression is efficient for MC event records. For example, partons typically have small integer values of:
thus they are compressed more effectively using “varints” than final state or exotic particles with large PDG_ID number. Also, light particles (partons) will be compressed more effectively due to their small mass.
Another place where ProMC tries to optimize storage is to set masses for most common particles as a map in the header message of the record. For example, masses of pions and kaons can be set to 0 value (1 bit). During the reading, the masses are restored using the map stored in the header file.
A typical ProMC has 4 major ProtoBuff messages:
The ProMC is based on a random access, i.e. you can read the “Header”, “Statistics”, and any event record from any place in your program.
The data layouts inside ProMC files are implemented using the Google's Protocol Bufffes template files. Look at the language guide protocol-buffers used for such files. Such files can be used to generate analysis code in any programming language (C++,Java,Python.). There are a few files used to create and read ProMC files:
These are the files that are shipped with the default installation and suitable to keep truth MC information. A more complicated data layouts are given in examples/proto directory (to keep jets, leptons, jets with constituents etc.).
The proto files (ProMCHeader.proto, ProMC.proto, ProMCHeader.proto) can be embedded to the ProMC file record, making the file “self-describing”. It is recommended to embed such files, since one can later generate analysis code in any programming language using these files. This is ideal for preserving data and make future modifications without knowing the analysis code used to create the original data. See the tutorials for examples.
To embed the layout templates inside a ProMC file, simply make a directory “proto” and copy (or link) these files. In case if such files are embedded, you can retrieve proto files and generate C++/Java/Python code which will read the data using the data structure used to create the file.
Optionally, you can also include a log file to the ProMC files. If you have a file ”logfile.txt“ in the same directory where you write ProMC files, it will be included to the ProMC record (and compressed).
ProMC Commands | Description |
---|---|
promc_info <file> | analyzes show the description |
promc_browser <file> | Start a Java browser and look at the records |
promc_browser <file URL> | As before, but using http/ftp |
promc_dump <file> | dump all the information |
promc_extract <file> <out> N | extracts N events and save them to out.promc |
promc_proto <file> | extracts self-description of a ProMC files and generates “proto” directory describing data inside the file |
promc_code | generates source files using “proto” directory. C++ code is generated in the directory src/, while Java code in the directory java/src |
promc_log <file> | extracts the log file “logfile.txt” (if attached to the ProMC file) |
promc_split <ProMC file> N | splits a ProMC files into N files in the directory “out” |
Certain program will need to be recompiled if the original event structure has changed. For example, promc_split program is not installed during the installation step and needs to be recompiled manually:
cp -r $PROMC/examples/promc_split promc_proto file.promc # extracts layout promc_code # generates header files in src/ make # compiles promc_split file.promc 7 # splits the original file into 7 files in the directory out/ with the same number of events
There are also other ways to work with the ProMC files. Here is a small Python script which reads a ProMC file and extracts self-description (including embedded proto files and logfile)
import zipfile z = zipfile.ZipFile("out/output.promc", "r") print z.read("promc_nevents") # Nr of events in the file print z.read("promc_description") # description print z.read("ProMCHeader.proto") # embedded ProtoBuffer templates used to describe messages print z.read("ProMC.proto") print z.read("ProMC.proto") print z.read("logfile.txt") # embedded logfile for filename in z.namelist(): # loop over all entries print filename #bytes = z.read(filename) #print len(bytes)
This a Python script. You can also read this info in Java and PHP, as long as you can read an entry inside a zip file.
Since ProMC is a ZIP archive with binary Protocol Buffers messages, you can work with them as with any ZIP file.
Here is an example to how to list all entries inside the ProMC file:
wget http://atlaswww.hep.anl.gov/asc/promc/download/Pythia8.promc unzip -l Pythia8.promc
You can uncompress events into files as “unzip Pythia8.promc”. You can extract any event, say event “100”, as:
unzip -p Pythia8.promc 100 > 100.event unzip -p Pythia8.promc ProMC.proto > ProMC.proto
This example saves the event “100” in a file “100.events”. The second line extract ProtocolBuffer template used to pack the event into the file using the ProtocolBuffers.
Similarly, one can look at the attached logfile and print the number of stored events as:
unzip -p <ProMC file> logfile.txt unzip -p <ProMC file> promc_nevents
In these examples, we send the contents of the files “logfile.txt” and “promc\_nevents” via pipe into shell console.
ProMC Commands | Description |
---|---|
hepmc2promc <HEPMC input> <ProMC output> “description” | converts HepMC file to ProMC file |
promc2hepmc <ProMC input> <HepMC output> | converts ProMC file to HEPMC file |
stdhep2promc <StdHEP input> <ProMC output> | converts StdHEP file to ProMC file |
promc2root <ProMC input> <ProMC output> | converts ProMC file to ROOT |
promc2stdhep <ProMC input> <STDHEP output> | converts ProMC file to STDHEP |
promc2lcio <ProMC input> <LCIO output> | converts ProMC file to LCIO |
lhe2promc <LHE input> <ProMC output> | converts LHEF (TXT) file with MC events to ProMC |
txt2promc <TXT input> <ProMC output> | converts TXT file with MC events to ProMC |
java -cp .:browser_promc.jar hepsim.MixPileup pN signal.promc minbias.promc output.promc | Pileup mixer. Mix N random events using a Poisson distribution with signal events |
Note that the conversion tools (hepmc2promc,promc2hepmc) are build during the installation (see the installation instruction). The tool that converts STDHEP to PROMC (stdhep2promc) needs to be compiled inside examples/stdhep2promc) directory. To build stdhep2promc, so this:
cd examples/stdhep2promc cd stdhep/ make cd .. make
The converts STDHEP to PROMC as:
stdhep2promc file.stdhep file.promc "description" "Nr events" "cross section in pb" "error on cross section".
As shown before, “promc2root” converts ProMC files to ROOT. Let us give an example:
cp -rf $PROMC/examples/promc2root . cd promc2root make prepare make # compile converter wget http://mc.hep.anl.gov/asc/hepsim/events/pp/100tev/qcd_pythia8/pythia100qcd_001.promc ./promc2root pythia100qcd_001.promc pythia100qcd_001.root
This will generate pythia100qcd_001.root file with the needed branches. The ROOT files are typically 30-50% bigger than ProMC files and the processing time of ROOT files is larger. See the benchmarks later.
An example which shows how to read such PROMC file can be found in “examples/random/reader.cc”. Starting from Pythi8 version 8180, ProMC is included into the Pythia8 package. See the example “main46.cc” in the directory “examples” of The Pythia8. There are two possible options to initialize ProMC:
ProMCBook* epbook = new ProMCBook("file.promc","w");
Event will be written to the disk after each “write()” call. The total number of events cannot be larger than 65k. After this limit, ProMC will attempt to close the file. Use this option to write large events, but limiting their number to 65k.
ProMCBook* epbook = new ProMCBook("file.promc","w",true);
With this option, all events are accumulated inside the memory, and then they are written to the disk after the “close()” statement. Use this option to write “small” size events, such as those from NLO and parton-level generators. There is no limit on how many events are stored, since this option uses “zip64”.
ProMCBook* epbook = new ProMCBook("file.promc","r");
ProMC determines the number of events and use the appropriate library. If the number of events larger than 65k, it will use zip64 library (slightly slower for reading).
Kyle Strand (summer 2013), Ed May (ANL)
ProMC files can be filled using C++ (see the examples using PYTHIA8), Java, Python and FORTRAN. FORTRAN is still popular language for some Monte Carlo models, thus an effort was put to develop a program which fills ProMC files using FORTRAN-based models.
A program called “FortranPROMC” was developed by Kyle Strand (summer 2013) which helps to fill ProMC files using FORTRAN. This program uses PYTHIA6 as examples. The program can be downloaded from the ProMC download area HEPFORGE.
Test it as. Untar the package and:
cd FortranProMC ln -s $PROMC/proto/promc proto # make sure the files are self-describing make pyt.exe > logfile.txt 2>&1 # log file will be attached to ProMC
You will see “Pythia6.promc” with events. Use the browser to look inside.
There are 2 versions of FortranPROMC:
FortranPROMC-1.1 is the most recent version developed by E.May and you are encouraged to use it. The download link are HEPFORGE or you can copy from FortranProMC-1.1.tgz.
You can extract a given record/event using a random access capabilities of this format. This is trivial to do in Java. For C++ example, check the code in “examples/random_access”. Type make to compile it and run the code. You can see that we can extract the needed event using the method “event(index)”.
You can stream data from a remote server without downloading ProMC files. The easiest is to to use the Python reader (see the example in examples/python). Below we show to to read one single event (event=100) remotely using Python:
# Shows how to read a single event from a remote file. S.Chekanov import urllib2, cStringIO, zipfile url = "http://mc.hep.anl.gov/asc/snowmass2013/delphes36/TruthRecords/higgs14tev/pythia8/pythia8_higgs_1.promc" try: remotezip = urllib2.urlopen(url) zipinmemory = cStringIO.StringIO(remotezip.read()) zip = zipfile.ZipFile(zipinmemory) for fn in zip.namelist(): # print fn if fn=="100": data = zip.read(fn) print "Read event=100" except urllib2.HTTPError: print "no file"
In this example. “data” represents a ProMC event record. Look at the example in the example in examples/python how to print such info.
You can look at events and other information stored in the ProMC files using a browser implemented in Java. It runs on Linux/Windows/Mac without any external libraries. First, get the browser:
wget http://atlaswww.hep.anl.gov/asc/promc/download/browser_promc.jar
And run it as (it assumes Java7 and above. Check it as “java -version”, it should show 1.7.X version)
java -jar browser_promc.jar
Now we can open a ProMC file. Let's get an example ProMC file which keeps 1,000 events generated by Pythia8:
wget http://atlaswww.hep.anl.gov/asc/promc/download/Pythia8.promc
On Windows/Linux, one can open this file in the browser as: [File]→[Open file]. Or you can open it using the prompt:
java -jar browser_promc.jar Pythia8.promc
To read data using the network is also possible (http/ftp):
java -jar browser_promc.jar http://atlaswww.hep.anl.gov/asc/promc/download/Pythia8.promc
This opens the file and shows the metadata (i.e. information stored in the header and statistics records):
On the left, you will see event numbers. Double click on any number. The browser will display the event record with all stored particles for this event (PID, Status,Px,Py,Pz, etc).
You can access metadata on particle data, such as information on particle types, PID and masses using the [Metadata]→[Particle data] menu. This record is common for all events (ProMC does not store particle names and masses for each event).
If the ProMC file was made “self-describing” and stores templates for proto layouts used to generate analysis code, you can open the “Data layout” menu:
This information can be used to generate analysis code and make future modification to the existing file. Use “promc_proto” command to extract such files, and “proto_code' to generate the analysis code. See the tutorial section.
You can look at event information (process ID, PDF, alphaS, weight) if you navigate with the mouse to the event number on the left, and click on the right button. You will see a pop-up menu. Select “Event information”.
If the ProMC package is installed, you can use a shorter command to lanch the browser:
promc_browser Pythia8.promc
To view ProMC encoded entry, use a browser which can help to see what is inside ProMC files.
wget http://atlaswww.hep.anl.gov/asc/promc/download/zipany.jar java -jar zipany.jar Pythia8.promc
It will pop-up a GUI with the entries inside the file. This tool is called ZipAnywhere and it has a license for non-commercial usage.
This tool is useful for:
Generally, you cannot encode events since this requires external java library. Of course, you can also unzip the ProMC file (this may create very large number of files!)
ProMC files can be read using Java without native C++ libraries. See the tutorials. You need only a single jar file (browser_promc.jar) to access all information inside a ProMC file. Look at the tutorials which show how to use it inside a Java program. In addition, you can browser the events with the same jar file:
wget http://atlaswww.hep.anl.gov/asc/promc/download/browser_promc.jar java -jar browser_promc.jar file.promc
The Java API of the part which access info inside the ProMC files is here. But you only need a few classes if you use the default data layout shipped with the ProMC.
If you create a different data layout, you can generate Java API yourself.
Since ProMC is zip archive, you can check files as:
unzip -l file.promc
which will list all ProtocolBuffers entries. You can also view information about the files as:
promc_info file.promc
(this command assumes that ProMC library is installed). Another command is:
promc_dump file.promc
which, similar to “unzip”, lists all entries.
You can set a number of environmental variables to simplify input and output. Note that “PROMC” variable should always be set, while other variables are optional:
You can plot histograms for the desired variables using several approaches:
Use Delphes fast detector simulation program to process the MC events. Delphes can read ProMC files directly using “readers/DelphesProMC.cxx” inside the Delphes package.
Here are the steps to do fast detector simulations using ProMC files from the HepSim repository:
wget http://mc.hep.anl.gov/asc/hepsim/events/pp/100tev/higgs_pythia8/pythia100higgs_001.promc ./DelphesProMC examples/delphes_card_ATLAS.tcl pythia100higgs_001.root pythia100higgs_001.promc
ProMC is a convenient data format to save unweighted events from NLO program. See the data layout example in example/proto/mcfm directory. The ProMC is easy to deploy on BlueGEne/Q and other supercomputers where, often, CERNLIB (for NTUPLES) and ROOT (for ROOT trees) do not exist. For NLO, you can reduce information to be stored for output particles.
See the NLO data description here: ProMC.proto - description file for NLO.
NLO has typically a few particles with little information about them. Instead, we need substantial information about uncertainties (for example, PDF uncertainties). You can encode uncertainties as varints using this algorithm: error[i] = int ((PDF[i]/ PDF[0]) * 10000). The integer array “error” keeps “deviations” from the central PDF value. Such value are close to 0, so the varint encoding will be very effective. For CT010, for example the array will have 52 integer values per event. Use the array “idata” in the ProMC.proto to fill such values.
You can view these files as
wget http://atlaswww.hep.anl.gov/asc/promc/download/browser_promc.jar java -cp browser_promc.jar probrowser.NLO http://mc.hep.anl.gov/asc/hepsim/events/pp/100tev/higgs_ttbar/httbar_0000000.promc
On the left panel, click on the event and then look at “Event info”. It shows the integer values (idata) that encode the PDF uncertainties, while the float array (“fdata”) shows other information. The first element in the float array is the weight of the event. The particle information is shown as usual (but without mother ID etc.).
ProMC is used to store truth MC event records (about x2 more compact than compressed HEPMC files). ProMC is also used for Snowmass 2012-2013 to keep Delphes fast simulation files (including reconstructed jets and other objects). See the Snowmass web page. Also look at the MC fast simulation repository for Snowmass 2013.
Use this Doxygen description to work with C++:
ProMC Java API that is used to store data:
This is documentation for CPython:
This table shows file sizes and access speed for 10,000 ttbar events at a 14 TeV colliders. The same information is stored in different file formats. The files for such benchmarks, generated using PYTHIA8, are located in the =HepSim= repository. Please take a look at the C++ code which shows how to fill such files. HEPMC and LHEF files were tested after the GZIP, BZIP2 and LZMA compression. It should be noted that the ProMC file includes a complete log file, particle data table, and ProMC file layouts; this information which is not included in other file formats.
The ProMC file is about 38% smaller than ROOT file based on Double32 type to store float values (like 4-momenta of particles) and INT type for integer values (id, status, mothers, etc.). The default ROOT compression is used. ProMC files are 60% smaller than compressed HEPMC and LHE files using LZMA and BZIP2 compression.
File format | File Size (MB) | C++ (sec) | CPython (sec) | Java (sec) | Jython (sec) |
---|---|---|---|---|---|
ProMC | 307 | 15.8 | 980 | 11.7 (12.1 +JVM startup) | 33.3 (35 +JVM startup) |
ROOT | 423 | 20.4 | 66.7 (PyROOT) | - | - |
LHEF | 2472 | 84.7 | 30.4 | 9.0 (9.6 +JVM startup) | - |
HEPMC | 2740 | 175.1 | - | - | - |
LHEF (gzip) | 712 | - | - | - | - |
LHEF(bzip2) | 552 | - | - | - | - |
LHEF (lzma) | 513 | - | - | - | - |
HEPMC (gzip) | 1021 | - | - | - | - |
HEPMC (bzip2) | 837 | - | - | - | - |
HEPMC (lzma) | 802 | - | - | - | - |
Table 1. Benchmark tests for reading files with 10,000 ttbar events stored in different file formats. For each test, the memory cache on Linux was cleared. In case of C++, the benchmark program reads complete event records using appropriate libraries. CPython code for ProMC file is implemented in pure CPython and does not use C++ binding (unlike PyROOT that uses C++ libraries). In case of LHEF files. JAVA and CPYTHON benchmarks only parse lines and tokenize the strings, without attempting to build an event record, therefore, such benchmarks may not be accurate while comparing with ProMC and ROOT.
Benchmark for the read speed was performed using a C++ code compiled on Intel(R) Xeon(R) CPU X5660 @ 2.80GHz. For these benchmark tests, the file is opened, and all entries with 4-momenta of particles were extracted. No any calculations are performed. As expected, the read speed of the ProMC file is 30% faster than for the ROOT files, and is substantially faster compared to the other formats. It can be seen that this difference in the read speed is roughly proportional to file sizes. Files after compression were not tested. A typical time for file decompression is 2-3 min. The JAVA Virtual Machine (JVM) processes data processes the ProMC files faster: this indicates that JVM creates more optimized code to deliver a better performance.
No significant difference in the write speed was detected (this test was dominated by event generation).
Files that have pile-up events are about 50-60% smaller than ROOT files, since soft particles (small values for momenta) are stored using varints with smaller number of bytes.
ProMC is a rewrite of an older package (CBook) for the community supported jHepWork. Currently, this program has the name DataMelt. The current ProMC version is based on HEvent record format Examples and zipious++ library which was first publicly available since 2008. (S.C.).
ProMCE is licensed by the GNU General Public License v3 or later. Please refer GPL-v3.0. It should be note that the project uses Protocol Buffers and ZIPIOS++ using same license.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
The text of this manual cannot be freely redistributed and is subject to the Creative Commons Attribution-Share Alike License; either version 3.0 of the License, or any later version. See By-SA. You are free to copy, distribute, transmit, and adapt ProMC under the following conditions:
Limitations are not detected so far. With zip64 is support (starting from ProMC v1.4), the number of entries can be arbitrary large. For zip64, you should run ProMC with the call:
ProMCBook* epbook = new ProMCBook("file.promc","w",true);
With this option, events are collected in the memory and writted after “close()” statement. Without last option, the number of entries limited to 65k (but events are written after each call).
ProMC can read files that are created using zip64 using Java, Python and C++ (as shown above).
Bibtex entry:
@article{Chekanov20142629, title = "ProMC: Input–output data format for \{HEP\} applications using varint encoding ", journal = "Computer Physics Communications ", volume = "185", number = "10", pages = "2629 - 2635", year = "2014", note = "", issn = "0010-4655", doi = "http://dx.doi.org/10.1016/j.cpc.2014.06.016", url = "http://www.sciencedirect.com/science/article/pii/S0010465514002215", author = "S.V. Chekanov and E. May and K. Strand and P. Van Gemmeren", keywords = "Data", keywords = "Format", keywords = "\{IO\}", keywords = "Input–output", keywords = "\{LHC\} ", abstract = "Abstract A new data format for Monte Carlo (MC) events, or any structural data, including experimental data, is discussed. The format is designed to store data in a compact binary form using variable-size integer encoding as implemented in the Google’s Protocol Buffers package. This approach is implemented in the ProMC library which produces smaller file sizes for \{MC\} records compared to the existing input–output libraries used in high-energy physics (HEP). Other important features of the proposed format are a separation of abstract data layouts from concrete programming implementations, self-description and random access. Data stored in ProMC files can be written, read and manipulated in a number of programming languages, such C++, JAVA, \{FORTRAN\} and PYTHON. " }
— Sergei Chekanov 2013/03/11 21:45 — Sergei Chekanov 2013/05/03 12:58