Summary of issues encountered in condor system

Employed features of condor system

 * SRM

The scirpt uses SRM commands to access the storage so that different condor jobs may share the data among them.


 * Dagman

The script uses the Dagman to establish parent-child relationship between collision simulations and post-event data analysis.

Outline of the script
The following is the scheme of the script. JOB events calculates, for example, 5000 instances of events. It is defined as the parent JOB of all other JOBs. In the following example, JOB postevents-histo evaluates average multiplicities and directs the output in the format of histograms.

Each instance of JOB events will generate data files, they are transferred to the storage so that JOB postevents-histo may access them at a later time. Among others, I have used the srm command: srm-copy to copy to and from the storage. Comparing to the alternative command srmcp, the former offers recursive feature. I used it in the following way.

Each event evolves to transfer around 200-300M data for a typical central 0-5% collision. One need 5000+ events for the calculation of collective flow and correlation.

The problem
The script works well when one submits 100-200 events. However, when number of events goes up, there is around 2-4% of chance for an event to fail. (This is in fact more than what happens when I run the same code on a normal linux machine.) When this happens, it goes into "H" (held) state and stay there for ever, one can use "condor_release" but this does not solve the problem. So for a total of 5000+ events, 15+ events will fail, when this happens, condor will hold all other events to wait for this event to finish (due to the parent-child relation established by Dagman) which will never happen. It is quite frustrating because the condor system does not inform why it fails, so if it was due to a real programming mistake, the unlucky user has to figure this out by themselves. After debugging the script, I modified the script to skip the those events that has been successfully calculated. If few events fail during the first run, I need only to run the same script again, and in principle the script will only deal with those failed events by checking the existence of certain files on the storage. Since the the SRM command does not have "if -s" feature as in bash, I did the following

Unfortunately, I was only left to find out that condor still has a good chance to fail those events even when the script only does some checks and eventually skips the calculation. I think the problem might be caused by the SRM commands, but I can not prove my guess, neither do I know where to start, since I do not see anything from the log file. As a result, sometimes calculation never moves forward. The current version of script work well if one calculates 100+ events each time, since one can choose to manually download the data file to a local computer for further analysis (at any instance), unfortunately, it is too expensive when one has to calculate flow using 5000+ events for each centrality window.

Please contact me if you need any more detail.