User Tools

Site Tools


hpc:argobalsam

ARGO (A Rapid Generator Omnibus) & Balsam

Authors:
J Taylor Childers (ANL HEP)
Tom Uram (ANL ALCF)

Description & Versions

Balsam is an interface to a batch system's local scheduler. The each scheduler is abstracted such that Balsam remains scheduler independent.

ARGO is a workflow manager. ARGO can submit jobs to any system on which Balsam is running.

Both ARGO and Balsam are implemented in Python as django apps. This was done because django provides some services by default that are needed such as database handling and web interfaces for monitoring job statuses. django version 1.6 has been used and Python 2.6.6 (with GCC 4.4.7 20120313 (Red Hat 4.4.7-3)).

The communication layer is handled using RabbitMQ and pika. RabbitMQ is a message queue system. Message queues were an easy alternative to writing a custom TCP/IP interface. This requires installing and running a RabbitMQ server. RabbitMQ 3.3.1 is used with Erlang R16B02.

The data transport is handled using GridFTP. This requires installing and running a GridFTP server. Globus version 5.2.0 is used.

 Diagram of the ARGO - Balsam interactions.

Job Submission

Jobs are submitted to ARGO via a RabbitMQ message queue. The messages use the python json serialization format. An example submission is:

example_msg.txt
'''
{
   "preprocess": null,
   "preprocess_args": null,
   "postprocess": null,
   "postprocess_args": null,
   "input_url":"gsiftp://www.gridftpserver.com/path/to/input/files",
   "output_url":"gsiftp://www.gridftpserver.com/path/to/output/files",
   "username": "bob",
   "email_address": "[email protected]",
   "jobs":[
      {
       "executable": "zjetgen90_mpi",
       "executable_args": "alpout.input.0",
       "input_files": ["alpout.input.0","cteq6l1.tbl"],
       "nodes": 1,
       "num_evts": -1,
       "output_files": ["alpout.grid1","alpout.grid2"],
       "postprocess": null,
       "postprocess_args": null,
       "preprocess": null,
       "preprocess_args": null,
       "processes_per_node": 1,
       "scheduler_args": null,
       "wall_minutes": 60,
       "target_site": "argo_cluster"
       },
 
      {
       "executable": "alpgenCombo.sh",
       "executable_args": "zjetgen90_mpi alpout.input.1 alpout.input.2 32",
       "input_files": ["alpout.input.1","alpout.input.2","cteq6l1.tbl","alpout.grid1","alpout.grid2"],
       "nodes": 2,
       "num_evts": -1,
       "output_files": ["alpout.unw","alpout_unw.par","directoryList_before.txt","directoryList_after.txt","alpgen_postsubmit.err","alpgen_postsubmit.out"],
       "postprocess": "alpgen_postsubmit.sh",
       "postprocess_args": "alpout",
       "preprocess": "alpgen_presubmit.sh",
       "preprocess_args": null,
       "processes_per_node": 32,
       "scheduler_args": "--mode=script",
       "wall_minutes": 60,
       "target_site": "vesta"
      }
   ]
}
'''

Installation

  1. On Mira: soft add +python (to get python 2.7)
  2. Install virtualenv
  3. Create install directory: mkdir /path/to/installation/argobalsam
  4. export INST_PATH=/path/to/installation/argobalsam
  5. cd $INST_PATH
  6. Create virtual environment: virtualenv argobalsam_env
    1. On Edison:
      1. module load virtualenv
      2. module load python/2.7
  7. Activate virtual environment: . argobalsam_env/bin/activate
    1. On Edison:
      1. To use pip you need a certificate so run mk_pip_cabundle.sh, then include --cert ~/.pip/cabundle in all your pip commands.
  8. Install needed software:
    1. pip install django
      1. if you have less than python-2.7 you need pip install django==1.6.2
    2. pip install south
    3. pip install pika
    4. pip install MySql (only if needed)
      1. For this I had to install on SLC6 yum install mysql mysql-devel mysql-server
  9. Create django project: django-admin.py startproject argobalsam
  10. cd argobalsam
  11. git clone [email protected]:balsam.git argobalsam_git
  12. mv argobalsam_git/* ./
  13. rm -rf argobalsam_git/
  14. Update the following lines of argobalsam/settings.py:
    1. At the Top:
      from site_settings.mira_settings import *
    2. INSTALLED_APPS = (
          'django.contrib.admin',
          'django.contrib.auth',
          'django.contrib.contenttypes',
          'django.contrib.sessions',
          'django.contrib.messages',
          'django.contrib.staticfiles',
          'south',
          'balsam_core',
          'argo_core',
      )
    3. If you are using MySQL:
      DATABASES = {
          'default': {
              'ENGINE': 'django.db.backends.mysql',
              'NAME': 'your_table_name_goes_here',
              'USER': 'your_login',
              'PASSWORD': 'your_password',
              'HOST': '127.0.0.1',
              'PORT': '',
              'CONN_MAX_AGE': 2000000,
          }
      }
  15. Setup your database:
    1. For Older Django using South (pre-1.7):
      1.  python manage.py syncdb 
      2. Create the first migration
         python manage.py schemamigration balsam_core --initial 
      3. Apply the first migration
         python manage.py migrate balsam_core --fake 
      4. Create the first migration
         python manage.py schemamigration argo_core --initial 
      5. Apply the first migration
         python manage.py migrate argo_core --fake 
    2. For newer Django (1.7,1.8):
      1.  python manage.py syncdb 
    3. For Django (1.9+):
      1.  python manage.py migrate --fake-initial 

Git Tag Notes

Git Browser

5.0

  • updated to MySQL database
  • create filter website
  • Added group_identifier to ArgoDbEntry so that jobs can be grouped together and searched for easier.

4.1

  • updated vesta balsam delays to 5 min
  • making all settings files consistent.
  • added code to deal with jobs after a crash, and added print statements
  • added restart try of subprocesses in the service loop
  • updated vesta settings
  • Updated Balsam code to support tukey submit script with manual mpirun call. Mira handles the mpirun call for the user whereas tukey does not.
  • added tukey balsam site to argo settings
  • updated argo settings to reflect new install directory balsam_production

4.0

  • adding condor command files which should have already been added. Fixed bugs with adding alpgen data to finished emails.
  • adding logger.exception in the proper places. Fixed small bugs and added JobHold error
  • Adding files for the condor scheduler changes to include dagman jobs and condor job files.
  • Many updates to accomodate a Condor Scheduler ability to accept condor job files and condor dagman job files. Also includes updates to include more information on emails upon completion.
  • fixing missing str() conversion
  • fixed error messages where integers needed to be converted to strings
  • updating only specific db fields instead of full object

3.2

  • re-added JobStatusReceiver

3.1

  • adding QueueMessage class to argo
  • merging old and new?
  • adding longer wait times to balsam on ascinode settings, and an error checking to the job status receiver.

3.0

  • major change to ARGO so that each transition takes place in a subprocess and does not block argo from processing other jobs in parallel
  • added error catching for job id mismatch
hpc/argobalsam.txt · Last modified: 2016/02/24 21:31 by jchilders