User Tools

Site Tools


asc:asc_cluster

Introduction

ATLAS Tier3g environment is set up so that most analysis instructions available on general ATLAS Twiki's will work. This documentation will orient you in some specific features of a standard Tier3g(T3g) and point to other relevant documentation. The model Tier3g at ANL ASC is described here; where your own T3g will likely differ from ANL ASC, is noted.

ATLAS Tier3g consists of the following elements

  • Interactive nodes: this is where the user's log in and do all of the work, including submission to the Grid and to your local batch cluster. At ANL ASC, they are called
    ascint0y.hep.anl.gov
    ancint1y.hep.anl.gov
    
  • At ANL, they are currently only accessible from inside the ANL firewall.
  • Worker nodes: this is where your batch jobs will run; normally there is also associated storage of data. There is usually no necessity for users to know the details about the worker nodes. At ANL ASC, there are a total of 42 batch slots available for use.
  • Data gateway: There is a way for your T3g to get large amounts (multi !TeraBytes) of data from the Grid and load it into your batch cluster for you to run over. This is normally controlled by the ATLAS Administrators at your site. You can copy small data sets on your own to the interactive nodes for test processing.

Setting Up Your Account

Basics

Your T3g account will have bash shell as default. It's recommended that you stick with this. Given the limited manpower, we did not install the rebuilt special C-shell needed for ATLAS software–nor test any of the functionality from C-type shells.

Your home login area will normally be /export/home/your_user_name. In case of ANL ASC, it is /users/your_user_name. This is because the home area at ANL ASC is shared with another cluster. You may find a similar arrangement at your T3g.

Before you get to work, you will probably want to do the following for convenience. In .bash_profile, put in

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

And in .bashrc, put in.

# add ~/bin/ to path
PATH=$PATH:$HOME/bin:./
# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

To get the default bashrc to give you a prompt that shows user and node, add the current directory, and your own stuff to your PATH; you are also ready to put in aliases and functions in .bashrc.

Your ATLAS environment

The ATLAS envrionment in a T3g is based on the https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ATLASLocalRootBase][ATLAS Local Root Base Package developed at ATLAS Canada. The original documentation resides on Canadian ATLAS pages; they will move to CERN (as will these pages) in the future and be maintained centrally.

The other part of your environment comes from the file system CVMFS which is a web file system (and part of the http://cernvm.cern.ch/cernvm/][CERNVM project–although what is being used here does not have to do with Virtual Machines) that maintains the Athena versions as well as conditions data centrally at CERN.

The two environments are designed to work together.

To start your envrionment, you need to do the following:

export ATLAS_LOCAL_ROOT_BASE=/export/share/atlas/ATLASLocalRootBase
alias setupATLAS='source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh'

You can put this into .bashrc or, if you prefer, make a separate shell script that you can execute.

Now you can do:

setupATLAS

You should see the following output on your screen

...Type localSetupDQ2Client to use DQ2 Client
...Type localSetupGanga to use Ganga
...Type localSetupGcc to use alternate gcc
...Type localSetupGLite to use GLite
...Type localSetupPacman to use Pacman
...Type localSetupPandaClient to use Panda Client
...Type localSetupROOT to setup (standalone) ROOT
...Type localSetupWlcgClientLite to use wlcg-client-lite
...Type saveSnapshot [--help] to save your settings
...Type showVersions to show versions of installed software
...Type createRequirements [--help] to create requirements/setup files
...Type changeASetup [--help] to change asetup configuration
...Type setupDBRelease to use an alternate DBRelease
...Type diagnostics for diagnostic tools

Getting ready to run Athena interactively

Running on CVMFS athena versions

This is the generally recommended way to run Athena at a Tier3. The athena versions which is suitable at scientific linux 5 (sl5) installation such as ANL ASC is at

/opt/atlas/software/i686_slc5_gcc43_opt/

Note: /opt/atlas/ area is remotely mounted and cached locally: this means you shouldn't do a recursive command (like ls -R) on these directories or you could be waiting for a very long time.

You can do a simple “ls”, for example to find installed versions on CVMFS:

[test_user@ascwrk2 ~]$ ls /opt/atlas/software/i686_slc5_gcc43_opt/
15.6.3  15.6.4  15.6.5  15.6.6  gcc432_i686_slc4  gcc432_i686_slc5  gcc432_x86_64_slc5

You can look for patched versions in the following way:

[ryoshida@ascint1y ~]$ ls /opt/atlas/software/i686_slc5_gcc43_opt/15.6.6/AtlasProduction/
15.6.6  15.6.6.1  15.6.6.2  15.6.6.3  15.6.6.4

You can set up your testarea as usual (example here sets up 16.0.0)

mkdir ~/testarea
mkdir ~/testarea/16.0.0
export ATLAS_TEST_AREA=~/testarea/16.0.0

Now you need to set up the correct version of the C++ compiler for Athena and your setup (at ANL ASC it is 64-bit slc5) using the environment created by the ATLASLocalRootBase package. (This version of gcc will become the default in the future).

localSetupGcc --gccVersion=gcc432_x86_64_slc5

Now you need to setup the version you want. (An alternate setup procedure using cmthome directory is HowToCreateRequirements][HERE)

source /opt/atlas/software/i686_slc5_gcc43_opt/16.0.0/cmtsite/setup.sh -tag=16.0.0,AtlasOffline,32,opt,oneTest,setup

For patched versions, an example is

source /opt/atlas/software/i686_slc5_gcc43_opt/16.0.0/cmtsite/setup.sh -tag=16.0.0.1,AtlasProduction,32,opt,oneTest,setup

(note the directory of the setup.sh is in the main release version. Also note that the “tag” options are somewhat different for base and patched version)

Database access needed for Athena jobs

You may need the following definition in addition to run some type of jobs. (They define how to access conditions files and database)

export FRONTIER_SERVER="(proxyurl=http://vmsquid.hep.anl.gov:3128)(serverurl=http://squid-frontier.usatlas.bnl.gov:23128/frontieratbnl)"

In this, “vmsquid.hep.anl.gov” is specific to the ANL ASC cluster. Ask your administrator for the name of your local squid server.

(For recent versions of Atlas Local Root Base, it is no longer necessary to define the following: “export ATLAS_POOLCOND_PATH=/opt/atlas/conditions/poolcond/catalogue”)

Sometimes, a job will require a specific recent Database release which is not shipped with Athena versions. If this is the case, it is possible to access these which are installed on CVMFS. To see which versions of the database are available:

[ryoshida@ascint1y ~]$ ls /opt/atlas/database/DBRelease
9.6.1  9.7.1  9.8.1  9.9.1  current

If you want to use these instead (9.6.1 in this example) of the ones built into the athena version, give the following commands.

export DBRELEASE_INSTALLDIR="/opt/atlas/database"
export DBRELEASE_VERSION="9.6.1"
export ATLAS_DB_AREA=${DBRELEASE_INSTALLDIR}
export DBRELEASE_OVERRIDE=${DBRELEASE_VERSION}

More information on database releases are HERE.

Accessing SVN code repository at CERN

In order to check out packages from CERN SVN using commands like “cmt co”. you need to do a Kerberos authentication. If your local user_name is not the same as at CERN (your lxplus account), you will need to create a file called.

~/.ssh/config

This file should contain the following.

Host svn.cern.ch
 User your_cern_username
 GSSAPIAuthentication yes
 GSSAPIDelegateCredentials yes
 Protocol 2
 ForwardX11 no

Then you can give the commands (after setting up an Athena version)

kinit [email protected]   ! give lxplus password
export SVNROOT=svn+ssh://svn.cern.ch/reps/atlasoff

and you will have access to the svn repository at CERN.

If your usernames are the same, you only need to do the kinit command.

Running Athena

At this stage, you are set up so that examples in the Atlas computing workbook should work. (But skip the “setting up your account” section–you have done the equivalent already). Also examples from Physics Analysis Workbook should work. The following is a small example to get you started.

(Almost) Athena-Version independent !HelloWorld example

  • Set up the Athena environment as above.
  • Go to your test area, e.g. ~/testarea/RELEASE (e.g. 15.6.6)
   cmt show versions PhysicsAnalysis/AnalysisCommon/UserAnalysis 
  • will return a string containing a “tag collector” number which will look like !UserAnalysis-nn-nn-nn. (For 15.6.6, it is !UserAnalysis-00-14-03)
  • Issue the command
   cmt co -r UserAnalysis-nn-nn-nn PhysicsAnalysis/AnalysisCommon/UserAnalysis
  • This can take a minute or two to complete.
  • Go to the run directory
    cd PhysicsAnalysis/AnalysisCommon/UserAnalysis/run 
  • Execute the following command to get the runtime files
    get_files -jo HelloWorldOptions.py
  • Run athena: issue the command:
    athena.py HelloWorldOptions.py

The algorithm will first initialize and will then run ten times (during each run it will print various messages and echo the values given in the job options file). Then it will finalize, and stop. You should see something that includes this:

HelloWorld INFO initialize()
HelloWorld INFO MyInt = 42
HelloWorld INFO MyBool = 1
HelloWorld INFO MyDouble = 3.14159
HelloWorld INFO MyStringVec[0] = Welcome
HelloWorld INFO MyStringVec[1] = to
HelloWorld INFO MyStringVec[2] = Athena
HelloWorld INFO MyStringVec[3] = Framework
HelloWorld INFO MyStringVec[4] = Tutorial

If so you have successfully run Athena !HelloWorld.

Getting sample data and MC files with DQ2

After doing

 setupATLAS

give the command:

localSetupDQ2Client

You'll get a banner

************************************************************************
It is strongly recommended that you run DQ2 in a new session
  It may use a different version of python from Athena.
************************************************************************
Continue ? (yes[no]) : 

Say “yes”. It's safest to dedicate a window for DQ2, or log out and in after using DQ2 if you want to use Athena.

The usage and documentation on DQ2 tools are HERE.

Submitting to the Grid using pathena

Your Grid Certificates

As usual, you need to copy your certificates userkey.pem and usercert.pem in the ~/.globus area. Instruction on obtaining certificates (for US users) is HERE.

Setting up for Pathena

After setting up for Athena as described above give the following command:

 localSetupPandaClient  

Using Pathena to submit to the Gri

After the above setup you can follow the general Distributed Analysis on Panda Instructions HERE. Skip the Setup section in the document since you have done this already for T3g.

Local Batch Cluster

Your Tier3 will have local batch queues you can use to run over larger amount of data. In general one batch queue is more or less equivalent to one analysis slot at a Tier2 or Tier1. As an example, ANL ASC cluster has 42 batch queues; this means a job that runs at a Tier1/2 analysis queues in an hour, split to 42 jobs, will also run in about an hour at ANL ASC (assuming you are using all of the queues).

<!– The basic batch system that is running is Condor which can be used on it's own as described https://atlaswww.hep.anl.gov/twiki/bin/view/UsAtlasTier3/Tier3gUsersGuide#Condor][below. However, most will probably want to take advantage of some type of interface to make this easier. These are

  • Pathena: This is the same Pathena as for the Grid adapted for T3g usage.
  • !ArCond: This is a local interface to condor developed at Argonne. This will parallel-ize your jobs and submit them to cause least amount of network traffic; which will be a significant advantage particularly for i/o bound jobs. Pathena will also have this capability in the future.

–>

Using pathena to submit to your local batch cluster

This is still under preliminary tests.

The batch nodes of your cluster may be configured so that it can accept Pathena jobs in a similar way to Tier2 and Tier1 analysis queues. The description of Tier3 Panda is HERE.

The name of the Panda site at ASC ANL is ANALY_ANLASC. Pathena submission with option –site=ANALY_ANLASC will submit jobs to the Condor queues described below.

T3g is not part of the Grid and the data storage you have locally is not visible to Panda.

  • You are still communicating with Panda server at CERN. This means you will need to set up exactly as you would for any Pathena submission.
  • You will need to specify the files you want to run on in a file and send it with your job.
  • Panda will not retrieve your job and register it with dq2. Your output will reside locally.
  • Normally, T3g queues will be set up so that only local T3g users can submit to its batch queues.

The following is an example of a command submitting a job to ANALY_ANLASC. This submitted the Analysis Skeleton example from the computing workbook.

 pathena --site=ANALY_ANLASC --pfnList=my_filelist --outDS=user10.RikutaroYoshida.t3test.26Mar.v0 AnalysisSkeleton_topOptions.py

Note that everything is the same as usual except that ANALY_ANLASC is chosen as the site and the input is specified by a file called my_filelist.

[ryoshida@ascint1y run]$ cat my_filelist
root://ascvmxrdr.hep.anl.gov//xrootd/mc08.105597.Pythia_Zprime_tt3000.recon.AOD.e435_s462_s520_r808_tid091860/AOD.091860._000003.pool.root.1

This file specifies the file you want use as input in the format which can be understood by the local system. The local data storage for you batch jobs is explained below under XRootD.

The output of the jobs cannot be registered with DQ2 as with the T2 or T1, but is kept locally in the area.

/export/share/data/users/atlasadmin/2010/YourGridId 

. For example:

[ryoshida@ascint1y run]$ ls /export/share/data/users/atlasadmin/2010/RikutaroYoshida/
user10.RikutaroYoshida.t3test.22Mar.v0              user10.RikutaroYoshida.t3test.25Mar.v0
user10.RikutaroYoshida.t3test.22Mar.v1              user10.RikutaroYoshida.t3test.25Mar.v0_sub06230727
user10.RikutaroYoshida.t3test.22Mar.v1_sub06183810  user10.RikutaroYoshida.t3test.26Mar.v0
user10.RikutaroYoshida.t3test.22Mar.v2              user10.RikutaroYoshida.t3test.26Mar.v0_sub06253711
user10.RikutaroYoshida.t3test.22Mar.v2_sub06183821

Looking at the output of the job submitted by the command above:

ls /export/share/data/users/atlasadmin/2010/RikutaroYoshida/user10.RikutaroYoshida.t3test.26Mar.v0_sub06253711
user10.RikutaroYoshida.t3test.26Mar.v0.AANT._00001.root
user10.RikutaroYoshida.t3test.26Mar.v0._1055967261.log.tgz

The output root file and the zipped log files can be seen.

Local parallel processing on your batch cluster: ArCond

ArCond (Argonne Condor) is a wrapper around condor which allows you to automatically parallelize your jobs (athena and non-athena) with input data files and helps you to concatenate your output at the end. Unlike pathena, this is a completely local submission system.

To start with !ArCond do the following

mkdir arctest
cd arctest
source /export/home/atlasadmin/condor/Arcond/etc/arcond/arcond_setup.sh
arc_setup

This will set up your arctest directory

[ryoshida@ascint1y arctest]$ arc_setup
Current directory=/users/ryoshida/asc/arctest
--- initialization  is done ---
[ryoshida@ascint1y arctest]$ ls
DataCollector  Job  arcond.conf  example.sh  patterns  user

Now do

arc_ls /xrootd/

to see the files loaded into the batch area. You are now set up to run a test job from the arctest directory with the “arcond” command. We're working on a more complete description. However https://atlaswww.hep.anl.gov/twiki/bin/view/Workbook/UsingPCF][these pages although it describes an installation on a different cluster, contains most of the needed information.

Condor

The batch system being used by ANL ASC is Condor. There are detailed documentation in the link. The information here is meant for you to be able to look at the system and to do simple submissions. We believe you will do most of your job submission using interfaces which are provided (pathena and !ArCond: see above) so that parallel processing of the data is automated.

There is no user setup necessary for using Condor.

Looking at Condor queues

To see what queues are there, give the following command.

[test_user@ascint1y ~]$ condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

[email protected] LINUX      X86_64 Unclaimed Idle     0.000  2411  1+19:38:32
[email protected] LINUX      X86_64 Unclaimed Idle     0.000  2411  1+19:42:55
...(abbreviated)
[email protected] LINUX      X86_64 Claimed   Busy     0.000  2411  0+00:00:05
[email protected] LINUX      X86_64 Claimed   Busy     0.010  2411  0+00:00:05
[email protected] LINUX      X86_64 Claimed   Busy     0.010  2411  0+00:00:06
[email protected] LINUX      X86_64 Claimed   Busy     0.010  2411  0+00:00:07
[email protected] LINUX      X86_64 Claimed   Busy     0.010  2411  0+00:00:08
[email protected] LINUX      X86_64 Claimed   Busy     0.010  2411  0+00:00:09
...(abbreviated)
[email protected]. LINUX      X86_64 Unclaimed Idle     0.000  2411  0+14:10:17
[email protected]. LINUX      X86_64 Unclaimed Idle     0.000  2411  0+14:10:18
[email protected]. LINUX      X86_64 Unclaimed Idle     0.000  2411  0+14:10:11
[email protected]. LINUX      X86_64 Unclaimed Idle     0.000  2411  0+14:10:12
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX    45     3      14        28       0          0        0

               Total    45     3      14        28       0          0        0

This tells you that there are 45 queues (3 reserved for service jobs) and the status of each queue. Note that queues (slots) in ascwrk1 nodes are running a job (busy). To see the queues themselves:

[test_user@ascint1y ~]$ condor_q -global


-- Submitter: ascint1y.hep.anl.gov : <146.139.33.41:9779> : ascint1y.hep.anl.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
76139.0   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.1   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.2   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.3   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.4   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.5   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.6   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.7   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.8   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.9   test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.10  test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.11  test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.12  test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh
76139.13  test_user       3/18 10:53   0+00:00:00 I  20  0.0  run_athena_v2_1.sh

14 jobs; 14 idle, 0 running, 0 held

In this case 14 jobs from the user “test_user” is in the idle state (just before it begins to run)

Submitting a job to Condor

Prepare your submission file. An example is the following

[test_user@ascint1y xrd_rdr_access_local]$ less run_athena_v2.sub
# Some incantation..
universe = vanilla
# This is the actual shell script that runs
executable = /export/home/test_user/condor/athena_test/xrd_rdr_access_local/run_athena_v2.sh
# The job keeps the environmental variables of the shell from which you submit
getenv = True
#  Setting the priority high
Priority        = +20
#  Specifies the type of machine.
Requirements    = ( (Arch == "INTEL" || Arch == "X86_64"))
#  You can also specify the node on which it runs, if you want
#Requirements    = ( (Arch == "INTEL" || Arch == "X86_64") &&  Machine == "ascwrk2.hep.anl.gov")
#  The following files will be written out in the directory from which you submit the job
log = test_v2.$(Cluster).$(Process).log
#  The next two will be written out at the end of the job; they are stdout and stderr
output = test_v2.$(Cluster).$(Process).out
error = test_v2.$(Cluster).$(Process).err
#  Ask that you transfer any file that you create in the "top directory" of the job
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
#  queue the job.
queue 1
#  more than once if you want
# queue 14

The actual shell script that executes look like this:

[test_user@ascint1y xrd_rdr_access_local]$ less run_athena_v2.sh

#!/bin/bash
##  You will need this bit for every Athena job
# non-interactive shell doesn't do aliases by default. Set it so it does
shopt -s expand_aliases
# set up the aliases.
# note that ATLAS_LOCAL_ROOT_BASE (unlike the aliases) is passed the shell from where you are submitting.
alias setupATLAS='source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh'
# now proceed normally to set up the other aliases
setupATLAS
# Condor works in a "sandbox" directory.
# We now want to create our Athena environment in this sandbox area.
mkdir testarea
mkdir testarea/15.6.3
export ATLAS_TEST_AREA=${PWD}/testarea/15.6.3
localSetupGcc --gccVersion=gcc432_x86_64_slc5
cd testarea/15.6.3
# Set up the Athena version  version
source /export/home/atlasadmin/temp/setupScripts/setupAtlasProduction_15.6.3.sh
# For this example, just copy the code from my interactive work area where I have the code running.
cp -r ~/cvmfs2test/15.6.3/NtupleMaker .
# compile the code 
cd NtupleMaker/cmt
cmt config
gmake
source setup.sh
# cd to the run area and start running.
cd ../share
athena Analysis_data900GeV.py
#  Just to see what we have at the end do an ls.  This will end up in the *.out file
echo "ls -ltr"
ls -ltr
# copy the output file back up to the top directory to get it back from CONDOR into you submission directory.
cp Analysis.root ../../../../Analysis.root 

Now to submit the job do

 condor_submit run_athena_v2.sub

Data Storage at your site

The baseline Tier 3g configuration has several data storage options. The interactive nodes can be configurated to have some local space. This space should be considered shared scratch space. Local site policies will define how this space will be used. There is also space located in the standalone file server (also know as the nfs node). Due to limitations of nfs within Scientific Linux, XrootD is used to access the data on this node. In the baseline Tier 3g setup, the majority of storage is located on the worker nodes. This storage space is XrootD managed and accessed space.

Jump to page explaining how use your storage system

asc/asc_cluster.txt · Last modified: 2013/05/30 18:53 (external edit)