Espresso is the system developed for managing correlation at Curtin University. It is a lightweight system for managing data on your cluster, automating the correlation process, and providing simple archiving of the outputs. It is designed for correlation from standard linux disks (not direct from Mark5s). Espresso also provides a number of auxiliary scripts which may come in handy during correlation. A typical espresso session, as it is used at Curtin, is available here.
All the scripts will give help if invoked with the -h switch.
The scripts come with your DiFX installation (2.0.2 and later), in
$DIFXROOT/applications/espresso. The included install.py script should
install them in your DiFX bin directory. To work they need a
correlator definition file. See the corr_hosts.txt file in
$DIFXROOT/applications/espresso for an example. For every node in your cluster you need to
enter the hostname, maximum number of compute processes to be run
simultaneously on that node (i.e. number of MPI threads), and a space separated
list of any data areas (directories) on that node where baseband data may be
stored.
The environment variable $CORR_HOSTS should point at your version of the
cluster definition file (corr_hosts.txt described above). As this correlator
definition file is unrelated to the particular version of DiFX you are using,
you probably want to store it in your home directory, or similar.
Espresso allows you to write the output data to a directory other than the one
in which the correlation files are stored (this is useful for installations
where the NFS disks are too small to store the output data). You should set the
environment variable $CORR_DATA to point to the directory where you want
the output data to be stored. The output data will be stored in a subdirectory
of $CORR_DATA with the experiment name.
Espresso will automatically sniff the data areas given in the $CORR_HOSTS
file for baseband data. The baseband data should be stored in subdirectories of
the given data areas with the following naming convention:
<expname>-<tel>
where <expname> is the name of the experiment, and <tel> is the telescope station code, as used in the .v2d file.
disk_report.py > ~/disk.txt
this script will sniff all the data areas given in $CORR_HOSTS and
summarise the baseband data distributed across your cluster. You should save
the output to a convenient place (e.g. ~/disk.txt).
disk_exper.py <expname> ~/disk.txt
this script extracts the telescope baseband data locations from the output of
disk_report.py. It takes 2 arguments: the experiment for which you want a data
summary (<expname>), and the file where you saved the output of
disk_report. It will write a summary of the baseband data locations for
each telescope in a file <expname>.datafiles (example).
lbafilecheck.py <expname>.datafiles
this script will do a parallel search of the baseband data locations in
<expname>.datafiles to extract the full file list for correlation. These
are written as a series of .filelist files (one per telescope). In addition it
creates a machines, threads and run file for MPI.
(Advanced users might wish to note that it is possible to restrict the files that are selected by use of the pattern match specified at the top of the file, as described here.)
espresso.py -a <expname>
running the script with these parameters will run the correlation for every job generated by running vex2difx <expname>.v2d.
It will modify the machines and threads file for each job, automatically taking
care of telescopes that are not present in some jobs. Output will be written to
a subdirectory of $CORR_DATA.
In turn it will run:
vex2difxcalcif2errlog2mpifxcorrAll the auxiliary files (.calc, .im, .input, etc.) required for converting the output data to IDI fits are copied to the output directory (with modified internal paths). The log file will also be copied to the output directory when the job finishes. If there are any files already in the output directory which need to be overwritten, they will first be copied to a subdirectory (whose name matches the time that the new correlation started).
At the end of the correlation, the script will pause to force the operator to
enter a summary message on how the correlation went. By default that message
will be entered using the vim editor, but you may set the $EDITOR
environment variable to another editor if you prefer.
The behaviour of the script can be modified with a number of command line switches. Information on these can be obtained with:
espresso.py -h
In the case where you do not wish to run all the jobs created by vex2difx, you may select a subset by giving those jobs as arguments (and dropping the -a switch), e.g.:
espresso.py v389b_1 v389b_2
would run the first 2 jobs created by: vex2difx v389b.v2d
You may also use a python regular expression to match the part of the job name after the '_', e.g.
espresso.py 'v278b_1[1-3]'
would run jobs v278b_11, v278b_12, v278b_13. (Note you will need to quote regular expressions to prevent the shell from expanding them.)
Espresso comes with a number of auxiliary tools to assist the weary correlator operator:
getEOP.py <date> #returns 5 days of EOPs around <date> in .v2d format. <date> can either be MJD or a VEX format date. mjd2vex.py <date> #converts the given <date> from MJD to VEX format, or vice versa. updateclock.py #update the clock entry in the .v2d file (given residual clock offset and rate). updatepos.py #update a site position in the .vex file (requires that $STADB points to a sched locations.dat file)
The following is deprecated but may be useful on occasion:
mk5scans.py <vexfile> <filelist>
will append start and end times to each filename entry in the .filelist file, by comparing the filename to the scan names in the given vex file. Obviously this will only work if your mark5 filenames include the vex scan name (this is very often the case). vex2difx uses these start and end times to only include files which actually appear in the given job. This can speed up processing of subjobs.
Espresso automatically creates a machines and threads file for MPI. It assumes that the head node for correlation is the node on which you start the correlation (i.e. where you invoke espresso.py). The output data directory must be accessible from the head node.
By default, it assumes that the head node and datastream nodes should not be used as compute nodes. You can override this, and force all nodes to be used as compute nodes, with the -H switch to espresso.
If any of the nodes in $CORR_HOSTS are to be used only as
datastream nodes, and never as compute nodes, then set the number of available
compute threads in $CORR_HOSTS to 0 for that host.
Espresso assumes that sorting your baseband data alphanumerically by file name will result in a file list that is ordered by time. This is usually the case for reasonable naming conventions, but you should ensure that it is so.