Globus GridFTP

Transfer of data between CUPPA and cortex.ivec.org is most efficiently done using GridFTP (part of the Globus Toolkit). GridFTP can be used without grid certificates via the sshftp url mechanism described below. Note that this uses ssh only to instigate the transfer between the two hosts - the data are then transferred using GridFTP and are not encrypted or otherwise handled by ssh so there is no performance penalty.

pbstore is special

The load put on cortex by people transferring data has on many occasions caused it to misbehave rather badly. There are a number of things you can do to minimise the load. The first and most important one is to avoid doing encrypted data transfers (i.e. ssh based transfers, such as scp or rsync over ssh). The gridftp methods outlined below are preferred. It is also possible to use high-performance ssh (hpn-ssh) with the “none” cypher enabled. hpn-ssh is installed on cortex as the ssh server/client, but you will also have to compile it on your own machine to take advantage of it. It is not yet available on CUPPA, so don't use ssh for data transfers from there.

The iVEC help pages, these slides and this reference document by Rob Mollard at iVEC, contain a number of useful tips for interacting more intelligently with the pbstore. A number of your favourite UNIX commands have special versions for the SAMQFS file system that runs on pbstore. These SAMQFS versions (with slightly different names) will often work much better than the standard UNIX equivalent. Please read the iVEC help pages before doing any significant transfers on pbstore.

File sizes

One of the pbstore's major problems is inadequate metadata I/O. For this reason great care should be taken to avoid having large numbers of small (or empty) files on the pbstore. Use tar to create larger archives of small files. Individual files on the pbstore should be larger than 10 MB. File sizes of up to 0.5 TB are comfortably handled by the pbstore's tape archives (larger files should be avoided). You can find your empty files (which should be deleted!) with:

sfind /pbstore/<my_project> -size 0

Staging data

To maximise transfer efficiency from the pbstore it is necessary to stage files to disk before transfer (the gloPut script below will do this for you and is recommended for large transfers).

To stage all files in a directory from tape to disk on pbstore, the following is most efficient according to man stage:

stage -r . ; stage -r -w .

Command to stage offline files matching a pattern (useful for staging a smaller subset of files, and to check progress/success of staging):

sfind . -name "<filename_pattern>" -offline -exec stage {} \; -exec echo {} \;

Finally, don't try to stage very large amounts of data at one time. The disk cache is only 100 TB and has many other users. If you use the gloPut script described below you do not need to stage the data as it will do it in optimised chunks for you. This minimises the impact on the disk cache while still ensuring the transfer is not interrupted by waiting for files to stage.

Transfer data

Data transferred from the telescopes resides in cortex.ivec.org:/pbstore/groupfs/astrotmp/astro_transfer/

To use the globus-url-copy command below you will have to set the environment variable $GLOBUS_LOCATION and modify your path appropriately. On CUPPA this should happen automatically for bash users (see /etc/bash.bashrc)

Small Transfers

You can transfer small amounts of data, such as single files, to/from the pbstore using the following syntax:

globus-url-copy -vb sshftp://user@pbstore.ivec.org/<from_path> file://<to_path>

where <from_path> is the directory on pbstore and <to_path> is the destination path on the cuppa node. If either path is a directory you must terminate the url with a forward slash (/).

Large Transfers

For transferring large numbers of files, there is a wrapper script for globus-url-copy that circumvents the problem that globus-url-copy has with large numbers of files. The script will stage the data for you (in optimised chunks) and will also ensure that all the data requested have been transferred completely (by comparing file sizes at each end of the transfer and re-transferring incomplete files). To use the script you must be able to connect over ssh without a password. It is strongly encouraged that you use this script for large transfers (greater than a TB), rather than the manual stage and globus-url-copy described above, or ssh based methods (e.g. scp, rsync).

:!: gloPut7T.sh will not attempt to transfer a file which is already at the destination with the same size as at the source. This is useful for restarting transfers, as already completed files will be skipped. However, you should note that files which change, but maintain the same size, will not be re-transferred on a restart.

The script lives on cuppa under ~corr/bin/gloPut7T.sh. On pbstore it lives in ~cormac/bin/gloPut7T.sh. See the notes below about transferring it to other machines.

gloPut7T.sh <directory> <remote-userid> <remote-directory>

e.g.

gloPut7T.sh /data/ASTRO-TRANSFERS/February09/v252l/Mopra corr@cuppa04.cira.curtin.edu.au /data/corr/v252l

The script requires full pathnames.

If you do not wish to transfer all the files in a directory, use the -m switch to specify a regular expression which will be used to match against the available files (using grep). E.g.

gloPut7T.sh -m v252l_Mp_184_03 /data/ASTRO-TRANSFERS/February09/v252l/Mopra corr@cuppa04.cira.curtin.edu.au /data/corr/v252l

would transfer only files with 'v252l_Mp_184_03' in the filename.

The -u switch (use udt transfer) cannot currently be used on cortex (though between cuppa and cortex this should not be a great loss).

About the gloPut script

The original script was written by Graham Jenkins at VPAC and the latest copy is available at: http://projects.arcs.org.au/svn/systems/trunk/dataFabricScripts/BulkDataTransfer/gloPut7T.sh

The versions at the cuppa/pbstore locations above have been significantly modified from Graham's original, principally to allow for the data staging described above. Please compare with the version already installed on pbstore (also available from Cormac) before re-installing there. Also note that $GLOBUS_LOCATION is (semi) hard-coded in the script. This may have to be amended before the script can be used on another machine.

gloPut transfers the data in small batches. If running on pbstore, it will also stage the next several batches of files in advance. In most circumstances this is sufficient to ensure that the data are already staged before they are due to be transferred (though of course you will notice slight delays at startup as the first set of files are staged). The advantages of this are that you do not have to stage all your files before you begin the transfer and the amount of diskspace required on the disk cache at any time is only enough for the next several batches of files. When doing very large transfers this is hugely beneficial.

Installing GridFTP

Follow the procedures at: http://www.globus.org/toolkit/data/gridftp/quickstart.html

You may also need to carry out the additional steps mentioned for Debian systems at http://projects.arcs.org.au/trac/systems/wiki/DataServices/GridFTPWithUdt

For ubuntu 10.4 you may also need libssl-dev: i.e. the following need to be installed

sudo apt-get install build-essential bzip2 autoconf libxml-parser-perl xinetd openssl telnet libssl-dev

Ubuntu 10.4 has gridFTP available in the standard repository. I haven't tested this version yet, but it can be installed simply by

sudo apt-get install globus-gass-copy-progs

Testing the Installation

Once you have set up ssh keys to the destination machine, a nice test that everything is working for outgoing connections (and a rough idea of transfer speed) is the following command

globus-url-copy -vb -p 4 /dev/zero sshftp://username@pbstore.ivec.org/dev/null

If you get error: globus_ftp_client: an invalid value for url was used this probably means that the sshftp url wasn't understood. Double-check that the /path/to/install/to/setup/globus/setup-globus-gridftp-sshftp command worked.

correlator/globus.txt · Last modified: 2012/01/11 11:57 by cormac
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki