Getting started with JASMIN

Lesson 6: Data transfer to and from JASMIN

Transcript

  • This lesson starts from a data transfer node on the UK Research Data Facility, with a live ssh-agent connection (for authentication, and the following entry in .ssh/config:
    Host jasmin-xfer1
        HostName jasmin-xfer1.ceda.ac.uk
        User <username>
    Note that while this lesson focusses on transferring data to JASMIN, retrieving data from JASMIN can be done by switching the order of arguments.
  • Small individual files can be transferred using scp;
    scp 10MB_file.dat jasmin-xfer1:/group_workspaces/jasmin2/primavera1/cache/msmizielinski/
    or rsync (note the options to show progress and a summary of how much data was transferred;
    rsync -av --progress --stats 10MB_file.dat jasmin-xfer1:/group_workspaces/jasmin2/primavera1/cache/msmizielinski/
  • Repeating the rsync command should transfer only a few bytes as the primary function of rsync is to synchronise files and directory structures.
  • To transfer large volumes of data to/from JASMIN two tools are available; bbcp and globus-url-copy, part of the Globus toolkit, also known as gridftp. Here we are only going to look at globus-url-copy.
  • Once installed you may need to have the GLOBUS_TCP_PORT_RANGE environment variable set; its value should be two comma separated numbers indicating the range of TCP ports that globus-url-copy is allowed to use (e.g. 50000,52000). To check if you have this variable set run env | grep GLOBUS.
  • To copy a file use commands such as;
    globus-url-copy -vb file:///nerc/n02/n02/mattmiz/10MB_file.dat sshftp://jasmin-xfer1//group_workspaces/jasmin2/primavera1/cache/msmizielinski/10MB_file.dat
    This will copy the local file at /nerc/n02/n02/mattmiz/10MB_file.dat to /group_workspaces/jasmin2/primavera1/cache/msmizielinski/10MB_file.dat on jasmin-xfer1 (user name and full host name as specified in .ssh/config).
  • To perform the transfer over multiple parallel data connections, specify the number of connections using the -p option;
    globus-url-copy -vb -p 10 file:///nerc/n02/n02/mattmiz/10MB_file.dat sshftp://jasmin-xfer1//group_workspaces/jasmin2/primavera1/cache/msmizielinski/10MB_file.dat
    In many case this can yield significant performance improvements.
  • To transfer multiple files place the source and destination urls in a text file (gridftp_url_list.txt), e.g.
    file:///nerc/n02/n02/mattmiz/50MB_file_1.dat sshftp://jasmin-xfer1//group_workspaces/jasmin2/primavera1/cache/msmizielinski/50MB_file_1.dat
    file:///nerc/n02/n02/mattmiz/50MB_file_2.dat sshftp://jasmin-xfer1//group_workspaces/jasmin2/primavera1/cache/msmizielinski/50MB_file_2.dat
    file:///nerc/n02/n02/mattmiz/50MB_file_3.dat sshftp://jasmin-xfer1//group_workspaces/jasmin2/primavera1/cache/msmizielinski/50MB_file_3.dat
    file:///nerc/n02/n02/mattmiz/50MB_file_4.dat sshftp://jasmin-xfer1//group_workspaces/jasmin2/primavera1/cache/msmizielinski/50MB_file_4.dat
    file:///nerc/n02/n02/mattmiz/50MB_file_5.dat sshftp://jasmin-xfer1//group_workspaces/jasmin2/primavera1/cache/msmizielinski/50MB_file_5.dat
    and the transfers specified in the file are then run in sequence using
    globus-url-copy -vb -p 10 -f gridftp_url_list.txt 
    
  • When running large batches of file transfers you may need to disconnect the globus-url-copy process from your login session using nohup, e.g.
    nohup globus-url-copy -vb -p 10 -f gridftp_url_list.txt > copy_files.out &
    Alternatively, screen is a useful unix utility. To start a screen session for managing a data transfer process, run screen, start a new ssh-agent session and load your private key into it;
    screen
    eval ssh-agent $SHELL
    ssh-add .ssh/<path to private ssh key>
    Then run the transfer command;
    globus-url-copy -vb -p 10 -f gridftp_url_list.txt
    and disconnect from the screen session using ctrl-a followed by d (see screen man page for a full description of screen commands). The processes within your screen session will continue to operate as if you were connected to that session. To list screen sessions you currently have run
    screen -ls
    And you can reconnect to a screen session using
    screen -r <PID of session>
    or screen -r if there is only a single screen session running.
    Should you have trouble re-attaching a screen session adding the -d option to force session detachment may be necessary.