Getting started with JASMIN

Lesson 3: Understanding JASMIN storage

Transcript

  • The pan_df command returns information about the use of space on a workspace in a similar fashion to the standard unix command df on many other systems. Note that on JASMIN df will provide misleading information for most file systems. As on many systems care needs to be taken in the interpretation of the results. Compare the commands
    
      $ pan_df -h ~/workspaces/primavera1
      $ pan_df -H ~/workspaces/primavera1 
    The first will give results using binary units, the second using decimal units. For individual files the difference is not dramatic, but the difference between the Tebibyte and the Terabyte is almost 10%. By default most unix utilities work with the Tebibyte (unless an argument such as --si is included)
  • One quirk of the PANASUS storage used on JASMIN is the relationship between the size of the files being stored and the amount of storage taken up. This can be demonstrated using the HighResMIP directory from within /group_workspaces/jasmin2/primavera1/open/;
    
      $ du -hs --si HighResMIP
      $ du -hs --si --apparent-size HighResmIP
    The first command (at time of writing) reports that there is 254 GB (SI units) of storage taken up by the HighResMIP directory. The second command shows that within the HighResMIP directory there is 203 GB of data in files. The remaining 20% of the storage used is an overhead of working with the PANASUS system. The size of this overhead is dependent on the size of the individual files. The plot below shows the ratio of storage used to file size for different size files:

    Storage to file size ratios for JASMIN storage

    One implication of this; storing a hundred 1 MB files takes up 14% more storage than a single 100 MB file.

    The size of this overhead settles down above 100 MB, so multi-terabyte data sets should not be stored in files smaller than this size.