Data Management¶
This section gives you information about PDC’s storage solutions. Working with PDC can involve transferring data back and forth between your local machine and PDC resources, or between different systems at PDC.
If you have SNIC Swestore allocation, please check File transfer section on how to transfer files to/from Swestore.
Where to store my data¶
As the speed of CPU computations keep increasing, the relatively slow rate of input/output (I/O) or data accessing operations can create bottlenecks and cause programs to slow down significantly. Therefore it is very important to pay attention to how your programs are doing I/O and accessing data as that can have a huge impact on the run time of your jobs. Here, you will find a quick guide to storing data, ideal if you have just started to use PDC resources.
What is Lustre?
The Lustre system is a parallel file system optimized for handling data from many clients at the same time.
Things to remember when using all types of files
Minimize I/O operations: larger input/output (I/O) operations are more efficient than small ones – if possible aggregate reads/writes into larger blocks.
Avoid creating too many files – post-processing a large number of files can be very hard on the file system.
Avoid creating directories with very large numbers of files – instead create directory hierarchies, which also improves interactiveness.
Things to remember when using Lustre
Avoid all unnecessary metadata operations – once a file is opened, do as much as possible before closing it again. Do not check the existence of files or
stat()
files too often.Open files as read-only if possible – read-only files require less locking and therefore put less load on the file system.
Avoid using
ls
with flags like-l
,-F
, or--color
as this requiresls
tostat()
every file to determine its type, which puts an unnecessary load on the file system. Use such flags only when the extra information is really needed and do not have them as default.Summary of Lustre¶ File system
Lustre
Suggested usage
large files
program code
files accessed for computation
Location
/cfs/klemming
Storage size
On Dardel 12 PB shared.
File access speed
Fast
File access
supports standard POSIX ACLs
Backup
files are not backed up. On Dardel home directories are backed up
Contents
#. HOME area, suitable for analysis results
/cfs/klemming/home/[u]/[username]
#. scratch, suitable for temporary storage/cfs/klemming/scratch/[u]/[username]
#. project data/cfs/klemming/snic/[projectname]
After running your processes
After performing computations at PDC, please move important data files to your own departmental storage system or to a national storage system provided by SNIC (Swestore). Remember, space on Lustre is currently limited, and NOT backed up. However, home directories on Dardel (residing in Lustre) are backed up.
SNIC environmental variables¶
To simplify for the user how to find different folders, SNIC has provided a number of specific variables which indicate in which folders data should be stored. On Dardel the module snic_env is loaded by default
Table of the environmental variables
Name |
Function |
Location on Dardel |
---|---|---|
SNIC_BACKUP |
Where important data are backed up. |
Your klemming home directory |
SNIC_NOBACKUP |
Not backed up folder for large data |
/cfs/klemming/projects or /cfs/klemming/nobackup |
SNIC_RESOURCE |
Name of the cluster you are logged into |
Dardel |
SNIC_SITE |
Name of the site |
PDC |
SNIC_TMP |
Scratch folder for storing temporary data |
/cfs/klemming/projects or /cfs/klemming/projects |
Swestore¶
Swestore is a large scale storage system for live research data provided by SNIC. It requires a separate allocation to use. Part of the Swestore system is hosted at PDC.
File transfer¶
We recommend the following methods for transferring files to and from PDC:
scp/rsync: With Secure Copy (SCP) and rsync you can copy files between your local machine and PDC systems. They use SSH for data transfer, and thus the same authentication as for logging in.
Swestore (dCache): If you have SNIC Swestore (dCache) allocation, please see here how you can transfer files to/from it.
KTH OneDrive (rclone): Use to transfer data between PDC and KTH OneDrive cloud storage.
Nodes for file operations¶
At PDC we have a number of transfer nodes setup. These nodes are dedicated for large file transfers but also for extensive file operations involving large amount of data or many files. It is important that you use these nodes for extensive file operations as not to overload the login node.
Dedicated transfer nodes for large file transfers will be set up on Dardel. In the meanwhile, please use the dardel.pdc.kth.se login node for the file transfers.
Name |
Type |
Usage |
---|---|---|
dardel.pdc.kth.se |
Login node (Dardel) |
Submitting jobs and small file transfers |
dardel.pdc.kth.se |
Login node (Dardel) |
Large transfers and operations on the file system |