Cluster Support

 
Who can use the cluser?
  • every DIUF staff user
  • DIUF students (after discussion)
What do I need to do?
How can I log in?
To log in, you should use ssh using one of the two variants:
  • ssh -l USERNAME diuf-cluster
  • ssh USERNAME@diuf-cluster
How can I use it?
  • First go to the home on the master node (the master node is the node to which you logged in)
    • the home on the master node is located in /home/USERNAME
    • this space is about 8.8T for all users put together, so the space is watched.
    • important: data on this space is not backed up
  • Your 'normal' home from diuf-file is located as /diufhome/USERNAME
  • Execute jobs from the master node
  • We use a SUN grid engine. If you would like to execute a job naming run.sh you must do the following:
    • add the following lines in each job, in our example in run.sh, leaving them commented out, just as they are printed below:
#!/bin/bash
#$ -N JOBNAME
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -m eas
  • copy the job and its related files into /home/USERNAME  (do not forget to add chmod a+x)

Very important: the line   #$ -m n  will result in NO mail being sent to you. You can change it in various ways: 

-m beas
‘b’ Mail is sent at the beginning of the job
‘e’ Mail is sent at the end of the job
‘a’ Mail is sent when the job is aborted or  rescheduled
‘s’ Mail is sent when the job is suspended
‘n’ No mail is sent

Please note that you will get a mail for each job, so if you execute 2500 jobs, you will receive at least 2500 mails. 

 
Submit the job on the master node using qsub
  • enter qsub run.sh

To query your jobs:
  • enter qstat -u USERNAME
States
r            reading
qw        waiting for free node
Eqw     error (this job will never run)
E          error (this job will never run)
t            error (this job will never run)
dt         error (this job will never run)
 
Note: We had a user that sent over 1 million jobs together. This resulted in the cluster being crushed. Our cluser is powerful, but it has its limits. 
 
Dealing with the most frequent errors
Eqw error:
  • use "qmod -c <jobid>" to clear the job error state
  • use "qdel -f <jobid>" to force deletion of the job
  • you can also try "qstat -explain E"
  • Read your mail on the master to figure out the error: cat/var/spool/mail/USERNAME

Mail Transfer Error When Using Windows
 
It seems that the mail server adds a \r in the mails. You could use the following line in your script to avoid this error
	sed -i 's/\r//g' a_file.sh

To add the \r again, you could use

 

	sed -i 's/$/\r/g' a_file.sh
 
User Guide
If you need more information, you can download the User Guide
 
Technical Information
  • Total number of nodes (including master): 9
  • CPUs: 134
  • Storage: 8.8T
  • Mem: 32827356kB total (master)
  • Uses infiniband