The cluster uses Slurm for job scheduling and resource management. Slurm controls how jobs are queued, scheduled, and executed across compute nodes.
All compute jobs must be submitted through Slurm. Do not run compute-intensive applications directly on the login node.
1. Partitions
A partition (also called a queue) is a logical grouping of compute nodes. Each partition may have different hardware, limits, and runtime policies.
To see available partitions and their status:
sinfo
This command shows:
- Partition name
- Node availability
- Node state (idle, allocated, etc.)
- Maximum runtime limits
2. Submitting a Job
Jobs are submitted using the sbatch command with a job script.
Basic Submission
sbatch your_job_script.sh
Requesting Specific Resources
You can override resource requests directly from the command line:
sbatch --nodes=2 --ntasks-per-node=32 your_job_script.sh
Submitting to a Specific Partition
sbatch --partition=cpucluster your_job_script.sh
3. Example Slurm Job Script
#!/bin/bash #SBATCH --partition=cputest #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --time=00:05:00 # This is the command to run the program echo "Hello, World!" sleep 30
Submit the script:
sbatch my_job_script.sh
4. Difference Between sbatch and srun
- sbatch: Submits a batch job script to be scheduled and executed later.
- srun: Launches parallel tasks within a job allocation (used inside job scripts or interactive sessions).
5. Interactive Jobs
To request an interactive session:
salloc --nodes=1 --ntasks=8 --time=01:00:00
Then run your program with:
srun ./my_program
6. Checking Job Status
View your running and pending jobs:
squeue -u your_username
View completed jobs:
sacct -u your_username
Cancel a job:
scancel jobID
Additional Resources
Comments
0 comments
Article is closed for comments.