GPU jobs
Single-GPU
All GPU nodes in deucalion (gnx[501-533]
) are exclusive, meaning that one cannot allocate (and consequently be billed) for any number of GPU below 4, which is the number of A-100 NVidia GPU per node. Since any job will be billed for the whole node, if your code can only use one GPU at a time you can start four simultaneous simulations:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --time=4:00:00
#SBATCH --partition normal-a100-40
#SBATCH --mem=0
#SBATCH --account=<slurm_account>
ml OpenMPI/5.0.3-GCC-13.3.0 CUDA/11.8.0 NCCL/2.20.5-GCCcore-13.3.0-CUDA-12.4.0
CUDA_VISIBLE_DEVICES=0 srun -n1 code input0 &
CUDA_VISIBLE_DEVICES=1 srun -n1 code input1 &
CUDA_VISIBLE_DEVICES=2 srun -n1 code input2 &
CUDA_VISIBLE_DEVICES=3 srun -n1 code input3 &
wait
In this case the job will only finish after every process ended.
Multi-GPU in single node
Even though some python codes automatically grab every GPU available (without requiring the explicit use of srun
), it is useful to look at the output of nvidia-smi
to guarantee that you are running at least one process per GPU. The following jobscript runs the HPL benchmark in 4 GPU using srun (single node).
#!/bin/bash
#SBATCH -A <slurm_account>
#SBATCH -t 00:30:00
#SBATCH -p normal-a100-40
#SBATCH -N 1
#SBATCH --tasks-per-node 4
#SBATCH --output=results/%j.out
ml OpenMPI/5.0.3-GCC-13.3.0 CUDA/11.8.0 NCCL/2.20.5-GCCcore-13.3.0-CUDA-12.4.0
export CUDA_VISIBLE_DEVICES=0,1,2,3
srun hpl.sh --dat sample-dat/HPL-4GPUs-40.dat
Multi-node
For multi-node, you should notice that every GPU in every node is being used (as otherwise you are misusing the resources). The following jobscript runs the HPL benchmark in 16 GPU:
#!/bin/bash
#SBATCH -A <slurm_account>
#SBATCH -t 00:30:00
#SBATCH -p normal-a100-40
#SBATCH -N 4
#SBATCH --tasks-per-node 4
#SBATCH --output=results/%j.out
ml OpenMPI/5.0.3-GCC-13.3.0 CUDA/11.8.0 NCCL/2.20.5-GCCcore-13.3.0-CUDA-12.4.0
export CUDA_VISIBLE_DEVICES=0,1,2,3
srun hpl.sh --dat sample-dat/HPL-16GPUs-40.dat