Submitting Batch Jobs (Slurm Examples)
For any task that is computationally intensive or expected to run for more than a few minutes, you should not run it directly on a login node. Instead, submit it to the Slurm workload manager as a batch job.
Basic Batch Job
Create a script (e.g., train.sh):
#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --partition=general
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
# Activate your environment
source activate myenv
# Run your code
python train.py
Submit it:
sbatch train.sh
Choosing a QoS Tier
Every job runs under a QoS tier that determines its scheduling priority, preemption behavior, and fairshare cost. Add --qos=<tier> to your submission.
| Tier | Flag | Fairshare Cost | Preemptable? | Max Jobs | Best For |
|---|---|---|---|---|---|
general |
--qos=general (or omit) |
1x | Yes | 24 | Sweeps, training, batch work |
protected |
--qos=protected |
4x | No | 1 | Learning, short guaranteed jobs |
interactive |
--qos=interactive |
8x | No | 1 | Live development, debugging |
If you do not specify --qos, your job defaults to general. For the full tier specification — including fairshare math, wall-time limits, and the preemption hierarchy — see the Job Scheduling Policy.
DenyOnLimit on protected and interactive: If you try to submit a second protected job (or a second interactive session) while one is already running, the request is immediately rejected with an error — it will not queue silently. The rejection is normal behavior, not a bug. Wait for the first job to finish, or submit to general instead.
Examples
# Standard batch job (general tier, preemptable, cheap)
sbatch --qos=general train.sh
# Non-preemptable job (protected tier, 4x fairshare cost, 2h max)
sbatch --qos=protected --time=02:00:00 train.sh
You can also set the QoS inside the script:
#SBATCH --qos=protected
Job Arrays for Hyperparameter Sweeps
If you are running many similar jobs (parameter sweeps, cross-validation folds, etc.), use a job array instead of submitting individual jobs. Arrays are more efficient for Slurm to manage and respect the per-user job limits cleanly.
#!/bin/bash
#SBATCH --job-name=sweep
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --array=0-99%20
#SBATCH --output=logs/sweep_%A_%a.out
# Each task gets a unique SLURM_ARRAY_TASK_ID (0-99)
python train.py --config configs/sweep_${SLURM_ARRAY_TASK_ID}.yaml
Submit:
sbatch sweep.sh
The %20 in --array=0-99%20 limits the sweep to 20 concurrent tasks. This is important — without a throttle, a 100-task array could consume your entire job limit and flood the queue.
Recommended throttle values:
- Small, short jobs:
%20to%30 - GPU-intensive jobs:
%5to%10 - Start conservatively and increase if the cluster has capacity
Handling Preemption
Jobs on the general tier can be preempted — interrupted and requeued — when an interactive job needs the resources. This will only happen to jobs that have already been running for at least 5 minutes. Preemption works as follows:
- Slurm sends your job a SIGUSR1 signal
- Your job has 5 minutes (grace period) to save a checkpoint
- After the grace period, the job is killed and automatically requeued
- When resources become available, the job restarts from the beginning of the script
The Checkpoint Contract
To handle preemption gracefully, submit your job with these flags:
#SBATCH --signal=B:USR1@300
#SBATCH --requeue
--signal=B:USR1@300— sends SIGUSR1 to the batch script 300 seconds (5 minutes) before the wall-time limit (and also at preemption time)--requeue— allows the job to be requeued after preemption
--signal=B:USR1@N is a per-job SBATCH directive and is QoS-agnostic: it fires N seconds before the wall-time limit on every QoS (general, protected, interactive) when you request it. Preemption signaling — also delivered via SIGUSR1 — is separate and only applies on general, since protected and interactive are not preemptable. If you want a checkpoint warning before the wall clock runs out on a protected job, add --signal=B:USR1@N to your submission; it will work even though protected is never preempted.
Wrapper Script
Here is a complete wrapper script that handles preemption with checkpoint/restart:
#!/bin/bash
#SBATCH --job-name=preemptable_train
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --signal=B:USR1@300
#SBATCH --requeue
#SBATCH --output=logs/%x_%j.out
set -euo pipefail
child_pid=""
CKPT_DIR="/project/${SLURM_JOB_ACCOUNT}/checkpoints/${USER}/${SLURM_JOB_NAME}/${SLURM_JOB_ID}"
mkdir -p "$CKPT_DIR"
# Send a signal to the training process group, falling back to the direct child PID.
# This relies on launching Python with `setsid` below: the child PID is then also
# the process-group ID, so `-$child_pid` targets the whole training process group.
# This matters for parallelized workflows using things like multiprocessing, joblib, etc.
signal_child_group() {
local signal="$1"
if [[ -n "${child_pid:-}" ]] && kill -0 "$child_pid" 2>/dev/null; then
echo "[$(date)] Sending ${signal} to training process group..."
kill -s "${signal}" -- "-$child_pid" 2>/dev/null || \
kill -s "${signal}" "$child_pid" 2>/dev/null || \
true
fi
}
# Handle Slurm's preemption warning signal.
# Python should checkpoint and exit. In the usual Slurm path, this wrapper exits 99
# to request requeue; if Python exits 99 directly, the defensive path below propagates it.
on_preempt() {
echo "[$(date)] USR1 received: checkpointing before preemption or time-limit warning."
signal_child_group USR1
# Let Python checkpoint and exit. Slurm's grace window is the hard deadline.
wait "$child_pid" || true
# DSI cluster policy: exit code 99 tells Slurm to requeue this job.
echo "[$(date)] Exiting with code 99 for Slurm requeue policy."
exit 99
}
# Handle TERM if it reaches the batch shell, e.g. scancel --batch or scheduler cleanup.
# Plain `scancel <jobid>` may terminate the Python process directly instead and will then
# be handled by the `wait "$child_pid"` below.
on_term() {
echo "[$(date)] TERM received: terminating without intentional requeue."
signal_child_group TERM
wait "$child_pid" || true
echo "[$(date)] Exiting due to TERM."
exit 143
}
trap on_preempt USR1
trap on_term TERM
# Check for existing checkpoint
latest_ckpt=$(ls -1t "$CKPT_DIR"/*.pt 2>/dev/null | head -n1 || true)
if [[ -n "${latest_ckpt:-}" ]]; then
echo "[$(date)] Resuming from checkpoint: $latest_ckpt (restart #${SLURM_RESTART_COUNT:-0})"
setsid python train.py --resume "$latest_ckpt" --ckpt-dir "$CKPT_DIR" &
else
echo "[$(date)] Starting fresh training run"
setsid python train.py --ckpt-dir "$CKPT_DIR" &
fi
child_pid=$!
return_code=0
wait "$child_pid" || return_code=$?
echo "[$(date)] Training exited with code ${return_code}"
# Defensive path: if Python itself exits 99 after checkpointing, propagate that
# as the Slurm requeue request. The normal Slurm path is still the USR1 trap above.
if [[ "$return_code" -eq 99 ]]; then
echo "[$(date)] Training requested requeue via exit code 99."
exit 99
fi
exit "$return_code"
About the checkpoint directory. The wrapper writes to
/project/$SLURM_JOB_ACCOUNT/checkpoints/$USER/$SLURM_JOB_NAME/$SLURM_JOB_ID. The$SLURM_JOB_IDsegment prevents two same-named jobs from clobbering or cross-resuming each other’s checkpoints;$SLURM_JOB_IDis preserved across requeues, so resume after preemption still works. Confirm the parent path/project/$SLURM_JOB_ACCOUNT/checkpoints/$USER/is writable by your account before relying on it — if/project/<your-account>/does not exist or is read-only for you, themkdir -pwill fail and the job will abort immediately. Substitute a path under/project/or/scratch/that you own. Runningtouch /project/$SLURM_JOB_ACCOUNT/checkpoints/test && rm /project/$SLURM_JOB_ACCOUNT/checkpoints/teston a login node is a quick way to check.
PyTorch Checkpoint Example
In your training script, handle SIGUSR1 to save a checkpoint:
import os
import signal
import sys
from pathlib import Path
import torch
should_checkpoint = False
def handle_preempt(signum, frame):
global should_checkpoint
print(f"Received signal {signum}; will checkpoint at next opportunity...", flush=True)
should_checkpoint = True
def atomic_torch_save(state, path):
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
tmp_path = path.with_suffix(path.suffix + ".tmp")
torch.save(state, tmp_path)
os.replace(tmp_path, path)
signal.signal(signal.SIGUSR1, handle_preempt)
for epoch in range(start_epoch, max_epochs):
for step, batch in enumerate(dataloader):
loss = train_step(model, batch)
if should_checkpoint:
ckpt_path = Path(ckpt_dir) / f"checkpoint_epoch{epoch}_step{step}.pt"
atomic_torch_save(
{
"epoch": epoch,
"step": step,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": float(loss.detach().cpu()) if torch.is_tensor(loss) else float(loss),
},
ckpt_path,
)
print("Checkpoint saved. Exiting with code 99 for requeue.", flush=True)
sys.exit(99)
To resume from a checkpoint:
if args.resume:
checkpoint = torch.load(args.resume)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
What Preemption Looks Like for an Overnight Job
The 5-minute warning is a signal to your job, not to you — you do not need to be awake for it. Here is what actually happens if you submit a 12-hour training run at 6pm and go home:
- 6:00pm —
sbatch train.shsubmitted. Job ID 12345 starts on a GPU node. - 11:47pm — an
interactivejob requests a GPU; the scheduler picks your job for preemption. - 11:47pm — Slurm sends
SIGUSR1to your batch script. Your trap handler (see wrapper script above) forwards it to the training process. PyTorch catches it, setsshould_checkpoint = True, finishes the current step, writes a checkpoint to$CKPT_DIR, and exits with code 99. - 11:52pm — 5 minutes elapse. If the job has already exited cleanly it is requeued immediately. If it is still running, Slurm sends
SIGKILLand requeues it. - Sometime overnight — a GPU frees up. Slurm restarts job 12345 (same job ID,
SLURM_RESTART_COUNTincrements). The wrapper script finds the checkpoint in$CKPT_DIRand resumes from it. - Morning — you check
sacct -j 12345and see the restart history inlogs/my_training_12345.out.
The 5-minute warning is a deadline for your code, not for you. As long as your job implements the checkpoint contract, preemption is automatic and requires no human intervention. A job that does not implement checkpointing will simply restart from epoch 0 — correct, but wasteful.
Key recommendations for overnight and long-running jobs:
- Always set
--signal=B:USR1@300and--requeue. Without both, a preempted job has no warning and will not be requeued. - Checkpoint at a cadence your code can actually meet in 5 minutes. If a single training step takes 4 minutes, you need to checkpoint per-step. For typical step times (seconds), checkpointing every N steps or at each
should_checkpointflag check is sufficient. - Write checkpoints atomically — save to
checkpoint.pt.tmpandmvinto place — so aSIGKILLmid-write cannot corrupt the previous checkpoint. - Log to
logs/%x_%j.out(the%jis the job ID, which is preserved across requeues), so a single log file captures all restart attempts. - If preemption would be catastrophic (e.g., a rare reproducibility run, or you need guaranteed wall-clock completion), use
--qos=protectedinstead. The 2-hour limit and 4x fairshare cost are the tradeoff. - Test your signal handler during the day before relying on it overnight. You can simulate preemption with
scancel --signal=USR1 --batch <jobid>, which sends SIGUSR1 to the batch step without killing the job. Confirm that a checkpoint appears in$CKPT_DIRand that the job exits with code 99.
Parallel Python: multiprocessing & joblib
If your job uses multiprocessing.Pool, joblib.Parallel, or anything else that calls fork() to spawn workers, the parent process’s signal handler does not propagate to the workers automatically. The wrapper above signals the entire process group via setsid, but the Python side still needs to (1) make workers ignore SIGUSR1 so only the parent coordinates checkpointing, (2) record completed work between batches, and (3) write a single atomic checkpoint from the parent before exiting with code 99.
The two reference scripts below implement this pattern. Both use --ckpt-dir and --resume arguments matching the wrapper’s invocation.
multiprocessing example
import argparse
import os
import pickle
import signal
import sys
from pathlib import Path
from multiprocessing import Pool
should_checkpoint = False
def handle_preempt(signum, frame):
"""Record that Slurm requested checkpointing; keep the signal handler minimal."""
global should_checkpoint
print(f"Received signal {signum}; will checkpoint soon.", flush=True)
should_checkpoint = True
def ignore_preempt_in_worker():
"""Make workers ignore SIGUSR1 so only the parent coordinates checkpointing."""
signal.signal(signal.SIGUSR1, signal.SIG_IGN)
def atomic_pickle_save(state, path):
"""Write a checkpoint safely by saving to a temp file, then renaming atomically."""
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
tmp_path = path.with_suffix(path.suffix + ".tmp")
with open(tmp_path, "wb") as f:
pickle.dump(state, f)
os.replace(tmp_path, path)
def load_checkpoint(path):
"""Load a previously saved checkpoint so the job can resume after requeue."""
with open(path, "rb") as f:
return pickle.load(f)
def work(x):
"""Run one unit of parallel work; replace this with your actual computation."""
return x, x * x
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--ckpt-dir", required=True)
parser.add_argument("--resume", default=None)
args = parser.parse_args()
signal.signal(signal.SIGUSR1, handle_preempt)
ckpt_dir = Path(args.ckpt_dir)
ckpt_path = ckpt_dir / "mp_checkpoint.pkl"
items = list(range(1000))
results = []
start_index = 0
if args.resume:
state = load_checkpoint(args.resume)
start_index = state["next_item_index"]
results = state["results"]
print(f"Resuming from item index {start_index}", flush=True)
remaining_items = items[start_index:]
with Pool(processes=4, initializer=ignore_preempt_in_worker) as pool:
for absolute_index, result in enumerate(pool.imap(work, remaining_items), start=start_index):
results.append(result)
next_item_index = absolute_index + 1
if should_checkpoint:
# This checkpoint records only completed/yielded results.
# Any in-flight worker tasks may be discarded on exit and recomputed after resume.
atomic_pickle_save(
{
"next_item_index": next_item_index,
"results": results,
},
ckpt_path,
)
print(f"Checkpoint saved to {ckpt_path}. Exiting.", flush=True)
sys.exit(99)
atomic_pickle_save(
{
"next_item_index": len(items),
"results": results,
"complete": True,
},
ckpt_path,
)
if __name__ == "__main__":
main()
joblib example
import argparse
import os
import pickle
import signal
import sys
from pathlib import Path
from joblib import Parallel, delayed
should_checkpoint = False
def handle_preempt(signum, frame):
"""Record that Slurm requested checkpointing; joblib checks between batches."""
global should_checkpoint
print(f"Received signal {signum}; will checkpoint after current batch.", flush=True)
should_checkpoint = True
def atomic_pickle_save(state, path):
"""Write a checkpoint safely by saving to a temp file, then renaming atomically."""
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
tmp_path = path.with_suffix(path.suffix + ".tmp")
with open(tmp_path, "wb") as f:
pickle.dump(state, f)
os.replace(tmp_path, path)
def load_checkpoint(path):
"""Load a previously saved checkpoint so the job can resume after requeue."""
with open(path, "rb") as f:
return pickle.load(f)
def work(x):
"""Run one unit of parallel work; replace this with your actual computation."""
return x, x * x
def batched(items, batch_size):
"""Yield small batches so checkpointing can happen between joblib Parallel calls."""
for start in range(0, len(items), batch_size):
yield start, items[start : start + batch_size]
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--ckpt-dir", required=True)
parser.add_argument("--resume", default=None)
args = parser.parse_args()
signal.signal(signal.SIGUSR1, handle_preempt)
ckpt_dir = Path(args.ckpt_dir)
ckpt_path = ckpt_dir / "joblib_checkpoint.pkl"
items = list(range(1000))
results = []
start_index = 0
batch_size = 16
if args.resume:
state = load_checkpoint(args.resume)
start_index = state["next_item_index"]
results = state["results"]
print(f"Resuming from item index {start_index}", flush=True)
for batch_offset, batch in batched(items[start_index:], batch_size):
absolute_start = start_index + batch_offset
batch_results = Parallel(n_jobs=4)(
delayed(work)(x) for x in batch
)
results.extend(batch_results)
next_item_index = absolute_start + len(batch)
# Rolling checkpoint after each completed batch, so resume only re-runs at most
# the batch that was in progress when preemption was requested.
atomic_pickle_save(
{
"next_item_index": next_item_index,
"results": results,
},
ckpt_path,
)
if should_checkpoint:
print(f"Checkpoint saved to {ckpt_path}. Exiting.", flush=True)
sys.exit(99)
atomic_pickle_save(
{
"next_item_index": len(items),
"results": results,
"complete": True,
},
ckpt_path,
)
if __name__ == "__main__":
main()
What If I Don’t Implement Checkpointing?
Jobs without checkpointing will simply restart from the beginning when requeued. This is fine for short jobs (< 1 hour), but for multi-hour or overnight training runs you will lose all progress each time the job is preempted. Implementing checkpointing is strongly recommended for any job that runs for more than an hour or is submitted to run overnight.
Useful Commands
# Check your job status
squeue -u $USER
# Check job status with QoS and priority info
squeue -u $USER -O jobid,name,partition,qos,state,timelimit,priority
# Cancel a job
scancel <jobid>
# Cancel all your jobs
scancel -u $USER
# Check your fairshare standing
sshare -u $USER
# View detailed job info
scontrol show job <jobid>
# View past job accounting
sacct -j <jobid> --format=JobID,QOS,State,Elapsed,MaxRSS
Quick Reference
# Standard batch job (cheapest, preemptable)
sbatch --qos=general --partition=general --gres=gpu:1 --time=12:00:00 train.sh
# Non-preemptable job (4x cost, guaranteed completion, 2h max)
sbatch --qos=protected --partition=general --gres=gpu:1 --time=02:00:00 train.sh
# Sweep with throttle
sbatch --qos=general --array=0-99%20 sweep.sh
# Preemption-safe job
sbatch --qos=general --signal=B:USR1@300 --requeue train_with_ckpt.sh