Stay on top of the cluster’s health with two complementary dashboards.
Dashboards at a glance
- Simple status (
cluster-status.ds.uchicago.edu): a lightweight view that tells you if the scheduler, login nodes, and key services are up. - Grafana metrics (
graf.ds.uchicago.edu): detailed node-by-node charts for GPU/CPU usage, job pressure, historical utilization, and more.
Both update in real time and are public, so you can keep them open on any device while a job runs.
When to use each
- Quick yes/no: Check the simple status page before filing a ticket or starting a workshop. It answers “is the cluster generally healthy?” in a single glance.
- Deep dive: Use Grafana while debugging hung jobs, capacity questions, or GPU contention. You can drill into specific nodes, partitions, or GRES devices to see if what you’re experiencing matches cluster-wide activity.
Tips for incident triage
- Keep the simple status page open in a browser tab; it auto-refreshes and is mobile-friendly.
- Bookmark favorite Grafana dashboards (GPU saturation, node availability, Lustre/NFS throughput) so you can jump straight to the metrics that matter for your workflow.
- When reporting an issue, include timestamps and screenshots/links from either dashboard—this helps staff correlate your report with backend logs.