the promise
Concept as scaffold. Hands-on as the deliverable.
Every topic in this course is introduced through the lens of what it means on the job — not what it means in a textbook. If a concept doesn't directly unlock a lab exercise or an operational decision, it isn't in this course.
You won't be taught how to think about systems. You already know that. You'll be given HPC-specific muscle memory to layer onto instincts you've spent years building.
who this is for
You have five or more years of hands-on IT infrastructure experience. You've managed physical servers, virtual machines, storage systems, and Linux environments. You understand how a datacenter works because you've kept one running.
You have not worked inside an HPC environment before — and that's exactly the assumption this course is built on. No HPC background required. No beginner basics taught.
If you've managed virtualisation platforms — VMware, KVM, OpenShift or similar — you already speak most of the language. This course gives you the rest.
The course will not make you immediately employable in HPC overnight — the field is niche and the pyramid narrows at seniority. What it will do is build foundations that work in both directions: a complete shift into HPC operations, or a solid floor to build an AI infrastructure career on top of. Those foundations — scheduler operations, parallel storage, high-speed fabric, cluster observability — sit underneath every serious large-scale compute environment running today.
the arc
Ten modules that mirror a new HPC administrator's first ninety days on the job.
Start as a user. Take ownership of each infrastructure layer. Prove competence under pressure. Exit with a clear map of what comes next.
if this sounds familiar
A research team needs compute.
A cluster appears in the datacenter.
Or a GPU system arrives and someone asks if you can operate it.
Suddenly you are responsible for infrastructure nobody ever formally trained you to run.
Schedulers. Parallel storage. High-speed fabric. MPI workloads. Cluster observability.
This course exists for engineers who already run infrastructure — but want to understand how large-scale compute systems actually work.
the modules
Exploration before instruction.
You SSH into a live cluster before you're told what you're looking at. Examine running processes, read the scheduler configuration, inspect mounted filesystems, list network interfaces. No instructions — guided exploration followed by a debrief.
You leave this module able to read a cluster's topology and understand its components without requiring documentation.
Build admin empathy by being the user first.
Before you administer anything, you submit jobs. Serial jobs, multi-threaded jobs, massively parallel MPI jobs. You deliberately submit a job requesting 32 cores that only uses 4 — and read the output to identify the waste. Every operational decision later in the course will be informed by this perspective.
You leave this module able to submit, monitor, and interpret real HPC jobs — and recognise what a poorly-specified job looks like from the scheduler's side.
The most transferable module for datacenter veterans — familiar concepts, radically different implementation.
You know enterprise NAS. This module starts by showing you precisely why it fails at HPC — and what replaces it. Parallel filesystem architecture, stripe tuning, quota management, scratch lifecycle. You'll benchmark storage under parallel write load, tune stripe counts, and simulate a scratch-full failure — diagnosing and recovering a cluster where a job is failing because a disk has silently filled.
You leave this module able to choose the right storage tier for a workload, configure parallel filesystem stripes, and diagnose I/O bottlenecks before they become incidents.
Understanding when the network is the bottleneck — and proving it with numbers.
Why standard Ethernet is insufficient for tightly-coupled parallel workloads. High-speed fabric architecture — concepts, health monitoring, and verification. You'll benchmark interconnect performance, verify your jobs are actually using the high-speed fabric and not silently falling back to standard networking, and diagnose a job running ten times slower than expected because of exactly that failure. The module closes with a direct mapping onto GPU collective communications in large-scale compute workloads.
You leave this module able to verify network fabric usage, benchmark interconnect performance, and identify network-related job slowdowns.
The operational heart of HPC administration. Most of your working day lives here.
SLURM architecture, partition design, quality of service configuration, fair-share accounting. Node states and what each means operationally. Job dependencies, job arrays, preemption, checkpointing. Every topic is scenario-driven — a researcher reports their job has been pending for six hours; you diagnose it using the scheduler's own tools. Root causes are intentionally varied across participants.
You leave this module able to diagnose pending jobs systematically, configure fair-share policies, and perform node maintenance without disrupting running workloads.
The invisible infrastructure that makes or breaks researcher productivity.
Why environment modules exist. How to write them. SPACK for dependency-aware package management. Singularity containers — why Docker doesn't run on HPC clusters, and how to convert Docker images to run under the job scheduler. You'll pull a PyTorch image, convert it, and submit it as a cluster job. Directly applicable to AI infrastructure container workflows.
You leave this module able to maintain software stacks on a shared cluster, support users with module issues, and run containerised workloads under a job scheduler.
Framed as a user lifecycle exercise — onboarding and offboarding drives every topic naturally.
Linux account management at scale, LDAP integration, SSH key policies, filesystem security on shared parallel filesystems. The lab is a complete user lifecycle — onboard an account from scratch, then offboard it in the correct sequence: suspend the account, handle running jobs, archive the home directory, remove scheduler associations. Everything in the right order.
You leave this module able to manage the complete user lifecycle on an HPC cluster and understand the filesystem security model for shared environments.
Build a dashboard you would actually use — not a toy example.
HPC monitoring philosophy: job-level metrics matter as much as node-level metrics. You'll deploy a scheduler metrics exporter and node exporters on the shared cluster, build a working Grafana dashboard covering cluster utilisation, queue depth, job wait time, and node health — and configure two production-grade alerts: storage reaching capacity, and a node stuck in drain state. Then generate real cluster load and watch the metrics respond.
You leave this module able to build and maintain cluster observability and configure proactive alerting before users have to complain.
This is the module that converts knowledge into operational confidence.
No new concepts. No lectures. Rajesh injects real failures into a live cluster. You receive a symptom description only — no hints, no root cause disclosed. You diagnose.
Each diagnosis attempt is followed by a structured debrief: the correct diagnostic path, the logs and tools that reveal the root cause, and the fix.
This is not a simulation. This is operational pressure, controlled and deliberate.
You leave this module with a systematic troubleshooting methodology — read logs, correlate symptoms, distinguish user error from infrastructure failure, fix and document.
No new labs. A mapping session that makes everything you've learned immediately portable.
This module does not teach AI infrastructure. It shows you exactly where the HPC knowledge from this course sits inside the large-scale compute stack — and names precisely what is new territory from here. GPU collective communications, model checkpoint storage patterns, inference infrastructure, orchestration for ML workloads. You'll know what you know, what you don't yet know, and what to learn next.
You leave this module able to walk into an AI infrastructure role or course already understanding the infrastructure layer — with zero vocabulary confusion.
what you leave with
By the end of the course you will be able to:
You will not know everything about HPC. But you will know enough to operate a cluster competently and continue learning with confidence.
the capstone
Deploy a complete, production-grade HPC cluster from a blank account over one focused weekend.
Cluster up on Friday evening. All configuration and testing complete by Sunday. Cluster torn down Sunday night. The GitHub repository is the persistent artifact — not the running infrastructure.
The deliverable is a GitHub repository containing:
Rajesh reviews the repository and conducts one structured feedback session per participant. Every configuration decision must be defensible.
This is the module that produces a portfolio artifact — something you can show, not just describe.
certificate of capability
This is not a participation certificate.
It is awarded on demonstrated competency through the capstone. The certificate includes a signed capability table listing each skill area demonstrated — making it substantively different from a course completion badge. It documents what you can do, not that you showed up.
the format
Ten modules delivered as live cohort sessions. Every lab runs on a live HPC cluster maintained and provisioned by Rajesh. You operate within it from the first session — SSHing into a real cluster, submitting real jobs, working with real infrastructure.
the instructor
Rajesh Kumar
HPC Infrastructure Architect · 16+ Years · Bengaluru, India
Rajesh has operated HPC infrastructure for sixteen years — across IISc, C-DAC Pune, GE India, and Boeing India. Across scheduler, storage, network, compute, and automation simultaneously. Not as a layer specialist. Across all of them, at scale, in production.
He has never been employed by a vendor. That independence is structural — it shapes every comparative assessment, every configuration recommendation, and every opinion in this course.
He built this course because the knowledge that takes years to accumulate inside HPC environments has never been properly written down for the engineers who need it most.
enrollment
Enrollment starts with a 15-minute conversation — not a payment form.
This is not a gatekeeping exercise. It is a fit check — to make sure you'll get full value from the course rather than finding yourself covering foundations that are assumed knowledge here.
Pilot pricing is offered to the founding cohort in exchange for honest feedback. This is not a discount — it is a fair exchange for helping sharpen a curriculum that will outlast this cohort.
Write to rajesh@hpc.now