Professional HPC infrastructure training

The professional training programme for HPC infrastructure engineers.

From first principles to production ownership — structured, instructor-led, built on real clusters. Four courses that take you from understanding what HPC is to investigating what broke and why.

HPC Essentials · ~10 days

Understand the infrastructure before you're in the room discussing it.

Built for researchers, IT managers, and architects who need HPC fluency without operational depth. No prerequisites. No commands. Just the mental models that make everything else make sense.

HPC Fundamentals + Operations

Your infrastructure experience transfers. Your HPC knowledge is what's missing.

Built for Linux, VMware, cloud, and storage engineers making the move into HPC operations. Skip the conceptual intro — start where your background ends and HPC begins.

HPC Professional · ~48 days

You operate the cluster. HPC Professional teaches you to own it.

For engineers who have mastered the workflows and are ready for the next step — systematic cross-subsystem investigation, escalation reasoning, and the diagnostic capability that makes you the person the team calls.

Instructor-led, real infrastructure labs No vendor lock-in — open stack throughout Structured progression from concept to engineering

Curriculum

A curriculum built on genuine skill progression

Each course has a distinct capability outcome. This is not a sequence of increasingly advanced versions of the same thing — each level changes your relationship to the infrastructure.

HPC Essentials

~10 days

Understand

Conceptual grounding — what HPC is and how it works

HPC Fundamentals

~30 days

Interpret

Read and reason about live infrastructure

HPC Operations

~42 days

Execute

Execute production workflows safely

HPC Professional

~48 days

Investigate

Investigate what Associates escalate

Courses

Four courses. One clear progression.

Each course is a standalone programme with a defined capability outcome — not a module in an undefined journey.

HPC Essentials

~10 days

Understand the infrastructure before you touch it.

No prerequisites Instructor-led Conceptual

What HPC is, why it exists, and how its major subsystems fit together
How compute, storage, networking, and scheduling interact as a system
Why GPU infrastructure and HPC are converging
The vocabulary and mental models to engage confidently in HPC discussions
Guided observation labs — no commands, no configuration

Built for: Researchers, IT managers, architects, procurement professionals, and career-changers with no prior HPC background.

HPC Fundamentals

~30 days

Learn to read infrastructure before you're responsible for it.

No prerequisites Instructor-led Interpretive

Linux operational reading fluency — processes, services, logs, topology, sockets
Interpret scheduler queues, storage state, and GPU telemetry without owning them
NUMA, locality, and topology awareness — why workload placement matters
Observability mindset and blast-radius discipline for production environments
Integrated infrastructure reasoning across all major HPC subsystems

Built for: Engineers transitioning into HPC from enterprise IT, cloud, or storage roles. HPC Essentials recommended for those with no infrastructure background.

HPC Operations

~42 days

Execute production HPC workflows safely, from day one.

Fundamentals or Linux fluency Instructor-led Operational

Node provisioning, lifecycle management, and maintenance sequencing
Scheduler and workload operations — job submission, queue management, GPU allocation
Storage and parallel filesystem operations — Lustre client workflows, bounded recovery
GPU and containerised AI workload operations — ECC handling, MIG awareness
Observability-driven cluster validation — dashboards, alert triage, health verification

Requires: HPC Fundamentals or demonstrated Linux operational fluency.

Slurm · Lustre · InfiniBand · NVIDIA GPUs · Warewulf · Prometheus/Grafana · OpenOnDemand · FreeIPA

HPC Professional

~48 days

Investigate what your team escalates to you.

Operations required Instructor-led Investigative

Structured investigative methodology for infrastructure anomalies under real ambiguity
Cross-subsystem correlation — how a scheduler symptom connects to a storage root cause
GPU and distributed training investigation — NCCL failures, ECC anomalies, topology-sensitive degradation
Observability-driven timeline reconstruction across scheduler, storage, fabric, and GPU
Escalation coordination — evidence preservation, vendor packages, investigative reporting

Requires: HPC Operations completion. Production experience strongly recommended.

Learning paths

Find your starting point

Not everyone starts from the same place. Three paths through the curriculum — pick the one that matches where you are today.

Zero to operational

You're new to HPC infrastructure — coming from enterprise IT, a research background, or a career change.

Essentials → Fundamentals → Operations → Professional

Infrastructure fast-track

You're an experienced infrastructure engineer — Linux, cloud, storage, or VMware — and you need HPC-specific depth, not a conceptual introduction.

Fundamentals → Operations → Professional

Conceptual fluency

You work with HPC infrastructure — as a researcher, manager, architect, or procurement professional — and need to understand it without operating it.

Essentials → standalone

Not sure where you fit? Book a 30-minute discovery call — we'll help you find the right starting point.

What engineers say

From the pilot cohort

Before this course I had very limited knowledge of HPC. The sessions gave me a strong foundation — the way the topics were explained made it easier to connect with the subject and develop a deeper interest in exploring HPC further.

Chandra Shekar

Enterprise IT Professional · Pilot cohort, April 2026

The course provided an excellent walkthrough of the key concepts and practical intricacies of the HPC world. I learned a great deal from both the lectures and the hands-on labs — particularly how the different components fit together in a real environment.

Pilot cohort participant

HPC Infrastructure · Pilot cohort, May 2026

Next cohort underway

More reviews will appear here as cohorts complete. All current ratings: 5 / 5.

↗ LinkedIn

Your instructor

Built by someone who has designed, scaled, and operated HPC infrastructure for 16 years.

Rajesh Kumar is an HPC Architect based in Bengaluru, with over 16 years of hands-on experience designing and operating HPC infrastructure across research computing, life sciences, aerospace, and industrial engineering.

He started at IISc's Supercomputer Education and Research Centre — one of India's most demanding research computing environments — and spent a decade at C-DAC designing and operating production HPC clusters for institutions across genomics, bioinformatics, pharma, and academic research. He later architected HPC platforms at GE India and Boeing India, where his work spanned CFD, FEA, digital twin workloads, and hybrid cloud HPC at global scale.

The curriculum at master.hpc.now is drawn directly from that experience — the operational mistakes engineers make when they're new to HPC, the investigative patterns that separate senior engineers from the rest, and the stack that most production clusters actually run on.

16 years

HPC architecture

IISc · C-DAC · GE · Boeing

Where he's worked

Research · Pharma · Aerospace · Cloud

Domains

DG-R&D Excellence Award

C-DAC, 2020