About Nebius:
Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure.
Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI.
Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R&D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R&D.
The role
We are looking for a Technical Project Manager to drive the programs that turn manual hardware operations into automated, scalable workflows. You will own the delivery of automation initiatives across the full hardware lifecycle — from automated server provisioning and firmware update orchestration through intelligent fault detection, self-healing remediation, and data-centre-scale operational tooling — coordinating engineers, operations teams, and infrastructure stakeholders to ship systems that meaningfully reduce human toil at fleet scale.
The responsibilities
Automation Program Delivery
- Own end-to-end delivery of hardware automation program: zero-touch provisioning, firmware update pipelines, automated burn-in and validation, fault detection and self-healing, and operational tooling for data centre technicians.
- Translate high-level automation goals — reduce mean time to provision, eliminate manual firmware update toil, automate 80% of common fault remediation actions — into structured project plans with clear milestones, owners, dependencies, and success metrics.
- Manage the full program lifecycle from discovery and scoping through engineering delivery, staged rollout, production validation, and handoff to operations — ensuring automation is reliable enough to trust before it runs unsupervised on live infrastructure.
- Run structured delivery cadences: sprint planning, weekly engineering syncs, milestone reviews, and go/no-go gates for automation rollout to production — without creating process overhead that slows the engineering team down.
Cross-functional Coordination
- Coordinate across Hardware Automation engineering, Hardware Infrastructure, data centre operations, network engineering, baremetal software, and cloud control plane teams — identifying and resolving the cross-team dependencies that are the most common source of program delay.
- Partner with data centre operations leadership to understand the manual workflows targeted for automation, ensure engineering solutions match operational reality, and manage the change process when new automated systems replace established manual procedures.
- Manage relationships with hardware vendors and firmware teams whose delivery timelines and API roadmaps directly gate automation program schedules — tracking commitments, escalating slippage early, and building contingency plans where vendor dependency is unavoidable.
- Facilitate technical alignment across engineering teams when automation program require changes to adjacent systems: asset management platforms, monitoring infrastructure, ITSM integrations, or cloud control plane APIs.
Risk Management & Operational Safety
- Maintain a program risk register with specific focus on operational safety risks: automation failures that could affect production fleet availability, firmware rollouts with insufficient rollback capability, or self-healing workflows with poorly bounded blast radius.
- Own the staged rollout framework for automation deployments: define canary criteria, rollout gates, rollback triggers, and monitoring requirements that must be in place before any automation system runs unsupervised at scale.
- Ensure every automation program has a clearly documented failure mode analysis — what happens when the automation breaks, who gets alerted, and how the fleet returns to a known-good state — reviewed by engineering and operations before production deployment.
- Proactively identify automation initiatives where the risk-benefit trade-off warrants a slower, more conservative rollout cadence, and make that case clearly to engineering and leadership rather than defaulting to schedule pressure.
Metrics, Observability & Program Value
- Define and track the outcome metrics that demonstrate automation program value: manual operations hours eliminated, mean time to provision, mean time to remediate faults, firmware update cycle time, and human-error-driven incident reduction.
- Build program dashboards that give engineering, operations, and leadership real-time visibility into automation coverage (what percentage of the fleet lifecycle is automated), reliability (automation success rates by workflow), and backlog (outstanding manual toil targeted for automation).
- Produce concise, data-driven status reports and executive summaries that translate engineering progress into business outcomes — connecting automation delivery to fleet scalability, operational cost, and reliability metrics that matter to Nebius leadership.
Roadmap Planning & Prioritisation
- Own the Hardware Automation team roadmap: maintain a prioritised backlog of automation opportunities, scored by toil reduction potential, operational risk, and engineering complexity, and drive quarterly planning that allocates engineering capacity to the highest-value initiatives.
- Facilitate structured toil assessment processes with data centre operations and infrastructure engineering — identifying, quantifying, and ranking manual workflows that are the best candidates for automation investment.
- Track the automation landscape at Nebius competitors and in the broader hyperscale infrastructure community; bring external best practices and reference implementations into roadmap conversations to accelerate Nebius automation ambitions.
Process & Tooling Excellence
- Develop and maintain program templates, playbooks, and rollout checklists specific to hardware automation delivery — capturing the operational safety requirements, stakeholder sign-off gates, and monitoring baselines that every automation program must satisfy before going live.
- Build post-mortem and retrospective practices that capture learnings from automation incidents and near-misses, and feed those learnings back into engineering standards and future program planning.
- Champion a culture of measurable automation impact: every program ships with defined success metrics, and post-launch measurement is treated as a first-class engineering activity, not an afterthought.
What we are looking for
Must have
- 5+ years of technical project or program management experience, with at least 2 years managing automation, platform engineering, or infrastructure tooling program in a data centre or cloud infrastructure environment.
- Technical fluency in hardware and systems software: for example discussing BMC/IPMI management, PXE provisioning pipelines, firmware update mechanics, automated hardware testing, and the failure modes of large-scale fleet automation. We are flexible around what you know as long as you have a certain level of exposure.
- Demonstrated experience managing program that automate operational workflows.
- Strong operational risk awareness: track record of structuring staged rollouts, defining rollback criteria, and insisting on safety gates before automation runs unsupervised on production infrastructure.
- Experience coordinating across engineering and operations stakeholders with different risk tolerances — bridging teams that move fast and teams that prioritise stability without letting that tension stall delivery.
- Proficiency with project management and engineering tooling (Jira, Linear, Confluence, or equivalent) and data-driven program reporting.
- Excellent written and verbal communication skills, with the ability to adapt depth and tone from hands-on engineers to VP-level operations leadership.
Preferred / Nice to Have
- Background as a systems engineer, SRE, or infrastructure software engineer before transitioning to program management — someone who has written automation scripts or operated large fleets rather than only managed people who do.
- Experience with specific hardware automation domains: zero-touch provisioning (Ironic, Tinkerbell, or similar), firmware orchestration (Redfish, vendor-specific update tools), automated hardware testing frameworks, or self-healing remediation systems.
- Familiarity with infrastructure-as-code and configuration management tooling (Terraform, Ansible, Salt) from a delivery and rollout management perspective.
- Exposure to data centre operations — rack and stack, cabling, power and cooling management — sufficient to understand what automation means to a technician on the floor.
- Experience with site reliability engineering (SRE) practices: toil measurement, error budgets, and the discipline of treating operational automation as a first-class engineering investment.
- Knowledge of observability platforms (Prometheus, Grafana, Datadog, or equivalent) used to monitor automation system health and measure operational outcomes.
Benefits & Perks:
- Competitive compensation
- Career growth and learning opportunities
- Flexibility and work-life balance
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment and talented teams
What's it like to work at Nebius:
Fast moving - Bold thinking - Constant growth - Meaningful impact - Trust and real ownership - Opportunity to shape the future of AI
Equal Opportunity Statement:
Nebius is an equal opportunity employer. We are committed to fostering an inclusive and diverse workplace and to providing equal employment opportunities in all aspects of employment. We do not discriminate on the basis of race, color, religion, sex (including pregnancy), national origin, ancestry, age, disability, genetic information, marital status, veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by applicable law.
Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire.
If you need accommodations during the application process, please let us know.