Senior Site Reliability Engineer

iGenius

We are looking for an experienced Site Reliability Engineer to join our growing team in Milan and help shape the future of our flagship project, Colosseum, one of Europe’s most powerful AI supercomputers, currently in development.

Designed to run our proprietary AI models at scale, it forms the compute backbone behind the intelligence we deliver to the world’s most demanding industries.

In this role, you will design and implement observability and control mechanisms that extract operational data from infrastructure and feed it into automated systems to enable continuous optimization, including key system budgets such as power, cooling and service level, security-level objectives.

You will be responsible for actively guarding and maintaining these operational budgets as part of day-to-day system reliability and performance management.

You will also contribute to operational excellence through blameless post-mortem analysis and structured incident learning, ensuring continuous improvement of system behavior and resilience.

As a part of the team, you will work closely with Platform Engineering in a shared cybersecurity model, where SRE focuses on detection and monitoring, while Platform Engineering ensures the secure design and operation of the underlying infrastructure.

What You Have

Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
At least 6 years of experience as a Site Reliability Engineer or in similar roles.
Strong experience with observability and monitoring systems such as Prometheus, Thanos, Grafana, and OpenTelemetry
Experience with low-level system instrumentation and performance visibility using technologies such as eBPF
Experience with security monitoring and threat detection tools such as Zeek, Wazuh, or equivalent SIEM / security observability platforms
Strong experience with containerized and cloud-native environments, particularly Kubernetes
Strong software development skills, particularly in Python, with the ability to build automation, integrations, and custom tooling
Experience integrating heterogeneous infrastructure systems across multiple vendors, APIs, and evolving tool ecosystems
Familiarity with modern infrastructure automation and emerging agent-based frameworks such as MCP / A2A (or equivalent technologies)
Exposure to digital twin technologies and simulation platforms such as NVIDIA Omniverse or equivalent
Strong ability to design, build, and maintain software-driven infrastructure solutions in complex, large-scale environments

Who You Are

A versatile engineer, comfortable operating in complex and fast-paced environments.
Driven and fearless, you proactively tackle challenges and overcome obstacles with determination.
A systems thinker, capable of understanding the broader architecture and identifying dependencies across platforms and technologies.
A collaborative team player who is enthusiastic, curious, and passionate about problem-solving, thriving both independently and within cross-functional teams.
An effective communicator with strong interpersonal skills, able to engage with diverse stakeholders and foster collaboration.
Fluent in English and eager to contribute in a multicultural and international environment.

Benefits

Perks

Learning Friday. If our team members know more, so do we. That’s why we give everyone a training budget that they can spend on books, online courses or other training materials.
Smart Working. Trains can be a drag, you can save some commuting time by working from home.
Salary is based on experience and topped up with other bonuses.

We offer a competitive salary, as well as an opportunity to receive company equity. The typical salary for this role ranges between € 50.000 and € 70.000. As you gain experience and make more significant contributions to the business, your compensation will be reviewed to match your impact. Additionally, depending on your seniority and your performance, you’ll have the opportunity to receive stock options, with a variable value calculated from your base salary, giving you the chance to directly participate in the company’s success.

About Domyn

Domyn is a company specializing in the research and development of Responsible AI for regulated industries, including financial services, government, and heavy industry. It supports enterprises with proprietary, fully governable solutions based on a composable AI architecture — including LLMs, AI agents, and one of the world’s largest supercomputers. At the core of Domyn’s product offer is a chip-to-frontend architecture that allows organizations to control the entire AI stack — from hardware to application — ensuring isolation, security, and governance throughout the AI lifecycle. Its foundational LLMs, Domyn Large and Domyn Small, are designed for advanced reasoning and optimized to understand each business’s specific language, logic, and context. Provided under an open-enterprise license, these models can be fully transferred and owned by clients. Once deployed, they enable customizable agents that operate on proprietary data to solve complex, domain-specific problems. All solutions are managed via a unified platform with native tools for access management, traceability, and security. Powering it all, Colosseum — a supercomputer in development using NVIDIA Grace Blackwell Superchips — will train next-gen models exceeding 1T parameters. Domyn partners with Microsoft, NVIDIA, and G42. Clients include Allianz, Intesa Sanpaolo, and Fincantieri.

Please review our Privacy Policy here .

Clicca qui per candidarti su trk.iohire.com

Offerta di lavoro pubblicata 2 mesi fa

Ricerche correlate