Job Description
We are seeking a Senior ML Runtime & Reliability Engineer to own the production reliability, scalability, and operational stability of enterprise AI systems deployed across Snowflake and Microsoft Azure.
This role ensures predictive and agentic AI systems operate reliably in real-time, scale with event demand, and meet enterprise performance and availability standards.
This is a production systems role — not a model development role.
Real-Time Inference & Runtime Engineering
• Design always-on inference environments
• Architect real-time model execution pipelines
• Engineer failover and rollback strategies
• Support high-throughput event-driven ML systems
Event Platform Reliability
• Scale and stabilize event ingestion systems
• Optimize runtime performance for GeoSense, PWS, and predictive systems
• Reduce shared dependency bottlenecks
Production Operations
• Lead incident response for ML production systems
• Develop runbooks and automated recovery workflows
• Ensure SLO / SLA adherence
Agentic AI Runtime Safety (PI-3 Critical)
• Implement guardrails for agent decision loops
• Ensure explainability trace hooks
• Design rollback-safe execution models
• Implement audit logging frameworks
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
• 6+ years cloud platform engineering
• Experience operating real-time AI systems
• Deep Azure infrastructure knowledge
• Strong experience with containerized workloads (AKS preferred)
• CI/CD deployment integration experience
• Strong understanding of ML inference systems
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.