Job Description
The Senior Software Development Engineer – Site Reliability & Application Performance is responsible for ensuring the stability, reliability, and performance of critical applications supporting our Third-Party Administrator (TPA) & Payer Solutions department. This role sits at the intersection of software engineering, operations, and SRE practices, with a strong emphasis on Application Performance Monitoring (APM), observability, and continuous improvement of production systems. The colleague in this role will design and implement scalable, resilient solutions; build and maintain observability capabilities; drive incident reduction; and partner closely with engineering, infrastructure, and support teams to improve end‑to‑end reliability and customer experience.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
· 6+ years of professional experience in software engineering, site reliability engineering, or a closely related discipline.
· Strong hands‑on experience with AppDynamics in production environments (dashboards, health rules, transaction detection, alerting, baselining, war‑room usage).
· Practical experience with SRE practices: SLIs/SLOs, error budgets, incident response, post‑incident reviews, and runbooks.
· Experience with observability tooling and standards, including OpenTelemetry (tracing, metrics, logging) and integration into APM/monitoring platforms.
· Solid programming skills in one or more languages commonly used in backend or distributed systems (e.g., .NET, Java, Python, Go, or similar; .NET preferred).
· Utilization of AI coding assistants such as Github Actions, GHCP, Windsurf, or Cursor for code analysis and reverse engineering legacy applications
· Experience with CI/CD pipelines and modern deployment practices (e.g., Git-based workflows, automated testing and deployment).
· Strong understanding of distributed systems, microservices, and cloud‑native architectures (latency, resiliency, back‑pressure, timeouts, circuit breakers).
· Demonstrated ability to troubleshoot complex production issues across application, infrastructure, and network layers.
Nice to Have Skills & Experience
· Experience with additional APM / monitoring stacks (e.g., Dynatrace, New Relic, Datadog, Prometheus, Grafana, Splunk, ELK, etc.).
· Background in healthcare, insurance, or other highly regulated environments.
· Experience mentoring or leading other engineers in an SRE/DevOps context.
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.