Job Description
We are seeking a high-caliber AI Research Engineer / Scientist to join our specialized Conversational Intelligence team. In this role, you will bridge the gap between advanced generative speech research and real-world production systems. You will focus on building Duplex/Real-time Speech-to-Speech pipelines and optimizing the post-training infrastructure for our next-generation Slackbot Advanced Voice Mode.
The ideal candidate brings strong academic roots in generative speech modeling (TTS, Speech-to-Speech dialogue, Disentanglement, or Diffusion) combined with hands-on experience tuning Large Language Models (LLMs) for complex tool call execution and low-latency interactions.
Key Responsibilities
1. Post-Training & Data Pipeline Engineering
Pipeline Architecture: Develop, scale, and maintain the supervised fine-tuning (SFT) and post-training pipelines supporting the advanced voice model.
Data Curative & Synthesis: Clean, curate, and expand multi-modal datasets for enterprise voice interactions. Build automated, synthetic human voice simulation engines to generate high-fidelity end-to-end evaluation training data.
Enterprise Tool Integration: Optimize datasets to handle complex, multi-turn, and multi-tool execution flows unique to collaboration environments.
2. Generative Model Training & Iteration
Fine-Tuning & Alignment: Train and iteratively refine model checkpoints targeting rapid, accurate tool selection and API invocation (e.g., automated status updates, channel posting, and cross-functional reminders).
Hallucination & Noise Mitigation: Engineer the pipeline for minimal hallucination rates during name/channel entity resolution, and implement audio robustness behaviors against ambient acoustic noise.
Multilingual Expansion: Explore and implement cross-lingual transfer, accent resilience, and expressive, natural speech generation paradigms to ensure global user accessibility.
3. Evaluation & Quality Assurance
Advanced Speech Evaluation: Define automated evaluation metrics, severity scoring matrices, and layer-wise distillation methodologies to benchmark voice models against strong baselines.
Trace Analysis: Deeply analyze model traces and system failure modes to identify and fix systemic degradation in full-duplex/micro-turn voice architectures.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
Education: Enrolled in or graduated from a top-tier Master’s or Ph.D. program in Computer Science, Creative Informatics, or Information and Communications Engineering with a strong research focus on Speech Processing.
Speech Mastery: Proven track record in generative speech modeling, including experience with Expressive Text-to-Speech (TTS), Speech-to-Speech (S2S) dialogue cascades, and Speech Evaluation (e.g., MOS prediction).
Deep Learning Frameworks: Deep proficiency in Python and PyTorch, with a solid grasp of foundational architectures (Diffusion models, Self-Supervised Learning, and LLM fine-tuning techniques).
Systems & Infrastructure: Practical understanding of cloud architecture and machine learning engineering workflows (AWS environment experience or certifications are highly valued).
Nice to Have Skills & Experience
Publication Record: Authorship in premier signal processing or speech communication conferences (e.g., ICASSP, INTERSPEECH, APSIPA).
Duplex Architectures: Direct academic or project experience with Full-Duplex systems, VAD-free cascaded pipelines, or micro-turn conversational optimization.
Linguistic Versatility: Multilingual fluency (e.g., native/fluent capabilities in English, Chinese, or Japanese) to drive global accent and multilingual modeling tasks.
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.