We are developing a real-time *speech-to-speech (S2S) accent softening system* designed for *on-device deployment*, with a strong emphasis on *ultra-low latency, prosody control, and phoneme-level manipulation.*

The primary goal of this work is to *convert Filipino-accented English into a softened, neutral American accent* while preserving the speaker’s *natural voice, identity, and expressiveness.*

In addition to this, we are also building a *separate on-device voice translation feature*, which will require similar expertise in S2S modeling, low-latency inference, and speaker identity preservation.

We are seeking a *Senior Speech AI Engineer* with hands-on experience in *direct S2S models (not STT+TTS pipelines).* You will lead the design, experimentation, and optimization of *voice-to-voice accent conversion systems*, and contribute to the architecture of the *on-device voice translation module*, both operating in real time on CPU.

The target system performance for accent softening is *end-to-end latency under 200 ms*, requiring highly optimized models and inference infrastructure.

This is a deep technical role focused on *speech modeling, phonetics, prosody, and efficient inference*, rather than application-level integration.

*Key Responsibilities*

The selected candidate will:

* Lead the design and optimization of *direct S2S accent conversion models*
* Architect and contribute to a *separate on-device voice translation system* based on S2S principles
* Work at the *phoneme and prosody level* to adjust rhythm, stress, and intonation toward a softened American accent
* Preserve *speaker timbre and identity* during both accent transformation and voice translation
* Optimize models for *CPU-only, on-device inference*
* Achieve *end-to-end latency under 200 ms* for real-time accent softening
* Experiment with *neural vocoders and latent speech representations*
* Improve robustness across different Filipino speakers, speech styles, and environments
* Guide technical direction and mentor junior engineers where applicable

*Required Skills & Experience*

Candidates should have practical experience with:

* *Speech-to-Speech (S2S) modeling* (NOT STT + TTS pipelines)
* *Voice-to-voice translation architectures* or related research
* Prosody modeling and *phoneme-level speech manipulation*
* Latent speech models such as *HuBERT, Encodec, or similar architectures*
* *PyTorch* for model development and experimentation
* Model optimization and quantization (*ONNX, FP16, INT8*)
* *Edge / on-device inference on CPU*
* Performance optimization for real-time audio systems
* Proficient in C++ and Python.

*Preferred Background*

Ideal candidates will have:

* Research or industry experience in *voice conversion, accent modification, or speech synthesis*
* Experience with *speech-to-speech translation or zero-shot voice translation*
* Experience working with *real-time audio or streaming speech systems*
* Demonstrated work on *low-latency AI systems in production*
* Familiarity with constraints of on-device deployment

Job Type: Full-time

Pay: $600.00 - $1,200.00 per month

Work Location: Remote

Senior Speech AI Engineer - S2S (Accent Softening & Voice Translation, On-Device)

Job Description