Natural Conversation With PersonaPlex

Nvidia has unveiled PersonaPlex, a new conversational AI system designed to deliver Natural Conversation With PersonaPlex while allowing users to define both the voice and the role of the assistant. The company says the technology removes a long standing trade off in voice AI, where systems have either sounded natural or been customisable, but rarely both.

Traditional voice assistants typically rely on a cascade of separate components. One model converts speech to text, another generates a response, and a third turns that response back into speech. While this approach allows developers to adjust tone and persona, it often introduces delays. Those delays can result in awkward pauses, missed interruptions and unnatural turn taking that make conversations feel mechanical.

Why latency has been so hard to fix

Latency in traditional systems is difficult to eliminate because each stage waits for the previous one to finish. Speech must first be fully transcribed before a language model begins processing. Only then can a text to speech system generate audio. Even small delays at each step add up, especially when users interrupt or change direction mid sentence.

This structure also makes it harder to handle backchannel cues such as “uh huh” or “okay”, which are common in human dialogue. Because the system processes speech in chunks rather than continuously, it struggles to respond in real time.

How PersonaPlex gets around the problem

PersonaPlex takes a different approach. Built on the Moshi architecture developed by Kyutai, it uses a single full duplex model that can listen and speak at the same time. Rather than passing audio between separate systems, it continuously updates its internal state as the user talks and streams responses back immediately.

The system combines two prompts to define behaviour. A voice prompt captures vocal style and prosody, while a text prompt sets the role and context. Together they create a coherent persona. Nvidia says this hybrid prompting architecture allows Natural Conversation With PersonaPlex while maintaining character consistency across longer exchanges.

Under the surface, a speech encoder called Mimi converts audio into tokens, which are processed by temporal and depth transformers. A speech decoder then generates audio at a 24kHz sample rate. The underlying language model, Helium, provides semantic understanding and helps the system generalise to scenarios beyond its training data.

Training for realism and flexibility

To teach the model natural rhythm and emotion, Nvidia used more than 1,200 hours of real conversations from the Fisher English corpus. These recordings contain interruptions, pauses and overlapping speech. They were combined with thousands of synthetic assistant and customer service dialogues generated using large language models.

The result, according to Nvidia, is a system that blends realistic speech patterns with strong task adherence. In benchmark tests, PersonaPlex outperformed several open source and commercial systems on measures of turn taking, interruption handling and response speed.

Beyond telling jokes

Although early demonstrations include light hearted exchanges, Nvidia sees broader applications. PersonaPlex can act as a banking agent verifying suspicious transactions, a medical receptionist recording patient details, or a technical expert responding to an emergency scenario such as a failing spacecraft reactor.

The company suggests the technology could be used in customer service centres, healthcare administration, education and immersive entertainment. By combining custom roles with low latency interaction, Nvidia believes Natural Conversation With PersonaPlex could make voice AI feel less like issuing commands and more like speaking to another person.

Code and model weights have been released under open licences, with further evaluation benchmarks due to follow.