Microsoft’s AI tool VALL-E imitates anyone’s voice -"Text to Speech"

AI has been exploding in recent times, with a dozen of platforms emerging as AI tools for text to image, video and 3D creation and now with text to voice generation. Moments after announcing a huge $10 billion investment on Open AI’s ChatGPT-3, Microsoft’s own new AI tool VALL-E converts text to speech in any voices with a three-second audio sample.

Likely as DALL-E that aims to transform text to images, rhythming VALL-E transforms to voice.

Microsoft’s VALL-E

Microsoft uses Meta’s EnCodec audio compression technology, for preserving a speaker’s emotional tone and acoustic environment to make VALL-E imitate the voice in the audio sample. Trained on roughly 60,000 hours of voice data in English, the AI tool is capable of simulating any voice it hears for just 3 second, that’s completely new and revolutionary to the world.

The research paper on VALL-E available at Cornell University states, “Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system [AI that recreates voices it’s never heard] in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

Users can synthesize personalized speech, edit recordings and create audio content with this VALL-E and the tech could be widely useful for virtual learning, customer service automation, speech editing, and more. The AI tool can not only simulate voices from audio samples, but can deliver a speech in accordance with the required emotion to make the audio more natural and convincing.

What VALL-E successfully does is match the three-second audio sample to the voice of one of the 7,000 people that it has trained and then deliver the text in a voice similar to that in the training data to deliver an accurate mimic response.

Spoofing Concern

Potential risks and privacy issues that may befall over misuse of VALL-E such as spoofing, impersonating a speaker without his concern, voice identification simultaneously raises and are insighted to be played delinquently.

Microsoft has said that it will work on the issues trying to resolve it in future upgrades and developments.

Would you be using it, if the AI tool arrives?

(For more such interesting technology and innovative detailing, keep reading The Inner Detail).