VALL-E is a speech-to-text AI system that can imitate a human voice from a recording of only 3 seconds. It uses machine learning techniques to analyze an individual’s voice characteristics and reproduce them to generate additional phrases. It was developed by OpenAI and is considered a major advancement in speech synthesis.
After the creations of OpenAI such as DALL-E, capable of generating images, and ChatGPT, which can write all types of text, Microsoft adds a new member to this AI family by developing VALL-E, a voice synthesis model particularly effective. VALL-E makes it possible to imitate a voice thanks to a sample of only 3 seconds. It retains the tone, timbre and even reproduces the acoustic environment of the original audio.
VALL-E trained on Meta’s sound library, LibriLight, which contains 60,000 hours of English speech by 7,000 different speakers, mostly taken from LibriVox public domain audiobooks. Researchers are currently working to improve the model’s performance in terms of prosody and style of expression.
For the most curiousVALL-E’s demo, published on GitHub, allows you to observe how the AI works with various examples. And we must admit that it is quite impressive, even if the AI would have trouble with certain accents, not all of which are listed in the LibriLight library.
Like ChatGPT, VALL-E has caused a wave of concern because its enormous potential might be very useful for people who have lost the ability to speak, but it can also easily be used for identity theft. Microsoft developers assure that they will include a protocol to ensure that the speaker approves the use of his voice.
It is important to remember that VALL-E is more iterative than revolutionary and its capabilities are not as new as one might think. Voice imitation has been the subject of intensive research for several years, and some of it is mature enough to fuel many start-ups.