Google just published new information about its latest advancements in voice AI. The new Tacotron sounds just like a human.
As the years have gone by the Google voice has started to sound less robotic and more like a human. At this point, the new Tacotron 2 Google voice AI is almost indistinguishable from humans.
In a recently published research paper by the people at Google, the team introduces details to the impressive speech system called Tacotron 2. In the paper, Google highlights the systems ability to speak almost identically to its human creators. The team describes the second generation speech system in the report stating, ” The Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms.”
As stated in the report, the technology comprises of two deep neural networks. The first network translates the text into a spectrogram, then sends them into the Deep Mind-created system, WaveNet. What do you get when you implement these systems? A voice that sounds like its human counterparts. Listen to the voice recording presented below. One of the recordings is the Tacotron 2 while the other is a paid actress. Can you tell the difference?
In these recordings, the voice says “That girl did a video about Star Wars lipstick.”
Or how about this one? “She earned a doctorate in sociology at Columbia University.”
If you hear the power of the Tacotron 2, listen to it attempt these tongue twisters.
“Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?”
“She sells sea-shells on the sea-shore. The shells she sells are sea-shells I’m sure.”
The AI also does a fantastic job of parsing context and understanding where stress is supposed to lie. Listen to the perfect inflection it uses in the statement “He thought it was time to present the present.”
It can also tell the difference between homonyms, such as being able to tell the difference between the past tense read and the infinitive to read. Even some (human) native English speakers can struggle with those while reading aloud!
While the samples sound great, there are still some difficult problems to be tackled. For example, the system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises. Also, the system cannot yet generate audio in realtime or directing it to sound happy or sad.
Though the system does occasionally struggle with the pronunciation of the multi-syllable words, Tacotron 2 does deliver some impressive vocal acoustics. Once the system is finalized for production, the Tacotron 2 is sure to be powerful voice across Google’s ecosystem.