Google researchers have developed a deep learning system that can pick out specific voices by looking at people’s faces when they’re speaking.
Google’s team trained the neural network to detect an individual speaking by themselves. Following that, Google created virtual parties with an added background noise in order to teach the neural network to isolate multiple voices into its own distinct audio tracks.
The neural network’s artificial intelligence (AI) can create a clean audio track by focusing on a person’s face. It even works if a person partially covers their face with hand gestures or a microphone.
It also differentiates between two people clearly trying to compete against one another, seen in the Team Coco clip above.
Google is looking into different ways of using this. Using it for video messaging services like Google Duo could potentially be helpful especially when video chats take place in a crowded room. However, in the video below it seems like the AI can also be used for eavesdropping.
Audio-visual speech separation was posted in Google’s research blog by software engineers Oran Lang and Inbar Mosseri.
Source: Google Research Blog