articles

Home / DeveloperSection / Articles / Real-Time Voice Translation: You Speak English and I Hear Chinese

Real-Time Voice Translation: You Speak English and I Hear Chinese

Pie chan667 26-Jul-2019

A Chinese Beta Release of Skype Translator, with real-time voice translation technology imbedded, was launched in China today. This product, which incorporates Microsoft's achievements in areas such as speech recognition, automatic translation and machine learning, will finally enable real-time voice conversations between English and Chinese Mandarin.

The realization of this real-time speech translation feature is built on a powerful machine learning platform. Machine learning refers to the ability of computer software programs to learn, analyze and leverage data. Training data for speech recognition and machine translation include translated web pages, video with subtitles, and pre-translated and transcribed text for one-on-one conversations. Skype Translator records these conversations to analyze spoken text and train the system to do a better job of "learning" languages.

Different from recitation, oral communication is often not fluent. When people speak, they pause, repeat, and say things like "um", "er", and "ah". Our machine learning model addresses these pauses accordingly. In the preview, users can see that the "e" parts of the speech are removed, while the rest may be re-optimized through user feedback.

Real-Time Voice Translation: You Speak English and I Hear Chinese

The Skype Translator pioneered a combination of syntactic and statistical models and more targeted training in conversational language output. 

Better recognition and translation can be achieved by removing words that cause influence, breaking text into sentences, and adding punctuation and case recognition. Using the training data obtained in the preview stage, the software can learn different topics, accents and language conversion for real users.

After the prepared data is entered into the machine learning system, the machine learning software builds a statistical model of the words involved in these conversations and environments. When the user speaks, the software looks for similar content in the statistical model and applies it to a pre-learned translation program that converts audio into text and from text into another language. In addition, the team created a custom robot program to coordinate the entire product experience. It is responsible for establishing a telephone connection, sending audio streams to a voice engine to retrieve the translated text, and translating what each side says at the end of the conversation.

After years of dedicated efforts, Microsoft has worked with a team in Beijing and Raymond, USA, to complete the language model of mandarin. Thanks to the combination of deep neural network and statistical machine translation technology, the speech recognition ability has been further improved, and the translation results have been constantly accurate, making one-to-one voice conversation a reality. The development of speech recognition has been hampered for decades by high error rates, differences in microphone sensitivity and noise conditions. Microsoft research takes the lead in introducing the deep neural network (DNNs) technology into speech recognition, which greatly reduces the error rate and improves the reliability, and finally makes the speech translation technology widely used.

Today, we have witnessed the raise of the curtain on the stage for real-time voice translation between the Chinese and English, the world's most spoken language and the world's most widely spoken second language. We believe that with the continuous optimization of the technology and products, the era of cross-language barrier-free communication will eventually come.



Updated 07-Sep-2019
I am a Freelancer.

Leave Comment

Comments

Liked By