Machine-Learning System Tackles Speech and Object Recognition, All at Once

MIT computer scientists have developed a system that learns to identify objects within an image, based on a spoken description of the image. Given an image and an audio caption, the model will highlight in real-time the relevant regions of the image being described.

Unlike current speech-recognition technologies, the model doesn’t require manual transcriptions and annotations of the examples it’s trained on. Instead, it learns words directly from recorded speech clips and objects in raw images, and associates them with one another.

该模型当前只能识别几百个不同的单词和对象类型。但是研究人员希望有一天他们的语音对象识别技术可以节省无数小时的手动劳动，并在语音和图像识别中打开新的门。

例如，语音识别系统（例如Siri和Google语音搜索）需要数千小时的语音记录。使用这些数据，系统学会使用特定单词来映射语音信号。当说新术语输入我们的词典，必须重新训练，这种方法尤其有问题。

“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing,” says David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group. Harwath co-authored a paper describing the model that was presented at the recent European Conference on Computer Vision.

在论文中,研究人员展示他们的del on an image of a young girl with blonde hair and blue eyes, wearing a blue dress, with a white lighthouse with a red roof in the background. The model learned to associate which pixels in the image corresponded with the words “girl,” “blonde hair,” “blue eyes,” “blue dress,” “white light house,” and “red roof.” When an audio caption was narrated, the model then highlighted each of those objects in the image as they were described.

One promising application is learning translations between different languages, without need of a bilingual annotator. Of the estimated 7,000 languages spoken worldwide, only 100 or so have enough transcription data for speech recognition. Consider, however, a situation where two different-language speakers describe the same image. If the model learns speech signals from language A that correspond to objects in the image, and learns the signals in language B that correspond to those same objects, it could assume those two signals — and matching words — are translations of one another.

哈瓦斯说：“有可能有一种叫bab的机制。”他指的是“银河系的搭便车指南”中的虚拟活着的耳机，该小说将不同的语言转化为佩戴者。

The CSAIL co-authors are: graduate student Adria Recasens; visiting student Didac Suris; former researcher Galen Chuang; Antonio Torralba, a professor of electrical engineering and computer science who also heads the MIT-IBM Watson AI Lab; and Senior Research Scientist James Glass, who leads the Spoken Language Systems Group at CSAIL.

Audio-visual associations

这项工作扩展了Harwath，Glass和Torralba开发的早期模型，该模型将语音与主题相关图像组相关。在较早的研究中，他们放置了分类的场景图像database在众包机械土耳其人平台上。然后，他们让人们描述了这些图像，就好像他们在向孩子叙述大约10秒钟。他们汇编了数百种不同类别的图像和音频字幕，例如海滩，购物中心，城市街道和卧室。

They then designed a model consisting of two separate convolutional neural networks (CNNs). One processes images, and one processes spectrograms, a visual representation of audio signals as they vary over time. The highest layer of the model computes outputs of the two networks and maps the speech patterns with image data.

例如，研究人员将为正确的模型标题A和图像A提供正确的内容。然后，他们会将其送给一个随机字幕B，然后将图像A喂入不正确的配对。在将数千个错误的字幕与图像A进行比较之后，该模型了解了与图像A相对应的语音信号，并将这些信号与字幕中的单词相关联。如2016年所述学习, the model learned, for instance, to pick out the signal corresponding to the word “water,” and to retrieve images with bodies of water.

“But it didn’t provide a way to say, ‘This is exact point in time that somebody said a specific word that refers to that specific patch of pixels,’” Harwath says.

制作对接

In the new paper, the researchers modified the model to associate specific words with specific patches of pixels. The researchers trained the model on the same database, but with a new total of 400,000 image-captions pairs. They held out 1,000 random pairs for testing.

In training, the model is similarly given correct and incorrect images and captions. But this time, the image-analyzing CNN divides the image into a grid of cells consisting of patches of pixels. The audio-analyzing CNN divides the spectrogram into segments of, say, one second to capture a word or two.

使用正确的图像和字幕对，该模型将网格的第一个单元格与音频的第一个段匹配，然后将同一单元格与第二段的音频匹配，依此类推，依此类推时间段。对于每个单元格和音频段，它提供相似性得分，具体取决于信号与对象的对应方式。

面临的挑战是，在培训期间，模型无法访问语音和图像之间的任何真实对齐信息。哈瓦斯说：“论文的最大贡献是证明这些跨模式对齐方式可以通过简单地教授网络来自动推断出图像和标题属于哪些图像和标题在一起，而哪个对不属于哪些图像和标题。”

The authors dub this automatic-learning association between a spoken caption’s waveform with the image pixels a “matchmap.” After training on thousands of image-caption pairs, the network narrows down those alignments to specific words representing specific objects in that matchmap.

“It’s kind of like the Big Bang, where matter was really dispersed, but then coalesced into planets and stars,” Harwath says. “Predictions start dispersed everywhere but, as you go through training, they converge into an alignment that represents meaningful semantic groundings between spoken words and visual objects.”