设计世界

  • Home
  • 技术
    • 3D CAD
    • Electronics • electrical
    • 固定和加入
    • Factory automation
    • Linear Motion
    • Motion Control
    • 测试和测量
    • Sensors
    • 流体功率
  • Learn
    • Ebooks / Tech Tips
    • Engineering Week
    • Future of Design Engineering
    • MC²运动控制教室
    • Podcasts
    • Videos
    • Webinars
  • 跳跃奖
  • 领导
    • 2022年投票
    • 2021获奖者
  • Design Guide Library
  • 资源
    • 3D CAD型号
      • PARTsolutions
      • TraceParts
    • 数字问题
      • 设计世界
      • EE World
    • 工程领域的妇女
  • 供应商清单

Machine-Learning System Tackles Speech and Object Recognition, All at Once

ByRob Matheson, MIT News Office|2018年9月18日,

Share

MIT computer scientists have developed a system that learns to identify objects within an image, based on a spoken description of the image. Given an image and an audio caption, the model will highlight in real-time the relevant regions of the image being described.

Unlike current speech-recognition technologies, the model doesn’t require manual transcriptions and annotations of the examples it’s trained on. Instead, it learns words directly from recorded speech clips and objects in raw images, and associates them with one another.

该模型当前只能识别几百个不同的单词和对象类型。但是研究人员希望有一天他们的语音对象识别技术可以节省无数小时的手动劳动,并在语音和图像识别中打开新的门。

例如,语音识别系统(例如Siri和Google语音搜索)需要数千小时的语音记录。使用这些数据,系统学会使用特定单词来映射语音信号。当说新术语输入我们的词典,必须重新训练,这种方法尤其有问题。

“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing,” says David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group. Harwath co-authored a paper describing the model that was presented at the recent European Conference on Computer Vision.

在论文中,研究人员展示他们的del on an image of a young girl with blonde hair and blue eyes, wearing a blue dress, with a white lighthouse with a red roof in the background. The model learned to associate which pixels in the image corresponded with the words “girl,” “blonde hair,” “blue eyes,” “blue dress,” “white light house,” and “red roof.” When an audio caption was narrated, the model then highlighted each of those objects in the image as they were described.

One promising application is learning translations between different languages, without need of a bilingual annotator. Of the estimated 7,000 languages spoken worldwide, only 100 or so have enough transcription data for speech recognition. Consider, however, a situation where two different-language speakers describe the same image. If the model learns speech signals from language A that correspond to objects in the image, and learns the signals in language B that correspond to those same objects, it could assume those two signals — and matching words — are translations of one another.

哈瓦斯说:“有可能有一种叫bab的机制。”他指​​的是“银河系的搭便车指南”中的虚拟活着的耳机,该小说将不同的语言转化为佩戴者。

The CSAIL co-authors are: graduate student Adria Recasens; visiting student Didac Suris; former researcher Galen Chuang; Antonio Torralba, a professor of electrical engineering and computer science who also heads the MIT-IBM Watson AI Lab; and Senior Research Scientist James Glass, who leads the Spoken Language Systems Group at CSAIL.

Audio-visual associations

这项工作扩展了Harwath,Glass和Torralba开发的早期模型,该模型将语音与主题相关图像组相关。在较早的研究中,他们放置了分类的场景图像database在众包机械土耳其人平台上。然后,他们让人们描述了这些图像,就好像他们在向孩子叙述大约10秒钟。他们汇编了数百种不同类别的图像和音频字幕,例如海滩,购物中心,城市街道和卧室。

They then designed a model consisting of two separate convolutional neural networks (CNNs). One processes images, and one processes spectrograms, a visual representation of audio signals as they vary over time. The highest layer of the model computes outputs of the two networks and maps the speech patterns with image data.

例如,研究人员将为正确的模型标题A和图像A提供正确的内容。然后,他们会将其送给一个随机字幕B,然后将图像A喂入不正确的配对。在将数千个错误的字幕与图像A进行比较之后,该模型了解了与图像A相对应的语音信号,并将这些信号与字幕中的单词相关联。如2016年所述学习, the model learned, for instance, to pick out the signal corresponding to the word “water,” and to retrieve images with bodies of water.

“But it didn’t provide a way to say, ‘This is exact point in time that somebody said a specific word that refers to that specific patch of pixels,’” Harwath says.

制作对接

In the new paper, the researchers modified the model to associate specific words with specific patches of pixels. The researchers trained the model on the same database, but with a new total of 400,000 image-captions pairs. They held out 1,000 random pairs for testing.

In training, the model is similarly given correct and incorrect images and captions. But this time, the image-analyzing CNN divides the image into a grid of cells consisting of patches of pixels. The audio-analyzing CNN divides the spectrogram into segments of, say, one second to capture a word or two.

使用正确的图像和字幕对,该模型将网格的第一个单元格与音频的第一个段匹配,然后将同一单元格与第二段的音频匹配,依此类推,依此类推时间段。对于每个单元格和音频段,它提供相似性得分,具体取决于信号与对象的对应方式。

面临的挑战是,在培训期间,模型无法访问语音和图像之间的任何真实对齐信息。哈瓦斯说:“论文的最大贡献是证明这些跨模式对齐方式可以通过简单地教授网络来自动推断出图像和标题属于哪些图像和标题在一起,而哪个对不属于哪些图像和标题。”

The authors dub this automatic-learning association between a spoken caption’s waveform with the image pixels a “matchmap.” After training on thousands of image-caption pairs, the network narrows down those alignments to specific words representing specific objects in that matchmap.

“It’s kind of like the Big Bang, where matter was really dispersed, but then coalesced into planets and stars,” Harwath says. “Predictions start dispersed everywhere but, as you go through training, they converge into an alignment that represents meaningful semantic groundings between spoken words and visual objects.”


提交以下:M2M(机器到机器)


Related Articles阅读更多>

Part 6: IDE and other software for connectivity and IoT design work
Part 4: Edge computing and gateways proliferate for industrial machinery
Part 3: Trends in Ethernet, PoE, IO-Link, HIPERFACE, and single-cable solutions
Machine Learning for Sensors

设计指导图书馆

“运动

Enews Sign Up

运动控制教室

设计世界Digital Edition

覆盖

浏览最新的问题of Design World and back issues in an easy to use high quality format. Clip, share and download with the leading design engineering magazine today.

电子电子论坛

全球问题解决EE论坛涵盖微控制器,DSP,网络,模拟和数字设计,RF,Power Electronics,PCB路由等等

Edaboard:电子论坛

Sponsored Content

  • Wago定制设计的接口接线系统使工业应用更容易
  • Global supply needs drive increased manufacturing footprint development
  • 指定阀歧管的10个原因
  • Case study: How a 3D-printed tool saved thousands of hours and dollars
  • Wago的SmartDesigner Online为项目提供了无缝的进步
  • 停止过度设计:如何用轻度戒指节省时间和金钱

设计世界Podcasts

2022年4月11日
使用3D打印小
See More >
工程交换

工程交流是一个针对工程师的全球教育网络社区。

Connect, share, and learn today »

设计世界
  • Advertising
  • 关于我们
  • Contact
  • Manage your Design World Subscription
  • 订阅
  • 设计世界数字网络
  • Engineering White Papers
  • 跳跃奖

Copyright © 2022 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
隐私政策|Advertising|关于我们

搜索设计世界

  • Home
  • 技术
    • 3D CAD
    • Electronics • electrical
    • 固定和加入
    • Factory automation
    • Linear Motion
    • Motion Control
    • 测试和测量
    • Sensors
    • 流体功率
  • Learn
    • Ebooks / Tech Tips
    • Engineering Week
    • Future of Design Engineering
    • MC²运动控制教室
    • Podcasts
    • Videos
    • Webinars
  • 跳跃奖
  • 领导
    • 2022年投票
    • 2021获奖者
  • Design Guide Library
  • 资源
    • 3D CAD型号
      • PARTsolutions
      • TraceParts
    • 数字问题
      • 设计世界
      • EE World
    • 工程领域的妇女
  • 供应商清单
我们使用Cookie来个性化内容和广告,提供社交媒体功能并分析我们的流量。我们还与社交媒体,广告和分析合作伙伴共享有关您使用我们网站的信息,他们可能将其与您提供给他们的其他信息或他们从使用他们的服务中收集的其他信息。如果您继续使用此网站,则同意我们的cookie。 OkNoRead more