摘要翻譯:
當(dāng)視覺事件和音頻事件同時發(fā)生時,一個彈跳球的重?fù)袈,嘴唇張開時說話的開始--這表明可能有一個共同的潛在事件產(chǎn)生了這兩種信號。在本文中,我們認(rèn)為視頻信號的視覺和音頻分量應(yīng)該使用融合的多感知表示來聯(lián)合建模。我們提出通過訓(xùn)練一個神經(jīng)網(wǎng)絡(luò)來預(yù)測視頻幀和音頻是否在時間上對齊,從而以一種自監(jiān)督的方式學(xué)習(xí)這種表示。我們將學(xué)習(xí)到的表示用于三個應(yīng)用:(a)聲源定位,即在視頻中可視化聲源;(b)視聽動作識別;和(c)屏幕上/屏幕外音頻源分離,例如從外國官員的講話中去除屏幕外翻譯的聲音。代碼、模型和視頻結(jié)果可在我們的網(wǎng)頁上獲得:http://andrewowens.com/multisensory
---
英文標(biāo)題:
《Audio-Visual Scene Analysis with Self-Supervised Multisensory Features》
---
作者:
Andrew Owens, Alexei A. Efros
---
最新提交年份:
2018
---
分類信息:
一級分類:Computer Science 計(jì)算機(jī)科學(xué)
二級分類:Computer Vision and Pattern Recognition 計(jì)算機(jī)視覺與模式識別
分類描述:Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.
涵蓋圖像處理、計(jì)算機(jī)視覺、模式識別和場景理解。大致包括ACM課程I.2.10、I.4和I.5中的材料。
--
一級分類:Computer Science 計(jì)算機(jī)科學(xué)
二級分類:Sound 聲音
分類描述:Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.
涵蓋了聲音計(jì)算的各個方面,以及聲音作為一種信息通道。包括聲音模型、分析和合成、音頻用戶界面、數(shù)據(jù)的可聽化、計(jì)算機(jī)音樂和聲音信號處理。包括ACM學(xué)科類H.5.5,并與H.1.2、H.5.1、H.5.2、I.2.7、I.5.4、I.6.3、J.5、K.4.2交叉。
--
一級分類:Electrical Engineering and Systems Science 電氣工程與系統(tǒng)科學(xué)
二級分類:Audio and Speech Processing 音頻和語音處理
分類描述:Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome. Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval; audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.
處理代表音頻、語音和語言的信號的理論和方法及其應(yīng)用。這包括分析、合成、增強(qiáng)、轉(zhuǎn)換、分類和解釋這些信號,以及相關(guān)信號處理系統(tǒng)的設(shè)計(jì)、開發(fā)和評估。機(jī)器學(xué)習(xí)和模式分析應(yīng)用于上述任何領(lǐng)域也是受歡迎的。感興趣的具體主題包括:聽覺建模和助聽器;聲波束形成與聲源定位;聲場景分類;說話人分離;有源噪聲控制和回聲消除;增強(qiáng);去混響;生物聲學(xué);音樂信號的分析、合成與修飾;音樂信息檢索;多媒體音頻和聯(lián)合音視頻處理;口語和書面語建模、切分、標(biāo)注、句法分析、理解和翻譯;文本挖掘;言語產(chǎn)生、感知和心理聲學(xué);語音分析、合成、感知建模和編碼;魯棒語音識別;說話人識別與特征描述;應(yīng)用于語音、音頻和語言信號的深度學(xué)習(xí)、在線學(xué)習(xí)和圖形模型;以及從系統(tǒng)架構(gòu)到快速算法的實(shí)現(xiàn)方面。
--
---
英文摘要:
The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory
---
PDF鏈接:
https://arxiv.org/pdf/1804.03641