GLORIA

GEOMAR Library Ocean Research Information Access

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
  • 1
    Online Resource
    Online Resource
    MDPI AG ; 2018
    In:  Applied Sciences Vol. 8, No. 9 ( 2018-08-23), p. 1436-
    In: Applied Sciences, MDPI AG, Vol. 8, No. 9 ( 2018-08-23), p. 1436-
    Abstract: Enhancing speech captured by distant microphones is a challenging task. In this study, we investigate the multichannel signal properties of the single acoustic vector sensor (AVS) to obtain the inter-sensor data ratio (ISDR) model in the time-frequency (TF) domain. Then, the monotone functions describing the relationship between the ISDRs and the direction of arrival (DOA) of the target speaker are derived. For the target speech enhancement (SE) task, the DOA of the target speaker is given, and the ISDRs are calculated. Hence, the TF components dominated by the target speech are extracted with high probability using the established monotone functions, and then, a nonlinear soft mask of the target speech is generated. As a result, a masking-based speech enhancement method is developed, which is termed the AVS-SMASK method. Extensive experiments with simulated data and recorded data have been carried out to validate the effectiveness of our proposed AVS-SMASK method in terms of suppressing spatial speech interferences and reducing the adverse impact of the additive background noise while maintaining less speech distortion. Moreover, our AVS-SMASK method is computationally inexpensive, and the AVS is of a small physical size. These merits are favorable to many applications, such as robot auditory systems.
    Type of Medium: Online Resource
    ISSN: 2076-3417
    Language: English
    Publisher: MDPI AG
    Publication Date: 2018
    detail.hit.zdb_id: 2704225-X
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 2
    Online Resource
    Online Resource
    Elsevier BV ; 2015
    In:  Pattern Recognition Vol. 48, No. 10 ( 2015-10), p. 3076-3092
    In: Pattern Recognition, Elsevier BV, Vol. 48, No. 10 ( 2015-10), p. 3076-3092
    Type of Medium: Online Resource
    ISSN: 0031-3203
    Language: English
    Publisher: Elsevier BV
    Publication Date: 2015
    detail.hit.zdb_id: 1466343-0
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 3
    Online Resource
    Online Resource
    Association for the Advancement of Artificial Intelligence (AAAI) ; 2021
    In:  Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 14 ( 2021-05-18), p. 13098-13106
    In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), Vol. 35, No. 14 ( 2021-05-18), p. 13098-13106
    Abstract: While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Machine Reading Comprehension task which mines textual inputs (paragraphs and questions) to predict the answers (choices or text spans). However, there are a lot of MC tasks that accept audio input in addition to the textual input, e.g. English listening comprehension test. In this paper, we target the problem of Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer questions based on the given audio and textual information. To solve this problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to effectively fuse the two modalities (audio and textual). DIIA can work as an independent component and thus be easily integrated into existing MC models. Moreover, we further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio. As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models. Experimental results and analysis prove the effectiveness of the proposed approaches. First, the proposed DIIA boosts the baseline models by up to 21.08% in terms of accuracy; Second, under the unimodal scenarios, the MKD module allows our multimodal MC model to significantly outperform the unimodal models by up to 18.87%, which are trained and tested with only audio or textual data.
    Type of Medium: Online Resource
    ISSN: 2374-3468 , 2159-5399
    Language: Unknown
    Publisher: Association for the Advancement of Artificial Intelligence (AAAI)
    Publication Date: 2021
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 4
    Online Resource
    Online Resource
    Association for the Advancement of Artificial Intelligence (AAAI) ; 2021
    In:  Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 4 ( 2021-05-18), p. 3119-3127
    In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), Vol. 35, No. 4 ( 2021-05-18), p. 3119-3127
    Abstract: It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufficient training of visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this paper, we propose a non-autoregressive decoding based model with a coarse-to-fine captioning procedure to alleviate these defects. In implementations, we employ a bi-directional self-attention based network as our language model for achieving inference speedup, based on which we decompose the captioning procedure into two stages, where the model has different focuses. Specifically, given that visual words determine the semantic correctness of captions, we design a mechanism of generating visual words to not only promote the training of scene-related words but also capture relevant details from videos to construct a coarse-grained sentence ``template''. Thereafter, we devise dedicated decoding algorithms that fill in the ``template'' with suitable words and modify inappropriate phrasing via iterative refinement to obtain a fine-grained description. Extensive experiments on two mainstream video captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efficiency.
    Type of Medium: Online Resource
    ISSN: 2374-3468 , 2159-5399
    Language: Unknown
    Publisher: Association for the Advancement of Artificial Intelligence (AAAI)
    Publication Date: 2021
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 5
    Online Resource
    Online Resource
    Institute of Electrical and Electronics Engineers (IEEE) ; 2019
    In:  IEEE Access Vol. 7 ( 2019), p. 62805-62816
    In: IEEE Access, Institute of Electrical and Electronics Engineers (IEEE), Vol. 7 ( 2019), p. 62805-62816
    Type of Medium: Online Resource
    ISSN: 2169-3536
    Language: Unknown
    Publisher: Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2019
    detail.hit.zdb_id: 2687964-5
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 6
    Online Resource
    Online Resource
    Institute of Electrical and Electronics Engineers (IEEE) ; 2019
    In:  IEEE Transactions on Instrumentation and Measurement Vol. 68, No. 1 ( 2019-1), p. 73-86
    In: IEEE Transactions on Instrumentation and Measurement, Institute of Electrical and Electronics Engineers (IEEE), Vol. 68, No. 1 ( 2019-1), p. 73-86
    Type of Medium: Online Resource
    ISSN: 0018-9456 , 1557-9662
    Language: Unknown
    Publisher: Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2019
    detail.hit.zdb_id: 160442-9
    detail.hit.zdb_id: 2027532-8
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 7
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2022
    In:  ACM Transactions on Knowledge Discovery from Data Vol. 16, No. 1 ( 2022-02-28), p. 1-19
    In: ACM Transactions on Knowledge Discovery from Data, Association for Computing Machinery (ACM), Vol. 16, No. 1 ( 2022-02-28), p. 1-19
    Abstract: Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space . To tackle this problem, we propose DiMBERT (short for Di sentangled M ultimodal-Attention BERT ), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image–sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Di sentangled M ultimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.
    Type of Medium: Online Resource
    ISSN: 1556-4681 , 1556-472X
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2022
    detail.hit.zdb_id: 2257358-6
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 8
    Online Resource
    Online Resource
    Association for the Advancement of Artificial Intelligence (AAAI) ; 2020
    In:  Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 07 ( 2020-04-03), p. 11572-11579
    In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), Vol. 34, No. 07 ( 2020-04-03), p. 11572-11579
    Abstract: Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.
    Type of Medium: Online Resource
    ISSN: 2374-3468 , 2159-5399
    Language: Unknown
    Publisher: Association for the Advancement of Artificial Intelligence (AAAI)
    Publication Date: 2020
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 9
    Online Resource
    Online Resource
    Elsevier BV ; 2009
    In:  Pattern Recognition Letters Vol. 30, No. 9 ( 2009-7), p. 827-837
    In: Pattern Recognition Letters, Elsevier BV, Vol. 30, No. 9 ( 2009-7), p. 827-837
    Type of Medium: Online Resource
    ISSN: 0167-8655
    Language: English
    Publisher: Elsevier BV
    Publication Date: 2009
    detail.hit.zdb_id: 1466342-9
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 10
    Online Resource
    Online Resource
    Institute of Electrical and Electronics Engineers (IEEE) ; 2022
    In:  IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 12 ( 2022-12-1), p. 9255-9268
    In: IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers (IEEE), Vol. 44, No. 12 ( 2022-12-1), p. 9255-9268
    Type of Medium: Online Resource
    ISSN: 0162-8828 , 2160-9292 , 1939-3539
    RVK:
    Language: Unknown
    Publisher: Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2022
    detail.hit.zdb_id: 2027336-8
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...