GLORIA — GEOMAR Library Ocean Research Information Access

1

Online Resource

Enhancing Target Speech Based on Nonlinear Soft Masking Using a Single Acoustic Vector Sensor

Zou, Yuexian ; Liu, Zhaoyi ; Ritz, Christian

MDPI AG ; 2018

In: Applied Sciences Vol. 8, No. 9 ( 2018-08-23), p. 1436-

add to mindlist on the mindlist

Details

In: Applied Sciences, MDPI AG, Vol. 8, No. 9 ( 2018-08-23), p. 1436-

Abstract: Enhancing speech captured by distant microphones is a challenging task. In this study, we investigate the multichannel signal properties of the single acoustic vector sensor (AVS) to obtain the inter-sensor data ratio (ISDR) model in the time-frequency (TF) domain. Then, the monotone functions describing the relationship between the ISDRs and the direction of arrival (DOA) of the target speaker are derived. For the target speech enhancement (SE) task, the DOA of the target speaker is given, and the ISDRs are calculated. Hence, the TF components dominated by the target speech are extracted with high probability using the established monotone functions, and then, a nonlinear soft mask of the target speech is generated. As a result, a masking-based speech enhancement method is developed, which is termed the AVS-SMASK method. Extensive experiments with simulated data and recorded data have been carried out to validate the effectiveness of our proposed AVS-SMASK method in terms of suppressing spatial speech interferences and reducing the adverse impact of the additive background noise while maintaining less speech distortion. Moreover, our AVS-SMASK method is computationally inexpensive, and the AVS is of a small physical size. These merits are favorable to many applications, such as robot auditory systems.

Type of Medium: Online Resource

ISSN: 2076-3417

URL: Article

DOI: 10.3390/app8091436

Language: English

Publisher: MDPI AG

Publication Date: 2018

detail.hit.zdb_id: 2704225-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

2

Online Resource

KCRC-LCD: Discriminative kernel collaborative representation with locality constrained dictionary for visual categorization

Liu, Weiyang ; Yu, Zhiding ; Lu, Lijia ; [et al.]

Elsevier BV ; 2015

In: Pattern Recognition Vol. 48, No. 10 ( 2015-10), p. 3076-3092

add to mindlist on the mindlist

Details

In: Pattern Recognition, Elsevier BV, Vol. 48, No. 10 ( 2015-10), p. 3076-3092

Type of Medium: Online Resource

ISSN: 0031-3203

URL: Article

DOI: 10.1016/j.patcog.2015.04.014

Language: English

Publisher: Elsevier BV

Publication Date: 2015

detail.hit.zdb_id: 1466343-0

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

3

Online Resource

Audio-Oriented Multimodal Machine Comprehension via Dynamic Inter- and Intra-modality Attention

Huang, Zhiqi ; Liu, Fenglin ; Wu, Xian ; [et al.]

Association for the Advancement of Artificial Intelligence (AAAI) ; 2021

In: Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 14 ( 2021-05-18), p. 13098-13106

add to mindlist on the mindlist

Details

In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), Vol. 35, No. 14 ( 2021-05-18), p. 13098-13106

Abstract: While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Machine Reading Comprehension task which mines textual inputs (paragraphs and questions) to predict the answers (choices or text spans). However, there are a lot of MC tasks that accept audio input in addition to the textual input, e.g. English listening comprehension test. In this paper, we target the problem of Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer questions based on the given audio and textual information. To solve this problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to effectively fuse the two modalities (audio and textual). DIIA can work as an independent component and thus be easily integrated into existing MC models. Moreover, we further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio. As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models. Experimental results and analysis prove the effectiveness of the proposed approaches. First, the proposed DIIA boosts the baseline models by up to 21.08% in terms of accuracy; Second, under the unimodal scenarios, the MKD module allows our multimodal MC model to significantly outperform the unimodal models by up to 18.87%, which are trained and tested with only audio or textual data.

Type of Medium: Online Resource

ISSN: 2374-3468 , 2159-5399

URL: Article

DOI: 10.1609/aaai.v35i14.17548

Language: Unknown

Publisher: Association for the Advancement of Artificial Intelligence (AAAI)

Publication Date: 2021

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

4

Online Resource

Non-Autoregressive Coarse-to-Fine Video Captioning

Yang, Bang ; Zou, Yuexian ; Liu, Fenglin ; [et al.]

Association for the Advancement of Artificial Intelligence (AAAI) ; 2021

In: Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 4 ( 2021-05-18), p. 3119-3127

add to mindlist on the mindlist

Details

In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), Vol. 35, No. 4 ( 2021-05-18), p. 3119-3127

Abstract: It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufficient training of visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this paper, we propose a non-autoregressive decoding based model with a coarse-to-fine captioning procedure to alleviate these defects. In implementations, we employ a bi-directional self-attention based network as our language model for achieving inference speedup, based on which we decompose the captioning procedure into two stages, where the model has different focuses. Specifically, given that visual words determine the semantic correctness of captions, we design a mechanism of generating visual words to not only promote the training of scene-related words but also capture relevant details from videos to construct a coarse-grained sentence ``template''. Thereafter, we devise dedicated decoding algorithms that fill in the ``template'' with suitable words and modify inappropriate phrasing via iterative refinement to obtain a fine-grained description. Extensive experiments on two mainstream video captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efficiency.

Type of Medium: Online Resource

ISSN: 2374-3468 , 2159-5399

URL: Article

DOI: 10.1609/aaai.v35i4.16421

Language: Unknown

Publisher: Association for the Advancement of Artificial Intelligence (AAAI)

Publication Date: 2021

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

5

Online Resource

GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection

Cao, Meng ; Zou, Yuexian ; Yang, Dongming ; [et al.]

Institute of Electrical and Electronics Engineers (IEEE) ; 2019

In: IEEE Access Vol. 7 ( 2019), p. 62805-62816

add to mindlist on the mindlist

Details

In: IEEE Access, Institute of Electrical and Electronics Engineers (IEEE), Vol. 7 ( 2019), p. 62805-62816

Type of Medium: Online Resource

ISSN: 2169-3536

URL: Journal

URL: Article

DOI: 10.1109/Access.6287639

DOI: 10.1109/ACCESS.2019.2915513

Language: Unknown

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2019

detail.hit.zdb_id: 2687964-5

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

6

Online Resource

Improved Blind Timing Skew Estimation Based on Spectrum Sparsity and ApFFT in Time-Interleaved ADCs

Liu, Sujuan ; Lyu, Ning ; Cui, Jiashuai ; [et al.]

Institute of Electrical and Electronics Engineers (IEEE) ; 2019

In: IEEE Transactions on Instrumentation and Measurement Vol. 68, No. 1 ( 2019-1), p. 73-86

add to mindlist on the mindlist

Details

In: IEEE Transactions on Instrumentation and Measurement, Institute of Electrical and Electronics Engineers (IEEE), Vol. 68, No. 1 ( 2019-1), p. 73-86

Type of Medium: Online Resource

ISSN: 0018-9456 , 1557-9662

URL: Article

DOI: 10.1109/TIM.2018.2834080

Language: Unknown

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2019

detail.hit.zdb_id: 160442-9

detail.hit.zdb_id: 2027532-8

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

7

Online Resource

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Liu, Fenglin ; Wu, Xian ; Ge, Shen ; [et al.]

Association for Computing Machinery (ACM) ; 2022

In: ACM Transactions on Knowledge Discovery from Data Vol. 16, No. 1 ( 2022-02-28), p. 1-19

add to mindlist on the mindlist

Details

In: ACM Transactions on Knowledge Discovery from Data, Association for Computing Machinery (ACM), Vol. 16, No. 1 ( 2022-02-28), p. 1-19

Abstract: Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space . To tackle this problem, we propose DiMBERT (short for Di sentangled M ultimodal-Attention BERT ), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image–sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Di sentangled M ultimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.

Type of Medium: Online Resource

ISSN: 1556-4681 , 1556-472X

URL: Article

DOI: 10.1145/3447685

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2022

detail.hit.zdb_id: 2257358-6

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

8

Online Resource

Federated Learning for Vision-and-Language Grounding Problems

Liu, Fenglin ; Wu, Xian ; Ge, Shen ; [et al.]

Association for the Advancement of Artificial Intelligence (AAAI) ; 2020

In: Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 07 ( 2020-04-03), p. 11572-11579

add to mindlist on the mindlist

Details

In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), Vol. 34, No. 07 ( 2020-04-03), p. 11572-11579

Abstract: Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.

Type of Medium: Online Resource

ISSN: 2374-3468 , 2159-5399

URL: Article

DOI: 10.1609/aaai.v34i07.6824

Language: Unknown

Publisher: Association for the Advancement of Artificial Intelligence (AAAI)

Publication Date: 2020

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

9

Online Resource

Robust human tracking based on multi-cue integration and mean-shift

Liu, Hong ; Yu, Ze ; Zha, Hongbin ; [et al.]

Elsevier BV ; 2009

In: Pattern Recognition Letters Vol. 30, No. 9 ( 2009-7), p. 827-837

add to mindlist on the mindlist

Details

In: Pattern Recognition Letters, Elsevier BV, Vol. 30, No. 9 ( 2009-7), p. 827-837

Type of Medium: Online Resource

ISSN: 0167-8655

URL: Article

DOI: 10.1016/j.patrec.2008.10.008

Language: English

Publisher: Elsevier BV

Publication Date: 2009

detail.hit.zdb_id: 1466342-9

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

10

Online Resource

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Liu, Fenglin ; Wu, Xian ; You, Chenyu ; [et al.]

Institute of Electrical and Electronics Engineers (IEEE) ; 2022

In: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 12 ( 2022-12-1), p. 9255-9268

add to mindlist on the mindlist

Details

In: IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers (IEEE), Vol. 44, No. 12 ( 2022-12-1), p. 9255-9268

Type of Medium: Online Resource

ISSN: 0162-8828 , 2160-9292 , 1939-3539

URL: Article

DOI: 10.1109/TPAMI.2021.3132229

RVK:

SQ 1100

Language: Unknown

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2022

detail.hit.zdb_id: 2027336-8

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher