GLORIA — GEOMAR Library Ocean Research Information Access

1

Online Resource

Response Timing Detection Using Prosodic and Linguistic Information for Human-friendly Spoken Dialog Systems

Kitaoka, Norihide ; Takeuchi, Masashi ; Nishimura, Ryota ; [et al.]

Japanese Society for Artificial Intelligence ; 2005

In: Transactions of the Japanese Society for Artificial Intelligence Vol. 20 ( 2005), p. 220-228

add to mindlist on the mindlist

Details

In: Transactions of the Japanese Society for Artificial Intelligence, Japanese Society for Artificial Intelligence, Vol. 20 ( 2005), p. 220-228

Type of Medium: Online Resource

ISSN: 1346-0714 , 1346-8030

URL: Article

DOI: 10.1527/tjsai.20.220

Language: English

Publisher: Japanese Society for Artificial Intelligence

Publication Date: 2005

detail.hit.zdb_id: 2045823-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

2

Online Resource

音声対話システムのための自由発話に対応した照応解析による入力発話への話題補完手法

Nishimura, Ryota ; Mori, Raita ; Ohta, Kengo ; [et al.]

Japanese Society for Artificial Intelligence ; 2022

In: Transactions of the Japanese Society for Artificial Intelligence Vol. 37, No. 3 ( 2022-5-1), p. IDS-F_1-13

add to mindlist on the mindlist

Details

In: Transactions of the Japanese Society for Artificial Intelligence, Japanese Society for Artificial Intelligence, Vol. 37, No. 3 ( 2022-5-1), p. IDS-F_1-13

Type of Medium: Online Resource

ISSN: 1346-0714 , 1346-8030

URL: Article

DOI: 10.1527/tjsai.37-3_IDS-F

Language: English

Publisher: Japanese Society for Artificial Intelligence

Publication Date: 2022

detail.hit.zdb_id: 2045823-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

3

Online Resource

End-to-end recognition of streaming Japanese speech using CTC and local attention

Chen, Jiahao ; Nishimura, Ryota ; Kitaoka, Norihide

Now Publishers ; 2020

In: APSIPA Transactions on Signal and Information Processing Vol. 9, No. 1 ( 2020)

add to mindlist on the mindlist

Details

In: APSIPA Transactions on Signal and Information Processing, Now Publishers, Vol. 9, No. 1 ( 2020)

Type of Medium: Online Resource

ISSN: 2048-7703 , 2048-7703

URL: Article

DOI: 10.1017/ATSIP.2020.23

Language: English

Publisher: Now Publishers

Publication Date: 2020

detail.hit.zdb_id: 2689862-7

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

4

Online Resource

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Byambadorj, Zolzaya ; Nishimura, Ryota ; Ayush, Altangerel ; [et al.]

Springer Science and Business Media LLC ; 2021

In: EURASIP Journal on Audio, Speech, and Music Processing Vol. 2021, No. 1 ( 2021-12)

add to mindlist on the mindlist

Details

In: EURASIP Journal on Audio, Speech, and Music Processing, Springer Science and Business Media LLC, Vol. 2021, No. 1 ( 2021-12)

Abstract: Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Type of Medium: Online Resource

ISSN: 1687-4722

URL: Article

DOI: 10.1186/s13636-021-00225-4

Language: English

Publisher: Springer Science and Business Media LLC

Publication Date: 2021

detail.hit.zdb_id: 2252877-5

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

5

Online Resource

Recognition of target domain Japanese speech using language model replacement

Mori, Daiki ; Ohta, Kengo ; Nishimura, Ryota ; [et al.]

Springer Science and Business Media LLC ; 2024

In: EURASIP Journal on Audio, Speech, and Music Processing Vol. 2024, No. 1 ( 2024-07-20)

add to mindlist on the mindlist

Details

In: EURASIP Journal on Audio, Speech, and Music Processing, Springer Science and Business Media LLC, Vol. 2024, No. 1 ( 2024-07-20)

Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.

Type of Medium: Online Resource

ISSN: 1687-4722

URL: Article

DOI: 10.1186/s13636-024-00360-8

Language: English

Publisher: Springer Science and Business Media LLC

Publication Date: 2024

detail.hit.zdb_id: 2252877-5

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

6

Online Resource

Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning

Ohta, Kengo ; Nishimura, Ryota ; Kitaoka, Norihide

Elsevier BV ; 2021

In: Speech Communication Vol. 133 ( 2021-10), p. 23-30

add to mindlist on the mindlist

Details

In: Speech Communication, Elsevier BV, Vol. 133 ( 2021-10), p. 23-30

Type of Medium: Online Resource

ISSN: 0167-6393

URL: Article

DOI: 10.1016/j.specom.2021.07.003

Language: English

Publisher: Elsevier BV

Publication Date: 2021

detail.hit.zdb_id: 1460279-9

SSG: 7,11

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

7

Online Resource

Recognizing emotions from speech using a physical model

Kitaoka, Norihide ; Segawa, Shuhei ; Nishimura, Ryota ; [et al.]

Acoustical Society of Japan ; 2018

In: Acoustical Science and Technology Vol. 39, No. 2 ( 2018), p. 167-170

add to mindlist on the mindlist

Details

In: Acoustical Science and Technology, Acoustical Society of Japan, Vol. 39, No. 2 ( 2018), p. 167-170

Type of Medium: Online Resource

ISSN: 1346-3969 , 1347-5177

URL: Article

DOI: 10.1250/ast.39.167

Language: English

Publisher: Acoustical Society of Japan

Publication Date: 2018

detail.hit.zdb_id: 2043164-8

detail.hit.zdb_id: 2039148-1

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

8

Online Resource

Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data

Byambadorj, Zolzaya ; Nishimura, Ryota ; Ayush, Altangerel ; [et al.]

Association for Computing Machinery (ACM) ; 2021

In: ACM Transactions on Asian and Low-Resource Language Information Processing Vol. 20, No. 6 ( 2021-11-30), p. 1-19

add to mindlist on the mindlist

Details

In: ACM Transactions on Asian and Low-Resource Language Information Processing, Association for Computing Machinery (ACM), Vol. 20, No. 6 ( 2021-11-30), p. 1-19

Abstract: The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.

Type of Medium: Online Resource

ISSN: 2375-4699 , 2375-4702

URL: Article

DOI: 10.1145/3464361

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2021

detail.hit.zdb_id: 2820615-0

detail.hit.zdb_id: 2820619-8

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

9

Online Resource

A new speech corpus of super-elderly Japanese for acoustic modeling

Fukuda, Meiko ; Nishimura, Ryota ; Nishizaki, Hiromitsu ; [et al.]

Elsevier BV ; 2023

In: Computer Speech & Language Vol. 77 ( 2023-01), p. 101424-

add to mindlist on the mindlist

Details

In: Computer Speech & Language, Elsevier BV, Vol. 77 ( 2023-01), p. 101424-

Type of Medium: Online Resource

ISSN: 0885-2308

URL: Article

DOI: 10.1016/j.csl.2022.101424

Language: English

Publisher: Elsevier BV

Publication Date: 2023

detail.hit.zdb_id: 56461-8

SSG: 7,11

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher