GLORIA

GEOMAR Library Ocean Research Information Access

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Linguistics  (1)
  • Computer Science  (1)
Material
Person/Organisation
Language
Years
FID
  • Linguistics  (1)
Subjects(RVK)
  • Computer Science  (1)
RVK
  • 1
    Online Resource
    Online Resource
    Cambridge University Press (CUP) ; 2019
    In:  Natural Language Engineering Vol. 25, No. 2 ( 2019-03), p. 239-255
    In: Natural Language Engineering, Cambridge University Press (CUP), Vol. 25, No. 2 ( 2019-03), p. 239-255
    Abstract: Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.
    Type of Medium: Online Resource
    ISSN: 1351-3249 , 1469-8110
    RVK:
    Language: English
    Publisher: Cambridge University Press (CUP)
    Publication Date: 2019
    detail.hit.zdb_id: 1481165-0
    SSG: 7,11
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...