GLORIA

GEOMAR Library Ocean Research Information Access

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Association for Computing Machinery (ACM)  (5)
  • 1
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2021
    In:  ACM Transactions on Asian and Low-Resource Language Information Processing Vol. 20, No. 1 ( 2021-01-31), p. 1-19
    In: ACM Transactions on Asian and Low-Resource Language Information Processing, Association for Computing Machinery (ACM), Vol. 20, No. 1 ( 2021-01-31), p. 1-19
    Abstract: Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus. The tweets in the this corpus are classified at three levels: Neutral-Hostile, Simple-Complex, and Offensive-Hate speech. As another contribution, we have used five supervised learning techniques, including a deep learning technique, to evaluate and compare their effectiveness for hate speech detection. The results show that Logistic Regression outperformed all other techniques, including deep learning techniques for the two levels of classification, by achieved an F1 score of 0.906 for distinguishing between Neutral-Hostile tweets, and 0.756 for distinguishing between Offensive-Hate speech tweets.
    Type of Medium: Online Resource
    ISSN: 2375-4699 , 2375-4702
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2021
    detail.hit.zdb_id: 2820615-0
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 2
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2020
    In:  ACM Transactions on Asian and Low-Resource Language Information Processing Vol. 19, No. 1 ( 2020-01-31), p. 1-13
    In: ACM Transactions on Asian and Low-Resource Language Information Processing, Association for Computing Machinery (ACM), Vol. 19, No. 1 ( 2020-01-31), p. 1-13
    Abstract: Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.
    Type of Medium: Online Resource
    ISSN: 2375-4699 , 2375-4702
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2020
    detail.hit.zdb_id: 2820615-0
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 3
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2022
    In:  ACM Transactions on Asian and Low-Resource Language Information Processing Vol. 21, No. 3 ( 2022-05-31), p. 1-23
    In: ACM Transactions on Asian and Low-Resource Language Information Processing, Association for Computing Machinery (ACM), Vol. 21, No. 3 ( 2022-05-31), p. 1-23
    Abstract: Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.
    Type of Medium: Online Resource
    ISSN: 2375-4699 , 2375-4702
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2022
    detail.hit.zdb_id: 2820615-0
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 4
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2023
    In:  ACM Transactions on Asian and Low-Resource Language Information Processing Vol. 22, No. 2 ( 2023-02-28), p. 1-28
    In: ACM Transactions on Asian and Low-Resource Language Information Processing, Association for Computing Machinery (ACM), Vol. 22, No. 2 ( 2023-02-28), p. 1-28
    Abstract: Emotion detection is a widely studied topic in natural language processing due to its significance in a number of application areas. A plethora of studies have been conducted on emotion detection in European as well as Asian languages. However, a large majority of these studies have been conducted in monolingual settings, whereas little attention has been paid to emotion detection in code-mixed text. Specifically, merely one study has been conducted on emotion detection in Roman Urdu (RU) and English (EN) code-mixed text despite the fact that such text is widely used in social media platforms. A careful examination of the existing study has revealed several issues which justify that this area requires attention of researchers. For instance, more than 37% of the messages in the contemporary corpus are monolingual sentences representing that a purely code-mixed emotion analysis corpus is non-existent. To that end, this study has scrapped 400,000 sentences from three social media platforms to identify 20,000 RU-EN code-mixed sentences. Subsequently, an iterative approach is employed to develop emotion detection guidelines. These guidelines have been used to develop a large RU-EN emotion detection (RU-EN-Emotion) corpus in which 20,000 sentences are annotated as Neutral or Emotion-sentence. The sentences having emotions are further annotated with the respective emotions. Subsequently, 102 experiments are performed to evaluate the effectiveness of six classical machine learning techniques and six deep learning techniques. The results show, (a) CNN is the most effective technique when used with GloVe embeddings, and (b) our developed RU-EN-Emotion corpus is more useful than the contemporary corpus, as it employs a two-level classification approach.
    Type of Medium: Online Resource
    ISSN: 2375-4699 , 2375-4702
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2023
    detail.hit.zdb_id: 2820615-0
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 5
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2020
    In:  ACM Transactions on Asian and Low-Resource Language Information Processing Vol. 19, No. 4 ( 2020-07-31), p. 1-13
    In: ACM Transactions on Asian and Low-Resource Language Information Processing, Association for Computing Machinery (ACM), Vol. 19, No. 4 ( 2020-07-31), p. 1-13
    Abstract: Named entity recognition (NER) refers to the identification of proper nouns from natural language text and classifying them into named entity types, such as person, location, and organization. Due to the widespread applications of NER, numerous NER techniques and benchmark datasets have been developed for both Western and Asian languages. Even though Shahmukhi script of the Punjabi language has been used by nearly three fourths of the Punjabi speakers worldwide, Gurmukhi has been the main focus of research activities. Specifically, a benchmark NER corpus for Shahmukhi is non-existent, which has thwarted the commencement of NER research for the Shahmukhi script. To this end, this article presents the development and specifications of the first-ever NER corpus for Shahmukhi. The newly developed corpus is composed of 318,275 tokens and 16,300 named entities, including 11,147 persons, 3,140 locations, and 2,013 organizations. To establish the strength of our corpus, we have compared the specifications of our corpus with its Gurmukhi counterparts. Furthermore, we have demonstrated the usability of our corpus using five supervised learning techniques, including two state-of-the-art deep learning techniques. The results are compared, and valuable insights about the behaviors of the most effective technique are discussed.
    Type of Medium: Online Resource
    ISSN: 2375-4699 , 2375-4702
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2020
    detail.hit.zdb_id: 2820615-0
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...