<  Back to the Polytechnique Montréal portal

Automated sentence boundary detection in modern standard arabic transcripts using deep neural networks

Carlos-Emiliano González-Gallardo, Elvys Linhares Pontes, Fatiha Sadat and Juan-Manuel Torres-Moreno

Article (2018)

Open Acess document in PolyPublie and at official publisher
[img]
Preview
Open Access to the full text of this document
Published Version
Terms of Use: Creative Commons Attribution Non-commercial No Derivatives
Download (391kB)
Show abstract
Hide abstract

Abstract

The increased volumes of Arabic sources of data available on the Web has boosted the development of Natural Language Processing (NLP) tools over different tasks and applications. However, to take advantage from a vast amount of these applications, a prior segmentation task call Sentence Boundary Detection (SBD) is needed. In this paper we focus on SBD over Modern Standard Arabic (MSA) by comparing two different approaches based on Deep Neural Networks (DNN) using out-of-domain and in-domain training data with only lexical features (represented as character embedding) while conducting two scenarios based on a Convolutional Neural Network and a Recurrent Neural Network with attention mechanism architectures. While tuning a big out-of-domain dataset with a smaller in-domain dataset, improves the performance in general. Our evaluations were based on IWSLT 2017 TED talks transcripts and showed similarities and differences depending of the SBD method. MSA carries certain complications given its rich and complex morphology. However, using only lexical features for Arabic SBD is an acceptable option when the source audio signal is not available and a certain level of language independence needs to be reached.

Uncontrolled Keywords

Sentence Boundary Detection; Speech-to-Text; Transcription; Modern Standard Arabic; Deep Neural Networks

Subjects: 2800 Artificial intelligence > 2801 Natural language and speech understanding
Department: Department of Computer Engineering and Software Engineering
Funders: Access Multilingual Information opinionS (AMIS) project.
Grant number: CHISTERA-AMIS ANR-15-CHR2-0001
PolyPublie URL: https://publications.polymtl.ca/4899/
Journal Title: Procedia Computer Science (vol. 142)
Publisher: Elsevier
DOI: 10.1016/j.procs.2018.10.485
Official URL: https://doi.org/10.1016/j.procs.2018.10.485
Date Deposited: 19 Dec 2022 14:08
Last Modified: 28 Sep 2024 18:28
Cite in APA 7: González-Gallardo, C.-E., Pontes, E. L., Sadat, F., & Torres-Moreno, J.-M. (2018). Automated sentence boundary detection in modern standard arabic transcripts using deep neural networks. Procedia Computer Science, 142, 339-346. https://doi.org/10.1016/j.procs.2018.10.485

Statistics

Total downloads

Downloads per month in the last year

Origin of downloads

Dimensions

Repository Staff Only

View Item View Item