<  Retour au portail Polytechnique Montréal

Automated sentence boundary detection in modern standard arabic transcripts using deep neural networks

Carlos-Emiliano González-Gallardo, Elvys Linhares Pontes, Fatiha Sadat et Juan-Manuel Torres-Moreno

Article de revue (2018)

Document en libre accès dans PolyPublie et chez l'éditeur officiel
[img]
Affichage préliminaire
Libre accès au plein texte de ce document
Version officielle de l'éditeur
Conditions d'utilisation: Creative Commons: Attribution-Pas d'utilisation commerciale-Pas de modification (CC BY-NC-ND)
Télécharger (391kB)
Afficher le résumé
Cacher le résumé

Abstract

The increased volumes of Arabic sources of data available on the Web has boosted the development of Natural Language Processing (NLP) tools over different tasks and applications. However, to take advantage from a vast amount of these applications, a prior segmentation task call Sentence Boundary Detection (SBD) is needed. In this paper we focus on SBD over Modern Standard Arabic (MSA) by comparing two different approaches based on Deep Neural Networks (DNN) using out-of-domain and in-domain training data with only lexical features (represented as character embedding) while conducting two scenarios based on a Convolutional Neural Network and a Recurrent Neural Network with attention mechanism architectures. While tuning a big out-of-domain dataset with a smaller in-domain dataset, improves the performance in general. Our evaluations were based on IWSLT 2017 TED talks transcripts and showed similarities and differences depending of the SBD method. MSA carries certain complications given its rich and complex morphology. However, using only lexical features for Arabic SBD is an acceptable option when the source audio signal is not available and a certain level of language independence needs to be reached.

Mots clés

Sentence Boundary Detection; Speech-to-Text; Transcription; Modern Standard Arabic; Deep Neural Networks

Sujet(s): 2800 Intelligence artificielle > 2801 Langage naturel et reconnaissance de la parole
Département: Département de génie informatique et génie logiciel
Organismes subventionnaires: Access Multilingual Information opinionS (AMIS) project.
Numéro de subvention: CHISTERA-AMIS ANR-15-CHR2-0001
URL de PolyPublie: https://publications.polymtl.ca/4899/
Titre de la revue: Procedia Computer Science (vol. 142)
Maison d'édition: Elsevier
DOI: 10.1016/j.procs.2018.10.485
URL officielle: https://doi.org/10.1016/j.procs.2018.10.485
Date du dépôt: 19 déc. 2022 14:08
Dernière modification: 28 sept. 2024 18:28
Citer en APA 7: González-Gallardo, C.-E., Pontes, E. L., Sadat, F., & Torres-Moreno, J.-M. (2018). Automated sentence boundary detection in modern standard arabic transcripts using deep neural networks. Procedia Computer Science, 142, 339-346. https://doi.org/10.1016/j.procs.2018.10.485

Statistiques

Total des téléchargements à partir de PolyPublie

Téléchargements par année

Provenance des téléchargements

Dimensions

Actions réservées au personnel

Afficher document Afficher document