<  Back to the Polytechnique Montréal portal

Benchmarking Framework and Performance Modeling for Evaluating the Performance of Spark-Based Data Science Projects

Soude Ghari

Ph.D. thesis (2023)

[img] Restricted to: Repository staff only until 11 March 2025
Terms of Use: All rights reserved
Show abstract
Hide abstract


Machine learning and big data analytics are of great importance to enterprise business operations, aiming to uncover data-driven insights. Considerable effort is put into developing and deploying machine learning models, which are utilized for estimating and predicting economic factors, and even for interacting with clients. However, this planning remains largely dependent on human acumen and is expensive to determine in a systematic fashion with automated tools. Benchmarking is the process of efficiently running experiments to determine a system’s performance requirements, among others, in order to aid planning and resource allocation. Benchmarking intelligent and data-intensive systems remains in its infancy and does not cover fully realistic or particular case studies. Benchmarking holds significant practical importance in various fields and industries. It provides a valuable tool for organizations to measure their performance, and make informed decisions. However, there are two significant challenges faced by these systems: volatility and variability. The latter concerns numerous conditions to be taken into consideration when evaluating the performance of analytics systems. The former denotes that any of these circumstances could vary at any time. The challenge is to be able to accurately estimate the performance and effectiveness of a data science platform in the presence of uncertainty and variability. Such an estimate can enable informed planning and potentially dynamic resource allocation for such projects. Moreover, we present a study of multiple machine learning models from simple one linear regression to more complicated ones like LSTM and MLP to estimate the performance of data science projects deployed in Apache Spark, a popular and flexible distributed analytics platform. We demonstrate the process of training such a model, from data collection to training and testing, and we systematically compare the various alternatives to help decision-makers choose the best configuration. The significant scientific contribution of this thesis can be found in the fact that machine learning testing tools and benchmarks primarily focus on assessing model performance rather than the software performance of the implementation. This emphasizes the significance of longitudinal workloads, as they play a pivotal role in impacting resource and time performance over the long term by saturation, unlike model accuracy, which may remain relatively stable. In summary, while batch workloads, predominantly employed by most benchmark tools, are appropriate for evaluating model accuracy, they may not be as suitable for evaluating software performance. Concerning the model itself, a substantial scientific contribution emerges from the incorporation of statistical testing within your pipeline for model comparison, a feature notably absent in many tools like AutoML.


La récente augmentation de l’adoption des sciences des données et du big data a eu un impact significatif sur Comment nous prenons des décisions basées sur les données. À mesure que ces technologies se sont répandues, la demande de modélisation et d’évaluation des performances des applications de sciences des données a augmenté. Pour répondre à ce besoin, une plateforme de référence a été développée pour évaluer les performances de ces applications. Cette plateforme intègre une phase de collecte de données où des métriques pertinentes sont rassemblées. Ces métriques sont essentielles pour comprendre et quantifier les performances du système. Les données collectées sont ensuite utilisées dans différents modèles de régression classiques et d’apprentissage en profondeur. Ces modèles sont entraînés pour prédire deux indicateurs clés de performance. Le débit indique la quantité de travail qu’un système peut gérer dans une période désignée, tandis que le temps de réponse fait référence à la vitesse à laquelle le système peut réagir et traiter les demandes entrantes. Les prédictions du débit et du temps de réponse sont basées sur différentes caractéristiques de charge de travail, configurations de la topologie Spark et configurations de l’infrastructure. En prenant en compte ces facteurs, la plateforme de référence permet une évaluation globale des performances des applications de sciences des données.

Department: Department of Computer Engineering and Software Engineering
Program: Génie informatique
Academic/Research Directors: Heng Li and Marios-Eleftherios Fokaefs
PolyPublie URL: https://publications.polymtl.ca/55740/
Institution: Polytechnique Montréal
Date Deposited: 11 Mar 2024 14:09
Last Modified: 13 Apr 2024 06:16
Cite in APA 7: Ghari, S. (2023). Benchmarking Framework and Performance Modeling for Evaluating the Performance of Spark-Based Data Science Projects [Ph.D. thesis, Polytechnique Montréal]. PolyPublie. https://publications.polymtl.ca/55740/


Total downloads

Downloads per month in the last year

Origin of downloads

Repository Staff Only

View Item View Item