Data Quality Assessment and Preprocessing Techniques for Enhancing Machine Learning Model Performance.
Authors: Olakunle Ebenezer Aribisala
DOI: https://doi.org/10.37082/IJIRMPS.v12.i3.232799
Short DOI: https://doi.org/g993mj
Country: Nigeria
Full-text Research PDF File:
View |
Download
Abstract:
The model selection criteria are important for machine learning model performance as it depends strongly on the quality of the data employed for training and testing. Improperly managed data quality problems, such as missing data, noise, imbalance, redundancy, and variability, may lead to inaccurate prediction, decreased generalizability and biased learning. As the applications of machine learning keep on growing in diverse domains like Healthcare, Manufacturing, Climate Modeling, Finance, and Natural Resources Management etc., the need for systematic data quality evaluation and robust preprocessing strategies is rising. This article offers an in-depth analysis of the major dimensions of data quality, such as accuracy, completeness, consistency, validity, timeliness, and integrity and assesses the factors by which these dimensions affect the performances of models. Furthermore, the paper covers major data preprocessing techniques, including data cleaning, data normalization, data transformation, feature selection, dimensionality reduction, outlier detection, handling imbalanced data and data augmentation.
In addition, the article addresses the use of automated and semi-automated frameworks that are developed to support evaluation of data quality, and discusses recent advances that address challenges with data in specific domains. The review also highlights the need for pre-processing choice alignment and consideration of model characteristics, data structure and application. Experimental analyses and comparative evaluations are provided and shown to illustrate the how suitable preprocessing pipelines would be able to positively impact machine learning results through increased model robustness, effectiveness, and credibility.
The results indicate that optimized preprocessing strategies, based on systematic evaluation of the quality of data, form an important part of the optimization of the performance of machine learning models. The article ends by pointing out the existing gaps of the research, such as standardised data quality indicators, more sophisticated automation tools, and scalable preprocessing for big and complex datasets. Recommendations for future research paths and sound systems for actual implementation are offered to aid in the development of high-quality, reliable machine learning systems.
Keywords: Data quality; Machine learning; Preprocessing techniques; Feature engineering; Data cleaning; Dimensionality reduction; Imbalanced data handling; Data augmentation; Model performance optimization; Data governance
Paper Id: 232799
Published On: 2024-06-14
Published In: Volume 12, Issue 3, May-June 2024
All research papers published in this journal/on this website are openly accessible and licensed under