EN
TR
Advanced Machine Learning Strategies for Effective Rare Event Classification: A Comparative Study
Abstract
As data science and machine learning continue to evolve, binary event classification has become increasingly important. Logistic Regression (LR) is a standard baseline, yet it can underestimate probabilities in rare-event settings. This study combines a 16-scenario simulation (rarity 5–10%, n∈ {1000,5000}, p∈ {3,5,7,10}, 100 repeats) with a real-world application to assess Support Vector Machines (SVM), Random Forest (RF), and Gradient Boosting (GB) as alternatives to LR. Training data were balanced using the SMOTETomek hybrid method. In simulation, LR attained the highest balanced performance (G_mean) across cases, with GB the closest competitor; SVM lagged, and RF yielded the lowest G_mean despite often leading test accuracy and precision. On a wine dataset adjusted to 5% and 10% rarity, RF/GB achieved top test accuracy/recall (e.g., ACC=0.998/0.997 with REC=0.958/0.989 at 5% and ACC=0.998/0.994 with REC=0.989/0.977 at 10%), mirroring their strong aggregate accuracy. Overall, the “best” model depends on the target metric: LR/GB when balanced minority–majority performance is critical, and RF when overall accuracy/precision is prioritized.
Keywords
References
- [1] Maalouf M., Trafalis T.B., Robust weighted kernel logistic regression in imbalanced and rare events data, Computational Statistics & Data Analysis, 55(1) (2011a) 168–183.
- [2] Maalouf M., Trafalis T.B., Rare events and imbalanced datasets: an overview, International Journal of Data Mining, Modelling and Management, 3(4) (2011b) 375–388.
- [3] Maalouf M., Homouz D., Trafalis T.B., Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods, Computational Intelligence, 34(1) (2018) 161–174.
- [4] Basha S.J., Madala S.R., Vivek K., Kumar E.S., Ammannamma T., A Review on Imbalanced Data Classification Techniques, International Conference on Advanced Computing Technologies and Applications (ICACTA), India, (2022) 1-6.
- [5] Ghorbani R., Ghousi R., Comparing different resampling methods in predicting students’ performance using machine learning techniques, IEEE Access, 8 (2020) 67899-67911.
- [6] Jonathan B., Putra P.H., Ruldeviyani Y., Observation imbalanced data text to predict users selling products on female daily with SMOTE, Tomek, and SMOTE-Tomek, IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Indonesia, (2020) 81-85.
- [7] Tariq M.A., Sargano A.B., Iftikhar M.A., Habib Z., Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques, Cybernetics and Information Technologies, 23(4) (2023) 199-212.
- [8] Stando A., Cavus M., Biecek P., The effect of balancing methods on model behavior in imbalanced classification problems, In Fifth International Workshop on Learning with Imbalanced Domains: Theory and Applications, Italy, (2024) 16-30.
Details
Primary Language
English
Subjects
Statistical Data Science
Journal Section
Research Article
Authors
Olcay Alpay
*
0000-0003-1446-0801
Türkiye
Publication Date
December 30, 2025
Submission Date
December 22, 2024
Acceptance Date
December 24, 2025
Published in Issue
Year 1970 Volume: 46 Number: 4