Advanced Machine Learning Strategies for Effective Rare Event Classification: A Comparative Study

Olcay Alpay

doi:10.17776/csj.1605507

Research Article

Advanced Machine Learning Strategies for Effective Rare Event Classification: A Comparative Study

Year 2025, Volume: 46 Issue: 4, 990 - 1002, 30.12.2025

Olcay Alpay

https://doi.org/10.17776/csj.1605507

Abstract

As data science and machine learning continue to evolve, binary event classification has become increasingly important. Logistic Regression (LR) is a standard baseline, yet it can underestimate probabilities in rare-event settings. This study combines a 16-scenario simulation (rarity 5–10%, n∈ {1000,5000}, p∈ {3,5,7,10}, 100 repeats) with a real-world application to assess Support Vector Machines (SVM), Random Forest (RF), and Gradient Boosting (GB) as alternatives to LR. Training data were balanced using the SMOTETomek hybrid method. In simulation, LR attained the highest balanced performance (G_mean) across cases, with GB the closest competitor; SVM lagged, and RF yielded the lowest G_mean despite often leading test accuracy and precision. On a wine dataset adjusted to 5% and 10% rarity, RF/GB achieved top test accuracy/recall (e.g., ACC=0.998/0.997 with REC=0.958/0.989 at 5% and ACC=0.998/0.994 with REC=0.989/0.977 at 10%), mirroring their strong aggregate accuracy. Overall, the “best” model depends on the target metric: LR/GB when balanced minority–majority performance is critical, and RF when overall accuracy/precision is prioritized.

Keywords

Machine learning , Rare event , Performance metrics , SMOTETomek

References

[1] Maalouf M., Trafalis T.B., Robust weighted kernel logistic regression in imbalanced and rare events data, Computational Statistics & Data Analysis, 55(1) (2011a) 168–183.
[2] Maalouf M., Trafalis T.B., Rare events and imbalanced datasets: an overview, International Journal of Data Mining, Modelling and Management, 3(4) (2011b) 375–388.
[3] Maalouf M., Homouz D., Trafalis T.B., Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods, Computational Intelligence, 34(1) (2018) 161–174.
[4] Basha S.J., Madala S.R., Vivek K., Kumar E.S., Ammannamma T., A Review on Imbalanced Data Classification Techniques, International Conference on Advanced Computing Technologies and Applications (ICACTA), India, (2022) 1-6.
[5] Ghorbani R., Ghousi R., Comparing different resampling methods in predicting students’ performance using machine learning techniques, IEEE Access, 8 (2020) 67899-67911.
[6] Jonathan B., Putra P.H., Ruldeviyani Y., Observation imbalanced data text to predict users selling products on female daily with SMOTE, Tomek, and SMOTE-Tomek, IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Indonesia, (2020) 81-85.
[7] Tariq M.A., Sargano A.B., Iftikhar M.A., Habib Z., Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques, Cybernetics and Information Technologies, 23(4) (2023) 199-212.
[8] Stando A., Cavus M., Biecek P., The effect of balancing methods on model behavior in imbalanced classification problems, In Fifth International Workshop on Learning with Imbalanced Domains: Theory and Applications, Italy, (2024) 16-30.
[9] Seiffert C., Khoshgoftaar T.M., Van Hulse J., Napolitano A., Mining Data with Rare Events: A Case Study, 19th IEEE International Conference on Tools with Artificial Intelligence, Greece, (2007) 132-139.
[10] Escobar C.A., Morales-Menendez R., Macias D., Process-monitoring-for-quality-a machine learning-based modeling for rare event detection, Array, 7 (2020) 100034.
[11] Lai S.B.S., Shahri N.H.N.B., Mohamad M.B., Rahman H.A.B.A., Rambli A.B., Comparing the performance of AdaBoost, XGBoost, and logistic regression for imbalanced data, Mathematics and Statistics, 9(3) (2021) 379-385.
[12] Meysami M., Kumar V., Pugh M., Lowery S.T., Sur S., Mondal S., Greene J.M., Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence, Frontiers in Oncology, 13 (2023) 1227842.
[13] Cervantes J., Garcia-Lamont F., Rodríguez-Mazahua L., Lopez A., A comprehensive survey on support vector machine classification: Applications, challenges and trends, Neurocomputing, 408 (2020) 189-215.
[14] Ling C., Lu Z., Support vector machine-based importance sampling for rare event estimation, Structural and Multidisciplinary Optimization, 63(4) (2021) 1609–1631.
[15] Chi H.M., Ersoy O.K., Support vector machine decision trees with rare event detection, International Journal of Smart Engineering System Design, 4(4) (2002) 225–242.
[16] Nayak M.A., Ghosh S., Prediction of extreme rainfall event using weather pattern recognition and support vector machine classifier, Theoretical and Applied Climatology, 114(3–4) (2013) 583–563.
[17] Kumar D., Thakur M., Dubey C.S., Shukla D.P., Landslide susceptibility mapping & prediction using support vector machine for Mandakini River Basin, Garhwal Himalaya, India, Geomorphology, 295 (2017) 115–125.
[18] Kalantar B., Pradhan B., Naghibi S.A., Motevalli A., Mansor S., Assessment of the effects of training data selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM), logistic regression (LR) and artificial neural networks (ANN), Geomatics Natural Hazards and Risk, 9(1) (2018) 49–69.
[19] Nhu V.H., Shirzadi A., Shahabi H., Singh S.K., Al-Ansari N., Clague J.J., Jaafari A., Chen W., Miraki S., Dou J., Shallow landslide susceptibility mapping: a comparison between logistic model tree, logistic regression, naïve bayes tree, artificial neural network, and support vector machine algorithms, International Journal of Environmental Research and Public Health, 17(8) (2020) 2749.
[20] Zhu J., Li S., Song J., Magnitude Estimation for Earthquake Early Warning with Multiple Parameter Inputs and a Support Vector Machine, Seismological Research Letters, 93(1) (2021) 126–136.
[21] Breiman L., Random Forests, Machine Learning, 45(1) (2001) 5-32.
[22] Yan W., Application of Random Forest to Aircraft Engine Fault Diagnosis, In the Proceedings of the Multiconference on Computational Engineering in Systems Applications, 1 (2006) 468–475.
[23] Muchlinski D., Siroky D., He J., Kocher M., Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data, Political Analysis, 24(1) (2016) 87–103.
[24] Hegelich S., Decision trees and random forests: Machine learning techniques to classify rare events, European Policy Analysis, 2(1) (2016) 98–120.
[25] Siders Z.A., Ducharme-Barth N.D., Carvalho F., Kobayashi D., Martin S., Raynor J., Jones T.T., Ahrens R.N.M., Ensemble Random Forests as a tool for modeling rare occurrences, Endangered Species Research, 43 (2020) 183–197.
[26] Lusa L., Gradient boosting for high-dimensional prediction of rare events, Computational Statistics and Data Analysis, 113 (2017) 19–37.
[27] Islam M.K., Hridi P., Hossain M.S., Narman H.S., Network anomaly detection using LightGBM: A gradient boosting classifier, 30th International Telecommunication Networks and Applications Conference (ITNAC), Australia, (2020) 1-7.
[28] Lyashevska O., Malone F., MacCarthy E., Fiehler J., Buhk J.H., Morris L., Class imbalance in gradient boosting classification algorithms: Application to experimental stroke data, Statistical Methods in Medical Research, 30(3) (2021) 916-925.
[29] Hairani H., Anggrawan A., Priyanto D., Improvement performance of the random forest method on unbalanced diabetes data classification using SMOTE-Tomek link, JOIV: International Journal on Informatics Visualization, 7(1) (2023) 258–266.
[30] Talukder M.A., Sharmin S., Uddin M.A., Islam M.M., Aryal S., MLSTL-WSN: Machine learning-based intrusion detection using SMOTETomek in WSNs, International Journal of Information Security, 23(3) (2024) 2139–2158.
[31] Musa A.B., Comparative study on classification performance between support vector machine and logistic regression, International Journal of Machine Learning and Cybernetics, 4 (2013) 13-24.
[32] Kirasich K., Smith T., Sadler B., Random forest vs logistic regression: binary classification for heterogeneous datasets, SMU Data Science Review, 1(3) (2018) 9.
[33] Zhou Z.H., Machine Learning, Springer Nature, (2021).
[34] Vapnik V., Statistical Learning Theory, John Wiley & Sons, Chichester, (1998).
[35] Qiwen D., Xiaolong W., Lei L., Protein domain boundary prediction by combining support vector machine and domain guess by size algorithm, High Technology Letters, 13 (1994) 74-78.
[36] Rezvani S., Pourpanah F., Lim C.P., Wu Q.M., Methods for Class-Imbalanced Learning with Support Vector Machines: A Review and an Empirical Evaluation, Soft Computing, 28 (2024) 11873–11894.
[37] Gold C., Sollich P., Model selection for support vector machine classification, Neurocomputing, 55(1-2) (2003) 221-249.
[38] Scornet E., Biau G., Vert J.P., Consistency of random forests, Annals of Statistics, 43(4) (2015) 1716-1741.
[39] Kumar V., Evaluation of computationally intelligent techniques for breast cancer diagnosis, Neural Computing and Applications, 33(8) (2021) 3195-3208.
[40] Nazarenko E., Varkentin V., Polyakova T., Features of application of machine learning methods for classification of network traffic (features, advantages, disadvantages), International Multi-Conference on Industrial Engineering and Modern Technologies (FarEastCon), Russian Federation, (2019) 1-5.
[41] Friedman J.H., Stochastic gradient boosting, Computational Statistics & Data Analysis, 38 (2002) 367–378.
[42] Friedman J.H., Greedy function approximation: a gradient boosting machine, Annals of Statistics, (2001) 1189-1232.
[43] Natekin A., Knoll A., Gradient boosting machines, a tutorial, Frontiers in Neurorobotics, 7 (2013) 21.
[44] Veropoulos K., Campbell C., Cristianini N., Controlling the sensitivity of support vector machines, In Proceedings of the International Joint Conference on AI, Stockholm, (1999) 55-60.
[45] Hoens T.R., Chawla N.V., Imbalanced datasets: from sampling to classifiers, In Imbalanced Learning: Foundations, Algorithms, and Applications, (2013) 43-59.
[46] Birla S., Kohli K., Dutta A., Machine learning on imbalanced data in credit risk, IEEE 7th Annual Information Technology Electronics and Mobile Communication Conference (IEMCON), United States, (2016) 1-6.
[47] Tanimu J.J., Hamada M., Hassan M., Kakudi H., Abiodun J.O., A machine learning method for classification of cervical cancer, Electronics, 11(3) (2022) 463.
[48] Swana E.F., Doorsamy W., Bokoro P., Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset, Sensors, 22(9) (2022) 3246.
[49] Mandrekar J.N., Receiver operating characteristic curve in diagnostic test assessment, Journal of Thoracic Oncology, 5(9) (2010) 1315–1316.
[50] DeLong E.R., DeLong D.M., Clarke-Pearson D.L., Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics, (1988) 837–845.
[51] Oommen T., Baise L.G., Vogel R.M., Sampling bias and class imbalance in maximum-likelihood logistic regression, Mathematical Geosciences, 43(1) (2011) 99–120.
[52] Hajian-Tilaki K., Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation, Caspian Journal of Internal Medicine, 4(2) (2013) 627.
[53] Cortez P., Cerdeira A., Almeida F., Matos T., Reis J., Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems, 47(4) (2009) 547-553.

Etkili Nadir Olay Sınıflandırması için Gelişmiş Makine Öğrenimi Stratejileri: Karşılaştırmalı Bir Çalışma

Year 2025, Volume: 46 Issue: 4, 990 - 1002, 30.12.2025

Olcay Alpay

https://doi.org/10.17776/csj.1605507

Abstract

Keywords

Makiine öğrenimi , Nadir olay , Performans metrikleri , SMOTETomek

References

[1] Maalouf M., Trafalis T.B., Robust weighted kernel logistic regression in imbalanced and rare events data, Computational Statistics & Data Analysis, 55(1) (2011a) 168–183.
[2] Maalouf M., Trafalis T.B., Rare events and imbalanced datasets: an overview, International Journal of Data Mining, Modelling and Management, 3(4) (2011b) 375–388.
[3] Maalouf M., Homouz D., Trafalis T.B., Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods, Computational Intelligence, 34(1) (2018) 161–174.
[4] Basha S.J., Madala S.R., Vivek K., Kumar E.S., Ammannamma T., A Review on Imbalanced Data Classification Techniques, International Conference on Advanced Computing Technologies and Applications (ICACTA), India, (2022) 1-6.
[5] Ghorbani R., Ghousi R., Comparing different resampling methods in predicting students’ performance using machine learning techniques, IEEE Access, 8 (2020) 67899-67911.
[6] Jonathan B., Putra P.H., Ruldeviyani Y., Observation imbalanced data text to predict users selling products on female daily with SMOTE, Tomek, and SMOTE-Tomek, IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Indonesia, (2020) 81-85.
[7] Tariq M.A., Sargano A.B., Iftikhar M.A., Habib Z., Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques, Cybernetics and Information Technologies, 23(4) (2023) 199-212.
[8] Stando A., Cavus M., Biecek P., The effect of balancing methods on model behavior in imbalanced classification problems, In Fifth International Workshop on Learning with Imbalanced Domains: Theory and Applications, Italy, (2024) 16-30.
[9] Seiffert C., Khoshgoftaar T.M., Van Hulse J., Napolitano A., Mining Data with Rare Events: A Case Study, 19th IEEE International Conference on Tools with Artificial Intelligence, Greece, (2007) 132-139.
[10] Escobar C.A., Morales-Menendez R., Macias D., Process-monitoring-for-quality-a machine learning-based modeling for rare event detection, Array, 7 (2020) 100034.
[11] Lai S.B.S., Shahri N.H.N.B., Mohamad M.B., Rahman H.A.B.A., Rambli A.B., Comparing the performance of AdaBoost, XGBoost, and logistic regression for imbalanced data, Mathematics and Statistics, 9(3) (2021) 379-385.
[12] Meysami M., Kumar V., Pugh M., Lowery S.T., Sur S., Mondal S., Greene J.M., Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence, Frontiers in Oncology, 13 (2023) 1227842.
[13] Cervantes J., Garcia-Lamont F., Rodríguez-Mazahua L., Lopez A., A comprehensive survey on support vector machine classification: Applications, challenges and trends, Neurocomputing, 408 (2020) 189-215.
[14] Ling C., Lu Z., Support vector machine-based importance sampling for rare event estimation, Structural and Multidisciplinary Optimization, 63(4) (2021) 1609–1631.
[15] Chi H.M., Ersoy O.K., Support vector machine decision trees with rare event detection, International Journal of Smart Engineering System Design, 4(4) (2002) 225–242.
[16] Nayak M.A., Ghosh S., Prediction of extreme rainfall event using weather pattern recognition and support vector machine classifier, Theoretical and Applied Climatology, 114(3–4) (2013) 583–563.
[17] Kumar D., Thakur M., Dubey C.S., Shukla D.P., Landslide susceptibility mapping & prediction using support vector machine for Mandakini River Basin, Garhwal Himalaya, India, Geomorphology, 295 (2017) 115–125.
[18] Kalantar B., Pradhan B., Naghibi S.A., Motevalli A., Mansor S., Assessment of the effects of training data selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM), logistic regression (LR) and artificial neural networks (ANN), Geomatics Natural Hazards and Risk, 9(1) (2018) 49–69.
[19] Nhu V.H., Shirzadi A., Shahabi H., Singh S.K., Al-Ansari N., Clague J.J., Jaafari A., Chen W., Miraki S., Dou J., Shallow landslide susceptibility mapping: a comparison between logistic model tree, logistic regression, naïve bayes tree, artificial neural network, and support vector machine algorithms, International Journal of Environmental Research and Public Health, 17(8) (2020) 2749.
[20] Zhu J., Li S., Song J., Magnitude Estimation for Earthquake Early Warning with Multiple Parameter Inputs and a Support Vector Machine, Seismological Research Letters, 93(1) (2021) 126–136.
[21] Breiman L., Random Forests, Machine Learning, 45(1) (2001) 5-32.
[22] Yan W., Application of Random Forest to Aircraft Engine Fault Diagnosis, In the Proceedings of the Multiconference on Computational Engineering in Systems Applications, 1 (2006) 468–475.
[23] Muchlinski D., Siroky D., He J., Kocher M., Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data, Political Analysis, 24(1) (2016) 87–103.
[24] Hegelich S., Decision trees and random forests: Machine learning techniques to classify rare events, European Policy Analysis, 2(1) (2016) 98–120.
[25] Siders Z.A., Ducharme-Barth N.D., Carvalho F., Kobayashi D., Martin S., Raynor J., Jones T.T., Ahrens R.N.M., Ensemble Random Forests as a tool for modeling rare occurrences, Endangered Species Research, 43 (2020) 183–197.
[26] Lusa L., Gradient boosting for high-dimensional prediction of rare events, Computational Statistics and Data Analysis, 113 (2017) 19–37.
[27] Islam M.K., Hridi P., Hossain M.S., Narman H.S., Network anomaly detection using LightGBM: A gradient boosting classifier, 30th International Telecommunication Networks and Applications Conference (ITNAC), Australia, (2020) 1-7.
[28] Lyashevska O., Malone F., MacCarthy E., Fiehler J., Buhk J.H., Morris L., Class imbalance in gradient boosting classification algorithms: Application to experimental stroke data, Statistical Methods in Medical Research, 30(3) (2021) 916-925.
[29] Hairani H., Anggrawan A., Priyanto D., Improvement performance of the random forest method on unbalanced diabetes data classification using SMOTE-Tomek link, JOIV: International Journal on Informatics Visualization, 7(1) (2023) 258–266.
[30] Talukder M.A., Sharmin S., Uddin M.A., Islam M.M., Aryal S., MLSTL-WSN: Machine learning-based intrusion detection using SMOTETomek in WSNs, International Journal of Information Security, 23(3) (2024) 2139–2158.
[31] Musa A.B., Comparative study on classification performance between support vector machine and logistic regression, International Journal of Machine Learning and Cybernetics, 4 (2013) 13-24.
[32] Kirasich K., Smith T., Sadler B., Random forest vs logistic regression: binary classification for heterogeneous datasets, SMU Data Science Review, 1(3) (2018) 9.
[33] Zhou Z.H., Machine Learning, Springer Nature, (2021).
[34] Vapnik V., Statistical Learning Theory, John Wiley & Sons, Chichester, (1998).
[35] Qiwen D., Xiaolong W., Lei L., Protein domain boundary prediction by combining support vector machine and domain guess by size algorithm, High Technology Letters, 13 (1994) 74-78.
[36] Rezvani S., Pourpanah F., Lim C.P., Wu Q.M., Methods for Class-Imbalanced Learning with Support Vector Machines: A Review and an Empirical Evaluation, Soft Computing, 28 (2024) 11873–11894.
[37] Gold C., Sollich P., Model selection for support vector machine classification, Neurocomputing, 55(1-2) (2003) 221-249.
[38] Scornet E., Biau G., Vert J.P., Consistency of random forests, Annals of Statistics, 43(4) (2015) 1716-1741.
[39] Kumar V., Evaluation of computationally intelligent techniques for breast cancer diagnosis, Neural Computing and Applications, 33(8) (2021) 3195-3208.
[40] Nazarenko E., Varkentin V., Polyakova T., Features of application of machine learning methods for classification of network traffic (features, advantages, disadvantages), International Multi-Conference on Industrial Engineering and Modern Technologies (FarEastCon), Russian Federation, (2019) 1-5.
[41] Friedman J.H., Stochastic gradient boosting, Computational Statistics & Data Analysis, 38 (2002) 367–378.
[42] Friedman J.H., Greedy function approximation: a gradient boosting machine, Annals of Statistics, (2001) 1189-1232.
[43] Natekin A., Knoll A., Gradient boosting machines, a tutorial, Frontiers in Neurorobotics, 7 (2013) 21.
[44] Veropoulos K., Campbell C., Cristianini N., Controlling the sensitivity of support vector machines, In Proceedings of the International Joint Conference on AI, Stockholm, (1999) 55-60.
[45] Hoens T.R., Chawla N.V., Imbalanced datasets: from sampling to classifiers, In Imbalanced Learning: Foundations, Algorithms, and Applications, (2013) 43-59.
[46] Birla S., Kohli K., Dutta A., Machine learning on imbalanced data in credit risk, IEEE 7th Annual Information Technology Electronics and Mobile Communication Conference (IEMCON), United States, (2016) 1-6.
[47] Tanimu J.J., Hamada M., Hassan M., Kakudi H., Abiodun J.O., A machine learning method for classification of cervical cancer, Electronics, 11(3) (2022) 463.
[48] Swana E.F., Doorsamy W., Bokoro P., Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset, Sensors, 22(9) (2022) 3246.
[49] Mandrekar J.N., Receiver operating characteristic curve in diagnostic test assessment, Journal of Thoracic Oncology, 5(9) (2010) 1315–1316.
[50] DeLong E.R., DeLong D.M., Clarke-Pearson D.L., Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics, (1988) 837–845.
[51] Oommen T., Baise L.G., Vogel R.M., Sampling bias and class imbalance in maximum-likelihood logistic regression, Mathematical Geosciences, 43(1) (2011) 99–120.
[52] Hajian-Tilaki K., Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation, Caspian Journal of Internal Medicine, 4(2) (2013) 627.
[53] Cortez P., Cerdeira A., Almeida F., Matos T., Reis J., Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems, 47(4) (2009) 547-553.

There are 53 citations in total.

Details

Primary Language	English
Subjects	Statistical Data Science
Journal Section	Research Article
Authors	Olcay Alpay 0000-0003-1446-0801
Submission Date	December 22, 2024
Acceptance Date	December 24, 2025
Publication Date	December 30, 2025
Published in Issue	Year 2025 Volume: 46 Issue: 4

Cite

APA	Alpay, O. (2025). Advanced Machine Learning Strategies for Effective Rare Event Classification: A Comparative Study. Cumhuriyet Science Journal, 46(4), 990-1002. https://doi.org/10.17776/csj.1605507

Article Files

Full Text

Editor