As data science and machine learning continue to evolve, binary event classification has become increasingly important. Logistic Regression (LR) is a standard baseline, yet it can underestimate probabilities in rare-event settings. This study combines a 16-scenario simulation (rarity 5–10%, n∈ {1000,5000}, p∈ {3,5,7,10}, 100 repeats) with a real-world application to assess Support Vector Machines (SVM), Random Forest (RF), and Gradient Boosting (GB) as alternatives to LR. Training data were balanced using the SMOTETomek hybrid method. In simulation, LR attained the highest balanced performance (G_mean) across cases, with GB the closest competitor; SVM lagged, and RF yielded the lowest G_mean despite often leading test accuracy and precision. On a wine dataset adjusted to 5% and 10% rarity, RF/GB achieved top test accuracy/recall (e.g., ACC=0.998/0.997 with REC=0.958/0.989 at 5% and ACC=0.998/0.994 with REC=0.989/0.977 at 10%), mirroring their strong aggregate accuracy. Overall, the “best” model depends on the target metric: LR/GB when balanced minority–majority performance is critical, and RF when overall accuracy/precision is prioritized.
| Primary Language | English |
|---|---|
| Subjects | Statistical Data Science |
| Journal Section | Research Article |
| Authors | |
| Submission Date | December 22, 2024 |
| Acceptance Date | December 24, 2025 |
| Publication Date | December 30, 2025 |
| Published in Issue | Year 2025 Volume: 46 Issue: 4 |
Editor