Cross Validation Machine Learning Model Predicts More Accurate: A Comparative Study of Heart Disease Using Linear Regression, Support Vector Machine, K Neighbors and Random Forest Models
Abstract
This primary research paper focuses on the utilization of cross-validation, where each iteration of test data is uniquely structured to ensure optimal model performance by combining weak learners for improved model final accuracy. In the machine learning process, data is typically divided into two types of training/tests of 70% and 30% split, and cross-validation for training and evaluation purposes. This research study involves transforming the original datasets and comparative analysis cross-validation using LR, SVM, KNN, and RF methodologies to heart disease classification datasets. The objective is to easily identify the average accuracy of model predictions and subsequently make recommendations for model selection based on both cross-validated increased (15%) more and non-cross-validated approaches. From the comparing each model accuracy scores it is found that the logistic regression and k-nearest neighbor models achieved the highest accuracy of 81% among the four models. Similarly, the random forest model attained an F1 score of 95%, indicating the highest accuracy score. These findings can be further corroborated using learning curve validation. Conversely, the linear regression model exhibited the lowest accuracy of 84% among the four machine learning models.Downloads
References
References
S. Maldonado, J. López, and A. Iturriaga, “Out-of-time cross-validation strategies for classification in the presence of dataset shift,” Appl. Intell., vol. 52, no. 5, pp. 5770–5783, Mar. 2022, doi: 10.1007/s10489-021-02735-2.
T. R. Mahesh, O. Geman, M. Margala, and M. Guduri, “The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification,” Healthc. Anal., vol. 4, p. 100247, 2023.
J. Schmidt, M. R. Marques, S. Botti, and M. A. Marques, “Recent advances and applications of machine learning in solid-state materials science,” Npj Comput. Mater., vol. 5, no. 1, pp. 1–36, 2019.
Z. Ye et al., “Predicting beneficial effects of atomoxetine and citalopram on response inhibition in P arkinson’s disease with clinical and neuroimaging measures,” Hum. Brain Mapp., vol. 37, no. 3, pp. 1026–1037, 2016.
J. I. Gimenez-Nadal, M. Lafuente, J. A. Molina, and J. Velilla, “Resampling and bootstrap algorithms to assess the relevance of variables: applications to cross section entrepreneurship data,” Empir. Econ., vol. 56, no. 1, pp. 233–267, 2019.
J. Dodge, S. Gururangan, D. Card, R. Schwartz, and N. A. Smith, “Expected Validation Performance and Estimation of a Random Variable’s Maximum.” arXiv, Oct. 01, 2021. Accessed: Feb. 04, 2024. [Online]. Available: http://arxiv.org/abs/2110.00613
M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proc. Natl. Acad. Sci., vol. 116, no. 32, pp. 15849–15854, 2019.
J. M. Kernbach and V. E. Staartjes, “Foundations of Machine Learning-Based Clinical Prediction Modeling: Part II—Generalization and Overfitting,” in Machine Learning in Clinical Neuroscience, vol. 134, V. E. Staartjes, L. Regli, and C. Serra, Eds., in Acta Neurochirurgica Supplement, vol. 134. , Cham: Springer International Publishing, 2022, pp. 15–21. doi: 10.1007/978-3-030-85292-4_3.
E. O. Olaniyi, O. K. Oyedotun, C. A. Ogunlade, and A. Khashman, “In-line grading system for mango fruits using GLCM feature extraction and soft-computing techniques,” Int. J. Appl. Pattern Recognit., vol. 6, no. 1, pp. 58–75, 2019.
E. J. Benjamin et al., “Heart disease and stroke statistics—2019 update: a report from the American Heart Association,” Circulation, vol. 139, no. 10, pp. e56–e528, 2019.
S. Arora, J. A. Santiago, M. Bernstein, and J. A. Potashkin, “Diet and lifestyle impact the development and progression of Alzheimer’s dementia,” Front. Nutr., vol. 10, 2023, Accessed: Feb. 04, 2024. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10344607/
M. Zuhair et al., “Estimation of the worldwide seroprevalence of cytomegalovirus: A systematic review and meta-analysis,” Rev. Med. Virol., vol. 29, no. 3, p. e2034, 2019.
B. Xiong, W. Jiang, and F. Zhang, “Semi-Supervised Classification considering space and spectrum constraint for remote sensing imagery,” in 2010 18th International Conference on Geoinformatics, IEEE, 2010, pp. 1–6.
N. Nadar and R. Kamatchi, “A Novel Student Risk Identification Model using Machine Learning Approach,” Int J Adv Comput Sci Appl, vol. 9, pp. 305–309, 2018.
A. Khan and S. K. Ghosh, “Student performance analysis and prediction in classroom learning: A review of educational data mining studies,” Educ. Inf. Technol., vol. 26, no. 1, pp. 205–240, Jan. 2021, doi: 10.1007/s10639-020-10230-3.
B. K. Yousafzai, M. Hayat, and S. Afzal, “Application of machine learning and data mining in predicting the performance of intermediate and secondary education level student,” Educ. Inf. Technol., vol. 25, no. 6, pp. 4677–4697, 2020.
L. K. Smirani, H. A. Yamani, L. J. Menzli, and J. A. Boulahia, “Using Ensemble Learning Algorithms to Predict Student Failure and Enabling Customized Educational Paths,” Sci. Program., vol. 2022, pp. 1–15, Apr. 2022, doi: 10.1155/2022/3805235.
M. Usama, B. Ahmad, W. Xiao, M. S. Hossain, and G. Muhammad, “Self-attention based recurrent convolutional neural network for disease prediction using healthcare data,” Comput. Methods Programs Biomed., vol. 190, p. 105191, 2020.
N. Shukla, M. Hagenbuchner, K. T. Win, and J. Yang, “Breast cancer data analysis for survivability studies and prediction,” Comput. Methods Programs Biomed., vol. 155, pp. 199–208, 2018.
G. Kaur and A. Chhabra, “Improved J48 Classification Algorithm for the Prediction of Diabetes,” Int. J. Comput. Appl., vol. 98, no. 22, pp. 13–17, Jul. 2014, doi: 10.5120/17314-7433.
H. Naz and S. Ahuja, “Deep learning approach for diabetes prediction using PIMA Indian dataset,” J. Diabetes Metab. Disord., vol. 19, no. 1, pp. 391–403, 2020.
F. Dharma, S. Shabrina, A. Noviana, M. Tahir, N. Hendrastuty, and W. Wahyono, “Prediction of Indonesian inflation rate using regression model based on genetic algorithms,” J. Online Inform., vol. 5, no. 1, pp. 45–52, 2020.
S. Touzani, J. Granderson, and S. Fernandes, “Gradient boosting machine for modeling the energy consumption of commercial buildings,” Energy Build., vol. 158, pp. 1533–1543, 2018.
S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease prediction using hybrid machine learning techniques,” IEEE Access, vol. 7, pp. 81542–81554, 2019.
C. Anuradha and T. Velmurugan, “A Comparative Analysis on the Evaluation of Classification Algorithms in the Prediction of Students Performance,” Indian J. Sci. Technol., vol. 8, no. 15, Jul. 2015, doi: 10.17485/ijst/2015/v8i15/74555.
A. A. Hussain and K. Dimililer, “Student grade prediction using machine learning in iot era,” in International Conference on Forthcoming Networks and Sustainability in the IoT Era, Springer, 2021, pp. 65–81.
R. Chowdhury et al., “Dynamic interventions to control COVID-19 pandemic: a multivariate prediction modelling study comparing 16 worldwide countries,” Eur. J. Epidemiol., vol. 35, no. 5, pp. 389–399, 2020.
N. Townsend et al., “Epidemiology of cardiovascular disease in Europe,” Nat. Rev. Cardiol., vol. 19, no. 2, Art. no. 2, 2022.
M. F. Ansari, B. Alankar, and H. Kaur, “A Prediction of Heart Disease Using Machine Learning Algorithms,” in Image Processing and Capsule Networks, vol. 1200, J. I.-Z. Chen, J. M. R. S. Tavares, S. Shakya, and A. M. Iliyasu, Eds., in Advances in Intelligent Systems and Computing, vol. 1200. , Cham: Springer International Publishing, 2021, pp. 497–504. doi: 10.1007/978-3-030-51859-2_45.
T. Amarbayasgalan, V.-H. Pham, N. Theera-Umpon, Y. Piao, and K. H. Ryu, “An efficient prediction method for coronary heart disease risk based on two deep neural networks trained on well-ordered training datasets,” IEEE Access, vol. 9, pp. 135210–135223, 2021.
A. M. Barhoom, A. Almasri, B. S. Abu-Nasser, and S. S. Abu-Naser, “Prediction of Heart Disease Using a Collection of Machine and Deep Learning Algorithms,” 2022, Accessed: Feb. 04, 2024. [Online]. Available: https://philpapers.org/rec/BARPOH-4