Throughout this research, our data source is from Kaggle (https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset/data), which record The Behavioral Risk Factor Surveillance System (BRFSS), USA’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.
2 objectives of our research including:
Identify Key Risk Factors: Analyze the dataset to identify the most significant risk factors contributing to cardiovascular diseases using feature importance techniques such as correlation analysis, feature selection, or machine learning models like Random Forest or Logistic Regression.
Develop a Predictive Model: Build and evaluate a machine learning model (e.g., Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, or Neural Networks) to predict the likelihood of cardiovascular disease based on patient demographic, lifestyle, and clinical features.
Before we proceed to data preprocessing phase, we need to understand what dataset we’re using and what information do we have in our dataset.
First, get all package that will be using first and call them into environment before we start.
install.packages("randomForest")
##
## The downloaded binary packages are in
## /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages
install.packages("caret")
##
## The downloaded binary packages are in
## /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages
install.packages(c("zoo","xts","quantmod"))
##
## The downloaded binary packages are in
## /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages
install.packages( c("abind", "ROCR"))
##
## The downloaded binary packages are in
## /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages
install.packages("xgboost")
##
## The downloaded binary packages are in
## /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages
install.packages("~/Downloads/DMwR_0.4.1.tar", repos = NULL, type = "source")
library(dplyr)
library(tidyverse)
library(ggplot2)
library(DMwR)
library(zoo)
library(xts)
library(quantmod)
library(abind)
library(ROCR)
library(xgboost)
library(randomForest)
library(caret)
To analyze the dataset, the dataset from Kaggle, entitled “CVD.csv”, which have mentioned the source of this dataset perviously is loaded into our R script.
df <- read.csv("~/Downloads/CVD.csv")
head(df)
## General_Health Checkup Exercise Heart_Disease Skin_Cancer
## 1 Poor Within the past 2 years No No No
## 2 Very Good Within the past year No Yes No
## 3 Very Good Within the past year Yes No No
## 4 Poor Within the past year Yes Yes No
## 5 Good Within the past year No No No
## 6 Good Within the past year No No No
## Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_.cm.
## 1 No No No Yes Female 70-74 150
## 2 No No Yes No Female 70-74 165
## 3 No No Yes No Female 60-64 163
## 4 No No Yes No Male 75-79 180
## 5 No No No No Male 80+ 191
## 6 No Yes No Yes Male 60-64 183
## Weight_.kg. BMI Smoking_History Alcohol_Consumption Fruit_Consumption
## 1 32.66 14.54 Yes 0 30
## 2 77.11 28.29 No 0 30
## 3 88.45 33.47 No 4 12
## 4 93.44 28.73 No 0 30
## 5 88.45 24.37 Yes 0 8
## 6 154.22 46.11 No 0 12
## Green_Vegetables_Consumption FriedPotato_Consumption
## 1 16 12
## 2 0 4
## 3 3 16
## 4 30 8
## 5 4 0
## 6 12 12
Then, we can check on the number of rows and columns of the dataset
#Numbers of rows and columns
dim(df)
## [1] 308854 19
cat("Number of rows:", nrow(df), "\n")
## Number of rows: 308854
cat("Number of columns:", ncol(df), "\n")
## Number of columns: 19
Name of the columns are listed for our understanding on what’s info stored in this dataset
#column name
colnames(df)
## [1] "General_Health" "Checkup"
## [3] "Exercise" "Heart_Disease"
## [5] "Skin_Cancer" "Other_Cancer"
## [7] "Depression" "Diabetes"
## [9] "Arthritis" "Sex"
## [11] "Age_Category" "Height_.cm."
## [13] "Weight_.kg." "BMI"
## [15] "Smoking_History" "Alcohol_Consumption"
## [17] "Fruit_Consumption" "Green_Vegetables_Consumption"
## [19] "FriedPotato_Consumption"
#Unique values in columns
sapply(df, unique)
## $General_Health
## [1] "Poor" "Very Good" "Good" "Fair" "Excellent"
##
## $Checkup
## [1] "Within the past 2 years" "Within the past year"
## [3] "5 or more years ago" "Within the past 5 years"
## [5] "Never"
##
## $Exercise
## [1] "No" "Yes"
##
## $Heart_Disease
## [1] "No" "Yes"
##
## $Skin_Cancer
## [1] "No" "Yes"
##
## $Other_Cancer
## [1] "No" "Yes"
##
## $Depression
## [1] "No" "Yes"
##
## $Diabetes
## [1] "No"
## [2] "Yes"
## [3] "No, pre-diabetes or borderline diabetes"
## [4] "Yes, but female told only during pregnancy"
##
## $Arthritis
## [1] "Yes" "No"
##
## $Sex
## [1] "Female" "Male"
##
## $Age_Category
## [1] "70-74" "60-64" "75-79" "80+" "65-69" "50-54" "45-49" "18-24" "30-34"
## [10] "55-59" "35-39" "40-44" "25-29"
##
## $Height_.cm.
## [1] 150 165 163 180 191 183 175 160 168 178 152 157 188 185 170 173 155 193 196
## [20] 206 198 140 135 145 147 142 201 218 124 203 137 122 216 224 229 151 177 164
## [39] 162 156 153 169 167 172 106 190 143 171 154 176 200 146 148 158 159 187 104
## [58] 120 107 211 226 182 213 97 184 125 127 234 130 119 132 105 166 181 186 91
## [77] 174 208 149 96 197 161 94 103 221 134 144 189 100 179 117 99 102 110 241
## [96] 115 205 195 108
##
## $Weight_.kg.
## [1] 32.66 77.11 88.45 93.44 154.22 69.85 108.86 72.57 91.63 74.84
## [11] 73.48 83.91 113.40 52.16 116.12 99.79 81.65 104.33 79.38 55.79
## [21] 124.74 81.19 70.31 112.49 147.42 84.82 102.06 64.41 60.78 61.23
## [31] 88.00 90.72 49.90 85.28 120.20 69.40 62.14 65.77 89.81 66.68
## [41] 86.18 72.12 87.54 62.60 75.75 88.90 92.08 56.70 68.04 79.83
## [51] 63.50 58.97 114.76 45.36 73.94 54.43 125.19 77.56 96.16 95.25
## [61] 115.67 82.55 136.08 78.93 70.76 95.71 53.52 87.09 55.34 83.01
## [71] 123.38 98.88 73.03 76.66 97.52 71.67 83.46 122.47 58.06 74.39
## [81] 67.13 82.10 47.63 99.34 85.73 108.41 91.17 57.61 63.05 45.81
## [91] 94.35 44.45 117.93 107.50 127.01 106.59 107.95 89.36 92.99 53.07
## [101] 78.02 131.09 97.98 84.37 111.13 50.80 57.15 64.86 80.29 76.20
## [111] 114.31 65.32 97.07 67.59 75.30 105.69 110.68 86.64 51.26 61.69
## [121] 107.05 42.64 40.82 101.60 90.26 131.54 98.43 78.47 59.87 68.95
## [131] 60.33 94.80 48.53 96.62 117.48 102.51 46.27 109.77 58.51 68.49
## [141] 133.81 158.76 52.62 80.74 48.99 117.03 54.88 33.11 51.71 92.53
## [151] 34.02 44.91 105.23 145.60 106.14 56.25 139.71 124.28 103.42 71.21
## [161] 138.35 143.34 101.15 103.87 145.15 93.89 49.44 46.72 120.66 132.90
## [171] 112.04 167.83 142.88 66.22 63.96 162.39 176.90 121.56 111.58 136.98
## [181] 110.22 170.10 122.02 133.36 119.75 129.27 175.54 151.95 118.39 121.11
## [191] 132.45 135.62 119.29 102.97 172.37 156.49 43.09 43.54 157.40 59.42
## [201] 181.44 125.65 127.91 144.70 149.69 113.85 163.29 100.70 115.21 112.94
## [211] 208.65 127.46 100.24 136.53 50.35 130.63 122.92 141.97 47.17 142.43
## [221] 132.00 118.84 116.57 151.50 137.44 72.00 140.61 169.19 44.00 160.57
## [231] 104.78 130.18 65.00 134.72 48.08 139.25 53.98 36.29 129.73 38.56
## [241] 228.16 166.92 148.32 174.63 146.96 235.87 161.03 152.86 105.00 75.00
## [251] 109.00 123.83 165.56 42.18 134.26 140.16 228.61 128.82 39.92 78.00
## [261] 109.32 204.12 85.00 141.52 161.48 185.97 90.00 61.00 95.00 60.00
## [271] 45.00 126.10 82.00 137.89 70.00 64.00 100.00 63.00 126.55 135.17
## [281] 74.00 150.59 83.00 189.15 143.79 97.00 73.00 54.00 80.00 164.20
## [291] 55.00 155.13 52.00 67.00 144.24 68.00 226.80 62.00 76.00 190.51
## [301] 146.51 188.24 199.58 79.00 96.00 152.41 58.00 138.80 118.00 165.11
## [311] 153.31 84.00 38.10 41.73 39.01 37.19 87.00 128.37 156.04 187.79
## [321] 168.28 136.00 35.38 166.01 213.19 89.00 66.00 40.37 81.00 210.92
## [331] 50.00 155.58 25.40 224.53 179.17 159.66 176.45 209.56 151.05 148.78
## [341] 217.72 146.06 140.00 219.99 86.00 141.07 110.00 92.00 31.75 41.28
## [351] 33.57 130.00 48.00 161.93 94.00 98.00 183.25 53.00 34.47 120.00
## [361] 77.00 183.70 272.16 171.46 206.84 26.76 200.00 69.00 101.00 180.53
## [371] 47.00 126.00 59.00 215.46 175.99 166.47 184.61 153.77 39.46 174.18
## [381] 186.43 46.00 40.00 195.04 104.00 201.85 147.87 192.78 36.74 162.84
## [391] 134.00 157.85 191.87 272.61 51.00 231.33 37.65 93.00 56.00 149.23
## [401] 154.68 158.30 194.59 179.62 159.00 210.47 180.98 169.64 285.76 177.81
## [411] 119.00 150.00 159.21 167.38 244.03 163.75 168.74 173.27 227.70 99.00
## [421] 200.94 229.52 219.54 232.69 175.09 229.97 293.02 200.49 195.95 190.96
## [431] 212.73 150.14 229.06 178.72 181.89 34.93 171.00 156.94 176.00 57.00
## [441] 71.00 103.00 254.01 30.84 249.48 240.40 238.14 205.93 273.52 215.00
## [451] 187.33 182.00 164.65 233.60 29.94 199.13 185.07 216.36 186.88 189.60
## [461] 173.73 219.09 24.95 203.21 210.01 35.83 190.06 116.00 197.31 154.00
## [471] 180.08 102.00 222.26 202.76 27.22 206.38 135.00 170.55 160.00 43.00
## [481] 188.69 106.00 177.35 112.00 160.12 195.00 172.82 220.45 184.16 185.52
## [491] 26.31 41.00 274.42 207.75 180.00 175.00 141.00 230.88 222.71 258.55
## [501] 203.66 113.00 182.34 195.50 191.42 215.91 178.26 257.64 244.94 263.08
## [511] 214.10 117.00 172.00 115.00 185.00 247.21 39.00 193.68 127.00 30.00
## [521] 250.00 252.20 283.50 49.00 42.00
##
## $BMI
## [1] 14.54 28.29 33.47 28.73 24.37 46.11 22.74 39.94 27.46 34.67 29.23 23.92
## [13] 29.86 35.87 22.46 43.94 29.84 29.05 33.00 30.04 22.50 26.58 38.35 30.72
## [25] 34.54 32.92 30.18 31.00 45.33 32.77 31.12 28.89 25.73 22.24 21.63 25.79
## [37] 27.44 25.63 24.69 27.98 27.84 30.90 29.53 17.75 30.34 45.49 29.88 23.52
## [49] 34.97 22.71 14.06 30.11 27.78 26.63 31.18 26.50 23.48 23.00 25.10 33.28
## [61] 26.18 24.45 25.39 38.28 29.13 28.19 28.34 24.41 24.21 26.76 21.14 25.61
## [73] 22.31 31.75 31.01 25.84 32.28 35.33 27.12 31.47 28.70 35.29 25.69 26.31
## [85] 18.89 24.80 23.78 35.43 22.67 33.23 31.89 27.81 33.60 30.29 29.57 30.21
## [97] 33.89 41.16 28.97 21.46 27.67 43.05 32.88 25.23 25.96 20.53 37.76 25.85
## [109] 23.57 36.22 22.30 28.32 24.03 25.33 20.38 23.73 17.79 31.32 22.87 26.96
## [121] 25.07 24.89 21.11 31.25 30.41 20.37 18.88 34.21 43.90 18.79 27.53 23.30
## [133] 31.28 29.45 30.91 33.91 25.51 29.95 26.61 29.99 25.50 24.28 30.23 32.32
## [145] 30.81 38.74 37.38 19.14 20.05 26.52 26.47 30.82 22.15 20.36 27.29 20.60
## [157] 38.77 33.20 26.45 21.86 20.25 25.97 21.58 19.20 41.38 27.41 27.91 35.44
## [169] 22.14 31.53 27.50 34.50 31.09 23.86 30.79 26.57 32.01 25.83 33.65 23.41
## [181] 31.57 42.51 39.48 20.40 37.20 34.61 15.82 36.26 26.37 27.34 25.77 41.98
## [193] 23.63 49.60 30.67 36.81 18.29 29.18 34.70 32.74 24.51 42.16 29.09 30.27
## [205] 31.91 19.47 26.15 28.41 51.19 39.58 30.07 28.12 21.80 27.73 30.02 17.54
## [217] 39.68 27.71 39.60 22.40 41.50 30.38 25.06 29.41 35.28 27.96 43.26 21.89
## [229] 20.80 20.51 25.02 33.52 24.79 44.93 23.03 25.24 43.22 25.20 29.29 26.54
## [241] 38.37 18.24 41.79 22.63 23.91 34.90 21.95 36.96 20.22 22.32 19.84 29.01
## [253] 17.76 25.09 25.75 28.25 44.29 17.91 17.58 34.06 32.69 23.34 32.12 29.26
## [265] 22.38 28.66 38.01 21.26 37.23 35.02 36.61 28.84 28.79 18.01 29.04 19.94
## [277] 34.30 25.78 24.53 38.84 26.88 25.66 37.12 26.78 36.05 21.29 32.61 23.56
## [289] 23.49 31.71 23.40 26.34 19.74 35.26 23.62 20.18 26.94 26.83 24.63 27.99
## [301] 36.13 24.81 23.67 37.79 17.81 32.19 30.54 29.76 32.22 31.93 19.87 23.75
## [313] 19.80 26.07 17.23 34.02 27.28 50.90 28.48 26.04 18.31 31.64 30.56 33.43
## [325] 34.17 28.35 37.61 23.11 18.07 30.36 38.62 22.86 42.87 27.37 23.59 40.35
## [337] 38.27 23.81 20.48 44.85 28.13 30.13 35.78 23.26 23.05 34.46 19.97 24.39
## [349] 18.95 23.72 43.93 24.96 29.12 21.79 33.83 16.64 25.68 22.00 45.73 25.92
## [361] 51.69 22.45 28.53 22.65 29.60 27.06 21.81 25.42 20.78 29.81 25.58 36.84
## [373] 20.20 25.82 27.20 31.62 35.73 28.15 34.75 18.27 26.95 22.26 32.93 32.81
## [385] 29.52 36.32 20.50 24.83 32.62 35.94 22.81 22.53 37.66 48.82 24.33 34.33
## [397] 24.05 33.96 41.20 27.05 36.44 16.23 31.38 34.03 33.99 31.90 30.52 24.13
## [409] 20.00 19.27 21.09 20.08 27.02 37.45 16.54 51.81 31.74 23.83 27.32 22.60
## [421] 49.92 33.84 24.62 30.68 21.83 35.90 20.63 46.83 27.26 45.60 31.78 31.83
## [433] 20.09 20.34 42.91 32.98 17.47 20.98 30.61 28.95 21.97 38.09 37.94 23.33
## [445] 36.76 24.72 35.71 26.26 23.17 26.29 35.51 37.59 20.92 37.13 32.45 49.49
## [457] 26.09 21.22 32.00 32.50 26.12 32.08 19.77 33.82 40.24 28.06 23.53 40.39
## [469] 43.85 22.43 29.85 30.65 30.30 40.00 22.66 46.25 40.51 40.25 40.41 25.80
## [481] 25.13 32.10 18.69 18.72 27.92 28.75 25.00 36.31 18.71 35.52 42.29 32.55
## [493] 27.89 16.62 45.19 30.99 32.78 34.15 45.89 36.58 20.83 31.33 33.09 16.04
## [505] 22.36 26.32 23.06 35.95 43.88 29.27 20.12 36.49 36.39 19.53 43.27 25.11
## [517] 19.13 27.40 27.10 24.61 31.82 38.34 21.21 49.48 19.26 34.18 43.58 33.67
## [529] 20.62 29.71 32.11 33.38 21.70 39.16 31.95 35.15 25.04 28.80 19.37 26.90
## [541] 27.60 31.15 18.02 21.91 21.40 33.79 27.59 21.93 43.07 31.87 38.41 33.25
## [553] 21.52 35.62 51.21 32.42 31.19 27.54 35.61 33.75 28.28 34.87 25.22 56.07
## [565] 23.01 25.94 24.34 30.43 50.07 29.35 18.56 49.34 26.06 36.94 22.96 39.84
## [577] 29.68 39.71 23.13 51.84 40.18 32.06 25.29 47.47 57.02 37.25 34.44 32.56
## [589] 40.09 47.65 20.85 21.24 26.39 48.92 23.74 27.80 27.72 41.35 23.04 44.39
## [601] 16.73 19.57 23.46 36.02 24.27 38.45 33.45 31.31 30.17 38.97 34.31 26.97
## [613] 42.09 21.41 23.44 21.03 30.83 22.04 57.59 24.14 26.68 23.69 18.75 33.66
## [625] 27.21 45.20 31.66 37.97 48.37 23.37 30.70 24.75 28.17 46.87 34.96 22.92
## [637] 26.22 42.13 39.51 21.61 21.13 34.11 29.03 29.02 34.74 29.66 20.82 48.07
## [649] 32.27 31.63 31.06 24.66 36.40 24.86 28.67 32.37 41.82 38.52 19.67 33.61
## [661] 23.39 24.97 37.30 35.08 37.68 48.06 40.55 29.65 34.95 43.33 30.24 43.01
## [673] 32.87 21.87 33.72 20.64 24.02 34.01 30.37 21.18 50.13 45.31 61.11 47.20
## [685] 17.38 38.26 22.89 25.70 44.22 29.21 34.59 18.04 37.42 33.07 29.87 36.21
## [697] 20.07 36.62 32.89 52.90 36.87 24.78 26.43 21.32 38.65 26.40 23.89 46.18
## [709] 24.87 43.56 44.63 40.03 41.05 17.16 33.40 63.47 21.73 31.16 42.43 23.38
## [721] 37.44 30.86 28.82 62.65 31.17 30.55 37.88 39.46 29.63 22.69 35.58 25.37
## [733] 26.16 22.27 22.68 18.60 21.48 19.92 34.99 27.76 36.55 32.67 47.43 37.28
## [745] 52.51 26.65 45.52 26.19 37.60 18.54 25.46 27.88 36.92 15.21 19.49 40.96
## [757] 31.41 20.77 23.43 52.15 28.45 30.00 22.80 35.19 46.34 30.62 33.36 30.92
## [769] 43.09 33.11 20.01 30.40 38.92 43.46 34.57 18.83 25.34 51.68 32.54 49.02
## [781] 25.40 29.37 32.49 28.62 22.47 28.27 25.54 25.25 35.13 29.43 51.49 48.73
## [793] 35.24 34.14 24.95 16.24 50.21 41.01 31.80 34.34 28.10 28.59 30.73 36.95
## [805] 46.99 39.78 21.50 40.19 23.29 55.79 28.60 21.62 21.16 40.34 44.92 35.40
## [817] 19.64 64.16 48.42 53.04 34.37 53.26 22.52 40.83 32.71 27.25 24.59 51.65
## [829] 53.14 36.28 39.33 41.15 37.21 23.80 44.01 28.52 44.81 24.55 37.93 24.64
## [841] 23.23 21.12 26.36 39.32 53.96 23.09 32.26 25.36 33.19 45.00 39.05 42.32
## [853] 32.43 35.57 26.92 27.17 38.07 31.50 33.73 50.09 35.36 37.49 25.86 39.99
## [865] 27.48 28.37 40.44 29.58 28.50 45.54 17.70 31.60 39.54 42.77 20.30 27.86
## [877] 41.60 43.40 55.45 23.71 40.17 41.61 68.02 36.25 25.56 42.57 18.65 33.87
## [889] 33.51 28.21 40.74 39.53 43.35 43.65 22.78 31.46 25.62 32.02 29.16 38.36
## [901] 40.77 24.24 42.07 26.13 27.93 18.62 37.80 34.77 50.30 39.36 42.05 38.02
## [913] 41.71 48.24 29.38 16.79 26.64 19.48 37.11 31.40 33.95 39.86 38.39 23.24
## [925] 21.56 28.91 35.56 33.29 32.24 27.63 27.55 24.71 23.96 31.96 55.62 25.32
## [937] 24.56 41.96 31.42 31.14 36.90 25.55 45.83 25.59 27.39 45.35 46.59 26.11
## [949] 49.09 46.64 22.05 27.62 28.51 37.50 26.85 29.44 40.89 14.92 25.45 33.57
## [961] 39.06 21.31 35.67 29.70 21.75 29.55 24.22 21.30 18.30 36.54 18.16 29.19
## [973] 20.81 54.69 37.56 48.81 53.25 46.06 34.16 34.86 46.58 34.58 37.40 33.30
## [985] 49.38 52.01 29.54 44.30 31.58 36.18 22.35 22.22 46.81 20.66 16.31 21.47
## [997] 33.63 36.80 27.31 42.04 30.08 58.42 26.87 29.89 24.20 24.11 31.24 19.59
## [1009] 15.81 26.46 54.08 28.43 28.76 40.61 36.33 28.00 17.22 21.37 25.18 25.52
## [1021] 40.72 29.69 42.83 55.44 17.63 44.94 20.90 35.88 22.11 28.87 17.07 39.28
## [1033] 22.62 50.12 15.92 34.28 30.42 57.57 26.69 30.85 19.58 17.03 34.39 51.37
## [1045] 39.50 21.94 45.44 23.65 26.72 14.61 39.13 31.51 17.71 40.23 39.77 26.30
## [1057] 33.05 42.38 37.31 20.87 23.82 34.08 42.98 30.80 35.35 19.30 23.85 29.79
## [1069] 41.09 18.99 21.34 18.13 18.37 56.38 38.16 43.77 49.78 30.71 21.77 39.31
## [1081] 29.24 36.73 26.75 35.30 46.52 19.79 20.70 35.00 33.64 39.42 21.28 22.13
## [1093] 19.75 27.94 27.04 37.43 40.60 42.72 21.08 27.07 28.49 29.80 17.85 42.97
## [1105] 20.52 20.41 45.98 28.56 20.97 20.03 24.19 16.16 20.67 41.04 40.45 37.89
## [1117] 44.09 16.65 37.95 36.59 32.91 44.82 27.15 49.76 42.60 18.25 38.11 56.64
## [1129] 29.08 30.47 28.99 13.31 16.45 23.02 58.59 22.85 16.37 23.18 89.10 30.69
## [1141] 46.00 28.23 28.31 30.28 17.72 35.11 45.70 20.94 33.12 16.14 31.79 29.10
## [1153] 22.83 25.38 21.96 22.73 32.95 19.15 22.55 58.24 40.78 17.51 47.83 33.37
## [1165] 28.90 38.08 45.65 45.61 27.52 34.51 23.12 26.79 31.05 32.46 37.37 23.08
## [1177] 18.38 66.08 38.44 45.76 25.53 42.22 26.14 24.82 58.53 38.95 32.96 53.67
## [1189] 48.66 33.48 24.31 24.68 50.75 29.62 35.85 76.79 31.04 20.47 22.20 20.06
## [1201] 26.56 20.55 44.10 28.98 36.17 39.08 48.47 22.99 27.57 20.45 17.97 26.25
## [1213] 42.10 47.26 47.00 24.46 28.93 25.08 25.15 30.10 31.35 24.94 38.67 43.24
## [1225] 30.14 19.66 25.90 33.08 24.18 39.14 19.89 32.99 50.22 35.53 60.08 37.41
## [1237] 47.55 21.44 40.29 30.45 35.31 37.02 33.50 18.94 47.61 20.54 35.74 44.64
## [1249] 26.91 28.04 48.71 40.69 44.43 16.46 31.48 28.74 29.83 20.19 27.61 42.23
## [1261] 20.75 60.82 48.78 28.02 32.59 35.82 38.73 21.43 31.45 32.72 21.54 19.46
## [1273] 23.15 32.44 20.33 41.81 28.08 33.78 27.69 34.98 19.11 26.71 50.86 24.25
## [1285] 27.22 49.93 40.01 18.80 57.17 31.65 33.44 19.69 17.57 24.01 19.01 19.73
## [1297] 26.00 47.78 51.75 27.97 16.95 27.27 41.84 44.26 17.92 47.90 35.48 31.02
## [1309] 42.53 36.14 86.51 37.74 24.54 36.15 24.30 26.70 43.19 22.16 45.88 46.20
## [1321] 22.56 40.02 26.08 38.87 19.85 18.46 33.33 60.36 36.04 63.77 16.74 21.42
## [1333] 36.57 22.18 22.98 41.40 39.87 21.02 38.47 42.33 18.40 31.10 50.31 68.97
## [1345] 47.46 51.67 40.14 38.63 35.83 48.69 18.47 52.46 22.72 20.73 21.74 16.63
## [1357] 28.57 19.39 34.22 30.12 32.23 37.75 44.48 23.99 23.60 33.13 45.17 28.63
## [1369] 54.03 61.79 21.20 34.20 46.09 22.59 26.17 45.78 34.41 38.75 18.50 54.78
## [1381] 13.48 21.25 54.25 19.91 21.59 49.95 32.17 43.71 40.57 19.65 19.05 18.09
## [1393] 15.78 25.89 35.70 39.11 17.49 30.96 15.64 21.57 23.42 28.46 16.80 23.98
## [1405] 15.17 23.87 17.14 16.83 51.92 34.72 49.20 21.64 25.99 28.72 24.90 24.07
## [1417] 30.95 59.22 33.80 41.00 59.44 34.93 52.25 35.96 22.48 29.48 47.99 41.57
## [1429] 38.79 40.13 21.68 42.27 15.05 18.90 19.07 22.08 26.66 20.16 41.80 34.69
## [1441] 25.87 51.08 44.86 19.16 32.03 45.18 24.67 33.81 43.12 32.58 36.48 28.01
## [1453] 17.27 23.19 18.66 18.17 13.72 41.88 32.36 22.42 24.98 17.89 27.18 44.76
## [1465] 37.55 38.04 54.14 48.62 18.08 19.40 41.66 16.82 21.35 45.42 42.61 51.59
## [1477] 48.26 17.11 42.30 33.14 27.47 15.41 19.22 40.50 15.01 29.00 33.77 18.74
## [1489] 17.31 17.95 39.18 27.75 47.72 27.82 65.91 43.62 26.93 26.60 26.23 44.34
## [1501] 28.03 32.60 31.56 26.89 39.74 42.42 31.84 22.49 57.39 20.84 35.25 24.84
## [1513] 32.21 16.13 34.88 31.61 17.43 43.80 23.36 34.94 22.10 23.31 24.65 39.27
## [1525] 24.09 38.78 38.13 30.05 56.55 52.75 37.03 16.27 41.63 41.99 36.72 42.12
## [1537] 38.61 38.86 42.37 41.37 34.52 25.26 41.45 16.99 22.19 15.45 36.03 30.76
## [1549] 29.75 50.49 34.53 36.52 41.34 34.84 28.55 35.55 27.49 36.67 15.70 34.45
## [1561] 38.96 41.97 13.99 46.22 19.08 45.04 23.88 28.68 74.68 31.07 46.96 21.71
## [1573] 23.51 31.85 47.64 30.26 45.34 49.03 40.59 24.00 32.41 71.74 26.84 19.19
## [1585] 48.55 13.56 41.44 37.46 39.75 20.68 25.71 40.85 47.92 25.81 54.87 17.15
## [1597] 36.79 32.25 34.38 16.90 41.10 30.78 28.07 32.47 67.79 43.13 28.36 17.56
## [1609] 32.64 46.35 36.36 17.01 15.35 17.94 29.91 24.10 51.87 61.37 36.64 27.87
## [1621] 43.60 13.61 32.39 25.28 33.54 19.90 38.98 37.16 21.90 49.47 44.08 39.22
## [1633] 40.62 27.64 19.71 32.84 51.60 44.33 16.43 47.29 33.98 34.04 18.36 38.54
## [1645] 32.14 15.98 29.14 36.34 37.70 46.26 44.40 25.72 36.19 42.40 25.14 27.68
## [1657] 27.83 41.32 27.35 44.55 51.54 47.24 34.89 30.46 46.86 54.91 21.53 38.55
## [1669] 56.58 24.91 51.77 24.48 33.59 16.05 41.47 38.12 22.51 15.97 30.98 23.94
## [1681] 56.49 15.51 47.76 40.27 16.36 30.50 44.46 41.65 40.42 31.27 51.36 20.91
## [1693] 18.64 33.01 19.42 28.65 50.03 18.14 37.08 41.14 20.29 42.31 49.13 39.80
## [1705] 26.05 50.18 35.14 55.25 41.73 24.43 18.85 48.36 30.25 37.73 40.82 35.05
## [1717] 38.25 50.77 48.50 56.31 27.51 21.38 18.48 23.07 27.79 34.23 32.75 37.22
## [1729] 20.99 38.21 55.38 66.88 41.02 33.41 41.39 20.02 22.03 15.36 34.83 19.00
## [1741] 23.10 34.62 18.18 13.87 57.50 41.19 49.91 30.51 36.78 42.18 19.04 50.00
## [1753] 47.42 36.08 30.57 45.92 44.87 15.59 34.56 50.19 24.12 39.59 23.54 35.32
## [1765] 44.36 79.06 29.50 18.52 57.41 18.58 17.36 20.31 36.65 28.42 24.85 34.07
## [1777] 32.35 49.23 34.48 46.29 22.21 18.81 16.01 38.94 15.62 16.10 31.55 46.05
## [1789] 70.76 14.23 36.35 38.31 37.62 20.23 32.34 23.25 29.94 42.93 29.15 29.82
## [1801] 30.74 18.26 35.06 54.88 46.94 21.92 37.51 38.30 41.64 25.27 47.80 35.54
## [1813] 31.77 32.04 53.47 29.49 36.98 29.17 19.31 40.43 16.93 18.97 17.33 39.89
## [1825] 17.68 17.87 56.70 51.15 50.48 42.74 27.16 43.76 52.30 47.56 35.01 29.97
## [1837] 20.43 40.79 81.71 17.86 34.25 25.21 62.95 54.38 25.19 33.70 61.08 59.88
## [1849] 52.73 18.19 23.22 42.69 22.75 40.32 41.42 18.41 39.49 16.76 19.61 31.54
## [1861] 30.66 57.95 53.22 18.32 34.26 33.32 18.84 45.71 20.61 34.36 16.89 17.34
## [1873] 59.50 55.51 14.74 22.07 17.64 26.41 30.87 27.14 85.96 20.27 38.90 36.89
## [1885] 43.79 49.50 25.93 33.86 31.52 42.88 17.59 42.81 25.65 64.56 35.12 24.99
## [1897] 41.95 15.28 12.16 20.56 53.71 35.69 27.09 25.41 61.35 19.93 17.65 59.52
## [1909] 19.60 23.58 16.48 40.31 70.33 43.82 16.98 36.88 37.77 24.17 64.93 46.53
## [1921] 14.25 16.60 17.18 18.10 40.26 25.16 30.09 35.07 16.50 40.28 32.57 54.93
## [1933] 37.87 14.88 47.84 43.54 55.80 22.76 35.80 39.37 13.73 33.76 50.78 16.49
## [1945] 40.75 37.54 36.75 17.74 16.41 34.05 17.09 30.48 21.15 65.31 29.56 44.15
## [1957] 48.08 31.37 37.35 58.58 55.82 53.16 52.61 19.96 36.56 40.05 18.23 34.12
## [1969] 49.69 16.87 43.36 35.41 36.99 24.50 23.90 64.44 35.04 18.68 15.00 26.73
## [1981] 52.77 54.52 36.91 42.11 21.67 54.32 45.53 43.75 22.29 37.10 37.69 33.53
## [1993] 36.09 50.64 20.89 57.29 16.94 82.39 38.23 39.52 36.66 48.87 44.60 28.22
## [2005] 20.95 20.28 48.10 32.73 78.17 19.21 23.21 34.79 44.20 40.04 33.15 21.06
## [2017] 35.18 16.12 19.33 33.90 49.77 47.02 25.12 19.51 28.16 21.66 26.82 41.94
## [2029] 38.32 49.61 37.91 18.12 27.00 33.58 36.85 25.03 53.59 45.03 28.05 19.25
## [2041] 32.51 52.26 21.84 43.67 15.94 28.14 15.66 21.05 17.90 37.86 16.07 18.33
## [2053] 12.21 24.52 18.87 41.70 54.61 24.26 65.78 25.30 51.70 25.49 15.20 28.94
## [2065] 57.25 56.85 28.96 51.97 43.08 32.76 35.34 56.12 31.20 31.13 49.75 14.53
## [2077] 29.39 32.82 32.31 38.10 45.91 62.19 47.13 29.98 35.09 55.61 41.21 42.85
## [2089] 35.75 29.47 22.77 14.95 28.47 19.50 39.03 43.86 18.11 22.82 15.83 34.19
## [2101] 45.75 27.01 23.95 53.55 16.25 50.84 44.14 16.29 57.78 18.55 36.11 47.35
## [2113] 23.84 38.99 26.33 26.99 44.53 40.06 30.39 16.06 19.72 22.44 56.68 35.91
## [2125] 50.58 49.79 74.06 26.62 73.51 43.43 38.14 31.39 25.44 40.56 98.44 37.57
## [2137] 54.58 42.50 18.82 30.89 36.50 14.98 22.95 12.55 44.06 25.88 28.92 41.77
## [2149] 25.01 36.77 20.72 22.64 37.98 62.07 51.09 36.10 19.76 47.12 31.44 35.50
## [2161] 39.45 46.16 19.03 38.17 47.70 21.07 32.20 29.42 46.68 19.52 38.89 57.60
## [2173] 51.02 39.00 38.58 40.37 41.27 46.63 29.73 62.00 42.70 30.32 32.86 41.25
## [2185] 56.43 28.71 22.17 24.58 44.11 36.60 29.40 46.80 26.21 46.72 46.03 83.68
## [2197] 44.23 51.38 43.89 59.20 40.90 24.44 17.50 45.68 60.16 45.16 36.12 38.22
## [2209] 47.50 18.00 35.22 40.46 39.61 68.35 35.98 61.71 47.09 21.82 40.67 16.00
## [2221] 37.24 26.98 58.75 20.69 22.09 53.28 23.47 34.09 48.96 19.02 62.70 53.21
## [2233] 47.85 33.10 37.36 20.76 27.58 45.43 16.21 14.76 38.06 41.18 54.75 21.51
## [2245] 28.40 23.70 62.34 20.71 55.97 37.90 45.66 38.83 28.24 39.82 43.49 67.81
## [2257] 73.46 50.91 46.79 40.68 39.23 42.25 49.59 17.04 39.30 55.24 36.68 22.12
## [2269] 24.16 58.98 52.86 22.02 51.96 45.14 44.59 45.58 43.38 32.48 39.72 46.97
## [2281] 53.09 15.19 51.30 33.06 18.78 61.03 48.65 19.99 39.38 62.33 20.93 46.41
## [2293] 34.76 17.77 68.67 60.84 50.66 44.83 58.52 16.92 49.43 24.60 51.42 27.24
## [2305] 35.89 14.14 21.01 28.38 39.47 20.21 54.23 41.48 35.84 39.20 35.93 37.07
## [2317] 56.25 18.42 33.93 38.38 46.36 41.87 41.54 78.12 62.67 42.84 45.26 46.17
## [2329] 46.69 51.12 16.97 36.47 16.69 32.70 50.25 21.00 36.01 25.57 40.10 44.79
## [2341] 53.02 30.94 50.81 59.81 59.37 44.95 20.58 41.03 35.49 42.82 56.26 63.10
## [2353] 44.97 62.62 14.35 22.39 52.94 33.04 46.65 34.60 23.35 51.17 65.84 48.29
## [2365] 65.23 48.79 55.20 20.14 50.08 56.41 42.54 17.28 66.43 51.98 42.76 62.40
## [2377] 24.77 55.60 54.82 34.73 57.67 45.77 18.73 48.21 39.17 48.23 51.74 55.36
## [2389] 48.05 29.74 38.50 47.31 58.99 15.95 40.94 18.39 13.20 22.94 32.85 16.88
## [2401] 25.48 60.14 17.17 29.90 28.88 38.49 22.57 15.79 17.19 45.11 66.00 37.00
## [2413] 20.49 14.64 16.56 27.66 83.01 52.00 38.00 51.56 54.74 41.56 59.91 35.99
## [2425] 64.66 70.31 23.20 42.34 19.23 64.45 39.44 43.70 34.78 50.79 34.13 46.07
## [2437] 35.23 57.48 22.28 43.16 67.35 14.24 45.06 15.60 23.50 15.87 26.86 69.74
## [2449] 48.75 53.85 16.55 65.71 40.99 43.03 97.58 31.86 25.91 42.41 82.81 45.01
## [2461] 73.03 32.13 48.14 48.74 50.94 51.73 55.76 54.24 40.21 49.18 36.30 40.07
## [2473] 15.77 39.56 15.75 32.15 15.55 34.85 52.91 52.37 42.20 42.47 57.64 21.76
## [2485] 16.42 29.46 45.84 15.50 16.19 53.92 55.73 45.05 47.77 57.37 52.13 64.02
## [2497] 40.97 15.07 39.62 79.29 15.33 53.00 40.64 47.93 43.53 41.12 52.14 19.12
## [2509] 51.22 52.29 65.48 48.16 56.71 39.69 14.70 54.57 40.70 30.75 38.71 58.36
## [2521] 42.68 36.16 44.84 48.00 55.74 45.32 22.41 48.58 17.13 32.18 23.66 38.59
## [2533] 51.76 39.43 18.49 49.80 58.54 52.49 59.57 31.98 15.34 67.31 62.76 48.18
## [2545] 13.25 27.65 21.88 44.07 63.74 44.78 72.61 39.15 43.91 31.81 54.53 39.19
## [2557] 37.71 60.06 18.44 55.65 52.89 34.92 34.27 54.07 17.08 68.23 49.42 59.30
## [2569] 42.62 50.36 46.82 64.21 59.72 35.65 47.30 69.24 25.74 39.65 49.87 15.31
## [2581] 50.46 17.80 65.97 36.41 38.66 20.32 17.30 38.53 41.76 29.07 43.28 18.61
## [2593] 47.87 65.77 47.01 12.91 28.85 42.00 54.42 14.77 70.18 51.57 14.37 26.81
## [2605] 60.53 37.65 22.91 65.37 37.85 47.08 42.45 48.15 46.93 54.90 68.66 21.19
## [2617] 34.29 46.67 31.94 82.19 43.97 52.34 24.47 37.84 26.80 19.44 30.31 15.48
## [2629] 41.51 23.61 42.35 71.56 46.48 47.44 42.80 61.07 18.70 62.35 50.42 45.63
## [2641] 12.48 53.90 48.04 80.89 64.37 47.37 16.81 63.27 49.62 49.51 36.29 31.11
## [2653] 59.41 33.85 28.78 25.35 25.31 51.33 30.16 29.30 46.49 14.63 16.72 60.23
## [2665] 36.06 33.68 51.79 28.30 45.48 16.30 25.67 48.35 27.85 66.13 44.17 29.77
## [2677] 70.48 54.68 51.25 60.74 44.21 37.47 16.40 76.02 53.01 19.06 48.83 29.25
## [2689] 34.80 14.48 48.34 48.45 75.86 50.39 31.08 32.05 26.44 39.29 88.57 49.81
## [2701] 26.24 44.25 71.13 22.88 57.18 50.72 44.37 66.76 39.67 48.95 61.28 66.51
## [2713] 69.86 82.31 58.43 41.93 54.39 48.12 35.59 45.36 57.79 49.32 67.93 46.76
## [2725] 59.07 49.07 49.55 47.81 81.48 50.51 44.05 59.06 41.36 37.18 37.52 45.85
## [2737] 14.47 47.67 37.34 45.93 67.24 60.93 13.05 51.72 47.19 45.09 57.61 47.11
## [2749] 47.63 58.26 52.11 59.97 54.98 57.22 67.67 30.93 43.84 44.62 91.82 52.12
## [2761] 73.18 56.13 12.92 66.90 14.51 50.97 55.41 68.91 13.19 44.19 58.10 45.50
## [2773] 58.45 58.46 16.47 81.67 47.89 46.23 52.56 12.40 62.14 47.06 67.68 50.93
## [2785] 29.92 36.82 48.44 43.42 79.41 44.99 85.23 78.31 52.18 63.42 15.04 72.25
## [2797] 15.11 40.86 13.39 56.82 43.00 50.85 40.91 72.63 30.53 74.07 63.33 44.03
## [2809] 44.91 17.42 45.12 55.05 30.97 44.70 78.65 47.14 44.72 46.37 49.26 65.90
## [2821] 64.13 15.73 55.66 41.23 81.37 56.99 66.56 53.46 70.71 62.17 54.13 14.17
## [2833] 44.35 75.18 18.59 30.19 53.70 22.25 39.79 78.04 15.65 19.34 54.79 46.08
## [2845] 12.53 43.74 34.55 13.37 48.56 50.26 51.32 26.01 57.06 39.21 53.51 84.04
## [2857] 58.18 43.48 13.00 51.48 42.14 39.25 52.62 52.80 17.46 68.40 54.55 55.50
## [2869] 58.77 55.96 66.21 17.98 40.92 59.34 15.91 22.34 13.22 56.33 31.97 15.23
## [2881] 47.59 36.37 58.32 54.50 48.09 85.69 69.52 48.20 40.80 62.49 59.05 54.64
## [2893] 51.03 14.08 29.28 23.55 41.86 66.45 48.52 43.25 62.54 28.69 67.06 60.89
## [2905] 43.59 46.28 50.83 83.20 15.49 40.81 15.54 13.67 49.88 31.67 25.47 56.18
## [2917] 27.42 49.39 69.91 50.45 13.23 22.79 35.81 53.81 15.06 51.45 30.49 38.82
## [2929] 41.52 56.15 28.64 57.14 40.76 52.68 16.44 16.57 27.13 35.79 33.27 48.27
## [2941] 58.57 85.74 32.52 39.93 49.82 15.46 20.10 34.82 43.51 73.37 33.97 55.78
## [2953] 41.29 90.39 45.23 64.06 18.92 81.19 39.41 43.47 48.91 50.98 59.40 17.29
## [2965] 20.86 23.16 45.62 59.23 12.20 63.86 15.25 42.44 61.09 39.73 38.69 78.92
## [2977] 68.05 82.23 56.81 63.28 73.28 53.91 49.64 53.84 51.82 51.06 66.94 63.92
## [2989] 50.59 49.24 60.54 16.18 55.75 58.63 21.72 75.37 64.29 56.01 47.45 66.77
## [3001] 24.49 72.22 36.45 27.70 20.74 19.35 78.30 61.64 22.37 41.75 65.14 53.75
## [3013] 23.27 60.69 78.83 45.10 40.93 16.70 41.90 58.91 52.57 51.16 24.04 13.14
## [3025] 12.89 51.05 31.22 22.01 57.88 57.94 17.20 14.78 34.00 63.23 32.79 27.38
## [3037] 19.38 46.92 67.64 31.49 44.77 45.30 23.77 26.51 62.99 26.02 59.67 46.43
## [3049] 42.89 15.18 55.08 70.03 34.43 76.99 57.03 55.88 50.04 43.83 51.00 55.09
## [3061] 15.22 57.30 75.95 61.87 38.91 17.99 52.24 52.42 28.54 47.21 48.40 21.85
## [3073] 42.90 12.11 53.36 47.74 51.39 45.37 66.95 73.84 70.72 54.41 52.52 63.51
## [3085] 50.11 57.51 17.53 62.55 51.94 49.41 73.89 46.46 63.30 64.88 43.37 46.27
## [3097] 29.11 14.32 44.32 41.28 36.27 46.98 52.50 60.37 32.97 24.40 13.35 14.75
## [3109] 14.65 13.26 57.52 55.99 49.65 46.84 33.02 29.59 14.10 15.30 17.24 43.64
## [3121] 41.74 36.46 60.59 49.36 38.60 19.95 60.61 14.45 55.28 15.42 58.28 52.47
## [3133] 35.68 42.55 34.40 38.88 46.74 57.16 53.56 46.66 66.89 64.20 49.08 62.39
## [3145] 47.25 17.55 72.74 34.68 52.35 55.00 47.07 56.90 72.31 33.62 87.18 53.50
## [3157] 64.98 48.31 29.36 52.53 15.12 50.63 46.04 42.63 44.71 52.16 51.93 50.52
## [3169] 38.46 59.71 60.30 46.45 58.27 14.41 20.04 15.14 24.06 48.90 59.87 14.44
## [3181] 37.78 63.56 55.68 42.75 64.99 13.85 30.59 43.50 63.75 20.17 22.06 13.95
## [3193] 26.03 21.78 51.27 24.35 37.09 31.99 18.98 59.85 62.42 36.71 38.51 29.93
## [3205] 27.36 56.28 19.45 14.00 26.55 32.53 27.03 71.29 32.07 53.52 55.46 47.38
## [3217] 73.22 19.29 43.69 60.48 14.90 25.60 29.72 60.94 65.38 59.63 54.34 62.50
## [3229] 24.29 15.15 16.20 68.57 59.27 16.39 42.21 35.10 49.21 38.57 59.98 73.74
## [3241] 37.19 40.73 47.39 39.83 51.88 15.96 16.67 16.28 18.22 60.62 12.12 54.20
## [3253] 53.23 55.04 13.90 52.92 46.77 48.63 60.46 71.88 49.10 14.81 62.02 72.65
## [3265] 43.55 57.42 45.56 54.04 17.35 14.26 24.15 19.81 16.68 60.05 57.20 22.58
## [3277] 14.03 51.58 75.72 26.20 68.90 51.50 33.26 13.75 78.94 17.10 44.02 57.72
## [3289] 83.00 94.94 75.49 18.96 47.05 62.57 55.98 14.12 46.10 18.51 46.70 44.44
## [3301] 44.49 48.32 60.76 44.80 62.64 43.81 18.34 22.54 26.35 45.25 20.11 27.30
## [3313] 26.28 48.54 24.36 14.93 28.33 50.14 19.70 80.79 19.68 40.53 67.14 37.72
## [3325] 57.56 59.99 61.22 17.41 25.43 74.50 60.15 40.33 17.66 20.44 52.66 49.37
## [3337] 19.41 45.59 71.90 40.47 23.68 26.49 31.43 55.64 62.53 50.29 67.33 47.94
## [3349] 29.06 55.17 59.66 40.11 63.50 50.88 54.59 55.55 46.30 25.05 84.75 47.62
## [3361] 43.10 16.32 50.71 64.12 43.14 54.33 31.59 99.17 41.55 53.74 63.11 62.80
## [3373] 61.45 29.34 42.99 50.95 73.10 53.98 41.43 43.02 53.41 60.25 49.96 48.43
## [3385] 29.51 50.16 68.06 48.02 61.82 16.53 97.65 50.65 91.23 70.29 49.71 73.92
## [3397] 48.22 64.57 65.56 52.70 61.41 60.00 61.49 70.88 80.66 49.57 62.44 75.33
## [3409] 69.89 57.63 21.45 17.37 55.21 31.34 44.68 46.38 47.23 44.45 50.92 40.52
## [3421] 54.97 51.90 44.47 59.80 67.15 14.62 49.11 51.55 64.71 57.82 75.17 17.69
## [3433] 60.26 56.96 81.02 58.74 79.25 77.31 57.74 73.24 23.79 53.08 14.59 24.38
## [3445] 53.38 76.74 58.71 43.18 65.54 20.96 58.73 20.57 16.84 55.42 25.76 87.22
## [3457] 48.88 21.17 50.62 64.07 33.31 63.99 59.18 64.08 14.38 64.42 56.46 69.94
## [3469] 48.59 24.92 66.65 52.96 13.02 42.03 62.91 59.29 47.33 65.53 28.58 50.69
## [3481] 31.92 63.13 17.02 52.87 19.82 42.26 55.23 71.80 55.91 62.31 56.44 32.63
## [3493] 81.50 61.33 44.89 15.80 57.46 13.24 12.70 42.59 53.57 12.17 69.50 96.52
## [3505] 50.32 74.12 54.56 42.48 20.39 13.74 50.99 45.95 99.33 19.83 67.72 42.01
## [3517] 19.32 28.09 62.01 53.31 35.46 71.42 74.24 13.89 54.29 46.61 44.41 26.74
## [3529] 12.65 26.48 66.40 68.95 45.47 31.23 43.21 33.49 43.72 21.10 58.83 25.95
## [3541] 61.14 57.98 65.83 42.19 48.28 74.60 13.64 18.21 53.12 13.63 80.74 12.88
## [3553] 82.28 36.69 46.62 94.41 32.80 30.60 14.69 52.78 33.21 45.40 57.33 20.65
## [3565] 39.12 29.22 32.29 42.86 12.02 13.06 13.82 52.21 73.16 86.32 91.52 48.01
## [3577] 60.42 17.12 52.72 45.57 57.92 82.11 35.27 55.84 48.64 84.87 52.31 57.00
## [3589] 68.42 53.32 87.71 56.65 70.69 39.01 38.42 36.43 65.74 64.30 15.90 77.47
## [3601] 33.18 48.94 14.04 55.19 79.35 59.12 29.96 12.05 30.06 76.17 35.60 72.05
## [3613] 83.26 33.56 83.45 46.39 53.42 43.06 48.70 41.91 52.79 40.87 92.45 52.58
## [3625] 57.27 57.84 49.94 43.87 52.08 43.41 73.63 57.11 48.76 47.03 79.71 82.37
## [3637] 40.88 28.18 65.10 32.40 31.69 58.60 45.46 12.87 21.49 41.30 44.16 14.28
## [3649] 45.41 38.29 51.91 63.83 19.09 56.32
##
## $Smoking_History
## [1] "Yes" "No"
##
## $Alcohol_Consumption
## [1] 0 4 3 8 30 2 12 1 5 10 20 17 16 6 25 28 15 7 9 24 11 29 27 14 21
## [26] 23 18 26 22 13 19
##
## $Fruit_Consumption
## [1] 30 12 8 16 2 1 60 0 7 5 3 6 90 28 20 4 80 24 15
## [20] 10 25 14 120 32 40 17 45 100 9 99 96 35 50 56 48 27 72 36
## [39] 84 26 23 18 21 42 22 11 112 29 64 70 33 76 44 39 75 31 92
## [58] 104 88 65 55 13 38 63 97 108 19 52 98 37 68 34 41 116 54 62
## [77] 85
##
## $Green_Vegetables_Consumption
## [1] 16 0 3 30 4 12 8 20 1 10 5 2 6 60 28 25 14 40 7
## [20] 22 24 15 120 90 19 13 11 80 27 17 56 18 9 21 99 29 31 45
## [39] 23 100 104 32 48 75 36 35 112 26 50 33 96 52 76 84 34 97 88
## [58] 98 68 92 55 95 64 124 61 65 77 85 44 39 70 93 128 37 53
##
## $FriedPotato_Consumption
## [1] 12 4 16 8 0 1 2 30 20 15 10 3 7 28 5 9 6 120 32
## [20] 14 60 33 48 25 24 21 90 13 99 17 18 40 56 34 36 44 100 11
## [39] 64 45 80 29 68 26 50 22 95 23 27 112 35 31 98 96 88 92 19
## [58] 76 49 97 128 41 37 42 52 72 46 124 84
We also summarized the dataset
#Summary statistics
summary(df)
## General_Health Checkup Exercise Heart_Disease
## Length:308854 Length:308854 Length:308854 Length:308854
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Skin_Cancer Other_Cancer Depression Diabetes
## Length:308854 Length:308854 Length:308854 Length:308854
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Arthritis Sex Age_Category Height_.cm.
## Length:308854 Length:308854 Length:308854 Min. : 91.0
## Class :character Class :character Class :character 1st Qu.:163.0
## Mode :character Mode :character Mode :character Median :170.0
## Mean :170.6
## 3rd Qu.:178.0
## Max. :241.0
## Weight_.kg. BMI Smoking_History Alcohol_Consumption
## Min. : 24.95 Min. :12.02 Length:308854 Min. : 0.000
## 1st Qu.: 68.04 1st Qu.:24.21 Class :character 1st Qu.: 0.000
## Median : 81.65 Median :27.44 Mode :character Median : 1.000
## Mean : 83.59 Mean :28.63 Mean : 5.096
## 3rd Qu.: 95.25 3rd Qu.:31.85 3rd Qu.: 6.000
## Max. :293.02 Max. :99.33 Max. :30.000
## Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
## Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 12.00 1st Qu.: 4.00 1st Qu.: 2.000
## Median : 30.00 Median : 12.00 Median : 4.000
## Mean : 29.84 Mean : 15.11 Mean : 6.297
## 3rd Qu.: 30.00 3rd Qu.: 20.00 3rd Qu.: 8.000
## Max. :120.00 Max. :128.00 Max. :128.000
For categorical data, we use bar plot to visualized their frequency.
#Bar plot
ggplot(df, aes(x = General_Health)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of General Health",
x = "General Health",
y = "Count") +
theme_minimal()
ggplot(df, aes(x = Checkup)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Checkup",
x = "Checkup",
y = "Count") +
theme_minimal()
ggplot(df, aes(x = Age_Category)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Age_Category",
x = "Age_Category",
y = "Count") +
theme_minimal()
ggplot(df, aes(x = Heart_Disease)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Heart_Disease",
x = "Heart_Disease",
y = "Count") +
theme_minimal()
For numerical variables, we use histogram & box plots to show their distribution
# Histogram for BMI
ggplot(df, aes(x = BMI)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
labs(title = "BMI Distribution", x = "BMI", y = "Count") +
theme_minimal()
# Boxplot for Height and Weight
ggplot(df, aes(y = Height_.cm.)) +
geom_boxplot(fill = "orange", color = "black") +
labs(title = "Boxplot of Height", y = "Height(cm)") +
theme_minimal()
ggplot(df, aes(y = Weight_.kg.)) +
geom_boxplot(fill = "orange", color = "black") +
labs(title = "Boxplot of Weight", y = "Weight (kg)") +
theme_minimal()
For binary variables, bar plot is plotted to analyze them.
# Bar plot
ggplot(df, aes(x = General_Health, fill = Heart_Disease)) +
geom_bar(position = "dodge") +
labs(title = "General Health by Heart Disease Status",
x = "General Health",
y = "Count") +
theme_minimal()
ggplot(df, aes(x = Exercise, fill = Heart_Disease)) +
geom_bar(position = "dodge") +
labs(title = "Exercise Participation by Heart Disease Status",
x = "Exercise",
y = "Count") +
theme_minimal()
After we do some simply analyze the dataset, we need to clean dataset in order to make it able to be trained by model.
This section focuses on handling missing values and duplicates in the dataset.
First, check on rows (observations) which contain missing values
## Missing values summary:
## General_Health Checkup
## 0 0
## Exercise Heart_Disease
## 0 0
## Skin_Cancer Other_Cancer
## 0 0
## Depression Diabetes
## 0 0
## Arthritis Sex
## 0 0
## Age_Category Height_.cm.
## 0 0
## Weight_.kg. BMI
## 0 0
## Smoking_History Alcohol_Consumption
## 0 0
## Fruit_Consumption Green_Vegetables_Consumption
## 0 0
## FriedPotato_Consumption
## 0
Since no null values in the dataset, we no need to drop any rows.
Then, check on rows (observations) which are not distinct, but duplicated with other observations.
## Number of duplicate rows: 80
## Duplicate rows:
## General_Health Checkup Exercise Heart_Disease Skin_Cancer
## 46403 Good Within the past year Yes No No
## 49288 Very Good Within the past year Yes No No
## 75449 Excellent Within the past year Yes No No
## 76858 Excellent Within the past year Yes No No
## 78872 Good Within the past year Yes No No
## 81703 Excellent Within the past year Yes No No
## Other_Cancer Depression Diabetes Arthritis Sex Age_Category
## 46403 No Yes No No Female 18-24
## 49288 No No No No Female 35-39
## 75449 No No No No Female 65-69
## 76858 No No No No Male 40-44
## 78872 No No No No Female 75-79
## 81703 No No No Yes Male 55-59
## Height_.cm. Weight_.kg. BMI Smoking_History Alcohol_Consumption
## 46403 163 81.65 30.90 No 0
## 49288 160 72.57 28.34 Yes 0
## 75449 163 61.23 23.17 Yes 0
## 76858 173 81.65 27.37 No 0
## 78872 163 58.97 22.31 No 0
## 81703 178 81.65 25.83 Yes 0
## Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
## 46403 60 4 4
## 49288 60 30 4
## 75449 30 16 0
## 76858 30 8 1
## 78872 60 30 0
## 81703 30 30 4
After detecting those duplicated rows, we remove them.
# Remove duplicate rows from the dataset
df <- unique(df)
# Check the number of records in the cleaned dataset
cat("Number of rows after duplicate removal:", nrow(df), "\n")
## Number of rows after duplicate removal: 308774
As most machine learning models require numerical input, categorical variables will be converted into numerical values using the following strategies: 1. Label Encoding (for binary variables) 2. Ordinal Encoding (for ordinal variables) 3. One-Hot Encoding (for norminal variables)
Columns encoded: Exercise, Heart_Disease, Skin_Cancer, Other_Cancer, Depression, Arthritis, Smoking_History
# Function for label encoding
label_encode <- function(df, columns) {
for (col in columns) {
df[[col]] <- ifelse(df[[col]] == "Yes", 1, 0)
}
return(df)
}
# List of columns to encode
columns_to_encode <- c("Exercise", "Heart_Disease", "Skin_Cancer",
"Other_Cancer", "Depression", "Arthritis", "Smoking_History")
# Apply label encoding to selected columns
df <- label_encode(df, columns_to_encode)
head(df)
## General_Health Checkup Exercise Heart_Disease Skin_Cancer
## 1 Poor Within the past 2 years 0 0 0
## 2 Very Good Within the past year 0 1 0
## 3 Very Good Within the past year 1 0 0
## 4 Poor Within the past year 1 1 0
## 5 Good Within the past year 0 0 0
## 6 Good Within the past year 0 0 0
## Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_.cm.
## 1 0 0 No 1 Female 70-74 150
## 2 0 0 Yes 0 Female 70-74 165
## 3 0 0 Yes 0 Female 60-64 163
## 4 0 0 Yes 0 Male 75-79 180
## 5 0 0 No 0 Male 80+ 191
## 6 0 1 No 1 Male 60-64 183
## Weight_.kg. BMI Smoking_History Alcohol_Consumption Fruit_Consumption
## 1 32.66 14.54 1 0 30
## 2 77.11 28.29 0 0 30
## 3 88.45 33.47 0 4 12
## 4 93.44 28.73 0 0 30
## 5 88.45 24.37 1 0 8
## 6 154.22 46.11 0 0 12
## Green_Vegetables_Consumption FriedPotato_Consumption
## 1 16 12
## 2 0 4
## 3 3 16
## 4 30 8
## 5 4 0
## 6 12 12
Columns encoded: General Health
df$General_Health <- factor(df$General_Health,
levels = c('Poor', 'Fair', 'Good', 'Very Good', 'Excellent'),
ordered = TRUE)
df$General_Health <- as.integer(df$General_Health)
head(df)
## General_Health Checkup Exercise Heart_Disease Skin_Cancer
## 1 1 Within the past 2 years 0 0 0
## 2 4 Within the past year 0 1 0
## 3 4 Within the past year 1 0 0
## 4 1 Within the past year 1 1 0
## 5 3 Within the past year 0 0 0
## 6 3 Within the past year 0 0 0
## Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_.cm.
## 1 0 0 No 1 Female 70-74 150
## 2 0 0 Yes 0 Female 70-74 165
## 3 0 0 Yes 0 Female 60-64 163
## 4 0 0 Yes 0 Male 75-79 180
## 5 0 0 No 0 Male 80+ 191
## 6 0 1 No 1 Male 60-64 183
## Weight_.kg. BMI Smoking_History Alcohol_Consumption Fruit_Consumption
## 1 32.66 14.54 1 0 30
## 2 77.11 28.29 0 0 30
## 3 88.45 33.47 0 4 12
## 4 93.44 28.73 0 0 30
## 5 88.45 24.37 1 0 8
## 6 154.22 46.11 0 0 12
## Green_Vegetables_Consumption FriedPotato_Consumption
## 1 16 12
## 2 0 4
## 3 3 16
## 4 30 8
## 5 4 0
## 6 12 12
Columns encoded: Diabetes, Checkup, Sex, Age_Category
# Refined Grouping of Diabetes Categories
df$Diabetes <- recode(df$Diabetes,
'No' = 'No Diabetes',
'No, pre-diabetes or borderline diabetes' = 'Pre-diabetes',
'Yes' = 'Diabetes',
'Yes, but female told only during pregnancy' = 'Gestational Diabetes')
# Function for one-hot encoding
one_hot_encode <- function(df, columns) {
# Create a copy of the original dataframe
result_df <- df
# Process each column
for (col in columns) {
# Create dummy variables
dummy_matrix <- model.matrix(as.formula(paste0("~", col, " - 1")), data = result_df)
# Add dummy columns to the result dataframe
result_df <- cbind(result_df, dummy_matrix)
# Remove the original column
result_df[[col]] <- NULL
}
return(result_df)
}
# List of columns to encode
columns_to_encode <- c("Diabetes", "Checkup", "Sex", "Age_Category")
# Apply label encoding to selected columns
df <- one_hot_encode(df, columns_to_encode)
head(df)
## General_Health Exercise Heart_Disease Skin_Cancer Other_Cancer Depression
## 1 1 0 0 0 0 0
## 2 4 0 1 0 0 0
## 3 4 1 0 0 0 0
## 4 1 1 1 0 0 0
## 5 3 0 0 0 0 0
## 6 3 0 0 0 0 1
## Arthritis Height_.cm. Weight_.kg. BMI Smoking_History Alcohol_Consumption
## 1 1 150 32.66 14.54 1 0
## 2 0 165 77.11 28.29 0 0
## 3 0 163 88.45 33.47 0 4
## 4 0 180 93.44 28.73 0 0
## 5 0 191 88.45 24.37 1 0
## 6 1 183 154.22 46.11 0 0
## Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
## 1 30 16 12
## 2 30 0 4
## 3 12 3 16
## 4 30 30 8
## 5 8 4 0
## 6 12 12 12
## DiabetesDiabetes DiabetesGestational Diabetes DiabetesNo Diabetes
## 1 0 0 1
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 0 0 1
## 6 0 0 1
## DiabetesPre-diabetes Checkup5 or more years ago CheckupNever
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## CheckupWithin the past 2 years CheckupWithin the past 5 years
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## CheckupWithin the past year SexFemale SexMale Age_Category18-24
## 1 0 1 0 0
## 2 1 1 0 0
## 3 1 1 0 0
## 4 1 0 1 0
## 5 1 0 1 0
## 6 1 0 1 0
## Age_Category25-29 Age_Category30-34 Age_Category35-39 Age_Category40-44
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Age_Category45-49 Age_Category50-54 Age_Category55-59 Age_Category60-64
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 1
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 1
## Age_Category65-69 Age_Category70-74 Age_Category75-79 Age_Category80+
## 1 0 1 0 0
## 2 0 1 0 0
## 3 0 0 0 0
## 4 0 0 1 0
## 5 0 0 0 1
## 6 0 0 0 0
To avoid model is biased to predict only either one class, we need to check our experimental variable (Y), which is Heart Disease
# Check the number of imbalanced classes
df %>% count(Heart_Disease)
## Heart_Disease n
## 1 0 283803
## 2 1 24971
The class imbalance is quite significant, which may cause the model to be biased towards predicting the majority class, which is ‘No Heart Disease’ Class
Hence, The SMOTE (Synthetic Minority Oversampling Technique) algorithm is used to balance the dataset by oversampling the minority class and undersampling the majority class to address class imbalance for the target variable.
#SMOTE approaach
df$Heart_Disease <- as.factor(df$Heart_Disease)
balanced_data <- SMOTE(Heart_Disease ~ ., data = df, perc.over = 200, perc.under = 200, k=3)
Then, we will check whether the dataset after applied SMOTE algorithm is balanced.
# Check the class distribution in the balanced dataset
table(balanced_data$Heart_Disease)
##
## 0 1
## 99884 74913
Since negative cases has 99884 and positive cases has 74913, it’s relatively much more balanced than dataset before SMOTE applied, then we can proceed with this balanced dataset
The data is split into training (80%) and testing (20%) sets.
# Split the original data into features (X) and target (y)
X <- df[, -which(names(df) == "Heart_Disease")] # Features
y <- df$Heart_Disease # Target
# Split the oversampled data into features (X) and target (y)
X_oversampled <- balanced_data[, -which(names(balanced_data) == "Heart_Disease")] # Features
y_oversampled <- balanced_data$Heart_Disease # Target
# Split the original data into training and testing sets (80% training)
set.seed(42)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
X_train <- X[train_index, ]
y_train <- y[train_index]
X_test <- X[-train_index, ]
y_test <- y[-train_index]
# Split the oversampled data into training and testing sets (80% training)
set.seed(42) # Ensure reproducibility
train_index_oversampled <- sample(1:nrow(balanced_data), 0.8 * nrow(balanced_data))
X_train_oversampled <- X_oversampled[train_index_oversampled, ]
y_train_oversampled <- y_oversampled[train_index_oversampled]
X_test_oversampled <- X_oversampled[-train_index_oversampled, ]
y_test_oversampled <- y_oversampled[-train_index_oversampled]
We also check the dimensions for both test and training set.
# Check the dimensions
cat("X_train dimensions:", dim(X_train), "\n")
## X_train dimensions: 247019 38
cat("y_train length:", length(y_train), "\n\n")
## y_train length: 247019
cat("X_train_oversampled dimensions:", dim(X_train_oversampled), "\n")
## X_train_oversampled dimensions: 139837 38
cat("y_train_oversampled length:", length(y_train_oversampled), "\n\n")
## y_train_oversampled length: 139837
cat("X_test dimensions:", dim(X_test), "\n")
## X_test dimensions: 61755 38
cat("y_test length:", length(y_test), "\n\n")
## y_test length: 61755
cat("X_test_oversampled dimensions:", dim(X_test_oversampled), "\n")
## X_test_oversampled dimensions: 34960 38
cat("y_test_oversampled length:", length(y_test_oversampled), "\n")
## y_test_oversampled length: 34960
The z-score standardization method is applied to transform the data so that it has zero mean and unit variance. This process ensures that all features are on the same scale, which can improve the performance of many machine learning algorithms.
# Standardize continuous variables in the training and test sets
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test, center = attr(X_train_scaled, "scaled:center"),
scale = attr(X_train_scaled, "scaled:scale"))
X_train_oversampled_scaled <- scale(X_train_oversampled)
X_test_oversampled_scaled <- scale(X_test_oversampled, center = attr(X_train_oversampled_scaled, "scaled:center"),
scale = attr(X_train_oversampled_scaled, "scaled:scale"))
head(X_train_scaled)
## General_Health Exercise Skin_Cancer Other_Cancer Depression Arthritis
## 61415 1.4241926 0.5393822 -0.3277424 -0.3270679 -0.5012777 1.4333171
## 54427 -0.5158081 -1.8539653 -0.3277424 -0.3270679 1.9948943 1.4333171
## 99566 1.4241926 0.5393822 -0.3277424 -0.3270679 -0.5012777 1.4333171
## 74364 -0.5158081 -1.8539653 -0.3277424 -0.3270679 1.9948943 -0.6976795
## 46208 0.4541923 0.5393822 -0.3277424 -0.3270679 -0.5012777 -0.6976795
## 61607 -0.5158081 -1.8539653 -0.3277424 -0.3270679 1.9948943 1.4333171
## Height_.cm. Weight_.kg. BMI Smoking_History Alcohol_Consumption
## 61415 -1.74527467 -1.6215758 -1.1551946 1.2093797 -0.4994683
## 54427 -0.71385652 0.2918438 0.8221726 1.2093797 -0.6214907
## 99566 -1.74527467 -1.3665782 -0.7949766 -0.8268668 -0.6214907
## 74364 -0.52632595 2.5671349 3.3912171 -0.8268668 -0.6214907
## 46208 -0.52632595 0.3344996 0.7133407 -0.8268668 -0.6214907
## 61607 -0.05749951 1.5040199 1.7342141 -0.8268668 -0.6214907
## Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
## 61415 1.2150823 0.9993139 -0.7359452
## 54427 -0.7156781 -0.2076617 -0.3857589
## 99566 -0.5950056 -1.0123122 -0.6192165
## 74364 -0.8765748 0.3287719 -0.6192165
## 46208 -0.3938847 0.3287719 1.5986302
## 61607 -0.5547814 -0.8782038 -0.2690301
## DiabetesDiabetes DiabetesGestational Diabetes DiabetesNo Diabetes
## 61415 -0.3858388 -0.09339431 0.4373676
## 54427 -0.3858388 -0.09339431 0.4373676
## 99566 -0.3858388 -0.09339431 0.4373676
## 74364 2.5917458 -0.09339431 -2.2863968
## 46208 -0.3858388 -0.09339431 0.4373676
## 61607 -0.3858388 -0.09339431 0.4373676
## DiabetesPre-diabetes Checkup5 or more years ago CheckupNever
## 61415 -0.1511858 -0.2130658 -0.06850995
## 54427 -0.1511858 -0.2130658 -0.06850995
## 99566 -0.1511858 -0.2130658 -0.06850995
## 74364 -0.1511858 -0.2130658 -0.06850995
## 46208 -0.1511858 -0.2130658 -0.06850995
## 61607 -0.1511858 -0.2130658 -0.06850995
## CheckupWithin the past 2 years CheckupWithin the past 5 years
## 61415 -0.3702832 -0.2447143
## 54427 2.7006246 -0.2447143
## 99566 -0.3702832 -0.2447143
## 74364 -0.3702832 -0.2447143
## 46208 -0.3702832 -0.2447143
## 61607 -0.3702832 -0.2447143
## CheckupWithin the past year SexFemale SexMale Age_Category18-24
## 61415 0.5390883 0.9626064 -0.9626064 -0.2539108
## 54427 -1.8549762 0.9626064 -0.9626064 -0.2539108
## 99566 0.5390883 0.9626064 -0.9626064 -0.2539108
## 74364 0.5390883 0.9626064 -0.9626064 -0.2539108
## 46208 0.5390883 0.9626064 -0.9626064 -0.2539108
## 61607 0.5390883 0.9626064 -0.9626064 -0.2539108
## Age_Category25-29 Age_Category30-34 Age_Category35-39 Age_Category40-44
## 61415 -0.2295526 -0.2516358 -0.266983 -0.2740325
## 54427 -0.2295526 -0.2516358 -0.266983 -0.2740325
## 99566 4.3562826 -0.2516358 -0.266983 -0.2740325
## 74364 -0.2295526 -0.2516358 -0.266983 -0.2740325
## 46208 -0.2295526 -0.2516358 -0.266983 3.6491874
## 61607 -0.2295526 -0.2516358 -0.266983 -0.2740325
## Age_Category45-49 Age_Category50-54 Age_Category55-59 Age_Category60-64
## 61415 -0.2701623 -0.2973456 -0.3171072 -0.342436
## 54427 -0.2701623 -0.2973456 -0.3171072 2.920242
## 99566 -0.2701623 -0.2973456 -0.3171072 -0.342436
## 74364 -0.2701623 -0.2973456 3.1534952 -0.342436
## 46208 -0.2701623 -0.2973456 -0.3171072 -0.342436
## 61607 -0.2701623 3.3630760 -0.3171072 -0.342436
## Age_Category65-69 Age_Category70-74 Age_Category75-79 Age_Category80+
## 61415 -0.3478932 -0.3347261 -0.2683204 3.5905778
## 54427 -0.3478932 -0.3347261 -0.2683204 -0.2785056
## 99566 -0.3478932 -0.3347261 -0.2683204 -0.2785056
## 74364 -0.3478932 -0.3347261 -0.2683204 -0.2785056
## 46208 -0.3478932 -0.3347261 -0.2683204 -0.2785056
## 61607 -0.3478932 -0.3347261 -0.2683204 -0.2785056
After data is cleaned, we are able to train them using machine learning models and evaluate their performance in predicting True Positive (TP) cases and True Negative (TN).
In our research, several models are trained and compared to determine the best-performing model for the task: 1. Random Forest - an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy. 2. Logistic Regression - a linear model typically used for classification tasks. 3. XGBoost - a powerful gradient boosting technique that builds an ensemble of decision trees sequentially to minimize prediction errors.
For Random Forest, we will set number of tree division =100, and other parameters are default.
# Train the Random Forest model
set.seed(42)
rf_model_oversampled <- randomForest(x = X_train_oversampled_scaled, y = as.factor(y_train_oversampled), ntree = 100, mtry = sqrt(ncol(X_train_oversampled_scaled)))
# Evaluate the model on the test set
rf_predictions_oversampled <- predict(rf_model_oversampled, X_test_oversampled_scaled)
# Create confusion matrix using caret
rf_confusion_oversampled <- confusionMatrix(rf_predictions_oversampled, as.factor(y_test_oversampled))
# Print confusion matrix and statistics
print(rf_confusion_oversampled)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17819 2112
## 1 2250 12779
##
## Accuracy : 0.8752
## 95% CI : (0.8717, 0.8787)
## No Information Rate : 0.5741
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7452
##
## Mcnemar's Test P-Value : 0.03805
##
## Sensitivity : 0.8879
## Specificity : 0.8582
## Pos Pred Value : 0.8940
## Neg Pred Value : 0.8503
## Prevalence : 0.5741
## Detection Rate : 0.5097
## Detection Prevalence : 0.5701
## Balanced Accuracy : 0.8730
##
## 'Positive' Class : 0
##
For Logistic Regression, we set all parameters in default.
# Convert X_train_scaled and y_train to a data.frame for training
set.seed(42)
train_data_oversampled <- data.frame(X_train_oversampled_scaled) # Convert to data.frame
train_data_oversampled$Heart_Disease <- y_train_oversampled # Add target variable as a column
# Train a Logistic Regression model
logistic_model_oversampled <- glm(Heart_Disease ~ ., data = train_data_oversampled, family = binomial)
# Convert X_test_scaled to a data.frame and make predictions
X_test_oversampled_scaled_df <- data.frame(X_test_oversampled_scaled) # Convert matrix to data.frame
# Make predictions on the test set
logistic_predictions_oversampled <- predict(logistic_model_oversampled, newdata = X_test_oversampled_scaled_df, type = "response")
# Convert probabilities to class labels
logistic_class_predictions_oversampled <- ifelse(logistic_predictions_oversampled > 0.5, 1, 0)
# Confusion matrix
logistic_confusion_oversampled <- table(Predicted = logistic_class_predictions_oversampled, Actual = y_test_oversampled)
# Calculate precision, recall, F1-score
lr_confusion_oversampled <- confusionMatrix(logistic_confusion_oversampled)
# Print the metrics
print(lr_confusion_oversampled)
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 15741 4009
## 1 4328 10882
##
## Accuracy : 0.7615
## 95% CI : (0.757, 0.766)
## No Information Rate : 0.5741
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5137
##
## Mcnemar's Test P-Value : 0.0004963
##
## Sensitivity : 0.7843
## Specificity : 0.7308
## Pos Pred Value : 0.7970
## Neg Pred Value : 0.7155
## Prevalence : 0.5741
## Detection Rate : 0.4503
## Detection Prevalence : 0.5649
## Balanced Accuracy : 0.7576
##
## 'Positive' Class : 0
##
For XGBoost, we will use nrounds = 100,objective = “binary:logistic”,eval_metric = “error”,max_depth = 6,eta = 0.1,verbose = 0 as model hyperparameter.
set.seed(42)
# Convert the data to matrix format for xgboost
dtrain_oversampled <- xgb.DMatrix(data = X_train_oversampled_scaled, label = as.numeric(as.character(y_train_oversampled)))
dtest_oversampled <- xgb.DMatrix(data = X_test_oversampled_scaled, label = as.numeric(as.character(y_test_oversampled)))
# Train the XGBoost model
xgb_model_oversampled <- xgboost(data = dtrain_oversampled,
nrounds = 100,
objective = "binary:logistic",
eval_metric = "error",
max_depth = 6,
eta = 0.1,
verbose = 0)
# Predict on the test set
xgb_predictions_oversampled <- predict(xgb_model_oversampled, dtest_oversampled)
# Convert probabilities to class labels
xgb_class_predictions_oversampled <- ifelse(xgb_predictions_oversampled > 0.5, 1, 0)
# Confusion matrix
xgboost_confusion_oversampled <- table(Predicted = xgb_class_predictions_oversampled, Actual = y_test_oversampled)
# Calculate precision, recall, F1-score
xgb_confusion_oversampled <- confusionMatrix(xgboost_confusion_oversampled)
# Print the metrics
print(xgb_confusion_oversampled)
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 18533 3065
## 1 1536 11826
##
## Accuracy : 0.8684
## 95% CI : (0.8648, 0.8719)
## No Information Rate : 0.5741
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7273
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9235
## Specificity : 0.7942
## Pos Pred Value : 0.8581
## Neg Pred Value : 0.8850
## Prevalence : 0.5741
## Detection Rate : 0.5301
## Detection Prevalence : 0.6178
## Balanced Accuracy : 0.8588
##
## 'Positive' Class : 0
##
set.seed(42)
# Extract Random Forest Metrics (Oversampled)
rf_accuracy_oversampled <- rf_confusion_oversampled$overall["Accuracy"]
rf_kappa_oversampled <- rf_confusion_oversampled$overall["Kappa"]
rf_sensitivity_oversampled <- rf_confusion_oversampled$byClass["Sensitivity"]
rf_specificity_oversampled <- rf_confusion_oversampled$byClass["Specificity"]
rf_f1_score_oversampled <- 2 * (rf_sensitivity_oversampled * rf_specificity_oversampled) /
(rf_sensitivity_oversampled + rf_specificity_oversampled)
# Extract Logistic Regression Metrics (Oversampled)
lr_accuracy_oversampled <- lr_confusion_oversampled$overall["Accuracy"]
lr_kappa_oversampled <- lr_confusion_oversampled$overall["Kappa"]
lr_sensitivity_oversampled <- lr_confusion_oversampled$byClass["Sensitivity"]
lr_specificity_oversampled <- lr_confusion_oversampled$byClass["Specificity"]
lr_f1_score_oversampled <- 2 * (lr_sensitivity_oversampled * lr_specificity_oversampled) /
(lr_sensitivity_oversampled + lr_specificity_oversampled)
# Extract XGBoost Metrics (Oversampled)
xgb_accuracy_oversampled <- xgb_confusion_oversampled$overall["Accuracy"]
xgb_kappa_oversampled <- xgb_confusion_oversampled$overall["Kappa"]
xgb_sensitivity_oversampled <- xgb_confusion_oversampled$byClass["Sensitivity"]
xgb_specificity_oversampled <- xgb_confusion_oversampled$byClass["Specificity"]
xgb_f1_score_oversampled <- 2 * (xgb_sensitivity_oversampled * xgb_specificity_oversampled) /
(xgb_sensitivity_oversampled + xgb_specificity_oversampled)
# Combine Results into a Data Frame
results <- data.frame(
Model = c("Random Forest (Oversampled)", "Logistic Regression (Oversampled)", "XGBoost (Oversampled)"),
Accuracy = c(rf_accuracy_oversampled, lr_accuracy_oversampled, xgb_accuracy_oversampled),
Kappa = c(rf_kappa_oversampled, lr_kappa_oversampled, xgb_kappa_oversampled),
Sensitivity = c(rf_sensitivity_oversampled, lr_sensitivity_oversampled, xgb_sensitivity_oversampled),
Specificity = c(rf_specificity_oversampled, lr_specificity_oversampled, xgb_specificity_oversampled),
F1_Score = c(rf_f1_score_oversampled, lr_f1_score_oversampled, xgb_f1_score_oversampled)
)
# Print the results table
options(width = 200)
print(results)
## Model Accuracy Kappa Sensitivity Specificity F1_Score
## 1 Random Forest (Oversampled) 0.8752288 0.7451653 0.8878868 0.8581694 0.8727752
## 2 Logistic Regression (Oversampled) 0.7615275 0.5137013 0.7843440 0.7307770 0.7566136
## 3 XGBoost (Oversampled) 0.8683924 0.7272685 0.9234640 0.7941710 0.8539513
2 model, XGBoost and Random Forest are very close in term of model performance. We need to decide whether which of them is better. We need to do some adjustments.
Since there are too many features (total 37) included, accuracy of model might be affected by taking those not important features as consideration, even though the weightage is not high. We’re expecting this approach is able to evaluate both model’s performance in more detail.
Feature Importance of Random Forest is visualized and listed.
set.seed(42)
# Get the feature importance
feature_importance <- importance(rf_model_oversampled)
# View the importance of each feature
print(feature_importance)
## MeanDecreaseGini
## General_Health 10000.82371
## Exercise 834.20764
## Skin_Cancer 765.67324
## Other_Cancer 759.03369
## Depression 699.52976
## Arthritis 2627.52361
## Height_.cm. 4446.37616
## Weight_.kg. 4063.74859
## BMI 4232.48413
## Smoking_History 1350.71912
## Alcohol_Consumption 3339.85824
## Fruit_Consumption 3513.00921
## Green_Vegetables_Consumption 3873.89278
## FriedPotato_Consumption 4074.29542
## DiabetesDiabetes 2064.75859
## DiabetesGestational Diabetes 64.70883
## DiabetesNo Diabetes 1254.27544
## DiabetesPre-diabetes 156.03778
## Checkup5 or more years ago 136.92034
## CheckupNever 26.82495
## CheckupWithin the past 2 years 253.48442
## CheckupWithin the past 5 years 134.13701
## CheckupWithin the past year 1002.24777
## SexFemale 706.45074
## SexMale 687.33703
## Age_Category18-24 446.26700
## Age_Category25-29 309.34607
## Age_Category30-34 344.17282
## Age_Category35-39 470.05636
## Age_Category40-44 423.44436
## Age_Category45-49 326.58788
## Age_Category50-54 371.50814
## Age_Category55-59 425.36021
## Age_Category60-64 551.60090
## Age_Category65-69 747.05878
## Age_Category70-74 987.85560
## Age_Category75-79 1174.60183
## Age_Category80+ 2018.22466
library(ggplot2)
set.seed(42)
# Convert the importance to a data frame for plotting
importance_df <- data.frame(Feature = rownames(feature_importance),
Importance = feature_importance[, 1])
# Sort the features by importance
importance_df <- importance_df[order(-importance_df$Importance), ]
# Plot the feature importance with colors
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance, fill = Importance)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip the axes to make the labels readable
theme_minimal() +
labs(title = "Feature Importance - Random Forest Model",
x = "Features",
y = "Importance") +
scale_fill_gradient(low = "skyblue", high = "darkblue") # Color gradient from light to dark blue
Feature Importance of XGBoost is visualized and listed.
set.seed(42)
# Get the feature importance
importance_matrix <- xgb.importance(feature_names = colnames(X_train_oversampled_scaled), model = xgb_model_oversampled)
print(importance_matrix)
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: General_Health 3.503962e-01 0.1392817414 0.109920720
## 2: FriedPotato_Consumption 1.477095e-01 0.1606081417 0.119348618
## 3: Green_Vegetables_Consumption 9.632695e-02 0.0985948607 0.104349689
## 4: Height_.cm. 6.674948e-02 0.0797638667 0.091279194
## 5: Alcohol_Consumption 5.289063e-02 0.0696119751 0.065781016
## 6: Age_Category80+ 3.768819e-02 0.0331598396 0.031283480
## 7: SexFemale 2.986418e-02 0.0329964936 0.029140776
## 8: DiabetesDiabetes 2.369779e-02 0.0236185311 0.016284551
## 9: CheckupWithin the past year 2.187385e-02 0.0152691382 0.009856439
## 10: Arthritis 2.104288e-02 0.0176480080 0.025069638
## 11: Age_Category75-79 1.976007e-02 0.0235143768 0.016927362
## 12: Age_Category70-74 1.714262e-02 0.0180759294 0.014141847
## 13: Fruit_Consumption 1.450815e-02 0.0201968192 0.068780801
## 14: Smoking_History 1.399932e-02 0.0294421316 0.017998714
## 15: Age_Category35-39 1.135584e-02 0.0250020562 0.007928005
## 16: Age_Category18-24 1.068694e-02 0.0328013585 0.007499464
## 17: Age_Category40-44 9.631848e-03 0.0217373842 0.007713735
## 18: Age_Category30-34 9.155879e-03 0.0256828673 0.006428112
## 19: Age_Category25-29 8.766203e-03 0.0291999476 0.005142490
## 20: Age_Category65-69 8.092437e-03 0.0109299274 0.008142276
## 21: BMI 6.274980e-03 0.0218891030 0.080137133
## 22: Age_Category45-49 5.230579e-03 0.0187599988 0.005142490
## 23: Weight_.kg. 4.439254e-03 0.0134317213 0.055281766
## 24: Age_Category60-64 3.416319e-03 0.0053892028 0.005142490
## 25: Age_Category50-54 2.460800e-03 0.0126838332 0.006213842
## 26: Skin_Cancer 1.881751e-03 0.0043539841 0.016284551
## 27: Other_Cancer 1.705327e-03 0.0038374697 0.020569959
## 28: Depression 1.324213e-03 0.0044163783 0.017141633
## 29: DiabetesNo Diabetes 8.184238e-04 0.0021255539 0.009856439
## 30: Exercise 3.982381e-04 0.0007600845 0.007499464
## 31: Age_Category55-59 2.801692e-04 0.0029220981 0.002785515
## 32: Checkup5 or more years ago 1.471372e-04 0.0013825412 0.003428327
## 33: CheckupWithin the past 2 years 1.142739e-04 0.0001911379 0.003642597
## 34: CheckupWithin the past 5 years 9.581601e-05 0.0004265850 0.002571245
## 35: DiabetesPre-diabetes 7.382090e-05 0.0002949140 0.001285622
## Feature Gain Cover Frequency
xgb.plot.importance(importance_matrix)
Based on the result, top 25 features are being chosen, and the other 12
features are dropped.
set.seed(42)
# Select top 25 features based on importance for Random Forest
top_25_features <- rownames(importance_df)[1:25]
top_25_features
## [1] "General_Health" "Height_.cm." "BMI" "FriedPotato_Consumption" "Weight_.kg." "Green_Vegetables_Consumption"
## [7] "Fruit_Consumption" "Alcohol_Consumption" "Arthritis" "DiabetesDiabetes" "Age_Category80+" "Smoking_History"
## [13] "DiabetesNo Diabetes" "Age_Category75-79" "CheckupWithin the past year" "Age_Category70-74" "Exercise" "Skin_Cancer"
## [19] "Other_Cancer" "Age_Category65-69" "SexFemale" "Depression" "SexMale" "Age_Category60-64"
## [25] "Age_Category35-39"
# Select top 25 features based on importance for XGBoost
top_25_features_xgb <- head(importance_matrix[order(-importance_matrix$Gain), ], 25)
top_25_features_xgb <- top_25_features_xgb$Feature
top_25_features_xgb
## [1] "General_Health" "FriedPotato_Consumption" "Green_Vegetables_Consumption" "Height_.cm." "Alcohol_Consumption" "Age_Category80+"
## [7] "SexFemale" "DiabetesDiabetes" "CheckupWithin the past year" "Arthritis" "Age_Category75-79" "Age_Category70-74"
## [13] "Fruit_Consumption" "Smoking_History" "Age_Category35-39" "Age_Category18-24" "Age_Category40-44" "Age_Category30-34"
## [19] "Age_Category25-29" "Age_Category65-69" "BMI" "Age_Category45-49" "Weight_.kg." "Age_Category60-64"
## [25] "Age_Category50-54"
set.seed(42)
# Select top 25 features based on importance for Random Forest
top_25_features <- rownames(importance_df)[1:25]
# Select top 25 features based on importance for XGBoost
top_25_features_xgb <- head(importance_matrix[order(-importance_matrix$Gain), ], 25)
top_25_features_xgb <- top_25_features_xgb$Feature
# Subset the training data to only include the top 25 features for Random Forest
X_train_top25 <- X_train_oversampled_scaled[, top_25_features]
X_test_top25 <- X_test_oversampled_scaled[, top_25_features]
# Subset the training data to only include the top 25 features for XGBoost
X_train_top25_xgb <- X_train_oversampled_scaled[, top_25_features_xgb]
X_test_top25_xgb <- X_test_oversampled_scaled[, top_25_features_xgb]
Then, we train these 2 models again, using only dataset with top 25 important features based on their model respectively.
set.seed(42)
# Train the Random Forest model on the top 25 features
rf_model_top25 <- randomForest(x = X_train_top25, y = as.factor(y_train_oversampled), ntree = 100, mtry = sqrt(ncol(X_train_top25)))
# Evaluate the model on the test set with top 25 features
rf_predictions_top25 <- predict(rf_model_top25, X_test_top25)
# Create confusion matrix using caret for the Random Forest model with top 25 features
rf_confusion_top25 <- confusionMatrix(rf_predictions_top25, as.factor(y_test_oversampled))
# Print confusion matrix and statistics
print(rf_confusion_top25)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18086 2323
## 1 1983 12568
##
## Accuracy : 0.8768
## 95% CI : (0.8733, 0.8803)
## No Information Rate : 0.5741
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7474
##
## Mcnemar's Test P-Value : 2.39e-07
##
## Sensitivity : 0.9012
## Specificity : 0.8440
## Pos Pred Value : 0.8862
## Neg Pred Value : 0.8637
## Prevalence : 0.5741
## Detection Rate : 0.5173
## Detection Prevalence : 0.5838
## Balanced Accuracy : 0.8726
##
## 'Positive' Class : 0
##
set.seed(42)
# Convert the data to matrix format for xgboost
dtrain_top25 <- xgb.DMatrix(data = X_train_top25_xgb, label = as.numeric(as.character(y_train_oversampled)))
dtest_top25 <- xgb.DMatrix(data = X_test_top25_xgb, label = as.numeric(as.character(y_test_oversampled)))
# Train the XGBoost model on the top 25 features
xgb_model_top25 <- xgboost(data = dtrain_top25,
nrounds = 100,
objective = "binary:logistic",
eval_metric = "error",
max_depth = 6,
eta = 0.1,
verbose = 0)
# Predict on the test set using the top 25 features
xgb_predictions_top25 <- predict(xgb_model_top25, dtest_top25)
# Convert probabilities to class labels
xgb_class_predictions_top25 <- ifelse(xgb_predictions_top25 > 0.5, 1, 0)
# Confusion matrix
xgboost_confusion_top25 <- table(Predicted = xgb_class_predictions_top25, Actual = y_test_oversampled)
# Calculate precision, recall, F1-score
xgb_confusion_top25 <- confusionMatrix(xgboost_confusion_top25)
# Print the metrics
print(xgb_confusion_top25)
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 18495 3037
## 1 1574 11854
##
## Accuracy : 0.8681
## 95% CI : (0.8645, 0.8716)
## No Information Rate : 0.5741
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7268
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9216
## Specificity : 0.7961
## Pos Pred Value : 0.8590
## Neg Pred Value : 0.8828
## Prevalence : 0.5741
## Detection Rate : 0.5290
## Detection Prevalence : 0.6159
## Balanced Accuracy : 0.8588
##
## 'Positive' Class : 0
##
The result is tabulated as follow.
set.seed(42)
# Extract Random Forest Metrics (Oversampled)
rf_accuracy_oversampled <- rf_confusion_oversampled$overall["Accuracy"]
rf_kappa_oversampled <- rf_confusion_oversampled$overall["Kappa"]
rf_sensitivity_oversampled <- rf_confusion_oversampled$byClass["Sensitivity"]
rf_specificity_oversampled <- rf_confusion_oversampled$byClass["Specificity"]
rf_f1_score_oversampled <- 2 * (rf_sensitivity_oversampled * rf_specificity_oversampled) /
(rf_sensitivity_oversampled + rf_specificity_oversampled)
# Extract XGBoost Metrics (Oversampled)
xgb_accuracy_oversampled <- xgb_confusion_oversampled$overall["Accuracy"]
xgb_kappa_oversampled <- xgb_confusion_oversampled$overall["Kappa"]
xgb_sensitivity_oversampled <- xgb_confusion_oversampled$byClass["Sensitivity"]
xgb_specificity_oversampled <- xgb_confusion_oversampled$byClass["Specificity"]
xgb_f1_score_oversampled <- 2 * (xgb_sensitivity_oversampled * xgb_specificity_oversampled) /
(xgb_sensitivity_oversampled + xgb_specificity_oversampled)
# Extract Random Forest Metrics
rf_accuracy_top25 <- rf_confusion_top25$overall["Accuracy"]
rf_kappa_top25 <- rf_confusion_top25$overall["Kappa"]
rf_sensitivity_top25 <- rf_confusion_top25$byClass["Sensitivity"]
rf_specificity_top25 <- rf_confusion_top25$byClass["Specificity"]
rf_f1_score_top25 <- 2 * (rf_sensitivity_top25 * rf_specificity_top25) / (rf_sensitivity_top25 + rf_specificity_top25)
# Extract XGBoost Metrics
xgb_accuracy_top25 <- xgb_confusion_top25$overall["Accuracy"]
xgb_kappa_top25 <- xgb_confusion_top25$overall["Kappa"]
xgb_sensitivity_top25 <- xgb_confusion_top25$byClass["Sensitivity"]
xgb_specificity_top25 <- xgb_confusion_top25$byClass["Specificity"]
xgb_f1_score_top25 <- 2 * (xgb_sensitivity_top25 * xgb_specificity_top25) / (xgb_sensitivity_top25 + xgb_specificity_top25)
# Combine Results into a Data Frame
results <- data.frame(
Model = c("Random Forest (Oversampled)", "XGBoost (Oversampled)","Random Forest (Top 25)", "XGBoost (Top 25)"),
Accuracy = c(rf_accuracy_oversampled, xgb_accuracy_oversampled, rf_accuracy_top25, xgb_accuracy_top25),
Kappa = c(rf_kappa_oversampled, xgb_kappa_oversampled, rf_kappa_top25, xgb_kappa_top25),
Sensitivity = c(rf_sensitivity_oversampled, xgb_sensitivity_oversampled, rf_sensitivity_top25, xgb_sensitivity_top25),
Specificity = c(rf_specificity_oversampled, xgb_specificity_oversampled, rf_specificity_top25, xgb_specificity_top25),
F1_Score = c(rf_f1_score_oversampled, xgb_f1_score_oversampled, rf_f1_score_top25, xgb_f1_score_top25)
)
# Print the results table
options(width = 200)
print(results)
## Model Accuracy Kappa Sensitivity Specificity F1_Score
## 1 Random Forest (Oversampled) 0.8752288 0.7451653 0.8878868 0.8581694 0.8727752
## 2 XGBoost (Oversampled) 0.8683924 0.7272685 0.9234640 0.7941710 0.8539513
## 3 Random Forest (Top 25) 0.8768307 0.7473921 0.9011909 0.8439997 0.8716582
## 4 XGBoost (Top 25) 0.8681064 0.7268341 0.9215706 0.7960513 0.8542246
After including only 25 important features, it’s expected model performance would somehow decrease as 12 features are being extracted. However, Random Forest still managed to improve its performance in term of accuracy and sensitivity meanwhile XGBoost decrease its performance in all performance metrics. Overall, Random Forest still performs slightly better than XGBoost as accuracy, specificity and F1 score of Random Forest is higher than XGBoost’s.
From first evaluation result, it’s clear to see that Logistic Regression is the worst model compared to XGBoost and also Random Forest. The performance gap between Logistic Regression with both XGBoost and Random Forest is significantly huge. Hence, it’s good to only interpret the evaluation of both XGBoost and Random Forest result.
Based on our analysis on the features importance for both XGBoost and Random Forest, their similarities such as General Health, Height, Age Category(80+), BMI, Fried Potato Consumption, Alcohol Consumption are listed in the top 10 features. It indicates that Age, BMI, General Health Status and how much a people consume fried potato does increase the odd of cardiovascular disease. Both models showed some interesting insights also, as consumption of fruits, vegetables and fried potatoes are significantly important for both XGBoost and Random Forest. We can also see that the occurence of diabetes history is also an important feature for both models. It somehow enlightens an potential medical research opportunity on overconsuming carbs can lead to cardiovascular disease.
As expected after only including 25 top important feature, it’s still Random Forest which performs slightly better than XGBoost as Accuracy, F1-score of Random Forest is higher than XGBoost’s. Based on our definition, model who performs better in accuracy and F1-score will be decided to be the best model. Hence, through our result, Random Forest is the best model to predict the likelihood of cardiovascular disease based on patient demographic, lifestyle, and clinical features.
However, XGBoost slightly perform better than Random Forest in term of Sensitivity (Recall). Which means that in term of detecting true positive Cardiovascular Disease, XGBoost is better, and in medical field, a model’s recall always a primary consideration of evaluating a model beside F1 score. Hence, further study required to consider XGBoost as a model to detect more accurately, especially in feature engineering and also hyperparameter tuning.