WQD7004 Programming of Data Science

Data source

Throughout this research, our data source is from Kaggle (https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset/data), which record The Behavioral Risk Factor Surveillance System (BRFSS), USA’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.

Objective

2 objectives of our research including:

Identify Key Risk Factors: Analyze the dataset to identify the most significant risk factors contributing to cardiovascular diseases using feature importance techniques such as correlation analysis, feature selection, or machine learning models like Random Forest or Logistic Regression.
Develop a Predictive Model: Build and evaluate a machine learning model (e.g., Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, or Neural Networks) to predict the likelihood of cardiovascular disease based on patient demographic, lifestyle, and clinical features.

Data Understanding

Before we proceed to data preprocessing phase, we need to understand what dataset we’re using and what information do we have in our dataset.

First, get all package that will be using first and call them into environment before we start.

install.packages("randomForest")

## 
## The downloaded binary packages are in
##  /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages

install.packages("caret")

## 
## The downloaded binary packages are in
##  /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages

install.packages(c("zoo","xts","quantmod"))

## 
## The downloaded binary packages are in
##  /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages

install.packages( c("abind", "ROCR"))

## 
## The downloaded binary packages are in
##  /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages

install.packages("xgboost")

## 
## The downloaded binary packages are in
##  /var/folders/my/ktmpxyd908v22d1dd12crydc0000gn/T//RtmpsheeET/downloaded_packages

install.packages("~/Downloads/DMwR_0.4.1.tar", repos = NULL, type = "source")

library(dplyr)
library(tidyverse)
library(ggplot2)
library(DMwR)
library(zoo)
library(xts)
library(quantmod)
library(abind)
library(ROCR)
library(xgboost)
library(randomForest)
library(caret)

To analyze the dataset, the dataset from Kaggle, entitled “CVD.csv”, which have mentioned the source of this dataset perviously is loaded into our R script.

df <- read.csv("~/Downloads/CVD.csv")

head(df)

##   General_Health                 Checkup Exercise Heart_Disease Skin_Cancer
## 1           Poor Within the past 2 years       No            No          No
## 2      Very Good    Within the past year       No           Yes          No
## 3      Very Good    Within the past year      Yes            No          No
## 4           Poor    Within the past year      Yes           Yes          No
## 5           Good    Within the past year       No            No          No
## 6           Good    Within the past year       No            No          No
##   Other_Cancer Depression Diabetes Arthritis    Sex Age_Category Height_.cm.
## 1           No         No       No       Yes Female        70-74         150
## 2           No         No      Yes        No Female        70-74         165
## 3           No         No      Yes        No Female        60-64         163
## 4           No         No      Yes        No   Male        75-79         180
## 5           No         No       No        No   Male          80+         191
## 6           No        Yes       No       Yes   Male        60-64         183
##   Weight_.kg.   BMI Smoking_History Alcohol_Consumption Fruit_Consumption
## 1       32.66 14.54             Yes                   0                30
## 2       77.11 28.29              No                   0                30
## 3       88.45 33.47              No                   4                12
## 4       93.44 28.73              No                   0                30
## 5       88.45 24.37             Yes                   0                 8
## 6      154.22 46.11              No                   0                12
##   Green_Vegetables_Consumption FriedPotato_Consumption
## 1                           16                      12
## 2                            0                       4
## 3                            3                      16
## 4                           30                       8
## 5                            4                       0
## 6                           12                      12

Then, we can check on the number of rows and columns of the dataset

#Numbers of rows and columns
dim(df)

## [1] 308854     19

cat("Number of rows:", nrow(df), "\n")

## Number of rows: 308854

cat("Number of columns:", ncol(df), "\n")

## Number of columns: 19

Name of the columns are listed for our understanding on what’s info stored in this dataset

#column name
colnames(df)

##  [1] "General_Health"               "Checkup"                     
##  [3] "Exercise"                     "Heart_Disease"               
##  [5] "Skin_Cancer"                  "Other_Cancer"                
##  [7] "Depression"                   "Diabetes"                    
##  [9] "Arthritis"                    "Sex"                         
## [11] "Age_Category"                 "Height_.cm."                 
## [13] "Weight_.kg."                  "BMI"                         
## [15] "Smoking_History"              "Alcohol_Consumption"         
## [17] "Fruit_Consumption"            "Green_Vegetables_Consumption"
## [19] "FriedPotato_Consumption"

#Unique values in columns
sapply(df, unique)

## $General_Health
## [1] "Poor"      "Very Good" "Good"      "Fair"      "Excellent"
## 
## $Checkup
## [1] "Within the past 2 years" "Within the past year"   
## [3] "5 or more years ago"     "Within the past 5 years"
## [5] "Never"                  
## 
## $Exercise
## [1] "No"  "Yes"
## 
## $Heart_Disease
## [1] "No"  "Yes"
## 
## $Skin_Cancer
## [1] "No"  "Yes"
## 
## $Other_Cancer
## [1] "No"  "Yes"
## 
## $Depression
## [1] "No"  "Yes"
## 
## $Diabetes
## [1] "No"                                        
## [2] "Yes"                                       
## [3] "No, pre-diabetes or borderline diabetes"   
## [4] "Yes, but female told only during pregnancy"
## 
## $Arthritis
## [1] "Yes" "No" 
## 
## $Sex
## [1] "Female" "Male"  
## 
## $Age_Category
##  [1] "70-74" "60-64" "75-79" "80+"   "65-69" "50-54" "45-49" "18-24" "30-34"
## [10] "55-59" "35-39" "40-44" "25-29"
## 
## $Height_.cm.
##  [1] 150 165 163 180 191 183 175 160 168 178 152 157 188 185 170 173 155 193 196
## [20] 206 198 140 135 145 147 142 201 218 124 203 137 122 216 224 229 151 177 164
## [39] 162 156 153 169 167 172 106 190 143 171 154 176 200 146 148 158 159 187 104
## [58] 120 107 211 226 182 213  97 184 125 127 234 130 119 132 105 166 181 186  91
## [77] 174 208 149  96 197 161  94 103 221 134 144 189 100 179 117  99 102 110 241
## [96] 115 205 195 108
## 
## $Weight_.kg.
##   [1]  32.66  77.11  88.45  93.44 154.22  69.85 108.86  72.57  91.63  74.84
##  [11]  73.48  83.91 113.40  52.16 116.12  99.79  81.65 104.33  79.38  55.79
##  [21] 124.74  81.19  70.31 112.49 147.42  84.82 102.06  64.41  60.78  61.23
##  [31]  88.00  90.72  49.90  85.28 120.20  69.40  62.14  65.77  89.81  66.68
##  [41]  86.18  72.12  87.54  62.60  75.75  88.90  92.08  56.70  68.04  79.83
##  [51]  63.50  58.97 114.76  45.36  73.94  54.43 125.19  77.56  96.16  95.25
##  [61] 115.67  82.55 136.08  78.93  70.76  95.71  53.52  87.09  55.34  83.01
##  [71] 123.38  98.88  73.03  76.66  97.52  71.67  83.46 122.47  58.06  74.39
##  [81]  67.13  82.10  47.63  99.34  85.73 108.41  91.17  57.61  63.05  45.81
##  [91]  94.35  44.45 117.93 107.50 127.01 106.59 107.95  89.36  92.99  53.07
## [101]  78.02 131.09  97.98  84.37 111.13  50.80  57.15  64.86  80.29  76.20
## [111] 114.31  65.32  97.07  67.59  75.30 105.69 110.68  86.64  51.26  61.69
## [121] 107.05  42.64  40.82 101.60  90.26 131.54  98.43  78.47  59.87  68.95
## [131]  60.33  94.80  48.53  96.62 117.48 102.51  46.27 109.77  58.51  68.49
## [141] 133.81 158.76  52.62  80.74  48.99 117.03  54.88  33.11  51.71  92.53
## [151]  34.02  44.91 105.23 145.60 106.14  56.25 139.71 124.28 103.42  71.21
## [161] 138.35 143.34 101.15 103.87 145.15  93.89  49.44  46.72 120.66 132.90
## [171] 112.04 167.83 142.88  66.22  63.96 162.39 176.90 121.56 111.58 136.98
## [181] 110.22 170.10 122.02 133.36 119.75 129.27 175.54 151.95 118.39 121.11
## [191] 132.45 135.62 119.29 102.97 172.37 156.49  43.09  43.54 157.40  59.42
## [201] 181.44 125.65 127.91 144.70 149.69 113.85 163.29 100.70 115.21 112.94
## [211] 208.65 127.46 100.24 136.53  50.35 130.63 122.92 141.97  47.17 142.43
## [221] 132.00 118.84 116.57 151.50 137.44  72.00 140.61 169.19  44.00 160.57
## [231] 104.78 130.18  65.00 134.72  48.08 139.25  53.98  36.29 129.73  38.56
## [241] 228.16 166.92 148.32 174.63 146.96 235.87 161.03 152.86 105.00  75.00
## [251] 109.00 123.83 165.56  42.18 134.26 140.16 228.61 128.82  39.92  78.00
## [261] 109.32 204.12  85.00 141.52 161.48 185.97  90.00  61.00  95.00  60.00
## [271]  45.00 126.10  82.00 137.89  70.00  64.00 100.00  63.00 126.55 135.17
## [281]  74.00 150.59  83.00 189.15 143.79  97.00  73.00  54.00  80.00 164.20
## [291]  55.00 155.13  52.00  67.00 144.24  68.00 226.80  62.00  76.00 190.51
## [301] 146.51 188.24 199.58  79.00  96.00 152.41  58.00 138.80 118.00 165.11
## [311] 153.31  84.00  38.10  41.73  39.01  37.19  87.00 128.37 156.04 187.79
## [321] 168.28 136.00  35.38 166.01 213.19  89.00  66.00  40.37  81.00 210.92
## [331]  50.00 155.58  25.40 224.53 179.17 159.66 176.45 209.56 151.05 148.78
## [341] 217.72 146.06 140.00 219.99  86.00 141.07 110.00  92.00  31.75  41.28
## [351]  33.57 130.00  48.00 161.93  94.00  98.00 183.25  53.00  34.47 120.00
## [361]  77.00 183.70 272.16 171.46 206.84  26.76 200.00  69.00 101.00 180.53
## [371]  47.00 126.00  59.00 215.46 175.99 166.47 184.61 153.77  39.46 174.18
## [381] 186.43  46.00  40.00 195.04 104.00 201.85 147.87 192.78  36.74 162.84
## [391] 134.00 157.85 191.87 272.61  51.00 231.33  37.65  93.00  56.00 149.23
## [401] 154.68 158.30 194.59 179.62 159.00 210.47 180.98 169.64 285.76 177.81
## [411] 119.00 150.00 159.21 167.38 244.03 163.75 168.74 173.27 227.70  99.00
## [421] 200.94 229.52 219.54 232.69 175.09 229.97 293.02 200.49 195.95 190.96
## [431] 212.73 150.14 229.06 178.72 181.89  34.93 171.00 156.94 176.00  57.00
## [441]  71.00 103.00 254.01  30.84 249.48 240.40 238.14 205.93 273.52 215.00
## [451] 187.33 182.00 164.65 233.60  29.94 199.13 185.07 216.36 186.88 189.60
## [461] 173.73 219.09  24.95 203.21 210.01  35.83 190.06 116.00 197.31 154.00
## [471] 180.08 102.00 222.26 202.76  27.22 206.38 135.00 170.55 160.00  43.00
## [481] 188.69 106.00 177.35 112.00 160.12 195.00 172.82 220.45 184.16 185.52
## [491]  26.31  41.00 274.42 207.75 180.00 175.00 141.00 230.88 222.71 258.55
## [501] 203.66 113.00 182.34 195.50 191.42 215.91 178.26 257.64 244.94 263.08
## [511] 214.10 117.00 172.00 115.00 185.00 247.21  39.00 193.68 127.00  30.00
## [521] 250.00 252.20 283.50  49.00  42.00
## 
## $BMI
##    [1] 14.54 28.29 33.47 28.73 24.37 46.11 22.74 39.94 27.46 34.67 29.23 23.92
##   [13] 29.86 35.87 22.46 43.94 29.84 29.05 33.00 30.04 22.50 26.58 38.35 30.72
##   [25] 34.54 32.92 30.18 31.00 45.33 32.77 31.12 28.89 25.73 22.24 21.63 25.79
##   [37] 27.44 25.63 24.69 27.98 27.84 30.90 29.53 17.75 30.34 45.49 29.88 23.52
##   [49] 34.97 22.71 14.06 30.11 27.78 26.63 31.18 26.50 23.48 23.00 25.10 33.28
##   [61] 26.18 24.45 25.39 38.28 29.13 28.19 28.34 24.41 24.21 26.76 21.14 25.61
##   [73] 22.31 31.75 31.01 25.84 32.28 35.33 27.12 31.47 28.70 35.29 25.69 26.31
##   [85] 18.89 24.80 23.78 35.43 22.67 33.23 31.89 27.81 33.60 30.29 29.57 30.21
##   [97] 33.89 41.16 28.97 21.46 27.67 43.05 32.88 25.23 25.96 20.53 37.76 25.85
##  [109] 23.57 36.22 22.30 28.32 24.03 25.33 20.38 23.73 17.79 31.32 22.87 26.96
##  [121] 25.07 24.89 21.11 31.25 30.41 20.37 18.88 34.21 43.90 18.79 27.53 23.30
##  [133] 31.28 29.45 30.91 33.91 25.51 29.95 26.61 29.99 25.50 24.28 30.23 32.32
##  [145] 30.81 38.74 37.38 19.14 20.05 26.52 26.47 30.82 22.15 20.36 27.29 20.60
##  [157] 38.77 33.20 26.45 21.86 20.25 25.97 21.58 19.20 41.38 27.41 27.91 35.44
##  [169] 22.14 31.53 27.50 34.50 31.09 23.86 30.79 26.57 32.01 25.83 33.65 23.41
##  [181] 31.57 42.51 39.48 20.40 37.20 34.61 15.82 36.26 26.37 27.34 25.77 41.98
##  [193] 23.63 49.60 30.67 36.81 18.29 29.18 34.70 32.74 24.51 42.16 29.09 30.27
##  [205] 31.91 19.47 26.15 28.41 51.19 39.58 30.07 28.12 21.80 27.73 30.02 17.54
##  [217] 39.68 27.71 39.60 22.40 41.50 30.38 25.06 29.41 35.28 27.96 43.26 21.89
##  [229] 20.80 20.51 25.02 33.52 24.79 44.93 23.03 25.24 43.22 25.20 29.29 26.54
##  [241] 38.37 18.24 41.79 22.63 23.91 34.90 21.95 36.96 20.22 22.32 19.84 29.01
##  [253] 17.76 25.09 25.75 28.25 44.29 17.91 17.58 34.06 32.69 23.34 32.12 29.26
##  [265] 22.38 28.66 38.01 21.26 37.23 35.02 36.61 28.84 28.79 18.01 29.04 19.94
##  [277] 34.30 25.78 24.53 38.84 26.88 25.66 37.12 26.78 36.05 21.29 32.61 23.56
##  [289] 23.49 31.71 23.40 26.34 19.74 35.26 23.62 20.18 26.94 26.83 24.63 27.99
##  [301] 36.13 24.81 23.67 37.79 17.81 32.19 30.54 29.76 32.22 31.93 19.87 23.75
##  [313] 19.80 26.07 17.23 34.02 27.28 50.90 28.48 26.04 18.31 31.64 30.56 33.43
##  [325] 34.17 28.35 37.61 23.11 18.07 30.36 38.62 22.86 42.87 27.37 23.59 40.35
##  [337] 38.27 23.81 20.48 44.85 28.13 30.13 35.78 23.26 23.05 34.46 19.97 24.39
##  [349] 18.95 23.72 43.93 24.96 29.12 21.79 33.83 16.64 25.68 22.00 45.73 25.92
##  [361] 51.69 22.45 28.53 22.65 29.60 27.06 21.81 25.42 20.78 29.81 25.58 36.84
##  [373] 20.20 25.82 27.20 31.62 35.73 28.15 34.75 18.27 26.95 22.26 32.93 32.81
##  [385] 29.52 36.32 20.50 24.83 32.62 35.94 22.81 22.53 37.66 48.82 24.33 34.33
##  [397] 24.05 33.96 41.20 27.05 36.44 16.23 31.38 34.03 33.99 31.90 30.52 24.13
##  [409] 20.00 19.27 21.09 20.08 27.02 37.45 16.54 51.81 31.74 23.83 27.32 22.60
##  [421] 49.92 33.84 24.62 30.68 21.83 35.90 20.63 46.83 27.26 45.60 31.78 31.83
##  [433] 20.09 20.34 42.91 32.98 17.47 20.98 30.61 28.95 21.97 38.09 37.94 23.33
##  [445] 36.76 24.72 35.71 26.26 23.17 26.29 35.51 37.59 20.92 37.13 32.45 49.49
##  [457] 26.09 21.22 32.00 32.50 26.12 32.08 19.77 33.82 40.24 28.06 23.53 40.39
##  [469] 43.85 22.43 29.85 30.65 30.30 40.00 22.66 46.25 40.51 40.25 40.41 25.80
##  [481] 25.13 32.10 18.69 18.72 27.92 28.75 25.00 36.31 18.71 35.52 42.29 32.55
##  [493] 27.89 16.62 45.19 30.99 32.78 34.15 45.89 36.58 20.83 31.33 33.09 16.04
##  [505] 22.36 26.32 23.06 35.95 43.88 29.27 20.12 36.49 36.39 19.53 43.27 25.11
##  [517] 19.13 27.40 27.10 24.61 31.82 38.34 21.21 49.48 19.26 34.18 43.58 33.67
##  [529] 20.62 29.71 32.11 33.38 21.70 39.16 31.95 35.15 25.04 28.80 19.37 26.90
##  [541] 27.60 31.15 18.02 21.91 21.40 33.79 27.59 21.93 43.07 31.87 38.41 33.25
##  [553] 21.52 35.62 51.21 32.42 31.19 27.54 35.61 33.75 28.28 34.87 25.22 56.07
##  [565] 23.01 25.94 24.34 30.43 50.07 29.35 18.56 49.34 26.06 36.94 22.96 39.84
##  [577] 29.68 39.71 23.13 51.84 40.18 32.06 25.29 47.47 57.02 37.25 34.44 32.56
##  [589] 40.09 47.65 20.85 21.24 26.39 48.92 23.74 27.80 27.72 41.35 23.04 44.39
##  [601] 16.73 19.57 23.46 36.02 24.27 38.45 33.45 31.31 30.17 38.97 34.31 26.97
##  [613] 42.09 21.41 23.44 21.03 30.83 22.04 57.59 24.14 26.68 23.69 18.75 33.66
##  [625] 27.21 45.20 31.66 37.97 48.37 23.37 30.70 24.75 28.17 46.87 34.96 22.92
##  [637] 26.22 42.13 39.51 21.61 21.13 34.11 29.03 29.02 34.74 29.66 20.82 48.07
##  [649] 32.27 31.63 31.06 24.66 36.40 24.86 28.67 32.37 41.82 38.52 19.67 33.61
##  [661] 23.39 24.97 37.30 35.08 37.68 48.06 40.55 29.65 34.95 43.33 30.24 43.01
##  [673] 32.87 21.87 33.72 20.64 24.02 34.01 30.37 21.18 50.13 45.31 61.11 47.20
##  [685] 17.38 38.26 22.89 25.70 44.22 29.21 34.59 18.04 37.42 33.07 29.87 36.21
##  [697] 20.07 36.62 32.89 52.90 36.87 24.78 26.43 21.32 38.65 26.40 23.89 46.18
##  [709] 24.87 43.56 44.63 40.03 41.05 17.16 33.40 63.47 21.73 31.16 42.43 23.38
##  [721] 37.44 30.86 28.82 62.65 31.17 30.55 37.88 39.46 29.63 22.69 35.58 25.37
##  [733] 26.16 22.27 22.68 18.60 21.48 19.92 34.99 27.76 36.55 32.67 47.43 37.28
##  [745] 52.51 26.65 45.52 26.19 37.60 18.54 25.46 27.88 36.92 15.21 19.49 40.96
##  [757] 31.41 20.77 23.43 52.15 28.45 30.00 22.80 35.19 46.34 30.62 33.36 30.92
##  [769] 43.09 33.11 20.01 30.40 38.92 43.46 34.57 18.83 25.34 51.68 32.54 49.02
##  [781] 25.40 29.37 32.49 28.62 22.47 28.27 25.54 25.25 35.13 29.43 51.49 48.73
##  [793] 35.24 34.14 24.95 16.24 50.21 41.01 31.80 34.34 28.10 28.59 30.73 36.95
##  [805] 46.99 39.78 21.50 40.19 23.29 55.79 28.60 21.62 21.16 40.34 44.92 35.40
##  [817] 19.64 64.16 48.42 53.04 34.37 53.26 22.52 40.83 32.71 27.25 24.59 51.65
##  [829] 53.14 36.28 39.33 41.15 37.21 23.80 44.01 28.52 44.81 24.55 37.93 24.64
##  [841] 23.23 21.12 26.36 39.32 53.96 23.09 32.26 25.36 33.19 45.00 39.05 42.32
##  [853] 32.43 35.57 26.92 27.17 38.07 31.50 33.73 50.09 35.36 37.49 25.86 39.99
##  [865] 27.48 28.37 40.44 29.58 28.50 45.54 17.70 31.60 39.54 42.77 20.30 27.86
##  [877] 41.60 43.40 55.45 23.71 40.17 41.61 68.02 36.25 25.56 42.57 18.65 33.87
##  [889] 33.51 28.21 40.74 39.53 43.35 43.65 22.78 31.46 25.62 32.02 29.16 38.36
##  [901] 40.77 24.24 42.07 26.13 27.93 18.62 37.80 34.77 50.30 39.36 42.05 38.02
##  [913] 41.71 48.24 29.38 16.79 26.64 19.48 37.11 31.40 33.95 39.86 38.39 23.24
##  [925] 21.56 28.91 35.56 33.29 32.24 27.63 27.55 24.71 23.96 31.96 55.62 25.32
##  [937] 24.56 41.96 31.42 31.14 36.90 25.55 45.83 25.59 27.39 45.35 46.59 26.11
##  [949] 49.09 46.64 22.05 27.62 28.51 37.50 26.85 29.44 40.89 14.92 25.45 33.57
##  [961] 39.06 21.31 35.67 29.70 21.75 29.55 24.22 21.30 18.30 36.54 18.16 29.19
##  [973] 20.81 54.69 37.56 48.81 53.25 46.06 34.16 34.86 46.58 34.58 37.40 33.30
##  [985] 49.38 52.01 29.54 44.30 31.58 36.18 22.35 22.22 46.81 20.66 16.31 21.47
##  [997] 33.63 36.80 27.31 42.04 30.08 58.42 26.87 29.89 24.20 24.11 31.24 19.59
## [1009] 15.81 26.46 54.08 28.43 28.76 40.61 36.33 28.00 17.22 21.37 25.18 25.52
## [1021] 40.72 29.69 42.83 55.44 17.63 44.94 20.90 35.88 22.11 28.87 17.07 39.28
## [1033] 22.62 50.12 15.92 34.28 30.42 57.57 26.69 30.85 19.58 17.03 34.39 51.37
## [1045] 39.50 21.94 45.44 23.65 26.72 14.61 39.13 31.51 17.71 40.23 39.77 26.30
## [1057] 33.05 42.38 37.31 20.87 23.82 34.08 42.98 30.80 35.35 19.30 23.85 29.79
## [1069] 41.09 18.99 21.34 18.13 18.37 56.38 38.16 43.77 49.78 30.71 21.77 39.31
## [1081] 29.24 36.73 26.75 35.30 46.52 19.79 20.70 35.00 33.64 39.42 21.28 22.13
## [1093] 19.75 27.94 27.04 37.43 40.60 42.72 21.08 27.07 28.49 29.80 17.85 42.97
## [1105] 20.52 20.41 45.98 28.56 20.97 20.03 24.19 16.16 20.67 41.04 40.45 37.89
## [1117] 44.09 16.65 37.95 36.59 32.91 44.82 27.15 49.76 42.60 18.25 38.11 56.64
## [1129] 29.08 30.47 28.99 13.31 16.45 23.02 58.59 22.85 16.37 23.18 89.10 30.69
## [1141] 46.00 28.23 28.31 30.28 17.72 35.11 45.70 20.94 33.12 16.14 31.79 29.10
## [1153] 22.83 25.38 21.96 22.73 32.95 19.15 22.55 58.24 40.78 17.51 47.83 33.37
## [1165] 28.90 38.08 45.65 45.61 27.52 34.51 23.12 26.79 31.05 32.46 37.37 23.08
## [1177] 18.38 66.08 38.44 45.76 25.53 42.22 26.14 24.82 58.53 38.95 32.96 53.67
## [1189] 48.66 33.48 24.31 24.68 50.75 29.62 35.85 76.79 31.04 20.47 22.20 20.06
## [1201] 26.56 20.55 44.10 28.98 36.17 39.08 48.47 22.99 27.57 20.45 17.97 26.25
## [1213] 42.10 47.26 47.00 24.46 28.93 25.08 25.15 30.10 31.35 24.94 38.67 43.24
## [1225] 30.14 19.66 25.90 33.08 24.18 39.14 19.89 32.99 50.22 35.53 60.08 37.41
## [1237] 47.55 21.44 40.29 30.45 35.31 37.02 33.50 18.94 47.61 20.54 35.74 44.64
## [1249] 26.91 28.04 48.71 40.69 44.43 16.46 31.48 28.74 29.83 20.19 27.61 42.23
## [1261] 20.75 60.82 48.78 28.02 32.59 35.82 38.73 21.43 31.45 32.72 21.54 19.46
## [1273] 23.15 32.44 20.33 41.81 28.08 33.78 27.69 34.98 19.11 26.71 50.86 24.25
## [1285] 27.22 49.93 40.01 18.80 57.17 31.65 33.44 19.69 17.57 24.01 19.01 19.73
## [1297] 26.00 47.78 51.75 27.97 16.95 27.27 41.84 44.26 17.92 47.90 35.48 31.02
## [1309] 42.53 36.14 86.51 37.74 24.54 36.15 24.30 26.70 43.19 22.16 45.88 46.20
## [1321] 22.56 40.02 26.08 38.87 19.85 18.46 33.33 60.36 36.04 63.77 16.74 21.42
## [1333] 36.57 22.18 22.98 41.40 39.87 21.02 38.47 42.33 18.40 31.10 50.31 68.97
## [1345] 47.46 51.67 40.14 38.63 35.83 48.69 18.47 52.46 22.72 20.73 21.74 16.63
## [1357] 28.57 19.39 34.22 30.12 32.23 37.75 44.48 23.99 23.60 33.13 45.17 28.63
## [1369] 54.03 61.79 21.20 34.20 46.09 22.59 26.17 45.78 34.41 38.75 18.50 54.78
## [1381] 13.48 21.25 54.25 19.91 21.59 49.95 32.17 43.71 40.57 19.65 19.05 18.09
## [1393] 15.78 25.89 35.70 39.11 17.49 30.96 15.64 21.57 23.42 28.46 16.80 23.98
## [1405] 15.17 23.87 17.14 16.83 51.92 34.72 49.20 21.64 25.99 28.72 24.90 24.07
## [1417] 30.95 59.22 33.80 41.00 59.44 34.93 52.25 35.96 22.48 29.48 47.99 41.57
## [1429] 38.79 40.13 21.68 42.27 15.05 18.90 19.07 22.08 26.66 20.16 41.80 34.69
## [1441] 25.87 51.08 44.86 19.16 32.03 45.18 24.67 33.81 43.12 32.58 36.48 28.01
## [1453] 17.27 23.19 18.66 18.17 13.72 41.88 32.36 22.42 24.98 17.89 27.18 44.76
## [1465] 37.55 38.04 54.14 48.62 18.08 19.40 41.66 16.82 21.35 45.42 42.61 51.59
## [1477] 48.26 17.11 42.30 33.14 27.47 15.41 19.22 40.50 15.01 29.00 33.77 18.74
## [1489] 17.31 17.95 39.18 27.75 47.72 27.82 65.91 43.62 26.93 26.60 26.23 44.34
## [1501] 28.03 32.60 31.56 26.89 39.74 42.42 31.84 22.49 57.39 20.84 35.25 24.84
## [1513] 32.21 16.13 34.88 31.61 17.43 43.80 23.36 34.94 22.10 23.31 24.65 39.27
## [1525] 24.09 38.78 38.13 30.05 56.55 52.75 37.03 16.27 41.63 41.99 36.72 42.12
## [1537] 38.61 38.86 42.37 41.37 34.52 25.26 41.45 16.99 22.19 15.45 36.03 30.76
## [1549] 29.75 50.49 34.53 36.52 41.34 34.84 28.55 35.55 27.49 36.67 15.70 34.45
## [1561] 38.96 41.97 13.99 46.22 19.08 45.04 23.88 28.68 74.68 31.07 46.96 21.71
## [1573] 23.51 31.85 47.64 30.26 45.34 49.03 40.59 24.00 32.41 71.74 26.84 19.19
## [1585] 48.55 13.56 41.44 37.46 39.75 20.68 25.71 40.85 47.92 25.81 54.87 17.15
## [1597] 36.79 32.25 34.38 16.90 41.10 30.78 28.07 32.47 67.79 43.13 28.36 17.56
## [1609] 32.64 46.35 36.36 17.01 15.35 17.94 29.91 24.10 51.87 61.37 36.64 27.87
## [1621] 43.60 13.61 32.39 25.28 33.54 19.90 38.98 37.16 21.90 49.47 44.08 39.22
## [1633] 40.62 27.64 19.71 32.84 51.60 44.33 16.43 47.29 33.98 34.04 18.36 38.54
## [1645] 32.14 15.98 29.14 36.34 37.70 46.26 44.40 25.72 36.19 42.40 25.14 27.68
## [1657] 27.83 41.32 27.35 44.55 51.54 47.24 34.89 30.46 46.86 54.91 21.53 38.55
## [1669] 56.58 24.91 51.77 24.48 33.59 16.05 41.47 38.12 22.51 15.97 30.98 23.94
## [1681] 56.49 15.51 47.76 40.27 16.36 30.50 44.46 41.65 40.42 31.27 51.36 20.91
## [1693] 18.64 33.01 19.42 28.65 50.03 18.14 37.08 41.14 20.29 42.31 49.13 39.80
## [1705] 26.05 50.18 35.14 55.25 41.73 24.43 18.85 48.36 30.25 37.73 40.82 35.05
## [1717] 38.25 50.77 48.50 56.31 27.51 21.38 18.48 23.07 27.79 34.23 32.75 37.22
## [1729] 20.99 38.21 55.38 66.88 41.02 33.41 41.39 20.02 22.03 15.36 34.83 19.00
## [1741] 23.10 34.62 18.18 13.87 57.50 41.19 49.91 30.51 36.78 42.18 19.04 50.00
## [1753] 47.42 36.08 30.57 45.92 44.87 15.59 34.56 50.19 24.12 39.59 23.54 35.32
## [1765] 44.36 79.06 29.50 18.52 57.41 18.58 17.36 20.31 36.65 28.42 24.85 34.07
## [1777] 32.35 49.23 34.48 46.29 22.21 18.81 16.01 38.94 15.62 16.10 31.55 46.05
## [1789] 70.76 14.23 36.35 38.31 37.62 20.23 32.34 23.25 29.94 42.93 29.15 29.82
## [1801] 30.74 18.26 35.06 54.88 46.94 21.92 37.51 38.30 41.64 25.27 47.80 35.54
## [1813] 31.77 32.04 53.47 29.49 36.98 29.17 19.31 40.43 16.93 18.97 17.33 39.89
## [1825] 17.68 17.87 56.70 51.15 50.48 42.74 27.16 43.76 52.30 47.56 35.01 29.97
## [1837] 20.43 40.79 81.71 17.86 34.25 25.21 62.95 54.38 25.19 33.70 61.08 59.88
## [1849] 52.73 18.19 23.22 42.69 22.75 40.32 41.42 18.41 39.49 16.76 19.61 31.54
## [1861] 30.66 57.95 53.22 18.32 34.26 33.32 18.84 45.71 20.61 34.36 16.89 17.34
## [1873] 59.50 55.51 14.74 22.07 17.64 26.41 30.87 27.14 85.96 20.27 38.90 36.89
## [1885] 43.79 49.50 25.93 33.86 31.52 42.88 17.59 42.81 25.65 64.56 35.12 24.99
## [1897] 41.95 15.28 12.16 20.56 53.71 35.69 27.09 25.41 61.35 19.93 17.65 59.52
## [1909] 19.60 23.58 16.48 40.31 70.33 43.82 16.98 36.88 37.77 24.17 64.93 46.53
## [1921] 14.25 16.60 17.18 18.10 40.26 25.16 30.09 35.07 16.50 40.28 32.57 54.93
## [1933] 37.87 14.88 47.84 43.54 55.80 22.76 35.80 39.37 13.73 33.76 50.78 16.49
## [1945] 40.75 37.54 36.75 17.74 16.41 34.05 17.09 30.48 21.15 65.31 29.56 44.15
## [1957] 48.08 31.37 37.35 58.58 55.82 53.16 52.61 19.96 36.56 40.05 18.23 34.12
## [1969] 49.69 16.87 43.36 35.41 36.99 24.50 23.90 64.44 35.04 18.68 15.00 26.73
## [1981] 52.77 54.52 36.91 42.11 21.67 54.32 45.53 43.75 22.29 37.10 37.69 33.53
## [1993] 36.09 50.64 20.89 57.29 16.94 82.39 38.23 39.52 36.66 48.87 44.60 28.22
## [2005] 20.95 20.28 48.10 32.73 78.17 19.21 23.21 34.79 44.20 40.04 33.15 21.06
## [2017] 35.18 16.12 19.33 33.90 49.77 47.02 25.12 19.51 28.16 21.66 26.82 41.94
## [2029] 38.32 49.61 37.91 18.12 27.00 33.58 36.85 25.03 53.59 45.03 28.05 19.25
## [2041] 32.51 52.26 21.84 43.67 15.94 28.14 15.66 21.05 17.90 37.86 16.07 18.33
## [2053] 12.21 24.52 18.87 41.70 54.61 24.26 65.78 25.30 51.70 25.49 15.20 28.94
## [2065] 57.25 56.85 28.96 51.97 43.08 32.76 35.34 56.12 31.20 31.13 49.75 14.53
## [2077] 29.39 32.82 32.31 38.10 45.91 62.19 47.13 29.98 35.09 55.61 41.21 42.85
## [2089] 35.75 29.47 22.77 14.95 28.47 19.50 39.03 43.86 18.11 22.82 15.83 34.19
## [2101] 45.75 27.01 23.95 53.55 16.25 50.84 44.14 16.29 57.78 18.55 36.11 47.35
## [2113] 23.84 38.99 26.33 26.99 44.53 40.06 30.39 16.06 19.72 22.44 56.68 35.91
## [2125] 50.58 49.79 74.06 26.62 73.51 43.43 38.14 31.39 25.44 40.56 98.44 37.57
## [2137] 54.58 42.50 18.82 30.89 36.50 14.98 22.95 12.55 44.06 25.88 28.92 41.77
## [2149] 25.01 36.77 20.72 22.64 37.98 62.07 51.09 36.10 19.76 47.12 31.44 35.50
## [2161] 39.45 46.16 19.03 38.17 47.70 21.07 32.20 29.42 46.68 19.52 38.89 57.60
## [2173] 51.02 39.00 38.58 40.37 41.27 46.63 29.73 62.00 42.70 30.32 32.86 41.25
## [2185] 56.43 28.71 22.17 24.58 44.11 36.60 29.40 46.80 26.21 46.72 46.03 83.68
## [2197] 44.23 51.38 43.89 59.20 40.90 24.44 17.50 45.68 60.16 45.16 36.12 38.22
## [2209] 47.50 18.00 35.22 40.46 39.61 68.35 35.98 61.71 47.09 21.82 40.67 16.00
## [2221] 37.24 26.98 58.75 20.69 22.09 53.28 23.47 34.09 48.96 19.02 62.70 53.21
## [2233] 47.85 33.10 37.36 20.76 27.58 45.43 16.21 14.76 38.06 41.18 54.75 21.51
## [2245] 28.40 23.70 62.34 20.71 55.97 37.90 45.66 38.83 28.24 39.82 43.49 67.81
## [2257] 73.46 50.91 46.79 40.68 39.23 42.25 49.59 17.04 39.30 55.24 36.68 22.12
## [2269] 24.16 58.98 52.86 22.02 51.96 45.14 44.59 45.58 43.38 32.48 39.72 46.97
## [2281] 53.09 15.19 51.30 33.06 18.78 61.03 48.65 19.99 39.38 62.33 20.93 46.41
## [2293] 34.76 17.77 68.67 60.84 50.66 44.83 58.52 16.92 49.43 24.60 51.42 27.24
## [2305] 35.89 14.14 21.01 28.38 39.47 20.21 54.23 41.48 35.84 39.20 35.93 37.07
## [2317] 56.25 18.42 33.93 38.38 46.36 41.87 41.54 78.12 62.67 42.84 45.26 46.17
## [2329] 46.69 51.12 16.97 36.47 16.69 32.70 50.25 21.00 36.01 25.57 40.10 44.79
## [2341] 53.02 30.94 50.81 59.81 59.37 44.95 20.58 41.03 35.49 42.82 56.26 63.10
## [2353] 44.97 62.62 14.35 22.39 52.94 33.04 46.65 34.60 23.35 51.17 65.84 48.29
## [2365] 65.23 48.79 55.20 20.14 50.08 56.41 42.54 17.28 66.43 51.98 42.76 62.40
## [2377] 24.77 55.60 54.82 34.73 57.67 45.77 18.73 48.21 39.17 48.23 51.74 55.36
## [2389] 48.05 29.74 38.50 47.31 58.99 15.95 40.94 18.39 13.20 22.94 32.85 16.88
## [2401] 25.48 60.14 17.17 29.90 28.88 38.49 22.57 15.79 17.19 45.11 66.00 37.00
## [2413] 20.49 14.64 16.56 27.66 83.01 52.00 38.00 51.56 54.74 41.56 59.91 35.99
## [2425] 64.66 70.31 23.20 42.34 19.23 64.45 39.44 43.70 34.78 50.79 34.13 46.07
## [2437] 35.23 57.48 22.28 43.16 67.35 14.24 45.06 15.60 23.50 15.87 26.86 69.74
## [2449] 48.75 53.85 16.55 65.71 40.99 43.03 97.58 31.86 25.91 42.41 82.81 45.01
## [2461] 73.03 32.13 48.14 48.74 50.94 51.73 55.76 54.24 40.21 49.18 36.30 40.07
## [2473] 15.77 39.56 15.75 32.15 15.55 34.85 52.91 52.37 42.20 42.47 57.64 21.76
## [2485] 16.42 29.46 45.84 15.50 16.19 53.92 55.73 45.05 47.77 57.37 52.13 64.02
## [2497] 40.97 15.07 39.62 79.29 15.33 53.00 40.64 47.93 43.53 41.12 52.14 19.12
## [2509] 51.22 52.29 65.48 48.16 56.71 39.69 14.70 54.57 40.70 30.75 38.71 58.36
## [2521] 42.68 36.16 44.84 48.00 55.74 45.32 22.41 48.58 17.13 32.18 23.66 38.59
## [2533] 51.76 39.43 18.49 49.80 58.54 52.49 59.57 31.98 15.34 67.31 62.76 48.18
## [2545] 13.25 27.65 21.88 44.07 63.74 44.78 72.61 39.15 43.91 31.81 54.53 39.19
## [2557] 37.71 60.06 18.44 55.65 52.89 34.92 34.27 54.07 17.08 68.23 49.42 59.30
## [2569] 42.62 50.36 46.82 64.21 59.72 35.65 47.30 69.24 25.74 39.65 49.87 15.31
## [2581] 50.46 17.80 65.97 36.41 38.66 20.32 17.30 38.53 41.76 29.07 43.28 18.61
## [2593] 47.87 65.77 47.01 12.91 28.85 42.00 54.42 14.77 70.18 51.57 14.37 26.81
## [2605] 60.53 37.65 22.91 65.37 37.85 47.08 42.45 48.15 46.93 54.90 68.66 21.19
## [2617] 34.29 46.67 31.94 82.19 43.97 52.34 24.47 37.84 26.80 19.44 30.31 15.48
## [2629] 41.51 23.61 42.35 71.56 46.48 47.44 42.80 61.07 18.70 62.35 50.42 45.63
## [2641] 12.48 53.90 48.04 80.89 64.37 47.37 16.81 63.27 49.62 49.51 36.29 31.11
## [2653] 59.41 33.85 28.78 25.35 25.31 51.33 30.16 29.30 46.49 14.63 16.72 60.23
## [2665] 36.06 33.68 51.79 28.30 45.48 16.30 25.67 48.35 27.85 66.13 44.17 29.77
## [2677] 70.48 54.68 51.25 60.74 44.21 37.47 16.40 76.02 53.01 19.06 48.83 29.25
## [2689] 34.80 14.48 48.34 48.45 75.86 50.39 31.08 32.05 26.44 39.29 88.57 49.81
## [2701] 26.24 44.25 71.13 22.88 57.18 50.72 44.37 66.76 39.67 48.95 61.28 66.51
## [2713] 69.86 82.31 58.43 41.93 54.39 48.12 35.59 45.36 57.79 49.32 67.93 46.76
## [2725] 59.07 49.07 49.55 47.81 81.48 50.51 44.05 59.06 41.36 37.18 37.52 45.85
## [2737] 14.47 47.67 37.34 45.93 67.24 60.93 13.05 51.72 47.19 45.09 57.61 47.11
## [2749] 47.63 58.26 52.11 59.97 54.98 57.22 67.67 30.93 43.84 44.62 91.82 52.12
## [2761] 73.18 56.13 12.92 66.90 14.51 50.97 55.41 68.91 13.19 44.19 58.10 45.50
## [2773] 58.45 58.46 16.47 81.67 47.89 46.23 52.56 12.40 62.14 47.06 67.68 50.93
## [2785] 29.92 36.82 48.44 43.42 79.41 44.99 85.23 78.31 52.18 63.42 15.04 72.25
## [2797] 15.11 40.86 13.39 56.82 43.00 50.85 40.91 72.63 30.53 74.07 63.33 44.03
## [2809] 44.91 17.42 45.12 55.05 30.97 44.70 78.65 47.14 44.72 46.37 49.26 65.90
## [2821] 64.13 15.73 55.66 41.23 81.37 56.99 66.56 53.46 70.71 62.17 54.13 14.17
## [2833] 44.35 75.18 18.59 30.19 53.70 22.25 39.79 78.04 15.65 19.34 54.79 46.08
## [2845] 12.53 43.74 34.55 13.37 48.56 50.26 51.32 26.01 57.06 39.21 53.51 84.04
## [2857] 58.18 43.48 13.00 51.48 42.14 39.25 52.62 52.80 17.46 68.40 54.55 55.50
## [2869] 58.77 55.96 66.21 17.98 40.92 59.34 15.91 22.34 13.22 56.33 31.97 15.23
## [2881] 47.59 36.37 58.32 54.50 48.09 85.69 69.52 48.20 40.80 62.49 59.05 54.64
## [2893] 51.03 14.08 29.28 23.55 41.86 66.45 48.52 43.25 62.54 28.69 67.06 60.89
## [2905] 43.59 46.28 50.83 83.20 15.49 40.81 15.54 13.67 49.88 31.67 25.47 56.18
## [2917] 27.42 49.39 69.91 50.45 13.23 22.79 35.81 53.81 15.06 51.45 30.49 38.82
## [2929] 41.52 56.15 28.64 57.14 40.76 52.68 16.44 16.57 27.13 35.79 33.27 48.27
## [2941] 58.57 85.74 32.52 39.93 49.82 15.46 20.10 34.82 43.51 73.37 33.97 55.78
## [2953] 41.29 90.39 45.23 64.06 18.92 81.19 39.41 43.47 48.91 50.98 59.40 17.29
## [2965] 20.86 23.16 45.62 59.23 12.20 63.86 15.25 42.44 61.09 39.73 38.69 78.92
## [2977] 68.05 82.23 56.81 63.28 73.28 53.91 49.64 53.84 51.82 51.06 66.94 63.92
## [2989] 50.59 49.24 60.54 16.18 55.75 58.63 21.72 75.37 64.29 56.01 47.45 66.77
## [3001] 24.49 72.22 36.45 27.70 20.74 19.35 78.30 61.64 22.37 41.75 65.14 53.75
## [3013] 23.27 60.69 78.83 45.10 40.93 16.70 41.90 58.91 52.57 51.16 24.04 13.14
## [3025] 12.89 51.05 31.22 22.01 57.88 57.94 17.20 14.78 34.00 63.23 32.79 27.38
## [3037] 19.38 46.92 67.64 31.49 44.77 45.30 23.77 26.51 62.99 26.02 59.67 46.43
## [3049] 42.89 15.18 55.08 70.03 34.43 76.99 57.03 55.88 50.04 43.83 51.00 55.09
## [3061] 15.22 57.30 75.95 61.87 38.91 17.99 52.24 52.42 28.54 47.21 48.40 21.85
## [3073] 42.90 12.11 53.36 47.74 51.39 45.37 66.95 73.84 70.72 54.41 52.52 63.51
## [3085] 50.11 57.51 17.53 62.55 51.94 49.41 73.89 46.46 63.30 64.88 43.37 46.27
## [3097] 29.11 14.32 44.32 41.28 36.27 46.98 52.50 60.37 32.97 24.40 13.35 14.75
## [3109] 14.65 13.26 57.52 55.99 49.65 46.84 33.02 29.59 14.10 15.30 17.24 43.64
## [3121] 41.74 36.46 60.59 49.36 38.60 19.95 60.61 14.45 55.28 15.42 58.28 52.47
## [3133] 35.68 42.55 34.40 38.88 46.74 57.16 53.56 46.66 66.89 64.20 49.08 62.39
## [3145] 47.25 17.55 72.74 34.68 52.35 55.00 47.07 56.90 72.31 33.62 87.18 53.50
## [3157] 64.98 48.31 29.36 52.53 15.12 50.63 46.04 42.63 44.71 52.16 51.93 50.52
## [3169] 38.46 59.71 60.30 46.45 58.27 14.41 20.04 15.14 24.06 48.90 59.87 14.44
## [3181] 37.78 63.56 55.68 42.75 64.99 13.85 30.59 43.50 63.75 20.17 22.06 13.95
## [3193] 26.03 21.78 51.27 24.35 37.09 31.99 18.98 59.85 62.42 36.71 38.51 29.93
## [3205] 27.36 56.28 19.45 14.00 26.55 32.53 27.03 71.29 32.07 53.52 55.46 47.38
## [3217] 73.22 19.29 43.69 60.48 14.90 25.60 29.72 60.94 65.38 59.63 54.34 62.50
## [3229] 24.29 15.15 16.20 68.57 59.27 16.39 42.21 35.10 49.21 38.57 59.98 73.74
## [3241] 37.19 40.73 47.39 39.83 51.88 15.96 16.67 16.28 18.22 60.62 12.12 54.20
## [3253] 53.23 55.04 13.90 52.92 46.77 48.63 60.46 71.88 49.10 14.81 62.02 72.65
## [3265] 43.55 57.42 45.56 54.04 17.35 14.26 24.15 19.81 16.68 60.05 57.20 22.58
## [3277] 14.03 51.58 75.72 26.20 68.90 51.50 33.26 13.75 78.94 17.10 44.02 57.72
## [3289] 83.00 94.94 75.49 18.96 47.05 62.57 55.98 14.12 46.10 18.51 46.70 44.44
## [3301] 44.49 48.32 60.76 44.80 62.64 43.81 18.34 22.54 26.35 45.25 20.11 27.30
## [3313] 26.28 48.54 24.36 14.93 28.33 50.14 19.70 80.79 19.68 40.53 67.14 37.72
## [3325] 57.56 59.99 61.22 17.41 25.43 74.50 60.15 40.33 17.66 20.44 52.66 49.37
## [3337] 19.41 45.59 71.90 40.47 23.68 26.49 31.43 55.64 62.53 50.29 67.33 47.94
## [3349] 29.06 55.17 59.66 40.11 63.50 50.88 54.59 55.55 46.30 25.05 84.75 47.62
## [3361] 43.10 16.32 50.71 64.12 43.14 54.33 31.59 99.17 41.55 53.74 63.11 62.80
## [3373] 61.45 29.34 42.99 50.95 73.10 53.98 41.43 43.02 53.41 60.25 49.96 48.43
## [3385] 29.51 50.16 68.06 48.02 61.82 16.53 97.65 50.65 91.23 70.29 49.71 73.92
## [3397] 48.22 64.57 65.56 52.70 61.41 60.00 61.49 70.88 80.66 49.57 62.44 75.33
## [3409] 69.89 57.63 21.45 17.37 55.21 31.34 44.68 46.38 47.23 44.45 50.92 40.52
## [3421] 54.97 51.90 44.47 59.80 67.15 14.62 49.11 51.55 64.71 57.82 75.17 17.69
## [3433] 60.26 56.96 81.02 58.74 79.25 77.31 57.74 73.24 23.79 53.08 14.59 24.38
## [3445] 53.38 76.74 58.71 43.18 65.54 20.96 58.73 20.57 16.84 55.42 25.76 87.22
## [3457] 48.88 21.17 50.62 64.07 33.31 63.99 59.18 64.08 14.38 64.42 56.46 69.94
## [3469] 48.59 24.92 66.65 52.96 13.02 42.03 62.91 59.29 47.33 65.53 28.58 50.69
## [3481] 31.92 63.13 17.02 52.87 19.82 42.26 55.23 71.80 55.91 62.31 56.44 32.63
## [3493] 81.50 61.33 44.89 15.80 57.46 13.24 12.70 42.59 53.57 12.17 69.50 96.52
## [3505] 50.32 74.12 54.56 42.48 20.39 13.74 50.99 45.95 99.33 19.83 67.72 42.01
## [3517] 19.32 28.09 62.01 53.31 35.46 71.42 74.24 13.89 54.29 46.61 44.41 26.74
## [3529] 12.65 26.48 66.40 68.95 45.47 31.23 43.21 33.49 43.72 21.10 58.83 25.95
## [3541] 61.14 57.98 65.83 42.19 48.28 74.60 13.64 18.21 53.12 13.63 80.74 12.88
## [3553] 82.28 36.69 46.62 94.41 32.80 30.60 14.69 52.78 33.21 45.40 57.33 20.65
## [3565] 39.12 29.22 32.29 42.86 12.02 13.06 13.82 52.21 73.16 86.32 91.52 48.01
## [3577] 60.42 17.12 52.72 45.57 57.92 82.11 35.27 55.84 48.64 84.87 52.31 57.00
## [3589] 68.42 53.32 87.71 56.65 70.69 39.01 38.42 36.43 65.74 64.30 15.90 77.47
## [3601] 33.18 48.94 14.04 55.19 79.35 59.12 29.96 12.05 30.06 76.17 35.60 72.05
## [3613] 83.26 33.56 83.45 46.39 53.42 43.06 48.70 41.91 52.79 40.87 92.45 52.58
## [3625] 57.27 57.84 49.94 43.87 52.08 43.41 73.63 57.11 48.76 47.03 79.71 82.37
## [3637] 40.88 28.18 65.10 32.40 31.69 58.60 45.46 12.87 21.49 41.30 44.16 14.28
## [3649] 45.41 38.29 51.91 63.83 19.09 56.32
## 
## $Smoking_History
## [1] "Yes" "No" 
## 
## $Alcohol_Consumption
##  [1]  0  4  3  8 30  2 12  1  5 10 20 17 16  6 25 28 15  7  9 24 11 29 27 14 21
## [26] 23 18 26 22 13 19
## 
## $Fruit_Consumption
##  [1]  30  12   8  16   2   1  60   0   7   5   3   6  90  28  20   4  80  24  15
## [20]  10  25  14 120  32  40  17  45 100   9  99  96  35  50  56  48  27  72  36
## [39]  84  26  23  18  21  42  22  11 112  29  64  70  33  76  44  39  75  31  92
## [58] 104  88  65  55  13  38  63  97 108  19  52  98  37  68  34  41 116  54  62
## [77]  85
## 
## $Green_Vegetables_Consumption
##  [1]  16   0   3  30   4  12   8  20   1  10   5   2   6  60  28  25  14  40   7
## [20]  22  24  15 120  90  19  13  11  80  27  17  56  18   9  21  99  29  31  45
## [39]  23 100 104  32  48  75  36  35 112  26  50  33  96  52  76  84  34  97  88
## [58]  98  68  92  55  95  64 124  61  65  77  85  44  39  70  93 128  37  53
## 
## $FriedPotato_Consumption
##  [1]  12   4  16   8   0   1   2  30  20  15  10   3   7  28   5   9   6 120  32
## [20]  14  60  33  48  25  24  21  90  13  99  17  18  40  56  34  36  44 100  11
## [39]  64  45  80  29  68  26  50  22  95  23  27 112  35  31  98  96  88  92  19
## [58]  76  49  97 128  41  37  42  52  72  46 124  84

We also summarized the dataset

#Summary statistics
summary(df)

##  General_Health       Checkup            Exercise         Heart_Disease     
##  Length:308854      Length:308854      Length:308854      Length:308854     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Skin_Cancer        Other_Cancer        Depression          Diabetes        
##  Length:308854      Length:308854      Length:308854      Length:308854     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Arthritis             Sex            Age_Category        Height_.cm.   
##  Length:308854      Length:308854      Length:308854      Min.   : 91.0  
##  Class :character   Class :character   Class :character   1st Qu.:163.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :170.0  
##                                                           Mean   :170.6  
##                                                           3rd Qu.:178.0  
##                                                           Max.   :241.0  
##   Weight_.kg.          BMI        Smoking_History    Alcohol_Consumption
##  Min.   : 24.95   Min.   :12.02   Length:308854      Min.   : 0.000     
##  1st Qu.: 68.04   1st Qu.:24.21   Class :character   1st Qu.: 0.000     
##  Median : 81.65   Median :27.44   Mode  :character   Median : 1.000     
##  Mean   : 83.59   Mean   :28.63                      Mean   : 5.096     
##  3rd Qu.: 95.25   3rd Qu.:31.85                      3rd Qu.: 6.000     
##  Max.   :293.02   Max.   :99.33                      Max.   :30.000     
##  Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
##  Min.   :  0.00    Min.   :  0.00               Min.   :  0.000        
##  1st Qu.: 12.00    1st Qu.:  4.00               1st Qu.:  2.000        
##  Median : 30.00    Median : 12.00               Median :  4.000        
##  Mean   : 29.84    Mean   : 15.11               Mean   :  6.297        
##  3rd Qu.: 30.00    3rd Qu.: 20.00               3rd Qu.:  8.000        
##  Max.   :120.00    Max.   :128.00               Max.   :128.000

Univariate Analysis

For categorical data, we use bar plot to visualized their frequency.

#Bar plot
ggplot(df, aes(x = General_Health)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of General Health",
       x = "General Health",
       y = "Count") +
  theme_minimal()

ggplot(df, aes(x = Checkup)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Checkup",
       x = "Checkup",
       y = "Count") +
  theme_minimal()

ggplot(df, aes(x = Age_Category)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Age_Category",
       x = "Age_Category",
       y = "Count") +
  theme_minimal()

ggplot(df, aes(x = Heart_Disease)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Heart_Disease",
       x = "Heart_Disease",
       y = "Count") +
  theme_minimal()

For numerical variables, we use histogram & box plots to show their distribution

# Histogram for BMI
ggplot(df, aes(x = BMI)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  labs(title = "BMI Distribution", x = "BMI", y = "Count") +
  theme_minimal()

# Boxplot for Height and Weight
ggplot(df, aes(y = Height_.cm.)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Boxplot of Height", y = "Height(cm)") +
  theme_minimal()

ggplot(df, aes(y = Weight_.kg.)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Boxplot of Weight", y = "Weight (kg)") +
  theme_minimal()

For binary variables, bar plot is plotted to analyze them.

# Bar plot
ggplot(df, aes(x = General_Health, fill = Heart_Disease)) +
  geom_bar(position = "dodge") +
  labs(title = "General Health by Heart Disease Status",
       x = "General Health",
       y = "Count") +
  theme_minimal()

ggplot(df, aes(x = Exercise, fill = Heart_Disease)) +
  geom_bar(position = "dodge") +
  labs(title = "Exercise Participation by Heart Disease Status",
       x = "Exercise",
       y = "Count") +
  theme_minimal()

Data Preprocessing

After we do some simply analyze the dataset, we need to clean dataset in order to make it able to be trained by model.

1.Data cleaning

This section focuses on handling missing values and duplicates in the dataset.

1.1 Missing Values

First, check on rows (observations) which contain missing values

## Missing values summary:

##               General_Health                      Checkup 
##                            0                            0 
##                     Exercise                Heart_Disease 
##                            0                            0 
##                  Skin_Cancer                 Other_Cancer 
##                            0                            0 
##                   Depression                     Diabetes 
##                            0                            0 
##                    Arthritis                          Sex 
##                            0                            0 
##                 Age_Category                  Height_.cm. 
##                            0                            0 
##                  Weight_.kg.                          BMI 
##                            0                            0 
##              Smoking_History          Alcohol_Consumption 
##                            0                            0 
##            Fruit_Consumption Green_Vegetables_Consumption 
##                            0                            0 
##      FriedPotato_Consumption 
##                            0

Since no null values in the dataset, we no need to drop any rows.

1.2 Duplicates

Then, check on rows (observations) which are not distinct, but duplicated with other observations.

## Number of duplicate rows: 80

## Duplicate rows:

##       General_Health              Checkup Exercise Heart_Disease Skin_Cancer
## 46403           Good Within the past year      Yes            No          No
## 49288      Very Good Within the past year      Yes            No          No
## 75449      Excellent Within the past year      Yes            No          No
## 76858      Excellent Within the past year      Yes            No          No
## 78872           Good Within the past year      Yes            No          No
## 81703      Excellent Within the past year      Yes            No          No
##       Other_Cancer Depression Diabetes Arthritis    Sex Age_Category
## 46403           No        Yes       No        No Female        18-24
## 49288           No         No       No        No Female        35-39
## 75449           No         No       No        No Female        65-69
## 76858           No         No       No        No   Male        40-44
## 78872           No         No       No        No Female        75-79
## 81703           No         No       No       Yes   Male        55-59
##       Height_.cm. Weight_.kg.   BMI Smoking_History Alcohol_Consumption
## 46403         163       81.65 30.90              No                   0
## 49288         160       72.57 28.34             Yes                   0
## 75449         163       61.23 23.17             Yes                   0
## 76858         173       81.65 27.37              No                   0
## 78872         163       58.97 22.31              No                   0
## 81703         178       81.65 25.83             Yes                   0
##       Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
## 46403                60                            4                       4
## 49288                60                           30                       4
## 75449                30                           16                       0
## 76858                30                            8                       1
## 78872                60                           30                       0
## 81703                30                           30                       4

After detecting those duplicated rows, we remove them.

# Remove duplicate rows from the dataset
df <- unique(df)

# Check the number of records in the cleaned dataset
cat("Number of rows after duplicate removal:", nrow(df), "\n")

## Number of rows after duplicate removal: 308774

2. Data formatting

As most machine learning models require numerical input, categorical variables will be converted into numerical values using the following strategies: 1. Label Encoding (for binary variables) 2. Ordinal Encoding (for ordinal variables) 3. One-Hot Encoding (for norminal variables)

2.1 Label Encoding

Columns encoded: Exercise, Heart_Disease, Skin_Cancer, Other_Cancer, Depression, Arthritis, Smoking_History

# Function for label encoding
label_encode <- function(df, columns) {
  for (col in columns) {
    df[[col]] <- ifelse(df[[col]] == "Yes", 1, 0)
  }
  return(df)
}

# List of columns to encode
columns_to_encode <- c("Exercise", "Heart_Disease", "Skin_Cancer",
                       "Other_Cancer", "Depression", "Arthritis", "Smoking_History")

# Apply label encoding to selected columns
df <- label_encode(df, columns_to_encode)

head(df)

##   General_Health                 Checkup Exercise Heart_Disease Skin_Cancer
## 1           Poor Within the past 2 years        0             0           0
## 2      Very Good    Within the past year        0             1           0
## 3      Very Good    Within the past year        1             0           0
## 4           Poor    Within the past year        1             1           0
## 5           Good    Within the past year        0             0           0
## 6           Good    Within the past year        0             0           0
##   Other_Cancer Depression Diabetes Arthritis    Sex Age_Category Height_.cm.
## 1            0          0       No         1 Female        70-74         150
## 2            0          0      Yes         0 Female        70-74         165
## 3            0          0      Yes         0 Female        60-64         163
## 4            0          0      Yes         0   Male        75-79         180
## 5            0          0       No         0   Male          80+         191
## 6            0          1       No         1   Male        60-64         183
##   Weight_.kg.   BMI Smoking_History Alcohol_Consumption Fruit_Consumption
## 1       32.66 14.54               1                   0                30
## 2       77.11 28.29               0                   0                30
## 3       88.45 33.47               0                   4                12
## 4       93.44 28.73               0                   0                30
## 5       88.45 24.37               1                   0                 8
## 6      154.22 46.11               0                   0                12
##   Green_Vegetables_Consumption FriedPotato_Consumption
## 1                           16                      12
## 2                            0                       4
## 3                            3                      16
## 4                           30                       8
## 5                            4                       0
## 6                           12                      12

2.2 Ordinal Encoding

Columns encoded: General Health

df$General_Health <- factor(df$General_Health,
                              levels = c('Poor', 'Fair', 'Good', 'Very Good', 'Excellent'),
                              ordered = TRUE)
df$General_Health <- as.integer(df$General_Health)

head(df)

##   General_Health                 Checkup Exercise Heart_Disease Skin_Cancer
## 1              1 Within the past 2 years        0             0           0
## 2              4    Within the past year        0             1           0
## 3              4    Within the past year        1             0           0
## 4              1    Within the past year        1             1           0
## 5              3    Within the past year        0             0           0
## 6              3    Within the past year        0             0           0
##   Other_Cancer Depression Diabetes Arthritis    Sex Age_Category Height_.cm.
## 1            0          0       No         1 Female        70-74         150
## 2            0          0      Yes         0 Female        70-74         165
## 3            0          0      Yes         0 Female        60-64         163
## 4            0          0      Yes         0   Male        75-79         180
## 5            0          0       No         0   Male          80+         191
## 6            0          1       No         1   Male        60-64         183
##   Weight_.kg.   BMI Smoking_History Alcohol_Consumption Fruit_Consumption
## 1       32.66 14.54               1                   0                30
## 2       77.11 28.29               0                   0                30
## 3       88.45 33.47               0                   4                12
## 4       93.44 28.73               0                   0                30
## 5       88.45 24.37               1                   0                 8
## 6      154.22 46.11               0                   0                12
##   Green_Vegetables_Consumption FriedPotato_Consumption
## 1                           16                      12
## 2                            0                       4
## 3                            3                      16
## 4                           30                       8
## 5                            4                       0
## 6                           12                      12

2.3 One-Hot Encoding

Columns encoded: Diabetes, Checkup, Sex, Age_Category

# Refined Grouping of Diabetes Categories
df$Diabetes <- recode(df$Diabetes,
                        'No' = 'No Diabetes',
                        'No, pre-diabetes or borderline diabetes' = 'Pre-diabetes',
                        'Yes' = 'Diabetes',
                        'Yes, but female told only during pregnancy' = 'Gestational Diabetes')

# Function for one-hot encoding
one_hot_encode <- function(df, columns) {
  # Create a copy of the original dataframe
  result_df <- df

  # Process each column
  for (col in columns) {
    # Create dummy variables
    dummy_matrix <- model.matrix(as.formula(paste0("~", col, " - 1")), data = result_df)

    # Add dummy columns to the result dataframe
    result_df <- cbind(result_df, dummy_matrix)

    # Remove the original column
    result_df[[col]] <- NULL
  }

  return(result_df)
}

# List of columns to encode
columns_to_encode <- c("Diabetes", "Checkup", "Sex", "Age_Category")

# Apply label encoding to selected columns
df <- one_hot_encode(df, columns_to_encode)

head(df)

##   General_Health Exercise Heart_Disease Skin_Cancer Other_Cancer Depression
## 1              1        0             0           0            0          0
## 2              4        0             1           0            0          0
## 3              4        1             0           0            0          0
## 4              1        1             1           0            0          0
## 5              3        0             0           0            0          0
## 6              3        0             0           0            0          1
##   Arthritis Height_.cm. Weight_.kg.   BMI Smoking_History Alcohol_Consumption
## 1         1         150       32.66 14.54               1                   0
## 2         0         165       77.11 28.29               0                   0
## 3         0         163       88.45 33.47               0                   4
## 4         0         180       93.44 28.73               0                   0
## 5         0         191       88.45 24.37               1                   0
## 6         1         183      154.22 46.11               0                   0
##   Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
## 1                30                           16                      12
## 2                30                            0                       4
## 3                12                            3                      16
## 4                30                           30                       8
## 5                 8                            4                       0
## 6                12                           12                      12
##   DiabetesDiabetes DiabetesGestational Diabetes DiabetesNo Diabetes
## 1                0                            0                   1
## 2                1                            0                   0
## 3                1                            0                   0
## 4                1                            0                   0
## 5                0                            0                   1
## 6                0                            0                   1
##   DiabetesPre-diabetes Checkup5 or more years ago CheckupNever
## 1                    0                          0            0
## 2                    0                          0            0
## 3                    0                          0            0
## 4                    0                          0            0
## 5                    0                          0            0
## 6                    0                          0            0
##   CheckupWithin the past 2 years CheckupWithin the past 5 years
## 1                              1                              0
## 2                              0                              0
## 3                              0                              0
## 4                              0                              0
## 5                              0                              0
## 6                              0                              0
##   CheckupWithin the past year SexFemale SexMale Age_Category18-24
## 1                           0         1       0                 0
## 2                           1         1       0                 0
## 3                           1         1       0                 0
## 4                           1         0       1                 0
## 5                           1         0       1                 0
## 6                           1         0       1                 0
##   Age_Category25-29 Age_Category30-34 Age_Category35-39 Age_Category40-44
## 1                 0                 0                 0                 0
## 2                 0                 0                 0                 0
## 3                 0                 0                 0                 0
## 4                 0                 0                 0                 0
## 5                 0                 0                 0                 0
## 6                 0                 0                 0                 0
##   Age_Category45-49 Age_Category50-54 Age_Category55-59 Age_Category60-64
## 1                 0                 0                 0                 0
## 2                 0                 0                 0                 0
## 3                 0                 0                 0                 1
## 4                 0                 0                 0                 0
## 5                 0                 0                 0                 0
## 6                 0                 0                 0                 1
##   Age_Category65-69 Age_Category70-74 Age_Category75-79 Age_Category80+
## 1                 0                 1                 0               0
## 2                 0                 1                 0               0
## 3                 0                 0                 0               0
## 4                 0                 0                 1               0
## 5                 0                 0                 0               1
## 6                 0                 0                 0               0

3. Handling Imbalanced Classes

To avoid model is biased to predict only either one class, we need to check our experimental variable (Y), which is Heart Disease

# Check the number of imbalanced classes

df %>% count(Heart_Disease)

##   Heart_Disease      n
## 1             0 283803
## 2             1  24971

The class imbalance is quite significant, which may cause the model to be biased towards predicting the majority class, which is ‘No Heart Disease’ Class

Hence, The SMOTE (Synthetic Minority Oversampling Technique) algorithm is used to balance the dataset by oversampling the minority class and undersampling the majority class to address class imbalance for the target variable.

#SMOTE approaach
df$Heart_Disease <- as.factor(df$Heart_Disease)
balanced_data <- SMOTE(Heart_Disease ~ ., data = df, perc.over = 200, perc.under = 200, k=3)

Then, we will check whether the dataset after applied SMOTE algorithm is balanced.

# Check the class distribution in the balanced dataset
table(balanced_data$Heart_Disease)

## 
##     0     1 
## 99884 74913

Since negative cases has 99884 and positive cases has 74913, it’s relatively much more balanced than dataset before SMOTE applied, then we can proceed with this balanced dataset

4. Data Splitting

The data is split into training (80%) and testing (20%) sets.

# Split the original data into features (X) and target (y)
X <- df[, -which(names(df) == "Heart_Disease")]  # Features
y <- df$Heart_Disease  # Target

# Split the oversampled data into features (X) and target (y)
X_oversampled <- balanced_data[, -which(names(balanced_data) == "Heart_Disease")]  # Features
y_oversampled <- balanced_data$Heart_Disease  # Target

# Split the original data into training and testing sets (80% training)
set.seed(42)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
X_train <- X[train_index, ]
y_train <- y[train_index]
X_test <- X[-train_index, ]
y_test <- y[-train_index]

# Split the oversampled data into training and testing sets (80% training)
set.seed(42)  # Ensure reproducibility
train_index_oversampled <- sample(1:nrow(balanced_data), 0.8 * nrow(balanced_data))
X_train_oversampled <- X_oversampled[train_index_oversampled, ]
y_train_oversampled <- y_oversampled[train_index_oversampled]
X_test_oversampled <- X_oversampled[-train_index_oversampled, ]
y_test_oversampled <- y_oversampled[-train_index_oversampled]

We also check the dimensions for both test and training set.

# Check the dimensions
cat("X_train dimensions:", dim(X_train), "\n")

## X_train dimensions: 247019 38

cat("y_train length:", length(y_train), "\n\n")

## y_train length: 247019

cat("X_train_oversampled dimensions:", dim(X_train_oversampled), "\n")

## X_train_oversampled dimensions: 139837 38

cat("y_train_oversampled length:", length(y_train_oversampled), "\n\n")

## y_train_oversampled length: 139837

cat("X_test dimensions:", dim(X_test), "\n")

## X_test dimensions: 61755 38

cat("y_test length:", length(y_test), "\n\n")

## y_test length: 61755

cat("X_test_oversampled dimensions:", dim(X_test_oversampled), "\n")

## X_test_oversampled dimensions: 34960 38

cat("y_test_oversampled length:", length(y_test_oversampled), "\n")

## y_test_oversampled length: 34960

5. Data Standardization

The z-score standardization method is applied to transform the data so that it has zero mean and unit variance. This process ensures that all features are on the same scale, which can improve the performance of many machine learning algorithms.

# Standardize continuous variables in the training and test sets
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test, center = attr(X_train_scaled, "scaled:center"),
                       scale = attr(X_train_scaled, "scaled:scale"))

X_train_oversampled_scaled <- scale(X_train_oversampled)
X_test_oversampled_scaled <- scale(X_test_oversampled, center = attr(X_train_oversampled_scaled, "scaled:center"),
                       scale = attr(X_train_oversampled_scaled, "scaled:scale"))
head(X_train_scaled)

##       General_Health   Exercise Skin_Cancer Other_Cancer Depression  Arthritis
## 61415      1.4241926  0.5393822  -0.3277424   -0.3270679 -0.5012777  1.4333171
## 54427     -0.5158081 -1.8539653  -0.3277424   -0.3270679  1.9948943  1.4333171
## 99566      1.4241926  0.5393822  -0.3277424   -0.3270679 -0.5012777  1.4333171
## 74364     -0.5158081 -1.8539653  -0.3277424   -0.3270679  1.9948943 -0.6976795
## 46208      0.4541923  0.5393822  -0.3277424   -0.3270679 -0.5012777 -0.6976795
## 61607     -0.5158081 -1.8539653  -0.3277424   -0.3270679  1.9948943  1.4333171
##       Height_.cm. Weight_.kg.        BMI Smoking_History Alcohol_Consumption
## 61415 -1.74527467  -1.6215758 -1.1551946       1.2093797          -0.4994683
## 54427 -0.71385652   0.2918438  0.8221726       1.2093797          -0.6214907
## 99566 -1.74527467  -1.3665782 -0.7949766      -0.8268668          -0.6214907
## 74364 -0.52632595   2.5671349  3.3912171      -0.8268668          -0.6214907
## 46208 -0.52632595   0.3344996  0.7133407      -0.8268668          -0.6214907
## 61607 -0.05749951   1.5040199  1.7342141      -0.8268668          -0.6214907
##       Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
## 61415         1.2150823                    0.9993139              -0.7359452
## 54427        -0.7156781                   -0.2076617              -0.3857589
## 99566        -0.5950056                   -1.0123122              -0.6192165
## 74364        -0.8765748                    0.3287719              -0.6192165
## 46208        -0.3938847                    0.3287719               1.5986302
## 61607        -0.5547814                   -0.8782038              -0.2690301
##       DiabetesDiabetes DiabetesGestational Diabetes DiabetesNo Diabetes
## 61415       -0.3858388                  -0.09339431           0.4373676
## 54427       -0.3858388                  -0.09339431           0.4373676
## 99566       -0.3858388                  -0.09339431           0.4373676
## 74364        2.5917458                  -0.09339431          -2.2863968
## 46208       -0.3858388                  -0.09339431           0.4373676
## 61607       -0.3858388                  -0.09339431           0.4373676
##       DiabetesPre-diabetes Checkup5 or more years ago CheckupNever
## 61415           -0.1511858                 -0.2130658  -0.06850995
## 54427           -0.1511858                 -0.2130658  -0.06850995
## 99566           -0.1511858                 -0.2130658  -0.06850995
## 74364           -0.1511858                 -0.2130658  -0.06850995
## 46208           -0.1511858                 -0.2130658  -0.06850995
## 61607           -0.1511858                 -0.2130658  -0.06850995
##       CheckupWithin the past 2 years CheckupWithin the past 5 years
## 61415                     -0.3702832                     -0.2447143
## 54427                      2.7006246                     -0.2447143
## 99566                     -0.3702832                     -0.2447143
## 74364                     -0.3702832                     -0.2447143
## 46208                     -0.3702832                     -0.2447143
## 61607                     -0.3702832                     -0.2447143
##       CheckupWithin the past year SexFemale    SexMale Age_Category18-24
## 61415                   0.5390883 0.9626064 -0.9626064        -0.2539108
## 54427                  -1.8549762 0.9626064 -0.9626064        -0.2539108
## 99566                   0.5390883 0.9626064 -0.9626064        -0.2539108
## 74364                   0.5390883 0.9626064 -0.9626064        -0.2539108
## 46208                   0.5390883 0.9626064 -0.9626064        -0.2539108
## 61607                   0.5390883 0.9626064 -0.9626064        -0.2539108
##       Age_Category25-29 Age_Category30-34 Age_Category35-39 Age_Category40-44
## 61415        -0.2295526        -0.2516358         -0.266983        -0.2740325
## 54427        -0.2295526        -0.2516358         -0.266983        -0.2740325
## 99566         4.3562826        -0.2516358         -0.266983        -0.2740325
## 74364        -0.2295526        -0.2516358         -0.266983        -0.2740325
## 46208        -0.2295526        -0.2516358         -0.266983         3.6491874
## 61607        -0.2295526        -0.2516358         -0.266983        -0.2740325
##       Age_Category45-49 Age_Category50-54 Age_Category55-59 Age_Category60-64
## 61415        -0.2701623        -0.2973456        -0.3171072         -0.342436
## 54427        -0.2701623        -0.2973456        -0.3171072          2.920242
## 99566        -0.2701623        -0.2973456        -0.3171072         -0.342436
## 74364        -0.2701623        -0.2973456         3.1534952         -0.342436
## 46208        -0.2701623        -0.2973456        -0.3171072         -0.342436
## 61607        -0.2701623         3.3630760        -0.3171072         -0.342436
##       Age_Category65-69 Age_Category70-74 Age_Category75-79 Age_Category80+
## 61415        -0.3478932        -0.3347261        -0.2683204       3.5905778
## 54427        -0.3478932        -0.3347261        -0.2683204      -0.2785056
## 99566        -0.3478932        -0.3347261        -0.2683204      -0.2785056
## 74364        -0.3478932        -0.3347261        -0.2683204      -0.2785056
## 46208        -0.3478932        -0.3347261        -0.2683204      -0.2785056
## 61607        -0.3478932        -0.3347261        -0.2683204      -0.2785056

Modeling

After data is cleaned, we are able to train them using machine learning models and evaluate their performance in predicting True Positive (TP) cases and True Negative (TN).

In our research, several models are trained and compared to determine the best-performing model for the task: 1. Random Forest - an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy. 2. Logistic Regression - a linear model typically used for classification tasks. 3. XGBoost - a powerful gradient boosting technique that builds an ensemble of decision trees sequentially to minimize prediction errors.

1. Random Forest

For Random Forest, we will set number of tree division =100, and other parameters are default.

# Train the Random Forest model
set.seed(42)

rf_model_oversampled <- randomForest(x = X_train_oversampled_scaled, y = as.factor(y_train_oversampled), ntree = 100, mtry = sqrt(ncol(X_train_oversampled_scaled)))

# Evaluate the model on the test set
rf_predictions_oversampled <- predict(rf_model_oversampled, X_test_oversampled_scaled)

# Create confusion matrix using caret
rf_confusion_oversampled <- confusionMatrix(rf_predictions_oversampled, as.factor(y_test_oversampled))

# Print confusion matrix and statistics
print(rf_confusion_oversampled)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17819  2112
##          1  2250 12779
##                                           
##                Accuracy : 0.8752          
##                  95% CI : (0.8717, 0.8787)
##     No Information Rate : 0.5741          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7452          
##                                           
##  Mcnemar's Test P-Value : 0.03805         
##                                           
##             Sensitivity : 0.8879          
##             Specificity : 0.8582          
##          Pos Pred Value : 0.8940          
##          Neg Pred Value : 0.8503          
##              Prevalence : 0.5741          
##          Detection Rate : 0.5097          
##    Detection Prevalence : 0.5701          
##       Balanced Accuracy : 0.8730          
##                                           
##        'Positive' Class : 0               
##

2. Logistic Regression

For Logistic Regression, we set all parameters in default.

# Convert X_train_scaled and y_train to a data.frame for training
set.seed(42)

train_data_oversampled <- data.frame(X_train_oversampled_scaled)  # Convert to data.frame
train_data_oversampled$Heart_Disease <- y_train_oversampled  # Add target variable as a column

# Train a Logistic Regression model
logistic_model_oversampled <- glm(Heart_Disease ~ ., data = train_data_oversampled, family = binomial)

# Convert X_test_scaled to a data.frame and make predictions
X_test_oversampled_scaled_df <- data.frame(X_test_oversampled_scaled)  # Convert matrix to data.frame

# Make predictions on the test set
logistic_predictions_oversampled <- predict(logistic_model_oversampled, newdata = X_test_oversampled_scaled_df, type = "response")

# Convert probabilities to class labels
logistic_class_predictions_oversampled <- ifelse(logistic_predictions_oversampled > 0.5, 1, 0)

# Confusion matrix
logistic_confusion_oversampled <- table(Predicted = logistic_class_predictions_oversampled, Actual = y_test_oversampled)

# Calculate precision, recall, F1-score
lr_confusion_oversampled <- confusionMatrix(logistic_confusion_oversampled)

# Print the metrics
print(lr_confusion_oversampled)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted     0     1
##         0 15741  4009
##         1  4328 10882
##                                         
##                Accuracy : 0.7615        
##                  95% CI : (0.757, 0.766)
##     No Information Rate : 0.5741        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.5137        
##                                         
##  Mcnemar's Test P-Value : 0.0004963     
##                                         
##             Sensitivity : 0.7843        
##             Specificity : 0.7308        
##          Pos Pred Value : 0.7970        
##          Neg Pred Value : 0.7155        
##              Prevalence : 0.5741        
##          Detection Rate : 0.4503        
##    Detection Prevalence : 0.5649        
##       Balanced Accuracy : 0.7576        
##                                         
##        'Positive' Class : 0             
##

3. XGBoost

For XGBoost, we will use nrounds = 100,objective = “binary:logistic”,eval_metric = “error”,max_depth = 6,eta = 0.1,verbose = 0 as model hyperparameter.

set.seed(42)

# Convert the data to matrix format for xgboost
dtrain_oversampled <- xgb.DMatrix(data = X_train_oversampled_scaled, label = as.numeric(as.character(y_train_oversampled)))
dtest_oversampled <- xgb.DMatrix(data = X_test_oversampled_scaled, label = as.numeric(as.character(y_test_oversampled)))

# Train the XGBoost model
xgb_model_oversampled <- xgboost(data = dtrain_oversampled,
                     nrounds = 100,
                     objective = "binary:logistic",
                     eval_metric = "error",
                     max_depth = 6,
                     eta = 0.1,
                     verbose = 0)

# Predict on the test set
xgb_predictions_oversampled <- predict(xgb_model_oversampled, dtest_oversampled)

# Convert probabilities to class labels
xgb_class_predictions_oversampled <- ifelse(xgb_predictions_oversampled > 0.5, 1, 0)

# Confusion matrix
xgboost_confusion_oversampled <- table(Predicted = xgb_class_predictions_oversampled, Actual = y_test_oversampled)

# Calculate precision, recall, F1-score
xgb_confusion_oversampled <- confusionMatrix(xgboost_confusion_oversampled)

# Print the metrics
print(xgb_confusion_oversampled)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted     0     1
##         0 18533  3065
##         1  1536 11826
##                                           
##                Accuracy : 0.8684          
##                  95% CI : (0.8648, 0.8719)
##     No Information Rate : 0.5741          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7273          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9235          
##             Specificity : 0.7942          
##          Pos Pred Value : 0.8581          
##          Neg Pred Value : 0.8850          
##              Prevalence : 0.5741          
##          Detection Rate : 0.5301          
##    Detection Prevalence : 0.6178          
##       Balanced Accuracy : 0.8588          
##                                           
##        'Positive' Class : 0               
##

set.seed(42)

# Extract Random Forest Metrics (Oversampled)
rf_accuracy_oversampled <- rf_confusion_oversampled$overall["Accuracy"]
rf_kappa_oversampled <- rf_confusion_oversampled$overall["Kappa"]
rf_sensitivity_oversampled <- rf_confusion_oversampled$byClass["Sensitivity"]
rf_specificity_oversampled <- rf_confusion_oversampled$byClass["Specificity"]
rf_f1_score_oversampled <- 2 * (rf_sensitivity_oversampled * rf_specificity_oversampled) /
                           (rf_sensitivity_oversampled + rf_specificity_oversampled)

# Extract Logistic Regression Metrics (Oversampled)
lr_accuracy_oversampled <- lr_confusion_oversampled$overall["Accuracy"]
lr_kappa_oversampled <- lr_confusion_oversampled$overall["Kappa"]
lr_sensitivity_oversampled <- lr_confusion_oversampled$byClass["Sensitivity"]
lr_specificity_oversampled <- lr_confusion_oversampled$byClass["Specificity"]
lr_f1_score_oversampled <- 2 * (lr_sensitivity_oversampled * lr_specificity_oversampled) /
                                 (lr_sensitivity_oversampled + lr_specificity_oversampled)

# Extract XGBoost Metrics (Oversampled)
xgb_accuracy_oversampled <- xgb_confusion_oversampled$overall["Accuracy"]
xgb_kappa_oversampled <- xgb_confusion_oversampled$overall["Kappa"]
xgb_sensitivity_oversampled <- xgb_confusion_oversampled$byClass["Sensitivity"]
xgb_specificity_oversampled <- xgb_confusion_oversampled$byClass["Specificity"]
xgb_f1_score_oversampled <- 2 * (xgb_sensitivity_oversampled * xgb_specificity_oversampled) /
                            (xgb_sensitivity_oversampled + xgb_specificity_oversampled)

# Combine Results into a Data Frame
results <- data.frame(
  Model = c("Random Forest (Oversampled)", "Logistic Regression (Oversampled)", "XGBoost (Oversampled)"),
  Accuracy = c(rf_accuracy_oversampled, lr_accuracy_oversampled, xgb_accuracy_oversampled),
  Kappa = c(rf_kappa_oversampled, lr_kappa_oversampled, xgb_kappa_oversampled),
  Sensitivity = c(rf_sensitivity_oversampled, lr_sensitivity_oversampled, xgb_sensitivity_oversampled),
  Specificity = c(rf_specificity_oversampled, lr_specificity_oversampled, xgb_specificity_oversampled),
  F1_Score = c(rf_f1_score_oversampled, lr_f1_score_oversampled, xgb_f1_score_oversampled)
)

# Print the results table
options(width = 200)
print(results)

##                               Model  Accuracy     Kappa Sensitivity Specificity  F1_Score
## 1       Random Forest (Oversampled) 0.8752288 0.7451653   0.8878868   0.8581694 0.8727752
## 2 Logistic Regression (Oversampled) 0.7615275 0.5137013   0.7843440   0.7307770 0.7566136
## 3             XGBoost (Oversampled) 0.8683924 0.7272685   0.9234640   0.7941710 0.8539513

Feature Engineering

Feature Extraction

2 model, XGBoost and Random Forest are very close in term of model performance. We need to decide whether which of them is better. We need to do some adjustments.

Since there are too many features (total 37) included, accuracy of model might be affected by taking those not important features as consideration, even though the weightage is not high. We’re expecting this approach is able to evaluate both model’s performance in more detail.

1. Random Forest

Feature Importance of Random Forest is visualized and listed.

set.seed(42)

# Get the feature importance
feature_importance <- importance(rf_model_oversampled)

# View the importance of each feature
print(feature_importance)

##                                MeanDecreaseGini
## General_Health                      10000.82371
## Exercise                              834.20764
## Skin_Cancer                           765.67324
## Other_Cancer                          759.03369
## Depression                            699.52976
## Arthritis                            2627.52361
## Height_.cm.                          4446.37616
## Weight_.kg.                          4063.74859
## BMI                                  4232.48413
## Smoking_History                      1350.71912
## Alcohol_Consumption                  3339.85824
## Fruit_Consumption                    3513.00921
## Green_Vegetables_Consumption         3873.89278
## FriedPotato_Consumption              4074.29542
## DiabetesDiabetes                     2064.75859
## DiabetesGestational Diabetes           64.70883
## DiabetesNo Diabetes                  1254.27544
## DiabetesPre-diabetes                  156.03778
## Checkup5 or more years ago            136.92034
## CheckupNever                           26.82495
## CheckupWithin the past 2 years        253.48442
## CheckupWithin the past 5 years        134.13701
## CheckupWithin the past year          1002.24777
## SexFemale                             706.45074
## SexMale                               687.33703
## Age_Category18-24                     446.26700
## Age_Category25-29                     309.34607
## Age_Category30-34                     344.17282
## Age_Category35-39                     470.05636
## Age_Category40-44                     423.44436
## Age_Category45-49                     326.58788
## Age_Category50-54                     371.50814
## Age_Category55-59                     425.36021
## Age_Category60-64                     551.60090
## Age_Category65-69                     747.05878
## Age_Category70-74                     987.85560
## Age_Category75-79                    1174.60183
## Age_Category80+                      2018.22466

library(ggplot2)

set.seed(42)

# Convert the importance to a data frame for plotting
importance_df <- data.frame(Feature = rownames(feature_importance),
                            Importance = feature_importance[, 1])

# Sort the features by importance
importance_df <- importance_df[order(-importance_df$Importance), ]

# Plot the feature importance with colors
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance, fill = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip the axes to make the labels readable
  theme_minimal() +
  labs(title = "Feature Importance - Random Forest Model",
       x = "Features",
       y = "Importance") +
  scale_fill_gradient(low = "skyblue", high = "darkblue")  # Color gradient from light to dark blue

2. XGBoost

Feature Importance of XGBoost is visualized and listed.

set.seed(42)

# Get the feature importance
importance_matrix <- xgb.importance(feature_names = colnames(X_train_oversampled_scaled), model = xgb_model_oversampled)
print(importance_matrix)

##                            Feature         Gain        Cover   Frequency
##                             <char>        <num>        <num>       <num>
##  1:                 General_Health 3.503962e-01 0.1392817414 0.109920720
##  2:        FriedPotato_Consumption 1.477095e-01 0.1606081417 0.119348618
##  3:   Green_Vegetables_Consumption 9.632695e-02 0.0985948607 0.104349689
##  4:                    Height_.cm. 6.674948e-02 0.0797638667 0.091279194
##  5:            Alcohol_Consumption 5.289063e-02 0.0696119751 0.065781016
##  6:                Age_Category80+ 3.768819e-02 0.0331598396 0.031283480
##  7:                      SexFemale 2.986418e-02 0.0329964936 0.029140776
##  8:               DiabetesDiabetes 2.369779e-02 0.0236185311 0.016284551
##  9:    CheckupWithin the past year 2.187385e-02 0.0152691382 0.009856439
## 10:                      Arthritis 2.104288e-02 0.0176480080 0.025069638
## 11:              Age_Category75-79 1.976007e-02 0.0235143768 0.016927362
## 12:              Age_Category70-74 1.714262e-02 0.0180759294 0.014141847
## 13:              Fruit_Consumption 1.450815e-02 0.0201968192 0.068780801
## 14:                Smoking_History 1.399932e-02 0.0294421316 0.017998714
## 15:              Age_Category35-39 1.135584e-02 0.0250020562 0.007928005
## 16:              Age_Category18-24 1.068694e-02 0.0328013585 0.007499464
## 17:              Age_Category40-44 9.631848e-03 0.0217373842 0.007713735
## 18:              Age_Category30-34 9.155879e-03 0.0256828673 0.006428112
## 19:              Age_Category25-29 8.766203e-03 0.0291999476 0.005142490
## 20:              Age_Category65-69 8.092437e-03 0.0109299274 0.008142276
## 21:                            BMI 6.274980e-03 0.0218891030 0.080137133
## 22:              Age_Category45-49 5.230579e-03 0.0187599988 0.005142490
## 23:                    Weight_.kg. 4.439254e-03 0.0134317213 0.055281766
## 24:              Age_Category60-64 3.416319e-03 0.0053892028 0.005142490
## 25:              Age_Category50-54 2.460800e-03 0.0126838332 0.006213842
## 26:                    Skin_Cancer 1.881751e-03 0.0043539841 0.016284551
## 27:                   Other_Cancer 1.705327e-03 0.0038374697 0.020569959
## 28:                     Depression 1.324213e-03 0.0044163783 0.017141633
## 29:            DiabetesNo Diabetes 8.184238e-04 0.0021255539 0.009856439
## 30:                       Exercise 3.982381e-04 0.0007600845 0.007499464
## 31:              Age_Category55-59 2.801692e-04 0.0029220981 0.002785515
## 32:     Checkup5 or more years ago 1.471372e-04 0.0013825412 0.003428327
## 33: CheckupWithin the past 2 years 1.142739e-04 0.0001911379 0.003642597
## 34: CheckupWithin the past 5 years 9.581601e-05 0.0004265850 0.002571245
## 35:           DiabetesPre-diabetes 7.382090e-05 0.0002949140 0.001285622
##                            Feature         Gain        Cover   Frequency

xgb.plot.importance(importance_matrix)

Based on the result, top 25 features are being chosen, and the other 12 features are dropped.

set.seed(42)

# Select top 25 features based on importance for Random Forest
top_25_features <- rownames(importance_df)[1:25]
top_25_features

##  [1] "General_Health"               "Height_.cm."                  "BMI"                          "FriedPotato_Consumption"      "Weight_.kg."                  "Green_Vegetables_Consumption"
##  [7] "Fruit_Consumption"            "Alcohol_Consumption"          "Arthritis"                    "DiabetesDiabetes"             "Age_Category80+"              "Smoking_History"             
## [13] "DiabetesNo Diabetes"          "Age_Category75-79"            "CheckupWithin the past year"  "Age_Category70-74"            "Exercise"                     "Skin_Cancer"                 
## [19] "Other_Cancer"                 "Age_Category65-69"            "SexFemale"                    "Depression"                   "SexMale"                      "Age_Category60-64"           
## [25] "Age_Category35-39"

# Select top 25 features based on importance for XGBoost
top_25_features_xgb <- head(importance_matrix[order(-importance_matrix$Gain), ], 25)
top_25_features_xgb <- top_25_features_xgb$Feature
top_25_features_xgb

##  [1] "General_Health"               "FriedPotato_Consumption"      "Green_Vegetables_Consumption" "Height_.cm."                  "Alcohol_Consumption"          "Age_Category80+"             
##  [7] "SexFemale"                    "DiabetesDiabetes"             "CheckupWithin the past year"  "Arthritis"                    "Age_Category75-79"            "Age_Category70-74"           
## [13] "Fruit_Consumption"            "Smoking_History"              "Age_Category35-39"            "Age_Category18-24"            "Age_Category40-44"            "Age_Category30-34"           
## [19] "Age_Category25-29"            "Age_Category65-69"            "BMI"                          "Age_Category45-49"            "Weight_.kg."                  "Age_Category60-64"           
## [25] "Age_Category50-54"

set.seed(42)

# Select top 25 features based on importance for Random Forest
top_25_features <- rownames(importance_df)[1:25]

# Select top 25 features based on importance for XGBoost
top_25_features_xgb <- head(importance_matrix[order(-importance_matrix$Gain), ], 25)
top_25_features_xgb <- top_25_features_xgb$Feature


# Subset the training data to only include the top 25 features for Random Forest
X_train_top25 <- X_train_oversampled_scaled[, top_25_features]
X_test_top25 <- X_test_oversampled_scaled[, top_25_features]

# Subset the training data to only include the top 25 features for XGBoost
X_train_top25_xgb <- X_train_oversampled_scaled[, top_25_features_xgb]
X_test_top25_xgb <- X_test_oversampled_scaled[, top_25_features_xgb]

Then, we train these 2 models again, using only dataset with top 25 important features based on their model respectively.

set.seed(42)

# Train the Random Forest model on the top 25 features
rf_model_top25 <- randomForest(x = X_train_top25, y = as.factor(y_train_oversampled), ntree = 100, mtry = sqrt(ncol(X_train_top25)))

# Evaluate the model on the test set with top 25 features
rf_predictions_top25 <- predict(rf_model_top25, X_test_top25)

# Create confusion matrix using caret for the Random Forest model with top 25 features
rf_confusion_top25 <- confusionMatrix(rf_predictions_top25, as.factor(y_test_oversampled))

# Print confusion matrix and statistics
print(rf_confusion_top25)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 18086  2323
##          1  1983 12568
##                                           
##                Accuracy : 0.8768          
##                  95% CI : (0.8733, 0.8803)
##     No Information Rate : 0.5741          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7474          
##                                           
##  Mcnemar's Test P-Value : 2.39e-07        
##                                           
##             Sensitivity : 0.9012          
##             Specificity : 0.8440          
##          Pos Pred Value : 0.8862          
##          Neg Pred Value : 0.8637          
##              Prevalence : 0.5741          
##          Detection Rate : 0.5173          
##    Detection Prevalence : 0.5838          
##       Balanced Accuracy : 0.8726          
##                                           
##        'Positive' Class : 0               
##

set.seed(42)

# Convert the data to matrix format for xgboost
dtrain_top25 <- xgb.DMatrix(data = X_train_top25_xgb, label = as.numeric(as.character(y_train_oversampled)))
dtest_top25 <- xgb.DMatrix(data = X_test_top25_xgb, label = as.numeric(as.character(y_test_oversampled)))

# Train the XGBoost model on the top 25 features
xgb_model_top25 <- xgboost(data = dtrain_top25,
                     nrounds = 100,
                     objective = "binary:logistic",
                     eval_metric = "error",
                     max_depth = 6,
                     eta = 0.1,
                     verbose = 0)

# Predict on the test set using the top 25 features
xgb_predictions_top25 <- predict(xgb_model_top25, dtest_top25)

# Convert probabilities to class labels
xgb_class_predictions_top25 <- ifelse(xgb_predictions_top25 > 0.5, 1, 0)

# Confusion matrix
xgboost_confusion_top25 <- table(Predicted = xgb_class_predictions_top25, Actual = y_test_oversampled)

# Calculate precision, recall, F1-score
xgb_confusion_top25 <- confusionMatrix(xgboost_confusion_top25)

# Print the metrics
print(xgb_confusion_top25)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted     0     1
##         0 18495  3037
##         1  1574 11854
##                                           
##                Accuracy : 0.8681          
##                  95% CI : (0.8645, 0.8716)
##     No Information Rate : 0.5741          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7268          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9216          
##             Specificity : 0.7961          
##          Pos Pred Value : 0.8590          
##          Neg Pred Value : 0.8828          
##              Prevalence : 0.5741          
##          Detection Rate : 0.5290          
##    Detection Prevalence : 0.6159          
##       Balanced Accuracy : 0.8588          
##                                           
##        'Positive' Class : 0               
##

The result is tabulated as follow.

set.seed(42)

# Extract Random Forest Metrics (Oversampled)
rf_accuracy_oversampled <- rf_confusion_oversampled$overall["Accuracy"]
rf_kappa_oversampled <- rf_confusion_oversampled$overall["Kappa"]
rf_sensitivity_oversampled <- rf_confusion_oversampled$byClass["Sensitivity"]
rf_specificity_oversampled <- rf_confusion_oversampled$byClass["Specificity"]
rf_f1_score_oversampled <- 2 * (rf_sensitivity_oversampled * rf_specificity_oversampled) /
                           (rf_sensitivity_oversampled + rf_specificity_oversampled)

# Extract XGBoost Metrics (Oversampled)
xgb_accuracy_oversampled <- xgb_confusion_oversampled$overall["Accuracy"]
xgb_kappa_oversampled <- xgb_confusion_oversampled$overall["Kappa"]
xgb_sensitivity_oversampled <- xgb_confusion_oversampled$byClass["Sensitivity"]
xgb_specificity_oversampled <- xgb_confusion_oversampled$byClass["Specificity"]
xgb_f1_score_oversampled <- 2 * (xgb_sensitivity_oversampled * xgb_specificity_oversampled) /
                            (xgb_sensitivity_oversampled + xgb_specificity_oversampled)
# Extract Random Forest Metrics
rf_accuracy_top25 <- rf_confusion_top25$overall["Accuracy"]
rf_kappa_top25 <- rf_confusion_top25$overall["Kappa"]
rf_sensitivity_top25 <- rf_confusion_top25$byClass["Sensitivity"]
rf_specificity_top25 <- rf_confusion_top25$byClass["Specificity"]
rf_f1_score_top25 <- 2 * (rf_sensitivity_top25 * rf_specificity_top25) / (rf_sensitivity_top25 + rf_specificity_top25)

# Extract XGBoost Metrics
xgb_accuracy_top25 <- xgb_confusion_top25$overall["Accuracy"]
xgb_kappa_top25 <- xgb_confusion_top25$overall["Kappa"]
xgb_sensitivity_top25 <- xgb_confusion_top25$byClass["Sensitivity"]
xgb_specificity_top25 <- xgb_confusion_top25$byClass["Specificity"]
xgb_f1_score_top25 <- 2 * (xgb_sensitivity_top25 * xgb_specificity_top25) / (xgb_sensitivity_top25 + xgb_specificity_top25)

# Combine Results into a Data Frame
results <- data.frame(
  Model = c("Random Forest (Oversampled)", "XGBoost (Oversampled)","Random Forest (Top 25)", "XGBoost (Top 25)"),
  Accuracy = c(rf_accuracy_oversampled, xgb_accuracy_oversampled, rf_accuracy_top25, xgb_accuracy_top25),
  Kappa = c(rf_kappa_oversampled, xgb_kappa_oversampled, rf_kappa_top25, xgb_kappa_top25),
  Sensitivity = c(rf_sensitivity_oversampled, xgb_sensitivity_oversampled, rf_sensitivity_top25, xgb_sensitivity_top25),
  Specificity = c(rf_specificity_oversampled, xgb_specificity_oversampled, rf_specificity_top25, xgb_specificity_top25),
  F1_Score = c(rf_f1_score_oversampled, xgb_f1_score_oversampled, rf_f1_score_top25, xgb_f1_score_top25)
)

# Print the results table
options(width = 200)
print(results)

##                         Model  Accuracy     Kappa Sensitivity Specificity  F1_Score
## 1 Random Forest (Oversampled) 0.8752288 0.7451653   0.8878868   0.8581694 0.8727752
## 2       XGBoost (Oversampled) 0.8683924 0.7272685   0.9234640   0.7941710 0.8539513
## 3      Random Forest (Top 25) 0.8768307 0.7473921   0.9011909   0.8439997 0.8716582
## 4            XGBoost (Top 25) 0.8681064 0.7268341   0.9215706   0.7960513 0.8542246

After including only 25 important features, it’s expected model performance would somehow decrease as 12 features are being extracted. However, Random Forest still managed to improve its performance in term of accuracy and sensitivity meanwhile XGBoost decrease its performance in all performance metrics. Overall, Random Forest still performs slightly better than XGBoost as accuracy, specificity and F1 score of Random Forest is higher than XGBoost’s.

Conclusion

From first evaluation result, it’s clear to see that Logistic Regression is the worst model compared to XGBoost and also Random Forest. The performance gap between Logistic Regression with both XGBoost and Random Forest is significantly huge. Hence, it’s good to only interpret the evaluation of both XGBoost and Random Forest result.

Based on our analysis on the features importance for both XGBoost and Random Forest, their similarities such as General Health, Height, Age Category(80+), BMI, Fried Potato Consumption, Alcohol Consumption are listed in the top 10 features. It indicates that Age, BMI, General Health Status and how much a people consume fried potato does increase the odd of cardiovascular disease. Both models showed some interesting insights also, as consumption of fruits, vegetables and fried potatoes are significantly important for both XGBoost and Random Forest. We can also see that the occurence of diabetes history is also an important feature for both models. It somehow enlightens an potential medical research opportunity on overconsuming carbs can lead to cardiovascular disease.

As expected after only including 25 top important feature, it’s still Random Forest which performs slightly better than XGBoost as Accuracy, F1-score of Random Forest is higher than XGBoost’s. Based on our definition, model who performs better in accuracy and F1-score will be decided to be the best model. Hence, through our result, Random Forest is the best model to predict the likelihood of cardiovascular disease based on patient demographic, lifestyle, and clinical features.

However, XGBoost slightly perform better than Random Forest in term of Sensitivity (Recall). Which means that in term of detecting true positive Cardiovascular Disease, XGBoost is better, and in medical field, a model’s recall always a primary consideration of evaluating a model beside F1 score. Hence, further study required to consider XGBoost as a model to detect more accurately, especially in feature engineering and also hyperparameter tuning.

Application of Machine Learning Models on Predicting Cardiovascular Disease

Cham Ying Chyi (23076054), Teoh Sue Lynn (23091969), Lisa Ho Yen Xin (23104045), Yap Han Kee (17086135), Esther Teo Yong Xuan (23109154)

(GMT+8 Kuala Lumpur) 7:05pm, 11th January 2025

WQD7004 Programming of Data Science

Data source

Objective

Data Understanding

Univariate Analysis

Data Preprocessing

1.Data cleaning

1.1 Missing Values

1.2 Duplicates

2. Data formatting

2.1 Label Encoding

2.2 Ordinal Encoding

2.3 One-Hot Encoding

3. Handling Imbalanced Classes

4. Data Splitting

5. Data Standardization

Modeling

1. Random Forest

2. Logistic Regression

3. XGBoost

Feature Engineering

Feature Extraction

1. Random Forest

2. XGBoost

Conclusion