# Importing the 'modeest' package for estimating statistical modes or working with statistical distributions
library(modeest)
# Importing the 'ggplot2' package for creating visually appealing and customizable plots
library(ggplot2)
# Importing the 'lattice' package for creating conditioned plots
library(lattice)
# Importing the 'caret' package for machine learning tasks such as data preprocessing, model training, and performance evaluation
library(caret)
# Importing the 'mlbench' package for benchmark datasets commonly used in machine learning research
library(mlbench)
# Importing the 'dlookr' package for data exploration and outlier detection
library(dlookr)
## Registered S3 methods overwritten by 'dlookr':
## method from
## plot.transform scales
## print.transform scales
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:modeest':
##
## skewness
## The following object is masked from 'package:base':
##
## transform
# Importing the 'ROSE' package for dealing with imbalance data
library(ROSE)
## Loaded ROSE 0.0-4
# Reading the csv data file "Main data.csv" into the 'CHD' dataframe
CHD <- read.csv("Main data.csv", stringsAsFactors = TRUE, na.strings = TRUE)
# Generating a summary of the 'CHD' dataframe
summary(CHD)
## id age education sex is_smoking
## Min. : 0.0 Min. :32.00 Min. :1.000 F:1923 NO :1703
## 1st Qu.: 847.2 1st Qu.:42.00 1st Qu.:1.000 M:1467 YES:1687
## Median :1694.5 Median :49.00 Median :2.000
## Mean :1694.5 Mean :49.54 Mean :1.971
## 3rd Qu.:2541.8 3rd Qu.:56.00 3rd Qu.:3.000
## Max. :3389.0 Max. :70.00 Max. :4.000
## NA's :87
## cigsPerDay BPMeds prevalentStroke prevalentHyp
## Min. : 0.000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 0.000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean : 9.069 Mean :0.02989 Mean :0.00649 Mean :0.3153
## 3rd Qu.:20.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :70.000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## NA's :22 NA's :44
## diabetes totChol sysBP diaBP
## Min. :0.00000 Min. :107.0 Min. : 83.5 Min. : 48.00
## 1st Qu.:0.00000 1st Qu.:206.0 1st Qu.:117.0 1st Qu.: 74.50
## Median :0.00000 Median :234.0 Median :128.5 Median : 82.00
## Mean :0.02566 Mean :237.1 Mean :132.6 Mean : 82.88
## 3rd Qu.:0.00000 3rd Qu.:264.0 3rd Qu.:144.0 3rd Qu.: 90.00
## Max. :1.00000 Max. :696.0 Max. :295.0 Max. :142.50
## NA's :38
## BMI heartRate glucose TenYearCHD
## Min. :15.96 Min. : 45.00 Min. : 40.00 Min. :0.0000
## 1st Qu.:23.02 1st Qu.: 68.00 1st Qu.: 71.00 1st Qu.:0.0000
## Median :25.38 Median : 75.00 Median : 78.00 Median :0.0000
## Mean :25.79 Mean : 75.98 Mean : 82.09 Mean :0.1507
## 3rd Qu.:28.04 3rd Qu.: 83.00 3rd Qu.: 87.00 3rd Qu.:0.0000
## Max. :56.80 Max. :143.00 Max. :394.00 Max. :1.0000
## NA's :14 NA's :1 NA's :304
Here’s a brief description of the summary result:
id: The identifier variable with a range of values
from 0 to 3389.
age: The age of the individuals in the dataset,
ranging from 32 to 70 years.
education: The level of education, represented by
numeric values ranging from 1 to 4.
sex: The gender of the individuals, with “F”
representing females and “M” representing males.
is_smoking: Indicates whether the individual is a
smoker, with “NO” indicating non-smokers and “YES” indicating
smokers.
cigsPerDay: The average number of cigarettes smoked
per day, ranging from 0 to 70.
BPMeds: Indicates whether the individual is taking
blood pressure medication, with values of 0 or 1.
prevalentStroke: Indicates whether the individual
had a prevalent stroke, with values of 0 or 1.
prevalentHyp: Indicates whether the individual has
prevalent hypertension, with values of 0 or 1.
diabetes: Indicates whether the individual has
diabetes, with values of 0 or 1.
totChol: Total cholesterol level, ranging from 107
to 696.
sysBP: Systolic blood pressure, ranging from 83.5 to
295.
diaBP: Diastolic blood pressure, ranging from 48 to
142.5.
BMI: Body Mass Index, ranging from 15.96 to
56.8.
heartRate: Heart rate in beats per minute, ranging
from 45 to 143.
glucose: Blood glucose level, ranging from 40 to
394.
TenYearCHD: Indicates whether the individual had a
coronary heart disease event within ten years, with values of 0 or
1.
.The summary result also indicates the presence of missing values (NA) in certain variables, as denoted by the number of NA entries mentioned under each variable.
# Displaying the structure of the 'CHD' dataframe
str(CHD)
## 'data.frame': 3390 obs. of 17 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ age : int 64 36 46 50 64 61 61 36 41 55 ...
## $ education : num 2 4 1 1 1 3 1 4 2 2 ...
## $ sex : Factor w/ 2 levels "F","M": 1 2 1 2 1 1 2 2 1 1 ...
## $ is_smoking : Factor w/ 2 levels "NO","YES": 2 1 2 2 2 1 1 2 2 1 ...
## $ cigsPerDay : num 3 0 10 20 30 0 0 35 20 0 ...
## $ BPMeds : num 0 0 0 0 0 0 0 0 NA 0 ...
## $ prevalentStroke: int 0 0 0 0 0 0 0 0 0 0 ...
## $ prevalentHyp : int 0 1 0 1 0 1 1 0 0 1 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ totChol : num 221 212 250 233 241 272 238 295 220 326 ...
## $ sysBP : num 148 168 116 158 136 ...
## $ diaBP : num 85 98 71 88 85 121 136 68 78 81 ...
## $ BMI : num NA 29.8 20.4 28.3 26.4 ...
## $ heartRate : num 90 72 88 68 70 85 75 60 86 85 ...
## $ glucose : num 80 75 94 94 77 65 79 63 79 NA ...
## $ TenYearCHD : int 1 0 0 1 0 1 0 0 0 0 ...
Here’s a brief description of the displayed information:
The description below provides details about each variable/column:
id: An integer variable representing the identifier,
ranging from 0 to higher values.age: An integer variable representing the age of
individuals.education: A numeric variable representing the level of
education.sex: A factor variable indicating the gender, with two
levels “F” and “M”.is_smoking: A factor variable indicating smoking
status, with two levels “NO” and “YES”.cigsPerDay: A numeric variable representing the average
number of cigarettes smoked per day.BPMeds: A numeric variable indicating whether the
individual is taking blood pressure medication.prevalentStroke: An integer variable indicating whether
the individual had a prevalent stroke.prevalentHyp: An integer variable indicating whether
the individual has prevalent hypertension.diabetes: An integer variable indicating whether the
individual has diabetes.totChol: A numeric variable representing total
cholesterol level.sysBP: A numeric variable representing systolic blood
pressure.diaBP: A numeric variable representing diastolic blood
pressure.BMI: A numeric variable representing Body Mass
Index.heartRate: A numeric variable representing heart rate
in beats per minute.glucose: A numeric variable representing blood glucose
level.TenYearCHD: An integer variable indicating whether the
individual had a coronary heart disease event within ten years.# Creating a new dataframe 'CHD_NEW' by excluding the 'id' and 'education' columns from the 'CHD' dataframe
CHD_NEW <- CHD[, -c(1, 3)]
# Converting selected variables in the 'CHD_NEW' dataframe to factor type
CHD_NEW$BPMeds <- as.factor(CHD_NEW$BPMeds)
CHD_NEW$prevalentStroke <- as.factor(CHD_NEW$prevalentStroke)
CHD_NEW$prevalentHyp <- as.factor(CHD_NEW$prevalentHyp)
CHD_NEW$diabetes <- as.factor(CHD_NEW$diabetes)
CHD_NEW$TenYearCHD <- as.factor(CHD_NEW$TenYearCHD)
# Calculating the column-wise sum of missing values in the 'CHD_NEW' dataframe
colSums(is.na(CHD_NEW))
## age sex is_smoking cigsPerDay BPMeds
## 0 0 0 22 44
## prevalentStroke prevalentHyp diabetes totChol sysBP
## 0 0 0 38 0
## diaBP BMI heartRate glucose TenYearCHD
## 0 14 1 304 0
# Imputing missing values with the corresponding median value for numerical variables
CHD_NEW$cigsPerDay[is.na(CHD_NEW$cigsPerDay)] <- median(CHD_NEW$cigsPerDay, na.rm = TRUE)
CHD_NEW$totChol[is.na(CHD_NEW$totChol)] <- median(CHD_NEW$totChol, na.rm = TRUE)
CHD_NEW$BMI[is.na(CHD_NEW$BMI)] <- median(CHD_NEW$BMI, na.rm = TRUE)
CHD_NEW$heartRate[is.na(CHD_NEW$heartRate)] <- median(CHD_NEW$heartRate, na.rm = TRUE)
CHD_NEW$glucose[is.na(CHD_NEW$glucose)] <- median(CHD_NEW$glucose, na.rm = TRUE)
#Imputing missing values in 'BPMeds' column with the most frequent value(mode)
CHD_NEW$BPMeds[is.na(CHD_NEW$BPMeds)] <- mfv(CHD_NEW$BPMeds, na_rm = TRUE)
From the output:
# Removing duplicate rows
CHD_NEW <- unique(CHD_NEW)
# Converting 'is_smoking' column to binary indicator
CHD_NEW$is_smoking <- ifelse(CHD_NEW$is_smoking == "YES", 1, 0)
# Converting 'sex' column to binary indicator
CHD_NEW$sex <- ifelse(CHD_NEW$sex == "M", 1, 0)
# Converting 'is_smoking' and 'sex' columns to factors
CHD_NEW$is_smoking <- as.factor(CHD_NEW$is_smoking)
CHD_NEW$sex <- as.factor(CHD_NEW$sex)
# Selecting numerical variables from CHD_NEW dataframe
numerical_vars <- CHD_NEW[, sapply(CHD_NEW, is.numeric)]
# Plot outliers in numerical variables
plot_outlier(numerical_vars)
# Identify nature of outliers using boxplot statistics
boxplot.stats(CHD_NEW$age)$out
## integer(0)
boxplot.stats(CHD_NEW$cigsPerDay)$out
## [1] 60 60 60 60 60 60 60 70 60
boxplot.stats(CHD_NEW$totChol)$out
## [1] 382 391 439 600 356 696 361 382 464 391 362 390 360 352 366 366 370 372 365
## [20] 113 392 353 367 390 398 453 358 363 107 352 354 373 410 380 366 432 410 358
## [39] 355 364 352 410 352
boxplot.stats(CHD_NEW$sysBP)$out
## [1] 232.0 185.0 186.5 193.0 192.0 207.0 202.0 244.0 186.0 191.0 191.5 193.0
## [13] 195.0 185.5 190.0 212.0 210.0 202.5 215.0 196.0 248.0 215.0 195.0 208.0
## [25] 194.0 204.0 186.0 188.0 214.0 188.5 187.0 192.5 197.0 185.0 197.0 190.0
## [37] 186.5 190.0 196.0 230.0 187.0 185.0 205.0 205.0 200.0 199.0 185.0 210.0
## [49] 189.0 200.0 191.0 198.0 200.0 193.0 200.0 209.0 195.0 199.0 189.0 295.0
## [61] 235.0 204.0 188.0 195.0 198.0 197.5 190.0 199.0 188.0 206.0 215.0 210.0
## [73] 243.0 205.5 192.5 193.0 189.0 192.0 189.0 187.0 187.5 206.0 191.0 206.0
## [85] 196.0 185.0 195.0 196.0 195.0 190.0 220.0 217.0 189.0 185.0 213.0 188.0
## [97] 190.0 185.0 220.0 192.5 213.0 210.0 207.5 197.5 187.0
boxplot.stats(CHD_NEW$diaBP)$out
## [1] 121.0 136.0 121.0 120.0 122.5 114.0 124.0 124.0 115.5 130.0 129.0 48.0
## [13] 116.0 130.0 118.0 136.0 115.0 114.0 130.0 117.0 50.0 122.5 135.0 120.0
## [25] 125.0 133.0 135.0 118.0 130.0 123.0 128.0 117.5 125.0 115.0 114.0 120.0
## [37] 123.0 120.0 142.5 115.0 118.0 115.0 121.0 124.5 116.0 120.0 119.0 114.0
## [49] 121.0 130.0 118.0 125.0 133.0 127.5 118.0 118.0 51.0 119.0
boxplot.stats(CHD_NEW$BMI)$out
## [1] 35.58 43.69 36.54 35.96 35.62 38.42 36.11 38.14 44.09 35.99 39.94 39.82
## [13] 40.21 38.96 38.53 37.38 39.69 35.85 38.54 37.10 39.94 36.07 35.78 38.43
## [25] 51.28 41.29 36.91 38.31 56.80 39.21 38.11 36.65 39.04 35.68 44.27 39.17
## [37] 38.75 45.79 36.81 38.46 36.21 37.48 36.52 40.38 36.62 37.15 38.82 35.53
## [49] 42.00 43.48 37.04 37.62 40.58 39.22 37.02 36.12 42.53 38.17 40.08 37.41
## [61] 37.30 40.11 40.81 39.53 40.52 36.18 39.08 44.55 39.91 38.39 36.01 41.61
## [73] 44.71 43.67 45.80 38.61 36.29 38.06
boxplot.stats(CHD_NEW$heartRate)$out
## [1] 110 110 110 112 110 120 110 110 106 110 110 115 110 107 110 143 110 110 112
## [20] 110 107 45 110 120 108 115 110 108 110 110 110 125 125 120 115 125 110 122
## [39] 110 110 122 110 106 110 120 110 140 108 115 120 110 110 108 110 108 110 108
## [58] 110 110 112 108 110 110 110
boxplot.stats(CHD_NEW$glucose)$out
## [1] 132 105 170 107 177 118 120 173 163 205 115 120 268 137 106 348 110 120
## [19] 115 106 122 108 255 140 105 124 244 106 115 294 112 113 254 132 116 52
## [37] 50 120 110 114 127 140 113 135 45 206 140 105 105 105 117 112 107 126
## [55] 113 45 131 113 114 116 394 116 120 107 123 114 118 270 107 320 129 112
## [73] 107 115 142 106 112 107 120 52 235 116 115 40 126 117 118 150 120 136
## [91] 113 107 118 112 191 117 44 394 124 173 127 50 119 117 109 113 105 108
## [109] 110 112 121 43 117 108 256 113 145 113 105 117 47 297 118 166 172 48
## [127] 132 111 106 107 113 115 126 110 110 127 122 215 113 216 368 118 123 155
## [145] 105 148 147 121 115 116 117 108 115 115 105 135 105 123 123 223 106 115
## [163] 113 225 332 105 106 115 117 118 210 112 110 115 107 186 110 136 110 112
## [181] 117 109 115 386 150 137 248 107 108 127 170 193 125 167 47 183 112 110
## [199] 108 120 215 137 113 108 45 260 44 115 274 108 107 112 206 113
# Selecting numerical variables from CHD_NEW dataframe
numerical_vars <- CHD_NEW[, sapply(CHD_NEW, is.numeric)]
# Calculating correlation matrix and rounding to 2 decimal places
round(cor(numerical_vars), 2)
## age cigsPerDay totChol sysBP diaBP BMI heartRate glucose
## age 1.00 -0.19 0.27 0.40 0.22 0.14 0.00 0.11
## cigsPerDay -0.19 1.00 -0.03 -0.10 -0.07 -0.10 0.07 -0.07
## totChol 0.27 -0.03 1.00 0.20 0.15 0.11 0.09 0.06
## sysBP 0.40 -0.10 0.20 1.00 0.78 0.33 0.18 0.14
## diaBP 0.22 -0.07 0.15 0.78 1.00 0.38 0.17 0.07
## BMI 0.14 -0.10 0.11 0.33 0.38 1.00 0.07 0.09
## heartRate 0.00 0.07 0.09 0.18 0.17 0.07 1.00 0.08
## glucose 0.11 -0.07 0.06 0.14 0.07 0.09 0.08 1.00
Here is a brief description of the correlations:
# Attaching the CHD_NEW data frame
attach(CHD_NEW)
# Building the logistic regression model
logit_model <- glm(TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke +
prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + heartRate + glucose,
family = binomial)
# Summarizing the logistic regression model
summary(logit_model)
##
## Call:
## glm(formula = TenYearCHD ~ age + sex + is_smoking + cigsPerDay +
## BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol +
## sysBP + diaBP + BMI + heartRate + glucose, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5517 -0.5936 -0.4273 -0.2902 2.8086
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.360407 0.716824 -11.663 < 2e-16 ***
## age 0.062776 0.006917 9.075 < 2e-16 ***
## sex1 0.481424 0.113386 4.246 2.18e-05 ***
## is_smoking1 0.136236 0.159861 0.852 0.394094
## cigsPerDay 0.021421 0.006308 3.396 0.000684 ***
## BPMeds1 0.183600 0.249342 0.736 0.461524
## prevalentStroke1 1.179526 0.472153 2.498 0.012483 *
## prevalentHyp1 0.187982 0.144761 1.299 0.194092
## diabetes1 0.050892 0.333957 0.152 0.878878
## totChol 0.002585 0.001126 2.295 0.021722 *
## sysBP 0.015899 0.003942 4.034 5.49e-05 ***
## diaBP -0.006661 0.006552 -1.017 0.309294
## BMI 0.001157 0.013184 0.088 0.930098
## heartRate -0.002666 0.004354 -0.612 0.540277
## glucose 0.008748 0.002424 3.609 0.000307 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2874.6 on 3389 degrees of freedom
## Residual deviance: 2541.1 on 3375 degrees of freedom
## AIC: 2571.1
##
## Number of Fisher Scoring iterations: 5
.Age, Sex, Cigarettes per Day, Prevalent Stroke, Total Cholesterol, Systolic Blood Pressure, and Glucose demonstrated a statistically significant relationship with the TenYearCHD outcome. These variables have shown a significant impact on predicting the likelihood of TenYearCHD when accounting for other variables in the model.
# Perform stepwise selection on the logistic regression model
logit_modelB <- step(logit_model)
## Start: AIC=2571.15
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke +
## prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI +
## heartRate + glucose
##
## Df Deviance AIC
## - BMI 1 2541.2 2569.2
## - diabetes 1 2541.2 2569.2
## - heartRate 1 2541.5 2569.5
## - BPMeds 1 2541.7 2569.7
## - is_smoking 1 2541.9 2569.9
## - diaBP 1 2542.2 2570.2
## - prevalentHyp 1 2542.8 2570.8
## <none> 2541.2 2571.2
## - totChol 1 2546.3 2574.3
## - prevalentStroke 1 2547.1 2575.1
## - cigsPerDay 1 2552.6 2580.6
## - glucose 1 2554.9 2582.9
## - sysBP 1 2557.4 2585.4
## - sex 1 2559.2 2587.2
## - age 1 2626.4 2654.4
##
## Step: AIC=2569.16
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke +
## prevalentHyp + diabetes + totChol + sysBP + diaBP + heartRate +
## glucose
##
## Df Deviance AIC
## - diabetes 1 2541.2 2567.2
## - heartRate 1 2541.5 2567.5
## - BPMeds 1 2541.7 2567.7
## - is_smoking 1 2541.9 2567.9
## - diaBP 1 2542.2 2568.2
## - prevalentHyp 1 2542.8 2568.8
## <none> 2541.2 2569.2
## - totChol 1 2546.4 2572.4
## - prevalentStroke 1 2547.1 2573.1
## - cigsPerDay 1 2552.6 2578.6
## - glucose 1 2554.9 2580.9
## - sysBP 1 2557.4 2583.4
## - sex 1 2559.3 2585.3
## - age 1 2626.5 2652.5
##
## Step: AIC=2567.18
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke +
## prevalentHyp + totChol + sysBP + diaBP + heartRate + glucose
##
## Df Deviance AIC
## - heartRate 1 2541.6 2565.6
## - BPMeds 1 2541.7 2565.7
## - is_smoking 1 2541.9 2565.9
## - diaBP 1 2542.2 2566.2
## - prevalentHyp 1 2542.9 2566.9
## <none> 2541.2 2567.2
## - totChol 1 2546.4 2570.4
## - prevalentStroke 1 2547.1 2571.1
## - cigsPerDay 1 2552.6 2576.6
## - sysBP 1 2557.5 2581.5
## - sex 1 2559.4 2583.4
## - glucose 1 2565.6 2589.6
## - age 1 2626.6 2650.6
##
## Step: AIC=2565.56
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke +
## prevalentHyp + totChol + sysBP + diaBP + glucose
##
## Df Deviance AIC
## - BPMeds 1 2542.1 2564.1
## - is_smoking 1 2542.2 2564.2
## - diaBP 1 2542.7 2564.7
## - prevalentHyp 1 2543.2 2565.2
## <none> 2541.6 2565.6
## - totChol 1 2546.6 2568.6
## - prevalentStroke 1 2547.6 2569.6
## - cigsPerDay 1 2552.8 2574.8
## - sysBP 1 2557.7 2579.7
## - sex 1 2560.8 2582.8
## - glucose 1 2565.6 2587.6
## - age 1 2628.1 2650.1
##
## Step: AIC=2564.14
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + prevalentStroke +
## prevalentHyp + totChol + sysBP + diaBP + glucose
##
## Df Deviance AIC
## - is_smoking 1 2542.8 2562.8
## - diaBP 1 2543.3 2563.3
## - prevalentHyp 1 2543.9 2563.9
## <none> 2542.1 2564.1
## - totChol 1 2547.3 2567.3
## - prevalentStroke 1 2548.7 2568.7
## - cigsPerDay 1 2553.3 2573.3
## - sysBP 1 2559.2 2579.2
## - sex 1 2561.2 2581.2
## - glucose 1 2566.4 2586.4
## - age 1 2628.7 2648.7
##
## Step: AIC=2562.83
## TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + prevalentHyp +
## totChol + sysBP + diaBP + glucose
##
## Df Deviance AIC
## - diaBP 1 2544.0 2562.0
## - prevalentHyp 1 2544.6 2562.6
## <none> 2542.8 2562.8
## - totChol 1 2548.0 2566.0
## - prevalentStroke 1 2549.3 2567.3
## - sysBP 1 2559.8 2577.8
## - sex 1 2561.8 2579.8
## - glucose 1 2567.0 2585.0
## - cigsPerDay 1 2574.8 2592.8
## - age 1 2628.8 2646.8
##
## Step: AIC=2562.03
## TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + prevalentHyp +
## totChol + sysBP + glucose
##
## Df Deviance AIC
## - prevalentHyp 1 2545.4 2561.4
## <none> 2544.0 2562.0
## - totChol 1 2549.2 2565.2
## - prevalentStroke 1 2550.4 2566.4
## - sex 1 2562.1 2578.1
## - sysBP 1 2564.1 2580.1
## - glucose 1 2568.8 2584.8
## - cigsPerDay 1 2576.6 2592.6
## - age 1 2637.9 2653.9
##
## Step: AIC=2561.43
## TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + totChol +
## sysBP + glucose
##
## Df Deviance AIC
## <none> 2545.4 2561.4
## - totChol 1 2550.7 2564.7
## - prevalentStroke 1 2552.1 2566.1
## - sex 1 2563.9 2577.9
## - glucose 1 2569.9 2583.9
## - cigsPerDay 1 2577.7 2591.7
## - sysBP 1 2594.5 2608.5
## - age 1 2640.0 2654.0
# Generate a summary of the updated model's results
summary(logit_modelB)
##
## Call:
## glm(formula = TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke +
## totChol + sysBP + glucose, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6384 -0.5904 -0.4308 -0.2878 2.8699
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.052122 0.492851 -18.367 < 2e-16 ***
## age 0.064384 0.006741 9.552 < 2e-16 ***
## sex1 0.476141 0.111004 4.289 1.79e-05 ***
## cigsPerDay 0.025005 0.004358 5.737 9.63e-09 ***
## prevalentStroke1 1.238352 0.466334 2.656 0.00792 **
## totChol 0.002593 0.001123 2.308 0.02101 *
## sysBP 0.015711 0.002237 7.022 2.18e-12 ***
## glucose 0.008946 0.001828 4.895 9.83e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2874.6 on 3389 degrees of freedom
## Residual deviance: 2545.4 on 3382 degrees of freedom
## AIC: 2561.4
##
## Number of Fisher Scoring iterations: 5
# Calculate the odds ratios for the variables in the logistic regression model
# using the logit_modelB object
exp(coef(logit_modelB))
## (Intercept) age sex1 cigsPerDay
## 0.0001171422 1.0665022189 1.6098499377 1.0253197684
## prevalentStroke1 totChol sysBP glucose
## 3.4499223834 1.0025961748 1.0158348165 1.0089861372
Based on odds ratios, here is the interpretation of the effects of the independent variables on the dependent variable:
Intercept: The odds of the dependent variable occurring when all the independent variables are zero is extremely low (approximately 0.000117). This can be interpreted as the baseline odds of the dependent variable.
Age: For every one-unit increase in age, the odds of the dependent variable increase by approximately 6.7%. This suggests that older age is associated with a slightly higher likelihood of the dependent variable occurring.
Sex: Being male (sex1) increases the odds of the dependent variable occurring by approximately 61% compared to being female (sex0). This indicates that males have a higher likelihood of the dependent variable occurring compared to females.
cigsPerDay: For every one-unit increase in the number of cigarettes smoked per day, the odds of the dependent variable occurring increase by approximately 2.5%. This suggests that higher cigarette consumption is associated with a slightly higher likelihood of the dependent variable occurring.
prevalentStroke: Individuals with a prevalent stroke (prevalentStroke1) have approximately 244% higher odds of the dependent variable occurring compared to those without a prevalent stroke (prevalentStroke0). This suggests that a previous stroke is strongly associated with an increased likelihood of the dependent variable occurring.
totChol: For every one-unit increase in total cholesterol levels, the odds of the dependent variable occurring increase by approximately 0.3%. This suggests that higher total cholesterol levels are associated with a slightly higher likelihood of the dependent variable occurring.
sysBP: For every one-unit increase in systolic blood pressure, the odds of the dependent variable occurring increase by approximately 1.6%. This indicates that higher systolic blood pressure is associated with a slightly higher likelihood of the dependent variable occurring.
glucose: For every one-unit increase in glucose levels, the odds of the dependent variable occurring increase by approximately 0.9%. This suggests that higher glucose levels are associated with a slightly higher likelihood of the dependent variable occurring.
# Set the random seed for reproducibility
set.seed(1994)
# Determine the number of rows in the 'CHD_NEW' data frame
n_rows <- nrow(CHD_NEW)
# Create a random sample of indices for splitting the data into training and testing sets (70% for training)
idx <- sample(n_rows, n_rows*0.7)
# Create the training and testing data sets using the sampled indices
trainData <- CHD_NEW[idx,]
testData <- CHD_NEW[-idx,]
# Counting the number of occurrences of each class in the original trainData
table(trainData$TenYearCHD)
##
## 0 1
## 2031 342
# Applying the ROSE (Random Over-Sampling Examples) algorithm to balance the data
balanced_data <- ROSE(TenYearCHD ~ . , data = trainData)
balanced_data <- balanced_data$data
# Counting the number of occurrences of each class in the balanced_data
table(balanced_data$TenYearCHD)
##
## 0 1
## 1193 1180
# Define the formula for the random forest model using the relevant features
rf_formula <- TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + heartRate + glucose
# Set up cross-validation parameters for the model (100 iterations)
parameters_cv <- trainControl(method = 'CV', number = 100, savePredictions = "final")
# Train the random forest model
rf_model_CV <- train(rf_formula, data = balanced_data, method='rf', trControl=parameters_cv)
# Display the cross-validation model
rf_model_CV
## Random Forest
##
## 2373 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 2349, 2349, 2350, 2350, 2349, 2349, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7440761 0.4876104
## 8 0.7403986 0.4804163
## 14 0.7371920 0.4740123
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Make predictions
predictions_CV <- predict(rf_model_CV, testData[,-15], type = 'raw')
# Computing the confusion matrix for predictions_CV and actual labels in testData
co_matrix_cv <- confusionMatrix(predictions_CV, testData$TenYearCHD)
co_matrix_cv
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 710 100
## 1 138 69
##
## Accuracy : 0.766
## 95% CI : (0.7387, 0.7917)
## No Information Rate : 0.8338
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : 0.2253
##
## Mcnemar's Test P-Value : 0.01647
##
## Sensitivity : 0.8373
## Specificity : 0.4083
## Pos Pred Value : 0.8765
## Neg Pred Value : 0.3333
## Prevalence : 0.8338
## Detection Rate : 0.6981
## Detection Prevalence : 0.7965
## Balanced Accuracy : 0.6228
##
## 'Positive' Class : 0
##
. The performance metrics scores offer invaluable insights into the model’s effectiveness. With an accuracy rate of 76.79%, the model demonstrates a notable proportion of correct predictions. Notably, a sensitivity of 83.73% highlights the model’s exceptional capability in accurately identifying positive cases (CHD events), while a specificity of 40.83% indicates its moderate proficiency in correctly pinpointing negative cases (non-CHD events). These findings strongly advocate for the RF algorithm as the preferred choice for predicting the likelihood of CHD events, leveraging a comprehensive set of independent variables.
#dropping insignificant variables
CHD_NEW_NEW <- CHD_NEW[, -c(3,5,7,8,11,12,13)]
# Set the random seed for reproducibility
set.seed(1994)
# Determine the number of rows in the 'tree_data' data frame
n_rows <- nrow(CHD_NEW_NEW)
# Create a random sample of indices for splitting the data into training and testing sets (70% for training)
idx2 <- sample(n_rows, n_rows*0.7)
# Create the training and testing data sets using the sampled indices
trainDataa <- CHD_NEW_NEW[idx,]
testDataa <- CHD_NEW_NEW[-idx,]
# Counting the number of occurrences of each class in the original trainData
table(trainDataa$TenYearCHD)
##
## 0 1
## 2031 342
# Applying the ROSE (Random Over-Sampling Examples) algorithm to balance the data
balanced_data2 <- ROSE(TenYearCHD ~ . , data = trainDataa)
balanced_data2 <- balanced_data$data
# Counting the number of occurrences of each class in the balanced_data
table(balanced_data2$TenYearCHD)
## < table of extent 0 >
# Define the formula for the random forest model using the relevant features
rf_formula2 <- TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + totChol + sysBP + glucose
# Set up cross-validation parameters for the model (1000 iterations)
parameters_cv2 <- trainControl(method = 'CV', number = 100, savePredictions = "final")
# Train the decision tree model using cross-validation
rf_model_CV2 <- train(rf_formula2, data = balanced_data2, method='rf', trControl=parameters_cv2)
# Display the cross-validation
rf_model_CV2
## Random Forest
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 3357, 3356, 3356, 3356, 3356, 3356, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8475475 0.06441145
## 4 0.8449267 0.11216217
## 7 0.8425286 0.11481967
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Make predictions using the cross-validation
predictions_CV2 <- predict(rf_model_CV2, testData[,-15], type = 'raw')
# Computing the confusion matrix for predictions_CV2 and actual labels in testData
co_table <- confusionMatrix(predictions_CV2, testData$TenYearCHD)
co_table
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 848 38
## 1 0 131
##
## Accuracy : 0.9626
## 95% CI : (0.9491, 0.9734)
## No Information Rate : 0.8338
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8518
##
## Mcnemar's Test P-Value : 1.947e-09
##
## Sensitivity : 1.0000
## Specificity : 0.7751
## Pos Pred Value : 0.9571
## Neg Pred Value : 1.0000
## Prevalence : 0.8338
## Detection Rate : 0.8338
## Detection Prevalence : 0.8712
## Balanced Accuracy : 0.8876
##
## 'Positive' Class : 0
##
. The model achieves an accuracy of 96.26% indicating high level of precision in predicting the 10-year risk of CHD. The model shows perfect sensitivity (100%) in correctly identifying positive cases (CHD events). The specificity is 77.51%, indicating the model’s ability to accurately identify negative cases (non-CHD events). Overall, the RF algorithm,when trained with only the most significant factors, demonstrates excellent performance and predictive power in determining the 10-year risk of CHD.
# Extract variable importance using varImp function for rf_model_CV
var_importance1 <- varImp(rf_model_CV)
var_importance1
## rf variable importance
##
## Overall
## glucose 100.000
## age 70.841
## sysBP 60.902
## diaBP 48.908
## totChol 46.461
## BMI 45.859
## heartRate 42.615
## cigsPerDay 41.615
## prevalentHyp1 16.967
## sex1 8.953
## is_smoking1 4.964
## diabetes1 4.770
## BPMeds1 1.690
## prevalentStroke1 0.000
# Extract variable importance using varImp function for rf_model_CV2
var_importance2 <- varImp(rf_model_CV2)
var_importance2
## rf variable importance
##
## Overall
## sysBP 100.000
## totChol 90.928
## glucose 83.427
## age 77.481
## cigsPerDay 35.855
## sex1 8.124
## prevalentStroke1 0.000
. In both models, variable importance measures underscore the significance of age, systolic blood pressure, total cholesterol levels, glucose, and cigarettes smoked per day in predicting the 10-year risk of CHD. However, variables such as sex and history of stroke demonstrate comparatively lower importance scores when contrasted with the aforementioned factors.