Importing Essential R Packages

# Importing the 'modeest' package for estimating statistical modes or working with statistical distributions
library(modeest)

# Importing the 'ggplot2' package for creating visually appealing and customizable plots
library(ggplot2)

# Importing the 'lattice' package for creating conditioned plots
library(lattice)

# Importing the 'caret' package for machine learning tasks such as data preprocessing, model training, and performance evaluation
library(caret)

# Importing the 'mlbench' package for benchmark datasets commonly used in machine learning research
library(mlbench)

# Importing the 'dlookr' package for data exploration and outlier detection
library(dlookr)

## Registered S3 methods overwritten by 'dlookr':
##   method          from  
##   plot.transform  scales
##   print.transform scales

## 
## Attaching package: 'dlookr'

## The following object is masked from 'package:modeest':
## 
##     skewness

## The following object is masked from 'package:base':
## 
##     transform

# Importing the 'ROSE' package for dealing with imbalance data
library(ROSE)

## Loaded ROSE 0.0-4

Reading the dataset

# Reading the csv data file "Main data.csv" into the 'CHD' dataframe
CHD <- read.csv("Main data.csv", stringsAsFactors = TRUE, na.strings = TRUE)

Data pre-preparation/pre-processing

Summary Statistics

# Generating a summary of the 'CHD' dataframe
summary(CHD)

##        id              age          education     sex      is_smoking
##  Min.   :   0.0   Min.   :32.00   Min.   :1.000   F:1923   NO :1703  
##  1st Qu.: 847.2   1st Qu.:42.00   1st Qu.:1.000   M:1467   YES:1687  
##  Median :1694.5   Median :49.00   Median :2.000                      
##  Mean   :1694.5   Mean   :49.54   Mean   :1.971                      
##  3rd Qu.:2541.8   3rd Qu.:56.00   3rd Qu.:3.000                      
##  Max.   :3389.0   Max.   :70.00   Max.   :4.000                      
##                                   NA's   :87                         
##    cigsPerDay         BPMeds        prevalentStroke    prevalentHyp   
##  Min.   : 0.000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median : 0.000   Median :0.00000   Median :0.00000   Median :0.0000  
##  Mean   : 9.069   Mean   :0.02989   Mean   :0.00649   Mean   :0.3153  
##  3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##  NA's   :22       NA's   :44                                          
##     diabetes          totChol          sysBP           diaBP       
##  Min.   :0.00000   Min.   :107.0   Min.   : 83.5   Min.   : 48.00  
##  1st Qu.:0.00000   1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 74.50  
##  Median :0.00000   Median :234.0   Median :128.5   Median : 82.00  
##  Mean   :0.02566   Mean   :237.1   Mean   :132.6   Mean   : 82.88  
##  3rd Qu.:0.00000   3rd Qu.:264.0   3rd Qu.:144.0   3rd Qu.: 90.00  
##  Max.   :1.00000   Max.   :696.0   Max.   :295.0   Max.   :142.50  
##                    NA's   :38                                      
##       BMI          heartRate         glucose         TenYearCHD    
##  Min.   :15.96   Min.   : 45.00   Min.   : 40.00   Min.   :0.0000  
##  1st Qu.:23.02   1st Qu.: 68.00   1st Qu.: 71.00   1st Qu.:0.0000  
##  Median :25.38   Median : 75.00   Median : 78.00   Median :0.0000  
##  Mean   :25.79   Mean   : 75.98   Mean   : 82.09   Mean   :0.1507  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00   3rd Qu.:0.0000  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00   Max.   :1.0000  
##  NA's   :14      NA's   :1        NA's   :304

Here’s a brief description of the summary result:

id: The identifier variable with a range of values from 0 to 3389.
age: The age of the individuals in the dataset, ranging from 32 to 70 years.
education: The level of education, represented by numeric values ranging from 1 to 4.
sex: The gender of the individuals, with “F” representing females and “M” representing males.
is_smoking: Indicates whether the individual is a smoker, with “NO” indicating non-smokers and “YES” indicating smokers.
cigsPerDay: The average number of cigarettes smoked per day, ranging from 0 to 70.
BPMeds: Indicates whether the individual is taking blood pressure medication, with values of 0 or 1.
prevalentStroke: Indicates whether the individual had a prevalent stroke, with values of 0 or 1.
prevalentHyp: Indicates whether the individual has prevalent hypertension, with values of 0 or 1.
diabetes: Indicates whether the individual has diabetes, with values of 0 or 1.
totChol: Total cholesterol level, ranging from 107 to 696.
sysBP: Systolic blood pressure, ranging from 83.5 to 295.
diaBP: Diastolic blood pressure, ranging from 48 to 142.5.
BMI: Body Mass Index, ranging from 15.96 to 56.8.
heartRate: Heart rate in beats per minute, ranging from 45 to 143.
glucose: Blood glucose level, ranging from 40 to 394.
TenYearCHD: Indicates whether the individual had a coronary heart disease event within ten years, with values of 0 or 1.

.The summary result also indicates the presence of missing values (NA) in certain variables, as denoted by the number of NA entries mentioned under each variable.

Examining the Structure of the Dataset

# Displaying the structure of the 'CHD' dataframe
str(CHD)

## 'data.frame':    3390 obs. of  17 variables:
##  $ id             : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ age            : int  64 36 46 50 64 61 61 36 41 55 ...
##  $ education      : num  2 4 1 1 1 3 1 4 2 2 ...
##  $ sex            : Factor w/ 2 levels "F","M": 1 2 1 2 1 1 2 2 1 1 ...
##  $ is_smoking     : Factor w/ 2 levels "NO","YES": 2 1 2 2 2 1 1 2 2 1 ...
##  $ cigsPerDay     : num  3 0 10 20 30 0 0 35 20 0 ...
##  $ BPMeds         : num  0 0 0 0 0 0 0 0 NA 0 ...
##  $ prevalentStroke: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ prevalentHyp   : int  0 1 0 1 0 1 1 0 0 1 ...
##  $ diabetes       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ totChol        : num  221 212 250 233 241 272 238 295 220 326 ...
##  $ sysBP          : num  148 168 116 158 136 ...
##  $ diaBP          : num  85 98 71 88 85 121 136 68 78 81 ...
##  $ BMI            : num  NA 29.8 20.4 28.3 26.4 ...
##  $ heartRate      : num  90 72 88 68 70 85 75 60 86 85 ...
##  $ glucose        : num  80 75 94 94 77 65 79 63 79 NA ...
##  $ TenYearCHD     : int  1 0 0 1 0 1 0 0 0 0 ...

Here’s a brief description of the displayed information:

The dataframe contains 3390 observations or rows and has 17 variables or columns.

The description below provides details about each variable/column:

id: An integer variable representing the identifier, ranging from 0 to higher values.
age: An integer variable representing the age of individuals.
education: A numeric variable representing the level of education.
sex: A factor variable indicating the gender, with two levels “F” and “M”.
is_smoking: A factor variable indicating smoking status, with two levels “NO” and “YES”.
cigsPerDay: A numeric variable representing the average number of cigarettes smoked per day.
BPMeds: A numeric variable indicating whether the individual is taking blood pressure medication.
prevalentStroke: An integer variable indicating whether the individual had a prevalent stroke.
prevalentHyp: An integer variable indicating whether the individual has prevalent hypertension.
diabetes: An integer variable indicating whether the individual has diabetes.
totChol: A numeric variable representing total cholesterol level.
sysBP: A numeric variable representing systolic blood pressure.
diaBP: A numeric variable representing diastolic blood pressure.
BMI: A numeric variable representing Body Mass Index.
heartRate: A numeric variable representing heart rate in beats per minute.
glucose: A numeric variable representing blood glucose level.
TenYearCHD: An integer variable indicating whether the individual had a coronary heart disease event within ten years.

Droping irrelevant Variables(Id and Education)

# Creating a new dataframe 'CHD_NEW' by excluding the 'id' and 'education' columns from the 'CHD' dataframe
CHD_NEW <- CHD[, -c(1, 3)]

Converting categorical to factor

# Converting selected variables in the 'CHD_NEW' dataframe to factor type
CHD_NEW$BPMeds <- as.factor(CHD_NEW$BPMeds)
CHD_NEW$prevalentStroke <- as.factor(CHD_NEW$prevalentStroke)
CHD_NEW$prevalentHyp <- as.factor(CHD_NEW$prevalentHyp)
CHD_NEW$diabetes <- as.factor(CHD_NEW$diabetes)
CHD_NEW$TenYearCHD <- as.factor(CHD_NEW$TenYearCHD)

Missing values handling

# Calculating the column-wise sum of missing values in the 'CHD_NEW' dataframe
colSums(is.na(CHD_NEW))

##             age             sex      is_smoking      cigsPerDay          BPMeds 
##               0               0               0              22              44 
## prevalentStroke    prevalentHyp        diabetes         totChol           sysBP 
##               0               0               0              38               0 
##           diaBP             BMI       heartRate         glucose      TenYearCHD 
##               0              14               1             304               0

# Imputing missing values with the corresponding median value for numerical variables
CHD_NEW$cigsPerDay[is.na(CHD_NEW$cigsPerDay)] <- median(CHD_NEW$cigsPerDay, na.rm = TRUE)
CHD_NEW$totChol[is.na(CHD_NEW$totChol)] <- median(CHD_NEW$totChol, na.rm = TRUE)
CHD_NEW$BMI[is.na(CHD_NEW$BMI)] <- median(CHD_NEW$BMI, na.rm = TRUE)
CHD_NEW$heartRate[is.na(CHD_NEW$heartRate)] <- median(CHD_NEW$heartRate, na.rm = TRUE)
CHD_NEW$glucose[is.na(CHD_NEW$glucose)] <- median(CHD_NEW$glucose, na.rm = TRUE)

#Imputing missing values in 'BPMeds' column with the most frequent value(mode)
CHD_NEW$BPMeds[is.na(CHD_NEW$BPMeds)] <- mfv(CHD_NEW$BPMeds, na_rm = TRUE)

From the output:

There are no missing values in the ‘age’ column.
There are no missing values in the ‘sex’ column.
There are no missing values in the ‘is_smoking’ column.
There are 22 missing values in the ‘cigsPerDay’ column.
There are 44 missing values in the ‘BPMeds’ column.
There are no missing values in the ‘prevalentStroke’ column.
There are no missing values in the ‘prevalentHyp’ column.
There are no missing values in the ‘diabetes’ column.
There are 38 missing values in the ‘totChol’ column.
There are no missing values in the ‘sysBP’ column.
There are no missing values in the ‘diaBP’ column.
There are 14 missing values in the ‘BMI’ column.
There is 1 missing value in the ‘heartRate’ column.
There are 304 missing values in the ‘glucose’ column.
There are no missing values in the ‘TenYearCHD’ column.

Duplicates checking

# Removing duplicate rows
CHD_NEW <- unique(CHD_NEW)

Variable Encoding

# Converting 'is_smoking' column to binary indicator
CHD_NEW$is_smoking <- ifelse(CHD_NEW$is_smoking == "YES", 1, 0)

# Converting 'sex' column to binary indicator
CHD_NEW$sex <- ifelse(CHD_NEW$sex == "M", 1, 0)

# Converting 'is_smoking' and 'sex' columns to factors
CHD_NEW$is_smoking <- as.factor(CHD_NEW$is_smoking)
CHD_NEW$sex <- as.factor(CHD_NEW$sex)

Outliers handling

# Selecting numerical variables from CHD_NEW dataframe
numerical_vars <- CHD_NEW[, sapply(CHD_NEW, is.numeric)]

# Plot outliers in numerical variables
plot_outlier(numerical_vars)

# Identify nature of outliers using boxplot statistics
boxplot.stats(CHD_NEW$age)$out

## integer(0)

boxplot.stats(CHD_NEW$cigsPerDay)$out

## [1] 60 60 60 60 60 60 60 70 60

boxplot.stats(CHD_NEW$totChol)$out

##  [1] 382 391 439 600 356 696 361 382 464 391 362 390 360 352 366 366 370 372 365
## [20] 113 392 353 367 390 398 453 358 363 107 352 354 373 410 380 366 432 410 358
## [39] 355 364 352 410 352

boxplot.stats(CHD_NEW$sysBP)$out

##   [1] 232.0 185.0 186.5 193.0 192.0 207.0 202.0 244.0 186.0 191.0 191.5 193.0
##  [13] 195.0 185.5 190.0 212.0 210.0 202.5 215.0 196.0 248.0 215.0 195.0 208.0
##  [25] 194.0 204.0 186.0 188.0 214.0 188.5 187.0 192.5 197.0 185.0 197.0 190.0
##  [37] 186.5 190.0 196.0 230.0 187.0 185.0 205.0 205.0 200.0 199.0 185.0 210.0
##  [49] 189.0 200.0 191.0 198.0 200.0 193.0 200.0 209.0 195.0 199.0 189.0 295.0
##  [61] 235.0 204.0 188.0 195.0 198.0 197.5 190.0 199.0 188.0 206.0 215.0 210.0
##  [73] 243.0 205.5 192.5 193.0 189.0 192.0 189.0 187.0 187.5 206.0 191.0 206.0
##  [85] 196.0 185.0 195.0 196.0 195.0 190.0 220.0 217.0 189.0 185.0 213.0 188.0
##  [97] 190.0 185.0 220.0 192.5 213.0 210.0 207.5 197.5 187.0

boxplot.stats(CHD_NEW$diaBP)$out

##  [1] 121.0 136.0 121.0 120.0 122.5 114.0 124.0 124.0 115.5 130.0 129.0  48.0
## [13] 116.0 130.0 118.0 136.0 115.0 114.0 130.0 117.0  50.0 122.5 135.0 120.0
## [25] 125.0 133.0 135.0 118.0 130.0 123.0 128.0 117.5 125.0 115.0 114.0 120.0
## [37] 123.0 120.0 142.5 115.0 118.0 115.0 121.0 124.5 116.0 120.0 119.0 114.0
## [49] 121.0 130.0 118.0 125.0 133.0 127.5 118.0 118.0  51.0 119.0

boxplot.stats(CHD_NEW$BMI)$out

##  [1] 35.58 43.69 36.54 35.96 35.62 38.42 36.11 38.14 44.09 35.99 39.94 39.82
## [13] 40.21 38.96 38.53 37.38 39.69 35.85 38.54 37.10 39.94 36.07 35.78 38.43
## [25] 51.28 41.29 36.91 38.31 56.80 39.21 38.11 36.65 39.04 35.68 44.27 39.17
## [37] 38.75 45.79 36.81 38.46 36.21 37.48 36.52 40.38 36.62 37.15 38.82 35.53
## [49] 42.00 43.48 37.04 37.62 40.58 39.22 37.02 36.12 42.53 38.17 40.08 37.41
## [61] 37.30 40.11 40.81 39.53 40.52 36.18 39.08 44.55 39.91 38.39 36.01 41.61
## [73] 44.71 43.67 45.80 38.61 36.29 38.06

boxplot.stats(CHD_NEW$heartRate)$out

##  [1] 110 110 110 112 110 120 110 110 106 110 110 115 110 107 110 143 110 110 112
## [20] 110 107  45 110 120 108 115 110 108 110 110 110 125 125 120 115 125 110 122
## [39] 110 110 122 110 106 110 120 110 140 108 115 120 110 110 108 110 108 110 108
## [58] 110 110 112 108 110 110 110

boxplot.stats(CHD_NEW$glucose)$out

##   [1] 132 105 170 107 177 118 120 173 163 205 115 120 268 137 106 348 110 120
##  [19] 115 106 122 108 255 140 105 124 244 106 115 294 112 113 254 132 116  52
##  [37]  50 120 110 114 127 140 113 135  45 206 140 105 105 105 117 112 107 126
##  [55] 113  45 131 113 114 116 394 116 120 107 123 114 118 270 107 320 129 112
##  [73] 107 115 142 106 112 107 120  52 235 116 115  40 126 117 118 150 120 136
##  [91] 113 107 118 112 191 117  44 394 124 173 127  50 119 117 109 113 105 108
## [109] 110 112 121  43 117 108 256 113 145 113 105 117  47 297 118 166 172  48
## [127] 132 111 106 107 113 115 126 110 110 127 122 215 113 216 368 118 123 155
## [145] 105 148 147 121 115 116 117 108 115 115 105 135 105 123 123 223 106 115
## [163] 113 225 332 105 106 115 117 118 210 112 110 115 107 186 110 136 110 112
## [181] 117 109 115 386 150 137 248 107 108 127 170 193 125 167  47 183 112 110
## [199] 108 120 215 137 113 108  45 260  44 115 274 108 107 112 206 113

Data analysis

Correlation matrix

# Selecting numerical variables from CHD_NEW dataframe
numerical_vars <- CHD_NEW[, sapply(CHD_NEW, is.numeric)]

# Calculating correlation matrix and rounding to 2 decimal places
round(cor(numerical_vars), 2)

##              age cigsPerDay totChol sysBP diaBP   BMI heartRate glucose
## age         1.00      -0.19    0.27  0.40  0.22  0.14      0.00    0.11
## cigsPerDay -0.19       1.00   -0.03 -0.10 -0.07 -0.10      0.07   -0.07
## totChol     0.27      -0.03    1.00  0.20  0.15  0.11      0.09    0.06
## sysBP       0.40      -0.10    0.20  1.00  0.78  0.33      0.18    0.14
## diaBP       0.22      -0.07    0.15  0.78  1.00  0.38      0.17    0.07
## BMI         0.14      -0.10    0.11  0.33  0.38  1.00      0.07    0.09
## heartRate   0.00       0.07    0.09  0.18  0.17  0.07      1.00    0.08
## glucose     0.11      -0.07    0.06  0.14  0.07  0.09      0.08    1.00

Here is a brief description of the correlations:

Age has a positive correlation with totChol (0.27), sysBP (0.40), and diaBP (0.22), indicating that older individuals tend to have higher cholesterol levels and blood pressure.
There is a weak negative correlation between age and cigsPerDay (-0.19), suggesting that older individuals tend to smoke fewer cigarettes per day.
There are no significant correlations between age and heartRate or glucose.
CigsPerDay shows a weak negative correlation with heartRate (-0.10) and a weak positive correlation with glucose (0.07). This indicates that higher cigarette consumption is associated with slightly lower heart rate and slightly higher glucose levels.
TotChol has a weak positive correlation with sysBP (0.20) and diaBP (0.15), suggesting a potential association between cholesterol levels and blood pressure.
SysBP and diaBP exhibit a strong positive correlation of 0.78, indicating a strong relationship between systolic and diastolic blood pressure.
BMI shows weak positive correlations with sysBP (0.33) and diaBP (0.38), indicating a potential association between body mass index and blood pressure.
HeartRate has a weak positive correlation with sysBP (0.18) and a weak positive correlation with glucose (0.08).
Glucose has a weak positive correlation with sysBP (0.14), diaBP (0.07), BMI (0.09), and heartRate (0.08), suggesting a potential association between glucose levels and these variables.

Logistic regression

# Attaching the CHD_NEW data frame
attach(CHD_NEW)

# Building the logistic regression model
logit_model <- glm(TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke +
                   prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + heartRate + glucose,
                   family = binomial)

# Summarizing the logistic regression model
summary(logit_model)

## 
## Call:
## glm(formula = TenYearCHD ~ age + sex + is_smoking + cigsPerDay + 
##     BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + 
##     sysBP + diaBP + BMI + heartRate + glucose, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5517  -0.5936  -0.4273  -0.2902   2.8086  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -8.360407   0.716824 -11.663  < 2e-16 ***
## age               0.062776   0.006917   9.075  < 2e-16 ***
## sex1              0.481424   0.113386   4.246 2.18e-05 ***
## is_smoking1       0.136236   0.159861   0.852 0.394094    
## cigsPerDay        0.021421   0.006308   3.396 0.000684 ***
## BPMeds1           0.183600   0.249342   0.736 0.461524    
## prevalentStroke1  1.179526   0.472153   2.498 0.012483 *  
## prevalentHyp1     0.187982   0.144761   1.299 0.194092    
## diabetes1         0.050892   0.333957   0.152 0.878878    
## totChol           0.002585   0.001126   2.295 0.021722 *  
## sysBP             0.015899   0.003942   4.034 5.49e-05 ***
## diaBP            -0.006661   0.006552  -1.017 0.309294    
## BMI               0.001157   0.013184   0.088 0.930098    
## heartRate        -0.002666   0.004354  -0.612 0.540277    
## glucose           0.008748   0.002424   3.609 0.000307 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2874.6  on 3389  degrees of freedom
## Residual deviance: 2541.1  on 3375  degrees of freedom
## AIC: 2571.1
## 
## Number of Fisher Scoring iterations: 5

.Age, Sex, Cigarettes per Day, Prevalent Stroke, Total Cholesterol, Systolic Blood Pressure, and Glucose demonstrated a statistically significant relationship with the TenYearCHD outcome. These variables have shown a significant impact on predicting the likelihood of TenYearCHD when accounting for other variables in the model.

Removing non-significant variables using step

# Perform stepwise selection on the logistic regression model
logit_modelB <- step(logit_model)

## Start:  AIC=2571.15
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + 
##     heartRate + glucose
## 
##                   Df Deviance    AIC
## - BMI              1   2541.2 2569.2
## - diabetes         1   2541.2 2569.2
## - heartRate        1   2541.5 2569.5
## - BPMeds           1   2541.7 2569.7
## - is_smoking       1   2541.9 2569.9
## - diaBP            1   2542.2 2570.2
## - prevalentHyp     1   2542.8 2570.8
## <none>                 2541.2 2571.2
## - totChol          1   2546.3 2574.3
## - prevalentStroke  1   2547.1 2575.1
## - cigsPerDay       1   2552.6 2580.6
## - glucose          1   2554.9 2582.9
## - sysBP            1   2557.4 2585.4
## - sex              1   2559.2 2587.2
## - age              1   2626.4 2654.4
## 
## Step:  AIC=2569.16
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + diabetes + totChol + sysBP + diaBP + heartRate + 
##     glucose
## 
##                   Df Deviance    AIC
## - diabetes         1   2541.2 2567.2
## - heartRate        1   2541.5 2567.5
## - BPMeds           1   2541.7 2567.7
## - is_smoking       1   2541.9 2567.9
## - diaBP            1   2542.2 2568.2
## - prevalentHyp     1   2542.8 2568.8
## <none>                 2541.2 2569.2
## - totChol          1   2546.4 2572.4
## - prevalentStroke  1   2547.1 2573.1
## - cigsPerDay       1   2552.6 2578.6
## - glucose          1   2554.9 2580.9
## - sysBP            1   2557.4 2583.4
## - sex              1   2559.3 2585.3
## - age              1   2626.5 2652.5
## 
## Step:  AIC=2567.18
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + diaBP + heartRate + glucose
## 
##                   Df Deviance    AIC
## - heartRate        1   2541.6 2565.6
## - BPMeds           1   2541.7 2565.7
## - is_smoking       1   2541.9 2565.9
## - diaBP            1   2542.2 2566.2
## - prevalentHyp     1   2542.9 2566.9
## <none>                 2541.2 2567.2
## - totChol          1   2546.4 2570.4
## - prevalentStroke  1   2547.1 2571.1
## - cigsPerDay       1   2552.6 2576.6
## - sysBP            1   2557.5 2581.5
## - sex              1   2559.4 2583.4
## - glucose          1   2565.6 2589.6
## - age              1   2626.6 2650.6
## 
## Step:  AIC=2565.56
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + diaBP + glucose
## 
##                   Df Deviance    AIC
## - BPMeds           1   2542.1 2564.1
## - is_smoking       1   2542.2 2564.2
## - diaBP            1   2542.7 2564.7
## - prevalentHyp     1   2543.2 2565.2
## <none>                 2541.6 2565.6
## - totChol          1   2546.6 2568.6
## - prevalentStroke  1   2547.6 2569.6
## - cigsPerDay       1   2552.8 2574.8
## - sysBP            1   2557.7 2579.7
## - sex              1   2560.8 2582.8
## - glucose          1   2565.6 2587.6
## - age              1   2628.1 2650.1
## 
## Step:  AIC=2564.14
## TenYearCHD ~ age + sex + is_smoking + cigsPerDay + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + diaBP + glucose
## 
##                   Df Deviance    AIC
## - is_smoking       1   2542.8 2562.8
## - diaBP            1   2543.3 2563.3
## - prevalentHyp     1   2543.9 2563.9
## <none>                 2542.1 2564.1
## - totChol          1   2547.3 2567.3
## - prevalentStroke  1   2548.7 2568.7
## - cigsPerDay       1   2553.3 2573.3
## - sysBP            1   2559.2 2579.2
## - sex              1   2561.2 2581.2
## - glucose          1   2566.4 2586.4
## - age              1   2628.7 2648.7
## 
## Step:  AIC=2562.83
## TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + prevalentHyp + 
##     totChol + sysBP + diaBP + glucose
## 
##                   Df Deviance    AIC
## - diaBP            1   2544.0 2562.0
## - prevalentHyp     1   2544.6 2562.6
## <none>                 2542.8 2562.8
## - totChol          1   2548.0 2566.0
## - prevalentStroke  1   2549.3 2567.3
## - sysBP            1   2559.8 2577.8
## - sex              1   2561.8 2579.8
## - glucose          1   2567.0 2585.0
## - cigsPerDay       1   2574.8 2592.8
## - age              1   2628.8 2646.8
## 
## Step:  AIC=2562.03
## TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + prevalentHyp + 
##     totChol + sysBP + glucose
## 
##                   Df Deviance    AIC
## - prevalentHyp     1   2545.4 2561.4
## <none>                 2544.0 2562.0
## - totChol          1   2549.2 2565.2
## - prevalentStroke  1   2550.4 2566.4
## - sex              1   2562.1 2578.1
## - sysBP            1   2564.1 2580.1
## - glucose          1   2568.8 2584.8
## - cigsPerDay       1   2576.6 2592.6
## - age              1   2637.9 2653.9
## 
## Step:  AIC=2561.43
## TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + totChol + 
##     sysBP + glucose
## 
##                   Df Deviance    AIC
## <none>                 2545.4 2561.4
## - totChol          1   2550.7 2564.7
## - prevalentStroke  1   2552.1 2566.1
## - sex              1   2563.9 2577.9
## - glucose          1   2569.9 2583.9
## - cigsPerDay       1   2577.7 2591.7
## - sysBP            1   2594.5 2608.5
## - age              1   2640.0 2654.0

# Generate a summary of the updated model's results
summary(logit_modelB)

## 
## Call:
## glm(formula = TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + 
##     totChol + sysBP + glucose, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6384  -0.5904  -0.4308  -0.2878   2.8699  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -9.052122   0.492851 -18.367  < 2e-16 ***
## age               0.064384   0.006741   9.552  < 2e-16 ***
## sex1              0.476141   0.111004   4.289 1.79e-05 ***
## cigsPerDay        0.025005   0.004358   5.737 9.63e-09 ***
## prevalentStroke1  1.238352   0.466334   2.656  0.00792 ** 
## totChol           0.002593   0.001123   2.308  0.02101 *  
## sysBP             0.015711   0.002237   7.022 2.18e-12 ***
## glucose           0.008946   0.001828   4.895 9.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2874.6  on 3389  degrees of freedom
## Residual deviance: 2545.4  on 3382  degrees of freedom
## AIC: 2561.4
## 
## Number of Fisher Scoring iterations: 5

Odd ratios and Odds

# Calculate the odds ratios for the variables in the logistic regression model
# using the logit_modelB object

exp(coef(logit_modelB))

##      (Intercept)              age             sex1       cigsPerDay 
##     0.0001171422     1.0665022189     1.6098499377     1.0253197684 
## prevalentStroke1          totChol            sysBP          glucose 
##     3.4499223834     1.0025961748     1.0158348165     1.0089861372

Based on odds ratios, here is the interpretation of the effects of the independent variables on the dependent variable:

Intercept: The odds of the dependent variable occurring when all the independent variables are zero is extremely low (approximately 0.000117). This can be interpreted as the baseline odds of the dependent variable.
Age: For every one-unit increase in age, the odds of the dependent variable increase by approximately 6.7%. This suggests that older age is associated with a slightly higher likelihood of the dependent variable occurring.
Sex: Being male (sex1) increases the odds of the dependent variable occurring by approximately 61% compared to being female (sex0). This indicates that males have a higher likelihood of the dependent variable occurring compared to females.
cigsPerDay: For every one-unit increase in the number of cigarettes smoked per day, the odds of the dependent variable occurring increase by approximately 2.5%. This suggests that higher cigarette consumption is associated with a slightly higher likelihood of the dependent variable occurring.
prevalentStroke: Individuals with a prevalent stroke (prevalentStroke1) have approximately 244% higher odds of the dependent variable occurring compared to those without a prevalent stroke (prevalentStroke0). This suggests that a previous stroke is strongly associated with an increased likelihood of the dependent variable occurring.
totChol: For every one-unit increase in total cholesterol levels, the odds of the dependent variable occurring increase by approximately 0.3%. This suggests that higher total cholesterol levels are associated with a slightly higher likelihood of the dependent variable occurring.
sysBP: For every one-unit increase in systolic blood pressure, the odds of the dependent variable occurring increase by approximately 1.6%. This indicates that higher systolic blood pressure is associated with a slightly higher likelihood of the dependent variable occurring.
glucose: For every one-unit increase in glucose levels, the odds of the dependent variable occurring increase by approximately 0.9%. This suggests that higher glucose levels are associated with a slightly higher likelihood of the dependent variable occurring.

Random Forest Model for Ten-Year CHD Risk Prediction

Using full set of variables

# Set the random seed for reproducibility
set.seed(1994)

# Determine the number of rows in the 'CHD_NEW' data frame
n_rows <- nrow(CHD_NEW)

# Create a random sample of indices for splitting the data into training and testing sets (70% for training)
idx <- sample(n_rows, n_rows*0.7)

# Create the training and testing data sets using the sampled indices
trainData <- CHD_NEW[idx,]
testData <- CHD_NEW[-idx,]

# Counting the number of occurrences of each class in the original trainData
table(trainData$TenYearCHD)

## 
##    0    1 
## 2031  342

# Applying the ROSE (Random Over-Sampling Examples) algorithm to balance the data
balanced_data <- ROSE(TenYearCHD ~ . , data = trainData)
balanced_data <- balanced_data$data

# Counting the number of occurrences of each class in the balanced_data
table(balanced_data$TenYearCHD)

## 
##    0    1 
## 1193 1180

# Define the formula for the random forest model using the relevant features
rf_formula <- TenYearCHD ~ age + sex + is_smoking + cigsPerDay + BPMeds + prevalentStroke + prevalentHyp +               diabetes + totChol + sysBP + diaBP + BMI + heartRate + glucose

# Set up cross-validation parameters for the model (100 iterations)
parameters_cv <- trainControl(method = 'CV', number = 100, savePredictions = "final")

# Train the random forest model
rf_model_CV <- train(rf_formula, data = balanced_data, method='rf', trControl=parameters_cv)

# Display the cross-validation model
rf_model_CV

## Random Forest 
## 
## 2373 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 2349, 2349, 2350, 2350, 2349, 2349, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7440761  0.4876104
##    8    0.7403986  0.4804163
##   14    0.7371920  0.4740123
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# Make predictions
predictions_CV <- predict(rf_model_CV, testData[,-15], type = 'raw')

# Computing the confusion matrix for predictions_CV and actual labels in testData
co_matrix_cv <- confusionMatrix(predictions_CV, testData$TenYearCHD)
co_matrix_cv

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 710 100
##          1 138  69
##                                           
##                Accuracy : 0.766           
##                  95% CI : (0.7387, 0.7917)
##     No Information Rate : 0.8338          
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.2253          
##                                           
##  Mcnemar's Test P-Value : 0.01647         
##                                           
##             Sensitivity : 0.8373          
##             Specificity : 0.4083          
##          Pos Pred Value : 0.8765          
##          Neg Pred Value : 0.3333          
##              Prevalence : 0.8338          
##          Detection Rate : 0.6981          
##    Detection Prevalence : 0.7965          
##       Balanced Accuracy : 0.6228          
##                                           
##        'Positive' Class : 0               
##

. The performance metrics scores offer invaluable insights into the model’s effectiveness. With an accuracy rate of 76.79%, the model demonstrates a notable proportion of correct predictions. Notably, a sensitivity of 83.73% highlights the model’s exceptional capability in accurately identifying positive cases (CHD events), while a specificity of 40.83% indicates its moderate proficiency in correctly pinpointing negative cases (non-CHD events). These findings strongly advocate for the RF algorithm as the preferred choice for predicting the likelihood of CHD events, leveraging a comprehensive set of independent variables.

Using only significant factors

#dropping insignificant variables
CHD_NEW_NEW <- CHD_NEW[, -c(3,5,7,8,11,12,13)]

# Set the random seed for reproducibility
set.seed(1994)

# Determine the number of rows in the 'tree_data' data frame
n_rows <- nrow(CHD_NEW_NEW)

# Create a random sample of indices for splitting the data into training and testing sets (70% for training)
idx2 <- sample(n_rows, n_rows*0.7)

# Create the training and testing data sets using the sampled indices
trainDataa <- CHD_NEW_NEW[idx,]
testDataa <- CHD_NEW_NEW[-idx,]

# Counting the number of occurrences of each class in the original trainData
table(trainDataa$TenYearCHD)

## 
##    0    1 
## 2031  342

# Applying the ROSE (Random Over-Sampling Examples) algorithm to balance the data
balanced_data2 <- ROSE(TenYearCHD ~ . , data = trainDataa)
balanced_data2 <- balanced_data$data

# Counting the number of occurrences of each class in the balanced_data
table(balanced_data2$TenYearCHD)

## < table of extent 0 >

# Define the formula for the random forest model using the relevant features
rf_formula2 <- TenYearCHD ~ age + sex + cigsPerDay + prevalentStroke + totChol + sysBP + glucose

# Set up cross-validation parameters for the model (1000 iterations)
parameters_cv2 <- trainControl(method = 'CV', number = 100, savePredictions = "final")

# Train the decision tree model using cross-validation
rf_model_CV2 <- train(rf_formula2, data = balanced_data2, method='rf', trControl=parameters_cv2)

# Display the cross-validation
rf_model_CV2

## Random Forest 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 3357, 3356, 3356, 3356, 3356, 3356, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##   2     0.8475475  0.06441145
##   4     0.8449267  0.11216217
##   7     0.8425286  0.11481967
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# Make predictions using the cross-validation
predictions_CV2 <- predict(rf_model_CV2, testData[,-15], type = 'raw')

# Computing the confusion matrix for predictions_CV2 and actual labels in testData
co_table <- confusionMatrix(predictions_CV2, testData$TenYearCHD)
co_table

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 848  38
##          1   0 131
##                                           
##                Accuracy : 0.9626          
##                  95% CI : (0.9491, 0.9734)
##     No Information Rate : 0.8338          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8518          
##                                           
##  Mcnemar's Test P-Value : 1.947e-09       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.7751          
##          Pos Pred Value : 0.9571          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8338          
##          Detection Rate : 0.8338          
##    Detection Prevalence : 0.8712          
##       Balanced Accuracy : 0.8876          
##                                           
##        'Positive' Class : 0               
##

. The model achieves an accuracy of 96.26% indicating high level of precision in predicting the 10-year risk of CHD. The model shows perfect sensitivity (100%) in correctly identifying positive cases (CHD events). The specificity is 77.51%, indicating the model’s ability to accurately identify negative cases (non-CHD events). Overall, the RF algorithm,when trained with only the most significant factors, demonstrates excellent performance and predictive power in determining the 10-year risk of CHD.

Variable Importance Comparison for Random Forest Models

# Extract variable importance using varImp function for rf_model_CV
var_importance1 <- varImp(rf_model_CV)
var_importance1

## rf variable importance
## 
##                  Overall
## glucose          100.000
## age               70.841
## sysBP             60.902
## diaBP             48.908
## totChol           46.461
## BMI               45.859
## heartRate         42.615
## cigsPerDay        41.615
## prevalentHyp1     16.967
## sex1               8.953
## is_smoking1        4.964
## diabetes1          4.770
## BPMeds1            1.690
## prevalentStroke1   0.000

# Extract variable importance using varImp function for rf_model_CV2
var_importance2 <- varImp(rf_model_CV2)
var_importance2

## rf variable importance
## 
##                  Overall
## sysBP            100.000
## totChol           90.928
## glucose           83.427
## age               77.481
## cigsPerDay        35.855
## sex1               8.124
## prevalentStroke1   0.000

. In both models, variable importance measures underscore the significance of age, systolic blood pressure, total cholesterol levels, glucose, and cigarettes smoked per day in predicting the 10-year risk of CHD. However, variables such as sex and history of stroke demonstrate comparatively lower importance scores when contrasted with the aforementioned factors.

Assessing 10-Year Risk of Coronary Heart Disease

Fredrick George Mbuya

2023-06-23