Clear Environment

rm(list = ls())

Importing Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(tidyverse)
library(polycor)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(e1071)
library(caret)
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ROCR)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Introduction

In this assignment, two machine learning algorithms were fit to two datasets of different sizes. The smaller of the two datasets involves cardiovascular disease data collected from different datasets and combined together into one dataset. Cardiovascular disease itself is the leading cause of death globally according to the WHO (World Health Organization). WHO estimates that 17.9 million deaths occured in 2019, as a result of cardiovascular disease. According to the CDC, heart disease and stroke medical costs are estimated to be nearly $1 billion dollars a day. Therefore, it was important to identify risk factors, which are shown as features within the dataset, so individuals and health care professionals can work towards reducing these risk factors and prevent heart disease.

This dataset includes the following variables:

The response variable for this dataset is HeartDisease and in total, the dataset consists of 918 observations. More infornation about the dataset itself can be found (here)[https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset].

The larger of the datasets involves indicators for diabetes. This dataset was compiled from a telephone survey conducted by the CDC in 2015. Questions asked in the survey involved health-related risk behaviors, chronic health conditions, and the use of preventative services. Also include are age, education, income, location, and race to name a few. There are 3 .csv files that can be used for analysis. The one that was used in this homework was the diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file. This .csv file contains 253,680 survey responses (observations) and 21 features. The response variable is multiclass, in that it contains 3 different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for diabetes. The author of this dataset points out that there is a class imbalance.

This dataset includes the following variables:

Importing Data

heart_failure_data <- read_csv(
  file = "heart_failure_prediction.csv", 
  col_types = "nffnnffnfnff"
  )

diabetes_data <- read_csv(
  file = "diabetes_012_health_indicators_BRFSS2015.csv",
  col_types = "ffffnfffffffffffffffff")

Exploratory Data Analysis - Heart Failure Prediction

A summary of the heart failure prediction dataset is provided below:

summary(heart_failure_data)
##       Age        Sex     ChestPainType   RestingBP      Cholesterol   
##  Min.   :28.00   M:725   ATA:173       Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:47.00   F:193   NAP:203       1st Qu.:120.0   1st Qu.:173.2  
##  Median :54.00           ASY:496       Median :130.0   Median :223.0  
##  Mean   :53.51           TA : 46       Mean   :132.4   Mean   :198.8  
##  3rd Qu.:60.00                         3rd Qu.:140.0   3rd Qu.:267.0  
##  Max.   :77.00                         Max.   :200.0   Max.   :603.0  
##  FastingBS  RestingECG      MaxHR       ExerciseAngina    Oldpeak       
##  0:704     Normal:552   Min.   : 60.0   N:547          Min.   :-2.6000  
##  1:214     ST    :178   1st Qu.:120.0   Y:371          1st Qu.: 0.0000  
##            LVH   :188   Median :138.0                  Median : 0.6000  
##                         Mean   :136.8                  Mean   : 0.8874  
##                         3rd Qu.:156.0                  3rd Qu.: 1.5000  
##                         Max.   :202.0                  Max.   : 6.2000  
##  ST_Slope   HeartDisease
##  Up  :395   0:410       
##  Flat:460   1:508       
##  Down: 63               
##                         
##                         
## 

The minimum, maximum, median, and mean values for Age are within normal expectations. There seems to be many more male respondents than female when looking at Sex. Many of the other features shown above fall within reasonable expectations except for Cholesterol and RestingBP. A minimum value of 0 for these values is not physically possible. Other than that, there were no missing values within the dataset. A plot showing the distributions of the continuous variables is shown below.

Figure 1: Histograms for the continuous features in the Heart Failure Prediction dataset

heart_failure_data %>%
  summarise(
    zeroes_Cholesterol = sum(Cholesterol == 0),
    zeroes_RestingBP = sum(RestingBP == 0)
  )

The count above, along with the histograms, confirms that there were a significant amount of 0 values for Cholesterol while there is only just one value of 0 for RestingBP. Age and MaxHR look somewhat normally distributed while Oldpeak displays signs of right-skewness.

Boxplot - Heart Failure Prediction Data

Figure 2: Boxplots for the Heart Failure Prediction dataset

Some findings were discovered that support the theoretical effects for some of the variables using the boxplots in Figure 2. Based on the age boxplot, theoretically, older people are more likely to have heart disease. Theoretically, on average, people with a lower maximum heart rate are more likely to have heart disease when viewing the MaxHR variable. Finally, based on the boxplot, on average, people with higher Oldpeak are more likely to have heart disease, which makes sense given that an Oldpeak equal to ± 1 is indicative of a serious health condition. Cholesterol and RestingBP are dealt with later in the Homework because of the 0 values present in these variables.

Examining Feature Multicollinearity for Continuous Variables

Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. Note that because this dataset uses a mixture of both continuous and categorical variables, the hetcor package in R was used to generate the correlation plot. Using this package allows one to compute “a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables.”

corrplot(heart_failure_correlations$correlations, 
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 1,
         tl.cex = 1)

Figure 3: Histograms for the continuous features in the Heart Failure Prediction dataset

Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. Calkins also goes on to point out that magnitudes between 0.5 and 0.7 indicate moderate correlation, and anything above 0.7 indicate high correlation. The correlation plot above reveals that ST_Slope, Oldpeak, and ExerciseAngina have a moderately high correlation.

Exploratory Data Analysis - Diabetes Health Indicators Dataset

A summary of the Diabetes Health Indicators Dataset is provided below:

summary(diabetes_data)
##  Diabetes_012 HighBP       HighChol     CholCheck         BMI       
##  0.0:213703   1.0:108829   1.0:107591   1.0:244210   Min.   :12.00  
##  2.0: 35346   0.0:144851   0.0:146089   0.0:  9470   1st Qu.:24.00  
##  1.0:  4631                                          Median :27.00  
##                                                      Mean   :28.38  
##                                                      3rd Qu.:31.00  
##                                                      Max.   :98.00  
##                                                                     
##  Smoker       Stroke       HeartDiseaseorAttack PhysActivity Fruits      
##  1.0:112423   0.0:243388   0.0:229787           0.0: 61760   0.0: 92782  
##  0.0:141257   1.0: 10292   1.0: 23893           1.0:191920   1.0:160898  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Veggies      HvyAlcoholConsump AnyHealthcare NoDocbcCost  GenHlth    
##  1.0:205841   0.0:239424        1.0:241263    0.0:232326   5.0:12081  
##  0.0: 47839   1.0: 14256        0.0: 12417    1.0: 21354   3.0:75646  
##                                                            2.0:89084  
##                                                            4.0:31570  
##                                                            1.0:45299  
##                                                                       
##                                                                       
##     MentHlth         PhysHlth      DiffWalk      Sex              Age       
##  0.0    :175680   0.0    :160052   1.0: 42675   0.0:141974   9.0    :33244  
##  2.0    : 13054   30.0   : 19400   0.0:211005   1.0:111706   10.0   :32194  
##  30.0   : 12088   2.0    : 14764                             8.0    :30832  
##  5.0    :  9030   1.0    : 11388                             7.0    :26314  
##  1.0    :  8538   3.0    :  8495                             11.0   :23533  
##  3.0    :  7381   5.0    :  7622                             6.0    :19819  
##  (Other): 27909   (Other): 31959                             (Other):87744  
##  Education        Income     
##  4.0: 62750   8.0    :90385  
##  6.0:107325   7.0    :43219  
##  3.0:  9478   6.0    :36470  
##  5.0: 69910   5.0    :25883  
##  2.0:  4043   4.0    :20135  
##  1.0:   174   3.0    :15994  
##               (Other):21594

The factors above have been recoded for readability.

diabetes_data <- diabetes_data %>%
  mutate(
    Diabetes_012 = dplyr::recode(Diabetes_012, '0.0' = 'No Diabetes', '1.0' = 'Prediabetes', '2.0' = 'Diabetes'),
    CholCheck = dplyr::recode(CholCheck, '1.0' = 'Yes Chol Check in 5 years', '0.0' = 'No Chol Check in 5 Years'),
    AnyHealthcare = dplyr::recode(AnyHealthcare, '1.0' = 'Has Insurance', '0.0' = 'No Insurance'),
    GenHlth = dplyr::recode(GenHlth, '5.0' = 'Poor', '4.0' = 'Fair', '3.0' = 'Good', '2.0' = 'Very Good', '1.0' = "Excellent"),
    Age = dplyr::recode(Age, '1.0' = '18-24', '2.0' = '25-29', '3.0' = '30-34', '4.0' = '35-39', '5.0' = '40-44',
                 '6.0' = '45-49', '7.0' = '50-54', '8.0' = '55-59', '9.0' = '60-64', '10.0' = '65-69',
                 '11.0'='70-74', '12.0' = '75-79', '13.0' = '>=80'),
    Education = dplyr::recode(Education, '1.0' = 'No School/Kindergarten', '2.0' = 'Grades 1-8', '3.0' = 'Grades 9 - 11',
                       '4.0' = 'Grade 12/GED', '5.0' = '1-3 Yrs College', '6.0' = '>= 4 Yrs College'),
    Income = dplyr::recode(Income, '1.0' = '<10K', '2.0' = '10K<=Income<15K', '3.0' = '15K<=Income<20K', '4.0' = '20K<=Income<25K',
                    '5.0' = '25K<=Income<35K', '6.0' = '35K<=Income<50K', '7.0' = '50K<=Income<75K', '8.0' = 'Income>=75K')
  )

summary(diabetes_data)
##       Diabetes_012    HighBP       HighChol    
##  No Diabetes:213703   1.0:108829   1.0:107591  
##  Diabetes   : 35346   0.0:144851   0.0:146089  
##  Prediabetes:  4631                            
##                                                
##                                                
##                                                
##                                                
##                      CholCheck           BMI        Smoker       Stroke      
##  Yes Chol Check in 5 years:244210   Min.   :12.00   1.0:112423   0.0:243388  
##  No Chol Check in 5 Years :  9470   1st Qu.:24.00   0.0:141257   1.0: 10292  
##                                     Median :27.00                            
##                                     Mean   :28.38                            
##                                     3rd Qu.:31.00                            
##                                     Max.   :98.00                            
##                                                                              
##  HeartDiseaseorAttack PhysActivity Fruits       Veggies      HvyAlcoholConsump
##  0.0:229787           0.0: 61760   0.0: 92782   1.0:205841   0.0:239424       
##  1.0: 23893           1.0:191920   1.0:160898   0.0: 47839   1.0: 14256       
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##        AnyHealthcare    NoDocbcCost       GenHlth         MentHlth     
##  Has Insurance:241263   0.0:232326   Poor     :12081   0.0    :175680  
##  No Insurance : 12417   1.0: 21354   Good     :75646   2.0    : 13054  
##                                      Very Good:89084   30.0   : 12088  
##                                      Fair     :31570   5.0    :  9030  
##                                      Excellent:45299   1.0    :  8538  
##                                                        3.0    :  7381  
##                                                        (Other): 27909  
##     PhysHlth      DiffWalk      Sex              Age       
##  0.0    :160052   1.0: 42675   0.0:141974   60-64  :33244  
##  30.0   : 19400   0.0:211005   1.0:111706   65-69  :32194  
##  2.0    : 14764                             55-59  :30832  
##  1.0    : 11388                             50-54  :26314  
##  3.0    :  8495                             70-74  :23533  
##  5.0    :  7622                             45-49  :19819  
##  (Other): 31959                             (Other):87744  
##                   Education                  Income     
##  Grade 12/GED          : 62750   Income>=75K    :90385  
##  >= 4 Yrs College      :107325   50K<=Income<75K:43219  
##  Grades 9 - 11         :  9478   35K<=Income<50K:36470  
##  1-3 Yrs College       : 69910   25K<=Income<35K:25883  
##  Grades 1-8            :  4043   20K<=Income<25K:20135  
##  No School/Kindergarten:   174   15K<=Income<20K:15994  
##                                  (Other)        :21594

Everything in the summary seems to fall within reasonable expectations. The summary also revealed that there were no missing values in this dataset.

Figure 4: Histograms for the BMI (the only continuous feature) in the Diabetes Health Indicators Dataset

Figure 4 shows us that BMI is displaying right skewness. This right skewness could have also been deduced from the summary. Notice that in the summary, for the BMI variable, the maximum is 98, while the mean is 28 and the minimum is 12.

Boxplot - Diabetes Health Indicators Dataset

Figure 5: Boxplots for the Diabetes Health Indicators Dataset

Some findings were discovered that support the theoretical effects for some of the variables using the boxplots in Figure 5. Based on the age boxplot, theoretically, older people are more likely to have heart disease. Theoretically, on average, people with a lower maximum heart rate are more likely to have heart disease when viewing the MaxHR variable. Finally, based on the boxplot, on average, people with higher Oldpeak are more likely to have heart disease, which makes sense given that an Oldpeak equal to ± 1 is indicative of a serious health condition.

Examining Feature Multicollinearity for Continuous and Categorical Variables - Diabetes Health Indicators Dataset

Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.

corrplot(diabetes_correlations$correlations, 
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 1,
         tl.cex = 1)

Figure 6: Multicollinearity plot for continuous predictor variables

Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. The correlation with the largest magnitude has a value of 0.52, and while this value is above the maximum range at what would be considered a “low correlation”, it is only 0.02 above the maximum. Therefore, it is sufficient to say that the entire dataset has low correlation.

Model Selection

Two of the models that will be used in order to generate predictions are the multiple logistic regression model and the k-nearest neighbors model. The reason why both of these models were chosen is because for both of the datasets, the response is a binary class. Both of these datasets contain a mixture of categorical and continuous features as well, which is why these 2 machine learning algorithms were selected, as they can handle this mixture of different variables.

Pros and Cons - Multiple Logistic Regression Model

One of the strengths of a multiple logistic regression model lies in its interpretability. Interpretability is important for the datasets that are being analyzed in this Homework because they offer transparency. They allow patients and doctors to easily understand why a particular prediction was made, and this gives patients trust in the healthcare system. Model interpretability is also important when presenting a model to a stakeholder. Since both of these datasets involve healthcare, a healthcare organization might prioritize interpretability to grasp the reasoning behind predictions and incorporate them into clinical decision-making processes. A multiple logistic regression model is also computationally unintensive. This means that larger datasets, which in this case would be the diabetes dataset, would fit to the multiple logistic regression model in less time.

Conversely, multiple logistic regression models tend to underperform when there are multiple or nonlinear decision boundaries. Healthcare datasets in general are complex and may have non-linear relationships which explains the underperformance. Other problems include the inability to handle missing data, multicollinearity, and sensitivity to outliers.

Pros and Cons - Naive Bayes Classifier

One of the strengths of a Naive Bayes classifier model is that it is simple and fast to implement. They are, like multiple logistic regression models, also computationally unintensive. Patient-monitoring systems generally operate in real-time, which would warrant the use of such a model. Naive Bayes models are also able to handle missing data, which is important in the medical field because some patients will purposely withhold information from doctors out of feat or their medical records may be incomplete. Just like a multiple logistic regression model, a naive bayes classifer is easily interpretable when viewing the probabilities generated by the model, which helps with transparency in doctor-patient interactions. Finally, such models handle noisy and missing data well. There is no need for normalization like a k-nearest neighbors model.

Conversely, naive bayes classifier models make the assumption of “…class conditional independence, computed probabilities are not reliable when considered in isolation. The computed probability of an instance belonging to a particular class has to be evaluated relative to the computed probability of the same instance belonging to other classes.” (Practical Machine Learning in R, p.269). In a healthcare setting, this is problematic because some features in a healthcare dataset could have more importance than others. Also, these models perform better with larger datasets, which means that the model generated for heart failure prediction dataset will most likely underperform compared to the model generated for the diabetes health indicators dataset.

Data Preparation

Note that because the SMOTE function is only applied to the training data, the class imbalance is dealt with after the training data for both of the datasets has been generated. Also note that different machine learning algorithms require different types of data transformations. Some of these transformations are applied later in the homework.

Heart Failure Prediction Dataset

Dealing With Zero Values

There are 172 observations where Cholesterol is zero. There is one observation where RestingBP is equal to zero. Since these values are not physically achieveable, all of the observations where Cholesterol and RestingBP were zero were dropped from the dataset.

heart_failure_data <- heart_failure_data %>%
  filter(Cholesterol > 0, RestingBP > 0)
summary(heart_failure_data)
##       Age        Sex     ChestPainType   RestingBP    Cholesterol    FastingBS
##  Min.   :28.00   M:564   ATA:166       Min.   : 92   Min.   : 85.0   0:621    
##  1st Qu.:46.00   F:182   NAP:169       1st Qu.:120   1st Qu.:207.2   1:125    
##  Median :54.00           ASY:370       Median :130   Median :237.0            
##  Mean   :52.88           TA : 41       Mean   :133   Mean   :244.6            
##  3rd Qu.:59.00                         3rd Qu.:140   3rd Qu.:275.0            
##  Max.   :77.00                         Max.   :200   Max.   :603.0            
##   RestingECG      MaxHR       ExerciseAngina    Oldpeak        ST_Slope  
##  Normal:445   Min.   : 69.0   N:459          Min.   :-0.1000   Up  :349  
##  ST    :125   1st Qu.:122.0   Y:287          1st Qu.: 0.0000   Flat:354  
##  LVH   :176   Median :140.0                  Median : 0.5000   Down: 43  
##               Mean   :140.2                  Mean   : 0.9016             
##               3rd Qu.:160.0                  3rd Qu.: 1.5000             
##               Max.   :202.0                  Max.   : 6.2000             
##  HeartDisease
##  0:390       
##  1:356       
##              
##              
##              
## 

The summary above reveals that there are now 746 observations. The minimum for RestingBP is now 92 while the for Cholesterol, it is 85, both of which fall within reasonable expectations. All of the other variables also fall within reasonable expectations.

Class Imbalance

prop.table(table(select(heart_failure_data, HeartDisease)))
## HeartDisease
##         0         1 
## 0.5227882 0.4772118

The output above shows the percentage of respondents that do not have heart disease (52.3%) and the percentage of respondents that do have heart disease (47.7%). When the training dataset is generated later in this Homework, the class imbalance in the training dataset is dealt with using the SMOTE function from the DMwR package.

Figure 7: Histograms for the continuous features in the Heart Failure Prediction dataset after removing 0 values from Cholesterol and RestingBP

After the removal of 0 values, Cholesterol displays slight right-skewness but does not seem to be extreme in nature. All of the other variables have a normal distribution which is ideal. The only continuous variable that has significant right-skewness now is Oldpeak. This right-skewness is also reflected in the summary that was generated earlier. In the summary, the minimum value is -0.1, the mean is 0.9, while the maximum is 6.2. I believe that the reason this skewness exists in the first place is because Oldpeak is a realtime measurement of a serious heart condition that requires immediate medical attention. This means that the majority of respondents were not at risk of having a serious heart condition at the time the survey was conducted, which would make sense. If they did have an Oldpeak that was above or below zero, which some respondents do, than they should be at the hospital for immediate medical attention.

Figure 8: Boxplots for the Heart Failure Prediction dataset after removing 0 values from Cholesterol and RestingBP

After the transformation, the boxplot for the Cholesterol variable now indicates that, theoretically, people with higher Cholesterol levels are more likely to have HeartDisease.

corrplot(heart_failure_correlations$correlations, 
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 1,
         tl.cex = 1)

Figure 9: Correlation plot for the Heart Failure Prediction dataset after removing 0 values from Cholesterol and RestingBP

Many of the correlations associated between Cholesterol and the other features decreased in magnitude. Furthermore, notice that the peak correlation increased from 0.55 to 0.67, and Oldpeak, ExerciseAngina, and St_Slope now have higher correlation values.

Binning Continuous Features - Naive Bayes Model

As pointed out in Practical Machine Learning in R, continuous features should be discretized prior to being used in a naive Bayes model. Therefore, the binning is done in the code chunk below. Bins will be created based on quantiles.

divide_equal_bins_func <- function(x, na.rm = FALSE) cut(x, breaks = unique(quantile(x,probs=seq.int(0,1, by=1/7))), include.lowest=TRUE)


heart_failure_data_binned <- heart_failure_data %>%
  mutate_at(c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak"), divide_equal_bins_func)
summary(heart_failure_data_binned)
##       Age      Sex     ChestPainType     RestingBP      Cholesterol  FastingBS
##  [28,42]:119   M:564   ATA:166       [92,118] :115   [85,192] :109   0:621    
##  (42,48]:122   F:182   NAP:169       (118,120]:110   (192,212]:109   1:125    
##  (48,52]: 97           ASY:370       (120,130]:166   (212,228]:106            
##  (52,55]:105           TA : 41       (130,135]: 45   (228,248]:107            
##  (55,58]: 95                         (135,140]:131   (248,270]:105            
##  (58,63]:111                         (140,150]: 89   (270,299]:103            
##  (63,77]: 97                         (150,200]: 90   (299,603]:107            
##   RestingECG        MaxHR     ExerciseAngina     Oldpeak    ST_Slope  
##  Normal:445   [69,112] :111   N:459          [-0.1,0]:318   Up  :349  
##  ST    :125   (112,125]:108   Y:287          (0,0.1] : 10   Flat:354  
##  LVH   :176   (125,137]:105                  (0.1,1] :151   Down: 43  
##               (137,146]:102                  (1,1.5] : 85             
##               (146,156]:110                  (1.5,2] : 98             
##               (156,169]:106                  (2,6.2] : 84             
##               (169,202]:104                                           
##  HeartDisease
##  0:390       
##  1:356       
##              
##              
##              
##              
## 

The summary above shows the Heart Failure Prediction Dataset after the numeric features were binned into 7 equal bins (Oldpeak was only divided into 6 equal bins).

Diabetes Health Indicators Dataset

Class Imbalance

prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes    Diabetes Prediabetes 
##  0.84241170  0.13933302  0.01825528

The output above shows the percentage of respondents that do not have diabetes (84.24%), the percentage of respondents that do have diabetes (13.9%), and the percentage that are prediabetic(1.8%). Note that in order to deal with class imbalance, only 2 classes can exist within the response variable. Since people with prediabetes only makes up 1.8% of the total amount of observations, all of the observations where Diabetes_012 == Prediabetes were removed from the dataset. The nature of this study slightly changed with the removal of a class. Now instead of generating a model to determine if someone is not at risk of having diabetes, is at risk of being prediabetic, or is at risk of being diabetic, now the model will just determine if a person is either at risk or not at risk of getting diabetes. The modeling and prediction of the remaining classes still yielded valuable insights.

diabetes_data <- diabetes_data %>%
  mutate(Diabetes_012 = as.numeric(Diabetes_012)) %>%
  subset(Diabetes_012 != 3) %>%
  mutate(Diabetes_012 = as.factor(Diabetes_012)) %>%
  mutate(Diabetes_012 = dplyr::recode(Diabetes_012, '1' = 'No Diabetes', '2' = 'Diabetes'))

prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580761   0.1419239

Binning Continuous Features - Naive Bayes Model

As pointed out in Practical Machine Learning in R, continuous features should be discretized prior to being used in a naive Bayes model. Therefore, the binning is done in the code chunk below. Bins will be created based on quantiles.

diabetes_data_binned <- diabetes_data %>%
  mutate_at(c("BMI"), divide_equal_bins_func)
summary(diabetes_data_binned)
##       Diabetes_012    HighBP       HighChol    
##  No Diabetes:213703   1.0:105916   1.0:104716  
##  Diabetes   : 35346   0.0:143133   0.0:144333  
##                                                
##                                                
##                                                
##                                                
##                                                
##                      CholCheck           BMI        Smoker       Stroke      
##  Yes Chol Check in 5 years:239641   [12,22]:36591   1.0:110141   0.0:239022  
##  No Chol Check in 5 Years :  9408   (22,24]:34772   0.0:138908   1.0: 10027  
##                                     (24,26]:37188                            
##                                     (26,28]:40433                            
##                                     (28,31]:40836                            
##                                     (31,34]:25902                            
##                                     (34,98]:33327                            
##  HeartDiseaseorAttack PhysActivity Fruits       Veggies      HvyAlcoholConsump
##  0.0:225820           0.0: 60271   0.0: 90940   1.0:202280   0.0:235001       
##  1.0: 23229           1.0:188778   1.0:158109   0.0: 46769   1.0: 14048       
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##        AnyHealthcare    NoDocbcCost       GenHlth         MentHlth     
##  Has Insurance:236886   0.0:228294   Poor     :11730   0.0    :172724  
##  No Insurance : 12163   1.0: 20755   Good     :73918   2.0    : 12823  
##                                      Very Good:87870   30.0   : 11727  
##                                      Fair     :30545   5.0    :  8849  
##                                      Excellent:44986   1.0    :  8418  
##                                                        3.0    :  7256  
##                                                        (Other): 27252  
##     PhysHlth      DiffWalk      Sex              Age       
##  0.0    :157581   1.0: 41390   0.0:139370   60-64  :32542  
##  30.0   : 18842   0.0:207659   1.0:109679   65-69  :31497  
##  2.0    : 14516                             55-59  :30282  
##  1.0    : 11214                             50-54  :25896  
##  3.0    :  8322                             70-74  :22931  
##  5.0    :  7454                             45-49  :19507  
##  (Other): 31120                             (Other):86394  
##                   Education                  Income     
##  Grade 12/GED          : 61400   Income>=75K    :89374  
##  >= 4 Yrs College      :105854   50K<=Income<75K:42484  
##  Grades 9 - 11         :  9164   35K<=Income<50K:35722  
##  1-3 Yrs College       : 68577   25K<=Income<35K:25296  
##  Grades 1-8            :  3882   20K<=Income<25K:19676  
##  No School/Kindergarten:   172   15K<=Income<20K:15573  
##                                  (Other)        :20924

The summary above shows the Diabetes Health Indicators dataset after BMI were binned into 7 equal bins.

Dealing with Skewed Variables

 diabetes_data %>%
  mutate(BMI_transformed = log10(BMI)) %>% 
  dplyr::select(-Diabetes_012) %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_density(col = 'red') +
    geom_histogram(aes(y = stat(density)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 10: The BMI variable plotted on a histogram and the BMI_transformed variable which was generated by applying a log transformation to the BMI variable.

The output above shows us that the BMI_transformed variable displays slight bimodality, which is problematic. In order to preserve interpretability and to account for the inherent bimodality in the transformed variable, the original BMI variable is used instead of the transformed version.

Training and Testing Datasets

Both of the datasets were split such that 75% of it will be used to train, and 25% to test.

Training and Testing Datasets - Heart Failure Prediction Dataset

set.seed(123)
original_split <- caTools::sample.split(heart_failure_data$HeartDisease, SplitRatio = 0.75)
heart_failure_data_train <-  subset(heart_failure_data, original_split == TRUE)
heart_failure_data_test <- subset(heart_failure_data, original_split == FALSE)
prop.table(table(select(heart_failure_data, HeartDisease)))
## HeartDisease
##         0         1 
## 0.5227882 0.4772118
prop.table(table(select(heart_failure_data_train, HeartDisease)))
## HeartDisease
##         0         1 
## 0.5223614 0.4776386
prop.table(table(select(heart_failure_data_test, HeartDisease)))
## HeartDisease
##         0         1 
## 0.5240642 0.4759358

The output above shows us that there is a slight class imbalance. SMOTE from the DMwR package is only applied for the training dataset.

heart_failure_data_train <- SMOTE(HeartDisease ~ ., data.frame(heart_failure_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(heart_failure_data_train, HeartDisease)))
## HeartDisease
##   0   1 
## 0.5 0.5

The output above is the class distribution after SMOTE has been applied. Each class in the training dataset is balanced. The same methodology to deal with the class imbalance will be applied to the heart_failure_data_binned dataset for the naive Bayes model.

heart_failure_data_train_binned <-  subset(heart_failure_data_binned, original_split == TRUE)
heart_failure_data_test_binned <- subset(heart_failure_data_binned, original_split == FALSE)

heart_failure_data_train_binned <- SMOTE(HeartDisease ~ ., data.frame(heart_failure_data_train_binned), perc.over = 100, perc.under = 200)
prop.table(table(select(heart_failure_data_train_binned, HeartDisease)))
## HeartDisease
##   0   1 
## 0.5 0.5

The output above shows us that the normalized Heart Failure Prediction Dataset with dummy variables is balanced.

Training and Testing Datasets - Diabetes Health Indicators Dataset

set.seed(123)
original_split <- caTools::sample.split(diabetes_data$Diabetes_012, SplitRatio = 0.75)
diabetes_data_train <-  subset(diabetes_data, original_split == TRUE)
diabetes_data_test <- subset(diabetes_data, original_split == FALSE)
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580761   0.1419239
prop.table(table(select(diabetes_data_train, Diabetes_012)))
## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580736   0.1419264
prop.table(table(select(diabetes_data_test, Diabetes_012)))
## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580836   0.1419164

The output above shows us that there is a significant class imbalance. SMOTE from the DMwR package is only applied for the training dataset.

diabetes_data_train <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_train, Diabetes_012)))
## Diabetes_012
## No Diabetes    Diabetes 
##         0.5         0.5

The output above is the class distribution after SMOTE has been applied. Each class in the training dataset is balanced. The same methodology to deal with the class imbalance will be applied to the diabetes_data_train_binned dataset for the naive Bayes model.

diabetes_data_train_binned <-  subset(diabetes_data_binned, original_split == TRUE)
diabetes_data_test_binned <- subset(diabetes_data_binned, original_split == FALSE)

diabetes_data_train_binned <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_train_binned), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_train_binned, Diabetes_012)))
## Diabetes_012
## No Diabetes    Diabetes 
##         0.5         0.5

The output above shows us that the binned Diabetes Health Indicators dataset with dummy variables is almost perfectly balanced from using ovun.sample from the ROSE package.

Model Fitting

Health Failure Prediction Dataset - Multiple Logistic Regression

The glm function in R allowed for the generation of a multiple logistic regression model that uses all of the features in the training set to building a model that predicts HeartDisease,

heart_failure_data_logistic <- glm(data = heart_failure_data_train,
                                   family = binomial,
                                   formula = HeartDisease ~ .)

summary(heart_failure_data_logistic)
## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = heart_failure_data_train)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -6.457650   1.290518  -5.004 5.62e-07 ***
## Age               0.032086   0.012211   2.628 0.008599 ** 
## SexF             -1.183395   0.208441  -5.677 1.37e-08 ***
## ChestPainTypeNAP  0.444573   0.277156   1.604 0.108702    
## ChestPainTypeASY  1.068242   0.258156   4.138 3.50e-05 ***
## ChestPainTypeTA  -0.015030   0.369229  -0.041 0.967529    
## RestingBP         0.015012   0.005754   2.609 0.009087 ** 
## Cholesterol       0.001675   0.001667   1.004 0.315224    
## FastingBS1        0.997053   0.237935   4.190 2.78e-05 ***
## RestingECGST      0.551774   0.261008   2.114 0.034514 *  
## RestingECGLVH     0.344996   0.216565   1.593 0.111152    
## MaxHR            -0.001413   0.004478  -0.315 0.752420    
## ExerciseAnginaY   1.208831   0.201299   6.005 1.91e-09 ***
## Oldpeak           0.414688   0.115616   3.587 0.000335 ***
## ST_SlopeFlat      2.262564   0.208116  10.872  < 2e-16 ***
## ST_SlopeDown      0.861049   0.417350   2.063 0.039100 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1480.56  on 1067  degrees of freedom
## Residual deviance:  795.24  on 1052  degrees of freedom
## AIC: 827.24
## 
## Number of Fisher Scoring iterations: 5

The output above indicates that MaxHR and Cholesterol are not significant. Therefore, the model is refit with these variables removed.

heart_failure_data_logistic <- glm(data = heart_failure_data_train %>% select(-MaxHR, -Cholesterol),
                                   family = binomial,
                                   formula = HeartDisease ~ .)

summary(heart_failure_data_logistic)
## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = heart_failure_data_train %>% 
##     select(-MaxHR, -Cholesterol))
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -6.476453   0.921383  -7.029 2.08e-12 ***
## Age               0.034252   0.011333   3.022 0.002507 ** 
## SexF             -1.157966   0.206410  -5.610 2.02e-08 ***
## ChestPainTypeNAP  0.438983   0.276947   1.585 0.112948    
## ChestPainTypeASY  1.100638   0.254371   4.327 1.51e-05 ***
## ChestPainTypeTA  -0.018212   0.369018  -0.049 0.960638    
## RestingBP         0.015647   0.005716   2.738 0.006190 ** 
## FastingBS1        1.017873   0.236755   4.299 1.71e-05 ***
## RestingECGST      0.549653   0.258048   2.130 0.033168 *  
## RestingECGLVH     0.361398   0.215100   1.680 0.092930 .  
## ExerciseAnginaY   1.210874   0.200650   6.035 1.59e-09 ***
## Oldpeak           0.417098   0.115966   3.597 0.000322 ***
## ST_SlopeFlat      2.267745   0.205569  11.032  < 2e-16 ***
## ST_SlopeDown      0.859721   0.415565   2.069 0.038565 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1480.56  on 1067  degrees of freedom
## Residual deviance:  796.32  on 1054  degrees of freedom
## AIC: 824.32
## 
## Number of Fisher Scoring iterations: 5

The model summary shown above shows that all of the features are significant or have at least one level that is significant. The AIC value has decreased slightly, which is ideal.

heart_failure_data_logistic_pred <- predict(heart_failure_data_logistic, heart_failure_data_test %>% select(-MaxHR, -Cholesterol), type = 'response')
heart_failure_data_logistic_pred <- ifelse(heart_failure_data_logistic_pred >= 0.5, 1, 0)

caret::confusionMatrix(
    as.factor(as.vector(heart_failure_data_logistic_pred)), heart_failure_data_test$HeartDisease, positive = "1"
  )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81  7
##          1 17 82
##                                          
##                Accuracy : 0.8717         
##                  95% CI : (0.8151, 0.916)
##     No Information Rate : 0.5241         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.744          
##                                          
##  Mcnemar's Test P-Value : 0.06619        
##                                          
##             Sensitivity : 0.9213         
##             Specificity : 0.8265         
##          Pos Pred Value : 0.8283         
##          Neg Pred Value : 0.9205         
##              Prevalence : 0.4759         
##          Detection Rate : 0.4385         
##    Detection Prevalence : 0.5294         
##       Balanced Accuracy : 0.8739         
##                                          
##        'Positive' Class : 1              
## 

The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 87.17%, which is a relatively high accuracy score.

test_roc = roc(heart_failure_data_test$HeartDisease ~ predict(heart_failure_data_logistic, heart_failure_data_test, type = 'response'), plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Figure 12: The ROC curve for the heart failure prediction dataset using the multiple logistic regression model.

The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.931.

Health Failure Prediction Dataset - Naive Bayes Model

heart_failure_data_bayes <- e1071::naiveBayes(
  HeartDisease ~ ., data = heart_failure_data_train_binned, laplace = 1
  )
heart_failure_data_bayes_pred <- predict(heart_failure_data_bayes, heart_failure_data_test_binned)

caret::confusionMatrix(
    heart_failure_data_bayes_pred, heart_failure_data_test$HeartDisease, positive = "1"
  )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 80 10
##          1 18 79
##                                           
##                Accuracy : 0.8503          
##                  95% CI : (0.7909, 0.8981)
##     No Information Rate : 0.5241          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7011          
##                                           
##  Mcnemar's Test P-Value : 0.1859          
##                                           
##             Sensitivity : 0.8876          
##             Specificity : 0.8163          
##          Pos Pred Value : 0.8144          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.4759          
##          Detection Rate : 0.4225          
##    Detection Prevalence : 0.5187          
##       Balanced Accuracy : 0.8520          
##                                           
##        'Positive' Class : 1               
## 

The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 85.03%, which is a relatively high accuracy score, but also slightly lower than the accuracy score that was generated for the multiple logistic regression model.

heart_failure_data_bayes_pred_prob <- predict(heart_failure_data_bayes, heart_failure_data_test_binned, type = "raw")

roc_pred <- prediction(
  predictions = heart_failure_data_bayes_pred_prob[, "1"],
  labels = heart_failure_data_test$HeartDisease
)
roc_perf <- performance(roc_pred, measure = "tpr", x.measure = "fpr")
plot(roc_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

unlist(slot(performance(roc_pred, measure = "auc"),"y.values"))
## [1] 0.9228388

Figure 13: The ROC curve for the heart failure prediction dataset using the naive Bayes model.

The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.923.

Diabetes Health Indicators Dataset - Multiple Logistic Regression Model

The glm function in R allowed for the generation of a multiple logistic regression model that uses all of the features in the training set to building a model that predicts Diabetes_012,

diabetes_data_logistic <- glm(data = diabetes_data_train,
                              family = binomial,
                              formula = Diabetes_012 ~ .)

summary(diabetes_data_logistic)
## 
## Call:
## glm(formula = Diabetes_012 ~ ., family = binomial, data = diabetes_data_train)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -0.488330   0.378110  -1.292 0.196530    
## HighBP0.0                         -0.524529   0.016903 -31.031  < 2e-16 ***
## HighChol0.0                       -0.467302   0.016369 -28.547  < 2e-16 ***
## CholCheckNo Chol Check in 5 Years -0.048148   0.042593  -1.130 0.258309    
## BMI                                0.070229   0.001359  51.662  < 2e-16 ***
## Smoker0.0                          0.092005   0.016345   5.629 1.82e-08 ***
## Stroke1.0                          1.053223   0.030535  34.493  < 2e-16 ***
## HeartDiseaseorAttack1.0            0.811062   0.022571  35.935  < 2e-16 ***
## PhysActivity1.0                   -0.219496   0.017674 -12.419  < 2e-16 ***
## Fruits1.0                         -0.086241   0.016712  -5.160 2.47e-07 ***
## Veggies0.0                         0.336853   0.019015  17.715  < 2e-16 ***
## HvyAlcoholConsump1.0              -0.104429   0.035002  -2.983 0.002850 ** 
## AnyHealthcareNo Insurance          0.957329   0.030239  31.659  < 2e-16 ***
## NoDocbcCost1.0                     0.757384   0.024217  31.275  < 2e-16 ***
## GenHlthGood                       -0.721380   0.033537 -21.510  < 2e-16 ***
## GenHlthVery Good                  -1.285552   0.035463 -36.251  < 2e-16 ***
## GenHlthFair                       -0.340501   0.034288  -9.931  < 2e-16 ***
## GenHlthExcellent                  -1.967598   0.044953 -43.770  < 2e-16 ***
## MentHlth0.0                        0.150376   0.370457   0.406 0.684802    
## MentHlth30.0                       0.439684   0.371401   1.184 0.236472    
## MentHlth3.0                        0.210148   0.373123   0.563 0.573289    
## MentHlth5.0                        0.228070   0.372541   0.612 0.540405    
## MentHlth15.0                       0.060519   0.373332   0.162 0.871224    
## MentHlth10.0                       0.398566   0.372837   1.069 0.285066    
## MentHlth6.0                        0.720633   0.387780   1.858 0.063119 .  
## MentHlth20.0                       0.320272   0.375337   0.853 0.393498    
## MentHlth2.0                        0.172520   0.372038   0.464 0.642851    
## MentHlth25.0                       0.252323   0.382305   0.660 0.509250    
## MentHlth1.0                        0.086987   0.373367   0.233 0.815778    
## MentHlth4.0                        0.117357   0.375968   0.312 0.754929    
## MentHlth7.0                        0.212479   0.377001   0.564 0.573024    
## MentHlth8.0                        0.151863   0.400680   0.379 0.704677    
## MentHlth21.0                       0.796661   0.436635   1.825 0.068069 .  
## MentHlth14.0                       0.340085   0.384948   0.883 0.376989    
## MentHlth26.0                      -1.061173   0.614762  -1.726 0.084320 .  
## MentHlth29.0                       0.464101   0.478452   0.970 0.332044    
## MentHlth16.0                       0.301377   0.606605   0.497 0.619312    
## MentHlth28.0                      -0.269857   0.428409  -0.630 0.528757    
## MentHlth11.0                      -2.154004   0.823525  -2.616 0.008907 ** 
## MentHlth12.0                       0.613102   0.413731   1.482 0.138370    
## MentHlth24.0                       0.555237   0.686383   0.809 0.418555    
## MentHlth17.0                       1.351435   0.619022   2.183 0.029023 *  
## MentHlth13.0                      -0.078290   0.672947  -0.116 0.907384    
## MentHlth27.0                      -0.905133   0.559929  -1.617 0.105983    
## MentHlth19.0                       0.847598   0.937301   0.904 0.365838    
## MentHlth22.0                       1.034849   0.676059   1.531 0.125842    
## MentHlth9.0                        0.981151   0.563585   1.741 0.081700 .  
## MentHlth23.0                       0.631030   0.629775   1.002 0.316347    
## PhysHlth0.0                       -0.422158   0.048379  -8.726  < 2e-16 ***
## PhysHlth30.0                      -0.250079   0.051765  -4.831 1.36e-06 ***
## PhysHlth2.0                       -0.264645   0.056902  -4.651 3.31e-06 ***
## PhysHlth14.0                      -0.426269   0.084011  -5.074 3.90e-07 ***
## PhysHlth28.0                      -0.235768   0.149143  -1.581 0.113919    
## PhysHlth7.0                       -0.412229   0.072634  -5.675 1.38e-08 ***
## PhysHlth20.0                      -0.297240   0.073755  -4.030 5.57e-05 ***
## PhysHlth3.0                       -0.416083   0.062609  -6.646 3.02e-11 ***
## PhysHlth10.0                      -0.283765   0.064813  -4.378 1.20e-05 ***
## PhysHlth1.0                       -0.325770   0.061920  -5.261 1.43e-07 ***
## PhysHlth5.0                       -0.352182   0.062483  -5.636 1.74e-08 ***
## PhysHlth17.0                       0.209474   0.268756   0.779 0.435732    
## PhysHlth4.0                       -0.196521   0.071678  -2.742 0.006112 ** 
## PhysHlth19.0                       8.570479  45.855885   0.187 0.851739    
## PhysHlth6.0                        0.215560   0.100014   2.155 0.031139 *  
## PhysHlth12.0                      -0.167998   0.144654  -1.161 0.245489    
## PhysHlth25.0                      -0.249285   0.099561  -2.504 0.012285 *  
## PhysHlth27.0                      -1.260024   0.336671  -3.743 0.000182 ***
## PhysHlth21.0                      -0.649979   0.144066  -4.512 6.43e-06 ***
## PhysHlth22.0                       1.349212   0.553284   2.439 0.014746 *  
## PhysHlth8.0                       -0.247913   0.124817  -1.986 0.047010 *  
## PhysHlth29.0                       0.387322   0.220716   1.755 0.079286 .  
## PhysHlth24.0                      -0.591053   0.375186  -1.575 0.115173    
## PhysHlth9.0                        0.017766   0.258017   0.069 0.945104    
## PhysHlth16.0                      -0.090589   0.320502  -0.283 0.777448    
## PhysHlth18.0                      -0.460250   0.273019  -1.686 0.091838 .  
## PhysHlth23.0                      -0.520245   0.437157  -1.190 0.234020    
## PhysHlth13.0                       0.456226   0.471775   0.967 0.333523    
## PhysHlth26.0                      -0.455110   0.371313  -1.226 0.220319    
## PhysHlth11.0                      -0.394249   0.527303  -0.748 0.454659    
## DiffWalk0.0                       -0.336119   0.020080 -16.739  < 2e-16 ***
## Sex1.0                             0.273631   0.016485  16.599  < 2e-16 ***
## Age50-54                          -0.333666   0.032394 -10.300  < 2e-16 ***
## Age70-74                           0.206061   0.031631   6.515 7.29e-11 ***
## Age65-69                           0.173289   0.029134   5.948 2.71e-09 ***
## Age55-59                          -0.235598   0.030544  -7.713 1.23e-14 ***
## Age>=80                           -0.069994   0.035263  -1.985 0.047153 *  
## Age35-39                          -1.043576   0.049481 -21.091  < 2e-16 ***
## Age45-49                          -0.464748   0.036625 -12.689  < 2e-16 ***
## Age25-29                          -1.220957   0.068460 -17.835  < 2e-16 ***
## Age75-79                           0.162624   0.035513   4.579 4.67e-06 ***
## Age40-44                          -0.822764   0.042904 -19.177  < 2e-16 ***
## Age18-24                          -1.856152   0.107473 -17.271  < 2e-16 ***
## Age30-34                          -1.490660   0.062274 -23.937  < 2e-16 ***
## Education>= 4 Yrs College         -0.036676   0.021516  -1.705 0.088271 .  
## EducationGrades 9 - 11             0.161958   0.037633   4.304 1.68e-05 ***
## Education1-3 Yrs College           0.010221   0.021164   0.483 0.629132    
## EducationGrades 1-8                0.530049   0.054134   9.791  < 2e-16 ***
## EducationNo School/Kindergarten    0.352758   0.228937   1.541 0.123353    
## Income<10K                         0.125423   0.043630   2.875 0.004044 ** 
## IncomeIncome>=75K                 -0.437775   0.033348 -13.128  < 2e-16 ***
## Income35K<=Income<50K             -0.307460   0.034433  -8.929  < 2e-16 ***
## Income20K<=Income<25K             -0.123112   0.036986  -3.329 0.000873 ***
## Income50K<=Income<75K             -0.236641   0.034446  -6.870 6.43e-12 ***
## Income10K<=Income<15K              0.036996   0.040717   0.909 0.363543    
## Income25K<=Income<35K             -0.159237   0.035568  -4.477 7.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 147003  on 106039  degrees of freedom
## Residual deviance:  96890  on 105936  degrees of freedom
## AIC: 97098
## 
## Number of Fisher Scoring iterations: 8

The model summary shown above shows that all of the features are significant or have at least one level that is significant.

diabetes_data_logistic_pred <- predict(diabetes_data_logistic, diabetes_data_test, type = 'response')
diabetes_data_logistic_pred <- ifelse(diabetes_data_logistic_pred >= 0.5, "Diabetes", "No Diabetes")

caret::confusionMatrix(
  ordered(diabetes_data_test$Diabetes_012, levels = c("No Diabetes", "Diabetes")), 
  ordered(as.vector(as.factor(diabetes_data_logistic_pred)), levels = c("No Diabetes", "Diabetes")),
  positive = "Diabetes"
  )
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       42312    11114
##   Diabetes           2977     5859
##                                          
##                Accuracy : 0.7737         
##                  95% CI : (0.7704, 0.777)
##     No Information Rate : 0.7274         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3287         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.3452         
##             Specificity : 0.9343         
##          Pos Pred Value : 0.6631         
##          Neg Pred Value : 0.7920         
##              Prevalence : 0.2726         
##          Detection Rate : 0.0941         
##    Detection Prevalence : 0.1419         
##       Balanced Accuracy : 0.6397         
##                                          
##        'Positive' Class : Diabetes       
## 

The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 73.51%, which is a relatively high accuracy score.

test_roc = roc(diabetes_data_test$Diabetes_012 ~ predict(diabetes_data_logistic, diabetes_data_test, type = 'response'), plot = TRUE, print.auc = TRUE)
## Setting levels: control = No Diabetes, case = Diabetes
## Setting direction: controls < cases

Figure 14: The ROC curve for the diabetes health indicators dataset using the multiple logistic regression model.

The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.831.

Diabetes Health Indicators Dataset - Naive Bayes Model

diabetes_data_bayes <- e1071::naiveBayes(
  Diabetes_012 ~ ., data = diabetes_data_train_binned, laplace = 1
  )

Note that k has been set equal the square root of the number of observations in the training dataset as suggested by Practical Machine Learning in R. The AIC value has decreased slightly, which is ideal.

diabetes_data_bayes_pred <- predict(diabetes_data_bayes, diabetes_data_test_binned)

caret::confusionMatrix(
  ordered(diabetes_data_test$Diabetes_012, levels = c("No Diabetes", "Diabetes")), 
  ordered(as.vector(as.factor(diabetes_data_bayes_pred)), levels = c("No Diabetes", "Diabetes")),
  positive = "No Diabetes"
  )
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       42252    11174
##   Diabetes           3283     5553
##                                           
##                Accuracy : 0.7678          
##                  95% CI : (0.7645, 0.7711)
##     No Information Rate : 0.7313          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3055          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9279          
##             Specificity : 0.3320          
##          Pos Pred Value : 0.7909          
##          Neg Pred Value : 0.6285          
##              Prevalence : 0.7313          
##          Detection Rate : 0.6786          
##    Detection Prevalence : 0.8581          
##       Balanced Accuracy : 0.6299          
##                                           
##        'Positive' Class : No Diabetes     
## 

The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 76.68%, which is not only a relatively high accuracy score, but also slightly higher than the accuracy score that was generated for the multiple logistic regression model.

diabetes_data_bayes_pred_prob <- predict(diabetes_data_bayes, diabetes_data_test_binned, type = "raw")

roc_pred <- prediction(
  predictions = diabetes_data_bayes_pred_prob[, "No Diabetes"],
  labels = diabetes_data_test$Diabetes_012
)
roc_perf <- performance(roc_pred, measure = "tpr", x.measure = "fpr")
plot(roc_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

unlist(slot(performance(roc_pred, measure = "auc"),"y.values"))
## [1] 0.7978325

Figure 15: The ROC curve for the diabetes health indicators dataset using the naive Bayes model.

The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.798.

Essay

Introduction

In this assignment, two machine learning algorithms were fit to two datasets of different sizes. The smaller of the two datasets involves cardiovascular disease data collected from different datasets and combined together into one dataset. Cardiovascular disease itself is the leading cause of death globally according to the WHO (World Health Organization). WHO estimates that 17.9 million deaths occured in 2019, as a result of cardiovascular disease. According to the CDC, heart disease and stroke medical costs are estimated to be nearly $1 billion dollars a day. Therefore, it was important to identify risk factors, which are shown as features within the dataset, so individuals and health care professionals can work towards reducing these risk factors and prevent heart disease.

This dataset includes the following variables:

  • Age: age of the patient [years]
  • Sex: sex of the patient [M: Male, F: Female]
  • ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  • RestingBP: resting blood pressure [mm Hg]
  • Cholesterol: serum cholesterol [mm/dl]
  • FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  • RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
  • MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
  • ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
  • Oldpeak: oldpeak = ST [Numeric value measured in depression]
    • Represents the amount of ST depression measured during an exercise test relative to rest. Wikipedia points out that if this value is above 1 or below -1, than that is an indication of a significant heart problem.
  • ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  • HeartDisease: output class [1: heart disease, 0: Normal]

The response variable for this dataset is HeartDisease and in total, the dataset consists of 918 observations. More infornation about the dataset itself can be found (here)[https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset].

The larger of the datasets involves indicators for diabetes. This dataset was compiled from a telephone survey conducted by the CDC in 2015. Questions asked in the survey involved health-related risk behaviors, chronic health conditions, and the use of preventative services. Also include are age, education, income, location, and race to name a few. There are 3 .csv files that can be used for analysis. The one that was used in this homework was the diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file. This .csv file contains 253,680 survey responses (observations) and 21 features. The response variable is multiclass, in that it contains 3 different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for diabetes. The author of this dataset points out that there is a class imbalance.

This dataset includes the following variables:

  • Diabetes_012: 0 = no diabetes 1 = prediabetes 2 = diabetes
  • HighBP: 0 = no high BP 1 = high BP
  • HighChol: 0 = no high cholesterol 1 = high cholesterol
  • CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
  • BMI: Body Mass Index
  • Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes
  • Stroke: (Ever told) you had a stroke. 0 = no 1 = yes
  • HeartDiseaseorAttack: coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes
  • PhysActivity: physical activity in past 30 days - not including job 0 = no 1 = yes
  • Fruits: Consume Fruit 1 or more times per day 0 = no 1 = yes
  • Veggies: Consume Vegetables 1 or more times per day 0 = no 1 = yes
  • HvyAlcoholConsump: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes
  • AnyHealthcare: Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes
  • NoDocbcCost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes
  • GenHlth: Would you say that in general your health is: scale 1-5:
    • 1 = excellent
    • 2 = very good
    • 3 = good
    • 4 = fair
    • 5 = poor
  • MentHlth: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
    • 1 - 30: number of days
    • 88: None
    • 77: Don’t know/Not sure
    • 99: Refused
  • PhysHlth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
    • 1 - 30: number of days
    • 88: None
    • 77: Don’t know/Not sure
    • 99: Refused
    • BLANK: Not asked or Missing
  • DiffWalk: Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes
  • Sex: 0 = female 1 = male
  • Age: 13-level age category:
    • 1 = 18-24
    • 2 = 25-29
    • 3 = 30-34
    • 4 = 35-39
    • 5 = 40-44
    • 6 = 45-49
    • 7 = 50-54
    • 8 = 55-59
    • 9 = 60-64
    • 10 = 65-69
    • 11 = 70-74
    • 12 = 75-79
    • 13 = 80 or older
  • Education: Education level; scale 1-6:
    • 1 = Never attended school or only kindergarten
    • 2 = Grades 1 through 8 (Elementary)
    • 3 = Grades 9 through 11 (Some high school)
    • 4 = Grade 12 or GED (High school graduate)
    • 5 = College 1 year to 3 years (Some college or technical school)
    • 6 = College 4 years or more (College graduate)
  • Income: Income scale; scale 1-8:
    • 1 = less than $10,000
    • 2 = less than $15,000
    • 3 = less than $20,000
    • 4 = less than $25,000
    • 5 = less than $35,000
    • 6 = less than $50,000
    • 7 = less than $75,000
    • 8 = $75,000 or more

Model Selection

Two of the models that will be used in order to generate predictions are the multiple logistic regression model and the k-nearest neighbors model. The reason why both of these models were chosen is because for both of the datasets, the response is a binary class. Both of these datasets contain a mixture of categorical and continuous features as well, which is why these 2 machine learning algorithms were selected, as they can handle this mixture of different variables.

Pros and Cons - Multiple Logistic Regression Model

One of the strengths of a multiple logistic regression model lies in its interpretability. Interpretability is important for the datasets that are being analyzed in this Homework because they offer transparency. They allow patients and doctors to easily understand why a particular prediction was made, and this gives patients trust in the healthcare system. Model interpretability is also important when presenting a model to a stakeholder. Since both of these datasets involve healthcare, a healthcare organization might prioritize interpretability to grasp the reasoning behind predictions and incorporate them into clinical decision-making processes. A multiple logistic regression model is also computationally unintensive. This means that larger datasets, which in this case would be the diabetes dataset, would fit to the multiple logistic regression model in less time.

Conversely, multiple logistic regression models tend to underperform when there are multiple or nonlinear decision boundaries. Healthcare datasets in general are complex and may have non-linear relationships which explains the underperformance. Other problems include the inability to handle missing data, multicollinearity, and sensitivity to outliers.

Pros and Cons - Naive Bayes Classifier

One of the strengths of a Naive Bayes classifier model is that it is simple and fast to implement. They are, like multiple logistic regression models, also computationally unintensive. Patient-monitoring systems generally operate in real-time, which would warrant the use of such a model. Naive Bayes models are also able to handle missing data, which is important in the medical field because some patients will purposely withhold information from doctors out of feat or their medical records may be incomplete. Just like a multiple logistic regression model, a naive bayes classifer is easily interpretable when viewing the probabilities generated by the model, which helps with transparency in doctor-patient interactions. Finally, such models handle noisy and missing data well. There is no need for normalization like a k-nearest neighbors model.

Conversely, naive bayes classifier models make the assumption of “…class conditional independence, computed probabilities are not reliable when considered in isolation. The computed probability of an instance belonging to a particular class has to be evaluated relative to the computed probability of the same instance belonging to other classes.” (Practical Machine Learning in R, p.269). In a healthcare setting, this is problematic because some features in a healthcare dataset could have more importance than others. Also, these models perform better with larger datasets, which means that the model generated for heart failure prediction dataset will most likely underperform compared to the model generated for the diabetes health indicators dataset.

Correlation Between Variables

The correlation plot revealed that for the Heart Failure Prediction dataset, ST_Slope, Oldpeak, and ExerciseAngina have a moderately high correlation. Then after the data preparation stage, where the data was transformed, the maximum correlation increased from 0.55 to 0.67. The correlation plots revealed that for the Diabetes Indicators dataset, the correlation with the largest magnitude has a value of 0.52, and while this value is above the maximum range at what would be considered a “low correlation”, it is only 0.02 above the maximum.

Conclusion

In terms of making a final business decision, it would be best for the heart failure prediction dataset to use the multiple logistic regression model, while for the diabetes indicators dataset, the naive Bayes model would probably be best. Because both of these models are easily interpretable, the more accurate the algorithm, the more useful it would be in generating business driven decisions. These algorithms that were selected also ran relatively quickly for the larger dataset. I had tried to use k-nearest neighbors originally and after 1 hour of waiting for the data to fit, I had to move to a different algorithm.I believe that an analysis could be prone to errors if the least amount of data is used compared to using a giant dataset. Now, potentially, fitting a giant dataset to a model could lead to overfitting, but there are sampling techniques that can be used in order to account for this overfitting. Having the least amount of data possible could lead to underfitting. Even when you sample a small dataset, you might not capture the distribution properly as you would with a much larger dataset, where the sampled data will more likely have the same distribution as the original dataset. The heart failure dataset had better accuracy scores for both machine learning algorithms when compared to the accuracy scores for the diabetes dataset. It could be that the diabetes indicators dataset had more noise, or there was overfitting present in the larger dataset. It would have been nice to see the predictive accuracy of the k-nearest neighbors model, but that is for a future endeavor.