rm(list = ls())
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(tidyverse)
library(polycor)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(e1071)
library(caret)
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
In this assignment, two machine learning algorithms were fit to two datasets of different sizes. The smaller of the two datasets involves cardiovascular disease data collected from different datasets and combined together into one dataset. Cardiovascular disease itself is the leading cause of death globally according to the WHO (World Health Organization). WHO estimates that 17.9 million deaths occured in 2019, as a result of cardiovascular disease. According to the CDC, heart disease and stroke medical costs are estimated to be nearly $1 billion dollars a day. Therefore, it was important to identify risk factors, which are shown as features within the dataset, so individuals and health care professionals can work towards reducing these risk factors and prevent heart disease.
This dataset includes the following variables:
Age: age of the patient [years]Sex: sex of the patient [M: Male, F: Female]ChestPainType: chest pain type [TA: Typical Angina,
ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]RestingBP: resting blood pressure [mm Hg]Cholesterol: serum cholesterol [mm/dl]FastingBS: fasting blood sugar [1: if FastingBS >
120 mg/dl, 0: otherwise]RestingECG: resting electrocardiogram results [Normal:
Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV), LVH: showing probable or
definite left ventricular hypertrophy by Estes’ criteria]MaxHR: maximum heart rate achieved [Numeric value
between 60 and 202]ExerciseAngina: exercise-induced angina [Y: Yes, N:
No]Oldpeak: oldpeak = ST [Numeric value measured in
depression]
ST_Slope: the slope of the peak exercise ST segment
[Up: upsloping, Flat: flat, Down: downsloping]HeartDisease: output class [1: heart disease, 0:
Normal]The response variable for this dataset is HeartDisease
and in total, the dataset consists of 918 observations. More infornation
about the dataset itself can be found (here)[https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset].
The larger of the datasets involves indicators for diabetes. This
dataset was compiled from a telephone survey conducted by the CDC in
2015. Questions asked in the survey involved health-related risk
behaviors, chronic health conditions, and the use of preventative
services. Also include are age, education, income, location, and race to
name a few. There are 3 .csv files that can be used for analysis. The
one that was used in this homework was the
diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file.
This .csv file contains 253,680 survey responses (observations) and 21
features. The response variable is multiclass, in that it contains 3
different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for
diabetes. The
author of this dataset points out that there is a class
imbalance.
This dataset includes the following variables:
Diabetes_012: 0 = no diabetes 1 = prediabetes 2 =
diabetesHighBP: 0 = no high BP 1 = high BPHighChol: 0 = no high cholesterol 1 = high
cholesterolCholCheck: 0 = no cholesterol check in 5 years 1 = yes
cholesterol check in 5 yearsBMI: Body Mass IndexSmoker: Have you smoked at least 100 cigarettes in your
entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yesStroke: (Ever told) you had a stroke. 0 = no 1 =
yesHeartDiseaseorAttack: coronary heart disease (CHD) or
myocardial infarction (MI) 0 = no 1 = yesPhysActivity: physical activity in past 30 days - not
including job 0 = no 1 = yesFruits: Consume Fruit 1 or more times per day 0 = no 1
= yesVeggies: Consume Vegetables 1 or more times per day 0 =
no 1 = yesHvyAlcoholConsump: Heavy drinkers (adult men having
more than 14 drinks per week and adult women having more than 7 drinks
per week) 0 = no 1 = yesAnyHealthcare: Have any kind of health care coverage,
including health insurance, prepaid plans such as HMO, etc. 0 = no 1 =
yesNoDocbcCost: Was there a time in the past 12 months
when you needed to see a doctor but could not because of cost? 0 = no 1
= yesGenHlth: Would you say that in general your health is:
scale 1-5:
MentHlth: Now thinking about your mental health, which
includes stress, depression, and problems with emotions, for how many
days during the past 30 days was your mental health not good?
PhysHlth: Now thinking about your physical health,
which includes physical illness and injury, for how many days during the
past 30 days was your physical health not good?
DiffWalk: Do you have serious difficulty walking or
climbing stairs? 0 = no 1 = yesSex: 0 = female 1 = maleAge: 13-level age category:
Education: Education level; scale 1-6:
Income: Income scale; scale 1-8:
heart_failure_data <- read_csv(
file = "heart_failure_prediction.csv",
col_types = "nffnnffnfnff"
)
diabetes_data <- read_csv(
file = "diabetes_012_health_indicators_BRFSS2015.csv",
col_types = "ffffnfffffffffffffffff")
A summary of the heart failure prediction dataset is provided below:
summary(heart_failure_data)
## Age Sex ChestPainType RestingBP Cholesterol
## Min. :28.00 M:725 ATA:173 Min. : 0.0 Min. : 0.0
## 1st Qu.:47.00 F:193 NAP:203 1st Qu.:120.0 1st Qu.:173.2
## Median :54.00 ASY:496 Median :130.0 Median :223.0
## Mean :53.51 TA : 46 Mean :132.4 Mean :198.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0
## Max. :77.00 Max. :200.0 Max. :603.0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0:704 Normal:552 Min. : 60.0 N:547 Min. :-2.6000
## 1:214 ST :178 1st Qu.:120.0 Y:371 1st Qu.: 0.0000
## LVH :188 Median :138.0 Median : 0.6000
## Mean :136.8 Mean : 0.8874
## 3rd Qu.:156.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## ST_Slope HeartDisease
## Up :395 0:410
## Flat:460 1:508
## Down: 63
##
##
##
The minimum, maximum, median, and mean values for Age
are within normal expectations. There seems to be many more male
respondents than female when looking at Sex. Many of the
other features shown above fall within reasonable expectations except
for Cholesterol and RestingBP. A minimum value
of 0 for these values is not physically possible. Other
than that, there were no missing values within the dataset. A plot
showing the distributions of the continuous variables is shown
below.
Figure 1: Histograms for the continuous features in the Heart Failure Prediction dataset
heart_failure_data %>%
summarise(
zeroes_Cholesterol = sum(Cholesterol == 0),
zeroes_RestingBP = sum(RestingBP == 0)
)
The count above, along with the histograms, confirms that there were
a significant amount of 0 values for Cholesterol while
there is only just one value of 0 for
RestingBP. Age and MaxHR look
somewhat normally distributed while Oldpeak displays signs
of right-skewness.
Figure 2: Boxplots for the Heart Failure Prediction dataset
Some findings were discovered that support the theoretical effects
for some of the variables using the boxplots in Figure 2. Based on the
age boxplot, theoretically, older people are more likely to
have heart disease. Theoretically, on average, people with a lower
maximum heart rate are more likely to have heart disease when viewing
the MaxHR variable. Finally, based on the boxplot, on
average, people with higher Oldpeak are more likely to have
heart disease, which makes sense given that an Oldpeak
equal to ± 1 is indicative of a serious health condition.
Cholesterol and RestingBP are dealt with later
in the Homework because of the 0 values present in these
variables.
Finally, it is imperative to understand which features are correlated
with each other in order to address and avoid multicollinearity within
our models. By using a correlation plot, we can visualize the
relationships between certain features. Note that because this dataset
uses a mixture of both continuous and categorical variables, the
hetcor package in R was used to generate the correlation
plot. Using this package allows one to compute “a
heterogenous correlation matrix, consisting of Pearson product-moment
correlations between numeric variables, polyserial correlations between
numeric and ordinal variables, and polychoric correlations between
ordinal variables.”
corrplot(heart_failure_correlations$correlations,
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 1,
tl.cex = 1)
Figure 3: Histograms for the continuous features in the Heart Failure Prediction dataset
Calkins
indicates that “…correlation coefficients whose magnitude are between
0.3 and 0.5 indicate variables which have a low correlation”. Calkins
also goes on to point out that magnitudes between 0.5 and 0.7 indicate
moderate correlation, and anything above 0.7 indicate high correlation.
The correlation plot above reveals that ST_Slope,
Oldpeak, and ExerciseAngina have a moderately
high correlation.
A summary of the Diabetes Health Indicators Dataset is provided below:
summary(diabetes_data)
## Diabetes_012 HighBP HighChol CholCheck BMI
## 0.0:213703 1.0:108829 1.0:107591 1.0:244210 Min. :12.00
## 2.0: 35346 0.0:144851 0.0:146089 0.0: 9470 1st Qu.:24.00
## 1.0: 4631 Median :27.00
## Mean :28.38
## 3rd Qu.:31.00
## Max. :98.00
##
## Smoker Stroke HeartDiseaseorAttack PhysActivity Fruits
## 1.0:112423 0.0:243388 0.0:229787 0.0: 61760 0.0: 92782
## 0.0:141257 1.0: 10292 1.0: 23893 1.0:191920 1.0:160898
##
##
##
##
##
## Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth
## 1.0:205841 0.0:239424 1.0:241263 0.0:232326 5.0:12081
## 0.0: 47839 1.0: 14256 0.0: 12417 1.0: 21354 3.0:75646
## 2.0:89084
## 4.0:31570
## 1.0:45299
##
##
## MentHlth PhysHlth DiffWalk Sex Age
## 0.0 :175680 0.0 :160052 1.0: 42675 0.0:141974 9.0 :33244
## 2.0 : 13054 30.0 : 19400 0.0:211005 1.0:111706 10.0 :32194
## 30.0 : 12088 2.0 : 14764 8.0 :30832
## 5.0 : 9030 1.0 : 11388 7.0 :26314
## 1.0 : 8538 3.0 : 8495 11.0 :23533
## 3.0 : 7381 5.0 : 7622 6.0 :19819
## (Other): 27909 (Other): 31959 (Other):87744
## Education Income
## 4.0: 62750 8.0 :90385
## 6.0:107325 7.0 :43219
## 3.0: 9478 6.0 :36470
## 5.0: 69910 5.0 :25883
## 2.0: 4043 4.0 :20135
## 1.0: 174 3.0 :15994
## (Other):21594
The factors above have been recoded for readability.
diabetes_data <- diabetes_data %>%
mutate(
Diabetes_012 = dplyr::recode(Diabetes_012, '0.0' = 'No Diabetes', '1.0' = 'Prediabetes', '2.0' = 'Diabetes'),
CholCheck = dplyr::recode(CholCheck, '1.0' = 'Yes Chol Check in 5 years', '0.0' = 'No Chol Check in 5 Years'),
AnyHealthcare = dplyr::recode(AnyHealthcare, '1.0' = 'Has Insurance', '0.0' = 'No Insurance'),
GenHlth = dplyr::recode(GenHlth, '5.0' = 'Poor', '4.0' = 'Fair', '3.0' = 'Good', '2.0' = 'Very Good', '1.0' = "Excellent"),
Age = dplyr::recode(Age, '1.0' = '18-24', '2.0' = '25-29', '3.0' = '30-34', '4.0' = '35-39', '5.0' = '40-44',
'6.0' = '45-49', '7.0' = '50-54', '8.0' = '55-59', '9.0' = '60-64', '10.0' = '65-69',
'11.0'='70-74', '12.0' = '75-79', '13.0' = '>=80'),
Education = dplyr::recode(Education, '1.0' = 'No School/Kindergarten', '2.0' = 'Grades 1-8', '3.0' = 'Grades 9 - 11',
'4.0' = 'Grade 12/GED', '5.0' = '1-3 Yrs College', '6.0' = '>= 4 Yrs College'),
Income = dplyr::recode(Income, '1.0' = '<10K', '2.0' = '10K<=Income<15K', '3.0' = '15K<=Income<20K', '4.0' = '20K<=Income<25K',
'5.0' = '25K<=Income<35K', '6.0' = '35K<=Income<50K', '7.0' = '50K<=Income<75K', '8.0' = 'Income>=75K')
)
summary(diabetes_data)
## Diabetes_012 HighBP HighChol
## No Diabetes:213703 1.0:108829 1.0:107591
## Diabetes : 35346 0.0:144851 0.0:146089
## Prediabetes: 4631
##
##
##
##
## CholCheck BMI Smoker Stroke
## Yes Chol Check in 5 years:244210 Min. :12.00 1.0:112423 0.0:243388
## No Chol Check in 5 Years : 9470 1st Qu.:24.00 0.0:141257 1.0: 10292
## Median :27.00
## Mean :28.38
## 3rd Qu.:31.00
## Max. :98.00
##
## HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 0.0:229787 0.0: 61760 0.0: 92782 1.0:205841 0.0:239424
## 1.0: 23893 1.0:191920 1.0:160898 0.0: 47839 1.0: 14256
##
##
##
##
##
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Has Insurance:241263 0.0:232326 Poor :12081 0.0 :175680
## No Insurance : 12417 1.0: 21354 Good :75646 2.0 : 13054
## Very Good:89084 30.0 : 12088
## Fair :31570 5.0 : 9030
## Excellent:45299 1.0 : 8538
## 3.0 : 7381
## (Other): 27909
## PhysHlth DiffWalk Sex Age
## 0.0 :160052 1.0: 42675 0.0:141974 60-64 :33244
## 30.0 : 19400 0.0:211005 1.0:111706 65-69 :32194
## 2.0 : 14764 55-59 :30832
## 1.0 : 11388 50-54 :26314
## 3.0 : 8495 70-74 :23533
## 5.0 : 7622 45-49 :19819
## (Other): 31959 (Other):87744
## Education Income
## Grade 12/GED : 62750 Income>=75K :90385
## >= 4 Yrs College :107325 50K<=Income<75K:43219
## Grades 9 - 11 : 9478 35K<=Income<50K:36470
## 1-3 Yrs College : 69910 25K<=Income<35K:25883
## Grades 1-8 : 4043 20K<=Income<25K:20135
## No School/Kindergarten: 174 15K<=Income<20K:15994
## (Other) :21594
Everything in the summary seems to fall within reasonable expectations. The summary also revealed that there were no missing values in this dataset.
Figure 4: Histograms for the BMI (the only
continuous feature) in the Diabetes Health Indicators Dataset
Figure 4 shows us that BMI is displaying right skewness.
This right skewness could have also been deduced from the summary.
Notice that in the summary, for the BMI variable, the
maximum is 98, while the mean is 28 and the minimum is 12.
Figure 5: Boxplots for the Diabetes Health Indicators Dataset
Some findings were discovered that support the theoretical effects
for some of the variables using the boxplots in Figure 5. Based on the
age boxplot, theoretically, older people are more likely to
have heart disease. Theoretically, on average, people with a lower
maximum heart rate are more likely to have heart disease when viewing
the MaxHR variable. Finally, based on the boxplot, on
average, people with higher Oldpeak are more likely to have
heart disease, which makes sense given that an Oldpeak
equal to ± 1 is indicative of a serious health condition.
Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.
corrplot(diabetes_correlations$correlations,
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 1,
tl.cex = 1)
Figure 6: Multicollinearity plot for continuous predictor variables
Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. The correlation with the largest magnitude has a value of 0.52, and while this value is above the maximum range at what would be considered a “low correlation”, it is only 0.02 above the maximum. Therefore, it is sufficient to say that the entire dataset has low correlation.
Two of the models that will be used in order to generate predictions are the multiple logistic regression model and the k-nearest neighbors model. The reason why both of these models were chosen is because for both of the datasets, the response is a binary class. Both of these datasets contain a mixture of categorical and continuous features as well, which is why these 2 machine learning algorithms were selected, as they can handle this mixture of different variables.
One of the strengths of a multiple logistic regression model lies in its interpretability. Interpretability is important for the datasets that are being analyzed in this Homework because they offer transparency. They allow patients and doctors to easily understand why a particular prediction was made, and this gives patients trust in the healthcare system. Model interpretability is also important when presenting a model to a stakeholder. Since both of these datasets involve healthcare, a healthcare organization might prioritize interpretability to grasp the reasoning behind predictions and incorporate them into clinical decision-making processes. A multiple logistic regression model is also computationally unintensive. This means that larger datasets, which in this case would be the diabetes dataset, would fit to the multiple logistic regression model in less time.
Conversely, multiple logistic regression models tend to underperform when there are multiple or nonlinear decision boundaries. Healthcare datasets in general are complex and may have non-linear relationships which explains the underperformance. Other problems include the inability to handle missing data, multicollinearity, and sensitivity to outliers.
One of the strengths of a Naive Bayes classifier model is that it is simple and fast to implement. They are, like multiple logistic regression models, also computationally unintensive. Patient-monitoring systems generally operate in real-time, which would warrant the use of such a model. Naive Bayes models are also able to handle missing data, which is important in the medical field because some patients will purposely withhold information from doctors out of feat or their medical records may be incomplete. Just like a multiple logistic regression model, a naive bayes classifer is easily interpretable when viewing the probabilities generated by the model, which helps with transparency in doctor-patient interactions. Finally, such models handle noisy and missing data well. There is no need for normalization like a k-nearest neighbors model.
Conversely, naive bayes classifier models make the assumption of “…class conditional independence, computed probabilities are not reliable when considered in isolation. The computed probability of an instance belonging to a particular class has to be evaluated relative to the computed probability of the same instance belonging to other classes.” (Practical Machine Learning in R, p.269). In a healthcare setting, this is problematic because some features in a healthcare dataset could have more importance than others. Also, these models perform better with larger datasets, which means that the model generated for heart failure prediction dataset will most likely underperform compared to the model generated for the diabetes health indicators dataset.
Note that because the SMOTE function is only applied to
the training data, the class imbalance is dealt with after the training
data for both of the datasets has been generated. Also note that
different machine learning algorithms require different types of data
transformations. Some of these transformations are applied later in the
homework.
There are 172 observations where Cholesterol is zero.
There is one observation where RestingBP is equal to zero.
Since these values are not physically achieveable, all of the
observations where Cholesterol and RestingBP
were zero were dropped from the dataset.
heart_failure_data <- heart_failure_data %>%
filter(Cholesterol > 0, RestingBP > 0)
summary(heart_failure_data)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS
## Min. :28.00 M:564 ATA:166 Min. : 92 Min. : 85.0 0:621
## 1st Qu.:46.00 F:182 NAP:169 1st Qu.:120 1st Qu.:207.2 1:125
## Median :54.00 ASY:370 Median :130 Median :237.0
## Mean :52.88 TA : 41 Mean :133 Mean :244.6
## 3rd Qu.:59.00 3rd Qu.:140 3rd Qu.:275.0
## Max. :77.00 Max. :200 Max. :603.0
## RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope
## Normal:445 Min. : 69.0 N:459 Min. :-0.1000 Up :349
## ST :125 1st Qu.:122.0 Y:287 1st Qu.: 0.0000 Flat:354
## LVH :176 Median :140.0 Median : 0.5000 Down: 43
## Mean :140.2 Mean : 0.9016
## 3rd Qu.:160.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## HeartDisease
## 0:390
## 1:356
##
##
##
##
The summary above reveals that there are now 746 observations. The
minimum for RestingBP is now 92 while the for
Cholesterol, it is 85, both of which fall within reasonable
expectations. All of the other variables also fall within reasonable
expectations.
prop.table(table(select(heart_failure_data, HeartDisease)))
## HeartDisease
## 0 1
## 0.5227882 0.4772118
The output above shows the percentage of respondents that do not have
heart disease (52.3%) and the percentage of respondents that do have
heart disease (47.7%). When the training dataset is generated later in
this Homework, the class imbalance in the training dataset is dealt with
using the SMOTE function from the DMwR
package.
Figure 7: Histograms for the continuous features in the Heart
Failure Prediction dataset after removing 0 values from
Cholesterol and RestingBP
After the removal of 0 values, Cholesterol displays
slight right-skewness but does not seem to be extreme in nature. All of
the other variables have a normal distribution which is ideal. The only
continuous variable that has significant right-skewness now is
Oldpeak. This right-skewness is also reflected in the
summary that was generated earlier. In the summary, the minimum value is
-0.1, the mean is 0.9, while the maximum is 6.2. I believe that the
reason this skewness exists in the first place is because
Oldpeak is a realtime measurement of a serious heart
condition that requires immediate medical attention. This means that the
majority of respondents were not at risk of having a serious heart
condition at the time the survey was conducted, which would make sense.
If they did have an Oldpeak that was above or below zero,
which some respondents do, than they should be at the hospital for
immediate medical attention.
Figure 8: Boxplots for the Heart Failure Prediction dataset after
removing 0 values from Cholesterol and
RestingBP
After the transformation, the boxplot for the
Cholesterol variable now indicates that, theoretically,
people with higher Cholesterol levels are more likely to
have HeartDisease.
corrplot(heart_failure_correlations$correlations,
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 1,
tl.cex = 1)
Figure 9: Correlation plot for the Heart Failure Prediction
dataset after removing 0 values from
Cholesterol and RestingBP
Many of the correlations associated between Cholesterol
and the other features decreased in magnitude. Furthermore, notice that
the peak correlation increased from 0.55 to 0.67, and
Oldpeak, ExerciseAngina, and
St_Slope now have higher correlation values.
As pointed out in Practical Machine Learning in R, continuous features should be discretized prior to being used in a naive Bayes model. Therefore, the binning is done in the code chunk below. Bins will be created based on quantiles.
divide_equal_bins_func <- function(x, na.rm = FALSE) cut(x, breaks = unique(quantile(x,probs=seq.int(0,1, by=1/7))), include.lowest=TRUE)
heart_failure_data_binned <- heart_failure_data %>%
mutate_at(c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak"), divide_equal_bins_func)
summary(heart_failure_data_binned)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS
## [28,42]:119 M:564 ATA:166 [92,118] :115 [85,192] :109 0:621
## (42,48]:122 F:182 NAP:169 (118,120]:110 (192,212]:109 1:125
## (48,52]: 97 ASY:370 (120,130]:166 (212,228]:106
## (52,55]:105 TA : 41 (130,135]: 45 (228,248]:107
## (55,58]: 95 (135,140]:131 (248,270]:105
## (58,63]:111 (140,150]: 89 (270,299]:103
## (63,77]: 97 (150,200]: 90 (299,603]:107
## RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope
## Normal:445 [69,112] :111 N:459 [-0.1,0]:318 Up :349
## ST :125 (112,125]:108 Y:287 (0,0.1] : 10 Flat:354
## LVH :176 (125,137]:105 (0.1,1] :151 Down: 43
## (137,146]:102 (1,1.5] : 85
## (146,156]:110 (1.5,2] : 98
## (156,169]:106 (2,6.2] : 84
## (169,202]:104
## HeartDisease
## 0:390
## 1:356
##
##
##
##
##
The summary above shows the Heart Failure Prediction Dataset after
the numeric features were binned into 7 equal bins (Oldpeak
was only divided into 6 equal bins).
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes Prediabetes
## 0.84241170 0.13933302 0.01825528
The output above shows the percentage of respondents that do not have
diabetes (84.24%), the percentage of respondents that do have diabetes
(13.9%), and the percentage that are prediabetic(1.8%). Note that in
order to deal with class imbalance, only 2 classes can exist within the
response variable. Since people with prediabetes only makes up 1.8% of
the total amount of observations, all of the observations where
Diabetes_012 == Prediabetes were removed from the dataset.
The nature of this study slightly changed with the removal of a class.
Now instead of generating a model to determine if someone is not at risk
of having diabetes, is at risk of being prediabetic, or is at risk of
being diabetic, now the model will just determine if a person is either
at risk or not at risk of getting diabetes. The modeling and prediction
of the remaining classes still yielded valuable insights.
diabetes_data <- diabetes_data %>%
mutate(Diabetes_012 = as.numeric(Diabetes_012)) %>%
subset(Diabetes_012 != 3) %>%
mutate(Diabetes_012 = as.factor(Diabetes_012)) %>%
mutate(Diabetes_012 = dplyr::recode(Diabetes_012, '1' = 'No Diabetes', '2' = 'Diabetes'))
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580761 0.1419239
As pointed out in Practical Machine Learning in R, continuous features should be discretized prior to being used in a naive Bayes model. Therefore, the binning is done in the code chunk below. Bins will be created based on quantiles.
diabetes_data_binned <- diabetes_data %>%
mutate_at(c("BMI"), divide_equal_bins_func)
summary(diabetes_data_binned)
## Diabetes_012 HighBP HighChol
## No Diabetes:213703 1.0:105916 1.0:104716
## Diabetes : 35346 0.0:143133 0.0:144333
##
##
##
##
##
## CholCheck BMI Smoker Stroke
## Yes Chol Check in 5 years:239641 [12,22]:36591 1.0:110141 0.0:239022
## No Chol Check in 5 Years : 9408 (22,24]:34772 0.0:138908 1.0: 10027
## (24,26]:37188
## (26,28]:40433
## (28,31]:40836
## (31,34]:25902
## (34,98]:33327
## HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 0.0:225820 0.0: 60271 0.0: 90940 1.0:202280 0.0:235001
## 1.0: 23229 1.0:188778 1.0:158109 0.0: 46769 1.0: 14048
##
##
##
##
##
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Has Insurance:236886 0.0:228294 Poor :11730 0.0 :172724
## No Insurance : 12163 1.0: 20755 Good :73918 2.0 : 12823
## Very Good:87870 30.0 : 11727
## Fair :30545 5.0 : 8849
## Excellent:44986 1.0 : 8418
## 3.0 : 7256
## (Other): 27252
## PhysHlth DiffWalk Sex Age
## 0.0 :157581 1.0: 41390 0.0:139370 60-64 :32542
## 30.0 : 18842 0.0:207659 1.0:109679 65-69 :31497
## 2.0 : 14516 55-59 :30282
## 1.0 : 11214 50-54 :25896
## 3.0 : 8322 70-74 :22931
## 5.0 : 7454 45-49 :19507
## (Other): 31120 (Other):86394
## Education Income
## Grade 12/GED : 61400 Income>=75K :89374
## >= 4 Yrs College :105854 50K<=Income<75K:42484
## Grades 9 - 11 : 9164 35K<=Income<50K:35722
## 1-3 Yrs College : 68577 25K<=Income<35K:25296
## Grades 1-8 : 3882 20K<=Income<25K:19676
## No School/Kindergarten: 172 15K<=Income<20K:15573
## (Other) :20924
The summary above shows the Diabetes Health Indicators dataset after
BMI were binned into 7 equal bins.
diabetes_data %>%
mutate(BMI_transformed = log10(BMI)) %>%
dplyr::select(-Diabetes_012) %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density(col = 'red') +
geom_histogram(aes(y = stat(density)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 10: The BMI variable plotted on a histogram
and the BMI_transformed variable which was generated by
applying a log transformation to the BMI variable.
The output above shows us that the BMI_transformed
variable displays slight bimodality, which is problematic. In order to
preserve interpretability and to account for the inherent bimodality in
the transformed variable, the original BMI variable is used
instead of the transformed version.
Both of the datasets were split such that 75% of it will be used to train, and 25% to test.
set.seed(123)
original_split <- caTools::sample.split(heart_failure_data$HeartDisease, SplitRatio = 0.75)
heart_failure_data_train <- subset(heart_failure_data, original_split == TRUE)
heart_failure_data_test <- subset(heart_failure_data, original_split == FALSE)
prop.table(table(select(heart_failure_data, HeartDisease)))
## HeartDisease
## 0 1
## 0.5227882 0.4772118
prop.table(table(select(heart_failure_data_train, HeartDisease)))
## HeartDisease
## 0 1
## 0.5223614 0.4776386
prop.table(table(select(heart_failure_data_test, HeartDisease)))
## HeartDisease
## 0 1
## 0.5240642 0.4759358
The output above shows us that there is a slight class imbalance.
SMOTE from the DMwR package is only applied
for the training dataset.
heart_failure_data_train <- SMOTE(HeartDisease ~ ., data.frame(heart_failure_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(heart_failure_data_train, HeartDisease)))
## HeartDisease
## 0 1
## 0.5 0.5
The output above is the class distribution after SMOTE
has been applied. Each class in the training dataset is balanced. The
same methodology to deal with the class imbalance will be applied to the
heart_failure_data_binned dataset for the naive Bayes
model.
heart_failure_data_train_binned <- subset(heart_failure_data_binned, original_split == TRUE)
heart_failure_data_test_binned <- subset(heart_failure_data_binned, original_split == FALSE)
heart_failure_data_train_binned <- SMOTE(HeartDisease ~ ., data.frame(heart_failure_data_train_binned), perc.over = 100, perc.under = 200)
prop.table(table(select(heart_failure_data_train_binned, HeartDisease)))
## HeartDisease
## 0 1
## 0.5 0.5
The output above shows us that the normalized Heart Failure Prediction Dataset with dummy variables is balanced.
set.seed(123)
original_split <- caTools::sample.split(diabetes_data$Diabetes_012, SplitRatio = 0.75)
diabetes_data_train <- subset(diabetes_data, original_split == TRUE)
diabetes_data_test <- subset(diabetes_data, original_split == FALSE)
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580761 0.1419239
prop.table(table(select(diabetes_data_train, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580736 0.1419264
prop.table(table(select(diabetes_data_test, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580836 0.1419164
The output above shows us that there is a significant class
imbalance. SMOTE from the DMwR package is only
applied for the training dataset.
diabetes_data_train <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_train, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.5 0.5
The output above is the class distribution after SMOTE
has been applied. Each class in the training dataset is balanced. The
same methodology to deal with the class imbalance will be applied to the
diabetes_data_train_binned dataset for the naive Bayes
model.
diabetes_data_train_binned <- subset(diabetes_data_binned, original_split == TRUE)
diabetes_data_test_binned <- subset(diabetes_data_binned, original_split == FALSE)
diabetes_data_train_binned <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_train_binned), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_train_binned, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.5 0.5
The output above shows us that the binned Diabetes Health Indicators
dataset with dummy variables is almost perfectly balanced from using
ovun.sample from the ROSE package.
The glm function in R allowed for the generation of a
multiple logistic regression model that uses all of the features in the
training set to building a model that predicts
HeartDisease,
heart_failure_data_logistic <- glm(data = heart_failure_data_train,
family = binomial,
formula = HeartDisease ~ .)
summary(heart_failure_data_logistic)
##
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = heart_failure_data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.457650 1.290518 -5.004 5.62e-07 ***
## Age 0.032086 0.012211 2.628 0.008599 **
## SexF -1.183395 0.208441 -5.677 1.37e-08 ***
## ChestPainTypeNAP 0.444573 0.277156 1.604 0.108702
## ChestPainTypeASY 1.068242 0.258156 4.138 3.50e-05 ***
## ChestPainTypeTA -0.015030 0.369229 -0.041 0.967529
## RestingBP 0.015012 0.005754 2.609 0.009087 **
## Cholesterol 0.001675 0.001667 1.004 0.315224
## FastingBS1 0.997053 0.237935 4.190 2.78e-05 ***
## RestingECGST 0.551774 0.261008 2.114 0.034514 *
## RestingECGLVH 0.344996 0.216565 1.593 0.111152
## MaxHR -0.001413 0.004478 -0.315 0.752420
## ExerciseAnginaY 1.208831 0.201299 6.005 1.91e-09 ***
## Oldpeak 0.414688 0.115616 3.587 0.000335 ***
## ST_SlopeFlat 2.262564 0.208116 10.872 < 2e-16 ***
## ST_SlopeDown 0.861049 0.417350 2.063 0.039100 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1480.56 on 1067 degrees of freedom
## Residual deviance: 795.24 on 1052 degrees of freedom
## AIC: 827.24
##
## Number of Fisher Scoring iterations: 5
The output above indicates that MaxHR and
Cholesterol are not significant. Therefore, the model is
refit with these variables removed.
heart_failure_data_logistic <- glm(data = heart_failure_data_train %>% select(-MaxHR, -Cholesterol),
family = binomial,
formula = HeartDisease ~ .)
summary(heart_failure_data_logistic)
##
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = heart_failure_data_train %>%
## select(-MaxHR, -Cholesterol))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.476453 0.921383 -7.029 2.08e-12 ***
## Age 0.034252 0.011333 3.022 0.002507 **
## SexF -1.157966 0.206410 -5.610 2.02e-08 ***
## ChestPainTypeNAP 0.438983 0.276947 1.585 0.112948
## ChestPainTypeASY 1.100638 0.254371 4.327 1.51e-05 ***
## ChestPainTypeTA -0.018212 0.369018 -0.049 0.960638
## RestingBP 0.015647 0.005716 2.738 0.006190 **
## FastingBS1 1.017873 0.236755 4.299 1.71e-05 ***
## RestingECGST 0.549653 0.258048 2.130 0.033168 *
## RestingECGLVH 0.361398 0.215100 1.680 0.092930 .
## ExerciseAnginaY 1.210874 0.200650 6.035 1.59e-09 ***
## Oldpeak 0.417098 0.115966 3.597 0.000322 ***
## ST_SlopeFlat 2.267745 0.205569 11.032 < 2e-16 ***
## ST_SlopeDown 0.859721 0.415565 2.069 0.038565 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1480.56 on 1067 degrees of freedom
## Residual deviance: 796.32 on 1054 degrees of freedom
## AIC: 824.32
##
## Number of Fisher Scoring iterations: 5
The model summary shown above shows that all of the features are significant or have at least one level that is significant. The AIC value has decreased slightly, which is ideal.
heart_failure_data_logistic_pred <- predict(heart_failure_data_logistic, heart_failure_data_test %>% select(-MaxHR, -Cholesterol), type = 'response')
heart_failure_data_logistic_pred <- ifelse(heart_failure_data_logistic_pred >= 0.5, 1, 0)
caret::confusionMatrix(
as.factor(as.vector(heart_failure_data_logistic_pred)), heart_failure_data_test$HeartDisease, positive = "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 81 7
## 1 17 82
##
## Accuracy : 0.8717
## 95% CI : (0.8151, 0.916)
## No Information Rate : 0.5241
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.744
##
## Mcnemar's Test P-Value : 0.06619
##
## Sensitivity : 0.9213
## Specificity : 0.8265
## Pos Pred Value : 0.8283
## Neg Pred Value : 0.9205
## Prevalence : 0.4759
## Detection Rate : 0.4385
## Detection Prevalence : 0.5294
## Balanced Accuracy : 0.8739
##
## 'Positive' Class : 1
##
The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 87.17%, which is a relatively high accuracy score.
test_roc = roc(heart_failure_data_test$HeartDisease ~ predict(heart_failure_data_logistic, heart_failure_data_test, type = 'response'), plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Figure 12: The ROC curve for the heart failure prediction dataset using the multiple logistic regression model.
The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.931.
heart_failure_data_bayes <- e1071::naiveBayes(
HeartDisease ~ ., data = heart_failure_data_train_binned, laplace = 1
)
heart_failure_data_bayes_pred <- predict(heart_failure_data_bayes, heart_failure_data_test_binned)
caret::confusionMatrix(
heart_failure_data_bayes_pred, heart_failure_data_test$HeartDisease, positive = "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 80 10
## 1 18 79
##
## Accuracy : 0.8503
## 95% CI : (0.7909, 0.8981)
## No Information Rate : 0.5241
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7011
##
## Mcnemar's Test P-Value : 0.1859
##
## Sensitivity : 0.8876
## Specificity : 0.8163
## Pos Pred Value : 0.8144
## Neg Pred Value : 0.8889
## Prevalence : 0.4759
## Detection Rate : 0.4225
## Detection Prevalence : 0.5187
## Balanced Accuracy : 0.8520
##
## 'Positive' Class : 1
##
The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 85.03%, which is a relatively high accuracy score, but also slightly lower than the accuracy score that was generated for the multiple logistic regression model.
heart_failure_data_bayes_pred_prob <- predict(heart_failure_data_bayes, heart_failure_data_test_binned, type = "raw")
roc_pred <- prediction(
predictions = heart_failure_data_bayes_pred_prob[, "1"],
labels = heart_failure_data_test$HeartDisease
)
roc_perf <- performance(roc_pred, measure = "tpr", x.measure = "fpr")
plot(roc_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
unlist(slot(performance(roc_pred, measure = "auc"),"y.values"))
## [1] 0.9228388
Figure 13: The ROC curve for the heart failure prediction dataset using the naive Bayes model.
The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.923.
The glm function in R allowed for the generation of a
multiple logistic regression model that uses all of the features in the
training set to building a model that predicts
Diabetes_012,
diabetes_data_logistic <- glm(data = diabetes_data_train,
family = binomial,
formula = Diabetes_012 ~ .)
summary(diabetes_data_logistic)
##
## Call:
## glm(formula = Diabetes_012 ~ ., family = binomial, data = diabetes_data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.488330 0.378110 -1.292 0.196530
## HighBP0.0 -0.524529 0.016903 -31.031 < 2e-16 ***
## HighChol0.0 -0.467302 0.016369 -28.547 < 2e-16 ***
## CholCheckNo Chol Check in 5 Years -0.048148 0.042593 -1.130 0.258309
## BMI 0.070229 0.001359 51.662 < 2e-16 ***
## Smoker0.0 0.092005 0.016345 5.629 1.82e-08 ***
## Stroke1.0 1.053223 0.030535 34.493 < 2e-16 ***
## HeartDiseaseorAttack1.0 0.811062 0.022571 35.935 < 2e-16 ***
## PhysActivity1.0 -0.219496 0.017674 -12.419 < 2e-16 ***
## Fruits1.0 -0.086241 0.016712 -5.160 2.47e-07 ***
## Veggies0.0 0.336853 0.019015 17.715 < 2e-16 ***
## HvyAlcoholConsump1.0 -0.104429 0.035002 -2.983 0.002850 **
## AnyHealthcareNo Insurance 0.957329 0.030239 31.659 < 2e-16 ***
## NoDocbcCost1.0 0.757384 0.024217 31.275 < 2e-16 ***
## GenHlthGood -0.721380 0.033537 -21.510 < 2e-16 ***
## GenHlthVery Good -1.285552 0.035463 -36.251 < 2e-16 ***
## GenHlthFair -0.340501 0.034288 -9.931 < 2e-16 ***
## GenHlthExcellent -1.967598 0.044953 -43.770 < 2e-16 ***
## MentHlth0.0 0.150376 0.370457 0.406 0.684802
## MentHlth30.0 0.439684 0.371401 1.184 0.236472
## MentHlth3.0 0.210148 0.373123 0.563 0.573289
## MentHlth5.0 0.228070 0.372541 0.612 0.540405
## MentHlth15.0 0.060519 0.373332 0.162 0.871224
## MentHlth10.0 0.398566 0.372837 1.069 0.285066
## MentHlth6.0 0.720633 0.387780 1.858 0.063119 .
## MentHlth20.0 0.320272 0.375337 0.853 0.393498
## MentHlth2.0 0.172520 0.372038 0.464 0.642851
## MentHlth25.0 0.252323 0.382305 0.660 0.509250
## MentHlth1.0 0.086987 0.373367 0.233 0.815778
## MentHlth4.0 0.117357 0.375968 0.312 0.754929
## MentHlth7.0 0.212479 0.377001 0.564 0.573024
## MentHlth8.0 0.151863 0.400680 0.379 0.704677
## MentHlth21.0 0.796661 0.436635 1.825 0.068069 .
## MentHlth14.0 0.340085 0.384948 0.883 0.376989
## MentHlth26.0 -1.061173 0.614762 -1.726 0.084320 .
## MentHlth29.0 0.464101 0.478452 0.970 0.332044
## MentHlth16.0 0.301377 0.606605 0.497 0.619312
## MentHlth28.0 -0.269857 0.428409 -0.630 0.528757
## MentHlth11.0 -2.154004 0.823525 -2.616 0.008907 **
## MentHlth12.0 0.613102 0.413731 1.482 0.138370
## MentHlth24.0 0.555237 0.686383 0.809 0.418555
## MentHlth17.0 1.351435 0.619022 2.183 0.029023 *
## MentHlth13.0 -0.078290 0.672947 -0.116 0.907384
## MentHlth27.0 -0.905133 0.559929 -1.617 0.105983
## MentHlth19.0 0.847598 0.937301 0.904 0.365838
## MentHlth22.0 1.034849 0.676059 1.531 0.125842
## MentHlth9.0 0.981151 0.563585 1.741 0.081700 .
## MentHlth23.0 0.631030 0.629775 1.002 0.316347
## PhysHlth0.0 -0.422158 0.048379 -8.726 < 2e-16 ***
## PhysHlth30.0 -0.250079 0.051765 -4.831 1.36e-06 ***
## PhysHlth2.0 -0.264645 0.056902 -4.651 3.31e-06 ***
## PhysHlth14.0 -0.426269 0.084011 -5.074 3.90e-07 ***
## PhysHlth28.0 -0.235768 0.149143 -1.581 0.113919
## PhysHlth7.0 -0.412229 0.072634 -5.675 1.38e-08 ***
## PhysHlth20.0 -0.297240 0.073755 -4.030 5.57e-05 ***
## PhysHlth3.0 -0.416083 0.062609 -6.646 3.02e-11 ***
## PhysHlth10.0 -0.283765 0.064813 -4.378 1.20e-05 ***
## PhysHlth1.0 -0.325770 0.061920 -5.261 1.43e-07 ***
## PhysHlth5.0 -0.352182 0.062483 -5.636 1.74e-08 ***
## PhysHlth17.0 0.209474 0.268756 0.779 0.435732
## PhysHlth4.0 -0.196521 0.071678 -2.742 0.006112 **
## PhysHlth19.0 8.570479 45.855885 0.187 0.851739
## PhysHlth6.0 0.215560 0.100014 2.155 0.031139 *
## PhysHlth12.0 -0.167998 0.144654 -1.161 0.245489
## PhysHlth25.0 -0.249285 0.099561 -2.504 0.012285 *
## PhysHlth27.0 -1.260024 0.336671 -3.743 0.000182 ***
## PhysHlth21.0 -0.649979 0.144066 -4.512 6.43e-06 ***
## PhysHlth22.0 1.349212 0.553284 2.439 0.014746 *
## PhysHlth8.0 -0.247913 0.124817 -1.986 0.047010 *
## PhysHlth29.0 0.387322 0.220716 1.755 0.079286 .
## PhysHlth24.0 -0.591053 0.375186 -1.575 0.115173
## PhysHlth9.0 0.017766 0.258017 0.069 0.945104
## PhysHlth16.0 -0.090589 0.320502 -0.283 0.777448
## PhysHlth18.0 -0.460250 0.273019 -1.686 0.091838 .
## PhysHlth23.0 -0.520245 0.437157 -1.190 0.234020
## PhysHlth13.0 0.456226 0.471775 0.967 0.333523
## PhysHlth26.0 -0.455110 0.371313 -1.226 0.220319
## PhysHlth11.0 -0.394249 0.527303 -0.748 0.454659
## DiffWalk0.0 -0.336119 0.020080 -16.739 < 2e-16 ***
## Sex1.0 0.273631 0.016485 16.599 < 2e-16 ***
## Age50-54 -0.333666 0.032394 -10.300 < 2e-16 ***
## Age70-74 0.206061 0.031631 6.515 7.29e-11 ***
## Age65-69 0.173289 0.029134 5.948 2.71e-09 ***
## Age55-59 -0.235598 0.030544 -7.713 1.23e-14 ***
## Age>=80 -0.069994 0.035263 -1.985 0.047153 *
## Age35-39 -1.043576 0.049481 -21.091 < 2e-16 ***
## Age45-49 -0.464748 0.036625 -12.689 < 2e-16 ***
## Age25-29 -1.220957 0.068460 -17.835 < 2e-16 ***
## Age75-79 0.162624 0.035513 4.579 4.67e-06 ***
## Age40-44 -0.822764 0.042904 -19.177 < 2e-16 ***
## Age18-24 -1.856152 0.107473 -17.271 < 2e-16 ***
## Age30-34 -1.490660 0.062274 -23.937 < 2e-16 ***
## Education>= 4 Yrs College -0.036676 0.021516 -1.705 0.088271 .
## EducationGrades 9 - 11 0.161958 0.037633 4.304 1.68e-05 ***
## Education1-3 Yrs College 0.010221 0.021164 0.483 0.629132
## EducationGrades 1-8 0.530049 0.054134 9.791 < 2e-16 ***
## EducationNo School/Kindergarten 0.352758 0.228937 1.541 0.123353
## Income<10K 0.125423 0.043630 2.875 0.004044 **
## IncomeIncome>=75K -0.437775 0.033348 -13.128 < 2e-16 ***
## Income35K<=Income<50K -0.307460 0.034433 -8.929 < 2e-16 ***
## Income20K<=Income<25K -0.123112 0.036986 -3.329 0.000873 ***
## Income50K<=Income<75K -0.236641 0.034446 -6.870 6.43e-12 ***
## Income10K<=Income<15K 0.036996 0.040717 0.909 0.363543
## Income25K<=Income<35K -0.159237 0.035568 -4.477 7.57e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 147003 on 106039 degrees of freedom
## Residual deviance: 96890 on 105936 degrees of freedom
## AIC: 97098
##
## Number of Fisher Scoring iterations: 8
The model summary shown above shows that all of the features are significant or have at least one level that is significant.
diabetes_data_logistic_pred <- predict(diabetes_data_logistic, diabetes_data_test, type = 'response')
diabetes_data_logistic_pred <- ifelse(diabetes_data_logistic_pred >= 0.5, "Diabetes", "No Diabetes")
caret::confusionMatrix(
ordered(diabetes_data_test$Diabetes_012, levels = c("No Diabetes", "Diabetes")),
ordered(as.vector(as.factor(diabetes_data_logistic_pred)), levels = c("No Diabetes", "Diabetes")),
positive = "Diabetes"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 42312 11114
## Diabetes 2977 5859
##
## Accuracy : 0.7737
## 95% CI : (0.7704, 0.777)
## No Information Rate : 0.7274
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3287
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3452
## Specificity : 0.9343
## Pos Pred Value : 0.6631
## Neg Pred Value : 0.7920
## Prevalence : 0.2726
## Detection Rate : 0.0941
## Detection Prevalence : 0.1419
## Balanced Accuracy : 0.6397
##
## 'Positive' Class : Diabetes
##
The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 73.51%, which is a relatively high accuracy score.
test_roc = roc(diabetes_data_test$Diabetes_012 ~ predict(diabetes_data_logistic, diabetes_data_test, type = 'response'), plot = TRUE, print.auc = TRUE)
## Setting levels: control = No Diabetes, case = Diabetes
## Setting direction: controls < cases
Figure 14: The ROC curve for the diabetes health indicators dataset using the multiple logistic regression model.
The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.831.
diabetes_data_bayes <- e1071::naiveBayes(
Diabetes_012 ~ ., data = diabetes_data_train_binned, laplace = 1
)
Note that k has been set equal the square root of the
number of observations in the training dataset as suggested by Practical
Machine Learning in R. The AIC value has decreased slightly, which is
ideal.
diabetes_data_bayes_pred <- predict(diabetes_data_bayes, diabetes_data_test_binned)
caret::confusionMatrix(
ordered(diabetes_data_test$Diabetes_012, levels = c("No Diabetes", "Diabetes")),
ordered(as.vector(as.factor(diabetes_data_bayes_pred)), levels = c("No Diabetes", "Diabetes")),
positive = "No Diabetes"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 42252 11174
## Diabetes 3283 5553
##
## Accuracy : 0.7678
## 95% CI : (0.7645, 0.7711)
## No Information Rate : 0.7313
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3055
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9279
## Specificity : 0.3320
## Pos Pred Value : 0.7909
## Neg Pred Value : 0.6285
## Prevalence : 0.7313
## Detection Rate : 0.6786
## Detection Prevalence : 0.8581
## Balanced Accuracy : 0.6299
##
## 'Positive' Class : No Diabetes
##
The output above shows a confusion matrix generated from the testing dataset. From this confusion matrix, the model’s predictive accuracy was calculated to be 76.68%, which is not only a relatively high accuracy score, but also slightly higher than the accuracy score that was generated for the multiple logistic regression model.
diabetes_data_bayes_pred_prob <- predict(diabetes_data_bayes, diabetes_data_test_binned, type = "raw")
roc_pred <- prediction(
predictions = diabetes_data_bayes_pred_prob[, "No Diabetes"],
labels = diabetes_data_test$Diabetes_012
)
roc_perf <- performance(roc_pred, measure = "tpr", x.measure = "fpr")
plot(roc_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
unlist(slot(performance(roc_pred, measure = "auc"),"y.values"))
## [1] 0.7978325
Figure 15: The ROC curve for the diabetes health indicators dataset using the naive Bayes model.
The ROC curve is generated in the code chunk above. Also, the calculated AUC is 0.798.
In this assignment, two machine learning algorithms were fit to two datasets of different sizes. The smaller of the two datasets involves cardiovascular disease data collected from different datasets and combined together into one dataset. Cardiovascular disease itself is the leading cause of death globally according to the WHO (World Health Organization). WHO estimates that 17.9 million deaths occured in 2019, as a result of cardiovascular disease. According to the CDC, heart disease and stroke medical costs are estimated to be nearly $1 billion dollars a day. Therefore, it was important to identify risk factors, which are shown as features within the dataset, so individuals and health care professionals can work towards reducing these risk factors and prevent heart disease.
This dataset includes the following variables:
Age: age of the patient [years]Sex: sex of the patient [M: Male, F: Female]ChestPainType: chest pain type [TA: Typical Angina,
ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]RestingBP: resting blood pressure [mm Hg]Cholesterol: serum cholesterol [mm/dl]FastingBS: fasting blood sugar [1: if FastingBS >
120 mg/dl, 0: otherwise]RestingECG: resting electrocardiogram results [Normal:
Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV), LVH: showing probable or
definite left ventricular hypertrophy by Estes’ criteria]MaxHR: maximum heart rate achieved [Numeric value
between 60 and 202]ExerciseAngina: exercise-induced angina [Y: Yes, N:
No]Oldpeak: oldpeak = ST [Numeric value measured in
depression]
ST_Slope: the slope of the peak exercise ST segment
[Up: upsloping, Flat: flat, Down: downsloping]HeartDisease: output class [1: heart disease, 0:
Normal]The response variable for this dataset is HeartDisease
and in total, the dataset consists of 918 observations. More infornation
about the dataset itself can be found (here)[https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset].
The larger of the datasets involves indicators for diabetes. This
dataset was compiled from a telephone survey conducted by the CDC in
2015. Questions asked in the survey involved health-related risk
behaviors, chronic health conditions, and the use of preventative
services. Also include are age, education, income, location, and race to
name a few. There are 3 .csv files that can be used for analysis. The
one that was used in this homework was the
diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file.
This .csv file contains 253,680 survey responses (observations) and 21
features. The response variable is multiclass, in that it contains 3
different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for
diabetes. The
author of this dataset points out that there is a class
imbalance.
This dataset includes the following variables:
Diabetes_012: 0 = no diabetes 1 = prediabetes 2 =
diabetesHighBP: 0 = no high BP 1 = high BPHighChol: 0 = no high cholesterol 1 = high
cholesterolCholCheck: 0 = no cholesterol check in 5 years 1 = yes
cholesterol check in 5 yearsBMI: Body Mass IndexSmoker: Have you smoked at least 100 cigarettes in your
entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yesStroke: (Ever told) you had a stroke. 0 = no 1 =
yesHeartDiseaseorAttack: coronary heart disease (CHD) or
myocardial infarction (MI) 0 = no 1 = yesPhysActivity: physical activity in past 30 days - not
including job 0 = no 1 = yesFruits: Consume Fruit 1 or more times per day 0 = no 1
= yesVeggies: Consume Vegetables 1 or more times per day 0 =
no 1 = yesHvyAlcoholConsump: Heavy drinkers (adult men having
more than 14 drinks per week and adult women having more than 7 drinks
per week) 0 = no 1 = yesAnyHealthcare: Have any kind of health care coverage,
including health insurance, prepaid plans such as HMO, etc. 0 = no 1 =
yesNoDocbcCost: Was there a time in the past 12 months
when you needed to see a doctor but could not because of cost? 0 = no 1
= yesGenHlth: Would you say that in general your health is:
scale 1-5:
MentHlth: Now thinking about your mental health, which
includes stress, depression, and problems with emotions, for how many
days during the past 30 days was your mental health not good?
PhysHlth: Now thinking about your physical health,
which includes physical illness and injury, for how many days during the
past 30 days was your physical health not good?
DiffWalk: Do you have serious difficulty walking or
climbing stairs? 0 = no 1 = yesSex: 0 = female 1 = maleAge: 13-level age category:
Education: Education level; scale 1-6:
Income: Income scale; scale 1-8:
Two of the models that will be used in order to generate predictions are the multiple logistic regression model and the k-nearest neighbors model. The reason why both of these models were chosen is because for both of the datasets, the response is a binary class. Both of these datasets contain a mixture of categorical and continuous features as well, which is why these 2 machine learning algorithms were selected, as they can handle this mixture of different variables.
One of the strengths of a multiple logistic regression model lies in its interpretability. Interpretability is important for the datasets that are being analyzed in this Homework because they offer transparency. They allow patients and doctors to easily understand why a particular prediction was made, and this gives patients trust in the healthcare system. Model interpretability is also important when presenting a model to a stakeholder. Since both of these datasets involve healthcare, a healthcare organization might prioritize interpretability to grasp the reasoning behind predictions and incorporate them into clinical decision-making processes. A multiple logistic regression model is also computationally unintensive. This means that larger datasets, which in this case would be the diabetes dataset, would fit to the multiple logistic regression model in less time.
Conversely, multiple logistic regression models tend to underperform when there are multiple or nonlinear decision boundaries. Healthcare datasets in general are complex and may have non-linear relationships which explains the underperformance. Other problems include the inability to handle missing data, multicollinearity, and sensitivity to outliers.
One of the strengths of a Naive Bayes classifier model is that it is simple and fast to implement. They are, like multiple logistic regression models, also computationally unintensive. Patient-monitoring systems generally operate in real-time, which would warrant the use of such a model. Naive Bayes models are also able to handle missing data, which is important in the medical field because some patients will purposely withhold information from doctors out of feat or their medical records may be incomplete. Just like a multiple logistic regression model, a naive bayes classifer is easily interpretable when viewing the probabilities generated by the model, which helps with transparency in doctor-patient interactions. Finally, such models handle noisy and missing data well. There is no need for normalization like a k-nearest neighbors model.
Conversely, naive bayes classifier models make the assumption of “…class conditional independence, computed probabilities are not reliable when considered in isolation. The computed probability of an instance belonging to a particular class has to be evaluated relative to the computed probability of the same instance belonging to other classes.” (Practical Machine Learning in R, p.269). In a healthcare setting, this is problematic because some features in a healthcare dataset could have more importance than others. Also, these models perform better with larger datasets, which means that the model generated for heart failure prediction dataset will most likely underperform compared to the model generated for the diabetes health indicators dataset.
The correlation plot revealed that for the Heart Failure Prediction
dataset, ST_Slope, Oldpeak, and
ExerciseAngina have a moderately high correlation. Then
after the data preparation stage, where the data was transformed, the
maximum correlation increased from 0.55 to 0.67. The correlation plots
revealed that for the Diabetes Indicators dataset, the correlation with
the largest magnitude has a value of 0.52, and while this value is above
the maximum range at what would be considered a “low correlation”, it is
only 0.02 above the maximum.
In terms of making a final business decision, it would be best for the heart failure prediction dataset to use the multiple logistic regression model, while for the diabetes indicators dataset, the naive Bayes model would probably be best. Because both of these models are easily interpretable, the more accurate the algorithm, the more useful it would be in generating business driven decisions. These algorithms that were selected also ran relatively quickly for the larger dataset. I had tried to use k-nearest neighbors originally and after 1 hour of waiting for the data to fit, I had to move to a different algorithm.I believe that an analysis could be prone to errors if the least amount of data is used compared to using a giant dataset. Now, potentially, fitting a giant dataset to a model could lead to overfitting, but there are sampling techniques that can be used in order to account for this overfitting. Having the least amount of data possible could lead to underfitting. Even when you sample a small dataset, you might not capture the distribution properly as you would with a much larger dataset, where the sampled data will more likely have the same distribution as the original dataset. The heart failure dataset had better accuracy scores for both machine learning algorithms when compared to the accuracy scores for the diabetes dataset. It could be that the diabetes indicators dataset had more noise, or there was overfitting present in the larger dataset. It would have been nice to see the predictive accuracy of the k-nearest neighbors model, but that is for a future endeavor.