This analyzes whether an individual will have heart disease
Load required libraries:
library(tidyverse)
library(corrplot)
library(ggcorrplot)
library(Amelia)
library(caTools)
library(cowplot)Load the data from github repo
url <- "https://raw.githubusercontent.com/chinedu2301/data622-machine-learning-and-big-data/main/data/heart_disease.csv"
heart_disease <- read_csv(url)Look at the head of the data
head(heart_disease, n = 10)## # A tibble: 10 × 12
## Age Sex ChestPain…¹ Resti…² Chole…³ Fasti…⁴ Resti…⁵ MaxHR Exerc…⁶ Oldpeak
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 40 M ATA 140 289 0 Normal 172 N 0
## 2 49 F NAP 160 180 0 Normal 156 N 1
## 3 37 M ATA 130 283 0 ST 98 N 0
## 4 48 F ASY 138 214 0 Normal 108 Y 1.5
## 5 54 M NAP 150 195 0 Normal 122 N 0
## 6 39 M NAP 120 339 0 Normal 170 N 0
## 7 45 F ATA 130 237 0 Normal 170 N 0
## 8 54 M ATA 110 208 0 Normal 142 N 0
## 9 37 M ASY 140 207 0 Normal 130 Y 1.5
## 10 48 F ATA 120 284 0 Normal 120 N 0
## # … with 2 more variables: ST_Slope <chr>, HeartDisease <dbl>, and abbreviated
## # variable names ¹ChestPainType, ²RestingBP, ³Cholesterol, ⁴FastingBS,
## # ⁵RestingECG, ⁶ExerciseAngina
Get a glimpse of the variables in the datasets.
# get a glimpse of the variables
glimpse(heart_disease)## Rows: 918
## Columns: 12
## $ Age <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
There are 918 rows and 12 columns in this data set.
There are 12 variables and 918 observations in the dataset.
Eleven(11) of the 12 variables in the dataset are potential predictors
of the twelfth(12th) variable - HeartDisease. The data is
labelled.
Each observation represents the characteristics of an individual such as
Age, Sex, RestingBP, Cholesterol level, etc. and whether that individual
has a Heart Disease or not.
There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:
Relevant statistics are:
Summary statistics of all variables
summary(heart_disease)## Age Sex ChestPainType RestingBP
## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
From the summary statistics, we can see that the average age of individuals in the dataset is 53 while the median age is 54. Also, the mean RestingBP is 132, the mean Cholesterol level is 198.8, and mean maxHR is 136.8
Check for Correlation
model.matrix(~0+., data=heart_disease) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag=FALSE, type="lower", lab=TRUE, lab_size=2)The predictors are largely not correlated with one another as can be seen from the correlation plot.
Check for Null values
# Check for NA values
any(is.na(heart_disease))## [1] FALSE
Visualize the na values
# use missmap function from the Amelia package to check for NA values
missmap(heart_disease, main = "Heart Data - Missing Values", col = c("yellow", "black"), legend = FALSE)## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
There are no NA values in the dataset
Bar Graph by Gender
#filter the dataset for only those with heart disease
heart_1 <- heart_disease %>% filter(HeartDisease == 1)
#filter the dataset for only those without heart disease
heart_0 <- heart_disease %>% filter(HeartDisease == 0)
# Bar Chart by Sex for the entire data set
p1 <- ggplot(heart_disease, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - All") + ylab(NULL)
# Bar plot by Sex for only those with Heart Disease
p2 <- ggplot(heart_1, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - Heart Disease") + ylab(NULL)
# Bar plot by Sex for only those with no hear disease
p3 <- ggplot(heart_0, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - No Heart Disease") + ylab(NULL)
# Bar plot of individuals who have heart disease by Sex
p4 <- heart_disease %>% mutate(heart_prob = ifelse(HeartDisease == 1, "Yes", "No")) %>%
ggplot(aes(x = heart_prob, fill = Sex)) + geom_bar() + theme_bw() + ylab(NULL) +
labs(title = "HeartDisease vs No HeartDisease")
# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)Histogram to show distribution by age
# Histogram to show age distribution in the dataset
p5 <- heart_disease |> ggplot(aes(x = Age)) + geom_histogram(fill = "brown", binwidth = 2) + theme_bw() +
labs(title = "Distribution by Age") + ylab(NULL)
# Histogram of Cholesterol level
p6 <- ggplot(heart_disease, aes(x = Cholesterol)) + geom_histogram(binwidth = 12, fill = "brown") +
labs(title = "Distribution of Cholesterol level") + ylab(NULL) + theme_bw()
# Histogram of RestingBP
p7 <- heart_disease |> ggplot(aes(x = RestingBP)) + geom_histogram(binwidth = 15, fill = "brown") +
labs(title = "Distribution of RestingBP") + ylab(NULL) + theme_bw()
# Plot all the histograms in a grid
plot_grid(p5, p6, p7)Scatter plot of RestingBP vs Cholesterol
# RestingBP vs Cholesterol
heart_disease |> ggplot(aes(x = Cholesterol, y = RestingBP, color = RestingECG)) + geom_point() +
labs(title = "RestingBP vs Cholesterol") + theme_bw()Box Plot of RestingBP for each ChestPainType
# Boxplot by ChestPainType
heart_disease |> ggplot() + geom_boxplot(aes(x = ChestPainType, y = RestingBP)) +
labs(title = "Box Plot of Resting BP vs ChestPainType") + theme_bw()Use the CaTools library to split the dataset into training and testing datasets
# Set a seed
set.seed(1994)
#Split the sample
sample <- sample.split(heart_disease$HeartDisease, SplitRatio = 0.8)
# Training Data
heart_train <- subset(heart_disease, sample == TRUE)
# Testing Data
heart_test <- subset(heart_disease, sample == FALSE)Train the model using a logistic model
# Train the model
heart_logistic_model <- glm(formula = HeartDisease ~ . , family = binomial(link = 'logit'),
data = heart_train)Get the summary of the model
# Get the summary of the logistic model
summary(heart_logistic_model)##
## Call:
## glm(formula = HeartDisease ~ ., family = binomial(link = "logit"),
## data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5574 -0.3803 0.1880 0.4854 2.4617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6741258 1.5794027 -1.060 0.289156
## Age 0.0165346 0.0145153 1.139 0.254657
## SexM 1.2972248 0.3093051 4.194 2.74e-05 ***
## ChestPainTypeATA -1.7950352 0.3586640 -5.005 5.59e-07 ***
## ChestPainTypeNAP -1.7578665 0.2888355 -6.086 1.16e-09 ***
## ChestPainTypeTA -1.2334065 0.4718782 -2.614 0.008954 **
## RestingBP 0.0033623 0.0064796 0.519 0.603821
## Cholesterol -0.0035538 0.0011489 -3.093 0.001980 **
## FastingBS 1.2876688 0.3020131 4.264 2.01e-05 ***
## RestingECGNormal 0.0221109 0.3018480 0.073 0.941606
## RestingECGST -0.0787364 0.3897392 -0.202 0.839898
## MaxHR -0.0003464 0.0055610 -0.062 0.950334
## ExerciseAnginaY 1.0607645 0.2745094 3.864 0.000111 ***
## Oldpeak 0.3001046 0.1344795 2.232 0.025641 *
## ST_SlopeFlat 1.3126680 0.4668419 2.812 0.004926 **
## ST_SlopeUp -1.0289223 0.4971077 -2.070 0.038469 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1009.24 on 733 degrees of freedom
## Residual deviance: 495.17 on 718 degrees of freedom
## AIC: 527.17
##
## Number of Fisher Scoring iterations: 5
Predict values using the model
fit_heart_probabilities <- predict(heart_logistic_model, newdata = heart_test, type = "response")Properly group the probabilities
# Make probabilities greater than 0.5 to be 1
fit_heart_results <- ifelse(fit_heart_probabilities > 0.5, 1, 0)Accuracy
# Mis-classification Error
misclassError <- mean(fit_heart_results != heart_test$HeartDisease)
accuracy = round((1 - misclassError), 4) * 100
paste0("The accuracy of the logistic regression model is ", accuracy, "%")## [1] "The accuracy of the logistic regression model is 90.22%"
Confusion Matrix
print("-CONFUSION MATRIX-")## [1] "-CONFUSION MATRIX-"
table(heart_test$HeartDisease, fit_heart_results > 0.5)##
## FALSE TRUE
## 0 73 9
## 1 9 93
From exploratory data analysis, we see that Males are more likely to have HeartDisease than females. Also, from the summary of the logistic model, “SexM - Male Gender” is a significant predictor of HeartDisease. Furthermore, from hypothesis testing of the difference in mean cholesterol level for those with and without heart disease, we see that there is a significant difference in their cholesterol level, but there is no significant difference in their RestingBP. Looking at the summary of the logistic model, we can easily see that Cholesterol level is also a significant predictor of HeartDisease, while RestingBP is not a significant predictor of HeartDisease. In addition, some other significant predictors of HeartDisease from the model summary are: ChestPainType, whether the individual ExerciseAngina, and Fasting Blood Sugar.
This analyzes if an individual will have a death_event from a heart attack.
Load the data from github repo
url <- "https://raw.githubusercontent.com/chinedu2301/data622-machine-learning-and-big-data/main/data/heart_failure_clinical_records_dataset.csv"
heart_failure <- read_csv(url)Look at the head of the data
head(heart_failure, n = 10)## # A tibble: 10 × 13
## age anaemia creatin…¹ diabe…² eject…³ high_…⁴ plate…⁵ serum…⁶ serum…⁷ sex
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75 0 582 0 20 1 265000 1.9 130 1
## 2 55 0 7861 0 38 0 263358. 1.1 136 1
## 3 65 0 146 0 20 0 162000 1.3 129 1
## 4 50 1 111 0 20 0 210000 1.9 137 1
## 5 65 1 160 1 20 0 327000 2.7 116 0
## 6 90 1 47 0 40 1 204000 2.1 132 1
## 7 75 1 246 0 15 0 127000 1.2 137 1
## 8 60 1 315 1 60 0 454000 1.1 131 1
## 9 65 0 157 0 65 0 263358. 1.5 138 0
## 10 80 1 123 0 35 1 388000 9.4 133 1
## # … with 3 more variables: smoking <dbl>, time <dbl>, DEATH_EVENT <dbl>, and
## # abbreviated variable names ¹creatinine_phosphokinase, ²diabetes,
## # ³ejection_fraction, ⁴high_blood_pressure, ⁵platelets, ⁶serum_creatinine,
## # ⁷serum_sodium
Get a glimpse of the variables in the datasets.
# get a glimpse of the variables
glimpse(heart_failure)## Rows: 299
## Columns: 13
## $ age <dbl> 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 6…
## $ anaemia <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, …
## $ creatinine_phosphokinase <dbl> 582, 7861, 146, 111, 160, 47, 246, 315, 157, …
## $ diabetes <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ ejection_fraction <dbl> 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 2…
## $ high_blood_pressure <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, …
## $ platelets <dbl> 265000, 263358, 162000, 210000, 327000, 20400…
## $ serum_creatinine <dbl> 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.1…
## $ serum_sodium <dbl> 130, 136, 129, 137, 116, 132, 137, 131, 138, …
## $ sex <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, …
## $ smoking <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, …
## $ time <dbl> 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11,…
## $ DEATH_EVENT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
There are 299 rows and 13 columns in this data set.
There are 13 variables and 299 observations in the dataset.
Twelve(12) of the 13 variables in the dataset are potential predictors
of the thirteenth(13th) variable - DEATH_EVENT. The data is
labelled.
Each observation represents the characteristics of an individual such as
Age, Anaemia, Diabetes, etc. and whether that individual has a
DEATH_EVENT/Heart Failure.
There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:
Relevant statistics are:
Summary statistics of all variables
summary(heart_failure)## age anaemia creatinine_phosphokinase diabetes
## Min. :40.00 Min. :0.0000 Min. : 23.0 Min. :0.0000
## 1st Qu.:51.00 1st Qu.:0.0000 1st Qu.: 116.5 1st Qu.:0.0000
## Median :60.00 Median :0.0000 Median : 250.0 Median :0.0000
## Mean :60.83 Mean :0.4314 Mean : 581.8 Mean :0.4181
## 3rd Qu.:70.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :14.00 Min. :0.0000 Min. : 25100 Min. :0.500
## 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:212500 1st Qu.:0.900
## Median :38.00 Median :0.0000 Median :262000 Median :1.100
## Mean :38.08 Mean :0.3512 Mean :263358 Mean :1.394
## 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:303500 3rd Qu.:1.400
## Max. :80.00 Max. :1.0000 Max. :850000 Max. :9.400
## serum_sodium sex smoking time
## Min. :113.0 Min. :0.0000 Min. :0.0000 Min. : 4.0
## 1st Qu.:134.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 73.0
## Median :137.0 Median :1.0000 Median :0.0000 Median :115.0
## Mean :136.6 Mean :0.6488 Mean :0.3211 Mean :130.3
## 3rd Qu.:140.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:203.0
## Max. :148.0 Max. :1.0000 Max. :1.0000 Max. :285.0
## DEATH_EVENT
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3211
## 3rd Qu.:1.0000
## Max. :1.0000
From the summary statistics, we can see that the average age of individuals in the dataset is 60 and the median age is about 60 as well.
Check for Correlation
heart_failure_correlation = cor(heart_failure, method = c("spearman"))
corrplot(heart_failure_correlation)The predictors are largely not correlated with one another as can be seen from the correlation plot.
Check for Null values
# Check for NA values
any(is.na(heart_failure))## [1] FALSE
Visualize the na values
# use missmap function from the Amelia package to check for NA values
missmap(heart_failure, main = "Heart Failure/Attack - Missing Values", col = c("yellow", "black"), legend = FALSE)## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
There are no NA values in the dataset
#filter the dataset for only those with heart disease
deathEvent_1 <- heart_failure %>% filter(DEATH_EVENT == 1)
#filter the dataset for only those without heart disease
deathEvent_0 <- heart_failure %>% filter(DEATH_EVENT == 0)
p1 <- ggplot(heart_failure, aes(x = sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - All") + ylab(NULL)
# Bar plot by Sex for only those with Heart Disease
p2 <- ggplot(deathEvent_1, aes(x = sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - DEATH_EVENT") + ylab(NULL)
# Bar plot by Sex for only those with no hear disease
p3 <- ggplot(deathEvent_0, aes(x = sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - No DEATH_EVENT") + ylab(NULL)
# Bar plot of individuals who have heart disease by Sex
p4 <- heart_failure %>% mutate(heart_prob = ifelse(DEATH_EVENT == 1, "Yes", "No")) %>%
ggplot(aes(x = heart_prob, fill = sex)) + geom_bar() + theme_bw() + ylab(NULL) +
labs(title = "DEATH_EVENT vs No DEATH_EVENT")
# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Bar Charts
p1 <- ggplot(deathEvent_1, aes(x =anaemia)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph - Anaemia") + ylab(NULL)
p2 <- ggplot(deathEvent_1, aes(x =high_blood_pressure)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph - HBP") + ylab(NULL)
p3 <- ggplot(deathEvent_1, aes(x =diabetes)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph - Diabetes") + ylab(NULL)
p4 <- ggplot(deathEvent_1, aes(x =smoking)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph - Smoking") + ylab(NULL)
# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)Histogram to show distribution by age
# Histogram of ejection_fraction
ejection_fraction <- ggplot(heart_failure, aes(x = ejection_fraction)) + geom_histogram(binwidth = 12, fill = "brown") +
labs(title = "Distribution of ejection_fraction") + ylab(NULL) + theme_bw()
ejection_fractionUse the CaTools library to split the dataset into training and testing datasets
# Set a seed
set.seed(1994)
#Split the sample
sample <- sample.split(heart_failure$DEATH_EVENT, SplitRatio = 0.8)
# Training Data
heart_failure_train <- subset(heart_failure, sample == TRUE)
# Testing Data
heart_failure_test <- subset(heart_failure, sample == FALSE)Train the model using a logistic model
# Train the model
deathEvent_logistic_model <- glm(formula = DEATH_EVENT ~ . , family = binomial(link = 'logit'),
data = heart_failure_train)Get the summary of the model
# Get the summary of the logistic model
summary(deathEvent_logistic_model)##
## Call:
## glm(formula = DEATH_EVENT ~ ., family = binomial(link = "logit"),
## data = heart_failure_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2584 -0.5229 -0.2296 0.4473 2.3042
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.355e+00 6.123e+00 1.201 0.229684
## age 5.523e-02 1.804e-02 3.062 0.002200 **
## anaemia 3.646e-01 4.088e-01 0.892 0.372524
## creatinine_phosphokinase 2.726e-04 1.952e-04 1.396 0.162567
## diabetes 1.631e-01 4.108e-01 0.397 0.691433
## ejection_fraction -7.595e-02 1.804e-02 -4.210 2.55e-05 ***
## high_blood_pressure -8.921e-02 4.227e-01 -0.211 0.832845
## platelets -1.016e-06 2.200e-06 -0.462 0.644112
## serum_creatinine 8.009e-01 2.084e-01 3.843 0.000121 ***
## serum_sodium -5.531e-02 4.313e-02 -1.282 0.199689
## sex -1.444e-01 4.706e-01 -0.307 0.758945
## smoking -3.494e-01 4.619e-01 -0.756 0.449355
## time -2.061e-02 3.333e-03 -6.183 6.28e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 300.42 on 238 degrees of freedom
## Residual deviance: 168.39 on 226 degrees of freedom
## AIC: 194.39
##
## Number of Fisher Scoring iterations: 6
Predict values using the model
fit_deathEvent_probabilities <- predict(deathEvent_logistic_model, newdata = heart_failure_test, type = "response")Properly group the probabilities
# Make probabilities greater than 0.5 to be 1
fit_deathEvent_results <- ifelse(fit_deathEvent_probabilities > 0.5, 1, 0)Accuracy
# Mis-classification Error
misclassError <- mean(fit_deathEvent_results != heart_failure_test$DEATH_EVENT)
accuracy = round((1 - misclassError), 4) * 100
paste0("The accuracy of the logistic regression model is ", accuracy, "%")## [1] "The accuracy of the logistic regression model is 83.33%"
Confusion Matrix
print("-CONFUSION MATRIX-")## [1] "-CONFUSION MATRIX-"
table(heart_failure_test$DEATH_EVENT, fit_deathEvent_results > 0.5)##
## FALSE TRUE
## 0 39 2
## 1 8 11
From exploratory data analysis, we see that Males are more likely to have Death Events from a heart attack.