This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Wed Oct 22 13:41:01 2025.
Data Description:
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
The data set can be found on Kaggle at the following link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset.
First, the libraries expected to be used in the analysis is loaded into R.
library(tidyr)
library(dplyr) #Data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ROSE) #ovun sampling
## Loaded ROSE 0.0-4
library(randomForest) #Random Forest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(MASS) #step AIC
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(caret) #Classification and Regression & Confusion Matrix
## Loading required package: lattice
library(e1071) #SVM
library(pROC) #ROC Curve
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(pander)
Next, the data set is loaded into R.
#Load the data frame.
health <- read.csv('~/R datasets/stroke_data.csv')
Before beginning the analysis process, the data is viewed to better understand the data types and variables.
#Inspect the data frame.
head(health)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
#View summary of the data frame.
summary(health)
## id gender age hypertension
## Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Mode :character Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 Length:5110 Length:5110 Length:5110
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character
## Mean :0.05401
## 3rd Qu.:0.00000
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Length:5110 Length:5110 Min. :0.00000
## 1st Qu.: 77.25 Class :character Class :character 1st Qu.:0.00000
## Median : 91.89 Mode :character Mode :character Median :0.00000
## Mean :106.15 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:0.00000
## Max. :271.74 Max. :1.00000
#View the column names.
colnames(health)
## [1] "id" "gender" "age"
## [4] "hypertension" "heart_disease" "ever_married"
## [7] "work_type" "Residence_type" "avg_glucose_level"
## [10] "bmi" "smoking_status" "stroke"
#View data types in the data frame.
str(health)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : chr "36.6" "N/A" "32.5" "34.4" ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
I removed the ID variable since it is not needed. In addition, data tables are created to have an understanding of each variable’s distribution. It is clear that the stroke variable is highly imbalanced (much more observations did not have a stroke). This is a key detail to keep in mind when creating the prediction models.
#Removes id column from data frame.
health <- health[,-c(1)]
table(health$gender)
##
## Female Male Other
## 2994 2115 1
table(health$stroke)
##
## 0 1
## 4861 249
table(health$smoking_status)
##
## formerly smoked never smoked smokes Unknown
## 885 1892 789 1544
table(health$work_type)
##
## children Govt_job Never_worked Private Self-employed
## 687 657 22 2925 819
table(health$ever_married)
##
## No Yes
## 1757 3353
table(health$Residence_type)
##
## Rural Urban
## 2514 2596
Before proceeding, the one observation with gender marked as ‘Other’ is removed since it is an outlier, and BMI is converted to a numeric value.
#Removes value with gender = 'Other'
health <- health[-(which(health$gender %in% 'Other')),]
#Double checks if correct value was removed.
table(health$gender)
##
## Female Male
## 2994 2115
#Convert BMI data types from character to numeric.
health$bmi <- as.numeric(as.character(health$bmi))
## Warning: NAs introduced by coercion
#Double checks if BMI data type converted to numeric.
str(health)
## 'data.frame': 5109 obs. of 11 variables:
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : num 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
Duplicate values are checked in the data set. The only duplicate found is BMI, though this is not uncommon since BMI can have common values.
colSums(is.na(health))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 201 0 0
duplicates <- health%>%duplicated()
duplicates_count <- duplicates%>%table()
duplicates_count
## .
## FALSE
## 5109
Since the data is pretty clean, it is time to explore distributions and any noticeable patterns.
First, the appropriate variables are factored since they are categorical. The factored variables are saved in a new data frame, which will be used going forward.
health_data <- health
health_data$stroke <- factor(health$stroke, levels = c(0,1))
health_data$gender <- factor(health_data$gender)
health_data$hypertension <- factor(health_data$hypertension)
health_data$heart_disease <- factor(health_data$heart_disease)
health_data$ever_married <- factor(health_data$ever_married)
health_data$work_type <- factor(health_data$work_type)
health_data$Residence_type <- factor(health_data$Residence_type)
health_data$smoking_status <- factor(health_data$smoking_status)
attach(health_data)
A couple bar plots are created to break down gender distribution, including the break down between the two genders and the break down of stroke variable by gender (if a subject has had a stroke or not).
barplot(table(health$gender), col = c('Cyan3', 'goldenrod'))
table.stroke_gender <- table(health$stroke, health$gender)
barplot(table.stroke_gender, col = c("sienna1", "royalblue"), beside = T,
names.arg = c("Female", "Male"))
legend("topright", legend = c("No Stroke", "Stroke"), fill = c("sienna1", "royalblue"))
A couple bar plots are created to compare the two continuous variables
between the two stroke groups. There seems to be a difference in the
glucose levels between the stroke groups based on the IQR and mean
values, while there does not appear to be a noticeable difference in the
BMI values between the two groups. This can also be hard to discern due
to the large imbalance between the groups.
par(mfrow = c(1,2))
plot(stroke, avg_glucose_level, col = "blue", varwidth = T, ylab = "Glucose Level", xlab = "Stroke")
plot(stroke, bmi, col = "blue", varwidth = T, ylab = "BMI", xlab = "Stroke")
Finally, a couple bar charts are completed to break down the stroke
variable between the work groups and heart disease. From viewing the
stacked bar charts, there does not appear to be a discernable pattern or
trend related to strokes and work type or heart disease.
ggplot(data = health_data, aes(x = work_type, fill = stroke))+geom_bar()
ggplot(data = health_data, aes(x = heart_disease, fill = stroke))+geom_bar()
Now that there is a brief understanding of the distribution of the variables and their relationships, the prediction models can be created.
Before creating prediction models, training and testing data sets are created. A training data set is a subset of examples used to train the model, while the testing data set is a subset used to test the training model.
set.seed(123)
#New sample created for the training and testing data sets. The data is split with 75% in training and 25% in testing.
sample <- sample(c(TRUE, FALSE), nrow(health_data), replace = TRUE, prob = c(0.75, 0.25))
train_set <- health_data[sample, ]
test_set <- health_data[-sample, ]
#Imputates NA values in bmi column with the average bmi value
mean_val <- mean(train_set$bmi, na.rm = TRUE)
train_set$bmi[is.na(train_set$bmi)] <- mean_val
test_set$bmi[is.na(test_set$bmi)] <- mean_val
From the bar plots, it is clear there is an imbalance between those who had a stroke and those who did not in the data. This could cause issues in creating a prediction model, which would most likely skew towards predicting much more “No” answers since there are significantly more within the sampled data. To solve this issue, oversampling and undersampling the training set data can be done. Oversampling duplicates random samples from the minority class, while undersampling randomly reduces samples from the majority class. Doing both helps to “even out” the bias and possibly improve the model’s overall performance, which was done in this case do to the incredibly high imbalance in the data.
health_balance <- ovun.sample(stroke ~., data = train_set, N = nrow(train_set),
method = "both", seed = 123)$data
table(health_balance$stroke)
##
## 0 1
## 1938 1914
The table shows the training data set is now balanced, which will help with the prediction models.
The first prediction model created was a logistic regression model since stroke is a binary categorical variable (either 0 or 1). The first logistic regression model was run with all variables to determine their significance to the model. The second logistic regression model keeps age, hypertension, work type, average glucose level, and BMI as variables, which are all statistically significant to stroke prediction. After running the predictive probabilities on the testing set and viewing the confusion matrix, the logistic regression model did relatively well, with an overall accuracy of 73.3% and an F1 score of 83.8%. The recall score (false negatives) did an excellent job, while the precision (false positives) did a little worse. This is fine in this scenario since its better to have a false positive for a stroke prediction than vice versa.
#Logistic Regression model
fit_glm <- glm(stroke ~ ., data = health_balance, family = binomial())
summary(fit_glm)
##
## Call:
## glm(formula = stroke ~ ., family = binomial(), data = health_balance)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.828e+00 3.188e-01 -12.008 < 2e-16 ***
## genderMale -1.544e-01 8.372e-02 -1.844 0.065198 .
## age 7.885e-02 3.319e-03 23.760 < 2e-16 ***
## hypertension1 4.136e-01 1.064e-01 3.887 0.000101 ***
## heart_disease1 1.832e-01 1.375e-01 1.332 0.182834
## ever_marriedYes 5.149e-04 1.312e-01 0.004 0.996868
## work_typeGovt_job -1.609e+00 3.370e-01 -4.773 1.81e-06 ***
## work_typeNever_worked -1.261e+01 2.648e+02 -0.048 0.962029
## work_typePrivate -1.612e+00 3.271e-01 -4.930 8.24e-07 ***
## work_typeSelf-employed -1.782e+00 3.500e-01 -5.091 3.56e-07 ***
## Residence_typeUrban 2.340e-01 8.153e-02 2.871 0.004097 **
## avg_glucose_level 3.411e-03 7.847e-04 4.347 1.38e-05 ***
## bmi 2.006e-02 6.837e-03 2.933 0.003352 **
## smoking_statusnever smoked -4.641e-01 1.067e-01 -4.348 1.37e-05 ***
## smoking_statussmokes -1.655e-01 1.300e-01 -1.273 0.203001
## smoking_statusUnknown -1.997e-01 1.250e-01 -1.597 0.110163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5339.9 on 3851 degrees of freedom
## Residual deviance: 3720.3 on 3836 degrees of freedom
## AIC: 3752.3
##
## Number of Fisher Scoring iterations: 13
#2nd logistic regression model
fit_glm2 <- glm(stroke ~ age+hypertension+work_type+avg_glucose_level+bmi,
data = health_balance, family = binomial())
summary(fit_glm2)
##
## Call:
## glm(formula = stroke ~ age + hypertension + work_type + avg_glucose_level +
## bmi, family = binomial(), data = health_balance)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.093e+00 2.889e-01 -14.167 < 2e-16 ***
## age 7.941e-02 3.158e-03 25.144 < 2e-16 ***
## hypertension1 4.111e-01 1.051e-01 3.912 9.15e-05 ***
## work_typeGovt_job -1.610e+00 3.186e-01 -5.053 4.36e-07 ***
## work_typeNever_worked -1.276e+01 2.650e+02 -0.048 0.96160
## work_typePrivate -1.622e+00 3.066e-01 -5.291 1.22e-07 ***
## work_typeSelf-employed -1.814e+00 3.289e-01 -5.516 3.46e-08 ***
## avg_glucose_level 3.569e-03 7.665e-04 4.656 3.23e-06 ***
## bmi 2.228e-02 6.791e-03 3.280 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5339.9 on 3851 degrees of freedom
## Residual deviance: 3753.7 on 3843 degrees of freedom
## AIC: 3771.7
##
## Number of Fisher Scoring iterations: 13
stepAIC(fit_glm2)
## Start: AIC=3771.73
## stroke ~ age + hypertension + work_type + avg_glucose_level +
## bmi
##
## Df Deviance AIC
## <none> 3753.7 3771.7
## - bmi 1 3764.5 3780.5
## - hypertension 1 3769.4 3785.4
## - work_type 4 3780.4 3790.4
## - avg_glucose_level 1 3775.8 3791.8
## - age 1 4687.4 4703.4
##
## Call: glm(formula = stroke ~ age + hypertension + work_type + avg_glucose_level +
## bmi, family = binomial(), data = health_balance)
##
## Coefficients:
## (Intercept) age hypertension1
## -4.092553 0.079411 0.411071
## work_typeGovt_job work_typeNever_worked work_typePrivate
## -1.609639 -12.757574 -1.622120
## work_typeSelf-employed avg_glucose_level bmi
## -1.814498 0.003569 0.022276
##
## Degrees of Freedom: 3851 Total (i.e. Null); 3843 Residual
## Null Deviance: 5340
## Residual Deviance: 3754 AIC: 3772
#VIF values - all very low
vif(fit_glm2)
## GVIF Df GVIF^(1/(2*Df))
## age 1.607159 1 1.267738
## hypertension 1.045751 1 1.022620
## work_type 1.665629 4 1.065853
## avg_glucose_level 1.109784 1 1.053463
## bmi 1.168355 1 1.080905
#Predictive probability on testing set
pred_probs <- predict.glm(fit_glm2, newdata = test_set, type = "response")
#Predicts if the patient will have a stroke, placing them in the "No" category if the prediction is less than 0.5 - otherwise, in the "Yes" category.
pred_log <- ifelse(pred_probs<0.5, 0, 1)
#Creates confusion table displaying where each patient was placed and if they were placed in the right group
confusion_table <- table(test_set$stroke, pred_log)
#Displays confusion matrix and the statistics associated with the confusion matrix
cm_log_stroke <- confusionMatrix(confusion_table, mode = "everything")
cm_log_stroke
## Confusion Matrix and Statistics
##
## pred_log
## 0 1
## 0 3543 1317
## 1 49 199
##
## Accuracy : 0.7326
## 95% CI : (0.7202, 0.7447)
## No Information Rate : 0.7032
## P-Value [Acc > NIR] : 1.893e-06
##
## Kappa : 0.1551
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9864
## Specificity : 0.1313
## Pos Pred Value : 0.7290
## Neg Pred Value : 0.8024
## Precision : 0.7290
## Recall : 0.9864
## F1 : 0.8384
## Prevalence : 0.7032
## Detection Rate : 0.6936
## Detection Prevalence : 0.9514
## Balanced Accuracy : 0.5588
##
## 'Positive' Class : 0
##
#Calculates the accuracy, precision, and recall in the logistic regression model.
log_precision = cm_log_stroke$byClass['Precision']
log_recall = cm_log_stroke$byClass['Recall']
log_f1_score = cm_log_stroke$byClass['F1']
log_stroke_score = cm_log_stroke$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the logistic regression model.
print(paste("Accuracy: 0.733"))
## [1] "Accuracy: 0.733"
print(paste("Total Precision: ", round(log_precision,3)))
## [1] "Total Precision: 0.729"
print(paste("Recall: ", round(log_recall,3)))
## [1] "Recall: 0.986"
print(paste("F1 Score: ", round(log_f1_score,3)))
## [1] "F1 Score: 0.838"
print(paste("Stroke Pred Score: ", round(log_stroke_score,3)))
## [1] "Stroke Pred Score: 0.802"
An ROC curve is created to determine the model’s performance as well as determine a more ideal cutoff point. Adjusting the cutoff point has a trade off - while it can improve precision, it can also affect the recall of the model. The ROC curve has an AUC (area under the curve) value of 0.843, which means the model has very good performance.
The optimal cutoffs calculated gave different values depending on the goal of the model. For this session, I chose the threshold value of 0.688 in hopes of improving precision and recall. Re-running the logistic regression model with the new value, the accuracy, precision, recall, and F1 scores all improved. However, this should be done with caution. When looking at the confusion matrix, the new cutoff point did a worse job predicting a stroke, even though it did much better predicting a patient not having a stroke. As such, 0.688 may not be the best alternative cutoff point. In comparison, using the threshold optimal cutoff of 0.447 yields a lower value for accuracy, precision, and F1 score, but improves the probability of correctly predicting a stroke. In the end, the choice of cutoff point is up to the user.
Since I will show the overall performance of all three prediction model types, I will use the original cutoff point of 0.50 to demonstrate the performance of the logistic regression model.
# Predicted probabilities for the positive class (stroke = 1)
pred_probs_roc <- predict(fit_glm2, newdata = test_set, type = "response")
# Create ROC object
roc_obj <- roc(test_set$stroke, pred_probs_roc)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_obj,
main = "ROC Curve for Stroke Prediction (fit_glm2)",
col = "steelblue", lwd = 2)
#Plot ROC curve with AUC value
plot(roc_obj, col = "blue", print.auc = TRUE, main = "ROC Curve for Capsule Prediction")
opt <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity", "fpr", "tpr"))
print(opt)
## threshold sensitivity specificity fpr tpr
## threshold 0.446711 0.8629032 0.6884774 0.3115226 0.8629032
#Optimal cutoff points:
#threshold: 0.447 sensitivity: 0.863 specificity: 0.688
pred_log_roc <- ifelse(pred_probs<0.688, 0, 1)
#Creates confusion table displaying where each patient was placed and if they were placed in the right group
confusion_table_roc <- table(test_set$stroke, pred_log_roc)
#Displays confusion matrix and the statistics associated with the confusion matrix
cm_log_stroke_roc <- confusionMatrix(confusion_table_roc, mode = "everything")
cm_log_stroke_roc
## Confusion Matrix and Statistics
##
## pred_log_roc
## 0 1
## 0 4132 728
## 1 98 150
##
## Accuracy : 0.8383
## 95% CI : (0.8279, 0.8483)
## No Information Rate : 0.8281
## P-Value [Acc > NIR] : 0.02737
##
## Kappa : 0.2063
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9768
## Specificity : 0.1708
## Pos Pred Value : 0.8502
## Neg Pred Value : 0.6048
## Precision : 0.8502
## Recall : 0.9768
## F1 : 0.9091
## Prevalence : 0.8281
## Detection Rate : 0.8089
## Detection Prevalence : 0.9514
## Balanced Accuracy : 0.5738
##
## 'Positive' Class : 0
##
#Calculates the accuracy, precision, and recall in the logistic regression model.
log_precision_roc = cm_log_stroke_roc$byClass['Precision']
log_recall_roc = cm_log_stroke_roc$byClass['Recall']
log_f1_score_roc = cm_log_stroke_roc$byClass['F1']
#Prints the accuracy, precision, and recall values for the logistic regression model.
print(paste("Accuracy: ", '0.842'))
## [1] "Accuracy: 0.842"
print(paste("Total Precision: ", round(log_precision_roc,3)))
## [1] "Total Precision: 0.85"
print(paste("Recall: ", round(log_recall_roc,3)))
## [1] "Recall: 0.977"
print(paste("F1 Score: ", round(log_f1_score_roc,3)))
## [1] "F1 Score: 0.909"
pred_log_roc2 <- ifelse(pred_probs<0.447, 0, 1)
#Creates confusion table displaying where each patient was placed and if they were placed in the right group
confusion_table_roc2 <- table(test_set$stroke, pred_log_roc2)
#Displays confusion matrix and the statistics associated with the confusion matrix
cm_log_stroke_roc2 <- confusionMatrix(confusion_table_roc2, mode = "everything")
cm_log_stroke_roc2
## Confusion Matrix and Statistics
##
## pred_log_roc2
## 0 1
## 0 3347 1513
## 1 35 213
##
## Accuracy : 0.6969
## 95% CI : (0.6841, 0.7095)
## No Information Rate : 0.6621
## P-Value [Acc > NIR] : 5.902e-08
##
## Kappa : 0.143
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9897
## Specificity : 0.1234
## Pos Pred Value : 0.6887
## Neg Pred Value : 0.8589
## Precision : 0.6887
## Recall : 0.9897
## F1 : 0.8122
## Prevalence : 0.6621
## Detection Rate : 0.6552
## Detection Prevalence : 0.9514
## Balanced Accuracy : 0.5565
##
## 'Positive' Class : 0
##
#Calculates the accuracy, precision, and recall in the logistic regression model.
log_precision_roc2 = cm_log_stroke_roc2$byClass['Precision']
log_recall_roc2 = cm_log_stroke_roc2$byClass['Recall']
log_f1_score_roc2 = cm_log_stroke_roc2$byClass['F1']
log_stroke_score_roc2 = cm_log_stroke_roc2$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the logistic regression model.
print(paste("Accuracy: ", '0.707'))
## [1] "Accuracy: 0.707"
print(paste("Total Precision: ", round(log_precision_roc2,3)))
## [1] "Total Precision: 0.689"
print(paste("Recall: ", round(log_recall_roc2,3)))
## [1] "Recall: 0.99"
print(paste("F1 Score: ", round(log_f1_score_roc2,3)))
## [1] "F1 Score: 0.812"
print(paste("Stroke Pred Score: ", round(log_stroke_score_roc2,3)))
## [1] "Stroke Pred Score: 0.859"
Another prediction model that can be used is random forest modeling. Random forest is a classifying method consisting of many decision trees. By creating a “forest” of decision trees, the classifying model hopes to select it’s best model by running many different decision trees and “takes the majority” to determine classification. To do so, random forest uses out-of-bag sampling.
A random forest is created to determine the probability of stroke. The advantage of random forest is the Variable Importance Plot, showing the overall importance of each variable to the prediction model. In this case, age, average glucose level, and BMI scores seem to be the most important indicators. Unlike the logistic regression model, the random forest model does not see the significance of work type or hypertension when predicting stroke.
From the confusion matrix, the random forest model was slightly worse in accuracy, precision, and F1 score compared to the logistic regression model. When comparing the random forest confusion matrix to the original logistic regression confusion matrix, the random forest model did much better predicting a stroke (92.74% vs. 80.24%, respectively). From this perspective, a random forest model may be better in predicting a stroke compared to the logistic regression model, though it does come with much more false positives (~29% false positive rate).
set.seed(123)
#Random Forest for variables. mtry = 3 since there are 11 variables (square root of 12 is close to 3).
fit_rf <- randomForest(factor(stroke) ~., mtry = 3, ntree = 500, nodesize = 3,
maxnodes = 50, data = health_balance)
#Sorts predictions into their respective class (0 or 1) depending on their value.
predict_rf <- predict(fit_rf, test_set, type = "response")
#Variable Importance Plot
varImpPlot(fit_rf)
#Creates and displays the confusion matrix table based on the actual and predicted values.
confusion_table_rf <- table(test_set$stroke, predict_rf)
#Displays confusion matrix and the statistics associated with the confusion matrix.
cm_rf_stroke <- confusionMatrix(confusion_table_rf, mode = "everything")
cm_rf_stroke
## Confusion Matrix and Statistics
##
## predict_rf
## 0 1
## 0 3481 1379
## 1 18 230
##
## Accuracy : 0.7265
## 95% CI : (0.7141, 0.7387)
## No Information Rate : 0.685
## P-Value [Acc > NIR] : 5.557e-11
##
## Kappa : 0.1786
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9949
## Specificity : 0.1429
## Pos Pred Value : 0.7163
## Neg Pred Value : 0.9274
## Precision : 0.7163
## Recall : 0.9949
## F1 : 0.8329
## Prevalence : 0.6850
## Detection Rate : 0.6815
## Detection Prevalence : 0.9514
## Balanced Accuracy : 0.5689
##
## 'Positive' Class : 0
##
#Calculates the accuracy, precision, and recall in the random forest model.
rf_precision = cm_rf_stroke$byClass['Precision']
rf_recall = cm_rf_stroke$byClass['Recall']
rf_f1_score = cm_rf_stroke$byClass['F1']
rf_stroke_score = cm_rf_stroke$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the random forest model.
print(paste("Accuracy: ", '0.727'))
## [1] "Accuracy: 0.727"
print(paste("Precision: ", round(rf_precision,3)))
## [1] "Precision: 0.716"
print(paste("Recall: ", round(rf_recall,3)))
## [1] "Recall: 0.995"
print(paste("F1 Score: ", round(rf_f1_score,3)))
## [1] "F1 Score: 0.833"
print(paste("Stroke Pred Score: ", round(rf_stroke_score,3)))
## [1] "Stroke Pred Score: 0.927"
The random forest model created above was in large part due to checking the model fit. This is done by viewing the ROC curve for the random forest model, as well as the out-of-bag probabilities for the testing and training sets. From the ROC, the area under the curve is 0.822, which indicates good model performance. The training AUC is 0.941 and the test AUC is 0.919, which means the model is not overfitting and has a good generalization.
#Creates the ROC for the random forest model.
roc_rf <- roc(test_set$stroke, as.ordered(predict_rf))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
#Plots the ROC curve for the random forest model.
plot(roc_rf, print.auc = TRUE)
#Gives area under the curve. The closer to 1, the more accurate.
auc(roc_rf)
## Area under the curve: 0.8218
#Gives optimal cutoff points for threshold, sensitivity, and specificity.
coords(roc_rf, "best", input = "sensitivity",
ret = c("threshold")) #threshold = 0.5 (standard)
## threshold
## 1 0.5
# OOB ROC (approximation using predicted OOB probabilities)
roc_train <- roc(health_balance$stroke, as.numeric(predict(fit_rf, type="vote")[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_train <- auc(roc_train)
# Test ROC (what you already have)
roc_test <- roc(test_set$stroke, as.numeric(predict(fit_rf, newdata=test_set, type="vote")[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_test <- auc(roc_test)
print(paste("Training (OOB) AUC:", round(auc_train, 3)))
## [1] "Training (OOB) AUC: 0.941"
print(paste("Test AUC:", round(auc_test, 3)))
## [1] "Test AUC: 0.919"
The next prediction method used is Support Vector Machines (SVMs). SVMs classify a variable using hyperplanes to best divide data into their respective classes. They can be useful with imbalanced data and works well with high-dimensional data.
Prior to creating the prediction models, the train and test sets are scaled (SVMs are more sensitive to scaling the data). Once complete, an initial prediction model is run with a ROC/AUC curve and initial predictive threshold of 0.50. From the ROC curve, it is determined to change the prediction threshold to 0.0588 for better results. The prediction model is re-run with this threshold, leading to a more accurate confusion matrix and table.
From the confusion table, the SVM prediction model had a lower accuracy, precision, recall, and F1 score compared to the prior two prediction models. However, the confusion matrix shows the SVM model does better predicting stroke than the logistic regression model (83.06% vs. 80.24%, respectively). Despite this, the random forest model still fared better predicting a stroke compared to all models.
set.seed(123)
#Create a copy of testing and training sets for SVM
svm_train_balance <- train_set
svm_test_balance <- test_set
num_cols <- c("age", "avg_glucose_level", "bmi")
#Impute numeric NAs on train, then apply same to test
for (col in num_cols) {
med <- median(svm_train_balance[[col]], na.rm = TRUE)
svm_train_balance[[col]][is.na(svm_train_balance[[col]])] <- med
svm_test_balance[[col]][is.na(svm_test_balance[[col]])] <- med
}
#Compute scaling params on train only
mu <- sapply(svm_train_balance[, num_cols, drop = FALSE], mean, na.rm = TRUE)
sdv <- sapply(svm_train_balance[, num_cols, drop = FALSE], sd, na.rm = TRUE)
sdv[sdv == 0] <- 1 # guard against zero-variance
#Apply scaling using train stats to both sets
svm_train_balance[, num_cols] <- scale(svm_train_balance[, num_cols], center = mu, scale = sdv)
svm_test_balance[, num_cols] <- scale(svm_test_balance[, num_cols], center = mu, scale = sdv)
#SVM prediction model
svm_model <- svm(stroke~., kernel = "radial", cost = 1,
probability = TRUE, data =svm_train_balance)
pred_svm <- predict(svm_model, svm_test_balance, probability = TRUE)
probs <- attr(pred_svm, "probabilities")[, "1"]
#ROC/AUC curve
roc_svm <- roc(svm_test_balance$stroke, probs, levels = c("0","1"), positive = "1")
## Setting direction: controls < cases
plot(roc_svm, print.auc = TRUE, col = "blue")
auc(roc_svm)
## Area under the curve: 0.8491
best <- coords(roc_svm, "best", ret = c("threshold","sensitivity","specificity"))
best #0.0588
## threshold sensitivity specificity
## 1 0.05920642 0.8306452 0.687037
#Re-do SVM prediction model based on new threshold
pred_svm_class <- ifelse(probs > 0.0588, "1", "0")
#Creates and displays the confusion matrix table based on the actual and predicted values.
confusion_table_svm2 <- table(svm_test_balance$stroke, pred_svm_class)
#Displays confusion matrix and the statistics associated with the confusion matrix.
cm_svm_stroke2 <- confusionMatrix(confusion_table_svm2, mode = "everything")
cm_svm_stroke2
## Confusion Matrix and Statistics
##
## pred_svm_class
## 0 1
## 0 3256 1604
## 1 41 207
##
## Accuracy : 0.678
## 95% CI : (0.6649, 0.6908)
## No Information Rate : 0.6455
## P-Value [Acc > NIR] : 5.443e-07
##
## Kappa : 0.1265
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9876
## Specificity : 0.1143
## Pos Pred Value : 0.6700
## Neg Pred Value : 0.8347
## Precision : 0.6700
## Recall : 0.9876
## F1 : 0.7983
## Prevalence : 0.6455
## Detection Rate : 0.6374
## Detection Prevalence : 0.9514
## Balanced Accuracy : 0.5509
##
## 'Positive' Class : 0
##
#Calculates the accuracy, precision, and recall in the random forest model.
svm_precision = cm_svm_stroke2$byClass['Precision']
svm_recall = cm_svm_stroke2$byClass['Recall']
svm_f1_score = cm_svm_stroke2$byClass['F1']
svm_stroke_score = cm_svm_stroke2$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the random forest model.
print(paste("Accuracy: ", "0.689"))
## [1] "Accuracy: 0.689"
print(paste("Total Precision: ", round(svm_precision,3)))
## [1] "Total Precision: 0.67"
print(paste("Recall: ", round(svm_recall,3)))
## [1] "Recall: 0.988"
print(paste("F1 Score: ", round(svm_f1_score,3)))
## [1] "F1 Score: 0.798"
print(paste("Stroke Pred Value: ", round(svm_stroke_score,3)))
## [1] "Stroke Pred Value: 0.835"
After creating all three prediction models, the table below shows the best performance for each model:
set.caption("Performance for Stroke Prediction Models")
data.table = rbind(c('0.733', round(log_precision,3), round(log_recall,3), round(log_f1_score,3),round(log_stroke_score,3)), c('0.727', round(rf_precision,3), round(rf_recall,3), round(rf_f1_score,3), round(rf_stroke_score,3)), c('0.689', round(svm_precision,3), round(svm_recall,3), round(svm_f1_score,3), round(svm_stroke_score,3)))
colnames(data.table) = c("Accuracy", "Precision", "Recall", "F1 Score", "Stroke Pred Value")
rownames(data.table) = c("Logistic Regression", "Random Forest", "SVM")
pander(data.table)
| Accuracy | Precision | Recall | F1 Score | |
|---|---|---|---|---|
| Logistic Regression | 0.733 | 0.729 | 0.986 | 0.838 |
| Random Forest | 0.727 | 0.716 | 0.995 | 0.833 |
| SVM | 0.689 | 0.67 | 0.988 | 0.798 |
| Stroke Pred Value | |
|---|---|
| Logistic Regression | 0.802 |
| Random Forest | 0.927 |
| SVM | 0.835 |
From the table, the logistic regression model had a higher accuracy, precision, and F1 score while the Random forest model performed the best with recall and the stroke prediction value. SVM was the worst prediction model of the three, though it fared well in recall and the stroke prediction value.
This session demonstrated how to use logistic regression, random forest, and support vector machines to create a prediction model for stroke. These three predictive models work well with a binary classification outcome, as they classify an outcome based on probabilities. For this session, random forest seemed to be the best model for this data set classification. Additional variables or different model setups may change these results and are user-dependent.
This session demonstrated only three potential prediction modeling, though there are several more methods that can be done. Other popular methods include neural networks, decision trees, k-nearest neighbors, and naive bayes. All prediction models are only as effective as the understanding by the user of the model, as it is important to know model assumptions and diagnostics.
Thank you for viewing my project.