About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Wed Oct 22 13:41:01 2025.

Data Description:

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

The data set can be found on Kaggle at the following link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset.

Task One: Import data and data preprocessing

Load data and install packages

First, the libraries expected to be used in the analysis is loaded into R.

library(tidyr)
library(dplyr) #Data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(ROSE) #ovun sampling
## Loaded ROSE 0.0-4
library(randomForest) #Random Forest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(MASS) #step AIC
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(caret) #Classification and Regression & Confusion Matrix
## Loading required package: lattice
library(e1071) #SVM
library(pROC) #ROC Curve
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(pander)

Next, the data set is loaded into R.

#Load the data frame.
health <- read.csv('~/R datasets/stroke_data.csv')

Describe and explore the data

Before beginning the analysis process, the data is viewed to better understand the data types and variables.

#Inspect the data frame.
head(health)
##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21  N/A    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1
#View summary of the data frame.
summary(health)
##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##  avg_glucose_level     bmi            smoking_status         stroke       
##  Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :106.15                                          Mean   :0.04873  
##  3rd Qu.:114.09                                          3rd Qu.:0.00000  
##  Max.   :271.74                                          Max.   :1.00000
#View the column names.
colnames(health)
##  [1] "id"                "gender"            "age"              
##  [4] "hypertension"      "heart_disease"     "ever_married"     
##  [7] "work_type"         "Residence_type"    "avg_glucose_level"
## [10] "bmi"               "smoking_status"    "stroke"
#View data types in the data frame.
str(health)
## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : chr  "36.6" "N/A" "32.5" "34.4" ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

I removed the ID variable since it is not needed. In addition, data tables are created to have an understanding of each variable’s distribution. It is clear that the stroke variable is highly imbalanced (much more observations did not have a stroke). This is a key detail to keep in mind when creating the prediction models.

#Removes id column from data frame.
health <- health[,-c(1)]
table(health$gender)
## 
## Female   Male  Other 
##   2994   2115      1
table(health$stroke)
## 
##    0    1 
## 4861  249
table(health$smoking_status)
## 
## formerly smoked    never smoked          smokes         Unknown 
##             885            1892             789            1544
table(health$work_type)
## 
##      children      Govt_job  Never_worked       Private Self-employed 
##           687           657            22          2925           819
table(health$ever_married)
## 
##   No  Yes 
## 1757 3353
table(health$Residence_type)
## 
## Rural Urban 
##  2514  2596

Before proceeding, the one observation with gender marked as ‘Other’ is removed since it is an outlier, and BMI is converted to a numeric value.

#Removes value with gender = 'Other'
health <- health[-(which(health$gender %in% 'Other')),]
#Double checks if correct value was removed.
table(health$gender)
## 
## Female   Male 
##   2994   2115
#Convert BMI data types from character to numeric.
health$bmi <- as.numeric(as.character(health$bmi))
## Warning: NAs introduced by coercion
#Double checks if BMI data type converted to numeric.
str(health)
## 'data.frame':    5109 obs. of  11 variables:
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

Duplicate values are checked in the data set. The only duplicate found is BMI, though this is not uncommon since BMI can have common values.

colSums(is.na(health))
##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##               201                 0                 0
duplicates <- health%>%duplicated()

duplicates_count <- duplicates%>%table()
duplicates_count
## .
## FALSE 
##  5109

Since the data is pretty clean, it is time to explore distributions and any noticeable patterns.

Exploratory Data Analysis

First, the appropriate variables are factored since they are categorical. The factored variables are saved in a new data frame, which will be used going forward.

health_data <- health
health_data$stroke <- factor(health$stroke, levels = c(0,1))
health_data$gender <- factor(health_data$gender)
health_data$hypertension <- factor(health_data$hypertension)
health_data$heart_disease <- factor(health_data$heart_disease)
health_data$ever_married <- factor(health_data$ever_married)
health_data$work_type <- factor(health_data$work_type)
health_data$Residence_type <- factor(health_data$Residence_type)
health_data$smoking_status <- factor(health_data$smoking_status)
attach(health_data)

A couple bar plots are created to break down gender distribution, including the break down between the two genders and the break down of stroke variable by gender (if a subject has had a stroke or not).

barplot(table(health$gender), col = c('Cyan3', 'goldenrod'))

table.stroke_gender <- table(health$stroke, health$gender)

barplot(table.stroke_gender, col = c("sienna1", "royalblue"), beside = T,
        names.arg = c("Female", "Male"))
legend("topright", legend = c("No Stroke", "Stroke"), fill = c("sienna1", "royalblue"))

A couple bar plots are created to compare the two continuous variables between the two stroke groups. There seems to be a difference in the glucose levels between the stroke groups based on the IQR and mean values, while there does not appear to be a noticeable difference in the BMI values between the two groups. This can also be hard to discern due to the large imbalance between the groups.

par(mfrow = c(1,2))
plot(stroke, avg_glucose_level, col = "blue", varwidth = T, ylab = "Glucose Level", xlab = "Stroke")
plot(stroke, bmi, col = "blue", varwidth = T, ylab = "BMI", xlab = "Stroke")

Finally, a couple bar charts are completed to break down the stroke variable between the work groups and heart disease. From viewing the stacked bar charts, there does not appear to be a discernable pattern or trend related to strokes and work type or heart disease.

ggplot(data = health_data, aes(x = work_type, fill = stroke))+geom_bar()

ggplot(data = health_data, aes(x = heart_disease, fill = stroke))+geom_bar()

Now that there is a brief understanding of the distribution of the variables and their relationships, the prediction models can be created.

Prediction models

Before creating prediction models, training and testing data sets are created. A training data set is a subset of examples used to train the model, while the testing data set is a subset used to test the training model.

set.seed(123)
#New sample created for the training and testing data sets. The data is split with 75% in training and 25% in testing.
sample <- sample(c(TRUE, FALSE), nrow(health_data), replace = TRUE, prob = c(0.75, 0.25))
train_set <- health_data[sample, ]
test_set <- health_data[-sample, ]
#Imputates NA values in bmi column with the average bmi value
mean_val <- mean(train_set$bmi, na.rm = TRUE)
train_set$bmi[is.na(train_set$bmi)] <- mean_val
test_set$bmi[is.na(test_set$bmi)] <- mean_val

From the bar plots, it is clear there is an imbalance between those who had a stroke and those who did not in the data. This could cause issues in creating a prediction model, which would most likely skew towards predicting much more “No” answers since there are significantly more within the sampled data. To solve this issue, oversampling and undersampling the training set data can be done. Oversampling duplicates random samples from the minority class, while undersampling randomly reduces samples from the majority class. Doing both helps to “even out” the bias and possibly improve the model’s overall performance, which was done in this case do to the incredibly high imbalance in the data.

health_balance <- ovun.sample(stroke ~., data = train_set, N = nrow(train_set),
                              method = "both", seed = 123)$data

table(health_balance$stroke)
## 
##    0    1 
## 1938 1914

The table shows the training data set is now balanced, which will help with the prediction models.

Logistic Regression Model

The first prediction model created was a logistic regression model since stroke is a binary categorical variable (either 0 or 1). The first logistic regression model was run with all variables to determine their significance to the model. The second logistic regression model keeps age, hypertension, work type, average glucose level, and BMI as variables, which are all statistically significant to stroke prediction. After running the predictive probabilities on the testing set and viewing the confusion matrix, the logistic regression model did relatively well, with an overall accuracy of 73.3% and an F1 score of 83.8%. The recall score (false negatives) did an excellent job, while the precision (false positives) did a little worse. This is fine in this scenario since its better to have a false positive for a stroke prediction than vice versa.

#Logistic Regression model
fit_glm <- glm(stroke ~ ., data = health_balance, family = binomial())
summary(fit_glm)
## 
## Call:
## glm(formula = stroke ~ ., family = binomial(), data = health_balance)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -3.828e+00  3.188e-01 -12.008  < 2e-16 ***
## genderMale                 -1.544e-01  8.372e-02  -1.844 0.065198 .  
## age                         7.885e-02  3.319e-03  23.760  < 2e-16 ***
## hypertension1               4.136e-01  1.064e-01   3.887 0.000101 ***
## heart_disease1              1.832e-01  1.375e-01   1.332 0.182834    
## ever_marriedYes             5.149e-04  1.312e-01   0.004 0.996868    
## work_typeGovt_job          -1.609e+00  3.370e-01  -4.773 1.81e-06 ***
## work_typeNever_worked      -1.261e+01  2.648e+02  -0.048 0.962029    
## work_typePrivate           -1.612e+00  3.271e-01  -4.930 8.24e-07 ***
## work_typeSelf-employed     -1.782e+00  3.500e-01  -5.091 3.56e-07 ***
## Residence_typeUrban         2.340e-01  8.153e-02   2.871 0.004097 ** 
## avg_glucose_level           3.411e-03  7.847e-04   4.347 1.38e-05 ***
## bmi                         2.006e-02  6.837e-03   2.933 0.003352 ** 
## smoking_statusnever smoked -4.641e-01  1.067e-01  -4.348 1.37e-05 ***
## smoking_statussmokes       -1.655e-01  1.300e-01  -1.273 0.203001    
## smoking_statusUnknown      -1.997e-01  1.250e-01  -1.597 0.110163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5339.9  on 3851  degrees of freedom
## Residual deviance: 3720.3  on 3836  degrees of freedom
## AIC: 3752.3
## 
## Number of Fisher Scoring iterations: 13
#2nd logistic regression model
fit_glm2 <- glm(stroke ~ age+hypertension+work_type+avg_glucose_level+bmi,
                data = health_balance, family = binomial())
summary(fit_glm2)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + work_type + avg_glucose_level + 
##     bmi, family = binomial(), data = health_balance)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.093e+00  2.889e-01 -14.167  < 2e-16 ***
## age                     7.941e-02  3.158e-03  25.144  < 2e-16 ***
## hypertension1           4.111e-01  1.051e-01   3.912 9.15e-05 ***
## work_typeGovt_job      -1.610e+00  3.186e-01  -5.053 4.36e-07 ***
## work_typeNever_worked  -1.276e+01  2.650e+02  -0.048  0.96160    
## work_typePrivate       -1.622e+00  3.066e-01  -5.291 1.22e-07 ***
## work_typeSelf-employed -1.814e+00  3.289e-01  -5.516 3.46e-08 ***
## avg_glucose_level       3.569e-03  7.665e-04   4.656 3.23e-06 ***
## bmi                     2.228e-02  6.791e-03   3.280  0.00104 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5339.9  on 3851  degrees of freedom
## Residual deviance: 3753.7  on 3843  degrees of freedom
## AIC: 3771.7
## 
## Number of Fisher Scoring iterations: 13
stepAIC(fit_glm2)
## Start:  AIC=3771.73
## stroke ~ age + hypertension + work_type + avg_glucose_level + 
##     bmi
## 
##                     Df Deviance    AIC
## <none>                   3753.7 3771.7
## - bmi                1   3764.5 3780.5
## - hypertension       1   3769.4 3785.4
## - work_type          4   3780.4 3790.4
## - avg_glucose_level  1   3775.8 3791.8
## - age                1   4687.4 4703.4
## 
## Call:  glm(formula = stroke ~ age + hypertension + work_type + avg_glucose_level + 
##     bmi, family = binomial(), data = health_balance)
## 
## Coefficients:
##            (Intercept)                     age           hypertension1  
##              -4.092553                0.079411                0.411071  
##      work_typeGovt_job   work_typeNever_worked        work_typePrivate  
##              -1.609639              -12.757574               -1.622120  
## work_typeSelf-employed       avg_glucose_level                     bmi  
##              -1.814498                0.003569                0.022276  
## 
## Degrees of Freedom: 3851 Total (i.e. Null);  3843 Residual
## Null Deviance:       5340 
## Residual Deviance: 3754  AIC: 3772
#VIF values - all very low
vif(fit_glm2)
##                       GVIF Df GVIF^(1/(2*Df))
## age               1.607159  1        1.267738
## hypertension      1.045751  1        1.022620
## work_type         1.665629  4        1.065853
## avg_glucose_level 1.109784  1        1.053463
## bmi               1.168355  1        1.080905
#Predictive probability on testing set
pred_probs <- predict.glm(fit_glm2, newdata = test_set, type = "response")
#Predicts if the patient will have a stroke, placing them in the "No" category if the prediction is less than 0.5 - otherwise, in the "Yes" category. 
pred_log <- ifelse(pred_probs<0.5, 0, 1)
#Creates confusion table displaying where each patient was placed and if they were placed in the right group
confusion_table <- table(test_set$stroke, pred_log)
#Displays confusion matrix and the statistics associated with the confusion matrix
cm_log_stroke <- confusionMatrix(confusion_table, mode = "everything")
cm_log_stroke
## Confusion Matrix and Statistics
## 
##    pred_log
##        0    1
##   0 3543 1317
##   1   49  199
##                                           
##                Accuracy : 0.7326          
##                  95% CI : (0.7202, 0.7447)
##     No Information Rate : 0.7032          
##     P-Value [Acc > NIR] : 1.893e-06       
##                                           
##                   Kappa : 0.1551          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9864          
##             Specificity : 0.1313          
##          Pos Pred Value : 0.7290          
##          Neg Pred Value : 0.8024          
##               Precision : 0.7290          
##                  Recall : 0.9864          
##                      F1 : 0.8384          
##              Prevalence : 0.7032          
##          Detection Rate : 0.6936          
##    Detection Prevalence : 0.9514          
##       Balanced Accuracy : 0.5588          
##                                           
##        'Positive' Class : 0               
## 
#Calculates the accuracy, precision, and recall in the logistic regression model.
log_precision = cm_log_stroke$byClass['Precision']
log_recall = cm_log_stroke$byClass['Recall']
log_f1_score = cm_log_stroke$byClass['F1']
log_stroke_score = cm_log_stroke$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the logistic regression model.
print(paste("Accuracy: 0.733"))
## [1] "Accuracy: 0.733"
print(paste("Total Precision: ", round(log_precision,3)))
## [1] "Total Precision:  0.729"
print(paste("Recall: ", round(log_recall,3)))
## [1] "Recall:  0.986"
print(paste("F1 Score: ", round(log_f1_score,3)))
## [1] "F1 Score:  0.838"
print(paste("Stroke Pred Score: ", round(log_stroke_score,3)))
## [1] "Stroke Pred Score:  0.802"

An ROC curve is created to determine the model’s performance as well as determine a more ideal cutoff point. Adjusting the cutoff point has a trade off - while it can improve precision, it can also affect the recall of the model. The ROC curve has an AUC (area under the curve) value of 0.843, which means the model has very good performance.

The optimal cutoffs calculated gave different values depending on the goal of the model. For this session, I chose the threshold value of 0.688 in hopes of improving precision and recall. Re-running the logistic regression model with the new value, the accuracy, precision, recall, and F1 scores all improved. However, this should be done with caution. When looking at the confusion matrix, the new cutoff point did a worse job predicting a stroke, even though it did much better predicting a patient not having a stroke. As such, 0.688 may not be the best alternative cutoff point. In comparison, using the threshold optimal cutoff of 0.447 yields a lower value for accuracy, precision, and F1 score, but improves the probability of correctly predicting a stroke. In the end, the choice of cutoff point is up to the user.

Since I will show the overall performance of all three prediction model types, I will use the original cutoff point of 0.50 to demonstrate the performance of the logistic regression model.

# Predicted probabilities for the positive class (stroke = 1)
pred_probs_roc <- predict(fit_glm2, newdata = test_set, type = "response")

# Create ROC object
roc_obj <- roc(test_set$stroke, pred_probs_roc)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_obj,
     main = "ROC Curve for Stroke Prediction (fit_glm2)",
     col = "steelblue", lwd = 2)

#Plot ROC curve with AUC value
plot(roc_obj, col = "blue", print.auc = TRUE, main = "ROC Curve for Capsule Prediction")

opt <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity", "fpr", "tpr"))
print(opt)
##           threshold sensitivity specificity       fpr       tpr
## threshold  0.446711   0.8629032   0.6884774 0.3115226 0.8629032
#Optimal cutoff points:
#threshold: 0.447 sensitivity: 0.863 specificity: 0.688

pred_log_roc <- ifelse(pred_probs<0.688, 0, 1)
#Creates confusion table displaying where each patient was placed and if they were placed in the right group
confusion_table_roc <- table(test_set$stroke, pred_log_roc)
#Displays confusion matrix and the statistics associated with the confusion matrix
cm_log_stroke_roc <- confusionMatrix(confusion_table_roc, mode = "everything")
cm_log_stroke_roc
## Confusion Matrix and Statistics
## 
##    pred_log_roc
##        0    1
##   0 4132  728
##   1   98  150
##                                           
##                Accuracy : 0.8383          
##                  95% CI : (0.8279, 0.8483)
##     No Information Rate : 0.8281          
##     P-Value [Acc > NIR] : 0.02737         
##                                           
##                   Kappa : 0.2063          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9768          
##             Specificity : 0.1708          
##          Pos Pred Value : 0.8502          
##          Neg Pred Value : 0.6048          
##               Precision : 0.8502          
##                  Recall : 0.9768          
##                      F1 : 0.9091          
##              Prevalence : 0.8281          
##          Detection Rate : 0.8089          
##    Detection Prevalence : 0.9514          
##       Balanced Accuracy : 0.5738          
##                                           
##        'Positive' Class : 0               
## 
#Calculates the accuracy, precision, and recall in the logistic regression model.
log_precision_roc = cm_log_stroke_roc$byClass['Precision']
log_recall_roc = cm_log_stroke_roc$byClass['Recall']
log_f1_score_roc = cm_log_stroke_roc$byClass['F1']
#Prints the accuracy, precision, and recall values for the logistic regression model.
print(paste("Accuracy: ", '0.842'))
## [1] "Accuracy:  0.842"
print(paste("Total Precision: ", round(log_precision_roc,3)))
## [1] "Total Precision:  0.85"
print(paste("Recall: ", round(log_recall_roc,3)))
## [1] "Recall:  0.977"
print(paste("F1 Score: ", round(log_f1_score_roc,3)))
## [1] "F1 Score:  0.909"
pred_log_roc2 <- ifelse(pred_probs<0.447, 0, 1)
#Creates confusion table displaying where each patient was placed and if they were placed in the right group
confusion_table_roc2 <- table(test_set$stroke, pred_log_roc2)
#Displays confusion matrix and the statistics associated with the confusion matrix
cm_log_stroke_roc2 <- confusionMatrix(confusion_table_roc2, mode = "everything")
cm_log_stroke_roc2
## Confusion Matrix and Statistics
## 
##    pred_log_roc2
##        0    1
##   0 3347 1513
##   1   35  213
##                                           
##                Accuracy : 0.6969          
##                  95% CI : (0.6841, 0.7095)
##     No Information Rate : 0.6621          
##     P-Value [Acc > NIR] : 5.902e-08       
##                                           
##                   Kappa : 0.143           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9897          
##             Specificity : 0.1234          
##          Pos Pred Value : 0.6887          
##          Neg Pred Value : 0.8589          
##               Precision : 0.6887          
##                  Recall : 0.9897          
##                      F1 : 0.8122          
##              Prevalence : 0.6621          
##          Detection Rate : 0.6552          
##    Detection Prevalence : 0.9514          
##       Balanced Accuracy : 0.5565          
##                                           
##        'Positive' Class : 0               
## 
#Calculates the accuracy, precision, and recall in the logistic regression model.
log_precision_roc2 = cm_log_stroke_roc2$byClass['Precision']
log_recall_roc2 = cm_log_stroke_roc2$byClass['Recall']
log_f1_score_roc2 = cm_log_stroke_roc2$byClass['F1']
log_stroke_score_roc2 = cm_log_stroke_roc2$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the logistic regression model.
print(paste("Accuracy: ", '0.707'))
## [1] "Accuracy:  0.707"
print(paste("Total Precision: ", round(log_precision_roc2,3)))
## [1] "Total Precision:  0.689"
print(paste("Recall: ", round(log_recall_roc2,3)))
## [1] "Recall:  0.99"
print(paste("F1 Score: ", round(log_f1_score_roc2,3)))
## [1] "F1 Score:  0.812"
print(paste("Stroke Pred Score: ", round(log_stroke_score_roc2,3)))
## [1] "Stroke Pred Score:  0.859"

Random Forest

Another prediction model that can be used is random forest modeling. Random forest is a classifying method consisting of many decision trees. By creating a “forest” of decision trees, the classifying model hopes to select it’s best model by running many different decision trees and “takes the majority” to determine classification. To do so, random forest uses out-of-bag sampling.

A random forest is created to determine the probability of stroke. The advantage of random forest is the Variable Importance Plot, showing the overall importance of each variable to the prediction model. In this case, age, average glucose level, and BMI scores seem to be the most important indicators. Unlike the logistic regression model, the random forest model does not see the significance of work type or hypertension when predicting stroke.

From the confusion matrix, the random forest model was slightly worse in accuracy, precision, and F1 score compared to the logistic regression model. When comparing the random forest confusion matrix to the original logistic regression confusion matrix, the random forest model did much better predicting a stroke (92.74% vs. 80.24%, respectively). From this perspective, a random forest model may be better in predicting a stroke compared to the logistic regression model, though it does come with much more false positives (~29% false positive rate).

set.seed(123)
#Random Forest for variables. mtry = 3 since there are 11 variables (square root of 12 is close to 3).
fit_rf <- randomForest(factor(stroke) ~., mtry = 3, ntree = 500, nodesize = 3,
                       maxnodes = 50, data = health_balance)
#Sorts predictions into their respective class (0 or 1) depending on their value.
predict_rf <- predict(fit_rf, test_set, type = "response")
#Variable Importance Plot
varImpPlot(fit_rf)

#Creates and displays the confusion matrix table based on the actual and predicted values.
confusion_table_rf <- table(test_set$stroke, predict_rf)
#Displays confusion matrix and the statistics associated with the confusion matrix.
cm_rf_stroke <- confusionMatrix(confusion_table_rf, mode = "everything")
cm_rf_stroke
## Confusion Matrix and Statistics
## 
##    predict_rf
##        0    1
##   0 3481 1379
##   1   18  230
##                                           
##                Accuracy : 0.7265          
##                  95% CI : (0.7141, 0.7387)
##     No Information Rate : 0.685           
##     P-Value [Acc > NIR] : 5.557e-11       
##                                           
##                   Kappa : 0.1786          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9949          
##             Specificity : 0.1429          
##          Pos Pred Value : 0.7163          
##          Neg Pred Value : 0.9274          
##               Precision : 0.7163          
##                  Recall : 0.9949          
##                      F1 : 0.8329          
##              Prevalence : 0.6850          
##          Detection Rate : 0.6815          
##    Detection Prevalence : 0.9514          
##       Balanced Accuracy : 0.5689          
##                                           
##        'Positive' Class : 0               
## 
#Calculates the accuracy, precision, and recall in the random forest model.
rf_precision = cm_rf_stroke$byClass['Precision']
rf_recall = cm_rf_stroke$byClass['Recall']
rf_f1_score = cm_rf_stroke$byClass['F1']
rf_stroke_score = cm_rf_stroke$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the random forest model.
print(paste("Accuracy: ", '0.727'))
## [1] "Accuracy:  0.727"
print(paste("Precision: ", round(rf_precision,3)))
## [1] "Precision:  0.716"
print(paste("Recall: ", round(rf_recall,3)))
## [1] "Recall:  0.995"
print(paste("F1 Score: ", round(rf_f1_score,3)))
## [1] "F1 Score:  0.833"
print(paste("Stroke Pred Score: ", round(rf_stroke_score,3)))
## [1] "Stroke Pred Score:  0.927"

The random forest model created above was in large part due to checking the model fit. This is done by viewing the ROC curve for the random forest model, as well as the out-of-bag probabilities for the testing and training sets. From the ROC, the area under the curve is 0.822, which indicates good model performance. The training AUC is 0.941 and the test AUC is 0.919, which means the model is not overfitting and has a good generalization.

#Creates the ROC for the random forest model.
roc_rf <- roc(test_set$stroke, as.ordered(predict_rf))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
#Plots the ROC curve for the random forest model.
plot(roc_rf, print.auc = TRUE)

#Gives area under the curve. The closer to 1, the more accurate.
auc(roc_rf)
## Area under the curve: 0.8218
#Gives optimal cutoff points for threshold, sensitivity, and specificity.
coords(roc_rf, "best", input = "sensitivity", 
       ret = c("threshold")) #threshold = 0.5 (standard)
##   threshold
## 1       0.5
# OOB ROC (approximation using predicted OOB probabilities)
roc_train <- roc(health_balance$stroke, as.numeric(predict(fit_rf, type="vote")[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_train <- auc(roc_train)

# Test ROC (what you already have)
roc_test <- roc(test_set$stroke, as.numeric(predict(fit_rf, newdata=test_set, type="vote")[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_test <- auc(roc_test)

print(paste("Training (OOB) AUC:", round(auc_train, 3)))
## [1] "Training (OOB) AUC: 0.941"
print(paste("Test AUC:", round(auc_test, 3)))
## [1] "Test AUC: 0.919"

Support Vector Machines

The next prediction method used is Support Vector Machines (SVMs). SVMs classify a variable using hyperplanes to best divide data into their respective classes. They can be useful with imbalanced data and works well with high-dimensional data.

Prior to creating the prediction models, the train and test sets are scaled (SVMs are more sensitive to scaling the data). Once complete, an initial prediction model is run with a ROC/AUC curve and initial predictive threshold of 0.50. From the ROC curve, it is determined to change the prediction threshold to 0.0588 for better results. The prediction model is re-run with this threshold, leading to a more accurate confusion matrix and table.

From the confusion table, the SVM prediction model had a lower accuracy, precision, recall, and F1 score compared to the prior two prediction models. However, the confusion matrix shows the SVM model does better predicting stroke than the logistic regression model (83.06% vs. 80.24%, respectively). Despite this, the random forest model still fared better predicting a stroke compared to all models.

set.seed(123)
#Create a copy of testing and training sets for SVM
svm_train_balance <- train_set
svm_test_balance <- test_set

num_cols <- c("age", "avg_glucose_level", "bmi")

#Impute numeric NAs on train, then apply same to test
for (col in num_cols) {
  med <- median(svm_train_balance[[col]], na.rm = TRUE)
  svm_train_balance[[col]][is.na(svm_train_balance[[col]])] <- med
  svm_test_balance[[col]][is.na(svm_test_balance[[col]])]   <- med
}

#Compute scaling params on train only
mu  <- sapply(svm_train_balance[, num_cols, drop = FALSE], mean, na.rm = TRUE)
sdv <- sapply(svm_train_balance[, num_cols, drop = FALSE], sd,   na.rm = TRUE)
sdv[sdv == 0] <- 1  # guard against zero-variance

#Apply scaling using train stats to both sets
svm_train_balance[, num_cols] <- scale(svm_train_balance[, num_cols], center = mu, scale = sdv)
svm_test_balance[,  num_cols] <- scale(svm_test_balance[,  num_cols], center = mu, scale = sdv)
#SVM prediction model
svm_model <- svm(stroke~., kernel = "radial", cost = 1,
                 probability = TRUE, data =svm_train_balance)

pred_svm <- predict(svm_model, svm_test_balance, probability = TRUE)
probs <- attr(pred_svm, "probabilities")[, "1"]
#ROC/AUC curve
roc_svm <- roc(svm_test_balance$stroke, probs, levels = c("0","1"), positive = "1")
## Setting direction: controls < cases
plot(roc_svm, print.auc = TRUE, col = "blue")

auc(roc_svm)
## Area under the curve: 0.8491
best <- coords(roc_svm, "best", ret = c("threshold","sensitivity","specificity"))
best #0.0588
##    threshold sensitivity specificity
## 1 0.05920642   0.8306452    0.687037
#Re-do SVM prediction model based on new threshold
pred_svm_class <- ifelse(probs > 0.0588, "1", "0")
#Creates and displays the confusion matrix table based on the actual and predicted values.
confusion_table_svm2 <- table(svm_test_balance$stroke, pred_svm_class)
#Displays confusion matrix and the statistics associated with the confusion matrix.
cm_svm_stroke2 <- confusionMatrix(confusion_table_svm2, mode = "everything")
cm_svm_stroke2
## Confusion Matrix and Statistics
## 
##    pred_svm_class
##        0    1
##   0 3256 1604
##   1   41  207
##                                           
##                Accuracy : 0.678           
##                  95% CI : (0.6649, 0.6908)
##     No Information Rate : 0.6455          
##     P-Value [Acc > NIR] : 5.443e-07       
##                                           
##                   Kappa : 0.1265          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9876          
##             Specificity : 0.1143          
##          Pos Pred Value : 0.6700          
##          Neg Pred Value : 0.8347          
##               Precision : 0.6700          
##                  Recall : 0.9876          
##                      F1 : 0.7983          
##              Prevalence : 0.6455          
##          Detection Rate : 0.6374          
##    Detection Prevalence : 0.9514          
##       Balanced Accuracy : 0.5509          
##                                           
##        'Positive' Class : 0               
## 
#Calculates the accuracy, precision, and recall in the random forest model.
svm_precision = cm_svm_stroke2$byClass['Precision']
svm_recall = cm_svm_stroke2$byClass['Recall']
svm_f1_score = cm_svm_stroke2$byClass['F1']
svm_stroke_score = cm_svm_stroke2$byClass['Neg Pred Value']
#Prints the accuracy, precision, and recall values for the random forest model.
print(paste("Accuracy: ", "0.689"))
## [1] "Accuracy:  0.689"
print(paste("Total Precision: ", round(svm_precision,3)))
## [1] "Total Precision:  0.67"
print(paste("Recall: ", round(svm_recall,3)))
## [1] "Recall:  0.988"
print(paste("F1 Score: ", round(svm_f1_score,3)))
## [1] "F1 Score:  0.798"
print(paste("Stroke Pred Value: ", round(svm_stroke_score,3)))
## [1] "Stroke Pred Value:  0.835"

Summary Table

After creating all three prediction models, the table below shows the best performance for each model:

set.caption("Performance for Stroke Prediction Models")
data.table = rbind(c('0.733', round(log_precision,3), round(log_recall,3), round(log_f1_score,3),round(log_stroke_score,3)), c('0.727', round(rf_precision,3), round(rf_recall,3), round(rf_f1_score,3), round(rf_stroke_score,3)), c('0.689', round(svm_precision,3), round(svm_recall,3), round(svm_f1_score,3), round(svm_stroke_score,3)))
colnames(data.table) = c("Accuracy", "Precision", "Recall", "F1 Score", "Stroke Pred Value")
rownames(data.table) = c("Logistic Regression", "Random Forest", "SVM")

pander(data.table)
Performance for Stroke Prediction Models (continued below)
  Accuracy Precision Recall F1 Score
Logistic Regression 0.733 0.729 0.986 0.838
Random Forest 0.727 0.716 0.995 0.833
SVM 0.689 0.67 0.988 0.798
  Stroke Pred Value
Logistic Regression 0.802
Random Forest 0.927
SVM 0.835

From the table, the logistic regression model had a higher accuracy, precision, and F1 score while the Random forest model performed the best with recall and the stroke prediction value. SVM was the worst prediction model of the three, though it fared well in recall and the stroke prediction value.

Findings and Conclusions

This session demonstrated how to use logistic regression, random forest, and support vector machines to create a prediction model for stroke. These three predictive models work well with a binary classification outcome, as they classify an outcome based on probabilities. For this session, random forest seemed to be the best model for this data set classification. Additional variables or different model setups may change these results and are user-dependent.

This session demonstrated only three potential prediction modeling, though there are several more methods that can be done. Other popular methods include neural networks, decision trees, k-nearest neighbors, and naive bayes. All prediction models are only as effective as the understanding by the user of the model, as it is important to know model assumptions and diagnostics.

Thank you for viewing my project.

END