Introduction

Stroke, according to the World Health Organization (WHO), stands as the second leading cause of mortality worldwide, accounting for 11% (6 million people) of total deaths. This alarming statistic underscores the critical need for effective predictive models to assess the likelihood of stroke occurrence in patients. The American Stroke Association indicates that most strokes can be stopped before they happen if we learn more about them and make some changes in our daily lives. This means doing things like being more active, eating healthy foods, keeping our blood pressure in check, getting enough sleep, and saying no to smoking and vaping. Mastrigt and Heugten point out that according to predictions from The American Heart Association, by 2030, nearly 4% of adults in the United States will have experienced a stroke. Additionally, they calculated that the total healthcare expenses for strokes amounted to USD 30.8 billion each year during 2016 and 2017. In response to all of these facts, a dataset has been curated, offering a robust foundation for predictive analytics in stroke risk assessment. This dataset consists of patient data for 5110 respondents.

This dataset, compiled and made available for research purposes, is comprised of a diverse array of parameters with varying levels of importance in predicting the onset of stroke in individuals. The features for the dataset are listed below:

Ultimately, the overarching goal of using this dataset was to leverage the power of machine learning and predictive analytics to identify individuals at heightened risk of stroke. This collection of information about people’s age, health, and habits is akin to a puzzle, and by putting together all these different pieces, one can understand better who might be at risk of having a stroke. The unprocessed and preprocessed data was fit to a decision tree model, 4 different support vector machine models, a random forest model, and a neural network model. Between the 4 different types of models, the one that was most accurate in determining if a person was at high risk of having a stroke was selected. Given the importance of early stroke detection and prevention in healthcare, the objective was to evaluate and compare different models to determine which one offers the highest accuracy and reliability. High accuracy is crucial in applications where the cost of misclassification (false positives or false negatives) is high, such as in medical diagnosis. The analysis that was conducted was important for enhancing patient outcomes through early intervention and for optimizing the use of healthcare resources. By identifying the best predictive model, healthcare providers can better allocate their efforts toward individuals at greatest risk of stroke, ultimately improving clinical decision-making and patient care.

Importing Data

stroke_data <- read.csv("healthcare-dataset-stroke-data.csv", header = TRUE) %>% subset(select = -id)

Exploratory Data Analysis

A summary of the stroke dataset is provided below:

summary(stroke_data)
##     gender               age         hypertension     heart_disease    
##  Length:5110        Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
##  Class :character   1st Qu.:25.00   1st Qu.:0.00000   1st Qu.:0.00000  
##  Mode  :character   Median :45.00   Median :0.00000   Median :0.00000  
##                     Mean   :43.23   Mean   :0.09746   Mean   :0.05401  
##                     3rd Qu.:61.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##                     Max.   :82.00   Max.   :1.00000   Max.   :1.00000  
##  ever_married        work_type         Residence_type     avg_glucose_level
##  Length:5110        Length:5110        Length:5110        Min.   : 55.12   
##  Class :character   Class :character   Class :character   1st Qu.: 77.25   
##  Mode  :character   Mode  :character   Mode  :character   Median : 91.89   
##                                                           Mean   :106.15   
##                                                           3rd Qu.:114.09   
##                                                           Max.   :271.74   
##      bmi            smoking_status         stroke       
##  Length:5110        Length:5110        Min.   :0.00000  
##  Class :character   Class :character   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Median :0.00000  
##                                        Mean   :0.04873  
##                                        3rd Qu.:0.00000  
##                                        Max.   :1.00000

The id variable was omitted from the analysis as this variable just offers a unique identifier for each observation. The factors above have been recoded for readability. After recoding, the summary below revealed that for the bmi variable, there were 201 missing values.

stroke_data <- stroke_data %>%
  mutate(
    gender = as.factor(gender),
    hypertension = as.factor(hypertension),
    heart_disease = as.factor(heart_disease),
    ever_married = as.factor(ever_married),
    work_type = as.factor(work_type),
    Residence_type = as.factor(Residence_type),
    smoking_status = as.factor(smoking_status),
    stroke = as.factor(stroke),
    bmi = as.numeric(bmi)
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `bmi = as.numeric(bmi)`.
## Caused by warning:
## ! NAs introduced by coercion
stroke_data_unprocessed <- stroke_data

summary(stroke_data)
##     gender          age        hypertension heart_disease ever_married
##  Female:2994   Min.   : 0.08   0:4612       0:4834        No :1757    
##  Male  :2115   1st Qu.:25.00   1: 498       1: 276        Yes:3353    
##  Other :   1   Median :45.00                                          
##                Mean   :43.23                                          
##                3rd Qu.:61.00                                          
##                Max.   :82.00                                          
##                                                                       
##          work_type    Residence_type avg_glucose_level      bmi       
##  children     : 687   Rural:2514     Min.   : 55.12    Min.   :10.30  
##  Govt_job     : 657   Urban:2596     1st Qu.: 77.25    1st Qu.:23.50  
##  Never_worked :  22                  Median : 91.89    Median :28.10  
##  Private      :2925                  Mean   :106.15    Mean   :28.89  
##  Self-employed: 819                  3rd Qu.:114.09    3rd Qu.:33.10  
##                                      Max.   :271.74    Max.   :97.60  
##                                                        NA's   :201    
##          smoking_status stroke  
##  formerly smoked: 885   0:4861  
##  never smoked   :1892   1: 249  
##  smokes         : 789           
##  Unknown        :1544           
##                                 
##                                 
## 

Figure 1: Density plots for age, avg_glucose_level, and bmi.

The avg_glucose_level variable exhibits somewhat of a normal distribution, however, near the 80 age bracket, there is a spike in observation count.The avg_glucose_level variable exhibits bimodality which is also reflected in the summary statistics. The minimum and mean are 55 and 106, respectively, but the maximum is 271. The bmi variable exhibits right skewness which is also reflected in the summary statistics. The minimum and mean are 10 and 29, respectively, while the maximum is 98.

Boxplot

Figure 2: Boxplots for the stroke dataset.

Some findings were discovered that support the theoretical effects for some of the variables using the boxplots in Figure 2. Based on the age boxplot, theoretically, the patients that were older were more likely to have a stroke. Theoretically, on average, patients in the dataset that had a higher avg_glucose_level were more likely to have stroke. The boxplot also reveals that the patients that had higher bmi were more likely tio develop a stroke.

Examining Feature Multicollinearity for Continuous and Categorical Variables

Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.

corrplot(stroke_correlations$correlations, 
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 1,
         tl.cex = 1)

Figure 3: Multicollinearity plot for continuous predictor variables.

Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. The article goes onto explain that correlations between 0.5 and 0.7 indicate a “moderate” correlation, with anything above a 0.7 indcating a “strong” correlation. The correlation with the largest magnitude is 0.77, which is the value generated between the age and ever_married variables, indicating a strong correlation between these 2 variables, followed by 0.65 for the age and heart_disease variables.

Class Imbalance

prop.table(table(select(stroke_data, stroke)))
## stroke
##          0          1 
## 0.95127202 0.04872798

The output above shows that the stroke dataset is imbalance, with 95.12% of the patients who responded to the study not ever having experience a stroke, while 4.87% of the patients who responded to the study did experience a stroke. Class imbalance can affect the performance of predictive models. Imbalanced datasets can lead to biased models that favor the majority class and perform poorly on the minority class, which is why the training dataset was resampled using SMOTE.

Data Preprocessing

Dealing with Multicollinearity

Age is a well-known risk factor for stroke, and it’s widely acknowledged in medical literature, These are a few of many web articles indicate this:

Martial status, potentially correlated with age, may not have a direct causal relationship with stroke risk. Therefore, the ever_married variable could have taken out of the dataset in favor of the age variable. In addition, heart disease is also known to increase the risk of stroke but may not have as direct and universally acknowledged relationship with stroke as age. Moderate collinearity suggests that there is some relationship between age and heart disease, but it may not be so strong that it significantly impacts the stability or interpretability of the model. With that being said, decision trees do not require or assume a specific relationship between the independent variables, unlike linear regression models. Consequently, decision trees can produce accurate predictions even when there is a high level of correlation among some variables. Also, neural networks generally do not suffer from multicollinearity because they are often overparameterized. The additional weights learned during training introduce redundancies, making issues that affect a small subset of features, such as multicollinearity, less significant. Therefore, it was decided to retain all of the variables within the dataset.

Dealing with Missing Values

In general, imputations by the means/medians is acceptable if the missing values only account for 5% of the sample. Peng et al.(2006) However, should the degree of missing values exceed 20% then using these simple imputation approaches will result in an artificial reduction in variability due to the fact that values are being imputed at the center of the variable’s distribution.

It was decided to employ another technique to handle the missing values: Multiple Regression Imputation using the MICE package.

The MICE package in R implements a methodology where each incomplete variable is imputed by a separate model. Alice points out that plausible values are drawn from a distribution specifically designed for each missing datapoint. Many imputation methods can be used within the package. The one that was selected for the data being analyzed in this report is PMM (Predictive Mean Matching), which is used for quantitative data.

Van Buuren explains that PMM works by selecting values from the observed/already existing data that would most likely belong to the variable in the observation with the missing value. The advantage of this is that it selects values that must exist from the observed data, so no negative values will be used to impute missing data. Not only that, it circumvents the shrinking of errors by using multiple regression models. The variability between the different imputed values gives a wider, but more correct standard error. Uncertainty is inherent in imputation which is why having multiple imputed values is important. Not only that. Marshall et al. 2010 points out that:

“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data…’

Note that the neural network model requires that there be no missing values. Therefore, a new dataset was created consisting of the unprocessed dataset with the bmi variable imputed.

Figure 4: Density plots for the bmi variable The number of multiple imputations was set to 4. Each of the red lines represents the distribution for each imputation.

The blue lines for each of the graphs in Figure 4 represent the distributions the non-missing data for each of the variables while the red lines represent the distributions for the imputed data. Note that the distributions for the imputed data for each of the iterations closely matches the distributions for the non-missing data, which is ideal. If the distributions did not match so well, than another imputing method would have had to have been used.

Skewed Variables

A Modern Approach to Regression with R explains the following:

“When conducting a binary regression with a skewed predictor, it is often easiest to assess the need for x and log(x) by including them both in the model so that their relative contributions can be assessed directly.”

The variable bmi exhibits skewness. Therefore, the log of this variable was added into the dataset.

target_variables <- c("bmi")

for (target_var in target_variables){
  stroke_data[,paste(target_var, "log", sep = "_")] <- log(stroke_data[target_var])
}

Figure 5: bmi after the log transformation. Scaled variable is stored in the dataset as bmi_log.

Normalization/Standardization

Neural networks usually learn by adjusting weights to reduce errors using methods like gradient descent. If the input features have different scales, these adjustments can be uneven and slow down the learning process. Normalization makes the feature scales similar, which helps the learning process to be smoother and faster. This is why normalization was employed to the dataset. Z-score normalization will ensure that the continuous features have a mean of zero and a standard deviation of 1, which should help the neural network learn for effectively.

scaled_numeric_stroke_data <- stroke_data %>%
  select_if(is.numeric) %>%
  scale()

colnames(scaled_numeric_stroke_data) <- paste0(colnames(scaled_numeric_stroke_data), "_scaled")

stroke_data <- cbind(stroke_data %>% select_if(is.factor), scaled_numeric_stroke_data)

Encode Categorical Varaibles

Neural networks operate on numerical data. Categorical variables, which represent categories or labels, need to be converted into numerical format for the model to process them effectively. Encoding converts categorical variables into a numerical representation that can be fed into the neural network. Therefore, one hot encoding was applied to the categorical variables in the original dataset. Note that only the features that had more than 2 levels were one hot encoded, while the binary categorical features were not, as doing so would result in redundant variables.

# Identify categorical variables with more than two levels
categorical_variables <- sapply(stroke_data, function(x) is.factor(x) && length(levels(x)) > 2)

# Create a formula for dummyVars to encode only the identified variables
formula <- as.formula(paste("~", paste(names(stroke_data)[categorical_variables], collapse = " + ")))

dummy_object <- dummyVars(formula, data = stroke_data)
encoded_data <- lapply(data.frame(predict(dummy_object, newdata = stroke_data)), as.factor)

stroke_data <- cbind(stroke_data[, !categorical_variables], encoded_data)

summary(stroke_data)
##  hypertension heart_disease ever_married Residence_type stroke  
##  0:4612       0:4834        No :1757     Rural:2514     0:4861  
##  1: 498       1: 276        Yes:3353     Urban:2596     1: 249  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##    age_scaled       avg_glucose_level_scaled   bmi_scaled     
##  Min.   :-1.90807   Min.   :-1.1268          Min.   :-2.3732  
##  1st Qu.:-0.80604   1st Qu.:-0.6383          1st Qu.:-0.6803  
##  Median : 0.07842   Median :-0.3150          Median :-0.1076  
##  Mean   : 0.00000   Mean   : 0.0000          Mean   : 0.0000  
##  3rd Qu.: 0.78599   3rd Qu.: 0.1754          3rd Qu.: 0.5289  
##  Max.   : 1.71468   Max.   : 3.6568          Max.   : 8.7385  
##  bmi_log_scaled     gender.Female gender.Male gender.Other work_type.children
##  Min.   :-3.76935   0:2116        0:2995      0:5109       0:4423            
##  1st Qu.:-0.63835   1:2994        1:2115      1:   1       1: 687            
##  Median : 0.02071                                                            
##  Mean   : 0.00000                                                            
##  3rd Qu.: 0.63914                                                            
##  Max.   : 4.72270                                                            
##  work_type.Govt_job work_type.Never_worked work_type.Private
##  0:4453             0:5088                 0:2185           
##  1: 657             1:  22                 1:2925           
##                                                             
##                                                             
##                                                             
##                                                             
##  work_type.Self.employed smoking_status.formerly.smoked
##  0:4291                  0:4225                        
##  1: 819                  1: 885                        
##                                                        
##                                                        
##                                                        
##                                                        
##  smoking_status.never.smoked smoking_status.smokes smoking_status.Unknown
##  0:3218                      0:4321                0:3566                
##  1:1892                      1: 789                1:1544                
##                                                                          
##                                                                          
##                                                                          
## 

The summary above shows all of the factor and numeric variables, along with the factor variables that had more than 2 levels that were one hot encoded.

Splitting the Data into Testing and Training

To properly test how well the machine learning model worked, the datasaet was divided into two parts: a training set and a testing set. The training set was used to teach all of the models, and the testing set was used to see how well the models performed on new data it hadn’t seen before. This helped make sure the model worked well on data it had not been exposed to before. The same splitting methodology was also applied to the unprocessed dataset and the dataset where the only preprocessing was the imputing of the bmi variable as well.

set.seed(1845)
original_split <- caTools::sample.split(stroke_data$stroke, SplitRatio = 0.75)
stroke_data_train <-  subset(stroke_data, original_split == TRUE)
stroke_data_test <- subset(stroke_data, original_split == FALSE)

set.seed(1845)
original_split_unprocessed <- caTools::sample.split(stroke_data_unprocessed$stroke, SplitRatio = 0.75)
stroke_data_train_unprocessed <-  subset(stroke_data_unprocessed, original_split == TRUE)
stroke_data_test_unprocessed <- subset(stroke_data_unprocessed, original_split == FALSE)

set.seed(1845)
original_split_unprocessed_bmi_imputed <- caTools::sample.split(stroke_data_unprocessed_bmi_imputed$stroke, SplitRatio = 0.75)
stroke_data_train_unprocessed_bmi_imputed <-  subset(stroke_data_unprocessed_bmi_imputed, original_split == TRUE)
stroke_data_test_unprocessed_bmi_imputed <- subset(stroke_data_unprocessed_bmi_imputed, original_split == FALSE)
prop.table(table(select(stroke_data, stroke)))
## stroke
##          0          1 
## 0.95127202 0.04872798
prop.table(table(select(stroke_data_train, stroke)))
## stroke
##          0          1 
## 0.95121315 0.04878685
prop.table(table(select(stroke_data_test, stroke)))
## stroke
##          0          1 
## 0.95144871 0.04855129
prop.table(table(select(stroke_data_unprocessed, stroke)))
## stroke
##          0          1 
## 0.95127202 0.04872798
prop.table(table(select(stroke_data_train_unprocessed, stroke)))
## stroke
##          0          1 
## 0.95121315 0.04878685
prop.table(table(select(stroke_data_test_unprocessed, stroke)))
## stroke
##          0          1 
## 0.95144871 0.04855129
prop.table(table(select(stroke_data_unprocessed_bmi_imputed, stroke)))
## stroke
##          0          1 
## 0.95127202 0.04872798
prop.table(table(select(stroke_data_train_unprocessed_bmi_imputed, stroke)))
## stroke
##          0          1 
## 0.95121315 0.04878685
prop.table(table(select(stroke_data_test_unprocessed_bmi_imputed, stroke)))
## stroke
##          0          1 
## 0.95144871 0.04855129

For the output above, stroke_data represents the data after it has been preprocessed, stroke_data_unprocessed represents the data with no preprocessing, while stroke_data_unprocessed_bmi_imputed represents the data where the only preprocessing that took place was the imputing of the bmi variable. The proportions of the classes shown from the output above reveal that there is a significant class imbalance for all of the different datasets used in this report. SMOTE from the DMwR package is only applied for the training dataset.

print("Balanced training data for `stroke_data`")
## [1] "Balanced training data for `stroke_data`"
set.seed(1845)
stroke_data_train <- SMOTE(stroke ~ ., data.frame(stroke_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(stroke_data_train, stroke)))
## stroke
##   0   1 
## 0.5 0.5
print("Balanced training data for `stroke_data_unprocessed`")
## [1] "Balanced training data for `stroke_data_unprocessed`"
set.seed(1845)
stroke_data_train_unprocessed <- SMOTE(stroke ~ ., data.frame(stroke_data_train_unprocessed), perc.over = 100, perc.under = 200)
prop.table(table(select(stroke_data_train_unprocessed, stroke)))
## stroke
##   0   1 
## 0.5 0.5
print("Balanced training data for `stroke_data_unprocessed_bmi_imputed`")
## [1] "Balanced training data for `stroke_data_unprocessed_bmi_imputed`"
set.seed(1845)
stroke_data_train_unprocessed_bmi_imputed <- SMOTE(stroke ~ ., data.frame(stroke_data_train_unprocessed_bmi_imputed), perc.over = 100, perc.under = 200)
prop.table(table(select(stroke_data_train_unprocessed_bmi_imputed, stroke)))
## stroke
##   0   1 
## 0.5 0.5

Fitting the Decision Tree Model (Unprocessed Data)

Practical Machine Learning in R states the following for decision tree models:

“…they are able to robustly handle outliers and noisy data. As you can start to see, decision trees require rather little of us in terms of data preparation.”

Therefore, it was decided to fit the unprocessed data to the decision tree model in order to compare results between preprocessed data and untouched data.

stroke_decision_tree_unprocessed <- rpart(
  stroke ~ .,
  method = "class",
  data = stroke_data_train_unprocessed
)

rpart.plot(stroke_decision_tree_unprocessed)

Figure 6: Decision tree for the unprocessed stroke dataset using all of the available features.

varImp(stroke_decision_tree_unprocessed) %>%
  tibble::rownames_to_column() %>%
  dplyr::rename("variable" = rowname) %>%
  dplyr::arrange(Overall) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
  filter(Overall > 0) %>%
  ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()

Figure 7: Variable importance plot for the decision tree model fit to the unprocessed data which uses all of the available features.

stroke_decision_tree_pred_unprocessed <- predict(stroke_decision_tree_unprocessed, stroke_data_test_unprocessed, type = "class")
confusionMatrix(stroke_decision_tree_pred_unprocessed, stroke_data_test_unprocessed$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 959  14
##          1 256  48
##                                           
##                Accuracy : 0.7886          
##                  95% CI : (0.7651, 0.8107)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1976          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.77419         
##             Specificity : 0.78930         
##          Pos Pred Value : 0.15789         
##          Neg Pred Value : 0.98561         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03759         
##    Detection Prevalence : 0.23806         
##       Balanced Accuracy : 0.78175         
##                                           
##        'Positive' Class : 1               
## 
roc_decision_tree_unprocessed <- ROCR::prediction(
  predictions = as.numeric(stroke_decision_tree_pred_unprocessed),
  labels = stroke_data_test_unprocessed$stroke
)
roc_perf_decision_tree_unprocessed <- performance(roc_decision_tree_unprocessed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_decision_tree_unprocessed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 8: ROC curve for the decision tree model fit to the unprocessed data using all of the available features.

auc_decision_tree_unprocessed <- performance(roc_decision_tree_unprocessed, measure = "auc")
stroke_decision_tree_auc_unprocessed <- unlist(slot(auc_decision_tree_unprocessed,"y.values"))
paste("Calculated AUC: ", stroke_decision_tree_auc_unprocessed)
## [1] "Calculated AUC:  0.781746979954865"

Fitting the Decision Tree Model (Preprocessed Data)

Here, the decision tree model was fit to the processed data.

stroke_decision_tree <- rpart(
  stroke ~ .,
  method = "class",
  data = stroke_data_train
)

rpart.plot(stroke_decision_tree)

Figure 9: Decision tree for the preprocessed stroke dataset using all of the available features.

varImp(stroke_decision_tree) %>%
  tibble::rownames_to_column() %>%
  dplyr::rename("variable" = rowname) %>%
  dplyr::arrange(Overall) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
  filter(Overall > 0) %>%
  ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()

Figure 10: Variable importance plot for the decision tree model fit to the preprocessed data which uses all of the available features.

stroke_decision_tree_pred <- predict(stroke_decision_tree, stroke_data_test, type = "class")
confusionMatrix(stroke_decision_tree_pred, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 891  10
##          1 324  52
##                                           
##                Accuracy : 0.7384          
##                  95% CI : (0.7134, 0.7624)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1681          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.83871         
##             Specificity : 0.73333         
##          Pos Pred Value : 0.13830         
##          Neg Pred Value : 0.98890         
##              Prevalence : 0.04855         
##          Detection Rate : 0.04072         
##    Detection Prevalence : 0.29444         
##       Balanced Accuracy : 0.78602         
##                                           
##        'Positive' Class : 1               
## 
roc_decision_tree <- ROCR::prediction(
  predictions = predict(stroke_decision_tree, stroke_data_test, type = "prob")[, "1"],
  labels = stroke_data_test$stroke
)
roc_perf_decision_tree <- performance(roc_decision_tree, measure = "tpr", x.measure = "fpr")
plot(roc_perf_decision_tree, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 11: ROC curve for the decision tree model fit to the preprocessed data using all of the available features.

auc_decision_tree <- performance(roc_decision_tree, measure = "auc")
stroke_decision_tree_auc <- unlist(slot(auc_decision_tree,"y.values"))
paste("Calculated AUC: ", stroke_decision_tree_auc)
## [1] "Calculated AUC:  0.781162883313421"

Fitting the Support Vector Machine Model

The svm function in R allowed for the generation of a SVM model that uses all of the features in the training set to building a model that predicts stroke. Several different kernels were used in order to compare the performance between each kernel and the other models in this report. These kernels include:

These are all of the possible kernels that can be used using the svm function in R. Note that the svm function requires that datasets do not have any missing data. Therefore, for the SVM fits labeled “Unprocessed”, the bmi variable was imputed using the MICE algorithm.

Linear Kernel SVM (Unprocessed Data)

stroke_svm_linear_unprocessed_bmi_imputed <- svm(
  stroke ~ .,
  kernel = "linear",
  type = "C-classification",
  data = stroke_data_train_unprocessed_bmi_imputed
)

summary(stroke_svm_linear_unprocessed_bmi_imputed)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed, 
##     kernel = "linear", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  349
## 
##  ( 176 173 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_linear_unprocessed_bmi_imputed <- predict(stroke_svm_linear_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed %>% subset(select = -stroke), type = "class")
confusionMatrix(stroke_svm_pred_linear_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 938  16
##          1 277  46
##                                           
##                Accuracy : 0.7706          
##                  95% CI : (0.7465, 0.7934)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1715          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.74194         
##             Specificity : 0.77202         
##          Pos Pred Value : 0.14241         
##          Neg Pred Value : 0.98323         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03602         
##    Detection Prevalence : 0.25294         
##       Balanced Accuracy : 0.75698         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_linear_unprocessed_bmi_imputed <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_linear_unprocessed_bmi_imputed),
  labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_linear_unprocessed_bmi_imputed <- performance(roc_pred_linear_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_linear_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 12: ROC curve for the linear kernel SVM model fit to the unprocessed data using all of the available features.

auc_perf_linear_unprocessed_bmi_imputed <- performance(roc_pred_linear_unprocessed_bmi_imputed, measure = "auc")
stroke_svm_auc_linear_unprocessed_bmi_imputed <- unlist(slot(auc_perf_linear_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_svm_auc_linear_unprocessed_bmi_imputed)
## [1] "Calculated AUC:  0.756975972388159"

Linear Kernel SVM (Preprocessed Data)

stroke_svm_linear <- svm(
  stroke ~ .,
  kernel = "linear",
  type = "C-classification",
  data = stroke_data_train
)

summary(stroke_svm_linear)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "linear", 
##     type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  369
## 
##  ( 181 188 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_linear <- predict(stroke_svm_linear, stroke_data_test %>% subset(select = -stroke), type = "class")
confusionMatrix(stroke_svm_pred_linear, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 906  14
##          1 309  48
##                                           
##                Accuracy : 0.7471          
##                  95% CI : (0.7223, 0.7707)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1596          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.77419         
##             Specificity : 0.74568         
##          Pos Pred Value : 0.13445         
##          Neg Pred Value : 0.98478         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03759         
##    Detection Prevalence : 0.27956         
##       Balanced Accuracy : 0.75994         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_linear <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_linear),
  labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_linear <- performance(roc_pred_linear, measure = "tpr", x.measure = "fpr")
plot(roc_perf_linear, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 13: ROC curve for the linear kernel SVM model fit to the preprocessed data using all of the available features.

auc_perf_linear <- performance(roc_pred_linear, measure = "auc")
stroke_svm_auc_linear <- unlist(slot(auc_perf_linear,"y.values"))
paste("Calculated AUC: ", stroke_svm_auc_linear)
## [1] "Calculated AUC:  0.759936280366388"

Polynomial Kernel SVM (Unprocessed Data)

stroke_svm_polynomial_unprocessed_bmi_imputed <- svm(
  stroke ~ .,
  kernel = "polynomial",
  type = "C-classification",
  data = stroke_data_train_unprocessed_bmi_imputed
)

summary(stroke_svm_polynomial_unprocessed_bmi_imputed)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed, 
##     kernel = "polynomial", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  571
## 
##  ( 282 289 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_polynomial_unprocessed_bmi_imputed <- predict(stroke_svm_polynomial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed, type = "class")
confusionMatrix(stroke_svm_pred_polynomial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 789   8
##          1 426  54
##                                           
##                Accuracy : 0.6601          
##                  95% CI : (0.6334, 0.6861)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1239          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.87097         
##             Specificity : 0.64938         
##          Pos Pred Value : 0.11250         
##          Neg Pred Value : 0.98996         
##              Prevalence : 0.04855         
##          Detection Rate : 0.04229         
##    Detection Prevalence : 0.37588         
##       Balanced Accuracy : 0.76018         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_polynomial_unprocessed_bmi_imputed <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_polynomial_unprocessed_bmi_imputed),
  labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_polynomial_unprocessed_bmi_imputed <- performance(roc_pred_polynomial_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_polynomial_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 14: ROC curve for the polynomial kernel SVM model fit to the unprocessed data using all of the available features.

auc_perf_polynomial_unprocessed_bmi_imputed <- performance(roc_pred_polynomial_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_polynomial_unprocessed_bmi_imputed <- unlist(slot(auc_perf_polynomial_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_polynomial_unprocessed_bmi_imputed)
## [1] "Calculated AUC:  0.760175228992433"

Polynomial Kernel SVM (Preprocessed Data)

stroke_svm_polynomial <- svm(
  stroke ~ .,
  kernel = "polynomial",
  type = "C-classification",
  data = stroke_data_train
)

summary(stroke_svm_polynomial)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "polynomial", 
##     type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  551
## 
##  ( 274 277 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_polynomial <- predict(stroke_svm_polynomial, stroke_data_test, type = "class")
confusionMatrix(stroke_svm_pred_polynomial, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 739   8
##          1 476  54
##                                           
##                Accuracy : 0.621           
##                  95% CI : (0.5937, 0.6477)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1046          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.87097         
##             Specificity : 0.60823         
##          Pos Pred Value : 0.10189         
##          Neg Pred Value : 0.98929         
##              Prevalence : 0.04855         
##          Detection Rate : 0.04229         
##    Detection Prevalence : 0.41504         
##       Balanced Accuracy : 0.73960         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_polynomial <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_polynomial),
  labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_polynomial <- performance(roc_pred_polynomial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_polynomial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 15: ROC curve for the polynomial kernel SVM model fit to the preprocessed data using all of the available features.

auc_perf_polynomial <- performance(roc_pred_polynomial, measure = "auc")
stroke_auc_polynomial <- unlist(slot(auc_perf_polynomial,"y.values"))
paste("Calculated AUC: ", stroke_auc_polynomial)
## [1] "Calculated AUC:  0.739599097305191"

Radial Basis Kernel SVM (Unprocessed Data)

stroke_svm_radial_unprocessed_bmi_imputed <- svm(
  stroke ~ .,
  kernel = "radial",
  type = "C-classification",
  data = stroke_data_train_unprocessed_bmi_imputed
)

summary(stroke_svm_radial_unprocessed_bmi_imputed)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed, 
##     kernel = "radial", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  380
## 
##  ( 191 189 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_radial_unprocessed_bmi_imputed <- predict(stroke_svm_radial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed, type = "class")
confusionMatrix(stroke_svm_pred_radial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 905  12
##          1 310  50
##                                           
##                Accuracy : 0.7478          
##                  95% CI : (0.7231, 0.7715)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1681          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.80645         
##             Specificity : 0.74486         
##          Pos Pred Value : 0.13889         
##          Neg Pred Value : 0.98691         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03915         
##    Detection Prevalence : 0.28191         
##       Balanced Accuracy : 0.77565         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_radial_unprocessed_bmi_imputed <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_radial_unprocessed_bmi_imputed),
  labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_radial_unprocessed_bmi_imputed <- performance(roc_pred_radial_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_radial_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 16: ROC curve for the radial kernel SVM model fit to the unprocessed data using all of the available features.

auc_perf_radial_unprocessed_bmi_imputed <- performance(roc_pred_radial_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_radial_unprocessed_bmi_imputed <- unlist(slot(auc_perf_radial_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_radial_unprocessed_bmi_imputed)
## [1] "Calculated AUC:  0.775653789990708"

Radial Basis Kernel (Preprocessed Data)

stroke_svm_radial <- svm(
  stroke ~ .,
  kernel = "radial",
  type = "C-classification",
  data = stroke_data_train
)

summary(stroke_svm_radial)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "radial", 
##     type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  387
## 
##  ( 189 198 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_radial <- predict(stroke_svm_radial, stroke_data_test, type = "class")
confusionMatrix(stroke_svm_pred_radial, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 888  14
##          1 327  48
##                                           
##                Accuracy : 0.733           
##                  95% CI : (0.7078, 0.7571)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1487          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.77419         
##             Specificity : 0.73086         
##          Pos Pred Value : 0.12800         
##          Neg Pred Value : 0.98448         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03759         
##    Detection Prevalence : 0.29366         
##       Balanced Accuracy : 0.75253         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_radial <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_radial),
  labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_radial <- performance(roc_pred_radial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_radial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 17: ROC curve for the radial kernel SVM model fit to the preprocessed data using all of the available features.

auc_perf_radial <- performance(roc_pred_radial, measure = "auc")
stroke_auc_radial <- unlist(slot(auc_perf_radial,"y.values"))
paste("Calculated AUC: ", stroke_auc_radial)
## [1] "Calculated AUC:  0.752528872958981"

Sigmoid Kernel SVM (Unprocessed Data)

stroke_svm_sigmoid_unprocessed_bmi_imputed <- svm(
  stroke ~ .,
  kernel = "sigmoid",
  type = "C-classification",
  data = stroke_data_train_unprocessed_bmi_imputed
)

summary(stroke_svm_sigmoid_unprocessed_bmi_imputed)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed, 
##     kernel = "sigmoid", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  382
## 
##  ( 192 190 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_sigmoid_unprocessed_bmi_imputed <- predict(stroke_svm_sigmoid_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed, type = "class")
confusionMatrix(stroke_svm_pred_sigmoid_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 912  13
##          1 303  49
##                                          
##                Accuracy : 0.7525         
##                  95% CI : (0.7279, 0.776)
##     No Information Rate : 0.9514         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.168          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.79032        
##             Specificity : 0.75062        
##          Pos Pred Value : 0.13920        
##          Neg Pred Value : 0.98595        
##              Prevalence : 0.04855        
##          Detection Rate : 0.03837        
##    Detection Prevalence : 0.27565        
##       Balanced Accuracy : 0.77047        
##                                          
##        'Positive' Class : 1              
## 
roc_pred_sigmoid_unprocessed_bmi_imputed <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_sigmoid_unprocessed_bmi_imputed),
  labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_sigmoid_unprocessed_bmi_imputed <- performance(roc_pred_sigmoid_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_sigmoid_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 18: ROC curve for the sigmoid kernel SVM model fit to the unprocessed data using all of the available features.

auc_perf_sigmoid_unprocessed_bmi_imputed <- performance(roc_pred_sigmoid_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_sigmoid_unprocessed_bmi_imputed <- unlist(slot(auc_perf_sigmoid_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_sigmoid_unprocessed_bmi_imputed)
## [1] "Calculated AUC:  0.770469932297889"

Sigmoid Kernel SVM (Preprocessed Data)

stroke_svm_sigmoid <- svm(
  stroke ~ .,
  kernel = "sigmoid",
  type = "C-classification",
  data = stroke_data_train
)

summary(stroke_svm_sigmoid)
## 
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "sigmoid", 
##     type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  397
## 
##  ( 198 199 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
stroke_svm_pred_sigmoid <- predict(stroke_svm_sigmoid, stroke_data_test, type = "class")
confusionMatrix(stroke_svm_pred_sigmoid, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 887  10
##          1 328  52
##                                           
##                Accuracy : 0.7353          
##                  95% CI : (0.7102, 0.7593)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1656          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.83871         
##             Specificity : 0.73004         
##          Pos Pred Value : 0.13684         
##          Neg Pred Value : 0.98885         
##              Prevalence : 0.04855         
##          Detection Rate : 0.04072         
##    Detection Prevalence : 0.29757         
##       Balanced Accuracy : 0.78438         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_sigmoid <- ROCR::prediction(
  predictions = as.numeric(stroke_svm_pred_sigmoid),
  labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "tpr", x.measure = "fpr")
plot(roc_perf_sigmoid, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 19: ROC curve for the sigmoid kernel SVM model fit to the processed data using all of the available features.

auc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "auc")
stroke_auc_sigmoid <- unlist(slot(auc_perf_sigmoid,"y.values"))
paste("Calculated AUC: ", stroke_auc_sigmoid)
## [1] "Calculated AUC:  0.784375414841365"

Training the Random Forest Model (Unprocessed Data)

The randomForest package will be used to generate a random forest model. This model requires the user to input a value for mtry, which is the number of randomly selected features.

Practical Machine Learning in R explains the following:

“Based on the documentation provided by the randomForest package, the default value for mtry is the square root of the number of features in the dataset when working on a classification problem.”

Therefore, mtry will be set to 3 for the stroke dataset. Note that the rf method in the train function from the caret package requires that datasets do not have any missing data. Therefore, for this random forest model, the bmi variable was imputed using the MICE algorithm.

rf_mod_unprocessed_bmi_imputed <- train(
  stroke ~ .,
  data = stroke_data_train_unprocessed_bmi_imputed,
  metric = "Accuracy",
  method = "rf",
  trControl = trainControl(method = "none"),
  tuneGrid = expand.grid(.mtry = 3)
  )
plot(varImp(rf_mod_unprocessed_bmi_imputed), top = 10)

Figure 20: Variable importance plot for the random forest model fit to the unprocessed data generated using all of the available features.

rf_pred_unprocessed_bmi_imputed <- predict(rf_mod_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed)
confusionMatrix(rf_pred_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 929  17
##          1 286  45
##                                           
##                Accuracy : 0.7627          
##                  95% CI : (0.7384, 0.7858)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1603          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.72581         
##             Specificity : 0.76461         
##          Pos Pred Value : 0.13595         
##          Neg Pred Value : 0.98203         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03524         
##    Detection Prevalence : 0.25920         
##       Balanced Accuracy : 0.74521         
##                                           
##        'Positive' Class : 1               
## 
rf_pred_unprocessed_bmi_imputed <- ROCR::prediction(
  predictions = as.numeric(rf_pred_unprocessed_bmi_imputed),
  labels = stroke_data_test_unprocessed_bmi_imputed$stroke
)
rf_perf_unprocessed_bmi_imputed <- performance(rf_pred_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(rf_perf_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 21: ROC curve for the random forest model fit to the unprocessed data.

auc_perf_rf_unprocessed_bmi_imputed <- performance(rf_pred_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_rf_unprocessed_bmi_imputed <- unlist(slot(auc_perf_rf_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_rf_unprocessed_bmi_imputed)
## [1] "Calculated AUC:  0.745207752555423"

Training the Random Forest Model (Preprocessed Data)

rf_mod <- train(
  stroke ~ .,
  data = stroke_data_train,
  metric = "Accuracy",
  method = "rf",
  trControl = trainControl(method = "none"),
  tuneGrid = expand.grid(.mtry = 3)
  )
plot(varImp(rf_mod), top = 10)

Figure 22: Variable importance plot for the random forest model fit to the preprocessed data generated using all of the available features.

rf_pred <- predict(rf_mod, stroke_data_test)
confusionMatrix(rf_pred, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 906  17
##          1 309  45
##                                           
##                Accuracy : 0.7447          
##                  95% CI : (0.7199, 0.7684)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1458          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.72581         
##             Specificity : 0.74568         
##          Pos Pred Value : 0.12712         
##          Neg Pred Value : 0.98158         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03524         
##    Detection Prevalence : 0.27721         
##       Balanced Accuracy : 0.73574         
##                                           
##        'Positive' Class : 1               
## 
rf_pred <- ROCR::prediction(
  predictions = as.numeric(rf_pred),
  labels = stroke_data_test$stroke
)
rf_perf <- performance(rf_pred, measure = "tpr", x.measure = "fpr")
plot(rf_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 23: ROC curve for the random forest model fit to the preprocessed data.

auc_perf_rf <- performance(rf_pred, measure = "auc")
stroke_auc_rf <- unlist(slot(auc_perf_rf,"y.values"))
paste("Calculated AUC: ", stroke_auc_rf)
## [1] "Calculated AUC:  0.735742731979291"

Fitting the Neural Network Model (Unprocessed Data)

Here, the neural network model was fit to the unprocessed dataset with the bmi variable imputed. The caret package in R allows one to fit multiple neural networks then aggregate them together, which is what was done in this report for both the unprocessed and preprocessed data. This code sets up a grid of tuning parameters for a neural network model, then trains the model using the specified parameters, dataset, and preprocessing steps. The resulting model (nnet_unprocessed_bmi_imputed) can be used for predicting stroke risk based on the input data.

set.seed(123)
nnetGrid_unprocessed_bmi_imputed <- expand.grid(.decay = c(0, 0.01, .1),
                                                .size = c(1:10),
                                                .bag = FALSE)

ctrl <- trainControl(method = "cv", number = 5)

nnet_unprocessed_bmi_imputed <- train(stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed,
                                      method = "avNNet",
                                      tuneGrid = nnetGrid_unprocessed_bmi_imputed,
                                      trControl = ctrl,
                                      preProc = c("YeoJohnson", "center", "scale"),
                                      trace = FALSE,
                                      linout = TRUE)
nnet_unprocessed_bmi_imputed_pred <- predict(nnet_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed)
confusionMatrix(nnet_unprocessed_bmi_imputed_pred, as.factor(stroke_data_test_unprocessed_bmi_imputed$stroke), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 933  15
##          1 282  47
##                                           
##                Accuracy : 0.7674          
##                  95% CI : (0.7433, 0.7903)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1728          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.75806         
##             Specificity : 0.76790         
##          Pos Pred Value : 0.14286         
##          Neg Pred Value : 0.98418         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03681         
##    Detection Prevalence : 0.25764         
##       Balanced Accuracy : 0.76298         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_nnet_unprocessed_bmi_imputed <- ROCR::prediction(
  predictions = as.numeric(nnet_unprocessed_bmi_imputed_pred),
  labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_nnet_unprocessed_bmi_imputed <- performance(roc_pred_nnet_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_nnet_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 24: ROC curve for the neural network model fit to the unprocessed data using all of the available features

auc_perf_nnet_unprocessed_bmi_imputed <- performance(roc_pred_nnet_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_nnet_unprocessed_bmi_imputed <- unlist(slot(auc_perf_nnet_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_nnet_unprocessed_bmi_imputed)
## [1] "Calculated AUC:  0.762982875348467"

Fitting the Neural Network Model (Preprocessed Data)

Here, the neural network model was fit to the preprocessed data. Here, this code sets up a grid of tuning parameters for a neural network model, then trains the model using the specified parameters, dataset, and preprocessing steps. The resulting model (nnet) can be used for predicting stroke risk based on the input data.

set.seed(123)
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        .bag = FALSE)

nnet <- train(stroke ~ ., data = stroke_data_train,
              method = "avNNet",
              tuneGrid = nnetGrid,
              trControl = ctrl,
              preProc = c("YeoJohnson", "center", "scale"),
              trace = FALSE,
              linout = TRUE)
nnet_pred <- predict(nnet, stroke_data_test)
confusionMatrix(nnet_pred, as.factor(stroke_data_test$stroke), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 918  17
##          1 297  45
##                                           
##                Accuracy : 0.7541          
##                  95% CI : (0.7295, 0.7775)
##     No Information Rate : 0.9514          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1532          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.72581         
##             Specificity : 0.75556         
##          Pos Pred Value : 0.13158         
##          Neg Pred Value : 0.98182         
##              Prevalence : 0.04855         
##          Detection Rate : 0.03524         
##    Detection Prevalence : 0.26782         
##       Balanced Accuracy : 0.74068         
##                                           
##        'Positive' Class : 1               
## 
roc_pred_nnet <- ROCR::prediction(
  predictions = as.numeric(nnet_pred),
  labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_nnet <- performance(roc_pred_nnet, measure = "tpr", x.measure = "fpr")
plot(roc_perf_nnet, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 25: ROC curve for the neural network model fit to the preprocessed data using all of the available features

auc_perf_nnet <- performance(roc_pred_nnet, measure = "auc")
stroke_auc_nnet <- unlist(slot(auc_perf_nnet,"y.values"))
paste("Calculated AUC: ", stroke_auc_nnet)
## [1] "Calculated AUC:  0.740681003584229"

Table of Metrics

data1 <- tribble(
  ~"",~"Accuracy",~"Kappa",~"Sensitivity",~"Specificity",~"AUC",
  "Decision Tree (Unprocessed)", "0.7886", "0.1976", "0.7742", "0.7893", "0.7817",
  "Decision Tree (Preprocessed)", "0.7384","0.1681","0.8387","0.7333","0.7812",
  "Linear Kernel SVM (Unprocessed)","0.7706","0.1715","0.7419","0.7729","0.7570",
  "Linear Kernel SVM (Preprocessed)","0.7471","0.1596","0.7742","0.7457","0.7600",
  "Polynomial Kernel SVM (Unprocessed)", "0.6601","0.1239","0.8710","0.6494","0.7602",
  "Polynomial Kernel SVM (Preprocessed)", "0.6210","0.1046","0.8710","0.6028","0.7396",
  "Radial Kernel SVM (Unprocessed)", "0.7478","0.1681","0.8064","0.7449","0.7756",
  "Radial Kernel SVM (Preprocessed)", "0.7330","0.1487","0.7742","0.7309","0.7525",
  "Sigmoid Kernel SVM (Unprocessed)", "0.7525","0.1680","0.7903","0.7506","0.7705",
  "Sigmoid Kernel SVM (Preprocessed)", "0.7353","0.1656","0.8387","0.7300","0.7844",
  "Random Forest (Unprocessed)", "0.7627","0.1603","0.7258","0.7646","0.7357",
  "Random Forest (Preprocessed)", "0.7447","0.1458","0.7258","0.7459","0.7375",
  "Neural Network (Unprocessed)", "0.7674","0.1728","0.7581","0.7679","0.7630",
  "Neural Network (Preprocessed)", "0.7541","0.1532","0.7258","0.7556","0.7406"
)
knitr::kable((data1), booktabs = TRUE)
Accuracy Kappa Sensitivity Specificity AUC
Decision Tree (Unprocessed) 0.7886 0.1976 0.7742 0.7893 0.7817
Decision Tree (Preprocessed) 0.7384 0.1681 0.8387 0.7333 0.7812
Linear Kernel SVM (Unprocessed) 0.7706 0.1715 0.7419 0.7729 0.7570
Linear Kernel SVM (Preprocessed) 0.7471 0.1596 0.7742 0.7457 0.7600
Polynomial Kernel SVM (Unprocessed) 0.6601 0.1239 0.8710 0.6494 0.7602
Polynomial Kernel SVM (Preprocessed) 0.6210 0.1046 0.8710 0.6028 0.7396
Radial Kernel SVM (Unprocessed) 0.7478 0.1681 0.8064 0.7449 0.7756
Radial Kernel SVM (Preprocessed) 0.7330 0.1487 0.7742 0.7309 0.7525
Sigmoid Kernel SVM (Unprocessed) 0.7525 0.1680 0.7903 0.7506 0.7705
Sigmoid Kernel SVM (Preprocessed) 0.7353 0.1656 0.8387 0.7300 0.7844
Random Forest (Unprocessed) 0.7627 0.1603 0.7258 0.7646 0.7357
Random Forest (Preprocessed) 0.7447 0.1458 0.7258 0.7459 0.7375
Neural Network (Unprocessed) 0.7674 0.1728 0.7581 0.7679 0.7630
Neural Network (Preprocessed) 0.7541 0.1532 0.7258 0.7556 0.7406

Table 1: Metrics for different model types

Comparison Between Different Model Types and the Effects of Preprocessing

Based on Table 1, which contains performance metrics for various machine learning models applied to the stroke prediction dataset, several conclusions regarding the impact of preprocessing and the relative performance of different models can be drawn. The effects of each model and the impact of preprocessing are discussed in detail in this section.

For the decision tree model, preprocessing decreased accuracy and specificity but increased sensitivity, indicating a trade-off where the model became better at identifying true positives (stroke risks) at the expense of more false positives. Kappa, which measures agreement between predicted and actual classes, slightly decreased with preprocessing, suggesting a small reduction in overall predictive power. AUC remained almost the same, indicating that the model’s ability to discriminate between classes did not change significantly. For both decision tree models, age was the most critical predictor, indicating that it has the highest impact on predicting stroke risk. Being the main splitting variable signifies that age has the strongest association with stroke risk, and differentiating based on age results in the largest reduction in uncertainty about stroke risk. While average glucose level significantly contributes to prediction, it is less dominant compared to age. This highlights the role of glucose levels in stroke risk, potentially also linked to diabetes and metabolic health. Age and average glucose level are consistently the top two most important variables in both unprocessed and preprocessed models, suggesting that these variables are crucial predictors of stroke risk regardless of preprocessing. The dominance of age suggests that older individuals are at a significantly higher risk of stroke, which aligns with medical knowledge that age is a major risk factor due to the cumulative effects of other risk factors over time.

For the linear kernel SVM model, preprocessing slightly decreased accuracy and specificity but increased sensitivity. Kappa decreased, indicating a slight decline in predictive performance. AUC increased marginally, suggesting a slight improvement in the model’s ability to identify high-risk stroke patients. For the polynomial kernel, preprocessing resulted in a decrease in accuracy and specificity, while sensitivity remained constant. Kappa decreased, showing a reduction in agreement between predictions and actual outcomes. AUC also decreased, indicating a decline in overall model performance. For the radial kernel SVM and neural network models, the overall performance declined with preprocessing. For the sigmoid kernel SVM and random forest models, preprocessing decreased accuracy, kappa, and specificity, but sensitivity increased. However, AUC increased slightly, indicating a small improvement in overall performance.

Both age and average glucose level consistently emerged as the most important predictors in the decision tree and random forest models (both preprocessed and unprocessed), indicating that these two variables are crucial determinants in predicting stroke risk regardless of preprocessing. The analysis reveals that age and average glucose level are the primary predictors of stroke risk across different models.

Generally, preprocessing led to a decrease in accuracy and specificity across most models, with mixed effects on sensitivity. Kappa values tended to decrease with preprocessing, indicating a small reduction in overall agreement between predictions and actual outcomes. The AUC metric, which reflects the overall ability of the model to distinguish between positive and negative classes, showed mixed results, with some models experiencing slight improvements and others a decline. Preprocessing had mixed effects on model performance. The impact of preprocessing needs to be evaluated case-by-case, considering the specific context and requirements of the model’s application. For tasks prioritizing sensitivity (identifying true stroke risks), the polynomial kernel SVM might be a good choice despite its lower overall accuracy and specificity. For tasks requiring a balance between accuracy and specificity, the unprocessed decision tree might be more suitable.

Final Model Selection

In a healthcare setting where accuracy is most important, the model with the highest accuracy metric should be preferred. Based on Table 1, the decision tree model with unprocessed data has the highest accuracy (0.7886) among all the models tested. High accuracy ensures that a majority of predictions (both positive and negative) are correct, which is critical for reliable decision-making. This model also demonstrates a good balance between sensitivity (0.7742) and specificity (0.7893). While sensitivity is important to identify patients at high risk of stroke (true positives), specificity is equally important to avoid false alarms (false positives). The selected model maintains a strong balance, meaning it can reliably detect at-risk patients without overburdening the healthcare system with false positives. The AUC of 0.7817 indicates a strong ability to distinguish between patients at high risk of stroke and those not at risk. Although not the highest AUC in the table, it is competitive and, when combined with high accuracy, suggests robust overall performance. Decision trees are inherently simple and interpretable models. In a healthcare setting, where decisions must be transparent and justifiable, the interpretability of a decision tree can be advantageous. Clinicians can understand and explain the decision-making process, which builds trust in the model’s predictions.

Summary/Conclusions

Accurate prediction of stroke risk is crucial in healthcare. High accuracy ensures that patients at genuine risk are correctly identified and can receive timely interventions, potentially saving lives and improving patient outcomes. High accuracy minimizes false positives, thereby reducing unnecessary stress and medical interventions for patients misclassified as high-risk. It also ensures that healthcare resources are efficiently used, focusing on those who truly need them. The selected decision tree model’s interpretability allows healthcare providers to understand and trust the model’s decision-making process. This transparency is vital for clinical adoption and patient acceptance. By accurately identifying individuals at high risk of stroke, preventive measures can be taken, such as lifestyle modifications, medications, and regular monitoring. This can significantly reduce the incidence of strokes, which are often costly to treat. The average cost of treating a stroke can range from 30,000 to 120,000 USD, considering acute treatment, rehabilitation, and long-term care. Preventing strokes can therefore result in substantial cost savings. Implementing preventive strategies is generally far less expensive than treating a stroke. For example, the cost of managing risk factors (e.g., controlling hypertension, managing diabetes) is significantly lower than the costs associated with stroke treatment and rehabilitation. If a healthcare system can reduce the incidence of strokes by 10% through early identification and intervention, the potential savings could be enormous. For instance, if the average cost per stroke is 50,000 USD and the system prevents 1,000 strokes annually, this translates to $50 million in savings per year. Using the decision tree model in hospitals can make things easier for doctors by providing helpful information right when they need it, enabling faster decision-making and reducing the burden of risk assessment.

In summary, the decision tree model has a significant impact on healthcare businesses. For stakeholders, the financial implications are substantial, as early identification and intervention can lead to significant cost savings by preventing expensive stroke treatments. Implementing such a model enhances patient care and supports the sustainability and efficiency of healthcare systems.