Stroke, according to the World Health Organization (WHO), stands as the second leading cause of mortality worldwide, accounting for 11% (6 million people) of total deaths. This alarming statistic underscores the critical need for effective predictive models to assess the likelihood of stroke occurrence in patients. The American Stroke Association indicates that most strokes can be stopped before they happen if we learn more about them and make some changes in our daily lives. This means doing things like being more active, eating healthy foods, keeping our blood pressure in check, getting enough sleep, and saying no to smoking and vaping. Mastrigt and Heugten point out that according to predictions from The American Heart Association, by 2030, nearly 4% of adults in the United States will have experienced a stroke. Additionally, they calculated that the total healthcare expenses for strokes amounted to USD 30.8 billion each year during 2016 and 2017. In response to all of these facts, a dataset has been curated, offering a robust foundation for predictive analytics in stroke risk assessment. This dataset consists of patient data for 5110 respondents.
This dataset, compiled and made available for research purposes, is comprised of a diverse array of parameters with varying levels of importance in predicting the onset of stroke in individuals. The features for the dataset are listed below:
id: A unique identifiergender: Separated into 3 different categories
MaleFemaleOtherage: The age of the patienthypertension: 0 if the patient does not have
hypertension, 1 if the patient has hypertensionheart_disease: 0 if the patient doesn’t have any heart
diseases, 1 if the patient has a heart diseaseever_married: No or Yeswork_type: Separated into 5 different categories
childrenGovt_jobNever_workedPrivateSelf-employedResidence_type: Separated into 2 different categories
RuralUrbanavg_glucose_level: The average glucose level in blood
in mmol/Lbmi: The body mass indexsmoking_status: Separated into 3 different categories
formerly smokednever smokedsmokesUnknownstroke: Takes on 2 different values. 1 if the patient
had a stroke or 0 if not. This is the response variable.Ultimately, the overarching goal of using this dataset was to leverage the power of machine learning and predictive analytics to identify individuals at heightened risk of stroke. This collection of information about people’s age, health, and habits is akin to a puzzle, and by putting together all these different pieces, one can understand better who might be at risk of having a stroke. The unprocessed and preprocessed data was fit to a decision tree model, 4 different support vector machine models, a random forest model, and a neural network model. Between the 4 different types of models, the one that was most accurate in determining if a person was at high risk of having a stroke was selected. Given the importance of early stroke detection and prevention in healthcare, the objective was to evaluate and compare different models to determine which one offers the highest accuracy and reliability. High accuracy is crucial in applications where the cost of misclassification (false positives or false negatives) is high, such as in medical diagnosis. The analysis that was conducted was important for enhancing patient outcomes through early intervention and for optimizing the use of healthcare resources. By identifying the best predictive model, healthcare providers can better allocate their efforts toward individuals at greatest risk of stroke, ultimately improving clinical decision-making and patient care.
stroke_data <- read.csv("healthcare-dataset-stroke-data.csv", header = TRUE) %>% subset(select = -id)
A summary of the stroke dataset is provided below:
summary(stroke_data)
## gender age hypertension heart_disease
## Length:5110 Min. : 0.08 Min. :0.00000 Min. :0.00000
## Class :character 1st Qu.:25.00 1st Qu.:0.00000 1st Qu.:0.00000
## Mode :character Median :45.00 Median :0.00000 Median :0.00000
## Mean :43.23 Mean :0.09746 Mean :0.05401
## 3rd Qu.:61.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :82.00 Max. :1.00000 Max. :1.00000
## ever_married work_type Residence_type avg_glucose_level
## Length:5110 Length:5110 Length:5110 Min. : 55.12
## Class :character Class :character Class :character 1st Qu.: 77.25
## Mode :character Mode :character Mode :character Median : 91.89
## Mean :106.15
## 3rd Qu.:114.09
## Max. :271.74
## bmi smoking_status stroke
## Length:5110 Length:5110 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.04873
## 3rd Qu.:0.00000
## Max. :1.00000
The id variable was omitted from the analysis as this
variable just offers a unique identifier for each observation. The
factors above have been recoded for readability. After recoding, the
summary below revealed that for the bmi variable, there
were 201 missing values.
stroke_data <- stroke_data %>%
mutate(
gender = as.factor(gender),
hypertension = as.factor(hypertension),
heart_disease = as.factor(heart_disease),
ever_married = as.factor(ever_married),
work_type = as.factor(work_type),
Residence_type = as.factor(Residence_type),
smoking_status = as.factor(smoking_status),
stroke = as.factor(stroke),
bmi = as.numeric(bmi)
)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `bmi = as.numeric(bmi)`.
## Caused by warning:
## ! NAs introduced by coercion
stroke_data_unprocessed <- stroke_data
summary(stroke_data)
## gender age hypertension heart_disease ever_married
## Female:2994 Min. : 0.08 0:4612 0:4834 No :1757
## Male :2115 1st Qu.:25.00 1: 498 1: 276 Yes:3353
## Other : 1 Median :45.00
## Mean :43.23
## 3rd Qu.:61.00
## Max. :82.00
##
## work_type Residence_type avg_glucose_level bmi
## children : 687 Rural:2514 Min. : 55.12 Min. :10.30
## Govt_job : 657 Urban:2596 1st Qu.: 77.25 1st Qu.:23.50
## Never_worked : 22 Median : 91.89 Median :28.10
## Private :2925 Mean :106.15 Mean :28.89
## Self-employed: 819 3rd Qu.:114.09 3rd Qu.:33.10
## Max. :271.74 Max. :97.60
## NA's :201
## smoking_status stroke
## formerly smoked: 885 0:4861
## never smoked :1892 1: 249
## smokes : 789
## Unknown :1544
##
##
##
Figure 1: Density plots for age,
avg_glucose_level, and bmi.
The avg_glucose_level variable exhibits somewhat of a
normal distribution, however, near the 80 age bracket, there is a spike
in observation count.The avg_glucose_level variable
exhibits bimodality which is also reflected in the summary statistics.
The minimum and mean are 55 and 106, respectively, but the maximum is
271. The bmi variable exhibits right skewness which is also
reflected in the summary statistics. The minimum and mean are 10 and 29,
respectively, while the maximum is 98.
Figure 2: Boxplots for the stroke dataset.
Some findings were discovered that support the theoretical effects
for some of the variables using the boxplots in Figure 2. Based on the
age boxplot, theoretically, the patients that were older
were more likely to have a stroke. Theoretically, on average, patients
in the dataset that had a higher avg_glucose_level were
more likely to have stroke. The boxplot also reveals that the patients
that had higher bmi were more likely tio develop a
stroke.
Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.
corrplot(stroke_correlations$correlations,
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 1,
tl.cex = 1)
Figure 3: Multicollinearity plot for continuous predictor variables.
Calkins
indicates that “…correlation coefficients whose magnitude are between
0.3 and 0.5 indicate variables which have a low correlation”. The
article goes onto explain that correlations between 0.5 and 0.7 indicate
a “moderate” correlation, with anything above a 0.7 indcating a “strong”
correlation. The correlation with the largest magnitude is 0.77, which
is the value generated between the age and
ever_married variables, indicating a strong correlation
between these 2 variables, followed by 0.65 for the age and
heart_disease variables.
prop.table(table(select(stroke_data, stroke)))
## stroke
## 0 1
## 0.95127202 0.04872798
The output above shows that the stroke dataset is imbalance, with 95.12% of the patients who responded to the study not ever having experience a stroke, while 4.87% of the patients who responded to the study did experience a stroke. Class imbalance can affect the performance of predictive models. Imbalanced datasets can lead to biased models that favor the majority class and perform poorly on the minority class, which is why the training dataset was resampled using SMOTE.
Age is a well-known risk factor for stroke, and it’s widely acknowledged in medical literature, These are a few of many web articles indicate this:
Martial status, potentially correlated with age, may not have a
direct causal relationship with stroke risk. Therefore, the
ever_married variable could have taken out of the dataset
in favor of the age variable. In addition, heart disease is
also known to increase the risk of stroke but may not have as direct and
universally acknowledged relationship with stroke as age. Moderate
collinearity suggests that there is some relationship between age and
heart disease, but it may not be so strong that it significantly impacts
the stability or interpretability of the model. With that being said,
decision trees do
not require or assume a specific relationship between the independent
variables, unlike linear regression models. Consequently, decision
trees can produce accurate predictions even when there is a high level
of correlation among some variables. Also, neural
networks generally do not suffer from multicollinearity because they are
often overparameterized. The additional weights learned during
training introduce redundancies, making issues that affect a small
subset of features, such as multicollinearity, less significant.
Therefore, it was decided to retain all of the variables within the
dataset.
In general, imputations by the means/medians is acceptable if the missing values only account for 5% of the sample. Peng et al.(2006) However, should the degree of missing values exceed 20% then using these simple imputation approaches will result in an artificial reduction in variability due to the fact that values are being imputed at the center of the variable’s distribution.
It was decided to employ another technique to handle the missing values: Multiple Regression Imputation using the MICE package.
The MICE package in R implements a methodology where each incomplete variable is imputed by a separate model. Alice points out that plausible values are drawn from a distribution specifically designed for each missing datapoint. Many imputation methods can be used within the package. The one that was selected for the data being analyzed in this report is PMM (Predictive Mean Matching), which is used for quantitative data.
Van Buuren explains that PMM works by selecting values from the observed/already existing data that would most likely belong to the variable in the observation with the missing value. The advantage of this is that it selects values that must exist from the observed data, so no negative values will be used to impute missing data. Not only that, it circumvents the shrinking of errors by using multiple regression models. The variability between the different imputed values gives a wider, but more correct standard error. Uncertainty is inherent in imputation which is why having multiple imputed values is important. Not only that. Marshall et al. 2010 points out that:
“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data…’
Note that the neural network model requires that there be no missing
values. Therefore, a new dataset was created consisting of the
unprocessed dataset with the bmi variable imputed.
Figure 4: Density plots for the bmi variable The
number of multiple imputations was set to 4. Each of the red lines
represents the distribution for each imputation.
The blue lines for each of the graphs in Figure 4 represent the distributions the non-missing data for each of the variables while the red lines represent the distributions for the imputed data. Note that the distributions for the imputed data for each of the iterations closely matches the distributions for the non-missing data, which is ideal. If the distributions did not match so well, than another imputing method would have had to have been used.
A Modern Approach to Regression with R explains the following:
“When conducting a binary regression with a skewed predictor, it is often easiest to assess the need for x and log(x) by including them both in the model so that their relative contributions can be assessed directly.”
The variable bmi exhibits skewness. Therefore, the log
of this variable was added into the dataset.
target_variables <- c("bmi")
for (target_var in target_variables){
stroke_data[,paste(target_var, "log", sep = "_")] <- log(stroke_data[target_var])
}
Figure 5: bmi after the log transformation. Scaled
variable is stored in the dataset as bmi_log.
Neural networks usually learn by adjusting weights to reduce errors using methods like gradient descent. If the input features have different scales, these adjustments can be uneven and slow down the learning process. Normalization makes the feature scales similar, which helps the learning process to be smoother and faster. This is why normalization was employed to the dataset. Z-score normalization will ensure that the continuous features have a mean of zero and a standard deviation of 1, which should help the neural network learn for effectively.
scaled_numeric_stroke_data <- stroke_data %>%
select_if(is.numeric) %>%
scale()
colnames(scaled_numeric_stroke_data) <- paste0(colnames(scaled_numeric_stroke_data), "_scaled")
stroke_data <- cbind(stroke_data %>% select_if(is.factor), scaled_numeric_stroke_data)
Neural networks operate on numerical data. Categorical variables, which represent categories or labels, need to be converted into numerical format for the model to process them effectively. Encoding converts categorical variables into a numerical representation that can be fed into the neural network. Therefore, one hot encoding was applied to the categorical variables in the original dataset. Note that only the features that had more than 2 levels were one hot encoded, while the binary categorical features were not, as doing so would result in redundant variables.
# Identify categorical variables with more than two levels
categorical_variables <- sapply(stroke_data, function(x) is.factor(x) && length(levels(x)) > 2)
# Create a formula for dummyVars to encode only the identified variables
formula <- as.formula(paste("~", paste(names(stroke_data)[categorical_variables], collapse = " + ")))
dummy_object <- dummyVars(formula, data = stroke_data)
encoded_data <- lapply(data.frame(predict(dummy_object, newdata = stroke_data)), as.factor)
stroke_data <- cbind(stroke_data[, !categorical_variables], encoded_data)
summary(stroke_data)
## hypertension heart_disease ever_married Residence_type stroke
## 0:4612 0:4834 No :1757 Rural:2514 0:4861
## 1: 498 1: 276 Yes:3353 Urban:2596 1: 249
##
##
##
##
## age_scaled avg_glucose_level_scaled bmi_scaled
## Min. :-1.90807 Min. :-1.1268 Min. :-2.3732
## 1st Qu.:-0.80604 1st Qu.:-0.6383 1st Qu.:-0.6803
## Median : 0.07842 Median :-0.3150 Median :-0.1076
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.78599 3rd Qu.: 0.1754 3rd Qu.: 0.5289
## Max. : 1.71468 Max. : 3.6568 Max. : 8.7385
## bmi_log_scaled gender.Female gender.Male gender.Other work_type.children
## Min. :-3.76935 0:2116 0:2995 0:5109 0:4423
## 1st Qu.:-0.63835 1:2994 1:2115 1: 1 1: 687
## Median : 0.02071
## Mean : 0.00000
## 3rd Qu.: 0.63914
## Max. : 4.72270
## work_type.Govt_job work_type.Never_worked work_type.Private
## 0:4453 0:5088 0:2185
## 1: 657 1: 22 1:2925
##
##
##
##
## work_type.Self.employed smoking_status.formerly.smoked
## 0:4291 0:4225
## 1: 819 1: 885
##
##
##
##
## smoking_status.never.smoked smoking_status.smokes smoking_status.Unknown
## 0:3218 0:4321 0:3566
## 1:1892 1: 789 1:1544
##
##
##
##
The summary above shows all of the factor and numeric variables, along with the factor variables that had more than 2 levels that were one hot encoded.
To properly test how well the machine learning model worked, the
datasaet was divided into two parts: a training set and a testing set.
The training set was used to teach all of the models, and the testing
set was used to see how well the models performed on new data it hadn’t
seen before. This helped make sure the model worked well on data it had
not been exposed to before. The same splitting methodology was also
applied to the unprocessed dataset and the dataset where the only
preprocessing was the imputing of the bmi variable as
well.
set.seed(1845)
original_split <- caTools::sample.split(stroke_data$stroke, SplitRatio = 0.75)
stroke_data_train <- subset(stroke_data, original_split == TRUE)
stroke_data_test <- subset(stroke_data, original_split == FALSE)
set.seed(1845)
original_split_unprocessed <- caTools::sample.split(stroke_data_unprocessed$stroke, SplitRatio = 0.75)
stroke_data_train_unprocessed <- subset(stroke_data_unprocessed, original_split == TRUE)
stroke_data_test_unprocessed <- subset(stroke_data_unprocessed, original_split == FALSE)
set.seed(1845)
original_split_unprocessed_bmi_imputed <- caTools::sample.split(stroke_data_unprocessed_bmi_imputed$stroke, SplitRatio = 0.75)
stroke_data_train_unprocessed_bmi_imputed <- subset(stroke_data_unprocessed_bmi_imputed, original_split == TRUE)
stroke_data_test_unprocessed_bmi_imputed <- subset(stroke_data_unprocessed_bmi_imputed, original_split == FALSE)
prop.table(table(select(stroke_data, stroke)))
## stroke
## 0 1
## 0.95127202 0.04872798
prop.table(table(select(stroke_data_train, stroke)))
## stroke
## 0 1
## 0.95121315 0.04878685
prop.table(table(select(stroke_data_test, stroke)))
## stroke
## 0 1
## 0.95144871 0.04855129
prop.table(table(select(stroke_data_unprocessed, stroke)))
## stroke
## 0 1
## 0.95127202 0.04872798
prop.table(table(select(stroke_data_train_unprocessed, stroke)))
## stroke
## 0 1
## 0.95121315 0.04878685
prop.table(table(select(stroke_data_test_unprocessed, stroke)))
## stroke
## 0 1
## 0.95144871 0.04855129
prop.table(table(select(stroke_data_unprocessed_bmi_imputed, stroke)))
## stroke
## 0 1
## 0.95127202 0.04872798
prop.table(table(select(stroke_data_train_unprocessed_bmi_imputed, stroke)))
## stroke
## 0 1
## 0.95121315 0.04878685
prop.table(table(select(stroke_data_test_unprocessed_bmi_imputed, stroke)))
## stroke
## 0 1
## 0.95144871 0.04855129
For the output above, stroke_data represents the data
after it has been preprocessed, stroke_data_unprocessed
represents the data with no preprocessing, while
stroke_data_unprocessed_bmi_imputed represents the data
where the only preprocessing that took place was the imputing of the
bmi variable. The proportions of the classes shown from the
output above reveal that there is a significant class imbalance for all
of the different datasets used in this report. SMOTE from
the DMwR package is only applied for the training
dataset.
print("Balanced training data for `stroke_data`")
## [1] "Balanced training data for `stroke_data`"
set.seed(1845)
stroke_data_train <- SMOTE(stroke ~ ., data.frame(stroke_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(stroke_data_train, stroke)))
## stroke
## 0 1
## 0.5 0.5
print("Balanced training data for `stroke_data_unprocessed`")
## [1] "Balanced training data for `stroke_data_unprocessed`"
set.seed(1845)
stroke_data_train_unprocessed <- SMOTE(stroke ~ ., data.frame(stroke_data_train_unprocessed), perc.over = 100, perc.under = 200)
prop.table(table(select(stroke_data_train_unprocessed, stroke)))
## stroke
## 0 1
## 0.5 0.5
print("Balanced training data for `stroke_data_unprocessed_bmi_imputed`")
## [1] "Balanced training data for `stroke_data_unprocessed_bmi_imputed`"
set.seed(1845)
stroke_data_train_unprocessed_bmi_imputed <- SMOTE(stroke ~ ., data.frame(stroke_data_train_unprocessed_bmi_imputed), perc.over = 100, perc.under = 200)
prop.table(table(select(stroke_data_train_unprocessed_bmi_imputed, stroke)))
## stroke
## 0 1
## 0.5 0.5
Practical Machine Learning in R states the following for decision tree models:
“…they are able to robustly handle outliers and noisy data. As you can start to see, decision trees require rather little of us in terms of data preparation.”
Therefore, it was decided to fit the unprocessed data to the decision tree model in order to compare results between preprocessed data and untouched data.
stroke_decision_tree_unprocessed <- rpart(
stroke ~ .,
method = "class",
data = stroke_data_train_unprocessed
)
rpart.plot(stroke_decision_tree_unprocessed)
Figure 6: Decision tree for the unprocessed stroke dataset using all of the available features.
varImp(stroke_decision_tree_unprocessed) %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(Overall) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
filter(Overall > 0) %>%
ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()
Figure 7: Variable importance plot for the decision tree model fit to the unprocessed data which uses all of the available features.
stroke_decision_tree_pred_unprocessed <- predict(stroke_decision_tree_unprocessed, stroke_data_test_unprocessed, type = "class")
confusionMatrix(stroke_decision_tree_pred_unprocessed, stroke_data_test_unprocessed$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 959 14
## 1 256 48
##
## Accuracy : 0.7886
## 95% CI : (0.7651, 0.8107)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1976
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.77419
## Specificity : 0.78930
## Pos Pred Value : 0.15789
## Neg Pred Value : 0.98561
## Prevalence : 0.04855
## Detection Rate : 0.03759
## Detection Prevalence : 0.23806
## Balanced Accuracy : 0.78175
##
## 'Positive' Class : 1
##
roc_decision_tree_unprocessed <- ROCR::prediction(
predictions = as.numeric(stroke_decision_tree_pred_unprocessed),
labels = stroke_data_test_unprocessed$stroke
)
roc_perf_decision_tree_unprocessed <- performance(roc_decision_tree_unprocessed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_decision_tree_unprocessed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 8: ROC curve for the decision tree model fit to the unprocessed data using all of the available features.
auc_decision_tree_unprocessed <- performance(roc_decision_tree_unprocessed, measure = "auc")
stroke_decision_tree_auc_unprocessed <- unlist(slot(auc_decision_tree_unprocessed,"y.values"))
paste("Calculated AUC: ", stroke_decision_tree_auc_unprocessed)
## [1] "Calculated AUC: 0.781746979954865"
Here, the decision tree model was fit to the processed data.
stroke_decision_tree <- rpart(
stroke ~ .,
method = "class",
data = stroke_data_train
)
rpart.plot(stroke_decision_tree)
Figure 9: Decision tree for the preprocessed stroke dataset using all of the available features.
varImp(stroke_decision_tree) %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(Overall) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
filter(Overall > 0) %>%
ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()
Figure 10: Variable importance plot for the decision tree model fit to the preprocessed data which uses all of the available features.
stroke_decision_tree_pred <- predict(stroke_decision_tree, stroke_data_test, type = "class")
confusionMatrix(stroke_decision_tree_pred, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 891 10
## 1 324 52
##
## Accuracy : 0.7384
## 95% CI : (0.7134, 0.7624)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1681
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.83871
## Specificity : 0.73333
## Pos Pred Value : 0.13830
## Neg Pred Value : 0.98890
## Prevalence : 0.04855
## Detection Rate : 0.04072
## Detection Prevalence : 0.29444
## Balanced Accuracy : 0.78602
##
## 'Positive' Class : 1
##
roc_decision_tree <- ROCR::prediction(
predictions = predict(stroke_decision_tree, stroke_data_test, type = "prob")[, "1"],
labels = stroke_data_test$stroke
)
roc_perf_decision_tree <- performance(roc_decision_tree, measure = "tpr", x.measure = "fpr")
plot(roc_perf_decision_tree, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 11: ROC curve for the decision tree model fit to the preprocessed data using all of the available features.
auc_decision_tree <- performance(roc_decision_tree, measure = "auc")
stroke_decision_tree_auc <- unlist(slot(auc_decision_tree,"y.values"))
paste("Calculated AUC: ", stroke_decision_tree_auc)
## [1] "Calculated AUC: 0.781162883313421"
The svm function in R allowed for the generation of a
SVM model that uses all of the features in the training set to building
a model that predicts stroke. Several different kernels
were used in order to compare the performance between each kernel and
the other models in this report. These kernels include:
These are all of the possible kernels that can be used using the
svm function in R. Note that the svm function
requires that datasets do not have any missing data. Therefore, for the
SVM fits labeled “Unprocessed”, the bmi variable was
imputed using the MICE algorithm.
stroke_svm_linear_unprocessed_bmi_imputed <- svm(
stroke ~ .,
kernel = "linear",
type = "C-classification",
data = stroke_data_train_unprocessed_bmi_imputed
)
summary(stroke_svm_linear_unprocessed_bmi_imputed)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed,
## kernel = "linear", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 349
##
## ( 176 173 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_linear_unprocessed_bmi_imputed <- predict(stroke_svm_linear_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed %>% subset(select = -stroke), type = "class")
confusionMatrix(stroke_svm_pred_linear_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 938 16
## 1 277 46
##
## Accuracy : 0.7706
## 95% CI : (0.7465, 0.7934)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1715
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.74194
## Specificity : 0.77202
## Pos Pred Value : 0.14241
## Neg Pred Value : 0.98323
## Prevalence : 0.04855
## Detection Rate : 0.03602
## Detection Prevalence : 0.25294
## Balanced Accuracy : 0.75698
##
## 'Positive' Class : 1
##
roc_pred_linear_unprocessed_bmi_imputed <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_linear_unprocessed_bmi_imputed),
labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_linear_unprocessed_bmi_imputed <- performance(roc_pred_linear_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_linear_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 12: ROC curve for the linear kernel SVM model fit to the unprocessed data using all of the available features.
auc_perf_linear_unprocessed_bmi_imputed <- performance(roc_pred_linear_unprocessed_bmi_imputed, measure = "auc")
stroke_svm_auc_linear_unprocessed_bmi_imputed <- unlist(slot(auc_perf_linear_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_svm_auc_linear_unprocessed_bmi_imputed)
## [1] "Calculated AUC: 0.756975972388159"
stroke_svm_linear <- svm(
stroke ~ .,
kernel = "linear",
type = "C-classification",
data = stroke_data_train
)
summary(stroke_svm_linear)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "linear",
## type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 369
##
## ( 181 188 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_linear <- predict(stroke_svm_linear, stroke_data_test %>% subset(select = -stroke), type = "class")
confusionMatrix(stroke_svm_pred_linear, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 906 14
## 1 309 48
##
## Accuracy : 0.7471
## 95% CI : (0.7223, 0.7707)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1596
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.77419
## Specificity : 0.74568
## Pos Pred Value : 0.13445
## Neg Pred Value : 0.98478
## Prevalence : 0.04855
## Detection Rate : 0.03759
## Detection Prevalence : 0.27956
## Balanced Accuracy : 0.75994
##
## 'Positive' Class : 1
##
roc_pred_linear <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_linear),
labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_linear <- performance(roc_pred_linear, measure = "tpr", x.measure = "fpr")
plot(roc_perf_linear, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 13: ROC curve for the linear kernel SVM model fit to the preprocessed data using all of the available features.
auc_perf_linear <- performance(roc_pred_linear, measure = "auc")
stroke_svm_auc_linear <- unlist(slot(auc_perf_linear,"y.values"))
paste("Calculated AUC: ", stroke_svm_auc_linear)
## [1] "Calculated AUC: 0.759936280366388"
stroke_svm_polynomial_unprocessed_bmi_imputed <- svm(
stroke ~ .,
kernel = "polynomial",
type = "C-classification",
data = stroke_data_train_unprocessed_bmi_imputed
)
summary(stroke_svm_polynomial_unprocessed_bmi_imputed)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed,
## kernel = "polynomial", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 571
##
## ( 282 289 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_polynomial_unprocessed_bmi_imputed <- predict(stroke_svm_polynomial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed, type = "class")
confusionMatrix(stroke_svm_pred_polynomial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 789 8
## 1 426 54
##
## Accuracy : 0.6601
## 95% CI : (0.6334, 0.6861)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1239
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.87097
## Specificity : 0.64938
## Pos Pred Value : 0.11250
## Neg Pred Value : 0.98996
## Prevalence : 0.04855
## Detection Rate : 0.04229
## Detection Prevalence : 0.37588
## Balanced Accuracy : 0.76018
##
## 'Positive' Class : 1
##
roc_pred_polynomial_unprocessed_bmi_imputed <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_polynomial_unprocessed_bmi_imputed),
labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_polynomial_unprocessed_bmi_imputed <- performance(roc_pred_polynomial_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_polynomial_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 14: ROC curve for the polynomial kernel SVM model fit to the unprocessed data using all of the available features.
auc_perf_polynomial_unprocessed_bmi_imputed <- performance(roc_pred_polynomial_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_polynomial_unprocessed_bmi_imputed <- unlist(slot(auc_perf_polynomial_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_polynomial_unprocessed_bmi_imputed)
## [1] "Calculated AUC: 0.760175228992433"
stroke_svm_polynomial <- svm(
stroke ~ .,
kernel = "polynomial",
type = "C-classification",
data = stroke_data_train
)
summary(stroke_svm_polynomial)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "polynomial",
## type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 551
##
## ( 274 277 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_polynomial <- predict(stroke_svm_polynomial, stroke_data_test, type = "class")
confusionMatrix(stroke_svm_pred_polynomial, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 739 8
## 1 476 54
##
## Accuracy : 0.621
## 95% CI : (0.5937, 0.6477)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1046
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.87097
## Specificity : 0.60823
## Pos Pred Value : 0.10189
## Neg Pred Value : 0.98929
## Prevalence : 0.04855
## Detection Rate : 0.04229
## Detection Prevalence : 0.41504
## Balanced Accuracy : 0.73960
##
## 'Positive' Class : 1
##
roc_pred_polynomial <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_polynomial),
labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_polynomial <- performance(roc_pred_polynomial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_polynomial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 15: ROC curve for the polynomial kernel SVM model fit to the preprocessed data using all of the available features.
auc_perf_polynomial <- performance(roc_pred_polynomial, measure = "auc")
stroke_auc_polynomial <- unlist(slot(auc_perf_polynomial,"y.values"))
paste("Calculated AUC: ", stroke_auc_polynomial)
## [1] "Calculated AUC: 0.739599097305191"
stroke_svm_radial_unprocessed_bmi_imputed <- svm(
stroke ~ .,
kernel = "radial",
type = "C-classification",
data = stroke_data_train_unprocessed_bmi_imputed
)
summary(stroke_svm_radial_unprocessed_bmi_imputed)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed,
## kernel = "radial", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 380
##
## ( 191 189 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_radial_unprocessed_bmi_imputed <- predict(stroke_svm_radial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed, type = "class")
confusionMatrix(stroke_svm_pred_radial_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 905 12
## 1 310 50
##
## Accuracy : 0.7478
## 95% CI : (0.7231, 0.7715)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1681
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.80645
## Specificity : 0.74486
## Pos Pred Value : 0.13889
## Neg Pred Value : 0.98691
## Prevalence : 0.04855
## Detection Rate : 0.03915
## Detection Prevalence : 0.28191
## Balanced Accuracy : 0.77565
##
## 'Positive' Class : 1
##
roc_pred_radial_unprocessed_bmi_imputed <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_radial_unprocessed_bmi_imputed),
labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_radial_unprocessed_bmi_imputed <- performance(roc_pred_radial_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_radial_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 16: ROC curve for the radial kernel SVM model fit to the unprocessed data using all of the available features.
auc_perf_radial_unprocessed_bmi_imputed <- performance(roc_pred_radial_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_radial_unprocessed_bmi_imputed <- unlist(slot(auc_perf_radial_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_radial_unprocessed_bmi_imputed)
## [1] "Calculated AUC: 0.775653789990708"
stroke_svm_radial <- svm(
stroke ~ .,
kernel = "radial",
type = "C-classification",
data = stroke_data_train
)
summary(stroke_svm_radial)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "radial",
## type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 387
##
## ( 189 198 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_radial <- predict(stroke_svm_radial, stroke_data_test, type = "class")
confusionMatrix(stroke_svm_pred_radial, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 888 14
## 1 327 48
##
## Accuracy : 0.733
## 95% CI : (0.7078, 0.7571)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1487
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.77419
## Specificity : 0.73086
## Pos Pred Value : 0.12800
## Neg Pred Value : 0.98448
## Prevalence : 0.04855
## Detection Rate : 0.03759
## Detection Prevalence : 0.29366
## Balanced Accuracy : 0.75253
##
## 'Positive' Class : 1
##
roc_pred_radial <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_radial),
labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_radial <- performance(roc_pred_radial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_radial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 17: ROC curve for the radial kernel SVM model fit to the preprocessed data using all of the available features.
auc_perf_radial <- performance(roc_pred_radial, measure = "auc")
stroke_auc_radial <- unlist(slot(auc_perf_radial,"y.values"))
paste("Calculated AUC: ", stroke_auc_radial)
## [1] "Calculated AUC: 0.752528872958981"
stroke_svm_sigmoid_unprocessed_bmi_imputed <- svm(
stroke ~ .,
kernel = "sigmoid",
type = "C-classification",
data = stroke_data_train_unprocessed_bmi_imputed
)
summary(stroke_svm_sigmoid_unprocessed_bmi_imputed)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed,
## kernel = "sigmoid", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## coef.0: 0
##
## Number of Support Vectors: 382
##
## ( 192 190 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_sigmoid_unprocessed_bmi_imputed <- predict(stroke_svm_sigmoid_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed, type = "class")
confusionMatrix(stroke_svm_pred_sigmoid_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 912 13
## 1 303 49
##
## Accuracy : 0.7525
## 95% CI : (0.7279, 0.776)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.168
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79032
## Specificity : 0.75062
## Pos Pred Value : 0.13920
## Neg Pred Value : 0.98595
## Prevalence : 0.04855
## Detection Rate : 0.03837
## Detection Prevalence : 0.27565
## Balanced Accuracy : 0.77047
##
## 'Positive' Class : 1
##
roc_pred_sigmoid_unprocessed_bmi_imputed <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_sigmoid_unprocessed_bmi_imputed),
labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_sigmoid_unprocessed_bmi_imputed <- performance(roc_pred_sigmoid_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_sigmoid_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 18: ROC curve for the sigmoid kernel SVM model fit to the unprocessed data using all of the available features.
auc_perf_sigmoid_unprocessed_bmi_imputed <- performance(roc_pred_sigmoid_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_sigmoid_unprocessed_bmi_imputed <- unlist(slot(auc_perf_sigmoid_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_sigmoid_unprocessed_bmi_imputed)
## [1] "Calculated AUC: 0.770469932297889"
stroke_svm_sigmoid <- svm(
stroke ~ .,
kernel = "sigmoid",
type = "C-classification",
data = stroke_data_train
)
summary(stroke_svm_sigmoid)
##
## Call:
## svm(formula = stroke ~ ., data = stroke_data_train, kernel = "sigmoid",
## type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## coef.0: 0
##
## Number of Support Vectors: 397
##
## ( 198 199 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
stroke_svm_pred_sigmoid <- predict(stroke_svm_sigmoid, stroke_data_test, type = "class")
confusionMatrix(stroke_svm_pred_sigmoid, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 887 10
## 1 328 52
##
## Accuracy : 0.7353
## 95% CI : (0.7102, 0.7593)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1656
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.83871
## Specificity : 0.73004
## Pos Pred Value : 0.13684
## Neg Pred Value : 0.98885
## Prevalence : 0.04855
## Detection Rate : 0.04072
## Detection Prevalence : 0.29757
## Balanced Accuracy : 0.78438
##
## 'Positive' Class : 1
##
roc_pred_sigmoid <- ROCR::prediction(
predictions = as.numeric(stroke_svm_pred_sigmoid),
labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "tpr", x.measure = "fpr")
plot(roc_perf_sigmoid, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 19: ROC curve for the sigmoid kernel SVM model fit to the processed data using all of the available features.
auc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "auc")
stroke_auc_sigmoid <- unlist(slot(auc_perf_sigmoid,"y.values"))
paste("Calculated AUC: ", stroke_auc_sigmoid)
## [1] "Calculated AUC: 0.784375414841365"
The randomForest package will be used to generate a
random forest model. This model requires the user to input a value for
mtry, which is the number of randomly selected
features.
Practical Machine Learning in R explains the following:
“Based on the documentation provided by the randomForest
package, the default value for mtry is the square root of
the number of features in the dataset when working on a classification
problem.”
Therefore, mtry will be set to 3 for the stroke dataset.
Note that the rf method in the train function
from the caret package requires that datasets do not have
any missing data. Therefore, for this random forest model, the
bmi variable was imputed using the MICE algorithm.
rf_mod_unprocessed_bmi_imputed <- train(
stroke ~ .,
data = stroke_data_train_unprocessed_bmi_imputed,
metric = "Accuracy",
method = "rf",
trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 3)
)
plot(varImp(rf_mod_unprocessed_bmi_imputed), top = 10)
Figure 20: Variable importance plot for the random forest model fit to the unprocessed data generated using all of the available features.
rf_pred_unprocessed_bmi_imputed <- predict(rf_mod_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed)
confusionMatrix(rf_pred_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 929 17
## 1 286 45
##
## Accuracy : 0.7627
## 95% CI : (0.7384, 0.7858)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1603
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72581
## Specificity : 0.76461
## Pos Pred Value : 0.13595
## Neg Pred Value : 0.98203
## Prevalence : 0.04855
## Detection Rate : 0.03524
## Detection Prevalence : 0.25920
## Balanced Accuracy : 0.74521
##
## 'Positive' Class : 1
##
rf_pred_unprocessed_bmi_imputed <- ROCR::prediction(
predictions = as.numeric(rf_pred_unprocessed_bmi_imputed),
labels = stroke_data_test_unprocessed_bmi_imputed$stroke
)
rf_perf_unprocessed_bmi_imputed <- performance(rf_pred_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(rf_perf_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 21: ROC curve for the random forest model fit to the unprocessed data.
auc_perf_rf_unprocessed_bmi_imputed <- performance(rf_pred_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_rf_unprocessed_bmi_imputed <- unlist(slot(auc_perf_rf_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_rf_unprocessed_bmi_imputed)
## [1] "Calculated AUC: 0.745207752555423"
rf_mod <- train(
stroke ~ .,
data = stroke_data_train,
metric = "Accuracy",
method = "rf",
trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 3)
)
plot(varImp(rf_mod), top = 10)
Figure 22: Variable importance plot for the random forest model fit to the preprocessed data generated using all of the available features.
rf_pred <- predict(rf_mod, stroke_data_test)
confusionMatrix(rf_pred, stroke_data_test$stroke, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 906 17
## 1 309 45
##
## Accuracy : 0.7447
## 95% CI : (0.7199, 0.7684)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1458
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72581
## Specificity : 0.74568
## Pos Pred Value : 0.12712
## Neg Pred Value : 0.98158
## Prevalence : 0.04855
## Detection Rate : 0.03524
## Detection Prevalence : 0.27721
## Balanced Accuracy : 0.73574
##
## 'Positive' Class : 1
##
rf_pred <- ROCR::prediction(
predictions = as.numeric(rf_pred),
labels = stroke_data_test$stroke
)
rf_perf <- performance(rf_pred, measure = "tpr", x.measure = "fpr")
plot(rf_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 23: ROC curve for the random forest model fit to the preprocessed data.
auc_perf_rf <- performance(rf_pred, measure = "auc")
stroke_auc_rf <- unlist(slot(auc_perf_rf,"y.values"))
paste("Calculated AUC: ", stroke_auc_rf)
## [1] "Calculated AUC: 0.735742731979291"
Here, the neural network model was fit to the unprocessed dataset
with the bmi variable imputed. The caret
package in R allows one to fit multiple neural networks then aggregate
them together, which is what was done in this report for both the
unprocessed and preprocessed data. This code sets up a grid of tuning
parameters for a neural network model, then trains the model using the
specified parameters, dataset, and preprocessing steps. The resulting
model (nnet_unprocessed_bmi_imputed) can be used for
predicting stroke risk based on the input data.
set.seed(123)
nnetGrid_unprocessed_bmi_imputed <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
ctrl <- trainControl(method = "cv", number = 5)
nnet_unprocessed_bmi_imputed <- train(stroke ~ ., data = stroke_data_train_unprocessed_bmi_imputed,
method = "avNNet",
tuneGrid = nnetGrid_unprocessed_bmi_imputed,
trControl = ctrl,
preProc = c("YeoJohnson", "center", "scale"),
trace = FALSE,
linout = TRUE)
nnet_unprocessed_bmi_imputed_pred <- predict(nnet_unprocessed_bmi_imputed, stroke_data_test_unprocessed_bmi_imputed)
confusionMatrix(nnet_unprocessed_bmi_imputed_pred, as.factor(stroke_data_test_unprocessed_bmi_imputed$stroke), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 933 15
## 1 282 47
##
## Accuracy : 0.7674
## 95% CI : (0.7433, 0.7903)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1728
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.75806
## Specificity : 0.76790
## Pos Pred Value : 0.14286
## Neg Pred Value : 0.98418
## Prevalence : 0.04855
## Detection Rate : 0.03681
## Detection Prevalence : 0.25764
## Balanced Accuracy : 0.76298
##
## 'Positive' Class : 1
##
roc_pred_nnet_unprocessed_bmi_imputed <- ROCR::prediction(
predictions = as.numeric(nnet_unprocessed_bmi_imputed_pred),
labels = as.numeric(stroke_data_test_unprocessed_bmi_imputed$stroke)
)
roc_perf_nnet_unprocessed_bmi_imputed <- performance(roc_pred_nnet_unprocessed_bmi_imputed, measure = "tpr", x.measure = "fpr")
plot(roc_perf_nnet_unprocessed_bmi_imputed, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 24: ROC curve for the neural network model fit to the unprocessed data using all of the available features
auc_perf_nnet_unprocessed_bmi_imputed <- performance(roc_pred_nnet_unprocessed_bmi_imputed, measure = "auc")
stroke_auc_nnet_unprocessed_bmi_imputed <- unlist(slot(auc_perf_nnet_unprocessed_bmi_imputed,"y.values"))
paste("Calculated AUC: ", stroke_auc_nnet_unprocessed_bmi_imputed)
## [1] "Calculated AUC: 0.762982875348467"
Here, the neural network model was fit to the preprocessed data.
Here, this code sets up a grid of tuning parameters for a neural network
model, then trains the model using the specified parameters, dataset,
and preprocessing steps. The resulting model (nnet) can be
used for predicting stroke risk based on the input data.
set.seed(123)
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
nnet <- train(stroke ~ ., data = stroke_data_train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("YeoJohnson", "center", "scale"),
trace = FALSE,
linout = TRUE)
nnet_pred <- predict(nnet, stroke_data_test)
confusionMatrix(nnet_pred, as.factor(stroke_data_test$stroke), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 918 17
## 1 297 45
##
## Accuracy : 0.7541
## 95% CI : (0.7295, 0.7775)
## No Information Rate : 0.9514
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1532
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72581
## Specificity : 0.75556
## Pos Pred Value : 0.13158
## Neg Pred Value : 0.98182
## Prevalence : 0.04855
## Detection Rate : 0.03524
## Detection Prevalence : 0.26782
## Balanced Accuracy : 0.74068
##
## 'Positive' Class : 1
##
roc_pred_nnet <- ROCR::prediction(
predictions = as.numeric(nnet_pred),
labels = as.numeric(stroke_data_test$stroke)
)
roc_perf_nnet <- performance(roc_pred_nnet, measure = "tpr", x.measure = "fpr")
plot(roc_perf_nnet, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 25: ROC curve for the neural network model fit to the preprocessed data using all of the available features
auc_perf_nnet <- performance(roc_pred_nnet, measure = "auc")
stroke_auc_nnet <- unlist(slot(auc_perf_nnet,"y.values"))
paste("Calculated AUC: ", stroke_auc_nnet)
## [1] "Calculated AUC: 0.740681003584229"
data1 <- tribble(
~"",~"Accuracy",~"Kappa",~"Sensitivity",~"Specificity",~"AUC",
"Decision Tree (Unprocessed)", "0.7886", "0.1976", "0.7742", "0.7893", "0.7817",
"Decision Tree (Preprocessed)", "0.7384","0.1681","0.8387","0.7333","0.7812",
"Linear Kernel SVM (Unprocessed)","0.7706","0.1715","0.7419","0.7729","0.7570",
"Linear Kernel SVM (Preprocessed)","0.7471","0.1596","0.7742","0.7457","0.7600",
"Polynomial Kernel SVM (Unprocessed)", "0.6601","0.1239","0.8710","0.6494","0.7602",
"Polynomial Kernel SVM (Preprocessed)", "0.6210","0.1046","0.8710","0.6028","0.7396",
"Radial Kernel SVM (Unprocessed)", "0.7478","0.1681","0.8064","0.7449","0.7756",
"Radial Kernel SVM (Preprocessed)", "0.7330","0.1487","0.7742","0.7309","0.7525",
"Sigmoid Kernel SVM (Unprocessed)", "0.7525","0.1680","0.7903","0.7506","0.7705",
"Sigmoid Kernel SVM (Preprocessed)", "0.7353","0.1656","0.8387","0.7300","0.7844",
"Random Forest (Unprocessed)", "0.7627","0.1603","0.7258","0.7646","0.7357",
"Random Forest (Preprocessed)", "0.7447","0.1458","0.7258","0.7459","0.7375",
"Neural Network (Unprocessed)", "0.7674","0.1728","0.7581","0.7679","0.7630",
"Neural Network (Preprocessed)", "0.7541","0.1532","0.7258","0.7556","0.7406"
)
knitr::kable((data1), booktabs = TRUE)
| Accuracy | Kappa | Sensitivity | Specificity | AUC | |
|---|---|---|---|---|---|
| Decision Tree (Unprocessed) | 0.7886 | 0.1976 | 0.7742 | 0.7893 | 0.7817 |
| Decision Tree (Preprocessed) | 0.7384 | 0.1681 | 0.8387 | 0.7333 | 0.7812 |
| Linear Kernel SVM (Unprocessed) | 0.7706 | 0.1715 | 0.7419 | 0.7729 | 0.7570 |
| Linear Kernel SVM (Preprocessed) | 0.7471 | 0.1596 | 0.7742 | 0.7457 | 0.7600 |
| Polynomial Kernel SVM (Unprocessed) | 0.6601 | 0.1239 | 0.8710 | 0.6494 | 0.7602 |
| Polynomial Kernel SVM (Preprocessed) | 0.6210 | 0.1046 | 0.8710 | 0.6028 | 0.7396 |
| Radial Kernel SVM (Unprocessed) | 0.7478 | 0.1681 | 0.8064 | 0.7449 | 0.7756 |
| Radial Kernel SVM (Preprocessed) | 0.7330 | 0.1487 | 0.7742 | 0.7309 | 0.7525 |
| Sigmoid Kernel SVM (Unprocessed) | 0.7525 | 0.1680 | 0.7903 | 0.7506 | 0.7705 |
| Sigmoid Kernel SVM (Preprocessed) | 0.7353 | 0.1656 | 0.8387 | 0.7300 | 0.7844 |
| Random Forest (Unprocessed) | 0.7627 | 0.1603 | 0.7258 | 0.7646 | 0.7357 |
| Random Forest (Preprocessed) | 0.7447 | 0.1458 | 0.7258 | 0.7459 | 0.7375 |
| Neural Network (Unprocessed) | 0.7674 | 0.1728 | 0.7581 | 0.7679 | 0.7630 |
| Neural Network (Preprocessed) | 0.7541 | 0.1532 | 0.7258 | 0.7556 | 0.7406 |
Table 1: Metrics for different model types
Based on Table 1, which contains performance metrics for various machine learning models applied to the stroke prediction dataset, several conclusions regarding the impact of preprocessing and the relative performance of different models can be drawn. The effects of each model and the impact of preprocessing are discussed in detail in this section.
For the decision tree model, preprocessing decreased accuracy and specificity but increased sensitivity, indicating a trade-off where the model became better at identifying true positives (stroke risks) at the expense of more false positives. Kappa, which measures agreement between predicted and actual classes, slightly decreased with preprocessing, suggesting a small reduction in overall predictive power. AUC remained almost the same, indicating that the model’s ability to discriminate between classes did not change significantly. For both decision tree models, age was the most critical predictor, indicating that it has the highest impact on predicting stroke risk. Being the main splitting variable signifies that age has the strongest association with stroke risk, and differentiating based on age results in the largest reduction in uncertainty about stroke risk. While average glucose level significantly contributes to prediction, it is less dominant compared to age. This highlights the role of glucose levels in stroke risk, potentially also linked to diabetes and metabolic health. Age and average glucose level are consistently the top two most important variables in both unprocessed and preprocessed models, suggesting that these variables are crucial predictors of stroke risk regardless of preprocessing. The dominance of age suggests that older individuals are at a significantly higher risk of stroke, which aligns with medical knowledge that age is a major risk factor due to the cumulative effects of other risk factors over time.
For the linear kernel SVM model, preprocessing slightly decreased accuracy and specificity but increased sensitivity. Kappa decreased, indicating a slight decline in predictive performance. AUC increased marginally, suggesting a slight improvement in the model’s ability to identify high-risk stroke patients. For the polynomial kernel, preprocessing resulted in a decrease in accuracy and specificity, while sensitivity remained constant. Kappa decreased, showing a reduction in agreement between predictions and actual outcomes. AUC also decreased, indicating a decline in overall model performance. For the radial kernel SVM and neural network models, the overall performance declined with preprocessing. For the sigmoid kernel SVM and random forest models, preprocessing decreased accuracy, kappa, and specificity, but sensitivity increased. However, AUC increased slightly, indicating a small improvement in overall performance.
Both age and average glucose level consistently emerged as the most important predictors in the decision tree and random forest models (both preprocessed and unprocessed), indicating that these two variables are crucial determinants in predicting stroke risk regardless of preprocessing. The analysis reveals that age and average glucose level are the primary predictors of stroke risk across different models.
Generally, preprocessing led to a decrease in accuracy and specificity across most models, with mixed effects on sensitivity. Kappa values tended to decrease with preprocessing, indicating a small reduction in overall agreement between predictions and actual outcomes. The AUC metric, which reflects the overall ability of the model to distinguish between positive and negative classes, showed mixed results, with some models experiencing slight improvements and others a decline. Preprocessing had mixed effects on model performance. The impact of preprocessing needs to be evaluated case-by-case, considering the specific context and requirements of the model’s application. For tasks prioritizing sensitivity (identifying true stroke risks), the polynomial kernel SVM might be a good choice despite its lower overall accuracy and specificity. For tasks requiring a balance between accuracy and specificity, the unprocessed decision tree might be more suitable.
In a healthcare setting where accuracy is most important, the model with the highest accuracy metric should be preferred. Based on Table 1, the decision tree model with unprocessed data has the highest accuracy (0.7886) among all the models tested. High accuracy ensures that a majority of predictions (both positive and negative) are correct, which is critical for reliable decision-making. This model also demonstrates a good balance between sensitivity (0.7742) and specificity (0.7893). While sensitivity is important to identify patients at high risk of stroke (true positives), specificity is equally important to avoid false alarms (false positives). The selected model maintains a strong balance, meaning it can reliably detect at-risk patients without overburdening the healthcare system with false positives. The AUC of 0.7817 indicates a strong ability to distinguish between patients at high risk of stroke and those not at risk. Although not the highest AUC in the table, it is competitive and, when combined with high accuracy, suggests robust overall performance. Decision trees are inherently simple and interpretable models. In a healthcare setting, where decisions must be transparent and justifiable, the interpretability of a decision tree can be advantageous. Clinicians can understand and explain the decision-making process, which builds trust in the model’s predictions.
Accurate prediction of stroke risk is crucial in healthcare. High accuracy ensures that patients at genuine risk are correctly identified and can receive timely interventions, potentially saving lives and improving patient outcomes. High accuracy minimizes false positives, thereby reducing unnecessary stress and medical interventions for patients misclassified as high-risk. It also ensures that healthcare resources are efficiently used, focusing on those who truly need them. The selected decision tree model’s interpretability allows healthcare providers to understand and trust the model’s decision-making process. This transparency is vital for clinical adoption and patient acceptance. By accurately identifying individuals at high risk of stroke, preventive measures can be taken, such as lifestyle modifications, medications, and regular monitoring. This can significantly reduce the incidence of strokes, which are often costly to treat. The average cost of treating a stroke can range from 30,000 to 120,000 USD, considering acute treatment, rehabilitation, and long-term care. Preventing strokes can therefore result in substantial cost savings. Implementing preventive strategies is generally far less expensive than treating a stroke. For example, the cost of managing risk factors (e.g., controlling hypertension, managing diabetes) is significantly lower than the costs associated with stroke treatment and rehabilitation. If a healthcare system can reduce the incidence of strokes by 10% through early identification and intervention, the potential savings could be enormous. For instance, if the average cost per stroke is 50,000 USD and the system prevents 1,000 strokes annually, this translates to $50 million in savings per year. Using the decision tree model in hospitals can make things easier for doctors by providing helpful information right when they need it, enabling faster decision-making and reducing the burden of risk assessment.
In summary, the decision tree model has a significant impact on healthcare businesses. For stakeholders, the financial implications are substantial, as early identification and intervention can lead to significant cost savings by preventing expensive stroke treatments. Implementing such a model enhances patient care and supports the sustainability and efficiency of healthcare systems.