Stroke is also known as brain attack; it occurs when the blood vessel
in the brain bursts or when the blood supply to the part of brain is
stopped or blocked. According to the World Health Organization stroke is
the leading cause of death and disability globally. Annually, there are
more than 795k people in United State get a stroke and around 610k of
them are new stroke. People who survive from stroke can experience
different level of disabilities such as loss of vision and speech,
paralysis and confusion. Stroke will bring financial and psychological
burden to patient’s family.
The risk factors that cause stroke are high blood pressure, high
cholesterol, heart disease, diabetes, obesity, cigarette smoking,
alcohol consumption, imbalance diet, age, gender and etc. Some factors
are modifiable and can be controlled to avoid stroke effectively such as
high blood pressure, smoking and diet. As prevention is better than
cure, it is important to detect patient with high risk of getting stroke
and take prevention measurement accordingly.
With today high availability of medical data, this can be achieved
easily using machine learning models. Machine learning models can be
used to predict whether a person will have stroke or not using their
lifestyle habits and physiological measurement data.
library(readr)
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v forcats 0.5.1
## v tidyr 1.2.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
filePath = "C:/Users/Teng/Desktop/R/healthcare-dataset-stroke-data.csv"
healthcare_dataset_stroke_data <- read_csv(filePath)
## Rows: 5110 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
## dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(healthcare_dataset_stroke_data)
## Rows: 5,110
## Columns: 12
## $ id <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434~
## $ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"~
## $ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, ~
## $ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ heart_disease <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0~
## $ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No~
## $ work_type <chr> "Private", "Self-employed", "Private", "Private", "S~
## $ Residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"~
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0~
## $ bmi <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "~
## $ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "~
## $ stroke <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
dim(healthcare_dataset_stroke_data)
## [1] 5110 12
head(healthcare_dataset_stroke_data)
## # A tibble: 6 x 12
## id gender age hypertension heart_disease ever_married work_type
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## # ... with 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
Remove the id column.
new_healthcare_df <- subset(healthcare_dataset_stroke_data, select = -c(id))
head(new_healthcare_df)
## # A tibble: 6 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Male 67 0 1 Yes Private Urban
## 2 Female 61 0 0 Yes Self-empl~ Rural
## 3 Male 80 0 1 Yes Private Rural
## 4 Female 49 0 0 Yes Private Urban
## 5 Female 79 1 0 Yes Self-empl~ Rural
## 6 Male 81 0 0 Yes Private Urban
## # ... with 4 more variables: avg_glucose_level <dbl>, bmi <chr>,
## # smoking_status <chr>, stroke <dbl>
new_healthcare_df[new_healthcare_df == ""]
## <unspecified> [0]
cname <- names(new_healthcare_df)
for (i in cname){
print(paste(i, sum(new_healthcare_df[i] == "N/A")))
new_healthcare_df[!is.na(new_healthcare_df[i]) & new_healthcare_df[i] == "N/A", i] <- NA
}
## [1] "gender 0"
## [1] "age 0"
## [1] "hypertension 0"
## [1] "heart_disease 0"
## [1] "ever_married 0"
## [1] "work_type 0"
## [1] "Residence_type 0"
## [1] "avg_glucose_level 0"
## [1] "bmi 201"
## [1] "smoking_status 0"
## [1] "stroke 0"
Thus, we observed the bmi column has 201 “NA” value.
bmi_sorted = new_healthcare_df[which(is.na(new_healthcare_df$bmi)),]
move_bmi = bmi_sorted %>% dplyr::select("bmi", everything())
move_bmi
## # A tibble: 201 x 11
## bmi gender age hypertension heart_disease ever_married work_type
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 <NA> Female 61 0 0 Yes Self-employed
## 2 <NA> Female 59 0 0 Yes Private
## 3 <NA> Male 78 0 1 Yes Private
## 4 <NA> Male 57 0 1 No Govt_job
## 5 <NA> Male 58 0 0 Yes Private
## 6 <NA> Male 59 0 0 Yes Private
## 7 <NA> Female 63 0 0 Yes Private
## 8 <NA> Female 75 0 1 No Self-employed
## 9 <NA> Female 76 0 0 No Private
## 10 <NA> Male 78 1 0 Yes Private
## # ... with 191 more rows, and 4 more variables: Residence_type <chr>,
## # avg_glucose_level <dbl>, smoking_status <chr>, stroke <dbl>
Users from Kaggle stated that bmi attribute is not very clear indication of stroke. Hence, we will remove the missing bmi data of 201 rows, and it is a small portion of data.
clean_data = na.omit(new_healthcare_df)
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Male 67 0 1 Yes Private Urban
## 2 Male 80 0 1 Yes Private Rural
## 3 Female 49 0 0 Yes Private Urban
## 4 Female 79 1 0 Yes Self-emp~ Rural
## 5 Male 81 0 0 Yes Private Urban
## 6 Male 74 1 1 Yes Private Rural
## 7 Female 69 0 0 No Private Urban
## 8 Female 78 0 0 Yes Private Urban
## 9 Female 81 1 0 Yes Private Rural
## 10 Female 61 0 1 Yes Govt_job Rural
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
The gender, ever_married, and Residence_type columns will be transform into binary data for our data modelling step. We will check the gender, Residence_type, and ever_married columns to see if there is more than 2 types of value.
unique_gender = unique(clean_data$gender)
unique_ever_married = unique(clean_data$ever_married)
unique_residence_type = unique(clean_data$Residence_type)
unique_work_type = unique(clean_data$work_type)
unique_gender
## [1] "Male" "Female" "Other"
unique_ever_married
## [1] "Yes" "No"
unique_residence_type
## [1] "Urban" "Rural"
unique_work_type
## [1] "Private" "Self-employed" "Govt_job" "children"
## [5] "Never_worked"
We noticed the gender column has 3 types of value which are “Male”, “Female”, and “Other”. We transform it into Male = 0, Female = 1, Other = 2.
clean_data = clean_data %>%
mutate(gender = recode(
gender,
"Male" = "0",
"Female" = "1",
"Other" = "2"
))
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 0 67 0 1 Yes Private Urban
## 2 0 80 0 1 Yes Private Rural
## 3 1 49 0 0 Yes Private Urban
## 4 1 79 1 0 Yes Self-emp~ Rural
## 5 0 81 0 0 Yes Private Urban
## 6 0 74 1 1 Yes Private Rural
## 7 1 69 0 0 No Private Urban
## 8 1 78 0 0 Yes Private Urban
## 9 1 81 1 0 Yes Private Rural
## 10 1 61 0 1 Yes Govt_job Rural
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
The work_type column has 5 types of value which are “Private”, “Self-employed”, “Govt_job”, “children”, “Never_worked”. We transform it into Private = 0, Self-employed = 1, Govt_job = 2, children = 3, Never_worked = 4.
clean_data = clean_data %>%
mutate(work_type = recode(
work_type,
"Private" = "0",
"Self-employed" = "1",
"Govt_job" = "2",
"children" = "3",
"Never_worked" = "4"
))
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 0 67 0 1 Yes 0 Urban
## 2 0 80 0 1 Yes 0 Rural
## 3 1 49 0 0 Yes 0 Urban
## 4 1 79 1 0 Yes 1 Rural
## 5 0 81 0 0 Yes 0 Urban
## 6 0 74 1 1 Yes 0 Rural
## 7 1 69 0 0 No 0 Urban
## 8 1 78 0 0 Yes 0 Urban
## 9 1 81 1 0 Yes 0 Rural
## 10 1 61 0 1 Yes 2 Rural
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
Transform the Residence_type and ever_married columns’ data into binary data.
clean_data$Residence_type <- ifelse(clean_data$Residence_type == "Urban", 1, 0)
clean_data$ever_married <- ifelse(clean_data$ever_married == "Yes", 1, 0)
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 0 67 0 1 1 0 1
## 2 0 80 0 1 1 0 0
## 3 1 49 0 0 1 0 1
## 4 1 79 1 0 1 1 0
## 5 0 81 0 0 1 0 1
## 6 0 74 1 1 1 0 0
## 7 1 69 0 0 0 0 1
## 8 1 78 0 0 1 0 1
## 9 1 81 1 0 1 0 0
## 10 1 61 0 1 1 2 0
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
To start with data modelling, let’s assume that the data is not labelled, and see what clustering algorithm can tell us if we want to have 2 clusters in the dataset.
Ideally, the 2 clusters should be one with stroke and one without stroke. Normally, clustering algorithm works best with continuos variable. But we have categorical variable here, so we used gower distance to measure distance between two data points.
CRAN has a detailed documentation about gower distance which is available at https://cran.r-project.org/web/packages/gower/vignettes/intro.pdf
k-means clustering works by partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). All data points are treated as vectors, and distance between data points can be measured using various methods (Euclidean, Manhattan and etc.)
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(compareGroups)
# install.packages('cluster')
# install.packages('factoextra')
# install.packages('compareGroups')
set.seed(666)
clean_data$gender <- as.factor(clean_data$gender)
clean_data$ever_married <- as.factor(clean_data$ever_married)
clean_data$Residence_type <- as.factor(clean_data$Residence_type)
clean_data$smoking_status <- as.factor(clean_data$smoking_status)
clean_data$stroke <- as.factor(clean_data$stroke)
clean_data$heart_disease <- as.factor(clean_data$heart_disease)
clean_data$hypertension <- as.factor(clean_data$hypertension)
clean_data$bmi <- as.numeric(clean_data$bmi)
df = subset(clean_data, select = c(bmi,avg_glucose_level, age,
hypertension, smoking_status,
stroke) )
df <- na.omit(df)
distMat<-daisy(df,metric = "gower")
k2 <- kmeans(distMat, centers = 2, nstart = 25)
df$cluster<-k2$cluster
group<-compareGroups(cluster~.,data=df)
clustab<-createTable(group)
clustab
##
## --------Summary descriptives table by 'cluster'---------
##
## ______________________________________________________
## 1 2 p.overall
## N=691 N=4218
## ˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉ
## bmi 32.6 (8.27) 28.3 (7.62) <0.001
## avg_glucose_level 143 (65.3) 99.1 (36.4) <0.001
## age 64.4 (13.5) 39.3 (21.8) <0.001
## hypertension: 0.000
## 0 240 (34.7%) 4218 (100%)
## 1 451 (65.3%) 0 (0.00%)
## smoking_status: <0.001
## formerly smoked 203 (29.4%) 634 (15.0%)
## never smoked 286 (41.4%) 1566 (37.1%)
## smokes 132 (19.1%) 605 (14.3%)
## Unknown 70 (10.1%) 1413 (33.5%)
## stroke: <0.001
## 0 482 (69.8%) 4218 (100%)
## 1 209 (30.2%) 0 (0.00%)
## ˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉ
After that, we will use supervised learning method, classification to predict if an individual will get stroke or not
Random forests is an ensemble learning method that constructs many
decision trees at traing time. It can be used for classification as well
as regression task.
The output for classification task is based on the class selected by
most trees.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## 载入程辑包:'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## 载入需要的程辑包:lattice
##
## 载入程辑包:'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ROSE)
## Loaded ROSE 0.0-4
oversampled_data <- ovun.sample(as.factor(stroke)~.,data = clean_data, method = 'over',p = 0.3)$data
sample_index <- createDataPartition(oversampled_data$stroke, p = 0.7,
list = FALSE,
times = 1)
data_train <- oversampled_data[sample_index,]
data_test <- oversampled_data[-sample_index,]
rf_model <- randomForest(stroke~.,data = data_train,ntree = 1000,mtry = 5)
## Model performance using training data
pred_train <- predict(rf_model)
result_train <- confusionMatrix(pred_train, data_train$stroke)
(result_train$overall)['Accuracy']
## Accuracy
## 0.9849608
(result_train$byClass)['Sensitivity']
## Sensitivity
## 0.9796353
(result_train$byClass)['Specificity']
## Specificity
## 0.9972048
(result_train$byClass)['F1']
## F1
## 0.9891054
(result_train$byClass)['Recall']
## Recall
## 0.9796353
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Train")
## Model performance using testing data
pred_test <- predict(rf_model, newdata = data_test)
result_test <- confusionMatrix(pred_test, data_test$stroke)
(result_test$overall)['Accuracy']
## Accuracy
## 0.9841819
(result_test$byClass)['Sensitivity']
## Sensitivity
## 0.9787234
(result_test$byClass)['Specificity']
## Specificity
## 0.9967374
(result_test$byClass)['F1']
## F1
## 0.9885387
(result_test$byClass)['Recall']
## Recall
## 0.9787234
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Test")
Support Vector Machine works by constructing multiple hyperplanes that can seperate data points distinctly.
library(kernlab)
##
## 载入程辑包:'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(e1071)
svm_model <- ksvm(stroke~.,data = data_train, kernel="vanilladot")
## Setting default kernel parameters
## Model performance using training data
pred_train <- predict(svm_model)
result_train <- confusionMatrix(pred_train, data_train$stroke)
(result_train$overall)['Accuracy']
## Accuracy
## 0.798136
(result_train$byClass)['Sensitivity']
## Sensitivity
## 0.8604863
(result_train$byClass)['Specificity']
## Specificity
## 0.6547869
(result_train$byClass)['F1']
## F1
## 0.8559335
(result_train$byClass)['Recall']
## Recall
## 0.8604863
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Train")
## Model performance using testing data
pred_test <- predict(svm_model, newdata = data_test)
result_test <- confusionMatrix(pred_test, data_test$stroke)
(result_test$overall)['Accuracy']
## Accuracy
## 0.7874444
(result_test$byClass)['Sensitivity']
## Sensitivity
## 0.8595745
(result_test$byClass)['Specificity']
## Specificity
## 0.6215334
(result_test$byClass)['F1']
## F1
## 0.8493343
(result_test$byClass)['Recall']
## Recall
## 0.8595745
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Test")
Naive Bayes Classifier is a probabilistic machine learning algorithm based on the Bayes Theorem.
nb_model <- naiveBayes(stroke~.,data = data_train)
## Model performance using training data
pred_train <- predict(nb_model, data_train)
result_train <- confusionMatrix(pred_train, data_train$stroke)
(result_train$overall)['Accuracy']
## Accuracy
## 0.7665749
(result_train$byClass)['Sensitivity']
## Sensitivity
## 0.7829787
(result_train$byClass)['Specificity']
## Specificity
## 0.7288609
(result_train$byClass)['F1']
## F1
## 0.8237928
(result_train$byClass)['Recall']
## Recall
## 0.7829787
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Train")
## Model performance using testing data
pred_test <- predict(nb_model, newdata = data_test)
result_test <- confusionMatrix(pred_test, data_test$stroke)
(result_test$overall)['Accuracy']
## Accuracy
## 0.7572912
(result_test$byClass)['Sensitivity']
## Sensitivity
## 0.7815603
(result_test$byClass)['Specificity']
## Specificity
## 0.7014682
(result_test$byClass)['F1']
## F1
## 0.8178108
(result_test$byClass)['Recall']
## Recall
## 0.7815603
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Test")
With this dataset we can perform regression on average glucose level as well.
Linear regression works by fitting a linear equation from various variables to output. Hence, linear regression might not work well dataset with non-linear nature.
#LR all variable
linear_regression_str <- lm(avg_glucose_level~.,data = data_train)
summary(linear_regression_str)
##
## Call:
## lm(formula = avg_glucose_level ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.72 -31.73 -10.31 25.23 167.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.78289 4.46810 10.694 < 2e-16 ***
## gender1 -8.62641 1.42474 -6.055 1.52e-09 ***
## gender2 55.74107 47.42137 1.175 0.239877
## age 0.42468 0.05193 8.178 3.69e-16 ***
## hypertension1 13.50414 2.07378 6.512 8.20e-11 ***
## heart_disease1 28.44731 2.60689 10.912 < 2e-16 ***
## ever_married1 5.22690 2.04181 2.560 0.010500 *
## work_type1 -6.69649 1.94008 -3.452 0.000562 ***
## work_type2 -3.27198 2.09254 -1.564 0.117969
## work_type3 22.14089 3.37243 6.565 5.76e-11 ***
## work_type4 11.64373 14.42166 0.807 0.419490
## Residence_type1 -0.51483 1.38203 -0.373 0.709525
## bmi 1.28412 0.10201 12.589 < 2e-16 ***
## smoking_statusnever smoked 0.82478 1.94800 0.423 0.672022
## smoking_statussmokes -0.12377 2.38445 -0.052 0.958605
## smoking_statusUnknown -2.71304 2.32437 -1.167 0.243184
## stroke1 14.13256 1.81960 7.767 9.81e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.34 on 4704 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1926
## F-statistic: 71.36 on 16 and 4704 DF, p-value: < 2.2e-16
#LR select significant p
linear_regression_model <- lm(avg_glucose_level~gender+age+hypertension+heart_disease+bmi+stroke,data = data_train)
summary(linear_regression_model)
##
## Call:
## lm(formula = avg_glucose_level ~ gender + age + hypertension +
## heart_disease + bmi + stroke, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.60 -32.26 -10.98 25.06 167.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.19731 3.03674 20.811 < 2e-16 ***
## gender1 -9.11825 1.41950 -6.424 1.46e-10 ***
## gender2 47.85114 47.61372 1.005 0.315
## age 0.27898 0.03671 7.600 3.56e-14 ***
## hypertension1 13.90457 2.05859 6.754 1.61e-11 ***
## heart_disease1 29.77206 2.58662 11.510 < 2e-16 ***
## bmi 1.11732 0.09531 11.723 < 2e-16 ***
## stroke1 15.98919 1.79695 8.898 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.59 on 4713 degrees of freedom
## Multiple R-squared: 0.1852, Adjusted R-squared: 0.184
## F-statistic: 153 on 7 and 4713 DF, p-value: < 2.2e-16
pred_test <- predict(linear_regression_model,data_test)
actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
## actuals predicteds
## 2 87.96 137.08189
## 3 110.89 75.97577
## 10 205.84 149.63411
## 11 77.08 115.22496
## 12 57.08 139.59010
## 15 95.04 112.61279
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
min_max_accuracy
## [1] 0.7416867
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape
## [1] 0.3498599
From the results, we can observe which variables are most important
by looking at their p values. The important features identified
are
- gender
- age
- hypertension
- hear_disease
- bmi
- stroke
Random forests is an ensemble learning method that constructs many
decision trees at traing time. It can be used for classification as well
as regression task.
The mean or average prediction of the individual trees is used as output
for regression tasks.
rf_regression_model <- randomForest(avg_glucose_level~.,data = data_train,ntree = 1000,mtry = 5)
pred_test <- predict(rf_regression_model,data_test)
actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
## actuals predicteds
## 2 87.96 151.15280
## 3 110.89 89.30541
## 10 205.84 166.39514
## 11 77.08 90.70945
## 12 57.08 104.59448
## 15 95.04 101.60528
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
min_max_accuracy
## [1] 0.8337048
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape
## [1] 0.2168683
svm_regression_model <- svm(avg_glucose_level~.,data = data_train)
pred_test <- predict(svm_regression_model,data_test)
actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
## actuals predicteds
## 2 87.96 135.16215
## 3 110.89 83.44734
## 10 205.84 184.80977
## 11 77.08 104.01751
## 12 57.08 102.74096
## 15 95.04 105.42116
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
min_max_accuracy
## [1] 0.7793447
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape
## [1] 0.2720404
k means clustering algorithm identified two clusters.
One without stroke and one with around 30% occurrences of stroke.
We can observe that clusters without stroke have
- lower average blood glucose level
- lower hypertension
- lower smoking rate
- lower bmi
- lower age
We experimented with 3 classifier algorithms (Random Forest
Classifier, Support Vector Classifier, Naive Bayes Classifier) to
predict if a patient will get stroke or not.
Random Forest Classifier performed the best with accuracy >95%,
with very low false positives and false negatives.
The performance of Support Vector Classifer and Naive Bayes Classifier
are similar.
However, the model’s parameter are not tuned, we can expect performance
enhancement for Support Vector Classifier after hyperparameter
tuning.
We experimented with 3 regression algorithms (Linear Regression,
Support Vector Regression, Random Forest Regression) to predict the
average glucose level of a patient.
Random Forest Regression performed the best with lowest mean average
prediction error of < 0.25.
Linear Regression performed relatively poorly, this might due to the
non-linear nature of dataset.
Support Vector Regression also performed relatively poorly, but it has
ability to model non-linear dataset, further hyperparemeter tuning might
enhance regression performance.