stroke <- read.csv("C:/Users/qinfa/Desktop/school/STA 321/stroke.csv")
# turn the 'N/A'-s in the dataset into NA
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
is.na(x) <- x=="N/A"; x} else {
x}
stroke[] <- lapply(stroke, make.true.NA)
# coerce bmi into being numeric and not character
stroke$bmi <- as.numeric(stroke$bmi)
# remove NAs
stroke <- na.omit(stroke)
I chose a dataset found from this link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/ that includes identifying information about different hospital patients and whether or not they had strokes. The author chose not to disclose the source of the data. The dataset includes the following variables:
id (integer) - the unique identifying number given to each patient
gender (categorical) - the gender of the patient, male, female, or other
age (continuous) - the age of the patient
hypertension (categorical) - 0 if the patient doesn’t have hypertension, 1 if the patient does
heart_disease (categorical) - 0 if the patient doesn’t have heart disease, 1 if the patient does
ever_married (categorical) - Yes if the patient has been married before, No if not
work_type (categorical) - the type of work done by the patient
residence_type (categorical) - the area in which the patient lived, urban or rural
avg_glucose_level (continuous) - the average blood glucose level of the patient
bmi (continuous) - the body mass index (BMI) of the patient
smoking_status (categorical) - the smoking status of the patient with the options “formerly smoked”, “never smoked”, “smokes” or “Unknown”
stroke (categorical) - 0 if the patient has not had a stroke, 1 if the patient has
I chose to eliminate all missing values from the dataset.
The objective of this assignment is to build a logistic regression model to predict stroke in a patient according to various risk factors.
To begin, we examine our predictor variables for possible issues:
pairs.panels(stroke[,-12],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
We find that multicollinearity is not a major issue for this dataset. We will take a closer look at our continuous variables.
par(mfrow=c(1,3))
hist(stroke$age, xlab="age", main = "")
hist(stroke$bmi, xlab="bmi", main = "")
hist(stroke$avg_glucose_level, xlab = "average glucose level", main = "")
Looking at these histograms, we find that age and bmi are unimodal (with bmi having some right skew), but avg_glucose_level is bimodal. Therefore, we make the decision to discretize this variable.
glucose = stroke$avg_glucose_level
stroke <- stroke %>% mutate(
grp.glucose = case_when(
glucose > 0 & glucose <= 50 ~ "0-50",
glucose > 50 & glucose <= 100 ~ "50-100",
glucose > 100 & glucose <= 150 ~ "100-150",
glucose > 150 & glucose <= 200 ~ "150-200",
glucose > 200 & glucose <= 300 ~ "200-300",
)
)
We proceed by standardizing the numerical predictor variables.
## standardizing numerical variables
stroke$sd.age = (stroke$age-mean(stroke$age))/sd(stroke$age)
stroke$sd.bmi = (stroke$bmi-mean(stroke$bmi))/sd(stroke$bmi)
## drop the original numerical variables
sd.stroke = stroke[, -c(3, 9, 10)]
We will randomly split up the data into training and testing data. 70% of the dataset will be used to identify our model with the cross-validation method. The reserved testing data will then be used to assess our final model.
## splitting data: 70% training and 30% testing
n <- dim(sd.stroke)[1]
train.n <- round(0.7*n)
train.id <- sample(1:n, train.n, replace = FALSE)
## training and testing data sets
train <- sd.stroke[train.id, ]
test <- sd.stroke[-train.id, ]
Based on the full, reduced, and stepwise models that were created in last week’s assignment, we will use 5-fold cross validation to search for our final model. We will use a cutoff of -2.0 for each of the models.
## 5-fold cross-validation
k=5
fold.size = round(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){
## Training and testing folds
valid.id = (fold.size*(i-1)+1):(fold.size*i)
valid = train[valid.id, ]
train.dat = train[-valid.id,]
## full model
candidate01 = glm(stroke ~ gender + sd.age + hypertension + heart_disease + ever_married + work_type + Residence_type + sd.bmi + smoking_status + grp.glucose, family = binomial(link = "logit"), data = sd.stroke)
## reduced model
candidate03 = glm(stroke ~ hypertension + sd.bmi + smoking_status + grp.glucose,
family = binomial(link = "logit"), data = sd.stroke)
##
candidate02 = stepAIC(candidate01,
scope = list(lower=formula(candidate03),upper=formula(candidate01)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
## predicted probabilities of each candidate model
pred01 = predict(candidate01, newdata = valid, type="response")
pred02 = predict(candidate02, newdata = valid, type="response")
pred03 = predict(candidate03, newdata = valid, type="response")
## confusion matrix: ftable() will
confusion.matrix01 = ftable(as.vector((pred01>-2)),(valid$stroke==1))
confusion.matrix02 = ftable(as.vector((pred02>-2)),(valid$stroke==1))
confusion.matrix03 = ftable(as.vector((pred03>-2)),(valid$stroke==1))
PE1[i] = (confusion.matrix01[1,2] + confusion.matrix01[2,1])/length(pred01)
PE2[i] = (confusion.matrix02[1,2] + confusion.matrix02[2,1])/length(pred02)
PE3[i] = (confusion.matrix03[1,2] + confusion.matrix03[2,1])/length(pred03)
}
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average of prediction errors of candidate models")
PE1 | PE2 | PE3 |
---|---|---|
0.1126638 | 0.1126638 | 0.0649199 |
We find an average predictive power that is the same between models 1 and 2. Because the second model has fewer predictors than model 1, we choose it as our final model.
For the final accuracy of the model, we find that:
pred02 = predict(candidate02, newdata = test, type="response")
confusion.matrix02 = ftable(as.vector((pred02>-2.0)),(test$stroke== 1))
accuracy = (confusion.matrix02[1,1] + confusion.matrix02[2,2])/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
x |
---|
0.8879837 |
In summary, we used the cross-validation method to choose our final model between the three models we generated last week as candidates. This method chose the second model that we created through automatic variable selection, and we calculated its predictive power.