1 Description of the Dataset

stroke <- read.csv("C:/Users/qinfa/Desktop/school/STA 321/stroke.csv")

# turn the 'N/A'-s in the dataset into NA 
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
                                  is.na(x) <- x=="N/A"; x} else {
                                  x}
stroke[] <- lapply(stroke, make.true.NA)
# coerce bmi into being numeric and not character
stroke$bmi <- as.numeric(stroke$bmi)

# remove NAs 
stroke <- na.omit(stroke)

I chose a dataset found from this link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/ that includes identifying information about different hospital patients and whether or not they had strokes. The author chose not to disclose the source of the data. The dataset includes the following variables:

id (integer) - the unique identifying number given to each patient

gender (categorical) - the gender of the patient, male, female, or other

age (continuous) - the age of the patient

hypertension (categorical) - 0 if the patient doesn’t have hypertension, 1 if the patient does

heart_disease (categorical) - 0 if the patient doesn’t have heart disease, 1 if the patient does

ever_married (categorical) - Yes if the patient has been married before, No if not

work_type (categorical) - the type of work done by the patient

residence_type (categorical) - the area in which the patient lived, urban or rural

avg_glucose_level (continuous) - the average blood glucose level of the patient

bmi (continuous) - the body mass index (BMI) of the patient

smoking_status (categorical) - the smoking status of the patient with the options “formerly smoked”, “never smoked”, “smokes” or “Unknown”

stroke (categorical) - 0 if the patient has not had a stroke, 1 if the patient has

I chose to eliminate all missing values from the dataset.

2 Research Question

The objective of this assignment is to build a logistic regression model to predict stroke in a patient according to various risk factors.

3 Exploratory Analysis

To begin, we examine our predictor variables for possible issues:

pairs.panels(stroke[,-12], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

We find that multicollinearity is not a major issue for this dataset. We will take a closer look at our continuous variables.

par(mfrow=c(1,3))
hist(stroke$age, xlab="age", main = "")
hist(stroke$bmi, xlab="bmi", main = "")
hist(stroke$avg_glucose_level, xlab = "average glucose level", main = "")

Looking at these histograms, we find that age and bmi are unimodal (with bmi having some right skew), but avg_glucose_level is bimodal. Therefore, we make the decision to discretize this variable.

glucose = stroke$avg_glucose_level

stroke <- stroke %>% mutate(
  grp.glucose = case_when(
    glucose > 0 & glucose <= 50 ~ "0-50",
    glucose > 50 & glucose <= 100 ~ "50-100",
    glucose > 100 & glucose <= 150 ~ "100-150",
    glucose > 150 & glucose <= 200 ~ "150-200",
    glucose > 200 & glucose <= 300 ~ "200-300",
  )
)

We proceed by standardizing the numerical predictor variables.

4 Standardizing Numerical Variables

## standardizing numerical variables
stroke$sd.age = (stroke$age-mean(stroke$age))/sd(stroke$age)
stroke$sd.bmi = (stroke$bmi-mean(stroke$bmi))/sd(stroke$bmi)

## drop the original numerical variables
sd.stroke = stroke[, -c(3, 9, 10)]

5 Training and Testing the Data

We will randomly split up the data into training and testing data. 70% of the dataset will be used to identify our model with the cross-validation method. The reserved testing data will then be used to assess our final model.

## splitting data: 70% training and 30% testing
n <- dim(sd.stroke)[1]
train.n <- round(0.7*n)
train.id <- sample(1:n, train.n, replace = FALSE)
## training and testing data sets
train <- sd.stroke[train.id, ]
test <- sd.stroke[-train.id, ]

6 Candidate Models

Based on the full, reduced, and stepwise models that were created in last week’s assignment, we will use 5-fold cross validation to search for our final model. We will use a cutoff of -2.0 for each of the models.

## 5-fold cross-validation
k=5
fold.size = round(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){
  ## Training and testing folds
  valid.id = (fold.size*(i-1)+1):(fold.size*i)
  valid = train[valid.id, ]
  train.dat = train[-valid.id,]
  ##  full model
  candidate01 = glm(stroke ~ gender + sd.age + hypertension + heart_disease + ever_married + work_type + Residence_type + sd.bmi + smoking_status + grp.glucose, family = binomial(link = "logit"),  data = sd.stroke)  
## reduced model
  candidate03 = glm(stroke ~ hypertension + sd.bmi + smoking_status + grp.glucose, 
          family = binomial(link = "logit"),  data = sd.stroke)  
## 
   candidate02 = stepAIC(candidate01, 
                      scope = list(lower=formula(candidate03),upper=formula(candidate01)),
                      direction = "forward",   # forward selection
                      trace = 0                # do not show the details
                      )
  ##  predicted probabilities of each candidate model
   pred01 = predict(candidate01, newdata = valid, type="response")
   pred02 = predict(candidate02, newdata = valid, type="response")
   pred03 = predict(candidate03, newdata = valid, type="response")
  ## confusion matrix: ftable() will  
   confusion.matrix01 = ftable(as.vector((pred01>-2)),(valid$stroke==1))
   confusion.matrix02 = ftable(as.vector((pred02>-2)),(valid$stroke==1))
   confusion.matrix03 = ftable(as.vector((pred03>-2)),(valid$stroke==1))
   PE1[i] = (confusion.matrix01[1,2] + confusion.matrix01[2,1])/length(pred01)
   PE2[i] = (confusion.matrix02[1,2] + confusion.matrix02[2,1])/length(pred02)
   PE3[i] = (confusion.matrix03[1,2] + confusion.matrix03[2,1])/length(pred03)
}
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average of prediction errors of candidate models")
Average of prediction errors of candidate models
PE1 PE2 PE3
0.1126638 0.1126638 0.0649199

We find an average predictive power that is the same between models 1 and 2. Because the second model has fewer predictors than model 1, we choose it as our final model.

For the final accuracy of the model, we find that:

pred02 = predict(candidate02, newdata = test, type="response")
confusion.matrix02 = ftable(as.vector((pred02>-2.0)),(test$stroke== 1))
accuracy = (confusion.matrix02[1,1] + confusion.matrix02[2,2])/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
The actual accuracy of the final model
x
0.8879837

7 Summary

In summary, we used the cross-validation method to choose our final model between the three models we generated last week as candidates. This method chose the second model that we created through automatic variable selection, and we calculated its predictive power.