This is meant to be a tutorial on how to check the prediction power of your logistic regresion/classification model. The goal is to predict if a patient will visit the dentist based on a few other variables. I will leave the variable selection up to the reader. I chose the independent variables: “sex”, “insured”, “meds”, “children”, and “employ”.
Here’s a peek of the data set:
## dental.visit sex insured meds children employ
## 1 Yes female Yes Yes 0 Employed_for_wages
## 2 Yes male Yes Yes 0 Retired
## 3 No female Yes Yes 0 Retired
Using the package caret, we create a training set (“Train”) and a testing set (“Test”) out of our data set. The training set will be 2/3’s of the full data set, and the testing set will be 1/3. The function we us is createDataPartition, and we generate the partitions randomly conditioning on the variable dental.visit, the dependent variable.
set.seed(3456)
trainIndex <- createDataPartition(data$dental.visit, p = .67,
list = FALSE,
times = 1)
Train <- data[ trainIndex,]
Test <- data[-trainIndex,]
Let’s take a quick peak at the training set:
## dental.visit sex insured meds children employ
## 1 Yes female Yes Yes 0 Employed_for_wages
## 2 Yes male Yes Yes 0 Retired
## 3 No female Yes Yes 0 Retired
And also at the testing set:
## dental.visit sex insured meds children employ
## 11 Yes female Yes No 0 Employed_for_wages
## 13 Yes female Yes No 0 Employed_for_wages
## 15 Yes female Yes Yes 0 Retired
Next, we carry out the logistic regression on the training set. For that we use the glm function, where glm stands for general linear model. We have to specify the name of the family = binomial, since the distribution of the dependent variable is binomial.
# Logistic Regression Model
model <- glm(dental.visit ~ sex + insured + meds + children + employ,
data = Train, family=binomial)
#summary(model)
The idea is to find how well the logistic regression model will do when compared to the testing set. Somehow we have to quantify the results of the model. The function predict, applied to the model and the training set will give us a probability for every observation. We save those probabilities to the testing set under the variable “model_pred”.
Test$model_prob <- predict(model, Test, type = "response")
Once we choose the threshold for your decision, we can transform those probabilities into successes and failures (1’s and 0’s), and we save them under the variable “model_pred”.
Similarly, we tranform the “Yes” and “No” in the variable dental.visit to 1’s and 0’s, we call the variable “visit_binary”.
Test <- Test %>% mutate(model_pred = 1*(model_prob > .53) + 0,
visit_binary = 1*(dental.visit == "Yes") + 0)
Take a peak at the important columns of our training set after being updated:
## dental.visit model_prob model_pred visit_binary
## 1 Yes 0.7406852 1 1
## 2 Yes 0.7406852 1 1
## 3 Yes 0.7614214 1 1
Finally we compare the newly created columns “model_pred” and “visit_binary” to calculate the accuracy of our model.
Test <- Test %>% mutate(accurate = 1*(model_pred == visit_binary))
sum(Test$accurate)/nrow(Test)
## [1] 0.7749775
That’s it! Hope you enjoyed it. Questions at juanmurillopacheco@gmail.com