This is meant to be a tutorial on how to check the prediction power of your logistic regresion/classification model. The goal is to predict if a patient will visit the dentist based on a few other variables. I will leave the variable selection up to the reader. I chose the independent variables: “sex”, “insured”, “meds”, “children”, and “employ”.

Here’s a peek of the data set:

##   dental.visit    sex insured meds children             employ
## 1          Yes female     Yes  Yes        0 Employed_for_wages
## 2          Yes   male     Yes  Yes        0            Retired
## 3           No female     Yes  Yes        0            Retired

Using the package caret, we create a training set (“Train”) and a testing set (“Test”) out of our data set. The training set will be 2/3’s of the full data set, and the testing set will be 1/3. The function we us is createDataPartition, and we generate the partitions randomly conditioning on the variable dental.visit, the dependent variable.

set.seed(3456)
trainIndex <- createDataPartition(data$dental.visit, p = .67,
                                  list = FALSE,
                                  times = 1)

Train <- data[ trainIndex,]
Test  <- data[-trainIndex,]

Let’s take a quick peak at the training set:

##   dental.visit    sex insured meds children             employ
## 1          Yes female     Yes  Yes        0 Employed_for_wages
## 2          Yes   male     Yes  Yes        0            Retired
## 3           No female     Yes  Yes        0            Retired

And also at the testing set:

##    dental.visit    sex insured meds children             employ
## 11          Yes female     Yes   No        0 Employed_for_wages
## 13          Yes female     Yes   No        0 Employed_for_wages
## 15          Yes female     Yes  Yes        0            Retired

Next, we carry out the logistic regression on the training set. For that we use the glm function, where glm stands for general linear model. We have to specify the name of the family = binomial, since the distribution of the dependent variable is binomial.

# Logistic Regression Model
model <- glm(dental.visit ~ sex + insured + meds + children + employ, 
            data = Train, family=binomial)

#summary(model)

The idea is to find how well the logistic regression model will do when compared to the testing set. Somehow we have to quantify the results of the model. The function predict, applied to the model and the training set will give us a probability for every observation. We save those probabilities to the testing set under the variable “model_pred”.

Test$model_prob <- predict(model, Test, type = "response")

Once we choose the threshold for your decision, we can transform those probabilities into successes and failures (1’s and 0’s), and we save them under the variable “model_pred”.

Similarly, we tranform the “Yes” and “No” in the variable dental.visit to 1’s and 0’s, we call the variable “visit_binary”.

Test <- Test  %>% mutate(model_pred = 1*(model_prob > .53) + 0,
                                 visit_binary = 1*(dental.visit == "Yes") + 0)

Take a peak at the important columns of our training set after being updated:

##   dental.visit model_prob model_pred visit_binary
## 1          Yes  0.7406852          1            1
## 2          Yes  0.7406852          1            1
## 3          Yes  0.7614214          1            1

Finally we compare the newly created columns “model_pred” and “visit_binary” to calculate the accuracy of our model.

Test <- Test %>% mutate(accurate = 1*(model_pred == visit_binary))
sum(Test$accurate)/nrow(Test)
## [1] 0.7749775

That’s it! Hope you enjoyed it. Questions at juanmurillopacheco@gmail.com