Introduction

We will walk through a set of exercises using logistic regression to classify the gender of individuals in the cdc dataframe based on the variables smoke100, genhlth and exerany. This follows the processes presented in the chapter “Classification models: fitting them and evaluating their performance” in the datacamp course “Machine Learning Toolbox.”

library(tidyverse)
library(caret)
# Be sure to change this
load("~/Dropbox/RProjects/Module 8/cdc.Rdata")

Exercise 1

Produce train and test subsets of the randomized cdc dataframe.

Answer

# Randomize the dataframe
set.seed(999)
rows <- sample(nrow(cdc))
cdc = cdc[rows,]

# Split into train and test dataframes
split <- round(nrow(cdc) * .8)

train = cdc[1:split,]
test = cdc[(split+1):nrow(cdc),]

str(test)
## 'data.frame':    4000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 2 1 4 3 2 1 2 2 1 ...
##  $ exerany : num  1 1 1 0 1 0 1 1 1 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 0 1 1 1 ...
##  $ smoke100: num  1 1 0 0 0 1 0 1 0 0 ...
##  $ height  : num  67 68 72 66 70 68 71 67 70 73 ...
##  $ weight  : int  126 175 180 158 225 160 175 206 187 190 ...
##  $ wtdesire: int  125 175 175 158 195 160 175 189 175 184 ...
##  $ age     : int  48 63 34 67 53 63 43 53 54 68 ...
##  $ gender  : Factor w/ 2 levels "m","f": 2 1 1 1 1 1 1 2 1 1 ...
str(train)
## 'data.frame':    16000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 1 3 2 3 3 2 2 2 1 2 ...
##  $ exerany : num  1 0 1 1 0 0 1 1 1 1 ...
##  $ hlthplan: num  0 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  1 1 0 1 1 0 0 0 0 1 ...
##  $ height  : num  74 66 61 64 75 63 69 67 70 67 ...
##  $ weight  : int  187 185 135 128 320 155 200 170 170 175 ...
##  $ wtdesire: int  200 145 110 128 210 145 185 155 170 150 ...
##  $ age     : int  24 44 47 96 29 44 22 57 38 48 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 1 2 1 1 1 2 ...
str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 1 3 2 3 3 2 2 2 1 2 ...
##  $ exerany : num  1 0 1 1 0 0 1 1 1 1 ...
##  $ hlthplan: num  0 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  1 1 0 1 1 0 0 0 0 1 ...
##  $ height  : num  74 66 61 64 75 63 69 67 70 67 ...
##  $ weight  : int  187 185 135 128 320 155 200 170 170 175 ...
##  $ wtdesire: int  200 145 110 128 210 145 185 155 170 150 ...
##  $ age     : int  24 44 47 96 29 44 22 57 38 48 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 1 2 1 1 1 2 ...

Exercise 2

Use glm to fit a logistic regression model to the train dataframe using the three listed variables as independent variables and gender as the dependent variable. Call this model mod_g. Get a summary of mod_g.

Answer

mod_g = glm(gender ~ smoke100 + genhlth + exerany,data = train,  family = "binomial" )
summary(mod_g)
## 
## Call:
## glm(formula = gender ~ smoke100 + genhlth + exerany, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5070  -1.2008   0.9737   1.1175   1.3084  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.42225    0.04770   8.852  < 2e-16 ***
## smoke100         -0.44528    0.03227 -13.798  < 2e-16 ***
## genhlthvery good  0.07785    0.04271   1.823 0.068349 .  
## genhlthgood       0.07606    0.04515   1.685 0.092072 .  
## genhlthfair       0.22402    0.06185   3.622 0.000293 ***
## genhlthpoor       0.32586    0.09523   3.422 0.000622 ***
## exerany          -0.27971    0.03793  -7.374 1.65e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22152  on 15999  degrees of freedom
## Residual deviance: 21886  on 15993  degrees of freedom
## AIC: 21900
## 
## Number of Fisher Scoring iterations: 4

Exercise 3

Now create two new variables isM and isF as booleans in the train dataset. The meanings should be obvious.

Answer

train$isM = train$gender == 'm'
train$isF = train$gender == 'f'

Exercise 4

Use the new boolean variables as alternative dependent variables to produce models mod_m and mod_f. Produce summaries of each of these and decide which one is equivalent to mod_g.

Answer

Here is mod_m.

mod_m = glm(isM ~ smoke100 + genhlth + exerany,data = train,  family = "binomial" )
summary(mod_m)
## 
## Call:
## glm(formula = isM ~ smoke100 + genhlth + exerany, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3084  -1.1175  -0.9737   1.2008   1.5070  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.42225    0.04770  -8.852  < 2e-16 ***
## smoke100          0.44528    0.03227  13.798  < 2e-16 ***
## genhlthvery good -0.07785    0.04271  -1.823 0.068349 .  
## genhlthgood      -0.07606    0.04515  -1.685 0.092072 .  
## genhlthfair      -0.22402    0.06185  -3.622 0.000293 ***
## genhlthpoor      -0.32586    0.09523  -3.422 0.000622 ***
## exerany           0.27971    0.03793   7.374 1.65e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22152  on 15999  degrees of freedom
## Residual deviance: 21886  on 15993  degrees of freedom
## AIC: 21900
## 
## Number of Fisher Scoring iterations: 4

Here is mod_f

mod_f = glm(isF ~ smoke100 + genhlth + exerany,data = train,  family = "binomial" )
summary(mod_f)
## 
## Call:
## glm(formula = isF ~ smoke100 + genhlth + exerany, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5070  -1.2008   0.9737   1.1175   1.3084  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.42225    0.04770   8.852  < 2e-16 ***
## smoke100         -0.44528    0.03227 -13.798  < 2e-16 ***
## genhlthvery good  0.07785    0.04271   1.823 0.068349 .  
## genhlthgood       0.07606    0.04515   1.685 0.092072 .  
## genhlthfair       0.22402    0.06185   3.622 0.000293 ***
## genhlthpoor       0.32586    0.09523   3.422 0.000622 ***
## exerany          -0.27971    0.03793  -7.374 1.65e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22152  on 15999  degrees of freedom
## Residual deviance: 21886  on 15993  degrees of freedom
## AIC: 21900
## 
## Number of Fisher Scoring iterations: 4

Exercise

Look at the levels of the gender variable. What can you say about how glm used these values.

Answer

levels(train$gender)
## [1] "m" "f"

It’s clear from comparing the coefficient signs in the three models that mod_g and mod_f are equivalent. The probability is the probability that the individual is female.

Here’s what’s happening. We ran the logistic regression model with a binary dependent variable, gender, which was not boolean but was a factor. Normally you think of the probabiity produced by the logistic model as the probability of TRUE. If the dependent variable is not boolean, the probability is the probability that the variable has the value of the second level of the factor. Unless levels have been specifically assigned when the factor was created, alphabetic order is used.

Exercise

Use mod_g to produce the predicted probabilities for the test dataframe. Put these values in a vector pred. Do a summary of pred.

Answer

pred = predict(mod_g,test,type="response")
summary(pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4249  0.4440  0.5356  0.5208  0.5549  0.6788

Exercise

Create the confusion matrix.

# Select a threshold probability 
th = .5
# Calculate predicted class values as a factor: p_class


p_class = ifelse(pred > th,"f","m")
# Align the levels of the two factors.
p_factor  = factor(p_class,levels=c("m","f"))
levels(test$gender)
## [1] "m" "f"
levels(p_factor)
## [1] "m" "f"
# Create confusion matrix
confusionMatrix(p_factor,test$gender,positive = "f")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    m    f
##          m  768  653
##          f 1139 1440
##                                           
##                Accuracy : 0.552           
##                  95% CI : (0.5364, 0.5675)
##     No Information Rate : 0.5232          
##     P-Value [Acc > NIR] : 0.0001425       
##                                           
##                   Kappa : 0.0918          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6880          
##             Specificity : 0.4027          
##          Pos Pred Value : 0.5584          
##          Neg Pred Value : 0.5405          
##              Prevalence : 0.5232          
##          Detection Rate : 0.3600          
##    Detection Prevalence : 0.6448          
##       Balanced Accuracy : 0.5454          
##                                           
##        'Positive' Class : f               
## 
# Note that if we don't identify the positive value, this function will reverse the results.

Exercise

Find the following values in the confusion matrix output and write proper sentences to explain what they mean.

  1. Accuracy:

  2. Sensitivity

  3. Specificity

Answer

The accuracy is 0.552. This means that the model produced correct answers 55.2% of the time.

The sensitivity is 0.688. This means that the model declared 68.8% of the positive values to be positive.

The specificity is 0.4027. This means that the model correctly rejected 40.27% of the negative values.

Exercise

Repeat the process using the same pred vector, but with a reduced threshold and note how the key figures derived from the confusion matrix are changed.

Answer

# Select a threshold probability 
th = .45
# Calculate predicted class values as a factor: p_class


p_class = ifelse(pred > th,"f","m")
# Align the levels of the two factors.
p_factor  = factor(p_class,levels=c("m","f"))
levels(test$gender)
## [1] "m" "f"
levels(p_factor)
## [1] "m" "f"
# Create confusion matrix
confusionMatrix(p_factor,test$gender,positive = "f")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    m    f
##          m  671  554
##          f 1236 1539
##                                          
##                Accuracy : 0.5525         
##                  95% CI : (0.5369, 0.568)
##     No Information Rate : 0.5232         
##     P-Value [Acc > NIR] : 0.0001112      
##                                          
##                   Kappa : 0.0886         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.7353         
##             Specificity : 0.3519         
##          Pos Pred Value : 0.5546         
##          Neg Pred Value : 0.5478         
##              Prevalence : 0.5232         
##          Detection Rate : 0.3847         
##    Detection Prevalence : 0.6937         
##       Balanced Accuracy : 0.5436         
##                                          
##        'Positive' Class : f              
## 
# Note that if we don't identify the positive value, this function will reverse the results.

Note that reducing the threshold increased the number of true positives, but also increased the number of false negatives. The decision on the best threshold value depends on the relative costs of false positives and false negatives.

Exercise

You should think of a classification problem in two stages.

  1. Get the model with the best tradeoff between the two possible types of mistakes. This involves selecting which variables to include and possibly creating some new ones. To do this we construct an ROC curve for every model we try and using the AUC to measure performance. We use the selected model for the second stage.

  2. We consider the consequences of the possible errors and select a threshold appropriate for the consequences.

Produce an ROC curve and an AUC value for the model we’ve been working with.

# We need the caTools package.
library(caTools)

# We already created the probability vector as pred.

colAUC(pred,test$gender,plotROC=T)

##              [,1]
## m vs. f 0.5667713

So the AUC is above .5, but certainly not great.