Load the packages for the current session

library(dplyr)  # for manipulating data frames 
library(ggplot2)  # for data viz
library(here)  # for simplifying folder references 
library(readr)  # reading and writing data files 
library(GGally)  # extension to ggplot2
library(ISLR)  # package from the book Intro to Stat Learning 
library(broom)  # for cleaning up model results 
library(tidyr)  # for tidying data

Examining the Default dataset

The default dataset is built into the ISLR package. Much of this example is drawn from here - it’s a good tutorial, and I don’t have too much to add to it, to be honest.

Our goal is to find the probability that a certain credit card holder will default, given their income, balance, student status, etc.

df1.default.dataset <- Default  # rename for convenience 

# str(df1.default.dataset)

# levels(df1.default.dataset$student)  # No is the first level, 1

Exploratory data analysis

p1.pairs <- df1.default.dataset %>% 
    ggpairs(); p1.pairs

# split by default category: Yes vs No
df2.summary <- 
    df1.default.dataset %>% 
    group_by(default) %>% 
    summarize(avg.bal = mean(balance), 
              avg.inc = mean(income), 
              
              # dplyr::n() is used to count rows 
              num.cases = n(), 
              num.student = sum(student=="Yes"), 
              perc.student = round(100*sum(student=="Yes")/n(), 2))  
    

# print result: 
df2.summary %>%     
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
default avg.bal avg.inc num.cases num.student perc.student
No 803.9438 33566.17 9667 2817 29.14
Yes 1747.8217 32089.15 333 127 38.14
# split by Student category: Yes vs No
df1.default.dataset %>% 
    group_by(student) %>% 
    summarize(avg.bal = mean(balance), 
              avg.inc = mean(income), 
              num = n(), 
              perc.default = round(100*sum(default=="Yes")/n(), 2)) %>% 
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
student avg.bal avg.inc num perc.default
No 771.7704 40011.95 7056 2.92
Yes 987.8182 17950.23 2944 4.31

There’s a lot of interesting things to see here.

  1. Defaults are a small proportion of total accounts, for both students and non-students.
  2. Non-students have much higher income than students (no surprise). They also have slightly higher credit balances.
  3. Overall, proportion of students in the sample is 0.2944. Among the defaults, proportion of students is 0.3814. Clearly, this shows that students are riskier than non-students - right?

Let’s create some models to examine the probability of default, given the characteristics of each customer.

   
 

What if we just try linear regression?

Let’s start by examining the relationship between default and credit card balance.

# m1.linear.regression <- lm(as.numeric(default) ~ balance, 
#                            data = df1.default.dataset)

# summary(m1.linear.regression)

# plot result of model: 
df1.default.dataset %>% 
    mutate(is.default.numeric = ifelse(default == "Yes", 
                                1, 
                                0)) %>% 
    
    ggplot(aes(x = balance, 
               y = is.default.numeric)) + 
    geom_point(alpha = 0.2) + 
    geom_smooth(method = "lm") + 
    
    labs(title = "Ordinarly linear regression doesn't really make sense when you have \na binary response", 
         subtitle = "We want to interpret the regression line as 'Probability of default', but notice that we're seeing negative \nvalues at the far left.")

Fitting a logistic regression model with glm

m2.logistic <- glm(default ~ balance, 
                   # specify which specific GLM you want using "family": 
                   family = "binomial", 
                   data = df1.default.dataset)

summary(m2.logistic)
## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = df1.default.dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

Unlike in linear regression, there is no R-squared stat here. Instead, we use residual deviance. Lower deviance is better.

Interpreting coefficients

Let’s extract coefficients using broom::tidy. Note that the coefficients are given in terms of log odds, which are hard to interpret, so we convert them to odds factors, which are slightly easier to interpret.

tidy(m2.logistic) %>% 
    mutate(estimate.odds.factor = exp(estimate)) %>% 
    
    # rearrange column order: 
    select(term, 
           estimate, 
           estimate.odds.factor, 
           everything()) %>% 
    
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
term estimate estimate.odds.factor std.error statistic p.value
(Intercept) -10.6513306 0.0000237 0.3611574 -29.49221 0
balance 0.0054989 1.0055141 0.0002204 24.95309 0

Every unit increase in balance causes the odds of default to increase by a factor of 0.0054989. If the odds factor is greater than 1, then this variable is associated with an increase in the probability of default.

Interpreting the model with fake data

Let’s create a dataset with three rows, using the 25th, 50th, 75th, 90th and 99th percentile values of balance.

# use quantile( ) to get percentiles 
balance.quantiles <- df1.default.dataset %>% 
    pull(balance) %>%
    quantile(probs = c(.25, .50, .75, .90, .99))

df4.fake.data <- data.frame(balance = balance.quantiles)


# view result: 
df4.fake.data %>% 
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
balance
25% 481.7311
50% 823.6370
75% 1166.3084
90% 1471.6253
99% 2008.4709

Now let’s use predict to see what probability our logistic model predicts for customers at these three percentiles of balance.

# in order to get output in terms of proportions (instead of log odds), use the type = "response" argument 

(predict(m2.logistic, 
        newdata = df4.fake.data, 
        
         
        type = "response")) * 100  # convert to probabilities 
##         25%         50%         75%         90%         99% 
##  0.03345695  0.21887810  1.42324265  7.18251470 59.70248510

     

Conclusions:

  • Below the 50th percentile of balance, there is a very small probability of default.
  • The increase in probability from 25th to 75th percentile is very small, but the increase from 90th to 95th is quite large. Therefore, the effect of a single unit increase in balance is not uniform in its effect on probability of default.

Plotting the model

augment(m2.logistic) %>% 
    # create an ID column, change fitted value to probability, etc: 
    mutate(id = row_number(), 
           default.numeric = ifelse(default == "Yes", 
                                    1, 
                                    0), 
           predicted.prob = m2.logistic$family$linkinv(.fitted)) %>% 
    # ^^ linkinv is used to invert the logistic function to get probabilities instead of log odds 
    
    # select columns for plotting: 
    select(id, 
           default.numeric, 
           balance, 
           predicted.prob) %>% 
    
    # reshape the data: 
    gather(key = "variable", 
           value = "value", 
           -c(id, balance)) %>% 
    
    
    # now plot it: 
    ggplot(aes(x = balance, 
               y = value, 
               group = variable, 
               colour = variable)) +
    geom_point(alpha = 0.2)

Alternate way of plotting

It’s kind of fun to be able to see the predictions for your actual data points, so that’s the approach I showed above. In practice, you’ll usually just let ggplot plot a smooth curve, like this:

df1.default.dataset %>% 
    mutate(is.default.numeric = ifelse(default == "Yes", 
                                1, 
                                0)) %>% 
    
    ggplot(aes(x = balance, 
               y = is.default.numeric)) + 
    geom_point(alpha = 0.2) + 
    geom_smooth(method = "glm", 
                method.args = list(family = "binomial"))

Adding student as a predictor

First we’ll fit a new model. Then we’ll check whether the new model really is better. Finally we’ll plot the new model’s predictions.

# fit logistic model with 2 predictors: 
m3.logistic.with.student <- 
    glm(default ~ balance + student, 
        data = df1.default.dataset, 
        family = "binomial")

summary(m3.logistic.with.student)
## 
## Call:
## glm(formula = default ~ balance + student, family = "binomial", 
##     data = df1.default.dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4578  -0.1422  -0.0559  -0.0203   3.7435  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8
# examine coefficients: 
tidy(m3.logistic.with.student) %>% 
    mutate(estimate.odds.factor = exp(estimate)) %>% 
    
    # rearrange column order: 
    select(term, 
           estimate, 
           estimate.odds.factor, 
           everything()) %>% 
    
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
term estimate estimate.odds.factor std.error statistic p.value
(Intercept) -10.7494959 0.0000215 0.3691914 -29.116326 0.0e+00
balance 0.0057381 1.0057546 0.0002318 24.749526 0.0e+00
studentYes -0.7148776 0.4892520 0.1475190 -4.846003 1.3e-06

Huh. The odds factor for student after controlling for balance is less than one: for the same balance, a student is less likely to default than a non-student. This result is the opposite of what we concluded after a naive analysis

Eplanation for this change in conclusion:2

The variables student and balance are correlated. Students tend to hold higher levels of debt, which is in turn associated with higher probability of default. In other words, students are more likely to have large credit card balances, which, as we know from the left-hand panel of the below figure, tend to be associated with high default rates. Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students. This is an important distinction for a credit card company that is trying to determine to whom they should offer credit.

Note that deviance is lower for this model. But is it significantly lower?

anova(m2.logistic, 
      m3.logistic.with.student, 
      test = "Chisq")   
## Analysis of Deviance Table
## 
## Model 1: default ~ balance
## Model 2: default ~ balance + student
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      9998     1596.5                          
## 2      9997     1571.7  1    24.77 6.459e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, it is significantly lower.

Plotting the model with balance and student as predictors

df1.default.dataset %>% 
    mutate(is.default.numeric = ifelse(default == "Yes", 
                                1, 
                                0)) %>% 
    
    ggplot(aes(x = balance, 
               y = is.default.numeric, 
               group = student, 
               colour = student)) + 
    geom_jitter(alpha = 0.1, 
                height = .1) + 
    geom_smooth(method = "glm", 
                method.args = list(family = "binomial"))

Mini-exercise

Add income to the model as well and check whether the model improves.


Footnotes


  1. Actually, ordinary linear regression is also a specific case of a GLM, where the link function is just the identity function, \(f(x) = x\)

  2. http://uc-r.github.io/logistic_regression#req