Load packages for today’s session

library(dplyr)  # for manipulating data frames 
library(ggplot2)  # for data viz
library(here)  # for simplifying folder references 
library(readr)  # reading and writing data files 
library(GGally)  # extension to ggplot2
library(ISLR)  # from the book Intro to Stat Learning with R 
library(broom)  # for saving model outputs as dataframes 
library(janitor)  # optional; for cleaning up dataframes 

Motivation for regression models

Have you ever been asked to …

If you answered yes to all, regression models may be right for you.

A regression model is a quantitative explanation of how the mean/expected value of one variable (Y) changes as the values of other variables (X1, x2, …) change.

Remember, all models are wrong but some are useful.1 Calculating an average was considered cutting-edge modelling in the 16th century.2 In a sense, the average is “wrong” because no single data point might actually take the average value.3

Similarly, regression models simplify away a lot of detail from the raw data. This is a feature, not a bug.

Steps in regression modelling

  1. Fit a model
  2. Diagnostics: any problems with the model? Is it suitable for the data?
  3. Check overall significance and significance of individual predictors
  4. Check goodness of fit
  5. Fit another model; compare new model with old one, then go with the one that’s better (there are algorithmic approaches for making these decisions)

mtcars example

Remember this graph?

p1.group.means <- mtcars %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = mpg)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth(aes(group = cyl,
                    colour = cyl), 
                method = "lm", 
                formula = y ~ 1) + 
    
    theme_classic()
    
# print: 
p1.group.means

Two ways to get those horizontal lines:

Averages by group:

mtcars %>% 
    group_by(cyl) %>% 
    summarize(mean.mpg = mean(mpg)) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cyl mean.mpg
4 26.66364
6 19.74286
8 15.10000

     

Regression on categorical variable:

We use the lm function to fit a regression model. summary can be used to examine the model.

df1.mtcars <- mtcars %>% 
    mutate(cyl = as.factor(cyl))

m1.mpg.vs.cyl <- lm(mpg ~ cyl - 1, data = df1.mtcars)

summary(m1.mpg.vs.cyl)
## 
## Call:
## lm(formula = mpg ~ cyl - 1, data = df1.mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## cyl4  26.6636     0.9718   27.44  < 2e-16 ***
## cyl6  19.7429     1.2182   16.21 4.49e-16 ***
## cyl8  15.1000     0.8614   17.53  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.9785, Adjusted R-squared:  0.9763 
## F-statistic: 440.9 on 3 and 29 DF,  p-value: < 2.2e-16

Functions from the package broom

The default output is hard to save, so we’ll use the broom package to clean up and create a nice dataframe.

     

Using broom::tidy to get model coefficients

#1 row per coefficient 
broom::tidy(m1.mpg.vs.cyl) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
term estimate std.error statistic p.value
cyl4 26.66364 0.9718008 27.43735 0
cyl6 19.74286 1.2182168 16.20636 0
cyl8 15.10000 0.8614094 17.52941 0

     

Using broom::glance to get one-line model summary

# 1 row per model; very useful when comparing lots of models
broom::glance(m1.mpg.vs.cyl) %>% 
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.9785461 0.9763267 3.223099 440.9114 0 3 -81.28198 170.564 176.4269 301.2626 29

     

Using broom::augment to get predicted values, residuals, etc.

# 1 row per observation
broom::augment(m1.mpg.vs.cyl) %>% 
    head %>% 
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
mpg cyl .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
21.0 6 19.74286 1.2182168 1.257143 0.1428571 3.270096 0.0098604 0.4212932
21.0 6 19.74286 1.2182168 1.257143 0.1428571 3.270096 0.0098604 0.4212932
22.8 4 26.66364 0.9718008 -3.863636 0.0909091 3.189504 0.0526886 -1.2572423
21.4 6 19.74286 1.2182168 1.657143 0.1428571 3.262661 0.0171335 0.5553410
18.7 8 15.10000 0.8614094 3.600000 0.0714286 3.203267 0.0344491 1.1591009
18.1 6 19.74286 1.2182168 -1.642857 0.1428571 3.262962 0.0168394 -0.5505536
write_csv(broom::tidy(m1.mpg.vs.cyl), 
          here("results", 
               "output from src", 
               "mtcars-regression.csv"))

     

Interpreting the change in the mean mpg

In both cases above, we see that there’s this negative trend: as num cylinders increases, mpg decreases. So what do we gain by doing the regression instead of just calculating averages?

  1. Are these “trends” real or are they the result of chance? E.g. is the difference in sample means of 6- and 8- cylinder cars representative of a real difference in the population?

  2. Will these “trends” still exist after we account for other variables? We shouldn’t just filter out rows based on other variables - we should account for them systematically.

  3. What is the size of the trend (for continuous variables)? Can we compare the trends of two different sites? Can we detect changes in the size of the trend? Can we use the trend to fill in gaps in our knowledge (interpolation and extrapolation)?

Averages by group, after accounting for weight

Let’s fit another regression, this time incuding weight as a predictor variable.

m2.include.wt <- lm(mpg ~ cyl + wt, data = df1.mtcars)

summary(m2.include.wt)
## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = df1.mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
## cyl8         -6.0709     1.6523  -3.674 0.000999 ***
## wt           -3.2056     0.7539  -4.252 0.000213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

Note that the expected difference in mpg between 4-cylinder and 6-cylinder cars is 4.2556 mpg, while it was 6.9207 mpg previously.

Similarly, after accounting for weight, the difference between 6-cylinder cars and 8-cylinder cars is 1.8153 instead of 4.6429

This is important. Our conclusions may be biased or irrelevant if we don’t systematically control for confounding variables. E.g. an observed difference in ALOS between two nursing units is almost certainly not due the characteristics of the two nursing units alone - there will be multiple confounding variables such as age, diagnoses, etc.

Diagnostic plots

Residuals from regression should be white noise with mean zero. This tells you that you have captured all the systematic structure in the relationship between the response and the predictor variables, leaving only randomness/noise. See here for more details on interpreting these plots.

# display 4 plots in a 2 by 2 grid: 
par(mfrow = c(2, 2))

plot(m2.include.wt)

     

Using predict to interpret models

Once you fit a model, it exists independently of the data that you started with. This is a powerful abstraction - instead of carrying around raw data to describe a system, you can have a simple mathematical model that is easier to interpret and work with.

Although you will usually start by looking at regression coefficients, one of the best ways to interpet your model is to kick it and see what it does. In this context, kicking it refers to feeding it an artificial dataset that you create.

First, just try calling predict:

predict(m2.include.wt)
##        1        2        3        4        5        6        7        8 
## 21.33650 20.51907 26.55377 19.42916 16.89262 18.64379 16.47590 23.76489 
##        9       10       11       12       13       14       15       16 
## 23.89311 18.70790 18.70790 14.87309 15.96300 15.80272 11.09046 10.53269 
##       17       18       19       20       21       22       23       24 
## 10.78593 26.93844 28.81373 28.10849 26.08896 16.63618 16.90865 15.61038 
##       25       26       27       28       29       30       31       32 
## 15.59435 27.78793 27.13078 29.14070 17.75814 20.85566 16.47590 25.07919

The result is a vector of numbers that gives you the predicted value for each row of the mtcars dataset. The broom::augment function is a good way to get these predicted values along with the input data. In the output below, .fitted gives the value that the regression model predicts, given values of cyl and wt.

augment(m2.include.wt) %>% 
    select(mpg, cyl, wt, .fitted) %>% 
    head(15) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
mpg cyl wt .fitted
21.0 6 2.620 21.33650
21.0 6 2.875 20.51907
22.8 4 2.320 26.55377
21.4 6 3.215 19.42917
18.7 8 3.440 16.89262
18.1 6 3.460 18.64379
14.3 8 3.570 16.47590
24.4 4 3.190 23.76489
22.8 4 3.150 23.89311
19.2 6 3.440 18.70790
17.8 6 3.440 18.70790
16.4 8 4.070 14.87309
17.3 8 3.730 15.96300
15.2 8 3.780 15.80272
10.4 8 5.250 11.09046

Let’s see how the model uses wt and cyl values to predict mpg. We expect three parallel lines, one for each value of cyl, with slope equal to the coefficient of wt.

df1.mtcars %>% 
    mutate(pred.mpg = predict(m2.include.wt)) %>% 
    
    ggplot(aes(x = wt, 
               y = pred.mpg)) + 
    
    geom_point(aes(col = cyl))

Yup, just what we expected. Now let’s say you have some new data and you want to see what the model predicts for this new data.

# create new fake data: 

df2.mtcars.fake <- data.frame(cyl = as.factor(c(4, 6, 8)), 
                              wt = c(4, 2, 2))

df2.mtcars.fake %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
cyl wt
4 4
6 2
8 2

Now pass the fake data to predict:

# note the "newdata" argument in predict() below

df2.mtcars.fake <- 
    df2.mtcars.fake %>% 
    mutate(pred.mpg = predict(m2.include.wt, 
                              newdata = df2.mtcars.fake)) 

# first plot the points that were originally in the data:  
df1.mtcars %>% 
    mutate(pred.mpg = predict(m2.include.wt)) %>% 
    
    ggplot(aes(x = wt, 
               y = pred.mpg)) + 
    
    geom_point(aes(col = cyl)) + 
    
    # then add the new points. Increase size to make them distinguishable: 
    geom_point(data = df2.mtcars.fake, 
               aes(x = wt, 
                   y = pred.mpg, 
                   col = cyl), 
               size = 5)

Plotting actuals vs predicted values to assess model performance

Plotting the actual values of your response variable vs the predicted values from the model is a great way to see how your model is doing and to compare different models. In the best case scenario, all the points below should fall on the blue line with slope 1 and intercept 0.

augment(m2.include.wt) %>% 
    
    ggplot(aes(x = .fitted, 
               y = mpg)) + 
    
    geom_point(aes(col = cyl)) + 
    
    # add 45 degree line: 
    geom_abline(slope = 1, 
                intercept = 0, 
                col = "blue") + 
    
    # make sure axes have same scale: 
    scale_x_continuous(limits = c(10, 35)) + 
    scale_y_continuous(limits = c(10, 35)) + 
    
    labs(title = "Actuals vs predictions", 
         subtitle = "Points above the line are being under-estimated \nPoints below the line are being over-estimated")

By building effective models of the way our operations work, we can encapsulate our knowledge succinctly in a form that can immediately answer a lot of important questions. That way, we don’t have to keep spending time on surface-level ad hoc analyses that always start from scratch with the raw data (“Let’s pull last 2 years of data, make a couple of graphs and send it back to the requester”). That work has already been done during the modelling process, and does not need to be repeated every time.


     

Mini-Exercise: GDP of US Metro areas


     

Transformations of variables and interactions between variables

We worked with a log-transformation in the mini-exercise. Now let’s look at interactions between variables. These examples are from ISLR.4

Interaction between 2 continuous variables.

First read in the data:

df2.advert <- read_csv(here::here("data", 
                                  "Advertising.csv")) %>% 
    select(-X1) %>%  # drop unnecessary column 
    clean_names  # from "janitor"" package

# str(df2.advert)
# summary(df2.advert)
# head(df2.advert)

First we’ll fit a model without an interaction:

m3.advert.no.interaction <- lm(sales ~ tv + radio, 
                               data = df2.advert)

# summary(m3.advert.no.interaction)


# look at model coefficients 
tidy(m3.advert.no.interaction) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
term estimate std.error statistic p.value
(Intercept) 2.9210999 0.2944897 9.919193 0
tv 0.0457548 0.0013904 32.908708 0
radio 0.1879942 0.0080400 23.382446 0
# model summary: 
glance(m3.advert.no.interaction) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.8971943 0.8961505 1.681361 859.6177 0 3 -386.197 780.3941 793.5874 556.914 197
# look at residual plots:
par(mfrow = c(2, 2))  # plot 4 graphs on 1 screen
plot(m3.advert.no.interaction)

Those curved shapes on the top-left and bottom-left plots are a little worrying. Looks like we’re under-predicting sales both at the lower end of the range and the higher end.

What if we add an interaction term? We see that the effect of tv advertising on sales is dependent on the level of radio advertising.

m4.advert.with.interact <- lm(sales ~ tv + 
                                  radio + 
                                  tv*radio,  # interaction term
                              data = df2.advert)

# summary(m4.advert.with.interact)


# look at model coefficients 
tidy(m4.advert.with.interact) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
term estimate std.error statistic p.value
(Intercept) 6.7502202 0.2478714 27.232755 0.0000000
tv 0.0191011 0.0015041 12.698954 0.0000000
radio 0.0288603 0.0089053 3.240815 0.0014005
tv:radio 0.0010865 0.0000524 20.726564 0.0000000
# model summary: 
glance(m4.advert.with.interact) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.9677905 0.9672975 0.9435154 1963.057 0 4 -270.1389 550.2778 566.7694 174.4834 196
# residual plots 
par(mfrow = c(2, 2))  # plot 4 graphs on 1 screen
plot(m4.advert.with.interact)

Mini-exercise: explore interaction with predict

Set up a fake dataset to explore how the interaction between tv and radio advertising works. For example, try creating a dataframe with four rows:

  • Rows 1 and 2: keep tv at a fixed value, but create two values of radio that differ by 1 unit.
  • Rows 3 and 4: Change the value of tv, but keep it constant across Rows 3 and 4. Leave the values of radio as in the previous two rows.

Then pass this fake data to the predict function. K

     

Interaction between continuous and categorical variables

We’ll use the Credit dataset. The goal is to predict the balance on a customer’s credit card.

df3.credit <- Credit %>% 
    clean_names()

# str(df3.credit)
# summary(df3.credit)
head(df3.credit) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
id income limit rating cards age education gender student married ethnicity balance
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
3 104.593 7075 514 4 71 11 Male No No Asian 580
4 148.924 9504 681 3 36 11 Female No No Asian 964
5 55.882 4897 357 2 68 16 Male No Yes Caucasian 331
6 80.180 8047 569 4 77 10 Male No No Caucasian 1151

First fit a model with only income and student.

m5.credit <- lm(balance ~ income + student, 
                data = df3.credit)

# summary(m5.credit)

df3.credit %>%  
    mutate(pred.balance = predict(m5.credit)) %>% 
    
    ggplot(aes(x = income, 
               y = pred.balance)) + 
    
    geom_point(aes(col = student))

Does it make sense that the effect of additional income is the same whether or not you’re a student? Possibly not. Let’s fit a more general model that does allow for the slope of the 2 lines to differ.

m6.credit.interact <- lm(balance ~ income +
                             student + 
                             income*student,  # interaction term
                         data = df3.credit)

summary(m6.credit.interact)
## 
## Call:
## lm(formula = balance ~ income + student + income * student, data = df3.credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773.39 -325.70  -41.13  321.65  814.04 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       200.6232    33.6984   5.953 5.79e-09 ***
## income              6.2182     0.5921  10.502  < 2e-16 ***
## studentYes        476.6758   104.3512   4.568 6.59e-06 ***
## income:studentYes  -1.9992     1.7313  -1.155    0.249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared:  0.2799, Adjusted R-squared:  0.2744 
## F-statistic:  51.3 on 3 and 396 DF,  p-value: < 2.2e-16
df3.credit %>%  
    mutate(pred.balance = predict(m6.credit.interact)) %>% 
    
    ggplot(aes(x = income, 
               y = pred.balance)) + 
    
    geom_point(aes(col = student))

Note that adjusted R-squared and Residual standard error are pretty much exactly the same after adding the interaction, and the interaction is not significant. In this case, we should exclude the interaction from our final model.


     

Boston house prices dataset

Use the Boston dataset from the ISLR package. Try to find the best model to predict median house value (medv) using the 13 other variables in the dataset.


     

Footnotes


  1. https://en.wikipedia.org/wiki/All_models_are_wrong

  2. https://en.wikipedia.org/wiki/Average#Origin

  3. https://www.gse.harvard.edu/news/ed/15/08/beyond-average

  4. https://www-bcf.usc.edu/~gareth/ISL/