library(dplyr) # for manipulating data frames
library(ggplot2) # for data viz
library(here) # for simplifying folder references
library(readr) # reading and writing data files
library(GGally) # extension to ggplot2
library(ISLR) # from the book Intro to Stat Learning with R
library(broom) # for saving model outputs as dataframes
library(janitor) # optional; for cleaning up dataframes
Have you ever been asked to …
If you answered yes to all, regression models may be right for you.
A regression model is a quantitative explanation of how the mean/expected value of one variable (Y) changes as the values of other variables (X1, x2, …) change.
Remember, all models are wrong but some are useful.1 Calculating an average was considered cutting-edge modelling in the 16th century.2 In a sense, the average is “wrong” because no single data point might actually take the average value.3
Similarly, regression models simplify away a lot of detail from the raw data. This is a feature, not a bug.
Remember this graph?
p1.group.means <- mtcars %>%
# let's recode cyl as a discrete variable:
mutate(cyl = factor(cyl,
levels = c(4, 6, 8))) %>%
# now start using gpplot functions:
ggplot(aes(x = disp, # specify x and y axis
y = mpg)) +
geom_point(aes(colour = cyl,
size = hp)) +
# examine three different ways of summarizing behaviour within
# each level of cyl:
# mean by group:
stat_smooth(aes(group = cyl,
colour = cyl),
method = "lm",
formula = y ~ 1) +
theme_classic()
# print:
p1.group.means
Two ways to get those horizontal lines:
mtcars %>%
group_by(cyl) %>%
summarize(mean.mpg = mean(mpg)) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| cyl | mean.mpg |
|---|---|
| 4 | 26.66364 |
| 6 | 19.74286 |
| 8 | 15.10000 |
We use the lm function to fit a regression model. summary can be used to examine the model.
df1.mtcars <- mtcars %>%
mutate(cyl = as.factor(cyl))
m1.mpg.vs.cyl <- lm(mpg ~ cyl - 1, data = df1.mtcars)
summary(m1.mpg.vs.cyl)
##
## Call:
## lm(formula = mpg ~ cyl - 1, data = df1.mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2636 -1.8357 0.0286 1.3893 7.2364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## cyl4 26.6636 0.9718 27.44 < 2e-16 ***
## cyl6 19.7429 1.2182 16.21 4.49e-16 ***
## cyl8 15.1000 0.8614 17.53 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared: 0.9785, Adjusted R-squared: 0.9763
## F-statistic: 440.9 on 3 and 29 DF, p-value: < 2.2e-16
broomThe default output is hard to save, so we’ll use the broom package to clean up and create a nice dataframe.
broom::tidy to get model coefficients#1 row per coefficient
broom::tidy(m1.mpg.vs.cyl) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| cyl4 | 26.66364 | 0.9718008 | 27.43735 | 0 |
| cyl6 | 19.74286 | 1.2182168 | 16.20636 | 0 |
| cyl8 | 15.10000 | 0.8614094 | 17.52941 | 0 |
broom::glance to get one-line model summary# 1 row per model; very useful when comparing lots of models
broom::glance(m1.mpg.vs.cyl) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.9785461 | 0.9763267 | 3.223099 | 440.9114 | 0 | 3 | -81.28198 | 170.564 | 176.4269 | 301.2626 | 29 |
broom::augment to get predicted values, residuals, etc.# 1 row per observation
broom::augment(m1.mpg.vs.cyl) %>%
head %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| mpg | cyl | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|
| 21.0 | 6 | 19.74286 | 1.2182168 | 1.257143 | 0.1428571 | 3.270096 | 0.0098604 | 0.4212932 |
| 21.0 | 6 | 19.74286 | 1.2182168 | 1.257143 | 0.1428571 | 3.270096 | 0.0098604 | 0.4212932 |
| 22.8 | 4 | 26.66364 | 0.9718008 | -3.863636 | 0.0909091 | 3.189504 | 0.0526886 | -1.2572423 |
| 21.4 | 6 | 19.74286 | 1.2182168 | 1.657143 | 0.1428571 | 3.262661 | 0.0171335 | 0.5553410 |
| 18.7 | 8 | 15.10000 | 0.8614094 | 3.600000 | 0.0714286 | 3.203267 | 0.0344491 | 1.1591009 |
| 18.1 | 6 | 19.74286 | 1.2182168 | -1.642857 | 0.1428571 | 3.262962 | 0.0168394 | -0.5505536 |
write_csv(broom::tidy(m1.mpg.vs.cyl),
here("results",
"output from src",
"mtcars-regression.csv"))
In both cases above, we see that there’s this negative trend: as num cylinders increases, mpg decreases. So what do we gain by doing the regression instead of just calculating averages?
Are these “trends” real or are they the result of chance? E.g. is the difference in sample means of 6- and 8- cylinder cars representative of a real difference in the population?
Will these “trends” still exist after we account for other variables? We shouldn’t just filter out rows based on other variables - we should account for them systematically.
What is the size of the trend (for continuous variables)? Can we compare the trends of two different sites? Can we detect changes in the size of the trend? Can we use the trend to fill in gaps in our knowledge (interpolation and extrapolation)?
Let’s fit another regression, this time incuding weight as a predictor variable.
m2.include.wt <- lm(mpg ~ cyl + wt, data = df1.mtcars)
summary(m2.include.wt)
##
## Call:
## lm(formula = mpg ~ cyl + wt, data = df1.mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5890 -1.2357 -0.5159 1.3845 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9908 1.8878 18.006 < 2e-16 ***
## cyl6 -4.2556 1.3861 -3.070 0.004718 **
## cyl8 -6.0709 1.6523 -3.674 0.000999 ***
## wt -3.2056 0.7539 -4.252 0.000213 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.82
## F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11
Note that the expected difference in mpg between 4-cylinder and 6-cylinder cars is 4.2556 mpg, while it was 6.9207 mpg previously.
Similarly, after accounting for weight, the difference between 6-cylinder cars and 8-cylinder cars is 1.8153 instead of 4.6429
This is important. Our conclusions may be biased or irrelevant if we don’t systematically control for confounding variables. E.g. an observed difference in ALOS between two nursing units is almost certainly not due the characteristics of the two nursing units alone - there will be multiple confounding variables such as age, diagnoses, etc.
Residuals from regression should be white noise with mean zero. This tells you that you have captured all the systematic structure in the relationship between the response and the predictor variables, leaving only randomness/noise. See here for more details on interpreting these plots.
# display 4 plots in a 2 by 2 grid:
par(mfrow = c(2, 2))
plot(m2.include.wt)
predict to interpret modelsOnce you fit a model, it exists independently of the data that you started with. This is a powerful abstraction - instead of carrying around raw data to describe a system, you can have a simple mathematical model that is easier to interpret and work with.
Although you will usually start by looking at regression coefficients, one of the best ways to interpet your model is to kick it and see what it does. In this context, kicking it refers to feeding it an artificial dataset that you create.
First, just try calling predict:
predict(m2.include.wt)
## 1 2 3 4 5 6 7 8
## 21.33650 20.51907 26.55377 19.42916 16.89262 18.64379 16.47590 23.76489
## 9 10 11 12 13 14 15 16
## 23.89311 18.70790 18.70790 14.87309 15.96300 15.80272 11.09046 10.53269
## 17 18 19 20 21 22 23 24
## 10.78593 26.93844 28.81373 28.10849 26.08896 16.63618 16.90865 15.61038
## 25 26 27 28 29 30 31 32
## 15.59435 27.78793 27.13078 29.14070 17.75814 20.85566 16.47590 25.07919
The result is a vector of numbers that gives you the predicted value for each row of the mtcars dataset. The broom::augment function is a good way to get these predicted values along with the input data. In the output below, .fitted gives the value that the regression model predicts, given values of cyl and wt.
augment(m2.include.wt) %>%
select(mpg, cyl, wt, .fitted) %>%
head(15) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| mpg | cyl | wt | .fitted |
|---|---|---|---|
| 21.0 | 6 | 2.620 | 21.33650 |
| 21.0 | 6 | 2.875 | 20.51907 |
| 22.8 | 4 | 2.320 | 26.55377 |
| 21.4 | 6 | 3.215 | 19.42917 |
| 18.7 | 8 | 3.440 | 16.89262 |
| 18.1 | 6 | 3.460 | 18.64379 |
| 14.3 | 8 | 3.570 | 16.47590 |
| 24.4 | 4 | 3.190 | 23.76489 |
| 22.8 | 4 | 3.150 | 23.89311 |
| 19.2 | 6 | 3.440 | 18.70790 |
| 17.8 | 6 | 3.440 | 18.70790 |
| 16.4 | 8 | 4.070 | 14.87309 |
| 17.3 | 8 | 3.730 | 15.96300 |
| 15.2 | 8 | 3.780 | 15.80272 |
| 10.4 | 8 | 5.250 | 11.09046 |
Let’s see how the model uses wt and cyl values to predict mpg. We expect three parallel lines, one for each value of cyl, with slope equal to the coefficient of wt.
df1.mtcars %>%
mutate(pred.mpg = predict(m2.include.wt)) %>%
ggplot(aes(x = wt,
y = pred.mpg)) +
geom_point(aes(col = cyl))
Yup, just what we expected. Now let’s say you have some new data and you want to see what the model predicts for this new data.
# create new fake data:
df2.mtcars.fake <- data.frame(cyl = as.factor(c(4, 6, 8)),
wt = c(4, 2, 2))
df2.mtcars.fake %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| cyl | wt |
|---|---|
| 4 | 4 |
| 6 | 2 |
| 8 | 2 |
Now pass the fake data to predict:
# note the "newdata" argument in predict() below
df2.mtcars.fake <-
df2.mtcars.fake %>%
mutate(pred.mpg = predict(m2.include.wt,
newdata = df2.mtcars.fake))
# first plot the points that were originally in the data:
df1.mtcars %>%
mutate(pred.mpg = predict(m2.include.wt)) %>%
ggplot(aes(x = wt,
y = pred.mpg)) +
geom_point(aes(col = cyl)) +
# then add the new points. Increase size to make them distinguishable:
geom_point(data = df2.mtcars.fake,
aes(x = wt,
y = pred.mpg,
col = cyl),
size = 5)
Plotting the actual values of your response variable vs the predicted values from the model is a great way to see how your model is doing and to compare different models. In the best case scenario, all the points below should fall on the blue line with slope 1 and intercept 0.
augment(m2.include.wt) %>%
ggplot(aes(x = .fitted,
y = mpg)) +
geom_point(aes(col = cyl)) +
# add 45 degree line:
geom_abline(slope = 1,
intercept = 0,
col = "blue") +
# make sure axes have same scale:
scale_x_continuous(limits = c(10, 35)) +
scale_y_continuous(limits = c(10, 35)) +
labs(title = "Actuals vs predictions",
subtitle = "Points above the line are being under-estimated \nPoints below the line are being over-estimated")
By building effective models of the way our operations work, we can encapsulate our knowledge succinctly in a form that can immediately answer a lot of important questions. That way, we don’t have to keep spending time on surface-level ad hoc analyses that always start from scratch with the raw data (“Let’s pull last 2 years of data, make a couple of graphs and send it back to the requester”). That work has already been done during the modelling process, and does not need to be repeated every time.
us-msa-gdp-and-population.csv dataset to find the relationship between GDP (in millions of USD) and population for US metropolitan statistical areas (MSAs).
We worked with a log-transformation in the mini-exercise. Now let’s look at interactions between variables. These examples are from ISLR.4
First read in the data:
df2.advert <- read_csv(here::here("data",
"Advertising.csv")) %>%
select(-X1) %>% # drop unnecessary column
clean_names # from "janitor"" package
# str(df2.advert)
# summary(df2.advert)
# head(df2.advert)
First we’ll fit a model without an interaction:
m3.advert.no.interaction <- lm(sales ~ tv + radio,
data = df2.advert)
# summary(m3.advert.no.interaction)
# look at model coefficients
tidy(m3.advert.no.interaction) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.9210999 | 0.2944897 | 9.919193 | 0 |
| tv | 0.0457548 | 0.0013904 | 32.908708 | 0 |
| radio | 0.1879942 | 0.0080400 | 23.382446 | 0 |
# model summary:
glance(m3.advert.no.interaction) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.8971943 | 0.8961505 | 1.681361 | 859.6177 | 0 | 3 | -386.197 | 780.3941 | 793.5874 | 556.914 | 197 |
# look at residual plots:
par(mfrow = c(2, 2)) # plot 4 graphs on 1 screen
plot(m3.advert.no.interaction)
Those curved shapes on the top-left and bottom-left plots are a little worrying. Looks like we’re under-predicting sales both at the lower end of the range and the higher end.
What if we add an interaction term? We see that the effect of tv advertising on sales is dependent on the level of radio advertising.
m4.advert.with.interact <- lm(sales ~ tv +
radio +
tv*radio, # interaction term
data = df2.advert)
# summary(m4.advert.with.interact)
# look at model coefficients
tidy(m4.advert.with.interact) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 6.7502202 | 0.2478714 | 27.232755 | 0.0000000 |
| tv | 0.0191011 | 0.0015041 | 12.698954 | 0.0000000 |
| radio | 0.0288603 | 0.0089053 | 3.240815 | 0.0014005 |
| tv:radio | 0.0010865 | 0.0000524 | 20.726564 | 0.0000000 |
# model summary:
glance(m4.advert.with.interact) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.9677905 | 0.9672975 | 0.9435154 | 1963.057 | 0 | 4 | -270.1389 | 550.2778 | 566.7694 | 174.4834 | 196 |
# residual plots
par(mfrow = c(2, 2)) # plot 4 graphs on 1 screen
plot(m4.advert.with.interact)
predictSet up a fake dataset to explore how the interaction between tv and radio advertising works. For example, try creating a dataframe with four rows:
Then pass this fake data to the predict function. K
We’ll use the Credit dataset. The goal is to predict the balance on a customer’s credit card.
df3.credit <- Credit %>%
clean_names()
# str(df3.credit)
# summary(df3.credit)
head(df3.credit) %>%
# don't use these lines; they're only for RMarkdown docs:
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| id | income | limit | rating | cards | age | education | gender | student | married | ethnicity | balance |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 14.891 | 3606 | 283 | 2 | 34 | 11 | Male | No | Yes | Caucasian | 333 |
| 2 | 106.025 | 6645 | 483 | 3 | 82 | 15 | Female | Yes | Yes | Asian | 903 |
| 3 | 104.593 | 7075 | 514 | 4 | 71 | 11 | Male | No | No | Asian | 580 |
| 4 | 148.924 | 9504 | 681 | 3 | 36 | 11 | Female | No | No | Asian | 964 |
| 5 | 55.882 | 4897 | 357 | 2 | 68 | 16 | Male | No | Yes | Caucasian | 331 |
| 6 | 80.180 | 8047 | 569 | 4 | 77 | 10 | Male | No | No | Caucasian | 1151 |
First fit a model with only income and student.
m5.credit <- lm(balance ~ income + student,
data = df3.credit)
# summary(m5.credit)
df3.credit %>%
mutate(pred.balance = predict(m5.credit)) %>%
ggplot(aes(x = income,
y = pred.balance)) +
geom_point(aes(col = student))
Does it make sense that the effect of additional income is the same whether or not you’re a student? Possibly not. Let’s fit a more general model that does allow for the slope of the 2 lines to differ.
m6.credit.interact <- lm(balance ~ income +
student +
income*student, # interaction term
data = df3.credit)
summary(m6.credit.interact)
##
## Call:
## lm(formula = balance ~ income + student + income * student, data = df3.credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773.39 -325.70 -41.13 321.65 814.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.6232 33.6984 5.953 5.79e-09 ***
## income 6.2182 0.5921 10.502 < 2e-16 ***
## studentYes 476.6758 104.3512 4.568 6.59e-06 ***
## income:studentYes -1.9992 1.7313 -1.155 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared: 0.2799, Adjusted R-squared: 0.2744
## F-statistic: 51.3 on 3 and 396 DF, p-value: < 2.2e-16
df3.credit %>%
mutate(pred.balance = predict(m6.credit.interact)) %>%
ggplot(aes(x = income,
y = pred.balance)) +
geom_point(aes(col = student))
Note that adjusted R-squared and Residual standard error are pretty much exactly the same after adding the interaction, and the interaction is not significant. In this case, we should exclude the interaction from our final model.
Use the Boston dataset from the ISLR package. Try to find the best model to predict median house value (medv) using the 13 other variables in the dataset.