Day 3 - Linear and nonlinear regression

Load packages for today’s session
Motivation for regression models
Steps in regression modelling
mtcars example
Mini-Exercise: GDP of US Metro areas
Transformations of variables and interactions between variables
Boston house prices dataset
Footnotes

Load packages for today’s session

library(dplyr)  # for manipulating data frames 
library(ggplot2)  # for data viz
library(here)  # for simplifying folder references 
library(readr)  # reading and writing data files 
library(GGally)  # extension to ggplot2
library(ISLR)  # from the book Intro to Stat Learning with R 
library(broom)  # for saving model outputs as dataframes 
library(janitor)  # optional; for cleaning up dataframes

Motivation for regression models

Have you ever been asked to …

Calculate an average of a variable? (e.g. ALOS? Avg daily ED visits?)
Investigate whether the average of one variable is related to one or more other variables? (e.g. ALOS by nursing unit? ED visits by time of day?)
Use historical/sample data to make inferences about a population or what might happen in the future?

If you answered yes to all, regression models may be right for you.

A regression model is a quantitative explanation of how the mean/expected value of one variable (Y) changes as the values of other variables (X1, x2, …) change.

Remember, all models are wrong but some are useful.¹ Calculating an average was considered cutting-edge modelling in the 16th century.² In a sense, the average is “wrong” because no single data point might actually take the average value.³

Similarly, regression models simplify away a lot of detail from the raw data. This is a feature, not a bug.

Steps in regression modelling

Fit a model
Diagnostics: any problems with the model? Is it suitable for the data?
Check overall significance and significance of individual predictors
Check goodness of fit
Fit another model; compare new model with old one, then go with the one that’s better (there are algorithmic approaches for making these decisions)

mtcars example

Remember this graph?

p1.group.means <- mtcars %>%
    
    # let's recode cyl as a discrete variable: 
    mutate(cyl = factor(cyl, 
                        levels = c(4, 6, 8))) %>% 
    
    # now start using gpplot functions: 
    ggplot(aes(x = disp,  # specify x and y axis
               y = mpg)) + 
    
    geom_point(aes(colour = cyl, 
                   size = hp)) + 
    
    # examine three different ways of summarizing behaviour within
    # each level of cyl: 
    
    # mean by group: 
    stat_smooth(aes(group = cyl,
                    colour = cyl), 
                method = "lm", 
                formula = y ~ 1) + 
    
    theme_classic()
    
# print: 
p1.group.means

Two ways to get those horizontal lines:

Averages by group:

mtcars %>% 
    group_by(cyl) %>% 
    summarize(mean.mpg = mean(mpg)) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cyl	mean.mpg
4	26.66364
6	19.74286
8	15.10000

Regression on categorical variable:

We use the lm function to fit a regression model. summary can be used to examine the model.

df1.mtcars <- mtcars %>% 
    mutate(cyl = as.factor(cyl))

m1.mpg.vs.cyl <- lm(mpg ~ cyl - 1, data = df1.mtcars)

summary(m1.mpg.vs.cyl)

## 
## Call:
## lm(formula = mpg ~ cyl - 1, data = df1.mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## cyl4  26.6636     0.9718   27.44  < 2e-16 ***
## cyl6  19.7429     1.2182   16.21 4.49e-16 ***
## cyl8  15.1000     0.8614   17.53  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.9785, Adjusted R-squared:  0.9763 
## F-statistic: 440.9 on 3 and 29 DF,  p-value: < 2.2e-16

Functions from the package `broom`

The default output is hard to save, so we’ll use the broom package to clean up and create a nice dataframe.

Using `broom::tidy` to get model coefficients

#1 row per coefficient 
broom::tidy(m1.mpg.vs.cyl) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

term	estimate	std.error	statistic
cyl4	26.66364	0.9718008	27.43735
cyl6	19.74286	1.2182168	16.20636
cyl8	15.10000	0.8614094	17.52941

Using `broom::glance` to get one-line model summary

# 1 row per model; very useful when comparing lots of models
broom::glance(m1.mpg.vs.cyl) %>% 
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.9785461	0.9763267	3.223099	440.9114	0	3	-81.28198	170.564	176.4269	301.2626	29

Using `broom::augment` to get predicted values, residuals, etc.

# 1 row per observation
broom::augment(m1.mpg.vs.cyl) %>% 
    head %>% 
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

mpg	cyl	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
21.0	6	19.74286	1.2182168	1.257143	0.1428571	3.270096	0.0098604	0.4212932
21.0	6	19.74286	1.2182168	1.257143	0.1428571	3.270096	0.0098604	0.4212932
22.8	4	26.66364	0.9718008	-3.863636	0.0909091	3.189504	0.0526886	-1.2572423
21.4	6	19.74286	1.2182168	1.657143	0.1428571	3.262661	0.0171335	0.5553410
18.7	8	15.10000	0.8614094	3.600000	0.0714286	3.203267	0.0344491	1.1591009
18.1	6	19.74286	1.2182168	-1.642857	0.1428571	3.262962	0.0168394	-0.5505536

write_csv(broom::tidy(m1.mpg.vs.cyl), 
          here("results", 
               "output from src", 
               "mtcars-regression.csv"))

Interpreting the change in the mean mpg

In both cases above, we see that there’s this negative trend: as num cylinders increases, mpg decreases. So what do we gain by doing the regression instead of just calculating averages?

Are these “trends” real or are they the result of chance? E.g. is the difference in sample means of 6- and 8- cylinder cars representative of a real difference in the population?
Will these “trends” still exist after we account for other variables? We shouldn’t just filter out rows based on other variables - we should account for them systematically.
What is the size of the trend (for continuous variables)? Can we compare the trends of two different sites? Can we detect changes in the size of the trend? Can we use the trend to fill in gaps in our knowledge (interpolation and extrapolation)?

Averages by group, after accounting for weight

Let’s fit another regression, this time incuding weight as a predictor variable.

m2.include.wt <- lm(mpg ~ cyl + wt, data = df1.mtcars)

summary(m2.include.wt)

## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = df1.mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
## cyl8         -6.0709     1.6523  -3.674 0.000999 ***
## wt           -3.2056     0.7539  -4.252 0.000213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

Note that the expected difference in mpg between 4-cylinder and 6-cylinder cars is 4.2556 mpg, while it was 6.9207 mpg previously.

Similarly, after accounting for weight, the difference between 6-cylinder cars and 8-cylinder cars is 1.8153 instead of 4.6429

This is important. Our conclusions may be biased or irrelevant if we don’t systematically control for confounding variables. E.g. an observed difference in ALOS between two nursing units is almost certainly not due the characteristics of the two nursing units alone - there will be multiple confounding variables such as age, diagnoses, etc.

Diagnostic plots

Residuals from regression should be white noise with mean zero. This tells you that you have captured all the systematic structure in the relationship between the response and the predictor variables, leaving only randomness/noise. See here for more details on interpreting these plots.

# display 4 plots in a 2 by 2 grid: 
par(mfrow = c(2, 2))

plot(m2.include.wt)

Using `predict` to interpret models

Once you fit a model, it exists independently of the data that you started with. This is a powerful abstraction - instead of carrying around raw data to describe a system, you can have a simple mathematical model that is easier to interpret and work with.

Although you will usually start by looking at regression coefficients, one of the best ways to interpet your model is to kick it and see what it does. In this context, kicking it refers to feeding it an artificial dataset that you create.

First, just try calling predict:

predict(m2.include.wt)

##        1        2        3        4        5        6        7        8 
## 21.33650 20.51907 26.55377 19.42916 16.89262 18.64379 16.47590 23.76489 
##        9       10       11       12       13       14       15       16 
## 23.89311 18.70790 18.70790 14.87309 15.96300 15.80272 11.09046 10.53269 
##       17       18       19       20       21       22       23       24 
## 10.78593 26.93844 28.81373 28.10849 26.08896 16.63618 16.90865 15.61038 
##       25       26       27       28       29       30       31       32 
## 15.59435 27.78793 27.13078 29.14070 17.75814 20.85566 16.47590 25.07919

The result is a vector of numbers that gives you the predicted value for each row of the mtcars dataset. The broom::augment function is a good way to get these predicted values along with the input data. In the output below, .fitted gives the value that the regression model predicts, given values of cyl and wt.

augment(m2.include.wt) %>% 
    select(mpg, cyl, wt, .fitted) %>% 
    head(15) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

mpg	cyl	wt	.fitted
21.0	6	2.620	21.33650
21.0	6	2.875	20.51907
22.8	4	2.320	26.55377
21.4	6	3.215	19.42917
18.7	8	3.440	16.89262
18.1	6	3.460	18.64379
14.3	8	3.570	16.47590
24.4	4	3.190	23.76489
22.8	4	3.150	23.89311
19.2	6	3.440	18.70790
17.8	6	3.440	18.70790
16.4	8	4.070	14.87309
17.3	8	3.730	15.96300
15.2	8	3.780	15.80272
10.4	8	5.250	11.09046

Let’s see how the model uses wt and cyl values to predict mpg. We expect three parallel lines, one for each value of cyl, with slope equal to the coefficient of wt.

df1.mtcars %>% 
    mutate(pred.mpg = predict(m2.include.wt)) %>% 
    
    ggplot(aes(x = wt, 
               y = pred.mpg)) + 
    
    geom_point(aes(col = cyl))

Yup, just what we expected. Now let’s say you have some new data and you want to see what the model predicts for this new data.

# create new fake data: 

df2.mtcars.fake <- data.frame(cyl = as.factor(c(4, 6, 8)), 
                              wt = c(4, 2, 2))

df2.mtcars.fake %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

cyl	wt
4	4
6	2
8	2

Now pass the fake data to predict:

# note the "newdata" argument in predict() below

df2.mtcars.fake <- 
    df2.mtcars.fake %>% 
    mutate(pred.mpg = predict(m2.include.wt, 
                              newdata = df2.mtcars.fake)) 

# first plot the points that were originally in the data:  
df1.mtcars %>% 
    mutate(pred.mpg = predict(m2.include.wt)) %>% 
    
    ggplot(aes(x = wt, 
               y = pred.mpg)) + 
    
    geom_point(aes(col = cyl)) + 
    
    # then add the new points. Increase size to make them distinguishable: 
    geom_point(data = df2.mtcars.fake, 
               aes(x = wt, 
                   y = pred.mpg, 
                   col = cyl), 
               size = 5)

Plotting actuals vs predicted values to assess model performance

Plotting the actual values of your response variable vs the predicted values from the model is a great way to see how your model is doing and to compare different models. In the best case scenario, all the points below should fall on the blue line with slope 1 and intercept 0.

augment(m2.include.wt) %>% 
    
    ggplot(aes(x = .fitted, 
               y = mpg)) + 
    
    geom_point(aes(col = cyl)) + 
    
    # add 45 degree line: 
    geom_abline(slope = 1, 
                intercept = 0, 
                col = "blue") + 
    
    # make sure axes have same scale: 
    scale_x_continuous(limits = c(10, 35)) + 
    scale_y_continuous(limits = c(10, 35)) + 
    
    labs(title = "Actuals vs predictions", 
         subtitle = "Points above the line are being under-estimated \nPoints below the line are being over-estimated")

By building effective models of the way our operations work, we can encapsulate our knowledge succinctly in a form that can immediately answer a lot of important questions. That way, we don’t have to keep spending time on surface-level ad hoc analyses that always start from scratch with the raw data (“Let’s pull last 2 years of data, make a couple of graphs and send it back to the requester”). That work has already been done during the modelling process, and does not need to be repeated every time.

Mini-Exercise: GDP of US Metro areas

Use the us-msa-gdp-and-population.csv dataset to find the relationship between GDP (in millions of USD) and population for US metropolitan statistical areas (MSAs).
How can you tell whether a particular MSA is under-performing or over-performing, relative to its population size?
Identify the top 3 over-performing MSAs.

Transformations of variables and interactions between variables

We worked with a log-transformation in the mini-exercise. Now let’s look at interactions between variables. These examples are from ISLR.⁴

Interaction between 2 continuous variables.

First read in the data:

df2.advert <- read_csv(here::here("data", 
                                  "Advertising.csv")) %>% 
    select(-X1) %>%  # drop unnecessary column 
    clean_names  # from "janitor"" package

# str(df2.advert)
# summary(df2.advert)
# head(df2.advert)

First we’ll fit a model without an interaction:

m3.advert.no.interaction <- lm(sales ~ tv + radio, 
                               data = df2.advert)

# summary(m3.advert.no.interaction)


# look at model coefficients 
tidy(m3.advert.no.interaction) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

term	estimate	std.error	statistic
(Intercept)	2.9210999	0.2944897	9.919193
tv	0.0457548	0.0013904	32.908708
radio	0.1879942	0.0080400	23.382446

# model summary: 
glance(m3.advert.no.interaction) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.8971943	0.8961505	1.681361	859.6177	0	3	-386.197	780.3941	793.5874	556.914	197

# look at residual plots:
par(mfrow = c(2, 2))  # plot 4 graphs on 1 screen
plot(m3.advert.no.interaction)

Those curved shapes on the top-left and bottom-left plots are a little worrying. Looks like we’re under-predicting sales both at the lower end of the range and the higher end.

What if we add an interaction term? We see that the effect of tv advertising on sales is dependent on the level of radio advertising.

m4.advert.with.interact <- lm(sales ~ tv + 
                                  radio + 
                                  tv*radio,  # interaction term
                              data = df2.advert)

# summary(m4.advert.with.interact)


# look at model coefficients 
tidy(m4.advert.with.interact) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

term	estimate	std.error	statistic	p.value
(Intercept)	6.7502202	0.2478714	27.232755	0.0000000
tv	0.0191011	0.0015041	12.698954	0.0000000
radio	0.0288603	0.0089053	3.240815	0.0014005
tv:radio	0.0010865	0.0000524	20.726564	0.0000000

# model summary: 
glance(m4.advert.with.interact) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.9677905	0.9672975	0.9435154	1963.057	0	4	-270.1389	550.2778	566.7694	174.4834	196

# residual plots 
par(mfrow = c(2, 2))  # plot 4 graphs on 1 screen
plot(m4.advert.with.interact)

Mini-exercise: explore interaction with `predict`

Set up a fake dataset to explore how the interaction between tv and radio advertising works. For example, try creating a dataframe with four rows:

Rows 1 and 2: keep tv at a fixed value, but create two values of radio that differ by 1 unit.
Rows 3 and 4: Change the value of tv, but keep it constant across Rows 3 and 4. Leave the values of radio as in the previous two rows.

Then pass this fake data to the predict function. K

Interaction between continuous and categorical variables

We’ll use the Credit dataset. The goal is to predict the balance on a customer’s credit card.

df3.credit <- Credit %>% 
    clean_names()

# str(df3.credit)
# summary(df3.credit)
head(df3.credit) %>% 
    
    # don't use these lines; they're only for RMarkdown docs: 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

id	income	limit	rating	cards	age	education	gender	student	married	ethnicity	balance
1	14.891	3606	283	2	34	11	Male	No	Yes	Caucasian	333
2	106.025	6645	483	3	82	15	Female	Yes	Yes	Asian	903
3	104.593	7075	514	4	71	11	Male	No	No	Asian	580
4	148.924	9504	681	3	36	11	Female	No	No	Asian	964
5	55.882	4897	357	2	68	16	Male	No	Yes	Caucasian	331
6	80.180	8047	569	4	77	10	Male	No	No	Caucasian	1151

First fit a model with only income and student.

m5.credit <- lm(balance ~ income + student, 
                data = df3.credit)

# summary(m5.credit)

df3.credit %>%  
    mutate(pred.balance = predict(m5.credit)) %>% 
    
    ggplot(aes(x = income, 
               y = pred.balance)) + 
    
    geom_point(aes(col = student))

Does it make sense that the effect of additional income is the same whether or not you’re a student? Possibly not. Let’s fit a more general model that does allow for the slope of the 2 lines to differ.

m6.credit.interact <- lm(balance ~ income +
                             student + 
                             income*student,  # interaction term
                         data = df3.credit)

summary(m6.credit.interact)

## 
## Call:
## lm(formula = balance ~ income + student + income * student, data = df3.credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773.39 -325.70  -41.13  321.65  814.04 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       200.6232    33.6984   5.953 5.79e-09 ***
## income              6.2182     0.5921  10.502  < 2e-16 ***
## studentYes        476.6758   104.3512   4.568 6.59e-06 ***
## income:studentYes  -1.9992     1.7313  -1.155    0.249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared:  0.2799, Adjusted R-squared:  0.2744 
## F-statistic:  51.3 on 3 and 396 DF,  p-value: < 2.2e-16

df3.credit %>%  
    mutate(pred.balance = predict(m6.credit.interact)) %>% 
    
    ggplot(aes(x = income, 
               y = pred.balance)) + 
    
    geom_point(aes(col = student))

Note that adjusted R-squared and Residual standard error are pretty much exactly the same after adding the interaction, and the interaction is not significant. In this case, we should exclude the interaction from our final model.

Boston house prices dataset

Use the Boston dataset from the ISLR package. Try to find the best model to predict median house value (medv) using the 13 other variables in the dataset.

Day 3 - Linear and nonlinear regression

Nayef Ahmad - VCH Decision Support

November 16, 2018

Load packages for today’s session

Motivation for regression models

Steps in regression modelling

mtcars example

Averages by group:

Regression on categorical variable:

Functions from the package `broom`

Using `broom::tidy` to get model coefficients

Using `broom::glance` to get one-line model summary

Using `broom::augment` to get predicted values, residuals, etc.

Interpreting the change in the mean mpg

Averages by group, after accounting for weight

Diagnostic plots

Using `predict` to interpret models

Plotting actuals vs predicted values to assess model performance

Mini-Exercise: GDP of US Metro areas

Transformations of variables and interactions between variables

Interaction between 2 continuous variables.

Mini-exercise: explore interaction with `predict`

Interaction between continuous and categorical variables

Boston house prices dataset

Footnotes

Day 3 - Linear and nonlinear regression

Nayef Ahmad - VCH Decision Support

November 16, 2018

Load packages for today’s session

Motivation for regression models

Steps in regression modelling

mtcars example

Averages by group:

Regression on categorical variable:

Functions from the package broom

Using broom::tidy to get model coefficients

Using broom::glance to get one-line model summary

Using broom::augment to get predicted values, residuals, etc.

Interpreting the change in the mean mpg

Averages by group, after accounting for weight

Diagnostic plots

Using predict to interpret models

Plotting actuals vs predicted values to assess model performance

Mini-Exercise: GDP of US Metro areas

Transformations of variables and interactions between variables

Interaction between 2 continuous variables.

Mini-exercise: explore interaction with predict

Interaction between continuous and categorical variables

Boston house prices dataset

Footnotes

Functions from the package `broom`

Using `broom::tidy` to get model coefficients

Using `broom::glance` to get one-line model summary

Using `broom::augment` to get predicted values, residuals, etc.

Using `predict` to interpret models

Mini-exercise: explore interaction with `predict`