Regression: Not Only a Mystery Thriller

Olesya Volchenko and Anna Shirokanova

April 2018

What do we already know?

test association between two categorical variables (chi-square)
find relationships between metric and continuous variables (ANOVA, t-test)
test relationships between two continuous variables (correlations)

Why do we need regression?

it is an umbrella method for all of the above
more than one predictor variable is allowed
relationship can be expressed as a number (e.g., every extra year of work experience increases the expected salary by £50)

Omitted variable effect (1)

Relationship	Third variable
The larger the foot size of a kid, the more clever s/he is	?
The lower the person, the longer the hair of that person	?
The larger the school class, the better are average grades	?
People using the Internet daily in Africa are happier	?
Ice-cream sales are positively related to the number of people drowning	?

Omitted variable effect (2)

Relationship	Third variable
The larger the foot size of a kid, the more clever s/he is	Age
The taller the person, the shorter the hair of that person	Gender
The larger the school class, the better are average grades	School size / equipment
People using the Internet daily in Africa are happier	Income
Ice-cream sales are positively related to the number of people drowning	Season

Regression modelling

describes relationships between variables with the following formula: \[y_i = a + b*x_i + e_i\]
- a - intercept. This is the value of Y when X = 0.
- b - slope. The is a regression coefficient showing the change in Y when X increases by 1.

Toy data example

There are variables X and Y, n = 50, normally distributed

y	x
-2.2769689	-0.9030778
0.2932664	-0.1464296
-2.1536341	-1.2766415
3.3128843	0.9340912
6.2678134	2.3101434
2.9741792	1.2217376

##        y                  x           
##  Min.   :-8.43173   Min.   :-1.92319  
##  1st Qu.:-2.07282   1st Qu.:-0.68045  
##  Median :-0.03748   Median :-0.11262  
##  Mean   : 0.17194   Mean   : 0.06454  
##  3rd Qu.: 2.36191   3rd Qu.: 0.91913  
##  Max.   : 6.26781   Max.   : 2.31014

Visualise their relationship

plot(x, y, pch = 16)

Use correlation to estimate the strength of their relationship

cor.test(x, y, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 18.604, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8912640 0.9640148
## sample estimates:
##       cor 
## 0.9371278

Estimate the model

model1 <- lm(y ~ x)
summary(model1)

R Output

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9636 -0.4744 -0.0366  0.4992  3.8284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01119    0.15885   -0.07    0.944    
## x            2.83743    0.15252   18.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.121 on 48 degrees of freedom
## Multiple R-squared:  0.8782, Adjusted R-squared:  0.8757 
## F-statistic: 346.1 on 1 and 48 DF,  p-value: < 2.2e-16

What are the conclusions?

The value of Y when X = 0, e.g. expected salary when work experience is 0 years.
How much will Y change when X increases by 1?

Plot

plot(x, y, pch = 16)
abline(model1, col = "red")
text(0, 2, paste("y = ", round(model1$coefficients[1], 1), "+", round(model1$coefficients[2], 
    1), "* x"))
abline(v = 0, lty = 2, col = "gray")

Coefficient of determination R²

shows the relation of variance explained by the model to total variance of the outcome variable \[R^2 = \frac{SS_{reg}}{SS_{tot}}\]
varies from 0 to 1 (or between 0-100 per cent)
R² above 0.1 demonstrates a satisfactory predictive power (but that is not a hard fact)
as the number of predictors grows, R² grows as well. Therefore,
adjusted R²: \[R^2_{adj} = \frac{SS_{reg} / (n – k))}{(SS_{tot} / (n – 1))}\]
where n is the sample size, and k is the number of predictors

Other ways to evaluate model fit:

AIC — Akaike information criterion \[AIC = \frac{2k}{n} + ln \frac{SS_{res}}{n}\]
BIC — Bayesian information criterion \[BIC = \frac{k*ln(n)}{n} + ln \frac{SS_{res}}{n}\]
the smaller the value (the closer it is to 0), the better
suitable for model comparison only

Several predictors

The main advantage of regression modelling is that it can digest several predictors in one model

Dummy-variables

dichotomised
show the differences in intercepts between groups (there is a ‘reference group’)
if class(var) = factor in R, it is recoded into dummies by default

Example: salary of males and females

200 males and females

##       age         salary       sex    
##  Min.   :14   Min.   : 40.60   F:100  
##  1st Qu.:26   1st Qu.: 85.44   M:100  
##  Median :30   Median : 94.70          
##  Mean   :30   Mean   : 94.98          
##  3rd Qu.:33   3rd Qu.:105.08          
##  Max.   :43   Max.   :133.30

Example: exploratory plots

Example: the model

model2 <- lm(salary ~ age + sex, data = genderdata)

sjt.lm(model2, show.ci = F)

	salary
	B	p
(Intercept)	1.00	.080
age	2.97	<.001
sex (M)	10.08	<.001
Observations	200
R² / adj. R²	.993 / .993

Example: plot

What if there are more than 2 categories? (e.g. educational attinment)

create n-1 variables where n is the original number of categories
the first category is now the ‘reference’
What does the coding look like?

##        educ dummy1 dummy2
## 1  tertiary      0      0
## 2 secondary      1      0
## 3   primary      0      1

Interaction effect

the model in the previous example implies that the difference between males and females is constant
What if this gender gap grows with age?

Model Comparison

model3 <- lm(salary ~ age + sex, data = genderdata2)
model4 <- lm(salary ~ age * sex, data = genderdata2)
sjt.lm(model3, model4, show.ci = F)

	salary		salary
	B	p	B	p
(Intercept)	4.11	.565	99.07	<.001
age	4.19	<.001	1.07	<.001
sexM	77.86	<.001	-103.23	<.001
age:sexM			6.04	<.001
Observations	200		200
R² / adj. R²	.884 / .883		.986 / .986

What the Interaction Effect Model Looks Like

What happens if we ignore the interaction effect?

Can we identify a possible interaction effect visually?

Sometimes yes. Heteroskedasticity may indicate a moderation/ interaction. Let’s recall the assumptions first.

Linear Regression Assumptions

linearity
normally distributed residuals of the outcome
predictors have an unlimited range
independent observations (no correlation between their residuals)
independent variables (no multicollinearity)
no outliers
homoskedasticity (residual variances are the same along the values of Y)

Homoskedasticity/Heteroskedasticity

x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red")

Plane on the right-hand side with data points ‘fanning out’ shows there is a third variable which comes into play at high values of X

Anscombe’s Quartet

Feature	Value
Mean x	9.0
Variance x	10.0
Mean y	7.5
Variance y	3.75
Correlation between x and y	0.816
Regression fitted line	y = 3 + 0.5x

Another example

Outcome
- Grade
Predictors
- Books read
- Classes attended
- Luck at the exam (unlucky, lucky, or very lucky)

This example originates from the book by Jeremy Miles and Mark Shevlin (2001) ¹.

Outcome distribution

## 
##  Shapiro-Wilk normality test
## 
## data:  df$grade
## W = 0.93009, p-value = 0.01622

Models: Standardised Coefficients

standardization, or scaling, is substracting the mean and dividing by SD \[z_i = \frac{X_i - X}{\sigma}\]
scaling of the outcome change the interpretation of the intercept
scaling of predictors shows the change in the outcome when predictors increase by 1 SD
only metric variables can be standardised (where mean and SD make sense

model3 = lm(grade ~ books + attend, data = df)
summary(model3)

## 
## Call:
## lm(formula = grade ~ books + attend, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.379      7.745   4.827 2.41e-05 ***
## books          4.037      1.753   2.303   0.0270 *  
## attend         1.284      0.587   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278

model3scaled <- lm(grade ~ scale(as.numeric(books)) + 
                     scale(as.numeric(attend)), data = df)
summary(model3scaled)

## 
## Call:
## lm(formula = grade ~ scale(as.numeric(books)) + scale(as.numeric(attend)), 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 63.550      2.222  28.602   <2e-16 ***
## scale(as.numeric(books))     5.782      2.511   2.303   0.0270 *  
## scale(as.numeric(attend))    5.490      2.511   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278

The effect of books is somewhat larger than the effect of attend.

Interaction effect model (1)

model5 = lm(grade ~ attend * books, data = df)
summary(model5)

## 
## Call:
## lm(formula = grade ~ attend * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.844 -11.850   1.466   8.163  34.198 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   55.2213    11.2621   4.903 2.02e-05 ***
## attend        -0.1368     0.8782  -0.156   0.8771    
## books         -6.2078     5.1508  -1.205   0.2360    
## attend:books   0.7349     0.3494   2.104   0.0425 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 36 degrees of freedom
## Multiple R-squared:  0.4022, Adjusted R-squared:  0.3524 
## F-statistic: 8.073 on 3 and 36 DF,  p-value: 0.0003058

anova(model3, model5)

## Analysis of Variance Table
## 
## Model 1: grade ~ books + attend
## Model 2: grade ~ attend * books
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     37 7306.2                              
## 2     36 6506.4  1    799.78 4.4252 0.04246 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction (multiplicative) model explains significantly more than the additive model

Interaction effect model (2)

model6 = lm(grade ~ lucky * books, data = df)
summary(model6)

## 
## Call:
## lm(formula = grade ~ lucky * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.417 -10.952  -1.455  10.156  37.300 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.92308    5.76906   8.827 2.57e-10 ***
## lucky1        3.76257    8.78658   0.428   0.6712    
## lucky2        0.07692   11.83961   0.006   0.9949    
## books         6.15385    2.42349   2.539   0.0159 *  
## lucky1:books -3.64642    3.71632  -0.981   0.3334    
## lucky2:books  2.55449    4.55189   0.561   0.5783    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.62 on 34 degrees of freedom
## Multiple R-squared:  0.3322, Adjusted R-squared:  0.2339 
## F-statistic: 3.382 on 5 and 34 DF,  p-value: 0.01381

Luck has not effect on final grade whatsoever

Interaction effect visualisation (1)

library(sjPlot)
sjp.int(model5, type = "eff")

sjp.int(model5, type = "eff", show.ci = T)

Interaction effect visualisation (2)

library(rockchalk)
x = plotSlopes(model5, plotx = "books", 
           modx = "attend", interval = "conf")

plot(testSlopes(x), shade = T)

## Values of attend OUTSIDE this interval:
##         lo         hi 
## -141.07291   13.13872 
## cause the slope of (b1 + b2*attend)books to be statistically significant

x = plotSlopes(model5, plotx = "attend", 
           modx = "books", interval = "conf")

plot(testSlopes(x), shade = T)

## Values of books OUTSIDE this interval:
##         lo         hi 
## -47.419893   1.746103 
## cause the slope of (b1 + b2*books)attend to be statistically significant

The interaction effect gets significant at 2 books and 13 classes attended.

Visualisation (3)

#sjp.int is now turning into plot_model, so please learn to use the new function:
sjPlot::plot_model(model5, type = "eff", terms = "books")

sjPlot::plot_model(model5, type = "int", show.ci = T, mdrt.values = "all")

#by default, it takes the variable with smaller range as a moderator, here - books

sjPlot::plot_model(model5, type = "int", show.ci = T, terms = "books", mdrt.values = "all")

sjPlot::plot_model(model5, type = "int", show.ci = T, mdrt.values = "meansd")

All models in one table

sjt.lm(model3, model5, model6, show.ci = F, p.numeric = F)

		grade	grade	grade
		B	B	B
(Intercept)		37.38 ***	55.22 ***	50.92 ***
lucky
books		4.04 *	-6.21	6.15 *
attend		1.28 *	-0.14
attend:books			0.73 *
lucky1				3.76
lucky2				0.08
lucky1:books				-3.65
lucky2:books				2.55
Observations		40	40	40
R² / adj. R²		.329 / .292	.402 / .352	.332 / .234
Notes	* p<.05 p<.01 * p<.001

Model 5 is best (highest R-squared)

Check list for interaction effect regressions

(Brambor et al. 2005)²:

include all direct effects
interpret all effects correctly (interpretation is not as in additive models!)
provide range for marginal effect
provide measure of uncertainty (CIs)

For interpretation, see the link

Summary: nuts and bolts of regression modelling

Input
- outcome - 1 pc.
- predictors - >= 1 pc.
Output
- regression coefficients
- R²

What’s next?

correct interpretation of interaction effect models
model diagnostics (next year)
logistic regression (after that)

Reding for the seminar: Chapter 14

Monday’s lab is here