Regression: Not Only a Mystery Thriller

Olesya Volchenko and Anna Shirokanova

April 2018

What do we already know?

test association between two categorical variables (chi-square)
find relationships between metric and continuous variables (ANOVA, t-test)
test relationships between two continuous variables (correlations)

Why do we need regression?

it is an umbrella method for all of the above
more than one predictor variable is allowed
relationship can be expressed as a number (e.g., every extra year of work experience increases the expected salary by £50)

Omitted variable effect (1)

Relationship	Third variable
The larger the foot size of a kid, the more clever s/he is	?
The lower the person, the longer the hair of that person	?
The larger the school class, the better are average grades	?
People using the Internet daily in Africa are happier	?
Ice-cream sales are positively related to the number of people drowning	?

Omitted variable effect (2)

Relationship	Third variable
The larger the foot size of a kid, the more clever s/he is	Age
The taller the person, the shorter the hair of that person	Gender
The larger the school class, the better are average grades	School size / equipment
People using the Internet daily in Africa are happier	Income
Ice-cream sales are positively related to the number of people drowning	Season

Regression modelling

describes relationships between variables with the following formula: \[y_i = a + b*x_i + e_i\]
a - intercept. This is the value of Y when X = 0.
b - slope. The is a regression coefficient showing the change in Y when X increases by 1.

Toy data example

There are variables X and Y

y	x
2.3434072	1.1062056
-5.7028911	-1.6623645
-1.2161804	-0.2888668
1.0362797	0.5752666
1.3464117	0.3875749
2.0395967	0.2835436
1.2407077	0.1530659
4.1694978	1.0044248
-0.8050663	-0.1653561
2.7972354	1.1869595
1.2171334	0.2319040
0.2003237	0.4613343
1.4274962	0.6209180
6.7268963	2.1472300
0.4021215	0.1328819
0.3360385	0.0873094
2.6072062	0.6217123
-4.0830110	-0.8438386
5.9026377	1.3107213
2.9125798	0.7172064
4.9443642	1.6982076
2.9856271	1.1205246
1.3026918	0.3585168
1.9968218	0.3650547
4.3063373	1.7451252
-4.0624447	-1.3103822
3.3363657	1.3952343
-4.5741076	-1.2653576
4.8788024	1.2802166
1.5956168	0.5503804
-5.5108671	-1.9861473
1.0963555	0.3808488
4.0262914	1.0942326
-0.9364985	-0.6841675
-2.9122435	-0.9642431
-5.3560686	-1.6591004
-5.0853931	-1.3529958
-2.7138662	-0.6265333
0.7735975	0.4267002
-0.0738835	-0.4384743
4.4310860	1.0516549
-3.1855930	-0.9060857
0.8373001	0.2054775
2.8863445	1.1665390
3.6380946	0.6640382
-4.3165562	-0.7684066
2.0484183	0.2688752
-4.3024800	-1.1724599
6.1449382	1.7689950
-3.1971329	-0.7483182

Visualise their relationship

plot(x, y, pch = 16)

Use correlation to estimate the strength of their relationship

cor.test(x, y, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 26.972, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9449707 0.9821281
## sample estimates:
##       cor 
## 0.9685582

Estimate the model

model1 <- lm(y ~ x)
summary(model1)

R Output

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77859 -0.61624 -0.05182  0.61653  1.66946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03548    0.12302  -0.288    0.774    
## x            3.25672    0.12074  26.972   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8539 on 48 degrees of freedom
## Multiple R-squared:  0.9381, Adjusted R-squared:  0.9368 
## F-statistic: 727.5 on 1 and 48 DF,  p-value: < 2.2e-16

What are the conclusions?

The value of Y when X = 0, e.g. expected salary when work experience is 0 years.
How much will Y change when X increases by 1?

Plot

plot(x, y, pch = 16)
abline(model1, col = "red")
text(0, 2, paste("y = ", round(model1$coefficients[1], 1), "+", round(model1$coefficients[2], 
    1), "* x"))
abline(v = 0, lty = 2, col = "gray")

Coefficient of determination R²

shows the relation of variance explained by the model to total variance of the outcome variable \[R^2 = \frac{SS_{reg}}{SS_{tot}}\]
varies from 0 to 1 (or between 0-100 per cent)
R² above 0.1 demonstrates a satisfactory predictive power (but that is not a hard fact)
as the number of predictors grows, R² grows as well. Therefore,
adjusted R²: \[R^2_{adj} = \frac{SS_{reg} / (n – k))}{(SS_{tot} / (n – 1))}\]
where n is the sample size, and k is the number of predictors

Several predictors

The main advantage of regression modelling is that it can digest several predictors in one model

Dummy-variables

dichotomised
show the differences in intercepts between groups (there is a ‘reference group’)
if class(var) = factor in R, it is recoded into dummies by default

Example: salary of males and females

20 males and females (this is too few for a real regression)

##   age    salary sex
## 1  30  98.88262   M
## 2  36 118.76878   M
## 3  38 126.83304   M
## 4  35 114.72223   M
## 5  35 114.31441   M
## 6  26  90.22501   M

Example: exploratory plots

Example: the model

model2 <- lm(salary ~ age + sex, data = genderdata)

sjt.lm(model2, show.ci = F)

	salary
	B	p
(Intercept)	-1.72	.545
age	3.07	<.001
sex (M)	9.88	<.001
Observations	20
R² / adj. R²	.992 / .991

Example: plot (10 observations)

What if there are more than 2 categories? (e.g. educational attinment)

create n-1 variables where n is the original number of categories
last category is now the ‘reference’
What does the coding look like?

##        educ dummy1 dummy2
## 1  tertiary      0      0
## 2 secondary      1      0
## 3   primary      0      1

Interaction effect

the model in the previous example implies that the difference between males and females is constant
What if this gender gap grows with age?

##   age   salary sex
## 1  35 240.6612   M
## 2  27 194.1506   M
## 3  40 281.2867   M
## 4  22 157.2166   M
## 5  24 166.1852   M
## 6  36 257.1027   M

Model Comparison

model3 <- lm(salary ~ age + sex, data = genderdata2)
model4 <- lm(salary ~ age * sex, data = genderdata2)
sjt.lm(model3, model4, show.ci = F)

	salary		salary
	B	p	B	p
(Intercept)	-17.75	.470	111.38	<.001
age	4.80	<.001	0.51	.093
sexM	83.72	<.001	-112.88	<.001
age:sexM			6.57	<.001
Observations	20		20
R² / adj. R²	.894 / .882		.995 / .994

What the Interaction Effect Model Looks Like

What happens if we ignore the interaction effect?

Standardised Coefficients

standardization, or scaling, is substracting the mean and dividing by SD \[z_i = \frac{X_i - X}{\sigma}\]
scaling of the outcome change the interpretation of the intercept
scaling of predictors shows the change in the outcome when predictors increase by 1 SD
only metric variables can be standardised (where mean and SD make sense

Linear Regression Assumptions

linearity
normally distributed residuals of the outcome
predictors have an unlimited range
independent observations (no correlation between their residuals)
independent variables (no multicollinearity)
no outliers
homoskedasticity (residual variances are the same along the values of Y)

Homoskedasticity/Heteroskedasticity

x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red")

Anscombe’s Quartet

Feature	Value
Mean x	9.0
Variance x	10.0
Mean y	7.5
Variance y	3.75
Correlation between x and y	0.816
Regression fitted line	y = 3 + 0.5x

Another example

Outcome
Grade
Predictors
Books read
Classes attended
Lucky student

This example originates from the book by Jeremy Miles and Mark Shevlin (2000).

Outcome distribution

## 
##  Shapiro-Wilk normality test
## 
## data:  df$grade
## W = 0.93009, p-value = 0.01622

Models

model3 = lm(grade ~ books + attend, data = df)
summary(model3)

## 
## Call:
## lm(formula = grade ~ books + attend, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.379      7.745   4.827 2.41e-05 ***
## books          4.037      1.753   2.303   0.0270 *  
## attend         1.284      0.587   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278

model3scaled <- lm(grade ~ scale(as.numeric(books)) + 
                     scale(as.numeric(attend)), data = df)
summary(model3scaled)

## 
## Call:
## lm(formula = grade ~ scale(as.numeric(books)) + scale(as.numeric(attend)), 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 63.550      2.222  28.602   <2e-16 ***
## scale(as.numeric(books))     5.782      2.511   2.303   0.0270 *  
## scale(as.numeric(attend))    5.490      2.511   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278

Interaction effect model (1)

model5 = lm(grade ~ attend * books, data = df)
summary(model5)

## 
## Call:
## lm(formula = grade ~ attend * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.844 -11.850   1.466   8.163  34.198 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   55.2213    11.2621   4.903 2.02e-05 ***
## attend        -0.1368     0.8782  -0.156   0.8771    
## books         -6.2078     5.1508  -1.205   0.2360    
## attend:books   0.7349     0.3494   2.104   0.0425 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 36 degrees of freedom
## Multiple R-squared:  0.4022, Adjusted R-squared:  0.3524 
## F-statistic: 8.073 on 3 and 36 DF,  p-value: 0.0003058

anova(model3, model5)

## Analysis of Variance Table
## 
## Model 1: grade ~ books + attend
## Model 2: grade ~ attend * books
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     37 7306.2                              
## 2     36 6506.4  1    799.78 4.4252 0.04246 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction effect model (2)

model6 = lm(grade ~ lucky * books, data = df)
summary(model6)

## 
## Call:
## lm(formula = grade ~ lucky * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.481  -9.802  -2.542  11.454  29.109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   47.542      4.980   9.547 2.11e-11 ***
## lucky         11.759      8.227   1.429  0.16152    
## books          6.004      2.104   2.853  0.00713 ** 
## lucky:books   -1.709      3.336  -0.512  0.61149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 36 degrees of freedom
## Multiple R-squared:  0.3086, Adjusted R-squared:  0.251 
## F-statistic: 5.356 on 3 and 36 DF,  p-value: 0.003737

Interaction effect visualisation (1)

sjp.int(model5, type = "eff")

sjp.int(model5, type = "eff", show.ci = T)

Interaction effect visualisation (2)

x = plotSlopes(model5, plotx = "books", 
           modx = "attend", interval = "conf")

plot(testSlopes(x), shade = T)

## Values of attend OUTSIDE this interval:
##         lo         hi 
## -141.07291   13.13872 
## cause the slope of (b1 + b2*attend)books to be statistically significant

This is marginal effect

Visualisation (3)

sjp.int(model5, type = "eff")

sjp.int(model5, type = "eff", show.ci = T)

sjp.int(model5, type = "eff", show.ci = T, mdrt.values = "all")

All models in one table

sjt.lm(model3, model5, model6)

## Fitted models have different coefficients. Grouping may not work properly. Set `group.pred = FALSE` if you encouter cluttered labelling.

grade

B

CI

p

B

CI

p

B

CI

p

(Intercept)

37.38

21.69 – 53.07

<.001

55.22

32.38 – 78.06

<.001

47.54

37.44 – 57.64

<.001

books

4.04

0.48 – 7.59

.027

-6.21

-16.65 – 4.24

.236

6.00

1.74 – 10.27

.007

attend

1.28

0.09 – 2.47

.035

-0.14

-1.92 – 1.64

.877

attend:books

0.73

0.03 – 1.44

.042

lucky

11.76

-4.93 – 28.44

.162

lucky:books

-1.71

-8.47 – 5.06

.611

Observations

40

R² / adj. R²

.329 / .292

.402 / .352

.309 / .251

Check list for interaction effect regressions

(Brambor et al. 2005):

include all direct effects
interpret all effects correctly (interpretation is not as in additive models!)
provide range for marginal effect
provide measure of uncertainty (CIs)

For interpretation, see: https://www.theanalysisfactor.com/interpreting-interactions-in-regression/

Summary: nuts and bolts of regression modelling

Input
outcome - 1 pc.
predictors - >= 1 pc.
Output
regression coefficients
R²

What’s next?

correct interpretation of interaction effect models
model diagnostics (next year)
logistic regression (after that)

Tomorrow is seminar for some of you. Mind Chapter 14

Monday’s lab is here:

rpubs.com/shirokaner/lablm1

Regression: Not Only a Mystery Thriller

What do we already know?

Why do we need regression?

Omitted variable effect (1)

Omitted variable effect (2)

Regression modelling

Toy data example

Visualise their relationship

Use correlation to estimate the strength of their relationship

Estimate the model

R Output

What are the conclusions?

Plot

Coefficient of determination R2

Several predictors

Dummy-variables

Example: salary of males and females

Example: exploratory plots

Example: the model

Example: plot (10 observations)

What if there are more than 2 categories? (e.g. educational attinment)

Interaction effect

Model Comparison

What the Interaction Effect Model Looks Like

What happens if we ignore the interaction effect?

Standardised Coefficients

Linear Regression Assumptions

Homoskedasticity/Heteroskedasticity

Anscombe’s Quartet

Another example

Outcome distribution

Models

Interaction effect model (1)

Interaction effect model (2)

Interaction effect visualisation (1)

Interaction effect visualisation (2)

Visualisation (3)

All models in one table

Check list for interaction effect regressions

(Brambor et al. 2005):

Summary: nuts and bolts of regression modelling

What’s next?

Tomorrow is seminar for some of you. Mind Chapter 14

Monday’s lab is here:

Coefficient of determination R²