Regression: Not Only a Mystery Thriller

Olesya Volchenko and Anna Shirokanova

April 2018

What do we already know?

Why do we need regression?

Omitted variable effect (1)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is ?
The lower the person, the longer the hair of that person ?
The larger the school class, the better are average grades ?
People using the Internet daily in Africa are happier ?
Ice-cream sales are positively related to the number of people drowning ?

Omitted variable effect (2)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is Age
The taller the person, the shorter the hair of that person Gender
The larger the school class, the better are average grades School size / equipment
People using the Internet daily in Africa are happier Income
Ice-cream sales are positively related to the number of people drowning Season

Regression modelling

Toy data example

There are variables X and Y

y x
2.3434072 1.1062056
-5.7028911 -1.6623645
-1.2161804 -0.2888668
1.0362797 0.5752666
1.3464117 0.3875749
2.0395967 0.2835436
1.2407077 0.1530659
4.1694978 1.0044248
-0.8050663 -0.1653561
2.7972354 1.1869595
1.2171334 0.2319040
0.2003237 0.4613343
1.4274962 0.6209180
6.7268963 2.1472300
0.4021215 0.1328819
0.3360385 0.0873094
2.6072062 0.6217123
-4.0830110 -0.8438386
5.9026377 1.3107213
2.9125798 0.7172064
4.9443642 1.6982076
2.9856271 1.1205246
1.3026918 0.3585168
1.9968218 0.3650547
4.3063373 1.7451252
-4.0624447 -1.3103822
3.3363657 1.3952343
-4.5741076 -1.2653576
4.8788024 1.2802166
1.5956168 0.5503804
-5.5108671 -1.9861473
1.0963555 0.3808488
4.0262914 1.0942326
-0.9364985 -0.6841675
-2.9122435 -0.9642431
-5.3560686 -1.6591004
-5.0853931 -1.3529958
-2.7138662 -0.6265333
0.7735975 0.4267002
-0.0738835 -0.4384743
4.4310860 1.0516549
-3.1855930 -0.9060857
0.8373001 0.2054775
2.8863445 1.1665390
3.6380946 0.6640382
-4.3165562 -0.7684066
2.0484183 0.2688752
-4.3024800 -1.1724599
6.1449382 1.7689950
-3.1971329 -0.7483182

Visualise their relationship

plot(x, y, pch = 16)

Use correlation to estimate the strength of their relationship

cor.test(x, y, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 26.972, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9449707 0.9821281
## sample estimates:
##       cor 
## 0.9685582

Estimate the model

model1 <- lm(y ~ x)
summary(model1)

R Output

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77859 -0.61624 -0.05182  0.61653  1.66946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03548    0.12302  -0.288    0.774    
## x            3.25672    0.12074  26.972   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8539 on 48 degrees of freedom
## Multiple R-squared:  0.9381, Adjusted R-squared:  0.9368 
## F-statistic: 727.5 on 1 and 48 DF,  p-value: < 2.2e-16

What are the conclusions?

Plot

plot(x, y, pch = 16)
abline(model1, col = "red")
text(0, 2, paste("y = ", round(model1$coefficients[1], 1), "+", round(model1$coefficients[2], 
    1), "* x"))
abline(v = 0, lty = 2, col = "gray")

Coefficient of determination R2

Several predictors

Dummy-variables

Example: salary of males and females

20 males and females (this is too few for a real regression)

##   age    salary sex
## 1  30  98.88262   M
## 2  36 118.76878   M
## 3  38 126.83304   M
## 4  35 114.72223   M
## 5  35 114.31441   M
## 6  26  90.22501   M

Example: exploratory plots

Example: the model

model2 <- lm(salary ~ age + sex, data = genderdata)
sjt.lm(model2, show.ci = F)
    salary
    B p
(Intercept)   -1.72 .545
age   3.07 <.001
sex (M)   9.88 <.001
Observations   20
R2 / adj. R2   .992 / .991

Example: plot (10 observations)

What if there are more than 2 categories? (e.g. educational attinment)

##        educ dummy1 dummy2
## 1  tertiary      0      0
## 2 secondary      1      0
## 3   primary      0      1

Interaction effect

##   age   salary sex
## 1  35 240.6612   M
## 2  27 194.1506   M
## 3  40 281.2867   M
## 4  22 157.2166   M
## 5  24 166.1852   M
## 6  36 257.1027   M

Model Comparison

model3 <- lm(salary ~ age + sex, data = genderdata2)
model4 <- lm(salary ~ age * sex, data = genderdata2)
sjt.lm(model3, model4, show.ci = F)
    salary   salary
    B p   B p
(Intercept)   -17.75 .470   111.38 <.001
age   4.80 <.001   0.51 .093
sexM   83.72 <.001   -112.88 <.001
age:sexM     6.57 <.001
Observations   20   20
R2 / adj. R2   .894 / .882   .995 / .994

What the Interaction Effect Model Looks Like

What happens if we ignore the interaction effect?

Standardised Coefficients

Linear Regression Assumptions

Homoskedasticity/Heteroskedasticity

x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red")

Anscombe’s Quartet

Feature Value
Mean x 9.0
Variance x 10.0
Mean y 7.5
Variance y 3.75
Correlation between x and y 0.816
Regression fitted line y = 3 + 0.5x

Another example

This example originates from the book by Jeremy Miles and Mark Shevlin (2000).

Outcome distribution

## 
##  Shapiro-Wilk normality test
## 
## data:  df$grade
## W = 0.93009, p-value = 0.01622

Models

model3 = lm(grade ~ books + attend, data = df)
summary(model3)
## 
## Call:
## lm(formula = grade ~ books + attend, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.379      7.745   4.827 2.41e-05 ***
## books          4.037      1.753   2.303   0.0270 *  
## attend         1.284      0.587   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278
model3scaled <- lm(grade ~ scale(as.numeric(books)) + 
                     scale(as.numeric(attend)), data = df)
summary(model3scaled)
## 
## Call:
## lm(formula = grade ~ scale(as.numeric(books)) + scale(as.numeric(attend)), 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 63.550      2.222  28.602   <2e-16 ***
## scale(as.numeric(books))     5.782      2.511   2.303   0.0270 *  
## scale(as.numeric(attend))    5.490      2.511   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278

Interaction effect model (1)

model5 = lm(grade ~ attend * books, data = df)
summary(model5)
## 
## Call:
## lm(formula = grade ~ attend * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.844 -11.850   1.466   8.163  34.198 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   55.2213    11.2621   4.903 2.02e-05 ***
## attend        -0.1368     0.8782  -0.156   0.8771    
## books         -6.2078     5.1508  -1.205   0.2360    
## attend:books   0.7349     0.3494   2.104   0.0425 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 36 degrees of freedom
## Multiple R-squared:  0.4022, Adjusted R-squared:  0.3524 
## F-statistic: 8.073 on 3 and 36 DF,  p-value: 0.0003058
anova(model3, model5)
## Analysis of Variance Table
## 
## Model 1: grade ~ books + attend
## Model 2: grade ~ attend * books
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     37 7306.2                              
## 2     36 6506.4  1    799.78 4.4252 0.04246 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction effect model (2)

model6 = lm(grade ~ lucky * books, data = df)
summary(model6)
## 
## Call:
## lm(formula = grade ~ lucky * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.481  -9.802  -2.542  11.454  29.109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   47.542      4.980   9.547 2.11e-11 ***
## lucky         11.759      8.227   1.429  0.16152    
## books          6.004      2.104   2.853  0.00713 ** 
## lucky:books   -1.709      3.336  -0.512  0.61149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 36 degrees of freedom
## Multiple R-squared:  0.3086, Adjusted R-squared:  0.251 
## F-statistic: 5.356 on 3 and 36 DF,  p-value: 0.003737

Interaction effect visualisation (1)

sjp.int(model5, type = "eff")

sjp.int(model5, type = "eff", show.ci = T)

Interaction effect visualisation (2)

x = plotSlopes(model5, plotx = "books", 
           modx = "attend", interval = "conf")

plot(testSlopes(x), shade = T)
## Values of attend OUTSIDE this interval:
##         lo         hi 
## -141.07291   13.13872 
## cause the slope of (b1 + b2*attend)books to be statistically significant

This is marginal effect

Visualisation (3)

sjp.int(model5, type = "eff")

sjp.int(model5, type = "eff", show.ci = T)

sjp.int(model5, type = "eff", show.ci = T, mdrt.values = "all")

All models in one table

sjt.lm(model3, model5, model6)
## Fitted models have different coefficients. Grouping may not work properly. Set `group.pred = FALSE` if you encouter cluttered labelling.
    grade   grade   grade
    B CI p   B CI p   B CI p
(Intercept)   37.38 21.69 – 53.07 <.001   55.22 32.38 – 78.06 <.001   47.54 37.44 – 57.64 <.001
books   4.04 0.48 – 7.59 .027   -6.21 -16.65 – 4.24 .236   6.00 1.74 – 10.27 .007
attend   1.28 0.09 – 2.47 .035   -0.14 -1.92 – 1.64 .877    
attend:books       0.73 0.03 – 1.44 .042    
lucky           11.76 -4.93 – 28.44 .162
lucky:books           -1.71 -8.47 – 5.06 .611
Observations   40   40   40
R2 / adj. R2   .329 / .292   .402 / .352   .309 / .251

Check list for interaction effect regressions

(Brambor et al. 2005):

For interpretation, see: https://www.theanalysisfactor.com/interpreting-interactions-in-regression/

Summary: nuts and bolts of regression modelling

What’s next?

Tomorrow is seminar for some of you. Mind Chapter 14

Monday’s lab is here:

rpubs.com/shirokaner/lablm1