Olesya Volchenko and Anna Shirokanova
April 2018
| Relationship | Third variable |
|---|---|
| The larger the foot size of a kid, the more clever s/he is | ? |
| The lower the person, the longer the hair of that person | ? |
| The larger the school class, the better are average grades | ? |
| People using the Internet daily in Africa are happier | ? |
| Ice-cream sales are positively related to the number of people drowning | ? |
| Relationship | Third variable |
|---|---|
| The larger the foot size of a kid, the more clever s/he is | Age |
| The taller the person, the shorter the hair of that person | Gender |
| The larger the school class, the better are average grades | School size / equipment |
| People using the Internet daily in Africa are happier | Income |
| Ice-cream sales are positively related to the number of people drowning | Season |
There are variables X and Y, n = 50, normally distributed
| y | x |
|---|---|
| -2.2769689 | -0.9030778 |
| 0.2932664 | -0.1464296 |
| -2.1536341 | -1.2766415 |
| 3.3128843 | 0.9340912 |
| 6.2678134 | 2.3101434 |
| 2.9741792 | 1.2217376 |
## y x
## Min. :-8.43173 Min. :-1.92319
## 1st Qu.:-2.07282 1st Qu.:-0.68045
## Median :-0.03748 Median :-0.11262
## Mean : 0.17194 Mean : 0.06454
## 3rd Qu.: 2.36191 3rd Qu.: 0.91913
## Max. : 6.26781 Max. : 2.31014
plot(x, y, pch = 16)cor.test(x, y, method = "pearson")##
## Pearson's product-moment correlation
##
## data: x and y
## t = 18.604, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8912640 0.9640148
## sample estimates:
## cor
## 0.9371278
model1 <- lm(y ~ x)
summary(model1)##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9636 -0.4744 -0.0366 0.4992 3.8284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01119 0.15885 -0.07 0.944
## x 2.83743 0.15252 18.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.121 on 48 degrees of freedom
## Multiple R-squared: 0.8782, Adjusted R-squared: 0.8757
## F-statistic: 346.1 on 1 and 48 DF, p-value: < 2.2e-16
plot(x, y, pch = 16)
abline(model1, col = "red")
text(0, 2, paste("y = ", round(model1$coefficients[1], 1), "+", round(model1$coefficients[2],
1), "* x"))
abline(v = 0, lty = 2, col = "gray")200 males and females
## age salary sex
## Min. :14 Min. : 40.60 F:100
## 1st Qu.:26 1st Qu.: 85.44 M:100
## Median :30 Median : 94.70
## Mean :30 Mean : 94.98
## 3rd Qu.:33 3rd Qu.:105.08
## Max. :43 Max. :133.30
model2 <- lm(salary ~ age + sex, data = genderdata)sjt.lm(model2, show.ci = F)| salary | |||
| B | p | ||
| (Intercept) | 1.00 | .080 | |
| age | 2.97 | <.001 | |
| sex (M) | 10.08 | <.001 | |
| Observations | 200 | ||
| R2 / adj. R2 | .993 / .993 | ||
## educ dummy1 dummy2
## 1 tertiary 0 0
## 2 secondary 1 0
## 3 primary 0 1
model3 <- lm(salary ~ age + sex, data = genderdata2)
model4 <- lm(salary ~ age * sex, data = genderdata2)
sjt.lm(model3, model4, show.ci = F)| salary | salary | |||||
| B | p | B | p | |||
| (Intercept) | 4.11 | .565 | 99.07 | <.001 | ||
| age | 4.19 | <.001 | 1.07 | <.001 | ||
| sexM | 77.86 | <.001 | -103.23 | <.001 | ||
| age:sexM | 6.04 | <.001 | ||||
| Observations | 200 | 200 | ||||
| R2 / adj. R2 | .884 / .883 | .986 / .986 | ||||
Sometimes yes. Heteroskedasticity may indicate a moderation/ interaction. Let’s recall the assumptions first.
x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red") Plane on the right-hand side with data points ‘fanning out’ shows there is a third variable which comes into play at high values of X
| Feature | Value |
|---|---|
| Mean x | 9.0 |
| Variance x | 10.0 |
| Mean y | 7.5 |
| Variance y | 3.75 |
| Correlation between x and y | 0.816 |
| Regression fitted line | y = 3 + 0.5x |
This example originates from the book by Jeremy Miles and Mark Shevlin (2001) 1.
##
## Shapiro-Wilk normality test
##
## data: df$grade
## W = 0.93009, p-value = 0.01622
model3 = lm(grade ~ books + attend, data = df)
summary(model3)##
## Call:
## lm(formula = grade ~ books + attend, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.802 -13.374 0.060 9.173 32.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.379 7.745 4.827 2.41e-05 ***
## books 4.037 1.753 2.303 0.0270 *
## attend 1.284 0.587 2.187 0.0352 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared: 0.3287, Adjusted R-squared: 0.2924
## F-statistic: 9.059 on 2 and 37 DF, p-value: 0.0006278
model3scaled <- lm(grade ~ scale(as.numeric(books)) +
scale(as.numeric(attend)), data = df)
summary(model3scaled)##
## Call:
## lm(formula = grade ~ scale(as.numeric(books)) + scale(as.numeric(attend)),
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.802 -13.374 0.060 9.173 32.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.550 2.222 28.602 <2e-16 ***
## scale(as.numeric(books)) 5.782 2.511 2.303 0.0270 *
## scale(as.numeric(attend)) 5.490 2.511 2.187 0.0352 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared: 0.3287, Adjusted R-squared: 0.2924
## F-statistic: 9.059 on 2 and 37 DF, p-value: 0.0006278
The effect of books is somewhat larger than the effect of attend.
model5 = lm(grade ~ attend * books, data = df)
summary(model5)##
## Call:
## lm(formula = grade ~ attend * books, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.844 -11.850 1.466 8.163 34.198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.2213 11.2621 4.903 2.02e-05 ***
## attend -0.1368 0.8782 -0.156 0.8771
## books -6.2078 5.1508 -1.205 0.2360
## attend:books 0.7349 0.3494 2.104 0.0425 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.44 on 36 degrees of freedom
## Multiple R-squared: 0.4022, Adjusted R-squared: 0.3524
## F-statistic: 8.073 on 3 and 36 DF, p-value: 0.0003058
anova(model3, model5)## Analysis of Variance Table
##
## Model 1: grade ~ books + attend
## Model 2: grade ~ attend * books
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 37 7306.2
## 2 36 6506.4 1 799.78 4.4252 0.04246 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction (multiplicative) model explains significantly more than the additive model
model6 = lm(grade ~ lucky * books, data = df)
summary(model6)##
## Call:
## lm(formula = grade ~ lucky * books, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.417 -10.952 -1.455 10.156 37.300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.92308 5.76906 8.827 2.57e-10 ***
## lucky1 3.76257 8.78658 0.428 0.6712
## lucky2 0.07692 11.83961 0.006 0.9949
## books 6.15385 2.42349 2.539 0.0159 *
## lucky1:books -3.64642 3.71632 -0.981 0.3334
## lucky2:books 2.55449 4.55189 0.561 0.5783
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.62 on 34 degrees of freedom
## Multiple R-squared: 0.3322, Adjusted R-squared: 0.2339
## F-statistic: 3.382 on 5 and 34 DF, p-value: 0.01381
Luck has not effect on final grade whatsoever
library(sjPlot)
sjp.int(model5, type = "eff")sjp.int(model5, type = "eff", show.ci = T)library(rockchalk)
x = plotSlopes(model5, plotx = "books",
modx = "attend", interval = "conf")plot(testSlopes(x), shade = T)## Values of attend OUTSIDE this interval:
## lo hi
## -141.07291 13.13872
## cause the slope of (b1 + b2*attend)books to be statistically significant
x = plotSlopes(model5, plotx = "attend",
modx = "books", interval = "conf")plot(testSlopes(x), shade = T)## Values of books OUTSIDE this interval:
## lo hi
## -47.419893 1.746103
## cause the slope of (b1 + b2*books)attend to be statistically significant
The interaction effect gets significant at 2 books and 13 classes attended.
#sjp.int is now turning into plot_model, so please learn to use the new function:
sjPlot::plot_model(model5, type = "eff", terms = "books")sjPlot::plot_model(model5, type = "int", show.ci = T, mdrt.values = "all")#by default, it takes the variable with smaller range as a moderator, here - books
sjPlot::plot_model(model5, type = "int", show.ci = T, terms = "books", mdrt.values = "all")sjPlot::plot_model(model5, type = "int", show.ci = T, mdrt.values = "meansd")sjt.lm(model3, model5, model6, show.ci = F, p.numeric = F)| grade | grade | grade | ||||
| B | B | B | ||||
| (Intercept) | 37.38 *** | 55.22 *** | 50.92 *** | |||
| lucky | ||||||
| books | 4.04 * | -6.21 | 6.15 * | |||
| attend | 1.28 * | -0.14 | ||||
| attend:books | 0.73 * | |||||
| lucky1 | 3.76 | |||||
| lucky2 | 0.08 | |||||
| lucky1:books | -3.65 | |||||
| lucky2:books | 2.55 | |||||
| Observations | 40 | 40 | 40 | |||
| R2 / adj. R2 | .329 / .292 | .402 / .352 | .332 / .234 | |||
| Notes | * p<.05 ** p<.01 *** p<.001 | |||||