To see a summary of the regression assumptions and more ways run test diagnostics, see our previous lab
pkgs <- c("tidyverse",
"dplyr",
"haven",
"foreign",
"lme4",
"nlme",
"lsr",
"emmeans",
"afex",
"knitr",
"kableExtra",
"car",
"mediation",
"rockchalk",
"multilevel",
"bda",
"gvlma",
"stargazer",
"QuantPsyc",
"pequod",
"MASS",
"texreg",
"pwr",
"effectsize",
"semPlot",
"lmtest",
"semptools",
"conflicted",
"nnet",
"ordinal",
"DescTools")
packages <- rownames(installed.packages())
p_to_install <- pkgs[!(pkgs %in% packages)]
if(length(p_to_install) > 0){
install.packages(p_to_install)
}
lapply(pkgs, library, character.only = TRUE)
# devtools::install_github("cardiomoon/semMediation")
library(semMediation)
# tell R which package to use for functions that are in multiple packages
these_functions <- c("mutate", "select", "summarize", "filter")
lapply(these_functions, conflict_prefer, "dplyr")
conflict_prefer("mutate", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("summarize", "dplyr")# quick ANOVA for demonstration
# create id variable for error term
iris_afex <- iris %>%
dplyr::mutate(id = row_number())
anova_ex <- afex::aov_car(Sepal.Length ~ Species + Error(id), data = iris_afex)
summary(anova_ex)Anova Table (Type 3 tests)
Response: Sepal.Length
num Df den Df MSE F ges Pr(>F)
Species 2 147 0.26501 119.26 0.61871 < 0.00000000000000022 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Effect Size for ANOVA (Type III)
Parameter | Cohen's f | 95% CI
-----------------------------------
Species | 1.27 | [1.09, Inf]
- One-sided CIs: upper bound fixed at [Inf].
# here is another power analysis function that allows you to use cohen's w.
library(pwr)
pwr.chisq.test(w=0.2,df=1,power=.95,sig.level=0.05)
Chi squared power calculation
w = 0.2
N = 324.8677
df = 1
sig.level = 0.05
power = 0.95
NOTE: N is the number of observations
library(effectsize)
# Get f-squared
reg_ex <- lm(Sepal.Length ~ Species, data = iris_afex)
cohens_f_squared(reg_ex)# Effect Size for ANOVA
Parameter | Cohen's f2 | 95% CI
------------------------------------
Species | 1.62 | [1.18, Inf]
- One-sided CIs: upper bound fixed at [Inf].
# Manually calculate f-squared
r_squared <- summary(reg_ex)$r.squared
f_squared <- r_squared / (1 - r_squared)
library(pwr)
power_analysis <- pwr.f2.test(u = 2, f2 = .15, sig.level = 0.05, power = 0.95)
print(power_analysis)
Multiple regression power calculation
u = 2
v = 103.0185
f2 = 0.15
sig.level = 0.05
power = 0.95
Gender (Male, Female) Film (Bridget Jones, Memento)
For people who watched Memento, do women report higher arousal after watching the film than men?
d.data <- read.delim("/Users/kareenadelrosario/Downloads/ChickFlick.dat")
d.data <- d.data %>%
mutate(gender = as.factor(ifelse(gender == "Female", 0, 1)),
film = as.factor(ifelse(film == "Memento", 0, 1)))
contrasts(d.data$gender) = contr.treatment(2, base = 2) # female = 1, male = 0
contrasts(d.data$film) # memento = 0, bridget jones = 1 1
0 0
1 1
Call:
lm(formula = arousal ~ gender * film, data = d.data)
Residuals:
Min 1Q Median 3Q Max
-10.700 -4.350 0.450 5.425 11.300
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.800 2.019 12.778 0.0000000000000061 ***
gender1 -1.100 2.856 -0.385 0.70234
film1 -8.600 2.856 -3.012 0.00473 **
gender1:film1 -3.700 4.038 -0.916 0.36564
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.385 on 36 degrees of freedom
Multiple R-squared: 0.4525, Adjusted R-squared: 0.4069
F-statistic: 9.92 on 3 and 36 DF, p-value: 0.00006611
contrast estimate SE df t.ratio p.value
gender0 film0 - gender1 film0 -1.1 2.86 36 -0.385 0.9803
gender0 film0 - gender0 film1 12.3 2.86 36 4.307 0.0007
gender0 film0 - gender1 film1 7.5 2.86 36 2.627 0.0582
gender1 film0 - gender0 film1 13.4 2.86 36 4.693 0.0002
gender1 film0 - gender1 film1 8.6 2.86 36 3.012 0.0234
gender0 film1 - gender1 film1 -4.8 2.86 36 -1.681 0.3483
P value adjustment: tukey method for comparing a family of 4 estimates
d.data.f <- d.data %>%
filter(film == 0)
f_model <- lm(arousal ~ gender, data = d.data.f)
summary(f_model)
Call:
lm(formula = arousal ~ gender, data = d.data.f)
Residuals:
Min 1Q Median 3Q Max
-10.700 -5.050 -0.700 6.225 11.300
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.800 2.309 11.173 0.00000000158 ***
gender1 -1.100 3.265 -0.337 0.74
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.302 on 18 degrees of freedom
Multiple R-squared: 0.006265, Adjusted R-squared: -0.04894
F-statistic: 0.1135 on 1 and 18 DF, p-value: 0.7401
credit: https://stats.oarc.ucla.edu/r/dae/robust-regression/
What is the difference between OLS and robust regression? When would you use one over the other?
Robust regression is suitable for cases where you’d typically use OLS regression but encounter outliers or high leverage points that aren’t errors or from a different population. Instead of excluding these points or treating all data equally as in OLS regression, robust regression offers a middle ground by assigning different weights to observations based on their reliability. Essentially, it’s a weighted least squares approach that adjusts weights in response to the features of the data.
Terms to know:
- Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value.
- Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
- Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients.
- Influence: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients. Influence can be thought of as the product of leverage and outlierness.
- Cook’s distance (or Cook’s D): A measure that combines the information of leverage and residual of the observation.
sid state crime murder pctmetro pctwhite pcths poverty single
1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3
2 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5
3 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7
4 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1
5 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5
6 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1
We will begin by running an OLS regression and looking at diagnostic plots examining residuals, fitted values, Cook’s distance, and leverage.
Call:
lm(formula = crime ~ poverty + single, data = cdata)
Residuals:
Min 1Q Median 3Q Max
-811.14 -114.27 -22.44 121.86 689.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1368.189 187.205 -7.308 0.0000000024786 ***
poverty 6.787 8.989 0.755 0.454
single 166.373 19.423 8.566 0.0000000000312 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 243.6 on 48 degrees of freedom
Multiple R-squared: 0.7072, Adjusted R-squared: 0.695
F-statistic: 57.96 on 2 and 48 DF, p-value: 0.0000000000001578
From these plots, we can identify observations 9, 25, and 51 as possibly problematic to our model. We can look at these observations to see which states they represent.
sid state
9 9 fl
25 25 ms
51 51 dc
DC, Florida and Mississippi have either high leverage or large residuals. We can display the observations that have relatively large values of Cook’s D. A conventional cut-off point is 4/n, where n is the number of observations in the data set.
d1 <- cooks.distance(ols) # captures influential observations
r <- stdres(ols) # computes standardized residuals
a <- cbind(cdata, d1, r)
a[d1 > 4/51, ] # Filters the combined data frame a to display only those observations for which Cook's distance is greater than 4/n. n = 51 sid state crime murder pctmetro pctwhite pcths poverty single d1 r
1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 0.1254750 -1.397418
9 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 0.1425891 2.902663
25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 0.6138721 -3.562990
51 51 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 2.6362519 2.616447
Now we will look at the residuals. We will generate a new variable called absr1, which is the absolute value of the residuals (because the sign of the residual doesn’t matter). We then print the ten observations with the highest absolute residual values.
sid state crime murder pctmetro pctwhite pcths poverty single d1 r
25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 0.61387212 -3.562990
9 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 0.14258909 2.902663
51 51 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 2.63625193 2.616447
46 46 vt 114 3.6 27.0 98.4 80.8 10.0 11.0 0.04271548 -1.742409
26 26 mt 178 3.0 24.0 92.6 81.0 14.9 10.8 0.01675501 -1.460885
21 21 me 126 1.6 35.7 98.5 78.8 10.7 10.6 0.02233128 -1.426741
1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 0.12547500 -1.397418
31 31 nj 627 5.3 100.0 80.8 76.7 10.9 9.6 0.02229184 1.354149
14 14 il 960 11.4 84.0 81.0 76.2 13.6 11.5 0.01265689 1.338192
20 20 md 998 12.7 92.8 68.9 78.4 9.7 12.0 0.03569623 1.287087
rabs
25 3.562990
9 2.902663
51 2.616447
46 1.742409
26 1.460885
21 1.426741
1 1.397418
31 1.354149
14 1.338192
20 1.287087
Now let’s run our first robust regression. Robust regression is done by iterated re-weighted least squares (IRLS). The command for running robust regression is rlm in the MASS package. There are several weighting functions that can be used for IRLS. We are going to first use the Huber weights in this example. We will then look at the final weights created by the IRLS process. This can be very useful.
Call: rlm(formula = crime ~ poverty + single, data = cdata)
Residuals:
Min 1Q Median 3Q Max
-846.09 -125.80 -16.49 119.15 679.94
Coefficients:
Value Std. Error t value
(Intercept) -1423.0373 167.5899 -8.4912
poverty 8.8677 8.0467 1.1020
single 168.9858 17.3878 9.7186
Residual standard error: 181.8 on 48 degrees of freedom
Let’s take a look at the weights of the top 15 largest residuals
hweights <- data.frame(state = cdata$state, resid = rr.huber$resid, weight = rr.huber$w)
hweights2 <- hweights[order(rr.huber$w), ]
hweights2[1:15, ] state resid weight
25 ms -846.08536 0.2889618
9 fl 679.94327 0.3595480
46 vt -410.48310 0.5955740
51 dc 376.34468 0.6494131
26 mt -356.13760 0.6864625
21 me -337.09622 0.7252263
31 nj 331.11603 0.7383578
14 il 319.10036 0.7661169
1 ak -313.15532 0.7807432
20 md 307.19142 0.7958154
19 ma 291.20817 0.8395172
18 la -266.95752 0.9159411
2 al 105.40319 1.0000000
3 ar 30.53589 1.0000000
4 az -43.25299 1.0000000
We can see that roughly, as the absolute residual goes down, the weight goes up.
All observations not shown above have a weight of 1. In OLS regression, all cases have a weight of 1. Hence, the more cases in the robust regression that have a weight close to one, the closer the results of the OLS and robust regressions.
Next, let’s run the same model, but using the bisquare weighting function. Again, we can look at the weights.
Call: rlm(formula = crime ~ poverty + single, data = cdata, psi = psi.bisquare)
Residuals:
Min 1Q Median 3Q Max
-905.59 -140.97 -14.98 114.65 668.38
Coefficients:
Value Std. Error t value
(Intercept) -1535.3338 164.5062 -9.3330
poverty 11.6903 7.8987 1.4800
single 175.9303 17.0678 10.3077
Residual standard error: 202.3 on 48 degrees of freedom
# if you need significance values. MASS doesn't trust p-values and won't report them.
# texreg(rr.bisquare, ci.force=TRUE)Here are the weights for the same 15 observations using bisquare weights
biweights <- data.frame(state = cdata$state, resid = rr.bisquare$resid, weight = rr.bisquare$w)
biweights2 <- biweights[order(rr.bisquare$w), ]
biweights2[1:15, ] state resid weight
25 ms -905.5931 0.007652565
9 fl 668.3844 0.252870542
46 vt -402.8031 0.671495418
26 mt -360.8997 0.731136908
31 nj 345.9780 0.751347695
18 la -332.6527 0.768938330
21 me -328.6143 0.774103322
1 ak -325.8519 0.777662383
14 il 313.1466 0.793658594
20 md 308.7737 0.799065530
19 ma 297.6068 0.812596833
51 dc 260.6489 0.854441716
50 wy -234.1952 0.881660897
5 ca 201.4407 0.911713981
10 ga -186.5799 0.924033113
We can see that the weight given to Mississippi is dramatically lower using the bisquare weighting function than the Huber weighting function and the parameter estimates from these two different weighting methods differ.
When comparing the results of a regular OLS regression and a robust regression, if the results are very different, you will most likely want to use the results from the robust regression. Large differences suggest that the model parameters are being highly influenced by outliers. Different functions have advantages and drawbacks. Huber weights can have difficulties with severe outliers, and bisquare weights can have difficulties converging.
Robust regression does not directly address heteroscedasticity. To account for heteroscedasticity or autocorrelation in GLS, you would typically need to specify a particular structure for the covariance matrix of the errors that reflects the nature of the heteroscedasticity or correlation you’re dealing with. In other words, what is driving the difference in variance?
Check the residuals vs fitted plot to see if the line is curved. That would indicate heteroscedasticity
studentized Breusch-Pagan test
data: ols
BP = 7.6296, df = 2, p-value = 0.02204
ggplot(cdata, aes(x = single, y = crime)) +
geom_point(aes(color = poverty), alpha = 0.5) + # Color points by poverty level
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "green") +
labs(x = "Single Parent Percentage", y = "Crime Rate",
title = "Crime Rate vs. Single Parent Percentage",
subtitle = "Colored by Poverty Level") +
theme_minimal()# Let's say we find that as single-parent household % increases, crime variance increases. This suggests a proportional relationship between single-parent and crime, so we'll account for variance changes as a power function
glsModel <- gls(crime ~ poverty + single, data = cdata,
weights = varPower(form = ~ single))
summary(glsModel)Generalized least squares fit by REML
Model: crime ~ poverty + single
Data: cdata
AIC BIC logLik
680.0737 689.4297 -335.0368
Variance function:
Structure: Power of variance covariate
Formula: ~single
Parameter estimates:
power
1.848189
Coefficients:
Value Std.Error t-value p-value
(Intercept) -1215.4468 229.36175 -5.299257 0.0000
poverty 8.7081 8.10173 1.074840 0.2878
single 150.3019 23.44772 6.410087 0.0000
Correlation:
(Intr) povrty
poverty -0.055
single -0.892 -0.387
Standardized residuals:
Min Q1 Med Q3 Max
-2.1045268 -0.4871898 -0.1202581 0.5487129 3.3453653
Residual standard error: 2.563121
Degrees of freedom: 51 total; 48 residual
# Plotting standardized residuals vs fitted values
plot(fitted(glsModel), residuals(glsModel, type = "normalized"),
xlab = "Fitted values", ylab = "Standardized Residuals")
abline(h = 0, lty = 2)# Let's say that the variability in crime scores differs as a function of the state (could formally test with levene's test). We could account for that on the weights line. Note this doesn't work here because each state only has one entry.
glsModel2 <- gls(crime ~ poverty + single, data = cdata,
weights = varIdent(form = ~ 1 | state))
summary(glsModel2)Generalized least squares fit by REML
Model: crime ~ poverty + single
Data: cdata
AIC BIC logLik
689.3053 790.3501 -290.6526
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | state
Parameter estimates:
ak al ar az ca
1.000000000000 2.048549058024 0.068085574850 0.576119727497 4.300623446230
co ct de fl ga
0.000005454337 2.461595107494 2.922156518667 9.754497652367 0.418252848397
hi ia id il in
1.615479026361 2.084911823269 0.267399898634 5.737528373745 0.600830252092
ks ky la ma md
2.097280314278 1.744754031846 2.671187574741 5.420342182435 6.481315070471
me mi mn mo ms
3.861041771601 0.039565313233 0.077530452236 2.998643425989 10.913292603298
mt nc nd ne nh
4.671621555841 2.174747057479 0.603570546230 1.539150720484 0.890348399478
nj nm nv ny oh
5.171393725120 0.000028911708 3.937796504993 4.388504628947 0.510888858428
ok or pa ri sc
0.029274306912 0.000004348616 1.489888746595 0.374030689892 3.728761133878
sd tn tx ut va
1.468792274624 1.769559166951 1.237389620521 0.226046963631 0.536417315386
vt wa wi wv wy
4.569384180392 0.645710065105 2.037106623757 3.747919048775 2.651974336771
dc
11.049378262028
Coefficients:
Value Std.Error t-value p-value
(Intercept) -1161.1918 0.007238506 -160418.7 0
poverty 19.6750 0.000208372 94422.5 0
single 126.7281 0.000529827 239187.5 0
Correlation:
(Intr) povrty
poverty -0.586
single -0.952 0.311
Standardized residuals:
Min Q1 Med Q3 Max
-1.0000071 -0.9999989 0.9999944 1.0000002 1.0000075
Residual standard error: 69.06124
Degrees of freedom: 51 total; 48 residual
Bootstrapping can be helpful if you have a small sample or your data violate the regression assumptions and you want to get more accurate CI estimates. While this won’t correct for heterogeneity of variance, bootstrapping can be used to estimate robust standard errors for regression coefficients.
When to use bootstrapping:
bootReg <- function(formula, data, indices) {
d <- data[indices, ] # Use 'indices' to create a subset dataframe 'd'
fit <- lm(formula, data = d)
return(coef(fit))
}
bootResults <- boot (statistic = bootReg, formula = crime ~ poverty + single, data = cdata, R = 2000)
# based on the bootstrapping, how biased are our estimates?
summary(bootResults)
Number of bootstrap replications R = 2000
original bootBias bootSE bootMed
1 -1368.1887 93.4373 336.521 -1364.583
2 6.7874 1.5695 10.896 8.219
3 166.3727 -10.2531 31.364 163.437
Call:
lm(formula = crime ~ poverty + single, data = cdata)
Residuals:
Min 1Q Median 3Q Max
-811.14 -114.27 -22.44 121.86 689.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1368.189 187.205 -7.308 0.0000000024786 ***
poverty 6.787 8.989 0.755 0.454
single 166.373 19.423 8.566 0.0000000000312 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 243.6 on 48 degrees of freedom
Multiple R-squared: 0.7072, Adjusted R-squared: 0.695
F-statistic: 57.96 on 2 and 48 DF, p-value: 0.0000000000001578
# intercept (index = 1)
boot.ci(bootResults, type = "bca", index = 1) # bca = bias corrected and acceleratedBOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = bootResults, type = "bca", index = 1)
Intervals :
Level BCa
95% (-1764, -590 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = bootResults, type = "bca", index = 2)
Intervals :
Level BCa
95% (-16.145, 26.785 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = bootResults, type = "bca", index = 3)
Intervals :
Level BCa
95% (103.9, 214.5 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
How do these compare to the OLS confidence intervals?
2.5 % 97.5 %
(Intercept) -1744.58988 -991.78745
poverty -11.28529 24.86001
single 127.32029 205.42505
A study was carried out to explore the relationship between Aggression and several potential predicting factors in 666 children who had an older sibling.
Variables measured were:
The data are in the file ChildAggression.dat.
model <- lm(Aggression ~ Television + Computer_Games + Sibling_Aggression + Diet + Parenting_Style, data = df)
summary(model)
Call:
lm(formula = Aggression ~ Television + Computer_Games + Sibling_Aggression +
Diet + Parenting_Style, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.12629 -0.15253 -0.00421 0.15222 1.17669
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004988 0.011983 -0.416 0.677350
Television 0.032916 0.046057 0.715 0.475059
Computer_Games 0.142161 0.036920 3.851 0.000129 ***
Sibling_Aggression 0.081684 0.038780 2.106 0.035550 *
Diet -0.109054 0.038076 -2.864 0.004315 **
Parenting_Style 0.056648 0.014557 3.891 0.000110 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3071 on 660 degrees of freedom
Multiple R-squared: 0.08258, Adjusted R-squared: 0.07563
F-statistic: 11.88 on 5 and 660 DF, p-value: 0.00000000005025
Television Computer_Games Sibling_Aggression Diet
0.03192490 0.15211518 0.08357717 -0.11503080
Parenting_Style
0.17735588
Residuals vs Fitted: is used to check the assumptions of linearity. If the residuals are spread equally around a horizontal line without distinct patterns (red line is approximately horizontal at zero), that is a good indication of having a linear relationship.
Normal Q-Q: is used to check the normality of residuals assumption. If the majority of the residuals follow the straight dashed line, then the assumption is fulfilled.
Scale-Location: is used to check the homoscedasticity of residuals (equal variance of residuals). If the residuals are spread randomly and the see a horizontal line with equally (randomly) spread points, then the assumption is fulfilled.
Residuals vs Leverage: is used to identify any influential value in our dataset. Influential values are extreme values that might influence the regression results when included or excluded from the analysis. Look for cases outside of a dashed line (if there is a concerning point, you’ll see red lines pop up).
Homoscedasticity: residuals vs fitted & bp
studentized Breusch-Pagan test
data: model
BP = 10.439, df = 5, p-value = 0.0637
Multicollinearity
Television Computer_Games Sibling_Aggression Diet
1.435525 1.122719 1.132618 1.160466
Parenting_Style
1.494296
[1] 1.269125
Television Computer_Games Sibling_Aggression Diet
0.6966095 0.8906946 0.8829104 0.8617231
Parenting_Style
0.6692115
Independence
lag Autocorrelation D-W Statistic p-value
1 0.04218005 1.912808 0.286
Alternative hypothesis: rho != 0
Skew
Call:
lm(formula = Aggression ~ Television + Computer_Games + Sibling_Aggression +
Diet + Parenting_Style, data = df)
Coefficients:
(Intercept) Television Computer_Games Sibling_Aggression
-0.004988 0.032916 0.142161 0.081684
Diet Parenting_Style
-0.109054 0.056648
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model)
Value p-value Decision
Global Stat 62.8865 0.000000000000716982 Assumptions NOT satisfied!
Skewness 0.4924 0.482846201232194572 Assumptions acceptable.
Kurtosis 61.4485 0.000000000000004552 Assumptions NOT satisfied!
Link Function 0.3833 0.535842599337205128 Assumptions acceptable.
Heteroscedasticity 0.5623 0.453336961025912588 Assumptions acceptable.
adapted from: https://advstats.psychstat.org/book/mregression/index.php
Note that you never HAVE to center/recode your predictors. It should never influence the test significance– just the interpretation of your intercept and coefficients.
Let’s say we’re interested in the effects of high school GPA (h.gpa), SAT, and quality of recommendation letters (recommd) on college GPA (c.gpa).
c.gpa h.gpa SAT recommd
1 2.04 2.01 1070 5
2 2.56 3.40 1254 6
3 3.75 3.68 1466 6
4 1.10 1.54 706 4
5 3.00 3.32 1160 5
6 0.05 0.33 756 3
par(mfrow=c(2,2))
plot(gpa_df$h.gpa, gpa_df$c.gpa)
plot(gpa_df$SAT, gpa_df$c.gpa)
plot(gpa_df$recommd, gpa_df$c.gpa)
# original model
gpaModel<- lm(c.gpa ~ h.gpa + SAT + recommd, data = gpa_df)
summary(gpaModel)
Call:
lm(formula = c.gpa ~ h.gpa + SAT + recommd, data = gpa_df)
Residuals:
Min 1Q Median 3Q Max
-1.0979 -0.4407 -0.0094 0.3859 1.7606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1532639 0.3229381 -0.475 0.636156
h.gpa 0.3763511 0.1142615 3.294 0.001385 **
SAT 0.0012269 0.0003032 4.046 0.000105 ***
recommd 0.0226843 0.0509817 0.445 0.657358
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5895 on 96 degrees of freedom
Multiple R-squared: 0.3997, Adjusted R-squared: 0.381
F-statistic: 21.31 on 3 and 96 DF, p-value: 0.000000000116
How would you interpret these estimates?
# center separately
gpa_df$h.gpaC <- scale(gpa_df$h.gpa, center=TRUE, scale=FALSE) # mean center without rescaling
# shortcut
centeredModel<- lm(c.gpa~ h.gpaC + scale(SAT,scale=F) + scale(recommd,scale=F), data=gpa_df)
summary(centeredModel)
Call:
lm(formula = c.gpa ~ h.gpaC + scale(SAT, scale = F) + scale(recommd,
scale = F), data = gpa_df)
Residuals:
Min 1Q Median 3Q Max
-1.0979 -0.4407 -0.0094 0.3859 1.7606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9805000 0.0589476 33.598 < 0.0000000000000002 ***
h.gpaC 0.3763511 0.1142615 3.294 0.001385 **
scale(SAT, scale = F) 0.0012269 0.0003032 4.046 0.000105 ***
scale(recommd, scale = F) 0.0226843 0.0509817 0.445 0.657358
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5895 on 96 degrees of freedom
Multiple R-squared: 0.3997, Adjusted R-squared: 0.381
F-statistic: 21.31 on 3 and 96 DF, p-value: 0.000000000116
When predictors like GPA (typically on a 0 to 4 scale) and SAT scores (typically on a 400 to 1600 scale) are involved, their coefficients can be difficult to compare directly because a one-unit change means something very different for each predictor. Through standardization, however, we can remove the scales of the predictors and therefore make the coefficients relatively more comparable. We can standardize predictors only or both predictors and the outcome variable.
After standardization, the variable means are all 0 and variances are all 1. The beta coefficient (output), tells us how many standard deviations the predicted DV changes given one standard deviation change in the IV when the other IVs are held constant.
In R, scale() can also be used for standardization. Note that to skip the estimation of the intercept, one can add -1 in the regression model formular.
# same as above except we remove the scale = F statement
betaModel<-lm(scale(c.gpa) ~ scale(h.gpa) + scale(SAT) + scale(recommd)-1, data=gpa_df)
summary(betaModel)
Call:
lm(formula = scale(c.gpa) ~ scale(h.gpa) + scale(SAT) + scale(recommd) -
1, data = gpa_df)
Residuals:
Min 1Q Median 3Q Max
-1.46544 -0.58820 -0.01254 0.51510 2.34994
Coefficients:
Estimate Std. Error t value Pr(>|t|)
scale(h.gpa) 0.36285 0.10959 3.311 0.00131 **
scale(SAT) 0.35593 0.08751 4.067 0.0000968 ***
scale(recommd) 0.04528 0.10123 0.447 0.65568
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7827 on 97 degrees of freedom
Multiple R-squared: 0.3997, Adjusted R-squared: 0.3812
F-statistic: 21.53 on 3 and 97 DF, p-value: 0.00000000009028
Alternatively, you could feed your centered model through lm.beta() to get the betas without rerunning the model
h.gpaC scale(SAT, scale = F) scale(recommd, scale = F)
0.3628458 0.3559267 0.0452765
moderation & mediation content was adapted from: https://ademos.people.uic.edu/Chapter14.html#
You can think of this distinction as mediation answers how two variables are related whereas moderation tells us it depends.
Mediators describe the how or why of a (typically well-established) relationship between two other variables and are sometimes called intermediary variables since they often describe the process through which an effect occurs. This is also sometimes called an indirect effect. For instance, people with higher incomes tend to live longer but this effect is explained by the mediating influence of having access to better health care.
Moderation tests whether a variable (Z) affects the direction and/or strength of the relation between an IV (X) and a DV (Y). In other words, moderation tests for interactions that affect WHEN relationships between variables occur. Moderators are conceptually different from mediators (when versus how/why) but some variables may be a moderator or a mediator depending on your question.
Like mediation, moderation assumes that there is little to no measurement error in the moderator variable and that the DV did not CAUSE the moderator.
In this example we’ll say we are interested in whether the relationship between the number of hours of sleep (X) a graduate student receives and the attention that they pay to this tutorial (Y) is influenced by their consumption of coffee (Z).
Here we create the moderation effect by making our DV (Y) the product of levels of the IV (X) and our moderator (Z).
set.seed(123)#Standardizes the numbers generated by rnorm
N <- 100 #Number of participants; graduate students
X.hours <- abs(rnorm(N, 6, 4)) #IV; Hours of sleep
X1 <- abs(rnorm(N, 60, 30)) #Adding some systematic variance for our DV
Z.coffee <- rnorm(N, 30, 8) #Moderator; Ounces of coffee consumed
Y.attn <- abs((-0.8*X.hours) * (0.2*Z.coffee) - 0.5*X.hours - 0.4*X1 + 10 + rnorm(N, 0, 3)) #DV; Attention Paid
Moddata <- data.frame(X.hours, X1, Z.coffee, Y.attn)
summary(Moddata) X.hours X1 Z.coffee Y.attn
Min. : 0.195 Min. : 1.597 Min. :15.95 Min. : 2.386
1st Qu.: 4.025 1st Qu.: 35.967 1st Qu.:25.75 1st Qu.: 30.155
Median : 6.247 Median : 53.225 Median :30.29 Median : 47.761
Mean : 6.483 Mean : 56.806 Mean :30.96 Mean : 47.763
3rd Qu.: 8.767 3rd Qu.: 74.035 3rd Qu.:36.11 3rd Qu.: 61.727
Max. :14.749 Max. :157.231 Max. :48.34 Max. :136.947
Moderation can be tested by looking for significant interactions between the moderating variable (Z) and the IV (X). It is important to mean center both your moderator and your IV to reduce multicollinearity and make interpretation easier. Centering can be done using the scale function, which subtracts the mean of a variable from each value in that variable.
#Moderation
fitMod <- lm(Y.attn ~ X.hours + Z.coffee + X.hours:Z.coffee, data = Moddata) #Model interacts IV & moderator
# Alternative formatting
# fitMod <- lm(Y.wake ~ X.hours*Z.coffee)
print_p(coef(summary(fitMod))) Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.5216 10.4032 2.6455 0.0095
X.hours -2.0323 1.3796 -1.4732 0.1440
Z.coffee -0.4114 0.3075 -1.3381 0.1840
X.hours:Z.coffee 0.2338 0.0413 5.6563 0.0000
Sometimes our coefficients don’t make any sense. We can make the 0 meaningful by centering our predictors.
### Two ways to center
Moddata <- Moddata %>%
mutate(X.hoursC = X.hours - mean(X.hours),
Z.coffeeC = Z.coffee - mean(Z.coffee))
# Centering Data
X.hoursC <- scale(X.hours, center=TRUE, scale=FALSE) #Centering IV; hours of sleep
Z.coffeeC <- scale(Z.coffee, center=TRUE, scale=FALSE) #Centering moderator; coffee consumption
##################################
# Reanalyze with centered variables
fitModC <- lm(Y.attn ~ X.hoursC + Z.coffeeC + X.hoursC:Z.coffeeC, data = Moddata)
print_p(coef(summary(fitModC))) Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.5444 1.1729 41.3899 0
X.hoursC 5.2081 0.3487 14.9358 0
Z.coffeeC 1.1044 0.1554 7.1083 0
X.hoursC:Z.coffeeC 0.2338 0.0413 5.6563 0
Call:
lm(formula = Y.attn ~ X.hoursC + Z.coffeeC + X.hoursC:Z.coffeeC,
data = Moddata)
Coefficients:
(Intercept) X.hoursC Z.coffeeC X.hoursC:Z.coffeeC
48.5444 5.2081 1.1044 0.2338
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = fitModC)
Value p-value Decision
Global Stat 7.68778 0.10371 Assumptions acceptable.
Skewness 5.97432 0.01452 Assumptions NOT satisfied!
Kurtosis 0.94082 0.33207 Assumptions acceptable.
Link Function 0.73540 0.39114 Assumptions acceptable.
Heteroscedasticity 0.03724 0.84698 Assumptions acceptable.
#Data Summary
stargazer(fitModC, type="html", digits = 2, font.size = "footnotesize", title = "Sleep and Coffee on Attention")| Dependent variable: | |
| Y.attn | |
| X.hoursC | 5.21*** |
| (0.35) | |
| Z.coffeeC | 1.10*** |
| (0.16) | |
| X.hoursC:Z.coffeeC | 0.23*** |
| (0.04) | |
| Constant | 48.54*** |
| (1.17) | |
| Observations | 100 |
| R2 | 0.77 |
| Adjusted R2 | 0.76 |
| Residual Std. Error | 11.65 (df = 96) |
| F Statistic | 104.78*** (df = 3; 96) |
| Note: | p<0.1; p<0.05; p<0.01 |
#Plotting
ps <- plotSlopes(fitModC, plotx="X.hoursC", modx="Z.coffeeC", xlab = "Sleep", ylab = "Attention Paid", modxVals = "std.dev")Our model shows a significant interaction between hours slept and coffee consumption on attention paid to this tutorial (b = .23, SE = .04, p < .001). However, we’ll need to unpack this interaction visually to get a better idea of what this means.
The rockchalk function will automatically plot the simple slopes (1 SD above and 1 SD below the mean) of the moderating effect.
credit: https://stats.oarc.ucla.edu/r/seminars/interactions-r/#s2
You know that hours spent exercising improves weight loss, but how does it interact with effort?
This is a hypothetical study of weight loss for 900 participants in a year-long study of 3 different exercise programs, a jogging program, a swimming program, and a reading program which serves as a control activity. Variables include
dat <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2019/03/exercise.csv")
dat$gender <- factor(dat$gender,labels=c("male","female"))How do we interpret an interaction with continuous predictors? And what is extrapolating?
\(Weight\ Loss = b_0 + b_1Hours + b_2Effort + b_3Hours*Effort\)
Call:
lm(formula = loss ~ hours * effort, data = dat)
Residuals:
Min 1Q Median 3Q Max
-29.52 -10.60 -1.78 11.13 34.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.79864 11.60362 0.672 0.5017
hours -9.37568 5.66392 -1.655 0.0982 .
effort -0.08028 0.38465 -0.209 0.8347
hours:effort 0.39335 0.18750 2.098 0.0362 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.56 on 896 degrees of freedom
Multiple R-squared: 0.07818, Adjusted R-squared: 0.07509
F-statistic: 25.33 on 3 and 896 DF, p-value: 0.0000000000000009826
Although we may think that the slope of Hours should be positive at levels of Effort, remember that in this case, Effort is zero. As we see in our data, this is improbable as the minimum value of effort is 12.95.
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.95 26.26 29.63 29.66 33.10 44.08
What would we expect weight loss to look like at average effort?
$hours
[1] 2
$effort
[1] 30
hours effort emmean SE df lower.CL upper.CL
2 30 10.2 0.453 896 9.35 11.1
Confidence level used: 0.95
The results show that predicted weight loss is 10.2 pounds if we put in two hours of exercise and an effort level of 30; this seems reasonable. Let’s see what happens when we predict weight loss for two hours of exercise given an effort level of 0.
hours effort emmean SE df lower.CL upper.CL
2 0 -11 2.65 896 -16.1 -5.76
Confidence level used: 0.95
The predicted weight gain is now 11 pounds. Weird, right? This is an example of extrapolating, which means we are making predictions about our data beyond what the data can support. This is why we should always choose reasonable values of our predictors in order to interpret our data properly.
We know to choose reasonable values when predicting values. The same concept applies when decomposing an interaction. Our output suggests that Hours varies by levels of Effort. Since effort is continuous, we can choose an infinite set of values with which to fix effort.
For ease of presentation, we can use what’s called “spotlight analysis” (Aiken & West, 1991). To do a spotlight analysis, we’ll divide Effort into three levels: low, average, high.
$EffA = + (Effort) $
$Eff = $
$EffB = - (Effort) $
[1] 34.8
[1] 29.7
[1] 24.5
Recall that our simple slope is the relationship of hours on weight loss fixed a particular values of effort. In R, we can obtain simple slopes using the function emtrends. We first create a list which incorporates the three values of effort we found above in preparation for spotlight analysis.
mylist <- list(effort=c(effbr,effr,effar))
emtrends(contcont, pairwise ~effort, var="hours",at=mylist, adjust="none")$emtrends
effort hours.trend SE df lower.CL upper.CL
24.5 0.261 1.352 896 -2.392 2.91
29.7 2.307 0.915 896 0.511 4.10
34.8 4.313 1.308 896 1.745 6.88
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
effort24.5 - effort29.7 -2.05 0.975 896 -2.098 0.0362
effort24.5 - effort34.8 -4.05 1.931 896 -2.098 0.0362
effort29.7 - effort34.8 -2.01 0.956 896 -2.098 0.0362
All comparisons of simple slopes result in the same p-value as the interaction itself.
Based on the tests above, yes the maginitude of the slope is larger between “low” and “high” versus “medium” and “high”, but the two p-values are the same.
$hours
[1] 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0
$effort
[1] 24.5 29.7 34.8
The results suggest that hours spent exercising is only effective for weight loss if we put in more effort, which supports the rationale for high intensity interval training.
The research question here is, do men and women (W) differ in the relationship between Hours (X) and Weight loss? Here, we’re going to use dummy coding.
dat$gender <- relevel(dat$gender, ref="female")
contcat <- lm(loss~hours*gender,data=dat)
summary(contcat)
Call:
lm(formula = loss ~ hours * gender, data = dat)
Residuals:
Min 1Q Median 3Q Max
-27.118 -11.350 -1.963 10.001 42.376
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.335 2.731 1.221 0.222
hours 3.315 1.332 2.489 0.013 *
gendermale 3.571 3.915 0.912 0.362
hours:gendermale -1.724 1.898 -0.908 0.364
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.06 on 896 degrees of freedom
Multiple R-squared: 0.008433, Adjusted R-squared: 0.005113
F-statistic: 2.54 on 3 and 896 DF, p-value: 0.05523
Since our goal is to obtain simple slopes of Hours by gender we use emtrends. We do not use emmeans because this function gives us the predicted values rather than slopes.
gender hours.trend SE df lower.CL upper.CL
female 3.32 1.33 896 0.702 5.93
male 1.59 1.35 896 -1.063 4.25
Confidence level used: 0.95
A common misconception is that since the simple slope of Hours is significant for females but not males, we should have seen a significant interaction. However, the interaction tests the difference of the Hours slope for males and females and not whether each simple slope is different from zero (which is what we have from the output above).
To test the difference in slopes, we add pairwise ~ gender to tell the function that we want the pairwise difference in the simple slope of Hours for females versus males.
$emtrends
gender hours.trend SE df lower.CL upper.CL
female 3.32 1.33 896 0.702 5.93
male 1.59 1.35 896 -1.063 4.25
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
female - male 1.72 1.9 896 0.908 0.3639
Recall from our summary table, this is exactly the same as the interaction, which verifies that we have in fact obtained the interaction coefficient.
$hours
[1] 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0
$gender
[1] "female" "male"
Another common approach to plotting predicted values:
newdata <- expand.grid(hours = seq(0, 4, by = 0.4), gender = c("female", "male"))
prediction <- predict(contcat, newdata, interval="confidence", level = 0.95)
# combine
newdata <- cbind(newdata, prediction)
ggplot(newdata, aes(x = hours, y = fit, color = gender)) +
geom_line(size = 1) +
geom_point(size = 2, shape = 1, fill = "white") +
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = gender), alpha = 0.2, color = NA) +
theme_minimal() +
scale_color_manual(values = c("#4878D0", "#D65F5F")) +
scale_fill_manual(values = c("#4878D0", "#D65F5F")) +
labs(
title = "Predicted Loss by Hours and Gender",
x = "Hours",
y = "Predicted Loss",
color = "Gender",
fill = "Gender"
)The Baron & Kenny method is among the original methods for testing for mediation but tends to have low statistical power. We’re covering it here because it provides a very clear approach to establishing relationships between variables and is still occasionally requested by reviewers.
Mediation tests whether the effects of X (the independent variable) on Y (the dependent variable) operate through a third variable, M (the mediator). In this way, mediators explain the causal relationship between two variables or “how” the relationship works.
c = the total effect of X on Y with no consideration of mediator variables
c'= the direct effect of X on Y after controlling for M
ab= the indirect effect of X on Y through M
The above shows the standard mediation model. Full mediation occurs when the effect of X on Y disappears with M in the model. Partial mediation occurs when the effect of X on Y decreases by a nontrivial amount (the actual amount is up for debate) with M in the model.
In this example we’ll say we are interested in whether the number of hours since dawn (X) affect the subjective ratings of wakefulness (Y) 100 graduate students through the consumption of coffee (M).
Note that we are intentionally creating a mediation effect here (because statistics is always more fun if we have something to find) and we do so below by creating M so that it is related to X and Y so that it is related to M. This creates the causal chain for our analysis to parse.
#setwd("user location") #Working directory
set.seed(123) #Standardizes the numbers generated by rnorm; see Chapter 5
N <- 100 #Number of participants; graduate students
X.hours <- rnorm(N, 175, 7) #IV; hours since dawn
M.coffee <- 0.7*X.hours + rnorm(N, 0, 5) #Suspected mediator; coffee consumption
Y.wake <- 0.4*M.coffee + rnorm(N, 0, 5) #DV; wakefulness
Meddata <- data.frame(X.hours, M.coffee, Y.wake)This is the original 4-step method used to describe a mediation effect. Steps 1 and 2 use basic linear regression while steps 3 and 4 use multiple regression.
Now, let’s apply the Baron & Kenny method to our data:
The Steps:
#1. Total Effect
fit <- lm(Y.wake ~ X.hours, data=Meddata)
#2. Path A (X on M)
fita <- lm(M.coffee ~ X.hours, data=Meddata)
#3. Path B (M on Y, controlling for X)
fitb <- lm(Y.wake ~ M.coffee + X.hours, data=Meddata)
#4. Reversed Path C (Y on X, controlling for M)
fitc <- lm(X.hours ~ Y.wake + M.coffee, data=Meddata)#Summary Table
stargazer::stargazer(fit, fita, fitb, fitc, type="html", digits = 2, font.size = "footnotesize",title = "Baron and Kenny Method")| Dependent variable: | ||||
| Y.wake | M.coffee | Y.wake | X.hours | |
| (1) | (2) | (3) | (4) | |
| Y.wake | -0.11 | |||
| (0.10) | ||||
| M.coffee | 0.42*** | 0.70*** | ||
| (0.10) | (0.08) | |||
| X.hours | 0.17** | 0.66*** | -0.11 | |
| (0.08) | (0.08) | (0.10) | ||
| Constant | 19.88 | 6.04 | 17.32 | 96.11*** |
| (14.26) | (13.42) | (13.16) | (9.28) | |
| Observations | 100 | 100 | 100 | 100 |
| R2 | 0.04 | 0.43 | 0.19 | 0.44 |
| Adjusted R2 | 0.03 | 0.43 | 0.18 | 0.43 |
| Residual Std. Error | 5.16 (df = 98) | 4.85 (df = 98) | 4.76 (df = 97) | 4.82 (df = 97) |
| F Statistic | 4.34** (df = 1; 98) | 75.31*** (df = 1; 98) | 11.72*** (df = 2; 97) | 38.39*** (df = 2; 97) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Since the relationship between hours since dawn and wakefulness is no longer significant when controlling for coffee consumption, this suggests that coffee consumption does in fact mediate this relationship. However, this method alone does not allow for a formal test of the indirect effect so we don’t know if the change in this relationship is truly meaningful.
There are two primary methods for formally testing the significance of the indirect test: the Sobel test & bootstrapping (covered under the mediation method). These are considered outdated, so we’re going to move onto our preferred method.
This package uses the more recent bootstrapping method of Preacher & Hayes (2004) to address the power limitations of the Sobel Test. This method computes the point estimate of the indirect effect (ab) over a large number of random sample (typically 1000) so it does not assume that the data are normally distributed and is especially more suitable for small sample sizes than the Baron & Kenny method.
To run the mediate function, we will again need a model of our IV (hours since dawn), predicting our mediator (coffee consumption) like our Path A model above. We will also need a model of the direct effect of our IV (hours since dawn) on our DV (wakefulness), when controlling for our mediator (coffee consumption). When can then use mediate to repeatedly simulate a comparsion between these models and to test the significance of the indirect effect of coffee consumption.
# Path A
fitM <- lm(M.coffee ~ X.hours, data=Meddata) #IV on M; Hours since dawn predicting coffee consumption
# Total
fitY <- lm(Y.wake ~ X.hours + M.coffee, data=Meddata) #IV and M on DV; Hours since dawn and coffee predicting wakefulness
# Check skew
gvlma(fitM) #data is positively skewed; could log transform
Call:
lm(formula = M.coffee ~ X.hours, data = Meddata)
Coefficients:
(Intercept) X.hours
6.0449 0.6625
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = fitM)
Value p-value Decision
Global Stat 8.833 0.06542 Assumptions acceptable.
Skewness 6.314 0.01198 Assumptions NOT satisfied!
Kurtosis 1.219 0.26949 Assumptions acceptable.
Link Function 1.076 0.29959 Assumptions acceptable.
Heteroscedasticity 0.223 0.63674 Assumptions acceptable.
Call:
lm(formula = Y.wake ~ X.hours + M.coffee, data = Meddata)
Coefficients:
(Intercept) X.hours M.coffee
17.3218 -0.1118 0.4238
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = fitY)
Value p-value Decision
Global Stat 3.41844 0.4904 Assumptions acceptable.
Skewness 1.85648 0.1730 Assumptions acceptable.
Kurtosis 0.77788 0.3778 Assumptions acceptable.
Link Function 0.71512 0.3977 Assumptions acceptable.
Heteroscedasticity 0.06896 0.7929 Assumptions acceptable.
Causal Mediation Analysis
Quasi-Bayesian Confidence Intervals
Estimate 95% CI Lower 95% CI Upper p-value
ACME 0.2808 0.1437 0.42 <0.0000000000000002 ***
ADE -0.1133 -0.3116 0.09 0.258
Total Effect 0.1674 0.0208 0.34 0.028 *
Prop. Mediated 1.6428 0.5631 8.44 0.028 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sample Size Used: 100
Simulations: 1000
#Bootstrap
fitMedBoot <- mediate(fitM, fitY, boot=TRUE, sims=2000, treat="X.hours", mediator="M.coffee")
summary(fitMedBoot)
Causal Mediation Analysis
Nonparametric Bootstrap Confidence Intervals with the Percentile Method
Estimate 95% CI Lower 95% CI Upper p-value
ACME 0.280784 0.136681 0.43 <0.0000000000000002 ***
ADE -0.111790 -0.288539 0.09 0.254
Total Effect 0.168993 -0.000457 0.34 0.051 .
Prop. Mediated 1.661509 -1.369638 9.83 0.051 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sample Size Used: 100
Simulations: 2000
The mediate function gives us:
The ACME here is the indirect effect of M (total effect - direct effect) and thus this value tells us if our mediation effect is significant. If ACME is significant, but ADE is not, then we have full mediation. If ADE is significant, but ACME is not, then we have no mediation.
In this case, our fitMed model again shows a significant effect of coffee consumption on the relationship between hours since dawn and feelings of wakefulness, (ACME = .28, p < .001) with no direct effect of hours since dawn (ADE = -0.11, p = .27) and significant total effect (p < .05).
We can then bootstrap this comparison to verify this result in fitMedBoot and again find a significant mediation effect (ACME = .28, p < .001) and no direct effect of hours since dawn (ADE = -0.11, p = .27). However, with increased power, this analysis no longer shows a significant total effect (p = .07).
library(lavaan)
model <-"
Y.wake ~ a*X.hours + b*M.coffee
M.coffee ~ c*X.hours
#indirect effect
ind:=b*c
#total effect
total:=a+(b*c)
"medsem<-sem(model = model, data = Meddata, se="bootstrap")
summary(medsem,standardized=T,fit.measures=T,rsquare=T)lavaan 0.6.17 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 5
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 78.650
Degrees of freedom 3
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -595.177
Loglikelihood unrestricted model (H1) -595.177
Akaike (AIC) 1200.354
Bayesian (BIC) 1213.380
Sample-size adjusted Bayesian (SABIC) 1197.589
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 1000
Number of successful bootstrap draws 999
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Y.wake ~
X.hours (a) -0.112 0.102 -1.101 0.271 -0.112 -0.136
M.coffee (b) 0.424 0.102 4.153 0.000 0.424 0.519
M.coffee ~
X.hours (c) 0.663 0.075 8.823 0.000 0.663 0.659
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.Y.wake 21.945 2.699 8.130 0.000 21.945 0.805
.M.coffee 23.086 3.567 6.472 0.000 23.086 0.565
R-Square:
Estimate
Y.wake 0.195
M.coffee 0.435
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ind 0.281 0.075 3.762 0.000 0.281 0.342
total 0.169 0.092 1.836 0.066 0.169 0.206
When you have categorical predictors with more than 2 levels, you will need to model the mediation effects for each level of that variable. To do that, we’re going to dummy code our categorical variable. We’re going to use the same dataset but create a 3-level predictor: a lot of sleep, average sleep, little sleep. I know this would be considered ordinal, but let’s ignore that for now.
IMPORTANT NOTES:
Here, the reference group will be the people who got ‘a lot’ of sleep (sleep=0)
# creating our dataset
set.seed(123)
N <- 100
# sleep: 0 = a lot, 1 = average, 2 = a little
X.sleep <- sample(0:2, N, replace = TRUE)
sleep_effect <- -2
M.coffee <- 0.7 * X.hours + sleep_effect * X.sleep + rnorm(N, 0, 5)
Y.wake <- 0.4 * M.coffee + rnorm(N, 0, 5)
Meddata_cat <- data.frame(X.hours, X.sleep, M.coffee, Y.wake)
Meddata_cat$sleep <- factor(Meddata_cat$X.sleep)
# dummy code sleep
library(fastDummies)
Meddata_cat <- dummy_cols(Meddata_cat,
select_columns = "X.sleep")
# 'a lot' of sleep is the reference category, so you can omit X.sleep_0 from the model
model_cat <-"
# Direct Effects on wakefulness
Y.wake ~ a1*X.sleep_1 + a2*X.sleep_2 + b*M.coffee
# Effect of sleep on coffee (note: now we use only two dummies, assuming one as reference)
M.coffee ~ c1*X.sleep_1 + c2*X.sleep_2
# Indirect effects (now must include dummy paths for only two levels)
ind1:=b*c1
ind2:=b*c2
# Total effects
total1 := a1 + (b*c1)
total2 := a2 + (b*c2)
"
med_cat<-sem(model = model_cat, data = Meddata_cat)
summary(med_cat,standardized=T,fit.measures=T,rsquare=T)lavaan 0.6.17 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 7
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 49.270
Degrees of freedom 5
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -628.077
Loglikelihood unrestricted model (H1) -628.077
Akaike (AIC) 1270.155
Bayesian (BIC) 1288.391
Sample-size adjusted Bayesian (SABIC) 1266.283
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Y.wake ~
X.sleep_1 (a1) -0.491 1.160 -0.423 0.672 -0.491 -0.042
X.sleep_2 (a2) -0.936 1.208 -0.775 0.438 -0.936 -0.081
M.coffee (b) 0.383 0.070 5.481 0.000 0.383 0.503
M.coffee ~
X.sleep_1 (c1) -0.180 1.660 -0.109 0.913 -0.180 -0.012
X.sleep_2 (c2) -5.957 1.624 -3.669 0.000 -5.957 -0.392
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.Y.wake 21.848 3.090 7.071 0.000 21.848 0.718
.M.coffee 44.776 6.332 7.071 0.000 44.776 0.851
R-Square:
Estimate
Y.wake 0.282
M.coffee 0.149
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ind1 -0.069 0.636 -0.109 0.913 -0.069 -0.006
ind2 -2.281 0.748 -3.049 0.002 -2.281 -0.197
total1 -0.560 1.322 -0.423 0.672 -0.560 -0.047
total2 -3.217 1.293 -2.487 0.013 -3.217 -0.278
The process is similar when using the mediation, but note that you’ll need to run separate mediations for each level.
# Model predicting mediator (coffee)
model_mediator <- lm(M.coffee ~ X.sleep_1 + X.sleep_2, data = Meddata_cat)
# Model predicting outcome (wake), including mediator
model_outcome <- lm(Y.wake ~ X.sleep_1 + X.sleep_2 + M.coffee, data = Meddata_cat)
# average sleep vs a lot of sleep
med.1 <- mediate(model_mediator, model_outcome, treat = "X.sleep_1", mediator = "M.coffee")
summary(med.1)
Causal Mediation Analysis
Quasi-Bayesian Confidence Intervals
Estimate 95% CI Lower 95% CI Upper p-value
ACME -0.0714 -1.4659 1.19 0.93
ADE -0.4922 -2.8570 1.79 0.67
Total Effect -0.5636 -3.1142 2.05 0.66
Prop. Mediated 0.2188 -3.1517 4.22 0.68
Sample Size Used: 100
Simulations: 1000
# little sleep vs a lot of sleep
med.2 <- mediate(model_mediator, model_outcome, treat = "X.sleep_2", mediator = "M.coffee")
summary(med.2)
Causal Mediation Analysis
Quasi-Bayesian Confidence Intervals
Estimate 95% CI Lower 95% CI Upper p-value
ACME -2.283 -3.946 -0.97 <0.0000000000000002 ***
ADE -0.946 -3.331 1.51 0.46
Total Effect -3.228 -5.672 -0.76 0.01 **
Prop. Mediated 0.712 0.276 2.29 0.01 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sample Size Used: 100
Simulations: 1000
Meddata$M2.activity <- 2.1 * Meddata$X.hours + rnorm(nrow(Meddata), mean = 0, sd = 1)
summary(lm(Y.wake ~ M2.activity, data = Meddata))
Call:
lm(formula = Y.wake ~ M2.activity, data = Meddata)
Residuals:
Min 1Q Median 3Q Max
-10.7903 -3.7035 -0.2119 2.8524 12.5561
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.01365 14.47448 1.314 0.1920
M2.activity 0.08282 0.03921 2.112 0.0372 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.157 on 98 degrees of freedom
Multiple R-squared: 0.04353, Adjusted R-squared: 0.03377
F-statistic: 4.461 on 1 and 98 DF, p-value: 0.03723
# Define the model for the outcome variable
parallelModel <- lm(Y.wake ~ X.hours + M.coffee + M2.activity, data = Meddata)
# Define the models for the mediator variables
medModel_coffee <- lm(M.coffee ~ X.hours, data = Meddata)
medModel_activity <- lm(M2.activity ~ X.hours, data = Meddata)
# Perform the mediation analysis for each mediator
mediation_coffee <- mediate(medModel_coffee, parallelModel,
treat = "X.hours", mediator = "M.coffee",
boot = TRUE, sims = 500)
mediation_activity <- mediate(medModel_activity, parallelModel,
treat = "X.hours", mediator = "M2.activity",
boot = TRUE, sims = 500)
# Summarize the mediation analysis results
summary(mediation_coffee)
Causal Mediation Analysis
Nonparametric Bootstrap Confidence Intervals with the Percentile Method
Estimate 95% CI Lower 95% CI Upper p-value
ACME 0.280 0.141 0.42 <0.0000000000000002 ***
ADE -0.394 -2.568 1.53 0.73
Total Effect -0.114 -2.277 1.84 0.86
Prop. Mediated -2.458 -3.849 2.55 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sample Size Used: 100
Simulations: 500
Causal Mediation Analysis
Nonparametric Bootstrap Confidence Intervals with the Percentile Method
Estimate 95% CI Lower 95% CI Upper p-value
ACME 0.283 -1.627 2.29 0.80
ADE -0.394 -2.459 1.49 0.71
Total Effect -0.111 -0.295 0.08 0.22
Prop. Mediated -2.553 -104.088 46.85 0.87
Sample Size Used: 100
Simulations: 500
multipleMediation <- '
# Direct
Y.wake ~ b1 * M.coffee + b2 * M2.activity + c * X.hours
M.coffee ~ a1 * X.hours
M2.activity ~ a2 * X.hours
# Indirect
indirect1 := a1 * b1
indirect2 := a2 * b2
# Total
total := c + (a1 * b1) + (a2 * b2)
# Covariance between mediators
M.coffee ~~ M2.activity
'
fit_mult <- sem(model = multipleMediation, data = Meddata)
summary(fit_mult, fit.measures = TRUE, standardized = TRUE)lavaan 0.6.17 ended normally after 15 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 9
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 605.568
Degrees of freedom 6
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -731.262
Loglikelihood unrestricted model (H1) -731.262
Akaike (AIC) 1480.523
Bayesian (BIC) 1503.970
Sample-size adjusted Bayesian (SABIC) 1475.545
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Y.wake ~
M.coffee (b1) 0.422 0.098 4.326 0.000 0.422 0.517
M2.actvty (b2) 0.137 0.496 0.276 0.782 0.137 0.345
X.hours (c) -0.394 1.025 -0.384 0.701 -0.394 -0.479
M.coffee ~
X.hours (a1) 0.663 0.076 8.766 0.000 0.663 0.659
M2.activity ~
X.hours (a2) 2.063 0.015 138.745 0.000 2.063 0.997
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M.coffee ~~
.M2.activity 0.254 0.455 0.559 0.576 0.254 0.056
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.Y.wake 21.929 3.101 7.071 0.000 21.929 0.805
.M.coffee 23.086 3.265 7.071 0.000 23.086 0.565
.M2.activity 0.894 0.126 7.071 0.000 0.894 0.005
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect1 0.280 0.072 3.880 0.000 0.280 0.341
indirect2 0.283 1.024 0.276 0.782 0.283 0.344
total 0.169 0.080 2.103 0.035 0.169 0.206
multMed<-sem(model = multipleMediation, data = Meddata, se="bootstrap")
summary(multMed,standardized=T,fit.measures=T,rsquare=T)lavaan 0.6.17 ended normally after 15 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 9
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 605.568
Degrees of freedom 6
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -731.262
Loglikelihood unrestricted model (H1) -731.262
Akaike (AIC) 1480.523
Bayesian (BIC) 1503.970
Sample-size adjusted Bayesian (SABIC) 1475.545
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 1000
Number of successful bootstrap draws 974
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Y.wake ~
M.coffee (b1) 0.422 0.099 4.245 0.000 0.422 0.517
M2.actvty (b2) 0.137 0.529 0.259 0.796 0.137 0.345
X.hours (c) -0.394 1.097 -0.359 0.720 -0.394 -0.479
M.coffee ~
X.hours (a1) 0.663 0.074 8.968 0.000 0.663 0.659
M2.activity ~
X.hours (a2) 2.063 0.013 164.862 0.000 2.063 0.997
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M.coffee ~~
.M2.activity 0.254 0.440 0.578 0.563 0.254 0.056
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.Y.wake 21.929 2.655 8.259 0.000 21.929 0.805
.M.coffee 23.086 3.552 6.500 0.000 23.086 0.565
.M2.activity 0.894 0.141 6.327 0.000 0.894 0.005
R-Square:
Estimate
Y.wake 0.195
M.coffee 0.435
M2.activity 0.995
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect1 0.280 0.074 3.782 0.000 0.280 0.341
indirect2 0.283 1.092 0.259 0.796 0.283 0.344
total 0.169 0.093 1.810 0.070 0.169 0.206
mediationPlot(multMed, indirect = TRUE, whatLabels = "par", secondIndirect = TRUE) #standardized estimates# Generate random ages between 18 and 65
Meddata$age <- round(runif(N, min = 18, max = 65))
covMediation <- '
# Direct paths
M.coffee ~ a1 * X.hours + age
M2.activity ~ a2 * X.hours + age
Y.wake ~ b1 * M.coffee + b2 * M2.activity + c * X.hours + age
# Covariance between mediators
M.coffee ~~ M2.activity
# Indirect effects
indirect1 := a1 * b1
indirect2 := a2 * b2
# Total effect
total := c + (a1 * b1) + (a2 * b2)
'
fit_cov <- sem(model = covMediation, data = Meddata)
summary(fit_cov, fit.measures = TRUE, standardized = TRUE) # Adds fit measures and standardized estimateslavaan 0.6.17 ended normally after 20 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 12
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 606.331
Degrees of freedom 9
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -730.880
Loglikelihood unrestricted model (H1) -730.880
Akaike (AIC) 1485.760
Bayesian (BIC) 1517.022
Sample-size adjusted Bayesian (SABIC) 1479.123
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
M.coffee ~
X.hours (a1) 0.662 0.076 8.747 0.000 0.662 0.658
age -0.008 0.039 -0.212 0.832 -0.008 -0.016
M2.activity ~
X.hours (a2) 2.063 0.015 138.629 0.000 2.063 0.998
age 0.002 0.008 0.248 0.804 0.002 0.002
Y.wake ~
M.coffee (b1) 0.424 0.097 4.357 0.000 0.424 0.519
M2.actvty (b2) 0.127 0.495 0.256 0.798 0.127 0.319
X.hours (c) -0.370 1.022 -0.363 0.717 -0.370 -0.451
age 0.030 0.037 0.808 0.419 0.030 0.072
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M.coffee ~~
.M2.activity 0.257 0.455 0.565 0.572 0.257 0.057
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M.coffee 23.076 3.263 7.071 0.000 23.076 0.565
.M2.activity 0.893 0.126 7.071 0.000 0.893 0.005
.Y.wake 21.787 3.081 7.071 0.000 21.787 0.800
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect1 0.281 0.072 3.900 0.000 0.281 0.342
indirect2 0.261 1.021 0.256 0.798 0.261 0.318
total 0.172 0.080 2.138 0.033 0.172 0.209
covMed<-sem(model = covMediation, data = Meddata, se="bootstrap")
semPaths(object = covMed, whatLabels = "std")Lavaan automatically detects binary DVs. There’s no need to add any special arguments.
Create binary DV: Pass or fail
pass_threshold <- median(Meddata$Y.wake)
# Create binary DV 'pass_fail', where 1 = pass, 0 = fail
Meddata$pass_fail <- ifelse(Meddata$Y.wake > pass_threshold, 1, 0)binaryMediation <- '
# Direct paths to mediators
M.coffee ~ a1 * X.hours + age
M2.activity ~ a2 * X.hours + age
# Binary outcome path using logit
pass_fail ~ b1 * M.coffee + b2 * M2.activity + c * X.hours + age
# Covariance between mediators
M.coffee ~~ M2.activity
# Indirect effects
indirect1 := a1 * b1
indirect2 := a2 * b2
# Total effect
total := c + (a1 * b1) + (a2 * b2)
'
# Fitting the model with lavaan
fitBinary <- sem(model = binaryMediation, data = Meddata, estimator = "ML", se = "bootstrap", bootstrap = 1000)
summary(fitBinary, fit.measures = TRUE, standardized = TRUE)lavaan 0.6.17 ended normally after 37 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 12
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 598.365
Degrees of freedom 9
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -500.302
Loglikelihood unrestricted model (H1) -500.302
Akaike (AIC) 1024.604
Bayesian (BIC) 1055.866
Sample-size adjusted Bayesian (SABIC) 1017.967
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 1000
Number of successful bootstrap draws 1000
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
M.coffee ~
X.hours (a1) 0.662 0.075 8.806 0.000 0.662 0.658
age -0.008 0.039 -0.209 0.835 -0.008 -0.016
M2.activity ~
X.hours (a2) 2.063 0.012 167.090 0.000 2.063 0.998
age 0.002 0.009 0.217 0.829 0.002 0.002
pass_fail ~
M.coffee (b1) 0.030 0.009 3.221 0.001 0.030 0.386
M2.actvty (b2) 0.024 0.057 0.420 0.674 0.024 0.630
X.hours (c) -0.055 0.118 -0.468 0.640 -0.055 -0.701
age 0.005 0.004 1.394 0.163 0.005 0.130
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M.coffee ~~
.M2.activity 0.257 0.423 0.608 0.544 0.257 0.057
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M.coffee 23.076 3.691 6.251 0.000 23.076 0.565
.M2.activity 0.893 0.141 6.318 0.000 0.893 0.005
.pass_fail 0.216 0.016 13.747 0.000 0.216 0.866
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect1 0.020 0.007 3.066 0.002 0.020 0.254
indirect2 0.049 0.118 0.420 0.674 0.049 0.628
total 0.014 0.008 1.848 0.065 0.014 0.182
binMed<-sem(model = fitBinary, data = Meddata, se="bootstrap")
# basic
semPaths(object = binMed, whatLabels = "std")# change shape and label size
binSem <- semPaths(binMed,
whatLabels = "std",
edge.label.cex = 0.75,
label.cex = 1.5,
layout = "spring")Add labels
# change node label
binMed_lab <- change_node_label(binSem,
c(X.h = "Hours",
M.c = "Coffee",
M2. = "Activity",
ps_ = "Grade"),
label.cex = 1.1)
plot(binMed_lab)Indicate which effects are significant
credit: https://ademos.people.uic.edu/Chapter15.html#31_describe_the_dataset
These are challenging to understand, so we’re going to walk through a different example.
Let’s examine the question: does spending more time in grad school predict more offers after graduation? We’ll see if this relationship is explained by number of publications (i.e., the more time spent in grad school, the more publications one has, and the more publications one has, the more job offers they get). However, this causal chain may only work for people who spend their time in graduate school networking.
We are going to simulate a dataset that measured the following:
### We are intentionally creating a moderated mediation effect here and we do so below by setting the relationships (the paths) between our causal chain variables and setting the relationships for our interaction terms
set.seed(42) #This makes sure that everyone gets the same numbers generated through rnorm function
a1 = -.59 #Set the path a1 strength (effect of X on M)
a2 = -.17 #Set path a2 strength (effect of Z on M)
a3 = .29 #Set path a3 strength (interaction between X and Z on M)
b = .59 #Set path b strength (effect of M on Y)
cdash1 = .27 #Set path c'1 strength (effect of X on Y)
cdash2 = .01 #Set path c'2 strength (effect of Z on Y)
cdash3 = -.01 #Set path c'3 strength (interaction betwee X and Z on Y)
### Here we are creating the values of our variables for each subject
n <- 200 #Set sample size
X <- rnorm(n, 7, 1) #IV: Time spent in grad school (M = 7, SD = 1)
Z <- rnorm(n, 5, 1) #Moderator: Time spent (hours per week) with Professor Demos in class or in office hours (M = 5, SD = 1)
M <- a1*X + a2*Z + a3*X*Z + rnorm(n, 0, .1) #Mediator: Number of publications in grad school
#The mediator variable is created as a function of the IV, moderator, and their interaction with some random noise thrown in the mix
Y <- cdash1*X + cdash2*Z + cdash3*X*Z + b*M + rnorm(n, 0, .1) #DV: Number of job offers
#Similar to the mediator, the DV is a function of the IV, moderator, their interaction, and the mediator with some random noise thrown in the mix
Success.ModMed <- data.frame(jobs = Y, time = X, pubs = M, network = Z) #Build our data frame and give it recognizable variable namesBecause we have interaction terms in our regression analyses, we need to mean center our IV and Moderator (Z)
Success.ModMed$time.c <- scale(Success.ModMed$time, center = TRUE, scale = FALSE)[,] #Scale returns a matrix so we have to make it a vector by indexing one column
Success.ModMed$network.c <- scale(Success.ModMed$network, center = TRUE, scale = FALSE)[,]Mod.Med.Lavaan <- '
#Regressions
#These are the same regression equations from our previous example
#Except in this code we are naming the coefficients that are produced from the regression equations
#E.g., the regression coefficient for the effect of time on pubs is named "a1"
pubs ~ a1*time.c + a2*network.c + a3*time.c:network.c
jobs ~ cdash1*time.c + cdash2*network.c + cdash3*time.c:network.c + b1*pubs
#Mean of centered network (for use in simple slopes)
#This is making a coefficient labeled "network.c.mean" which equals the intercept because of the "1"
#(Y~1) gives you the intercept, which is the mean for our network.c variable
network.c ~ network.c.mean*1
#Variance of centered network (for use in simple slopes)
#This is making a coefficient labeled "network.c.var" which equals the variance
#Two tildes separating the same variable gives you the variance
network.c ~~ network.c.var*network.c
#Indirect effects conditional on moderator (a1 + a3*ModValue)*b1
indirect.SDbelow := (a1 + a3*(network.c.mean-sqrt(network.c.var)))*b1
indirect.SDabove := (a1 + a3*(network.c.mean+sqrt(network.c.var)))*b1
#Direct effects conditional on moderator (cdash1 + cdash3*ModValue)
#We have to do it this way because you cannot call the mean and sd functions in lavaan package
direct.SDbelow := cdash1 + cdash3*(network.c.mean-sqrt(network.c.var))
direct.SDabove := cdash1 + cdash3*(network.c.mean+sqrt(network.c.var))
#Total effects conditional on moderator
total.SDbelow := direct.SDbelow + indirect.SDbelow
total.SDabove := direct.SDabove + indirect.SDabove
#Proportion mediated conditional on moderator
#To match the output of "mediate" package
prop.mediated.SDbelow := indirect.SDbelow / total.SDbelow
prop.mediated.SDabove := indirect.SDabove / total.SDabove
#Index of moderated mediation
#An alternative way of testing if conditional indirect effects are significantly different from each other
index.mod.med := a3*b1
'
#Fit model
Mod.Med.SEM <- sem(model = Mod.Med.Lavaan,
data = Success.ModMed,
se = "bootstrap",
bootstrap = 10,
fixed.x = FALSE) # pubs should model the variance
#Fit measures
summary(Mod.Med.SEM,
fit.measures = FALSE,
standardize = TRUE,
rsquare = TRUE)lavaan 0.6.17 ended normally after 51 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 18
Number of observations 200
Model Test User Model:
Test statistic 12.599
Degrees of freedom 2
P-value (Chi-square) 0.002
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 10
Number of successful bootstrap draws 10
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
pubs ~
time.c (a1) 0.858 0.006 138.664 0.000 0.858 0.425
ntwrk.c (a2) 1.854 0.006 309.044 0.000 1.854 0.892
tm.c:n. (a3) 0.309 0.010 30.402 0.000 0.309 0.149
jobs ~
time.c (cds1) 0.219 0.051 4.339 0.000 0.219 0.175
ntwrk.c (cds2) -0.048 0.104 -0.462 0.644 -0.048 -0.037
tm.c:n. (cds3) -0.009 0.017 -0.556 0.578 -0.009 -0.007
pubs (b1) 0.584 0.057 10.308 0.000 0.584 0.942
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
time.c ~~
time.c:ntwrk.c -0.008 0.127 -0.062 0.951 -0.008 -0.009
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ntwrk.c (nt..) -0.000 0.084 -0.000 1.000 -0.000 -0.000
.pubs 5.163 0.007 723.506 0.000 5.163 2.629
.jobs 1.602 0.292 5.480 0.000 1.602 1.316
time.c 0.000 0.091 0.000 1.000 0.000 0.000
tm.c:n. -0.074 0.041 -1.809 0.070 -0.074 -0.078
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ntwrk.c (nt..) 0.892 0.050 17.959 0.000 0.892 1.000
.pubs 0.010 0.001 11.043 0.000 0.010 0.003
.jobs 0.009 0.001 11.307 0.000 0.009 0.006
time.c 0.945 0.075 12.565 0.000 0.945 1.000
tm.c:n. 0.889 0.136 6.513 0.000 0.889 1.000
R-Square:
Estimate
pubs 0.997
jobs 0.994
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect.SDblw 0.330 0.036 9.182 0.000 0.330 0.260
indirect.SDabv 0.672 0.061 11.090 0.000 0.672 0.540
direct.SDbelow 0.228 0.040 5.704 0.000 0.228 0.182
direct.SDabove 0.210 0.068 3.101 0.002 0.210 0.168
total.SDbelow 0.559 0.014 41.333 0.000 0.559 0.443
total.SDabove 0.882 0.017 51.107 0.000 0.882 0.708
prp.mdtd.SDblw 0.591 0.067 8.771 0.000 0.591 0.588
prop.mdtd.SDbv 0.762 0.075 10.201 0.000 0.762 0.763
index.mod.med 0.181 0.019 9.693 0.000 0.181 0.140
To look at the moderator effect, we can use the +/- 1SD from the mean
mm_plot <- semPaths(Mod.Med.SEM,
whatLabels = "est",
edge.label.cex = 0.75,
label.cex = 1.5,
layout = "spring",
intercepts = FALSE)# change node label
Modmed_lab <- change_node_label(mm_plot,
c(jbs = "Y.Jobs",
pbs = "M.Pubs",
tm. = "X.Time",
nt. = "Z.Network",
't.:' = "T*N"),
label.cex = 1.1)
sig_plot_mm <- Modmed_lab %>%
mark_sig(Mod.Med.SEM) %>%
mark_se(Mod.Med.SEM, sep = "\n")
plot(sig_plot_mm)#Bootstraps
parameterEstimates(Mod.Med.SEM,
boot.ci.type = "bca.simple",
level = .95, ci = TRUE,
standardized = FALSE)[c(19:27),c(4:10)] #We index the matrix to only display columns we are interested in label est se z pvalue ci.lower ci.upper
19 indirect.SDbelow 0.330 0.036 9.182 0.000 0.287 0.395
20 indirect.SDabove 0.672 0.061 11.090 0.000 0.571 0.749
21 direct.SDbelow 0.228 0.040 5.704 0.000 0.156 0.273
22 direct.SDabove 0.210 0.068 3.101 0.002 0.117 0.296
23 total.SDbelow 0.559 0.014 41.333 0.000 0.544 0.589
24 total.SDabove 0.882 0.017 51.107 0.000 0.864 0.921
25 prop.mediated.SDbelow 0.591 0.067 8.771 0.000 0.523 0.715
26 prop.mediated.SDabove 0.762 0.075 10.201 0.000 0.678 0.865
27 index.mod.med 0.181 0.019 9.693 0.000 0.155 0.208
The difference is most likely a result of bootstrap estimation differences (e.g., lavaan uses bias-corrected but not accelerated bootstrapping for their confidence intervals)
#install.packages("JSmediation")
library(JSmediation)
Success.ModMed$pubs.c <- scale(Success.ModMed$pubs, center = TRUE, scale = FALSE)[,]
moderated_mediation_fit <-
mdt_moderated(data = Success.ModMed,
IV = time.c,
DV = jobs,
M = pubs.c,
Mod = network.c)
moderated_mediation_fit %>% add_index(stage = "first")Test of mediation (moderated mediation)
==============================================
Variables:
- IV: time.c
- DV: jobs
- M: pubs.c
- Mod: network.c
Paths:
======== ============== ===== =========================
Path Point estimate SE APA
======== ============== ===== =========================
a 0.858 0.008 t(196) = 114.12, p < .001
a * Mod 0.309 0.008 t(196) = 38.91, p < .001
b 0.584 0.065 t(194) = 8.99, p < .001
b * Mod -0.003 0.003 t(194) = 1.20, p = .233
c 0.720 0.008 t(196) = 88.82, p < .001
c * Mod 0.171 0.009 t(196) = 19.96, p < .001
c' 0.222 0.056 t(194) = 3.96, p < .001
c' * Mod -0.008 0.021 t(194) = 0.37, p = .709
======== ============== ===== =========================
Indirect effect index:
- type: Mediated moderation index (First stage)
- point estimate: 0.18
- confidence interval:
- method: Monte Carlo (5000 iterations)
- level: 0.05
- CI: [0.139; 0.221]
Fitted models:
- X * Mod -> Y
- X * Mod -> M
- (X + M) * Mod -> Y
- type: Conditional simple mediation index (Mod = 1)
- point estimate: 0.677
- confidence interval:
- method: Monte Carlo (5000 iterations)
- level: 0.05
- CI: [0.526; 0.832]
This dataset has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre, gpa and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
xtabs(~admit + rank, data = mydata) rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
2.5 % 97.5 %
(Intercept) -6.2716202334 -1.792547080
gre 0.0001375921 0.004435874
gpa 0.1602959439 1.464142727
rank2 -1.3008888002 -0.056745722
rank3 -2.0276713127 -0.670372346
rank4 -2.4000265384 -0.753542605
OR 2.5 % 97.5 %
(Intercept) 0.0185001 0.001889165 0.1665354
gre 1.0022670 1.000137602 1.0044457
gpa 2.2345448 1.173858216 4.3238349
rank2 0.5089310 0.272289674 0.9448343
rank3 0.2617923 0.131641717 0.5115181
rank4 0.2119375 0.090715546 0.4706961
Main effect of rank
Wald test:
----------
Chi-squared test:
X2 = 20.9, df = 3, P(> X2) = 0.00011
The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.
Main effect of gre
Wald test:
----------
Chi-squared test:
X2 = 4.3, df = 1, P(> X2) = 0.038
Main effect of gpa
Wald test:
----------
Chi-squared test:
X2 = 5.9, df = 1, P(> X2) = 0.015
# step 1
newdata1 <- with(mydata,
data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
# step 2
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1 gre gpa rank rankP
1 587.7 3.3899 1 0.5166016
2 587.7 3.3899 2 0.3522846
3 587.7 3.3899 3 0.2186120
4 587.7 3.3899 4 0.1846684
# step 3
newdata2 <- with(mydata,
data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4),
gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
# step 4
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type="link", se=TRUE))
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
# graph
PredProbPlot <- ggplot(newdata3, aes(x = gre, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = .2) +
geom_line(aes(colour = rank), size=1) +theme_minimal()
PredProbPlot + ggtitle("Admission to Grad School by Rank")The data set contains variables on 200 students. The outcome variable is prog, program type. The predictor variables are social economic status, ses, a three-level categorical variable and writing score, write, a continuous variable.
(Research Question): When high school students choose the program (general, vocational, and academic programs), how do their math and science scores and their social economic status (SES) affect their decision?
prog
ses general academic vocation
low 16 19 12
middle 20 44 31
high 9 42 7
# Since we are going to use Academic as the reference group, we need relevel the group.
hsb$prog2 <- relevel(as.factor(hsb$prog), ref = 2)
hsb$ses <- as.factor(hsb$ses)
levels(hsb$prog2)[1] "academic" "general" "vocation"
# Run a multinomial model
multi_mo <- multinom(prog2 ~ ses + math + science + math*science, data = hsb,model=TRUE)# weights: 21 (12 variable)
initial value 219.722458
iter 10 value 173.831002
iter 20 value 167.382760
final value 166.951813
converged
Call:
multinom(formula = prog2 ~ ses + math + science + math * science,
data = hsb, model = TRUE)
Coefficients:
(Intercept) sesmiddle seshigh math science math:science
general 5.897618 -0.4081497 -1.1254491 -0.1852220 0.01323626 0.001025283
vocation 22.728283 0.8402168 -0.5605656 -0.5036705 -0.28297703 0.006185571
Std. Errors:
(Intercept) sesmiddle seshigh math science math:science
general 0.002304064 0.2613732 0.2134308 0.02694593 0.02953364 0.0004761369
vocation 0.003856861 0.2959741 0.1984775 0.02681947 0.03142872 0.0004760567
Residual Deviance: 333.9036
AIC: 357.9036
Main effects
Analysis of Deviance Table (Type II tests)
Response: prog2
LR Chisq Df Pr(>Chisq)
ses 12.922 4 0.01166 *
math 39.156 2 0.000000003143 ***
science 8.134 2 0.01713 *
math:science 5.249 2 0.07247 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
CoxSnell Nagelkerke
0.3102655 0.3565873
# A tibble: 12 × 8
y.level term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 general (Intercept) 3.64e+2 0.00230 2560. 0 3.63e+2 3.66e+2
2 general sesmiddle 6.65e-1 0.261 -1.56 1.18e- 1 3.98e-1 1.11e+0
3 general seshigh 3.25e-1 0.213 -5.27 1.34e- 7 2.14e-1 4.93e-1
4 general math 8.31e-1 0.0269 -6.87 6.25e-12 7.88e-1 8.76e-1
5 general science 1.01e+0 0.0295 0.448 6.54e- 1 9.56e-1 1.07e+0
6 general math:science 1.00e+0 0.000476 2.15 3.13e- 2 1.00e+0 1.00e+0
7 vocation (Intercept) 7.43e+9 0.00386 5893. 0 7.37e+9 7.48e+9
8 vocation sesmiddle 2.32e+0 0.296 2.84 4.53e- 3 1.30e+0 4.14e+0
9 vocation seshigh 5.71e-1 0.198 -2.82 4.74e- 3 3.87e-1 8.42e-1
10 vocation math 6.04e-1 0.0268 -18.8 1.10e-78 5.73e-1 6.37e-1
11 vocation science 7.54e-1 0.0314 -9.00 2.18e-19 7.09e-1 8.01e-1
12 vocation math:science 1.01e+0 0.000476 13.0 1.33e-38 1.01e+0 1.01e+0
Another way to understand the model using the predicted probabilities is to look at the averaged predicted probabilities for different values of the continuous predictor variable math within each level of ses.
# Extract coefficients
coef_df <- as.data.frame(coef(multi_mo))
coef_df$Term <- row.names(coef_df)
# Melt the dataframe for ggplot
library(reshape2)
melted_coef_df <- melt(coef_df, id.vars = c("Term", "(Intercept)"))
# Plot using ggplot2
library(ggplot2)
ggplot(melted_coef_df, aes(x = Term, y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(x = "Term", y = "Coefficient", fill = "Outcome") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))(Research Question): When high school students choose the program (general, vocational, and academic programs), how do their math and science scores and their social economic status (SES) affect their decision?
A one-unit increase in math is associated with the decrease in the log odds of being in general program vs. academic program in the amount of .19 (one-unit increase in the variable math is .83 for being in general program vs. academic program).
Students in the highest social status level (SES = 3), compared to the lowest (SES = 1), are less likely to be in the general program compared to the academic one. The chance decreases by about 1.125 times, b = -1.125, Wald χ2(1) = -5.27, p <.001. This means their odds are about 0.325 times lower, making students from lower social status more inclined to pick the general program over the academic one.
Similarly, for high SES students (compared to low), the likelihood of being in the vocational program instead of the academic program drops by 0.56 times, b = -0.56, Wald χ2(1) = -2.82, p < 0.01. In other words, lower-status students (compared to high SES) prefer the vocational program over the academic one.
Here, we’re looking at whether prog and math predict number of awards
p <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic",
"Vocational"))
id <- factor(id)
})
summary(p) id num_awards prog math
1 : 1 Min. :0.00 General : 45 Min. :33.00
2 : 1 1st Qu.:0.00 Academic :105 1st Qu.:45.00
3 : 1 Median :0.00 Vocational: 50 Median :52.00
4 : 1 Mean :0.63 Mean :52.65
5 : 1 3rd Qu.:1.00 3rd Qu.:59.00
6 : 1 Max. :6.00 Max. :75.00
(Other):194
ggplot(p, aes(num_awards, fill = prog)) +
geom_histogram(binwidth=.5, position="dodge") + theme_classic()
Call:
glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2043 -0.8436 -0.5106 0.2558 2.6796
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.24712 0.65845 -7.969 0.0000000000000016 ***
progAcademic 1.08386 0.35825 3.025 0.00248 **
progVocational 0.36981 0.44107 0.838 0.40179
math 0.07015 0.01060 6.619 0.0000000000362501 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 287.67 on 199 degrees of freedom
Residual deviance: 189.45 on 196 degrees of freedom
AIC: 373.5
Number of Fisher Scoring iterations: 6
Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates to control for mild violation of the distribution assumption that the variance equals the mean.
cov.m1 <- vcovHC(m1, type="HC0")
std.err <- sqrt(diag(cov.m1))
r.est <- cbind(Estimate= coef(m1), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(m1)/std.err), lower.tail=FALSE),
LL = coef(m1) - 1.96 * std.err,
UL = coef(m1) + 1.96 * std.err)
r.est Estimate Robust SE Pr(>|z|) LL UL
(Intercept) -5.2471244 0.64599839 0.000000000000000456663 -6.51328124 -3.98096756
progAcademic 1.0838591 0.32104816 0.000735474482416738676 0.45460476 1.71311353
progVocational 0.3698092 0.40041731 0.355715684208826043999 -0.41500870 1.15462716
math 0.0701524 0.01043516 0.000000000017839751696 0.04969947 0.09060532
## calculate and store predicted values
p$phat <- predict(m1, type="response")
## order by program and then by math
p <- p[with(p, order(prog, math)), ]
## create the plot
ggplot(p, aes(x = math, y = phat, colour = prog)) +
geom_point(aes(y = num_awards), alpha=.5, position=position_jitter(h=.2)) +
geom_line(size = 1) + theme_classic() +
labs(x = "Math Score", y = "Expected number of awards")Being in an academic program was associated with a significant increase in the number of awards, b = 1.084, SE = 0.358, z = 3.025, p = 0.00248, indicating that students in academic programs tend to receive more awards than those in general programs, holding math scores constant.
Being in a vocational program did not significantly predict the number of awards compared to being in a general program, b = 0.370, SE = 0.441, z = 0.838, p = 0.40179, suggesting no significant difference in the expected count of awards between vocational and general program students, again holding math scores constant.
Math scores were also a significant predictor of the number of awards, b = 0.07015, SE = 0.01060, z = 6.619, p < 0.0001, suggesting that, regardless of program type, an increase in math scores is associated with an increase in the expected number of awards.
library(MASS) # For ordinal logistic regression (polr)
# Generate example data
set.seed(123)
data <- data.frame(
StudyHours = rnorm(100, mean = 5, sd = 2), # Simulating study hours
Performance = ordered(sample(1:5, 100, replace = TRUE), levels = c(1, 2, 3, 4, 5), labels = c("Poor", "Below Average", "Average", "Above Average", "Excellent"))
)
# Fit an ordinal logistic regression model
model <- polr(Performance ~ StudyHours, data = data)
summary(model)Call:
polr(formula = Performance ~ StudyHours, data = data)
Coefficients:
Value Std. Error t value
StudyHours 0.1162 0.09861 1.178
Intercepts:
Value Std. Error t value
Poor|Below Average -0.5097 0.5491 -0.9284
Below Average|Average 0.4270 0.5379 0.7938
Average|Above Average 0.9074 0.5377 1.6876
Above Average|Excellent 1.5883 0.5552 2.8607
Residual Deviance: 311.9673
AIC: 321.9673
Value Std. Error t value
StudyHours 0.1161846 0.09860927 1.1782316
Poor|Below Average -0.5097395 0.54906589 -0.9283759
Below Average|Average 0.4269730 0.53789874 0.7937795
Average|Above Average 0.9073994 0.53768046 1.6876183
Above Average|Excellent 1.5882928 0.55521379 2.8606868
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value
StudyHours 0.1161846 0.09860927 1.1782316 0.238704296
Poor|Below Average -0.5097395 0.54906589 -0.9283759 0.353212615
Below Average|Average 0.4269730 0.53789874 0.7937795 0.427323829
Average|Above Average 0.9073994 0.53768046 1.6876183 0.091484520
Above Average|Excellent 1.5882928 0.55521379 2.8606868 0.004227244
The significance of “Above Average|Excellent” suggests that study hours become more influential in determining whether students achieve Above Average vs Excellent.
# Create a sequence of study hours for prediction
new_data <- data.frame(StudyHours = seq(min(data$StudyHours), max(data$StudyHours), length.out = 100))
# Predict probabilities for each performance category
predicted_probs <- as.data.frame(predict(model, newdata = new_data, type = "probs"))
# Rename the columns for clarity
colnames(predicted_probs) <- c("Poor", "Below Average", "Average", "Above Average", "Excellent")
# Create a visualization
ggplot(predicted_probs, aes(x = new_data$StudyHours)) +
geom_line(aes(y = `Poor`, color = "Poor"), size = 1) +
geom_line(aes(y = `Below Average`, color = "Below Average"), size = 1) +
geom_line(aes(y = `Average`, color = "Average"), size = 1) +
geom_line(aes(y = `Above Average`, color = "Above Average"), size = 1) +
geom_line(aes(y = `Excellent`, color = "Excellent"), size = 1) +
labs(title = "Ordinal Logistic Regression: Predicting Student Performance",
x = "Study Hours",
y = "Predicted Probability") +
scale_color_manual(values = c("Poor" = "#1f77b4",
"Below Average" = "#ff7f0e",
"Average" = "#2ca02c",
"Above Average" = "#d62728",
"Excellent" = "#9467bd"),
name = "Performance Level",
limits = c("Excellent","Above Average", "Average", "Below Average", "Poor")) + # Explicitly set the order
theme_minimal()