In this tutorial, we will mainly use lme4 and lmerTest for linear mixed effect model. The data Produc is from package plm. We are going to study on the relationship between
pc: Private Capital Stockemp: labor input measured by the employment in non–agricultural payrollsconsidering the effect of year and state.
library(ggplot2)
library(GGally)
library(plm) # panel data linear mixed
library(tidyverse)
library(ggfortify)
library(ggpubr)
library(lme4) # linear mixed effect model
library(lmerTest)
data(Produc) # load the data
#head(Produc)
state <- c("AL","AZ","AR","CA","CO","CT","DE","FL","GA","ID","IL","IN",
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT",
"NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA",
"RI","SC","SD","TN","TX","UT","VT","VI","WA","WV","WI","WY")
Produc$state <- rep(state, each = 17)
data <- Produc %>%
filter(year > 1980) %>%
select( state, year, pc, emp)
data$year <- as.factor(data$year)
data[1:10,]
ggpairs(data[,-1])
\[ H_0: \text{the mean private capital stock were the same from 1981 - 1986} \] \[ H_a: \text{the mean private capital stock were not the same from 1981 - 1986} \]
By ANOVA, we are testing on the group means. The null hypothesis is all the group means are equal while the alternative is not all the means are the same.
boxplot(pc~ year, data)
mod0 <- aov(pc~year, data)
summary(mod0)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 5 2.190e+09 4.380e+08 0.088 0.994
## Residuals 282 1.403e+12 4.974e+09
# mod0 <- lm(pc ~ year, data)
ggboxplot(data, x = "year", y = "pc",
color = "year", palette = "jco",
add = "jitter") + stat_compare_means(method = "anova")
# Fit linear regression
mod1 <- lm(pc ~ emp, data = data)
autoplot(mod1)
summary(mod1)
##
## Call:
## lm(formula = pc ~ emp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62787 -10491 -2709 5403 158714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7078.9203 2271.8141 3.116 0.00202 **
## emp 31.4491 0.8087 38.890 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27950 on 286 degrees of freedom
## Multiple R-squared: 0.841, Adjusted R-squared: 0.8404
## F-statistic: 1512 on 1 and 286 DF, p-value: < 2.2e-16
ggplot(data,aes(x = emp, y = pc)) + geom_point() +
geom_smooth(method="lm") + ggtitle("Linear Regresion Model: pc ~ emp")
P2.If we directly fit a linear regression pc ~ emp, the diagnostic plots show lots of violation against linear model assumptions. However, before doing any remedy, let’s look at the data.
ggplot(data) +
geom_point(aes(x = emp, y = pc)) + ggtitle("Emp vs PC")
The scatter plot show potential linearity, but with some variation. Let’s recreate the scatter plot by year.
ggplot(data) +
geom_point(aes(x = emp, y = pc, color = year)) + facet_wrap(~year) + ggtitle("Emp vs PC by year")
ggplot(data) +
geom_text(aes(x = emp, y = pc, label = state, color = year)) + facet_wrap(~year)+ ggtitle("Emp vs PC by year (Outliers)")
# two outliers: Texas, Louisana,
# leverage points California
On this graph, Texas and Louisana seems to be outliers, while California is a leverage point. Delete data from the three states. Since the data has linear relationship if we split the data by year, we add year as fixed effect.
By using fixed effects model, we are assuming the relationship (slope \(\beta_1\)) between pc and emp are the same across 6 years. However, due to the different ecnomic situation, the intercepts (or offset, or initial position) are different. Thus, we are going to have 6 parallel regression lines.
df <- data %>% filter(state !="TX" , state != "LA", state !="CA")
df <- df[order(df$state, df$year), ]
df$year <- as.factor(df$year) # change "year" to factor
# add "year" as a fixed effect
# the slope for "emp" are the same among the 6 year, however each year has a different "intercept"
# the regression lines should be parallel to each other
mod2 <- lm(pc ~ emp + year, data = df)
# check residuals
autoplot(mod2)
summary(mod2)
##
## Call:
## lm(formula = pc ~ emp + year, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27542.5 -7046.5 492.2 7566.6 28428.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9265.9131 1830.6468 5.062 7.82e-07 ***
## emp 26.3126 0.4531 58.067 < 2e-16 ***
## year1982 2354.0008 2373.8844 0.992 0.322
## year1983 2806.6694 2373.8524 1.182 0.238
## year1984 1566.3448 2373.9819 0.660 0.510
## year1985 1975.5624 2374.3226 0.832 0.406
## year1986 2460.8819 2374.7416 1.036 0.301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11260 on 263 degrees of freedom
## Multiple R-squared: 0.9278, Adjusted R-squared: 0.9262
## F-statistic: 563.5 on 6 and 263 DF, p-value: < 2.2e-16
df$pred1 <- mod2$fitted.values
# get parallel regression line with different intercepts
ggplot(df) + geom_point(aes(x = emp, y = pc, color = year,group = year) ) +
geom_line(aes(x = emp, y = pred1, color = year,group = year) ) + ggtitle("Add Year as Fixed Effects ")
ggplot(df) + geom_point(aes(x = emp, y = pc, color = year)) +
geom_line(aes(x = emp, y = pred1, color= year)) + facet_wrap(~year) + ggtitle("Add Year as Fixed Effects ")
Fixed effects model estimate the fixed effect for each level, here, for each year. Please note the results can not be extended to another year, and the inference can only be done for the given time period.
If we want to consider the difference among different states, state can be added into the model. Here, we treat it as random effects. Contrast to the fixed effects, the random effects are unobserved random variables, not parameters. It can also be some variation that are not important in the study. Instead of focusing on the estimates for each level, we are looking for a general distribution for this variable.
Here, we add a random intercept to this model. That is, each state will have its own regression line. These lines are parallel to each other. The random intercepts follows some normal distribution.
# state: random effect
# library(lme4)
# library(lmerTest)
# the default lmer function won't provide p-value
# load "lmerTest" package to get p-values
# Intercept only: random intercept with fixed mean
# pc is predicted by state, a categorical random effect
# intercept is represented by "1"
# "|" means random effect
mod3 <- lmer(pc ~ (1|state) , df)
summary(mod3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pc ~ (1 | state)
## Data: df
##
## REML criterion at convergence: 5456.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9577 -0.3279 -0.0456 0.2270 5.1094
##
## Random effects:
## Groups Name Variance Std.Dev.
## state (Intercept) 1.737e+09 41683
## Residual 1.227e+07 3502
## Number of obs: 270, groups: state, 45
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 54674 6217 44 8.794 3.01e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# the variability of the intercept across states is 1.737*10^9
# The intra class correlation is 0.98
ggplot(df) + geom_point(aes(x = state, y = pc)) + ggtitle("Random Effects for Different States")
# add state as random intercept, while year as fixed effect,
# we believe there is a difference between the intercept
In practice, people rarely use a random intercept only model. But we can add this random intercept into any fixed effect model, to make it a mixed effect model.
We add a random intercept by state to the original linear regression to get the mixed effect model:
mod4 <- lmer(pc ~ (1|state)+ emp , df)
## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Warning: Some predictor variables are on very different scales: consider
## rescaling
# summary(mod4)
We get the warning message saying “Some predictor variables are on very different scales: consider rescaling”. Scale the predictor emp.
df$emp_s <- (df$emp-mean(df$emp))/sd(df$emp)
mod5 <- lmer(pc ~ (1|state) + emp_s , df)
summary(mod5)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pc ~ (1 | state) + emp_s
## Data: df
##
## REML criterion at convergence: 5086.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.2107 -0.2440 0.0420 0.3574 2.7730
##
## Random effects:
## Groups Name Variance Std.Dev.
## state (Intercept) 124456690 11156
## Residual 4276403 2068
## Number of obs: 270, groups: state, 45
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 54673.88 1667.80 43.67 32.78 <2e-16 ***
## emp_s 39913.34 1269.27 119.36 31.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## emp_s 0.000
# coef(summary(mod5)) # fixed effect estimates
# confint(mod5) # confidence interval for fixed effect estimates
# ranef(mod5) # random effect estimates, we can't calculate confidence interval for random effect
The model is significant. Let’s check the residuals. It’s the same as the diagnostics in linear regression.
plot(mod5, main = "Residuals vs. Fitted Plot for Mixed Effects Model")
qqnorm(residuals(mod5), main = "QQ Plot for residuals from Mixed Effects Model")
The residuals plots suggest that the data need to be transformed. (Econ data, of course!) Perform log transformation and refit the model:
df$logpc <-log(df$pc)
mod6<- lmer(logpc ~ (1|state) + emp_s , df)
summary(mod6)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logpc ~ (1 | state) + emp_s
## Data: df
##
## REML criterion at convergence: -654.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.5631 -0.5775 0.0608 0.3558 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev.
## state (Intercept) 0.242602 0.49255
## Residual 0.001631 0.04039
## Number of obs: 270, groups: state, 45
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 10.61416 0.07347 42.05024 144.48 <2e-16 ***
## emp_s 0.51900 0.03379 263.63464 15.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## emp_s 0.000
plot(mod6)
qqnorm(residuals(mod6))
The residuals look good. Let’s get the fitted value (for illustration only, we can’t really get the estimation for random effects). Please note our goal is to find the relationship between emp and pc. We don’t care the estimation for random effects.
Let’s plot the results for “AL”, “AZ”, “CO” and “CT”.
ranef(mod6) # get the estimate for random intercept for each state
## $state
## (Intercept)
## AL 0.46833133
## AR 0.07578189
## AZ 0.12224831
## CO 0.25783900
## CT -0.10480453
## DE -0.96512524
## FL 0.22237142
## GA 0.32429287
## IA 0.36209278
## ID -0.61401315
## IL 0.32042408
## IN 0.56419835
## KS 0.44398626
## KY 0.35778086
## MA -0.01395532
## MD 0.02364600
## ME -0.76090185
## MI 0.47618189
## MN 0.34011221
## MO 0.34012386
## MS 0.09665642
## MT -0.25287013
## NC 0.27592028
## ND -0.15554241
## NE 0.02582168
## NH -0.99155704
## NJ 0.17347303
## NM 0.04687365
## NV -0.33533415
## NY -0.47870086
## OH 0.31205541
## OK 0.65334073
## OR 0.14640274
## PA 0.20404309
## RI -1.24944820
## SC 0.19247111
## SD -0.74503172
## TN 0.36831347
## UT -0.32254805
## VI 0.18990220
## VT -1.43085097
## WA 0.43855528
## WI 0.24784259
## WV 0.23033496
## WY 0.11926585
##
## with conditional variances for "state"
intercept <- 10.61416 # extract from summary(mod6)
slope <- 0.51900 # extract from summary(mod6)
fixed <- intercept + slope * df$emp_s
random <- rep(unlist(ranef(mod6)$state), each =6)
df$pred2 <- fixed + random # prediction
sub1 <- df %>% filter(state == "AL" | state == "AZ" | state == "CO"| state == "CT" )
# parallel lines
ggplot(sub1) + geom_point(aes(x = emp_s, y = logpc, color = state, group = state)) +
geom_line(aes(x = emp_s, y = pred2, color = state, group = state)) + ggtitle("Mixed Effects Model: Random Intercept")
The regression lines share the same slope for emp but have random intercepts. Again we should expect parallel regression lines.
The random intercept model is the most realistic and popular model in practice. However, the reality could be more complicated: the relationship between pc and emp may be different among the states. That is, both the intercepts and the slopes(\(\beta_1\)) may be different across states. It will lead to a random intercept and random slope model.
# random slope, random intercept
# Allow each state has its own regression line
mod7 <- lmer(logpc ~ (1 + emp_s|state) + emp_s , df)
summary(mod7)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logpc ~ (1 + emp_s | state) + emp_s
## Data: df
##
## REML criterion at convergence: -723.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8799 -0.5287 0.0825 0.4310 4.1855
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## state (Intercept) 0.132141 0.3635
## emp_s 0.066951 0.2587 -0.98
## Residual 0.001296 0.0360
## Number of obs: 270, groups: state, 45
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 10.76483 0.05562 33.71439 193.54 < 2e-16 ***
## emp_s 0.64672 0.04192 22.63397 15.43 1.69e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## emp_s -0.857
plot(mod7,main = "Residuals vs. Fitted Plot for Mixed Effects Model")
qqnorm(residuals(mod7), main = "QQ Plot of Residuals from Mixed Effects Model")
The residuals plots look normal. Let’s again get the visualization for the four states using random intercept and random slope.
# coef(mod7)
df$intercept <- rep(unlist(coef(mod7)$state[1]),each = 6)
df$slope <- rep(unlist(coef(mod7)$state[2]),each = 6)
df$pred3 <- df$intercept + df$slope * df$emp_s
sub2 <- df %>% filter(state == "AL" | state == "AZ" | state == "CO"| state == "CT" )
# non -parallel lines
ggplot(sub2) + geom_point(aes(x = emp_s, y = logpc, color = state, group = state)) +
geom_line(aes(x = emp_s, y = pred3, color = state, group = state)) + ggtitle("Mixed Effects Model: Random Intercept and Random Slope")
Now, the four regression lines have different intercepts and different slopes. They are not parallel anymore.