Preparation

In this tutorial, we will mainly use lme4 and lmerTest for linear mixed effect model. The data Produc is from package plm. We are going to study on the relationship between

considering the effect of year and state.

library(ggplot2)
library(GGally)
library(plm) # panel data linear mixed
library(tidyverse)
library(ggfortify)
library(ggpubr)
library(lme4) # linear mixed effect model
library(lmerTest)

data(Produc)  # load the data
#head(Produc)
state <- c("AL","AZ","AR","CA","CO","CT","DE","FL","GA","ID","IL","IN",
           "IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT",
           "NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA",
           "RI","SC","SD","TN","TX","UT","VT","VI","WA","WV","WI","WY")
Produc$state <- rep(state, each = 17)
data <- Produc %>%
  filter(year > 1980) %>%
  select( state, year, pc, emp)
data$year <- as.factor(data$year)
data[1:10,]
ggpairs(data[,-1])

ANOVA: pc ~ year

\[ H_0: \text{the mean private capital stock were the same from 1981 - 1986} \] \[ H_a: \text{the mean private capital stock were not the same from 1981 - 1986} \]

By ANOVA, we are testing on the group means. The null hypothesis is all the group means are equal while the alternative is not all the means are the same.

boxplot(pc~ year, data)

mod0 <- aov(pc~year, data)
summary(mod0)
##              Df    Sum Sq   Mean Sq F value Pr(>F)
## year          5 2.190e+09 4.380e+08   0.088  0.994
## Residuals   282 1.403e+12 4.974e+09
# mod0 <- lm(pc ~ year, data)

ggboxplot(data, x = "year", y = "pc",
          color = "year", palette = "jco",
          add = "jitter") + stat_compare_means(method = "anova")

Linear Regression: pc ~ emp

# Fit linear regression
mod1 <- lm(pc ~ emp, data = data)
autoplot(mod1)

summary(mod1)
## 
## Call:
## lm(formula = pc ~ emp, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62787 -10491  -2709   5403 158714 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7078.9203  2271.8141   3.116  0.00202 ** 
## emp           31.4491     0.8087  38.890  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27950 on 286 degrees of freedom
## Multiple R-squared:  0.841,  Adjusted R-squared:  0.8404 
## F-statistic:  1512 on 1 and 286 DF,  p-value: < 2.2e-16
ggplot(data,aes(x = emp, y = pc)) + geom_point() +
  geom_smooth(method="lm") + ggtitle("Linear Regresion Model: pc ~ emp")

P2.If we directly fit a linear regression pc ~ emp, the diagnostic plots show lots of violation against linear model assumptions. However, before doing any remedy, let’s look at the data.

ggplot(data) + 
  geom_point(aes(x = emp, y = pc)) + ggtitle("Emp vs PC")

The scatter plot show potential linearity, but with some variation. Let’s recreate the scatter plot by year.

ggplot(data) + 
  geom_point(aes(x = emp, y = pc, color = year))  + facet_wrap(~year) + ggtitle("Emp vs PC by year")

ggplot(data) + 
  geom_text(aes(x = emp, y = pc, label = state, color = year))  + facet_wrap(~year)+ ggtitle("Emp vs PC by year (Outliers)")

# two outliers: Texas, Louisana, 
# leverage points California 

On this graph, Texas and Louisana seems to be outliers, while California is a leverage point. Delete data from the three states. Since the data has linear relationship if we split the data by year, we add year as fixed effect.

Fixed Effect Model: pc ~ emp(fixed effect) + year(fixed effect)

By using fixed effects model, we are assuming the relationship (slope \(\beta_1\)) between pc and emp are the same across 6 years. However, due to the different ecnomic situation, the intercepts (or offset, or initial position) are different. Thus, we are going to have 6 parallel regression lines.

df <- data %>% filter(state !="TX" , state != "LA", state !="CA")
df <- df[order(df$state, df$year), ]
df$year <- as.factor(df$year) # change "year" to factor

# add "year" as a fixed effect
# the slope for "emp" are the same among the 6 year, however each year has a different "intercept"
# the regression lines should be parallel to each other


mod2 <- lm(pc ~ emp + year, data = df)
# check residuals
autoplot(mod2)

summary(mod2)
## 
## Call:
## lm(formula = pc ~ emp + year, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27542.5  -7046.5    492.2   7566.6  28428.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9265.9131  1830.6468   5.062 7.82e-07 ***
## emp           26.3126     0.4531  58.067  < 2e-16 ***
## year1982    2354.0008  2373.8844   0.992    0.322    
## year1983    2806.6694  2373.8524   1.182    0.238    
## year1984    1566.3448  2373.9819   0.660    0.510    
## year1985    1975.5624  2374.3226   0.832    0.406    
## year1986    2460.8819  2374.7416   1.036    0.301    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11260 on 263 degrees of freedom
## Multiple R-squared:  0.9278, Adjusted R-squared:  0.9262 
## F-statistic: 563.5 on 6 and 263 DF,  p-value: < 2.2e-16
df$pred1 <- mod2$fitted.values


# get parallel regression line with different intercepts

ggplot(df) + geom_point(aes(x = emp, y = pc, color = year,group = year) ) + 
  geom_line(aes(x = emp, y = pred1, color = year,group = year) ) + ggtitle("Add Year as Fixed Effects ")

ggplot(df) + geom_point(aes(x = emp, y = pc, color = year)) + 
  geom_line(aes(x = emp, y = pred1, color= year)) + facet_wrap(~year) + ggtitle("Add Year as Fixed Effects ")

Fixed effects model estimate the fixed effect for each level, here, for each year. Please note the results can not be extended to another year, and the inference can only be done for the given time period.

Random Effect Model: pc ~ state(random intercept)

If we want to consider the difference among different states, state can be added into the model. Here, we treat it as random effects. Contrast to the fixed effects, the random effects are unobserved random variables, not parameters. It can also be some variation that are not important in the study. Instead of focusing on the estimates for each level, we are looking for a general distribution for this variable.

Here, we add a random intercept to this model. That is, each state will have its own regression line. These lines are parallel to each other. The random intercepts follows some normal distribution.

# state: random effect
# library(lme4)
# library(lmerTest)

# the default lmer function won't provide p-value
# load "lmerTest" package to get p-values


# Intercept only: random intercept with fixed mean
# pc is predicted by state, a categorical random effect 
# intercept is represented by "1"
# "|" means random effect
mod3 <- lmer(pc ~  (1|state) , df)
summary(mod3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pc ~ (1 | state)
##    Data: df
## 
## REML criterion at convergence: 5456.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9577 -0.3279 -0.0456  0.2270  5.1094 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  state    (Intercept) 1.737e+09 41683   
##  Residual             1.227e+07  3502   
## Number of obs: 270, groups:  state, 45
## 
## Fixed effects:
##             Estimate Std. Error    df t value Pr(>|t|)    
## (Intercept)    54674       6217    44   8.794 3.01e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# the variability of the intercept across states is 1.737*10^9
# The intra class correlation is 0.98


ggplot(df) + geom_point(aes(x = state, y = pc)) + ggtitle("Random Effects for Different States")

# add state as random intercept, while year as fixed effect, 
# we believe there is a difference between the intercept 

In practice, people rarely use a random intercept only model. But we can add this random intercept into any fixed effect model, to make it a mixed effect model.

Mixed Effect Model

Random Intercept: pc ~ state(random intercept) + emp(fixed)

We add a random intercept by state to the original linear regression to get the mixed effect model:

mod4 <- lmer(pc ~ (1|state)+ emp  , df)
## Warning: Some predictor variables are on very different scales: consider
## rescaling

## Warning: Some predictor variables are on very different scales: consider
## rescaling
# summary(mod4)

We get the warning message saying “Some predictor variables are on very different scales: consider rescaling”. Scale the predictor emp.

df$emp_s <- (df$emp-mean(df$emp))/sd(df$emp)
mod5 <- lmer(pc ~ (1|state) + emp_s , df)
summary(mod5)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pc ~ (1 | state) + emp_s
##    Data: df
## 
## REML criterion at convergence: 5086.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.2107 -0.2440  0.0420  0.3574  2.7730 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  state    (Intercept) 124456690 11156   
##  Residual               4276403  2068   
## Number of obs: 270, groups:  state, 45
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 54673.88    1667.80    43.67   32.78   <2e-16 ***
## emp_s       39913.34    1269.27   119.36   31.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## emp_s 0.000
# coef(summary(mod5)) # fixed effect estimates
# confint(mod5) # confidence interval for fixed effect estimates
# ranef(mod5) # random effect estimates, we can't calculate confidence interval for random effect

The model is significant. Let’s check the residuals. It’s the same as the diagnostics in linear regression.

plot(mod5, main = "Residuals vs. Fitted Plot for Mixed Effects Model")

qqnorm(residuals(mod5), main = "QQ Plot for residuals from Mixed Effects Model")

The residuals plots suggest that the data need to be transformed. (Econ data, of course!) Perform log transformation and refit the model:

df$logpc <-log(df$pc)
mod6<- lmer(logpc ~ (1|state) + emp_s , df)
summary(mod6)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logpc ~ (1 | state) + emp_s
##    Data: df
## 
## REML criterion at convergence: -654.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5631 -0.5775  0.0608  0.3558  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  state    (Intercept) 0.242602 0.49255 
##  Residual             0.001631 0.04039 
## Number of obs: 270, groups:  state, 45
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  10.61416    0.07347  42.05024  144.48   <2e-16 ***
## emp_s         0.51900    0.03379 263.63464   15.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## emp_s 0.000
plot(mod6)

qqnorm(residuals(mod6))

The residuals look good. Let’s get the fitted value (for illustration only, we can’t really get the estimation for random effects). Please note our goal is to find the relationship between emp and pc. We don’t care the estimation for random effects.

Let’s plot the results for “AL”, “AZ”, “CO” and “CT”.

ranef(mod6)  # get the estimate for random intercept for each state
## $state
##    (Intercept)
## AL  0.46833133
## AR  0.07578189
## AZ  0.12224831
## CO  0.25783900
## CT -0.10480453
## DE -0.96512524
## FL  0.22237142
## GA  0.32429287
## IA  0.36209278
## ID -0.61401315
## IL  0.32042408
## IN  0.56419835
## KS  0.44398626
## KY  0.35778086
## MA -0.01395532
## MD  0.02364600
## ME -0.76090185
## MI  0.47618189
## MN  0.34011221
## MO  0.34012386
## MS  0.09665642
## MT -0.25287013
## NC  0.27592028
## ND -0.15554241
## NE  0.02582168
## NH -0.99155704
## NJ  0.17347303
## NM  0.04687365
## NV -0.33533415
## NY -0.47870086
## OH  0.31205541
## OK  0.65334073
## OR  0.14640274
## PA  0.20404309
## RI -1.24944820
## SC  0.19247111
## SD -0.74503172
## TN  0.36831347
## UT -0.32254805
## VI  0.18990220
## VT -1.43085097
## WA  0.43855528
## WI  0.24784259
## WV  0.23033496
## WY  0.11926585
## 
## with conditional variances for "state"
intercept <- 10.61416 # extract from summary(mod6)
slope <- 0.51900   # extract from summary(mod6)
fixed <- intercept + slope * df$emp_s
random <- rep(unlist(ranef(mod6)$state), each =6)
df$pred2 <- fixed + random  # prediction

sub1 <- df %>% filter(state == "AL" | state == "AZ" | state == "CO"| state == "CT" )

# parallel lines
ggplot(sub1) + geom_point(aes(x = emp_s, y = logpc, color = state, group = state)) +
 geom_line(aes(x = emp_s, y = pred2, color = state, group = state)) + ggtitle("Mixed Effects Model: Random Intercept")

The regression lines share the same slope for emp but have random intercepts. Again we should expect parallel regression lines.

Random Intercept and Random Slope: pc ~ state(random intercept) + emp(random slope)

The random intercept model is the most realistic and popular model in practice. However, the reality could be more complicated: the relationship between pc and emp may be different among the states. That is, both the intercepts and the slopes(\(\beta_1\)) may be different across states. It will lead to a random intercept and random slope model.

# random slope, random intercept
# Allow each state has its own regression line

mod7 <- lmer(logpc ~ (1 + emp_s|state) + emp_s , df)
summary(mod7)             
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logpc ~ (1 + emp_s | state) + emp_s
##    Data: df
## 
## REML criterion at convergence: -723.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8799 -0.5287  0.0825  0.4310  4.1855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  state    (Intercept) 0.132141 0.3635        
##           emp_s       0.066951 0.2587   -0.98
##  Residual             0.001296 0.0360        
## Number of obs: 270, groups:  state, 45
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 10.76483    0.05562 33.71439  193.54  < 2e-16 ***
## emp_s        0.64672    0.04192 22.63397   15.43 1.69e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## emp_s -0.857
plot(mod7,main = "Residuals vs. Fitted Plot for Mixed Effects Model") 

qqnorm(residuals(mod7), main = "QQ Plot of Residuals from Mixed Effects Model")             

The residuals plots look normal. Let’s again get the visualization for the four states using random intercept and random slope.

# coef(mod7)

df$intercept <- rep(unlist(coef(mod7)$state[1]),each = 6)

df$slope <- rep(unlist(coef(mod7)$state[2]),each = 6)

df$pred3 <- df$intercept + df$slope * df$emp_s

sub2 <- df %>% filter(state == "AL" | state == "AZ" | state == "CO"| state == "CT" )

# non -parallel lines
ggplot(sub2) + geom_point(aes(x = emp_s, y = logpc, color = state, group = state)) +
 geom_line(aes(x = emp_s, y = pred3, color = state, group = state)) + ggtitle("Mixed Effects Model: Random Intercept and Random Slope")

Now, the four regression lines have different intercepts and different slopes. They are not parallel anymore.