ANLY 505 - Simulated Data and Linear Mixed Models

The statistical model:

\(y_t = \beta_0 + \beta_1 * (Elevation_s)_t + \beta_2 * Slope_t + (b_s)_t + \epsilon_t\)

Where:

\(\beta_0\) is the mean response when both Elevation and Slope are 0
\(\beta_1\) is the change in mean response for a 1-unit change in elevation. Elevation is measured at the stand level, so all plots in a stand share a single value in elevation.
\(\beta_2\) is the change in mean response for a 1-unit change in slope. Slope is measured at the plot level, so every plot potentially has a unique value of slope.

Let’s define the parameters:

the intercept \((\beta_0)\) will be -1
the coefficient for elevation \((\beta_1)\) will be set to 0.005
the coefficient for slope \((\beta_2)\) will be set to 0.1

nstand = 5
nplot = 4
b0 = -1
b1 = .005
b2 = .1
sds = 2
sd = 1

Simulate other variables:

set.seed(16)
stand = rep(LETTERS[1:nstand], each = nplot)
standeff = rep( rnorm(nstand, 0, sds), each = nplot)
ploteff = rnorm(nstand*nplot, 0, sd)

Simulate elevation and slope:

elevation = rep( runif(nstand, 1000, 1500), each = nplot)
slope = runif(nstand*nplot, 2, 75)

Simulate response variable:

resp2 = b0 + b1*elevation + b2*slope + standeff + ploteff 


library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(broom)
library(lme4)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

Your tasks (complete each task in its’ own code chunk, make sure to use echo=TRUE so I can see your code):

Fit a linear mixed model with the response variable as a function of elevation and slope with stand as a random effect. Are the estimated parameters similar to the intial parameters as we defined them? Ans: No, the estimated parameters are very different from the initial parameters.

# use this chunk to answer question 1
fit_1 = lm(resp2 ~ stand) # fit the model
summary(fit_1)

## 
## Call:
## lm(formula = resp2 ~ stand)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0604 -1.1517  0.0334  0.9560  4.5691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.5276     1.0187  12.297 3.09e-09 ***
## standB       -2.0727     1.4407  -1.439    0.171    
## standC       -0.7438     1.4407  -0.516    0.613    
## standD       -8.1292     1.4407  -5.643 4.68e-05 ***
## standE        0.1064     1.4407   0.074    0.942    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.037 on 15 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.6871 
## F-statistic: 11.43 on 4 and 15 DF,  p-value: 0.0001855

Create a function for your model and run 1000 simulations of that model.

# use this chunk to answer question 2
fit_1_fun = function(nstand = 5, b0 = -1, b1 = 0.005, b2 = .1, sigma = 2) {
   stand = rep(LETTERS[1:nstand], each = nplot)
   standeff = rep(rnorm(nstand, 0, sds), each = nplot)
  ploteff = rnorm(nstand*nplot, 0, sd)
  elevation = rep( runif(nstand, 1000, 1500), each = nplot)
  slope = runif(nstand*nplot, 2, 75)
  resp = b0 + b1*elevation + b2*slope + standeff + ploteff 
  dat = data.frame(stand, standeff, ploteff)
  fit1 = lmer(resp ~ elevation + slope + (1|stand), data = dat)
}
   
   
  
sims <- rerun(1000, fit_1_fun())

## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular

Extract the stand and residual variances from this simulation run. Print the first 6 rows of the data.

# use this chunk to answer question 3
stand_vars = sims %>%
  map_dfr(tidy, effects = "ran_pars",scales = "vcov")
 
head(stand_vars)

## # A tibble: 6 x 3
##   term                     group    estimate
##   <chr>                    <chr>       <dbl>
## 1 var_(Intercept).stand    stand       5.56 
## 2 var_Observation.Residual Residual    0.951
## 3 var_(Intercept).stand    stand       2.61 
## 4 var_Observation.Residual Residual    1.11 
## 5 var_(Intercept).stand    stand       9.73 
## 6 var_Observation.Residual Residual    1.36

Choose three different sample sizes (your choice) and run 1000 model simulations with each sample size. Create 3 visualizations that compare distributions of the variances for each of the 3 sample sizes. Make sure that the axes are labelled correctly. What do these graphs say about the relationship between sample size and variance?

Ans. Sample size chosen are 10, 25, 100. Increasing in sample size leads to variance decreases.

# use this chunk to answer question 4
stand_sims = c(10, 25, 100) %>%
   set_names() %>%
   map(~rerun(1000, fit_1_fun(nstand = .x)))

## boundary (singular) fit: see ?isSingular

stand_vars = stand_sims %>%
   modify_depth(2, ~tidy(.x, effects = "ran_pars", scales = "vcov")) %>%
   map_dfr(bind_rows, .id = "stand_num") %>%
   filter(group == "stand")
 head(stand_vars)

## # A tibble: 6 x 4
##   stand_num term                  group estimate
##   <chr>     <chr>                 <chr>    <dbl>
## 1 10        var_(Intercept).stand stand     5.60
## 2 10        var_(Intercept).stand stand     7.11
## 3 10        var_(Intercept).stand stand     1.68
## 4 10        var_(Intercept).stand stand     2.74
## 5 10        var_(Intercept).stand stand     3.16
## 6 10        var_(Intercept).stand stand     7.19

stand_vars = mutate(stand_vars, stand_num = fct_inorder(stand_num))
add_prefix = function(string) {
   paste("Number of stands:", string, sep = " ") #label plots
}

groupmed = stand_vars %>%
   group_by(stand_num) %>%
   summarise(mvar = median(estimate)) # add median of each distribution

ggplot(stand_vars, aes(x = estimate)) +
   geom_density(fill = "pink", alpha = 0.25) +
   facet_wrap(~stand_num, labeller = as_labeller(add_prefix)) +
   geom_vline(aes(xintercept = 4, linetype = "True Variance"), size = 0.5) +
   geom_vline(data = groupmed, aes(xintercept = mvar, linetype = "Median Variance"), size = 0.5) +
   theme_bw()+
   scale_linetype_manual(name = "", values = c(2,1)) +
   theme(legend.position = "bottom", legend.key.width = unit(.1, "cm")) +
   labs(x = "Estimated Variance", y = "Density")

Plot the coefficients of the estimates of elevation and slope. Hint: the x-axis should have 1000 values. Discuss the graphs.

Ans. The median estimates of coefficients for elevation and slope are 0.0047 and 0.1006, respectively, which is considered pretty close to the initial set parameters of 0.005 and 0.1. Although the estimate b1 and b2 are scattered in the plot, and some estimates of b1 are negative, considering the model predicts the two important factor pretty well, I’d say the model is a good fit.

# use this chunk to answer question 5
coef = sims %>%
  map_dfr(tidy, effects="fixed") %>%
  filter(term %in% c("elevation", "slope"))
coef

## # A tibble: 2,000 x 4
##    term      estimate std.error statistic
##    <chr>        <dbl>     <dbl>     <dbl>
##  1 elevation -0.00546   0.00927    -0.589
##  2 slope      0.0868    0.0140      6.21 
##  3 elevation  0.0152    0.00426     3.57 
##  4 slope      0.121     0.0157      7.73 
##  5 elevation  0.0156    0.0117      1.33 
##  6 slope      0.124     0.0166      7.48 
##  7 elevation  0.0130    0.00336     3.87 
##  8 slope      0.0937    0.0100      9.32 
##  9 elevation  0.0135    0.0157      0.856
## 10 slope      0.0767    0.0152      5.05 
## # … with 1,990 more rows

coef_med = coef %>%
  group_by(term) %>%
  summarise(med=median(estimate))
coef_med

## # A tibble: 2 x 2
##   term          med
##   <chr>       <dbl>
## 1 elevation 0.00470
## 2 slope     0.0999

coef_plot = ggplot(coef, aes(x=rep(1:1000, each=2), y=estimate))+
  geom_point(size=0.5)+
  facet_wrap(~term)+
  geom_hline(data=coef_med, aes(yintercept=med, color=term), size=0.8)+
  labs(x="Replicated number", y="Coefficients of the estimates")
coef_plot

6. Submit a link to this document in R Pubs to your Moodle. This assignment is worth 25 points.

ANLY 505 - Simulated Data and Linear Mixed Models

Week 2

Cheng-An Yang

2020-02-18