Question 1

A)

Manufacture A has machines 1,2,3 and manufacture B has machines 4,5,6. Right away the boxplots show that manufacture A’s machines appear on average to take more time. Machines M2 and m5 have the greatest variance between there low and high settings and in general there is a clear difference from the low to high settings.

## 'data.frame':    24 obs. of  4 variables:
##  $ manufact: Factor w/ 2 levels "A","B": 1 1 2 2 1 1 2 2 1 1 ...
##  $ machine : Factor w/ 6 levels "m1","m2","m3",..: 1 1 4 4 1 1 4 4 2 2 ...
##  $ speed   : Factor w/ 2 levels "H","L": 2 1 2 1 2 1 2 1 2 1 ...
##  $ time    : int  211 278 205 247 230 278 217 251 184 249 ...
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

B)

Explain why not all effects can be estimated?

NA as a coefficient in a regression indicates that the variable in question is linearly related to the other variables. In our case machine 6 can be explained with some combination of machine 1-5, manufacture B and speedL.If this is the case, then there’s no unique solution to the regression without dropping one of the variables. Adding machine 6 is only going to make matters worse.

In this case it is because we know manufactB mean and machine 4-5 mean so in turn we know the mean for machine 6 as the three machines means make up the manufactB mean.

## 
## Call:
## lm(formula = time ~ speed + manufact + machine, data = lawn)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.50  -8.50  -3.00   7.50  19.25 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  278.750      6.042  46.138  < 2e-16 ***
## speedL       -59.000      4.567 -12.919 3.23e-10 ***
## manufactB    -26.750      7.910  -3.382  0.00355 ** 
## machinem2    -26.000      7.910  -3.287  0.00435 ** 
## machinem3     -0.750      7.910  -0.095  0.92557    
## machinem4      7.500      7.910   0.948  0.35635    
## machinem5    -15.500      7.910  -1.959  0.06667 .  
## machinem6         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.19 on 17 degrees of freedom
## Multiple R-squared:  0.9251, Adjusted R-squared:  0.8986 
## F-statistic: 34.97 on 6 and 17 DF,  p-value: 1.183e-08

C)

The average intercept of the model is 270.50 and the SD of these intercepts 12.05, is about 68% of the intercepts lie between 270.50+-12.05. Same can be said with the random variability of 11.50, so in the case of the same machine being tested on the same speed we would just have the random variability that would have an effect on time on +-11.50.

12.05 is a measure of variability that the random effect machine has on the model. Which is greater than our residual which is our random variability of 11.50. This can be shown by the intraclass correlation is 0.5116773 which we would expect as the variation between the levels is larger than the random error.

The question of different machines being sampled from the same manufacturer at the same speed would include the the variation in both machines and random varibility. Shown by the formular below our total SD on time would be +-16.65833.

mixed_model1 = lmer(time~manufact*speed + (1|machine),data=lawn)
summary(mixed_model1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: time ~ manufact * speed + (1 | machine)
##    Data: lawn
## 
## REML criterion at convergence: 168.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.0909 -0.6740 -0.1291  0.6661  1.5405 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  machine  (Intercept) 145.2    12.05   
##  Residual             132.3    11.50   
## Number of obs: 24, groups:  machine, 6
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)       270.500      8.394  32.225
## manufactB         -21.833     11.871  -1.839
## speedL            -60.333      6.641  -9.085
## manufactB:speedL    2.667      9.391   0.284
## 
## Correlation of Fixed Effects:
##             (Intr) mnfctB speedL
## manufactB   -0.707              
## speedL      -0.396  0.280       
## mnfctB:spdL  0.280 -0.396 -0.707
# p = (12.05/(12.05+11.50))
# tts = sqrt((145.2+132.3))
## between machiens
# fixef(mixed_model1)+ranef(mixed_model1)$machine
# ranef(mixed_model1)$machine

D)

We find there is no significant interaction term. We can now test each of the main effects starting with the speed: speed is found to be significant. Next manufact: Dropping manufact from the model seems reasonable since the p-value of 0.1332 is large

mixed_model2 = lmer(time~manufact+speed + (1|machine),data=lawn, REML=FALSE)
KRmodcomp(mixed_model1, mixed_model2)
## F-test with Kenward-Roger approximation; computing time: 0.09 sec.
## large : time ~ manufact * speed + (1 | machine)
## small : time ~ manufact + speed + (1 | machine)
##          stat     ndf     ddf F.scaling p.value
## Ftest  0.0806  1.0000 16.0000         1  0.7801
mixed_model3 = lmer(time~manufact + (1|machine),data=lawn, REML=FALSE)
## singular fit
KRmodcomp(mixed_model2, mixed_model3)
## F-test with Kenward-Roger approximation; computing time: 0.04 sec.
## large : time ~ manufact + speed + (1 | machine)
## small : time ~ manufact + (1 | machine)
##         stat    ndf    ddf F.scaling   p.value    
## Ftest 166.89   1.00  17.00         1 3.229e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mixed_model4 = lmer(time~speed + (1|machine),data=lawn, REML=FALSE)
KRmodcomp(mixed_model2, mixed_model4)
## F-test with Kenward-Roger approximation; computing time: 0.03 sec.
## large : time ~ manufact + speed + (1 | machine)
## small : time ~ speed + (1 | machine)
##         stat    ndf    ddf F.scaling p.value
## Ftest 3.5354 1.0000 4.0000         1  0.1332

E)

Variation in machines is important in the model including the interactions between manufact and speed and also the simplier model only including speed after removing manufact.

exactRLRT(mixed_model1)
## 
##  simulated finite sample distribution of RLRT.
##  
##  (p-value based on 10000 simulated values)
## 
## data:  
## RLRT = 5.867, p-value = 0.0053
exactRLRT(mixed_model4)
## Using restricted likelihood evaluated at ML estimators.
## Refit with method="REML" for exact results.
## 
##  simulated finite sample distribution of RLRT.
##  
##  (p-value based on 10000 simulated values)
## 
## data:  
## RLRT = 11.269, p-value = 4e-04

F)

All of variance components are of similar magnitude. So the variance in time can be ascribed to variance between manufactures, variance between machines and plain measurement error. Measurement error is slightly smaller but similar nonetheless.

mixed_model5 = lmer(time~speed + (1|manufact/machine),data=lawn)
summary(mixed_model5)
## Linear mixed model fit by REML ['lmerMod']
## Formula: time ~ speed + (1 | manufact/machine)
##    Data: lawn
## 
## REML criterion at convergence: 183.6
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.09600 -0.66847 -0.09456  0.66675  1.58890 
## 
## Random effects:
##  Groups           Name        Variance Std.Dev.
##  machine:manufact (Intercept) 147.0    12.12   
##  manufact         (Intercept) 150.6    12.27   
##  Residual                     125.2    11.19   
## Number of obs: 24, groups:  machine:manufact, 6; manufact, 2
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  259.583     10.500   24.72
## speedL       -59.000      4.567  -12.92
## 
## Correlation of Fixed Effects:
##        (Intr)
## speedL -0.217

G)

Both have 0 bounds so there is a potential for there both variabilities to be nothing. The confidence interval for manufacturer is larger than for machines but overall I would say there is no strong evidence for varibility to be ascribed soley to one or the other. It is safest to conclude there is some variation in time coming from all sources.

VarCorr(mixed_model5)
##  Groups           Name        Std.Dev.
##  machine:manufact (Intercept) 12.124  
##  manufact         (Intercept) 12.273  
##  Residual                     11.187
as.data.frame(VarCorr(mixed_model5))
##                grp        var1 var2     vcov    sdcor
## 1 machine:manufact (Intercept) <NA> 146.9937 12.12410
## 2         manufact (Intercept) <NA> 150.6291 12.27311
## 3         Residual        <NA> <NA> 125.1528 11.18717
bsd <- numeric(1000)
for(i in 1:1000){
y <- unlist(simulate(mixed_model5))
bmod <- refit(mixed_model5, y)
bsd[i] <- as.data.frame(VarCorr(bmod))$sdcor[1]
}

quantile(bsd, c(0.025, 0.975))
##     2.5%    97.5% 
##  0.00000 20.49393
confint(mixed_model5, method="boot")
##                  2.5 %    97.5 %
## .sig01        0.000000  20.95349
## .sig02        0.000000  31.56916
## .sigma        7.309984  15.13770
## (Intercept) 237.463008 281.55232
## speedL      -67.898370 -50.23432

Question 2

A)

Trajectories of the different subjects for the most part are all gradually increasing with the exception of a few outliers. The variation as shown in the boxplots appears greatest at the very beginning and near the middle. After 7.5 days plus of no sleep you can see the variation is actually quite low but is made to appear larger due to a few outliers.

head(sleepstudy)
##   Reaction Days Subject
## 1 249.5600    0     308
## 2 258.7047    1     308
## 3 250.8006    2     308
## 4 321.4398    3     308
## 5 356.8519    4     308
## 6 414.6901    5     308
ggplot(sleepstudy, aes(x=Days, y=Reaction)) + geom_line()+facet_wrap(~ Subject)

ggplot(sleepstudy, aes(x=Days, y=Reaction, shape=Subject, color=Subject)) +
  geom_point() + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

ggplot(sleepstudy, aes(x=Days, y=Reaction, color=Subject)) + geom_boxplot() +
  geom_line(aes(group = Subject), colour = "blue")

B)

Fit a mixed effects model that describes how the reaction time varies linearly with days and allows for random variation in both the slope and intercepts of the subject lines. Under this model, would it be unusual for an individual to have a reaction time that does not increase over time?

We can see from our fixed effects that for every day reaction times slow by 10.467.

Random variation for he intercept and the slope are 24.737 and 5.923 respectively. These have a correlation of 0.07. Lastly the addional random vairation of 25.592.

So given a slope of 10.467 we know our random variation for our slope for 1 standard deviation is +- 5.923. Given this we can expect the 95% (2 SD) of the intercepts to lie between 10.467+-11.846 which does allow for a negitive slope. This chance is a bit larger than 5% of the time so still fairly unusual.

mixed_model6 = lmer(Reaction~Days + (Days|Subject),data=sleepstudy)
summary(mixed_model6)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4633  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 611.90   24.737       
##           Days         35.08    5.923   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.824  36.843
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

C)

Allow for quadratic effects in the previous model. Does the data support the inclusion of quadratic effects?

Testing the more complex model with the quadratic against the previous model we find the quadratic term isn’t significant. So we have reason to not support the inclusion of the quadratic.

mixed_model7 = lmer(Reaction~Days + I(Days^2) + (Days|Subject),data=sleepstudy)
summary(mixed_model7)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + I(Days^2) + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1742.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.0093 -0.4489  0.0422  0.5036  5.2702 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 613.12   24.761       
##           Days         35.11    5.925   0.06
##  Residual             651.97   25.534       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 255.4494     7.5135  33.999
## Days          7.4341     2.8189   2.637
## I(Days^2)     0.3370     0.2619   1.287
## 
## Correlation of Fixed Effects:
##           (Intr) Days  
## Days      -0.418       
## I(Days^2)  0.418 -0.836
## should we consider the summary output or is it realitive 
KRmodcomp(mixed_model7,mixed_model6)
## F-test with Kenward-Roger approximation; computing time: 0.10 sec.
## large : Reaction ~ Days + I(Days^2) + (Days | Subject)
## small : Reaction ~ Days + (Days | Subject)
##           stat      ndf      ddf F.scaling p.value
## Ftest   1.6558   1.0000 143.0000         1  0.2003
##confint(mixed_model6, method="boot")

D)

Using the model from part (b), make the following diagnostic plots and interpret: i.Residuals vs Fitted plot ii.Normal QQ plot of the residuals iii.Normal QQ plot of both random effects iv.A scatterplot of the random effects?

Residuals vs fitted - Good coverage across the 0 line is good for our homogeneous variance assumption. I would note that there is a very slight funnel shape as fitted increase but this could just be because of outliers.

Noraml QQ plot - the residuals are reasonably distributed for most but with heavy tails.

Normal QQ plot for both random effects - Random effects also are shown to be fairly reasonably distributed with maybe slightly heavy tails

Scatterplot of the random effect - There doesn’t appear to be any discernible pattern between the random intercept and the random slope.

** Note when using these variables to plot and regression line they did show to have a similar trend to the data **

## $Subject

E)

Identify any outlying cases and mark these on top of your initial plot. Try refitting the model without these cases and identify the largest change in the model fit.

Subject 332 and Subject 308 are our possible outlier points.

Largest changes in the fit would include the reduction in residual SD that went from 25.592 to 19.036 whilst at the same time Subject SD increase from 24.740 to 28.467. Everything else stayed fairly similar witht he exception of a slight decrease of slope.

Is interesting that with the outliers removed we gain more knowledge of where the variation in the data lies and even reduce our overall variation in our random effects.

## 
## 308 309 310 330 331 332 333 334 335 337 349 350 351 352 369 370 371 372 
##   0  10  10  10  10   0  10  10  10  10  10  10  10  10  10  10  10  10
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy.No.Outliers
## 
## REML criterion at convergence: 1473.3
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.79311 -0.56007 -0.00319  0.55782  2.25091 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Subject  (Intercept) 810.30   28.466        
##           Days         35.34    5.945   -0.02
##  Residual             362.36   19.036        
## Number of obs: 160, groups:  Subject, 16
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.053      7.646   32.83
## Days           9.817      1.576    6.23
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.123

F)

Simulate the response under your first model and plot it. Does the simulated data look like the actual data?

The Red lines on the graph below show the simulated data. we can see they follow the original data fairly well with a few the miss the mark completely.But from my understand from looking up the simulated data it does draw a random simulated value so it doesn’t make much sense to compare this way. But I like this graph and find it interesting (>‘-’)>

The second graph gives a better overview of all the simulated data and we can see the general pattern is the same. But fails to capture some of the smaller details for example everytime I simulated my data it looks slightly different but one thing I noticed on the original was that after 7.5+ the variation was quite small for most people where in the simulations it appears fairly evenly spread.

Reaction_simulated <- unlist(simulate(mixed_model6))

ggplot(sleepstudy, aes(x=Days, y=Reaction)) + geom_line()+facet_wrap(~ Subject)+
    geom_line(aes(y = Reaction_simulated), color = "red")

ggplot(sleepstudy, aes(x=Days, y=Reaction_simulated, color=Subject)) + geom_boxplot() +
  geom_line(aes(group = Subject), colour = "blue")