GLMM - STAT 413

This week we discussed and reviewed many basics of statistical model analysis.

For this week, I’ll use the lifedata we used in class to summarize my work.

lifedata <- read.csv("http://www.cknudson.com/data/LifeExp.csv")
attach(lifedata)

The first thing we reviewed was simply how to construct and evaluate a simple linear model.

mod1 <- lm(MaleLife ~ GNP + Birth + Death)
summary(mod1)

## 
## Call:
## lm(formula = MaleLife ~ GNP + Birth + Death)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2262 -1.4835 -0.1487  1.4506 12.3041 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.105e+01  1.237e+00  65.541  < 2e-16 ***
## GNP          2.081e-04  5.174e-05   4.022 0.000123 ***
## Birth       -3.840e-01  3.377e-02 -11.371  < 2e-16 ***
## Death       -8.893e-01  8.056e-02 -11.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.088 on 87 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.9026, Adjusted R-squared:  0.8993 
## F-statistic: 268.8 on 3 and 87 DF,  p-value: < 2.2e-16

Looking at the summary, we notice a few thing:

First, the model is predicting MaleLife (expectancy) using the variables GNP, Birth, and Death.

Second, the p-values for all the variables are really low and therefore significant to explaining MaleLife.

Third, the R-squared is about .9 which means the variables explain about 90% of the variability in MaleLife.

This week we also talked about the importance of transforming variables

mod2 <- lm(MaleLife ~ GNP)
summary(mod2)

## 
## Call:
## lm(formula = MaleLife ~ GNP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.999  -5.388   1.222   5.840  12.553 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.694e+01  9.647e-01   59.03  < 2e-16 ***
## GNP         7.728e-04  9.758e-05    7.92 6.35e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.492 on 89 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.4134, Adjusted R-squared:  0.4068 
## F-statistic: 62.72 on 1 and 89 DF,  p-value: 6.345e-12

GNP is a significant variable but the model is not very good.

plot(MaleLife ~ GNP)
abline(mod2)

Judging by this graph, it is clear GNP is not entirely linearly related to MaleLife.

Perhaps a log transformation will help.

lnGNP <- log(GNP)
mod3 <- lm(MaleLife ~ lnGNP)
summary(mod3)

## 
## Call:
## lm(formula = MaleLife ~ lnGNP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5872  -3.5064   0.0095   3.7562  14.1309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.4726     2.8387   8.973 4.27e-14 ***
## lnGNP         4.7804     0.3693  12.946  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.761 on 89 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.6532, Adjusted R-squared:  0.6493 
## F-statistic: 167.6 on 1 and 89 DF,  p-value: < 2.2e-16

Definitely a better model but lets check the plot of MaleLife and lnGNP.

plot(MaleLife ~ lnGNP)
abline(mod3)

Not perfect… but a much more linear relationship which is good.

With this example it is visually clear which variable will perform better, however, there are also numeric ways to determine which model is superior

Using AIC and BIC we can accomplish this.

AIC(mod1)

## [1] 469.3429

AIC(mod2)

## [1] 628.7462

AIC(mod3)

## [1] 580.9289

We are looking for models that minimize AIC and BIC.

Despite being the most complex, the first model is still our best model in terms of AIC.

BIC(mod1)

## [1] 481.8972

BIC(mod2)

## [1] 636.2788

BIC(mod3)

## [1] 588.4615

Same conclusion can be drawn from our BIC calculation.

plot(mod1, which = c(1,2))

plot(mod2, which = c(1,2))

plot(mod3, which = c(1,2))

The plots I want to highlight for this week are the Residual vs Fitted and the Normal Q-Q plot.

For the Residual vs Fitted plot I would want to see no patterns and a random distribution of the points. I would also want the red line to be close to the dotted line (horizontal).

For the QQ plot I would want to see the points hug that dotted diagonal line the whole way.

Models 1 & 3 have pretty good Residual vs Fitted plots but all the Q-Q plots leave a little to be desired.

The last thing I want to make note of in this first report is how to use the predict command to effectively utilize the formula from the linear model.

new_GNP <- data.frame(lnGNP = log(1710))
predict(mod3, new_GNP)

##        1 
## 61.05923

GLMM - STAT 413

Week One

Zachary Fahrendorff