This week we discussed and reviewed many basics of statistical model analysis.
For this week, I’ll use the lifedata we used in class to summarize my work.
lifedata <- read.csv("http://www.cknudson.com/data/LifeExp.csv")
attach(lifedata)
The first thing we reviewed was simply how to construct and evaluate a simple linear model.
mod1 <- lm(MaleLife ~ GNP + Birth + Death)
summary(mod1)
##
## Call:
## lm(formula = MaleLife ~ GNP + Birth + Death)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2262 -1.4835 -0.1487 1.4506 12.3041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.105e+01 1.237e+00 65.541 < 2e-16 ***
## GNP 2.081e-04 5.174e-05 4.022 0.000123 ***
## Birth -3.840e-01 3.377e-02 -11.371 < 2e-16 ***
## Death -8.893e-01 8.056e-02 -11.038 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.088 on 87 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.9026, Adjusted R-squared: 0.8993
## F-statistic: 268.8 on 3 and 87 DF, p-value: < 2.2e-16
Looking at the summary, we notice a few thing:
First, the model is predicting MaleLife (expectancy) using the variables GNP, Birth, and Death.
Second, the p-values for all the variables are really low and therefore significant to explaining MaleLife.
Third, the R-squared is about .9 which means the variables explain about 90% of the variability in MaleLife.
This week we also talked about the importance of transforming variables
mod2 <- lm(MaleLife ~ GNP)
summary(mod2)
##
## Call:
## lm(formula = MaleLife ~ GNP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.999 -5.388 1.222 5.840 12.553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.694e+01 9.647e-01 59.03 < 2e-16 ***
## GNP 7.728e-04 9.758e-05 7.92 6.35e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.492 on 89 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4134, Adjusted R-squared: 0.4068
## F-statistic: 62.72 on 1 and 89 DF, p-value: 6.345e-12
GNP is a significant variable but the model is not very good.
plot(MaleLife ~ GNP)
abline(mod2)
Judging by this graph, it is clear GNP is not entirely linearly related to MaleLife.
Perhaps a log transformation will help.
lnGNP <- log(GNP)
mod3 <- lm(MaleLife ~ lnGNP)
summary(mod3)
##
## Call:
## lm(formula = MaleLife ~ lnGNP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5872 -3.5064 0.0095 3.7562 14.1309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.4726 2.8387 8.973 4.27e-14 ***
## lnGNP 4.7804 0.3693 12.946 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.761 on 89 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.6532, Adjusted R-squared: 0.6493
## F-statistic: 167.6 on 1 and 89 DF, p-value: < 2.2e-16
Definitely a better model but lets check the plot of MaleLife and lnGNP.
plot(MaleLife ~ lnGNP)
abline(mod3)
Not perfect… but a much more linear relationship which is good.
With this example it is visually clear which variable will perform better, however, there are also numeric ways to determine which model is superior
Using AIC and BIC we can accomplish this.
AIC(mod1)
## [1] 469.3429
AIC(mod2)
## [1] 628.7462
AIC(mod3)
## [1] 580.9289
We are looking for models that minimize AIC and BIC.
Despite being the most complex, the first model is still our best model in terms of AIC.
BIC(mod1)
## [1] 481.8972
BIC(mod2)
## [1] 636.2788
BIC(mod3)
## [1] 588.4615
Same conclusion can be drawn from our BIC calculation.
plot(mod1, which = c(1,2))
plot(mod2, which = c(1,2))
plot(mod3, which = c(1,2))
The plots I want to highlight for this week are the Residual vs Fitted and the Normal Q-Q plot.
For the Residual vs Fitted plot I would want to see no patterns and a random distribution of the points. I would also want the red line to be close to the dotted line (horizontal).
For the QQ plot I would want to see the points hug that dotted diagonal line the whole way.
Models 1 & 3 have pretty good Residual vs Fitted plots but all the Q-Q plots leave a little to be desired.
The last thing I want to make note of in this first report is how to use the predict command to effectively utilize the formula from the linear model.
new_GNP <- data.frame(lnGNP = log(1710))
predict(mod3, new_GNP)
## 1
## 61.05923