Exercise 1

Instructions: In the previous lab, we used the file statedata.csv to examine the relationship between life expectancy and other socio-economic variables for the U.S. states. Using the same data set, now use stepwise regression (step()) to obtain the best model to estimate life expectancy. You will first need to make a new dataframe that excludes the first column of statedata (statedata2 <- statedata[,-1]).

1.a. Build the null model (mod0) and full model (mod1)
1.b. Use the step() function to perform automatic stepwise model building. Report the R code you used

#Setup
state <- read.csv("../datafiles/statedata.csv")
str(state)
## 'data.frame':    50 obs. of  9 variables:
##  $ State     : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 8 9 10 ...
##  $ Population: int  3615 365 2212 2110 21198 2541 3100 579 8277 4931 ...
##  $ Income    : int  3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life.Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : int  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : int  50708 566432 113417 51945 156361 103766 4862 1982 54090 58073 ...
state_2 <- (state[,-1])


#Models
mod_null <- lm(Life.Exp ~ 1, data = state_2)
mod_full <- lm(Life.Exp ~ ., data = state_2)

step(mod_null, scope = formula(mod_full))
## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = state_2)
## 
## Coefficients:
## (Intercept)       Murder      HS.Grad        Frost   Population  
##   7.103e+01   -3.001e-01    4.658e-02   -5.943e-03    5.014e-05

1.c. Describe the final model obtained, together with its AIC score-value and the R2.

Using a forward approach, the step function reported that the model with the lowest AIC includes murder rate, high school graduation rate, average number of frost days, and population, though the error term for murder is relatively high. The AIC for this model is -28.16, and — as indicated below — the R-squared is 0.736.

mod_final <- lm(Life.Exp ~ Murder + HS.Grad + Frost + Population, data = state_2)

summary(mod_final)
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = state_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12
plot(mod_final, which = 1, col = c("blue"))


Because this model was created automatically, I wanted to explore the residual plot. It looks to me almost like overfitting — the data spreads slightly up and then down and up again, and the fit line follows this same pattern.

1.d. Using the final model you obtain, make a prediction for the life expectancy in Utah in 2009, given an increase in population size to approximately 2 785 000 in 2009, increase in high school graduation (known) to 75% and a change in murder rate to 1.3/100 000. To do this you will need to make a new dataframe. This can be done easily by selecting the appropriate row from statedata (newstate <- statedata[44,]), then adjusting the values directly (newstate[2] <- 2785). Give the new expectancy plus 95% confidence interval   1.e. Do the same for California. 2009 population = 36 962 000; 2009 high school graduation = 68.3%; 2009 murder rate = 5.3/100 000 (figures are approximate)

utah <- state_2[44,]
utah[1] <- 2785
utah[6] <- 75
utah[5] <- 1.3

predict(mod_final, utah, interval = 'confidence')
##         fit      lwr      upr
## 44 73.45601 72.83971 74.07231
cali <- state[which(state$State == "CA"),]
cali <- cali[,-1]
cali[1] <- 36962
cali[6] <- 68.3
cali[5] <- 5.3 

predict(mod_final, cali, interval = 'confidence')
##        fit      lwr      upr
## 5 74.35232 72.64674 76.05789

The estimated life expectancy for Utah is 73.5 (95% CI [72.8, 74.1]).
The esimated life expectancy for California is 74.4 (95% CI [72.6, 76.1]).

Exercise 2

Instructions: The file normtemp.csv contains measurements of body temperature from 130 healthy human subjects, as well as their weight and sex. Use this data to model body temperatures as a function of both weight and sex. You will need to convert the sex variable into a factor in order for R to recognize this as a dummy variable (bodytemp\(sex <- factor(bodytemp\)sex, labels = c(“male”, “female”))). You should also center the weight variable by subtracting the mean.

normtemp = read.csv("../datafiles/normtemp.csv")

#Cleaning
normtemp$sex <- factor(normtemp$sex, labels = c("male", "female"))
describe(normtemp$weight) #in kg
##    vars   n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 130 73.76 7.06     74    73.9 7.41  57  89    32 -0.17    -0.53 0.62
normtemp$weight <- (normtemp$weight) - mean(normtemp$weight)
normtemp$weight <- (normtemp$weight)/10
normtemp <- normtemp[,-1]

2.a. Start by testing the correlation between body temperature and weight using Pearson’s correlation coefficient.
2.b. Build a model including both weight and sex, and give the code you used.
2.c. Report the goodness-of-fit (F-statistic and associated p-value) and the R2.
2.d. The model should provide you with three coefficients. Give a very brief interpretation of what each of these mean (see the lecture notes for an example)

#Correlation
cor(normtemp$weight, normtemp$temp) #pearson is the default method
## [1] 0.2536564
#Model
summary(mod <- lm(temp ~ weight + sex, data = normtemp))
## 
## Call:
## lm(formula = temp ~ weight + sex, data = normtemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86363 -0.45624  0.01841  0.47366  2.33424 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 98.11453    0.08710 1126.428  < 2e-16 ***
## weight       0.25267    0.08762    2.884  0.00462 ** 
## sexfemale    0.26941    0.12328    2.185  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.08405 
## F-statistic: 6.919 on 2 and 127 DF,  p-value: 0.001406
#Plot
eq1=function(x){coef(mod)[2]*x+coef(mod)[1]}
eq2=function(x){coef(mod)[2]*x+coef(mod)[1]+coef(mod)[3]}

t <- ggplot(normtemp, aes(y = temp, x = weight, color = sex)) +
  geom_point(alpha = 0.6, size = 3) +
  scale_color_manual(values = c("brown1", "deepskyblue3")) +
  stat_function(fun = eq1, geom="line", color = "brown1") +
  stat_function(fun = eq2, geom="line", color = "deepskyblue3") +
  theme_minimal() +
  labs(y = "Temperature (F)", x = "Weight (kg)") +
  ggtitle("Model of Body Temperature by Weight and Sex")
t


The residuals look relatively even. The F-statistic is 6.919 (p-value = 0.001) with an R-squared of about 0.10. The coefficient estimates are: 98.11 (intercept, p < 2e-16), 0.25 (weight, p < 0.01), and 0.27 (female sex, p < 0.05). This means that a person of an average weight has a body temperature of 98.11, and for every 10-kg increase in weight, the expected increase in body temperature is 0.25. Relative to males, females have a body temperature of 0.27 degrees higher.

2.e. Build a subset model using only weight and then use the anova() function to test whether or not there is a significant improvement in the full model over the subset. Give the F-statistic, the associated p-value and state whether or not you believe the full model to be better.

mod_sub <- lm(temp ~ weight, data = normtemp)

anova(mod_sub, mod)
## Analysis of Variance Table
## 
## Model 1: temp ~ weight
## Model 2: temp ~ weight + sex
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    128 64.883                             
## 2    127 62.532  1    2.3515 4.7758 0.0307 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
models <- list(mod_sub, mod)

With an F-statistic of 4.78 and p-value of 0.03, we reject the null hypothesis that there is not a significant improvement from the subset model. At the 0.05 alpha level, I would conclude that the full model is better. However, because I did not predetermine an alpha level when I conducted this analysis, I would think critically about this decision.

To feel more comfortable with this choice, I also calculated the AIC for the models (see below). There is a relatively small difference between the AICs, but the full model does have a slightly lower value — therefore, I would feel comfortable concluding that sex improves the model despite the penalty of including another parameter.

names <- c("Subset Model", "Full Model")
aictab(cand.set = models, modnames = names)
## 
## Model selection based on AICc:
## 
##              K   AICc Delta_AICc AICcWt Cum.Wt      LL
## Full Model   4 282.10       0.00   0.79   0.79 -136.89
## Subset Model 3 284.77       2.67   0.21   1.00 -139.29