Question 1

In the previous lab, we used the file statedata.csv to examine the relationship between life expectancy and other socio-economic variables for the U.S. states. Using the same data set, now use stepwise regression (step()) to obtain the best model to estimate life expectancy. You will first need to make a new dataframe that excludes the first column of statedata (statedata2 <- statedata[,-1]).

stateData_full <- read.csv("statedata.csv")

stateData <- stateData_full[ ,-1]
head(stateData)
##   Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## 1       3615   3624        2.1    69.05   15.1    41.3    20  50708
## 2        365   6315        1.5    69.31   11.3    66.7   152 566432
## 3       2212   4530        1.8    70.55    7.8    58.1    15 113417
## 4       2110   3378        1.9    70.66   10.1    39.9    65  51945
## 5      21198   5114        1.1    71.71   10.3    62.6    20 156361
## 6       2541   4884        0.7    72.06    6.8    63.9   166 103766

There are several variables in this data frame, they all look like continuous data. I always like to get a feel for the data I’m working with first. Lets see the distributions of each variable.

hist.data.frame(stateData, nclass = 15, title = "Distribution of variables in statedata.csv")

Build the null model (mod0) and full model (mod1)

nullModel <- lm(Life.Exp ~ 1, data = stateData)
anova(nullModel)
## Analysis of Variance Table
## 
## Response: Life.Exp
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 49 88.299   1.802
summary(nullModel)
## 
## Call:
## lm(formula = Life.Exp ~ 1, data = stateData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9186 -0.7611 -0.2036  1.0139  2.7214 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  70.8786     0.1898   373.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.342 on 49 degrees of freedom

The mean life expectancy is 70.88 years.

fullModel <- lm(Life.Exp ~ ., data = stateData)
anova(fullModel)
## Analysis of Variance Table
## 
## Response: Life.Exp
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Population  1  0.4089  0.4089  0.7372  0.395434    
## Income      1 11.5946 11.5946 20.9028 4.218e-05 ***
## Illiteracy  1 19.4207 19.4207 35.0116 5.228e-07 ***
## Murder      1 27.4288 27.4288 49.4486 1.308e-08 ***
## HS.Grad     1  4.0989  4.0989  7.3895  0.009494 ** 
## Frost       1  2.0488  2.0488  3.6935  0.061426 .  
## Area        1  0.0011  0.0011  0.0020  0.964908    
## Residuals  42 23.2971  0.5547                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fullModel)
## 
## Call:
## lm(formula = Life.Exp ~ ., data = stateData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.094e+01  1.748e+00  40.586  < 2e-16 ***
## Population   5.180e-05  2.919e-05   1.775   0.0832 .  
## Income      -2.180e-05  2.444e-04  -0.089   0.9293    
## Illiteracy   3.382e-02  3.663e-01   0.092   0.9269    
## Murder      -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## HS.Grad      4.893e-02  2.332e-02   2.098   0.0420 *  
## Frost       -5.735e-03  3.143e-03  -1.825   0.0752 .  
## Area        -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10

While the adjusted R-squared is fairly hight (0.69) and the F-statistic p-value is extremely low (2.534e-10), the coefficients for some of the variables are a bit suspect. For instance, what does negative area mean?

cor(stateData)
##             Population     Income  Illiteracy    Life.Exp     Murder
## Population  1.00000000  0.2082276  0.10762237 -0.06805195  0.3436428
## Income      0.20822756  1.0000000 -0.43707519  0.34025534 -0.2300776
## Illiteracy  0.10762237 -0.4370752  1.00000000 -0.58847793  0.7029752
## Life.Exp   -0.06805195  0.3402553 -0.58847793  1.00000000 -0.7808458
## Murder      0.34364275 -0.2300776  0.70297520 -0.78084575  1.0000000
## HS.Grad    -0.09848975  0.6199323 -0.65718861  0.58221620 -0.4879710
## Frost      -0.33215245  0.2262822 -0.67194697  0.26206801 -0.5388834
## Area        0.02254384  0.3633154  0.07726113 -0.10733194  0.2283902
##                HS.Grad      Frost        Area
## Population -0.09848975 -0.3321525  0.02254384
## Income      0.61993232  0.2262822  0.36331544
## Illiteracy -0.65718861 -0.6719470  0.07726113
## Life.Exp    0.58221620  0.2620680 -0.10733194
## Murder     -0.48797102 -0.5388834  0.22839021
## HS.Grad     1.00000000  0.3667797  0.33354187
## Frost       0.36677970  1.0000000  0.05922910
## Area        0.33354187  0.0592291  1.00000000

Looking at the correlation between variables it looks like HS.Grad, Illiteracy, and Murder have a high correlation between themselves.

vif(fullModel)
## Population     Income Illiteracy     Murder    HS.Grad      Frost       Area 
##   1.499915   1.992680   4.403151   2.616472   3.134887   2.358206   1.789764

While those variables do have a higher variance inflation factor (VIF), none of them are above a 5 threshold.

anova(fullModel, nullModel)
## Analysis of Variance Table
## 
## Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost + Area
## Model 2: Life.Exp ~ 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     42 23.297                                  
## 2     49 88.299 -7   -65.002 16.741 2.534e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic p-value is 2.534e-10, which is extremely low. This means that there is a low probability that the difference between the two models is due to chance. Therefore we can proceed with the full model (or at least some of the variables from the full model) instead of the null model.

Use the step() function to perform automatic stepwise model building. Report the R code you used

step(nullModel, scope=formula(fullModel))
## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = stateData)
## 
## Coefficients:
## (Intercept)       Murder      HS.Grad        Frost   Population  
##   7.103e+01   -3.001e-01    4.658e-02   -5.943e-03    5.014e-05

Describe the final model obtained, together with its AIC score

The final model is: Life.Exp ~ Murder + HS.Grad + Frost + Population. With an AIC score of -28.16.

From the summary() function, report the goodness-of-fit (\(F\)-statistic and associated \(p\)-value) and the \(R^2\)

bestModel <- lm(Life.Exp ~ Murder + HS.Grad + Frost + Population, data = stateData)
summary(bestModel)
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = stateData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

The goodness-of-fit is an F-statistic of 31.37 on 4 & 45 degrees of freedom, with a resulting p-value of 1.69e-12.

Using the final model you obtain, make a prediction for the life expectancy in Utah in 2009, given an increase in population size to approximately 2,785,000 in 2009, increase in high school graduation (known) to 75% and a change in murder rate to 1.3/100,000. To do this you will need to make a new dataframe. This can be done easily by selecting the appropriate row from statedata (newstate <- statedata[44,]), then adjusting the values directly (newstate[2] <- 2785). Give the new expectancy plus 95% confidence interval

library(stats)

# get UT values and adjust for 2009 data
utah <- stateData[44, ]
utah$Population <- 2785
utah$HS.Grad <- 75
utah$Murder <- 1.3

# apply new UT data to the model
predict(bestModel, utah, interval = 'confidence', level = 0.95)
##         fit      lwr      upr
## 44 73.45601 72.83971 74.07231

The model predicts for the 2009 data that in Utah, the life expectancy would be 73.46 years with a 95% confidence interval of (72.84, 74.07)

Do the same for California. 2009 population = 36,962,000; 2009 high school graduation = 68.3%; 2009 murder rate = 5.3/100 000 (figures are approximate)

#make california data
ca09 <- stateData[5, ]
ca09$Population <- 36962
ca09$HS.Grad <- 68.3
ca09$Murder <- 5.3

predict(bestModel, ca09, interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 5 74.35232 72.64674 76.05789

The model predicts that life expectancy in California in 2009 would be 74.35 years with a 95% confidence interval of (72.65, 76.06).


Question 2

The file normtemp.csv contains measurements of body temperature from 130 healthy human subjects, as well as their weight and sex. Use this data to model body temperatures as a function of both weight and sex. You will need to convert the sex variable into a factor in order for R to recognize this as a dummy variable (bodytemp$sex <- factor(bodytemp$sex, labels = c("male", "female"))). You should also center the weight variable by subtracting the mean.

tempData <- read.csv("../Lab02/normtemp.csv")

# drop index column, make sex a factor
tempData <- tempData[ ,-1]
tempData$sex <- factor(tempData$sex, labels = c("male", "female"))

# center weight variable
tempData$centerWeight <- tempData$weight - mean(tempData$weight)

hist(tempData$weight, main = "", xlab = "Subject weight (kg)")

hist(tempData$centerWeight, main = "Centered weight distribution", 
     xlab = paste("Subject weight (kg): mean =", round(mean(tempData$weight), 2)))

Start by testing the correlation between body temperature and weight using Pearson’s correlation coefficient

cor(tempData$temp, tempData$centerWeight)
## [1] 0.2536564

The correlation between body temperature and the centered weight is fairly low (R^2=0.254).

plot(y = tempData$temp, x = tempData$centerWeight,
     ylab = "Body Temperature (F)",
     xlab = paste("Subject weight (kg): mean =", round(mean(tempData$weight), 2)))

There doesn’t look to be a strong relationship between the two variables.

Build a model including both weight and sex, and give the code you used

model1 <- lm(temp ~ centerWeight + sex, data = tempData)
summary(model1)
## 
## Call:
## lm(formula = temp ~ centerWeight + sex, data = tempData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86363 -0.45624  0.01841  0.47366  2.33424 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  98.114528   0.087102 1126.428  < 2e-16 ***
## centerWeight  0.025267   0.008762    2.884  0.00462 ** 
## sexfemale     0.269406   0.123277    2.185  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.08405 
## F-statistic: 6.919 on 2 and 127 DF,  p-value: 0.001406

Report the goodness-of-fit (\(F\)-statistic and associated \(p\)-value) and the \(R^2\)

The model has an F-statistic of 6.9 (2 and 127 df) with a corresponding p-value of 0.0014. The adjusted \(R^2\) value is 0.08, which is really low.

The model should provide you with three coefficients. Give a very brief interpretation of what each of these mean (see the lecture notes for an example)

The intercept coefficient is 98.11 which is the mean temperature for the entire data set. The centerWeight coefficient means that for males, the relationship between temperature and weight has a slope of 0.025 \(\circ\)F/kg. The sexfemale coefficient means that for females, that relationship has a higher slope of 0.269 \(\circ\)F/kg.

Build a subset model using only weight and then use the anova() function to test whether or not there is a significant improvement in the full model over the subset. Give the \(F\)-statistic, the associated \(p\)-value and state whether or not you believe the full model to be better

model0 <- lm(temp ~ centerWeight, data = tempData)

anova(model0, model1)
## Analysis of Variance Table
## 
## Model 1: temp ~ centerWeight
## Model 2: temp ~ centerWeight + sex
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    128 64.883                             
## 2    127 62.532  1    2.3515 4.7758 0.0307 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using the full model is better than only using the centered weight data because there is a decrease of error (RSS is lower). The F-statistic is 4.77 resulting in a p-value of 0.03, which is low enough to be significant, meaning that there is a low probability that the difference between the models is due to chance.