In the previous lab, we used the file statedata.csv to examine
the relationship between life expectancy and other socio-economic
variables for the U.S. states. Using the same data set, now use stepwise
regression (step()) to obtain the best model to estimate
life expectancy. You will first need to make a new dataframe that
excludes the first column of statedata
(statedata2 <- statedata[,-1]).
stateData_full <- read.csv("statedata.csv")
stateData <- stateData_full[ ,-1]
head(stateData)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 1 3615 3624 2.1 69.05 15.1 41.3 20 50708
## 2 365 6315 1.5 69.31 11.3 66.7 152 566432
## 3 2212 4530 1.8 70.55 7.8 58.1 15 113417
## 4 2110 3378 1.9 70.66 10.1 39.9 65 51945
## 5 21198 5114 1.1 71.71 10.3 62.6 20 156361
## 6 2541 4884 0.7 72.06 6.8 63.9 166 103766
There are several variables in this data frame, they all look like continuous data. I always like to get a feel for the data I’m working with first. Lets see the distributions of each variable.
hist.data.frame(stateData, nclass = 15, title = "Distribution of variables in statedata.csv")
Build the null model (mod0) and full model (mod1)
nullModel <- lm(Life.Exp ~ 1, data = stateData)
anova(nullModel)
## Analysis of Variance Table
##
## Response: Life.Exp
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 49 88.299 1.802
summary(nullModel)
##
## Call:
## lm(formula = Life.Exp ~ 1, data = stateData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9186 -0.7611 -0.2036 1.0139 2.7214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.8786 0.1898 373.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.342 on 49 degrees of freedom
The mean life expectancy is 70.88 years.
fullModel <- lm(Life.Exp ~ ., data = stateData)
anova(fullModel)
## Analysis of Variance Table
##
## Response: Life.Exp
## Df Sum Sq Mean Sq F value Pr(>F)
## Population 1 0.4089 0.4089 0.7372 0.395434
## Income 1 11.5946 11.5946 20.9028 4.218e-05 ***
## Illiteracy 1 19.4207 19.4207 35.0116 5.228e-07 ***
## Murder 1 27.4288 27.4288 49.4486 1.308e-08 ***
## HS.Grad 1 4.0989 4.0989 7.3895 0.009494 **
## Frost 1 2.0488 2.0488 3.6935 0.061426 .
## Area 1 0.0011 0.0011 0.0020 0.964908
## Residuals 42 23.2971 0.5547
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fullModel)
##
## Call:
## lm(formula = Life.Exp ~ ., data = stateData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48895 -0.51232 -0.02747 0.57002 1.49447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
## Population 5.180e-05 2.919e-05 1.775 0.0832 .
## Income -2.180e-05 2.444e-04 -0.089 0.9293
## Illiteracy 3.382e-02 3.663e-01 0.092 0.9269
## Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
## HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *
## Frost -5.735e-03 3.143e-03 -1.825 0.0752 .
## Area -7.383e-08 1.668e-06 -0.044 0.9649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared: 0.7362, Adjusted R-squared: 0.6922
## F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10
While the adjusted R-squared is fairly hight (0.69) and the F-statistic p-value is extremely low (2.534e-10), the coefficients for some of the variables are a bit suspect. For instance, what does negative area mean?
cor(stateData)
## Population Income Illiteracy Life.Exp Murder
## Population 1.00000000 0.2082276 0.10762237 -0.06805195 0.3436428
## Income 0.20822756 1.0000000 -0.43707519 0.34025534 -0.2300776
## Illiteracy 0.10762237 -0.4370752 1.00000000 -0.58847793 0.7029752
## Life.Exp -0.06805195 0.3402553 -0.58847793 1.00000000 -0.7808458
## Murder 0.34364275 -0.2300776 0.70297520 -0.78084575 1.0000000
## HS.Grad -0.09848975 0.6199323 -0.65718861 0.58221620 -0.4879710
## Frost -0.33215245 0.2262822 -0.67194697 0.26206801 -0.5388834
## Area 0.02254384 0.3633154 0.07726113 -0.10733194 0.2283902
## HS.Grad Frost Area
## Population -0.09848975 -0.3321525 0.02254384
## Income 0.61993232 0.2262822 0.36331544
## Illiteracy -0.65718861 -0.6719470 0.07726113
## Life.Exp 0.58221620 0.2620680 -0.10733194
## Murder -0.48797102 -0.5388834 0.22839021
## HS.Grad 1.00000000 0.3667797 0.33354187
## Frost 0.36677970 1.0000000 0.05922910
## Area 0.33354187 0.0592291 1.00000000
Looking at the correlation between variables it looks like HS.Grad, Illiteracy, and Murder have a high correlation between themselves.
vif(fullModel)
## Population Income Illiteracy Murder HS.Grad Frost Area
## 1.499915 1.992680 4.403151 2.616472 3.134887 2.358206 1.789764
While those variables do have a higher variance inflation factor (VIF), none of them are above a 5 threshold.
anova(fullModel, nullModel)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
## Frost + Area
## Model 2: Life.Exp ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 42 23.297
## 2 49 88.299 -7 -65.002 16.741 2.534e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic p-value is 2.534e-10, which is extremely low. This means that there is a low probability that the difference between the two models is due to chance. Therefore we can proceed with the full model (or at least some of the variables from the full model) instead of the null model.
Use the step() function to perform automatic
stepwise model building. Report the R code you used
step(nullModel, scope=formula(fullModel))
## Start: AIC=30.44
## Life.Exp ~ 1
##
## Df Sum of Sq RSS AIC
## + Murder 1 53.838 34.461 -14.609
## + Illiteracy 1 30.578 57.721 11.179
## + HS.Grad 1 29.931 58.368 11.737
## + Income 1 10.223 78.076 26.283
## + Frost 1 6.064 82.235 28.878
## <none> 88.299 30.435
## + Area 1 1.017 87.282 31.856
## + Population 1 0.409 87.890 32.203
##
## Step: AIC=-14.61
## Life.Exp ~ Murder
##
## Df Sum of Sq RSS AIC
## + HS.Grad 1 4.691 29.770 -19.925
## + Population 1 4.016 30.445 -18.805
## + Frost 1 3.135 31.327 -17.378
## + Income 1 2.405 32.057 -16.226
## <none> 34.461 -14.609
## + Area 1 0.470 33.992 -13.295
## + Illiteracy 1 0.273 34.188 -13.007
## - Murder 1 53.838 88.299 30.435
##
## Step: AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
##
## Df Sum of Sq RSS AIC
## + Frost 1 4.3987 25.372 -25.920
## + Population 1 3.3405 26.430 -23.877
## <none> 29.770 -19.925
## + Illiteracy 1 0.4419 29.328 -18.673
## + Area 1 0.2775 29.493 -18.394
## + Income 1 0.1022 29.668 -18.097
## - HS.Grad 1 4.6910 34.461 -14.609
## - Murder 1 28.5974 58.368 11.737
##
## Step: AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
##
## Df Sum of Sq RSS AIC
## + Population 1 2.064 23.308 -28.161
## <none> 25.372 -25.920
## + Income 1 0.182 25.189 -24.280
## + Illiteracy 1 0.172 25.200 -24.259
## + Area 1 0.026 25.346 -23.970
## - Frost 1 4.399 29.770 -19.925
## - HS.Grad 1 5.955 31.327 -17.378
## - Murder 1 32.756 58.128 13.531
##
## Step: AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
##
## Df Sum of Sq RSS AIC
## <none> 23.308 -28.161
## + Income 1 0.006 23.302 -26.174
## + Illiteracy 1 0.004 23.304 -26.170
## + Area 1 0.001 23.307 -26.163
## - Population 1 2.064 25.372 -25.920
## - Frost 1 3.122 26.430 -23.877
## - HS.Grad 1 5.112 28.420 -20.246
## - Murder 1 34.816 58.124 15.528
##
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population,
## data = stateData)
##
## Coefficients:
## (Intercept) Murder HS.Grad Frost Population
## 7.103e+01 -3.001e-01 4.658e-02 -5.943e-03 5.014e-05
Describe the final model obtained, together with its AIC score
The final model is: Life.Exp ~ Murder + HS.Grad + Frost + Population. With an AIC score of -28.16.
From the summary() function, report the
goodness-of-fit (\(F\)-statistic and
associated \(p\)-value) and the \(R^2\)
bestModel <- lm(Life.Exp ~ Murder + HS.Grad + Frost + Population, data = stateData)
summary(bestModel)
##
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population,
## data = stateData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47095 -0.53464 -0.03701 0.57621 1.50683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
## Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
## HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
## Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
## Population 5.014e-05 2.512e-05 1.996 0.05201 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
## F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
The goodness-of-fit is an F-statistic of 31.37 on 4 & 45 degrees of freedom, with a resulting p-value of 1.69e-12.
Using the final model you obtain, make a prediction for the life
expectancy in Utah in 2009, given an increase in population size to
approximately 2,785,000 in 2009, increase in high school graduation
(known) to 75% and a change in murder rate to 1.3/100,000. To do this
you will need to make a new dataframe. This can be done easily by
selecting the appropriate row from statedata
(newstate <- statedata[44,]), then adjusting the values
directly (newstate[2] <- 2785). Give the new expectancy
plus 95% confidence interval
library(stats)
# get UT values and adjust for 2009 data
utah <- stateData[44, ]
utah$Population <- 2785
utah$HS.Grad <- 75
utah$Murder <- 1.3
# apply new UT data to the model
predict(bestModel, utah, interval = 'confidence', level = 0.95)
## fit lwr upr
## 44 73.45601 72.83971 74.07231
The model predicts for the 2009 data that in Utah, the life expectancy would be 73.46 years with a 95% confidence interval of (72.84, 74.07)
Do the same for California. 2009 population = 36,962,000; 2009 high school graduation = 68.3%; 2009 murder rate = 5.3/100 000 (figures are approximate)
#make california data
ca09 <- stateData[5, ]
ca09$Population <- 36962
ca09$HS.Grad <- 68.3
ca09$Murder <- 5.3
predict(bestModel, ca09, interval = "confidence", level = 0.95)
## fit lwr upr
## 5 74.35232 72.64674 76.05789
The model predicts that life expectancy in California in 2009 would be 74.35 years with a 95% confidence interval of (72.65, 76.06).
The file normtemp.csv contains measurements of body temperature
from 130 healthy human subjects, as well as their weight and sex. Use
this data to model body temperatures as a function of both weight and
sex. You will need to convert the sex variable into a
factor in order for R to recognize this as a dummy variable
(bodytemp$sex <- factor(bodytemp$sex, labels = c("male", "female"))).
You should also center the weight variable by subtracting the
mean.
tempData <- read.csv("../Lab02/normtemp.csv")
# drop index column, make sex a factor
tempData <- tempData[ ,-1]
tempData$sex <- factor(tempData$sex, labels = c("male", "female"))
# center weight variable
tempData$centerWeight <- tempData$weight - mean(tempData$weight)
hist(tempData$weight, main = "", xlab = "Subject weight (kg)")
hist(tempData$centerWeight, main = "Centered weight distribution",
xlab = paste("Subject weight (kg): mean =", round(mean(tempData$weight), 2)))
Start by testing the correlation between body temperature and weight using Pearson’s correlation coefficient
cor(tempData$temp, tempData$centerWeight)
## [1] 0.2536564
The correlation between body temperature and the centered weight is
fairly low (R^2=0.254).
plot(y = tempData$temp, x = tempData$centerWeight,
ylab = "Body Temperature (F)",
xlab = paste("Subject weight (kg): mean =", round(mean(tempData$weight), 2)))
There doesn’t look to be a strong relationship between the two variables.
Build a model including both weight and sex, and give the code you used
model1 <- lm(temp ~ centerWeight + sex, data = tempData)
summary(model1)
##
## Call:
## lm(formula = temp ~ centerWeight + sex, data = tempData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86363 -0.45624 0.01841 0.47366 2.33424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.114528 0.087102 1126.428 < 2e-16 ***
## centerWeight 0.025267 0.008762 2.884 0.00462 **
## sexfemale 0.269406 0.123277 2.185 0.03070 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared: 0.09825, Adjusted R-squared: 0.08405
## F-statistic: 6.919 on 2 and 127 DF, p-value: 0.001406
Report the goodness-of-fit (\(F\)-statistic and associated \(p\)-value) and the \(R^2\)
The model has an F-statistic of 6.9 (2 and 127 df) with a corresponding p-value of 0.0014. The adjusted \(R^2\) value is 0.08, which is really low.
The model should provide you with three coefficients. Give a very brief interpretation of what each of these mean (see the lecture notes for an example)
The intercept coefficient is 98.11 which is the mean temperature for the entire data set. The centerWeight coefficient means that for males, the relationship between temperature and weight has a slope of 0.025 \(\circ\)F/kg. The sexfemale coefficient means that for females, that relationship has a higher slope of 0.269 \(\circ\)F/kg.
Build a subset model using only weight and then use the
anova() function to test whether or not there is a
significant improvement in the full model over the subset. Give the
\(F\)-statistic, the associated \(p\)-value and state whether or not you
believe the full model to be better
model0 <- lm(temp ~ centerWeight, data = tempData)
anova(model0, model1)
## Analysis of Variance Table
##
## Model 1: temp ~ centerWeight
## Model 2: temp ~ centerWeight + sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 128 64.883
## 2 127 62.532 1 2.3515 4.7758 0.0307 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the full model is better than only using the centered weight data because there is a decrease of error (RSS is lower). The F-statistic is 4.77 resulting in a p-value of 0.03, which is low enough to be significant, meaning that there is a low probability that the difference between the models is due to chance.