Exercise 1

In the previous lab, we used the file statedata.csv to examine the relationship between life expectancy and other socio-economic variables for the U.S. states. Using the same data set, now use stepwise regression (step()) to obtain the best model to estimate life expectancy. You will first need to make a new dataframe that excludes the first column of statedata (statedata2 <- statedata[,-1]).

setwd("N:/Projects/geog6000/lab04")
statedata = read.csv("../datafiles/statedata.csv")
statedata2 = statedata[,-1]

Build the null model (mod0) and full model (mod1)

## Null model:
mod0 = lm(Life.Exp ~ 1, data = statedata2)  

## Full model:
mod1 = lm(Life.Exp ~ ., data = statedata2)

Use the step() function to perform automatic stepwise model building. Report the R code you used

stepmod = step(mod0, scope = formula(mod1))
## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528
stepmod
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = statedata2)
## 
## Coefficients:
## (Intercept)       Murder      HS.Grad        Frost   Population  
##   7.103e+01   -3.001e-01    4.658e-02   -5.943e-03    5.014e-05

The final model obtained describes life expectancy as a function of murder, high school graduation rate, frost levels, and population. The AIC score for this model is -28.16

From the summary() function, report the goodness-of-fit (F-statistic and associated p-value) and the R-squared

summary(stepmod)
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = statedata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

The F-statistic for this model is 31.37. The p-value for the model is 1.696e-12. The adjusted R-squared for this model is 0.7126

Using the final model you obtain, make a prediction for the life expectancy in Utah in 2009, given an increase in population size to approximately 2 785 000 in 2009, increase in high school graduation (known) to 75% and a change in murder rate to 1.3/100 000. To do this you will need to make a new dataframe. This can be done easily by selecting the appropriate row from statedata (newstate <- statedata[44,]), then adjusting the values directly (newstate[2] <- 2785). Give the new expectancy plus 95% confidence interval

utahversion2 = statedata[44,]
utahversion2[2] = 2785
utahversion2[7] = 75
utahversion2[6] = 1.3
predict(stepmod, utahversion2, interval = 'confidence')
##         fit      lwr      upr
## 44 73.45601 72.83971 74.07231

Do the same for California. 2009 population = 36 962 000; 2009 high school graduation = 68.3%; 2009 murder rate = 5.3/100 000 (figures are approximate)

caliversion2 = statedata[44,]
caliversion2[2] = 36962
caliversion2[7] = 68.3
caliversion2[6] = 5.3
predict(stepmod, caliversion2, interval = 'confidence')
##         fit      lwr      upr
## 44 73.65695 71.92649 75.38741

Exercise 2

normtemp = read.csv("../datafiles/normtemp.csv")
normtemp$fsex = factor(normtemp$sex, 
                       levels = c(1, 2),
                       labels =  c("male", "female"))
normtemp$weightctr = (normtemp$weight - mean(normtemp$weight))

Start by testing the correlation between body temperature and weight using Pearson’s correlation coefficient

cor(normtemp$temp, normtemp$weightctr)
## [1] 0.2536564

Build a model including both weight and sex, and give the code you used

weightmod1 = lm(temp ~ weightctr + fsex, data = normtemp)

Report the goodness-of-fit (F-statistic and associated p-value) and the R-squared

summary(weightmod1)
## 
## Call:
## lm(formula = temp ~ weightctr + fsex, data = normtemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86363 -0.45624  0.01841  0.47366  2.33424 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 98.114528   0.087102 1126.428  < 2e-16 ***
## weightctr    0.025267   0.008762    2.884  0.00462 ** 
## fsexfemale   0.269406   0.123277    2.185  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.08405 
## F-statistic: 6.919 on 2 and 127 DF,  p-value: 0.001406

The F-statistic for this model is 6.919 The p-value is .001406 The adjusted R-squared is 0.08405

Coefficients:

coef(weightmod1)
## (Intercept)   weightctr  fsexfemale 
## 98.11452772  0.02526674  0.26940610

The intercept is the predicted body temp of a male at average weight (98.1). The weightctr coefficient says each time weight increases by 1, temperature increases by .025. The fsexfemale coefficient says if the person is female, the predicted weight of a female at average weight will be 98.1 + 0.27 = 98.37 degrees.