Exercise 1

Set working directory

setwd("/Users/annapeterson/Desktop/Classes/GEOG6000/Lab04")

Read CSV

state_data = read.csv("/Users/annapeterson/Desktop/Classes/GEOG6000/Lab04/statedata.csv")
state_data2 = state_data[,-1]

Create null (mod0) and full (mod1) model

mod_0 = lm(Life.Exp ~ 1, data = state_data2)
mod_1 = lm(Life.Exp ~ ., data = state_data2)

Stepwise regression

st = step(mod_0, scope = formula(mod_1))
## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528

We can see in the last step that the AIC is -28.16 for the final model of life expectancy. The final model uses murder, high school graduation, frost levels, and population as a determinant of life expectancy.

We can now create a summary of our step model

summary(st)
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = state_data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

As you can see in the summary above, the p-value is 1.69x10-12, the r-squared is 0.7126, and the f-statistic is 31.37.

Creating a new data frame for life expectancy in Utah in 2009:

new_state = state_data[44,]

Increase population size (column 2):

new_state[2] = 2785

Increase high school graduation (column 7):

new_state[7] = 75

Change murder rate (column 6):

new_state[6] = 1.3

Now we can use these new values to create our new expectancy utilizing confidence intervals

predict(st, new_state, interval = "confidence")
##         fit      lwr      upr
## 44 73.45601 72.83971 74.07231

Using predict, the new life expectancy in Utah in 2009 is in the range of 72.8 to 74.1.

Creating a new dataframe for life expectancy in CA in 2009. It also uses the same columns as the previous example in UT.

ca_state = state_data[44,]
ca_state[2] = 36962
ca_state[7] = 68.3
ca_state[6] = 5.3

Then using predict:

predict(st, ca_state, interval = "confidence")
##         fit      lwr      upr
## 44 73.65695 71.92649 75.38741

For this prediction in CA, we see the range is 71.9 to 75.4.

Exercise 2

Read csv

norm_temp = read.csv("/Users/annapeterson/Desktop/Classes/GEOG6000/Lab04/normtemp.csv")

Code male and female subjects

norm_temp$sex = factor(norm_temp$sex,
                       levels = c(1, 2),
                       labels = c("male", "female"))

Centering weight

norm_temp$weight = (norm_temp$weight - mean(norm_temp$weight))/10

Test the correlation between temp and weight

cor(norm_temp$temp, norm_temp$weight)
## [1] 0.2536564

Building a linear model and reporting a summary of model

w_mod = lm(temp ~ weight + sex, data = norm_temp)
summary(w_mod)
## 
## Call:
## lm(formula = temp ~ weight + sex, data = norm_temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86363 -0.45624  0.01841  0.47366  2.33424 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 98.11453    0.08710 1126.428  < 2e-16 ***
## weight       0.25267    0.08762    2.884  0.00462 ** 
## sexfemale    0.26941    0.12328    2.185  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.08405 
## F-statistic: 6.919 on 2 and 127 DF,  p-value: 0.001406

The goodness-of-fit, F-statistic and p-value, are 6.919 and 0.001406, respectively we can find the coefficients for the model from coef()

coef(w_mod)
## (Intercept)      weight   sexfemale 
##  98.1145277   0.2526674   0.2694061

What this means is that the intercept is equal to the male body temp at an average weight which is 98.11. For the weight column this describes the temperature change as weight increases. Assuming these are reported in kg,then this means for every 1 kg increase there will be a temperature increase of 0.25. The last column is the coefficient for female body temperature. This is the difference in temperature change from the average male body temperature which is 98.38 (98.11 + .27).

Building a subset mode:

sub_mod = lm(temp ~ weight, data = norm_temp)

Run anova off of subset and first mod

anova(sub_mod, w_mod)
## Analysis of Variance Table
## 
## Model 1: temp ~ weight
## Model 2: temp ~ weight + sex
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    128 64.883                             
## 2    127 62.532  1    2.3515 4.7758 0.0307 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic and p-value are 4.78 and 0.0307, respectively. Since the p-value is lower that 0.05, we reject the null. The subset model is not better than the than the full model.