GEOG 6000

Exercise 1

Read in State Data CSV and set variable.

statedata = read.csv("../datafiles/statedata.csv")

Make a new dataframe that excludes the first column of statedata

statedata2 = statedata[,-1]

Build the null model (mod0) and full model (mod1) Full model:

mod.1 <- lm(Life.Exp ~ ., data = statedata2)

Null model:

mod.0 <- lm(Life.Exp ~ 1, data = statedata2)

Perform automatic stepwise model building

step(mod.0, scope = formula(mod.1))

## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528

## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = statedata2)
## 
## Coefficients:
## (Intercept)       Murder      HS.Grad        Frost   Population  
##   7.103e+01   -3.001e-01    4.658e-02   -5.943e-03    5.014e-05

Describe the final model obtained, together with its AIC score: The final model obtained shows that the most simple model for life expectancy without impacting performance includes the murder rate, high school graduation rate, frost, and population variables. The final AIC score is -28.16 compared to the first AIC score of 30.44, which includes all variables.

From the summary() function, report the goodness-of-fit (F-statistic and associated p-value) and the R2

model.step = (step(mod.0, scope = formula(mod.1)))

## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528

summary(model.step)

## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = statedata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

The F-statistic is 31.37. The associated p-value is 1.696e-12. The adjusted R2 is 0.7126

Make a prediction for the life expectancy in Utah in 2009. Given an increase in population size to approximately 2,785,000 in 2009 and an increase in high school graduation (known) to 75%. A change in murder rate to 1.3/100 000.

Make a new dataframe and select the appropriate row (44) from statedata

newstate <- statedata[44,]

Adjust numbers

(newstate[2] <- 2785) #adjust the population

## [1] 2785

(newstate[7] = 75) #high school graduation

## [1] 75

(newstate[6] = 1.3) #murder rate

## [1] 1.3

summary(newstate)

##     State             Population       Income       Illiteracy     Life.Exp   
##  Length:1           Min.   :2785   Min.   :4022   Min.   :0.6   Min.   :72.9  
##  Class :character   1st Qu.:2785   1st Qu.:4022   1st Qu.:0.6   1st Qu.:72.9  
##  Mode  :character   Median :2785   Median :4022   Median :0.6   Median :72.9  
##                     Mean   :2785   Mean   :4022   Mean   :0.6   Mean   :72.9  
##                     3rd Qu.:2785   3rd Qu.:4022   3rd Qu.:0.6   3rd Qu.:72.9  
##                     Max.   :2785   Max.   :4022   Max.   :0.6   Max.   :72.9  
##      Murder       HS.Grad       Frost          Area      
##  Min.   :1.3   Min.   :75   Min.   :137   Min.   :82096  
##  1st Qu.:1.3   1st Qu.:75   1st Qu.:137   1st Qu.:82096  
##  Median :1.3   Median :75   Median :137   Median :82096  
##  Mean   :1.3   Mean   :75   Mean   :137   Mean   :82096  
##  3rd Qu.:1.3   3rd Qu.:75   3rd Qu.:137   3rd Qu.:82096  
##  Max.   :1.3   Max.   :75   Max.   :137   Max.   :82096

Make Prediction for Utah

prediction1 = predict(model.step, int='prediction', newdata = newstate)

Summary of prediction

summary(prediction1)

##       fit             lwr             upr       
##  Min.   :73.46   Min.   :71.88   Min.   :75.03  
##  1st Qu.:73.46   1st Qu.:71.88   1st Qu.:75.03  
##  Median :73.46   Median :71.88   Median :75.03  
##  Mean   :73.46   Mean   :71.88   Mean   :75.03  
##  3rd Qu.:73.46   3rd Qu.:71.88   3rd Qu.:75.03  
##  Max.   :73.46   Max.   :71.88   Max.   :75.03

Print prediction

prediction1

##         fit     lwr      upr
## 44 73.45601 71.8809 75.03112

The new life expectancy for Utah is 73.45601 years with a 95% confidence interval of 71.8809 to 75.03112.

California. 2009 population = 36 962 000; 2009 high school graduation = 68.3%; 2009 murder rate = 5.3/100 000

Make prediction for California Subset data to California

newstate2 = statedata[5,]

Adjust numbers

(newstate2[2] <- 36962) #adjust the population

## [1] 36962

(newstate2[7] = 68.3) #high school graduation

## [1] 68.3

(newstate2[6] = 5.3) #murder rate

## [1] 5.3

Run prediction

prediction2 = predict(model.step, int='prediction', newdata = newstate2)

Print prediction

prediction2

##        fit      lwr      upr
## 5 74.35232 72.11398 76.59065

The new life expectancy for California is 74.35232 years with a 95% confidence interval of 72.11398 to 76.59065.

Exercise 2

Read csv

bodytemp = read.csv("/Users/jessieeastburn/Documents/Fall 2021/GEOG 6000/datafiles/normtemp.csv")

Factor bodytemp by sex

bodytemp$sex <- factor(bodytemp$sex, labels = c("male", "female"))

Center the weight variable by subtracting the mean.

bodytemp$weight = (bodytemp$weight - mean(bodytemp$weight))

Summary

summary(bodytemp$weight)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -16.7615  -4.7615   0.2385   0.0000   5.2385  15.2385

Correlation

cor.test(bodytemp$temp, bodytemp$weight)

## 
##  Pearson's product-moment correlation
## 
## data:  bodytemp$temp and bodytemp$weight
## t = 2.9668, df = 128, p-value = 0.003591
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08519113 0.40802170
## sample estimates:
##       cor 
## 0.2536564

The correlation between body temperature and weight is 0.2536564

Build a model including both weight and sex

bodytempmodel1= lm(bodytemp$temp ~ bodytemp$weight + bodytemp$sex, data=bodytemp)

Summarize

summary(bodytempmodel1)

## 
## Call:
## lm(formula = bodytemp$temp ~ bodytemp$weight + bodytemp$sex, 
##     data = bodytemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86363 -0.45624  0.01841  0.47366  2.33424 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        98.114528   0.087102 1126.428  < 2e-16 ***
## bodytemp$weight     0.025267   0.008762    2.884  0.00462 ** 
## bodytemp$sexfemale  0.269406   0.123277    2.185  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.08405 
## F-statistic: 6.919 on 2 and 127 DF,  p-value: 0.001406

The F statistic is 6.919, the P Value is 0.001406, and the Adjusted R^2 is 0.08405.

Get the coefficients

coef(bodytempmodel1)

##        (Intercept)    bodytemp$weight bodytemp$sexfemale 
##        98.11452772         0.02526674         0.26940610

The coefficients are intercept = 98.11452772, bodytempweight = 0.02526674 and bodytempfemale= 0.26940610. The intercept coefficient is the average male temperature for the average weight. When there is an increase in 1 unit of weight there is an increase in 0.02526674 temperature to the average male. If it is a female their temperature is the intercept + 0.26940610 + the increase in 0.02526674 temperature per unit of weight.

Subset the model to include only weight

bodytempmodel2 = lm(bodytemp$temp ~ bodytemp$weight, data = bodytemp)

Summary

summary(bodytempmodel2)

## 
## Call:
## lm(formula = bodytemp$temp ~ bodytemp$weight, data = bodytemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85017 -0.39999  0.01033  0.43915  2.46549 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     98.249231   0.062444 1573.402  < 2e-16 ***
## bodytemp$weight  0.026335   0.008876    2.967  0.00359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.712 on 128 degrees of freedom
## Multiple R-squared:  0.06434,    Adjusted R-squared:  0.05703 
## F-statistic: 8.802 on 1 and 128 DF,  p-value: 0.003591

F-statistic: 8.802, p-value: 0.003591, Adjusted R-squared: 0.05703

Run anova to compare models

anova(bodytempmodel1, bodytempmodel2)

## Analysis of Variance Table
## 
## Model 1: bodytemp$temp ~ bodytemp$weight + bodytemp$sex
## Model 2: bodytemp$temp ~ bodytemp$weight
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    127 62.532                             
## 2    128 64.883 -1   -2.3515 4.7758 0.0307 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic is 4.7758. The p value is 0.0307.

The p-values for the subset model and the full model show that both the full model and the subset model are statistically significant. The full model has a higher statistical significance of 0.001406 than the subset model p value of 0.003591. The RSS of the ANOVA shows that the full model has a slightly lower value than the subset model, so adding in the sex variable to the full model improves the model performance, but not by much. Based on the RSS and P value, I believe the full model to be better.

GEOG 6000_Lab4

Jessie Eastburn

9/19/2021