Exercise 1
Read in State Data CSV and set variable.
statedata = read.csv("../datafiles/statedata.csv")
Make a new dataframe that excludes the first column of statedata
statedata2 = statedata[,-1]
Build the null model (mod0) and full model (mod1) Full model:
mod.1 <- lm(Life.Exp ~ ., data = statedata2)
Null model:
mod.0 <- lm(Life.Exp ~ 1, data = statedata2)
Perform automatic stepwise model building
step(mod.0, scope = formula(mod.1))
## Start: AIC=30.44
## Life.Exp ~ 1
##
## Df Sum of Sq RSS AIC
## + Murder 1 53.838 34.461 -14.609
## + Illiteracy 1 30.578 57.721 11.179
## + HS.Grad 1 29.931 58.368 11.737
## + Income 1 10.223 78.076 26.283
## + Frost 1 6.064 82.235 28.878
## <none> 88.299 30.435
## + Area 1 1.017 87.282 31.856
## + Population 1 0.409 87.890 32.203
##
## Step: AIC=-14.61
## Life.Exp ~ Murder
##
## Df Sum of Sq RSS AIC
## + HS.Grad 1 4.691 29.770 -19.925
## + Population 1 4.016 30.445 -18.805
## + Frost 1 3.135 31.327 -17.378
## + Income 1 2.405 32.057 -16.226
## <none> 34.461 -14.609
## + Area 1 0.470 33.992 -13.295
## + Illiteracy 1 0.273 34.188 -13.007
## - Murder 1 53.838 88.299 30.435
##
## Step: AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
##
## Df Sum of Sq RSS AIC
## + Frost 1 4.3987 25.372 -25.920
## + Population 1 3.3405 26.430 -23.877
## <none> 29.770 -19.925
## + Illiteracy 1 0.4419 29.328 -18.673
## + Area 1 0.2775 29.493 -18.394
## + Income 1 0.1022 29.668 -18.097
## - HS.Grad 1 4.6910 34.461 -14.609
## - Murder 1 28.5974 58.368 11.737
##
## Step: AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
##
## Df Sum of Sq RSS AIC
## + Population 1 2.064 23.308 -28.161
## <none> 25.372 -25.920
## + Income 1 0.182 25.189 -24.280
## + Illiteracy 1 0.172 25.200 -24.259
## + Area 1 0.026 25.346 -23.970
## - Frost 1 4.399 29.770 -19.925
## - HS.Grad 1 5.955 31.327 -17.378
## - Murder 1 32.756 58.128 13.531
##
## Step: AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
##
## Df Sum of Sq RSS AIC
## <none> 23.308 -28.161
## + Income 1 0.006 23.302 -26.174
## + Illiteracy 1 0.004 23.304 -26.170
## + Area 1 0.001 23.307 -26.163
## - Population 1 2.064 25.372 -25.920
## - Frost 1 3.122 26.430 -23.877
## - HS.Grad 1 5.112 28.420 -20.246
## - Murder 1 34.816 58.124 15.528
##
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population,
## data = statedata2)
##
## Coefficients:
## (Intercept) Murder HS.Grad Frost Population
## 7.103e+01 -3.001e-01 4.658e-02 -5.943e-03 5.014e-05
Describe the final model obtained, together with its AIC score: The final model obtained shows that the most simple model for life expectancy without impacting performance includes the murder rate, high school graduation rate, frost, and population variables. The final AIC score is -28.16 compared to the first AIC score of 30.44, which includes all variables.
From the summary() function, report the goodness-of-fit (F-statistic and associated p-value) and the R2
model.step = (step(mod.0, scope = formula(mod.1)))
## Start: AIC=30.44
## Life.Exp ~ 1
##
## Df Sum of Sq RSS AIC
## + Murder 1 53.838 34.461 -14.609
## + Illiteracy 1 30.578 57.721 11.179
## + HS.Grad 1 29.931 58.368 11.737
## + Income 1 10.223 78.076 26.283
## + Frost 1 6.064 82.235 28.878
## <none> 88.299 30.435
## + Area 1 1.017 87.282 31.856
## + Population 1 0.409 87.890 32.203
##
## Step: AIC=-14.61
## Life.Exp ~ Murder
##
## Df Sum of Sq RSS AIC
## + HS.Grad 1 4.691 29.770 -19.925
## + Population 1 4.016 30.445 -18.805
## + Frost 1 3.135 31.327 -17.378
## + Income 1 2.405 32.057 -16.226
## <none> 34.461 -14.609
## + Area 1 0.470 33.992 -13.295
## + Illiteracy 1 0.273 34.188 -13.007
## - Murder 1 53.838 88.299 30.435
##
## Step: AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
##
## Df Sum of Sq RSS AIC
## + Frost 1 4.3987 25.372 -25.920
## + Population 1 3.3405 26.430 -23.877
## <none> 29.770 -19.925
## + Illiteracy 1 0.4419 29.328 -18.673
## + Area 1 0.2775 29.493 -18.394
## + Income 1 0.1022 29.668 -18.097
## - HS.Grad 1 4.6910 34.461 -14.609
## - Murder 1 28.5974 58.368 11.737
##
## Step: AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
##
## Df Sum of Sq RSS AIC
## + Population 1 2.064 23.308 -28.161
## <none> 25.372 -25.920
## + Income 1 0.182 25.189 -24.280
## + Illiteracy 1 0.172 25.200 -24.259
## + Area 1 0.026 25.346 -23.970
## - Frost 1 4.399 29.770 -19.925
## - HS.Grad 1 5.955 31.327 -17.378
## - Murder 1 32.756 58.128 13.531
##
## Step: AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
##
## Df Sum of Sq RSS AIC
## <none> 23.308 -28.161
## + Income 1 0.006 23.302 -26.174
## + Illiteracy 1 0.004 23.304 -26.170
## + Area 1 0.001 23.307 -26.163
## - Population 1 2.064 25.372 -25.920
## - Frost 1 3.122 26.430 -23.877
## - HS.Grad 1 5.112 28.420 -20.246
## - Murder 1 34.816 58.124 15.528
summary(model.step)
##
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population,
## data = statedata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47095 -0.53464 -0.03701 0.57621 1.50683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
## Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
## HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
## Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
## Population 5.014e-05 2.512e-05 1.996 0.05201 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
## F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
The F-statistic is 31.37. The associated p-value is 1.696e-12. The adjusted R2 is 0.7126
Make a prediction for the life expectancy in Utah in 2009. Given an increase in population size to approximately 2,785,000 in 2009 and an increase in high school graduation (known) to 75%. A change in murder rate to 1.3/100 000.
Make a new dataframe and select the appropriate row (44) from statedata
newstate <- statedata[44,]
Adjust numbers
(newstate[2] <- 2785) #adjust the population
## [1] 2785
(newstate[7] = 75) #high school graduation
## [1] 75
(newstate[6] = 1.3) #murder rate
## [1] 1.3
summary(newstate)
## State Population Income Illiteracy Life.Exp
## Length:1 Min. :2785 Min. :4022 Min. :0.6 Min. :72.9
## Class :character 1st Qu.:2785 1st Qu.:4022 1st Qu.:0.6 1st Qu.:72.9
## Mode :character Median :2785 Median :4022 Median :0.6 Median :72.9
## Mean :2785 Mean :4022 Mean :0.6 Mean :72.9
## 3rd Qu.:2785 3rd Qu.:4022 3rd Qu.:0.6 3rd Qu.:72.9
## Max. :2785 Max. :4022 Max. :0.6 Max. :72.9
## Murder HS.Grad Frost Area
## Min. :1.3 Min. :75 Min. :137 Min. :82096
## 1st Qu.:1.3 1st Qu.:75 1st Qu.:137 1st Qu.:82096
## Median :1.3 Median :75 Median :137 Median :82096
## Mean :1.3 Mean :75 Mean :137 Mean :82096
## 3rd Qu.:1.3 3rd Qu.:75 3rd Qu.:137 3rd Qu.:82096
## Max. :1.3 Max. :75 Max. :137 Max. :82096
Make Prediction for Utah
prediction1 = predict(model.step, int='prediction', newdata = newstate)
Summary of prediction
summary(prediction1)
## fit lwr upr
## Min. :73.46 Min. :71.88 Min. :75.03
## 1st Qu.:73.46 1st Qu.:71.88 1st Qu.:75.03
## Median :73.46 Median :71.88 Median :75.03
## Mean :73.46 Mean :71.88 Mean :75.03
## 3rd Qu.:73.46 3rd Qu.:71.88 3rd Qu.:75.03
## Max. :73.46 Max. :71.88 Max. :75.03
Print prediction
prediction1
## fit lwr upr
## 44 73.45601 71.8809 75.03112
The new life expectancy for Utah is 73.45601 years with a 95% confidence interval of 71.8809 to 75.03112.
California. 2009 population = 36 962 000; 2009 high school graduation = 68.3%; 2009 murder rate = 5.3/100 000
Make prediction for California Subset data to California
newstate2 = statedata[5,]
Adjust numbers
(newstate2[2] <- 36962) #adjust the population
## [1] 36962
(newstate2[7] = 68.3) #high school graduation
## [1] 68.3
(newstate2[6] = 5.3) #murder rate
## [1] 5.3
Run prediction
prediction2 = predict(model.step, int='prediction', newdata = newstate2)
Print prediction
prediction2
## fit lwr upr
## 5 74.35232 72.11398 76.59065
The new life expectancy for California is 74.35232 years with a 95% confidence interval of 72.11398 to 76.59065.
Exercise 2
Read csv
bodytemp = read.csv("/Users/jessieeastburn/Documents/Fall 2021/GEOG 6000/datafiles/normtemp.csv")
Factor bodytemp by sex
bodytemp$sex <- factor(bodytemp$sex, labels = c("male", "female"))
Center the weight variable by subtracting the mean.
bodytemp$weight = (bodytemp$weight - mean(bodytemp$weight))
Summary
summary(bodytemp$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16.7615 -4.7615 0.2385 0.0000 5.2385 15.2385
Correlation
cor.test(bodytemp$temp, bodytemp$weight)
##
## Pearson's product-moment correlation
##
## data: bodytemp$temp and bodytemp$weight
## t = 2.9668, df = 128, p-value = 0.003591
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08519113 0.40802170
## sample estimates:
## cor
## 0.2536564
The correlation between body temperature and weight is 0.2536564
Build a model including both weight and sex
bodytempmodel1= lm(bodytemp$temp ~ bodytemp$weight + bodytemp$sex, data=bodytemp)
Summarize
summary(bodytempmodel1)
##
## Call:
## lm(formula = bodytemp$temp ~ bodytemp$weight + bodytemp$sex,
## data = bodytemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86363 -0.45624 0.01841 0.47366 2.33424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.114528 0.087102 1126.428 < 2e-16 ***
## bodytemp$weight 0.025267 0.008762 2.884 0.00462 **
## bodytemp$sexfemale 0.269406 0.123277 2.185 0.03070 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared: 0.09825, Adjusted R-squared: 0.08405
## F-statistic: 6.919 on 2 and 127 DF, p-value: 0.001406
The F statistic is 6.919, the P Value is 0.001406, and the Adjusted R^2 is 0.08405.
Get the coefficients
coef(bodytempmodel1)
## (Intercept) bodytemp$weight bodytemp$sexfemale
## 98.11452772 0.02526674 0.26940610
The coefficients are intercept = 98.11452772, bodytempweight = 0.02526674 and bodytempfemale= 0.26940610. The intercept coefficient is the average male temperature for the average weight. When there is an increase in 1 unit of weight there is an increase in 0.02526674 temperature to the average male. If it is a female their temperature is the intercept + 0.26940610 + the increase in 0.02526674 temperature per unit of weight.
Subset the model to include only weight
bodytempmodel2 = lm(bodytemp$temp ~ bodytemp$weight, data = bodytemp)
Summary
summary(bodytempmodel2)
##
## Call:
## lm(formula = bodytemp$temp ~ bodytemp$weight, data = bodytemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85017 -0.39999 0.01033 0.43915 2.46549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.249231 0.062444 1573.402 < 2e-16 ***
## bodytemp$weight 0.026335 0.008876 2.967 0.00359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.712 on 128 degrees of freedom
## Multiple R-squared: 0.06434, Adjusted R-squared: 0.05703
## F-statistic: 8.802 on 1 and 128 DF, p-value: 0.003591
F-statistic: 8.802, p-value: 0.003591, Adjusted R-squared: 0.05703
Run anova to compare models
anova(bodytempmodel1, bodytempmodel2)
## Analysis of Variance Table
##
## Model 1: bodytemp$temp ~ bodytemp$weight + bodytemp$sex
## Model 2: bodytemp$temp ~ bodytemp$weight
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 127 62.532
## 2 128 64.883 -1 -2.3515 4.7758 0.0307 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic is 4.7758. The p value is 0.0307.
The p-values for the subset model and the full model show that both the full model and the subset model are statistically significant. The full model has a higher statistical significance of 0.001406 than the subset model p value of 0.003591. The RSS of the ANOVA shows that the full model has a slightly lower value than the subset model, so adding in the sex variable to the full model improves the model performance, but not by much. Based on the RSS and P value, I believe the full model to be better.