Lab04 Modeling II

Part 1: Modeling Life Expectancy and Socio-economic Variables

Set and read data

statedata <- read.csv("statedata.csv")
statedata2 <- statedata[,-1]

Null model

mod.0 <-lm(Life.Exp ~ 1, data = statedata2)

Full model

mod.1<-lm(Life.Exp ~., data =statedata2)

Perform automatic stepwise model building

step(mod.0, scope = formula(mod.1))

## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528

## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = statedata2)
## 
## Coefficients:
## (Intercept)       Murder      HS.Grad        Frost   Population  
##   7.103e+01   -3.001e-01    4.658e-02   -5.943e-03    5.014e-05

The stepwise function steps forward through the variables and reduces the AIC at each step. The AIC is -28.16 compared to the original AIC of 30.44. The final model consists of Murder, HS.GRad, Frost and Population. These results indicate that life expectancy increases as high school graduates and population increases and decreases when there’s a change in murder and frost.

Report goodness-of-fit

mod<-lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
    data = statedata2)
summary(mod)

## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = statedata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

Using the summary() function, the goodness-of-fit reported an F-statistic of 31.37,p-value of 1.69e-12, and r2 value of 0.71256

Make a prediction for Utah in 2009

newstate <- statedata[44,]
newstate[2] <- 2785
newstate[7] <- 75
newstate[6] <- 1.3

predict(mod, level = 0.95, interval = "conf", newstate)

##         fit      lwr      upr
## 44 73.45601 72.83971 74.07231

Predicted life expectancy in Utah, with the above conditions is 73 years, with a 95% confidence level of 72.6 to 73.5.

Make a prediction for California

newstateca <- statedata[5,]
newstateca[2] <- 36962
newstateca[7] <- 68.3
newstateca[6] <- 5.3
predict(mod, level = 0.95, interval = "conf", newstateca)

##        fit      lwr      upr
## 5 74.35232 72.64674 76.05789

Predicted life expectancy in California, with the above conditions is 74 years, with a 95% confidence level of 71.5 to 74.6

Part 2: Modeling Body Temperature as a Function of Weight and Sex

Convert sex variable and center the data

bodytemp<- read.csv("normtemp.csv")
bodytemp$sex <- factor(bodytemp$sex, labels = c("male", "female"))
bodytemp$weight2 = bodytemp$weight - mean(bodytemp$weight)

Test correlaation between body temp and weight

cor(bodytemp$temp,bodytemp$weight2)

## [1] 0.2536564

cor.test(bodytemp$temp,bodytemp$weight)

## 
##  Pearson's product-moment correlation
## 
## data:  bodytemp$temp and bodytemp$weight
## t = 2.9668, df = 128, p-value = 0.003591
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08519113 0.40802170
## sample estimates:
##       cor 
## 0.2536564

Correlation is 0.25 and a p-value of 0.003591. From the Pearson’s correleation test we can reject the null hypothesis that the correlation is equal to 0.

Build model for body temperature as a function of weight and sex

bodytemp.lm3 <- lm(bodytemp$temp ~ bodytemp$weight2 + bodytemp$sex, data =bodytemp)
summary(bodytemp.lm3)

## 
## Call:
## lm(formula = bodytemp$temp ~ bodytemp$weight2 + bodytemp$sex, 
##     data = bodytemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86363 -0.45624  0.01841  0.47366  2.33424 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        98.114528   0.087102 1126.428  < 2e-16 ***
## bodytemp$weight2    0.025267   0.008762    2.884  0.00462 ** 
## bodytemp$sexfemale  0.269406   0.123277    2.185  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.08405 
## F-statistic: 6.919 on 2 and 127 DF,  p-value: 0.001406

Report goodness-of-fit and R^2

F-statistic of 6.919, R2 value of 0.09 and p-value of 0.001406 means this model describes 9% of the variation in the data set and that both variables are statistically significantly related to body temperature.

The three coeffecienets from the model are Intercept (98.114528), Weight2(0.025) and Sex Female (0.269). The sex coeffiencent means that if the person is a female, their body temperature is 0.269 higher than a male of the same weight. If the sex is female, the weight will be adjusted up by 0.269.

Subset model using weight

bodytemp.lm4 = lm(bodytemp$temp~weight2, data= bodytemp)
summary(bodytemp.lm4)

## 
## Call:
## lm(formula = bodytemp$temp ~ weight2, data = bodytemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85017 -0.39999  0.01033  0.43915  2.46549 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 98.249231   0.062444 1573.402  < 2e-16 ***
## weight2      0.026335   0.008876    2.967  0.00359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.712 on 128 degrees of freedom
## Multiple R-squared:  0.06434,    Adjusted R-squared:  0.05703 
## F-statistic: 8.802 on 1 and 128 DF,  p-value: 0.003591

ANOVA test

anova(bodytemp.lm3,bodytemp.lm4)

## Analysis of Variance Table
## 
## Model 1: bodytemp$temp ~ bodytemp$weight2 + bodytemp$sex
## Model 2: bodytemp$temp ~ weight2
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    127 62.532                             
## 2    128 64.883 -1   -2.3515 4.7758 0.0307 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic is 4.7758, and the p-value is 0.0307. The subset model is not significantly better. Even though the p-value is still signifanct, the R squared value is lower. Adding the sex variable does not drastically change the model performance.