GEOG 6000
Date: 9/29/2015
Sulochan Dhungel
Model to estimate Life Expectancy using socio-economic variables for US States
setwd("C:/Users/Sulochan/Copy/Fall 2015/Advance Geo/Assignment3")
statedata = read.csv("statedata.csv",header=T)
statedata2 <- statedata[,-1]
Null model: This is the model with only intercept.
mod.0 = lm(Life.Exp ~ 1, data = statedata2)
summary(mod.0)
##
## Call:
## lm(formula = Life.Exp ~ 1, data = statedata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9186 -0.7611 -0.2036 1.0139 2.7214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.8786 0.1898 373.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.342 on 49 degrees of freedom
Full Model: This is the model with all predictor variables included.
mod.1 = lm(Life.Exp ~ ., data = statedata2)
summary(mod.1)
##
## Call:
## lm(formula = Life.Exp ~ ., data = statedata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48895 -0.51232 -0.02747 0.57002 1.49447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
## Population 5.180e-05 2.919e-05 1.775 0.0832 .
## Income -2.180e-05 2.444e-04 -0.089 0.9293
## Illiteracy 3.382e-02 3.663e-01 0.092 0.9269
## Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
## HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *
## Frost -5.735e-03 3.143e-03 -1.825 0.0752 .
## Area -7.383e-08 1.668e-06 -0.044 0.9649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared: 0.7362, Adjusted R-squared: 0.6922
## F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10
Comparing Null model with Full model: The full model significantly explains more variability than null model.
anova(mod.0,mod.1)
## Analysis of Variance Table
##
## Model 1: Life.Exp ~ 1
## Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
## Frost + Area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 88.299
## 2 42 23.297 7 65.002 16.741 2.534e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are few coefficients in the full model which are not significant. So, we check if there is any multicollinearity. Pairwise plots show that there is some multicollinearity with values higher than 0.6 or 0.7.
So, we try using stepwise variable selection method with AIC score to select variables for the model, which could help remove multicollinearity.
Using r Step() function to perform stepwise model building.
The process starts with null model and adds a variable at each step, then retains the variable which results in lowest AIC score. In each step, each added variable in previous step is removed to check AIC score. The final model is the one with lowest AIC score.
mod.aic = step(mod.0, scope=formula(mod.1))
## Start: AIC=30.44
## Life.Exp ~ 1
##
## Df Sum of Sq RSS AIC
## + Murder 1 53.838 34.461 -14.609
## + Illiteracy 1 30.578 57.721 11.179
## + HS.Grad 1 29.931 58.368 11.737
## + Income 1 10.223 78.076 26.283
## + Frost 1 6.064 82.235 28.878
## <none> 88.299 30.435
## + Area 1 1.017 87.282 31.856
## + Population 1 0.409 87.890 32.203
##
## Step: AIC=-14.61
## Life.Exp ~ Murder
##
## Df Sum of Sq RSS AIC
## + HS.Grad 1 4.691 29.770 -19.925
## + Population 1 4.016 30.445 -18.805
## + Frost 1 3.135 31.327 -17.378
## + Income 1 2.405 32.057 -16.226
## <none> 34.461 -14.609
## + Area 1 0.470 33.992 -13.295
## + Illiteracy 1 0.273 34.188 -13.007
## - Murder 1 53.838 88.299 30.435
##
## Step: AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
##
## Df Sum of Sq RSS AIC
## + Frost 1 4.3987 25.372 -25.920
## + Population 1 3.3405 26.430 -23.877
## <none> 29.770 -19.925
## + Illiteracy 1 0.4419 29.328 -18.673
## + Area 1 0.2775 29.493 -18.394
## + Income 1 0.1022 29.668 -18.097
## - HS.Grad 1 4.6910 34.461 -14.609
## - Murder 1 28.5974 58.368 11.737
##
## Step: AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
##
## Df Sum of Sq RSS AIC
## + Population 1 2.064 23.308 -28.161
## <none> 25.372 -25.920
## + Income 1 0.182 25.189 -24.280
## + Illiteracy 1 0.172 25.200 -24.259
## + Area 1 0.026 25.346 -23.970
## - Frost 1 4.399 29.770 -19.925
## - HS.Grad 1 5.955 31.327 -17.378
## - Murder 1 32.756 58.128 13.531
##
## Step: AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
##
## Df Sum of Sq RSS AIC
## <none> 23.308 -28.161
## + Income 1 0.006 23.302 -26.174
## + Illiteracy 1 0.004 23.304 -26.170
## + Area 1 0.001 23.307 -26.163
## - Population 1 2.064 25.372 -25.920
## - Frost 1 3.122 26.430 -23.877
## - HS.Grad 1 5.112 28.420 -20.246
## - Murder 1 34.816 58.124 15.528
Description of Final model with AIC score:
The final model includes Murder rate, HS.graduate, Frost and Population. Addition or removal of any other variables does not decrease AIC. At each step, variables were added and removed to minimize AIC. The plot of AIC for variable selection with stepwise model building process is given below
plot(1:length(mod.aic$anova$Step),mod.aic$anova$AIC,"b",axes=F, xlab="Addition of",ylab = "AIC",lwd=2, main = "AIC variable selection")
axis(2)
axis(1,at=c(1:5),labels=c("intercept" , mod.aic$anova$Step[2:5]))
box()
grid()
Checking for multicollinearity of selected model: The final selection of variables shows that the variables are not multicollinear.
Goodness of Fit
summary(mod.aic)
##
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population,
## data = statedata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47095 -0.53464 -0.03701 0.57621 1.50683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
## Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
## HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
## Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
## Population 5.014e-05 2.512e-05 1.996 0.05201 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
## F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
F-statistic value = 31.37
\(p-value = 1.696\times 10^{-12}\)
R2 = 0.7126
R-squared indicates that about 70% of variance is explained by the predictor variables. High F-value and low p-value indicates that the at least one of the variable has significant effect on the response. The p-values for almost all t-statistics for the coefficients are less than 0.05 or about 0.05. So, all of the selected predictor’s coefficients are significant.
Prediction
utID = which(statedata$State=="UT")
newutah = statedata[utID,]
newutah$Population = 2785
newutah$HS.Grad = 75.0
newutah$Murder = 1.3
newutah
## State Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 44 UT 2785 4022 0.6 72.9 1.3 75 137 82096
predUT.lexp = predict(mod.aic, interval ="prediction", newdata = newutah,level = .95)
predUT.lexp
## fit lwr upr
## 44 73.45601 71.8809 75.03112
At 95% confidence interval, life expectancy is estimated to be 73.456 and the upper and lower estimates are 71.88 and 75.03 in Utah.
califID = which(statedata$State=="CA")
newcalif = statedata[califID,]
newcalif$Population = 36962
newcalif$HS.Grad = 68.3
newcalif$Murder = 5.3
newcalif
## State Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 5 CA 36962 5114 1.1 71.71 5.3 68.3 20 156361
predCA.lexp = predict(mod.aic, interval ="prediction", newdata = newcalif,level = .95)
predCA.lexp
## fit lwr upr
## 5 74.35232 72.11398 76.59065
At 95% confidence interval, life expectancy is estimated to be 74.35 and the upper and lower estimates are 72.11 and 76.59 in California.
bodytemp1 = read.csv("normtemp.csv",header = T)
bodytemp = bodytemp1[,-1]
bodytemp$sex = factor(bodytemp$sex, labels=c("male","female"))
cor(bodytemp$weight, bodytemp$temp)
## [1] 0.2536564
The correlation between weight and temperature is about 0.25. The plot also shows very less correlation between weight and temperature within any sexes.
Center body weight
weight_cen = bodytemp$weight - mean(bodytemp$weight)
Build a model using body weight and sex
temp.lm1 = lm(temp ~ weight_cen + sex, data=bodytemp)
summary(temp.lm1)
##
## Call:
## lm(formula = temp ~ weight_cen + sex, data = bodytemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86363 -0.45624 0.01841 0.47366 2.33424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.114528 0.087102 1126.428 < 2e-16 ***
## weight_cen 0.025267 0.008762 2.884 0.00462 **
## sexfemale 0.269406 0.123277 2.185 0.03070 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared: 0.09825, Adjusted R-squared: 0.08405
## F-statistic: 6.919 on 2 and 127 DF, p-value: 0.001406
Goodness-of-fit:
F-statistic = 6.919
p-val = 0.001406
R2 = 0.08405
The first estimate is for the intercept = 98.11 which indicates temperature with mean weight. With increase in unit weight above the mean weight, increase in temperature is 0.025 degrees for males. For females, extra temperature increase is 0.269 compared to males.
Subset of the full Model only using weight
temp.mod_wt = lm(temp ~ weight_cen, data=bodytemp)
summary(temp.mod_wt)
##
## Call:
## lm(formula = temp ~ weight_cen, data = bodytemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85017 -0.39999 0.01033 0.43915 2.46549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.249231 0.062444 1573.402 < 2e-16 ***
## weight_cen 0.026335 0.008876 2.967 0.00359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.712 on 128 degrees of freedom
## Multiple R-squared: 0.06434, Adjusted R-squared: 0.05703
## F-statistic: 8.802 on 1 and 128 DF, p-value: 0.003591
The r-squared value for the subset as well as the full model is less than 0.1. So, only about 10% of the variance in temperature can be explained using full model while only about 6% of variance is explained by subset of the full model.
anova(temp.lm1, temp.mod_wt)
## Analysis of Variance Table
##
## Model 1: temp ~ weight_cen + sex
## Model 2: temp ~ weight_cen
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 127 62.532
## 2 128 64.883 -1 -2.3515 4.7758 0.0307 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value for the anova between the two models is less than 0.05 which indicates a significant improvement in model. But RSS change is not much different (increases by about 2) and also since R-squared values are not high, practically, I do not believe that adding sex to the model does not make it much better.