Advanced Geographical Statistics

Assignment 3

GEOG 6000

Date: 9/29/2015

Sulochan Dhungel

\(Answer 1\)

Model to estimate Life Expectancy using socio-economic variables for US States

setwd("C:/Users/Sulochan/Copy/Fall 2015/Advance Geo/Assignment3")
statedata = read.csv("statedata.csv",header=T)
statedata2 <- statedata[,-1]



Null model: This is the model with only intercept.

mod.0 = lm(Life.Exp ~ 1, data = statedata2)
summary(mod.0)
## 
## Call:
## lm(formula = Life.Exp ~ 1, data = statedata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9186 -0.7611 -0.2036  1.0139  2.7214 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  70.8786     0.1898   373.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.342 on 49 degrees of freedom



Full Model: This is the model with all predictor variables included.

mod.1 = lm(Life.Exp ~ ., data = statedata2)
summary(mod.1)
## 
## Call:
## lm(formula = Life.Exp ~ ., data = statedata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.094e+01  1.748e+00  40.586  < 2e-16 ***
## Population   5.180e-05  2.919e-05   1.775   0.0832 .  
## Income      -2.180e-05  2.444e-04  -0.089   0.9293    
## Illiteracy   3.382e-02  3.663e-01   0.092   0.9269    
## Murder      -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## HS.Grad      4.893e-02  2.332e-02   2.098   0.0420 *  
## Frost       -5.735e-03  3.143e-03  -1.825   0.0752 .  
## Area        -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10



Comparing Null model with Full model: The full model significantly explains more variability than null model.

anova(mod.0,mod.1)
## Analysis of Variance Table
## 
## Model 1: Life.Exp ~ 1
## Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost + Area
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     49 88.299                                  
## 2     42 23.297  7    65.002 16.741 2.534e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


There are few coefficients in the full model which are not significant. So, we check if there is any multicollinearity. Pairwise plots show that there is some multicollinearity with values higher than 0.6 or 0.7.

So, we try using stepwise variable selection method with AIC score to select variables for the model, which could help remove multicollinearity.



Using r Step() function to perform stepwise model building.
The process starts with null model and adds a variable at each step, then retains the variable which results in lowest AIC score. In each step, each added variable in previous step is removed to check AIC score. The final model is the one with lowest AIC score.

mod.aic = step(mod.0, scope=formula(mod.1))
## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1     4.691 29.770 -19.925
## + Population  1     4.016 30.445 -18.805
## + Frost       1     3.135 31.327 -17.378
## + Income      1     2.405 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1     0.470 33.992 -13.295
## + Illiteracy  1     0.273 34.188 -13.007
## - Murder      1    53.838 88.299  30.435
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## - HS.Grad     1    4.6910 34.461 -14.609
## - Murder      1   28.5974 58.368  11.737
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1     2.064 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1     0.182 25.189 -24.280
## + Illiteracy  1     0.172 25.200 -24.259
## + Area        1     0.026 25.346 -23.970
## - Frost       1     4.399 29.770 -19.925
## - HS.Grad     1     5.955 31.327 -17.378
## - Murder      1    32.756 58.128  13.531
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Income      1     0.006 23.302 -26.174
## + Illiteracy  1     0.004 23.304 -26.170
## + Area        1     0.001 23.307 -26.163
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528



Description of Final model with AIC score:

The final model includes Murder rate, HS.graduate, Frost and Population. Addition or removal of any other variables does not decrease AIC. At each step, variables were added and removed to minimize AIC. The plot of AIC for variable selection with stepwise model building process is given below

plot(1:length(mod.aic$anova$Step),mod.aic$anova$AIC,"b",axes=F, xlab="Addition of",ylab = "AIC",lwd=2, main = "AIC variable selection")
axis(2)
axis(1,at=c(1:5),labels=c("intercept" , mod.aic$anova$Step[2:5]))
box()
grid()


Checking for multicollinearity of selected model: The final selection of variables shows that the variables are not multicollinear.

Goodness of Fit

summary(mod.aic)
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = statedata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12


F-statistic value = 31.37

\(p-value = 1.696\times 10^{-12}\)

R2 = 0.7126

R-squared indicates that about 70% of variance is explained by the predictor variables. High F-value and low p-value indicates that the at least one of the variable has significant effect on the response. The p-values for almost all t-statistics for the coefficients are less than 0.05 or about 0.05. So, all of the selected predictor’s coefficients are significant.



Prediction

utID = which(statedata$State=="UT")
newutah = statedata[utID,]
newutah$Population = 2785
newutah$HS.Grad = 75.0
newutah$Murder = 1.3
newutah
##    State Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## 44    UT       2785   4022        0.6     72.9    1.3      75   137 82096
predUT.lexp = predict(mod.aic, interval ="prediction", newdata = newutah,level = .95)
predUT.lexp
##         fit     lwr      upr
## 44 73.45601 71.8809 75.03112

At 95% confidence interval, life expectancy is estimated to be 73.456 and the upper and lower estimates are 71.88 and 75.03 in Utah.


califID = which(statedata$State=="CA")
newcalif = statedata[califID,]
newcalif$Population = 36962
newcalif$HS.Grad = 68.3
newcalif$Murder = 5.3
newcalif
##   State Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## 5    CA      36962   5114        1.1    71.71    5.3    68.3    20 156361
predCA.lexp = predict(mod.aic, interval ="prediction", newdata = newcalif,level = .95)
predCA.lexp
##        fit      lwr      upr
## 5 74.35232 72.11398 76.59065

At 95% confidence interval, life expectancy is estimated to be 74.35 and the upper and lower estimates are 72.11 and 76.59 in California.


\(Answer 2\)

bodytemp1 = read.csv("normtemp.csv",header = T)
bodytemp = bodytemp1[,-1]
bodytemp$sex = factor(bodytemp$sex, labels=c("male","female"))
cor(bodytemp$weight, bodytemp$temp)
## [1] 0.2536564

The correlation between weight and temperature is about 0.25. The plot also shows very less correlation between weight and temperature within any sexes.

Center body weight

weight_cen = bodytemp$weight - mean(bodytemp$weight)

Build a model using body weight and sex

temp.lm1 = lm(temp ~ weight_cen + sex, data=bodytemp)
summary(temp.lm1)
## 
## Call:
## lm(formula = temp ~ weight_cen + sex, data = bodytemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86363 -0.45624  0.01841  0.47366  2.33424 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 98.114528   0.087102 1126.428  < 2e-16 ***
## weight_cen   0.025267   0.008762    2.884  0.00462 ** 
## sexfemale    0.269406   0.123277    2.185  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7017 on 127 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.08405 
## F-statistic: 6.919 on 2 and 127 DF,  p-value: 0.001406

Goodness-of-fit:

F-statistic = 6.919

p-val = 0.001406

R2 = 0.08405

The first estimate is for the intercept = 98.11 which indicates temperature with mean weight. With increase in unit weight above the mean weight, increase in temperature is 0.025 degrees for males. For females, extra temperature increase is 0.269 compared to males.

Subset of the full Model only using weight

temp.mod_wt = lm(temp ~ weight_cen, data=bodytemp)
summary(temp.mod_wt)
## 
## Call:
## lm(formula = temp ~ weight_cen, data = bodytemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85017 -0.39999  0.01033  0.43915  2.46549 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 98.249231   0.062444 1573.402  < 2e-16 ***
## weight_cen   0.026335   0.008876    2.967  0.00359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.712 on 128 degrees of freedom
## Multiple R-squared:  0.06434,    Adjusted R-squared:  0.05703 
## F-statistic: 8.802 on 1 and 128 DF,  p-value: 0.003591


The r-squared value for the subset as well as the full model is less than 0.1. So, only about 10% of the variance in temperature can be explained using full model while only about 6% of variance is explained by subset of the full model.

anova(temp.lm1, temp.mod_wt)
## Analysis of Variance Table
## 
## Model 1: temp ~ weight_cen + sex
## Model 2: temp ~ weight_cen
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1    127 62.532                             
## 2    128 64.883 -1   -2.3515 4.7758 0.0307 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value for the anova between the two models is less than 0.05 which indicates a significant improvement in model. But RSS change is not much different (increases by about 2) and also since R-squared values are not high, practically, I do not believe that adding sex to the model does not make it much better.