Exploratory Regression - Putting it all Together

Explaining Life Expectancy

We will begin by considering what could explain life expectancy in this world. Start by loading the “HDR.rda” data object that was pulled from a fairly recent Human Development Report. If you cannot access the link to the data, above, you may access it on the canvas site.

Once you have the data, set your working directory to wherever the data was saved and load it.

load("HDR.rda")

We don’t have a lot of time, so start with all the data we have. To do that, consider starting with all the variables that are available in the data set.

To run a regression with all remaining variables in the data set, use a “.” in place of the variable names. It is a handy trick that is worth remembering. You will, however, want to remove any variables that you do not want in the model.

In our case, we do not want to include the first two variables in the data set: country name, and HDR rank.

If you use dim(HDR) to check the dimensions of the data object, then you will see that HDR contains 174 rows and 78 columns.

Using names(HDR), you will also see that the first two variables are the ones that we would prefer not to include in the regression. We are, therefore, left with variables three through seventy-eight (3:78) that we would like to include in the model. So, we can subset the initial data set right in the model, as we do below.

initial.model <- lm(LIFE_EXP_95_00 ~ . , data=HDR[ , 3:78])
# summary(initial.model)

If you run this model for yourself, when you look at the model summary you will note that there are a lot of non-significant variables in this model.

We, therefore, need to streamline what is in the model.

Next, Use Stepwise Model Selection

Stepwise model selection will eliminate any variables that do not improve the model fit. Though, keep in mind that a lot of variables will still be problematic or redundant.

m1 <- step(initial.model, direcion="both")
# summary(m1)

Next, use basic diagnostic plots to evaluate the model.

par(mfrow=c(2, 2)) # Set the plot window to display a 2x2 matrix of plots. 
plot(m1)
par(mfrow=c(1, 1)) #

The fit is somewhere between fair and “meh” and there are a lot of redundant or uninformative variables in the model. There is also no indication of what may be poilcy-relevant in this model. Let’s take just the formula for the model using m1$call and reorganize the formula to include any variables that look like they may be policy-relevant up front. Then, we can re-run the mode and diagnose problems using Variance Inflation Factors (VIF) to see which variables are contributing too much multicollinearity.

Recalling that policy-relevant variables are anything that you can potentially manipulate to influence society, we he new model highlights items such as spending on health, tax revenue, general government expenditures and debts, and other items that may be policy relevant.

m2 <- lm(formula = LIFE_EXP_95_00 ~ HEALTH96_98 + HEALTH90 + TXREV98 + GOVCONS98 + WGOV98 + 
           AGRIC98 + EXPEND98 + BDGT_SUR + DEBT98 + DEBT_SER + WGOV98 + ELECTRI_ + 
           KWATT_HR_CAP80 + GNP98 + GDP98 +  AID98 + AID_CAP9 + AID_CAP98 + POP98 + POP015 +   
           PRATE98_ + POP65_98 + POP65_01 + FERTILIT + KWATT_HR + KWATT_HR_CAP97 + FOOD_IDX +
           GOVCONS98 + LIFE_EXP + MORTAL70 + MORTAL98 + MORT_UNDER5_98, data = HDR)

# summary(m2)

This model is still ridiculous, given that there ar still a lot of, essentially, duplicate variables and the R-square value is suspiciously good. Don’t count on seeing an R-squared in the 90s unless there is a lot of multicollinearity in the model. It is possible to get a great R-square. But be very suspicious when it is this high.

So, the next task will be to pare down the model you derived using stepwise regression. To do so, you have a choice of whether to eliminate (1) variables that are redundant, or (2) variables with the highest p-values, or (3) variables that are contributing too much multicollinearity to the model.

My suggestion is to begin by removing all variables that look like your outcome (i.e. remove any variables that refer to mortality of life expectancy). Next, use Variance Inflation Factors (vif()) to help choose between redundant variables. Recall that VIF > 10 for any variable is an indication that it is adding multicollinearity.

As you remove variables, you will notice that some other variables are no longer significant. This becomes an iterative process of culling, re-running the regression, and reconsidering what should be retained.

Initially, at least, you should try to keep as many of the non-redundant policy-relevant variables as you can.

Eventually, you will end up with a model that fits well.

m3 <- lm(formula = LIFE_EXP_95_00 ~ HEALTH96_98 + WGOV98 + AGRIC98 + 
             BDGT_SUR + DEBT98 + WGOV98 + ELECTRI_ + AID_CAP98 + 
             FERTILIT + FOOD_IDX, data = HDR)
# summary(m3)

# par(mfrow=c(2,2))
# plot(m3)
# par(mfrow=c(1,1))

You will notice that a two of these variables (women in government and budget surplus) are still in the model. They were retained so that we could compare the performance of the model with, and without them.

m4 <- lm(formula = LIFE_EXP_95_00 ~ HEALTH96_98 + AGRIC98 + 
             DEBT98 + AID_CAP98 + FERTILIT + 
             FOOD_IDX, data = HDR)
# summary(m4)

# par(mfrow=c(2,2))
# plot(m4)
# par(mfrow=c(1,1))

Consider your final Model(s)

Let’s compare the last two models. Are the differences significant?

anova(m3, m4)

## Analysis of Variance Table
## 
## Model 1: LIFE_EXP_95_00 ~ HEALTH96_98 + WGOV98 + AGRIC98 + BDGT_SUR + 
##     DEBT98 + WGOV98 + ELECTRI_ + AID_CAP98 + FERTILIT + FOOD_IDX
## Model 2: LIFE_EXP_95_00 ~ HEALTH96_98 + AGRIC98 + DEBT98 + AID_CAP98 + 
##     FERTILIT + FOOD_IDX
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    156 4064.6                                 
## 2    159 4686.0 -3   -621.48 7.951 5.742e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You may also wish to view Standardized Coefficients

library(relaimpo)
calc.relimp(m3)

## Response variable: LIFE_EXP_95_00 
## Total response variance: 123.6329 
## Analysis based on 166 observations 
## 
## 9 Regressors: 
## HEALTH96_98 WGOV98 AGRIC98 BDGT_SUR DEBT98 ELECTRI_ AID_CAP98 FERTILIT FOOD_IDX 
## Proportion of variance explained by model: 80.08%
## Metrics are not normalized (rela=FALSE). 
## 
## Relative importance metrics: 
## 
##                     lmg
## HEALTH96_98 0.107557210
## WGOV98      0.006105127
## AGRIC98     0.131672254
## BDGT_SUR    0.003317696
## DEBT98      0.113146643
## ELECTRI_    0.018236034
## AID_CAP98   0.008116010
## FERTILIT    0.398937998
## FOOD_IDX    0.013662608
## 
## Average coefficients for different model sizes: 
## 
##                      1X          2Xs          3Xs          4Xs          5Xs
## HEALTH96_98  2.81407371  2.379336630  2.019574591  1.724813846  1.485586901
## WGOV98       0.21538030  0.128955097  0.067577546  0.023230098 -0.009656936
## AGRIC98     -0.38751207 -0.331330120 -0.283174176 -0.242062311 -0.207047603
## BDGT_SUR    -0.27059753 -0.265957233 -0.238199508 -0.199042184 -0.154443216
## DEBT98      -0.08581969 -0.071938875 -0.060002969 -0.049754979 -0.040960393
## ELECTRI_     0.00178679  0.003990495  0.005409689  0.006328510  0.006935880
## AID_CAP98   -0.04027545 -0.025473235 -0.013895158 -0.004866145  0.002189447
## FERTILIT    -5.34270946 -5.192289645 -5.053367756 -4.927359420 -4.815002040
## FOOD_IDX    -0.05398497 -0.026481151 -0.006289512  0.008511993  0.019434870
##                      6Xs          7Xs          8Xs          9Xs
## HEALTH96_98  1.292962127  1.138583977  1.014760718  0.914603711
## WGOV98      -0.034994956 -0.055519871 -0.073093375 -0.088906540
## AGRIC98     -0.177259826 -0.151925091 -0.130373731 -0.112042543
## BDGT_SUR    -0.107226682 -0.058634487 -0.009163311  0.040979953
## DEBT98      -0.033404015 -0.026888334 -0.021234557 -0.016286362
## ELECTRI_     0.007351033  0.007646287  0.007865365  0.008035639
## AID_CAP98    0.007752795  0.012215693  0.015887311  0.019003589
## FERTILIT    -4.716507662 -4.631683394 -4.560031557 -4.500834769
## FOOD_IDX     0.027623589  0.033904229  0.038840013  0.042795716

Last, Bootstrap the Regression Coefficients if You Wish

library(boot)

BReg <- function(formula, data, indices){
  d <- data[indices,]
  fit <- lm(formula, data=d)
  return(coef(fit))
}

result <- boot(statistic=BReg, R=1000,
               formula = LIFE_EXP_95_00 ~ HEALTH96_98 + AGRIC98 + 
               DEBT98 + AID_CAP98 + FERTILIT + FOOD_IDX, 
               data = HDR)
summary(result)

## 
## Number of bootstrap replications R = 1000 
##    original    bootBias    bootSE   bootMed
## 1 74.635336  0.23964736 2.3165485 74.966325
## 2  0.747136 -0.04396632 0.2437175  0.697778
## 3 -0.110663 -0.01049144 0.0505143 -0.117360
## 4 -0.020404 -0.00011581 0.0093373 -0.019783
## 5  0.016854 -0.00169278 0.0120731  0.015320
## 6 -4.293606  0.02761166 0.4746301 -4.248686
## 7  0.052190 -0.00005218 0.0149241  0.051066

boot.ci(result, type="bca", index=2)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = result, type = "bca", index = 2)
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.3216,  1.3105 )  
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable

plot(result, index=2)