Data: Lifeexp developed countries.csv

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative.

#1. Perform the backward stepwise regression analysis.

Briefly describe the method that was used to eliminate variables, including the elimination criteria. List the variables that are removed from the developed countries model. (5)

Backward stepwise regression starts off with the full model and gradually eliminates variables at each step to build a reduced model that best models the data. 5 variables were removed. Variables Removed:

=> bmi => percentexp => gdp => polio => measles

#####################################

#Problem 1
#Life expectancy data for developed countries
#years 2000-2015

#Build the model with all quantitative predictors for the Developed countries
#We remove Country and Year

model1 <- lm(y~.-country-year, data=developed)
summary(model1)
## 
## Call:
## lm(formula = y ~ . - country - year, data = developed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2210 -1.7368 -0.6397  1.0129 10.2250 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.536e+01  5.184e+00   6.821 8.16e-11 ***
## adultmort   -1.180e-02  3.794e-03  -3.111 0.002106 ** 
## infmort     -6.530e-01  5.985e-01  -1.091 0.276390    
## alcohol     -1.653e-01  8.433e-02  -1.960 0.051262 .  
## percentexp   9.310e-05  1.904e-04   0.489 0.625314    
## hepB         1.678e-02  9.808e-03   1.711 0.088510 .  
## measles     -9.272e-05  9.838e-05  -0.942 0.346968    
## bmi         -4.526e-03  1.052e-02  -0.430 0.667457    
## under5mort   6.185e-01  5.006e-01   1.236 0.217887    
## polio        1.305e-02  2.866e-02   0.455 0.649352    
## totalexp     1.420e-01  7.574e-02   1.875 0.062101 .  
## diptheria   -3.308e-02  2.840e-02  -1.165 0.245287    
## gdp         -1.567e-05  3.043e-05  -0.515 0.607151    
## population   1.676e-08  1.251e-08   1.340 0.181456    
## income       6.408e+01  6.188e+00  10.356  < 2e-16 ***
## schooling   -5.152e-01  1.502e-01  -3.431 0.000715 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.807 on 226 degrees of freedom
##   (270 observations deleted due to missingness)
## Multiple R-squared:  0.5953, Adjusted R-squared:  0.5685 
## F-statistic: 22.17 on 15 and 226 DF,  p-value: < 2.2e-16
#Backwards stepwise regression using p-values
ols_step_backward_p(model1, prem=0.20, details=T)
## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. adultmort 
## 2. infmort 
## 3. alcohol 
## 4. percentexp 
## 5. hepB 
## 6. measles 
## 7. bmi 
## 8. under5mort 
## 9. polio 
## 10. totalexp 
## 11. diptheria 
## 12. gdp 
## 13. population 
## 14. income 
## 15. schooling 
## 
## 
## Step   => 0 
## Model  => y ~ adultmort + infmort + alcohol + percentexp + hepB + measles + bmi + under5mort + polio + totalexp + diptheria + gdp + population + income + schooling 
## R2     => 0.595 
## 
## Initiating stepwise selection... 
## 
## Step     => 1 
## Removed  => bmi 
## Model    => y ~ adultmort + infmort + alcohol + percentexp + hepB + measles + under5mort + polio + totalexp + diptheria + gdp + population + income + schooling 
## R2       => 0.59501 
## 
## Step     => 2 
## Removed  => percentexp 
## Model    => y ~ adultmort + infmort + alcohol + hepB + measles + under5mort + polio + totalexp + diptheria + gdp + population + income + schooling 
## R2       => 0.59465 
## 
## Step     => 3 
## Removed  => gdp 
## Model    => y ~ adultmort + infmort + alcohol + hepB + measles + under5mort + polio + totalexp + diptheria + population + income + schooling 
## R2       => 0.59461 
## 
## Step     => 4 
## Removed  => polio 
## Model    => y ~ adultmort + infmort + alcohol + hepB + measles + under5mort + totalexp + diptheria + population + income + schooling 
## R2       => 0.59428 
## 
## Step     => 5 
## Removed  => measles 
## Model    => y ~ adultmort + infmort + alcohol + hepB + under5mort + totalexp + diptheria + population + income + schooling 
## R2       => 0.59244 
## 
## 
## No more variables to be removed.
## 
## Variables Removed: 
## 
## => bmi 
## => percentexp 
## => gdp 
## => polio 
## => measles
## 
## 
##                              Stepwise Summary                             
## ------------------------------------------------------------------------
## Step    Variable        AIC         SBC       SBIC      R2       Adj. R2 
## ------------------------------------------------------------------------
##  0      Full Model    1203.803    1263.115      NA    0.59534    0.56848 
##  1      bmi           1202.001    1257.824      NA    0.59501    0.57003 
##  2      percentexp    1200.214    1252.548      NA    0.59465    0.57154 
##  3      gdp           1198.235    1247.080      NA    0.59461    0.57337 
##  4      polio         1196.436    1241.793      NA    0.59428    0.57487 
##  5      measles       1195.530    1237.398      NA    0.59244    0.57479 
## ------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                          Model Summary                           
## ----------------------------------------------------------------
## R                       0.770       RMSE                  2.723 
## R-Squared               0.592       MSE                   7.765 
## Adj. R-Squared          0.575       Coef. Var             3.541 
## Pred R-Squared          0.543       AIC                1195.530 
## MAE                     2.027       SBC                1237.398 
## ----------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                 ANOVA                                 
## ---------------------------------------------------------------------
##                 Sum of                                               
##                Squares         DF    Mean Square      F         Sig. 
## ---------------------------------------------------------------------
## Regression    2607.497         10        260.750    33.578    0.0000 
## Residual      1793.807        231          7.765                     
## Total         4401.303        241                                    
## ---------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    35.708         4.534                  7.875    0.000    26.773    44.642 
##   adultmort    -0.012         0.004       -0.151    -3.129    0.002    -0.019    -0.004 
##     infmort    -0.675         0.582       -0.170    -1.160    0.247    -1.821     0.472 
##     alcohol    -0.154         0.081       -0.083    -1.898    0.059    -0.314     0.006 
##        hepB     0.017         0.010        0.076     1.730    0.085    -0.002     0.036 
##  under5mort     0.630         0.479        0.188     1.316    0.189    -0.313     1.574 
##    totalexp     0.144         0.074        0.089     1.951    0.052    -0.001     0.290 
##   diptheria    -0.028         0.026       -0.048    -1.094    0.275    -0.080     0.023 
##  population     0.000         0.000        0.066     1.300    0.195     0.000     0.000 
##      income    64.082         5.166        0.786    12.404    0.000    53.902    74.261 
##   schooling    -0.515         0.142       -0.212    -3.637    0.000    -0.794    -0.236 
## ----------------------------------------------------------------------------------------

#2. Correlation matrix for model 2 variables Calculate the correlation matrix for the remaining quantitative variables. Identify the pairs of variables that are highly correlated, per the criteria presented in the lecture. (3)

Using a criteria of the absolute value of 0.8 we find one pair of highly correlated variables, under5mort and infmort. #3. VIF Calculate and report the VIFs for this model. (3)

The VIFs for this model are reported below.

adultmort infmort alcohol hepB under5mort totalexp diptheria population income schooling 1.313605 12.247880 1.074693 1.099257 11.551952 1.189583 1.076459 1.464274 2.275913 1.931344

#Select only quantitative predictors that were retained by the 
#screening process to calculate correlations

#Add variables that were NOT dropped from the backward stepwise regression where you see 
# all of the hashmarks like this ##########################
#separate variables with commas

x <- developed %>%
  select(adultmort,infmort,alcohol,hepB,under5mort,totalexp,diptheria,population,income,schooling)

#Correlation matrix
cor(x, use="complete.obs")
##               adultmort      infmort      alcohol        hepB   under5mort
## adultmort   1.000000000  0.159701466 -0.003275959  0.16432299  0.147250725
## infmort     0.159701466  1.000000000 -0.163708602  0.03165349  0.952134695
## alcohol    -0.003275959 -0.163708602  1.000000000 -0.02898060 -0.124022131
## hepB        0.164322991  0.031653485 -0.028980602  1.00000000  0.062632500
## under5mort  0.147250725  0.952134695 -0.124022131  0.06263250  1.000000000
## totalexp   -0.168415221 -0.161009792 -0.144020211  0.01185474 -0.176842234
## diptheria  -0.001228567 -0.004307695 -0.040850881  0.08777625 -0.004343776
## population -0.053288649  0.507738038 -0.019068825 -0.06860509  0.481058608
## income     -0.434170131 -0.122203614  0.021497401 -0.21656744 -0.068620847
## schooling  -0.192268496  0.014133537 -0.010357930 -0.18866472  0.057780050
##               totalexp    diptheria  population      income   schooling
## adultmort  -0.16841522 -0.001228567 -0.05328865 -0.43417013 -0.19226850
## infmort    -0.16100979 -0.004307695  0.50773804 -0.12220361  0.01413354
## alcohol    -0.14402021 -0.040850881 -0.01906883  0.02149740 -0.01035793
## hepB        0.01185474  0.087776247 -0.06860509 -0.21656744 -0.18866472
## under5mort -0.17684223 -0.004343776  0.48105861 -0.06862085  0.05778005
## totalexp    1.00000000  0.186814075 -0.16457756  0.14143569  0.16014959
## diptheria   0.18681408  1.000000000  0.04984900 -0.02498708 -0.10285151
## population -0.16457756  0.049848996  1.00000000  0.10651859  0.08488604
## income      0.14143569 -0.024987082  0.10651859  1.00000000  0.66240925
## schooling   0.16014959 -0.102851514  0.08488604  0.66240925  1.00000000
#Build model2 using only variables that were NOT dropped from the backward stepwise regression
#separate variables with "+"

model2<-lm(y~ adultmort+infmort+alcohol+hepB+under5mort+totalexp+diptheria+population+income+schooling, data=developed)

#Variance Inflation Factors
vif(model2)
##  adultmort    infmort    alcohol       hepB under5mort   totalexp  diptheria 
##   1.313605  12.247880   1.074693   1.099257  11.551952   1.189583   1.076459 
## population     income  schooling 
##   1.464274   2.275913   1.931344

#4. Multicollinearity

Do you have a multicollinearity issue with your model?

If no, explain why. Then use model2 as your final model for residual analysis.Make a decision as to which variables you will eliminate to remedy the multicollinearity.

If yes, report the eliminated variables here with your justification for their elimination. Eliminate them and calculate and report the model.(5)

There is a multicollinearity issue with the model. under5mort and infmort are highly correlated (around .95) so I decided to remove them from the model.

#FINAL MODEL TIME
#You must enter the variables that you want to include in your final model. TO address multicollinearity
#remove variable that comes first alphabetically

#No Multicollinearity?
#final_model=model2

#Yes Multicollinearity?
final_model <-lm(y~ adultmort+alcohol+hepB+totalexp+diptheria+population+income+schooling, data=developed)

summary(final_model)
## 
## Call:
## lm(formula = y ~ adultmort + alcohol + hepB + totalexp + diptheria + 
##     population + income + schooling, data = developed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1550 -1.5411 -0.8003  1.0925 10.4292 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.480e+01  4.397e+00   7.914    1e-13 ***
## adultmort   -1.151e-02  3.706e-03  -3.105 0.002138 ** 
## alcohol     -1.470e-01  7.941e-02  -1.851 0.065375 .  
## hepB         1.861e-02  9.492e-03   1.960 0.051170 .  
## totalexp     1.331e-01  7.335e-02   1.815 0.070777 .  
## diptheria   -2.766e-02  2.602e-02  -1.063 0.288782    
## population   1.642e-08  1.043e-08   1.574 0.116788    
## income       6.464e+01  5.062e+00  12.769  < 2e-16 ***
## schooling   -4.975e-01  1.400e-01  -3.552 0.000462 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.785 on 233 degrees of freedom
##   (270 observations deleted due to missingness)
## Multiple R-squared:  0.5893, Adjusted R-squared:  0.5752 
## F-statistic: 41.79 on 8 and 233 DF,  p-value: < 2.2e-16
vif(final_model)
##  adultmort    alcohol       hepB   totalexp  diptheria population     income 
##   1.295371   1.026577   1.072943   1.168963   1.075280   1.063686   2.187208 
##  schooling 
##   1.891090

#5. Residual Analysis

Do the residuals meet the assumptions of constant variance and normality? Identify any outliers and/or leverage points (as many as you can). Are there any values that are both an outlier and have high leverage? (6)

The residuals do meet the assumptions of constant variance and normality since the scatterplot is randomly disbtributed and the histogram is symmetric. There are no values that are both an outlier and have higher leverage.

#Residual analysis
e <-resid(final_model)

#Histogram
hist(e)

#Versus fits plot
yhat <-predict(final_model)
plot(yhat, e)

ols_plot_resid_lev(final_model)