Data: Lifeexp developed countries.csv
The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative.
#1. Perform the backward stepwise regression analysis.
Briefly describe the method that was used to eliminate variables, including the elimination criteria. List the variables that are removed from the developed countries model. (5)
Backward stepwise regression starts off with the full model and gradually eliminates variables at each step to build a reduced model that best models the data. 5 variables were removed. Variables Removed:
=> bmi => percentexp => gdp => polio => measles
#####################################
#Problem 1
#Life expectancy data for developed countries
#years 2000-2015
#Build the model with all quantitative predictors for the Developed countries
#We remove Country and Year
model1 <- lm(y~.-country-year, data=developed)
summary(model1)
##
## Call:
## lm(formula = y ~ . - country - year, data = developed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2210 -1.7368 -0.6397 1.0129 10.2250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.536e+01 5.184e+00 6.821 8.16e-11 ***
## adultmort -1.180e-02 3.794e-03 -3.111 0.002106 **
## infmort -6.530e-01 5.985e-01 -1.091 0.276390
## alcohol -1.653e-01 8.433e-02 -1.960 0.051262 .
## percentexp 9.310e-05 1.904e-04 0.489 0.625314
## hepB 1.678e-02 9.808e-03 1.711 0.088510 .
## measles -9.272e-05 9.838e-05 -0.942 0.346968
## bmi -4.526e-03 1.052e-02 -0.430 0.667457
## under5mort 6.185e-01 5.006e-01 1.236 0.217887
## polio 1.305e-02 2.866e-02 0.455 0.649352
## totalexp 1.420e-01 7.574e-02 1.875 0.062101 .
## diptheria -3.308e-02 2.840e-02 -1.165 0.245287
## gdp -1.567e-05 3.043e-05 -0.515 0.607151
## population 1.676e-08 1.251e-08 1.340 0.181456
## income 6.408e+01 6.188e+00 10.356 < 2e-16 ***
## schooling -5.152e-01 1.502e-01 -3.431 0.000715 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.807 on 226 degrees of freedom
## (270 observations deleted due to missingness)
## Multiple R-squared: 0.5953, Adjusted R-squared: 0.5685
## F-statistic: 22.17 on 15 and 226 DF, p-value: < 2.2e-16
#Backwards stepwise regression using p-values
ols_step_backward_p(model1, prem=0.20, details=T)
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1. adultmort
## 2. infmort
## 3. alcohol
## 4. percentexp
## 5. hepB
## 6. measles
## 7. bmi
## 8. under5mort
## 9. polio
## 10. totalexp
## 11. diptheria
## 12. gdp
## 13. population
## 14. income
## 15. schooling
##
##
## Step => 0
## Model => y ~ adultmort + infmort + alcohol + percentexp + hepB + measles + bmi + under5mort + polio + totalexp + diptheria + gdp + population + income + schooling
## R2 => 0.595
##
## Initiating stepwise selection...
##
## Step => 1
## Removed => bmi
## Model => y ~ adultmort + infmort + alcohol + percentexp + hepB + measles + under5mort + polio + totalexp + diptheria + gdp + population + income + schooling
## R2 => 0.59501
##
## Step => 2
## Removed => percentexp
## Model => y ~ adultmort + infmort + alcohol + hepB + measles + under5mort + polio + totalexp + diptheria + gdp + population + income + schooling
## R2 => 0.59465
##
## Step => 3
## Removed => gdp
## Model => y ~ adultmort + infmort + alcohol + hepB + measles + under5mort + polio + totalexp + diptheria + population + income + schooling
## R2 => 0.59461
##
## Step => 4
## Removed => polio
## Model => y ~ adultmort + infmort + alcohol + hepB + measles + under5mort + totalexp + diptheria + population + income + schooling
## R2 => 0.59428
##
## Step => 5
## Removed => measles
## Model => y ~ adultmort + infmort + alcohol + hepB + under5mort + totalexp + diptheria + population + income + schooling
## R2 => 0.59244
##
##
## No more variables to be removed.
##
## Variables Removed:
##
## => bmi
## => percentexp
## => gdp
## => polio
## => measles
##
##
## Stepwise Summary
## ------------------------------------------------------------------------
## Step Variable AIC SBC SBIC R2 Adj. R2
## ------------------------------------------------------------------------
## 0 Full Model 1203.803 1263.115 NA 0.59534 0.56848
## 1 bmi 1202.001 1257.824 NA 0.59501 0.57003
## 2 percentexp 1200.214 1252.548 NA 0.59465 0.57154
## 3 gdp 1198.235 1247.080 NA 0.59461 0.57337
## 4 polio 1196.436 1241.793 NA 0.59428 0.57487
## 5 measles 1195.530 1237.398 NA 0.59244 0.57479
## ------------------------------------------------------------------------
##
## Final Model Output
## ------------------
##
## Model Summary
## ----------------------------------------------------------------
## R 0.770 RMSE 2.723
## R-Squared 0.592 MSE 7.765
## Adj. R-Squared 0.575 Coef. Var 3.541
## Pred R-Squared 0.543 AIC 1195.530
## MAE 2.027 SBC 1237.398
## ----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## ---------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------
## Regression 2607.497 10 260.750 33.578 0.0000
## Residual 1793.807 231 7.765
## Total 4401.303 241
## ---------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 35.708 4.534 7.875 0.000 26.773 44.642
## adultmort -0.012 0.004 -0.151 -3.129 0.002 -0.019 -0.004
## infmort -0.675 0.582 -0.170 -1.160 0.247 -1.821 0.472
## alcohol -0.154 0.081 -0.083 -1.898 0.059 -0.314 0.006
## hepB 0.017 0.010 0.076 1.730 0.085 -0.002 0.036
## under5mort 0.630 0.479 0.188 1.316 0.189 -0.313 1.574
## totalexp 0.144 0.074 0.089 1.951 0.052 -0.001 0.290
## diptheria -0.028 0.026 -0.048 -1.094 0.275 -0.080 0.023
## population 0.000 0.000 0.066 1.300 0.195 0.000 0.000
## income 64.082 5.166 0.786 12.404 0.000 53.902 74.261
## schooling -0.515 0.142 -0.212 -3.637 0.000 -0.794 -0.236
## ----------------------------------------------------------------------------------------
#2. Correlation matrix for model 2 variables Calculate the correlation matrix for the remaining quantitative variables. Identify the pairs of variables that are highly correlated, per the criteria presented in the lecture. (3)
Using a criteria of the absolute value of 0.8 we find one pair of highly correlated variables, under5mort and infmort. #3. VIF Calculate and report the VIFs for this model. (3)
The VIFs for this model are reported below.
adultmort infmort alcohol hepB under5mort totalexp diptheria population income schooling 1.313605 12.247880 1.074693 1.099257 11.551952 1.189583 1.076459 1.464274 2.275913 1.931344
#Select only quantitative predictors that were retained by the
#screening process to calculate correlations
#Add variables that were NOT dropped from the backward stepwise regression where you see
# all of the hashmarks like this ##########################
#separate variables with commas
x <- developed %>%
select(adultmort,infmort,alcohol,hepB,under5mort,totalexp,diptheria,population,income,schooling)
#Correlation matrix
cor(x, use="complete.obs")
## adultmort infmort alcohol hepB under5mort
## adultmort 1.000000000 0.159701466 -0.003275959 0.16432299 0.147250725
## infmort 0.159701466 1.000000000 -0.163708602 0.03165349 0.952134695
## alcohol -0.003275959 -0.163708602 1.000000000 -0.02898060 -0.124022131
## hepB 0.164322991 0.031653485 -0.028980602 1.00000000 0.062632500
## under5mort 0.147250725 0.952134695 -0.124022131 0.06263250 1.000000000
## totalexp -0.168415221 -0.161009792 -0.144020211 0.01185474 -0.176842234
## diptheria -0.001228567 -0.004307695 -0.040850881 0.08777625 -0.004343776
## population -0.053288649 0.507738038 -0.019068825 -0.06860509 0.481058608
## income -0.434170131 -0.122203614 0.021497401 -0.21656744 -0.068620847
## schooling -0.192268496 0.014133537 -0.010357930 -0.18866472 0.057780050
## totalexp diptheria population income schooling
## adultmort -0.16841522 -0.001228567 -0.05328865 -0.43417013 -0.19226850
## infmort -0.16100979 -0.004307695 0.50773804 -0.12220361 0.01413354
## alcohol -0.14402021 -0.040850881 -0.01906883 0.02149740 -0.01035793
## hepB 0.01185474 0.087776247 -0.06860509 -0.21656744 -0.18866472
## under5mort -0.17684223 -0.004343776 0.48105861 -0.06862085 0.05778005
## totalexp 1.00000000 0.186814075 -0.16457756 0.14143569 0.16014959
## diptheria 0.18681408 1.000000000 0.04984900 -0.02498708 -0.10285151
## population -0.16457756 0.049848996 1.00000000 0.10651859 0.08488604
## income 0.14143569 -0.024987082 0.10651859 1.00000000 0.66240925
## schooling 0.16014959 -0.102851514 0.08488604 0.66240925 1.00000000
#Build model2 using only variables that were NOT dropped from the backward stepwise regression
#separate variables with "+"
model2<-lm(y~ adultmort+infmort+alcohol+hepB+under5mort+totalexp+diptheria+population+income+schooling, data=developed)
#Variance Inflation Factors
vif(model2)
## adultmort infmort alcohol hepB under5mort totalexp diptheria
## 1.313605 12.247880 1.074693 1.099257 11.551952 1.189583 1.076459
## population income schooling
## 1.464274 2.275913 1.931344
#4. Multicollinearity
Do you have a multicollinearity issue with your model?
If no, explain why. Then use model2 as your final model for residual analysis.Make a decision as to which variables you will eliminate to remedy the multicollinearity.
If yes, report the eliminated variables here with your justification for their elimination. Eliminate them and calculate and report the model.(5)
There is a multicollinearity issue with the model. under5mort and infmort are highly correlated (around .95) so I decided to remove them from the model.
#FINAL MODEL TIME
#You must enter the variables that you want to include in your final model. TO address multicollinearity
#remove variable that comes first alphabetically
#No Multicollinearity?
#final_model=model2
#Yes Multicollinearity?
final_model <-lm(y~ adultmort+alcohol+hepB+totalexp+diptheria+population+income+schooling, data=developed)
summary(final_model)
##
## Call:
## lm(formula = y ~ adultmort + alcohol + hepB + totalexp + diptheria +
## population + income + schooling, data = developed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1550 -1.5411 -0.8003 1.0925 10.4292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.480e+01 4.397e+00 7.914 1e-13 ***
## adultmort -1.151e-02 3.706e-03 -3.105 0.002138 **
## alcohol -1.470e-01 7.941e-02 -1.851 0.065375 .
## hepB 1.861e-02 9.492e-03 1.960 0.051170 .
## totalexp 1.331e-01 7.335e-02 1.815 0.070777 .
## diptheria -2.766e-02 2.602e-02 -1.063 0.288782
## population 1.642e-08 1.043e-08 1.574 0.116788
## income 6.464e+01 5.062e+00 12.769 < 2e-16 ***
## schooling -4.975e-01 1.400e-01 -3.552 0.000462 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.785 on 233 degrees of freedom
## (270 observations deleted due to missingness)
## Multiple R-squared: 0.5893, Adjusted R-squared: 0.5752
## F-statistic: 41.79 on 8 and 233 DF, p-value: < 2.2e-16
vif(final_model)
## adultmort alcohol hepB totalexp diptheria population income
## 1.295371 1.026577 1.072943 1.168963 1.075280 1.063686 2.187208
## schooling
## 1.891090
#5. Residual Analysis
Do the residuals meet the assumptions of constant variance and normality? Identify any outliers and/or leverage points (as many as you can). Are there any values that are both an outlier and have high leverage? (6)
The residuals do meet the assumptions of constant variance and normality since the scatterplot is randomly disbtributed and the histogram is symmetric. There are no values that are both an outlier and have higher leverage.
#Residual analysis
e <-resid(final_model)
#Histogram
hist(e)
#Versus fits plot
yhat <-predict(final_model)
plot(yhat, e)
ols_plot_resid_lev(final_model)