File Locations:
Assignment Details :
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
—– Importing provided dataset —–
This dataset was sources from https://www.kaggle.com/. The data contains life expectancy, health, immunization, and economic and demographic information about 179 countries from 2000-2015 years. The adjusted dataset has 21 variables and 2.864 rows. It was last updated in April 2023.
life_exp_url <- 'https://raw.githubusercontent.com/tagensingh/sps_data605_wk12_discussion/main/wk_12_discussion_life_expectancy.csv'
life_exp_raw <- read.csv(life_exp_url, header=TRUE)The following is a summary of the dataset gathered from provided dataset
summary(life_exp_raw)## Country Region Year Infant_deaths
## Length:2864 Length:2864 Min. :2000 Min. : 1.80
## Class :character Class :character 1st Qu.:2004 1st Qu.: 8.10
## Mode :character Mode :character Median :2008 Median : 19.60
## Mean :2008 Mean : 30.36
## 3rd Qu.:2011 3rd Qu.: 47.35
## Max. :2015 Max. :138.10
## Under_five_deaths Adult_mortality Alcohol_consumption Hepatitis_B
## Min. : 2.300 Min. : 49.38 Min. : 0.000 Min. :12.00
## 1st Qu.: 9.675 1st Qu.:106.91 1st Qu.: 1.200 1st Qu.:78.00
## Median : 23.100 Median :163.84 Median : 4.020 Median :89.00
## Mean : 42.938 Mean :192.25 Mean : 4.821 Mean :84.29
## 3rd Qu.: 66.000 3rd Qu.:246.79 3rd Qu.: 7.777 3rd Qu.:96.00
## Max. :224.900 Max. :719.36 Max. :17.870 Max. :99.00
## Measles BMI Polio Diphtheria
## Min. :10.00 Min. :19.80 Min. : 8.0 Min. :16.00
## 1st Qu.:64.00 1st Qu.:23.20 1st Qu.:81.0 1st Qu.:81.00
## Median :83.00 Median :25.50 Median :93.0 Median :93.00
## Mean :77.34 Mean :25.03 Mean :86.5 Mean :86.27
## 3rd Qu.:93.00 3rd Qu.:26.40 3rd Qu.:97.0 3rd Qu.:97.00
## Max. :99.00 Max. :32.10 Max. :99.0 Max. :99.00
## Incidents_HIV GDP_per_capita Population_mln
## Min. : 0.0100 Min. : 148 Min. : 0.080
## 1st Qu.: 0.0800 1st Qu.: 1416 1st Qu.: 2.098
## Median : 0.1500 Median : 4217 Median : 7.850
## Mean : 0.8943 Mean : 11541 Mean : 36.676
## 3rd Qu.: 0.4600 3rd Qu.: 12557 3rd Qu.: 23.688
## Max. :21.6800 Max. :112418 Max. :1379.860
## Thinness_ten_nineteen_years Thinness_five_nine_years Schooling
## Min. : 0.100 Min. : 0.1 Min. : 1.100
## 1st Qu.: 1.600 1st Qu.: 1.6 1st Qu.: 5.100
## Median : 3.300 Median : 3.4 Median : 7.800
## Mean : 4.866 Mean : 4.9 Mean : 7.632
## 3rd Qu.: 7.200 3rd Qu.: 7.3 3rd Qu.:10.300
## Max. :27.700 Max. :28.6 Max. :14.100
## Economy_status_Developed Economy_status_Developing Life_expectancy
## Min. :0.0000 Min. :0.0000 Min. :39.40
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:62.70
## Median :0.0000 Median :1.0000 Median :71.40
## Mean :0.2067 Mean :0.7933 Mean :68.86
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:75.40
## Max. :1.0000 Max. :1.0000 Max. :83.80
The following is the header information of the dataset.
head(life_exp_raw)## Country Region Year Infant_deaths Under_five_deaths
## 1 Turkiye Middle East 2015 11.1 13.0
## 2 Spain European Union 2015 2.7 3.3
## 3 India Asia 2007 51.5 67.9
## 4 Guyana South America 2006 32.8 40.5
## 5 Israel Middle East 2012 3.4 4.3
## 6 Costa Rica Central America and Caribbean 2006 9.8 11.2
## Adult_mortality Alcohol_consumption Hepatitis_B Measles BMI Polio Diphtheria
## 1 105.8240 1.32 97 65 27.8 97 97
## 2 57.9025 10.35 97 94 26.0 97 97
## 3 201.0765 1.57 60 35 21.2 67 64
## 4 222.1965 5.68 93 74 25.3 92 93
## 5 57.9510 2.89 97 89 27.0 94 94
## 6 95.2200 4.19 88 86 26.4 89 89
## Incidents_HIV GDP_per_capita Population_mln Thinness_ten_nineteen_years
## 1 0.08 11006 78.53 4.9
## 2 0.09 25742 46.44 0.6
## 3 0.13 1076 1183.21 27.1
## 4 0.79 4146 0.75 5.7
## 5 0.08 33995 7.91 1.2
## 6 0.16 9110 4.35 2.0
## Thinness_five_nine_years Schooling Economy_status_Developed
## 1 4.8 7.8 0
## 2 0.5 9.7 1
## 3 28.0 5.0 0
## 4 5.5 7.9 0
## 5 1.1 12.8 1
## 6 1.9 7.9 0
## Economy_status_Developing Life_expectancy
## 1 1 76.5
## 2 0 82.8
## 3 1 65.4
## 4 1 67.0
## 5 0 81.7
## 6 1 78.2
sort(colnames(life_exp_raw))## [1] "Adult_mortality" "Alcohol_consumption"
## [3] "BMI" "Country"
## [5] "Diphtheria" "Economy_status_Developed"
## [7] "Economy_status_Developing" "GDP_per_capita"
## [9] "Hepatitis_B" "Incidents_HIV"
## [11] "Infant_deaths" "Life_expectancy"
## [13] "Measles" "Polio"
## [15] "Population_mln" "Region"
## [17] "Schooling" "Thinness_five_nine_years"
## [19] "Thinness_ten_nineteen_years" "Under_five_deaths"
## [21] "Year"
#The dimension of the dataset :
dim(life_exp_raw)## [1] 2864 21
#Checking for Null Values
life_exp_raw[!complete.cases(life_exp_raw),]## [1] Country Region
## [3] Year Infant_deaths
## [5] Under_five_deaths Adult_mortality
## [7] Alcohol_consumption Hepatitis_B
## [9] Measles BMI
## [11] Polio Diphtheria
## [13] Incidents_HIV GDP_per_capita
## [15] Population_mln Thinness_ten_nineteen_years
## [17] Thinness_five_nine_years Schooling
## [19] Economy_status_Developed Economy_status_Developing
## [21] Life_expectancy
## <0 rows> (or 0-length row.names)
Selecting the Columns for Regression Analysis
life_exp <- select(life_exp_raw,Life_expectancy,Adult_mortality,Hepatitis_B,Measles,Polio,Diphtheria,Incidents_HIV,GDP_per_capita,BMI,Infant_deaths)Creating the Quadratic Variable :
life_exp$le_quad <- (life_exp_raw$Schooling)^2
summary(life_exp$le_quad)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.21 26.01 60.84 68.30 106.09 198.81
Creating the Dichotomous Variable :
life_exp$le_dichotomous <- life_exp_raw$Alcohol_consumption * life_exp_raw$Economy_status_Developed
summary(life_exp$le_dichotomous)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.077 0.000 17.870
Creating the Dichotomous vs Quadratic Variable :
life_exp$le_dich_quad <- life_exp$le_quad * life_exp_raw$Economy_status_Developing* life_exp_raw$Alcohol_consumption
summary(life_exp$le_dich_quad)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4055 45.3852 179.7748 252.9990 2088.0000
Generating the initial Regression Model - 1 Predicted Variable - 12 Predictor Variables :
life_exp_0.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)The Regression Details
summary(life_exp_0.lm)##
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B +
## Measles + Polio + Diphtheria + Incidents_HIV + GDP_per_capita +
## le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths,
## data = life_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3181 -0.9630 -0.0053 0.9512 5.3601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.365e+01 5.610e-01 149.104 < 2e-16 ***
## Adult_mortality -5.019e-02 6.441e-04 -77.919 < 2e-16 ***
## Hepatitis_B -1.365e-02 2.646e-03 -5.157 2.68e-07 ***
## Measles 2.330e-04 1.789e-03 0.130 0.8964
## Polio 1.351e-03 6.124e-03 0.221 0.8254
## Diphtheria 1.258e-02 6.137e-03 2.050 0.0405 *
## Incidents_HIV 1.103e-01 1.908e-02 5.779 8.30e-09 ***
## GDP_per_capita 2.439e-05 2.342e-06 10.417 < 2e-16 ***
## le_quad 5.911e-03 1.193e-03 4.957 7.58e-07 ***
## le_dichotomous 8.375e-02 1.121e-02 7.471 1.05e-13 ***
## le_dich_quad 9.468e-04 1.360e-04 6.960 4.20e-12 ***
## BMI -9.409e-02 1.739e-02 -5.411 6.77e-08 ***
## Infant_deaths -1.312e-01 2.709e-03 -48.425 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 2851 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.977
## F-statistic: 1.014e+04 on 12 and 2851 DF, p-value: < 2.2e-16
Initial Regression summary Discussion :
Residuals :
The median is very close to zero(0) with 1Q and 3Q almost perfectly balanced, The Min and Max and about even with a very slight right tail. This indicates that the residual distribution is close to “Normal”.
Since the Multiple R-squared of 0.9771 and Adjusted R-squared of 0.977 is very close to 1, this indicates that the model is a good fit to the data.
However we will investigate the coefficients to rule out instances of Multicollinearity
Testing for and Freeing From Multicollinearity among Variables
Multicollinearity occurs when two or more predictor variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model.
If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.
To test this model for Multicollinearity we will employ the “imcdiag” function from the “mctest” library and examine the Variance Inflation Factor (VIF) score.
Note : Scores over 5 are moderately multicollinear. Scores over 10 are very problematic
using the VIF measure we see that 7 of the predictor variables posses low VIF scores indicating that they are not very correlated, but the following variables are moderately to problematic :
Adult_mortality - VIF score is 7.7 —- Moderately multicollinear
Infant_deaths - VIF score is 7.8 —- Moderately multicollinear
Polio- VIF score is 12.0 —- Problematic —- Will be removed from the model$
Diphtheria - VIF score is 12.8 —- Problematic —- Will be removed from the model
imcdiag(life_exp_0.lm)##
## Call:
## imcdiag(mod = life_exp_0.lm)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1
## Adult_mortality 7.7110 0.1297 1739.9711 1914.6393 0.3601 -0.0477 0 0.0005
## Hepatitis_B 2.5224 0.3964 394.7215 434.3458 0.6296 -0.0156 0 0.0015
## Measles 1.5690 0.6374 147.5183 162.3271 0.7983 -0.0097 0 0.0025
## Polio 12.0060 0.0833 2853.5539 3140.0099 0.2886 -0.0743 0 0.0003
## Diphtheria 12.7943 0.0782 3057.9513 3364.9259 0.2796 -0.0792 0 0.0003
## Incidents_HIV 2.9062 0.3441 494.2268 543.8401 0.5866 -0.0180 0 0.0013
## GDP_per_capita 2.2133 0.4518 314.5877 346.1678 0.6722 -0.0137 0 0.0017
## le_quad 4.5757 0.2185 927.0764 1020.1416 0.4675 -0.0283 0 0.0008
## le_dichotomous 3.1580 0.3167 559.5014 615.6673 0.5627 -0.0196 0 0.0012
## le_dich_quad 2.1173 0.4723 289.6765 318.7558 0.6872 -0.0131 0 0.0018
## BMI 2.0481 0.4883 271.7478 299.0274 0.6988 -0.0127 0 0.0019
## Infant_deaths 7.8355 0.1276 1772.2539 1950.1629 0.3572 -0.0485 0 0.0005
## IND2
## Adult_mortality 1.2650
## Hepatitis_B 0.8773
## Measles 0.5271
## Polio 1.3325
## Diphtheria 1.3399
## Incidents_HIV 0.9534
## GDP_per_capita 0.7968
## le_quad 1.1359
## le_dichotomous 0.9933
## le_dich_quad 0.7670
## BMI 0.7438
## Infant_deaths 1.2680
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Measles , Polio , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.9771
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
Adjusting Model by removing both Problematic Variables
# Removed both Polio and Diphtheria variables
life_exp_1.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)The Regression Details of the Adjusted Model :
summary(life_exp_1.lm)##
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B +
## Measles + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous +
## le_dich_quad + BMI + Infant_deaths, data = life_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3887 -0.9663 0.0045 0.9450 5.5421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.461e+01 5.051e-01 167.509 < 2e-16 ***
## Adult_mortality -5.015e-02 6.454e-04 -77.700 < 2e-16 ***
## Hepatitis_B -6.881e-03 2.059e-03 -3.341 0.000844 ***
## Measles 9.021e-04 1.777e-03 0.508 0.611736
## Incidents_HIV 1.159e-01 1.908e-02 6.073 1.42e-09 ***
## GDP_per_capita 2.411e-05 2.345e-06 10.283 < 2e-16 ***
## le_quad 6.055e-03 1.195e-03 5.068 4.27e-07 ***
## le_dichotomous 8.234e-02 1.120e-02 7.350 2.58e-13 ***
## le_dich_quad 9.100e-04 1.359e-04 6.695 2.59e-11 ***
## BMI -1.047e-01 1.722e-02 -6.078 1.38e-09 ***
## Infant_deaths -1.356e-01 2.462e-03 -55.071 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.43 on 2853 degrees of freedom
## Multiple R-squared: 0.977, Adjusted R-squared: 0.9769
## F-statistic: 1.21e+04 on 10 and 2853 DF, p-value: < 2.2e-16
The Multicollinearity test results for the Adjusted Model :
imcdiag(life_exp_1.lm)##
## Call:
## imcdiag(mod = life_exp_1.lm)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1
## Adult_mortality 7.7026 0.1298 2125.4629 2391.9835 0.3603 -0.0613 0 0.0004
## Hepatitis_B 1.5193 0.6582 164.6644 185.3124 0.8113 -0.0121 0 0.0021
## Measles 1.5396 0.6495 171.1146 192.5714 0.8059 -0.0123 0 0.0020
## Incidents_HIV 2.8901 0.3460 599.3814 674.5403 0.5882 -0.0230 0 0.0011
## GDP_per_capita 2.2085 0.4528 383.2344 431.2898 0.6729 -0.0176 0 0.0014
## le_quad 4.5685 0.2189 1131.6196 1273.5181 0.4679 -0.0364 0 0.0007
## le_dichotomous 3.1385 0.3186 678.1369 763.1713 0.5645 -0.0250 0 0.0010
## le_dich_quad 2.1033 0.4755 349.8557 393.7256 0.6895 -0.0167 0 0.0015
## BMI 1.9999 0.5000 317.0702 356.8289 0.7071 -0.0159 0 0.0016
## Infant_deaths 6.4356 0.1554 1723.6941 1939.8353 0.3942 -0.0512 0 0.0005
## IND2
## Adult_mortality 1.4276
## Hepatitis_B 0.5607
## Measles 0.5750
## Incidents_HIV 1.0730
## GDP_per_capita 0.8978
## le_quad 1.2815
## le_dichotomous 1.1179
## le_dich_quad 0.8606
## BMI 0.8203
## Infant_deaths 1.3857
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Measles , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.977
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
Discussion of Adjusted model and Multicollinearity tests :
Residuals :
The median is now zero(0) with 1Q and 3Q almost perfectly balanced, The Min and Max and about even with a very slight right tail. This indicates that the residual distribution is close to “Normal”.
Since the Multiple R-squared of 0.9771 and Adjusted R-squared of 0.977 is very close to 1, this indicates that the model is a good fit to the data.
there was some changes to the VIF scores of the following variables
Adult_mortality - VIF score remain at 7.7 —- Moderately multicollinear
Infant_deaths - VIF score is 6.4 - down from 7.8 —- Moderately multicollinear
Overall this Adjusted model is a better fit since we removed the severely multicollinear variables.
Note - the computed variables le_quad, le_dichotomous and le_dich_quad low to moderately multicollinear.
when we examine a Histogram of the residuals of the regression we see that it is approximately normally distributed.
Reviewing the model by removing only one Problematic Variable alternatively.
Note that when the “polio” variable was removed from the linear model, the VIF score for the “Diphtheria” variable vastly improved from 12.8 to 3.8
AND
when the “diphtheria” variable was removed from the linear model, the VIF score for the “polio” variable vastly improved from 12.0 to 3.6
MODEL Improvement - Remove ONLY the “Diphtheria” variable from the original model
# Removed Polio variables
life_exp_1_1.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles +Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)
summary(life_exp_1_1.lm)##
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B +
## Measles + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad +
## le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3269 -0.9643 -0.0044 0.9502 5.3614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.366e+01 5.555e-01 150.611 < 2e-16 ***
## Adult_mortality -5.018e-02 6.437e-04 -77.960 < 2e-16 ***
## Hepatitis_B -1.366e-02 2.646e-03 -5.163 2.60e-07 ***
## Measles 2.749e-04 1.779e-03 0.155 0.877
## Diphtheria 1.371e-02 3.374e-03 4.064 4.96e-05 ***
## Incidents_HIV 1.104e-01 1.907e-02 5.787 7.94e-09 ***
## GDP_per_capita 2.437e-05 2.339e-06 10.418 < 2e-16 ***
## le_quad 5.919e-03 1.192e-03 4.966 7.24e-07 ***
## le_dichotomous 8.356e-02 1.118e-02 7.476 1.01e-13 ***
## le_dich_quad 9.453e-04 1.358e-04 6.959 4.23e-12 ***
## BMI -9.399e-02 1.738e-02 -5.409 6.88e-08 ***
## Infant_deaths -1.313e-01 2.670e-03 -49.177 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 2852 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.977
## F-statistic: 1.106e+04 on 11 and 2852 DF, p-value: < 2.2e-16
imcdiag(life_exp_1_1.lm)##
## Call:
## imcdiag(mod = life_exp_1_1.lm)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1
## Adult_mortality 7.7041 0.1298 1912.6659 2125.9292 0.3603 -0.0537 0 0.0005
## Hepatitis_B 2.5216 0.3966 434.1120 482.5158 0.6297 -0.0176 0 0.0014
## Measles 1.5513 0.6446 157.2805 174.8174 0.8029 -0.0108 0 0.0023
## Diphtheria 3.8681 0.2585 818.2668 909.5040 0.5085 -0.0269 0 0.0009
## Incidents_HIV 2.9046 0.3443 543.3928 603.9814 0.5868 -0.0202 0 0.0012
## GDP_per_capita 2.2101 0.4525 345.2534 383.7494 0.6727 -0.0154 0 0.0016
## le_quad 4.5721 0.2187 1019.1273 1132.7606 0.4677 -0.0319 0 0.0008
## le_dichotomous 3.1407 0.3184 610.7559 678.8555 0.5643 -0.0219 0 0.0011
## le_dich_quad 2.1119 0.4735 317.2247 352.5955 0.6881 -0.0147 0 0.0017
## BMI 2.0469 0.4885 298.6797 331.9826 0.6990 -0.0143 0 0.0017
## Infant_deaths 7.6119 0.1314 1886.3653 2096.6961 0.3625 -0.0530 0 0.0005
## IND2
## Adult_mortality 1.3400
## Hepatitis_B 0.9292
## Measles 0.5472
## Diphtheria 1.1418
## Incidents_HIV 1.0098
## GDP_per_capita 0.8432
## le_quad 1.2031
## le_dichotomous 1.0496
## le_dich_quad 0.8108
## BMI 0.7876
## Infant_deaths 1.3376
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Measles , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.9771
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
# Removed Diphtheria variable
life_exp_1_2.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)
summary(life_exp_1_2.lm)##
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B +
## Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad +
## le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2766 -0.9560 0.0008 0.9405 5.3998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.378e+01 5.577e-01 150.230 < 2e-16 ***
## Adult_mortality -5.021e-02 6.444e-04 -77.923 < 2e-16 ***
## Hepatitis_B -1.169e-02 2.469e-03 -4.734 2.31e-06 ***
## Measles 8.168e-05 1.789e-03 0.046 0.96358
## Polio 1.184e-02 3.369e-03 3.513 0.00045 ***
## Incidents_HIV 1.110e-01 1.909e-02 5.817 6.65e-09 ***
## GDP_per_capita 2.447e-05 2.343e-06 10.447 < 2e-16 ***
## le_quad 5.892e-03 1.193e-03 4.938 8.34e-07 ***
## le_dichotomous 8.483e-02 1.120e-02 7.571 4.96e-14 ***
## le_dich_quad 9.487e-04 1.361e-04 6.971 3.90e-12 ***
## BMI -9.778e-02 1.730e-02 -5.651 1.75e-08 ***
## Infant_deaths -1.316e-01 2.704e-03 -48.672 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.427 on 2852 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.977
## F-statistic: 1.105e+04 on 11 and 2852 DF, p-value: < 2.2e-16
imcdiag(life_exp_1_2.lm)##
## Call:
## imcdiag(mod = life_exp_1_2.lm)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1
## Adult_mortality 7.7086 0.1297 1913.9715 2127.3804 0.3602 -0.0535 0 0.0005
## Hepatitis_B 2.1925 0.4561 340.2164 378.1508 0.6754 -0.0152 0 0.0016
## Measles 1.5663 0.6384 161.5653 179.5799 0.7990 -0.0109 0 0.0022
## Polio 3.6298 0.2755 750.2683 833.9237 0.5249 -0.0252 0 0.0010
## Incidents_HIV 2.9051 0.3442 543.5266 604.1301 0.5867 -0.0202 0 0.0012
## GDP_per_capita 2.2127 0.4519 345.9876 384.5655 0.6723 -0.0154 0 0.0016
## le_quad 4.5754 0.2186 1020.0609 1133.7983 0.4675 -0.0318 0 0.0008
## le_dichotomous 3.1510 0.3174 613.6821 682.1080 0.5633 -0.0219 0 0.0011
## le_dich_quad 2.1172 0.4723 318.7268 354.2651 0.6873 -0.0147 0 0.0017
## BMI 2.0261 0.4936 292.7446 325.3858 0.7025 -0.0141 0 0.0017
## Infant_deaths 7.7949 0.1283 1938.5961 2154.7506 0.3582 -0.0541 0 0.0004
## IND2
## Adult_mortality 1.3533
## Hepatitis_B 0.8458
## Measles 0.5622
## Polio 1.1266
## Incidents_HIV 1.0197
## GDP_per_capita 0.8522
## le_quad 1.2151
## le_dichotomous 1.0615
## le_dich_quad 0.8205
## BMI 0.7875
## Infant_deaths 1.3555
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Measles , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.9771
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
The New Adjusted model is as follows — 11 Predictor Variables + 1 Predicted Variable
life_exp_1_2.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)
summary(life_exp_1_2.lm)##
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B +
## Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad +
## le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2766 -0.9560 0.0008 0.9405 5.3998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.378e+01 5.577e-01 150.230 < 2e-16 ***
## Adult_mortality -5.021e-02 6.444e-04 -77.923 < 2e-16 ***
## Hepatitis_B -1.169e-02 2.469e-03 -4.734 2.31e-06 ***
## Measles 8.168e-05 1.789e-03 0.046 0.96358
## Polio 1.184e-02 3.369e-03 3.513 0.00045 ***
## Incidents_HIV 1.110e-01 1.909e-02 5.817 6.65e-09 ***
## GDP_per_capita 2.447e-05 2.343e-06 10.447 < 2e-16 ***
## le_quad 5.892e-03 1.193e-03 4.938 8.34e-07 ***
## le_dichotomous 8.483e-02 1.120e-02 7.571 4.96e-14 ***
## le_dich_quad 9.487e-04 1.361e-04 6.971 3.90e-12 ***
## BMI -9.778e-02 1.730e-02 -5.651 1.75e-08 ***
## Infant_deaths -1.316e-01 2.704e-03 -48.672 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.427 on 2852 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.977
## F-statistic: 1.105e+04 on 11 and 2852 DF, p-value: < 2.2e-16
imcdiag(life_exp_1_2.lm)##
## Call:
## imcdiag(mod = life_exp_1_2.lm)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1
## Adult_mortality 7.7086 0.1297 1913.9715 2127.3804 0.3602 -0.0535 0 0.0005
## Hepatitis_B 2.1925 0.4561 340.2164 378.1508 0.6754 -0.0152 0 0.0016
## Measles 1.5663 0.6384 161.5653 179.5799 0.7990 -0.0109 0 0.0022
## Polio 3.6298 0.2755 750.2683 833.9237 0.5249 -0.0252 0 0.0010
## Incidents_HIV 2.9051 0.3442 543.5266 604.1301 0.5867 -0.0202 0 0.0012
## GDP_per_capita 2.2127 0.4519 345.9876 384.5655 0.6723 -0.0154 0 0.0016
## le_quad 4.5754 0.2186 1020.0609 1133.7983 0.4675 -0.0318 0 0.0008
## le_dichotomous 3.1510 0.3174 613.6821 682.1080 0.5633 -0.0219 0 0.0011
## le_dich_quad 2.1172 0.4723 318.7268 354.2651 0.6873 -0.0147 0 0.0017
## BMI 2.0261 0.4936 292.7446 325.3858 0.7025 -0.0141 0 0.0017
## Infant_deaths 7.7949 0.1283 1938.5961 2154.7506 0.3582 -0.0541 0 0.0004
## IND2
## Adult_mortality 1.3533
## Hepatitis_B 0.8458
## Measles 0.5622
## Polio 1.1266
## Incidents_HIV 1.0197
## GDP_per_capita 0.8522
## le_quad 1.2151
## le_dichotomous 1.0615
## le_dich_quad 0.8205
## BMI 0.7875
## Infant_deaths 1.3555
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Measles , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.9771
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
Final Adjusted model discussion :
The overall model does not significantly change
The Adjusted \(R^2\) remains relatively high at > 97% even after removing the “Diphtheria” variable.
The P- Value for the individual coefficients are relatively low except for the “Measles” variable.
Let us try the stepAIC function on our model.
Using the initial model with all initial variables
life_exp_0.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)
stepAIC(life_exp_0.lm, direction="both")## Start: AIC=2046.35
## Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio +
## Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous +
## le_dich_quad + BMI + Infant_deaths
##
## Df Sum of Sq RSS AIC
## - Measles 1 0.0 5798.8 2044.4
## - Polio 1 0.1 5798.9 2044.4
## <none> 5798.8 2046.3
## - Diphtheria 1 8.5 5807.3 2048.6
## - le_quad 1 50.0 5848.8 2068.9
## - Hepatitis_B 1 54.1 5852.9 2070.9
## - BMI 1 59.6 5858.3 2073.6
## - Incidents_HIV 1 67.9 5866.7 2077.7
## - le_dich_quad 1 98.5 5897.3 2092.6
## - le_dichotomous 1 113.5 5912.3 2099.9
## - GDP_per_capita 1 220.7 6019.5 2151.3
## - Infant_deaths 1 4769.6 10568.4 3763.4
## - Adult_mortality 1 12348.8 18147.5 5311.8
##
## Step: AIC=2044.37
## Life_expectancy ~ Adult_mortality + Hepatitis_B + Polio + Diphtheria +
## Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous +
## le_dich_quad + BMI + Infant_deaths
##
## Df Sum of Sq RSS AIC
## - Polio 1 0.1 5798.9 2042.4
## <none> 5798.8 2044.4
## + Measles 1 0.0 5798.8 2046.3
## - Diphtheria 1 8.5 5807.3 2046.6
## - le_quad 1 51.1 5849.9 2067.5
## - Hepatitis_B 1 54.8 5853.6 2069.3
## - BMI 1 59.7 5858.5 2071.7
## - Incidents_HIV 1 68.1 5866.9 2075.8
## - le_dich_quad 1 99.0 5897.8 2090.9
## - le_dichotomous 1 113.8 5912.6 2098.0
## - GDP_per_capita 1 220.7 6019.5 2149.4
## - Infant_deaths 1 4770.8 10569.7 3761.7
## - Adult_mortality 1 12402.1 18201.0 5318.3
##
## Step: AIC=2042.42
## Life_expectancy ~ Adult_mortality + Hepatitis_B + Diphtheria +
## Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous +
## le_dich_quad + BMI + Infant_deaths
##
## Df Sum of Sq RSS AIC
## <none> 5798.9 2042.4
## + Polio 1 0.1 5798.8 2044.4
## + Measles 1 0.0 5798.9 2044.4
## - Diphtheria 1 34.1 5833.0 2057.2
## - le_quad 1 51.4 5850.3 2065.7
## - Hepatitis_B 1 54.8 5853.7 2067.3
## - BMI 1 59.6 5858.5 2069.7
## - Incidents_HIV 1 68.3 5867.2 2074.0
## - le_dich_quad 1 98.9 5897.8 2088.9
## - le_dichotomous 1 113.9 5912.8 2096.1
## - GDP_per_capita 1 220.7 6019.6 2147.4
## - Infant_deaths 1 4917.3 10716.2 3799.2
## - Adult_mortality 1 12407.2 18206.1 5317.1
##
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B +
## Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous +
## le_dich_quad + BMI + Infant_deaths, data = life_exp)
##
## Coefficients:
## (Intercept) Adult_mortality Hepatitis_B Diphtheria
## 8.367e+01 -5.019e-02 -1.360e-02 1.376e-02
## Incidents_HIV GDP_per_capita le_quad le_dichotomous
## 1.105e-01 2.437e-05 5.942e-03 8.361e-02
## le_dich_quad BMI Infant_deaths
## 9.463e-04 -9.379e-02 -1.313e-01
Applying the recommended stepAIC model - lowest AIC score of 2042.42
Note that this model excludes the “polio” variable
The Adjusted \(R^2\) remains the same and the overall model fit is about the same as the model suggested by using the VIF scores.
life_exp_aic.lm <- lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)
summary(life_exp_aic.lm)##
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B +
## Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous +
## le_dich_quad + BMI + Infant_deaths, data = life_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3226 -0.9662 -0.0040 0.9488 5.3671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.367e+01 5.537e-01 151.124 < 2e-16 ***
## Adult_mortality -5.019e-02 6.424e-04 -78.129 < 2e-16 ***
## Hepatitis_B -1.360e-02 2.621e-03 -5.191 2.24e-07 ***
## Diphtheria 1.376e-02 3.361e-03 4.093 4.37e-05 ***
## Incidents_HIV 1.105e-01 1.906e-02 5.796 7.52e-09 ***
## GDP_per_capita 2.437e-05 2.339e-06 10.419 < 2e-16 ***
## le_quad 5.942e-03 1.182e-03 5.027 5.29e-07 ***
## le_dichotomous 8.361e-02 1.117e-02 7.485 9.48e-14 ***
## le_dich_quad 9.463e-04 1.356e-04 6.976 3.76e-12 ***
## BMI -9.379e-02 1.733e-02 -5.413 6.70e-08 ***
## Infant_deaths -1.313e-01 2.669e-03 -49.186 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 2853 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.977
## F-statistic: 1.218e+04 on 10 and 2853 DF, p-value: < 2.2e-16
Residuals discussion
The histogram of the residuals shows an almost perfect normal distribution with mean = 0
In the QQ lot, data that aligns closely to the red line indicates a normal distribution. If the points skew drastically from the line, you could consider adjusting your model by adding or removing other variables in the regression model, this model is the result of that model adjustment.
hist(life_exp_aic.lm$residuals, prob = TRUE)
abline(v = mean(life_exp_aic.lm$residuals), # Add line for mean
col = "red",
lwd = 3)
lines(density(life_exp_aic.lm$residuals),col = "green")qqnorm(life_exp_aic.lm$residuals)
qqline(life_exp_aic.lm$residuals, col = "red")The fitted and residual values seem to have a linear relationship, there is some evidence of heteroskedastic behavior
plot(life_exp_aic.lm$fitted.values, life_exp_aic.lm$residuals,
xlab="Fitted Values", ylab="Residuals",
main="Residuals Plot",col = "greenyellow")
abline(h=0)The proposed model fits the data and is a credible predictor(appropiate) of Life expectancy
All models tested would yield good predictions of the data but the stepAIC model is preferred
References :
https://www.kaggle.com/datasets/lashagoch/life-expectancy-who-updated
https://www.statology.org/multicollinearity-in-r