File Locations:

R Pubs Output

RMD on Github


Assignment Details :

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?


—– Importing provided dataset —–

This dataset was sources from https://www.kaggle.com/. The data contains life expectancy, health, immunization, and economic and demographic information about 179 countries from 2000-2015 years. The adjusted dataset has 21 variables and 2.864 rows. It was last updated in April 2023.

life_exp_url <- 'https://raw.githubusercontent.com/tagensingh/sps_data605_wk12_discussion/main/wk_12_discussion_life_expectancy.csv'

life_exp_raw <- read.csv(life_exp_url, header=TRUE)




The following is a summary of the dataset gathered from provided dataset

summary(life_exp_raw)
##    Country             Region               Year      Infant_deaths   
##  Length:2864        Length:2864        Min.   :2000   Min.   :  1.80  
##  Class :character   Class :character   1st Qu.:2004   1st Qu.:  8.10  
##  Mode  :character   Mode  :character   Median :2008   Median : 19.60  
##                                        Mean   :2008   Mean   : 30.36  
##                                        3rd Qu.:2011   3rd Qu.: 47.35  
##                                        Max.   :2015   Max.   :138.10  
##  Under_five_deaths Adult_mortality  Alcohol_consumption  Hepatitis_B   
##  Min.   :  2.300   Min.   : 49.38   Min.   : 0.000      Min.   :12.00  
##  1st Qu.:  9.675   1st Qu.:106.91   1st Qu.: 1.200      1st Qu.:78.00  
##  Median : 23.100   Median :163.84   Median : 4.020      Median :89.00  
##  Mean   : 42.938   Mean   :192.25   Mean   : 4.821      Mean   :84.29  
##  3rd Qu.: 66.000   3rd Qu.:246.79   3rd Qu.: 7.777      3rd Qu.:96.00  
##  Max.   :224.900   Max.   :719.36   Max.   :17.870      Max.   :99.00  
##     Measles           BMI            Polio        Diphtheria   
##  Min.   :10.00   Min.   :19.80   Min.   : 8.0   Min.   :16.00  
##  1st Qu.:64.00   1st Qu.:23.20   1st Qu.:81.0   1st Qu.:81.00  
##  Median :83.00   Median :25.50   Median :93.0   Median :93.00  
##  Mean   :77.34   Mean   :25.03   Mean   :86.5   Mean   :86.27  
##  3rd Qu.:93.00   3rd Qu.:26.40   3rd Qu.:97.0   3rd Qu.:97.00  
##  Max.   :99.00   Max.   :32.10   Max.   :99.0   Max.   :99.00  
##  Incidents_HIV     GDP_per_capita   Population_mln    
##  Min.   : 0.0100   Min.   :   148   Min.   :   0.080  
##  1st Qu.: 0.0800   1st Qu.:  1416   1st Qu.:   2.098  
##  Median : 0.1500   Median :  4217   Median :   7.850  
##  Mean   : 0.8943   Mean   : 11541   Mean   :  36.676  
##  3rd Qu.: 0.4600   3rd Qu.: 12557   3rd Qu.:  23.688  
##  Max.   :21.6800   Max.   :112418   Max.   :1379.860  
##  Thinness_ten_nineteen_years Thinness_five_nine_years   Schooling     
##  Min.   : 0.100              Min.   : 0.1             Min.   : 1.100  
##  1st Qu.: 1.600              1st Qu.: 1.6             1st Qu.: 5.100  
##  Median : 3.300              Median : 3.4             Median : 7.800  
##  Mean   : 4.866              Mean   : 4.9             Mean   : 7.632  
##  3rd Qu.: 7.200              3rd Qu.: 7.3             3rd Qu.:10.300  
##  Max.   :27.700              Max.   :28.6             Max.   :14.100  
##  Economy_status_Developed Economy_status_Developing Life_expectancy
##  Min.   :0.0000           Min.   :0.0000            Min.   :39.40  
##  1st Qu.:0.0000           1st Qu.:1.0000            1st Qu.:62.70  
##  Median :0.0000           Median :1.0000            Median :71.40  
##  Mean   :0.2067           Mean   :0.7933            Mean   :68.86  
##  3rd Qu.:0.0000           3rd Qu.:1.0000            3rd Qu.:75.40  
##  Max.   :1.0000           Max.   :1.0000            Max.   :83.80


The following is the header information of the dataset.

head(life_exp_raw)
##      Country                        Region Year Infant_deaths Under_five_deaths
## 1    Turkiye                   Middle East 2015          11.1              13.0
## 2      Spain                European Union 2015           2.7               3.3
## 3      India                          Asia 2007          51.5              67.9
## 4     Guyana                 South America 2006          32.8              40.5
## 5     Israel                   Middle East 2012           3.4               4.3
## 6 Costa Rica Central America and Caribbean 2006           9.8              11.2
##   Adult_mortality Alcohol_consumption Hepatitis_B Measles  BMI Polio Diphtheria
## 1        105.8240                1.32          97      65 27.8    97         97
## 2         57.9025               10.35          97      94 26.0    97         97
## 3        201.0765                1.57          60      35 21.2    67         64
## 4        222.1965                5.68          93      74 25.3    92         93
## 5         57.9510                2.89          97      89 27.0    94         94
## 6         95.2200                4.19          88      86 26.4    89         89
##   Incidents_HIV GDP_per_capita Population_mln Thinness_ten_nineteen_years
## 1          0.08          11006          78.53                         4.9
## 2          0.09          25742          46.44                         0.6
## 3          0.13           1076        1183.21                        27.1
## 4          0.79           4146           0.75                         5.7
## 5          0.08          33995           7.91                         1.2
## 6          0.16           9110           4.35                         2.0
##   Thinness_five_nine_years Schooling Economy_status_Developed
## 1                      4.8       7.8                        0
## 2                      0.5       9.7                        1
## 3                     28.0       5.0                        0
## 4                      5.5       7.9                        0
## 5                      1.1      12.8                        1
## 6                      1.9       7.9                        0
##   Economy_status_Developing Life_expectancy
## 1                         1            76.5
## 2                         0            82.8
## 3                         1            65.4
## 4                         1            67.0
## 5                         0            81.7
## 6                         1            78.2
sort(colnames(life_exp_raw))
##  [1] "Adult_mortality"             "Alcohol_consumption"        
##  [3] "BMI"                         "Country"                    
##  [5] "Diphtheria"                  "Economy_status_Developed"   
##  [7] "Economy_status_Developing"   "GDP_per_capita"             
##  [9] "Hepatitis_B"                 "Incidents_HIV"              
## [11] "Infant_deaths"               "Life_expectancy"            
## [13] "Measles"                     "Polio"                      
## [15] "Population_mln"              "Region"                     
## [17] "Schooling"                   "Thinness_five_nine_years"   
## [19] "Thinness_ten_nineteen_years" "Under_five_deaths"          
## [21] "Year"
#The dimension of the dataset :

dim(life_exp_raw)
## [1] 2864   21
#Checking for Null Values 

life_exp_raw[!complete.cases(life_exp_raw),]
##  [1] Country                     Region                     
##  [3] Year                        Infant_deaths              
##  [5] Under_five_deaths           Adult_mortality            
##  [7] Alcohol_consumption         Hepatitis_B                
##  [9] Measles                     BMI                        
## [11] Polio                       Diphtheria                 
## [13] Incidents_HIV               GDP_per_capita             
## [15] Population_mln              Thinness_ten_nineteen_years
## [17] Thinness_five_nine_years    Schooling                  
## [19] Economy_status_Developed    Economy_status_Developing  
## [21] Life_expectancy            
## <0 rows> (or 0-length row.names)



Selecting the Columns for Regression Analysis

life_exp <- select(life_exp_raw,Life_expectancy,Adult_mortality,Hepatitis_B,Measles,Polio,Diphtheria,Incidents_HIV,GDP_per_capita,BMI,Infant_deaths)

Creating the Quadratic Variable :

life_exp$le_quad <- (life_exp_raw$Schooling)^2

summary(life_exp$le_quad)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.21   26.01   60.84   68.30  106.09  198.81

Creating the Dichotomous Variable :

life_exp$le_dichotomous <- life_exp_raw$Alcohol_consumption * life_exp_raw$Economy_status_Developed

summary(life_exp$le_dichotomous)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.077   0.000  17.870

Creating the Dichotomous vs Quadratic Variable :

life_exp$le_dich_quad <- life_exp$le_quad * life_exp_raw$Economy_status_Developing* life_exp_raw$Alcohol_consumption

summary(life_exp$le_dich_quad)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.4055   45.3852  179.7748  252.9990 2088.0000



Generating the initial Regression Model - 1 Predicted Variable - 12 Predictor Variables :

life_exp_0.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)

The Regression Details

summary(life_exp_0.lm)
## 
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + 
##     Measles + Polio + Diphtheria + Incidents_HIV + GDP_per_capita + 
##     le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, 
##     data = life_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3181 -0.9630 -0.0053  0.9512  5.3601 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.365e+01  5.610e-01 149.104  < 2e-16 ***
## Adult_mortality -5.019e-02  6.441e-04 -77.919  < 2e-16 ***
## Hepatitis_B     -1.365e-02  2.646e-03  -5.157 2.68e-07 ***
## Measles          2.330e-04  1.789e-03   0.130   0.8964    
## Polio            1.351e-03  6.124e-03   0.221   0.8254    
## Diphtheria       1.258e-02  6.137e-03   2.050   0.0405 *  
## Incidents_HIV    1.103e-01  1.908e-02   5.779 8.30e-09 ***
## GDP_per_capita   2.439e-05  2.342e-06  10.417  < 2e-16 ***
## le_quad          5.911e-03  1.193e-03   4.957 7.58e-07 ***
## le_dichotomous   8.375e-02  1.121e-02   7.471 1.05e-13 ***
## le_dich_quad     9.468e-04  1.360e-04   6.960 4.20e-12 ***
## BMI             -9.409e-02  1.739e-02  -5.411 6.77e-08 ***
## Infant_deaths   -1.312e-01  2.709e-03 -48.425  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.426 on 2851 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.977 
## F-statistic: 1.014e+04 on 12 and 2851 DF,  p-value: < 2.2e-16

Initial Regression summary Discussion :

Residuals :

The median is very close to zero(0) with 1Q and 3Q almost perfectly balanced, The Min and Max and about even with a very slight right tail. This indicates that the residual distribution is close to “Normal”.

Since the Multiple R-squared of 0.9771 and Adjusted R-squared of 0.977 is very close to 1, this indicates that the model is a good fit to the data.

However we will investigate the coefficients to rule out instances of Multicollinearity

Testing for and Freeing From Multicollinearity among Variables

Multicollinearity occurs when two or more predictor variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model.

If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.

To test this model for Multicollinearity we will employ the “imcdiag” function from the “mctest” library and examine the Variance Inflation Factor (VIF) score.

Note : Scores over 5 are moderately multicollinear. Scores over 10 are very problematic

using the VIF measure we see that 7 of the predictor variables posses low VIF scores indicating that they are not very correlated, but the following variables are moderately to problematic :

Adult_mortality - VIF score is 7.7 —- Moderately multicollinear

Infant_deaths - VIF score is 7.8 —- Moderately multicollinear

Polio- VIF score is 12.0 —- Problematic —- Will be removed from the model$

Diphtheria - VIF score is 12.8 —- Problematic —- Will be removed from the model

imcdiag(life_exp_0.lm)
## 
## Call:
## imcdiag(mod = life_exp_0.lm)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                     VIF    TOL        Wi        Fi Leamer    CVIF Klein   IND1
## Adult_mortality  7.7110 0.1297 1739.9711 1914.6393 0.3601 -0.0477     0 0.0005
## Hepatitis_B      2.5224 0.3964  394.7215  434.3458 0.6296 -0.0156     0 0.0015
## Measles          1.5690 0.6374  147.5183  162.3271 0.7983 -0.0097     0 0.0025
## Polio           12.0060 0.0833 2853.5539 3140.0099 0.2886 -0.0743     0 0.0003
## Diphtheria      12.7943 0.0782 3057.9513 3364.9259 0.2796 -0.0792     0 0.0003
## Incidents_HIV    2.9062 0.3441  494.2268  543.8401 0.5866 -0.0180     0 0.0013
## GDP_per_capita   2.2133 0.4518  314.5877  346.1678 0.6722 -0.0137     0 0.0017
## le_quad          4.5757 0.2185  927.0764 1020.1416 0.4675 -0.0283     0 0.0008
## le_dichotomous   3.1580 0.3167  559.5014  615.6673 0.5627 -0.0196     0 0.0012
## le_dich_quad     2.1173 0.4723  289.6765  318.7558 0.6872 -0.0131     0 0.0018
## BMI              2.0481 0.4883  271.7478  299.0274 0.6988 -0.0127     0 0.0019
## Infant_deaths    7.8355 0.1276 1772.2539 1950.1629 0.3572 -0.0485     0 0.0005
##                   IND2
## Adult_mortality 1.2650
## Hepatitis_B     0.8773
## Measles         0.5271
## Polio           1.3325
## Diphtheria      1.3399
## Incidents_HIV   0.9534
## GDP_per_capita  0.7968
## le_quad         1.1359
## le_dichotomous  0.9933
## le_dich_quad    0.7670
## BMI             0.7438
## Infant_deaths   1.2680
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## Measles , Polio , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.9771 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================



Adjusting Model by removing both Problematic Variables

# Removed both Polio and Diphtheria variables

life_exp_1.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)

The Regression Details of the Adjusted Model :

summary(life_exp_1.lm)
## 
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + 
##     Measles + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + 
##     le_dich_quad + BMI + Infant_deaths, data = life_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3887 -0.9663  0.0045  0.9450  5.5421 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.461e+01  5.051e-01 167.509  < 2e-16 ***
## Adult_mortality -5.015e-02  6.454e-04 -77.700  < 2e-16 ***
## Hepatitis_B     -6.881e-03  2.059e-03  -3.341 0.000844 ***
## Measles          9.021e-04  1.777e-03   0.508 0.611736    
## Incidents_HIV    1.159e-01  1.908e-02   6.073 1.42e-09 ***
## GDP_per_capita   2.411e-05  2.345e-06  10.283  < 2e-16 ***
## le_quad          6.055e-03  1.195e-03   5.068 4.27e-07 ***
## le_dichotomous   8.234e-02  1.120e-02   7.350 2.58e-13 ***
## le_dich_quad     9.100e-04  1.359e-04   6.695 2.59e-11 ***
## BMI             -1.047e-01  1.722e-02  -6.078 1.38e-09 ***
## Infant_deaths   -1.356e-01  2.462e-03 -55.071  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.43 on 2853 degrees of freedom
## Multiple R-squared:  0.977,  Adjusted R-squared:  0.9769 
## F-statistic: 1.21e+04 on 10 and 2853 DF,  p-value: < 2.2e-16

The Multicollinearity test results for the Adjusted Model :

imcdiag(life_exp_1.lm)
## 
## Call:
## imcdiag(mod = life_exp_1.lm)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                    VIF    TOL        Wi        Fi Leamer    CVIF Klein   IND1
## Adult_mortality 7.7026 0.1298 2125.4629 2391.9835 0.3603 -0.0613     0 0.0004
## Hepatitis_B     1.5193 0.6582  164.6644  185.3124 0.8113 -0.0121     0 0.0021
## Measles         1.5396 0.6495  171.1146  192.5714 0.8059 -0.0123     0 0.0020
## Incidents_HIV   2.8901 0.3460  599.3814  674.5403 0.5882 -0.0230     0 0.0011
## GDP_per_capita  2.2085 0.4528  383.2344  431.2898 0.6729 -0.0176     0 0.0014
## le_quad         4.5685 0.2189 1131.6196 1273.5181 0.4679 -0.0364     0 0.0007
## le_dichotomous  3.1385 0.3186  678.1369  763.1713 0.5645 -0.0250     0 0.0010
## le_dich_quad    2.1033 0.4755  349.8557  393.7256 0.6895 -0.0167     0 0.0015
## BMI             1.9999 0.5000  317.0702  356.8289 0.7071 -0.0159     0 0.0016
## Infant_deaths   6.4356 0.1554 1723.6941 1939.8353 0.3942 -0.0512     0 0.0005
##                   IND2
## Adult_mortality 1.4276
## Hepatitis_B     0.5607
## Measles         0.5750
## Incidents_HIV   1.0730
## GDP_per_capita  0.8978
## le_quad         1.2815
## le_dichotomous  1.1179
## le_dich_quad    0.8606
## BMI             0.8203
## Infant_deaths   1.3857
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## Measles , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.977 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Discussion of Adjusted model and Multicollinearity tests :

Residuals :

The median is now zero(0) with 1Q and 3Q almost perfectly balanced, The Min and Max and about even with a very slight right tail. This indicates that the residual distribution is close to “Normal”.

Since the Multiple R-squared of 0.9771 and Adjusted R-squared of 0.977 is very close to 1, this indicates that the model is a good fit to the data.

there was some changes to the VIF scores of the following variables

Adult_mortality - VIF score remain at 7.7 —- Moderately multicollinear

Infant_deaths - VIF score is 6.4 - down from 7.8 —- Moderately multicollinear

Overall this Adjusted model is a better fit since we removed the severely multicollinear variables.

Note - the computed variables le_quad, le_dichotomous and le_dich_quad low to moderately multicollinear.

when we examine a Histogram of the residuals of the regression we see that it is approximately normally distributed.



Reviewing the model by removing only one Problematic Variable alternatively.

Note that when the “polio” variable was removed from the linear model, the VIF score for the “Diphtheria” variable vastly improved from 12.8 to 3.8

AND

when the “diphtheria” variable was removed from the linear model, the VIF score for the “polio” variable vastly improved from 12.0 to 3.6

MODEL Improvement - Remove ONLY the “Diphtheria” variable from the original model

# Removed  Polio variables

life_exp_1_1.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles +Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)

summary(life_exp_1_1.lm)
## 
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + 
##     Measles + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + 
##     le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3269 -0.9643 -0.0044  0.9502  5.3614 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.366e+01  5.555e-01 150.611  < 2e-16 ***
## Adult_mortality -5.018e-02  6.437e-04 -77.960  < 2e-16 ***
## Hepatitis_B     -1.366e-02  2.646e-03  -5.163 2.60e-07 ***
## Measles          2.749e-04  1.779e-03   0.155    0.877    
## Diphtheria       1.371e-02  3.374e-03   4.064 4.96e-05 ***
## Incidents_HIV    1.104e-01  1.907e-02   5.787 7.94e-09 ***
## GDP_per_capita   2.437e-05  2.339e-06  10.418  < 2e-16 ***
## le_quad          5.919e-03  1.192e-03   4.966 7.24e-07 ***
## le_dichotomous   8.356e-02  1.118e-02   7.476 1.01e-13 ***
## le_dich_quad     9.453e-04  1.358e-04   6.959 4.23e-12 ***
## BMI             -9.399e-02  1.738e-02  -5.409 6.88e-08 ***
## Infant_deaths   -1.313e-01  2.670e-03 -49.177  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.426 on 2852 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.977 
## F-statistic: 1.106e+04 on 11 and 2852 DF,  p-value: < 2.2e-16
imcdiag(life_exp_1_1.lm)
## 
## Call:
## imcdiag(mod = life_exp_1_1.lm)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                    VIF    TOL        Wi        Fi Leamer    CVIF Klein   IND1
## Adult_mortality 7.7041 0.1298 1912.6659 2125.9292 0.3603 -0.0537     0 0.0005
## Hepatitis_B     2.5216 0.3966  434.1120  482.5158 0.6297 -0.0176     0 0.0014
## Measles         1.5513 0.6446  157.2805  174.8174 0.8029 -0.0108     0 0.0023
## Diphtheria      3.8681 0.2585  818.2668  909.5040 0.5085 -0.0269     0 0.0009
## Incidents_HIV   2.9046 0.3443  543.3928  603.9814 0.5868 -0.0202     0 0.0012
## GDP_per_capita  2.2101 0.4525  345.2534  383.7494 0.6727 -0.0154     0 0.0016
## le_quad         4.5721 0.2187 1019.1273 1132.7606 0.4677 -0.0319     0 0.0008
## le_dichotomous  3.1407 0.3184  610.7559  678.8555 0.5643 -0.0219     0 0.0011
## le_dich_quad    2.1119 0.4735  317.2247  352.5955 0.6881 -0.0147     0 0.0017
## BMI             2.0469 0.4885  298.6797  331.9826 0.6990 -0.0143     0 0.0017
## Infant_deaths   7.6119 0.1314 1886.3653 2096.6961 0.3625 -0.0530     0 0.0005
##                   IND2
## Adult_mortality 1.3400
## Hepatitis_B     0.9292
## Measles         0.5472
## Diphtheria      1.1418
## Incidents_HIV   1.0098
## GDP_per_capita  0.8432
## le_quad         1.2031
## le_dichotomous  1.0496
## le_dich_quad    0.8108
## BMI             0.7876
## Infant_deaths   1.3376
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## Measles , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.9771 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
# Removed Diphtheria variable

life_exp_1_2.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)

summary(life_exp_1_2.lm)
## 
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + 
##     Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad + 
##     le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2766 -0.9560  0.0008  0.9405  5.3998 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.378e+01  5.577e-01 150.230  < 2e-16 ***
## Adult_mortality -5.021e-02  6.444e-04 -77.923  < 2e-16 ***
## Hepatitis_B     -1.169e-02  2.469e-03  -4.734 2.31e-06 ***
## Measles          8.168e-05  1.789e-03   0.046  0.96358    
## Polio            1.184e-02  3.369e-03   3.513  0.00045 ***
## Incidents_HIV    1.110e-01  1.909e-02   5.817 6.65e-09 ***
## GDP_per_capita   2.447e-05  2.343e-06  10.447  < 2e-16 ***
## le_quad          5.892e-03  1.193e-03   4.938 8.34e-07 ***
## le_dichotomous   8.483e-02  1.120e-02   7.571 4.96e-14 ***
## le_dich_quad     9.487e-04  1.361e-04   6.971 3.90e-12 ***
## BMI             -9.778e-02  1.730e-02  -5.651 1.75e-08 ***
## Infant_deaths   -1.316e-01  2.704e-03 -48.672  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.427 on 2852 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.977 
## F-statistic: 1.105e+04 on 11 and 2852 DF,  p-value: < 2.2e-16
imcdiag(life_exp_1_2.lm)
## 
## Call:
## imcdiag(mod = life_exp_1_2.lm)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                    VIF    TOL        Wi        Fi Leamer    CVIF Klein   IND1
## Adult_mortality 7.7086 0.1297 1913.9715 2127.3804 0.3602 -0.0535     0 0.0005
## Hepatitis_B     2.1925 0.4561  340.2164  378.1508 0.6754 -0.0152     0 0.0016
## Measles         1.5663 0.6384  161.5653  179.5799 0.7990 -0.0109     0 0.0022
## Polio           3.6298 0.2755  750.2683  833.9237 0.5249 -0.0252     0 0.0010
## Incidents_HIV   2.9051 0.3442  543.5266  604.1301 0.5867 -0.0202     0 0.0012
## GDP_per_capita  2.2127 0.4519  345.9876  384.5655 0.6723 -0.0154     0 0.0016
## le_quad         4.5754 0.2186 1020.0609 1133.7983 0.4675 -0.0318     0 0.0008
## le_dichotomous  3.1510 0.3174  613.6821  682.1080 0.5633 -0.0219     0 0.0011
## le_dich_quad    2.1172 0.4723  318.7268  354.2651 0.6873 -0.0147     0 0.0017
## BMI             2.0261 0.4936  292.7446  325.3858 0.7025 -0.0141     0 0.0017
## Infant_deaths   7.7949 0.1283 1938.5961 2154.7506 0.3582 -0.0541     0 0.0004
##                   IND2
## Adult_mortality 1.3533
## Hepatitis_B     0.8458
## Measles         0.5622
## Polio           1.1266
## Incidents_HIV   1.0197
## GDP_per_capita  0.8522
## le_quad         1.2151
## le_dichotomous  1.0615
## le_dich_quad    0.8205
## BMI             0.7875
## Infant_deaths   1.3555
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## Measles , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.9771 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================



The New Adjusted model is as follows — 11 Predictor Variables + 1 Predicted Variable

life_exp_1_2.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)

summary(life_exp_1_2.lm)
## 
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + 
##     Measles + Polio + Incidents_HIV + GDP_per_capita + le_quad + 
##     le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2766 -0.9560  0.0008  0.9405  5.3998 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.378e+01  5.577e-01 150.230  < 2e-16 ***
## Adult_mortality -5.021e-02  6.444e-04 -77.923  < 2e-16 ***
## Hepatitis_B     -1.169e-02  2.469e-03  -4.734 2.31e-06 ***
## Measles          8.168e-05  1.789e-03   0.046  0.96358    
## Polio            1.184e-02  3.369e-03   3.513  0.00045 ***
## Incidents_HIV    1.110e-01  1.909e-02   5.817 6.65e-09 ***
## GDP_per_capita   2.447e-05  2.343e-06  10.447  < 2e-16 ***
## le_quad          5.892e-03  1.193e-03   4.938 8.34e-07 ***
## le_dichotomous   8.483e-02  1.120e-02   7.571 4.96e-14 ***
## le_dich_quad     9.487e-04  1.361e-04   6.971 3.90e-12 ***
## BMI             -9.778e-02  1.730e-02  -5.651 1.75e-08 ***
## Infant_deaths   -1.316e-01  2.704e-03 -48.672  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.427 on 2852 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.977 
## F-statistic: 1.105e+04 on 11 and 2852 DF,  p-value: < 2.2e-16
imcdiag(life_exp_1_2.lm)
## 
## Call:
## imcdiag(mod = life_exp_1_2.lm)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                    VIF    TOL        Wi        Fi Leamer    CVIF Klein   IND1
## Adult_mortality 7.7086 0.1297 1913.9715 2127.3804 0.3602 -0.0535     0 0.0005
## Hepatitis_B     2.1925 0.4561  340.2164  378.1508 0.6754 -0.0152     0 0.0016
## Measles         1.5663 0.6384  161.5653  179.5799 0.7990 -0.0109     0 0.0022
## Polio           3.6298 0.2755  750.2683  833.9237 0.5249 -0.0252     0 0.0010
## Incidents_HIV   2.9051 0.3442  543.5266  604.1301 0.5867 -0.0202     0 0.0012
## GDP_per_capita  2.2127 0.4519  345.9876  384.5655 0.6723 -0.0154     0 0.0016
## le_quad         4.5754 0.2186 1020.0609 1133.7983 0.4675 -0.0318     0 0.0008
## le_dichotomous  3.1510 0.3174  613.6821  682.1080 0.5633 -0.0219     0 0.0011
## le_dich_quad    2.1172 0.4723  318.7268  354.2651 0.6873 -0.0147     0 0.0017
## BMI             2.0261 0.4936  292.7446  325.3858 0.7025 -0.0141     0 0.0017
## Infant_deaths   7.7949 0.1283 1938.5961 2154.7506 0.3582 -0.0541     0 0.0004
##                   IND2
## Adult_mortality 1.3533
## Hepatitis_B     0.8458
## Measles         0.5622
## Polio           1.1266
## Incidents_HIV   1.0197
## GDP_per_capita  0.8522
## le_quad         1.2151
## le_dichotomous  1.0615
## le_dich_quad    0.8205
## BMI             0.7875
## Infant_deaths   1.3555
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## Measles , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.9771 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Final Adjusted model discussion :

The overall model does not significantly change

The Adjusted \(R^2\) remains relatively high at > 97% even after removing the “Diphtheria” variable.

The P- Value for the individual coefficients are relatively low except for the “Measles” variable.



Let us try the stepAIC function on our model.

Using the initial model with all initial variables

life_exp_0.lm <- lm(Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data=life_exp)


stepAIC(life_exp_0.lm, direction="both")
## Start:  AIC=2046.35
## Life_expectancy ~ Adult_mortality + Hepatitis_B + Measles + Polio + 
##     Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + 
##     le_dich_quad + BMI + Infant_deaths
## 
##                   Df Sum of Sq     RSS    AIC
## - Measles          1       0.0  5798.8 2044.4
## - Polio            1       0.1  5798.9 2044.4
## <none>                          5798.8 2046.3
## - Diphtheria       1       8.5  5807.3 2048.6
## - le_quad          1      50.0  5848.8 2068.9
## - Hepatitis_B      1      54.1  5852.9 2070.9
## - BMI              1      59.6  5858.3 2073.6
## - Incidents_HIV    1      67.9  5866.7 2077.7
## - le_dich_quad     1      98.5  5897.3 2092.6
## - le_dichotomous   1     113.5  5912.3 2099.9
## - GDP_per_capita   1     220.7  6019.5 2151.3
## - Infant_deaths    1    4769.6 10568.4 3763.4
## - Adult_mortality  1   12348.8 18147.5 5311.8
## 
## Step:  AIC=2044.37
## Life_expectancy ~ Adult_mortality + Hepatitis_B + Polio + Diphtheria + 
##     Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + 
##     le_dich_quad + BMI + Infant_deaths
## 
##                   Df Sum of Sq     RSS    AIC
## - Polio            1       0.1  5798.9 2042.4
## <none>                          5798.8 2044.4
## + Measles          1       0.0  5798.8 2046.3
## - Diphtheria       1       8.5  5807.3 2046.6
## - le_quad          1      51.1  5849.9 2067.5
## - Hepatitis_B      1      54.8  5853.6 2069.3
## - BMI              1      59.7  5858.5 2071.7
## - Incidents_HIV    1      68.1  5866.9 2075.8
## - le_dich_quad     1      99.0  5897.8 2090.9
## - le_dichotomous   1     113.8  5912.6 2098.0
## - GDP_per_capita   1     220.7  6019.5 2149.4
## - Infant_deaths    1    4770.8 10569.7 3761.7
## - Adult_mortality  1   12402.1 18201.0 5318.3
## 
## Step:  AIC=2042.42
## Life_expectancy ~ Adult_mortality + Hepatitis_B + Diphtheria + 
##     Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + 
##     le_dich_quad + BMI + Infant_deaths
## 
##                   Df Sum of Sq     RSS    AIC
## <none>                          5798.9 2042.4
## + Polio            1       0.1  5798.8 2044.4
## + Measles          1       0.0  5798.9 2044.4
## - Diphtheria       1      34.1  5833.0 2057.2
## - le_quad          1      51.4  5850.3 2065.7
## - Hepatitis_B      1      54.8  5853.7 2067.3
## - BMI              1      59.6  5858.5 2069.7
## - Incidents_HIV    1      68.3  5867.2 2074.0
## - le_dich_quad     1      98.9  5897.8 2088.9
## - le_dichotomous   1     113.9  5912.8 2096.1
## - GDP_per_capita   1     220.7  6019.6 2147.4
## - Infant_deaths    1    4917.3 10716.2 3799.2
## - Adult_mortality  1   12407.2 18206.1 5317.1
## 
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + 
##     Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + 
##     le_dich_quad + BMI + Infant_deaths, data = life_exp)
## 
## Coefficients:
##     (Intercept)  Adult_mortality      Hepatitis_B       Diphtheria  
##       8.367e+01       -5.019e-02       -1.360e-02        1.376e-02  
##   Incidents_HIV   GDP_per_capita          le_quad   le_dichotomous  
##       1.105e-01        2.437e-05        5.942e-03        8.361e-02  
##    le_dich_quad              BMI    Infant_deaths  
##       9.463e-04       -9.379e-02       -1.313e-01

Applying the recommended stepAIC model - lowest AIC score of 2042.42

Note that this model excludes the “polio” variable

The Adjusted \(R^2\) remains the same and the overall model fit is about the same as the model suggested by using the VIF scores.

life_exp_aic.lm <- lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + le_dich_quad + BMI + Infant_deaths, data = life_exp)


summary(life_exp_aic.lm)
## 
## Call:
## lm(formula = Life_expectancy ~ Adult_mortality + Hepatitis_B + 
##     Diphtheria + Incidents_HIV + GDP_per_capita + le_quad + le_dichotomous + 
##     le_dich_quad + BMI + Infant_deaths, data = life_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3226 -0.9662 -0.0040  0.9488  5.3671 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.367e+01  5.537e-01 151.124  < 2e-16 ***
## Adult_mortality -5.019e-02  6.424e-04 -78.129  < 2e-16 ***
## Hepatitis_B     -1.360e-02  2.621e-03  -5.191 2.24e-07 ***
## Diphtheria       1.376e-02  3.361e-03   4.093 4.37e-05 ***
## Incidents_HIV    1.105e-01  1.906e-02   5.796 7.52e-09 ***
## GDP_per_capita   2.437e-05  2.339e-06  10.419  < 2e-16 ***
## le_quad          5.942e-03  1.182e-03   5.027 5.29e-07 ***
## le_dichotomous   8.361e-02  1.117e-02   7.485 9.48e-14 ***
## le_dich_quad     9.463e-04  1.356e-04   6.976 3.76e-12 ***
## BMI             -9.379e-02  1.733e-02  -5.413 6.70e-08 ***
## Infant_deaths   -1.313e-01  2.669e-03 -49.186  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.426 on 2853 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.977 
## F-statistic: 1.218e+04 on 10 and 2853 DF,  p-value: < 2.2e-16



Residuals discussion

The histogram of the residuals shows an almost perfect normal distribution with mean = 0

In the QQ lot, data that aligns closely to the red line indicates a normal distribution. If the points skew drastically from the line, you could consider adjusting your model by adding or removing other variables in the regression model, this model is the result of that model adjustment.

hist(life_exp_aic.lm$residuals, prob = TRUE)
abline(v = mean(life_exp_aic.lm$residuals),                       # Add line for mean
       col = "red",
       lwd = 3)
lines(density(life_exp_aic.lm$residuals),col = "green")

qqnorm(life_exp_aic.lm$residuals)
qqline(life_exp_aic.lm$residuals, col = "red")

The fitted and residual values seem to have a linear relationship, there is some evidence of heteroskedastic behavior

plot(life_exp_aic.lm$fitted.values, life_exp_aic.lm$residuals, 
     xlab="Fitted Values", ylab="Residuals",
     main="Residuals Plot",col = "greenyellow")
abline(h=0)


The proposed model fits the data and is a credible predictor(appropiate) of Life expectancy

All models tested would yield good predictions of the data but the stepAIC model is preferred



References :

https://www.kaggle.com/datasets/lashagoch/life-expectancy-who-updated

https://www.statology.org/multicollinearity-in-r

https://datascienceplus.com/multicollinearity-in-r

https://rforpoliticalscience.com/2020/11/12/interpret-multicollinearity-tests-from-the-mctest-package-in-r