1 Introduction

Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015.

2 Reading Data

#Load packages
library('ggplot2')
library('dplyr')
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#To load the dataset
Life <- read.csv("data/Life Expectancy Data.csv")

#examine data
glimpse(Life)
## Rows: 2,938
## Columns: 22
## $ Country                         <chr> "Afghanistan", "Afghanistan", "Afghani…
## $ Year                            <int> 2015, 2014, 2013, 2012, 2011, 2010, 20…
## $ Status                          <chr> "Developing", "Developing", "Developin…
## $ Life.expectancy                 <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58…
## $ Adult.Mortality                 <int> 263, 271, 268, 272, 275, 279, 281, 287…
## $ infant.deaths                   <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84…
## $ Alcohol                         <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.…
## $ percentage.expenditure          <dbl> 71.279624, 73.523582, 73.219243, 78.18…
## $ Hepatitis.B                     <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64…
## $ Measles                         <int> 1154, 492, 430, 2787, 3013, 1989, 2861…
## $ BMI                             <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16…
## $ under.five.deaths               <int> 83, 86, 89, 93, 97, 102, 106, 110, 113…
## $ Polio                           <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,…
## $ Total.expenditure               <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.…
## $ Diphtheria                      <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58…
## $ HIV.AIDS                        <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1…
## $ GDP                             <dbl> 584.25921, 612.69651, 631.74498, 669.9…
## $ Population                      <dbl> 33736494, 327582, 31731688, 3696958, 2…
## $ thinness..1.19.years            <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18…
## $ thinness.5.9.years              <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18…
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4…
## $ Schooling                       <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8…

This shows us our variables, their class type, some of their observations, and what we can expect if we opened the dataset and looked at it. We have 2938 observations for 23 variables.

3 Linear Regression :Examining Life Expectancy from 2000-2015

For now, I want to look at how Life Expectancy has changed for both developing and developed countries from 2000-2015 by runnig a linear model on the Life dataset.

#ggplot2 is used as an "order of operations" for data visualization that allows for streamlined reporting that can be understood easily.

ggplot(data=Life,mapping=aes(Year,Life.expectancy,color=Status))+geom_point()+geom_smooth(method="lm",se=FALSE)+labs(title="Life Expectancy from 2000-2015")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).

#Some additional explanation: I am using life.data, and mapping Year to the x-axis, and Life.expectancy to the y-axis. I am differentiating each data point by color using Status
#(whether they are developed or undeveloped). geom_point() is used to plot every scatterplot dot. 

The warning message told us that there are 10 rows of data that couldn’t be used because they were missing data: it’s likely that the dataset was incomplete and missing Life Expectancy data.

Looking at the graph, it’s clear that developed countries have higher life expectancies (LE) and are less numerous than developing countries.

4 Multi Linear Regression 1: Developed and Developing countries (Splitting the data)

What if I wanted to examine what variables were affecting developed and developing countries? To check that, I’ll use structure, then formally split the two and give them different names.

#splits dataset by different Status chr.
Life_split<-split(Life,Life$Status)

str(Life_split)
## List of 2
##  $ Developed :'data.frame':  512 obs. of  22 variables:
##   ..$ Country                        : chr [1:512] "Australia" "Australia" "Australia" "Australia" ...
##   ..$ Year                           : int [1:512] 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##   ..$ Status                         : chr [1:512] "Developed" "Developed" "Developed" "Developed" ...
##   ..$ Life.expectancy                : num [1:512] 82.8 82.7 82.5 82.3 82 81.9 81.7 81.3 81.3 81.2 ...
##   ..$ Adult.Mortality                : int [1:512] 59 6 61 61 63 64 66 66 66 66 ...
##   ..$ infant.deaths                  : int [1:512] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Alcohol                        : num [1:512] NA 9.71 9.87 10.03 10.3 ...
##   ..$ percentage.expenditure         : num [1:512] 0 10769 11735 11715 10986 ...
##   ..$ Hepatitis.B                    : int [1:512] 93 91 91 91 92 92 94 94 94 95 ...
##   ..$ Measles                        : int [1:512] 74 340 158 199 190 70 104 65 11 0 ...
##   ..$ BMI                            : num [1:512] 66.6 66.1 65.5 65 64.4 63.9 63.4 62.9 62.5 62 ...
##   ..$ under.five.deaths              : int [1:512] 1 1 1 1 1 1 1 1 2 2 ...
##   ..$ Polio                          : int [1:512] 93 92 91 92 92 92 92 92 92 92 ...
##   ..$ Total.expenditure              : num [1:512] NA 9.42 9.36 9.36 9.2 9.2 9.5 8.78 8.53 8.49 ...
##   ..$ Diphtheria                     : int [1:512] 93 92 91 92 92 92 92 92 92 92 ...
##   ..$ HIV.AIDS                       : num [1:512] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##   ..$ GDP                            : num [1:512] 56554 62215 67792 67678 62245 ...
##   ..$ Population                     : num [1:512] 23789338 2346694 23117353 22728254 223424 ...
##   ..$ thinness..1.19.years           : num [1:512] 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 ...
##   ..$ thinness.5.9.years             : num [1:512] 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 ...
##   ..$ Income.composition.of.resources: num [1:512] 0.937 0.936 0.933 0.93 0.927 0.927 0.925 0.921 0.918 0.915 ...
##   ..$ Schooling                      : num [1:512] 20.4 20.4 20.3 20.1 19.8 19.5 19.1 19.1 19 20.3 ...
##  $ Developing:'data.frame':  2426 obs. of  22 variables:
##   ..$ Country                        : chr [1:2426] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##   ..$ Year                           : int [1:2426] 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##   ..$ Status                         : chr [1:2426] "Developing" "Developing" "Developing" "Developing" ...
##   ..$ Life.expectancy                : num [1:2426] 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##   ..$ Adult.Mortality                : int [1:2426] 263 271 268 272 275 279 281 287 295 295 ...
##   ..$ infant.deaths                  : int [1:2426] 62 64 66 69 71 74 77 80 82 84 ...
##   ..$ Alcohol                        : num [1:2426] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##   ..$ percentage.expenditure         : num [1:2426] 71.3 73.5 73.2 78.2 7.1 ...
##   ..$ Hepatitis.B                    : int [1:2426] 65 62 64 67 68 66 63 64 63 64 ...
##   ..$ Measles                        : int [1:2426] 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##   ..$ BMI                            : num [1:2426] 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##   ..$ under.five.deaths              : int [1:2426] 83 86 89 93 97 102 106 110 113 116 ...
##   ..$ Polio                          : int [1:2426] 6 58 62 67 68 66 63 64 63 58 ...
##   ..$ Total.expenditure              : num [1:2426] 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##   ..$ Diphtheria                     : int [1:2426] 65 62 64 67 68 66 63 64 63 58 ...
##   ..$ HIV.AIDS                       : num [1:2426] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##   ..$ GDP                            : num [1:2426] 584.3 612.7 631.7 670 63.5 ...
##   ..$ Population                     : num [1:2426] 33736494 327582 31731688 3696958 2978599 ...
##   ..$ thinness..1.19.years           : num [1:2426] 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##   ..$ thinness.5.9.years             : num [1:2426] 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##   ..$ Income.composition.of.resources: num [1:2426] 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##   ..$ Schooling                      : num [1:2426] 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
#Separates datasets based on Developed/Developing
Life_developing<-Life_split$Developing
Life_developed<-Life_split$Developed

If we wanted to see if there was a relationship between LE and diseases and vice versa in developed and developing country as given in the dataset such as alcoholism, hepatitis B, measles, polio, diphtheria and HIV/AIDS, then we can run a multi linear regression on those variables and see if they would influence LE.

4.1 Developing Countries

Let’s interpret what this regression returned. When we run a regression, we are trying to determine if our independent varaibles are associated with our depedent variables. To do this, we are trying to find if the coefficients on independent variables are different from 0.

developing_model<- lm(Life.expectancy ~Alcohol+Hepatitis.B+Measles+Polio+Diphtheria+HIV.AIDS,data=Life_developing)
summary(developing_model)
## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol + Hepatitis.B + Measles + 
##     Polio + Diphtheria + HIV.AIDS, data = Life_developing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6652  -3.8874   0.6864   3.8682  16.6351 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.929e+01  6.072e-01  97.648  < 2e-16 ***
## Alcohol      4.543e-01  4.028e-02  11.279  < 2e-16 ***
## Hepatitis.B  1.630e-03  6.827e-03   0.239  0.81130    
## Measles     -4.216e-05  1.305e-05  -3.230  0.00126 ** 
## Polio        4.830e-02  7.465e-03   6.471 1.24e-10 ***
## Diphtheria   5.973e-02  8.357e-03   7.148 1.26e-12 ***
## HIV.AIDS    -8.190e-01  2.385e-02 -34.342  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.84 on 1881 degrees of freedom
##   (538 observations deleted due to missingness)
## Multiple R-squared:  0.4831, Adjusted R-squared:  0.4815 
## F-statistic:   293 on 6 and 1881 DF,  p-value: < 2.2e-16

Interpretation:

Adjusted R-Squared: “R-squared explains to what extent the variance of one variable explains the variance of the second variable. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.” - Investopedia. Adjusted R-Squared will take into account multiple variables decreases R-Squared if the variables actually do not contribute to increasing R-Squared. Overall, it’s more accurate. Our Adjusted R-Squared is 0.4815, meaning that it has a low effect of predicting Life Expectancy. Of course, if we add more variables, we would be able to increase our adjusted R-Sqaured and not be in danger of overfitting. Is R-Squared good in this case? Yes! The higher R-Squared we have, the better we are able to pinpoint what leads to life expectancy being higher or lower for a country.

Coefficient estimates: The estimate next to each intercept shows the amount that a one unit increase in the intercept affects Life Expectancy. For example, a 1 unit increase in Alcohol (which in this case means recorded per capita (15+) consumption (in litres of pure alcohol)), will reduce life expectancy by 0.4543 years. A 1 unit increase in the AIDS/HIV intercept (Deaths per 1 000 live births HIV/AIDS (0-4 years)) results in the reduction of Life Expectancy by .819 years.

P-value: A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. Thus, the Alcohol, Polio, Measles, Diptheria and HIV/AIDS variables are statistically significant, while Hepatitis B is not. The Hepatitis B intercept actually represents immunization coverage among 1-year-olds in percentage terms, so increasing it should actually lead to increased life expectancy.

4.2 Developed Countries

mlr_developed<-lm(formula = Life.expectancy ~ Alcohol + Hepatitis.B + Measles + 
    Polio + Diphtheria + HIV.AIDS, data = Life_developed)
summary(mlr_developed)
## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol + Hepatitis.B + Measles + 
##     Polio + Diphtheria + HIV.AIDS, data = Life_developed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5419 -2.5812  0.2955  2.2043 11.1242 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.125e+01  2.677e+00  30.351  < 2e-16 ***
## Alcohol     -2.632e-01  7.861e-02  -3.348 0.000914 ***
## Hepatitis.B -2.338e-02  1.133e-02  -2.063 0.039937 *  
## Measles     -9.702e-05  1.330e-04  -0.729 0.466344    
## Polio        2.998e-02  2.484e-02   1.207 0.228300    
## Diphtheria  -7.317e-03  2.052e-02  -0.357 0.721595    
## HIV.AIDS            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.974 on 311 degrees of freedom
##   (195 observations deleted due to missingness)
## Multiple R-squared:  0.0508, Adjusted R-squared:  0.03554 
## F-statistic: 3.329 on 5 and 311 DF,  p-value: 0.006056

Interpretation:

Here is the summary of the developed countries and the effects that these diseases and vices have on life expectancy. This R squared is close to 0! My interpretation is that these factors do not affect life expectancy heavily: better healthcare systems and outcomes result in aging-related conditions being more of a factor in affecting life expectancy, rather than diseases that are either cured or well managed in the healthcare system of developed countries. Thus, none of these intercepts are particularly effective in being able to approximate life expectancy in developed countries.

5 Data Wrangling

Note: There are 538 observations that were deleted due to missingness. Considering that there were 2426 observations originally in the Developing dataset, this is a considerable amount of data that is likely to be affecting our output. Therefore we will fix the issue before run Multiple Regression with Step Wise Feature Selecetion

# Drop the country colomns (this time we focus on social variable)
Life <- select (Life,-c(Country ))

# Missing value check
colSums(is.na(Life))
##                            Year                          Status 
##                               0                               0 
##                 Life.expectancy                 Adult.Mortality 
##                              10                              10 
##                   infant.deaths                         Alcohol 
##                               0                             194 
##          percentage.expenditure                     Hepatitis.B 
##                               0                             553 
##                         Measles                             BMI 
##                               0                              34 
##               under.five.deaths                           Polio 
##                               0                              19 
##               Total.expenditure                      Diphtheria 
##                             226                              19 
##                        HIV.AIDS                             GDP 
##                               0                             448 
##                      Population            thinness..1.19.years 
##                             652                              34 
##              thinness.5.9.years Income.composition.of.resources 
##                              34                             167 
##                       Schooling 
##                             163
library("tidyr")

Life <- Life%>% 
  drop_na()

colSums(is.na(Life))
##                            Year                          Status 
##                               0                               0 
##                 Life.expectancy                 Adult.Mortality 
##                               0                               0 
##                   infant.deaths                         Alcohol 
##                               0                               0 
##          percentage.expenditure                     Hepatitis.B 
##                               0                               0 
##                         Measles                             BMI 
##                               0                               0 
##               under.five.deaths                           Polio 
##                               0                               0 
##               Total.expenditure                      Diphtheria 
##                               0                               0 
##                        HIV.AIDS                             GDP 
##                               0                               0 
##                      Population            thinness..1.19.years 
##                               0                               0 
##              thinness.5.9.years Income.composition.of.resources 
##                               0                               0 
##                       Schooling 
##                               0

6 Multi Linear Regression 2: Step Wise Feature Selection

6.1 Backward Elimination

Backward elimination works by starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion (AIC or residual sum of squares) and then removing the variable whose absence gives the highest reduction in AIC (biggest improvement in model fit) and repeating this process until no further variables can be removed without a loss of fit.

lm.all <- lm(Life.expectancy ~., Life)
step(lm.all, direction="backward")
## Start:  AIC=4204.82
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness..1.19.years             1       0.2 20586 4202.8
## - Population                       1       1.6 20588 4202.9
## - Hepatitis.B                      1       6.8 20593 4203.4
## - GDP                              1       9.5 20596 4203.6
## - Measles                          1      11.8 20598 4203.8
## - thinness.5.9.years               1      11.8 20598 4203.8
## - Polio                            1      16.2 20603 4204.1
## <none>                                         20586 4204.8
## - percentage.expenditure           1      36.2 20622 4205.7
## - Total.expenditure                1      65.8 20652 4208.1
## - Diphtheria                       1      72.0 20658 4208.6
## - Status                           1      88.4 20675 4209.9
## - Alcohol                          1     192.5 20779 4218.2
## - BMI                              1     361.2 20948 4231.5
## - Year                             1     383.8 20970 4233.3
## - infant.deaths                    1     887.1 21473 4272.4
## - under.five.deaths                1     953.2 21540 4277.5
## - Income.composition.of.resources  1    1991.4 22578 4355.1
## - Schooling                        1    2899.2 23486 4420.1
## - Adult.Mortality                  1    3728.2 24314 4477.3
## - HIV.AIDS                         1    8013.7 28600 4745.0
## 
## Step:  AIC=4202.84
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness.5.9.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Population                       1       1.6 20588 4201.0
## - Hepatitis.B                      1       6.8 20593 4201.4
## - GDP                              1       9.5 20596 4201.6
## - Measles                          1      11.8 20598 4201.8
## - Polio                            1      16.1 20603 4202.1
## <none>                                         20586 4202.8
## - percentage.expenditure           1      36.2 20623 4203.7
## - thinness.5.9.years               1      54.6 20641 4205.2
## - Total.expenditure                1      65.7 20652 4206.1
## - Diphtheria                       1      72.3 20659 4206.6
## - Status                           1      88.3 20675 4207.9
## - Alcohol                          1     192.6 20779 4216.2
## - BMI                              1     363.0 20949 4229.7
## - Year                             1     384.4 20971 4231.3
## - infant.deaths                    1     888.0 21474 4270.5
## - under.five.deaths                1     954.9 21541 4275.6
## - Income.composition.of.resources  1    1996.0 22582 4353.4
## - Schooling                        1    2912.2 23499 4419.0
## - Adult.Mortality                  1    3730.6 24317 4475.5
## - HIV.AIDS                         1    8016.8 28603 4743.2
## 
## Step:  AIC=4200.96
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + thinness.5.9.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Hepatitis.B                      1       6.7 20595 4199.5
## - GDP                              1       9.6 20598 4199.7
## - Measles                          1      11.1 20599 4199.9
## - Polio                            1      16.0 20604 4200.2
## <none>                                         20588 4201.0
## - percentage.expenditure           1      35.9 20624 4201.8
## - thinness.5.9.years               1      54.6 20643 4203.3
## - Total.expenditure                1      65.8 20654 4204.2
## - Diphtheria                       1      71.8 20660 4204.7
## - Status                           1      88.5 20677 4206.0
## - Alcohol                          1     192.6 20781 4214.3
## - BMI                              1     361.8 20950 4227.7
## - Year                             1     384.6 20973 4229.5
## - infant.deaths                    1     924.1 21512 4271.4
## - under.five.deaths                1     971.4 21559 4275.0
## - Income.composition.of.resources  1    1998.0 22586 4351.7
## - Schooling                        1    2916.2 23504 4417.4
## - Adult.Mortality                  1    3741.5 24330 4474.3
## - HIV.AIDS                         1    8015.3 28603 4741.2
## 
## Step:  AIC=4199.5
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Measles + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - GDP                              1       9.4 20604 4198.3
## - Measles                          1      10.9 20606 4198.4
## - Polio                            1      13.1 20608 4198.5
## <none>                                         20595 4199.5
## - percentage.expenditure           1      37.4 20632 4200.5
## - thinness.5.9.years               1      56.3 20651 4202.0
## - Total.expenditure                1      64.6 20659 4202.7
## - Diphtheria                       1      66.5 20661 4202.8
## - Status                           1      85.2 20680 4204.3
## - Alcohol                          1     190.8 20786 4212.7
## - BMI                              1     357.6 20952 4225.9
## - Year                             1     405.5 21000 4229.7
## - infant.deaths                    1     926.8 21522 4270.1
## - under.five.deaths                1     971.9 21567 4273.5
## - Income.composition.of.resources  1    2011.0 22606 4351.1
## - Schooling                        1    2917.3 23512 4416.0
## - Adult.Mortality                  1    3750.9 24346 4473.4
## - HIV.AIDS                         1    8008.7 28603 4739.2
## 
## Step:  AIC=4198.25
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Measles + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Measles                          1      10.9 20615 4197.1
## - Polio                            1      14.0 20618 4197.4
## <none>                                         20604 4198.3
## - thinness.5.9.years               1      57.4 20662 4200.8
## - Total.expenditure                1      62.6 20667 4201.3
## - Diphtheria                       1      66.2 20670 4201.5
## - Status                           1      89.5 20694 4203.4
## - Alcohol                          1     188.7 20793 4211.3
## - BMI                              1     355.9 20960 4224.5
## - Year                             1     397.1 21001 4227.7
## - percentage.expenditure           1     748.3 21352 4255.1
## - infant.deaths                    1     928.3 21532 4268.9
## - under.five.deaths                1     973.3 21577 4272.4
## - Income.composition.of.resources  1    2033.1 22637 4351.4
## - Schooling                        1    2958.8 23563 4417.5
## - Adult.Mortality                  1    3750.3 24354 4472.0
## - HIV.AIDS                         1    8004.5 28609 4737.5
## 
## Step:  AIC=4197.12
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Polio                            1      13.9 20629 4196.2
## <none>                                         20615 4197.1
## - thinness.5.9.years               1      51.5 20667 4199.2
## - Total.expenditure                1      64.8 20680 4200.3
## - Diphtheria                       1      66.4 20681 4200.4
## - Status                           1      91.2 20706 4202.4
## - Alcohol                          1     192.0 20807 4210.4
## - BMI                              1     371.8 20987 4224.6
## - Year                             1     392.1 21007 4226.2
## - percentage.expenditure           1     750.7 21366 4254.1
## - infant.deaths                    1     939.1 21554 4268.6
## - under.five.deaths                1     973.3 21588 4271.2
## - Income.composition.of.resources  1    2031.8 22647 4350.1
## - Schooling                        1    2969.8 23585 4417.0
## - Adult.Mortality                  1    3747.8 24363 4470.6
## - HIV.AIDS                         1    8017.7 28633 4736.9
## 
## Step:  AIC=4196.23
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + BMI + under.five.deaths + 
##     Total.expenditure + Diphtheria + HIV.AIDS + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         20629 4196.2
## - thinness.5.9.years               1      50.1 20679 4198.2
## - Total.expenditure                1      65.8 20695 4199.5
## - Status                           1      91.8 20721 4201.6
## - Diphtheria                       1     143.2 20772 4205.6
## - Alcohol                          1     189.7 20819 4209.3
## - BMI                              1     371.1 21000 4223.6
## - Year                             1     400.5 21029 4225.9
## - percentage.expenditure           1     746.5 21375 4252.9
## - infant.deaths                    1     951.7 21581 4268.6
## - under.five.deaths                1     987.8 21617 4271.4
## - Income.composition.of.resources  1    2033.2 22662 4349.2
## - Schooling                        1    3030.5 23659 4420.3
## - Adult.Mortality                  1    3767.7 24397 4470.9
## - HIV.AIDS                         1    8019.0 28648 4735.7
## 
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
##     infant.deaths + Alcohol + percentage.expenditure + BMI + 
##     under.five.deaths + Total.expenditure + Diphtheria + HIV.AIDS + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling, 
##     data = Life)
## 
## Coefficients:
##                     (Intercept)                             Year  
##                       3.101e+02                       -1.277e-01  
##                StatusDeveloping                  Adult.Mortality  
##                      -8.975e-01                       -1.626e-02  
##                   infant.deaths                          Alcohol  
##                       8.608e-02                       -1.299e-01  
##          percentage.expenditure                              BMI  
##                       4.523e-04                        3.199e-02  
##               under.five.deaths                Total.expenditure  
##                      -6.516e-02                        9.201e-02  
##                      Diphtheria                         HIV.AIDS  
##                       1.509e-02                       -4.478e-01  
##              thinness.5.9.years  Income.composition.of.resources  
##                      -5.212e-02                        1.052e+01  
##                       Schooling  
##                       9.047e-01

6.2 Summary Final Model

After we got the final model selected by backward elimination, then we can run the final formula as our summary.

Final_model <- lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
    infant.deaths + Alcohol + percentage.expenditure + BMI + 
    under.five.deaths + Total.expenditure + Diphtheria + HIV.AIDS +     thinness.5.9.years + Income.composition.of.resources + Schooling, 
    data = Life)

summary(Final_model)
## 
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
##     infant.deaths + Alcohol + percentage.expenditure + BMI + 
##     under.five.deaths + Total.expenditure + Diphtheria + HIV.AIDS + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling, 
##     data = Life)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7779  -2.1865   0.0023   2.2038  12.4209 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      3.101e+02  4.540e+01   6.831 1.19e-11 ***
## Year                            -1.277e-01  2.268e-02  -5.632 2.09e-08 ***
## StatusDeveloping                -8.975e-01  3.329e-01  -2.696 0.007089 ** 
## Adult.Mortality                 -1.626e-02  9.413e-04 -17.275  < 2e-16 ***
## infant.deaths                    8.608e-02  9.914e-03   8.682  < 2e-16 ***
## Alcohol                         -1.299e-01  3.351e-02  -3.877 0.000110 ***
## percentage.expenditure           4.523e-04  5.882e-05   7.690 2.53e-14 ***
## BMI                              3.199e-02  5.901e-03   5.421 6.80e-08 ***
## under.five.deaths               -6.516e-02  7.366e-03  -8.846  < 2e-16 ***
## Total.expenditure                9.201e-02  4.029e-02   2.284 0.022516 *  
## Diphtheria                       1.509e-02  4.481e-03   3.368 0.000776 ***
## HIV.AIDS                        -4.478e-01  1.777e-02 -25.203  < 2e-16 ***
## thinness.5.9.years              -5.212e-02  2.616e-02  -1.992 0.046497 *  
## Income.composition.of.resources  1.052e+01  8.292e-01  12.690  < 2e-16 ***
## Schooling                        9.047e-01  5.839e-02  15.493  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.553 on 1634 degrees of freedom
## Multiple R-squared:  0.8382, Adjusted R-squared:  0.8369 
## F-statistic: 604.8 on 14 and 1634 DF,  p-value: < 2.2e-16

7 Model Evaluation

In some cases, you might end up with more than one model for you to choose. It is often important for us to evaluate the performance of each model to find out which of the models result in the least total error.

7.1 Root Mean Squared Error

A rather common measure we’ve introduced earlier in this course is the Root Mean Squared Error. It measures the root mean square of observed data points to its predicted values.

# MLR 1 Evaluation
sqrt(mean((developing_model$residuals^2)))
## [1] 5.829581
sqrt(mean((mlr_developed$residuals^2)))
## [1] 3.936332
# MLR 2 Evaluation
sqrt(mean((Final_model$residuals^2)))
## [1] 3.536935

As seen in RMSE for developing_model (5.82) mlr_developed (3.93) and Final_model (3.53), we observe that Final_model did a better job in summarizing Life.expectancy given its predictors as it has a lower RMSE value.

7.2 Assumption of Homoscedasticity

For a more objective approach, a common statistical test is to check for heteroscedasticity using the Breusch-Pagan test. Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual. Now let’s use Breusch-Pagan test to our Final_model:

library(lmtest)
## Warning: package 'lmtest' was built under R version 4.2.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(Final_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  Final_model
## BP = 172.8, df = 14, p-value < 2.2e-16

Breusch-Pagan hypothesis test:

H0: Variation of residual is constant (Homoscedasticity) H1: Variation of residual is not constant (Heteroscedasticity)

The test has a p-value below the significance level of 0.05 (it has p-value of 0.00000000000000022), therefore we accept the null hypothesis. We can conclude that the residuals has not a constant variance and the assumption isn’t passed.

7.3 Normality Assumption

The normality assumption means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. To test for this condition, we can use the formal Shapiro-Wilk test to our residual:

shapiro.test(Final_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  Final_model$residuals
## W = 0.99168, p-value = 4.694e-08

Shapiro-Wilk hypothesis test:

H0: Residuals are normally distributed H1: Residuals are not normally distributed

The test has a p-value below the significance level of 0.05, therefore we accept the null hypothesis. We can conclude that the residuals are not normally distributed.

7.4 Multicollinearity

One of the statistical tool you have at your disposal when assessing multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model;

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(Final_model)
##                            Year                          Status 
##                        1.121545                        1.812569 
##                 Adult.Mortality                   infant.deaths 
##                        1.816404                      187.387309 
##                         Alcohol          percentage.expenditure 
##                        2.379172                        1.397921 
##                             BMI               under.five.deaths 
##                        1.774025                      187.954440 
##               Total.expenditure                      Diphtheria 
##                        1.120304                        1.220286 
##                        HIV.AIDS              thinness.5.9.years 
##                        1.499629                        1.934576 
## Income.composition.of.resources                       Schooling 
##                        3.008832                        3.478180

Interpretation: A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection.

8 Conclusion

After doing several step, we got the conclusion that the Final_model which is selected by backward selection has the best Root Mean Square (RMSE) of 3.536935. Unfortunately, the Final_model is failed to passed three model evaluation such as: Homoscedacity, Normality and Multicollinearity, thus further inspection and treatment is required. Removing the problematic variable that detected in the evaluation , for example: variable that detecetd by multicollinearity such as infant.deaths and under.five.deaths may deem to be neccessary in the future.

9 Acknowledgment

The project relies on accuracy of data. The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis (https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who).