Data Preparation

Data Reading

Explanation on life dataset:
+ Country - Country Observed.
+ Year - Year Observed.
+ Status - Developed or Developing status.
+ Life.expectancy - Life Expectancy in age.
+ Adult.Mortality - Adult Mortality Rates on both sexes (probability of dying between 15-60 years/1000 population).
+ infant.deaths - Number of Infant Deaths per 1000 population.
+ Alcohol - Alcohol recorded per capita (15+) consumption (in litres of pure alcohol).
+ percentage.expenditure - Expenditure on health as a percentage of Gross Domestic Product per capita(%).
+ Hepatitis.B - Hepatitis B (HepB) immunization coverage among 1-year-olds (%).
+ Measles - Number of reported Measles cases per 1000 population.
+ BMI - Average Body Mass Index of entire population.
+ under.five.deaths - Number of under-five deaths per 1000 population.
+ Polio - Polio (Pol3) immunization coverage among 1-year-olds (%).
+ Total expenditure - General government expenditure on health as a percentage of total government expenditure (%).
+ Diphtheria - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%).
+ HIV_AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years).
+ GDP - Gross Domestic Product per capita (in USD).
+ Population - Population of the country.
+ thinness.1-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (%).
+ thinness 5-9 years - Prevalence of thinness among children for Age 5 to 9(%).
+ Income.composition.of.resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1).
+ Schooling - Number of years of Schooling(years) .

Based on life dataset, we are going to predict the Life.expectancy of the people, using given dependent variables.

Clean Up Data

Drop Observations with “NA” value

Data Checking

## 'data.frame':    1649 obs. of  22 variables:
##  $ Country                        : Factor w/ 193 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : Factor w/ 2 levels "Developed","Developing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under_5.deaths                 : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV_AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness.10_19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5_9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
##         Country          Year             Status     Life.expectancy
##  Afghanistan:  16   Min.   :2000   Developed : 242   Min.   :44.0   
##  Albania    :  16   1st Qu.:2005   Developing:1407   1st Qu.:64.4   
##  Armenia    :  15   Median :2008                     Median :71.7   
##  Austria    :  15   Mean   :2008                     Mean   :69.3   
##  Belarus    :  15   3rd Qu.:2011                     3rd Qu.:75.0   
##  Belgium    :  15   Max.   :2015                     Max.   :89.0   
##  (Other)    :1557                                                   
##  Adult.Mortality infant.deaths        Alcohol       percentage.expenditure
##  Min.   :  1.0   Min.   :   0.00   Min.   : 0.010   Min.   :    0.00      
##  1st Qu.: 77.0   1st Qu.:   1.00   1st Qu.: 0.810   1st Qu.:   37.44      
##  Median :148.0   Median :   3.00   Median : 3.790   Median :  145.10      
##  Mean   :168.2   Mean   :  32.55   Mean   : 4.533   Mean   :  698.97      
##  3rd Qu.:227.0   3rd Qu.:  22.00   3rd Qu.: 7.340   3rd Qu.:  509.39      
##  Max.   :723.0   Max.   :1600.00   Max.   :17.870   Max.   :18961.35      
##                                                                           
##   Hepatitis.B       Measles            BMI        under_5.deaths   
##  Min.   : 2.00   Min.   :     0   Min.   : 2.00   Min.   :   0.00  
##  1st Qu.:74.00   1st Qu.:     0   1st Qu.:19.50   1st Qu.:   1.00  
##  Median :89.00   Median :    15   Median :43.70   Median :   4.00  
##  Mean   :79.22   Mean   :  2224   Mean   :38.13   Mean   :  44.22  
##  3rd Qu.:96.00   3rd Qu.:   373   3rd Qu.:55.80   3rd Qu.:  29.00  
##  Max.   :99.00   Max.   :131441   Max.   :77.10   Max.   :2100.00  
##                                                                    
##      Polio       Total.expenditure   Diphtheria       HIV_AIDS     
##  Min.   : 3.00   Min.   : 0.740    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:81.00   1st Qu.: 4.410    1st Qu.:82.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.840    Median :92.00   Median : 0.100  
##  Mean   :83.56   Mean   : 5.956    Mean   :84.16   Mean   : 1.984  
##  3rd Qu.:97.00   3rd Qu.: 7.470    3rd Qu.:97.00   3rd Qu.: 0.700  
##  Max.   :99.00   Max.   :14.390    Max.   :99.00   Max.   :50.600  
##                                                                    
##       GDP              Population         thinness.10_19.years
##  Min.   :     1.68   Min.   :        34   Min.   : 0.100      
##  1st Qu.:   462.15   1st Qu.:    191897   1st Qu.: 1.600      
##  Median :  1592.57   Median :   1419631   Median : 3.000      
##  Mean   :  5566.03   Mean   :  14653626   Mean   : 4.851      
##  3rd Qu.:  4718.51   3rd Qu.:   7658972   3rd Qu.: 7.100      
##  Max.   :119172.74   Max.   :1293859294   Max.   :27.200      
##                                                               
##  thinness.5_9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.100     Min.   :0.0000                  Min.   : 4.20  
##  1st Qu.: 1.700     1st Qu.:0.5090                  1st Qu.:10.30  
##  Median : 3.200     Median :0.6730                  Median :12.30  
##  Mean   : 4.908     Mean   :0.6316                  Mean   :12.12  
##  3rd Qu.: 7.100     3rd Qu.:0.7510                  3rd Qu.:14.00  
##  Max.   :28.200     Max.   :0.9360                  Max.   :20.70  
## 
## [1] 44 89

Feature Selection

In Total, there are 22 Variables, 20 of them are Numerical, and 2 of them are Categorical.
We will need to deselect/mutate some variables because of the following conditions:
- Deselect Country -> Too many levels, and doesn’t give additional information to predict Life.expectancy.
- Deselect Year -> Time series data, and doesn’t give additional information to predict Life.expectancy.
- Mutate Hepatitis.B -> The range between minimum value and the 1st Quartile is too wide, need to be adjusted/manipulated.
- Mutate Polio -> The range between minimum value and the 1st Quartile is too wide, need to be adjusted/manipulated.
- Mutate Diphtheria -> The range between minimum value and the 1st Quartile is too wide, need to be adjusted/manipulated.

As stated on The Global Vaccine Action Plan 2011–2020 (GVAP) (1), endorsed by the World Health Assembly in 2012, all countries need to reach ≥90% national coverage for all vaccines in the country’s routine immunization schedule by 2020. Based on that statement, we are going to mutate the Hepatitis.B, Polio, and Diphtheria into a categorical variable, with 2 value: “Under 90% Covered” and “Covered by 90% or More”. By doing this, hopefully we can get a better view on the immunization impact to Life.expectancy.

Data Wrangling

## 'data.frame':    1649 obs. of  20 variables:
##  $ Status                         : Factor w/ 2 levels "Developed","Developing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : Factor w/ 2 levels "<90% Covered",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under_5.deaths                 : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : Factor w/ 2 levels "<90% Covered",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : Factor w/ 2 levels "<90% Covered",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HIV_AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness.10_19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5_9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

Correlations and Variances

Numerical Variables

To check whether there is correlation between Numerical Independent Variables with the Dependent, we will use ggcorr function.

The Life.expectancy as dependent variable has somewhat strong positive correlation with Schooling and Income.composition.of.resources, we are going to see it further on the model analysis. On the other hand, it has negative correlation with Adult.Mortality. And this is a valid finding, because if mortality rate of adult is high, of course the life expectancy of people will be low.

Life.expectancy also has a very weak correlation with Population and the Measles. We will test it further on the next analysis.

And based on the Corr Matrix, we can see there is very strong correlation between infant.deaths and the under_5.deaths. This strong correlation indicates multicollinearity among them. Therefore, we are going to deselect under_5.deaths, with consideration that other variables seems more related with conditions during infants period.

Categorical Variables

Check the data distribution of Life.expectancy among all of the Categorical Variables

Status Variable

## # A tibble: 2 x 3
##   Status     count percentage
##   <fct>      <int> <chr>     
## 1 Developed    242 14.68%    
## 2 Developing  1407 85.32%
##               Df Sum Sq Mean Sq F value              Pr(>F)    
## Status         1  25005   25005   401.7 <0.0000000000000002 ***
## Residuals   1647 102525      62                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The number of Developing Countries on this observations are way bigger than the Developed Countries.
  • On the Development Status, it was clearly that distribution of higher Life.expectancy lies on the Developed Countries, with a significant Median distance. And even if there are some Outliers on the Developing Countries, we will keep it at the mean time because they were Low Leverages.
  • As the p-value ANOVA Analysis is less than the significance level 0.05, we can conclude that there are significant differences of Life Expectancy between the Developed and Developing Countries.

Hepatitis B Coverage

## # A tibble: 2 x 3
##   Hepatitis.B   count percentage
##   <fct>         <int> <chr>     
## 1 <90% Covered    826 50.09%    
## 2 >=90% Covered   823 49.91%
##               Df Sum Sq Mean Sq F value              Pr(>F)    
## Hepatitis.B    1  12164   12164   173.7 <0.0000000000000002 ***
## Residuals   1647 115366      70                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Suprisingly, the number of Countries with less than 90% Coverage of Hepatitis.B Immunization is half of the observations.
  • On the Hepatitis.B Coverage, higher Life.expectancy lies on the Countries which cover their Hepatitis.B immunization on 90% or more, with a big Median distance. And even if there are some Outliers on the Developing Countries, we will keep it at the mean time because most of them were Low Leverages.
  • As the p-value is less than the significance level 0.05, we can conclude that there are significant differences of Life Expectancy between the groups in Hepatitis B Coverage.

Polio Coverage

## # A tibble: 2 x 3
##   Polio         count percentage
##   <fct>         <int> <chr>     
## 1 <90% Covered    700 42.45%    
## 2 >=90% Covered   949 57.55%
##               Df Sum Sq Mean Sq F value              Pr(>F)    
## Polio          1  26047   26047   422.7 <0.0000000000000002 ***
## Residuals   1647 101482      62                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Better than the Hepatitis.B Coverage, the Polio Coverage are larger.
  • On the Polio Coverage, higher Life.expectancy lies on the Countries which cover their Polio immunization on 90% or more, with a big Median distance. The upper outliers of Polio is not as much as Hepatitis.B. And even if there are some Outliers on the Developing Countries, we will keep it at the mean time because most of them were Low Leverages.
  • As the p-value of ANOVA Analysis is less than the significance level 0.05, we can conclude that there are significant differences of Life Expectancy between the groups in Polio Coverage.

Diphteria

## # A tibble: 2 x 3
##   Diphtheria    count percentage
##   <fct>         <int> <chr>     
## 1 <90% Covered    704 42.69%    
## 2 >=90% Covered   945 57.31%
##               Df Sum Sq Mean Sq F value              Pr(>F)    
## Diphtheria     1  25834   25834   418.4 <0.0000000000000002 ***
## Residuals   1647 101695      62                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The Diphteria Coverage are the same like Polio Coverage in term of number of Countries.
  • The distribution of Diphteria is somewhat similar with the Polio. It may be indicating that Polio and Diphteria immunization are given at the same time.
  • As the p-value of ANOVA Analysis is less than the significance level 0.05, we can conclude that there are significant differences of Life Expectancy between the groups in Diphteria Coverage.

Association between Categorical Variables

Development Status vs Hepatitis B Coverage

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(life_selected$Status, life_selected$Hepatitis.B)
## X-squared = 49.835, df = 1, p-value = 0.000000000001672
  • Most of the Developed Countries have larger coverage on Hepatitis.B immunization
  • By chi-square test, we can see strong evidence to suggest that Developed and Developing Countries tend to have difference coverage on Hepatitis.B immunization.

Development Status vs Polio Coverage

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(life_selected$Status, life_selected$Polio)
## X-squared = 127.6, df = 1, p-value < 0.00000000000000022
  • Developed Countries have significantly larger coverage on Polio immunization
  • By chi-square test, we can see strong evidence to suggest that Developed and Developing Countries tend to have difference coverage on Polio immunization.

Development Status vs Polio Coverage

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(life_selected$Status, life_selected$Diphtheria)
## X-squared = 129.28, df = 1, p-value < 0.00000000000000022
  • Just like previous findings, Diphtheria has a similar pattern with the Polio. We will see on next test whether we only need one of them.

Create Model

As mentioned at the beginning of this Analysis, we are going to predict the Life.expectancy by using Selected Variables. And this is the full linear prediction model.

## 
## Call:
## lm(formula = Life.expectancy ~ ., data = life_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0291  -2.1529   0.0557   2.3893  11.5018 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     55.002467863226  0.810849894348  67.833
## StatusDeveloping                -0.981517391893  0.346377773952  -2.834
## Adult.Mortality                 -0.017799228333  0.000967427774 -18.399
## infant.deaths                   -0.003007364967  0.001265712464  -2.376
## Alcohol                         -0.155154873503  0.033801578881  -4.590
## percentage.expenditure           0.000349139357  0.000186187084   1.875
## Hepatitis.B>=90% Covered        -0.637192410194  0.319248875671  -1.996
## Measles                          0.000016831614  0.000010792654   1.560
## BMI                              0.035845737407  0.006160560054   5.819
## Polio>=90% Covered               0.568041014306  0.443924117610   1.280
## Total.expenditure                0.069943232821  0.041790211855   1.674
## Diphtheria>=90% Covered          0.909716421385  0.489937625026   1.857
## HIV_AIDS                        -0.427934296549  0.018491961872 -23.142
## GDP                              0.000009180942  0.000029253467   0.314
## Population                       0.000000002496  0.000000001766   1.414
## thinness.10_19.years            -0.050182389508  0.054691293904  -0.918
## thinness.5_9.years               0.001518971381  0.053738340913   0.028
## Income.composition.of.resources 10.477963833400  0.850733729997  12.316
## Schooling                        0.884291070136  0.061717791004  14.328
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                             0.00466 ** 
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                                0.01762 *  
## Alcohol                                0.00000476823 ***
## percentage.expenditure                       0.06094 .  
## Hepatitis.B>=90% Covered                     0.04611 *  
## Measles                                      0.11906    
## BMI                                    0.00000000713 ***
## Polio>=90% Covered                           0.20087    
## Total.expenditure                            0.09439 .  
## Diphtheria>=90% Covered                      0.06352 .  
## HIV_AIDS                        < 0.0000000000000002 ***
## GDP                                          0.75368    
## Population                                   0.15769    
## thinness.10_19.years                         0.35899    
## thinness.5_9.years                           0.97745    
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.686 on 1630 degrees of freedom
## Multiple R-squared:  0.8263, Adjusted R-squared:  0.8244 
## F-statistic: 430.9 on 18 and 1630 DF,  p-value: < 0.00000000000000022
  • Coefficients interpretaions: Since there will no condition where all independent variables are 0, then the intercept relatively has no meaning in this context. About the coefficients, interesting findings can be seen there, some of the numerical Variables may give negative effects, such as Adult.Mortality, infant.deaths, Alcohol, HIV_AIDS and thinness.10_19.years indicating additional of these Variables may lead to decrease the Life.expectancy. On the other hand, Income.composition.of.resources has a big positive effect on the Life.expectancy. Some ibteresting findings also occured in the Categorical Variables, just like StatusDeveloping which expected will reduce Life.expectancy about -0.9815 compared with StatusDeveloped. Funny thing founded in Hepatitis.B>=90% Covered which may gives negative relative to Polio>=90% Covered and Diphtheria>=90% Covered.
  • Adj. R-squared interpretation: Approximately 82.44% of the observed variation can be explained by the model’s inputs, this is a quite good result, indicating that we are on the right path to create good linear model.
  • Significancies of Predictors: As seen on the p-value, Adult.Mortality, Alcohol, BMI, HIV_AIDS, Income.composition.of.resources, and Schooling are the most significant Predictors. Followed by StatusDeveloping with 0.01 significant level, and infant.deaths, Hepatitis.B>=90% Covered with 0.05 significant level. As for the others, we may consider that changes on those predictors are not significantly associated with Life.expectancy.

Advanced Feature Selection

Now we are going to select most important predictors based on automated calculation by R.

Stepwise Method

Backward Direction

## Start:  AIC=4321.29
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     GDP + Population + thinness.10_19.years + thinness.5_9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.5_9.years               1       0.0 22147 4319.3
## - GDP                              1       1.3 22148 4319.4
## - thinness.10_19.years             1      11.4 22158 4320.1
## - Polio                            1      22.2 22169 4320.9
## <none>                                         22147 4321.3
## - Population                       1      27.1 22174 4321.3
## - Measles                          1      33.0 22180 4321.8
## - Total.expenditure                1      38.1 22185 4322.1
## - Diphtheria                       1      46.8 22193 4322.8
## - percentage.expenditure           1      47.8 22194 4322.8
## - Hepatitis.B                      1      54.1 22201 4323.3
## - infant.deaths                    1      76.7 22223 4325.0
## - Status                           1     109.1 22256 4327.4
## - Alcohol                          1     286.3 22433 4340.5
## - BMI                              1     460.0 22607 4353.2
## - Income.composition.of.resources  1    2061.0 24208 4466.0
## - Schooling                        1    2789.2 24936 4514.9
## - Adult.Mortality                  1    4599.2 26746 4630.5
## - HIV_AIDS                         1    7276.2 29423 4787.8
## 
## Step:  AIC=4319.29
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     GDP + Population + thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - GDP                              1       1.3 22148 4317.4
## - Polio                            1      22.2 22169 4318.9
## <none>                                         22147 4319.3
## - Population                       1      27.1 22174 4319.3
## - Measles                          1      33.1 22180 4319.8
## - Total.expenditure                1      38.1 22185 4320.1
## - thinness.10_19.years             1      41.7 22188 4320.4
## - Diphtheria                       1      46.9 22193 4320.8
## - percentage.expenditure           1      47.8 22194 4320.8
## - Hepatitis.B                      1      54.2 22201 4321.3
## - infant.deaths                    1      77.2 22224 4323.0
## - Status                           1     109.1 22256 4325.4
## - Alcohol                          1     286.3 22433 4338.5
## - BMI                              1     468.5 22615 4351.8
## - Income.composition.of.resources  1    2061.3 24208 4464.0
## - Schooling                        1    2799.0 24946 4513.5
## - Adult.Mortality                  1    4609.5 26756 4629.1
## - HIV_AIDS                         1    7277.4 29424 4785.8
## 
## Step:  AIC=4317.39
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     Population + thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Polio                            1      22.7 22171 4317.1
## <none>                                         22148 4317.4
## - Population                       1      27.0 22175 4317.4
## - Measles                          1      33.1 22181 4317.9
## - Total.expenditure                1      37.6 22185 4318.2
## - thinness.10_19.years             1      42.0 22190 4318.5
## - Diphtheria                       1      46.3 22194 4318.8
## - Hepatitis.B                      1      53.2 22201 4319.4
## - infant.deaths                    1      76.8 22225 4321.1
## - Status                           1     111.1 22259 4323.6
## - Alcohol                          1     286.1 22434 4336.6
## - BMI                              1     467.8 22616 4349.9
## - percentage.expenditure           1     590.7 22739 4358.8
## - Income.composition.of.resources  1    2079.2 24227 4463.4
## - Schooling                        1    2824.3 24972 4513.3
## - Adult.Mortality                  1    4608.3 26756 4627.1
## - HIV_AIDS                         1    7277.2 29425 4783.9
## 
## Step:  AIC=4317.08
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + Population + 
##     thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         22171 4317.1
## - Population                       1      27.3 22198 4317.1
## - Measles                          1      31.2 22202 4317.4
## - Total.expenditure                1      36.0 22207 4317.8
## - thinness.10_19.years             1      39.9 22210 4318.0
## - Hepatitis.B                      1      44.4 22215 4318.4
## - infant.deaths                    1      79.3 22250 4321.0
## - Status                           1     113.0 22284 4323.5
## - Diphtheria                       1     206.4 22377 4330.4
## - Alcohol                          1     283.1 22454 4336.0
## - BMI                              1     471.9 22642 4349.8
## - percentage.expenditure           1     585.6 22756 4358.1
## - Income.composition.of.resources  1    2090.6 24261 4463.7
## - Schooling                        1    2851.8 25022 4514.6
## - Adult.Mortality                  1    4620.1 26791 4627.2
## - HIV_AIDS                         1    7337.5 29508 4786.5
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + Population + 
##     thinness.10_19.years + Income.composition.of.resources + 
##     Schooling, data = life_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0614  -2.1244   0.0496   2.3996  11.4712 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     54.988635634772  0.809572673070  67.923
## StatusDeveloping                -0.996426463567  0.345406384670  -2.885
## Adult.Mortality                 -0.017814717510  0.000965712450 -18.447
## infant.deaths                   -0.003042129201  0.001258897693  -2.417
## Alcohol                         -0.154259916713  0.033781334791  -4.566
## percentage.expenditure           0.000402417581  0.000061274211   6.567
## Hepatitis.B>=90% Covered        -0.568682910837  0.314538919313  -1.808
## Measles                          0.000016309888  0.000010767114   1.515
## BMI                              0.035941493947  0.006096149812   5.896
## Total.expenditure                0.067891276955  0.041686924792   1.629
## Diphtheria>=90% Covered          1.348666445388  0.345884907735   3.899
## HIV_AIDS                        -0.429134063064  0.018459188110 -23.248
## Population                       0.000000002501  0.000000001765   1.417
## thinness.10_19.years            -0.047793875815  0.027868686866  -1.715
## Income.composition.of.resources 10.522821501001  0.848000360190  12.409
## Schooling                        0.889310284674  0.061359991042  14.493
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                             0.00397 ** 
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                                0.01578 *  
## Alcohol                              0.0000053327894 ***
## percentage.expenditure               0.0000000000686 ***
## Hepatitis.B>=90% Covered                     0.07079 .  
## Measles                                      0.13002    
## BMI                                  0.0000000045224 ***
## Total.expenditure                            0.10359    
## Diphtheria>=90% Covered                      0.00010 ***
## HIV_AIDS                        < 0.0000000000000002 ***
## Population                                   0.15658    
## thinness.10_19.years                         0.08654 .  
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.685 on 1633 degrees of freedom
## Multiple R-squared:  0.8262, Adjusted R-squared:  0.8246 
## F-statistic: 517.4 on 15 and 1633 DF,  p-value: < 0.00000000000000022

Forward Direction

## Start:  AIC=7172.14
## Life.expectancy ~ 1
## 
##                                   Df Sum of Sq    RSS    AIC
## + Schooling                        1     67520  60009 5931.1
## + Income.composition.of.resources  1     66310  61219 5964.0
## + Adult.Mortality                  1     62941  64589 6052.3
## + HIV_AIDS                         1     44730  82799 6461.9
## + BMI                              1     37469  90060 6600.5
## + thinness.10_19.years             1     26732 100797 6786.2
## + thinness.5_9.years               1     26694 100836 6786.9
## + Polio                            1     26047 101482 6797.4
## + Diphtheria                       1     25834 101695 6800.9
## + Status                           1     25005 102525 6814.3
## + GDP                              1     24838 102691 6816.9
## + percentage.expenditure           1     21399 106130 6871.3
## + Alcohol                          1     20683 106846 6882.3
## + Hepatitis.B                      1     12164 115366 7008.8
## + Total.expenditure                1      3893 123636 7123.0
## + infant.deaths                    1      3646 123884 7126.3
## + Measles                          1       605 126924 7166.3
## <none>                                         127529 7172.1
## + Population                       1        63 127466 7173.3
## 
## Step:  AIC=5931.06
## Life.expectancy ~ Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## + HIV_AIDS                         1   25626.4 34383 5014.7
## + Adult.Mortality                  1   24319.2 35690 5076.2
## + Income.composition.of.resources  1    7477.0 52532 5713.6
## + BMI                              1    3525.2 56484 5833.2
## + thinness.5_9.years               1    2123.1 57886 5873.7
## + Polio                            1    1743.5 58266 5884.4
## + thinness.10_19.years             1    1695.2 58314 5885.8
## + GDP                              1    1660.0 58349 5886.8
## + percentage.expenditure           1    1630.5 58379 5887.6
## + Diphtheria                       1    1549.3 58460 5889.9
## + Status                           1     844.1 59165 5909.7
## + Hepatitis.B                      1     455.8 59554 5920.5
## + Alcohol                          1     439.7 59570 5920.9
## <none>                                         60009 5931.1
## + Measles                          1      30.2 59979 5932.2
## + infant.deaths                    1      22.9 59987 5932.4
## + Population                       1       6.3 60003 5932.9
## + Total.expenditure                1       1.0 60009 5933.0
## 
## Step:  AIC=5014.67
## Life.expectancy ~ Schooling + HIV_AIDS
## 
##                                   Df Sum of Sq   RSS    AIC
## + Adult.Mortality                  1    7231.6 27152 4627.3
## + Income.composition.of.resources  1    4265.9 30117 4798.2
## + BMI                              1    1702.8 32680 4932.9
## + percentage.expenditure           1    1548.9 32834 4940.7
## + GDP                              1    1527.8 32855 4941.7
## + thinness.5_9.years               1     947.7 33435 4970.6
## + thinness.10_19.years             1     805.3 33578 4977.6
## + Status                           1     627.9 33755 4986.3
## + Polio                            1     329.3 34054 5000.8
## + Diphtheria                       1     322.9 34060 5001.1
## + Total.expenditure                1     227.9 34155 5005.7
## + infant.deaths                    1     123.6 34260 5010.7
## + Hepatitis.B                      1      46.2 34337 5014.5
## <none>                                         34383 5014.7
## + Population                       1      11.9 34371 5016.1
## + Measles                          1       0.8 34382 5016.6
## + Alcohol                          1       0.4 34383 5016.7
## 
## Step:  AIC=4627.29
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality
## 
##                                   Df Sum of Sq   RSS    AIC
## + Income.composition.of.resources  1   2815.85 24336 4448.7
## + percentage.expenditure           1   1059.31 26092 4563.7
## + GDP                              1   1057.75 26094 4563.8
## + BMI                              1   1011.53 26140 4566.7
## + thinness.5_9.years               1    619.35 26532 4591.2
## + thinness.10_19.years             1    591.82 26560 4592.9
## + Status                           1    340.09 26812 4608.5
## + Diphtheria                       1    279.04 26873 4612.3
## + Polio                            1    270.01 26882 4612.8
## + infant.deaths                    1    209.12 26943 4616.5
## + Total.expenditure                1    141.14 27010 4620.7
## <none>                                         27152 4627.3
## + Hepatitis.B                      1     30.45 27121 4627.4
## + Alcohol                          1     29.49 27122 4627.5
## + Population                       1     25.36 27126 4627.7
## + Measles                          1     11.99 27140 4628.6
## 
## Step:  AIC=4448.74
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources
## 
##                          Df Sum of Sq   RSS    AIC
## + percentage.expenditure  1    706.09 23630 4402.2
## + BMI                     1    664.20 23672 4405.1
## + GDP                     1    653.87 23682 4405.8
## + thinness.5_9.years      1    380.74 23955 4424.7
## + thinness.10_19.years    1    344.14 23992 4427.3
## + infant.deaths           1    284.56 24051 4431.3
## + Status                  1    170.61 24165 4439.1
## + Diphtheria              1    157.07 24179 4440.1
## + Polio                   1    155.91 24180 4440.1
## + Total.expenditure       1    147.42 24188 4440.7
## + Population              1     44.51 24291 4447.7
## + Measles                 1     32.54 24303 4448.5
## <none>                                24336 4448.7
## + Alcohol                 1     22.62 24313 4449.2
## + Hepatitis.B             1     18.65 24317 4449.5
## 
## Step:  AIC=4402.19
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure
## 
##                        Df Sum of Sq   RSS    AIC
## + BMI                   1    681.26 22948 4355.9
## + thinness.5_9.years    1    328.31 23301 4381.1
## + thinness.10_19.years  1    302.59 23327 4382.9
## + infant.deaths         1    276.51 23353 4384.8
## + Polio                 1    182.17 23448 4391.4
## + Diphtheria            1    172.04 23458 4392.1
## + Alcohol               1    112.62 23517 4396.3
## + Total.expenditure     1     95.00 23535 4397.5
## + Hepatitis.B           1     51.51 23578 4400.6
## + Population            1     42.71 23587 4401.2
## <none>                              23630 4402.2
## + Status                1     27.79 23602 4402.2
## + Measles               1     25.33 23604 4402.4
## + GDP                   1      1.11 23629 4404.1
## 
## Step:  AIC=4355.95
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI
## 
##                        Df Sum of Sq   RSS    AIC
## + Diphtheria            1   227.284 22721 4341.5
## + Polio                 1   223.936 22724 4341.8
## + infant.deaths         1   159.159 22789 4346.5
## + Alcohol               1   124.554 22824 4349.0
## + thinness.5_9.years    1    78.289 22870 4352.3
## + Hepatitis.B           1    72.254 22876 4352.7
## + thinness.10_19.years  1    72.167 22876 4352.8
## + Total.expenditure     1    59.666 22889 4353.7
## + Status                1    27.891 22921 4355.9
## <none>                              22948 4355.9
## + Population            1    19.272 22929 4356.6
## + Measles               1     3.196 22945 4357.7
## + GDP                   1     2.338 22946 4357.8
## 
## Step:  AIC=4341.53
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria
## 
##                        Df Sum of Sq   RSS    AIC
## + Alcohol               1   159.749 22561 4331.9
## + infant.deaths         1   109.257 22612 4335.6
## + thinness.10_19.years  1    78.356 22643 4337.8
## + thinness.5_9.years    1    73.033 22648 4338.2
## + Total.expenditure     1    43.232 22678 4340.4
## <none>                              22721 4341.5
## + Hepatitis.B           1    27.102 22694 4341.6
## + Status                1    21.917 22699 4341.9
## + Polio                 1    13.602 22708 4342.5
## + Population            1    10.366 22711 4342.8
## + GDP                   1     0.676 22720 4343.5
## + Measles               1     0.432 22721 4343.5
## 
## Step:  AIC=4331.9
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol
## 
##                        Df Sum of Sq   RSS    AIC
## + thinness.10_19.years  1   116.370 22445 4325.4
## + Status                1   112.074 22449 4325.7
## + thinness.5_9.years    1   106.354 22455 4326.1
## + infant.deaths         1    95.306 22466 4326.9
## + Total.expenditure     1    52.166 22509 4330.1
## + Hepatitis.B           1    41.428 22520 4330.9
## <none>                              22561 4331.9
## + Polio                 1    13.930 22547 4332.9
## + Population            1     9.950 22551 4333.2
## + GDP                   1     1.478 22560 4333.8
## + Measles               1     0.012 22561 4333.9
## 
## Step:  AIC=4325.37
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol + thinness.10_19.years
## 
##                      Df Sum of Sq   RSS    AIC
## + Status              1   113.215 22332 4319.0
## + Total.expenditure   1    39.968 22405 4324.4
## + Hepatitis.B         1    38.551 22406 4324.5
## + infant.deaths       1    31.863 22413 4325.0
## <none>                            22445 4325.4
## + Polio               1    15.716 22429 4326.2
## + thinness.5_9.years  1     2.539 22442 4327.2
## + Measles             1     1.611 22443 4327.3
## + GDP                 1     1.256 22444 4327.3
## + Population          1     0.047 22445 4327.4
## 
## Step:  AIC=4319.03
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol + thinness.10_19.years + 
##     Status
## 
##                      Df Sum of Sq   RSS    AIC
## + Hepatitis.B         1    42.565 22289 4317.9
## + Total.expenditure   1    35.002 22297 4318.4
## + infant.deaths       1    28.608 22303 4318.9
## <none>                            22332 4319.0
## + Polio               1    13.763 22318 4320.0
## + Measles             1     2.718 22329 4320.8
## + thinness.5_9.years  1     1.738 22330 4320.9
## + Population          1     0.239 22332 4321.0
## + GDP                 1     0.192 22332 4321.0
## 
## Step:  AIC=4317.89
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol + thinness.10_19.years + 
##     Status + Hepatitis.B
## 
##                      Df Sum of Sq   RSS    AIC
## + Total.expenditure   1    35.631 22254 4317.2
## + infant.deaths       1    30.432 22259 4317.6
## <none>                            22289 4317.9
## + Polio               1    22.471 22267 4318.2
## + Measles             1     2.472 22287 4319.7
## + thinness.5_9.years  1     1.184 22288 4319.8
## + GDP                 1     0.861 22288 4319.8
## + Population          1     0.113 22289 4319.9
## 
## Step:  AIC=4317.25
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol + thinness.10_19.years + 
##     Status + Hepatitis.B + Total.expenditure
## 
##                      Df Sum of Sq   RSS    AIC
## + infant.deaths       1   27.5771 22226 4317.2
## <none>                            22254 4317.2
## + Polio               1   23.8495 22230 4317.5
## + Measles             1    3.8052 22250 4319.0
## + GDP                 1    1.3515 22252 4319.1
## + thinness.5_9.years  1    0.6689 22253 4319.2
## + Population          1    0.2788 22253 4319.2
## 
## Step:  AIC=4317.2
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol + thinness.10_19.years + 
##     Status + Hepatitis.B + Total.expenditure + infant.deaths
## 
##                      Df Sum of Sq   RSS    AIC
## + Measles             1   28.2358 22198 4317.1
## <none>                            22226 4317.2
## + Population          1   24.3561 22202 4317.4
## + Polio               1   21.1019 22205 4317.6
## + GDP                 1    1.5363 22224 4319.1
## + thinness.5_9.years  1    0.0990 22226 4319.2
## 
## Step:  AIC=4317.11
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol + thinness.10_19.years + 
##     Status + Hepatitis.B + Total.expenditure + infant.deaths + 
##     Measles
## 
##                      Df Sum of Sq   RSS    AIC
## + Population          1   27.2727 22171 4317.1
## <none>                            22198 4317.1
## + Polio               1   22.9679 22175 4317.4
## + GDP                 1    1.5394 22196 4319.0
## + thinness.5_9.years  1    0.0022 22198 4319.1
## 
## Step:  AIC=4317.08
## Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Alcohol + thinness.10_19.years + 
##     Status + Hepatitis.B + Total.expenditure + infant.deaths + 
##     Measles + Population
## 
##                      Df Sum of Sq   RSS    AIC
## <none>                            22171 4317.1
## + Polio               1   22.6508 22148 4317.4
## + GDP                 1    1.7488 22169 4318.9
## + thinness.5_9.years  1    0.0001 22171 4319.1
## 
## Call:
## lm(formula = Life.expectancy ~ Schooling + HIV_AIDS + Adult.Mortality + 
##     Income.composition.of.resources + percentage.expenditure + 
##     BMI + Diphtheria + Alcohol + thinness.10_19.years + Status + 
##     Hepatitis.B + Total.expenditure + infant.deaths + Measles + 
##     Population, data = life_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0614  -2.1244   0.0496   2.3996  11.4712 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     54.988635634772  0.809572673070  67.923
## Schooling                        0.889310284674  0.061359991042  14.493
## HIV_AIDS                        -0.429134063064  0.018459188110 -23.248
## Adult.Mortality                 -0.017814717510  0.000965712450 -18.447
## Income.composition.of.resources 10.522821501001  0.848000360190  12.409
## percentage.expenditure           0.000402417581  0.000061274211   6.567
## BMI                              0.035941493947  0.006096149812   5.896
## Diphtheria>=90% Covered          1.348666445388  0.345884907735   3.899
## Alcohol                         -0.154259916713  0.033781334791  -4.566
## thinness.10_19.years            -0.047793875815  0.027868686866  -1.715
## StatusDeveloping                -0.996426463567  0.345406384670  -2.885
## Hepatitis.B>=90% Covered        -0.568682910837  0.314538919313  -1.808
## Total.expenditure                0.067891276955  0.041686924792   1.629
## infant.deaths                   -0.003042129201  0.001258897693  -2.417
## Measles                          0.000016309888  0.000010767114   1.515
## Population                       0.000000002501  0.000000001765   1.417
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## HIV_AIDS                        < 0.0000000000000002 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## percentage.expenditure               0.0000000000686 ***
## BMI                                  0.0000000045224 ***
## Diphtheria>=90% Covered                      0.00010 ***
## Alcohol                              0.0000053327894 ***
## thinness.10_19.years                         0.08654 .  
## StatusDeveloping                             0.00397 ** 
## Hepatitis.B>=90% Covered                     0.07079 .  
## Total.expenditure                            0.10359    
## infant.deaths                                0.01578 *  
## Measles                                      0.13002    
## Population                                   0.15658    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.685 on 1633 degrees of freedom
## Multiple R-squared:  0.8262, Adjusted R-squared:  0.8246 
## F-statistic: 517.4 on 15 and 1633 DF,  p-value: < 0.00000000000000022

Both Direction

## Start:  AIC=4321.29
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     GDP + Population + thinness.10_19.years + thinness.5_9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.5_9.years               1       0.0 22147 4319.3
## - GDP                              1       1.3 22148 4319.4
## - thinness.10_19.years             1      11.4 22158 4320.1
## - Polio                            1      22.2 22169 4320.9
## <none>                                         22147 4321.3
## - Population                       1      27.1 22174 4321.3
## - Measles                          1      33.0 22180 4321.8
## - Total.expenditure                1      38.1 22185 4322.1
## - Diphtheria                       1      46.8 22193 4322.8
## - percentage.expenditure           1      47.8 22194 4322.8
## - Hepatitis.B                      1      54.1 22201 4323.3
## - infant.deaths                    1      76.7 22223 4325.0
## - Status                           1     109.1 22256 4327.4
## - Alcohol                          1     286.3 22433 4340.5
## - BMI                              1     460.0 22607 4353.2
## - Income.composition.of.resources  1    2061.0 24208 4466.0
## - Schooling                        1    2789.2 24936 4514.9
## - Adult.Mortality                  1    4599.2 26746 4630.5
## - HIV_AIDS                         1    7276.2 29423 4787.8
## 
## Step:  AIC=4319.29
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     GDP + Population + thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - GDP                              1       1.3 22148 4317.4
## - Polio                            1      22.2 22169 4318.9
## <none>                                         22147 4319.3
## - Population                       1      27.1 22174 4319.3
## - Measles                          1      33.1 22180 4319.8
## - Total.expenditure                1      38.1 22185 4320.1
## - thinness.10_19.years             1      41.7 22188 4320.4
## - Diphtheria                       1      46.9 22193 4320.8
## - percentage.expenditure           1      47.8 22194 4320.8
## + thinness.5_9.years               1       0.0 22147 4321.3
## - Hepatitis.B                      1      54.2 22201 4321.3
## - infant.deaths                    1      77.2 22224 4323.0
## - Status                           1     109.1 22256 4325.4
## - Alcohol                          1     286.3 22433 4338.5
## - BMI                              1     468.5 22615 4351.8
## - Income.composition.of.resources  1    2061.3 24208 4464.0
## - Schooling                        1    2799.0 24946 4513.5
## - Adult.Mortality                  1    4609.5 26756 4629.1
## - HIV_AIDS                         1    7277.4 29424 4785.8
## 
## Step:  AIC=4317.39
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     Population + thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Polio                            1      22.7 22171 4317.1
## <none>                                         22148 4317.4
## - Population                       1      27.0 22175 4317.4
## - Measles                          1      33.1 22181 4317.9
## - Total.expenditure                1      37.6 22185 4318.2
## - thinness.10_19.years             1      42.0 22190 4318.5
## - Diphtheria                       1      46.3 22194 4318.8
## + GDP                              1       1.3 22147 4319.3
## - Hepatitis.B                      1      53.2 22201 4319.4
## + thinness.5_9.years               1       0.0 22148 4319.4
## - infant.deaths                    1      76.8 22225 4321.1
## - Status                           1     111.1 22259 4323.6
## - Alcohol                          1     286.1 22434 4336.6
## - BMI                              1     467.8 22616 4349.9
## - percentage.expenditure           1     590.7 22739 4358.8
## - Income.composition.of.resources  1    2079.2 24227 4463.4
## - Schooling                        1    2824.3 24972 4513.3
## - Adult.Mortality                  1    4608.3 26756 4627.1
## - HIV_AIDS                         1    7277.2 29425 4783.9
## 
## Step:  AIC=4317.08
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + Population + 
##     thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         22171 4317.1
## - Population                       1      27.3 22198 4317.1
## + Polio                            1      22.7 22148 4317.4
## - Measles                          1      31.2 22202 4317.4
## - Total.expenditure                1      36.0 22207 4317.8
## - thinness.10_19.years             1      39.9 22210 4318.0
## - Hepatitis.B                      1      44.4 22215 4318.4
## + GDP                              1       1.7 22169 4318.9
## + thinness.5_9.years               1       0.0 22171 4319.1
## - infant.deaths                    1      79.3 22250 4321.0
## - Status                           1     113.0 22284 4323.5
## - Diphtheria                       1     206.4 22377 4330.4
## - Alcohol                          1     283.1 22454 4336.0
## - BMI                              1     471.9 22642 4349.8
## - percentage.expenditure           1     585.6 22756 4358.1
## - Income.composition.of.resources  1    2090.6 24261 4463.7
## - Schooling                        1    2851.8 25022 4514.6
## - Adult.Mortality                  1    4620.1 26791 4627.2
## - HIV_AIDS                         1    7337.5 29508 4786.5
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + Population + 
##     thinness.10_19.years + Income.composition.of.resources + 
##     Schooling, data = life_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0614  -2.1244   0.0496   2.3996  11.4712 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     54.988635634772  0.809572673070  67.923
## StatusDeveloping                -0.996426463567  0.345406384670  -2.885
## Adult.Mortality                 -0.017814717510  0.000965712450 -18.447
## infant.deaths                   -0.003042129201  0.001258897693  -2.417
## Alcohol                         -0.154259916713  0.033781334791  -4.566
## percentage.expenditure           0.000402417581  0.000061274211   6.567
## Hepatitis.B>=90% Covered        -0.568682910837  0.314538919313  -1.808
## Measles                          0.000016309888  0.000010767114   1.515
## BMI                              0.035941493947  0.006096149812   5.896
## Total.expenditure                0.067891276955  0.041686924792   1.629
## Diphtheria>=90% Covered          1.348666445388  0.345884907735   3.899
## HIV_AIDS                        -0.429134063064  0.018459188110 -23.248
## Population                       0.000000002501  0.000000001765   1.417
## thinness.10_19.years            -0.047793875815  0.027868686866  -1.715
## Income.composition.of.resources 10.522821501001  0.848000360190  12.409
## Schooling                        0.889310284674  0.061359991042  14.493
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                             0.00397 ** 
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                                0.01578 *  
## Alcohol                              0.0000053327894 ***
## percentage.expenditure               0.0000000000686 ***
## Hepatitis.B>=90% Covered                     0.07079 .  
## Measles                                      0.13002    
## BMI                                  0.0000000045224 ***
## Total.expenditure                            0.10359    
## Diphtheria>=90% Covered                      0.00010 ***
## HIV_AIDS                        < 0.0000000000000002 ***
## Population                                   0.15658    
## thinness.10_19.years                         0.08654 .  
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.685 on 1633 degrees of freedom
## Multiple R-squared:  0.8262, Adjusted R-squared:  0.8246 
## F-statistic: 517.4 on 15 and 1633 DF,  p-value: < 0.00000000000000022

All-Possible (Regsubsets)

Based on given Plot, we can determine the most significant Variables based on Largest Adj. R-Squared: Adult.Mortality, Alcohol, percentage.expenditure, BMI, Diphtheria, HIV_AIDS, Income.composition.of.resources, and Schooling. This selected Variables also reflected by the siginificancy of p-value on other models (three stars / ***)

Create Model Based on Selected Variables:

## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + Alcohol + percentage.expenditure + 
##     BMI + Diphtheria + HIV_AIDS + Income.composition.of.resources + 
##     Schooling, data = life_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2945  -2.0744   0.0991   2.4163  12.4001 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     53.22608649  0.58395177  91.148
## Adult.Mortality                 -0.01805638  0.00096422 -18.726
## Alcohol                         -0.10392185  0.03049635  -3.408
## percentage.expenditure           0.00046830  0.00005927   7.901
## BMI                              0.04215698  0.00568985   7.409
## Diphtheria>=90% Covered          0.94152766  0.21554902   4.368
## HIV_AIDS                        -0.42895610  0.01838683 -23.330
## Income.composition.of.resources 10.53525738  0.84747104  12.431
## Schooling                        0.93299775  0.06085501  15.331
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                     0.000671 ***
## percentage.expenditure           0.00000000000000502 ***
## BMI                              0.00000000000020250 ***
## Diphtheria>=90% Covered          0.00001332152624881 ***
## HIV_AIDS                        < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.709 on 1640 degrees of freedom
## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8222 
## F-statistic: 953.8 on 8 and 1640 DF,  p-value: < 0.00000000000000022

RegBest (FactoMineR)

We sould like to see, if we are only using numeric variables, which Variables that will come out as the best.

## 
## Call:
## lm(formula = as.formula(as.character(formul)), data = don)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9301  -3.0501   0.1115   3.1362  15.1075 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 46.58327    0.52114   89.39 <0.0000000000000002 ***
## HIV_AIDS    -0.66888    0.01910  -35.03 <0.0000000000000002 ***
## Schooling    1.98401    0.04121   48.14 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.57 on 1646 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7301 
## F-statistic:  2230 on 2 and 1646 DF,  p-value: < 0.00000000000000022

Create Model Based on Selected Variables:

## 
## Call:
## lm(formula = Life.expectancy ~ HIV_AIDS + Schooling, data = life_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9301  -3.0501   0.1115   3.1362  15.1075 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 46.58327    0.52114   89.39 <0.0000000000000002 ***
## HIV_AIDS    -0.66888    0.01910  -35.03 <0.0000000000000002 ***
## Schooling    1.98401    0.04121   48.14 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.57 on 1646 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7301 
## F-statistic:  2230 on 2 and 1646 DF,  p-value: < 0.00000000000000022

Compare the Adj. R-Squared from All Models

##            model AdjRsquare
## 1 model_backward  0.8245570
## 2  model_forward  0.8245570
## 3     model_both  0.8245570
## 4     model_regs  0.8222260
## 5   model_regMod  0.7300628

From the given Result, we will choose model_backward as our model to predict the `Life.expectancy.

Checking on Errors

Create Prediction Model to define

Checking Errors with Various Methods

##   Method Error.Value
## 1    MSE 13.44479835
## 2   RMSE  3.66671493
## 3    MAE  2.81676542
## 4   MAPE  0.04251349
## [1] 44 89

If we take a look the Error Value from every methods, the error seems small compared to the range of the Life.expectancy as the Dependent Variable. Therefore we can assume that the predicted values will not so far from the actual values.

Checking on Assumptions

Normality Test

Plot Residuals on Histogram

Most of the Residuals seems distributed on the center, indicates they are distributed normally.

Plot Residuals on QQPlot

Most of the residuals gathered on the center line, indicates they are distributed normally

Shapiro Test

## 
##  Shapiro-Wilk normality test
## 
## data:  model_regs$residuals
## W = 0.98912, p-value = 0.0000000008975

Based on Shapiro-Wilk normality test, the p-value < 0.05 implying that the distribution of the data are significantly different from normal distribution. Therefore, we need to do some adjustment to data.

Create New Model

## Start:  AIC=3973.83
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     GDP + Population + thinness.10_19.years + thinness.5_9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.5_9.years               1       0.8 18900 3971.9
## - GDP                              1       2.3 18901 3972.0
## - Polio                            1       9.0 18908 3972.6
## - thinness.10_19.years             1      11.0 18910 3972.8
## - Population                       1      12.5 18911 3972.9
## <none>                                         18899 3973.8
## - Diphtheria                       1      38.8 18938 3975.1
## - percentage.expenditure           1      43.9 18943 3975.5
## - Hepatitis.B                      1      50.0 18949 3976.0
## - Status                           1      71.5 18970 3977.8
## - infant.deaths                    1      79.2 18978 3978.5
## - Measles                          1      79.9 18979 3978.5
## - Total.expenditure                1      80.1 18979 3978.6
## - Alcohol                          1      94.4 18993 3979.8
## - BMI                              1     312.0 19211 3997.9
## - Income.composition.of.resources  1    1690.6 20590 4108.1
## - Schooling                        1    2527.7 21427 4171.4
## - Adult.Mortality                  1    3246.1 22145 4223.9
## - HIV_AIDS                         1    4334.9 23234 4300.2
## 
## Step:  AIC=3971.9
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     GDP + Population + thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - GDP                              1       2.3 18902 3970.1
## - Polio                            1       8.9 18909 3970.6
## - Population                       1      12.4 18912 3970.9
## <none>                                         18900 3971.9
## - thinness.10_19.years             1      24.8 18925 3972.0
## - Diphtheria                       1      38.5 18938 3973.1
## - percentage.expenditure           1      44.0 18944 3973.6
## - Hepatitis.B                      1      49.6 18949 3974.1
## - Status                           1      71.3 18971 3975.9
## - infant.deaths                    1      78.4 18978 3976.5
## - Measles                          1      79.3 18979 3976.6
## - Total.expenditure                1      79.5 18979 3976.6
## - Alcohol                          1      94.6 18994 3977.8
## - BMI                              1     313.7 19213 3996.1
## - Income.composition.of.resources  1    1691.6 20591 4106.2
## - Schooling                        1    2541.5 21441 4170.5
## - Adult.Mortality                  1    3248.6 22148 4222.1
## - HIV_AIDS                         1    4337.9 23238 4298.4
## 
## Step:  AIC=3970.09
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV_AIDS + 
##     Population + thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Polio                            1       9.3 18911 3968.9
## - Population                       1      12.3 18914 3969.1
## <none>                                         18902 3970.1
## - thinness.10_19.years             1      25.2 18927 3970.2
## - Diphtheria                       1      37.7 18940 3971.3
## - Hepatitis.B                      1      48.4 18950 3972.2
## - Status                           1      73.2 18975 3974.2
## - infant.deaths                    1      77.9 18980 3974.6
## - Total.expenditure                1      78.7 18981 3974.7
## - Measles                          1      79.3 18981 3974.7
## - Alcohol                          1      94.5 18996 3976.0
## - BMI                              1     312.9 19215 3994.2
## - percentage.expenditure           1     597.4 19499 4017.6
## - Income.composition.of.resources  1    1708.1 20610 4105.6
## - Schooling                        1    2568.5 21470 4170.7
## - Adult.Mortality                  1    3247.5 22149 4220.2
## - HIV_AIDS                         1    4337.8 23240 4296.6
## 
## Step:  AIC=3968.86
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + Population + 
##     thinness.10_19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Population                       1      12.4 18924 3967.9
## <none>                                         18911 3968.9
## - thinness.10_19.years             1      24.0 18935 3968.9
## - Hepatitis.B                      1      42.9 18954 3970.5
## - Status                           1      73.9 18985 3973.1
## - Measles                          1      77.4 18989 3973.4
## - Total.expenditure                1      77.6 18989 3973.4
## - infant.deaths                    1      79.6 18991 3973.5
## - Alcohol                          1      92.7 19004 3974.6
## - Diphtheria                       1     132.1 19043 3977.9
## - BMI                              1     314.8 19226 3993.1
## - percentage.expenditure           1     594.3 19505 4016.1
## - Income.composition.of.resources  1    1714.8 20626 4104.9
## - Schooling                        1    2584.1 21495 4170.5
## - Adult.Mortality                  1    3255.0 22166 4219.4
## - HIV_AIDS                         1    4390.3 23302 4298.8
## 
## Step:  AIC=3967.91
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + thinness.10_19.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.10_19.years             1      23.3 18947 3967.9
## <none>                                         18924 3967.9
## - Hepatitis.B                      1      43.4 18967 3969.6
## - infant.deaths                    1      71.9 18995 3971.9
## - Status                           1      73.0 18997 3972.0
## - Measles                          1      74.5 18998 3972.2
## - Total.expenditure                1      77.3 19001 3972.4
## - Alcohol                          1      94.2 19018 3973.8
## - Diphtheria                       1     134.6 19058 3977.2
## - BMI                              1     320.7 19244 3992.6
## - percentage.expenditure           1     596.5 19520 4015.3
## - Income.composition.of.resources  1    1714.6 20638 4103.8
## - Schooling                        1    2628.0 21552 4172.7
## - Adult.Mortality                  1    3247.1 22171 4217.7
## - HIV_AIDS                         1    4405.0 23329 4298.6
## 
## Step:  AIC=3967.86
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         18947 3967.9
## - Hepatitis.B                      1      45.7 18993 3969.7
## - Status                           1      71.4 19018 3971.8
## - Alcohol                          1      80.9 19028 3972.6
## - Total.expenditure                1      83.4 19030 3972.8
## - Measles                          1      87.1 19034 3973.2
## - Diphtheria                       1     129.1 19076 3976.7
## - infant.deaths                    1     137.8 19085 3977.4
## - BMI                              1     423.0 19370 4001.0
## - percentage.expenditure           1     598.0 19545 4015.3
## - Income.composition.of.resources  1    1760.5 20707 4107.1
## - Schooling                        1    2660.5 21607 4174.8
## - Adult.Mortality                  1    3255.5 22202 4218.0
## - HIV_AIDS                         1    4485.5 23432 4303.7
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Total.expenditure + Diphtheria + HIV_AIDS + Income.composition.of.resources + 
##     Schooling, data = life_clean1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.785  -1.980  -0.008   2.263  13.405 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     55.41212847  0.72528872  76.400
## StatusDeveloping                -0.79640456  0.32680840  -2.437
## Adult.Mortality                 -0.01758518  0.00106863 -16.456
## infant.deaths                   -0.00295954  0.00087412  -3.386
## Alcohol                         -0.08341745  0.03216117  -2.594
## percentage.expenditure           0.00040704  0.00005772   7.053
## Hepatitis.B>=90% Covered        -0.59461631  0.30495556  -1.950
## Measles                          0.00002824  0.00001049   2.691
## BMI                              0.03242806  0.00546683   5.932
## Total.expenditure                0.10607817  0.04028236   2.633
## Diphtheria>=90% Covered          1.09897888  0.33539972   3.277
## HIV_AIDS                        -0.59567125  0.03083829 -19.316
## Income.composition.of.resources  9.67520937  0.79952369  12.101
## Schooling                        0.86616701  0.05822545  14.876
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                            0.014923 *  
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                               0.000727 ***
## Alcohol                                     0.009582 ** 
## percentage.expenditure              0.00000000000262 ***
## Hepatitis.B>=90% Covered                    0.051372 .  
## Measles                                     0.007195 ** 
## BMI                                 0.00000000367619 ***
## Total.expenditure                           0.008537 ** 
## Diphtheria>=90% Covered                     0.001073 ** 
## HIV_AIDS                        < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.467 on 1576 degrees of freedom
## Multiple R-squared:  0.8063, Adjusted R-squared:  0.8047 
## F-statistic: 504.6 on 13 and 1576 DF,  p-value: < 0.00000000000000022
## Start:  AIC=6551.77
## Life.expectancy ~ 1
## 
##                                   Df Sum of Sq   RSS    AIC
## + Schooling                        1     54529 43287 5257.5
## + Income.composition.of.resources  1     51607 46209 5361.4
## + Adult.Mortality                  1     44653 53163 5584.3
## + BMI                              1     28468 69348 6006.9
## + HIV_AIDS                         1     28405 69410 6008.3
## + Status                           1     20987 76829 6169.8
## + GDP                              1     20982 76833 6169.9
## + thinness.5_9.years               1     20704 77112 6175.6
## + thinness.10_19.years             1     20532 77284 6179.2
## + Alcohol                          1     19296 78519 6204.4
## + Polio                            1     18881 78934 6212.8
## + Diphtheria                       1     18680 79136 6216.8
## + percentage.expenditure           1     18206 79610 6226.3
## + Hepatitis.B                      1      8291 89525 6412.9
## + Total.expenditure                1      4313 93503 6482.1
## + infant.deaths                    1      3390 94426 6497.7
## + Measles                          1       516 97300 6545.4
## + Population                       1       135 97681 6551.6
## <none>                                         97816 6551.8
## 
## Step:  AIC=5257.55
## Life.expectancy ~ Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## + Adult.Mortality                  1   15022.0 28265 4581.8
## + HIV_AIDS                         1   14988.1 28299 4583.8
## + Income.composition.of.resources  1    5267.1 38020 5053.3
## + BMI                              1    2517.9 40769 5164.3
## + GDP                              1    1676.9 41610 5196.7
## + percentage.expenditure           1    1657.6 41629 5197.5
## + thinness.5_9.years               1    1598.0 41689 5199.7
## + thinness.10_19.years             1    1206.0 42081 5214.6
## + Polio                            1    1046.5 42241 5220.6
## + Diphtheria                       1     901.3 42386 5226.1
## + Status                           1     887.3 42400 5226.6
## + Hepatitis.B                      1     184.0 43103 5252.8
## + Alcohol                          1      93.7 43193 5256.1
## + infant.deaths                    1      65.8 43221 5257.1
## + Total.expenditure                1      56.4 43231 5257.5
## <none>                                         43287 5257.5
## + Measles                          1      25.8 43261 5258.6
## + Population                       1       1.6 43285 5259.5
## 
## Step:  AIC=4581.85
## Life.expectancy ~ Schooling + Adult.Mortality
## 
##                                   Df Sum of Sq   RSS    AIC
## + HIV_AIDS                         1    5173.7 23091 4262.4
## + Income.composition.of.resources  1    2642.8 25622 4427.8
## + BMI                              1    1020.4 27245 4525.4
## + GDP                              1     947.6 27317 4529.6
## + percentage.expenditure           1     925.2 27340 4530.9
## + thinness.5_9.years               1     764.0 27501 4540.3
## + thinness.10_19.years             1     706.1 27559 4543.6
## + Polio                            1     479.0 27786 4556.7
## + Diphtheria                       1     451.5 27814 4558.2
## + Status                           1     340.2 27925 4564.6
## + infant.deaths                    1     172.0 28093 4574.1
## + Hepatitis.B                      1     100.4 28165 4578.2
## <none>                                         28265 4581.8
## + Total.expenditure                1      25.5 28240 4582.4
## + Alcohol                          1      24.1 28241 4582.5
## + Population                       1      17.8 28247 4582.8
## + Measles                          1       0.1 28265 4583.8
## 
## Step:  AIC=4262.4
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS
## 
##                                   Df Sum of Sq   RSS    AIC
## + Income.composition.of.resources  1   2383.53 20708 4091.2
## + percentage.expenditure           1   1117.44 21974 4185.5
## + GDP                              1   1108.05 21983 4186.2
## + BMI                              1    698.64 22393 4215.5
## + thinness.5_9.years               1    502.29 22589 4229.4
## + thinness.10_19.years             1    500.41 22591 4229.6
## + Status                           1    363.47 22728 4239.2
## + Total.expenditure                1    211.74 22880 4249.8
## + Diphtheria                       1    163.51 22928 4253.1
## + Alcohol                          1    147.89 22943 4254.2
## + infant.deaths                    1    145.57 22946 4254.3
## + Polio                            1    140.98 22950 4254.7
## <none>                                         23091 4262.4
## + Population                       1     28.61 23063 4262.4
## + Hepatitis.B                      1      5.36 23086 4264.0
## + Measles                          1      1.11 23090 4264.3
## 
## Step:  AIC=4091.17
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources
## 
##                          Df Sum of Sq   RSS    AIC
## + percentage.expenditure  1    784.52 19923 4031.8
## + GDP                     1    729.44 19978 4036.2
## + BMI                     1    446.26 20262 4058.5
## + thinness.5_9.years      1    304.70 20403 4069.6
## + thinness.10_19.years    1    289.35 20418 4070.8
## + Total.expenditure       1    209.53 20498 4077.0
## + infant.deaths           1    206.25 20502 4077.3
## + Status                  1    201.28 20507 4077.6
## + Diphtheria              1     82.48 20625 4086.8
## + Polio                   1     70.22 20638 4087.8
## + Population              1     46.81 20661 4089.6
## <none>                                20708 4091.2
## + Alcohol                 1      7.60 20700 4092.6
## + Measles                 1      1.81 20706 4093.0
## + Hepatitis.B             1      1.51 20706 4093.1
## 
## Step:  AIC=4031.76
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure
## 
##                        Df Sum of Sq   RSS    AIC
## + BMI                   1    459.75 19464 3996.6
## + thinness.5_9.years    1    255.24 19668 4013.3
## + thinness.10_19.years  1    248.96 19674 4013.8
## + infant.deaths         1    198.64 19725 4017.8
## + Total.expenditure     1    143.66 19780 4022.3
## + Diphtheria            1     92.69 19831 4026.4
## + Polio                 1     87.94 19835 4026.7
## + Population            1     45.13 19878 4030.2
## + Status                1     36.88 19886 4030.8
## <none>                              19923 4031.8
## + Hepatitis.B           1     17.28 19906 4032.4
## + Alcohol               1     10.71 19913 4032.9
## + GDP                   1      1.67 19922 4033.6
## + Measles               1      0.37 19923 4033.7
## 
## Step:  AIC=3996.64
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI
## 
##                        Df Sum of Sq   RSS    AIC
## + Diphtheria            1   129.618 19334 3988.0
## + infant.deaths         1   117.086 19346 3989.1
## + Polio                 1   115.371 19348 3989.2
## + Total.expenditure     1   105.723 19358 3990.0
## + thinness.10_19.years  1    73.332 19390 3992.6
## + thinness.5_9.years    1    71.556 19392 3992.8
## + Status                1    37.304 19426 3995.6
## + Hepatitis.B           1    29.100 19434 3996.3
## <none>                              19464 3996.6
## + Population            1    24.439 19439 3996.6
## + Alcohol               1    15.874 19448 3997.3
## + Measles               1     4.318 19459 3998.3
## + GDP                   1     2.880 19461 3998.4
## 
## Step:  AIC=3988.02
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria
## 
##                        Df Sum of Sq   RSS    AIC
## + Total.expenditure     1    87.195 19247 3982.8
## + infant.deaths         1    84.773 19249 3983.0
## + thinness.10_19.years  1    78.368 19256 3983.6
## + thinness.5_9.years    1    68.258 19266 3984.4
## + Hepatitis.B           1    33.278 19301 3987.3
## + Status                1    31.775 19302 3987.4
## + Alcohol               1    28.304 19306 3987.7
## <none>                              19334 3988.0
## + Population            1    16.356 19318 3988.7
## + Measles               1     8.449 19325 3989.3
## + Polio                 1     2.701 19331 3989.8
## + GDP                   1     1.355 19333 3989.9
## 
## Step:  AIC=3982.83
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Total.expenditure
## 
##                        Df Sum of Sq   RSS    AIC
## + infant.deaths         1    72.396 19174 3978.8
## + thinness.10_19.years  1    63.923 19183 3979.5
## + thinness.5_9.years    1    53.326 19193 3980.4
## + Alcohol               1    34.439 19212 3982.0
## + Hepatitis.B           1    33.222 19214 3982.1
## + Status                1    26.246 19221 3982.7
## <none>                              19247 3982.8
## + Measles               1    13.391 19233 3983.7
## + Population            1    12.352 19234 3983.8
## + Polio                 1     3.237 19244 3984.6
## + GDP                   1     2.116 19245 3984.7
## 
## Step:  AIC=3978.84
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Total.expenditure + 
##     infant.deaths
## 
##                        Df Sum of Sq   RSS    AIC
## + Measles               1    84.473 19090 3973.8
## + Hepatitis.B           1    34.805 19140 3978.0
## + Alcohol               1    29.598 19145 3978.4
## + Status                1    26.751 19148 3978.6
## <none>                              19174 3978.8
## + thinness.10_19.years  1    23.899 19150 3978.9
## + thinness.5_9.years    1    17.443 19157 3979.4
## + Population            1     9.306 19165 3980.1
## + GDP                   1     2.314 19172 3980.6
## + Polio                 1     1.986 19172 3980.7
## 
## Step:  AIC=3973.82
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Total.expenditure + 
##     infant.deaths + Measles
## 
##                        Df Sum of Sq   RSS    AIC
## + Hepatitis.B           1    34.563 19055 3972.9
## + Alcohol               1    29.720 19060 3973.3
## + Status                1    28.268 19062 3973.5
## <none>                              19090 3973.8
## + thinness.10_19.years  1    14.092 19076 3974.6
## + Population            1    12.580 19077 3974.8
## + thinness.5_9.years    1     8.747 19081 3975.1
## + Polio                 1     3.251 19087 3975.5
## + GDP                   1     2.250 19088 3975.6
## 
## Step:  AIC=3972.94
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Total.expenditure + 
##     infant.deaths + Measles + Hepatitis.B
## 
##                        Df Sum of Sq   RSS    AIC
## + Alcohol               1    37.094 19018 3971.8
## + Status                1    27.610 19028 3972.6
## <none>                              19055 3972.9
## + Population            1    12.271 19043 3973.9
## + thinness.10_19.years  1    11.856 19043 3973.9
## + Polio                 1     7.734 19048 3974.3
## + thinness.5_9.years    1     6.623 19049 3974.4
## + GDP                   1     3.776 19052 3974.6
## 
## Step:  AIC=3971.84
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Total.expenditure + 
##     infant.deaths + Measles + Hepatitis.B + Alcohol
## 
##                        Df Sum of Sq   RSS    AIC
## + Status                1    71.394 18947 3967.9
## <none>                              19018 3971.8
## + thinness.10_19.years  1    21.639 18997 3972.0
## + thinness.5_9.years    1    13.487 19005 3972.7
## + Population            1    10.846 19007 3972.9
## + Polio                 1     8.924 19009 3973.1
## + GDP                   1     4.738 19013 3973.4
## 
## Step:  AIC=3967.86
## Life.expectancy ~ Schooling + Adult.Mortality + HIV_AIDS + Income.composition.of.resources + 
##     percentage.expenditure + BMI + Diphtheria + Total.expenditure + 
##     infant.deaths + Measles + Hepatitis.B + Alcohol + Status
## 
##                        Df Sum of Sq   RSS    AIC
## <none>                              18947 3967.9
## + thinness.10_19.years  1   23.2693 18924 3967.9
## + thinness.5_9.years    1   13.9540 18933 3968.7
## + Population            1   11.6307 18935 3968.9
## + Polio                 1    8.2642 18939 3969.2
## + GDP                   1    2.7072 18944 3969.6
## 
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality + 
##     HIV_AIDS + Income.composition.of.resources + percentage.expenditure + 
##     BMI + Diphtheria + Total.expenditure + infant.deaths + Measles + 
##     Hepatitis.B + Alcohol + Status, data = life_clean1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.785  -1.980  -0.008   2.263  13.405 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     55.41212847  0.72528872  76.400
## Schooling                        0.86616701  0.05822545  14.876
## Adult.Mortality                 -0.01758518  0.00106863 -16.456
## HIV_AIDS                        -0.59567125  0.03083829 -19.316
## Income.composition.of.resources  9.67520937  0.79952369  12.101
## percentage.expenditure           0.00040704  0.00005772   7.053
## BMI                              0.03242806  0.00546683   5.932
## Diphtheria>=90% Covered          1.09897888  0.33539972   3.277
## Total.expenditure                0.10607817  0.04028236   2.633
## infant.deaths                   -0.00295954  0.00087412  -3.386
## Measles                          0.00002824  0.00001049   2.691
## Hepatitis.B>=90% Covered        -0.59461631  0.30495556  -1.950
## Alcohol                         -0.08341745  0.03216117  -2.594
## StatusDeveloping                -0.79640456  0.32680840  -2.437
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## HIV_AIDS                        < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## percentage.expenditure              0.00000000000262 ***
## BMI                                 0.00000000367619 ***
## Diphtheria>=90% Covered                     0.001073 ** 
## Total.expenditure                           0.008537 ** 
## infant.deaths                               0.000727 ***
## Measles                                     0.007195 ** 
## Hepatitis.B>=90% Covered                    0.051372 .  
## Alcohol                                     0.009582 ** 
## StatusDeveloping                            0.014923 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.467 on 1576 degrees of freedom
## Multiple R-squared:  0.8063, Adjusted R-squared:  0.8047 
## F-statistic: 504.6 on 13 and 1576 DF,  p-value: < 0.00000000000000022

Unfortunately, the adj. R-squared fell down drastically from 0.8245570 into 0.8047. This is not what we are expecting for. Therefore, we will keep the origin of data, with outliers.

Transform the Data

  • Log Transformation Let us try to transform the data using Log. Since we already decide that “model_backward” is the best fit model, we will only use variables inside that model to transform the data.
## 
## Call:
## lm(formula = log1p(Life.expectancy) ~ Status + log1p(Adult.Mortality) + 
##     log1p(infant.deaths) + log1p(Alcohol) + log1p(percentage.expenditure) + 
##     Hepatitis.B + log1p(Measles) + log1p(BMI) + log1p(Total.expenditure) + 
##     Diphtheria + log1p(HIV_AIDS) + log1p(thinness.10_19.years) + 
##     log1p(Income.composition.of.resources) + log1p(Schooling), 
##     data = life_selected)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.255542 -0.028055  0.001655  0.030288  0.189695 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                             3.9699293  0.0252575 157.178
## StatusDeveloping                       -0.0065261  0.0045033  -1.449
## log1p(Adult.Mortality)                 -0.0099200  0.0013338  -7.438
## log1p(infant.deaths)                   -0.0054746  0.0011774  -4.650
## log1p(Alcohol)                          0.0029923  0.0020038   1.493
## log1p(percentage.expenditure)           0.0066102  0.0008564   7.719
## Hepatitis.B>=90% Covered               -0.0090513  0.0044033  -2.056
## log1p(Measles)                         -0.0002588  0.0005172  -0.500
## log1p(BMI)                              0.0024199  0.0019714   1.228
## log1p(Total.expenditure)                0.0068497  0.0035765   1.915
## Diphtheria>=90% Covered                 0.0128582  0.0047825   2.689
## log1p(HIV_AIDS)                        -0.0923361  0.0019639 -47.018
## log1p(thinness.10_19.years)            -0.0096884  0.0024131  -4.015
## log1p(Income.composition.of.resources)  0.1784654  0.0157667  11.319
## log1p(Schooling)                        0.0995534  0.0096981  10.265
##                                                    Pr(>|t|)    
## (Intercept)                            < 0.0000000000000002 ***
## StatusDeveloping                                    0.14748    
## log1p(Adult.Mortality)                   0.0000000000001648 ***
## log1p(infant.deaths)                     0.0000035909852119 ***
## log1p(Alcohol)                                      0.13555    
## log1p(percentage.expenditure)            0.0000000000000203 ***
## Hepatitis.B>=90% Covered                            0.03998 *  
## log1p(Measles)                                      0.61695    
## log1p(BMI)                                          0.21979    
## log1p(Total.expenditure)                            0.05564 .  
## Diphtheria>=90% Covered                             0.00725 ** 
## log1p(HIV_AIDS)                        < 0.0000000000000002 ***
## log1p(thinness.10_19.years)              0.0000621539384934 ***
## log1p(Income.composition.of.resources) < 0.0000000000000002 ***
## log1p(Schooling)                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0515 on 1634 degrees of freedom
## Multiple R-squared:  0.851,  Adjusted R-squared:  0.8498 
## F-statistic: 666.8 on 14 and 1634 DF,  p-value: < 0.00000000000000022

Looks promising, the Adj. R-Squared even bigger than the “model_backward”

## 
## Call:
## lm(formula = powerTransform(Life.expectancy, lambda) ~ Status + 
##     Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + 
##     Hepatitis.B + Measles + BMI + Total.expenditure + Diphtheria + 
##     HIV_AIDS + thinness.10_19.years + Income.composition.of.resources + 
##     Schooling, data = life_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -216.690  -32.484    0.082   34.665  194.624 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     423.2163283  11.8150671  35.820
## StatusDeveloping                -18.4114001   5.0610413  -3.638
## Adult.Mortality                  -0.2569250   0.0141492 -18.158
## infant.deaths                    -0.0275159   0.0149145  -1.845
## Alcohol                          -2.2417078   0.4947660  -4.531
## percentage.expenditure            0.0068207   0.0008978   7.597
## Hepatitis.B>=90% Covered         -9.9855002   4.6088322  -2.167
## Measles                           0.0002341   0.0001576   1.486
## BMI                               0.5115523   0.0892458   5.732
## Total.expenditure                 1.2746992   0.6108360   2.087
## Diphtheria>=90% Covered          20.8908095   5.0661567   4.124
## HIV_AIDS                         -5.5564512   0.2704781 -20.543
## thinness.10_19.years             -0.7798239   0.4082681  -1.910
## Income.composition.of.resources 153.2651542  12.4259713  12.334
## Schooling                        13.0819270   0.8964068  14.594
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                            0.000283 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                               0.065231 .  
## Alcohol                           0.0000063012976700 ***
## percentage.expenditure            0.0000000000000507 ***
## Hepatitis.B>=90% Covered                    0.030409 *  
## Measles                                     0.137488    
## BMI                               0.0000000118010916 ***
## Total.expenditure                           0.037060 *  
## Diphtheria>=90% Covered           0.0000391729179153 ***
## HIV_AIDS                        < 0.0000000000000002 ***
## thinness.10_19.years                        0.056298 .  
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.99 on 1634 degrees of freedom
## Multiple R-squared:  0.8212, Adjusted R-squared:  0.8197 
## F-statistic:   536 on 14 and 1634 DF,  p-value: < 0.00000000000000022

The Adj. R-Squared is smaller than the “model_backward”, we will not consider the box-cox transformation, and will use log transformation as our new model.

Second Normality Test

Plot Residuals on Histogram

The Residuals seems has a better distribution at center.

Plot Residuals on QQPlot

The residuals under -2 and over 2 fell way above/below the center line. Seems still not following the normal distribution.

Shapiro Test

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9798         0.0000 
## Kolmogorov-Smirnov        0.0439         0.0035 
## Cramer-von Mises         497.4152        0.1200 
## Anderson-Darling           6.03          0.0000 
## -----------------------------------------------

Based on Shapiro-Wilk, Kolmogorov-Smirnov, and Anderson-Darling normality test, the p-value < 0.05 implying that the distribution of the data are significantly different from normal distribution. Therefore, we need to do some adjustment to data. Only in Cramer-von Mises the p-value is > 0.05. We may conclude that the residuals still not following the normal distribution.

Homoscedasticity

Plot Fitted vs Residuals

The error seems not following particular pattern, by visual plot analysis.

Breusch-Pagan Test

## 
##  studentized Breusch-Pagan test
## 
## data:  log_life
## BP = 81.344, df = 14, p-value = 0.00000000001592
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                        Data                        
##  --------------------------------------------------
##  Response : log1p(Life.expectancy) 
##  Variables: fitted values of log1p(Life.expectancy) 
## 
##                  Test Summary                  
##  ----------------------------------------------
##  DF            =    1 
##  Chi2          =    85.30346 
##  Prob > Chi2   =    0.0000000000000000000255915

Using 2 different function to test the homocedasticity, we still get conclusion that the residuals variance is not constant.

Multicollinearity Test

##                                 Status 
##                               1.578679 
##                 log1p(Adult.Mortality) 
##                               1.210762 
##                   log1p(infant.deaths) 
##                               2.340268 
##                         log1p(Alcohol) 
##                               1.866246 
##          log1p(percentage.expenditure) 
##                               1.678846 
##                            Hepatitis.B 
##                               3.013337 
##                         log1p(Measles) 
##                               1.754956 
##                             log1p(BMI) 
##                               1.351176 
##               log1p(Total.expenditure) 
##                               1.100274 
##                             Diphtheria 
##                               3.478853 
##                        log1p(HIV_AIDS) 
##                               1.534598 
##            log1p(thinness.10_19.years) 
##                               1.864135 
## log1p(Income.composition.of.resources) 
##                               2.353177 
##                       log1p(Schooling) 
##                               3.115301

After tested, we can conclude that all of the independent variables are not correlated each other, since the vif test values are < 10.

Conclusion

The linear model seems fit to predict Life.expectancy based on the Adj. R-Squared value, Error Value, and pass 2 of 4 Assumption Check, which is the Multicollinearity and Linearity Test. However, the Normality and Homocedasticity doesn’t give expected result. Even when we look at the visualization the residuals plot seems following Normal Distribution and Homocedasticity principle, but the statistic test giving different result.

The Linear Model can be used to explain the linear correlation between Life.expectancy and the selected independent variables. However, since this model is highly sensitive to outliers (which quite massive occured in this data and taking it out is not a good option), it is highly recommended to see the outliers pattern if you still wish to use this model on the new set of Life.expectancy data.