Intro

Dataset yang akan saya gunakan dalam analisi regresi berikut ini adalah data mengenai angka harapan hidup dari 193 negara di dunia untuk beberapa tahun kebelakangan. Angka Harapan Hidup merupakan alat untuk mengevaluasi kinerja pemerintah dalam meningkatkan kesejahteraan penduduk pada umumnya, dan meningkatkan derajat kesehatan pada khususnya. sehingga tujuan saya kali ini yaitu membuat model untuk menganalisis variable mana yang lebih berpengaruh terhadap angka harapan hidup, sehingga apabila pemerintah ingin meningkatkan angka harapan hidup bisa memfokuskan ke variable tersebut. lalu tujuan berikutnya adalah membuat model untuk memprediksi angka harapan hidup berdasarkan prediktor prediktor yang ada

Import Library

# import libs
library(tidyverse)
library(GGally)
library(car)
library(caret)
library(randomForest)
library(partykit)

Preparation

Load data

life <- read.csv("lifee.csv")
head(life, 10)
str(life)
## 'data.frame':    2938 obs. of  22 variables:
##  $ Country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..1.19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

Cek Na

colSums(is.na(life))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                              10 
##                 Adult.Mortality                   infant.deaths 
##                              10                               0 
##                         Alcohol          percentage.expenditure 
##                             194                               0 
##                     Hepatitis.B                         Measles 
##                             553                               0 
##                             BMI               under.five.deaths 
##                              34                               0 
##                           Polio               Total.expenditure 
##                              19                             226 
##                      Diphtheria                        HIV.AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##            thinness..1.19.years              thinness.5.9.years 
##                              34                              34 
## Income.composition.of.resources                       Schooling 
##                             167                             163

Cek Outlier

boxplot(life %>% select(-c(Country, Year, Status)))

Dealing With NA and Outlier

life_clean <- life %>% 
  select(-c(Country, Year)) %>%
  mutate(Status = as.factor(Status)) %>%
  filter(thinness..1.19.years <= 20) %>% 
  na.omit()

Exploratory data Analysis

ggcorr(life_clean %>% select(-Status), label = T, hjust = 1, layout.exp = 5)

Modeling

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(1234)

index <- sample(nrow(life_clean), nrow(life_clean)*0.75)
train = life_clean[index, ]
test = life_clean[-index, ]
model1 <- lm(Life.expectancy~., data = train)
summary(model1)
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9988  -2.1057   0.0423   2.2513  11.9603 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.425e+01  9.883e-01  54.895  < 2e-16 ***
## StatusDeveloping                -1.064e+00  3.914e-01  -2.718  0.00666 ** 
## Adult.Mortality                 -1.659e-02  1.069e-03 -15.523  < 2e-16 ***
## infant.deaths                    9.456e-02  1.500e-02   6.306 4.01e-10 ***
## Alcohol                         -9.712e-02  3.888e-02  -2.498  0.01263 *  
## percentage.expenditure           1.514e-04  2.371e-04   0.639  0.52318    
## Hepatitis.B                     -6.521e-03  5.287e-03  -1.234  0.21761    
## Measles                         -1.095e-05  1.787e-05  -0.613  0.54027    
## BMI                              3.616e-02  6.993e-03   5.172 2.71e-07 ***
## under.five.deaths               -7.091e-02  1.009e-02  -7.025 3.57e-12 ***
## Polio                            1.162e-02  5.864e-03   1.981  0.04781 *  
## Total.expenditure                9.315e-02  4.893e-02   1.904  0.05718 .  
## Diphtheria                       1.507e-02  6.807e-03   2.214  0.02704 *  
## HIV.AIDS                        -4.349e-01  1.939e-02 -22.427  < 2e-16 ***
## GDP                              3.986e-05  3.725e-05   1.070  0.28471    
## Population                      -1.819e-09  3.795e-09  -0.479  0.63180    
## thinness..1.19.years             3.413e-02  6.239e-02   0.547  0.58447    
## thinness.5.9.years              -7.095e-02  6.183e-02  -1.147  0.25144    
## Income.composition.of.resources  9.922e+00  9.818e-01  10.106  < 2e-16 ***
## Schooling                        8.346e-01  6.953e-02  12.003  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.605 on 1205 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8373 
## F-statistic: 332.6 on 19 and 1205 DF,  p-value: < 2.2e-16
Berdasarkan summary dari model lm, terdapat 4 prediktor yang paling berpengaruh terhadap angka harapan hidup, dua berpengaruh positif, yaitu pendapatan perkapita dan tingkat pendidikan, dan dua lagi berpengaruh negatif yaitu tingkat kematian orang dewasa dan angka HIV AIDS di negara tersebut

Evaluation

Normality

shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.99273, p-value = 1.008e-05

Multicolinearity

vif(model1)
##                          Status                 Adult.Mortality 
##                        1.784536                        1.774709 
##                   infant.deaths                         Alcohol 
##                       80.046595                        2.309383 
##          percentage.expenditure                     Hepatitis.B 
##                       16.904709                        1.650474 
##                         Measles                             BMI 
##                        1.623970                        1.807382 
##               under.five.deaths                           Polio 
##                       75.379264                        1.676715 
##               Total.expenditure                      Diphtheria 
##                        1.164806                        2.100722 
##                        HIV.AIDS                             GDP 
##                        1.468532                       17.628615 
##                      Population            thinness..1.19.years 
##                        1.235130                        6.731179 
##              thinness.5.9.years Income.composition.of.resources 
##                        6.704778                        3.052511 
##                       Schooling 
##                        3.574748

Create New Model for resolve this issue

Try using Both Direction by Step

modelboth <- step(model1, direction = "both")
## Start:  AIC=3161.75
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Population                       1       3.0 15666 3160.0
## - thinness..1.19.years             1       3.9 15667 3160.1
## - Measles                          1       4.9 15668 3160.1
## - percentage.expenditure           1       5.3 15668 3160.2
## - GDP                              1      14.9 15678 3160.9
## - thinness.5.9.years               1      17.1 15680 3161.1
## - Hepatitis.B                      1      19.8 15683 3161.3
## <none>                                         15663 3161.8
## - Total.expenditure                1      47.1 15710 3163.4
## - Polio                            1      51.0 15714 3163.7
## - Diphtheria                       1      63.7 15727 3164.7
## - Alcohol                          1      81.1 15744 3166.1
## - Status                           1      96.1 15759 3167.2
## - BMI                              1     347.7 16011 3186.6
## - infant.deaths                    1     516.9 16180 3199.5
## - under.five.deaths                1     641.5 16305 3208.9
## - Income.composition.of.resources  1    1327.5 16991 3259.4
## - Schooling                        1    1872.8 17536 3298.1
## - Adult.Mortality                  1    3132.2 18795 3383.1
## - HIV.AIDS                         1    6537.6 22201 3587.0
## 
## Step:  AIC=3159.98
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness..1.19.years             1       3.6 15670 3158.3
## - Measles                          1       4.2 15670 3158.3
## - percentage.expenditure           1       5.3 15671 3158.4
## - GDP                              1      14.8 15681 3159.1
## - thinness.5.9.years               1      16.4 15682 3159.3
## - Hepatitis.B                      1      19.6 15686 3159.5
## <none>                                         15666 3160.0
## - Total.expenditure                1      48.2 15714 3161.7
## + Population                       1       3.0 15663 3161.8
## - Polio                            1      51.2 15717 3162.0
## - Diphtheria                       1      63.0 15729 3162.9
## - Alcohol                          1      82.5 15749 3164.4
## - Status                           1      98.7 15765 3165.7
## - BMI                              1     348.0 16014 3184.9
## - infant.deaths                    1     530.9 16197 3198.8
## - under.five.deaths                1     651.0 16317 3207.9
## - Income.composition.of.resources  1    1325.2 16991 3257.5
## - Schooling                        1    1869.8 17536 3296.1
## - Adult.Mortality                  1    3145.0 18811 3382.1
## - HIV.AIDS                         1    6536.0 22202 3585.1
## 
## Step:  AIC=3158.26
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + thinness.5.9.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Measles                          1       4.1 15674 3156.6
## - percentage.expenditure           1       5.3 15675 3156.7
## - GDP                              1      14.8 15684 3157.4
## - Hepatitis.B                      1      19.2 15689 3157.8
## - thinness.5.9.years               1      23.3 15693 3158.1
## <none>                                         15670 3158.3
## + thinness..1.19.years             1       3.6 15666 3160.0
## - Total.expenditure                1      48.3 15718 3160.0
## + Population                       1       2.7 15667 3160.1
## - Polio                            1      54.0 15724 3160.5
## - Diphtheria                       1      62.2 15732 3161.1
## - Alcohol                          1      85.7 15755 3162.9
## - Status                           1      99.6 15769 3164.0
## - BMI                              1     345.5 16015 3183.0
## - infant.deaths                    1     529.2 16199 3196.9
## - under.five.deaths                1     648.5 16318 3205.9
## - Income.composition.of.resources  1    1322.1 16992 3255.5
## - Schooling                        1    1866.3 17536 3294.1
## - Adult.Mortality                  1    3150.4 18820 3380.7
## - HIV.AIDS                         1    6534.5 22204 3583.2
## 
## Step:  AIC=3156.59
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - percentage.expenditure           1       5.3 15679 3155.0
## - GDP                              1      15.0 15689 3155.8
## - Hepatitis.B                      1      18.2 15692 3156.0
## - thinness.5.9.years               1      22.2 15696 3156.3
## <none>                                         15674 3156.6
## + Measles                          1       4.1 15670 3158.3
## + thinness..1.19.years             1       3.5 15670 3158.3
## - Total.expenditure                1      48.4 15722 3158.4
## + Population                       1       2.0 15672 3158.4
## - Polio                            1      53.4 15727 3158.7
## - Diphtheria                       1      61.4 15735 3159.4
## - Alcohol                          1      86.7 15760 3161.3
## - Status                           1      99.5 15773 3162.3
## - BMI                              1     347.0 16021 3181.4
## - infant.deaths                    1     569.8 16244 3198.3
## - under.five.deaths                1     673.9 16348 3206.2
## - Income.composition.of.resources  1    1323.7 16998 3253.9
## - Schooling                        1    1866.8 17541 3292.4
## - Adult.Mortality                  1    3146.7 18821 3378.7
## - HIV.AIDS                         1    6545.9 22220 3582.1
## 
## Step:  AIC=3155
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + Hepatitis.B + BMI + under.five.deaths + Polio + 
##     Total.expenditure + Diphtheria + HIV.AIDS + GDP + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Hepatitis.B                      1      19.7 15699 3154.5
## - thinness.5.9.years               1      21.9 15701 3154.7
## <none>                                         15679 3155.0
## + percentage.expenditure           1       5.3 15674 3156.6
## + Measles                          1       4.1 15675 3156.7
## + thinness..1.19.years             1       3.5 15676 3156.7
## + Population                       1       2.1 15677 3156.8
## - Total.expenditure                1      50.6 15730 3156.9
## - Polio                            1      52.0 15731 3157.1
## - Diphtheria                       1      62.6 15742 3157.9
## - Alcohol                          1      88.6 15768 3159.9
## - Status                           1     102.3 15781 3161.0
## - BMI                              1     345.7 16025 3179.7
## - GDP                              1     452.0 16131 3187.8
## - infant.deaths                    1     569.0 16248 3196.7
## - under.five.deaths                1     673.3 16352 3204.5
## - Income.composition.of.resources  1    1318.6 16998 3251.9
## - Schooling                        1    1864.2 17543 3290.6
## - Adult.Mortality                  1    3151.3 18831 3377.4
## - HIV.AIDS                         1    6543.3 22222 3580.2
## 
## Step:  AIC=3154.54
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + BMI + under.five.deaths + Polio + Total.expenditure + 
##     Diphtheria + HIV.AIDS + GDP + thinness.5.9.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.5.9.years               1      25.0 15724 3154.5
## <none>                                         15699 3154.5
## + Hepatitis.B                      1      19.7 15679 3155.0
## - Diphtheria                       1      43.8 15743 3156.0
## - Polio                            1      44.0 15743 3156.0
## + percentage.expenditure           1       6.9 15692 3156.0
## + thinness..1.19.years             1       3.2 15696 3156.3
## + Measles                          1       3.0 15696 3156.3
## - Total.expenditure                1      48.9 15748 3156.4
## + Population                       1       2.0 15697 3156.4
## - Alcohol                          1      82.1 15781 3158.9
## - Status                           1      94.8 15794 3159.9
## - BMI                              1     338.9 16038 3178.7
## - GDP                              1     468.3 16167 3188.6
## - infant.deaths                    1     567.6 16266 3196.0
## - under.five.deaths                1     671.1 16370 3203.8
## - Income.composition.of.resources  1    1327.4 17026 3252.0
## - Schooling                        1    1852.1 17551 3289.2
## - Adult.Mortality                  1    3167.9 18867 3377.7
## - HIV.AIDS                         1    6523.7 22223 3578.3
## 
## Step:  AIC=3154.49
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + BMI + under.five.deaths + Polio + Total.expenditure + 
##     Diphtheria + HIV.AIDS + GDP + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         15724 3154.5
## + thinness.5.9.years               1      25.0 15699 3154.5
## + Hepatitis.B                      1      22.8 15701 3154.7
## + thinness..1.19.years             1      11.9 15712 3155.6
## - Diphtheria                       1      43.1 15767 3155.8
## - Polio                            1      44.0 15768 3155.9
## + percentage.expenditure           1       6.7 15717 3156.0
## + Measles                          1       2.0 15722 3156.3
## + Population                       1       1.7 15722 3156.4
## - Total.expenditure                1      54.3 15778 3156.7
## - Alcohol                          1      71.7 15796 3158.1
## - Status                           1      95.2 15819 3159.9
## - BMI                              1     468.3 16192 3188.4
## - GDP                              1     474.1 16198 3188.9
## - infant.deaths                    1     566.8 16291 3195.9
## - under.five.deaths                1     672.0 16396 3203.8
## - Income.composition.of.resources  1    1362.4 17086 3254.3
## - Schooling                        1    1862.5 17586 3289.6
## - Adult.Mortality                  1    3203.3 18927 3379.6
## - HIV.AIDS                         1    6582.1 22306 3580.8
summary(modelboth)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + BMI + under.five.deaths + Polio + Total.expenditure + 
##     Diphtheria + HIV.AIDS + GDP + Income.composition.of.resources + 
##     Schooling, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.993  -2.065   0.020   2.270  11.530 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.377e+01  9.162e-01  58.693  < 2e-16 ***
## StatusDeveloping                -1.050e+00  3.879e-01  -2.708  0.00687 ** 
## Adult.Mortality                 -1.671e-02  1.064e-03 -15.707  < 2e-16 ***
## infant.deaths                    8.940e-02  1.353e-02   6.607 5.87e-11 ***
## Alcohol                         -8.970e-02  3.817e-02  -2.350  0.01894 *  
## BMI                              3.909e-02  6.508e-03   6.006 2.52e-09 ***
## under.five.deaths               -6.806e-02  9.461e-03  -7.194 1.10e-12 ***
## Polio                            1.062e-02  5.769e-03   1.841  0.06582 .  
## Total.expenditure                9.945e-02  4.863e-02   2.045  0.04106 *  
## Diphtheria                       1.105e-02  6.064e-03   1.822  0.06865 .  
## HIV.AIDS                        -4.349e-01  1.931e-02 -22.515  < 2e-16 ***
## GDP                              6.412e-05  1.061e-05   6.042 2.02e-09 ***
## Income.composition.of.resources  9.999e+00  9.761e-01  10.244  < 2e-16 ***
## Schooling                        8.296e-01  6.927e-02  11.977  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.603 on 1211 degrees of freedom
## Multiple R-squared:  0.8392, Adjusted R-squared:  0.8375 
## F-statistic: 486.3 on 13 and 1211 DF,  p-value: < 2.2e-16

Try to didnt use multicolinier Predictor

modelnocoliniear <- lm(Life.expectancy ~ Status + Adult.Mortality + 
    Alcohol + Hepatitis.B + Measles + 
    BMI + Polio + Total.expenditure + Diphtheria + 
    HIV.AIDS + Population + thinness..1.19.years + thinness.5.9.years + 
    Income.composition.of.resources + Schooling, data = train)
summary(modelnocoliniear)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Alcohol + 
##     Hepatitis.B + Measles + BMI + Polio + Total.expenditure + 
##     Diphtheria + HIV.AIDS + Population + thinness..1.19.years + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2957  -2.2698   0.0644   2.3093  11.1739 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.299e+01  9.994e-01  53.026  < 2e-16 ***
## StatusDeveloping                -1.613e+00  3.955e-01  -4.079 4.83e-05 ***
## Adult.Mortality                 -1.764e-02  1.105e-03 -15.970  < 2e-16 ***
## Alcohol                         -1.437e-01  3.922e-02  -3.665 0.000258 ***
## Hepatitis.B                     -8.212e-03  5.460e-03  -1.504 0.132888    
## Measles                         -1.249e-05  1.509e-05  -0.828 0.407808    
## BMI                              3.641e-02  7.235e-03   5.032 5.58e-07 ***
## Polio                            1.559e-02  6.046e-03   2.579 0.010028 *  
## Total.expenditure                1.142e-01  5.049e-02   2.261 0.023936 *  
## Diphtheria                       2.124e-02  7.030e-03   3.021 0.002568 ** 
## HIV.AIDS                        -4.331e-01  2.014e-02 -21.501  < 2e-16 ***
## Population                      -1.087e-09  3.652e-09  -0.298 0.766054    
## thinness..1.19.years             8.002e-03  6.467e-02   0.124 0.901558    
## thinness.5.9.years              -5.821e-02  6.416e-02  -0.907 0.364416    
## Income.composition.of.resources  1.077e+01  1.011e+00  10.654  < 2e-16 ***
## Schooling                        9.192e-01  7.127e-02  12.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.746 on 1209 degrees of freedom
## Multiple R-squared:  0.8265, Adjusted R-squared:  0.8244 
## F-statistic:   384 on 15 and 1209 DF,  p-value: < 2.2e-16

Create New Model Base on other Algorithm

Decision Tree

modeldtree <- ctree(Life.expectancy~., data = train)

Random Forest

modelrf <- randomForest(Life.expectancy~., data = train, importance = TRUE, ntree = 500)

Long Random Forest

#ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
#model_forest <- train(Life.expectancy~., data = train, method = "rf", trcontrol = "ctrl")
#saveRDS(model_forest, "lifeforest.RDS")
modelrflong <- readRDS("lifeforest.RDS")

Predicting and Evaluate error

pred1 <- predict(model1, newdata = test)
predmodelboth <- predict(modelboth, newdata = test)
predmodelnocoliniear <- predict(modelnocoliniear, newdata = test)
preddtree <- predict(modeldtree, newdata = test)
predrf <- predict(modelrf, newdata = test)
predrflong <- predict(modelrflong, newdata = test)

RMSE

RMSE(pred1, obs = test$Life.expectancy)
## [1] 3.58042
RMSE(predmodelboth, obs = test$Life.expectancy)
## [1] 3.601059
RMSE(predmodelnocoliniear, obs = test$Life.expectancy)
## [1] 3.76168
RMSE(preddtree, obs = test$Life.expectancy)
## [1] 3.007273
RMSE(predrf, obs = test$Life.expectancy)
## [1] 1.869979
RMSE(predrflong, obs = test$Life.expectancy)
## [1] 1.166801

MAE

MAE(pred1, obs = test$Life.expectancy)
## [1] 2.696044
MAE(predmodelboth, obs = test$Life.expectancy)
## [1] 2.698483
MAE(predmodelnocoliniear, obs = test$Life.expectancy)
## [1] 2.837579
MAE(preddtree, obs = test$Life.expectancy)
## [1] 2.058884
MAE(predrf, obs = test$Life.expectancy)
## [1] 1.185654
MAE(predrflong, obs = test$Life.expectancy)
## [1] 0.6624638

Conclusion

varImp(modelrflong)
## rf variable importance
## 
##                                  Overall
## Income.composition.of.resources 100.0000
## HIV.AIDS                         55.2340
## Adult.Mortality                  43.0821
## Schooling                        11.3506
## BMI                               4.4277
## thinness.5.9.years                4.3889
## thinness..1.19.years              3.0001
## Alcohol                           2.7593
## Total.expenditure                 1.9680
## under.five.deaths                 1.7575
## percentage.expenditure            1.7139
## GDP                               1.5435
## infant.deaths                     1.2092
## Polio                             0.7421
## Population                        0.6633
## Measles                           0.5909
## Diphtheria                        0.5629
## Hepatitis.B                       0.5396
## StatusDeveloping                  0.0000
Berdasarkan Model dari Random Forest, 3 variable yang paling berpengaruh terhadap Angka Harapan Hidup, yaitu pendapatan perkapita, tingkat HIV AIDS, dan kematian orang dewasa. tiga variable ini unggul signifikan dibanding variable variable lain.
Hal yang mengejutkan yaitu ternyata status negara maju dan negara berkembang tidak ada pengaruhnya terhadap angka harapan hidup, sehingga dapat kita simpulkan bahwa negara maju sekalipun bisa mengalami angka harapan hidup yang rendah. lalu jumlah populasi juga tidak terlalu berpengaruh terhadap angka harapan hidup, sehingga negara yang penduduknya banyak angka harapan hidupnya bisa rendah, dan negara yang penduduknya sedikit pun angka harapan hidupnya rendah.
Lalu, dari semua model yang saya buat, ternyata model yang errornya paling sedikit yaitu model yang menggunakan random forest yang prosesnya lebih lama, yaitu dengan RMSE sebesar 1,12 dan MAE sebesar 0,6. sehingga model ini lebih cocok dipakai untuk memprediksi dan menghitung angka harapan hidup di tahun tahun berikutnya. namun kelemahan dari model ini adalah harus terdapat semua prediktor untuk bisa memprediksi angka harapan hidup, kendalanya kadang terdapat NA di salah satu prediktor, apabila terdapat NA maka model ini bisa gagal