Objective

Four models are built to predict the life expectancy for different countries based on multiple economics and social factors. ordinary least squares R2 for test set is 0.7458 and RMSE is 1.923. The stepwise model R2 for test set is 0.72 and RMSE is 2.0185 . Lasso model R2 for test set is 0.7474 and RMSE is 1.91. The ridge R2 for test set is 0.7357 and RMSE is 1.96. All models predict life expectancy with fair accuracy. Lasso performs best and stepwise model shows less performance on this case.

Introduction

This project is trying to find out the multiple economics and social factors other than health factors which may affect a country’s life expectancy through multiple linear regression: life expectancy as the response. We put emphasis on following factors: quality of life index such as living cost, climate, traffic, property price to income ratio, social factors such as corruption, generosity, freedom, economic status and some other health related factors such as healthcare, pollution, happiness. Our goal is to determine the predictors that are significantly contributing to the value of life expectancy. This will help to suggest a country which area should be paid attention in order to improve the life expectancy of its population.

Data Description

The observations of this dataset are based on different countries. Two data sets ‘Quality of Life Index’ & ‘world_happiness_report_2018’ are merged into one data frame. The data sources are from this two websites: Kaggle and NUMBEO. On initial visual inspection of the data showed some missing values. Countries’ Log GDP per capita are searched online and filled missing values in Log GDP column. Finding all data for these countries was difficult and hence, so some less known countries are excluded from the final model data-set. Other Missing data is handled in R software by using ‘mice’ package. The final merged file(final dataset) consists of 18 Columns(with country name) and 64 rows which means 16 predicting variables. All predicting variables are then divided into several broad categories: quality of life index factors, Economical factors and Social factors. Status is the only categorical predictors, other predictors are numerical. The plot of life expectancy distribution group by status shows that there’s strong relationship between life expectancy and country developing status. The box plot of life expectancy with two different statuses displays a higher mean and median of developed countries than developing countries. It indicates that, in general, the life expectancy of developed countries is much longer than the developing country.

Imputing the Missing Data

library(mice)
## Warning: package 'mice' was built under R version 3.6.2
## Loading required package: lattice
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
quality=read.csv(file = 'quality.csv',header = TRUE)[-1]
quality
##                 Country Rank Quality.of.Life PurchasingPower Safety
## 1             Argentina   48          122.49           55.00  37.45
## 2             Australia    4          191.13          122.85  57.24
## 3               Austria    5          191.05           96.70  78.63
## 4               Belarus   37          141.47           44.64  76.47
## 5               Belgium   23          162.09           95.09  57.54
## 6                Brazil   64          100.33           37.43  29.76
## 7              Bulgaria   43          130.59           52.18  60.00
## 8                Canada   18          170.32          109.44  60.49
## 9                 Chile   46          124.14           54.13  53.19
## 10                China   65           97.92           68.95  54.54
## 11             Colombia   57          108.36           34.66  48.74
## 12              Croatia   21          165.31           60.33  75.31
## 13               Cyprus   27          157.57           75.29  70.69
## 14       Czech Republic   25          158.79           71.32  73.34
## 15              Denmark    1          198.57          114.39  75.75
## 16                Egypt   71           83.98           23.67  49.29
## 17              Estonia   11          180.88           75.97  79.20
## 18              Finland    3          194.01          112.30  77.20
## 19               France   26          157.83           91.55  53.61
## 20              Georgia   50          120.90           33.19  80.14
## 21              Germany    8          187.05          116.20  65.49
## 22               Greece   39          137.82           49.57  61.43
## 23                India   52          117.51           66.91  57.28
## 24            Indonesia   58          107.20           32.95  53.99
## 25                 Iran   70           87.02           37.74  50.67
## 26              Ireland   24          160.82           95.09  55.48
## 27               Israel   30          153.82           92.48  67.84
## 28                Italy   36          145.69           77.49  54.98
## 29                Japan   12          180.50          103.12  86.27
## 30               Jordan   54          112.94           41.73  56.18
## 31           Kazakhstan   69           87.17           38.83  33.49
## 32               Latvia   34          149.15           55.16  63.23
## 33              Lebanon   55          111.21           53.97  55.73
## 34            Lithuania   29          156.36           60.34  63.49
## 35            Macedonia   56          110.64           39.83  60.71
## 36             Malaysia   49          122.11           73.14  39.21
## 37               Mexico   47          123.48           49.98  47.70
## 38          Netherlands    6          188.91          102.54  71.43
## 39          New Zealand    9          185.58          101.09  60.45
## 40               Norway   10          181.86          103.61  64.68
## 41             Pakistan   60          104.63           38.61  53.27
## 42               Panama   53          113.12           41.34  53.57
## 43          Philippines   67           90.73           29.46  59.17
## 44             Portugal   22          163.50           55.19  67.87
## 45              Romania   38          140.31           61.16  72.16
## 46               Russia   59          104.94           45.38  58.07
## 47         Saudi Arabia   32          152.72          118.59  70.27
## 48               Serbia   51          119.83           39.90  62.73
## 49            Singapore   28          156.91          103.77  78.53
## 50             Slovakia   31          153.10           63.24  70.46
## 51             Slovenia   15          175.98           75.38  77.43
## 52         South Africa   40          135.31           82.97  23.20
## 53          South Korea   33          149.53          103.40  65.95
## 54                Spain   16          174.16           83.80  67.54
## 55            Sri Lanka   66           95.30           27.34  58.97
## 56               Sweden   14          178.67          111.38  50.65
## 57          Switzerland    2          195.93          129.70  78.50
## 58             Thailand   61          103.26           40.86  53.34
## 59               Turkey   44          125.51           45.33  59.62
## 60              Ukraine   62          102.34           32.72  51.12
## 61 United Arab Emirates   20          167.81          119.60  83.68
## 62       United Kingdom   17          170.81          105.73  57.28
## 63        United States   13          179.20          122.03  52.87
## 64              Vietnam   68           88.82           33.33  51.22
##    HealthCare LivingCost  PPIR Traffic Pollution Climate     Status
## 1       69.39      33.09 19.02   43.04     52.35   98.28 developing
## 2       76.38      72.08  7.60   35.29     23.97   93.75  developed
## 3       79.19      71.79 10.14   25.15     21.97   77.74  developed
## 4       58.01      33.12 14.27   28.78     40.97   64.37 developing
## 5       79.44      72.97  7.18   34.86     49.89   86.03  developed
## 6       54.64      42.80 18.72   43.37     57.72   95.35 developing
## 7       54.04      37.17  8.72   28.79     63.98   82.76 developing
## 8       70.99      65.01  7.66   33.77     27.85   52.55  developed
## 9       65.50      47.73 13.51   36.79     65.54   90.21 developing
## 10      64.03      39.24 29.09   42.46     81.91   78.91 developing
## 11      67.51      30.15 18.85   45.06     61.74   86.04 developing
## 12      64.14      49.18 11.63   29.11     31.03   88.95 developing
## 13      50.17      55.57  7.17   20.27     53.75   92.23  developed
## 14      74.71      45.12 15.17   30.02     40.96   77.13  developed
## 15      79.41      81.38  6.93   28.51     22.14   81.80  developed
## 16      44.22      26.46 13.60   49.17     86.48   91.98 developing
## 17      72.12      51.01  9.31   25.75     18.15   64.28  developed
## 18      73.49      72.82  7.98   30.41     11.93   58.56  developed
## 19      78.55      74.85 12.85   35.27     42.70   88.25  developed
## 20      51.29      28.78 12.67   34.34     72.46   84.20 developing
## 21      74.32      67.62  9.02   29.91     28.01   82.51  developed
## 22      55.16      56.66  9.84   32.20     51.86   94.18  developed
## 23      68.32      24.17 11.33   45.71     75.81   65.74 developing
## 24      61.98      36.24 13.81   42.93     62.78   68.96 developing
## 25      51.18      35.80 16.87   48.38     79.35   70.99 developing
## 26      48.58      75.35  7.90   35.85     31.12   89.13  developed
## 27      73.44      74.28 13.98   36.32     57.25   93.80  developed
## 28      67.14      69.25  9.97   34.69     53.75   91.25  developed
## 29      80.40      83.33 11.25   40.03     37.08   84.79  developed
## 30      66.12      54.98  7.67   42.43     80.39   89.05 developing
## 31      51.27      29.83 11.51   29.75     74.25   39.78 developing
## 32      59.71      49.23  9.35   31.45     34.96   74.70 developing
## 33      65.90      58.06 13.44   37.36     87.39   94.74 developing
## 34      67.58      45.91 11.11   27.15     30.84   69.86 developing
## 35      58.98      31.58 13.65   27.94     80.85   76.30 developing
## 36      67.61      39.38  9.77   35.25     63.74   60.10 developing
## 37      69.72      32.71 10.19   39.08     66.02   86.29 developing
## 38      77.81      74.83  7.38   29.87     27.45   87.45  developed
## 39      73.62      72.62  8.34   31.19     22.74   95.46  developed
## 40      74.14     100.99  8.40   27.12     19.86   71.16  developed
## 41      60.58      20.40 14.06   38.63     75.89   72.99 developing
## 42      61.41      51.45 12.67   36.48     62.00   67.84 developing
## 43      65.87      35.09 21.83   44.15     74.47   60.23 developing
## 44      70.72      50.39 12.71   29.02     31.58   97.55  developed
## 45      54.49      36.45 10.15   34.36     55.39   77.62 developing
## 46      57.63      35.52 12.39   46.00     62.80   46.53 developing
## 47      59.07      47.61  2.89   29.61     67.26   41.42 developing
## 48      52.56      35.39 18.62   29.74     58.86   83.23 developing
## 49      70.30      69.79 21.56   42.15     32.29   57.45  developed
## 50      60.06      44.98 10.27   29.58     41.89   78.13  developed
## 51      62.81      52.51 10.16   24.81     24.33   78.08  developed
## 52      62.56      42.49  4.11   40.33     56.95   95.97 developing
## 53      84.51      76.93 17.58   40.02     54.19   68.39  developed
## 54      77.77      54.70  9.21   29.42     39.36   94.19  developed
## 55      72.30      30.24 26.32   51.90     57.69   59.11 developing
## 56      70.95      71.55 10.26   30.29     18.01   74.92  developed
## 57      72.68     121.16  9.63   29.05     22.03   80.05  developed
## 58      79.06      47.54 21.94   39.40     72.21   69.45 developing
## 59      69.36      35.21  9.26   44.77     69.15   93.26  developed
## 60      50.95      27.94 14.35   37.36     66.63   70.69 developing
## 61      67.99      56.16  4.40   37.46     53.11   45.23 developing
## 62      74.71      65.28  9.16   34.62     39.43   87.82  developed
## 63      69.41      69.91  3.58   32.87     33.95   77.51  developed
## 64      54.54      37.70 19.66   28.33     87.13   71.24 developing
##      Ladder    LogGDP SocialSuppor lifeExpectancy   Freedom  Generosity
## 1  5.792797  9.809972    0.8999116           68.8 0.8458947 -0.20693663
## 2  7.176993 10.721021    0.9401373           73.6 0.9160281  0.13779540
## 3  7.396002 10.741893    0.9116681           73.0 0.9041120  0.05155233
## 4  5.233770  9.778739    0.9045693           66.1 0.6436024 -0.18186484
## 5  6.892172 10.672445    0.9298155           72.0 0.8083866 -0.12727834
## 6  6.190922  9.557933    0.8815053           66.4 0.7506090 -0.12632652
## 7  5.098814  9.873219    0.9238533           66.8 0.7243359 -0.17910960
## 8  7.175497 10.701248    0.9227188           73.6 0.9457829  0.09796613
## 9  6.436221 10.065920    0.8900849           69.9 0.7885304 -0.07061584
## 10 5.131434  9.694376    0.7876053           69.3 0.8953777 -0.17489916
## 11 5.983512  9.511734    0.8709704           67.7 0.8507658 -0.15504800
## 12 5.536271 10.065751    0.9098068           70.2 0.6908556 -0.15252924
## 13 6.276443        NA    0.8255732           73.7 0.7942150          NA
## 14 7.034165 10.419474    0.9291639           70.3 0.7901324 -0.29435962
## 15 7.648786 10.755594    0.9582189           72.4 0.9354378  0.01559285
## 16 4.005451  9.293960    0.7588240           61.7 0.6816545 -0.22292954
## 17 6.091302 10.324107    0.9326938           68.6 0.8856184 -0.14466491
## 18 7.858107 10.636060    0.9621550           71.9 0.9378074 -0.13173516
## 19 6.665904 10.573352    0.9214631           73.8 0.8163772 -0.14156713
## 20 4.659097  9.229100    0.6172186           64.5 0.7751441 -0.21924759
## 21 7.118364 10.730945    0.9197631           72.2 0.8768875  0.03019965
## 22 5.409289 10.132058    0.7935008           72.4 0.5644557 -0.33638453
## 23 3.818069  8.830280    0.6380520           60.1 0.8904434  0.07180630
## 24 5.340296  9.362827    0.8093789           62.1 0.8793744  0.49937782
## 25 4.278118  9.886065    0.6737647           66.0 0.6033199  0.04471550
## 26 6.962336 11.163328    0.9378624           72.3 0.8614716  0.13978343
## 27 6.927179 10.424574    0.9095954           73.3 0.7246623  0.05254996
## 28 6.516527 10.480516    0.9126561           73.6 0.6500093 -0.02346227
## 29 5.793575 10.581618    0.8864319           75.0 0.7734721 -0.27297229
## 30 4.638934  9.024435    0.7995443           66.8 0.7624203 -0.18348970
## 31 6.007636 10.111166    0.9366567           64.6 0.8401830 -0.10859531
## 32 5.901154 10.184117    0.9132763           66.8 0.6082076 -0.21699953
## 33 5.167187  9.507958    0.8293806           67.1 0.6070307 -0.06632593
## 34 6.308879 10.339512    0.9293501           67.3 0.6989451 -0.24097855
## 35 5.239835  9.503044    0.8489152           67.5 0.7448009 -0.04029087
## 36 5.338818 10.235504    0.7894086           67.0 0.8745483  0.11288136
## 37 6.549579  9.769919    0.8580689           68.3 0.8162005 -0.18181843
## 38 7.463097 10.809204    0.9394432           72.3 0.9199852  0.15553649
## 39 7.370286 10.501477    0.9538627           73.2 0.9493002  0.11675003
## 40 7.444262 11.085626    0.9659619           73.2 0.9604290  0.07572849
## 41 5.471554  8.561664    0.6850595           58.5 0.7725691  0.05352156
## 42 6.281434 10.049730    0.9043898           69.6 0.8614481 -0.12278221
## 43 5.869173  8.985703    0.8458033           61.9 0.9178082 -0.11312916
## 44 5.919823 10.262012    0.8871133           72.4 0.8774042 -0.26273811
## 45 6.150879 10.112093    0.8179305           67.2 0.8451596 -0.22003184
## 46 5.513500 10.132390    0.9087261           64.3 0.7292822 -0.15622631
## 47 6.356393 10.797972    0.8678484           66.3 0.8549215 -0.20956351
## 48 5.936493  9.584804    0.8529453           68.2 0.7398918 -0.09851145
## 49 6.374564        NA    0.9028407           76.8 0.9160782          NA
## 50 6.235111 10.352236    0.9223787           68.9 0.7576340 -0.17969246
## 51 6.249419 10.397203    0.9409712           71.1 0.9420459 -0.12239641
## 52 4.883922  9.411724    0.8413440           56.5 0.7527313 -0.05903366
## 53 5.840231 10.511578    0.7977238           73.6 0.6001617 -0.09396584
## 54 6.513371 10.465594    0.9103147           74.4 0.7222507 -0.07935060
## 55 4.400223  9.400388    0.8280654           67.2 0.8526282  0.08676241
## 56 7.374792 10.766932    0.9306796           72.6 0.9417247  0.06957325
## 57 7.508587 10.975945    0.9302910           74.1 0.9264145  0.09636896
## 58 6.011562  9.734829    0.8730524           67.2 0.9048283  0.25165021
## 59 5.185689 10.148917    0.8470273           66.8 0.5286291 -0.18165372
## 60 4.661909  9.012027    0.9009367           64.6 0.6630551 -0.05504196
## 61 6.603744 11.127678    0.8510413           67.1 0.9436644  0.03649422
## 62 7.233445 10.596948    0.9284839           72.3 0.8375084  0.22199814
## 63 6.882685 10.922465    0.9038560           68.3 0.8246067  0.10771339
## 64 5.295547  8.783416    0.8319452           67.9 0.9092598 -0.03912370
##    Corruption
## 1  0.85525525
## 2  0.40464750
## 3  0.52306056
## 4  0.71845549
## 5  0.63041180
## 6  0.76325131
## 7  0.95201445
## 8  0.37174085
## 9  0.81629741
## 10         NA
## 11 0.85482091
## 12 0.92540830
## 13 0.84833723
## 14 0.85138226
## 15 0.15060744
## 16         NA
## 17 0.62067795
## 18 0.19860484
## 19 0.58177531
## 20 0.75485378
## 21 0.49567395
## 22 0.86030239
## 23 0.80526334
## 24 0.86772943
## 25 0.70343977
## 26 0.36221024
## 27 0.77013481
## 28 0.88782483
## 29 0.68678451
## 30         NA
## 31 0.82378298
## 32 0.79894924
## 33 0.90665030
## 34 0.85174507
## 35 0.90993434
## 36 0.89413112
## 37 0.80863810
## 38 0.37055778
## 39 0.20658022
## 40 0.26820144
## 41 0.79884166
## 42 0.83693099
## 43 0.72648335
## 44 0.87972784
## 45 0.92117017
## 46 0.86531156
## 47         NA
## 48 0.86372358
## 49 0.09656293
## 50 0.90994465
## 51 0.83925253
## 52 0.84119260
## 53 0.79682589
## 54 0.77650446
## 55 0.85801667
## 56 0.26279658
## 57 0.30125997
## 58 0.90659601
## 59 0.80487853
## 60 0.94296074
## 61         NA
## 62 0.40427601
## 63 0.70992827
## 64 0.80842298
quality2=read.csv(file = 'quality2.csv',header = TRUE)
quality2=quality2[,-1]
head(quality2)
##   LifeExpectancy     Status LifeQuality PurchaPower Safety HealthCare
## 1           68.8 developing      122.49       55.00  37.45      69.39
## 2           73.6  developed      191.13      122.85  57.24      76.38
## 3           73.0  developed      191.05       96.70  78.63      79.19
## 4           66.1 developing      141.47       44.64  76.47      58.01
## 5           72.0  developed      162.09       95.09  57.54      79.44
## 6           66.4 developing      100.33       37.43  29.76      54.64
##   LivingCost  PPIR Traffic Pollution Climate Happiness    LogGDP
## 1      33.09 19.02   43.04     52.35   98.28  5.792797  9.809972
## 2      72.08  7.60   35.29     23.97   93.75  7.176993 10.721021
## 3      71.79 10.14   25.15     21.97   77.74  7.396002 10.741893
## 4      33.12 14.27   28.78     40.97   64.37  5.233770  9.778739
## 5      72.97  7.18   34.86     49.89   86.03  6.892172 10.672445
## 6      42.80 18.72   43.37     57.72   95.35  6.190922  9.557933
##   SocialSuppor   Freedom  Generosity Corruption
## 1    0.8999116 0.8458947 -0.20693663  0.8552552
## 2    0.9401373 0.9160281  0.13779540  0.4046475
## 3    0.9116681 0.9041120  0.05155233  0.5230606
## 4    0.9045693 0.6436024 -0.18186484  0.7184555
## 5    0.9298155 0.8083866 -0.12727834  0.6304118
## 6    0.8815053 0.7506090 -0.12632652  0.7632513
sort(sapply(quality2,function(x) sum(is.na(x))),decreasing = T)
##     Corruption     Generosity LifeExpectancy         Status    LifeQuality 
##              5              2              0              0              0 
##    PurchaPower         Safety     HealthCare     LivingCost           PPIR 
##              0              0              0              0              0 
##        Traffic      Pollution        Climate      Happiness         LogGDP 
##              0              0              0              0              0 
##   SocialSuppor        Freedom 
##              0              0
md.pattern(quality2)

##    LifeExpectancy Status LifeQuality PurchaPower Safety HealthCare
## 57              1      1           1           1      1          1
## 5               1      1           1           1      1          1
## 2               1      1           1           1      1          1
##                 0      0           0           0      0          0
##    LivingCost PPIR Traffic Pollution Climate Happiness LogGDP SocialSuppor
## 57          1    1       1         1       1         1      1            1
## 5           1    1       1         1       1         1      1            1
## 2           1    1       1         1       1         1      1            1
##             0    0       0         0       0         0      0            0
##    Freedom Generosity Corruption  
## 57       1          1          1 0
## 5        1          1          0 1
## 2        1          0          1 1
##          0          2          5 7
#from the output, we can easily to see there are 2 missing value in 'Generosity' colume and 5 missing values in 'Corruption' colume.  Due to the small size of data set, we determined to impute the missing value using the 'mice' package in R.


###----A couple of notes on the parameters:-----

## m=5 refers to the number of imputed datasets. Five is the default value.
## meth='pmm' refers to the imputation method. In this case we are using predictive mean matching as imputation method. Other imputation methods can be used, type methods(mice) for a list of the available imputation methods.

set.seed(1)
tempData <- mice(quality2,m=5,maxit=50,meth='pmm',seed=500)
## 
##  iter imp variable
##   1   1  Generosity  Corruption
##   1   2  Generosity  Corruption
##   1   3  Generosity  Corruption
##   1   4  Generosity  Corruption
##   1   5  Generosity  Corruption
##   2   1  Generosity  Corruption
##   2   2  Generosity  Corruption
##   2   3  Generosity  Corruption
##   2   4  Generosity  Corruption
##   2   5  Generosity  Corruption
##   3   1  Generosity  Corruption
##   3   2  Generosity  Corruption
##   3   3  Generosity  Corruption
##   3   4  Generosity  Corruption
##   3   5  Generosity  Corruption
##   4   1  Generosity  Corruption
##   4   2  Generosity  Corruption
##   4   3  Generosity  Corruption
##   4   4  Generosity  Corruption
##   4   5  Generosity  Corruption
##   5   1  Generosity  Corruption
##   5   2  Generosity  Corruption
##   5   3  Generosity  Corruption
##   5   4  Generosity  Corruption
##   5   5  Generosity  Corruption
##   6   1  Generosity  Corruption
##   6   2  Generosity  Corruption
##   6   3  Generosity  Corruption
##   6   4  Generosity  Corruption
##   6   5  Generosity  Corruption
##   7   1  Generosity  Corruption
##   7   2  Generosity  Corruption
##   7   3  Generosity  Corruption
##   7   4  Generosity  Corruption
##   7   5  Generosity  Corruption
##   8   1  Generosity  Corruption
##   8   2  Generosity  Corruption
##   8   3  Generosity  Corruption
##   8   4  Generosity  Corruption
##   8   5  Generosity  Corruption
##   9   1  Generosity  Corruption
##   9   2  Generosity  Corruption
##   9   3  Generosity  Corruption
##   9   4  Generosity  Corruption
##   9   5  Generosity  Corruption
##   10   1  Generosity  Corruption
##   10   2  Generosity  Corruption
##   10   3  Generosity  Corruption
##   10   4  Generosity  Corruption
##   10   5  Generosity  Corruption
##   11   1  Generosity  Corruption
##   11   2  Generosity  Corruption
##   11   3  Generosity  Corruption
##   11   4  Generosity  Corruption
##   11   5  Generosity  Corruption
##   12   1  Generosity  Corruption
##   12   2  Generosity  Corruption
##   12   3  Generosity  Corruption
##   12   4  Generosity  Corruption
##   12   5  Generosity  Corruption
##   13   1  Generosity  Corruption
##   13   2  Generosity  Corruption
##   13   3  Generosity  Corruption
##   13   4  Generosity  Corruption
##   13   5  Generosity  Corruption
##   14   1  Generosity  Corruption
##   14   2  Generosity  Corruption
##   14   3  Generosity  Corruption
##   14   4  Generosity  Corruption
##   14   5  Generosity  Corruption
##   15   1  Generosity  Corruption
##   15   2  Generosity  Corruption
##   15   3  Generosity  Corruption
##   15   4  Generosity  Corruption
##   15   5  Generosity  Corruption
##   16   1  Generosity  Corruption
##   16   2  Generosity  Corruption
##   16   3  Generosity  Corruption
##   16   4  Generosity  Corruption
##   16   5  Generosity  Corruption
##   17   1  Generosity  Corruption
##   17   2  Generosity  Corruption
##   17   3  Generosity  Corruption
##   17   4  Generosity  Corruption
##   17   5  Generosity  Corruption
##   18   1  Generosity  Corruption
##   18   2  Generosity  Corruption
##   18   3  Generosity  Corruption
##   18   4  Generosity  Corruption
##   18   5  Generosity  Corruption
##   19   1  Generosity  Corruption
##   19   2  Generosity  Corruption
##   19   3  Generosity  Corruption
##   19   4  Generosity  Corruption
##   19   5  Generosity  Corruption
##   20   1  Generosity  Corruption
##   20   2  Generosity  Corruption
##   20   3  Generosity  Corruption
##   20   4  Generosity  Corruption
##   20   5  Generosity  Corruption
##   21   1  Generosity  Corruption
##   21   2  Generosity  Corruption
##   21   3  Generosity  Corruption
##   21   4  Generosity  Corruption
##   21   5  Generosity  Corruption
##   22   1  Generosity  Corruption
##   22   2  Generosity  Corruption
##   22   3  Generosity  Corruption
##   22   4  Generosity  Corruption
##   22   5  Generosity  Corruption
##   23   1  Generosity  Corruption
##   23   2  Generosity  Corruption
##   23   3  Generosity  Corruption
##   23   4  Generosity  Corruption
##   23   5  Generosity  Corruption
##   24   1  Generosity  Corruption
##   24   2  Generosity  Corruption
##   24   3  Generosity  Corruption
##   24   4  Generosity  Corruption
##   24   5  Generosity  Corruption
##   25   1  Generosity  Corruption
##   25   2  Generosity  Corruption
##   25   3  Generosity  Corruption
##   25   4  Generosity  Corruption
##   25   5  Generosity  Corruption
##   26   1  Generosity  Corruption
##   26   2  Generosity  Corruption
##   26   3  Generosity  Corruption
##   26   4  Generosity  Corruption
##   26   5  Generosity  Corruption
##   27   1  Generosity  Corruption
##   27   2  Generosity  Corruption
##   27   3  Generosity  Corruption
##   27   4  Generosity  Corruption
##   27   5  Generosity  Corruption
##   28   1  Generosity  Corruption
##   28   2  Generosity  Corruption
##   28   3  Generosity  Corruption
##   28   4  Generosity  Corruption
##   28   5  Generosity  Corruption
##   29   1  Generosity  Corruption
##   29   2  Generosity  Corruption
##   29   3  Generosity  Corruption
##   29   4  Generosity  Corruption
##   29   5  Generosity  Corruption
##   30   1  Generosity  Corruption
##   30   2  Generosity  Corruption
##   30   3  Generosity  Corruption
##   30   4  Generosity  Corruption
##   30   5  Generosity  Corruption
##   31   1  Generosity  Corruption
##   31   2  Generosity  Corruption
##   31   3  Generosity  Corruption
##   31   4  Generosity  Corruption
##   31   5  Generosity  Corruption
##   32   1  Generosity  Corruption
##   32   2  Generosity  Corruption
##   32   3  Generosity  Corruption
##   32   4  Generosity  Corruption
##   32   5  Generosity  Corruption
##   33   1  Generosity  Corruption
##   33   2  Generosity  Corruption
##   33   3  Generosity  Corruption
##   33   4  Generosity  Corruption
##   33   5  Generosity  Corruption
##   34   1  Generosity  Corruption
##   34   2  Generosity  Corruption
##   34   3  Generosity  Corruption
##   34   4  Generosity  Corruption
##   34   5  Generosity  Corruption
##   35   1  Generosity  Corruption
##   35   2  Generosity  Corruption
##   35   3  Generosity  Corruption
##   35   4  Generosity  Corruption
##   35   5  Generosity  Corruption
##   36   1  Generosity  Corruption
##   36   2  Generosity  Corruption
##   36   3  Generosity  Corruption
##   36   4  Generosity  Corruption
##   36   5  Generosity  Corruption
##   37   1  Generosity  Corruption
##   37   2  Generosity  Corruption
##   37   3  Generosity  Corruption
##   37   4  Generosity  Corruption
##   37   5  Generosity  Corruption
##   38   1  Generosity  Corruption
##   38   2  Generosity  Corruption
##   38   3  Generosity  Corruption
##   38   4  Generosity  Corruption
##   38   5  Generosity  Corruption
##   39   1  Generosity  Corruption
##   39   2  Generosity  Corruption
##   39   3  Generosity  Corruption
##   39   4  Generosity  Corruption
##   39   5  Generosity  Corruption
##   40   1  Generosity  Corruption
##   40   2  Generosity  Corruption
##   40   3  Generosity  Corruption
##   40   4  Generosity  Corruption
##   40   5  Generosity  Corruption
##   41   1  Generosity  Corruption
##   41   2  Generosity  Corruption
##   41   3  Generosity  Corruption
##   41   4  Generosity  Corruption
##   41   5  Generosity  Corruption
##   42   1  Generosity  Corruption
##   42   2  Generosity  Corruption
##   42   3  Generosity  Corruption
##   42   4  Generosity  Corruption
##   42   5  Generosity  Corruption
##   43   1  Generosity  Corruption
##   43   2  Generosity  Corruption
##   43   3  Generosity  Corruption
##   43   4  Generosity  Corruption
##   43   5  Generosity  Corruption
##   44   1  Generosity  Corruption
##   44   2  Generosity  Corruption
##   44   3  Generosity  Corruption
##   44   4  Generosity  Corruption
##   44   5  Generosity  Corruption
##   45   1  Generosity  Corruption
##   45   2  Generosity  Corruption
##   45   3  Generosity  Corruption
##   45   4  Generosity  Corruption
##   45   5  Generosity  Corruption
##   46   1  Generosity  Corruption
##   46   2  Generosity  Corruption
##   46   3  Generosity  Corruption
##   46   4  Generosity  Corruption
##   46   5  Generosity  Corruption
##   47   1  Generosity  Corruption
##   47   2  Generosity  Corruption
##   47   3  Generosity  Corruption
##   47   4  Generosity  Corruption
##   47   5  Generosity  Corruption
##   48   1  Generosity  Corruption
##   48   2  Generosity  Corruption
##   48   3  Generosity  Corruption
##   48   4  Generosity  Corruption
##   48   5  Generosity  Corruption
##   49   1  Generosity  Corruption
##   49   2  Generosity  Corruption
##   49   3  Generosity  Corruption
##   49   4  Generosity  Corruption
##   49   5  Generosity  Corruption
##   50   1  Generosity  Corruption
##   50   2  Generosity  Corruption
##   50   3  Generosity  Corruption
##   50   4  Generosity  Corruption
##   50   5  Generosity  Corruption
## Warning: Number of logged events: 500
tempData$imp$Generosity
##             1           2           3           4           5
## 13 -0.1525292 -0.02346227 -0.15622631 -0.05504196 -0.05504196
## 49 -0.1550480 -0.15504800  0.09636896  0.05352156  0.06957325
tempData$imp$Corruption
##            1         2         3         4         5
## 10 0.8637236 0.8392525 0.8237830 0.8086381 0.7548538
## 16 0.8048785 0.9429607 0.8237830 0.8048785 0.9099343
## 30 0.8369310 0.7034398 0.8517451 0.9066503 0.8084230
## 47 0.9211702 0.7968259 0.8513823 0.8878248 0.8878248
## 61 0.7099283 0.5230606 0.3705578 0.7701348 0.4042760

Inspecting the distribution of original and imputed data

#Let's compare the distributions of original and imputed data using a some useful plots. We can use a scatterplot and plot Corruption against other variables(random select from the dataset).What we would like to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed "plausible values".
xyplot(tempData,Corruption ~ Generosity+LogGDP+Climate+Freedom+Happiness+Pollution ,pch=17,cex=1.2)

completedData <- complete(tempData,1)
head(completedData)
##   LifeExpectancy     Status LifeQuality PurchaPower Safety HealthCare
## 1           68.8 developing      122.49       55.00  37.45      69.39
## 2           73.6  developed      191.13      122.85  57.24      76.38
## 3           73.0  developed      191.05       96.70  78.63      79.19
## 4           66.1 developing      141.47       44.64  76.47      58.01
## 5           72.0  developed      162.09       95.09  57.54      79.44
## 6           66.4 developing      100.33       37.43  29.76      54.64
##   LivingCost  PPIR Traffic Pollution Climate Happiness    LogGDP
## 1      33.09 19.02   43.04     52.35   98.28  5.792797  9.809972
## 2      72.08  7.60   35.29     23.97   93.75  7.176993 10.721021
## 3      71.79 10.14   25.15     21.97   77.74  7.396002 10.741893
## 4      33.12 14.27   28.78     40.97   64.37  5.233770  9.778739
## 5      72.97  7.18   34.86     49.89   86.03  6.892172 10.672445
## 6      42.80 18.72   43.37     57.72   95.35  6.190922  9.557933
##   SocialSuppor   Freedom  Generosity Corruption
## 1    0.8999116 0.8458947 -0.20693663  0.8552552
## 2    0.9401373 0.9160281  0.13779540  0.4046475
## 3    0.9116681 0.9041120  0.05155233  0.5230606
## 4    0.9045693 0.6436024 -0.18186484  0.7184555
## 5    0.9298155 0.8083866 -0.12727834  0.6304118
## 6    0.8815053 0.7506090 -0.12632652  0.7632513
dim(completedData)
## [1] 64 17
md.pattern(completedData)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##    LifeExpectancy Status LifeQuality PurchaPower Safety HealthCare
## 64              1      1           1           1      1          1
##                 0      0           0           0      0          0
##    LivingCost PPIR Traffic Pollution Climate Happiness LogGDP SocialSuppor
## 64          1    1       1         1       1         1      1            1
##             0    0       0         0       0         0      0            0
##    Freedom Generosity Corruption  
## 64       1          1          1 0
##          0          0          0 0

Table 1 Summary of Response Variable(Life Expectancy)

summary(completedData$LifeExpectancy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.50   66.80   68.70   68.96   72.40   76.80

Table 2 Variables Description

Variables Discription
Happiness a measure of life satisfaction
PurchasPowe Purchasing Power Index
SocialSuppor SocialSupport
Safety Safety Index
Freedom Freedom
HealthCare HealthCare Index
Status levels: developing&developed
Corruption Perception of Corruption
LivingCost Cost of Living Index
Generosity Generosity
PPIR Property Price to Income Ratio
LogGDP Log of GDP per capita
Traffic Traffic Commute Time Index
LifeQuality Quality of Life Index
Pollution Pollution Index
Climate Climate Index

Data Exploratory

Status is the only categorical predictors, other predictors are numerical. The plot of life expectancy distribution group by status shows that there’s strong relationship between life expectancy and country developing status. The box plot of life expectancy with two different statuses displays a higher mean and median of developed countries than developing countries. It indicates that, in general, the life expectancy of developed countries is much longer than the developing country.

Box Plot & Histogram of Life Expectancy Group By Country Status Box plot of Potential Predictors Group By Life Expectancy Greater/Less Than 65 (Safety, Healthcare, GDP,Traffic)

library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   1.0.0     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::complete() masks mice::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
ggplot(data=completedData,aes(x=LifeExpectancy,fill=Status))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=completedData,aes(x=Status,y=LifeExpectancy,fill=Status))+geom_boxplot(outlier.size = 2,outlier.shape = 2)

for (i in 3:17 ) {
  plot=ggplot(data=completedData,aes(y=completedData[,i]))+geom_boxplot()+xlab(names(completedData[i]))
  print(plot)
}

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

which(is_outlier(completedData$LivingCost))
## [1] 57
quality[57,c(1,7)]
##        Country LivingCost
## 57 Switzerland     121.16
which(is_outlier(completedData$PPIR))
## [1] 10 43 55 58
quality[c(10,43,55,58),c(1,8)]
##        Country  PPIR
## 10       China 29.09
## 43 Philippines 21.83
## 55   Sri Lanka 26.32
## 58    Thailand 21.94
subset(quality[,c(1,16,12,8,9,5,11,14)],quality$Country=='United States')
##          Country lifeExpectancy    Status PPIR Traffic Safety Climate
## 63 United States           68.3 developed 3.58   32.87  52.87   77.51
##      LogGDP
## 63 10.92247
# Based on the scatterplot of LifeExpectancy to each predictors, we suppose that these Life quality index and healthcare, econimic status are related to the life expectancy. Let's do the regression to see if it is correct. 

dat=completedData
dat$greater65=0

dat$greater65[which(dat$LifeExpectancy>=65)]='greater than 65'
dat$greater65[which(dat$LifeExpectancy<65)]='less than 65'

for (i in 3:17 ) {
  plot=ggplot(data=dat,aes(x=greater65,y=completedData[,i],fill=greater65))+geom_boxplot()+ylab(names(completedData[i]))+xlab('life expectancy')
  print(plot)
}

Method

The statistical technique used in this study is multiple linear regression. It is a method to find the best fitting line to a set of data with some quantitative explanatory variables X and a quantitative response variable Y and examining the slope of the regression line. Simple linear regression allows us to summarize the relationship between two quantitative variables. Multiple linear regression is with more than two predictors.

Train and test is used to evaluate the model performance. The completed data is randomly splitted to two part: 75% of the total observation are the train set and the left 25% of the total observations are validation set. The training data set is used to develop a number of regression models, while the test data set is used to evaluate the performance of these models.

Train Model:70% training / 30% validation set split of the data set

set.seed(222)
train=sample(length(completedData$LifeExpectancy),floor(0.70*length(completedData$LifeExpectancy)),replace=FALSE)
length(train)
## [1] 44
#train=c(60,  5, 31 , 1, 55, 57, 21, 24, 33,   22 , 4 , 6, 36, 48,  3, 32, 49,  10, 38, 12, 63 ,54,45 ,44 ,28, 18, 16, 50 ,42, 23, 61 ,17, 51, 39 ,20, 11, 46, 29, 19, 25,62 ,26, 15 ,41, 27)
#train2=c(60,  5, 31 , 1, 55, 57, 21, 24, 33,   22 , 4 , 6, 36, 48,  3, 32, 49,  10, 38, 12, 63 ,54,45 ,44 ,28, 18, 16, 50 ,42, 23, 61 ,17, 51, 39 ,20, 11, 46, 29, 19, 25,62 ,26, 15 ,41, 27,52,30,8)
# Check  if there is any collinearity,from the output, we can see there are several strong correlations between the predictors.
#train=subset(train, train!=52)
train
##  [1] 61 62 15 18 23 24 22 52  9 10 41 50 33 59 46 26  7 54 44 13 16 37 56
## [24]  1  5 58 42 53 40  4 14 47 63 27 29 39 48 34 60 20 35 12 25 30
library(faraway)
## Warning: package 'faraway' was built under R version 3.6.2
## 
## Attaching package: 'faraway'
## The following object is masked from 'package:mice':
## 
##     mammalsleep
## The following object is masked from 'package:lattice':
## 
##     melanoma
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.2
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 3.0-2
library(car)
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4
## 
## Attaching package: 'car'
## The following objects are masked from 'package:faraway':
## 
##     logit, vif
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
round(cor(completedData[, -c(1,2)]), 2)
##              LifeQuality PurchaPower Safety HealthCare LivingCost  PPIR
## LifeQuality         1.00        0.86   0.59       0.58       0.78 -0.60
## PurchaPower         0.86        1.00   0.38       0.59       0.81 -0.54
## Safety              0.59        0.38   1.00       0.27       0.38 -0.18
## HealthCare          0.58        0.59   0.27       1.00       0.58 -0.08
## LivingCost          0.78        0.81   0.38       0.58       1.00 -0.38
## PPIR               -0.60       -0.54  -0.18      -0.08      -0.38  1.00
## Traffic            -0.61       -0.38  -0.43      -0.06      -0.37  0.48
## Pollution          -0.90       -0.70  -0.47      -0.50      -0.68  0.41
## Climate             0.15       -0.03  -0.14       0.08       0.17 -0.07
## Happiness           0.77        0.74   0.30       0.53       0.73 -0.39
## LogGDP              0.81        0.83   0.42       0.46       0.77 -0.48
## SocialSuppor        0.59        0.47   0.16       0.38       0.54 -0.28
## Freedom             0.35        0.43   0.14       0.33       0.31 -0.05
## Generosity          0.12        0.27  -0.16       0.25       0.25 -0.07
## Corruption         -0.64       -0.68  -0.28      -0.43      -0.70  0.24
##              Traffic Pollution Climate Happiness LogGDP SocialSuppor
## LifeQuality    -0.61     -0.90    0.15      0.77   0.81         0.59
## PurchaPower    -0.38     -0.70   -0.03      0.74   0.83         0.47
## Safety         -0.43     -0.47   -0.14      0.30   0.42         0.16
## HealthCare     -0.06     -0.50    0.08      0.53   0.46         0.38
## LivingCost     -0.37     -0.68    0.17      0.73   0.77         0.54
## PPIR            0.48      0.41   -0.07     -0.39  -0.48        -0.28
## Traffic         1.00      0.53   -0.04     -0.53  -0.43        -0.48
## Pollution       0.53      1.00   -0.06     -0.75  -0.75        -0.69
## Climate        -0.04     -0.06    1.00      0.04  -0.06         0.03
## Happiness      -0.53     -0.75    0.04      1.00   0.78         0.74
## LogGDP         -0.43     -0.75   -0.06      0.78   1.00         0.65
## SocialSuppor   -0.48     -0.69    0.03      0.74   0.65         1.00
## Freedom        -0.19     -0.38   -0.28      0.50   0.28         0.28
## Generosity      0.08     -0.12   -0.11      0.24   0.11         0.04
## Corruption      0.19      0.67    0.03     -0.67  -0.59        -0.41
##              Freedom Generosity Corruption
## LifeQuality     0.35       0.12      -0.64
## PurchaPower     0.43       0.27      -0.68
## Safety          0.14      -0.16      -0.28
## HealthCare      0.33       0.25      -0.43
## LivingCost      0.31       0.25      -0.70
## PPIR           -0.05      -0.07       0.24
## Traffic        -0.19       0.08       0.19
## Pollution      -0.38      -0.12       0.67
## Climate        -0.28      -0.11       0.03
## Happiness       0.50       0.24      -0.67
## LogGDP          0.28       0.11      -0.59
## SocialSuppor    0.28       0.04      -0.41
## Freedom         1.00       0.41      -0.53
## Generosity      0.41       1.00      -0.36
## Corruption     -0.53      -0.36       1.00


Before training the model, a correlation matrix is used to check if there’s any collinearity between the potential predictors. Predictors with the variance inflation factor (VIF=1/(1-\(R^2\)), is the percent of variation in \(X_j\) explained by the other predictors) exceeding 5 are removed.

From Correlation Matrix table, it indicates that there’s strong correlations between predictors: life quality index, purchase power, living cost, pollution, happiness, logGDP. These predictors are removed except logGDP. Even though LogGDP seems to have strong correlation with lots of predictors, it is an important economic factor and when some of the other correlated predictors of are removed, the correlations decreased. Surprisingly, there’s no strong relations between healthcare and LogGDP. Also the property price to income ratio is not strongly related to the logGDP. That means people in some rich regions are suffering a bad healthcare and people in some of the less rich countries are paying high housing price compare to their income.

The model shows signs of multicollinearity. The overall F statistic is large and R2 = 0.76, but only Status are significant.

lm_mod1=lm(LifeExpectancy~.,data=completedData,subset = train)
summary(lm_mod1)
## 
## Call:
## lm(formula = LifeExpectancy ~ ., data = completedData, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8015 -0.9826  0.0481  1.0661  2.6643 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -5112.5732  5862.6448  -0.872   0.3909  
## Statusdeveloping    -1.6619     1.1586  -1.434   0.1629  
## LifeQuality         51.3379    58.6660   0.875   0.3892  
## PurchaPower        -20.5658    23.4699  -0.876   0.3886  
## Safety             -25.6110    29.3404  -0.873   0.3904  
## HealthCare         -20.5122    23.4677  -0.874   0.3898  
## LivingCost           5.2159     5.8645   0.889   0.3816  
## PPIR                51.5902    58.6575   0.880   0.3869  
## Traffic             25.5514    29.3273   0.871   0.3913  
## Pollution           34.2595    39.1047   0.876   0.3887  
## Climate            -17.0477    19.5584  -0.872   0.3911  
## Happiness            0.5177     0.8372   0.618   0.5415  
## LogGDP               3.0033     1.2545   2.394   0.0239 *
## SocialSuppor         1.7490     6.6575   0.263   0.7948  
## Freedom              0.9434     3.9111   0.241   0.8112  
## Generosity          -2.9654     2.1707  -1.366   0.1832  
## Corruption          -0.6493     2.5741  -0.252   0.8028  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.872 on 27 degrees of freedom
## Multiple R-squared:  0.8828, Adjusted R-squared:  0.8133 
## F-statistic: 12.71 on 16 and 27 DF,  p-value: 1.15e-08
round(vif(lm_mod1), 2)
##       Status  LifeQuality  PurchaPower       Safety   HealthCare 
##         4.14  40275392.49   5889297.08   1518832.89    656250.08 
##   LivingCost         PPIR      Traffic    Pollution      Climate 
##    149149.85   1031694.64    463867.07   7755817.16    943373.07 
##    Happiness       LogGDP SocialSuppor      Freedom   Generosity 
##         8.71         7.86         3.87         2.32         1.58 
##   Corruption 
##         4.02
#Several of the variance in action factors are large and exceed the 5 cutting.
lm_mod2=lm(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData,subset = train)
summary(lm_mod2)
## 
## Call:
## lm(formula = LifeExpectancy ~ . - LifeQuality - PurchaPower - 
##     LivingCost - Happiness, data = completedData, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0220 -1.0514 -0.0337  1.1516  3.6639 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.89559   11.58821   0.940 0.354368    
## Statusdeveloping -1.73906    1.14406  -1.520 0.138627    
## Safety            0.07800    0.03612   2.159 0.038687 *  
## HealthCare        0.04426    0.03954   1.119 0.271613    
## PPIR              0.28456    0.07047   4.038 0.000329 ***
## Traffic          -0.13128    0.07089  -1.852 0.073579 .  
## Pollution         0.05354    0.03216   1.665 0.106032    
## Climate           0.08732    0.02832   3.083 0.004281 ** 
## LogGDP            3.78758    0.91886   4.122 0.000260 ***
## SocialSuppor      6.95486    6.07943   1.144 0.261381    
## Freedom           0.48396    3.69338   0.131 0.896595    
## Generosity       -1.89935    2.23060  -0.852 0.401022    
## Corruption       -2.96327    2.28687  -1.296 0.204618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.963 on 31 degrees of freedom
## Multiple R-squared:  0.8519, Adjusted R-squared:  0.7946 
## F-statistic: 14.86 on 12 and 31 DF,  p-value: 1.171e-09
round(vif(lm_mod2), 2)
##       Status       Safety   HealthCare         PPIR      Traffic 
##         3.67         2.09         1.69         1.35         2.46 
##    Pollution      Climate       LogGDP SocialSuppor      Freedom 
##         4.77         1.80         3.83         2.93         1.88 
##   Generosity   Corruption 
##         1.52         2.88
plot(lm_mod2,which=c(1,2))

#lm_mod2=lm(LifeExpectancy~Status+ Safety+PPIR+Climate+LogGDP,data=completedData,subset = train)
#summary(lm_mod2)
#plot(lm_mod2,which=c(1,2))

All potential predictor variables will be included in the regression model at the beginning of Backward elimination. Then, it deletes the predictor variable step by step such that the resulting model has the lowest value of an information criterion. (This amounts to deleting the predictor with the largest p-value each time.) This process is continued until all variables have been deleted from the model or the information criterion increases. The final model is determined by backwards stepwise selection, which has the lowest AIC. AIC is defined to be AIC=2[-log(L(\(\hat{\beta_0},\hat{\beta_1}...\hat{\beta_n},\hat{\sigma^2}\)|Y)+K], where K=p+2. R calculate AIC using AIC=nlog(RSS/n)+2p. (Sheather, S. J.,231-237)


There’s three assumption for multiple linear regression model: there’s a linear relationship between predictors and response (linearity), the variance of residual is the same for any value of X (constant variance), the observations should be independent to each other, the residuals should follow normal distribution. The basic tool for examining the model’s linearity and constant variance is the residuals vs fitted values plot. Residuals are measured as follows: residual = observed y – model-predicted y


From the Residuals vs Fitted Values Plot If the points are randomly spread around 0 line, no obvious fan pattern, the assumption of constant variance is met. If the points are too far away from the 0, the assumption of linearity is not satisfied. QQ-plot is a good technique to evaluate the normality of residuals by comparing the residuals to ‘ideal’ normal observations. Observations lie well along the 45-degree line in the QQ-plot, so we may assume that normality holds.


Another tool to check assumptions of the model is detecting high leverage points and outliers. High leverage points are the data points have x -values that have an unusually large effect on the estimated regression model. Detecting the outliers is to detect the points which do not follow the pattern set by the bulk of the data, when one takes into account the given model. The leverage for point i is quantified by hi , the i th diagonal entry of hat matrix H. A popular rule is to classify xi as a point of high leverage in a simple linear regression model if \(h_i>2* average(h_i)=2(p+1)/n\). And a popular rule for outliers is its standardized residuals falls outside the interval from -2 to 2.

Power transformation is considered in this case, but we decided not to transform the variables because of the limiting improvement of the model and the risk of overfitting.

Backwards stepwise selection applied to the data set.

step_mod=step(lm_mod2, direction='backward',trace=F)
summary(step_mod)
## 
## Call:
## lm(formula = LifeExpectancy ~ Status + Safety + PPIR + Traffic + 
##     Climate + LogGDP, data = completedData, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3996 -1.1764  0.1569  1.2754  4.8474 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      14.96300   10.16310   1.472  0.14940    
## Statusdeveloping -1.72397    0.97151  -1.775  0.08420 .  
## Safety            0.07526    0.03146   2.392  0.02196 *  
## PPIR              0.29865    0.06793   4.396 8.94e-05 ***
## Traffic          -0.10083    0.05580  -1.807  0.07888 .  
## Climate           0.09232    0.02666   3.463  0.00136 ** 
## LogGDP            4.23876    0.74541   5.687 1.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.944 on 37 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.7986 
## F-statistic: 29.43 on 6 and 37 DF,  p-value: 1.15e-12
plot(step_mod,which=c(1,2))

#check assumptions for the model:linear, constant variance.From the plot, we can see there's non linear relationship. powertransformation may be applied to resolve this issues.

  





####

diagnosis

There are 3 outliers (points that have standardized residuals out of the interval -2 to 2), and no points are flagged as having high leverage. There are no “bad” leverage points to remove, since none of the high leverage points have standardized residuals outside -2 to 2.

p <- 8
n <- nrow(completedData[train,])

plot(hatvalues(step_mod), rstandard(step_mod),xlab='Leverage', ylab='Standardized Residuals')%>%abline(v = 2*(p+1)/n, lty=2)

Evaluation and Result

The residuals vs fitted values plot shows a little non linearity, but not significant. Most points fall on the line. The residuals are approximately normally distributed. Power transformation is not considered in this case, because the non linearity is not significant and while applying transformation to the data, it may cause overfitting.

Finally the stepwise regression model got adjusted R-squared 0.7986 with the training set which means 79.86% of the variation on life expectancy is explained by the predicting variables. The R Square on test data set is 0.7199692, shows that there is positive relationship between the predicted value and the response variable. 72% of the variation of life expectancy in test set have been explained by the predicting variables. The RMSE obtained by training data is 2.018533, which is much smaller than the standard deviation(4.231). So it is a valid model to predict the life expectancy.

compute_rmse <- function(y, y_pred) {
n <- length(y)
sqrt((1 / n) * sum((y - y_pred)^2))
}
rsquare=function(y,y_pred){
1-sum((y-y_pred)^2)/sum((y-mean(y))^2)
} 

adrsquare=function(rsquare,k){
1-((1-rsquare)*(17-1)/(17-k-1))
} 
lm_pred=predict(step_mod,newdata = completedData[-train,])


#calculate the RMSE and R square for logY
compute_rmse(completedData$LifeExpectancy[-train], lm_pred)
## [1] 2.018533
rsquare(completedData$LifeExpectancy[-train], lm_pred)
## [1] 0.7199692
y_train=completedData[train,'LifeExpectancy']
y_test=completedData[-train,'LifeExpectancy']
x_train=model.matrix(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData)[train, -1]
x_test=model.matrix(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData)[-train,-1]
lm_fit=lm(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData,subset=train)
summary(lm_fit)
## 
## Call:
## lm(formula = LifeExpectancy ~ . - LifeQuality - PurchaPower - 
##     LivingCost - Happiness, data = completedData, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0220 -1.0514 -0.0337  1.1516  3.6639 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.89559   11.58821   0.940 0.354368    
## Statusdeveloping -1.73906    1.14406  -1.520 0.138627    
## Safety            0.07800    0.03612   2.159 0.038687 *  
## HealthCare        0.04426    0.03954   1.119 0.271613    
## PPIR              0.28456    0.07047   4.038 0.000329 ***
## Traffic          -0.13128    0.07089  -1.852 0.073579 .  
## Pollution         0.05354    0.03216   1.665 0.106032    
## Climate           0.08732    0.02832   3.083 0.004281 ** 
## LogGDP            3.78758    0.91886   4.122 0.000260 ***
## SocialSuppor      6.95486    6.07943   1.144 0.261381    
## Freedom           0.48396    3.69338   0.131 0.896595    
## Generosity       -1.89935    2.23060  -0.852 0.401022    
## Corruption       -2.96327    2.28687  -1.296 0.204618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.963 on 31 degrees of freedom
## Multiple R-squared:  0.8519, Adjusted R-squared:  0.7946 
## F-statistic: 14.86 on 12 and 31 DF,  p-value: 1.171e-09
rmse=function(y,y_pred){
  n=length(y)
  sqrt((1/n)*sum((y-y_pred)^2))
}
lm_pred=predict(lm_fit,newdata =completedData[-train,-1] )
a=paste('OLS model RMSE',rmse(y_test,lm_pred));a
## [1] "OLS model RMSE 1.92312653913355"
lm_step=step(lm_fit,trace = F)
summary(lm_step)
## 
## Call:
## lm(formula = LifeExpectancy ~ Status + Safety + PPIR + Traffic + 
##     Climate + LogGDP, data = completedData, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3996 -1.1764  0.1569  1.2754  4.8474 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      14.96300   10.16310   1.472  0.14940    
## Statusdeveloping -1.72397    0.97151  -1.775  0.08420 .  
## Safety            0.07526    0.03146   2.392  0.02196 *  
## PPIR              0.29865    0.06793   4.396 8.94e-05 ***
## Traffic          -0.10083    0.05580  -1.807  0.07888 .  
## Climate           0.09232    0.02666   3.463  0.00136 ** 
## LogGDP            4.23876    0.74541   5.687 1.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.944 on 37 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.7986 
## F-statistic: 29.43 on 6 and 37 DF,  p-value: 1.15e-12
lm_step_pred=predict(lm_step, newdata= completedData[-train,-1])
b=paste('OLS step RMSE',rmse(y_test,lm_step_pred), 'and 6 predictors(excluding the intercept) were retained in the model fit using the training data');b
## [1] "OLS step RMSE 2.01853320494638 and 6 predictors(excluding the intercept) were retained in the model fit using the training data"


Lasso and Ridge Regularization will be applied to the model and then compare to the stepwise model

library(glmnet)
set.seed(3)
ridge_fit=cv.glmnet(x_train,y_train, alpha=0)
ridge_fit$lambda.min
## [1] 0.7186131
coef(ridge_fit, s="lambda.min")
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                            1
## (Intercept)      26.74503810
## Statusdeveloping -1.93511363
## Safety            0.06884993
## HealthCare        0.04400887
## PPIR              0.21508465
## Traffic          -0.09224277
## Pollution         0.01365314
## Climate           0.06570276
## LogGDP            2.54386487
## SocialSuppor      6.82429889
## Freedom           0.31773772
## Generosity       -1.81318963
## Corruption       -1.75265931
ridge_pred=predict(ridge_fit,newx=x_test, s='lambda.min')
ridge_pred=as.numeric(ridge_pred)
c=paste('ridge RMSE',rmse(y_test,ridge_pred),'and 12 predictors(excluding the intercept) were retained in the model fit using the training data');c
## [1] "ridge RMSE 1.96101639158768 and 12 predictors(excluding the intercept) were retained in the model fit using the training data"
set.seed(10)
lasso_fit=cv.glmnet(x_train,y_train, alpha=1)
lasso_fit$lambda.min
## [1] 0.01548205
coef(lasso_fit,s='lambda.min')
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                            1
## (Intercept)      12.25266188
## Statusdeveloping -1.70095023
## Safety            0.07661894
## HealthCare        0.04275590
## PPIR              0.27946826
## Traffic          -0.12446662
## Pollution         0.04517256
## Climate           0.08607239
## LogGDP            3.74995355
## SocialSuppor      6.36571026
## Freedom           0.24991505
## Generosity       -1.69604511
## Corruption       -2.61802234
sum(abs(as.numeric(coef(lasso_fit,s='lambda.min')))>0)
## [1] 13
lasso_pred=predict(lasso_fit,newx=x_test,s='lambda.min')
lasso_pred=as.numeric(lasso_pred)
d=paste('lasso RMSE', rmse(y_test,lasso_pred),' and 12 predictors were retained (including the intercept) in the model fit using the training data');d
## [1] "lasso RMSE 1.91695047281746  and 12 predictors were retained (including the intercept) in the model fit using the training data"

So ordinary least squares R2 for test set is 0.7458 and RMSE is 1.923. The stepwise model R2 for test set is 0.72 and RMSE is 2.0185 . Lasso model R2 for test set is 0.7474 and RMSE is 1.91. The ridge R2 for test set is 0.7357 and RMSE is 1.96. All models predict life expectancy with fair accuracy. Lasso performs best and stepwise model shows less performance on this case.

rsquare=function(y,y_pred){
  1-sum((y-y_pred)^2)/sum((y-mean(y))^2)
}
paste('OLS test R square', rsquare(y_test,lm_pred))
## [1] "OLS test R square 0.745815122180905"
paste('OLS stepwise test R square',rsquare(y_test,lm_step_pred))
## [1] "OLS stepwise test R square 0.719969211570034"
paste('Ridge test R square',rsquare(y_test,ridge_pred))
## [1] "Ridge test R square 0.735700443187198"
paste('Lasso test R square',rsquare(y_test,lasso_pred))
## [1] "Lasso test R square 0.747445115678852"
a;b;c;d
## [1] "OLS model RMSE 1.92312653913355"
## [1] "OLS step RMSE 2.01853320494638 and 6 predictors(excluding the intercept) were retained in the model fit using the training data"
## [1] "ridge RMSE 1.96101639158768 and 12 predictors(excluding the intercept) were retained in the model fit using the training data"
## [1] "lasso RMSE 1.91695047281746  and 12 predictors were retained (including the intercept) in the model fit using the training data"

Conclusion

A country’s life expectancy mostly depends on the economic factors not social factors and healthcare. People in countries with higher economic status tend to has a longer life expectancy. Climate and safety is non-economic factors that affect the life expectancy, which makes us interested in relationship between climate and life expectancy. To improve the life expectancy of a country, is important to improve the economics. It makes sense because when the economics improved, the country has money to improve the healthcare, environment and many other aspect which are relevant to life expectancy. People would pay more attention to their health.


Multiple linear regression might not be the best method for prediction. There might be better method for predicting life expectancy such as LASSO and ridge regression, regression trees and random forest. They might use more predicting variable and provide more precise predictions. Maybe in the future we’ll gather climate information with each region and discuss the role it plays in life expectancy. Also, we’ll try to find more data about country’s economic factors, social factors, environmental factors and health related factors in different years to predict the life expectancy with certain years.