Four models are built to predict the life expectancy for different countries based on multiple economics and social factors. ordinary least squares R2 for test set is 0.7458 and RMSE is 1.923. The stepwise model R2 for test set is 0.72 and RMSE is 2.0185 . Lasso model R2 for test set is 0.7474 and RMSE is 1.91. The ridge R2 for test set is 0.7357 and RMSE is 1.96. All models predict life expectancy with fair accuracy. Lasso performs best and stepwise model shows less performance on this case.
This project is trying to find out the multiple economics and social factors other than health factors which may affect a country’s life expectancy through multiple linear regression: life expectancy as the response. We put emphasis on following factors: quality of life index such as living cost, climate, traffic, property price to income ratio, social factors such as corruption, generosity, freedom, economic status and some other health related factors such as healthcare, pollution, happiness. Our goal is to determine the predictors that are significantly contributing to the value of life expectancy. This will help to suggest a country which area should be paid attention in order to improve the life expectancy of its population.
The observations of this dataset are based on different countries. Two data sets ‘Quality of Life Index’ & ‘world_happiness_report_2018’ are merged into one data frame. The data sources are from this two websites: Kaggle and NUMBEO. On initial visual inspection of the data showed some missing values. Countries’ Log GDP per capita are searched online and filled missing values in Log GDP column. Finding all data for these countries was difficult and hence, so some less known countries are excluded from the final model data-set. Other Missing data is handled in R software by using ‘mice’ package. The final merged file(final dataset) consists of 18 Columns(with country name) and 64 rows which means 16 predicting variables. All predicting variables are then divided into several broad categories: quality of life index factors, Economical factors and Social factors. Status is the only categorical predictors, other predictors are numerical. The plot of life expectancy distribution group by status shows that there’s strong relationship between life expectancy and country developing status. The box plot of life expectancy with two different statuses displays a higher mean and median of developed countries than developing countries. It indicates that, in general, the life expectancy of developed countries is much longer than the developing country.
Imputing the Missing Data
library(mice)
## Warning: package 'mice' was built under R version 3.6.2
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
quality=read.csv(file = 'quality.csv',header = TRUE)[-1]
quality
## Country Rank Quality.of.Life PurchasingPower Safety
## 1 Argentina 48 122.49 55.00 37.45
## 2 Australia 4 191.13 122.85 57.24
## 3 Austria 5 191.05 96.70 78.63
## 4 Belarus 37 141.47 44.64 76.47
## 5 Belgium 23 162.09 95.09 57.54
## 6 Brazil 64 100.33 37.43 29.76
## 7 Bulgaria 43 130.59 52.18 60.00
## 8 Canada 18 170.32 109.44 60.49
## 9 Chile 46 124.14 54.13 53.19
## 10 China 65 97.92 68.95 54.54
## 11 Colombia 57 108.36 34.66 48.74
## 12 Croatia 21 165.31 60.33 75.31
## 13 Cyprus 27 157.57 75.29 70.69
## 14 Czech Republic 25 158.79 71.32 73.34
## 15 Denmark 1 198.57 114.39 75.75
## 16 Egypt 71 83.98 23.67 49.29
## 17 Estonia 11 180.88 75.97 79.20
## 18 Finland 3 194.01 112.30 77.20
## 19 France 26 157.83 91.55 53.61
## 20 Georgia 50 120.90 33.19 80.14
## 21 Germany 8 187.05 116.20 65.49
## 22 Greece 39 137.82 49.57 61.43
## 23 India 52 117.51 66.91 57.28
## 24 Indonesia 58 107.20 32.95 53.99
## 25 Iran 70 87.02 37.74 50.67
## 26 Ireland 24 160.82 95.09 55.48
## 27 Israel 30 153.82 92.48 67.84
## 28 Italy 36 145.69 77.49 54.98
## 29 Japan 12 180.50 103.12 86.27
## 30 Jordan 54 112.94 41.73 56.18
## 31 Kazakhstan 69 87.17 38.83 33.49
## 32 Latvia 34 149.15 55.16 63.23
## 33 Lebanon 55 111.21 53.97 55.73
## 34 Lithuania 29 156.36 60.34 63.49
## 35 Macedonia 56 110.64 39.83 60.71
## 36 Malaysia 49 122.11 73.14 39.21
## 37 Mexico 47 123.48 49.98 47.70
## 38 Netherlands 6 188.91 102.54 71.43
## 39 New Zealand 9 185.58 101.09 60.45
## 40 Norway 10 181.86 103.61 64.68
## 41 Pakistan 60 104.63 38.61 53.27
## 42 Panama 53 113.12 41.34 53.57
## 43 Philippines 67 90.73 29.46 59.17
## 44 Portugal 22 163.50 55.19 67.87
## 45 Romania 38 140.31 61.16 72.16
## 46 Russia 59 104.94 45.38 58.07
## 47 Saudi Arabia 32 152.72 118.59 70.27
## 48 Serbia 51 119.83 39.90 62.73
## 49 Singapore 28 156.91 103.77 78.53
## 50 Slovakia 31 153.10 63.24 70.46
## 51 Slovenia 15 175.98 75.38 77.43
## 52 South Africa 40 135.31 82.97 23.20
## 53 South Korea 33 149.53 103.40 65.95
## 54 Spain 16 174.16 83.80 67.54
## 55 Sri Lanka 66 95.30 27.34 58.97
## 56 Sweden 14 178.67 111.38 50.65
## 57 Switzerland 2 195.93 129.70 78.50
## 58 Thailand 61 103.26 40.86 53.34
## 59 Turkey 44 125.51 45.33 59.62
## 60 Ukraine 62 102.34 32.72 51.12
## 61 United Arab Emirates 20 167.81 119.60 83.68
## 62 United Kingdom 17 170.81 105.73 57.28
## 63 United States 13 179.20 122.03 52.87
## 64 Vietnam 68 88.82 33.33 51.22
## HealthCare LivingCost PPIR Traffic Pollution Climate Status
## 1 69.39 33.09 19.02 43.04 52.35 98.28 developing
## 2 76.38 72.08 7.60 35.29 23.97 93.75 developed
## 3 79.19 71.79 10.14 25.15 21.97 77.74 developed
## 4 58.01 33.12 14.27 28.78 40.97 64.37 developing
## 5 79.44 72.97 7.18 34.86 49.89 86.03 developed
## 6 54.64 42.80 18.72 43.37 57.72 95.35 developing
## 7 54.04 37.17 8.72 28.79 63.98 82.76 developing
## 8 70.99 65.01 7.66 33.77 27.85 52.55 developed
## 9 65.50 47.73 13.51 36.79 65.54 90.21 developing
## 10 64.03 39.24 29.09 42.46 81.91 78.91 developing
## 11 67.51 30.15 18.85 45.06 61.74 86.04 developing
## 12 64.14 49.18 11.63 29.11 31.03 88.95 developing
## 13 50.17 55.57 7.17 20.27 53.75 92.23 developed
## 14 74.71 45.12 15.17 30.02 40.96 77.13 developed
## 15 79.41 81.38 6.93 28.51 22.14 81.80 developed
## 16 44.22 26.46 13.60 49.17 86.48 91.98 developing
## 17 72.12 51.01 9.31 25.75 18.15 64.28 developed
## 18 73.49 72.82 7.98 30.41 11.93 58.56 developed
## 19 78.55 74.85 12.85 35.27 42.70 88.25 developed
## 20 51.29 28.78 12.67 34.34 72.46 84.20 developing
## 21 74.32 67.62 9.02 29.91 28.01 82.51 developed
## 22 55.16 56.66 9.84 32.20 51.86 94.18 developed
## 23 68.32 24.17 11.33 45.71 75.81 65.74 developing
## 24 61.98 36.24 13.81 42.93 62.78 68.96 developing
## 25 51.18 35.80 16.87 48.38 79.35 70.99 developing
## 26 48.58 75.35 7.90 35.85 31.12 89.13 developed
## 27 73.44 74.28 13.98 36.32 57.25 93.80 developed
## 28 67.14 69.25 9.97 34.69 53.75 91.25 developed
## 29 80.40 83.33 11.25 40.03 37.08 84.79 developed
## 30 66.12 54.98 7.67 42.43 80.39 89.05 developing
## 31 51.27 29.83 11.51 29.75 74.25 39.78 developing
## 32 59.71 49.23 9.35 31.45 34.96 74.70 developing
## 33 65.90 58.06 13.44 37.36 87.39 94.74 developing
## 34 67.58 45.91 11.11 27.15 30.84 69.86 developing
## 35 58.98 31.58 13.65 27.94 80.85 76.30 developing
## 36 67.61 39.38 9.77 35.25 63.74 60.10 developing
## 37 69.72 32.71 10.19 39.08 66.02 86.29 developing
## 38 77.81 74.83 7.38 29.87 27.45 87.45 developed
## 39 73.62 72.62 8.34 31.19 22.74 95.46 developed
## 40 74.14 100.99 8.40 27.12 19.86 71.16 developed
## 41 60.58 20.40 14.06 38.63 75.89 72.99 developing
## 42 61.41 51.45 12.67 36.48 62.00 67.84 developing
## 43 65.87 35.09 21.83 44.15 74.47 60.23 developing
## 44 70.72 50.39 12.71 29.02 31.58 97.55 developed
## 45 54.49 36.45 10.15 34.36 55.39 77.62 developing
## 46 57.63 35.52 12.39 46.00 62.80 46.53 developing
## 47 59.07 47.61 2.89 29.61 67.26 41.42 developing
## 48 52.56 35.39 18.62 29.74 58.86 83.23 developing
## 49 70.30 69.79 21.56 42.15 32.29 57.45 developed
## 50 60.06 44.98 10.27 29.58 41.89 78.13 developed
## 51 62.81 52.51 10.16 24.81 24.33 78.08 developed
## 52 62.56 42.49 4.11 40.33 56.95 95.97 developing
## 53 84.51 76.93 17.58 40.02 54.19 68.39 developed
## 54 77.77 54.70 9.21 29.42 39.36 94.19 developed
## 55 72.30 30.24 26.32 51.90 57.69 59.11 developing
## 56 70.95 71.55 10.26 30.29 18.01 74.92 developed
## 57 72.68 121.16 9.63 29.05 22.03 80.05 developed
## 58 79.06 47.54 21.94 39.40 72.21 69.45 developing
## 59 69.36 35.21 9.26 44.77 69.15 93.26 developed
## 60 50.95 27.94 14.35 37.36 66.63 70.69 developing
## 61 67.99 56.16 4.40 37.46 53.11 45.23 developing
## 62 74.71 65.28 9.16 34.62 39.43 87.82 developed
## 63 69.41 69.91 3.58 32.87 33.95 77.51 developed
## 64 54.54 37.70 19.66 28.33 87.13 71.24 developing
## Ladder LogGDP SocialSuppor lifeExpectancy Freedom Generosity
## 1 5.792797 9.809972 0.8999116 68.8 0.8458947 -0.20693663
## 2 7.176993 10.721021 0.9401373 73.6 0.9160281 0.13779540
## 3 7.396002 10.741893 0.9116681 73.0 0.9041120 0.05155233
## 4 5.233770 9.778739 0.9045693 66.1 0.6436024 -0.18186484
## 5 6.892172 10.672445 0.9298155 72.0 0.8083866 -0.12727834
## 6 6.190922 9.557933 0.8815053 66.4 0.7506090 -0.12632652
## 7 5.098814 9.873219 0.9238533 66.8 0.7243359 -0.17910960
## 8 7.175497 10.701248 0.9227188 73.6 0.9457829 0.09796613
## 9 6.436221 10.065920 0.8900849 69.9 0.7885304 -0.07061584
## 10 5.131434 9.694376 0.7876053 69.3 0.8953777 -0.17489916
## 11 5.983512 9.511734 0.8709704 67.7 0.8507658 -0.15504800
## 12 5.536271 10.065751 0.9098068 70.2 0.6908556 -0.15252924
## 13 6.276443 NA 0.8255732 73.7 0.7942150 NA
## 14 7.034165 10.419474 0.9291639 70.3 0.7901324 -0.29435962
## 15 7.648786 10.755594 0.9582189 72.4 0.9354378 0.01559285
## 16 4.005451 9.293960 0.7588240 61.7 0.6816545 -0.22292954
## 17 6.091302 10.324107 0.9326938 68.6 0.8856184 -0.14466491
## 18 7.858107 10.636060 0.9621550 71.9 0.9378074 -0.13173516
## 19 6.665904 10.573352 0.9214631 73.8 0.8163772 -0.14156713
## 20 4.659097 9.229100 0.6172186 64.5 0.7751441 -0.21924759
## 21 7.118364 10.730945 0.9197631 72.2 0.8768875 0.03019965
## 22 5.409289 10.132058 0.7935008 72.4 0.5644557 -0.33638453
## 23 3.818069 8.830280 0.6380520 60.1 0.8904434 0.07180630
## 24 5.340296 9.362827 0.8093789 62.1 0.8793744 0.49937782
## 25 4.278118 9.886065 0.6737647 66.0 0.6033199 0.04471550
## 26 6.962336 11.163328 0.9378624 72.3 0.8614716 0.13978343
## 27 6.927179 10.424574 0.9095954 73.3 0.7246623 0.05254996
## 28 6.516527 10.480516 0.9126561 73.6 0.6500093 -0.02346227
## 29 5.793575 10.581618 0.8864319 75.0 0.7734721 -0.27297229
## 30 4.638934 9.024435 0.7995443 66.8 0.7624203 -0.18348970
## 31 6.007636 10.111166 0.9366567 64.6 0.8401830 -0.10859531
## 32 5.901154 10.184117 0.9132763 66.8 0.6082076 -0.21699953
## 33 5.167187 9.507958 0.8293806 67.1 0.6070307 -0.06632593
## 34 6.308879 10.339512 0.9293501 67.3 0.6989451 -0.24097855
## 35 5.239835 9.503044 0.8489152 67.5 0.7448009 -0.04029087
## 36 5.338818 10.235504 0.7894086 67.0 0.8745483 0.11288136
## 37 6.549579 9.769919 0.8580689 68.3 0.8162005 -0.18181843
## 38 7.463097 10.809204 0.9394432 72.3 0.9199852 0.15553649
## 39 7.370286 10.501477 0.9538627 73.2 0.9493002 0.11675003
## 40 7.444262 11.085626 0.9659619 73.2 0.9604290 0.07572849
## 41 5.471554 8.561664 0.6850595 58.5 0.7725691 0.05352156
## 42 6.281434 10.049730 0.9043898 69.6 0.8614481 -0.12278221
## 43 5.869173 8.985703 0.8458033 61.9 0.9178082 -0.11312916
## 44 5.919823 10.262012 0.8871133 72.4 0.8774042 -0.26273811
## 45 6.150879 10.112093 0.8179305 67.2 0.8451596 -0.22003184
## 46 5.513500 10.132390 0.9087261 64.3 0.7292822 -0.15622631
## 47 6.356393 10.797972 0.8678484 66.3 0.8549215 -0.20956351
## 48 5.936493 9.584804 0.8529453 68.2 0.7398918 -0.09851145
## 49 6.374564 NA 0.9028407 76.8 0.9160782 NA
## 50 6.235111 10.352236 0.9223787 68.9 0.7576340 -0.17969246
## 51 6.249419 10.397203 0.9409712 71.1 0.9420459 -0.12239641
## 52 4.883922 9.411724 0.8413440 56.5 0.7527313 -0.05903366
## 53 5.840231 10.511578 0.7977238 73.6 0.6001617 -0.09396584
## 54 6.513371 10.465594 0.9103147 74.4 0.7222507 -0.07935060
## 55 4.400223 9.400388 0.8280654 67.2 0.8526282 0.08676241
## 56 7.374792 10.766932 0.9306796 72.6 0.9417247 0.06957325
## 57 7.508587 10.975945 0.9302910 74.1 0.9264145 0.09636896
## 58 6.011562 9.734829 0.8730524 67.2 0.9048283 0.25165021
## 59 5.185689 10.148917 0.8470273 66.8 0.5286291 -0.18165372
## 60 4.661909 9.012027 0.9009367 64.6 0.6630551 -0.05504196
## 61 6.603744 11.127678 0.8510413 67.1 0.9436644 0.03649422
## 62 7.233445 10.596948 0.9284839 72.3 0.8375084 0.22199814
## 63 6.882685 10.922465 0.9038560 68.3 0.8246067 0.10771339
## 64 5.295547 8.783416 0.8319452 67.9 0.9092598 -0.03912370
## Corruption
## 1 0.85525525
## 2 0.40464750
## 3 0.52306056
## 4 0.71845549
## 5 0.63041180
## 6 0.76325131
## 7 0.95201445
## 8 0.37174085
## 9 0.81629741
## 10 NA
## 11 0.85482091
## 12 0.92540830
## 13 0.84833723
## 14 0.85138226
## 15 0.15060744
## 16 NA
## 17 0.62067795
## 18 0.19860484
## 19 0.58177531
## 20 0.75485378
## 21 0.49567395
## 22 0.86030239
## 23 0.80526334
## 24 0.86772943
## 25 0.70343977
## 26 0.36221024
## 27 0.77013481
## 28 0.88782483
## 29 0.68678451
## 30 NA
## 31 0.82378298
## 32 0.79894924
## 33 0.90665030
## 34 0.85174507
## 35 0.90993434
## 36 0.89413112
## 37 0.80863810
## 38 0.37055778
## 39 0.20658022
## 40 0.26820144
## 41 0.79884166
## 42 0.83693099
## 43 0.72648335
## 44 0.87972784
## 45 0.92117017
## 46 0.86531156
## 47 NA
## 48 0.86372358
## 49 0.09656293
## 50 0.90994465
## 51 0.83925253
## 52 0.84119260
## 53 0.79682589
## 54 0.77650446
## 55 0.85801667
## 56 0.26279658
## 57 0.30125997
## 58 0.90659601
## 59 0.80487853
## 60 0.94296074
## 61 NA
## 62 0.40427601
## 63 0.70992827
## 64 0.80842298
quality2=read.csv(file = 'quality2.csv',header = TRUE)
quality2=quality2[,-1]
head(quality2)
## LifeExpectancy Status LifeQuality PurchaPower Safety HealthCare
## 1 68.8 developing 122.49 55.00 37.45 69.39
## 2 73.6 developed 191.13 122.85 57.24 76.38
## 3 73.0 developed 191.05 96.70 78.63 79.19
## 4 66.1 developing 141.47 44.64 76.47 58.01
## 5 72.0 developed 162.09 95.09 57.54 79.44
## 6 66.4 developing 100.33 37.43 29.76 54.64
## LivingCost PPIR Traffic Pollution Climate Happiness LogGDP
## 1 33.09 19.02 43.04 52.35 98.28 5.792797 9.809972
## 2 72.08 7.60 35.29 23.97 93.75 7.176993 10.721021
## 3 71.79 10.14 25.15 21.97 77.74 7.396002 10.741893
## 4 33.12 14.27 28.78 40.97 64.37 5.233770 9.778739
## 5 72.97 7.18 34.86 49.89 86.03 6.892172 10.672445
## 6 42.80 18.72 43.37 57.72 95.35 6.190922 9.557933
## SocialSuppor Freedom Generosity Corruption
## 1 0.8999116 0.8458947 -0.20693663 0.8552552
## 2 0.9401373 0.9160281 0.13779540 0.4046475
## 3 0.9116681 0.9041120 0.05155233 0.5230606
## 4 0.9045693 0.6436024 -0.18186484 0.7184555
## 5 0.9298155 0.8083866 -0.12727834 0.6304118
## 6 0.8815053 0.7506090 -0.12632652 0.7632513
sort(sapply(quality2,function(x) sum(is.na(x))),decreasing = T)
## Corruption Generosity LifeExpectancy Status LifeQuality
## 5 2 0 0 0
## PurchaPower Safety HealthCare LivingCost PPIR
## 0 0 0 0 0
## Traffic Pollution Climate Happiness LogGDP
## 0 0 0 0 0
## SocialSuppor Freedom
## 0 0
md.pattern(quality2)
## LifeExpectancy Status LifeQuality PurchaPower Safety HealthCare
## 57 1 1 1 1 1 1
## 5 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 0 0 0 0 0 0
## LivingCost PPIR Traffic Pollution Climate Happiness LogGDP SocialSuppor
## 57 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0
## Freedom Generosity Corruption
## 57 1 1 1 0
## 5 1 1 0 1
## 2 1 0 1 1
## 0 2 5 7
#from the output, we can easily to see there are 2 missing value in 'Generosity' colume and 5 missing values in 'Corruption' colume. Due to the small size of data set, we determined to impute the missing value using the 'mice' package in R.
###----A couple of notes on the parameters:-----
## m=5 refers to the number of imputed datasets. Five is the default value.
## meth='pmm' refers to the imputation method. In this case we are using predictive mean matching as imputation method. Other imputation methods can be used, type methods(mice) for a list of the available imputation methods.
set.seed(1)
tempData <- mice(quality2,m=5,maxit=50,meth='pmm',seed=500)
##
## iter imp variable
## 1 1 Generosity Corruption
## 1 2 Generosity Corruption
## 1 3 Generosity Corruption
## 1 4 Generosity Corruption
## 1 5 Generosity Corruption
## 2 1 Generosity Corruption
## 2 2 Generosity Corruption
## 2 3 Generosity Corruption
## 2 4 Generosity Corruption
## 2 5 Generosity Corruption
## 3 1 Generosity Corruption
## 3 2 Generosity Corruption
## 3 3 Generosity Corruption
## 3 4 Generosity Corruption
## 3 5 Generosity Corruption
## 4 1 Generosity Corruption
## 4 2 Generosity Corruption
## 4 3 Generosity Corruption
## 4 4 Generosity Corruption
## 4 5 Generosity Corruption
## 5 1 Generosity Corruption
## 5 2 Generosity Corruption
## 5 3 Generosity Corruption
## 5 4 Generosity Corruption
## 5 5 Generosity Corruption
## 6 1 Generosity Corruption
## 6 2 Generosity Corruption
## 6 3 Generosity Corruption
## 6 4 Generosity Corruption
## 6 5 Generosity Corruption
## 7 1 Generosity Corruption
## 7 2 Generosity Corruption
## 7 3 Generosity Corruption
## 7 4 Generosity Corruption
## 7 5 Generosity Corruption
## 8 1 Generosity Corruption
## 8 2 Generosity Corruption
## 8 3 Generosity Corruption
## 8 4 Generosity Corruption
## 8 5 Generosity Corruption
## 9 1 Generosity Corruption
## 9 2 Generosity Corruption
## 9 3 Generosity Corruption
## 9 4 Generosity Corruption
## 9 5 Generosity Corruption
## 10 1 Generosity Corruption
## 10 2 Generosity Corruption
## 10 3 Generosity Corruption
## 10 4 Generosity Corruption
## 10 5 Generosity Corruption
## 11 1 Generosity Corruption
## 11 2 Generosity Corruption
## 11 3 Generosity Corruption
## 11 4 Generosity Corruption
## 11 5 Generosity Corruption
## 12 1 Generosity Corruption
## 12 2 Generosity Corruption
## 12 3 Generosity Corruption
## 12 4 Generosity Corruption
## 12 5 Generosity Corruption
## 13 1 Generosity Corruption
## 13 2 Generosity Corruption
## 13 3 Generosity Corruption
## 13 4 Generosity Corruption
## 13 5 Generosity Corruption
## 14 1 Generosity Corruption
## 14 2 Generosity Corruption
## 14 3 Generosity Corruption
## 14 4 Generosity Corruption
## 14 5 Generosity Corruption
## 15 1 Generosity Corruption
## 15 2 Generosity Corruption
## 15 3 Generosity Corruption
## 15 4 Generosity Corruption
## 15 5 Generosity Corruption
## 16 1 Generosity Corruption
## 16 2 Generosity Corruption
## 16 3 Generosity Corruption
## 16 4 Generosity Corruption
## 16 5 Generosity Corruption
## 17 1 Generosity Corruption
## 17 2 Generosity Corruption
## 17 3 Generosity Corruption
## 17 4 Generosity Corruption
## 17 5 Generosity Corruption
## 18 1 Generosity Corruption
## 18 2 Generosity Corruption
## 18 3 Generosity Corruption
## 18 4 Generosity Corruption
## 18 5 Generosity Corruption
## 19 1 Generosity Corruption
## 19 2 Generosity Corruption
## 19 3 Generosity Corruption
## 19 4 Generosity Corruption
## 19 5 Generosity Corruption
## 20 1 Generosity Corruption
## 20 2 Generosity Corruption
## 20 3 Generosity Corruption
## 20 4 Generosity Corruption
## 20 5 Generosity Corruption
## 21 1 Generosity Corruption
## 21 2 Generosity Corruption
## 21 3 Generosity Corruption
## 21 4 Generosity Corruption
## 21 5 Generosity Corruption
## 22 1 Generosity Corruption
## 22 2 Generosity Corruption
## 22 3 Generosity Corruption
## 22 4 Generosity Corruption
## 22 5 Generosity Corruption
## 23 1 Generosity Corruption
## 23 2 Generosity Corruption
## 23 3 Generosity Corruption
## 23 4 Generosity Corruption
## 23 5 Generosity Corruption
## 24 1 Generosity Corruption
## 24 2 Generosity Corruption
## 24 3 Generosity Corruption
## 24 4 Generosity Corruption
## 24 5 Generosity Corruption
## 25 1 Generosity Corruption
## 25 2 Generosity Corruption
## 25 3 Generosity Corruption
## 25 4 Generosity Corruption
## 25 5 Generosity Corruption
## 26 1 Generosity Corruption
## 26 2 Generosity Corruption
## 26 3 Generosity Corruption
## 26 4 Generosity Corruption
## 26 5 Generosity Corruption
## 27 1 Generosity Corruption
## 27 2 Generosity Corruption
## 27 3 Generosity Corruption
## 27 4 Generosity Corruption
## 27 5 Generosity Corruption
## 28 1 Generosity Corruption
## 28 2 Generosity Corruption
## 28 3 Generosity Corruption
## 28 4 Generosity Corruption
## 28 5 Generosity Corruption
## 29 1 Generosity Corruption
## 29 2 Generosity Corruption
## 29 3 Generosity Corruption
## 29 4 Generosity Corruption
## 29 5 Generosity Corruption
## 30 1 Generosity Corruption
## 30 2 Generosity Corruption
## 30 3 Generosity Corruption
## 30 4 Generosity Corruption
## 30 5 Generosity Corruption
## 31 1 Generosity Corruption
## 31 2 Generosity Corruption
## 31 3 Generosity Corruption
## 31 4 Generosity Corruption
## 31 5 Generosity Corruption
## 32 1 Generosity Corruption
## 32 2 Generosity Corruption
## 32 3 Generosity Corruption
## 32 4 Generosity Corruption
## 32 5 Generosity Corruption
## 33 1 Generosity Corruption
## 33 2 Generosity Corruption
## 33 3 Generosity Corruption
## 33 4 Generosity Corruption
## 33 5 Generosity Corruption
## 34 1 Generosity Corruption
## 34 2 Generosity Corruption
## 34 3 Generosity Corruption
## 34 4 Generosity Corruption
## 34 5 Generosity Corruption
## 35 1 Generosity Corruption
## 35 2 Generosity Corruption
## 35 3 Generosity Corruption
## 35 4 Generosity Corruption
## 35 5 Generosity Corruption
## 36 1 Generosity Corruption
## 36 2 Generosity Corruption
## 36 3 Generosity Corruption
## 36 4 Generosity Corruption
## 36 5 Generosity Corruption
## 37 1 Generosity Corruption
## 37 2 Generosity Corruption
## 37 3 Generosity Corruption
## 37 4 Generosity Corruption
## 37 5 Generosity Corruption
## 38 1 Generosity Corruption
## 38 2 Generosity Corruption
## 38 3 Generosity Corruption
## 38 4 Generosity Corruption
## 38 5 Generosity Corruption
## 39 1 Generosity Corruption
## 39 2 Generosity Corruption
## 39 3 Generosity Corruption
## 39 4 Generosity Corruption
## 39 5 Generosity Corruption
## 40 1 Generosity Corruption
## 40 2 Generosity Corruption
## 40 3 Generosity Corruption
## 40 4 Generosity Corruption
## 40 5 Generosity Corruption
## 41 1 Generosity Corruption
## 41 2 Generosity Corruption
## 41 3 Generosity Corruption
## 41 4 Generosity Corruption
## 41 5 Generosity Corruption
## 42 1 Generosity Corruption
## 42 2 Generosity Corruption
## 42 3 Generosity Corruption
## 42 4 Generosity Corruption
## 42 5 Generosity Corruption
## 43 1 Generosity Corruption
## 43 2 Generosity Corruption
## 43 3 Generosity Corruption
## 43 4 Generosity Corruption
## 43 5 Generosity Corruption
## 44 1 Generosity Corruption
## 44 2 Generosity Corruption
## 44 3 Generosity Corruption
## 44 4 Generosity Corruption
## 44 5 Generosity Corruption
## 45 1 Generosity Corruption
## 45 2 Generosity Corruption
## 45 3 Generosity Corruption
## 45 4 Generosity Corruption
## 45 5 Generosity Corruption
## 46 1 Generosity Corruption
## 46 2 Generosity Corruption
## 46 3 Generosity Corruption
## 46 4 Generosity Corruption
## 46 5 Generosity Corruption
## 47 1 Generosity Corruption
## 47 2 Generosity Corruption
## 47 3 Generosity Corruption
## 47 4 Generosity Corruption
## 47 5 Generosity Corruption
## 48 1 Generosity Corruption
## 48 2 Generosity Corruption
## 48 3 Generosity Corruption
## 48 4 Generosity Corruption
## 48 5 Generosity Corruption
## 49 1 Generosity Corruption
## 49 2 Generosity Corruption
## 49 3 Generosity Corruption
## 49 4 Generosity Corruption
## 49 5 Generosity Corruption
## 50 1 Generosity Corruption
## 50 2 Generosity Corruption
## 50 3 Generosity Corruption
## 50 4 Generosity Corruption
## 50 5 Generosity Corruption
## Warning: Number of logged events: 500
tempData$imp$Generosity
## 1 2 3 4 5
## 13 -0.1525292 -0.02346227 -0.15622631 -0.05504196 -0.05504196
## 49 -0.1550480 -0.15504800 0.09636896 0.05352156 0.06957325
tempData$imp$Corruption
## 1 2 3 4 5
## 10 0.8637236 0.8392525 0.8237830 0.8086381 0.7548538
## 16 0.8048785 0.9429607 0.8237830 0.8048785 0.9099343
## 30 0.8369310 0.7034398 0.8517451 0.9066503 0.8084230
## 47 0.9211702 0.7968259 0.8513823 0.8878248 0.8878248
## 61 0.7099283 0.5230606 0.3705578 0.7701348 0.4042760
Inspecting the distribution of original and imputed data
#Let's compare the distributions of original and imputed data using a some useful plots. We can use a scatterplot and plot Corruption against other variables(random select from the dataset).What we would like to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed "plausible values".
xyplot(tempData,Corruption ~ Generosity+LogGDP+Climate+Freedom+Happiness+Pollution ,pch=17,cex=1.2)
completedData <- complete(tempData,1)
head(completedData)
## LifeExpectancy Status LifeQuality PurchaPower Safety HealthCare
## 1 68.8 developing 122.49 55.00 37.45 69.39
## 2 73.6 developed 191.13 122.85 57.24 76.38
## 3 73.0 developed 191.05 96.70 78.63 79.19
## 4 66.1 developing 141.47 44.64 76.47 58.01
## 5 72.0 developed 162.09 95.09 57.54 79.44
## 6 66.4 developing 100.33 37.43 29.76 54.64
## LivingCost PPIR Traffic Pollution Climate Happiness LogGDP
## 1 33.09 19.02 43.04 52.35 98.28 5.792797 9.809972
## 2 72.08 7.60 35.29 23.97 93.75 7.176993 10.721021
## 3 71.79 10.14 25.15 21.97 77.74 7.396002 10.741893
## 4 33.12 14.27 28.78 40.97 64.37 5.233770 9.778739
## 5 72.97 7.18 34.86 49.89 86.03 6.892172 10.672445
## 6 42.80 18.72 43.37 57.72 95.35 6.190922 9.557933
## SocialSuppor Freedom Generosity Corruption
## 1 0.8999116 0.8458947 -0.20693663 0.8552552
## 2 0.9401373 0.9160281 0.13779540 0.4046475
## 3 0.9116681 0.9041120 0.05155233 0.5230606
## 4 0.9045693 0.6436024 -0.18186484 0.7184555
## 5 0.9298155 0.8083866 -0.12727834 0.6304118
## 6 0.8815053 0.7506090 -0.12632652 0.7632513
dim(completedData)
## [1] 64 17
md.pattern(completedData)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## LifeExpectancy Status LifeQuality PurchaPower Safety HealthCare
## 64 1 1 1 1 1 1
## 0 0 0 0 0 0
## LivingCost PPIR Traffic Pollution Climate Happiness LogGDP SocialSuppor
## 64 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0
## Freedom Generosity Corruption
## 64 1 1 1 0
## 0 0 0 0
Table 1 Summary of Response Variable(Life Expectancy)
summary(completedData$LifeExpectancy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.50 66.80 68.70 68.96 72.40 76.80
Table 2 Variables Description
| Variables | Discription |
|---|---|
| Happiness | a measure of life satisfaction |
| PurchasPowe | Purchasing Power Index |
| SocialSuppor | SocialSupport |
| Safety | Safety Index |
| Freedom | Freedom |
| HealthCare | HealthCare Index |
| Status | levels: developing&developed |
| Corruption | Perception of Corruption |
| LivingCost | Cost of Living Index |
| Generosity | Generosity |
| PPIR | Property Price to Income Ratio |
| LogGDP | Log of GDP per capita |
| Traffic | Traffic Commute Time Index |
| LifeQuality | Quality of Life Index |
| Pollution | Pollution Index |
| Climate | Climate Index |
Status is the only categorical predictors, other predictors are numerical. The plot of life expectancy distribution group by status shows that there’s strong relationship between life expectancy and country developing status. The box plot of life expectancy with two different statuses displays a higher mean and median of developed countries than developing countries. It indicates that, in general, the life expectancy of developed countries is much longer than the developing country.
Box Plot & Histogram of Life Expectancy Group By Country Status Box plot of Potential Predictors Group By Life Expectancy Greater/Less Than 65 (Safety, Healthcare, GDP,Traffic)
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 1.0.0 v dplyr 0.8.3
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.3 v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::complete() masks mice::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(data=completedData,aes(x=LifeExpectancy,fill=Status))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=completedData,aes(x=Status,y=LifeExpectancy,fill=Status))+geom_boxplot(outlier.size = 2,outlier.shape = 2)
for (i in 3:17 ) {
plot=ggplot(data=completedData,aes(y=completedData[,i]))+geom_boxplot()+xlab(names(completedData[i]))
print(plot)
}
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
which(is_outlier(completedData$LivingCost))
## [1] 57
quality[57,c(1,7)]
## Country LivingCost
## 57 Switzerland 121.16
which(is_outlier(completedData$PPIR))
## [1] 10 43 55 58
quality[c(10,43,55,58),c(1,8)]
## Country PPIR
## 10 China 29.09
## 43 Philippines 21.83
## 55 Sri Lanka 26.32
## 58 Thailand 21.94
subset(quality[,c(1,16,12,8,9,5,11,14)],quality$Country=='United States')
## Country lifeExpectancy Status PPIR Traffic Safety Climate
## 63 United States 68.3 developed 3.58 32.87 52.87 77.51
## LogGDP
## 63 10.92247
# Based on the scatterplot of LifeExpectancy to each predictors, we suppose that these Life quality index and healthcare, econimic status are related to the life expectancy. Let's do the regression to see if it is correct.
dat=completedData
dat$greater65=0
dat$greater65[which(dat$LifeExpectancy>=65)]='greater than 65'
dat$greater65[which(dat$LifeExpectancy<65)]='less than 65'
for (i in 3:17 ) {
plot=ggplot(data=dat,aes(x=greater65,y=completedData[,i],fill=greater65))+geom_boxplot()+ylab(names(completedData[i]))+xlab('life expectancy')
print(plot)
}
The statistical technique used in this study is multiple linear regression. It is a method to find the best fitting line to a set of data with some quantitative explanatory variables X and a quantitative response variable Y and examining the slope of the regression line. Simple linear regression allows us to summarize the relationship between two quantitative variables. Multiple linear regression is with more than two predictors.
Train and test is used to evaluate the model performance. The completed data is randomly splitted to two part: 75% of the total observation are the train set and the left 25% of the total observations are validation set. The training data set is used to develop a number of regression models, while the test data set is used to evaluate the performance of these models.
Train Model:70% training / 30% validation set split of the data set
set.seed(222)
train=sample(length(completedData$LifeExpectancy),floor(0.70*length(completedData$LifeExpectancy)),replace=FALSE)
length(train)
## [1] 44
#train=c(60, 5, 31 , 1, 55, 57, 21, 24, 33, 22 , 4 , 6, 36, 48, 3, 32, 49, 10, 38, 12, 63 ,54,45 ,44 ,28, 18, 16, 50 ,42, 23, 61 ,17, 51, 39 ,20, 11, 46, 29, 19, 25,62 ,26, 15 ,41, 27)
#train2=c(60, 5, 31 , 1, 55, 57, 21, 24, 33, 22 , 4 , 6, 36, 48, 3, 32, 49, 10, 38, 12, 63 ,54,45 ,44 ,28, 18, 16, 50 ,42, 23, 61 ,17, 51, 39 ,20, 11, 46, 29, 19, 25,62 ,26, 15 ,41, 27,52,30,8)
# Check if there is any collinearity,from the output, we can see there are several strong correlations between the predictors.
#train=subset(train, train!=52)
train
## [1] 61 62 15 18 23 24 22 52 9 10 41 50 33 59 46 26 7 54 44 13 16 37 56
## [24] 1 5 58 42 53 40 4 14 47 63 27 29 39 48 34 60 20 35 12 25 30
library(faraway)
## Warning: package 'faraway' was built under R version 3.6.2
##
## Attaching package: 'faraway'
## The following object is masked from 'package:mice':
##
## mammalsleep
## The following object is masked from 'package:lattice':
##
## melanoma
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 3.0-2
library(car)
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'car'
## The following objects are masked from 'package:faraway':
##
## logit, vif
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
round(cor(completedData[, -c(1,2)]), 2)
## LifeQuality PurchaPower Safety HealthCare LivingCost PPIR
## LifeQuality 1.00 0.86 0.59 0.58 0.78 -0.60
## PurchaPower 0.86 1.00 0.38 0.59 0.81 -0.54
## Safety 0.59 0.38 1.00 0.27 0.38 -0.18
## HealthCare 0.58 0.59 0.27 1.00 0.58 -0.08
## LivingCost 0.78 0.81 0.38 0.58 1.00 -0.38
## PPIR -0.60 -0.54 -0.18 -0.08 -0.38 1.00
## Traffic -0.61 -0.38 -0.43 -0.06 -0.37 0.48
## Pollution -0.90 -0.70 -0.47 -0.50 -0.68 0.41
## Climate 0.15 -0.03 -0.14 0.08 0.17 -0.07
## Happiness 0.77 0.74 0.30 0.53 0.73 -0.39
## LogGDP 0.81 0.83 0.42 0.46 0.77 -0.48
## SocialSuppor 0.59 0.47 0.16 0.38 0.54 -0.28
## Freedom 0.35 0.43 0.14 0.33 0.31 -0.05
## Generosity 0.12 0.27 -0.16 0.25 0.25 -0.07
## Corruption -0.64 -0.68 -0.28 -0.43 -0.70 0.24
## Traffic Pollution Climate Happiness LogGDP SocialSuppor
## LifeQuality -0.61 -0.90 0.15 0.77 0.81 0.59
## PurchaPower -0.38 -0.70 -0.03 0.74 0.83 0.47
## Safety -0.43 -0.47 -0.14 0.30 0.42 0.16
## HealthCare -0.06 -0.50 0.08 0.53 0.46 0.38
## LivingCost -0.37 -0.68 0.17 0.73 0.77 0.54
## PPIR 0.48 0.41 -0.07 -0.39 -0.48 -0.28
## Traffic 1.00 0.53 -0.04 -0.53 -0.43 -0.48
## Pollution 0.53 1.00 -0.06 -0.75 -0.75 -0.69
## Climate -0.04 -0.06 1.00 0.04 -0.06 0.03
## Happiness -0.53 -0.75 0.04 1.00 0.78 0.74
## LogGDP -0.43 -0.75 -0.06 0.78 1.00 0.65
## SocialSuppor -0.48 -0.69 0.03 0.74 0.65 1.00
## Freedom -0.19 -0.38 -0.28 0.50 0.28 0.28
## Generosity 0.08 -0.12 -0.11 0.24 0.11 0.04
## Corruption 0.19 0.67 0.03 -0.67 -0.59 -0.41
## Freedom Generosity Corruption
## LifeQuality 0.35 0.12 -0.64
## PurchaPower 0.43 0.27 -0.68
## Safety 0.14 -0.16 -0.28
## HealthCare 0.33 0.25 -0.43
## LivingCost 0.31 0.25 -0.70
## PPIR -0.05 -0.07 0.24
## Traffic -0.19 0.08 0.19
## Pollution -0.38 -0.12 0.67
## Climate -0.28 -0.11 0.03
## Happiness 0.50 0.24 -0.67
## LogGDP 0.28 0.11 -0.59
## SocialSuppor 0.28 0.04 -0.41
## Freedom 1.00 0.41 -0.53
## Generosity 0.41 1.00 -0.36
## Corruption -0.53 -0.36 1.00
Before training the model, a correlation matrix is used to check if there’s any collinearity between the potential predictors. Predictors with the variance inflation factor (VIF=1/(1-\(R^2\)), is the percent of variation in \(X_j\) explained by the other predictors) exceeding 5 are removed.
From Correlation Matrix table, it indicates that there’s strong correlations between predictors: life quality index, purchase power, living cost, pollution, happiness, logGDP. These predictors are removed except logGDP. Even though LogGDP seems to have strong correlation with lots of predictors, it is an important economic factor and when some of the other correlated predictors of are removed, the correlations decreased. Surprisingly, there’s no strong relations between healthcare and LogGDP. Also the property price to income ratio is not strongly related to the logGDP. That means people in some rich regions are suffering a bad healthcare and people in some of the less rich countries are paying high housing price compare to their income.
The model shows signs of multicollinearity. The overall F statistic is large and R2 = 0.76, but only Status are significant.
lm_mod1=lm(LifeExpectancy~.,data=completedData,subset = train)
summary(lm_mod1)
##
## Call:
## lm(formula = LifeExpectancy ~ ., data = completedData, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8015 -0.9826 0.0481 1.0661 2.6643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5112.5732 5862.6448 -0.872 0.3909
## Statusdeveloping -1.6619 1.1586 -1.434 0.1629
## LifeQuality 51.3379 58.6660 0.875 0.3892
## PurchaPower -20.5658 23.4699 -0.876 0.3886
## Safety -25.6110 29.3404 -0.873 0.3904
## HealthCare -20.5122 23.4677 -0.874 0.3898
## LivingCost 5.2159 5.8645 0.889 0.3816
## PPIR 51.5902 58.6575 0.880 0.3869
## Traffic 25.5514 29.3273 0.871 0.3913
## Pollution 34.2595 39.1047 0.876 0.3887
## Climate -17.0477 19.5584 -0.872 0.3911
## Happiness 0.5177 0.8372 0.618 0.5415
## LogGDP 3.0033 1.2545 2.394 0.0239 *
## SocialSuppor 1.7490 6.6575 0.263 0.7948
## Freedom 0.9434 3.9111 0.241 0.8112
## Generosity -2.9654 2.1707 -1.366 0.1832
## Corruption -0.6493 2.5741 -0.252 0.8028
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.872 on 27 degrees of freedom
## Multiple R-squared: 0.8828, Adjusted R-squared: 0.8133
## F-statistic: 12.71 on 16 and 27 DF, p-value: 1.15e-08
round(vif(lm_mod1), 2)
## Status LifeQuality PurchaPower Safety HealthCare
## 4.14 40275392.49 5889297.08 1518832.89 656250.08
## LivingCost PPIR Traffic Pollution Climate
## 149149.85 1031694.64 463867.07 7755817.16 943373.07
## Happiness LogGDP SocialSuppor Freedom Generosity
## 8.71 7.86 3.87 2.32 1.58
## Corruption
## 4.02
#Several of the variance in action factors are large and exceed the 5 cutting.
lm_mod2=lm(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData,subset = train)
summary(lm_mod2)
##
## Call:
## lm(formula = LifeExpectancy ~ . - LifeQuality - PurchaPower -
## LivingCost - Happiness, data = completedData, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0220 -1.0514 -0.0337 1.1516 3.6639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89559 11.58821 0.940 0.354368
## Statusdeveloping -1.73906 1.14406 -1.520 0.138627
## Safety 0.07800 0.03612 2.159 0.038687 *
## HealthCare 0.04426 0.03954 1.119 0.271613
## PPIR 0.28456 0.07047 4.038 0.000329 ***
## Traffic -0.13128 0.07089 -1.852 0.073579 .
## Pollution 0.05354 0.03216 1.665 0.106032
## Climate 0.08732 0.02832 3.083 0.004281 **
## LogGDP 3.78758 0.91886 4.122 0.000260 ***
## SocialSuppor 6.95486 6.07943 1.144 0.261381
## Freedom 0.48396 3.69338 0.131 0.896595
## Generosity -1.89935 2.23060 -0.852 0.401022
## Corruption -2.96327 2.28687 -1.296 0.204618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.963 on 31 degrees of freedom
## Multiple R-squared: 0.8519, Adjusted R-squared: 0.7946
## F-statistic: 14.86 on 12 and 31 DF, p-value: 1.171e-09
round(vif(lm_mod2), 2)
## Status Safety HealthCare PPIR Traffic
## 3.67 2.09 1.69 1.35 2.46
## Pollution Climate LogGDP SocialSuppor Freedom
## 4.77 1.80 3.83 2.93 1.88
## Generosity Corruption
## 1.52 2.88
plot(lm_mod2,which=c(1,2))
#lm_mod2=lm(LifeExpectancy~Status+ Safety+PPIR+Climate+LogGDP,data=completedData,subset = train)
#summary(lm_mod2)
#plot(lm_mod2,which=c(1,2))
All potential predictor variables will be included in the regression model at the beginning of Backward elimination. Then, it deletes the predictor variable step by step such that the resulting model has the lowest value of an information criterion. (This amounts to deleting the predictor with the largest p-value each time.) This process is continued until all variables have been deleted from the model or the information criterion increases. The final model is determined by backwards stepwise selection, which has the lowest AIC. AIC is defined to be AIC=2[-log(L(\(\hat{\beta_0},\hat{\beta_1}...\hat{\beta_n},\hat{\sigma^2}\)|Y)+K], where K=p+2. R calculate AIC using AIC=nlog(RSS/n)+2p. (Sheather, S. J.,231-237)
There’s three assumption for multiple linear regression model: there’s a linear relationship between predictors and response (linearity), the variance of residual is the same for any value of X (constant variance), the observations should be independent to each other, the residuals should follow normal distribution. The basic tool for examining the model’s linearity and constant variance is the residuals vs fitted values plot. Residuals are measured as follows: residual = observed y – model-predicted y
From the Residuals vs Fitted Values Plot If the points are randomly spread around 0 line, no obvious fan pattern, the assumption of constant variance is met. If the points are too far away from the 0, the assumption of linearity is not satisfied. QQ-plot is a good technique to evaluate the normality of residuals by comparing the residuals to ‘ideal’ normal observations. Observations lie well along the 45-degree line in the QQ-plot, so we may assume that normality holds.
Another tool to check assumptions of the model is detecting high leverage points and outliers. High leverage points are the data points have x -values that have an unusually large effect on the estimated regression model. Detecting the outliers is to detect the points which do not follow the pattern set by the bulk of the data, when one takes into account the given model. The leverage for point i is quantified by hi , the i th diagonal entry of hat matrix H. A popular rule is to classify xi as a point of high leverage in a simple linear regression model if \(h_i>2* average(h_i)=2(p+1)/n\). And a popular rule for outliers is its standardized residuals falls outside the interval from -2 to 2.
Power transformation is considered in this case, but we decided not to transform the variables because of the limiting improvement of the model and the risk of overfitting.
Backwards stepwise selection applied to the data set.
step_mod=step(lm_mod2, direction='backward',trace=F)
summary(step_mod)
##
## Call:
## lm(formula = LifeExpectancy ~ Status + Safety + PPIR + Traffic +
## Climate + LogGDP, data = completedData, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3996 -1.1764 0.1569 1.2754 4.8474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.96300 10.16310 1.472 0.14940
## Statusdeveloping -1.72397 0.97151 -1.775 0.08420 .
## Safety 0.07526 0.03146 2.392 0.02196 *
## PPIR 0.29865 0.06793 4.396 8.94e-05 ***
## Traffic -0.10083 0.05580 -1.807 0.07888 .
## Climate 0.09232 0.02666 3.463 0.00136 **
## LogGDP 4.23876 0.74541 5.687 1.67e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.944 on 37 degrees of freedom
## Multiple R-squared: 0.8267, Adjusted R-squared: 0.7986
## F-statistic: 29.43 on 6 and 37 DF, p-value: 1.15e-12
plot(step_mod,which=c(1,2))
#check assumptions for the model:linear, constant variance.From the plot, we can see there's non linear relationship. powertransformation may be applied to resolve this issues.
####
There are 3 outliers (points that have standardized residuals out of the interval -2 to 2), and no points are flagged as having high leverage. There are no “bad” leverage points to remove, since none of the high leverage points have standardized residuals outside -2 to 2.
p <- 8
n <- nrow(completedData[train,])
plot(hatvalues(step_mod), rstandard(step_mod),xlab='Leverage', ylab='Standardized Residuals')%>%abline(v = 2*(p+1)/n, lty=2)
The residuals vs fitted values plot shows a little non linearity, but not significant. Most points fall on the line. The residuals are approximately normally distributed. Power transformation is not considered in this case, because the non linearity is not significant and while applying transformation to the data, it may cause overfitting.
Finally the stepwise regression model got adjusted R-squared 0.7986 with the training set which means 79.86% of the variation on life expectancy is explained by the predicting variables. The R Square on test data set is 0.7199692, shows that there is positive relationship between the predicted value and the response variable. 72% of the variation of life expectancy in test set have been explained by the predicting variables. The RMSE obtained by training data is 2.018533, which is much smaller than the standard deviation(4.231). So it is a valid model to predict the life expectancy.
compute_rmse <- function(y, y_pred) {
n <- length(y)
sqrt((1 / n) * sum((y - y_pred)^2))
}
rsquare=function(y,y_pred){
1-sum((y-y_pred)^2)/sum((y-mean(y))^2)
}
adrsquare=function(rsquare,k){
1-((1-rsquare)*(17-1)/(17-k-1))
}
lm_pred=predict(step_mod,newdata = completedData[-train,])
#calculate the RMSE and R square for logY
compute_rmse(completedData$LifeExpectancy[-train], lm_pred)
## [1] 2.018533
rsquare(completedData$LifeExpectancy[-train], lm_pred)
## [1] 0.7199692
y_train=completedData[train,'LifeExpectancy']
y_test=completedData[-train,'LifeExpectancy']
x_train=model.matrix(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData)[train, -1]
x_test=model.matrix(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData)[-train,-1]
lm_fit=lm(LifeExpectancy~.-LifeQuality-PurchaPower-LivingCost-Happiness,data=completedData,subset=train)
summary(lm_fit)
##
## Call:
## lm(formula = LifeExpectancy ~ . - LifeQuality - PurchaPower -
## LivingCost - Happiness, data = completedData, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0220 -1.0514 -0.0337 1.1516 3.6639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89559 11.58821 0.940 0.354368
## Statusdeveloping -1.73906 1.14406 -1.520 0.138627
## Safety 0.07800 0.03612 2.159 0.038687 *
## HealthCare 0.04426 0.03954 1.119 0.271613
## PPIR 0.28456 0.07047 4.038 0.000329 ***
## Traffic -0.13128 0.07089 -1.852 0.073579 .
## Pollution 0.05354 0.03216 1.665 0.106032
## Climate 0.08732 0.02832 3.083 0.004281 **
## LogGDP 3.78758 0.91886 4.122 0.000260 ***
## SocialSuppor 6.95486 6.07943 1.144 0.261381
## Freedom 0.48396 3.69338 0.131 0.896595
## Generosity -1.89935 2.23060 -0.852 0.401022
## Corruption -2.96327 2.28687 -1.296 0.204618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.963 on 31 degrees of freedom
## Multiple R-squared: 0.8519, Adjusted R-squared: 0.7946
## F-statistic: 14.86 on 12 and 31 DF, p-value: 1.171e-09
rmse=function(y,y_pred){
n=length(y)
sqrt((1/n)*sum((y-y_pred)^2))
}
lm_pred=predict(lm_fit,newdata =completedData[-train,-1] )
a=paste('OLS model RMSE',rmse(y_test,lm_pred));a
## [1] "OLS model RMSE 1.92312653913355"
lm_step=step(lm_fit,trace = F)
summary(lm_step)
##
## Call:
## lm(formula = LifeExpectancy ~ Status + Safety + PPIR + Traffic +
## Climate + LogGDP, data = completedData, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3996 -1.1764 0.1569 1.2754 4.8474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.96300 10.16310 1.472 0.14940
## Statusdeveloping -1.72397 0.97151 -1.775 0.08420 .
## Safety 0.07526 0.03146 2.392 0.02196 *
## PPIR 0.29865 0.06793 4.396 8.94e-05 ***
## Traffic -0.10083 0.05580 -1.807 0.07888 .
## Climate 0.09232 0.02666 3.463 0.00136 **
## LogGDP 4.23876 0.74541 5.687 1.67e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.944 on 37 degrees of freedom
## Multiple R-squared: 0.8267, Adjusted R-squared: 0.7986
## F-statistic: 29.43 on 6 and 37 DF, p-value: 1.15e-12
lm_step_pred=predict(lm_step, newdata= completedData[-train,-1])
b=paste('OLS step RMSE',rmse(y_test,lm_step_pred), 'and 6 predictors(excluding the intercept) were retained in the model fit using the training data');b
## [1] "OLS step RMSE 2.01853320494638 and 6 predictors(excluding the intercept) were retained in the model fit using the training data"
Lasso and Ridge Regularization will be applied to the model and then compare to the stepwise model
library(glmnet)
set.seed(3)
ridge_fit=cv.glmnet(x_train,y_train, alpha=0)
ridge_fit$lambda.min
## [1] 0.7186131
coef(ridge_fit, s="lambda.min")
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 26.74503810
## Statusdeveloping -1.93511363
## Safety 0.06884993
## HealthCare 0.04400887
## PPIR 0.21508465
## Traffic -0.09224277
## Pollution 0.01365314
## Climate 0.06570276
## LogGDP 2.54386487
## SocialSuppor 6.82429889
## Freedom 0.31773772
## Generosity -1.81318963
## Corruption -1.75265931
ridge_pred=predict(ridge_fit,newx=x_test, s='lambda.min')
ridge_pred=as.numeric(ridge_pred)
c=paste('ridge RMSE',rmse(y_test,ridge_pred),'and 12 predictors(excluding the intercept) were retained in the model fit using the training data');c
## [1] "ridge RMSE 1.96101639158768 and 12 predictors(excluding the intercept) were retained in the model fit using the training data"
set.seed(10)
lasso_fit=cv.glmnet(x_train,y_train, alpha=1)
lasso_fit$lambda.min
## [1] 0.01548205
coef(lasso_fit,s='lambda.min')
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 12.25266188
## Statusdeveloping -1.70095023
## Safety 0.07661894
## HealthCare 0.04275590
## PPIR 0.27946826
## Traffic -0.12446662
## Pollution 0.04517256
## Climate 0.08607239
## LogGDP 3.74995355
## SocialSuppor 6.36571026
## Freedom 0.24991505
## Generosity -1.69604511
## Corruption -2.61802234
sum(abs(as.numeric(coef(lasso_fit,s='lambda.min')))>0)
## [1] 13
lasso_pred=predict(lasso_fit,newx=x_test,s='lambda.min')
lasso_pred=as.numeric(lasso_pred)
d=paste('lasso RMSE', rmse(y_test,lasso_pred),' and 12 predictors were retained (including the intercept) in the model fit using the training data');d
## [1] "lasso RMSE 1.91695047281746 and 12 predictors were retained (including the intercept) in the model fit using the training data"
So ordinary least squares R2 for test set is 0.7458 and RMSE is 1.923. The stepwise model R2 for test set is 0.72 and RMSE is 2.0185 . Lasso model R2 for test set is 0.7474 and RMSE is 1.91. The ridge R2 for test set is 0.7357 and RMSE is 1.96. All models predict life expectancy with fair accuracy. Lasso performs best and stepwise model shows less performance on this case.
rsquare=function(y,y_pred){
1-sum((y-y_pred)^2)/sum((y-mean(y))^2)
}
paste('OLS test R square', rsquare(y_test,lm_pred))
## [1] "OLS test R square 0.745815122180905"
paste('OLS stepwise test R square',rsquare(y_test,lm_step_pred))
## [1] "OLS stepwise test R square 0.719969211570034"
paste('Ridge test R square',rsquare(y_test,ridge_pred))
## [1] "Ridge test R square 0.735700443187198"
paste('Lasso test R square',rsquare(y_test,lasso_pred))
## [1] "Lasso test R square 0.747445115678852"
a;b;c;d
## [1] "OLS model RMSE 1.92312653913355"
## [1] "OLS step RMSE 2.01853320494638 and 6 predictors(excluding the intercept) were retained in the model fit using the training data"
## [1] "ridge RMSE 1.96101639158768 and 12 predictors(excluding the intercept) were retained in the model fit using the training data"
## [1] "lasso RMSE 1.91695047281746 and 12 predictors were retained (including the intercept) in the model fit using the training data"
A country’s life expectancy mostly depends on the economic factors not social factors and healthcare. People in countries with higher economic status tend to has a longer life expectancy. Climate and safety is non-economic factors that affect the life expectancy, which makes us interested in relationship between climate and life expectancy. To improve the life expectancy of a country, is important to improve the economics. It makes sense because when the economics improved, the country has money to improve the healthcare, environment and many other aspect which are relevant to life expectancy. People would pay more attention to their health.
Multiple linear regression might not be the best method for prediction. There might be better method for predicting life expectancy such as LASSO and ridge regression, regression trees and random forest. They might use more predicting variable and provide more precise predictions. Maybe in the future we’ll gather climate information with each region and discuss the role it plays in life expectancy. Also, we’ll try to find more data about country’s economic factors, social factors, environmental factors and health related factors in different years to predict the life expectancy with certain years.