Background
The data-set related to life expectancy and health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. It was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.
Purpose of the project :
- Know the relationship between “Life Expectancy” based on historical data.
- Learn to use a linear regression model to predict “Life Expectancy” based on the dataset.
Explanation on “Life Expectancy” data :
- Country = Country Observed.
- Year = Year Observed.
- Status = Developed or Developing status.
- Life.expectancy = Life Expectancy in age.
- Adult.Mortality = Adult Mortality Rates on both sexes (probability of dying between 15-60 years/1000 population).
- infant.deaths = Number of Infant Deaths per 1000 population.
- Alcohol = Alcohol recorded per capita (15+) consumption (in litres of pure alcohol).
- percentage.expenditure = Expenditure on health as a percentage of Gross Domestic Product per capita(%).
- Hepatitis.B = Hepatitis B (HepB) immunization coverage among 1-year-olds (%).
- Measles = Number of reported Measles cases per 1000 population.
- BMI = Average Body Mass Index of entire population.
- under.five.deaths = Number of under-five deaths per 1000 population.
- Polio = Polio (Pol3) immunization coverage among 1-year-olds (%).
- Total.expenditure = General government expenditure on health as a percentage of total government expenditure (%).
- Diphtheria = Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%).
- HIV.AIDS = Deaths per 1 000 live births HIV/AIDS (0-4 years).
- GDP = Gross Domestic Product per capita (in USD).
- Population = Population of the country.
- thinness..1-19 years = Prevalence of thinness among children and adolescents for Age 10 to 19 (%).
- thinness.5-9 years = Prevalence of thinness among children for Age 5 to 9(%).
- Income.composition.of.resources = Human Development Index in terms of income composition of resources (index ranging from 0 to 1).
- Schooling = Number of years of Schooling(years) .
Data Preparation
Load required libraries.
# Load libraries
library(caret)
library(GGally)
library(car)
library(lmtest)
library(rmarkdown)
library(dplyr)
options(scipen = 100, max.print = 1e+06)
Load the dataset.
# Load data
le <- read.csv("assets/le.csv")
# Show data as table
paged_table(le)
Check structure of the new data frame
# Check structure
le %>% glimpse()
## Rows: 2,938
## Columns: 22
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghani…
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 20…
## $ Status <chr> "Developing", "Developing", "Developin…
## $ Life.expectancy <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58…
## $ Adult.Mortality <int> 263, 271, 268, 272, 275, 279, 281, 287…
## $ infant.deaths <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84…
## $ Alcohol <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.…
## $ percentage.expenditure <dbl> 71.279624, 73.523582, 73.219243, 78.18…
## $ Hepatitis.B <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64…
## $ Measles <int> 1154, 492, 430, 2787, 3013, 1989, 2861…
## $ BMI <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16…
## $ under.five.deaths <int> 83, 86, 89, 93, 97, 102, 106, 110, 113…
## $ Polio <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,…
## $ Total.expenditure <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.…
## $ Diphtheria <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58…
## $ HIV.AIDS <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1…
## $ GDP <dbl> 584.25921, 612.69651, 631.74498, 669.9…
## $ Population <dbl> 33736494, 327582, 31731688, 3696958, 2…
## $ thinness..1.19.years <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18…
## $ thinness.5.9.years <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18…
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4…
## $ Schooling <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8…
# Update to categorical
le <- le %>%
mutate_at(vars(Country, Year, Status), as.factor)
N/A value on our data frame
# Check proportion of missing data
table(is.na(le))
##
## FALSE TRUE
## 62073 2563
le <- le %>% na.omit()
le %>% is.na() %>% colSums()
## Country Year
## 0 0
## Status Life.expectancy
## 0 0
## Adult.Mortality infant.deaths
## 0 0
## Alcohol percentage.expenditure
## 0 0
## Hepatitis.B Measles
## 0 0
## BMI under.five.deaths
## 0 0
## Polio Total.expenditure
## 0 0
## Diphtheria HIV.AIDS
## 0 0
## GDP Population
## 0 0
## thinness..1.19.years thinness.5.9.years
## 0 0
## Income.composition.of.resources Schooling
## 0 0
The proportion of missing values (NA) from the data is only 4%. Therefore, it can be deleted.
Take a look on data summary
le %>% summary()
## Country Year Status Life.expectancy
## Afghanistan: 16 2014 :131 Developed : 242 Min. :44.0
## Albania : 16 2011 :130 Developing:1407 1st Qu.:64.4
## Armenia : 15 2013 :130 Median :71.7
## Austria : 15 2012 :129 Mean :69.3
## Belarus : 15 2010 :128 3rd Qu.:75.0
## Belgium : 15 2009 :126 Max. :89.0
## (Other) :1557 (Other):875
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.00 Min. : 0.010 Min. : 0.00
## 1st Qu.: 77.0 1st Qu.: 1.00 1st Qu.: 0.810 1st Qu.: 37.44
## Median :148.0 Median : 3.00 Median : 3.790 Median : 145.10
## Mean :168.2 Mean : 32.55 Mean : 4.533 Mean : 698.97
## 3rd Qu.:227.0 3rd Qu.: 22.00 3rd Qu.: 7.340 3rd Qu.: 509.39
## Max. :723.0 Max. :1600.00 Max. :17.870 Max. :18961.35
##
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 2.00 Min. : 0 Min. : 2.00 Min. : 0.00
## 1st Qu.:74.00 1st Qu.: 0 1st Qu.:19.50 1st Qu.: 1.00
## Median :89.00 Median : 15 Median :43.70 Median : 4.00
## Mean :79.22 Mean : 2224 Mean :38.13 Mean : 44.22
## 3rd Qu.:96.00 3rd Qu.: 373 3rd Qu.:55.80 3rd Qu.: 29.00
## Max. :99.00 Max. :131441 Max. :77.10 Max. :2100.00
##
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.740 Min. : 2.00 Min. : 0.100
## 1st Qu.:81.00 1st Qu.: 4.410 1st Qu.:82.00 1st Qu.: 0.100
## Median :93.00 Median : 5.840 Median :92.00 Median : 0.100
## Mean :83.56 Mean : 5.956 Mean :84.16 Mean : 1.984
## 3rd Qu.:97.00 3rd Qu.: 7.470 3rd Qu.:97.00 3rd Qu.: 0.700
## Max. :99.00 Max. :14.390 Max. :99.00 Max. :50.600
##
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. : 34 Min. : 0.100
## 1st Qu.: 462.15 1st Qu.: 191897 1st Qu.: 1.600
## Median : 1592.57 Median : 1419631 Median : 3.000
## Mean : 5566.03 Mean : 14653626 Mean : 4.851
## 3rd Qu.: 4718.51 3rd Qu.: 7658972 3rd Qu.: 7.100
## Max. :119172.74 Max. :1293859294 Max. :27.200
##
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.100 Min. :0.0000 Min. : 4.20
## 1st Qu.: 1.700 1st Qu.:0.5090 1st Qu.:10.30
## Median : 3.200 Median :0.6730 Median :12.30
## Mean : 4.908 Mean :0.6316 Mean :12.12
## 3rd Qu.: 7.100 3rd Qu.:0.7510 3rd Qu.:14.00
## Max. :28.200 Max. :0.9360 Max. :20.70
##
Exploratory & Data Analysis
EDA (exploratory & data analysis) is one of the phase to explore the variables, allow us to get any pattern and insight on each variables. We can know and indicate any kind of correlation between variables.
Check target data distribution
boxplot(le$Life.expectancy, ylab = "Life Expectancy (Age)")
💡 Insight :
- “Life.expectancy” has many outlier values. Remember : Regression models can be sensitive to outlier values.
Check correlation for each variables
ggcorr(le,
label = T,
label_size = 3,
hjust = 1,
layout.exp = 10)
💡 Insight :
- “Schooling” and “Income.composition.of.resources” are the most correlated predictors. On the other side, “Life.expectation” has negative correlation with “Adult.Mortality” (This is a valid finding due to mortality rate of adult is high, life expectancy of people will be low).
- “Life.expectancy” has weak correlation with “Population”, “Measles” and “infant.deaths”.
- There are 4 variables with strong correlation :
- “thinness.5.9.years” and “thinness..1.19.years”.
- “GDP” and “percentage.expenditure”.
- “infant.deaths” and “under.five.deaths”.
- The correlation between those predictors are so strong that they are essentially measuring the same underlying concept, then it can be said that there is multicollinearity.
Check levels on categorical variables
# Country variable
levels(le$Country)
## [1] "Afghanistan"
## [2] "Albania"
## [3] "Algeria"
## [4] "Angola"
## [5] "Antigua and Barbuda"
## [6] "Argentina"
## [7] "Armenia"
## [8] "Australia"
## [9] "Austria"
## [10] "Azerbaijan"
## [11] "Bahamas"
## [12] "Bahrain"
## [13] "Bangladesh"
## [14] "Barbados"
## [15] "Belarus"
## [16] "Belgium"
## [17] "Belize"
## [18] "Benin"
## [19] "Bhutan"
## [20] "Bolivia (Plurinational State of)"
## [21] "Bosnia and Herzegovina"
## [22] "Botswana"
## [23] "Brazil"
## [24] "Brunei Darussalam"
## [25] "Bulgaria"
## [26] "Burkina Faso"
## [27] "Burundi"
## [28] "Cabo Verde"
## [29] "Cambodia"
## [30] "Cameroon"
## [31] "Canada"
## [32] "Central African Republic"
## [33] "Chad"
## [34] "Chile"
## [35] "China"
## [36] "Colombia"
## [37] "Comoros"
## [38] "Congo"
## [39] "Cook Islands"
## [40] "Costa Rica"
## [41] "Côte d'Ivoire"
## [42] "Croatia"
## [43] "Cuba"
## [44] "Cyprus"
## [45] "Czechia"
## [46] "Democratic People's Republic of Korea"
## [47] "Democratic Republic of the Congo"
## [48] "Denmark"
## [49] "Djibouti"
## [50] "Dominica"
## [51] "Dominican Republic"
## [52] "Ecuador"
## [53] "Egypt"
## [54] "El Salvador"
## [55] "Equatorial Guinea"
## [56] "Eritrea"
## [57] "Estonia"
## [58] "Ethiopia"
## [59] "Fiji"
## [60] "Finland"
## [61] "France"
## [62] "Gabon"
## [63] "Gambia"
## [64] "Georgia"
## [65] "Germany"
## [66] "Ghana"
## [67] "Greece"
## [68] "Grenada"
## [69] "Guatemala"
## [70] "Guinea"
## [71] "Guinea-Bissau"
## [72] "Guyana"
## [73] "Haiti"
## [74] "Honduras"
## [75] "Hungary"
## [76] "Iceland"
## [77] "India"
## [78] "Indonesia"
## [79] "Iran (Islamic Republic of)"
## [80] "Iraq"
## [81] "Ireland"
## [82] "Israel"
## [83] "Italy"
## [84] "Jamaica"
## [85] "Japan"
## [86] "Jordan"
## [87] "Kazakhstan"
## [88] "Kenya"
## [89] "Kiribati"
## [90] "Kuwait"
## [91] "Kyrgyzstan"
## [92] "Lao People's Democratic Republic"
## [93] "Latvia"
## [94] "Lebanon"
## [95] "Lesotho"
## [96] "Liberia"
## [97] "Libya"
## [98] "Lithuania"
## [99] "Luxembourg"
## [100] "Madagascar"
## [101] "Malawi"
## [102] "Malaysia"
## [103] "Maldives"
## [104] "Mali"
## [105] "Malta"
## [106] "Marshall Islands"
## [107] "Mauritania"
## [108] "Mauritius"
## [109] "Mexico"
## [110] "Micronesia (Federated States of)"
## [111] "Monaco"
## [112] "Mongolia"
## [113] "Montenegro"
## [114] "Morocco"
## [115] "Mozambique"
## [116] "Myanmar"
## [117] "Namibia"
## [118] "Nauru"
## [119] "Nepal"
## [120] "Netherlands"
## [121] "New Zealand"
## [122] "Nicaragua"
## [123] "Niger"
## [124] "Nigeria"
## [125] "Niue"
## [126] "Norway"
## [127] "Oman"
## [128] "Pakistan"
## [129] "Palau"
## [130] "Panama"
## [131] "Papua New Guinea"
## [132] "Paraguay"
## [133] "Peru"
## [134] "Philippines"
## [135] "Poland"
## [136] "Portugal"
## [137] "Qatar"
## [138] "Republic of Korea"
## [139] "Republic of Moldova"
## [140] "Romania"
## [141] "Russian Federation"
## [142] "Rwanda"
## [143] "Saint Kitts and Nevis"
## [144] "Saint Lucia"
## [145] "Saint Vincent and the Grenadines"
## [146] "Samoa"
## [147] "San Marino"
## [148] "Sao Tome and Principe"
## [149] "Saudi Arabia"
## [150] "Senegal"
## [151] "Serbia"
## [152] "Seychelles"
## [153] "Sierra Leone"
## [154] "Singapore"
## [155] "Slovakia"
## [156] "Slovenia"
## [157] "Solomon Islands"
## [158] "Somalia"
## [159] "South Africa"
## [160] "South Sudan"
## [161] "Spain"
## [162] "Sri Lanka"
## [163] "Sudan"
## [164] "Suriname"
## [165] "Swaziland"
## [166] "Sweden"
## [167] "Switzerland"
## [168] "Syrian Arab Republic"
## [169] "Tajikistan"
## [170] "Thailand"
## [171] "The former Yugoslav republic of Macedonia"
## [172] "Timor-Leste"
## [173] "Togo"
## [174] "Tonga"
## [175] "Trinidad and Tobago"
## [176] "Tunisia"
## [177] "Turkey"
## [178] "Turkmenistan"
## [179] "Tuvalu"
## [180] "Uganda"
## [181] "Ukraine"
## [182] "United Arab Emirates"
## [183] "United Kingdom of Great Britain and Northern Ireland"
## [184] "United Republic of Tanzania"
## [185] "United States of America"
## [186] "Uruguay"
## [187] "Uzbekistan"
## [188] "Vanuatu"
## [189] "Venezuela (Bolivarian Republic of)"
## [190] "Viet Nam"
## [191] "Yemen"
## [192] "Zambia"
## [193] "Zimbabwe"
# Year variable
levels(le$Year)
## [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015"
# Status variable
levels(le$Status)
## [1] "Developed" "Developing"
💡 Insight :
- Country has many levels and it doesn’t give valuable information to predict “Life.expectancy”.
- Year is a time series data. It is not suitable to be a predictors for “Life.expectancy”.
- Status has 2 levels and it is suitable to be a categorical predictors for “Life.expectancy”.
A glimpse of vaccination information on data
# Subset and see the vaccination
le_vaccination <- le %>%
select(c(Hepatitis.B,
Polio,
Diphtheria))
# Check range for each variable
summary(le_vaccination)
## Hepatitis.B Polio Diphtheria
## Min. : 2.00 Min. : 3.00 Min. : 2.00
## 1st Qu.:74.00 1st Qu.:81.00 1st Qu.:82.00
## Median :89.00 Median :93.00 Median :92.00
## Mean :79.22 Mean :83.56 Mean :84.16
## 3rd Qu.:96.00 3rd Qu.:97.00 3rd Qu.:97.00
## Max. :99.00 Max. :99.00 Max. :99.00
💡 Insight :
- For all three variables, The range between the minimum value and the 1st Quartile is too wide. Therefore, adjustment should be done for these variables.
- We can use Global Vaccine Action Plan statement to change the data type into categorical variables, “< 90% Covered” and “>= 90% Covered”. The purpose is to get a better view of the immunization impact on “Life.expectancy”.
Feature Engineering
Based on above summary, we will need to :
- Remove 1 of variable with strong correlation = “thinness..1.19.years”, “GDP” and “infant.deaths”.
- Remove non valuable and suitable variables = “Country” and “Year”.
- Update to categorical data type = “Hepatitis.B”, “Polio” and “Diphtheria”
- Remove outliers on “Life.expectancy”.
le_clean <- le %>%
select(-Country, -Year, -infant.deaths, -GDP, -thinness..1.19.years) %>%
mutate(Hepatitis.B = ifelse(Hepatitis.B < 90, "< 90% Covered", ">= 90% Covered"),
Polio = ifelse(Polio < 90, "< 90% Covered", ">= 90% Covered"),
Diphtheria = ifelse(Diphtheria < 90, "< 90% Covered", ">= 90% Covered"),
Hepatitis.B = as.factor(Hepatitis.B),
Polio = as.factor(Polio),
Diphtheria = as.factor(Diphtheria))
le_clean <- le_clean[le_clean$Life.expectancy > 50, ]
Check structure new data frame
le_clean %>% glimpse()
## Rows: 1,590
## Columns: 17
## $ Status <fct> Developing, Developing, Developing, De…
## $ Life.expectancy <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58…
## $ Adult.Mortality <int> 263, 271, 268, 272, 275, 279, 281, 287…
## $ Alcohol <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.…
## $ percentage.expenditure <dbl> 71.279624, 73.523582, 73.219243, 78.18…
## $ Hepatitis.B <fct> < 90% Covered, < 90% Covered, < 90% Co…
## $ Measles <int> 1154, 492, 430, 2787, 3013, 1989, 2861…
## $ BMI <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16…
## $ under.five.deaths <int> 83, 86, 89, 93, 97, 102, 106, 110, 113…
## $ Polio <fct> < 90% Covered, < 90% Covered, < 90% Co…
## $ Total.expenditure <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.…
## $ Diphtheria <fct> < 90% Covered, < 90% Covered, < 90% Co…
## $ HIV.AIDS <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1…
## $ Population <dbl> 33736494, 327582, 31731688, 3696958, 2…
## $ thinness.5.9.years <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18…
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4…
## $ Schooling <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8…
Check outliers on “Life.expectancy”
boxplot(le_clean$Life.expectancy, ylab = "Life Expectancy (Age)")
Train-Test Split
Before we make a model regression model, we need to split the data into train and test dataset. This is a crucial step in the machine learning process, as it allows us to evaluate the performance of our models and make informed decisions about how to improve them.. We will split into 80% for the training and the rest of it as the testing.
# set seed
set.seed(123)
# split data
samplesize <- round(0.8 * nrow(le_clean), 0)
index <- sample(seq_len(nrow(le_clean)), size = samplesize)
le_train <- le_clean[index, ]
le_test <- le_clean[-index, ]
Create Model
Multiple predictors
We create a model using “Life.expectancy” as the target variable.
# Create model
le_model <- lm(formula = Life.expectancy ~ .,
data = le_train)
# Model summary
summary(le_model)
##
## Call:
## lm(formula = Life.expectancy ~ ., data = le_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6497 -1.9303 -0.0174 2.3692 11.7388
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55.805745917713 0.882964812003 63.203
## StatusDeveloping -0.918239525132 0.377137757239 -2.435
## Adult.Mortality -0.017419078387 0.001189654093 -14.642
## Alcohol -0.085254722512 0.036409231465 -2.342
## percentage.expenditure 0.000414862898 0.000067217776 6.172
## Hepatitis.B>= 90% Covered -0.831983940017 0.348766970892 -2.386
## Measles 0.000026844691 0.000012230715 2.195
## BMI 0.034367794254 0.006465365766 5.316
## under.five.deaths -0.003140719911 0.000937012759 -3.352
## Polio>= 90% Covered 0.063306758025 0.473798530450 0.134
## Total.expenditure 0.103002083684 0.045706670500 2.254
## Diphtheria>= 90% Covered 1.119748320645 0.525159100317 2.132
## HIV.AIDS -0.556403937803 0.033568046892 -16.575
## Population 0.000000001844 0.000000001767 1.043
## thinness.5.9.years -0.013148223109 0.029056477496 -0.453
## Income.composition.of.resources 9.398409439085 0.876786690379 10.719
## Schooling 0.860327636148 0.065550198898 13.125
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## StatusDeveloping 0.015040 *
## Adult.Mortality < 0.0000000000000002 ***
## Alcohol 0.019359 *
## percentage.expenditure 0.000000000909 ***
## Hepatitis.B>= 90% Covered 0.017204 *
## Measles 0.028356 *
## BMI 0.000000125660 ***
## under.five.deaths 0.000827 ***
## Polio>= 90% Covered 0.893728
## Total.expenditure 0.024397 *
## Diphtheria>= 90% Covered 0.033183 *
## HIV.AIDS < 0.0000000000000002 ***
## Population 0.297081
## thinness.5.9.years 0.650983
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.49 on 1255 degrees of freedom
## Multiple R-squared: 0.7969, Adjusted R-squared: 0.7943
## F-statistic: 307.7 on 16 and 1255 DF, p-value: < 0.00000000000000022
💡 Insight : - Adj. R-squared value is 79.4%, indicating the model isn’t good enough. - Significant predictors : Most of predictors are significant. Only “Polio”, “Population” and “thinness.5.9.years” aren’t significant to target.
Step wise method
Create non target model
le_none <- lm(formula = Life.expectancy ~ 1,
data = le_train)
Create backward step wise model
# Backward
le_backward <- step(le_model, direction = "backward")
## Start: AIC=3197
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Polio +
## Total.expenditure + Diphtheria + HIV.AIDS + Population +
## thinness.5.9.years + Income.composition.of.resources + Schooling
##
## Df Sum of Sq RSS AIC
## - Polio 1 0.2 15291 3195.0
## - thinness.5.9.years 1 2.5 15293 3195.2
## - Population 1 13.3 15304 3196.1
## <none> 15290 3197.0
## - Diphtheria 1 55.4 15346 3199.6
## - Measles 1 58.7 15349 3199.9
## - Total.expenditure 1 61.9 15352 3200.1
## - Alcohol 1 66.8 15357 3200.5
## - Hepatitis.B 1 69.3 15360 3200.8
## - Status 1 72.2 15363 3201.0
## - under.five.deaths 1 136.9 15427 3206.3
## - BMI 1 344.3 15635 3223.3
## - percentage.expenditure 1 464.1 15754 3233.0
## - Income.composition.of.resources 1 1399.9 16690 3306.4
## - Schooling 1 2098.7 17389 3358.6
## - Adult.Mortality 1 2612.1 17902 3395.6
## - HIV.AIDS 1 3347.4 18638 3446.8
##
## Step: AIC=3195.01
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure +
## Diphtheria + HIV.AIDS + Population + thinness.5.9.years +
## Income.composition.of.resources + Schooling
##
## Df Sum of Sq RSS AIC
## - thinness.5.9.years 1 2.5 15293 3193.2
## - Population 1 13.3 15304 3194.1
## <none> 15291 3195.0
## - Measles 1 58.5 15349 3197.9
## - Total.expenditure 1 61.8 15352 3198.1
## - Alcohol 1 66.7 15357 3198.5
## - Hepatitis.B 1 69.8 15360 3198.8
## - Status 1 72.4 15363 3199.0
## - Diphtheria 1 117.4 15408 3202.7
## - under.five.deaths 1 137.1 15428 3204.4
## - BMI 1 344.1 15635 3221.3
## - percentage.expenditure 1 463.9 15754 3231.0
## - Income.composition.of.resources 1 1400.9 16692 3304.5
## - Schooling 1 2104.9 17395 3357.1
## - Adult.Mortality 1 2613.2 17904 3393.7
## - HIV.AIDS 1 3372.2 18663 3446.5
##
## Step: AIC=3193.22
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure +
## Diphtheria + HIV.AIDS + Population + Income.composition.of.resources +
## Schooling
##
## Df Sum of Sq RSS AIC
## - Population 1 12.8 15306 3192.3
## <none> 15293 3193.2
## - Measles 1 62.5 15356 3196.4
## - Total.expenditure 1 64.0 15357 3196.5
## - Alcohol 1 64.3 15357 3196.6
## - Hepatitis.B 1 71.1 15364 3197.1
## - Status 1 72.2 15365 3197.2
## - Diphtheria 1 116.9 15410 3200.9
## - under.five.deaths 1 171.7 15465 3205.4
## - BMI 1 411.2 15704 3225.0
## - percentage.expenditure 1 464.3 15757 3229.3
## - Income.composition.of.resources 1 1413.2 16706 3303.6
## - Schooling 1 2115.2 17408 3356.0
## - Adult.Mortality 1 2619.9 17913 3392.4
## - HIV.AIDS 1 3436.4 18729 3449.1
##
## Step: AIC=3192.28
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure +
## Diphtheria + HIV.AIDS + Income.composition.of.resources +
## Schooling
##
## Df Sum of Sq RSS AIC
## <none> 15306 3192.3
## - Measles 1 62.7 15369 3195.5
## - Total.expenditure 1 62.9 15369 3195.5
## - Alcohol 1 65.8 15372 3195.7
## - Status 1 71.2 15377 3196.2
## - Hepatitis.B 1 71.6 15377 3196.2
## - Diphtheria 1 119.1 15425 3200.1
## - under.five.deaths 1 183.2 15489 3205.4
## - BMI 1 414.2 15720 3224.2
## - percentage.expenditure 1 464.9 15771 3228.3
## - Income.composition.of.resources 1 1414.7 16721 3302.7
## - Schooling 1 2151.1 17457 3357.6
## - Adult.Mortality 1 2618.7 17924 3391.2
## - HIV.AIDS 1 3447.7 18754 3448.7
Create forward step wise model
le_forward <- step(le_none, scope = list(lower = le_none, upper = le_model) ,direction = "forward")
## Start: AIC=5192.52
## Life.expectancy ~ 1
##
## Df Sum of Sq RSS AIC
## + Schooling 1 41020 34258 4193.1
## + Income.composition.of.resources 1 38051 37226 4298.8
## + Adult.Mortality 1 32200 43078 4484.5
## + BMI 1 20882 54396 4781.2
## + HIV.AIDS 1 20165 55113 4797.9
## + Status 1 16525 58753 4879.3
## + thinness.5.9.years 1 15837 59441 4894.1
## + percentage.expenditure 1 14391 60887 4924.6
## + Alcohol 1 14016 61261 4932.4
## + Diphtheria 1 13255 62023 4948.2
## + Polio 1 13153 62125 4950.2
## + Hepatitis.B 1 5836 69442 5091.9
## + Total.expenditure 1 3848 71430 5127.8
## + under.five.deaths 1 3053 72225 5141.9
## + Measles 1 731 74547 5182.1
## <none> 75278 5192.5
## + Population 1 87 75191 5193.0
##
## Step: AIC=4193.11
## Life.expectancy ~ Schooling
##
## Df Sum of Sq RSS AIC
## + Adult.Mortality 1 11389.8 22868 3681.0
## + HIV.AIDS 1 11286.6 22971 3686.7
## + Income.composition.of.resources 1 3859.3 30399 4043.1
## + BMI 1 1987.1 32271 4119.1
## + percentage.expenditure 1 1339.9 32918 4144.4
## + thinness.5.9.years 1 1313.2 32945 4145.4
## + Status 1 853.2 33405 4163.0
## + Polio 1 711.8 33546 4168.4
## + Diphtheria 1 664.6 33593 4170.2
## + under.five.deaths 1 182.5 34075 4188.3
## + Hepatitis.B 1 129.8 34128 4190.3
## + Total.expenditure 1 108.7 34149 4191.1
## + Alcohol 1 73.8 34184 4192.4
## <none> 34258 4193.1
## + Population 1 7.5 34251 4194.8
## + Measles 1 0.1 34258 4195.1
##
## Step: AIC=3681.01
## Life.expectancy ~ Schooling + Adult.Mortality
##
## Df Sum of Sq RSS AIC
## + HIV.AIDS 1 4037.3 18831 3435.9
## + Income.composition.of.resources 1 2053.7 20814 3563.3
## + BMI 1 1003.4 21865 3625.9
## + percentage.expenditure 1 758.0 22110 3640.1
## + thinness.5.9.years 1 700.5 22168 3643.4
## + Diphtheria 1 347.7 22520 3663.5
## + Polio 1 330.0 22538 3664.5
## + Status 1 316.5 22552 3665.3
## + under.five.deaths 1 276.1 22592 3667.6
## + Hepatitis.B 1 66.4 22802 3679.3
## + Population 1 36.1 22832 3681.0
## <none> 22868 3681.0
## + Total.expenditure 1 25.3 22843 3681.6
## + Alcohol 1 23.8 22844 3681.7
## + Measles 1 10.6 22858 3682.4
##
## Step: AIC=3435.92
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS
##
## Df Sum of Sq RSS AIC
## + Income.composition.of.resources 1 1873.51 16957 3304.6
## + percentage.expenditure 1 905.07 17926 3375.3
## + BMI 1 686.08 18145 3390.7
## + thinness.5.9.years 1 414.85 18416 3409.6
## + Status 1 343.59 18487 3414.5
## + under.five.deaths 1 221.64 18609 3422.9
## + Total.expenditure 1 197.80 18633 3424.5
## + Alcohol 1 127.41 18704 3429.3
## + Diphtheria 1 107.91 18723 3430.6
## + Polio 1 72.88 18758 3433.0
## + Population 1 43.74 18787 3435.0
## <none> 18831 3435.9
## + Measles 1 2.12 18829 3437.8
## + Hepatitis.B 1 0.57 18830 3437.9
##
## Step: AIC=3304.63
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources
##
## Df Sum of Sq RSS AIC
## + percentage.expenditure 1 670.31 16287 3255.3
## + BMI 1 472.83 16485 3270.7
## + under.five.deaths 1 277.61 16680 3285.6
## + thinness.5.9.years 1 265.83 16692 3286.5
## + Status 1 208.27 16749 3290.9
## + Total.expenditure 1 198.60 16759 3291.6
## + Population 1 62.17 16895 3302.0
## + Diphtheria 1 50.96 16906 3302.8
## + Polio 1 31.05 16926 3304.3
## <none> 16957 3304.6
## + Measles 1 13.68 16944 3305.6
## + Alcohol 1 6.75 16951 3306.1
## + Hepatitis.B 1 0.18 16957 3306.6
##
## Step: AIC=3255.32
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure
##
## Df Sum of Sq RSS AIC
## + BMI 1 479.07 15808 3219.3
## + under.five.deaths 1 272.28 16015 3235.9
## + thinness.5.9.years 1 230.34 16057 3239.2
## + Total.expenditure 1 127.56 16160 3247.3
## + Diphtheria 1 59.14 16228 3252.7
## + Population 1 57.47 16230 3252.8
## + Status 1 39.60 16248 3254.2
## + Polio 1 39.56 16248 3254.2
## <none> 16287 3255.3
## + Measles 1 9.91 16277 3256.5
## + Alcohol 1 8.26 16279 3256.7
## + Hepatitis.B 1 3.28 16284 3257.1
##
## Step: AIC=3219.35
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI
##
## Df Sum of Sq RSS AIC
## + under.five.deaths 1 175.895 15632 3207.1
## + Total.expenditure 1 87.448 15721 3214.3
## + Diphtheria 1 81.711 15726 3214.8
## + thinness.5.9.years 1 57.697 15750 3216.7
## + Polio 1 57.324 15751 3216.7
## + Status 1 44.185 15764 3217.8
## + Population 1 31.888 15776 3218.8
## <none> 15808 3219.3
## + Alcohol 1 11.905 15796 3220.4
## + Hepatitis.B 1 6.632 15801 3220.8
## + Measles 1 0.202 15808 3221.3
##
## Step: AIC=3207.12
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI + under.five.deaths
##
## Df Sum of Sq RSS AIC
## + Total.expenditure 1 67.829 15564 3203.6
## + Measles 1 54.065 15578 3204.7
## + Diphtheria 1 48.668 15584 3205.1
## + Status 1 43.343 15589 3205.6
## + Polio 1 29.959 15602 3206.7
## <none> 15632 3207.1
## + Population 1 13.661 15618 3208.0
## + Alcohol 1 8.856 15623 3208.4
## + thinness.5.9.years 1 4.388 15628 3208.8
## + Hepatitis.B 1 0.534 15632 3209.1
##
## Step: AIC=3203.58
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI + under.five.deaths + Total.expenditure
##
## Df Sum of Sq RSS AIC
## + Measles 1 61.291 15503 3200.6
## + Diphtheria 1 38.595 15526 3202.4
## + Status 1 35.291 15529 3202.7
## <none> 15564 3203.6
## + Polio 1 23.358 15541 3203.7
## + Population 1 15.017 15549 3204.4
## + Alcohol 1 13.699 15551 3204.5
## + thinness.5.9.years 1 2.203 15562 3205.4
## + Hepatitis.B 1 0.015 15564 3205.6
##
## Step: AIC=3200.57
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI + under.five.deaths + Total.expenditure +
## Measles
##
## Df Sum of Sq RSS AIC
## + Diphtheria 1 40.313 15463 3199.3
## + Status 1 35.631 15467 3199.6
## + Polio 1 27.011 15476 3200.3
## <none> 15503 3200.6
## + Population 1 14.781 15488 3201.4
## + Alcohol 1 13.712 15489 3201.4
## + thinness.5.9.years 1 0.268 15503 3202.5
## + Hepatitis.B 1 0.070 15503 3202.6
##
## Step: AIC=3199.25
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI + under.five.deaths + Total.expenditure +
## Measles + Diphtheria
##
## Df Sum of Sq RSS AIC
## + Hepatitis.B 1 58.028 15405 3196.5
## + Status 1 32.965 15430 3198.5
## <none> 15463 3199.3
## + Alcohol 1 19.933 15443 3199.6
## + Population 1 13.415 15449 3200.1
## + thinness.5.9.years 1 0.815 15462 3201.2
## + Polio 1 0.444 15462 3201.2
##
## Step: AIC=3196.47
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI + under.five.deaths + Total.expenditure +
## Measles + Diphtheria + Hepatitis.B
##
## Df Sum of Sq RSS AIC
## + Status 1 33.058 15372 3195.7
## + Alcohol 1 27.595 15377 3196.2
## <none> 15405 3196.5
## + Population 1 13.102 15392 3197.4
## + Polio 1 0.239 15404 3198.5
## + thinness.5.9.years 1 0.189 15404 3198.5
##
## Step: AIC=3195.74
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI + under.five.deaths + Total.expenditure +
## Measles + Diphtheria + Hepatitis.B + Status
##
## Df Sum of Sq RSS AIC
## + Alcohol 1 65.775 15306 3192.3
## <none> 15372 3195.7
## + Population 1 14.289 15357 3196.6
## + Polio 1 0.093 15372 3197.7
## + thinness.5.9.years 1 0.012 15372 3197.7
##
## Step: AIC=3192.28
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Income.composition.of.resources +
## percentage.expenditure + BMI + under.five.deaths + Total.expenditure +
## Measles + Diphtheria + Hepatitis.B + Status + Alcohol
##
## Df Sum of Sq RSS AIC
## <none> 15306 3192.3
## + Population 1 12.8055 15293 3193.2
## + thinness.5.9.years 1 1.9979 15304 3194.1
## + Polio 1 0.1993 15306 3194.3
Create forward & backward step wise model
# Both
le_both <- step(le_model,
scope = list(lower = le_none,
upper = le_model), direction = "both")
## Start: AIC=3197
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Polio +
## Total.expenditure + Diphtheria + HIV.AIDS + Population +
## thinness.5.9.years + Income.composition.of.resources + Schooling
##
## Df Sum of Sq RSS AIC
## - Polio 1 0.2 15291 3195.0
## - thinness.5.9.years 1 2.5 15293 3195.2
## - Population 1 13.3 15304 3196.1
## <none> 15290 3197.0
## - Diphtheria 1 55.4 15346 3199.6
## - Measles 1 58.7 15349 3199.9
## - Total.expenditure 1 61.9 15352 3200.1
## - Alcohol 1 66.8 15357 3200.5
## - Hepatitis.B 1 69.3 15360 3200.8
## - Status 1 72.2 15363 3201.0
## - under.five.deaths 1 136.9 15427 3206.3
## - BMI 1 344.3 15635 3223.3
## - percentage.expenditure 1 464.1 15754 3233.0
## - Income.composition.of.resources 1 1399.9 16690 3306.4
## - Schooling 1 2098.7 17389 3358.6
## - Adult.Mortality 1 2612.1 17902 3395.6
## - HIV.AIDS 1 3347.4 18638 3446.8
##
## Step: AIC=3195.01
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure +
## Diphtheria + HIV.AIDS + Population + thinness.5.9.years +
## Income.composition.of.resources + Schooling
##
## Df Sum of Sq RSS AIC
## - thinness.5.9.years 1 2.5 15293 3193.2
## - Population 1 13.3 15304 3194.1
## <none> 15291 3195.0
## + Polio 1 0.2 15290 3197.0
## - Measles 1 58.5 15349 3197.9
## - Total.expenditure 1 61.8 15352 3198.1
## - Alcohol 1 66.7 15357 3198.5
## - Hepatitis.B 1 69.8 15360 3198.8
## - Status 1 72.4 15363 3199.0
## - Diphtheria 1 117.4 15408 3202.7
## - under.five.deaths 1 137.1 15428 3204.4
## - BMI 1 344.1 15635 3221.3
## - percentage.expenditure 1 463.9 15754 3231.0
## - Income.composition.of.resources 1 1400.9 16692 3304.5
## - Schooling 1 2104.9 17395 3357.1
## - Adult.Mortality 1 2613.2 17904 3393.7
## - HIV.AIDS 1 3372.2 18663 3446.5
##
## Step: AIC=3193.22
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure +
## Diphtheria + HIV.AIDS + Population + Income.composition.of.resources +
## Schooling
##
## Df Sum of Sq RSS AIC
## - Population 1 12.8 15306 3192.3
## <none> 15293 3193.2
## + thinness.5.9.years 1 2.5 15291 3195.0
## + Polio 1 0.2 15293 3195.2
## - Measles 1 62.5 15356 3196.4
## - Total.expenditure 1 64.0 15357 3196.5
## - Alcohol 1 64.3 15357 3196.6
## - Hepatitis.B 1 71.1 15364 3197.1
## - Status 1 72.2 15365 3197.2
## - Diphtheria 1 116.9 15410 3200.9
## - under.five.deaths 1 171.7 15465 3205.4
## - BMI 1 411.2 15704 3225.0
## - percentage.expenditure 1 464.3 15757 3229.3
## - Income.composition.of.resources 1 1413.2 16706 3303.6
## - Schooling 1 2115.2 17408 3356.0
## - Adult.Mortality 1 2619.9 17913 3392.4
## - HIV.AIDS 1 3436.4 18729 3449.1
##
## Step: AIC=3192.28
## Life.expectancy ~ Status + Adult.Mortality + Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Total.expenditure +
## Diphtheria + HIV.AIDS + Income.composition.of.resources +
## Schooling
##
## Df Sum of Sq RSS AIC
## <none> 15306 3192.3
## + Population 1 12.8 15293 3193.2
## + thinness.5.9.years 1 2.0 15304 3194.1
## + Polio 1 0.2 15306 3194.3
## - Measles 1 62.7 15369 3195.5
## - Total.expenditure 1 62.9 15369 3195.5
## - Alcohol 1 65.8 15372 3195.7
## - Status 1 71.2 15377 3196.2
## - Hepatitis.B 1 71.6 15377 3196.2
## - Diphtheria 1 119.1 15425 3200.1
## - under.five.deaths 1 183.2 15489 3205.4
## - BMI 1 414.2 15720 3224.2
## - percentage.expenditure 1 464.9 15771 3228.3
## - Income.composition.of.resources 1 1414.7 16721 3302.7
## - Schooling 1 2151.1 17457 3357.6
## - Adult.Mortality 1 2618.7 17924 3391.2
## - HIV.AIDS 1 3447.7 18754 3448.7
summary(le_backward)
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Alcohol +
## percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths +
## Total.expenditure + Diphtheria + HIV.AIDS + Income.composition.of.resources +
## Schooling, data = le_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6587 -1.9273 -0.0234 2.3269 11.8412
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55.60390048 0.82344174 67.526
## StatusDeveloping -0.91154846 0.37671511 -2.420
## Adult.Mortality -0.01743226 0.00118823 -14.671
## Alcohol -0.08349779 0.03591153 -2.325
## percentage.expenditure 0.00041518 0.00006716 6.182
## Hepatitis.B>= 90% Covered -0.83487886 0.34417632 -2.426
## Measles 0.00002749 0.00001211 2.271
## BMI 0.03545956 0.00607768 5.834
## under.five.deaths -0.00277041 0.00071399 -3.880
## Total.expenditure 0.10350644 0.04553187 2.273
## Diphtheria>= 90% Covered 1.17656001 0.37612305 3.128
## HIV.AIDS -0.55922622 0.03322075 -16.834
## Income.composition.of.resources 9.42925025 0.87443614 10.783
## Schooling 0.86694378 0.06519956 13.297
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## StatusDeveloping 0.01567 *
## Adult.Mortality < 0.0000000000000002 ***
## Alcohol 0.02023 *
## percentage.expenditure 0.000000000856 ***
## Hepatitis.B>= 90% Covered 0.01542 *
## Measles 0.02334 *
## BMI 0.000000006859 ***
## under.five.deaths 0.00011 ***
## Total.expenditure 0.02318 *
## Diphtheria>= 90% Covered 0.00180 **
## HIV.AIDS < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.488 on 1258 degrees of freedom
## Multiple R-squared: 0.7967, Adjusted R-squared: 0.7946
## F-statistic: 379.2 on 13 and 1258 DF, p-value: < 0.00000000000000022
summary(le_forward)
##
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality +
## HIV.AIDS + Income.composition.of.resources + percentage.expenditure +
## BMI + under.five.deaths + Total.expenditure + Measles + Diphtheria +
## Hepatitis.B + Status + Alcohol, data = le_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6587 -1.9273 -0.0234 2.3269 11.8412
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55.60390048 0.82344174 67.526
## Schooling 0.86694378 0.06519956 13.297
## Adult.Mortality -0.01743226 0.00118823 -14.671
## HIV.AIDS -0.55922622 0.03322075 -16.834
## Income.composition.of.resources 9.42925025 0.87443614 10.783
## percentage.expenditure 0.00041518 0.00006716 6.182
## BMI 0.03545956 0.00607768 5.834
## under.five.deaths -0.00277041 0.00071399 -3.880
## Total.expenditure 0.10350644 0.04553187 2.273
## Measles 0.00002749 0.00001211 2.271
## Diphtheria>= 90% Covered 1.17656001 0.37612305 3.128
## Hepatitis.B>= 90% Covered -0.83487886 0.34417632 -2.426
## StatusDeveloping -0.91154846 0.37671511 -2.420
## Alcohol -0.08349779 0.03591153 -2.325
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## Adult.Mortality < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## percentage.expenditure 0.000000000856 ***
## BMI 0.000000006859 ***
## under.five.deaths 0.00011 ***
## Total.expenditure 0.02318 *
## Measles 0.02334 *
## Diphtheria>= 90% Covered 0.00180 **
## Hepatitis.B>= 90% Covered 0.01542 *
## StatusDeveloping 0.01567 *
## Alcohol 0.02023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.488 on 1258 degrees of freedom
## Multiple R-squared: 0.7967, Adjusted R-squared: 0.7946
## F-statistic: 379.2 on 13 and 1258 DF, p-value: < 0.00000000000000022
summary(le_both)
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Alcohol +
## percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths +
## Total.expenditure + Diphtheria + HIV.AIDS + Income.composition.of.resources +
## Schooling, data = le_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6587 -1.9273 -0.0234 2.3269 11.8412
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55.60390048 0.82344174 67.526
## StatusDeveloping -0.91154846 0.37671511 -2.420
## Adult.Mortality -0.01743226 0.00118823 -14.671
## Alcohol -0.08349779 0.03591153 -2.325
## percentage.expenditure 0.00041518 0.00006716 6.182
## Hepatitis.B>= 90% Covered -0.83487886 0.34417632 -2.426
## Measles 0.00002749 0.00001211 2.271
## BMI 0.03545956 0.00607768 5.834
## under.five.deaths -0.00277041 0.00071399 -3.880
## Total.expenditure 0.10350644 0.04553187 2.273
## Diphtheria>= 90% Covered 1.17656001 0.37612305 3.128
## HIV.AIDS -0.55922622 0.03322075 -16.834
## Income.composition.of.resources 9.42925025 0.87443614 10.783
## Schooling 0.86694378 0.06519956 13.297
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## StatusDeveloping 0.01567 *
## Adult.Mortality < 0.0000000000000002 ***
## Alcohol 0.02023 *
## percentage.expenditure 0.000000000856 ***
## Hepatitis.B>= 90% Covered 0.01542 *
## Measles 0.02334 *
## BMI 0.000000006859 ***
## under.five.deaths 0.00011 ***
## Total.expenditure 0.02318 *
## Diphtheria>= 90% Covered 0.00180 **
## HIV.AIDS < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.488 on 1258 degrees of freedom
## Multiple R-squared: 0.7967, Adjusted R-squared: 0.7946
## F-statistic: 379.2 on 13 and 1258 DF, p-value: < 0.00000000000000022
💡 Insight :
- Adj. R-squared value for both “backward”, “forward” and “both” step wise are same, with 79.4%
- Both result are still not satisfied
Feature selection
As we get all significant predictors (with three ***), let’s create a new model with those predictors.
# Model selection
le_selected <- lm(formula = Life.expectancy ~ Adult.Mortality + under.five.deaths + HIV.AIDS + percentage.expenditure + BMI + Income.composition.of.resources + Schooling,
data = le_train)
summary(le_selected)
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + under.five.deaths +
## HIV.AIDS + percentage.expenditure + BMI + Income.composition.of.resources +
## Schooling, data = le_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8367 -2.0608 0.0356 2.3203 12.1137
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55.27699936 0.61067502 90.518
## Adult.Mortality -0.01839507 0.00117363 -15.674
## under.five.deaths -0.00231010 0.00061255 -3.771
## HIV.AIDS -0.56342891 0.03255143 -17.309
## percentage.expenditure 0.00046564 0.00006319 7.369
## BMI 0.03374420 0.00606621 5.563
## Income.composition.of.resources 9.48884449 0.86342388 10.990
## Schooling 0.88661004 0.06080591 14.581
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Adult.Mortality < 0.0000000000000002 ***
## under.five.deaths 0.00017 ***
## HIV.AIDS < 0.0000000000000002 ***
## percentage.expenditure 0.000000000000309 ***
## BMI 0.000000032397113 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.517 on 1264 degrees of freedom
## Multiple R-squared: 0.7923, Adjusted R-squared: 0.7912
## F-statistic: 689 on 7 and 1264 DF, p-value: < 0.00000000000000022
💡 Insight :
- Adj. R-squared value for selected predictor is still 79.4%. It is still not a good result.
Model comparison
data.frame(model = c("le_model","le_backward", "le_forward", "le_both", "le_selected"),
AdjRsquare = c(summary(le_model)$adj.r.square,
summary(le_backward)$adj.r.square,
summary(le_forward)$adj.r.square,
summary(le_both)$adj.r.square,
summary(le_selected)$adj.r.square))
## model AdjRsquare
## 1 le_model 0.7942915
## 2 le_backward 0.7945742
## 3 le_forward 0.7945742
## 4 le_both 0.7945742
## 5 le_selected 0.7911908
💡 Insight :
- “le_backward” and “le_both” are the best model than other models. Therefore, we will tune one of them before do prediction on “Life.expectancy”
Transformation log
Log Transformation allow us to transform the data using Log. On above, we decide to use “le_backward”, “le_forward” or “le_both” as our tuned model, we will only use variables inside that model to transform the data.
le_log <- lm(formula = log1p(Life.expectancy) ~ Status + log1p(Adult.Mortality) + log1p(Alcohol) + log1p(percentage.expenditure) + Hepatitis.B + log1p(Measles) + log1p(BMI) + log1p(under.five.deaths) + log1p(Total.expenditure) + Diphtheria + log1p(HIV.AIDS) + log1p(Income.composition.of.resources) + log1p(Schooling),
data = le_clean)
summary(le_log)
##
## Call:
## lm(formula = log1p(Life.expectancy) ~ Status + log1p(Adult.Mortality) +
## log1p(Alcohol) + log1p(percentage.expenditure) + Hepatitis.B +
## log1p(Measles) + log1p(BMI) + log1p(under.five.deaths) +
## log1p(Total.expenditure) + Diphtheria + log1p(HIV.AIDS) +
## log1p(Income.composition.of.resources) + log1p(Schooling),
## data = le_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.186507 -0.028061 0.001444 0.027619 0.178458
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.9520531 0.0230530 171.434
## StatusDeveloping -0.0097076 0.0041594 -2.334
## log1p(Adult.Mortality) -0.0097795 0.0012800 -7.640
## log1p(Alcohol) 0.0062903 0.0018735 3.358
## log1p(percentage.expenditure) 0.0063802 0.0008170 7.809
## Hepatitis.B>= 90% Covered -0.0126270 0.0041799 -3.021
## log1p(Measles) -0.0004827 0.0004812 -1.003
## log1p(BMI) 0.0030001 0.0018224 1.646
## log1p(under.five.deaths) -0.0059778 0.0010672 -5.601
## log1p(Total.expenditure) 0.0069510 0.0033860 2.053
## Diphtheria>= 90% Covered 0.0128293 0.0045744 2.805
## log1p(HIV.AIDS) -0.0841528 0.0022181 -37.939
## log1p(Income.composition.of.resources) 0.1739592 0.0146800 11.850
## log1p(Schooling) 0.1018532 0.0091858 11.088
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## StatusDeveloping 0.019725 *
## log1p(Adult.Mortality) 0.0000000000000375 ***
## log1p(Alcohol) 0.000805 ***
## log1p(percentage.expenditure) 0.0000000000000104 ***
## Hepatitis.B>= 90% Covered 0.002561 **
## log1p(Measles) 0.315981
## log1p(BMI) 0.099919 .
## log1p(under.five.deaths) 0.0000000250690925 ***
## log1p(Total.expenditure) 0.040250 *
## Diphtheria>= 90% Covered 0.005100 **
## log1p(HIV.AIDS) < 0.0000000000000002 ***
## log1p(Income.composition.of.resources) < 0.0000000000000002 ***
## log1p(Schooling) < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04782 on 1576 degrees of freedom
## Multiple R-squared: 0.8265, Adjusted R-squared: 0.8251
## F-statistic: 577.6 on 13 and 1576 DF, p-value: < 0.00000000000000022
data.frame(model = c("le_model","le_backward", "le_forward", "le_both", "le_selected", "le_log"),
AdjRsquare = c(summary(le_model)$adj.r.square,
summary(le_backward)$adj.r.square,
summary(le_forward)$adj.r.square,
summary(le_both)$adj.r.square,
summary(le_selected)$adj.r.square,
summary(le_log)$adj.r.square))
## model AdjRsquare
## 1 le_model 0.7942915
## 2 le_backward 0.7945742
## 3 le_forward 0.7945742
## 4 le_both 0.7945742
## 5 le_selected 0.7911908
## 6 le_log 0.8250929
💡 Insight :
- Adj. R-squared value are better, with 82.5%. It indicates that a good linear model.
Prediction Model & Errors
# Prediction
le_pred_temp <- predict(le_log, le_test)
le_pred <- exp(le_pred_temp)
# Check error
data.frame(Method = c("RMSE","MAE"),
Error.Value = c(RMSE(le_pred, le_test$Life.expectancy),
MAE(le_pred, le_test$Life.expectancy)))
## Method Error.Value
## 1 RMSE 3.280840
## 2 MAE 2.523816
# Check target range
range(le_test$Life.expectancy)
## [1] 51 89
💡 Insight : :
- The error values are small compared to the range of the “Life.expectancy”. Therefore, we assume the predicted values will be similar with the actual values.
Evaluation Model
Normality test
Using histogram
hist(le_log$residuals, breaks = 20)
Most of the Residuals are distributed on the center, indicated a normal
distribution.
Using QQ Plot
plot(le_log, which = 2)
Most of the Residuals are gathered on the center, indicated a normal
distribution.
Shapiro test
shapiro.test(le_log$residuals)
##
## Shapiro-Wilk normality test
##
## data: le_log$residuals
## W = 0.99285, p-value = 0.0000005719
The W statistic is 0.99285, which is close to 1, indicating that the residuals are fairly normally distributed. The p-value is 0.0000005719, which is very small, suggesting strong evidence against the null hypothesis of normality. Therefore, it is likely that the residuals are not normally distributed.
Homoscedasticity
Creating plot to check using visualisation
plot(le_train$Life.expectancy, le_backward$residuals)
abline(h = 0, col = "red")
bptest(le_log)
##
## studentized Breusch-Pagan test
##
## data: le_log
## BP = 58.76, df = 13, p-value = 0.00000008744
The test statistic is given as BP = 58.76, which indicates the strength of the evidence against the null hypothesis of homoscedasticity. The degrees of freedom (df) for the test are 13, and the p-value is reported as 0.00000008744, which is very small. This suggests that there is strong evidence of heteroscedasticity in the dataset.
Multicollinearity Test
vif(le_log)
## Status log1p(Adult.Mortality)
## 1.552076 1.190205
## log1p(Alcohol) log1p(percentage.expenditure)
## 1.861351 1.675288
## Hepatitis.B log1p(Measles)
## 3.035487 1.701054
## log1p(BMI) log1p(under.five.deaths)
## 1.287062 2.413507
## log1p(Total.expenditure) Diphtheria
## 1.096936 3.524288
## log1p(HIV.AIDS) log1p(Income.composition.of.resources)
## 1.460273 2.266542
## log1p(Schooling)
## 3.086015
Based on the VIF result, it appears that there is no evidence of significant multicollinearity in the model. All of the VIF values are below 4, which suggests that the predictor variables are not highly correlated with each other. This is generally a good sign for the model’s accuracy and reliability.
Linearity Test
Create plot
linear <- data.frame(residual = le_log$residuals, fitted = le_log$fitted.values)
plot(linear)
cor.test.all <- function(data,target) {
names <- names(data)
df <- NULL
for (i in 1:length(names)) {
y <- target
x <- names[[i]]
p_value <- cor.test(data[,y], data[,x])[3]
temp <- data.frame(x = x,
y = y,
p_value = as.numeric(p_value))
df <- rbind(df,temp)
}
return(df)
}
data_num2 <- le_clean %>% select(Status,
Adult.Mortality,
Alcohol,
percentage.expenditure,
Hepatitis.B,
Measles,
BMI,
under.five.deaths,
Total.expenditure,
Diphtheria,
HIV.AIDS,
Income.composition.of.resources,
Schooling,
Life.expectancy) %>% # select only variables in le_log + target
select_if(is.numeric)
p_value <- cor.test.all(data_num2, "Life.expectancy")
p_value %>%
filter(p_value > 0.05)
## [1] x y p_value
## <0 rows> (or 0-length row.names)
All selected variables have linear correlation with the Dependent Variable, since no p-value > 0.05.
Conclusion
Result :
- Adj. R-squared value is 82.5%, indicating a good linear model.
- RMSE and MSE values are small compared to the range of the “Life.expectancy”, indicating the predicted values will be similar with the actual values.
- The model passed VIF and Linearity test.
- However, Normality and Homocedasticity statistic test don’t give the expected result.
Since, the transformation log had been done on the process, we still can’t pass all of Assumptions test. Therefore, there are some advise can be done in the future :
- The Linear Model can be used to explain the linear correlation between “Life.expectancy” and the selected independent variables. The con is the model is highly sensitive to outliers. It is highly recommended to see the outliers pattern if you still want to use Linear Model on the new data set of “Life.expectancy”.
- Another option is to use a weighted regression, where the observations are weighted according to the inverse of the variance of the dependent variable. This gives more weight to the observations with lower variance, which can help to reduce the impact of heteroscedasticity.
- Use a robust regression method: Robust regression methods are designed to be less sensitive to outliers and heteroscedasticity. Some examples of robust regression methods include the Huber-White estimator, the M-estimator, and the S-estimator.
- Use a different model: In some cases, a different model may be more appropriate for the data. For example, if the relationship between the dependent variable and the independent variable(s) is not linear, a nonlinear model may be more appropriate.