Linear Regression

Concept

Regression is one of the most important concepts used in machine learning. Regression analysis allows us to predict target variable (y) based on the value of one or multiple predictor variables (x). The target variable is also known as an independent variable or label. On the other hand, predictor variables are also known as dependent variables.

The types of regressions are represented in the diagram:

In the formula below you will find several notations, such as \(y\), \(b_0\), \(b_n\), and \(x_n\). Note that :

  1. The target variable (independent variable) is denoted by \(y\).
  2. The predictor variables (dependent variables) are denoted by \(x_n\).
  3. The intercept is denoted by \(b_0\).
  4. The slope (gradient) values are denoted by \(b_n\). The slope (gradient) value can have positive, negative or zero value. The slope (gradient) value need to be learnt.


Without further ado, let’s talk in details about all the points shows in the block diagram!

Based on:

A. Degree of independent variable

1. Linear : the model fit to predict a target variable of the data is linear.

\[y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3 + ... b_n*x_n\]

2. Polynomial: the model fit to predict a target variable of the data is polynomial.

\[y = b_0 + (b_1*x_1)^1 + (b_2*x_2)^2 + (b_3*x_3)^3 + ... (b_n*x_n)^n\]

B. Number of independent variable

1. Univariate : the number of target variable (independent variable) is 1.

\[y = b_0 + b_1*x_1\]

2. Bivariate: the number of target variable (independent variable) is 2.

\[y = b_0 + b_1*x_1 + b_2*x_2\]

2. Multivariate: the number of target variable (independent variable) is more than 2.

\[y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3 + ... b_n*x_n\]

Linear Regression

The focus of the regression task is to predict a value of the best fit model based on the independent variable(s). The linear regression tries to find out the best possible linear relationship between the target variable and the predictor variables.

Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line consists of the predicted score on \(Y\) for each possible value of \(X\). The vertical lines from the points to the best-fitting represent the errors of prediction.

The error of prediction for a point is the value of the point \(Y\) minus the predicted value \(Y'\) (the value on the line).

\[ Error = Y - Y'\]

The most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction (\((Y-Y')^2\)).

Limitations & Assumptions

Limitations

Linear regression models, even when considered to be the powerhouse of statistics came with its limitations.

  • Linear regressions are best fitted on data where a linear relationship between the predictor variables and target exist.
  • Simple/multiple regression models can be sensitive to outliers (recall the chapter regarding leverage and power)

Assumptions

Linear regression models, even when considered to be the powerhouse of statistics came with its assumptions.

1. Linearity: There’s a linear relationship between the target variable and the independet variable(s)

2. Normality Error: The distribution of error is a normal distribution.

3. Homoscedasticity: Error are randomly scattered

4. Non-Multicolinearity: There are no independent variable that are strongly correlated with each other.

Packages Required

Initially, we begin by loading the packages that will be required throughout the course of the analysis.

library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(scales)
library(caret)
library(psych)
library(stats)
library(leaps)
library(GGally)
library(MASS)
library(lmtest)
library(car)
library(MLmetrics)

The descriptions of the packages are in the table below.

Packages Description
data.table For data manipulation that can be reducing programming and compute time tremendously
DT An R interface to the DataTables library
kableExtra Styling an Interactive Data Tables within Markdown
knitr A general-purpose tool for dynamic report generation
tidyverse Collection of R packages (tidyr, dplyr, ggplot2) designed for data science that works harmoniously with other packages
tidyr Changing the layout of the data sets, to convert data into a tidy format
dplyr For data manipulation
ggplot2 Customizable graphical representation
caret For data Pre-Processing and Feature Selection
psych Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis
stats Contains functions for statistical calculations and random number generation.
leaps Regression subset selection, including exhaustive search
GGally Extends ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data.
MASS Functions to support Venables and Ripley, ``Modern Applied Statistics with S’’ (4th edition, 2002)
lmtest A collection of testsfor diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided.
car Functions to Accompany J. Fox and S. Weisberg, An R Companion to Applied Regression, Third Edition, Sage, 2019.
MLmetrics A collection of evaluation metrics, including loss, score and utility functions, that measure regression, classification and ranking performance.

Data Preparation

Now, let’s load the dataset into the R-Environment.

Importing Data

This project aims to build the best model that can predict the Life Expectancy based on he Global Health Observatory (GHO) dataset.

Data Source

This project uses the data from Kumar Rajarshi - Life Expectancy (WHO) in kaggle.com website. The data was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single data-set.

Read the Data

The dataset is in .csv format, so we will useread.csv method to read the dataset.

raw.data <- read.csv("assets/Life Expectancy Data.csv")

Glimpse of the Data

After importing the dataset, Let’s take a peek of our dataset!

The dataset has 2,938 rows and 22 columns.

# DATA INPUT: RAW DATASET
glimpse(raw.data)
## Rows: 2,938
## Columns: 22
## $ Country                         <chr> "Afghanistan", "Afghanistan", "Afgh...
## $ Year                            <int> 2015, 2014, 2013, 2012, 2011, 2010,...
## $ Status                          <chr> "Developing", "Developing", "Develo...
## $ Life.expectancy                 <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8,...
## $ Adult.Mortality                 <int> 263, 271, 268, 272, 275, 279, 281, ...
## $ infant.deaths                   <int> 62, 64, 66, 69, 71, 74, 77, 80, 82,...
## $ Alcohol                         <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01,...
## $ percentage.expenditure          <dbl> 71.279624, 73.523582, 73.219243, 78...
## $ Hepatitis.B                     <int> 65, 62, 64, 67, 68, 66, 63, 64, 63,...
## $ Measles                         <int> 1154, 492, 430, 2787, 3013, 1989, 2...
## $ BMI                             <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7,...
## $ under.five.deaths               <int> 83, 86, 89, 93, 97, 102, 106, 110, ...
## $ Polio                           <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, ...
## $ Total.expenditure               <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20,...
## $ Diphtheria                      <int> 65, 62, 64, 67, 68, 66, 63, 64, 63,...
## $ HIV.AIDS                        <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, ...
## $ GDP                             <dbl> 584.25921, 612.69651, 631.74498, 66...
## $ Population                      <dbl> 33736494, 327582, 31731688, 3696958...
## $ thinness..1.19.years            <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4,...
## $ thinness.5.9.years              <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4,...
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, ...
## $ Schooling                       <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9...

Data Wrangling

Quality Check

Missing Values

Let’s see how many missing values in each column.

# Counting missing values in each column
numMissVal <-sapply(raw.data, function(x) sum(length(which(is.na(x)))))

# Result table
kable(as.data.frame(numMissVal)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "100%", height = "250px")
numMissVal
Country 0
Year 0
Status 0
Life.expectancy 10
Adult.Mortality 10
infant.deaths 0
Alcohol 194
percentage.expenditure 0
Hepatitis.B 553
Measles 0
BMI 34
under.five.deaths 0
Polio 19
Total.expenditure 226
Diphtheria 19
HIV.AIDS 0
GDP 448
Population 652
thinness..1.19.years 34
thinness.5.9.years 34
Income.composition.of.resources 167
Schooling 163


Observation findings :

  1. The majority of columns have missing values.

Duplicated Rows

There aren’t any duplicated rows found.

# Check duplicated rows
raw.data[which(duplicated(raw.data) ==T), ]

Data Formatting

Some columns need to be changed as factors.

# data format : factor
raw.data$Country <- as.factor(raw.data$Country)
raw.data$Status <-  as.factor(raw.data$Status)

NA Values Treatment

Concept


The missing value can severely distort the distribution of the data. However, there isn’t any better way to deal with missing data. Removing columns or rows with missing values can produce a bias in the analysis. Note that imputation does not necessarily give better results.


Alvira Swalin gave a better explanation about how to handle missing data in towardsdatascience.com. The methods to handle missing values are as follow:


Based on the previous flowchart, we use data imputation to deal with the general problem with continuous data type.

Check Outliers

According to the block diagram before, we can treat the missing values by using data imputation. The columns that have missing values are continuous data type. To treat the missing value, we can choose to treat it by imputing its mean value or its median value based on the outliers occurrences. If there are many outliers data, it’ll be best to use the median value. On the other hand, if there aren’t many outliers data, we can use the mean value.

par(mfrow=c(1,3))
boxplot(raw.data$Life.expectancy,
        ylab = "Life Expectancy",
        main = "Boxplot of Life Expectancy")
boxplot(raw.data$Adult.Mortality,
        ylab = "Adult Mortality",
        main = "Boxplot of Adult Mortality")
boxplot(raw.data$Alcohol,
        ylab = "Alcohol",
        main = "Boxplot of Alcohol")

par(mfrow=c(1,3))
boxplot(raw.data$Hepatitis.B,
        ylab = "Hepatitis B",
        main = "Boxplot of Hepatitis B")
boxplot(raw.data$BMI,
        ylab = "BMI",
        main = "Boxplot of BMI")
boxplot(raw.data$Polio,
        ylab = "Polio",
        main = "Boxplot of Polio")

par(mfrow=c(1,3))
boxplot(raw.data$Total.expenditure,
        ylab = "Total Expenditure",
        main = "Boxplot of Total Expenditure")
boxplot(raw.data$Diphtheria,
        ylab = "Diphteria",
        main = "Boxplot of Diphteria")
boxplot(raw.data$GDP,
        ylab = "GDP",
        main = "Boxplot of GDP")

par(mfrow=c(1,3))
boxplot(raw.data$Population,
        ylab = "Population",
        main = "Boxplot of Population")
boxplot(raw.data$thinness..1.19.years,
        ylab = "Thinness 1-19 years",
        main = "Boxplot of Thinness for 1-19 years old")
boxplot(raw.data$thinness.5.9.years,
        ylab = "Thinness 5-9 years",
        main = "Boxplot of Thinness for 5-9 years old")

par(mfrow=c(1,3))
boxplot(raw.data$Income.composition.of.resources,
        ylab = "Income Composition",
        main = "Boxplot of Income Composition")
boxplot(raw.data$Schooling,
        ylab = "Schooling",
        main = "Boxplot of Schooling")


Observation findings :

  1. Most of the columns have many outliers, except Alcohol, BMI, Income.composition.of.resources.
  2. We use its median value for the columns that have many outliers.
  3. We use its mean value for the columns that not have many outliers.

Data Imputation

We use data imputation by its median value to most of the columns with missing values. These columns have many outliers.

# Find median value 

life_mean <- median(raw.data$Life.expectancy,  na.rm = TRUE)
mortality_mean <- median(raw.data$Adult.Mortality,  na.rm = TRUE)
hepatitis_mean <- median(raw.data$Hepatitis.B,  na.rm = TRUE)
polio_mean <- median(raw.data$Polio,  na.rm = TRUE)
diph_mean <- median(raw.data$Diphtheria,  na.rm = TRUE)
exp_mean <- median(raw.data$Total.expenditure,  na.rm = TRUE)
gdp_mean <- median(raw.data$GDP,  na.rm = TRUE)
pop_mean <- median(raw.data$Population,  na.rm = TRUE)
thin19_mean <- median(raw.data$thinness..1.19.years,  na.rm = TRUE)
thin9_mean <- median(raw.data$thinness.5.9.years,  na.rm = TRUE)
school_mean <- median(raw.data$Schooling,  na.rm = TRUE)

Then replace the missing values with the median of the corresponding columns.

raw.data$Life.expectancy[is.na(raw.data$Life.expectancy)] <- life_mean
raw.data$Adult.Mortality[is.na(raw.data$Adult.Mortality)] <- mortality_mean
raw.data$Hepatitis.B[is.na(raw.data$Hepatitis.B)] <- hepatitis_mean
raw.data$Polio[is.na(raw.data$Polio)] <- polio_mean
raw.data$Diphtheria[is.na(raw.data$Diphtheria)] <- diph_mean
raw.data$Total.expenditure[is.na(raw.data$Total.expenditure)] <- exp_mean
raw.data$GDP[is.na(raw.data$GDP)] <- gdp_mean
raw.data$Population[is.na(raw.data$Population)] <- pop_mean
raw.data$thinness..1.19.years[is.na(raw.data$thinness..1.19.years)] <- thin19_mean
raw.data$thinness.5.9.years[is.na(raw.data$thinness.5.9.years)] <- thin9_mean
raw.data$Schooling[is.na(raw.data$Schooling)] <- school_mean


Next, we find the mean value for the Alcohol, BMI, Income.composition.of.resources columns. These columns don’t have many outliers.

alcohol_mean <- mean(raw.data$Alcohol,  na.rm = TRUE)
bmi_mean <- mean(raw.data$BMI,  na.rm = TRUE)
income_mean <- mean(raw.data$Income.composition.of.resources,  na.rm = TRUE)

Then replace the missing values with the average of the corresponding columns.

raw.data$Alcohol[is.na(raw.data$Alcohol)] <- alcohol_mean
raw.data$BMI[is.na(raw.data$BMI)] <- bmi_mean
raw.data$Income.composition.of.resources[is.na(raw.data$Income.composition.of.resources)] <- income_mean

Cleaned Dataset

Here’s the cleaned data set:

datatable(head(clean_data, 50),
          options = list(scroller = TRUE, scrollX = T),
          style = 'bootstrap',
          class = 'table-bordered table-condensed')

Summary of Variables

The summary of the variables of dataset are in the table below.


No.  Variable Class Description
1 Country factor Country name
2 Year numeric Year of the data
3 Status factor Country status of developed or developing
4 Life_Expectancy numeric Life expectancy in age
5 Adult_Mortality numeric Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
6 infant.deaths numeric Number of Infant Deaths per 1000 population
7 Alcohol numeric Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
8 percentage.expenditure numeric Expenditure on health as a percentage of Gross Domestic Product per capita(%)
9 Hepatitis.B numeric Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
10 Measles numeric Measles - number of reported cases per 1000 population
11 BMI numeric Average Body Mass Index of entire population
12 under.five.deaths numeric Number of under-five deaths per 1000 population
13 Polio numeric Polio (Pol3) immunization coverage among 1-year-olds (%)
14 Total.expenditure numeric General government expenditure on health as a percentage of total government expenditure (%)
15 Diphtheria numeric Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)room)
16 HIV.AIDS numeric Deaths per 1 000 live births HIV/AIDS (0-4 years)
17 GDP numeric Gross Domestic Product per capita (in USD)
18 Population numeric Population of the country
19 thinness..1.19.years numeric Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
20 thinness.5.9.years numeric Prevalence of thinness among children for Age 5 to 9(%)
21 Income.composition.of.resources numeric Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
22 Schooling numeric Number of years of Schooling(years)


Explanatory Data Analysis

Target Variable

In this study case, we use Life.Expectancy as our target variable. Let’s do EDA on the target variable.

par(mfrow=c(1,2))
# target variable
# histogram
hist(clean_data$Life.expectancy,
     main = "LifeExpectance Distribution",
     xlab = "Life Expectancy(yrs)")
# kernel density plot with a vertical indication of location of the mean
plot(density(clean_data$Life.expectancy),
     main = "Distribution of Life Expectancy",
     xlab = "Life Expectancy (yrs)")
abline(v=mean(clean_data$Life.expectancy))


Observation findings :

  1. the target variable Life.expectancy is not distributed perfectly normal, it is a little left-skewed.

  2. The unit of Life Expectancy is number of years.

Predictor Variables

Univariate Plots

Alcohol

par(mfrow=c(2,2))
layout(matrix(c(1,1,2,3), 2, 2, byrow = F),
   widths=c(1,1), heights=c(1,1))
boxplot(clean_data$Alcohol,
        main = "Alcohol consumption")         # box plot 
plot(density(clean_data$Alcohol),
     main = "Distribution of Alcohol consumed",
     xlab = "Alcohol(litres)")   # kernel density plot
# to normalize the density plot
plot(density(clean_data$Alcohol^0.5),
     main = "Distribution of Alcohol consumed",
     xlab = "Alcohol(litres)")   # normalized kernel density plot


Observation findings :

  1. The predictor variable Alcohol is not normally distributed. It is highly right-skewed.

  2. The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.

    • proof: Alcohol and GDP are significantly correlated with a correlation coefficient of 0.31 and p-value of \(2.2^{-16}\)
cor.test(clean_data$Alcohol, clean_data$GDP)
## 
##  Pearson's product-moment correlation
## 
## data:  clean_data$Alcohol and clean_data$GDP
## t = 17.831, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2795771 0.3448433
## sample estimates:
##       cor 
## 0.3125791

Under Five Year Old Deaths

par(mfrow=c(2,2))
layout(matrix(c(1,1,2,3), 2, 2, byrow = F),
   widths=c(1,1), heights=c(1,1))
boxplot(clean_data$under.five.deaths,
        main = "Under Five Year Old Deaths")         # box plot 
plot(density(clean_data$under.five.deaths),
     main = "Distribution / 1000 Population",
     xlab = "Under Five Year Old Deaths(cnt)")   # kernel density plot
# to normalize the density plot
plot(density(clean_data$under.five.deaths^0.5),
     main = "Distribution Rate / 1000 Population",
     xlab = "Under Five Year Old Deaths rate")   # normalized kernel density plot


Observation findings :

  1. The predictor variable under.five.deaths is not normally distributed. It is highly right-skewed.

  2. The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.

    • proof: under.five.deaths and GDP are significantly correlated with a correlation coefficient of -0.1 and p-value of \(8.19^{-06}\)
cor.test(clean_data$under.five.deaths, clean_data$GDP)
## 
##  Pearson's product-moment correlation
## 
## data:  clean_data$under.five.deaths and clean_data$GDP
## t = -5.7813, df = 2936, p-value = 0.000000008194
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14171143 -0.07020008
## sample estimates:
##        cor 
## -0.1060929

Percentage Expenditure

par(mfrow=c(1,2))
boxplot(clean_data$percentage.expenditure,
        main = "Percentage expenditure")         # box plot
plot(density(clean_data$percentage.expenditure),
     main = "% Expenditure on health",
     xlab = "Percentage expenditure(%)")   # kernel density plot


Observation findings :

  1. The predictor variable percentage.expenditure is not normally distributed. It is heavily right-skewed.

  2. The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.

    • proof: percentage.expenditure and GDP are significantly correlated with a correlation coefficient of 0.9 and p-value of \(2.2^{-16}\)
cor.test(clean_data$percentage.expenditure, clean_data$GDP)
## 
##  Pearson's product-moment correlation
## 
## data:  clean_data$percentage.expenditure and clean_data$GDP
## t = 113.08, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8948392 0.9083581
## sample estimates:
##       cor 
## 0.9018191

Polio

par(mfrow=c(1,2))
boxplot(clean_data$Polio,
        main = "Polio Immunization ")         # box plot
plot(density(clean_data$Polio),
     main = "% Polio Immunization Coverage",
     xlab = "Polio Immunization (%)")   # kernel density plot


Observation findings :

  1. The predictor variable Polio is not normally distributed. It is heavily left-skewed.

  2. The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.

    • proof: Polio and GDP are significantly correlated with a correlation coefficient of 0.18 and p-value of \(2.2^{-16}\)
cor.test(clean_data$Polio, clean_data$GDP)
## 
##  Pearson's product-moment correlation
## 
## data:  clean_data$Polio and clean_data$GDP
## t = 10.482, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1548269 0.2245457
## sample estimates:
##       cor 
## 0.1899257

Prevalence of Thinness for Age 10 to 19

par(mfrow=c(1,2))
boxplot(clean_data$thinness..1.19.years,
        main = "Prevalence of thinness ")         # box plot
plot(density(clean_data$thinness..1.19.years),
     main = "% Prevalence of thinness",
     xlab = "Prevalence of thinness (%)")   # kernel density plot


Observation findings :

  1. The predictor variable thinness..1.19.years is not normally distributed. It is right-skewed.

  2. The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.

    • proof: thinness..1.19.years and GDP are significantly correlated with a correlation coefficient of -0.26 and p-value of \(2.2^{-16}\)
cor.test(clean_data$thinness..1.19.years, clean_data$GDP)
## 
##  Pearson's product-moment correlation
## 
## data:  clean_data$thinness..1.19.years and clean_data$GDP
## t = -14.79, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2966604 -0.2293449
## sample estimates:
##        cor 
## -0.2633231

Bivariate Plots

We have plotted three of the predictor variables to show how the variables relate with the target variable overall.

Life Expectancy ~ Income Compositions

# life expectancy vs. income composition - positively correlated
plot(y = clean_data$Life.expectancy,
     x = clean_data$Income.composition.of.resources,
     main = "Life Expectancy vs. Income compositions",
     xlab = "Income composition of resources",
     ylab = "Life Expectancy",
     pch = 19,
     col = "yellowgreen")
abline(60,1,
       col = "red")       # 45 degree line (line with slope 1)


Observation findings :

  1. The Life.expectancy and Income.composition.of.resources are positively correlated.

  2. The red line in the plot indicates a correlation of 1 (45 degree line). Thus it is clear that the correlation is definitely less than 1.

Life Expectancy ~ Schooling

plot(y = clean_data$Life.expectancy,
     x = clean_data$Schooling,
     main = "Life Expectancy vs. Schooling",
     xlab = "Schooling",
     ylab = "Life Expectancy",
     pch = 19,
     col = "rosybrown1")
abline(50,1,
       col = "red")      # 45 degree line (line with slope 1)


Observation findings :

  1. The Life.expectancy and Schooling are positively correlated.

  2. The red line in the plot indicates a correlation of 1 (45 degree line). Thus it is clear that the correlation is definitely less than 1.

Life Expectancy ~ Adult Mortality

plot(y = clean_data$Life.expectancy,
     x = clean_data$Adult.Mortality,
     main = "Life Expectancy vs. Adult Mortality",
     xlab = "Adult Mortality",
     ylab = "Life Expectancy",
     pch = 19,
     col = "mediumpurple1")
abline(80, - 1,
       col = "red")       # 135 degree line (line with slope -1)


Observation findings :

  1. The Life.expectancy and Adult.Mortality are negatively correlated.

  2. The red line in the plot indicates a correlation of -1 (135 degree line). Thus it is clear that the correlation is definitely not perfectly -1.

Life Expectancy ~ Population

plot(y = clean_data$Life.expectancy,
     x = clean_data$Population,
     main = "Life Expectancy vs. Population",
     xlab = "Population",
     ylab = "Life Expectancy",
     pch = 19,
     col = "lightsteelblue3")
abline(50,1,
       col = "red")      # 45 degree line (line with slope 1)

br> Observation findings :

  1. The Life.expectancy and Population are not really correlated.

  2. The red line in the plot indicates a correlation of 1 (45 degree line). Thus it is clear that the correlation is negligible.

Correlations

We need to check the linear relationship between the target variable and the predictor variables (independent variables).

Let’s find the correlations between the target variable Life.expectancy and first 5 predictors, i.e. Adult.Morality, infant.deaths, Alcohol, percentage.expenditure, and Hepatitis.B.

# check correlations of the target variable with the first 5 predictors using Pearson correlation
pairs.panels(clean_data[,4:9], 
             method = "pearson", # correlation method
             hist.col = "green",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

Next, let’s find the correlations between the target variable Life.expectancy and the next 5 predictors, i.e. Measles, BMI, under.five.deaths, Polio, and Total.expenditure.

# check correlations of the target variable with the next 5 predictors using Pearson correlation
pairs.panels(clean_data[,c(4,10:14)], 
             method = "pearson", # correlation method
             hist.col = "green",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

Next, let’s find the correlations between the target variable Life.expectancy and the next 5 predictors, i.e. Diphtheria, HIV.AIDS, GDP, Population, and thinness..1.19.years.

# check correlations of the target variable with the next 5 predictors using Pearson correlation
pairs.panels(clean_data[,c(4,15:19)], 
             method = "pearson", # correlation method
             hist.col = "green",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

Next, let’s find the correlations between the target variable Life.expectancy and the last 3 predictors, i.e. thinness.5.9.years, Income.composition.of.resources, and Schooling.

# check correlations of the target variable with the last 3 predictors using Pearson correlation
pairs.panels(clean_data[,c(4,20:22)], 
             method = "pearson", # correlation method
             hist.col = "green",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)


According to Schober & Boer of the label interpretation of the r values, here are the labels of correlation based on the strength of the corresponding predictor:

  1. Negligible correlation = 0.00 - 0.09
  2. Weak correlation = 0.10 - 0.39
  3. Moderate correlation = 0.40 - 0.69
  4. Strong correlation = 0.70 - 0.89
  5. Very strong correlation = 0.90 - 1.00


Observation findings :

  1. The target variable Life.expectancy is strongly correlated to Schooling, Adult.Mortality, and Income.composition.of.resources as indicated by the Pearson correlation.

  2. According to the Pearson correlation, the target variable Life Expectancy has a moderate correlation to BMI, HIV.AIDS, Diphtheria, thinness..1.19.years, Polio, thinness.5.9.years, and GDP.

  3. According to the Pearson correlation, the target variable Life Expectancy has a weak correlation to Alcohol, percentage.expenditure, under.five.deaths, Total.expenditure, and infant.deaths.

  4. According to the Pearson correlation, the target variable Life Expectancy has a very weak correlation to Hepatitis.B, Measles, and Population.

Modelling and Predicting

Let’s make few models and predictions based on the dataset!

Null Model


Modelling

Let’s build a model without any predictors!

# baseline model with no predictors
nullModel <- lm(Life.expectancy ~ 1,
                data = clean_data)

# check the model
summary(nullModel)
## 
## Call:
## lm(formula = Life.expectancy ~ 1, data = clean_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.935  -6.035   2.865   6.365  19.765 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  69.2347     0.1754   394.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.509 on 2937 degrees of freedom

Predictions

The model is set! Let’s make a prediction based on the nullModel!

nullModelPreds <- predict(nullModel,  # my model
                    newdata = clean_data, # dataset
                    type = "response") # to get predicted values

Model Interpretation

Let’s interpreting the nullModel!

summary (nullModel)
## 
## Call:
## lm(formula = Life.expectancy ~ 1, data = clean_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.935  -6.035   2.865   6.365  19.765 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  69.2347     0.1754   394.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.509 on 2937 degrees of freedom


Observation findings :

  1. The nullModel has not any predictor.

  2. If there is no predictor, then the nullModel will predict the future value using the mean value of the target variable or the intercept value.

Full Model


Modelling

Let’s build a model with all predictors, except Country!

# baseline model with all predictors, except Country
fullModel <- lm(Life.expectancy ~ . - Country,
                data = clean_data)

# check the model
summary(fullModel)
## 
## Call:
## lm(formula = Life.expectancy ~ . - Country, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2860  -2.2437  -0.0817   2.3740  16.4282 
## 
## Coefficients:
##                                          Estimate        Std. Error t value
## (Intercept)                     81.57371776468257 34.65542497468053   2.354
## Year                            -0.01242470778379  0.01732564244728  -0.717
## StatusDeveloping                -1.60345349486797  0.27022829393151  -5.934
## Adult.Mortality                 -0.01981086120344  0.00079540789532 -24.907
## infant.deaths                    0.09932330190305  0.00842868251499  11.784
## Alcohol                          0.06135257680996  0.02608460875382   2.352
## percentage.expenditure           0.00003237640088  0.00009046842117   0.358
## Hepatitis.B                     -0.01660284006457  0.00371935046024  -4.464
## Measles                         -0.00001913469438  0.00000765428319  -2.500
## BMI                              0.04455147570409  0.00493230714916   9.033
## under.five.deaths               -0.07437771658324  0.00617691845324 -12.041
## Polio                            0.02865527021677  0.00444805095607   6.442
## Total.expenditure                0.07444995871406  0.03436772061263   2.166
## Diphtheria                       0.04078688657161  0.00464496059143   8.781
## HIV.AIDS                        -0.47215145572195  0.01764488354663 -26.759
## GDP                              0.00004260320835  0.00001378343058   3.091
## Population                       0.00000000001346  0.00000000168702   0.008
## thinness..1.19.years            -0.08195313416986  0.05028535403442  -1.630
## thinness.5.9.years               0.00847794195443  0.04956902498094   0.171
## Income.composition.of.resources  5.83669846548877  0.64052133954243   9.112
## Schooling                        0.64795815007715  0.04176733243002  15.514
##                                             Pr(>|t|)    
## (Intercept)                                  0.01865 *  
## Year                                         0.47335    
## StatusDeveloping                      0.000000003311 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                   < 0.0000000000000002 ***
## Alcohol                                      0.01874 *  
## percentage.expenditure                       0.72046    
## Hepatitis.B                           0.000008352972 ***
## Measles                                      0.01248 *  
## BMI                             < 0.0000000000000002 ***
## under.five.deaths               < 0.0000000000000002 ***
## Polio                                 0.000000000137 ***
## Total.expenditure                            0.03037 *  
## Diphtheria                      < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## GDP                                          0.00201 ** 
## Population                                   0.99364    
## thinness..1.19.years                         0.10326    
## thinness.5.9.years                           0.86421    
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.045 on 2917 degrees of freedom
## Multiple R-squared:  0.8203, Adjusted R-squared:  0.819 
## F-statistic: 665.6 on 20 and 2917 DF,  p-value: < 0.00000000000000022

Predictions

The model is set! Let’s make a prediction based on the fullModel!

fullModelPreds <- predict(fullModel,  # my model
                    newdata = clean_data, # dataset
                    type = "response") # to get predicted values

Model Interpretation

Let’s interpreting the fullModel!

summary (fullModel)
## 
## Call:
## lm(formula = Life.expectancy ~ . - Country, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2860  -2.2437  -0.0817   2.3740  16.4282 
## 
## Coefficients:
##                                          Estimate        Std. Error t value
## (Intercept)                     81.57371776468257 34.65542497468053   2.354
## Year                            -0.01242470778379  0.01732564244728  -0.717
## StatusDeveloping                -1.60345349486797  0.27022829393151  -5.934
## Adult.Mortality                 -0.01981086120344  0.00079540789532 -24.907
## infant.deaths                    0.09932330190305  0.00842868251499  11.784
## Alcohol                          0.06135257680996  0.02608460875382   2.352
## percentage.expenditure           0.00003237640088  0.00009046842117   0.358
## Hepatitis.B                     -0.01660284006457  0.00371935046024  -4.464
## Measles                         -0.00001913469438  0.00000765428319  -2.500
## BMI                              0.04455147570409  0.00493230714916   9.033
## under.five.deaths               -0.07437771658324  0.00617691845324 -12.041
## Polio                            0.02865527021677  0.00444805095607   6.442
## Total.expenditure                0.07444995871406  0.03436772061263   2.166
## Diphtheria                       0.04078688657161  0.00464496059143   8.781
## HIV.AIDS                        -0.47215145572195  0.01764488354663 -26.759
## GDP                              0.00004260320835  0.00001378343058   3.091
## Population                       0.00000000001346  0.00000000168702   0.008
## thinness..1.19.years            -0.08195313416986  0.05028535403442  -1.630
## thinness.5.9.years               0.00847794195443  0.04956902498094   0.171
## Income.composition.of.resources  5.83669846548877  0.64052133954243   9.112
## Schooling                        0.64795815007715  0.04176733243002  15.514
##                                             Pr(>|t|)    
## (Intercept)                                  0.01865 *  
## Year                                         0.47335    
## StatusDeveloping                      0.000000003311 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                   < 0.0000000000000002 ***
## Alcohol                                      0.01874 *  
## percentage.expenditure                       0.72046    
## Hepatitis.B                           0.000008352972 ***
## Measles                                      0.01248 *  
## BMI                             < 0.0000000000000002 ***
## under.five.deaths               < 0.0000000000000002 ***
## Polio                                 0.000000000137 ***
## Total.expenditure                            0.03037 *  
## Diphtheria                      < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## GDP                                          0.00201 ** 
## Population                                   0.99364    
## thinness..1.19.years                         0.10326    
## thinness.5.9.years                           0.86421    
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.045 on 2917 degrees of freedom
## Multiple R-squared:  0.8203, Adjusted R-squared:  0.819 
## F-statistic: 665.6 on 20 and 2917 DF,  p-value: < 0.00000000000000022


Observation findings :

  1. The fullModel has the largest parameter estimate that is Income.composition.of.resources which is 5.836, followed by StatusDeveloping which is -1.60.

  2. The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.

  3. On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.

  4. The p-value for Adult.Mortality, infant.deaths, BMI, under.five.deaths, Diphtheria, HIV.AIDS, Income.composition.of.resources, and Schooling are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.

  5. The fullModel has R-squared value 0.8203, which indicates the fullModel can describe its predictors condition by 82%.

Actual vs. Predicted

Original Plot

Here is the Actual vs Predicted Plot of fullModel. The green line represents a perfect prediction, while the red line represents the regression line.

# actual vs predicted
plot(y = fullModel$fitted.values,
     x = clean_data$Life.expectancy,
     main = "Actual vs Predicted using Full model",
     xlab = "Actual",
     ylab = "Predicted(fullModel)",
     pch = 19)
abline(0,1, col = "green", lwd = 2)  # this is a perfect prediction - 45 degree line

# add the regression line 
abline(lm(fullModel$fitted.values ~ fullModel$model$Life.expectancy),
       col = "red", lwd = 2)

Plot with Confidence Interval (CI)

Here is the plot of the Actual vs Predicted of fullModel with Confidence Interval.


The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.

# predict Life expectancy
predictedLE1 <- predict(fullModel, interval = "prediction")

# combine the actual data and predicted data
comb1 <- cbind.data.frame(clean_data, predictedLE1)

# Plotting the combined data
ggplot(comb1, aes(Life.expectancy, fit)) +
  geom_point() + 
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  stat_smooth(method = lm) +
  geom_smooth(method=lm, se=TRUE)+
  ggtitle("Actual vs. Predicted for FullModel with CI") +
  xlab("Actual Life Expectancy") + 
  ylab("Predicted Life Expectancy")

EDA-Based Model


Modelling

Let’s build a model with predictors that strongly correlated to target variable! The predictors are Schooling, Adult.Mortality, and Income.composition.of.resources.

# baseline model with predictors that strongly correlated to target variable
EDAModel <- lm(Life.expectancy ~ Schooling +  Adult.Mortality + Income.composition.of.resources,
                data = clean_data)

# check the model
summary(EDAModel)
## 
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality + 
##     Income.composition.of.resources, data = clean_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.541  -1.933   0.374   2.619  23.151 
## 
## Coefficients:
##                                   Estimate Std. Error t value
## (Intercept)                     56.5263202  0.4782747  118.19
## Schooling                        0.9883753  0.0483207   20.45
## Adult.Mortality                 -0.0345377  0.0008571  -40.30
## Income.composition.of.resources 10.4014206  0.7731076   13.45
##                                            Pr(>|t|)    
## (Intercept)                     <0.0000000000000002 ***
## Schooling                       <0.0000000000000002 ***
## Adult.Mortality                 <0.0000000000000002 ***
## Income.composition.of.resources <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.114 on 2934 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.7107 
## F-statistic:  2406 on 3 and 2934 DF,  p-value: < 0.00000000000000022

Predictions

The model is set! Let’s make a prediction based on the EDAModel!

EDAModelPreds <- predict(EDAModel,  # my model
                    newdata = clean_data, # dataset
                    type = "response") # to get predicted values

Model Interpretation

Let’s interpreting the EDAModel!

summary (EDAModel)
## 
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality + 
##     Income.composition.of.resources, data = clean_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.541  -1.933   0.374   2.619  23.151 
## 
## Coefficients:
##                                   Estimate Std. Error t value
## (Intercept)                     56.5263202  0.4782747  118.19
## Schooling                        0.9883753  0.0483207   20.45
## Adult.Mortality                 -0.0345377  0.0008571  -40.30
## Income.composition.of.resources 10.4014206  0.7731076   13.45
##                                            Pr(>|t|)    
## (Intercept)                     <0.0000000000000002 ***
## Schooling                       <0.0000000000000002 ***
## Adult.Mortality                 <0.0000000000000002 ***
## Income.composition.of.resources <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.114 on 2934 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.7107 
## F-statistic:  2406 on 3 and 2934 DF,  p-value: < 0.00000000000000022


Observation findings :

  1. The EDAModel has the largest parameter estimate that is Income.composition.of.resources which is 10.4.

  2. The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.

  3. The p-value of all predictors are much lower than 0.05, thus indicating they are very significant predictors for Life.expectancy.

  4. The EDAModel has R-squared value 0.711, which indicates the EDAModel can describe its predictors condition by 71.1%.

Actual vs. Predicted

Original Plot

Here is the Actual vs Predicted Plot of EDAModel. The green line represents a perfect prediction, while the red line represents the regression line.

# actual vs predicted
plot(y = EDAModel$fitted.values,
     x = clean_data$Life.expectancy,
     main = "Actual vs Predicted using EDA model",
     xlab = "Actual",
     ylab = "Predicted(EDAModel)",
     pch = 19)
abline(0,1, col = "green", lwd = 2)  # this is a perfect prediction - 45 degree line

# add the regression line 
abline(lm(EDAModel$fitted.values ~ EDAModel$model$Life.expectancy),
       col = "red", lwd = 2)

Plot with Confidence Interval (CI)

Here is the plot of the Actual vs Predicted of EDAModel with Confidence Interval.


The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.

# predict Life expectancy
predictedLE2 <- predict(EDAModel, interval = "prediction")

# combine the actual data and predicted data
comb2 <- cbind.data.frame(clean_data, predictedLE2)

# Plotting the combined data
ggplot(comb2, aes(Life.expectancy, fit)) +
  geom_point() + 
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  stat_smooth(method = lm) +
  geom_smooth(method=lm, se=TRUE)+
  ggtitle("Actual vs. Predicted for EDAModel with CI") +
  xlab("Actual Life Expectancy") + 
  ylab("Predicted Life Expectancy")

Backward Step Model


Modelling

Let’s build a model using step-wise regression with backward step!


The predictors of the BackwardStepModel are Status, Adult.Mortality, infant.deaths, Alcohol, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, Total.expenditure, Diphtheria, HIV.AIDS, GDP, thinness..1.19.years, Income.composition.of.resources, and Schooling.

BackwardStepModel <- step(fullModel,
                      direction = "backward")
## Start:  AIC=8232.93
## Life.expectancy ~ (Country + Year + Status + Adult.Mortality + 
##     infant.deaths + Alcohol + percentage.expenditure + Hepatitis.B + 
##     Measles + BMI + under.five.deaths + Polio + Total.expenditure + 
##     Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + 
##     thinness.5.9.years + Income.composition.of.resources + Schooling) - 
##     Country
## 
##                                   Df Sum of Sq   RSS    AIC
## - Population                       1       0.0 47735 8230.9
## - thinness.5.9.years               1       0.5 47735 8231.0
## - percentage.expenditure           1       2.1 47737 8231.1
## - Year                             1       8.4 47743 8231.4
## <none>                                         47735 8232.9
## - thinness..1.19.years             1      43.5 47778 8233.6
## - Total.expenditure                1      76.8 47811 8235.7
## - Alcohol                          1      90.5 47825 8236.5
## - Measles                          1     102.3 47837 8237.2
## - GDP                              1     156.3 47891 8240.5
## - Hepatitis.B                      1     326.1 48061 8250.9
## - Status                           1     576.2 48311 8266.2
## - Polio                            1     679.2 48414 8272.4
## - Diphtheria                       1    1261.8 48996 8307.6
## - BMI                              1    1335.1 49070 8312.0
## - Income.composition.of.resources  1    1358.8 49093 8313.4
## - infant.deaths                    1    2272.4 50007 8367.6
## - under.five.deaths                1    2372.7 50107 8373.5
## - Schooling                        1    3938.4 51673 8463.9
## - Adult.Mortality                  1   10151.4 57886 8797.4
## - HIV.AIDS                         1   11717.2 59452 8875.8
## 
## Step:  AIC=8230.93
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - thinness.5.9.years               1       0.5 47735 8229.0
## - percentage.expenditure           1       2.1 47737 8229.1
## - Year                             1       8.4 47743 8229.4
## <none>                                         47735 8230.9
## - thinness..1.19.years             1      43.5 47778 8231.6
## - Total.expenditure                1      76.8 47811 8233.7
## - Alcohol                          1      90.6 47825 8234.5
## - Measles                          1     102.4 47837 8235.2
## - GDP                              1     156.3 47891 8238.5
## - Hepatitis.B                      1     327.6 48062 8249.0
## - Status                           1     576.2 48311 8264.2
## - Polio                            1     679.2 48414 8270.4
## - Diphtheria                       1    1263.8 48998 8305.7
## - BMI                              1    1335.8 49070 8310.0
## - Income.composition.of.resources  1    1358.9 49094 8311.4
## - infant.deaths                    1    2346.3 50081 8369.9
## - under.five.deaths                1    2411.0 50146 8373.7
## - Schooling                        1    3939.6 51674 8461.9
## - Adult.Mortality                  1   10152.5 57887 8795.5
## - HIV.AIDS                         1   11717.2 59452 8873.9
## 
## Step:  AIC=8228.96
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + thinness..1.19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - percentage.expenditure           1       2.1 47737 8227.1
## - Year                             1       8.4 47744 8227.5
## <none>                                         47735 8229.0
## - Total.expenditure                1      76.4 47811 8231.7
## - Alcohol                          1      90.4 47826 8232.5
## - Measles                          1     103.0 47838 8233.3
## - GDP                              1     156.1 47891 8236.6
## - thinness..1.19.years             1     159.7 47895 8236.8
## - Hepatitis.B                      1     327.6 48063 8247.1
## - Status                           1     575.8 48311 8262.2
## - Polio                            1     678.7 48414 8268.4
## - Diphtheria                       1    1265.5 49001 8303.8
## - BMI                              1    1348.6 49084 8308.8
## - Income.composition.of.resources  1    1359.4 49095 8309.5
## - infant.deaths                    1    2357.7 50093 8368.6
## - under.five.deaths                1    2418.8 50154 8372.2
## - Schooling                        1    3941.7 51677 8460.1
## - Adult.Mortality                  1   10156.8 57892 8793.7
## - HIV.AIDS                         1   11721.7 59457 8872.1
## 
## Step:  AIC=8227.09
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + 
##     Alcohol + Hepatitis.B + Measles + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + 
##     thinness..1.19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## - Year                             1       9.4 47747 8225.7
## <none>                                         47737 8227.1
## - Total.expenditure                1      82.3 47820 8230.2
## - Alcohol                          1      93.1 47830 8230.8
## - Measles                          1     103.0 47840 8231.4
## - thinness..1.19.years             1     159.9 47897 8234.9
## - Hepatitis.B                      1     331.6 48069 8245.4
## - Status                           1     583.8 48321 8260.8
## - Polio                            1     676.9 48414 8266.5
## - GDP                              1     818.6 48556 8275.0
## - Diphtheria                       1    1265.5 49003 8302.0
## - BMI                              1    1347.4 49085 8306.9
## - Income.composition.of.resources  1    1357.4 49095 8307.5
## - infant.deaths                    1    2361.4 50099 8366.9
## - under.five.deaths                1    2422.3 50159 8370.5
## - Schooling                        1    3941.8 51679 8458.2
## - Adult.Mortality                  1   10154.7 57892 8791.7
## - HIV.AIDS                         1   11730.6 59468 8870.6
## 
## Step:  AIC=8225.67
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + Hepatitis.B + Measles + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + 
##     thinness..1.19.years + Income.composition.of.resources + 
##     Schooling
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         47747 8225.7
## - Total.expenditure                1      77.7 47824 8228.4
## - Measles                          1     100.0 47847 8229.8
## - Alcohol                          1     103.5 47850 8230.0
## - thinness..1.19.years             1     162.3 47909 8233.6
## - Hepatitis.B                      1     329.1 48076 8243.8
## - Status                           1     598.8 48345 8260.3
## - Polio                            1     680.9 48428 8265.3
## - GDP                              1     814.2 48561 8273.3
## - Diphtheria                       1    1257.0 49004 8300.0
## - BMI                              1    1348.5 49095 8305.5
## - Income.composition.of.resources  1    1352.2 49099 8305.7
## - infant.deaths                    1    2373.7 50120 8366.2
## - under.five.deaths                1    2435.4 50182 8369.8
## - Schooling                        1    3937.5 51684 8456.5
## - Adult.Mortality                  1   10266.6 58013 8795.9
## - HIV.AIDS                         1   11809.6 59556 8873.0

Predictions

The model is set! Let’s make a prediction based on the BackwardStepModel!

BackwardStepModelPreds <- predict(BackwardStepModel,  # my model
                    newdata = clean_data, # dataset
                    type = "response") # to get predicted values

Model Interpretation

Let’s interpreting the BackwardStepModel!

summary (BackwardStepModel)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + Hepatitis.B + Measles + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + 
##     thinness..1.19.years + Income.composition.of.resources + 
##     Schooling, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2503  -2.2438  -0.1097   2.3665  16.3824 
## 
## Coefficients:
##                                     Estimate   Std. Error t value
## (Intercept)                     56.745781830  0.668314373  84.909
## StatusDeveloping                -1.625002116  0.268478794  -6.053
## Adult.Mortality                 -0.019850253  0.000792059 -25.062
## infant.deaths                    0.099679379  0.008271756  12.051
## Alcohol                          0.064726322  0.025725807   2.516
## Hepatitis.B                     -0.016610250  0.003701821  -4.487
## Measles                         -0.000018876  0.000007631  -2.474
## BMI                              0.044354554  0.004883370   9.083
## under.five.deaths               -0.074629707  0.006114028 -12.206
## Polio                            0.028664684  0.004441145   6.454
## Total.expenditure                0.073561760  0.033733716   2.181
## Diphtheria                       0.040593295  0.004629077   8.769
## HIV.AIDS                        -0.470653224  0.017510073 -26.879
## GDP                              0.000046752  0.000006624   7.058
## thinness..1.19.years            -0.074951946  0.023787239  -3.151
## Income.composition.of.resources  5.762655777  0.633592391   9.095
## Schooling                        0.645416268  0.041585022  15.520
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## StatusDeveloping                    0.00000000160683 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## infant.deaths                   < 0.0000000000000002 ***
## Alcohol                                      0.01192 *  
## Hepatitis.B                         0.00000750069921 ***
## Measles                                      0.01343 *  
## BMI                             < 0.0000000000000002 ***
## under.five.deaths               < 0.0000000000000002 ***
## Polio                               0.00000000012680 ***
## Total.expenditure                            0.02929 *  
## Diphtheria                      < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## GDP                                 0.00000000000211 ***
## thinness..1.19.years                         0.00164 ** 
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.043 on 2921 degrees of freedom
## Multiple R-squared:  0.8202, Adjusted R-squared:  0.8192 
## F-statistic: 832.9 on 16 and 2921 DF,  p-value: < 0.00000000000000022


Observation findings :

  1. The BackwardStepModel has the largest parameter estimate that is Income.composition.of.resources which is 5.762, followed by StatusDeveloping which is -1.625.

  2. The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.

  3. On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.

  4. The p-value for its intercept, Adult.Mortality, infant.deaths, BMI, under.five.deaths, Diphtheria, HIV.AIDS, Income.composition.of.resources, and Schooling are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.

  5. The BackwardStepModel has R-squared value 0.8202, which indicates the BackwardStepModel can describe its predictors condition by 82%.

Actual vs. Predicted

Original Plot

Here is the Actual vs Predicted Plot of BackwardStepModel. The green line represents a perfect prediction, while the red line represents the regression line.

# actual vs predicted
plot(y = BackwardStepModel$fitted.values,
     x = clean_data$Life.expectancy,
     main = "Actual vs Predicted using Backward Step model",
     xlab = "Actual",
     ylab = "Predicted(BackwardStepModel)",
     pch = 19)
abline(0,1, col = "green", lwd = 2)  # this is a perfect prediction - 45 degree line

# add the regression line 
abline(lm(BackwardStepModel$fitted.values ~ BackwardStepModel$model$Life.expectancy),
       col = "red", lwd = 2)

Plot with Confidence Interval (CI)

Here is the plot of the Actual vs Predicted of BackwardStepModel with Confidence Interval.


The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.

# predict Life expectancy
predictedLE3 <- predict(BackwardStepModel, interval = "prediction")

# combine the actual data and predicted data
comb3 <- cbind.data.frame(clean_data, predictedLE3)

# Plotting the combined data
ggplot(comb3, aes(Life.expectancy, fit)) +
  geom_point() + 
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  stat_smooth(method = lm) +
  geom_smooth(method=lm, se=TRUE)+
  ggtitle("Actual vs. Predicted for BackwardStepModel with CI") +
  xlab("Actual Life Expectancy") + 
  ylab("Predicted Life Expectancy")

Forward Step Model


Modelling

Let’s build a model using step-wise regression with forward step!


The predictors of the forwardStepModel are Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, Status, Polio, GDP, Hepatitis.B, under.five.deaths, infant.deaths, thinness..1.19.years, Alcohol, Measles, and Total.expenditure.

forwardStepModel <- step(nullModel,
                         direction="forward",
                         scope=list(lower=nullModel,
                                    upper=fullModel))
## Start:  AIC=13235.23
## Life.expectancy ~ 1
## 
##                                   Df Sum of Sq    RSS   AIC
## + Schooling                        1    135029 130544 11151
## + Adult.Mortality                  1    128792 136781 11288
## + Income.composition.of.resources  1    127379 138194 11318
## + BMI                              1     83419 182155 12130
## + HIV.AIDS                         1     82306 183267 12147
## + Status                           1     61549 204024 12463
## + Diphtheria                       1     59218 206355 12496
## + thinness..1.19.years             1     58167 207406 12511
## + thinness.5.9.years               1     56801 208772 12530
## + Polio                            1     55805 209768 12544
## + GDP                              1     49210 216363 12635
## + Alcohol                          1     40534 225039 12751
## + percentage.expenditure           1     38636 226938 12775
## + under.five.deaths                1     13176 252397 13088
## + Total.expenditure                1     11583 253990 13106
## + infant.deaths                    1     10282 255291 13121
## + Year                             1      7749 257824 13150
## + Hepatitis.B                      1      7695 257878 13151
## + Measles                          1      6610 258963 13163
## + Population                       1       224 265350 13235
## <none>                                         265573 13235
## 
## Step:  AIC=11150.71
## Life.expectancy ~ Schooling
## 
##                                   Df Sum of Sq    RSS     AIC
## + Adult.Mortality                  1     49061  81483  9768.0
## + HIV.AIDS                         1     44779  85765  9918.5
## + BMI                              1     14155 116389 10815.5
## + Diphtheria                       1     12645 117899 10853.4
## + Income.composition.of.resources  1     11318 119225 10886.3
## + Polio                            1     11213 119331 10888.9
## + thinness.5.9.years               1      8113 122431 10964.2
## + thinness..1.19.years             1      8021 122523 10966.4
## + Status                           1      5919 124625 11016.4
## + GDP                              1      4882 125662 11040.7
## + percentage.expenditure           1      3515 127029 11072.5
## + under.five.deaths                1      1588 128955 11116.7
## + Measles                          1      1383 129161 11121.4
## + Hepatitis.B                      1      1308 129235 11123.1
## + infant.deaths                    1      1013 129531 11129.8
## + Total.expenditure                1       750 129793 11135.8
## + Alcohol                          1       424 130120 11143.1
## + Year                             1       183 130361 11148.6
## <none>                                         130544 11150.7
## + Population                       1         2 130542 11152.7
## 
## Step:  AIC=9767.98
## Life.expectancy ~ Schooling + Adult.Mortality
## 
##                                   Df Sum of Sq   RSS    AIC
## + HIV.AIDS                         1   14069.6 67413 9213.1
## + Diphtheria                       1    7221.9 74261 9497.3
## + Polio                            1    6103.7 75379 9541.2
## + BMI                              1    5520.4 75962 9563.9
## + Income.composition.of.resources  1    4734.9 76748 9594.1
## + thinness..1.19.years             1    3713.6 77769 9632.9
## + thinness.5.9.years               1    3479.5 78003 9641.8
## + Status                           1    2376.4 79106 9683.0
## + GDP                              1    1982.1 79501 9697.6
## + Measles                          1    1798.9 79684 9704.4
## + percentage.expenditure           1    1551.3 79931 9713.5
## + under.five.deaths                1    1492.1 79991 9715.7
## + infant.deaths                    1    1075.2 80407 9731.0
## + Alcohol                          1     791.2 80691 9741.3
## + Total.expenditure                1     541.6 80941 9750.4
## + Hepatitis.B                      1     433.2 81049 9754.3
## + Year                             1     246.7 81236 9761.1
## <none>                                         81483 9768.0
## + Population                       1      44.3 81438 9768.4
## 
## Step:  AIC=9213.08
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS
## 
##                                   Df Sum of Sq   RSS    AIC
## + Diphtheria                       1    6587.0 60826 8913.0
## + Polio                            1    5636.2 61777 8958.6
## + BMI                              1    4450.0 62963 9014.4
## + Income.composition.of.resources  1    4032.2 63381 9033.9
## + thinness..1.19.years             1    2808.0 64605 9090.1
## + thinness.5.9.years               1    2604.4 64809 9099.3
## + Status                           1    2582.3 64831 9100.3
## + GDP                              1    2295.7 65117 9113.3
## + percentage.expenditure           1    1884.8 65528 9131.8
## + Measles                          1    1618.6 65794 9143.7
## + under.five.deaths                1    1600.3 65813 9144.5
## + Alcohol                          1    1274.3 66139 9159.0
## + infant.deaths                    1    1214.8 66198 9161.6
## + Total.expenditure                1     987.1 66426 9171.7
## + Hepatitis.B                      1     314.9 67098 9201.3
## + Population                       1      73.9 67339 9211.9
## <none>                                         67413 9213.1
## + Year                             1       2.3 67411 9215.0
## 
## Step:  AIC=8912.99
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria
## 
##                                   Df Sum of Sq   RSS    AIC
## + BMI                              1    3596.0 57230 8735.9
## + Income.composition.of.resources  1    3185.5 57640 8756.9
## + Status                           1    2422.7 58403 8795.6
## + thinness..1.19.years             1    2361.1 58465 8798.7
## + GDP                              1    2233.9 58592 8805.1
## + thinness.5.9.years               1    2213.3 58613 8806.1
## + percentage.expenditure           1    1993.4 58833 8817.1
## + Alcohol                          1    1068.0 59758 8862.9
## + Polio                            1    1026.4 59800 8865.0
## + Measles                          1     999.2 59827 8866.3
## + under.five.deaths                1     879.1 59947 8872.2
## + Total.expenditure                1     663.8 60162 8882.7
## + infant.deaths                    1     657.9 60168 8883.0
## + Hepatitis.B                      1     349.8 60476 8898.0
## + Population                       1      43.7 60782 8912.9
## <none>                                         60826 8913.0
## + Year                             1      10.3 60816 8914.5
## 
## Step:  AIC=8735.95
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI
## 
##                                   Df Sum of Sq   RSS    AIC
## + Income.composition.of.resources  1   2544.82 54685 8604.3
## + Status                           1   2111.11 55119 8627.5
## + GDP                              1   1959.73 55270 8635.6
## + percentage.expenditure           1   1854.67 55375 8641.2
## + Polio                            1    856.71 56373 8693.6
## + thinness..1.19.years             1    782.07 56448 8697.5
## + Alcohol                          1    725.32 56505 8700.5
## + thinness.5.9.years               1    657.35 56573 8704.0
## + Measles                          1    568.01 56662 8708.6
## + under.five.deaths                1    431.96 56798 8715.7
## + Hepatitis.B                      1    339.63 56890 8720.5
## + Total.expenditure                1    304.28 56926 8722.3
## + infant.deaths                    1    280.40 56950 8723.5
## <none>                                         57230 8735.9
## + Year                             1      8.62 57221 8737.5
## + Population                       1      7.38 57223 8737.6
## 
## Step:  AIC=8604.31
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources
## 
##                          Df Sum of Sq   RSS    AIC
## + Status                  1   1699.37 52986 8513.6
## + percentage.expenditure  1   1394.08 53291 8530.4
## + GDP                     1   1346.68 53338 8533.1
## + Polio                   1    834.91 53850 8561.1
## + thinness..1.19.years    1    700.86 53984 8568.4
## + Alcohol                 1    641.60 54044 8571.6
## + thinness.5.9.years      1    605.26 54080 8573.6
## + Measles                 1    536.56 54149 8577.3
## + under.five.deaths       1    505.38 54180 8579.0
## + Total.expenditure       1    439.61 54246 8582.6
## + infant.deaths           1    352.37 54333 8587.3
## + Hepatitis.B             1    257.09 54428 8592.5
## + Year                    1     83.26 54602 8601.8
## <none>                                54685 8604.3
## + Population              1     17.23 54668 8605.4
## 
## Step:  AIC=8513.56
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status
## 
##                          Df Sum of Sq   RSS    AIC
## + Polio                   1    802.17 52184 8470.7
## + GDP                     1    699.53 52286 8476.5
## + percentage.expenditure  1    662.90 52323 8478.6
## + Measles                 1    511.75 52474 8487.0
## + under.five.deaths       1    491.60 52494 8488.2
## + thinness..1.19.years    1    392.03 52594 8493.7
## + Hepatitis.B             1    345.31 52640 8496.4
## + infant.deaths           1    329.36 52656 8497.2
## + thinness.5.9.years      1    315.68 52670 8498.0
## + Total.expenditure       1    153.60 52832 8507.0
## + Alcohol                 1     59.81 52926 8512.2
## <none>                                52986 8513.6
## + Year                    1     13.42 52972 8514.8
## + Population              1      9.94 52976 8515.0
## 
## Step:  AIC=8470.74
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio
## 
##                          Df Sum of Sq   RSS    AIC
## + GDP                     1    682.97 51501 8434.0
## + percentage.expenditure  1    673.70 51510 8434.6
## + Hepatitis.B             1    464.89 51719 8446.5
## + Measles                 1    462.60 51721 8446.6
## + under.five.deaths       1    433.02 51751 8448.3
## + thinness..1.19.years    1    395.77 51788 8450.4
## + thinness.5.9.years      1    309.75 51874 8455.3
## + infant.deaths           1    285.54 51898 8456.6
## + Total.expenditure       1    150.80 52033 8464.2
## + Alcohol                 1     56.41 52127 8469.6
## <none>                                52184 8470.7
## + Year                    1      8.62 52175 8472.3
## + Population              1      5.97 52178 8472.4
## 
## Step:  AIC=8434.04
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP
## 
##                          Df Sum of Sq   RSS    AIC
## + Hepatitis.B             1    462.41 51038 8409.5
## + Measles                 1    452.06 51049 8410.1
## + under.five.deaths       1    418.70 51082 8412.1
## + thinness..1.19.years    1    379.40 51121 8414.3
## + thinness.5.9.years      1    286.65 51214 8419.6
## + infant.deaths           1    270.37 51230 8420.6
## + Total.expenditure       1    179.55 51321 8425.8
## + Alcohol                 1     57.66 51443 8432.7
## + percentage.expenditure  1     42.17 51458 8433.6
## <none>                                51501 8434.0
## + Year                    1     13.23 51487 8435.3
## + Population              1      5.06 51496 8435.7
## 
## Step:  AIC=8409.54
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B
## 
##                          Df Sum of Sq   RSS    AIC
## + under.five.deaths       1    513.06 50525 8381.9
## + Measles                 1    461.23 50577 8384.9
## + thinness..1.19.years    1    380.09 50658 8389.6
## + infant.deaths           1    351.84 50686 8391.2
## + thinness.5.9.years      1    291.88 50746 8394.7
## + Total.expenditure       1    167.68 50871 8401.9
## + Alcohol                 1     53.27 50985 8408.5
## <none>                                51038 8409.5
## + percentage.expenditure  1     27.04 51011 8410.0
## + Population              1     23.99 51014 8410.2
## + Year                    1     17.07 51021 8410.6
## 
## Step:  AIC=8381.86
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths
## 
##                          Df Sum of Sq   RSS    AIC
## + infant.deaths           1   2223.33 48302 8251.6
## + Measles                 1    145.58 50380 8375.4
## + Total.expenditure       1    136.09 50389 8375.9
## + thinness..1.19.years    1    122.92 50402 8376.7
## + Population              1     74.17 50451 8379.5
## + thinness.5.9.years      1     69.02 50456 8379.8
## + Alcohol                 1     58.10 50467 8380.5
## <none>                                50525 8381.9
## + percentage.expenditure  1     26.98 50498 8382.3
## + Year                    1     17.75 50507 8382.8
## 
## Step:  AIC=8251.64
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths
## 
##                          Df Sum of Sq   RSS    AIC
## + thinness..1.19.years    1   248.735 48053 8238.5
## + Alcohol                 1   196.874 48105 8241.6
## + thinness.5.9.years      1   190.728 48111 8242.0
## + Total.expenditure       1   146.782 48155 8244.7
## + Measles                 1    87.496 48214 8248.3
## <none>                                48302 8251.6
## + percentage.expenditure  1    23.257 48279 8252.2
## + Year                    1    14.517 48287 8252.8
## + Population              1     0.068 48302 8253.6
## 
## Step:  AIC=8238.47
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years
## 
##                          Df Sum of Sq   RSS    AIC
## + Alcohol                 1   120.915 47932 8233.1
## + Total.expenditure       1   107.663 47945 8233.9
## + Measles                 1   103.706 47949 8234.1
## <none>                                48053 8238.5
## + percentage.expenditure  1    17.121 48036 8239.4
## + Year                    1     9.060 48044 8239.9
## + Population              1     0.251 48053 8240.5
## + thinness.5.9.years      1     0.143 48053 8240.5
## 
## Step:  AIC=8233.07
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years + 
##     Alcohol
## 
##                          Df Sum of Sq   RSS    AIC
## + Measles                 1   107.930 47824 8228.4
## + Total.expenditure       1    85.647 47847 8229.8
## <none>                                47932 8233.1
## + percentage.expenditure  1     9.067 47923 8234.5
## + Year                    1     2.540 47930 8234.9
## + thinness.5.9.years      1     0.376 47932 8235.0
## + Population              1     0.034 47932 8235.1
## 
## Step:  AIC=8228.45
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years + 
##     Alcohol + Measles
## 
##                          Df Sum of Sq   RSS    AIC
## + Total.expenditure       1    77.730 47747 8225.7
## <none>                                47824 8228.4
## + percentage.expenditure  1     8.980 47815 8229.9
## + Year                    1     4.783 47820 8230.2
## + thinness.5.9.years      1     0.085 47824 8230.4
## + Population              1     0.023 47824 8230.4
## 
## Step:  AIC=8225.67
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years + 
##     Alcohol + Measles + Total.expenditure
## 
##                          Df Sum of Sq   RSS    AIC
## <none>                                47747 8225.7
## + Year                    1    9.3541 47737 8227.1
## + percentage.expenditure  1    3.0234 47744 8227.5
## + thinness.5.9.years      1    0.5265 47746 8227.6
## + Population              1    0.0011 47747 8227.7

Predictions

The model is set! Let’s make a prediction based on the forwardStepModel!

forwardStepModelPreds <- predict(forwardStepModel,  # my model
                    newdata = clean_data, # dataset
                    type = "response") # to get predicted values

Model Interpretation

Let’s interpreting the forwardStepModel!

summary (forwardStepModel)
## 
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality + 
##     HIV.AIDS + Diphtheria + BMI + Income.composition.of.resources + 
##     Status + Polio + GDP + Hepatitis.B + under.five.deaths + 
##     infant.deaths + thinness..1.19.years + Alcohol + Measles + 
##     Total.expenditure, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2503  -2.2438  -0.1097   2.3665  16.3824 
## 
## Coefficients:
##                                     Estimate   Std. Error t value
## (Intercept)                     56.745781830  0.668314373  84.909
## Schooling                        0.645416268  0.041585022  15.520
## Adult.Mortality                 -0.019850253  0.000792059 -25.062
## HIV.AIDS                        -0.470653224  0.017510073 -26.879
## Diphtheria                       0.040593295  0.004629077   8.769
## BMI                              0.044354554  0.004883370   9.083
## Income.composition.of.resources  5.762655777  0.633592391   9.095
## StatusDeveloping                -1.625002116  0.268478794  -6.053
## Polio                            0.028664684  0.004441145   6.454
## GDP                              0.000046752  0.000006624   7.058
## Hepatitis.B                     -0.016610250  0.003701821  -4.487
## under.five.deaths               -0.074629707  0.006114028 -12.206
## infant.deaths                    0.099679379  0.008271756  12.051
## thinness..1.19.years            -0.074951946  0.023787239  -3.151
## Alcohol                          0.064726322  0.025725807   2.516
## Measles                         -0.000018876  0.000007631  -2.474
## Total.expenditure                0.073561760  0.033733716   2.181
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## Diphtheria                      < 0.0000000000000002 ***
## BMI                             < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## StatusDeveloping                    0.00000000160683 ***
## Polio                               0.00000000012680 ***
## GDP                                 0.00000000000211 ***
## Hepatitis.B                         0.00000750069921 ***
## under.five.deaths               < 0.0000000000000002 ***
## infant.deaths                   < 0.0000000000000002 ***
## thinness..1.19.years                         0.00164 ** 
## Alcohol                                      0.01192 *  
## Measles                                      0.01343 *  
## Total.expenditure                            0.02929 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.043 on 2921 degrees of freedom
## Multiple R-squared:  0.8202, Adjusted R-squared:  0.8192 
## F-statistic: 832.9 on 16 and 2921 DF,  p-value: < 0.00000000000000022


Observation findings :

  1. The forwardStepModel has the largest parameter estimate that is Income.composition.of.resources which is 5.762, followed by StatusDeveloping which is -1.625.

  2. The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.

  3. On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.

  4. The p-value for its intercept, Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, under.five.deaths, and infant.deaths are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.

  5. The forwardStepModel has R-squared value 0.8202, which indicates the forwardStepModel can describe its predictors condition by 82%.

Actual vs. Predicted

Original Plot

Here is the Actual vs Predicted Plot of forwardStepModel. The green line represents a perfect prediction, while the red line represents the regression line.

# actual vs predicted
plot(y = forwardStepModel$fitted.values,
     x = clean_data$Life.expectancy,
     main = "Actual vs Predicted using Forward Step model",
     xlab = "Actual",
     ylab = "Predicted(forwardStepModel)",
     pch = 19)

abline(0,1, col = "green", lwd = 2)  # this is a perfect prediction - 45 degree line

# add the regression line 
abline(lm(forwardStepModel$fitted.values ~ forwardStepModel$model$Life.expectancy),
       col = "red", lwd = 2)

Plot with Confidence Interval (CI)

Here is the plot of the Actual vs Predicted of forwardStepModel with Confidence Interval.


The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.

# predict Life expectancy
predictedLE4 <- predict(forwardStepModel, interval = "prediction")

# combine the actual data and predicted data
comb4 <- cbind.data.frame(clean_data, predictedLE4)

# Plotting the combined data
ggplot(comb4, aes(Life.expectancy, fit)) +
  geom_point() + 
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  stat_smooth(method = lm) +
  geom_smooth(method=lm, se=TRUE)+
  ggtitle("Actual vs. Predicted for forwardStepModel with CI") +
  xlab("Actual Life Expectancy") + 
  ylab("Predicted Life Expectancy")

Mixed Step Model


Modelling

Let’s build a model using step-wise regression with both backward and forward step!


The predictors of the MixedStepModel are Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, Status, Polio, GDP, Hepatitis.B, under.five.deaths, infant.deaths, thinness..1.19.years, Alcohol, Measles, and Total.expenditure.

MixedStepModel <- step(nullModel,
                         scope=list(lower=nullModel,
                                    upper=fullModel),
                         direction="both")
## Start:  AIC=13235.23
## Life.expectancy ~ 1
## 
##                                   Df Sum of Sq    RSS   AIC
## + Schooling                        1    135029 130544 11151
## + Adult.Mortality                  1    128792 136781 11288
## + Income.composition.of.resources  1    127379 138194 11318
## + BMI                              1     83419 182155 12130
## + HIV.AIDS                         1     82306 183267 12147
## + Status                           1     61549 204024 12463
## + Diphtheria                       1     59218 206355 12496
## + thinness..1.19.years             1     58167 207406 12511
## + thinness.5.9.years               1     56801 208772 12530
## + Polio                            1     55805 209768 12544
## + GDP                              1     49210 216363 12635
## + Alcohol                          1     40534 225039 12751
## + percentage.expenditure           1     38636 226938 12775
## + under.five.deaths                1     13176 252397 13088
## + Total.expenditure                1     11583 253990 13106
## + infant.deaths                    1     10282 255291 13121
## + Year                             1      7749 257824 13150
## + Hepatitis.B                      1      7695 257878 13151
## + Measles                          1      6610 258963 13163
## + Population                       1       224 265350 13235
## <none>                                         265573 13235
## 
## Step:  AIC=11150.71
## Life.expectancy ~ Schooling
## 
##                                   Df Sum of Sq    RSS     AIC
## + Adult.Mortality                  1     49061  81483  9768.0
## + HIV.AIDS                         1     44779  85765  9918.5
## + BMI                              1     14155 116389 10815.5
## + Diphtheria                       1     12645 117899 10853.4
## + Income.composition.of.resources  1     11318 119225 10886.3
## + Polio                            1     11213 119331 10888.9
## + thinness.5.9.years               1      8113 122431 10964.2
## + thinness..1.19.years             1      8021 122523 10966.4
## + Status                           1      5919 124625 11016.4
## + GDP                              1      4882 125662 11040.7
## + percentage.expenditure           1      3515 127029 11072.5
## + under.five.deaths                1      1588 128955 11116.7
## + Measles                          1      1383 129161 11121.4
## + Hepatitis.B                      1      1308 129235 11123.1
## + infant.deaths                    1      1013 129531 11129.8
## + Total.expenditure                1       750 129793 11135.8
## + Alcohol                          1       424 130120 11143.1
## + Year                             1       183 130361 11148.6
## <none>                                         130544 11150.7
## + Population                       1         2 130542 11152.7
## - Schooling                        1    135029 265573 13235.2
## 
## Step:  AIC=9767.98
## Life.expectancy ~ Schooling + Adult.Mortality
## 
##                                   Df Sum of Sq    RSS     AIC
## + HIV.AIDS                         1     14070  67413  9213.1
## + Diphtheria                       1      7222  74261  9497.3
## + Polio                            1      6104  75379  9541.2
## + BMI                              1      5520  75962  9563.9
## + Income.composition.of.resources  1      4735  76748  9594.1
## + thinness..1.19.years             1      3714  77769  9632.9
## + thinness.5.9.years               1      3480  78003  9641.8
## + Status                           1      2376  79106  9683.0
## + GDP                              1      1982  79501  9697.6
## + Measles                          1      1799  79684  9704.4
## + percentage.expenditure           1      1551  79931  9713.5
## + under.five.deaths                1      1492  79991  9715.7
## + infant.deaths                    1      1075  80407  9731.0
## + Alcohol                          1       791  80691  9741.3
## + Total.expenditure                1       542  80941  9750.4
## + Hepatitis.B                      1       433  81049  9754.3
## + Year                             1       247  81236  9761.1
## <none>                                          81483  9768.0
## + Population                       1        44  81438  9768.4
## - Adult.Mortality                  1     49061 130544 11150.7
## - Schooling                        1     55298 136781 11287.8
## 
## Step:  AIC=9213.08
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS
## 
##                                   Df Sum of Sq    RSS     AIC
## + Diphtheria                       1      6587  60826  8913.0
## + Polio                            1      5636  61777  8958.6
## + BMI                              1      4450  62963  9014.4
## + Income.composition.of.resources  1      4032  63381  9033.9
## + thinness..1.19.years             1      2808  64605  9090.1
## + thinness.5.9.years               1      2604  64809  9099.3
## + Status                           1      2582  64831  9100.3
## + GDP                              1      2296  65117  9113.3
## + percentage.expenditure           1      1885  65528  9131.8
## + Measles                          1      1619  65794  9143.7
## + under.five.deaths                1      1600  65813  9144.5
## + Alcohol                          1      1274  66139  9159.0
## + infant.deaths                    1      1215  66198  9161.6
## + Total.expenditure                1       987  66426  9171.7
## + Hepatitis.B                      1       315  67098  9201.3
## + Population                       1        74  67339  9211.9
## <none>                                          67413  9213.1
## + Year                             1         2  67411  9215.0
## - HIV.AIDS                         1     14070  81483  9768.0
## - Adult.Mortality                  1     18352  85765  9918.5
## - Schooling                        1     55892 123305 10985.1
## 
## Step:  AIC=8912.99
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria
## 
##                                   Df Sum of Sq    RSS     AIC
## + BMI                              1      3596  57230  8735.9
## + Income.composition.of.resources  1      3186  57640  8756.9
## + Status                           1      2423  58403  8795.6
## + thinness..1.19.years             1      2361  58465  8798.7
## + GDP                              1      2234  58592  8805.1
## + thinness.5.9.years               1      2213  58613  8806.1
## + percentage.expenditure           1      1993  58833  8817.1
## + Alcohol                          1      1068  59758  8862.9
## + Polio                            1      1026  59800  8865.0
## + Measles                          1       999  59827  8866.3
## + under.five.deaths                1       879  59947  8872.2
## + Total.expenditure                1       664  60162  8882.7
## + infant.deaths                    1       658  60168  8883.0
## + Hepatitis.B                      1       350  60476  8898.0
## + Population                       1        44  60782  8912.9
## <none>                                          60826  8913.0
## + Year                             1        10  60816  8914.5
## - Diphtheria                       1      6587  67413  9213.1
## - HIV.AIDS                         1     13435  74261  9497.3
## - Adult.Mortality                  1     16152  76978  9602.9
## - Schooling                        1     40329 101155 10405.4
## 
## Step:  AIC=8735.95
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI
## 
##                                   Df Sum of Sq   RSS    AIC
## + Income.composition.of.resources  1    2544.8 54685 8604.3
## + Status                           1    2111.1 55119 8627.5
## + GDP                              1    1959.7 55270 8635.6
## + percentage.expenditure           1    1854.7 55375 8641.2
## + Polio                            1     856.7 56373 8693.6
## + thinness..1.19.years             1     782.1 56448 8697.5
## + Alcohol                          1     725.3 56505 8700.5
## + thinness.5.9.years               1     657.4 56573 8704.0
## + Measles                          1     568.0 56662 8708.6
## + under.five.deaths                1     432.0 56798 8715.7
## + Hepatitis.B                      1     339.6 56890 8720.5
## + Total.expenditure                1     304.3 56926 8722.3
## + infant.deaths                    1     280.4 56950 8723.5
## <none>                                         57230 8735.9
## + Year                             1       8.6 57221 8737.5
## + Population                       1       7.4 57223 8737.6
## - BMI                              1    3596.0 60826 8913.0
## - Diphtheria                       1    5733.1 62963 9014.4
## - HIV.AIDS                         1   12527.5 69758 9315.5
## - Adult.Mortality                  1   13696.8 70927 9364.4
## - Schooling                        1   26760.3 83990 9861.0
## 
## Step:  AIC=8604.31
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources
## 
##                                   Df Sum of Sq   RSS    AIC
## + Status                           1    1699.4 52986 8513.6
## + percentage.expenditure           1    1394.1 53291 8530.4
## + GDP                              1    1346.7 53338 8533.1
## + Polio                            1     834.9 53850 8561.1
## + thinness..1.19.years             1     700.9 53984 8568.4
## + Alcohol                          1     641.6 54044 8571.6
## + thinness.5.9.years               1     605.3 54080 8573.6
## + Measles                          1     536.6 54149 8577.3
## + under.five.deaths                1     505.4 54180 8579.0
## + Total.expenditure                1     439.6 54246 8582.6
## + infant.deaths                    1     352.4 54333 8587.3
## + Hepatitis.B                      1     257.1 54428 8592.5
## + Year                             1      83.3 54602 8601.8
## <none>                                         54685 8604.3
## + Population                       1      17.2 54668 8605.4
## - Income.composition.of.resources  1    2544.8 57230 8735.9
## - BMI                              1    2955.3 57640 8756.9
## - Diphtheria                       1    5095.2 59780 8864.0
## - Schooling                        1    7143.4 61829 8963.0
## - HIV.AIDS                         1   12106.7 66792 9189.9
## - Adult.Mortality                  1   12307.2 66992 9198.7
## 
## Step:  AIC=8513.56
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status
## 
##                                   Df Sum of Sq   RSS    AIC
## + Polio                            1     802.2 52184 8470.7
## + GDP                              1     699.5 52286 8476.5
## + percentage.expenditure           1     662.9 52323 8478.6
## + Measles                          1     511.8 52474 8487.0
## + under.five.deaths                1     491.6 52494 8488.2
## + thinness..1.19.years             1     392.0 52594 8493.7
## + Hepatitis.B                      1     345.3 52640 8496.4
## + infant.deaths                    1     329.4 52656 8497.2
## + thinness.5.9.years               1     315.7 52670 8498.0
## + Total.expenditure                1     153.6 52832 8507.0
## + Alcohol                          1      59.8 52926 8512.2
## <none>                                         52986 8513.6
## + Year                             1      13.4 52972 8514.8
## + Population                       1       9.9 52976 8515.0
## - Status                           1    1699.4 54685 8604.3
## - Income.composition.of.resources  1    2133.1 55119 8627.5
## - BMI                              1    2748.8 55735 8660.2
## - Diphtheria                       1    5053.7 58040 8779.2
## - Schooling                        1    5510.7 58496 8802.3
## - Adult.Mortality                  1   11298.0 64284 9079.4
## - HIV.AIDS                         1   12327.9 65314 9126.1
## 
## Step:  AIC=8470.74
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio
## 
##                                   Df Sum of Sq   RSS    AIC
## + GDP                              1     683.0 51501 8434.0
## + percentage.expenditure           1     673.7 51510 8434.6
## + Hepatitis.B                      1     464.9 51719 8446.5
## + Measles                          1     462.6 51721 8446.6
## + under.five.deaths                1     433.0 51751 8448.3
## + thinness..1.19.years             1     395.8 51788 8450.4
## + thinness.5.9.years               1     309.7 51874 8455.3
## + infant.deaths                    1     285.5 51898 8456.6
## + Total.expenditure                1     150.8 52033 8464.2
## + Alcohol                          1      56.4 52127 8469.6
## <none>                                         52184 8470.7
## + Year                             1       8.6 52175 8472.3
## + Population                       1       6.0 52178 8472.4
## - Polio                            1     802.2 52986 8513.6
## - Diphtheria                       1    1534.0 53718 8553.9
## - Status                           1    1666.6 53850 8561.1
## - Income.composition.of.resources  1    2117.2 54301 8585.6
## - BMI                              1    2611.4 54795 8612.2
## - Schooling                        1    5175.2 57359 8746.6
## - Adult.Mortality                  1   11025.7 63209 9031.9
## - HIV.AIDS                         1   12298.2 64482 9090.5
## 
## Step:  AIC=8434.04
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP
## 
##                                   Df Sum of Sq   RSS    AIC
## + Hepatitis.B                      1     462.4 51038 8409.5
## + Measles                          1     452.1 51049 8410.1
## + under.five.deaths                1     418.7 51082 8412.1
## + thinness..1.19.years             1     379.4 51121 8414.3
## + thinness.5.9.years               1     286.7 51214 8419.6
## + infant.deaths                    1     270.4 51230 8420.6
## + Total.expenditure                1     179.5 51321 8425.8
## + Alcohol                          1      57.7 51443 8432.7
## + percentage.expenditure           1      42.2 51458 8433.6
## <none>                                         51501 8434.0
## + Year                             1      13.2 51487 8435.3
## + Population                       1       5.1 51496 8435.7
## - GDP                              1     683.0 52184 8470.7
## - Polio                            1     785.6 52286 8476.5
## - Status                           1    1033.3 52534 8490.4
## - Diphtheria                       1    1566.9 53068 8520.1
## - Income.composition.of.resources  1    1761.3 53262 8530.8
## - BMI                              1    2550.8 54051 8574.1
## - Schooling                        1    4904.6 56405 8699.3
## - Adult.Mortality                  1   10635.9 62137 8983.6
## - HIV.AIDS                         1   12474.3 63975 9069.3
## 
## Step:  AIC=8409.54
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B
## 
##                                   Df Sum of Sq   RSS    AIC
## + under.five.deaths                1     513.1 50525 8381.9
## + Measles                          1     461.2 50577 8384.9
## + thinness..1.19.years             1     380.1 50658 8389.6
## + infant.deaths                    1     351.8 50686 8391.2
## + thinness.5.9.years               1     291.9 50746 8394.7
## + Total.expenditure                1     167.7 50871 8401.9
## + Alcohol                          1      53.3 50985 8408.5
## <none>                                         51038 8409.5
## + percentage.expenditure           1      27.0 51011 8410.0
## + Population                       1      24.0 51014 8410.2
## + Year                             1      17.1 51021 8410.6
## - Hepatitis.B                      1     462.4 51501 8434.0
## - GDP                              1     680.5 51719 8446.5
## - Polio                            1     903.7 51942 8459.1
## - Status                           1    1111.8 52150 8470.8
## - Income.composition.of.resources  1    1660.7 52699 8501.6
## - Diphtheria                       1    1955.8 52994 8518.0
## - BMI                              1    2535.8 53574 8550.0
## - Schooling                        1    4868.0 55906 8675.2
## - Adult.Mortality                  1   10650.8 61689 8964.4
## - HIV.AIDS                         1   12549.6 63588 9053.4
## 
## Step:  AIC=8381.86
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths
## 
##                                   Df Sum of Sq   RSS    AIC
## + infant.deaths                    1    2223.3 48302 8251.6
## + Measles                          1     145.6 50380 8375.4
## + Total.expenditure                1     136.1 50389 8375.9
## + thinness..1.19.years             1     122.9 50402 8376.7
## + Population                       1      74.2 50451 8379.5
## + thinness.5.9.years               1      69.0 50456 8379.8
## + Alcohol                          1      58.1 50467 8380.5
## <none>                                         50525 8381.9
## + percentage.expenditure           1      27.0 50498 8382.3
## + Year                             1      17.8 50507 8382.8
## - under.five.deaths                1     513.1 51038 8409.5
## - Hepatitis.B                      1     556.8 51082 8412.1
## - GDP                              1     664.4 51190 8418.2
## - Polio                            1     850.1 51375 8428.9
## - Status                           1    1115.9 51641 8444.0
## - Income.composition.of.resources  1    1716.3 52242 8478.0
## - Diphtheria                       1    1883.7 52409 8487.4
## - BMI                              1    2143.7 52669 8501.9
## - Schooling                        1    4625.7 55151 8637.2
## - Adult.Mortality                  1   10759.8 61285 8947.1
## - HIV.AIDS                         1   12672.7 63198 9037.4
## 
## Step:  AIC=8251.64
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths
## 
##                                   Df Sum of Sq   RSS    AIC
## + thinness..1.19.years             1     248.7 48053 8238.5
## + Alcohol                          1     196.9 48105 8241.6
## + thinness.5.9.years               1     190.7 48111 8242.0
## + Total.expenditure                1     146.8 48155 8244.7
## + Measles                          1      87.5 48214 8248.3
## <none>                                         48302 8251.6
## + percentage.expenditure           1      23.3 48279 8252.2
## + Year                             1      14.5 48287 8252.8
## + Population                       1       0.1 48302 8253.6
## - Hepatitis.B                      1     388.7 48691 8273.2
## - Polio                            1     692.9 48995 8291.5
## - GDP                              1     796.0 49098 8297.7
## - Status                           1    1352.4 49654 8330.8
## - Diphtheria                       1    1358.4 49660 8331.1
## - Income.composition.of.resources  1    1361.3 49663 8331.3
## - BMI                              1    2194.7 50497 8380.2
## - infant.deaths                    1    2223.3 50525 8381.9
## - under.five.deaths                1    2384.5 50686 8391.2
## - Schooling                        1    4583.1 52885 8516.0
## - Adult.Mortality                  1   10175.6 58477 8811.3
## - HIV.AIDS                         1   11984.3 60286 8900.8
## 
## Step:  AIC=8238.47
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years
## 
##                                   Df Sum of Sq   RSS    AIC
## + Alcohol                          1     120.9 47932 8233.1
## + Total.expenditure                1     107.7 47945 8233.9
## + Measles                          1     103.7 47949 8234.1
## <none>                                         48053 8238.5
## + percentage.expenditure           1      17.1 48036 8239.4
## + Year                             1       9.1 48044 8239.9
## + Population                       1       0.3 48053 8240.5
## + thinness.5.9.years               1       0.1 48053 8240.5
## - thinness..1.19.years             1     248.7 48302 8251.6
## - Hepatitis.B                      1     356.5 48410 8258.2
## - Polio                            1     706.1 48759 8279.3
## - GDP                              1     790.4 48844 8284.4
## - Status                           1    1129.1 49182 8304.7
## - Income.composition.of.resources  1    1318.8 49372 8316.0
## - Diphtheria                       1    1339.1 49392 8317.2
## - BMI                              1    1487.1 49540 8326.0
## - infant.deaths                    1    2349.1 50402 8376.7
## - under.five.deaths                1    2469.9 50523 8383.7
## - Schooling                        1    4429.0 52482 8495.5
## - Adult.Mortality                  1   10110.4 58164 8797.5
## - HIV.AIDS                         1   11669.2 59722 8875.2
## 
## Step:  AIC=8233.07
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years + 
##     Alcohol
## 
##                                   Df Sum of Sq   RSS    AIC
## + Measles                          1     107.9 47824 8228.4
## + Total.expenditure                1      85.6 47847 8229.8
## <none>                                         47932 8233.1
## + percentage.expenditure           1       9.1 47923 8234.5
## + Year                             1       2.5 47930 8234.9
## + thinness.5.9.years               1       0.4 47932 8235.0
## + Population                       1       0.0 47932 8235.1
## - Alcohol                          1     120.9 48053 8238.5
## - thinness..1.19.years             1     172.8 48105 8241.6
## - Hepatitis.B                      1     351.1 48283 8252.5
## - Status                           1     691.0 48623 8273.1
## - Polio                            1     693.2 48625 8273.3
## - GDP                              1     797.5 48730 8279.6
## - Diphtheria                       1    1309.8 49242 8310.3
## - Income.composition.of.resources  1    1325.8 49258 8311.2
## - BMI                              1    1481.5 49414 8320.5
## - infant.deaths                    1    2441.6 50374 8377.0
## - under.five.deaths                1    2568.3 50501 8384.4
## - Schooling                        1    3963.2 51895 8464.5
## - Adult.Mortality                  1   10216.1 58148 8798.7
## - HIV.AIDS                         1   11783.0 59715 8876.8
## 
## Step:  AIC=8228.45
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years + 
##     Alcohol + Measles
## 
##                                   Df Sum of Sq   RSS    AIC
## + Total.expenditure                1      77.7 47747 8225.7
## <none>                                         47824 8228.4
## + percentage.expenditure           1       9.0 47815 8229.9
## + Year                             1       4.8 47820 8230.2
## + thinness.5.9.years               1       0.1 47824 8230.4
## + Population                       1       0.0 47824 8230.4
## - Measles                          1     107.9 47932 8233.1
## - Alcohol                          1     125.1 47949 8234.1
## - thinness..1.19.years             1     185.4 48010 8237.8
## - Hepatitis.B                      1     335.2 48159 8247.0
## - Status                           1     674.1 48498 8267.6
## - Polio                            1     682.7 48507 8268.1
## - GDP                              1     793.7 48618 8274.8
## - Diphtheria                       1    1287.6 49112 8304.5
## - Income.composition.of.resources  1    1305.7 49130 8305.6
## - BMI                              1    1412.1 49236 8311.9
## - infant.deaths                    1    2384.3 50209 8369.4
## - under.five.deaths                1    2445.6 50270 8373.0
## - Schooling                        1    3995.7 51820 8462.2
## - Adult.Mortality                  1   10307.8 58132 8799.9
## - HIV.AIDS                         1   11732.4 59557 8871.0
## 
## Step:  AIC=8225.67
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria + 
##     BMI + Income.composition.of.resources + Status + Polio + 
##     GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years + 
##     Alcohol + Measles + Total.expenditure
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         47747 8225.7
## + Year                             1       9.4 47737 8227.1
## + percentage.expenditure           1       3.0 47744 8227.5
## + thinness.5.9.years               1       0.5 47746 8227.6
## + Population                       1       0.0 47747 8227.7
## - Total.expenditure                1      77.7 47824 8228.4
## - Measles                          1     100.0 47847 8229.8
## - Alcohol                          1     103.5 47850 8230.0
## - thinness..1.19.years             1     162.3 47909 8233.6
## - Hepatitis.B                      1     329.1 48076 8243.8
## - Status                           1     598.8 48345 8260.3
## - Polio                            1     680.9 48428 8265.3
## - GDP                              1     814.2 48561 8273.3
## - Diphtheria                       1    1257.0 49004 8300.0
## - BMI                              1    1348.5 49095 8305.5
## - Income.composition.of.resources  1    1352.2 49099 8305.7
## - infant.deaths                    1    2373.7 50120 8366.2
## - under.five.deaths                1    2435.4 50182 8369.8
## - Schooling                        1    3937.5 51684 8456.5
## - Adult.Mortality                  1   10266.6 58013 8795.9
## - HIV.AIDS                         1   11809.6 59556 8873.0

Predictions

The model is set! Let’s make a prediction based on the MixedStepModel!

MixedStepModelPreds <- predict(MixedStepModel,  # my model
                    newdata = clean_data, # dataset
                    type = "response") # to get predicted values

Model Interpretation

Let’s interpreting the MixedStepModel!

summary (MixedStepModel)
## 
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality + 
##     HIV.AIDS + Diphtheria + BMI + Income.composition.of.resources + 
##     Status + Polio + GDP + Hepatitis.B + under.five.deaths + 
##     infant.deaths + thinness..1.19.years + Alcohol + Measles + 
##     Total.expenditure, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2503  -2.2438  -0.1097   2.3665  16.3824 
## 
## Coefficients:
##                                     Estimate   Std. Error t value
## (Intercept)                     56.745781830  0.668314373  84.909
## Schooling                        0.645416268  0.041585022  15.520
## Adult.Mortality                 -0.019850253  0.000792059 -25.062
## HIV.AIDS                        -0.470653224  0.017510073 -26.879
## Diphtheria                       0.040593295  0.004629077   8.769
## BMI                              0.044354554  0.004883370   9.083
## Income.composition.of.resources  5.762655777  0.633592391   9.095
## StatusDeveloping                -1.625002116  0.268478794  -6.053
## Polio                            0.028664684  0.004441145   6.454
## GDP                              0.000046752  0.000006624   7.058
## Hepatitis.B                     -0.016610250  0.003701821  -4.487
## under.five.deaths               -0.074629707  0.006114028 -12.206
## infant.deaths                    0.099679379  0.008271756  12.051
## thinness..1.19.years            -0.074951946  0.023787239  -3.151
## Alcohol                          0.064726322  0.025725807   2.516
## Measles                         -0.000018876  0.000007631  -2.474
## Total.expenditure                0.073561760  0.033733716   2.181
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## Diphtheria                      < 0.0000000000000002 ***
## BMI                             < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## StatusDeveloping                    0.00000000160683 ***
## Polio                               0.00000000012680 ***
## GDP                                 0.00000000000211 ***
## Hepatitis.B                         0.00000750069921 ***
## under.five.deaths               < 0.0000000000000002 ***
## infant.deaths                   < 0.0000000000000002 ***
## thinness..1.19.years                         0.00164 ** 
## Alcohol                                      0.01192 *  
## Measles                                      0.01343 *  
## Total.expenditure                            0.02929 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.043 on 2921 degrees of freedom
## Multiple R-squared:  0.8202, Adjusted R-squared:  0.8192 
## F-statistic: 832.9 on 16 and 2921 DF,  p-value: < 0.00000000000000022


Observation findings :

  1. The MixedStepModel has the largest parameter estimate that is Income.composition.of.resources which is 5.762, followed by StatusDeveloping which is -1.625.

  2. The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.

  3. On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.

  4. The p-value for its intercept, Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, under.five.deaths, and infant.deaths are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.

  5. The MixedStepModel has R-squared value 0.8202, which indicates the MixedStepModel can describe its predictors condition by 82%.

Actual vs. Predicted

Original Plot

Here is the Actual vs Predicted Plot of MixedStepModel. The green line represents a perfect prediction, while the red line represents the regression line.

# actual vs predicted
plot(y = MixedStepModel$fitted.values,
     x = clean_data$Life.expectancy,
     main = "Actual vs Predicted using Mixed Step model",
     xlab = "Actual",
     ylab = "Predicted(MixedStepModel)",
     pch = 19)
abline(0,1, col = "green", lwd = 2)  # this is a perfect prediction - 45 degree line

# add the regression line 
abline(lm(MixedStepModel$fitted.values ~ MixedStepModel$model$Life.expectancy),
       col = "red", lwd = 2)

Plot with Confidence Interval (CI)

Here is the plot of the Actual vs Predicted of MixedStepModel with Confidence Interval.


The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.

# predict Life expectancy
predictedLE5 <- predict(MixedStepModel, interval = "prediction")

# combine the actual data and predicted data
comb5 <- cbind.data.frame(clean_data, predictedLE5)

# Plotting the combined data
ggplot(comb5, aes(Life.expectancy, fit)) +
  geom_point() + 
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  stat_smooth(method = lm) +
  geom_smooth(method=lm, se=TRUE)+
  ggtitle("Actual vs. Predicted for MixedStepModel with CI") +
  xlab("Actual Life Expectancy") + 
  ylab("Predicted Life Expectancy")

Fitting Reduced model using VIF


Modelling

The fitModel is a fullModel with predictors that have VIF (Variable Inflation Factors) value <5. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.


Let’s build a model using fitting reduced model using VIF! First, let’s check the vif value of fitModel

fitModel <- fullModel

vif(fitModel)
##                            Year                          Status 
##                        1.146861                        1.886573 
##                 Adult.Mortality                   infant.deaths 
##                        1.748372                      177.316529 
##                         Alcohol          percentage.expenditure 
##                        1.872944                        5.804925 
##                     Hepatitis.B                         Measles 
##                        1.313055                        1.382727 
##                             BMI               under.five.deaths 
##                        1.733886                      176.281234 
##                           Polio               Total.expenditure 
##                        1.938914                        1.221827 
##                      Diphtheria                        HIV.AIDS 
##                        2.166894                        1.440765 
##                             GDP                      Population 
##                        6.028414                        1.490720 
##            thinness..1.19.years              thinness.5.9.years 
##                        8.776585                        8.873971 
## Income.composition.of.resources                       Schooling 
##                        3.088999                        3.337981

Next, let’s build the fitModel!


The predictors of the fitModel are Year, Status, Adult.Mortality, Alcohol, percentage.expenditure, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, Total.expenditure, Diphtheria, HIV.AIDS, Population, thinness..1.19.years, Income.composition.of.resources, and Schooling.

# sort the variables in ascending order in a temporary variable, according to the VIFs
temp <- sort(vif(fitModel))

# reduce models until all the included predictors have a VIF < 5
while (temp[length(temp)] > 5) {
  cat("\nVariable with highest VIF - ",names(temp[length(temp)]))    # variable with highest VIF
  frm <- as.formula(paste(".~.-", names(temp[length(temp)]))) # creating formula to remove variable from model
  # names(temp[length(temp)])
  # as.name(names(temp[length(temp)]))
  cat("\nRemoving variable - ",names(temp[length(temp)]))
  fitModel <- update(fitModel,frm)        # updating model after removing the variable with highest VIF
  #fitModel$call
  cat("\n")
  print(summary(fitModel))              # rechecking the VIFs for new model
  temp <- sort(vif(fitModel))
}
## 
## Variable with highest VIF -  infant.deaths
## Removing variable -  infant.deaths
## 
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.4783  -2.3102  -0.1014   2.4085  17.1550 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     94.218172614547 35.447627021394   2.658
## Year                            -0.019317910967  0.017720088476  -1.090
## StatusDeveloping                -1.580759427126  0.276531114362  -5.716
## Adult.Mortality                 -0.020270686093  0.000813000612 -24.933
## Alcohol                          0.023881976769  0.026494604134   0.901
## percentage.expenditure           0.000050510735  0.000092567468   0.546
## Hepatitis.B                     -0.019661995380  0.003796914705  -5.178
## Measles                         -0.000022980625  0.000007825887  -2.936
## BMI                              0.045806862566  0.005046299234   9.077
## under.five.deaths               -0.002044012162  0.000705589452  -2.897
## Polio                            0.031708636288  0.004544183354   6.978
## Total.expenditure                0.081557612752  0.035164791641   2.319
## Diphtheria                       0.047551230150  0.004716983157  10.081
## HIV.AIDS                        -0.485050990087  0.018022110037 -26.914
## GDP                              0.000036253975  0.000014094493   2.572
## Population                       0.000000003529  0.000000001699   2.077
## thinness..1.19.years            -0.094503107011  0.051447976779  -1.837
## thinness.5.9.years               0.043264631538  0.050636428739   0.854
## Income.composition.of.resources  6.566711285448  0.652404575651  10.065
## Schooling                        0.665004031763  0.042716960064  15.568
##                                             Pr(>|t|)    
## (Intercept)                                  0.00790 ** 
## Year                                         0.27573    
## StatusDeveloping                    0.00000001198118 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                      0.36746    
## percentage.expenditure                       0.58534    
## Hepatitis.B                         0.00000023898660 ***
## Measles                                      0.00335 ** 
## BMI                             < 0.0000000000000002 ***
## under.five.deaths                            0.00380 ** 
## Polio                               0.00000000000369 ***
## Total.expenditure                            0.02045 *  
## Diphtheria                      < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## GDP                                          0.01015 *  
## Population                                   0.03792 *  
## thinness..1.19.years                         0.06633 .  
## thinness.5.9.years                           0.39294    
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.14 on 2918 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8105 
## F-statistic:   662 on 19 and 2918 DF,  p-value: < 0.00000000000000022
## 
## 
## Variable with highest VIF -  thinness.5.9.years
## Removing variable -  thinness.5.9.years
## 
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness..1.19.years + Income.composition.of.resources + 
##     Schooling, data = clean_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.451  -2.305  -0.096   2.422  17.138 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     94.499341217681 35.444460303979   2.666
## Year                            -0.019444490254  0.017718649747  -1.097
## StatusDeveloping                -1.574966541377  0.276435203206  -5.697
## Adult.Mortality                 -0.020252878125  0.000812695826 -24.921
## Alcohol                          0.023347378652  0.026485990194   0.881
## percentage.expenditure           0.000051217858  0.000092559487   0.553
## Hepatitis.B                     -0.019670902367  0.003796724807  -5.181
## Measles                         -0.000023226726  0.000007820223  -2.970
## BMI                              0.045305548712  0.005011841794   9.040
## under.five.deaths               -0.001976059007  0.000701060439  -2.819
## Polio                            0.031653416786  0.004543513628   6.967
## Total.expenditure                0.080107936440  0.035122211380   2.281
## Diphtheria                       0.047673618745  0.004714589677  10.112
## HIV.AIDS                        -0.484693143962  0.018016409670 -26.903
## GDP                              0.000035995139  0.000014090586   2.555
## Population                       0.000000003514  0.000000001699   2.068
## thinness..1.19.years            -0.055765398178  0.024316454018  -2.293
## Income.composition.of.resources  6.574453247811  0.652311481310  10.079
## Schooling                        0.665622890040  0.042708843846  15.585
##                                             Pr(>|t|)    
## (Intercept)                                  0.00772 ** 
## Year                                         0.27256    
## StatusDeveloping                    0.00000001337641 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                      0.37812    
## percentage.expenditure                       0.58007    
## Hepatitis.B                         0.00000023569735 ***
## Measles                                      0.00300 ** 
## BMI                             < 0.0000000000000002 ***
## under.five.deaths                            0.00485 ** 
## Polio                               0.00000000000399 ***
## Total.expenditure                            0.02263 *  
## Diphtheria                      < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## GDP                                          0.01068 *  
## Population                                   0.03868 *  
## thinness..1.19.years                         0.02190 *  
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.14 on 2919 degrees of freedom
## Multiple R-squared:  0.8117, Adjusted R-squared:  0.8105 
## F-statistic: 698.8 on 18 and 2919 DF,  p-value: < 0.00000000000000022
## 
## 
## Variable with highest VIF -  GDP
## Removing variable -  GDP
## 
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + Population + thinness..1.19.years + Income.composition.of.resources + 
##     Schooling, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3575  -2.3315  -0.0647   2.4009  17.2581 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     85.482265573542 35.301615607074   2.421
## Year                            -0.014949672567  0.017647743435  -0.847
## StatusDeveloping                -1.615194866551  0.276247300110  -5.847
## Adult.Mortality                 -0.020343311970  0.000812692294 -25.032
## Alcohol                          0.019115507918  0.026459132204   0.722
## percentage.expenditure           0.000258526489  0.000044556256   5.802
## Hepatitis.B                     -0.019199377414  0.003795821784  -5.058
## Measles                         -0.000023256305  0.000007827611  -2.971
## BMI                              0.046156161752  0.005005497667   9.221
## under.five.deaths               -0.001980860848  0.000701720938  -2.823
## Polio                            0.032055472441  0.004545081249   7.053
## Total.expenditure                0.065953799793  0.034715214262   1.900
## Diphtheria                       0.047562454272  0.004718847433  10.079
## HIV.AIDS                        -0.483073575096  0.018022279479 -26.804
## Population                       0.000000003463  0.000000001701   2.036
## thinness..1.19.years            -0.056059956546  0.024339177430  -2.303
## Income.composition.of.resources  6.725532217572  0.650239347161  10.343
## Schooling                        0.668854850329  0.042730474148  15.653
##                                             Pr(>|t|)    
## (Intercept)                                  0.01552 *  
## Year                                         0.39700    
## StatusDeveloping                    0.00000000556230 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                      0.47007    
## percentage.expenditure              0.00000000724548 ***
## Hepatitis.B                         0.00000044981054 ***
## Measles                                      0.00299 ** 
## BMI                             < 0.0000000000000002 ***
## under.five.deaths                            0.00479 ** 
## Polio                               0.00000000000218 ***
## Total.expenditure                            0.05755 .  
## Diphtheria                      < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## Population                                   0.04181 *  
## thinness..1.19.years                         0.02133 *  
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.143 on 2920 degrees of freedom
## Multiple R-squared:  0.8112, Adjusted R-squared:  0.8101 
## F-statistic: 738.2 on 17 and 2920 DF,  p-value: < 0.00000000000000022

Predictions

The model is set! Let’s make a prediction based on the fitModel!

fitModelPreds <- predict(fitModel,  # my model
                    newdata = clean_data, # dataset
                    type = "response") # to get predicted values

Model Interpretation

Let’s interpreting the fitModel!

summary (fitModel)
## 
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + Population + thinness..1.19.years + Income.composition.of.resources + 
##     Schooling, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3575  -2.3315  -0.0647   2.4009  17.2581 
## 
## Coefficients:
##                                        Estimate      Std. Error t value
## (Intercept)                     85.482265573542 35.301615607074   2.421
## Year                            -0.014949672567  0.017647743435  -0.847
## StatusDeveloping                -1.615194866551  0.276247300110  -5.847
## Adult.Mortality                 -0.020343311970  0.000812692294 -25.032
## Alcohol                          0.019115507918  0.026459132204   0.722
## percentage.expenditure           0.000258526489  0.000044556256   5.802
## Hepatitis.B                     -0.019199377414  0.003795821784  -5.058
## Measles                         -0.000023256305  0.000007827611  -2.971
## BMI                              0.046156161752  0.005005497667   9.221
## under.five.deaths               -0.001980860848  0.000701720938  -2.823
## Polio                            0.032055472441  0.004545081249   7.053
## Total.expenditure                0.065953799793  0.034715214262   1.900
## Diphtheria                       0.047562454272  0.004718847433  10.079
## HIV.AIDS                        -0.483073575096  0.018022279479 -26.804
## Population                       0.000000003463  0.000000001701   2.036
## thinness..1.19.years            -0.056059956546  0.024339177430  -2.303
## Income.composition.of.resources  6.725532217572  0.650239347161  10.343
## Schooling                        0.668854850329  0.042730474148  15.653
##                                             Pr(>|t|)    
## (Intercept)                                  0.01552 *  
## Year                                         0.39700    
## StatusDeveloping                    0.00000000556230 ***
## Adult.Mortality                 < 0.0000000000000002 ***
## Alcohol                                      0.47007    
## percentage.expenditure              0.00000000724548 ***
## Hepatitis.B                         0.00000044981054 ***
## Measles                                      0.00299 ** 
## BMI                             < 0.0000000000000002 ***
## under.five.deaths                            0.00479 ** 
## Polio                               0.00000000000218 ***
## Total.expenditure                            0.05755 .  
## Diphtheria                      < 0.0000000000000002 ***
## HIV.AIDS                        < 0.0000000000000002 ***
## Population                                   0.04181 *  
## thinness..1.19.years                         0.02133 *  
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling                       < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.143 on 2920 degrees of freedom
## Multiple R-squared:  0.8112, Adjusted R-squared:  0.8101 
## F-statistic: 738.2 on 17 and 2920 DF,  p-value: < 0.00000000000000022


Observation findings :

  1. The fitModel has the largest parameter estimate that is Income.composition.of.resources which is 6.725, followed by StatusDeveloping which is -1.615.

  2. The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.

  3. On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.

  4. The p-value for Adult.Mortality, BMI, Diphtheria, HIV.AIDS, Income.composition.of.resources, and Schooling are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.

  5. The fitModel has R-squared value 0.8112, which indicates the fitModel can describe its predictors condition by 81.12%.

Actual vs. Predicted

Original Plot

Here is the Actual vs Predicted Plot of fitModel. The green line represents a perfect prediction, while the red line represents the regression line.

# actual vs predicted
plot(y = fitModel$fitted.values,
     x = clean_data$Life.expectancy,
     main = "Actual vs Predicted using Fit model",
     xlab = "Actual",
     ylab = "Predicted(fitModel)",
     pch = 19)
abline(0,1, col = "green", lwd = 2)  # this is a perfect prediction - 45 degree line

# add the regression line 
abline(lm(fitModel$fitted.values ~ fitModel$model$Life.expectancy),
       col = "red", lwd = 2)

Plot with Confidence Interval (CI)

Here is the plot of the Actual vs Predicted of fitModel with Confidence Interval.


The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.

# predict Life expectancy
predictedLE6 <- predict(fitModel, interval = "prediction")

# combine the actual data and predicted data
comb6 <- cbind.data.frame(clean_data, predictedLE6)

# Plotting the combined data
ggplot(comb6, aes(Life.expectancy, fit)) +
  geom_point() + 
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  stat_smooth(method = lm) +
  geom_smooth(method=lm, se=TRUE)+
  ggtitle("Actual vs. Predicted for fitModel with CI") +
  xlab("Actual Life Expectancy") + 
  ylab("Predicted Life Expectancy")

Model Evaluation

Let’s evaluate the best model based on several criteria!

Assumpstions Checking

Linear regression makes several assumptions about the data, such as linearity of the data, normality of the residuals (error), homogeneity of residuals variance (homoscedasticity), and independece of residuals error terms (Non-Multicolinearity). The linearity assumption has been checked in the correlation tab. Let’s check the rest of three assumptions!

Normality Error

To check the normality error assumption, we can use the density plot.

par(mfrow = c(3, 2))
plot(density(fullModel$residuals))
plot(density(EDAModel$residuals))
plot(density(BackwardStepModel$residuals))
plot(density(forwardStepModel$residuals))
plot(density(MixedStepModel$residuals))
plot(density(fitModel$residuals))

Observation findings :

  1. The density plots show the error of all models has a normal distribution.

Homoscedasticity

To check the homoscedasticity assumption, we can use VIF ()

par(mfrow = c(3, 2))
plot(fullModel$fitted.values, fullModel$residuals)
abline(h=0, col = "red")
plot(EDAModel$fitted.values, EDAModel$residuals)
abline(h=0, col = "red")
plot(BackwardStepModel$fitted.values, BackwardStepModel$residuals)
abline(h=0, col = "red")
plot(forwardStepModel$fitted.values, forwardStepModel$residuals)
abline(h=0, col = "red")
plot(MixedStepModel$fitted.values, MixedStepModel$residuals)
abline(h=0, col = "red")
plot(fitModel$fitted.values, fitModel$residuals)
abline(h=0, col = "red")

Observation findings :

  1. The residuals from all models are spread adequately equal along with the ranges of predictors.

No-multicolinearity

To check the hNo-multicolinearity assumption, we can use VIF (Variable Inflation Factors). VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.

Full Model

vif(fullModel)
##                            Year                          Status 
##                        1.146861                        1.886573 
##                 Adult.Mortality                   infant.deaths 
##                        1.748372                      177.316529 
##                         Alcohol          percentage.expenditure 
##                        1.872944                        5.804925 
##                     Hepatitis.B                         Measles 
##                        1.313055                        1.382727 
##                             BMI               under.five.deaths 
##                        1.733886                      176.281234 
##                           Polio               Total.expenditure 
##                        1.938914                        1.221827 
##                      Diphtheria                        HIV.AIDS 
##                        2.166894                        1.440765 
##                             GDP                      Population 
##                        6.028414                        1.490720 
##            thinness..1.19.years              thinness.5.9.years 
##                        8.776585                        8.873971 
## Income.composition.of.resources                       Schooling 
##                        3.088999                        3.337981

Observation findings :

  1. All predictors has VIF value < 10, except infant.deaths and under.five.deaths.

EDA-Based Model

vif(EDAModel)
##                       Schooling                 Adult.Mortality 
##                        2.794906                        1.269928 
## Income.composition.of.resources 
##                        2.815281


Observation findings :

  1. All predictors have VIF value < 10. So, it’s safe to assume that the EDAModel has no-multicollinearity.

Backward Step Model

vif(BackwardStepModel)
##                          Status                 Adult.Mortality 
##                        1.864311                        1.735623 
##                   infant.deaths                         Alcohol 
##                      170.966783                        1.823815 
##                     Hepatitis.B                         Measles 
##                        1.302165                        1.375869 
##                             BMI               under.five.deaths 
##                        1.701555                      172.903451 
##                           Polio               Total.expenditure 
##                        1.935065                        1.178482 
##                      Diphtheria                        HIV.AIDS 
##                        2.154512                        1.420424 
##                             GDP            thinness..1.19.years 
##                        1.393999                        1.966152 
## Income.composition.of.resources                       Schooling 
##                        3.025917                        3.312613


Observation findings :

  1. All predictors has VIF value < 10, except infant.deaths and under.five.deaths.

Forward Step Model

vif(forwardStepModel)
##                       Schooling                 Adult.Mortality 
##                        3.312613                        1.735623 
##                        HIV.AIDS                      Diphtheria 
##                        1.420424                        2.154512 
##                             BMI Income.composition.of.resources 
##                        1.701555                        3.025917 
##                          Status                           Polio 
##                        1.864311                        1.935065 
##                             GDP                     Hepatitis.B 
##                        1.393999                        1.302165 
##               under.five.deaths                   infant.deaths 
##                      172.903451                      170.966783 
##            thinness..1.19.years                         Alcohol 
##                        1.966152                        1.823815 
##                         Measles               Total.expenditure 
##                        1.375869                        1.178482


Observation findings :

  1. All predictors has VIF value < 10, except infant.deaths and under.five.deaths.

Mixed Step Model

vif(MixedStepModel)
##                       Schooling                 Adult.Mortality 
##                        3.312613                        1.735623 
##                        HIV.AIDS                      Diphtheria 
##                        1.420424                        2.154512 
##                             BMI Income.composition.of.resources 
##                        1.701555                        3.025917 
##                          Status                           Polio 
##                        1.864311                        1.935065 
##                             GDP                     Hepatitis.B 
##                        1.393999                        1.302165 
##               under.five.deaths                   infant.deaths 
##                      172.903451                      170.966783 
##            thinness..1.19.years                         Alcohol 
##                        1.966152                        1.823815 
##                         Measles               Total.expenditure 
##                        1.375869                        1.178482


Observation findings :

  1. All predictors has VIF value < 10, except infant.deaths and under.five.deaths.

Fitting Reduced model using VIF

vif(fitModel)
##                            Year                          Status 
##                        1.134178                        1.879225 
##                 Adult.Mortality                         Alcohol 
##                        1.739711                        1.836868 
##          percentage.expenditure                     Hepatitis.B 
##                        1.342118                        1.303560 
##                         Measles                             BMI 
##                        1.378340                        1.702102 
##               under.five.deaths                           Polio 
##                        2.168512                        1.929626 
##               Total.expenditure                      Diphtheria 
##                        1.188280                        2.131651 
##                        HIV.AIDS                      Population 
##                        1.432668                        1.443763 
##            thinness..1.19.years Income.composition.of.resources 
##                        1.959860                        3.034365 
##                       Schooling 
##                        3.330094


Observation findings :

  1. All predictors have VIF value < 10. So, it’s safe to assume that the fitModel has no-multicollinearity.

Diagnostic Plots

Linear regression makes several assumptions about the data, such as linearity of the data, normality of the residuals (error), homogeneity of residuals variance (homoscedasticity), and independece of residuals error terms (Non-Multicolinearity).

To check the assumptions, we use the diagnostic plots. The diagnostic plots show residuals (error) in four different ways:

  1. Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.

  2. Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.

  3. Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.

  4. Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis.

Full Model

par(mfrow = c(2, 2))
plot(fullModel)


Observation findings :

  1. The Residuals vs Fitted plot is to check linearity assumption.
    • There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
  2. The Scale-Location plot is to check homoscedasticity assumption.
    • The residuals are spread adequately equal along with the ranges of predictors.
  3. The Normal Q-Q plot is to check normality of the residuals (error) assumption.
    • All the points fall adequately along the reference line, so it’s sufficient to assume that the data has the normality of the residuals.
  4. The Residuals vs Leverage is to check linearity assumption.
    • The plot highlights the top 3 most extreme points (#1202, #1901, and #1909), with a standardized residuals below -4. Any cases beyond 0.5 band are influential. However, there are outliers that exceed 3 standard deviations.

EDA-Based Model

par(mfrow = c(2, 2))
plot(EDAModel)


Observation findings :

  1. The Residuals vs Fitted plot is to check linearity assumption.
    • There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
  2. The Scale-Location plot is to check homoscedasticity assumption.
    • The residuals are spread adequately equal along with the ranges of predictors.
  3. The Normal Q-Q plot is to check normality of the residuals (error) assumption.
    • The points between -1 and 2 in Theoritical Quantiles is adequately along the reference line. However, it’s still sufficient enough to assume that the data has the normality of the residuals.
  4. The Residuals vs Leverage is to check linearity assumption.
    • The plot highlights the top 3 most extreme, with a standardized residuals below -4. Any cases beyond 0.5 band are influential. However, there are outliers that exceed 3 standard deviations.

Backward Step Model

par(mfrow = c(2, 2))
plot(BackwardStepModel)


Observation findings :

  1. The Residuals vs Fitted plot is to check linearity assumption.
    • There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
  2. The Scale-Location plot is to check homoscedasticity assumption.
    • The residuals are spread adequately equal along with the ranges of predictors.
  3. The Normal Q-Q plot is to check normality of the residuals (error) assumption.
    • All the points fall adequately along the reference line, so it’s sufficient to assume that the data has the normality of the residuals.
  4. The Residuals vs Leverage is to check linearity assumption.
    • The plot highlights the top 3 most extreme points (#1202, #1901, and #1909), with a standardized residuals below -4. Any cases beyond 0.5 band are influential. However, there are outliers that exceed 3 standard deviations.

Forward Step Model

par(mfrow = c(2, 2))
plot(forwardStepModel)


Observation findings :

  1. The Residuals vs Fitted plot is to check linearity assumption.
    • There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
  2. The Scale-Location plot is to check homoscedasticity assumption.
    • The residuals are spread adequately equal along with the ranges of predictors.
  3. The Normal Q-Q plot is to check normality of the residuals (error) assumption.
    • All the points fall adequately along the reference line, so it’s sufficient to assume that the data has the normality of the residuals.
  4. The Residuals vs Leverage is to check linearity assumption.
    • The plot highlights the top 3 most extreme points (#1202, #1901, and #1909), with a standardized residuals below -4. Any cases beyond 0.5 band are influential. However, there are outliers that exceed 3 standard deviations.

Mixed Step Model

par(mfrow = c(2, 2))
plot(MixedStepModel)


Observation findings :

  1. The Residuals vs Fitted plot is to check linearity assumption.
    • There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
  2. The Scale-Location plot is to check homoscedasticity assumption.
    • The residuals are spread adequately equal along with the ranges of predictors.
  3. The Normal Q-Q plot is to check normality of the residuals (error) assumption.
    • All the points fall adequately along the reference line, so it’s sufficient to assume that the data has the normality of the residuals.
  4. The Residuals vs Leverage is to check linearity assumption.
    • The plot highlights the top 3 most extreme points (#1202, #1901, and #1909), with a standardized residuals below -4. Any cases beyond 0.5 band are influential. However, there are outliers that exceed 3 standard deviations.

Fitting Reduced model using VIF

par(mfrow = c(2, 2))
plot(fitModel)


Observation findings :

  1. The Residuals vs Fitted plot is to check linearity assumption.
    • There is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
  2. The Scale-Location plot is to check homoscedasticity assumption.
    • The residuals are spread adequately equal along with the ranges of predictors.
  3. The Normal Q-Q plot is to check normality of the residuals (error) assumption.
    • All the points fall adequately along the reference line, so it’s sufficient to assume that the data has the normality of the residuals.
  4. The Residuals vs Leverage is to check linearity assumption.
    • The plot highlights the top 3 most extreme points (#1199, #1200, and #1202), with a standardized residuals below -4. However, there are outliers that exceed 3 standard deviations.

Adjusted R-squared & Root Mean Square Error (RMSE)

summary(fullModel)$adj.r.squared
summary(EDAModel)$adj.r.squared 
summary(BackwardStepModel)$adj.r.squared 
summary(forwardStepModel)$adj.r.squared 
summary(MixedStepModel)$adj.r.squared 
summary(fitModel)$adj.r.squared 
RMSE(y_pred = fullModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = EDAModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = BackwardStepModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = forwardStepModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = MixedStepModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = fitModelPreds, y_true = clean_data$Life.expectancy)
Model Adjusted.R.squared RMSE
fullModel 0.8190258 4.030796
EDAModel 0.7107153 5.111013
BackwardStepModel 0.8192284 4.031300
forwardStepModel 0.8192284 4.031300
MixedStepModel 0.8192284 4.031300
fitModel 0.8101345 4.130748


Observation findings :

  1. The Step-Wise Models, such as BackwardStepModel, forwardStepModel, and MixedStepModel have the highest value of Adjusted R-squared.

  2. The EDA Model has the lowest value of Adjusted R-squared.

  3. The EDAModel has the highest value of RMSE.

  4. The fullModel has lowest value of RMSE, followed by the Step-Wise Models, such as BackwardStepModel, forwardStepModel, and MixedStepModel

Final Conclusion

Here are the criteria to find the best model to predict Life Expectancy:

  1. The model has the highest value of Adjusted R-Squared

  2. The model has the lowest value of RMSE

  3. The model has the least predictors.

Model Adjusted.R.squared RMSE Number.of.Predictors
fullModel 0.8190258 4.030796 20
EDAModel 0.7107153 5.111013 3
BackwardStepModel 0.8192284 4.031300 16
forwardStepModel 0.8192284 4.031300 16
MixedStepModel 0.8192284 4.031300 16
fitModel 0.8101345 4.130748 17

So, the best model that fits the criteria goes to

The Step-Wise Model

References

[1] A. Roy, “A Deep Dive Into The Concept of Regression.” [Online]. Available: https://towardsdatascience.com/a-deep-dive-into-the-concept-of-regression-fb912d427a2e

[2] Sathwick, “What is a Linear Regression?” [Online]. Available: https://towardsdatascience.com/the-concepts-behind-linear-regression-and-its-implementation-ffbab5a4d65e

[3] Algoritma Team, “Inclass Regression Model.”

[4] S. Swaminathan, “Linear Regression — Detailed View.” [Online]. Available: https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86

[6] P. Schober, C. Boer, and L. Schwarte, “Correlation coefficients: Appropriate use and interpretation,” Anesthesia & Analgesia, vol. 126, p. 1, Feb. 2018, doi: 10.1213/ANE.0000000000002864.

[7] “What is Multicollinearity? Here’s Everything You Need to Know.” [Online]. Available: https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/

[9] “Predicting Life Expectancy for Countries.” [Online]. Available: https://rpubs.com/mrunws/OPIM5603-Healthcare2