Regression is one of the most important concepts used in machine learning. Regression analysis allows us to predict target variable (y) based on the value of one or multiple predictor variables (x). The target variable is also known as an independent variable or label. On the other hand, predictor variables are also known as dependent variables.
The types of regressions are represented in the diagram:
In the formula below you will find several notations, such as \(y\), \(b_0\), \(b_n\), and \(x_n\). Note that :
Without further ado, let’s talk in details about all the points shows in the block diagram!
1. Linear : the model fit to predict a target variable of the data is linear.
\[y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3 + ... b_n*x_n\]
2. Polynomial: the model fit to predict a target variable of the data is polynomial.
\[y = b_0 + (b_1*x_1)^1 + (b_2*x_2)^2 + (b_3*x_3)^3 + ... (b_n*x_n)^n\]
1. Univariate : the number of target variable (independent variable) is 1.
\[y = b_0 + b_1*x_1\]
2. Bivariate: the number of target variable (independent variable) is 2.
\[y = b_0 + b_1*x_1 + b_2*x_2\]
2. Multivariate: the number of target variable (independent variable) is more than 2.
\[y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3 + ... b_n*x_n\]
The focus of the regression task is to predict a value of the best fit model based on the independent variable(s). The linear regression tries to find out the best possible linear relationship between the target variable and the predictor variables.
Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line consists of the predicted score on \(Y\) for each possible value of \(X\). The vertical lines from the points to the best-fitting represent the errors of prediction.
The error of prediction for a point is the value of the point \(Y\) minus the predicted value \(Y'\) (the value on the line).
\[ Error = Y - Y'\]
The most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction (\((Y-Y')^2\)).
Linear regression models, even when considered to be the powerhouse of statistics came with its limitations.
Linear regression models, even when considered to be the powerhouse of statistics came with its assumptions.
1. Linearity: There’s a linear relationship between the target variable and the independet variable(s)
2. Normality Error: The distribution of error is a normal distribution.
3. Homoscedasticity: Error are randomly scattered
4. Non-Multicolinearity: There are no independent variable that are strongly correlated with each other.
Initially, we begin by loading the packages that will be required throughout the course of the analysis.
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(tidyverse)
library(scales)
library(caret)
library(psych)
library(stats)
library(leaps)
library(GGally)
library(MASS)
library(lmtest)
library(car)
library(MLmetrics)The descriptions of the packages are in the table below.
| Packages | Description |
|---|---|
| data.table | For data manipulation that can be reducing programming and compute time tremendously |
| DT | An R interface to the DataTables library |
| kableExtra | Styling an Interactive Data Tables within Markdown |
| knitr | A general-purpose tool for dynamic report generation |
| tidyverse | Collection of R packages (tidyr, dplyr, ggplot2) designed for data science that works harmoniously with other packages |
| tidyr | Changing the layout of the data sets, to convert data into a tidy format |
| dplyr | For data manipulation |
| ggplot2 | Customizable graphical representation |
| caret | For data Pre-Processing and Feature Selection |
| psych | Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis |
| stats | Contains functions for statistical calculations and random number generation. |
| leaps | Regression subset selection, including exhaustive search |
| GGally | Extends ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data. |
| MASS | Functions to support Venables and Ripley, ``Modern Applied Statistics with S’’ (4th edition, 2002) |
| lmtest | A collection of testsfor diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided. |
| car | Functions to Accompany J. Fox and S. Weisberg, An R Companion to Applied Regression, Third Edition, Sage, 2019. |
| MLmetrics | A collection of evaluation metrics, including loss, score and utility functions, that measure regression, classification and ranking performance. |
Now, let’s load the dataset into the R-Environment.
This project aims to build the best model that can predict the Life Expectancy based on he Global Health Observatory (GHO) dataset.
This project uses the data from Kumar Rajarshi - Life Expectancy (WHO) in kaggle.com website. The data was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.
The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single data-set.
The dataset is in .csv format, so we will useread.csv method to read the dataset.
After importing the dataset, Let’s take a peek of our dataset!
The dataset has 2,938 rows and 22 columns.
## Rows: 2,938
## Columns: 22
## $ Country <chr> "Afghanistan", "Afghanistan", "Afgh...
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010,...
## $ Status <chr> "Developing", "Developing", "Develo...
## $ Life.expectancy <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8,...
## $ Adult.Mortality <int> 263, 271, 268, 272, 275, 279, 281, ...
## $ infant.deaths <int> 62, 64, 66, 69, 71, 74, 77, 80, 82,...
## $ Alcohol <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01,...
## $ percentage.expenditure <dbl> 71.279624, 73.523582, 73.219243, 78...
## $ Hepatitis.B <int> 65, 62, 64, 67, 68, 66, 63, 64, 63,...
## $ Measles <int> 1154, 492, 430, 2787, 3013, 1989, 2...
## $ BMI <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7,...
## $ under.five.deaths <int> 83, 86, 89, 93, 97, 102, 106, 110, ...
## $ Polio <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, ...
## $ Total.expenditure <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20,...
## $ Diphtheria <int> 65, 62, 64, 67, 68, 66, 63, 64, 63,...
## $ HIV.AIDS <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, ...
## $ GDP <dbl> 584.25921, 612.69651, 631.74498, 66...
## $ Population <dbl> 33736494, 327582, 31731688, 3696958...
## $ thinness..1.19.years <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4,...
## $ thinness.5.9.years <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4,...
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, ...
## $ Schooling <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9...
Let’s see how many missing values in each column.
# Counting missing values in each column
numMissVal <-sapply(raw.data, function(x) sum(length(which(is.na(x)))))
# Result table
kable(as.data.frame(numMissVal)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%", height = "250px")| numMissVal | |
|---|---|
| Country | 0 |
| Year | 0 |
| Status | 0 |
| Life.expectancy | 10 |
| Adult.Mortality | 10 |
| infant.deaths | 0 |
| Alcohol | 194 |
| percentage.expenditure | 0 |
| Hepatitis.B | 553 |
| Measles | 0 |
| BMI | 34 |
| under.five.deaths | 0 |
| Polio | 19 |
| Total.expenditure | 226 |
| Diphtheria | 19 |
| HIV.AIDS | 0 |
| GDP | 448 |
| Population | 652 |
| thinness..1.19.years | 34 |
| thinness.5.9.years | 34 |
| Income.composition.of.resources | 167 |
| Schooling | 163 |
Observation findings :
Some columns need to be changed as factors.
The missing value can severely distort the distribution of the data. However, there isn’t any better way to deal with missing data. Removing columns or rows with missing values can produce a bias in the analysis. Note that imputation does not necessarily give better results.
Alvira Swalin gave a better explanation about how to handle missing data in towardsdatascience.com. The methods to handle missing values are as follow:
Based on the previous flowchart, we use data imputation to deal with the general problem with continuous data type.
According to the block diagram before, we can treat the missing values by using data imputation. The columns that have missing values are continuous data type. To treat the missing value, we can choose to treat it by imputing its mean value or its median value based on the outliers occurrences. If there are many outliers data, it’ll be best to use the median value. On the other hand, if there aren’t many outliers data, we can use the mean value.
par(mfrow=c(1,3))
boxplot(raw.data$Life.expectancy,
ylab = "Life Expectancy",
main = "Boxplot of Life Expectancy")
boxplot(raw.data$Adult.Mortality,
ylab = "Adult Mortality",
main = "Boxplot of Adult Mortality")
boxplot(raw.data$Alcohol,
ylab = "Alcohol",
main = "Boxplot of Alcohol")par(mfrow=c(1,3))
boxplot(raw.data$Hepatitis.B,
ylab = "Hepatitis B",
main = "Boxplot of Hepatitis B")
boxplot(raw.data$BMI,
ylab = "BMI",
main = "Boxplot of BMI")
boxplot(raw.data$Polio,
ylab = "Polio",
main = "Boxplot of Polio")par(mfrow=c(1,3))
boxplot(raw.data$Total.expenditure,
ylab = "Total Expenditure",
main = "Boxplot of Total Expenditure")
boxplot(raw.data$Diphtheria,
ylab = "Diphteria",
main = "Boxplot of Diphteria")
boxplot(raw.data$GDP,
ylab = "GDP",
main = "Boxplot of GDP")par(mfrow=c(1,3))
boxplot(raw.data$Population,
ylab = "Population",
main = "Boxplot of Population")
boxplot(raw.data$thinness..1.19.years,
ylab = "Thinness 1-19 years",
main = "Boxplot of Thinness for 1-19 years old")
boxplot(raw.data$thinness.5.9.years,
ylab = "Thinness 5-9 years",
main = "Boxplot of Thinness for 5-9 years old")par(mfrow=c(1,3))
boxplot(raw.data$Income.composition.of.resources,
ylab = "Income Composition",
main = "Boxplot of Income Composition")
boxplot(raw.data$Schooling,
ylab = "Schooling",
main = "Boxplot of Schooling")
Observation findings :
Alcohol, BMI, Income.composition.of.resources.We use data imputation by its median value to most of the columns with missing values. These columns have many outliers.
# Find median value
life_mean <- median(raw.data$Life.expectancy, na.rm = TRUE)
mortality_mean <- median(raw.data$Adult.Mortality, na.rm = TRUE)
hepatitis_mean <- median(raw.data$Hepatitis.B, na.rm = TRUE)
polio_mean <- median(raw.data$Polio, na.rm = TRUE)
diph_mean <- median(raw.data$Diphtheria, na.rm = TRUE)
exp_mean <- median(raw.data$Total.expenditure, na.rm = TRUE)
gdp_mean <- median(raw.data$GDP, na.rm = TRUE)
pop_mean <- median(raw.data$Population, na.rm = TRUE)
thin19_mean <- median(raw.data$thinness..1.19.years, na.rm = TRUE)
thin9_mean <- median(raw.data$thinness.5.9.years, na.rm = TRUE)
school_mean <- median(raw.data$Schooling, na.rm = TRUE)Then replace the missing values with the median of the corresponding columns.
raw.data$Life.expectancy[is.na(raw.data$Life.expectancy)] <- life_mean
raw.data$Adult.Mortality[is.na(raw.data$Adult.Mortality)] <- mortality_mean
raw.data$Hepatitis.B[is.na(raw.data$Hepatitis.B)] <- hepatitis_mean
raw.data$Polio[is.na(raw.data$Polio)] <- polio_mean
raw.data$Diphtheria[is.na(raw.data$Diphtheria)] <- diph_mean
raw.data$Total.expenditure[is.na(raw.data$Total.expenditure)] <- exp_mean
raw.data$GDP[is.na(raw.data$GDP)] <- gdp_mean
raw.data$Population[is.na(raw.data$Population)] <- pop_mean
raw.data$thinness..1.19.years[is.na(raw.data$thinness..1.19.years)] <- thin19_mean
raw.data$thinness.5.9.years[is.na(raw.data$thinness.5.9.years)] <- thin9_mean
raw.data$Schooling[is.na(raw.data$Schooling)] <- school_meanNext, we find the mean value for the Alcohol, BMI, Income.composition.of.resources columns. These columns don’t have many outliers.
alcohol_mean <- mean(raw.data$Alcohol, na.rm = TRUE)
bmi_mean <- mean(raw.data$BMI, na.rm = TRUE)
income_mean <- mean(raw.data$Income.composition.of.resources, na.rm = TRUE)Then replace the missing values with the average of the corresponding columns.
Here’s the cleaned data set:
The summary of the variables of dataset are in the table below.
| No. | Variable | Class | Description |
|---|---|---|---|
| 1 | Country | factor | Country name |
| 2 | Year | numeric | Year of the data |
| 3 | Status | factor | Country status of developed or developing |
| 4 | Life_Expectancy | numeric | Life expectancy in age |
| 5 | Adult_Mortality | numeric | Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population) |
| 6 | infant.deaths | numeric | Number of Infant Deaths per 1000 population |
| 7 | Alcohol | numeric | Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) |
| 8 | percentage.expenditure | numeric | Expenditure on health as a percentage of Gross Domestic Product per capita(%) |
| 9 | Hepatitis.B | numeric | Hepatitis B (HepB) immunization coverage among 1-year-olds (%) |
| 10 | Measles | numeric | Measles - number of reported cases per 1000 population |
| 11 | BMI | numeric | Average Body Mass Index of entire population |
| 12 | under.five.deaths | numeric | Number of under-five deaths per 1000 population |
| 13 | Polio | numeric | Polio (Pol3) immunization coverage among 1-year-olds (%) |
| 14 | Total.expenditure | numeric | General government expenditure on health as a percentage of total government expenditure (%) |
| 15 | Diphtheria | numeric | Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)room) |
| 16 | HIV.AIDS | numeric | Deaths per 1 000 live births HIV/AIDS (0-4 years) |
| 17 | GDP | numeric | Gross Domestic Product per capita (in USD) |
| 18 | Population | numeric | Population of the country |
| 19 | thinness..1.19.years | numeric | Prevalence of thinness among children and adolescents for Age 10 to 19 (% ) |
| 20 | thinness.5.9.years | numeric | Prevalence of thinness among children for Age 5 to 9(%) |
| 21 | Income.composition.of.resources | numeric | Human Development Index in terms of income composition of resources (index ranging from 0 to 1) |
| 22 | Schooling | numeric | Number of years of Schooling(years) |
In this study case, we use Life.Expectancy as our target variable. Let’s do EDA on the target variable.
par(mfrow=c(1,2))
# target variable
# histogram
hist(clean_data$Life.expectancy,
main = "LifeExpectance Distribution",
xlab = "Life Expectancy(yrs)")
# kernel density plot with a vertical indication of location of the mean
plot(density(clean_data$Life.expectancy),
main = "Distribution of Life Expectancy",
xlab = "Life Expectancy (yrs)")
abline(v=mean(clean_data$Life.expectancy))
Observation findings :
the target variable Life.expectancy is not distributed perfectly normal, it is a little left-skewed.
The unit of Life Expectancy is number of years.
par(mfrow=c(2,2))
layout(matrix(c(1,1,2,3), 2, 2, byrow = F),
widths=c(1,1), heights=c(1,1))
boxplot(clean_data$Alcohol,
main = "Alcohol consumption") # box plot
plot(density(clean_data$Alcohol),
main = "Distribution of Alcohol consumed",
xlab = "Alcohol(litres)") # kernel density plot
# to normalize the density plot
plot(density(clean_data$Alcohol^0.5),
main = "Distribution of Alcohol consumed",
xlab = "Alcohol(litres)") # normalized kernel density plot
Observation findings :
The predictor variable Alcohol is not normally distributed. It is highly right-skewed.
The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.
Alcohol and GDP are significantly correlated with a correlation coefficient of 0.31 and p-value of \(2.2^{-16}\)cor.test(clean_data$Alcohol, clean_data$GDP)
##
## Pearson's product-moment correlation
##
## data: clean_data$Alcohol and clean_data$GDP
## t = 17.831, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2795771 0.3448433
## sample estimates:
## cor
## 0.3125791par(mfrow=c(2,2))
layout(matrix(c(1,1,2,3), 2, 2, byrow = F),
widths=c(1,1), heights=c(1,1))
boxplot(clean_data$under.five.deaths,
main = "Under Five Year Old Deaths") # box plot
plot(density(clean_data$under.five.deaths),
main = "Distribution / 1000 Population",
xlab = "Under Five Year Old Deaths(cnt)") # kernel density plot
# to normalize the density plot
plot(density(clean_data$under.five.deaths^0.5),
main = "Distribution Rate / 1000 Population",
xlab = "Under Five Year Old Deaths rate") # normalized kernel density plot
Observation findings :
The predictor variable under.five.deaths is not normally distributed. It is highly right-skewed.
The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.
under.five.deaths and GDP are significantly correlated with a correlation coefficient of -0.1 and p-value of \(8.19^{-06}\)cor.test(clean_data$under.five.deaths, clean_data$GDP)
##
## Pearson's product-moment correlation
##
## data: clean_data$under.five.deaths and clean_data$GDP
## t = -5.7813, df = 2936, p-value = 0.000000008194
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14171143 -0.07020008
## sample estimates:
## cor
## -0.1060929par(mfrow=c(1,2))
boxplot(clean_data$percentage.expenditure,
main = "Percentage expenditure") # box plot
plot(density(clean_data$percentage.expenditure),
main = "% Expenditure on health",
xlab = "Percentage expenditure(%)") # kernel density plot
Observation findings :
The predictor variable percentage.expenditure is not normally distributed. It is heavily right-skewed.
The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.
percentage.expenditure and GDP are significantly correlated with a correlation coefficient of 0.9 and p-value of \(2.2^{-16}\)cor.test(clean_data$percentage.expenditure, clean_data$GDP)
##
## Pearson's product-moment correlation
##
## data: clean_data$percentage.expenditure and clean_data$GDP
## t = 113.08, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8948392 0.9083581
## sample estimates:
## cor
## 0.9018191par(mfrow=c(1,2))
boxplot(clean_data$Polio,
main = "Polio Immunization ") # box plot
plot(density(clean_data$Polio),
main = "% Polio Immunization Coverage",
xlab = "Polio Immunization (%)") # kernel density plot
Observation findings :
The predictor variable Polio is not normally distributed. It is heavily left-skewed.
The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.
Polio and GDP are significantly correlated with a correlation coefficient of 0.18 and p-value of \(2.2^{-16}\)cor.test(clean_data$Polio, clean_data$GDP)
##
## Pearson's product-moment correlation
##
## data: clean_data$Polio and clean_data$GDP
## t = 10.482, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1548269 0.2245457
## sample estimates:
## cor
## 0.1899257par(mfrow=c(1,2))
boxplot(clean_data$thinness..1.19.years,
main = "Prevalence of thinness ") # box plot
plot(density(clean_data$thinness..1.19.years),
main = "% Prevalence of thinness",
xlab = "Prevalence of thinness (%)") # kernel density plot
Observation findings :
The predictor variable thinness..1.19.years is not normally distributed. It is right-skewed.
The outliers are not due to any data error, but just abnormal values due to some countries being having high GDP, whereas some countries having a very low GDP. Thus, they cannot be eliminated.
thinness..1.19.years and GDP are significantly correlated with a correlation coefficient of -0.26 and p-value of \(2.2^{-16}\)cor.test(clean_data$thinness..1.19.years, clean_data$GDP)
##
## Pearson's product-moment correlation
##
## data: clean_data$thinness..1.19.years and clean_data$GDP
## t = -14.79, df = 2936, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2966604 -0.2293449
## sample estimates:
## cor
## -0.2633231We have plotted three of the predictor variables to show how the variables relate with the target variable overall.
# life expectancy vs. income composition - positively correlated
plot(y = clean_data$Life.expectancy,
x = clean_data$Income.composition.of.resources,
main = "Life Expectancy vs. Income compositions",
xlab = "Income composition of resources",
ylab = "Life Expectancy",
pch = 19,
col = "yellowgreen")
abline(60,1,
col = "red") # 45 degree line (line with slope 1)
Observation findings :
The Life.expectancy and Income.composition.of.resources are positively correlated.
The red line in the plot indicates a correlation of 1 (45 degree line). Thus it is clear that the correlation is definitely less than 1.
plot(y = clean_data$Life.expectancy,
x = clean_data$Schooling,
main = "Life Expectancy vs. Schooling",
xlab = "Schooling",
ylab = "Life Expectancy",
pch = 19,
col = "rosybrown1")
abline(50,1,
col = "red") # 45 degree line (line with slope 1)
Observation findings :
The Life.expectancy and Schooling are positively correlated.
The red line in the plot indicates a correlation of 1 (45 degree line). Thus it is clear that the correlation is definitely less than 1.
plot(y = clean_data$Life.expectancy,
x = clean_data$Adult.Mortality,
main = "Life Expectancy vs. Adult Mortality",
xlab = "Adult Mortality",
ylab = "Life Expectancy",
pch = 19,
col = "mediumpurple1")
abline(80, - 1,
col = "red") # 135 degree line (line with slope -1)
Observation findings :
The Life.expectancy and Adult.Mortality are negatively correlated.
The red line in the plot indicates a correlation of -1 (135 degree line). Thus it is clear that the correlation is definitely not perfectly -1.
plot(y = clean_data$Life.expectancy,
x = clean_data$Population,
main = "Life Expectancy vs. Population",
xlab = "Population",
ylab = "Life Expectancy",
pch = 19,
col = "lightsteelblue3")
abline(50,1,
col = "red") # 45 degree line (line with slope 1)br> Observation findings :
The Life.expectancy and Population are not really correlated.
The red line in the plot indicates a correlation of 1 (45 degree line). Thus it is clear that the correlation is negligible.
We need to check the linear relationship between the target variable and the predictor variables (independent variables).
Let’s find the correlations between the target variable Life.expectancy and first 5 predictors, i.e. Adult.Morality, infant.deaths, Alcohol, percentage.expenditure, and Hepatitis.B.
# check correlations of the target variable with the first 5 predictors using Pearson correlation
pairs.panels(clean_data[,4:9],
method = "pearson", # correlation method
hist.col = "green",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)Next, let’s find the correlations between the target variable Life.expectancy and the next 5 predictors, i.e. Measles, BMI, under.five.deaths, Polio, and Total.expenditure.
# check correlations of the target variable with the next 5 predictors using Pearson correlation
pairs.panels(clean_data[,c(4,10:14)],
method = "pearson", # correlation method
hist.col = "green",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)Next, let’s find the correlations between the target variable Life.expectancy and the next 5 predictors, i.e. Diphtheria, HIV.AIDS, GDP, Population, and thinness..1.19.years.
# check correlations of the target variable with the next 5 predictors using Pearson correlation
pairs.panels(clean_data[,c(4,15:19)],
method = "pearson", # correlation method
hist.col = "green",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)Next, let’s find the correlations between the target variable Life.expectancy and the last 3 predictors, i.e. thinness.5.9.years, Income.composition.of.resources, and Schooling.
# check correlations of the target variable with the last 3 predictors using Pearson correlation
pairs.panels(clean_data[,c(4,20:22)],
method = "pearson", # correlation method
hist.col = "green",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)According to Schober & Boer of the label interpretation of the r values, here are the labels of correlation based on the strength of the corresponding predictor:
Observation findings :
The target variable Life.expectancy is strongly correlated to Schooling, Adult.Mortality, and Income.composition.of.resources as indicated by the Pearson correlation.
According to the Pearson correlation, the target variable Life Expectancy has a moderate correlation to BMI, HIV.AIDS, Diphtheria, thinness..1.19.years, Polio, thinness.5.9.years, and GDP.
According to the Pearson correlation, the target variable Life Expectancy has a weak correlation to Alcohol, percentage.expenditure, under.five.deaths, Total.expenditure, and infant.deaths.
According to the Pearson correlation, the target variable Life Expectancy has a very weak correlation to Hepatitis.B, Measles, and Population.
Let’s make few models and predictions based on the dataset!
Let’s build a model without any predictors!
# baseline model with no predictors
nullModel <- lm(Life.expectancy ~ 1,
data = clean_data)
# check the model
summary(nullModel)##
## Call:
## lm(formula = Life.expectancy ~ 1, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.935 -6.035 2.865 6.365 19.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.2347 0.1754 394.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.509 on 2937 degrees of freedom
The model is set! Let’s make a prediction based on the nullModel!
Let’s interpreting the nullModel!
##
## Call:
## lm(formula = Life.expectancy ~ 1, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.935 -6.035 2.865 6.365 19.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.2347 0.1754 394.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.509 on 2937 degrees of freedom
Observation findings :
The nullModel has not any predictor.
If there is no predictor, then the nullModel will predict the future value using the mean value of the target variable or the intercept value.
Let’s build a model with all predictors, except Country!
# baseline model with all predictors, except Country
fullModel <- lm(Life.expectancy ~ . - Country,
data = clean_data)
# check the model
summary(fullModel)##
## Call:
## lm(formula = Life.expectancy ~ . - Country, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2860 -2.2437 -0.0817 2.3740 16.4282
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 81.57371776468257 34.65542497468053 2.354
## Year -0.01242470778379 0.01732564244728 -0.717
## StatusDeveloping -1.60345349486797 0.27022829393151 -5.934
## Adult.Mortality -0.01981086120344 0.00079540789532 -24.907
## infant.deaths 0.09932330190305 0.00842868251499 11.784
## Alcohol 0.06135257680996 0.02608460875382 2.352
## percentage.expenditure 0.00003237640088 0.00009046842117 0.358
## Hepatitis.B -0.01660284006457 0.00371935046024 -4.464
## Measles -0.00001913469438 0.00000765428319 -2.500
## BMI 0.04455147570409 0.00493230714916 9.033
## under.five.deaths -0.07437771658324 0.00617691845324 -12.041
## Polio 0.02865527021677 0.00444805095607 6.442
## Total.expenditure 0.07444995871406 0.03436772061263 2.166
## Diphtheria 0.04078688657161 0.00464496059143 8.781
## HIV.AIDS -0.47215145572195 0.01764488354663 -26.759
## GDP 0.00004260320835 0.00001378343058 3.091
## Population 0.00000000001346 0.00000000168702 0.008
## thinness..1.19.years -0.08195313416986 0.05028535403442 -1.630
## thinness.5.9.years 0.00847794195443 0.04956902498094 0.171
## Income.composition.of.resources 5.83669846548877 0.64052133954243 9.112
## Schooling 0.64795815007715 0.04176733243002 15.514
## Pr(>|t|)
## (Intercept) 0.01865 *
## Year 0.47335
## StatusDeveloping 0.000000003311 ***
## Adult.Mortality < 0.0000000000000002 ***
## infant.deaths < 0.0000000000000002 ***
## Alcohol 0.01874 *
## percentage.expenditure 0.72046
## Hepatitis.B 0.000008352972 ***
## Measles 0.01248 *
## BMI < 0.0000000000000002 ***
## under.five.deaths < 0.0000000000000002 ***
## Polio 0.000000000137 ***
## Total.expenditure 0.03037 *
## Diphtheria < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## GDP 0.00201 **
## Population 0.99364
## thinness..1.19.years 0.10326
## thinness.5.9.years 0.86421
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.045 on 2917 degrees of freedom
## Multiple R-squared: 0.8203, Adjusted R-squared: 0.819
## F-statistic: 665.6 on 20 and 2917 DF, p-value: < 0.00000000000000022
The model is set! Let’s make a prediction based on the fullModel!
Let’s interpreting the fullModel!
##
## Call:
## lm(formula = Life.expectancy ~ . - Country, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2860 -2.2437 -0.0817 2.3740 16.4282
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 81.57371776468257 34.65542497468053 2.354
## Year -0.01242470778379 0.01732564244728 -0.717
## StatusDeveloping -1.60345349486797 0.27022829393151 -5.934
## Adult.Mortality -0.01981086120344 0.00079540789532 -24.907
## infant.deaths 0.09932330190305 0.00842868251499 11.784
## Alcohol 0.06135257680996 0.02608460875382 2.352
## percentage.expenditure 0.00003237640088 0.00009046842117 0.358
## Hepatitis.B -0.01660284006457 0.00371935046024 -4.464
## Measles -0.00001913469438 0.00000765428319 -2.500
## BMI 0.04455147570409 0.00493230714916 9.033
## under.five.deaths -0.07437771658324 0.00617691845324 -12.041
## Polio 0.02865527021677 0.00444805095607 6.442
## Total.expenditure 0.07444995871406 0.03436772061263 2.166
## Diphtheria 0.04078688657161 0.00464496059143 8.781
## HIV.AIDS -0.47215145572195 0.01764488354663 -26.759
## GDP 0.00004260320835 0.00001378343058 3.091
## Population 0.00000000001346 0.00000000168702 0.008
## thinness..1.19.years -0.08195313416986 0.05028535403442 -1.630
## thinness.5.9.years 0.00847794195443 0.04956902498094 0.171
## Income.composition.of.resources 5.83669846548877 0.64052133954243 9.112
## Schooling 0.64795815007715 0.04176733243002 15.514
## Pr(>|t|)
## (Intercept) 0.01865 *
## Year 0.47335
## StatusDeveloping 0.000000003311 ***
## Adult.Mortality < 0.0000000000000002 ***
## infant.deaths < 0.0000000000000002 ***
## Alcohol 0.01874 *
## percentage.expenditure 0.72046
## Hepatitis.B 0.000008352972 ***
## Measles 0.01248 *
## BMI < 0.0000000000000002 ***
## under.five.deaths < 0.0000000000000002 ***
## Polio 0.000000000137 ***
## Total.expenditure 0.03037 *
## Diphtheria < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## GDP 0.00201 **
## Population 0.99364
## thinness..1.19.years 0.10326
## thinness.5.9.years 0.86421
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.045 on 2917 degrees of freedom
## Multiple R-squared: 0.8203, Adjusted R-squared: 0.819
## F-statistic: 665.6 on 20 and 2917 DF, p-value: < 0.00000000000000022
Observation findings :
The fullModel has the largest parameter estimate that is Income.composition.of.resources which is 5.836, followed by StatusDeveloping which is -1.60.
The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.
On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.
The p-value for Adult.Mortality, infant.deaths, BMI, under.five.deaths, Diphtheria, HIV.AIDS, Income.composition.of.resources, and Schooling are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.
The fullModel has R-squared value 0.8203, which indicates the fullModel can describe its predictors condition by 82%.
Here is the Actual vs Predicted Plot of fullModel. The green line represents a perfect prediction, while the red line represents the regression line.
# actual vs predicted
plot(y = fullModel$fitted.values,
x = clean_data$Life.expectancy,
main = "Actual vs Predicted using Full model",
xlab = "Actual",
ylab = "Predicted(fullModel)",
pch = 19)
abline(0,1, col = "green", lwd = 2) # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(fullModel$fitted.values ~ fullModel$model$Life.expectancy),
col = "red", lwd = 2)Here is the plot of the Actual vs Predicted of fullModel with Confidence Interval.
The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.
# predict Life expectancy
predictedLE1 <- predict(fullModel, interval = "prediction")
# combine the actual data and predicted data
comb1 <- cbind.data.frame(clean_data, predictedLE1)
# Plotting the combined data
ggplot(comb1, aes(Life.expectancy, fit)) +
geom_point() +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
stat_smooth(method = lm) +
geom_smooth(method=lm, se=TRUE)+
ggtitle("Actual vs. Predicted for FullModel with CI") +
xlab("Actual Life Expectancy") +
ylab("Predicted Life Expectancy")Let’s build a model with predictors that strongly correlated to target variable! The predictors are Schooling, Adult.Mortality, and Income.composition.of.resources.
# baseline model with predictors that strongly correlated to target variable
EDAModel <- lm(Life.expectancy ~ Schooling + Adult.Mortality + Income.composition.of.resources,
data = clean_data)
# check the model
summary(EDAModel)##
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality +
## Income.composition.of.resources, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.541 -1.933 0.374 2.619 23.151
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 56.5263202 0.4782747 118.19
## Schooling 0.9883753 0.0483207 20.45
## Adult.Mortality -0.0345377 0.0008571 -40.30
## Income.composition.of.resources 10.4014206 0.7731076 13.45
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Schooling <0.0000000000000002 ***
## Adult.Mortality <0.0000000000000002 ***
## Income.composition.of.resources <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.114 on 2934 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.7107
## F-statistic: 2406 on 3 and 2934 DF, p-value: < 0.00000000000000022
The model is set! Let’s make a prediction based on the EDAModel!
Let’s interpreting the EDAModel!
##
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality +
## Income.composition.of.resources, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.541 -1.933 0.374 2.619 23.151
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 56.5263202 0.4782747 118.19
## Schooling 0.9883753 0.0483207 20.45
## Adult.Mortality -0.0345377 0.0008571 -40.30
## Income.composition.of.resources 10.4014206 0.7731076 13.45
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Schooling <0.0000000000000002 ***
## Adult.Mortality <0.0000000000000002 ***
## Income.composition.of.resources <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.114 on 2934 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.7107
## F-statistic: 2406 on 3 and 2934 DF, p-value: < 0.00000000000000022
Observation findings :
The EDAModel has the largest parameter estimate that is Income.composition.of.resources which is 10.4.
The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.
The p-value of all predictors are much lower than 0.05, thus indicating they are very significant predictors for Life.expectancy.
The EDAModel has R-squared value 0.711, which indicates the EDAModel can describe its predictors condition by 71.1%.
Here is the Actual vs Predicted Plot of EDAModel. The green line represents a perfect prediction, while the red line represents the regression line.
# actual vs predicted
plot(y = EDAModel$fitted.values,
x = clean_data$Life.expectancy,
main = "Actual vs Predicted using EDA model",
xlab = "Actual",
ylab = "Predicted(EDAModel)",
pch = 19)
abline(0,1, col = "green", lwd = 2) # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(EDAModel$fitted.values ~ EDAModel$model$Life.expectancy),
col = "red", lwd = 2)Here is the plot of the Actual vs Predicted of EDAModel with Confidence Interval.
The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.
# predict Life expectancy
predictedLE2 <- predict(EDAModel, interval = "prediction")
# combine the actual data and predicted data
comb2 <- cbind.data.frame(clean_data, predictedLE2)
# Plotting the combined data
ggplot(comb2, aes(Life.expectancy, fit)) +
geom_point() +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
stat_smooth(method = lm) +
geom_smooth(method=lm, se=TRUE)+
ggtitle("Actual vs. Predicted for EDAModel with CI") +
xlab("Actual Life Expectancy") +
ylab("Predicted Life Expectancy")Let’s build a model using step-wise regression with backward step!
The predictors of the BackwardStepModel are Status, Adult.Mortality, infant.deaths, Alcohol, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, Total.expenditure, Diphtheria, HIV.AIDS, GDP, thinness..1.19.years, Income.composition.of.resources, and Schooling.
## Start: AIC=8232.93
## Life.expectancy ~ (Country + Year + Status + Adult.Mortality +
## infant.deaths + Alcohol + percentage.expenditure + Hepatitis.B +
## Measles + BMI + under.five.deaths + Polio + Total.expenditure +
## Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years +
## thinness.5.9.years + Income.composition.of.resources + Schooling) -
## Country
##
## Df Sum of Sq RSS AIC
## - Population 1 0.0 47735 8230.9
## - thinness.5.9.years 1 0.5 47735 8231.0
## - percentage.expenditure 1 2.1 47737 8231.1
## - Year 1 8.4 47743 8231.4
## <none> 47735 8232.9
## - thinness..1.19.years 1 43.5 47778 8233.6
## - Total.expenditure 1 76.8 47811 8235.7
## - Alcohol 1 90.5 47825 8236.5
## - Measles 1 102.3 47837 8237.2
## - GDP 1 156.3 47891 8240.5
## - Hepatitis.B 1 326.1 48061 8250.9
## - Status 1 576.2 48311 8266.2
## - Polio 1 679.2 48414 8272.4
## - Diphtheria 1 1261.8 48996 8307.6
## - BMI 1 1335.1 49070 8312.0
## - Income.composition.of.resources 1 1358.8 49093 8313.4
## - infant.deaths 1 2272.4 50007 8367.6
## - under.five.deaths 1 2372.7 50107 8373.5
## - Schooling 1 3938.4 51673 8463.9
## - Adult.Mortality 1 10151.4 57886 8797.4
## - HIV.AIDS 1 11717.2 59452 8875.8
##
## Step: AIC=8230.93
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + GDP + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources + Schooling
##
## Df Sum of Sq RSS AIC
## - thinness.5.9.years 1 0.5 47735 8229.0
## - percentage.expenditure 1 2.1 47737 8229.1
## - Year 1 8.4 47743 8229.4
## <none> 47735 8230.9
## - thinness..1.19.years 1 43.5 47778 8231.6
## - Total.expenditure 1 76.8 47811 8233.7
## - Alcohol 1 90.6 47825 8234.5
## - Measles 1 102.4 47837 8235.2
## - GDP 1 156.3 47891 8238.5
## - Hepatitis.B 1 327.6 48062 8249.0
## - Status 1 576.2 48311 8264.2
## - Polio 1 679.2 48414 8270.4
## - Diphtheria 1 1263.8 48998 8305.7
## - BMI 1 1335.8 49070 8310.0
## - Income.composition.of.resources 1 1358.9 49094 8311.4
## - infant.deaths 1 2346.3 50081 8369.9
## - under.five.deaths 1 2411.0 50146 8373.7
## - Schooling 1 3939.6 51674 8461.9
## - Adult.Mortality 1 10152.5 57887 8795.5
## - HIV.AIDS 1 11717.2 59452 8873.9
##
## Step: AIC=8228.96
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + GDP + thinness..1.19.years + Income.composition.of.resources +
## Schooling
##
## Df Sum of Sq RSS AIC
## - percentage.expenditure 1 2.1 47737 8227.1
## - Year 1 8.4 47744 8227.5
## <none> 47735 8229.0
## - Total.expenditure 1 76.4 47811 8231.7
## - Alcohol 1 90.4 47826 8232.5
## - Measles 1 103.0 47838 8233.3
## - GDP 1 156.1 47891 8236.6
## - thinness..1.19.years 1 159.7 47895 8236.8
## - Hepatitis.B 1 327.6 48063 8247.1
## - Status 1 575.8 48311 8262.2
## - Polio 1 678.7 48414 8268.4
## - Diphtheria 1 1265.5 49001 8303.8
## - BMI 1 1348.6 49084 8308.8
## - Income.composition.of.resources 1 1359.4 49095 8309.5
## - infant.deaths 1 2357.7 50093 8368.6
## - under.five.deaths 1 2418.8 50154 8372.2
## - Schooling 1 3941.7 51677 8460.1
## - Adult.Mortality 1 10156.8 57892 8793.7
## - HIV.AIDS 1 11721.7 59457 8872.1
##
## Step: AIC=8227.09
## Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths +
## Alcohol + Hepatitis.B + Measles + BMI + under.five.deaths +
## Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP +
## thinness..1.19.years + Income.composition.of.resources +
## Schooling
##
## Df Sum of Sq RSS AIC
## - Year 1 9.4 47747 8225.7
## <none> 47737 8227.1
## - Total.expenditure 1 82.3 47820 8230.2
## - Alcohol 1 93.1 47830 8230.8
## - Measles 1 103.0 47840 8231.4
## - thinness..1.19.years 1 159.9 47897 8234.9
## - Hepatitis.B 1 331.6 48069 8245.4
## - Status 1 583.8 48321 8260.8
## - Polio 1 676.9 48414 8266.5
## - GDP 1 818.6 48556 8275.0
## - Diphtheria 1 1265.5 49003 8302.0
## - BMI 1 1347.4 49085 8306.9
## - Income.composition.of.resources 1 1357.4 49095 8307.5
## - infant.deaths 1 2361.4 50099 8366.9
## - under.five.deaths 1 2422.3 50159 8370.5
## - Schooling 1 3941.8 51679 8458.2
## - Adult.Mortality 1 10154.7 57892 8791.7
## - HIV.AIDS 1 11730.6 59468 8870.6
##
## Step: AIC=8225.67
## Life.expectancy ~ Status + Adult.Mortality + infant.deaths +
## Alcohol + Hepatitis.B + Measles + BMI + under.five.deaths +
## Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP +
## thinness..1.19.years + Income.composition.of.resources +
## Schooling
##
## Df Sum of Sq RSS AIC
## <none> 47747 8225.7
## - Total.expenditure 1 77.7 47824 8228.4
## - Measles 1 100.0 47847 8229.8
## - Alcohol 1 103.5 47850 8230.0
## - thinness..1.19.years 1 162.3 47909 8233.6
## - Hepatitis.B 1 329.1 48076 8243.8
## - Status 1 598.8 48345 8260.3
## - Polio 1 680.9 48428 8265.3
## - GDP 1 814.2 48561 8273.3
## - Diphtheria 1 1257.0 49004 8300.0
## - BMI 1 1348.5 49095 8305.5
## - Income.composition.of.resources 1 1352.2 49099 8305.7
## - infant.deaths 1 2373.7 50120 8366.2
## - under.five.deaths 1 2435.4 50182 8369.8
## - Schooling 1 3937.5 51684 8456.5
## - Adult.Mortality 1 10266.6 58013 8795.9
## - HIV.AIDS 1 11809.6 59556 8873.0
The model is set! Let’s make a prediction based on the BackwardStepModel!
Let’s interpreting the BackwardStepModel!
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths +
## Alcohol + Hepatitis.B + Measles + BMI + under.five.deaths +
## Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP +
## thinness..1.19.years + Income.composition.of.resources +
## Schooling, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2503 -2.2438 -0.1097 2.3665 16.3824
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 56.745781830 0.668314373 84.909
## StatusDeveloping -1.625002116 0.268478794 -6.053
## Adult.Mortality -0.019850253 0.000792059 -25.062
## infant.deaths 0.099679379 0.008271756 12.051
## Alcohol 0.064726322 0.025725807 2.516
## Hepatitis.B -0.016610250 0.003701821 -4.487
## Measles -0.000018876 0.000007631 -2.474
## BMI 0.044354554 0.004883370 9.083
## under.five.deaths -0.074629707 0.006114028 -12.206
## Polio 0.028664684 0.004441145 6.454
## Total.expenditure 0.073561760 0.033733716 2.181
## Diphtheria 0.040593295 0.004629077 8.769
## HIV.AIDS -0.470653224 0.017510073 -26.879
## GDP 0.000046752 0.000006624 7.058
## thinness..1.19.years -0.074951946 0.023787239 -3.151
## Income.composition.of.resources 5.762655777 0.633592391 9.095
## Schooling 0.645416268 0.041585022 15.520
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## StatusDeveloping 0.00000000160683 ***
## Adult.Mortality < 0.0000000000000002 ***
## infant.deaths < 0.0000000000000002 ***
## Alcohol 0.01192 *
## Hepatitis.B 0.00000750069921 ***
## Measles 0.01343 *
## BMI < 0.0000000000000002 ***
## under.five.deaths < 0.0000000000000002 ***
## Polio 0.00000000012680 ***
## Total.expenditure 0.02929 *
## Diphtheria < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## GDP 0.00000000000211 ***
## thinness..1.19.years 0.00164 **
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.043 on 2921 degrees of freedom
## Multiple R-squared: 0.8202, Adjusted R-squared: 0.8192
## F-statistic: 832.9 on 16 and 2921 DF, p-value: < 0.00000000000000022
Observation findings :
The BackwardStepModel has the largest parameter estimate that is Income.composition.of.resources which is 5.762, followed by StatusDeveloping which is -1.625.
The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.
On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.
The p-value for its intercept, Adult.Mortality, infant.deaths, BMI, under.five.deaths, Diphtheria, HIV.AIDS, Income.composition.of.resources, and Schooling are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.
The BackwardStepModel has R-squared value 0.8202, which indicates the BackwardStepModel can describe its predictors condition by 82%.
Here is the Actual vs Predicted Plot of BackwardStepModel. The green line represents a perfect prediction, while the red line represents the regression line.
# actual vs predicted
plot(y = BackwardStepModel$fitted.values,
x = clean_data$Life.expectancy,
main = "Actual vs Predicted using Backward Step model",
xlab = "Actual",
ylab = "Predicted(BackwardStepModel)",
pch = 19)
abline(0,1, col = "green", lwd = 2) # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(BackwardStepModel$fitted.values ~ BackwardStepModel$model$Life.expectancy),
col = "red", lwd = 2)Here is the plot of the Actual vs Predicted of BackwardStepModel with Confidence Interval.
The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.
# predict Life expectancy
predictedLE3 <- predict(BackwardStepModel, interval = "prediction")
# combine the actual data and predicted data
comb3 <- cbind.data.frame(clean_data, predictedLE3)
# Plotting the combined data
ggplot(comb3, aes(Life.expectancy, fit)) +
geom_point() +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
stat_smooth(method = lm) +
geom_smooth(method=lm, se=TRUE)+
ggtitle("Actual vs. Predicted for BackwardStepModel with CI") +
xlab("Actual Life Expectancy") +
ylab("Predicted Life Expectancy")Let’s build a model using step-wise regression with forward step!
The predictors of the forwardStepModel are Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, Status, Polio, GDP, Hepatitis.B, under.five.deaths, infant.deaths, thinness..1.19.years, Alcohol, Measles, and Total.expenditure.
forwardStepModel <- step(nullModel,
direction="forward",
scope=list(lower=nullModel,
upper=fullModel))## Start: AIC=13235.23
## Life.expectancy ~ 1
##
## Df Sum of Sq RSS AIC
## + Schooling 1 135029 130544 11151
## + Adult.Mortality 1 128792 136781 11288
## + Income.composition.of.resources 1 127379 138194 11318
## + BMI 1 83419 182155 12130
## + HIV.AIDS 1 82306 183267 12147
## + Status 1 61549 204024 12463
## + Diphtheria 1 59218 206355 12496
## + thinness..1.19.years 1 58167 207406 12511
## + thinness.5.9.years 1 56801 208772 12530
## + Polio 1 55805 209768 12544
## + GDP 1 49210 216363 12635
## + Alcohol 1 40534 225039 12751
## + percentage.expenditure 1 38636 226938 12775
## + under.five.deaths 1 13176 252397 13088
## + Total.expenditure 1 11583 253990 13106
## + infant.deaths 1 10282 255291 13121
## + Year 1 7749 257824 13150
## + Hepatitis.B 1 7695 257878 13151
## + Measles 1 6610 258963 13163
## + Population 1 224 265350 13235
## <none> 265573 13235
##
## Step: AIC=11150.71
## Life.expectancy ~ Schooling
##
## Df Sum of Sq RSS AIC
## + Adult.Mortality 1 49061 81483 9768.0
## + HIV.AIDS 1 44779 85765 9918.5
## + BMI 1 14155 116389 10815.5
## + Diphtheria 1 12645 117899 10853.4
## + Income.composition.of.resources 1 11318 119225 10886.3
## + Polio 1 11213 119331 10888.9
## + thinness.5.9.years 1 8113 122431 10964.2
## + thinness..1.19.years 1 8021 122523 10966.4
## + Status 1 5919 124625 11016.4
## + GDP 1 4882 125662 11040.7
## + percentage.expenditure 1 3515 127029 11072.5
## + under.five.deaths 1 1588 128955 11116.7
## + Measles 1 1383 129161 11121.4
## + Hepatitis.B 1 1308 129235 11123.1
## + infant.deaths 1 1013 129531 11129.8
## + Total.expenditure 1 750 129793 11135.8
## + Alcohol 1 424 130120 11143.1
## + Year 1 183 130361 11148.6
## <none> 130544 11150.7
## + Population 1 2 130542 11152.7
##
## Step: AIC=9767.98
## Life.expectancy ~ Schooling + Adult.Mortality
##
## Df Sum of Sq RSS AIC
## + HIV.AIDS 1 14069.6 67413 9213.1
## + Diphtheria 1 7221.9 74261 9497.3
## + Polio 1 6103.7 75379 9541.2
## + BMI 1 5520.4 75962 9563.9
## + Income.composition.of.resources 1 4734.9 76748 9594.1
## + thinness..1.19.years 1 3713.6 77769 9632.9
## + thinness.5.9.years 1 3479.5 78003 9641.8
## + Status 1 2376.4 79106 9683.0
## + GDP 1 1982.1 79501 9697.6
## + Measles 1 1798.9 79684 9704.4
## + percentage.expenditure 1 1551.3 79931 9713.5
## + under.five.deaths 1 1492.1 79991 9715.7
## + infant.deaths 1 1075.2 80407 9731.0
## + Alcohol 1 791.2 80691 9741.3
## + Total.expenditure 1 541.6 80941 9750.4
## + Hepatitis.B 1 433.2 81049 9754.3
## + Year 1 246.7 81236 9761.1
## <none> 81483 9768.0
## + Population 1 44.3 81438 9768.4
##
## Step: AIC=9213.08
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS
##
## Df Sum of Sq RSS AIC
## + Diphtheria 1 6587.0 60826 8913.0
## + Polio 1 5636.2 61777 8958.6
## + BMI 1 4450.0 62963 9014.4
## + Income.composition.of.resources 1 4032.2 63381 9033.9
## + thinness..1.19.years 1 2808.0 64605 9090.1
## + thinness.5.9.years 1 2604.4 64809 9099.3
## + Status 1 2582.3 64831 9100.3
## + GDP 1 2295.7 65117 9113.3
## + percentage.expenditure 1 1884.8 65528 9131.8
## + Measles 1 1618.6 65794 9143.7
## + under.five.deaths 1 1600.3 65813 9144.5
## + Alcohol 1 1274.3 66139 9159.0
## + infant.deaths 1 1214.8 66198 9161.6
## + Total.expenditure 1 987.1 66426 9171.7
## + Hepatitis.B 1 314.9 67098 9201.3
## + Population 1 73.9 67339 9211.9
## <none> 67413 9213.1
## + Year 1 2.3 67411 9215.0
##
## Step: AIC=8912.99
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria
##
## Df Sum of Sq RSS AIC
## + BMI 1 3596.0 57230 8735.9
## + Income.composition.of.resources 1 3185.5 57640 8756.9
## + Status 1 2422.7 58403 8795.6
## + thinness..1.19.years 1 2361.1 58465 8798.7
## + GDP 1 2233.9 58592 8805.1
## + thinness.5.9.years 1 2213.3 58613 8806.1
## + percentage.expenditure 1 1993.4 58833 8817.1
## + Alcohol 1 1068.0 59758 8862.9
## + Polio 1 1026.4 59800 8865.0
## + Measles 1 999.2 59827 8866.3
## + under.five.deaths 1 879.1 59947 8872.2
## + Total.expenditure 1 663.8 60162 8882.7
## + infant.deaths 1 657.9 60168 8883.0
## + Hepatitis.B 1 349.8 60476 8898.0
## + Population 1 43.7 60782 8912.9
## <none> 60826 8913.0
## + Year 1 10.3 60816 8914.5
##
## Step: AIC=8735.95
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI
##
## Df Sum of Sq RSS AIC
## + Income.composition.of.resources 1 2544.82 54685 8604.3
## + Status 1 2111.11 55119 8627.5
## + GDP 1 1959.73 55270 8635.6
## + percentage.expenditure 1 1854.67 55375 8641.2
## + Polio 1 856.71 56373 8693.6
## + thinness..1.19.years 1 782.07 56448 8697.5
## + Alcohol 1 725.32 56505 8700.5
## + thinness.5.9.years 1 657.35 56573 8704.0
## + Measles 1 568.01 56662 8708.6
## + under.five.deaths 1 431.96 56798 8715.7
## + Hepatitis.B 1 339.63 56890 8720.5
## + Total.expenditure 1 304.28 56926 8722.3
## + infant.deaths 1 280.40 56950 8723.5
## <none> 57230 8735.9
## + Year 1 8.62 57221 8737.5
## + Population 1 7.38 57223 8737.6
##
## Step: AIC=8604.31
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources
##
## Df Sum of Sq RSS AIC
## + Status 1 1699.37 52986 8513.6
## + percentage.expenditure 1 1394.08 53291 8530.4
## + GDP 1 1346.68 53338 8533.1
## + Polio 1 834.91 53850 8561.1
## + thinness..1.19.years 1 700.86 53984 8568.4
## + Alcohol 1 641.60 54044 8571.6
## + thinness.5.9.years 1 605.26 54080 8573.6
## + Measles 1 536.56 54149 8577.3
## + under.five.deaths 1 505.38 54180 8579.0
## + Total.expenditure 1 439.61 54246 8582.6
## + infant.deaths 1 352.37 54333 8587.3
## + Hepatitis.B 1 257.09 54428 8592.5
## + Year 1 83.26 54602 8601.8
## <none> 54685 8604.3
## + Population 1 17.23 54668 8605.4
##
## Step: AIC=8513.56
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status
##
## Df Sum of Sq RSS AIC
## + Polio 1 802.17 52184 8470.7
## + GDP 1 699.53 52286 8476.5
## + percentage.expenditure 1 662.90 52323 8478.6
## + Measles 1 511.75 52474 8487.0
## + under.five.deaths 1 491.60 52494 8488.2
## + thinness..1.19.years 1 392.03 52594 8493.7
## + Hepatitis.B 1 345.31 52640 8496.4
## + infant.deaths 1 329.36 52656 8497.2
## + thinness.5.9.years 1 315.68 52670 8498.0
## + Total.expenditure 1 153.60 52832 8507.0
## + Alcohol 1 59.81 52926 8512.2
## <none> 52986 8513.6
## + Year 1 13.42 52972 8514.8
## + Population 1 9.94 52976 8515.0
##
## Step: AIC=8470.74
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio
##
## Df Sum of Sq RSS AIC
## + GDP 1 682.97 51501 8434.0
## + percentage.expenditure 1 673.70 51510 8434.6
## + Hepatitis.B 1 464.89 51719 8446.5
## + Measles 1 462.60 51721 8446.6
## + under.five.deaths 1 433.02 51751 8448.3
## + thinness..1.19.years 1 395.77 51788 8450.4
## + thinness.5.9.years 1 309.75 51874 8455.3
## + infant.deaths 1 285.54 51898 8456.6
## + Total.expenditure 1 150.80 52033 8464.2
## + Alcohol 1 56.41 52127 8469.6
## <none> 52184 8470.7
## + Year 1 8.62 52175 8472.3
## + Population 1 5.97 52178 8472.4
##
## Step: AIC=8434.04
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP
##
## Df Sum of Sq RSS AIC
## + Hepatitis.B 1 462.41 51038 8409.5
## + Measles 1 452.06 51049 8410.1
## + under.five.deaths 1 418.70 51082 8412.1
## + thinness..1.19.years 1 379.40 51121 8414.3
## + thinness.5.9.years 1 286.65 51214 8419.6
## + infant.deaths 1 270.37 51230 8420.6
## + Total.expenditure 1 179.55 51321 8425.8
## + Alcohol 1 57.66 51443 8432.7
## + percentage.expenditure 1 42.17 51458 8433.6
## <none> 51501 8434.0
## + Year 1 13.23 51487 8435.3
## + Population 1 5.06 51496 8435.7
##
## Step: AIC=8409.54
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B
##
## Df Sum of Sq RSS AIC
## + under.five.deaths 1 513.06 50525 8381.9
## + Measles 1 461.23 50577 8384.9
## + thinness..1.19.years 1 380.09 50658 8389.6
## + infant.deaths 1 351.84 50686 8391.2
## + thinness.5.9.years 1 291.88 50746 8394.7
## + Total.expenditure 1 167.68 50871 8401.9
## + Alcohol 1 53.27 50985 8408.5
## <none> 51038 8409.5
## + percentage.expenditure 1 27.04 51011 8410.0
## + Population 1 23.99 51014 8410.2
## + Year 1 17.07 51021 8410.6
##
## Step: AIC=8381.86
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths
##
## Df Sum of Sq RSS AIC
## + infant.deaths 1 2223.33 48302 8251.6
## + Measles 1 145.58 50380 8375.4
## + Total.expenditure 1 136.09 50389 8375.9
## + thinness..1.19.years 1 122.92 50402 8376.7
## + Population 1 74.17 50451 8379.5
## + thinness.5.9.years 1 69.02 50456 8379.8
## + Alcohol 1 58.10 50467 8380.5
## <none> 50525 8381.9
## + percentage.expenditure 1 26.98 50498 8382.3
## + Year 1 17.75 50507 8382.8
##
## Step: AIC=8251.64
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths
##
## Df Sum of Sq RSS AIC
## + thinness..1.19.years 1 248.735 48053 8238.5
## + Alcohol 1 196.874 48105 8241.6
## + thinness.5.9.years 1 190.728 48111 8242.0
## + Total.expenditure 1 146.782 48155 8244.7
## + Measles 1 87.496 48214 8248.3
## <none> 48302 8251.6
## + percentage.expenditure 1 23.257 48279 8252.2
## + Year 1 14.517 48287 8252.8
## + Population 1 0.068 48302 8253.6
##
## Step: AIC=8238.47
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years
##
## Df Sum of Sq RSS AIC
## + Alcohol 1 120.915 47932 8233.1
## + Total.expenditure 1 107.663 47945 8233.9
## + Measles 1 103.706 47949 8234.1
## <none> 48053 8238.5
## + percentage.expenditure 1 17.121 48036 8239.4
## + Year 1 9.060 48044 8239.9
## + Population 1 0.251 48053 8240.5
## + thinness.5.9.years 1 0.143 48053 8240.5
##
## Step: AIC=8233.07
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years +
## Alcohol
##
## Df Sum of Sq RSS AIC
## + Measles 1 107.930 47824 8228.4
## + Total.expenditure 1 85.647 47847 8229.8
## <none> 47932 8233.1
## + percentage.expenditure 1 9.067 47923 8234.5
## + Year 1 2.540 47930 8234.9
## + thinness.5.9.years 1 0.376 47932 8235.0
## + Population 1 0.034 47932 8235.1
##
## Step: AIC=8228.45
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years +
## Alcohol + Measles
##
## Df Sum of Sq RSS AIC
## + Total.expenditure 1 77.730 47747 8225.7
## <none> 47824 8228.4
## + percentage.expenditure 1 8.980 47815 8229.9
## + Year 1 4.783 47820 8230.2
## + thinness.5.9.years 1 0.085 47824 8230.4
## + Population 1 0.023 47824 8230.4
##
## Step: AIC=8225.67
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years +
## Alcohol + Measles + Total.expenditure
##
## Df Sum of Sq RSS AIC
## <none> 47747 8225.7
## + Year 1 9.3541 47737 8227.1
## + percentage.expenditure 1 3.0234 47744 8227.5
## + thinness.5.9.years 1 0.5265 47746 8227.6
## + Population 1 0.0011 47747 8227.7
The model is set! Let’s make a prediction based on the forwardStepModel!
Let’s interpreting the forwardStepModel!
##
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality +
## HIV.AIDS + Diphtheria + BMI + Income.composition.of.resources +
## Status + Polio + GDP + Hepatitis.B + under.five.deaths +
## infant.deaths + thinness..1.19.years + Alcohol + Measles +
## Total.expenditure, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2503 -2.2438 -0.1097 2.3665 16.3824
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 56.745781830 0.668314373 84.909
## Schooling 0.645416268 0.041585022 15.520
## Adult.Mortality -0.019850253 0.000792059 -25.062
## HIV.AIDS -0.470653224 0.017510073 -26.879
## Diphtheria 0.040593295 0.004629077 8.769
## BMI 0.044354554 0.004883370 9.083
## Income.composition.of.resources 5.762655777 0.633592391 9.095
## StatusDeveloping -1.625002116 0.268478794 -6.053
## Polio 0.028664684 0.004441145 6.454
## GDP 0.000046752 0.000006624 7.058
## Hepatitis.B -0.016610250 0.003701821 -4.487
## under.five.deaths -0.074629707 0.006114028 -12.206
## infant.deaths 0.099679379 0.008271756 12.051
## thinness..1.19.years -0.074951946 0.023787239 -3.151
## Alcohol 0.064726322 0.025725807 2.516
## Measles -0.000018876 0.000007631 -2.474
## Total.expenditure 0.073561760 0.033733716 2.181
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## Adult.Mortality < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## Diphtheria < 0.0000000000000002 ***
## BMI < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## StatusDeveloping 0.00000000160683 ***
## Polio 0.00000000012680 ***
## GDP 0.00000000000211 ***
## Hepatitis.B 0.00000750069921 ***
## under.five.deaths < 0.0000000000000002 ***
## infant.deaths < 0.0000000000000002 ***
## thinness..1.19.years 0.00164 **
## Alcohol 0.01192 *
## Measles 0.01343 *
## Total.expenditure 0.02929 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.043 on 2921 degrees of freedom
## Multiple R-squared: 0.8202, Adjusted R-squared: 0.8192
## F-statistic: 832.9 on 16 and 2921 DF, p-value: < 0.00000000000000022
Observation findings :
The forwardStepModel has the largest parameter estimate that is Income.composition.of.resources which is 5.762, followed by StatusDeveloping which is -1.625.
The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.
On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.
The p-value for its intercept, Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, under.five.deaths, and infant.deaths are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.
The forwardStepModel has R-squared value 0.8202, which indicates the forwardStepModel can describe its predictors condition by 82%.
Here is the Actual vs Predicted Plot of forwardStepModel. The green line represents a perfect prediction, while the red line represents the regression line.
# actual vs predicted
plot(y = forwardStepModel$fitted.values,
x = clean_data$Life.expectancy,
main = "Actual vs Predicted using Forward Step model",
xlab = "Actual",
ylab = "Predicted(forwardStepModel)",
pch = 19)
abline(0,1, col = "green", lwd = 2) # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(forwardStepModel$fitted.values ~ forwardStepModel$model$Life.expectancy),
col = "red", lwd = 2)Here is the plot of the Actual vs Predicted of forwardStepModel with Confidence Interval.
The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.
# predict Life expectancy
predictedLE4 <- predict(forwardStepModel, interval = "prediction")
# combine the actual data and predicted data
comb4 <- cbind.data.frame(clean_data, predictedLE4)
# Plotting the combined data
ggplot(comb4, aes(Life.expectancy, fit)) +
geom_point() +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
stat_smooth(method = lm) +
geom_smooth(method=lm, se=TRUE)+
ggtitle("Actual vs. Predicted for forwardStepModel with CI") +
xlab("Actual Life Expectancy") +
ylab("Predicted Life Expectancy")Let’s build a model using step-wise regression with both backward and forward step!
The predictors of the MixedStepModel are Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, Status, Polio, GDP, Hepatitis.B, under.five.deaths, infant.deaths, thinness..1.19.years, Alcohol, Measles, and Total.expenditure.
## Start: AIC=13235.23
## Life.expectancy ~ 1
##
## Df Sum of Sq RSS AIC
## + Schooling 1 135029 130544 11151
## + Adult.Mortality 1 128792 136781 11288
## + Income.composition.of.resources 1 127379 138194 11318
## + BMI 1 83419 182155 12130
## + HIV.AIDS 1 82306 183267 12147
## + Status 1 61549 204024 12463
## + Diphtheria 1 59218 206355 12496
## + thinness..1.19.years 1 58167 207406 12511
## + thinness.5.9.years 1 56801 208772 12530
## + Polio 1 55805 209768 12544
## + GDP 1 49210 216363 12635
## + Alcohol 1 40534 225039 12751
## + percentage.expenditure 1 38636 226938 12775
## + under.five.deaths 1 13176 252397 13088
## + Total.expenditure 1 11583 253990 13106
## + infant.deaths 1 10282 255291 13121
## + Year 1 7749 257824 13150
## + Hepatitis.B 1 7695 257878 13151
## + Measles 1 6610 258963 13163
## + Population 1 224 265350 13235
## <none> 265573 13235
##
## Step: AIC=11150.71
## Life.expectancy ~ Schooling
##
## Df Sum of Sq RSS AIC
## + Adult.Mortality 1 49061 81483 9768.0
## + HIV.AIDS 1 44779 85765 9918.5
## + BMI 1 14155 116389 10815.5
## + Diphtheria 1 12645 117899 10853.4
## + Income.composition.of.resources 1 11318 119225 10886.3
## + Polio 1 11213 119331 10888.9
## + thinness.5.9.years 1 8113 122431 10964.2
## + thinness..1.19.years 1 8021 122523 10966.4
## + Status 1 5919 124625 11016.4
## + GDP 1 4882 125662 11040.7
## + percentage.expenditure 1 3515 127029 11072.5
## + under.five.deaths 1 1588 128955 11116.7
## + Measles 1 1383 129161 11121.4
## + Hepatitis.B 1 1308 129235 11123.1
## + infant.deaths 1 1013 129531 11129.8
## + Total.expenditure 1 750 129793 11135.8
## + Alcohol 1 424 130120 11143.1
## + Year 1 183 130361 11148.6
## <none> 130544 11150.7
## + Population 1 2 130542 11152.7
## - Schooling 1 135029 265573 13235.2
##
## Step: AIC=9767.98
## Life.expectancy ~ Schooling + Adult.Mortality
##
## Df Sum of Sq RSS AIC
## + HIV.AIDS 1 14070 67413 9213.1
## + Diphtheria 1 7222 74261 9497.3
## + Polio 1 6104 75379 9541.2
## + BMI 1 5520 75962 9563.9
## + Income.composition.of.resources 1 4735 76748 9594.1
## + thinness..1.19.years 1 3714 77769 9632.9
## + thinness.5.9.years 1 3480 78003 9641.8
## + Status 1 2376 79106 9683.0
## + GDP 1 1982 79501 9697.6
## + Measles 1 1799 79684 9704.4
## + percentage.expenditure 1 1551 79931 9713.5
## + under.five.deaths 1 1492 79991 9715.7
## + infant.deaths 1 1075 80407 9731.0
## + Alcohol 1 791 80691 9741.3
## + Total.expenditure 1 542 80941 9750.4
## + Hepatitis.B 1 433 81049 9754.3
## + Year 1 247 81236 9761.1
## <none> 81483 9768.0
## + Population 1 44 81438 9768.4
## - Adult.Mortality 1 49061 130544 11150.7
## - Schooling 1 55298 136781 11287.8
##
## Step: AIC=9213.08
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS
##
## Df Sum of Sq RSS AIC
## + Diphtheria 1 6587 60826 8913.0
## + Polio 1 5636 61777 8958.6
## + BMI 1 4450 62963 9014.4
## + Income.composition.of.resources 1 4032 63381 9033.9
## + thinness..1.19.years 1 2808 64605 9090.1
## + thinness.5.9.years 1 2604 64809 9099.3
## + Status 1 2582 64831 9100.3
## + GDP 1 2296 65117 9113.3
## + percentage.expenditure 1 1885 65528 9131.8
## + Measles 1 1619 65794 9143.7
## + under.five.deaths 1 1600 65813 9144.5
## + Alcohol 1 1274 66139 9159.0
## + infant.deaths 1 1215 66198 9161.6
## + Total.expenditure 1 987 66426 9171.7
## + Hepatitis.B 1 315 67098 9201.3
## + Population 1 74 67339 9211.9
## <none> 67413 9213.1
## + Year 1 2 67411 9215.0
## - HIV.AIDS 1 14070 81483 9768.0
## - Adult.Mortality 1 18352 85765 9918.5
## - Schooling 1 55892 123305 10985.1
##
## Step: AIC=8912.99
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria
##
## Df Sum of Sq RSS AIC
## + BMI 1 3596 57230 8735.9
## + Income.composition.of.resources 1 3186 57640 8756.9
## + Status 1 2423 58403 8795.6
## + thinness..1.19.years 1 2361 58465 8798.7
## + GDP 1 2234 58592 8805.1
## + thinness.5.9.years 1 2213 58613 8806.1
## + percentage.expenditure 1 1993 58833 8817.1
## + Alcohol 1 1068 59758 8862.9
## + Polio 1 1026 59800 8865.0
## + Measles 1 999 59827 8866.3
## + under.five.deaths 1 879 59947 8872.2
## + Total.expenditure 1 664 60162 8882.7
## + infant.deaths 1 658 60168 8883.0
## + Hepatitis.B 1 350 60476 8898.0
## + Population 1 44 60782 8912.9
## <none> 60826 8913.0
## + Year 1 10 60816 8914.5
## - Diphtheria 1 6587 67413 9213.1
## - HIV.AIDS 1 13435 74261 9497.3
## - Adult.Mortality 1 16152 76978 9602.9
## - Schooling 1 40329 101155 10405.4
##
## Step: AIC=8735.95
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI
##
## Df Sum of Sq RSS AIC
## + Income.composition.of.resources 1 2544.8 54685 8604.3
## + Status 1 2111.1 55119 8627.5
## + GDP 1 1959.7 55270 8635.6
## + percentage.expenditure 1 1854.7 55375 8641.2
## + Polio 1 856.7 56373 8693.6
## + thinness..1.19.years 1 782.1 56448 8697.5
## + Alcohol 1 725.3 56505 8700.5
## + thinness.5.9.years 1 657.4 56573 8704.0
## + Measles 1 568.0 56662 8708.6
## + under.five.deaths 1 432.0 56798 8715.7
## + Hepatitis.B 1 339.6 56890 8720.5
## + Total.expenditure 1 304.3 56926 8722.3
## + infant.deaths 1 280.4 56950 8723.5
## <none> 57230 8735.9
## + Year 1 8.6 57221 8737.5
## + Population 1 7.4 57223 8737.6
## - BMI 1 3596.0 60826 8913.0
## - Diphtheria 1 5733.1 62963 9014.4
## - HIV.AIDS 1 12527.5 69758 9315.5
## - Adult.Mortality 1 13696.8 70927 9364.4
## - Schooling 1 26760.3 83990 9861.0
##
## Step: AIC=8604.31
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources
##
## Df Sum of Sq RSS AIC
## + Status 1 1699.4 52986 8513.6
## + percentage.expenditure 1 1394.1 53291 8530.4
## + GDP 1 1346.7 53338 8533.1
## + Polio 1 834.9 53850 8561.1
## + thinness..1.19.years 1 700.9 53984 8568.4
## + Alcohol 1 641.6 54044 8571.6
## + thinness.5.9.years 1 605.3 54080 8573.6
## + Measles 1 536.6 54149 8577.3
## + under.five.deaths 1 505.4 54180 8579.0
## + Total.expenditure 1 439.6 54246 8582.6
## + infant.deaths 1 352.4 54333 8587.3
## + Hepatitis.B 1 257.1 54428 8592.5
## + Year 1 83.3 54602 8601.8
## <none> 54685 8604.3
## + Population 1 17.2 54668 8605.4
## - Income.composition.of.resources 1 2544.8 57230 8735.9
## - BMI 1 2955.3 57640 8756.9
## - Diphtheria 1 5095.2 59780 8864.0
## - Schooling 1 7143.4 61829 8963.0
## - HIV.AIDS 1 12106.7 66792 9189.9
## - Adult.Mortality 1 12307.2 66992 9198.7
##
## Step: AIC=8513.56
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status
##
## Df Sum of Sq RSS AIC
## + Polio 1 802.2 52184 8470.7
## + GDP 1 699.5 52286 8476.5
## + percentage.expenditure 1 662.9 52323 8478.6
## + Measles 1 511.8 52474 8487.0
## + under.five.deaths 1 491.6 52494 8488.2
## + thinness..1.19.years 1 392.0 52594 8493.7
## + Hepatitis.B 1 345.3 52640 8496.4
## + infant.deaths 1 329.4 52656 8497.2
## + thinness.5.9.years 1 315.7 52670 8498.0
## + Total.expenditure 1 153.6 52832 8507.0
## + Alcohol 1 59.8 52926 8512.2
## <none> 52986 8513.6
## + Year 1 13.4 52972 8514.8
## + Population 1 9.9 52976 8515.0
## - Status 1 1699.4 54685 8604.3
## - Income.composition.of.resources 1 2133.1 55119 8627.5
## - BMI 1 2748.8 55735 8660.2
## - Diphtheria 1 5053.7 58040 8779.2
## - Schooling 1 5510.7 58496 8802.3
## - Adult.Mortality 1 11298.0 64284 9079.4
## - HIV.AIDS 1 12327.9 65314 9126.1
##
## Step: AIC=8470.74
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio
##
## Df Sum of Sq RSS AIC
## + GDP 1 683.0 51501 8434.0
## + percentage.expenditure 1 673.7 51510 8434.6
## + Hepatitis.B 1 464.9 51719 8446.5
## + Measles 1 462.6 51721 8446.6
## + under.five.deaths 1 433.0 51751 8448.3
## + thinness..1.19.years 1 395.8 51788 8450.4
## + thinness.5.9.years 1 309.7 51874 8455.3
## + infant.deaths 1 285.5 51898 8456.6
## + Total.expenditure 1 150.8 52033 8464.2
## + Alcohol 1 56.4 52127 8469.6
## <none> 52184 8470.7
## + Year 1 8.6 52175 8472.3
## + Population 1 6.0 52178 8472.4
## - Polio 1 802.2 52986 8513.6
## - Diphtheria 1 1534.0 53718 8553.9
## - Status 1 1666.6 53850 8561.1
## - Income.composition.of.resources 1 2117.2 54301 8585.6
## - BMI 1 2611.4 54795 8612.2
## - Schooling 1 5175.2 57359 8746.6
## - Adult.Mortality 1 11025.7 63209 9031.9
## - HIV.AIDS 1 12298.2 64482 9090.5
##
## Step: AIC=8434.04
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP
##
## Df Sum of Sq RSS AIC
## + Hepatitis.B 1 462.4 51038 8409.5
## + Measles 1 452.1 51049 8410.1
## + under.five.deaths 1 418.7 51082 8412.1
## + thinness..1.19.years 1 379.4 51121 8414.3
## + thinness.5.9.years 1 286.7 51214 8419.6
## + infant.deaths 1 270.4 51230 8420.6
## + Total.expenditure 1 179.5 51321 8425.8
## + Alcohol 1 57.7 51443 8432.7
## + percentage.expenditure 1 42.2 51458 8433.6
## <none> 51501 8434.0
## + Year 1 13.2 51487 8435.3
## + Population 1 5.1 51496 8435.7
## - GDP 1 683.0 52184 8470.7
## - Polio 1 785.6 52286 8476.5
## - Status 1 1033.3 52534 8490.4
## - Diphtheria 1 1566.9 53068 8520.1
## - Income.composition.of.resources 1 1761.3 53262 8530.8
## - BMI 1 2550.8 54051 8574.1
## - Schooling 1 4904.6 56405 8699.3
## - Adult.Mortality 1 10635.9 62137 8983.6
## - HIV.AIDS 1 12474.3 63975 9069.3
##
## Step: AIC=8409.54
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B
##
## Df Sum of Sq RSS AIC
## + under.five.deaths 1 513.1 50525 8381.9
## + Measles 1 461.2 50577 8384.9
## + thinness..1.19.years 1 380.1 50658 8389.6
## + infant.deaths 1 351.8 50686 8391.2
## + thinness.5.9.years 1 291.9 50746 8394.7
## + Total.expenditure 1 167.7 50871 8401.9
## + Alcohol 1 53.3 50985 8408.5
## <none> 51038 8409.5
## + percentage.expenditure 1 27.0 51011 8410.0
## + Population 1 24.0 51014 8410.2
## + Year 1 17.1 51021 8410.6
## - Hepatitis.B 1 462.4 51501 8434.0
## - GDP 1 680.5 51719 8446.5
## - Polio 1 903.7 51942 8459.1
## - Status 1 1111.8 52150 8470.8
## - Income.composition.of.resources 1 1660.7 52699 8501.6
## - Diphtheria 1 1955.8 52994 8518.0
## - BMI 1 2535.8 53574 8550.0
## - Schooling 1 4868.0 55906 8675.2
## - Adult.Mortality 1 10650.8 61689 8964.4
## - HIV.AIDS 1 12549.6 63588 9053.4
##
## Step: AIC=8381.86
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths
##
## Df Sum of Sq RSS AIC
## + infant.deaths 1 2223.3 48302 8251.6
## + Measles 1 145.6 50380 8375.4
## + Total.expenditure 1 136.1 50389 8375.9
## + thinness..1.19.years 1 122.9 50402 8376.7
## + Population 1 74.2 50451 8379.5
## + thinness.5.9.years 1 69.0 50456 8379.8
## + Alcohol 1 58.1 50467 8380.5
## <none> 50525 8381.9
## + percentage.expenditure 1 27.0 50498 8382.3
## + Year 1 17.8 50507 8382.8
## - under.five.deaths 1 513.1 51038 8409.5
## - Hepatitis.B 1 556.8 51082 8412.1
## - GDP 1 664.4 51190 8418.2
## - Polio 1 850.1 51375 8428.9
## - Status 1 1115.9 51641 8444.0
## - Income.composition.of.resources 1 1716.3 52242 8478.0
## - Diphtheria 1 1883.7 52409 8487.4
## - BMI 1 2143.7 52669 8501.9
## - Schooling 1 4625.7 55151 8637.2
## - Adult.Mortality 1 10759.8 61285 8947.1
## - HIV.AIDS 1 12672.7 63198 9037.4
##
## Step: AIC=8251.64
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths
##
## Df Sum of Sq RSS AIC
## + thinness..1.19.years 1 248.7 48053 8238.5
## + Alcohol 1 196.9 48105 8241.6
## + thinness.5.9.years 1 190.7 48111 8242.0
## + Total.expenditure 1 146.8 48155 8244.7
## + Measles 1 87.5 48214 8248.3
## <none> 48302 8251.6
## + percentage.expenditure 1 23.3 48279 8252.2
## + Year 1 14.5 48287 8252.8
## + Population 1 0.1 48302 8253.6
## - Hepatitis.B 1 388.7 48691 8273.2
## - Polio 1 692.9 48995 8291.5
## - GDP 1 796.0 49098 8297.7
## - Status 1 1352.4 49654 8330.8
## - Diphtheria 1 1358.4 49660 8331.1
## - Income.composition.of.resources 1 1361.3 49663 8331.3
## - BMI 1 2194.7 50497 8380.2
## - infant.deaths 1 2223.3 50525 8381.9
## - under.five.deaths 1 2384.5 50686 8391.2
## - Schooling 1 4583.1 52885 8516.0
## - Adult.Mortality 1 10175.6 58477 8811.3
## - HIV.AIDS 1 11984.3 60286 8900.8
##
## Step: AIC=8238.47
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years
##
## Df Sum of Sq RSS AIC
## + Alcohol 1 120.9 47932 8233.1
## + Total.expenditure 1 107.7 47945 8233.9
## + Measles 1 103.7 47949 8234.1
## <none> 48053 8238.5
## + percentage.expenditure 1 17.1 48036 8239.4
## + Year 1 9.1 48044 8239.9
## + Population 1 0.3 48053 8240.5
## + thinness.5.9.years 1 0.1 48053 8240.5
## - thinness..1.19.years 1 248.7 48302 8251.6
## - Hepatitis.B 1 356.5 48410 8258.2
## - Polio 1 706.1 48759 8279.3
## - GDP 1 790.4 48844 8284.4
## - Status 1 1129.1 49182 8304.7
## - Income.composition.of.resources 1 1318.8 49372 8316.0
## - Diphtheria 1 1339.1 49392 8317.2
## - BMI 1 1487.1 49540 8326.0
## - infant.deaths 1 2349.1 50402 8376.7
## - under.five.deaths 1 2469.9 50523 8383.7
## - Schooling 1 4429.0 52482 8495.5
## - Adult.Mortality 1 10110.4 58164 8797.5
## - HIV.AIDS 1 11669.2 59722 8875.2
##
## Step: AIC=8233.07
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years +
## Alcohol
##
## Df Sum of Sq RSS AIC
## + Measles 1 107.9 47824 8228.4
## + Total.expenditure 1 85.6 47847 8229.8
## <none> 47932 8233.1
## + percentage.expenditure 1 9.1 47923 8234.5
## + Year 1 2.5 47930 8234.9
## + thinness.5.9.years 1 0.4 47932 8235.0
## + Population 1 0.0 47932 8235.1
## - Alcohol 1 120.9 48053 8238.5
## - thinness..1.19.years 1 172.8 48105 8241.6
## - Hepatitis.B 1 351.1 48283 8252.5
## - Status 1 691.0 48623 8273.1
## - Polio 1 693.2 48625 8273.3
## - GDP 1 797.5 48730 8279.6
## - Diphtheria 1 1309.8 49242 8310.3
## - Income.composition.of.resources 1 1325.8 49258 8311.2
## - BMI 1 1481.5 49414 8320.5
## - infant.deaths 1 2441.6 50374 8377.0
## - under.five.deaths 1 2568.3 50501 8384.4
## - Schooling 1 3963.2 51895 8464.5
## - Adult.Mortality 1 10216.1 58148 8798.7
## - HIV.AIDS 1 11783.0 59715 8876.8
##
## Step: AIC=8228.45
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years +
## Alcohol + Measles
##
## Df Sum of Sq RSS AIC
## + Total.expenditure 1 77.7 47747 8225.7
## <none> 47824 8228.4
## + percentage.expenditure 1 9.0 47815 8229.9
## + Year 1 4.8 47820 8230.2
## + thinness.5.9.years 1 0.1 47824 8230.4
## + Population 1 0.0 47824 8230.4
## - Measles 1 107.9 47932 8233.1
## - Alcohol 1 125.1 47949 8234.1
## - thinness..1.19.years 1 185.4 48010 8237.8
## - Hepatitis.B 1 335.2 48159 8247.0
## - Status 1 674.1 48498 8267.6
## - Polio 1 682.7 48507 8268.1
## - GDP 1 793.7 48618 8274.8
## - Diphtheria 1 1287.6 49112 8304.5
## - Income.composition.of.resources 1 1305.7 49130 8305.6
## - BMI 1 1412.1 49236 8311.9
## - infant.deaths 1 2384.3 50209 8369.4
## - under.five.deaths 1 2445.6 50270 8373.0
## - Schooling 1 3995.7 51820 8462.2
## - Adult.Mortality 1 10307.8 58132 8799.9
## - HIV.AIDS 1 11732.4 59557 8871.0
##
## Step: AIC=8225.67
## Life.expectancy ~ Schooling + Adult.Mortality + HIV.AIDS + Diphtheria +
## BMI + Income.composition.of.resources + Status + Polio +
## GDP + Hepatitis.B + under.five.deaths + infant.deaths + thinness..1.19.years +
## Alcohol + Measles + Total.expenditure
##
## Df Sum of Sq RSS AIC
## <none> 47747 8225.7
## + Year 1 9.4 47737 8227.1
## + percentage.expenditure 1 3.0 47744 8227.5
## + thinness.5.9.years 1 0.5 47746 8227.6
## + Population 1 0.0 47747 8227.7
## - Total.expenditure 1 77.7 47824 8228.4
## - Measles 1 100.0 47847 8229.8
## - Alcohol 1 103.5 47850 8230.0
## - thinness..1.19.years 1 162.3 47909 8233.6
## - Hepatitis.B 1 329.1 48076 8243.8
## - Status 1 598.8 48345 8260.3
## - Polio 1 680.9 48428 8265.3
## - GDP 1 814.2 48561 8273.3
## - Diphtheria 1 1257.0 49004 8300.0
## - BMI 1 1348.5 49095 8305.5
## - Income.composition.of.resources 1 1352.2 49099 8305.7
## - infant.deaths 1 2373.7 50120 8366.2
## - under.five.deaths 1 2435.4 50182 8369.8
## - Schooling 1 3937.5 51684 8456.5
## - Adult.Mortality 1 10266.6 58013 8795.9
## - HIV.AIDS 1 11809.6 59556 8873.0
The model is set! Let’s make a prediction based on the MixedStepModel!
Let’s interpreting the MixedStepModel!
##
## Call:
## lm(formula = Life.expectancy ~ Schooling + Adult.Mortality +
## HIV.AIDS + Diphtheria + BMI + Income.composition.of.resources +
## Status + Polio + GDP + Hepatitis.B + under.five.deaths +
## infant.deaths + thinness..1.19.years + Alcohol + Measles +
## Total.expenditure, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2503 -2.2438 -0.1097 2.3665 16.3824
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 56.745781830 0.668314373 84.909
## Schooling 0.645416268 0.041585022 15.520
## Adult.Mortality -0.019850253 0.000792059 -25.062
## HIV.AIDS -0.470653224 0.017510073 -26.879
## Diphtheria 0.040593295 0.004629077 8.769
## BMI 0.044354554 0.004883370 9.083
## Income.composition.of.resources 5.762655777 0.633592391 9.095
## StatusDeveloping -1.625002116 0.268478794 -6.053
## Polio 0.028664684 0.004441145 6.454
## GDP 0.000046752 0.000006624 7.058
## Hepatitis.B -0.016610250 0.003701821 -4.487
## under.five.deaths -0.074629707 0.006114028 -12.206
## infant.deaths 0.099679379 0.008271756 12.051
## thinness..1.19.years -0.074951946 0.023787239 -3.151
## Alcohol 0.064726322 0.025725807 2.516
## Measles -0.000018876 0.000007631 -2.474
## Total.expenditure 0.073561760 0.033733716 2.181
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## Adult.Mortality < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## Diphtheria < 0.0000000000000002 ***
## BMI < 0.0000000000000002 ***
## Income.composition.of.resources < 0.0000000000000002 ***
## StatusDeveloping 0.00000000160683 ***
## Polio 0.00000000012680 ***
## GDP 0.00000000000211 ***
## Hepatitis.B 0.00000750069921 ***
## under.five.deaths < 0.0000000000000002 ***
## infant.deaths < 0.0000000000000002 ***
## thinness..1.19.years 0.00164 **
## Alcohol 0.01192 *
## Measles 0.01343 *
## Total.expenditure 0.02929 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.043 on 2921 degrees of freedom
## Multiple R-squared: 0.8202, Adjusted R-squared: 0.8192
## F-statistic: 832.9 on 16 and 2921 DF, p-value: < 0.00000000000000022
Observation findings :
The MixedStepModel has the largest parameter estimate that is Income.composition.of.resources which is 5.762, followed by StatusDeveloping which is -1.625.
The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.
On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.
The p-value for its intercept, Schooling, Adult.Mortality, HIV.AIDS, Diphtheria, BMI, Income.composition.of.resources, under.five.deaths, and infant.deaths are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.
The MixedStepModel has R-squared value 0.8202, which indicates the MixedStepModel can describe its predictors condition by 82%.
Here is the Actual vs Predicted Plot of MixedStepModel. The green line represents a perfect prediction, while the red line represents the regression line.
# actual vs predicted
plot(y = MixedStepModel$fitted.values,
x = clean_data$Life.expectancy,
main = "Actual vs Predicted using Mixed Step model",
xlab = "Actual",
ylab = "Predicted(MixedStepModel)",
pch = 19)
abline(0,1, col = "green", lwd = 2) # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(MixedStepModel$fitted.values ~ MixedStepModel$model$Life.expectancy),
col = "red", lwd = 2)Here is the plot of the Actual vs Predicted of MixedStepModel with Confidence Interval.
The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.
# predict Life expectancy
predictedLE5 <- predict(MixedStepModel, interval = "prediction")
# combine the actual data and predicted data
comb5 <- cbind.data.frame(clean_data, predictedLE5)
# Plotting the combined data
ggplot(comb5, aes(Life.expectancy, fit)) +
geom_point() +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
stat_smooth(method = lm) +
geom_smooth(method=lm, se=TRUE)+
ggtitle("Actual vs. Predicted for MixedStepModel with CI") +
xlab("Actual Life Expectancy") +
ylab("Predicted Life Expectancy")The fitModel is a fullModel with predictors that have VIF (Variable Inflation Factors) value <5. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.
Let’s build a model using fitting reduced model using VIF! First, let’s check the vif value of fitModel
## Year Status
## 1.146861 1.886573
## Adult.Mortality infant.deaths
## 1.748372 177.316529
## Alcohol percentage.expenditure
## 1.872944 5.804925
## Hepatitis.B Measles
## 1.313055 1.382727
## BMI under.five.deaths
## 1.733886 176.281234
## Polio Total.expenditure
## 1.938914 1.221827
## Diphtheria HIV.AIDS
## 2.166894 1.440765
## GDP Population
## 6.028414 1.490720
## thinness..1.19.years thinness.5.9.years
## 8.776585 8.873971
## Income.composition.of.resources Schooling
## 3.088999 3.337981
Next, let’s build the fitModel!
The predictors of the fitModel are Year, Status, Adult.Mortality, Alcohol, percentage.expenditure, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, Total.expenditure, Diphtheria, HIV.AIDS, Population, thinness..1.19.years, Income.composition.of.resources, and Schooling.
# sort the variables in ascending order in a temporary variable, according to the VIFs
temp <- sort(vif(fitModel))
# reduce models until all the included predictors have a VIF < 5
while (temp[length(temp)] > 5) {
cat("\nVariable with highest VIF - ",names(temp[length(temp)])) # variable with highest VIF
frm <- as.formula(paste(".~.-", names(temp[length(temp)]))) # creating formula to remove variable from model
# names(temp[length(temp)])
# as.name(names(temp[length(temp)]))
cat("\nRemoving variable - ",names(temp[length(temp)]))
fitModel <- update(fitModel,frm) # updating model after removing the variable with highest VIF
#fitModel$call
cat("\n")
print(summary(fitModel)) # rechecking the VIFs for new model
temp <- sort(vif(fitModel))
}##
## Variable with highest VIF - infant.deaths
## Removing variable - infant.deaths
##
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources + Schooling, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.4783 -2.3102 -0.1014 2.4085 17.1550
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 94.218172614547 35.447627021394 2.658
## Year -0.019317910967 0.017720088476 -1.090
## StatusDeveloping -1.580759427126 0.276531114362 -5.716
## Adult.Mortality -0.020270686093 0.000813000612 -24.933
## Alcohol 0.023881976769 0.026494604134 0.901
## percentage.expenditure 0.000050510735 0.000092567468 0.546
## Hepatitis.B -0.019661995380 0.003796914705 -5.178
## Measles -0.000022980625 0.000007825887 -2.936
## BMI 0.045806862566 0.005046299234 9.077
## under.five.deaths -0.002044012162 0.000705589452 -2.897
## Polio 0.031708636288 0.004544183354 6.978
## Total.expenditure 0.081557612752 0.035164791641 2.319
## Diphtheria 0.047551230150 0.004716983157 10.081
## HIV.AIDS -0.485050990087 0.018022110037 -26.914
## GDP 0.000036253975 0.000014094493 2.572
## Population 0.000000003529 0.000000001699 2.077
## thinness..1.19.years -0.094503107011 0.051447976779 -1.837
## thinness.5.9.years 0.043264631538 0.050636428739 0.854
## Income.composition.of.resources 6.566711285448 0.652404575651 10.065
## Schooling 0.665004031763 0.042716960064 15.568
## Pr(>|t|)
## (Intercept) 0.00790 **
## Year 0.27573
## StatusDeveloping 0.00000001198118 ***
## Adult.Mortality < 0.0000000000000002 ***
## Alcohol 0.36746
## percentage.expenditure 0.58534
## Hepatitis.B 0.00000023898660 ***
## Measles 0.00335 **
## BMI < 0.0000000000000002 ***
## under.five.deaths 0.00380 **
## Polio 0.00000000000369 ***
## Total.expenditure 0.02045 *
## Diphtheria < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## GDP 0.01015 *
## Population 0.03792 *
## thinness..1.19.years 0.06633 .
## thinness.5.9.years 0.39294
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.14 on 2918 degrees of freedom
## Multiple R-squared: 0.8117, Adjusted R-squared: 0.8105
## F-statistic: 662 on 19 and 2918 DF, p-value: < 0.00000000000000022
##
##
## Variable with highest VIF - thinness.5.9.years
## Removing variable - thinness.5.9.years
##
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + GDP + Population + thinness..1.19.years + Income.composition.of.resources +
## Schooling, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.451 -2.305 -0.096 2.422 17.138
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 94.499341217681 35.444460303979 2.666
## Year -0.019444490254 0.017718649747 -1.097
## StatusDeveloping -1.574966541377 0.276435203206 -5.697
## Adult.Mortality -0.020252878125 0.000812695826 -24.921
## Alcohol 0.023347378652 0.026485990194 0.881
## percentage.expenditure 0.000051217858 0.000092559487 0.553
## Hepatitis.B -0.019670902367 0.003796724807 -5.181
## Measles -0.000023226726 0.000007820223 -2.970
## BMI 0.045305548712 0.005011841794 9.040
## under.five.deaths -0.001976059007 0.000701060439 -2.819
## Polio 0.031653416786 0.004543513628 6.967
## Total.expenditure 0.080107936440 0.035122211380 2.281
## Diphtheria 0.047673618745 0.004714589677 10.112
## HIV.AIDS -0.484693143962 0.018016409670 -26.903
## GDP 0.000035995139 0.000014090586 2.555
## Population 0.000000003514 0.000000001699 2.068
## thinness..1.19.years -0.055765398178 0.024316454018 -2.293
## Income.composition.of.resources 6.574453247811 0.652311481310 10.079
## Schooling 0.665622890040 0.042708843846 15.585
## Pr(>|t|)
## (Intercept) 0.00772 **
## Year 0.27256
## StatusDeveloping 0.00000001337641 ***
## Adult.Mortality < 0.0000000000000002 ***
## Alcohol 0.37812
## percentage.expenditure 0.58007
## Hepatitis.B 0.00000023569735 ***
## Measles 0.00300 **
## BMI < 0.0000000000000002 ***
## under.five.deaths 0.00485 **
## Polio 0.00000000000399 ***
## Total.expenditure 0.02263 *
## Diphtheria < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## GDP 0.01068 *
## Population 0.03868 *
## thinness..1.19.years 0.02190 *
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.14 on 2919 degrees of freedom
## Multiple R-squared: 0.8117, Adjusted R-squared: 0.8105
## F-statistic: 698.8 on 18 and 2919 DF, p-value: < 0.00000000000000022
##
##
## Variable with highest VIF - GDP
## Removing variable - GDP
##
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + Population + thinness..1.19.years + Income.composition.of.resources +
## Schooling, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.3575 -2.3315 -0.0647 2.4009 17.2581
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 85.482265573542 35.301615607074 2.421
## Year -0.014949672567 0.017647743435 -0.847
## StatusDeveloping -1.615194866551 0.276247300110 -5.847
## Adult.Mortality -0.020343311970 0.000812692294 -25.032
## Alcohol 0.019115507918 0.026459132204 0.722
## percentage.expenditure 0.000258526489 0.000044556256 5.802
## Hepatitis.B -0.019199377414 0.003795821784 -5.058
## Measles -0.000023256305 0.000007827611 -2.971
## BMI 0.046156161752 0.005005497667 9.221
## under.five.deaths -0.001980860848 0.000701720938 -2.823
## Polio 0.032055472441 0.004545081249 7.053
## Total.expenditure 0.065953799793 0.034715214262 1.900
## Diphtheria 0.047562454272 0.004718847433 10.079
## HIV.AIDS -0.483073575096 0.018022279479 -26.804
## Population 0.000000003463 0.000000001701 2.036
## thinness..1.19.years -0.056059956546 0.024339177430 -2.303
## Income.composition.of.resources 6.725532217572 0.650239347161 10.343
## Schooling 0.668854850329 0.042730474148 15.653
## Pr(>|t|)
## (Intercept) 0.01552 *
## Year 0.39700
## StatusDeveloping 0.00000000556230 ***
## Adult.Mortality < 0.0000000000000002 ***
## Alcohol 0.47007
## percentage.expenditure 0.00000000724548 ***
## Hepatitis.B 0.00000044981054 ***
## Measles 0.00299 **
## BMI < 0.0000000000000002 ***
## under.five.deaths 0.00479 **
## Polio 0.00000000000218 ***
## Total.expenditure 0.05755 .
## Diphtheria < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## Population 0.04181 *
## thinness..1.19.years 0.02133 *
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.143 on 2920 degrees of freedom
## Multiple R-squared: 0.8112, Adjusted R-squared: 0.8101
## F-statistic: 738.2 on 17 and 2920 DF, p-value: < 0.00000000000000022
The model is set! Let’s make a prediction based on the fitModel!
Let’s interpreting the fitModel!
##
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + Population + thinness..1.19.years + Income.composition.of.resources +
## Schooling, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.3575 -2.3315 -0.0647 2.4009 17.2581
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 85.482265573542 35.301615607074 2.421
## Year -0.014949672567 0.017647743435 -0.847
## StatusDeveloping -1.615194866551 0.276247300110 -5.847
## Adult.Mortality -0.020343311970 0.000812692294 -25.032
## Alcohol 0.019115507918 0.026459132204 0.722
## percentage.expenditure 0.000258526489 0.000044556256 5.802
## Hepatitis.B -0.019199377414 0.003795821784 -5.058
## Measles -0.000023256305 0.000007827611 -2.971
## BMI 0.046156161752 0.005005497667 9.221
## under.five.deaths -0.001980860848 0.000701720938 -2.823
## Polio 0.032055472441 0.004545081249 7.053
## Total.expenditure 0.065953799793 0.034715214262 1.900
## Diphtheria 0.047562454272 0.004718847433 10.079
## HIV.AIDS -0.483073575096 0.018022279479 -26.804
## Population 0.000000003463 0.000000001701 2.036
## thinness..1.19.years -0.056059956546 0.024339177430 -2.303
## Income.composition.of.resources 6.725532217572 0.650239347161 10.343
## Schooling 0.668854850329 0.042730474148 15.653
## Pr(>|t|)
## (Intercept) 0.01552 *
## Year 0.39700
## StatusDeveloping 0.00000000556230 ***
## Adult.Mortality < 0.0000000000000002 ***
## Alcohol 0.47007
## percentage.expenditure 0.00000000724548 ***
## Hepatitis.B 0.00000044981054 ***
## Measles 0.00299 **
## BMI < 0.0000000000000002 ***
## under.five.deaths 0.00479 **
## Polio 0.00000000000218 ***
## Total.expenditure 0.05755 .
## Diphtheria < 0.0000000000000002 ***
## HIV.AIDS < 0.0000000000000002 ***
## Population 0.04181 *
## thinness..1.19.years 0.02133 *
## Income.composition.of.resources < 0.0000000000000002 ***
## Schooling < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.143 on 2920 degrees of freedom
## Multiple R-squared: 0.8112, Adjusted R-squared: 0.8101
## F-statistic: 738.2 on 17 and 2920 DF, p-value: < 0.00000000000000022
Observation findings :
The fitModel has the largest parameter estimate that is Income.composition.of.resources which is 6.725, followed by StatusDeveloping which is -1.615.
The Income.composition.of.resources will affect the Life.expectancy the most in a positive direction.
On the other hand, the StatusDeveloping will affect the Life.expectancy the most in a negative direction.
The p-value for Adult.Mortality, BMI, Diphtheria, HIV.AIDS, Income.composition.of.resources, and Schooling are the least among all other predictors, thus indicating they are very significant predictors for Life.expectancy.
The fitModel has R-squared value 0.8112, which indicates the fitModel can describe its predictors condition by 81.12%.
Here is the Actual vs Predicted Plot of fitModel. The green line represents a perfect prediction, while the red line represents the regression line.
# actual vs predicted
plot(y = fitModel$fitted.values,
x = clean_data$Life.expectancy,
main = "Actual vs Predicted using Fit model",
xlab = "Actual",
ylab = "Predicted(fitModel)",
pch = 19)
abline(0,1, col = "green", lwd = 2) # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(fitModel$fitted.values ~ fitModel$model$Life.expectancy),
col = "red", lwd = 2)Here is the plot of the Actual vs Predicted of fitModel with Confidence Interval.
The blue line is the regression line, surrounding which in grey shade is the prediction interval. The confidence interval for the prediction is indicated by the dotted red line both above and below the regression line. The plot shows that almost all the data points lie well within the confidence interval of 95%.
# predict Life expectancy
predictedLE6 <- predict(fitModel, interval = "prediction")
# combine the actual data and predicted data
comb6 <- cbind.data.frame(clean_data, predictedLE6)
# Plotting the combined data
ggplot(comb6, aes(Life.expectancy, fit)) +
geom_point() +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
stat_smooth(method = lm) +
geom_smooth(method=lm, se=TRUE)+
ggtitle("Actual vs. Predicted for fitModel with CI") +
xlab("Actual Life Expectancy") +
ylab("Predicted Life Expectancy")Let’s evaluate the best model based on several criteria!
Linear regression makes several assumptions about the data, such as linearity of the data, normality of the residuals (error), homogeneity of residuals variance (homoscedasticity), and independece of residuals error terms (Non-Multicolinearity). The linearity assumption has been checked in the correlation tab. Let’s check the rest of three assumptions!
To check the normality error assumption, we can use the density plot.
par(mfrow = c(3, 2))
plot(density(fullModel$residuals))
plot(density(EDAModel$residuals))
plot(density(BackwardStepModel$residuals))
plot(density(forwardStepModel$residuals))
plot(density(MixedStepModel$residuals))
plot(density(fitModel$residuals))Observation findings :
To check the homoscedasticity assumption, we can use VIF ()
par(mfrow = c(3, 2))
plot(fullModel$fitted.values, fullModel$residuals)
abline(h=0, col = "red")
plot(EDAModel$fitted.values, EDAModel$residuals)
abline(h=0, col = "red")
plot(BackwardStepModel$fitted.values, BackwardStepModel$residuals)
abline(h=0, col = "red")
plot(forwardStepModel$fitted.values, forwardStepModel$residuals)
abline(h=0, col = "red")
plot(MixedStepModel$fitted.values, MixedStepModel$residuals)
abline(h=0, col = "red")
plot(fitModel$fitted.values, fitModel$residuals)
abline(h=0, col = "red")Observation findings :
To check the hNo-multicolinearity assumption, we can use VIF (Variable Inflation Factors). VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.
## Year Status
## 1.146861 1.886573
## Adult.Mortality infant.deaths
## 1.748372 177.316529
## Alcohol percentage.expenditure
## 1.872944 5.804925
## Hepatitis.B Measles
## 1.313055 1.382727
## BMI under.five.deaths
## 1.733886 176.281234
## Polio Total.expenditure
## 1.938914 1.221827
## Diphtheria HIV.AIDS
## 2.166894 1.440765
## GDP Population
## 6.028414 1.490720
## thinness..1.19.years thinness.5.9.years
## 8.776585 8.873971
## Income.composition.of.resources Schooling
## 3.088999 3.337981
Observation findings :
infant.deaths and under.five.deaths.## Schooling Adult.Mortality
## 2.794906 1.269928
## Income.composition.of.resources
## 2.815281
Observation findings :
EDAModel has no-multicollinearity.## Status Adult.Mortality
## 1.864311 1.735623
## infant.deaths Alcohol
## 170.966783 1.823815
## Hepatitis.B Measles
## 1.302165 1.375869
## BMI under.five.deaths
## 1.701555 172.903451
## Polio Total.expenditure
## 1.935065 1.178482
## Diphtheria HIV.AIDS
## 2.154512 1.420424
## GDP thinness..1.19.years
## 1.393999 1.966152
## Income.composition.of.resources Schooling
## 3.025917 3.312613
Observation findings :
infant.deaths and under.five.deaths.## Schooling Adult.Mortality
## 3.312613 1.735623
## HIV.AIDS Diphtheria
## 1.420424 2.154512
## BMI Income.composition.of.resources
## 1.701555 3.025917
## Status Polio
## 1.864311 1.935065
## GDP Hepatitis.B
## 1.393999 1.302165
## under.five.deaths infant.deaths
## 172.903451 170.966783
## thinness..1.19.years Alcohol
## 1.966152 1.823815
## Measles Total.expenditure
## 1.375869 1.178482
Observation findings :
infant.deaths and under.five.deaths.## Schooling Adult.Mortality
## 3.312613 1.735623
## HIV.AIDS Diphtheria
## 1.420424 2.154512
## BMI Income.composition.of.resources
## 1.701555 3.025917
## Status Polio
## 1.864311 1.935065
## GDP Hepatitis.B
## 1.393999 1.302165
## under.five.deaths infant.deaths
## 172.903451 170.966783
## thinness..1.19.years Alcohol
## 1.966152 1.823815
## Measles Total.expenditure
## 1.375869 1.178482
Observation findings :
infant.deaths and under.five.deaths.## Year Status
## 1.134178 1.879225
## Adult.Mortality Alcohol
## 1.739711 1.836868
## percentage.expenditure Hepatitis.B
## 1.342118 1.303560
## Measles BMI
## 1.378340 1.702102
## under.five.deaths Polio
## 2.168512 1.929626
## Total.expenditure Diphtheria
## 1.188280 2.131651
## HIV.AIDS Population
## 1.432668 1.443763
## thinness..1.19.years Income.composition.of.resources
## 1.959860 3.034365
## Schooling
## 3.330094
Observation findings :
fitModel has no-multicollinearity.Linear regression makes several assumptions about the data, such as linearity of the data, normality of the residuals (error), homogeneity of residuals variance (homoscedasticity), and independece of residuals error terms (Non-Multicolinearity).
To check the assumptions, we use the diagnostic plots. The diagnostic plots show residuals (error) in four different ways:
Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.
Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis.
Observation findings :
Observation findings :
Observation findings :
Observation findings :
Observation findings :
Observation findings :
summary(fullModel)$adj.r.squared
summary(EDAModel)$adj.r.squared
summary(BackwardStepModel)$adj.r.squared
summary(forwardStepModel)$adj.r.squared
summary(MixedStepModel)$adj.r.squared
summary(fitModel)$adj.r.squared RMSE(y_pred = fullModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = EDAModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = BackwardStepModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = forwardStepModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = MixedStepModelPreds, y_true = clean_data$Life.expectancy)
RMSE(y_pred = fitModelPreds, y_true = clean_data$Life.expectancy)| Model | Adjusted.R.squared | RMSE |
|---|---|---|
| fullModel | 0.8190258 | 4.030796 |
| EDAModel | 0.7107153 | 5.111013 |
| BackwardStepModel | 0.8192284 | 4.031300 |
| forwardStepModel | 0.8192284 | 4.031300 |
| MixedStepModel | 0.8192284 | 4.031300 |
| fitModel | 0.8101345 | 4.130748 |
Observation findings :
The Step-Wise Models, such as BackwardStepModel, forwardStepModel, and MixedStepModel have the highest value of Adjusted R-squared.
The EDA Model has the lowest value of Adjusted R-squared.
The EDAModel has the highest value of RMSE.
The fullModel has lowest value of RMSE, followed by the Step-Wise Models, such as BackwardStepModel, forwardStepModel, and MixedStepModel
Here are the criteria to find the best model to predict Life Expectancy:
The model has the highest value of Adjusted R-Squared
The model has the lowest value of RMSE
The model has the least predictors.
| Model | Adjusted.R.squared | RMSE | Number.of.Predictors |
|---|---|---|---|
| fullModel | 0.8190258 | 4.030796 | 20 |
| EDAModel | 0.7107153 | 5.111013 | 3 |
| BackwardStepModel | 0.8192284 | 4.031300 | 16 |
| forwardStepModel | 0.8192284 | 4.031300 | 16 |
| MixedStepModel | 0.8192284 | 4.031300 | 16 |
| fitModel | 0.8101345 | 4.130748 | 17 |
So, the best model that fits the criteria goes to
The Step-Wise Model
[1] A. Roy, “A Deep Dive Into The Concept of Regression.” [Online]. Available: https://towardsdatascience.com/a-deep-dive-into-the-concept-of-regression-fb912d427a2e
[2] Sathwick, “What is a Linear Regression?” [Online]. Available: https://towardsdatascience.com/the-concepts-behind-linear-regression-and-its-implementation-ffbab5a4d65e
[3] Algoritma Team, “Inclass Regression Model.”
[4] S. Swaminathan, “Linear Regression — Detailed View.” [Online]. Available: https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86
[5] “Linear Regression.” [Online]. Available: https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html#:~:text=Cost%20function,-The%20prediction%20function&text=MSE%20measures%20the%20average%20squared,the%20accuracy%20of%20our%20model.
[6] P. Schober, C. Boer, and L. Schwarte, “Correlation coefficients: Appropriate use and interpretation,” Anesthesia & Analgesia, vol. 126, p. 1, Feb. 2018, doi: 10.1213/ANE.0000000000002864.
[7] “What is Multicollinearity? Here’s Everything You Need to Know.” [Online]. Available: https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
[8] “Regression Model Diagnostics.” [Online]. Available: http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
[9] “Predicting Life Expectancy for Countries.” [Online]. Available: https://rpubs.com/mrunws/OPIM5603-Healthcare2