What is the importance of knowing Life Expectancy? Life expectancy, or the average number of years a person is expected to live, can be used as an overall indicator of community health. Low life expectancies can result from high infant mortality rates, high rates of drug overdose or suicide, barriers to high quality healthcare, and other factors. There are many uses for it in the financial world, including life insurance, pension planning, and U.S. Social Security benefits.
The above packages were installed and used in this project.
There are 22 columns in our dataset. These columns’s label are listed below.
Country - Country Year - data is collected from 2000 - 2015 years Status - Developed or Developing status Life expectancy - Life Expectancy in age Adult Mortality - Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population) infant deaths - Number of Infant Deaths per 1000 population Alcohol - Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) percentage expenditure - Expenditure on health as a percentage of Gross Domestic Product per capita(%) Hepatitis B - Hepatitis B (HepB) immunization coverage among 1-year-olds (%) Measles - number of reported cases per 1000 population BMI - Average Body Mass Index of entire population under-five deaths - Number of under-five deaths per 1000 population Polio - (Pol3) immunization coverage among 1-year-olds (%) Total expenditure - General government expenditure on health as a percentage of total government expenditure (%) Diphtheria - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%) HIV/AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years) GDP - Gross Domestic Product per capita (in USD) Population - Population of the country thinness 1-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (% ) thinness 5-9 years - Prevalence of thinness among children for Age 5 to 9(%) Income composition of resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1) Schooling - Number of years of Schooling(years)
#Reading csv file
life_expectancy_data <- read.csv("C://Users/raghu/Desktop/Daya Courses/Projects/Life-Expectancy-Prediction/Life Expectancy Data.csv")
head(life_expectancy_data, 5)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
The dataset contains 2938 rows and 22 columns
#Dimensions : Gives numbers of rows and columns
dim(life_expectancy_data)
## [1] 2938 22
We can understand more about the structure of the dataset by using the str() function.
# Structure of dataset
str(life_expectancy_data)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
There are 13 variables that are taken as indicators from this dataset.
#statistical summary of the variables
summary(life_expectancy_data)
## Country Year Status Life.expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.10
## Mode :character Median :2008 Mode :character Median :72.10
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.70
## Max. :2015 Max. :89.00
## NA's :10
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.0100 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 0.8775 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 3.7550 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.6029 Mean : 738.251
## 3rd Qu.:228.0 3rd Qu.: 22.0 3rd Qu.: 7.7025 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.8700 Max. :19479.912
## NA's :10 NA's :194
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:77.00 1st Qu.: 0.0 1st Qu.:19.30 1st Qu.: 0.00
## Median :92.00 Median : 17.0 Median :43.50 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:97.00 3rd Qu.: 360.2 3rd Qu.:56.20 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## NA's :553 NA's :34
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.260 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.755 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.492 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## NA's :19 NA's :226 NA's :19
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.10
## 1st Qu.: 463.94 1st Qu.:1.958e+05 1st Qu.: 1.60
## Median : 1766.95 Median :1.387e+06 Median : 3.30
## Mean : 7483.16 Mean :1.275e+07 Mean : 4.84
## 3rd Qu.: 5910.81 3rd Qu.:7.420e+06 3rd Qu.: 7.20
## Max. :119172.74 Max. :1.294e+09 Max. :27.70
## NA's :448 NA's :652 NA's :34
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.50 1st Qu.:0.4930 1st Qu.:10.10
## Median : 3.30 Median :0.6770 Median :12.30
## Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.20 3rd Qu.:0.7790 3rd Qu.:14.30
## Max. :28.60 Max. :0.9480 Max. :20.70
## NA's :34 NA's :167 NA's :163
#Check for missing values
colSums(is.na(life_expectancy_data))
## Country Year
## 0 0
## Status Life.expectancy
## 0 10
## Adult.Mortality infant.deaths
## 10 0
## Alcohol percentage.expenditure
## 194 0
## Hepatitis.B Measles
## 553 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 19 226
## Diphtheria HIV.AIDS
## 19 0
## GDP Population
## 448 652
## thinness..1.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
We will now impute the missing values with the mean to avoid errors in our analysis.
# Select numeric variables for calculating mean
life_expectancy_data_num <- select(life_expectancy_data,-c(1,2,3))
#Calculate means of all the numeric variables
colMeans(life_expectancy_data_num, na.rm = TRUE)
## Life.expectancy Adult.Mortality
## 6.922493e+01 1.647964e+02
## infant.deaths Alcohol
## 3.030395e+01 4.602861e+00
## percentage.expenditure Hepatitis.B
## 7.382513e+02 8.094046e+01
## Measles BMI
## 2.419592e+03 3.832125e+01
## under.five.deaths Polio
## 4.203574e+01 8.255019e+01
## Total.expenditure Diphtheria
## 5.938190e+00 8.232408e+01
## HIV.AIDS GDP
## 1.742103e+00 7.483158e+03
## Population thinness..1.19.years
## 1.275338e+07 4.839704e+00
## thinness.5.9.years Income.composition.of.resources
## 4.870317e+00 6.275511e-01
## Schooling
## 1.199279e+01
# Impute missing values in numeric variables with mean
for(i in 4:ncol(life_expectancy_data)) {
life_expectancy_data[ , i][is.na(life_expectancy_data[ , i])] <- mean(life_expectancy_data[ , i], na.rm=TRUE)
}
summary(life_expectancy_data)
## Country Year Status Life.expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.20
## Mode :character Median :2008 Mode :character Median :72.00
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.60
## Max. :2015 Max. :89.00
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.010 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 1.093 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 4.160 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.603 Mean : 738.251
## 3rd Qu.:227.0 3rd Qu.: 22.0 3rd Qu.: 7.390 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.870 Max. :19479.912
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:80.94 1st Qu.: 0.0 1st Qu.:19.40 1st Qu.: 0.00
## Median :87.00 Median : 17.0 Median :43.00 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:96.00 3rd Qu.: 360.2 3rd Qu.:56.10 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.370 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.938 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.330 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.10
## 1st Qu.: 580.49 1st Qu.:4.189e+05 1st Qu.: 1.60
## Median : 3116.56 Median :3.676e+06 Median : 3.40
## Mean : 7483.16 Mean :1.275e+07 Mean : 4.84
## 3rd Qu.: 7483.16 3rd Qu.:1.275e+07 3rd Qu.: 7.10
## Max. :119172.74 Max. :1.294e+09 Max. :27.70
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.60 1st Qu.:0.5042 1st Qu.:10.30
## Median : 3.40 Median :0.6620 Median :12.10
## Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.20 3rd Qu.:0.7720 3rd Qu.:14.10
## Max. :28.60 Max. :0.9480 Max. :20.70
# We can see that now the data set has no missing values
colSums(is.na(life_expectancy_data))
## Country Year
## 0 0
## Status Life.expectancy
## 0 0
## Adult.Mortality infant.deaths
## 0 0
## Alcohol percentage.expenditure
## 0 0
## Hepatitis.B Measles
## 0 0
## BMI under.five.deaths
## 0 0
## Polio Total.expenditure
## 0 0
## Diphtheria HIV.AIDS
## 0 0
## GDP Population
## 0 0
## thinness..1.19.years thinness.5.9.years
## 0 0
## Income.composition.of.resources Schooling
## 0 0
dim(life_expectancy_data)
## [1] 2938 22
While predicting life expectancy there could be few outliers that we need to ignore. Detection of outliers is important as, it increases the error variance and reduces the power of statistical tests. They can cause bias and/or influence estimates. They can also impact the basic assumption of regression as well as other statistical models.
#Plotting box plots of life expectancy to understand outliers
boxplot(life_expectancy_data$Life.expectancy, xlab="Life Expectancy")
From the box plot we can see that age below 45 is Outlier. Our analysis is not applicable for these records.
outliers <- boxplot(life_expectancy_data$Life.expectancy, plot=FALSE)$out
life_expectancy_data<- life_expectancy_data[-which(life_expectancy_data$Life.expectancy %in% outliers),]
dim(life_expectancy_data)
## [1] 2921 22
Now we will perform correlation analysis to identify how each factor is related to the life expectancy of a person.
Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:
1. Scatter plot: Visualize the linear relationship between the predictor and response.
2. Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
3. Density plot: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred.
We will be using mostly scatter plots to identify any relationship between variables as they are easy to detect visually detect any correlation.
#correlation between percentage expenditure and life expectancy
life_expectancy_vs_percenntage_expenditure <- ggplot(life_expectancy_data, aes(percentage.expenditure, Life.expectancy)) +
geom_jitter(color = "yellow", alpha = 0.5) + theme_light()
life_expectancy_vs_Total_expenditure <- ggplot(life_expectancy_data, aes(Total.expenditure, Life.expectancy)) +
geom_jitter(color = "Light green", alpha = 0.5) + theme_light()
p <- plot_grid(life_expectancy_vs_percenntage_expenditure, life_expectancy_vs_Total_expenditure)
title <- ggdraw() + draw_label("Correlation between Health expenditure and life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))
We can see from the above graph that the concentration of life expectancy is more when expenditure varies from 5k - 20k. Similar Analysis could be done for other variables. Lets find out if there is any effect of immunization coverage on life expectancy.
library(plotly)
life_expectancy_vs_Hepatitis_B <- ggplot(life_expectancy_data, aes(Hepatitis.B, Life.expectancy)) +
geom_jitter(color = "purple", alpha = 0.5) + theme_light()
life_expectancy_vs_Diphtheria <- ggplot(life_expectancy_data, aes(Diphtheria, Life.expectancy)) +
geom_jitter(color = "orange", alpha = 0.5) + theme_light()
life_expectancy_vs_Polio <- ggplot(life_expectancy_data, aes(Polio, Life.expectancy)) + geom_jitter(color = "pink", alpha = 0.5) + theme_grey()
p <- plot_grid(life_expectancy_vs_Hepatitis_B, life_expectancy_vs_Diphtheria, life_expectancy_vs_Polio )
title <- ggdraw() + draw_label("Correlation between Immunizations and life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))
Data points for Immunizations ( Hepatitis.B and Diphtheria) are concentrated in the age range from 60 - 80 years old, this means getting imunized is definitely positive for better life expectancy.
Measles represents the number of reported cases per 1000 population
#correlation between measles and life expectancy
life_expectancy_vs_Measles <- plot_ly(data = life_expectancy_data, x = ~Measles , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(221,160,221, .3)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Measles <- life_expectancy_vs_Measles %>% layout(title = 'Scatter Plot: Life Expectancy vs Measles',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Measles
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
We can see that the concentration of reported cases of Measles is between 0 - 50k and is varied across all the age groups. We can also see some values in the 50k - 250k range that may have influenced the Life expectancy.
Alcohol represents Gross Domestic Product per capita (in USD)
#correlation between alcohol and life expectancy
life_expectancy_vs_Alcohol <- plot_ly(data = life_expectancy_data, x = ~Alcohol , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(152, 215, 182, .5)',
line = list(color = 'rgba(0, 0, 0, 0)',
width = 2)))
life_expectancy_vs_Alcohol <- life_expectancy_vs_Alcohol %>% layout(title = 'Scatter Plot: Life Expectancy vs Alcohol ',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Alcohol
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
#correlation between BMI and life expectancy
life_expectancy_vs_BMI <- plot_ly(data = life_expectancy_data, x = ~BMI, y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(255,182,193, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_BMI <- life_expectancy_vs_BMI %>% layout(title = 'Scatter Plot: Life Expectancy vs BMI',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_BMI
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
We can see that life expectancy of the population decrease as the BMI increase.
We want to analyse if lesser deaths are positively related to life expectancy.
Under five deaths represents the number of under-five deaths per 1000 population
library(plotly)
life_expectancy_vs_under_five_deaths <- ggplot(life_expectancy_data, aes(under.five.deaths, Life.expectancy)) + geom_jitter(color = "pink", alpha = 0.5) + theme_grey()
p <- plot_grid(life_expectancy_vs_under_five_deaths)
title <- ggdraw() + draw_label("Correlation between Under five deaths and life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))
GDP Gross Domestic Product per capita (in USD)
#correlation between GDP and lif expectancy
life_expectancy_vs_GDP <- ggplot(life_expectancy_data, aes(GDP, Life.expectancy)) +
geom_jitter(color = "dark green", alpha = 0.5) + theme_light()
p <- plot_grid(life_expectancy_vs_GDP)
title <- ggdraw() + draw_label("Correlation between GDP vs Life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))
thinness 1 to 19 years Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
#correlation between thinness and life expectancy
life_expectancy_vs_thinness_1_19_years <- ggplot(life_expectancy_data, aes(thinness..1.19.years, Life.expectancy)) +
geom_jitter(color = "blue", alpha = 0.5) + theme_light()
life_expectancy_vs_thinness_5_9_years <- ggplot(life_expectancy_data, aes(thinness.5.9.years, Life.expectancy)) +
geom_jitter(color = "orange", alpha = 0.5) + theme_light()
p <- plot_grid(life_expectancy_vs_thinness_1_19_years, life_expectancy_vs_thinness_5_9_years)
title <- ggdraw() + draw_label("Correlation between Thinness vs Life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))
Income composition of resources Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
library(plotly)
life_expectancy_vs_Income_composition_of_resources <- plot_ly(data = life_expectancy_data, x = ~Income.composition.of.resources , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(181, 201, 253, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Income_composition_of_resources <- life_expectancy_vs_Income_composition_of_resources %>% layout(title = 'Scatter Plot: Life Expectancy vs Income composition of resources',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Income_composition_of_resources
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
We can see Human development positively influences Life expectancy.
We will use ggcorr function to check the correlation between independent numerical variables and the dependent variable.
## Warning in ggcorr(life_expectancy_data, label = T, label_size = 2, label_round =
## 2, : data in column(s) 'Country', 'Status' are not numeric and were ignored
We observe that the Life.Expectancy variable has a strong positive correlation with schooling and Income.composition.of.resources. On the other hand, Adult.Mortality has a strong negative correlation with Life.Expectancy and is a valid finding since when the Adult mortality is high, the life expectancy will definitely be low.
#we will split the data into train and test for model building
n_train <- round(0.8 * nrow(life_expectancy_data))
train_indices <- sample(1:nrow(life_expectancy_data), n_train)
train_data <- life_expectancy_data[train_indices, ]
test_data <- life_expectancy_data[-train_indices, ]
Now that we have seen correlation showing the relationship of Life Expectancy with each independent variables. Let’s analyse the data set using Multiple linear regression.
The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Y when only the X is known.
Let’s build our initial Linear regression model.
Outcome: life.expectancy (reasons for high or low life expectancy?) Predictors: Alcohol, percentage.expenditure, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, GDP, Total.expenditure, Diphtheria, thinness..1.19.years, thinness.5.9.years, Income.composition.of.resources
#first model
formula1 <- as.formula("Life.expectancy ~ Alcohol + percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + Polio+ Total.expenditure + Diphtheria + thinness..1.19.years + thinness.5.9.years + GDP + Income.composition.of.resources")
model1 <- lm(formula1, data = train_data)
summary(model1)
##
## Call:
## lm(formula = formula1, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9221 -3.0859 0.1629 3.1554 22.9896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.643e+01 7.399e-01 62.744 < 2e-16 ***
## Alcohol 6.435e-02 3.590e-02 1.793 0.07317 .
## percentage.expenditure 1.565e-04 1.312e-04 1.193 0.23308
## Hepatitis.B -1.352e-02 6.070e-03 -2.228 0.02600 *
## Measles -1.919e-05 1.274e-05 -1.506 0.13226
## BMI 9.389e-02 7.399e-03 12.689 < 2e-16 ***
## under.five.deaths 2.807e-04 9.539e-04 0.294 0.76858
## Polio 5.381e-02 6.883e-03 7.819 8.03e-15 ***
## Total.expenditure 9.183e-02 5.324e-02 1.725 0.08467 .
## Diphtheria 6.084e-02 7.219e-03 8.428 < 2e-16 ***
## thinness..1.19.years -1.038e-01 7.900e-02 -1.314 0.18911
## thinness.5.9.years -1.301e-01 7.753e-02 -1.678 0.09355 .
## GDP 5.920e-05 2.077e-05 2.850 0.00441 **
## Income.composition.of.resources 1.714e+01 7.424e-01 23.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.64 on 2323 degrees of freedom
## Multiple R-squared: 0.6327, Adjusted R-squared: 0.6306
## F-statistic: 307.8 on 13 and 2323 DF, p-value: < 2.2e-16
r_sq1 <- summary(model1)$r.squared
prediction1 <- predict(model1, newdata = test_data)
residuals1 <- test_data$Life.expectancy - prediction1
rmse1 <- sqrt(mean(residuals1^2, na.rm=TRUE))
The first step in interpreting the regression is to examine the F-statistic and the associated p-value, at the bottom of model summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:
This is very interesting to see the effect of all other variables - Alcohol, Immunizations (Polio, Diptheria), BMI and Human development have a statistically significant effect on the outcome of Life expectancy.
summary(model1)$coefficient
## Estimate Std. Error t value
## (Intercept) 4.642729e+01 7.399437e-01 62.744351
## Alcohol 6.435314e-02 3.590015e-02 1.792559
## percentage.expenditure 1.565036e-04 1.312087e-04 1.192784
## Hepatitis.B -1.352081e-02 6.069735e-03 -2.227578
## Measles -1.918537e-05 1.274108e-05 -1.505788
## BMI 9.388895e-02 7.399244e-03 12.688993
## under.five.deaths 2.806923e-04 9.538888e-04 0.294261
## Polio 5.381452e-02 6.882801e-03 7.818695
## Total.expenditure 9.183487e-02 5.323976e-02 1.724930
## Diphtheria 6.084419e-02 7.219329e-03 8.427957
## thinness..1.19.years -1.037799e-01 7.900391e-02 -1.313605
## thinness.5.9.years -1.300648e-01 7.752717e-02 -1.677667
## GDP 5.920481e-05 2.077122e-05 2.850329
## Income.composition.of.resources 1.714154e+01 7.424478e-01 23.087872
## Pr(>|t|)
## (Intercept) 0.000000e+00
## Alcohol 7.317358e-02
## percentage.expenditure 2.330759e-01
## Hepatitis.B 2.600415e-02
## Measles 1.322575e-01
## BMI 1.013461e-35
## under.five.deaths 7.685847e-01
## Polio 8.027681e-15
## Total.expenditure 8.467309e-02
## Diphtheria 6.079207e-17
## thinness..1.19.years 1.891089e-01
## thinness.5.9.years 9.354661e-02
## GDP 4.405924e-03
## Income.composition.of.resources 2.381008e-106
For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.
It can be seen that, change in the Alcohol,BMI,Polio, Total expenditure,Diphtheria, Thinness 1- 19 years,Income composition of resources are significantly associated to life expectancy of a person.
For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.
We found that Measles, percentage expenditure, Hepatitis B, under five deaths variables are not significant in the multiple regression model. We can remove these variables from our analysis.
#second model
#dropping insignificant variables like measles, percentage.expenditure, Hepatitis.B, Under.five.deaths, Thinness.5.9.years
formula2 <- as.formula("Life.expectancy ~ Alcohol + Diphtheria + BMI + Polio + Total.expenditure + thinness..1.19.years + Income.composition.of.resources")
model2 <- lm(formula2, data = train_data)
summary(model2)
##
## Call:
## lm(formula = formula2, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0275 -3.0852 0.1366 3.0732 23.3538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.780807 0.692541 64.662 < 2e-16 ***
## Alcohol 0.111912 0.035692 3.135 0.00174 **
## Diphtheria 0.054351 0.006880 7.899 4.29e-15 ***
## BMI 0.097621 0.007428 13.143 < 2e-16 ***
## Polio 0.053603 0.006934 7.731 1.58e-14 ***
## Total.expenditure 0.122170 0.053438 2.286 0.02233 *
## thinness..1.19.years -0.237843 0.034424 -6.909 6.26e-12 ***
## Income.composition.of.resources 18.911779 0.720720 26.240 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.731 on 2329 degrees of freedom
## Multiple R-squared: 0.6198, Adjusted R-squared: 0.6187
## F-statistic: 542.4 on 7 and 2329 DF, p-value: < 2.2e-16
r_sq2 <- summary(model2)$r.squared
prediction2 <- predict(model2, newdata = test_data)
residuals2 <- test_data$Life.expectancy - prediction2
rmse2 <- sqrt(mean(residuals2^2, na.rm=TRUE))
print(paste0("R-squared for first model:", round(r_sq1, 4)))
## [1] "R-squared for first model:0.6327"
print(paste0("R-squared for second model: ", round(r_sq2, 4)))
## [1] "R-squared for second model: 0.6198"
print(paste0("RMSE for first model: ", round(rmse1, 2)))
## [1] "RMSE for first model: 6.12"
print(paste0("RMSE for second model: ", round(rmse2, 2)))
## [1] "RMSE for second model: 6.17"
The confidence interval of the model coefficient can be extracted as follow:
confint(model2, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 43.42274699 46.13886800
## Alcohol 0.04191988 0.18190398
## Diphtheria 0.04085837 0.06784335
## BMI 0.08305466 0.11218642
## Polio 0.04000570 0.06720047
## Total.expenditure 0.01737986 0.22696006
## thinness..1.19.years -0.30534709 -0.17033909
## Income.composition.of.resources 17.49845990 20.32509864
#prediction
test_data$prediction <- predict(model2, newdata = test_data)
ggplot(test_data, aes(x = prediction, y = Life.expectancy)) +
geom_point(color = "blue", alpha = 0.7) +
geom_abline(color = "red") +
ggtitle("Prediction vs. Real values")
Diagnostic Plots are used to evaluate the model assumptions and understand whether or not there are observations that can strongly have influence on the analysis. Residuals are the measure of how far from the regression line the Data points are. Fitted values are models prediction of mean response value when you input the values of the predictors, factor levels or components into the model. We have plotted the following graphs: 1. Residuals Vs Fitted values graph 2. Histogram of residuals
#residuals vs linear model prediction
test_data$residuals <- test_data$Life.expectancy - test_data$prediction
ggplot(data = test_data, aes(x = prediction, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals), color = "purple", alpha = 0.7) + geom_hline(yintercept = 0, linetype = 4, color = "red") +
ggtitle("Residuals vs. Linear model prediction")
#histogram for residuals
ggplot(test_data, aes(x = residuals)) +
geom_histogram(bins = 15, fill = "light blue") +
ggtitle("Histogram of residuals")
Residual: It is an error between a predicted value and the observed actual value. Assumptions for a Residual Plot: 1. The most important assumption of a linear regression model is that errors are independent and normally distributed. 2. It has high quality of points close to the origin and low density of points away from the origin. 3. It is symmetric about the origin. To validate the regression model, you must use redual plot to visually confirm the validity of your model.
The use case for this visualization is to compare a predictive model score to an actual outcome (either binary (0/1) or continuous). In this case the gain curve plot measures how well the model score sorts the data compared to the true outcome value.
GainCurvePlot(test_data, "prediction", "Life.expectancy", "Model2")
Finally our model can be written as follow:
As we have seen in our Linear regression model, the overall quality of the model can be assessed by examining the R-squared (R2) and Residual Standard Error (RSE).
R-squared: In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.
A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the prediction model.
Residual Standard Error (RSE), or sigma: The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand). The error rate can be estimated by dividing the RSE by the mean outcome variable:
sigma(model2)/mean(life_expectancy_data$Life.expectancy)
## [1] 0.08260357
In our regression example, the RSE is 5.860291 corresponding to 8.4% error rate which is pretty good.
We can test our model by giving the input values for the model to predict the approx. life expectancy of a person.
#test1
XYZ <- data.frame( Country = "India",
Alcohol = 5.28,
Diphtheria = 86,
BMI = 38.9,
Polio = 98,
Total.expenditure = 11.14,
thinness..1.19.years = 2.1,
Income.composition.of.resources = 0.741)
print(paste0("Life expectancy for XYZ: ", round(predict(model2, XYZ), 2)))
## [1] "Life expectancy for XYZ: 73.97"
It is deduced from the above observations that, there is a high chance of increased Life expectancy by being immunized. We were successfully able to meet the objective and figure out variables that play important role in Life expectancy prediction model, that can be used in companies and as well as by customers. Data set with more features will provide more accurate outputs.