Introduction

Life expectancy is key measure of health of population.
According to the WHO[1] life expectancy of people in a country is decided by factors such as GDP, education, water, sanitation, vaccine etc.
This analysis is all about the analysis of 5 major factors which have major contribution in deciding life-expectancy of people all around the world.
We also aims to predict Life Expectancy of people by the major contributers by checking the correlation of these predictors with life expectancy.
The dataset under consideration contains values for 264 different countries with followign variables:
Life Expectancy, Water, GDP, Sanitation, Vaccine

Problem Statement

Out of the factors that decide life expectancy from birth (in years) as published by World Bank, which are the major contributers that decides the life expectancy of people in different countries of the world?
Is it possible to determine the life expectancy of people in different countries using these factors?
If possible upto what percentage of variability in Life Expectncy can be explained by the predictor variables?
What are the distribution of the residuals of the predicted model?

This analysis will try to answer the following questions:

Collecting life expectency data and the other variables that have an effect on life expectency for analysis from a publically available reliable source like world bank data.
Checking the correlation between life expectency and the other variables considered for the analysis.
Creating a linear regression model of using data downloaded from the Worldsbank Data.

Data

There are 5 different files from which the data is being taken from.
Life expectancy at birth, in Years (Target Variable)
Improved sanitation facilities (% of population with access)
Immunization, DPT (% of children ages 12-23 months)
GDP per capita (current US $)
Improved water source(% of population with access)

These data are collected and derived from sources such as:

male and female life expectancy at birth by United Nations Population Division, Census reports and other statistical publications from national statistical offices.
Water and Sanitation data collected by WHO/UNICEF Joint Monitoring Programme ( JMP ) for Water Supply and Sanitation.
GDP of the countries from World Bank national accounts data, and OECD National Accounts data files.

setwd("~/Desktop/Stats Assignment 4")
Lifeexpectancy <- read.csv(file="Lifeexpectancyatbirth.csv", skip=4, header=TRUE) 
gdpp <- read.csv(file="GDP.csv", skip=4, header=TRUE) 
Improvedsanitationfacilities <- read.csv(file="Improvedsanitationfacilities.csv", skip=4, header=TRUE) 
vaccinations <- read.csv(file="ImmunizationDPT.csv", skip=4, header=TRUE)
watersource <- read.csv(file="ImprovedWatersource.csv", skip=4, header=TRUE)

Data Preparation and Exploration

The data available is from the year 1960 to 2014 with 264 observations and 62 Variables.
The data for analysis is taken only for the 2014 data as it is the most latest in the dataset.
New data.frame is created and the following variables are given to it.
Country Name, Country Code, Life Expectancy, Water, GDP, Vaccine

#Loading variables into Data frame
HealthData <- data.frame(Country.Code=Lifeexpectancy$Country.Code, Country.Name=Lifeexpectancy$Country.Name,
                  LifeExp=Lifeexpectancy$X2014,GDP=gdpp$X2014,sanitation=Improvedsanitationfacilities$X2014,
                  Vaccine=vaccinations$X2014,water=watersource$X2014)

Since data for all the countries are not available, 49 null values are present in the dataframe, which has to be removed.

# Data cleaning : Remove obs with any NAs, 
HealthData_cleaned <- subset(HealthData, 
                  !(is.na(HealthData$Country.Code)|
                    is.na(HealthData$Country.Name)| 
                    is.na(HealthData$GDP) |
                    is.na(HealthData$LifeExp) |
                    is.na(HealthData$sanitation) |
                    is.na(HealthData$water) |
                    is.na(HealthData$Vaccine)))

Visualisation

pairs.panels(subset(HealthData_cleaned, select = -c(Country.Name, Country.Code)), hist.col = "#3498DB")

Visualisation Cont..

Paired plot in the previous slide, shows scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
From the graph: the variables sanitation(0.86) and water(0.78) are having high correlation with Life Expectancy.
The variables GDP(0.61) and Vaccine(0.57), whch are having low correlation with the dependend variable. So are removed from the further analysis.
Water and sanitation are having very high multicolinearity(0.81). So water is also removed.
We are left with only only one vaiable, Sanitation.

#Loading variables into Data frame
HealthData <- data.frame(Country.Code=Lifeexpectancy$Country.Code, Country.Name=Lifeexpectancy$Country.Name,
                  LifeExp=Lifeexpectancy$X2014,sanitation=Improvedsanitationfacilities$X2014)
# Data cleaning : Remove observations with any NAs, 
HealthData_cleaned <- subset(HealthData, 
                  !(is.na(HealthData$Country.Code)|
                    is.na(HealthData$Country.Name)| 
                    is.na(HealthData$LifeExp) |
                    is.na(HealthData$sanitation)))

Data is again cleaned of the null variables so that now there is only 35 null observations in the dataset for analysis.

Descriptive Statistics

HealthData_cleaned %>% summarise(Min = min(LifeExp,na.rm = TRUE),
                                         Q1 = quantile(LifeExp,probs = .25,na.rm = TRUE),
                                         Median = median(LifeExp, na.rm = TRUE),
                                         Q3 = quantile(LifeExp,probs = .75,na.rm = TRUE),
                                         Max = max(LifeExp,na.rm = TRUE),
                                         Mean = mean(LifeExp, na.rm = TRUE),
                                         SD = sd(LifeExp, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(LifeExp)))

HealthData_cleaned %>% summarise(Min = min(sanitation,na.rm = TRUE),
                                         Q1 = quantile(sanitation,probs = .25,na.rm = TRUE),
                                         Median = median(sanitation, na.rm = TRUE),
                                         Q3 = quantile(sanitation,probs = .75,na.rm = TRUE),
                                         Max = max(sanitation,na.rm = TRUE),
                                         Mean = mean(sanitation, na.rm = TRUE),
                                         SD = sd(sanitation, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(sanitation)))

Scatter Plot

Scatter plot of Life Expectancy Vs. Sanitation shows that the data is almost linear.

  ggplot(HealthData_cleaned, aes(x = sanitation, y = LifeExp)) + 
  geom_smooth(method = "lm", se = FALSE, color = "red") +  
  guides(alpha = FALSE)+
  geom_point() + theme_bw()

Histogram

Histogram for Life Expectancy and Sanitation is plotted below doesn’t show any evidenece of normality of data.
Since data count is more than 30, data is assumed to be normally distributed.
We will check this assumption later using Q-Q Plot.

par(mfrow=c(1,2))
#Histograms
hist(HealthData_cleaned$LifeExp,main= " Histogram for Life expectancy")
hist(HealthData_cleaned$sanitation,main= " Histogram for Improvement in sanitation")

Assumptions

The assumptions for the regression model are:

Independence
- The data is assumed to be independend.
- No countries are considered twice for the analysis.
- Measurement of variables for one countries are not dependend on the variables for any other country.

Linearity
- The data is assumed to be linear as seen from the scatter plot in the previous slides.

Normality of residuals
- The date is assumed to be normal since the number of observations are greater than 30.
- This assumption is further validated after the model creation using the Q_Q plot.

Homoscedasticity
- Variance in the residuals are assumed to be normal across the predicted variables in this dataset.

Linear Regression Model

For the purpose of analysis and validation of the model, the data set is divided in a ratio of 70:30.
70 % of data is assigned to training dataset and 30% to validation dataset.
Regression model is build on the training dataset.
Model is validated using the validation dataset.

# define the training and validation samples from main data source

#Requires caTools

HealthData_cleaned$Split <- sample.split(HealthData_cleaned$LifeExp, SplitRatio = .7)

training <- subset(HealthData_cleaned, HealthData_cleaned$Split==TRUE, select = -Split)

validation <- subset(HealthData_cleaned, HealthData_cleaned$Split==FALSE, select = -Split)

Linear Regression Model cont..

R square value is found to be good (0.7382)
Therefore, Sanitation explained 73.82% of the variability in LifeExpectancy of People in different countries.

#Multiple Linear Regression Model with LifeExpectency as Target variable
model <- lm(data = training, LifeExp ~ sanitation)
summary(model)

## 
## Call:
## lm(formula = LifeExp ~ sanitation, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7608  -2.0168   0.3339   2.8363   8.0749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.22231    0.89146   60.82   <2e-16 ***
## sanitation   0.23432    0.01163   20.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.086 on 158 degrees of freedom
## Multiple R-squared:  0.7197, Adjusted R-squared:  0.718 
## F-statistic: 405.8 on 1 and 158 DF,  p-value: < 2.2e-16

Hypothesis testing for model parameters

Consider the following Statical hypothesis
– H0: The data do not fit the linear regression model.
– HA: The data fit the linear regression model.
Run regression on Life expectancy and Sanitation.
From the result of regression analysis P value is less than 0.05, which means that the test is statistically significant
So the decision is to Reject null hypothesis.
Data fits the linear regression model.

model %>% confint()

##                  2.5 %     97.5 %
## (Intercept) 52.4615998 55.9830172
## sanitation   0.2113424  0.2572927

Since model is statically significant, different parameters for liner regression model should be interpreted.
Hypothesis assumption for intercept
- H0: Intercept=0
- HA: Intercept ≠ 0
Hypothesis assumption for Slope
- H0: Slope=0
- HA: Slope ≠ 0
- Neither of Slope or Intercept value falls in between 95% confident interval. So, decision is to reject null hypothesis

Regression Model Summary

Residual vs. Fitted Plot
- Relationship between fitted values and residuals is almost flat.
- The assumption of homoscedasticity, or constant variance appear to be true in this case.
Normal Q-Q Plot
- The plot below suggests there are no major deviations from normality.
- It would be safe to assume the residuals are approximately normally distributed.

par(mfrow=c(1,2))
plot(model, which = 1)
plot(model, which = 2)

Regression Model Summary Cont..

Scale-Location
- The red line should is close to flat whcih strengthens the assumption of homoscedasticity.
- variance in the square root of the standardised residuals are almost consistent across fitted values.
Residual vs Leverage
- Any cases beyond 0.5 band are influential.
- This data doesn’t appear to have any influential case.

par(mfrow=c(1,2))
plot(model, which = 3)
plot(model, which = 5)

Model Validation

Model validation using validation set

validation$LifeExp.Predicted <- predict(model, 
                                          subset(validation, 
                                                 select = -c(Country.Code, LifeExp)))
valid.correlation <- round(cor(validation$LifeExp, validation$LifeExp.Predicted),4)
valid.RMSE <- round(sqrt(mean(validation$LifeExp-validation$LifeExp.Predicted)^2),4)
c(correlation = valid.correlation, RMSE = valid.RMSE)

## correlation        RMSE 
##      0.8795      0.2373

Model validation using training set

training$LifeExp.Predicted <- predict(model, 
                                        subset(training, 
                                               select = -c(Country.Code, LifeExp)))
train.correlation <- round(cor(training$LifeExp,training$LifeExp.Predicted),2)
train.RMSE <- sqrt(mean(training$LifeExp-training$LifeExp.Predicted)^2)
c(correlation = train.correlation, RMSE = train.RMSE)

##  correlation         RMSE 
## 8.500000e-01 4.085336e-15

There is good correlation between; Validation set and Predicted values; Training set and Predicted values.
This shows that the model is good in predicting Life expectancy based on Sanitation

Residuals

The residuals reflect how far an observed value deviates from the value predicted by the line of best fit.

training$residuals <- residuals(model)
ggplot(training, aes(x = sanitation, y = LifeExp)) + 
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  
    geom_segment(aes(xend = sanitation, yend = LifeExp.Predicted), alpha = .2) +
  geom_point(aes(color = abs(residuals)))  +
   scale_color_continuous(low = "#2ECC71", high = "#E74C3C") +  
    guides(color = FALSE)+
geom_point(aes(y = LifeExp.Predicted), shape = 1)+
    theme_bw()

Black dots are actual values and hollow circles are predicted values.
Each data point has a vertical line that connects to the line of best fit. The length of each point’s line is a residual.
higher the value of residuals, more red are the points; and smaller the residuals, more green the points are.

Discussion

Strengths
- There is very good correlation between the predictor and dependednt variables of the model.
- Regression coefficient obtained for the model is good, which assumes that the model is very good in predicting life expectancy using sanitation value.

Limitations
- the dataset had 35/264 observations with null values. these observations where removed for the further analysis.This could have lead to losing major insights of the dataset.
- Only 2014 data is considered. Life expectancy and its dependencies may have changed over the time.
- This study is considerrring only few variables which are related to Life Expectancy. There are may be many other variables which are having higher effect on Life Expectancy.
- Multi-colinearity between Water and Sanity have not considered, which would have decresed the model accuracy as water was having high correlation with Life Expectancy.

Findings
- out of the 4 variables considered for analysis, Sanitation has the majot effect in determining the Life expectancy of people in different countries.
- Regression Equation is found to be: LifeExp = 44.46553 + 0.18944*Sanitation
- According to counducted study, correlation of Life Expectancy on Sanitation is 0.86.
- 0.74% of variability in life Expectancy is explained by Sanitation.
- Resgression model is statistically significant.
- The data is found to be normally distributed using the Q-Q plot.

Through this analysis it is found that, by improving the Sanitation facilities in a country, Life expectancy of the people can be improved to a great extend.

MATH1324 Assignment 4

Life Expectancy Prediction