MATH1324 Assignment 4

Life Expectancy Prediction

Amal Joy - S3644794, Nupura Sanjay Swale - S3639703, Soumyashree Giri - S3640845

Last updated: 04 June, 2017

Introduction

Problem Statement

This analysis will try to answer the following questions:

Data

These data are collected and derived from sources such as:

setwd("~/Desktop/Stats Assignment 4")
Lifeexpectancy <- read.csv(file="Lifeexpectancyatbirth.csv", skip=4, header=TRUE) 
gdpp <- read.csv(file="GDP.csv", skip=4, header=TRUE) 
Improvedsanitationfacilities <- read.csv(file="Improvedsanitationfacilities.csv", skip=4, header=TRUE) 
vaccinations <- read.csv(file="ImmunizationDPT.csv", skip=4, header=TRUE)
watersource <- read.csv(file="ImprovedWatersource.csv", skip=4, header=TRUE)

Data Preparation and Exploration

#Loading variables into Data frame
HealthData <- data.frame(Country.Code=Lifeexpectancy$Country.Code, Country.Name=Lifeexpectancy$Country.Name,
                  LifeExp=Lifeexpectancy$X2014,GDP=gdpp$X2014,sanitation=Improvedsanitationfacilities$X2014,
                  Vaccine=vaccinations$X2014,water=watersource$X2014)
# Data cleaning : Remove obs with any NAs, 
HealthData_cleaned <- subset(HealthData, 
                  !(is.na(HealthData$Country.Code)|
                    is.na(HealthData$Country.Name)| 
                    is.na(HealthData$GDP) |
                    is.na(HealthData$LifeExp) |
                    is.na(HealthData$sanitation) |
                    is.na(HealthData$water) |
                    is.na(HealthData$Vaccine)))

Visualisation

pairs.panels(subset(HealthData_cleaned, select = -c(Country.Name, Country.Code)), hist.col = "#3498DB")

Visualisation Cont..

#Loading variables into Data frame
HealthData <- data.frame(Country.Code=Lifeexpectancy$Country.Code, Country.Name=Lifeexpectancy$Country.Name,
                  LifeExp=Lifeexpectancy$X2014,sanitation=Improvedsanitationfacilities$X2014)
# Data cleaning : Remove observations with any NAs, 
HealthData_cleaned <- subset(HealthData, 
                  !(is.na(HealthData$Country.Code)|
                    is.na(HealthData$Country.Name)| 
                    is.na(HealthData$LifeExp) |
                    is.na(HealthData$sanitation)))

Descriptive Statistics

HealthData_cleaned %>% summarise(Min = min(LifeExp,na.rm = TRUE),
                                         Q1 = quantile(LifeExp,probs = .25,na.rm = TRUE),
                                         Median = median(LifeExp, na.rm = TRUE),
                                         Q3 = quantile(LifeExp,probs = .75,na.rm = TRUE),
                                         Max = max(LifeExp,na.rm = TRUE),
                                         Mean = mean(LifeExp, na.rm = TRUE),
                                         SD = sd(LifeExp, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(LifeExp)))
HealthData_cleaned %>% summarise(Min = min(sanitation,na.rm = TRUE),
                                         Q1 = quantile(sanitation,probs = .25,na.rm = TRUE),
                                         Median = median(sanitation, na.rm = TRUE),
                                         Q3 = quantile(sanitation,probs = .75,na.rm = TRUE),
                                         Max = max(sanitation,na.rm = TRUE),
                                         Mean = mean(sanitation, na.rm = TRUE),
                                         SD = sd(sanitation, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(sanitation)))

Scatter Plot

  ggplot(HealthData_cleaned, aes(x = sanitation, y = LifeExp)) + 
  geom_smooth(method = "lm", se = FALSE, color = "red") +  
  guides(alpha = FALSE)+
  geom_point() + theme_bw() 

Histogram

par(mfrow=c(1,2))
#Histograms
hist(HealthData_cleaned$LifeExp,main= " Histogram for Life expectancy")
hist(HealthData_cleaned$sanitation,main= " Histogram for Improvement in sanitation")

Assumptions

The assumptions for the regression model are:

Independence
- The data is assumed to be independend.
- No countries are considered twice for the analysis.
- Measurement of variables for one countries are not dependend on the variables for any other country.

Linearity
- The data is assumed to be linear as seen from the scatter plot in the previous slides.

Normality of residuals
- The date is assumed to be normal since the number of observations are greater than 30.
- This assumption is further validated after the model creation using the Q_Q plot.

Homoscedasticity
- Variance in the residuals are assumed to be normal across the predicted variables in this dataset.

Linear Regression Model

# define the training and validation samples from main data source

#Requires caTools

HealthData_cleaned$Split <- sample.split(HealthData_cleaned$LifeExp, SplitRatio = .7)

training <- subset(HealthData_cleaned, HealthData_cleaned$Split==TRUE, select = -Split)

validation <- subset(HealthData_cleaned, HealthData_cleaned$Split==FALSE, select = -Split)

Linear Regression Model cont..

#Multiple Linear Regression Model with LifeExpectency as Target variable
model <- lm(data = training, LifeExp ~ sanitation)
summary(model)
## 
## Call:
## lm(formula = LifeExp ~ sanitation, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7608  -2.0168   0.3339   2.8363   8.0749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.22231    0.89146   60.82   <2e-16 ***
## sanitation   0.23432    0.01163   20.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.086 on 158 degrees of freedom
## Multiple R-squared:  0.7197, Adjusted R-squared:  0.718 
## F-statistic: 405.8 on 1 and 158 DF,  p-value: < 2.2e-16

Hypothesis testing for model parameters

model %>% confint()
##                  2.5 %     97.5 %
## (Intercept) 52.4615998 55.9830172
## sanitation   0.2113424  0.2572927

Since model is statically significant, different parameters for liner regression model should be interpreted.
Hypothesis assumption for intercept
- H0: Intercept=0
- HA: Intercept ≠ 0
Hypothesis assumption for Slope
- H0: Slope=0
- HA: Slope ≠ 0
- Neither of Slope or Intercept value falls in between 95% confident interval. So, decision is to reject null hypothesis

Regression Model Summary

Residual vs. Fitted Plot
- Relationship between fitted values and residuals is almost flat.
- The assumption of homoscedasticity, or constant variance appear to be true in this case.
Normal Q-Q Plot
- The plot below suggests there are no major deviations from normality.
- It would be safe to assume the residuals are approximately normally distributed.

par(mfrow=c(1,2))
plot(model, which = 1)
plot(model, which = 2)

Regression Model Summary Cont..

Scale-Location
- The red line should is close to flat whcih strengthens the assumption of homoscedasticity.
- variance in the square root of the standardised residuals are almost consistent across fitted values.
Residual vs Leverage
- Any cases beyond 0.5 band are influential.
- This data doesn’t appear to have any influential case.

par(mfrow=c(1,2))
plot(model, which = 3)
plot(model, which = 5)

Model Validation

Model validation using validation set

validation$LifeExp.Predicted <- predict(model, 
                                          subset(validation, 
                                                 select = -c(Country.Code, LifeExp)))
valid.correlation <- round(cor(validation$LifeExp, validation$LifeExp.Predicted),4)
valid.RMSE <- round(sqrt(mean(validation$LifeExp-validation$LifeExp.Predicted)^2),4)
c(correlation = valid.correlation, RMSE = valid.RMSE)
## correlation        RMSE 
##      0.8795      0.2373

Model validation using training set

training$LifeExp.Predicted <- predict(model, 
                                        subset(training, 
                                               select = -c(Country.Code, LifeExp)))
train.correlation <- round(cor(training$LifeExp,training$LifeExp.Predicted),2)
train.RMSE <- sqrt(mean(training$LifeExp-training$LifeExp.Predicted)^2)
c(correlation = train.correlation, RMSE = train.RMSE)
##  correlation         RMSE 
## 8.500000e-01 4.085336e-15

Residuals

The residuals reflect how far an observed value deviates from the value predicted by the line of best fit.

training$residuals <- residuals(model)
ggplot(training, aes(x = sanitation, y = LifeExp)) + 
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  
    geom_segment(aes(xend = sanitation, yend = LifeExp.Predicted), alpha = .2) +
  geom_point(aes(color = abs(residuals)))  +
   scale_color_continuous(low = "#2ECC71", high = "#E74C3C") +  
    guides(color = FALSE)+
geom_point(aes(y = LifeExp.Predicted), shape = 1)+
    theme_bw() 

Discussion

Strengths
- There is very good correlation between the predictor and dependednt variables of the model.
- Regression coefficient obtained for the model is good, which assumes that the model is very good in predicting life expectancy using sanitation value.

Limitations
- the dataset had 35/264 observations with null values. these observations where removed for the further analysis.This could have lead to losing major insights of the dataset.
- Only 2014 data is considered. Life expectancy and its dependencies may have changed over the time.
- This study is considerrring only few variables which are related to Life Expectancy. There are may be many other variables which are having higher effect on Life Expectancy.
- Multi-colinearity between Water and Sanity have not considered, which would have decresed the model accuracy as water was having high correlation with Life Expectancy.

Findings
- out of the 4 variables considered for analysis, Sanitation has the majot effect in determining the Life expectancy of people in different countries.
- Regression Equation is found to be: LifeExp = 44.46553 + 0.18944*Sanitation
- According to counducted study, correlation of Life Expectancy on Sanitation is 0.86.
- 0.74% of variability in life Expectancy is explained by Sanitation.
- Resgression model is statistically significant.
- The data is found to be normally distributed using the Q-Q plot.

Through this analysis it is found that, by improving the Sanitation facilities in a country, Life expectancy of the people can be improved to a great extend.

References