Amal Joy - S3644794, Nupura Sanjay Swale - S3639703, Soumyashree Giri - S3640845
Last updated: 04 June, 2017
This analysis will try to answer the following questions:
Collecting life expectency data and the other variables that have an effect on life expectency for analysis from a publically available reliable source like world bank data.
Checking the correlation between life expectency and the other variables considered for the analysis.
Creating a linear regression model of using data downloaded from the Worldsbank Data.
These data are collected and derived from sources such as:
setwd("~/Desktop/Stats Assignment 4")
Lifeexpectancy <- read.csv(file="Lifeexpectancyatbirth.csv", skip=4, header=TRUE)
gdpp <- read.csv(file="GDP.csv", skip=4, header=TRUE)
Improvedsanitationfacilities <- read.csv(file="Improvedsanitationfacilities.csv", skip=4, header=TRUE)
vaccinations <- read.csv(file="ImmunizationDPT.csv", skip=4, header=TRUE)
watersource <- read.csv(file="ImprovedWatersource.csv", skip=4, header=TRUE)#Loading variables into Data frame
HealthData <- data.frame(Country.Code=Lifeexpectancy$Country.Code, Country.Name=Lifeexpectancy$Country.Name,
LifeExp=Lifeexpectancy$X2014,GDP=gdpp$X2014,sanitation=Improvedsanitationfacilities$X2014,
Vaccine=vaccinations$X2014,water=watersource$X2014)# Data cleaning : Remove obs with any NAs,
HealthData_cleaned <- subset(HealthData,
!(is.na(HealthData$Country.Code)|
is.na(HealthData$Country.Name)|
is.na(HealthData$GDP) |
is.na(HealthData$LifeExp) |
is.na(HealthData$sanitation) |
is.na(HealthData$water) |
is.na(HealthData$Vaccine)))pairs.panels(subset(HealthData_cleaned, select = -c(Country.Name, Country.Code)), hist.col = "#3498DB")#Loading variables into Data frame
HealthData <- data.frame(Country.Code=Lifeexpectancy$Country.Code, Country.Name=Lifeexpectancy$Country.Name,
LifeExp=Lifeexpectancy$X2014,sanitation=Improvedsanitationfacilities$X2014)
# Data cleaning : Remove observations with any NAs,
HealthData_cleaned <- subset(HealthData,
!(is.na(HealthData$Country.Code)|
is.na(HealthData$Country.Name)|
is.na(HealthData$LifeExp) |
is.na(HealthData$sanitation)))HealthData_cleaned %>% summarise(Min = min(LifeExp,na.rm = TRUE),
Q1 = quantile(LifeExp,probs = .25,na.rm = TRUE),
Median = median(LifeExp, na.rm = TRUE),
Q3 = quantile(LifeExp,probs = .75,na.rm = TRUE),
Max = max(LifeExp,na.rm = TRUE),
Mean = mean(LifeExp, na.rm = TRUE),
SD = sd(LifeExp, na.rm = TRUE),
n = n(),
Missing = sum(is.na(LifeExp)))HealthData_cleaned %>% summarise(Min = min(sanitation,na.rm = TRUE),
Q1 = quantile(sanitation,probs = .25,na.rm = TRUE),
Median = median(sanitation, na.rm = TRUE),
Q3 = quantile(sanitation,probs = .75,na.rm = TRUE),
Max = max(sanitation,na.rm = TRUE),
Mean = mean(sanitation, na.rm = TRUE),
SD = sd(sanitation, na.rm = TRUE),
n = n(),
Missing = sum(is.na(sanitation))) ggplot(HealthData_cleaned, aes(x = sanitation, y = LifeExp)) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
guides(alpha = FALSE)+
geom_point() + theme_bw() par(mfrow=c(1,2))
#Histograms
hist(HealthData_cleaned$LifeExp,main= " Histogram for Life expectancy")
hist(HealthData_cleaned$sanitation,main= " Histogram for Improvement in sanitation")The assumptions for the regression model are:
Independence
- The data is assumed to be independend.
- No countries are considered twice for the analysis.
- Measurement of variables for one countries are not dependend on the variables for any other country.
Linearity
- The data is assumed to be linear as seen from the scatter plot in the previous slides.
Normality of residuals
- The date is assumed to be normal since the number of observations are greater than 30.
- This assumption is further validated after the model creation using the Q_Q plot.
Homoscedasticity
- Variance in the residuals are assumed to be normal across the predicted variables in this dataset.
# define the training and validation samples from main data source
#Requires caTools
HealthData_cleaned$Split <- sample.split(HealthData_cleaned$LifeExp, SplitRatio = .7)
training <- subset(HealthData_cleaned, HealthData_cleaned$Split==TRUE, select = -Split)
validation <- subset(HealthData_cleaned, HealthData_cleaned$Split==FALSE, select = -Split)#Multiple Linear Regression Model with LifeExpectency as Target variable
model <- lm(data = training, LifeExp ~ sanitation)
summary(model)##
## Call:
## lm(formula = LifeExp ~ sanitation, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7608 -2.0168 0.3339 2.8363 8.0749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.22231 0.89146 60.82 <2e-16 ***
## sanitation 0.23432 0.01163 20.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.086 on 158 degrees of freedom
## Multiple R-squared: 0.7197, Adjusted R-squared: 0.718
## F-statistic: 405.8 on 1 and 158 DF, p-value: < 2.2e-16
model %>% confint()## 2.5 % 97.5 %
## (Intercept) 52.4615998 55.9830172
## sanitation 0.2113424 0.2572927
Since model is statically significant, different parameters for liner regression model should be interpreted.
Hypothesis assumption for intercept
- H0: Intercept=0
- HA: Intercept ≠0
Hypothesis assumption for Slope
- H0: Slope=0
- HA: Slope ≠0
- Neither of Slope or Intercept value falls in between 95% confident interval. So, decision is to reject null hypothesis
Residual vs. Fitted Plot
- Relationship between fitted values and residuals is almost flat.
- The assumption of homoscedasticity, or constant variance appear to be true in this case.
Normal Q-Q Plot
- The plot below suggests there are no major deviations from normality.
- It would be safe to assume the residuals are approximately normally distributed.
par(mfrow=c(1,2))
plot(model, which = 1)
plot(model, which = 2)Scale-Location
- The red line should is close to flat whcih strengthens the assumption of homoscedasticity.
- variance in the square root of the standardised residuals are almost consistent across fitted values.
Residual vs Leverage
- Any cases beyond 0.5 band are influential.
- This data doesn’t appear to have any influential case.
par(mfrow=c(1,2))
plot(model, which = 3)
plot(model, which = 5)validation$LifeExp.Predicted <- predict(model,
subset(validation,
select = -c(Country.Code, LifeExp)))
valid.correlation <- round(cor(validation$LifeExp, validation$LifeExp.Predicted),4)
valid.RMSE <- round(sqrt(mean(validation$LifeExp-validation$LifeExp.Predicted)^2),4)
c(correlation = valid.correlation, RMSE = valid.RMSE)## correlation RMSE
## 0.8795 0.2373
training$LifeExp.Predicted <- predict(model,
subset(training,
select = -c(Country.Code, LifeExp)))
train.correlation <- round(cor(training$LifeExp,training$LifeExp.Predicted),2)
train.RMSE <- sqrt(mean(training$LifeExp-training$LifeExp.Predicted)^2)
c(correlation = train.correlation, RMSE = train.RMSE)## correlation RMSE
## 8.500000e-01 4.085336e-15
The residuals reflect how far an observed value deviates from the value predicted by the line of best fit.
training$residuals <- residuals(model)
ggplot(training, aes(x = sanitation, y = LifeExp)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = sanitation, yend = LifeExp.Predicted), alpha = .2) +
geom_point(aes(color = abs(residuals))) +
scale_color_continuous(low = "#2ECC71", high = "#E74C3C") +
guides(color = FALSE)+
geom_point(aes(y = LifeExp.Predicted), shape = 1)+
theme_bw() Strengths
- There is very good correlation between the predictor and dependednt variables of the model.
- Regression coefficient obtained for the model is good, which assumes that the model is very good in predicting life expectancy using sanitation value.
Limitations
- the dataset had 35/264 observations with null values. these observations where removed for the further analysis.This could have lead to losing major insights of the dataset.
- Only 2014 data is considered. Life expectancy and its dependencies may have changed over the time.
- This study is considerrring only few variables which are related to Life Expectancy. There are may be many other variables which are having higher effect on Life Expectancy.
- Multi-colinearity between Water and Sanity have not considered, which would have decresed the model accuracy as water was having high correlation with Life Expectancy.
Findings
- out of the 4 variables considered for analysis, Sanitation has the majot effect in determining the Life expectancy of people in different countries.
- Regression Equation is found to be: LifeExp = 44.46553 + 0.18944*Sanitation
- According to counducted study, correlation of Life Expectancy on Sanitation is 0.86.
- 0.74% of variability in life Expectancy is explained by Sanitation.
- Resgression model is statistically significant.
- The data is found to be normally distributed using the Q-Q plot.
Through this analysis it is found that, by improving the Sanitation facilities in a country, Life expectancy of the people can be improved to a great extend.