PUJA(S3795543)
Last updated: 25 October,2018
-There are many factors that affect life span of humans such as demographic variables, income composition and mortality rates.
-It is a general notion that being healthy increace an individual’s life expectancy.
-The analysis aims to give answer to out question “Does BMI affects Life expectancy in humans?”
-Is there any relationship between the Life expectancy in humans and the BMI Index in humans?
-Statistical models and hypothesis testing for the model parameters would provide enough evidence to support or reject the problem statement.
In this assignment, data from year 2000-2015 for 193 countries is considered for the analysis.
The dataset consists od 2238 observations and 22 variables
The data is collected from: https://www.kaggle.com/kumarajarshi/life-expectancy-who
There are two important variables that we are going to take further:
-Life Expectancy
-BMI Index
The data is subset to select only Life expectancy and BMI
Life_expect <- read_csv("/Users/puja/Downloads/Life Expectancy Data.csv")
Life_expect <-Life_expect[ ,c(4,11)]
head(Life_expect)-There were 10 missing values in Life expectency and 34 values in BMI is deleted.
-As the missing values are less than 5%, it is safe to exclude the missing values.
## Life expectancy BMI
## 10 34
| x | |
|---|---|
| Life expectancy | 0 |
| BMI | 0 |
-For detecting ouliers The Mahalanobis distance is used and a new data frame without outliers is created.
## [1] 2609 2
-The data frame created without the outliers and missing value is used to calculate summary statistiscs using summarise() function.
Descriptive_statistics <- Life_expect_filtered %>% summarise_at(vars(Life.expectancy,BMI), funs(Nos. = n(),
Mean = mean(.,na.rm = TRUE),
SD = sd(.,na.rm = TRUE),
Median = median(.,na.rm = TRUE),
Max = max(.,na.rm = TRUE),
Min = min(.,na.rm = TRUE),
Q1 = quantile(.,probs = 0.25,na.rm = TRUE),
Q3 = quantile(.,probs = 0.75,na.rm = TRUE) ))
knitr::kable(Descriptive_statistics)| Life.expectancy_Nos. | BMI_Nos. | Life.expectancy_Mean | BMI_Mean | Life.expectancy_SD | BMI_SD | Life.expectancy_Median | BMI_Median | Life.expectancy_Max | BMI_Max | Life.expectancy_Min | BMI_Min | Life.expectancy_Q1 | BMI_Q1 | Life.expectancy_Q3 | BMI_Q3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2609 | 2609 | 69.122 | 37.88724 | 9.557769 | 19.9721 | 71.9 | 41.7 | 89 | 77.6 | 36.3 | 1 | 63 | 19.2 | 75.7 | 56.1 |
Test for variables normal distribution:
Sample size>30, we can move forward with testing (According to CLT)
## [1] 127 1305
## [1] 378 379
Linear regression model is considered to check if the data exhibits any correlation between the two variables “Life Expectancy” and “BMI”
Linear Regression : Assumptions
Independence - The data variables are independent as they have measured independently for different countries.
Linearity - This can be verified from the scatter plot (Sqrt(Life Expectancy) vs. Sqrt(BMI))
Normality of residuals - (after model fitting)
Homoscedasticity - (after model fitting)
Linear Regrssion Model -Linearity
-From the Raw data plot we can see that the plot is not linear.
-For making our plot linear, transformation model has to be used and the square froot model transformation provided with best liner relationship among them.
-The relationship shows positive trend.
par(mfrow = c(1,2))
plot(Life_expect_filtered$BMI, Life_expect_filtered$Life.expectancy,
main = "Raw data",
xlab = "BMI INDEX",
ylab = "Life expectancy")
plot(sqrt(Life_expect_filtered$BMI), sqrt(Life_expect_filtered$Life.expectancy),
main = "Square root transformation",
xlab = "Square root(BMI INDEX)",
ylab = "Square root(Life expectancy)")-Linear Regression : Line if Best fit
-We need to add Line of best fit and it is done by lm() function
-From the above the below test, R2 value was found to be 0.2603 indicates that 26.03% of dependent variable (sqrt(Life expectancy)) can be predicted by a linear relationship with the predictor variable (sqrt(BMI)).
fittingmodel <- lm(sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy))
fittingmodel %>% summary()##
## Call:
## lm(formula = sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6554 -0.6508 0.5581 1.0450 4.4171
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -7.42914 0.44006 -16.88
## sqrt(Life_expect_filtered$Life.expectancy) 1.60339 0.05293 30.29
## Pr(>|t|)
## (Intercept) <2e-16 ***
## sqrt(Life_expect_filtered$Life.expectancy) <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.6 on 2607 degrees of freedom
## Multiple R-squared: 0.2603, Adjusted R-squared: 0.2601
## F-statistic: 917.6 on 1 and 2607 DF, p-value: < 2.2e-16
plot(sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy),
main = "Line of best fit",
xlab = "Square root (BMI)",
ylab = "Square root (Life Expectancy)")
abline(a = -7.42914, b= 1.60339, col = "red")-The F-test for the above linear regression model has the following hypotheses -
-Null hypothesis : (The data does not fit the linear regression) H0:√LE ≠α+β√BMI
-Alternate hypothesis : (The data fits the linear regression) HA:√LE=α+β√BMI
-We observe that the p-value obtained in the summary is less than 0.001, therefore we can reject Ho as the p-value is less than the 0.05 level of significance.
## [1] 5.857957e-173
We get intercept = -7.42914, slope = 1.60339 and we can check the significance by:
Intercept: Null hypothesis :H0:α=0 Alternate hypothesis :HA:α≠0
Slope: Null hypothesis :H0:β=0 Alternate hypothesis :HA:β≠0
Intercept: t-statistic value is found to be -16.88215 and p-value <0.001, we conclude that intercept constant is statistically significant (significance level=0.05).Hence, we reject the null hypothesis H0:α=0.
Slope:-statistic value is found to be 30.29251 and p-value <0.001, we conclude that interc constant slope is statistically significant (significance level=0.05).Hence, we reject the null hypothesis H0:β=0.
## Estimate Std. Error t value
## (Intercept) -7.429140 0.4400590 -16.88215
## sqrt(Life_expect_filtered$Life.expectancy) 1.603386 0.0529301 30.29251
## Pr(>|t|)
## (Intercept) 9.162349e-61
## sqrt(Life_expect_filtered$Life.expectancy) 5.779917e-173
This is an important part of Liner Regrssion model hypothesis testing.
par(mfrow = c(2,2))
fittingmodel %>% plot(which=1)
fittingmodel %>% plot(which=2)
fittingmodel %>% plot(which=3)
fittingmodel %>% plot(which=5)The result for correlation test gives a value of 0.5102443 which indicates positive weak linear relationship between the variables.
r <- cor(sqrt(Life_expect_filtered$BMI), sqrt(Life_expect_filtered$Life.expectancy), use = "complete.obs")
r## [1] 0.5102443
-The results show that there is a positive linear relationship between Life Expectancy and BMI Index.
-Higher BMI Index suggests, Higer life span in Humans.
-The correlation test shows that the relationship between the variables is weak.
-The Linear regression model provides (α=-7.429140) and (β=1.603386).The equation becomes :
√Le= -7.429140−1.603386√BMI
-Life Expectancy (WHO).“Statistical Analysis on factors influencing Life Expectancy”.
-Module Notes.