MATH1324 ASSIGNMENT 3

DOES BMI INDEX AFFECTS OVERALL LIFE SPAN IN HUMANS

PUJA(S3795543)

Last updated: 25 October,2018

Introduction

-There are many factors that affect life span of humans such as demographic variables, income composition and mortality rates.

-It is a general notion that being healthy increace an individual’s life expectancy.

-The analysis aims to give answer to out question “Does BMI affects Life expectancy in humans?”

Problem Statement

-Is there any relationship between the Life expectancy in humans and the BMI Index in humans?

-Statistical models and hypothesis testing for the model parameters would provide enough evidence to support or reject the problem statement.

Data

In this assignment, data from year 2000-2015 for 193 countries is considered for the analysis.

The dataset consists od 2238 observations and 22 variables

The data is collected from: https://www.kaggle.com/kumarajarshi/life-expectancy-who

There are two important variables that we are going to take further:

-Life Expectancy

-BMI Index

DATASET SUBSET

The data is subset to select only Life expectancy and BMI

Life_expect <- read_csv("/Users/puja/Downloads/Life Expectancy Data.csv")
Life_expect <-Life_expect[ ,c(4,11)]
head(Life_expect)

Scanning Missing Values

-There were 10 missing values in Life expectency and 34 values in BMI is deleted.

-As the missing values are less than 5%, it is safe to exclude the missing values.

colSums(is.na(Life_expect))
## Life expectancy             BMI 
##              10              34
Life_expect <-na.omit(Life_expect)
knitr::kable(colSums(is.na(Life_expect)))
x
Life expectancy 0
BMI 0

Scanning outliers

-For detecting ouliers The Mahalanobis distance is used and a new data frame without outliers is created.

results <- mvn(data = Life_expect, multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)

dim(results$newData)
## [1] 2609    2
Life_expect_filtered <- data.frame( Life_expectancy = results$newData[,1],
                                           population = results$newData[,2])

Decsriptive Statistics Cont.

-The data frame created without the outliers and missing value is used to calculate summary statistiscs using summarise() function.

Descriptive_statistics <- Life_expect_filtered %>% summarise_at(vars(Life.expectancy,BMI), funs(Nos. = n(),
                                            Mean = mean(.,na.rm = TRUE), 
                                            SD = sd(.,na.rm = TRUE), 
                                            Median = median(.,na.rm = TRUE),
                                            Max = max(.,na.rm = TRUE),
                                            Min = min(.,na.rm = TRUE),
                                            Q1 = quantile(.,probs = 0.25,na.rm = TRUE),
                                            Q3 = quantile(.,probs = 0.75,na.rm = TRUE) )) 
knitr::kable(Descriptive_statistics)
Life.expectancy_Nos. BMI_Nos. Life.expectancy_Mean BMI_Mean Life.expectancy_SD BMI_SD Life.expectancy_Median BMI_Median Life.expectancy_Max BMI_Max Life.expectancy_Min BMI_Min Life.expectancy_Q1 BMI_Q1 Life.expectancy_Q3 BMI_Q3
2609 2609 69.122 37.88724 9.557769 19.9721 71.9 41.7 89 77.6 36.3 1 63 19.2 75.7 56.1

Normality Test

Test for variables normal distribution:

Sample size>30, we can move forward with testing (According to CLT)

Life_expect_filtered$Life.expectancy %>%  qqPlot(dist="norm", main = "Normal Q-Q Plot - Life Expectancy")

## [1]  127 1305
Life_expect_filtered$BMI %>%  qqPlot(dist="norm", main = "Normal Q-Q Plot - Population")

## [1] 378 379

Hypothesis Testing

Linear regression model is considered to check if the data exhibits any correlation between the two variables “Life Expectancy” and “BMI”

Linear Regression : Assumptions

Independence - The data variables are independent as they have measured independently for different countries.

Linearity - This can be verified from the scatter plot (Sqrt(Life Expectancy) vs. Sqrt(BMI))

Normality of residuals - (after model fitting)

Homoscedasticity - (after model fitting)

Linear Regrssion Model -Linearity

-From the Raw data plot we can see that the plot is not linear.

-For making our plot linear, transformation model has to be used and the square froot model transformation provided with best liner relationship among them.

-The relationship shows positive trend.

par(mfrow = c(1,2))
plot(Life_expect_filtered$BMI, Life_expect_filtered$Life.expectancy,
     main = "Raw data",
     xlab = "BMI INDEX",
     ylab = "Life expectancy")
plot(sqrt(Life_expect_filtered$BMI), sqrt(Life_expect_filtered$Life.expectancy),
     main = "Square root transformation",
     xlab = "Square root(BMI INDEX)",
     ylab = "Square root(Life expectancy)")

Hypthesis Testing Cont.

-Linear Regression : Line if Best fit

-We need to add Line of best fit and it is done by lm() function

-From the above the below test, R2 value was found to be 0.2603 indicates that 26.03% of dependent variable (sqrt(Life expectancy)) can be predicted by a linear relationship with the predictor variable (sqrt(BMI)).

fittingmodel <- lm(sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy))
fittingmodel %>% summary()
## 
## Call:
## lm(formula = sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6554 -0.6508  0.5581  1.0450  4.4171 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                                -7.42914    0.44006  -16.88
## sqrt(Life_expect_filtered$Life.expectancy)  1.60339    0.05293   30.29
##                                            Pr(>|t|)    
## (Intercept)                                  <2e-16 ***
## sqrt(Life_expect_filtered$Life.expectancy)   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 2607 degrees of freedom
## Multiple R-squared:  0.2603, Adjusted R-squared:  0.2601 
## F-statistic: 917.6 on 1 and 2607 DF,  p-value: < 2.2e-16
plot(sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy), 
     main = "Line of best fit",
     xlab = "Square root (BMI)",
     ylab = "Square root (Life Expectancy)")

abline(a = -7.42914, b= 1.60339, col = "red")

Hypothesis Testing cont.

-The F-test for the above linear regression model has the following hypotheses -

-Null hypothesis : (The data does not fit the linear regression) H0:√LE ≠α+β√BMI

-Alternate hypothesis : (The data fits the linear regression) HA:√LE=α+β√BMI

-We observe that the p-value obtained in the summary is less than 0.001, therefore we can reject Ho as the p-value is less than the 0.05 level of significance.

pf(q=917.6,1,2607,lower.tail = FALSE)
## [1] 5.857957e-173

Hypothesis Testing cont.

We get intercept = -7.42914, slope = 1.60339 and we can check the significance by:

Intercept: Null hypothesis :H0:α=0 Alternate hypothesis :HA:α≠0

Slope: Null hypothesis :H0:β=0 Alternate hypothesis :HA:β≠0

Intercept: t-statistic value is found to be -16.88215 and p-value <0.001, we conclude that intercept constant is statistically significant (significance level=0.05).Hence, we reject the null hypothesis H0:α=0.

Slope:-statistic value is found to be 30.29251 and p-value <0.001, we conclude that interc constant slope is statistically significant (significance level=0.05).Hence, we reject the null hypothesis H0:β=0.

fittingmodel %>% summary() %>% coef()
##                                             Estimate Std. Error   t value
## (Intercept)                                -7.429140  0.4400590 -16.88215
## sqrt(Life_expect_filtered$Life.expectancy)  1.603386  0.0529301  30.29251
##                                                 Pr(>|t|)
## (Intercept)                                 9.162349e-61
## sqrt(Life_expect_filtered$Life.expectancy) 5.779917e-173

Liner Regression : Assumption Testing

This is an important part of Liner Regrssion model hypothesis testing.

par(mfrow = c(2,2))
fittingmodel %>% plot(which=1)
fittingmodel %>% plot(which=2)
fittingmodel %>% plot(which=3)
fittingmodel %>% plot(which=5)

Correlation Testing

The result for correlation test gives a value of 0.5102443 which indicates positive weak linear relationship between the variables.

r <- cor(sqrt(Life_expect_filtered$BMI), sqrt(Life_expect_filtered$Life.expectancy), use = "complete.obs")
r
## [1] 0.5102443

Discussion

-The results show that there is a positive linear relationship between Life Expectancy and BMI Index.

-Higher BMI Index suggests, Higer life span in Humans.

-The correlation test shows that the relationship between the variables is weak.

-The Linear regression model provides (α=-7.429140) and (β=1.603386).The equation becomes :

√Le= -7.429140−1.603386√BMI

References

-Life Expectancy (WHO).“Statistical Analysis on factors influencing Life Expectancy”.

-Module Notes.