Introduction

-There are many factors that affect life span of humans such as demographic variables, income composition and mortality rates.

-It is a general notion that being healthy increace an individual’s life expectancy.

-The analysis aims to give answer to out question “Does BMI affects Life expectancy in humans?”

Problem Statement

-Is there any relationship between the Life expectancy in humans and the BMI Index in humans?

-Statistical models and hypothesis testing for the model parameters would provide enough evidence to support or reject the problem statement.

Data

In this assignment, data from year 2000-2015 for 193 countries is considered for the analysis.

The dataset consists od 2238 observations and 22 variables

The data is collected from: https://www.kaggle.com/kumarajarshi/life-expectancy-who

There are two important variables that we are going to take further:

-Life Expectancy

-BMI Index

DATASET SUBSET

The data is subset to select only Life expectancy and BMI

Life_expect <- read_csv("/Users/puja/Downloads/Life Expectancy Data.csv")
Life_expect <-Life_expect[ ,c(4,11)]
head(Life_expect)

Scanning Missing Values

-There were 10 missing values in Life expectency and 34 values in BMI is deleted.

-As the missing values are less than 5%, it is safe to exclude the missing values.

colSums(is.na(Life_expect))

## Life expectancy             BMI 
##              10              34

Life_expect <-na.omit(Life_expect)
knitr::kable(colSums(is.na(Life_expect)))

	x
Life expectancy	0
BMI	0

Scanning outliers

-For detecting ouliers The Mahalanobis distance is used and a new data frame without outliers is created.

results <- mvn(data = Life_expect, multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)

dim(results$newData)

## [1] 2609    2

Life_expect_filtered <- data.frame( Life_expectancy = results$newData[,1],
                                           population = results$newData[,2])

Decsriptive Statistics Cont.

-The data frame created without the outliers and missing value is used to calculate summary statistiscs using summarise() function.

Descriptive_statistics <- Life_expect_filtered %>% summarise_at(vars(Life.expectancy,BMI), funs(Nos. = n(),
                                            Mean = mean(.,na.rm = TRUE), 
                                            SD = sd(.,na.rm = TRUE), 
                                            Median = median(.,na.rm = TRUE),
                                            Max = max(.,na.rm = TRUE),
                                            Min = min(.,na.rm = TRUE),
                                            Q1 = quantile(.,probs = 0.25,na.rm = TRUE),
                                            Q3 = quantile(.,probs = 0.75,na.rm = TRUE) )) 
knitr::kable(Descriptive_statistics)

Life.expectancy_Nos.	BMI_Nos.	Life.expectancy_Mean	BMI_Mean	Life.expectancy_SD	BMI_SD	Life.expectancy_Median	BMI_Median	Life.expectancy_Max	BMI_Max	Life.expectancy_Min	BMI_Min	Life.expectancy_Q1	BMI_Q1	Life.expectancy_Q3	BMI_Q3
2609	2609	69.122	37.88724	9.557769	19.9721	71.9	41.7	89	77.6	36.3	1	63	19.2	75.7	56.1

Normality Test

Test for variables normal distribution:

Sample size>30, we can move forward with testing (According to CLT)

Life_expect_filtered$Life.expectancy %>%  qqPlot(dist="norm", main = "Normal Q-Q Plot - Life Expectancy")

## [1]  127 1305

Life_expect_filtered$BMI %>%  qqPlot(dist="norm", main = "Normal Q-Q Plot - Population")

## [1] 378 379

Hypothesis Testing

Linear regression model is considered to check if the data exhibits any correlation between the two variables “Life Expectancy” and “BMI”

Linear Regression : Assumptions

Independence - The data variables are independent as they have measured independently for different countries.

Linearity - This can be verified from the scatter plot (Sqrt(Life Expectancy) vs. Sqrt(BMI))

Normality of residuals - (after model fitting)

Homoscedasticity - (after model fitting)

Linear Regrssion Model -Linearity

-From the Raw data plot we can see that the plot is not linear.

-For making our plot linear, transformation model has to be used and the square froot model transformation provided with best liner relationship among them.

-The relationship shows positive trend.

par(mfrow = c(1,2))
plot(Life_expect_filtered$BMI, Life_expect_filtered$Life.expectancy,
     main = "Raw data",
     xlab = "BMI INDEX",
     ylab = "Life expectancy")
plot(sqrt(Life_expect_filtered$BMI), sqrt(Life_expect_filtered$Life.expectancy),
     main = "Square root transformation",
     xlab = "Square root(BMI INDEX)",
     ylab = "Square root(Life expectancy)")

Hypthesis Testing Cont.

-Linear Regression : Line if Best fit

-We need to add Line of best fit and it is done by lm() function

-From the above the below test, R2 value was found to be 0.2603 indicates that 26.03% of dependent variable (sqrt(Life expectancy)) can be predicted by a linear relationship with the predictor variable (sqrt(BMI)).

fittingmodel <- lm(sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy))
fittingmodel %>% summary()

## 
## Call:
## lm(formula = sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6554 -0.6508  0.5581  1.0450  4.4171 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                                -7.42914    0.44006  -16.88
## sqrt(Life_expect_filtered$Life.expectancy)  1.60339    0.05293   30.29
##                                            Pr(>|t|)    
## (Intercept)                                  <2e-16 ***
## sqrt(Life_expect_filtered$Life.expectancy)   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 2607 degrees of freedom
## Multiple R-squared:  0.2603, Adjusted R-squared:  0.2601 
## F-statistic: 917.6 on 1 and 2607 DF,  p-value: < 2.2e-16

plot(sqrt(Life_expect_filtered$BMI) ~ sqrt(Life_expect_filtered$Life.expectancy), 
     main = "Line of best fit",
     xlab = "Square root (BMI)",
     ylab = "Square root (Life Expectancy)")

abline(a = -7.42914, b= 1.60339, col = "red")

Hypothesis Testing cont.

-The F-test for the above linear regression model has the following hypotheses -

-Null hypothesis : (The data does not fit the linear regression) H0:√LE ≠α+β√BMI

-Alternate hypothesis : (The data fits the linear regression) HA:√LE=α+β√BMI

-We observe that the p-value obtained in the summary is less than 0.001, therefore we can reject Ho as the p-value is less than the 0.05 level of significance.

pf(q=917.6,1,2607,lower.tail = FALSE)

## [1] 5.857957e-173

Hypothesis Testing cont.

We get intercept = -7.42914, slope = 1.60339 and we can check the significance by:

Intercept: Null hypothesis :H0:α=0 Alternate hypothesis :HA:α≠0

Slope: Null hypothesis :H0:β=0 Alternate hypothesis :HA:β≠0

Intercept: t-statistic value is found to be -16.88215 and p-value <0.001, we conclude that intercept constant is statistically significant (significance level=0.05).Hence, we reject the null hypothesis H0:α=0.

Slope:-statistic value is found to be 30.29251 and p-value <0.001, we conclude that interc constant slope is statistically significant (significance level=0.05).Hence, we reject the null hypothesis H0:β=0.

fittingmodel %>% summary() %>% coef()

##                                             Estimate Std. Error   t value
## (Intercept)                                -7.429140  0.4400590 -16.88215
## sqrt(Life_expect_filtered$Life.expectancy)  1.603386  0.0529301  30.29251
##                                                 Pr(>|t|)
## (Intercept)                                 9.162349e-61
## sqrt(Life_expect_filtered$Life.expectancy) 5.779917e-173

Liner Regression : Assumption Testing

This is an important part of Liner Regrssion model hypothesis testing.

par(mfrow = c(2,2))
fittingmodel %>% plot(which=1)
fittingmodel %>% plot(which=2)
fittingmodel %>% plot(which=3)
fittingmodel %>% plot(which=5)

Correlation Testing

The result for correlation test gives a value of 0.5102443 which indicates positive weak linear relationship between the variables.

r <- cor(sqrt(Life_expect_filtered$BMI), sqrt(Life_expect_filtered$Life.expectancy), use = "complete.obs")
r

## [1] 0.5102443

Discussion

-The results show that there is a positive linear relationship between Life Expectancy and BMI Index.

-Higher BMI Index suggests, Higer life span in Humans.

-The correlation test shows that the relationship between the variables is weak.

-The Linear regression model provides (α=-7.429140) and (β=1.603386).The equation becomes :

√Le= -7.429140−1.603386√BMI

26.03 Life expectancy can be predicted from BMI Index.

MATH1324 ASSIGNMENT 3

DOES BMI INDEX AFFECTS OVERALL LIFE SPAN IN HUMANS

RPubs link information

Introduction

Problem Statement

Data

DATASET SUBSET

Scanning Missing Values

Scanning outliers

Decsriptive Statistics Cont.

Normality Test

Hypothesis Testing

Hypthesis Testing Cont.

Hypothesis Testing cont.

Hypothesis Testing cont.

Liner Regression : Assumption Testing

Correlation Testing

Discussion

References