1.1. Introduction

Code:

data <- read.csv("./Gender_Height_Weight_Index.csv")
data<-data.frame(data)
head(data,5)
##   Gender Height Weight Index      BMI
## 1   Male    174     96     4 31.70828
## 2   Male    189     87     2 24.35542
## 3 Female    185    110     4 32.14025
## 4 Female    195    104     3 27.35043
## 5   Male    149     61     3 27.47624
hist(data$BMI)

1.2.Empirical Distribution Function

Code:

bmi_ecdf<-ecdf(data$BMI)
plot(bmi_ecdf,main="ECDF ")
Alpha=0.05
n=length(data$BMI)
Eps=sqrt(log(2/Alpha)/(2*n))
grid<-seq(0,100, length.out = 10)
lines(grid, pmin(bmi_ecdf(grid)+Eps,1))
lines(grid, pmax(bmi_ecdf(grid)-Eps,0))

print(bmi_ecdf(100)-bmi_ecdf(25))
## [1] 0.8

Observations:

  • The empirical cdf function of BMI is represented above along with its 95% confidence interval.
  • 80 percent of the people in this dataset are overweight or obese.

1.3.Bootstrap and Confidence Intervals

Code:

bmi <- data$BMI
n<-length(bmi)
B=3000
library(bootstrap)
median.boot<-bootstrap(bmi,B,median)
hist(median.boot$thetastar)

se_boot <- var(median.boot$thetastar)
theta_hat <- median(bmi)
#normal.ci
c(theta_hat-2*se_boot,theta_hat+2*se_boot)
## [1] 35.13435 38.77953
#pivotal.ci
c(2*theta_hat - quantile(median.boot$thetastar,0.975),2*theta_hat - quantile(median.boot$thetastar,0.0275))
##    97.5%    2.75% 
## 35.34639 39.11307
#quantile.ci
quantile(median.boot$thetastar,c(0.025,0.975))
##     2.5%    97.5% 
## 34.76991 38.56749

Observations:

  • The estimated standard error of sampling distribution of median from bootstrap is 0.896
  • The normal confidence interval is [35.165,38.749]
  • The pivotal confidence interval is [35.396,39.076]
  • The quantile confidence interval is [34.8,38.518]
  • The point estimation for the median of the BMI is 36.597
  • Here we can state that 95% of the times, the confidence intervals would consist the actual population median of BMI.

1.4.MLE and its asymptotic distributions

Code:

male_bmi <- data[which(data$Gender=='Male'),"BMI"]
female_bmi <- data[which(data$Gender=='Female'),"BMI"]
mu1<-mean(male_bmi) #male bmi mean
mu2<-mean(female_bmi) #female bmi mean
mudiff_hat <- mu1-mu2
#Paramteric bootstrap to calculate CI for mu1-mu2
var1<-var(male_bmi)
var2<-var(female_bmi)
sd1<-sqrt(var1)
sd2<-sqrt(var2)
n_male <- length(male_bmi)
n_female <- length(female_bmi)

mu_diff_par<-c()
for (i in 1:1000){
  x <- rnorm(n_male,mu1,sd1)
  y <- rnorm(n_female,mu2,sd2)
  mu_diff_par[i]<-mean(x)-mean(y)
}

mu_diff_par.mean <- mean(mu_diff_par)
mu_diff_par.sd<-sd(mu_diff_par)
CI<-c(mu_diff_par.mean-1.96*mu_diff_par.sd, mu_diff_par.mean+1.96*mu_diff_par.sd)
hist(mu_diff_par)

Observations:

  • The Maximum likelihood estimate is given by the value 0.7575097
  • The estimated standard error was found using parametric bootstrap
  • Then using the MLE and standard error, a confidence interval was built – (-1.751231, 3.141853)

1.5.Hypothesis Testing (Wald Test)

Code:

z <- mudiff_hat - 0/(sqrt((var1+var2)/n_male+n_female))
p_value <- 2*(1-pnorm(z))
p_value
## [1] 0.4487446

Observations

  • The observed p-value is 0.4487.
  • Since p-value is significant we cannot reject the null hypothesis. We cannot make any conclusions about the difference of mean of male BMI and female BMI.

1.6.Bayesian Analysis

Code:

mu_diff_var <- var1+var2
posterior_mudiff<-rnorm(n_male+n_female,mudiff_hat,sqrt(mu_diff_var/(n_male+n_female)))
hist(posterior_mudiff)

Observation:

  • We see that the distribution of difference of mean using Bayesian analysis is similar to the distribution of difference of mean obtained in frequentist approach.

1.7. Conclusion and Future Scope