Lab 7 Assignment

Problem 1:

Import the data from Flight Delays into R. Although the data are on all UA and AA flights flown in May and June of 2009, we will assume these represent a sample from a larger population of UA and AA flights flown under similar circumstances. We will consider the ratio of means of the flight delay lengths, \(\mu UA/ \mu AA\).

  1. Perform some exploratory data analysis(EDA) on flight delay lengths for each of UA and AA flights(look at some plots). Comment on your results.

  2. Bootstrap the mean of flight delay lengths for each airline separately and plot and describe the distribution.

  3. Bootstrap the ratio of means. Provide plots of the bootstrap distribution and describe the distribution.

  4. Find the 95% bootstrap percentile interval for the ratio of means. Interpret this interval.

  5. What is the bootstrap estimate of the bias for the mean ratio?

Problem 2: Hypothesis testing using bootstrap approach.

(Chihara2011)

Import the data set Titanic.csv which contains survival data (0 = death, 1 = survival) and ages of 658 passengers of the Titanic which sank on April 15, 1912 (the day when Americans had to file income tax returns for the first time). Examine the null hypothesis that the mean ages of survivors and of victims are the same against the alternative that these mean ages are different, using a bootstrap approach.

Titanic = read.csv("Titanic.csv",stringsAsFactors=FALSE)
head(Titanic)
##   ID Survived   Age
## 1  1        1  0.92
## 2  2        0 30.00
## 3  3        1 48.00
## 4  4        0 39.00
## 5  5        0 71.00
## 6  6        0 47.00

Let’s make a hypothesis. Are two population means different?

Null Hypothesis: Mean ages of survivors and of victims are the same. (That means that there is no difference between the two population means)

Let’s use the bootstrap approach.

  1. First let’s seperate the two samples “survivors(=1)” and “victims(=0)”.

  2. Make many bootstrap samples(100000) of difference of sample means.

  3. Make a 90% percentile bootstrap confidence interval for the difference estimate. Using this CI, conclude on your hypothesis.

Null Hypothesis: Mean ages of survivors and of victims are the same. (That means that there is no difference between the two population means)

  1. Make a 95% percentile bootstrap confidence interval for the difference estimate and using this CI, conclude on your hypothesis.

Null Hypothesis: Mean ages of survivors and of victims are the same. (That means that there is no difference between the two population means)

Problem 3 (BONUS -1 points;

Note that this is out of 10, so in total you can get upto 110%)

Load the Bangladesh data. We made inference about Arsenic in water in the Lab 9. Let’s make inference about Cobalt.

bdesh <- read.csv("Bangladesh.csv")
head(bdesh)
##   Arsenic Chlorine Cobalt
## 1    2400      6.2   0.42
## 2       6    116.0   0.45
## 3     904     14.8   0.63
## 4     321     35.9   0.68
## 5    1280     18.9   0.58
## 6     151      7.8   0.35
  1. Plot the boxplot of Cobalt. What’s look wrong? Do you think a transformation can explain the data better?

  2. Make a log transformation of the data (Cobalt) and plot the boxplot. What do you see differently? Comment.

  3. Are there an missing values? If so, remove the missing values and draw bootstrap samples(10000) of sample mean. (remeber we are still using the log transformation)

  4. What is the bias?

  5. Find the standard error of the bootstrap estimate.

Also, calculate the estimate(using the bootstrap sample) of mean squared error(MSE) =var(X)+ (bias)^2

Comment on your results.

  1. Make a 95% Percentile Bootstrap CI for mean of log transformed(log10) Cobalt data. Comment on your results.

  2. Repeat the bootstrap sampling process to estimate the median of the actual data(Cobalt- no transformation).

  3. Find the bias and the standard error of the bootstrap estimate. Comment on your results.

Quiz 5 type Question (PRACTICE ONLY)

(You don’t have to submit this question, but I highly recommend you to practice this as Quiz 5 peoblem 1 will be similar to this question)

Let \(( x_1, x_2, \ldots, x_n)\) be a random sample of size \((n)\) from a population described by the density function:

\[ f(x|\theta) = \begin{cases} \frac{x e^{-x/\theta}}{\theta^3}, & x > 0, \theta > 0 \\ 0, & \text{otherwise} \end{cases} \]

Questions

  1. Find the likelihood function.
  2. Find the log-likelihood.
  3. Find the MLE of \(\theta\).
  4. Find the method of Moments Estimate (MOM) of \(\theta\).
  5. Is the MOM estimator unbiased? Show the reason for your answer.
  6. Is the MLE estimator unbiased? Show the reason for your answer.
  7. Create an unbiased estimator \(\tilde{\theta}\) for MLE.
  8. Using the concepts of efficiency,which estimator is the better estimate? Hints: \(X_i\)’s are i.i.d., and \(\bar{X}\) is the sample mean of \(X_i\)’s.