Introduction

Last time we learned how to generate a bootstrap confidence interval estimate for the mean. In this lesson, we will explore confidence interval estimates more and learn how to perform hypothesis testing.

Recall CI’s

To warm up, let’s first recall some basics about confidence intervals. The construction and the interpretation are worth revisting

Question

Suppose for a sample of numerical values X a confidence interval estimate for the mean is computed as (45,55) What is the appropriate interpretation?

Question

Given a sample mean \(xbar\), sample size \(n\), sample standard deviation \(s\), write out the formula for a 95% confidence interval using the t-distribution. How is this done mathematically? How is this done in R?

Main Example: Stats Enrollment

Below is a dataset called StatsEnrollment. It is the enrollment of 82 statistics graduate programs. It has two columns UniversityDepartment which is the name of the institution, and FTGradEnrollment which is the number of full time graduates in the program. Here are the first few entries and a summary of the enrollment numbers.

Here are some summary statistics:

head(StatsEnrollment)
print(summary(StatsEnrollment$FTGradEnrollment))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   30.25   45.50   53.54   68.00  196.00
print(sd(StatsEnrollment$FTGradEnrollment))
## [1] 36.88962
print(nrow(StatsEnrollment))
## [1] 82

Question

Using the T-distribution, compute a 95% confidence interval estimate for the mean enrollment. Write out an interpretation of your interval in a complete sentence.

You should get (49.46281, 57.61036).

Question

Now compute the same interval using boostrapping using 1000 samples. Below is the code we used last time. Note that x is not defined and should be. What should we use for x?

B<-1000 # number of bootstrap samples to obtain
xbar<-rep(0,B) # first a list of 1000 zeros. This will become our bootstrap sample

#next we calculate a resample with replacement 1000 times and compute the mean.
for (i in 1:B){
  xbs<-sample(x, length(x), replace=TRUE)
  xbar[i]<-mean(xbs)
}

quantile(xbar, probs=c(0.025, 0.975, type=1))
##     2.5%    97.5%     100% 
## 46.54726 61.67287 66.98780

Your solution should be close to the one above but it will probably not be equal.

Follow up question:

You will probably get an interval that is slightly wider than the one computed using the t-distribution. Think about why this makes sense. It is worth noting that both would be wider than if you used the normal distribution.

Hypothesis testing using boostrapping

First, as before let’s recall hypothesis testing using the t-test as a basis of comparison. Assume that our sample of colleges is only a sample, and not a census of all colleges with stats graduate programs.

Suppose historically graduate school enrollment in statistics programs was on average 55 students. We would like to know if we can support the claim that this has decreased. Let \(\mu\) be the true current mean of all statistics programs. Our Hypotheses:

\[H_0: \mu=55\] \[H_1: \mu<55\] Recall now that a p-value is the probability of your observation or an equally or more extreme observation assuming the null hypothesis is true.

Question

Compute the t-score, the p-value and the decision. You will need to use some of the summary statistics we computed previously. You will also need to use the R function pt.

You should get a p-value of 0.3601797.

Follow up question

What would you do if the direction was reversed? Two sided?

Now hypothesis testing for bootstrapping

We will do the same hypothesis test, this time using bootstrapping. Before we get too far in, based on what we have seen before, do you think the p-value will be, in general greater than, less than or roughly the same as the previous p-value? As a reminder, since bootstrapping is random we won’t know for sure!

We do not need to generate a new bootstrap sample we can use the same xbar as before. Recall this is a list of sample means.

What percentage of them are greater than or equal to 55?

sum(xbar>=55)/length(xbar)
## [1] 0.342

Question

Why did we have a “>=” rather than a “<” if our extreme is to the left? Think about the definition of a p-value.

Question

Compute the same exercise for a right extreme and a two sided extreme. State the new hypotheses clearly.

Appendix Stats Enrollment Table

Copy all of this text below and paste it into your RStudio to complete the exercises.

StatsEnrollment<-read.table(text="UniversityDepartment, FTGradEnrollment
Baylor University (Statistics), 26
Boston University (Biostatistics), 39
Brown University (Biostatistics), 21
Carnegie Mellon University (Statistics), 39
Case Western Reserve University (Statistics), 11
Colorado State University (Statistics), 14
Columbia University (Biostatistics), 64
Columbia University (Statistics), 196
Cornell University (Statistics), 78
Duke University (Statistics), 31
Emory University (Biostatistics), 58
Florida State University (Statistics), 47
George Mason University (Statistics), 10
George Washington University (Statistics), 9
Harvard University (Biostatistics), 70
Harvard University (Statistics), 67
Iowa State University (Statistics), 145
Johns Hopkins University (Biostatistics), 41
Kansas State University (Statistics), 44
Medical College of Georgia (Biostatistics), 11
Medical College of Wisconsin (Biostatistics), 7
Medical University of South Carolina (Biostatistics), 46
Michigan State University (Statistics), 81
New York University (Statistics), 6
North Carolina State University (Statistics), 163
North Dakota State University (Statistics), 25
Northwestern University (Statistics), 12
Ohio State University (Statistics), 101
Oklahoma State University (Statistics), 22
Oregon State University (Statistics), 30
Pennsylvania State University (Statistics), 75
Purdue University (Statistics), 85
Rice University (Statistics), 55
Rutgers University (Statistics), 111
Southern Methodist University (Statistics), 21
Stanford University (Statistics), 100
State University of New York - Buffalo (Biostatistics), 43
Temple University (Statistics), 40
Texas A&M University (Statistics), 101
University of Alabama - Birmingham (Biostatistics), 49
University of Arizona (Statistics), 3
University of California - Berkeley (Biostatistics), 36
University of California - Berkeley (Statistics), 58
University of California - Davis (Statistics), 34
University of California - Los Angeles (Biostatistics), 60
University of California - Los Angeles (Statistics), 72
University of California - Riverside (Statistics), 54
University of California - Santa Barbara (Statistics), 53
University of Chicago (Statistics), 109
University of Cincinnati (Biostatistics), 31
University of Connecticut (Statistics), 45
University of Florida (Statistics), 68
University of Georgia (Statistics), 59
University of Illinois (Statistics), 58
University of Iowa (Biostatistics), 35
University of Iowa (Statistics), 75
University of Kentucky (Statistics), 40
University of Massachusetts - Amherst (Biostatistics), 19
University of Michigan (Biostatistics), 117
University of Michigan (Statistics), 108
University of Minnesota (Biostatistics), 48
University of Minnesota (Statistics), 47
University of Missouri (Statistics), 58
University of Nebraska (Statistics), 44
University of North Carolina (Biostatistics), 118
University of North Carolina (Statistics), 78
University of Pennsylvania (Statistics), 23
University of Pittsburgh (Statistics), 32
University of Rochester (Biostatistics), 18
University of South Carolina (Biostatistics), 45
University of South Carolina (Statistics), 32
University of Texas - Houston (Biostatistics), 62
University of Virginia (Statistics), 43
University of Washington (Biostatistics), 68
University of Washington (Statistics), 53
University of Wisconsin (Statistics), 116
University of Wyoming (Statistics), 11
Virginia Commonwealth University (Biostatistics), 24
Virginia Commonwealth University (Statistics), 15
Virginia Polytechnic Institute (Statistics), 60
Western Michigan Statistics (Statistics), 31
Yale University (Statistics), 36", sep=",", header=TRUE)