Recall CI’s
To warm up, let’s first recall some basics about confidence intervals. The construction and the interpretation are worth revisting
Question
Suppose for a sample of numerical values X a confidence interval estimate for the mean is computed as (45,55) What is the appropriate interpretation?
Question
Given a sample mean \(xbar\), sample size \(n\), sample standard deviation \(s\), write out the formula for a 95% confidence interval using the t-distribution. How is this done mathematically? How is this done in R?
Main Example: Stats Enrollment
Below is a dataset called StatsEnrollment. It is the enrollment of 82 statistics graduate programs. It has two columns UniversityDepartment which is the name of the institution, and FTGradEnrollment which is the number of full time graduates in the program. Here are the first few entries and a summary of the enrollment numbers.
Here are some summary statistics:
head(StatsEnrollment)
print(summary(StatsEnrollment$FTGradEnrollment))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 30.25 45.50 53.54 68.00 196.00
print(sd(StatsEnrollment$FTGradEnrollment))
## [1] 36.88962
print(nrow(StatsEnrollment))
## [1] 82
Question
Using the T-distribution, compute a 95% confidence interval estimate for the mean enrollment. Write out an interpretation of your interval in a complete sentence.
You should get (49.46281, 57.61036).
Question
Now compute the same interval using boostrapping using 1000 samples. Below is the code we used last time. Note that x is not defined and should be. What should we use for x?
B<-1000 # number of bootstrap samples to obtain
xbar<-rep(0,B) # first a list of 1000 zeros. This will become our bootstrap sample
#next we calculate a resample with replacement 1000 times and compute the mean.
for (i in 1:B){
xbs<-sample(x, length(x), replace=TRUE)
xbar[i]<-mean(xbs)
}
quantile(xbar, probs=c(0.025, 0.975, type=1))
## 2.5% 97.5% 100%
## 46.54726 61.67287 66.98780
Your solution should be close to the one above but it will probably not be equal.
Follow up question:
You will probably get an interval that is slightly wider than the one computed using the t-distribution. Think about why this makes sense. It is worth noting that both would be wider than if you used the normal distribution.
Hypothesis testing using boostrapping
First, as before let’s recall hypothesis testing using the t-test as a basis of comparison. Assume that our sample of colleges is only a sample, and not a census of all colleges with stats graduate programs.
Suppose historically graduate school enrollment in statistics programs was on average 55 students. We would like to know if we can support the claim that this has decreased. Let \(\mu\) be the true current mean of all statistics programs. Our Hypotheses:
\[H_0: \mu=55\] \[H_1: \mu<55\] Recall now that a p-value is the probability of your observation or an equally or more extreme observation assuming the null hypothesis is true.
Question
Compute the t-score, the p-value and the decision. You will need to use some of the summary statistics we computed previously. You will also need to use the R function pt.
You should get a p-value of 0.3601797.
Follow up question
What would you do if the direction was reversed? Two sided?
Now hypothesis testing for bootstrapping
We will do the same hypothesis test, this time using bootstrapping. Before we get too far in, based on what we have seen before, do you think the p-value will be, in general greater than, less than or roughly the same as the previous p-value? As a reminder, since bootstrapping is random we won’t know for sure!
We do not need to generate a new bootstrap sample we can use the same xbar as before. Recall this is a list of sample means.
What percentage of them are greater than or equal to 55?
sum(xbar>=55)/length(xbar)
## [1] 0.342
Question
Why did we have a “>=” rather than a “<” if our extreme is to the left? Think about the definition of a p-value.
Question
Compute the same exercise for a right extreme and a two sided extreme. State the new hypotheses clearly.
Appendix Stats Enrollment Table
Copy all of this text below and paste it into your RStudio to complete the exercises.
StatsEnrollment<-read.table(text="UniversityDepartment, FTGradEnrollment
Baylor University (Statistics), 26
Boston University (Biostatistics), 39
Brown University (Biostatistics), 21
Carnegie Mellon University (Statistics), 39
Case Western Reserve University (Statistics), 11
Colorado State University (Statistics), 14
Columbia University (Biostatistics), 64
Columbia University (Statistics), 196
Cornell University (Statistics), 78
Duke University (Statistics), 31
Emory University (Biostatistics), 58
Florida State University (Statistics), 47
George Mason University (Statistics), 10
George Washington University (Statistics), 9
Harvard University (Biostatistics), 70
Harvard University (Statistics), 67
Iowa State University (Statistics), 145
Johns Hopkins University (Biostatistics), 41
Kansas State University (Statistics), 44
Medical College of Georgia (Biostatistics), 11
Medical College of Wisconsin (Biostatistics), 7
Medical University of South Carolina (Biostatistics), 46
Michigan State University (Statistics), 81
New York University (Statistics), 6
North Carolina State University (Statistics), 163
North Dakota State University (Statistics), 25
Northwestern University (Statistics), 12
Ohio State University (Statistics), 101
Oklahoma State University (Statistics), 22
Oregon State University (Statistics), 30
Pennsylvania State University (Statistics), 75
Purdue University (Statistics), 85
Rice University (Statistics), 55
Rutgers University (Statistics), 111
Southern Methodist University (Statistics), 21
Stanford University (Statistics), 100
State University of New York - Buffalo (Biostatistics), 43
Temple University (Statistics), 40
Texas A&M University (Statistics), 101
University of Alabama - Birmingham (Biostatistics), 49
University of Arizona (Statistics), 3
University of California - Berkeley (Biostatistics), 36
University of California - Berkeley (Statistics), 58
University of California - Davis (Statistics), 34
University of California - Los Angeles (Biostatistics), 60
University of California - Los Angeles (Statistics), 72
University of California - Riverside (Statistics), 54
University of California - Santa Barbara (Statistics), 53
University of Chicago (Statistics), 109
University of Cincinnati (Biostatistics), 31
University of Connecticut (Statistics), 45
University of Florida (Statistics), 68
University of Georgia (Statistics), 59
University of Illinois (Statistics), 58
University of Iowa (Biostatistics), 35
University of Iowa (Statistics), 75
University of Kentucky (Statistics), 40
University of Massachusetts - Amherst (Biostatistics), 19
University of Michigan (Biostatistics), 117
University of Michigan (Statistics), 108
University of Minnesota (Biostatistics), 48
University of Minnesota (Statistics), 47
University of Missouri (Statistics), 58
University of Nebraska (Statistics), 44
University of North Carolina (Biostatistics), 118
University of North Carolina (Statistics), 78
University of Pennsylvania (Statistics), 23
University of Pittsburgh (Statistics), 32
University of Rochester (Biostatistics), 18
University of South Carolina (Biostatistics), 45
University of South Carolina (Statistics), 32
University of Texas - Houston (Biostatistics), 62
University of Virginia (Statistics), 43
University of Washington (Biostatistics), 68
University of Washington (Statistics), 53
University of Wisconsin (Statistics), 116
University of Wyoming (Statistics), 11
Virginia Commonwealth University (Biostatistics), 24
Virginia Commonwealth University (Statistics), 15
Virginia Polytechnic Institute (Statistics), 60
Western Michigan Statistics (Statistics), 31
Yale University (Statistics), 36", sep=",", header=TRUE)