We discussed the process of using the bootstrap as a way to build confidence intervals in our last class. However, we have not tried to code it yet. Today, we are going to play with the coding, as well as exploring some properties of the bootstrap.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey. While there are over 200 variables in this data set, we will work with a small subset.
We begin by loading the training data set of 20,000 observations with the following command. NOTE: This data set takes a second to load; be patient.
source("http://www.openintro.org/stat/data/cdc.R")
We have 9 variables in our training data.
genhlth
: respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor.exerany
: indicates whether the respondent exercised in the past month (1) or did not (0).hlthplan
: indicates whether the respondent had some form of health coverage (1) or did not (0)smoke100
: indicates whether the respondent had smoked at least 100 cigarettes in their lifetime height
: in inches weight
: in pounds wtdesire
: desired weight in pounds age
: in years gender
: biological sex, limited to male/female.We are going to make one quick change.
cdc$wtdesire<- cdc$weight -cdc$wtdesire
This means that wtdesire
is now represents how many pounds in weight a person would like to change. A positive number reflects a number of pounds a person would like to lose, and a negative number reflects a number of pounds a person would like to gain.
Let's look at two particular variables in this data set, weight
and height
. We are going to model weight
as our response variable, using height
as our feature variable. Note that weight is recorded in pounds and height is in inches. We want to use the following LSLR model:
lm
.lm
to train the LSLR regression model on these data, and write out the trained regression line. Verify that your sample statistics match your answers from the previous question.These values we have just obtained are sample statistics, meaning estimates we obtain from a sample. We are generally more interested in using these sample statistics to learn about population parameters, or the population value of parameters like the regression slope.
To use the sample statistics to learn about the population parameter, we need to build a confidence interval. Today, we are going to use the bootstrap to help us do that.
The first step in the bootstrap is to create a single bootstrap sample. We do this by sampling with replacement from our training data.
Why do we do this? Well, we want to create a sampling distribution, meaning a distribution of a sample statistic, in this case \(\hat{\beta}_1\). This means we want to know which values \(\hat{\beta}_1\) can take on, and how often it takes on each value.
In theory, we do this by reaching into the population, getting a training sample with n rows, training our model, and obtaining \(\hat{\beta}_1\). We then do this many many times, and plot all the values of \(\hat{\beta}_1\) to see the distribution.
With these data, we are not able to reach back into the population to get more samples; we have to deal with the data that we have. In fact, we can almost never obtain the sampling distribution by sampling from the population. Instead, we either (1) use some sort of theory that tells us what the sampling distribution would be (like the CLT) or we (2) estimate the sampling distribution by using simulation. The bootstrap helps us with (2)!
The first step is to create a bootstrap sample of the same dimensions as our original data.
Sample1
and show your code. Hint: The key is to use replace = TRUE
in your code.Now that we have our bootstrap training sample, let's use it.
Okay, so now we have obtained \(\hat{\beta}_1^{(1)}\), our first bootstrap statistic! However, one value is not enough to create a distribution. We are going to need to do this process again...and again...and again.
set.seed(2021)
for(i in 1:B){
# Decide which rows will do in your bootstrap sample
ChosenRows <- sample( 1:n, n , replace = TRUE)
# Create your bootstrap sample from these rows
# Train the regression model
# Store the intercept
# Store the slope
}
Right now, we only have 10 bootstrap statistics, so we can use a dot plot to visualize the bootstrap sampling distribution that we have so far. To do this, we can use the code below.
ggplot() +
geom_dotplot(aes(x = beta1)) +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
binwidth
command. This goes in the geom_dotplot(aes(x = beta1), binwidth = )
part of the code, and indicates how we want to group data for the dot plot. Try .005.In this lab, we have explored how the bootstrap can be used to build confidence intervals. We will use the idea of a bootstrap a few times in this semester.