STA 363 Lab 4 : The Bootstrap

Goal

We discussed the process of using the bootstrap as a way to build confidence intervals in our last class. However, we have not tried to code it yet. Today, we are going to play with the coding, as well as exploring some properties of the bootstrap.

The Data

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the training data set of 20,000 observations with the following command. NOTE: This data set takes a second to load; be patient.

source("http://www.openintro.org/stat/data/cdc.R")

We have 9 variables in our training data.

genhlth: respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor.

exerany: indicates whether the respondent exercised in the past month (1) or did not (0).

hlthplan: indicates whether the respondent had some form of health coverage (1) or did not (0)

smoke100: indicates whether the respondent had smoked at least 100 cigarettes in their lifetime

height: in inches

weight: in pounds

wtdesire: desired weight in pounds

age: in years

gender: biological sex, limited to male/female.

We are going to make one quick change.

cdc$wtdesire<- cdc$weight -cdc$wtdesire

This means that wtdesire is now represents how many pounds in weight a person would like to change. A positive number reflects a number of pounds a person would like to lose, and a negative number reflects a number of pounds a person would like to gain.

Let's look at two particular variables in this data set, weight and height. We are going to model weight as our response variable, using height as our feature variable. Note that weight is recorded in pounds and height is in inches. We want to use the following LSLR model:

$$Weight_i = \beta_0 + \beta_1 Height_i + \epsilon_i, \epsilon_i \sim N(0.\sigma).$$

Create a design matrix for this LSLR regression model. Show your code. Hint: To show your code, change the header of your chunk to say {r, echo = TRUE}

Using appropriate matrix algebra, what are the estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ for this model? Show your work. Note: You may not use lm.

Now use lm to train the LSLR regression model on these data, and write out the trained regression line. Verify that your sample statistics match your answers from the previous question.

These values we have just obtained are sample statistics, meaning estimates we obtain from a sample. We are generally more interested in using these sample statistics to learn about population parameters, or the population value of parameters like the regression slope.

To use the sample statistics to learn about the population parameter, we need to build a confidence interval. Today, we are going to use the bootstrap to help us do that.

Obtain a Bootstrap Sample

The first step in the bootstrap is to create a single bootstrap sample. We do this by sampling with replacement from our training data.

Why do we do this? Well, we want to create a sampling distribution, meaning a distribution of a sample statistic, in this case $\hat{\beta}_1$. This means we want to know which values $\hat{\beta}_1$ can take on, and how often it takes on each value.

In theory, we do this by reaching into the population, getting a training sample with n rows, training our model, and obtaining $\hat{\beta}_1$. We then do this many many times, and plot all the values of $\hat{\beta}_1$ to see the distribution.

In 1-2 sentences, explain why this might not be possible for these data.

With these data, we are not able to reach back into the population to get more samples; we have to deal with the data that we have. In fact, we can almost never obtain the sampling distribution by sampling from the population. Instead, we either (1) use some sort of theory that tells us what the sampling distribution would be (like the CLT) or we (2) estimate the sampling distribution by using simulation. The bootstrap helps us with (2)!

The first step is to create a bootstrap sample of the same dimensions as our original data.

Create a single bootstrap sample from your training data. Call the sample Sample1 and show your code. Hint: The key is to use replace = TRUE in your code.

Now that we have our bootstrap training sample, let's use it.

Train an LSLR regression model on the bootstrap training data, and write out the regression line. Comment on how the value of the sample slope compare to the value you obtained on the original training data. Note: Notation wise, we use $\hat{\beta}_1$ to represent the sample statistic from the original training data, and $\hat{\beta}_1^{(s)}$ to represent the sample statistic value from bootstrap sample s. This means the sample slope from our first bootstrap sample is denoted $\hat{\beta}_1^{(1)}$.

Okay, so now we have obtained $\hat{\beta}_1^{(1)}$, our first bootstrap statistic! However, one value is not enough to create a distribution. We are going to need to do this process again...and again...and again.

We are about to repeat a coding process over and over. What type of code structure would be helpful for us to use?

Using the code structure from our notes, create B = 10 bootstrap samples. Make sure you annotate your code, and show your code.
set.seed(2021) for(i in 1:B){ # Decide which rows will do in your bootstrap sample ChosenRows <- sample( 1:n, n , replace = TRUE) # Create your bootstrap sample from these rows # Train the regression model # Store the intercept # Store the slope }
Note: You may not use any built in bootstrap functions; you must build your own following this structure. This helps us to understand how the process works.

The Bootstrap Sampling Distribution

Right now, we only have 10 bootstrap statistics, so we can use a dot plot to visualize the bootstrap sampling distribution that we have so far. To do this, we can use the code below.

ggplot() + 
 geom_dotplot(aes(x = beta1)) + 
 theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())

Using the code above, create a dotplot to visualize the sampling distribution so far. Make sure to add appropriate figure labels!

Use your code from Question 8 to create B = 100 bootstrap samples, and display the bootstrap sampling distribution dotplot. You do not need to show your code in this question, just the plot.

Repeat for B = 1000. If your distribution is too tall, we can fix it using the binwidth command. This goes in the geom_dotplot(aes(x = beta1), binwidth = ) part of the code, and indicates how we want to group data for the dot plot. Try .005.

Create a 95% bootstrap confidence interval for each of your three bootstrap sampling distributions: B = 10, B = 100, and B = 1000. Report all three intervals.

Which of the three bootstrap intervals would you choose to use in practice, and why? Based on your choice, interpret the confidence interval for the population slope in terms of the data. Note: You will not receive credit for "We are 95% confident that ${\beta}_1$ is between X and X."

Conclusion

In this lab, we have explored how the bootstrap can be used to build confidence intervals. We will use the idea of a bootstrap a few times in this semester.

Last updated 2021 March 3.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set set used in this lab is the cdc data set, provided by OpenIntro.

STA 363/663 Lab 4: The Bootstrap

Goal

The Data

Obtain a Bootstrap Sample

The Bootstrap Sampling Distribution

Conclusion