STA 112 Lab: Simulation

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Motivation

We have been discussing inference procedures, which are processes we employ to use information from sample statistics to learn about about population parameters.

We have learned how to do both hypothesis tests and confidence intervals for the population slope \(\beta_1\) in LSLR models. Both of these procedures involve the standard error \(SE_{\hat{\beta}_1}\) of the sample slope.

In class, we saw a formula for computing the standard error. Today, we are going to do a simulation study to verify that result.

The Data

We are going to use data today from the Behavioral Risk Factor Surveillance System (BRFSS). This is a survey that is conducted annually in the United States, and the data is collected and processed by the Centers for Disease Control and Prevention (CDC).

We have \(N = 20000\) rows of data from this survey. To load the data, copy the code below into a chunk in your Markdown file and press play. It may take a moment for this to load - it’s a big data set!

source("http://www.openintro.org/stat/data/cdc.R")

For today, we are interested in two particular variables in this data set: X = height in inches and Y = weight in pounds. We are interested in seeing if there is a linear relationship between height and weight.

The Population data

We are going to start by exploring all \(N=20000\) rows of data.

Question 1

Create an appropriate visualization to explore the relationship between the X = the height of a person in inches and Y = the weight of that person in pounds. Make sure you have labeled your axes “Height (in inches)” and “Weight (in pounds)”, and title your graph Figure 1.

Question 2

Build the LSLR line for X = height and Y = weight. Write down the equation of the line as your answer to this question (you don’t have to graph it).

What you just built in Question 2 is the LSLR line you get using all \(N\) rows of the original data set. In other words, this is the value of the y-intercept and the slope that you get if you have access to all the data.

The entire point of performing inference is when we do not have access to the whole population. So, for today we are going to treat these \(N = 20000\) rows as a population. We are them going to take a sample from this population and see how the sample statistics that we get from that sample compare to the population values we just found in Question 2.

Why do we want to do that?? The point of the lab today is for us to explore the inference procedures we learned in class. To do that, it is helpful to actually know what the population value of the slope coefficient is so that we can check to see if that value is in our confidence interval, and if the standard error we computed in class matches what we get if we compute it ourselves from samples.

The first sample

Let’s take a sample of \(n = 1000\) rows from the population, meaning from the cdc data set. You will notice we have switched from \(N\) (population size) to \(n\) (sample size). Notation matters, especially when we are bouncing back and forth between quantities relating to a sample and those relating to the population.

To create our random sample of 1000 rows, put the following code in a chunk and press play.

set.seed(112)
# Randomly choose 1000 rows from the original data set
RowsInSample <- sample( 1:nrow(cdc), 1000, replace = F)

# Create a data set with just those rows
cdc_sample1 <- cdc[RowsInSample,c("height","weight")]

It’s not important that you understand the code, but please ask if you are curious! The important thing is that this code will draw a random sample of \(n=1000\) rows from our original population of \(N= 20000\). We then store our sample under the name cdc_sample1. If you look in your Environment (upper right panel of RStudio), this means you will see a data set called cdc_sample1.

Question 3

Build the LSLR line for X = height and Y = weight using the new data set cdc_sample1 that we just created. Write down the equation of the line as your answer to this question.

Question 4

Is the slope you found in Question 3 (a) the same as, (b) smaller than, or (c) larger than the population slope from Question 2?

The LSLR line we built in Question 3 involved a sample of data from a population. The LSLR line we built in Question 2 involved all the data in the population. This means that the value we got in Question 2 is a population parameter and the value we got in Question 3 is a sample statistic.

It is very common that we are not able to obtain data on the entire population, so typically we need to rely on inference techniques to use information from the sample statistic to learn about the value of the population parameter. Today, we actually have the population parameter, so we are going to try out our inference techniques and see how they work!

Finding the Standard Error of the sample slope

For both hypothesis tests and confidence intervals, we rely on the concept of the standard error of the sample slope, denoted by \(SE_{\hat{\beta}_1}\). The standard error of the sample slope describes the variability of the distribution of \(\hat{\beta}_1\). In other words, if we…

1. took many different random samples of size \(n=1000\) from the population
1. built a LSLR line on each sample
1. stored the \(\hat{\beta}_1\) estimate we got from each sample
1. and took the standard deviation of all the sample slopes,

the value we would get would be the standard deviation of \(\hat{\beta}_1\), which is called the standard error of \(\hat{\beta}_1\).

Steps 1-4 above are typically not something we can do. In practice we usually do not have the population. This means we cannot draw random samples over and over…we only have the data that we’ve got.

However, today we do have the population, so we can do Steps 1 - 4!

A Second Random Sample

Before we get to many, many random samples, let’s start off with just one more sample. To draw our second random sample, cdc_sample2, use the following code:

set.seed(12)
# Randomly choose 1000 rows from the original data set
RowsInSample <- sample( 1:nrow(cdc), 1000, replace = F)

# Create a data set with just those rows
cdc_sample2 <- cdc[RowsInSample,c("height","weight")]

There are only two differences between this code and the from the code we used to draw our first random sample. We changed the value in set.seed() and we named the sample cdc_sample2 rather than cdc_sample1.

Changing the name hopefully makes sense - we want to create two different data sets, one for each sample. But what’s a seed??

Seeds are how we control random sampling in R. The goal is to make sure that every time you turn on your computer or run a chunk, you get the same random sample. It also means that anyone else running your code will get the same results that you do.

Let’s see what we mean by that.

Question 5

Create a chunk and paste in the following the following code. Then, do the same thing again (create a new chunk and paste in the code) to create two identical chunks.

sample(1:10,1)
sample(1:10,1)

Run both chunks and report the values that you get.

Note: This code command tells R to sample (sample) one number ((,1)) from 1 to 10 (1:10,)).

Question 6

Create a chunk and paste in the following the following code. Then, do the same thing again (create a new chunk and paste in the code) to create two identical chunks.

set.seed(112)
sample(1:10,1,)
sample(1:10,1)

Run both chunks and report the values that you get.

Do you notice the difference in your results from Question 5 and Question 6? This is what set.seed() does. It makes sure that every time you run the code, the results stay the same. Imagine how annoying it would be if your random sample changed every time you knit your Lab!!

We took a little detour talking about seeds, so let’s remind ourselves of what we were trying to do. The goal is to:

1. take many different random samples of size \(n=1000\) from the population
1. build a LSLR line on each sample
1. store the \(\hat{\beta}_1\) estimate we got from each sample
1. and take the standard deviation of the sample slopes. We’ve done Step 1 for a second sample. Let’s do Step 2 and Step 3.

Question 7

Build the LSLR line for X = height and Y = weight using the second sample data set cdc_sample2 that we just created. Write down the estimate \(\hat{\beta}_1\) that you get from this sample.

Question 8

Is this second \(\hat{\beta}_1\) value you obtained larger or smaller than the value you got from the first random sample?

What we have just illustrated is called sampling variability. The value of the sample statistic depends upon the actual sample data that was used to compute it. This means if you change the sample, the value of the statistic might change.

The goal of our work today is to explore just that - what are the different kinds of values for \(\hat{\beta}_1\) we will get on different samples from the same population?

Many, many random samples

We have seen sampling variability in action with two samples. Now, let’s see what kinds of values for \(\hat{\beta}_1\) we get if we use many samples.

We are going to use the code below to generate \(S = 10000\) random samples from the original population. We will then find the estimate of \(\hat{\beta}_1\) on each sample and store that value.

To do this, copy the code below into a chunk and press play. Note: If this runs too slowly on your computer, change the 10000 to 5000 or 1000.

set.seed(112)
# Decide how many samples you want 
S <-  10000

# Create a space to store the slope coefficients 
hatbeta1 <- data.frame("Slope" = rep(NA, S))

# Do the following S = 10000 times: 
for( i in 1:S){
  # Obtain the random sample 
  S2 <- sample( 1:nrow(cdc), 1000, replace = F)
  cdc2 <- cdc[S2,c("height","weight")]
  
  # Build the LSLR line 
  m2 <- lm(cdc2$weight ~ cdc2$height)
  
  # Store the slope coefficient
  hatbeta1$Slope[i] <- m2$coefficients[2]
}

What in the world did we just create?? Take a look in your Environment (upper right tab of your RStudio screen where all your data sets are). You will see something called hatbeta1. This is a data set with \(S= 10000\) rows and 1 column. Each row represents one of the 10000 random samples we drew from the population, and the value that you see in that row gives us the value of \(\hat{\beta}_1\) we got using that sample.

Now, we are interested in seeing what the standard deviation of \(\hat{\beta}_1\) is across these 10000 samples! In other words, what is \(SE_{\hat{\beta}_1}\)???

sd(hatbeta1$Slope)

Question 9

Run the code in the chunk above to find the standard deviation of \(\hat{\beta}_1\).

In class, we used R to estimate the standard error. Would we have gotten something similar if we used R?

Question 10

Go back to the LSLR line you fit on the first random sample cdc_sample1. Using the summary command, what is the value of \(SE_{\hat{\beta}_1}\) you get?

So, this is pretty cool. When we used information from just our first random sample of \(n=1000\) rows, we were able to estimate the standard deviation of \(\hat{\beta}_1\) fairly well. And we can do it without needing to take all 10000 random samples from the population!!!

Now, are the value identical? No. The theory estimate that we see in Question 10 assumes we take infinitely many samples from the population. The value in Question 9 came from \(S=10000\) random samples. However, the two values are fairly similar, with the theory estimate being the more accurate of the two. The more random samples we draw, the more accurate the standard error we get from our simulation will be.

If you are interested in all of the statistical theory that made this possible, take STA 312!

The whole distribution

Since we have already created all of these estimates of \(\hat{\beta}_1\), let’s use them to create a histogram and explore the distribution of this sample statistic.

Question 11

Create a histogram of the sample slopes in hatbeta1. Remember, you can flip back to Lab 1 if you need a refresher on the code for how to do this!

Question 12

Is the distribution you see in Question 11 unimodal or multimodal? Is it skewed or symmetric?

We can also find the mean of this distribution using:

mean(hatbeta1$Slope)

Hey!! We have seen this number before! In Question 2, you determined that 5.39 is the slope of the LSLR line when you use the population data. In other words, the mean of the distribution of \(\hat{\beta}_1\) is the population parameter \(\beta_1\)!!

Okay, so we now know the standard deviation of the distribution of \(\hat{\beta}_1\) and the mean of this distribution. Why is this important?

Well, it shows us that when we draw random samples, the slopes we get from those random samples are centered on the population slope. Some are smaller, and some are larger, but we can use \(SE_{\hat{\beta}_1}\) to describe how far we expect the sample statistics to be from the mean.

Confidence Intervals

Remember that a confidence interval is a range of plausible values for a population parameter? We talked about the fact that when we have a sample, \(\hat{\beta}_1\) is our best guess, but we know that \(\hat{\beta}_1\) does not have to equal \(\beta_1\). So, we build our confidence interval using \[(\hat{\beta}_1 - t^{*}~SE_{\hat{\beta}_1}, \hat{\beta}_1 + t^{*}~SE_{\hat{\beta}_1})\] We just figured out the standard deviation of \(\hat{\beta_1}\), so all we need is \(t^{*}\) and we can build a confidence interval.

Question 13

What value of \(t^{*}\) do you need for a 95% confidence interval? Remember, \(n=1000\)

Question 14

Using the \(\hat{\beta}_1\) estimate from the first random sample (cdc_sample1), build and interpret a 95% confidence interval for \(\beta_1\). Use \(SE_{\hat{\beta}_1} = .27\), from your simulation.

Question 15

Is the true value of \(\beta_1\) (5.39) in this interval?

Verifying the confidence level

Great!! So for our first random sample, we have seen whether or not \(\beta_1\) falls in our interval. In class, we learned that for a 95% confidence interval, this means that if we took many many samples from the population and build confidence intervals on each one, 95% of those intervals would contain the true population parameter.

Well, we currently have \(S=10000\) samples.

Question 16

If our confidence level is correct, for how many of our random samples would the confidence interval for the population slope contain 5.39?

Now, let’s try it and see! Put the following code in a chunk and press play:

# Store the true value of the population slope 
beta1 <- 5.39 
# Store the critical value
tstar <- 1.96
# Add a column to hatbeta1
hatbeta1$InCI <- NA

for(i in 1:S){
  # Find the lower bound of the CI 
  CIlower <- hatbeta1$Slope[i] - tstar*.27
  # Find the upper bound of the CI 
  CIupper <- hatbeta1$Slope[i] + tstar*.27
  # Check to see if beta 1 is in the interval 
  hatbeta1$InCI[i] <- ifelse(beta1 < CIupper & beta1 > CIlower, "Yes", "No")
}

What did we just do? We created a 95% confidence interval using each of our \(S= 10000\) samples, and we created a new column called InCI in our hatbeta1 data set which indicates whether or not the true value \(\beta_1 = 5.39\) is included in that confidence interval. If we recorded a Yes, then \(\beta_1\) was in the confidence interval. If we recorded No, then \(\beta_1\) was not in the confidence interval.

Question 17

Did the confidence interval we built on the 3rd sample contain \(\beta_1\)?

If our confidence level is correct, then we should see about as many “Yes” outcomes as you reported in Question 16. To check, use this code:

table(hatbeta1$InCI)

Question 18

For how many of the 10000 samples did the confidence interval contain the true value?

Question 19

Is this close to the 95% we should have gotten?

Wrapping it up

We have seen that we can use simulation to verify some of the theory results we have observed in class. We have also explored sampling variability in practice!

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 February 8.