In a past class, we found a relationship in our sample data between annual murders per million and poverty. Below is our estimated regression line for the relationship between murder rate and poverty rate, according to our sample.
\[ \widehat{murders} = -29.91 + 2.56~poverty \]
A skeptic might tell us, well, this is just your sample. Maybe your sample is weird. Maybe there really isn't a relationship between murder rates and poverty. How might we respond? Our skeptic has asked us to defend our belief that this result from our sample data is actually indicative of a population relationship between annual murders per million (murder) and percent living in poverty (poverty).
Our skeptic has a point that our estimate of \(\hat{\beta}_1\) came from one sample. But, does that really matter? If we had a different sample, would we have seen different results, and obtained a different estimate of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)?
We're going to run a simulation study to find out.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
We begin by loading the data set of 20,000 observations with the following command. NOTE: This data set takes a second to load; be patient.
source("http://www.openintro.org/stat/data/cdc.R")
To view the names of the variables, type the command
names(cdc)
This returns the names genhlth
, exerany
, hlthplan
, smoke100
, height
, weight
, wtdesire
, age
, and gender
. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth
, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany
variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan
indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100
variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height
in inches, weight
in pounds as well as their desired weight, wtdesire
, age
in years, and gender
.
Let's look at two particular variables in this data set, weight
and height
. We are going to model weight
as our response variable, using height
as our explanatory variable. Note that weight is recorded in pounds and height is in inches.
We are going to pretend that the cdc data set represents population data. Does it in reality? No. For purposes of this simulation, assume that it does. This means that our population values for \({\beta}_0\) and \({\beta}_1\)(notice that the hats are gone, this is population data) can be determined. Specifically, \({\beta}_0 = -192.74\) and \({\beta}_1 = 5.39\) .
\[ {weight} = -192.74 + 5.39~height + \epsilon \]
Typically, we work with data that is a sample from some population. To mimic that here, we are going to reach into this cdc population and grab a sample of 1000 data points, i.e., 1000 rows from cdc. Don't worry about understanding the code (though if you are interested, please ask!). In essence, what this code does is choose 1000 random values between 1 and 20000 (the number of rows in the cdc). The sample
command does this part, and we store the 1000 random numbers in the object S1
. Then, to obtain our random sample of the cdc data, we select for our sample only those rows that are stored in S1
. All other rows are excluded from our sample. The set.seed
command ensures that everyone in the class gets the same random sample. Don't change the seed. If you are interested in learning more about random seeds and how they work, visit https://www.stata.com/manuals13/rsetseed.pdf and take a look at page 2.
set.seed(400)
S1 <- sample( 1:nrow(cdc), 1000, replace = F)
cdc1 <- cdc[S1,c("height","weight")]
The code above creates our sample, which we call cdc1
. It contains 1000 rows from cdc, i.e., 1000 people with information in the cdc data have their information in our sample cdc1
. Remember that. One sample = 1000 people. Now, let's use this sample to fit a linear regression model and obtain estimates of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). The 1000 data points in the sample are used to fit the model.
m_cdc1 <- lm(cdc1$weight ~ cdc1$height)
m1.coefs <- m_cdc1$coefficients
m1.coefs
According to this, we have \(\hat{\beta}_0 = -195.6\) and \(\hat{\beta}_1 = 5.43\).
So, we're not exactly on, but it's a sample - we don't expect that an estimate obtained from a sample will exactly agree with the population values. Why? There are 20000 people in the population, and only 1000 in the sample. Information on 19000 people are excluded from cdc1
. It makes sense that these 19000 people would have some influence on our estimation of the regression line. In statistical terms, the sample estimates differ from the population values.
This gives some credence to our skeptic's claim. If we use only our sample data to estimate the relationships in the data, we won't necessary be reporting values that match the population. But, we can use the values from our sample to make statements about the population values. More to come.
Let's plot the sample data and the regression line, to see what that looks like.
plot( x = cdc1$height, y = cdc1$weight, xlab = "Height", ylab = "Weight")
abline(m_cdc1)
Our skeptic claimed that the result we saw from our sample of murder data was not enough to make claims about the population. But do estimates really vary from sample to sample? We just used one sample to obtain estimates of our regression parameters. What if we had a different sample? Would the estimates be different? Let's find out.
Grab a second random sample of 1000 people from cdc. We use the same process as before, but the choice of a different seed here means that we will have a different sample of rows from cdc in this sample than we did in our first sample. In other words, the 1000 people in cdc1
may be different from the 1000 people in cdc2
.
set.seed(500)
S2 <- sample( 1:nrow(cdc), 1000, replace = F)
cdc2 <- cdc[S2,c("height","weight")]
Now that we have the second sample, let's fit a regression line.
m_cdc2 <- lm(cdc2$weight ~ cdc2$height)
m2.coefs <- m_cdc2$coefficients
m2.coefs
Hmm. For sample 1, \(\hat{\beta}_0 = -195.6\) and \(\hat{\beta}_1 = 5.43\). For sample 2, \(\hat{\beta}_0 = -188.6\) and \(\hat{\beta}_1 = 5.37\).
These are similar, but they are not exactly the same! This suggests that if we fit the model using different samples of the population, we can obtain different estimates of the regression parameters.
So the estimates can be different from sample to sample. This is called sampling variability, meaning that the variability (or differences) in our estimates of the regression parameters is due to difference in samples.
But, how different are our estimates from sample to sample? Are there huge differences? Are they small? How can we tell?
We know that our estimates of the regression coefficients are different when we use sample cdc1
to fit a line than when we use sample cdc2
to fit a line.
cdc1
, and then on the same plot draw the line using the estimates from cdc2
, do we expect that the regression lines will be the same? Why or why not?Let's try it.
plot(cdc2$height, cdc2$weight, col="white", xlab ="Height", ylab = "Weight",ylim = c(50,300))
abline(m_cdc1, lty = 2, col = "red")
abline(m_cdc2, col = "blue")
The dashed (red) line is the regression line fit with sample cdc1
and the solid line (blue) is the regression line fit with sample cdc2
. So, we can see that the lines are slightly different. The samples used to fit the data are different, meaning that the regression line, i.e., the line that minimizes the sum of squares for the same, is different for the two samples. It looks like, potentially, there were a few more short individuals in cdc1
and cdc2
, resulting in the dashed line that is little lower than the other line.
This is only two samples. What if we did more? Would this behavior continue?
Let's try taking 20 random samples (each containing information on 1000 people) from the cdc data, fitting a regression line to each, and see what happens.
set.seed(666)
coefs.store <- NULL
plot( cdc2$height, cdc2$weight, col="white", xlab ="Height", ylab = "Weight",ylim = c(50,300))
for( i in 1:20){
S2 <- sample( 1:nrow(cdc), 1000, replace = F)
cdc2 <- cdc[S2,c("height","weight")]
m2 <- lm(cdc2$weight ~ cdc2$height)
coefs.store <- rbind(coefs.store, m2$coefficients)
abline(m2)
}
Again, we see that the lines are all slightly different, each reflecting a different random sample of the data.
Let's try taking 500 random samples from the cdc data, fitting a regression line to each, and see what happens.
set.seed(666)
coefs.store <- NULL
plot( cdc2$height, cdc2$weight, col="white", xlab ="Height", ylab = "Weight",ylim = c(50,300))
for( i in 1:500){
S2 <- sample( 1:nrow(cdc), 1000, replace = F)
cdc2 <- cdc[S2,c("height","weight")]
m2 <- lm(cdc2$weight ~ cdc2$height)
coefs.store <- rbind(coefs.store, m2$coefficients)
abline(m2)
}
What we notice here is that at the center of the line, near the mean, of weight
and height
, the lines are more similar to one another. At the edges, the lines start to diverge. This behavior is due to the fact that different samples are likely to have different maximum or minimum data points, tilting each line up or down at the edges. Think about the population distribution of weight
.
hist(cdc$weight)
We can see that most people in the population have a weight between 100 and 200 pounds, but we have some outliers. Any sample taken from this population will likely have a lot of individuals with weight values in the center of this population, but will also potentially have a few individuals that have weights that are more in the tails, i.e., to the right side of the distribution above. These individuals will be in some samples, but not others, resulting in regression lines with different slopes depending on who is in the sample and who is not.
This is an illustration of the fact that for each different sample, we will obtain different estimates of our regression parameters. In order to take our single sample, and generalize to making statements about the population, we have to get a handle on the variability of these estimates.
The code I used to perform the simluation did more than just plot lines. It also stored the estimates of our regression parameters for each line. The object coef.store
has 500 estimates of \(\hat{\beta}_0\) and 500 estimates of \(\hat{\beta}_1\). Think about that. Each of the 500 samples of 1000 people produces an estimate. The second column, i.e, coef.store[,2]
, of coef.store
stores the 500 estimates of \(\hat{\beta}_1\). Let's take a look at those 500 estimates of \(\hat{\beta}_1\).
plot( coefs.store[,2], ylab = "Estimates of Beta 1 for each line", xlab = "Simulation", pch = 20)
This plot is a little hard to interpret. We can see all 500 estimates, but it is hard to directly see the center, the spread, etc. What we can see is that each of the 500 samples provides a slightly different value of \(\hat{\beta}_1\).
Let's determine the average estimate of \(\hat{\beta}_1\).
mean(coefs.store[,2])
Recall that our true population value is \({\beta}_1 =5.39\). The mean we see here is roughly the same! This means that even if any one sample does not exactly estimate the populatioon parameter, if we average all the estimates across our samples, we end up with a result that is very near the true value. In math terms,
\[ \bar{\hat{\beta}}_1 = \frac{1}{500} \sum_{i=1}^{500} \hat{\beta}_{i1} \approx \beta_1 \]
This makes sense. We expect that some samples from a population will contain larger values, some smaller, some overlapping. Averaging across these samples should give us a better idea of the true value of \({\beta}_1\) than any single sample. We can add this average to the plot.
plot( coefs.store[,2], ylab = "Estimates of Beta 1 for each line", xlab = "Simulation", pch = 20)
abline(h = mean(coefs.store[,2]), col = "red")
As we can see in the plot above, not all of the estimates of \(\hat{\beta}_1\) are the same, nor are they all equal to the mean. This means that we can discuss the variability in the estimates of \(\hat{\beta}_1\) .
We have talked in class about the variability of data (i.e., the variance in a response variable y
). For instance, we can compute the standard deviation of weight
using the code
sd(cdc$weight)
However, in this case, we are not thinking about variability in the data. We are instead thinking about the variability of a statistic, i.e., a value that is computed from sample data. The sample mean, for instance, is a statistic. When we want to estimate how much a sample statistic deviates from the population mean, we use the concept of standard error.
Estimates of the the regression parameters are also sample statistics. We have computed 500 estimate of the sample statistic \(\hat{\beta}_1\). Each estimate came from a different sample. This means that the differences among these 500 estimates are caused by the fact that different samples contain different observations from the cdc data set.
For instance, we already know that we obtained different estimates of \(\hat{\beta}_1\) when we used two different samples from the cdc data. If we look at summaries of weight
in each of the two samples, we'll see that they are not the same.
summary(cdc1$weight)
summary(cdc2$weight)
The differences we see in the estimates of \(\hat{\beta}_1\) are due to the differences in the values in the sample. Because the differences in our estimates stem from the different samples, we call the variability of a sample statistic the sampling variability.
Let's try another visual to help understand this better. We already plotted our 500 estimates, but let's try a histogram.
hist(coefs.store[,2], main = "Histogram of estimates of beta 1", xlab = "hat beta 1")
This is a distribution! So, just like we have been plotting the distributions of data, we can plot the distribution of a sample statistic!
We have been using things like the mean and standard deviation to describe the distributions of data. The standard deviation helps measure how much data points vary from the mean. When we are talking about the distribution of sample statistic, we use the mean and the standard error. The standard error is defined as the standard deviation of the distribution of a sample statistic. Basically, this tells us how much the sample statistic deviatates from the population parameter.
How do we compute the standard error? Well, the standard error is the standard deviation of the sample statistic. Since we have 500 estimates of the sample statistic, we just compute the standard deviation of these 500 values.
mean(coefs.store[,2])
sd(coefs.store[,2])
Some of you are going, hey, I rememeber a \( \frac{\sigma}{\sqrt{n}} \) or something when we deal with the standard error. Where did that go? You are correct that when dealing with the distribution of the sample mean, we can use such a formula to help us approximate the standard error when we only have one sample. In that case, we don't get to see 500 estimates of our sample statistic. We only have one, and we have to figure out how to use that one to obtain some guess at the variability we would see if we could create many estimates. We'll see something similar next class. Right now, however, we actually have multiple estimates of our statistic of interest, and we have seen that the average of these estimates is a good estimate of the population parameter. This means the standard error can be estimated as the standard deviation in these 500 estimates.
The standard error of our estimates of \(\hat{\beta}_1\), denoted \( SE_{\hat{\beta}_1}\), is .27. What does this tells us? It tells us that, on average, our estimates of \(\hat{\beta}_1\) that we obtained from the 500 samples vary from the mean of the distribution of \(\hat{\beta}_1\) (the line in the plot above) by about \(\ \pm \).27.
The standard error of our estimates of \(\hat{\beta}_1\), denoted \( SE_{\hat{\beta}_1}\), is .27. What does this mean again? It means that, on average, our estimates of \(\hat{\beta}_1\) that we obtained from the 500 samples vary from the mean (the line in the plot above) by about \(\ \pm \).27.
Let's add a line to our plot that is \(\ \pm \).27 from the mean.
plot( coefs.store[,2], ylab = "Estimates of Beta 1 for each line", xlab = "Simulation", pch = 20, col="gray")
abline(h = mean(coefs.store[,2]), col = "red")
abline(h = mean(coefs.store[,2])+.27, col = "black", lty = 2, lwd = 2)
abline(h = mean(coefs.store[,2])-.27, col = "black", lty = 2, lwd = 2)
We notice that a lot of our estimates for \(\hat{\beta}_1\) lie within one standard error (i.e., plus or minus .27) of the mean. Let's go out 2 standard errors.
plot( coefs.store[,2], ylab = "Estimates of Beta 1 for each line", xlab = "Simulation", pch = 20, col="gray")
abline(h = mean(coefs.store[,2]), col = "red")
abline(h = mean(coefs.store[,2])+.27, col = "black", lty = 2, lwd = 2)
abline(h = mean(coefs.store[,2])-.27, col = "black", lty = 2, lwd = 2)
abline(h = mean(coefs.store[,2])+2*.27, col = "blue", lty = 2, lwd = 2)
abline(h = mean(coefs.store[,2])-2*.27, col = "blue", lty = 2, lwd = 2)
Within two standard errors, we see that we capture almost all of the estimates of \(\hat{\beta}_1\) . Specifically, we can compute the percent of the estimates that lie within the 2 standard errors.
length(which(coefs.store[,2] > mean(coefs.store[,2])-2*.27 &
coefs.store[,2] < mean(coefs.store[,2])+2*.27))/ nrow(coefs.store)
95% of the estimates of \(\hat{\beta}_1\) are contained in the interval (4.84, 5.93), i.e., within two standard errors of the mean.
We have already seen that, from our simulation, we can estimate the mean and the standard error of \(\hat{\beta}_1\). Can we summarize this concisely? Maybe. Let's take a look at the shape of the distribution.
hist(coefs.store[,2], main = "Histogram of estimates of beta 1", xlab = "hat beta 1")
Hmm. This is looking vaguely familar. We have a symmetric, unimodal distribution. It looks like a normal distribution! What does this mean??
We drew random samples (each the same size) from our population. We then fit a regression line using the sample data, and estimated \({\beta}_0\) and \({\beta}_1\). It turns out that if we plot all the different \(\hat{\beta}_1\) values, we end up with a normal distribution. This suggests that the distribution of \(\hat{\beta}_1\) is a normal distribution!
There is a formula for the exact parameters of this normal sampling distribution, but it is not necessary for this course. For those interested, it listed here. Suppose \(\sigma\), the standard deviation of the error term, is known. Then:
\[ \hat{\beta}_1| \sigma \sim N( \beta_1 , \frac{\sigma }{\sum_{i=1}^{n}{x_i - \bar{x}} } ) \]
hist(coefs.store[,1], main = "Beta 0")
Yes, we are still dealing with a normal, but we will have a different mean and standard error.
mean(coefs.store[,1])
sd(coefs.store[,1])
Now, we chose a sample of size 1000. Would our results change if chose a larger sample size? Let's try a sample size of 2000, and find out.
We are going to recreate the plot from the simulation with a sample size of 1000, and then overlay the results from the simulation with a sample size of 2000.
set.seed(666)
coefs.store <- NULL
plot( cdc2$height, cdc2$weight, col="white", xlab ="Height",
ylab = "Weight", ylim = c(50,300))
for( i in 1:500){
S2 <- sample( 1:nrow(cdc), 1000, replace = F)
cdc2 <- cdc[S2,c("height","weight")]
m2 <- lm(cdc2$weight ~ cdc2$height)
coefs.store <- rbind(coefs.store, m2$coefficients)
abline(m2)
}
set.seed(444)
coefs.store2 <- NULL
for( i in 1:500){
S2 <- sample( 1:nrow(cdc), 5000, replace = F)
cdc2 <- cdc[S2,c("height","weight")]
m2 <- lm(cdc2$weight ~ cdc2$height)
coefs.store2 <- rbind(coefs.store2, m2$coefficients)
abline(m2, col="grey")
}
abline( lm(cdc$weight~cdc$height), col="red",lty = 2, lwd = 3)
The gray lines here are the regression lines from the samples with the larger sample size. The black lines here are the regression lines from the samples with the smaller sample size. The true population regression line is the dashed value (in red).
Obtain the estimated mean, and the standard error for (1) the first round of simulations with sample size 1000 and (2) the second round with sample size 2000.
mean(coefs.store2[,2])
sd(coefs.store2[,2])
Our estimate of the mean hasn't changed much, but look at the standard error. It has gone way down. This mimics what we see in the plot with the regression lines.
To understand, let's plot the distribution of \(\hat{\beta}_1\) from both simulations.
hist(coefs.store[,2], xlim = c(4.5,6.5), main ="Smaller sample size")
hist(coefs.store2[,2], xlim = c(4.5,6.5), main = "Larger sample size")
What do you notice? We notice that, with a larger sample size, there is less variability in the estimates of \(\hat{\beta}_1\). This means that our estimate of \(\hat{\beta}_1\) varies less from sample to sample.
In this simulation, we have illustrated that estimates of our regression parameters can be different depending on the sample population that we use. This means that we can actually have a distribution on values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Specifically, each parameter is normally distributed.
When we increase the size of the samples, we see a reduced standard error. This means that with larger samples, we can be more confident that our estimate of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) with a larger sample size.
The next step is to take these new realizations and use them to help make formal statements about the population parameters based on these sample statisics. That's next week!