Please indicate who you collaborated with on this problem set:
For this exercise, we will mimic the tactile sampling you did in class with virtual sampling. We will use some data from the general social survey, an annual personal-interview survey conducted in the United States. The survey is designed to monitor changes in both social characteristics and attitudes.
For this exercise, the “study population” will not be the US population, but rather 2538 individuals living in a single neighborhood whose information are saved in the gss_cat data frame in the forcats package. As an analogy to the tactile sampling you did in class, the neighborhood is the “bowl” and the 2,538 people are the little balls. We will use sampling to estimate a certain population proportion \(p\) of interest:
The proportion of the study population who are divorced
Now you might ask yourself, if we have all 2538 individual’s saved in gss_cat then why are we sampling to estimate these population proportions \(p\)? Can’t we just compute both population proportions exactly? You are correct! We are only sampling from gss_cat as an exercise to study sampling variation. In a real life sampling study with a realistic study population, we won’t have access to everyone’s data!
First load the necessary packages:
library(ggplot2)
library(dplyr)
library(forcats)
library(moderndive)
library(readr)Next let’s explore the gss_cat data set in the forcats package using the glimpse() function (note that the forcats package was loaded automatically with tidyverse).
glimpse(gss_cat)## Observations: 21,483
## Variables: 9
## $ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divor...
## $ age <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51...
## $ race <fct> White, White, White, White, White, White, White, White...
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not appl...
## $ partyid <fct> Ind,near rep, Not str republican, Independent, Ind,nea...
## $ relig <fct> Protestant, Protestant, Protestant, Orthodox-christian...
## $ denom <fct> Southern baptist, Baptist-dk which, No denomination, N...
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, ...
Use the View() function in the console to take a look at the data in the viewer. And type ?gss_cat into the console to see a description of the variables in this data set.
This data set includes many of years of data, and many variables. We will restrict our analysis to only the most recent year (2014), and to only the variable marital.
gss_14 <- gss_cat %>%
filter(year == 2014) %>%
select(marital)You can use the following to see what the different responses were for the marital status question:
gss_14 %>%
distinct(marital) | marital |
|---|
| Divorced |
| Married |
| Never married |
| Separated |
| Widowed |
| No answer |
We are going to use random sampling to estimate the population proportion \(p\) of the neighborhood that are divorced. We will start by taking a single virtual random sample of fifty human respondents. Be sure to View() the results to get a sense of what our random sample single_sample_50 looks like.
single_sample_50 <- gss_14 %>%
rep_sample_n(size = 50, reps = 1)Next, let’s calculate the sample proportion \(\widehat{p}\) of people who identified themselves as Divorced in our single replicate. Recall that the sample proportion \(\widehat{p}\) is an “estimate” of the population proportion \(p\) who are divorced. This estimate is based on our sample of size n = 50.
p_hat <- single_sample_50 %>%
summarize(divorced = sum(marital == "Divorced")) %>%
mutate(prop_divorced = divorced / 50)
p_hat| replicate | divorced | prop_divorced |
|---|---|---|
| 1 | 3 | 0.06 |
However, we are interested in studying how sampling variation affects our estimates. In other words we want to see how much p_hat will vary based on different random samples of size n = 50. In order to study this variation in \(\widehat{p}\), we need more than one value of \(\widehat{p}\), and hence more than one sample of size n = 50.
Let’s take not 1, but 1000 samples of size n = 50. In other words, 1000 “replicates”:
sample_n_50_1000_rep <- gss_14 %>%
rep_sample_n(size = 50, reps = 1000)Be sure to View() the results sample_n_50_1000_rep and scroll down to get a sense of these 1000 replicates look like and how the different replicates are differentiated. Note our sample size is still n = 50 i.e. how many humans are sampled in each replicate; we’re just simulating the act of taking many samples of size n = 50:
50 humans are simulated in each replicate
We will again calculate the sample proportion \(\widehat{p}\) of people who reported they were divorced for each of the 1000 replicates:
p_hat_n_50_1000_rep <- sample_n_50_1000_rep %>%
group_by(replicate) %>%
summarize(divorced = sum(marital == "Divorced")) %>%
mutate(prop_divorced = divorced / 50)Take a look at the first five rows of the results:
p_hat_n_50_1000_rep %>%
slice(1:5)| replicate | divorced | prop_divorced |
|---|---|---|
| 1 | 6 | 0.12 |
| 2 | 6 | 0.12 |
| 3 | 3 | 0.06 |
| 4 | 10 | 0.20 |
| 5 | 10 | 0.20 |
We plot the sampling distribution of these 1000 simulated values of the sample proportion \(\widehat{p}\) of divorced respondents with a histogram
ggplot(p_hat_n_50_1000_rep, aes(x = prop_divorced)) +
geom_histogram(binwidth = 0.05, color = "white", fill = "darkgreen") +
labs(x = "Sample proportion divorced", title = "Sampling distribution of p-hat based on n = 50") Finally we can calculate the standard error, which is the standard deviation of the sampling distribution.
p_hat_n_50_1000_rep %>%
summarize(SE = sd(prop_divorced))| SE |
|---|
| 0.0527385 |
Let’s now calculate the true population proportion \(p\) of Divorced respondents for the full gss_14 data set like so. In other words, this would be like doing a census and asking everyone in the neighborhood if they are divorced or not.
gss_14 %>%
summarize(divorced = sum(marital == "Divorced")) %>%
mutate(p_divorced = divorced / nrow(gss_14))| divorced | p_divorced |
|---|---|
| 411 | 0.1619385 |
Questions:
Divorced respondents for all n=2538 people surveyed?Answers:
Now use the rep_sample_n function to collect 1000 virtual samples of size n = 15.
Note: BE SURE TO NAME YOUR DATAFRAME WITH 1000 SAMPLES SOMETHING NEW, TO ENSURE YOU CAN DISTINGUISH IT FROM THE n = 50 SAMPLE ABOVE!
sample_n_15_1000_rep <- gss_14 %>%
rep_sample_n(size = 15, reps = 1000)Calculate the sample proportion \(\widehat{p}\) of people who reported they were divorced for each of the 1000 replicates:
p_hat_n_15_1000_rep <- sample_n_15_1000_rep %>%
group_by(replicate) %>%
summarize(divorced = sum(marital == "Divorced")) %>%
mutate(prop_divorced = divorced / 15)Plot the sampling distribution of these 1000 simulated values of the sample proportion \(\widehat{p}\) of divorced respondents with a histogram:
ggplot(p_hat_n_15_1000_rep, aes(x = prop_divorced)) +
geom_histogram(binwidth = 0.05, color = "white", fill = "light blue") +
labs(x = "Sample proportion divorced", title = "Sampling distribution of p-hat based on n = 15")Question:
Answer:
Calculate the standard error, which is the standard deviation of the sampling distribution.
p_hat_n_15_1000_rep %>%
summarize(SE = sd(prop_divorced))| SE |
|---|
| 0.0937101 |
Question:
Answer:
Now use the rep_sample_n function to collect 1000 virtual samples of size n = 400.
Note: BE SURE TO NAME YOUR DATAFRAME WITH 1000 SAMPLES SOMETHING NEW, TO ENSURE YOU CAN DISTINGUISH IT FROM THE n = 50 and n = 15 SAMPLES ABOVE!
sample_n_400_1000_rep <- gss_14 %>%
rep_sample_n(size = 400, reps = 1000)Calculate the proportion of people who reported they were divorced for each replicate.
p_hat_n_400_1000_rep <- sample_n_400_1000_rep %>%
group_by(replicate) %>%
summarize(divorced = sum(marital == "Divorced")) %>%
mutate(prop_divorced = divorced/400)Plot the sampling distribution of these 1000 simulated values of the sample proportion of divorced respondents with a histogram.
ggplot(p_hat_n_400_1000_rep, aes(x = prop_divorced)) +
geom_histogram(binwidth = 0.05, color = "white", fill = "light pink") +
labs(x = "Sample population divorced", title = "Sample distribution of p-hat based on n = 400")Question:
Answer:
Calculate the standard error, which is the standard deviation of the sampling distribution.
p_hat_n_400_1000_rep%>%
summarize(SE = sd(prop_divorced))| SE |
|---|
| 0.0168646 |
Question:
Answer:
Questions:
Answers:
Now let’s sample values of a numerical variable age in order to estimate the population mean age \(\mu\) of the “Independent” voters. The following will read in a data frame age_ind_voters that you should View() after loading:
age_ind_voters <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQF_uMZ6_F8Qk-qnFbXc3qIa77lDu3ZkSFDJpBp19eYG-COsEPcNHVvCwQtG66P87VheH9nqn2p6igY/pub?gid=72203649&single=true&output=csv")age_ind_voters contains the ages of all 4101 individuals in one neighborhood that identified their political party affiliation as “Independent”. This neighborhood will be our study population of interest and therefore the data in age_ind_voters represents a full census of the 4101 independent voters’ ages.
Use the rep_sample_n() function to sample 50 respondents from this population 1000 times. Save this as sample_n_50_1000_rep. Be sure to View() the results and scroll down to get a sense of these 1000 replicates look like. Note our sample size is still n = 50 i.e. how many humans are sampled in each replicate.
sample_n_50_1000_rep <- age_ind_voters %>%
rep_sample_n(size = 50, reps = 1000)Use group_by() and summarize() on sample_n_50_1000_rep to calculate the sample mean \(\bar{x}\) of age for each of the 1000 replicates; save these values in x_bar_n_50_1000_rep. We will use these values to construct the sampling distribution for the sample mean age \(\bar{x}\) based on samples of size n = 50. Be sure to View() the results to check your work!
x_bar_n_50_1000_rep <- sample_n_50_1000_rep %>%
group_by(replicate) %>%
summarize(mean_age = mean(age))Plot the sampling distribution of the 1000 sample means \(\bar{x}\) of age with a histogram. Give your histogram a meaningful x-axis label and title.
ggplot(x_bar_n_50_1000_rep, aes(x = mean_age)) +
geom_histogram(binwidth = 1, color = "white", fill = "light blue") +
labs(x = "mean age of sample population", title = "sample population based on n = 50")Numerically compute and output in your HTML document the values of:
# Center of sampling distribution
x_bar_n_50_1000_rep %>%
summarize(x_bar_centre = mean(mean_age))| x_bar_centre |
|---|
| 43.26866 |
# True population mean
pop_mean <- age_ind_voters %>%
summarize(meanage = mean(age))
pop_mean| meanage |
|---|
| 43.27262 |
Question: Why are these two values close?
Answer: The sampling distribution is bootstrapped and the taking the centre simulates taking the mean of multiple samples from the original population, which would in theory be more likely produce an estimate of the mean very close to the true mean.
Calculate the standard error of the sample mean age \(\bar{x}\), which is the standard deviation of the sampling distribution for \(\bar{x}\).
x_bar_n_50_1000_rep %>%
summarize(SE = sd(mean_age))| SE |
|---|
| 2.299385 |
Question: If our sampling were based on samples of size n = 100 individuals instead, what would happen to the standard error?
Answer: The standard error would decrease.