Collaboration

Please indicate who you collaborated with on this problem set:

Background

For this exercise, we will mimic the tactile sampling you did in class with virtual sampling. We will use some data from the general social survey, an annual personal-interview survey conducted in the United States. The survey is designed to monitor changes in both social characteristics and attitudes.

For this exercise, the “study population” will not be the US population, but rather 2538 individuals living in a single neighborhood whose information are saved in the gss_cat data frame in the forcats package. As an analogy to the tactile sampling you did in class, the neighborhood is the “bowl” and the 2,538 people are the little balls. We will use sampling to estimate a certain population proportion \(p\) of interest:

The proportion of the study population who are divorced

Now you might ask yourself, if we have all 2538 individual’s saved in gss_cat then why are we sampling to estimate these population proportions \(p\)? Can’t we just compute both population proportions exactly? You are correct! We are only sampling from gss_cat as an exercise to study sampling variation. In a real life sampling study with a realistic study population, we won’t have access to everyone’s data!

Setup

First load the necessary packages:

library(ggplot2)
library(dplyr)
library(forcats)
library(moderndive)
library(readr)

Next let’s explore the gss_cat data set in the forcats package using the glimpse() function (note that the forcats package was loaded automatically with tidyverse).

glimpse(gss_cat)
## Observations: 21,483
## Variables: 9
## $ year    <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divor...
## $ age     <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51...
## $ race    <fct> White, White, White, White, White, White, White, White...
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not appl...
## $ partyid <fct> Ind,near rep, Not str republican, Independent, Ind,nea...
## $ relig   <fct> Protestant, Protestant, Protestant, Orthodox-christian...
## $ denom   <fct> Southern baptist, Baptist-dk which, No denomination, N...
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, ...

Use the View() function in the console to take a look at the data in the viewer. And type ?gss_cat into the console to see a description of the variables in this data set.

Exploratory data wrangling

This data set includes many of years of data, and many variables. We will restrict our analysis to only the most recent year (2014), and to only the variable marital.

gss_14 <- gss_cat %>% 
  filter(year == 2014) %>% 
  select(marital)

You can use the following to see what the different responses were for the marital status question:

gss_14  %>% 
  distinct(marital) 
marital
Divorced
Married
Never married
Separated
Widowed
No answer

Demo

We are going to use random sampling to estimate the population proportion \(p\) of the neighborhood that are divorced. We will start by taking a single virtual random sample of fifty human respondents. Be sure to View() the results to get a sense of what our random sample single_sample_50 looks like.

single_sample_50 <- gss_14 %>% 
  rep_sample_n(size = 50, reps = 1)

Next, let’s calculate the sample proportion \(\widehat{p}\) of people who identified themselves as Divorced in our single replicate. Recall that the sample proportion \(\widehat{p}\) is an “estimate” of the population proportion \(p\) who are divorced. This estimate is based on our sample of size n = 50.

p_hat <- single_sample_50 %>% 
  summarize(divorced = sum(marital == "Divorced")) %>% 
  mutate(prop_divorced = divorced / 50)
p_hat
replicate divorced prop_divorced
1 3 0.06

a) Sampling a thousand replicates at n = 50

However, we are interested in studying how sampling variation affects our estimates. In other words we want to see how much p_hat will vary based on different random samples of size n = 50. In order to study this variation in \(\widehat{p}\), we need more than one value of \(\widehat{p}\), and hence more than one sample of size n = 50.

Let’s take not 1, but 1000 samples of size n = 50. In other words, 1000 “replicates”:

sample_n_50_1000_rep <- gss_14 %>% 
  rep_sample_n(size = 50, reps = 1000)

Be sure to View() the results sample_n_50_1000_rep and scroll down to get a sense of these 1000 replicates look like and how the different replicates are differentiated. Note our sample size is still n = 50 i.e. how many humans are sampled in each replicate; we’re just simulating the act of taking many samples of size n = 50:

50 humans are simulated in each replicate

b) Calculate sample proportion \(\widehat{p}\) for each replicate

We will again calculate the sample proportion \(\widehat{p}\) of people who reported they were divorced for each of the 1000 replicates:

p_hat_n_50_1000_rep <- sample_n_50_1000_rep %>% 
  group_by(replicate) %>% 
  summarize(divorced = sum(marital == "Divorced")) %>% 
  mutate(prop_divorced = divorced / 50)

Take a look at the first five rows of the results:

p_hat_n_50_1000_rep %>%
  slice(1:5)
replicate divorced prop_divorced
1 6 0.12
2 6 0.12
3 3 0.06
4 10 0.20
5 10 0.20

c) Visualize the sampling distribution of \(\widehat{p}\)

We plot the sampling distribution of these 1000 simulated values of the sample proportion \(\widehat{p}\) of divorced respondents with a histogram

ggplot(p_hat_n_50_1000_rep, aes(x = prop_divorced)) +
  geom_histogram(binwidth = 0.05, color = "white", fill = "darkgreen") +
  labs(x = "Sample proportion divorced", title = "Sampling distribution of p-hat based on n = 50") 

d) Standard error

Finally we can calculate the standard error, which is the standard deviation of the sampling distribution.

p_hat_n_50_1000_rep %>% 
  summarize(SE = sd(prop_divorced))
SE
0.0527385

e) Compare sampling distribution to true population parameter \(p\)

Let’s now calculate the true population proportion \(p\) of Divorced respondents for the full gss_14 data set like so. In other words, this would be like doing a census and asking everyone in the neighborhood if they are divorced or not.

gss_14 %>% 
  summarize(divorced = sum(marital == "Divorced")) %>% 
  mutate(p_divorced = divorced / nrow(gss_14))
divorced p_divorced
411 0.1619385

Questions:

  1. Does it look like our sampling distribution as displayed in the histogram is centered at the true population proportion of Divorced respondents for all n=2538 people surveyed?
  2. Is this what you expected?

Answers:

  1. Yes, the histogram is centered on ~ 0.16.
  2. I did expect this, since we collected a decently sized sample and ran 1000 reps, and since there is no reason to think that our R sampling is biased.

Question 1

a) Sampling a thousand replicates at n = 15

Now use the rep_sample_n function to collect 1000 virtual samples of size n = 15.

Note: BE SURE TO NAME YOUR DATAFRAME WITH 1000 SAMPLES SOMETHING NEW, TO ENSURE YOU CAN DISTINGUISH IT FROM THE n = 50 SAMPLE ABOVE!

sample_n_15_1000_rep <- gss_14 %>% 
  rep_sample_n(size = 15, reps = 1000)

b) Calculate sample proportion \(\widehat{p}\) for each replicate

Calculate the sample proportion \(\widehat{p}\) of people who reported they were divorced for each of the 1000 replicates:

p_hat_n_15_1000_rep <- sample_n_15_1000_rep %>% 
  group_by(replicate) %>% 
  summarize(divorced = sum(marital == "Divorced")) %>% 
  mutate(prop_divorced = divorced / 15)

c) Visualize the sampling distribution of \(\widehat{p}\)

Plot the sampling distribution of these 1000 simulated values of the sample proportion \(\widehat{p}\) of divorced respondents with a histogram:

ggplot(p_hat_n_15_1000_rep, aes(x = prop_divorced)) +
  geom_histogram(binwidth = 0.05, color = "white", fill = "light blue") +
  labs(x = "Sample proportion divorced", title = "Sampling distribution of p-hat based on n = 15")

Question:

  1. How does this sampling distribution compare to the one you made earlier with a sample size n of 50?

Answer:

  1. The bellcurve is more spread out. It is still centred around the true mean 0.16, but less so than when sample size n was 50. Many points are further away from the 0.16 mean.

d) Standard error

Calculate the standard error, which is the standard deviation of the sampling distribution.

p_hat_n_15_1000_rep %>% 
  summarize(SE = sd(prop_divorced))
SE
0.0937101

Question:

  1. How does this standard error compare to the one you made earlier with a sample size n of 50?

Answer:

  1. The standard error is much larger than (almost double) that when sample size n was 50.

Question 2

a) Sampling a thousand replicates at n = 400

Now use the rep_sample_n function to collect 1000 virtual samples of size n = 400.

Note: BE SURE TO NAME YOUR DATAFRAME WITH 1000 SAMPLES SOMETHING NEW, TO ENSURE YOU CAN DISTINGUISH IT FROM THE n = 50 and n = 15 SAMPLES ABOVE!

sample_n_400_1000_rep <- gss_14 %>% 
  rep_sample_n(size = 400, reps = 1000)

b) Calculate sample proportion \(\widehat{p}\) for each replicate

Calculate the proportion of people who reported they were divorced for each replicate.

p_hat_n_400_1000_rep <- sample_n_400_1000_rep %>% 
  group_by(replicate) %>% 
  summarize(divorced = sum(marital == "Divorced")) %>% 
  mutate(prop_divorced = divorced/400)

c) Visualize the sampling distribution of \(\widehat{p}\)

Plot the sampling distribution of these 1000 simulated values of the sample proportion of divorced respondents with a histogram.

ggplot(p_hat_n_400_1000_rep, aes(x = prop_divorced)) +
  geom_histogram(binwidth = 0.05, color = "white", fill = "light pink") +
  labs(x = "Sample population divorced", title = "Sample distribution of p-hat based on n = 400")

Question:

  1. How does this sampling distribution compare to the one you made earlier with a sample size n of 50?

Answer:

  1. The distribution of p-hats are much more tightly congregated near 0.16. It’s also got the more consistent p-hat values as compared to the sample size n = 50 distribution.

d) Standard error

Calculate the standard error, which is the standard deviation of the sampling distribution.

p_hat_n_400_1000_rep%>% 
  summarize(SE = sd(prop_divorced))
SE
0.0168646

Question:

  1. How does this standard error compare to the one you made earlier with a sample size n of 50?

Answer:

  1. The standard error is almost a fifth of that of sample size n of 50. The error is much smaller with the larger sample size.

Question 3

Questions:

  1. Which sampling distribution looked more normally distributed (bell shaped and symmetrical); the one built on n = 15, 50 or 400? Why?
  2. Let’s pretend the GSS organizers can no longer afford to do a full census of this neighborhood in the future due to budget cuts. As their stats adviser, what sample size would you recommend they use?

Answers:

  1. n = 400.The bigger the sample size, the closer to p and the narrower the sampling distribution.
  2. I would recommend they conduct as big an investigation as they can with the given budget. But if they can’t, for this case especially, n = 400 produces statistics very close to the original census.

Question 4

Now let’s sample values of a numerical variable age in order to estimate the population mean age \(\mu\) of the “Independent” voters. The following will read in a data frame age_ind_voters that you should View() after loading:

age_ind_voters <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQF_uMZ6_F8Qk-qnFbXc3qIa77lDu3ZkSFDJpBp19eYG-COsEPcNHVvCwQtG66P87VheH9nqn2p6igY/pub?gid=72203649&single=true&output=csv")

age_ind_voters contains the ages of all 4101 individuals in one neighborhood that identified their political party affiliation as “Independent”. This neighborhood will be our study population of interest and therefore the data in age_ind_voters represents a full census of the 4101 independent voters’ ages.

a) Sampling a thousand replicates at n = 50

Use the rep_sample_n() function to sample 50 respondents from this population 1000 times. Save this as sample_n_50_1000_rep. Be sure to View() the results and scroll down to get a sense of these 1000 replicates look like. Note our sample size is still n = 50 i.e. how many humans are sampled in each replicate.

sample_n_50_1000_rep <- age_ind_voters %>% 
  rep_sample_n(size = 50, reps = 1000)

b) Calculate sample mean \(\bar{x}\) for each replicate

Use group_by() and summarize() on sample_n_50_1000_rep to calculate the sample mean \(\bar{x}\) of age for each of the 1000 replicates; save these values in x_bar_n_50_1000_rep. We will use these values to construct the sampling distribution for the sample mean age \(\bar{x}\) based on samples of size n = 50. Be sure to View() the results to check your work!

x_bar_n_50_1000_rep <- sample_n_50_1000_rep %>% 
  group_by(replicate) %>%
  summarize(mean_age = mean(age))

c) Visualize the sampling distribution of \(\bar{x}\)

Plot the sampling distribution of the 1000 sample means \(\bar{x}\) of age with a histogram. Give your histogram a meaningful x-axis label and title.

ggplot(x_bar_n_50_1000_rep, aes(x = mean_age)) +
  geom_histogram(binwidth = 1, color = "white", fill = "light blue") +
  labs(x = "mean age of sample population", title = "sample population based on n = 50")

d) Center of sampling distribution

Numerically compute and output in your HTML document the values of:

  1. The center of this sampling distribution
  2. The true population mean age \(\mu\).
# Center of sampling distribution
x_bar_n_50_1000_rep %>% 
  summarize(x_bar_centre = mean(mean_age))
x_bar_centre
43.26866
# True population mean
pop_mean <- age_ind_voters %>% 
  summarize(meanage = mean(age))
pop_mean
meanage
43.27262

Question: Why are these two values close?

Answer: The sampling distribution is bootstrapped and the taking the centre simulates taking the mean of multiple samples from the original population, which would in theory be more likely produce an estimate of the mean very close to the true mean.

e) Spread of sampling distribution: standard error

Calculate the standard error of the sample mean age \(\bar{x}\), which is the standard deviation of the sampling distribution for \(\bar{x}\).

x_bar_n_50_1000_rep %>% 
  summarize(SE = sd(mean_age))
SE
2.299385

Question: If our sampling were based on samples of size n = 100 individuals instead, what would happen to the standard error?

Answer: The standard error would decrease.