Lab 5 -606, Part 1: Foundations for statistical inference

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
global_monitor

## # A tibble: 100,000 × 1
##    scientist_work
##    <chr>         
##  1 Benefits      
##  2 Benefits      
##  3 Benefits      
##  4 Benefits      
##  5 Benefits      
##  6 Benefits      
##  7 Benefits      
##  8 Benefits      
##  9 Benefits      
## 10 Benefits      
## # … with 99,990 more rows

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

## # A tibble: 2 × 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

samp1 <- global_monitor %>%
  sample_n(50)  #This command collects a simple random sample of size 50 from the global_monitor dataset, and assigns the result to samp1. This is similar to randomly drawing names from a hat that contains the names of all in the population.
samp1

## # A tibble: 50 × 1
##    scientist_work 
##    <chr>          
##  1 Doesn't benefit
##  2 Benefits       
##  3 Benefits       
##  4 Doesn't benefit
##  5 Benefits       
##  6 Benefits       
##  7 Benefits       
##  8 Benefits       
##  9 Benefits       
## 10 Benefits       
## # … with 40 more rows

Exercise 1: Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarizing the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.

print(n=50, samp1)

## # A tibble: 50 × 1
##    scientist_work 
##    <chr>          
##  1 Doesn't benefit
##  2 Benefits       
##  3 Benefits       
##  4 Doesn't benefit
##  5 Benefits       
##  6 Benefits       
##  7 Benefits       
##  8 Benefits       
##  9 Benefits       
## 10 Benefits       
## 11 Doesn't benefit
## 12 Doesn't benefit
## 13 Benefits       
## 14 Benefits       
## 15 Benefits       
## 16 Doesn't benefit
## 17 Benefits       
## 18 Benefits       
## 19 Benefits       
## 20 Benefits       
## 21 Doesn't benefit
## 22 Doesn't benefit
## 23 Doesn't benefit
## 24 Benefits       
## 25 Benefits       
## 26 Benefits       
## 27 Benefits       
## 28 Benefits       
## 29 Benefits       
## 30 Benefits       
## 31 Doesn't benefit
## 32 Benefits       
## 33 Benefits       
## 34 Benefits       
## 35 Benefits       
## 36 Doesn't benefit
## 37 Benefits       
## 38 Benefits       
## 39 Doesn't benefit
## 40 Benefits       
## 41 Benefits       
## 42 Benefits       
## 43 Benefits       
## 44 Doesn't benefit
## 45 Benefits       
## 46 Doesn't benefit
## 47 Benefits       
## 48 Benefits       
## 49 Benefits       
## 50 Benefits

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           37  0.74
## 2 Doesn't benefit    13  0.26

ggplot(samp1, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you? (Sample Version)"
  ) +
  coord_flip()

The distribution percentages of this sample is similar to that of the population. "Benefits"" is around 80%, and "Doesn't benefit" is around 20%.

Exercise 2: Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.

I would not expect the sample proportion to match the sample proportion of another student's sample. This is because the proportions can be very different. For example, even if I throw a die, I sometimes get 3 "6s", but then I can keep getting low values. Sample can be very random.

Exercise 3: Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?

samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           39  0.78
## 2 Doesn't benefit    11  0.22

Samp2 is very similar to Samp1. The sample size of 1000 will give a more accurate estimate because the larger the sample size, the more accurate the average values will be.

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

count(sample_props50)

## # A tibble: 15,000 × 2
## # Groups:   replicate [15,000]
##    replicate     n
##        <int> <int>
##  1         1     1
##  2         2     1
##  3         3     1
##  4         4     1
##  5         5     1
##  6         6     1
##  7         7     1
##  8         8     1
##  9         9     1
## 10        10     1
## # … with 14,990 more rows

Exercise 4: How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.

count(sample_props50)

## # A tibble: 15,000 × 2
## # Groups:   replicate [15,000]
##    replicate     n
##        <int> <int>
##  1         1     1
##  2         2     1
##  3         3     1
##  4         4     1
##  5         5     1
##  6         6     1
##  7         7     1
##  8         8     1
##  9         9     1
## 10        10     1
## # … with 14,990 more rows

There are 15,000 elements in sample_props50. By looking at the graph, we can tell the center is about 0.2. Also, the graph above is quite symmetric, there isn't much skew.

Exercise 5: To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?

set.seed(1000)
sample_props_small <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")
sample_props_small

## # A tibble: 23 × 4
## # Groups:   replicate [23]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         2 Doesn't benefit     4   0.4
##  2         3 Doesn't benefit     1   0.1
##  3         4 Doesn't benefit     1   0.1
##  4         5 Doesn't benefit     2   0.2
##  5         6 Doesn't benefit     1   0.1
##  6         7 Doesn't benefit     3   0.3
##  7         8 Doesn't benefit     3   0.3
##  8         9 Doesn't benefit     2   0.2
##  9        10 Doesn't benefit     2   0.2
## 10        11 Doesn't benefit     2   0.2
## # … with 13 more rows

There are 25 observations in sample_props_small. Each observation represents a proportion of response in each sample that believes that the work that scientists do doesn’t benefit them.

ggplot(data = sample_props_small, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.05) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 25, Number of samples = 10"
  )

#The reason we used the rep_sample_n function: to compute a sampling distribution, specifically, the sampling distribution of the proportions from samples of 50 people.
ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

#same graph I provided right above with a different binwidth (given in assignment)

The sample proportion is an unbiased estimator, the sampling distribution is centered at the true population proportion, and the spread of the distribution indicates how much variability is incurred by sampling only 50 people at a time from the population.

Exercise 6: Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)

shinyApp(
  ui <- fluidPage(
    
    # Sidebar with a slider input for number of bins 
    sidebarLayout(
      sidebarPanel(
        
        selectInput("outcome",
                    "Outcome of interest:",
                    choices = c("Benefits", "Doesn't benefit"),
                    selected = "Doesn't benefit"),
        
        numericInput("n_samp",
                     "Sample size:",
                     min = 1,
                     max = nrow(global_monitor),
                     value = 30),
        
        numericInput("n_rep",
                     "Number of samples:",
                     min = 1,
                     max = 30000,
                     value = 15000),
        
        hr(),
        
        sliderInput("binwidth",
                    "Binwidth:",
                    min = 0, max = 0.5,
                    value = 0.02,
                    step = 0.005)
        
      ),
      
      # Show a plot of the generated distribution
      mainPanel(
        plotOutput("sampling_plot"),
        textOutput("sampling_mean"),
        textOutput("sampling_se")
      )
    )
  ),
  
  server <- function(input, output) {
    
    # create sampling distribution
    sampling_dist <- reactive({
      global_monitor %>%
        rep_sample_n(size = input$n_samp, reps = input$n_rep, replace = TRUE) %>%
        count(scientist_work) %>%
        mutate(p_hat = n /sum(n)) %>%
        filter(scientist_work == input$outcome)
    })
    
    # plot sampling distribution
    output$sampling_plot <- renderPlot({
      
      ggplot(sampling_dist(), aes(x = p_hat)) +
        geom_histogram(binwidth = input$binwidth) +
        xlim(0, 1) +
        labs(
          x = paste0("p_hat (", input$outcome, ")"),
          title = "Sampling distribution of p_hat",
          subtitle = paste0("Sample size = ", input$n_samp, " Number of samples = ", input$n_rep)
        ) +
        theme(plot.title = element_text(face = "bold", size = 16))
    })
    
    ggplot(data = sample_props50, aes(x = p_hat)) +
      geom_histogram(binwidth = 0.02) +
      labs(
        x = "p_hat (Doesn't benefit)",
        title = "Sampling distribution of p_hat",
        subtitle = "Sample size = 50, Number of samples = 15000"
      )
    
    # mean of sampling distribution
    output$sampling_mean <- renderText({
      paste0("Mean of sampling distribution = ", round(mean(sampling_dist()$p_hat), 2))
    })
    
    # mean of sampling distribution
    output$sampling_se <- renderText({
      paste0("SE of sampling distribution = ", round(sd(sampling_dist()$p_hat), 2))
    })
  },
  
  options = list(height = 900) 
)

Shiny applications not supported in static R Markdown documents

Note: This Shiny App was taken from the template on the 606 class page.

As the sample size increased, the mean of the sample distribution slightly decreased (barely) and the standard error of sampling distribution decreased significantly from 0.11 (for 10) to 0.04 (for 100). Increasing simulations had no effect on the mean and standard error (I tested it out with 10K and 20K). Standard error changes from 0.11, 0.06 & 0.04, no matter what the sample size is. Each observation represents a proportion of response in each sample that believes that the work that scientists do doesn’t benefit them.

Exercise 7: Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enchances their lives?

#  Benefits

samp15 <- global_monitor %>%
  sample_n(15)

samp15 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           10 0.667
## 2 Doesn't benefit     5 0.333

For each run, the percentage changes, but for the current one I am looking at 11 out of 15 people think the work scientists do benefits them (73%). I think my best point estimate is after I reran it a few times I got 15/15 thinks it "Benefits", which means 100%!

Exercise 8: Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enchances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enchances their lives to be? Finally, calculate and report the population proportion.

sample_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

sample_props15

## # A tibble: 2,000 × 4
## # Groups:   replicate [2,000]
##    replicate scientist_work     n p_hat
##        <int> <chr>          <int> <dbl>
##  1         1 Benefits          13 0.867
##  2         2 Benefits          10 0.667
##  3         3 Benefits          10 0.667
##  4         4 Benefits          11 0.733
##  5         5 Benefits          10 0.667
##  6         6 Benefits          12 0.8  
##  7         7 Benefits          12 0.8  
##  8         8 Benefits          12 0.8  
##  9         9 Benefits          11 0.733
## 10        10 Benefits          14 0.933
## # … with 1,990 more rows

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )

mean(sample_props15$p_hat)

## [1] 0.7988333

median(sample_props15$p_hat)

## [1] 0.8

This graph is right skewed, meaning we are getting higher precentages of people who view the duties of a scientist as beneficial. The mean and median are both 0.8.

Exercise 9: Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work scientists do enchances their lives?

sample_props150 <- global_monitor %>%
                    rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

sample_props150

## # A tibble: 2,000 × 4
## # Groups:   replicate [2,000]
##    replicate scientist_work     n p_hat
##        <int> <chr>          <int> <dbl>
##  1         1 Benefits         119 0.793
##  2         2 Benefits         125 0.833
##  3         3 Benefits         110 0.733
##  4         4 Benefits         126 0.84 
##  5         5 Benefits         127 0.847
##  6         6 Benefits         115 0.767
##  7         7 Benefits         118 0.787
##  8         8 Benefits         126 0.84 
##  9         9 Benefits         105 0.7  
## 10        10 Benefits         119 0.793
## # … with 1,990 more rows

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000"
  )

mean(sample_props150$p_hat)

## [1] 0.7995833

median(sample_props150$p_hat)

## [1] 0.8

The graph again is right skewed. The mean is 0.79 and the median is 0.8. The graph looks very similar to the one with a sample size of 15.

Exercise 10: Of the sampling distributions from 2 and 3, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?

Standard deviation measures how spread out the data values are from the mean. The larger the study sample size, the smaller the margin of error. s = sigma/(sqrt(n)), which means that the sample standard deviation decreases with increasing n. Therefore, the sample_props150 has the smaller spread.

sd(sample_props15$p_hat) > sd(sample_props150$p_hat)

## [1] TRUE

sd(sample_props15$p_hat) <  sd(sample_props150$p_hat)

## [1] FALSE

Lab 5 -606, Part 1: Foundations for statistical inference - Sampling distributions

Sangeetha Sasikumar

10/9/2022