Sampling Distributions

library(tidyverse)
library(openintro)
library(infer)
library(ggplot2)

set.seed(74229)

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

Exercise 1

Given that sample was taken randomly (assuming indp) and that size is > 30, we can assume that sampling distribution will approximate population parameter but may be flatter/broader than actual population distribution.

samp1 <- global_monitor %>%
  sample_n(50)

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           39  0.78
## 2 Doesn't benefit    11  0.22

Exercise 2

No - no randomly selected sample can be expected to be an exact match to another sample due to simple variance in data. I would expect proportions to be only somewhat different as CLT would apply at an independently selected sample size of 50. It is possible that samples could be noticeably different depending on how much noise there is in the sample.

Exercise 3

samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           42  0.84
## 2 Doesn't benefit     8  0.16

Sample 2 is very close to Sample 1 in proportion of those who do and dont believe science benefits them.

The sample size of 1000 would most closely estimate the population parameter as the more samples there are to make statistical inference, the more the sampling distribution will align to the population distribution following the logic of CLT.

Exercise 4

15,000 samples of sample size 50 were drawn. Within each sample, the sample proportion of interest (p_hat) was evaluated and plotted on histogram. Reviewing the sampling distribution, we can see unsurprisingly that this sampling distribution follows the bell curve of a normal distribution, and the assumed distribution of the population parameter. The center of the distribution is found at .2, just the same as population parameter.

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

Exercise 5

There are 25 observations in sample_props_small; each observation represents a sample taken and sample parameter computed. The mean of each replicated sampling parameter is above the actual population parameter, which is to be expected with such a small sample size.

sample_props_small <- global_monitor %>%
                    rep_sample_n(size = 10, 
                                 reps = 25, 
                                 replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props_small

## # A tibble: 22 x 4
## # Groups:   replicate [22]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         1 Doesn't benefit     2   0.2
##  2         2 Doesn't benefit     4   0.4
##  3         4 Doesn't benefit     2   0.2
##  4         5 Doesn't benefit     3   0.3
##  5         6 Doesn't benefit     1   0.1
##  6         7 Doesn't benefit     2   0.2
##  7         8 Doesn't benefit     1   0.1
##  8         9 Doesn't benefit     2   0.2
##  9        10 Doesn't benefit     1   0.1
## 10        11 Doesn't benefit     1   0.1
## # ... with 12 more rows

sample_props_small %>%
  group_by(scientist_work) %>%
  summarize(p_hat_avg = mean(p_hat))

## # A tibble: 1 x 2
##   scientist_work  p_hat_avg
##   <chr>               <dbl>
## 1 Doesn't benefit     0.214

Exercise 6

As sample size increases, the sampling distribution will better approximate the population distribution. As such, the mean of the sampling distribution will move closer and closer to the actual population distribution center of .2. Standard error (which estimates the precision of sample mean estimate to the population mean ) will decrease as uncertainty decreases with more data points to observe.

The same is true as simulation size increases. Again, we are taking enough sample parameters to approximate normal distribution of population distribution. Std error decreases, mean or center gravitates more and more toward actual population parameter of .2 and distribution of sample parameters becomes more narrower and more bell curved symmetrical.

LS0tDQp0aXRsZTogIlNhbXBsaW5nIERpc3RyaWJ1dGlvbnMiDQphdXRob3I6ICJDYXNzYW5kcmEgQm95bGFuIg0KZGF0ZTogImByIFN5cy5EYXRlKClgIg0Kb3V0cHV0OiBvcGVuaW50cm86OmxhYl9yZXBvcnQNCi0tLQ0KDQpgYGB7ciBsb2FkLXBhY2thZ2VzLCBtZXNzYWdlPUZBTFNFfQ0KbGlicmFyeSh0aWR5dmVyc2UpDQpsaWJyYXJ5KG9wZW5pbnRybykNCmxpYnJhcnkoaW5mZXIpDQpsaWJyYXJ5KGdncGxvdDIpDQoNCnNldC5zZWVkKDc0MjI5KQ0KYGBgDQoNCmBgYHtyfQ0KZ2xvYmFsX21vbml0b3IgPC0gdGliYmxlKA0KICBzY2llbnRpc3Rfd29yayA9IGMocmVwKCJCZW5lZml0cyIsIDgwMDAwKSwgcmVwKCJEb2Vzbid0IGJlbmVmaXQiLCAyMDAwMCkpDQopDQpgYGANCg0KIyMgRXhlcmNpc2UgMQ0KDQpHaXZlbiB0aGF0IHNhbXBsZSB3YXMgdGFrZW4gcmFuZG9tbHkgKGFzc3VtaW5nIGluZHApIGFuZCB0aGF0IHNpemUgaXMgPiAzMCwgd2UgY2FuIGFzc3VtZSB0aGF0IHNhbXBsaW5nIGRpc3RyaWJ1dGlvbiB3aWxsIGFwcHJveGltYXRlIHBvcHVsYXRpb24gcGFyYW1ldGVyIGJ1dCBtYXkgYmUgZmxhdHRlci9icm9hZGVyIHRoYW4gYWN0dWFsIHBvcHVsYXRpb24gZGlzdHJpYnV0aW9uLg0KYGBge3J9DQpzYW1wMSA8LSBnbG9iYWxfbW9uaXRvciAlPiUNCiAgc2FtcGxlX24oNTApDQoNCnNhbXAxICU+JQ0KICBjb3VudChzY2llbnRpc3Rfd29yaykgJT4lDQogIG11dGF0ZShwX2hhdCA9IG4gL3N1bShuKSkNCmBgYA0KDQoNCiMjIEV4ZXJjaXNlIDINCg0KTm8gLSBubyByYW5kb21seSBzZWxlY3RlZCBzYW1wbGUgY2FuIGJlIGV4cGVjdGVkIHRvIGJlIGFuIGV4YWN0IG1hdGNoIHRvIGFub3RoZXIgc2FtcGxlIGR1ZSB0byBzaW1wbGUgdmFyaWFuY2UgaW4gZGF0YS4gIEkgd291bGQgZXhwZWN0IHByb3BvcnRpb25zIHRvIGJlIG9ubHkgc29tZXdoYXQgZGlmZmVyZW50IGFzIENMVCB3b3VsZCBhcHBseSBhdCBhbiBpbmRlcGVuZGVudGx5IHNlbGVjdGVkIHNhbXBsZSBzaXplIG9mIDUwLiAgSXQgaXMgcG9zc2libGUgdGhhdCBzYW1wbGVzIGNvdWxkIGJlIG5vdGljZWFibHkgZGlmZmVyZW50IGRlcGVuZGluZyBvbiBob3cgbXVjaCBub2lzZSB0aGVyZSBpcyBpbiB0aGUgc2FtcGxlLg0KDQojIyBFeGVyY2lzZSAzDQpgYGB7ciBzYW1wMn0NCnNhbXAyIDwtIGdsb2JhbF9tb25pdG9yICU+JQ0KICBzYW1wbGVfbig1MCkNCg0Kc2FtcDIgJT4lDQogIGNvdW50KHNjaWVudGlzdF93b3JrKSAlPiUNCiAgbXV0YXRlKHBfaGF0ID0gbiAvc3VtKG4pKQ0KYGBgDQoNClNhbXBsZSAyIGlzIHZlcnkgY2xvc2UgdG8gU2FtcGxlIDEgaW4gcHJvcG9ydGlvbiBvZiB0aG9zZSB3aG8gZG8gYW5kIGRvbnQgYmVsaWV2ZSBzY2llbmNlIGJlbmVmaXRzIHRoZW0uDQoNClRoZSBzYW1wbGUgc2l6ZSBvZiAxMDAwIHdvdWxkIG1vc3QgY2xvc2VseSBlc3RpbWF0ZSB0aGUgcG9wdWxhdGlvbiBwYXJhbWV0ZXIgYXMgdGhlIG1vcmUgc2FtcGxlcyB0aGVyZSBhcmUgdG8gbWFrZSBzdGF0aXN0aWNhbCBpbmZlcmVuY2UsIHRoZSBtb3JlIHRoZSBzYW1wbGluZyBkaXN0cmlidXRpb24gd2lsbCBhbGlnbiB0byB0aGUgcG9wdWxhdGlvbiBkaXN0cmlidXRpb24gZm9sbG93aW5nIHRoZSBsb2dpYyBvZiBDTFQuDQoNCiMjIEV4ZXJjaXNlIDQNCjE1LDAwMCBzYW1wbGVzIG9mIHNhbXBsZSBzaXplIDUwIHdlcmUgZHJhd24uICBXaXRoaW4gZWFjaCBzYW1wbGUsIHRoZSBzYW1wbGUgcHJvcG9ydGlvbiBvZiBpbnRlcmVzdCAocF9oYXQpIHdhcyBldmFsdWF0ZWQgYW5kIHBsb3R0ZWQgb24gaGlzdG9ncmFtLiAgUmV2aWV3aW5nIHRoZSBzYW1wbGluZyBkaXN0cmlidXRpb24sIHdlIGNhbiBzZWUgdW5zdXJwcmlzaW5nbHkgdGhhdCB0aGlzIHNhbXBsaW5nIGRpc3RyaWJ1dGlvbiBmb2xsb3dzIHRoZSBiZWxsIGN1cnZlIG9mIGEgbm9ybWFsIGRpc3RyaWJ1dGlvbiwgYW5kIHRoZSBhc3N1bWVkIGRpc3RyaWJ1dGlvbiBvZiB0aGUgcG9wdWxhdGlvbiBwYXJhbWV0ZXIuICBUaGUgY2VudGVyIG9mIHRoZSBkaXN0cmlidXRpb24gaXMgZm91bmQgYXQgLjIsIGp1c3QgdGhlIHNhbWUgYXMgcG9wdWxhdGlvbiBwYXJhbWV0ZXIuDQpgYGB7ciBpdGVyYXRlfQ0Kc2FtcGxlX3Byb3BzNTAgPC0gZ2xvYmFsX21vbml0b3IgJT4lDQogICAgICAgICAgICAgICAgICAgIHJlcF9zYW1wbGVfbihzaXplID0gNTAsIHJlcHMgPSAxNTAwMCwgcmVwbGFjZSA9IFRSVUUpICU+JQ0KICAgICAgICAgICAgICAgICAgICBjb3VudChzY2llbnRpc3Rfd29yaykgJT4lDQogICAgICAgICAgICAgICAgICAgIG11dGF0ZShwX2hhdCA9IG4gL3N1bShuKSkgJT4lDQogICAgICAgICAgICAgICAgICAgIGZpbHRlcihzY2llbnRpc3Rfd29yayA9PSAiRG9lc24ndCBiZW5lZml0IikNCmBgYA0KYGBge3J9DQpnZ3Bsb3QoZGF0YSA9IHNhbXBsZV9wcm9wczUwLCBhZXMoeCA9IHBfaGF0KSkgKw0KICBnZW9tX2hpc3RvZ3JhbShiaW53aWR0aCA9IDAuMDIpICsNCiAgbGFicygNCiAgICB4ID0gInBfaGF0IChEb2Vzbid0IGJlbmVmaXQpIiwNCiAgICB0aXRsZSA9ICJTYW1wbGluZyBkaXN0cmlidXRpb24gb2YgcF9oYXQiLA0KICAgIHN1YnRpdGxlID0gIlNhbXBsZSBzaXplID0gNTAsIE51bWJlciBvZiBzYW1wbGVzID0gMTUwMDAiDQogICkNCmBgYA0KDQojIyBFeGVyY2lzZSA1DQoNClRoZXJlIGFyZSAyNSBvYnNlcnZhdGlvbnMgaW4gc2FtcGxlX3Byb3BzX3NtYWxsOyBlYWNoIG9ic2VydmF0aW9uIHJlcHJlc2VudHMgYSBzYW1wbGUgdGFrZW4gYW5kIHNhbXBsZSBwYXJhbWV0ZXIgY29tcHV0ZWQuICBUaGUgbWVhbiBvZiBlYWNoIHJlcGxpY2F0ZWQgc2FtcGxpbmcgcGFyYW1ldGVyIGlzIGFib3ZlIHRoZSBhY3R1YWwgcG9wdWxhdGlvbiBwYXJhbWV0ZXIsIHdoaWNoIGlzIHRvIGJlIGV4cGVjdGVkIHdpdGggc3VjaCBhIHNtYWxsIHNhbXBsZSBzaXplLg0KYGBge3IgbWVzc2FnZT1GQUxTRX0NCnNhbXBsZV9wcm9wc19zbWFsbCA8LSBnbG9iYWxfbW9uaXRvciAlPiUNCiAgICAgICAgICAgICAgICAgICAgcmVwX3NhbXBsZV9uKHNpemUgPSAxMCwgDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByZXBzID0gMjUsIA0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgcmVwbGFjZSA9IFRSVUUpICU+JQ0KICAgICAgICAgICAgICAgICAgICBjb3VudChzY2llbnRpc3Rfd29yaykgJT4lDQogICAgICAgICAgICAgICAgICAgIG11dGF0ZShwX2hhdCA9IG4gL3N1bShuKSkgJT4lDQogICAgICAgICAgICAgICAgICAgIGZpbHRlcihzY2llbnRpc3Rfd29yayA9PSAiRG9lc24ndCBiZW5lZml0IikNCg0Kc2FtcGxlX3Byb3BzX3NtYWxsDQoNCnNhbXBsZV9wcm9wc19zbWFsbCAlPiUNCiAgZ3JvdXBfYnkoc2NpZW50aXN0X3dvcmspICU+JQ0KICBzdW1tYXJpemUocF9oYXRfYXZnID0gbWVhbihwX2hhdCkpDQpgYGANCg0KIyMgRXhlcmNpc2UgNg0KQXMgc2FtcGxlIHNpemUgaW5jcmVhc2VzLCB0aGUgc2FtcGxpbmcgZGlzdHJpYnV0aW9uIHdpbGwgYmV0dGVyIGFwcHJveGltYXRlIHRoZSBwb3B1bGF0aW9uIGRpc3RyaWJ1dGlvbi4gIEFzIHN1Y2gsIHRoZSBtZWFuIG9mIHRoZSBzYW1wbGluZyBkaXN0cmlidXRpb24gd2lsbCBtb3ZlIGNsb3NlciBhbmQgY2xvc2VyIHRvIHRoZSBhY3R1YWwgcG9wdWxhdGlvbiBkaXN0cmlidXRpb24gY2VudGVyIG9mIC4yLiAgU3RhbmRhcmQgZXJyb3IgKHdoaWNoIGVzdGltYXRlcyB0aGUgcHJlY2lzaW9uIG9mIHNhbXBsZSBtZWFuIGVzdGltYXRlIHRvIHRoZSBwb3B1bGF0aW9uIG1lYW4gKSB3aWxsIGRlY3JlYXNlIGFzIHVuY2VydGFpbnR5IGRlY3JlYXNlcyB3aXRoIG1vcmUgZGF0YSBwb2ludHMgdG8gb2JzZXJ2ZS4NCg0KVGhlIHNhbWUgaXMgdHJ1ZSBhcyBzaW11bGF0aW9uIHNpemUgaW5jcmVhc2VzLiAgQWdhaW4sIHdlIGFyZSB0YWtpbmcgZW5vdWdoIHNhbXBsZSBwYXJhbWV0ZXJzIHRvIGFwcHJveGltYXRlIG5vcm1hbCBkaXN0cmlidXRpb24gb2YgcG9wdWxhdGlvbiBkaXN0cmlidXRpb24uICBTdGQgZXJyb3IgZGVjcmVhc2VzLCBtZWFuIG9yIGNlbnRlciBncmF2aXRhdGVzIG1vcmUgYW5kIG1vcmUgdG93YXJkIGFjdHVhbCBwb3B1bGF0aW9uIHBhcmFtZXRlciBvZiAuMiBhbmQgZGlzdHJpYnV0aW9uIG9mIHNhbXBsZSBwYXJhbWV0ZXJzIGJlY29tZXMgbW9yZSBuYXJyb3dlciBhbmQgbW9yZSBiZWxsIGN1cnZlZCBzeW1tZXRyaWNhbC4=