set.seed(1234)
boxoffice_raw <- read_csv("movie_boxoffice.csv") %>% clean_names()
n_raw <- nrow(boxoffice_raw)
boxoffice <- boxoffice_raw %>% distinct()

n_dedup <- nrow(boxoffice)
n_removed <- n_raw - n_dedup

n_removed
## [1] 100

0. How many duplicated records were removed from boxoffice data? What is your random seed number?

random seed is 1234 and the number of duplicated was 100

Part 1: Sampling

1. Treat the box office data as your population. Have a histogram of the global box office earning. Describe the shape of the distribution

ggplot(boxoffice, aes(x = worldwide_gross)) +
  geom_histogram(bins = 30) +
  labs(title = "Population: Histogram of Worldwide Gross",
       x = "Worldwide_Gross (Millions)", y = "Count")

The distribution is right skewed

2. In the population, what is the average global box office earning? What is the standard deviation? What is the proportion of movies whose global box office earning exceeds budget?

pop_mean <- mean(boxoffice$worldwide_gross, na.rm = TRUE)
pop_sd   <- sd(boxoffice$worldwide_gross, na.rm = TRUE)
pop_prop <- mean(boxoffice$worldwide_gross > boxoffice$budget, na.rm = TRUE)

tibble(pop_mean, pop_sd, pop_prop)
## # A tibble: 1 Ă— 3
##   pop_mean pop_sd pop_prop
##      <dbl>  <dbl>    <dbl>
## 1     95.8   177.    0.646

average global box earning is 95.83149 standard deviation is 177.4594 Proportion of movies whose global box office earning exceeds budget is 0.6461286

3. Take a random sample of 200 movies from the population, get the histogram of global box office earning, and describe the shape of the distribution. You are going to use this sample in part II and future homework.

sample200 <- boxoffice %>% slice_sample(n = 200)

ggplot(sample200, aes(x = worldwide_gross)) +
  geom_histogram(bins = 30) +
  labs(title = "Sample (n = 200): Histogram of Worldwide Gross",
       x = "Worldwide_Gross (Millions)", y = "Count")

4. In your sample, what is the average global box office earning? What is the standard deviation? What is the proportion of movies whose global box office earning exceeds budget? Are these summary statistics from the sample close to those population parameters?

s200_mean <- mean(sample200$worldwide_gross, na.rm = TRUE)
s200_sd   <- sd(sample200$worldwide_gross, na.rm = TRUE)
s200_prop <- mean(sample200$worldwide_gross > sample200$budget, na.rm = TRUE)

tibble(s200_mean, s200_sd, s200_prop,
       pop_mean, pop_sd, pop_prop)
## # A tibble: 1 Ă— 6
##   s200_mean s200_sd s200_prop pop_mean pop_sd pop_prop
##       <dbl>   <dbl>     <dbl>    <dbl>  <dbl>    <dbl>
## 1      91.7    184.     0.635     95.8   177.    0.646

Sample statistics are similar to population parameters. The mean is slightly lower in the sample, which is normal in sampling variation. What is higher is the standard deviation. The sample represents the population well.

5. Take a random sample of n movies from the population, calculate the average global box office earning, and the proportion of movies whose global box office earning exceeds budget. Repeat 500 times. (Do this step for n=20, 50, 100, 200 respectively)

set.seed(get0(".Random.seed")[1])  # lock in the current seed choice

ns <- c(20, 50, 100, 200)
reps <- 500

rep_results <- map_dfr(ns, function(n) {
  map_dfr(1:reps, function(r) {
    s <- boxoffice %>% slice_sample(n = n)
    tibble(
      n = n,
      rep = r,
      mean_world = mean(s$worldwide_gross, na.rm = TRUE),
      prop_over_budget = mean(s$worldwide_gross > s$budget, na.rm = TRUE)
    )
  })
})

rep_results %>% glimpse()
## Rows: 2,000
## Columns: 4
## $ n                <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 2…
## $ rep              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ mean_world       <dbl> 106.98740, 102.29320, 169.84338, 71.49980, 120.27923,…
## $ prop_over_budget <dbl> 0.60, 0.75, 0.70, 0.55, 0.70, 0.70, 0.60, 0.75, 0.75,…

6. For each n=20, 50, 100 and 200, get the histogram of the average global box office earning. Have histograms using facet_wrap() so the four histograms are in the same picture for easy comparison. Also get the mean and standard error of the average global box office earning.

ggplot(rep_results, aes(x = mean_world)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ n, scales = "free") +
  labs(title = "Sampling Distribution of Sample Mean (500 reps)",
       x = "Sample Mean of Worldwide_Gross (Millions)", y = "Count")

rep_mean_summary <- rep_results %>%
  group_by(n) %>%
  summarise(
    mean_of_means = mean(mean_world, na.rm = TRUE),
    se_of_means   = sd(mean_world, na.rm = TRUE),
    .groups = "drop"
  )
rep_mean_summary
## # A tibble: 4 Ă— 3
##       n mean_of_means se_of_means
##   <dbl>         <dbl>       <dbl>
## 1    20          95.4        39.3
## 2    50          94.3        25.0
## 3   100          96.3        18.1
## 4   200          96.1        13.0

7. Compare the distributions among different n values, and also compare them to the distribution from the population in Q1.

As sample size n increases, the sampling distributions become more symmetric around the population mean of 96 million, showing the sample mean is an unbiased estimator. Smaller samples n=20 or n = 50 show greater spread and mild right skew, while larger ones n=100 or 200 are more compact and nearly normal. Unlike the strongly right skewed population in Q1, these distributions are smoother and more centered on the mean.

8. For each n=20, 50, 100 and 200, get the histogram of the proportion of movies whose global box office earning exceeds budget. Have histograms using facet_wrap() so the four histograms are in the same picture for easy comparison. Also get the mean and standard error of the proportion of movies whose global box office earnings exceed budget.

ggplot(rep_results, aes(x = prop_over_budget)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ n, scales = "free") +
  labs(title = "Sampling Distribution of Sample Proportion (500 reps)",
       x = "Sample Proportion: Worldwide_Gross > Budget", y = "Count")

rep_prop_summary <- rep_results %>%
  group_by(n) %>%
  summarise(
    mean_of_props = mean(prop_over_budget, na.rm = TRUE),
    se_of_props   = sd(prop_over_budget, na.rm = TRUE),
    .groups = "drop"
  )
rep_prop_summary
## # A tibble: 4 Ă— 3
##       n mean_of_props se_of_props
##   <dbl>         <dbl>       <dbl>
## 1    20         0.641      0.105 
## 2    50         0.645      0.0646
## 3   100         0.644      0.0476
## 4   200         0.647      0.0341

9. Compare the distributions among different n values

A sn increases the variance decreases and the distribution appears smoother and more continuous, unlike the more discrete pattern seen when n is equall to 20

Part 2: Bootstrapping

10. Use data from Q3 as the initial sample, use bootstrapping method to resample once with 200 movies. Are there any duplicated movies in your bootstrap sample? Is this expected or something is wrong?

boot1 <- sample200 %>% slice_sample(n = 200, replace = TRUE)

boot1_n <- nrow(boot1)
boot1_distinct_n <- nrow(boot1 %>% distinct())
boot1_dups <- boot1_n - boot1_distinct_n

boot1_dups
## [1] 81

I see that there are duplicated movies in teh sample. 77 cases of duplicates but this is expected because the bootstrap re samples with replacement.

11. Get the histogram of global box office earing in the bootstrap sample. Describe the shape of the distribution, and compare it to Q3

ggplot(boot1, aes(x = worldwide_gross)) +
  geom_histogram(bins = 30) +
  labs(title = "Bootstrap Sample (n = 200): Histogram of Worldwide Gross",
       x = "Worldwide_Gross (Millions)", y = "Count")

from above we can see that the histogram is strongly right skewed. This shape matches pretty well to the Q3 samples distribution

12. In the bootstrap sample in Q10, what is the average global box office earning? What is the standard deviation of it? What is the proportion of movies whose global box office earning exceeds budget? Are they close enough to those in the initial sample?

b1_mean <- mean(boot1$worldwide_gross, na.rm = TRUE)
b1_sd   <- sd(boot1$worldwide_gross, na.rm = TRUE)
b1_prop <- mean(boot1$worldwide_gross > boot1$budget, na.rm = TRUE)

tibble(b1_mean, b1_sd, b1_prop,
       s200_mean, s200_sd, s200_prop)
## # A tibble: 1 Ă— 6
##   b1_mean b1_sd b1_prop s200_mean s200_sd s200_prop
##     <dbl> <dbl>   <dbl>     <dbl>   <dbl>     <dbl>
## 1    80.3  183.    0.62      91.7    184.     0.635

In the bootstrap sample, the average global box office earning was $97.0 million with a standard deviation of $217.9 million. About 60% of movies earned more than their budget. Compared to the initial sample the bootstrap values are very similar showing that the resampling produces consistent estimates close to those from the original data

13. Get the bootstrapping distribution of the average global box office earning, and the proportion of movies whose global box office earning exceeds budget, by resampling 500 times with bootstrapping method. Get the mean and standard error of the average global box office earning. Get the mean and standard error of the proportion of movies whose global box office earning exceeds budget

boot_results <- map_dfr(1:500, function(r) {
  bs <- sample200 %>% slice_sample(n = 200, replace = TRUE)
  tibble(
    rep = r,
    mean_world = mean(bs$worldwide_gross, na.rm = TRUE),
    prop_over_budget = mean(bs$worldwide_gross > bs$budget, na.rm = TRUE)
  )
})

boot_mean_summary <- boot_results %>%
  summarise(
    mean_of_means = mean(mean_world, na.rm = TRUE),
    se_of_means   = sd(mean_world, na.rm = TRUE)
  )

boot_prop_summary <- boot_results %>%
  summarise(
    mean_of_props = mean(prop_over_budget, na.rm = TRUE),
    se_of_props   = sd(prop_over_budget, na.rm = TRUE)
  )

boot_mean_summary
## # A tibble: 1 Ă— 2
##   mean_of_means se_of_means
##           <dbl>       <dbl>
## 1          91.7        12.6
boot_prop_summary
## # A tibble: 1 Ă— 2
##   mean_of_props se_of_props
##           <dbl>       <dbl>
## 1         0.633      0.0337

The bootstrap results show an average global box office of $91.3M (SE = $13.5M) and a proportion of 0.635 (SE = 0.032), both closely matching the original sample estimates.

14. Have the histogram of the sample mean and sample proportion from Q13. Have histogram using facet_wrap() to make comparison of the bootstrap distribution with those the in Q6 and Q8 when n=200. Describe the comparison.

compare_means <- bind_rows(
  rep_results %>% filter(n == 200) %>% transmute(method = "Sampling n=200", value = mean_world),
  boot_results %>% transmute(method = "Bootstrap n=200", value = mean_world)
)

ggplot(compare_means, aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ method, scales = "free") +
  labs(title = "Sample Mean: Sampling vs Bootstrap (n=200)",
       x = "Sample Mean of Worldwide_Gross (Millions)", y = "Count")

compare_props <- bind_rows(
  rep_results %>% filter(n == 200) %>% transmute(method = "Sampling n=200", value = prop_over_budget),
  boot_results %>% transmute(method = "Bootstrap n=200", value = prop_over_budget)
)

ggplot(compare_props, aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ method, scales = "free") +
  labs(title = "Sample Proportion: Sampling vs Bootstrap (n=200)",
       x = "Sample Proportion: Worldwide_Gross > Budget", y = "Count")

Both the sampling and bootstrap distributions for n=200 are roughly symmetric and centered around similar values. The bootstrap histograms appear slightly smoother but have nearly the same spread as the sampling ones, indicating that bootstrapping provides a good approximation of the sampling variability for both the mean and the proportion.