Introduction

This week’s data dive is focused on analyzing multiple random samples taken from a dataset to understand potential variations and inconsistencies in data collection and conclusions. The dataset will be sampled 5 times (with replacement), each sample containing roughly 50% of the data. We will explore how these subsamples differ and what patterns remain consistent across them, using Monte Carlo simulations for further analysis.

# Load necessary libraries 
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the dataset (replace this with your dataset path)
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows
head(tv_data)
# Inspect the structure of the dataset
str(tv_data)
## spc_tbl_ [168,639 × 29] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                  : num [1:168639] 1399 71446 66732 1402 63174 ...
##  $ name                : chr [1:168639] "Game of Thrones" "Money Heist" "Stranger Things" "The Walking Dead" ...
##  $ number_of_seasons   : num [1:168639] 8 3 4 11 6 7 2 5 6 1 ...
##  $ number_of_episodes  : num [1:168639] 73 41 34 177 93 137 9 62 116 9 ...
##  $ original_language   : chr [1:168639] "en" "es" "en" "en" ...
##  $ vote_count          : num [1:168639] 21857 17836 16161 15432 13870 ...
##  $ vote_average        : num [1:168639] 8.44 8.26 8.62 8.12 8.49 ...
##  $ overview            : chr [1:168639] "Seven noble families fight for control of the mythical land of Westeros. Friction between the houses leads to f"| __truncated__ "To carry out the biggest heist in history, a mysterious man called The Professor recruits a band of eight robbe"| __truncated__ "When a young boy vanishes, a small town uncovers a mystery involving secret experiments, terrifying supernatura"| __truncated__ "Sheriff's deputy Rick Grimes awakens from a coma to find a post-apocalyptic world dominated by flesh-eating zom"| __truncated__ ...
##  $ adult               : logi [1:168639] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ backdrop_path       : chr [1:168639] "/2OMB0ynKlyIenMJWI2Dy9IWT4c.jpg" "/gFZriCkpJYsApPZEF3jhxL4yLzG.jpg" "/2MaumbgBlW1NoPo3ZJO38A6v7OS.jpg" "/x4salpjB11umlUOltfNvSSrjSXm.jpg" ...
##  $ first_air_date      : Date[1:168639], format: "2011-04-17" "2017-05-02" ...
##  $ last_air_date       : Date[1:168639], format: "2019-05-19" "2021-12-03" ...
##  $ homepage            : chr [1:168639] "http://www.hbo.com/game-of-thrones" "https://www.netflix.com/title/80192098" "https://www.netflix.com/title/80057281" "http://www.amc.com/shows/the-walking-dead--1002293" ...
##  $ in_production       : logi [1:168639] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ original_name       : chr [1:168639] "Game of Thrones" "La Casa de Papel" "Stranger Things" "The Walking Dead" ...
##  $ popularity          : num [1:168639] 1083.9 96.4 185.7 489.7 416.7 ...
##  $ poster_path         : chr [1:168639] "/1XS1oqL89opfnbLl8WnZY1O1uJx.jpg" "/reEMJA1uzscCbkpeRJeTT2bjqUp.jpg" "/49WJfeN0moxb9IPfGn8AIqMGskD.jpg" "/n7PVu0hSz2sAsVekpOIoCnkWlbn.jpg" ...
##  $ type                : chr [1:168639] "Scripted" "Scripted" "Scripted" "Scripted" ...
##  $ status              : chr [1:168639] "Ended" "Ended" "Returning Series" "Ended" ...
##  $ tagline             : chr [1:168639] "Winter Is Coming" "The perfect robbery." "Every ending has a beginning." "Fight the dead. Fear the living." ...
##  $ genres              : chr [1:168639] "Sci-Fi & Fantasy, Drama, Action & Adventure" "Crime, Drama" "Drama, Sci-Fi & Fantasy, Mystery" "Action & Adventure, Drama, Sci-Fi & Fantasy" ...
##  $ created_by          : chr [1:168639] "David Benioff, D.B. Weiss" "Álex Pina" "Matt Duffer, Ross Duffer" "Frank Darabont" ...
##  $ languages           : chr [1:168639] "en" "es" "en" "en" ...
##  $ networks            : chr [1:168639] "HBO" "Netflix, Antena 3" "Netflix" "AMC" ...
##  $ origin_country      : chr [1:168639] "US" "ES" "US" "US" ...
##  $ spoken_languages    : chr [1:168639] "English" "Español" "English" "English" ...
##  $ production_companies: chr [1:168639] "Revolution Sun Studios, Television 360, Generator Entertainment, Bighead Littlehead" "Vancouver Media" "21 Laps Entertainment, Monkey Massacre Productions" "AMC Studios, Circle of Confusion, Valhalla Motion Pictures, Darkwoods Productions, Skybound Entertainment, Idiotbox" ...
##  $ production_countries: chr [1:168639] "United Kingdom, United States of America" "Spain" "United States of America" "United States of America" ...
##  $ episode_run_time    : num [1:168639] 0 70 0 42 45 45 0 0 43 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   number_of_seasons = col_double(),
##   ..   number_of_episodes = col_double(),
##   ..   original_language = col_character(),
##   ..   vote_count = col_double(),
##   ..   vote_average = col_double(),
##   ..   overview = col_character(),
##   ..   adult = col_logical(),
##   ..   backdrop_path = col_character(),
##   ..   first_air_date = col_date(format = ""),
##   ..   last_air_date = col_date(format = ""),
##   ..   homepage = col_character(),
##   ..   in_production = col_logical(),
##   ..   original_name = col_character(),
##   ..   popularity = col_double(),
##   ..   poster_path = col_character(),
##   ..   type = col_character(),
##   ..   status = col_character(),
##   ..   tagline = col_character(),
##   ..   genres = col_character(),
##   ..   created_by = col_character(),
##   ..   languages = col_character(),
##   ..   networks = col_character(),
##   ..   origin_country = col_character(),
##   ..   spoken_languages = col_character(),
##   ..   production_companies = col_character(),
##   ..   production_countries = col_character(),
##   ..   episode_run_time = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Generating Random Samples

We will generate 5 random samples (with replacement), each containing roughly 50% of the original data.

set.seed(123)  # For reproducibility

# Number of samples to generate
n_samples <- 5

# List to store samples
samples <- list()

# Sampling process (50% of the data with replacement)
for (i in 1:n_samples) {
   samples[[i]] <- tv_data |> sample_frac(size = 0.5, replace = TRUE)
}

# Assign each sample to a data frame
df_1 <- samples[[1]]
df_2 <- samples[[2]]
df_3 <- samples[[3]]
df_4 <- samples[[4]]
df_5 <- samples[[5]]

# Show the first few rows of the first sample
head(df_1)

We are simulating the process of collecting random samples from a population. By generating 5 random subsamples (each with 50% of the data), we replicate different scenarios of data collection. Since the samples are drawn with replacement, some values may appear multiple times in a single sample, reflecting real-world randomness in data collection.

Scrutiny of Subsamples

We will now analyze the differences and consistencies across the subsamples.

- Group Summary of Each Subsample

We can use group_by to examine categorical and continuous data across the samples. For instance, we will analyze the distribution of original_language and average vote_average across the subsamples.

# Group by original_language and summarize vote_average for each sample
group_summary <- function(df) {
   df |> 
    group_by(original_language) |>
    summarize(avg_rating = mean(vote_average, na.rm = TRUE), 
              count = n())
}

# Summarizing each sample
summary_1 <- group_summary(df_1)
summary_2 <- group_summary(df_2)
summary_3 <- group_summary(df_3)
summary_4 <- group_summary(df_4)
summary_5 <- group_summary(df_5)

# Summary for the first sample
summary_1

By grouping the subsamples based on the original_language, we calculate the average rating for each language in every sample. This summary provides insight into how different languages are represented in the dataset and whether certain languages tend to have higher or lower average ratings. Comparing the results across samples helps identify whether these patterns are consistent or if they vary depending on the sample.

- Investigating Differences and Anomalies

We will now examine whether anomalies present in one sample also exist in another.

# Checking if any samples have languages with unusually low or high ratings
# To make it more readable, let's compare the top languages by average rating in each sample
compare_samples <- function(summary_list) {
  lapply(summary_list, function(summary) {
        summary |> filter(avg_rating > 8 | avg_rating < 2)  # Filtering for anomalies
  })
}

# Store summaries for comparison
summaries <- list(summary_1, summary_2, summary_3, summary_4, summary_5)

# Compare anomalies across subsamples
anomalies <- compare_samples(summaries)
anomalies
## [[1]]
## # A tibble: 36 × 3
##    original_language avg_rating count
##    <chr>                  <dbl> <int>
##  1 am                      1.99    23
##  2 de                      1.83  3914
##  3 dv                      0        1
##  4 et                      1.06    31
##  5 ga                      0       11
##  6 gd                      0        4
##  7 gu                      1.33     6
##  8 he                      1.99   258
##  9 ho                      0        1
## 10 hr                      1.78    67
## # ℹ 26 more rows
## 
## [[2]]
## # A tibble: 36 × 3
##    original_language avg_rating count
##    <chr>                  <dbl> <int>
##  1 am                     1.94     27
##  2 av                     0         4
##  3 ca                     1.58     84
##  4 de                     1.74   3988
##  5 dv                     0         4
##  6 et                     1.09     25
##  7 fy                     0         5
##  8 gd                     0         2
##  9 gl                     0.686     7
## 10 gu                     0        10
## # ℹ 26 more rows
## 
## [[3]]
## # A tibble: 42 × 3
##    original_language avg_rating count
##    <chr>                  <dbl> <int>
##  1 ab                      1.89     9
##  2 av                      0        1
##  3 bs                      1.78    18
##  4 ca                      1.60    75
##  5 cn                      1.84   853
##  6 cs                      2.00   640
##  7 de                      1.66  3841
##  8 et                      1.33    27
##  9 fy                      0        1
## 10 gd                      0        1
## # ℹ 32 more rows
## 
## [[4]]
## # A tibble: 33 × 3
##    original_language avg_rating count
##    <chr>                  <dbl> <int>
##  1 am                      1.15    20
##  2 av                      0        3
##  3 az                      1        6
##  4 be                      1        2
##  5 cs                      1.97   693
##  6 de                      1.79  3877
##  7 dv                      0        1
##  8 et                      1.31    29
##  9 fy                      0        1
## 10 ga                      1.04    14
## # ℹ 23 more rows
## 
## [[5]]
## # A tibble: 38 × 3
##    original_language avg_rating count
##    <chr>                  <dbl> <int>
##  1 am                      1.5     21
##  2 az                      0        1
##  3 be                      0        1
##  4 bs                      1.9     15
##  5 cs                      1.96   722
##  6 de                      1.81  3815
##  7 et                      1.05    21
##  8 fy                      0        1
##  9 gd                      0        2
## 10 gl                      1.4      5
## # ℹ 28 more rows

We are filtering languages with either very high or very low average ratings (above 8 or below 2). This allows us to identify anomalies like languages that have unusual ratings in a given sample. By comparing the results across subsamples, we can see whether these anomalies are consistently present or if they are sample-specific. This comparison helps us understand the reliability of our conclusions, as an anomaly in one subsample may not be present in another.

Monte Carlo Simulation

Using Monte Carlo simulations, we can simulate additional sampling scenarios to assess how variability in the data might affect our conclusions.

# Monte Carlo simulation: simulate random sampling 100 times and calculate the mean rating each time
monte_carlo_simulation <- function(df, n_simulations = 100) {
  results <- replicate(n_simulations, {
    df |> sample_frac(size = 0.5, replace = TRUE) |>
      summarize(mean_rating = mean(vote_average, na.rm = TRUE)) |>
      pull(mean_rating)
  })
  results
}

# Running Monte Carlo simulations on the original dataset
simulation_results <- monte_carlo_simulation(tv_data)

# Plotting distribution of means from Monte Carlo simulation
hist(simulation_results, breaks = 30, col = "steelblue", 
     main = "Monte Carlo Simulation of Vote Average", 
     xlab = "Mean Vote Average")

Monte Carlo simulations allow us to repeatedly sample the data and observe how the average vote rating varies across multiple simulations. The histogram from these simulations shows the distribution of mean ratings across the 100 simulations, giving us an idea of the stability and variability of the average vote. This technique helps to assess whether our conclusions from a single sample are representative or if there’s significant variability due to random sampling.

Overall Insights

  1. There are notable differences in the subsamples in terms of average ratings and language distributions. Some subsamples exhibit anomalies in specific categories, such as languages with unusually low or high average ratings, which may not appear in other subsamples.

  2. Certain aspects of the data, such as the most common languages, remain consistent across subsamples, suggesting strong trends that are less affected by sampling variability.

  3. The Monte Carlo simulation shows that the mean vote average is relatively stable when sampled multiple times, but there is some variability. This suggests that while overall trends are consistent, individual samples might lead to different conclusions, especially if they include outliers.

  4. When drawing conclusions from sampled data for future consideration, it’s important to be aware of sampling variability. Monte Carlo simulations can help us understand the potential range of outcomes and make more informed decisions.