Load the Netflix Dataset and Prepare for Sampling

First, let’s load the Netflix dataset and examine its structure.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

# Preview the dataset
head(netflix_data)

##         id                               title  type
## 1 ts300399 Five Came Back: The Reference Films  SHOW
## 2  tm84618                         Taxi Driver MOVIE
## 3 tm127384     Monty Python and the Holy Grail MOVIE
## 4  tm70993                       Life of Brian MOVIE
## 5 tm190788                        The Exorcist MOVIE
## 6  ts22164        Monty Python's Flying Circus  SHOW
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                                                                                                                                                                                                                                                                                            This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2                                                                                                                                                                                                                                A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3                                    King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5                                                                                                                12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6                                                                                                                                                                                                                                                                                             A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
##   release_year age_certification runtime                 genres
## 1         1945             TV-MA      48      ['documentation']
## 2         1976                 R     113     ['crime', 'drama']
## 3         1975                PG      91  ['comedy', 'fantasy']
## 4         1979                 R      94             ['comedy']
## 5         1973                 R     133             ['horror']
## 6         1969             TV-14      30 ['comedy', 'european']
##   production_countries seasons   imdb_id imdb_score imdb_votes tmdb_popularity
## 1               ['US']       1                   NA         NA           0.600
## 2               ['US']      NA tt0075314        8.3     795222          27.612
## 3               ['GB']      NA tt0071853        8.2     530877          18.216
## 4               ['GB']      NA tt0079470        8.0     392419          17.505
## 5               ['US']      NA tt0070047        8.1     391942          95.337
## 6               ['GB']       4 tt0063929        8.8      72895          12.919
##   tmdb_score
## 1         NA
## 2        8.2
## 3        7.8
## 4        7.8
## 5        7.7
## 6        8.3

# Check the structure of the data
str(netflix_data)

## 'data.frame':    5806 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm127384" "tm70993" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
##  $ release_year        : int  1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
##  $ age_certification   : chr  "TV-MA" "R" "PG" "R" ...
##  $ runtime             : int  48 113 91 94 133 30 102 170 104 110 ...
##  $ genres              : chr  "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
##  $ production_countries: chr  "['US']" "['US']" "['GB']" "['GB']" ...
##  $ seasons             : num  1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0071853" "tt0079470" ...
##  $ imdb_score          : num  NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
##  $ imdb_votes          : num  NA 795222 530877 392419 391942 ...
##  $ tmdb_popularity     : num  0.6 27.6 18.2 17.5 95.3 ...
##  $ tmdb_score          : num  NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...

# View a summary of the data
summary(netflix_data)

##       id               title               type           description       
##  Length:5806        Length:5806        Length:5806        Length:5806       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   release_year  age_certification     runtime          genres         
##  Min.   :1945   Length:5806        Min.   :  0.00   Length:5806       
##  1st Qu.:2015   Class :character   1st Qu.: 44.00   Class :character  
##  Median :2018   Mode  :character   Median : 84.00   Mode  :character  
##  Mean   :2016                      Mean   : 77.64                     
##  3rd Qu.:2020                      3rd Qu.:105.00                     
##  Max.   :2022                      Max.   :251.00                     
##                                                                       
##  production_countries    seasons         imdb_id            imdb_score   
##  Length:5806          Min.   : 1.000   Length:5806        Min.   :1.500  
##  Class :character     1st Qu.: 1.000   Class :character   1st Qu.:5.800  
##  Mode  :character     Median : 1.000   Mode  :character   Median :6.600  
##                       Mean   : 2.166                      Mean   :6.533  
##                       3rd Qu.: 2.000                      3rd Qu.:7.400  
##                       Max.   :42.000                      Max.   :9.600  
##                       NA's   :3759                        NA's   :523    
##    imdb_votes      tmdb_popularity       tmdb_score    
##  Min.   :      5   Min.   :   0.0094   Min.   : 0.500  
##  1st Qu.:    521   1st Qu.:   3.1553   1st Qu.: 6.100  
##  Median :   2279   Median :   7.4780   Median : 6.900  
##  Mean   :  23407   Mean   :  22.5257   Mean   : 6.818  
##  3rd Qu.:  10144   3rd Qu.:  17.7757   3rd Qu.: 7.500  
##  Max.   :2268288   Max.   :1823.3740   Max.   :10.000  
##  NA's   :539       NA's   :94          NA's   :318

2.Generating Samples from the Population

Generate 5 Random Samples (With Replacement)

Each subsample will contain approximately 50% of the original data.

set.seed(123)  # For reproducibility

# Sample sizes will be 50% of the dataset
sample_size <- floor(0.5 * nrow(netflix_data))

# Generate 5 random samples with replacement
df_1 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_2 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_3 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_4 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_5 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]

Investigate Differences in Subsamples

We’ll inspect how different or similar these samples are by comparing their distributions and using group-by summaries.

Example: Compare IMDb Scores Across Subsamples

# Summary statistics for IMDb scores across samples
summary(df_1$imdb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.500   5.900   6.600   6.532   7.300   9.300     269

summary(df_2$imdb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.800   5.900   6.700   6.571   7.400   9.600     244

summary(df_3$imdb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.700   5.800   6.600   6.542   7.400   9.600     242

summary(df_4$imdb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.500   5.800   6.700   6.578   7.400   9.600     279

summary(df_5$imdb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.600   5.800   6.700   6.526   7.300   9.300     234

# Plot the distribution of IMDb scores for each subsample
ggplot() +
  geom_density(data = df_1, aes(x = imdb_score, color = "Sample 1"), na.rm = TRUE) +
  geom_density(data = df_2, aes(x = imdb_score, color = "Sample 2"), na.rm = TRUE) +
  geom_density(data = df_3, aes(x = imdb_score, color = "Sample 3"), na.rm = TRUE) +
  geom_density(data = df_4, aes(x = imdb_score, color = "Sample 4"), na.rm = TRUE) +
  geom_density(data = df_5, aes(x = imdb_score, color = "Sample 5"), na.rm = TRUE) +
  labs(title = "Distribution of IMDb Scores Across Subsamples", x = "IMDb Score", y = "Density") +
  theme_minimal()

The density plot shows the distribution of IMDb scores across five subsamples. From the summary statistics, I can see that the median IMDb score is consistent across subsamples, ranging from 6.6 to 6.7, and the mean IMDb score stays between 6.5 and 6.6. The range of scores spans from 1.5 to 9.6, but there are minor variations in the quartiles between samples.

Key Observations:

Similarity: The subsamples have fairly consistent median and mean IMDb scores, suggesting that the distribution of scores is stable across the samples.

Variability: The slight differences in the 1st and 3rd quartiles show that some subsamples have a wider spread of IMDb scores than others, but overall, they follow a similar pattern.

NA’s: Each subsample has some missing values (NA’s), but the number is relatively close across the subsamples, which likely doesn’t significantly impact the overall conclusions.

Conclusions:

The consistency in means and medians suggests that my random subsamples are representative of the overall dataset.

Although there’s minor variability, it’s not enough to affect my confidence in the stability of IMDb scores across these samples.

2.Scrutinizing the Subsamples

2.1 Using `group_by` to Scrutinize Subsamples

We will analyze subsamples by grouping based on categorical data (e.g., type, age_certification) to see if there are consistent patterns or anomalies.

Group by type and Calculate Average IMDb Score

# Group by 'type' in each sample and calculate mean IMDb score
df_1_group <- df_1 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_2_group <- df_2 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_3_group <- df_3 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_4_group <- df_4 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_5_group <- df_5 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))

# Combine the results into a single data frame for comparison
df_combined <- rbind(
  df_1_group %>% mutate(sample = "Sample 1"),
  df_2_group %>% mutate(sample = "Sample 2"),
  df_3_group %>% mutate(sample = "Sample 3"),
  df_4_group %>% mutate(sample = "Sample 4"),
  df_5_group %>% mutate(sample = "Sample 5")
)

# Visualize the comparison using dodged bars
ggplot(df_combined, aes(x = type, y = avg_imdb, fill = sample)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average IMDb Scores by Type Across Subsamples", x = "Type", y = "Average IMDb Score") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

I grouped the Netflix data by content type (MOVIE or SHOW) and calculated the average IMDb score for each in five random subsamples. After combining these, I compared the results in a bar plot to see how they differ.

Visualizing the Comparison: The bar chart uses geom_bar() with dodged bars for easy comparison of IMDb scores across the subsamples. The color scheme helps distinguish each subsample.

Scrutinizing the Subsamples

How Different Are They? The subsamples show some variation in IMDb scores due to random selection. For instance, one sample might have more highly-rated movies, while another might skew lower.

Consistent Trends: Despite variability, I noticed some consistent trends. TV Shows, for example, might consistently score higher than Movies across all samples, though the exact averages differ.

Impact on Conclusions

Sampling Variability: This shows the variability that comes with random sampling. If I relied on just one subsample, it might not represent the full dataset accurately. Multiple subsamples offer a more reliable view of the trends.

Reliability of Findings: If the variability between subsamples is low, I can trust my findings more. High variability suggests I may need to use larger samples or more structured sampling methods.

Future Data Analysis: Using multiple random samples or larger sample sizes will lead to more stable conclusions about IMDb scores. Relying on a single subsample could give misleading insights, especially if it’s skewed by outliers.

Further Investigation

Does Increasing Subsample Size Reduce Variability? Would the differences between subsamples shrink if I used 75% or 90% of the population?
What About Stratified Sampling? If I ensure equal proportions of Movies and Shows in each subsample, will that reduce variability?
Effect on Other Variables? Could this sampling variability impact other variables like runtime or popularity? Investigating other dimensions might provide more insights into the dataset’s stability.

By answering these questions, I can better understand how to generalize findings from this dataset and avoid misleading conclusions from random sampling.

2.2 Identify Consistencies and Anomalies

Now, we’ll investigate potential anomalies and consistencies within the subsamples.

Anomalies in IMDb Scores for Specific Genres

# Find genres with unusually high/low average IMDb scores in each sample
df_1_genre <- df_1 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_2_genre <- df_2 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_3_genre <- df_3 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_4_genre <- df_4 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_5_genre <- df_5 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))

# Print potential anomalies (extremely high or low IMDb scores)
anomalies_df_1 <- filter(df_1_genre, avg_imdb > 8.5 | avg_imdb < 3)
anomalies_df_2 <- filter(df_2_genre, avg_imdb > 8.5 | avg_imdb < 3)

print(anomalies_df_1)

## # A tibble: 11 × 2
##    genres                                                               avg_imdb
##    <chr>                                                                   <dbl>
##  1 ['action', 'comedy', 'drama', 'sport']                                    8.6
##  2 ['action', 'fantasy', 'scifi', 'animation', 'comedy']                     8.7
##  3 ['crime', 'scifi', 'thriller', 'drama', 'fantasy']                        8.6
##  4 ['drama', 'action', 'comedy', 'crime', 'animation', 'documentation'…      9  
##  5 ['drama', 'comedy', 'action']                                             1.7
##  6 ['drama', 'scifi', 'thriller']                                            1.5
##  7 ['family', 'comedy', 'animation']                                         8.8
##  8 ['horror', 'thriller', 'fantasy']                                         2.8
##  9 ['scifi', 'crime', 'drama', 'thriller']                                   8.7
## 10 ['scifi', 'family', 'fantasy', 'animation', 'action']                     9.3
## 11 ['scifi', 'thriller', 'drama', 'european']                                8.8

print(anomalies_df_2)

## # A tibble: 19 × 2
##    genres                                                               avg_imdb
##    <chr>                                                                   <dbl>
##  1 ['action', 'comedy', 'drama', 'sport']                                    8.6
##  2 ['action', 'comedy', 'romance', 'drama', 'fantasy', 'horror']             8.8
##  3 ['action', 'fantasy', 'scifi', 'animation', 'comedy']                     8.7
##  4 ['animation', 'scifi', 'action', 'fantasy', 'thriller', 'horror']         8.7
##  5 ['comedy', 'drama', 'music', 'reality']                                   8.7
##  6 ['comedy', 'family', 'drama', 'sport']                                    1.8
##  7 ['comedy', 'scifi']                                                       2.7
##  8 ['crime', 'scifi', 'thriller', 'drama', 'fantasy']                        8.6
##  9 ['documentation', 'family']                                               8.6
## 10 ['drama', 'action', 'history', 'romance', 'war']                          8.7
## 11 ['drama', 'comedy', 'animation']                                          8.7
## 12 ['drama', 'fantasy', 'romance']                                           9.2
## 13 ['drama', 'romance', 'crime']                                             8.6
## 14 ['family', 'comedy', 'animation']                                         8.8
## 15 ['fantasy', 'action', 'scifi', 'thriller', 'comedy']                      2.9
## 16 ['scifi', 'action', 'drama', 'fantasy', 'horror', 'animation']            9  
## 17 ['scifi', 'animation', 'crime', 'drama', 'fantasy', 'thriller']           9  
## 18 ['scifi', 'music', 'thriller', 'action']                                  8.8
## 19 ['western', 'action', 'scifi', 'thriller', 'animation', 'comedy', '…      8.9

# Combine anomalies from all samples for plotting
anomalies_combined <- rbind(
  anomalies_df_1 %>% mutate(sample = "Sample 1"),
  anomalies_df_2 %>% mutate(sample = "Sample 2"),
  filter(df_3_genre, avg_imdb > 8.5 | avg_imdb < 3) %>% mutate(sample = "Sample 3"),
  filter(df_4_genre, avg_imdb > 8.5 | avg_imdb < 3) %>% mutate(sample = "Sample 4"),
  filter(df_5_genre, avg_imdb > 8.5 | avg_imdb < 3) %>% mutate(sample = "Sample 5")
)

# Plot anomalies by genre across the samples
ggplot(anomalies_combined, aes(x = reorder(genres, -avg_imdb), y = avg_imdb, fill = sample)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Anomalous Genres with Unusually High/Low IMDb Scores Across Samples", 
       x = "Genres", y = "Average IMDb Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

Identifying Outliers:

To identify anomalies, we analyzed each subsample for genres with exceptionally high or low IMDb scores. These outliers, defined as scores above 8.5 or below 3, represent genres that significantly deviate from the overall trend.

Combining and Visualizing Anomalies:

After identifying anomalies in each subsample, we consolidated them into a single data frame for easier comparison. A bar plot was then created to visually represent these anomalies across all samples.

Sample-Specific Anomalies:

Due to the random nature of sampling, a genre might be considered anomalous in one sample but not in another. For instance, a genre like “horror” could have a high IMDb score in Sample 1 but a more average score in Sample 2. This variability arises from the random inclusion of highly-rated or poorly-rated content within each subsample.

Insights and Implications:

Inconsistencies in Genre Ratings: The fluctuation of genre ratings between anomalously high or low scores across samples highlights the impact of random sampling.

Outliers Affecting Averages: Extreme outliers can significantly influence the overall distribution of the data.

Sampling Variability: Conclusions drawn from a single sample might not hold across other samples.

Impact of Random Sampling: Random sampling can lead to over- or under-representation of certain genres, affecting results.

Further Investigation:

To gain a deeper understanding of anomalies, we could investigate:

Genre-Specific Anomalies: Which genres are most prone to anomalies?
Stratified Sampling: Would stratified sampling reduce variability?
Sample Size: How does increasing sample size affect anomalies?
Other Metrics: Are anomalies consistent across different metrics?

2.3 Monte Carlo Simulation

Monte Carlo simulations can be used to repeatedly sample from the data, which will help assess the variability across multiple runs.

# Monte Carlo simulation: calculate average IMDb score from 100 subsamples
set.seed(123)  # Set seed for reproducibility

# generating 100 subsamples and calculate average IMDb scores
mc_sim <- replicate(100, {
  sample_data <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
  mean(sample_data$imdb_score, na.rm = TRUE)
})

# converting the results to a data frame
mc_sim_df <- data.frame(mc_sim = mc_sim)

# calculating descriptive statistics for the simulation
mean_sim <- mean(mc_sim)
sd_sim <- sd(mc_sim)
ci_sim <- quantile(mc_sim, probs = c(0.025, 0.975))  # 95% confidence interval

# displaying statistics
cat("Mean of Simulation:", mean_sim, "\n")

## Mean of Simulation: 6.538487

cat("Standard Deviation of Simulation:", sd_sim, "\n")

## Standard Deviation of Simulation: 0.02225251

cat("95% Confidence Interval:", ci_sim, "\n")

## 95% Confidence Interval: 6.498188 6.57921

# plotting the results of the Monte Carlo simulation with enhanced labeling
ggplot(mc_sim_df, aes(x = mc_sim)) +
  geom_histogram(binwidth = 0.05, fill = "blue", alpha = 0.7) +  # Adjusted binwidth for better visualization
  
  geom_vline(aes(xintercept = mean(mc_sim)), color = "red", linetype = "dashed", size = 1.2) +  # simulation mean line
  geom_vline(aes(xintercept = mean(netflix_data$imdb_score, na.rm = TRUE)), color = "green", linetype = "solid", size = 1.2) +  # Population mean line
  labs(title = "Monte Carlo Simulation of Average IMDb Scores", 
       x = "Average IMDb Score", 
       y = "Frequency") +
  theme_minimal() +
  # Adjust labels to avoid overlapping
  annotate("text", x = mean_sim + 0.15, y = 10, label = paste("Sim Mean:", round(mean_sim, 2)), color = "red", size = 4, vjust = -1) +
  annotate("text", x = mean(netflix_data$imdb_score, na.rm = TRUE) - 0.15, y = 9, 
           label = paste("Pop Mean:", round(mean(netflix_data$imdb_score, na.rm = TRUE), 2)), color = "green", size = 4, vjust = 1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Descriptive Statistics:

Mean: Shows the average IMDb score across the 100 subsamples, giving us a sense of the central tendency.
Standard Deviation: Indicates how much variation there is between subsamples. A high value means greater differences between them.
Confidence Interval (CI): The 95% CI gives the range where the true population mean is expected to fall based on the subsamples.

Insights Gathered:

Variation in IMDb Scores: The Monte Carlo simulation shows how average IMDb scores fluctuate across random subsamples, helping us understand the reliability of sample estimates. A lower standard deviation means the subsamples are more consistent.
Comparison to Population Mean: The proximity of the subsample mean (red dashed line) to the population mean (green line) indicates how well the subsamples represent the full dataset. A large deviation would suggest less reliability.
Confidence Interval: The 95% CI gives us an idea of the uncertainty in our estimates, showing that most future samples would fall within this range.

Significance of This Investigation:

Sampling Variability: The simulation shows that there’s natural variability in random samples. Over many samples, the mean should converge with the population mean.
Better Estimates with More Samples: Combining many subsamples gives a more accurate estimate of the true population mean, rather than relying on one sample.
Informed Decision-Making: Understanding the variability across samples helps us decide whether we need larger or more structured samples. A low standard deviation means more confidence in the results.

Further Questions to Investigate:

Does Increasing Subsamples Reduce Variability? If we increase Monte Carlo iterations (e.g., from 100 to 1,000), will the variability and confidence intervals shrink?
Does Sample Size Matter? How would the results change if we sampled 25% or 75% of the population instead of 50%?
Impact of Genres: Do certain genres influence the average IMDb scores more than others, and how much do outliers affect this?
Behavior of Other Metrics: Would other metrics, like runtime or popularity, show similar variability across subsamples, or are they more stable?

By exploring these, we can get a deeper understanding of how the dataset behaves under different sampling conditions.

NetflixDataDiveWeek4

Junaid Ahmed Mohammed

2024-09-20

Load the Netflix Dataset and Prepare for Sampling

2.Generating Samples from the Population

Generate 5 Random Samples (With Replacement)

Investigate Differences in Subsamples

Example: Compare IMDb Scores Across Subsamples

2.Scrutinizing the Subsamples

2.1 Using `group_by` to Scrutinize Subsamples

2.2 Identify Consistencies and Anomalies

2.3 Monte Carlo Simulation

NetflixDataDiveWeek4

Junaid Ahmed Mohammed

2024-09-20

Load the Netflix Dataset and Prepare for Sampling

2.Generating Samples from the Population

Generate 5 Random Samples (With Replacement)

Investigate Differences in Subsamples

Example: Compare IMDb Scores Across Subsamples

2.Scrutinizing the Subsamples

2.1 Using group_by to Scrutinize Subsamples

2.2 Identify Consistencies and Anomalies

2.3 Monte Carlo Simulation

2.1 Using `group_by` to Scrutinize Subsamples