First, let’s load the Netflix dataset and examine its structure.
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview the dataset
head(netflix_data)
## id title type
## 1 ts300399 Five Came Back: The Reference Films SHOW
## 2 tm84618 Taxi Driver MOVIE
## 3 tm127384 Monty Python and the Holy Grail MOVIE
## 4 tm70993 Life of Brian MOVIE
## 5 tm190788 The Exorcist MOVIE
## 6 ts22164 Monty Python's Flying Circus SHOW
## description
## 1 This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## release_year age_certification runtime genres
## 1 1945 TV-MA 48 ['documentation']
## 2 1976 R 113 ['crime', 'drama']
## 3 1975 PG 91 ['comedy', 'fantasy']
## 4 1979 R 94 ['comedy']
## 5 1973 R 133 ['horror']
## 6 1969 TV-14 30 ['comedy', 'european']
## production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity
## 1 ['US'] 1 NA NA 0.600
## 2 ['US'] NA tt0075314 8.3 795222 27.612
## 3 ['GB'] NA tt0071853 8.2 530877 18.216
## 4 ['GB'] NA tt0079470 8.0 392419 17.505
## 5 ['US'] NA tt0070047 8.1 391942 95.337
## 6 ['GB'] 4 tt0063929 8.8 72895 12.919
## tmdb_score
## 1 NA
## 2 8.2
## 3 7.8
## 4 7.8
## 5 7.7
## 6 8.3
# Check the structure of the data
str(netflix_data)
## 'data.frame': 5806 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm127384" "tm70993" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
## $ release_year : int 1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
## $ age_certification : chr "TV-MA" "R" "PG" "R" ...
## $ runtime : int 48 113 91 94 133 30 102 170 104 110 ...
## $ genres : chr "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
## $ production_countries: chr "['US']" "['US']" "['GB']" "['GB']" ...
## $ seasons : num 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0071853" "tt0079470" ...
## $ imdb_score : num NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
## $ imdb_votes : num NA 795222 530877 392419 391942 ...
## $ tmdb_popularity : num 0.6 27.6 18.2 17.5 95.3 ...
## $ tmdb_score : num NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...
# View a summary of the data
summary(netflix_data)
## id title type description
## Length:5806 Length:5806 Length:5806 Length:5806
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## release_year age_certification runtime genres
## Min. :1945 Length:5806 Min. : 0.00 Length:5806
## 1st Qu.:2015 Class :character 1st Qu.: 44.00 Class :character
## Median :2018 Mode :character Median : 84.00 Mode :character
## Mean :2016 Mean : 77.64
## 3rd Qu.:2020 3rd Qu.:105.00
## Max. :2022 Max. :251.00
##
## production_countries seasons imdb_id imdb_score
## Length:5806 Min. : 1.000 Length:5806 Min. :1.500
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:5.800
## Mode :character Median : 1.000 Mode :character Median :6.600
## Mean : 2.166 Mean :6.533
## 3rd Qu.: 2.000 3rd Qu.:7.400
## Max. :42.000 Max. :9.600
## NA's :3759 NA's :523
## imdb_votes tmdb_popularity tmdb_score
## Min. : 5 Min. : 0.0094 Min. : 0.500
## 1st Qu.: 521 1st Qu.: 3.1553 1st Qu.: 6.100
## Median : 2279 Median : 7.4780 Median : 6.900
## Mean : 23407 Mean : 22.5257 Mean : 6.818
## 3rd Qu.: 10144 3rd Qu.: 17.7757 3rd Qu.: 7.500
## Max. :2268288 Max. :1823.3740 Max. :10.000
## NA's :539 NA's :94 NA's :318
Each subsample will contain approximately 50% of the original data.
set.seed(123) # For reproducibility
# Sample sizes will be 50% of the dataset
sample_size <- floor(0.5 * nrow(netflix_data))
# Generate 5 random samples with replacement
df_1 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_2 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_3 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_4 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
df_5 <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
We’ll inspect how different or similar these samples are by comparing their distributions and using group-by summaries.
# Summary statistics for IMDb scores across samples
summary(df_1$imdb_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.500 5.900 6.600 6.532 7.300 9.300 269
summary(df_2$imdb_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.800 5.900 6.700 6.571 7.400 9.600 244
summary(df_3$imdb_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.700 5.800 6.600 6.542 7.400 9.600 242
summary(df_4$imdb_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.500 5.800 6.700 6.578 7.400 9.600 279
summary(df_5$imdb_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.600 5.800 6.700 6.526 7.300 9.300 234
# Plot the distribution of IMDb scores for each subsample
ggplot() +
geom_density(data = df_1, aes(x = imdb_score, color = "Sample 1"), na.rm = TRUE) +
geom_density(data = df_2, aes(x = imdb_score, color = "Sample 2"), na.rm = TRUE) +
geom_density(data = df_3, aes(x = imdb_score, color = "Sample 3"), na.rm = TRUE) +
geom_density(data = df_4, aes(x = imdb_score, color = "Sample 4"), na.rm = TRUE) +
geom_density(data = df_5, aes(x = imdb_score, color = "Sample 5"), na.rm = TRUE) +
labs(title = "Distribution of IMDb Scores Across Subsamples", x = "IMDb Score", y = "Density") +
theme_minimal()
The density plot shows the distribution of IMDb scores across five subsamples. From the summary statistics, I can see that the median IMDb score is consistent across subsamples, ranging from 6.6 to 6.7, and the mean IMDb score stays between 6.5 and 6.6. The range of scores spans from 1.5 to 9.6, but there are minor variations in the quartiles between samples.
Key Observations:
Similarity: The subsamples have fairly consistent median and mean IMDb scores, suggesting that the distribution of scores is stable across the samples.
Variability: The slight differences in the 1st and 3rd quartiles show that some subsamples have a wider spread of IMDb scores than others, but overall, they follow a similar pattern.
NA’s: Each subsample has some missing values (NA’s), but the number is relatively close across the subsamples, which likely doesn’t significantly impact the overall conclusions.
Conclusions:
The consistency in means and medians suggests that my random subsamples are representative of the overall dataset.
Although there’s minor variability, it’s not enough to affect my confidence in the stability of IMDb scores across these samples.
group_by
to Scrutinize SubsamplesWe will analyze subsamples by grouping based on categorical data
(e.g., type
, age_certification
) to see if
there are consistent patterns or anomalies.
Group by type
and Calculate Average IMDb Score
# Group by 'type' in each sample and calculate mean IMDb score
df_1_group <- df_1 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_2_group <- df_2 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_3_group <- df_3 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_4_group <- df_4 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_5_group <- df_5 %>% group_by(type) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
# Combine the results into a single data frame for comparison
df_combined <- rbind(
df_1_group %>% mutate(sample = "Sample 1"),
df_2_group %>% mutate(sample = "Sample 2"),
df_3_group %>% mutate(sample = "Sample 3"),
df_4_group %>% mutate(sample = "Sample 4"),
df_5_group %>% mutate(sample = "Sample 5")
)
# Visualize the comparison using dodged bars
ggplot(df_combined, aes(x = type, y = avg_imdb, fill = sample)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average IMDb Scores by Type Across Subsamples", x = "Type", y = "Average IMDb Score") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
I grouped the Netflix data by content type (MOVIE or SHOW) and calculated the average IMDb score for each in five random subsamples. After combining these, I compared the results in a bar plot to see how they differ.
Visualizing the Comparison: The bar chart uses
geom_bar()
with dodged bars for easy comparison of IMDb
scores across the subsamples. The color scheme helps distinguish each
subsample.
Scrutinizing the Subsamples
How Different Are They? The subsamples show some variation in IMDb scores due to random selection. For instance, one sample might have more highly-rated movies, while another might skew lower.
Consistent Trends: Despite variability, I noticed some consistent trends. TV Shows, for example, might consistently score higher than Movies across all samples, though the exact averages differ.
Impact on Conclusions
Sampling Variability: This shows the variability that comes with random sampling. If I relied on just one subsample, it might not represent the full dataset accurately. Multiple subsamples offer a more reliable view of the trends.
Reliability of Findings: If the variability between subsamples is low, I can trust my findings more. High variability suggests I may need to use larger samples or more structured sampling methods.
Future Data Analysis: Using multiple random samples or larger sample sizes will lead to more stable conclusions about IMDb scores. Relying on a single subsample could give misleading insights, especially if it’s skewed by outliers.
Further Investigation
Does Increasing Subsample Size Reduce Variability? Would the differences between subsamples shrink if I used 75% or 90% of the population?
What About Stratified Sampling? If I ensure equal proportions of Movies and Shows in each subsample, will that reduce variability?
Effect on Other Variables? Could this sampling variability impact other variables like runtime or popularity? Investigating other dimensions might provide more insights into the dataset’s stability.
By answering these questions, I can better understand how to generalize findings from this dataset and avoid misleading conclusions from random sampling.
Now, we’ll investigate potential anomalies and consistencies within the subsamples.
Anomalies in IMDb Scores for Specific Genres
# Find genres with unusually high/low average IMDb scores in each sample
df_1_genre <- df_1 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_2_genre <- df_2 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_3_genre <- df_3 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_4_genre <- df_4 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
df_5_genre <- df_5 %>% group_by(genres) %>% summarize(avg_imdb = mean(imdb_score, na.rm = TRUE))
# Print potential anomalies (extremely high or low IMDb scores)
anomalies_df_1 <- filter(df_1_genre, avg_imdb > 8.5 | avg_imdb < 3)
anomalies_df_2 <- filter(df_2_genre, avg_imdb > 8.5 | avg_imdb < 3)
print(anomalies_df_1)
## # A tibble: 11 × 2
## genres avg_imdb
## <chr> <dbl>
## 1 ['action', 'comedy', 'drama', 'sport'] 8.6
## 2 ['action', 'fantasy', 'scifi', 'animation', 'comedy'] 8.7
## 3 ['crime', 'scifi', 'thriller', 'drama', 'fantasy'] 8.6
## 4 ['drama', 'action', 'comedy', 'crime', 'animation', 'documentation'… 9
## 5 ['drama', 'comedy', 'action'] 1.7
## 6 ['drama', 'scifi', 'thriller'] 1.5
## 7 ['family', 'comedy', 'animation'] 8.8
## 8 ['horror', 'thriller', 'fantasy'] 2.8
## 9 ['scifi', 'crime', 'drama', 'thriller'] 8.7
## 10 ['scifi', 'family', 'fantasy', 'animation', 'action'] 9.3
## 11 ['scifi', 'thriller', 'drama', 'european'] 8.8
print(anomalies_df_2)
## # A tibble: 19 × 2
## genres avg_imdb
## <chr> <dbl>
## 1 ['action', 'comedy', 'drama', 'sport'] 8.6
## 2 ['action', 'comedy', 'romance', 'drama', 'fantasy', 'horror'] 8.8
## 3 ['action', 'fantasy', 'scifi', 'animation', 'comedy'] 8.7
## 4 ['animation', 'scifi', 'action', 'fantasy', 'thriller', 'horror'] 8.7
## 5 ['comedy', 'drama', 'music', 'reality'] 8.7
## 6 ['comedy', 'family', 'drama', 'sport'] 1.8
## 7 ['comedy', 'scifi'] 2.7
## 8 ['crime', 'scifi', 'thriller', 'drama', 'fantasy'] 8.6
## 9 ['documentation', 'family'] 8.6
## 10 ['drama', 'action', 'history', 'romance', 'war'] 8.7
## 11 ['drama', 'comedy', 'animation'] 8.7
## 12 ['drama', 'fantasy', 'romance'] 9.2
## 13 ['drama', 'romance', 'crime'] 8.6
## 14 ['family', 'comedy', 'animation'] 8.8
## 15 ['fantasy', 'action', 'scifi', 'thriller', 'comedy'] 2.9
## 16 ['scifi', 'action', 'drama', 'fantasy', 'horror', 'animation'] 9
## 17 ['scifi', 'animation', 'crime', 'drama', 'fantasy', 'thriller'] 9
## 18 ['scifi', 'music', 'thriller', 'action'] 8.8
## 19 ['western', 'action', 'scifi', 'thriller', 'animation', 'comedy', '… 8.9
# Combine anomalies from all samples for plotting
anomalies_combined <- rbind(
anomalies_df_1 %>% mutate(sample = "Sample 1"),
anomalies_df_2 %>% mutate(sample = "Sample 2"),
filter(df_3_genre, avg_imdb > 8.5 | avg_imdb < 3) %>% mutate(sample = "Sample 3"),
filter(df_4_genre, avg_imdb > 8.5 | avg_imdb < 3) %>% mutate(sample = "Sample 4"),
filter(df_5_genre, avg_imdb > 8.5 | avg_imdb < 3) %>% mutate(sample = "Sample 5")
)
# Plot anomalies by genre across the samples
ggplot(anomalies_combined, aes(x = reorder(genres, -avg_imdb), y = avg_imdb, fill = sample)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Anomalous Genres with Unusually High/Low IMDb Scores Across Samples",
x = "Genres", y = "Average IMDb Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")
Identifying Outliers:
To identify anomalies, we analyzed each subsample for genres with exceptionally high or low IMDb scores. These outliers, defined as scores above 8.5 or below 3, represent genres that significantly deviate from the overall trend.
Combining and Visualizing Anomalies:
After identifying anomalies in each subsample, we consolidated them into a single data frame for easier comparison. A bar plot was then created to visually represent these anomalies across all samples.
Sample-Specific Anomalies:
Due to the random nature of sampling, a genre might be considered anomalous in one sample but not in another. For instance, a genre like “horror” could have a high IMDb score in Sample 1 but a more average score in Sample 2. This variability arises from the random inclusion of highly-rated or poorly-rated content within each subsample.
Insights and Implications:
Inconsistencies in Genre Ratings: The fluctuation of genre ratings between anomalously high or low scores across samples highlights the impact of random sampling.
Outliers Affecting Averages: Extreme outliers can significantly influence the overall distribution of the data.
Sampling Variability: Conclusions drawn from a single sample might not hold across other samples.
Impact of Random Sampling: Random sampling can lead to over- or under-representation of certain genres, affecting results.
Further Investigation:
To gain a deeper understanding of anomalies, we could investigate:
Genre-Specific Anomalies: Which genres are most prone to anomalies?
Stratified Sampling: Would stratified sampling reduce variability?
Sample Size: How does increasing sample size affect anomalies?
Other Metrics: Are anomalies consistent across different metrics?
Monte Carlo simulations can be used to repeatedly sample from the data, which will help assess the variability across multiple runs.
# Monte Carlo simulation: calculate average IMDb score from 100 subsamples
set.seed(123) # Set seed for reproducibility
# generating 100 subsamples and calculate average IMDb scores
mc_sim <- replicate(100, {
sample_data <- netflix_data[sample(1:nrow(netflix_data), sample_size, replace = TRUE), ]
mean(sample_data$imdb_score, na.rm = TRUE)
})
# converting the results to a data frame
mc_sim_df <- data.frame(mc_sim = mc_sim)
# calculating descriptive statistics for the simulation
mean_sim <- mean(mc_sim)
sd_sim <- sd(mc_sim)
ci_sim <- quantile(mc_sim, probs = c(0.025, 0.975)) # 95% confidence interval
# displaying statistics
cat("Mean of Simulation:", mean_sim, "\n")
## Mean of Simulation: 6.538487
cat("Standard Deviation of Simulation:", sd_sim, "\n")
## Standard Deviation of Simulation: 0.02225251
cat("95% Confidence Interval:", ci_sim, "\n")
## 95% Confidence Interval: 6.498188 6.57921
# plotting the results of the Monte Carlo simulation with enhanced labeling
ggplot(mc_sim_df, aes(x = mc_sim)) +
geom_histogram(binwidth = 0.05, fill = "blue", alpha = 0.7) + # Adjusted binwidth for better visualization
geom_vline(aes(xintercept = mean(mc_sim)), color = "red", linetype = "dashed", size = 1.2) + # simulation mean line
geom_vline(aes(xintercept = mean(netflix_data$imdb_score, na.rm = TRUE)), color = "green", linetype = "solid", size = 1.2) + # Population mean line
labs(title = "Monte Carlo Simulation of Average IMDb Scores",
x = "Average IMDb Score",
y = "Frequency") +
theme_minimal() +
# Adjust labels to avoid overlapping
annotate("text", x = mean_sim + 0.15, y = 10, label = paste("Sim Mean:", round(mean_sim, 2)), color = "red", size = 4, vjust = -1) +
annotate("text", x = mean(netflix_data$imdb_score, na.rm = TRUE) - 0.15, y = 9,
label = paste("Pop Mean:", round(mean(netflix_data$imdb_score, na.rm = TRUE), 2)), color = "green", size = 4, vjust = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Descriptive Statistics:
Mean: Shows the average IMDb score across the 100 subsamples, giving us a sense of the central tendency.
Standard Deviation: Indicates how much variation there is between subsamples. A high value means greater differences between them.
Confidence Interval (CI): The 95% CI gives the range where the true population mean is expected to fall based on the subsamples.
Insights Gathered:
Variation in IMDb Scores: The Monte Carlo simulation shows how average IMDb scores fluctuate across random subsamples, helping us understand the reliability of sample estimates. A lower standard deviation means the subsamples are more consistent.
Comparison to Population Mean: The proximity of the subsample mean (red dashed line) to the population mean (green line) indicates how well the subsamples represent the full dataset. A large deviation would suggest less reliability.
Confidence Interval: The 95% CI gives us an idea of the uncertainty in our estimates, showing that most future samples would fall within this range.
Significance of This Investigation:
Sampling Variability: The simulation shows that there’s natural variability in random samples. Over many samples, the mean should converge with the population mean.
Better Estimates with More Samples: Combining many subsamples gives a more accurate estimate of the true population mean, rather than relying on one sample.
Informed Decision-Making: Understanding the variability across samples helps us decide whether we need larger or more structured samples. A low standard deviation means more confidence in the results.
Further Questions to Investigate:
Does Increasing Subsamples Reduce Variability? If we increase Monte Carlo iterations (e.g., from 100 to 1,000), will the variability and confidence intervals shrink?
Does Sample Size Matter? How would the results change if we sampled 25% or 75% of the population instead of 50%?
Impact of Genres: Do certain genres influence the average IMDb scores more than others, and how much do outliers affect this?
Behavior of Other Metrics: Would other metrics, like runtime or popularity, show similar variability across subsamples, or are they more stable?
By exploring these, we can get a deeper understanding of how the dataset behaves under different sampling conditions.