First, let’s load the Netflix dataset and examine its structure.
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview the dataset
head(netflix_data)
## id title type
## 1 ts300399 Five Came Back: The Reference Films SHOW
## 2 tm84618 Taxi Driver MOVIE
## 3 tm127384 Monty Python and the Holy Grail MOVIE
## 4 tm70993 Life of Brian MOVIE
## 5 tm190788 The Exorcist MOVIE
## 6 ts22164 Monty Python's Flying Circus SHOW
## description
## 1 This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## release_year age_certification runtime genres
## 1 1945 TV-MA 48 ['documentation']
## 2 1976 R 113 ['crime', 'drama']
## 3 1975 PG 91 ['comedy', 'fantasy']
## 4 1979 R 94 ['comedy']
## 5 1973 R 133 ['horror']
## 6 1969 TV-14 30 ['comedy', 'european']
## production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity
## 1 ['US'] 1 NA NA 0.600
## 2 ['US'] NA tt0075314 8.3 795222 27.612
## 3 ['GB'] NA tt0071853 8.2 530877 18.216
## 4 ['GB'] NA tt0079470 8.0 392419 17.505
## 5 ['US'] NA tt0070047 8.1 391942 95.337
## 6 ['GB'] 4 tt0063929 8.8 72895 12.919
## tmdb_score
## 1 NA
## 2 8.2
## 3 7.8
## 4 7.8
## 5 7.7
## 6 8.3
# Check the structure of the dataset
str(netflix_data)
## 'data.frame': 5806 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm127384" "tm70993" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
## $ release_year : int 1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
## $ age_certification : chr "TV-MA" "R" "PG" "R" ...
## $ runtime : int 48 113 91 94 133 30 102 170 104 110 ...
## $ genres : chr "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
## $ production_countries: chr "['US']" "['US']" "['GB']" "['GB']" ...
## $ seasons : num 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0071853" "tt0079470" ...
## $ imdb_score : num NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
## $ imdb_votes : num NA 795222 530877 392419 391942 ...
## $ tmdb_popularity : num 0.6 27.6 18.2 17.5 95.3 ...
## $ tmdb_score : num NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...
We’ll group the data by various categorical variables, calculate summaries, and investigate the probabilities of these groups.
This dataset contains columns such as type
(Movie/TV
Shofgw), age_certification
(content rating),
genres
(the genre of the content), and
imdb_score
(IMDb rating). In this analysis, we will explore
these columns in various ways to gather insights.
type
and calculate the average
IMDb scoreWe can group by type
and age_certification
to explore the distribution of age ratings for both movies and
shows.
# Group by 'type' and calculate average IMDb score
group_by_type <- netflix_data %>%
group_by(type) %>%
summarize(count = n(),
avg_imdb = mean(imdb_score, na.rm = TRUE))
# Assign probabilities to each group
group_by_type <- group_by_type %>%
mutate(probability = count / sum(count))
# Tag the smallest group
group_by_type <- group_by_type %>%
mutate(tag = ifelse(probability == min(probability), "Low Probability", "Regular"))
# Print the result
print(group_by_type)
## # A tibble: 2 × 5
## type count avg_imdb probability tag
## <chr> <int> <dbl> <dbl> <chr>
## 1 MOVIE 3759 6.27 0.647 Regular
## 2 SHOW 2047 7.02 0.353 Low Probability
# Visualization
ggplot(group_by_type, aes(x = type, y = avg_imdb, fill = tag)) +
geom_bar(stat = "identity") +
labs(title = "Average IMDb Score by Type", x = "Type", y = "Average IMDb Score") +
theme_minimal()
We grouped the data by the column type
(whether a
Netflix entry is a movie or TV show).
We calculated how many entries exist for each type
and the average IMDb score for each group.
We also calculated the probability of randomly selecting an entry from each group and tagged the group with the lowest probability as “Low Probability.”
Movies vs. TV Shows: Let’s say the data shows that there are more movies than TV shows in the dataset. This would give movies a higher probability of being selected.
Low Probability Group: The group with the fewest entries (likely TV Shows) has the lowest probability of selection. We tagged this group as “Low Probability,” meaning TV Shows are rarer in this dataset than movies.
Hypothesis: A possible hypothesis for why movies are more common than TV shows is that Netflix originally focused on building a larger movie catalog, or that producing movies requires fewer resources compared to producing entire TV series.
Further Questions:
Why are movies more prevalent than TV shows in the dataset? Is this a strategic choice or a reflection of user preferences?
Do TV Shows have higher variability in IMDb scores due to episodic nature compared to movies?
Are the low-probability groups underserved, or is their audience niche enough to not warrant an increased catalog?
age_certification
and calculate
average IMDb score# Group by 'age_certification' and calculate average IMDb score
group_by_age <- netflix_data %>%
group_by(age_certification) %>%
summarize(count = n(),
avg_imdb = mean(imdb_score, na.rm = TRUE),
sd_imdb = sd(imdb_score, na.rm = TRUE))
# Assign probabilities
group_by_age <- group_by_age %>%
mutate(probability = count / sum(count)) %>%
mutate(tag = ifelse(probability == min(probability), "Low Probability", "Regular"))
# Print the result
print(group_by_age)
## # A tibble: 12 × 6
## age_certification count avg_imdb sd_imdb probability tag
## <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 "" 2610 6.28 1.15 0.450 Regular
## 2 "G" 131 6.39 1.32 0.0226 Regular
## 3 "NC-17" 14 6.4 1.44 0.00241 Low Probability
## 4 "PG" 246 6.21 1.15 0.0424 Regular
## 5 "PG-13" 440 6.44 1.02 0.0758 Regular
## 6 "R" 575 6.32 1.05 0.0990 Regular
## 7 "TV-14" 470 7.26 1.02 0.0810 Regular
## 8 "TV-G" 76 6.35 1.22 0.0131 Regular
## 9 "TV-MA" 841 7.07 1.00 0.145 Regular
## 10 "TV-PG" 186 6.92 1.18 0.0320 Regular
## 11 "TV-Y" 105 6.55 1.14 0.0181 Regular
## 12 "TV-Y7" 112 6.91 1.08 0.0193 Regular
# Visualization
ggplot(group_by_age, aes(x = age_certification, y = avg_imdb, fill = tag)) +
geom_bar(stat = "identity") +
labs(title = "Average IMDb Score by Age Certification", x = "Age Certification", y = "Average IMDb Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We grouped the data by the age_certification
column
(e.g., TV-MA, TV-14, etc.) and calculated how many entries exist for
each certification.
We calculated the average and standard deviation IMDb score for each certification group.
The probabilities were assigned based on the count, and the smallest group was tagged as “Low Probability.”
Age Certification Distribution: If the group with the lowest probability is, for instance, TV-Y (for children), it might mean that there is less children’s content compared to mature content like TV-MA.
Low Probability Group: The group with the fewest entries (e.g., TV-Y) has a low chance of being selected if we randomly pick a row from the dataset.
Hypothesis: We could hypothesize that content aimed at younger audiences is less common on Netflix due to a higher focus on mature content for adult audiences. TV Shows may have a higher variability in ratings due to episodic nature
Further Questions:
Why is there more content with certain age certifications, like TV-MA? Is it driven by audience demographics or demand for mature content?
What is the rationale behind the smaller catalog of content aimed at younger audiences?
Does Netflix produce fewer children’s shows (TV-Y) due to competition from platforms like Disney+?
genres
and calculate average IMDb
scoreFocus on top genres since there might be many.
# Group by 'genres' and calculate average IMDb score
group_by_genres <- netflix_data %>%
filter(genres != "[]") %>%
group_by(genres) %>%
summarize(count = n(),
avg_imdb = mean(imdb_score, na.rm = TRUE))
# Assign probabilities
group_by_genres <- group_by_genres %>%
mutate(probability = count / sum(count)) %>%
mutate(tag = ifelse(probability == min(probability), "Low Probability", "Regular"))
# Print the result
print(group_by_genres)
## # A tibble: 1,625 × 5
## genres count avg_imdb probability tag
## <chr> <int> <dbl> <dbl> <chr>
## 1 ['action', 'animation', 'comedy', 'crime', … 1 6 0.000174 Low …
## 2 ['action', 'animation', 'comedy', 'drama', … 1 4.8 0.000174 Low …
## 3 ['action', 'animation', 'comedy', 'drama', … 1 6.6 0.000174 Low …
## 4 ['action', 'animation', 'comedy', 'drama', … 1 6.3 0.000174 Low …
## 5 ['action', 'animation', 'comedy', 'drama', … 1 8.5 0.000174 Low …
## 6 ['action', 'animation', 'comedy', 'drama', … 1 7 0.000174 Low …
## 7 ['action', 'animation', 'comedy', 'family',… 1 6.6 0.000174 Low …
## 8 ['action', 'animation', 'comedy', 'family',… 1 6.7 0.000174 Low …
## 9 ['action', 'animation', 'comedy', 'family',… 1 5.5 0.000174 Low …
## 10 ['action', 'animation', 'comedy', 'family'] 2 6.35 0.000349 Regu…
## # ℹ 1,615 more rows
# Visualization for top genres
top_genres <- group_by_genres %>%
top_n(10, avg_imdb)
ggplot(top_genres, aes(x = reorder(genres, -avg_imdb), y = avg_imdb, fill = tag)) +
geom_bar(stat = "identity") +
labs(title = "Average IMDb Score by Genre", x = "Genres", y = "Average IMDb Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We grouped the data by the genres
column and
calculated how many entries exist for each genre and their average IMDb
score.
To make the chart more readable, we focused on the top 10 genres based on IMDb scores.
We again calculated probabilities and tagged the rarest genre as “Low Probability.”
Genre Distribution: Certain genres, like “Action” or “Drama,” may have more entries compared to less common genres like “Documentary.” The rarest genres will have the lowest probability of being selected.
Low Probability Group: The genre with the least entries might be a niche genre, such as “Western” or “Musical,” indicating that Netflix doesn’t prioritize these types of content.
Hypothesis: A possible hypothesis is that certain niche genres are less popular or harder to produce, leading to fewer entries.
Further Question:
How does IMDb rating correlate with genre popularity—does a genre’s rating predict its probability of being produced more frequently?
In this step, we are analyzing the combinations of two categorical variables—type (whether the content is a movie or a TV show) and age_certification (the content’s rating like TV-MA, TV-14, etc.)—to understand how Netflix distributes content across these categories.
type
and
age_certification
# Create combinations of 'type' and 'age_certification'
combinations <- netflix_data %>%
group_by(type, age_certification) %>%
summarize(count = n(),
avg_imdb = mean(imdb_score, na.rm = TRUE)) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.
# Print the result
print(combinations)
## # A tibble: 13 × 4
## # Groups: type [2]
## type age_certification count avg_imdb
## <chr> <chr> <int> <dbl>
## 1 MOVIE "" 2353 6.22
## 2 SHOW "TV-MA" 841 7.07
## 3 MOVIE "R" 575 6.32
## 4 SHOW "TV-14" 470 7.26
## 5 MOVIE "PG-13" 440 6.44
## 6 SHOW "" 257 6.90
## 7 MOVIE "PG" 246 6.21
## 8 SHOW "TV-PG" 186 6.92
## 9 MOVIE "G" 131 6.39
## 10 SHOW "TV-Y7" 112 6.91
## 11 SHOW "TV-Y" 105 6.55
## 12 SHOW "TV-G" 76 6.35
## 13 MOVIE "NC-17" 14 6.4
We combined the two categorical columns type
(Movie/TV Show) and age_certification
(e.g., TV-MA, TV-14)
to create a data frame of all unique combinations.
We counted how many entries exist for each combination.
Most Common Combinations: The most common combinations might be “Movies” with “TV-MA” certification, suggesting that Netflix has a large volume of mature movies.
Least Common Combinations: Certain combinations, such as “TV Show” with a “TV-Y” certification, might be rarer.
Further Questions:
Which combination is the most prevalent, and why? For example, why are TV-MA movies more common than TV-14 TV Shows?
Does the lack of certain combinations, like TV-Y movies, indicate a strategic decision by Netflix, or does it highlight a content gap that needs to be addressed?
We’ll identify combinations that do not exist.
# Generate all possible combinations of 'type' and 'age_certification'
all_combinations <- expand.grid(type = unique(netflix_data$type),
age_certification = unique(netflix_data$age_certification))
# Left join with actual data to find missing combinations
missing_combinations <- all_combinations %>%
left_join(combinations, by = c("type", "age_certification")) %>%
filter(is.na(count))
# Print missing combinations
print(missing_combinations)
## type age_certification count avg_imdb
## 1 MOVIE TV-MA NA NA
## 2 SHOW R NA NA
## 3 SHOW PG NA NA
## 4 MOVIE TV-14 NA NA
## 5 SHOW G NA NA
## 6 SHOW PG-13 NA NA
## 7 MOVIE TV-PG NA NA
## 8 MOVIE TV-Y NA NA
## 9 MOVIE TV-G NA NA
## 10 MOVIE TV-Y7 NA NA
## 11 SHOW NC-17 NA NA
type
and
age_certification
that don’t exist in the dataset. These
missing combinations could provide insight into content gaps in the
Netflix catalog.Missing Combinations: A combination like “TV Show” with “TV-Y” might be missing, meaning that Netflix doesn’t offer many children’s TV shows. Another possibility is that some age certifications are not applied to certain content types.
Hypothesis: A hypothesis could be that Netflix has chosen not to offer content aimed at very young children (e.g., TV-Y) because their primary audience skews toward older age groups.
Further Questions:
Why are some combinations missing? For instance, are there no TV-Y TV shows because of a lack of demand, or is Netflix focused on catering to older audiences?
Is Netflix strategically avoiding certain combinations, or does this indicate a potential opportunity to attract younger viewers with more diverse content offerings?
# Bar plot for combinations
ggplot(combinations, aes(x = type, y = count)) +
geom_bar(stat = "identity") +
facet_wrap(~ age_certification) +
labs(title = "Distribution of Type by Age Certification", x = "Type", y = "Count") +
theme_minimal()
type
and
age_certification
using a bar plot.This visualization gives us a sense of which combinations are the most common and which ones are rare. For example, if “Movies” and “TV-MA” dominate the chart, it suggests that Netflix is targeting an adult audience with its movie catalog.
Missing combinations may indicate certain restrictions or trends (e.g., no TV-Y rated films).
A hypothesis could be: “Some combinations of age certification and type may not exist because certain formats (e.g., movies) may not cater to specific age groups (like children).”
Further Question:
Could Netflix benefit from exploring underrepresented combinations, or are these less likely to succeed with the current audience?
Overall, your analysis paints a comprehensive picture of Netflix’s content strategy in terms of content type, age certification, and genre. However, many of the results open up further avenues for investigation.
Insight 1: There are more movies than TV shows in the dataset, with movies potentially having higher or lower average IMDb scores depending on the genre or age certification.
Insight 2: Some age certifications, like TV-Y or TV-G, are rare, which suggests that Netflix is focused more on mature content.
Insight 3: Certain combinations of
type
and age_certification
are missing,
suggesting content gaps, while the most common combinations reflect
Netflix’s content strategy.
Further Investigation: We could investigate why certain combinations are missing, or why some groups have low probabilities. For example, is Netflix investing more in movies with mature content due to demand? or How do IMDb ratings across different categories and combinations influence user retention and satisfaction?
By summarizing and visualizing the dataset, we have identified trends in content types, age certifications, and genres, which can help draw conclusions about Netflix’s content strategy.