Netflix Dataset Data Dive - Group By and Probabilities

Load the Data

First, let’s load the Netflix dataset and examine its structure.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

# Preview the dataset
head(netflix_data)

##         id                               title  type
## 1 ts300399 Five Came Back: The Reference Films  SHOW
## 2  tm84618                         Taxi Driver MOVIE
## 3 tm127384     Monty Python and the Holy Grail MOVIE
## 4  tm70993                       Life of Brian MOVIE
## 5 tm190788                        The Exorcist MOVIE
## 6  ts22164        Monty Python's Flying Circus  SHOW
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                                                                                                                                                                                                                                                                                            This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2                                                                                                                                                                                                                                A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3                                    King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5                                                                                                                12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6                                                                                                                                                                                                                                                                                             A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
##   release_year age_certification runtime                 genres
## 1         1945             TV-MA      48      ['documentation']
## 2         1976                 R     113     ['crime', 'drama']
## 3         1975                PG      91  ['comedy', 'fantasy']
## 4         1979                 R      94             ['comedy']
## 5         1973                 R     133             ['horror']
## 6         1969             TV-14      30 ['comedy', 'european']
##   production_countries seasons   imdb_id imdb_score imdb_votes tmdb_popularity
## 1               ['US']       1                   NA         NA           0.600
## 2               ['US']      NA tt0075314        8.3     795222          27.612
## 3               ['GB']      NA tt0071853        8.2     530877          18.216
## 4               ['GB']      NA tt0079470        8.0     392419          17.505
## 5               ['US']      NA tt0070047        8.1     391942          95.337
## 6               ['GB']       4 tt0063929        8.8      72895          12.919
##   tmdb_score
## 1         NA
## 2        8.2
## 3        7.8
## 4        7.8
## 5        7.7
## 6        8.3

# Check the structure of the dataset
str(netflix_data)

## 'data.frame':    5806 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm127384" "tm70993" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
##  $ release_year        : int  1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
##  $ age_certification   : chr  "TV-MA" "R" "PG" "R" ...
##  $ runtime             : int  48 113 91 94 133 30 102 170 104 110 ...
##  $ genres              : chr  "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
##  $ production_countries: chr  "['US']" "['US']" "['GB']" "['GB']" ...
##  $ seasons             : num  1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0071853" "tt0079470" ...
##  $ imdb_score          : num  NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
##  $ imdb_votes          : num  NA 795222 530877 392419 391942 ...
##  $ tmdb_popularity     : num  0.6 27.6 18.2 17.5 95.3 ...
##  $ tmdb_score          : num  NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...

Grouping and Summarizing Data

We’ll group the data by various categorical variables, calculate summaries, and investigate the probabilities of these groups.

This dataset contains columns such as type (Movie/TV Shofgw), age_certification (content rating), genres (the genre of the content), and imdb_score (IMDb rating). In this analysis, we will explore these columns in various ways to gather insights.

1.1 Group by `type` and calculate the average IMDb score

We can group by type and age_certification to explore the distribution of age ratings for both movies and shows.

# Group by 'type' and calculate average IMDb score
group_by_type <- netflix_data %>%
  group_by(type) %>%
  summarize(count = n(),
            avg_imdb = mean(imdb_score, na.rm = TRUE))

# Assign probabilities to each group
group_by_type <- group_by_type %>%
  mutate(probability = count / sum(count))

# Tag the smallest group
group_by_type <- group_by_type %>%
  mutate(tag = ifelse(probability == min(probability), "Low Probability", "Regular"))

# Print the result
print(group_by_type)

## # A tibble: 2 × 5
##   type  count avg_imdb probability tag            
##   <chr> <int>    <dbl>       <dbl> <chr>          
## 1 MOVIE  3759     6.27       0.647 Regular        
## 2 SHOW   2047     7.02       0.353 Low Probability

# Visualization
ggplot(group_by_type, aes(x = type, y = avg_imdb, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Average IMDb Score by Type", x = "Type", y = "Average IMDb Score") +
  theme_minimal()

Code Summary:

We grouped the data by the column type (whether a Netflix entry is a movie or TV show).
We calculated how many entries exist for each type and the average IMDb score for each group.
We also calculated the probability of randomly selecting an entry from each group and tagged the group with the lowest probability as “Low Probability.”

Interpretation of Results:

Movies vs. TV Shows: Let’s say the data shows that there are more movies than TV shows in the dataset. This would give movies a higher probability of being selected.
Low Probability Group: The group with the fewest entries (likely TV Shows) has the lowest probability of selection. We tagged this group as “Low Probability,” meaning TV Shows are rarer in this dataset than movies.
Hypothesis: A possible hypothesis for why movies are more common than TV shows is that Netflix originally focused on building a larger movie catalog, or that producing movies requires fewer resources compared to producing entire TV series.
Further Questions:

Why are movies more prevalent than TV shows in the dataset? Is this a strategic choice or a reflection of user preferences?

Do TV Shows have higher variability in IMDb scores due to episodic nature compared to movies?

Are the low-probability groups underserved, or is their audience niche enough to not warrant an increased catalog?

Visualization:

The bar plot shows the average IMDb scores for movies and TV shows, allowing us to compare the quality (based on ratings) between the two types of content.

1.2 Group by `age_certification` and calculate average IMDb score

# Group by 'age_certification' and calculate average IMDb score
group_by_age <- netflix_data %>%
  group_by(age_certification) %>%
  summarize(count = n(),
            avg_imdb = mean(imdb_score, na.rm = TRUE),
            sd_imdb = sd(imdb_score, na.rm = TRUE))

# Assign probabilities
group_by_age <- group_by_age %>%
  mutate(probability = count / sum(count)) %>%
  mutate(tag = ifelse(probability == min(probability), "Low Probability", "Regular"))

# Print the result
print(group_by_age)

## # A tibble: 12 × 6
##    age_certification count avg_imdb sd_imdb probability tag            
##    <chr>             <int>    <dbl>   <dbl>       <dbl> <chr>          
##  1 ""                 2610     6.28    1.15     0.450   Regular        
##  2 "G"                 131     6.39    1.32     0.0226  Regular        
##  3 "NC-17"              14     6.4     1.44     0.00241 Low Probability
##  4 "PG"                246     6.21    1.15     0.0424  Regular        
##  5 "PG-13"             440     6.44    1.02     0.0758  Regular        
##  6 "R"                 575     6.32    1.05     0.0990  Regular        
##  7 "TV-14"             470     7.26    1.02     0.0810  Regular        
##  8 "TV-G"               76     6.35    1.22     0.0131  Regular        
##  9 "TV-MA"             841     7.07    1.00     0.145   Regular        
## 10 "TV-PG"             186     6.92    1.18     0.0320  Regular        
## 11 "TV-Y"              105     6.55    1.14     0.0181  Regular        
## 12 "TV-Y7"             112     6.91    1.08     0.0193  Regular

# Visualization
ggplot(group_by_age, aes(x = age_certification, y = avg_imdb, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Average IMDb Score by Age Certification", x = "Age Certification", y = "Average IMDb Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code Summary:

We grouped the data by the age_certification column (e.g., TV-MA, TV-14, etc.) and calculated how many entries exist for each certification.
We calculated the average and standard deviation IMDb score for each certification group.
The probabilities were assigned based on the count, and the smallest group was tagged as “Low Probability.”

Interpretation of Results:

Age Certification Distribution: If the group with the lowest probability is, for instance, TV-Y (for children), it might mean that there is less children’s content compared to mature content like TV-MA.
Low Probability Group: The group with the fewest entries (e.g., TV-Y) has a low chance of being selected if we randomly pick a row from the dataset.
Hypothesis: We could hypothesize that content aimed at younger audiences is less common on Netflix due to a higher focus on mature content for adult audiences. TV Shows may have a higher variability in ratings due to episodic nature
Further Questions:

Why is there more content with certain age certifications, like TV-MA? Is it driven by audience demographics or demand for mature content?

What is the rationale behind the smaller catalog of content aimed at younger audiences?

Does Netflix produce fewer children’s shows (TV-Y) due to competition from platforms like Disney+?

Visualization:

The bar plot allows us to see the average IMDb scores across different age certifications, possibly highlighting that certain certifications, like TV-MA, may have higher or lower ratings.

1.3 Group by `genres` and calculate average IMDb score

Focus on top genres since there might be many.

# Group by 'genres' and calculate average IMDb score
group_by_genres <- netflix_data %>%
  filter(genres != "[]") %>%
  group_by(genres) %>%
  summarize(count = n(),
            avg_imdb = mean(imdb_score, na.rm = TRUE))

# Assign probabilities
group_by_genres <- group_by_genres %>%
  mutate(probability = count / sum(count)) %>%
  mutate(tag = ifelse(probability == min(probability), "Low Probability", "Regular"))

# Print the result
print(group_by_genres)

## # A tibble: 1,625 × 5
##    genres                                       count avg_imdb probability tag  
##    <chr>                                        <int>    <dbl>       <dbl> <chr>
##  1 ['action', 'animation', 'comedy', 'crime', …     1     6       0.000174 Low …
##  2 ['action', 'animation', 'comedy', 'drama', …     1     4.8     0.000174 Low …
##  3 ['action', 'animation', 'comedy', 'drama', …     1     6.6     0.000174 Low …
##  4 ['action', 'animation', 'comedy', 'drama', …     1     6.3     0.000174 Low …
##  5 ['action', 'animation', 'comedy', 'drama', …     1     8.5     0.000174 Low …
##  6 ['action', 'animation', 'comedy', 'drama', …     1     7       0.000174 Low …
##  7 ['action', 'animation', 'comedy', 'family',…     1     6.6     0.000174 Low …
##  8 ['action', 'animation', 'comedy', 'family',…     1     6.7     0.000174 Low …
##  9 ['action', 'animation', 'comedy', 'family',…     1     5.5     0.000174 Low …
## 10 ['action', 'animation', 'comedy', 'family']      2     6.35    0.000349 Regu…
## # ℹ 1,615 more rows

# Visualization for top genres
top_genres <- group_by_genres %>%
  top_n(10, avg_imdb)

ggplot(top_genres, aes(x = reorder(genres, -avg_imdb), y = avg_imdb, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Average IMDb Score by Genre", x = "Genres", y = "Average IMDb Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code Summary:

We grouped the data by the genres column and calculated how many entries exist for each genre and their average IMDb score.
To make the chart more readable, we focused on the top 10 genres based on IMDb scores.
We again calculated probabilities and tagged the rarest genre as “Low Probability.”

Interpretation of Results:

Genre Distribution: Certain genres, like “Action” or “Drama,” may have more entries compared to less common genres like “Documentary.” The rarest genres will have the lowest probability of being selected.
Low Probability Group: The genre with the least entries might be a niche genre, such as “Western” or “Musical,” indicating that Netflix doesn’t prioritize these types of content.
Hypothesis: A possible hypothesis is that certain niche genres are less popular or harder to produce, leading to fewer entries.
Further Question:

How does IMDb rating correlate with genre popularity—does a genre’s rating predict its probability of being produced more frequently?

Visualization:

The plot allows us to compare IMDb ratings across different genres, which might give us insight into the perceived quality of content across genres.

Combinations of Categorical Variables

In this step, we are analyzing the combinations of two categorical variables—type (whether the content is a movie or a TV show) and age_certification (the content’s rating like TV-MA, TV-14, etc.)—to understand how Netflix distributes content across these categories.

2.1 Find all combinations of `type` and `age_certification`

# Create combinations of 'type' and 'age_certification'
combinations <- netflix_data %>%
  group_by(type, age_certification) %>%
  summarize(count = n(),
            avg_imdb = mean(imdb_score, na.rm = TRUE)) %>%
  arrange(desc(count))

## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.

# Print the result
print(combinations)

## # A tibble: 13 × 4
## # Groups:   type [2]
##    type  age_certification count avg_imdb
##    <chr> <chr>             <int>    <dbl>
##  1 MOVIE ""                 2353     6.22
##  2 SHOW  "TV-MA"             841     7.07
##  3 MOVIE "R"                 575     6.32
##  4 SHOW  "TV-14"             470     7.26
##  5 MOVIE "PG-13"             440     6.44
##  6 SHOW  ""                  257     6.90
##  7 MOVIE "PG"                246     6.21
##  8 SHOW  "TV-PG"             186     6.92
##  9 MOVIE "G"                 131     6.39
## 10 SHOW  "TV-Y7"             112     6.91
## 11 SHOW  "TV-Y"              105     6.55
## 12 SHOW  "TV-G"               76     6.35
## 13 MOVIE "NC-17"              14     6.4

Code Summary:

We combined the two categorical columns type (Movie/TV Show) and age_certification (e.g., TV-MA, TV-14) to create a data frame of all unique combinations.
We counted how many entries exist for each combination.

Interpretation of Results:

Most Common Combinations: The most common combinations might be “Movies” with “TV-MA” certification, suggesting that Netflix has a large volume of mature movies.
Least Common Combinations: Certain combinations, such as “TV Show” with a “TV-Y” certification, might be rarer.
Further Questions:

Which combination is the most prevalent, and why? For example, why are TV-MA movies more common than TV-14 TV Shows?

Does the lack of certain combinations, like TV-Y movies, indicate a strategic decision by Netflix, or does it highlight a content gap that needs to be addressed?

2.2 Find missing combinations

We’ll identify combinations that do not exist.

# Generate all possible combinations of 'type' and 'age_certification'
all_combinations <- expand.grid(type = unique(netflix_data$type), 
                                age_certification = unique(netflix_data$age_certification))

# Left join with actual data to find missing combinations
missing_combinations <- all_combinations %>%
  left_join(combinations, by = c("type", "age_certification")) %>%
  filter(is.na(count))

# Print missing combinations
print(missing_combinations)

##     type age_certification count avg_imdb
## 1  MOVIE             TV-MA    NA       NA
## 2   SHOW                 R    NA       NA
## 3   SHOW                PG    NA       NA
## 4  MOVIE             TV-14    NA       NA
## 5   SHOW                 G    NA       NA
## 6   SHOW             PG-13    NA       NA
## 7  MOVIE             TV-PG    NA       NA
## 8  MOVIE              TV-Y    NA       NA
## 9  MOVIE              TV-G    NA       NA
## 10 MOVIE             TV-Y7    NA       NA
## 11  SHOW             NC-17    NA       NA

Code Summary:

We checked for combinations of type and age_certification that don’t exist in the dataset. These missing combinations could provide insight into content gaps in the Netflix catalog.

Interpretation of Results:

Missing Combinations: A combination like “TV Show” with “TV-Y” might be missing, meaning that Netflix doesn’t offer many children’s TV shows. Another possibility is that some age certifications are not applied to certain content types.
Hypothesis: A hypothesis could be that Netflix has chosen not to offer content aimed at very young children (e.g., TV-Y) because their primary audience skews toward older age groups.
Further Questions:

Why are some combinations missing? For instance, are there no TV-Y TV shows because of a lack of demand, or is Netflix focused on catering to older audiences?

Is Netflix strategically avoiding certain combinations, or does this indicate a potential opportunity to attract younger viewers with more diverse content offerings?

2.3 Visualization of Combinations

# Bar plot for combinations
ggplot(combinations, aes(x = type, y = count)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ age_certification) +
  labs(title = "Distribution of Type by Age Certification", x = "Type", y = "Count") +
  theme_minimal()

Code Summary:

We visualized the combinations of type and age_certification using a bar plot.

Interpretation:

This visualization gives us a sense of which combinations are the most common and which ones are rare. For example, if “Movies” and “TV-MA” dominate the chart, it suggests that Netflix is targeting an adult audience with its movie catalog.
Missing combinations may indicate certain restrictions or trends (e.g., no TV-Y rated films).

A hypothesis could be: “Some combinations of age certification and type may not exist because certain formats (e.g., movies) may not cater to specific age groups (like children).”
Further Question:

Could Netflix benefit from exploring underrepresented combinations, or are these less likely to succeed with the current audience?

Conclusion

Overall, your analysis paints a comprehensive picture of Netflix’s content strategy in terms of content type, age certification, and genre. However, many of the results open up further avenues for investigation.

Insight 1: There are more movies than TV shows in the dataset, with movies potentially having higher or lower average IMDb scores depending on the genre or age certification.
Insight 2: Some age certifications, like TV-Y or TV-G, are rare, which suggests that Netflix is focused more on mature content.
Insight 3: Certain combinations of type and age_certification are missing, suggesting content gaps, while the most common combinations reflect Netflix’s content strategy.
Further Investigation: We could investigate why certain combinations are missing, or why some groups have low probabilities. For example, is Netflix investing more in movies with mature content due to demand? or How do IMDb ratings across different categories and combinations influence user retention and satisfaction?

By summarizing and visualizing the dataset, we have identified trends in content types, age certifications, and genres, which can help draw conclusions about Netflix’s content strategy.

Netflix Dataset Data Dive - Group By and Probabilities

Junaid Ahmed Mohammed

2024-09-17

Load the Data

Grouping and Summarizing Data

1.1 Group by `type` and calculate the average IMDb score

Code Summary:

Interpretation of Results:

Visualization:

1.2 Group by `age_certification` and calculate average IMDb score

Code Summary:

Interpretation of Results:

Visualization:

1.3 Group by `genres` and calculate average IMDb score

Code Summary:

Interpretation of Results:

Visualization:

Combinations of Categorical Variables

2.1 Find all combinations of `type` and `age_certification`

Code Summary:

Interpretation of Results:

2.2 Find missing combinations

Code Summary:

Interpretation of Results:

2.3 Visualization of Combinations

Code Summary:

Interpretation:

Conclusion

Netflix Dataset Data Dive - Group By and Probabilities

Junaid Ahmed Mohammed

2024-09-17

Load the Data

Grouping and Summarizing Data

1.1 Group by type and calculate the average IMDb score

Code Summary:

Interpretation of Results:

Visualization:

1.2 Group by age_certification and calculate average IMDb score

Code Summary:

Interpretation of Results:

Visualization:

1.3 Group by genres and calculate average IMDb score

Code Summary:

Interpretation of Results:

Visualization:

Combinations of Categorical Variables

2.1 Find all combinations of type and age_certification

Code Summary:

Interpretation of Results:

2.2 Find missing combinations

Code Summary:

Interpretation of Results:

2.3 Visualization of Combinations

Code Summary:

Interpretation:

Conclusion

1.1 Group by `type` and calculate the average IMDb score

1.2 Group by `age_certification` and calculate average IMDb score

1.3 Group by `genres` and calculate average IMDb score

2.1 Find all combinations of `type` and `age_certification`