# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview the dataset
head(netflix_data)
## id title type
## 1 ts300399 Five Came Back: The Reference Films SHOW
## 2 tm84618 Taxi Driver MOVIE
## 3 tm127384 Monty Python and the Holy Grail MOVIE
## 4 tm70993 Life of Brian MOVIE
## 5 tm190788 The Exorcist MOVIE
## 6 ts22164 Monty Python's Flying Circus SHOW
## description
## 1 This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## release_year age_certification runtime genres
## 1 1945 TV-MA 48 ['documentation']
## 2 1976 R 113 ['crime', 'drama']
## 3 1975 PG 91 ['comedy', 'fantasy']
## 4 1979 R 94 ['comedy']
## 5 1973 R 133 ['horror']
## 6 1969 TV-14 30 ['comedy', 'european']
## production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity
## 1 ['US'] 1 NA NA 0.600
## 2 ['US'] NA tt0075314 8.3 795222 27.612
## 3 ['GB'] NA tt0071853 8.2 530877 18.216
## 4 ['GB'] NA tt0079470 8.0 392419 17.505
## 5 ['US'] NA tt0070047 8.1 391942 95.337
## 6 ['GB'] 4 tt0063929 8.8 72895 12.919
## tmdb_score
## 1 NA
## 2 8.2
## 3 7.8
## 4 7.8
## 5 7.7
## 6 8.3
# Summary statistics for runtime and imdb_score
summary(netflix_data$runtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 44.00 84.00 77.64 105.00 251.00
summary(netflix_data$imdb_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.500 5.800 6.600 6.533 7.400 9.600 523
# Quantiles for runtime
quantile(netflix_data$runtime, probs = seq(0, 1, 0.25))
## 0% 25% 50% 75% 100%
## 0 44 84 105 251
# Quantiles for imdb_score
quantile(netflix_data$imdb_score, probs = seq(0, 1, 0.25), na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 1.5 5.8 6.6 7.4 9.6
We focus on summarizing two key numeric columns:
runtime
: Measures the length of
movies/shows.
imdb_score
: A rating out of 10 from
the IMDb website.
Summary Statistics:
For runtime
, you will see the min,
max, mean, median,
and quantiles. This helps understand the spread of runtimes in the
dataset.
For imdb_score
, we compute similar statistics,
excluding missing values (NA
). Quantiles give a more
detailed picture of the distribution.
Key Insights:
The summary of runtime
helps us understand the
variation between short and long content.
The imdb_score
statistics show the distribution of
user ratings for the Netflix content.
# Count unique values for 'type' (e.g., MOVIE, SHOW)
table(netflix_data$type)
##
## MOVIE SHOW
## 3759 2047
# Count unique values for 'age_certification'
table(netflix_data$age_certification)
##
## G NC-17 PG PG-13 R TV-14 TV-G TV-MA TV-PG TV-Y TV-Y7
## 2610 131 14 246 440 575 470 76 841 186 105 112
# Count unique values for 'genres' (first 10 unique counts)
head(sort(table(netflix_data$genres), decreasing = TRUE), 10)
##
## ['comedy'] ['drama']
## 510 350
## ['documentation'] ['comedy', 'drama']
## 320 141
## ['drama', 'comedy'] ['reality']
## 128 120
## ['drama', 'romance'] ['comedy', 'documentation']
## 112 93
## ['animation'] []
## 69 68
Here we analyze the categorical columns:
type
: Counts the number of
SHOWS
and MOVIES
.
age_certification
: Displays the
count of unique age certifications (e.g., TV-MA, PG-13).
genres
: Counts and lists the top 10
genres in the dataset.
Key Insights:
The type
count tells us how balanced the dataset is
between movies and shows.
The age_certification
count informs us about the
distribution of age ratings, useful for analyzing the target
audience.
The genres
count shows the most frequent genres,
which helps us understand the type of content available on
Netflix.
Addressing one of the above questions. I am using groupby and summarize functions to calculate average runtime between movies and Tv shows.
# Group by type and calculate average runtime
avg_runtime <- netflix_data %>%
group_by(type) %>%
summarize(average_runtime = mean(runtime, na.rm = TRUE))
avg_runtime
## # A tibble: 2 × 2
## type average_runtime
## <chr> <dbl>
## 1 MOVIE 98.8
## 2 SHOW 38.8
We group the data by type
(SHOW or MOVIE) and calculate
the average runtime
for each. This gives insight into how
the typical duration differs between shows and movies.
Key Insights:
# Histogram of runtime
ggplot(netflix_data, aes(x = runtime)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Distribution of Runtime", x = "Runtime (minutes)", y = "Frequency")
This histogram visualizes the distribution of runtime
for all content. The binwidth
is set to 10 minutes.
Interpretation of the Plot:
The histogram shows the frequency of content with different runtimes.
Key Insights: A peak in short runtimes may indicate a prevalence of shows, while a tail at the higher end might reflect longer movies.
# Filter data for MOVIES only
movies_data <- netflix_data %>%
filter(type == "MOVIE")
# Scatter plot of imdb_score vs tmdb_score
ggplot(movies_data, aes(x = imdb_score, y = tmdb_score)) +
geom_point(color = "blue", alpha = 0.5) +
labs(title = "IMDB Score vs TMDB Score (Movies)", x = "IMDB Score", y = "TMDB Score") +
geom_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 490 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 490 rows containing missing values or values outside the scale range
## (`geom_point()`).
This scatter plot shows the relationship between
imdb_score
and tmdb_score
for movies. A linear
regression line (red) is added to help visualize any trends.
Interpretation of the Plot:
The points represent individual movies with their IMDb and TMDB ratings.
Key Insights: If the points are tightly clustered along the red line, it suggests a strong positive correlation between the two rating systems. Outliers indicate movies with differing IMDb and TMDB scores.
# counting the occurrences of genres and age certification
genres_age <- netflix_data %>%
group_by(genres, age_certification) %>%
tally()
# listing the top 10 genres based on the total count
top_10_genres <- genres_age %>%
group_by(genres) %>%
summarise(total_count = sum(n)) %>%
arrange(desc(total_count)) %>%
top_n(10, total_count)
# filtering the original data to keep only the top 10 genres
genres_age_top10 <- genres_age %>%
filter(genres %in% top_10_genres$genres)
# Bar plot for the top 10 genres and age certification
ggplot(genres_age_top10, aes(x = reorder(genres, -n), y = n, fill = age_certification)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Top 10 Genres by Age Certification", x = "Genres", y = "Count")
Note: The details in the bar graph were not that clear on the full data, hence picked top 10 genre to visualize it in a better way.
This bar plot shows the distribution of genres across different age
certifications (e.g., TV-MA, PG-13). The bars are color-coded by
age_certification
, with position = dodge
separating them for comparison.
Interpretation of the Plot:
The plot shows how genres are distributed across different age
groups. For example, you might see that certain genres (like
drama
or crime
) are more common in the TV-MA
category.
Key Insights: This helps in understanding the relationship between content type and target audience (e.g., adult-oriented vs family-friendly genres).
Summarize the insights gathered from the data and visualizations:
The distribution of runtime
shows that a significant
amount of content has shorter durations, likely due to shows with
shorter episodes.
There appears to be a positive correlation between
imdb_score
and tmdb_score
for movies, although
some outliers suggest discrepancies in ratings between the
platforms.
Certain genres are predominantly associated with specific age certifications, which provides insight into content targeting different age groups.