Netflix is one of the world’s leading streaming services, offering a vast array of TV shows and movies to millions of users across the globe. With such a diverse content library, understanding viewer preferences and trends is crucial for both enhancing user experience and guiding content strategy. This is where the need for a comprehensive dashboard becomes evident. By enabling users to explore genre distributions, popular shows and movies helps identify which genres are most favored by viewers. This data-driven approach allows Netflix to tailor its recommendations and content acquisitions, ensuring that it continues to meet the evolving tastes and preferences of its audience.
We aim to load and analyze a dataset of Netflix titles to understand genre popularity. This will involve inspecting and cleansing the data, followed by data manipulation and transformation, culminating in actionable business recommendations.
First, we load the dataset and inspect its structure to identify any issues:
## 'data.frame': 5791 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm154986" "tm127384" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Deliverance" "Monty Python and the Holy Grail" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "Intent on seeing the Cahulawassee River before it's turned into one huge lake, outdoor fanatic Lewis Medlock ta"| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ ...
## $ release_year : int 1945 1976 1972 1975 1967 1969 1979 1971 1967 1980 ...
## $ age_certification : chr "TV-MA" "R" "R" "PG" ...
## $ runtime : int 51 114 109 91 150 30 94 102 110 104 ...
## $ genres : chr "documentation" "drama" "drama" "fantasy" ...
## $ production_countries: chr "['US']" "['US']" "['US']" "['GB']" ...
## $ seasons : int 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0068473" "tt0071853" ...
## $ imdb_score : num NA 8.2 7.7 8.2 7.7 8.8 8 7.7 7.7 5.8 ...
## $ imdb_votes : int NA 808582 107673 534486 72662 73424 395024 155051 112048 69844 ...
## $ tmdb_popularity : num 0.6 41 10 15.5 20.4 ...
## $ tmdb_score : num NA 8.18 7.3 7.81 7.6 ...
To ensure data quality, we will handle any missing values, rename columns if necessary, and convert data types as needed:
To determine if there are any missing values in the data, we can use the anyNA() function
## [1] TRUE
The output FALSE from anyNA(netflix) indicates that there are no missing values (NA values) present in the netflix dataset. This means that all cells in the dataset contain valid data without any missing entries.
To check for missing values in each column, we can use the is.na() and colSums() functions.
## id title type
## 0 0 0
## description release_year age_certification
## 0 0 0
## runtime genres production_countries
## 0 0 0
## seasons imdb_id imdb_score
## 3710 0 429
## imdb_votes tmdb_popularity tmdb_score
## 443 82 284
each number (0 in this case) represents the count of missing values found in each respective column of the dataset. A count of 0 indicates that there are no missing values in that particular column.
To Convert data type we can use below functions. *
as.character() * as.Date() *
as.integer() * as.numeric() *
as.factor()
# Convert data types
netflix$id <- as.character(netflix$id)
netflix$title <- as.character(netflix$title)
netflix$type <- as.character(netflix$type)
netflix$description <- as.character(netflix$description)
netflix$age_certification <- as.character(netflix$age_certification)
netflix$genres <- as.character(netflix$genres)
netflix$production_countries <- as.character(netflix$production_countries)
netflix$imdb_id <- as.character(netflix$imdb_id)
netflix$seasons <- as.factor(netflix$seasons)To Check the data, we can use the str() function
## 'data.frame': 5791 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm154986" "tm127384" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Deliverance" "Monty Python and the Holy Grail" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "Intent on seeing the Cahulawassee River before it's turned into one huge lake, outdoor fanatic Lewis Medlock ta"| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ ...
## $ release_year : int 1945 1976 1972 1975 1967 1969 1979 1971 1967 1980 ...
## $ age_certification : chr "TV-MA" "R" "R" "PG" ...
## $ runtime : int 51 114 109 91 150 30 94 102 110 104 ...
## $ genres : chr "documentation" "drama" "drama" "fantasy" ...
## $ production_countries: chr "['US']" "['US']" "['US']" "['GB']" ...
## $ seasons : Factor w/ 26 levels "1","2","3","4",..: 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0068473" "tt0071853" ...
## $ imdb_score : num NA 8.2 7.7 8.2 7.7 8.8 8 7.7 7.7 5.8 ...
## $ imdb_votes : int NA 808582 107673 534486 72662 73424 395024 155051 112048 69844 ...
## $ tmdb_popularity : num 0.6 41 10 15.5 20.4 ...
## $ tmdb_score : num NA 8.18 7.3 7.81 7.6 ...
The dataset netflix has been cleaned to handle missing values and change the data types, with 3,933 rows omitted due to missing data across various columns. This dataset provides comprehensive information about TV shows and movies available on Netflix, facilitating analysis and insights into content preferences, popularity, and production details.
We will manipulate and transform the data to derive meaningful insights, such as identifying the most popular genres:
agg_mean <-
aggregate(imdb_votes ~ type+ genres, data= netflix , FUN =mean)
agg_mean_sorted <- agg_mean[order(-agg_mean$imdb_votes), ]
agg_mean_sortedthe Output shows that the genre with the highest votes is Western (type MOVIE) and followed by genre CRIME (type MOVIE).
# Subset data for MOVIE and SHOW types
movies <- subset(netflix, type == "MOVIE")
shows <- subset(netflix, type == "SHOW")
# Create frequency tables for genres
freq_movies <- table(movies$genres)
freq_shows <- table(shows$genres)
# Convert frequency tables to data frames
freq_movies_df <- as.data.frame(freq_movies)
freq_shows_df <- as.data.frame(freq_shows)
# Rename columns for clarity
names(freq_movies_df) <- c("genres", "Frequency_Movie")
names(freq_shows_df) <- c("genres", "Frequency_Show")
# Sort data frames by frequency in descending order
freq_movies_df <- freq_movies_df[order(-freq_movies_df$Frequency_Movie), ]
freq_shows_df <- freq_shows_df[order(-freq_shows_df$Frequency_Show), ]
# Display the frequency tables
freq_movies_dfThese lists show the top genres for both types of content (MOVIE and SHOW) on Netflix, sorted by the frequency of occurrence in descending order.Based on the frequency analysis of genres in the Netflix dataset for both MOVIE and SHOW types, we can draw the following conclusions: Popular Genres in TV Shows (SHOW):Drama and Comedy are the top genres followed by documentation Popular Genres in Movies (MOVIE):Comedy and Drama followed by documentation
We can use xtabs also to shows the frequency for Genre on Both MOVIE and TV SHOWS.
## genres
## type action animation comedy crime documentation drama family fantasy
## MOVIE 232 130 961 122 414 882 56 85
## SHOW 133 187 344 116 251 539 57 3
## genres
## type history horror music reality romance scifi sport thriller war western
## MOVIE 15 107 55 2 222 75 3 311 25 13
## SHOW 6 7 4 169 10 164 1 66 21 3
From the output above has the same conclusion : Popular Genres in TV Shows (SHOW):Drama and Comedy are the top genres followed by followed by Popular Genres in Movies (MOVIE):Comedy and Drama followed by documentation
# Subset the top 10 titles with highest IMDb scores
top_10_titles <- netflix[order(-netflix$imdb_score), ][1:10, c("type","genres","title", "imdb_score")]
# Display the top 10 titles with IMDb scores
top_10_titlesthe Output shows that the title with the highest Imdb Score is Breaking Bad and followed by Khawatir.
Firstly, analyzing the average IMDb votes by genre and type (MOVIE or TV SHOWS) reveals that Westerns in the MOVIE category and Crime in the MOVIE category tend to receive the highest average votes, indicating strong viewer engagement and appreciation for these genres.
Secondly, examining the frequency of genres across MOVIE and TV SHOWS highlights notable trends. In MOVIE, Comedy and Drama are the most prevalent genres, underscoring their popularity among viewers seeking cinematic experiences. On the other hand, in TV SHOWS, Drama and Comedy also dominate, reflecting a similar preference for these genres in Netflix.
Further exploration into the top titles with the highest IMDb scores reinforces the prominence of acclaimed series like “Breaking Bad” and “Khawatir” (if applicable), indicating the positive reviews and viewer satisfaction these shows have garnered.
In summary, the Netflix dataset reveals that Comedy and Drama genres are consistently popular across both MOVIE and TV SHOWs type of Content in Netflix.