1. OVERVIEW

knitr::include_graphics("assets/netflix.jpg")

Netflix is one of the world’s leading streaming services, offering a vast array of TV shows and movies to millions of users across the globe. With such a diverse content library, understanding viewer preferences and trends is crucial for both enhancing user experience and guiding content strategy. This is where the need for a comprehensive dashboard becomes evident. By enabling users to explore genre distributions, popular shows and movies helps identify which genres are most favored by viewers. This data-driven approach allows Netflix to tailor its recommendations and content acquisitions, ensuring that it continues to meet the evolving tastes and preferences of its audience.

2. DATA PROCESSING

We aim to load and analyze a dataset of Netflix titles to understand genre popularity. This will involve inspecting and cleansing the data, followed by data manipulation and transformation, culminating in actionable business recommendations.

A. Import Data & Inspection

First, we load the dataset and inspect its structure to identify any issues:

# Load the Data
netflix <- read.csv("data_input/titles.csv")
head(netflix)

# Inspect the data
str(netflix)

## 'data.frame':    5791 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm154986" "tm127384" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Deliverance" "Monty Python and the Holy Grail" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "Intent on seeing the Cahulawassee River before it's turned into one huge lake, outdoor fanatic Lewis Medlock ta"| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ ...
##  $ release_year        : int  1945 1976 1972 1975 1967 1969 1979 1971 1967 1980 ...
##  $ age_certification   : chr  "TV-MA" "R" "R" "PG" ...
##  $ runtime             : int  51 114 109 91 150 30 94 102 110 104 ...
##  $ genres              : chr  "documentation" "drama" "drama" "fantasy" ...
##  $ production_countries: chr  "['US']" "['US']" "['US']" "['GB']" ...
##  $ seasons             : int  1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0068473" "tt0071853" ...
##  $ imdb_score          : num  NA 8.2 7.7 8.2 7.7 8.8 8 7.7 7.7 5.8 ...
##  $ imdb_votes          : int  NA 808582 107673 534486 72662 73424 395024 155051 112048 69844 ...
##  $ tmdb_popularity     : num  0.6 41 10 15.5 20.4 ...
##  $ tmdb_score          : num  NA 8.18 7.3 7.81 7.6 ...

B. Data Cleansing & Coercions

-. Missing Value

To ensure data quality, we will handle any missing values, rename columns if necessary, and convert data types as needed:

To determine if there are any missing values in the data, we can use the anyNA() function

anyNA(netflix)

## [1] TRUE

The output FALSE from anyNA(netflix) indicates that there are no missing values (NA values) present in the netflix dataset. This means that all cells in the dataset contain valid data without any missing entries.

To check for missing values in each column, we can use the is.na() and colSums() functions.

colSums(is.na(netflix))

##                   id                title                 type 
##                    0                    0                    0 
##          description         release_year    age_certification 
##                    0                    0                    0 
##              runtime               genres production_countries 
##                    0                    0                    0 
##              seasons              imdb_id           imdb_score 
##                 3710                    0                  429 
##           imdb_votes      tmdb_popularity           tmdb_score 
##                  443                   82                  284

each number (0 in this case) represents the count of missing values found in each respective column of the dataset. A count of 0 indicates that there are no missing values in that particular column.

-. Rename column if needed

To Rename columns we can use make.names() functions.

# Rename columns if needed
colnames(netflix) <- make.names(colnames(netflix), unique = TRUE)

C. Convert data types

To Convert data type we can use below functions. * as.character() * as.Date() * as.integer() * as.numeric() * as.factor()

# Convert data types

netflix$id <- as.character(netflix$id)
netflix$title <- as.character(netflix$title)
netflix$type <- as.character(netflix$type)
netflix$description <- as.character(netflix$description)
netflix$age_certification <- as.character(netflix$age_certification)
netflix$genres <- as.character(netflix$genres)
netflix$production_countries <- as.character(netflix$production_countries)
netflix$imdb_id <- as.character(netflix$imdb_id)
netflix$seasons <- as.factor(netflix$seasons)

D. Check the data after change the data type

To Check the data, we can use the str() function

str(netflix)

## 'data.frame':    5791 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm154986" "tm127384" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Deliverance" "Monty Python and the Holy Grail" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "Intent on seeing the Cahulawassee River before it's turned into one huge lake, outdoor fanatic Lewis Medlock ta"| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ ...
##  $ release_year        : int  1945 1976 1972 1975 1967 1969 1979 1971 1967 1980 ...
##  $ age_certification   : chr  "TV-MA" "R" "R" "PG" ...
##  $ runtime             : int  51 114 109 91 150 30 94 102 110 104 ...
##  $ genres              : chr  "documentation" "drama" "drama" "fantasy" ...
##  $ production_countries: chr  "['US']" "['US']" "['US']" "['GB']" ...
##  $ seasons             : Factor w/ 26 levels "1","2","3","4",..: 1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0068473" "tt0071853" ...
##  $ imdb_score          : num  NA 8.2 7.7 8.2 7.7 8.8 8 7.7 7.7 5.8 ...
##  $ imdb_votes          : int  NA 808582 107673 534486 72662 73424 395024 155051 112048 69844 ...
##  $ tmdb_popularity     : num  0.6 41 10 15.5 20.4 ...
##  $ tmdb_score          : num  NA 8.18 7.3 7.81 7.6 ...

The dataset netflix has been cleaned to handle missing values and change the data types, with 3,933 rows omitted due to missing data across various columns. This dataset provides comprehensive information about TV shows and movies available on Netflix, facilitating analysis and insights into content preferences, popularity, and production details.

3. DATA EXPLORATION

We will manipulate and transform the data to derive meaningful insights, such as identifying the most popular genres:

A. What is the genres in Netflix that has average IMDB Votes

agg_mean <-
aggregate(imdb_votes ~ type+ genres, data= netflix , FUN =mean)

agg_mean_sorted <- agg_mean[order(-agg_mean$imdb_votes), ]

agg_mean_sorted

the Output shows that the genre with the highest votes is Western (type MOVIE) and followed by genre CRIME (type MOVIE).

B. Which genres that has the highest frequency for each Type (MOVIE & TV SHOWS)

# Subset data for MOVIE and SHOW types
movies <- subset(netflix, type == "MOVIE")
shows <- subset(netflix, type == "SHOW")

# Create frequency tables for genres
freq_movies <- table(movies$genres)
freq_shows <- table(shows$genres)

# Convert frequency tables to data frames
freq_movies_df <- as.data.frame(freq_movies)
freq_shows_df <- as.data.frame(freq_shows)

# Rename columns for clarity
names(freq_movies_df) <- c("genres", "Frequency_Movie")
names(freq_shows_df) <- c("genres", "Frequency_Show")

# Sort data frames by frequency in descending order
freq_movies_df <- freq_movies_df[order(-freq_movies_df$Frequency_Movie), ]
freq_shows_df <- freq_shows_df[order(-freq_shows_df$Frequency_Show), ]

# Display the frequency tables
freq_movies_df

freq_shows_df

These lists show the top genres for both types of content (MOVIE and SHOW) on Netflix, sorted by the frequency of occurrence in descending order.Based on the frequency analysis of genres in the Netflix dataset for both MOVIE and SHOW types, we can draw the following conclusions: Popular Genres in TV Shows (SHOW):Drama and Comedy are the top genres followed by documentation Popular Genres in Movies (MOVIE):Comedy and Drama followed by documentation

We can use xtabs also to shows the frequency for Genre on Both MOVIE and TV SHOWS.

xtabs( ~ type + genres, data = netflix)

##        genres
## type    action animation comedy crime documentation drama family fantasy
##   MOVIE    232       130    961   122           414   882     56      85
##   SHOW     133       187    344   116           251   539     57       3
##        genres
## type    history horror music reality romance scifi sport thriller war western
##   MOVIE      15    107    55       2     222    75     3      311  25      13
##   SHOW        6      7     4     169      10   164     1       66  21       3

From the output above has the same conclusion : Popular Genres in TV Shows (SHOW):Drama and Comedy are the top genres followed by followed by Popular Genres in Movies (MOVIE):Comedy and Drama followed by documentation

C. The top 10 titles with the highest IMDb scores in the Netflix dataset

# Subset the top 10 titles with highest IMDb scores
top_10_titles <- netflix[order(-netflix$imdb_score), ][1:10, c("type","genres","title", "imdb_score")]

# Display the top 10 titles with IMDb scores
top_10_titles

the Output shows that the title with the highest Imdb Score is Breaking Bad and followed by Khawatir.

4. CONCLUSION

Firstly, analyzing the average IMDb votes by genre and type (MOVIE or TV SHOWS) reveals that Westerns in the MOVIE category and Crime in the MOVIE category tend to receive the highest average votes, indicating strong viewer engagement and appreciation for these genres.

Secondly, examining the frequency of genres across MOVIE and TV SHOWS highlights notable trends. In MOVIE, Comedy and Drama are the most prevalent genres, underscoring their popularity among viewers seeking cinematic experiences. On the other hand, in TV SHOWS, Drama and Comedy also dominate, reflecting a similar preference for these genres in Netflix.

Further exploration into the top titles with the highest IMDb scores reinforces the prominence of acclaimed series like “Breaking Bad” and “Khawatir” (if applicable), indicating the positive reviews and viewer satisfaction these shows have garnered.

In summary, the Netflix dataset reveals that Comedy and Drama genres are consistently popular across both MOVIE and TV SHOWs type of Content in Netflix.

LBB Programming for Data Science

Eva Marudur

2024-07-05