WQD7004 Group Assignment - Netflix Movie Analysis

Group Members

Name Matric No.
EMILY SIA ZI XUAN 17205326
ANG QI KANG 17205824
ZHAOZIHUI S2187551
SIYU JIANG 22060253
MEI EN TEE 22079668

Problem Statement

  1. Enhance customer engagement is the key to increase the profit of Netflix.

  2. Support decision-making processes in the entertainment industry.

  3. Develop a predictive model to accurately predict the IMDb score of a movie.

  4. Classify movies based on their popularity.

  5. Utilize machine learning techniques to build robust predictive and classification models.

Objectives

  1. To predict the movie rating in IMDb score.

  2. To classify the movie based on popularity.


Data Sourcing

Initialization

# for data preprocessing
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.1     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tidyr)
# for data visualization
library(ggplot2)
library(treemap)
## Warning: package 'treemap' was built under R version 4.2.3
library(igraph)
## Warning: package 'igraph' was built under R version 4.2.3
## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
# for regression
library(caret) 
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.2.3
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
# for classification
library(e1071)
## Warning: package 'e1071' was built under R version 4.2.3

Data Ingestion

df <- read.csv("titles.csv",header = TRUE,sep = ",") 

Data Understanding

glimpse(df)
## Rows: 6,137
## Columns: 15
## $ id                   <chr> "ts300399", "tm82169", "tm17823", "tm191099", "tm…
## $ title                <chr> "Five Came Back: The Reference Films", "Rocky", "…
## $ type                 <chr> "SHOW", "MOVIE", "MOVIE", "MOVIE", "MOVIE", "MOVI…
## $ description          <chr> "This collection includes 12 World War II-era pro…
## $ release_year         <int> 1945, 1976, 1978, 1973, 1979, 1975, 1978, 1969, 1…
## $ age_certification    <chr> "TV-MA", "PG", "PG", "PG", "PG", "PG", "R", "TV-1…
## $ runtime              <int> 51, 119, 110, 129, 119, 91, 109, 30, 94, 120, 112…
## $ genres               <chr> "['documentation']", "['drama', 'sport']", "['rom…
## $ production_countries <chr> "['US']", "['US']", "['US']", "['US']", "['US']",…
## $ seasons              <dbl> 1, NA, NA, NA, NA, NA, NA, 4, NA, NA, NA, NA, NA,…
## $ imdb_id              <chr> "", "tt0075148", "tt0077631", "tt0070735", "tt007…
## $ imdb_score           <dbl> NA, 8.1, 7.2, 8.3, 7.3, 8.2, 7.4, 8.8, 8.0, 7.5, …
## $ imdb_votes           <dbl> NA, 588100, 283316, 266738, 216307, 547292, 12361…
## $ tmdb_popularity      <dbl> 0.601, 106.361, 33.160, 24.616, 75.699, 20.964, 1…
## $ tmdb_score           <dbl> NA, 7.782, 7.406, 8.020, 7.246, 7.804, 7.020, 8.2…
# Check if there is any NA value
colSums(is.na(df))
##                   id                title                 type 
##                    0                    0                    0 
##          description         release_year    age_certification 
##                    0                    0                    0 
##              runtime               genres production_countries 
##                    0                    0                    0 
##              seasons              imdb_id           imdb_score 
##                 3831                    0                  468 
##           imdb_votes      tmdb_popularity           tmdb_score 
##                  484                   76                  252
# Check if there is any empty string value
colSums(df=="")
##                   id                title                 type 
##                    0                    0                    0 
##          description         release_year    age_certification 
##                   23                    0                 2743 
##              runtime               genres production_countries 
##                    0                    0                    0 
##              seasons              imdb_id           imdb_score 
##                   NA                  396                   NA 
##           imdb_votes      tmdb_popularity           tmdb_score 
##                   NA                   NA                   NA
# Check if there is any zero value
colSums(df==0)
##                   id                title                 type 
##                    0                    0                    0 
##          description         release_year    age_certification 
##                    0                    0                    0 
##              runtime               genres production_countries 
##                   19                    0                    0 
##              seasons              imdb_id           imdb_score 
##                   NA                    0                   NA 
##           imdb_votes      tmdb_popularity           tmdb_score 
##                   NA                   NA                   NA

Data Preprocessing

Drop unnecessary columns

# Remove unnecessary columns and store in df1
df1 <- select(df,-id,-seasons,-age_certification, -imdb_id, -description)
str(df1)
## 'data.frame':    6137 obs. of  10 variables:
##  $ title               : chr  "Five Came Back: The Reference Films" "Rocky" "Grease" "The Sting" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ release_year        : int  1945 1976 1978 1973 1979 1975 1978 1969 1979 1954 ...
##  $ runtime             : int  51 119 110 129 119 91 109 30 94 120 ...
##  $ genres              : chr  "['documentation']" "['drama', 'sport']" "['romance', 'comedy']" "['crime', 'drama', 'comedy', 'music']" ...
##  $ production_countries: chr  "['US']" "['US']" "['US']" "['US']" ...
##  $ imdb_score          : num  NA 8.1 7.2 8.3 7.3 8.2 7.4 8.8 8 7.5 ...
##  $ imdb_votes          : num  NA 588100 283316 266738 216307 ...
##  $ tmdb_popularity     : num  0.601 106.361 33.16 24.616 75.699 ...
##  $ tmdb_score          : num  NA 7.78 7.41 8.02 7.25 ...

Check for duplicate record

has_duplicates <- any(duplicated(df1))
paste0("There is duplicate record? ", has_duplicates)
## [1] "There is duplicate record? FALSE"

Clean missing value for imdb_score, imdb_votes, tmdb_popularity and tmdb_score

# Drop all the record where the mentioned column has missing value
df1 <- df1[complete.cases(df1$imdb_score), ]
df1 <- df1[complete.cases(df1$imdb_votes), ]
df1 <- df1[complete.cases(df1$tmdb_popularity), ]
df1 <- df1[complete.cases(df1$tmdb_score), ]

### check if there is any remaining missing value
missing_counts <- cbind(
  NA_Count = colSums(is.na(df1)),
  EmptyString_Count = colSums(df1 == "")
)
missing_counts
##                      NA_Count EmptyString_Count
## title                       0                 0
## type                        0                 0
## release_year                0                 0
## runtime                     0                 0
## genres                      0                 0
## production_countries        0                 0
## imdb_score                  0                 0
## imdb_votes                  0                 0
## tmdb_popularity             0                 0
## tmdb_score                  0                 0
df2<-df1

Pre-process column genres and prouction_countries

There are multiple values in a list for genres and prouction_countries variables, we retain the first element as the main genres or production_contries.

# Only retain the first element in the list and remove stray characters
# Convert empty brackets to NA and extract the first genre value in "genres" column
df2$genres <- ifelse(df2$genres == "[]", NA, gsub("\\[|\\]|'", "", sapply(strsplit(df2$genres, ","), function(x) trimws(x[1]))))
df2$production_countries <- ifelse(df2$production_countries == "[]", NA, gsub("\\[|\\]|'", "", sapply(strsplit(df2$production_countries, ","), function(x) trimws(x[1]))))

unique(df2$genres)
##  [1] "drama"         "romance"       "crime"         "fantasy"      
##  [5] "comedy"        "documentation" "thriller"      "action"       
##  [9] "animation"     "family"        "reality"       "scifi"        
## [13] "western"       "horror"        "war"           "music"        
## [17] "history"       NA              "sport"
unique(df2$production_countries)
##  [1] "US" "GB" "EG" "IN" "DE" "CA" "LB" "JP" "AR" "FR" "IE" "AU" "ET" "HK" "MX"
## [16] "CN" "ES" "SU" "IT" "NZ" "DK" "CO" "TW" "KR" "RU" "NG" NA   "PS" "TR" "MY"
## [31] "PH" "ZA" "MA" "SE" "SG" "KE" "NO" "CL" "SA" "BR" "ID" "IS" "IL" "PL" "FI"
## [46] "CD" "RO" "BE" "NL" "UA" "QA" "GL" "AT" "AE" "BY" "JO" "VN" "TN" "TH" "KH"
## [61] "CH" "CU" "UY" "CZ" "PE" "PR" "KW" "IR" "PY" "PK" "HU" "IQ" "BD" "TZ" "CM"
## [76] "LU" "SN" "BT" "PT" "AO" "GH" "ZW" "MW" "GT" "MU" "BG" "DO" "PA" "IO" "FO"
# Check for missing values
count_missing_genres <- sum(is.na(df2$genres))
count_missing_production_countries <- sum(is.na(df2$production_countries))

# Print the counts of missing values
cat(
  paste("Number of missing values in genres: ", count_missing_genres), "\n",
  paste("Number of missing values in production_countries: ", count_missing_production_countries)
)
## Number of missing values in genres:  2 
##  Number of missing values in production_countries:  89
# Delete rows with missing values in genres or production_countries
df3 <- df2[complete.cases(df2$genres, df2$production_countries), ]

Pre-process column runtime

# Remove entry where runtime=0
df3 <- df3[which(df3$runtime!=0),]
paste0("Number of zero runtime after preprocessing: ", sum(df3$runtime == 0))
## [1] "Number of zero runtime after preprocessing: 0"
# save the current df
write.csv(df3, file = "cleaned_data_netflix.csv", row.names = FALSE)

Exploratary Data Analysis (EDA)

Summary Statistic

An overview of the distribution and central tendencies of all features.

netflix_df <- read.csv("cleaned_data_netflix.csv",header = TRUE,sep = ",") 
summary(netflix_df)
##     title               type            release_year     runtime      
##  Length:5368        Length:5368        Min.   :1954   Min.   :  2.00  
##  Class :character   Class :character   1st Qu.:2017   1st Qu.: 45.00  
##  Mode  :character   Mode  :character   Median :2019   Median : 84.00  
##                                        Mean   :2017   Mean   : 78.47  
##                                        3rd Qu.:2021   3rd Qu.:106.00  
##                                        Max.   :2023   Max.   :225.00  
##     genres          production_countries   imdb_score      imdb_votes     
##  Length:5368        Length:5368          Min.   :1.500   Min.   :      5  
##  Class :character   Class :character     1st Qu.:5.900   1st Qu.:    616  
##  Mode  :character   Mode  :character     Median :6.600   Median :   2370  
##                                          Mean   :6.553   Mean   :  22152  
##                                          3rd Qu.:7.300   3rd Qu.:   9696  
##                                          Max.   :9.600   Max.   :2684317  
##  tmdb_popularity      tmdb_score    
##  Min.   :   0.600   Min.   : 1.000  
##  1st Qu.:   3.857   1st Qu.: 6.037  
##  Median :   8.306   Median : 6.800  
##  Mean   :  20.914   Mean   : 6.667  
##  3rd Qu.:  17.740   3rd Qu.: 7.400  
##  Max.   :1078.637   Max.   :10.000

Overview of features

Bar plot of showtype
#Bar plot of show_type
ggplot(netflix_df, aes(x = type)) +
  geom_bar(fill = "steelblue") +
  labs(x = "Show Type", y = "Count") +
  ggtitle("Count of TV Shows and Movies")

Histogram of release_year

Movies and shows by release year.

ggplot(netflix_df, aes(x = release_year)) +
  geom_histogram(binwidth = 1, color = "black", fill = "skyblue") +
  labs(x = "Release Year", y = "Count", title = "Distribution of Movies and Shows by Release Year") +
  scale_x_continuous(breaks = seq(min(netflix_df$release_year), max(netflix_df$release_year), by = 10))+
  scale_y_continuous(breaks = seq(0, 800, by =50))

Histogram distribution of runtime
ggplot(netflix_df, aes(x = runtime)) +
  geom_histogram(binwidth = 10, color = "black", fill = "skyblue") +
  labs(x = "Runtime", y = "Count", title = "Distribution of Runtime") +
  scale_x_continuous(breaks = seq(min(netflix_df$runtime), max(netflix_df$runtime), by = 20)) +
  scale_y_continuous(breaks = seq(0, 800, by =50))

Boxplot distribution of runtime
ggplot(netflix_df, aes(x = "", y = runtime)) +
  geom_boxplot() +
  labs(x = "", y = "Runtime (minutes)") +
  ggtitle("Distribution of Runtime") +
  theme_minimal() +
  coord_cartesian(ylim = c(-10, 250)) +  # Adjust the y-axis limits as per your preference
  geom_text(aes(x = 1, y = max(runtime), label = paste("Max:", max(runtime))),
            vjust = -1, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = min(runtime), label = paste("Min:", min(runtime))),
            vjust = 2, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = median(runtime), label = paste("Median:", median(runtime))),
            vjust = 0, hjust = -1, color = "blue") +
  geom_text(aes(x = 1, y = quantile(runtime, 0.25),
                label = paste("Q1:", quantile(runtime, 0.25))),
            vjust = -1, hjust = -1, color = "blue") +
  geom_text(aes(x = 1, y = quantile(runtime, 0.75),
                label = paste("Q3:", quantile(runtime, 0.75))),
            vjust = 1, hjust = -1, color = "blue")

Bubble chart of genres
#GENRE FEATURE
genre_counts <- table(netflix_df$genres)
genre_data <- data.frame(genre = names(genre_counts), count = as.numeric(genre_counts))
genre_data$percentage <- genre_data$count / sum(genre_data$count) * 100
genre_data <- genre_data[order(-genre_data$count), ]
genre_data$size <- sqrt(genre_data$count)

#Bubble Chart
ggplot(genre_data, aes(x = genre, y = count, size = size, label = paste(sprintf("%.1f%%", percentage)))) +
  geom_point(color = "black", fill = "skyblue", shape = 21) +
  labs(x = "Genres", y = "Count of Movies and TV Show by Genres", title = "Bubble Chart of Genres") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.line.y = element_blank(),
        panel.grid.major.y = element_line(color = "lightgray", linetype = "dashed"),
        panel.grid.minor.y = element_blank()) +
  scale_size_continuous(range = c(5, 15), guide = "none") +
  geom_text(size = 2.2, vjust = -1, color = "red") +
  ylim(0, max(genre_data$count) * 1.2)

Treemap of production_countries
country_counts <- table(netflix_df$production_countries)

country_data <- data.frame(
  country = names(country_counts),
  count = as.numeric(country_counts)
)

country_data$percentage <- format(round(country_data$count / sum(country_data$count) * 100, 2), nsmall=2)
country_data <- country_data[order(-country_data$count), ]

#Treemap
country_data$Country.Index <- paste0(country_data$country," ", country_data$percentage, "%")
treemap(country_data, index = "Country.Index", vSize = "count", title = "Production Countries Treemap")

Boxplot distribution for imdb_score
ggplot(data = netflix_df, aes(x = "", y = imdb_score)) +
  geom_boxplot() +
  labs(x = "", y = "IMDb Score") +
  ggtitle("Distribution of IMDb Scores") +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 10)) +  # Adjust the y-axis limits as per your preference
  geom_text(aes(x = 1, y = max(imdb_score), label = paste("Max:", max(imdb_score))),
            vjust = -1, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = min(imdb_score), label = paste("Min:", min(imdb_score))),
            vjust = 2, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = median(imdb_score), label = paste("Median:", median(imdb_score))),
            vjust = 0, hjust = -1, color = "red") +
  geom_text(aes(x = 1, y = quantile(imdb_score, 0.25),
                label = paste("Q1:", quantile(imdb_score, 0.25))),
            vjust = -1, hjust = -1, color = "blue") +
  geom_text(aes(x = 1, y = quantile(imdb_score, 0.75),
                label = paste("Q3:", quantile(imdb_score, 0.75))),
            vjust = 1, hjust = -1, color = "blue")

Boxplot distribution for imdb_votes
options(repr.plot.width = 10, repr.plot.height = 6)
ggplot(data = netflix_df, aes(x = "", y = imdb_votes)) +
  geom_boxplot() +
  labs(x = "", y = "IMDb Votes") +
  ggtitle("Distribution of IMDb Votes") +
  theme_minimal() +
  coord_cartesian(ylim = c(-1000, 2800000)) +  # Adjust the y-axis limits as per your preference
  geom_text(aes(x = 1, y = max(imdb_votes), label = paste("Max:", max(imdb_votes))),
            vjust = -1, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = min(imdb_votes), label = paste("Min:", min(imdb_votes))),
            vjust = 2, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = median(imdb_votes), label = paste("Median:", median(imdb_votes))),
            vjust = 0, hjust = -1, color = "red") +
  geom_text(aes(x = 1, y = quantile(imdb_votes, 0.25),
                label = paste("Q1:", quantile(imdb_votes, 0.25))),
            vjust = -1, hjust = -1, color = "blue") +
  geom_text(aes(x = 1, y = quantile(imdb_votes, 0.75),
                label = paste("Q3:", quantile(imdb_votes, 0.75))),
            vjust = 1, hjust = -1, color = "blue")

Boxplot Distribution tmdb_popularity
ggplot(data = netflix_df, aes(x = "", y = tmdb_popularity)) +
  geom_boxplot() +
  labs(x = "", y = "TMDB Popularity") +
  ggtitle("Distribution of TMDB Popularity") +
  theme_minimal() +
  coord_cartesian(ylim = c(-100, 1500)) +
  geom_text(aes(x = 1, y = max(tmdb_popularity), label = paste("Max:", max(tmdb_popularity))),
            vjust = -1, hjust = 0, color = "red", size = 3) +
  geom_text(aes(x = 1, y = min(tmdb_popularity), label = paste("Min:", min(tmdb_popularity))),
            vjust = 2, hjust = 0, color = "red", size = 3) +
  geom_text(aes(x = 1, y = median(tmdb_popularity), label = paste("Median:", median(tmdb_popularity))),
            vjust = 0, hjust = -1, color = "red", size = 3) +
  geom_text(aes(x = 1, y = quantile(tmdb_popularity, 0.25),
                label = paste("Q1:", quantile(tmdb_popularity, 0.25))),
            vjust = -1, hjust = -1, color = "blue", size = 3) +
  geom_text(aes(x = 1, y = quantile(tmdb_popularity, 0.75),
                label = paste("Q3:", quantile(tmdb_popularity, 0.75))),
            vjust = 1, hjust = -1, color = "blue", size = 3)

Boxplot Distribution tmdb_score
#TMDB_Score Feature
ggplot(data = netflix_df, aes(x = "", y = tmdb_score)) +
  geom_boxplot() +
  labs(x = "", y = "TMDB Score") +
  ggtitle("Distribution of TMDB Score") +
  theme_minimal() +
  coord_cartesian(ylim = c(-2, 15)) +  # Adjust the y-axis limits as per your preference
  geom_text(aes(x = 1, y = max(tmdb_score), label = paste("Max:", max(tmdb_score))),
            vjust = -1, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = min(tmdb_score), label = paste("Min:", min(tmdb_score))),
            vjust = 2, hjust = 0, color = "red") +
  geom_text(aes(x = 1, y = median(tmdb_score), label = paste("Median:", median(tmdb_score))),
            vjust = 0, hjust = -1, color = "red") +
  geom_text(aes(x = 1, y = quantile(tmdb_score, 0.25),
                label = paste("Q1:", quantile(tmdb_score, 0.25))),
            vjust = -1, hjust = -1, color = "blue") +
  geom_text(aes(x = 1, y = quantile(tmdb_score, 0.75),
                label = paste("Q3:", quantile(tmdb_score, 0.75))),
            vjust = 1, hjust = -1, color = "blue")

Density plot of all imdb_score and tmdb_score variables
#Density plot of all imdb and tmdb variables
ggplot(netflix_df) +
  geom_density(aes(x = imdb_score, fill = "IMDb Score"), alpha = 0.5) +
  geom_density(aes(x = tmdb_score, fill = "TMDB Score"), alpha = 0.5) +
  scale_fill_manual(values = c("blue", "green")) +
  labs(x = "Score", y = "Density") +
  ggtitle("Density Plot - IMDb Score vs TMDB Score") +
  theme_minimal()

Remove maximum value for imdb_votes and tmdb_popularity

max_imdb_votes <- max(netflix_df$imdb_votes)
netflix_df <- netflix_df[which(netflix_df$imdb_votes!=max_imdb_votes),]

max_tmdb_popularity <- max(netflix_df$tmdb_popularity)
netflix_df <- netflix_df[which(netflix_df$tmdb_popularity!=max_tmdb_popularity),]

summary(netflix_df)
##     title               type            release_year     runtime      
##  Length:5366        Length:5366        Min.   :1954   Min.   :  2.00  
##  Class :character   Class :character   1st Qu.:2017   1st Qu.: 45.00  
##  Mode  :character   Mode  :character   Median :2019   Median : 84.00  
##                                        Mean   :2017   Mean   : 78.46  
##                                        3rd Qu.:2021   3rd Qu.:106.00  
##                                        Max.   :2023   Max.   :225.00  
##     genres          production_countries   imdb_score      imdb_votes     
##  Length:5366        Length:5366          Min.   :1.500   Min.   :      5  
##  Class :character   Class :character     1st Qu.:5.900   1st Qu.:    616  
##  Mode  :character   Mode  :character     Median :6.600   Median :   2370  
##                                          Mean   :6.552   Mean   :  21594  
##                                          3rd Qu.:7.300   3rd Qu.:   9684  
##                                          Max.   :9.600   Max.   :2106826  
##  tmdb_popularity      tmdb_score    
##  Min.   :   0.600   Min.   : 1.000  
##  1st Qu.:   3.850   1st Qu.: 6.036  
##  Median :   8.305   Median : 6.800  
##  Mean   :  20.706   Mean   : 6.667  
##  3rd Qu.:  17.735   3rd Qu.: 7.400  
##  Max.   :1005.232   Max.   :10.000

Correlation heatmap between features

From the correlation heatmap, the correlation frequencies of these variables are low. However, the correlation frequency between imdb_score and tmdb_score is 0.5373 which is relatively high, so we can consider combining them in the prediction module.

#Identify correlations between runtime, release_year, imdb_score, tmdb_score, imdb_votes, tmdb_popularity
selected_data <- netflix_df %>% select(runtime, release_year, imdb_score, tmdb_score, imdb_votes, tmdb_popularity)
correlation_matrix <- cor(selected_data) #Compute the correlation matrix
correlation_df <- as.data.frame(as.table(correlation_matrix)) #Convert the correlation matrix to a data frame
#Create a heatmap
my_colors <- c("#e31a1c", "#ff7f00", "#fdbf6f", "#a6cee3", "#1f78b4") #Custom color palette
ggplot(correlation_df, aes(Var1, Var2)) +
  geom_tile(aes(fill = Freq), color = "white") +
  scale_fill_gradientn(colors = my_colors, limits = c(-1, 1), na.value = "white") +
  labs(title = "Correlation Heatmap") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Average IMDb and TMDB scores by genre

The genre of war has the highest average IMDb score and TMDB score

#Find the scores in different genres
#Calculate average scores by genre
average_scores <- netflix_df %>%
  group_by(genres) %>%
  summarise(avg_imdb_score = mean(imdb_score),
            avg_tmdb_score = mean(tmdb_score))

#Pivot the data into longer format to create a side-by-side bar chart
data_long <- average_scores %>%
  select(genres, avg_imdb_score, avg_tmdb_score) %>%
  pivot_longer(cols = c(avg_imdb_score, avg_tmdb_score), names_to = "score_type", values_to = "score")
#Create the side-by-side bar chart by genre
ggplot(data_long) +
  geom_bar(aes(x = genres, y = score, fill = score_type), position = "dodge", stat = "identity") +
  scale_fill_manual(values = c("avg_imdb_score" = "steelblue", "avg_tmdb_score" = "salmon")) +
  labs(title = "Average IMDB Score and TMDB Score by Genre",
       x = "Genre",
       y = "Average Score",
       fill = "Score Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Line Chart of TMDB Popularity Over Release Year

The genre of drama reached the top TMDB popularity around 1000 with release year of 2005. The genre of thriller reached the top TMDB popularity around 1000 with release year of 2023. The popularity of genres seems to change a lot in each released period.

#Determine the current trends of genres in movies and TV shows
#Create the line chart of TMDB popularity
ggplot(netflix_df, aes(x = release_year, y = tmdb_popularity, color = genres, group = genres)) +
  geom_line() +
  labs(title = "Line Chart of TMDB Popularity Over Release Year",
       x = "Release Year",
       y = "TMDB Popularity",
       color = "Genres") +
  scale_x_continuous(breaks = seq(1953, 2023, by = 10)) +
  scale_y_continuous(breaks = seq(0, 1100, by = 100)) +
  theme_minimal()

Line chart of TMDB Score

The TMDB score of genres seems to change a lot in each released period

#Create the line chart of TMDB Score
ggplot(netflix_df, aes(x = release_year, y = tmdb_score, color = genres, group = genres)) +
  geom_line() +
  labs(title = "Line Chart of TMDB Score Over Release Year",
       x = "Release Year",
       y = "TMDB Score",
       color = "Genre") +
  scale_x_continuous(breaks = seq(1953, 2023, by = 10)) +
  scale_y_continuous(breaks = seq(0, 10, by = 1)) +
  theme_minimal() 

###Data Visualization of imdb_score and tmdb_score combined (Labeled as popularity_score)

#Previously mentioned that imdb_score and tmdb_score has a relatively high correlation of 0.5373 and will be combined in the prediction module. These graphs provide an even greater understanding and justification for this feature.

###Heatmap of Popularity Score and other features

netflix_df$popularity_score = ((netflix_df$imdb_score + netflix_df$tmdb_score)/2)
# Compute the correlation matrix
correlation_matrix <- cor(netflix_df[, c("popularity_score", "runtime", "imdb_score", "imdb_votes", "tmdb_score")])
# Melt the correlation matrix for visualization
melted_correlation <- melt(correlation_matrix)

# Create the heatmap
heatmap_plot <- ggplot(melted_correlation, aes(x = Var2, y = Var1, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red", na.value = "white") +
  labs(x = "", y = "", title = "Correlation Heatmap") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(heatmap_plot)

The heatmap shows that the combined average of imdb_score and tmdb score, also labeled as popularity score, shares a relatively high correlation with imdb_votes. The really extremely high correlation (Red colored meaning pretty much 1) shared with imdb_score and tmdb_score once again justifies the use of tmdb_score and imdb_score combining as one feature labeled as popularity score.

###Boxplot of Popularity Score by Genre

unique_genres <- unique(netflix_df$genres)
unique_genres
##  [1] "drama"         "romance"       "crime"         "fantasy"      
##  [5] "comedy"        "documentation" "thriller"      "action"       
##  [9] "animation"     "family"        "reality"       "scifi"        
## [13] "western"       "horror"        "war"           "music"        
## [17] "history"       "sport"
unique_countries <- unique(netflix_df$production_countries)
unique_countries
##  [1] "US" "GB" "EG" "IN" "DE" "CA" "LB" "JP" "AR" "FR" "IE" "AU" "ET" "HK" "MX"
## [16] "CN" "ES" "SU" "IT" "NZ" "DK" "CO" "TW" "KR" "RU" "NG" "PS" "TR" "MY" "PH"
## [31] "ZA" "MA" "SE" "SG" "KE" "NO" "CL" "SA" "BR" "ID" "IS" "IL" "PL" "FI" "CD"
## [46] "RO" "BE" "NL" "UA" "QA" "GL" "AT" "AE" "BY" "JO" "VN" "TN" "TH" "KH" "CH"
## [61] "CU" "UY" "CZ" "PE" "PR" "KW" "IR" "PY" "PK" "HU" "IQ" "BD" "TZ" "CM" "LU"
## [76] "SN" "BT" "PT" "AO" "GH" "ZW" "MW" "GT" "MU" "BG" "DO" "PA" "IO" "FO"
# Filter movies and select specific genres
movie_list1 <- netflix_df %>%
  filter(type == "MOVIE" & genres %in% c("romance", "horror", "fantasy", "comedy", "thriller", "action", "scifi"))

movie_list2 <- netflix_df %>%
  filter(type == "MOVIE" & genres %in% c("family", "drama", "crime", "music", "sport", "animation"))

movie_list3<- netflix_df %>%
  filter(type == "MOVIE" & genres %in% c("documentation", "western", "history", "war", "history"))

# Create boxplot 1
plot1 <- ggplot(movie_list1, aes(x = genres, y = popularity_score)) +
  geom_boxplot() +
  xlab("Genres") +
  ylab("Popularity Score") +
  ggtitle("Popularity Score of Movies in Each Genre Category")

# Create boxplot 2
plot2 <- ggplot(movie_list2, aes(x = genres, y = popularity_score)) +
  geom_boxplot() +
  xlab("Genres") +
  ylab("Popularity Score") +
  ggtitle("Popularity Score of Movies in Each Genre Category")

# Create boxplot 3
plot3 <- ggplot(movie_list3, aes(x = genres, y = popularity_score)) +
  geom_boxplot() +
  xlab("Genres") +
  ylab("Popularity Score") +
  ggtitle("Popularity Score of Movies in Each Genre Category")

# Print the boxplots
print(plot1)

Action, comedy, fantansy does have a popularity score of above 7.5

print(plot2)

The only genres that score below 7.5 are family and sport.

print(plot3)

The boxplots of the movies popularity score in terms of genre category gives us a better idea of the average for the movies in each genre category. We can see that the more popular movie genres on average is documentation but when we look at outliers, we can see that action, romance, and animation is rated higher than the other genres. On the other hand, we see that drama, thriller, and western genres share a lower popularity score than others when considering outliers.

# save the current df
write.csv(netflix_df, file = "cleaned_data_netflix.csv", row.names = FALSE)

Model development

Objective 1: Movie IMDb Scores Prediction

Preprocess data to fit regression model

netflix_df_reg <- read.csv("cleaned_data_netflix.csv", sep = ',')
# Convert nominal features to factors
netflix_df_reg$type <- as.factor(netflix_df_reg$type)
netflix_df_reg$genres <- as.factor(netflix_df_reg$genres)
# Select the predictor features and target variable
features <- netflix_df_reg[, c("type", "runtime", "genres", "imdb_votes")]
target <- netflix_df_reg$imdb_score
# Preprocess the data
preprocess_params <- preProcess(features, method = c("center", "scale"))
features_scaled <- predict(preprocess_params, features)

Split dataset into test train set

# Split the data into training and testing sets
set.seed(42)
train_indices <- sample(1:nrow(features_scaled), nrow(features_scaled) * 0.8)
X_train <- features_scaled[train_indices, ]
y_train <- target[train_indices]
X_test <- features_scaled[-train_indices, ]
y_test <- target[-train_indices]

Regression Algorithm 1: Random Forest Regression

#Random Forest Regression
# Train a Random Forest regression model
model_rf <- randomForest(X_train, y_train)
# Make predictions on the test set
y_pred <- predict(model_rf, X_test)

# Evaluate the model
RMSE <- sqrt(mean((y_pred - y_test)^2))
R_squared <- cor(y_pred, y_test)^2

# Print the results
cat("Random Forest Regression:\n",
        paste("Root Mean Squared Error (RMSE):", RMSE), "\n",
        paste("R-squared:", R_squared))
## Random Forest Regression:
##  Root Mean Squared Error (RMSE): 0.971529427451738 
##  R-squared: 0.270530849394465

Regression Algorithm 2: Linear Regression

# Train a Linear Regression model
model_lm <- train(X_train, y_train, method = "lm")

# Make predictions on the test set using Linear Regression
y_pred_lm <- predict(model_lm, X_test)

RMSE_lm <- sqrt(mean((y_pred_lm - y_test)^2))
R_squared_lm <- cor(y_pred_lm, y_test)^2

cat("Linear Regression:\n",
        paste("Root Mean Squared Error (RMSE):", RMSE_lm), "\n",
        paste("R-squared:", R_squared_lm))
## Linear Regression:
##  Root Mean Squared Error (RMSE): 1.01196085625222 
##  R-squared: 0.199071005416998

Regression Algorithm 3: XGBoost Regression

# Train an XGBoost regression model
model_xgb <- xgboost(data = as.matrix(sapply(X_train, as.numeric)), label = y_train, nrounds = 100, objective = "reg:squarederror")
## [1]  train-rmse:4.370645 
## [2]  train-rmse:3.137754 
## [3]  train-rmse:2.297079 
## [4]  train-rmse:1.735711 
## [5]  train-rmse:1.373568 
## [6]  train-rmse:1.151367 
## [7]  train-rmse:1.021556 
## [8]  train-rmse:0.948363 
## [9]  train-rmse:0.905359 
## [10] train-rmse:0.881415 
## [11] train-rmse:0.869031 
## [12] train-rmse:0.860006 
## [13] train-rmse:0.849110 
## [14] train-rmse:0.837696 
## [15] train-rmse:0.832084 
## [16] train-rmse:0.828467 
## [17] train-rmse:0.819644 
## [18] train-rmse:0.813108 
## [19] train-rmse:0.809294 
## [20] train-rmse:0.804897 
## [21] train-rmse:0.800795 
## [22] train-rmse:0.795338 
## [23] train-rmse:0.794334 
## [24] train-rmse:0.788178 
## [25] train-rmse:0.787154 
## [26] train-rmse:0.779294 
## [27] train-rmse:0.773339 
## [28] train-rmse:0.767792 
## [29] train-rmse:0.765389 
## [30] train-rmse:0.761413 
## [31] train-rmse:0.756249 
## [32] train-rmse:0.754820 
## [33] train-rmse:0.748801 
## [34] train-rmse:0.746557 
## [35] train-rmse:0.742853 
## [36] train-rmse:0.740443 
## [37] train-rmse:0.736762 
## [38] train-rmse:0.731600 
## [39] train-rmse:0.727791 
## [40] train-rmse:0.724929 
## [41] train-rmse:0.722721 
## [42] train-rmse:0.718119 
## [43] train-rmse:0.715060 
## [44] train-rmse:0.712672 
## [45] train-rmse:0.708882 
## [46] train-rmse:0.706684 
## [47] train-rmse:0.702212 
## [48] train-rmse:0.700128 
## [49] train-rmse:0.699390 
## [50] train-rmse:0.695263 
## [51] train-rmse:0.693857 
## [52] train-rmse:0.687627 
## [53] train-rmse:0.682315 
## [54] train-rmse:0.678093 
## [55] train-rmse:0.673580 
## [56] train-rmse:0.669791 
## [57] train-rmse:0.666542 
## [58] train-rmse:0.663987 
## [59] train-rmse:0.662294 
## [60] train-rmse:0.659679 
## [61] train-rmse:0.659322 
## [62] train-rmse:0.658000 
## [63] train-rmse:0.656244 
## [64] train-rmse:0.655728 
## [65] train-rmse:0.649599 
## [66] train-rmse:0.648203 
## [67] train-rmse:0.644497 
## [68] train-rmse:0.641948 
## [69] train-rmse:0.640363 
## [70] train-rmse:0.638205 
## [71] train-rmse:0.633388 
## [72] train-rmse:0.632447 
## [73] train-rmse:0.631156 
## [74] train-rmse:0.628032 
## [75] train-rmse:0.624470 
## [76] train-rmse:0.621060 
## [77] train-rmse:0.618899 
## [78] train-rmse:0.615379 
## [79] train-rmse:0.612696 
## [80] train-rmse:0.609816 
## [81] train-rmse:0.608471 
## [82] train-rmse:0.605020 
## [83] train-rmse:0.603320 
## [84] train-rmse:0.601185 
## [85] train-rmse:0.600780 
## [86] train-rmse:0.597664 
## [87] train-rmse:0.594632 
## [88] train-rmse:0.593355 
## [89] train-rmse:0.590853 
## [90] train-rmse:0.588052 
## [91] train-rmse:0.586884 
## [92] train-rmse:0.584685 
## [93] train-rmse:0.581916 
## [94] train-rmse:0.579682 
## [95] train-rmse:0.578330 
## [96] train-rmse:0.576104 
## [97] train-rmse:0.575613 
## [98] train-rmse:0.572875 
## [99] train-rmse:0.571367 
## [100]    train-rmse:0.569285
# Make predictions on the test set using XGBoost
y_pred_xgb <- predict(model_xgb, as.matrix(sapply(X_test, as.numeric)))

# Evaluate the model
RMSE_xgb <- sqrt(mean((y_pred_xgb - y_test)^2))
R_squared_xgb <- cor(y_pred_xgb, y_test)^2

cat("XGBoost Regression:\n",
  paste("Root Mean Squared Error (RMSE):", RMSE_xgb), "\n",
  paste("R-squared:", R_squared_xgb))
## XGBoost Regression:
##  Root Mean Squared Error (RMSE): 1.03702433872803 
##  R-squared: 0.209229207703833

Make prediction for regression models

# New data for prediction
new_data <- data.frame(
  type = factor("MOVIE", levels = levels(features$type)),
  runtime = 120L,
  genres = factor("drama", levels = levels(features$genres)),
  imdb_votes = 5000L
)

# Preprocess the new data
new_data_scaled <- predict(preprocess_params, new_data)

# Make prediction on the new data
prediction_rf <- predict(model_rf, new_data_scaled)
prediction_lm <- predict(model_lm, new_data_scaled)
prediction_xgb <- predict(model_xgb, newdata = as.matrix(sapply(new_data_scaled, as.numeric)))

# Calculate the average of all predictions (XGB)
average_prediction_xgb <- mean(prediction_xgb)

# Print the prediction
cat( "Regression Prediction Results: \n" , 
  "Predicted imdb_score (RF):", round(prediction_rf, 2), "\n",
  "Predicted imdb_score (LR):", round(prediction_lm, 2), "\n",
  "Predicted imdb_score (XGB):", round(average_prediction_xgb, 2))
## Regression Prediction Results: 
##  Predicted imdb_score (RF): 6.41 
##  Predicted imdb_score (LR): 6.44 
##  Predicted imdb_score (XGB): 5.22

Objective 2: Movie Popularity Classification

Preprocess data to fit classification model

netflix_df_cls <- read.csv("cleaned_data_netflix.csv", sep = ',')
# Calculate the popularity score
netflix_df_cls$popularity_score <- ((netflix_df_cls$imdb_score + netflix_df_cls$tmdb_score)/2)

# Optional: You can round the popularity score to a desired decimal place
netflix_df_cls$popularity_score <- round(netflix_df_cls$popularity_score, 2)

netflix_df_cls$popularityLabel <- ifelse(netflix_df_cls$popularity_score >7.5,"Popular","Not Popular")
netflix_df_cls$popularity <- ifelse(netflix_df_cls$popularityLabel == "Popular", 1, 0)

# Select the features and target variables
features <- netflix_df_cls[, c( "type", "runtime", "genres", "imdb_score", "imdb_votes", "tmdb_score")]
target <- factor(netflix_df_cls$popularity, levels = c(0, 1), labels = c("Not Popular", "Popular"))

# Convert nominal features to factors
features$type <- as.factor(features$type)
features$genres <- as.factor(features$genres)

Split dataset into test train set

# Split the data into training and testing sets
set.seed(42)
train_indices <- sample(1:nrow(features), nrow(features) * 0.8)
X_train <- features[train_indices, ]
y_train <- target[train_indices]
X_test <- features[-train_indices, ]
y_test <- target[-train_indices]

Classification Algorithm 1: Random Forest Classifier

# Train a Random Forest Classifier
classifier_rf <- randomForest(X_train, y_train)

# Make predictions on the test set
y_pred <- predict(classifier_rf, X_test)

# Evaluate the model
accuracy_rf <- sum(y_pred == y_test) / length(y_test)
classification_report_rf <- table(y_test, y_pred)

# Print the results
cat("Accuracy (Random Forest):", accuracy_rf, "\n\n",
    "Classification Report (Random Forest):\n",
    paste(capture.output(print(classification_report_rf)), collapse = "\n")
)
## Accuracy (Random Forest): 0.9944134 
## 
##  Classification Report (Random Forest):
##               y_pred
## y_test        Not Popular Popular
##   Not Popular         859       2
##   Popular               4     209

Classification Algorithm 2: Support Vector Machines Classifier

# Train an SVM classifier
# Feature scaling using preProcess() for SVM
preprocess_params <- preProcess(X_train, method = c("center", "scale"))
X_train_scaled <- predict(preprocess_params, X_train)
X_test_scaled <- predict(preprocess_params, X_test)

classifier_svm <- svm(y_train ~ ., data = X_train_scaled, kernel = "radial", cost = 1, scale = FALSE, max_iter = 10000)

# Make predictions on the test set using SVM
y_pred_svm <- predict(classifier_svm, newdata = X_test_scaled)

# Evaluate the SVM model
accuracy_svm <- sum(y_pred_svm == y_test) / length(y_test)
classification_report_svm <- table(y_test, y_pred_svm)

# Print the SVM results
cat("Accuracy (SVM):", accuracy_svm, "\n\n",
    "Classification Report (SVM):\n",
    paste(capture.output(print(classification_report_svm)), collapse = "\n")
)
## Accuracy (SVM): 0.9841713 
## 
##  Classification Report (SVM):
##               y_pred_svm
## y_test        Not Popular Popular
##   Not Popular         860       1
##   Popular              16     197

Classification Algorithm 3: Naive Bayes Classifier

# Train a Naive Bayes Classifier
classifier_nb <- naiveBayes(X_train, y_train)

# Make predictions on the test set using Naive Bayes
y_pred_nb <- predict(classifier_nb, X_test)

# Evaluate the Naive Bayes model
accuracy_nb <- sum(y_pred_nb == y_test) / length(y_test)
classification_report_nb <- table(y_test, y_pred_nb)

# Print the Naive Bayes results
cat("Accuracy (Naive Bayes):", accuracy_nb, "\n\n",
    "Classification Report (Naive Bayes):\n",
    paste(capture.output(print(classification_report_nb)), collapse = "\n")
)
## Accuracy (Naive Bayes): 0.9134078 
## 
##  Classification Report (Naive Bayes):
##               y_pred_nb
## y_test        Not Popular Popular
##   Not Popular         838      23
##   Popular              70     143

Make prediction for classification models

# Making predictions
# New data for prediction
new_data <- data.frame(
  type = factor("MOVIE", levels = levels(features$type)),
  runtime = 78L,
  genres = factor("comedy", levels = levels(features$genres)),
  imdb_score = 7.4,
  #production_countries = factor("US", levels = levels(features$production_countries)),
  imdb_votes = 2000L,
  tmdb_score = 8.1
)

# Define labels for factor levels
class_labels <- levels(target)

# Make prediction on new data
prediction_rf <- predict(classifier_rf, newdata = new_data)
prediction_nb <- predict(classifier_nb, newdata = new_data)
prediction_svm <- predict(classifier_svm, newdata = new_data)
# Convert prediction to class labels
predicted_class_rf <- class_labels[prediction_rf]
predicted_class_nb <- class_labels[prediction_nb]
predicted_class_svm <- class_labels[prediction_svm]

cat(
  "Classifier Prediction Results:", "\n",
  "Random Forest (RF) : ", predicted_class_rf, "\n",
  "Naive Bayes (NB) : ", predicted_class_nb, "\n",
  "Support Vector Machines (SVM) : ", predicted_class_svm
)
## Classifier Prediction Results: 
##  Random Forest (RF) :  Popular 
##  Naive Bayes (NB) :  Not Popular 
##  Support Vector Machines (SVM) :  Popular

Conclusion

Based on the Root Mean Squared Error (RMSE), the Random Forest Regressor achieved the lowest RMSE with only 0.9705 which makes it the best performance model. The linear regression model achieved 1.012 RMSE while XGBoost achieve 1.037 RMSE. The predicted imdb_score using Random Forest Regressor is 6.41, the predicted imdb_score using linear regression model is 6.44 while the predicted imdb_score using XGBoost model is 5.22. Based on the classification report of all 3 models, the best performance model with highest accuracy is Random Forest Classifier. It achieved 99.5% of accuracy. It then followed by Support Vector Machines (SVM) where it achieved 98.4% accuracy. The Naive Bayes model achieved the lowest accuracy with 91.3%. From the prediction with random inputs, Random Forest Classifier and Support Vector Machines (SVM) successfully predicted the movie popular while the Naive Bayes unfortunately predicted it as unpopular.

With the regression and classification model, it allows Netflix to identify the next trending movie to include in their platform to enhance customer engagement with their platform.