Group Members
| Name | Matric No. |
|---|---|
| EMILY SIA ZI XUAN | 17205326 |
| ANG QI KANG | 17205824 |
| ZHAOZIHUI | S2187551 |
| SIYU JIANG | 22060253 |
| MEI EN TEE | 22079668 |
Enhance customer engagement is the key to increase the profit of Netflix.
Support decision-making processes in the entertainment industry.
Develop a predictive model to accurately predict the IMDb score of a movie.
Classify movies based on their popularity.
Utilize machine learning techniques to build robust predictive and classification models.
Netflix TV Shows and Movies
https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies?select=titles.csv
To predict the movie rating in IMDb score.
To classify the movie based on popularity.
# for data preprocessing
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.1 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tidyr)
# for data visualization
library(ggplot2)
library(treemap)
## Warning: package 'treemap' was built under R version 4.2.3
library(igraph)
## Warning: package 'igraph' was built under R version 4.2.3
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
# for regression
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.2.3
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
# for classification
library(e1071)
## Warning: package 'e1071' was built under R version 4.2.3
df <- read.csv("titles.csv",header = TRUE,sep = ",")
glimpse(df)
## Rows: 6,137
## Columns: 15
## $ id <chr> "ts300399", "tm82169", "tm17823", "tm191099", "tm…
## $ title <chr> "Five Came Back: The Reference Films", "Rocky", "…
## $ type <chr> "SHOW", "MOVIE", "MOVIE", "MOVIE", "MOVIE", "MOVI…
## $ description <chr> "This collection includes 12 World War II-era pro…
## $ release_year <int> 1945, 1976, 1978, 1973, 1979, 1975, 1978, 1969, 1…
## $ age_certification <chr> "TV-MA", "PG", "PG", "PG", "PG", "PG", "R", "TV-1…
## $ runtime <int> 51, 119, 110, 129, 119, 91, 109, 30, 94, 120, 112…
## $ genres <chr> "['documentation']", "['drama', 'sport']", "['rom…
## $ production_countries <chr> "['US']", "['US']", "['US']", "['US']", "['US']",…
## $ seasons <dbl> 1, NA, NA, NA, NA, NA, NA, 4, NA, NA, NA, NA, NA,…
## $ imdb_id <chr> "", "tt0075148", "tt0077631", "tt0070735", "tt007…
## $ imdb_score <dbl> NA, 8.1, 7.2, 8.3, 7.3, 8.2, 7.4, 8.8, 8.0, 7.5, …
## $ imdb_votes <dbl> NA, 588100, 283316, 266738, 216307, 547292, 12361…
## $ tmdb_popularity <dbl> 0.601, 106.361, 33.160, 24.616, 75.699, 20.964, 1…
## $ tmdb_score <dbl> NA, 7.782, 7.406, 8.020, 7.246, 7.804, 7.020, 8.2…
# Check if there is any NA value
colSums(is.na(df))
## id title type
## 0 0 0
## description release_year age_certification
## 0 0 0
## runtime genres production_countries
## 0 0 0
## seasons imdb_id imdb_score
## 3831 0 468
## imdb_votes tmdb_popularity tmdb_score
## 484 76 252
# Check if there is any empty string value
colSums(df=="")
## id title type
## 0 0 0
## description release_year age_certification
## 23 0 2743
## runtime genres production_countries
## 0 0 0
## seasons imdb_id imdb_score
## NA 396 NA
## imdb_votes tmdb_popularity tmdb_score
## NA NA NA
# Check if there is any zero value
colSums(df==0)
## id title type
## 0 0 0
## description release_year age_certification
## 0 0 0
## runtime genres production_countries
## 19 0 0
## seasons imdb_id imdb_score
## NA 0 NA
## imdb_votes tmdb_popularity tmdb_score
## NA NA NA
# Remove unnecessary columns and store in df1
df1 <- select(df,-id,-seasons,-age_certification, -imdb_id, -description)
str(df1)
## 'data.frame': 6137 obs. of 10 variables:
## $ title : chr "Five Came Back: The Reference Films" "Rocky" "Grease" "The Sting" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ release_year : int 1945 1976 1978 1973 1979 1975 1978 1969 1979 1954 ...
## $ runtime : int 51 119 110 129 119 91 109 30 94 120 ...
## $ genres : chr "['documentation']" "['drama', 'sport']" "['romance', 'comedy']" "['crime', 'drama', 'comedy', 'music']" ...
## $ production_countries: chr "['US']" "['US']" "['US']" "['US']" ...
## $ imdb_score : num NA 8.1 7.2 8.3 7.3 8.2 7.4 8.8 8 7.5 ...
## $ imdb_votes : num NA 588100 283316 266738 216307 ...
## $ tmdb_popularity : num 0.601 106.361 33.16 24.616 75.699 ...
## $ tmdb_score : num NA 7.78 7.41 8.02 7.25 ...
has_duplicates <- any(duplicated(df1))
paste0("There is duplicate record? ", has_duplicates)
## [1] "There is duplicate record? FALSE"
imdb_score,
imdb_votes, tmdb_popularity and
tmdb_score# Drop all the record where the mentioned column has missing value
df1 <- df1[complete.cases(df1$imdb_score), ]
df1 <- df1[complete.cases(df1$imdb_votes), ]
df1 <- df1[complete.cases(df1$tmdb_popularity), ]
df1 <- df1[complete.cases(df1$tmdb_score), ]
### check if there is any remaining missing value
missing_counts <- cbind(
NA_Count = colSums(is.na(df1)),
EmptyString_Count = colSums(df1 == "")
)
missing_counts
## NA_Count EmptyString_Count
## title 0 0
## type 0 0
## release_year 0 0
## runtime 0 0
## genres 0 0
## production_countries 0 0
## imdb_score 0 0
## imdb_votes 0 0
## tmdb_popularity 0 0
## tmdb_score 0 0
df2<-df1
genres and
prouction_countriesThere are multiple values in a list for genres and
prouction_countries variables, we retain the first element
as the main genres or production_contries.
# Only retain the first element in the list and remove stray characters
# Convert empty brackets to NA and extract the first genre value in "genres" column
df2$genres <- ifelse(df2$genres == "[]", NA, gsub("\\[|\\]|'", "", sapply(strsplit(df2$genres, ","), function(x) trimws(x[1]))))
df2$production_countries <- ifelse(df2$production_countries == "[]", NA, gsub("\\[|\\]|'", "", sapply(strsplit(df2$production_countries, ","), function(x) trimws(x[1]))))
unique(df2$genres)
## [1] "drama" "romance" "crime" "fantasy"
## [5] "comedy" "documentation" "thriller" "action"
## [9] "animation" "family" "reality" "scifi"
## [13] "western" "horror" "war" "music"
## [17] "history" NA "sport"
unique(df2$production_countries)
## [1] "US" "GB" "EG" "IN" "DE" "CA" "LB" "JP" "AR" "FR" "IE" "AU" "ET" "HK" "MX"
## [16] "CN" "ES" "SU" "IT" "NZ" "DK" "CO" "TW" "KR" "RU" "NG" NA "PS" "TR" "MY"
## [31] "PH" "ZA" "MA" "SE" "SG" "KE" "NO" "CL" "SA" "BR" "ID" "IS" "IL" "PL" "FI"
## [46] "CD" "RO" "BE" "NL" "UA" "QA" "GL" "AT" "AE" "BY" "JO" "VN" "TN" "TH" "KH"
## [61] "CH" "CU" "UY" "CZ" "PE" "PR" "KW" "IR" "PY" "PK" "HU" "IQ" "BD" "TZ" "CM"
## [76] "LU" "SN" "BT" "PT" "AO" "GH" "ZW" "MW" "GT" "MU" "BG" "DO" "PA" "IO" "FO"
# Check for missing values
count_missing_genres <- sum(is.na(df2$genres))
count_missing_production_countries <- sum(is.na(df2$production_countries))
# Print the counts of missing values
cat(
paste("Number of missing values in genres: ", count_missing_genres), "\n",
paste("Number of missing values in production_countries: ", count_missing_production_countries)
)
## Number of missing values in genres: 2
## Number of missing values in production_countries: 89
# Delete rows with missing values in genres or production_countries
df3 <- df2[complete.cases(df2$genres, df2$production_countries), ]
runtime# Remove entry where runtime=0
df3 <- df3[which(df3$runtime!=0),]
paste0("Number of zero runtime after preprocessing: ", sum(df3$runtime == 0))
## [1] "Number of zero runtime after preprocessing: 0"
# save the current df
write.csv(df3, file = "cleaned_data_netflix.csv", row.names = FALSE)
An overview of the distribution and central tendencies of all features.
netflix_df <- read.csv("cleaned_data_netflix.csv",header = TRUE,sep = ",")
summary(netflix_df)
## title type release_year runtime
## Length:5368 Length:5368 Min. :1954 Min. : 2.00
## Class :character Class :character 1st Qu.:2017 1st Qu.: 45.00
## Mode :character Mode :character Median :2019 Median : 84.00
## Mean :2017 Mean : 78.47
## 3rd Qu.:2021 3rd Qu.:106.00
## Max. :2023 Max. :225.00
## genres production_countries imdb_score imdb_votes
## Length:5368 Length:5368 Min. :1.500 Min. : 5
## Class :character Class :character 1st Qu.:5.900 1st Qu.: 616
## Mode :character Mode :character Median :6.600 Median : 2370
## Mean :6.553 Mean : 22152
## 3rd Qu.:7.300 3rd Qu.: 9696
## Max. :9.600 Max. :2684317
## tmdb_popularity tmdb_score
## Min. : 0.600 Min. : 1.000
## 1st Qu.: 3.857 1st Qu.: 6.037
## Median : 8.306 Median : 6.800
## Mean : 20.914 Mean : 6.667
## 3rd Qu.: 17.740 3rd Qu.: 7.400
## Max. :1078.637 Max. :10.000
showtype#Bar plot of show_type
ggplot(netflix_df, aes(x = type)) +
geom_bar(fill = "steelblue") +
labs(x = "Show Type", y = "Count") +
ggtitle("Count of TV Shows and Movies")
release_yearMovies and shows by release year.
ggplot(netflix_df, aes(x = release_year)) +
geom_histogram(binwidth = 1, color = "black", fill = "skyblue") +
labs(x = "Release Year", y = "Count", title = "Distribution of Movies and Shows by Release Year") +
scale_x_continuous(breaks = seq(min(netflix_df$release_year), max(netflix_df$release_year), by = 10))+
scale_y_continuous(breaks = seq(0, 800, by =50))
runtimeggplot(netflix_df, aes(x = runtime)) +
geom_histogram(binwidth = 10, color = "black", fill = "skyblue") +
labs(x = "Runtime", y = "Count", title = "Distribution of Runtime") +
scale_x_continuous(breaks = seq(min(netflix_df$runtime), max(netflix_df$runtime), by = 20)) +
scale_y_continuous(breaks = seq(0, 800, by =50))
runtimeggplot(netflix_df, aes(x = "", y = runtime)) +
geom_boxplot() +
labs(x = "", y = "Runtime (minutes)") +
ggtitle("Distribution of Runtime") +
theme_minimal() +
coord_cartesian(ylim = c(-10, 250)) + # Adjust the y-axis limits as per your preference
geom_text(aes(x = 1, y = max(runtime), label = paste("Max:", max(runtime))),
vjust = -1, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = min(runtime), label = paste("Min:", min(runtime))),
vjust = 2, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = median(runtime), label = paste("Median:", median(runtime))),
vjust = 0, hjust = -1, color = "blue") +
geom_text(aes(x = 1, y = quantile(runtime, 0.25),
label = paste("Q1:", quantile(runtime, 0.25))),
vjust = -1, hjust = -1, color = "blue") +
geom_text(aes(x = 1, y = quantile(runtime, 0.75),
label = paste("Q3:", quantile(runtime, 0.75))),
vjust = 1, hjust = -1, color = "blue")
genres#GENRE FEATURE
genre_counts <- table(netflix_df$genres)
genre_data <- data.frame(genre = names(genre_counts), count = as.numeric(genre_counts))
genre_data$percentage <- genre_data$count / sum(genre_data$count) * 100
genre_data <- genre_data[order(-genre_data$count), ]
genre_data$size <- sqrt(genre_data$count)
#Bubble Chart
ggplot(genre_data, aes(x = genre, y = count, size = size, label = paste(sprintf("%.1f%%", percentage)))) +
geom_point(color = "black", fill = "skyblue", shape = 21) +
labs(x = "Genres", y = "Count of Movies and TV Show by Genres", title = "Bubble Chart of Genres") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.line.y = element_blank(),
panel.grid.major.y = element_line(color = "lightgray", linetype = "dashed"),
panel.grid.minor.y = element_blank()) +
scale_size_continuous(range = c(5, 15), guide = "none") +
geom_text(size = 2.2, vjust = -1, color = "red") +
ylim(0, max(genre_data$count) * 1.2)
production_countriescountry_counts <- table(netflix_df$production_countries)
country_data <- data.frame(
country = names(country_counts),
count = as.numeric(country_counts)
)
country_data$percentage <- format(round(country_data$count / sum(country_data$count) * 100, 2), nsmall=2)
country_data <- country_data[order(-country_data$count), ]
#Treemap
country_data$Country.Index <- paste0(country_data$country," ", country_data$percentage, "%")
treemap(country_data, index = "Country.Index", vSize = "count", title = "Production Countries Treemap")
imdb_scoreggplot(data = netflix_df, aes(x = "", y = imdb_score)) +
geom_boxplot() +
labs(x = "", y = "IMDb Score") +
ggtitle("Distribution of IMDb Scores") +
theme_minimal() +
coord_cartesian(ylim = c(0, 10)) + # Adjust the y-axis limits as per your preference
geom_text(aes(x = 1, y = max(imdb_score), label = paste("Max:", max(imdb_score))),
vjust = -1, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = min(imdb_score), label = paste("Min:", min(imdb_score))),
vjust = 2, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = median(imdb_score), label = paste("Median:", median(imdb_score))),
vjust = 0, hjust = -1, color = "red") +
geom_text(aes(x = 1, y = quantile(imdb_score, 0.25),
label = paste("Q1:", quantile(imdb_score, 0.25))),
vjust = -1, hjust = -1, color = "blue") +
geom_text(aes(x = 1, y = quantile(imdb_score, 0.75),
label = paste("Q3:", quantile(imdb_score, 0.75))),
vjust = 1, hjust = -1, color = "blue")
imdb_votesoptions(repr.plot.width = 10, repr.plot.height = 6)
ggplot(data = netflix_df, aes(x = "", y = imdb_votes)) +
geom_boxplot() +
labs(x = "", y = "IMDb Votes") +
ggtitle("Distribution of IMDb Votes") +
theme_minimal() +
coord_cartesian(ylim = c(-1000, 2800000)) + # Adjust the y-axis limits as per your preference
geom_text(aes(x = 1, y = max(imdb_votes), label = paste("Max:", max(imdb_votes))),
vjust = -1, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = min(imdb_votes), label = paste("Min:", min(imdb_votes))),
vjust = 2, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = median(imdb_votes), label = paste("Median:", median(imdb_votes))),
vjust = 0, hjust = -1, color = "red") +
geom_text(aes(x = 1, y = quantile(imdb_votes, 0.25),
label = paste("Q1:", quantile(imdb_votes, 0.25))),
vjust = -1, hjust = -1, color = "blue") +
geom_text(aes(x = 1, y = quantile(imdb_votes, 0.75),
label = paste("Q3:", quantile(imdb_votes, 0.75))),
vjust = 1, hjust = -1, color = "blue")
tmdb_popularityggplot(data = netflix_df, aes(x = "", y = tmdb_popularity)) +
geom_boxplot() +
labs(x = "", y = "TMDB Popularity") +
ggtitle("Distribution of TMDB Popularity") +
theme_minimal() +
coord_cartesian(ylim = c(-100, 1500)) +
geom_text(aes(x = 1, y = max(tmdb_popularity), label = paste("Max:", max(tmdb_popularity))),
vjust = -1, hjust = 0, color = "red", size = 3) +
geom_text(aes(x = 1, y = min(tmdb_popularity), label = paste("Min:", min(tmdb_popularity))),
vjust = 2, hjust = 0, color = "red", size = 3) +
geom_text(aes(x = 1, y = median(tmdb_popularity), label = paste("Median:", median(tmdb_popularity))),
vjust = 0, hjust = -1, color = "red", size = 3) +
geom_text(aes(x = 1, y = quantile(tmdb_popularity, 0.25),
label = paste("Q1:", quantile(tmdb_popularity, 0.25))),
vjust = -1, hjust = -1, color = "blue", size = 3) +
geom_text(aes(x = 1, y = quantile(tmdb_popularity, 0.75),
label = paste("Q3:", quantile(tmdb_popularity, 0.75))),
vjust = 1, hjust = -1, color = "blue", size = 3)
tmdb_score#TMDB_Score Feature
ggplot(data = netflix_df, aes(x = "", y = tmdb_score)) +
geom_boxplot() +
labs(x = "", y = "TMDB Score") +
ggtitle("Distribution of TMDB Score") +
theme_minimal() +
coord_cartesian(ylim = c(-2, 15)) + # Adjust the y-axis limits as per your preference
geom_text(aes(x = 1, y = max(tmdb_score), label = paste("Max:", max(tmdb_score))),
vjust = -1, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = min(tmdb_score), label = paste("Min:", min(tmdb_score))),
vjust = 2, hjust = 0, color = "red") +
geom_text(aes(x = 1, y = median(tmdb_score), label = paste("Median:", median(tmdb_score))),
vjust = 0, hjust = -1, color = "red") +
geom_text(aes(x = 1, y = quantile(tmdb_score, 0.25),
label = paste("Q1:", quantile(tmdb_score, 0.25))),
vjust = -1, hjust = -1, color = "blue") +
geom_text(aes(x = 1, y = quantile(tmdb_score, 0.75),
label = paste("Q3:", quantile(tmdb_score, 0.75))),
vjust = 1, hjust = -1, color = "blue")
imdb_score and
tmdb_score variables#Density plot of all imdb and tmdb variables
ggplot(netflix_df) +
geom_density(aes(x = imdb_score, fill = "IMDb Score"), alpha = 0.5) +
geom_density(aes(x = tmdb_score, fill = "TMDB Score"), alpha = 0.5) +
scale_fill_manual(values = c("blue", "green")) +
labs(x = "Score", y = "Density") +
ggtitle("Density Plot - IMDb Score vs TMDB Score") +
theme_minimal()
imdb_votes and
tmdb_popularitymax_imdb_votes <- max(netflix_df$imdb_votes)
netflix_df <- netflix_df[which(netflix_df$imdb_votes!=max_imdb_votes),]
max_tmdb_popularity <- max(netflix_df$tmdb_popularity)
netflix_df <- netflix_df[which(netflix_df$tmdb_popularity!=max_tmdb_popularity),]
summary(netflix_df)
## title type release_year runtime
## Length:5366 Length:5366 Min. :1954 Min. : 2.00
## Class :character Class :character 1st Qu.:2017 1st Qu.: 45.00
## Mode :character Mode :character Median :2019 Median : 84.00
## Mean :2017 Mean : 78.46
## 3rd Qu.:2021 3rd Qu.:106.00
## Max. :2023 Max. :225.00
## genres production_countries imdb_score imdb_votes
## Length:5366 Length:5366 Min. :1.500 Min. : 5
## Class :character Class :character 1st Qu.:5.900 1st Qu.: 616
## Mode :character Mode :character Median :6.600 Median : 2370
## Mean :6.552 Mean : 21594
## 3rd Qu.:7.300 3rd Qu.: 9684
## Max. :9.600 Max. :2106826
## tmdb_popularity tmdb_score
## Min. : 0.600 Min. : 1.000
## 1st Qu.: 3.850 1st Qu.: 6.036
## Median : 8.305 Median : 6.800
## Mean : 20.706 Mean : 6.667
## 3rd Qu.: 17.735 3rd Qu.: 7.400
## Max. :1005.232 Max. :10.000
From the correlation heatmap, the correlation frequencies of these
variables are low. However, the correlation frequency between
imdb_score and tmdb_score is 0.5373 which is
relatively high, so we can consider combining them in the prediction
module.
#Identify correlations between runtime, release_year, imdb_score, tmdb_score, imdb_votes, tmdb_popularity
selected_data <- netflix_df %>% select(runtime, release_year, imdb_score, tmdb_score, imdb_votes, tmdb_popularity)
correlation_matrix <- cor(selected_data) #Compute the correlation matrix
correlation_df <- as.data.frame(as.table(correlation_matrix)) #Convert the correlation matrix to a data frame
#Create a heatmap
my_colors <- c("#e31a1c", "#ff7f00", "#fdbf6f", "#a6cee3", "#1f78b4") #Custom color palette
ggplot(correlation_df, aes(Var1, Var2)) +
geom_tile(aes(fill = Freq), color = "white") +
scale_fill_gradientn(colors = my_colors, limits = c(-1, 1), na.value = "white") +
labs(title = "Correlation Heatmap") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
The genre of war has the highest average IMDb score and TMDB score
#Find the scores in different genres
#Calculate average scores by genre
average_scores <- netflix_df %>%
group_by(genres) %>%
summarise(avg_imdb_score = mean(imdb_score),
avg_tmdb_score = mean(tmdb_score))
#Pivot the data into longer format to create a side-by-side bar chart
data_long <- average_scores %>%
select(genres, avg_imdb_score, avg_tmdb_score) %>%
pivot_longer(cols = c(avg_imdb_score, avg_tmdb_score), names_to = "score_type", values_to = "score")
#Create the side-by-side bar chart by genre
ggplot(data_long) +
geom_bar(aes(x = genres, y = score, fill = score_type), position = "dodge", stat = "identity") +
scale_fill_manual(values = c("avg_imdb_score" = "steelblue", "avg_tmdb_score" = "salmon")) +
labs(title = "Average IMDB Score and TMDB Score by Genre",
x = "Genre",
y = "Average Score",
fill = "Score Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The genre of drama reached the top TMDB popularity around 1000 with release year of 2005. The genre of thriller reached the top TMDB popularity around 1000 with release year of 2023. The popularity of genres seems to change a lot in each released period.
#Determine the current trends of genres in movies and TV shows
#Create the line chart of TMDB popularity
ggplot(netflix_df, aes(x = release_year, y = tmdb_popularity, color = genres, group = genres)) +
geom_line() +
labs(title = "Line Chart of TMDB Popularity Over Release Year",
x = "Release Year",
y = "TMDB Popularity",
color = "Genres") +
scale_x_continuous(breaks = seq(1953, 2023, by = 10)) +
scale_y_continuous(breaks = seq(0, 1100, by = 100)) +
theme_minimal()
The TMDB score of genres seems to change a lot in each released period
#Create the line chart of TMDB Score
ggplot(netflix_df, aes(x = release_year, y = tmdb_score, color = genres, group = genres)) +
geom_line() +
labs(title = "Line Chart of TMDB Score Over Release Year",
x = "Release Year",
y = "TMDB Score",
color = "Genre") +
scale_x_continuous(breaks = seq(1953, 2023, by = 10)) +
scale_y_continuous(breaks = seq(0, 10, by = 1)) +
theme_minimal()
###Data Visualization of imdb_score and tmdb_score combined (Labeled as popularity_score)
#Previously mentioned that imdb_score and
tmdb_score has a relatively high correlation of 0.5373 and
will be combined in the prediction module. These graphs provide an even
greater understanding and justification for this feature.
###Heatmap of Popularity Score and other features
netflix_df$popularity_score = ((netflix_df$imdb_score + netflix_df$tmdb_score)/2)
# Compute the correlation matrix
correlation_matrix <- cor(netflix_df[, c("popularity_score", "runtime", "imdb_score", "imdb_votes", "tmdb_score")])
# Melt the correlation matrix for visualization
melted_correlation <- melt(correlation_matrix)
# Create the heatmap
heatmap_plot <- ggplot(melted_correlation, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red", na.value = "white") +
labs(x = "", y = "", title = "Correlation Heatmap") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(heatmap_plot)
The heatmap shows that the combined average of imdb_score and tmdb score, also labeled as popularity score, shares a relatively high correlation with imdb_votes. The really extremely high correlation (Red colored meaning pretty much 1) shared with imdb_score and tmdb_score once again justifies the use of tmdb_score and imdb_score combining as one feature labeled as popularity score.
###Boxplot of Popularity Score by Genre
unique_genres <- unique(netflix_df$genres)
unique_genres
## [1] "drama" "romance" "crime" "fantasy"
## [5] "comedy" "documentation" "thriller" "action"
## [9] "animation" "family" "reality" "scifi"
## [13] "western" "horror" "war" "music"
## [17] "history" "sport"
unique_countries <- unique(netflix_df$production_countries)
unique_countries
## [1] "US" "GB" "EG" "IN" "DE" "CA" "LB" "JP" "AR" "FR" "IE" "AU" "ET" "HK" "MX"
## [16] "CN" "ES" "SU" "IT" "NZ" "DK" "CO" "TW" "KR" "RU" "NG" "PS" "TR" "MY" "PH"
## [31] "ZA" "MA" "SE" "SG" "KE" "NO" "CL" "SA" "BR" "ID" "IS" "IL" "PL" "FI" "CD"
## [46] "RO" "BE" "NL" "UA" "QA" "GL" "AT" "AE" "BY" "JO" "VN" "TN" "TH" "KH" "CH"
## [61] "CU" "UY" "CZ" "PE" "PR" "KW" "IR" "PY" "PK" "HU" "IQ" "BD" "TZ" "CM" "LU"
## [76] "SN" "BT" "PT" "AO" "GH" "ZW" "MW" "GT" "MU" "BG" "DO" "PA" "IO" "FO"
# Filter movies and select specific genres
movie_list1 <- netflix_df %>%
filter(type == "MOVIE" & genres %in% c("romance", "horror", "fantasy", "comedy", "thriller", "action", "scifi"))
movie_list2 <- netflix_df %>%
filter(type == "MOVIE" & genres %in% c("family", "drama", "crime", "music", "sport", "animation"))
movie_list3<- netflix_df %>%
filter(type == "MOVIE" & genres %in% c("documentation", "western", "history", "war", "history"))
# Create boxplot 1
plot1 <- ggplot(movie_list1, aes(x = genres, y = popularity_score)) +
geom_boxplot() +
xlab("Genres") +
ylab("Popularity Score") +
ggtitle("Popularity Score of Movies in Each Genre Category")
# Create boxplot 2
plot2 <- ggplot(movie_list2, aes(x = genres, y = popularity_score)) +
geom_boxplot() +
xlab("Genres") +
ylab("Popularity Score") +
ggtitle("Popularity Score of Movies in Each Genre Category")
# Create boxplot 3
plot3 <- ggplot(movie_list3, aes(x = genres, y = popularity_score)) +
geom_boxplot() +
xlab("Genres") +
ylab("Popularity Score") +
ggtitle("Popularity Score of Movies in Each Genre Category")
# Print the boxplots
print(plot1)
Action, comedy, fantansy does have a popularity score of above 7.5
print(plot2)
The only genres that score below 7.5 are family and sport.
print(plot3)
The boxplots of the movies popularity score in terms of genre category gives us a better idea of the average for the movies in each genre category. We can see that the more popular movie genres on average is documentation but when we look at outliers, we can see that action, romance, and animation is rated higher than the other genres. On the other hand, we see that drama, thriller, and western genres share a lower popularity score than others when considering outliers.
# save the current df
write.csv(netflix_df, file = "cleaned_data_netflix.csv", row.names = FALSE)
netflix_df_reg <- read.csv("cleaned_data_netflix.csv", sep = ',')
# Convert nominal features to factors
netflix_df_reg$type <- as.factor(netflix_df_reg$type)
netflix_df_reg$genres <- as.factor(netflix_df_reg$genres)
# Select the predictor features and target variable
features <- netflix_df_reg[, c("type", "runtime", "genres", "imdb_votes")]
target <- netflix_df_reg$imdb_score
# Preprocess the data
preprocess_params <- preProcess(features, method = c("center", "scale"))
features_scaled <- predict(preprocess_params, features)
# Split the data into training and testing sets
set.seed(42)
train_indices <- sample(1:nrow(features_scaled), nrow(features_scaled) * 0.8)
X_train <- features_scaled[train_indices, ]
y_train <- target[train_indices]
X_test <- features_scaled[-train_indices, ]
y_test <- target[-train_indices]
#Random Forest Regression
# Train a Random Forest regression model
model_rf <- randomForest(X_train, y_train)
# Make predictions on the test set
y_pred <- predict(model_rf, X_test)
# Evaluate the model
RMSE <- sqrt(mean((y_pred - y_test)^2))
R_squared <- cor(y_pred, y_test)^2
# Print the results
cat("Random Forest Regression:\n",
paste("Root Mean Squared Error (RMSE):", RMSE), "\n",
paste("R-squared:", R_squared))
## Random Forest Regression:
## Root Mean Squared Error (RMSE): 0.971529427451738
## R-squared: 0.270530849394465
# Train a Linear Regression model
model_lm <- train(X_train, y_train, method = "lm")
# Make predictions on the test set using Linear Regression
y_pred_lm <- predict(model_lm, X_test)
RMSE_lm <- sqrt(mean((y_pred_lm - y_test)^2))
R_squared_lm <- cor(y_pred_lm, y_test)^2
cat("Linear Regression:\n",
paste("Root Mean Squared Error (RMSE):", RMSE_lm), "\n",
paste("R-squared:", R_squared_lm))
## Linear Regression:
## Root Mean Squared Error (RMSE): 1.01196085625222
## R-squared: 0.199071005416998
# Train an XGBoost regression model
model_xgb <- xgboost(data = as.matrix(sapply(X_train, as.numeric)), label = y_train, nrounds = 100, objective = "reg:squarederror")
## [1] train-rmse:4.370645
## [2] train-rmse:3.137754
## [3] train-rmse:2.297079
## [4] train-rmse:1.735711
## [5] train-rmse:1.373568
## [6] train-rmse:1.151367
## [7] train-rmse:1.021556
## [8] train-rmse:0.948363
## [9] train-rmse:0.905359
## [10] train-rmse:0.881415
## [11] train-rmse:0.869031
## [12] train-rmse:0.860006
## [13] train-rmse:0.849110
## [14] train-rmse:0.837696
## [15] train-rmse:0.832084
## [16] train-rmse:0.828467
## [17] train-rmse:0.819644
## [18] train-rmse:0.813108
## [19] train-rmse:0.809294
## [20] train-rmse:0.804897
## [21] train-rmse:0.800795
## [22] train-rmse:0.795338
## [23] train-rmse:0.794334
## [24] train-rmse:0.788178
## [25] train-rmse:0.787154
## [26] train-rmse:0.779294
## [27] train-rmse:0.773339
## [28] train-rmse:0.767792
## [29] train-rmse:0.765389
## [30] train-rmse:0.761413
## [31] train-rmse:0.756249
## [32] train-rmse:0.754820
## [33] train-rmse:0.748801
## [34] train-rmse:0.746557
## [35] train-rmse:0.742853
## [36] train-rmse:0.740443
## [37] train-rmse:0.736762
## [38] train-rmse:0.731600
## [39] train-rmse:0.727791
## [40] train-rmse:0.724929
## [41] train-rmse:0.722721
## [42] train-rmse:0.718119
## [43] train-rmse:0.715060
## [44] train-rmse:0.712672
## [45] train-rmse:0.708882
## [46] train-rmse:0.706684
## [47] train-rmse:0.702212
## [48] train-rmse:0.700128
## [49] train-rmse:0.699390
## [50] train-rmse:0.695263
## [51] train-rmse:0.693857
## [52] train-rmse:0.687627
## [53] train-rmse:0.682315
## [54] train-rmse:0.678093
## [55] train-rmse:0.673580
## [56] train-rmse:0.669791
## [57] train-rmse:0.666542
## [58] train-rmse:0.663987
## [59] train-rmse:0.662294
## [60] train-rmse:0.659679
## [61] train-rmse:0.659322
## [62] train-rmse:0.658000
## [63] train-rmse:0.656244
## [64] train-rmse:0.655728
## [65] train-rmse:0.649599
## [66] train-rmse:0.648203
## [67] train-rmse:0.644497
## [68] train-rmse:0.641948
## [69] train-rmse:0.640363
## [70] train-rmse:0.638205
## [71] train-rmse:0.633388
## [72] train-rmse:0.632447
## [73] train-rmse:0.631156
## [74] train-rmse:0.628032
## [75] train-rmse:0.624470
## [76] train-rmse:0.621060
## [77] train-rmse:0.618899
## [78] train-rmse:0.615379
## [79] train-rmse:0.612696
## [80] train-rmse:0.609816
## [81] train-rmse:0.608471
## [82] train-rmse:0.605020
## [83] train-rmse:0.603320
## [84] train-rmse:0.601185
## [85] train-rmse:0.600780
## [86] train-rmse:0.597664
## [87] train-rmse:0.594632
## [88] train-rmse:0.593355
## [89] train-rmse:0.590853
## [90] train-rmse:0.588052
## [91] train-rmse:0.586884
## [92] train-rmse:0.584685
## [93] train-rmse:0.581916
## [94] train-rmse:0.579682
## [95] train-rmse:0.578330
## [96] train-rmse:0.576104
## [97] train-rmse:0.575613
## [98] train-rmse:0.572875
## [99] train-rmse:0.571367
## [100] train-rmse:0.569285
# Make predictions on the test set using XGBoost
y_pred_xgb <- predict(model_xgb, as.matrix(sapply(X_test, as.numeric)))
# Evaluate the model
RMSE_xgb <- sqrt(mean((y_pred_xgb - y_test)^2))
R_squared_xgb <- cor(y_pred_xgb, y_test)^2
cat("XGBoost Regression:\n",
paste("Root Mean Squared Error (RMSE):", RMSE_xgb), "\n",
paste("R-squared:", R_squared_xgb))
## XGBoost Regression:
## Root Mean Squared Error (RMSE): 1.03702433872803
## R-squared: 0.209229207703833
# New data for prediction
new_data <- data.frame(
type = factor("MOVIE", levels = levels(features$type)),
runtime = 120L,
genres = factor("drama", levels = levels(features$genres)),
imdb_votes = 5000L
)
# Preprocess the new data
new_data_scaled <- predict(preprocess_params, new_data)
# Make prediction on the new data
prediction_rf <- predict(model_rf, new_data_scaled)
prediction_lm <- predict(model_lm, new_data_scaled)
prediction_xgb <- predict(model_xgb, newdata = as.matrix(sapply(new_data_scaled, as.numeric)))
# Calculate the average of all predictions (XGB)
average_prediction_xgb <- mean(prediction_xgb)
# Print the prediction
cat( "Regression Prediction Results: \n" ,
"Predicted imdb_score (RF):", round(prediction_rf, 2), "\n",
"Predicted imdb_score (LR):", round(prediction_lm, 2), "\n",
"Predicted imdb_score (XGB):", round(average_prediction_xgb, 2))
## Regression Prediction Results:
## Predicted imdb_score (RF): 6.41
## Predicted imdb_score (LR): 6.44
## Predicted imdb_score (XGB): 5.22
netflix_df_cls <- read.csv("cleaned_data_netflix.csv", sep = ',')
# Calculate the popularity score
netflix_df_cls$popularity_score <- ((netflix_df_cls$imdb_score + netflix_df_cls$tmdb_score)/2)
# Optional: You can round the popularity score to a desired decimal place
netflix_df_cls$popularity_score <- round(netflix_df_cls$popularity_score, 2)
netflix_df_cls$popularityLabel <- ifelse(netflix_df_cls$popularity_score >7.5,"Popular","Not Popular")
netflix_df_cls$popularity <- ifelse(netflix_df_cls$popularityLabel == "Popular", 1, 0)
# Select the features and target variables
features <- netflix_df_cls[, c( "type", "runtime", "genres", "imdb_score", "imdb_votes", "tmdb_score")]
target <- factor(netflix_df_cls$popularity, levels = c(0, 1), labels = c("Not Popular", "Popular"))
# Convert nominal features to factors
features$type <- as.factor(features$type)
features$genres <- as.factor(features$genres)
# Split the data into training and testing sets
set.seed(42)
train_indices <- sample(1:nrow(features), nrow(features) * 0.8)
X_train <- features[train_indices, ]
y_train <- target[train_indices]
X_test <- features[-train_indices, ]
y_test <- target[-train_indices]
# Train a Random Forest Classifier
classifier_rf <- randomForest(X_train, y_train)
# Make predictions on the test set
y_pred <- predict(classifier_rf, X_test)
# Evaluate the model
accuracy_rf <- sum(y_pred == y_test) / length(y_test)
classification_report_rf <- table(y_test, y_pred)
# Print the results
cat("Accuracy (Random Forest):", accuracy_rf, "\n\n",
"Classification Report (Random Forest):\n",
paste(capture.output(print(classification_report_rf)), collapse = "\n")
)
## Accuracy (Random Forest): 0.9944134
##
## Classification Report (Random Forest):
## y_pred
## y_test Not Popular Popular
## Not Popular 859 2
## Popular 4 209
# Train an SVM classifier
# Feature scaling using preProcess() for SVM
preprocess_params <- preProcess(X_train, method = c("center", "scale"))
X_train_scaled <- predict(preprocess_params, X_train)
X_test_scaled <- predict(preprocess_params, X_test)
classifier_svm <- svm(y_train ~ ., data = X_train_scaled, kernel = "radial", cost = 1, scale = FALSE, max_iter = 10000)
# Make predictions on the test set using SVM
y_pred_svm <- predict(classifier_svm, newdata = X_test_scaled)
# Evaluate the SVM model
accuracy_svm <- sum(y_pred_svm == y_test) / length(y_test)
classification_report_svm <- table(y_test, y_pred_svm)
# Print the SVM results
cat("Accuracy (SVM):", accuracy_svm, "\n\n",
"Classification Report (SVM):\n",
paste(capture.output(print(classification_report_svm)), collapse = "\n")
)
## Accuracy (SVM): 0.9841713
##
## Classification Report (SVM):
## y_pred_svm
## y_test Not Popular Popular
## Not Popular 860 1
## Popular 16 197
# Train a Naive Bayes Classifier
classifier_nb <- naiveBayes(X_train, y_train)
# Make predictions on the test set using Naive Bayes
y_pred_nb <- predict(classifier_nb, X_test)
# Evaluate the Naive Bayes model
accuracy_nb <- sum(y_pred_nb == y_test) / length(y_test)
classification_report_nb <- table(y_test, y_pred_nb)
# Print the Naive Bayes results
cat("Accuracy (Naive Bayes):", accuracy_nb, "\n\n",
"Classification Report (Naive Bayes):\n",
paste(capture.output(print(classification_report_nb)), collapse = "\n")
)
## Accuracy (Naive Bayes): 0.9134078
##
## Classification Report (Naive Bayes):
## y_pred_nb
## y_test Not Popular Popular
## Not Popular 838 23
## Popular 70 143
# Making predictions
# New data for prediction
new_data <- data.frame(
type = factor("MOVIE", levels = levels(features$type)),
runtime = 78L,
genres = factor("comedy", levels = levels(features$genres)),
imdb_score = 7.4,
#production_countries = factor("US", levels = levels(features$production_countries)),
imdb_votes = 2000L,
tmdb_score = 8.1
)
# Define labels for factor levels
class_labels <- levels(target)
# Make prediction on new data
prediction_rf <- predict(classifier_rf, newdata = new_data)
prediction_nb <- predict(classifier_nb, newdata = new_data)
prediction_svm <- predict(classifier_svm, newdata = new_data)
# Convert prediction to class labels
predicted_class_rf <- class_labels[prediction_rf]
predicted_class_nb <- class_labels[prediction_nb]
predicted_class_svm <- class_labels[prediction_svm]
cat(
"Classifier Prediction Results:", "\n",
"Random Forest (RF) : ", predicted_class_rf, "\n",
"Naive Bayes (NB) : ", predicted_class_nb, "\n",
"Support Vector Machines (SVM) : ", predicted_class_svm
)
## Classifier Prediction Results:
## Random Forest (RF) : Popular
## Naive Bayes (NB) : Not Popular
## Support Vector Machines (SVM) : Popular
Based on the Root Mean Squared Error (RMSE), the Random Forest Regressor achieved the lowest RMSE with only 0.9705 which makes it the best performance model. The linear regression model achieved 1.012 RMSE while XGBoost achieve 1.037 RMSE. The predicted imdb_score using Random Forest Regressor is 6.41, the predicted imdb_score using linear regression model is 6.44 while the predicted imdb_score using XGBoost model is 5.22. Based on the classification report of all 3 models, the best performance model with highest accuracy is Random Forest Classifier. It achieved 99.5% of accuracy. It then followed by Support Vector Machines (SVM) where it achieved 98.4% accuracy. The Naive Bayes model achieved the lowest accuracy with 91.3%. From the prediction with random inputs, Random Forest Classifier and Support Vector Machines (SVM) successfully predicted the movie popular while the Naive Bayes unfortunately predicted it as unpopular.
With the regression and classification model, it allows Netflix to identify the next trending movie to include in their platform to enhance customer engagement with their platform.