Audience engagement is a key factor in understanding the success of movies. By clustering movies based on IMDb ratings, votes, and box office earnings, we can identify patterns in how audiences interact with films. In this analysis, we apply K-Means clustering to group movies with similar engagement levels.
Before clustering, we preprocess the data by converting categorical variables and normalizing numerical features. The following code are handles missing values and scales the data to not getting any issue in the following steps.
#load necessary libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(cluster)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.2
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
data <- read.csv("Top_10000_Movies_IMDb.csv")
data$Runtime <-as.numeric(gsub(" min", "", data$Runtime))
data$Rating <- as.numeric(data$Gross)
data$Votes <- as.numeric(data$Gross)
data$Gross <- as.numeric(data$Gross)
data$Metascore[is.na(data$Metascore)] <- median(data$Metascore, na.rm = TRUE)
colnames(data) # Display all column names
## [1] "ID" "Movie.Name" "Rating" "Runtime" "Genre"
## [6] "Metascore" "Plot" "Directors" "Stars" "Votes"
## [11] "Gross" "Link"
genres <- strsplit(data$Genre, ", ")
genres_data <- as.data.frame(do.call(rbind, lapply(genres, function(x) as.numeric(unique(unlist(genres)) %in% x))))
# Rename columns
colnames(genres_data) <- unique(unlist(genres))
# Combine with the main dataframe
data <- cbind(data, genres_data)
# Remove the original Genre column
data <- data %>% select(-Genre)
To working on the clustering process, we have to normalize numerical features. We scale all numerical features between 0 and 1.
numerical_cols <- c("Rating","Runtime","Metascore","Votes","Gross")
normalize <- function(x) { (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) }
data[numerical_cols] <- as.data.frame(lapply(data[numerical_cols],normalize))
Clustering Analyses During the analyses we will se audience egagement is a significant factor to understand a moviesās success. Analyse can show high engagement levels can be driven by too many factors like user votes amd box office earnings⦠Basically, in this study. We will apply K-means clustering to group movies based on audience interaction patterns.
To decide number of clusters, we will use the Elbow Method, which analyses how variance decreases with increasing cluster or clusters.
wss <- numeric(10)
set.seed(123)
for (k in 1:10){
kmeans_model <-kmeans(data[,c("Votes","Gross","Rating")],centers = k , nstart = 10)
wss[k] <- kmeans_model$tot.withinss
}
elbow_plot <- ggplot(data.frame(K = 1:10, WSS = wss), aes(x = K, y = WSS)) +
geom_point()+ geom_line()+
ggtitle("Elbow method for optimal k")+ xlab("Number of clusters k")+
ylab(" within cluster sum of squares")+
theme_minimal()
print(elbow_plot)
What is that mean? from the elbow plot, we can decide the optimal K is 10. That is exaclty shows where the curve starts to flatten graph.
Working on K-means clustering Now we will use K value(10) to apply k-means algorithm and assign each movie to a cluster from dataset which is used on the study.
set.seed(123)
kmeans_model <- kmeans(data[,c("Votes","Gross","Rating")], centers =10, nstart =25)
apply(data[, c("Votes", "Gross", "Rating")], 2, var)
## Votes Gross Rating
## 0.004065031 0.004065031 0.004065031
data$Cluster <- as.factor(kmeans_model$cluster)
table(data$Cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 700 13 1147 35 118 5706 231 409 68 1572
plotting <- ggplot(data, aes(x = Votes, y= Gross, color = Cluster))+geom_point(alpha = 0.6)+ ggtitle("Movie Clustering with Audience engagement")+ xlab("Normalized Votes")+ylab("Normalized Gross Revenue")+theme_minimal()
print(plotting)
When we analyses the clusters, we can find out these patterns:
A- Cluster 1 Blockbusters -High Votes and High Gross Revenue. -Such as: Marvel and Disney movies with mass appeal(data contains movies before Disney did not buy Marvel)
B- Cluster 2 Critically Acclaimed Films -High votes and Low gross revenue. -Such as: Oscar winning drama movies.
C-Cluster 3 Niche Popular Movies -Low votes and high gross revenues. -Such as: Regional movies.
D- Cluster 4: The movies which is not famaous or known -Low votes and low gross revenue. -Such as: movies which are made by international categories.
To sum up:
From this study we can basically understand clusters of IMDb movie based on audience engagement using k-means clustering. This study can be used for: -Movie studios. -Online movie platforms. -Persons who are rating to movies for reviews.
Source:
-IMDb movies dataset.: https://www.kaggle.com/datasets/moazeldsokyx/imdb-top-10000-movies-dataset -R Clustering Documentation:https://www.r-bloggers.com/2024/01/overview-of-clustering-methods-in-r/ -RPubs example of clustering analysis