Clustering by Audience Engagement

Audience engagement is a key factor in understanding the success of movies. By clustering movies based on IMDb ratings, votes, and box office earnings, we can identify patterns in how audiences interact with films. In this analysis, we apply K-Means clustering to group movies with similar engagement levels.

Data Preprocessing

Before clustering, we preprocess the data by converting categorical variables and normalizing numerical features. The following code are handles missing values and scales the data to not getting any issue in the following steps.

#load necessary libraries
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.2

library(cluster)
library(reshape2)

## Warning: package 'reshape2' was built under R version 4.4.2

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.4.2

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

Load the data and checking first few rows

data <- read.csv("Top_10000_Movies_IMDb.csv")

Converting of Runtime to Numeric

data$Runtime <-as.numeric(gsub(" min", "", data$Runtime))
data$Rating <- as.numeric(data$Gross)
data$Votes <- as.numeric(data$Gross)
data$Gross <- as.numeric(data$Gross)

Looking the some missing values in Metascore

data$Metascore[is.na(data$Metascore)] <- median(data$Metascore, na.rm = TRUE)

Creating seperate columns for each genre

colnames(data)  # Display all column names

##  [1] "ID"         "Movie.Name" "Rating"     "Runtime"    "Genre"     
##  [6] "Metascore"  "Plot"       "Directors"  "Stars"      "Votes"     
## [11] "Gross"      "Link"

genres <- strsplit(data$Genre, ", ")
genres_data <- as.data.frame(do.call(rbind, lapply(genres, function(x) as.numeric(unique(unlist(genres)) %in% x))))

# Rename columns
colnames(genres_data) <- unique(unlist(genres))

# Combine with the main dataframe
data <- cbind(data, genres_data)

# Remove the original Genre column
data <- data %>% select(-Genre)

To working on the clustering process, we have to normalize numerical features. We scale all numerical features between 0 and 1.

numerical_cols <- c("Rating","Runtime","Metascore","Votes","Gross")
normalize <- function(x) { (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) }

data[numerical_cols] <- as.data.frame(lapply(data[numerical_cols],normalize))

Clustering Analyses During the analyses we will se audience egagement is a significant factor to understand a movies’s success. Analyse can show high engagement levels can be driven by too many factors like user votes amd box office earnings… Basically, in this study. We will apply K-means clustering to group movies based on audience interaction patterns.

To decide number of clusters, we will use the Elbow Method, which analyses how variance decreases with increasing cluster or clusters.

wss <- numeric(10)
set.seed(123)
for (k in 1:10){
  kmeans_model <-kmeans(data[,c("Votes","Gross","Rating")],centers = k , nstart = 10) 
  wss[k] <- kmeans_model$tot.withinss
}

elbow_plot <- ggplot(data.frame(K = 1:10, WSS = wss), aes(x = K, y = WSS)) + 
  geom_point()+ geom_line()+
  ggtitle("Elbow method for optimal k")+ xlab("Number of clusters k")+
  ylab(" within cluster sum of squares")+
  theme_minimal()
print(elbow_plot)

What is that mean? from the elbow plot, we can decide the optimal K is 10. That is exaclty shows where the curve starts to flatten graph.

Working on K-means clustering Now we will use K value(10) to apply k-means algorithm and assign each movie to a cluster from dataset which is used on the study.

set.seed(123)
kmeans_model <- kmeans(data[,c("Votes","Gross","Rating")], centers =10, nstart =25)
apply(data[, c("Votes", "Gross", "Rating")], 2, var)

##       Votes       Gross      Rating 
## 0.004065031 0.004065031 0.004065031

data$Cluster <- as.factor(kmeans_model$cluster)

table(data$Cluster)

## 
##    1    2    3    4    5    6    7    8    9   10 
##  700   13 1147   35  118 5706  231  409   68 1572

Data Visualization During our study, to understand what we clustering. We will plot our Votes and Gross.

plotting <- ggplot(data, aes(x = Votes,  y= Gross,  color = Cluster))+geom_point(alpha = 0.6)+ ggtitle("Movie Clustering with Audience engagement")+ xlab("Normalized Votes")+ylab("Normalized Gross Revenue")+theme_minimal()
print(plotting)

Cluster Interprepation

When we analyses the clusters, we can find out these patterns:

A- Cluster 1 Blockbusters -High Votes and High Gross Revenue. -Such as: Marvel and Disney movies with mass appeal(data contains movies before Disney did not buy Marvel)

B- Cluster 2 Critically Acclaimed Films -High votes and Low gross revenue. -Such as: Oscar winning drama movies.

C-Cluster 3 Niche Popular Movies -Low votes and high gross revenues. -Such as: Regional movies.

D- Cluster 4: The movies which is not famaous or known -Low votes and low gross revenue. -Such as: movies which are made by international categories.

To sum up:

From this study we can basically understand clusters of IMDb movie based on audience engagement using k-means clustering. This study can be used for: -Movie studios. -Online movie platforms. -Persons who are rating to movies for reviews.

Source:

-IMDb movies dataset.: https://www.kaggle.com/datasets/moazeldsokyx/imdb-top-10000-movies-dataset -R Clustering Documentation:https://www.r-bloggers.com/2024/01/overview-of-clustering-methods-in-r/ -RPubs example of clustering analysis

ClusteringIMDb

Yüceltan Ebiri

2025-01-10