Introduction

In this project, we will be delving into the fascinating world of movies. Our goal is to explore patterns and relationships within a diverse dataset that contains information such as budget, release year, user ratings, and genre. By utilizing clustering techniques, we aim to uncover natural groupings and similarities among movies. This will not only be beneficial for the audience but also for the producers.

  • The data source is from IMDB

Data Exploration and Cleaning:

  • To begin, establish the directory where you will be working and
    proceed to load the dataset for the movie.
  • Please perform a thorough analysis of the dataset by examining its summary statistics and structural characteristics.
  • To maintain the accuracy and consistency of the data, it is crucial to convert the relevant columns into numeric data types and address any missing values. This can help in ensuring the reliability and
    usefulness of the data for analysis and decision-making purposes.
# Set the working directory and read the file
movies_AR <- read.csv("/Users/marcoayuob/Downloads/Data science /1st Year/Unsupervised Learning/Clustering Project/CL_draft.csv")
# Display summary information and structure of the dataset
summary(movies_AR)
##      Budget            Language           Country               Year     
##  Min.   :   100000   Length:5737        Length:5737        Min.   :1991  
##  1st Qu.:  5000000   Class :character   Class :character   1st Qu.:2001  
##  Median : 18000000   Mode  :character   Mode  :character   Median :2007  
##  Mean   : 30751907                                         Mean   :2006  
##  3rd Qu.: 39000000                                         3rd Qu.:2012  
##  Max.   :380000000                                         Max.   :2017  
##      Name                Rate           Genre          
##  Length:5737        Min.   : 0.000   Length:5737       
##  Class :character   1st Qu.: 5.300   Class :character  
##  Mode  :character   Median : 6.000   Mode  :character  
##                     Mean   : 5.864                     
##                     3rd Qu.: 6.600                     
##                     Max.   :10.000
str(movies_AR)
## 'data.frame':    5737 obs. of  7 variables:
##  $ Budget  : int  30000000 200000 250000 300000 500000 800000 800000 1000000 1182273 1227401 ...
##  $ Language: chr  "en" "en" "en" "en" ...
##  $ Country : chr  "United States of America" "United States of America" "United States of America" "United States of America" ...
##  $ Year    : int  1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 ...
##  $ Name    : chr  "Madonna: Truth or Dare" "Dingo" "Poison" "High Strung" ...
##  $ Rate    : num  6.3 6 6.4 5.4 4.8 6.8 6 4.8 4.2 6.4 ...
##  $ Genre   : chr  "Documentary" "Drama" "Drama" "Comedy" ...
# Convert certain columns to numeric data types
movies_AR$Budget <- as.numeric(as.character(movies_AR$Budget))
movies_AR$Year <- as.numeric(as.character(movies_AR$Year))
movies_AR$Rate <- as.numeric(as.character(movies_AR$Rate))
# Checking for missing values in the dataset and handling them
colSums(is.na(movies_AR))
##   Budget Language  Country     Year     Name     Rate    Genre 
##        0        0        0        0        0        0        0
movies_AR <- na.omit(movies_AR)
movies_AR$Genre[is.na(movies_AR$Genre)] <- NA
movies_AR$Budget[is.na(movies_AR$Budget)] <- mean(movies_AR$Budget)
# Remove rows with empty Genre
movies_AR <- movies_AR[movies_AR$Genre != "", ]

Statistical values and plots before clustering

# Calculating mean ratings by genre and create a bar plot to visualize them
mean_ratings <- aggregate(Rate ~ Genre, data = movies_AR, FUN = mean, na.rm = TRUE)
barplot(mean_ratings$Rate, names.arg = mean_ratings$Genre, 
        col = "skyblue", xlab = "Genre", ylab = "Mean Rating",
        main = "Mean Ratings by Genre", las = 2)

we can see from this plot the mean (Average) rating for every Genre

# Calculate mean budget by rate and create a scatter plot to visualize them
mean_budget <- aggregate(Budget ~ Rate, data = movies_AR, FUN = mean, na.rm = TRUE)
plot(mean_budget$Rate, mean_budget$Budget, 
     xlab = "Rate", ylab = "Mean Budget", 
     main = "Mean Budget by Rate", col = "blue", pch = 16)

from the previous plot we can observe that from rate 6 to 8 are the movies that have the highest budget (production coast)

# Calculating the movies with the maximum rating and corresponding year(s)
max_rating <- max(movies_AR$Rate)
years_highest_rating <- unique(movies_AR$Year[movies_AR$Rate == max_rating])
years_highest_rating
## [1] 1998 2002 2006 2010 2012

those are the years that have highest rate of movies

# Bar plot for years with maximum rating
barplot(table(movies_AR$Year[movies_AR$Rate == max_rating]), 
        xlab = "Year", ylab = "Frequency", 
        main = "Years with Maximum Rating", col = "darkgreen")

Perform K-means Clustering

# Selecting only the numeric columns of a movies dataframe using sapply # with is.numeric function for each column

# A new data frame called movies_z is formed by using the scale 
# function #on all numeric columns of the existing data frame, 
# movies_numeric..
library(fpc)
movies_numeric <- movies_AR[, sapply(movies_AR, is.numeric)]
movies_z <- as.data.frame(lapply(movies_numeric, scale))
# Calculate within-cluster sum of squares (WCSS) for different cluster counts
wcss <- numeric(10)
for (i in 1:10) {
  movies1 <- kmeans(movies_z, centers = i)
  wcss[i] <- movies1$tot.withinss
}
# Plot WCSS to determine optimal number of clusters (Elbow Method)
plot(1:10, wcss, type = "b", xlab = "Number of Clusters", ylab = "WCSS")

The elbow plot analysis suggests that the ideal number of clusters is 4 as the curve becomes flat beyond this point, indicating that further clusters are not significant.

# Perform k-means clustering with the selected number of clusters
k <- 4
kmeans_result <- kmeans(movies_z, centers = k)
movies_AR$cluster <- kmeans_result$cluster
kmeans_result
## K-means clustering with 4 clusters of sizes 951, 2463, 496, 1794
## 
## Cluster means:
##        Budget       Year       Rate
## 1 -0.48114370  0.3067912 -1.5452994
## 2 -0.28506312  0.6631598  0.3407349
## 3  2.58391862  0.4047969  0.3995035
## 4 -0.06797409 -1.1850057  0.2409118
## 
## Clustering vector:
##    [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##   [38] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##   [75] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [112] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [149] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [186] 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 1 4 4
##  [223] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [260] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [297] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [334] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [371] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 1 4 4 1 4 4 4 4
##  [408] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [445] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [482] 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4
##  [519] 4 4 4 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [556] 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4
##  [593] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [630] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4
##  [667] 4 4 4 4 1 4 4 4 4 1 4 4 4 1 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [704] 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [741] 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4
##  [778] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 4 4 4 4 4
##  [815] 4 4 4 4 1 4 4 4 4 4 4 4 4 4 1 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4
##  [852] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4
##  [889] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [926] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 4 1 4 1 4 4 4 4 4 4
##  [963] 1 4 1 4 1 4 4 4 1 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1000] 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4
## [1037] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4
## [1074] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1111] 4 4 4 4 3 3 3 3 3 3 3 3 3 1 4 4 4 1 1 1 4 4 4 4 4 4 4 1 4 1 1 4 4 1 1 4 1
## [1148] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 4 1 4 4 4 4
## [1185] 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 1 4 4
## [1222] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4
## [1259] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3
## [1296] 3 3 3 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 1 4 4 4 1 1
## [1333] 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 1 1 4 4 1 1 4 4 4 4 4 4
## [1370] 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1407] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1444] 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1481] 4 1 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4 4 1 1 1 4 1 4 1
## [1518] 4 1 4 4 1 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 1 4 4 4 4 4 1 4
## [1555] 4 1 4 4 4 4 1 1 4 4 4 4 4 4 1 4 1 4 4 4 4 4 4 4 1 4 4 1 4 4 4 4 4 1 4 4 4
## [1592] 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4
## [1629] 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1
## [1666] 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 1 4 1 4 4 4 1 1 1 1 4 4 4
## [1703] 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4
## [1740] 4 4 4 4 4 4 4 4 4 1 1 1 1 4 4 1 1 1 4 4 4 4 1 4 4 4 1 4 4 4 1 4 4 4 4 4 4
## [1777] 1 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4
## [1814] 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3
## [1851] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 1 4 4 4 1 1 1 4 4 4 1 1 1 1 2 2 1 1 2 1
## [1888] 2 4 1 4 4 4 1 2 4 1 2 4 1 2 1 2 2 2 4 2 4 1 4 2 1 4 4 2 1 1 4 2 2 2 4 4 1
## [1925] 4 4 1 1 4 4 4 4 4 4 4 4 1 2 2 1 2 1 4 4 4 1 1 4 4 4 2 4 1 4 4 4 1 1 2 4 4
## [1962] 4 4 4 1 4 4 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4 4 4 4 4 4
## [1999] 4 1 4 4 4 4 4 4 1 1 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4
## [2036] 4 4 4 4 4 4 4 4 4 4 4 4 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [2073] 3 2 1 2 2 2 2 1 1 1 1 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 1 1 2 2 2 2 2 2 1 1
## [2110] 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 1
## [2147] 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 1 2 2 1 1 1 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2
## [2184] 2 1 1 2 2 1 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 2 2
## [2221] 2 2 2 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 1 2 2 2 4 2 4
## [2258] 2 2 2 2 2 1 4 2 2 4 4 2 4 2 2 4 1 1 1 4 2 2 4 4 4 4 1 4 4 4 4 1 1 4 4 4 4
## [2295] 4 4 4 4 3 4 3 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 1 2 1 2
## [2332] 2 1 1 2 2 1 2 1 2 1 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 1 2 1 1 2
## [2369] 2 2 1 1 1 2 1 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2
## [2406] 2 1 1 2 1 1 2 2 2 2 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2
## [2443] 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2
## [2480] 2 2 1 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [2517] 2 2 1 2 1 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2
## [2554] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [2591] 3 3 3 3 3 3 3 3 1 1 1 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 1 1 2 2 2
## [2628] 1 1 2 1 1 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 1 1 2 1 2 1 2 1
## [2665] 2 1 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
## [2702] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 1 2 1 1 2 2 2 2 1 1 2 2 2
## [2739] 2 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2
## [2776] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [2813] 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1
## [2850] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 1 2
## [2887] 2 2 1 2 1 1 2 2 2 1 1 1 2 2 1 1 2 2 2 2 2 1 2 1 1 2 1 1 2 2 2 2 1 2 1 1 2
## [2924] 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 1 1 1 2 1
## [2961] 2 2 1 2 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 2 2 2 1 1 1 2 2 2 2 2 1 2 1 1 2 2 2
## [2998] 2 2 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2
## [3035] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
## [3072] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2
## [3109] 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1
## [3146] 2 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [3183] 1 2 2 2 2 1 1 1 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 2 1 2
## [3220] 2 2 2 1 1 1 2 2 1 2 1 1 1 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 1 1 1 2 2 2 2 2 1
## [3257] 1 1 2 2 2 1 2 1 2 2 1 2 1 2 1 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 1 2 1 1 2 2
## [3294] 2 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 1 2 1 1 2 1 2 2 1 2 2 2
## [3331] 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2
## [3368] 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [3405] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [3442] 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3
## [3479] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 1 2 1 1 2 2 2 1 2 2 2 2 2 1 2
## [3516] 1 1 1 2 1 1 1 2 2 1 1 1 2 2 2 1 2 1 1 2 1 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2
## [3553] 2 2 2 1 2 1 1 2 2 2 2 2 2 1 1 2 1 2 1 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2
## [3590] 2 2 1 1 1 2 2 1 1 2 2 2 1 2 2 2 1 2 1 2 2 1 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2
## [3627] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 1 1 1 2 1 1 1 2
## [3664] 2 2 2 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [3701] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## [3738] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3
## [3775] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [3812] 3 2 1 2 2 1 1 2 2 1 1 1 1 1 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 2 1 2 2 2 1 2 2
## [3849] 2 2 1 1 2 2 2 1 1 2 1 2 1 2 1 2 2 1 2 1 2 2 1 2 2 2 2 2 2 1 2 2 1 2 1 2 1
## [3886] 1 2 1 2 1 2 2 2 1 1 1 2 1 1 1 2 1 1 2 2 2 1 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2
## [3923] 2 2 1 2 2 1 2 1 2 1 2 1 1 1 1 2 2 2 1 1 2 1 2 1 1 1 2 2 2 1 2 2 2 1 2 1 1
## [3960] 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 1 2 2 1 2 2 2 2 2 2 2
## [3997] 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1
## [4034] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2
## [4071] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4108] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [4145] 3 3 3 3 3 3 3 3 2 2 1 2 2 1 2 2 2 1 1 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 1
## [4182] 1 2 2 2 2 1 2 1 2 1 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1
## [4219] 1 1 2 2 1 1 1 1 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 2 1 2 2 1 2 1 1 1 2 1 2 2
## [4256] 1 1 1 2 1 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 1 1 1 2 2 2 1 1 2
## [4293] 1 1 2 2 2 2 2 2 1 2 2 1 2 1 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2
## [4330] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2
## [4367] 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4404] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [4441] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 2 2 1 2 2 1 1 2 2 1 1 2 1 1 2 2
## [4478] 2 1 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 2
## [4515] 1 2 1 2 2 1 1 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 1 2 2
## [4552] 2 1 2 1 1 1 2 1 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 2 2 1 2 2
## [4589] 1 1 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [4626] 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4663] 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4700] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4737] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [4774] 3 3 3 3 3 3 3 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 1 2 2 1 2 2 1 1 2 2 2 1 1
## [4811] 1 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 2 1 2 2 2 2 2
## [4848] 2 2 1 1 2 2 2 2 1 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1
## [4885] 2 2 1 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1
## [4922] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4959] 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [4996] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [5033] 2 2 2 2 2 2 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5070] 3 2 2 2 2 1 2 1 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2
## [5107] 1 1 2 1 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1 1 2 2 1 1 1 1 2
## [5144] 2 1 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2
## [5181] 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [5218] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2
## [5255] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [5292] 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5329] 3 2 1 2 2 1 2 2 1 1 2 2 2 1 1 2 1 2 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2
## [5366] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2
## [5403] 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [5440] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 2 2 2 1
## [5477] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [5514] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [5551] 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5588] 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2
## [5625] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## [5662] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5699] 3 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 1600.620 2046.433 1157.040 2051.356
##  (between_SS / total_SS =  59.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
  • The following information provides a comprehensive breakdown of the four clusters and their characteristics based on budget, release year, and rate.
    • Cluster 1, which comprises 490 movies, is characterized by low budgets, moderately recent release years, and low rates. An example movie from this cluster is “Sharknado” (2013). The within-cluster sum of squares for this cluster is 1141.079, indicating that the data points within this cluster are somewhat closer to each other in terms of the features considered.

    • Cluster 2 consists of 1794 movies and is characterized by high budgets, moderately recent release years, and moderate rates. An example movie from this cluster is “Inception” (2010). The within-cluster sum of squares for this cluster is 2053.688, which is somewhat higher than Cluster 1. However, it is still the lowest within-cluster sum of squares among all four clusters, suggesting that the movies within this cluster are more homogeneous compared to the other clusters.

    • Cluster 3 is made up of 971 movies and is characterized by moderate budgets, somewhat older release years, and moderate rates. An example movie from this cluster is “The Matrix” (1999). The within-cluster sum of squares for this cluster is 1625.766, indicating that the data points within this cluster are somewhat closer to each other in terms of the features considered.

    • Cluster 4 is the largest cluster, consisting of 2449 movies that have low-moderate budgets, recent release years, and moderate rates. An example movie from this cluster is “Get Out” (2017). The within-cluster sum of squares for this cluster is 2035.016, which is higher than Cluster 1 and
      Cluster 3 but lower than Cluster 2.

Visualization of clusters

# Visualize clusters by rating and budget
plot(movies_AR$Rate, movies_AR$Budget, col = movies_AR$cluster, 
     xlab = "Rating", ylab = "Budget", 
     main = "Clusters of Movies by Rating and Budget")
legend("topright", legend = unique(movies_AR$cluster), col = 1:max(movies_AR$cluster), pch = 1)

  • we can see the results in a better way by ploting 3D plot
# Install and load the necessary library for 3D scatter plots
# install.packages("scatterplot3d")
library(scatterplot3d)
# Create the 3D scatter plot
scatterplot3d(movies_AR$Rate, movies_AR$Budget, movies_AR$cluster,
              xlab = "Rating", ylab = "Budget", zlab = "Cluster",
              main = "Clusters of Movies by Rating and Budget",
              color = as.numeric(movies_AR$cluster),
              pch = 16,
              type = "h",
              angle = 55
)

These clusters are represented by different colors. The ‘Rating’ axis indicates the average rating of movies, likely on a scale of 0-10. The ‘Budget’ axis represents the production budget of movies and is scaled in scientific notation, for example, 4e+08 means a budget of 400 million units of currency. The ‘Cluster Axis’ denotes the cluster to which each movie has been assigned.

# Aggregate data within each cluster to find mean ratings and budgets
aggregate(movies_AR[, c("Rate", "Budget")], by = list(movies_AR$cluster), FUN = mean)
##   Group.1     Rate    Budget
## 1       1 4.024816  11950748
## 2       2 6.280715  19668375
## 3       3 6.351008 132589961
## 4       4 6.161315  28212883
  • The previous results represent the mean of rate and budget in every cluster, before we assigen them to Genres.

  • Engaging Genre to every cluster

# Extract relevant columns for cluster and genre
cluster_genre <- data.frame(Cluster = movies_AR$cluster, Genre = movies_AR$Genre)
# How to find the highest and lowest genre in each cluster
find_high_low_genre <- function(cluster_data) {
  freq_genre <- table(cluster_data$Genre)
  highest_genre <- names(freq_genre)[which.max(freq_genre)]
  lowest_genre <- names(freq_genre)[which.min(freq_genre)]
  return(data.frame(Cluster = unique(cluster_data$Cluster), Highest_Genre = highest_genre, Lowest_Genre = lowest_genre))
}

# Apply the function to get a summary of highest and lowest genres in each cluster
genre_summary <- by(cluster_genre, cluster_genre$Cluster, find_high_low_genre)
# Convert the  object to a data frame
genre_summary_df <- do.call(rbind, lapply(genre_summary, data.frame))
# Plotting the stacked bar plot
ggplot(genre_plot_data, aes(x = Cluster, y = Count, fill = Genre)) +
  geom_bar(stat = "identity") +
  labs(x = "Cluster", y = "Count", fill = "Genre") +
  theme_minimal() +
  theme(legend.position = "right", 
        legend.title = element_blank(),
        legend.text = element_text(size = 8),
        legend.key.height = unit(0.5, "lines"))  # Adjust the height to your preference

This plot displays the number of movies with a specific genre in each cluster.

cluster_genre <- data.frame(Cluster = movies_AR$cluster, Genre = movies_AR$Genre)

get_high_low_genre <- function(cluster_data) {
  freq_genre <- table(cluster_data$Genre)
  highest_genre <- names(freq_genre)[which.max(freq_genre)]
  lowest_genre <- names(freq_genre)[which.min(freq_genre)]
  return(c(Highest_Genre = highest_genre, Lowest_Genre = lowest_genre))
}

# Calculate highest and lowest genres for each cluster
genre_summary <- by(cluster_genre, cluster_genre$Cluster, get_high_low_genre)
# Display results
for (i in 1:length(genre_summary)) {
  cat("Cluster", names(genre_summary)[i], "\n")
  cat("Highest genre:", genre_summary[[i]]["Highest_Genre"], "\n")
  cat("Lowest genre:", genre_summary[[i]]["Lowest_Genre"], "\n\n")
}
## Cluster 1 
## Highest genre: Horror 
## Lowest genre: Foreign 
## 
## Cluster 2 
## Highest genre: Drama 
## Lowest genre: Foreign 
## 
## Cluster 3 
## Highest genre: Action 
## Lowest genre: Mystery 
## 
## Cluster 4 
## Highest genre: Comedy 
## Lowest genre: TV Movie
  • Cluster 1
    • Highest Genre: Action
    • Lowest Genre: Mystery
    • Analysis: This cluster predominantly includes Action movies, with a focused preference for high-energy and thrilling content, as suggested by the absence of
      Mystery movies.
    • This cluster encompasses a wide range of Action sub-genres, from intense blockbusters to adventure movies.
    • Some examples of movies that fall under this cluster are “Die Hard” (Action), known for its intensity and iconic scenes, and “Mad Max: Fury Road” (Action), set in a post-apocalyptic world.
  • Cluster 2
    • Highest Genre: Comedy
    • Lowest Genre: TV Movie
    • Analysis:
    • The second cluster is distinguished by its emphasis on Comedy, which includes a variety of potential sub-genres such as romantic comedies and slapstick humor.
    • The absence of TV Movies suggests a preference for theatrical releases over made-for-TV productions.
    • Some examples of movies that fall under this cluster are “Dumb and Dumber” (Comedy), a classic slapstick comedy known for its humor and memorable characters, and “Anchorman: The Legend of Ron Burgundy” (Comedy), a satirical comedy set in the world of television journalism.
  • Cluster 3
    • Highest Genre: Horror
    • Lowest Genre: Foreign
    • Analysis:
    • The third cluster is characterized by a liking for Horror movies, especially those produced domestically.
    • The absence of Foreign films suggests a focus on horror content with broad mainstream appeal.
    • Some examples of movies that fall under this cluster are “The Conjuring” (Horror), a successful horror film known for its suspense and supernatural elements, and “A Nightmare on Elm Street” (Horror), a classic horror film featuring the iconic character Freddy Krueger.
  • Cluster 4
    • Highest Genre: Drama
    • Lowest Genre: Foreign
    • Analysis:
    • The fourth cluster is identified as having a preference for Drama movies, like Cluster 3, it avoids Foreign films.
    • This suggests a preference for domestic dramas, potentially with wide audience appeal and relatable themes.
    • Some examples of movies that fall under this cluster are “Forrest Gump” (Drama), a beloved drama with a mix of heartwarming and thought-provoking moments, and “The Shawshank Redemption” (Drama), a critically acclaimed drama known for its powerful storytelling.

Overall Conclusion

The K-means clustering analysis has provided valuable insights into distinct audience preferences within the dataset. This data-driven approach has several implications for both film producers and audiences. Producers can use this information to tailor their content to match the preferences of target audience segments, optimize their content creation strategies, allocate budgets more efficiently and direct marketing efforts more strategically. Audiences can benefit from a more personalized viewing experience, explore films beyond their typical preferences, make more informed choices, and experience higher satisfaction levels. Overall, this clustering analysis contributes to a more efficient and targeted film industry, promoting diversity in content creation, and enhancing the overall satisfaction of moviegoers.

Recommendations for Audience

# making a function that show all the movies from every cluster
get_movies_in_cluster <- function(movies_data, cluster_number) {

  
  # Filter movies in the specified cluster
  cluster_movies <- subset(movies_data, cluster == cluster_number)
  
  return(cluster_movies)
}


movies_in_cluster3 <- get_movies_in_cluster(movies_AR, cluster_number = 3)
head(movies_in_cluster3)
##       Budget Language                  Country Year                Name Rate
## 392 1.15e+08       en United States of America 1994           True Lies  6.8
## 517 1.75e+08       en United States of America 1995          Waterworld  5.9
## 805 1.05e+08       en United States of America 1997   Starship Troopers  6.7
## 806 1.10e+08       en United States of America 1997 Tomorrow Never Dies  6.0
## 807 1.16e+08       en United States of America 1997        Dante's Peak  5.8
## 808 1.25e+08       en United States of America 1997      Batman & Robin  4.2
##         Genre cluster
## 392    Action       3
## 517 Adventure       3
## 805 Adventure       3
## 806 Adventure       3
## 807    Action       3
## 808    Action       3

by this function users have the ability to utilize it to investigate movies that fall under a certain group, such as Cluster 3 as an instance. This enables the audience to discover movies that possess comparable traits regarding budget, year of release, and ratings provided by viewers.

I have selected the most efficient clustering algorithm after conducting evaluations of various other algorithms.