In this project, we will be delving into the fascinating world of movies. Our goal is to explore patterns and relationships within a diverse dataset that contains information such as budget, release year, user ratings, and genre. By utilizing clustering techniques, we aim to uncover natural groupings and similarities among movies. This will not only be beneficial for the audience but also for the producers.
# Set the working directory and read the file
movies_AR <- read.csv("/Users/marcoayuob/Downloads/Data science /1st Year/Unsupervised Learning/Clustering Project/CL_draft.csv")
# Display summary information and structure of the dataset
summary(movies_AR)
## Budget Language Country Year
## Min. : 100000 Length:5737 Length:5737 Min. :1991
## 1st Qu.: 5000000 Class :character Class :character 1st Qu.:2001
## Median : 18000000 Mode :character Mode :character Median :2007
## Mean : 30751907 Mean :2006
## 3rd Qu.: 39000000 3rd Qu.:2012
## Max. :380000000 Max. :2017
## Name Rate Genre
## Length:5737 Min. : 0.000 Length:5737
## Class :character 1st Qu.: 5.300 Class :character
## Mode :character Median : 6.000 Mode :character
## Mean : 5.864
## 3rd Qu.: 6.600
## Max. :10.000
str(movies_AR)
## 'data.frame': 5737 obs. of 7 variables:
## $ Budget : int 30000000 200000 250000 300000 500000 800000 800000 1000000 1182273 1227401 ...
## $ Language: chr "en" "en" "en" "en" ...
## $ Country : chr "United States of America" "United States of America" "United States of America" "United States of America" ...
## $ Year : int 1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 ...
## $ Name : chr "Madonna: Truth or Dare" "Dingo" "Poison" "High Strung" ...
## $ Rate : num 6.3 6 6.4 5.4 4.8 6.8 6 4.8 4.2 6.4 ...
## $ Genre : chr "Documentary" "Drama" "Drama" "Comedy" ...
# Convert certain columns to numeric data types
movies_AR$Budget <- as.numeric(as.character(movies_AR$Budget))
movies_AR$Year <- as.numeric(as.character(movies_AR$Year))
movies_AR$Rate <- as.numeric(as.character(movies_AR$Rate))
# Checking for missing values in the dataset and handling them
colSums(is.na(movies_AR))
## Budget Language Country Year Name Rate Genre
## 0 0 0 0 0 0 0
movies_AR <- na.omit(movies_AR)
movies_AR$Genre[is.na(movies_AR$Genre)] <- NA
movies_AR$Budget[is.na(movies_AR$Budget)] <- mean(movies_AR$Budget)
# Remove rows with empty Genre
movies_AR <- movies_AR[movies_AR$Genre != "", ]
# Calculating mean ratings by genre and create a bar plot to visualize them
mean_ratings <- aggregate(Rate ~ Genre, data = movies_AR, FUN = mean, na.rm = TRUE)
barplot(mean_ratings$Rate, names.arg = mean_ratings$Genre,
col = "skyblue", xlab = "Genre", ylab = "Mean Rating",
main = "Mean Ratings by Genre", las = 2)
we can see from this plot the mean (Average) rating for every Genre
# Calculate mean budget by rate and create a scatter plot to visualize them
mean_budget <- aggregate(Budget ~ Rate, data = movies_AR, FUN = mean, na.rm = TRUE)
plot(mean_budget$Rate, mean_budget$Budget,
xlab = "Rate", ylab = "Mean Budget",
main = "Mean Budget by Rate", col = "blue", pch = 16)
from the previous plot we can observe that from rate 6 to 8 are the
movies that have the highest budget (production coast)
# Calculating the movies with the maximum rating and corresponding year(s)
max_rating <- max(movies_AR$Rate)
years_highest_rating <- unique(movies_AR$Year[movies_AR$Rate == max_rating])
years_highest_rating
## [1] 1998 2002 2006 2010 2012
those are the years that have highest rate of movies
# Bar plot for years with maximum rating
barplot(table(movies_AR$Year[movies_AR$Rate == max_rating]),
xlab = "Year", ylab = "Frequency",
main = "Years with Maximum Rating", col = "darkgreen")
# Selecting only the numeric columns of a movies dataframe using sapply # with is.numeric function for each column
# A new data frame called movies_z is formed by using the scale
# function #on all numeric columns of the existing data frame,
# movies_numeric..
library(fpc)
movies_numeric <- movies_AR[, sapply(movies_AR, is.numeric)]
movies_z <- as.data.frame(lapply(movies_numeric, scale))
# Calculate within-cluster sum of squares (WCSS) for different cluster counts
wcss <- numeric(10)
for (i in 1:10) {
movies1 <- kmeans(movies_z, centers = i)
wcss[i] <- movies1$tot.withinss
}
# Plot WCSS to determine optimal number of clusters (Elbow Method)
plot(1:10, wcss, type = "b", xlab = "Number of Clusters", ylab = "WCSS")
The elbow plot analysis suggests that the ideal number of clusters is 4 as the curve becomes flat beyond this point, indicating that further clusters are not significant.
# Perform k-means clustering with the selected number of clusters
k <- 4
kmeans_result <- kmeans(movies_z, centers = k)
movies_AR$cluster <- kmeans_result$cluster
kmeans_result
## K-means clustering with 4 clusters of sizes 951, 2463, 496, 1794
##
## Cluster means:
## Budget Year Rate
## 1 -0.48114370 0.3067912 -1.5452994
## 2 -0.28506312 0.6631598 0.3407349
## 3 2.58391862 0.4047969 0.3995035
## 4 -0.06797409 -1.1850057 0.2409118
##
## Clustering vector:
## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [38] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [75] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [149] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [186] 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 1 4 4
## [223] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [260] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [297] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [334] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [371] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 1 4 4 1 4 4 4 4
## [408] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [445] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [482] 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4
## [519] 4 4 4 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [556] 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4
## [593] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [630] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4
## [667] 4 4 4 4 1 4 4 4 4 1 4 4 4 1 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [704] 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [741] 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4
## [778] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 4 4 4 4 4
## [815] 4 4 4 4 1 4 4 4 4 4 4 4 4 4 1 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4
## [852] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4
## [889] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [926] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 4 1 4 1 4 4 4 4 4 4
## [963] 1 4 1 4 1 4 4 4 1 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1000] 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4
## [1037] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4
## [1074] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1111] 4 4 4 4 3 3 3 3 3 3 3 3 3 1 4 4 4 1 1 1 4 4 4 4 4 4 4 1 4 1 1 4 4 1 1 4 1
## [1148] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 4 1 4 4 4 4
## [1185] 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 1 4 4
## [1222] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4
## [1259] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3
## [1296] 3 3 3 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 1 4 4 4 1 1
## [1333] 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 1 1 4 4 1 1 4 4 4 4 4 4
## [1370] 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1407] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1444] 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1481] 4 1 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4 4 1 1 1 4 1 4 1
## [1518] 4 1 4 4 1 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 1 4 4 4 4 4 1 4
## [1555] 4 1 4 4 4 4 1 1 4 4 4 4 4 4 1 4 1 4 4 4 4 4 4 4 1 4 4 1 4 4 4 4 4 1 4 4 4
## [1592] 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4
## [1629] 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1
## [1666] 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 1 4 1 4 4 4 1 1 1 1 4 4 4
## [1703] 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4
## [1740] 4 4 4 4 4 4 4 4 4 1 1 1 1 4 4 1 1 1 4 4 4 4 1 4 4 4 1 4 4 4 1 4 4 4 4 4 4
## [1777] 1 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 1 4 4 4 4 4 4 4 4 4
## [1814] 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3
## [1851] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 1 4 4 4 1 1 1 4 4 4 1 1 1 1 2 2 1 1 2 1
## [1888] 2 4 1 4 4 4 1 2 4 1 2 4 1 2 1 2 2 2 4 2 4 1 4 2 1 4 4 2 1 1 4 2 2 2 4 4 1
## [1925] 4 4 1 1 4 4 4 4 4 4 4 4 1 2 2 1 2 1 4 4 4 1 1 4 4 4 2 4 1 4 4 4 1 1 2 4 4
## [1962] 4 4 4 1 4 4 4 4 4 4 4 4 4 4 1 4 1 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4 4 4 4 4 4
## [1999] 4 1 4 4 4 4 4 4 1 1 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4
## [2036] 4 4 4 4 4 4 4 4 4 4 4 4 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [2073] 3 2 1 2 2 2 2 1 1 1 1 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 1 1 2 2 2 2 2 2 1 1
## [2110] 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 1
## [2147] 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 1 2 2 1 1 1 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2
## [2184] 2 1 1 2 2 1 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 2 2
## [2221] 2 2 2 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 1 2 2 2 4 2 4
## [2258] 2 2 2 2 2 1 4 2 2 4 4 2 4 2 2 4 1 1 1 4 2 2 4 4 4 4 1 4 4 4 4 1 1 4 4 4 4
## [2295] 4 4 4 4 3 4 3 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 1 2 1 2
## [2332] 2 1 1 2 2 1 2 1 2 1 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 1 2 1 1 2
## [2369] 2 2 1 1 1 2 1 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2
## [2406] 2 1 1 2 1 1 2 2 2 2 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2
## [2443] 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2
## [2480] 2 2 1 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [2517] 2 2 1 2 1 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2
## [2554] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [2591] 3 3 3 3 3 3 3 3 1 1 1 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 1 1 2 2 2
## [2628] 1 1 2 1 1 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 1 1 2 1 2 1 2 1
## [2665] 2 1 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
## [2702] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 1 2 1 1 2 2 2 2 1 1 2 2 2
## [2739] 2 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2
## [2776] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [2813] 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1
## [2850] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 1 2
## [2887] 2 2 1 2 1 1 2 2 2 1 1 1 2 2 1 1 2 2 2 2 2 1 2 1 1 2 1 1 2 2 2 2 1 2 1 1 2
## [2924] 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 1 1 1 2 1
## [2961] 2 2 1 2 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 2 2 2 1 1 1 2 2 2 2 2 1 2 1 1 2 2 2
## [2998] 2 2 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2
## [3035] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
## [3072] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2
## [3109] 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1
## [3146] 2 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [3183] 1 2 2 2 2 1 1 1 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 2 1 2
## [3220] 2 2 2 1 1 1 2 2 1 2 1 1 1 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 1 1 1 2 2 2 2 2 1
## [3257] 1 1 2 2 2 1 2 1 2 2 1 2 1 2 1 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 1 2 1 1 2 2
## [3294] 2 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 1 2 1 1 2 1 2 2 1 2 2 2
## [3331] 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2
## [3368] 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [3405] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [3442] 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3
## [3479] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 1 2 1 1 2 2 2 1 2 2 2 2 2 1 2
## [3516] 1 1 1 2 1 1 1 2 2 1 1 1 2 2 2 1 2 1 1 2 1 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2
## [3553] 2 2 2 1 2 1 1 2 2 2 2 2 2 1 1 2 1 2 1 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2
## [3590] 2 2 1 1 1 2 2 1 1 2 2 2 1 2 2 2 1 2 1 2 2 1 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2
## [3627] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 1 1 1 2 1 1 1 2
## [3664] 2 2 2 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [3701] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## [3738] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3
## [3775] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [3812] 3 2 1 2 2 1 1 2 2 1 1 1 1 1 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 2 1 2 2 2 1 2 2
## [3849] 2 2 1 1 2 2 2 1 1 2 1 2 1 2 1 2 2 1 2 1 2 2 1 2 2 2 2 2 2 1 2 2 1 2 1 2 1
## [3886] 1 2 1 2 1 2 2 2 1 1 1 2 1 1 1 2 1 1 2 2 2 1 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2
## [3923] 2 2 1 2 2 1 2 1 2 1 2 1 1 1 1 2 2 2 1 1 2 1 2 1 1 1 2 2 2 1 2 2 2 1 2 1 1
## [3960] 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 1 2 2 1 2 2 2 2 2 2 2
## [3997] 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1
## [4034] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2
## [4071] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4108] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [4145] 3 3 3 3 3 3 3 3 2 2 1 2 2 1 2 2 2 1 1 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 1
## [4182] 1 2 2 2 2 1 2 1 2 1 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1
## [4219] 1 1 2 2 1 1 1 1 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 2 1 2 2 1 2 1 1 1 2 1 2 2
## [4256] 1 1 1 2 1 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 1 1 1 2 2 2 1 1 2
## [4293] 1 1 2 2 2 2 2 2 1 2 2 1 2 1 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2
## [4330] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2
## [4367] 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4404] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [4441] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 2 2 1 2 2 1 1 2 2 1 1 2 1 1 2 2
## [4478] 2 1 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 2
## [4515] 1 2 1 2 2 1 1 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 1 2 2
## [4552] 2 1 2 1 1 1 2 1 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 2 2 1 2 2
## [4589] 1 1 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [4626] 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4663] 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4700] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4737] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [4774] 3 3 3 3 3 3 3 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 1 2 2 1 2 2 1 1 2 2 2 1 1
## [4811] 1 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 2 1 2 2 2 2 2
## [4848] 2 2 1 1 2 2 2 2 1 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1
## [4885] 2 2 1 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1
## [4922] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [4959] 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [4996] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [5033] 2 2 2 2 2 2 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5070] 3 2 2 2 2 1 2 1 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2
## [5107] 1 1 2 1 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1 1 2 2 1 1 1 1 2
## [5144] 2 1 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2
## [5181] 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [5218] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2
## [5255] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [5292] 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5329] 3 2 1 2 2 1 2 2 1 1 2 2 2 1 1 2 1 2 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2
## [5366] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2
## [5403] 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [5440] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 2 2 2 1
## [5477] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [5514] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [5551] 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5588] 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2
## [5625] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## [5662] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [5699] 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 1600.620 2046.433 1157.040 2051.356
## (between_SS / total_SS = 59.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Cluster 1, which comprises 490 movies, is characterized by low budgets, moderately recent release years, and low rates. An example movie from this cluster is “Sharknado” (2013). The within-cluster sum of squares for this cluster is 1141.079, indicating that the data points within this cluster are somewhat closer to each other in terms of the features considered.
Cluster 2 consists of 1794 movies and is characterized by high budgets, moderately recent release years, and moderate rates. An example movie from this cluster is “Inception” (2010). The within-cluster sum of squares for this cluster is 2053.688, which is somewhat higher than Cluster 1. However, it is still the lowest within-cluster sum of squares among all four clusters, suggesting that the movies within this cluster are more homogeneous compared to the other clusters.
Cluster 3 is made up of 971 movies and is characterized by moderate budgets, somewhat older release years, and moderate rates. An example movie from this cluster is “The Matrix” (1999). The within-cluster sum of squares for this cluster is 1625.766, indicating that the data points within this cluster are somewhat closer to each other in terms of the features considered.
Cluster 4 is the largest cluster, consisting of 2449 movies that
have low-moderate budgets, recent release years, and moderate rates. An
example movie from this cluster is “Get Out” (2017). The within-cluster
sum of squares for this cluster is 2035.016, which is higher than
Cluster 1 and
Cluster 3 but lower than Cluster 2.
# Visualize clusters by rating and budget
plot(movies_AR$Rate, movies_AR$Budget, col = movies_AR$cluster,
xlab = "Rating", ylab = "Budget",
main = "Clusters of Movies by Rating and Budget")
legend("topright", legend = unique(movies_AR$cluster), col = 1:max(movies_AR$cluster), pch = 1)
# Install and load the necessary library for 3D scatter plots
# install.packages("scatterplot3d")
library(scatterplot3d)
# Create the 3D scatter plot
scatterplot3d(movies_AR$Rate, movies_AR$Budget, movies_AR$cluster,
xlab = "Rating", ylab = "Budget", zlab = "Cluster",
main = "Clusters of Movies by Rating and Budget",
color = as.numeric(movies_AR$cluster),
pch = 16,
type = "h",
angle = 55
)
These clusters are represented by different colors. The ‘Rating’ axis indicates the average rating of movies, likely on a scale of 0-10. The ‘Budget’ axis represents the production budget of movies and is scaled in scientific notation, for example, 4e+08 means a budget of 400 million units of currency. The ‘Cluster Axis’ denotes the cluster to which each movie has been assigned.
# Aggregate data within each cluster to find mean ratings and budgets
aggregate(movies_AR[, c("Rate", "Budget")], by = list(movies_AR$cluster), FUN = mean)
## Group.1 Rate Budget
## 1 1 4.024816 11950748
## 2 2 6.280715 19668375
## 3 3 6.351008 132589961
## 4 4 6.161315 28212883
The previous results represent the mean of rate and budget in every cluster, before we assigen them to Genres.
Engaging Genre to every cluster
# Extract relevant columns for cluster and genre
cluster_genre <- data.frame(Cluster = movies_AR$cluster, Genre = movies_AR$Genre)
# How to find the highest and lowest genre in each cluster
find_high_low_genre <- function(cluster_data) {
freq_genre <- table(cluster_data$Genre)
highest_genre <- names(freq_genre)[which.max(freq_genre)]
lowest_genre <- names(freq_genre)[which.min(freq_genre)]
return(data.frame(Cluster = unique(cluster_data$Cluster), Highest_Genre = highest_genre, Lowest_Genre = lowest_genre))
}
# Apply the function to get a summary of highest and lowest genres in each cluster
genre_summary <- by(cluster_genre, cluster_genre$Cluster, find_high_low_genre)
# Convert the object to a data frame
genre_summary_df <- do.call(rbind, lapply(genre_summary, data.frame))
# Plotting the stacked bar plot
ggplot(genre_plot_data, aes(x = Cluster, y = Count, fill = Genre)) +
geom_bar(stat = "identity") +
labs(x = "Cluster", y = "Count", fill = "Genre") +
theme_minimal() +
theme(legend.position = "right",
legend.title = element_blank(),
legend.text = element_text(size = 8),
legend.key.height = unit(0.5, "lines")) # Adjust the height to your preference
This plot displays the number of movies with a specific genre in each
cluster.
cluster_genre <- data.frame(Cluster = movies_AR$cluster, Genre = movies_AR$Genre)
get_high_low_genre <- function(cluster_data) {
freq_genre <- table(cluster_data$Genre)
highest_genre <- names(freq_genre)[which.max(freq_genre)]
lowest_genre <- names(freq_genre)[which.min(freq_genre)]
return(c(Highest_Genre = highest_genre, Lowest_Genre = lowest_genre))
}
# Calculate highest and lowest genres for each cluster
genre_summary <- by(cluster_genre, cluster_genre$Cluster, get_high_low_genre)
# Display results
for (i in 1:length(genre_summary)) {
cat("Cluster", names(genre_summary)[i], "\n")
cat("Highest genre:", genre_summary[[i]]["Highest_Genre"], "\n")
cat("Lowest genre:", genre_summary[[i]]["Lowest_Genre"], "\n\n")
}
## Cluster 1
## Highest genre: Horror
## Lowest genre: Foreign
##
## Cluster 2
## Highest genre: Drama
## Lowest genre: Foreign
##
## Cluster 3
## Highest genre: Action
## Lowest genre: Mystery
##
## Cluster 4
## Highest genre: Comedy
## Lowest genre: TV Movie
The K-means clustering analysis has provided valuable insights into distinct audience preferences within the dataset. This data-driven approach has several implications for both film producers and audiences. Producers can use this information to tailor their content to match the preferences of target audience segments, optimize their content creation strategies, allocate budgets more efficiently and direct marketing efforts more strategically. Audiences can benefit from a more personalized viewing experience, explore films beyond their typical preferences, make more informed choices, and experience higher satisfaction levels. Overall, this clustering analysis contributes to a more efficient and targeted film industry, promoting diversity in content creation, and enhancing the overall satisfaction of moviegoers.
# making a function that show all the movies from every cluster
get_movies_in_cluster <- function(movies_data, cluster_number) {
# Filter movies in the specified cluster
cluster_movies <- subset(movies_data, cluster == cluster_number)
return(cluster_movies)
}
movies_in_cluster3 <- get_movies_in_cluster(movies_AR, cluster_number = 3)
head(movies_in_cluster3)
## Budget Language Country Year Name Rate
## 392 1.15e+08 en United States of America 1994 True Lies 6.8
## 517 1.75e+08 en United States of America 1995 Waterworld 5.9
## 805 1.05e+08 en United States of America 1997 Starship Troopers 6.7
## 806 1.10e+08 en United States of America 1997 Tomorrow Never Dies 6.0
## 807 1.16e+08 en United States of America 1997 Dante's Peak 5.8
## 808 1.25e+08 en United States of America 1997 Batman & Robin 4.2
## Genre cluster
## 392 Action 3
## 517 Adventure 3
## 805 Adventure 3
## 806 Adventure 3
## 807 Action 3
## 808 Action 3
by this function users have the ability to utilize it to investigate movies that fall under a certain group, such as Cluster 3 as an instance. This enables the audience to discover movies that possess comparable traits regarding budget, year of release, and ratings provided by viewers.
I have selected the most efficient clustering algorithm after conducting evaluations of various other algorithms.