Movies play a very crucial role in the entertainment aspect of our lives. Most of us have a go to movie that we could watch anytime. Some people love romantic movies whereas some like a little touch of comedy with it and hence they prefer:Romedy. But, one thing that we all have in common is that we all use these predefined genres to categorise different types of movies. But how similar are these movies in the way they are percieved by us? In this paper, I aim to use the movie rating data obtained from the Imbd website to find some genre of movies that might have some similarity amoungst them using clustering methods. So, lets get started!
First we start by loading the data. I have obtained two different sets of dataset: 1) movies: which contains information about the movie. It has the movie title and genre. 2) ratings: which contains the information about the Imbd ratings of the movies. Both of these datasets have a unique key (movieId) which we will later use to merge these two datasets. Lets look at the two data sets first:
movies <- read.csv("C:/Users/PC-CATHERINE/Desktop/New folder/Unsupervised learning/clustering/ml-latest-small/ml-latest-small/movies.csv")
ratings <- read.csv("C:/Users/PC-CATHERINE/Desktop/New folder/Unsupervised learning/clustering/ml-latest-small/ml-latest-small/ratings.csv", stringsAsFactors=FALSE)
movies[1:5,]
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
ratings[1:5,]
## userId movieId rating timestamp
## 1 1 1 4 964982703
## 2 1 3 4 964981247
## 3 1 6 4 964982224
## 4 1 47 5 964983815
## 5 1 50 5 964982931
Now, lets move to prepping the data so that we cn run clustering methods on it. First we start by making columns for each genre which will contain either 1 or 0 indicating whether that particular movie belongs to that genre or not. This is done because a particular movie can belong to multiple genres. In this way we can analyse the data better.
movies$Adventure <- ifelse(grepl("Adventure",movies$genres),1,0)
movies$Animation <- ifelse(grepl("Animation",movies$genres),1,0)
movies$Children <- ifelse(grepl("Children",movies$genres),1,0)
movies$Fantasy <- ifelse(grepl("Fantasy",movies$genres),1,0)
movies$Comedy <- ifelse(grepl("Comedy",movies$genres),1,0)
movies$Romance <- ifelse(grepl("Romance",movies$genres),1,0)
movies$Action <- ifelse(grepl("Action",movies$genres),1,0)
movies$Thriller <- ifelse(grepl("Thriller",movies$genres),1,0)
movies$Drama <- ifelse(grepl("Drama",movies$genres),1,0)
movies$Crime <- ifelse(grepl("Crime",movies$genres),1,0)
movies$SciFi <- ifelse(grepl("Sci-Fi",movies$genres),1,0)
movies$Horror <- ifelse(grepl("Horror",movies$genres),1,0)
movies$Mystery <- ifelse(grepl("Mystery",movies$genres),1,0)
movies$War <- ifelse(grepl("War",movies$genres),1,0)
movies$Musical <- ifelse(grepl("Musical",movies$genres),1,0)
movies$Documentary <- ifelse(grepl("Documentary",movies$genres),1,0)
movies$Western <- ifelse(grepl("Western",movies$genres),1,0)
movies_agg<-merge(movies, ratings, by="movieId")
movies_agg<-aggregate(movies_agg[,4:22], list(movies_agg$movieId), mean)
movies_agg<-merge(movies[,1:2],movies_agg, by.x ="movieId", by.y="Group.1")
movies_agg<-within(movies_agg, rm(userId, timestamp))
## Warning in rm(userId, timestamp): object 'timestamp' not found
movies_agg[1:5,]
## movieId title Adventure Animation Children
## 1 1 Toy Story (1995) 1 1 1
## 2 2 Jumanji (1995) 1 0 1
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## Fantasy Comedy Romance Action Thriller Drama Crime SciFi Horror Mystery War
## 1 1 1 0 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0 0 0
## 3 0 1 1 0 0 0 0 0 0 0 0
## 4 0 1 1 0 0 1 0 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0 0 0
## Musical Documentary Western rating
## 1 0 0 0 3.920930
## 2 0 0 0 3.431818
## 3 0 0 0 3.259615
## 4 0 0 0 2.357143
## 5 0 0 0 3.071429
Our output looks like it is shown above.Now that we have the dataframe we neede ready, let us move on to installling the required packages for clustering.
install.packages("cluster", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'cluster' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'cluster'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\cluster\libs\x64\cluster.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\cluster\libs\x64\cluster.dll: Permission
## denied
## Warning: restored 'cluster'
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("factoextra", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'factoextra' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("flexclust", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'flexclust' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'flexclust'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\flexclust\libs\x64\flexclust.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\flexclust\libs\x64\flexclust.dll:
## Permission denied
## Warning: restored 'flexclust'
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("fpc", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'fpc' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("clustertend", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'clustertend' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("clValid", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'clValid' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("ClusterR", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'ClusterR' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'ClusterR'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\ClusterR\libs\x64\ClusterR.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\ClusterR\libs\x64\ClusterR.dll: Permission
## denied
## Warning: restored 'ClusterR'
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("clusterSim", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'clusterSim' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'clusterSim'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\clusterSim\libs\x64\clusterSim.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\clusterSim\libs\x64\clusterSim.dll:
## Permission denied
## Warning: restored 'clusterSim'
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
install.packages("ClustGeo", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'ClustGeo' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages
library(cluster)
## Warning: package 'cluster' was built under R version 3.6.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.3
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Warning: package 'flexclust' was built under R version 3.6.2
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
## Warning: package 'fpc' was built under R version 3.6.3
library(clustertend)
library(ClusterR)
## Warning: package 'ClusterR' was built under R version 3.6.2
## Loading required package: gtools
library(clValid)
## Warning: package 'clValid' was built under R version 3.6.3
##
## Attaching package: 'clValid'
## The following object is masked from 'package:flexclust':
##
## clusters
## The following object is masked from 'package:modeltools':
##
## clusters
library(clusterSim)
## Warning: package 'clusterSim' was built under R version 3.6.2
## Loading required package: MASS
library(ClustGeo)
## Warning: package 'ClustGeo' was built under R version 3.6.3
And now that we have all the libraries, lets start! Since the binary data and the ratings of the moovie are not on the same scale, we will first scale the data.
movies_scale<-scale(movies_agg[1:2000,4:19])
movies_scale[1:5,]
## Animation Children Fantasy Comedy Romance Action Thriller
## 1 5.1734311 2.9827068 3.644046 1.3200805 -0.484201 -0.4379488 -0.4912679
## 2 -0.1931987 2.9827068 3.644046 -0.7571508 -0.484201 -0.4379488 -0.4912679
## 3 -0.1931987 -0.3350983 -0.274283 1.3200805 2.064225 -0.4379488 -0.4912679
## 4 -0.1931987 -0.3350983 -0.274283 1.3200805 2.064225 -0.4379488 -0.4912679
## 5 -0.1931987 -0.3350983 -0.274283 1.3200805 -0.484201 -0.4379488 -0.4912679
## Drama Crime SciFi Horror Mystery War Musical
## 1 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 2 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 3 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 4 1.0648495 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 5 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## Documentary Western
## 1 -0.1391341 -0.1409889
## 2 -0.1391341 -0.1409889
## 3 -0.1391341 -0.1409889
## 4 -0.1391341 -0.1409889
## 5 -0.1391341 -0.1409889
Now that we have scaled the data, the next step involves checking whether the dataset is a balanced or unbalanced one (that is, whether the dataset has missing values or not). We do it the following manner:
sum(is.na(movies_scale))
## [1] 0
As we can see we don’t have missing values. This means that we can move forward.
In the next step we check if the data has clustering tendency or not. Many dataset are not suitable for clustering.The re are two ways to do that. The first way is to check visually. This method is not that accurate is very subjective as well. The second method is comparitively more quantifiable and measurable. How can we measure clusterablity? The answer is : using the get_clust tendency function in R. This function assess the clustering tendency of the data using Hopkin’s statistic via a visual approach. It gives an ordered disssimilarity image. The more the dissimilarities, the better as then the clusters will be more clearly defined and distinguishable.For more details, please refer to: http://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/
The Hopkin’s statistic should be higher than 0.75 for the data to be clusterable. A Hopkin’s statistic higher than 0.75 implies a confidence level of 90% (where the null hypothesis is that the data isn’t clusterable; if the statistic is higher than 0.75 then we reject the null hypothesis with 90% confidence).
clusterability <- get_clust_tendency(movies_scale, n = nrow(movies_scale)-1, graph = FALSE)
clusterability$hopkins_stat
## [1] 0.7800369
d<-dist(movies_scale)
fviz_dist(d, show_labels = FALSE)+ labs(title = "Movies Data")
As we can see the Hopkin’s statistic is quite high, indicating that our data is quite good for clustering. Now that we know that our data is fit for clustering, the next question is: which is the most appropriate method of clustering for this particular dataset? To answer the above question. we use the library and function “clvalid” in R This function validates the clustering methods, hence helping us pinpointing the method that is the most appropriate in the given case.It uses both Sillhouette and Dunn’s index to validate the different methods. Sillhouette width is based distance amoung the points that lie in the same cluster. It ranges from -1 to 1, with 1 being assigned to the points which are fitted very nicely in their clusters and -1 to the poorly fitted clusters. And Dunn’s index is the minimum distance between two points which aren’t in the same clusters. As you could have guessed, the both should be maximised. This particular function takes a bit long as we have asked it to compare between alot of different methods of clustering.
clmethods <- c("hierarchical","kmeans","pam","clara")
internal <- clValid(movies_scale, nClust = 2:30, clMethods = clmethods, validation = "internal", maxitems = 100000)
summary(internal)
##
## Clustering Methods:
## hierarchical kmeans pam clara
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
##
## hierarchical Connectivity 2.9290 2.9290 4.6702 4.6702 7.5992 8.0782 11.9361 13.6472 13.6472 21.4401 28.2341 31.6087 34.9901 37.4401 38.6579 41.5869 45.4607 50.9317 53.8317 55.8079 61.1230 63.9702 71.7032 75.5611 79.4782 80.6738 86.2107 88.7599 91.6889
## Dunn 0.5712 0.5712 0.4619 0.5041 0.4943 0.4981 0.4981 0.4891 0.4891 0.4617 0.4331 0.4326 0.4947 0.4947 0.4947 0.4947 0.4947 0.4437 0.4193 0.3916 0.3916 0.3916 0.3916 0.3916 0.3916 0.3950 0.3950 0.3950 0.3950
## Silhouette 0.4285 0.4169 0.4238 0.4367 0.4294 0.3482 0.3469 0.3478 0.3523 0.3563 0.3459 0.3416 0.3474 0.3455 0.3420 0.3406 0.3407 0.3304 0.3176 0.2967 0.2922 0.2924 0.2925 0.2924 0.2865 0.3295 0.3285 0.3300 0.3296
## kmeans Connectivity 2.9290 10.3361 10.3361 13.2651 13.7440 19.1984 23.0563 20.8937 20.8937 24.4357 28.9702 32.3448 34.9901 38.6579 44.7167 50.0036 53.8774 62.4889 64.8099 85.7190 76.0567 78.9040 83.3302 87.1881 83.5794 86.4254 89.5508 94.5115 97.4405
## Dunn 0.5712 0.2856 0.3040 0.3102 0.3102 0.3984 0.3984 0.4082 0.4244 0.2736 0.3096 0.3096 0.4947 0.4947 0.4384 0.4305 0.4305 0.4472 0.4615 0.3513 0.2787 0.2787 0.2787 0.2787 0.3513 0.3513 0.3520 0.3520 0.3520
## Silhouette 0.4285 0.4206 0.4366 0.4344 0.3505 0.3311 0.3299 0.3363 0.3406 0.3512 0.3527 0.3500 0.3474 0.3468 0.3534 0.3513 0.3512 0.3473 0.3047 0.3325 0.3716 0.3716 0.3728 0.3736 0.3805 0.3840 0.3847 0.3845 0.3841
## pam Connectivity 199.4647 178.0766 246.5175 318.8103 314.5234 295.8302 260.7583 270.4635 310.6246 268.7397 260.5516 236.2976 214.7722 172.9544 146.0349 157.3496 165.8933 174.5206 184.5865 216.3675 204.5333 203.3968 201.9000 209.1504 212.9444 187.3353 190.6730 183.8706 178.2218
## Dunn 0.1562 0.1562 0.1521 0.1521 0.1554 0.1560 0.1647 0.1647 0.1647 0.1668 0.1819 0.1819 0.1819 0.1928 0.1928 0.1928 0.2077 0.2077 0.2077 0.2077 0.2262 0.2262 0.2262 0.2262 0.2262 0.2262 0.2262 0.2262 0.2262
## Silhouette 0.1100 0.1484 0.1656 0.1957 0.2199 0.2414 0.2655 0.2782 0.2582 0.2996 0.3235 0.3511 0.3904 0.4244 0.4550 0.4710 0.4906 0.4962 0.5146 0.5217 0.5317 0.5435 0.5513 0.5684 0.5768 0.5998 0.6060 0.6121 0.6203
## clara Connectivity 199.4647 207.9440 285.6972 281.4103 252.6913 278.6921 242.5175 262.4115 287.2337 299.9663 212.5028 213.2869 247.2611 259.9802 225.3448 204.9909 262.8540 247.2056 243.2437 225.0429 211.7222 207.5107 226.2452 185.8171 193.4583 178.6508 225.2726 212.2643 176.4690
## Dunn 0.1562 0.1562 0.1521 0.1539 0.1601 0.1609 0.1732 0.1732 0.1609 0.1609 0.1836 0.1954 0.1836 0.1836 0.2170 0.2170 0.1954 0.1996 0.1954 0.1996 0.1954 0.1954 0.1954 0.1996 0.1836 0.1836 0.2430 0.2430 0.2430
## Silhouette 0.1100 0.1380 0.1770 0.2013 0.2205 0.2408 0.2548 0.2740 0.2268 0.2549 0.3112 0.3310 0.3225 0.3607 0.3957 0.4178 0.3965 0.3970 0.4036 0.4796 0.5021 0.4909 0.4895 0.5354 0.5162 0.5231 0.5452 0.5421 0.5431
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 2.9290 hierarchical 2
## Dunn 0.5712 hierarchical 2
## Silhouette 0.6203 pam 30
As you can see from the results, the optimal method is use the heirarchical clustering with either 2,3 or 5 clusters. But which cluster is better? And should we even be dividing the data into clusters? The latter question is answered by running the Duda-Hart test. Here the null hypothesis says that the data is homogenous, that is, the data is not suitable for dividing into clusters. The former question is answered by the Calinski-Harabasz index. It gives a measure of which number of clusters is the most appropriate. Before we move on to these test, lets look at the Sillhouette and Elbow graphs. According to the first graph, the optimal number of cluster should be where the graph reaches a peak and in the latter, it should be the lower peak.
par(mfrow=c(4,4))
fviz_nbclust(movies_scale, hcut, method = "silhouette",k.max=30)+
labs(subtitle = "Silhouette method")
fviz_nbclust(movies_scale, hcut, method = "wss",k.max = 30) +
labs(subtitle = "Elbow method")
Now let’s see if we should be dividing the data into clusters in the first place using Duda Hart test:
hclust_2<-cutree(hclust(d),2)
hclust_4<-cutree(hclust(d),4)
hclust_5<-cutree(hclust(d),5)
dudahart2(movies_scale,hclust_2)
## $p.value
## [1] 3.621325e-12
##
## $dh
## [1] 0.9074254
##
## $compare
## [1] 0.9364077
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
As we can see from the p-value, we reject the null hypothesis and and hence we can divide the data into clusters. Next, lets use Calinski-Harabasz index to test what is the optimal number of clusters for each type of heirarchical method type:
hc_diana<-diana(movies_scale, diss=FALSE)
d<-dist(movies_scale)
diana_1<- cutree(hc_diana, k= 2)
diana_2<- cutree(hc_diana, k= 4)
diana_3<- cutree(hc_diana, k= 5)
c.hclust.2<-cluster.stats(d,hclust_2)
c.hclust.4<-cluster.stats(d,hclust_4)
c.hclust.5<-cluster.stats(d,hclust_5)
c.diana_2<-cluster.stats(d,diana_1)
c.diana_4<-cluster.stats(d,diana_2)
c.diana_5<-cluster.stats(d,diana_3)
c.hclust.2$ch
## [1] 203.8339
c.hclust.4$ch
## [1] 184.8868
c.hclust.5$ch
## [1] 196.6868
c.diana_2$ch
## [1] 209.0694
c.diana_4$ch
## [1] 189.7788
c.diana_5$ch
## [1] 143.8248
The highest score is for 2 clusters in each type of heirarchical clustering.Now lets plot the dendogram for each method:
hc_hclust<-hcut(movies_scale, k = 2, isdiss = FALSE, hc_func = c("hclust"), hc_method = "ward.D2", hc_metric = "euclidean",graph = FALSE)
fviz_dend(
hc_hclust,
k = 2,
horiz = TRUE,
rect = TRUE,
rect_fill = TRUE,
cex = 0.1,
main = "Dendogram for Agglomerative Method"
)
## Warning in data.frame(xmin = unlist(xleft), ymin = unlist(ybottom), xmax =
## unlist(xright), : row names were found from a short variable and have been
## discarded
hc_diana<-diana(movies_scale, diss=FALSE)
fviz_dend(
hc_diana,
k = 2,
horiz = TRUE,
rect = TRUE,
rect_fill = TRUE,
cex = 0.1,
main = "Dendogram for Divisive Method"
)
Next, lets use inertion ratio check the quality of divisions obtained.
inertion_hclust<-matrix(0, nrow=4, ncol=2)
colnames(inertion_hclust)<-c("division with agglomerative method", "division with divisive method")
rownames(inertion_hclust)<-c("intra-clust", "total", "percentage", "Q")
inertion_hclust[1,1]<-withindiss(dist(movies_scale), part=hclust_2) # intra-cluster
inertion_hclust[2,1]<-inertdiss(dist(movies_scale)) # overall
inertion_hclust[3,1]<-inertion_hclust[1,1]/ inertion_hclust[2,1] # ratio
inertion_hclust[4,1]<-1-inertion_hclust[3,1]
inertion_hclust[1,2]<-withindiss(dist(movies_scale), part=diana_1) # intra-cluster
inertion_hclust[2,2]<-inertdiss(dist(movies_scale)) # overall
inertion_hclust[3,2]<-inertion_hclust[1,2]/ inertion_hclust[2,2] # ratio
inertion_hclust[4,2]<-1-inertion_hclust[3,2] # Q, inter-cluster
inertion_hclust
## division with agglomerative method division with divisive method
## intra-clust 14.51154663 14.47712338
## total 15.99200000 15.99200000
## percentage 0.90742538 0.90527285
## Q 0.09257462 0.09472715
As we can see, the values are quite similar for both types of heirachical clustering methods. But, the divisive method is marginally better. That being said, the Q value indicates that the inter-cluster diffrence is very less in both types of clustering methods and the high intra clustering value indicates that the obtained divisions are not of that good quality. ## Results Lets see how different genres of movies are classified and distributed across diffrent clusters:
hclust_2<-c(hclust_2)
movies_agg<-cbind(movies_agg[1:2000,],hclust_2)
par(mfrow=c(2,3))
hist(movies_agg[movies_agg$Adventure==1,]$hclust_2, nclas=2, main="Adventure Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Animation==1,]$hclust_2, nclas=2, main="Animation Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Children==1,]$hclust_2, nclas=2, main="Children Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Fantasy==1,]$hclust_2, nclas=2, main="Fantasy Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Comedy==1,]$hclust_2, nclas=2, main="Comedy Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Romance==1,]$hclust_2, nclas=2, main="Romance Cluster Distribution", xlab= "Clusters")
par(mfrow=c(2,3))
hist(movies_agg[movies_agg$Action==1,]$hclust_2, nclass=2, main="Action Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Thriller==1,]$hclust_2, nclass=2, main="Thriller Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Drama==1,]$hclust_2, nclass=2, main="Drama Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Crime==1,]$hclust_2, nclass=2, main="Crime Cluster Distribution", xlab= "Clusters")
#hist(movies_agg[movies_agg$SciFi==1,]$hclust_2, nclass=2, main="Sci-Fi Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Horror==1,]$hclust_2, nclass=2, main="Horror Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Mystery==1,]$hclust_2, nclas=2, main="Mystery Cluster Distribution", xlab= "Clusters")
par(mfrow=c(2,2))
hist(movies_agg[movies_agg$War==1,]$hclust_2, nclas=2, main="War Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Musical==1,]$hclust_2, nclas=2, main="Musical Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Documentary==1,]$hclust_2, nclas=2, main="Documentary Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Western==1,]$hclust_2, nclas=2, main="Western Cluster Distribution", xlab= "Clusters")
As we can see, different movies of the same genre are assigned to different clusters. This can be explained by the fact that the same movie can be classified as belonging to different genres. But, from the results, we can clearly see that all movies which are classified as Documentaries, Mystery movies and Thriller movies belong to the same cluster and Animation movies belong to a different cluster as them. This implies that the ratings of the former cluster genres are similar and are not similar to Animation movies. This is obvious because Animation movies have completely different criterias (as they don’t have actors) than Mystery, Thrillers and Documentaries, which are not only script and visual driven, but are also actor driven.