Clustering on Imbd Movies Data

Introduction

Movies play a very crucial role in the entertainment aspect of our lives. Most of us have a go to movie that we could watch anytime. Some people love romantic movies whereas some like a little touch of comedy with it and hence they prefer:Romedy. But, one thing that we all have in common is that we all use these predefined genres to categorise different types of movies. But how similar are these movies in the way they are percieved by us? In this paper, I aim to use the movie rating data obtained from the Imbd website to find some genre of movies that might have some similarity amoungst them using clustering methods. So, lets get started!

Preparing the Data

First we start by loading the data. I have obtained two different sets of dataset: 1) movies: which contains information about the movie. It has the movie title and genre. 2) ratings: which contains the information about the Imbd ratings of the movies. Both of these datasets have a unique key (movieId) which we will later use to merge these two datasets. Lets look at the two data sets first:

movies <- read.csv("C:/Users/PC-CATHERINE/Desktop/New folder/Unsupervised learning/clustering/ml-latest-small/ml-latest-small/movies.csv")
ratings <- read.csv("C:/Users/PC-CATHERINE/Desktop/New folder/Unsupervised learning/clustering/ml-latest-small/ml-latest-small/ratings.csv", stringsAsFactors=FALSE)
movies[1:5,]

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy

ratings[1:5,]

##   userId movieId rating timestamp
## 1      1       1      4 964982703
## 2      1       3      4 964981247
## 3      1       6      4 964982224
## 4      1      47      5 964983815
## 5      1      50      5 964982931

Now, lets move to prepping the data so that we cn run clustering methods on it. First we start by making columns for each genre which will contain either 1 or 0 indicating whether that particular movie belongs to that genre or not. This is done because a particular movie can belong to multiple genres. In this way we can analyse the data better.

movies$Adventure <- ifelse(grepl("Adventure",movies$genres),1,0)
movies$Animation <- ifelse(grepl("Animation",movies$genres),1,0)
movies$Children <- ifelse(grepl("Children",movies$genres),1,0)
movies$Fantasy <- ifelse(grepl("Fantasy",movies$genres),1,0)
movies$Comedy <- ifelse(grepl("Comedy",movies$genres),1,0)
movies$Romance <- ifelse(grepl("Romance",movies$genres),1,0)
movies$Action <- ifelse(grepl("Action",movies$genres),1,0)
movies$Thriller <- ifelse(grepl("Thriller",movies$genres),1,0)
movies$Drama <- ifelse(grepl("Drama",movies$genres),1,0)
movies$Crime <- ifelse(grepl("Crime",movies$genres),1,0)
movies$SciFi <- ifelse(grepl("Sci-Fi",movies$genres),1,0)
movies$Horror <- ifelse(grepl("Horror",movies$genres),1,0)
movies$Mystery <- ifelse(grepl("Mystery",movies$genres),1,0)
movies$War <- ifelse(grepl("War",movies$genres),1,0)
movies$Musical <- ifelse(grepl("Musical",movies$genres),1,0)
movies$Documentary <- ifelse(grepl("Documentary",movies$genres),1,0)
movies$Western <- ifelse(grepl("Western",movies$genres),1,0)
movies_agg<-merge(movies, ratings, by="movieId")
movies_agg<-aggregate(movies_agg[,4:22], list(movies_agg$movieId), mean)
movies_agg<-merge(movies[,1:2],movies_agg, by.x  ="movieId", by.y="Group.1")
movies_agg<-within(movies_agg, rm(userId, timestamp))

## Warning in rm(userId, timestamp): object 'timestamp' not found

movies_agg[1:5,]

##   movieId                              title Adventure Animation Children
## 1       1                   Toy Story (1995)         1         1        1
## 2       2                     Jumanji (1995)         1         0        1
## 3       3            Grumpier Old Men (1995)         0         0        0
## 4       4           Waiting to Exhale (1995)         0         0        0
## 5       5 Father of the Bride Part II (1995)         0         0        0
##   Fantasy Comedy Romance Action Thriller Drama Crime SciFi Horror Mystery War
## 1       1      1       0      0        0     0     0     0      0       0   0
## 2       1      0       0      0        0     0     0     0      0       0   0
## 3       0      1       1      0        0     0     0     0      0       0   0
## 4       0      1       1      0        0     1     0     0      0       0   0
## 5       0      1       0      0        0     0     0     0      0       0   0
##   Musical Documentary Western   rating
## 1       0           0       0 3.920930
## 2       0           0       0 3.431818
## 3       0           0       0 3.259615
## 4       0           0       0 2.357143
## 5       0           0       0 3.071429

Our output looks like it is shown above.Now that we have the dataframe we neede ready, let us move on to installling the required packages for clustering.

install.packages("cluster", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'cluster' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'cluster'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\cluster\libs\x64\cluster.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\cluster\libs\x64\cluster.dll: Permission
## denied

## Warning: restored 'cluster'

## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("factoextra", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'factoextra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("flexclust", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'flexclust' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'flexclust'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\flexclust\libs\x64\flexclust.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\flexclust\libs\x64\flexclust.dll:
## Permission denied

## Warning: restored 'flexclust'

## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("fpc", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'fpc' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("clustertend", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'clustertend' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("clValid", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'clValid' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("ClusterR", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'ClusterR' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'ClusterR'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\ClusterR\libs\x64\ClusterR.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\ClusterR\libs\x64\ClusterR.dll: Permission
## denied

## Warning: restored 'ClusterR'

## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("clusterSim", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'clusterSim' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'clusterSim'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\PC-CATHERINE\Documents\R\win-
## library\3.6\00LOCK\clusterSim\libs\x64\clusterSim.dll to C:\Users\PC-
## CATHERINE\Documents\R\win-library\3.6\clusterSim\libs\x64\clusterSim.dll:
## Permission denied

## Warning: restored 'clusterSim'

## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

install.packages("ClustGeo", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/PC-CATHERINE/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'ClustGeo' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\PC-CATHERINE\AppData\Local\Temp\Rtmpm0rMkp\downloaded_packages

library(cluster)

## Warning: package 'cluster' was built under R version 3.6.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 3.6.3

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(flexclust)

## Warning: package 'flexclust' was built under R version 3.6.2

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4

library(fpc)

## Warning: package 'fpc' was built under R version 3.6.3

library(clustertend)
library(ClusterR)

## Warning: package 'ClusterR' was built under R version 3.6.2

## Loading required package: gtools

library(clValid)

## Warning: package 'clValid' was built under R version 3.6.3

## 
## Attaching package: 'clValid'

## The following object is masked from 'package:flexclust':
## 
##     clusters

## The following object is masked from 'package:modeltools':
## 
##     clusters

library(clusterSim)

## Warning: package 'clusterSim' was built under R version 3.6.2

## Loading required package: MASS

library(ClustGeo)

## Warning: package 'ClustGeo' was built under R version 3.6.3

Analysing the Data

And now that we have all the libraries, lets start! Since the binary data and the ratings of the moovie are not on the same scale, we will first scale the data.

movies_scale<-scale(movies_agg[1:2000,4:19])
movies_scale[1:5,]

##    Animation   Children   Fantasy     Comedy   Romance     Action   Thriller
## 1  5.1734311  2.9827068  3.644046  1.3200805 -0.484201 -0.4379488 -0.4912679
## 2 -0.1931987  2.9827068  3.644046 -0.7571508 -0.484201 -0.4379488 -0.4912679
## 3 -0.1931987 -0.3350983 -0.274283  1.3200805  2.064225 -0.4379488 -0.4912679
## 4 -0.1931987 -0.3350983 -0.274283  1.3200805  2.064225 -0.4379488 -0.4912679
## 5 -0.1931987 -0.3350983 -0.274283  1.3200805 -0.484201 -0.4379488 -0.4912679
##        Drama      Crime      SciFi     Horror    Mystery        War   Musical
## 1 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 2 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 3 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 4  1.0648495 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
## 5 -0.9386303 -0.3656774 -0.3124836 -0.3267315 -0.2548141 -0.1973362 -0.214481
##   Documentary    Western
## 1  -0.1391341 -0.1409889
## 2  -0.1391341 -0.1409889
## 3  -0.1391341 -0.1409889
## 4  -0.1391341 -0.1409889
## 5  -0.1391341 -0.1409889

Now that we have scaled the data, the next step involves checking whether the dataset is a balanced or unbalanced one (that is, whether the dataset has missing values or not). We do it the following manner:

sum(is.na(movies_scale))

## [1] 0

As we can see we don’t have missing values. This means that we can move forward.

In the next step we check if the data has clustering tendency or not. Many dataset are not suitable for clustering.The re are two ways to do that. The first way is to check visually. This method is not that accurate is very subjective as well. The second method is comparitively more quantifiable and measurable. How can we measure clusterablity? The answer is : using the get_clust tendency function in R. This function assess the clustering tendency of the data using Hopkin’s statistic via a visual approach. It gives an ordered disssimilarity image. The more the dissimilarities, the better as then the clusters will be more clearly defined and distinguishable.For more details, please refer to: http://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/

The Hopkin’s statistic should be higher than 0.75 for the data to be clusterable. A Hopkin’s statistic higher than 0.75 implies a confidence level of 90% (where the null hypothesis is that the data isn’t clusterable; if the statistic is higher than 0.75 then we reject the null hypothesis with 90% confidence).

clusterability <- get_clust_tendency(movies_scale, n = nrow(movies_scale)-1, graph = FALSE)
clusterability$hopkins_stat

## [1] 0.7800369

d<-dist(movies_scale)
fviz_dist(d, show_labels = FALSE)+ labs(title = "Movies Data")

As we can see the Hopkin’s statistic is quite high, indicating that our data is quite good for clustering. Now that we know that our data is fit for clustering, the next question is: which is the most appropriate method of clustering for this particular dataset? To answer the above question. we use the library and function “clvalid” in R This function validates the clustering methods, hence helping us pinpointing the method that is the most appropriate in the given case.It uses both Sillhouette and Dunn’s index to validate the different methods. Sillhouette width is based distance amoung the points that lie in the same cluster. It ranges from -1 to 1, with 1 being assigned to the points which are fitted very nicely in their clusters and -1 to the poorly fitted clusters. And Dunn’s index is the minimum distance between two points which aren’t in the same clusters. As you could have guessed, the both should be maximised. This particular function takes a bit long as we have asked it to compare between alot of different methods of clustering.

clmethods <- c("hierarchical","kmeans","pam","clara")
internal <- clValid(movies_scale, nClust = 2:30, clMethods = clmethods, validation = "internal", maxitems = 100000)
summary(internal)

## 
## Clustering Methods:
##  hierarchical kmeans pam clara 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
## 
## Validation Measures:
##                                   2        3        4        5        6        7        8        9       10       11       12       13       14       15       16       17       18       19       20       21       22       23       24       25       26       27       28       29       30
##                                                                                                                                                                                                                                                                                                
## hierarchical Connectivity    2.9290   2.9290   4.6702   4.6702   7.5992   8.0782  11.9361  13.6472  13.6472  21.4401  28.2341  31.6087  34.9901  37.4401  38.6579  41.5869  45.4607  50.9317  53.8317  55.8079  61.1230  63.9702  71.7032  75.5611  79.4782  80.6738  86.2107  88.7599  91.6889
##              Dunn            0.5712   0.5712   0.4619   0.5041   0.4943   0.4981   0.4981   0.4891   0.4891   0.4617   0.4331   0.4326   0.4947   0.4947   0.4947   0.4947   0.4947   0.4437   0.4193   0.3916   0.3916   0.3916   0.3916   0.3916   0.3916   0.3950   0.3950   0.3950   0.3950
##              Silhouette      0.4285   0.4169   0.4238   0.4367   0.4294   0.3482   0.3469   0.3478   0.3523   0.3563   0.3459   0.3416   0.3474   0.3455   0.3420   0.3406   0.3407   0.3304   0.3176   0.2967   0.2922   0.2924   0.2925   0.2924   0.2865   0.3295   0.3285   0.3300   0.3296
## kmeans       Connectivity    2.9290  10.3361  10.3361  13.2651  13.7440  19.1984  23.0563  20.8937  20.8937  24.4357  28.9702  32.3448  34.9901  38.6579  44.7167  50.0036  53.8774  62.4889  64.8099  85.7190  76.0567  78.9040  83.3302  87.1881  83.5794  86.4254  89.5508  94.5115  97.4405
##              Dunn            0.5712   0.2856   0.3040   0.3102   0.3102   0.3984   0.3984   0.4082   0.4244   0.2736   0.3096   0.3096   0.4947   0.4947   0.4384   0.4305   0.4305   0.4472   0.4615   0.3513   0.2787   0.2787   0.2787   0.2787   0.3513   0.3513   0.3520   0.3520   0.3520
##              Silhouette      0.4285   0.4206   0.4366   0.4344   0.3505   0.3311   0.3299   0.3363   0.3406   0.3512   0.3527   0.3500   0.3474   0.3468   0.3534   0.3513   0.3512   0.3473   0.3047   0.3325   0.3716   0.3716   0.3728   0.3736   0.3805   0.3840   0.3847   0.3845   0.3841
## pam          Connectivity  199.4647 178.0766 246.5175 318.8103 314.5234 295.8302 260.7583 270.4635 310.6246 268.7397 260.5516 236.2976 214.7722 172.9544 146.0349 157.3496 165.8933 174.5206 184.5865 216.3675 204.5333 203.3968 201.9000 209.1504 212.9444 187.3353 190.6730 183.8706 178.2218
##              Dunn            0.1562   0.1562   0.1521   0.1521   0.1554   0.1560   0.1647   0.1647   0.1647   0.1668   0.1819   0.1819   0.1819   0.1928   0.1928   0.1928   0.2077   0.2077   0.2077   0.2077   0.2262   0.2262   0.2262   0.2262   0.2262   0.2262   0.2262   0.2262   0.2262
##              Silhouette      0.1100   0.1484   0.1656   0.1957   0.2199   0.2414   0.2655   0.2782   0.2582   0.2996   0.3235   0.3511   0.3904   0.4244   0.4550   0.4710   0.4906   0.4962   0.5146   0.5217   0.5317   0.5435   0.5513   0.5684   0.5768   0.5998   0.6060   0.6121   0.6203
## clara        Connectivity  199.4647 207.9440 285.6972 281.4103 252.6913 278.6921 242.5175 262.4115 287.2337 299.9663 212.5028 213.2869 247.2611 259.9802 225.3448 204.9909 262.8540 247.2056 243.2437 225.0429 211.7222 207.5107 226.2452 185.8171 193.4583 178.6508 225.2726 212.2643 176.4690
##              Dunn            0.1562   0.1562   0.1521   0.1539   0.1601   0.1609   0.1732   0.1732   0.1609   0.1609   0.1836   0.1954   0.1836   0.1836   0.2170   0.2170   0.1954   0.1996   0.1954   0.1996   0.1954   0.1954   0.1954   0.1996   0.1836   0.1836   0.2430   0.2430   0.2430
##              Silhouette      0.1100   0.1380   0.1770   0.2013   0.2205   0.2408   0.2548   0.2740   0.2268   0.2549   0.3112   0.3310   0.3225   0.3607   0.3957   0.4178   0.3965   0.3970   0.4036   0.4796   0.5021   0.4909   0.4895   0.5354   0.5162   0.5231   0.5452   0.5421   0.5431
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 2.9290 hierarchical 2       
## Dunn         0.5712 hierarchical 2       
## Silhouette   0.6203 pam          30

As you can see from the results, the optimal method is use the heirarchical clustering with either 2,3 or 5 clusters. But which cluster is better? And should we even be dividing the data into clusters? The latter question is answered by running the Duda-Hart test. Here the null hypothesis says that the data is homogenous, that is, the data is not suitable for dividing into clusters. The former question is answered by the Calinski-Harabasz index. It gives a measure of which number of clusters is the most appropriate. Before we move on to these test, lets look at the Sillhouette and Elbow graphs. According to the first graph, the optimal number of cluster should be where the graph reaches a peak and in the latter, it should be the lower peak.

par(mfrow=c(4,4))
fviz_nbclust(movies_scale, hcut, method = "silhouette",k.max=30)+
  labs(subtitle = "Silhouette method")

fviz_nbclust(movies_scale, hcut, method = "wss",k.max = 30) +
  labs(subtitle = "Elbow method")

Now let’s see if we should be dividing the data into clusters in the first place using Duda Hart test:

hclust_2<-cutree(hclust(d),2)
hclust_4<-cutree(hclust(d),4)
hclust_5<-cutree(hclust(d),5) 
dudahart2(movies_scale,hclust_2)

## $p.value
## [1] 3.621325e-12
## 
## $dh
## [1] 0.9074254
## 
## $compare
## [1] 0.9364077
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

As we can see from the p-value, we reject the null hypothesis and and hence we can divide the data into clusters. Next, lets use Calinski-Harabasz index to test what is the optimal number of clusters for each type of heirarchical method type:

hc_diana<-diana(movies_scale, diss=FALSE)
d<-dist(movies_scale)
diana_1<- cutree(hc_diana, k= 2)
diana_2<- cutree(hc_diana, k= 4)
diana_3<- cutree(hc_diana, k= 5)

c.hclust.2<-cluster.stats(d,hclust_2)
c.hclust.4<-cluster.stats(d,hclust_4)
c.hclust.5<-cluster.stats(d,hclust_5)
c.diana_2<-cluster.stats(d,diana_1)
c.diana_4<-cluster.stats(d,diana_2)
c.diana_5<-cluster.stats(d,diana_3)

c.hclust.2$ch

## [1] 203.8339

c.hclust.4$ch

## [1] 184.8868

c.hclust.5$ch

## [1] 196.6868

c.diana_2$ch

## [1] 209.0694

c.diana_4$ch

## [1] 189.7788

c.diana_5$ch

## [1] 143.8248

The highest score is for 2 clusters in each type of heirarchical clustering.Now lets plot the dendogram for each method:

hc_hclust<-hcut(movies_scale, k = 2, isdiss = FALSE, hc_func = c("hclust"), hc_method = "ward.D2", hc_metric = "euclidean",graph = FALSE)
fviz_dend(
  hc_hclust,
  k = 2,
  horiz = TRUE,
  rect = TRUE,
  rect_fill = TRUE,
  cex = 0.1,
  main = "Dendogram for Agglomerative Method"
)

## Warning in data.frame(xmin = unlist(xleft), ymin = unlist(ybottom), xmax =
## unlist(xright), : row names were found from a short variable and have been
## discarded

hc_diana<-diana(movies_scale, diss=FALSE)
fviz_dend(
  hc_diana,
  k = 2,
  horiz = TRUE,
  rect = TRUE,
  rect_fill = TRUE,
  cex = 0.1,
  main = "Dendogram for Divisive Method"
)

Next, lets use inertion ratio check the quality of divisions obtained.

inertion_hclust<-matrix(0, nrow=4, ncol=2)
colnames(inertion_hclust)<-c("division with agglomerative method", "division with divisive method")
rownames(inertion_hclust)<-c("intra-clust", "total", "percentage", "Q")
inertion_hclust[1,1]<-withindiss(dist(movies_scale), part=hclust_2) # intra-cluster
inertion_hclust[2,1]<-inertdiss(dist(movies_scale))                 # overall
inertion_hclust[3,1]<-inertion_hclust[1,1]/ inertion_hclust[2,1]        # ratio
inertion_hclust[4,1]<-1-inertion_hclust[3,1]

inertion_hclust[1,2]<-withindiss(dist(movies_scale), part=diana_1)  # intra-cluster
inertion_hclust[2,2]<-inertdiss(dist(movies_scale))                 # overall
inertion_hclust[3,2]<-inertion_hclust[1,2]/ inertion_hclust[2,2]        # ratio
inertion_hclust[4,2]<-1-inertion_hclust[3,2]                # Q, inter-cluster
inertion_hclust

##             division with agglomerative method division with divisive method
## intra-clust                        14.51154663                   14.47712338
## total                              15.99200000                   15.99200000
## percentage                          0.90742538                    0.90527285
## Q                                   0.09257462                    0.09472715

As we can see, the values are quite similar for both types of heirachical clustering methods. But, the divisive method is marginally better. That being said, the Q value indicates that the inter-cluster diffrence is very less in both types of clustering methods and the high intra clustering value indicates that the obtained divisions are not of that good quality. ## Results Lets see how different genres of movies are classified and distributed across diffrent clusters:

hclust_2<-c(hclust_2)

movies_agg<-cbind(movies_agg[1:2000,],hclust_2)
par(mfrow=c(2,3))
hist(movies_agg[movies_agg$Adventure==1,]$hclust_2, nclas=2, main="Adventure Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Animation==1,]$hclust_2, nclas=2, main="Animation Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Children==1,]$hclust_2, nclas=2, main="Children Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Fantasy==1,]$hclust_2, nclas=2, main="Fantasy Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Comedy==1,]$hclust_2, nclas=2, main="Comedy Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Romance==1,]$hclust_2, nclas=2, main="Romance Cluster Distribution", xlab= "Clusters")

par(mfrow=c(2,3))
hist(movies_agg[movies_agg$Action==1,]$hclust_2, nclass=2, main="Action Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Thriller==1,]$hclust_2, nclass=2, main="Thriller Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Drama==1,]$hclust_2, nclass=2, main="Drama Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Crime==1,]$hclust_2, nclass=2, main="Crime Cluster Distribution", xlab= "Clusters")
#hist(movies_agg[movies_agg$SciFi==1,]$hclust_2, nclass=2, main="Sci-Fi Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Horror==1,]$hclust_2, nclass=2, main="Horror Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Mystery==1,]$hclust_2, nclas=2, main="Mystery Cluster Distribution", xlab= "Clusters")

par(mfrow=c(2,2))

hist(movies_agg[movies_agg$War==1,]$hclust_2, nclas=2, main="War Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Musical==1,]$hclust_2, nclas=2, main="Musical Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Documentary==1,]$hclust_2, nclas=2, main="Documentary Cluster Distribution", xlab= "Clusters")
hist(movies_agg[movies_agg$Western==1,]$hclust_2, nclas=2, main="Western Cluster Distribution", xlab= "Clusters")

As we can see, different movies of the same genre are assigned to different clusters. This can be explained by the fact that the same movie can be classified as belonging to different genres. But, from the results, we can clearly see that all movies which are classified as Documentaries, Mystery movies and Thriller movies belong to the same cluster and Animation movies belong to a different cluster as them. This implies that the ratings of the former cluster genres are similar and are not similar to Animation movies. This is obvious because Animation movies have completely different criterias (as they don’t have actors) than Mystery, Thrillers and Documentaries, which are not only script and visual driven, but are also actor driven.

Clustering on Imbd Movies Data

Catherine Sunil

2/1/2020

Introduction

Preparing the Data

Analysing the Data