Unit 6 - Introduction to Clustering

After following the steps in the video, load the data into R

movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
str(movies)
## 'data.frame':    1682 obs. of  24 variables:
##  $ V1 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ V4 : logi  NA NA NA NA NA NA ...
##  $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
##  $ V6 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V7 : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ V8 : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ V9 : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ V10: int  1 0 0 0 0 0 0 1 0 0 ...
##  $ V11: int  1 0 0 1 0 0 0 1 0 0 ...
##  $ V12: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ V13: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V14: int  0 0 0 1 1 1 1 1 1 1 ...
##  $ V15: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V16: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V17: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V18: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V19: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V20: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V21: int  0 0 0 0 0 0 1 0 0 0 ...
##  $ V22: int  0 1 1 0 1 0 0 0 0 0 ...
##  $ V23: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ V24: int  0 0 0 0 0 0 0 0 0 0 ...
  • header=FALSE means that this is because our data doesn’t have a header or a variable name row.
  • sep="|" means that in accordance with “|”, we can distinguish independent variables.
  • quote="\"" makes sure that our text was read in properly. Since the variable in the movieLens.txt didn’t have names, header equaled false, R just labeled them with V1, V2, V3, …, etc.

Add column names

colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
str(movies)
## 'data.frame':    1682 obs. of  24 variables:
##  $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title           : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ ReleaseDate     : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ VideoReleaseDate: logi  NA NA NA NA NA NA ...
##  $ IMDB            : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
##  $ Unknown         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action          : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animation       : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childrens       : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Comedy          : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime           : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama           : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi           : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller        : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War             : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western         : int  0 0 0 0 0 0 0 0 0 0 ...

Remove unnecessary variables

These below will just remove the variable from our dataset.

movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL

Remove duplicates

There’re a few duplicate entries in our dataset, so we’ll go ahead and remove them with the unique function.

movies = unique(movies)

Take a look at our data again:

str(movies)
## 'data.frame':    1664 obs. of  20 variables:
##  $ Title      : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ Unknown    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action     : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure  : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childrens  : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller   : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War        : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western    : int  0 0 0 0 0 0 0 0 0 0 ...

Hierarchical Clustering

First, Compute the distances between all data points. Second, Cluster the points. #### Compute distances

distances = dist(movies[2:20], method = "euclidean") # except for 'Title' variable

Hierarchical clustering

clusterMovies = hclust(distances, method = "ward.D") 

"Ward.D" method cares about the distance between clusters using centroid distance, and also the variance in each of the clusters. #### Plot the dendrogram

plot(clusterMovies)


Q.How many clusters would we pick? It looks like maybe 3 or 4 clusters would be a good choice according to the dendrogram. However, it depends on the application. For example, if we want to have very specific genre groups, we should select even more clusters.


Assign points to clusters

clusterGroups = cutree(clusterMovies, k = 10) # 10 Clusters

Now let’s figure out what the clusters are like.

Let’s use the tapply function to compute the percentage of movies in each genre and cluster

tapply(movies$Action, clusterGroups, mean)
##         1         2         3         4         5         6         7 
## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000 
##         8         9        10 
## 0.0000000 0.0000000 0.0000000
tapply(movies$Romance, clusterGroups, mean)
##          1          2          3          4          5          6 
## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000 
##          7          8          9         10 
## 1.00000000 0.00000000 0.00000000 0.00000000

By using tapply function, we can recognize that which cluster is fitted with Action or Romance. tapply function do that Divide our data points into the 10 clusters and then, compute the average value of the action or romance variable for each cluster.


We can repeat this for each genre. If you do, you get the results in ClusterMeans.ods

Find which cluster Men in Black is in.

subset(movies, Title=="Men in Black (1997)")
##                   Title Unknown Action Adventure Animation Childrens
## 257 Men in Black (1997)       0      1         1         0         0
##     Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
## 257      1     0           0     0       0        0      0       0       0
##     Romance SciFi Thriller War Western
## 257       0     1        0   0       0
clusterGroups[257] 
## 257 
##   2

Q. Which cluster did the 257th movie go into? It looks like ‘Men In Black’ went into cluster 2


Create a new data set with just the movies from cluster 2

cluster2 = subset(movies, clusterGroups==2)

Look at the first 10 titles in this cluster:

cluster2$Title[1:10]
##  [1] GoldenEye (1995)                              
##  [2] Bad Boys (1995)                               
##  [3] Apollo 13 (1995)                              
##  [4] Net, The (1995)                               
##  [5] Natural Born Killers (1994)                   
##  [6] Outbreak (1995)                               
##  [7] Stargate (1994)                               
##  [8] Fugitive, The (1993)                          
##  [9] Jurassic Park (1993)                          
## [10] Robert A. Heinlein's The Puppet Masters (1994)
## 1664 Levels: 'Til There Was You (1997) ... Zeus and Roxanne (1997)

An Advanced Approach to finding cluster centroids

So far, by using the function 1. tapply for each variable in the dataset, we can find the cluster centroids. While this approach works and is familiar to us, it can be a little tedious when there are a lot of variables.


2. colMeans function

An alternative approach is to use the colMeans function. With this approach, we only have one command for each cluster instead of one command for each variable. If you run the following command, we can get all of the column (variable) means for cluster 1:

colMeans(subset(movies[2:20], clusterGroups==1))
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226 
##     Western 
## 0.090909091

You can repeat this for each cluster by changing the clusterGroups number. However, if we also hav a lot of clusters,this approach is not that much more efficient than just using the tapply function.


3. split & lapply function (advanced approach)

A more advanced approach uses the split and lapply functions.
The lapply function runs the second argument (colMeans) on each element of the first argument (each cluster subset in spl). So instead of using 19 apply commands, or 10 colMeans commands, we can output our centroids with just two commands: one to define spl, and then the lapply command.

spl = split(movies[2:20], clusterGroups)
# split[[1]] = subset(movies[2:20], clusterGroups==1)
# so, colMeans(spl[[1]]) will output the centroid of cluster 1
lapply(spl, colMeans)
## $`1`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226 
##     Western 
## 0.090909091 
## 
## $`2`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.000000000 0.783919598 0.351758794 0.010050251 0.005025126 0.065326633 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.005025126 0.000000000 0.110552764 0.000000000 0.000000000 0.080402010 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.000000000 0.000000000 0.045226131 0.346733668 0.376884422 0.015075377 
##     Western 
## 0.000000000 
## 
## $`3`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.000000000 0.123853211 0.036697248 0.000000000 0.009174312 0.064220183 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.412844037 0.000000000 0.380733945 0.004587156 0.105504587 0.018348624 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.000000000 0.275229358 0.036697248 0.041284404 0.610091743 0.000000000 
##     Western 
## 0.000000000 
## 
## $`4`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           0 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           1           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           0           0           0           0 
##     Western 
##           0 
## 
## $`5`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           1 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           0           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           0           0           0           0 
##     Western 
##           0 
## 
## $`6`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##   0.0000000   0.1015625   0.0000000   0.0000000   0.0000000   0.1093750 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##   0.0468750   0.0000000   0.6640625   0.0000000   0.0078125   0.0156250 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##   0.0000000   0.0000000   1.0000000   0.0000000   0.1406250   0.0000000 
##     Western 
##   0.0000000 
## 
## $`7`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           1 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           0           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           1           0           0           0 
##     Western 
##           0 
## 
## $`8`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0212766 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##   0.0000000   1.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0212766 
##     Western 
##   0.0000000 
## 
## $`9`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           1 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           1           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           0           0           0           0 
##     Western 
##           0 
## 
## $`10`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.1587302 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   1.0000000 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.1587302   0.0000000 
##     Western 
##   0.0000000