movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
## $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ V4 : logi NA NA NA NA NA NA ...
## $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
## $ V6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ V7 : int 0 1 0 1 0 0 0 0 0 0 ...
## $ V8 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ V9 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ V10: int 1 0 0 0 0 0 0 1 0 0 ...
## $ V11: int 1 0 0 1 0 0 0 1 0 0 ...
## $ V12: int 0 0 0 0 1 0 0 0 0 0 ...
## $ V13: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V14: int 0 0 0 1 1 1 1 1 1 1 ...
## $ V15: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V16: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V17: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V18: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V19: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V20: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V21: int 0 0 0 0 0 0 1 0 0 0 ...
## $ V22: int 0 1 1 0 1 0 0 0 0 0 ...
## $ V23: int 0 0 0 0 0 0 0 0 0 1 ...
## $ V24: int 0 0 0 0 0 0 0 0 0 0 ...
header=FALSE means that this is because our data doesn’t have a header or a variable name row.sep="|" means that in accordance with “|”, we can distinguish independent variables.quote="\"" makes sure that our text was read in properly. Since the variable in the movieLens.txt didn’t have names, header equaled false, R just labeled them with V1, V2, V3, …, etc.colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
## $ ReleaseDate : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ VideoReleaseDate: logi NA NA NA NA NA NA ...
## $ IMDB : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
These below will just remove the variable from our dataset.
movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL
There’re a few duplicate entries in our dataset, so we’ll go ahead and remove them with the unique function.
movies = unique(movies)
str(movies)
## 'data.frame': 1664 obs. of 20 variables:
## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
First, Compute the distances between all data points. Second, Cluster the points. #### Compute distances
distances = dist(movies[2:20], method = "euclidean") # except for 'Title' variable
clusterMovies = hclust(distances, method = "ward.D")
"Ward.D" method cares about the distance between clusters using centroid distance, and also the variance in each of the clusters. #### Plot the dendrogram
plot(clusterMovies)
Q.How many clusters would we pick? It looks like maybe 3 or 4 clusters would be a good choice according to the dendrogram. However, it depends on the application. For example, if we want to have very specific genre groups, we should select even more clusters.
clusterGroups = cutree(clusterMovies, k = 10) # 10 Clusters
tapply(movies$Action, clusterGroups, mean)
## 1 2 3 4 5 6 7
## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000
## 8 9 10
## 0.0000000 0.0000000 0.0000000
tapply(movies$Romance, clusterGroups, mean)
## 1 2 3 4 5 6
## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000
## 7 8 9 10
## 1.00000000 0.00000000 0.00000000 0.00000000
By using tapply function, we can recognize that which cluster is fitted with Action or Romance. tapply function do that Divide our data points into the 10 clusters and then, compute the average value of the action or romance variable for each cluster.
subset(movies, Title=="Men in Black (1997)")
## Title Unknown Action Adventure Animation Childrens
## 257 Men in Black (1997) 0 1 1 0 0
## Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
## 257 1 0 0 0 0 0 0 0 0
## Romance SciFi Thriller War Western
## 257 0 1 0 0 0
clusterGroups[257]
## 257
## 2
Q. Which cluster did the 257th movie go into? It looks like ‘Men In Black’ went into cluster 2
cluster2 = subset(movies, clusterGroups==2)
cluster2$Title[1:10]
## [1] GoldenEye (1995)
## [2] Bad Boys (1995)
## [3] Apollo 13 (1995)
## [4] Net, The (1995)
## [5] Natural Born Killers (1994)
## [6] Outbreak (1995)
## [7] Stargate (1994)
## [8] Fugitive, The (1993)
## [9] Jurassic Park (1993)
## [10] Robert A. Heinlein's The Puppet Masters (1994)
## 1664 Levels: 'Til There Was You (1997) ... Zeus and Roxanne (1997)
So far, by using the function 1. tapply for each variable in the dataset, we can find the cluster centroids. While this approach works and is familiar to us, it can be a little tedious when there are a lot of variables.
An alternative approach is to use the colMeans function. With this approach, we only have one command for each cluster instead of one command for each variable. If you run the following command, we can get all of the column (variable) means for cluster 1:
colMeans(subset(movies[2:20], clusterGroups==1))
## Unknown Action Adventure Animation Childrens Comedy
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017
## Musical Mystery Romance SciFi Thriller War
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226
## Western
## 0.090909091
You can repeat this for each cluster by changing the clusterGroups number. However, if we also hav a lot of clusters,this approach is not that much more efficient than just using the tapply function.
A more advanced approach uses the split and lapply functions.
The lapply function runs the second argument (colMeans) on each element of the first argument (each cluster subset in spl). So instead of using 19 apply commands, or 10 colMeans commands, we can output our centroids with just two commands: one to define spl, and then the lapply command.
spl = split(movies[2:20], clusterGroups)
# split[[1]] = subset(movies[2:20], clusterGroups==1)
# so, colMeans(spl[[1]]) will output the centroid of cluster 1
lapply(spl, colMeans)
## $`1`
## Unknown Action Adventure Animation Childrens Comedy
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017
## Musical Mystery Romance SciFi Thriller War
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226
## Western
## 0.090909091
##
## $`2`
## Unknown Action Adventure Animation Childrens Comedy
## 0.000000000 0.783919598 0.351758794 0.010050251 0.005025126 0.065326633
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.005025126 0.000000000 0.110552764 0.000000000 0.000000000 0.080402010
## Musical Mystery Romance SciFi Thriller War
## 0.000000000 0.000000000 0.045226131 0.346733668 0.376884422 0.015075377
## Western
## 0.000000000
##
## $`3`
## Unknown Action Adventure Animation Childrens Comedy
## 0.000000000 0.123853211 0.036697248 0.000000000 0.009174312 0.064220183
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.412844037 0.000000000 0.380733945 0.004587156 0.105504587 0.018348624
## Musical Mystery Romance SciFi Thriller War
## 0.000000000 0.275229358 0.036697248 0.041284404 0.610091743 0.000000000
## Western
## 0.000000000
##
## $`4`
## Unknown Action Adventure Animation Childrens Comedy
## 0 0 0 0 0 0
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0 0 1 0 0 0
## Musical Mystery Romance SciFi Thriller War
## 0 0 0 0 0 0
## Western
## 0
##
## $`5`
## Unknown Action Adventure Animation Childrens Comedy
## 0 0 0 0 0 1
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0 0 0 0 0 0
## Musical Mystery Romance SciFi Thriller War
## 0 0 0 0 0 0
## Western
## 0
##
## $`6`
## Unknown Action Adventure Animation Childrens Comedy
## 0.0000000 0.1015625 0.0000000 0.0000000 0.0000000 0.1093750
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.0468750 0.0000000 0.6640625 0.0000000 0.0078125 0.0156250
## Musical Mystery Romance SciFi Thriller War
## 0.0000000 0.0000000 1.0000000 0.0000000 0.1406250 0.0000000
## Western
## 0.0000000
##
## $`7`
## Unknown Action Adventure Animation Childrens Comedy
## 0 0 0 0 0 1
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0 0 0 0 0 0
## Musical Mystery Romance SciFi Thriller War
## 0 0 1 0 0 0
## Western
## 0
##
## $`8`
## Unknown Action Adventure Animation Childrens Comedy
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0212766
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Musical Mystery Romance SciFi Thriller War
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0212766
## Western
## 0.0000000
##
## $`9`
## Unknown Action Adventure Animation Childrens Comedy
## 0 0 0 0 0 1
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0 0 1 0 0 0
## Musical Mystery Romance SciFi Thriller War
## 0 0 0 0 0 0
## Western
## 0
##
## $`10`
## Unknown Action Adventure Animation Childrens Comedy
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1587302
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000
## Musical Mystery Romance SciFi Thriller War
## 0.0000000 0.0000000 0.0000000 0.0000000 0.1587302 0.0000000
## Western
## 0.0000000