Through this example, we’ll introduce the method of clustering. Netflix is an online DVD rental and streaming video service. Customers can receive movie rentals by mail, and they can also watch selected movies and TV shows online. Netflix has more than 40 million subscribers worldwide and has an annual revenue of $3.6 billion.
A key aspect of the company is being able to offer customers accurate movie recommendations based on a customer’s own preferences and viewing history. From 2006 through 2009, Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies. This algorithm would be useful for Netflix when making recommendations to users. Netflix provided a training data set of about 100 million user ratings and a test data set of about three million user ratings.
When predicting user ratings, what data could be useful? There are two main types of data that we could use.
This technique is called content filtering.
Collaborative filtering can accurately suggest complex items without understanding the nature of the items. It didn’t matter at all that our items were movies in the collaborative filtering example. We were just comparing user ratings. However, this requires a lot of data about the user to make accurate recommendations. Also, when there are millions of items, it needs a lot of computing power to compute the user similarities. On the other hand, content filtering requires very little data to get started. But the major weakness of content filtering is that it can be limited in scope. You’re only recommending similar things to what the user has already liked. So the recommendations are often not surprising or particularly insightful. Netflix actually uses what’s called a hybrid recommendation system.
We could then do content filtering as well, where we could find that the movie Terminator, which they both liked, is classified in almost the same set of genres as Starship Troopers. So then we could recommend Starship Troopers to both Amy and Carl, even though neither of them have seen it before. If we were only doing collaborative filtering, one of them would have had to have seen it before. And if we were only doing content filtering, we would only be recommending to one user at a time. So by combining the two methods, the algorithm can be much more efficient and accurate.
movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 592 342 1317 1544 110 390 1239 ...
## $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ V4 : logi NA NA NA NA NA NA ...
## $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1377 565 505 542 310 1661 1399 103 357 1130 ...
## $ V6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ V7 : int 0 1 0 1 0 0 0 0 0 0 ...
## $ V8 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ V9 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ V10: int 1 0 0 0 0 0 0 1 0 0 ...
## $ V11: int 1 0 0 1 0 0 0 1 0 0 ...
## $ V12: int 0 0 0 0 1 0 0 0 0 0 ...
## $ V13: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V14: int 0 0 0 1 1 1 1 1 1 1 ...
## $ V15: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V16: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V17: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V18: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V19: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V20: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V21: int 0 0 0 0 0 0 1 0 0 0 ...
## $ V22: int 0 1 1 0 1 0 0 0 0 0 ...
## $ V23: int 0 0 0 0 0 0 0 0 0 1 ...
## $ V24: int 0 0 0 0 0 0 0 0 0 0 ...
colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 592 342 1317 1544 110 390 1239 ...
## $ ReleaseDate : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ VideoReleaseDate: logi NA NA NA NA NA NA ...
## $ IMDB : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1377 565 505 542 310 1661 1399 103 357 1130 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL
movies = unique(movies)
str(movies)
## 'data.frame': 1664 obs. of 20 variables:
## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 592 342 1317 1544 110 390 1239 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
table(movies$Comedy == 1)
##
## FALSE TRUE
## 1162 502
table(movies$Western == 1)
##
## FALSE TRUE
## 1637 27
table((movies$Romance == 1) & (movies$Drama == 1))
##
## FALSE TRUE
## 1567 97
# Compute distances
distances = dist(movies[2:20], method = "euclidean")
# Hierarchical clustering
clusterMovies = hclust(distances, method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
# Plot the dendrogram
plot(clusterMovies)
# Assign points to clusters
clusterGroups = cutree(clusterMovies, k = 10)
#Now let's figure out what the clusters are like.
# Let's use the tapply function to compute the percentage of movies in each genre and cluster
tapply(movies$Action, clusterGroups, mean)
## 1 2 3 4 5 6 7
## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000
## 8 9 10
## 0.0000000 0.0000000 0.0000000
tapply(movies$Romance, clusterGroups, mean)
## 1 2 3 4 5 6
## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000
## 7 8 9 10
## 1.00000000 0.00000000 0.00000000 0.00000000
# We can repeat this for each genre. If you do, you get the results in ClusterMeans.ods
# Find which cluster Men in Black is in.
subset(movies, Title=="Men in Black (1997)")
## Title Unknown Action Adventure Animation Childrens
## 257 Men in Black (1997) 0 1 1 0 0
## Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
## 257 1 0 0 0 0 0 0 0 0
## Romance SciFi Thriller War Western
## 257 0 1 0 0 0
clusterGroups[257]
## 257
## 2
# Create a new data set with just the movies from cluster 2
cluster2 = subset(movies, clusterGroups==2)
# Look at the first 10 titles in this cluster:
cluster2$Title[1:10]
## [1] GoldenEye (1995)
## [2] Bad Boys (1995)
## [3] Apollo 13 (1995)
## [4] Net, The (1995)
## [5] Natural Born Killers (1994)
## [6] Outbreak (1995)
## [7] Stargate (1994)
## [8] Fugitive, The (1993)
## [9] Jurassic Park (1993)
## [10] Robert A. Heinlein's The Puppet Masters (1994)
## 1664 Levels: 'Til There Was You (1997) ... \301 k\366ldum klaka (Cold Fever) (1994)
clusterGroups2 = cutree(clusterMovies, k = 2)
spl = split(movies[2:20], clusterGroups2)
lapply(spl, colMeans)
## $`1`
## Unknown Action Adventure Animation Childrens Comedy
## 0.001545595 0.192426584 0.102782071 0.032457496 0.092735703 0.387944359
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.082689335 0.038639876 0.267387944 0.017001546 0.018547141 0.069551777
## Musical Mystery Romance SciFi Thriller War
## 0.043276662 0.046367852 0.188562597 0.077279753 0.191653787 0.054868624
## Western
## 0.020865533
##
## $`2`
## Unknown Action Adventure Animation Childrens Comedy
## 0 0 0 0 0 0
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0 0 1 0 0 0
## Musical Mystery Romance SciFi Thriller War
## 0 0 0 0 0 0
## Western
## 0
Recommendation systems are used in many different areasother than movies. Jeff Bezos, the CEO of Amazon, said that, “If I have 3 million customers on the web, I should have 3 million stores on the web.” The internet allows for mass personalization, and recommendation systems are a key part of that. Recommendation systems build models about users’ preferences to personalize the user experience.
Recommendation systems are a cornerstone of these top businesses. Social networking sites, like Facebook, music streaming sites, like Pandora, and retail companies, like Amazon, all provide recommendation systems for their users. Both collaborative filtering and content filtering are used in practice. Collaborative filtering is used by companies like Amazon, Facebook, and Google News. Content filtering is used by companies like Pandora, Rotten Tomatoes, and See This Next. And Netflix uses both collaborative filtering and content filtering.
In today’s digital age, businesses often have hundreds of thousands of items to offer their customers, whether they’re movies, songs , or people they might know on Facebook. Excellent recommendation systems can make or break these businesses. Clustering algorithms, which are tailored to find similar customers or similar items, form the backbone of many of these recommendation systems. Clustering also has many other interesting applications.