Online DVD rental and steaming video service
More than 40 million subscribers worldwide
$3.6 billion in revenue
Key aspect is being able to offer customers accurate movie recommendations based on customer’s own preferences and viewing history
From 2006 - 2009 Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies
Training data set of ~100,000,000 ratings and test data set of ~3,000,000 ratings were provided
Offered a grand prize of $1,000,000 USD to the team who could beat Netflix’s own algorithm, Cinematch, by more than 10%, measured in RMSE
Netflix was willing to pay over $1M for the best user rating algorithm, which shows how critical the recommendation system was to their business
What data could be used to predict user ratings?
Every movie in Netflix’s database has the ranking from all users who have ranked that movie
We also know facts about the movie itself, actors, directors, genre classification, year released, etc.
Consider suggesting to Carl that he watch “Men in Black”, since Amy rated it highly and Carl and Amy seem to have similar preferences
This technique is called Collaborative Filtering
Netflix uses both collaborative and content filtering
For example, consider a collaborative filtering approach where we determine that Amy and Carl have similar preferences
We could then do content filtering, where we would find that “Terminator”, which both Amy and Carl liked, is classified in almost the same set of genres as “Starship Troopers”
Recommend “Starship Troopers” to both Amy and Carl, even though neither of them have seen it before
www.movielens.org is a movie recommendation website run by GroupLens Research Lab at the University of Minnesota
They collect user preferences about movies and do collaborative filtering to make recommendations we will use their movie database to do content filtering using a technique called clustering
Distance is highly influenced by scale of variables, so customary to normalize first
In our movie dataset, all genre variables are on the same scale and so normalization is not necessary
However, if we included a variable such as “Box Office Revenue”, we would need to normalize
In today’s digital age, businesses often have hundreds of thousands of items to offer their customers
Excellent recommendation systems can make or break these businesses
Clustering algorithms, which are tailored to find similar customers or similar items, form the backbone of many of these recommendation systems.
# Load the dataset
movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
# Output the string of the dataset
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
## $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ V4 : logi NA NA NA NA NA NA ...
## $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
## $ V6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ V7 : int 0 1 0 1 0 0 0 0 0 0 ...
## $ V8 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ V9 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ V10: int 1 0 0 0 0 0 0 1 0 0 ...
## $ V11: int 1 0 0 1 0 0 0 1 0 0 ...
## $ V12: int 0 0 0 0 1 0 0 0 0 0 ...
## $ V13: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V14: int 0 0 0 1 1 1 1 1 1 1 ...
## $ V15: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V16: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V17: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V18: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V19: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V20: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V21: int 0 0 0 0 0 0 1 0 0 0 ...
## $ V22: int 0 1 1 0 1 0 0 0 0 0 ...
## $ V23: int 0 0 0 0 0 0 0 0 0 1 ...
## $ V24: int 0 0 0 0 0 0 0 0 0 0 ...# Add column names
colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
# Outputs the string
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
## $ ReleaseDate : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ VideoReleaseDate: logi NA NA NA NA NA NA ...
## $ IMDB : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...# Remove unecessary variables
movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL# Remove duplicates
movies = unique(movies)# Examine the string
str(movies)
## 'data.frame': 1664 obs. of 20 variables:
## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...# Compute euclidean distance
distances = dist(movies[2:20], method = "euclidean")# Implement hierarchical clustering algorithm
clusterMovies = hclust(distances, method = "ward") # Plot the dendrogram
plot(clusterMovies)# Divide the points into 10 clusters
clusterGroups = cutree(clusterMovies, k = 10)# Compare two different categories using a statistical measure
z = tapply(movies$Action, clusterGroups, mean)
x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y = cbind(x,z)
kable(y)| x | z |
|---|---|
| 1 | 0.1784512 |
| 2 | 0.7839196 |
| 3 | 0.1238532 |
| 4 | 0.0000000 |
| 5 | 0.0000000 |
| 6 | 0.1015625 |
| 7 | 0.0000000 |
| 8 | 0.0000000 |
| 9 | 0.0000000 |
| 10 | 0.0000000 |
z = tapply(movies$Romance, clusterGroups, mean)
y = cbind(x,z)
kable(y)| x | z |
|---|---|
| 1 | 0.1043771 |
| 2 | 0.0452261 |
| 3 | 0.0366972 |
| 4 | 0.0000000 |
| 5 | 0.0000000 |
| 6 | 1.0000000 |
| 7 | 1.0000000 |
| 8 | 0.0000000 |
| 9 | 0.0000000 |
| 10 | 0.0000000 |
We can repeat this for each genre. If you do, you get the results in ClusterMeans.ods
# Find which subset men in black is in
z = subset(movies, Title=="Men in Black (1997)")
kable(z)| Title | Unknown | Action | Adventure | Animation | Childrens | Comedy | Crime | Documentary | Drama | Fantasy | FilmNoir | Horror | Musical | Mystery | Romance | SciFi | Thriller | War | Western | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 257 | Men in Black (1997) | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
clusterGroups[257]
## 257
## 2# Create a subset with movies from cluster 2
cluster2 = subset(movies, clusterGroups==2)# Examine the first 10 titles in the cluster
z = cluster2$Title[1:10]
kable(z)| x |
|---|
| GoldenEye (1995) |
| Bad Boys (1995) |
| Apollo 13 (1995) |
| Net, The (1995) |
| Natural Born Killers (1994) |
| Outbreak (1995) |
| Stargate (1994) |
| Fugitive, The (1993) |
| Jurassic Park (1993) |
| Robert A. Heinlein’s The Puppet Masters (1994) |