Story of Netflix

Through this example, we’ll introduce the method of clustering. Netflix is an online DVD rental and streaming video service. Customers can receive movie rentals by mail, and they can also watch selected movies and TV shows online. Netflix has more than 40 million subscribers worldwide and has an annual revenue of $3.6 billion.

A key aspect of the company is being able to offer customers accurate movie recommendations based on a customer’s own preferences and viewing history. From 2006 through 2009, Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies. This algorithm would be useful for Netflix when making recommendations to users. Netflix provided a training data set of about 100 million user ratings and a test data set of about three million user ratings.

How their recommendation system is worth a million dollars.

When predicting user ratings, what data could be useful? There are two main types of data that we could use.

  • Every movie in Netflix’s database, we have a ranking from all users who have ranked that movie.
  • We know facts about the movie itself– the actors in the movie, the director, the genre classifications of the movie, the year it was released, et cetera.
  • This technique is called content filtering.

    Strengths and weaknesses to both types of recommendation systems

    Collaborative filtering can accurately suggest complex items without understanding the nature of the items. It didn’t matter at all that our items were movies in the collaborative filtering example. We were just comparing user ratings. However, this requires a lot of data about the user to make accurate recommendations. Also, when there are millions of items, it needs a lot of computing power to compute the user similarities. On the other hand, content filtering requires very little data to get started. But the major weakness of content filtering is that it can be limited in scope. You’re only recommending similar things to what the user has already liked. So the recommendations are often not surprising or particularly insightful. Netflix actually uses what’s called a hybrid recommendation system.

    We could then do content filtering as well, where we could find that the movie Terminator, which they both liked, is classified in almost the same set of genres as Starship Troopers. So then we could recommend Starship Troopers to both Amy and Carl, even though neither of them have seen it before. If we were only doing collaborative filtering, one of them would have had to have seen it before. And if we were only doing content filtering, we would only be recommending to one user at a time. So by combining the two methods, the algorithm can be much more efficient and accurate.

    movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
    
    str(movies)
    ## 'data.frame':    1682 obs. of  24 variables:
    ##  $ V1 : int  1 2 3 4 5 6 7 8 9 10 ...
    ##  $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 592 342 1317 1544 110 390 1239 ...
    ##  $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
    ##  $ V4 : logi  NA NA NA NA NA NA ...
    ##  $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1377 565 505 542 310 1661 1399 103 357 1130 ...
    ##  $ V6 : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V7 : int  0 1 0 1 0 0 0 0 0 0 ...
    ##  $ V8 : int  0 1 0 0 0 0 0 0 0 0 ...
    ##  $ V9 : int  1 0 0 0 0 0 0 0 0 0 ...
    ##  $ V10: int  1 0 0 0 0 0 0 1 0 0 ...
    ##  $ V11: int  1 0 0 1 0 0 0 1 0 0 ...
    ##  $ V12: int  0 0 0 0 1 0 0 0 0 0 ...
    ##  $ V13: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V14: int  0 0 0 1 1 1 1 1 1 1 ...
    ##  $ V15: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V16: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V17: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V18: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V19: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V20: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ V21: int  0 0 0 0 0 0 1 0 0 0 ...
    ##  $ V22: int  0 1 1 0 1 0 0 0 0 0 ...
    ##  $ V23: int  0 0 0 0 0 0 0 0 0 1 ...
    ##  $ V24: int  0 0 0 0 0 0 0 0 0 0 ...
    colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
    
    str(movies)
    ## 'data.frame':    1682 obs. of  24 variables:
    ##  $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
    ##  $ Title           : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 592 342 1317 1544 110 390 1239 ...
    ##  $ ReleaseDate     : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
    ##  $ VideoReleaseDate: logi  NA NA NA NA NA NA ...
    ##  $ IMDB            : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1377 565 505 542 310 1661 1399 103 357 1130 ...
    ##  $ Unknown         : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Action          : int  0 1 0 1 0 0 0 0 0 0 ...
    ##  $ Adventure       : int  0 1 0 0 0 0 0 0 0 0 ...
    ##  $ Animation       : int  1 0 0 0 0 0 0 0 0 0 ...
    ##  $ Childrens       : int  1 0 0 0 0 0 0 1 0 0 ...
    ##  $ Comedy          : int  1 0 0 1 0 0 0 1 0 0 ...
    ##  $ Crime           : int  0 0 0 0 1 0 0 0 0 0 ...
    ##  $ Documentary     : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Drama           : int  0 0 0 1 1 1 1 1 1 1 ...
    ##  $ Fantasy         : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ FilmNoir        : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Horror          : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Musical         : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Mystery         : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Romance         : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ SciFi           : int  0 0 0 0 0 0 1 0 0 0 ...
    ##  $ Thriller        : int  0 1 1 0 1 0 0 0 0 0 ...
    ##  $ War             : int  0 0 0 0 0 0 0 0 0 1 ...
    ##  $ Western         : int  0 0 0 0 0 0 0 0 0 0 ...
    movies$ID = NULL
    movies$ReleaseDate = NULL
    movies$VideoReleaseDate = NULL
    movies$IMDB = NULL

    Remove duplicates

    movies = unique(movies)
    
    str(movies)
    ## 'data.frame':    1664 obs. of  20 variables:
    ##  $ Title      : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 592 342 1317 1544 110 390 1239 ...
    ##  $ Unknown    : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Action     : int  0 1 0 1 0 0 0 0 0 0 ...
    ##  $ Adventure  : int  0 1 0 0 0 0 0 0 0 0 ...
    ##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
    ##  $ Childrens  : int  1 0 0 0 0 0 0 1 0 0 ...
    ##  $ Comedy     : int  1 0 0 1 0 0 0 1 0 0 ...
    ##  $ Crime      : int  0 0 0 0 1 0 0 0 0 0 ...
    ##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Drama      : int  0 0 0 1 1 1 1 1 1 1 ...
    ##  $ Fantasy    : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ FilmNoir   : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ SciFi      : int  0 0 0 0 0 0 1 0 0 0 ...
    ##  $ Thriller   : int  0 1 1 0 1 0 0 0 0 0 ...
    ##  $ War        : int  0 0 0 0 0 0 0 0 0 1 ...
    ##  $ Western    : int  0 0 0 0 0 0 0 0 0 0 ...
    table(movies$Comedy == 1)
    ## 
    ## FALSE  TRUE 
    ##  1162   502
    table(movies$Western == 1)
    ## 
    ## FALSE  TRUE 
    ##  1637    27
    table((movies$Romance == 1) & (movies$Drama == 1))
    ## 
    ## FALSE  TRUE 
    ##  1567    97

    Hierarchical Clustering in R

    # Compute distances
    distances = dist(movies[2:20], method = "euclidean")
    
    # Hierarchical clustering
    clusterMovies = hclust(distances, method = "ward") 
    ## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
    # Plot the dendrogram
    plot(clusterMovies)

    # Assign points to clusters
    clusterGroups = cutree(clusterMovies, k = 10)
    
    #Now let's figure out what the clusters are like.
    
    # Let's use the tapply function to compute the percentage of movies in each genre and cluster
    
    tapply(movies$Action, clusterGroups, mean)
    ##         1         2         3         4         5         6         7 
    ## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000 
    ##         8         9        10 
    ## 0.0000000 0.0000000 0.0000000
    tapply(movies$Romance, clusterGroups, mean)
    ##          1          2          3          4          5          6 
    ## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000 
    ##          7          8          9         10 
    ## 1.00000000 0.00000000 0.00000000 0.00000000
    # We can repeat this for each genre. If you do, you get the results in ClusterMeans.ods
    
    
    # Find which cluster Men in Black is in.
    
    subset(movies, Title=="Men in Black (1997)")
    ##                   Title Unknown Action Adventure Animation Childrens
    ## 257 Men in Black (1997)       0      1         1         0         0
    ##     Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
    ## 257      1     0           0     0       0        0      0       0       0
    ##     Romance SciFi Thriller War Western
    ## 257       0     1        0   0       0
    clusterGroups[257]
    ## 257 
    ##   2
    # Create a new data set with just the movies from cluster 2
    cluster2 = subset(movies, clusterGroups==2)
    
    # Look at the first 10 titles in this cluster:
    cluster2$Title[1:10]
    ##  [1] GoldenEye (1995)                              
    ##  [2] Bad Boys (1995)                               
    ##  [3] Apollo 13 (1995)                              
    ##  [4] Net, The (1995)                               
    ##  [5] Natural Born Killers (1994)                   
    ##  [6] Outbreak (1995)                               
    ##  [7] Stargate (1994)                               
    ##  [8] Fugitive, The (1993)                          
    ##  [9] Jurassic Park (1993)                          
    ## [10] Robert A. Heinlein's The Puppet Masters (1994)
    ## 1664 Levels: 'Til There Was You (1997) ... \301 k\366ldum klaka (Cold Fever) (1994)
    clusterGroups2 = cutree(clusterMovies, k = 2)
    spl = split(movies[2:20], clusterGroups2)
    lapply(spl, colMeans)
    ## $`1`
    ##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
    ## 0.001545595 0.192426584 0.102782071 0.032457496 0.092735703 0.387944359 
    ##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
    ## 0.082689335 0.038639876 0.267387944 0.017001546 0.018547141 0.069551777 
    ##     Musical     Mystery     Romance       SciFi    Thriller         War 
    ## 0.043276662 0.046367852 0.188562597 0.077279753 0.191653787 0.054868624 
    ##     Western 
    ## 0.020865533 
    ## 
    ## $`2`
    ##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
    ##           0           0           0           0           0           0 
    ##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
    ##           0           0           1           0           0           0 
    ##     Musical     Mystery     Romance       SciFi    Thriller         War 
    ##           0           0           0           0           0           0 
    ##     Western 
    ##           0

    The Analytics Edge of Recommendation Systems

    Recommendation systems are used in many different areasother than movies. Jeff Bezos, the CEO of Amazon, said that, “If I have 3 million customers on the web, I should have 3 million stores on the web.” The internet allows for mass personalization, and recommendation systems are a key part of that. Recommendation systems build models about users’ preferences to personalize the user experience.

    Recommendation systems are a cornerstone of these top businesses. Social networking sites, like Facebook, music streaming sites, like Pandora, and retail companies, like Amazon, all provide recommendation systems for their users. Both collaborative filtering and content filtering are used in practice. Collaborative filtering is used by companies like Amazon, Facebook, and Google News. Content filtering is used by companies like Pandora, Rotten Tomatoes, and See This Next. And Netflix uses both collaborative filtering and content filtering.

    In today’s digital age, businesses often have hundreds of thousands of items to offer their customers, whether they’re movies, songs , or people they might know on Facebook. Excellent recommendation systems can make or break these businesses. Clustering algorithms, which are tailored to find similar customers or similar items, form the backbone of many of these recommendation systems. Clustering also has many other interesting applications.