Anil Kumar

Lecture Note

Hybrid Recommendation Systems

netflix is using both collaborative and content base filtering

Here in this presentaion we are using the data movieLens this data contain user preference about movies and we will do collaborative filtering to make recommendation.

using the technique called clustering we can do content filtering.

Movies in the dataset are categorized as belonging to different genres: like action, adventure, animation and war. a movie can belong to many genres our task is find the movies that belong to the same genres.

we are using folling algorithms in this presentation

Hierarchical
K-means

we group the items based on the distance between them by using different approachs of finding the distance between points.

Most popular is “Euclidean distance”
Manhattan Distance
Maximum Coordinate Distance

Distance Between Clusters

after grouping the points into groups, now find the distance between the each clusters. this distance basicaly we are calculating the distance between the centroids of clusters. some times distance is highly influenced by the scale of variables, so normalization is required some times.

Clustering algorithms, which are tailored to find similar customers or similar items, form the backbone of many of these recommendation systems.

Load the Data

movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")

statics of data

str(movies)

## 'data.frame':    1682 obs. of  24 variables:
##  $ V1 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2 : Factor w/ 1664 levels "101 Dalmatians (1996)",..: 1525 616 553 592 341 1317 1545 107 389 1238 ...
##  $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ V4 : logi  NA NA NA NA NA NA ...
##  $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact?06%20(1994)",..: 1431 564 504 542 309 1661 1453 102 356 1183 ...
##  $ V6 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V7 : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ V8 : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ V9 : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ V10: int  1 0 0 0 0 0 0 1 0 0 ...
##  $ V11: int  1 0 0 1 0 0 0 1 0 0 ...
##  $ V12: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ V13: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V14: int  0 0 0 1 1 1 1 1 1 1 ...
##  $ V15: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V16: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V17: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V18: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V19: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V20: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V21: int  0 0 0 0 0 0 1 0 0 0 ...
##  $ V22: int  0 1 1 0 1 0 0 0 0 0 ...
##  $ V23: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ V24: int  0 0 0 0 0 0 0 0 0 0 ...

there no column name present in the data so let add column name

Add column names

colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", 
                     "Unknown", "Action", "Adventure", "Animation", "Childrens", 
                     "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", 
                     "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", 
                     "War", "Western")

now statics of data

str(movies)

## 'data.frame':    1682 obs. of  24 variables:
##  $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title           : Factor w/ 1664 levels "101 Dalmatians (1996)",..: 1525 616 553 592 341 1317 1545 107 389 1238 ...
##  $ ReleaseDate     : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ VideoReleaseDate: logi  NA NA NA NA NA NA ...
##  $ IMDB            : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact?06%20(1994)",..: 1431 564 504 542 309 1661 1453 102 356 1183 ...
##  $ Unknown         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action          : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animation       : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childrens       : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Comedy          : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime           : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama           : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi           : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller        : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War             : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western         : int  0 0 0 0 0 0 0 0 0 0 ...

In this analysis some of the variables are not use so let remove those variables.

Remove unnecessary variables

movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL

Remove duplicates

movies = unique(movies)

Take a look at our data again:

str(movies)

## 'data.frame':    1664 obs. of  20 variables:
##  $ Title      : Factor w/ 1664 levels "101 Dalmatians (1996)",..: 1525 616 553 592 341 1317 1545 107 389 1238 ...
##  $ Unknown    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action     : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure  : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childrens  : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller   : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War        : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western    : int  0 0 0 0 0 0 0 0 0 0 ...

Compute distances

distances = dist(movies[2:20], method = "euclidean")

Hierarchical clustering

clusterMovies = hclust(distances, method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

Plot the dendrogram

plot(clusterMovies)

plot of chunk plot

Assign points to clusters

clusterGroups = cutree(clusterMovies, k = 10)

Movies in cluster

now let's find how many percentage of movies are present in each cluster this can be done by using the tapply function on each Groups.

tapply(movies$Action, clusterGroups, mean)

##         1         2         3         4         5         6         7 
## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000 
##         8         9        10 
## 0.0000000 0.0000000 0.0000000

tapply(movies$Romance, clusterGroups, mean)

##          1          2          3          4          5          6 
## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000 
##          7          8          9         10 
## 1.00000000 0.00000000 0.00000000 0.00000000

Cluster Search

Now let's find cluster of a movie like Men in Black.

subset(movies, Title=="Men in Black (1997)")

##                   Title Unknown Action Adventure Animation Childrens
## 257 Men in Black (1997)       0      1         1         0         0
##     Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
## 257      1     0           0     0       0        0      0       0       0
##     Romance SciFi Thriller War Western
## 257       0     1        0   0       0

we can print this

clusterGroups[257]

## 257 
##   2

Subset of data

Let's create a data set for the cluster 2

cluster2 = subset(movies, clusterGroups==2)

now we can view this cluster for the first 10 titles

cluster2$Title[1:10]

##  [1] GoldenEye (1995)                              
##  [2] Bad Boys (1995)                               
##  [3] Apollo 13 (1995)                              
##  [4] Net, The (1995)                               
##  [5] Natural Born Killers (1994)                   
##  [6] Outbreak (1995)                               
##  [7] Stargate (1994)                               
##  [8] Fugitive, The (1993)                          
##  [9] Jurassic Park (1993)                          
## [10] Robert A. Heinlein's The Puppet Masters (1994)
## 1664 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Zeus and Roxanne (1997)