netflix is using both collaborative and content base filtering
Here in this presentaion we are using the data movieLens this data contain user preference about movies and we will do collaborative filtering to make recommendation.
using the technique called clustering we can do content filtering.
Movies in the dataset are categorized as belonging to different genres: like action, adventure, animation and war. a movie can belong to many genres our task is find the movies that belong to the same genres.
we are using folling algorithms in this presentation
we group the items based on the distance between them by using different approachs of finding the distance between points.
after grouping the points into groups, now find the distance between the each clusters. this distance basicaly we are calculating the distance between the centroids of clusters. some times distance is highly influenced by the scale of variables, so normalization is required some times.
movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V2 : Factor w/ 1664 levels "101 Dalmatians (1996)",..: 1525 616 553 592 341 1317 1545 107 389 1238 ...
## $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ V4 : logi NA NA NA NA NA NA ...
## $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact?06%20(1994)",..: 1431 564 504 542 309 1661 1453 102 356 1183 ...
## $ V6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ V7 : int 0 1 0 1 0 0 0 0 0 0 ...
## $ V8 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ V9 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ V10: int 1 0 0 0 0 0 0 1 0 0 ...
## $ V11: int 1 0 0 1 0 0 0 1 0 0 ...
## $ V12: int 0 0 0 0 1 0 0 0 0 0 ...
## $ V13: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V14: int 0 0 0 1 1 1 1 1 1 1 ...
## $ V15: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V16: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V17: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V18: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V19: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V20: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V21: int 0 0 0 0 0 0 1 0 0 0 ...
## $ V22: int 0 1 1 0 1 0 0 0 0 0 ...
## $ V23: int 0 0 0 0 0 0 0 0 0 1 ...
## $ V24: int 0 0 0 0 0 0 0 0 0 0 ...
there no column name present in the data so let add column name
colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB",
"Unknown", "Action", "Adventure", "Animation", "Childrens",
"Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir",
"Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller",
"War", "Western")
str(movies)
## 'data.frame': 1682 obs. of 24 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : Factor w/ 1664 levels "101 Dalmatians (1996)",..: 1525 616 553 592 341 1317 1545 107 389 1238 ...
## $ ReleaseDate : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ VideoReleaseDate: logi NA NA NA NA NA NA ...
## $ IMDB : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact?06%20(1994)",..: 1431 564 504 542 309 1661 1453 102 356 1183 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
In this analysis some of the variables are not use so let remove those variables.
movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL
movies = unique(movies)
str(movies)
## 'data.frame': 1664 obs. of 20 variables:
## $ Title : Factor w/ 1664 levels "101 Dalmatians (1996)",..: 1525 616 553 592 341 1317 1545 107 389 1238 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
distances = dist(movies[2:20], method = "euclidean")
clusterMovies = hclust(distances, method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(clusterMovies)
clusterGroups = cutree(clusterMovies, k = 10)
now let's find how many percentage of movies are present in each cluster this can be done by using the tapply function on each Groups.
tapply(movies$Action, clusterGroups, mean)
## 1 2 3 4 5 6 7
## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000
## 8 9 10
## 0.0000000 0.0000000 0.0000000
tapply(movies$Romance, clusterGroups, mean)
## 1 2 3 4 5 6
## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000
## 7 8 9 10
## 1.00000000 0.00000000 0.00000000 0.00000000
Now let's find cluster of a movie like Men in Black.
subset(movies, Title=="Men in Black (1997)")
## Title Unknown Action Adventure Animation Childrens
## 257 Men in Black (1997) 0 1 1 0 0
## Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
## 257 1 0 0 0 0 0 0 0 0
## Romance SciFi Thriller War Western
## 257 0 1 0 0 0
we can print this
clusterGroups[257]
## 257
## 2
Let's create a data set for the cluster 2
cluster2 = subset(movies, clusterGroups==2)
now we can view this cluster for the first 10 titles
cluster2$Title[1:10]
## [1] GoldenEye (1995)
## [2] Bad Boys (1995)
## [3] Apollo 13 (1995)
## [4] Net, The (1995)
## [5] Natural Born Killers (1994)
## [6] Outbreak (1995)
## [7] Stargate (1994)
## [8] Fugitive, The (1993)
## [9] Jurassic Park (1993)
## [10] Robert A. Heinlein's The Puppet Masters (1994)
## 1664 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Zeus and Roxanne (1997)