Result of the exercise is a functional content-based recommender for movie recommendations, which takes as inputs:
titleFilmDF listing the movies and their corresponding genre affiliation (note that I only display the first seven attributes, more genres are Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller War Western)## movid movtitle unknown Action Adventure Animation Childrens
## 1 1 Toy Story (1995) 0 0 0 1 1
## 2 2 GoldenEye (1995) 0 1 1 0 0
## 3 3 Four Rooms (1995) 0 0 0 0 0
## 4 4 Get Shorty (1995) 0 1 0 0 0
## 5 5 Copycat (1995) 0 0 0 0 0
userDF with the ratings of a single user for the movies he watched## userid itemid rating
## 1 196 242 3
## 2 186 302 3
## 3 22 377 1
## 4 244 51 2
## 5 166 346 1
a userID for which the Recommender recommends movies
and the number no_films of movies to recommend.
Our first function clusters the movies based on their genre affiliation using k-means. As a decision criterion for the optimal cluster number an additional cluster should reduce heterogeneity to not less than 20%.
In case of hetergeneity, we want to look at the the sum of the squared distance between each member of a cluster and its cluster centroid (SSE: sum of squared error). As the number of clusters increases, the SSE should decrease because clusters are, by definition, smaller. Here it stops building clusters when SSE decreases by less than 20%.
tot.withinss = sum(withinss of every cluster)
clusterFilms<-function(titleFilmDF){
set.seed(123)
i<-1
#get rid of movie ids and titles
titleFilmDF<-titleFilmDF[,c(-1,-2)]
repeat {
set.seed(123)
#build two kmeans models starting with 2 and 3 clusters and repeat until dss<0.2
i <- i + 1
movieCluster<-kmeans(titleFilmDF,i)
movieCluster2<-kmeans(titleFilmDF,i+1)
#decision criterion
dss<-((movieCluster$tot.withinss-movieCluster2$tot.withinss)/movieCluster$tot.withinss)
#exit if dss < 0.2
if (dss < 0.2) break
}
return(movieCluster)
}
Find all the movies with the associated ratings our selected user has already watched. The return value is an activeUser data frame with the columns “itemid”(=movieid), “rating” and “cluster”.
“cluster” is set to the dummy value zero.
getUserInfo<-function(dat,id){
#Select all rows from user_DF that have the userid==user_id and keep the columns itemid & rating
a<-subset(dat, userid==id,select=c(itemid, rating))
# allocate 0 to the cluster column
cluster<-0
activeUser <- data.frame( a[order(a$itemid),] ,cluster)
return(activeUser)
}
Here we assign to each movie the corresponding cluster number.
setUserFilmCluster<-function(movieCluster, activeUser){
# set up temporary dataframe to match cluster assignments to movie ids
df1<- data.frame(cbind(titleFilmDF$movid, clusterNum = movieCluster$cluster))
names(df1)<-c("movie_id", "cluster")
#This matches the cluster number to the activeUser movie id
activeUser$cluster<-df1[match(activeUser$itemid, df1$movie_id),2]
return(activeUser)
}
Calculate for each cluster the average of the movie ratings. The return value is “like”, an integer vector, in which all clusters whose average rating is greater or equal 3 are included.
If we do not find clusters with >= 3 rating, we give back a dummy value of zero.
getMeanClusterRating<-function(movieCluster, activeUser){
#aggregate() function is used along with the cluster memberships to determine variable means for each cluster in the original metric
like<-aggregate(activeUser$rating, by=list(cluster=activeUser$cluster), mean)
#A bit different approach here: If the max mean rating is below three it gives out the dummy value zero
if(max(like$x)<3){
like<-as.vector(0)
#Else it gives out the cluster number of the max mean value
} else{
like<-as.vector(t(max(subset(like, x>=3, select=cluster))))
}
return(like)
}
If we have several clusters which have a greater or equal “3” rating, we select the highest rating cluster. Now we search all movies of this cluster (both movies our user has and has not yet watched).
If there is no cluster with a baseline rating of 3 or above, select at random 100 movies. The return value is an integer “recommend” vector containing the found or randomly selected movie IDs.
getGoodFilms<-function(like, movieCluster, titleFilmDF){
# Again a temporary dataframe is created to get a list of all movies and their associated clusters
df1<- data.frame(cbind(titleFilmDF$movid, clusterNum = movieCluster$cluster))
names(df1)<-c("movie_id", "cluster")
#if like has the value zero it selects randomly 100 movies
if(like==0){
recommend<-titleFilmDf[sample.int(n = dim(titleFilmDF)[1], size = 100), 1]
}
#else it selects all movies from the winning max mean cluster
else{
recommend<-as.vector(t(subset(df1, cluster==like, select=movie_id)))
}
return(recommend)
}
Now we see the previously implemented functions in order to perform the following calculations:
Create a movie cluster and find the relevant information about our selected user.
Calculate the average rating per cluster and find the movies to the cluster, which our user likes the most.
Select all movies our user has not yet seen. The return value contains in addition to the movie ID the movie title (movtitle).
getRecommendedFilms<-function(titleFilmDF, userDF, userid){
# according to plan we call all functions in order of our logic
movieCluster<-clusterFilms(titleFilmDF)
activeUser<-getUserInfo(userDF, userid)
activeUser<-setUserFilmCluster(movieCluster, activeUser)
like<-getMeanClusterRating(movieCluster, activeUser)
recommend<-getGoodFilms(like, movieCluster, titleFilmDF)
# only select not yet watched movies
recommend<-recommend[-activeUser$itemid]
# add movietitle
movtitle<-titleFilmDF[match(recommend,titleFilmDF$movid),2]
recommend<-data.frame(recommend,movtitle)
return(recommend)
}
We’re almost done! This function recommends a particular user (userid) a certain number (no_films) of movies.
suggestFilms<-function(titleFilmDF, userDF, userid, no_films){
#get suggestions
suggestions = getRecommendedFilms(titleFilmDF, userDF, userid)
#select stated number of selections
suggestions = suggestions[1:no_films,]
#implementing some German here
writeLines("You may also like these movies:")
#print suggestions without column headers or row indices
write.table(suggestions[2], row.names = FALSE, col.names = FALSE)
}
Our finished recommender should look like this:
suggestFilms(titleFilmDF, userDF, 6, 15)
## You may also like these movies:
## "Get Shorty (1995)"
## "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"
## "Twelve Monkeys (1995)"
## "Babe (1995)"
## "Dead Man Walking (1995)"
## "Mighty Aphrodite (1995)"
## "Mr. Holland's Opus (1995)"
## "French Twist (Gazon maudit) (1995)"
## "Rumble in the Bronx (1995)"
## "Birdcage, The (1996)"
## "Brothers McMullen, The (1995)"
## "Batman Forever (1995)"
## "Free Willy 2: The Adventure Home (1995)"
## "Mad Love (1995)"
## "Nadja (1994)"