1.0 Introduction

As someone who spends a shockingly large proportion of their life glued to a television or computer screen watching movies, the availability of quality movie recommendations is dare I say one of the key drivers of my general mood and disposition. However, instead of relying on recommendations from well meaning, but regrettably taste-less, friends I’ve turned to machine learning to help satisfy my film cravings.

Regrettably, this an oft explored topic in machine learning, with arguably one of the most prominent data science competitions of all time, the Netflix prize, challenging teams from around the world to solve this very problem. While I hold no delusions that my solution will measure up to the winning solution, the aim of this project is to demonstrate that machine learning, even yielded by a novice, can lead to significant improvements in making recommendations.

2.0 Data Explanation

The data was collected from MovieLens, a working group of the University of Minessota’s GroupLens lab that aims to use the data from their movie recommendation site to develop new experimental tools and interfaces for data exploration and recommendation. The data was initially divided over three files: (1) Ratings, (2) User information and (3) Item (aka Movie) information.

library(readr)

#read in ratings
tmp <- read_file("ml-100k/u.data")
ratings <- read_delim(tmp, delim = "\t")

#takes names (row) and appends it to the actual data part
ratings[nrow(ratings)+1,] <- names(ratings)
#fix names
names(ratings) <- c("userID", "itemID", "rating", "timestamp")

#read in items
tmp <- read_file("ml-100k/u.item")
movies <- read_delim(tmp, delim = "|")
movies[nrow(movies)+1,] <- names(movies)
names(movies) <- c("itemID", "title", "releaseDate", "videoReleaseDate",
                     "IMDb", "unknown", "Action", "Adventure", "Animation",
                     "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy",
                   "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi",
                   "Thriller", "War", "Western")

#read in users
tmp <- read_file("ml-100k/u.user")
users <- read_delim(tmp, delim = "|")
users[nrow(users)+1,] <- names(users)
names(users) <- c("userID", "age", "gender", "occupation", "zip")

#merge the data
merge1 <- merge(ratings, users)
merged_data <- merge(merge1, movies)

To give you an idea of the initial data, here’s a breakdown of the variables including their type and a brief explanation followed by some sample data.

Variable Type Description
itemID Factor ID for each movie
userID Factor ID for every user
rating Factor user rating of movie on a 5 point scale
timestamp Integer Time of review in UNIX seconds since 1/1/1970 UTC
age Integer User’s age
gender Factor User’s gender
occupation Factor User’s Occupation
zip Factor User’s ZIP code
title Character Title of film
releaseDate Date Film’s release date
videoReleaseDate Date Release date of homevideo
IMDb Character URL for film’s IMDb page
Genre Columns (19) Binary Variables indicating a film’s genre(s) (1 if movies is in genre, 0 otherwise)
##   itemID userID rating timestamp age gender occupation   zip
## 1      1    311      4 884963202  32      M technician 73071
## 2      1    899      3 884120105  32      M      other 55116
## 3      1    495      4 888632943  29      M   engineer 03052
## 4      1    747      5 888639138  19      M      other 93612
## 5      1    417      4 879646413  27      F      other 48103
## 6      1    902      5 879465583  45      F     artist 97203
##              title releaseDate videoReleaseDate
## 1 Toy Story (1995) 01-Jan-1995             <NA>
## 2 Toy Story (1995) 01-Jan-1995             <NA>
## 3 Toy Story (1995) 01-Jan-1995             <NA>
## 4 Toy Story (1995) 01-Jan-1995             <NA>
## 5 Toy Story (1995) 01-Jan-1995             <NA>
## 6 Toy Story (1995) 01-Jan-1995             <NA>
##                                                    IMDb unknown Action
## 1 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
## 2 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
## 3 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
## 4 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
## 5 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
## 6 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
##   Adventure Animation Children's Comedy Crime Documentary Drama Fantasy
## 1         0         1          1      1     0           0     0       0
## 2         0         1          1      1     0           0     0       0
## 3         0         1          1      1     0           0     0       0
## 4         0         1          1      1     0           0     0       0
## 5         0         1          1      1     0           0     0       0
## 6         0         1          1      1     0           0     0       0
##   Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
## 1         0      0       0       0       0      0        0   0       0
## 2         0      0       0       0       0      0        0   0       0
## 3         0      0       0       0       0      0        0   0       0
## 4         0      0       0       0       0      0        0   0       0
## 5         0      0       0       0       0      0        0   0       0
## 6         0      0       0       0       0      0        0   0       0

3.0 Origin and Collection

As previously mentioned, the data was collected from MovieLens, a site that provides personalized movie recommendations. Users can log onto the site and gain movie recommendations, they can also rate movies, the data of which is used to further guide recommendations.

This mimics many popular movie and television streaming websites like Netflix and Hulu, and provides an interesting chance to try to build a recommendation system similar to the ones used on those sites.

4.0 Data Preparation

All of the columns were read in as characters, so before I could begin any exploratory analysis I first had to prepare the data by correcting their classes.

#Check out what the current classes are
sapply(merged_data, class)
##           itemID           userID           rating        timestamp 
##      "character"      "character"      "character"      "character" 
##              age           gender       occupation              zip 
##      "character"      "character"      "character"      "character" 
##            title      releaseDate videoReleaseDate             IMDb 
##      "character"      "character"      "character"      "character" 
##          unknown           Action        Adventure        Animation 
##      "character"      "character"      "character"      "character" 
##       Children's           Comedy            Crime      Documentary 
##      "character"      "character"      "character"      "character" 
##            Drama          Fantasy        Film-Noir           Horror 
##      "character"      "character"      "character"      "character" 
##          Musical          Mystery          Romance           Sci-Fi 
##      "character"      "character"      "character"      "character" 
##         Thriller              War          Western 
##      "character"      "character"      "character"
#Converting Rating, Timestamp and Age into numeric variables
merged_data$rating <- as.factor(merged_data$rating)
merged_data$timestamp <- as.numeric(merged_data$timestamp)
merged_data$age <- as.numeric(merged_data$age)

#Make Gender and Occupation factors
merged_data$gender <- as.factor(merged_data$gender)
merged_data$occupation <- as.factor(merged_data$occupation)

#Use Lubridate to read in releaseDate as a Date variable
merged_data$releaseDate <- dmy(merged_data$releaseDate)

#Remove videoReleaseDate because there was no data for it
merged_data <- merged_data[,-11]

#Convert Genre columns into factors
merged_data[,12:30] <- lapply(merged_data[,12:30], as.factor)

5.0 Exploratory Analysis

To begin some exploratory analysis, I created a correlation plot to divine whether there were any variables that were particularly influential on ratings. Unfortunately none of the original features were particularly correlated to variables, thus putting the onus on feature engineering to create some predictive features.

I was also curious whether I was using a good metric to define whether someone like a movie (i.e. the rating). My gut feeling was that a 5-level rating skill wasn’t a good indicator as people could have different rating styles (i.e. person a could always use 4 for an average movie, whereas person b only gives 4 out for their favorites). I threw together a quick plot to look at the distribution of each user’s average score to check my hypothesis.

As I suspected, the user mean ratings are quite variable. Each user rated at least 20 movies, so I doubt the distribution could be caused just by chance variance in the quality of movies. In the next section I’m going to have to construct a more robust measure of user preference.

6.0 Feature Engineering

Time to start transforming our data into something with predictive power! This section constituted the bulk of my work, so I’ve broken it down into 4 sections: (1) defining a success metric, initial feature creation, data scraping and clustering.

6.1 Defining a success metric

As discussed in section 4, a 5-point rating system wasn’t going to cut the mustard as a metric for determining a good movie recommendation. To start I began by standardizing each user’s ratings, using the plyr package. I then created a new metric, View, a simple indicator whether a user would like or dislike a movie (a standardized rating above 0 indicates like, below 0 indicates dislike). While not a perfect metric (what if a user genuinely liked every movie they rated?), it should provide a decent metric given the data I had to work with. In the future it would be beneficial to collect data with user generated view data.

#Standardize the ratings for each user
merged_data <- ddply(merged_data, "userID", transform, std.ratings = scale(as.numeric(rating)))
#Create a new view variable which deems whether a user will like or dislike a movie
merged_data <- transform(merged_data, view = ifelse(std.ratings>0, 1, -1))
merged_data$view <- factor(merged_data$view, levels=c(1,-1), labels = c("LIKE","DISLIKE"))

6.2 Initial Features

Let’s start by crafting some new features out of our existing data. The first step is to turn timestamp into something useful, actual dates.

I assumed that time could have some impact on the review and constructed three new features to explore the effect. The first feature, timeSinceRelease, measures the difference in time between when a movie was released and when it was reviewed. The latter 2, month and day reviewed, are the month and the day of the week the movie was removed.

#convert timestamp to actual dates
merged_data$timestamp <- as.Date(as.POSIXct(merged_data$timestamp, origin="1970-01-01"))

#new feature - amount of time since release date and rating
merged_data <- mutate(merged_data, timeSinceRelease = as.double(timestamp-releaseDate))

#New features - Month and day of the week reviewed
merged_data$monthReviewed <- as.factor(month(merged_data$timestamp))
merged_data$dayReviewed <- as.factor(wday(merged_data$timestamp))

6.3 Data Scraping

Looking at the movie data it was clear there wasn’t enough information on each movie to guide a real recommendation) the data only contained genre, release date and title). I love some new comedies, but it doesn’t mean I’m going to enjoy the newest Adam Sandler flick! To get more data I used the omdb api to scrape more data about each of the movies. Note: the code to scrape the data is commented out as it is a very time consuming process so the results were saved to an RData object that is used in the rest of the code.

#Code to scrape more data on each movie
#titles <- unique(merged_data$title)
#library(stringr)
#library(RCurl)
#library(jsonlite)

#Use the titles in the data set to generate the GET request urls 
#titles <- str_replace_all(titles, "[(].*[)]", "")
#titles <- str_replace_all(titles, ", The", "")
#titles <- str_replace_all(titles, ",", "")
#titles <- str_trim(titles)
#titles <- str_replace_all(titles, " ", "+")
#http://www.omdbapi.com/?t=toy+story&y=&plot=short&r=json
#urls <- paste("http://www.omdbapi.com/?t=", titles, "&y=&plot=short&r=json", sep="")

#Scrape all the JSON data into a list
#IMDBdat <- lapply(urls, function(x) fromJSON(getURL(x)))

#Set up the IMDB data frame with the correct column names and character class
#IMDBdf <- data.frame()
#IMDBdf <- rbind(IMDBdf, unlist(IMDBdat[1]))
#names(IMDBdf) <- names(unlist(IMDBdat[1]))
#IMDBdf <-IMDBdf[-1,]
#IMDBdf <- as.data.frame(lapply(IMDBdf, as.character))

#Read in all the IMDB data into the IMDB data frame
#IMDBdf <-do.call(rbind, lapply(IMDBdat, unlist))
#Remove items for which the GET request failed
#IMDBdf <- IMDBdf[!(IMDBdf[,1] == "False"),]

#save(IMDBdf, file = "IMDBdat.RData")

Now that we have all this new information about the movies, it’s time to clean it up and engineer some features! The new omdb data is adding new rated (the movie’s rating), runtime, language, country and imdb rating features. From these I also derived two new features English and Foreign, which indicate whether the movie used english as its primary language and whether the movie was made in the USA respectively (I have a feeling people may not be as amenable to foreign films or films that they have to use subtitles to understand).

load("IMDBdat.RData")
#Grab the useful columns
IMDB.clean <- IMDBdf[, c("Title", "Rated", "Runtime", "Language", "Country", "imdbRating")]
IMDB.clean <- as.data.frame(IMDB.clean)

#fix up the classes
IMDB.clean$imdbRating <- as.numeric(IMDB.clean$imdbRating)
IMDB.clean$Rated <- as.factor(IMDB.clean$Rated)
IMDB.clean$Title <- as.character(IMDB.clean$Title)

#Runtimes expressed as (X min) - extract the numeric time
IMDB.clean$Runtime <- sapply(IMDB.clean$Runtime, function(x) {as.numeric(str_split(x, " ")[[1]][1])})


#engineered "English" feature - says if movie is in English or not
IMDB.clean$English <- ifelse(str_detect(IMDB.clean$Language, "English"),T, F)
#engineered "Foreign" features - says if movie is foreign to the USA
IMDB.clean$Foreign <- ifelse(str_detect(IMDB.clean$Country, "USA"),F, T)

IMDB.clean2 <- na.omit(IMDB.clean)
names(IMDB.clean2)[1] <- "title"

#Merge all these new features to the data set
#clean up titles - going to use this to merge it
merged_data$title <- str_replace_all(merged_data$title, "[(].*[)]", "")
merged_data$title[str_detect(merged_data$title, ", The")] <- sapply(merged_data$title[str_detect(merged_data$title, ", The")], function(x){
  x<- str_replace_all(x, ", The", "")
  x<- str_trim(x)
  x<- paste("The", x, sep = " ")
})
merged_data$title <- str_trim(merged_data$title)

movies.full <- merge(merged_data, IMDB.clean2)

6.4 Clustering

One major shortcoming of my features up to now is my model’s inability to incorporate previous rating information into its predictions. Each of the users in this dataset has rated at least 20 movies (all included in the data), which information should be pretty telling of what movies they’ll like in the futures (i.e. some people may love comedies, but hate pretentious dramas - information that is captured in past reviews but underutilized in the model).

To incorporate past reviews I decided to use a clustering algorithm to group users based on similar taste preferences. Each user’s review history was aggregated to include a breakdown of which types of movies they watch on average and how their standardized rankings compare to standardized IMDb ratings, this was included as opposed to a raw average standardized score to control for objectively better/worse movies. Thus recommendations can benefit off of previous user’s ratings with similar taste profiles.

library(dplyr)
detach(package:plyr)


movies.full$userID <- as.factor(movies.full$userID)
movies.full$std.imdbRating <- scale(movies.full$imdbRating)

clustering_data <- movies.full %>% group_by(userID) %>% summarise(
  nCrime = sum(Crime==1),
  pCrime = nCrime/length(Crime),
  rCrime = ifelse(is.na(mean(std.ratings[Crime==1]))|is.nan(mean(std.ratings[Crime==1])), 0, mean(std.ratings[Crime==1])),
  iCrime = ifelse(is.na(mean(std.imdbRating[Crime==1]))|is.nan(mean(std.imdbRating[Crime==1])), 0, mean(std.imdbRating[Crime==1])),
  dCrime = rCrime - iCrime,
                  
  nAction = sum(Action==1),
  pAction = nAction/length(Action),
  rAction = ifelse(is.na(mean(std.ratings[Action==1]))|is.nan(mean(std.ratings[Action==1])), 0, mean(std.ratings[Action==1])),
  iAction = ifelse(is.na(mean(std.imdbRating[Action==1]))|is.nan(mean(std.imdbRating[Action==1])), 0, mean(std.imdbRating[Action==1])),
  dAction = rAction - iAction,
  
  nComedy = sum(Comedy==1),
  pComedy = nComedy/length(Comedy),
  rComedy = ifelse(is.na(mean(std.ratings[Comedy==1]))|is.nan(mean(std.ratings[Comedy==1])), 0, mean(std.ratings[Comedy==1])),
  iComedy = ifelse(is.na(mean(std.imdbRating[Comedy==1]))|is.nan(mean(std.imdbRating[Comedy==1])), 0, mean(std.imdbRating[Comedy==1])),
  dComedy = rComedy - iComedy,
  
  nAnimation= sum(Animation==1),
  pAnimation = nAnimation/length(Animation),
  rAnimation = ifelse(is.na(mean(std.ratings[Animation==1]))|is.nan(mean(std.ratings[Animation==1])), 0, mean(std.ratings[Animation==1])),
  iAnimation = ifelse(is.na(mean(std.imdbRating[Animation==1]))|is.nan(mean(std.imdbRating[Animation==1])), 0, mean(std.imdbRating[Animation==1])),
  dAnimation = rAnimation - iAnimation,
  
  nHorror = sum(Horror==1),
  pHorror = nHorror/length(Horror),
  rHorror = ifelse(is.na(mean(std.ratings[Horror==1]))|is.nan(mean(std.ratings[Horror==1])), 0, mean(std.ratings[Horror==1])),
  iHorror = ifelse(is.na(mean(std.imdbRating[Horror==1]))|is.nan(mean(std.imdbRating[Horror==1])), 0, mean(std.imdbRating[Horror==1])),
  dHorror = rHorror - iHorror,
  
  nUnk = sum(unknown==1),
  pUnk = nUnk/length(unknown),
  rUnk= ifelse(is.na(mean(std.ratings[unknown==1]))|is.nan(mean(std.ratings[unknown==1])), 0, mean(std.ratings[unknown==1])),
  iUnk = ifelse(is.na(mean(std.imdbRating[unknown==1]))|is.nan(mean(std.imdbRating[unknown==1])), 0, mean(std.imdbRating[unknown==1])),
  dUnk = rUnk - iUnk,
  
  nDrama = sum(Drama==1),
  pDrama = nDrama/length(Drama),
  rDrama = ifelse(is.na(mean(std.ratings[Drama==1]))|is.nan(mean(std.ratings[Drama==1])), 0, mean(std.ratings[Drama==1])),
  iDrama = ifelse(is.na(mean(std.imdbRating[Drama==1]))|is.nan(mean(std.imdbRating[Drama==1])), 0, mean(std.imdbRating[Drama==1])),
  dDrama = rDrama - iDrama,
  
  nAdventure = sum(Adventure==1),
  pAdventure = nAdventure/length(Adventure),
  rAdventure = ifelse(is.na(mean(std.ratings[Adventure==1]))|is.nan(mean(std.ratings[Adventure==1])), 0, mean(std.ratings[Adventure==1])),
  iAdventure = ifelse(is.na(mean(std.imdbRating[Adventure==1]))|is.nan(mean(std.imdbRating[Adventure==1])), 0, mean(std.imdbRating[Adventure==1])),
  dAdventure = rAdventure - iAdventure,
  
  nScifi = sum(Sci.Fi==1),
  pScifi = nScifi/length(Sci.Fi),
  rScifi = ifelse(is.na(mean(std.ratings[Sci.Fi==1]))|is.nan(mean(std.ratings[Sci.Fi==1])), 0, mean(std.ratings[Sci.Fi==1])),
  iScifi = ifelse(is.na(mean(std.imdbRating[Sci.Fi==1]))|is.nan(mean(std.imdbRating[Sci.Fi==1])), 0, mean(std.imdbRating[Sci.Fi==1])),
  dScifi = rScifi - iScifi,
  
  nThrill = sum(Thriller==1),
  pThrill = nThrill/length(Thriller),
  rThrill= ifelse(is.na(mean(std.ratings[Thriller==1]))|is.nan(mean(std.ratings[Thriller==1])), 0, mean(std.ratings[Thriller==1])),
  iThrill = ifelse(is.na(mean(std.imdbRating[Thriller==1]))|is.nan(mean(std.imdbRating[Thriller==1])), 0, mean(std.imdbRating[Thriller==1])),
  dThrill = rThrill - iThrill,
  
  nFantasy = sum(Fantasy==1),
  pFantasy = nFantasy/length(Fantasy),
  rFantasy = ifelse(is.na(mean(std.ratings[Fantasy==1]))|is.nan(mean(std.ratings[Fantasy==1])), 0, mean(std.ratings[Fantasy==1])),
  iFantasy = ifelse(is.na(mean(std.imdbRating[Fantasy==1]))|is.nan(mean(std.imdbRating[Fantasy==1])), 0, mean(std.imdbRating[Fantasy==1])),
  dFantasy = rFantasy - iFantasy,
  
  nChild = sum(Children.s==1),
  pChild = nChild/length(Children.s),
  rChild = ifelse(is.na(mean(std.ratings[Children.s==1]))|is.nan(mean(std.ratings[Children.s==1])), 0, mean(std.ratings[Children.s==1])),
  iChild = ifelse(is.na(mean(std.imdbRating[Children.s==1]))|is.nan(mean(std.imdbRating[Children.s==1])), 0, mean(std.imdbRating[Children.s==1])),
  dChild = rChild - iChild,
  
  nNoir= sum(Film.Noir==1),
  pNoir = nNoir/length(Film.Noir),
  rNoir = ifelse(is.na(mean(std.ratings[Film.Noir==1]))|is.nan(mean(std.ratings[Film.Noir==1])), 0, mean(std.ratings[Film.Noir==1])),
  iNoir = ifelse(is.na(mean(std.imdbRating[Film.Noir==1]))|is.nan(mean(std.imdbRating[Film.Noir==1])), 0, mean(std.imdbRating[Film.Noir==1])),
  dNoir = rNoir - iNoir,
  
  nWestern = sum(Western==1),
  pWestern = nWestern/length(Western),
  rWestern = ifelse(is.na(mean(std.ratings[Western==1]))|is.nan(mean(std.ratings[Western==1])), 0, mean(std.ratings[Western==1])),
  iWestern = ifelse(is.na(mean(std.imdbRating[Western==1]))|is.nan(mean(std.imdbRating[Western==1])), 0, mean(std.imdbRating[Western==1])),
  dWestern = rWestern - iWestern,
  
  nMystery= sum(Mystery==1),
  pMystery = nMystery/length(Mystery),
  rMystery= ifelse(is.na(mean(std.ratings[Mystery==1]))|is.nan(mean(std.ratings[Mystery==1])), 0, mean(std.ratings[Mystery==1])),
  iMystery = ifelse(is.na(mean(std.imdbRating[Mystery==1]))|is.nan(mean(std.imdbRating[Mystery==1])), 0, mean(std.imdbRating[Mystery==1])),
  dMystery = rMystery - iMystery,
  
  nDoc = sum(Documentary==1),
  pDoc = nDoc/length(Documentary),
  rDoc = ifelse(is.na(mean(std.ratings[Documentary==1]))|is.nan(mean(std.ratings[Documentary==1])), 0, mean(std.ratings[Documentary==1])),
  iDoc = ifelse(is.na(mean(std.imdbRating[Documentary==1]))|is.nan(mean(std.imdbRating[Documentary==1])), 0, mean(std.imdbRating[Documentary==1])),
  dDoc = rDoc - iDoc,
  
  nWar = sum(War==1),
  pWar = nWar/length(War),
  rWar = ifelse(is.na(mean(std.ratings[War==1]))|is.nan(mean(std.ratings[War==1])), 0, mean(std.ratings[War==1])),
  iWar = ifelse(is.na(mean(std.imdbRating[War==1]))|is.nan(mean(std.imdbRating[War==1])), 0, mean(std.imdbRating[War==1])),
  dWar = rWar - iWar,
  
  nRom = sum(Romance==1),
  pRom = nRom/length(Romance),
  rRom= ifelse(is.na(mean(std.ratings[Romance==1]))|is.nan(mean(std.ratings[Romance==1])), 0, mean(std.ratings[Romance==1])),
  iRom = ifelse(is.na(mean(std.imdbRating[Romance==1]))|is.nan(mean(std.imdbRating[Romance==1])), 0, mean(std.imdbRating[Romance==1])),
  dRom = rRom - iRom,
  
  nMusical = sum(Musical==1),
  pMusical = nMusical/length(Musical),
  rMusical = ifelse(is.na(mean(std.ratings[Musical==1]))|is.nan(mean(std.ratings[Musical==1])), 0, mean(std.ratings[Musical==1])),
  iMusical = ifelse(is.na(mean(std.imdbRating[Musical==1]))|is.nan(mean(std.imdbRating[Musical==1])), 0, mean(std.imdbRating[Musical==1])),
  dMusical = rMusical - iMusical
)

I tested clustering the data with a varied number of clusters (ranging from 1 to 300) and plotted the ratio of variation explained by each cluster compared to the total variation of the set.

ratios = sapply(1:300, function(k) {
 with(kmeans(clustering_data[,c(seq(3,length(clustering_data),5),seq(6,length(clustering_data),5))], k, nstart = 20), betweenss / totss)
})

plot(ratios)

Based on these ratios I started by exploring clustered data around 100 clusters. I decided to use a decision tree model to test a few different number of clusters around 100 (ranging from 50 to 150 in intervals of 25) as I figured that the size of clusters could have a pronounced impact on predictive accuracy (having too many clusters removes any benefit to the model as no previous users of the same cluster might have reviewed similar types of movies, but too few wouldn’t be specific enough to provide any predictive edge). The actual testing code is excluded for brevity, but it confirmed 100 clusters to be the optimal number.

#going with 100 clusters - add the clusters to the data
clusters <- kmeans(clustering_data[,c(seq(3,length(clustering_data),5),seq(6,length(clustering_data),5))], 100, nstart = 20)
clusters_clean <- data.frame(userID = clustering_data$userID, cluster = clusters$cluster)
clusters_clean$cluster <- as.factor(clusters_clean$cluster)
movies.clustered2 <- merge(movies.full, clusters_clean)

7.0 Modelling

Now that we’ve constructed a new set of features, its time to actually create a model. Before I dive into running models on the engineered data, I thought it would be instructive to first create a simple model on the original data to act as a baseline. This way I’d be able to see if all the feature engineering actually led to an increase in predictive accuracy. It’s important to note that due to the size of the dataset (100,000 reviews) and my limited computational power, I’ve decided to work with a subset of my original data (2000 reviews).

#create a data partition
tmp <- createDataPartition(merged_data$view, p =0.02)[[1]]

#remove variables that aren't used for predictive purposes (i.e. userID, itemID, movie title...etc.)
modelling_data <- na.omit(merged_data)
modelling_data <- modelling_data[,-c(1:4,8:9,11,31,33,34,35)]

#run rpart model
base.rpart <- train(view ~ . , method="rpart", data=modelling_data[tmp,], trControl = trainControl(method = "cv"))

confusionMatrix(base.rpart)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction LIKE DISLIKE
##    LIKE    39.2    28.5
##    DISLIKE 16.4    15.8
##                            
##  Accuracy (average) : 0.551

We can definitely do better than 55%, let’s attempt running three different models (logistic regression, decision tree, and random forest) with our new feature engineered models.

7.1 Logistic Regression

The results of the logistic regression model are included below, unfortunately it doesn’t seem to perform much better than our base model. However it is important to note that a summary of the final model (excluded due to the number of variables and ungainly nature of the table) shows that all the clusters were statistically significant factors.

tmp <- createDataPartition(movies.clustered2$view, p =0.02)[[1]]

#remove variables that aren't used for predictive purposes (i.e. userID, itemID, movie title...etc.)
modelling_data <- na.omit(movies.clustered2)
modelling_data <- modelling_data[,-c(1:5,9,11,31, length(movies.clustered2)-1)]

#run rpart model
model.glm <- train(view ~ . , method="glm", data=modelling_data[tmp,], trControl = trainControl(method = "cv", verboseIter = T))
## + Fold01: parameter=none 
## - Fold01: parameter=none 
## + Fold02: parameter=none 
## - Fold02: parameter=none 
## + Fold03: parameter=none 
## - Fold03: parameter=none 
## + Fold04: parameter=none 
## - Fold04: parameter=none 
## + Fold05: parameter=none 
## - Fold05: parameter=none 
## + Fold06: parameter=none 
## - Fold06: parameter=none 
## + Fold07: parameter=none 
## - Fold07: parameter=none 
## + Fold08: parameter=none 
## - Fold08: parameter=none 
## + Fold09: parameter=none 
## - Fold09: parameter=none 
## + Fold10: parameter=none 
## - Fold10: parameter=none 
## Aggregating results
## Fitting final model on full training set
confusionMatrix(model.glm)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction LIKE DISLIKE
##    LIKE    32.0    23.0
##    DISLIKE 21.7    23.3
##                             
##  Accuracy (average) : 0.5533

7.2 Decision Trees

The results of the decision tree model fared much better than the logistic model, seeing a relatively large jump in accuracy to 62.3%.

tmp <- createDataPartition(movies.clustered2$view, times = 1, p = 0.02)[[1]]

modelling_data <- na.omit(movies.clustered2)
modelling_data <- modelling_data[,-c(1:5,9,11,31, length(movies.clustered2)-1)]
model.rpart <- train(view ~ . , method="rpart", data=modelling_data[tmp,], 
                       trControl = trainControl(method = "cv", verboseIter = T))
## + Fold01: cp=0.01018 
## - Fold01: cp=0.01018 
## + Fold02: cp=0.01018 
## - Fold02: cp=0.01018 
## + Fold03: cp=0.01018 
## - Fold03: cp=0.01018 
## + Fold04: cp=0.01018 
## - Fold04: cp=0.01018 
## + Fold05: cp=0.01018 
## - Fold05: cp=0.01018 
## + Fold06: cp=0.01018 
## - Fold06: cp=0.01018 
## + Fold07: cp=0.01018 
## - Fold07: cp=0.01018 
## + Fold08: cp=0.01018 
## - Fold08: cp=0.01018 
## + Fold09: cp=0.01018 
## - Fold09: cp=0.01018 
## + Fold10: cp=0.01018 
## - Fold10: cp=0.01018 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0102 on full training set
confusionMatrix(model.rpart)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction LIKE DISLIKE
##    LIKE    32.8    16.3
##    DISLIKE 20.9    30.0
##                             
##  Accuracy (average) : 0.6276

7.3 Random Forest

The final model, a random forest, performed slightly better than the decision tree, with a modest accuracy of 62.9%.

tmp <- createDataPartition(movies.clustered2$view, times = 1, p = 0.02)[[1]]

modelling_data <- na.omit(movies.clustered2)
modelling_data <- modelling_data[,-c(1:5,9,11,31, length(movies.clustered2)-1)]
model.rf <- train(view ~ . , method="rf", data=modelling_data[tmp,], 
                       trControl = trainControl(method = "cv", verboseIter = T))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
## + Fold01: mtry=  2 
## - Fold01: mtry=  2 
## + Fold01: mtry= 34 
## - Fold01: mtry= 34 
## + Fold01: mtry=584 
## - Fold01: mtry=584 
## + Fold02: mtry=  2 
## - Fold02: mtry=  2 
## + Fold02: mtry= 34 
## - Fold02: mtry= 34 
## + Fold02: mtry=584 
## - Fold02: mtry=584 
## + Fold03: mtry=  2 
## - Fold03: mtry=  2 
## + Fold03: mtry= 34 
## - Fold03: mtry= 34 
## + Fold03: mtry=584 
## - Fold03: mtry=584 
## + Fold04: mtry=  2 
## - Fold04: mtry=  2 
## + Fold04: mtry= 34 
## - Fold04: mtry= 34 
## + Fold04: mtry=584 
## - Fold04: mtry=584 
## + Fold05: mtry=  2 
## - Fold05: mtry=  2 
## + Fold05: mtry= 34 
## - Fold05: mtry= 34 
## + Fold05: mtry=584 
## - Fold05: mtry=584 
## + Fold06: mtry=  2 
## - Fold06: mtry=  2 
## + Fold06: mtry= 34 
## - Fold06: mtry= 34 
## + Fold06: mtry=584 
## - Fold06: mtry=584 
## + Fold07: mtry=  2 
## - Fold07: mtry=  2 
## + Fold07: mtry= 34 
## - Fold07: mtry= 34 
## + Fold07: mtry=584 
## - Fold07: mtry=584 
## + Fold08: mtry=  2 
## - Fold08: mtry=  2 
## + Fold08: mtry= 34 
## - Fold08: mtry= 34 
## + Fold08: mtry=584 
## - Fold08: mtry=584 
## + Fold09: mtry=  2 
## - Fold09: mtry=  2 
## + Fold09: mtry= 34 
## - Fold09: mtry= 34 
## + Fold09: mtry=584 
## - Fold09: mtry=584 
## + Fold10: mtry=  2 
## - Fold10: mtry=  2 
## + Fold10: mtry= 34 
## - Fold10: mtry= 34 
## + Fold10: mtry=584 
## - Fold10: mtry=584 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 34 on full training set
confusionMatrix(model.rf)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction LIKE DISLIKE
##    LIKE    38.2    21.7
##    DISLIKE 15.4    24.6
##                             
##  Accuracy (average) : 0.6288

8.0 Model Results

A summary of the four model’s results are included below. The results indicate that the random forest model is optimal, however the modest increase in accuracy might be outweighed by its massive increase in computational time. Practically, a decision tree model would likely suffice for this data.

model accuracy
Base 55.1%
GLM 55.33%
Decision Tree 62.76%
Random Forest 62.88%

9.0 Conclusion

The results of this project have demonstrated how dramatic an impact feature engineering can have on recommendations. By applying a clustering algorithm, I was able to realize an 8% jump in predictive accuracy for movie recommendations, not an insignificant amount in this inherently unpredictabe field.

Next steps for this endeavor would be to increase the quality and quantity of data available to base these predictions, and the computational power to run the algorithm on a large portion of the data set. I predict that training on more clustered data should lead to an increase in accuracy, as it allows for the creation of more targeted, and thus effective clusters. As previously mentioned, it would also be of value to collect a better absolute gauge of a user’s enjoyment of a movie (similar to the created view field, but generated by the user).