Industry and Company Background

A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behaviour of a customer and based on that, recommends products which the users might be likely to buy. The global recommendation engine market, is expected to grow from USD 801.1 Million in 2017 to USD 4414.8 Million by 2022, at a Compound Annual Growth Rate (CAGR) of 40.7% during the forecast period.

Several industry verticals, such as retail, media and entertainment, transportation, have deployed recommendation engines powered for various applications, including personalized campaigns and customer discovery, product planning, strategy and operations planning, and proactive asset management.

The retail end-user is expected to be the highest contributor during the forecast period, in terms of revenue, while the media and entertainment end-user is projected to grow at the highest CAGR during the forecast period. The Recommendation engines are classified into three types Content based , Collaborative recommendations and hybrid approaches

Major Challenges to Be Addressed

  1. Overspecialization: system does not recommend these items that are different from anything that the user has seen before. Sometimes this might become problem because the user might want to try something new and the system would never make it happen. Serendipities (variety in recommendations …) are ignored. So for this the user must be presented with range of options and not only few selected alternatives be made available

  2. Limited Content analysis: In this we might represent 2 different items with same set of attributes and they hence cannot be differentiated

  3. New User Problem: New users don’t have sufficient ratings before so he would not be able to get accurate recommendations.

About the Research

Our research mainly revolves around what Netflix has become famous for, its recommendation engine. Recommendation engines typically work in this manner: . User watches a movie and rates the movie. As the user watches more movies, we get to collect a lot more data . Now, this data describes users’ preference and can be used to recommend other movies to the user . The research objective is what will accurately describe the users’ preference of movies

Specific research questions

  1. Is it certain genres that the user likes?
  2. Is it that user prefers to watch movies that are of a certain rating?
  3. Does the user watch movies of a particular cast
  4. Does the user watch movies of a particular director?
  5. Does the user prefer to watch movies which are from a certain time period?

Why is this research important

  1. Recommendation engines are very powerful personalization tools because it’s a great way to do “discovery” - showing people items they will like, but are unlikely to discover by themselves. They improve a visitor’s experience by offering relevant items at the right time and on the right page

  2. Because of how well recommendation engines boost subscriber numbers through engagement and stickiness, facilitating such serendipitous discovery has turned into a high stakes multi-billion-dollar race for the world’s biggest digital companies

  3. The customer personalization journeys of Amazon and Netflix demonstrate just how powerful recommendation engines can be

  4. The on-demand streaming video is probably the world’s biggest market for digital consumption of content

  5. The savings produced by the Netflix algorithm, show up through increased viewership and lower churn

  6. Strong recommendations also increase the amount of time viewers watch content on Netflix keeping subscriber churn as low as possible

  7. According to a paper published by Netflix executives, the on-demand video streaming service claims its AI assisted recommendation system saves the company $1 billion per year. This means Netflix can confidently spend huge sums ($6 billion a year) on new content, knowing viewers will consume enough over time to give them healthy returns on the investment

  8. Updates to the algorithms are researched and tested by a team of over 70 engineers. In 2009, Netflix offered a $1 million prize in an open competition to any research team which could improve on the efficiency of their algorithms. The Netflix Prize was an important event in the development of content discovery systems - shining a light on recommendation engine technology, and bringing new machine learning scientists to the topic

Dataset description

GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens’s research projects have explored a variety of fields including: 1. recommender systems 2. online communities 3. mobile and ubiquitious technologies 4. digital libraries 5. local geographic information systems

GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data

This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

Formatting and Encoding

The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double-quotes (“). T User Ids

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

Movie Ids Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

Ratings Data File Structure (ratings.csv)

All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following variables:

  1. userId
  2. movieId
  3. rating
  4. timestamp

UserId - unique identifier to ID the user MovieId - unique identifier to ID the movie Rating - the rating the user gave the movie Timestamp - the time when the user rated

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Tags Data File Structure (tags.csv)

All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following variables: 1. userId 2. movieId 3. tag 4. timestamp

UserId - unique identifier to ID the user MovieId - unique identifier to ID the movie Tag - Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user. Timestamp - the time when the user rated

The lines within this file are ordered first by userId, then, within user, by movieId

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Movies Data File Structure (movies.csv)

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following variables: 1. movieId 2. title 3. genres

MovieId - unique identifier to ID the movie Title - the title of the movie. Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles. Genres - the genres associated with the movie. Genres are a pipe-separated list, and are selected from the following: 1. Action 2. Adventure 3. Animation 4. Children’s 5. Comedy 6. Crime 7. Documentary 8. Drama 9. Fantasy 10. Film-Noir 11. Horror 12. Musical 13. Mystery 14. Romance 15. Sci-Fi 16. Thriller 17. War 18. Western 19. (no genres listed)

Tag Genome (genome-scores.csv and genome-tags.csv)

This data set includes a current copy of the Tag Genome

The tag genome is a data structure that contains tag relevance scores for movies. The structure is a dense matrix: each movie in the genome has a value for every tag in the genome.

The tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

The genome is split into two files. The file genome-scores.csv contains movie-tag relevance data in the following format: movieId,tagId,relevance The second file, genome-tags.csv, provides the tag descriptions for the tag IDs in the genome file, in the following format: tagId,tag The tagId values are generated when the data set is exported, so they may vary from version to version of the MovieLens data sets

Dataset link - https://drive.google.com/drive/u/1/my-drive?ogsrc=32

Addressing stated research questions

Movie recommendation is based on customer consumption pattern and ratings they give to movies based on its genre, director, actor etc. This can be achieved using an unsupervised machine learning technique such as k-means clustering, neural network etc. based on selected variables and derived variables (combination of given variables). For example: Ratings history - for a given director - for a given combination of genre and director - for a given lead actor - for a given combination of a lead actor and the director etc.

After clustering, we intend to use segmentation techniques to dissect the clusters and arrive at different psychographic profiles for the users and develop a recommendation engine which has high accuracy in generating recommendations for each segment. For improving the model accuracy we intend to use supervised learning techniques such as regression, support vector machines, naïve bayes etc.

We intend to use a combination of content based and collaborative filtering methods which - uses attributes of items/users - recommend similar items to those liked by users in the past - recommend items liked by similar users - enable exploration of diverse content

Content based methods are based on similarity of item. For ex: Association rules can also be used for recommendation. Items that are frequently consumed together are connected with an edge in the graph. You can see clusters of best sellers (densely connected items that almost everybody interacted with) and small separated clusters of niche content.

Collaborative methods work with the interaction matrix that can also be called rating matrix in the rare case when users provide explicit rating of items. The task of machine learning is to learn a function that predicts utility of items to each user. Matrix is typically huge, very sparse and most of values are missing.

Reading and Preparing summary statistics

ratings<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/ratings.csv")
tags<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/tags.csv")
movies<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/movies.csv")
links<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/links.csv")
gscores<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/genome-scores.csv")
gtags<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/genome-tags.csv")

#Summary of ratings dataset
summary(ratings)
##      userId          movieId           rating        timestamp        
##  Min.   :     1   Min.   :     1   Min.   :0.500   Min.   :7.897e+08  
##  1st Qu.: 34395   1st Qu.:   902   1st Qu.:3.000   1st Qu.:9.668e+08  
##  Median : 69141   Median :  2167   Median :3.500   Median :1.104e+09  
##  Mean   : 69046   Mean   :  9042   Mean   :3.526   Mean   :1.101e+09  
##  3rd Qu.:103637   3rd Qu.:  4770   3rd Qu.:4.000   3rd Qu.:1.226e+09  
##  Max.   :138493   Max.   :131262   Max.   :5.000   Max.   :1.428e+09
#Summary of tags dataset
summary(tags)
##      userId          movieId                    tag        
##  Min.   :    18   Min.   :     1   sci-fi         :  3384  
##  1st Qu.: 28780   1st Qu.:  2571   based on a book:  3281  
##  Median : 70201   Median :  7373   atmospheric    :  2917  
##  Mean   : 68712   Mean   : 32628   comedy         :  2779  
##  3rd Qu.:107322   3rd Qu.: 62235   action         :  2657  
##  Max.   :138472   Max.   :131258   (Other)        :450530  
##                                    NA's           :    16  
##    timestamp        
##  Min.   :1.135e+09  
##  1st Qu.:1.245e+09  
##  Median :1.302e+09  
##  Mean   :1.299e+09  
##  3rd Qu.:1.366e+09  
##  Max.   :1.428e+09  
## 
#Summary of movies dataset
summary(movies)
##     movieId                                       title      
##  Min.   :     1   20,000 Leagues Under the Sea (1997):    2  
##  1st Qu.:  6931   Aladdin (1992)                     :    2  
##  Median : 68068   Beneath (2013)                     :    2  
##  Mean   : 59855   Blackout (2007)                    :    2  
##  3rd Qu.:100293   Casanova (2005)                    :    2  
##  Max.   :131262   Chaos (2005)                       :    2  
##                   (Other)                            :27266  
##             genres     
##  Drama         : 4520  
##  Comedy        : 2294  
##  Documentary   : 1942  
##  Comedy|Drama  : 1264  
##  Drama|Romance : 1075  
##  Comedy|Romance:  757  
##  (Other)       :15426
#Summary of links dataset
summary(links)
##     movieId           imdbId            tmdbId      
##  Min.   :     1   Min.   :      5   Min.   :     2  
##  1st Qu.:  6931   1st Qu.:  77417   1st Qu.: 15936  
##  Median : 68068   Median : 152435   Median : 39469  
##  Mean   : 59855   Mean   : 578186   Mean   : 63847  
##  3rd Qu.:100293   3rd Qu.: 906272   3rd Qu.: 82504  
##  Max.   :131262   Max.   :4530184   Max.   :421510  
##                                     NA's   :252
#Summary of gscores dataset
summary(gscores)
##     movieId           tagId          relevance      
##  Min.   :     1   Min.   :   1.0   Min.   :0.00025  
##  1st Qu.:  2926   1st Qu.: 282.8   1st Qu.:0.02425  
##  Median :  6017   Median : 564.5   Median :0.05650  
##  Mean   : 25843   Mean   : 564.5   Mean   :0.11648  
##  3rd Qu.: 46062   3rd Qu.: 846.2   3rd Qu.:0.14150  
##  Max.   :131170   Max.   :1128.0   Max.   :1.00000
#Summary of gtags dataset
summary(gtags)
##      tagId                  tag      
##  Min.   :   1.0   007         :   1  
##  1st Qu.: 282.8   007 (series):   1  
##  Median : 564.5   18th century:   1  
##  Mean   : 564.5   1920s       :   1  
##  3rd Qu.: 846.2   1930s       :   1  
##  Max.   :1128.0   1950s       :   1  
##                   (Other)     :1122
first <- merge(ratings,tags,by=c("userId","movieId"))
sec=merge(first, movies,by="movieId")
third=merge(sec,links,by="movieId")

# Highest rated movies
aggdata <-aggregate(third$rating, by=list(third$title),FUN=mean, na.rm=TRUE)
colnames(aggdata)[2]="rating"
newdata <- aggdata[order(-aggdata$rating),] 
head(newdata)
##                                                     Group.1 rating
## 3                                       'Salem's Lot (2004)      5
## 28                                              1066 (2009)      5
## 211 A Pigeon Sat on a Branch Reflecting on Existence (2014)      5
## 240                                       About Adam (2000)      5
## 250                                         Absentia (2011)      5
## 253                                Absolute Giganten (1999)      5
# Highest rated tags
aggdata <-aggregate(third$rating, by=list(third$tag),FUN=mean, na.rm=TRUE)
colnames(aggdata)[2]="rating"
newdata <- aggdata[order(-aggdata$rating),] 
head(newdata)
##                                                                      Group.1
## 26 'Take This Sinking Boat & Point It Home You Hoover Fixer Sucker Guuuyyy!'
## 34                                                   !950's Superman TV show
## 38                                                        "A Mão-de-Deus"
## 47                                                        "Ghost for adults"
## 55                                                      "meet me in montauk"
## 63                                                                   "Piggy"
##    rating
## 26      5
## 34      5
## 38      5
## 47      5
## 55      5
## 63      5
# Highest rated genres
aggdata <-aggregate(third$rating, by=list(third$genres),FUN=mean, na.rm=TRUE)
colnames(aggdata)[2]="rating"
newdata <- aggdata[order(-aggdata$rating),] 
head(newdata)
##                                     Group.1 rating
## 88             Action|Adventure|Documentary      5
## 96   Action|Adventure|Drama|Horror|Thriller      5
## 114 Action|Adventure|Fantasy|Horror|Romance      5
## 460        Adventure|Children|Drama|Romance      5
## 489        Adventure|Comedy|Fantasy|Musical      5
## 513           Adventure|Crime|Drama|Western      5
#Correlation of time with rating
correlation<-third[,c("timestamp.x","rating")]
cor(correlation)
##             timestamp.x      rating
## timestamp.x  1.00000000 -0.05882623
## rating      -0.05882623  1.00000000

Addressing stated research questions

Movie recommendation is based on customer consumption pattern and ratings they give to movies based on its genre, director, actor etc. This can be achieved using an unsupervised machine learning technique such as k-means clustering, neural network etc. based on selected variables and derived variables (combination of given variables). For example: Ratings history - for a given director - for a given combination of genre and director - for a given lead actor - for a given combination of a lead actor and the director etc.

After clustering, we intend to use segmentation techniques to dissect the clusters and arrive at different psychographic profiles for the users and develop a recommendation engine which has high accuracy in generating recommendations for each segment. For improving the model accuracy we intend to use supervised learning techniques such as regression, support vector machines, naïve bayes etc.

We intend to use a combination of content based and collaborative filtering methods which - uses attributes of items/users - recommend similar items to those liked by users in the past - recommend items liked by similar users - enable exploration of diverse content

Content based methods are based on similarity of item. For ex: Association rules can also be used for recommendation. Items that are frequently consumed together are connected with an edge in the graph. You can see clusters of best sellers (densely connected items that almost everybody interacted with) and small separated clusters of niche content.

Collaborative methods work with the interaction matrix that can also be called rating matrix in the rare case when users provide explicit rating of items. The task of machine learning is to learn a function that predicts utility of items to each user. Matrix is typically huge, very sparse and most of values are missing.