A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behaviour of a customer and based on that, recommends products which the users might be likely to buy. The global recommendation engine market, is expected to grow from USD 801.1 Million in 2017 to USD 4414.8 Million by 2022, at a Compound Annual Growth Rate (CAGR) of 40.7% during the forecast period.
Several industry verticals, such as retail, media and entertainment, transportation, have deployed recommendation engines powered for various applications, including personalized campaigns and customer discovery, product planning, strategy and operations planning, and proactive asset management.
The retail end-user is expected to be the highest contributor during the forecast period, in terms of revenue, while the media and entertainment end-user is projected to grow at the highest CAGR during the forecast period. The Recommendation engines are classified into three types Content based , Collaborative recommendations and hybrid approaches
Major Challenges to Be Addressed
Overspecialization: system does not recommend these items that are different from anything that the user has seen before. Sometimes this might become problem because the user might want to try something new and the system would never make it happen. Serendipities (variety in recommendations …) are ignored. So for this the user must be presented with range of options and not only few selected alternatives be made available
Limited Content analysis: In this we might represent 2 different items with same set of attributes and they hence cannot be differentiated
New User Problem: New users don’t have sufficient ratings before so he would not be able to get accurate recommendations.
Our research mainly revolves around what Netflix has become famous for, its recommendation engine. Recommendation engines typically work in this manner: . User watches a movie and rates the movie. As the user watches more movies, we get to collect a lot more data . Now, this data describes users’ preference and can be used to recommend other movies to the user . The research objective is what will accurately describe the users’ preference of movies
Recommendation engines are very powerful personalization tools because it’s a great way to do “discovery” - showing people items they will like, but are unlikely to discover by themselves. They improve a visitor’s experience by offering relevant items at the right time and on the right page
Because of how well recommendation engines boost subscriber numbers through engagement and stickiness, facilitating such serendipitous discovery has turned into a high stakes multi-billion-dollar race for the world’s biggest digital companies
The customer personalization journeys of Amazon and Netflix demonstrate just how powerful recommendation engines can be
The on-demand streaming video is probably the world’s biggest market for digital consumption of content
The savings produced by the Netflix algorithm, show up through increased viewership and lower churn
Strong recommendations also increase the amount of time viewers watch content on Netflix keeping subscriber churn as low as possible
According to a paper published by Netflix executives, the on-demand video streaming service claims its AI assisted recommendation system saves the company $1 billion per year. This means Netflix can confidently spend huge sums ($6 billion a year) on new content, knowing viewers will consume enough over time to give them healthy returns on the investment
Updates to the algorithms are researched and tested by a team of over 70 engineers. In 2009, Netflix offered a $1 million prize in an open competition to any research team which could improve on the efficiency of their algorithms. The Netflix Prize was an important event in the development of content discovery systems - shining a light on recommendation engine technology, and bringing new machine learning scientists to the topic
GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens’s research projects have explored a variety of fields including: 1. recommender systems 2. online communities 3. mobile and ubiquitious technologies 4. digital libraries 5. local geographic information systems
GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data
This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.
The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double-quotes (“). T User Ids
MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).
Movie Ids Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).
All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following variables:
UserId - unique identifier to ID the user MovieId - unique identifier to ID the movie Rating - the rating the user gave the movie Timestamp - the time when the user rated
The lines within this file are ordered first by userId, then, within user, by movieId.
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following variables: 1. movieId 2. title 3. genres
MovieId - unique identifier to ID the movie Title - the title of the movie. Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles. Genres - the genres associated with the movie. Genres are a pipe-separated list, and are selected from the following: 1. Action 2. Adventure 3. Animation 4. Children’s 5. Comedy 6. Crime 7. Documentary 8. Drama 9. Fantasy 10. Film-Noir 11. Horror 12. Musical 13. Mystery 14. Romance 15. Sci-Fi 16. Thriller 17. War 18. Western 19. (no genres listed)
Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following variables: 1. movieId 2. imdbId 3. tmdbId
movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.
imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.
tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.
Movie recommendation is based on customer consumption pattern and ratings they give to movies based on its genre, director, actor etc. This can be achieved using an unsupervised machine learning technique such as k-means clustering, neural network etc. based on selected variables and derived variables (combination of given variables). For example: Ratings history - for a given director - for a given combination of genre and director - for a given lead actor - for a given combination of a lead actor and the director etc.
After clustering, we intend to use segmentation techniques to dissect the clusters and arrive at different psychographic profiles for the users and develop a recommendation engine which has high accuracy in generating recommendations for each segment. For improving the model accuracy we intend to use supervised learning techniques such as regression, support vector machines, naïve bayes etc.
We intend to use a combination of content based and collaborative filtering methods which - uses attributes of items/users - recommend similar items to those liked by users in the past - recommend items liked by similar users - enable exploration of diverse content
Content based methods are based on similarity of item. For ex: Association rules can also be used for recommendation. Items that are frequently consumed together are connected with an edge in the graph. You can see clusters of best sellers (densely connected items that almost everybody interacted with) and small separated clusters of niche content.
Collaborative methods work with the interaction matrix that can also be called rating matrix in the rare case when users provide explicit rating of items. The task of machine learning is to learn a function that predicts utility of items to each user. Matrix is typically huge, very sparse and most of values are missing.
ratings<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/ratings.csv")
tags<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/tags.csv")
movies<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/movies.csv")
links<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/links.csv")
gscores<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/genome-scores.csv")
gtags<-read.csv("C:/Users/pgp33387/Documents/Mohan/DAM proj/ml-20m/genome-tags.csv")
#Summary of ratings dataset
summary(ratings)
## userId movieId rating timestamp
## Min. : 1 Min. : 1 Min. :0.500 Min. :7.897e+08
## 1st Qu.: 34395 1st Qu.: 902 1st Qu.:3.000 1st Qu.:9.668e+08
## Median : 69141 Median : 2167 Median :3.500 Median :1.104e+09
## Mean : 69046 Mean : 9042 Mean :3.526 Mean :1.101e+09
## 3rd Qu.:103637 3rd Qu.: 4770 3rd Qu.:4.000 3rd Qu.:1.226e+09
## Max. :138493 Max. :131262 Max. :5.000 Max. :1.428e+09
#Summary of tags dataset
summary(tags)
## userId movieId tag
## Min. : 18 Min. : 1 sci-fi : 3384
## 1st Qu.: 28780 1st Qu.: 2571 based on a book: 3281
## Median : 70201 Median : 7373 atmospheric : 2917
## Mean : 68712 Mean : 32628 comedy : 2779
## 3rd Qu.:107322 3rd Qu.: 62235 action : 2657
## Max. :138472 Max. :131258 (Other) :450530
## NA's : 16
## timestamp
## Min. :1.135e+09
## 1st Qu.:1.245e+09
## Median :1.302e+09
## Mean :1.299e+09
## 3rd Qu.:1.366e+09
## Max. :1.428e+09
##
#Summary of movies dataset
summary(movies)
## movieId title
## Min. : 1 20,000 Leagues Under the Sea (1997): 2
## 1st Qu.: 6931 Aladdin (1992) : 2
## Median : 68068 Beneath (2013) : 2
## Mean : 59855 Blackout (2007) : 2
## 3rd Qu.:100293 Casanova (2005) : 2
## Max. :131262 Chaos (2005) : 2
## (Other) :27266
## genres
## Drama : 4520
## Comedy : 2294
## Documentary : 1942
## Comedy|Drama : 1264
## Drama|Romance : 1075
## Comedy|Romance: 757
## (Other) :15426
#Summary of links dataset
summary(links)
## movieId imdbId tmdbId
## Min. : 1 Min. : 5 Min. : 2
## 1st Qu.: 6931 1st Qu.: 77417 1st Qu.: 15936
## Median : 68068 Median : 152435 Median : 39469
## Mean : 59855 Mean : 578186 Mean : 63847
## 3rd Qu.:100293 3rd Qu.: 906272 3rd Qu.: 82504
## Max. :131262 Max. :4530184 Max. :421510
## NA's :252
#Summary of gscores dataset
summary(gscores)
## movieId tagId relevance
## Min. : 1 Min. : 1.0 Min. :0.00025
## 1st Qu.: 2926 1st Qu.: 282.8 1st Qu.:0.02425
## Median : 6017 Median : 564.5 Median :0.05650
## Mean : 25843 Mean : 564.5 Mean :0.11648
## 3rd Qu.: 46062 3rd Qu.: 846.2 3rd Qu.:0.14150
## Max. :131170 Max. :1128.0 Max. :1.00000
#Summary of gtags dataset
summary(gtags)
## tagId tag
## Min. : 1.0 007 : 1
## 1st Qu.: 282.8 007 (series): 1
## Median : 564.5 18th century: 1
## Mean : 564.5 1920s : 1
## 3rd Qu.: 846.2 1930s : 1
## Max. :1128.0 1950s : 1
## (Other) :1122
first <- merge(ratings,tags,by=c("userId","movieId"))
sec=merge(first, movies,by="movieId")
third=merge(sec,links,by="movieId")
# Highest rated movies
aggdata <-aggregate(third$rating, by=list(third$title),FUN=mean, na.rm=TRUE)
colnames(aggdata)[2]="rating"
newdata <- aggdata[order(-aggdata$rating),]
head(newdata)
## Group.1 rating
## 3 'Salem's Lot (2004) 5
## 28 1066 (2009) 5
## 211 A Pigeon Sat on a Branch Reflecting on Existence (2014) 5
## 240 About Adam (2000) 5
## 250 Absentia (2011) 5
## 253 Absolute Giganten (1999) 5
# Highest rated tags
aggdata <-aggregate(third$rating, by=list(third$tag),FUN=mean, na.rm=TRUE)
colnames(aggdata)[2]="rating"
newdata <- aggdata[order(-aggdata$rating),]
head(newdata)
## Group.1
## 26 'Take This Sinking Boat & Point It Home You Hoover Fixer Sucker Guuuyyy!'
## 34 !950's Superman TV show
## 38 "A Mão-de-Deus"
## 47 "Ghost for adults"
## 55 "meet me in montauk"
## 63 "Piggy"
## rating
## 26 5
## 34 5
## 38 5
## 47 5
## 55 5
## 63 5
# Highest rated genres
aggdata <-aggregate(third$rating, by=list(third$genres),FUN=mean, na.rm=TRUE)
colnames(aggdata)[2]="rating"
newdata <- aggdata[order(-aggdata$rating),]
head(newdata)
## Group.1 rating
## 88 Action|Adventure|Documentary 5
## 96 Action|Adventure|Drama|Horror|Thriller 5
## 114 Action|Adventure|Fantasy|Horror|Romance 5
## 460 Adventure|Children|Drama|Romance 5
## 489 Adventure|Comedy|Fantasy|Musical 5
## 513 Adventure|Crime|Drama|Western 5
#Correlation of time with rating
correlation<-third[,c("timestamp.x","rating")]
cor(correlation)
## timestamp.x rating
## timestamp.x 1.00000000 -0.05882623
## rating -0.05882623 1.00000000
Movie recommendation is based on customer consumption pattern and ratings they give to movies based on its genre, director, actor etc. This can be achieved using an unsupervised machine learning technique such as k-means clustering, neural network etc. based on selected variables and derived variables (combination of given variables). For example: Ratings history - for a given director - for a given combination of genre and director - for a given lead actor - for a given combination of a lead actor and the director etc.
After clustering, we intend to use segmentation techniques to dissect the clusters and arrive at different psychographic profiles for the users and develop a recommendation engine which has high accuracy in generating recommendations for each segment. For improving the model accuracy we intend to use supervised learning techniques such as regression, support vector machines, naïve bayes etc.
We intend to use a combination of content based and collaborative filtering methods which - uses attributes of items/users - recommend similar items to those liked by users in the past - recommend items liked by similar users - enable exploration of diverse content
Content based methods are based on similarity of item. For ex: Association rules can also be used for recommendation. Items that are frequently consumed together are connected with an edge in the graph. You can see clusters of best sellers (densely connected items that almost everybody interacted with) and small separated clusters of niche content.
Collaborative methods work with the interaction matrix that can also be called rating matrix in the rare case when users provide explicit rating of items. The task of machine learning is to learn a function that predicts utility of items to each user. Matrix is typically huge, very sparse and most of values are missing.