The main reason for building the “Data analysis and recommendation on Top movies” project is to provide a user-friendly recommendation system, which will help the users to pick from the top trending movies based on the list of suggestions. This project has been designed with the data science programming language called ‘R’, in order to come up with a transparent structure to be more understandable about the usage and functionalities of this recommendation system. A recommendation engine is something that will recommend a user to make better selections based on the information and list of suggestions provided by this machine learning system. These recommendations can be customized for every user, based on a specific filtering method called content-based filtering, which will allow the system to take necessary inputs from the users such as top-rated movies, newly released movies, most watched movies, user preferences, search history, etc. This input will be processed and analyzed to create the list of recommendations of top movies, which can be implemented by machine learning algorithms.
This system also has the ability of detecting similarity of two different movies based on certain factors. For example: If a user had a past record of watching a specific genre of movies. Then, this system will provide a list of suggestions for the user to also watch, based on the factors of highly rated and top trending movies in that same genre. Also, this system can provide recommendations using Collaborative content filtering. This filtering method provides more suggestions for a user with the ideology obtained from few other users who are having similar fields of interest on movies and whose list of previously watched/search history is similar with them. This will provide the user to watch more movies based on their field of interest. In this R project, the ITEM based collaborative recommendation system has been used to complete the building of the recommendation engine.
• Identify top trending movies based on the greatest number of likes. • Identify and generate a list of previously watched movies and search history from the users. • Identify the similarities between the two different users who might have the same interest in specific movie genres. • Obtain a movie recommendation system from data analysis process.
In our Recommendation system based on top movies, we had used the data which has been retrieved from the open-source data website, in order to download IMDB-Dataset. This dataset consists of two csv files as follows:
3.0 DATA PREPARATION In the process of preparing data for analysis, several tasks need to be performed such as data reading, data merging, data cleaning, and data selection.
Reading Data:
library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
library(reshape2)
library(recommenderlab)
library(ggplot2)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
library(reshape2)
getwd()
## [1] "C:/Study/Sem 4/Programming DS/Project"
setwd("C:/Study/Sem 4/Programming DS/Project")
movie_data <- read.csv("movies.csv",stringsAsFactors=FALSE)
rating_data <- read.csv("ratings.csv")
str(movie_data)
## 'data.frame': 10329 obs. of 3 variables:
## $ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
## $ genres : chr "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
str(rating_data)
## 'data.frame': 105339 obs. of 4 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : int 16 24 32 47 50 110 150 161 165 204 ...
## $ rating : num 4 1.5 4 4 4 4 3 4 3 0.5 ...
## $ timestamp: int 1217897793 1217895807 1217896246 1217896556 1217896523 1217896150 1217895940 1217897864 1217897135 1217895786 ...
Data analysis is the step where a series of procedures such as data cleaning, data transforming and data modeling will take place in order to explore new informative conclusions and to make decisions effectively. In this project, data analysis has been done according to the needs of the project, which aims to identify the most popular and top-rated movies by analyzing the ranking of each movie by the audience and to determine the relationship between two individuals, who have obtained a recommendation list of same preferences predicted from their previous choices and ratings, this will be generated by the machine learning algorithm. This stage of analysis can be categorized into the following levels: Descriptive Analysis Top rated movies (shown via tabulars and graphs) New Trending movies (shown through tables and graphs) Most searched movies (shown by graph) 2. Predictive Analysis - Predicting a list of suggestions on the same genre of movies, based on the user’s past records of choice and search history. Here are some examples of data analysis that has been made:
summary(movie_data)
## movieId title genres
## Min. : 1 Length:10329 Length:10329
## 1st Qu.: 3240 Class :character Class :character
## Median : 7088 Mode :character Mode :character
## Mean : 31924
## 3rd Qu.: 59900
## Max. :149532
head(movie_data)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
summary(rating_data)
## userId movieId rating timestamp
## Min. : 1.0 Min. : 1 Min. :0.500 Min. :8.286e+08
## 1st Qu.:192.0 1st Qu.: 1073 1st Qu.:3.000 1st Qu.:9.711e+08
## Median :383.0 Median : 2497 Median :3.500 Median :1.115e+09
## Mean :364.9 Mean : 13381 Mean :3.517 Mean :1.130e+09
## 3rd Qu.:557.0 3rd Qu.: 5991 3rd Qu.:4.000 3rd Qu.:1.275e+09
## Max. :668.0 Max. :149532 Max. :5.000 Max. :1.452e+09
head(rating_data)
## userId movieId rating timestamp
## 1 1 16 4.0 1217897793
## 2 1 24 1.5 1217895807
## 3 1 32 4.0 1217896246
## 4 1 47 4.0 1217896556
## 5 1 50 4.0 1217896523
## 6 1 110 4.0 1217896150
movie_genre <- as.data.frame(movie_data$genres, stringsAsFactors=FALSE)
library(data.table)
movie_genre2 <- as.data.frame(tstrsplit(movie_genre[,1],'[|]',
type.convert=TRUE),
stringsAsFactors=FALSE)
colnames(movie_genre2) <- c(1:10)
list_genre <- c("Action", "Adventure", "Animation", "Children",
"Comedy", "Crime","Documentary", "Drama", "Fantasy",
"Film-Noir", "Horror", "Musical", "Mystery","Romance",
"Sci-Fi", "Thriller", "War", "Western")
genre_mat1 <- matrix(0,10330,18)
genre_mat1[1,] <- list_genre
colnames(genre_mat1) <- list_genre
for (index in 1:nrow(movie_genre2)) {
for (col in 1:ncol(movie_genre2)) {
gen_col = which(genre_mat1[1,] == movie_genre2[index,col])
genre_mat1[index+1,gen_col] <- 1
}
}
genre_mat2 <- as.data.frame(genre_mat1[-1,], stringsAsFactors=FALSE)
for (col in 1:ncol(genre_mat2)) {
genre_mat2[,col] <- as.integer(genre_mat2[,col])
str(genre_mat2)
}
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : chr "1" "1" "0" "0" ...
## $ Animation : chr "1" "0" "0" "0" ...
## $ Children : chr "1" "1" "0" "0" ...
## $ Comedy : chr "1" "0" "1" "1" ...
## $ Crime : chr "0" "0" "0" "0" ...
## $ Documentary: chr "0" "0" "0" "0" ...
## $ Drama : chr "0" "0" "0" "1" ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : chr "1" "0" "0" "0" ...
## $ Children : chr "1" "1" "0" "0" ...
## $ Comedy : chr "1" "0" "1" "1" ...
## $ Crime : chr "0" "0" "0" "0" ...
## $ Documentary: chr "0" "0" "0" "0" ...
## $ Drama : chr "0" "0" "0" "1" ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : chr "1" "1" "0" "0" ...
## $ Comedy : chr "1" "0" "1" "1" ...
## $ Crime : chr "0" "0" "0" "0" ...
## $ Documentary: chr "0" "0" "0" "0" ...
## $ Drama : chr "0" "0" "0" "1" ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : chr "1" "0" "1" "1" ...
## $ Crime : chr "0" "0" "0" "0" ...
## $ Documentary: chr "0" "0" "0" "0" ...
## $ Drama : chr "0" "0" "0" "1" ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : chr "0" "0" "0" "0" ...
## $ Documentary: chr "0" "0" "0" "0" ...
## $ Drama : chr "0" "0" "0" "1" ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: chr "0" "0" "0" "0" ...
## $ Drama : chr "0" "0" "0" "1" ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : chr "0" "0" "0" "1" ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : chr "1" "1" "0" "0" ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : chr "0" "0" "0" "0" ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : chr "0" "0" "0" "0" ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : chr "0" "0" "0" "0" ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : chr "0" "0" "0" "0" ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : chr "0" "0" "1" "1" ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 1 1 0 0 1 0 0 0 ...
## $ Sci-Fi : chr "0" "0" "0" "0" ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 1 1 0 0 1 0 0 0 ...
## $ Sci-Fi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Thriller : chr "0" "0" "0" "0" ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 1 1 0 0 1 0 0 0 ...
## $ Sci-Fi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Thriller : int 0 0 0 0 0 1 0 0 0 1 ...
## $ War : chr "0" "0" "0" "0" ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 1 1 0 0 1 0 0 0 ...
## $ Sci-Fi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Thriller : int 0 0 0 0 0 1 0 0 0 1 ...
## $ War : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Western : chr "0" "0" "0" "0" ...
## 'data.frame': 10329 obs. of 18 variables:
## $ Action : int 0 0 0 0 0 1 0 0 1 1 ...
## $ Adventure : int 1 1 0 0 0 0 0 1 0 1 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Children : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 1 1 1 0 1 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 1 1 0 0 1 0 0 0 ...
## $ Sci-Fi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Thriller : int 0 0 0 0 0 1 0 0 0 1 ...
## $ War : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
SearchMatrix <- cbind(movie_data[,1:2], genre_mat2[])
head(SearchMatrix)
## movieId title Action Adventure Animation
## 1 1 Toy Story (1995) 0 1 1
## 2 2 Jumanji (1995) 0 1 0
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## 6 6 Heat (1995) 1 0 0
## Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1 1 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 1 0 0 0
## 3 0 1 0 0 0 0 0 0 0
## 4 0 1 0 0 1 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 0 1 0 0 0 0 0 0
## Mystery Romance Sci-Fi Thriller War Western
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 1 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 1 0 0
ratingMatrix <- dcast(rating_data, userId~movieId, value.var = "rating", na.rm=FALSE)
## Warning in dcast(rating_data, userId ~ movieId, value.var = "rating", na.rm
## = FALSE): The dcast generic in data.table has been passed a data.frame and
## will attempt to redirect to the reshape2::dcast; please note that reshape2
## is deprecated, and this redirection is now deprecated as well. Please do this
## redirection yourself like reshape2::dcast(rating_data). In the next version,
## this warning will become an error.
ratingMatrix <- as.matrix(ratingMatrix[,-1])
ratingMatrix <- as(ratingMatrix, "realRatingMatrix")
ratingMatrix
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.
recommendation_model <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommendation_model)
## [1] "HYBRID_realRatingMatrix" "ALS_realRatingMatrix"
## [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"
## [5] "LIBMF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [7] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [9] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [11] "UBCF_realRatingMatrix"
lapply(recommendation_model, "[[", "description")
## $HYBRID_realRatingMatrix
## [1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."
##
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $LIBMF_realRatingMatrix
## [1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
recommendation_model$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
similarity_mat <- similarity(ratingMatrix[1:4, ],
method = "cosine",
which = "users")
as.matrix(similarity_mat)
## 1 2 3 4
## 1 NA 0.9880430 0.9820862 0.9957199
## 2 0.9880430 NA 0.9962866 0.9687126
## 3 0.9820862 0.9962866 NA 0.9944484
## 4 0.9957199 0.9687126 0.9944484 NA
image(as.matrix(similarity_mat), main = "User's Similarities")
movie_similarity <- similarity(ratingMatrix[, 1:4], method =
"cosine", which = "items")
as.matrix(movie_similarity)
## 1 2 3 4
## 1 NA 0.9834866 0.9779671 0.9550638
## 2 0.9834866 NA 0.9829378 0.9706208
## 3 0.9779671 0.9829378 NA 0.9932438
## 4 0.9550638 0.9706208 0.9932438 NA
image(as.matrix(movie_similarity), main = "Movies similarity")
rating_values <- as.vector(ratingMatrix@data)
unique(rating_values) # extracting unique ratings
## [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5
Table_of_Ratings <- table(rating_values) # creating a count of movie ratings
Table_of_Ratings
## rating_values
## 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
## 6791761 1198 3258 1567 7943 5484 21729 12237 28880 8187
## 5
## 14856
library(ggplot2)
movie_views <- colCounts(ratingMatrix) # count views for each movie
table_views <- data.frame(movie = names(movie_views),
views = movie_views) # create dataframe of views
table_views <- table_views[order(table_views$views,
decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
for (index in 1:10325){
table_views[index,3] <- as.character(subset(movie_data,
movie_data$movieId == table_views[index,1])$title)
}
table_views[1:6,]
## movie views title
## 296 296 325 Pulp Fiction (1994)
## 356 356 311 Forrest Gump (1994)
## 318 318 308 Shawshank Redemption, The (1994)
## 480 480 294 Jurassic Park (1993)
## 593 593 290 Silence of the Lambs, The (1991)
## 260 260 273 Star Wars: Episode IV - A New Hope (1977)
ggplot(table_views[1:6, ], aes(x = title, y = views)) +
geom_bar(stat="identity", fill = 'steelblue') +
geom_text(aes(label=views), vjust=-0.3, size=3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Total Views of the Top Films")
image(ratingMatrix[1:20, 1:25], axes = FALSE, main = "Heatmap of the first 25 rows and 25 columns")
movie_ratings <- ratingMatrix[rowCounts(ratingMatrix) > 50,
colCounts(ratingMatrix) > 50]
movie_ratings
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
minimum_movies<- quantile(rowCounts(movie_ratings), 0.98)
minimum_users <- quantile(colCounts(movie_ratings), 0.98)
image(movie_ratings[rowCounts(movie_ratings) > minimum_movies,
colCounts(movie_ratings) > minimum_users],
main = "Heatmap of the top users and movies")
average_ratings <- rowMeans(movie_ratings)
qplot(average_ratings, fill=I("steelblue"), col=I("red")) +
ggtitle("Distribution of the average rating per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
normalized_ratings <- normalize(movie_ratings)
sum(rowMeans(normalized_ratings) > 0.00001)
## [1] 0
image(normalized_ratings[rowCounts(normalized_ratings) > minimum_movies,
colCounts(normalized_ratings) > minimum_users],
main = "Normalized Ratings of the Top Users")
binary_minimum_movies <- quantile(rowCounts(movie_ratings), 0.95)
binary_minimum_users <- quantile(colCounts(movie_ratings), 0.95)
#movies_watched <- binarize(movie_ratings, minRating = 1)
good_rated_films <- binarize(movie_ratings, minRating = 3)
image(good_rated_films[rowCounts(movie_ratings) > binary_minimum_movies,
colCounts(movie_ratings) > binary_minimum_users],
main = "Heatmap of the top users and movies")
sampled_data<- sample(x = c(TRUE, FALSE),
size = nrow(movie_ratings),
replace = TRUE,
prob = c(0.8, 0.2))
training_data <- movie_ratings[sampled_data, ]
testing_data <- movie_ratings[!sampled_data, ]
recommendation_system <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommendation_system$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
recommen_model <- Recommender(data = training_data,
method = "IBCF",
parameter = list(k = 30))
recommen_model
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 328 users.
class(recommen_model)
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
model_info <- getModel(recommen_model)
class(model_info$sim)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
dim(model_info$sim)
## [1] 447 447
top_items <- 20
image(model_info$sim[1:top_items, 1:top_items],
main = "Heatmap of the first rows and columns")
sum_rows <- rowSums(model_info$sim > 0)
table(sum_rows)
## sum_rows
## 30
## 447
sum_cols <- colSums(model_info$sim > 0)
qplot(sum_cols, fill=I("steelblue"), col=I("red"))+ ggtitle("Distribution of the column count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
top_recommendations <- 10 # the number of items to recommend to each user
predicted_recommendations <- predict(object = recommen_model,
newdata = testing_data,
n = top_recommendations)
predicted_recommendations
## Recommendations as 'topNList' with n = 10 for 92 users.
user1 <- predicted_recommendations@items[[1]] # recommendation for the first user
movies_user1 <- predicted_recommendations@itemLabels[user1]
movies_user2 <- movies_user1
for (index in 1:10){
movies_user2[index] <- as.character(subset(movie_data,
movie_data$movieId == movies_user1[index])$title)
}
movies_user2
## [1] "South Park: Bigger, Longer and Uncut (1999)"
## [2] "Wizard of Oz, The (1939)"
## [3] "Gattaca (1997)"
## [4] "Chinatown (1974)"
## [5] "Annie Hall (1977)"
## [6] "One Flew Over the Cuckoo's Nest (1975)"
## [7] "2001: A Space Odyssey (1968)"
## [8] "Christmas Story, A (1983)"
## [9] "Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)"
## [10] "Taxi Driver (1976)"
recommendation_matrix <- sapply(predicted_recommendations@items,
function(x){ as.integer(colnames(movie_ratings)[x]) }) # matrix with the recommendations for each user
#dim(recc_matrix)
recommendation_matrix[,1:4]
## 0 1 2 3
## [1,] 2700 25 1206 11
## [2,] 919 47 1213 17
## [3,] 1653 62 7438 21
## [4,] 1252 111 1263 25
## [5,] 1230 474 2858 36
## [6,] 1193 508 111 62
## [7,] 924 529 1193 111
## [8,] 2804 590 1198 165
## [9,] 4973 858 1208 236
## [10,] 111 913 1210 260
Data visualization is the process of transforming the obtained necessary information from data analysis into a visual presentation. It is one of the methods in the data mining process which comes after the data analysis method. The major aim of the data visualization process is to attain the goal of making data easier for human understanding, by providing the simple visualization context such as graph, charts. map etc. This will help the human brain to visualize and to identify the patterns, behavior and to predict the future outcome from big data. Visualization of data analysis can be achieved through Analysis of Top Movies application, which has been developed using the latest data analytics software and technology such as R, RStudio, and Shiny and can be accessed through web browsers and mobile phones in the following address Analysis of Top Movies.
Project reports are the most important aspect for every project developer to complete, in order to visualize the finished product and its accuracy by making sure that it skilfully covers all necessary objectives of this project. This project report has been generated through the R Markdown function, and it can be accessed at the following address: # http://rpubs.com/WanAtiq/r-python
In today’s world of technology, data has been playing a vital role in all aspects of life. Data analysis makes one’s life easier by helping them to make faster and better decisions without worrying about inaccuracy of results. Movie recommendation system is one of the most trending machine learning applications, contains the most advanced type of filtering methods, which can be used to provide a shortcut for a user in order to select and watch the movies from the suggested list without spending hours on searching. These recommendations are mostly accurate for the user’s interest. Because the predictions were made based on their past history of preferences, choices and behavior. This project can provide the recommendation list for a user with the help of data analysis attained by the collaborative filtering method. This filtering method will find close similarities of different user preference lists based on the movies they had rated before. This will help the system to come up with more suggestions of movies from those who had watched and liked similar genres of movies. Here are some conclusions that can be drawn from this project: - Uses the latest R, RStudio and Shiny data analytics software platforms which are open-source software. - Has analytical functions and Graphical User Interface (GUI) that supports analytical and user -friendly data platforms. - Ability to be on various operating platforms Development can be done in a short time. - The system is accessible 24/7 and multiple devices (web, mobile).