2.0 Project Requirement

2.1 Introduction to the project

The main reason for building the “Data analysis and recommendation on Top movies” project is to provide a user-friendly recommendation system, which will help the users to pick from the top trending movies based on the list of suggestions. This project has been designed with the data science programming language called ‘R’, in order to come up with a transparent structure to be more understandable about the usage and functionalities of this recommendation system. A recommendation engine is something that will recommend a user to make better selections based on the information and list of suggestions provided by this machine learning system. These recommendations can be customized for every user, based on a specific filtering method called content-based filtering, which will allow the system to take necessary inputs from the users such as top-rated movies, newly released movies, most watched movies, user preferences, search history, etc. This input will be processed and analyzed to create the list of recommendations of top movies, which can be implemented by machine learning algorithms.

This system also has the ability of detecting similarity of two different movies based on certain factors. For example: If a user had a past record of watching a specific genre of movies. Then, this system will provide a list of suggestions for the user to also watch, based on the factors of highly rated and top trending movies in that same genre. Also, this system can provide recommendations using Collaborative content filtering. This filtering method provides more suggestions for a user with the ideology obtained from few other users who are having similar fields of interest on movies and whose list of previously watched/search history is similar with them. This will provide the user to watch more movies based on their field of interest. In this R project, the ITEM based collaborative recommendation system has been used to complete the building of the recommendation engine.

2.2 SYSTEM FUNCTIONS

• Identify top trending movies based on the greatest number of likes. • Identify and generate a list of previously watched movies and search history from the users. • Identify the similarities between the two different users who might have the same interest in specific movie genres. • Obtain a movie recommendation system from data analysis process.

2.3 DATA SOURCES

In our Recommendation system based on top movies, we had used the data which has been retrieved from the open-source data website, in order to download IMDB-Dataset. This dataset consists of two csv files as follows:

2.4 APPLICATION DEVELOPMENT PLATFORM


3.0 Data Preparation

3.0 DATA PREPARATION In the process of preparing data for analysis, several tasks need to be performed such as data reading, data merging, data cleaning, and data selection.

Reading Data:

library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
library(reshape2)

library(recommenderlab)
library(ggplot2)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt
library(reshape2)

getwd()
## [1] "C:/Study/Sem 4/Programming DS/Project"
setwd("C:/Study/Sem 4/Programming DS/Project")

movie_data <- read.csv("movies.csv",stringsAsFactors=FALSE)
rating_data <- read.csv("ratings.csv")
str(movie_data)
## 'data.frame':    10329 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title  : chr  "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
##  $ genres : chr  "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
str(rating_data)
## 'data.frame':    105339 obs. of  4 variables:
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId  : int  16 24 32 47 50 110 150 161 165 204 ...
##  $ rating   : num  4 1.5 4 4 4 4 3 4 3 0.5 ...
##  $ timestamp: int  1217897793 1217895807 1217896246 1217896556 1217896523 1217896150 1217895940 1217897864 1217897135 1217895786 ...

4.0 Data Analysis

Data analysis is the step where a series of procedures such as data cleaning, data transforming and data modeling will take place in order to explore new informative conclusions and to make decisions effectively. In this project, data analysis has been done according to the needs of the project, which aims to identify the most popular and top-rated movies by analyzing the ranking of each movie by the audience and to determine the relationship between two individuals, who have obtained a recommendation list of same preferences predicted from their previous choices and ratings, this will be generated by the machine learning algorithm. This stage of analysis can be categorized into the following levels: Descriptive Analysis Top rated movies (shown via tabulars and graphs) New Trending movies (shown through tables and graphs) Most searched movies (shown by graph) 2. Predictive Analysis - Predicting a list of suggestions on the same genre of movies, based on the user’s past records of choice and search history. Here are some examples of data analysis that has been made:

summary(movie_data)
##     movieId          title              genres         
##  Min.   :     1   Length:10329       Length:10329      
##  1st Qu.:  3240   Class :character   Class :character  
##  Median :  7088   Mode  :character   Mode  :character  
##  Mean   : 31924                                        
##  3rd Qu.: 59900                                        
##  Max.   :149532
head(movie_data)
##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller
summary(rating_data)
##      userId         movieId           rating        timestamp        
##  Min.   :  1.0   Min.   :     1   Min.   :0.500   Min.   :8.286e+08  
##  1st Qu.:192.0   1st Qu.:  1073   1st Qu.:3.000   1st Qu.:9.711e+08  
##  Median :383.0   Median :  2497   Median :3.500   Median :1.115e+09  
##  Mean   :364.9   Mean   : 13381   Mean   :3.517   Mean   :1.130e+09  
##  3rd Qu.:557.0   3rd Qu.:  5991   3rd Qu.:4.000   3rd Qu.:1.275e+09  
##  Max.   :668.0   Max.   :149532   Max.   :5.000   Max.   :1.452e+09
head(rating_data)
##   userId movieId rating  timestamp
## 1      1      16    4.0 1217897793
## 2      1      24    1.5 1217895807
## 3      1      32    4.0 1217896246
## 4      1      47    4.0 1217896556
## 5      1      50    4.0 1217896523
## 6      1     110    4.0 1217896150
movie_genre <- as.data.frame(movie_data$genres, stringsAsFactors=FALSE)
library(data.table)
movie_genre2 <- as.data.frame(tstrsplit(movie_genre[,1],'[|]',
                                        type.convert=TRUE),
                              stringsAsFactors=FALSE) 
colnames(movie_genre2) <- c(1:10)
list_genre <- c("Action", "Adventure", "Animation", "Children", 
                "Comedy", "Crime","Documentary", "Drama", "Fantasy",
                "Film-Noir", "Horror", "Musical", "Mystery","Romance",
                "Sci-Fi", "Thriller", "War", "Western")
genre_mat1 <- matrix(0,10330,18)
genre_mat1[1,] <- list_genre
colnames(genre_mat1) <- list_genre
for (index in 1:nrow(movie_genre2)) {
  for (col in 1:ncol(movie_genre2)) {
    gen_col = which(genre_mat1[1,] == movie_genre2[index,col])
    genre_mat1[index+1,gen_col] <- 1
  }
}
genre_mat2 <- as.data.frame(genre_mat1[-1,], stringsAsFactors=FALSE) 
for (col in 1:ncol(genre_mat2)) {
  genre_mat2[,col] <- as.integer(genre_mat2[,col]) 
str(genre_mat2)
}
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : chr  "1" "1" "0" "0" ...
##  $ Animation  : chr  "1" "0" "0" "0" ...
##  $ Children   : chr  "1" "1" "0" "0" ...
##  $ Comedy     : chr  "1" "0" "1" "1" ...
##  $ Crime      : chr  "0" "0" "0" "0" ...
##  $ Documentary: chr  "0" "0" "0" "0" ...
##  $ Drama      : chr  "0" "0" "0" "1" ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : chr  "1" "0" "0" "0" ...
##  $ Children   : chr  "1" "1" "0" "0" ...
##  $ Comedy     : chr  "1" "0" "1" "1" ...
##  $ Crime      : chr  "0" "0" "0" "0" ...
##  $ Documentary: chr  "0" "0" "0" "0" ...
##  $ Drama      : chr  "0" "0" "0" "1" ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : chr  "1" "1" "0" "0" ...
##  $ Comedy     : chr  "1" "0" "1" "1" ...
##  $ Crime      : chr  "0" "0" "0" "0" ...
##  $ Documentary: chr  "0" "0" "0" "0" ...
##  $ Drama      : chr  "0" "0" "0" "1" ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : chr  "1" "0" "1" "1" ...
##  $ Crime      : chr  "0" "0" "0" "0" ...
##  $ Documentary: chr  "0" "0" "0" "0" ...
##  $ Drama      : chr  "0" "0" "0" "1" ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : chr  "0" "0" "0" "0" ...
##  $ Documentary: chr  "0" "0" "0" "0" ...
##  $ Drama      : chr  "0" "0" "0" "1" ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: chr  "0" "0" "0" "0" ...
##  $ Drama      : chr  "0" "0" "0" "1" ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : chr  "0" "0" "0" "1" ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : chr  "1" "1" "0" "0" ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : chr  "0" "0" "0" "0" ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : chr  "0" "0" "0" "0" ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : chr  "0" "0" "0" "0" ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : chr  "0" "0" "0" "0" ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : chr  "0" "0" "1" "1" ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 1 1 0 0 1 0 0 0 ...
##  $ Sci-Fi     : chr  "0" "0" "0" "0" ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 1 1 0 0 1 0 0 0 ...
##  $ Sci-Fi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Thriller   : chr  "0" "0" "0" "0" ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 1 1 0 0 1 0 0 0 ...
##  $ Sci-Fi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Thriller   : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ War        : chr  "0" "0" "0" "0" ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 1 1 0 0 1 0 0 0 ...
##  $ Sci-Fi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Thriller   : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ War        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Western    : chr  "0" "0" "0" "0" ...
## 'data.frame':    10329 obs. of  18 variables:
##  $ Action     : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ Adventure  : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children   : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 1 1 1 0 1 0 0 0 ...
##  $ Crime      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fantasy    : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 1 1 0 0 1 0 0 0 ...
##  $ Sci-Fi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Thriller   : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ War        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Western    : int  0 0 0 0 0 0 0 0 0 0 ...
SearchMatrix <- cbind(movie_data[,1:2], genre_mat2[])
head(SearchMatrix)    
##   movieId                              title Action Adventure Animation
## 1       1                   Toy Story (1995)      0         1         1
## 2       2                     Jumanji (1995)      0         1         0
## 3       3            Grumpier Old Men (1995)      0         0         0
## 4       4           Waiting to Exhale (1995)      0         0         0
## 5       5 Father of the Bride Part II (1995)      0         0         0
## 6       6                        Heat (1995)      1         0         0
##   Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1        1      1     0           0     0       1         0      0       0
## 2        1      0     0           0     0       1         0      0       0
## 3        0      1     0           0     0       0         0      0       0
## 4        0      1     0           0     1       0         0      0       0
## 5        0      1     0           0     0       0         0      0       0
## 6        0      0     1           0     0       0         0      0       0
##   Mystery Romance Sci-Fi Thriller War Western
## 1       0       0      0        0   0       0
## 2       0       0      0        0   0       0
## 3       0       1      0        0   0       0
## 4       0       1      0        0   0       0
## 5       0       0      0        0   0       0
## 6       0       0      0        1   0       0
ratingMatrix <- dcast(rating_data, userId~movieId, value.var = "rating", na.rm=FALSE)
## Warning in dcast(rating_data, userId ~ movieId, value.var = "rating", na.rm
## = FALSE): The dcast generic in data.table has been passed a data.frame and
## will attempt to redirect to the reshape2::dcast; please note that reshape2
## is deprecated, and this redirection is now deprecated as well. Please do this
## redirection yourself like reshape2::dcast(rating_data). In the next version,
## this warning will become an error.
ratingMatrix <- as.matrix(ratingMatrix[,-1]) 
ratingMatrix <- as(ratingMatrix, "realRatingMatrix")
ratingMatrix
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.
recommendation_model <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommendation_model)
##  [1] "HYBRID_realRatingMatrix"       "ALS_realRatingMatrix"         
##  [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"        
##  [5] "LIBMF_realRatingMatrix"        "POPULAR_realRatingMatrix"     
##  [7] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
##  [9] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
## [11] "UBCF_realRatingMatrix"
lapply(recommendation_model, "[[", "description")
## $HYBRID_realRatingMatrix
## [1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."
## 
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $LIBMF_realRatingMatrix
## [1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
recommendation_model$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
## 
## $method
## [1] "cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE
similarity_mat <- similarity(ratingMatrix[1:4, ],
                             method = "cosine",
                             which = "users")
as.matrix(similarity_mat)
##           1         2         3         4
## 1        NA 0.9880430 0.9820862 0.9957199
## 2 0.9880430        NA 0.9962866 0.9687126
## 3 0.9820862 0.9962866        NA 0.9944484
## 4 0.9957199 0.9687126 0.9944484        NA
image(as.matrix(similarity_mat), main = "User's Similarities")

movie_similarity <- similarity(ratingMatrix[, 1:4], method =
                                 "cosine", which = "items")
as.matrix(movie_similarity)
##           1         2         3         4
## 1        NA 0.9834866 0.9779671 0.9550638
## 2 0.9834866        NA 0.9829378 0.9706208
## 3 0.9779671 0.9829378        NA 0.9932438
## 4 0.9550638 0.9706208 0.9932438        NA
image(as.matrix(movie_similarity), main = "Movies similarity")

rating_values <- as.vector(ratingMatrix@data)
unique(rating_values) # extracting unique ratings
##  [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5
Table_of_Ratings <- table(rating_values) # creating a count of movie ratings
Table_of_Ratings
## rating_values
##       0     0.5       1     1.5       2     2.5       3     3.5       4     4.5 
## 6791761    1198    3258    1567    7943    5484   21729   12237   28880    8187 
##       5 
##   14856
library(ggplot2)
movie_views <- colCounts(ratingMatrix) # count views for each movie
table_views <- data.frame(movie = names(movie_views),
                          views = movie_views) # create dataframe of views
table_views <- table_views[order(table_views$views,
                                 decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
for (index in 1:10325){
  table_views[index,3] <- as.character(subset(movie_data,
                                              movie_data$movieId == table_views[index,1])$title)
}
table_views[1:6,]
##     movie views                                     title
## 296   296   325                       Pulp Fiction (1994)
## 356   356   311                       Forrest Gump (1994)
## 318   318   308          Shawshank Redemption, The (1994)
## 480   480   294                      Jurassic Park (1993)
## 593   593   290          Silence of the Lambs, The (1991)
## 260   260   273 Star Wars: Episode IV - A New Hope (1977)
ggplot(table_views[1:6, ], aes(x = title, y = views)) +
  geom_bar(stat="identity", fill = 'steelblue') +
  geom_text(aes(label=views), vjust=-0.3, size=3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Total Views of the Top Films")

image(ratingMatrix[1:20, 1:25], axes = FALSE, main = "Heatmap of the first 25 rows and 25 columns")

movie_ratings <- ratingMatrix[rowCounts(ratingMatrix) > 50,
                              colCounts(ratingMatrix) > 50]
movie_ratings
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
minimum_movies<- quantile(rowCounts(movie_ratings), 0.98)
minimum_users <- quantile(colCounts(movie_ratings), 0.98)
image(movie_ratings[rowCounts(movie_ratings) > minimum_movies,
                    colCounts(movie_ratings) > minimum_users],
      main = "Heatmap of the top users and movies")

average_ratings <- rowMeans(movie_ratings)
qplot(average_ratings, fill=I("steelblue"), col=I("red")) +
  ggtitle("Distribution of the average rating per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

normalized_ratings <- normalize(movie_ratings)
sum(rowMeans(normalized_ratings) > 0.00001)
## [1] 0
image(normalized_ratings[rowCounts(normalized_ratings) > minimum_movies,
                         colCounts(normalized_ratings) > minimum_users],
      main = "Normalized Ratings of the Top Users")

binary_minimum_movies <- quantile(rowCounts(movie_ratings), 0.95)
binary_minimum_users <- quantile(colCounts(movie_ratings), 0.95)
#movies_watched <- binarize(movie_ratings, minRating = 1)
good_rated_films <- binarize(movie_ratings, minRating = 3)
image(good_rated_films[rowCounts(movie_ratings) > binary_minimum_movies,
                       colCounts(movie_ratings) > binary_minimum_users],
      main = "Heatmap of the top users and movies")

sampled_data<- sample(x = c(TRUE, FALSE),
                      size = nrow(movie_ratings),
                      replace = TRUE,
                      prob = c(0.8, 0.2))
training_data <- movie_ratings[sampled_data, ]
testing_data <- movie_ratings[!sampled_data, ]

recommendation_system <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommendation_system$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
## 
## $method
## [1] "cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE
recommen_model <- Recommender(data = training_data,
                              method = "IBCF",
                              parameter = list(k = 30))
recommen_model
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 328 users.
class(recommen_model)
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
model_info <- getModel(recommen_model)
            class(model_info$sim)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
            dim(model_info$sim)
## [1] 447 447
top_items <- 20
            image(model_info$sim[1:top_items, 1:top_items],
                  main = "Heatmap of the first rows and columns")

sum_rows <- rowSums(model_info$sim > 0)
            table(sum_rows)
## sum_rows
##  30 
## 447
sum_cols <- colSums(model_info$sim > 0)
            qplot(sum_cols, fill=I("steelblue"), col=I("red"))+ ggtitle("Distribution of the column count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

top_recommendations <- 10 # the number of items to recommend to each user
predicted_recommendations <- predict(object = recommen_model,
                                     newdata = testing_data,
                                     n = top_recommendations)
predicted_recommendations
## Recommendations as 'topNList' with n = 10 for 92 users.
user1 <- predicted_recommendations@items[[1]] # recommendation for the first user
movies_user1 <- predicted_recommendations@itemLabels[user1]
movies_user2 <- movies_user1
for (index in 1:10){
  movies_user2[index] <- as.character(subset(movie_data,
                                             movie_data$movieId == movies_user1[index])$title)
}
movies_user2
##  [1] "South Park: Bigger, Longer and Uncut (1999)"         
##  [2] "Wizard of Oz, The (1939)"                            
##  [3] "Gattaca (1997)"                                      
##  [4] "Chinatown (1974)"                                    
##  [5] "Annie Hall (1977)"                                   
##  [6] "One Flew Over the Cuckoo's Nest (1975)"              
##  [7] "2001: A Space Odyssey (1968)"                        
##  [8] "Christmas Story, A (1983)"                           
##  [9] "Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)"
## [10] "Taxi Driver (1976)"
recommendation_matrix <- sapply(predicted_recommendations@items,
                                function(x){ as.integer(colnames(movie_ratings)[x]) }) # matrix with the recommendations for each user
#dim(recc_matrix)
recommendation_matrix[,1:4]
##          0   1    2   3
##  [1,] 2700  25 1206  11
##  [2,]  919  47 1213  17
##  [3,] 1653  62 7438  21
##  [4,] 1252 111 1263  25
##  [5,] 1230 474 2858  36
##  [6,] 1193 508  111  62
##  [7,]  924 529 1193 111
##  [8,] 2804 590 1198 165
##  [9,] 4973 858 1208 236
## [10,]  111 913 1210 260

5.0 Data Visualization

Data visualization is the process of transforming the obtained necessary information from data analysis into a visual presentation. It is one of the methods in the data mining process which comes after the data analysis method. The major aim of the data visualization process is to attain the goal of making data easier for human understanding, by providing the simple visualization context such as graph, charts. map etc. This will help the human brain to visualize and to identify the patterns, behavior and to predict the future outcome from big data. Visualization of data analysis can be achieved through Analysis of Top Movies application, which has been developed using the latest data analytics software and technology such as R, RStudio, and Shiny and can be accessed through web browsers and mobile phones in the following address Analysis of Top Movies.


6.0 Project Report And Presentation

Project reports are the most important aspect for every project developer to complete, in order to visualize the finished product and its accuracy by making sure that it skilfully covers all necessary objectives of this project. This project report has been generated through the R Markdown function, and it can be accessed at the following address: # http://rpubs.com/WanAtiq/r-python


7.0 Conclusion

In today’s world of technology, data has been playing a vital role in all aspects of life. Data analysis makes one’s life easier by helping them to make faster and better decisions without worrying about inaccuracy of results. Movie recommendation system is one of the most trending machine learning applications, contains the most advanced type of filtering methods, which can be used to provide a shortcut for a user in order to select and watch the movies from the suggested list without spending hours on searching. These recommendations are mostly accurate for the user’s interest. Because the predictions were made based on their past history of preferences, choices and behavior. This project can provide the recommendation list for a user with the help of data analysis attained by the collaborative filtering method. This filtering method will find close similarities of different user preference lists based on the movies they had rated before. This will help the system to come up with more suggestions of movies from those who had watched and liked similar genres of movies. Here are some conclusions that can be drawn from this project: - Uses the latest R, RStudio and Shiny data analytics software platforms which are open-source software. - Has analytical functions and Graphical User Interface (GUI) that supports analytical and user -friendly data platforms. - Ability to be on various operating platforms Development can be done in a short time. - The system is accessible 24/7 and multiple devices (web, mobile).