The objective we are looking forward to fulfill in this project is to build a recommendation system to suggest movies based on user preferences and behaviour. There can majorly be two types of recommendation system algorithms, one content based and the other is collaborative filtereing. The former one recommends movies to the user based on the attributes of the movies that he has either already watched or has given it a high rating. These could be the possible genres, the cast and crew working with the movie etc. The latter type recommends movies to a user based on the experience or the likings of a similar kind of user.

Concepts for both the types will be explained in detail with every step of the project, may it be the intuition, the algorithm or the syntax.

In this project we will be building a content based recommendation system.

We first set the working directory.

setwd("C:/Users/salil/Desktop/AnalyticsEdgeFolder")

We then load in the data set which we would require presently.

Movies <- read.csv("C:/Users/salil/Desktop/AnalyticsEdgeFolder/movies.csv", stringsAsFactors = FALSE)

This is how the data looks like. The

head(Movies)
##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

The data set which we have with us talks only about the possible genres, as a trait, of the listed movies. We would therefore use this to build the algorithm. We have to keep in mind that the same techniques can be used as well if other characteristics are mentioned.

To work further with the genres, we extract the column out of the data set.

MoviesGenresA <- as.data.frame(Movies$genres, stringsAsFactors = FALSE)

This is how the column looks like. The different genres of the movies are mentioned and are separated by ‘|’.

head(MoviesGenresA)
##                                 Movies$genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

What we do next is separate the multiple genres and put it in under different columns for all the rows.

library(data.table)
MoviesGenresB <- as.data.frame(tstrsplit(MoviesGenresA[,1], "\\|"), type.convert=TRUE, stringsAsFactors = FALSE)

We number every column of the obtained table to make it convenient.

colnames(MoviesGenresB) <- c(1:10)

This is how the table looks like.

head(MoviesGenresB)
##           1         2        3      4       5    6    7    8    9   10
## 1 Adventure Animation Children Comedy Fantasy <NA> <NA> <NA> <NA> <NA>
## 2 Adventure  Children  Fantasy   <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
## 3    Comedy   Romance     <NA>   <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
## 4    Comedy     Drama  Romance   <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
## 5    Comedy      <NA>     <NA>   <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
## 6    Action     Crime Thriller   <NA>    <NA> <NA> <NA> <NA> <NA> <NA>

We try to get all the unique values of the available genres in the table.

unique(MoviesGenresB$`1`)
##  [1] "Adventure"          "Comedy"             "Action"            
##  [4] "Drama"              "Crime"              "Children"          
##  [7] "Mystery"            "Animation"          "Documentary"       
## [10] "Thriller"           "Horror"             "Fantasy"           
## [13] "Western"            "Film-Noir"          "Romance"           
## [16] "Sci-Fi"             "Musical"            "War"               
## [19] "(no genres listed)"
unique(MoviesGenresB$`2`)
##  [1] "Animation"   "Children"    "Romance"     "Drama"       NA           
##  [6] "Crime"       "Adventure"   "Horror"      "Comedy"      "Sci-Fi"     
## [11] "War"         "Thriller"    "Mystery"     "Film-Noir"   "Musical"    
## [16] "Fantasy"     "Documentary" "Western"     "IMAX"
unique(MoviesGenresB$`3`)
##  [1] "Children"    "Fantasy"     NA            "Romance"     "Thriller"   
##  [6] "Crime"       "Horror"      "IMAX"        "Drama"       "Comedy"     
## [11] "War"         "Mystery"     "Western"     "Sci-Fi"      "Animation"  
## [16] "Musical"     "Film-Noir"   "Documentary"

We would possibly obtain all the unique values from the initial columns of the table. We therefore use only 3 columns for the same. We can ignore the “IMAX” value as it is a type of cinematic experience and not a genre.

We now try to get the number of unique genre values.

length(unique(MoviesGenresB$`1`))
## [1] 19
length(unique(MoviesGenresB$`2`))
## [1] 19
length(unique(MoviesGenresB$`3`))
## [1] 18

So, if we exclude the ‘NA’ value, there are 18 unique genres.

We now create a null matrix where the number of rows will be equal to the number of films and the number of columns will be equal to the number of unique genres.

nrow(Movies)
## [1] 10329
GenreMatrixA <- matrix(0,10329,18)

Then we name the columns according to the genres.

list_genre <- c("Action", "Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
colnames(GenreMatrixA) <- list_genre

This is how it looks like.

head(GenreMatrixA)
##      Action Adventure Animation Children Comedy Crime Documentary Drama Fantasy
## [1,]      0         0         0        0      0     0           0     0       0
## [2,]      0         0         0        0      0     0           0     0       0
## [3,]      0         0         0        0      0     0           0     0       0
## [4,]      0         0         0        0      0     0           0     0       0
## [5,]      0         0         0        0      0     0           0     0       0
## [6,]      0         0         0        0      0     0           0     0       0
##      FilmNoir Horror Musical Mystery Romance SciFi Thriller War Western
## [1,]        0      0       0       0       0     0        0   0       0
## [2,]        0      0       0       0       0     0        0   0       0
## [3,]        0      0       0       0       0     0        0   0       0
## [4,]        0      0       0       0       0     0        0   0       0
## [5,]        0      0       0       0       0     0        0   0       0
## [6,]        0      0       0       0       0     0        0   0       0

Now we use nested ‘for loop’ to convert the above obtained matrix into a binary matrix with values ‘0’ and ‘1’. The matrix will fill the elements with the binary values corresponding to the genre values in ‘MoviesGenreB’. If there is a match, the respective genre column will put a ‘1’ in the corresponding row and a ‘0’ if there is no match.

for (row in 1:nrow(MoviesGenresB)) {
  for (col in 1:ncol(MoviesGenresB)) {
    gen_col = which(colnames(GenreMatrixA) == MoviesGenresB[row,col]) 
    GenreMatrixA[row,gen_col] <- 1
  }
}

We then convert the character to intergers to make it convenient for the program to read it correctly.

for (row in 1:nrow(GenreMatrixA)) {
  GenreMatrixA[row,] <- as.integer(GenreMatrixA[row,]) 
} 

Now we attach the original ‘Movies’ table to the matrix.

GenreMatrixB <- cbind(Movies, GenreMatrixA)

As we do not require the consolidated genre column and the movie id column, we would nullify them and keep the rest for further action.

GenreMatrixB$movieId <- NULL
GenreMatrixB$genres <- NULL

We have finally obtained the matrix which we would require to perform hierarchical clustering. This is how the matrix looks like till now.

head(GenreMatrixB)
##                                title Action Adventure Animation Children Comedy
## 1                   Toy Story (1995)      0         1         1        1      1
## 2                     Jumanji (1995)      0         1         0        1      0
## 3            Grumpier Old Men (1995)      0         0         0        0      1
## 4           Waiting to Exhale (1995)      0         0         0        0      1
## 5 Father of the Bride Part II (1995)      0         0         0        0      1
## 6                        Heat (1995)      1         0         0        0      0
##   Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi
## 1     0           0     0       1        0      0       0       0       0     0
## 2     0           0     0       1        0      0       0       0       0     0
## 3     0           0     0       0        0      0       0       0       1     0
## 4     0           0     1       0        0      0       0       0       1     0
## 5     0           0     0       0        0      0       0       0       0     0
## 6     1           0     0       0        0      0       0       0       0     0
##   Thriller War Western
## 1        0   0       0
## 2        0   0       0
## 3        0   0       0
## 4        0   0       0
## 5        0   0       0
## 6        1   0       0

To begin with, let us assume that there exists an 18 dimensional space where every dimension is represented by a genre. Every movie represents a point in the space where the coordinates of the point are the values mentioned in ‘GenreMatrixB’. For instance, the ‘action’ genre is represented by the x-axis, the ‘adventure’ genre is represented by the y-axis and so on. Closer the movies are to each other in the space, the more similar they will be.

What we would first do is compute the distance between the points using the Euclidean distance method. So for example, if we consider the first two movies on the top of the table, then the Euclidean distance between the two points will be √(0-0)^2 + (1-1)^2 + (1-0)^2 +…..+(0-0)^2. By performing the basic calculations, the distance will be equal to √2.

distances <- dist(GenreMatrixB[2:19], method = "euclidean")

After the distances have been computed, the next thing we need to do is to form clusters based on their positions. These clusters are a group of points based on their location in the space. Every cluster has a set of points which have similar coordinates. Intuitively, it can be said that movies with similar genres are put together in a single cluster and therefore every cluster represents a similar set of movies.

As already mentioned, the algorithm we will be using is that of hierarchical clustering. In hierarchical clustering, clusters are formed by each data point starting in its own cluster. It then combines two nearest clusters into one based on their Euclidean distance and centroid distance. The ‘method=“ward”’ arguement takes care of the same. The same process continues till all of our data points are in one single cluster.

MoviesCluster <- hclust(distances, method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

The process of hierarchical clustering can be displayed through a dendrogram.

ClusterDendo <- plot(MoviesCluster)

The bottom seems like all piled up as there are over 10k data points. The lines show how the clusters have been combined and the height of the lines show how far apart the clusters were when they were combined.

The dendrogram helps us visualise the process of hierarchical clustering and we can therefore use it to decide how many clusters we want in order to build an efiicient recommendation system. From the understanding of our problem, we choose the number of clusters to be equal to 15. This number seems to be feasible keeping in mind that it doesn’t underdo or overdo the model. This number will also help us to have the right number of specific genre groups.

So we label each of the data points according to what cluster it belongs to using the ‘cutree’ function.

ClusterGroups <- cutree(MoviesCluster, k=15)

We now use the ‘tapply’ function to compute the percentage of movies in each genre and cluster.

Action <- tapply(GenreMatrixB$Action, ClusterGroups, mean)
Action
##         1         2         3         4         5         6         7         8 
## 0.1679820 0.0000000 0.0000000 0.0000000 0.8179298 0.1076357 0.0000000 0.0000000 
##         9        10        11        12        13        14        15 
## 0.0000000 0.4458955 0.1465116 0.3033708 0.0000000 0.0000000 0.0000000

What the tapply function does is that it divides the data points into 15 different clusters and then computes the average value for the action variable for each cluster. This average value is the mean.

We are therefore computing the percentage of movies in the clusters that belong to the ‘Action’ variable. So for instance, in cluster 5, almost 82% of the movies have the ‘action’ genre label. This is a significant percentage. By similar means, clusters 2 and 3 have no movies with the ‘Action’ genre. Similar kind of interpretation can be done for other clusters of the ‘Action’ genre and all the clusters for the remaining genres.

Adventure <- tapply(GenreMatrixB$Adventure, ClusterGroups, mean)
Animation <- tapply(GenreMatrixB$Animation, ClusterGroups, mean)
Children <- tapply(GenreMatrixB$Children, ClusterGroups, mean)
Comedy <- tapply(GenreMatrixB$Comedy, ClusterGroups, mean)
Crime <- tapply(GenreMatrixB$Crime, ClusterGroups, mean)
Documentary <- tapply(GenreMatrixB$Documentary, ClusterGroups, mean)
Drama <- tapply(GenreMatrixB$Drama, ClusterGroups, mean)
Fantasy <-  tapply(GenreMatrixB$Fantasy, ClusterGroups, mean) 
FilmNoir <- tapply(GenreMatrixB$FilmNoir, ClusterGroups, mean)
Horror <- tapply(GenreMatrixB$Horror, ClusterGroups, mean)
Musical <-  tapply(GenreMatrixB$Musical, ClusterGroups, mean) 
Mystery <- tapply(GenreMatrixB$Mystery, ClusterGroups, mean)
Romance <- tapply(GenreMatrixB$Romance, ClusterGroups, mean)
SciFi <- tapply(GenreMatrixB$SciFi, ClusterGroups, mean)
Thriller <- tapply(GenreMatrixB$Thriller, ClusterGroups, mean)
War <- tapply(GenreMatrixB$War, ClusterGroups, mean)
Western <-  tapply(GenreMatrixB$Western, ClusterGroups, mean) 

We now create one consolidated table for the derived percentage values.

FinalTable <- rbind(Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, FilmNoir, Horror, Musical, Mystery, Romance, SciFi, Thriller, War, Western)

We will export the table to excel to perform further action.

write.table(FinalTable, file = "CombinedTableCluster.csv", sep = ",")

We will be using excel to do the conditional formatting of the cells. Whichever cell will have a percentage value of approximately 80%, we will be considering it. So the genres with significant percentage values in each cluster gets a name for itself accordingly.

We import the excel document.

library(readxl)
FinalTableCluster <- read_xlsx("C:/Users/salil/Desktop/AllDocuments/AnalyticsEdgeFolder/FinalTableCluster.xlsx")
## New names:
## * `` -> ...1

This is how our final table looks like.

FinalTableCluster
## # A tibble: 19 x 16
##    ...1  `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`   `10`  `11`  `12` 
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 Clus… Misc… Roma… Come… Come… Acti… Horr… Drama Crim… Roma… Come… Myst… Dram…
##  2 Acti… 0.16… 0     0     0     0.81… 0.10… 0     0     0     0.44… 0.14… 0.30…
##  3 Adve… 0.30… 0     0     0     0.38… 2.39… 0     0     0     0.18… 6.27… 0.12…
##  4 Anim… 0.21… 0     0     0     5.54… 3.67… 0     0     0     5.59… 0     8.98…
##  5 Chil… 0.29… 0     0     0     5.54… 3.67… 0     0     0     0     1.39… 0    
##  6 Come… 0.40… 1     1     1     1.75… 8.83… 0     0     0     1     0.17… 0.12…
##  7 Crime 7.55… 0     0     0     0.34… 5.70… 0     1     0     0.51… 0.39… 3.14…
##  8 Docu… 3.94… 0     0     0     0     0     0     0     0     0     0     4.49…
##  9 Drama 0.34… 0     1     0     0.37… 0.18… 1     1     1     0.20… 0.55… 0.79…
## 10 Fant… 0.28… 0     0     0     4.43… 8.37… 0     0     0     0     4.18… 1.12…
## 11 Film… 0     0     0     0     0     0     0     0     0     0     0     0    
## 12 Horr… 2.48… 0     0     0     4.62… 0.84… 0     0     0     4.29… 2.55… 2.24…
## 13 Musi… 0.20… 0     0     0     9.24… 1.83… 0     0     0     0     0     8.98…
## 14 Myst… 3.43… 0     0     0     4.62… 0.15… 0     0     0     1.11… 0.99… 2.02…
## 15 Roma… 0.31… 1     1     0     0     3.31… 0     0     1     7.08… 2.32… 0.12…
## 16 SciFi 0     0     0     0     0     0     0     0     0     0     0     0    
## 17 Thri… 8.56… 0     0     0     0.48… 0.55… 0     0.44… 0     0.29… 0.60… 0.13…
## 18 War   2.14… 0     0     0     0     4.59… 0     0     0     0     0     1    
## 19 West… 0.12… 0     0     0     0     5.51… 0     0     0     3.73… 2.32… 2.24…
## # … with 3 more variables: `13` <chr>, `14` <chr>, `15` <chr>

We can now finally use this recommendation system to obtain a list of movies similar to what I like or have already watched.

So suppose a friend of mine recently watched ‘Detective Dee and the Mystery of the Phantom Flame (Di Renjie) (2010)’ and is interested in watching similar movies whenever he has some free time. The recommendation system can be used in the following way.

which(Movies$title=="Detective Dee and the Mystery of the Phantom Flame (Di Renjie) (2010)")
## [1] 9273
ClusterGroups[9273]
## [1] 11
Cluster11 <- subset(Movies, ClusterGroups==11)
Cluster11$title[1:10]
##  [1] "City of Lost Children, The (Cité des enfants perdus, La) (1995)"
##  [2] "Twelve Monkeys (a.k.a. 12 Monkeys) (1995)"                       
##  [3] "Seven (a.k.a. Se7en) (1995)"                                     
##  [4] "Usual Suspects, The (1995)"                                      
##  [5] "Confessional, The (Confessionnal, Le) (1995)"                    
##  [6] "Unforgettable (1996)"                                            
##  [7] "Before and After (1996)"                                         
##  [8] "Clockers (1995)"                                                 
##  [9] "Congo (1995)"                                                    
## [10] "Devil in a Blue Dress (1995)"

These are the movies that my friend can watch.