Movie Recommendation based on User’s Most Watched

Yohana

28 August 2020

Movie Recommendation

Background

Movies are now part of some people’s life. Not only the latest one, but also the oldest. Even now there are so many movies applications or website, some are need monthly subscription, some just free which usually banned by the local government. Movie apps like Netflix, would record our movies history and would recommend similar movies based on user’s latest play. Not only the movie apps itself, but reliable movie rating like IMDB could also give movies recommendation based on their good rate given to some movies.

In this project, I will create a project which could give user, who use the movie apps, some movies recommendation based on their interest in movies they rate the best, so user could watched another similar movies in case they have no idea about what to watch.

Library and Read The Data

Before we continue, let’s load some libraries we need

library(tidyverse)
library(lubridate)
library(ggplot2)
library(scales)
library(plotly)

options(scipen = 9999)

First of all, let’s read the data.

movies <- read_csv("data/IMDb movies.csv")
movies

The movies data set consist of 22 columns and 81,273 rows.

ratings <- read_csv("data/IMDb ratings.csv")
ratings

The ratings data set consist of 49 columns and 81,273 rows, which represent the same IDs from our movie data set.

Let’s check if there are any duplication in our data.

movies[!duplicated(movies$imdb_title_id),]

There are no duplication, which means the id registered in our list have 81,273 IDs

Let’s check if there are any N/A in our movie data

colSums(is.na(movies))

##         imdb_title_id                 title        original_title 
##                     0                     0                     0 
##                  year        date_published                 genre 
##                     0                     0                     0 
##              duration               country              language 
##                     0                    39                   755 
##              director                writer    production_company 
##                    73                  1493                  4325 
##                actors           description              avg_vote 
##                    66                  2430                     0 
##                 votes                budget      usa_gross_income 
##                     0                 58469                 66179 
## worlwide_gross_income             metascore    reviews_from_users 
##                 51381                 68551                  7077 
##  reviews_from_critics 
##                 10987

There are some missing values in our movie data. We will just remove the N/A in our data.

Let’s check if there are any missing values in ratings data set.

colSums(is.na(ratings))

##             imdb_title_id     weighted_average_vote               total_votes 
##                         0                         0                         0 
##                 mean_vote               median_vote                  votes_10 
##                         0                         0                         0 
##                   votes_9                   votes_8                   votes_7 
##                         0                         0                         0 
##                   votes_6                   votes_5                   votes_4 
##                         0                         0                         0 
##                   votes_3                   votes_2                   votes_1 
##                         0                         0                         0 
##  allgenders_0age_avg_vote     allgenders_0age_votes allgenders_18age_avg_vote 
##                     54730                     54730                       415 
##    allgenders_18age_votes allgenders_30age_avg_vote    allgenders_30age_votes 
##                       415                         9                         9 
## allgenders_45age_avg_vote    allgenders_45age_votes    males_allages_avg_vote 
##                       113                       113                         1 
##       males_allages_votes       males_0age_avg_vote          males_0age_votes 
##                         1                     60934                     60934 
##      males_18age_avg_vote         males_18age_votes      males_30age_avg_vote 
##                      1056                      1056                         9 
##         males_30age_votes      males_45age_avg_vote         males_45age_votes 
##                         9                       153                       153 
##  females_allages_avg_vote     females_allages_votes     females_0age_avg_vote 
##                        70                        70                     65940 
##        females_0age_votes    females_18age_avg_vote       females_18age_votes 
##                     65940                      5034                      5034 
##    females_30age_avg_vote       females_30age_votes    females_45age_avg_vote 
##                       864                       864                      2572 
##       females_45age_votes     top1000_voters_rating      top1000_voters_votes 
##                      2572                       606                       606 
##          us_voters_rating           us_voters_votes      non_us_voters_rating 
##                       239                       239                         4 
##       non_us_voters_votes 
##                         4

There are some missing values in our data set, but actually, the N/A represents 0, so we will just change N/A to 0.

ratings[is.na(ratings)] = 0
ratings

As we can see from the data frame we have, the N/A has replaced to 0

Before we continue, we won’t use all columns, I will just eliminate some and select imdb_title_id, title, year, genre, country, avg_vote, votes, reviews_from_users, reviews_from_critics.

movies <- movies %>% 
  select(imdb_title_id, title, year, genre, country, avg_vote, votes, reviews_from_users, reviews_from_critics) %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(year = as.factor(year))

movies

Since the votes in our movies data set is a total of votes in rating, let’s join both data frame and create a new data frame movies_join and let’s eliminate all N/A in our data

movies_join <- left_join(movies, ratings, by = c("imdb_title_id"))
movies_join <- movies_join %>% 
  na.omit()
movies_join

Now, we have a joint data frame which consist of movies and ratings data and already eliminate the missing values so we just have 66,482 movies.

Let’s recheck if we still have any missing values in our data

colSums(is.na(movies_join))

##             imdb_title_id                     title                      year 
##                         0                         0                         0 
##                     genre                   country                  avg_vote 
##                         0                         0                         0 
##                     votes        reviews_from_users      reviews_from_critics 
##                         0                         0                         0 
##     weighted_average_vote               total_votes                 mean_vote 
##                         0                         0                         0 
##               median_vote                  votes_10                   votes_9 
##                         0                         0                         0 
##                   votes_8                   votes_7                   votes_6 
##                         0                         0                         0 
##                   votes_5                   votes_4                   votes_3 
##                         0                         0                         0 
##                   votes_2                   votes_1  allgenders_0age_avg_vote 
##                         0                         0                         0 
##     allgenders_0age_votes allgenders_18age_avg_vote    allgenders_18age_votes 
##                         0                         0                         0 
## allgenders_30age_avg_vote    allgenders_30age_votes allgenders_45age_avg_vote 
##                         0                         0                         0 
##    allgenders_45age_votes    males_allages_avg_vote       males_allages_votes 
##                         0                         0                         0 
##       males_0age_avg_vote          males_0age_votes      males_18age_avg_vote 
##                         0                         0                         0 
##         males_18age_votes      males_30age_avg_vote         males_30age_votes 
##                         0                         0                         0 
##      males_45age_avg_vote         males_45age_votes  females_allages_avg_vote 
##                         0                         0                         0 
##     females_allages_votes     females_0age_avg_vote        females_0age_votes 
##                         0                         0                         0 
##    females_18age_avg_vote       females_18age_votes    females_30age_avg_vote 
##                         0                         0                         0 
##       females_30age_votes    females_45age_avg_vote       females_45age_votes 
##                         0                         0                         0 
##     top1000_voters_rating      top1000_voters_votes          us_voters_rating 
##                         0                         0                         0 
##           us_voters_votes      non_us_voters_rating       non_us_voters_votes 
##                         0                         0                         0

There are no missing values in our data

summary(movies_join)

##  imdb_title_id                       title            year      
##  Length:66482       Darling             :    8   2017   : 2480  
##  Class :character   Home                :    8   2016   : 2374  
##  Mode  :character   Solo                :    8   2018   : 2285  
##                     The Three Musketeers:    8   2014   : 2273  
##                     Bloodline           :    7   2015   : 2273  
##                     Eden                :    7   2013   : 2227  
##                     (Other)             :66436   (Other):52570  
##                     genre          country         avg_vote    
##  Drama                 : 9019   USA    :25632   Min.   : 1.00  
##  Comedy                : 4587   UK     : 3571   1st Qu.: 5.30  
##  Comedy, Drama         : 2973   India  : 3528   Median : 6.10  
##  Drama, Romance        : 2651   Japan  : 2457   Mean   : 5.93  
##  Horror                : 2028   France : 2451   3rd Qu.: 6.80  
##  Comedy, Drama, Romance: 1875   Italy  : 1617   Max.   :10.00  
##  (Other)               :43349   (Other):27226                  
##      votes         reviews_from_users reviews_from_critics
##  Min.   :     99   Min.   :   1.00    Min.   :  1.00      
##  1st Qu.:    259   1st Qu.:   4.00    1st Qu.:  3.00      
##  Median :    673   Median :  11.00    Median :  9.00      
##  Mean   :  11436   Mean   :  48.34    Mean   : 29.34      
##  3rd Qu.:   2656   3rd Qu.:  31.00    3rd Qu.: 26.00      
##  Max.   :2159628   Max.   :8302.00    Max.   :987.00      
##                                                           
##  weighted_average_vote  total_votes        mean_vote       median_vote    
##  Min.   : 1.00         Min.   :     99   Min.   : 1.300   Min.   : 1.000  
##  1st Qu.: 5.30         1st Qu.:    259   1st Qu.: 5.600   1st Qu.: 6.000  
##  Median : 6.10         Median :    673   Median : 6.400   Median : 6.000  
##  Mean   : 5.93         Mean   :  11436   Mean   : 6.238   Mean   : 6.262  
##  3rd Qu.: 6.80         3rd Qu.:   2656   3rd Qu.: 7.000   3rd Qu.: 7.000  
##  Max.   :10.00         Max.   :2159628   Max.   :10.000   Max.   :10.000  
##                                                                           
##     votes_10          votes_9          votes_8          votes_7         
##  Min.   :      0   Min.   :     0   Min.   :     0   Min.   :     0.00  
##  1st Qu.:     26   1st Qu.:    11   1st Qu.:    22   1st Qu.:    35.25  
##  Median :     68   Median :    34   Median :    72   Median :   112.00  
##  Mean   :   1479   Mean   :  1451   Mean   :  2458   Mean   :  2525.12  
##  3rd Qu.:    275   3rd Qu.:   174   3rd Qu.:   366   3rd Qu.:   521.00  
##  Max.   :1197087   Max.   :596808   Max.   :397945   Max.   :231381.00  
##                                                                         
##     votes_6          votes_5           votes_4           votes_3       
##  Min.   :     0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    38   1st Qu.:   28.0   1st Qu.:   16.0   1st Qu.:   10.0  
##  Median :   105   Median :   70.0   Median :   41.0   Median :   25.0  
##  Mean   :  1622   Mean   :  841.9   Mean   :  410.5   Mean   :  231.8  
##  3rd Qu.:   420   3rd Qu.:  257.0   3rd Qu.:  142.0   3rd Qu.:   88.0  
##  Max.   :135547   Max.   :72485.0   Max.   :41751.0   Max.   :36360.0  
##                                                                        
##     votes_2           votes_1        allgenders_0age_avg_vote
##  Min.   :    0.0   Min.   :    0.0   Min.   : 0.000          
##  1st Qu.:    6.0   1st Qu.:   12.0   1st Qu.: 0.000          
##  Median :   18.0   Median :   32.0   Median : 0.000          
##  Mean   :  152.5   Mean   :  264.4   Mean   : 2.354          
##  3rd Qu.:   63.0   3rd Qu.:  110.0   3rd Qu.: 6.000          
##  Max.   :31211.0   Max.   :67515.0   Max.   :10.000          
##                                                              
##  allgenders_0age_votes allgenders_18age_avg_vote allgenders_18age_votes
##  Min.   :   0.000      Min.   : 0.000            Min.   :     0        
##  1st Qu.:   0.000      1st Qu.: 5.300            1st Qu.:    18        
##  Median :   0.000      Median : 6.300            Median :    65        
##  Mean   :   9.065      Mean   : 6.002            Mean   :  2554        
##  3rd Qu.:   1.000      3rd Qu.: 7.000            3rd Qu.:   344        
##  Max.   :4028.000      Max.   :10.000            Max.   :600243        
##                                                                        
##  allgenders_30age_avg_vote allgenders_30age_votes allgenders_45age_avg_vote
##  Min.   : 0.000            Min.   :     0         Min.   : 0.000           
##  1st Qu.: 5.200            1st Qu.:    85         1st Qu.: 5.100           
##  Median : 6.100            Median :   241         Median : 6.000           
##  Mean   : 5.876            Mean   :  4690         Mean   : 5.775           
##  3rd Qu.: 6.800            3rd Qu.:  1022         3rd Qu.: 6.600           
##  Max.   :10.000            Max.   :781955         Max.   :10.000           
##                                                                            
##  allgenders_45age_votes males_allages_avg_vote males_allages_votes
##  Min.   :     0         Min.   : 0.000         Min.   :      0    
##  1st Qu.:    71         1st Qu.: 5.200         1st Qu.:    169    
##  Median :   170         Median : 6.100         Median :    439    
##  Mean   :  1424         Mean   : 5.861         Mean   :   7363    
##  3rd Qu.:   601         3rd Qu.: 6.700         3rd Qu.:   1732    
##  Max.   :179646         Max.   :10.000         Max.   :1374105    
##                                                                   
##  males_0age_avg_vote males_0age_votes   males_18age_avg_vote males_18age_votes
##  Min.   : 0.000      Min.   :   0.000   Min.   : 0.000       Min.   :     0   
##  1st Qu.: 0.000      1st Qu.:   0.000   1st Qu.: 5.100       1st Qu.:    12   
##  Median : 0.000      Median :   0.000   Median : 6.200       Median :    46   
##  Mean   : 1.805      Mean   :   6.247   Mean   : 5.916       Mean   :  1921   
##  3rd Qu.: 4.000      3rd Qu.:   1.000   3rd Qu.: 7.000       3rd Qu.:   250   
##  Max.   :10.000      Max.   :2849.000   Max.   :10.000       Max.   :488238   
##                                                                               
##  males_30age_avg_vote males_30age_votes males_45age_avg_vote males_45age_votes
##  Min.   : 0.000       Min.   :     0    Min.   : 0.000       Min.   :     0   
##  1st Qu.: 5.100       1st Qu.:    69    1st Qu.: 5.000       1st Qu.:    60   
##  Median : 6.000       Median :   197    Median : 5.900       Median :   144   
##  Mean   : 5.833       Mean   :  3870    Mean   : 5.722       Mean   :  1185   
##  3rd Qu.: 6.700       3rd Qu.:   843    3rd Qu.: 6.600       3rd Qu.:   500   
##  Max.   :10.000       Max.   :664458    Max.   :10.000       Max.   :146000   
##                                                                               
##  females_allages_avg_vote females_allages_votes females_0age_avg_vote
##  Min.   : 0.000           Min.   :     0        Min.   : 0.000       
##  1st Qu.: 5.400           1st Qu.:    29        1st Qu.: 0.000       
##  Median : 6.300           Median :    83        Median : 0.000       
##  Mean   : 6.068           Mean   :  1677        Mean   : 1.479       
##  3rd Qu.: 7.000           3rd Qu.:   347        3rd Qu.: 0.000       
##  Max.   :10.000           Max.   :269839        Max.   :10.000       
##                                                                      
##  females_0age_votes females_18age_avg_vote females_18age_votes
##  Min.   :  0.00     Min.   : 0.000         Min.   :     0.0   
##  1st Qu.:  0.00     1st Qu.: 5.100         1st Qu.:     4.0   
##  Median :  0.00     Median : 6.400         Median :    15.0   
##  Mean   :  1.91     Mean   : 5.896         Mean   :   594.4   
##  3rd Qu.:  0.00     3rd Qu.: 7.200         3rd Qu.:    75.0   
##  Max.   :524.00     Max.   :10.000         Max.   :121451.0   
##                                                               
##  females_30age_avg_vote females_30age_votes females_45age_avg_vote
##  Min.   : 0.000         Min.   :     0.0    Min.   : 0.000        
##  1st Qu.: 5.300         1st Qu.:    12.0    1st Qu.: 5.300        
##  Median : 6.300         Median :    35.0    Median : 6.300        
##  Mean   : 6.039         Mean   :   763.8    Mean   : 6.017        
##  3rd Qu.: 7.000         3rd Qu.:   157.0    3rd Qu.: 7.000        
##  Max.   :10.000         Max.   :114034.0    Max.   :10.000        
##                                                                   
##  females_45age_votes top1000_voters_rating top1000_voters_votes
##  Min.   :    0       Min.   : 0.000        Min.   :  0.00      
##  1st Qu.:    8       1st Qu.: 4.500        1st Qu.: 17.00      
##  Median :   22       Median : 5.400        Median : 37.00      
##  Mean   :  217       Mean   : 5.207        Mean   : 91.29      
##  3rd Qu.:   85       3rd Qu.: 6.100        3rd Qu.: 99.00      
##  Max.   :30244       Max.   :10.000        Max.   :936.00      
##                                                                
##  us_voters_rating us_voters_votes  non_us_voters_rating non_us_voters_votes
##  Min.   : 0.000   Min.   :     0   Min.   : 0.000       Min.   :     0.0   
##  1st Qu.: 5.300   1st Qu.:    42   1st Qu.: 5.100       1st Qu.:   115.0   
##  Median : 6.200   Median :   128   Median : 6.000       Median :   312.5   
##  Mean   : 5.973   Mean   :  2036   Mean   : 5.787       Mean   :  5300.5   
##  3rd Qu.: 6.900   3rd Qu.:   529   3rd Qu.: 6.700       3rd Qu.:  1260.0   
##  Max.   :10.000   Max.   :341457   Max.   :10.000       Max.   :862970.0   
##

From the summary, we can conclude that in our data, there were 2,480 movies that produced in 2017 and the most genre were Drama with 9,019 movies. The country that produced the most movie was USA. The range for our avg_rate were 1 - 10. There were a lot of reviews_from_user than reviews_from_critics.

EDA

To understand our data, we could explore them. Before we continue, I will create my own plot theme.

soft_blue_theme <- theme(
  panel.background = element_rect(fill="lemonchiffon"),
  plot.background = element_rect(fill="slategray3"),
  panel.grid.minor.x = element_blank(),
  panel.grid.major.x = element_blank(),
  panel.grid.minor.y = element_blank(),
  panel.grid.major.y = element_blank(),
  text = element_text(color="black"),
  axis.text = element_text(color="black"),
  strip.background =element_rect(fill="linen"),
  strip.text = element_text(colour = 'black')
  )

Based on the summary above, the most movies were produced in 2017, to prove it, let’s create a plot to show it, but I will just subset into the to 10 years.

Here is the plot

movies %>%
  select(year) %>% 
  group_by(year) %>% 
  count(year) %>% 
  arrange(desc(n)) %>% 
  head(10) %>% 
  ggplot(aes(x = year, y = n)) +
  geom_col(aes(fill = year)) +
  labs(x = NULL, y = NULL, title = "Most Movies Per Year") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  soft_blue_theme

The most movie in our data was the movie produced in 2017

We would like to know the top 20 movies’ title based on the average of reviews_from_users, so we would know what type of movies does most of people like.

movies %>%
  select(title, reviews_from_users) %>% 
  group_by(title) %>% 
  mutate(reviews_from_users = mean(reviews_from_users)) %>% 
  arrange(desc(reviews_from_users)) %>% 
  head(20) %>% 
  ggplot(aes(x = reviews_from_users, y = reorder(title, reviews_from_users), fill = title)) +
  geom_col() +
  labs(title = "Top 20 Movies", 
       subtitle = "Based on reviews from users",
       x = "Review from Users",
       y = "Title",
       fill = NULL) +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  soft_blue_theme

The top movie based on user’s review is Avengers: Endgame which kind of the newest film.

As we know from the summary, the most movie we have in the dataset were from Drama, but was it has the most avg_votes? Let’s find out

movies %>%
  select(genre, title, avg_vote) %>% 
  group_by(genre) %>% 
  summarise(avg_vote = mean(avg_vote)) %>% 
  arrange(desc(avg_vote)) %>% 
  head(15) %>% 
  ggplot(mapping = aes(x = avg_vote,
                       y = reorder(genre, avg_vote),
                                        fill = genre)) +
  geom_col() +
  labs(title = "Top 15 Genres",
       subtitle = "Based on Average Vote",
       x = "Average Vote",
       y = "Genre",
       fill = NULL) +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  soft_blue_theme

Turns out, the most favorite genre based on votes was from Musical, Comedy, Family genre.

To know more, I will find out the top 20 movies from Comedy genre.

movies %>%
  select(genre, title, reviews_from_critics) %>% 
  group_by(genre, title) %>% 
  summarise(reviews_from_critics = mean(reviews_from_critics)) %>% 
  arrange(desc(reviews_from_critics)) %>% 
  ungroup() %>% 
  filter(genre == "Comedy") %>% 
  head(20) %>% 
  ggplot(mapping = aes(x = reviews_from_critics,
                       y = reorder(title, reviews_from_critics),
                                        fill = title)) +
  geom_col() +
  labs(title = "Top 20 Comedy Movies",
       subtitle = "Based on Reviews From Critics",
       x = "Reviews From Critics",
       y = "Title",
       fill = NULL) +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  soft_blue_theme

Ted, This Is the End and The Hangover Part III were the top 3 movies from Comedy