Movies are now part of some people’s life. Not only the latest one, but also the oldest. Even now there are so many movies applications or website, some are need monthly subscription, some just free which usually banned by the local government. Movie apps like Netflix
, would record our movies history and would recommend similar movies based on user’s latest play. Not only the movie apps itself, but reliable movie rating like IMDB
could also give movies recommendation based on their good rate given to some movies.
In this project, I will create a project which could give user, who use the movie apps, some movies recommendation based on their interest in movies they rate the best, so user could watched another similar movies in case they have no idea about what to watch.
Before we continue, let’s load some libraries we need
library(tidyverse)
library(lubridate)
library(ggplot2)
library(scales)
library(plotly)
options(scipen = 9999)
First of all, let’s read the data.
The movies
data set consist of 22 columns and 81,273 rows.
The ratings
data set consist of 49 columns and 81,273 rows, which represent the same IDs from our movie
data set.
Let’s check if there are any duplication in our data.
There are no duplication, which means the id registered in our list have 81,273 IDs
Let’s check if there are any N/A in our movie
data
## imdb_title_id title original_title
## 0 0 0
## year date_published genre
## 0 0 0
## duration country language
## 0 39 755
## director writer production_company
## 73 1493 4325
## actors description avg_vote
## 66 2430 0
## votes budget usa_gross_income
## 0 58469 66179
## worlwide_gross_income metascore reviews_from_users
## 51381 68551 7077
## reviews_from_critics
## 10987
There are some missing values in our movie
data. We will just remove the N/A in our data.
Let’s check if there are any missing values in ratings
data set.
## imdb_title_id weighted_average_vote total_votes
## 0 0 0
## mean_vote median_vote votes_10
## 0 0 0
## votes_9 votes_8 votes_7
## 0 0 0
## votes_6 votes_5 votes_4
## 0 0 0
## votes_3 votes_2 votes_1
## 0 0 0
## allgenders_0age_avg_vote allgenders_0age_votes allgenders_18age_avg_vote
## 54730 54730 415
## allgenders_18age_votes allgenders_30age_avg_vote allgenders_30age_votes
## 415 9 9
## allgenders_45age_avg_vote allgenders_45age_votes males_allages_avg_vote
## 113 113 1
## males_allages_votes males_0age_avg_vote males_0age_votes
## 1 60934 60934
## males_18age_avg_vote males_18age_votes males_30age_avg_vote
## 1056 1056 9
## males_30age_votes males_45age_avg_vote males_45age_votes
## 9 153 153
## females_allages_avg_vote females_allages_votes females_0age_avg_vote
## 70 70 65940
## females_0age_votes females_18age_avg_vote females_18age_votes
## 65940 5034 5034
## females_30age_avg_vote females_30age_votes females_45age_avg_vote
## 864 864 2572
## females_45age_votes top1000_voters_rating top1000_voters_votes
## 2572 606 606
## us_voters_rating us_voters_votes non_us_voters_rating
## 239 239 4
## non_us_voters_votes
## 4
There are some missing values in our data set, but actually, the N/A represents 0, so we will just change N/A to 0.
As we can see from the data frame we have, the N/A has replaced to 0
Before we continue, we won’t use all columns, I will just eliminate some and select imdb_title_id
, title
, year
, genre
, country
, avg_vote
, votes
, reviews_from_users
, reviews_from_critics
.
movies <- movies %>%
select(imdb_title_id, title, year, genre, country, avg_vote, votes, reviews_from_users, reviews_from_critics) %>%
mutate_if(is.character, as.factor) %>%
mutate(year = as.factor(year))
movies
Since the votes
in our movies
data set is a total of votes in rating
, let’s join both data frame and create a new data frame movies_join
and let’s eliminate all N/A in our data
movies_join <- left_join(movies, ratings, by = c("imdb_title_id"))
movies_join <- movies_join %>%
na.omit()
movies_join
Now, we have a joint data frame which consist of movies
and ratings
data and already eliminate the missing values so we just have 66,482 movies.
Let’s recheck if we still have any missing values in our data
## imdb_title_id title year
## 0 0 0
## genre country avg_vote
## 0 0 0
## votes reviews_from_users reviews_from_critics
## 0 0 0
## weighted_average_vote total_votes mean_vote
## 0 0 0
## median_vote votes_10 votes_9
## 0 0 0
## votes_8 votes_7 votes_6
## 0 0 0
## votes_5 votes_4 votes_3
## 0 0 0
## votes_2 votes_1 allgenders_0age_avg_vote
## 0 0 0
## allgenders_0age_votes allgenders_18age_avg_vote allgenders_18age_votes
## 0 0 0
## allgenders_30age_avg_vote allgenders_30age_votes allgenders_45age_avg_vote
## 0 0 0
## allgenders_45age_votes males_allages_avg_vote males_allages_votes
## 0 0 0
## males_0age_avg_vote males_0age_votes males_18age_avg_vote
## 0 0 0
## males_18age_votes males_30age_avg_vote males_30age_votes
## 0 0 0
## males_45age_avg_vote males_45age_votes females_allages_avg_vote
## 0 0 0
## females_allages_votes females_0age_avg_vote females_0age_votes
## 0 0 0
## females_18age_avg_vote females_18age_votes females_30age_avg_vote
## 0 0 0
## females_30age_votes females_45age_avg_vote females_45age_votes
## 0 0 0
## top1000_voters_rating top1000_voters_votes us_voters_rating
## 0 0 0
## us_voters_votes non_us_voters_rating non_us_voters_votes
## 0 0 0
There are no missing values in our data
## imdb_title_id title year
## Length:66482 Darling : 8 2017 : 2480
## Class :character Home : 8 2016 : 2374
## Mode :character Solo : 8 2018 : 2285
## The Three Musketeers: 8 2014 : 2273
## Bloodline : 7 2015 : 2273
## Eden : 7 2013 : 2227
## (Other) :66436 (Other):52570
## genre country avg_vote
## Drama : 9019 USA :25632 Min. : 1.00
## Comedy : 4587 UK : 3571 1st Qu.: 5.30
## Comedy, Drama : 2973 India : 3528 Median : 6.10
## Drama, Romance : 2651 Japan : 2457 Mean : 5.93
## Horror : 2028 France : 2451 3rd Qu.: 6.80
## Comedy, Drama, Romance: 1875 Italy : 1617 Max. :10.00
## (Other) :43349 (Other):27226
## votes reviews_from_users reviews_from_critics
## Min. : 99 Min. : 1.00 Min. : 1.00
## 1st Qu.: 259 1st Qu.: 4.00 1st Qu.: 3.00
## Median : 673 Median : 11.00 Median : 9.00
## Mean : 11436 Mean : 48.34 Mean : 29.34
## 3rd Qu.: 2656 3rd Qu.: 31.00 3rd Qu.: 26.00
## Max. :2159628 Max. :8302.00 Max. :987.00
##
## weighted_average_vote total_votes mean_vote median_vote
## Min. : 1.00 Min. : 99 Min. : 1.300 Min. : 1.000
## 1st Qu.: 5.30 1st Qu.: 259 1st Qu.: 5.600 1st Qu.: 6.000
## Median : 6.10 Median : 673 Median : 6.400 Median : 6.000
## Mean : 5.93 Mean : 11436 Mean : 6.238 Mean : 6.262
## 3rd Qu.: 6.80 3rd Qu.: 2656 3rd Qu.: 7.000 3rd Qu.: 7.000
## Max. :10.00 Max. :2159628 Max. :10.000 Max. :10.000
##
## votes_10 votes_9 votes_8 votes_7
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.00
## 1st Qu.: 26 1st Qu.: 11 1st Qu.: 22 1st Qu.: 35.25
## Median : 68 Median : 34 Median : 72 Median : 112.00
## Mean : 1479 Mean : 1451 Mean : 2458 Mean : 2525.12
## 3rd Qu.: 275 3rd Qu.: 174 3rd Qu.: 366 3rd Qu.: 521.00
## Max. :1197087 Max. :596808 Max. :397945 Max. :231381.00
##
## votes_6 votes_5 votes_4 votes_3
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 38 1st Qu.: 28.0 1st Qu.: 16.0 1st Qu.: 10.0
## Median : 105 Median : 70.0 Median : 41.0 Median : 25.0
## Mean : 1622 Mean : 841.9 Mean : 410.5 Mean : 231.8
## 3rd Qu.: 420 3rd Qu.: 257.0 3rd Qu.: 142.0 3rd Qu.: 88.0
## Max. :135547 Max. :72485.0 Max. :41751.0 Max. :36360.0
##
## votes_2 votes_1 allgenders_0age_avg_vote
## Min. : 0.0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 6.0 1st Qu.: 12.0 1st Qu.: 0.000
## Median : 18.0 Median : 32.0 Median : 0.000
## Mean : 152.5 Mean : 264.4 Mean : 2.354
## 3rd Qu.: 63.0 3rd Qu.: 110.0 3rd Qu.: 6.000
## Max. :31211.0 Max. :67515.0 Max. :10.000
##
## allgenders_0age_votes allgenders_18age_avg_vote allgenders_18age_votes
## Min. : 0.000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 5.300 1st Qu.: 18
## Median : 0.000 Median : 6.300 Median : 65
## Mean : 9.065 Mean : 6.002 Mean : 2554
## 3rd Qu.: 1.000 3rd Qu.: 7.000 3rd Qu.: 344
## Max. :4028.000 Max. :10.000 Max. :600243
##
## allgenders_30age_avg_vote allgenders_30age_votes allgenders_45age_avg_vote
## Min. : 0.000 Min. : 0 Min. : 0.000
## 1st Qu.: 5.200 1st Qu.: 85 1st Qu.: 5.100
## Median : 6.100 Median : 241 Median : 6.000
## Mean : 5.876 Mean : 4690 Mean : 5.775
## 3rd Qu.: 6.800 3rd Qu.: 1022 3rd Qu.: 6.600
## Max. :10.000 Max. :781955 Max. :10.000
##
## allgenders_45age_votes males_allages_avg_vote males_allages_votes
## Min. : 0 Min. : 0.000 Min. : 0
## 1st Qu.: 71 1st Qu.: 5.200 1st Qu.: 169
## Median : 170 Median : 6.100 Median : 439
## Mean : 1424 Mean : 5.861 Mean : 7363
## 3rd Qu.: 601 3rd Qu.: 6.700 3rd Qu.: 1732
## Max. :179646 Max. :10.000 Max. :1374105
##
## males_0age_avg_vote males_0age_votes males_18age_avg_vote males_18age_votes
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 5.100 1st Qu.: 12
## Median : 0.000 Median : 0.000 Median : 6.200 Median : 46
## Mean : 1.805 Mean : 6.247 Mean : 5.916 Mean : 1921
## 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.: 7.000 3rd Qu.: 250
## Max. :10.000 Max. :2849.000 Max. :10.000 Max. :488238
##
## males_30age_avg_vote males_30age_votes males_45age_avg_vote males_45age_votes
## Min. : 0.000 Min. : 0 Min. : 0.000 Min. : 0
## 1st Qu.: 5.100 1st Qu.: 69 1st Qu.: 5.000 1st Qu.: 60
## Median : 6.000 Median : 197 Median : 5.900 Median : 144
## Mean : 5.833 Mean : 3870 Mean : 5.722 Mean : 1185
## 3rd Qu.: 6.700 3rd Qu.: 843 3rd Qu.: 6.600 3rd Qu.: 500
## Max. :10.000 Max. :664458 Max. :10.000 Max. :146000
##
## females_allages_avg_vote females_allages_votes females_0age_avg_vote
## Min. : 0.000 Min. : 0 Min. : 0.000
## 1st Qu.: 5.400 1st Qu.: 29 1st Qu.: 0.000
## Median : 6.300 Median : 83 Median : 0.000
## Mean : 6.068 Mean : 1677 Mean : 1.479
## 3rd Qu.: 7.000 3rd Qu.: 347 3rd Qu.: 0.000
## Max. :10.000 Max. :269839 Max. :10.000
##
## females_0age_votes females_18age_avg_vote females_18age_votes
## Min. : 0.00 Min. : 0.000 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 5.100 1st Qu.: 4.0
## Median : 0.00 Median : 6.400 Median : 15.0
## Mean : 1.91 Mean : 5.896 Mean : 594.4
## 3rd Qu.: 0.00 3rd Qu.: 7.200 3rd Qu.: 75.0
## Max. :524.00 Max. :10.000 Max. :121451.0
##
## females_30age_avg_vote females_30age_votes females_45age_avg_vote
## Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 5.300 1st Qu.: 12.0 1st Qu.: 5.300
## Median : 6.300 Median : 35.0 Median : 6.300
## Mean : 6.039 Mean : 763.8 Mean : 6.017
## 3rd Qu.: 7.000 3rd Qu.: 157.0 3rd Qu.: 7.000
## Max. :10.000 Max. :114034.0 Max. :10.000
##
## females_45age_votes top1000_voters_rating top1000_voters_votes
## Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 8 1st Qu.: 4.500 1st Qu.: 17.00
## Median : 22 Median : 5.400 Median : 37.00
## Mean : 217 Mean : 5.207 Mean : 91.29
## 3rd Qu.: 85 3rd Qu.: 6.100 3rd Qu.: 99.00
## Max. :30244 Max. :10.000 Max. :936.00
##
## us_voters_rating us_voters_votes non_us_voters_rating non_us_voters_votes
## Min. : 0.000 Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 5.300 1st Qu.: 42 1st Qu.: 5.100 1st Qu.: 115.0
## Median : 6.200 Median : 128 Median : 6.000 Median : 312.5
## Mean : 5.973 Mean : 2036 Mean : 5.787 Mean : 5300.5
## 3rd Qu.: 6.900 3rd Qu.: 529 3rd Qu.: 6.700 3rd Qu.: 1260.0
## Max. :10.000 Max. :341457 Max. :10.000 Max. :862970.0
##
From the summary, we can conclude that in our data, there were 2,480 movies that produced in 2017 and the most genre were Drama with 9,019 movies. The country that produced the most movie was USA. The range for our avg_rate
were 1 - 10. There were a lot of reviews_from_user
than reviews_from_critics
.
To understand our data, we could explore them. Before we continue, I will create my own plot theme.
soft_blue_theme <- theme(
panel.background = element_rect(fill="lemonchiffon"),
plot.background = element_rect(fill="slategray3"),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
text = element_text(color="black"),
axis.text = element_text(color="black"),
strip.background =element_rect(fill="linen"),
strip.text = element_text(colour = 'black')
)
Based on the summary above, the most movies were produced in 2017, to prove it, let’s create a plot to show it, but I will just subset into the to 10 years.
Here is the plot
movies %>%
select(year) %>%
group_by(year) %>%
count(year) %>%
arrange(desc(n)) %>%
head(10) %>%
ggplot(aes(x = year, y = n)) +
geom_col(aes(fill = year)) +
labs(x = NULL, y = NULL, title = "Most Movies Per Year") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
soft_blue_theme
The most movie in our data was the movie produced in 2017
We would like to know the top 20 movies’ title based on the average of reviews_from_users
, so we would know what type of movies does most of people like.
movies %>%
select(title, reviews_from_users) %>%
group_by(title) %>%
mutate(reviews_from_users = mean(reviews_from_users)) %>%
arrange(desc(reviews_from_users)) %>%
head(20) %>%
ggplot(aes(x = reviews_from_users, y = reorder(title, reviews_from_users), fill = title)) +
geom_col() +
labs(title = "Top 20 Movies",
subtitle = "Based on reviews from users",
x = "Review from Users",
y = "Title",
fill = NULL) +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
soft_blue_theme
The top movie based on user’s review is Avengers: Endgame which kind of the newest film.
As we know from the summary, the most movie we have in the dataset were from Drama, but was it has the most avg_votes
? Let’s find out
movies %>%
select(genre, title, avg_vote) %>%
group_by(genre) %>%
summarise(avg_vote = mean(avg_vote)) %>%
arrange(desc(avg_vote)) %>%
head(15) %>%
ggplot(mapping = aes(x = avg_vote,
y = reorder(genre, avg_vote),
fill = genre)) +
geom_col() +
labs(title = "Top 15 Genres",
subtitle = "Based on Average Vote",
x = "Average Vote",
y = "Genre",
fill = NULL) +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
soft_blue_theme
Turns out, the most favorite genre based on votes was from Musical, Comedy, Family genre.
To know more, I will find out the top 20 movies from Comedy genre.
movies %>%
select(genre, title, reviews_from_critics) %>%
group_by(genre, title) %>%
summarise(reviews_from_critics = mean(reviews_from_critics)) %>%
arrange(desc(reviews_from_critics)) %>%
ungroup() %>%
filter(genre == "Comedy") %>%
head(20) %>%
ggplot(mapping = aes(x = reviews_from_critics,
y = reorder(title, reviews_from_critics),
fill = title)) +
geom_col() +
labs(title = "Top 20 Comedy Movies",
subtitle = "Based on Reviews From Critics",
x = "Reviews From Critics",
y = "Title",
fill = NULL) +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
soft_blue_theme
Ted, This Is the End and The Hangover Part III were the top 3 movies from Comedy