Internet Movie Databases (IMDb) is an online platform that provide an information related to movies, films, television program, online video stream, online video games by including the cast, director names, the user's votes, production crew and more. IMDb is based in Seattle, but in Bristol, England, where the website was created, the office of Col Needham, the founder and CEO, remains.
The IMDb website has expanded beyond its original aim of indexing film and TV production credits. An entry for a film can now include film-related studios and other businesses, release dates in different countries, censorship classifications, box office grosses, awards received, and other information. Searching for an IMDb site is free of charge. Users who want to rate or review films or contribute information must sign up for an IMDb account, but this move is free of charge as well. The firm founded IMDbPro in 2002 as a fee-based service for business professionals.
Not only does a commercial success movie entertain fans, but it also helps film studios to make considerable profit. Many factors are critical for making good movies, such as good directors and experienced actors. Popular directors and actors, however, can still carry an estimated income from the box office, but can not guarantee a highly rated imdb ranking.
The IMDb movies dataset contains information approximately 5000 movies including director and actors names, facebook likes, genres, content rating, number of votes and more. The original features are both categorical and numerical that provided by dataset are 28 in total before data cleaning process including budget and gross income of the movies. The movies are mostly released over 1998s to 2016s.
With the massive dataset that contains interesting and valueble features, It would be fascinating to understand what are the significant factors that make a movie more popular than others, based on the massive film knowledge. So, the project aim to examine what kind of movies, in other words, get a higher IMDB score, are more popular. In an intuitive way, the visualization result will display the outcomes of this analysis using ggplot2 in R.
In this project, the features of IMDb score is used as variable for the prediction of imdb score and whta factors actually influence in determining the score. We also explored how unsupervised machine learning can help in clustering movies according to their gross income.
The objectives of the project are :
1. To provide useful and fascinating information for user’s related to movie references by examine the significant factors lead to movie high rating.
2. To compare different machine learning models in predicting movie’s IMDb score and also determining top important features.
3. To examine data clustering analysis by classification the movies based on the gross income.
library(lattice)
library(readr)
library(dbplyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
##
## ident, sql
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggrepel)
library(ggthemes) # visualization
library(scales) # visualization
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(corrplot)
## corrplot 0.84 loaded
library(formattable)
##
## Attaching package: 'formattable'
## The following objects are masked from 'package:scales':
##
## comma, percent, scientific
## The following object is masked from 'package:plotly':
##
## style
library(stringr)
library(rpart)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret) #Data Modelling
library(mlbench) #Data Modelling
library(cluster)
library(fpc)
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(e1071)
data <- read.csv("movie_imdb_extended.csv", header=TRUE)
head(data)
## movie_title color
## 1 Avatar Color
## 2 Pirates of the Caribbean: At World's End Color
## 3 Spectre Color
## 4 The Dark Knight Rises Color
## 5 Star Wars: Episode VII - The Force Awakens
## 6 John Carter Color
## director_name num_critic_for_reviews duration director_facebook_likes
## 1 James Cameron 723 178 0
## 2 Gore Verbinski 302 169 563
## 3 Sam Mendes 602 148 0
## 4 Christopher Nolan 813 164 22000
## 5 Doug Walker NA NA 131
## 6 Andrew Stanton 462 132 475
## actor_3_facebook_likes actor_2_name actor_1_facebook_likes
## 1 855 Joel David Moore 1000
## 2 1000 Orlando Bloom 40000
## 3 161 Rory Kinnear 11000
## 4 23000 Christian Bale 27000
## 5 NA Rob Walker 131
## 6 530 Samantha Morton 640
## genres actor_1_name num_voted_users
## 1 Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204
## 2 Action|Adventure|Fantasy Johnny Depp 471220
## 3 Action|Adventure|Thriller Christoph Waltz 275868
## 4 Action|Thriller Tom Hardy 1144337
## 5 Documentary Doug Walker 8
## 6 Action|Adventure|Sci-Fi Daryl Sabara 212204
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 1 4834 Wes Studi 0
## 2 48350 Jack Davenport 0
## 3 11700 Stephanie Sigman 1
## 4 106759 Joseph Gordon-Levitt 0
## 5 143 0
## 6 1873 Polly Walker 1
## plot_keywords
## 1 avatar|future|marine|native|paraplegic
## 2 goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3 bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5
## 6 alien|american civil war|male nipple|mars|princess
## movie_imdb_link num_user_for_reviews
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1 3054
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1 1238
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1 994
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1 2701
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1 NA
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1 738
## language country content_rating budget title_year actor_2_facebook_likes
## 1 English USA PG-13 237000000 2009 936
## 2 English USA PG-13 300000000 2007 5000
## 3 English UK PG-13 245000000 2015 393
## 4 English USA PG-13 250000000 2012 23000
## 5 NA NA 12
## 6 English USA PG-13 263700000 2012 632
## aspect_ratio movie_facebook_likes imdb_score gross
## 1 1.78 33000 7.9 760505847
## 2 2.35 0 7.1 309404152
## 3 2.35 85000 6.8 200074175
## 4 2.35 164000 8.5 448130642
## 5 NA 0 7.1 NA
## 6 2.35 24000 6.6 73058679
class(data)
## [1] "data.frame"
with 28 variables, we have 5043 total of rows. The “imdb score” shows the variable is numerical and numerical and categorical variables are combined with the predictors.
dim(data)
## [1] 5043 28
str(data)
## 'data.frame': 5043 obs. of 28 variables:
## $ movie_title : chr "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
## $ color : chr "Color" "Color" "Color" "Color" ...
## $ director_name : chr "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : chr "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ genres : chr "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
## $ actor_1_name : chr "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : chr "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : chr "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
## $ movie_imdb_link : chr "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : chr "English" "English" "English" "English" ...
## $ country : chr "USA" "USA" "UK" "USA" ...
## $ content_rating : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
colnames(data)
## [1] "movie_title" "color"
## [3] "director_name" "num_critic_for_reviews"
## [5] "duration" "director_facebook_likes"
## [7] "actor_3_facebook_likes" "actor_2_name"
## [9] "actor_1_facebook_likes" "genres"
## [11] "actor_1_name" "num_voted_users"
## [13] "cast_total_facebook_likes" "actor_3_name"
## [15] "facenumber_in_poster" "plot_keywords"
## [17] "movie_imdb_link" "num_user_for_reviews"
## [19] "language" "country"
## [21] "content_rating" "budget"
## [23] "title_year" "actor_2_facebook_likes"
## [25] "aspect_ratio" "movie_facebook_likes"
## [27] "imdb_score" "gross"
1. Tidy up the movie title properly
All the movies titles from the raw dataset have a special character (Â) and some have white spaces that may be created during the collection of data. The coding below executed to remove all of it.
data$movie_title <- gsub("Â", "", as.character(factor(data$movie_title)))
str_trim(data$movie_title, side = "right")[6]
## [1] "John Carter"
2. Removing unneccesary columns
data <- subset( data, select = -c(plot_keywords, movie_imdb_link))
head(data)
## movie_title color
## 1 Avatar Color
## 2 Pirates of the Caribbean: At World's End Color
## 3 Spectre Color
## 4 The Dark Knight Rises Color
## 5 Star Wars: Episode VII - The Force Awakens
## 6 John Carter Color
## director_name num_critic_for_reviews duration director_facebook_likes
## 1 James Cameron 723 178 0
## 2 Gore Verbinski 302 169 563
## 3 Sam Mendes 602 148 0
## 4 Christopher Nolan 813 164 22000
## 5 Doug Walker NA NA 131
## 6 Andrew Stanton 462 132 475
## actor_3_facebook_likes actor_2_name actor_1_facebook_likes
## 1 855 Joel David Moore 1000
## 2 1000 Orlando Bloom 40000
## 3 161 Rory Kinnear 11000
## 4 23000 Christian Bale 27000
## 5 NA Rob Walker 131
## 6 530 Samantha Morton 640
## genres actor_1_name num_voted_users
## 1 Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204
## 2 Action|Adventure|Fantasy Johnny Depp 471220
## 3 Action|Adventure|Thriller Christoph Waltz 275868
## 4 Action|Thriller Tom Hardy 1144337
## 5 Documentary Doug Walker 8
## 6 Action|Adventure|Sci-Fi Daryl Sabara 212204
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 1 4834 Wes Studi 0
## 2 48350 Jack Davenport 0
## 3 11700 Stephanie Sigman 1
## 4 106759 Joseph Gordon-Levitt 0
## 5 143 0
## 6 1873 Polly Walker 1
## num_user_for_reviews language country content_rating budget title_year
## 1 3054 English USA PG-13 237000000 2009
## 2 1238 English USA PG-13 300000000 2007
## 3 994 English UK PG-13 245000000 2015
## 4 2701 English USA PG-13 250000000 2012
## 5 NA NA NA
## 6 738 English USA PG-13 263700000 2012
## actor_2_facebook_likes aspect_ratio movie_facebook_likes imdb_score gross
## 1 936 1.78 33000 7.9 760505847
## 2 5000 2.35 0 7.1 309404152
## 3 393 2.35 85000 6.8 200074175
## 4 23000 2.35 164000 8.5 448130642
## 5 12 NA 0 7.1 NA
## 6 632 2.35 24000 6.6 73058679
3. Change title_year and aspect_ratio column type
data$title_year = as.factor(data$title_year)
data$aspect_ratio = as.factor(data$aspect_ratio)
4. Handling Null values a. Count NA values in each columns.
colSums(is.na(data))
## movie_title color director_name
## 0 0 0
## num_critic_for_reviews duration director_facebook_likes
## 50 15 104
## actor_3_facebook_likes actor_2_name actor_1_facebook_likes
## 23 0 7
## genres actor_1_name num_voted_users
## 0 0 0
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 0 0 13
## num_user_for_reviews language country
## 21 0 0
## content_rating budget title_year
## 0 492 108
## actor_2_facebook_likes aspect_ratio movie_facebook_likes
## 13 329 0
## imdb_score gross
## 0 884
data1<-subset(data, gross!=0 & budget!=0)
dim(data1)
## [1] 3891 26
colSums(is.na(data1))
## movie_title color director_name
## 0 0 0
## num_critic_for_reviews duration director_facebook_likes
## 1 1 0
## actor_3_facebook_likes actor_2_name actor_1_facebook_likes
## 10 0 3
## genres actor_1_name num_voted_users
## 0 0 0
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 0 0 6
## num_user_for_reviews language country
## 0 0 0
## content_rating budget title_year
## 0 0 0
## actor_2_facebook_likes aspect_ratio movie_facebook_likes
## 5 75 0
## imdb_score gross
## 0 0
data1$num_critic_for_reviews[is.na(data1$num_critic_for_reviews)]<-mean(data1$num_critic_for_reviews,na.rm=TRUE)
data1$duration[is.na(data1$duration)]<-median(data1$duration,na.rm=TRUE)
data1$actor_1_facebook_likes[is.na(data1$actor_1_facebook_likes)]<-mean(data1$actor_1_facebook_likes,na.rm=TRUE)
data1$actor_2_facebook_likes[is.na(data1$actor_2_facebook_likes)]<-mean(data1$actor_2_facebook_likes,na.rm=TRUE)
data1$actor_3_facebook_likes[is.na(data1$actor_3_facebook_likes)]<-mean(data1$actor_3_facebook_likes,na.rm=TRUE)
data1$facenumber_in_poster[is.na(data1$facenumber_in_poster)] <- 0
ggplot(data1) + geom_bar(aes(x = aspect_ratio))
data1$aspect_ratio[is.na(data1$aspect_ratio)] <- "2.35"
colSums(is.na(data1))
## movie_title color director_name
## 0 0 0
## num_critic_for_reviews duration director_facebook_likes
## 0 0 0
## actor_3_facebook_likes actor_2_name actor_1_facebook_likes
## 0 0 0
## genres actor_1_name num_voted_users
## 0 0 0
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 0 0 0
## num_user_for_reviews language country
## 0 0 0
## content_rating budget title_year
## 0 0 0
## actor_2_facebook_likes aspect_ratio movie_facebook_likes
## 0 0 0
## imdb_score gross
## 0 0
The unique names of the country count ordinally.
table(data1$country)
##
## Afghanistan Argentina Aruba Australia Belgium
## 1 3 1 41 2
## Brazil Canada Chile China Colombia
## 5 64 1 15 1
## Czech Republic Denmark Finland France Georgia
## 3 9 1 106 1
## Germany Greece Hong Kong Hungary Iceland
## 82 1 13 2 2
## India Indonesia Iran Ireland Israel
## 14 1 4 7 3
## Italy Japan Mexico Netherlands New Line
## 11 17 11 3 1
## New Zealand Norway Official site Peru Philippines
## 11 4 1 1 1
## Poland Romania Russia South Africa South Korea
## 1 2 3 3 9
## Spain Sweden Taiwan Thailand UK
## 22 1 2 4 325
## USA West Germany
## 3074 1
The most country shows in the data is USA followed by United Kingdom (UK) and the other countries were group by as others
levels(data1$country) <- c(levels(data1$country), "Others")
data1$country[(data1$country != 'USA')&(data1$country != 'UK')] <- 'Others'
data1$country <- factor(data1$country)
table(data1$country)
##
## Others UK USA
## 492 325 3074
The analysis show via histogram graph where the movies released start increasing by year 0f 1997s and above where the movies outcome up more than 100 movies a year.
fig1 <- plot_ly (x=data1$title_year, histfunc='count', type = "histogram")
fig1 <- fig1 %>% layout(yaxis=list(type='linear'))
fig1
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
The result below shows the top 10 of the most profittable movies by years where to generate this outcome, new column which is 'profit' is added by earned from he calculation of (Gross - Budget). From the graph, it is show that the trend are almost linear where the profit increase as the budget are increasing.
data1 <- data1 %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100)
data1 %>%
filter(title_year %in% c(2000:2016)) %>%
arrange(desc(profit)) %>%
top_n(10, profit) %>%
ggplot(aes(x=budget/1000000, y=profit/1000000)) +
geom_point() +
geom_smooth() +
geom_text_repel(aes(label=movie_title)) +
labs(x = "Budget $million", y = "Profit $million", title = "Top 10 Profitable Movies") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Nowadays, the well known director are searcheable by the users where they believe the competitive director will produced the good movies based on their historical movies produced before. The pie chart shows the most famous director with the highest IMDb score of the movies produced is Charles Chaplin and Tony Kaye where share the same percentage value 8.6%.
top_director <- data1 %>%
group_by(director_name) %>%
summarise(avg_imdb = mean(imdb_score)) %>%
arrange(desc(avg_imdb)) %>%
top_n(20, avg_imdb)
## `summarise()` ungrouping output (override with `.groups` argument)
fig <- plot_ly(top_director, labels = ~director_name, values = ~avg_imdb, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most competitive director',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
The pie chart shows the Top 20 most famous main actor played the role in the specific movies. The visualisation is based on their total likes in facebook. The main character also main factor for the movies become high rating. The result shows the most famost main character was Darcy Donavan.
top_actor_1 <- data1 %>%
group_by(actor_1_name) %>%
summarise(avg_fb_likes = mean(actor_1_facebook_likes)) %>%
arrange(desc(avg_fb_likes)) %>%
top_n(20, avg_fb_likes)
## `summarise()` ungrouping output (override with `.groups` argument)
top_actor_1
## # A tibble: 21 x 2
## actor_1_name avg_fb_likes
## <chr> <dbl>
## 1 Darcy Donavan 640000
## 2 Matthew Ziff 260000
## 3 Krista Allen 164000
## 4 Andrew Fiscella 137000
## 5 Jimmy Bennett 87000
## 6 Michael Joiner 77000
## 7 Robin Williams 49000
## 8 Craig Stark 46000
## 9 Muse Watson 45000
## 10 Tim Holmes 44000
## # ... with 11 more rows
fig <- plot_ly(top_actor_1, labels = ~actor_1_name, values = ~avg_fb_likes, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most famous Actor 1',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
The pie chart shows the most likeable actor 2 chosen by the users based on their facebook like. The result shows that the most wanted actor2 is Andrew Fiscella.
top_actor_2 <- data1 %>%
group_by(actor_2_name) %>%
summarise(avg_fb_likes = mean(actor_2_facebook_likes)) %>%
arrange(desc(avg_fb_likes)) %>%
top_n(20, avg_fb_likes)
## `summarise()` ungrouping output (override with `.groups` argument)
top_actor_2
## # A tibble: 24 x 2
## actor_2_name avg_fb_likes
## <chr> <dbl>
## 1 Andrew Fiscella 137000
## 2 Leonardo DiCaprio 29000
## 3 Tom Hardy 27000
## 4 Alan Rickman 25000
## 5 Christian Bale 23000
## 6 Paul Walker 23000
## 7 Peter Dinklage 22000
## 8 Philip Seymour Hoffman 22000
## 9 Robert De Niro 22000
## 10 Robert Downey Jr. 21000
## # ... with 14 more rows
fig <- plot_ly(top_actor_2, labels = ~actor_2_name, values = ~avg_fb_likes, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most famous Actor 2',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
The pie chart shows the most likeable actor 2 chosen by the users based on their facebook like. The result shows that the most wanted actor2 is Joseph Gordon-Levitt.
top_actor_3 <- data1 %>%
group_by(actor_3_name) %>%
summarise(avg_fb_likes = mean(actor_3_facebook_likes)) %>%
arrange(desc(avg_fb_likes)) %>%
top_n(20, avg_fb_likes)
## `summarise()` ungrouping output (override with `.groups` argument)
top_actor_3
## # A tibble: 21 x 2
## actor_3_name avg_fb_likes
## <chr> <dbl>
## 1 Joseph Gordon-Levitt 23000
## 2 Hugh Jackman 20000
## 3 Benedict Cumberbatch 19000
## 4 Scarlett Johansson 19000
## 5 Kristen Stewart 17000
## 6 Charlie Hunnam 16000
## 7 Christopher Lee 16000
## 8 Ryan Reynolds 16000
## 9 Jud Tylor 15000
## 10 Bradley Cooper 14000
## # ... with 11 more rows
fig <- plot_ly(top_actor_3, labels = ~actor_3_name, values = ~avg_fb_likes, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most famous Actor 3',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
The bar chart shows the top 10 movies with the high budget movies. For the analysis number of users vote and budget was group by the name of the movies. The result shows the most highest budget movies with 12.2155 Billion is The Host.
HighBudget <- data1%>%
filter(num_voted_users > 7.0)%>%
arrange(desc(budget))%>%
head(n = 20) %>%
group_by(budget,movie_title)%>%
summarise(n = n())%>%
arrange(desc(n))
## `summarise()` regrouping output by 'budget' (override with `.groups` argument)
HighBudget
## # A tibble: 19 x 3
## # Groups: budget [17]
## budget movie_title n
## <dbl> <chr> <int>
## 1 258000000 Spider-Man 3 2
## 2 250000000 The Dark Knight Rises 1
## 3 260000000 Tangled 1
## 4 263700000 John Carter 1
## 5 300000000 Ong-bak 2 1
## 6 300000000 Pirates of the Caribbean: At World's End 1
## 7 390000000 The Messenger: The Story of Joan of Arc 1
## 8 400000000 The Legend of Suriyothai 1
## 9 553632000 Red Cliff 1
## 10 600000000 Kites 1
## 11 700000000 Kabhi Alvida Naa Kehna 1
## 12 700000000 Tango 1
## 13 1000000000 Godzilla 2000 1
## 14 1100000000 Akira 1
## 15 2127519898 Steamboy 1
## 16 2400000000 Princess Mononoke 1
## 17 2500000000 Fateless 1
## 18 4200000000 Lady Vengeance 1
## 19 12215500000 The Host 1
fig <- plot_ly(HighBudget, x = ~budget, y = ~movie_title, title = 'Top 20 High Budget Movies', type = 'bar', orientation = 'h')
fig
## Warning: 'bar' objects don't have these attributes: 'title'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
The table format shows the most content rating movies voted by the users. The result shows the movies which content rating PG 13 was most watch bu users because of the approriate content.
content_rating <- data1 %>%
group_by(content_rating) %>%
summarise(avg_num_vote = mean(num_voted_users)) %>%
arrange(desc(avg_num_vote)) %>%
top_n(10, avg_num_vote) %>%
formattable(list(avg_num_vote = color_bar("orange")), align = 'l')
## `summarise()` ungrouping output (override with `.groups` argument)
content_rating
| content_rating | avg_num_vote |
|---|---|
| PG-13 | 118324.18 |
| M | 106015.00 |
| G | 103782.29 |
| R | 102887.92 |
| Passed | 99862.67 |
| Approved | 89020.76 |
| PG | 84322.64 |
| GP | 74249.00 |
| X | 58022.60 |
| NC-17 | 42763.50 |
The genres plays the important part where users mostly choose the movies based on the genres the are mostly like. The most genres choose by the user is Adventure|Animation|Drama|Family|Musical.
genres <- data1 %>%
group_by(genres) %>%
summarise(avg_num_vote = mean(imdb_score)) %>%
arrange(desc(avg_num_vote)) %>%
top_n(10, avg_num_vote) %>%
formattable(list(avg_num_vote = color_bar("red")), align = 'l')
## `summarise()` ungrouping output (override with `.groups` argument)
genres
| genres | avg_num_vote |
|---|---|
| Adventure|Animation|Drama|Family|Musical | 8.50 |
| Crime|Drama|Fantasy|Mystery | 8.50 |
| Action|Adventure|Drama|Fantasy|War | 8.40 |
| Adventure|Animation|Fantasy | 8.40 |
| Adventure|Drama|Thriller|War | 8.40 |
| Adventure|Animation|Comedy|Drama|Family|Fantasy | 8.30 |
| Biography|Drama|History|Music | 8.30 |
| Documentary|Drama|Sport | 8.30 |
| Documentary|War | 8.30 |
| Adventure|Drama|War | 8.25 |
The result shows break this scatter diagram by content-rating. Movies with Facebook likes that are incredibly high appear to have a higher imdb ranking. But the score varies in a very wide range for movies with low Facebook likes.``
data1 %>%
plot_ly(x = ~movie_facebook_likes, y = ~imdb_score, color = ~content_rating , mode = "markers", text = ~content_rating, alpha = 0.7, type = "scatter")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
The bar chart shows the top 10 low budget movies with the high high voted. For the analysis number of users vote and budget was group by the name of the movies and the IMDb scores. The result shows movies title Cavite with the vote of 7000.
LowBudgetHighRated<- data1 %>%
filter(num_voted_users > 8.0)%>%
filter(budget < 3000000)%>%
arrange(budget) %>%
head(n = 5) %>%
group_by(imdb_score , movie_title, budget) %>%
summarise(n = n())%>%
arrange(desc(n))
## `summarise()` regrouping output by 'imdb_score', 'movie_title' (override with `.groups` argument)
LowBudgetHighRated
## # A tibble: 5 x 4
## # Groups: imdb_score, movie_title [5]
## imdb_score movie_title budget n
## <dbl> <chr> <dbl> <int>
## 1 6.3 Cavite 7000 1
## 2 6.6 My Date with Drew 1100 1
## 3 6.9 Clean 4500 1
## 4 7 Primer 7000 1
## 5 7.2 Tarnation 218 1
fig <- plot_ly(LowBudgetHighRated, labels = ~movie_title , values = ~budget , type = 'pie')
fig <- fig %>% layout(title = 'Low Budget Movie with high vote',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
num = sapply(data1, is.numeric) # num contains numeric column indices
imdb_numeric = data1[, num] # the numeric data
names(imdb_numeric)
## [1] "num_critic_for_reviews" "duration"
## [3] "director_facebook_likes" "actor_3_facebook_likes"
## [5] "actor_1_facebook_likes" "num_voted_users"
## [7] "cast_total_facebook_likes" "facenumber_in_poster"
## [9] "num_user_for_reviews" "budget"
## [11] "actor_2_facebook_likes" "movie_facebook_likes"
## [13] "imdb_score" "gross"
## [15] "profit" "return_on_investment_perc"
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(imdb_numeric, label = TRUE, label_round = 2, label_size = 3.5, size = 2, hjust = .85) +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))
Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors. According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.
There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews. We want to keep num_voted_users and take the ratio of num_user_for_reviews and num_critic_for_reviews.
There are also high correlation between budget and profit, we decided to keep column profit and drop budget.
# add up actor 2 and 3 facebook likes into other actors facebook likes
imdb_numeric$other_actors_facebook_likes <- imdb_numeric$actor_2_facebook_likes + imdb_numeric$actor_3_facebook_likes
# use the ratio of critical reviews amount to total reviews amount
imdb_numeric$critic_review_ratio <- imdb_numeric$num_critic_for_reviews / imdb_numeric$num_user_for_reviews
# delete columns
IMDB <- subset(imdb_numeric, select = -c(cast_total_facebook_likes, actor_2_facebook_likes, actor_3_facebook_likes, num_critic_for_reviews, num_user_for_reviews, budget))
ggcorr(IMDB, label = TRUE, label_round = 2, label_size = 4, size = 3, hjust = .85) +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))
Now, we don’t see any strong correlation (absolute value greater than 0.7) any more we can proceed with our data modelling.
1. Supervised Learning (Regression)
Regression Algorithms are a form of supervised learning algorithms, that are used to model continuous valued output functions. A regression task begins with a data set in which the output values are known. A regression algorithm computes the value of the output as a function of the inputs for each data point in the dataset. The relationship between input features and output value is represented by a model, which can then be applied on a different data set in which the output values are not known.
Before we proceed with machine learning, the data need to be normalized to make all the variables into proportion with one another.
# calculate the pre-process parameters from the dataset
preprocess1 <- preProcess(IMDB[,c(1,2,3,4,5,6,8,9,10,11,12)], method=c("center", "scale"))
transformed <- predict(preprocess1, IMDB[,c(1,2,3,4,5,6,8,9,10,11,12)])
head(transformed)
## duration director_facebook_likes actor_1_facebook_likes num_voted_users
## 1 2.9996817 -0.25891018 -0.4286856 5.1993129
## 2 2.6032748 -0.07234285 2.1103520 2.4458992
## 3 1.6783253 -0.25891018 0.2223497 1.1497411
## 4 2.3830487 7.03146504 1.2640061 6.9120222
## 5 0.9736018 -0.10150435 -0.4521229 0.7273312
## 6 2.0306870 -0.25891018 1.0686955 1.8609322
## facenumber_in_poster movie_facebook_likes gross profit
## 1 -0.6706420 1.1201293 10.1636906 2.28871906
## 2 -0.6706420 -0.4289656 3.7011456 0.01573722
## 3 -0.1838918 3.5611273 2.1348693 -0.22447022
## 4 -0.6706420 7.2695666 5.6885604 0.85014778
## 5 -0.1838918 0.6976488 0.3152278 -0.86871761
## 6 -0.6706420 -0.4289656 4.0897586 0.32136252
## return_on_investment_perc other_actors_facebook_likes critic_review_ratio
## 1 -0.02349024 -0.1628445 -0.5707955
## 2 -0.04028933 0.5716373 -0.5637571
## 3 -0.04194582 -0.3787043 -0.2103388
## 4 -0.03441708 7.5517440 -0.5080045
## 5 -0.04610850 -0.2726067 -0.1904226
## 6 -0.03818295 2.1421613 -0.6007345
score_db <- subset( IMDB, select = c(imdb_score))
dat <- cbind(transformed,score_db)
Multivariate Linear regression is one of the most popular regression models. The output is continuous valued, which depends on one or many features. A linear function can be formed
with all the features to predict the value of the output.
dat = data.frame(dat)
index <- 1:nrow(dat)
testindex <- sample(index, trunc(length(index)*1/4))
testset <- dat[testindex,]
trainset <- dat[-testindex,]
lm1 = lm(imdb_score ~., data = trainset)
prediction = predict(lm1, testset[,-12])
mse <- mean((testset$imdb_score - prediction)^2)
print(mse)
## [1] 0.7704616
summary(lm1)
##
## Call:
## lm(formula = imdb_score ~ ., data = trainset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8012 -0.4828 0.1001 0.6171 2.4033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.45756 0.01654 390.477 < 2e-16 ***
## duration 0.23519 0.01748 13.457 < 2e-16 ***
## director_facebook_likes 0.03765 0.01777 2.118 0.0342 *
## actor_1_facebook_likes 0.01476 0.01663 0.888 0.3748
## num_voted_users 0.53882 0.02592 20.791 < 2e-16 ***
## facenumber_in_poster -0.07624 0.01637 -4.656 3.36e-06 ***
## movie_facebook_likes 0.02361 0.01885 1.253 0.2104
## gross -0.15976 0.02220 -7.196 7.88e-13 ***
## profit 0.01400 0.01600 0.875 0.3816
## return_on_investment_perc 0.01304 0.01438 0.907 0.3646
## other_actors_facebook_likes -0.03328 0.01878 -1.772 0.0764 .
## critic_review_ratio 0.07843 0.01551 5.058 4.50e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8928 on 2907 degrees of freedom
## Multiple R-squared: 0.2936, Adjusted R-squared: 0.2909
## F-statistic: 109.8 on 11 and 2907 DF, p-value: < 2.2e-16
Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm. The Support Vector Regression (SVR) uses the same principles as the SVM, with only a few differences. In the case of regression, a tolerance margin is set in approximation to the SVM. The main idea is to lower the error, creating the hyper-plane which maximizes the margin.In SVR the input is mapped onto an n dimensional feature space using a non-linear mapping following which the construction of a linear model is performed.
svm1 = svm(imdb_score ~., data = trainset)
prediction1 = predict(svm1, testset[,-12])
mse <- mean((testset$imdb_score - prediction1)^2)
print(mse)
## [1] 0.5890036
print(svm1)
##
## Call:
## svm(formula = imdb_score ~ ., data = trainset)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.09090909
## epsilon: 0.1
##
##
## Number of Support Vectors: 2555
A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap Aggregation, commonly known as bagging.The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
rfr <- randomForest(imdb_score~., data=trainset, mtry = 3, importance = TRUE, na.action = na.omit)
# display results
print(rfr)
##
## Call:
## randomForest(formula = imdb_score ~ ., data = trainset, mtry = 3, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.5488715
## % Var explained: 51.15
plot(rfr)
This plot shows the Error and the Number of Trees. We can easily notice that how the Error is dropping as we keep on adding more and more trees and average them.
prediction2 = predict(rfr, testset[,-12])
mse1 <- mean((testset$imdb_score - prediction2)^2)
print(mse1)
## [1] 0.4954532
library(rpart)
library(rpart.plot)
fit3=rpart(imdb_score~., IMDB)
prp(fit3)
Classification rules:
1. If (num_vote >384000 ) then class = 8.
2. If (num_vote < 384000) and (gross > 34000000) then class = 7.5
3. If (num_vote < 384000) and (gross < 34000000) then class = 6.9
4. If (num_vote < 111000) and (duration >= 106) and (duration < 122) then class = 6.8.
5. If (user_vote < 111000) and (duration >= 106) and (duration > 122) then class = 6.4.
6. If (num_vote < 111000) and (duration < 106) and (other_ac < 185) then class = 6.7.
2. Unsupervised Learning
K-means Clustering is a Clustering algorithm which follows a simple way to classify a given data set by dividing it into k different clusters by defining k different centroids, which act as the cluster representatives and change with every iteration when a new cluster is formed, until no new cluster is formed.
clusterdata = cbind.data.frame(imdb_numeric$num_critic_for_reviews,imdb_numeric$duration,imdb_numeric$num_voted_users,imdb_numeric$num_user_for_reviews,discretize(imdb_numeric$gross,breaks = 5))
colnames(clusterdata) = c("num_critic_for_reviews","duration","num_voted_users","num_users_for_review","gross")
clusterdata=na.omit(clusterdata)
clusterdat <- data.frame(lapply(clusterdata[,1:4], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))
set.seed(123)
# Compute and plot wss for k = 2 to k = 15.
k.max <- 15
dat2 <- clusterdat
wss <- sapply(1:k.max,
function(k){kmeans(dat2, k, nstart=50,iter.max = 15 )$tot.withinss})
## Warning: did not converge in 15 iterations
wss
## [1] 165.19936 88.56859 67.25311 56.42546 49.51705 44.04464 39.16862
## [8] 35.70770 33.38798 31.03961 28.84333 26.78012 25.50375 24.28070
## [15] 23.34072
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
Using Elbow Method above, we can see that the observations can best be cluster into 1 till 5 clusters.
imdbCluster <- kmeans(clusterdat, 5, nstart = 25)
table(imdbCluster$cluster,clusterdata$gross)
##
## [162,4.19e+06) [4.19e+06,1.8e+07) [1.8e+07,3.87e+07) [3.87e+07,7.96e+07)
## 1 44 67 103 134
## 2 572 447 336 276
## 3 0 7 8 20
## 4 6 6 17 23
## 5 156 251 315 324
##
## [7.96e+07,7.61e+08]
## 1 166
## 2 118
## 3 103
## 4 98
## 5 294
plot(imdbCluster$cluster)
clusplot(clusterdata, imdbCluster$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
plotcluster(clusterdata[,1:4], imdbCluster$cluster)
imdbCluster$size
## [1] 514 1749 138 150 1340
This project is to determine which features in sequence the IMDB User Rating and Movie's Gross Income.we are also interested in finding out which Regression Models perform better for this dataset.Since regression works only when the features are numerical, only numerical features were used for data modelling. The data was cleaned and normalized according to each objective.
From the above information, we can infer that the MSE values for Random Forest Regression (0.4583154), are the lowest compare to Linear Regression (0.7611518) and SVR (0.5636579) in predicting the IMDb user ratings.However, comparing with other machine learning algorithms might gives better results such as ANN or Gradient Boosting.
Based on the Random Forest Regression tree, there were 4 important features in predicting imdb_score, duration, num_voted_users, gross and other_actor_Facebook_likes. From these rules, we can conclude that movies with a lot of votes in IMDB website tend to have a higher score, which really makes sense because popular movies will have a lot of fans to vote high scores for them.
On the contrary, if a movie has fewer votes, it can still be a good movie if its duration is longer.It is surprise to see that movies make less profit are good, but acceptable if they make more profit.
K-Means Clustering predicted that the data can be clustered in 5 categories. However, majority of the point were clustered in two categories. However, this approach can be used to determine the best number of category to predict the the gross income of a movie.
To predict the rating, we found out that the numerical values were more correlated to the rating and hence were crucial in predicting the rating. We were able to deduce that the most in influential features were duration and the number of users who voted. Combining certain categorical attributes did not help in improving the accuracy of the model.
Out of all the regression models, based on the mean squared error values it is clear that the Random Forest Regression model was the most accurate in predicting the IMDB user ratings.
In predicting the categories of gross income, K-Means cluster showed a promising result. It is because the fact that the data distributed into all the categories under 5 clusters. However, different approach can be use to compare the result with K-means clustering such as CART model which may give better result.
arun5493/Movie-User-Rating-and-Gross-Income-Prediction: Using IMDB Dataset, an R application to predict the upcoming movie’s user ratings. (n.d.). Retrieved January 6, 2021, from https://github.com/arun5493/Movie-User-Rating-and-Gross-Income-Prediction
Decision Trees in R using rpart | GormAnalysis. (n.d.). Retrieved January 6, 2021, from https://www.gormanalysis.com/blog/decision-trees-in-r-using-rpart/
IMDB 5000 Movie Dataset | Kaggle. (2016). Retrieved January 6, 2021, from https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
Linear Regression With R. (2018). Retrieved January 6, 2021, from http://r-statistics.co/Linear-Regression.html
Machines, S. V. (2018). Building Regression Models in R using Support Vector Regression, 1–13. Retrieved from https://www.kdnuggets.com/2017/03/building-regression-models-support-vector-regression.html
university of cincinnati. (n.d.). K-means Cluster Analysis · UC Business Analytics R Programming Guide. Retrieved January 6, 2021, from https://uc-r.github.io/kmeans_clustering
IMDb. (n.d.). Retrieved from https://www.britannica.com/topic/IMDb
Topal, K., & Ozsoyoglu, G. (2016). Movie review analysis: Emotion analysis of IMDb movie reviews. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). doi:10.1109/asonam.2016.7752387
Vr, Nithin & Pb, Sarath. (2014). Predicting Movie Success Based on IMDB Data.
Carolzhangdc. (2018, January 12). Predict IMDB score with data mining algorithms. Retrieved January 06, 2021, from https://www.kaggle.com/carolzhangdc/predict-imdb-score-with-data-mining-algorithms/report
IMDb, “The vote average for film.” [Online]. Available:http://www.imdb.com/help/show leaf?votes
IMDb, “What is IMDb.” [Online]. Available:http://www.imdb.com/pressroom/