IMDb Prediction of Movies based on User Rating and Gross Income

INTRODUCTION

  Internet Movie Databases (IMDb) is an online platform that provide an information related to movies, films, television program, online video stream, online video games by including the cast, director names, the user's votes, production crew and more. IMDb is based in Seattle, but in Bristol, England, where the website was created, the office of Col Needham, the founder and CEO, remains.
  
  The IMDb website has expanded beyond its original aim of indexing film and TV production credits. An entry for a film can now include film-related studios and other businesses, release dates in different countries, censorship classifications, box office grosses, awards received, and other information. Searching for an IMDb site is free of charge. Users who want to rate or review films or contribute information must sign up for an IMDb account, but this move is free of charge as well. The firm founded IMDbPro in 2002 as a fee-based service for business professionals.

BACKGROUND OF STUDY

  Not only does a commercial success movie entertain fans, but it also helps film studios to make considerable profit. Many factors are critical for making good movies, such as good directors and experienced actors. Popular directors and actors, however, can still carry an estimated income from the box office, but can not guarantee a highly rated imdb ranking.
  
  The IMDb movies dataset contains information approximately 5000 movies including director and actors names, facebook likes, genres, content rating, number of votes and more. The original features are both categorical and numerical that provided by dataset are 28 in total before data cleaning process including budget and gross income of the movies. The movies are mostly released over 1998s to 2016s.

PROBLEM STATEMENT

  With the massive dataset that contains interesting and valueble features, It would be fascinating to understand what are the significant factors that make a movie more popular than others, based on the massive film knowledge. So, the project aim to examine what kind of movies, in other words, get a higher IMDB score, are more popular. In an intuitive way, the visualization result will display the outcomes of this analysis using ggplot2 in R.
  In this project, the features of IMDb score is used as variable for the prediction of imdb score and whta factors actually influence in determining the score. We also explored how unsupervised machine learning can help in clustering movies according to their gross income.

OBJECTIVES

The objectives of the project are :

  1. To provide useful and fascinating information for user’s related to movie references by examine the significant factors lead to movie high rating.
  2. To compare different machine learning models in predicting movie’s IMDb score and also determining top important features.
  3. To examine data clustering analysis by classification the movies based on the gross income.

METHODOLOGY

Load libraries

library(lattice)
library(readr)
library(dbplyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:dbplyr':
## 
##     ident, sql

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(ggrepel)
library(ggthemes) # visualization
library(scales) # visualization

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

library(corrplot)

## corrplot 0.84 loaded

library(formattable)

## 
## Attaching package: 'formattable'

## The following objects are masked from 'package:scales':
## 
##     comma, percent, scientific

## The following object is masked from 'package:plotly':
## 
##     style

library(stringr)
library(rpart)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)   #Data Modelling
library(mlbench) #Data Modelling
library(cluster)
library(fpc)
library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(e1071)

Read data

data <- read.csv("movie_imdb_extended.csv", header=TRUE)
head(data)

##                                               movie_title color
## 1                                                 Avatar  Color
## 2               Pirates of the Caribbean: At World's End  Color
## 3                                                Spectre  Color
## 4                                  The Dark Knight Rises  Color
## 5 Star Wars: Episode VII - The Force Awakens                   
## 6                                            John Carter  Color
##       director_name num_critic_for_reviews duration director_facebook_likes
## 1     James Cameron                    723      178                       0
## 2    Gore Verbinski                    302      169                     563
## 3        Sam Mendes                    602      148                       0
## 4 Christopher Nolan                    813      164                   22000
## 5       Doug Walker                     NA       NA                     131
## 6    Andrew Stanton                    462      132                     475
##   actor_3_facebook_likes     actor_2_name actor_1_facebook_likes
## 1                    855 Joel David Moore                   1000
## 2                   1000    Orlando Bloom                  40000
## 3                    161     Rory Kinnear                  11000
## 4                  23000   Christian Bale                  27000
## 5                     NA       Rob Walker                    131
## 6                    530  Samantha Morton                    640
##                            genres    actor_1_name num_voted_users
## 1 Action|Adventure|Fantasy|Sci-Fi     CCH Pounder          886204
## 2        Action|Adventure|Fantasy     Johnny Depp          471220
## 3       Action|Adventure|Thriller Christoph Waltz          275868
## 4                 Action|Thriller       Tom Hardy         1144337
## 5                     Documentary     Doug Walker               8
## 6         Action|Adventure|Sci-Fi    Daryl Sabara          212204
##   cast_total_facebook_likes         actor_3_name facenumber_in_poster
## 1                      4834            Wes Studi                    0
## 2                     48350       Jack Davenport                    0
## 3                     11700     Stephanie Sigman                    1
## 4                    106759 Joseph Gordon-Levitt                    0
## 5                       143                                         0
## 6                      1873         Polly Walker                    1
##                                                      plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 2     goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5                                                                 
## 6               alien|american civil war|male nipple|mars|princess
##                                        movie_imdb_link num_user_for_reviews
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1                 3054
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1                 1238
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1                  994
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1                 2701
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1                   NA
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1                  738
##   language country content_rating    budget title_year actor_2_facebook_likes
## 1  English     USA          PG-13 237000000       2009                    936
## 2  English     USA          PG-13 300000000       2007                   5000
## 3  English      UK          PG-13 245000000       2015                    393
## 4  English     USA          PG-13 250000000       2012                  23000
## 5                                        NA         NA                     12
## 6  English     USA          PG-13 263700000       2012                    632
##   aspect_ratio movie_facebook_likes imdb_score     gross
## 1         1.78                33000        7.9 760505847
## 2         2.35                    0        7.1 309404152
## 3         2.35                85000        6.8 200074175
## 4         2.35               164000        8.5 448130642
## 5           NA                    0        7.1        NA
## 6         2.35                24000        6.6  73058679

class(data)

## [1] "data.frame"

with 28 variables, we have 5043 total of rows. The “imdb score” shows the variable is numerical and numerical and categorical variables are combined with the predictors.

dim(data)

## [1] 5043   28

str(data)

## 'data.frame':    5043 obs. of  28 variables:
##  $ movie_title              : chr  "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
##  $ color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
##  $ actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : chr  "English" "English" "English" "English" ...
##  $ country                  : chr  "USA" "USA" "UK" "USA" ...
##  $ content_rating           : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...

colnames(data)

##  [1] "movie_title"               "color"                    
##  [3] "director_name"             "num_critic_for_reviews"   
##  [5] "duration"                  "director_facebook_likes"  
##  [7] "actor_3_facebook_likes"    "actor_2_name"             
##  [9] "actor_1_facebook_likes"    "genres"                   
## [11] "actor_1_name"              "num_voted_users"          
## [13] "cast_total_facebook_likes" "actor_3_name"             
## [15] "facenumber_in_poster"      "plot_keywords"            
## [17] "movie_imdb_link"           "num_user_for_reviews"     
## [19] "language"                  "country"                  
## [21] "content_rating"            "budget"                   
## [23] "title_year"                "actor_2_facebook_likes"   
## [25] "aspect_ratio"              "movie_facebook_likes"     
## [27] "imdb_score"                "gross"

Data Cleaning and Pre-processing

1. Tidy up the movie title properly

All the movies titles from the raw dataset have a special character (Â) and some have white spaces that may be created during the collection of data. The coding below executed to remove all of it.

data$movie_title <- gsub("Â", "", as.character(factor(data$movie_title)))
str_trim(data$movie_title, side = "right")[6]

## [1] "John Carter"

2. Removing unneccesary columns

data <- subset( data, select = -c(plot_keywords, movie_imdb_link))
head(data)

##                                               movie_title color
## 1                                                 Avatar  Color
## 2               Pirates of the Caribbean: At World's End  Color
## 3                                                Spectre  Color
## 4                                  The Dark Knight Rises  Color
## 5 Star Wars: Episode VII - The Force Awakens                   
## 6                                            John Carter  Color
##       director_name num_critic_for_reviews duration director_facebook_likes
## 1     James Cameron                    723      178                       0
## 2    Gore Verbinski                    302      169                     563
## 3        Sam Mendes                    602      148                       0
## 4 Christopher Nolan                    813      164                   22000
## 5       Doug Walker                     NA       NA                     131
## 6    Andrew Stanton                    462      132                     475
##   actor_3_facebook_likes     actor_2_name actor_1_facebook_likes
## 1                    855 Joel David Moore                   1000
## 2                   1000    Orlando Bloom                  40000
## 3                    161     Rory Kinnear                  11000
## 4                  23000   Christian Bale                  27000
## 5                     NA       Rob Walker                    131
## 6                    530  Samantha Morton                    640
##                            genres    actor_1_name num_voted_users
## 1 Action|Adventure|Fantasy|Sci-Fi     CCH Pounder          886204
## 2        Action|Adventure|Fantasy     Johnny Depp          471220
## 3       Action|Adventure|Thriller Christoph Waltz          275868
## 4                 Action|Thriller       Tom Hardy         1144337
## 5                     Documentary     Doug Walker               8
## 6         Action|Adventure|Sci-Fi    Daryl Sabara          212204
##   cast_total_facebook_likes         actor_3_name facenumber_in_poster
## 1                      4834            Wes Studi                    0
## 2                     48350       Jack Davenport                    0
## 3                     11700     Stephanie Sigman                    1
## 4                    106759 Joseph Gordon-Levitt                    0
## 5                       143                                         0
## 6                      1873         Polly Walker                    1
##   num_user_for_reviews language country content_rating    budget title_year
## 1                 3054  English     USA          PG-13 237000000       2009
## 2                 1238  English     USA          PG-13 300000000       2007
## 3                  994  English      UK          PG-13 245000000       2015
## 4                 2701  English     USA          PG-13 250000000       2012
## 5                   NA                                        NA         NA
## 6                  738  English     USA          PG-13 263700000       2012
##   actor_2_facebook_likes aspect_ratio movie_facebook_likes imdb_score     gross
## 1                    936         1.78                33000        7.9 760505847
## 2                   5000         2.35                    0        7.1 309404152
## 3                    393         2.35                85000        6.8 200074175
## 4                  23000         2.35               164000        8.5 448130642
## 5                     12           NA                    0        7.1        NA
## 6                    632         2.35                24000        6.6  73058679

3. Change title_year and aspect_ratio column type

data$title_year = as.factor(data$title_year)
data$aspect_ratio = as.factor(data$aspect_ratio)

4. Handling Null values a. Count NA values in each columns.

colSums(is.na(data))

##               movie_title                     color             director_name 
##                         0                         0                         0 
##    num_critic_for_reviews                  duration   director_facebook_likes 
##                        50                        15                       104 
##    actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
##                        23                         0                         7 
##                    genres              actor_1_name           num_voted_users 
##                         0                         0                         0 
## cast_total_facebook_likes              actor_3_name      facenumber_in_poster 
##                         0                         0                        13 
##      num_user_for_reviews                  language                   country 
##                        21                         0                         0 
##            content_rating                    budget                title_year 
##                         0                       492                       108 
##    actor_2_facebook_likes              aspect_ratio      movie_facebook_likes 
##                        13                       329                         0 
##                imdb_score                     gross 
##                         0                       884

Remove rows with NA values in column gross and budget. (null value >10% will caused bias if impute with mean/median)

data1<-subset(data, gross!=0 & budget!=0)
dim(data1)

## [1] 3891   26

colSums(is.na(data1))

##               movie_title                     color             director_name 
##                         0                         0                         0 
##    num_critic_for_reviews                  duration   director_facebook_likes 
##                         1                         1                         0 
##    actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
##                        10                         0                         3 
##                    genres              actor_1_name           num_voted_users 
##                         0                         0                         0 
## cast_total_facebook_likes              actor_3_name      facenumber_in_poster 
##                         0                         0                         6 
##      num_user_for_reviews                  language                   country 
##                         0                         0                         0 
##            content_rating                    budget                title_year 
##                         0                         0                         0 
##    actor_2_facebook_likes              aspect_ratio      movie_facebook_likes 
##                         5                        75                         0 
##                imdb_score                     gross 
##                         0                         0

Impute with mean or median values for remaining numerical columns.

data1$num_critic_for_reviews[is.na(data1$num_critic_for_reviews)]<-mean(data1$num_critic_for_reviews,na.rm=TRUE)
data1$duration[is.na(data1$duration)]<-median(data1$duration,na.rm=TRUE)
data1$actor_1_facebook_likes[is.na(data1$actor_1_facebook_likes)]<-mean(data1$actor_1_facebook_likes,na.rm=TRUE)
data1$actor_2_facebook_likes[is.na(data1$actor_2_facebook_likes)]<-mean(data1$actor_2_facebook_likes,na.rm=TRUE)
data1$actor_3_facebook_likes[is.na(data1$actor_3_facebook_likes)]<-mean(data1$actor_3_facebook_likes,na.rm=TRUE)

Impute with zero value in facenumber_in_poster.

data1$facenumber_in_poster[is.na(data1$facenumber_in_poster)] <- 0

Impute with mode for categorical columns.

ggplot(data1) + geom_bar(aes(x = aspect_ratio))

data1$aspect_ratio[is.na(data1$aspect_ratio)] <- "2.35"

colSums(is.na(data1))

##               movie_title                     color             director_name 
##                         0                         0                         0 
##    num_critic_for_reviews                  duration   director_facebook_likes 
##                         0                         0                         0 
##    actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
##                         0                         0                         0 
##                    genres              actor_1_name           num_voted_users 
##                         0                         0                         0 
## cast_total_facebook_likes              actor_3_name      facenumber_in_poster 
##                         0                         0                         0 
##      num_user_for_reviews                  language                   country 
##                         0                         0                         0 
##            content_rating                    budget                title_year 
##                         0                         0                         0 
##    actor_2_facebook_likes              aspect_ratio      movie_facebook_likes 
##                         0                         0                         0 
##                imdb_score                     gross 
##                         0                         0

The unique names of the country count ordinally.

table(data1$country)

## 
##    Afghanistan      Argentina          Aruba      Australia        Belgium 
##              1              3              1             41              2 
##         Brazil         Canada          Chile          China       Colombia 
##              5             64              1             15              1 
## Czech Republic        Denmark        Finland         France        Georgia 
##              3              9              1            106              1 
##        Germany         Greece      Hong Kong        Hungary        Iceland 
##             82              1             13              2              2 
##          India      Indonesia           Iran        Ireland         Israel 
##             14              1              4              7              3 
##          Italy          Japan         Mexico    Netherlands       New Line 
##             11             17             11              3              1 
##    New Zealand         Norway  Official site           Peru    Philippines 
##             11              4              1              1              1 
##         Poland        Romania         Russia   South Africa    South Korea 
##              1              2              3              3              9 
##          Spain         Sweden         Taiwan       Thailand             UK 
##             22              1              2              4            325 
##            USA   West Germany 
##           3074              1

The most country shows in the data is USA followed by United Kingdom (UK) and the other countries were group by as others

levels(data1$country) <- c(levels(data1$country), "Others")
data1$country[(data1$country != 'USA')&(data1$country != 'UK')] <- 'Others' 
data1$country <- factor(data1$country)
table(data1$country)

## 
## Others     UK    USA 
##    492    325   3074

Data Visualization

Movies Released by years

The analysis show via histogram graph where the movies released start increasing by year 0f 1997s and above where the movies outcome up more than 100 movies a year.

fig1 <- plot_ly (x=data1$title_year, histfunc='count', type = "histogram")
fig1 <- fig1 %>% layout(yaxis=list(type='linear'))
fig1

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Top 10 most profit movies

The result below shows the top 10 of the most profittable movies by years where to generate this outcome, new column which is 'profit' is added by earned from he calculation of (Gross - Budget). From the graph, it is show that the trend are almost linear where the profit increase as the budget are increasing.

data1 <- data1 %>% 
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100)
data1 %>%
  filter(title_year %in% c(2000:2016)) %>%
  arrange(desc(profit)) %>%
  top_n(10, profit) %>%
  ggplot(aes(x=budget/1000000, y=profit/1000000)) +
  geom_point() +
  geom_smooth() + 
  geom_text_repel(aes(label=movie_title)) +
  labs(x = "Budget $million", y = "Profit $million", title = "Top 10 Profitable Movies") +
  theme(plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Top 20 most competitive director

Nowadays, the well known director are searcheable by the users where they believe the competitive director will produced the good movies based on their historical movies produced before. The pie chart shows the most famous director with the highest IMDb score of the movies produced is Charles Chaplin and Tony Kaye where share the same percentage value 8.6%.

top_director <- data1 %>%
  group_by(director_name) %>%
  summarise(avg_imdb = mean(imdb_score)) %>%
  arrange(desc(avg_imdb)) %>%
  top_n(20, avg_imdb)

## `summarise()` ungrouping output (override with `.groups` argument)

fig <- plot_ly(top_director, labels = ~director_name, values = ~avg_imdb, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most competitive director',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig

Top 20 Most famous Actor 1

The pie chart shows the Top 20 most famous main actor played the role in the specific movies. The visualisation is based on their total likes in facebook. The main character also main factor for the movies become high rating. The result shows the most famost main character was Darcy Donavan.

top_actor_1 <- data1 %>%
  group_by(actor_1_name) %>%
  summarise(avg_fb_likes = mean(actor_1_facebook_likes)) %>%
  arrange(desc(avg_fb_likes)) %>%
  top_n(20, avg_fb_likes)

## `summarise()` ungrouping output (override with `.groups` argument)

top_actor_1

## # A tibble: 21 x 2
##    actor_1_name    avg_fb_likes
##    <chr>                  <dbl>
##  1 Darcy Donavan         640000
##  2 Matthew Ziff          260000
##  3 Krista Allen          164000
##  4 Andrew Fiscella       137000
##  5 Jimmy Bennett          87000
##  6 Michael Joiner         77000
##  7 Robin Williams         49000
##  8 Craig Stark            46000
##  9 Muse Watson            45000
## 10 Tim Holmes             44000
## # ... with 11 more rows

fig <- plot_ly(top_actor_1, labels = ~actor_1_name, values = ~avg_fb_likes, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most famous Actor 1',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig

Top 20 Most Famous Actor 2

The pie chart shows the most likeable actor 2 chosen by the users based on their facebook like. The result shows that the most wanted actor2 is Andrew Fiscella.

top_actor_2 <- data1 %>%
  group_by(actor_2_name) %>%
  summarise(avg_fb_likes = mean(actor_2_facebook_likes)) %>%
  arrange(desc(avg_fb_likes)) %>%
  top_n(20, avg_fb_likes)

## `summarise()` ungrouping output (override with `.groups` argument)

top_actor_2

## # A tibble: 24 x 2
##    actor_2_name           avg_fb_likes
##    <chr>                         <dbl>
##  1 Andrew Fiscella              137000
##  2 Leonardo DiCaprio             29000
##  3 Tom Hardy                     27000
##  4 Alan Rickman                  25000
##  5 Christian Bale                23000
##  6 Paul Walker                   23000
##  7 Peter Dinklage                22000
##  8 Philip Seymour Hoffman        22000
##  9 Robert De Niro                22000
## 10 Robert Downey Jr.             21000
## # ... with 14 more rows

fig <- plot_ly(top_actor_2, labels = ~actor_2_name, values = ~avg_fb_likes, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most famous Actor 2',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig

Top 20 Most Famous Actor 3

The pie chart shows the most likeable actor 2 chosen by the users based on their facebook like. The result shows that the most wanted actor2 is Joseph Gordon-Levitt.

top_actor_3 <- data1 %>%
  group_by(actor_3_name) %>%
  summarise(avg_fb_likes = mean(actor_3_facebook_likes)) %>%
  arrange(desc(avg_fb_likes)) %>%
  top_n(20, avg_fb_likes)

## `summarise()` ungrouping output (override with `.groups` argument)

top_actor_3

## # A tibble: 21 x 2
##    actor_3_name         avg_fb_likes
##    <chr>                       <dbl>
##  1 Joseph Gordon-Levitt        23000
##  2 Hugh Jackman                20000
##  3 Benedict Cumberbatch        19000
##  4 Scarlett Johansson          19000
##  5 Kristen Stewart             17000
##  6 Charlie Hunnam              16000
##  7 Christopher Lee             16000
##  8 Ryan Reynolds               16000
##  9 Jud Tylor                   15000
## 10 Bradley Cooper              14000
## # ... with 11 more rows

fig <- plot_ly(top_actor_3, labels = ~actor_3_name, values = ~avg_fb_likes, type = 'pie')
fig <- fig %>% layout(title = 'Top 20 most famous Actor 3',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig

Top 10 High budget movies

The bar chart shows the top 10 movies with the high budget movies. For the analysis number of users vote and budget was group by the name of the movies. The result shows the most highest budget movies with 12.2155 Billion is The Host.

HighBudget <- data1%>%
  filter(num_voted_users > 7.0)%>%
  arrange(desc(budget))%>%
  head(n = 20) %>%
  group_by(budget,movie_title)%>%
  summarise(n = n())%>%
  arrange(desc(n))

## `summarise()` regrouping output by 'budget' (override with `.groups` argument)

HighBudget

## # A tibble: 19 x 3
## # Groups:   budget [17]
##         budget movie_title                                   n
##          <dbl> <chr>                                     <int>
##  1   258000000 Spider-Man 3                                  2
##  2   250000000 The Dark Knight Rises                         1
##  3   260000000 Tangled                                       1
##  4   263700000 John Carter                                   1
##  5   300000000 Ong-bak 2                                     1
##  6   300000000 Pirates of the Caribbean: At World's End      1
##  7   390000000 The Messenger: The Story of Joan of Arc       1
##  8   400000000 The Legend of Suriyothai                      1
##  9   553632000 Red Cliff                                     1
## 10   600000000 Kites                                         1
## 11   700000000 Kabhi Alvida Naa Kehna                        1
## 12   700000000 Tango                                         1
## 13  1000000000 Godzilla 2000                                 1
## 14  1100000000 Akira                                         1
## 15  2127519898 Steamboy                                      1
## 16  2400000000 Princess Mononoke                             1
## 17  2500000000 Fateless                                      1
## 18  4200000000 Lady Vengeance                                1
## 19 12215500000 The Host                                      1

fig <- plot_ly(HighBudget, x = ~budget, y = ~movie_title, title = 'Top 20 High Budget Movies', type = 'bar', orientation = 'h')
fig

## Warning: 'bar' objects don't have these attributes: 'title'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

Most voted by user based on content rating

The table format shows the most content rating movies voted by the users. The result shows the movies which content rating PG 13 was most watch bu users because of the approriate content.

content_rating <- data1 %>%
  group_by(content_rating) %>%
  summarise(avg_num_vote = mean(num_voted_users)) %>%
  arrange(desc(avg_num_vote)) %>%
  top_n(10, avg_num_vote) %>%
  formattable(list(avg_num_vote = color_bar("orange")), align = 'l')

## `summarise()` ungrouping output (override with `.groups` argument)

content_rating

content_rating	avg_num_vote
PG-13	118324.18
M	106015.00
G	103782.29
R	102887.92
Passed	99862.67
Approved	89020.76
PG	84322.64
GP	74249.00
X	58022.60
NC-17	42763.50

Top 10 genres based on user’s voted

The genres plays the important part where users mostly choose the movies based on the genres the are mostly like. The most genres choose by the user is Adventure|Animation|Drama|Family|Musical.

genres <- data1 %>%
  group_by(genres) %>%
  summarise(avg_num_vote = mean(imdb_score)) %>%
  arrange(desc(avg_num_vote)) %>%
  top_n(10, avg_num_vote) %>%
  formattable(list(avg_num_vote = color_bar("red")), align = 'l')

## `summarise()` ungrouping output (override with `.groups` argument)

genres

genres	avg_num_vote
Adventure\|Animation\|Drama\|Family\|Musical	8.50
Crime\|Drama\|Fantasy\|Mystery	8.50
Action\|Adventure\|Drama\|Fantasy\|War	8.40
Adventure\|Animation\|Fantasy	8.40
Adventure\|Drama\|Thriller\|War	8.40
Adventure\|Animation\|Comedy\|Drama\|Family\|Fantasy	8.30
Biography\|Drama\|History\|Music	8.30
Documentary\|Drama\|Sport	8.30
Documentary\|War	8.30
Adventure\|Drama\|War	8.25

Relationship between Movies Facebook Like versus IMDB scores

The result shows break this scatter diagram by content-rating. Movies with Facebook likes that are incredibly high appear to have a higher imdb ranking. But the score varies in a very wide range for movies with low Facebook likes.``

data1 %>%
  plot_ly(x = ~movie_facebook_likes, y = ~imdb_score, color = ~content_rating , mode = "markers", text = ~content_rating, alpha = 0.7, type = "scatter")

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Low Budget Movies But Highest vote

The bar chart shows the top 10 low budget movies with the high high voted. For the analysis number of users vote and budget was group by the name of the movies and the IMDb scores. The result shows movies title Cavite with the vote of 7000.

LowBudgetHighRated<- data1 %>%
  filter(num_voted_users > 8.0)%>%
  filter(budget < 3000000)%>%
  arrange(budget) %>%
  head(n = 5) %>%
  group_by(imdb_score , movie_title, budget) %>%
  summarise(n = n())%>%
  arrange(desc(n))

## `summarise()` regrouping output by 'imdb_score', 'movie_title' (override with `.groups` argument)

LowBudgetHighRated

## # A tibble: 5 x 4
## # Groups:   imdb_score, movie_title [5]
##   imdb_score movie_title        budget     n
##        <dbl> <chr>               <dbl> <int>
## 1        6.3 Cavite               7000     1
## 2        6.6 My Date with Drew    1100     1
## 3        6.9 Clean                4500     1
## 4        7   Primer               7000     1
## 5        7.2 Tarnation             218     1

fig <- plot_ly(LowBudgetHighRated, labels = ~movie_title , values = ~budget , type = 'pie')
fig <- fig %>% layout(title = 'Low Budget Movie with high vote',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig

Correlation Analysis

num = sapply(data1, is.numeric)  # num contains numeric column indices
imdb_numeric = data1[, num]      # the numeric data
names(imdb_numeric)

##  [1] "num_critic_for_reviews"    "duration"                 
##  [3] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [5] "actor_1_facebook_likes"    "num_voted_users"          
##  [7] "cast_total_facebook_likes" "facenumber_in_poster"     
##  [9] "num_user_for_reviews"      "budget"                   
## [11] "actor_2_facebook_likes"    "movie_facebook_likes"     
## [13] "imdb_score"                "gross"                    
## [15] "profit"                    "return_on_investment_perc"

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(imdb_numeric, label = TRUE, label_round = 2, label_size = 3.5, size = 2, hjust = .85) +
  ggtitle("Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))

Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors. According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.
There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews. We want to keep num_voted_users and take the ratio of num_user_for_reviews and num_critic_for_reviews.
There are also high correlation between budget and profit, we decided to keep column profit and drop budget.

# add up actor 2 and 3 facebook likes into other actors facebook likes
imdb_numeric$other_actors_facebook_likes <- imdb_numeric$actor_2_facebook_likes + imdb_numeric$actor_3_facebook_likes
# use the ratio of critical reviews amount to total reviews amount
imdb_numeric$critic_review_ratio <- imdb_numeric$num_critic_for_reviews / imdb_numeric$num_user_for_reviews
# delete columns
IMDB <- subset(imdb_numeric, select = -c(cast_total_facebook_likes, actor_2_facebook_likes, actor_3_facebook_likes, num_critic_for_reviews, num_user_for_reviews, budget))

ggcorr(IMDB, label = TRUE, label_round = 2, label_size = 4, size = 3, hjust = .85) +
  ggtitle("Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))

Now, we don’t see any strong correlation (absolute value greater than 0.7) any more we can proceed with our data modelling.

Data Modelling

1. Supervised Learning (Regression)

Regression Algorithms are a form of supervised learning algorithms, that are used to model continuous valued output functions. A regression task begins with a data set in which the output values are known. A regression algorithm computes the value of the output as a function of the inputs for each data point in the dataset. The relationship between input features and output value is represented by a model, which can then be applied on a different data set in which the output values are not known.

Before we proceed with machine learning, the data need to be normalized to make all the variables into proportion with one another.

# calculate the pre-process parameters from the dataset
preprocess1 <- preProcess(IMDB[,c(1,2,3,4,5,6,8,9,10,11,12)], method=c("center", "scale"))
transformed <- predict(preprocess1, IMDB[,c(1,2,3,4,5,6,8,9,10,11,12)])
head(transformed)

##    duration director_facebook_likes actor_1_facebook_likes num_voted_users
## 1 2.9996817             -0.25891018             -0.4286856       5.1993129
## 2 2.6032748             -0.07234285              2.1103520       2.4458992
## 3 1.6783253             -0.25891018              0.2223497       1.1497411
## 4 2.3830487              7.03146504              1.2640061       6.9120222
## 5 0.9736018             -0.10150435             -0.4521229       0.7273312
## 6 2.0306870             -0.25891018              1.0686955       1.8609322
##   facenumber_in_poster movie_facebook_likes      gross      profit
## 1           -0.6706420            1.1201293 10.1636906  2.28871906
## 2           -0.6706420           -0.4289656  3.7011456  0.01573722
## 3           -0.1838918            3.5611273  2.1348693 -0.22447022
## 4           -0.6706420            7.2695666  5.6885604  0.85014778
## 5           -0.1838918            0.6976488  0.3152278 -0.86871761
## 6           -0.6706420           -0.4289656  4.0897586  0.32136252
##   return_on_investment_perc other_actors_facebook_likes critic_review_ratio
## 1               -0.02349024                  -0.1628445          -0.5707955
## 2               -0.04028933                   0.5716373          -0.5637571
## 3               -0.04194582                  -0.3787043          -0.2103388
## 4               -0.03441708                   7.5517440          -0.5080045
## 5               -0.04610850                  -0.2726067          -0.1904226
## 6               -0.03818295                   2.1421613          -0.6007345

score_db <- subset( IMDB, select = c(imdb_score))
dat <- cbind(transformed,score_db)

Multivariate Linear Regression

Multivariate Linear regression is one of the most popular regression models. The output is continuous valued, which depends on one or many features. A linear function can be formed
with all the features to predict the value of the output.

dat = data.frame(dat)
index <- 1:nrow(dat)
testindex <- sample(index, trunc(length(index)*1/4))
testset <- dat[testindex,]
trainset <- dat[-testindex,]

lm1 = lm(imdb_score ~., data = trainset)

prediction = predict(lm1, testset[,-12])

mse <- mean((testset$imdb_score - prediction)^2)
print(mse)

## [1] 0.7704616

summary(lm1)

## 
## Call:
## lm(formula = imdb_score ~ ., data = trainset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8012 -0.4828  0.1001  0.6171  2.4033 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6.45756    0.01654 390.477  < 2e-16 ***
## duration                     0.23519    0.01748  13.457  < 2e-16 ***
## director_facebook_likes      0.03765    0.01777   2.118   0.0342 *  
## actor_1_facebook_likes       0.01476    0.01663   0.888   0.3748    
## num_voted_users              0.53882    0.02592  20.791  < 2e-16 ***
## facenumber_in_poster        -0.07624    0.01637  -4.656 3.36e-06 ***
## movie_facebook_likes         0.02361    0.01885   1.253   0.2104    
## gross                       -0.15976    0.02220  -7.196 7.88e-13 ***
## profit                       0.01400    0.01600   0.875   0.3816    
## return_on_investment_perc    0.01304    0.01438   0.907   0.3646    
## other_actors_facebook_likes -0.03328    0.01878  -1.772   0.0764 .  
## critic_review_ratio          0.07843    0.01551   5.058 4.50e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8928 on 2907 degrees of freedom
## Multiple R-squared:  0.2936, Adjusted R-squared:  0.2909 
## F-statistic: 109.8 on 11 and 2907 DF,  p-value: < 2.2e-16

Support Vector Regression

Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm. The Support Vector Regression (SVR) uses the same principles as the SVM, with only a few differences. In the case of regression, a tolerance margin is set in approximation to the SVM. The main idea is to lower the error, creating the hyper-plane which maximizes the margin.In SVR the input is mapped onto an n dimensional feature space using a non-linear mapping following which the construction of a linear model is performed.

svm1 = svm(imdb_score ~., data = trainset)

prediction1 = predict(svm1, testset[,-12])

mse <- mean((testset$imdb_score - prediction1)^2)
print(mse)

## [1] 0.5890036

print(svm1)

## 
## Call:
## svm(formula = imdb_score ~ ., data = trainset)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.09090909 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  2555

Random Forest Regression

A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap Aggregation, commonly known as bagging.The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.

rfr <- randomForest(imdb_score~., data=trainset, mtry = 3, importance = TRUE, na.action = na.omit)
# display results
print(rfr)

## 
## Call:
##  randomForest(formula = imdb_score ~ ., data = trainset, mtry = 3,      importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.5488715
##                     % Var explained: 51.15

plot(rfr)

This plot shows the Error and the Number of Trees. We can easily notice that how the Error is dropping as we keep on adding more and more trees and average them.

prediction2 = predict(rfr, testset[,-12])

mse1 <- mean((testset$imdb_score - prediction2)^2)
print(mse1)

## [1] 0.4954532

library(rpart)
library(rpart.plot)
fit3=rpart(imdb_score~., IMDB)
prp(fit3)

Classification rules:

  1. If (num_vote >384000 ) then class = 8.
  2. If (num_vote < 384000) and (gross > 34000000) then class = 7.5
  3. If (num_vote < 384000) and (gross < 34000000) then class = 6.9
  4. If (num_vote < 111000) and (duration >= 106) and (duration < 122) then class = 6.8.
  5. If (user_vote < 111000) and (duration >= 106) and (duration > 122) then class = 6.4.
  6. If (num_vote < 111000) and (duration < 106) and (other_ac < 185) then class = 6.7.

2. Unsupervised Learning

K-Means Clustering

K-means Clustering is a Clustering algorithm which follows a simple way to classify a given data set by dividing it into k different clusters by defining k different centroids, which act as the cluster representatives and change with every iteration when a new cluster is formed, until no new cluster is formed.

clusterdata = cbind.data.frame(imdb_numeric$num_critic_for_reviews,imdb_numeric$duration,imdb_numeric$num_voted_users,imdb_numeric$num_user_for_reviews,discretize(imdb_numeric$gross,breaks = 5))
colnames(clusterdata) = c("num_critic_for_reviews","duration","num_voted_users","num_users_for_review","gross")
clusterdata=na.omit(clusterdata)

clusterdat <- data.frame(lapply(clusterdata[,1:4], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))

set.seed(123)
# Compute and plot wss for k = 2 to k = 15.
k.max <- 15
dat2 <- clusterdat
wss <- sapply(1:k.max, 
              function(k){kmeans(dat2, k, nstart=50,iter.max = 15 )$tot.withinss})

## Warning: did not converge in 15 iterations

wss

##  [1] 165.19936  88.56859  67.25311  56.42546  49.51705  44.04464  39.16862
##  [8]  35.70770  33.38798  31.03961  28.84333  26.78012  25.50375  24.28070
## [15]  23.34072

plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

Using Elbow Method above, we can see that the observations can best be cluster into 1 till 5 clusters.

imdbCluster <- kmeans(clusterdat, 5, nstart = 25)
table(imdbCluster$cluster,clusterdata$gross)

##    
##     [162,4.19e+06) [4.19e+06,1.8e+07) [1.8e+07,3.87e+07) [3.87e+07,7.96e+07)
##   1             44                 67                103                 134
##   2            572                447                336                 276
##   3              0                  7                  8                  20
##   4              6                  6                 17                  23
##   5            156                251                315                 324
##    
##     [7.96e+07,7.61e+08]
##   1                 166
##   2                 118
##   3                 103
##   4                  98
##   5                 294

plot(imdbCluster$cluster)

clusplot(clusterdata, imdbCluster$cluster, color=TRUE, shade=TRUE, 
         labels=2, lines=0)

plotcluster(clusterdata[,1:4], imdbCluster$cluster)

imdbCluster$size

## [1]  514 1749  138  150 1340

Results and Discussion

This project is to determine which features in sequence the IMDB User Rating and Movie's Gross Income.we are also interested in finding out which Regression Models perform better for this dataset.Since regression works only when the features are numerical, only numerical features were used for data modelling. The data was cleaned and normalized according to each objective.

From the above information, we can infer that the MSE values for Random Forest Regression (0.4583154), are the lowest compare to Linear Regression (0.7611518) and SVR (0.5636579) in predicting the IMDb user ratings.However, comparing with other machine learning algorithms might gives better results such as ANN or Gradient Boosting.

Based on the Random Forest Regression tree, there were 4 important features in predicting imdb_score, duration, num_voted_users, gross and other_actor_Facebook_likes. From these rules, we can conclude that movies with a lot of votes in IMDB website tend to have a higher score, which really makes sense because popular movies will have a lot of fans to vote high scores for them.

On the contrary, if a movie has fewer votes, it can still be a good movie if its duration is longer.It is surprise to see that movies make less profit are good, but acceptable if they make more profit.

K-Means Clustering predicted that the data can be clustered in 5 categories. However, majority of the point were clustered in two categories. However, this approach can be used to determine the best number of category to predict the the gross income of a movie.

Conclusion

To predict the rating, we found out that the numerical values were more correlated to the rating and hence were crucial in predicting the rating. We were able to deduce that the most in influential features were duration and the number of users who voted. Combining certain categorical attributes did not help in improving the accuracy of the model.

Out of all the regression models, based on the mean squared error values it is clear that the Random Forest Regression model was the most accurate in predicting the IMDB user ratings.

In predicting the categories of gross income, K-Means cluster showed a promising result. It is because the fact that the data distributed into all the categories under 5 clusters. However, different approach can be use to compare the result with K-means clustering such as CART model which may give better result.

References

arun5493/Movie-User-Rating-and-Gross-Income-Prediction: Using IMDB Dataset, an R application to predict the upcoming movie’s user ratings. (n.d.). Retrieved January 6, 2021, from https://github.com/arun5493/Movie-User-Rating-and-Gross-Income-Prediction

Decision Trees in R using rpart | GormAnalysis. (n.d.). Retrieved January 6, 2021, from https://www.gormanalysis.com/blog/decision-trees-in-r-using-rpart/

IMDB 5000 Movie Dataset | Kaggle. (2016). Retrieved January 6, 2021, from https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset

Linear Regression With R. (2018). Retrieved January 6, 2021, from http://r-statistics.co/Linear-Regression.html

Machines, S. V. (2018). Building Regression Models in R using Support Vector Regression, 1–13. Retrieved from https://www.kdnuggets.com/2017/03/building-regression-models-support-vector-regression.html

university of cincinnati. (n.d.). K-means Cluster Analysis · UC Business Analytics R Programming Guide. Retrieved January 6, 2021, from https://uc-r.github.io/kmeans_clustering

IMDb. (n.d.). Retrieved from https://www.britannica.com/topic/IMDb

Topal, K., &amp; Ozsoyoglu, G. (2016). Movie review analysis: Emotion analysis of IMDb movie reviews. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). doi:10.1109/asonam.2016.7752387

Vr, Nithin & Pb, Sarath. (2014). Predicting Movie Success Based on IMDB Data. 

Carolzhangdc. (2018, January 12). Predict IMDB score with data mining algorithms. Retrieved January 06, 2021, from https://www.kaggle.com/carolzhangdc/predict-imdb-score-with-data-mining-algorithms/report

IMDb, “The vote average for film.” [Online]. Available:http://www.imdb.com/help/show leaf?votes

IMDb, “What is IMDb.” [Online]. Available:http://www.imdb.com/pressroom/