Objective:

Investigate the top 10000 movie in IMDB database to gain understanding of what attributes contribute to ranking of the movies and in turn utilize the analysis to predict the rating of the movies based on their presence in top 10000.

Dataset

The Dataset is a list of top 10,000 movies in IMDb based on the ranking. The Dataset is taken from https://www.kaggle.com/datasets/isaactaylorofficial/imdb-10000-most-voted-feature-films-041118. IMDb is an online database of information related to films, television series, home videos, video games, and streaming content online. In this dataset we are only analyzing the most popular movies.

The dataset includes:

##   Rank                    Title Year Score Metascore                     Genre
## 1    1 The Shawshank Redemption 1994   9.3        80                     Drama
## 2    2          The Dark Knight 2008   9.0        84      Action, Crime, Drama
## 3    3                Inception 2010   8.8        74 Action, Adventure, Sci-Fi
## 4    4               Fight Club 1999   8.8        66                     Drama
## 5    5             Pulp Fiction 1994   8.9        94              Crime, Drama
## 6    6             Forrest Gump 1994   8.8        82            Drama, Romance
##      Vote          Director Runtime Revenue
## 1 2011509    Frank Darabont     142   28.34
## 2 1980200 Christopher Nolan     152  534.86
## 3 1760209 Christopher Nolan     148  292.58
## 4 1609459     David Fincher     139   37.03
## 5 1570194 Quentin Tarantino     154  107.93
## 6 1532024   Robert Zemeckis     142  330.25
##                                                                                                                                                                                                                                   Description
## 1                                                                                                                      Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
## 2 When the menace known as the Joker emerges from his mysterious past, he wreaks havoc and chaos on the people of Gotham. The Dark Knight must accept one of the greatest psychological and physical tests of his ability to fight injustice.
## 3                                                                                      A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a CEO.
## 4                                                                                                       An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.
## 5                                                                                                   The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.
## 6                                                                                           The presidencies of Kennedy and Johnson, Vietnam, Watergate, and other history unfold through the perspective of an Alabama man with an IQ of 75.
summary(imdb)
##       Rank          Title                Year          Score      
##  Min.   :    1   Length:10000       Min.   :1915   Min.   :1.300  
##  1st Qu.: 2501   Class :character   1st Qu.:1991   1st Qu.:6.000  
##  Median : 5000   Mode  :character   Median :2004   Median :6.700  
##  Mean   : 5000                      Mean   :1998   Mean   :6.628  
##  3rd Qu.: 7500                      3rd Qu.:2011   3rd Qu.:7.400  
##  Max.   :10000                      Max.   :2018   Max.   :9.600  
##                                                                   
##    Metascore        Genre                Vote           Director        
##  Min.   :10.00   Length:10000       Min.   :   6015   Length:10000      
##  1st Qu.:44.00   Class :character   1st Qu.:  10147   Class :character  
##  Median :57.00   Mode  :character   Median :  21172   Mode  :character  
##  Mean   :56.53                      Mean   :  64488                     
##  3rd Qu.:70.00                      3rd Qu.:  62052                     
##  Max.   :99.00                      Max.   :2011509                     
##  NA's   :3219                                                           
##     Runtime         Revenue       Description       
##  Min.   : 45.0   Min.   :  0.00   Length:10000      
##  1st Qu.: 94.0   1st Qu.:  1.89   Class :character  
##  Median :105.0   Median : 15.09   Mode  :character  
##  Mean   :108.7   Mean   : 36.26                     
##  3rd Qu.:118.0   3rd Qu.: 43.86                     
##  Max.   :450.0   Max.   :936.66                     
##                  NA's   :2527

We see incomplete variable from the summary so we can visualize the incomplete variables to have a clear understanding of the completeness of our dataset.

Visualizing the missing values

vis_miss(imdb)

9 out of 11 variables are complete - they don’t have the NA value. In “Metascore” variable are 3219 missing values, which constitutes over 32% of all observations. Similar is in “Revenue” varaible, where 2527 values are empty.

A “VoteMln” variable is created to increase the readability of charts and show the units in Millions instead of large values in thousands.

Exploring the Variables

The most popular movie is The Shawshank Redemtion (2.01Million Votes). It is worth noting that the most popular movie is also the top ranked movie in the list. We can see if the most popular movies are also the best movies in the ranking of users and visualize the user rating to see the highly rated movies

The best rated movie by the audience is Aloko Udapadi.This work is not in the ranking of the best rated movies, because the ranking also includes the number of votes (this movie have only 6.5k). The worst rated movie from the 10000 most popular films is Cumali Ceber which is relatively popular in terms of votes (over 36k votes).

Looking at the most popular directors

The most popular director is Steven Spielberg. His films have been rated 10.35 million times on IMDb. Just before, with very little difference there is Christopher Nolan (10.22 milion votes). It is worth adding that in the top 10 most popular films there is no Spielberg’s movie! There are two Nolan’s movies on the popularity list. The director of the most popular movie (The Shawshank Redemption) is not on the top 10 list of directors.

Visualizing the Distrubution of All Variables

## Number of Votes received
##      Votes                     Values
## [1,] "Over milion votes"       "27"  
## [2,] "Over 500 thousand votes" "174" 
## [3,] "Over 100 thousand votes" "1617"
## [4,] "Over 10 thousand votes"  "7547"

Analysis from the distribution of all variables:

How does the user ratings and critics rating differ?

The scatter plot shows the spread of IMDb user ratings and ratings from the Metascore critic ratings. The line designated by linear model shows that the correlation is approximately linear.

From the distribution graph we can see that distribution of IMDb Score is more convex and less asymmetric (right-sided asymmetry). Critics’ votes from Metascore are more scattered, and IMDb users are close to average.

We can fit a polynomial model to see if it is any better

Analyzing more correlation between the variables

  • We can see large positive correlation between the metascore and IMDb user ratings and between the number of votes and revenue.

  • Large negative correlations can be seen between the Rank and the number of votes similarly with income and ranking. This is due to the fact that ranks start with a small number and the lower the rank, the better.

How has ratings affected popularity?

What impact does runtime have on revenue and popularity?

How does important predictors like popularity and ratings affect revenue of the movies?

Prediction of Score(User Rating) if the movie is in top 10000

We will attempt to create a prediction of the rating a movie can get if is in the list of top 10000 popular movies. We use simple multilinear regression model to do the prediction.

Using Multi-linear Regression

We create a model with only the numeric variable

##                   train test
## Number of rows     4900 1225
## Number of columns     7    7
## 
## Call:
## lm(formula = Score ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4388 -0.3155  0.0331  0.3739  2.1806 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept) 15.04677370775  1.47921307287  10.172 < 0.0000000000000002 ***
## Rank        -0.00004156969  0.00000406916 -10.216 < 0.0000000000000002 ***
## Year        -0.00555475651  0.00073467177  -7.561   0.0000000000000475 ***
## Metascore    0.03403416813  0.00050829209  66.958 < 0.0000000000000002 ***
## Vote         0.00000139992  0.00000008638  16.207 < 0.0000000000000002 ***
## Runtime      0.00765446371  0.00048240900  15.867 < 0.0000000000000002 ***
## Revenue     -0.00202792176  0.00016986233 -11.939 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5941 on 4893 degrees of freedom
## Multiple R-squared:  0.6232, Adjusted R-squared:  0.6228 
## F-statistic:  1349 on 6 and 4893 DF,  p-value: < 0.00000000000000022

Using Genre(qualitative) in the model

We add categorical variable like genre to the model to evaluate if we get a better R squared by adding more parameters to the model

##                   train test
## Number of rows    12423 3106
## Number of columns     8    8
## 
## Call:
## lm(formula = Score ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4388 -0.3155  0.0331  0.3739  2.1806 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept) 15.04677370775  1.47921307287  10.172 < 0.0000000000000002 ***
## Rank        -0.00004156969  0.00000406916 -10.216 < 0.0000000000000002 ***
## Year        -0.00555475651  0.00073467177  -7.561   0.0000000000000475 ***
## Metascore    0.03403416813  0.00050829209  66.958 < 0.0000000000000002 ***
## Vote         0.00000139992  0.00000008638  16.207 < 0.0000000000000002 ***
## Runtime      0.00765446371  0.00048240900  15.867 < 0.0000000000000002 ***
## Revenue     -0.00202792176  0.00016986233 -11.939 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5941 on 4893 degrees of freedom
## Multiple R-squared:  0.6232, Adjusted R-squared:  0.6228 
## F-statistic:  1349 on 6 and 4893 DF,  p-value: < 0.00000000000000022

There is no significant improvement in the model after adding more variable so we can ignore the qualitative variables in our model. Next we will see the performance of our model by adding polynomial terms to our first model

Using polynomial terms in the model

## 
## Call:
## lm(formula = Score ~ Rank + Revenue + Metascore + Runtime + I(Rank^2) + 
##     I(Revenue^2) + I(Metascore^2) + I(Runtime^2) + I(Vote^2), 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5496 -0.3239  0.0242  0.3625  2.2757 
## 
## Coefficients:
##                            Estimate           Std. Error t value
## (Intercept)     3.35855807147775520  0.14010206649037127  23.972
## Rank           -0.00022164228231170  0.00001422072038590 -15.586
## Revenue        -0.00403836367820153  0.00029307147247807 -13.779
## Metascore       0.05739091125029285  0.00253452703948769  22.644
## Runtime         0.01586429871288007  0.00188979917829083   8.395
## I(Rank^2)       0.00000001478640119  0.00000000137325081  10.767
## I(Revenue^2)    0.00000522397763860  0.00000061948060846   8.433
## I(Metascore^2) -0.00020948316261189  0.00002235917894638  -9.369
## I(Runtime^2)   -0.00003162943124259  0.00000712760142695  -4.438
## I(Vote^2)       0.00000000000071684  0.00000000000007281   9.845
##                            Pr(>|t|)    
## (Intercept)    < 0.0000000000000002 ***
## Rank           < 0.0000000000000002 ***
## Revenue        < 0.0000000000000002 ***
## Metascore      < 0.0000000000000002 ***
## Runtime        < 0.0000000000000002 ***
## I(Rank^2)      < 0.0000000000000002 ***
## I(Revenue^2)   < 0.0000000000000002 ***
## I(Metascore^2) < 0.0000000000000002 ***
## I(Runtime^2)              0.0000093 ***
## I(Vote^2)      < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5882 on 4890 degrees of freedom
## Multiple R-squared:  0.6309, Adjusted R-squared:  0.6302 
## F-statistic: 928.7 on 9 and 4890 DF,  p-value: < 0.00000000000000022

Here also we see the performance is similar so we use the simple multilinear regression model for prediction Using the first simple model we predict the rating measure the goodness of fit and and plot the graph comparing our predicted and actual value.

##        pred1 test.Score
## 1  10.538695        9.3
## 5  10.327653        8.9
## 6   9.323065        8.8
## 9   9.896258        8.9
## 12  9.374097        8.7
## 16  9.182940        8.4
## R-squared: 0.6232221
## RMSE Test: 0.5617891

Plot predicted vs observed value

Using Random Forest

## Ranger result
## 
## Call:
##  ranger(Score ~ ., data = train, num.trees = 100, mtry = 6, min.node.size = 1,      replace = T) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      4900 
## Number of independent variables:  6 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.3263479 
## R squared (OOB):                  0.6511699

Predicted Scores

##   pred_test test.Score
## 1      8.77        9.3
## 2      8.77        8.9
## 3      8.72        8.8
## 4      8.75        8.9
## 5      8.67        8.7
## 6      8.57        8.4

Impact

Conclusion

  • Variable which definitely has the most influence on the User Rating in this algorithm is Metascore. Other variables have a similar impact on the number of votes. Audience ratings of the movies are quite close to those of the critics ratings Critics rate more severely than the public.

  • It’s also important to look at the coefficients associated with each feature.All variables have significant impact on the User ratings. Other than the variables considered in the model, the ratings of a movie can depend on more factors like the time of the year it is released, the actors , production and so on. With a larger model we can accurately determine the ratings and popularity for a movie.

  • The movie business is a high demand high cost business and data analysis such as these on a larger scale can help us identify the factors that would make a movie successful.