Exploring IMDB top 10000 movies with visualization

Objective:

Investigate the top 10000 movie in IMDB database to gain understanding of what attributes contribute to ranking of the movies and in turn utilize the analysis to predict the rating of the movies based on their presence in top 10000.

Dataset

The Dataset is a list of top 10,000 movies in IMDb based on the ranking. The Dataset is taken from https://www.kaggle.com/datasets/isaactaylorofficial/imdb-10000-most-voted-feature-films-041118. IMDb is an online database of information related to films, television series, home videos, video games, and streaming content online. In this dataset we are only analyzing the most popular movies.

The dataset includes:

Movie title
Genre of the film
Director of the film
Duration of the film (in minutes)
Release year of the film
Number of votes (thousands)
Score-Public rating (score out of 10)
Metascore-Critics rating (score out of 100)
Movie Revenue (millions of dollars)

##   Rank                    Title Year Score Metascore                     Genre
## 1    1 The Shawshank Redemption 1994   9.3        80                     Drama
## 2    2          The Dark Knight 2008   9.0        84      Action, Crime, Drama
## 3    3                Inception 2010   8.8        74 Action, Adventure, Sci-Fi
## 4    4               Fight Club 1999   8.8        66                     Drama
## 5    5             Pulp Fiction 1994   8.9        94              Crime, Drama
## 6    6             Forrest Gump 1994   8.8        82            Drama, Romance
##      Vote          Director Runtime Revenue
## 1 2011509    Frank Darabont     142   28.34
## 2 1980200 Christopher Nolan     152  534.86
## 3 1760209 Christopher Nolan     148  292.58
## 4 1609459     David Fincher     139   37.03
## 5 1570194 Quentin Tarantino     154  107.93
## 6 1532024   Robert Zemeckis     142  330.25
##                                                                                                                                                                                                                                   Description
## 1                                                                                                                      Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
## 2 When the menace known as the Joker emerges from his mysterious past, he wreaks havoc and chaos on the people of Gotham. The Dark Knight must accept one of the greatest psychological and physical tests of his ability to fight injustice.
## 3                                                                                      A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a CEO.
## 4                                                                                                       An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.
## 5                                                                                                   The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.
## 6                                                                                           The presidencies of Kennedy and Johnson, Vietnam, Watergate, and other history unfold through the perspective of an Alabama man with an IQ of 75.

summary(imdb)

##       Rank          Title                Year          Score      
##  Min.   :    1   Length:10000       Min.   :1915   Min.   :1.300  
##  1st Qu.: 2501   Class :character   1st Qu.:1991   1st Qu.:6.000  
##  Median : 5000   Mode  :character   Median :2004   Median :6.700  
##  Mean   : 5000                      Mean   :1998   Mean   :6.628  
##  3rd Qu.: 7500                      3rd Qu.:2011   3rd Qu.:7.400  
##  Max.   :10000                      Max.   :2018   Max.   :9.600  
##                                                                   
##    Metascore        Genre                Vote           Director        
##  Min.   :10.00   Length:10000       Min.   :   6015   Length:10000      
##  1st Qu.:44.00   Class :character   1st Qu.:  10147   Class :character  
##  Median :57.00   Mode  :character   Median :  21172   Mode  :character  
##  Mean   :56.53                      Mean   :  64488                     
##  3rd Qu.:70.00                      3rd Qu.:  62052                     
##  Max.   :99.00                      Max.   :2011509                     
##  NA's   :3219                                                           
##     Runtime         Revenue       Description       
##  Min.   : 45.0   Min.   :  0.00   Length:10000      
##  1st Qu.: 94.0   1st Qu.:  1.89   Class :character  
##  Median :105.0   Median : 15.09   Mode  :character  
##  Mean   :108.7   Mean   : 36.26                     
##  3rd Qu.:118.0   3rd Qu.: 43.86                     
##  Max.   :450.0   Max.   :936.66                     
##                  NA's   :2527

We see incomplete variable from the summary so we can visualize the incomplete variables to have a clear understanding of the completeness of our dataset.

Visualizing the missing values

vis_miss(imdb)

9 out of 11 variables are complete - they don’t have the NA value. In “Metascore” variable are 3219 missing values, which constitutes over 32% of all observations. Similar is in “Revenue” varaible, where 2527 values are empty.

A “VoteMln” variable is created to increase the readability of charts and show the units in Millions instead of large values in thousands.

Exploring the Variables

The most popular movie is The Shawshank Redemtion (2.01Million Votes). It is worth noting that the most popular movie is also the top ranked movie in the list. We can see if the most popular movies are also the best movies in the ranking of users and visualize the user rating to see the highly rated movies

The best rated movie by the audience is Aloko Udapadi.This work is not in the ranking of the best rated movies, because the ranking also includes the number of votes (this movie have only 6.5k). The worst rated movie from the 10000 most popular films is Cumali Ceber which is relatively popular in terms of votes (over 36k votes).

Looking at the most popular directors

The most popular director is Steven Spielberg. His films have been rated 10.35 million times on IMDb. Just before, with very little difference there is Christopher Nolan (10.22 milion votes). It is worth adding that in the top 10 most popular films there is no Spielberg’s movie! There are two Nolan’s movies on the popularity list. The director of the most popular movie (The Shawshank Redemption) is not on the top 10 list of directors.

What genre is most popular?

One interesting thing here is the though the highest grossing genre is Drama and popular is Action, in terms of rating Filmnoir has the highest rating. The reason behind this could be the number of movies in this genre is less but on an average receives good revies.

Comparing the Runtime of the movies

The longest Movie lasts 450 minutes - 7 and a half hours The shortest movie is only 45 . Both these movies are not very popular

Visualizing the Distrubution of All Variables

## Number of Votes received

##      Votes                     Values
## [1,] "Over milion votes"       "27"  
## [2,] "Over 500 thousand votes" "174" 
## [3,] "Over 100 thousand votes" "1617"
## [4,] "Over 10 thousand votes"  "7547"

Analysis from the distribution of all variables:

On average, the film in the database was released at the beginning of 1998. Due to such a strong asymmetry, we can use median for measurement, according to which the average year of production is 2004. Old movies are rarely in the database of 1000 most popular films.
The number of votes is characterized by a very strong right-side asymmetry. The vast majority of movies have a number of votes below average. 27 movies have over one million votes.Most movies in this database has over 10 000 votes.
Movies on the database are ~110 minutes on an average. Most of the films last between 75 and 135 minutes. -The revenue from movies is characterized by a strong right-side asymmetry. Most movies have little revenue, only a few films made a big profit. The average revenue from the movie was over $ 36 million. -User feedback is one of the most important things on the IMDb site. 6.62 is the average rating issued by the user. The distribution is roughly symmetrical, most of the user’s ratings are in the range of 6 and 8. -The average rating for movies from the database by Metascore users is 56.5 points. The distribution is also symmetrical, but extreme values are relatively often compared to score.

How does the user ratings and critics rating differ?

The scatter plot shows the spread of IMDb user ratings and ratings from the Metascore critic ratings. The line designated by linear model shows that the correlation is approximately linear.

From the distribution graph we can see that distribution of IMDb Score is more convex and less asymmetric (right-sided asymmetry). Critics’ votes from Metascore are more scattered, and IMDb users are close to average.

We can fit a polynomial model to see if it is any better

Analyzing more correlation between the variables

We can see large positive correlation between the metascore and IMDb user ratings and between the number of votes and revenue.
Large negative correlations can be seen between the Rank and the number of votes similarly with income and ranking. This is due to the fact that ranks start with a small number and the lower the rank, the better.

How has ratings affected popularity?

There is a small positive correlation between the popularity of move and opinion of IMDB Users. Most movies have less than 500 thousand votes so most of the observations are at the bottom of the chart. The relationship is not very linear so a polynomial relation fit the relationship better.We can conclude that very good rated movies by users are much more popular than others.
There is a small positive correlation between the popularity of movie and opinion of film critics.

What impact does runtime have on revenue and popularity?

Most movies are 50-150 minutes and the votes are distributed. The relationship is not particularly linear. Only very few movies which is more than 200 min are popular. and movies more than 200 minutes are not the most money making movies.

How does important predictors like popularity and ratings affect revenue of the movies?

The number of votes is moderately correlated with the revenue from the movie. The dta on both these variables had right side asymmetry so most of the points are concentrated towards 0. A non linear model fits better and according to which the increase in the number of votes may affect to revenue. This means that if a movie is popular and then the movie is profitable which seems correct.
There is no significant correlation between the rating of IMDb users and the revenue from the movie. Ratings are more dispersed than revenue.
There is no to very little correlation between Metascore and the revenue from the movie. The metascore are definitely more dispersed than revenue. It isn’t possible to relate these two variables.

Prediction of Score(User Rating) if the movie is in top 10000

We will attempt to create a prediction of the rating a movie can get if is in the list of top 10000 popular movies. We use simple multilinear regression model to do the prediction.

Using Multi-linear Regression

We create a model with only the numeric variable

##                   train test
## Number of rows     4900 1225
## Number of columns     7    7

## 
## Call:
## lm(formula = Score ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4388 -0.3155  0.0331  0.3739  2.1806 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept) 15.04677370775  1.47921307287  10.172 < 0.0000000000000002 ***
## Rank        -0.00004156969  0.00000406916 -10.216 < 0.0000000000000002 ***
## Year        -0.00555475651  0.00073467177  -7.561   0.0000000000000475 ***
## Metascore    0.03403416813  0.00050829209  66.958 < 0.0000000000000002 ***
## Vote         0.00000139992  0.00000008638  16.207 < 0.0000000000000002 ***
## Runtime      0.00765446371  0.00048240900  15.867 < 0.0000000000000002 ***
## Revenue     -0.00202792176  0.00016986233 -11.939 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5941 on 4893 degrees of freedom
## Multiple R-squared:  0.6232, Adjusted R-squared:  0.6228 
## F-statistic:  1349 on 6 and 4893 DF,  p-value: < 0.00000000000000022

Using Genre(qualitative) in the model

We add categorical variable like genre to the model to evaluate if we get a better R squared by adding more parameters to the model

##                   train test
## Number of rows    12423 3106
## Number of columns     8    8

## 
## Call:
## lm(formula = Score ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4388 -0.3155  0.0331  0.3739  2.1806 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept) 15.04677370775  1.47921307287  10.172 < 0.0000000000000002 ***
## Rank        -0.00004156969  0.00000406916 -10.216 < 0.0000000000000002 ***
## Year        -0.00555475651  0.00073467177  -7.561   0.0000000000000475 ***
## Metascore    0.03403416813  0.00050829209  66.958 < 0.0000000000000002 ***
## Vote         0.00000139992  0.00000008638  16.207 < 0.0000000000000002 ***
## Runtime      0.00765446371  0.00048240900  15.867 < 0.0000000000000002 ***
## Revenue     -0.00202792176  0.00016986233 -11.939 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5941 on 4893 degrees of freedom
## Multiple R-squared:  0.6232, Adjusted R-squared:  0.6228 
## F-statistic:  1349 on 6 and 4893 DF,  p-value: < 0.00000000000000022

There is no significant improvement in the model after adding more variable so we can ignore the qualitative variables in our model. Next we will see the performance of our model by adding polynomial terms to our first model

Using polynomial terms in the model

## 
## Call:
## lm(formula = Score ~ Rank + Revenue + Metascore + Runtime + I(Rank^2) + 
##     I(Revenue^2) + I(Metascore^2) + I(Runtime^2) + I(Vote^2), 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5496 -0.3239  0.0242  0.3625  2.2757 
## 
## Coefficients:
##                            Estimate           Std. Error t value
## (Intercept)     3.35855807147775520  0.14010206649037127  23.972
## Rank           -0.00022164228231170  0.00001422072038590 -15.586
## Revenue        -0.00403836367820153  0.00029307147247807 -13.779
## Metascore       0.05739091125029285  0.00253452703948769  22.644
## Runtime         0.01586429871288007  0.00188979917829083   8.395
## I(Rank^2)       0.00000001478640119  0.00000000137325081  10.767
## I(Revenue^2)    0.00000522397763860  0.00000061948060846   8.433
## I(Metascore^2) -0.00020948316261189  0.00002235917894638  -9.369
## I(Runtime^2)   -0.00003162943124259  0.00000712760142695  -4.438
## I(Vote^2)       0.00000000000071684  0.00000000000007281   9.845
##                            Pr(>|t|)    
## (Intercept)    < 0.0000000000000002 ***
## Rank           < 0.0000000000000002 ***
## Revenue        < 0.0000000000000002 ***
## Metascore      < 0.0000000000000002 ***
## Runtime        < 0.0000000000000002 ***
## I(Rank^2)      < 0.0000000000000002 ***
## I(Revenue^2)   < 0.0000000000000002 ***
## I(Metascore^2) < 0.0000000000000002 ***
## I(Runtime^2)              0.0000093 ***
## I(Vote^2)      < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5882 on 4890 degrees of freedom
## Multiple R-squared:  0.6309, Adjusted R-squared:  0.6302 
## F-statistic: 928.7 on 9 and 4890 DF,  p-value: < 0.00000000000000022

Here also we see the performance is similar so we use the simple multilinear regression model for prediction Using the first simple model we predict the rating measure the goodness of fit and and plot the graph comparing our predicted and actual value.

##        pred1 test.Score
## 1  10.538695        9.3
## 5  10.327653        8.9
## 6   9.323065        8.8
## 9   9.896258        8.9
## 12  9.374097        8.7
## 16  9.182940        8.4

## R-squared: 0.6232221

## RMSE Test: 0.5617891

Plot predicted vs observed value

Using Random Forest

## Ranger result
## 
## Call:
##  ranger(Score ~ ., data = train, num.trees = 100, mtry = 6, min.node.size = 1,      replace = T) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      4900 
## Number of independent variables:  6 
## Mtry:                             6 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.3263479 
## R squared (OOB):                  0.6511699

Predicted Scores

##   pred_test test.Score
## 1      8.77        9.3
## 2      8.77        8.9
## 3      8.72        8.8
## 4      8.75        8.9
## 5      8.67        8.7
## 6      8.57        8.4

Impact

Conclusion

Variable which definitely has the most influence on the User Rating in this algorithm is Metascore. Other variables have a similar impact on the number of votes. Audience ratings of the movies are quite close to those of the critics ratings Critics rate more severely than the public.
It’s also important to look at the coefficients associated with each feature.All variables have significant impact on the User ratings. Other than the variables considered in the model, the ratings of a movie can depend on more factors like the time of the year it is released, the actors , production and so on. With a larger model we can accurately determine the ratings and popularity for a movie.
The movie business is a high demand high cost business and data analysis such as these on a larger scale can help us identify the factors that would make a movie successful.