Investigate the top 10000 movie in IMDB database to gain understanding of what attributes contribute to ranking of the movies and in turn utilize the analysis to predict the rating of the movies based on their presence in top 10000.
The Dataset is a list of top 10,000 movies in IMDb based on the ranking. The Dataset is taken from https://www.kaggle.com/datasets/isaactaylorofficial/imdb-10000-most-voted-feature-films-041118. IMDb is an online database of information related to films, television series, home videos, video games, and streaming content online. In this dataset we are only analyzing the most popular movies.
The dataset includes:
## Rank Title Year Score Metascore Genre
## 1 1 The Shawshank Redemption 1994 9.3 80 Drama
## 2 2 The Dark Knight 2008 9.0 84 Action, Crime, Drama
## 3 3 Inception 2010 8.8 74 Action, Adventure, Sci-Fi
## 4 4 Fight Club 1999 8.8 66 Drama
## 5 5 Pulp Fiction 1994 8.9 94 Crime, Drama
## 6 6 Forrest Gump 1994 8.8 82 Drama, Romance
## Vote Director Runtime Revenue
## 1 2011509 Frank Darabont 142 28.34
## 2 1980200 Christopher Nolan 152 534.86
## 3 1760209 Christopher Nolan 148 292.58
## 4 1609459 David Fincher 139 37.03
## 5 1570194 Quentin Tarantino 154 107.93
## 6 1532024 Robert Zemeckis 142 330.25
## Description
## 1 Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
## 2 When the menace known as the Joker emerges from his mysterious past, he wreaks havoc and chaos on the people of Gotham. The Dark Knight must accept one of the greatest psychological and physical tests of his ability to fight injustice.
## 3 A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a CEO.
## 4 An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.
## 5 The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.
## 6 The presidencies of Kennedy and Johnson, Vietnam, Watergate, and other history unfold through the perspective of an Alabama man with an IQ of 75.
summary(imdb)
## Rank Title Year Score
## Min. : 1 Length:10000 Min. :1915 Min. :1.300
## 1st Qu.: 2501 Class :character 1st Qu.:1991 1st Qu.:6.000
## Median : 5000 Mode :character Median :2004 Median :6.700
## Mean : 5000 Mean :1998 Mean :6.628
## 3rd Qu.: 7500 3rd Qu.:2011 3rd Qu.:7.400
## Max. :10000 Max. :2018 Max. :9.600
##
## Metascore Genre Vote Director
## Min. :10.00 Length:10000 Min. : 6015 Length:10000
## 1st Qu.:44.00 Class :character 1st Qu.: 10147 Class :character
## Median :57.00 Mode :character Median : 21172 Mode :character
## Mean :56.53 Mean : 64488
## 3rd Qu.:70.00 3rd Qu.: 62052
## Max. :99.00 Max. :2011509
## NA's :3219
## Runtime Revenue Description
## Min. : 45.0 Min. : 0.00 Length:10000
## 1st Qu.: 94.0 1st Qu.: 1.89 Class :character
## Median :105.0 Median : 15.09 Mode :character
## Mean :108.7 Mean : 36.26
## 3rd Qu.:118.0 3rd Qu.: 43.86
## Max. :450.0 Max. :936.66
## NA's :2527
We see incomplete variable from the summary so we can visualize the incomplete variables to have a clear understanding of the completeness of our dataset.
vis_miss(imdb)
9 out of 11 variables are complete - they don’t have the NA value. In “Metascore” variable are 3219 missing values, which constitutes over 32% of all observations. Similar is in “Revenue” varaible, where 2527 values are empty.
A “VoteMln” variable is created to increase the readability of charts and show the units in Millions instead of large values in thousands.
Exploring the Variables
The most popular movie is The Shawshank Redemtion (2.01Million Votes). It is worth noting that the most popular movie is also the top ranked movie in the list. We can see if the most popular movies are also the best movies in the ranking of users and visualize the user rating to see the highly rated movies
The best rated movie by the audience is Aloko Udapadi.This work is not in the ranking of the best rated movies, because the ranking also includes the number of votes (this movie have only 6.5k). The worst rated movie from the 10000 most popular films is Cumali Ceber which is relatively popular in terms of votes (over 36k votes).
Looking at the most popular directors
The most popular director is Steven Spielberg. His films have been rated 10.35 million times on IMDb. Just before, with very little difference there is Christopher Nolan (10.22 milion votes). It is worth adding that in the top 10 most popular films there is no Spielberg’s movie! There are two Nolan’s movies on the popularity list. The director of the most popular movie (The Shawshank Redemption) is not on the top 10 list of directors.
One interesting thing here is the though the highest grossing genre is Drama and popular is Action, in terms of rating Filmnoir has the highest rating. The reason behind this could be the number of movies in this genre is less but on an average receives good revies.
Comparing the Runtime of the movies
The longest Movie lasts 450 minutes - 7 and a half hours The shortest movie is only 45 . Both these movies are not very popular
## Number of Votes received
## Votes Values
## [1,] "Over milion votes" "27"
## [2,] "Over 500 thousand votes" "174"
## [3,] "Over 100 thousand votes" "1617"
## [4,] "Over 10 thousand votes" "7547"
Analysis from the distribution of all variables:
The scatter plot shows the spread of IMDb user ratings and ratings from the Metascore critic ratings. The line designated by linear model shows that the correlation is approximately linear.
From the distribution graph we can see that distribution of IMDb Score is more convex and less asymmetric (right-sided asymmetry). Critics’ votes from Metascore are more scattered, and IMDb users are close to average.
We can fit a polynomial model to see if it is any better
We can see large positive correlation between the metascore and IMDb user ratings and between the number of votes and revenue.
Large negative correlations can be seen between the Rank and the number of votes similarly with income and ranking. This is due to the fact that ranks start with a small number and the lower the rank, the better.
There is a small positive correlation between the popularity of move and opinion of IMDB Users. Most movies have less than 500 thousand votes so most of the observations are at the bottom of the chart. The relationship is not very linear so a polynomial relation fit the relationship better.We can conclude that very good rated movies by users are much more popular than others.
There is a small positive correlation between the popularity of movie and opinion of film critics.
The number of votes is moderately correlated with the revenue from the movie. The dta on both these variables had right side asymmetry so most of the points are concentrated towards 0. A non linear model fits better and according to which the increase in the number of votes may affect to revenue. This means that if a movie is popular and then the movie is profitable which seems correct.
There is no significant correlation between the rating of IMDb users and the revenue from the movie. Ratings are more dispersed than revenue.
There is no to very little correlation between Metascore and the revenue from the movie. The metascore are definitely more dispersed than revenue. It isn’t possible to relate these two variables.
We will attempt to create a prediction of the rating a movie can get if is in the list of top 10000 popular movies. We use simple multilinear regression model to do the prediction.
We create a model with only the numeric variable
## train test
## Number of rows 4900 1225
## Number of columns 7 7
##
## Call:
## lm(formula = Score ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4388 -0.3155 0.0331 0.3739 2.1806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.04677370775 1.47921307287 10.172 < 0.0000000000000002 ***
## Rank -0.00004156969 0.00000406916 -10.216 < 0.0000000000000002 ***
## Year -0.00555475651 0.00073467177 -7.561 0.0000000000000475 ***
## Metascore 0.03403416813 0.00050829209 66.958 < 0.0000000000000002 ***
## Vote 0.00000139992 0.00000008638 16.207 < 0.0000000000000002 ***
## Runtime 0.00765446371 0.00048240900 15.867 < 0.0000000000000002 ***
## Revenue -0.00202792176 0.00016986233 -11.939 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5941 on 4893 degrees of freedom
## Multiple R-squared: 0.6232, Adjusted R-squared: 0.6228
## F-statistic: 1349 on 6 and 4893 DF, p-value: < 0.00000000000000022
We add categorical variable like genre to the model to evaluate if we get a better R squared by adding more parameters to the model
## train test
## Number of rows 12423 3106
## Number of columns 8 8
##
## Call:
## lm(formula = Score ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4388 -0.3155 0.0331 0.3739 2.1806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.04677370775 1.47921307287 10.172 < 0.0000000000000002 ***
## Rank -0.00004156969 0.00000406916 -10.216 < 0.0000000000000002 ***
## Year -0.00555475651 0.00073467177 -7.561 0.0000000000000475 ***
## Metascore 0.03403416813 0.00050829209 66.958 < 0.0000000000000002 ***
## Vote 0.00000139992 0.00000008638 16.207 < 0.0000000000000002 ***
## Runtime 0.00765446371 0.00048240900 15.867 < 0.0000000000000002 ***
## Revenue -0.00202792176 0.00016986233 -11.939 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5941 on 4893 degrees of freedom
## Multiple R-squared: 0.6232, Adjusted R-squared: 0.6228
## F-statistic: 1349 on 6 and 4893 DF, p-value: < 0.00000000000000022
There is no significant improvement in the model after adding more variable so we can ignore the qualitative variables in our model. Next we will see the performance of our model by adding polynomial terms to our first model
##
## Call:
## lm(formula = Score ~ Rank + Revenue + Metascore + Runtime + I(Rank^2) +
## I(Revenue^2) + I(Metascore^2) + I(Runtime^2) + I(Vote^2),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5496 -0.3239 0.0242 0.3625 2.2757
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.35855807147775520 0.14010206649037127 23.972
## Rank -0.00022164228231170 0.00001422072038590 -15.586
## Revenue -0.00403836367820153 0.00029307147247807 -13.779
## Metascore 0.05739091125029285 0.00253452703948769 22.644
## Runtime 0.01586429871288007 0.00188979917829083 8.395
## I(Rank^2) 0.00000001478640119 0.00000000137325081 10.767
## I(Revenue^2) 0.00000522397763860 0.00000061948060846 8.433
## I(Metascore^2) -0.00020948316261189 0.00002235917894638 -9.369
## I(Runtime^2) -0.00003162943124259 0.00000712760142695 -4.438
## I(Vote^2) 0.00000000000071684 0.00000000000007281 9.845
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Rank < 0.0000000000000002 ***
## Revenue < 0.0000000000000002 ***
## Metascore < 0.0000000000000002 ***
## Runtime < 0.0000000000000002 ***
## I(Rank^2) < 0.0000000000000002 ***
## I(Revenue^2) < 0.0000000000000002 ***
## I(Metascore^2) < 0.0000000000000002 ***
## I(Runtime^2) 0.0000093 ***
## I(Vote^2) < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5882 on 4890 degrees of freedom
## Multiple R-squared: 0.6309, Adjusted R-squared: 0.6302
## F-statistic: 928.7 on 9 and 4890 DF, p-value: < 0.00000000000000022
Here also we see the performance is similar so we use the simple multilinear regression model for prediction Using the first simple model we predict the rating measure the goodness of fit and and plot the graph comparing our predicted and actual value.
## pred1 test.Score
## 1 10.538695 9.3
## 5 10.327653 8.9
## 6 9.323065 8.8
## 9 9.896258 8.9
## 12 9.374097 8.7
## 16 9.182940 8.4
## R-squared: 0.6232221
## RMSE Test: 0.5617891
Plot predicted vs observed value
## Ranger result
##
## Call:
## ranger(Score ~ ., data = train, num.trees = 100, mtry = 6, min.node.size = 1, replace = T)
##
## Type: Regression
## Number of trees: 100
## Sample size: 4900
## Number of independent variables: 6
## Mtry: 6
## Target node size: 1
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 0.3263479
## R squared (OOB): 0.6511699
Predicted Scores
## pred_test test.Score
## 1 8.77 9.3
## 2 8.77 8.9
## 3 8.72 8.8
## 4 8.75 8.9
## 5 8.67 8.7
## 6 8.57 8.4
Variable which definitely has the most influence on the User Rating in this algorithm is Metascore. Other variables have a similar impact on the number of votes. Audience ratings of the movies are quite close to those of the critics ratings Critics rate more severely than the public.
It’s also important to look at the coefficients associated with each feature.All variables have significant impact on the User ratings. Other than the variables considered in the model, the ratings of a movie can depend on more factors like the time of the year it is released, the actors , production and so on. With a larger model we can accurately determine the ratings and popularity for a movie.
The movie business is a high demand high cost business and data analysis such as these on a larger scale can help us identify the factors that would make a movie successful.