Overview

Ever wondered which movie among the Avengers (Marvel cinematic universe) has the best rating ? One may be biased to think that the most recent movie Avengers: End game is the best movie but it may not be truth. To truly evaluate the movies released over a period of time, we need ratings from the audience. ImDb solves this problem. IMDb is an online database (owned by Amazon) of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews ImDb Wiki:

In snap of a finger, we can learn which movie among the Avengers is the best one. But do you know if your friends feel the same way ? Let’s conduct a survey and compare the results with ImDb rating. Survey results are available here for reference.

Five Star rating for recent Marvel movies

Step 1:Let’s connect to ImDb table, but wait is it possible ?

Fortunately Kaggle provides the “All Marvel Cinematic Universe Movies on IMDB” data which we can download and load into MySQl DB.

We need RMySQL library to connect to the MySQL DB. Let’s install and load them:

library(RMySQL)

Up next, we need to connect to the MySQL DB by entering the server type and credentials

mydb = dbConnect(MySQL(), user='root', password='newrootpassword', dbname='movies', host='127.0.0.1')

Step 2: Importing required tables

We can find the list of tables in the db with the use of the function ‘dbListTables’. The tables in the schema/db are " “imdb_rating” “movie_title” and “movies”

dbListTables(mydb)

Let’s retrieve all the three tables from the database for our analysis. We can create three data frames through below queries:

tab1 = dbSendQuery(mydb, "select * from movies")
survey_data = fetch(tab1)
knitr::kable(survey_data, "simple")
IP_Address Time_Spent Movie_Frequency Movie_1 Movie_2 Movie_3 Movie_4 Movie_5 Movie_6
72.21.196.65 38 2-5 times a month 4 4 4 5 4 3
100.35.108.12 44 <1 time a month 4 4 5 5 1 4
174.241.132.76 46 5+ times a month 5 5 0 5 4 5
183.83.137.201 56 5+ times a month 1 3 0 3 0 0
100.35.108.12 37 2-5 times a month 4 5 5 4 3 5
115.99.172.152 62 5+ times a month 4 4 4 4 4 3
162.83.178.107 40 2-5 times a month 4 5 5 3 5 5
157.34.81.38 84 <1 time a month 0 0 4 0 0 4
106.198.172.233 64 NA 4 5 1 4 1 4
50.200.142.50 57 2-5 times a month 4 4 0 0 0 0
45.124.146.23 88 2-5 times a month 4 4 0 0 0 4
49.35.84.179 83 1 time a month 0 0 0 0 0 0
157.34.99.133 117 2-5 times a month 5 3 0 0 0 0
91.73.122.28 86 2-5 times a month 4 5 5 5 0 0
173.70.126.249 44 2-5 times a month 4 5 0 0 0 0
131.55.328.12 47 2-5 times a month 5 5 3 4 4 4
106.206.55.85 57 5+ times a month 3 4 0 0 3 4
223.229.184.53 90 5+ times a month 0 0 3 4 3 0
tab2 = dbSendQuery(mydb, "select * from movie_title")
movie_title = fetch(tab2)
knitr::kable(movie_title, "simple")
Movie_Index Movie_name
Movie_1 Avengers: Endgame (2019)
Movie_2 Avengers: Infinity War (2018)
Movie_3 Spider-Man: Far from Home (2019)
Movie_4 Captain Marvel (2019)
Movie_5 Ant-Man and the Wasp (2018)
Movie_6 Guardians of the Galaxy Vol. 2 (2017)
tab3 = dbSendQuery(mydb, "select * from imdb_rating")
imdb_rating = fetch(tab3)
head(imdb_rating)

Step 3: Time for analysis

Let’s calculate average rating for all the movies, excluding the entries where poll results were not available (marked as ‘0’ in the data). Also, we can leverage movie_title table to obtain the movie names and understand the results better.

survey_data_1 = replace(survey_data, survey_data==0, NA)
survey_col <- colnames(survey_data_1)
Survey_rating <- colMeans(survey_data_1[sapply(survey_data_1, is.numeric)],na.rm = TRUE)
Survey_rating <- as.data.frame(Survey_rating)

Survey_rating <- data.frame(Movie_Index = row.names(Survey_rating), Survey_rating)[-1,]

Survey_final <- merge(x = movie_title, y = Survey_rating, by = "Movie_Index")

# Leveraged round_df code from Stackoverflow https://stackoverflow.com/questions/29875914/rounding-values-in-a-dataframe-in-r
round_df <- function(x, digits) {
    # round all numeric variables
    # x: data frame 
    # digits: number of digits to round
    numeric_columns <- sapply(x, mode) == 'numeric'
    x[numeric_columns] <-  round(x[numeric_columns], digits)
    x
}

Survey_final <- round_df(Survey_final,1)
knitr::kable(Survey_final, "simple")
Movie_Index Movie_name Survey_rating
Movie_1 Avengers: Endgame (2019) 3.9
Movie_2 Avengers: Infinity War (2018) 4.3
Movie_3 Spider-Man: Far from Home (2019) 3.9
Movie_4 Captain Marvel (2019) 4.2
Movie_5 Ant-Man and the Wasp (2018) 3.2
Movie_6 Guardians of the Galaxy Vol. 2 (2017) 4.1

I wonder how my survey sample compares to broader ImDb rating, let’s find out! But before we do that, there might be differences in the movie title in the survey response table compared to ImDb table. We have to run fuzzy logic to match both of them. Based on the example available on the site here, let’s merge both the datasets.

library("fuzzyjoin")
colnames(Survey_final)[2] <- "Title" 
Survey_comp <- stringdist_inner_join(Survey_final[ , c("Title", "Survey_rating")],imdb_rating[ , c("Title", "Review Rating")],by = "Title")[,c("Title.y","Survey_rating","Review Rating")]
colnames(Survey_comp)[1] <- "Title" 
knitr::kable(Survey_comp, "simple")
Title Survey_rating Review Rating
Avengers: Endgame (2019) 3.9 9
Avengers: Infinity War (2018) 4.3 9
Spider-Man: Far from Home (2019) 3.9 0
Captain Marvel (2019) 4.2 7
Ant-Man and the Wasp (2018) 3.2 7
Guardians of the Galaxy Vol. 2 (2017) 4.1 8

Let’s bring ImDb rating to the scale of 5 to make the comparison even.

Survey_comp[3] <- Survey_comp[3]/2

knitr::kable(Survey_comp, "simple")
Title Survey_rating Review Rating
Avengers: Endgame (2019) 3.9 4.5
Avengers: Infinity War (2018) 4.3 4.5
Spider-Man: Far from Home (2019) 3.9 0.0
Captain Marvel (2019) 4.2 3.5
Ant-Man and the Wasp (2018) 3.2 3.5
Guardians of the Galaxy Vol. 2 (2017) 4.1 4.0

Conclusion

Movies released prior to 2019 had similar ratings from survey sample and ImDb but movies in 2019 seem to have difference of opinion. Also, Spider-Man movie did not have any rating in the ImDb databases possibly because the ImDb dataset was outdated. We may assume that Spider-Man might be a good recommendation to broader audience as it has 3.9 star rating from the survey conducted (not bad spidey!)