Assignment 2: Recommend movies. SQL & R.

Overview

Ever wondered which movie among the Avengers (Marvel cinematic universe) has the best rating ? One may be biased to think that the most recent movie Avengers: End game is the best movie but it may not be truth. To truly evaluate the movies released over a period of time, we need ratings from the audience. ImDb solves this problem. IMDb is an online database (owned by Amazon) of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews ImDb Wiki:

In snap of a finger, we can learn which movie among the Avengers is the best one. But do you know if your friends feel the same way ? Let’s conduct a survey and compare the results with ImDb rating. Survey results are available here for reference.

Five Star rating for recent Marvel movies

Step 1:Let’s connect to ImDb table, but wait is it possible ?

Fortunately Kaggle provides the “All Marvel Cinematic Universe Movies on IMDB” data which we can download and load into MySQl DB.

We need RMySQL library to connect to the MySQL DB. Let’s install and load them:

library(RMySQL)

Up next, we need to connect to the MySQL DB by entering the server type and credentials

mydb = dbConnect(MySQL(), user='root', password='newrootpassword', dbname='movies', host='127.0.0.1')

Step 2: Importing required tables

We can find the list of tables in the db with the use of the function ‘dbListTables’. The tables in the schema/db are " “imdb_rating” “movie_title” and “movies”

dbListTables(mydb)

Let’s retrieve all the three tables from the database for our analysis. We can create three data frames through below queries:

tab1 = dbSendQuery(mydb, "select * from movies")
survey_data = fetch(tab1)
knitr::kable(survey_data, "simple")

IP_Address	Time_Spent	Movie_Frequency	Movie_1	Movie_2	Movie_3	Movie_4	Movie_5	Movie_6
72.21.196.65	38	2-5 times a month	4	4	4	5	4	3
100.35.108.12	44	<1 time a month	4	4	5	5	1	4
174.241.132.76	46	5+ times a month	5	5	0	5	4	5
183.83.137.201	56	5+ times a month	1	3	0	3	0	0
100.35.108.12	37	2-5 times a month	4	5	5	4	3	5
115.99.172.152	62	5+ times a month	4	4	4	4	4	3
162.83.178.107	40	2-5 times a month	4	5	5	3	5	5
157.34.81.38	84	<1 time a month	0	0	4	0	0	4
106.198.172.233	64	NA	4	5	1	4	1	4
50.200.142.50	57	2-5 times a month	4	4	0	0	0	0
45.124.146.23	88	2-5 times a month	4	4	0	0	0	4
49.35.84.179	83	1 time a month	0	0	0	0	0	0
157.34.99.133	117	2-5 times a month	5	3	0	0	0	0
91.73.122.28	86	2-5 times a month	4	5	5	5	0	0
173.70.126.249	44	2-5 times a month	4	5	0	0	0	0
131.55.328.12	47	2-5 times a month	5	5	3	4	4	4
106.206.55.85	57	5+ times a month	3	4	0	0	3	4
223.229.184.53	90	5+ times a month	0	0	3	4	3	0

tab2 = dbSendQuery(mydb, "select * from movie_title")
movie_title = fetch(tab2)
knitr::kable(movie_title, "simple")

Movie_Index	Movie_name
Movie_1	Avengers: Endgame (2019)
Movie_2	Avengers: Infinity War (2018)
Movie_3	Spider-Man: Far from Home (2019)
Movie_4	Captain Marvel (2019)
Movie_5	Ant-Man and the Wasp (2018)
Movie_6	Guardians of the Galaxy Vol. 2 (2017)

tab3 = dbSendQuery(mydb, "select * from imdb_rating")
imdb_rating = fetch(tab3)
head(imdb_rating)

Step 3: Time for analysis

Let’s calculate average rating for all the movies, excluding the entries where poll results were not available (marked as ‘0’ in the data). Also, we can leverage movie_title table to obtain the movie names and understand the results better.

survey_data_1 = replace(survey_data, survey_data==0, NA)
survey_col <- colnames(survey_data_1)
Survey_rating <- colMeans(survey_data_1[sapply(survey_data_1, is.numeric)],na.rm = TRUE)
Survey_rating <- as.data.frame(Survey_rating)

Survey_rating <- data.frame(Movie_Index = row.names(Survey_rating), Survey_rating)[-1,]

Survey_final <- merge(x = movie_title, y = Survey_rating, by = "Movie_Index")

# Leveraged round_df code from Stackoverflow https://stackoverflow.com/questions/29875914/rounding-values-in-a-dataframe-in-r
round_df <- function(x, digits) {
    # round all numeric variables
    # x: data frame 
    # digits: number of digits to round
    numeric_columns <- sapply(x, mode) == 'numeric'
    x[numeric_columns] <-  round(x[numeric_columns], digits)
    x
}

Survey_final <- round_df(Survey_final,1)
knitr::kable(Survey_final, "simple")

Movie_Index	Movie_name	Survey_rating
Movie_1	Avengers: Endgame (2019)	3.9
Movie_2	Avengers: Infinity War (2018)	4.3
Movie_3	Spider-Man: Far from Home (2019)	3.9
Movie_4	Captain Marvel (2019)	4.2
Movie_5	Ant-Man and the Wasp (2018)	3.2
Movie_6	Guardians of the Galaxy Vol. 2 (2017)	4.1

I wonder how my survey sample compares to broader ImDb rating, let’s find out! But before we do that, there might be differences in the movie title in the survey response table compared to ImDb table. We have to run fuzzy logic to match both of them. Based on the example available on the site here, let’s merge both the datasets.

library("fuzzyjoin")
colnames(Survey_final)[2] <- "Title" 
Survey_comp <- stringdist_inner_join(Survey_final[ , c("Title", "Survey_rating")],imdb_rating[ , c("Title", "Review Rating")],by = "Title")[,c("Title.y","Survey_rating","Review Rating")]
colnames(Survey_comp)[1] <- "Title" 
knitr::kable(Survey_comp, "simple")

Title	Survey_rating	Review Rating
Avengers: Endgame (2019)	3.9	9
Avengers: Infinity War (2018)	4.3	9
Spider-Man: Far from Home (2019)	3.9	0
Captain Marvel (2019)	4.2	7
Ant-Man and the Wasp (2018)	3.2	7
Guardians of the Galaxy Vol. 2 (2017)	4.1	8

Let’s bring ImDb rating to the scale of 5 to make the comparison even.

Survey_comp[3] <- Survey_comp[3]/2

knitr::kable(Survey_comp, "simple")

Title	Survey_rating	Review Rating
Avengers: Endgame (2019)	3.9	4.5
Avengers: Infinity War (2018)	4.3	4.5
Spider-Man: Far from Home (2019)	3.9	0.0
Captain Marvel (2019)	4.2	3.5
Ant-Man and the Wasp (2018)	3.2	3.5
Guardians of the Galaxy Vol. 2 (2017)	4.1	4.0

Conclusion

Movies released prior to 2019 had similar ratings from survey sample and ImDb but movies in 2019 seem to have difference of opinion. Also, Spider-Man movie did not have any rating in the ImDb databases possibly because the ImDb dataset was outdated. We may assume that Spider-Man might be a good recommendation to broader audience as it has 3.9 star rating from the survey conducted (not bad spidey!)