Movie and Rating Data Analysis Using SQL and R

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.

Environment set up

library(tidyverse)
library(RMySQL)

Database Interaction (RDMBS: MySQL)

Requirement: Created Database DATA607_MOVIEDB, user ‘DATA607User’ and 3 tables: Movie, Reviewer, Rating
Note : SQL Scripts provided separately (github link)

  • Establish the MySQL database connection with MySQL Driver and proper connection string:
    • user id
    • password
    • host
    • database name

Once the connection is established, view the tables.

Database Connection

mydb =dbConnect(MySQL(),user='DATA607User',password=mypassword,dbname='DATA607_MOVIEDB',host='localhost')
dbListTables(mydb)
## [1] "movie"    "rating"   "reviewer"

Select the tables: movie, reviwer and ratings and fetch the resultset in their respective data frames

Movie Data

movie_resultSet = dbSendQuery(mydb, 'select * from movie')
movie_df <- fetch(movie_resultSet)
movie_df
##   movie_id                       movie_name
## 1        1          JUMANJI: THE NEXT LEVEL
## 2        2                Spies in Disguise
## 3        3                         Dolittle
## 4        4                        Fronzen 2
## 5        5 Star Wars: The Rise of Skywalker
## 6        6                             1917

Reviewer Data

reviewer_resultSet = dbSendQuery(mydb, 'select * from reviewer')
reviewer_df <- fetch(reviewer_resultSet)
reviewer_df
##   reviewer_id reviewer_name gender
## 1           1           Tom      M
## 2           2         Alice      F
## 3           3        Nathan      M
## 4           4           Dan      M
## 5           5          Lucy      F

Rating Data

rating_resultSet = dbSendQuery(mydb, 'select * from rating')
rating_df <- fetch(rating_resultSet)
head(rating_df,10)
##    reviewer_id movie_id rating
## 1            1        1    6.5
## 2            1        2    6.0
## 3            1        3    7.0
## 4            1        4    8.0
## 5            1        5    9.0
## 6            1        6    7.5
## 7            2        1    7.0
## 8            2        2    6.5
## 9            2        3    7.0
## 10           2        4    8.5

Joining Movie Data and Rating Data

reviewer_movie_rating <- left_join(movie_df,rating_df,by.x ="movie_id",by.y = "movie_id")
## Joining, by = "movie_id"

Now, checking the combined dataframe of movie and rating data

head(reviewer_movie_rating,30)
##    movie_id                       movie_name reviewer_id rating
## 1         1          JUMANJI: THE NEXT LEVEL           1    6.5
## 2         1          JUMANJI: THE NEXT LEVEL           2    7.0
## 3         1          JUMANJI: THE NEXT LEVEL           3    7.7
## 4         1          JUMANJI: THE NEXT LEVEL           4    7.5
## 5         1          JUMANJI: THE NEXT LEVEL           5    7.0
## 6         2                Spies in Disguise           1    6.0
## 7         2                Spies in Disguise           2    6.5
## 8         2                Spies in Disguise           3    6.8
## 9         2                Spies in Disguise           4    6.5
## 10        2                Spies in Disguise           5    6.5
## 11        3                         Dolittle           1    7.0
## 12        3                         Dolittle           2    7.0
## 13        3                         Dolittle           3    7.0
## 14        3                         Dolittle           4    7.5
## 15        3                         Dolittle           5    7.5
## 16        4                        Fronzen 2           1    8.0
## 17        4                        Fronzen 2           2    8.5
## 18        4                        Fronzen 2           3    7.0
## 19        4                        Fronzen 2           4    8.0
## 20        4                        Fronzen 2           5    8.0
## 21        5 Star Wars: The Rise of Skywalker           1    9.0
## 22        5 Star Wars: The Rise of Skywalker           2    7.0
## 23        5 Star Wars: The Rise of Skywalker           3    8.0
## 24        5 Star Wars: The Rise of Skywalker           4    7.0
## 25        5 Star Wars: The Rise of Skywalker           5    5.5
## 26        6                             1917           1    7.5
## 27        6                             1917           2    9.0
## 28        6                             1917           3    7.0
## 29        6                             1917           4     NA
## 30        6                             1917           5     NA

Hanling Missing Values

reviewer_movie_rating %>%
  filter(is.na(rating))
##   movie_id movie_name reviewer_id rating
## 1        6       1917           4     NA
## 2        6       1917           5     NA

Imputing the missing rating values with mean rating value

reviewer_rating_cleaned_df <- reviewer_movie_rating %>%
                            mutate(
                              rating = chron::times(rating),
                              rating = if_else(is.na(rating), mean(rating), rating)
                              )
reviewer_rating_cleaned_df
##    movie_id                       movie_name reviewer_id   rating
## 1         1          JUMANJI: THE NEXT LEVEL           1 6.500000
## 2         1          JUMANJI: THE NEXT LEVEL           2 7.000000
## 3         1          JUMANJI: THE NEXT LEVEL           3 7.700000
## 4         1          JUMANJI: THE NEXT LEVEL           4 7.500000
## 5         1          JUMANJI: THE NEXT LEVEL           5 7.000000
## 6         2                Spies in Disguise           1 6.000000
## 7         2                Spies in Disguise           2 6.500000
## 8         2                Spies in Disguise           3 6.800000
## 9         2                Spies in Disguise           4 6.500000
## 10        2                Spies in Disguise           5 6.500000
## 11        3                         Dolittle           1 7.000000
## 12        3                         Dolittle           2 7.000000
## 13        3                         Dolittle           3 7.000000
## 14        3                         Dolittle           4 7.500000
## 15        3                         Dolittle           5 7.500000
## 16        4                        Fronzen 2           1 8.000000
## 17        4                        Fronzen 2           2 8.500000
## 18        4                        Fronzen 2           3 7.000000
## 19        4                        Fronzen 2           4 8.000000
## 20        4                        Fronzen 2           5 8.000000
## 21        5 Star Wars: The Rise of Skywalker           1 9.000000
## 22        5 Star Wars: The Rise of Skywalker           2 7.000000
## 23        5 Star Wars: The Rise of Skywalker           3 8.000000
## 24        5 Star Wars: The Rise of Skywalker           4 7.000000
## 25        5 Star Wars: The Rise of Skywalker           5 5.500000
## 26        6                             1917           1 7.500000
## 27        6                             1917           2 9.000000
## 28        6                             1917           3 7.000000
## 29        6                             1917           4 7.267857
## 30        6                             1917           5 7.267857

Now, checking for the reviewer rating after the data imputation.

movie_rating <- reviewer_rating_cleaned_df %>%
    group_by(movie_name,movie_id) %>%
    summarize(reviewer_rating =mean(rating)) %>%
    arrange(desc(reviewer_rating))
movie_rating
## # A tibble: 6 x 3
## # Groups:   movie_name [6]
##   movie_name                       movie_id reviewer_rating
##   <chr>                               <int> <S3: times>    
## 1 Fronzen 2                               4 7.900000       
## 2 1917                                    6 7.607143       
## 3 Star Wars: The Rise of Skywalker        5 7.300000       
## 4 Dolittle                                3 7.200000       
## 5 JUMANJI: THE NEXT LEVEL                 1 7.140000       
## 6 Spies in Disguise                       2 6.460000

Conclusion

The three different datasets of movie, reiviewer and rating data fetched from SQL database “DATA607_MOVIEDB” and have been analyzed to find the final average score of each movie rating. After handling the missing values present in the reviewer dataset it’s been found that “Frozen 2” topped the list and “Spies in Disguise”" finished last.