What is required to run this assignment?

  1. 3 csv files containing the data:
    1. names.csv
    2. movies.csv
    3. ratings.csv
  2. MySQL Workbench server setup and running
  3. R package RMySQL installed

Loading Data

I begin by establishing connection.

library(RMySQL)
## Loading required package: DBI
con <- dbConnect(MySQL(), user = "user007", password = "123456789", host = "DESKTOP-1GKOF0M", dbname = "movierate")

The user, password, host, and dbname will depend on how your system is set up.

Now let’s see what the tables look like.

dbGetQuery(con, "SELECT * FROM movies")
##   movieid                    movie  movie_description
## 1       1             Wonder Woman DC come back movie
## 2       2    Spider-Man Homecoming  Marvel fail movie
## 3       3                       IT       Not so scary
## 4       4             Terminator 2      Classic in 3D
## 5       5          Despicable Me 3  Same old same old
## 6       6           The Mummy 2017             A joke
## 7       7 Star Wars: The Last Jedi        Not out yet
dbGetQuery(con, "SELECT * FROM names")
##   nameid  last_name first_name
## 1      1 Washington       Gigi
## 2      2  Jefferson      Simon
## 3      3    Lincoln    Yuenfei
## 4      4    Kennedy        Xin
## 5      5  Roosevelt      Davis
## 6      6         No        One
dbGetQuery(con, "SELECT * FROM ratings")
##    ratingid movieid nameid rating
## 1         1       1      1      5
## 2         2       1      2      4
## 3         3       1      3      5
## 4         4       1      4      5
## 5         5       1      5      4
## 6         6       2      1      1
## 7         7       2      2      3
## 8         8       2      3      4
## 9         9       2      4      1
## 10       10       2      5      1
## 11       11       3      1      3
## 12       12       3      2      2
## 13       13       3      3      1
## 14       14       3      4      3
## 15       15       3      5      2
## 16       16       4      1      3
## 17       17       4      2      1
## 18       18       4      3      2
## 19       19       4      4      1
## 20       20       4      5      2
## 21       21       5      1      3
## 22       22       5      2      2
## 23       23       5      3      2
## 24       24       5      4      1
## 25       25       5      5      1
## 26       26       6      1      2
## 27       27       6      2      1
## 28       28       6      3      1
## 29       29       6      4      1
## 30       30       6      5      3

Notice that “Star Wars: The Last Jedi” is not out yet, so no one rated it, and in the names table, there is a dummy person named “No One” who has not rated any movies.

Join Table

I want to join the tables together. Here I use two LEFT JOIN statement.

querry1 <- "SELECT movie, last_name, first_name, rating
            FROM movies 
            LEFT JOIN ratings on movies.movieid = ratings.movieid
            LEFT JOIN names on ratings.nameid = names.nameid
            ORDER BY movie, rating DESC"

joinTable1 <- dbGetQuery(con, querry1)
joinTable1
##                       movie  last_name first_name rating
## 1           Despicable Me 3 Washington       Gigi      3
## 2           Despicable Me 3  Jefferson      Simon      2
## 3           Despicable Me 3    Lincoln    Yuenfei      2
## 4           Despicable Me 3    Kennedy        Xin      1
## 5           Despicable Me 3  Roosevelt      Davis      1
## 6                        IT Washington       Gigi      3
## 7                        IT    Kennedy        Xin      3
## 8                        IT  Jefferson      Simon      2
## 9                        IT  Roosevelt      Davis      2
## 10                       IT    Lincoln    Yuenfei      1
## 11    Spider-Man Homecoming    Lincoln    Yuenfei      4
## 12    Spider-Man Homecoming  Jefferson      Simon      3
## 13    Spider-Man Homecoming Washington       Gigi      1
## 14    Spider-Man Homecoming    Kennedy        Xin      1
## 15    Spider-Man Homecoming  Roosevelt      Davis      1
## 16 Star Wars: The Last Jedi       <NA>       <NA>     NA
## 17             Terminator 2 Washington       Gigi      3
## 18             Terminator 2    Lincoln    Yuenfei      2
## 19             Terminator 2  Roosevelt      Davis      2
## 20             Terminator 2  Jefferson      Simon      1
## 21             Terminator 2    Kennedy        Xin      1
## 22           The Mummy 2017  Roosevelt      Davis      3
## 23           The Mummy 2017 Washington       Gigi      2
## 24           The Mummy 2017  Jefferson      Simon      1
## 25           The Mummy 2017    Lincoln    Yuenfei      1
## 26           The Mummy 2017    Kennedy        Xin      1
## 27             Wonder Woman Washington       Gigi      5
## 28             Wonder Woman    Lincoln    Yuenfei      5
## 29             Wonder Woman    Kennedy        Xin      5
## 30             Wonder Woman  Jefferson      Simon      4
## 31             Wonder Woman  Roosevelt      Davis      4

Notice that the Star Wars movie is listed with NA as value. And the dummy person No One is not on this table. This is because LEFT JOIN preserves only the left side table of the join statement.

Let’s try RIGHT JOIN. It should preserve the right side table of the join statement.

querry2 <- "SELECT movie, last_name, first_name, rating
            FROM movies 
            RIGHT JOIN ratings on movies.movieid = ratings.movieid
            RIGHT JOIN names on ratings.nameid = names.nameid
            ORDER BY movie, rating DESC"

joinTable2 <- dbGetQuery(con, querry2)
joinTable2
##                    movie  last_name first_name rating
## 1                   <NA>         No        One     NA
## 2        Despicable Me 3 Washington       Gigi      3
## 3        Despicable Me 3  Jefferson      Simon      2
## 4        Despicable Me 3    Lincoln    Yuenfei      2
## 5        Despicable Me 3    Kennedy        Xin      1
## 6        Despicable Me 3  Roosevelt      Davis      1
## 7                     IT Washington       Gigi      3
## 8                     IT    Kennedy        Xin      3
## 9                     IT  Jefferson      Simon      2
## 10                    IT  Roosevelt      Davis      2
## 11                    IT    Lincoln    Yuenfei      1
## 12 Spider-Man Homecoming    Lincoln    Yuenfei      4
## 13 Spider-Man Homecoming  Jefferson      Simon      3
## 14 Spider-Man Homecoming Washington       Gigi      1
## 15 Spider-Man Homecoming    Kennedy        Xin      1
## 16 Spider-Man Homecoming  Roosevelt      Davis      1
## 17          Terminator 2 Washington       Gigi      3
## 18          Terminator 2    Lincoln    Yuenfei      2
## 19          Terminator 2  Roosevelt      Davis      2
## 20          Terminator 2  Jefferson      Simon      1
## 21          Terminator 2    Kennedy        Xin      1
## 22        The Mummy 2017  Roosevelt      Davis      3
## 23        The Mummy 2017 Washington       Gigi      2
## 24        The Mummy 2017  Jefferson      Simon      1
## 25        The Mummy 2017    Lincoln    Yuenfei      1
## 26        The Mummy 2017    Kennedy        Xin      1
## 27          Wonder Woman Washington       Gigi      5
## 28          Wonder Woman    Lincoln    Yuenfei      5
## 29          Wonder Woman    Kennedy        Xin      5
## 30          Wonder Woman  Jefferson      Simon      4
## 31          Wonder Woman  Roosevelt      Davis      4

As expected, Star Wars movie is dropped from the table, but the dummpy person No One is listed eventhough that person has not rated any movies.

Average Rating

Here I am interested in finding the average rating of each movie, first using SQL, then compare with the result using R code.

Using SQL

querry <- "SELECT movie, AVG(rating) AS avg_rating
            FROM movies 
            LEFT JOIN ratings on movies.movieid = ratings.movieid
            LEFT JOIN names on ratings.nameid = names.nameid
            GROUP BY movie
            ORDER BY avg_rating DESC"

avgTable <- dbGetQuery(con, querry)
## Warning in .local(conn, statement, ...): Decimal MySQL column 1 imported as
## numeric
avgTable
##                      movie avg_rating
## 1             Wonder Woman        4.6
## 2                       IT        2.2
## 3    Spider-Man Homecoming        2.0
## 4             Terminator 2        1.8
## 5          Despicable Me 3        1.8
## 6           The Mummy 2017        1.6
## 7 Star Wars: The Last Jedi         NA

Using R

tble <- aggregate(joinTable1$rating, by = list(joinTable1$movie), mean)
names(tble) <- c("movie", "avg_rating")
tble[order(tble$avg_rating, decreasing = TRUE), ]
##                      movie avg_rating
## 7             Wonder Woman        4.6
## 2                       IT        2.2
## 3    Spider-Man Homecoming        2.0
## 1          Despicable Me 3        1.8
## 5             Terminator 2        1.8
## 6           The Mummy 2017        1.6
## 4 Star Wars: The Last Jedi         NA

The results are the same.