I begin by establishing connection.
library(RMySQL)
## Loading required package: DBI
con <- dbConnect(MySQL(), user = "user007", password = "123456789", host = "DESKTOP-1GKOF0M", dbname = "movierate")
The user, password, host, and dbname will depend on how your system is set up.
Now let’s see what the tables look like.
dbGetQuery(con, "SELECT * FROM movies")
## movieid movie movie_description
## 1 1 Wonder Woman DC come back movie
## 2 2 Spider-Man Homecoming Marvel fail movie
## 3 3 IT Not so scary
## 4 4 Terminator 2 Classic in 3D
## 5 5 Despicable Me 3 Same old same old
## 6 6 The Mummy 2017 A joke
## 7 7 Star Wars: The Last Jedi Not out yet
dbGetQuery(con, "SELECT * FROM names")
## nameid last_name first_name
## 1 1 Washington Gigi
## 2 2 Jefferson Simon
## 3 3 Lincoln Yuenfei
## 4 4 Kennedy Xin
## 5 5 Roosevelt Davis
## 6 6 No One
dbGetQuery(con, "SELECT * FROM ratings")
## ratingid movieid nameid rating
## 1 1 1 1 5
## 2 2 1 2 4
## 3 3 1 3 5
## 4 4 1 4 5
## 5 5 1 5 4
## 6 6 2 1 1
## 7 7 2 2 3
## 8 8 2 3 4
## 9 9 2 4 1
## 10 10 2 5 1
## 11 11 3 1 3
## 12 12 3 2 2
## 13 13 3 3 1
## 14 14 3 4 3
## 15 15 3 5 2
## 16 16 4 1 3
## 17 17 4 2 1
## 18 18 4 3 2
## 19 19 4 4 1
## 20 20 4 5 2
## 21 21 5 1 3
## 22 22 5 2 2
## 23 23 5 3 2
## 24 24 5 4 1
## 25 25 5 5 1
## 26 26 6 1 2
## 27 27 6 2 1
## 28 28 6 3 1
## 29 29 6 4 1
## 30 30 6 5 3
Notice that “Star Wars: The Last Jedi” is not out yet, so no one rated it, and in the names table, there is a dummy person named “No One” who has not rated any movies.
I want to join the tables together. Here I use two LEFT JOIN statement.
querry1 <- "SELECT movie, last_name, first_name, rating
FROM movies
LEFT JOIN ratings on movies.movieid = ratings.movieid
LEFT JOIN names on ratings.nameid = names.nameid
ORDER BY movie, rating DESC"
joinTable1 <- dbGetQuery(con, querry1)
joinTable1
## movie last_name first_name rating
## 1 Despicable Me 3 Washington Gigi 3
## 2 Despicable Me 3 Jefferson Simon 2
## 3 Despicable Me 3 Lincoln Yuenfei 2
## 4 Despicable Me 3 Kennedy Xin 1
## 5 Despicable Me 3 Roosevelt Davis 1
## 6 IT Washington Gigi 3
## 7 IT Kennedy Xin 3
## 8 IT Jefferson Simon 2
## 9 IT Roosevelt Davis 2
## 10 IT Lincoln Yuenfei 1
## 11 Spider-Man Homecoming Lincoln Yuenfei 4
## 12 Spider-Man Homecoming Jefferson Simon 3
## 13 Spider-Man Homecoming Washington Gigi 1
## 14 Spider-Man Homecoming Kennedy Xin 1
## 15 Spider-Man Homecoming Roosevelt Davis 1
## 16 Star Wars: The Last Jedi <NA> <NA> NA
## 17 Terminator 2 Washington Gigi 3
## 18 Terminator 2 Lincoln Yuenfei 2
## 19 Terminator 2 Roosevelt Davis 2
## 20 Terminator 2 Jefferson Simon 1
## 21 Terminator 2 Kennedy Xin 1
## 22 The Mummy 2017 Roosevelt Davis 3
## 23 The Mummy 2017 Washington Gigi 2
## 24 The Mummy 2017 Jefferson Simon 1
## 25 The Mummy 2017 Lincoln Yuenfei 1
## 26 The Mummy 2017 Kennedy Xin 1
## 27 Wonder Woman Washington Gigi 5
## 28 Wonder Woman Lincoln Yuenfei 5
## 29 Wonder Woman Kennedy Xin 5
## 30 Wonder Woman Jefferson Simon 4
## 31 Wonder Woman Roosevelt Davis 4
Notice that the Star Wars movie is listed with NA as value. And the dummy person No One is not on this table. This is because LEFT JOIN preserves only the left side table of the join statement.
Let’s try RIGHT JOIN. It should preserve the right side table of the join statement.
querry2 <- "SELECT movie, last_name, first_name, rating
FROM movies
RIGHT JOIN ratings on movies.movieid = ratings.movieid
RIGHT JOIN names on ratings.nameid = names.nameid
ORDER BY movie, rating DESC"
joinTable2 <- dbGetQuery(con, querry2)
joinTable2
## movie last_name first_name rating
## 1 <NA> No One NA
## 2 Despicable Me 3 Washington Gigi 3
## 3 Despicable Me 3 Jefferson Simon 2
## 4 Despicable Me 3 Lincoln Yuenfei 2
## 5 Despicable Me 3 Kennedy Xin 1
## 6 Despicable Me 3 Roosevelt Davis 1
## 7 IT Washington Gigi 3
## 8 IT Kennedy Xin 3
## 9 IT Jefferson Simon 2
## 10 IT Roosevelt Davis 2
## 11 IT Lincoln Yuenfei 1
## 12 Spider-Man Homecoming Lincoln Yuenfei 4
## 13 Spider-Man Homecoming Jefferson Simon 3
## 14 Spider-Man Homecoming Washington Gigi 1
## 15 Spider-Man Homecoming Kennedy Xin 1
## 16 Spider-Man Homecoming Roosevelt Davis 1
## 17 Terminator 2 Washington Gigi 3
## 18 Terminator 2 Lincoln Yuenfei 2
## 19 Terminator 2 Roosevelt Davis 2
## 20 Terminator 2 Jefferson Simon 1
## 21 Terminator 2 Kennedy Xin 1
## 22 The Mummy 2017 Roosevelt Davis 3
## 23 The Mummy 2017 Washington Gigi 2
## 24 The Mummy 2017 Jefferson Simon 1
## 25 The Mummy 2017 Lincoln Yuenfei 1
## 26 The Mummy 2017 Kennedy Xin 1
## 27 Wonder Woman Washington Gigi 5
## 28 Wonder Woman Lincoln Yuenfei 5
## 29 Wonder Woman Kennedy Xin 5
## 30 Wonder Woman Jefferson Simon 4
## 31 Wonder Woman Roosevelt Davis 4
As expected, Star Wars movie is dropped from the table, but the dummpy person No One is listed eventhough that person has not rated any movies.
Here I am interested in finding the average rating of each movie, first using SQL, then compare with the result using R code.
Using SQL
querry <- "SELECT movie, AVG(rating) AS avg_rating
FROM movies
LEFT JOIN ratings on movies.movieid = ratings.movieid
LEFT JOIN names on ratings.nameid = names.nameid
GROUP BY movie
ORDER BY avg_rating DESC"
avgTable <- dbGetQuery(con, querry)
## Warning in .local(conn, statement, ...): Decimal MySQL column 1 imported as
## numeric
avgTable
## movie avg_rating
## 1 Wonder Woman 4.6
## 2 IT 2.2
## 3 Spider-Man Homecoming 2.0
## 4 Terminator 2 1.8
## 5 Despicable Me 3 1.8
## 6 The Mummy 2017 1.6
## 7 Star Wars: The Last Jedi NA
Using R
tble <- aggregate(joinTable1$rating, by = list(joinTable1$movie), mean)
names(tble) <- c("movie", "avg_rating")
tble[order(tble$avg_rating, decreasing = TRUE), ]
## movie avg_rating
## 7 Wonder Woman 4.6
## 2 IT 2.2
## 3 Spider-Man Homecoming 2.0
## 1 Despicable Me 3 1.8
## 5 Terminator 2 1.8
## 6 The Mummy 2017 1.6
## 4 Star Wars: The Last Jedi NA
The results are the same.