In this assignment, I design a movie ratings database and import database onto a Google Cloud MySQL instance. I provide links to the SQL scripts to create the schema and data. I demonstrate the use of RMySQL to provide connectivity. I display the raw data and end by providing illustrative queries against the normalized database tables.
This section defines the Movies Ratings database business objectives and scope.
The Movies Ratings database allows us to store an inventory of movies, reviewers and their associated reviews. Each review assign a unique viewer to a movie for which a score from 1 to 5 is assigned. Each review is date-stamped. This allows one person to rate many movies. Moreover, a person may rate one movie multiple times. This design provides considerably flexiblity.
Lastly, we configure this database on Google Cloud Services MySQL Instance.
The SQL schema was defined in a simple SQL script.
The script has been posted to Github where it can be found at:
https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/Week2/smallimdb.sql
We create 4 tables * movies * viewers * ratings * reviews
Each table has an integer primary key as an id. The reviews table has foreign key constraints to the other 3 tables.
No use of autoincremented primary keys is used for simplicity of the design.
The data used to populate the database is contained in the SQL script uploaded to Github here: https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/Week2/smallimdbData.sql
MySQL workbench was used to run the SQL scripts to create the schema and import the database.
I set up the Google Database instance and learned it is hosted at IP address of 35.239.199.156. To ensure privacy and reproducibility of the code, I obscure the password from the HTML knitted output but allow the code to run to completion.
Thus, the password is only stored in the RMD file from which the password is manually deleted.
The connection to the MysQL instance uses the RMySQL package which works effectively both with cloud and local database instances.
library(RMySQL)
con = dbConnect(RMySQL::MySQL(), dbname="smallimdb",
host="35.239.199.156", user="trialuser", password=hiddenPassword)
dbGetInfo(con)
## $host
## [1] "35.239.199.156"
##
## $user
## [1] "trialuser"
##
## $dbname
## [1] "smallimdb"
##
## $conType
## [1] "35.239.199.156 via TCP/IP"
##
## $serverVersion
## [1] "5.7.14-google-log"
##
## $protocolVersion
## [1] 10
##
## $threadId
## [1] 2053
##
## $rsId
## list()
Now we list the tables contained in the movies rating database by using the dbListTables() command.
dbListTables(con)
## [1] "movies" "ratings" "reviews" "viewers"
Now we display the database contents. Note that most movies are recent blockbusters except for a classic arthouse film, “The Seventh Seal.” Reviewers also rated between 1 and 3 movies each.
knitr::kable( dbGetQuery(con, "SELECT * FROM movies;" ), format="markdown")
movieid | title | director | year | studio |
---|---|---|---|---|
1 | Mary Poppins Returns | Rob Marshall | 2018 | Walt Disney Studios |
2 | Bohemian Rhapsody | Bryan Singer, Dexter Fletcher | 2018 | 20th Century Fox |
3 | Black Panther | Ryan Coogler | 2018 | Walt Disney Studios |
4 | Fantastic Beasts: The Crimes of Grindelwald | David Yates | 2018 | Warner Bros. Pictures |
5 | Mission: Impossible - Fallout | Christopher McQuarrie | 2018 | Paramount Pictures |
6 | The Seventh Seal | Ingmar Bergman | 1957 | AB Svensk Filmindustri |
knitr::kable( dbGetQuery(con, "SELECT * FROM reviews;" ), format="markdown")
reviewid | movieid | viewerid | score | reviewdate |
---|---|---|---|---|
1 | 1 | 1 | 2 | 2019-02-07 |
2 | 2 | 2 | 2 | 2019-02-05 |
3 | 3 | 3 | 3 | 2019-01-05 |
4 | 4 | 4 | 4 | 2019-01-11 |
5 | 5 | 5 | 2 | 2019-02-07 |
6 | 6 | 5 | 1 | 2019-02-01 |
7 | 4 | 1 | 3 | 2019-02-01 |
8 | 1 | 2 | 3 | 2019-02-04 |
9 | 6 | 5 | 2 | 2019-02-04 |
knitr::kable( dbGetQuery(con, "SELECT * FROM viewers;" ), format="markdown")
viewerid | lastname | firstname |
---|---|---|
1 | Wang | Y |
2 | Johnson | Fred |
3 | Thakkar | Ossie |
4 | Astrom | Sophie |
5 | Anderson | Birgit |
We end the assignment by displaying 2 reports on the data illustrating the relational links between the reviewers and movies.
The report below shows the highest ranking movies at the top sorted in descending order across the population of reviews.
The Seventh Seal is the highest rated movie, followed by Mission: Impossible - Fallout.
query =
"select M.movieid, M.title, M.director, M.year, M.studio,
AVG( R.score ) as AvgScore, COUNT( R.ReviewID) as NumReviews
from reviews R INNER JOIN
movies M ON ( R.movieid = M.movieid )
GROUP BY M.movieid, M.title, M.director, M.year, M.studio
ORDER BY AvgScore ASC;"
knitr::kable( dbGetQuery(con, query), format="markdown")
movieid | title | director | year | studio | AvgScore | NumReviews |
---|---|---|---|---|---|---|
6 | The Seventh Seal | Ingmar Bergman | 1957 | AB Svensk Filmindustri | 1.5 | 2 |
5 | Mission: Impossible - Fallout | Christopher McQuarrie | 2018 | Paramount Pictures | 2.0 | 1 |
2 | Bohemian Rhapsody | Bryan Singer, Dexter Fletcher | 2018 | 20th Century Fox | 2.0 | 1 |
1 | Mary Poppins Returns | Rob Marshall | 2018 | Walt Disney Studios | 2.5 | 2 |
3 | Black Panther | Ryan Coogler | 2018 | Walt Disney Studios | 3.0 | 1 |
4 | Fantastic Beasts: The Crimes of Grindelwald | David Yates | 2018 | Warner Bros. Pictures | 3.5 | 2 |
We can also examine the behavior of the reviewers by looking at how their scores appear in the cross section.
query=
"select V.* , AVG(R.score), COUNT(R.score)
FROM
reviews R INNER JOIN
viewers V on (V.viewerid = R.viewerid )
GROUP BY V.viewerid"
knitr::kable(dbGetQuery(con, query), format="markdown")
viewerid | lastname | firstname | AVG(R.score) | COUNT(R.score) |
---|---|---|---|---|
1 | Wang | Y | 2.5000 | 2 |
2 | Johnson | Fred | 2.5000 | 2 |
3 | Thakkar | Ossie | 3.0000 | 1 |
4 | Astrom | Sophie | 4.0000 | 1 |
5 | Anderson | Birgit | 1.6667 | 3 |
Finally, we disconnect from the database to tidy up our resource utilization.
dbDisconnect(con)
## [1] TRUE