Overview

In this assignment, I design a movie ratings database and import database onto a Google Cloud MySQL instance. I provide links to the SQL scripts to create the schema and data. I demonstrate the use of RMySQL to provide connectivity. I display the raw data and end by providing illustrative queries against the normalized database tables.

Goals Of A Movie Ratings Database

This section defines the Movies Ratings database business objectives and scope.

The Movies Ratings database allows us to store an inventory of movies, reviewers and their associated reviews. Each review assign a unique viewer to a movie for which a score from 1 to 5 is assigned. Each review is date-stamped. This allows one person to rate many movies. Moreover, a person may rate one movie multiple times. This design provides considerably flexiblity.

Lastly, we configure this database on Google Cloud Services MySQL Instance.

Defining the SQL Schema

The SQL schema was defined in a simple SQL script.

The script has been posted to Github where it can be found at:

https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/Week2/smallimdb.sql

We create 4 tables * movies * viewers * ratings * reviews

Each table has an integer primary key as an id. The reviews table has foreign key constraints to the other 3 tables.

No use of autoincremented primary keys is used for simplicity of the design.

The data used to populate the database is contained in the SQL script uploaded to Github here: https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/Week2/smallimdbData.sql

MySQL workbench was used to run the SQL scripts to create the schema and import the database.

Connecting to the Database in Google Cloud

I set up the Google Database instance and learned it is hosted at IP address of 35.239.199.156. To ensure privacy and reproducibility of the code, I obscure the password from the HTML knitted output but allow the code to run to completion.

Thus, the password is only stored in the RMD file from which the password is manually deleted.

The connection to the MysQL instance uses the RMySQL package which works effectively both with cloud and local database instances.

library(RMySQL)

con = dbConnect(RMySQL::MySQL(), dbname="smallimdb", 
                host="35.239.199.156", user="trialuser",  password=hiddenPassword)

dbGetInfo(con)
## $host
## [1] "35.239.199.156"
## 
## $user
## [1] "trialuser"
## 
## $dbname
## [1] "smallimdb"
## 
## $conType
## [1] "35.239.199.156 via TCP/IP"
## 
## $serverVersion
## [1] "5.7.14-google-log"
## 
## $protocolVersion
## [1] 10
## 
## $threadId
## [1] 2053
## 
## $rsId
## list()

Now we list the tables contained in the movies rating database by using the dbListTables() command.

dbListTables(con)
## [1] "movies"  "ratings" "reviews" "viewers"

Displaying the raw database tables

Now we display the database contents. Note that most movies are recent blockbusters except for a classic arthouse film, “The Seventh Seal.” Reviewers also rated between 1 and 3 movies each.

knitr::kable( dbGetQuery(con, "SELECT * FROM movies;" ), format="markdown")
movieid title director year studio
1 Mary Poppins Returns Rob Marshall 2018 Walt Disney Studios
2 Bohemian Rhapsody Bryan Singer, Dexter Fletcher 2018 20th Century Fox
3 Black Panther Ryan Coogler 2018 Walt Disney Studios
4 Fantastic Beasts: The Crimes of Grindelwald David Yates 2018 Warner Bros. Pictures
5 Mission: Impossible - Fallout Christopher McQuarrie 2018 Paramount Pictures
6 The Seventh Seal Ingmar Bergman 1957 AB Svensk Filmindustri
knitr::kable( dbGetQuery(con, "SELECT * FROM reviews;" ), format="markdown")
reviewid movieid viewerid score reviewdate
1 1 1 2 2019-02-07
2 2 2 2 2019-02-05
3 3 3 3 2019-01-05
4 4 4 4 2019-01-11
5 5 5 2 2019-02-07
6 6 5 1 2019-02-01
7 4 1 3 2019-02-01
8 1 2 3 2019-02-04
9 6 5 2 2019-02-04
knitr::kable( dbGetQuery(con, "SELECT * FROM viewers;" ), format="markdown")
viewerid lastname firstname
1 Wang Y
2 Johnson Fred
3 Thakkar Ossie
4 Astrom Sophie
5 Anderson Birgit

Displaying Some Reports on the Data

We end the assignment by displaying 2 reports on the data illustrating the relational links between the reviewers and movies.

The report below shows the highest ranking movies at the top sorted in descending order across the population of reviews.

The Seventh Seal is the highest rated movie, followed by Mission: Impossible - Fallout.

query =
"select M.movieid, M.title, M.director, M.year, M.studio, 
AVG( R.score ) as AvgScore, COUNT( R.ReviewID) as NumReviews
from reviews R INNER JOIN 
movies M ON ( R.movieid = M.movieid ) 
GROUP BY M.movieid, M.title, M.director, M.year, M.studio
ORDER BY AvgScore ASC;"

knitr::kable( dbGetQuery(con, query), format="markdown")
movieid title director year studio AvgScore NumReviews
6 The Seventh Seal Ingmar Bergman 1957 AB Svensk Filmindustri 1.5 2
5 Mission: Impossible - Fallout Christopher McQuarrie 2018 Paramount Pictures 2.0 1
2 Bohemian Rhapsody Bryan Singer, Dexter Fletcher 2018 20th Century Fox 2.0 1
1 Mary Poppins Returns Rob Marshall 2018 Walt Disney Studios 2.5 2
3 Black Panther Ryan Coogler 2018 Walt Disney Studios 3.0 1
4 Fantastic Beasts: The Crimes of Grindelwald David Yates 2018 Warner Bros. Pictures 3.5 2

We can also examine the behavior of the reviewers by looking at how their scores appear in the cross section.

query=
"select V.* , AVG(R.score), COUNT(R.score)
FROM
reviews R INNER JOIN
viewers V  on (V.viewerid = R.viewerid )
GROUP BY V.viewerid"
  
knitr::kable(dbGetQuery(con, query), format="markdown")
viewerid lastname firstname AVG(R.score) COUNT(R.score)
1 Wang Y 2.5000 2
2 Johnson Fred 2.5000 2
3 Thakkar Ossie 3.0000 1
4 Astrom Sophie 4.0000 1
5 Anderson Birgit 1.6667 3

Finally, we disconnect from the database to tidy up our resource utilization.

dbDisconnect(con)
## [1] TRUE