In this assignment, we are asked to create a database capturing the reviews of six movies by five of our friends. We store the data in a SQL database and then use the associated packages and functions within R to retrieve the data and load into an R dataframe.
I identified two entity tables - Movies and Reviewers. Their relationship is defined as “Movies are rated by Reviewers.” Since each movie may be reviewed by multiple reviewers and each reviewer may review multiple movies, the two tables have a many-to-many relationship. This relationship requires a third table called Ratings, identified by the primary keys of the Movie and Reviewer tables.
Movie Review Design
In addition to basic information, the database captures additional information about each movie and reviewer that may be of interest for further analysis.
I entered the data for each table into separate .csv files. In MySQL, I created a new schema called “movies” and created the three tables within that schema.
The SQL script and associated .csv files used to create the MySQL database are uploaded to the github repository below:
https://github.com/stipton/CUNY-SPS/tree/DATA607-Week-2-Assignment-021118
Since I am working with MySQL as my SQL database, I installed the RMySQL package into R Studio to connect the two systems. I connected to the database:
library(RMySQL)## Loading required package: DBI
driver <- dbDriver("MySQL")
con <- dbConnect(driver,
user = "root",
password = password, ## password hidden in previous code
dbname = "movies")With the connection established, I can explore the functions of the RMySQL package:
# view tables
dbListTables(con)## [1] "movies" "ratings" "reviewers"
# check table existence
dbExistsTable(con, "movies")## [1] TRUE
# display table in R
dbReadTable(con, "movies")## movieID title MPAARating length
## 1 1 The Sound of Music G 174
## 2 2 Sweet Charity G 149
## 3 3 Cabaret PG 124
## 4 4 Fiddler on the Roof G 181
## 5 5 Chicago PG-13 113
## 6 6 Rent PG-13 135
Using the dbGetQuery function allows me to pull data from the SQL database with plain SQL language.
sql_fullDataSet <- "SELECT M.*, REV.*, numStars, comments
FROM (movies M JOIN ratings RAT
ON M.movieID = RAT.movieID)
JOIN reviewers REV ON RAT.reviewerID = REV.reviewerID
ORDER BY M.movieID, REV.reviewerID"
(fullDataSet <- dbGetQuery(con, sql_fullDataSet))## movieID title MPAARating length reviewerID firstName
## 1 1 The Sound of Music G 174 A Patrick
## 2 1 The Sound of Music G 174 C Mike
## 3 2 Sweet Charity G 149 A Patrick
## 4 2 Sweet Charity G 149 B Kate
## 5 3 Cabaret PG 124 B Kate
## 6 3 Cabaret PG 124 D Alex
## 7 3 Cabaret PG 124 E Caroline
## 8 4 Fiddler on the Roof G 181 B Kate
## 9 4 Fiddler on the Roof G 181 C Mike
## 10 4 Fiddler on the Roof G 181 D Alex
## 11 4 Fiddler on the Roof G 181 E Caroline
## 12 5 Chicago PG-13 113 A Patrick
## 13 5 Chicago PG-13 113 B Kate
## 14 5 Chicago PG-13 113 C Mike
## 15 5 Chicago PG-13 113 D Alex
## 16 5 Chicago PG-13 113 E Caroline
## 17 6 Rent PG-13 135 A Patrick
## 18 6 Rent PG-13 135 C Mike
## 19 6 Rent PG-13 135 D Alex
## 20 6 Rent PG-13 135 E Caroline
## lastName age numStars comments
## 1 Hinds 39 3 Too long too many nuns\r
## 2 Jensen 40 5 Childhood favorite\r
## 3 Hinds 39 4 Love the dancing\r
## 4 Tipton 30 4 Catchy music\r
## 5 Tipton 30 5 Great story\r
## 6 Jonsson 28 4 Liza Minnelli is my favorite\r
## 7 Nguyen 32 4 Sexy and smart\r
## 8 Tipton 30 3 A little long\r
## 9 Jensen 40 5 Great acting and songs\r
## 10 Jonsson 28 1 Waaaay too long\r
## 11 Nguyen 32 5 My favorite musical\r
## 12 Hinds 39 5 Modern and sexy - love it\r
## 13 Tipton 30 2 Too violent\r
## 14 Jensen 40 5 Best new musical of the 200s\r
## 15 Jonsson 28 3 It was okay\r
## 16 Nguyen 32 2 Too violent\r
## 17 Hinds 39 2 Poor adaptation\r
## 18 Jensen 40 3 Stage version is much better\r
## 19 Jonsson 28 5 Loved the gritty NYC feel\r
## 20 Nguyen 32 1 Not interesting to me\r
With the full data set now captured as an R dataframe, I can subset and examine with R functions:
longMovies <- subset(fullDataSet, length > 150)
shortMovies <- subset(fullDataSet, length <= 150)
mean(longMovies$numStars)## [1] 3.666667
mean(shortMovies$numStars)## [1] 3.5
topReviews <- subset(fullDataSet, numStars == 5)
topReviews## movieID title MPAARating length reviewerID firstName
## 2 1 The Sound of Music G 174 C Mike
## 5 3 Cabaret PG 124 B Kate
## 9 4 Fiddler on the Roof G 181 C Mike
## 11 4 Fiddler on the Roof G 181 E Caroline
## 12 5 Chicago PG-13 113 A Patrick
## 14 5 Chicago PG-13 113 C Mike
## 19 6 Rent PG-13 135 D Alex
## lastName age numStars comments
## 2 Jensen 40 5 Childhood favorite\r
## 5 Tipton 30 5 Great story\r
## 9 Jensen 40 5 Great acting and songs\r
## 11 Nguyen 32 5 My favorite musical\r
## 12 Hinds 39 5 Modern and sexy - love it\r
## 14 Jensen 40 5 Best new musical of the 200s\r
## 19 Jonsson 28 5 Loved the gritty NYC feel\r