Overview

In this assignment, we are asked to create a database capturing the reviews of six movies by five of our friends. We store the data in a SQL database and then use the associated packages and functions within R to retrieve the data and load into an R dataframe.

Tasks

  1. Design database
  2. Import database into MySQL
  3. Read database from MySQL into an R dataframe

Database design

I identified two entity tables - Movies and Reviewers. Their relationship is defined as “Movies are rated by Reviewers.” Since each movie may be reviewed by multiple reviewers and each reviewer may review multiple movies, the two tables have a many-to-many relationship. This relationship requires a third table called Ratings, identified by the primary keys of the Movie and Reviewer tables.

Movie Review Design

Movie Review Design

In addition to basic information, the database captures additional information about each movie and reviewer that may be of interest for further analysis.

Import Database into MySQL

I entered the data for each table into separate .csv files. In MySQL, I created a new schema called “movies” and created the three tables within that schema.

The SQL script and associated .csv files used to create the MySQL database are uploaded to the github repository below:

https://github.com/stipton/CUNY-SPS/tree/DATA607-Week-2-Assignment-021118

Read database from MySQL into an R dataframe

Since I am working with MySQL as my SQL database, I installed the RMySQL package into R Studio to connect the two systems. I connected to the database:

library(RMySQL)
## Loading required package: DBI
driver <- dbDriver("MySQL")
con <- dbConnect(driver, 
                 user = "root", 
                 password = password, ## password hidden in previous code 
                 dbname = "movies")

With the connection established, I can explore the functions of the RMySQL package:

# view tables
dbListTables(con)
## [1] "movies"    "ratings"   "reviewers"
# check table existence
dbExistsTable(con, "movies")
## [1] TRUE
# display table in R
dbReadTable(con, "movies")
##   movieID               title MPAARating length
## 1       1  The Sound of Music          G    174
## 2       2       Sweet Charity          G    149
## 3       3             Cabaret         PG    124
## 4       4 Fiddler on the Roof          G    181
## 5       5             Chicago      PG-13    113
## 6       6                Rent      PG-13    135

Using the dbGetQuery function allows me to pull data from the SQL database with plain SQL language.

sql_fullDataSet <- "SELECT M.*, REV.*, numStars, comments
        FROM (movies M JOIN ratings RAT
        ON M.movieID = RAT.movieID) 
        JOIN reviewers REV ON RAT.reviewerID = REV.reviewerID
        ORDER BY M.movieID, REV.reviewerID"
(fullDataSet <- dbGetQuery(con, sql_fullDataSet))
##    movieID               title MPAARating length reviewerID firstName
## 1        1  The Sound of Music          G    174          A   Patrick
## 2        1  The Sound of Music          G    174          C      Mike
## 3        2       Sweet Charity          G    149          A   Patrick
## 4        2       Sweet Charity          G    149          B      Kate
## 5        3             Cabaret         PG    124          B      Kate
## 6        3             Cabaret         PG    124          D      Alex
## 7        3             Cabaret         PG    124          E  Caroline
## 8        4 Fiddler on the Roof          G    181          B      Kate
## 9        4 Fiddler on the Roof          G    181          C      Mike
## 10       4 Fiddler on the Roof          G    181          D      Alex
## 11       4 Fiddler on the Roof          G    181          E  Caroline
## 12       5             Chicago      PG-13    113          A   Patrick
## 13       5             Chicago      PG-13    113          B      Kate
## 14       5             Chicago      PG-13    113          C      Mike
## 15       5             Chicago      PG-13    113          D      Alex
## 16       5             Chicago      PG-13    113          E  Caroline
## 17       6                Rent      PG-13    135          A   Patrick
## 18       6                Rent      PG-13    135          C      Mike
## 19       6                Rent      PG-13    135          D      Alex
## 20       6                Rent      PG-13    135          E  Caroline
##    lastName age numStars                       comments
## 1     Hinds  39        3       Too long too many nuns\r
## 2    Jensen  40        5           Childhood favorite\r
## 3     Hinds  39        4             Love the dancing\r
## 4    Tipton  30        4                 Catchy music\r
## 5    Tipton  30        5                  Great story\r
## 6   Jonsson  28        4 Liza Minnelli is my favorite\r
## 7    Nguyen  32        4               Sexy and smart\r
## 8    Tipton  30        3                A little long\r
## 9    Jensen  40        5       Great acting and songs\r
## 10  Jonsson  28        1              Waaaay too long\r
## 11   Nguyen  32        5          My favorite musical\r
## 12    Hinds  39        5    Modern and sexy - love it\r
## 13   Tipton  30        2                  Too violent\r
## 14   Jensen  40        5 Best new musical of the 200s\r
## 15  Jonsson  28        3                  It was okay\r
## 16   Nguyen  32        2                  Too violent\r
## 17    Hinds  39        2              Poor adaptation\r
## 18   Jensen  40        3 Stage version is much better\r
## 19  Jonsson  28        5    Loved the gritty NYC feel\r
## 20   Nguyen  32        1        Not interesting to me\r

Explore the data in R

With the full data set now captured as an R dataframe, I can subset and examine with R functions:

longMovies <- subset(fullDataSet, length > 150)
shortMovies <- subset(fullDataSet, length <= 150)
mean(longMovies$numStars)
## [1] 3.666667
mean(shortMovies$numStars)
## [1] 3.5
topReviews <- subset(fullDataSet, numStars == 5)
topReviews
##    movieID               title MPAARating length reviewerID firstName
## 2        1  The Sound of Music          G    174          C      Mike
## 5        3             Cabaret         PG    124          B      Kate
## 9        4 Fiddler on the Roof          G    181          C      Mike
## 11       4 Fiddler on the Roof          G    181          E  Caroline
## 12       5             Chicago      PG-13    113          A   Patrick
## 14       5             Chicago      PG-13    113          C      Mike
## 19       6                Rent      PG-13    135          D      Alex
##    lastName age numStars                       comments
## 2    Jensen  40        5           Childhood favorite\r
## 5    Tipton  30        5                  Great story\r
## 9    Jensen  40        5       Great acting and songs\r
## 11   Nguyen  32        5          My favorite musical\r
## 12    Hinds  39        5    Modern and sexy - love it\r
## 14   Jensen  40        5 Best new musical of the 200s\r
## 19  Jonsson  28        5    Loved the gritty NYC feel\r