Running this markdown

  1. Download the data607_illien_assignment2.sql file found in the Github repository and save it in your directory: https://github.com/maelillien/data607-assignment2
  2. Start MySQLWorkbench and load the script data607_illien_assignment2.sql
  3. Set the movies schema as the default schema and run the script
  4. Run code below replacing user and password with your own credentials

Install required packages

install.packages("RMySQL", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/qp/xyrhr9kx6md2dyjyq9w6grrh0000gn/T//Rtmpv90qur/downloaded_packages

Loading required Libraries

library(RMySQL)
## Loading required package: DBI

Connecting to database server

We create a connection to the MySQL server and load the movies data in the dataframe ‘movies’

mydb = dbConnect(MySQL(), user = user, password = password, dbname = 'movies', host = 'localhost')
movies <- dbGetQuery(mydb, "select * from movies")

Displaying data

There are 36 observations in this data set representing 6 movies titles and 6 reviewers. A decent portion of the reviews are NULL since not every movie had been seen by all participants. The data set is in tidy data format with one observation per line.

summary(movies)
##    review_id        title             reviewer             rating  
##  Min.   : 1.00   Length:36          Length:36          Min.   :2   
##  1st Qu.: 9.75   Class :character   Class :character   1st Qu.:3   
##  Median :18.50   Mode  :character   Mode  :character   Median :4   
##  Mean   :18.50                                         Mean   :4   
##  3rd Qu.:27.25                                         3rd Qu.:5   
##  Max.   :36.00                                         Max.   :5   
##                                                        NA's   :11
movies
##    review_id         title reviewer rating
## 1          1          Roma    Jason      5
## 2          2          Roma    Bruce      3
## 3          3          Roma     Elie      3
## 4          4          Roma   George      4
## 5          5          Roma    Brian     NA
## 6          6          Roma    Frank     NA
## 7          7         Venom    Jason      3
## 8          8         Venom    Bruce     NA
## 9          9         Venom     Elie      4
## 10        10         Venom   George      3
## 11        11         Venom    Brian      2
## 12        12         Venom    Frank      3
## 13        13     John Wick    Jason      4
## 14        14     John Wick    Bruce     NA
## 15        15     John Wick     Elie      5
## 16        16     John Wick   George      4
## 17        17     John Wick    Brian      3
## 18        18     John Wick    Frank     NA
## 19        19 The Lion King    Jason      5
## 20        20 The Lion King    Bruce      5
## 21        21 The Lion King     Elie      5
## 22        22 The Lion King    Brian     NA
## 23        23 The Lion King   George     NA
## 24        24 The Lion King    Frank      5
## 25        25      Avengers    Jason      5
## 26        26      Avengers    Bruce      4
## 27        27      Avengers     Elie      5
## 28        28      Avengers   George      5
## 29        29      Avengers    Brian      5
## 30        30      Avengers    Frank      4
## 31        31 Hobbes & Shaw    Jason      3
## 32        32 Hobbes & Shaw    Bruce     NA
## 33        33 Hobbes & Shaw     Elie      3
## 34        34 Hobbes & Shaw   George     NA
## 35        35 Hobbes & Shaw    Brian     NA
## 36        36 Hobbes & Shaw    Frank     NA

Removing null values

Here we omit the NA values from the data.

movies <- na.omit(movies)
movies
##    review_id         title reviewer rating
## 1          1          Roma    Jason      5
## 2          2          Roma    Bruce      3
## 3          3          Roma     Elie      3
## 4          4          Roma   George      4
## 7          7         Venom    Jason      3
## 9          9         Venom     Elie      4
## 10        10         Venom   George      3
## 11        11         Venom    Brian      2
## 12        12         Venom    Frank      3
## 13        13     John Wick    Jason      4
## 15        15     John Wick     Elie      5
## 16        16     John Wick   George      4
## 17        17     John Wick    Brian      3
## 19        19 The Lion King    Jason      5
## 20        20 The Lion King    Bruce      5
## 21        21 The Lion King     Elie      5
## 24        24 The Lion King    Frank      5
## 25        25      Avengers    Jason      5
## 26        26      Avengers    Bruce      4
## 27        27      Avengers     Elie      5
## 28        28      Avengers   George      5
## 29        29      Avengers    Brian      5
## 30        30      Avengers    Frank      4
## 31        31 Hobbes & Shaw    Jason      3
## 33        33 Hobbes & Shaw     Elie      3

Simple analysis

Let’s take a look at the average reviews for each movies title by running another MySQL query.

movies_avg_sql <- dbGetQuery(mydb, "select title, avg(rating) from movies group by title")
movies_avg_sql
##           title avg(rating)
## 1          Roma      3.7500
## 2         Venom      3.0000
## 3     John Wick      4.0000
## 4 The Lion King      5.0000
## 5      Avengers      4.6667
## 6 Hobbes & Shaw      3.0000

Let’s get the averages in R directly, but sorted by title this time.

aggregate(movies[, 4], list(movies$title), mean)
##         Group.1        x
## 1      Avengers 4.666667
## 2 Hobbes & Shaw 3.000000
## 3     John Wick 4.000000
## 4          Roma 3.750000
## 5 The Lion King 5.000000
## 6         Venom 3.000000