Lab Week 2: SQL and R - Movie Survey

Note: A lot of this work should probably be done in SQL, and the data should be loaded from files instead of manually loaded via script. However, as I’m trying to get more comfortable with R and am already competent with SQL, this assignment will be done entirely from R markdown script.

## Overall Plan:

1. Create PopMovie Schema
2. Create Movie table
3. Create Respondent table
4. Create Rating table
5. Load responses into R dataframe
6. Show r dataframe
7. Analysis - Average Rating by Movie
8. Analysis - Average Rating by Respondent
9. Analysis - Graph

## Prepare MySQL connection
Load library RMySQL to enable connecting to MySQL database via RStudio.

library(RMySQL)

## Loading required package: DBI

Note that difficulties were encountered connecting from RStudio to the latest version of MySQL(8.0) when attempted in Sept. 2018. Based on feedback from others who encountered the same error, an earlier version of MySQL (5.7) was installed. With 5.7, connectivity worked using the following R code.

mydb = dbConnect(MySQL(), user='root', password='password', host='127.0.0.1')

Create a new schema called “popmovies” in the MySQL database from R. In MySQL, schemas and databases are basically synonymous. This makes me nervous. Note that is commented out in the code, as the database was already created during testing.

dbSendQuery(mydb, 'DROP SCHEMA IF EXISTS`popmovies` ;')

## <MySQLResult:309911952,0,0>

dbSendQuery(mydb, 'CREATE SCHEMA `popmovies` ;')

## <MySQLResult:0,0,1>

Tell R to use that schema for future activity. Previous SQL work had us change the default schema in the MySQL workbench. I don’t want to do that from R - or know if if it’s possible.

dbSendQuery(mydb, 'USE`popmovies` ;')

## <MySQLResult:311672088,0,2>

## Create movies table
Drop “movies” table on the popmovies schema if it exist - just in case

dbSendQuery(mydb, 'DROP TABLE IF EXISTS movies;')

## <MySQLResult:0,0,3>

Create a “movies” table on the popmovies schema.

dbSendQuery(mydb, 'CREATE TABLE movies (
    Movie_ID int NOT NULL AUTO_INCREMENT,
    Movie_Name varchar(255) NOT NULL,
    PRIMARY KEY (Movie_ID)
);')

## <MySQLResult:0,0,4>

Populate the “movies” table on the popmovies schema.The Movie_ID field will be automatically generated.

dbSendQuery(mydb, "INSERT INTO movies
    (Movie_Name)
VALUES
    ('Deadpool 2'),
    ('Mission: Impossible - Fallout'),
    ('Crazy Rich Asians'),
    ('Ocean''s 8'),
    ('BlacKkKlansman'),
    ('Searching');
    ")

## <MySQLResult:302743632,0,5>

Go through same progression to create and populate a respondent table.
Drop “respondents” table on the popmovies schema if it exist - just in case.

dbSendQuery(mydb, 'DROP TABLE IF EXISTS respondents;')

## <MySQLResult:336574560,0,6>

## Create respondents table
Create a “respondents” table on the popmovies schema.

dbSendQuery(mydb, 'CREATE TABLE respondents (
    Respondent_ID int NOT NULL AUTO_INCREMENT,
    Respondent_Name varchar(255) NOT NULL,
    PRIMARY KEY (Respondent_ID)
);')

## <MySQLResult:1,0,7>

Populate the “respondents” table on the popmovies schema.The Respondent_ID field will be automatically generated.

dbSendQuery(mydb, "INSERT INTO respondents
    (Respondent_Name)
VALUES
    ('Wife'),
    ('Friend 1'),
    ('Friend 2'),
    ('Brother 1'),
    ('Brother 2');
    ")

## <MySQLResult:1,0,8>

##Create surveys table
Now to the survery table, which will compile ratings for the movies by those surveyed. We could do a compound primary key with the Movie_ID and Respondent_ID, but best practices would likely dictate creating a Response_ID and then including foreign key restraints.

Drop “survey” table on the popmovies schema if it exist - just in case.

dbSendQuery(mydb, 'DROP TABLE IF EXISTS surveys;')

## <MySQLResult:1,0,9>

Create a “surveys” table on the popmovies schema. In a perfect world, we would probably put an input control on the Survey_Rating field to ensure it falls within the 1-5 scale. Note that we’ll allow NULLs in case the respondent has not seen the movie.

dbSendQuery(mydb, 'CREATE TABLE surveys (
    Survey_ID int NOT NULL AUTO_INCREMENT,
    Movie_ID int NOT NULL,
    Respondent_ID int NOT NULL,
    Survey_Rating int,
    PRIMARY KEY (Survey_ID),
    FOREIGN KEY (Movie_ID)
      REFERENCES movies(Movie_ID),
    FOREIGN KEY (Respondent_ID)
      REFERENCES respondents(Respondent_ID)
);')

## <MySQLResult:1,0,10>

Populate the “respondents” table on the popmovies schema.The Respondent_ID field will be automatically generated.

dbSendQuery(mydb, "INSERT INTO surveys
    (Movie_ID,Respondent_ID,Survey_Rating)
VALUES
    (1,1,3),
    (1,2,3),
    (1,3,NULL),
    (1,4,5),  
    (1,5,2),
    (2,1,5),
    (2,2,5),
    (2,3,NULL),
    (2,4,4),  
    (2,5,4),
    (3,1,5),
    (3,2,NULL),
    (3,3,NULL),
    (3,4,2),  
    (3,5,5),
    (4,1,4),
    (4,2,NULL),
    (4,3,2),
    (4,4,2),  
    (4,5,NULL),
    (5,1,NULL),
    (5,2,4),
    (5,3,NULL),
    (5,4,NULL),  
    (5,5,5),
    (6,1,5),
    (6,2,NULL),
    (6,3,NULL),
    (6,4,NULL),  
    (6,5,NULL);
    ")

## <MySQLResult:1,0,11>

## Load dataframe

df_moviesurvey <- dbGetQuery(mydb, "SELECT 
  m.Movie_Name
, r.Respondent_Name
, s.Survey_Rating
FROM popmovies.surveys s
    INNER JOIN popmovies.movies m ON m.Movie_ID = s.Movie_ID
    INNER JOIN popmovies.respondents r ON r.Respondent_ID = s.Respondent_ID;")

Show the df summary just to verify load.

summary(df_moviesurvey)

##   Movie_Name        Respondent_Name    Survey_Rating  
##  Length:30          Length:30          Min.   :2.000  
##  Class :character   Class :character   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Median :4.000  
##                                        Mean   :3.824  
##                                        3rd Qu.:5.000  
##                                        Max.   :5.000  
##                                        NA's   :13

##Analysis

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Plot the average rating by movie, excluding the non-response NULLs.Searching has the highest rating, but this summary does not convey non-responses in any way.

ggplot(df_moviesurvey, aes(x=factor(df_moviesurvey$Movie_Name), y=df_moviesurvey$Survey_Rating)) + stat_summary(fun.y="mean", geom="bar") + ggtitle("Plot of movie ratings (1-5) by movie") +
  xlab("Movie") + ylab("Rating (1-5)")

## Warning: Removed 13 rows containing non-finite values (stat_summary).

Ugly histogram.

ggplot(df_moviesurvey,aes(x=df_moviesurvey$Survey_Rating,group=df_moviesurvey$Movie_Name,fill=df_moviesurvey$Movie_Name))+
  geom_histogram(position="dodge",binwidth=0.25)+theme_bw()

## Warning: Removed 13 rows containing non-finite values (stat_bin).

Plot the average rating by respondent, excluding the non-response NULLs. Friend 2 is obviously picky.

ggplot(df_moviesurvey, aes(x=factor(df_moviesurvey$Respondent_Name), y=df_moviesurvey$Survey_Rating)) + geom_boxplot() + ggtitle("Plot of movie ratings (1-5) by respondent") +
  xlab("Respondent") + ylab("Rating (1-5)")

## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

Lab Week 2: SQL and R - Movie Survey

Jeff Littlejohn

September 9, 2018