Assignment 2 - SQL and R

Imports

The chunk below includes all the pertinent R packages for running this .Rmd file.

library("RMySQL")

## Loading required package: DBI

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Introduction

This document outlines the process that was taken to create a SQL database using raw csv data, and then load that data into an R dataframe. The latter part of this document then provides some additional analysis/querying to show how to answer various questions about the data in both SQL and R. The data that was used comprises movie ratings of 10 different movies taken from 10 different individuals (100 total records).

Methodology

Data Collection

The first step in the process was to come up with a data collection method to easily retrieve and store the ratings that were provided by the participants. In this case, a Google form was used, given the application’s simplicity in both creating a survey and collecting its results. The list below provides a summary of the questions and possible answers that were given on the survey:

Question	Possible Answers
Please enter your name	Open-ended (text)
What is your gender?	Male, Female, Non-Binary, Other
How old are you?	1-99
What is your job title?	Open-ended (text)
Rate the movie, “The Shining” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “It (2017)” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “Indiana Jones: Raiders of the Lost Ark” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “Avatar” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “Caddyshack” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “The Hangover” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “The Shawshank Redemption” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “The Social Network” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “Fahrenheit 9/11” on a scale of 1-5	1-5, I’ve never seen this movie
Rate the movie, “Jiro Dreams of Sushi” on a scale of 1-5	1-5, I’ve never seen this movie

The ten different movies that each participant will rate are shown above, and were chosen for a number of reasons:

They are all “well-known” movies
They comprise a variety of genres (horror, action/adventure, comedy, drama, documentary)
They span a relatively large range of release years (1980-2017)

Once the survey was created, it was sent out to 10 individuals and the results were recorded in Google Forms application. These individuals spanned a range of both genders and ages.

Data Cleaning

The Google Forms application allows you to download the results of a Google Form as a csv file which takes the following format:

Question 1	Question 2	Question 3	…
Participant 1 answer to Q1	Participant 1 answer to Q2	Participant 1 Answer to Q3	…
Participant 2 answer to Q1	Participant 2 answer to Q2	Participant 2 Answer to Q3	…
…	…	…	…

Additionally, all answers are stored as strings enclosed in double quote characters. While a helpful data collection tool, this formatting/structure is not the best for doing data analysis: the header comprises the actual questions, all the values are enclosed in strings, and it seems clunky (all of the data thrown together haphazardly into a single csv file). It would be beneficial to come up with some kind of architecture that would better suit the data, and allow for easier analysis. Thinking about this in terms of an SQL database, one option is to set up three different tables that would combine all the information present in this csv file into three different tables:

Table Name	Description	Fields
respondents	Contains all the personal information provided by each respondent	respondent_name, gender, age, job_title
movies	Contains metadata pertaining to the movies used in the survey	movie_name, genre, release_date
ratings	Contains the ratings provided by each respondent	respondent_name, movie_name, rating

Each of the field names provided above can be sourced from the initial csv file, with the exception of release_date, and genre which must be researched and inputted manually. One way to create these three tables is to clean the data so that we have three new csv files containing the required fields. To create these three files, the following python script was used:

filename_str = 'DATA_607-Assignment_2-Movie_Ratings_Survey.csv'

ratings_list = []
people_list = []
movies_list = []

with open(filename_str, 'r') as file_to_read:
    lines = file_to_read.readlines()
    for line in lines[1:]: # skip header w/ question titles
        line_arr = line.split(',')
        name = line_arr[0].replace('"', '')
        gender = line_arr[1].replace('"', '')
        age = line_arr[2].replace('"', '')
        job = line_arr[3].replace('"', '')
        shining_rating = line_arr[4].replace('"', '')
        it_rating = line_arr[5].replace('"', '')
        raiders_rating = line_arr[6].replace('"', '')
        avatar_rating = line_arr[7].replace('"', '')
        caddy_rating = line_arr[8].replace('"', '')
        hangover_rating = line_arr[9].replace('"', '')
        ssr_rating = line_arr[10].replace('"', '')
        sn_rating = line_arr[11].replace('"', '')
        f911_rating = line_arr[12].replace('"', '')
        jdos_rating = line_arr[13].replace('"', '')
        people_list.append([name, gender, age, job])
        ratings_list.append([name, 'The Shining', shining_rating])
        ratings_list.append([name, 'It (2017)', it_rating])
        ratings_list.append([name, 'Indiana Jones: Raiders of the Lost Ark', raiders_rating])
        ratings_list.append([name, 'Avatar', avatar_rating])
        ratings_list.append([name, 'Caddyshack', caddy_rating])
        ratings_list.append([name, 'The Hangover', hangover_rating])
        ratings_list.append([name, 'The Shawshank Redemption', ssr_rating])
        ratings_list.append([name, 'The Social Network', sn_rating])
        ratings_list.append([name, 'Fahrenheit 9/11', f911_rating])
        ratings_list.append([name, 'Jiro Dreams of Sushi', jdos_rating.replace('\n', '')])

movies_list.append(['The Shining', 'Horror', '1980'])
movies_list.append(['It (2017)', 'Horror', '2017'])
movies_list.append(['Indiana Jones: Raiders of the Lost Ark', 'Action/Adventure', '1981'])
movies_list.append(['Avatar', 'Action/Adventure', '2009'])
movies_list.append(['Caddyshack', 'Comedy', '1980'])
movies_list.append(['The Hangover', 'Comedy', '2009'])
movies_list.append(['The Shawshank Redemption', 'Drama', '1994'])
movies_list.append(['The Social Network', 'Drama', '2010'])
movies_list.append(['Fahrenheit 9/11', 'Documentary', '2004'])
movies_list.append(['Jiro Dreams of Sushi', 'Documentary', '2011'])

movies_data = []
people_data = []
rating_data = []
for row in movies_list:
    movies_data.append(",".join(row)+ '\n')

for row in ratings_list:
    rating_data.append(",".join(row) + '\n')

for row in people_list:
    people_data.append(",".join(row)+ '\n')

with open('respondents.csv', 'w') as file_to_write:
    file_to_write.writelines(people_data)

with open('movies.csv', 'w') as file_to_write:
    file_to_write.writelines(movies_data)

with open('ratings.csv', 'w') as file_to_write:
    file_to_write.writelines(rating_data)

The script above accomplishes a number of data cleaning steps:

Removes the header of the initial csv file (containing all the questions),
removes the double quotes around each value,
parses the file to find the information that will populate the ratings and respondents tables,
manually inputs the information needed to populate the movies table (information pulled from IMDB),
outputs three different csv files: respondents.csv, movies.csv, ratings.csv, each corresponding to the data we want contained in our three SQL tables.

Creating the SQL Database

The next step is to use an SQL script to turn the previously created csv files into actual SQL tables, thus creating a complete database. In this case, MySQL was used and before running any SQL scripts a schema named movie_survey was created in the MySQL Workbench GUI. Instructions for creating a new schema can be found here.

Once a schema has been created to load the data into, the following SQL script can be used to create the movies, respondents, and ratings tables:

/*
  movie_survey.sql
*/

#################################################################################################
/* Load movies.csv file */

DROP TABLE IF EXISTS movies;

CREATE TABLE movies 
(
  movie_name varchar(100) NOT NULL,
  genre varchar(100) NOT NULL,
  release_year int NOT NULL
);

LOAD DATA INFILE '/Users/williamjasmine/repos/CUNY_SPS_DS/DATA_607/Homeworks/HW2/movies.csv' 
INTO TABLE movies
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
(movie_name, genre, release_year);


################################################################################################
/* Load respondents.csv file */

DROP TABLE IF EXISTS respondents;

CREATE TABLE respondents
(
  respondent_name varchar(100) NOT NULL,
  gender varchar(100) NOT NULL,
  age int NOT NULL,
  job_title varchar(100) NOT NULL
);

LOAD DATA INFILE '/Users/williamjasmine/repos/CUNY_SPS_DS/DATA_607/Homeworks/HW2/respondents.csv' 
INTO TABLE respondents
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
(respondent_name, gender, age, job_title);


################################################################################################
/* Load ratings.csv file */

DROP TABLE IF EXISTS ratings;

CREATE TABLE ratings 
(
  respondent_name varchar(100) NOT NULL,
  movie_name varchar(100) NOT NULL,
  rating int NULL
);

LOAD DATA INFILE '/Users/williamjasmine/repos/CUNY_SPS_DS/DATA_607/Homeworks/HW2/ratings.csv' 
INTO TABLE ratings
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
(respondent_name, movie_name, @rating)
SET
rating = nullif(@rating, "I've never seen this movie.")
;

The above script creates the movies, respondents, and ratings table and inserts them into the movie_survey schema. The script includes specifics about the data types of the fields in each table, but the field names are the same as those in the table from the previous subsection. The sctipt also carries out one last data cleaning step: because “I’ve never seen this movie” was a possible answer when rating the movies, the script searches for any time this string occurs and inputs those values as null. As such, the rating column can be cast as an integer field.

Loading Data Into R

Now that the data from the survey exists in a MySQL database, it can be loaded directly into R. The following R chunk uses the RMySQL package to connect to the database. The password has been hidden using R markdown parameters.

mydb <- dbConnect(MySQL(), user='williamjasmine', dbname='movie_survey', password=params$pw)

Now that there’s a connection to the database, the dbGetQuery() function can be used to get the results from actual SQL queries and store them as R dataframes. The following code chunk uses queries to select all data from the movies, ratings, and respondents tables.

movies_df <- dbGetQuery(mydb, 'SELECT * FROM movies')
respondents_df <- dbGetQuery(mydb, 'SELECT * FROM respondents')
ratings_df <- dbGetQuery(mydb, 'SELECT * FROM ratings')

Next, the glimpse() function can be used to get some information about each of the data frames, and ensure they are loaded correctly.

glimpse(movies_df)

## Rows: 10
## Columns: 3
## $ movie_name   <chr> "The Shining", "It (2017)", "Indiana Jones: Raiders of th…
## $ genre        <chr> "Horror", "Horror", "Action/Adventure", "Action/Adventure…
## $ release_year <int> 1980, 2017, 1981, 2009, 1980, 2009, 1994, 2010, 2004, 2011

glimpse(respondents_df)

## Rows: 10
## Columns: 4
## $ respondent_name <chr> "William", "Maha", "Donna", "Theodore", "Michelle", "R…
## $ gender          <chr> "Male", "Female", "Female", "Male", "Female", "Male", …
## $ age             <int> 28, 27, 68, 89, 37, 40, 52, 31, 22, 41
## $ job_title       <chr> "Data Scientist", "Teacher", "Retired", "Retired", "Pu…

glimpse(ratings_df)

## Rows: 100
## Columns: 3
## $ respondent_name <chr> "William", "William", "William", "William", "William",…
## $ movie_name      <chr> "The Shining", "It (2017)", "Indiana Jones: Raiders of…
## $ rating          <int> 5, 4, 5, 3, 4, 4, 5, 3, 4, 4, 1, 2, 3, 4, 2, 4, 3, 4, …

Based on the above results, it is clear that all the data has been loaded correctly into R: movies_df contains ten movies, respondents_df contains ten people, and ratings_df contains one hundred ratings.

Some Analysis

Now that we have our data in R, we can use it to answer some questions…

What is the Highest and Lowest Rated Movie?

In R:

ratings_df %>% 
  filter(!is.na(rating)) %>%
    group_by(movie_name) %>% 
      summarise(avg_score = mean(rating)) %>%
        arrange(desc(avg_score))

## # A tibble: 10 × 2
##    movie_name                             avg_score
##    <chr>                                      <dbl>
##  1 The Shawshank Redemption                    4.11
##  2 The Hangover                                4   
##  3 Indiana Jones: Raiders of the Lost Ark      3.8 
##  4 Fahrenheit 9/11                             3.78
##  5 Jiro Dreams of Sushi                        3.67
##  6 The Social Network                          3.5 
##  7 Caddyshack                                  3.33
##  8 It (2017)                                   3.25
##  9 Avatar                                      3.22
## 10 The Shining                                 2.88

In SQL:

query <- "
SELECT
  movie_name,
  AVG(rating) as average_rating
FROM ratings
GROUP BY movie_name
ORDER BY 2 DESC
"
results_df <- dbGetQuery(mydb, query)

## Warning in .local(conn, statement, ...): Decimal MySQL column 1 imported as
## numeric

results_df

##                                movie_name average_rating
## 1                The Shawshank Redemption         4.1111
## 2                            The Hangover         4.0000
## 3  Indiana Jones: Raiders of the Lost Ark         3.8000
## 4                         Fahrenheit 9/11         3.7778
## 5                    Jiro Dreams of Sushi         3.6667
## 6                      The Social Network         3.5000
## 7                              Caddyshack         3.3333
## 8                               It (2017)         3.2500
## 9                                  Avatar         3.2222
## 10                            The Shining         2.8750

We see that both the R and SQL results are the same: The Shawshank Redemption was the most highy rated movie, while “The Shining” was the lowest rated movie.

Which movie was the most popular for people under 50?

In R:

ratings_df %>%
  left_join(respondents_df, by="respondent_name") %>%
      filter(!is.na(rating) & age < 50) %>%
        group_by(movie_name) %>% 
          summarise(avg_score = mean(rating)) %>%
            arrange(desc(avg_score)) %>%
              slice(which.max(avg_score))

## # A tibble: 1 × 2
##   movie_name   avg_score
##   <chr>            <dbl>
## 1 The Hangover      4.29

In SQL:

query <- "
SELECT
    movie_name,
    AVG(rating) as avg_rating
FROM ratings 
LEFT JOIN respondents
USING (respondent_name)
WHERE age < 50
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1
"
results_df <- dbGetQuery(mydb, query)

## Warning in .local(conn, statement, ...): Decimal MySQL column 1 imported as
## numeric

results_df

##     movie_name avg_rating
## 1 The Hangover     4.2857

The above results confirm in both R and SQL that for people under 50, the top rated movie is “The Hangover.”

Conclusion

The above steps outlined the entire process of generating a database from a set of survey questions, loading that data as R data frames, and giving a glimpse into how analysis might be done to answer some simple questions. While there are definitely parts of this process that could be simplified (or automated), these scripts and code can be rerun as people continue to fill out the Google Form and can be updated if necessary.

Extra Credit - Global Baseline Estimate

Problem Description: Most recommender systems use personalized algorithms like “content management” and “item-item collaborative filtering.” Sometimes non-personalized recommenders are also useful or necessary. One of the best non-personalized recommender system algorithms is the “Global Baseline Estimate.

The job here is to use the survey data collected and write the R code that makes a movie recommendation using the Global Baseline Estimate algorithm.

Solution:

First, find the respondents who have unrated items:

ratings_df %>%
  group_by(respondent_name) %>% 
    summarise(num_unrated_movies = sum(is.na(rating))) %>%
       arrange(desc(num_unrated_movies))

## # A tibble: 10 × 2
##    respondent_name num_unrated_movies
##    <chr>                        <int>
##  1 Teal                             5
##  2 Max                              4
##  3 Theodore                         4
##  4 Donna                            3
##  5 Jarobi                           2
##  6 Michelle                         2
##  7 Jay                              1
##  8 Maha                             0
##  9 Rob                              0
## 10 William                          0

According to the output above, Teal has the most unrated items. These unrated items are outputted by the R chunk below:

ratings_df %>%
  filter(is.na(rating) == TRUE & respondent_name == "Teal") %>%
    pull(var = 'movie_name')

## [1] "The Shining"          "It (2017)"            "Caddyshack"          
## [4] "Fahrenheit 9/11"      "Jiro Dreams of Sushi"

The above five movies were unrated by Teal, one of which can be randomly chosen to provide a recommendation for. In this case, “The Shining” will be used.

Creating a movie recommendation using the Global Baseline Estimate requires both the average rating for each movie, and the average rating provided by each user. The former is calculated below and stored as a dataframe called avg_movie_ratings:

avg_movie_ratings <- 
  ratings_df %>% 
    filter(!is.na(rating)) %>%
      group_by(movie_name) %>% 
        summarise(avg_rating = mean(rating)) %>%
          arrange(desc(avg_rating))
avg_movie_ratings

## # A tibble: 10 × 2
##    movie_name                             avg_rating
##    <chr>                                       <dbl>
##  1 The Shawshank Redemption                     4.11
##  2 The Hangover                                 4   
##  3 Indiana Jones: Raiders of the Lost Ark       3.8 
##  4 Fahrenheit 9/11                              3.78
##  5 Jiro Dreams of Sushi                         3.67
##  6 The Social Network                           3.5 
##  7 Caddyshack                                   3.33
##  8 It (2017)                                    3.25
##  9 Avatar                                       3.22
## 10 The Shining                                  2.88

The chunk below calculates the average rating provided by each user and stores it as a dataframe called avg_user_rating:

avg_user_ratings <- 
  ratings_df %>% 
    filter(!is.na(rating)) %>%
      group_by(respondent_name) %>% 
        summarise(avg_rating = mean(rating)) %>%
          arrange(desc(avg_rating))
avg_user_ratings

## # A tibble: 10 × 2
##    respondent_name avg_rating
##    <chr>                <dbl>
##  1 Rob                   4.3 
##  2 Theodore              4.17
##  3 William               4.1 
##  4 Max                   4   
##  5 Michelle              3.5 
##  6 Donna                 3.43
##  7 Jarobi                3.38
##  8 Jay                   3.22
##  9 Teal                  3.2 
## 10 Maha                  2.7

The next step is to calculate the average movie rating provided by all users for all movies. This is done in the code chunk below, and the value is stored in variable called total_avg_rating:

total_avg_rating <-
  ratings_df %>%
    filter(is.na(rating) == FALSE) %>%
      pull(var='rating') %>%
        mean()
total_avg_rating

## [1] 3.594937

Next, the difference between both the user and movie ratings averages and the total_avg_rating is calculated. This is done in the cell below, by adding a mean_diff column to both the avg_user_ratings and avg_movie_ratings dataframes.

avg_user_ratings <-
  avg_user_ratings %>% 
    mutate(mean_diff = avg_rating - total_avg_rating)

avg_movie_ratings <- 
  avg_movie_ratings %>% 
    mutate(mean_diff = avg_rating - total_avg_rating)

glimpse(avg_user_ratings)

## Rows: 10
## Columns: 3
## $ respondent_name <chr> "Rob", "Theodore", "William", "Max", "Michelle", "Donn…
## $ avg_rating      <dbl> 4.300000, 4.166667, 4.100000, 4.000000, 3.500000, 3.42…
## $ mean_diff       <dbl> 0.70506329, 0.57172996, 0.50506329, 0.40506329, -0.094…

glimpse(avg_movie_ratings)

## Rows: 10
## Columns: 3
## $ movie_name <chr> "The Shawshank Redemption", "The Hangover", "Indiana Jones:…
## $ avg_rating <dbl> 4.111111, 4.000000, 3.800000, 3.777778, 3.666667, 3.500000,…
## $ mean_diff  <dbl> 0.51617440, 0.40506329, 0.20506329, 0.18284107, 0.07172996,…

All the components to determine a recommendation via a Global Baseline Estimate have now been determined, and all that remains is the final calculation.

To provide a user x a recommendation for movie y, we can define \(\mu_x\) and \(\mu_y\) as the average user and movie ratings, respectively. To calculate the Global Baseline Estimate \(GBE\) we can then use the following equation:

\[GBE=\mu+(\mu_x-\mu)+(\mu_y-\mu)\] where \(\mu\) is the average movie rating across all movies and users (total_avg_rating in this case). The following code cell translates the above equation into R for our example, in which x=The Shining and y=Teal:

movie = 'The Shining'
user = 'Teal'

# first_term
total_avg_rating +

# second term
avg_movie_ratings %>% 
  filter(movie_name == movie) %>%
    pull(var = mean_diff) +

# third term  
avg_user_ratings %>% 
  filter(respondent_name == user) %>%
    pull(var = mean_diff)

## [1] 2.480063

Based off the results of the output above, a recommender system implementing a Global Baseline Estimate methodology would predict Teal’s rating for the The Shining to be ~ 2.48.