Assignment 2 - SQL and R

Introduction

For this assignment, I decided to poll some friends to rate various Tom Hanks’ movies. The main focus of this assignment will be to design a database, create and fill tables from csv files to the database, connect to the database from R, and do some analysis on the data.

Database Design

For this database, I decided to create three tables: movies, person and ratings. The tables and their respective columns can be seen below:

The movies table gives the movie id, movie name and year the movie was made. The person table provides a person id and name for each person involved in the poll. The ratings table has four columns which contain a rating id, movie id, person id, and rating for the movie.

The “PK” symbol stands for “Primary Key,” while “FK” stands for “Foreign Key.” From the table, we can see that the ratings table has two foreign keys, with each foreign key being a primary key of the other tables. With this set up, we can link these tables and create a new table that contains information from all tables.

In this example, this set up may seem a little over the top, since we can simply replace the ‘movie_id’ and ‘person_id’ with movie name and person name and just create one table titled ratings. However, the movies and person table could be more involved. For example, the movies table could contain many more columns such as genre, supporting actors, movie length, etc. However, this information may not be important to the ratings of the movies. Therefore, it is much more clean to create multiple tables that give the necessary information to describe that specific table.

SQL Code to Create/Fill Tables from CSV Files

Below is the SQL code that will create and fill the tables. The “DROP TABLE IF EXISTS” command will delete all tables from the database if they exist. This is good to have first, so that it clears out any of the tables that we are going to create.

Next, the “CREATE TABLE” line will create each table with the specified columns. In each of these statements we define a primary key of the table. In the ratings table we also define foreign keys, which reference the primary key of the other tables. This will allow us to join these tables together.

Finally, we have the “LOAD DATA INFILE” lines that load each csv into their respective tables.

DROP TABLE IF EXISTS ratings;
DROP TABLE IF EXISTS movies;
DROP TABLE IF EXISTS person;

CREATE TABLE movies (
    movie_id integer NOT NULL,
    movie_name VARCHAR(255) UNIQUE NOT NULL,
    movie_year integer NOT NULL,
    PRIMARY KEY (movie_id)
);

CREATE TABLE person (
    person_id integer NOT NULL,
    person_name VARCHAR(255) UNIQUE NOT NULL,
    PRIMARY KEY (person_id)
);

CREATE TABLE ratings (
    rating_id integer NOT NULL,
    movie_id integer NOT NULL,
    person_id integer NOT NULL,
    rating integer NOT NULL,
    PRIMARY KEY (rating_id),
    FOREIGN KEY (movie_id) REFERENCES movies (movie_id),
    FOREIGN KEY (person_id) REFERENCES person (person_id)
);

LOAD DATA INFILE '/tmp/movies.csv'
INTO TABLE movies
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;

LOAD DATA INFILE '/tmp/person.csv'
INTO TABLE person
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;

LOAD DATA INFILE '/tmp/ratings.csv'
INTO TABLE ratings
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;

Connect to Movie Database and Create Data Frame

First, we need to load the libaries that are necessary for connection to my SQL (RMySQL), and allows us to query data from SQL (dplyr).

library(RMySQL)
library(dplyr)

Next, we need to make a connection to the MySQL server. For this, we use the dbConnect() function and store it in a variable called ‘con’:

con <- dbConnect(MySQL(), user='root', password=pwd, dbname='movie', host='localhost')

In the previous section, we saw how the tables were set up in the database. It would be useful for us to create a new table where each row lists a person, movie and their rating for this movie. This will require us to join all three tables and select these columns. This is done with the following R command that calls a SQL query:

sql.table <- tbl(con, sql("SELECT p.person_name, m.movie_name, r.rating
                           FROM ratings r
                           JOIN person p ON r.person_id = p.person_id
                           JOIN movies m ON r.movie_id = m.movie_id"))

Finally, we need to convert this table to a data frame and check the information provided.

movie.reviews <- as.data.frame(sql.table)

The structure of the data frame appears to be accurate, as it lists the reviewer name, movie name, and their rating.

head(movie.reviews)

  person_name          movie_name rating
1       Brett        Forrest Gump      4
2       Brett           Cast Away      1
3       Brett               Sully      4
4       Brett      The Green Mile      3
5       Brett        Philadelphia      2
6       Brett Saving Private Ryan      4

The dimensions show there are 30 rows and 3 columns. 30 rows is accurate since there six movies and five reviewers (6 x 5 = 30).

dim(movie.reviews)

[1] 30  3

It is also good practice to disconnect from the database when you are done using it. This command can be seen below:

dbDisconnect(con)

[1] TRUE

Data Analysis

The previous sections involved designing the database and placing this information into R through SQL. Now that we have the data we need in a data frame, it is time to conduct statistical analysis on it. In the following subsections, I will compare the average rating between movies, and the average rating given by each person.

Average Rating For Each Movie

This section focuses on comparing the average rating for each movie. In order to do this, we can use a simple function called aggregate(), which splits the data into subsets and then computes summary statistics for each. In this case, we can look at the mean.

mean.movie <- aggregate(rating ~ movie_name, movie.reviews, mean)
mean.movie

           movie_name rating
1           Cast Away    2.2
2        Forrest Gump    4.6
3        Philadelphia    2.8
4 Saving Private Ryan    4.4
5               Sully    3.6
6      The Green Mile    3.2

From the data provided in the ‘mean.movie’ data frame we can see that the highest rated movie was Forrest Gump, while the lowest rated movie was Cast Away.

To better visualize this data, we can create a bar plot that will provide us with the same conclusions, but in a more visual way.

library(ggplot2)
gginit.movie <- ggplot(mean.movie,aes(x=movie_name,y=rating))
plottype.movie <- geom_bar(stat="Identity",color='red',fill='pink',alpha=0.5)
plottheme.movie <- theme_bw()
gginit.movie + plottype.movie + plottheme.movie + xlab('Movie Name') + ylab('Average Rating') + ylim(0,5)

Average Rating Given By Each Person

In this section, we will focus on the ratings of the reviewers. We will use nearly identical code that was used in the last section. The creation of the data frame and the plot is shown below:

mean.person <- aggregate(rating ~ person_name, movie.reviews, mean)
mean.person

  person_name   rating
1       Brett 3.000000
2         Dan 3.666667
3         Joe 3.333333
4        Ross 4.000000
5        Ryan 3.333333

gginit.person <- ggplot(mean.person,aes(x=person_name,y=rating))
plottype.person <- geom_bar(stat="Identity",color='red',fill='pink',alpha=0.5)
plottheme.person <- theme_bw()
gginit.person + plottype.person + plottheme.person + xlab('Person Name') + ylab('Average Rating') + ylim(0,5)

We can see that the average rating by person ranges from 3 to 4. This range isn’t as large as when we grouped by movie, but there is a noticeable difference between the average rating by person.

The conclusions from the average rating by person are unclear from the data at hand. At first glance, it is tempting to say that Brett does not like Tom Hanks’ movies as much as Ross. However, we do not have Brett or Ross’ ratings on any other types of movies (i.e. Denzel Washington movies). Therefore, it may just be that Brett is a harder reviewer than Ross.