The goal of this assignment is to show how to import data from a SQL database into R.
Steps
1. Collect data on movie ratings in a CSV file
2. Import CSV file with movies and ratings into MySQL database
3. Import the DB into R
4. Explore the movie ratings
I have used Google forms to make a small survey of movie ratings among my friends. The survey is available here.
The data from survey responses flows into a Google Spreadsheet. The form of the spreadsheet is a default defined by Google.
This step requires you to have a working instance of MySQL Server and to download the form responses from the step 1 as a CSV file on your computer.
Run the following SQL script to populate a new table with the movie ratings information from the survey
-- Create the schema and the table
DROP SCHEMA IF EXISTS movieRatings;
CREATE SCHEMA `movieRatings`;
-- Create the table for movie ratings
CREATE TABLE movieRatings.ratings (
ts VARCHAR(50),
star_trek VARCHAR(2) NULL,
suicide_squad VARCHAR(2) NULL,
the_lobster VARCHAR(2) NULL,
pets VARCHAR(2) NULL,
deadpool VARCHAR(2) NULL,
batman_the_dark_knight VARCHAR(2) NULL,
pulp_fiction VARCHAR(2) NULL,
the_lord_of_the_rings VARCHAR(2) NULL
);
-- Populate the table for movie ratings
LOAD DATA LOCAL INFILE 'insert path to your downloaded CSV file here'
INTO TABLE movieRatings.ratings
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Note the structure of the table where every movie has its own column. This is due to the nature of the questionaire, as every movie was a new question.
This form is not good for data storage and analysis as it does not meet the requirements of the First Normal Form. In the future, the survey design can be improved to populate a normalized table.
For this and the next step to run in R, the following packages should be installed RMySQL, dplyr, tidyr, knitr, ggplot2.
Run the following R script to import the data from the MySQL database into an R data frame. I have used the following tutorial to write the code: https://www.r-bloggers.com/accessing-mysql-through-r/
library(RMySQL)
# Create a DB connection, adjust connection information as required
mydb = dbConnect(MySQL(), user='root', password='root', dbname='movieratings', host='localhost')
# Fetch all data from the ratings table via a Result Set variable
ratings = dbSendQuery(mydb, "select * from ratings")
data = fetch(ratings)
# Close the DB connection
dbDisconnect(mydb)
## [1] TRUE
Let us look into the top rows of the imported dataset
library(knitr)
kable(data[1:5,])
| ts | star_trek | suicide_squad | the_lobster | pets | deadpool | batman_the_dark_knight | pulp_fiction | the_lord_of_the_rings |
|---|---|---|---|---|---|---|---|---|
| 9/5/2016 23:00:17 | 4 | 3 | 5 | 4 | 4 | 5 | 4 | |
| 9/5/2016 23:06:07 | 3 | 1 | 5 | 5 | 3 | 4 | 4 | |
| 9/6/2016 0:35:36 | 3 | 2 | 5 | 1 | 2 | 1 | 3 | 4 |
| 9/6/2016 5:13:56 | 3 | 4 | 5 | |||||
| 9/6/2016 6:19:11 | 3 | 3 |
The data needs to be cleaned up and brought into the tidy format for analysis.
library(dplyr)
library(tidyr)
# Transpose the table into the tidy data format
data = as.tbl(data)
data = data %>% gather(movie, rating, star_trek:the_lord_of_the_rings)
# Fix the rating column type and drop missing values
data = mutate(data, rating = as.integer(rating)) %>% filter(complete.cases(rating))
After cleaning up, the top of the table looks like this:
kable(data[1:5,])
| ts | movie | rating |
|---|---|---|
| 9/5/2016 23:00:17 | star_trek | 4 |
| 9/6/2016 0:35:36 | star_trek | 3 |
| 9/6/2016 6:19:11 | star_trek | 3 |
| 9/6/2016 8:02:36 | star_trek | 4 |
| 9/6/2016 10:07:05 | star_trek | 4 |
Now let us first see how many ratings we have collected per movie to estimate how “objective” our view is.
table(data$movie)
##
## star_trek suicide_squad the_lobster
## 10 9 11
## pets deadpool batman_the_dark_knight
## 8 10 11
## pulp_fiction the_lord_of_the_rings
## 12 14
We see that each movie was rated by at least 8 people, which is not representative, but should at least provide an array of opinions about each one.
So what are the ratings of the movies?
library(ggplot2)
qplot(data = data, x = movie, y = rating, geom = "boxplot", fill = movie)+coord_flip() + ggtitle("Movie Ratings")
As the sample is very small, opinions are split about most films. But here are some first observations:
Outlook
In the future, after streamlining the data collection process, more details about the movies and their ratings can be included into the analysis, such as year, genre, starring actors, budget etc.