My friends and family have probably had enough of me talking about data, so rather than interview them, I decided to use a movie dataset that I found on the internet here. I’ve copied the data to my github and reference that copy in my code.

I’ve opted to do all my MySQL work directly from R, so the only file that i have to show is a .rmd which can be found in my here in my github and here on rpubs

Load Required Packages

library(RMySQL)
library(data.table)

Read in username and password from file

user_df <- read.table("mysql_temp_pw.csv",header=TRUE,sep=",")

user = toString(user_df[1,"user"])
pw = toString(user_df[1,"password"])

First we’ll open a connection to MySQL. Note that I’ve pre-created a database called “DATA_607” using the following SQL command: CREATE DATABASE “DATA_607”.

con = dbConnect(MySQL(), user=user, password=pw, dbname='DATA_607', host='localhost')

Next we’ll read the movie data into dataframes. The data we care about are contained in 2 separate files (ratings, movies) which we’ll end up combining into a single db table later.

ratings <- read.table("https://raw.githubusercontent.com/plb2018/DATA607/master/ml-latest-small/ratings.csv",header = TRUE, sep = ",")

movies <- read.table("https://raw.githubusercontent.com/plb2018/DATA607/master/ml-latest-small/movies.csv",header = TRUE, sep = ",",fill = TRUE)

And we’ll take a look at the data:

head(ratings,5)
##   userId movieId rating  timestamp
## 1      1      31    2.5 1260759144
## 2      1    1029    3.0 1260759179
## 3      1    1061    3.0 1260759182
## 4      1    1129    2.0 1260759185
## 5      1    1172    4.0 1260759205
head(movies,5)
##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy

The data look good, so now we’ll load the data into individual tables in MySQL:

dbWriteTable(con, name='ratings', value=ratings)
dbWriteTable(con, name='movies', value=movies)

Now we should have 2 tables: “movies” and “ratings”. Let’s check:

dbListTables(con)
## [1] "movies"  "ratings"

Looks good! Now we’ll pull some data and take a peek just to make sure that everything is as we expect:

query = dbSendQuery(con, "SELECT * FROM movies LIMIT 5")
movie_data = fetch(query)
movie_data
##   row_names movieId                              title
## 1         1       1                   Toy Story (1995)
## 2         2       2                     Jumanji (1995)
## 3         3       3            Grumpier Old Men (1995)
## 4         4       4           Waiting to Exhale (1995)
## 5         5       5 Father of the Bride Part II (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
query = dbSendQuery(con, "SELECT * FROM ratings LIMIT 5")
rating_data = fetch(query)
rating_data
##   row_names userId movieId rating  timestamp
## 1         1      1      31    2.5 1260759144
## 2         2      1    1029    3.0 1260759179
## 3         3      1    1061    3.0 1260759182
## 4         4      1    1129    2.0 1260759185
## 5         5      1    1172    4.0 1260759205

The tables look fine, so we’ll try to join the data “movies” and “ratings” tables using the common “movieId” field and create a new “movie_ratings” table to house the joined data:

dbSendQuery(con, "CREATE TABLE movie_ratings AS (
SELECT  movies.movieId,ratings.userId, ratings.rating, movies.title,movies.genres 
FROM movies
INNER JOIN ratings ON  movies.movieId = ratings.movieId);")

And finally, we’ll pull a sample out of the newly created table into an R dataframe:

query = dbSendQuery(con, "SELECT * FROM movie_ratings LIMIT 5")
data = fetch(query)

data
##   movieId userId rating                                  title
## 1      31      1    2.5                 Dangerous Minds (1995)
## 2    1405      1    1.0 Beavis and Butt-Head Do America (1996)
## 3    2150      1    3.0         Gods Must Be Crazy, The (1980)
## 4      10      2    4.0                       GoldenEye (1995)
## 5      17      2    5.0           Sense and Sensibility (1995)
##                             genres
## 1                            Drama
## 2 Adventure|Animation|Comedy|Crime
## 3                 Adventure|Comedy
## 4        Action|Adventure|Thriller
## 5                    Drama|Romance

And finally, we’ll close out db conneciton.

dbDisconnect(con)