My friends and family have probably had enough of me talking about data, so rather than interview them, I decided to use a movie dataset that I found on the internet here. I’ve copied the data to my github and reference that copy in my code.
I’ve opted to do all my MySQL work directly from R, so the only file that i have to show is a .rmd which can be found in my here in my github and here on rpubs
library(RMySQL)
library(data.table)user_df <- read.table("mysql_temp_pw.csv",header=TRUE,sep=",")
user = toString(user_df[1,"user"])
pw = toString(user_df[1,"password"])First we’ll open a connection to MySQL. Note that I’ve pre-created a database called “DATA_607” using the following SQL command: CREATE DATABASE “DATA_607”.
con = dbConnect(MySQL(), user=user, password=pw, dbname='DATA_607', host='localhost')Next we’ll read the movie data into dataframes. The data we care about are contained in 2 separate files (ratings, movies) which we’ll end up combining into a single db table later.
ratings <- read.table("https://raw.githubusercontent.com/plb2018/DATA607/master/ml-latest-small/ratings.csv",header = TRUE, sep = ",")
movies <- read.table("https://raw.githubusercontent.com/plb2018/DATA607/master/ml-latest-small/movies.csv",header = TRUE, sep = ",",fill = TRUE)And we’ll take a look at the data:
head(ratings,5)## userId movieId rating timestamp
## 1 1 31 2.5 1260759144
## 2 1 1029 3.0 1260759179
## 3 1 1061 3.0 1260759182
## 4 1 1129 2.0 1260759185
## 5 1 1172 4.0 1260759205
head(movies,5)## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
The data look good, so now we’ll load the data into individual tables in MySQL:
dbWriteTable(con, name='ratings', value=ratings)
dbWriteTable(con, name='movies', value=movies)Now we should have 2 tables: “movies” and “ratings”. Let’s check:
dbListTables(con)## [1] "movies" "ratings"
Looks good! Now we’ll pull some data and take a peek just to make sure that everything is as we expect:
query = dbSendQuery(con, "SELECT * FROM movies LIMIT 5")
movie_data = fetch(query)
movie_data## row_names movieId title
## 1 1 1 Toy Story (1995)
## 2 2 2 Jumanji (1995)
## 3 3 3 Grumpier Old Men (1995)
## 4 4 4 Waiting to Exhale (1995)
## 5 5 5 Father of the Bride Part II (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
query = dbSendQuery(con, "SELECT * FROM ratings LIMIT 5")
rating_data = fetch(query)
rating_data## row_names userId movieId rating timestamp
## 1 1 1 31 2.5 1260759144
## 2 2 1 1029 3.0 1260759179
## 3 3 1 1061 3.0 1260759182
## 4 4 1 1129 2.0 1260759185
## 5 5 1 1172 4.0 1260759205
The tables look fine, so we’ll try to join the data “movies” and “ratings” tables using the common “movieId” field and create a new “movie_ratings” table to house the joined data:
dbSendQuery(con, "CREATE TABLE movie_ratings AS (
SELECT movies.movieId,ratings.userId, ratings.rating, movies.title,movies.genres
FROM movies
INNER JOIN ratings ON movies.movieId = ratings.movieId);")And finally, we’ll pull a sample out of the newly created table into an R dataframe:
query = dbSendQuery(con, "SELECT * FROM movie_ratings LIMIT 5")
data = fetch(query)
data## movieId userId rating title
## 1 31 1 2.5 Dangerous Minds (1995)
## 2 1405 1 1.0 Beavis and Butt-Head Do America (1996)
## 3 2150 1 3.0 Gods Must Be Crazy, The (1980)
## 4 10 2 4.0 GoldenEye (1995)
## 5 17 2 5.0 Sense and Sensibility (1995)
## genres
## 1 Drama
## 2 Adventure|Animation|Comedy|Crime
## 3 Adventure|Comedy
## 4 Action|Adventure|Thriller
## 5 Drama|Romance
And finally, we’ll close out db conneciton.
dbDisconnect(con)