The data set was pulled from (https://grouplens.org/datasets/movielens/); the ml-latest-small set recommended for education and development. (100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.)
# Helper for getting new connection to Cloud SQL
getSqlConnection <- function(){
con <-
dbConnect(
RMySQL::MySQL(),
username = 'csps',
password = '1234',
host = '35.243.176.233',
dbname = 'Movie_rating'
) # TODO: use a configuration group `group = "my-db")`
return(con)
}
# Setting the Connection to pull the tables and validate.
conn<- getSqlConnection()
check_tables<- dbListTables(conn)
print(check_tables)
## [1] "movie_rating_combined" "movies" "ratings"
# Setting the tables as Dataframes
movies <- dbGetQuery(conn, "select * FROM movies")
ratings <- dbGetQuery (conn, "select * FROM ratings")
combined <- dbGetQuery (conn, "select * FROM movie_rating_combined")
## tableID movieId title
## 1 1 0 title
## 2 2 2 Jumanji (1995)
## 3 3 3 Grumpier Old Men (1995)
## 4 4 4 Waiting to Exhale (1995)
## 5 5 5 Father of the Bride Part II (1995)
## 6 6 6 Heat (1995)
## 7 7 7 Sabrina (1995)
## 8 8 8 Tom and Huck (1995)
## 9 9 9 Sudden Death (1995)
## 10 10 10 GoldenEye (1995)
## 11 11 11 American President, The (1995)
## 12 12 12 Dracula: Dead and Loving It (1995)
## 13 13 13 Balto (1995)
## 14 14 14 Nixon (1995)
## 15 15 15 Cutthroat Island (1995)
## 16 16 16 Casino (1995)
## 17 17 17 Sense and Sensibility (1995)
## 18 18 18 Four Rooms (1995)
## 19 19 19 Ace Ventura: When Nature Calls (1995)
## 20 20 20 Money Train (1995)
## genres
## 1 genres\r
## 2 Adventure|Children|Fantasy\r
## 3 Comedy|Romance\r
## 4 Comedy|Drama|Romance\r
## 5 Comedy\r
## 6 Action|Crime|Thriller\r
## 7 Comedy|Romance\r
## 8 Adventure|Children\r
## 9 Action\r
## 10 Action|Adventure|Thriller\r
## 11 Comedy|Drama|Romance\r
## 12 Comedy|Horror\r
## 13 Adventure|Animation|Children\r
## 14 Drama\r
## 15 Action|Adventure|Romance\r
## 16 Crime|Drama\r
## 17 Drama|Romance\r
## 18 Comedy\r
## 19 Comedy\r
## 20 Action|Comedy|Crime|Drama|Thriller\r
## tableID userId movieId rating
## 1 1 0 0 0
## 2 2 1 3 4
## 3 3 1 6 4
## 4 4 1 47 5
## 5 5 1 50 5
## 6 6 1 70 3
## 7 7 1 101 5
## 8 8 1 110 4
## 9 9 1 151 5
## 10 10 1 157 5
## 11 11 1 163 5
## 12 12 1 216 5
## 13 13 1 223 3
## 14 14 1 231 5
## 15 15 1 235 4
## 16 16 1 260 5
## 17 17 1 296 3
## 18 18 1 316 3
## 19 19 1 333 5
## 20 20 1 349 4
## tableID movieId title genres
## 1 1 0 title genres\r
## 2 2 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 3 3 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 4 4 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 5 5 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 6 6 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 7 7 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 8 8 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 9 9 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 10 10 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 11 11 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 12 12 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 13 13 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 14 14 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 15 15 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 16 16 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 17 17 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 18 18 2 Jumanji (1995) Adventure|Children|Fantasy\r
## 19 19 3 Grumpier Old Men (1995) Comedy|Romance\r
## 20 20 3 Grumpier Old Men (1995) Comedy|Romance\r
## rating userId
## 1 0 0
## 2 4 6
## 3 4 8
## 4 3 18
## 5 3 19
## 6 3 20
## 7 4 21
## 8 4 27
## 9 5 51
## 10 4 62
## 11 3 68
## 12 3 82
## 13 3 91
## 14 5 93
## 15 4 94
## 16 4 103
## 17 3 104
## 18 5 107
## 19 4 1
## 20 5 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 4.000 3.652 4.000 5.000
## [1] 9720
user1review <- subset(combined, userId == 1)
u1rating <- table(user1review$rating)
barplot(u1rating, main="User 1 Ratings", xlab = "Ratings" )
In SQL, I had to set up the data set twice. Once, to get everything in order and ensure the CSV’s would load. Second, in Google Cloud (the reasons will be come clear later in the text).
In my Local SQL I ran the following script:
– Drop database IF EXISTS Movie_rating;
CREATE database IF NOT EXISTS Movie_rating;
Use Movie_rating;
SET SQL_MODE = ’’;CREATE TABLE IF NOT EXISTS movies (
tableID INT AUTO_INCREMENT,
movieId INT,
title TEXT,
genres TEXT,
PRIMARY KEY (tableID)
);CREATE TABLE IF NOT EXISTS ratings(
tableID INT AUTO_INCREMENT NOT NULL,
userId INT,
movieId INT,
rating INT,
PRIMARY KEY (tableID)
);LOAD DATA infile ‘C:/Users/x/…/movies.csv’
INTO TABLE movies
FIELDS TERMINATED BY ‘,’
ENCLOSED BY ‘"’
LINES TERMINATED BY ‘’
IGNORE 1 ROWS;LOAD DATA infile ‘C:/Users/x/…/ratings3.csv’
INTO TABLE ratings
FIELDS TERMINATED BY ‘,’
ENCLOSED BY ‘"’
LINES TERMINATED BY ’
IGNORE 1 ROWS
;SELECT COUNT() FROM movies;
SELECT COUNT() FROM ratings;SELECT m.movieId, m.title, m.genres, r.rating, r.userId
FROM ratings r
LEFT JOIN movies m on r.movieId=m.movieId;CREATE TABLE IF NOT EXISTS movie_rating_combined(
tableID INT AUTO_INCREMENT NOT NULL,
movieId INT,
title TEXT,
genres TEXT,
rating INT,
userId INT,
PRIMARY KEY (tableID)
);INSERT INTO movie_rating_combined (movieId, title, genres, rating, userId)
SELECT m.movieId, m.title, m.genres, r.rating, r.userId
FROM ratings r
LEFT JOIN movies m on r.movieId=m.movieId;
This allowed me to check all the creations and loads while keeping the assignment fairly basic.
I ran into an issue attempting to import the local database as a .SQL file in the Cloud instance; permissions errors (check the discussion board). I also ran into an issue when attempting to load a local csv file into a local connection of the Cloud MYSQL instance. The movies and ratings CSV’s had to be uploaded to the Google Storage drive. I ran the script below to create the tables:
Use Movie_rating;
SET SQL_MODE = ’’;CREATE TABLE IF NOT EXISTS movies (
tableID INT AUTO_INCREMENT,
movieId INT,
title TEXT,
genres TEXT,
PRIMARY KEY (tableID)
);CREATE TABLE IF NOT EXISTS ratings(
tableID INT AUTO_INCREMENT NOT NULL,
userId INT,
movieId INT,
rating INT,
PRIMARY KEY (tableID)
);
Once the tables were created, I could then use the import wizard in the Cloud Platform GUI to load the data into the tables; no other way could be found at the time. Once the tables were loaded, I could run the final step to normalize the database to my local copy.
CREATE TABLE IF NOT EXISTS movie_rating_combined(
tableID INT AUTO_INCREMENT NOT NULL,
movieId INT,
title TEXT,
genres TEXT,
rating INT,
userId INT,
PRIMARY KEY (tableID)
);INSERT INTO movie_rating_combined (movieId, title, genres, rating, userId)
SELECT m.movieId, m.title, m.genres, r.rating, r.userId
FROM ratings r
LEFT JOIN movies m on r.movieId=m.movieId;– SELECT * from movie_rating_combined
When these steps were complete, I could run the R programming seen in the top of this document.