Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe. Your deliverables should include your SQL scripts and your R Markdown code, posted to GitHub.
This is by design a very open ended assignment. A variety of reasonable approaches are acceptable. You can (and should) blank out your SQL password if your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption that I have the same database server and R software.
Metacritic is a website that aggregates reviews of music albums, games, movies, TV shows, DVDs, and formerly, books. For each product, a numerical score from each review is obtained and the total is averaged. It was created and founded by Jason Dietz, Marc Doyle, and Julie Doyle Roberts. Source
Choosing top movies as rated by Metacritic. A while loop is necessary because for this website, getURL returns just the headers with “Error 429 Slow down” the majority times it is queried. A 30-second pause is built in every 500 tries because sometimes pausing for a bit helps.
library(XML)
library(RCurl)
## Loading required package: bitops
library(RMySQL)
## Loading required package: DBI
scrape <- function(url) {
webpage = ""; n = 0
options(warn = -1)
while (nchar(webpage)<1000) {
webpage <- getURL(url, isHTTP = T)
n = n + 1
if (n %% 500 == 0) {
Sys.sleep(30)
}
}
options(warn = 0)
return(webpage)
}
url <- "http://www.metacritic.com/browse/movies/score/metascore/year/filtered"
raw <- scrape(url); nchar(raw)
## [1] 189063
Choosing top six movies to extract from the raw data scraped from Metacritic.
clean <- function(raw, m) {
data <- htmlParse(raw)
nodes <- c("product_item row_num", "product_item product_title",
"metascore_w small movie positive", "product_item product_date")
n = length(nodes)
movies <- matrix(NA, nrow=m, ncol=n)
for (i in 1:m) {
for (j in 1:n) {
x <- getNodeSet(data, paste0("//div[@class='", nodes[j], "']"))[[i]]
x <- gsub("\\n|\\.|<a.*\">|</a>", "", as(x, "character"))
loc <- gregexpr(">.*?</", x)
a <- loc[[1]][1] + 1
b <- loc[[1]][1] + attr(loc[[1]],"match.length") - 3
x <- substr(x, start = a, stop = b)
movies[i,j] <- gsub("^\\s+|\\s+$", "", x)
}
}
popular = data.frame(movies, stringsAsFactors = F)
popular[,1] <- as.integer(popular[,1])
popular[,3] <- as.integer(popular[,3])
month <- substr(movies[,4],0,3)
day <- gsub("\\s", "0", gsub(",", "", substr(movies[ ,4], nchar(movies[,4])-7, nchar(movies[,4]) - 5)))
year <- substr(movies[,4], nchar(movies[,4]) - 3, nchar(movies[ ,4]))
popular[,4] <- as.Date(paste0(day,month,year), "%d%B%Y")
colnames(popular) <- c("Movie_Rank","Movie_Title","MetaScore","Release_Date")
return(popular)
}
popular <- clean(raw, 6); popular
## Movie_Rank Movie_Title MetaScore Release_Date
## 1 1 Moonlight 94 2016-10-21
## 2 2 Toni Erdmann 91 2016-12-25
## 3 3 One More Time with Feeling 91 2016-09-08
## 4 4 La La Land 91 2016-12-02
## 5 5 Manchester by the Sea 91 2016-11-18
## 6 6 The Fits 90 2016-06-03
Interesting note: Some of the highest rated movies on the site have future release dates. This must be do to a small sample –potentially made up of people involved with, and therefore biased toward, the movie– being used to represent the population.
Makes a list of five friends then produces a list of random ratings from one to six. The number six accounts for instances where movies were not watched by the individual and is therefore replaced with “NA” in the final step of the function.
imagine <- function(f) {
friends <- data.frame(cbind(seq(1:5), NA))
friends[ ,2] <- LETTERS[1:f]
colnames(friends) <- c("Friend_Rank", "Friend_Name")
return(friends)
}
rate <- function(min, max) {
m = length(popular[,1])
f = length(friends[,1])
ratings <- array(NA, dim=c(m,f))
set.seed(11217)
for (i in 1:m){
for (j in 1:f) {
score <- ceiling(runif(1, min - 1, max + 1))
if (score <= max) { ratings[i, j] <- score }
}
}
ratings <- cbind(as.numeric(row(ratings)), as.numeric(col(ratings)),as.numeric(ratings))
ratings <- data.frame(subset(ratings, complete.cases(ratings)))
colnames(ratings) <- c("Movie_Rank", "Friend_Rank", "Rating")
return(ratings)
}
friends <- imagine(5); friends
## Friend_Rank Friend_Name
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
ratings <- rate(1, 5); ratings
## Movie_Rank Friend_Rank Rating
## 1 1 1 5
## 2 2 1 5
## 3 3 1 4
## 4 4 1 1
## 5 6 1 4
## 6 1 2 2
## 7 2 2 1
## 8 3 2 1
## 9 4 2 4
## 10 5 2 5
## 11 6 2 1
## 12 2 3 1
## 13 4 3 2
## 14 5 3 2
## 15 6 3 4
## 16 1 4 3
## 17 3 4 4
## 18 4 4 5
## 19 6 4 4
## 20 1 5 1
## 21 2 5 5
## 22 3 5 5
## 23 4 5 2
## 24 5 5 4
## 25 6 5 5
Connect to MySQL. Create “Movies” database. Load movies, friends, and ratings into the database.
dbSendQuery(connection, "CREATE USER 'RMySQL' IDENTIFIED BY '123456';" )
## <MySQLResult:21441544,0,0>
dbSendQuery(connection, "GRANT ALL ON *.* TO 'RMySQL';")
## <MySQLResult:1,0,1>
connection <- dbConnect(MySQL(), user="RMySQL", password="123456")
dbSendQuery(connection, 'CREATE SCHEMA IF NOT EXISTS Movies;')
## <MySQLResult:0,1,0>
dbSendQuery(connection, 'USE Movies;')
## <MySQLResult:21788696,1,1>
dbSendQuery(connection, 'DROP TABLE IF EXISTS tbl_Movies;')
## <MySQLResult:21500384,1,2>
dbSendQuery(connection, 'DROP TABLE IF EXISTS tbl_Friends;')
## <MySQLResult:21439816,1,3>
dbSendQuery(connection, 'DROP TABLE IF EXISTS tbl_Ratings;')
## <MySQLResult:112225112,1,4>
dbWriteTable(connection, "tbl_Movies", popular, append = TRUE, row.names = FALSE)
## [1] TRUE
dbSendQuery(connection, "ALTER TABLE tbl_Movies
MODIFY COLUMN Movie_Rank INT NOT NULL,
MODIFY COLUMN Movie_Title VARCHAR(30) NOT NULL,
MODIFY COLUMN MetaScore INT NOT NULL,
MODIFY COLUMN Release_Date DATE NOT NULL,
ADD PRIMARY KEY (Movie_Rank);")
## <MySQLResult:21439648,1,8>
dbWriteTable(connection, "tbl_Friends", friends, append = TRUE, row.names = FALSE)
## [1] TRUE
dbSendQuery(connection, "ALTER TABLE tbl_Friends
MODIFY COLUMN Friend_Rank INT NOT NULL,
MODIFY COLUMN Friend_Name VARCHAR(30) NOT NULL,
ADD PRIMARY KEY (Friend_Rank);")
## <MySQLResult:2,1,12>
dbWriteTable(connection, "tbl_Ratings", ratings, append = TRUE, row.names = FALSE)
## [1] TRUE
dbSendQuery(connection, "ALTER TABLE tbl_Ratings
MODIFY COLUMN Rating INT NOT NULL,
MODIFY COLUMN Movie_Rank INT NOT NULL,
MODIFY COLUMN Friend_Rank INT NOT NULL,
ADD FOREIGN KEY (Movie_Rank) REFERENCES tbl_Movies (Movie_Rank),
ADD FOREIGN KEY (Friend_Rank) REFERENCES tbl_Friends (Friend_Rank),
ADD PRIMARY KEY (Movie_Rank, Friend_Rank);")
## <MySQLResult:2,1,16>
Queries of MySQL tables displays data as a dataframe in R.
dbGetQuery(connection, "SELECT Movie_Title, Metascore, Release_Date, Friend_Name, Rating
FROM movies.tbl_Movies AS M
CROSS JOIN movies.tbl_Friends AS F
LEFT JOIN movies.tbl_Ratings AS R
ON R.Movie_Rank = M.Movie_Rank AND F.Friend_Rank = R.Friend_Rank
ORDER BY M.Movie_Rank, F.Friend_Rank;")
## Movie_Title Metascore Release_Date Friend_Name Rating
## 1 Moonlight 94 2016-10-21 A 5
## 2 Moonlight 94 2016-10-21 B 2
## 3 Moonlight 94 2016-10-21 C NA
## 4 Moonlight 94 2016-10-21 D 3
## 5 Moonlight 94 2016-10-21 E 1
## 6 Toni Erdmann 91 2016-12-25 A 5
## 7 Toni Erdmann 91 2016-12-25 B 1
## 8 Toni Erdmann 91 2016-12-25 C 1
## 9 Toni Erdmann 91 2016-12-25 D NA
## 10 Toni Erdmann 91 2016-12-25 E 5
## 11 One More Time with Feeling 91 2016-09-08 A 4
## 12 One More Time with Feeling 91 2016-09-08 B 1
## 13 One More Time with Feeling 91 2016-09-08 C NA
## 14 One More Time with Feeling 91 2016-09-08 D 4
## 15 One More Time with Feeling 91 2016-09-08 E 5
## 16 La La Land 91 2016-12-02 A 1
## 17 La La Land 91 2016-12-02 B 4
## 18 La La Land 91 2016-12-02 C 2
## 19 La La Land 91 2016-12-02 D 5
## 20 La La Land 91 2016-12-02 E 2
## 21 Manchester by the Sea 91 2016-11-18 A NA
## 22 Manchester by the Sea 91 2016-11-18 B 5
## 23 Manchester by the Sea 91 2016-11-18 C 2
## 24 Manchester by the Sea 91 2016-11-18 D NA
## 25 Manchester by the Sea 91 2016-11-18 E 4
## 26 The Fits 90 2016-06-03 A 4
## 27 The Fits 90 2016-06-03 B 1
## 28 The Fits 90 2016-06-03 C 4
## 29 The Fits 90 2016-06-03 D 4
## 30 The Fits 90 2016-06-03 E 5
dbSendQuery(connection, 'DROP TABLE tbl_Ratings;')
## <MySQLResult:99050168,1,18>
dbSendQuery(connection, 'DROP TABLE tbl_Friends;')
## <MySQLResult:8,1,19>
dbSendQuery(connection, 'DROP TABLE tbl_Movies;')
## <MySQLResult:21451392,1,20>
dbSendQuery(connection, 'DROP SCHEMA Movies;')
## <MySQLResult:1364425037,1,21>
dbSendQuery(connection, "REVOKE ALL ON *.* FROM 'RMySQL' ;")
## <MySQLResult:21439648,2,0>
dbSendQuery(connection, "DROP USER 'RMySQL' ;")
## <MySQLResult:1,2,1>
dbDisconnect(connection)
## Warning: Closing open result sets
## [1] TRUE