Assignment Instructions

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe. Your deliverables should include your SQL scripts and your R Markdown code, posted to GitHub.

This is by design a very open ended assignment. A variety of reasonable approaches are acceptable. You can (and should) blank out your SQL password if your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption that I have the same database server and R software.

Metacritic

Metacritic is a website that aggregates reviews of music albums, games, movies, TV shows, DVDs, and formerly, books. For each product, a numerical score from each review is obtained and the total is averaged. It was created and founded by Jason Dietz, Marc Doyle, and Julie Doyle Roberts. Source

Scape Web Page

Choosing top movies as rated by Metacritic. A while loop is necessary because for this website, getURL returns just the headers with “Error 429 Slow down” the majority times it is queried. A 30-second pause is built in every 500 tries because sometimes pausing for a bit helps.

library(XML)
library(RCurl)
## Loading required package: bitops
library(RMySQL)
## Loading required package: DBI
scrape <- function(url) {
  webpage = ""; n = 0
  options(warn = -1)
  while (nchar(webpage)<1000) {
    webpage <- getURL(url, isHTTP = T)
    n = n + 1
    if (n %% 500 == 0) { 
      Sys.sleep(30)
      }
  }
  options(warn = 0)
  return(webpage)
}

url <- "http://www.metacritic.com/browse/movies/score/metascore/year/filtered"
raw <- scrape(url); nchar(raw)
## [1] 189063

Clean Data

Choosing top six movies to extract from the raw data scraped from Metacritic.

clean <- function(raw, m) {
    data <- htmlParse(raw)
    nodes <- c("product_item row_num", "product_item product_title", 
               "metascore_w small movie positive", "product_item product_date")
    n = length(nodes)
    movies <- matrix(NA, nrow=m, ncol=n)
    for (i in 1:m) {
    for (j in 1:n) {
      x <- getNodeSet(data, paste0("//div[@class='", nodes[j], "']"))[[i]]
      x <- gsub("\\n|\\.|<a.*\">|</a>", "", as(x, "character"))
      loc <- gregexpr(">.*?</", x)
      a <- loc[[1]][1] + 1
      b <- loc[[1]][1] + attr(loc[[1]],"match.length") - 3
      x <- substr(x, start = a, stop = b) 
      movies[i,j] <- gsub("^\\s+|\\s+$", "", x)
    }
  }
  popular = data.frame(movies, stringsAsFactors = F)
  popular[,1] <- as.integer(popular[,1])
  popular[,3] <- as.integer(popular[,3])
  month <- substr(movies[,4],0,3)
  day <- gsub("\\s", "0", gsub(",", "", substr(movies[ ,4], nchar(movies[,4])-7, nchar(movies[,4]) - 5)))
  year <- substr(movies[,4], nchar(movies[,4]) - 3, nchar(movies[ ,4]))
  popular[,4] <- as.Date(paste0(day,month,year), "%d%B%Y")
  colnames(popular) <- c("Movie_Rank","Movie_Title","MetaScore","Release_Date")
  return(popular)
}

popular <- clean(raw, 6); popular
##   Movie_Rank                Movie_Title MetaScore Release_Date
## 1          1                  Moonlight        94   2016-10-21
## 2          2               Toni Erdmann        91   2016-12-25
## 3          3 One More Time with Feeling        91   2016-09-08
## 4          4                 La La Land        91   2016-12-02
## 5          5      Manchester by the Sea        91   2016-11-18
## 6          6                   The Fits        90   2016-06-03

Interesting note: Some of the highest rated movies on the site have future release dates. This must be do to a small sample –potentially made up of people involved with, and therefore biased toward, the movie– being used to represent the population.

Imaginary Friend’s and their Ratings

Makes a list of five friends then produces a list of random ratings from one to six. The number six accounts for instances where movies were not watched by the individual and is therefore replaced with “NA” in the final step of the function.

imagine <- function(f) {
  friends <- data.frame(cbind(seq(1:5), NA))
  friends[ ,2] <- LETTERS[1:f]
  colnames(friends) <- c("Friend_Rank", "Friend_Name")   
  return(friends)
}

rate <- function(min, max) {
  m = length(popular[,1])
  f = length(friends[,1])
  ratings <- array(NA, dim=c(m,f))
  set.seed(11217)
  for (i in 1:m){
    for (j in 1:f) {
      score <- ceiling(runif(1, min - 1, max + 1))
      if (score <= max) { ratings[i, j] <- score }
    }
  }
  ratings <- cbind(as.numeric(row(ratings)), as.numeric(col(ratings)),as.numeric(ratings))
  ratings <- data.frame(subset(ratings, complete.cases(ratings)))
  colnames(ratings) <- c("Movie_Rank", "Friend_Rank", "Rating")
  return(ratings)
}

friends <- imagine(5); friends
##   Friend_Rank Friend_Name
## 1           1           A
## 2           2           B
## 3           3           C
## 4           4           D
## 5           5           E
ratings <- rate(1, 5); ratings
##    Movie_Rank Friend_Rank Rating
## 1           1           1      5
## 2           2           1      5
## 3           3           1      4
## 4           4           1      1
## 5           6           1      4
## 6           1           2      2
## 7           2           2      1
## 8           3           2      1
## 9           4           2      4
## 10          5           2      5
## 11          6           2      1
## 12          2           3      1
## 13          4           3      2
## 14          5           3      2
## 15          6           3      4
## 16          1           4      3
## 17          3           4      4
## 18          4           4      5
## 19          6           4      4
## 20          1           5      1
## 21          2           5      5
## 22          3           5      5
## 23          4           5      2
## 24          5           5      4
## 25          6           5      5

Store Results in SQL Database

Connect to MySQL. Create “Movies” database. Load movies, friends, and ratings into the database.

dbSendQuery(connection, "CREATE USER 'RMySQL' IDENTIFIED BY '123456';" )
## <MySQLResult:21441544,0,0>
dbSendQuery(connection, "GRANT ALL ON *.* TO 'RMySQL';")
## <MySQLResult:1,0,1>
connection <- dbConnect(MySQL(), user="RMySQL", password="123456")

dbSendQuery(connection, 'CREATE SCHEMA IF NOT EXISTS Movies;')
## <MySQLResult:0,1,0>
dbSendQuery(connection, 'USE Movies;')
## <MySQLResult:21788696,1,1>
dbSendQuery(connection, 'DROP TABLE IF EXISTS tbl_Movies;')
## <MySQLResult:21500384,1,2>
dbSendQuery(connection, 'DROP TABLE IF EXISTS tbl_Friends;')
## <MySQLResult:21439816,1,3>
dbSendQuery(connection, 'DROP TABLE IF EXISTS tbl_Ratings;')
## <MySQLResult:112225112,1,4>
dbWriteTable(connection, "tbl_Movies", popular, append = TRUE, row.names = FALSE)
## [1] TRUE
dbSendQuery(connection, "ALTER TABLE tbl_Movies
                         MODIFY COLUMN Movie_Rank INT NOT NULL,
                         MODIFY COLUMN Movie_Title VARCHAR(30) NOT NULL,
                         MODIFY COLUMN MetaScore INT NOT NULL,
                         MODIFY COLUMN Release_Date DATE NOT NULL,
                         ADD PRIMARY KEY (Movie_Rank);")
## <MySQLResult:21439648,1,8>
dbWriteTable(connection, "tbl_Friends", friends, append = TRUE, row.names = FALSE)
## [1] TRUE
dbSendQuery(connection, "ALTER TABLE tbl_Friends
                         MODIFY COLUMN Friend_Rank INT NOT NULL,
                         MODIFY COLUMN Friend_Name VARCHAR(30) NOT NULL,
                         ADD PRIMARY KEY (Friend_Rank);")
## <MySQLResult:2,1,12>
dbWriteTable(connection, "tbl_Ratings", ratings, append = TRUE, row.names = FALSE)
## [1] TRUE
dbSendQuery(connection, "ALTER TABLE tbl_Ratings
                         MODIFY COLUMN Rating INT NOT NULL,
                         MODIFY COLUMN Movie_Rank INT NOT NULL,
                         MODIFY COLUMN Friend_Rank INT NOT NULL,
                         ADD FOREIGN KEY (Movie_Rank) REFERENCES tbl_Movies (Movie_Rank),
                         ADD FOREIGN KEY (Friend_Rank) REFERENCES tbl_Friends (Friend_Rank),
                         ADD PRIMARY KEY (Movie_Rank, Friend_Rank);")
## <MySQLResult:2,1,16>

Run Query from SQL Database

Queries of MySQL tables displays data as a dataframe in R.

dbGetQuery(connection, "SELECT Movie_Title, Metascore, Release_Date, Friend_Name, Rating
                        FROM movies.tbl_Movies AS M
                        CROSS JOIN movies.tbl_Friends AS F
                        LEFT JOIN movies.tbl_Ratings AS R
                        ON R.Movie_Rank = M.Movie_Rank AND F.Friend_Rank = R.Friend_Rank
                        ORDER BY M.Movie_Rank, F.Friend_Rank;")
##                   Movie_Title Metascore Release_Date Friend_Name Rating
## 1                   Moonlight        94   2016-10-21           A      5
## 2                   Moonlight        94   2016-10-21           B      2
## 3                   Moonlight        94   2016-10-21           C     NA
## 4                   Moonlight        94   2016-10-21           D      3
## 5                   Moonlight        94   2016-10-21           E      1
## 6                Toni Erdmann        91   2016-12-25           A      5
## 7                Toni Erdmann        91   2016-12-25           B      1
## 8                Toni Erdmann        91   2016-12-25           C      1
## 9                Toni Erdmann        91   2016-12-25           D     NA
## 10               Toni Erdmann        91   2016-12-25           E      5
## 11 One More Time with Feeling        91   2016-09-08           A      4
## 12 One More Time with Feeling        91   2016-09-08           B      1
## 13 One More Time with Feeling        91   2016-09-08           C     NA
## 14 One More Time with Feeling        91   2016-09-08           D      4
## 15 One More Time with Feeling        91   2016-09-08           E      5
## 16                 La La Land        91   2016-12-02           A      1
## 17                 La La Land        91   2016-12-02           B      4
## 18                 La La Land        91   2016-12-02           C      2
## 19                 La La Land        91   2016-12-02           D      5
## 20                 La La Land        91   2016-12-02           E      2
## 21      Manchester by the Sea        91   2016-11-18           A     NA
## 22      Manchester by the Sea        91   2016-11-18           B      5
## 23      Manchester by the Sea        91   2016-11-18           C      2
## 24      Manchester by the Sea        91   2016-11-18           D     NA
## 25      Manchester by the Sea        91   2016-11-18           E      4
## 26                   The Fits        90   2016-06-03           A      4
## 27                   The Fits        90   2016-06-03           B      1
## 28                   The Fits        90   2016-06-03           C      4
## 29                   The Fits        90   2016-06-03           D      4
## 30                   The Fits        90   2016-06-03           E      5
dbSendQuery(connection, 'DROP TABLE tbl_Ratings;')
## <MySQLResult:99050168,1,18>
dbSendQuery(connection, 'DROP TABLE tbl_Friends;')
## <MySQLResult:8,1,19>
dbSendQuery(connection, 'DROP TABLE tbl_Movies;')
## <MySQLResult:21451392,1,20>
dbSendQuery(connection, 'DROP SCHEMA Movies;')
## <MySQLResult:1364425037,1,21>
dbSendQuery(connection, "REVOKE ALL ON *.* FROM 'RMySQL' ;")
## <MySQLResult:21439648,2,0>
dbSendQuery(connection, "DROP USER 'RMySQL' ;")
## <MySQLResult:1,2,1>
dbDisconnect(connection)
## Warning: Closing open result sets
## [1] TRUE