Data-607-Assignment-2

I Listed some classic Nicolas Cage movies and had may siblings rate them on Google Sheets. If I had to do it over I would have picked some Marvel movies or something since my data ended up pretty sparse.

library(googlesheets)
library(RMySQL)

## Loading required package: DBI

library(reshape2)

Connect to local DB. redacted code will look like this:
con = dbConnect(dbDriver(“MySQL”), user=‘’, password=’‘, dbname=’cuny’, host=’’,port = 3306)

This is probably not best practice but I have all of the SQL inside of the R code. It was a bit of a challenge to get it all to work but it’s nice that it’s all together in one document, and it gets completely reset every time I change and re-run.

Drop Tables if they exist

res <- dbSendQuery(con,'drop table if exists movie_critics')
res <- dbSendQuery(con,'drop table if exists movies')
res <- dbSendQuery(con,'drop table if exists movie_ratings')

Create table to list critics

res <- dbSendQuery(con,"
create table movie_critics (
id int primary key auto_increment,
name varchar(100)
)")

Create table to list movies

res <- dbSendQuery(con,"
create table movies (
id int primary key auto_increment,
name varchar(100)
)")

Create table of ratings I created “rating” as a decimal in case someone put a half star.

res <- dbSendQuery(con,"
create table movie_ratings(
id int primary key auto_increment,
critic_id int,
movie_id int,
rating decimal(14,4)
)")

Use stored token to get access to Google Sheets

gs_auth(token = "googlesheets_token.rds")

Pull in the spreadsheet

sheet <- gs_url('https://docs.google.com/spreadsheets/d/1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E')

## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.

## Putative key: 1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E

## Sheet successfully identified: "Movie Ratings"

sheet

##                   Spreadsheet title: Movie Ratings
##                  Spreadsheet author: steven.w.ellingson
##   Date of googlesheets registration: 2019-09-06 01:38:12 GMT
##     Date of last spreadsheet update: 2019-09-05 23:12:42 GMT
##                          visibility: private
##                         permissions: rw
##                             version: new
## 
## Contains 5 worksheets:
## (Title): (Nominal worksheet extent as rows x columns)
## Steven: 1000 x 26
## Sarah: 1000 x 26
## David: 1000 x 26
## Katie: 1000 x 26
## Jason: 1000 x 26
## 
## Key: 1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E
## Browser URL: https://docs.google.com/spreadsheets/d/1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E/

List out the worksheets

gs_ws_ls(sheet)

## [1] "Steven" "Sarah"  "David"  "Katie"  "Jason"

First Insert the movie critics into the table

query <- paste("INSERT INTO movie_critics (name) VALUES", paste("('",gs_ws_ls(sheet), "')", collapse = ", ", sep = ""), sep = "")
res <- dbSendQuery(con,query)

Now let’s define a couple of functions to read our data and push into DB

This function should be run for each worksheet
It will pull id for the critic, then use the “insert_rating” function to insert the rows

read_worksheet <- function(ss, ws) {
  
  # Get ID from table for this critic
  res <- dbSendQuery(con, paste0("select id from movie_critics where lower(name) = '",ws,"'"))
  data <- fetch(res,n=-1)
  
  if (nrow(data) > 1) {
    print(paste("WARNING, MULITPLE CRITICS WITH NAME",ws))
  } else if (nrow(data) == 0) {
    print(paste("WARNING, NO CRITICS WITH NAME",ws))
  } else {
    critic_id = data$id[1]
  }
  
  ratings <- gs_read(ss, ws)
  apply(ratings, 1, function(x) insert_rating(ss,critic_id,x))
}

This function should be run for every each individual rating
It will get the ids for the critic and movie and insert into DB

insert_rating <- function(ss, critic_id, rating_row) {
  #Extract Movie and Rating from Row
  movie = rating_row['Movie']
  rating = rating_row['Rating']
  
  #Get ID for movie from table
  res <- dbSendQuery(con, paste0("select id from movies where lower(name) = '",tolower(movie),"'"))
  data <- fetch(res,n=-1)
  
  if (nrow(data) > 1) {
    print(paste("WARNING, MULITPLE MOVIES WITH NAME",movie))
  } else if (nrow(data) == 0) {
    #there is no entry for this movie, so add it!
    res <- dbSendQuery(con, paste0("insert into movies (name) values ('",movie,"')"))
  }
  
  # Run Query again in case it wasn't there
  res <- dbSendQuery(con, paste0("select id from movies where lower(name) = '",movie,"'"))
  data <- fetch(res, n=-1)
  # Assume the first one is the one we want
  movie_id = data$id[1]
  
  # I didn't give clear instructions on what to do if you haven't seen the movie
  # we'll assume anything that isn't a number should go in as null, but also give a warning
  
  if (is.na(as.numeric(rating))) {
    if (is.na(rating)) {
      print(paste0("Warning - N/A for ",critic_id,",",movie_id,': this is probably OK because it was N/A in google sheets'))
    } else {
      print(paste0("Warning - N/A for ",critic_id,",",movie_id,': "',rating,'" was changed to N/A!'))
    }
    # Either way, lets change to "null" so it can be inserted into DB
    rating = 'null'
  }
  

  res <- dbSendQuery(con, paste0("insert into movie_ratings (critic_id,movie_id,rating) values (",critic_id,",",movie_id,",",rating,")"))
  
}

Use the above functions to insert the data

sapply(gs_ws_ls(sheet), function(x) read_worksheet(sheet,x))

## Accessing worksheet titled 'Steven'.

## Parsed with column specification:
## cols(
##   Movie = col_character(),
##   Rating = col_double()
## )

## Accessing worksheet titled 'Sarah'.

## Parsed with column specification:
## cols(
##   Movie = col_character(),
##   Rating = col_character()
## )

## [1] "Warning - N/A for 2,2: \"N/A\" was changed to N/A!"
## [1] "Warning - N/A for 2,3: \"N/A\" was changed to N/A!"

## Accessing worksheet titled 'David'.

## Parsed with column specification:
## cols(
##   Movie = col_character(),
##   Rating = col_double()
## )

## [1] "Warning - N/A for 3,1: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 3,3: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 3,4: this is probably OK because it was N/A in google sheets"

## Accessing worksheet titled 'Katie'.
## Parsed with column specification:
## cols(
##   Movie = col_character(),
##   Rating = col_double()
## )

## [1] "Warning - N/A for 4,1: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 4,2: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 4,3: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 4,7: this is probably OK because it was N/A in google sheets"

## Accessing worksheet titled 'Jason'.

## Parsed with column specification:
## cols(
##   Movie = col_character(),
##   Rating = col_character()
## )

## [1] "Warning - N/A for 5,1: \"Didn't Watch\" was changed to N/A!"
## [1] "Warning - N/A for 5,4: \"Didn't Watch\" was changed to N/A!"

##      Steven Sarah David Katie Jason
## [1,] ?      ?     ?     ?     ?    
## [2,] ?      ?     ?     ?     ?    
## [3,] ?      ?     ?     ?     ?    
## [4,] ?      ?     ?     ?     ?    
## [5,] ?      ?     ?     ?     ?    
## [6,] ?      ?     ?     ?     ?

OK, the warnings look reasonable (changing “Didn’t Watch” and “N/A” to N/A seems ok) so lets pull our data into a data frame

query = "select 
    mc.name as critic,
    m.name as movie,
    mr.rating
 from movie_ratings mr
 left join movies m on mr.movie_id = m.id
 left join movie_critics mc on mr.critic_id = mc.id"

res <- dbSendQuery(con, query)
data <- fetch(res, n=-1)

str(data)

## 'data.frame':    30 obs. of  3 variables:
##  $ critic: chr  "Steven" "Steven" "Steven" "Steven" ...
##  $ movie : chr  "Raising Arizona" "Con Air" "The Rock" "Bangkok Dangerous" ...
##  $ rating: num  5 4 4 2 3 4 4 NA NA 2 ...

looks ok at a glance Lets change it so the critics and movies are the row and column names

data.pivot = dcast(data, movie ~ critic,value.var = "rating")
data.pivot

##                movie David Jason Katie Sarah Steven
## 1  Bangkok Dangerous    NA    NA    NA     2      2
## 2 Bangkok Dangerous2    NA    NA    NA    NA     NA
## 3            Con Air     4     5    NA    NA      4
## 4           Face/Off     4     4     1     4      4
## 5  National Treasure     3     3     3     4      3
## 6    Raising Arizona    NA    NA    NA     4      5
## 7           The Rock    NA     5    NA    NA      4

What the heck, there is a movie titled “Bangkok Dangerous2”. Since I don’t believe that picture got a sequel I think something is wrong.

data[data$movie=="Bangkok Dangerous2",]

##    critic              movie rating
## 22  Katie Bangkok Dangerous2     NA

I’m going to assume that Katie tried to put in a rating of 2 for “Bangkok Dangerous” In real life I would fix this either in the spreadsheet itself or in the database, but for this exercize I’ll just fix the data frame

data[data$movie=="Bangkok Dangerous2",'rating'] = 2
data[data$movie=="Bangkok Dangerous2",'movie'] = "Bangkok Dangerous"

Let’s try that again

data.pivot = dcast(data, movie ~ critic,value.var = "rating")
data.pivot

##               movie David Jason Katie Sarah Steven
## 1 Bangkok Dangerous    NA    NA     2     2      2
## 2           Con Air     4     5    NA    NA      4
## 3          Face/Off     4     4     1     4      4
## 4 National Treasure     3     3     3     4      3
## 5   Raising Arizona    NA    NA    NA     4      5
## 6          The Rock    NA     5    NA    NA      4

lets change our row names to the movie titles

row.names(data.pivot) = data.pivot$movie
data.pivot['movie']=NULL

Since we have our data we may as well take a look at some summary statistics

Check how many people watched each movie

5-rowSums(is.na(data.pivot))

## Bangkok Dangerous           Con Air          Face/Off National Treasure 
##                 3                 3                 5                 5 
##   Raising Arizona          The Rock 
##                 2                 2

National Treasure and Face/Off were seen by everyone. Only a couple saw Raising Arizona and The Rock, which is disappointing

How many responses did I get per person

5-colSums(is.na(data.pivot))

##  David  Jason  Katie  Sarah Steven 
##      2      3      2      3      5

I saw all of them, which makes sense since I picked the movies. My younger siblings David and Katie are missing out.

Let’s look at the average rating by movie

round(rowMeans(data.pivot,na.rm=TRUE), 1)

## Bangkok Dangerous           Con Air          Face/Off National Treasure 
##               2.0               4.3               3.4               3.2 
##   Raising Arizona          The Rock 
##               4.5               4.5

More evidence that Katie and David should watch Raising Arizona and The Rock

And average rating by critic

round(colMeans(data.pivot,na.rm=TRUE), 1)

##  David  Jason  Katie  Sarah Steven 
##    3.7    4.2    2.0    3.5    3.7

Katie seems to be a pretty harsh critic. I wonder if we adjust the movie ratings based on the users average rating if that would adjust our results
This is a pretty silly thing to do with such a sparse dataset, but it’s my own fault I guess

data.normalized <- round(sweep(data.pivot,2,colMeans(data.pivot,na.rm=TRUE),'/'), 1)

Let’s now multiply by overall rating average to get it back on the same scale

data.normalized <- data.normalized*mean(data$rating, na.rm=TRUE)
rbind(round(rowMeans(data.pivot,na.rm=TRUE), 1),
round(rowMeans(data.normalized,na.rm=TRUE), 1))

##      Bangkok Dangerous Con Air Face/Off National Treasure Raising Arizona
## [1,]               2.0     4.3      3.4               3.2             4.5
## [2,]               2.5     4.0      3.3               3.4             4.4
##      The Rock
## [1,]      4.5
## [2,]      4.0

Looking at iot this way, action classics “The Rock and”Con Air" get taken down a notch while Bangkok Dangerous looks less pitiful

Data-607-Assignment-2

Steven Ellingson

September 5, 2019