I Listed some classic Nicolas Cage movies and had may siblings rate them on Google Sheets. If I had to do it over I would have picked some Marvel movies or something since my data ended up pretty sparse.
library(googlesheets)
library(RMySQL)
## Loading required package: DBI
library(reshape2)
Connect to local DB. redacted code will look like this:
con = dbConnect(dbDriver(“MySQL”), user=‘’, password=’‘, dbname=’cuny’, host=’’,port = 3306)
This is probably not best practice but I have all of the SQL inside of the R code. It was a bit of a challenge to get it all to work but it’s nice that it’s all together in one document, and it gets completely reset every time I change and re-run.
Drop Tables if they exist
res <- dbSendQuery(con,'drop table if exists movie_critics')
res <- dbSendQuery(con,'drop table if exists movies')
res <- dbSendQuery(con,'drop table if exists movie_ratings')
Create table to list critics
res <- dbSendQuery(con,"
create table movie_critics (
id int primary key auto_increment,
name varchar(100)
)")
Create table to list movies
res <- dbSendQuery(con,"
create table movies (
id int primary key auto_increment,
name varchar(100)
)")
Create table of ratings I created “rating” as a decimal in case someone put a half star.
res <- dbSendQuery(con,"
create table movie_ratings(
id int primary key auto_increment,
critic_id int,
movie_id int,
rating decimal(14,4)
)")
Use stored token to get access to Google Sheets
gs_auth(token = "googlesheets_token.rds")
Pull in the spreadsheet
sheet <- gs_url('https://docs.google.com/spreadsheets/d/1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E')
## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.
## Putative key: 1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E
## Sheet successfully identified: "Movie Ratings"
sheet
## Spreadsheet title: Movie Ratings
## Spreadsheet author: steven.w.ellingson
## Date of googlesheets registration: 2019-09-06 01:38:12 GMT
## Date of last spreadsheet update: 2019-09-05 23:12:42 GMT
## visibility: private
## permissions: rw
## version: new
##
## Contains 5 worksheets:
## (Title): (Nominal worksheet extent as rows x columns)
## Steven: 1000 x 26
## Sarah: 1000 x 26
## David: 1000 x 26
## Katie: 1000 x 26
## Jason: 1000 x 26
##
## Key: 1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E
## Browser URL: https://docs.google.com/spreadsheets/d/1eMTKJkwgP0mZup-25iF0u67EzgFUBEJYx3V3A5Khi_E/
List out the worksheets
gs_ws_ls(sheet)
## [1] "Steven" "Sarah" "David" "Katie" "Jason"
First Insert the movie critics into the table
query <- paste("INSERT INTO movie_critics (name) VALUES", paste("('",gs_ws_ls(sheet), "')", collapse = ", ", sep = ""), sep = "")
res <- dbSendQuery(con,query)
Now let’s define a couple of functions to read our data and push into DB
This function should be run for each worksheet
It will pull id for the critic, then use the “insert_rating” function to insert the rows
read_worksheet <- function(ss, ws) {
# Get ID from table for this critic
res <- dbSendQuery(con, paste0("select id from movie_critics where lower(name) = '",ws,"'"))
data <- fetch(res,n=-1)
if (nrow(data) > 1) {
print(paste("WARNING, MULITPLE CRITICS WITH NAME",ws))
} else if (nrow(data) == 0) {
print(paste("WARNING, NO CRITICS WITH NAME",ws))
} else {
critic_id = data$id[1]
}
ratings <- gs_read(ss, ws)
apply(ratings, 1, function(x) insert_rating(ss,critic_id,x))
}
This function should be run for every each individual rating
It will get the ids for the critic and movie and insert into DB
insert_rating <- function(ss, critic_id, rating_row) {
#Extract Movie and Rating from Row
movie = rating_row['Movie']
rating = rating_row['Rating']
#Get ID for movie from table
res <- dbSendQuery(con, paste0("select id from movies where lower(name) = '",tolower(movie),"'"))
data <- fetch(res,n=-1)
if (nrow(data) > 1) {
print(paste("WARNING, MULITPLE MOVIES WITH NAME",movie))
} else if (nrow(data) == 0) {
#there is no entry for this movie, so add it!
res <- dbSendQuery(con, paste0("insert into movies (name) values ('",movie,"')"))
}
# Run Query again in case it wasn't there
res <- dbSendQuery(con, paste0("select id from movies where lower(name) = '",movie,"'"))
data <- fetch(res, n=-1)
# Assume the first one is the one we want
movie_id = data$id[1]
# I didn't give clear instructions on what to do if you haven't seen the movie
# we'll assume anything that isn't a number should go in as null, but also give a warning
if (is.na(as.numeric(rating))) {
if (is.na(rating)) {
print(paste0("Warning - N/A for ",critic_id,",",movie_id,': this is probably OK because it was N/A in google sheets'))
} else {
print(paste0("Warning - N/A for ",critic_id,",",movie_id,': "',rating,'" was changed to N/A!'))
}
# Either way, lets change to "null" so it can be inserted into DB
rating = 'null'
}
res <- dbSendQuery(con, paste0("insert into movie_ratings (critic_id,movie_id,rating) values (",critic_id,",",movie_id,",",rating,")"))
}
Use the above functions to insert the data
sapply(gs_ws_ls(sheet), function(x) read_worksheet(sheet,x))
## Accessing worksheet titled 'Steven'.
## Parsed with column specification:
## cols(
## Movie = col_character(),
## Rating = col_double()
## )
## Accessing worksheet titled 'Sarah'.
## Parsed with column specification:
## cols(
## Movie = col_character(),
## Rating = col_character()
## )
## [1] "Warning - N/A for 2,2: \"N/A\" was changed to N/A!"
## [1] "Warning - N/A for 2,3: \"N/A\" was changed to N/A!"
## Accessing worksheet titled 'David'.
## Parsed with column specification:
## cols(
## Movie = col_character(),
## Rating = col_double()
## )
## [1] "Warning - N/A for 3,1: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 3,3: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 3,4: this is probably OK because it was N/A in google sheets"
## Accessing worksheet titled 'Katie'.
## Parsed with column specification:
## cols(
## Movie = col_character(),
## Rating = col_double()
## )
## [1] "Warning - N/A for 4,1: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 4,2: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 4,3: this is probably OK because it was N/A in google sheets"
## [1] "Warning - N/A for 4,7: this is probably OK because it was N/A in google sheets"
## Accessing worksheet titled 'Jason'.
## Parsed with column specification:
## cols(
## Movie = col_character(),
## Rating = col_character()
## )
## [1] "Warning - N/A for 5,1: \"Didn't Watch\" was changed to N/A!"
## [1] "Warning - N/A for 5,4: \"Didn't Watch\" was changed to N/A!"
## Steven Sarah David Katie Jason
## [1,] ? ? ? ? ?
## [2,] ? ? ? ? ?
## [3,] ? ? ? ? ?
## [4,] ? ? ? ? ?
## [5,] ? ? ? ? ?
## [6,] ? ? ? ? ?
OK, the warnings look reasonable (changing “Didn’t Watch” and “N/A” to N/A seems ok) so lets pull our data into a data frame
query = "select
mc.name as critic,
m.name as movie,
mr.rating
from movie_ratings mr
left join movies m on mr.movie_id = m.id
left join movie_critics mc on mr.critic_id = mc.id"
res <- dbSendQuery(con, query)
data <- fetch(res, n=-1)
str(data)
## 'data.frame': 30 obs. of 3 variables:
## $ critic: chr "Steven" "Steven" "Steven" "Steven" ...
## $ movie : chr "Raising Arizona" "Con Air" "The Rock" "Bangkok Dangerous" ...
## $ rating: num 5 4 4 2 3 4 4 NA NA 2 ...
looks ok at a glance Lets change it so the critics and movies are the row and column names
data.pivot = dcast(data, movie ~ critic,value.var = "rating")
data.pivot
## movie David Jason Katie Sarah Steven
## 1 Bangkok Dangerous NA NA NA 2 2
## 2 Bangkok Dangerous2 NA NA NA NA NA
## 3 Con Air 4 5 NA NA 4
## 4 Face/Off 4 4 1 4 4
## 5 National Treasure 3 3 3 4 3
## 6 Raising Arizona NA NA NA 4 5
## 7 The Rock NA 5 NA NA 4
What the heck, there is a movie titled “Bangkok Dangerous2”. Since I don’t believe that picture got a sequel I think something is wrong.
data[data$movie=="Bangkok Dangerous2",]
## critic movie rating
## 22 Katie Bangkok Dangerous2 NA
I’m going to assume that Katie tried to put in a rating of 2 for “Bangkok Dangerous” In real life I would fix this either in the spreadsheet itself or in the database, but for this exercize I’ll just fix the data frame
data[data$movie=="Bangkok Dangerous2",'rating'] = 2
data[data$movie=="Bangkok Dangerous2",'movie'] = "Bangkok Dangerous"
Let’s try that again
data.pivot = dcast(data, movie ~ critic,value.var = "rating")
data.pivot
## movie David Jason Katie Sarah Steven
## 1 Bangkok Dangerous NA NA 2 2 2
## 2 Con Air 4 5 NA NA 4
## 3 Face/Off 4 4 1 4 4
## 4 National Treasure 3 3 3 4 3
## 5 Raising Arizona NA NA NA 4 5
## 6 The Rock NA 5 NA NA 4
lets change our row names to the movie titles
row.names(data.pivot) = data.pivot$movie
data.pivot['movie']=NULL
Since we have our data we may as well take a look at some summary statistics
Check how many people watched each movie
5-rowSums(is.na(data.pivot))
## Bangkok Dangerous Con Air Face/Off National Treasure
## 3 3 5 5
## Raising Arizona The Rock
## 2 2
National Treasure and Face/Off were seen by everyone. Only a couple saw Raising Arizona and The Rock, which is disappointing
How many responses did I get per person
5-colSums(is.na(data.pivot))
## David Jason Katie Sarah Steven
## 2 3 2 3 5
I saw all of them, which makes sense since I picked the movies. My younger siblings David and Katie are missing out.
Let’s look at the average rating by movie
round(rowMeans(data.pivot,na.rm=TRUE), 1)
## Bangkok Dangerous Con Air Face/Off National Treasure
## 2.0 4.3 3.4 3.2
## Raising Arizona The Rock
## 4.5 4.5
More evidence that Katie and David should watch Raising Arizona and The Rock
And average rating by critic
round(colMeans(data.pivot,na.rm=TRUE), 1)
## David Jason Katie Sarah Steven
## 3.7 4.2 2.0 3.5 3.7
Katie seems to be a pretty harsh critic. I wonder if we adjust the movie ratings based on the users average rating if that would adjust our results
This is a pretty silly thing to do with such a sparse dataset, but it’s my own fault I guess
data.normalized <- round(sweep(data.pivot,2,colMeans(data.pivot,na.rm=TRUE),'/'), 1)
Let’s now multiply by overall rating average to get it back on the same scale
data.normalized <- data.normalized*mean(data$rating, na.rm=TRUE)
rbind(round(rowMeans(data.pivot,na.rm=TRUE), 1),
round(rowMeans(data.normalized,na.rm=TRUE), 1))
## Bangkok Dangerous Con Air Face/Off National Treasure Raising Arizona
## [1,] 2.0 4.3 3.4 3.2 4.5
## [2,] 2.5 4.0 3.3 3.4 4.4
## The Rock
## [1,] 4.5
## [2,] 4.0
Looking at iot this way, action classics “The Rock and”Con Air" get taken down a notch while Bangkok Dangerous looks less pitiful