week 2 Assignment

By Brian Weinfeld

February 8, 2018

All of my students filled out a Google form asking them to rate each of the 6 highest grossing movies of 2017 on a scale from 0-5 based on how much they enjoyed the film. 0 was used to indicate the student had not seen the movie as Google forms did not appear to have an N/A option. The data was stored and downloaded in a csv file.


The data was read in to r and lightly modified for better display.

raw <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment2/master/week2assignmentMovies.csv',
                header=TRUE, stringsAsFactors=FALSE)
raw <- cbind(raw, ID=1:nrow(raw))
colnames(raw) <- lapply(names(raw), FUN=function(x){gsub('(\\.)+', ' ', x) %>% toupper()})
kable(head(raw, n=10))
TIMESTAMP EMAIL ADDRESS STAR WARS THE LAST JEDI BEAUTY AND THE BEAST 2017 WONDER WOMAN GUARDIANS OF THE GALAXY VOL 2 JUMANJI WELCOME TO THE JUNGLE SPIDER MAN HOMECOMING ID
2/5/2018 7:25:53 Redacted Unique 01 3 1 2 3 3 4 1
2/5/2018 7:27:25 Redacted Unique 02 5 4 0 0 5 0 2
2/5/2018 7:43:56 Redacted Unique 03 5 0 0 5 0 5 3
2/5/2018 7:45:10 Redacted Unique 04 3 0 4 3 0 5 4
2/5/2018 7:46:36 Redacted Unique 05 4 0 0 5 0 0 5
2/5/2018 7:49:44 Redacted Unique 06 3 0 0 3 0 0 6
2/5/2018 7:55:32 Redacted Unique 07 5 0 0 4 0 0 7
2/5/2018 7:55:36 Redacted Unique 08 0 3 3 3 0 0 8
2/5/2018 7:55:45 Redacted Unique 09 4 0 5 0 0 0 9
2/5/2018 7:55:58 Redacted Unique 10 4 0 4 5 0 0 10

I ran into a small problem with the way the data was formatted. While the csv file was formatting nicely by Google for import into R, it was not formatted properly to be stored in a MYSQL table. Ideally, there would be two tables, one for users containing the timestamp and email address and one for each individual movie rating containing the user’s id, the movie they rated and their rating.


I decided to modify the table to create two csv files that could be read into MYSQL to create properly formatted tables. Then those tables could be queried in R. This is clearly a roundabout solution, but I did not feel it appropriate to simply read the original csv into MYSQL and then query it all into R. This did not seem to be in the spirit of the original project.

users <- subset(raw, select=c(1, 2))
reviews <- melt(subset(raw, select=3:ncol(raw)), id=c('ID'))

#write.csv(users, 'local-pathway', eol='\n')
#write.csv(reviews, 'local-pathway', eol = "\n")
kable(head(users))
TIMESTAMP EMAIL ADDRESS
2/5/2018 7:25:53 Redacted Unique 01
2/5/2018 7:27:25 Redacted Unique 02
2/5/2018 7:43:56 Redacted Unique 03
2/5/2018 7:45:10 Redacted Unique 04
2/5/2018 7:46:36 Redacted Unique 05
2/5/2018 7:49:44 Redacted Unique 06
kable(rbind(head(reviews), tail(reviews)))
ID variable value
1 1 STAR WARS THE LAST JEDI 3
2 2 STAR WARS THE LAST JEDI 5
3 3 STAR WARS THE LAST JEDI 5
4 4 STAR WARS THE LAST JEDI 3
5 5 STAR WARS THE LAST JEDI 4
6 6 STAR WARS THE LAST JEDI 3
493 78 SPIDER MAN HOMECOMING 4
494 79 SPIDER MAN HOMECOMING 0
495 80 SPIDER MAN HOMECOMING 0
496 81 SPIDER MAN HOMECOMING 5
497 82 SPIDER MAN HOMECOMING 5
498 83 SPIDER MAN HOMECOMING 0

I wrote an SQL script to read in both of these csv files and store them in a database. The script is recreated below. In addition to reading in the csv files I converted the timestamp column from a String to a DateTime and replaced all the 0 scores with Null.

DROP TABLE IF EXISTS reviews;
DROP TABLE IF EXISTS users;

CREATE TABLE users(
 user_id INTEGER PRIMARY KEY NOT NULL,
 timestamp DATETIME NOT NULL,
 email VARCHAR(50)
);

CREATE TABLE reviews(
 review_id INTEGER PRIMARY KEY NOT NULL,
 user_id INTEGER NOT NULL,
 movie VARCHAR(50) NOT NULL,
 rating INTEGER,
 foreign key(user_id) references users(user_id)
);

LOAD DATA LOCAL INFILE 'local-pathway' 
INTO TABLE users
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(user_id, @timestamp, email)
SET 
timestamp = STR_TO_DATE(@timestamp, '%d/%m/%Y %H:%i:%s')
;

LOAD DATA LOCAL INFILE 'local-pathway' 
INTO TABLE reviews
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(review_id, user_id, movie, @rating)
SET
rating = nullif(@rating,0)
;

Using the RMySQL library, I connected to the database and read the relevant information into R.

#mydb <- dbConnect(MySQL(), user='root', password='SECRET', dbname='week2assignment', host='localhost')
rb <- dbSendQuery(mydb, 'SELECT users.email, reviews.movie, reviews.rating FROM users JOIN reviews ON users.user_id = reviews.user_id')
data <- fetch(rb, n=-1) %>%
  dcast(formula=email~movie)
kable(head(data))
email BEAUTY AND THE BEAST 2017 GUARDIANS OF THE GALAXY VOL 2 JUMANJI WELCOME TO THE JUNGLE SPIDER MAN HOMECOMING STAR WARS THE LAST JEDI WONDER WOMAN
Redacted Unique 1 1 3 3 4 3 2
Redacted Unique 10 NA 5 NA NA 4 4
Redacted Unique 11 NA 5 NA 4 4 2
Redacted Unique 12 NA 4 4 4 4 5
Redacted Unique 13 NA 4 NA 4 3 5
Redacted Unique 14 3 5 NA NA NA 4

I wanted to have a total count of each response (including N/A) so I created a new data frame that summarized this information.

moded <- data.frame(t(rbind(apply(data[,2:ncol(data)], 2, FUN=function(x){table(factor(x, levels=1:5), useNA = 'always')}))))
colnames(moded) <- c(1:5, 'Not Seen')
kable(moded)
1 2 3 4 5 Not Seen
BEAUTY AND THE BEAST 2017 1 3 7 3 2 67
GUARDIANS OF THE GALAXY VOL 2 0 4 11 23 23 22
JUMANJI WELCOME TO THE JUNGLE 0 2 1 5 7 68
SPIDER MAN HOMECOMING 0 2 1 17 21 42
STAR WARS THE LAST JEDI 2 2 11 21 19 28
WONDER WOMAN 0 6 7 17 13 40

Finally, I used the new data frame to display the proportion of rating that each movie received. I created a second, similar plot that excluded N/A responses as I believe that provided additional insight into my student’s movie preferences.

moded$Movie <- rownames(moded)
melt(moded) %>%
  ggplot(aes(x=Movie, weight=value, fill=variable)) + geom_bar(position=position_fill(reverse=TRUE)) + labs(title='Proportion of Scores') + coord_flip() + labs(x='Proportion', fill='Score') + scale_fill_manual(values=c("red", "orange", "yellow", 'blue', 'green', 'black'))

moded['Not Seen'] <- NULL
melt(moded) %>%
  ggplot(aes(x=Movie, weight=value, fill=variable)) + geom_bar(position=position_fill(reverse=TRUE)) + labs(title='Proportion of Scores') + coord_flip() + labs(x='Proportion', fill='Score') + scale_fill_manual(values=c("red", "orange", "yellow", 'blue', 'green'))

Amongst my high school students, the most popular of the top 6 films of 2017 was Spider-Man: Homecoming while the most viewed was Guardians of the Galaxy 2. Unsurprisingly, the least liked movie and 2nd to least viewed was Beauty and the Beast, a movie whose target demographic consisted of families with younger children.