week 2 Assignment

By Brian Weinfeld

February 8, 2018

What do my students think of the top grossing films of 2017?

All of my students filled out a Google form asking them to rate each of the 6 highest grossing movies of 2017 on a scale from 0-5 based on how much they enjoyed the film. 0 was used to indicate the student had not seen the movie as Google forms did not appear to have an N/A option. The data was stored and downloaded in a csv file.

The data was read in to r and lightly modified for better display.

raw <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment2/master/week2assignmentMovies.csv',
                header=TRUE, stringsAsFactors=FALSE)
raw <- cbind(raw, ID=1:nrow(raw))
colnames(raw) <- lapply(names(raw), FUN=function(x){gsub('(\\.)+', ' ', x) %>% toupper()})
kable(head(raw, n=10))

TIMESTAMP	EMAIL ADDRESS	STAR WARS THE LAST JEDI	BEAUTY AND THE BEAST 2017	WONDER WOMAN	GUARDIANS OF THE GALAXY VOL 2	JUMANJI WELCOME TO THE JUNGLE	SPIDER MAN HOMECOMING	ID
2/5/2018 7:25:53	Redacted Unique 01	3	1	2	3	3	4	1
2/5/2018 7:27:25	Redacted Unique 02	5	4	0	0	5	0	2
2/5/2018 7:43:56	Redacted Unique 03	5	0	0	5	0	5	3
2/5/2018 7:45:10	Redacted Unique 04	3	0	4	3	0	5	4
2/5/2018 7:46:36	Redacted Unique 05	4	0	0	5	0	0	5
2/5/2018 7:49:44	Redacted Unique 06	3	0	0	3	0	0	6
2/5/2018 7:55:32	Redacted Unique 07	5	0	0	4	0	0	7
2/5/2018 7:55:36	Redacted Unique 08	0	3	3	3	0	0	8
2/5/2018 7:55:45	Redacted Unique 09	4	0	5	0	0	0	9
2/5/2018 7:55:58	Redacted Unique 10	4	0	4	5	0	0	10

I ran into a small problem with the way the data was formatted. While the csv file was formatting nicely by Google for import into R, it was not formatted properly to be stored in a MYSQL table. Ideally, there would be two tables, one for users containing the timestamp and email address and one for each individual movie rating containing the user’s id, the movie they rated and their rating.

I decided to modify the table to create two csv files that could be read into MYSQL to create properly formatted tables. Then those tables could be queried in R. This is clearly a roundabout solution, but I did not feel it appropriate to simply read the original csv into MYSQL and then query it all into R. This did not seem to be in the spirit of the original project.

users <- subset(raw, select=c(1, 2))
reviews <- melt(subset(raw, select=3:ncol(raw)), id=c('ID'))

#write.csv(users, 'local-pathway', eol='\n')
#write.csv(reviews, 'local-pathway', eol = "\n")
kable(head(users))

TIMESTAMP	EMAIL ADDRESS
2/5/2018 7:25:53	Redacted Unique 01
2/5/2018 7:27:25	Redacted Unique 02
2/5/2018 7:43:56	Redacted Unique 03
2/5/2018 7:45:10	Redacted Unique 04
2/5/2018 7:46:36	Redacted Unique 05
2/5/2018 7:49:44	Redacted Unique 06

kable(rbind(head(reviews), tail(reviews)))

	ID	variable	value
1	1	STAR WARS THE LAST JEDI	3
2	2	STAR WARS THE LAST JEDI	5
3	3	STAR WARS THE LAST JEDI	5
4	4	STAR WARS THE LAST JEDI	3
5	5	STAR WARS THE LAST JEDI	4
6	6	STAR WARS THE LAST JEDI	3
493	78	SPIDER MAN HOMECOMING	4
494	79	SPIDER MAN HOMECOMING	0
495	80	SPIDER MAN HOMECOMING	0
496	81	SPIDER MAN HOMECOMING	5
497	82	SPIDER MAN HOMECOMING	5
498	83	SPIDER MAN HOMECOMING	0

I wrote an SQL script to read in both of these csv files and store them in a database. The script is recreated below. In addition to reading in the csv files I converted the timestamp column from a String to a DateTime and replaced all the 0 scores with Null.

DROP TABLE IF EXISTS reviews;
DROP TABLE IF EXISTS users;

CREATE TABLE users(
 user_id INTEGER PRIMARY KEY NOT NULL,
 timestamp DATETIME NOT NULL,
 email VARCHAR(50)
);

CREATE TABLE reviews(
 review_id INTEGER PRIMARY KEY NOT NULL,
 user_id INTEGER NOT NULL,
 movie VARCHAR(50) NOT NULL,
 rating INTEGER,
 foreign key(user_id) references users(user_id)
);

LOAD DATA LOCAL INFILE 'local-pathway' 
INTO TABLE users
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(user_id, @timestamp, email)
SET 
timestamp = STR_TO_DATE(@timestamp, '%d/%m/%Y %H:%i:%s')
;

LOAD DATA LOCAL INFILE 'local-pathway' 
INTO TABLE reviews
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS(review_id, user_id, movie, @rating)
SET
rating = nullif(@rating,0)
;

Using the RMySQL library, I connected to the database and read the relevant information into R.

#mydb <- dbConnect(MySQL(), user='root', password='SECRET', dbname='week2assignment', host='localhost')
rb <- dbSendQuery(mydb, 'SELECT users.email, reviews.movie, reviews.rating FROM users JOIN reviews ON users.user_id = reviews.user_id')
data <- fetch(rb, n=-1) %>%
  dcast(formula=email~movie)
kable(head(data))

email	BEAUTY AND THE BEAST 2017	GUARDIANS OF THE GALAXY VOL 2	JUMANJI WELCOME TO THE JUNGLE	SPIDER MAN HOMECOMING	STAR WARS THE LAST JEDI	WONDER WOMAN
Redacted Unique 1	1	3	3	4	3	2
Redacted Unique 10	NA	5	NA	NA	4	4
Redacted Unique 11	NA	5	NA	4	4	2
Redacted Unique 12	NA	4	4	4	4	5
Redacted Unique 13	NA	4	NA	4	3	5
Redacted Unique 14	3	5	NA	NA	NA	4

I wanted to have a total count of each response (including N/A) so I created a new data frame that summarized this information.

moded <- data.frame(t(rbind(apply(data[,2:ncol(data)], 2, FUN=function(x){table(factor(x, levels=1:5), useNA = 'always')}))))
colnames(moded) <- c(1:5, 'Not Seen')
kable(moded)

	1	2	3	4	5	Not Seen
BEAUTY AND THE BEAST 2017	1	3	7	3	2	67
GUARDIANS OF THE GALAXY VOL 2	0	4	11	23	23	22
JUMANJI WELCOME TO THE JUNGLE	0	2	1	5	7	68
SPIDER MAN HOMECOMING	0	2	1	17	21	42
STAR WARS THE LAST JEDI	2	2	11	21	19	28
WONDER WOMAN	0	6	7	17	13	40

Finally, I used the new data frame to display the proportion of rating that each movie received. I created a second, similar plot that excluded N/A responses as I believe that provided additional insight into my student’s movie preferences.

moded$Movie <- rownames(moded)
melt(moded) %>%
  ggplot(aes(x=Movie, weight=value, fill=variable)) + geom_bar(position=position_fill(reverse=TRUE)) + labs(title='Proportion of Scores') + coord_flip() + labs(x='Proportion', fill='Score') + scale_fill_manual(values=c("red", "orange", "yellow", 'blue', 'green', 'black'))

moded['Not Seen'] <- NULL
melt(moded) %>%
  ggplot(aes(x=Movie, weight=value, fill=variable)) + geom_bar(position=position_fill(reverse=TRUE)) + labs(title='Proportion of Scores') + coord_flip() + labs(x='Proportion', fill='Score') + scale_fill_manual(values=c("red", "orange", "yellow", 'blue', 'green'))

Amongst my high school students, the most popular of the top 6 films of 2017 was Spider-Man: Homecoming while the most viewed was Guardians of the Galaxy 2. Unsurprisingly, the least liked movie and 2nd to least viewed was Beauty and the Beast, a movie whose target demographic consisted of families with younger children.