All of my students filled out a Google form asking them to rate each of the 6 highest grossing movies of 2017 on a scale from 0-5 based on how much they enjoyed the film. 0 was used to indicate the student had not seen the movie as Google forms did not appear to have an N/A option. The data was stored and downloaded in a csv file.
The data was read in to r and lightly modified for better display.
raw <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment2/master/week2assignmentMovies.csv',
header=TRUE, stringsAsFactors=FALSE)
raw <- cbind(raw, ID=1:nrow(raw))
colnames(raw) <- lapply(names(raw), FUN=function(x){gsub('(\\.)+', ' ', x) %>% toupper()})
kable(head(raw, n=10))
| TIMESTAMP | EMAIL ADDRESS | STAR WARS THE LAST JEDI | BEAUTY AND THE BEAST 2017 | WONDER WOMAN | GUARDIANS OF THE GALAXY VOL 2 | JUMANJI WELCOME TO THE JUNGLE | SPIDER MAN HOMECOMING | ID |
|---|---|---|---|---|---|---|---|---|
| 2/5/2018 7:25:53 | Redacted Unique 01 | 3 | 1 | 2 | 3 | 3 | 4 | 1 |
| 2/5/2018 7:27:25 | Redacted Unique 02 | 5 | 4 | 0 | 0 | 5 | 0 | 2 |
| 2/5/2018 7:43:56 | Redacted Unique 03 | 5 | 0 | 0 | 5 | 0 | 5 | 3 |
| 2/5/2018 7:45:10 | Redacted Unique 04 | 3 | 0 | 4 | 3 | 0 | 5 | 4 |
| 2/5/2018 7:46:36 | Redacted Unique 05 | 4 | 0 | 0 | 5 | 0 | 0 | 5 |
| 2/5/2018 7:49:44 | Redacted Unique 06 | 3 | 0 | 0 | 3 | 0 | 0 | 6 |
| 2/5/2018 7:55:32 | Redacted Unique 07 | 5 | 0 | 0 | 4 | 0 | 0 | 7 |
| 2/5/2018 7:55:36 | Redacted Unique 08 | 0 | 3 | 3 | 3 | 0 | 0 | 8 |
| 2/5/2018 7:55:45 | Redacted Unique 09 | 4 | 0 | 5 | 0 | 0 | 0 | 9 |
| 2/5/2018 7:55:58 | Redacted Unique 10 | 4 | 0 | 4 | 5 | 0 | 0 | 10 |
I ran into a small problem with the way the data was formatted. While the csv file was formatting nicely by Google for import into R, it was not formatted properly to be stored in a MYSQL table. Ideally, there would be two tables, one for users containing the timestamp and email address and one for each individual movie rating containing the user’s id, the movie they rated and their rating.
I decided to modify the table to create two csv files that could be read into MYSQL to create properly formatted tables. Then those tables could be queried in R. This is clearly a roundabout solution, but I did not feel it appropriate to simply read the original csv into MYSQL and then query it all into R. This did not seem to be in the spirit of the original project.
users <- subset(raw, select=c(1, 2))
reviews <- melt(subset(raw, select=3:ncol(raw)), id=c('ID'))
#write.csv(users, 'local-pathway', eol='\n')
#write.csv(reviews, 'local-pathway', eol = "\n")
kable(head(users))
| TIMESTAMP | EMAIL ADDRESS |
|---|---|
| 2/5/2018 7:25:53 | Redacted Unique 01 |
| 2/5/2018 7:27:25 | Redacted Unique 02 |
| 2/5/2018 7:43:56 | Redacted Unique 03 |
| 2/5/2018 7:45:10 | Redacted Unique 04 |
| 2/5/2018 7:46:36 | Redacted Unique 05 |
| 2/5/2018 7:49:44 | Redacted Unique 06 |
kable(rbind(head(reviews), tail(reviews)))
| ID | variable | value | |
|---|---|---|---|
| 1 | 1 | STAR WARS THE LAST JEDI | 3 |
| 2 | 2 | STAR WARS THE LAST JEDI | 5 |
| 3 | 3 | STAR WARS THE LAST JEDI | 5 |
| 4 | 4 | STAR WARS THE LAST JEDI | 3 |
| 5 | 5 | STAR WARS THE LAST JEDI | 4 |
| 6 | 6 | STAR WARS THE LAST JEDI | 3 |
| 493 | 78 | SPIDER MAN HOMECOMING | 4 |
| 494 | 79 | SPIDER MAN HOMECOMING | 0 |
| 495 | 80 | SPIDER MAN HOMECOMING | 0 |
| 496 | 81 | SPIDER MAN HOMECOMING | 5 |
| 497 | 82 | SPIDER MAN HOMECOMING | 5 |
| 498 | 83 | SPIDER MAN HOMECOMING | 0 |
I wrote an SQL script to read in both of these csv files and store them in a database. The script is recreated below. In addition to reading in the csv files I converted the timestamp column from a String to a DateTime and replaced all the 0 scores with Null.
DROP TABLE IF EXISTS reviews; DROP TABLE IF EXISTS users; CREATE TABLE users( user_id INTEGER PRIMARY KEY NOT NULL, timestamp DATETIME NOT NULL, email VARCHAR(50) ); CREATE TABLE reviews( review_id INTEGER PRIMARY KEY NOT NULL, user_id INTEGER NOT NULL, movie VARCHAR(50) NOT NULL, rating INTEGER, foreign key(user_id) references users(user_id) ); LOAD DATA LOCAL INFILE 'local-pathway' INTO TABLE users FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS(user_id, @timestamp, email) SET timestamp = STR_TO_DATE(@timestamp, '%d/%m/%Y %H:%i:%s') ; LOAD DATA LOCAL INFILE 'local-pathway' INTO TABLE reviews FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS(review_id, user_id, movie, @rating) SET rating = nullif(@rating,0) ;
Using the RMySQL library, I connected to the database and read the relevant information into R.
#mydb <- dbConnect(MySQL(), user='root', password='SECRET', dbname='week2assignment', host='localhost')
rb <- dbSendQuery(mydb, 'SELECT users.email, reviews.movie, reviews.rating FROM users JOIN reviews ON users.user_id = reviews.user_id')
data <- fetch(rb, n=-1) %>%
dcast(formula=email~movie)
kable(head(data))
| BEAUTY AND THE BEAST 2017 | GUARDIANS OF THE GALAXY VOL 2 | JUMANJI WELCOME TO THE JUNGLE | SPIDER MAN HOMECOMING | STAR WARS THE LAST JEDI | WONDER WOMAN | |
|---|---|---|---|---|---|---|
| Redacted Unique 1 | 1 | 3 | 3 | 4 | 3 | 2 |
| Redacted Unique 10 | NA | 5 | NA | NA | 4 | 4 |
| Redacted Unique 11 | NA | 5 | NA | 4 | 4 | 2 |
| Redacted Unique 12 | NA | 4 | 4 | 4 | 4 | 5 |
| Redacted Unique 13 | NA | 4 | NA | 4 | 3 | 5 |
| Redacted Unique 14 | 3 | 5 | NA | NA | NA | 4 |
I wanted to have a total count of each response (including N/A) so I created a new data frame that summarized this information.
moded <- data.frame(t(rbind(apply(data[,2:ncol(data)], 2, FUN=function(x){table(factor(x, levels=1:5), useNA = 'always')}))))
colnames(moded) <- c(1:5, 'Not Seen')
kable(moded)
| 1 | 2 | 3 | 4 | 5 | Not Seen | |
|---|---|---|---|---|---|---|
| BEAUTY AND THE BEAST 2017 | 1 | 3 | 7 | 3 | 2 | 67 |
| GUARDIANS OF THE GALAXY VOL 2 | 0 | 4 | 11 | 23 | 23 | 22 |
| JUMANJI WELCOME TO THE JUNGLE | 0 | 2 | 1 | 5 | 7 | 68 |
| SPIDER MAN HOMECOMING | 0 | 2 | 1 | 17 | 21 | 42 |
| STAR WARS THE LAST JEDI | 2 | 2 | 11 | 21 | 19 | 28 |
| WONDER WOMAN | 0 | 6 | 7 | 17 | 13 | 40 |
Finally, I used the new data frame to display the proportion of rating that each movie received. I created a second, similar plot that excluded N/A responses as I believe that provided additional insight into my student’s movie preferences.
moded$Movie <- rownames(moded)
melt(moded) %>%
ggplot(aes(x=Movie, weight=value, fill=variable)) + geom_bar(position=position_fill(reverse=TRUE)) + labs(title='Proportion of Scores') + coord_flip() + labs(x='Proportion', fill='Score') + scale_fill_manual(values=c("red", "orange", "yellow", 'blue', 'green', 'black'))
moded['Not Seen'] <- NULL
melt(moded) %>%
ggplot(aes(x=Movie, weight=value, fill=variable)) + geom_bar(position=position_fill(reverse=TRUE)) + labs(title='Proportion of Scores') + coord_flip() + labs(x='Proportion', fill='Score') + scale_fill_manual(values=c("red", "orange", "yellow", 'blue', 'green'))
Amongst my high school students, the most popular of the top 6 films of 2017 was Spider-Man: Homecoming while the most viewed was Guardians of the Galaxy 2. Unsurprisingly, the least liked movie and 2nd to least viewed was Beauty and the Beast, a movie whose target demographic consisted of families with younger children.