For Homework 2, I distributed a movie ranking Google Form to a group of friends, and uploaded the survey results to a SQL database. The data was then downloaded to R where I performed several transformations and visualizations.
A movie ranking survey created in Google forms was distributed to a population of students at Worcester Polytechnic Institute (my alma mater) through a Discord server. Movies were selected to cater towards the interests of the average STEM undergraduate student to increase the likelihood that participants had seen all six movies on the survey and could complete it successfully. Participants were asked to rank movies on a scale from 1 to 5, where 1 indicated that the movie was awful, and 5 indicated that the movie was excellent. If the survey participant had not seen the movie, they were instructed to rank it as 0. The survey was exported from Google Drive as a .csv file and upload to Github for reproducibility purposes. Participant names have been anonymized to preserve participant privacy.
The first step was to create a new database in MySQL.
CREATE DATABASE hw2_607;
Next, I created a table in the database called responses.
CREATE TABLE
hw2_607.responses(idVARCHAR(45) NOT NULL,jojo_rankINT NULL,joker_rankINT NULL,endgame_rankINT NULL,rocket_rankINT NULL,little_rankINT NULL,sonic_rankINT NULL, PRIMARY KEY (id));
Lastly, I used MySQL’s Table Import Wizard to import data from my .csv file into the responses table.
library(RMySQL)
## Loading required package: DBI
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(wesanderson)
library(keyring)
key_set('user', 'test')
## Please enter password in TK window (Alt+Tab)
db = dbConnect(MySQL(), user='test', password=key_get('user', 'test'), dbname='hw2_607', host='localhost')
movie_data <- dbGetQuery(db, "SELECT * FROM responses")
colnames(movie_data)[1] <- gsub('^...','',colnames(movie_data)[1])
movie_data %>%
kbl(caption = "Original Movie Ranking Data Exported from SQL") %>%
kable_styling(bootstrap_options = "striped")
| id | jojo_rank | joker_rank | endgame_rank | rocket_rank | little_rank | sonic_rank |
|---|---|---|---|---|---|---|
| A | 4 | 3 | 3 | 4 | 4 | 2 |
| B | 0 | 3 | 4 | 4 | 0 | 0 |
| C | 4 | 3 | 5 | 3 | 4 | 1 |
| D | 4 | 3 | 2 | 4 | 5 | 2 |
| E | 5 | 4 | 3 | 3 | 4 | 4 |
| F | 5 | 4 | 5 | 5 | 0 | 1 |
Each participant has been identified with an id instead of with their name. However, if we did need to retrieve the participant’s names, those are stored in a separate table in the database. For the purpose of this assignment, the names have been fictionalized.
names <- dbGetQuery(db, "SELECT * FROM names")
colnames(names)[1] <- gsub('^...','',colnames(names)[1])
names %>%
kbl(caption = "Names and IDs") %>%
kable_styling(bootstrap_options = "striped")
| id | name |
|---|---|
| A | Adelbert Anderson |
| B | Boris Brown |
| C | Cordelia Cohen |
| D | Dorcas Donovan |
| E | Ernest Englewort |
| F | Felicity Forrester |
In this scenario, a score of 0 means the participant did not see the movie. Let’s make sure that we don’t include that when we compute rankings.
movie_data[movie_data == 0] <- NA
movie_data %>%
kbl(caption="Movie Rankings With Missing Values Handled") %>%
kable_styling(bootstrap_options = "striped")
| id | jojo_rank | joker_rank | endgame_rank | rocket_rank | little_rank | sonic_rank |
|---|---|---|---|---|---|---|
| A | 4 | 3 | 3 | 4 | 4 | 2 |
| B | NA | 3 | 4 | 4 | NA | NA |
| C | 4 | 3 | 5 | 3 | 4 | 1 |
| D | 4 | 3 | 2 | 4 | 5 | 2 |
| E | 5 | 4 | 3 | 3 | 4 | 4 |
| F | 5 | 4 | 5 | 5 | NA | 1 |
To further analyze the data, I created a bar graph with the average movie rankings using ggplot and customized the colors using the wesanderson Zissou1 color palette. Missing rankings were not included in the averages.
jojo_mean <- mean(movie_data[,'jojo_rank'], na.rm=TRUE)
joker_mean <- mean(movie_data[,'joker_rank'], na.rm=TRUE)
endgame_mean <- mean(movie_data[,'endgame_rank'], na.rm=TRUE)
rocket_mean <- mean(movie_data[,'rocket_rank'], na.rm=TRUE)
little_mean <- mean(movie_data[,'little_rank'], na.rm=TRUE)
sonic_mean <- mean(movie_data[,'sonic_rank'], na.rm=TRUE)
mean_df <- data.frame("Movie_Name"= c('Jojo Rabbit', 'The Joker', 'The Avengers: End Game', 'Rocketman', 'Little Women', 'Sonic the Hedgehog'), 'Average_Ranking'= c(jojo_mean, joker_mean, endgame_mean, rocket_mean, little_mean, sonic_mean))
ggplot(mean_df, aes(x=Average_Ranking, y=Movie_Name, fill=Movie_Name)) + geom_bar(stat="identity") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + theme(axis.line = element_line(colour = "black")) + scale_fill_manual(values=wes_palette( name="Zissou1", 6, type="continuous")) +labs(title= "Average Movie Ranking Bar Graph")
Based on the original data set, there are a few clear recommendations to make . Participant A’s top ranked movies were Jojo Rabbit, Rocket Man, and Little Women, indicating a preference for historical films. Based on this, I would recommend they watch Bohemian Rhapsody, which is a historical film about the band Queen. Participant C only ranked Avengers: End Game as a 5, which may denote a preference for films set in the Marvel Cinematic Universe (MCU). Based on this, they may enjoy other movies set in the MCU, such as Avengers: Infinity War or Doctor Strange.