DATA607 Assignment 2

Overview

In this assignment, I used Google Forms (https://forms.gle/y6mbTf7X4EEW2nu7A) to collect ratings for six movies. I then used MySQL Workbench’s Table Import function to import the csv of the survey results as a table. The raw data included the email addresses provided by those who did the survey, the timestamp it was received, and the ‘ratings’ for each of the six movies, if they had seen them. Once I had the data imported into MySql, I normalized the data by splitting it into three tables: person, survey, and survey_response with the proper primary and foreign keys. This will allow my database to handle any future surveys of most formats.

Also, I used a keyring to store db password. There is likely a way to do this more efficiently, such as a way to hide the host, user, etc, possibly with a DSN file, but I have not explored this.

I could not, however, figure out in time how to knit with prompts for the keyring… I instead knitted/published using parameters in the Rmd.

Extracting Data from MySQL

kb <- keyring::backend_file$new()

azuredb = dbConnect(MySQL(), user=params$dbuser, password=params$dbpass, dbname=params$dbname, host=params$dbhost)

movieSurvey <- dbGetQuery(azuredb, "select pid as voter, question as movie, response as rating from survey_response where sid=1 and pid<6;")

dbDisconnect(azuredb)
## [1] TRUE
head(movieSurvey)
##   voter                     movie rating
## 1     1 Pokémon: The First Movie      5
## 2     2 Pokémon: The First Movie      4
## 3     3 Pokémon: The First Movie      4
## 4     4 Pokémon: The First Movie      4
## 5     5 Pokémon: The First Movie      5
## 6     1         Princess Mononoke      3

Let’s add some arbitrary names to our survey participants to make it more readable.

movieSurvey$voter[movieSurvey$voter == 1] <- 'Jeff'
movieSurvey$voter[movieSurvey$voter == 2] <- 'Mira'
movieSurvey$voter[movieSurvey$voter == 3] <- 'Fred'
movieSurvey$voter[movieSurvey$voter == 4] <- 'Olive'
movieSurvey$voter[movieSurvey$voter == 5] <- 'Muhammad'
movieSurvey$rating<-as.integer(movieSurvey$rating)

Data Exploration

ggplot(data = movieSurvey, aes(x = movie, y = rating, fill = voter)) + 
  geom_bar(position="dodge", stat="identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From first glance, it is obvious from this limited data that some movies are missing ratings from those surveyed. The most obvious is The Last Airbender. Olive is missing votes for Sahara and The Last Airbender. Mira is just missing The Last Airbender.

To make sure our data is ‘clean’, let’s see whether the data is actually missing or has a value that may cause issues with anything we try to do with this data.

subset(movieSurvey, voter=='Mira' | voter=='Olive')
##    voter                     movie rating
## 2   Mira Pokémon: The First Movie      4
## 4  Olive Pokémon: The First Movie      4
## 7   Mira         Princess Mononoke      3
## 9  Olive         Princess Mononoke      3
## 12  Mira       Howls Moving Castle      4
## 14 Olive       Howls Moving Castle      5
## 17  Mira                    Sahara      3
## 19 Olive                    Sahara      0
## 22  Mira       Saving Private Ryan      3
## 24 Olive       Saving Private Ryan      3
## 27  Mira The Last Airbender (2010)      0
## 29 Olive The Last Airbender (2010)      0

It looks like our data for movies not seen results in a default value of 0. We should definitely remove these values, at least until we can think of a better way to handle this. Otherwise, 0 will cause issues when doing any kind of analysis against the ratings values, as 0 will alter the mean, median, etc.

movieSurvey <- subset(movieSurvey, rating!=0)
ggplot(data = movieSurvey, aes(x = movie, y = rating, fill = voter)) + 
  geom_bar(position="dodge", stat="identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))