Load general libraries:

library(dplyr)
library(ggplot2)

Prequisites: Generate Data and Create SQL Table

The generate_reviews.R script generates a CSV of five people’s review of six movies. The create_install.sql script creates a SQL table based on that CSV. Compiling this Rmarkdown is dependent on successfully completing these steps (or simply run main.sh to set everything up).

Pull from SQL

Open a connection to the bmh607 PostgreSQL database, pull out the table, and close connection:

library(RPostgreSQL)

drv <- dbDriver('PostgreSQL')
con <- dbConnect(drv, host='localhost', dbname='bmh607')
 
q <- 'SELECT * FROM bmh_movies;'
df <- dbGetQuery(con, q)

head(df)

##   review_id reviewer             movie score
## 1         1   Alexis Crazy Rich Asians     1
## 2         2     Dina Crazy Rich Asians     2
## 3         3     Zach Crazy Rich Asians     1
## 4         4  Patrick Crazy Rich Asians     3
## 5         5   Claire Crazy Rich Asians     5
## 6         6   Alexis           The Nun     1

str(df)

## 'data.frame':    30 obs. of  4 variables:
##  $ review_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ reviewer : chr  "Alexis" "Dina" "Zach" "Patrick" ...
##  $ movie    : chr  "Crazy Rich Asians" "Crazy Rich Asians" "Crazy Rich Asians" "Crazy Rich Asians" ...
##  $ score    : int  1 2 1 3 5 1 2 3 2 5 ...

# be nice to the database
dbDisconnect(con)

## [1] TRUE

Inspecting the Data

Below is a boxplot of the distribution of scores for each movie.

ggplot(df, aes(x=movie, y=score)) + 
  geom_boxplot()

Slender Man had the highest average score of 3 and BlacKkKlansman had the lowest average score of 1—contrary to the actual box office returns! The latter film also had the lowest spread of scores, meaning all reviewers agreed it was a terrible to below-average movie.

Concensus was not so easy with Crazy Rich Asians, however. The majority of reviewers considered it below average, but one considered it `average’ and another considered it superb.

df %>% filter(movie == 'Crazy Rich Asians')

##   review_id reviewer             movie score
## 1         1   Alexis Crazy Rich Asians     1
## 2         2     Dina Crazy Rich Asians     2
## 3         3     Zach Crazy Rich Asians     1
## 4         4  Patrick Crazy Rich Asians     3
## 5         5   Claire Crazy Rich Asians     5

It was easier to agree on Slender Man and The Nun, though two outlying reviewers rated both a 5.

A Tangent: Notch Plots

While looking at the documention for geom_boxplot() I noticed the argument notch. It’s not extremely helpful in a dataset with \(n = 30\) but I imagine it could elucidate distributions better than a box plot when dealing with larger data. Pretty cool!

(Though it appears to be generating an error.)

ggplot(df, aes(x=movie, y=score)) + 
  geom_boxplot(notch=TRUE)

## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.

Reviewing the Reviewers

It might also be instructive to examine the reviewers themselves. Which ones are delighted by almost any film? Which ones are especially critically?

ggplot(df, aes(x=reviewer, y=score)) + 
  geom_boxplot()

There is a tremendous amount of variation in taste. Alexis hates all of the films—maybe she’s more of a book reader? Dina, too, is a fairly critical movie watcher, with mean score of 1.5 and low variance. Zach didn’t really enjoy many of the films either.

Claire appears to be the most discerning, with the widest spread of scores. She generally enjoyed the films, but was not above rating them below average, or superb. Patrick considered most of the films average and only average.

DATA 607—Homework No. 2

Ben Horvath

September 9, 2018

Prequisites: Generate Data and Create SQL Table

Pull from SQL

Inspecting the Data

A Tangent: Notch Plots

Reviewing the Reviewers