Week 2 Assignment

Description

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe.

The films used in this study were selected by Googling “2018 recent films” and getting the first 6 films listed in the search results.

What’s Been Done So Far

The ‘Week 2 Assignment.sql’ script has been executed which has set up the flix database with the ratings.

Connect To The Database

Using settings in a hidden file, connect to the database.

library(RMySQL)
library(ggplot2)
library(dplyr)
source('hidden/MySQL Settings.R')
conn <- dbConnect(MySQL(), user=user, password=password, dbname='flix', host=host)

Pull Data From The Database

Pull data via the connection.

sql <- 'SELECT `films`.`title`, 
`reviewers`.`name`, 
`ratings`.`rating`
FROM `ratings`
JOIN `films` ON `films`.`id` = `ratings`.`films_id`
JOIN `reviewers` ON `reviewers`.`id` = `ratings`.`reviewers_id`'
df <- dbGetQuery(conn, sql)

# Disconnect from the database
disconnected <- dbDisconnect(conn)

Exploratory Data Analysis

Now that the data is loaded into a dataframe let’s take a look at the data.

dim(df)

## [1] 30  3

names(df)

## [1] "title"  "name"   "rating"

Now for a preview of the data

head(df)

##                    title                         name rating
## 1 Avengers: Infinity War                         json    4.0
## 2 Avengers: Infinity War Dr. Christensen, the Supreme    3.0
## 3 Avengers: Infinity War                        Jesse    1.0
## 4 Avengers: Infinity War                          Vid    1.5
## 5 Avengers: Infinity War                       FamGuy    4.0
## 6          Black Panther                         json    5.0

ggplot(df, aes(x=title, y=rating, fill=title)) + 
  geom_boxplot() + 
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

The data is centered arround 3 with the exception of Annihilaton and Deadpool 2 which have a lower rating.

df %>%
  group_by(title) %>%
  summarise(min=min(rating),
            mean=mean(rating),
            median=median(rating),
            max=max(rating))

## # A tibble: 6 x 5
##   title                         min  mean median   max
##   <chr>                       <dbl> <dbl>  <dbl> <dbl>
## 1 Annihilation                    1   1.9    2       3
## 2 Avengers: Infinity War          1   2.7    3       4
## 3 Black Panther                   1   3.4    3       5
## 4 Deadpool 2                      1   1.9    1.5     3
## 5 Incredibles 2                   3   3.3    3       4
## 6 Mission Impossible: Fallout     1   2.5    3       4