Description

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe.

The films used in this study were selected by Googling “2018 recent films” and getting the first 6 films listed in the search results.

What’s Been Done So Far

The ‘Week 2 Assignment.sql’ script has been executed which has set up the flix database with the ratings.

Connect To The Database

Using settings in a hidden file, connect to the database.

library(RMySQL)
library(ggplot2)
library(dplyr)
source('hidden/MySQL Settings.R')
conn <- dbConnect(MySQL(), user=user, password=password, dbname='flix', host=host)

Pull Data From The Database

Pull data via the connection.

sql <- 'SELECT `films`.`title`, 
`reviewers`.`name`, 
`ratings`.`rating`
FROM `ratings`
JOIN `films` ON `films`.`id` = `ratings`.`films_id`
JOIN `reviewers` ON `reviewers`.`id` = `ratings`.`reviewers_id`'
df <- dbGetQuery(conn, sql)

# Disconnect from the database
disconnected <- dbDisconnect(conn)

Exploratory Data Analysis

Now that the data is loaded into a dataframe let’s take a look at the data.

dim(df)
## [1] 30  3
names(df)
## [1] "title"  "name"   "rating"

Now for a preview of the data

head(df)
##                    title                         name rating
## 1 Avengers: Infinity War                         json    4.0
## 2 Avengers: Infinity War Dr. Christensen, the Supreme    3.0
## 3 Avengers: Infinity War                        Jesse    1.0
## 4 Avengers: Infinity War                          Vid    1.5
## 5 Avengers: Infinity War                       FamGuy    4.0
## 6          Black Panther                         json    5.0
ggplot(df, aes(x=title, y=rating, fill=title)) + 
  geom_boxplot() + 
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

The data is centered arround 3 with the exception of Annihilaton and Deadpool 2 which have a lower rating.

df %>%
  group_by(title) %>%
  summarise(min=min(rating),
            mean=mean(rating),
            median=median(rating),
            max=max(rating))
## # A tibble: 6 x 5
##   title                         min  mean median   max
##   <chr>                       <dbl> <dbl>  <dbl> <dbl>
## 1 Annihilation                    1   1.9    2       3
## 2 Avengers: Infinity War          1   2.7    3       4
## 3 Black Panther                   1   3.4    3       5
## 4 Deadpool 2                      1   1.9    1.5     3
## 5 Incredibles 2                   3   3.3    3       4
## 6 Mission Impossible: Fallout     1   2.5    3       4