DATA 607 - Week 2 Assignment

library(tidyverse)
library(RMySQL)

A. Import movie data from CSV file output from SQL (Option 1)

The movie data was queried in the SQL script and output as a CSV file, then uploaded to GitHub. For the first approach, let’s read in the movie data from the CSV file.

# read CSV file saved at GitHub 
url <- "https://raw.githubusercontent.com/kecbenson/Data_607_Wk2/master/movie_ratings.csv"
df <- read_csv(url, col_names = c("Movie", "Rating", "Viewer"))

Now review the dataframe and ensure that it loaded correctly.

# review the dataframe 
df

## # A tibble: 24 x 3
##    Movie                 Rating Viewer
##    <chr>                  <int> <chr> 
##  1 Crazy Rich Asians          5 Matt  
##  2 Incredibles 2              5 Matt  
##  3 Ant-Man and the Wasp       4 Matt  
##  4 Jurassic World             4 Matt  
##  5 The Predator               3 Matt  
##  6 Mission Impossible 15      5 Matt  
##  7 Crazy Rich Asians          4 Kelly 
##  8 Incredibles 2              5 Kelly 
##  9 Jurassic World             2 Kelly 
## 10 Mission Impossible 15      4 Kelly 
## # ... with 14 more rows

summary(df)

##     Movie               Rating         Viewer         
##  Length:24          Min.   :1.000   Length:24         
##  Class :character   1st Qu.:2.750   Class :character  
##  Mode  :character   Median :3.500   Mode  :character  
##                     Mean   :3.458                     
##                     3rd Qu.:5.000                     
##                     Max.   :5.000

class(df)

## [1] "tbl_df"     "tbl"        "data.frame"

B. Import movie data directly from MySQL database (Option 2)

Alternatively, we can import the data directly from the MySQL database, by using the “RMySQL” package. In this second approach, we’ll need to connect to the database and pass the SQL query to load the data.

# use "RMySQL" package to query MySQL database
movies_db <- dbConnect(MySQL(), user=usr, password=pwd, dbname='data607_wk2_movies', host='localhost')
dbListTables(movies_db)

## [1] "movies"  "ratings" "viewers"

query <- "SELECT m.movie_name AS 'Movie', r.rating as 'Rating', v.viewer_name AS 'Viewer'
            FROM Ratings AS r
            INNER JOIN Viewers AS v
            ON r.viewer_id = v.viewer_id
            INNER JOIN Movies as m
            ON r.movie_id = m.movie_id;"
df2 <- as_tibble(dbGetQuery(movies_db, query))
dbDisconnect(movies_db)

## [1] TRUE

Review the dataframe and make sure it loaded correctly.

df2

## # A tibble: 24 x 3
##    Movie                 Rating Viewer
##    <chr>                  <int> <chr> 
##  1 Crazy Rich Asians          5 Matt  
##  2 Incredibles 2              5 Matt  
##  3 Ant-Man and the Wasp       4 Matt  
##  4 Jurassic World             4 Matt  
##  5 The Predator               3 Matt  
##  6 Mission Impossible 15      5 Matt  
##  7 Crazy Rich Asians          4 Kelly 
##  8 Incredibles 2              5 Kelly 
##  9 Jurassic World             2 Kelly 
## 10 Mission Impossible 15      4 Kelly 
## # ... with 14 more rows

summary(df2)

##     Movie               Rating         Viewer         
##  Length:24          Min.   :1.000   Length:24         
##  Class :character   1st Qu.:2.750   Class :character  
##  Mode  :character   Median :3.500   Mode  :character  
##                     Mean   :3.458                     
##                     3rd Qu.:5.000                     
##                     Max.   :5.000

class(df2)

## [1] "tbl_df"     "tbl"        "data.frame"

C. Review movie data and preliminary statistics

Let’s look at average, min and max ratings by movie.

df %>% group_by(Movie) %>% 
    summarize(Count = n(), Avg = mean(Rating), Min = min(Rating), Max = max(Rating)) %>% 
    arrange(desc(Avg))

## # A tibble: 6 x 5
##   Movie                 Count   Avg   Min   Max
##   <chr>                 <int> <dbl> <dbl> <dbl>
## 1 Incredibles 2             3  4.33     3     5
## 2 Crazy Rich Asians         5  3.6      2     5
## 3 Jurassic World            5  3.4      1     5
## 4 Mission Impossible 15     5  3.4      2     5
## 5 Ant-Man and the Wasp      4  3.25     1     5
## 6 The Predator              2  2.5      2     3

We can also plot the rating distribution (histogram) for all movies.

ggplot(data = df) + geom_bar(mapping = aes(x = Rating)) + 
    labs(x = "Rating (5=Best, 1=Worst)", y = "Count", title = "Distribution of all movie ratings")

ggplot(data = df) + geom_bar(mapping = aes(x = Rating, fill = Movie)) + facet_wrap(~ Movie, nrow = 2) + 
    labs(x = "Rating (5=Best, 1=Worst)", y = "Count", title = "Distribution of ratings by movie")

DATA 607 - Week 2 Assignment

Kevin Benson

September 7, 2018

A. Import movie data from CSV file output from SQL (Option 1)

B. Import movie data directly from MySQL database (Option 2)

C. Review movie data and preliminary statistics