library(tidyverse)
library(RMySQL)
The movie data was queried in the SQL script and output as a CSV file, then uploaded to GitHub. For the first approach, let’s read in the movie data from the CSV file.
# read CSV file saved at GitHub
url <- "https://raw.githubusercontent.com/kecbenson/Data_607_Wk2/master/movie_ratings.csv"
df <- read_csv(url, col_names = c("Movie", "Rating", "Viewer"))
Now review the dataframe and ensure that it loaded correctly.
# review the dataframe
df
## # A tibble: 24 x 3
## Movie Rating Viewer
## <chr> <int> <chr>
## 1 Crazy Rich Asians 5 Matt
## 2 Incredibles 2 5 Matt
## 3 Ant-Man and the Wasp 4 Matt
## 4 Jurassic World 4 Matt
## 5 The Predator 3 Matt
## 6 Mission Impossible 15 5 Matt
## 7 Crazy Rich Asians 4 Kelly
## 8 Incredibles 2 5 Kelly
## 9 Jurassic World 2 Kelly
## 10 Mission Impossible 15 4 Kelly
## # ... with 14 more rows
summary(df)
## Movie Rating Viewer
## Length:24 Min. :1.000 Length:24
## Class :character 1st Qu.:2.750 Class :character
## Mode :character Median :3.500 Mode :character
## Mean :3.458
## 3rd Qu.:5.000
## Max. :5.000
class(df)
## [1] "tbl_df" "tbl" "data.frame"
Alternatively, we can import the data directly from the MySQL database, by using the “RMySQL” package. In this second approach, we’ll need to connect to the database and pass the SQL query to load the data.
# use "RMySQL" package to query MySQL database
movies_db <- dbConnect(MySQL(), user=usr, password=pwd, dbname='data607_wk2_movies', host='localhost')
dbListTables(movies_db)
## [1] "movies" "ratings" "viewers"
query <- "SELECT m.movie_name AS 'Movie', r.rating as 'Rating', v.viewer_name AS 'Viewer'
FROM Ratings AS r
INNER JOIN Viewers AS v
ON r.viewer_id = v.viewer_id
INNER JOIN Movies as m
ON r.movie_id = m.movie_id;"
df2 <- as_tibble(dbGetQuery(movies_db, query))
dbDisconnect(movies_db)
## [1] TRUE
Review the dataframe and make sure it loaded correctly.
df2
## # A tibble: 24 x 3
## Movie Rating Viewer
## <chr> <int> <chr>
## 1 Crazy Rich Asians 5 Matt
## 2 Incredibles 2 5 Matt
## 3 Ant-Man and the Wasp 4 Matt
## 4 Jurassic World 4 Matt
## 5 The Predator 3 Matt
## 6 Mission Impossible 15 5 Matt
## 7 Crazy Rich Asians 4 Kelly
## 8 Incredibles 2 5 Kelly
## 9 Jurassic World 2 Kelly
## 10 Mission Impossible 15 4 Kelly
## # ... with 14 more rows
summary(df2)
## Movie Rating Viewer
## Length:24 Min. :1.000 Length:24
## Class :character 1st Qu.:2.750 Class :character
## Mode :character Median :3.500 Mode :character
## Mean :3.458
## 3rd Qu.:5.000
## Max. :5.000
class(df2)
## [1] "tbl_df" "tbl" "data.frame"
Let’s look at average, min and max ratings by movie.
df %>% group_by(Movie) %>%
summarize(Count = n(), Avg = mean(Rating), Min = min(Rating), Max = max(Rating)) %>%
arrange(desc(Avg))
## # A tibble: 6 x 5
## Movie Count Avg Min Max
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Incredibles 2 3 4.33 3 5
## 2 Crazy Rich Asians 5 3.6 2 5
## 3 Jurassic World 5 3.4 1 5
## 4 Mission Impossible 15 5 3.4 2 5
## 5 Ant-Man and the Wasp 4 3.25 1 5
## 6 The Predator 2 2.5 2 3
We can also plot the rating distribution (histogram) for all movies.
ggplot(data = df) + geom_bar(mapping = aes(x = Rating)) +
labs(x = "Rating (5=Best, 1=Worst)", y = "Count", title = "Distribution of all movie ratings")
ggplot(data = df) + geom_bar(mapping = aes(x = Rating, fill = Movie)) + facet_wrap(~ Movie, nrow = 2) +
labs(x = "Rating (5=Best, 1=Worst)", y = "Count", title = "Distribution of ratings by movie")