Assignmet 2-Alex Moyse

Introduction

I really wanted to work with a slightly larger data set so I opted to pull the The Movies Data set from Kaggle. It is a Public Domain License, which has No Copyright. This data set contains movies, reviews, summaries, genres and more. It is a great method to learn how to write with a recommendation bot.

We will first initialize our DB libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(RMariaDB)
library(DBI)

We will then define the connection to the container. In terms of implemented Security, the container forces a localhost only connection, which for the purpose of this demo, allows for easy, rapid deployments with an effective yet simple security implementation. I prefer listing tables here as it easily demonstrates a connect to a DB, and demonstrates the current status of the container.

con <- dbConnect(RMariaDB::MariaDB(), username="root", password="TestCase123.", dbname ="review_db", host="localhost")
dbListTables(con)

## [1] "movies_metadata" "ratings"

What is occurring here is that we’re reading a table named ratings, and we are loading it into a data frame in memory. Since ratings link a Movie and a value 1-5, we are going to summarize the results in a singular table, linking movieID and its average rating.

avg_ratings <- dbReadTable(con = con, name = 'ratings')
agg_ratings <- aggregate(avg_ratings$rating, list(avg_ratings$movieId), FUN=mean)
colnames(agg_ratings) <- c('Movie_ID','Rating')
head(agg_ratings)

##   Movie_ID   Rating
## 1        1 3.872470
## 2        2 3.401869
## 3        3 3.161017
## 4        4 2.384615
## 5        5 3.267857
## 6        6 3.884615

At this point, we are going to read the movies_metadata table into a dataframe memory. This database is fairly large, and since it’s loaded in in bulk, its not the most effective operation if you are hardware limited, but great in terms of managing number of connections.

movie_metadata <- dbReadTable(con = con, name = 'movies_metadata')

Now I want to see the trend in movie ratings over time. So I need to join the movie_metadata Dataframe, which has the release date in it, to the average rating. For any resulting joins that do not have values, I will drop and exclude them from the set.

agg_ratings$title <- "NAN"
agg_ratings$Movie_ID <- as.character(agg_ratings$Movie_ID)
combo <-left_join(agg_ratings, movie_metadata, by = c("Movie_ID" = "id"))
combo$release_year = substr(combo$release_date,1,4)
movie_ratings = subset(combo, select = c(original_title,  Rating, release_year) )
clean_movie_ratings <- na.omit(movie_ratings) 
head(clean_movie_ratings)

##          original_title   Rating release_year
## 2                 Ariel 3.401869         1988
## 3  Varjoja paratiisissa 3.161017         1986
## 5            Four Rooms 3.267857         1995
## 6        Judgment Night 3.884615         1993
## 11            Star Wars 3.689024         1977
## 12         Finding Nemo 2.861111         2003

At this point, I just want to visualize the data a bit, and see if there is anything that sticks out to me as interesting.

At this point, we have a simple chart, fully loaded set, and have cleaned the basic irregularities from the data. Further steps may by to remove older movies, because silent movies may have biased results due to the typical population who watch them are hyper focused on them. In addition, it may make sense to remove movies under a certain number of views.

Assignmet 2-Alex Moyse

Introduction

References

The Movie Dataset-https://www.kaggle.com/rounakbanik/the-movies-dataset

RMariaDB-https://cran.r-project.org/web/packages/RMariaDB/RMariaDB.pdf