I really wanted to work with a slightly larger data set so I opted to pull the The Movies Data set from Kaggle. It is a Public Domain License, which has No Copyright. This data set contains movies, reviews, summaries, genres and more. It is a great method to learn how to write with a recommendation bot.
We will first initialize our DB libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RMariaDB)
library(DBI)
We will then define the connection to the container. In terms of implemented Security, the container forces a localhost only connection, which for the purpose of this demo, allows for easy, rapid deployments with an effective yet simple security implementation. I prefer listing tables here as it easily demonstrates a connect to a DB, and demonstrates the current status of the container.
con <- dbConnect(RMariaDB::MariaDB(), username="root", password="TestCase123.", dbname ="review_db", host="localhost")
dbListTables(con)
## [1] "movies_metadata" "ratings"
What is occurring here is that we’re reading a table named ratings, and we are loading it into a data frame in memory. Since ratings link a Movie and a value 1-5, we are going to summarize the results in a singular table, linking movieID and its average rating.
avg_ratings <- dbReadTable(con = con, name = 'ratings')
agg_ratings <- aggregate(avg_ratings$rating, list(avg_ratings$movieId), FUN=mean)
colnames(agg_ratings) <- c('Movie_ID','Rating')
head(agg_ratings)
## Movie_ID Rating
## 1 1 3.872470
## 2 2 3.401869
## 3 3 3.161017
## 4 4 2.384615
## 5 5 3.267857
## 6 6 3.884615
At this point, we are going to read the movies_metadata table into a dataframe memory. This database is fairly large, and since it’s loaded in in bulk, its not the most effective operation if you are hardware limited, but great in terms of managing number of connections.
movie_metadata <- dbReadTable(con = con, name = 'movies_metadata')
Now I want to see the trend in movie ratings over time. So I need to join the movie_metadata Dataframe, which has the release date in it, to the average rating. For any resulting joins that do not have values, I will drop and exclude them from the set.
agg_ratings$title <- "NAN"
agg_ratings$Movie_ID <- as.character(agg_ratings$Movie_ID)
combo <-left_join(agg_ratings, movie_metadata, by = c("Movie_ID" = "id"))
combo$release_year = substr(combo$release_date,1,4)
movie_ratings = subset(combo, select = c(original_title, Rating, release_year) )
clean_movie_ratings <- na.omit(movie_ratings)
head(clean_movie_ratings)
## original_title Rating release_year
## 2 Ariel 3.401869 1988
## 3 Varjoja paratiisissa 3.161017 1986
## 5 Four Rooms 3.267857 1995
## 6 Judgment Night 3.884615 1993
## 11 Star Wars 3.689024 1977
## 12 Finding Nemo 2.861111 2003
At this point, I just want to visualize the data a bit, and see if there is anything that sticks out to me as interesting.
At this point, we have a simple chart, fully loaded set, and have cleaned the basic irregularities from the data. Further steps may by to remove older movies, because silent movies may have biased results due to the typical population who watch them are hyper focused on them. In addition, it may make sense to remove movies under a certain number of views.
RMariaDB-https://cran.r-project.org/web/packages/RMariaDB/RMariaDB.pdf