DATA 607 Assignment 7: NoSQL Migration

Intro
Load Packages
The Schema
Connect to MySQL Database
Get Movies
Generate Some Imaginary Friends
Let Them Vote
Access Reviews
MongoDB Data Import
MongoDB Data Exploration
Conclusion

Intro

For this assignment we were tasked with moving information from a relational database and migrate them to a NoSQL database of our own choosing. I decided to revisit a previous assignment in which I created a relational database and populate it on the fly using movies loaded from https://www.themoviedb.org's API. I used the randomNames package to create reviewers. I then made the reviewers rate each movie with a random value from 1 to 5. I increased the number of movies to 50 and the number of reviewers for each movie to 100. This information will be pulled from the MySQL database and loaded into a MongoDB database hosted on https://mlab.com/.

Load Packages

library("httr")
library("DBI")
library("RMySQL")
library("randomNames")
library("tidyverse")

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("DT")
library("mongolite")

The Schema

Connect to MySQL Database

rmysql.settingsfile <- "config/mylogin.cnf"
rmysql.db <- "movies"
con <- dbConnect(MySQL(), default.file = rmysql.settingsfile, group = rmysql.db, user = NULL, password = NULL)
dbListTables(con)

## [1] "movies"    "reviewers" "reviews"

Get Movies

  link <- "https://api.themoviedb.org/3/discover/movie?api_key=53b3abb279c64aa6b8bd31cedf177293&language=en-US&include_adult=false&primary_release_year=2017&sort_by=vote_average.desc"
  num_movies <- 20
  request_movies <- GET(link)
  request_movies$status_code

## [1] 200

  my_movies <- content(request_movies, "parsed")
  movies <- my_movies$results
 
  if(length(my_movies$results) < num_movies){
    message("Movies not available. I suggest using the movies database as is.")
  } else {
    #Clear the movies table in preparation to new inserts
    oldMovies <- dbGetQuery(con, "SELECT group_concat(movies.movie_id SEPARATOR ',') AS listing, 
                                  Count(*) AS num_rows FROM movies")
  
  if(oldMovies$num_rows[1] > 0){
    ds <- dbSendQuery(con, sprintf("DELETE FROM movies WHERE movies.movie_id IN (%s)",oldMovies[1]$listing))
    dbClearResult(ds)
  }
    a <- 1
    while(a <= num_movies){
      sql <- sprintf('INSERT INTO movies SET 
                      movie_id=%d, 
                      movie_title= "%s", 
                      movie_popularity="%s", 
                      movie_release_date="%s" 
                      ON DUPLICATE KEY UPDATE movie_id="%s"', 
                      movies[[a]]$id, movies[[a]]$title, movies[[a]]$popularity, movies[[a]]$release_date, movies[[a]]$id)
      
      rs <- dbSendQuery(con, (sql))
      dbClearResult(rs)
      a <- a + 1
    }
  }

Generate Some Imaginary Friends

num_friends = 100
new_friends <- randomNames(num_friends)
  # Clear the friends table in preparation to new friends
  oldFriends <- dbGetQuery(con, "SELECT group_concat(reviewers.reviewer_id SEPARATOR ',') AS listing, 
                                Count(*) AS num_rows FROM reviewers")
  
  if(oldFriends$num_rows[1] > 0){
    ds <- dbSendQuery(con, sprintf("DELETE FROM reviewers WHERE reviewers.reviewer_id IN (%s)", oldFriends[1]$listing))
    dbClearResult(ds)
  }

## [1] TRUE

  a <- 1
  while(a <= num_friends){
    new_friend = strsplit(new_friends[[a]],",")
    sql <- sprintf('INSERT INTO reviewers SET 
                    reviewer_id=%d,
                    reviewer_first_name="%s", 
                    reviewer_last_name="%s"  
                    ON DUPLICATE KEY UPDATE   reviewer_first_name="%s", reviewer_last_name="%s"', 
                    a, new_friend[[1]][1], new_friend[[1]][2],new_friend[[1]][1], new_friend[[1]][2])
    
    rs <- dbSendQuery(con, sql)
    dbClearResult(rs)
    a <- a + 1
  }

Let Them Vote

 # Remove old reviews
 oldReviews <- dbGetQuery(con, "SELECT group_concat(reviews.review_id SEPARATOR ',') AS listing, 
                                Count(*) AS num_rows FROM reviews")
  if(oldReviews$num_rows[1] > 0){
    ds <- dbSendQuery(con, sprintf("DELETE FROM reviews WHERE reviews.review_id IN (%s)", oldReviews[1]$listing))
    dbClearResult(ds)
  }

theMovies <- dbGetQuery(con, sprintf("SELECT movie_id FROM movies LIMIT %d", num_movies))
theReviewers <- dbGetQuery(con, sprintf("SELECT reviewer_id FROM reviewers LIMIT %d", num_friends))

for(r in 1:nrow(theReviewers) ){
  for(m in 1:nrow(theMovies)){
    rating <- floor(runif(1,1,6))
    sql <- sprintf("INSERT INTO reviews SET 
                    movie_id=%d, 
                    reviewer_id=%d, 
                    rating=%d  
                    ON DUPLICATE KEY UPDATE movie_id=%d, reviewer_id=%d, rating=%d", 
                    theMovies$movie_id[m], theReviewers$reviewer_id[r], rating, theMovies$movie_id[m], theReviewers$reviewer_id[r], rating)
    rs <- dbSendQuery(con, sql)
    dbClearResult(rs)
  }
}

Access Reviews

To access the reviews for each movie, all the tables in this relational database must be joined as follows. Because NoSQL databases are not relational, there is no benefit in separating its data into different tables/collections.

Therefore one record/document will contain detailed information about the movie, its reviewer, and its rating.

sql <- "SELECT movies.movie_title AS Movie, 
        movies.movie_popularity AS Movie_Popularity, 
        movies.movie_release_date AS Release_Date,
        reviews.rating AS Rating, 
        CONCAT(reviewers.reviewer_first_name,' ',reviewers.reviewer_last_name) AS Reviewer
        FROM reviews 
        INNER JOIN movies ON (movies.movie_id = reviews.movie_id)
        INNER JOIN  reviewers ON (reviewers.reviewer_id = reviews.reviewer_id)"

theReviews <- suppressWarnings(dbGetQuery(con, sql))

datatable(theReviews, options = list(filter = FALSE))

MongoDB Data Import

con$remove("{}")
con$insert(theReviews)

## List of 5
##  $ nInserted  : num 2000
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

MongoDB Data Exploration

Load data:

# Very easy to summarize reviews using the aggregation framework
mongoReviews <- con$aggregate('[{"$group":{"_id":{"Movie":"$Movie", "Rating":"$Rating"},  "Count":  {"$sum":1}}}]')

mongoReviews$`_id`$Rating <- factor(mongoReviews$`_id`$Rating, levels = c("One"<-1, "Two"<-2, "Three"<-3, "Four"<-4,"Five"<-5))

ggplot(data = mongoReviews, 
       aes(x = `_id`$Movie, y = Count, fill = `_id`$Rating)) + 
  geom_bar(stat = "identity") + coord_flip() + scale_fill_brewer(palette = 12) + 
  labs(title = "Distribution of Ratings Per Movie", x = "Movie", y="Rating", fill = "Rating")

Conclusion

The main advantage of relational databases is the facilitation of data integrity rules and its complementing features and funtionalities. For example, if a record is updated or deleted, all of its related records are updated or deleted accordingly, if the database is configured properly (no orphaned records). Relational databases can become very complex when modeling business processes, expertise in normalization may be required.

A key advantage of moving data from from relational databases to NoSQL databases is that the need for separating the data into related entities is greatly reduced, making database normalization skills optional. This also means that NoSQL read and write operations can generally be faster than it’s relational equivalent, primarily because of the decreased overhead of ensuring data integrity.

A disadvantage of this migration is that data integrity becomes more of a manual process. The software developer or database administrator must perform all the necessary “dirty work” when updating and deleting records. NoSQL tables would typically be larger in size than their relational counterpart. Primarily because a NoSQL table must contain all the necessary data in one entity, resulting in repetitive storing of texts, which require more storage than integers.

Indexing of records plays an even more important role in NoSQL databases. One must profile read and write operations to create the necessary indexes to speed up operations. Indexes drastically increase the size of NoSQL databases, and can negatively effect the speed of write operations, therefore they must be used carefully.