For this assignment we were tasked with moving information from a relational database and migrate them to a NoSQL database of our own choosing. I decided to revisit a previous assignment in which I created a relational database and populate it on the fly using movies loaded from https://www.themoviedb.org's API. I used the randomNames package to create reviewers. I then made the reviewers rate each movie with a random value from 1 to 5. I increased the number of movies to 50 and the number of reviewers for each movie to 100. This information will be pulled from the MySQL database and loaded into a MongoDB database hosted on https://mlab.com/.
library("httr")
library("DBI")
library("RMySQL")
library("randomNames")
library("tidyverse")## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("DT")
library("mongolite")rmysql.settingsfile <- "config/mylogin.cnf"
rmysql.db <- "movies"
con <- dbConnect(MySQL(), default.file = rmysql.settingsfile, group = rmysql.db, user = NULL, password = NULL)
dbListTables(con)## [1] "movies" "reviewers" "reviews"
link <- "https://api.themoviedb.org/3/discover/movie?api_key=53b3abb279c64aa6b8bd31cedf177293&language=en-US&include_adult=false&primary_release_year=2017&sort_by=vote_average.desc"
num_movies <- 20
request_movies <- GET(link)
request_movies$status_code## [1] 200
my_movies <- content(request_movies, "parsed")
movies <- my_movies$results
if(length(my_movies$results) < num_movies){
message("Movies not available. I suggest using the movies database as is.")
} else {
#Clear the movies table in preparation to new inserts
oldMovies <- dbGetQuery(con, "SELECT group_concat(movies.movie_id SEPARATOR ',') AS listing,
Count(*) AS num_rows FROM movies")
if(oldMovies$num_rows[1] > 0){
ds <- dbSendQuery(con, sprintf("DELETE FROM movies WHERE movies.movie_id IN (%s)",oldMovies[1]$listing))
dbClearResult(ds)
}
a <- 1
while(a <= num_movies){
sql <- sprintf('INSERT INTO movies SET
movie_id=%d,
movie_title= "%s",
movie_popularity="%s",
movie_release_date="%s"
ON DUPLICATE KEY UPDATE movie_id="%s"',
movies[[a]]$id, movies[[a]]$title, movies[[a]]$popularity, movies[[a]]$release_date, movies[[a]]$id)
rs <- dbSendQuery(con, (sql))
dbClearResult(rs)
a <- a + 1
}
}num_friends = 100
new_friends <- randomNames(num_friends)
# Clear the friends table in preparation to new friends
oldFriends <- dbGetQuery(con, "SELECT group_concat(reviewers.reviewer_id SEPARATOR ',') AS listing,
Count(*) AS num_rows FROM reviewers")
if(oldFriends$num_rows[1] > 0){
ds <- dbSendQuery(con, sprintf("DELETE FROM reviewers WHERE reviewers.reviewer_id IN (%s)", oldFriends[1]$listing))
dbClearResult(ds)
}## [1] TRUE
a <- 1
while(a <= num_friends){
new_friend = strsplit(new_friends[[a]],",")
sql <- sprintf('INSERT INTO reviewers SET
reviewer_id=%d,
reviewer_first_name="%s",
reviewer_last_name="%s"
ON DUPLICATE KEY UPDATE reviewer_first_name="%s", reviewer_last_name="%s"',
a, new_friend[[1]][1], new_friend[[1]][2],new_friend[[1]][1], new_friend[[1]][2])
rs <- dbSendQuery(con, sql)
dbClearResult(rs)
a <- a + 1
} # Remove old reviews
oldReviews <- dbGetQuery(con, "SELECT group_concat(reviews.review_id SEPARATOR ',') AS listing,
Count(*) AS num_rows FROM reviews")
if(oldReviews$num_rows[1] > 0){
ds <- dbSendQuery(con, sprintf("DELETE FROM reviews WHERE reviews.review_id IN (%s)", oldReviews[1]$listing))
dbClearResult(ds)
}
theMovies <- dbGetQuery(con, sprintf("SELECT movie_id FROM movies LIMIT %d", num_movies))
theReviewers <- dbGetQuery(con, sprintf("SELECT reviewer_id FROM reviewers LIMIT %d", num_friends))
for(r in 1:nrow(theReviewers) ){
for(m in 1:nrow(theMovies)){
rating <- floor(runif(1,1,6))
sql <- sprintf("INSERT INTO reviews SET
movie_id=%d,
reviewer_id=%d,
rating=%d
ON DUPLICATE KEY UPDATE movie_id=%d, reviewer_id=%d, rating=%d",
theMovies$movie_id[m], theReviewers$reviewer_id[r], rating, theMovies$movie_id[m], theReviewers$reviewer_id[r], rating)
rs <- dbSendQuery(con, sql)
dbClearResult(rs)
}
}To access the reviews for each movie, all the tables in this relational database must be joined as follows. Because NoSQL databases are not relational, there is no benefit in separating its data into different tables/collections.
Therefore one record/document will contain detailed information about the movie, its reviewer, and its rating.
sql <- "SELECT movies.movie_title AS Movie,
movies.movie_popularity AS Movie_Popularity,
movies.movie_release_date AS Release_Date,
reviews.rating AS Rating,
CONCAT(reviewers.reviewer_first_name,' ',reviewers.reviewer_last_name) AS Reviewer
FROM reviews
INNER JOIN movies ON (movies.movie_id = reviews.movie_id)
INNER JOIN reviewers ON (reviewers.reviewer_id = reviews.reviewer_id)"
theReviews <- suppressWarnings(dbGetQuery(con, sql))
datatable(theReviews, options = list(filter = FALSE))con$remove("{}")
con$insert(theReviews)## List of 5
## $ nInserted : num 2000
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
Load data:
# Very easy to summarize reviews using the aggregation framework
mongoReviews <- con$aggregate('[{"$group":{"_id":{"Movie":"$Movie", "Rating":"$Rating"}, "Count": {"$sum":1}}}]')
mongoReviews$`_id`$Rating <- factor(mongoReviews$`_id`$Rating, levels = c("One"<-1, "Two"<-2, "Three"<-3, "Four"<-4,"Five"<-5))
ggplot(data = mongoReviews,
aes(x = `_id`$Movie, y = Count, fill = `_id`$Rating)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_brewer(palette = 12) +
labs(title = "Distribution of Ratings Per Movie", x = "Movie", y="Rating", fill = "Rating")The main advantage of relational databases is the facilitation of data integrity rules and its complementing features and funtionalities. For example, if a record is updated or deleted, all of its related records are updated or deleted accordingly, if the database is configured properly (no orphaned records). Relational databases can become very complex when modeling business processes, expertise in normalization may be required.
A key advantage of moving data from from relational databases to NoSQL databases is that the need for separating the data into related entities is greatly reduced, making database normalization skills optional. This also means that NoSQL read and write operations can generally be faster than it’s relational equivalent, primarily because of the decreased overhead of ensuring data integrity.
A disadvantage of this migration is that data integrity becomes more of a manual process. The software developer or database administrator must perform all the necessary “dirty work” when updating and deleting records. NoSQL tables would typically be larger in size than their relational counterpart. Primarily because a NoSQL table must contain all the necessary data in one entity, resulting in repetitive storing of texts, which require more storage than integers.
Indexing of records plays an even more important role in NoSQL databases. One must profile read and write operations to create the necessary indexes to speed up operations. Indexes drastically increase the size of NoSQL databases, and can negatively effect the speed of write operations, therefore they must be used carefully.