Migrating data from a relational database to a NoSQL database.

A SQL database “movie_reviews” has been created that contains ratings given by a set of viewers to 6 recent movies. This database is loaded and read into a data frame. The related file “reviews.sql” contains SQL commands that were issued to create the database. Configuration: MySQL version 5.7, Windows 10

library(RMySQL)
library(datasets)

# Open a MySql database containing movie ratings from movie viewers
dbcon = dbConnect(MySQL(), user = 'root', password = 'pwd001', dbname='movie_reviews')

# Check the database has table(s).
dbListTables(dbcon)
## [1] "responses"
dbListFields(dbcon, 'responses')
## [1] "reviewer" "movie"    "rating"
# Read all rows in the responses table into data frame. Print data frame.
result = dbSendQuery(dbcon, "select * from responses")
mr.df = fetch(result, n=30)
print(mr.df)
##    reviewer        movie rating
## 1      Mark     LaLaLand      1
## 2      Mark        Moana      2
## 3      Mark     Zootopia      4
## 4      Mark      Arrival      4
## 5      Mark  FindingDory      5
## 6      Mark SuicideSquad      3
## 7      Jane     LaLaLand      5
## 8      Jane        Moana      3
## 9      Jane     Zootopia      3
## 10     Jane      Arrival      4
## 11     Jane  FindingDory      2
## 12     Jane SuicideSquad      2
## 13     Nate     LaLaLand      4
## 14     Nate        Moana      2
## 15     Nate     Zootopia      2
## 16     Nate      Arrival      4
## 17     Nate  FindingDory      5
## 18     Nate SuicideSquad      5
## 19      Rob     LaLaLand      3
## 20      Rob        Moana      3
## 21      Rob     Zootopia      4
## 22      Rob      Arrival      1
## 23      Rob  FindingDory      1
## 24      Rob SuicideSquad      2
## 25     Lisa     LaLaLand      1
## 26     Lisa        Moana      5
## 27     Lisa     Zootopia      5
## 28     Lisa      Arrival      3
## 29     Lisa  FindingDory      3
## 30     Lisa SuicideSquad      4

Write the dataframe from above into a MongoDB database.

library(mongolite)

# Create a MongoDB database and collection

mr <- mongo(collection = "movies", db = "reviews")
if (mr$count() > 0)
    mr$drop()

data(mr.df)
mr$insert(mr.df)
## List of 5
##  $ nInserted  : num 30
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
mr$count()
## [1] 30
# Read back from the MongoDB database.
mout <- mr$find()
print(mout)
##    reviewer        movie rating
## 1      Mark     LaLaLand      1
## 2      Mark        Moana      2
## 3      Mark     Zootopia      4
## 4      Mark      Arrival      4
## 5      Mark  FindingDory      5
## 6      Mark SuicideSquad      3
## 7      Jane     LaLaLand      5
## 8      Jane        Moana      3
## 9      Jane     Zootopia      3
## 10     Jane      Arrival      4
## 11     Jane  FindingDory      2
## 12     Jane SuicideSquad      2
## 13     Nate     LaLaLand      4
## 14     Nate        Moana      2
## 15     Nate     Zootopia      2
## 16     Nate      Arrival      4
## 17     Nate  FindingDory      5
## 18     Nate SuicideSquad      5
## 19      Rob     LaLaLand      3
## 20      Rob        Moana      3
## 21      Rob     Zootopia      4
## 22      Rob      Arrival      1
## 23      Rob  FindingDory      1
## 24      Rob SuicideSquad      2
## 25     Lisa     LaLaLand      1
## 26     Lisa        Moana      5
## 27     Lisa     Zootopia      5
## 28     Lisa      Arrival      3
## 29     Lisa  FindingDory      3
## 30     Lisa SuicideSquad      4

Relational vs NoSQL databases

Relational databases such as MySQL allow more advanced querying and indexing. They store data in rows where each row contains values of all variables (columns) corresponding to that row. They allow multiple secondary indexes to be built and used with predictable performance. NoSQL databases are limited in the number of secondary indexes that can be supported.

On the other hand, with massive datasets that are common for web-scale data, a MySQL database suffers from limited scalability: it is usually not designed to scale to multiple nodes, thus limiting the amount of data that it can store under a single repository. In this regard the NoSQL databases are clearly superior: they are designed from the ground up to scale to hundreds, often thousands of nodes. For this reason, they are the only choice in data centers where data from millions of users must be stored and accessed efficiently. A relational database, for all its sophisticated features, simply cannot scale to those levels of size and speed.

Today, an IT administrator must learn to deploy both types of databases as required by the size, growth rates and usage of their data.