Assignment

For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

Solution

Load Libraries

library(mongolite)

Data Import

The csv downloaded below is the result of the following SQL query passed to data.stackexchange which was used for project 3.

SELECT Id, TagName, Count from Tags ORDER BY Count DESC;

It will be read into a data frame called Tags. It could be read directly into the Mongo database, but this gives us a comparison against with to check.

Tags <- read.csv('https://data.stackexchange.com/datascience/csv/1389058',
                 stringsAsFactors = FALSE)

Data Export

Using the mongolite package and running MongoDB as a service in the background, we will upload the tags data into a local Mongo database called “tags”. However, since completing this homework required trial and error, we will clear the database first to ensure nothing is left from prior runs.

mongo_tags <- mongo(collection = "DS", db = "tags")
mongo_tags$remove(query = "{}")
mongo_tags$insert(Tags)

## List of 5
##  $ nInserted  : num 514
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

Data Test

Let’s see if the data loaded properly. At a minimum, there should be 514 rows.

mongo_tags$count()

## [1] 514

Finally, lets pull all tags whose count is greater than 1000 and compare:

mongo_count <- mongo_tags$find('{"Count" : { "$gt" : 1000}}')
mongo_count

##     Id          TagName Count
## 1    2 machine-learning  6479
## 2   46           python  3667
## 3   81   neural-network  2751
## 4  194    deep-learning  2576
## 5   77   classification  1777
## 6  324            keras  1595
## 7  128     scikit-learn  1233
## 8  321       tensorflow  1132
## 9   47              nlp  1077
## 10  24                r  1064

csv_count <- subset(Tags, Count > 1000)
all.equal(mongo_count, csv_count)

## [1] TRUE

Discussion

Advantages

The advantage of using a NoSQL database is its flexibility. Anything can be stored in a NoSQL database as its own JSON object. There is no necessity for similar fields across records and no need to update the database schema if new kinds of data need to be retained. If the information has some form of digital representation, it can be stored in a NoSQL database.

Disadvantages

Its flexibility underlies its key disadvantage as well. Without a regular structure, most kinds of data analysis, such as aggregations and summaries, become more difficult to construct and more time-consuming to process. Also, joining together different databases becomes much more difficult as there may be few, if any, common fields on which to join. One may argue with the last and say to just store everything in one database, but that can get very unweildly.

In this particular case, the data elements are all similar to each other. Also, the type of analysis that would be used would lean towards the mathematical more than merely information retrieval. Therefore, the advantages of a relational database, with its structure and ensuing optimization, outweigh those of a NoSQL database, despite its freedom and flexibility.

DT 607—Fall 2019

HW 12

Avraham Adler

11/11/2019