For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.
For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.
Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
library(mongolite)
The csv downloaded below is the result of the following SQL query passed to data.stackexchange which was used for project 3.
SELECT Id, TagName, Count from Tags ORDER BY Count DESC;
It will be read into a data frame called Tags. It could be read directly into the Mongo database, but this gives us a comparison against with to check.
Tags <- read.csv('https://data.stackexchange.com/datascience/csv/1389058',
stringsAsFactors = FALSE)
Using the mongolite package and running MongoDB as a service in the background, we will upload the tags data into a local Mongo database called “tags”. However, since completing this homework required trial and error, we will clear the database first to ensure nothing is left from prior runs.
mongo_tags <- mongo(collection = "DS", db = "tags")
mongo_tags$remove(query = "{}")
mongo_tags$insert(Tags)
## List of 5
## $ nInserted : num 514
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
Let’s see if the data loaded properly. At a minimum, there should be 514 rows.
mongo_tags$count()
## [1] 514
Finally, lets pull all tags whose count is greater than 1000 and compare:
mongo_count <- mongo_tags$find('{"Count" : { "$gt" : 1000}}')
mongo_count
## Id TagName Count
## 1 2 machine-learning 6479
## 2 46 python 3667
## 3 81 neural-network 2751
## 4 194 deep-learning 2576
## 5 77 classification 1777
## 6 324 keras 1595
## 7 128 scikit-learn 1233
## 8 321 tensorflow 1132
## 9 47 nlp 1077
## 10 24 r 1064
csv_count <- subset(Tags, Count > 1000)
all.equal(mongo_count, csv_count)
## [1] TRUE
The advantage of using a NoSQL database is its flexibility. Anything can be stored in a NoSQL database as its own JSON object. There is no necessity for similar fields across records and no need to update the database schema if new kinds of data need to be retained. If the information has some form of digital representation, it can be stored in a NoSQL database.
Its flexibility underlies its key disadvantage as well. Without a regular structure, most kinds of data analysis, such as aggregations and summaries, become more difficult to construct and more time-consuming to process. Also, joining together different databases becomes much more difficult as there may be few, if any, common fields on which to join. One may argue with the last and say to just store everything in one database, but that can get very unweildly.
In this particular case, the data elements are all similar to each other. Also, the type of analysis that would be used would lean towards the mathematical more than merely information retrieval. Therefore, the advantages of a relational database, with its structure and ensuing optimization, outweigh those of a NoSQL database, despite its freedom and flexibility.