For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.
For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.
Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
I will be using the tb database for this project. Let’s first import the data into our R session:
tb <- dbGetQuery(mydb, 'SELECT * FROM tb.tb')
head(tb)
## country year sex child adult elderly
## 1 Afghanistan 1995 female NA NA NA
## 2 Afghanistan 1995 male NA NA NA
## 3 Afghanistan 1996 female NA NA NA
## 4 Afghanistan 1996 male NA NA NA
## 5 Afghanistan 1997 female 5 96 1
## 6 Afghanistan 1997 male 0 26 0
I’ve chosen MongoDB as my NoSQL database. Instructions for using MongoDB can be found here: https://gist.github.com/Btibert3/7751989.
First, let’s connect to the MongoDB and create a collection for our data called “TB_DATA”. We’ll stored this in a database called tb.
# connect to MongoDB
library(mongolite)
con <- mongo(collection = "TB_DATA",db = "tb")
Next, let’s insert our data into the collection we’ve created. In order to ensure that there are no duplicates from previous runs, we’ll drop the table if it exists.
if(con$count() > 0) con$drop()
con$insert(tb)
## List of 5
## $ nInserted : num 3800
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
stopifnot(con$count() == nrow(tb))
Let’s check to see if the data was inserted properly.
con$count() == nrow(tb)
## [1] TRUE
Success!
For practice, let’s play around with some of the mongo querying language.
We’ll pull in only records for China that are greater than 2010.
con$find(
query = '{"country" : "China", "year" : { "$gt" : 2010 } }'
)
## country year sex child adult elderly
## 1 China 2011 female 2710 218852 42371
## 2 China 2011 male 3089 478461 119576
## 3 China 2012 female 2597 215915 46598
## 4 China 2012 male 3028 466634 124089
## 5 China 2013 female 2301 211310 47438
## 6 China 2013 male 2529 459122 124476
Let’s take this query and only display the country, year, sex, and number of children.
con$find(
query = '{"country" : "China", "year" : { "$gt" : 2010 } }',
fields = '{"country" : true, "year" : true, "sex": true, "child" : true, "_id" : false}'
)
## country year sex child
## 1 China 2011 female 2710
## 2 China 2011 male 3089
## 3 China 2012 female 2597
## 4 China 2012 male 3028
## 5 China 2013 female 2301
## 6 China 2013 male 2529
We’ll take a look at the data in descending order of children.
con$find(
query = '{"country" : "China", "year" : { "$gt" : 2010 } }',
fields = '{"country" : true, "year" : true, "sex": true, "child" : true, "_id" : false}',
sort = '{"child": -1}'
)
## country year sex child
## 1 China 2011 male 3089
## 2 China 2012 male 3028
## 3 China 2011 female 2710
## 4 China 2012 female 2597
## 5 China 2013 male 2529
## 6 China 2013 female 2301
And last, we’ll limit this to the top 3 records:
con$find(
query = '{"country" : "China", "year" : { "$gt" : 2010 } }',
fields = '{"country" : true, "year" : true, "sex": true, "child" : true, "_id" : false}',
sort = '{"child": -1}',
limit = 3
)
## country year sex child
## 1 China 2011 male 3089
## 2 China 2012 male 3028
## 3 China 2011 female 2710
The advantages of NoSQL are the disadvantages of relational databases. Likewise, the disadvantages of NoSQL are the advantages of relational databases. In general, relational databases are ideal for uniformity and standardization and NoSQL databases are great for unstructured data.
The advantages of storing the data in a NoSQL environment are:
The disadvantages of storing the data in a NoSQL environment are: