Assignment

For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

Importing the Data

I will be using the tb database for this project. Let’s first import the data into our R session:

tb <- dbGetQuery(mydb, 'SELECT * FROM tb.tb')
head(tb)
##       country year    sex child adult elderly
## 1 Afghanistan 1995 female    NA    NA      NA
## 2 Afghanistan 1995   male    NA    NA      NA
## 3 Afghanistan 1996 female    NA    NA      NA
## 4 Afghanistan 1996   male    NA    NA      NA
## 5 Afghanistan 1997 female     5    96       1
## 6 Afghanistan 1997   male     0    26       0

Storing the data in MongoDB

I’ve chosen MongoDB as my NoSQL database. Instructions for using MongoDB can be found here: https://gist.github.com/Btibert3/7751989.

First, let’s connect to the MongoDB and create a collection for our data called “TB_DATA”. We’ll stored this in a database called tb.

# connect to MongoDB
library(mongolite)
con <- mongo(collection = "TB_DATA",db = "tb")

Next, let’s insert our data into the collection we’ve created. In order to ensure that there are no duplicates from previous runs, we’ll drop the table if it exists.

if(con$count() > 0) con$drop()
con$insert(tb)
## List of 5
##  $ nInserted  : num 3800
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
stopifnot(con$count() == nrow(tb))

Let’s check to see if the data was inserted properly.

con$count() == nrow(tb)
## [1] TRUE

Success!

Querying the data

For practice, let’s play around with some of the mongo querying language.

We’ll pull in only records for China that are greater than 2010.

con$find(
  query = '{"country" : "China", "year" : { "$gt" : 2010 } }'
)
##   country year    sex child  adult elderly
## 1   China 2011 female  2710 218852   42371
## 2   China 2011   male  3089 478461  119576
## 3   China 2012 female  2597 215915   46598
## 4   China 2012   male  3028 466634  124089
## 5   China 2013 female  2301 211310   47438
## 6   China 2013   male  2529 459122  124476

Let’s take this query and only display the country, year, sex, and number of children.

con$find(
  query = '{"country" : "China", "year" : { "$gt" : 2010 } }',
  fields = '{"country" : true, "year" : true, "sex": true, "child" : true, "_id" : false}'
)
##   country year    sex child
## 1   China 2011 female  2710
## 2   China 2011   male  3089
## 3   China 2012 female  2597
## 4   China 2012   male  3028
## 5   China 2013 female  2301
## 6   China 2013   male  2529

We’ll take a look at the data in descending order of children.

con$find(
  query = '{"country" : "China", "year" : { "$gt" : 2010 } }',
  fields = '{"country" : true, "year" : true, "sex": true, "child" : true, "_id" : false}',
  sort = '{"child": -1}'
)
##   country year    sex child
## 1   China 2011   male  3089
## 2   China 2012   male  3028
## 3   China 2011 female  2710
## 4   China 2012 female  2597
## 5   China 2013   male  2529
## 6   China 2013 female  2301

And last, we’ll limit this to the top 3 records:

con$find(
  query = '{"country" : "China", "year" : { "$gt" : 2010 } }',
  fields = '{"country" : true, "year" : true, "sex": true, "child" : true, "_id" : false}',
  sort = '{"child": -1}',
  limit = 3
)
##   country year    sex child
## 1   China 2011   male  3089
## 2   China 2012   male  3028
## 3   China 2011 female  2710

Pros and Cons of MongoDB

The advantages of NoSQL are the disadvantages of relational databases. Likewise, the disadvantages of NoSQL are the advantages of relational databases. In general, relational databases are ideal for uniformity and standardization and NoSQL databases are great for unstructured data.

The advantages of storing the data in a NoSQL environment are:

The disadvantages of storing the data in a NoSQL environment are: