Project Instructions

Take information from a relational database and migrate it to a NoSQL database. The relational database can be the flights database, the tb database, the data skills database from Project 3, or any another database. The NoSQL database can be MongoDB, Neo4j, or any another NoSQL database. The migration process needs to be reproducible. R code is encouraged, but not required. Also briefly describe the advantages and disadvantages of storing the data in a relational database versus a NoSQL database.

Pull Data from Relational Database

library(RMySQL)
connection <- dbConnect(MySQL(), user=MySQL_Username, password=MySQL_Password)
dbSendQuery(connection, 'USE Flights;')
## <MySQLResult:0,0,0>
Airlines <- dbGetQuery(connection, "SELECT * FROM Airlines;")
Airports <- dbGetQuery(connection, "SELECT * FROM Airports;")
Flights <- dbGetQuery(connection, "SELECT * FROM Flights;")
Planes <- dbGetQuery(connection, "SELECT * FROM Planes;")
Weather <- dbGetQuery(connection, "SELECT * FROM Weather;")
dbDisconnect(connection)
## [1] TRUE

Migrate to NoSQL Database

MongoDB

The mongolite package is more efficient than the RMongo package which requires conversion to JSON, string manipulation, and several lines of additional code using the below functions:

library(jsonlite)
library(RMongo)
convert <- function(x){
  JSON <- toJSON(x)
  JSON <- gsub("\\[", "{", JSON)
  JSON <- gsub("\\]", "}", JSON)
  JSON <- as.character(JSON)
  return(JSON)
}
mongodb <- mongoDbConnect("Flights", host = "localhost", port = 27017)
dbInsertDocument(mongodb, 'Table', convert(Table))
dbShowCollections(mongodb)
dbGetQuery(mongodb, "Table", "{}")
dbRemoveQuery(mongodb, 'Table', convert(Table))\
dbDisconnect(mongodb)

An empty query filter document ({}) selects all documents in the collection.

library(mongolite)
mongodb <- mongo(collection = "Flights")
mongodb$insert(Airlines)
## 
Complete! Processed total of 16 rows.
## $nInserted
## [1] 16
## 
## $nMatched
## [1] 0
## 
## $nRemoved
## [1] 0
## 
## $nUpserted
## [1] 0
## 
## $writeErrors
## list()
mongodb$insert(Airports)
## 
Processed 1000 rows...
Complete! Processed total of 1397 rows.
## $nInserted
## [1] 1397
## 
## $nMatched
## [1] 0
## 
## $nRemoved
## [1] 0
## 
## $nUpserted
## [1] 0
## 
## $writeErrors
## list()
mongodb$insert(Flights, pagesize = 100000)
## 
Processed 1e+05 rows...
Processed 2e+05 rows...
Processed 3e+05 rows...
Complete! Processed total of 336776 rows.
## $nInserted
## [1] 336776
## 
## $nMatched
## [1] 0
## 
## $nRemoved
## [1] 0
## 
## $nUpserted
## [1] 0
## 
## $writeErrors
## list()
mongodb$insert(Planes)
## 
Processed 1000 rows...
Processed 2000 rows...
Processed 3000 rows...
Complete! Processed total of 3322 rows.
## $nInserted
## [1] 3322
## 
## $nMatched
## [1] 0
## 
## $nRemoved
## [1] 0
## 
## $nUpserted
## [1] 0
## 
## $writeErrors
## list()
mongodb$insert(Weather)
## 
Processed 1000 rows...
Processed 2000 rows...
Processed 3000 rows...
Processed 4000 rows...
Processed 5000 rows...
Processed 6000 rows...
Processed 7000 rows...
Processed 8000 rows...
Complete! Processed total of 8719 rows.
## $nInserted
## [1] 8719
## 
## $nMatched
## [1] 0
## 
## $nRemoved
## [1] 0
## 
## $nUpserted
## [1] 0
## 
## $writeErrors
## list()
mongodb$count()
## [1] 350230
mongodb$drop()
## [1] TRUE
mongodb$count()
## [1] 0
rm(mongodb)

The connection through mongolite automatically disconnects when the connection is removed.

Neo4j

A Neo4j database must be up and running in order to use RNeo4j. The RNeo4j createNode function returns a “collections containing mixed types cannot be stored in properties” error when nulls exist in the data. The function unmix changes nulls to the string “NA” as a work around. The Flights table is rather large with 336776 rows and returns either a “Server error: (500) Internal Server Error” or a “Java heap space” error due the memory constraints on this machine. The function has been limited to the first \(10^5\) rows for the sake of this assignment.

library(RNeo4j)
# system('"C:/Program Files/Neo4j CE 3.0.7/bin/neo4j-ce.exe"')
unmix <- function(x) { x[is.na(x)] <- "NA"; return(x) }
graph = startGraph("http://localhost:7474/db/data/")
clear(graph, input = FALSE)
createNode(graph, "Airlines", unmix(Airlines))
## < Node > 
## Airlines
## 
## $carrier
##  [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ"
##  [ reached getOption("max.print") -- omitted 6 entries ]
## 
## $name
##  [1] "Endeavor Air Inc.\r"           "American Airlines Inc.\r"     
##  [3] "Alaska Airlines Inc.\r"        "JetBlue Airways\r"            
##  [5] "Delta Air Lines Inc.\r"        "ExpressJet Airlines Inc.\r"   
##  [7] "Frontier Airlines Inc.\r"      "AirTran Airways Corporation\r"
##  [9] "Hawaiian Airlines Inc.\r"      "Envoy Air\r"                  
##  [ reached getOption("max.print") -- omitted 6 entries ]
createNode(graph, "Airports", unmix(Airports))
## < Node > 
## Airports
## 
## $dst
##  [1] "A" "A" "A" "A" "A" "A" "A" "A" "U" "A"
##  [ reached getOption("max.print") -- omitted 1387 entries ]
## 
## $tz
##  [1] -5 -5 -6 -5 -4 -4 -5 -5 -5 -8
##  [ reached getOption("max.print") -- omitted 1387 entries ]
## 
## $faa
##  [1] "04G" "06A" "06C" "06N" "09J" "0A9" "0G6" "0G7" "0P2" "0S9"
##  [ reached getOption("max.print") -- omitted 1387 entries ]
## 
## $name
##  [1] "Lansdowne Airport"              "Moton Field Municipal Airport" 
##  [3] "Schaumburg Regional"            "Randall Airport"               
##  [5] "Jekyll Island Airport"          "Elizabethton Municipal Airport"
##  [7] "Williams County Airport"        "Finger Lakes Regional Airport" 
##  [9] "Shoestring Aviation Airfield"   "Jefferson County Intl"         
##  [ reached getOption("max.print") -- omitted 1387 entries ]
## 
## $alt
##  [1] 1044  264  801  523   11 1593  730  492 1000  108
##  [ reached getOption("max.print") -- omitted 1387 entries ]
## 
## $lon
##  [1]  -80.6196  -85.6800  -88.1012  -74.3916  -81.4278  -82.1734  -84.5068
##  [8]  -76.7812  -76.6472 -122.8106
##  [ reached getOption("max.print") -- omitted 1387 entries ]
## 
## $lat
##  [1] 41.1305 32.4606 41.9893 41.4319 31.0745 36.3712 41.4673 42.8836
##  [9] 39.7948 48.0538
##  [ reached getOption("max.print") -- omitted 1387 entries ]
createNode(graph, "Flights", unmix(Flights[1:10^5, ]))
## < Node > 
## Flights
## 
## $flight
##  [1] 1545 1714 1141  725  461 1696  507 5708   79  301
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $arr_delay
##  [1] "11"  "20"  "33"  "-18" "-25" "12"  "19"  "-14" "-8"  "8"  
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $distance
##  [1] 1400 1416 1089 1576  762  719 1065  229  944  733
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $tailnum
##  [1] "N14228" "N24211" "N619AA" "N804JB" "N668DN" "N39463" "N516JB"
##  [8] "N829AS" "N593JB" "N3ALAA"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $year
##  [1] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $dep_time
##  [1] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $origin
##  [1] "EWR" "LGA" "JFK" "JFK" "LGA" "EWR" "EWR" "LGA" "JFK" "LGA"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $dep_delay
##  [1] "2"  "4"  "2"  "-1" "-6" "-4" "-5" "-3" "-3" "-2"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $dest
##  [1] "IAH" "IAH" "MIA" "BQN" "ATL" "ORD" "FLL" "IAD" "MCO" "ORD"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $minute
##  [1] "17" "33" "42" "44" "54" "54" "55" "57" "57" "58"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $carrier
##  [1] "UA" "UA" "AA" "B6" "DL" "UA" "B6" "EV" "B6" "AA"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $hour
##  [1] "5" "5" "5" "5" "6" "6" "6" "6" "6" "6"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $month
##  [1] 1 1 1 1 1 1 1 1 1 1
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $arr_time
##  [1] "830"  "850"  "923"  "1004" "812"  "740"  "913"  "709"  "838"  "753" 
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $air_time
##  [1] "227" "227" "160" "183" "116" "150" "158" "53"  "140" "138"
##  [ reached getOption("max.print") -- omitted 99990 entries ]
## 
## $day
##  [1] 1 1 1 1 1 1 1 1 1 1
##  [ reached getOption("max.print") -- omitted 99990 entries ]
createNode(graph, "Planes", unmix(Planes))
## < Node > 
## Planes
## 
## $engine
##  [1] "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan"
##  [6] "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan"
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $tailnum
##  [1] "N10156" "N102UW" "N103US" "N104UW" "N10575" "N105UW" "N107US"
##  [8] "N108UW" "N109UW" "N110UW"
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $year
##  [1] "2004" "1998" "1999" "1999" "2002" "1999" "1999" "1999" "1999" "1999"
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $engines
##  [1] 2 2 2 2 2 2 2 2 2 2
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $model
##  [1] "EMB-145XR" "A320-214"  "A320-214"  "A320-214"  "EMB-145LR"
##  [6] "A320-214"  "A320-214"  "A320-214"  "A320-214"  "A320-214" 
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $type
##  [1] "Fixed wing multi engine" "Fixed wing multi engine"
##  [3] "Fixed wing multi engine" "Fixed wing multi engine"
##  [5] "Fixed wing multi engine" "Fixed wing multi engine"
##  [7] "Fixed wing multi engine" "Fixed wing multi engine"
##  [9] "Fixed wing multi engine" "Fixed wing multi engine"
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $seats
##  [1]  55 182 182 182  55 182 182 182 182 182
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $speed
##  [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
##  [ reached getOption("max.print") -- omitted 3312 entries ]
## 
## $manufacturer
##  [1] "EMBRAER"          "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE"
##  [4] "AIRBUS INDUSTRIE" "EMBRAER"          "AIRBUS INDUSTRIE"
##  [7] "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE"
## [10] "AIRBUS INDUSTRIE"
##  [ reached getOption("max.print") -- omitted 3312 entries ]
createNode(graph, "Weather", unmix(Weather))
## < Node > 
## Weather
## 
## $temp
##  [1] "37.04" "37.04" "37.94" "37.94" "37.94" "39.02" "39.02" "39.02"
##  [9] "39.92" "39.02"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $year
##  [1] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $origin
##  [1] "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $visib
##  [1] 10 10 10 10 10 10 10 10 10 10
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $wind_dir
##  [1] 230 230 230 230 240 270 250 240 250 260
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $humid
##  [1] "53.97" "53.97" "52.09" "54.51" "57.04" "59.37" "61.63" "64.43"
##  [9] "62.21" "64.43"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $pressure
##  [1] "1013.9" "1013"   "1012.6" "1012.7" "1012.8" "1012"   "1012.3"
##  [8] "1012.5" "1012.2" "1011.9"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $wind_gust
##  [1] "11.9186514756" "15.8915353008" "14.5672406924" "15.8915353008"
##  [5] "17.2158299092" "11.9186514756" "9.2700622588"  "13.242946084" 
##  [9] "14.5672406924" "14.5672406924"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $precip
##  [1] 0 0 0 0 0 0 0 0 0 0
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $hour
##  [1] "0"  "1"  "2"  "3"  "4"  "6"  "7"  "8"  "9"  "10"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $dewp
##  [1] "21.92" "21.92" "21.92" "23"    "24.08" "26.06" "26.96" "28.04"
##  [9] "28.04" "28.04"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $month
##  [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $wind_speed
##  [1] "10.35702" "13.80936" "12.65858" "13.80936" "14.96014" "10.35702"
##  [7] "8.05546"  "11.5078"  "12.65858" "12.65858"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
## 
## $day
##  [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##  [ reached getOption("max.print") -- omitted 8709 entries ]
options(op)
(query <- cypher(graph, "MATCH (n) RETURN COUNT(n)"))
##   COUNT(n)
## 1        5
clear(graph, input = FALSE)

Comparison

Column-based versus Row-based Storage



RDBMS NoSQL
MySQL, Postgres, Oracle Database MongoDB, Cassandra, HBase, Neo4j
Relational schema Schema-free
Strong Consistency Strong/Eventual Consistency
Stores individual records Storage model Varies
Scalable reads Scalable writes/reads
Custom high-availability (Vertical) Auto high-availability (Horizontal)
Flexible queries (SQL) Limited queries (APIs)
Closed/open source Open source
Supports transactions Transactions may be limited
ACID BASE

Most NoSQL databases are designed more for performance than data integrity. NoSQL databases do not have a relational schema like the standard RDMS. This has an impact on the consistency of the NoSQL databases. RDMS data is stored as individual records, but NoSQL databases have various models including models that “do away with the table-and-row model altogether” which can help boost performance. NoSQL databases are better for data that requires a lot of reads and writes. NoSQL databases automatically provide auto high-availability and disaster recovery by auto-sharding (partitioning) across many server instances. NoSQL databases have very limited query abilities however. The open-source development model of NoSQL can be considered both an advantage and disadvantage with arguments for reach being outside the scope of this comparison. Lastly, update transactions can be limited in NoSQL databases.

Different Data Storage Models

References

https://www.mongodb.com/nosql-inline

http://dataconomy.com/sql-vs-nosql-need-know/

https://www.devbridge.com/articles/benefits-of-nosql/

https://cran.r-project.org/web/packages/RMongo/RMongo.pdf

https://docs.mongodb.com/v3.2/tutorial/query-documents/

https://neo4j.com/developer/r/

https://neo4j.com/blog/upload-last-fm-data-neo4j-rneo4j-transactional-endpoint/

http://stackoverflow.com/questions/25295590/how-to-create-nodes-in-rneo4j-using-vectors-or-dataframes

http://stackoverflow.com/questions/25530538/java-heap-space-error-neo4j

http://www.slideshare.net/nurulferdous/nosql-is-it-for-you

http://stackoverflow.com/questions/25550711/convert-data-frame-to-json

http://www.browniethoughts.com/2013/02/nosql-databases-introduction.html

https://kvaes.wordpress.com/2015/01/21/database-variants-explained-sql-or-nosql-is-that-really-the-question/