Take information from a relational database and migrate it to a NoSQL database. The relational database can be the flights database, the tb database, the data skills database from Project 3, or any another database. The NoSQL database can be MongoDB, Neo4j, or any another NoSQL database. The migration process needs to be reproducible. R
code is encouraged, but not required. Also briefly describe the advantages and disadvantages of storing the data in a relational database versus a NoSQL database.
library(RMySQL)
connection <- dbConnect(MySQL(), user=MySQL_Username, password=MySQL_Password)
dbSendQuery(connection, 'USE Flights;')
## <MySQLResult:0,0,0>
Airlines <- dbGetQuery(connection, "SELECT * FROM Airlines;")
Airports <- dbGetQuery(connection, "SELECT * FROM Airports;")
Flights <- dbGetQuery(connection, "SELECT * FROM Flights;")
Planes <- dbGetQuery(connection, "SELECT * FROM Planes;")
Weather <- dbGetQuery(connection, "SELECT * FROM Weather;")
dbDisconnect(connection)
## [1] TRUE
The mongolite package is more efficient than the RMongo package which requires conversion to JSON
, string manipulation, and several lines of additional code using the below functions:
library(jsonlite)
library(RMongo)
convert <- function(x){
JSON <- toJSON(x)
JSON <- gsub("\\[", "{", JSON)
JSON <- gsub("\\]", "}", JSON)
JSON <- as.character(JSON)
return(JSON)
}
mongodb <- mongoDbConnect("Flights", host = "localhost", port = 27017)
dbInsertDocument(mongodb, 'Table', convert(Table))
dbShowCollections(mongodb)
dbGetQuery(mongodb, "Table", "{}")
dbRemoveQuery(mongodb, 'Table', convert(Table))\
dbDisconnect(mongodb)
An empty query filter document ({}) selects all documents in the collection.
library(mongolite)
mongodb <- mongo(collection = "Flights")
mongodb$insert(Airlines)
##
Complete! Processed total of 16 rows.
## $nInserted
## [1] 16
##
## $nMatched
## [1] 0
##
## $nRemoved
## [1] 0
##
## $nUpserted
## [1] 0
##
## $writeErrors
## list()
mongodb$insert(Airports)
##
Processed 1000 rows...
Complete! Processed total of 1397 rows.
## $nInserted
## [1] 1397
##
## $nMatched
## [1] 0
##
## $nRemoved
## [1] 0
##
## $nUpserted
## [1] 0
##
## $writeErrors
## list()
mongodb$insert(Flights, pagesize = 100000)
##
Processed 1e+05 rows...
Processed 2e+05 rows...
Processed 3e+05 rows...
Complete! Processed total of 336776 rows.
## $nInserted
## [1] 336776
##
## $nMatched
## [1] 0
##
## $nRemoved
## [1] 0
##
## $nUpserted
## [1] 0
##
## $writeErrors
## list()
mongodb$insert(Planes)
##
Processed 1000 rows...
Processed 2000 rows...
Processed 3000 rows...
Complete! Processed total of 3322 rows.
## $nInserted
## [1] 3322
##
## $nMatched
## [1] 0
##
## $nRemoved
## [1] 0
##
## $nUpserted
## [1] 0
##
## $writeErrors
## list()
mongodb$insert(Weather)
##
Processed 1000 rows...
Processed 2000 rows...
Processed 3000 rows...
Processed 4000 rows...
Processed 5000 rows...
Processed 6000 rows...
Processed 7000 rows...
Processed 8000 rows...
Complete! Processed total of 8719 rows.
## $nInserted
## [1] 8719
##
## $nMatched
## [1] 0
##
## $nRemoved
## [1] 0
##
## $nUpserted
## [1] 0
##
## $writeErrors
## list()
mongodb$count()
## [1] 350230
mongodb$drop()
## [1] TRUE
mongodb$count()
## [1] 0
rm(mongodb)
The connection through mongolite automatically disconnects when the connection is removed.
A Neo4j database must be up and running in order to use RNeo4j. The RNeo4j createNode
function returns a “collections containing mixed types cannot be stored in properties” error when nulls exist in the data. The function unmix
changes nulls to the string “NA
” as a work around. The Flights table is rather large with 336776 rows and returns either a “Server error: (500) Internal Server Error” or a “Java heap space” error due the memory constraints on this machine. The function has been limited to the first \(10^5\) rows for the sake of this assignment.
library(RNeo4j)
# system('"C:/Program Files/Neo4j CE 3.0.7/bin/neo4j-ce.exe"')
unmix <- function(x) { x[is.na(x)] <- "NA"; return(x) }
graph = startGraph("http://localhost:7474/db/data/")
clear(graph, input = FALSE)
createNode(graph, "Airlines", unmix(Airlines))
## < Node >
## Airlines
##
## $carrier
## [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ"
## [ reached getOption("max.print") -- omitted 6 entries ]
##
## $name
## [1] "Endeavor Air Inc.\r" "American Airlines Inc.\r"
## [3] "Alaska Airlines Inc.\r" "JetBlue Airways\r"
## [5] "Delta Air Lines Inc.\r" "ExpressJet Airlines Inc.\r"
## [7] "Frontier Airlines Inc.\r" "AirTran Airways Corporation\r"
## [9] "Hawaiian Airlines Inc.\r" "Envoy Air\r"
## [ reached getOption("max.print") -- omitted 6 entries ]
createNode(graph, "Airports", unmix(Airports))
## < Node >
## Airports
##
## $dst
## [1] "A" "A" "A" "A" "A" "A" "A" "A" "U" "A"
## [ reached getOption("max.print") -- omitted 1387 entries ]
##
## $tz
## [1] -5 -5 -6 -5 -4 -4 -5 -5 -5 -8
## [ reached getOption("max.print") -- omitted 1387 entries ]
##
## $faa
## [1] "04G" "06A" "06C" "06N" "09J" "0A9" "0G6" "0G7" "0P2" "0S9"
## [ reached getOption("max.print") -- omitted 1387 entries ]
##
## $name
## [1] "Lansdowne Airport" "Moton Field Municipal Airport"
## [3] "Schaumburg Regional" "Randall Airport"
## [5] "Jekyll Island Airport" "Elizabethton Municipal Airport"
## [7] "Williams County Airport" "Finger Lakes Regional Airport"
## [9] "Shoestring Aviation Airfield" "Jefferson County Intl"
## [ reached getOption("max.print") -- omitted 1387 entries ]
##
## $alt
## [1] 1044 264 801 523 11 1593 730 492 1000 108
## [ reached getOption("max.print") -- omitted 1387 entries ]
##
## $lon
## [1] -80.6196 -85.6800 -88.1012 -74.3916 -81.4278 -82.1734 -84.5068
## [8] -76.7812 -76.6472 -122.8106
## [ reached getOption("max.print") -- omitted 1387 entries ]
##
## $lat
## [1] 41.1305 32.4606 41.9893 41.4319 31.0745 36.3712 41.4673 42.8836
## [9] 39.7948 48.0538
## [ reached getOption("max.print") -- omitted 1387 entries ]
createNode(graph, "Flights", unmix(Flights[1:10^5, ]))
## < Node >
## Flights
##
## $flight
## [1] 1545 1714 1141 725 461 1696 507 5708 79 301
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $arr_delay
## [1] "11" "20" "33" "-18" "-25" "12" "19" "-14" "-8" "8"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $distance
## [1] 1400 1416 1089 1576 762 719 1065 229 944 733
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $tailnum
## [1] "N14228" "N24211" "N619AA" "N804JB" "N668DN" "N39463" "N516JB"
## [8] "N829AS" "N593JB" "N3ALAA"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $year
## [1] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $dep_time
## [1] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $origin
## [1] "EWR" "LGA" "JFK" "JFK" "LGA" "EWR" "EWR" "LGA" "JFK" "LGA"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $dep_delay
## [1] "2" "4" "2" "-1" "-6" "-4" "-5" "-3" "-3" "-2"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $dest
## [1] "IAH" "IAH" "MIA" "BQN" "ATL" "ORD" "FLL" "IAD" "MCO" "ORD"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $minute
## [1] "17" "33" "42" "44" "54" "54" "55" "57" "57" "58"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $carrier
## [1] "UA" "UA" "AA" "B6" "DL" "UA" "B6" "EV" "B6" "AA"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $hour
## [1] "5" "5" "5" "5" "6" "6" "6" "6" "6" "6"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $month
## [1] 1 1 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $arr_time
## [1] "830" "850" "923" "1004" "812" "740" "913" "709" "838" "753"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $air_time
## [1] "227" "227" "160" "183" "116" "150" "158" "53" "140" "138"
## [ reached getOption("max.print") -- omitted 99990 entries ]
##
## $day
## [1] 1 1 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 99990 entries ]
createNode(graph, "Planes", unmix(Planes))
## < Node >
## Planes
##
## $engine
## [1] "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan"
## [6] "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan"
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $tailnum
## [1] "N10156" "N102UW" "N103US" "N104UW" "N10575" "N105UW" "N107US"
## [8] "N108UW" "N109UW" "N110UW"
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $year
## [1] "2004" "1998" "1999" "1999" "2002" "1999" "1999" "1999" "1999" "1999"
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $engines
## [1] 2 2 2 2 2 2 2 2 2 2
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $model
## [1] "EMB-145XR" "A320-214" "A320-214" "A320-214" "EMB-145LR"
## [6] "A320-214" "A320-214" "A320-214" "A320-214" "A320-214"
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $type
## [1] "Fixed wing multi engine" "Fixed wing multi engine"
## [3] "Fixed wing multi engine" "Fixed wing multi engine"
## [5] "Fixed wing multi engine" "Fixed wing multi engine"
## [7] "Fixed wing multi engine" "Fixed wing multi engine"
## [9] "Fixed wing multi engine" "Fixed wing multi engine"
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $seats
## [1] 55 182 182 182 55 182 182 182 182 182
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $speed
## [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
## [ reached getOption("max.print") -- omitted 3312 entries ]
##
## $manufacturer
## [1] "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE"
## [4] "AIRBUS INDUSTRIE" "EMBRAER" "AIRBUS INDUSTRIE"
## [7] "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE"
## [10] "AIRBUS INDUSTRIE"
## [ reached getOption("max.print") -- omitted 3312 entries ]
createNode(graph, "Weather", unmix(Weather))
## < Node >
## Weather
##
## $temp
## [1] "37.04" "37.04" "37.94" "37.94" "37.94" "39.02" "39.02" "39.02"
## [9] "39.92" "39.02"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $year
## [1] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $origin
## [1] "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR" "EWR"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $visib
## [1] 10 10 10 10 10 10 10 10 10 10
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $wind_dir
## [1] 230 230 230 230 240 270 250 240 250 260
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $humid
## [1] "53.97" "53.97" "52.09" "54.51" "57.04" "59.37" "61.63" "64.43"
## [9] "62.21" "64.43"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $pressure
## [1] "1013.9" "1013" "1012.6" "1012.7" "1012.8" "1012" "1012.3"
## [8] "1012.5" "1012.2" "1011.9"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $wind_gust
## [1] "11.9186514756" "15.8915353008" "14.5672406924" "15.8915353008"
## [5] "17.2158299092" "11.9186514756" "9.2700622588" "13.242946084"
## [9] "14.5672406924" "14.5672406924"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $precip
## [1] 0 0 0 0 0 0 0 0 0 0
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $hour
## [1] "0" "1" "2" "3" "4" "6" "7" "8" "9" "10"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $dewp
## [1] "21.92" "21.92" "21.92" "23" "24.08" "26.06" "26.96" "28.04"
## [9] "28.04" "28.04"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $month
## [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $wind_speed
## [1] "10.35702" "13.80936" "12.65858" "13.80936" "14.96014" "10.35702"
## [7] "8.05546" "11.5078" "12.65858" "12.65858"
## [ reached getOption("max.print") -- omitted 8709 entries ]
##
## $day
## [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [ reached getOption("max.print") -- omitted 8709 entries ]
options(op)
(query <- cypher(graph, "MATCH (n) RETURN COUNT(n)"))
## COUNT(n)
## 1 5
clear(graph, input = FALSE)
RDBMS | NoSQL |
---|---|
MySQL, Postgres, Oracle Database | MongoDB, Cassandra, HBase, Neo4j |
Relational schema | Schema-free |
Strong Consistency | Strong/Eventual Consistency |
Stores individual records | Storage model Varies |
Scalable reads | Scalable writes/reads |
Custom high-availability (Vertical) | Auto high-availability (Horizontal) |
Flexible queries (SQL) | Limited queries (APIs) |
Closed/open source | Open source |
Supports transactions | Transactions may be limited |
ACID | BASE |
Most NoSQL databases are designed more for performance than data integrity. NoSQL databases do not have a relational schema like the standard RDMS. This has an impact on the consistency of the NoSQL databases. RDMS data is stored as individual records, but NoSQL databases have various models including models that “do away with the table-and-row model altogether” which can help boost performance. NoSQL databases are better for data that requires a lot of reads and writes. NoSQL databases automatically provide auto high-availability and disaster recovery by auto-sharding (partitioning) across many server instances. NoSQL databases have very limited query abilities however. The open-source development model of NoSQL can be considered both an advantage and disadvantage with arguments for reach being outside the scope of this comparison. Lastly, update transactions can be limited in NoSQL databases.
https://www.mongodb.com/nosql-inline
http://dataconomy.com/sql-vs-nosql-need-know/
https://www.devbridge.com/articles/benefits-of-nosql/
https://cran.r-project.org/web/packages/RMongo/RMongo.pdf
https://docs.mongodb.com/v3.2/tutorial/query-documents/
https://neo4j.com/developer/r/
https://neo4j.com/blog/upload-last-fm-data-neo4j-rneo4j-transactional-endpoint/
http://stackoverflow.com/questions/25530538/java-heap-space-error-neo4j
http://www.slideshare.net/nurulferdous/nosql-is-it-for-you
http://stackoverflow.com/questions/25550711/convert-data-frame-to-json
http://www.browniethoughts.com/2013/02/nosql-databases-introduction.html