Assignment

For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing. For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing. Your migration process needs to be reproducible. R code is encouraged, but not required.

Part I:

I use flights data sets in this assignment. Following I compare Neo4j DB and Access SQL

Neo4j

Import CSV files and Creat Nodes Airport and Flight in Neo4j: eo4j-airport-csv-raw.csv and neo4j-flight-lab-data.csv. The following steps are part of the neo4j lab in this week.

load csv with headers from "https://raw.githubusercontent.com/czhu505/neo4j/master/neo4j-airport-csv-raw.csv" as airports create (a:Airport {label: airports.label, city: airports.city, state: airports.state})
load csv with headers from "https://raw.githubusercontent.com/czhu505/neo4j/master/neo4j-flight-lab-data.csv" as flights create (n:Flight {number: flights.flight, airline: flights.airline, capacity: flights.capacity,depart:flights.depart, arrive:flights.arrive, takeoff:flights.takeoff,landing:flights.landing})

Creat relationship Types Arrives and Departs

To connect two tables if Flight.arrive and Ariport.label match. From the graph,we can see the number of flight by ailines arrving and departing in each city.

load csv with headers from "https://raw.githubusercontent.com/czhu505/neo4j/master/neo4j-flight-lab-data.csv" as flights match (a: Flight {number: flights.flight}),(b:Airport {label: flights.arrive}) create (a) -[r:Arrives {landing:
flights.landing}]-> (b)
load csv with headers from "https://raw.githubusercontent.com/czhu505/neo4j/master/neo4j-flight-lab-data.csv" as flights
match (a: Flight {number: flights.flight}),(b:Airport {label: flights.depart}) create (a) -[r:Departs {takeoff:
flights.takeoff}]-> (b)
neo4j.arrive

neo4j.arrive

neo4j.depart

neo4j.depart

Access

Creat relationship Types Arrives and Departs To connect two tables if Flight.arrive and Ariport.label match. In this data set, we can count the number of flight by ailines arrving to each city.

access.arrive.ralationship

access.arrive.ralationship

To connect two tables if Flight.depart and Ariport.label match. In this data set, we can count the number of flight by ailines departing from each city.

access.depart.ralationship

access.depart.ralationship

Creat tables for Arrives and Departs

access.arrive.data

access.arrive.data

access.arrive.data

access.arrive.data

You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

Relationships are first-class citizens of the graph data model, unlike other database management systems, which require us to infer connections between entities using special properties such as foreign keys, or out-of-band processing like map-reduce.

Relational db nomally require us to infer connection between entities suing speical properties such as foreign keys, or out-of-band processing like map-reduce. Relational db can also be called relational database management systems (RDBMS) or SQL databases. The most popular of these are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2.

Pros: 1. Expressivge query language & secondary Indexes: Users should be able to access and manipulate their data in sophisticated ways to support both operational and analytical applications. Indexes play a critical role in providing efficient access to data, supported natively by the database rather than maintained in application code.

  1. Strong consistency: It is much more complex to build applications around an eventually consistent model, imposing significant work on the developer, even for the most sophisticated engineering teams.

  2. Enterprise Management and Integrations: Organizations need a database that can be secured, monitored, automated, and integrated with their existing technology infrastructure, processes, and staff, including operations teams, DBAs, and data analysts.

Cons:

  1. Performance problems associated with re-assembling simple data structures into their more complicated real-world representations.

  2. Lack of support for complex base types, e.g., drawings.

  3. SQL is limited when accessing complex data.

  4. Knowledge of the database structure is required to create ad hoc queries.

  5. Locking mechanisms defined by RDBMSs do not allow design transactions to be supported, e.g., the “check in” and “check out” type of feature that would allow an engineer to modify a drawing over the course of several working days.

Graph db is assembling the simple abstractions of nodes and relationships into connected structures. It enable to bnuild sophisticated models that map closely to the problem domain.Graph db also called NoSQL databases, the most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j.

pros:

  1. Flexible data model:Whether document,graph, key-value, or wide-column, all offer a flexible data model making it easy to store and combine data of any structure and allow dynamic modification of the schema without downtime or performance impace.

  2. Scalability and performance:This allows the database to scale out on commodity hardware deployed on-premises or in the cloud, enabling almost unlimited growth with higher throughput and lower latency than relational databases.

  3. Always-on global deployments:They are designed to run across many nodes, including replication to automatically synchronize data across servers, racks, and data centers.

Cons:

  1. It is a growing technology that will be mature in a few years, so, you have to bet for one and pray for it becoming successful.

  2. Most of them do not have a declarative language, and those that have it lack the capability to optimise queries in a proper way.

  3. You have to use an API.

  4. Many lack native implementations for different platforms, except for Sparksee high-performance graph database, which has native implementations for Linux, Windows, MacOS, iOS, Android and BB10.

Reference

  1. http://www.jamesserra.com/archive/2015/08/relational-databases-vs-non-relational-databases/

  2. https://www.mongodb.com/scale/relational-vs-non-relational-database

  3. http://it.toolbox.com/blogs/enterprise-solutions/some-pros-cons-of-relational-databases-24144

  4. https://www.quora.com/What-are-the-pros-and-cons-of-using-a-graph-database

_____________________________________________________________________________________________________________________________

Part II:

The following to connect neo4j data base, try to manipulation of Neo4j graph objects within the R environment. ___

Load package

## Warning: package 'RNeo4j' was built under R version 3.3.3
## Warning: package 'dbConnect' was built under R version 3.3.3
## Loading required package: RMySQL
## Warning: package 'RMySQL' was built under R version 3.3.3
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.3.3
## Loading required package: gWidgets
## Warning: package 'gWidgets' was built under R version 3.3.3
## 
## Attaching package: 'gWidgets'
## The following object is masked from 'package:RNeo4j':
## 
##     delete
## [[1]]
## [1] "RNeo4j"    "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "dbConnect" "gWidgets"  "RMySQL"    "DBI"       "RNeo4j"   
##  [6] "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [11] "methods"   "base"
graph=startGraph("http://127.0.0.1:7474/db/data/", username = "neo4j", password = "password")
clear(graph, input = FALSE) # delect all existed data base

“CREATE CONSTRAINT” command to create unique constraints on node or relationship properties.

addConstraint(graph, "Airport","label") #the three-character airport code (unique)
addConstraint(graph, "Flight",  "flight" ) #the flight identification number (this is unique)

Load flight and airport csv file form github

Flight<-read.csv("https://raw.githubusercontent.com/czhu505/neo4j/master/neo4j-flight-lab-data.csv",header = T)
Airport<-read.csv("https://raw.githubusercontent.com/czhu505/neo4j/master/neo4j-airport-csv-raw.csv",header = T)

createNode: Airport and Flight.

for ( i in 1: nrow(Airport) ) {
createNode (graph, "Airport", 
            label = Airport[[i,"label" ]],
            city = Airport[[i,"city"]], 
            state = Airport[[i,"state"]]  
            )
}


for ( j in 1: nrow(Flight) ) {
createNode (graph, "Flight", 
            flight = Flight[[j,"flight"]], 
            airline = Flight[[j,"airline"]] , 
            capacity = Flight[[j,"capacity" ]],
            arrive = Flight[[j,"arrive" ]]
            )
}

Load relationships

To connect two tables if Flight.arrive and Ariport.label match. In this data set, we can count the number of flight by ailines. And I stop at this step for showinhg how to do it in r.

query1 = "MATCH (a:Flight), (b:Airport) where a.arrive =b.label RETURN a.flight,a.airline, a.capacity, a.arrive, b.label, b.city, b. state"
cypher(graph,query1) 
##    a.flight a.airline a.capacity a.arrive b.label     b.city
## 1        24     Delta        160      ATL     ATL    Atlanta
## 2        28     Delta        160      DTW     DTW    Detroit
## 3        36     Delta        128      DTW     DTW    Detroit
## 4        38     Delta        160      ATL     ATL    Atlanta
## 5        44     Delta        160      ATL     ATL    Atlanta
## 6        46     Delta        160      BOS     BOS     Boston
## 7        23     Delta        160      DTW     DTW    Detroit
## 8        27     Delta        160      BOS     BOS     Boston
## 9        35     Delta        128      PIT     PIT Pittsburgh
## 10       37     Delta        160      BOS     BOS     Boston
## 11       43     Delta        160      PIT     PIT Pittsburgh
## 12       45     Delta        160      PIT     PIT Pittsburgh
## 13      101 Southwest        136      BOS     BOS     Boston
## 14      103 Southwest        136      PIT     PIT Pittsburgh
## 15     1231  American        160      BOS     BOS     Boston
## 16     1257  American        128      ATL     ATL    Atlanta
## 17     1277  American        160      DTW     DTW    Detroit
## 18     1291  American        128      DTW     DTW    Detroit
## 19      102 Southwest        136      DTW     DTW    Detroit
## 20      104 Southwest        136      DTW     DTW    Detroit
## 21     1232  American        160      PIT     PIT Pittsburgh
## 22     1258  American        160      ATL     ATL    Atlanta
## 23     1278  American        160      DTW     DTW    Detroit
## 24     1292  American        160      BOS     BOS     Boston
##          b. state
## 1         Georgia
## 2        Michigan
## 3        Michigan
## 4         Georgia
## 5         Georgia
## 6  Massachussetts
## 7        Michigan
## 8  Massachussetts
## 9    Pennsylvania
## 10 Massachussetts
## 11   Pennsylvania
## 12   Pennsylvania
## 13 Massachussetts
## 14   Pennsylvania
## 15 Massachussetts
## 16        Georgia
## 17       Michigan
## 18       Michigan
## 19       Michigan
## 20       Michigan
## 21   Pennsylvania
## 22        Georgia
## 23       Michigan
## 24 Massachussetts