Assignment Description:

The purpose of this assignment is to migrate a MySQL relational database into a NoSQL database platform. I have chosen ‘Flights’ database as a source and Neo4j graph database as a target NoSQL DB platform.

Assignment Goal:

The purpose of this DB platform migration is to be able to capture relationships between various entities graphically and ability to easily answer critical questions based on the property graph created in graph database.

MySQL Database Tables:

This migration process involves following tables in the ‘flights’ source database -

1. flights

2. airlines

3. airports

4. planes

5. weather

Note: MySQL database software need to be installed in the local computer in order to successfully execute the R markdown code.

R Libraries:

Load necessary libraries -

library(RODBC)
library(dplyr)
library(stringr)
library(kableExtra)
library(data.table)

Below are the steps followed to perform database migration -

  1. Establish MySQL DB Connection:

I have used RODBC package and an ODBC data source called ‘MySQL_Flights’ in order to connect to the database and retrieve tables into respective data frames.

con <- odbcConnect("MySQL_Flights")

Data Sets:

A. Flights:

flights_df <- as.data.frame(sqlFetch(con,"flights"),stringsAsFactors = FALSE)

head(flights_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42
2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44
2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 6 54
2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 6 54

B. Airports:

airports_df <- as.data.frame(sqlFetch(con,"airports"),stringsAsFactors = FALSE)

head(airports_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
faa name lat lon alt tz dst
04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
06C Schaumburg Regional 41.98934 -88.10124 801 -6 A
06N Randall Airport 41.43191 -74.39156 523 -5 A
09J Jekyll Island Airport 31.07447 -81.42778 11 -4 A
0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -4 A

C. Airlines:

airlines_df <- as.data.frame(sqlFetch(con,"airlines"),stringsAsFactors = FALSE)

head(airlines_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
carrier name
9E Endeavor Air Inc.
AA American Airlines Inc.
AS Alaska Airlines Inc.
B6 JetBlue Airways
DL Delta Air Lines Inc.
EV ExpressJet Airlines Inc.

D. Planes:

planes_df <- as.data.frame(sqlFetch(con,"planes"),stringsAsFactors = FALSE)

head(planes_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
tailnum year type manufacturer model engines seats speed engine
N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan

E. Weather:

weather_df <- as.data.frame(sqlFetch(con,"weather"),stringsAsFactors = FALSE)

head(weather_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib
EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702 11.91865 0 1013.9 10
EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936 15.89154 0 1013.0 10
EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858 14.56724 0 1012.6 10
EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936 15.89154 0 1012.7 10
EWR 2013 1 1 4 37.94 24.08 57.04 240 14.96014 17.21583 0 1012.8 10
EWR 2013 1 1 6 39.02 26.06 59.37 270 10.35702 11.91865 0 1012.0 10

Graph Database Migration:

  1. Create data model queries to transform data for graph database:
## Airports Export
write.csv(airports_df, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\airports.csv")

flights_df <- flights_df %>% mutate(flight_desc = paste(carrier,"-",flight))

## Flights Export
flight_unique <- unique(flights_df %>% select(c("flight","carrier","flight_desc")))
write.csv(flight_unique, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\flights.csv")

## Dates Export
flights_df <- flights_df %>% mutate(date = paste(month,"/",day,"/",year))
flight_dates <- unique(flights_df %>% select(c("date","day","month","year")))
write.csv(flight_dates, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\flight_dates.csv")

## Flight Origin & Destination Airports 
flight_airport <- unique(flights_df %>% select(c("flight_desc","origin","dest")))
write.csv(flight_airport, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\flight_airport.csv")
  1. Below CYPHER queries were executed to create nodes and relationships in the Neo4j graph database.

CSV file exported from the R data frames are used as source for Neo4j CYPHER queries. These files are to be placed under {Neo4j Install Dir}/Import location on local computer.

A. Airport Nodes - Cypher Query:

LOAD CSV WITH HEADERS FROM ‘file:///airports.csv’ as line CREATE (airport:Airport {id:line.faa}) SET airport.name = line.name, airport.latitude = TOFLOAT(line.lat), airport.logitude = TOFLOAT(line.lon), airport.altitude = TOINT(line.alt), airport.timezone = TOINT(line.tz), airport.dst = line.dst RETURN airport;

B. Flight Nodes - Cypher Query:

LOAD CSV WITH HEADERS FROM ‘file:///flights.csv’ as line CREATE (flight:Flight {id:line.flight_desc}) SET flight.name = line.flight_desc, flight.carrier = line.carrier, flight.number = line.flight RETURN flight;

C. Flight Origin Relation - Cypher Query:

LOAD CSV WITH HEADERS FROM ‘file:///flight_airport.csv’ as line MATCH (flight:Flight {id:line.flight_desc}) MATCH (airport:Airport {id:line.origin})

CREATE (flight)-[:ORIGIN]->(airport) RETURN flight,airport;

D. Flight Destiation Relation - Cypher Query:

LOAD CSV WITH HEADERS FROM ‘file:///flight_airport.csv’ as line MATCH (flight:Flight {id:line.flight_desc}) MATCH (airport:Airport {id:line.dest})

CREATE (flight)-[:DESTINATION]->(airport) RETURN flight,airport;

E. Final Graph database output:

Conclusion:

There are many advantages of using NoSQL graph databases -

a) By assembling nodes and relationships into connected structures, graph databases enable us to build simple and sophisticated models that map closely to our problem domain.

b) Each node (entity or attribute) in the graph database model directly and physically contains a list of relationship records that represent the relationships to other nodes.

c) Ability to pre-materialize relationships into the database structure allows Neo4j to provide performance of several orders of magnitude above others, especially for join-heavy relational queries.

d) Cypher, Neo4j’s declarative graph query language, is built on the basic concepts and clauses of SQL but has a lot of additional graph-specific functionality to make it easy to work with your graph model.

Some of the disadvantages of NoSQL databases are as below -

a) Simple NoSQL queries require some programming knowledge, and the most common business intelligence tools that many enterprises rely on do not offer connectivity to NoSQL databases.

b) The end goal for NoSQL database design was to offer a solution that would require no administration, but the reality on the ground is much different. NoSQL databases still demand a lot of technical skill with both installation and maintenance.

c) Each NoSQL database in contrast tends to be open-source, with just one or two firms handling the support angle. Many of them have been developed by smaller startups which lack the resources to fund support on a global scale, and also the credibility that the established RDBMS vendors like Oracle, IBM and Microsoft enjoy.