The purpose of this assignment is to migrate a MySQL relational database into a NoSQL database platform. I have chosen ‘Flights’ database as a source and Neo4j graph database as a target NoSQL DB platform.
The purpose of this DB platform migration is to be able to capture relationships between various entities graphically and ability to easily answer critical questions based on the property graph created in graph database.
Note: MySQL database software need to be installed in the local computer in order to successfully execute the R markdown code.
Load necessary libraries -
library(RODBC)
library(dplyr)
library(stringr)
library(kableExtra)
library(data.table)Below are the steps followed to perform database migration -
I have used RODBC package and an ODBC data source called ‘MySQL_Flights’ in order to connect to the database and retrieve tables into respective data frames.
con <- odbcConnect("MySQL_Flights")flights_df <- as.data.frame(sqlFetch(con,"flights"),stringsAsFactors = FALSE)
head(flights_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | 1 | 1 | 517 | 2 | 830 | 11 | UA | N14228 | 1545 | EWR | IAH | 227 | 1400 | 5 | 17 |
| 2013 | 1 | 1 | 533 | 4 | 850 | 20 | UA | N24211 | 1714 | LGA | IAH | 227 | 1416 | 5 | 33 |
| 2013 | 1 | 1 | 542 | 2 | 923 | 33 | AA | N619AA | 1141 | JFK | MIA | 160 | 1089 | 5 | 42 |
| 2013 | 1 | 1 | 544 | -1 | 1004 | -18 | B6 | N804JB | 725 | JFK | BQN | 183 | 1576 | 5 | 44 |
| 2013 | 1 | 1 | 554 | -6 | 812 | -25 | DL | N668DN | 461 | LGA | ATL | 116 | 762 | 6 | 54 |
| 2013 | 1 | 1 | 554 | -4 | 740 | 12 | UA | N39463 | 1696 | EWR | ORD | 150 | 719 | 6 | 54 |
airports_df <- as.data.frame(sqlFetch(con,"airports"),stringsAsFactors = FALSE)
head(airports_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")| faa | name | lat | lon | alt | tz | dst |
|---|---|---|---|---|---|---|
| 04G | Lansdowne Airport | 41.13047 | -80.61958 | 1044 | -5 | A |
| 06A | Moton Field Municipal Airport | 32.46057 | -85.68003 | 264 | -5 | A |
| 06C | Schaumburg Regional | 41.98934 | -88.10124 | 801 | -6 | A |
| 06N | Randall Airport | 41.43191 | -74.39156 | 523 | -5 | A |
| 09J | Jekyll Island Airport | 31.07447 | -81.42778 | 11 | -4 | A |
| 0A9 | Elizabethton Municipal Airport | 36.37122 | -82.17342 | 1593 | -4 | A |
airlines_df <- as.data.frame(sqlFetch(con,"airlines"),stringsAsFactors = FALSE)
head(airlines_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")| carrier | name |
|---|---|
| 9E | Endeavor Air Inc. |
| AA | American Airlines Inc. |
| AS | Alaska Airlines Inc. |
| B6 | JetBlue Airways |
| DL | Delta Air Lines Inc. |
| EV | ExpressJet Airlines Inc. |
planes_df <- as.data.frame(sqlFetch(con,"planes"),stringsAsFactors = FALSE)
head(planes_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")| tailnum | year | type | manufacturer | model | engines | seats | speed | engine |
|---|---|---|---|---|---|---|---|---|
| N10156 | 2004 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NA | Turbo-fan |
| N102UW | 1998 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N103US | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N104UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N10575 | 2002 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NA | Turbo-fan |
| N105UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
weather_df <- as.data.frame(sqlFetch(con,"weather"),stringsAsFactors = FALSE)
head(weather_df) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")| origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EWR | 2013 | 1 | 1 | 0 | 37.04 | 21.92 | 53.97 | 230 | 10.35702 | 11.91865 | 0 | 1013.9 | 10 |
| EWR | 2013 | 1 | 1 | 1 | 37.04 | 21.92 | 53.97 | 230 | 13.80936 | 15.89154 | 0 | 1013.0 | 10 |
| EWR | 2013 | 1 | 1 | 2 | 37.94 | 21.92 | 52.09 | 230 | 12.65858 | 14.56724 | 0 | 1012.6 | 10 |
| EWR | 2013 | 1 | 1 | 3 | 37.94 | 23.00 | 54.51 | 230 | 13.80936 | 15.89154 | 0 | 1012.7 | 10 |
| EWR | 2013 | 1 | 1 | 4 | 37.94 | 24.08 | 57.04 | 240 | 14.96014 | 17.21583 | 0 | 1012.8 | 10 |
| EWR | 2013 | 1 | 1 | 6 | 39.02 | 26.06 | 59.37 | 270 | 10.35702 | 11.91865 | 0 | 1012.0 | 10 |
## Airports Export
write.csv(airports_df, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\airports.csv")
flights_df <- flights_df %>% mutate(flight_desc = paste(carrier,"-",flight))
## Flights Export
flight_unique <- unique(flights_df %>% select(c("flight","carrier","flight_desc")))
write.csv(flight_unique, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\flights.csv")
## Dates Export
flights_df <- flights_df %>% mutate(date = paste(month,"/",day,"/",year))
flight_dates <- unique(flights_df %>% select(c("date","day","month","year")))
write.csv(flight_dates, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\flight_dates.csv")
## Flight Origin & Destination Airports
flight_airport <- unique(flights_df %>% select(c("flight_desc","origin","dest")))
write.csv(flight_airport, "C:\\Neo4J\\neo4j-community-3.4.10\\import\\flight_airport.csv")CSV file exported from the R data frames are used as source for Neo4j CYPHER queries. These files are to be placed under {Neo4j Install Dir}/Import location on local computer.
LOAD CSV WITH HEADERS FROM ‘file:///airports.csv’ as line CREATE (airport:Airport {id:line.faa}) SET airport.name = line.name, airport.latitude = TOFLOAT(line.lat), airport.logitude = TOFLOAT(line.lon), airport.altitude = TOINT(line.alt), airport.timezone = TOINT(line.tz), airport.dst = line.dst RETURN airport;
LOAD CSV WITH HEADERS FROM ‘file:///flights.csv’ as line CREATE (flight:Flight {id:line.flight_desc}) SET flight.name = line.flight_desc, flight.carrier = line.carrier, flight.number = line.flight RETURN flight;
LOAD CSV WITH HEADERS FROM ‘file:///flight_airport.csv’ as line MATCH (flight:Flight {id:line.flight_desc}) MATCH (airport:Airport {id:line.origin})
CREATE (flight)-[:ORIGIN]->(airport) RETURN flight,airport;
LOAD CSV WITH HEADERS FROM ‘file:///flight_airport.csv’ as line MATCH (flight:Flight {id:line.flight_desc}) MATCH (airport:Airport {id:line.dest})
CREATE (flight)-[:DESTINATION]->(airport) RETURN flight,airport;
There are many advantages of using NoSQL graph databases -
a) By assembling nodes and relationships into connected structures, graph databases enable us to build simple and sophisticated models that map closely to our problem domain.
b) Each node (entity or attribute) in the graph database model directly and physically contains a list of relationship records that represent the relationships to other nodes.
c) Ability to pre-materialize relationships into the database structure allows Neo4j to provide performance of several orders of magnitude above others, especially for join-heavy relational queries.
d) Cypher, Neo4j’s declarative graph query language, is built on the basic concepts and clauses of SQL but has a lot of additional graph-specific functionality to make it easy to work with your graph model.
Some of the disadvantages of NoSQL databases are as below -
a) Simple NoSQL queries require some programming knowledge, and the most common business intelligence tools that many enterprises rely on do not offer connectivity to NoSQL databases.
b) The end goal for NoSQL database design was to offer a solution that would require no administration, but the reality on the ground is much different. NoSQL databases still demand a lot of technical skill with both installation and maintenance.
c) Each NoSQL database in contrast tends to be open-source, with just one or two firms handling the support angle. Many of them have been developed by smaller startups which lack the resources to fund support on a global scale, and also the credibility that the established RDBMS vendors like Oracle, IBM and Microsoft enjoy.