The purpose of this assignment is to migrate data from a relational database to a NoSQL relational database. We will demonstrate how to move from a dataframe to a graph database using igraph and then RNeo4J.
All NYC flights from 2013 are included in a table called flights from the dplyr package.
library(nycflights13)
library(dplyr)
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
We will create a small relational database using a subset of the flights data.
remove(flights)
library(nycflights13)
library(dplyr)
library(tidyr)
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
flights_raw <- flights
airports <- flights_raw %>%
select(origin, dest) %>%
gather(value = "airportCode") %>%
select(airportCode) %>%
distinct() %>%
arrange(airportCode) %>%
glimpse()
## Observations: 107
## Variables: 1
## $ airportCode <chr> "ABQ", "ACK", "ALB", "ANC", "ATL", "AUS", "AVL", "...
flights <- flights_raw %>%
select(origin, dest, flight, carrier) %>%
group_by(origin, dest, flight, carrier) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
glimpse()
## Observations: 12,075
## Variables: 5
## $ origin <chr> "EWR", "EWR", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK"...
## $ dest <chr> "HNL", "LAX", "BUR", "CLT", "IAD", "LAS", "LAX", "LAX"...
## $ flight <int> 15, 119, 359, 219, 5712, 251, 181, 407, 413, 1783, 215...
## $ carrier <chr> "UA", "AA", "B6", "B6", "EV", "VX", "AA", "VX", "VX", ...
## $ count <int> 365, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365,...
Now we have two tables, which could be linked using the airport code as a primary/foreign key in a relational database. (We don’t have the data, but one could imagine including more properties in the airports table, for example, full name, city, number of terminals, etc.)
We can visualize the airports as nodes and scheduled flights as relationships between them.
library(igraph)
g <- graph_from_data_frame(flights, directed=TRUE, vertices=airports)
plot.igraph(g)
This graph is impossible to read! To get a smaller sense, let’s look at a smaller subset of the airports, such as just flights that are scheduled at least once a day.
daily_flights <- flights %>% filter(count >=365)
eids <- daily_flights$flight
plot(subgraph.edges(g, eids, delete.vertices = TRUE), width = 10)
RNeo4J is not yet available on CRAN. Install from the developer’s Git Hub site using devtools:
install.packages("devtools")
devtools::install_github("nicolewhite/RNeo4j")
Once installed, load using the library function, as you would with any other package.
library(RNeo4j)
This connection is through an Amazon Web Server virtual Ubuntu machine. This was done to avoid configuration issues using a Mac, and to ensure reproducibility. For a fuller explanation on how to configure an Amazon Web Serives virtual machine, see https://timhagmann.com/html/e63/hw2-hagmann-tim.html#appendix:_neo4j_setup
graph <- startGraph(
#This links to my virtual machine
url = "http://ec2-18-212-155-38.compute-1.amazonaws.com:7474/db/data/",
username = "neo4j",
password = "neopw",
#this option will give you back R in case you write a bad query by timing out after 2 seconds
opts = list(timeout=2))
Once again, we will simplify our analysis by using only those flights which occur at least 365 times (most likley daily flights, which we have stored as daily_flights).
clear(graph, input = F)
addConstraint(graph, "airports", "airportCode")
addConstraint(graph, "flights", "flight")
for(i in 1:length(daily_flights$flight)){
origin = getOrCreateNode(graph, "airports", airportCode = daily_flights[i,]$origin)
dest = getOrCreateNode(graph, "airports", airportCode = daily_flights[i,]$dest)
createRel(origin, "Flights", dest)
}
Neo4J Browser Results
A relational database is very powerful, but it can be tricky to maintain joins across multiple types of relationships. For a very simple database, such as the one we have here, probably a relational database is easier to work with; however, we can imagine adding more kinds of data, for example more informaiton about the airports, or people who visit them, or carriers, where a relational database would quickly become overwhelmed–or at least overwhelming. For a data structure that may change frequently, or be heavily based on different types of relationships, a NoSQL database may be preferred.
A good explanation of how to configure Neo4J to run on an Amazon Web Services Virtual Machine can be found here: https://timhagmann.com/html/e63/hw2-hagmann-tim.html#appendix:_neo4j_setup
Code adapted and inspiration from data scientist Nicole White at her blog.