library(RNeo4j)
library(readr)
library(dplyr)
library(knitr)
library(ggmap)
library(ggplot2)
library(lubridate)

Assingment

This assignment is to take a relational data base and convert it to a No SQL database using either MongoDB or Neo4j

My Take

I download the station list and the trip data from the Health Ride Pittsburgh (https://healthyridepgh.com/data/) website for 2016Q3. Healthy ride Pittsburgh is Pittsburgh, PA’s bike share program. I wanted to use a graph database (Neo4j) to find where do people bike ride from when with rent a bike at PNC Park, the home of the Pittsburgh Pirates.

Initialize graph

graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "password")
clear(graph, input = FALSE)

addConstraint(graph, "stations", "station_number")

Load Data from CSV files

trips <- read_csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607/master/Project4-NoSQL/HealthyRide-Rentals-2016-Q3.csv")
stations <- read_csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607/master/Project4-NoSQL/HealthyRideStations2016.csv")

#NOTE:  While this data is not in a relational database per say, the data is relational.  It is 2 tables Stations, and Trips and the Trips table has 2 foriegn keys back to the stations table.  One for destination and one for origin.  

kable(head(trips,5))
Trip id Starttime Stoptime Bikeid Tripduration From station id From station name To station id To station name Usertype
20217133 7/1/2016 0:44 7/1/2016 1:22 70362 2263 1009 12th St & Penn Ave 1025 Penn Ave & N Fairmount St Customer
20218077 7/1/2016 1:22 7/1/2016 1:34 70124 708 1001 Forbes Ave & Market Square 1012 North Shore Trail & Fort Duquesne Bridge Customer
20220848 7/1/2016 2:31 7/1/2016 2:39 70135 475 1049 S 12th St & E Carson St 1007 Stevenson St & Forbes Ave Customer
20227674 7/1/2016 5:48 7/1/2016 6:10 70260 1300 1022 Liberty Ave & S Millvale Ave (West Penn Hospital) 1005 Forbes Ave & Grant St Subscriber
20228273 7/1/2016 6:05 7/1/2016 6:24 70010 1107 1025 Penn Ave & N Fairmount St 1001 Forbes Ave & Market Square Subscriber
kable(head(stations,5))
Station # Station Name # of Racks Latitude Longitude
1000 Liberty Ave & Stanwix St 16 40.44133 -80.00468
1001 Forbes Ave & Market Square 19 40.44088 -80.00308
1002 Third Ave & Wood St 15 40.43903 -80.00186
1003 First Ave & Smithfield St (Art Institute) 15 40.43720 -80.00038
1004 First Ave & B St (T Station) 15 40.43589 -79.99710

Load Stations into the nodes of the graph

for ( i in 1: nrow(stations ) ) {
createNode (graph, "stations", station_number = stations[[i,"Station #"]], name = stations[[i,"Stations Name"]] , num_of_racks = stations[[i,"# of Racks" ]], lat = stations[[i,"Latitude"]] , long = stations[[i,"Longitude"]] )
}

Load relationships into graph

query = "
MATCH (o:stations),(d:stations)
WHERE o.station_number =  {origin_id} AND  d.station_number =  {destination_id} 
CREATE (o)-[r:BIKES_TO]->(d) "

t = newTransaction(graph)

for (i in 1 :nrow(trips)) {
  origin_id = trips[i, ]$`To station id`
  destination_id = trips[i, ]$`From station id`
  trip_id = trips[i, ]$`Trip id`
  bike_id = trips[i, ]$`Bikeid`
  start_time = mdy_hm(trips[i, ]$Starttime, tz = "EST")
  end_time = mdy_hm(trips[i, ]$Stoptime, tz = "EST")
   
  appendCypher(t, 
               query, 
               origin_id = origin_id, 
               destination_id = destination_id,
               trip_id =trip_id,
               start_time = start_time,
               end_time = end_time,
               bike_id = bike_id) 
}

commit(t)

Map Destinations from PNC Park (Station #1013)

trips_from_pnc <- cypher(graph, "match (o:stations)-->(d:stations) where o.station_number = 1013 return d.lat, d.long")
trips_from_pnc <- unique(trips_from_pnc)

map <- get_map(location = c(lon = -79.96, lat = 40.442), zoom = 13, scale = 2, maptype = "roadmap")
ggmap(map) + geom_point(data=trips_from_pnc , aes(y=d.lat, x=d.long) ,color="red", size=1)

Relational Database Pro’s / Cons

Pros:

The main pro of a relational database is that it was built for transactions in a traditional business/ERP setting. It is highly efficient when dealing with transactional rather than reporting or analytic data. SQL the main query language of a relational database is simple and widely understood. Most relational databases follow ACID (Atomicity, Consistency, Isolation, Durability) principles, wich for tranditional business transactional worklaods is critical. The toolset around relational database is extremely mature. Very good at aggregation and windowing functions.

Traditional ACID is important in situations like banking or shopping when you have to do multi-phsased commits. This ensures things like when you take money out of the ATM, the money is properly deducted from your account if the transaction fails in-flight. In many companies the ATM database will be in one system and the core banking in another and changes to those sysems need to be commited in an all or nothing fashion.

Cons:

The main con’s of a relational database is that is not built for analytic, business intelligence or reporting workloads. When modeling complex data the database design can get rather convoluted rather quickly. SQL statements can get very large and can be cumbersome and time consuming to tune. Fault tolerance/High Availablity of the ACID model is hard to scale. It is easier to scale up then out. This requires investment in large expensive servers. Requires Object-Relational or Cursor processing when integrating results into code.

NoSQL Database Pro’s/Cons

Pros:

Data format is highly flexible and can model a variety of senarios with relative ease. Supports more modern work flows like those built around networks or documents (CRM) more easiliy Query language is compact when compared to sql. Built for analytic workloads. Supports more flexible transactions than traditional ACID transactions, like eventual consistency, which may be ok and even disierable in situtuations like facebook that need to scale out but the users do not need a consistent answer on every query. (Some NoSQL are begining to support traditional ACID transations, but this is at the cost of the ease of scale out) It is easy to deploy in a fault tolerant/high availability seneraio on commodity hardware. When writing code there is less translation needed from the database tier to the applicaiton tier (I can put a JSON object directly into a Java Object). There are main NoSQL databases so you can pick one that meets the needs of your individual project.

Cons:

Each NoSQL database has its own query language thus introduces a learning curve. Not all NoSQL databases are ACID Compliant (however depending on your wokflow this may not matter). Tooling is less mature. Does not readiliy integrate with Business Intellegence toolsets (Although this is changing). Alot of NoSQL database run in the JVM this adds overhead when compared to traditional RDBMSs that run in low level code. Traditional aggregation and windowing functions ( Row_number/lag/lead/avg/sum) are slower in NoSQL databases than traditional RDBMSs.

Note: The landscape is rapidly evolving and anything said above may be disproven rather rapidly. I know people who are huge fans of NoSQL database and hate traditional databases and I know people who are huge fans of tranditional database and hate NoSQL.