library(RNeo4j)
library(readr)
library(dplyr)
library(knitr)
library(ggmap)
library(ggplot2)
library(lubridate)

Assingment

This assignment is to take a relational data base and convert it to a No SQL database using either MongoDB or Neo4j

My Take

I download the station list and the trip data from the Health Ride Pittsburgh (https://healthyridepgh.com/data/) website for 2016Q3. Healthy ride Pittsburgh is Pittsburgh, PA’s bike share program. I wanted to use a graph database (Neo4j) to find where do people bike ride from when with rent a bike at PNC Park, the home of the Pittsburgh Pirates.

Initialize graph

graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "password")
clear(graph, input = FALSE)

addConstraint(graph, "stations", "station_number")

Load Data from CSV files

trips <- read_csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607/master/Project4-NoSQL/HealthyRide-Rentals-2016-Q3.csv")
stations <- read_csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607/master/Project4-NoSQL/HealthyRideStations2016.csv")

#NOTE:  While this data is not in a relational database per say, the data is relational.  It is 2 tables Stations, and Trips and the Trips table has 2 foriegn keys back to the stations table.  One for destination and one for origin.  

kable(head(trips,5))

Trip id	Starttime	Stoptime	Bikeid	Tripduration	From station id	From station name	To station id	To station name	Usertype
20217133	7/1/2016 0:44	7/1/2016 1:22	70362	2263	1009	12th St & Penn Ave	1025	Penn Ave & N Fairmount St	Customer
20218077	7/1/2016 1:22	7/1/2016 1:34	70124	708	1001	Forbes Ave & Market Square	1012	North Shore Trail & Fort Duquesne Bridge	Customer
20220848	7/1/2016 2:31	7/1/2016 2:39	70135	475	1049	S 12th St & E Carson St	1007	Stevenson St & Forbes Ave	Customer
20227674	7/1/2016 5:48	7/1/2016 6:10	70260	1300	1022	Liberty Ave & S Millvale Ave (West Penn Hospital)	1005	Forbes Ave & Grant St	Subscriber
20228273	7/1/2016 6:05	7/1/2016 6:24	70010	1107	1025	Penn Ave & N Fairmount St	1001	Forbes Ave & Market Square	Subscriber

kable(head(stations,5))

Station #	Station Name	# of Racks	Latitude	Longitude
1000	Liberty Ave & Stanwix St	16	40.44133	-80.00468
1001	Forbes Ave & Market Square	19	40.44088	-80.00308
1002	Third Ave & Wood St	15	40.43903	-80.00186
1003	First Ave & Smithfield St (Art Institute)	15	40.43720	-80.00038
1004	First Ave & B St (T Station)	15	40.43589	-79.99710

Load Stations into the nodes of the graph

for ( i in 1: nrow(stations ) ) {
createNode (graph, "stations", station_number = stations[[i,"Station #"]], name = stations[[i,"Stations Name"]] , num_of_racks = stations[[i,"# of Racks" ]], lat = stations[[i,"Latitude"]] , long = stations[[i,"Longitude"]] )
}

Load relationships into graph

query = "
MATCH (o:stations),(d:stations)
WHERE o.station_number =  {origin_id} AND  d.station_number =  {destination_id} 
CREATE (o)-[r:BIKES_TO]->(d) "

t = newTransaction(graph)

for (i in 1 :nrow(trips)) {
  origin_id = trips[i, ]$`To station id`
  destination_id = trips[i, ]$`From station id`
  trip_id = trips[i, ]$`Trip id`
  bike_id = trips[i, ]$`Bikeid`
  start_time = mdy_hm(trips[i, ]$Starttime, tz = "EST")
  end_time = mdy_hm(trips[i, ]$Stoptime, tz = "EST")
   
  appendCypher(t, 
               query, 
               origin_id = origin_id, 
               destination_id = destination_id,
               trip_id =trip_id,
               start_time = start_time,
               end_time = end_time,
               bike_id = bike_id) 
}

commit(t)

Map Destinations from PNC Park (Station #1013)

trips_from_pnc <- cypher(graph, "match (o:stations)-->(d:stations) where o.station_number = 1013 return d.lat, d.long")
trips_from_pnc <- unique(trips_from_pnc)

map <- get_map(location = c(lon = -79.96, lat = 40.442), zoom = 13, scale = 2, maptype = "roadmap")
ggmap(map) + geom_point(data=trips_from_pnc , aes(y=d.lat, x=d.long) ,color="red", size=1)

Relational Database Pro’s / Cons

Pros:

The main pro of a relational database is that it was built for transactions in a traditional business/ERP setting. It is highly efficient when dealing with transactional rather than reporting or analytic data. SQL the main query language of a relational database is simple and widely understood. Most relational databases follow ACID (Atomicity, Consistency, Isolation, Durability) principles, wich for tranditional business transactional worklaods is critical. The toolset around relational database is extremely mature. Very good at aggregation and windowing functions.

Traditional ACID is important in situations like banking or shopping when you have to do multi-phsased commits. This ensures things like when you take money out of the ATM, the money is properly deducted from your account if the transaction fails in-flight. In many companies the ATM database will be in one system and the core banking in another and changes to those sysems need to be commited in an all or nothing fashion.

Cons:

The main con’s of a relational database is that is not built for analytic, business intelligence or reporting workloads. When modeling complex data the database design can get rather convoluted rather quickly. SQL statements can get very large and can be cumbersome and time consuming to tune. Fault tolerance/High Availablity of the ACID model is hard to scale. It is easier to scale up then out. This requires investment in large expensive servers. Requires Object-Relational or Cursor processing when integrating results into code.

NoSQL Database Pro’s/Cons

Pros:

Data format is highly flexible and can model a variety of senarios with relative ease. Supports more modern work flows like those built around networks or documents (CRM) more easiliy Query language is compact when compared to sql. Built for analytic workloads. Supports more flexible transactions than traditional ACID transactions, like eventual consistency, which may be ok and even disierable in situtuations like facebook that need to scale out but the users do not need a consistent answer on every query. (Some NoSQL are begining to support traditional ACID transations, but this is at the cost of the ease of scale out) It is easy to deploy in a fault tolerant/high availability seneraio on commodity hardware. When writing code there is less translation needed from the database tier to the applicaiton tier (I can put a JSON object directly into a Java Object). There are main NoSQL databases so you can pick one that meets the needs of your individual project.

Cons:

Each NoSQL database has its own query language thus introduces a learning curve. Not all NoSQL databases are ACID Compliant (however depending on your wokflow this may not matter). Tooling is less mature. Does not readiliy integrate with Business Intellegence toolsets (Although this is changing). Alot of NoSQL database run in the JVM this adds overhead when compared to traditional RDBMSs that run in low level code. Traditional aggregation and windowing functions ( Row_number/lag/lead/avg/sum) are slower in NoSQL databases than traditional RDBMSs.

Note: The landscape is rapidly evolving and anything said above may be disproven rather rapidly. I know people who are huge fans of NoSQL database and hate traditional databases and I know people who are huge fans of tranditional database and hate NoSQL.

CUNY-DATA-607-Project4-NoSQL

Raphael Nash

11/11/2016

Assingment

My Take

Initialize graph

Load Data from CSV files

Load Stations into the nodes of the graph

Load relationships into graph

Map Destinations from PNC Park (Station #1013)

Relational Database Pro’s / Cons

Pros:

Cons:

NoSQL Database Pro’s/Cons

Pros:

Cons: