library(RNeo4j)
library(readr)
library(dplyr)
library(knitr)
library(ggmap)
library(ggplot2)
library(lubridate)
This assignment is to take a relational data base and convert it to a No SQL database using either MongoDB or Neo4j
I download the station list and the trip data from the Health Ride Pittsburgh (https://healthyridepgh.com/data/) website for 2016Q3. Healthy ride Pittsburgh is Pittsburgh, PA’s bike share program. I wanted to use a graph database (Neo4j) to find where do people bike ride from when with rent a bike at PNC Park, the home of the Pittsburgh Pirates.
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "password")
clear(graph, input = FALSE)
addConstraint(graph, "stations", "station_number")
trips <- read_csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607/master/Project4-NoSQL/HealthyRide-Rentals-2016-Q3.csv")
stations <- read_csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607/master/Project4-NoSQL/HealthyRideStations2016.csv")
#NOTE: While this data is not in a relational database per say, the data is relational. It is 2 tables Stations, and Trips and the Trips table has 2 foriegn keys back to the stations table. One for destination and one for origin.
kable(head(trips,5))
| Trip id | Starttime | Stoptime | Bikeid | Tripduration | From station id | From station name | To station id | To station name | Usertype |
|---|---|---|---|---|---|---|---|---|---|
| 20217133 | 7/1/2016 0:44 | 7/1/2016 1:22 | 70362 | 2263 | 1009 | 12th St & Penn Ave | 1025 | Penn Ave & N Fairmount St | Customer |
| 20218077 | 7/1/2016 1:22 | 7/1/2016 1:34 | 70124 | 708 | 1001 | Forbes Ave & Market Square | 1012 | North Shore Trail & Fort Duquesne Bridge | Customer |
| 20220848 | 7/1/2016 2:31 | 7/1/2016 2:39 | 70135 | 475 | 1049 | S 12th St & E Carson St | 1007 | Stevenson St & Forbes Ave | Customer |
| 20227674 | 7/1/2016 5:48 | 7/1/2016 6:10 | 70260 | 1300 | 1022 | Liberty Ave & S Millvale Ave (West Penn Hospital) | 1005 | Forbes Ave & Grant St | Subscriber |
| 20228273 | 7/1/2016 6:05 | 7/1/2016 6:24 | 70010 | 1107 | 1025 | Penn Ave & N Fairmount St | 1001 | Forbes Ave & Market Square | Subscriber |
kable(head(stations,5))
| Station # | Station Name | # of Racks | Latitude | Longitude |
|---|---|---|---|---|
| 1000 | Liberty Ave & Stanwix St | 16 | 40.44133 | -80.00468 |
| 1001 | Forbes Ave & Market Square | 19 | 40.44088 | -80.00308 |
| 1002 | Third Ave & Wood St | 15 | 40.43903 | -80.00186 |
| 1003 | First Ave & Smithfield St (Art Institute) | 15 | 40.43720 | -80.00038 |
| 1004 | First Ave & B St (T Station) | 15 | 40.43589 | -79.99710 |
for ( i in 1: nrow(stations ) ) {
createNode (graph, "stations", station_number = stations[[i,"Station #"]], name = stations[[i,"Stations Name"]] , num_of_racks = stations[[i,"# of Racks" ]], lat = stations[[i,"Latitude"]] , long = stations[[i,"Longitude"]] )
}
query = "
MATCH (o:stations),(d:stations)
WHERE o.station_number = {origin_id} AND d.station_number = {destination_id}
CREATE (o)-[r:BIKES_TO]->(d) "
t = newTransaction(graph)
for (i in 1 :nrow(trips)) {
origin_id = trips[i, ]$`To station id`
destination_id = trips[i, ]$`From station id`
trip_id = trips[i, ]$`Trip id`
bike_id = trips[i, ]$`Bikeid`
start_time = mdy_hm(trips[i, ]$Starttime, tz = "EST")
end_time = mdy_hm(trips[i, ]$Stoptime, tz = "EST")
appendCypher(t,
query,
origin_id = origin_id,
destination_id = destination_id,
trip_id =trip_id,
start_time = start_time,
end_time = end_time,
bike_id = bike_id)
}
commit(t)
trips_from_pnc <- cypher(graph, "match (o:stations)-->(d:stations) where o.station_number = 1013 return d.lat, d.long")
trips_from_pnc <- unique(trips_from_pnc)
map <- get_map(location = c(lon = -79.96, lat = 40.442), zoom = 13, scale = 2, maptype = "roadmap")
ggmap(map) + geom_point(data=trips_from_pnc , aes(y=d.lat, x=d.long) ,color="red", size=1)
The main pro of a relational database is that it was built for transactions in a traditional business/ERP setting. It is highly efficient when dealing with transactional rather than reporting or analytic data. SQL the main query language of a relational database is simple and widely understood. Most relational databases follow ACID (Atomicity, Consistency, Isolation, Durability) principles, wich for tranditional business transactional worklaods is critical. The toolset around relational database is extremely mature. Very good at aggregation and windowing functions.
Traditional ACID is important in situations like banking or shopping when you have to do multi-phsased commits. This ensures things like when you take money out of the ATM, the money is properly deducted from your account if the transaction fails in-flight. In many companies the ATM database will be in one system and the core banking in another and changes to those sysems need to be commited in an all or nothing fashion.
The main con’s of a relational database is that is not built for analytic, business intelligence or reporting workloads. When modeling complex data the database design can get rather convoluted rather quickly. SQL statements can get very large and can be cumbersome and time consuming to tune. Fault tolerance/High Availablity of the ACID model is hard to scale. It is easier to scale up then out. This requires investment in large expensive servers. Requires Object-Relational or Cursor processing when integrating results into code.
Data format is highly flexible and can model a variety of senarios with relative ease. Supports more modern work flows like those built around networks or documents (CRM) more easiliy Query language is compact when compared to sql. Built for analytic workloads. Supports more flexible transactions than traditional ACID transactions, like eventual consistency, which may be ok and even disierable in situtuations like facebook that need to scale out but the users do not need a consistent answer on every query. (Some NoSQL are begining to support traditional ACID transations, but this is at the cost of the ease of scale out) It is easy to deploy in a fault tolerant/high availability seneraio on commodity hardware. When writing code there is less translation needed from the database tier to the applicaiton tier (I can put a JSON object directly into a Java Object). There are main NoSQL databases so you can pick one that meets the needs of your individual project.
Each NoSQL database has its own query language thus introduces a learning curve. Not all NoSQL databases are ACID Compliant (however depending on your wokflow this may not matter). Tooling is less mature. Does not readiliy integrate with Business Intellegence toolsets (Although this is changing). Alot of NoSQL database run in the JVM this adds overhead when compared to traditional RDBMSs that run in low level code. Traditional aggregation and windowing functions ( Row_number/lag/lead/avg/sum) are slower in NoSQL databases than traditional RDBMSs.
Note: The landscape is rapidly evolving and anything said above may be disproven rather rapidly. I know people who are huge fans of NoSQL database and hate traditional databases and I know people who are huge fans of tranditional database and hate NoSQL.