Project 4 - NoSQL and Neo4j

Solution summary

For Project 4, we will use the data set from the package nycflights13. This package has different flight information. It is also has airports, weather, airlines, plane type information.

knitr::opts_chunk$set(echo = TRUE)

library("RNeo4j")

## Warning: package 'RNeo4j' was built under R version 3.3.2

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("tidyr")
library(mongolite)

## Warning: package 'mongolite' was built under R version 3.3.2

library(RMySQL)

## Loading required package: DBI

library(nycflights13)

## Warning: package 'nycflights13' was built under R version 3.3.2

library(DT)

Load the data into MySQL server:

Here were have created a database in local mysql server and connecting to that database.

conn <- dbConnect(MySQL(),
                 user = 'root',
                 password = 'admin',
                 host = 'localhost',
                 dbname = 'nycflights13')

Once we established the connection, we are loading the records from nycflights13 dataset. We are creating all the tables and loading the necessary data.

if(!dbExistsTable(conn,"airlines")) {

dbWriteTable(conn, value = airlines, name = "airlines", row.names=F, append = TRUE ) 
}

if(!dbExistsTable(conn,"airports")) {

dbWriteTable(conn, value = airports, name = "airports", row.names=F, append = TRUE ) 
} 
if(!dbExistsTable(conn,"flights")) {

dbWriteTable(conn, value = flights, name = "flights", row.names=F, append = TRUE ) 
}
if(!dbExistsTable(conn,"planes")) {

dbWriteTable(conn, value = planes, name = "planes", row.names=F, append = TRUE ) 
}

if(!dbExistsTable(conn,"weather")) {

dbWriteTable(conn, value = weather, name = "weather", row.names=F, append = TRUE ) 
}


dbDisconnect(conn)

## [1] TRUE

Information Observed in MYSQL:

Setting up the database is ease and it can be scalled quickly.
Database actions are very fast using this package.
Data maupulations is quick. No much time lag for this dataset.
Lot of duplications in the data when we combine the different tables.
Database queries are traditional SQL. It is quick to learn and easy to work on it.

Joins tables in R

Once we load the necessary tables, we will join the different dataframes and create a aggregated dataset with all the rows.

flightfact_mysql <- inner_join(flights,airlines,by="carrier")

flightfact_mysql <- inner_join(flightfact_mysql,airports,by=c("origin"="faa"))

flightfact_mysql <- left_join(flightfact_mysql,planes,by="tailnum")

flightfact_mysql  <- left_join(flightfact_mysql,weather,by.flightfact_mysql=c("year","month","day","hour","origin"),by.weather=c("year","month","day","hour","origin"))

## Joining, by = c("month", "day", "origin", "hour", "time_hour")

flightfact_mysql <- dplyr::rename(flightfact_mysql,year=`year.x`,airlines =`name.x`,airport=`name.y`,flight_year=`year.y`)

head(flightfact_mysql,2)

## # A tibble: 2 × 44
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <dbl> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## # ... with 37 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, airlines <chr>, airport <chr>, lat <dbl>, lon <dbl>,
## #   alt <int>, tz <dbl>, dst <chr>, flight_year <int>, type <chr>,
## #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
## #   speed <int>, engine <chr>, year <dbl>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Now the above dataset contains all the aggegated information from different dataframes. This can be used to load in other databases.

Mongodb

Mongodb is one of the Nosql database. It is very famous for its fluid data structure. All the formats are loaded in JSON format. Here the databases are called as collections.

Here I have installed the local mongodb instance and created a database in it.

m <- mongo(collection="flightfact_mongodb",db="nycflights13")

m$insert(flightfact_mysql)

We are inserting the data into mongodb database from the aggegated results which is manipulated in R.

Finding data

In mongodb, the search query is replaced by JSON format of query. The syntax is easy to write and human readable. Below queries finds the flights between origin EWR and destination LGA

m$find('{"origin":"EWR","dest":"LGA"}')

m$find('{"airlines":"United Air Lines Inc."}')

Information Observed in Mongodb:

Setting the database needs little bit of knowledge and need to know how to start the server.
There is no GUI for mongodb. Everything is performed in command prompt. Although there are some from 3rd party, it is not combined with database.
When R interacts with mongodb, there is a lag between the operations.
Writing to database takes time. It is not quick as mysql.
Fetching the results are quick.
There are no joins. So to perform joins, we need to do lot of operations.

Neo4j

Neo4j is an graphical database. We can link the data of different tables. All the data points are called as nodes. The nodes can have label. These labels will have properties of that specific node. The relationship also been tagged as labels. It also can have properties.

I have installed the local instance of NEO4J and created the database. All the interactions are performed via CYPER queries.

#setting up the database instance

neo4j = startGraph("http://localhost:7474/db/data/","neo4j","admin")


#Delete the previous relationships and nodes

cypher(neo4j,"MATCH ()-[r:Departs]-() delete r")
cypher(neo4j,"MATCH ()-[r:Arrival]-() delete r")
cypher(neo4j,"match(n) delete n")

Here we need to create nodes for each dimension. For example, all the flights and locations are nodes. The relationships define the relation between nodes. In this dataset Arrival and Departure are called as relationships.

#cypher(neo4j,"match(n:airline) delete n")

#for airlines

for(i in 1:nrow(airlines)) {
  
  cypher(neo4j,paste0("create(carrier:airline{code:'",airlines[i,1],"',name:'",airlines[i,2],"'})",collapse = ""))
  
}


#For origin

#cypher(neo4j,"match(n:origin) delete n")

origin_codes <- unique(flightfact_mysql$origin) 

for(i in 1:length(origin_codes)) {
  
  cypher(neo4j,paste0("create(location:origin{origin_code:'",origin_codes[i],"'})",collapse = ""))
  
}


#For dest

#cypher(neo4j,"match(n:dest) delete n")

dest_codes <- unique(flightfact_mysql$dest)

for(i in 1:length(dest_codes)) {
  
  cypher(neo4j,paste0("create(location:dest{dest_code:'",dest_codes[i],"'})",collapse = ""))
  
}


#For depart

#cypher(neo4j,"MATCH ()-[r:Departs]-() delete r")


depart <- unique(flightfact_mysql[,c("origin","carrier")])


for(i in 1:nrow(depart)) {
  
  cypher(neo4j,paste0("match (air:airline{code:'",depart[i,2],"'}),(org:origin{origin_code:'",depart[i,1],"'}) create (air)-[r:Departs]->(org)"))
  
}



#For arrival

#cypher(neo4j,"MATCH ()-[r:Arrival]-() delete r")


arrival <- unique(flightfact_mysql[,c("dest","carrier")])


for(i in 1:nrow(arrival)) {
  
  cypher(neo4j,paste0("match (air:airline{code:'",arrival[i,2],"'}),(dest:dest{dest_code:'",arrival[i,1],"'}) create (air)-[r:Arrival]->(dest)"))
  
}

Information Observed in Mongodb:

Installation is easy and creating a database was simple.
The data is seen in visual format. So it is easy to think more about data.
It is done in step by step. So we need to understand the data more.
Relationships play a major role. It gives meaning to the nodes.
The grphical interface is very nice. But it becomes messy if you have more data and nodes.
Fetching the required data is very quick and fast.
Need to learn new query language called CYPER. Although it is easy for smaller operations.

Difference between the databases

Name	MYSQL	Mongodb	Neo4j
Description	Widely used database	Popular nosql database	Getting popular
Database type	RDBMS	Document Store	Graph
Transaction	ACID	No ACID	ACID
In-memory	Yes	Yes	No
Application scenarios	Traditional structure	API’s and IOT	Real-time recommendations
Scalability	Easy	Easy	Medium
Main character	Zylo data	JSON format	Relationship

Project 4 - NoSQL and Neo4j

Shyam BV

November 15, 2016

Project 4:

Solution summary

Load the data into MySQL server:

Information Observed in MYSQL:

Joins tables in R

Mongodb

Finding data

Information Observed in Mongodb:

Neo4j

Information Observed in Mongodb:

Difference between the databases