For this project we will be loading the data from the fire by state data set into a Neo4J graph database. From the work done by Prashanth Padebettu’s group we know that there are performance issues if Neo4J so I have chosen a small data set with only ~150 rows. The first thing that we need to do is load the drivers for Neo4J by using the devtools.
if (!require(RNeo4j)) {install.packages("devtools")
devtools::install_github("nicolewhite/RNeo4j")}
## Loading required package: RNeo4j
Next we load the dplyr() package and load the data from a public github account. We will then gather some basic information about the data set.
if (!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# file <- 'https://raw.githubusercontent.com/eriknylander99/Data/master/statefire.csv' Kit will not read the file from the website for some reason even if I pre-load it into R. Therfore I will be using a local file for the knit process but this is location of the file used to generate the project.
file <- 'statefire.csv' # Pulls the data from a downloaded copy of the file in the default R working directory.
fire <- read.csv(file, stringsAsFactors = FALSE)
This gives us a dataframe of 196 rows and 16 columns now we need to investigate the data set that we have pulled in. Notice that we have NAs in every column. In this data that is an indication that the state did not report fires of that type for a given year. This does not mean that a fire of that type did not occur it just means that they were not coded in on the reports. Therefore we will leave the NAs as NA and work with them when creating the Ne04J database. The state of Kansas is also not in the data set and I was unable to determine why that was the case.
head(fire)
## State Abrv Year Arson Campfire Children Debris.Burning
## 1 Alaska AK 2010 NA 123275.20 174269.10 109774.40
## 2 Alaska AK 2011 0.40 126.30 7.20 28.20
## 3 Alaska AK 2012 NA 183.30 4.00 2696.50
## 4 Alaska AK 2013 NA NA NA NA
## 5 Alabama AL 2010 12297.11 472.35 70.68 7421.65
## 6 Alabama AL 2012 8627.00 311.00 80.00 NA
## Equipment.Use Fireworks Lightning Miscellaneous Powerline Railroad
## 1 19355.60 NA 820884.10 149976.60 3656.10 NA
## 2 8.30 3.2 267782.80 24996.50 2.60 0.10
## 3 81.30 NA 252705.10 30871.40 1.70 0.50
## 4 NA NA NA NA NA NA
## 5 1385.58 101.8 346.15 12634.01 362.78 276.75
## 6 679.00 56.5 3211.50 14470.00 870.50 NA
## Smoking Structure Unassigned Total
## 1 0.30 24815.30 NA 1426006.70
## 2 0.80 59.70 NA 293016.10
## 3 0.20 1.20 1 286546.20
## 4 NA NA 1320753 1320752.70
## 5 167.15 129.62 NA 35665.63
## 6 26.00 76.00 NA 28407.50
tail(fire)
## State Abrv Year Arson Campfire Children Debris.Burning
## 189 West Virginia WV 2012 9402.4 231.90 278.5 2027.90
## 190 Wyoming WY 2011 NA 17.50 7.1 1151.27
## 191 Wyoming WY 2010 NA 6278.40 10.8 604.51
## 192 Wyoming WY 2012 NA 17.25 10.2 4125.60
## 193 Wyoming WY 2013 NA 5.20 9.3 191.00
## 194 Summary SUM NA 633485.1 365693.70 193011.0 855282.45
## Equipment.Use Fireworks Lightning Miscellaneous Powerline Railroad
## 189 742.50 2807.10 407.90 NA 620.1 1.200
## 190 25496.71 217.65 26113.22 1838.28 NA 273.912
## 191 29051.88 74.76 4048.68 57889.50 NA 16157.110
## 192 67345.64 NA 287125.13 33731.58 NA 528.510
## 193 67.97 23.30 9255.05 425.85 NA 24.200
## 194 750034.65 37026.39 5282713.75 2935718.53 643172.7 329540.762
## Smoking Structure Unassigned Total
## 189 0.70 91.40 NA 16611.60
## 190 8.25 NA NA 55123.89
## 191 46.63 NA NA 114162.27
## 192 145.40 NA NA 393029.31
## 193 9.50 NA NA 10011.37
## 194 65163.88 45079.44 2764753 14900675.81
str(fire)
## 'data.frame': 194 obs. of 17 variables:
## $ State : chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ Abrv : chr "AK" "AK" "AK" "AK" ...
## $ Year : int 2010 2011 2012 2013 2010 2012 2013 2011 2011 2010 ...
## $ Arson : num NA 0.4 NA NA 12297.1 ...
## $ Campfire : num 123275 126 183 NA 472 ...
## $ Children : num 174269.1 7.2 4 NA 70.7 ...
## $ Debris.Burning: num 109774.4 28.2 2696.5 NA 7421.6 ...
## $ Equipment.Use : num 19355.6 8.3 81.3 NA 1385.6 ...
## $ Fireworks : num NA 3.2 NA NA 101.8 ...
## $ Lightning : num 820884 267783 252705 NA 346 ...
## $ Miscellaneous : num 149977 24997 30871 NA 12634 ...
## $ Powerline : num 3656.1 2.6 1.7 NA 362.8 ...
## $ Railroad : num NA 0.1 0.5 NA 276.8 ...
## $ Smoking : num 0.3 0.8 0.2 NA 167.2 ...
## $ Structure : num 24815.3 59.7 1.2 NA 129.6 ...
## $ Unassigned : num NA NA 1 1320753 NA ...
## $ Total : num 1426007 293016 286546 1320753 35666 ...
The next thing that we need to do is connect to the Neo4J database. It is necessary to have the Neo4J database running to connect to the database.
graph = startGraph("http://localhost:7474/db/data/")
The following code will clear any existing graph so that we can restart with a fresh database. You will not need to enter a yes or no.
clear(graph, input = FALSE)
Next we need to create nodes. The first set of Nodes that we will create are for the unique states. Given the data set the state nodes will only contain the state name and the abbreviation for the state. We will run the following cypher query to generate the State nodes. CREATE (States {name: ‘State’, abrv: ‘Abrv’})
states <- unique(fire$State)
abrv <- unique(fire$Abrv)
for(x in 1:length(states)){
createNode(graph, "States", state= states[x], abrv= abrv[x])
}
Now we will create the Nodes for the types of fire causes that we can have. These nodes will only have a cause property.
causes <- colnames(select(fire, -State, -Abrv, -Year))
for(x in 1:length(causes)){
createNode(graph, "Causes", cause= causes[x])
}
The data set that we have contains data from 4 different years. We have decided to load the data from a single year so that the graph is less complicated with only one relation between state and cause. If we wanted to load the data for each year we would simply need to loop through data for each of the years. We have decided to pick the 2011 year for this load.
fire2011 <- fire %>%
filter(Year == 2011)
head(fire2011)
## State Abrv Year Arson Campfire Children Debris.Burning
## 1 Alaska AK 2011 0.4 126.30 7.20 28.20
## 2 Alabama AL 2011 18367.1 1934.65 105.40 10488.35
## 3 Arkansas AR 2011 14136.7 97.10 108.00 8139.10
## 4 Arizona AZ 2011 2.5 2.50 NA 61.63
## 5 California CA 2011 16409.2 153.10 703.45 1677.92
## 6 Colorado CO 2011 NA NA NA NA
## Equipment.Use Fireworks Lightning Miscellaneous Powerline Railroad
## 1 8.30 3.2 267782.80 24996.50 2.60 0.10
## 2 2796.60 479.6 5028.00 25232.67 463.20 572.95
## 3 2798.00 NA 5009.50 5890.10 NA 304.00
## 4 89.60 NA 27.30 4999.82 NA NA
## 5 15928.33 NA 2153.81 9525.01 992.69 1.10
## 6 NA NA NA NA NA NA
## Smoking Structure Unassigned Total
## 1 0.80 59.70 NA 293016.10
## 2 185.40 154.15 NA 65808.07
## 3 493.00 NA NA 36975.50
## 4 2.40 NA 494.32 5680.07
## 5 161.79 NA 161933.83 209640.23
## 6 NA NA 1030614.67 1030614.67
Now that we have just the data from 2011 we can start building our relationships. We create a relation if there is a value in the column otherwise we will skip building a relationship.
for(x in 1:length(causes)){
cause <- causes[x]
for(y in 1:nrow(fire2011)){
query <- "MATCH (s:States {state:{STATE}}) RETURN s"
stateNode <- getSingleNode(graph, query, STATE = fire2011[y, 'State'])
query <- "MATCH (c: Causes {cause:{CAUSE}}) RETURN c"
causeNode <- getSingleNode(graph, query, CAUSE = cause)
if(!is.na(fire2011[y, cause])){
createRel(stateNode, "2011Acres", causeNode, acres = fire2011[y, cause])
}
}
}
We have now created all of the nodes and relationships in the database. At this point the graphical representation of the data really started to show performance issues. We then decided to finish up by running a simple query against the database using R.
query = 'MATCH (:States {state:{STATE}})-[r]-(:Causes{cause:{CAUSE}}) RETURN r'
answer = getRels(graph, query, STATE = 'Alaska', CAUSE = 'Lightning')
sapply(answer, function(p) p$acres)
## [1] 267782.8
# 267782.8 Acres Burned
query = 'MATCH (:Causes {cause:{CAUSE}})<-[r]-(state) RETURN state'
answer = getNodes(graph, query, CAUSE = 'Lightning')
sapply(answer, function(p) p$abrv)
## [1] "AK" "AL" "AR" "AZ" "CA" "DE" "FL" "GA" "HI" "ID" "IL" "KY" "LA" "MD"
## [15] "ME" "MN" "MO" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OK" "OR"
## [29] "PA" "SC" "SD" "TN" "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"
# Returns all of the states that reported a lightning caused fire in 2011
Given the data set that we used for this project there is no real advantage or disadvantage to storing the data in a relational database or in Neo4J’s graph data base. One of the advantages to using a graph database is the speed improvements that you gain in a heavy use setting. For us this would not be a concern. However, we do get an interesting and more intuitive way to visualize our data. Now that we have the basics of the database set up we can easily add more years as connections between the states and the causes. This should allow us to pull information quickly and gives us the opportunity to see relationships between states or by causes. The graphical nature of the database also lets us see if there are any patterns in the data that don’t jump out in a table. While the data was tricky to get into the graph database now that we have done this type of process R and dplyr()’s ability to manipulate a data frame allowed us to complete the process without the need of any intermediate steps or manipulation of the original CSV file.