Project 5

For this project we will be loading the data from the fire by state data set into a Neo4J graph database. From the work done by Prashanth Padebettu’s group we know that there are performance issues if Neo4J so I have chosen a small data set with only ~150 rows. The first thing that we need to do is load the drivers for Neo4J by using the devtools.

if (!require(RNeo4j)) {install.packages("devtools")
devtools::install_github("nicolewhite/RNeo4j")}

## Loading required package: RNeo4j

Loading Data

Next we load the dplyr() package and load the data from a public github account. We will then gather some basic information about the data set.

if (!require(dplyr)){install.packages("dplyr")}

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# file <- 'https://raw.githubusercontent.com/eriknylander99/Data/master/statefire.csv' Kit will not read the file from the website for some reason even if I pre-load it into R. Therfore I will be using a local file for the knit process but this is location of the file used to generate the project.
file <- 'statefire.csv' # Pulls the data from a downloaded copy of the file in the default R working directory.
fire <- read.csv(file, stringsAsFactors = FALSE)

This gives us a dataframe of 196 rows and 16 columns now we need to investigate the data set that we have pulled in. Notice that we have NAs in every column. In this data that is an indication that the state did not report fires of that type for a given year. This does not mean that a fire of that type did not occur it just means that they were not coded in on the reports. Therefore we will leave the NAs as NA and work with them when creating the Ne04J database. The state of Kansas is also not in the data set and I was unable to determine why that was the case.

head(fire)

##     State Abrv Year    Arson  Campfire  Children Debris.Burning
## 1  Alaska   AK 2010       NA 123275.20 174269.10      109774.40
## 2  Alaska   AK 2011     0.40    126.30      7.20          28.20
## 3  Alaska   AK 2012       NA    183.30      4.00        2696.50
## 4  Alaska   AK 2013       NA        NA        NA             NA
## 5 Alabama   AL 2010 12297.11    472.35     70.68        7421.65
## 6 Alabama   AL 2012  8627.00    311.00     80.00             NA
##   Equipment.Use Fireworks Lightning Miscellaneous Powerline Railroad
## 1      19355.60        NA 820884.10     149976.60   3656.10       NA
## 2          8.30       3.2 267782.80      24996.50      2.60     0.10
## 3         81.30        NA 252705.10      30871.40      1.70     0.50
## 4            NA        NA        NA            NA        NA       NA
## 5       1385.58     101.8    346.15      12634.01    362.78   276.75
## 6        679.00      56.5   3211.50      14470.00    870.50       NA
##   Smoking Structure Unassigned      Total
## 1    0.30  24815.30         NA 1426006.70
## 2    0.80     59.70         NA  293016.10
## 3    0.20      1.20          1  286546.20
## 4      NA        NA    1320753 1320752.70
## 5  167.15    129.62         NA   35665.63
## 6   26.00     76.00         NA   28407.50

tail(fire)

##             State Abrv Year    Arson  Campfire Children Debris.Burning
## 189 West Virginia   WV 2012   9402.4    231.90    278.5        2027.90
## 190       Wyoming   WY 2011       NA     17.50      7.1        1151.27
## 191       Wyoming   WY 2010       NA   6278.40     10.8         604.51
## 192       Wyoming   WY 2012       NA     17.25     10.2        4125.60
## 193       Wyoming   WY 2013       NA      5.20      9.3         191.00
## 194       Summary  SUM   NA 633485.1 365693.70 193011.0      855282.45
##     Equipment.Use Fireworks  Lightning Miscellaneous Powerline   Railroad
## 189        742.50   2807.10     407.90            NA     620.1      1.200
## 190      25496.71    217.65   26113.22       1838.28        NA    273.912
## 191      29051.88     74.76    4048.68      57889.50        NA  16157.110
## 192      67345.64        NA  287125.13      33731.58        NA    528.510
## 193         67.97     23.30    9255.05        425.85        NA     24.200
## 194     750034.65  37026.39 5282713.75    2935718.53  643172.7 329540.762
##      Smoking Structure Unassigned       Total
## 189     0.70     91.40         NA    16611.60
## 190     8.25        NA         NA    55123.89
## 191    46.63        NA         NA   114162.27
## 192   145.40        NA         NA   393029.31
## 193     9.50        NA         NA    10011.37
## 194 65163.88  45079.44    2764753 14900675.81

str(fire)

## 'data.frame':    194 obs. of  17 variables:
##  $ State         : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ Abrv          : chr  "AK" "AK" "AK" "AK" ...
##  $ Year          : int  2010 2011 2012 2013 2010 2012 2013 2011 2011 2010 ...
##  $ Arson         : num  NA 0.4 NA NA 12297.1 ...
##  $ Campfire      : num  123275 126 183 NA 472 ...
##  $ Children      : num  174269.1 7.2 4 NA 70.7 ...
##  $ Debris.Burning: num  109774.4 28.2 2696.5 NA 7421.6 ...
##  $ Equipment.Use : num  19355.6 8.3 81.3 NA 1385.6 ...
##  $ Fireworks     : num  NA 3.2 NA NA 101.8 ...
##  $ Lightning     : num  820884 267783 252705 NA 346 ...
##  $ Miscellaneous : num  149977 24997 30871 NA 12634 ...
##  $ Powerline     : num  3656.1 2.6 1.7 NA 362.8 ...
##  $ Railroad      : num  NA 0.1 0.5 NA 276.8 ...
##  $ Smoking       : num  0.3 0.8 0.2 NA 167.2 ...
##  $ Structure     : num  24815.3 59.7 1.2 NA 129.6 ...
##  $ Unassigned    : num  NA NA 1 1320753 NA ...
##  $ Total         : num  1426007 293016 286546 1320753 35666 ...

Connectint to Neo4J

The next thing that we need to do is connect to the Neo4J database. It is necessary to have the Neo4J database running to connect to the database.

graph = startGraph("http://localhost:7474/db/data/")

The following code will clear any existing graph so that we can restart with a fresh database. You will not need to enter a yes or no.

clear(graph, input = FALSE)

Creating Nodes

Next we need to create nodes. The first set of Nodes that we will create are for the unique states. Given the data set the state nodes will only contain the state name and the abbreviation for the state. We will run the following cypher query to generate the State nodes. CREATE (States {name: ‘State’, abrv: ‘Abrv’})

states <- unique(fire$State)
abrv <- unique(fire$Abrv)

for(x in 1:length(states)){
    createNode(graph, "States", state= states[x], abrv= abrv[x])
}

Now we will create the Nodes for the types of fire causes that we can have. These nodes will only have a cause property.

causes <- colnames(select(fire, -State, -Abrv, -Year))

for(x in 1:length(causes)){
    createNode(graph, "Causes", cause= causes[x])
}

Limiting the Data

The data set that we have contains data from 4 different years. We have decided to load the data from a single year so that the graph is less complicated with only one relation between state and cause. If we wanted to load the data for each year we would simply need to loop through data for each of the years. We have decided to pick the 2011 year for this load.

fire2011 <- fire %>%
    filter(Year == 2011)
head(fire2011)

##        State Abrv Year   Arson Campfire Children Debris.Burning
## 1     Alaska   AK 2011     0.4   126.30     7.20          28.20
## 2    Alabama   AL 2011 18367.1  1934.65   105.40       10488.35
## 3   Arkansas   AR 2011 14136.7    97.10   108.00        8139.10
## 4    Arizona   AZ 2011     2.5     2.50       NA          61.63
## 5 California   CA 2011 16409.2   153.10   703.45        1677.92
## 6   Colorado   CO 2011      NA       NA       NA             NA
##   Equipment.Use Fireworks Lightning Miscellaneous Powerline Railroad
## 1          8.30       3.2 267782.80      24996.50      2.60     0.10
## 2       2796.60     479.6   5028.00      25232.67    463.20   572.95
## 3       2798.00        NA   5009.50       5890.10        NA   304.00
## 4         89.60        NA     27.30       4999.82        NA       NA
## 5      15928.33        NA   2153.81       9525.01    992.69     1.10
## 6            NA        NA        NA            NA        NA       NA
##   Smoking Structure Unassigned      Total
## 1    0.80     59.70         NA  293016.10
## 2  185.40    154.15         NA   65808.07
## 3  493.00        NA         NA   36975.50
## 4    2.40        NA     494.32    5680.07
## 5  161.79        NA  161933.83  209640.23
## 6      NA        NA 1030614.67 1030614.67

Creating Relationships

Now that we have just the data from 2011 we can start building our relationships. We create a relation if there is a value in the column otherwise we will skip building a relationship.

for(x in 1:length(causes)){
    cause <- causes[x]
    for(y in 1:nrow(fire2011)){
        
        query <- "MATCH (s:States {state:{STATE}}) RETURN s"
        stateNode <- getSingleNode(graph, query, STATE = fire2011[y, 'State'])
        
        query <- "MATCH (c: Causes {cause:{CAUSE}}) RETURN c"
        causeNode <- getSingleNode(graph, query, CAUSE = cause)
        
        if(!is.na(fire2011[y, cause])){
            createRel(stateNode, "2011Acres", causeNode, acres = fire2011[y, cause])
        }

    }
}

Querying the Database

We have now created all of the nodes and relationships in the database. At this point the graphical representation of the data really started to show performance issues. We then decided to finish up by running a simple query against the database using R.

query = 'MATCH (:States {state:{STATE}})-[r]-(:Causes{cause:{CAUSE}}) RETURN r'
answer = getRels(graph, query, STATE = 'Alaska', CAUSE = 'Lightning')
sapply(answer, function(p) p$acres)

## [1] 267782.8

# 267782.8 Acres Burned

query = 'MATCH (:Causes {cause:{CAUSE}})<-[r]-(state) RETURN state'
answer = getNodes(graph, query, CAUSE = 'Lightning')
sapply(answer, function(p) p$abrv)

##  [1] "AK" "AL" "AR" "AZ" "CA" "DE" "FL" "GA" "HI" "ID" "IL" "KY" "LA" "MD"
## [15] "ME" "MN" "MO" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OK" "OR"
## [29] "PA" "SC" "SD" "TN" "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"

# Returns all of the states that reported a lightning caused fire in 2011

Advantages and Disadvantages

Given the data set that we used for this project there is no real advantage or disadvantage to storing the data in a relational database or in Neo4J’s graph data base. One of the advantages to using a graph database is the speed improvements that you gain in a heavy use setting. For us this would not be a concern. However, we do get an interesting and more intuitive way to visualize our data. Now that we have the basics of the database set up we can easily add more years as connections between the states and the causes. This should allow us to pull information quickly and gives us the opportunity to see relationships between states or by causes. The graphical nature of the database also lets us see if there are any patterns in the data that don’t jump out in a table. While the data was tricky to get into the graph database now that we have done this type of process R and dplyr()’s ability to manipulate a data frame allowed us to complete the process without the need of any intermediate steps or manipulation of the original CSV file.