Project 4

NoSQL

My NoSQL database of chose for this assignment is MongoDB. I find it to be a quick and simple solution plus it seems to be an industry leader for NoSQL. Let’s see how it compares to SQL overall.

Data Set

The data set we will be pulling is the Details of Motor Vehicle Collisions in New York City provided by the Police Department (NYPD). This dataset is important as part of NYC Mayor De Blasio’s Vision Zero Program to reduce traffic related injuries and death.

The primary mission of government is to protect the public. New York’s families deserve and expect safe streets. But today in New York, approximately 4,000 New Yorkers are seriously injured and more than 250 are killed each year in traffic crashes. Being struck by a vehicle is the leading cause of injury-related death for children under 14, and the second leading cause for seniors. On average, vehicles seriously injure or kill a New Yorker every two hours.

You can learn more about Vision Zero here

Getting the Data

The below snippet will not knit in R so you would need to run this code separately on your machine to load the dataset. The download is ~140Mb and will take several minutes using NYC’s open data portal, so keep that in mind before running.

destfile <- "traffic_accidents.csv"
if(file.exists(destfile)){
  stop(sprintf("%s already exists in %s", destfile, getwd()))
  }
file_to_download <- "https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD"
download.file(file_to_download, destfile=destfile)
traffic_accidents <- read.csv(destfile)
file.remove(destfile)

MongoDB

I am assuming you have already installed MongoDB but if you haven’t you will need to navigate to MongoDB and download the latest version. After installation of MongoDB you will need to follow the next steps and open a connection.

After installation.

Open a command prompt window in the bin folder from your MongoDB installation and type mongod.

Then in a new command prompt window in the same bin folder type mongo and continue to the next section to create our new collection.

Creating MongoDB collection

Let’s create a Traffic Accidents collection and pull some data to make sure we have the appropriate setup.

library(mongolite)
m <- mongo(collection = "traffic_accidents")
if (m$count()>0)(m$drop())
names(traffic_accidents) <- gsub("[.]","", names(traffic_accidents)) 
m$insert(traffic_accidents)

Checking the results of our inserts

sprintf("%s rows were inserted out of %s from %s", m$count(), nrow(traffic_accidents), destfile)

[1] “769054 rows were inserted out of 769054 from traffic_accidents.csv”

Looks like we were able to add all of our information, now lets work with data to investigate some of the traffic accidents.

Results

Let’s see if there is any clustering with multiple person traffic injuries in Manhattan. We will get all traffic accidents with more than 3 people being injured and map it to see if any pattern emerges. The number of people injured is arbitrary, we can use fewer than 3 people but then we may need to set a time frame to keep the number of records manageable.

results <- m$find('{"BOROUGH" : "MANHATTAN", "NUMBEROFPERSONSINJURED" : { "$gt" : 3} }')

library(leaflet)
leaflet(results) %>% 
  addProviderTiles("Stamen.Toner") %>% 
  addMarkers(clusterOptions = markerClusterOptions(), 
             lng = ~LONGITUDE, lat = ~LATITUDE, 
             popup = paste("Reason:",
                           as.character(results$CONTRIBUTINGFACTORVEHICLE1), "<br>", 
                           "# of People Injured:", 
                           as.character(results$NUMBEROFPERSONSINJURED), "<br>",
                           "Date of Accident:", 
                           as.character(results$DATE)))

Interestingly, we see a large cluster of many injury accidents in East Harlem. I can’t tell if there is any particular reason for the clustering but more research could determine the root cause or if the number is actually not significant for the region.

NoSQL vs SQL results

Size

The data set we used is moderately large at ~145MB and with a relational database, I think we could definitely reduce the size of the file, especially as the data set grows. For instance, if we look at the unique counts of some of the fields we can see that we are storing a lot of duplicate information in our MongoDB collection.

There are only a few listed factors the contribute to an accident which could easily be stored in a separate table.

library(knitr)
CFV <- as.data.frame(m$distinct("CONTRIBUTINGFACTORVEHICLE1"))
colnames(CFV) <- "Contributing Factor to Accident"

As we can see there are only 48 contributing factors. But lets look at how many times they are repeated in the MongoDB collection in their full text format.

CFV <- m$aggregate('[{"$group":{"_id":"$CONTRIBUTINGFACTORVEHICLE1", "Instance": {"$sum":1}}}]')
colnames(CFV) <- c("Contributing Factor to Accident", "# of Instances")

If we simply created IDs for the contributing factors we could reduce the amount of duplicate information in the collection.

Speed

##set up a temporary sql database
library(RSQLite)

tmp <- dbConnect(SQLite(), ":memory:")
dbWriteTable(tmp, "traffic_accidents", traffic_accidents)

library(timeit)

n <- 50

sql_query_time <- timeit(
                    dbGetQuery(tmp, 
                              "SELECT * 
                               FROM traffic_accidents  
                               WHERE BOROUGH = 'MANHATTAN' AND NUMBEROFPERSONSINJURED > 3"), 
                         times = n)

mongo_query_time <- timeit(m$find('{"BOROUGH" : "MANHATTAN", 
                                  "NUMBEROFPERSONSINJURED" : { "$gt" : 3} }'), 
                           times = n)

Plotting the timing of each method.

library(reshape2)
SQL <- sql_query_time$self.time[(as.character(sql_query_time$func) == "\".Call\"")]
MongoDB <- mongo_query_time$self.time[(as.character(mongo_query_time$func)== "\".Call\"")]

df <- as.data.frame(melt(as.data.frame(SQL) %>%
      cbind(MongoDB)))

xlabel <- list(title = "Method")
ylabel <- list(title = "Time (in milliseconds)")

library(plotly)
plot_ly(df, y = value, color= variable, type = "box",  boxpoints = "all", jitter = 0.3) %>%
  layout(title = "Duration of Query between MongoDB and RSQLite",
         xaxis = xlabel, 
         yaxis = ylabel)

So the SQL database performs much better than the NoSQL when performing a query. My assumption is that to see improvements in NoSQL we would need to add complexity to the SQL database with reference tables and set up some joins to query for information. However, I found the NoSQL solution very easy to use and if any new information was collected for traffic accidents it would be simply to pull that information into our NoSQL database. Whereas, if we continued on with a relational database each update would take some thought on how to update the database and would not be a trivial situation.

Both database implementations are very good at the problems each method addresses. So ultimately, each method has to be considered for the problem and there is not one easy answer.

Clean up

We can now drop our MongoDB collection and disconnect from our SQLite temporary database.

m$drop()
dbDisconnect(tmp)