Homework 5: Fun Times with GGPlot2

For this assignment I decided to go out and find some new data, as I was getting a bit tired of working with GapMinder. I decided to check out NYC Open Data as I grew up in NY and thought it would be cool to check out. It turns out that there are over 1100 data sets, but luckily there was a good search and sort feature on the site. I ended up settling on a dataset on response times to emergencies by the NYC Fire Department located here.

The data has 5 different variables: Two are categorical (the borough or location, type of incident) and three are numerical (a count variable for each type of incident, an average time of response, and a date). After playing around a bit I manage to get the CSV loaded with the code below. Before doing so I should note that I had to go into the actual file and clean it up: it wasn't quite ready to be uploaded in the form that I downloaded it.

dat <- read.delim("fdny.txt", header = TRUE, sep = "\t")
library(ggplot2)

summary(dat)

##       date                                  type             location 
##  Min.   :200907   All Fire/Emergency Incidents:72   Bronx        :84  
##  1st Qu.:200910   False Alarm                 :72   Brooklyn     :84  
##  Median :200956   Medical Emergencies         :72   Citywide     :84  
##  Mean   :200956   Medical False Alarm         :72   Manhattan    :84  
##  3rd Qu.:201003   Non Medical Emergencies     :72   Queens       :84  
##  Max.   :201006   Non Structural Fires        :72   Staten Island:84  
##                   Structural Fires            :72                     
##      count            time    
##  Min.   :    7   4:20   : 16  
##  1st Qu.:  329   4:30   : 14  
##  Median :  783   4:19   : 13  
##  Mean   : 3884   4:16   : 12  
##  3rd Qu.: 4194   4:17   : 11  
##  Max.   :44564   4:32   : 11  
##                  (Other):427

Running the summary() function shows us several important pieces of information:

There are six different locations, corresponding to the 5 boroughs and a 6th “Citywide” location
There 7 types of incidents, including one aggregrated level (as with location)
Date is stored in an annoying format. It is coded as YYYYMM, with no separator
The response time is really problematic. It is being read in as a factor, and so is pretty useless in this form.

After doing some searching on working with time series in R I teach myself the following line of code, which will convert the time variable into a format that R can understand:

dat$time <- strptime(dat$time, format = "%M:%S")

Now that that is done I can start making plots! My first plot will be a stacked bar-chart. It will showing the counts of different types of incidents, for each borough, in June of 2010. Before plotting this I create a dataframe that does not have the “Citywide” or “All Fire/Emergency Incidents” levels. The latter in particular would be confusing as it would result in visual double counting in the figure.

dat2 <- subset(dat, type != "All Fire/Emergency Incidents" & location != "Citywide")

ggplot(data = dat2[dat2$date == 201006, ], aes(x = location, y = count, fill = type)) + 
    geom_bar(stat = "identity") + xlab("Type of Incident") + ylab("Count") + 
    ggtitle("Recorded Citywide FDNY Calls in June 2010") + scale_fill_hue(name = "Type of Incident")

plot of chunk unnamed-chunk-3

From the plot several interesting pieces of information pop out. One is that Brooklyn has the most incidents, and Staten Island the least. This makes sense based upon what I know about relative population levels, although I did expect Manhatten to beat out Brooklyn.

We also see that despite it's name, the majority of the incidents that the FDNY deals with are not fires at all, but medical and non-medical emergencies. In fact, fires make up only a very small proportion.

Furthermore, compared to true medical emergencies there are very few false medical alarms. However, there seem to be a lot of false fire alarms, at least when compared to the number of fires. Other than that it is worth noting the total count numbers: The FDNY is busy!

For my next plot I wanted to do a scatter-plot (a line chart in this case) of response times for different types of incidents, over the course of the year, to see if there was any temporal trend. I will do this analysis on a citywide basis and will use only the data from the year 2009.

dat3 <- subset(dat, location == "Citywide")
ggplot(data = dat3[dat3$date <= 200912, ], aes(x = date, y = time, colour = type)) + 
    geom_line(size = 1) + xlab("Month") + ylab("Response Time") + ggtitle("Response Times by Type of Call During 2009") + 
    scale_colour_hue(name = "Type of Call")

plot of chunk unnamed-chunk-4

Before commenting on the results I want to explain what is happening with the y-axis as it is a little hard to read. Essentially GGPlot2 is having trouble with the time nature of the data and has dropped the minutes, displaying only the seconds. For instance the “00” marker should be “3:00”. I played around quite a bit to correct this and in the end got only errors, but I suspect I need to insert something similar to: + scale_y_datetime(limits=c(as.POSIXct('3:00'), as.POSIXct('7:00')) ,format = "%M:%S")

into my plot code and something like:dat$time = as.POSIXct(dat[,4], format="%H:%M:%S") after I read in my code.

Alas, I could not figure it out, and in the end it is purely aesthetic.

Turning back to the plot we can see that non medical emergencies have the highest resonse time and that structural fires have the lowest response time. This is probabaly a result of how the department prioritizes different types of calls, and makes sense with what you would expect. Finally, we also note that there may be a slight increase over time in response times, but it is hard to tell visually.