A Million Geocode Data Visualization With ggplot2

“The greatest value of a picture is when it forces us to notice what we never expected to see.” John Tukey

Motivation

When it comes to data visualization, the more data we have to work with, the clearer the visual communication becomes. Plotting few hundred, even few thousand data points on spatial map renders a fuzzy visual(depending on the type of image). However, plotting a million data points on the same spatial map gives a better, crisper detailed visual, and can bring to light what may not be obvious with smaller amount of data.

The Objective for this blog is to attempt plotting a million spatial geocode data with ggplot2 (R grammer of graphics ploting device). I am also motivated to load test (stress test) the R graphics system on my home computer, and see if the resulting graphics can be visually useful and interpretable .

Getting Started

New York City Taxi and Limousine commission (NYCTL) graciously makes available a geocoded data for city customers pickup and drop-off data for Taxis and Limousines in the city. It had been sharing the data since 2009 on a monthly bases. Each month of dataset contains, roughly, a million plus pickup and drop off geocode locations. Which is just about the right size for the test I want to perform. There are several Taxi & limousine companies, for this blog we will be using only NY city yellow cap taxi geocode data for the month of January 2016.

Load the required R packages and the data into R

As always, we start by loading the packages needed to clean and map the spatial data. The following code chunk shows the packages used:

rm(list = ls())         #clean slate

# load packages
library("dplyr")         # data wrangler
library("ggplot2")       # graphic
library("ggmap")         # map
library("ggthemes")      # theme
library("knitr")         # A general-purpose tool for dynamic report generation in R

Load and Tidy the raw geocoded data

Next up is tidying (cleaning) the raw geocoded data. Because this is a large dataset (large enough for home computer), we will carve out features that will not be used for mapping the data, and engineer features that do not exist to make the data better suited for our plotting.

The first action here is to separating the date, month, hour, minute and year so we can use it to drill down and study traffic patterns by day/hour/min. Figure 2 & Figure 3, for example, use the newly generated factored variable to demonstrate busy hour count for 140 thousand yellow cab taxi rides for the month of January 2016. The following code wrangle the data and generate a clean dataset that we will use for plotting. A snip of the data after data wrangling is shown on Table 1.

Table 1

# download the data from NY taxi and limo commission
# ny_yellow_cab_jan2016 <- "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv" 

ny_yellow_cab_jan2016 <- read.csv("./data/ny_yellow_cab_jan2016.csv", stringsAsFactors = FALSE) 

set.seed(2022)
# Tidy the data using dplyr select and mutate functions
df_ride_total <- 
        ny_yellow_cab_jan2016 %>% 
           select(tpep_pickup_datetime, 
                  pickup_longitude, 
                  pickup_latitude, 
                  dropoff_longitude, 
                  dropoff_latitude) %>% 
                      mutate (year  = as.numeric(substr(tpep_pickup_datetime, 1,  4)),
                              month = as.numeric(substr(tpep_pickup_datetime, 6,  7)),
                              date  = as.numeric(substr(tpep_pickup_datetime, 9, 10)),
                              hour  = as.numeric(substr(tpep_pickup_datetime, 12,13)),
                              min   = as.numeric(substr(tpep_pickup_datetime, 15,16))
                             )
                      
# arrange and select relevant variable                    
df_ride_total <- df_ride_total[,c(6:10,2:5)]

# remove the orig data to free up space
rm(ny_yellow_cab_jan2016) 

knitr::kable(head(df_ride_total, 10), digits = 3, caption = "Table1: Here is sample of the data after its wrangled")

Table1: Here is sample of the data after its wrangled
year	month	date	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude
2016	1	1	-73.990	40.735	-73.982	40.732
2016	1	1	-73.981	40.730	-73.944	40.717
2016	1	1	-73.985	40.680	-73.950	40.789
2016	1	1	-73.993	40.719	-73.962	40.657
2016	1	1	-73.961	40.781	-73.977	40.759
2016	1	1	-73.980	40.743	-73.913	40.763
2016	1	1	-73.994	40.720	-73.966	40.790
2016	1	1	-73.979	40.745	-73.992	40.754
2016	1	1	-73.947	40.791	-73.921	40.866
2016	1	1	-73.998	40.724	-73.996	40.688

Stress test - Plotting the Million geocode data to a Spatial NY city Map

The data is now ready for the stress test! But, before plotting the million geocodes, each with its own longitude and latitude, we use the qmap function (a wrapper for ggmap function) that comes with ggplot2, to use a tile of New York City map from OpenStreetMap.

“OpenStreetMap is built by community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.” And is freely available.

Because Spatial map is static, and we will be using relatively small foot print of NY city, it took tuning of a number of features so as not to overwhelm and obscure the spatial map of the city with over plotting. The consideration included what level of zoom to use for the city map, size & color of each geocode data point. Also consideration for level of shade and opacity for the map.

The objective is to have the streets visible with as many of the one million data points on the map captured. The following code chunk mungs the tidy data and plots a million geocode locations spatial map of NY city tile provided by operstreetmap.

Figure 1

#Get the patial map from openstreetmap
pp_qmap_z13_test_normal <- qmap("new york, ny", 
                                zoom    = 13, 
                                darken  = .3, 
                                exten   = "normal", 
                                maptype = "toner-lite", 
                                source  = "stamen") 

#Plot the 1 million geocode data on the map
pp1_fig1 <- pp_qmap_z13_test_normal +  
                 geom_point(aes(x      = dropoff_longitude, 
                                y      = dropoff_latitude), 
                                data   = head(df_ride_total, 1000000),  #1 million data
                                colour = "darkolivegreen1", 
                                alpha  = .4, 
                                size   = .000001) + theme_gdocs()
pp1_fig1

Fig1:439,000 geocoded data Visualization

What we see in the figure is 439,000 data points! What happen to the rest of the data points? qmap dropped them because they fall outside of visible longitude and latitude spatial NY city map shown here. Interestingly, the warning ggplot2 returns for not plotting the entire geocoded points is “missing data”, which is a bit misleading (took me for a spin). But what it means is that the data isn’t plotted because it doesn’t fit the imported Open Street map at the current zoom level.

I had successfully plotted all 1 million data pionts at a higher zoom out level that captures most of NY city. However, the plots overwhelm the spatial map obscuring the street level view. That plot isn’t included here because it isn’t useful. Still, I think, it is impressive to view 439k plots in such a small foot print.

Visualizing Yellow cab traffic for different times of the day

Now that we know R graphic device can handle massive data plotting, the next thing I wanted to try is improve the existing data and to get a better understanding of taxi traffic flow in the City? For that I will add ngineering new feature that doesn’t exist in the original dataset. The new feature will bring to light traffic flow for the city based on time of the day. This new visual information can be used to plan street traffic engineering and planning; for street lights timing modification etc.

I will add a factor feature named TimeOfDay that will have four levels. That is we will split the 24 hour in a day into four sub groups (levels). And they are labeled Morning, Afternoon, Evening and Midnight, each has six hour window. Once we have the new imporived tidy data, we can plot the geocode data for each time subgroup utilizing the facet function to split the four levels and visualize traffic flow for all four times of the day.

Figure 2: NY City yellow cab taxi dropp off points by time of the day.

###Get the data for the time of the day plot
set.seed(2022)


df_fig2 <- df_ride_total %>%  
        mutate(TimeOfDay = as.factor(ifelse(hour >= 0  & hour < 6 ,  "Midnight - 00-06AM",  #adding a new categorical data
                                     ifelse(hour >= 6  & hour < 12 , "Morning  - 06-12AM", 
                                     ifelse(hour >= 12 & hour < 18 , "Afternoon- 12-18PM", 
                                     ifelse(hour >= 18 & hour <= 23 ,"Evening  - 18-23PM", 
                                                         hour
                                                          ))))
                                     ))

## plot the data
pp_qmap_z13_test_normal <- qmap("new york, ny", 
                                zoom = 13, 
                                darken = .3, 
                                exten = "normal", 
                                maptype = "toner-lite", 
                                source = "stamen") 

pp1_fig2 <- pp_qmap_z13_test_normal +  
                  geom_point(aes(x = dropoff_longitude, 
                                 y = dropoff_latitude,
                                 colour = TimeOfDay), 
                                 data = head(df_fig2, 325000),
                                 #colour = "darkolivegreen1", 
                                 alpha = .4, 
                                 size = .00005
                                  ) + facet_wrap( ~ TimeOfDay, nrow = 2) + theme_gdocs() 

pp1_fig2

$\label{fig:fig2}Fig2:Time of day Visualizing NY Taxi traffic for Jan 2016$

Fig2:Time of day Visualizing NY Taxi traffic for Jan 2016

Bar plot of factored data

This plot is a follow up to the above map. We use the dataset used to generate faceted plot on figure 2, but counts the traffic for each of the four parts of the day. Bar plot is just right for comparing the counts side by side. The plot is for about 140k geo special points, a smaller subset of the full data.

Figure 3

#summarize the data for bar plot
df_fig3 <- head(df_fig2, 400000) %>% select(TimeOfDay) %>% 
                                       group_by(TimeOfDay) %>% 
                                               summarise(cnt = n())


options(scipen=5)
pp3_fig3 <- ggplot(data = df_fig3, aes(x = TimeOfDay, y = cnt, label = cnt), size=14) + 
            geom_bar(aes(fill = TimeOfDay), alpha = .6, stat="identity", position = position_dodge(width = 0.90)) +
            geom_text(position = position_dodge(.9),  vjust = 2, colour="white", fontface = "bold", size = 5 ) +
            xlab("") +
            theme_minimal() +
            theme(axis.text.x = element_text(angle =45, hjust = 1, size = 10)) 

pp3_fig3

$\label{fig:fig3}Fig3: Bar plot to quantify traffic count by time of the day.$

Fig3: Bar plot to quantify traffic count by time of the day.

Take Away

Spatial data visualization is very useful tool to plot massive amount of data in a small foot print. As seen here, ggplot2/ggmap is capable of processing over a million data points even on home personal computer with limited memory and CPU resources. Vsualizing the taxi traffic data can be used to identify busy hours of traffic in the city and make necessary plans for maintenance schedule, traffic engineering and more.

Lessons learned

When mapping large set of data in a small foot print, paying attention to features such as opacity, color code and size selection for each plot point, together with zoom level of the static map produce a visual spatial map that is readable and Interpretable.

On my next blog, I will use leaflet a javascript html widget that adds interactivity to R based Maps plot the same One million geocode data from NY taxi commission yellow taxi cab, and contrast it against the static Spatial map from this blog. We will also explore overlaying the geojson data using the shape file provided by the NY Taxi And Limousine commission.

Contact

If you need consultation on this kind of work, feel free to contact abiyu.giday@gmail.com.

References

Chang, Winston. 2012. R Graphics Cookbook. O’Reilly Media.
Peng, Roger. 2014. R Programming for Data Science. Leanpub.
Wickham, Hadley. 2010. “Stringr: Elegant Graphics for Data Analysis.”
Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation.
Wickham, Hadley, 2016b. Ggplot2: An Implementation of the Grammar of Graphics.
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Vol. 29. CRC Press.

This a fully reproducible markdown document generated using RStudio IDE.