While the data is rather dated, this project was a great exercise in working with geospatial data as well as dates and time. The workflow is pretty straightforward with a lot data visualization. Certainly a great way to practice working with lubridate() and ggplot2() packages.
library(ggplot2)
library(ggthemes)
library(lubridate)
library(dplyr)
library(tidyr)
library(DT)
library(scales)
#Going ahead and setting some pretty colors
colors = c("#CC1011", "#665555", "#05a399", "#cfcaca", "#f5e840", "#0683c9", "#e075b0")
After downloading zip files from internet each month’s data was moved to the working directory on my computer. Transferring execution from R Studio console to R Markdown presented some challenges in this initial stage. This was due to the fact the default working directories for the R console and R source editor were not congruent. As such setwd() function was used to point source editor in right direction to read the csv files and use rbind() to consolidate data. It should be noted the resulting data_2014 data.frame includes 4.5 million observations so set run time expectations accordingly.
setwd("/Users/worthsmacbookair")
apr_data <- read.csv("uber-raw-data-apr14.csv")
may_data <- read.csv("uber-raw-data-may14.csv")
jun_data <- read.csv("uber-raw-data-jun14.csv")
jul_data <- read.csv("uber-raw-data-jul14.csv")
aug_data <- read.csv("uber-raw-data-aug14.csv")
sep_data <- read.csv("uber-raw-data-sep14.csv")
data_2014 <- rbind(apr_data,may_data, jun_data, jul_data, aug_data, sep_data)
head(data_2014)
## Date.Time Lat Lon Base
## 1 4/1/2014 0:11:00 40.7690 -73.9549 B02512
## 2 4/1/2014 0:17:00 40.7267 -74.0345 B02512
## 3 4/1/2014 0:21:00 40.7316 -73.9873 B02512
## 4 4/1/2014 0:28:00 40.7588 -73.9776 B02512
## 5 4/1/2014 0:33:00 40.7594 -73.9722 B02512
## 6 4/1/2014 0:33:00 40.7383 -74.0403 B02512
This was a great project for date and time data. Here data_2014$Date.Time is set to as.POSIXct which is rather common. From there Time is broken out into its own column for later use. Then Date.Time is reformatted to year-month-date-hour-minute-second from month-date-year-hour-minute-second while Time remains unchanged in format. Examining the head() call after each steps reveals the differences with each transformation.
data_2014$Date.Time <- as.POSIXct(data_2014$Date.Time, format = "%m/%d/%Y %H:%M:%S")
head(data_2014)
## Date.Time Lat Lon Base
## 1 2014-04-01 00:11:00 40.7690 -73.9549 B02512
## 2 2014-04-01 00:17:00 40.7267 -74.0345 B02512
## 3 2014-04-01 00:21:00 40.7316 -73.9873 B02512
## 4 2014-04-01 00:28:00 40.7588 -73.9776 B02512
## 5 2014-04-01 00:33:00 40.7594 -73.9722 B02512
## 6 2014-04-01 00:33:00 40.7383 -74.0403 B02512
data_2014$Time <- format(as.POSIXct(data_2014$Date.Time, format = "%m/%d/%Y %H:%M:%S"), format="%H:%M:%S")
head(data_2014)
## Date.Time Lat Lon Base Time
## 1 2014-04-01 00:11:00 40.7690 -73.9549 B02512 00:11:00
## 2 2014-04-01 00:17:00 40.7267 -74.0345 B02512 00:17:00
## 3 2014-04-01 00:21:00 40.7316 -73.9873 B02512 00:21:00
## 4 2014-04-01 00:28:00 40.7588 -73.9776 B02512 00:28:00
## 5 2014-04-01 00:33:00 40.7594 -73.9722 B02512 00:33:00
## 6 2014-04-01 00:33:00 40.7383 -74.0403 B02512 00:33:00
data_2014$Date.Time <- ymd_hms(data_2014$Date.Time)
head(data_2014)
## Date.Time Lat Lon Base Time
## 1 2014-04-01 00:11:00 40.7690 -73.9549 B02512 00:11:00
## 2 2014-04-01 00:17:00 40.7267 -74.0345 B02512 00:17:00
## 3 2014-04-01 00:21:00 40.7316 -73.9873 B02512 00:21:00
## 4 2014-04-01 00:28:00 40.7588 -73.9776 B02512 00:28:00
## 5 2014-04-01 00:33:00 40.7594 -73.9722 B02512 00:33:00
## 6 2014-04-01 00:33:00 40.7383 -74.0403 B02512 00:33:00
As much of the analysis will be comparing patterns by different components of date and time, it works well to categorize into itemized components via factor().
data_2014$day <- factor(day(data_2014$Date.Time))
data_2014$month <- factor(month(data_2014$Date.Time, label = TRUE))
data_2014$year <- factor(year(data_2014$Date.Time))
data_2014$dayofweek <- factor(wday(data_2014$Date.Time, label = TRUE))
data_2014$hour <- factor(hour(hms(data_2014$Time)))
data_2014$minute <- factor(minute(hms(data_2014$Time)))
data_2014$second <- factor(second(hms(data_2014$Time)))
##Trips by Hour
The first analysis breaks down Uber trips by the hour of the day. Here it seems quite evident usage peaks upon the end of the workday and into dinner time with a lesser peak around the hours associated with the beginning of the workday.
hour_data <- data_2014 %>%
group_by(hour) %>%
dplyr::summarize(Total = n())
datatable(hour_data)
ggplot(hour_data, aes(hour, Total)) +
geom_bar( stat = "identity", fill = "steelblue") +
ggtitle("Trips Every Hour") +
theme(legend.position = "none") +
scale_y_continuous(labels = comma)
##Trips by Hour and Month
The previous trend of usage by hour is further broken down by month. The overall trend of peaking around 17:00 with a secondary peak around 7:00 or 8:00 appears to hold across months indicating minimal seasonality.
month_hour <- data_2014 %>%
group_by(month, hour) %>%
dplyr::summarize(Total = n())
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
ggplot(month_hour, aes(hour, Total, fill = month)) +
geom_bar( stat = "identity") +
ggtitle("Trips by Hour and Month") +
scale_y_continuous(labels = comma)
Here trips are analyzed by which day in the month they occured. This could have potentially provided insight into whether users traveled more earlier or later in the month. The results yield little insight here. Clearly the 31st would be lower as not all months include a 31st day. Furthermore use is fairly consistent through out the month.
day_group <- data_2014 %>%
group_by(day) %>%
dplyr::summarize(Total = n())
datatable(day_group)
ggplot(day_group, aes(day, Total)) +
geom_bar( stat = "identity", fill = "steelblue") +
ggtitle("Trips Every Day") +
theme(legend.position = "none") +
scale_y_continuous(labels = comma)
Previous analysis is furthered by breaking it down into monthly contributions. While there is not an obvious pattern this visualization piqued my curiosity. The previous graphic confirmed there was not a large variance in usage across different parts of the month. However, we see more relative variance in the monthly chunks across different days. While which day of the week the day fell on is a likely partial explanation, comparing this data weather conditions could yield some further insight too.
day_month_group <- data_2014 %>%
group_by(month, day) %>%
dplyr::summarize(Total = n())
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
ggplot(day_month_group, aes(day, Total, fill = month)) +
geom_bar( stat = "identity") +
ggtitle("Trips by Day and Month") +
scale_y_continuous(labels = comma) +
scale_fill_manual(values = colors)
Analyzing trip data by day of week did reveal some interesting trends. Sundays consistently underperformed other days of the week which comes as no surprise. However, trips on Mondays were surprisingly low. While the rest of the week outperformed Sunday and Monday it was a stark difference between days of the week across months as seen in the table below.
| Month | Top Day of Week | Bottom Day of Week |
|---|---|---|
| April | Wednesday | Sunday |
| May | Friday | Sunday |
| June | Thursday | Sunday |
| July | Thursday | Sunday |
| August | Friday | Monday |
| September | Tuesday | Sunday |
month_weekday <- data_2014 %>%
group_by(month, dayofweek) %>%
dplyr::summarize(Total = n())
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
ggplot(month_weekday, aes(month, Total, fill = dayofweek)) +
geom_bar( stat = "identity", position = "dodge") +
ggtitle("Trips by Day of Week and Month") +
scale_y_continuous(labels = comma) +
scale_fill_manual(values = colors)
By leveraging geom_tile trip data can be visualized in a different way to confirm thoughts from earlier visual analysis. The first heat map by hour and day furthers the notion of peak usage around 17:00 with a secondary peak earlier in the day around 7:00. These peaks are seen lighter patches of blue stretching across the month. The second heat map is included for another reason. Clearly usage increased substantially from April to September as September is consistently lighter than April across all days. It would interesting to gather further data and see if this is a seasonal behavior or if this is representative of Uber’s growth trajectory in New York City during 2014.
ggplot(day_month_group, aes(day, month, fill = Total)) +
geom_tile(color = "white") +
ggtitle("Heat Map by Month and Day")
ggplot(month_weekday, aes(dayofweek, month, fill = Total)) +
geom_tile(color = "white") +
ggtitle("Heat Map by Month and Day of Week")
The most fascinating component of this data set were the GPS coordinates associated with the origination of each Uber trip. Geospatial data can be intimidating at first but framing it as simply plotting some points on a plane with an X and Y axis made it much easier. Minimum and maximum latitudes and longitudes are defined as to outline the New York City area which is the area of focus. The initial plot is yields little insight but it is amazing to watch the quantification of all these Uber trip coordinates paint a fairly accurate map of the area especially in the most densely populated areas.
The final map I felt began to offer some actual insight to the behavioral pattern of Uber users during this time. Uber trips in the heart of the residential area of Manhattan, Queens, and Brooklyn were dominated by usage associated with the work week. However, areas further from the commercial and residential center of the area see activity skewed more towards weekends. Given the range of months this data encompasses this makes sense as users look to escape the dense urban core on weekends through out the summer.
min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004
ggplot(data_2014, aes(x=Lon, y=Lat)) +
geom_point(size=1, color = "blue") +
scale_x_continuous(limits=c(min_long, max_long)) +
scale_y_continuous(limits=c(min_lat, max_lat)) +
theme_map() +
ggtitle("NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP)")
## Warning: Removed 71701 rows containing missing values (geom_point).
ggplot(data_2014, aes(x=Lon, y=Lat, color = dayofweek)) +
geom_point(size=1) +
scale_x_continuous(limits=c(min_long, max_long)) +
scale_y_continuous(limits=c(min_lat, max_lat)) +
theme_map() +
ggtitle("NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP) by Day of Week")
## Warning: Removed 71701 rows containing missing values (geom_point).
Working with geospatial data is not nearly as difficult as I thought it would be. Certainly there are more complex applications but in general it makes a lot of since to anyone who played with their graphing calculator in school. Working with dates and time also proved to be less annoying than I feared diving into this project. Formatting and factoring time related components well and systematically is certainly a practice I would continue.
As for the actual data, while dated it was still a great source and as someone who loves Uber and New York City it was engaging. The patterns were pretty clear with usage being highly correlated to the workweek and workday. I can only imagine how much fun it would be working at Uber with access to this amount and type of data.