Today we’re going to start looking at taxi cabs rides that took place in NYC from the start of 2009 through the first half of 2015. While the full data set has 55 million rides, we’ll start by looking at a 50,000 ride sample

sample_taxi_train <- read.csv("/home/rstudioshared/shared_files/data/sample_taxi_train.csv")
library(dplyr);library(ggplot2)

summary(sample_taxi_train)
head(sample_taxi_train)

The summary gives us a quick look at the data. One of the first things we might notice is that (for whatever reason) there are some entirely implausible longitudes and latitudes. We can remove these as follows:

sample_taxi_train <- sample_taxi_train %>% 
  mutate(pickup_latitude = ifelse(between(pickup_latitude,40.25, 41.25),pickup_latitude,NA),
         dropoff_latitude= ifelse(between(dropoff_latitude,40.25, 41.25),dropoff_latitude,NA),
         pickup_longitude= ifelse(between(pickup_longitude,-74.5,-73.5),pickup_longitude,NA),
         dropoff_longitude= ifelse(between(dropoff_longitude, -74.5,-73.5),dropoff_longitude,NA)
  ) 

Let’s take a look at the numbers of passengers on rides. Is there anything surprising in the bar chart below?

sample_taxi_train %>% ggplot(aes(passenger_count)) + geom_bar()

We can use the longitudes and latitudes to make maps of drop offs and pickups. We’ll need to make the points quite small because there are so many of them. See if you can recognize parts of the city in these maps:

sample_taxi_train %>% ggplot(aes(pickup_longitude, pickup_latitude)) +geom_point(size=0.05)

sample_taxi_train %>% ggplot(aes(dropoff_longitude, dropoff_latitude)) +geom_point(size=0.05)

Let’s take a look at the most common drop-off and pickup locations. To do this, I’ll round longitudes and latitudes to three decimal places. You can look up the coordinates by replacing the coordinates in this Google Map by the coordinates of the most common pickup and drop off locations.

sample_taxi_train %>% mutate(rnd_pickup_longitude = round(pickup_longitude,3), 
                             rnd_pickup_latitude = round(pickup_latitude,3)) %>%
  group_by(rnd_pickup_longitude, rnd_pickup_latitude) %>% 
  summarize(n=n()) %>% ungroup() %>%
  top_n(10, n) %>% View()

sample_taxi_train %>% mutate(rnd_dropoff_longitude = round(dropoff_longitude,3), 
                             rnd_dropoff_latitude = round(dropoff_latitude,3)) %>%
  group_by(rnd_dropoff_longitude, rnd_dropoff_latitude) %>% 
  summarize(n=n()) %>% ungroup() %>%
  top_n(10, n) %>% View()

Pickup Times

Now, let’s start making using of the “pickup_datetime” field. We can extract years, months, days and hours using the lubridate package.

library(lubridate)
sample_taxi_train <- sample_taxi_train %>% mutate(pickup_datetime=ymd_hms(pickup_datetime),
                                                  year = as.factor(year(pickup_datetime)),
                                                  month = as.factor(month(pickup_datetime)),
                                                  day = as.numeric(day(pickup_datetime)),
                                                  dayOfWeek = as.factor(wday(pickup_datetime)),
                                                  hour = as.numeric(hour(pickup_datetime))
) 

Take a look at when rides are most frequent. Take a few moments to make sense of these graphs:

sample_taxi_train %>% 
  group_by(dayOfWeek, hour) %>% 
  summarize(n= n()) %>% 
  ggplot(aes(hour, n)) + geom_line() + facet_wrap(~dayOfWeek) +
  ggtitle("# of Pickups by Day and Hour")

We could also look at when rides are the most expensive:

sample_taxi_train %>% 
  group_by(dayOfWeek, hour) %>% summarize(mean_fare=mean(fare_amount,na.rm=TRUE)) %>% 
  ggplot(aes(hour, mean_fare)) + geom_line() + facet_wrap(~dayOfWeek)+
  ggtitle("Mean Fare by Day and Hour")

Distances

Within the limited bounds of New York City the Earth is nearly flat and we could probably get away calculating distances (as the crow flies) between pickups and drop offs using the Pythagorean Theorem… but let’s do this right and use the geosphere package (and the distHaversine function) to calculate distances as the shortest arcs on the surface of a sphere. By virtue of using r=3959 (the radius of the Earth in miles), this function will return the distance between points in miles.

library(geosphere)
sample_taxi_train <- sample_taxi_train %>% 
  mutate(CrowD = distHaversine(cbind(pickup_latitude, pickup_longitude), cbind(dropoff_latitude, dropoff_longitude),r=3959))

Let’s take a look at the frequency of rides of different lengths. Again, these are distances as the crow flies. The cab no doubt registered more mileage (at least in almost all cases).

sample_taxi_train %>% ggplot(aes(CrowD))+geom_histogram()

Let’s look at when the longest rides take place (distance-wise):

sample_taxi_train %>% 
  group_by(dayOfWeek, hour) %>% summarize(mean_CrowD=mean(CrowD,na.rm=TRUE)) %>% 
  ggplot(aes(hour, mean_CrowD)) + geom_line() + facet_wrap(~dayOfWeek)+
  ggtitle("Mean Crow Distance by Day and Hour")

Finally, I’ll leave you with a puzzle. Why are rides that are shorter than a quarter of a mile so expensive? See below. Are these simply errors?

sample_taxi_train %>% filter(CrowD < 5) %>%
  ggplot(aes(CrowD, fare_amount)) + geom_smooth() +ggtitle("Fare v. Crow Distance")