Chicago Divvy Bike Analysis

Divvy bikes are rental bikes provided by the city of Chicago. Bike distribution stations are found throughout the north side of the city with a few centers on the south side. Stations have anywhere from 11 to 35 bikes available for rental. Renters can also subscribe to the Divvy service for a full year. This data set covers the first full year of Divvy bike rentals. Rentals started in June 2013. This data set comprises 1,665,487 bike rentals.

Note – For the sake of distinction, those that I refer to as customers are renters who are non-subscribers to the Divvy service. No information is provided on birth year or gender for non-subscribers.

Exploratory Analysis

suppressMessages(library("ggplot2"))
suppressMessages(library("tidyr"))
suppressMessages(library("dplyr"))
suppressMessages(library("scales"))
suppressMessages(library("gridExtra"))

library("ggplot2")
library("gridExtra")
library("dplyr")
library("tidyr")
library("scales")

Divvy2013 <- read.csv("Divvy_Trips_2013.csv")
Divvy2014 <- read.csv("Divvy_Trips_2014_Q1Q2.csv")

names(Divvy2013)[names(Divvy2013) == 'birthday'] <- 'birthyear'

Divvy2013$age <- 2013 - Divvy2013$birthyear
Divvy2014$age <- 2014 - Divvy2014$birthyear

Divvy <- rbind(Divvy2013, Divvy2014)

rm(Divvy2013)
rm(Divvy2014)

The first analysis I wanted to try was seasonal analysis of riders.

Divvy$starttime <- strptime(Divvy$starttime, format = "%Y-%m-%d %H:%M")
Divvy$date.month <- as.numeric(strftime(Divvy$starttime, format = "%m"))
Divvy$season[Divvy$date.month == 9 | Divvy$date.month == 10 
             | Divvy$date.month == 11 ] <- "Fall"
Divvy$season[Divvy$date.month == 12 | Divvy$date.month == 1 
             | Divvy$date.month == 2 ] <- "Winter"
Divvy$season[Divvy$date.month == 3 | Divvy$date.month == 4 
             | Divvy$date.month == 5 ] <- "Spring"
Divvy$season[Divvy$date.month == 6 | Divvy$date.month == 7 
             | Divvy$date.month == 8 ] <- "Summer"
Divvy$season <- factor(Divvy$season)
table(Divvy$season)

## 
##   Fall Summer Winter 
## 465735 249380  44673

Well, this is a strange result. It seems no one rents Divvy bikes during the spring. Those values also don’t add up to the total of the data set.

table(Divvy$date.month)

## 
##      6      7      8      9     10     11     12 
##   4005  74867 170508 201030 174695  90010  44673

No one rents Divvy bikes during the first six months of the year? Let’s go back to the starttime variable and see what is missing.

sum(is.na(Divvy$starttime))

## [1] 905699

And there is the problem. There is a little over 900,000 riders without starting time values. This number matches the number of riders found in the 2014 data set. It seems Divvy stopped recording starting time data for the riders at the beginning of 2014.

Age of Riders

Let’s see how old the Divvy subscribers are.

It looks like there are some subscribers over 75. Let’s look at that a little bit closer.

We’ve got some spry 100+ year olds in Chicago subscribing to Divvy. They could be subscribers, but I think it is more likely that these age values are from subscribers masking their age with a fake birth year. I will cut those over 75 from any age-based analysis from here on out.

I think I will take a look at bike rental duration next. Since, Divvy records rental duration in hours, I will change the duration to minutes. Minutes are easier to visualize numerically.

Duration of Rentals

Divvy$duration.mins <- Divvy$tripduration/60
summary(Divvy$duration.mins)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.55   13.12   19.15   21.70 1440.00

Wow! Someone rented their bike for two and a half days? That couldn’t have been cheap. Using the current cost model for non-subscribers, that bike rental was 365$. Ouch!

Divvy bikes are meant to be a commuter bike that goes one way from station to station. The distance between the furthest north and furthest south Divvy station is a little over 17 miles. Google Maps estimates it will take about 90 minutes to ride that distance along the city bike trails. I will give an upper bound of 90 mins for a traditional Divvy user. I would assume anyone with a bike longer than 90 minutes is not riding continuously, which is outside the intent of Divvy bikes. Let’s see if this will affect my data.

sum(Divvy$duration.mins > 90) / 1665487

## [1] 0.01478787

This omission affects only 1.5% of the data, so I will focus rental duration graphs up to 90 minutes.

The green line represents the 30 minute mark. Divvy heavily encourages rentals to occur in 30 min segments. If one plans on renting for longer than 30 minutes, a renter should stop at a station after 30 minutes and exchange bikes . I wonder what portion of the rentals are 30 minutes or less.

sum(Divvy$duration.mins <= 30) / 1665487

## [1] 0.8895158

Almost 89% of the rentals fall within the 30 minute range Divvy suggests. Let’s look at when rentals occur.

Rental Time of Day

Divvy$time.hour <- as.numeric(strftime(Divvy$starttime, format = "%H"))

This histogram displays rentals by time of day. It seems there are two peaks that may fall in line with the morning and evening rush hour.

Source: http://www.wbez.org/series/curious-city/when-chicago-area-traffic-worst-111374

AADT represents the annual average daily traffic, collected from 18 sites throughout the region between 2010 and 2013 by the Illinois Department of Transportation. The peak rush hour times fall between 6 and 8 AM for the morning rush and 4 and 6 PM for the evening rush.

The histogram peaks fall right in line with evening rush hour, but side-steps morning rush hour by an hour. I wonder if this is caused by the day of the week? Are Divvy rentals occurring at the same time during the weekends as they are during the weekdays?

Divvy$day <- weekdays(as.Date(Divvy$starttime))
Divvy$day <- as.factor(Divvy$day)
Divvy$day <- factor(Divvy$day, levels = c("Sunday", "Monday", "Tuesday", 
                                          "Wednesday","Thursday", "Friday", 
                                          "Saturday"))
DivvyTime <- separate(subset(Divvy, !is.na(starttime)), starttime, 
                      c("start.date", "start.time"), sep = " ")
DivvyTime <- subset(DivvyTime, select = c(start.date, time.hour, day))
DivvyTime$time.hour <- as.factor(DivvyTime$time.hour)
DivvyTime.count <- DivvyTime %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

The weekend rental peaks are definitely different than the weekday rentals. Let’s examine the variance of time of rental between the weekday and the weekend.

DivvyTime$day <- as.factor(DivvyTime$day)

DivvyTime.count.weekday <- subset(DivvyTime, day != 'Saturday' & 
                                        day != 'Sunday') %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

DivvyTime.count.weekend <- subset(DivvyTime, day == 'Saturday' 
                                  | day == 'Sunday') %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

Rental Duration by Day of the Week

Renters are renting bikes at different times between weekdays and weekends. How about the length of rental by day of the week?

It looks like there are more renters renting bikes for 15 minutes or longer on the weekends.

Rental Duration and Rental Time by Customer Type

Customers also seem to have longer rental times than subscribers.

The histogram for customers looks more like the weekend usage, while the histogram of subscribers looks like weekday usage.

Rental Duration by Season

And just for the sake of comparison, how about a look at how season affects rental duration.

One things seems pretty obvious. Rental duration drops during the winter. But, it also looks like fall has a higher amount of rentals than summer. This seems counter-intuitive, but remember, this seasonal data come from Divvy’s first six months of rental data. Since the service started in the summer, I’d expect the first few months to be slower than the following three months. The summer slope between 10 and 20 minutes does seem to be more gradual than the same portion of time for the fall. This gave me the sense that summer rental durations were proportionately longer during the summer than the fall.

Rentals by Station

We can see that arrival and departure locations are varied throughout the city . But we can’t make too much sense out of these two histograms otherwise.

The first thing I’ll need to do is create a data frame of just the stations to investigate departures and arrivals.

Divvy$starttime <- as.character(Divvy$starttime)

stations_dep <- Divvy %>%
                  group_by(from_station_id) %>%
                  summarize(departure = n()) %>%
                  rename(station_id = from_station_id) %>%
                  arrange(desc(departure))

stations_arr <- Divvy %>%
                  group_by(to_station_id) %>%
                  summarize(arrival = n()) %>%
                  rename(station_id = to_station_id) %>%
                  arrange(desc(arrival))

stations <- merge(stations_dep, stations_arr, by = "station_id")
stations$total <- stations$departure + stations$arrival
stations_tidy <- gather(stations, "depart_arrive", "riders", 2:3)

stations_tidy_sum <- transform(stations_tidy, 
                               station_id = reorder(station_id, total))

Top 20 stations by total activity

Top 10 stations by departure station

Top 10 stations by arrival station

The top ten stations correspond to the following landmarks.

35 - Navy Pier
76 - Grant Park
90 - Millennium Park
91 - Next to Chicago Ogilvie Transportation Center
85 - Magnificent Mile Shopping District
177 - Theatre on the Lake
97 - Museum Campus
26 - Streeterville
174 - Next to Chicago Ogilvie Transportation Center
52 - Michigan Ave & Lake St (Just northwest of Millennium Park)

The majority of these stations are either on Lake Michigan or extremely close to Lake Michigan, with most of them by extremely popular tourist attractions. The only exceptions are the two stations next to Ogilvie Transportation Center , one of the busiest commuter train stations in the US. Those two stations are on opposing corners of the station.

Since I saw a difference between the time customers and subscribers rent Divvy bikes and how long they rent them, maybe there is a difference in which stations customers and subscribers use as well.

# I am spliting the data by day, station, and user type. This will allow me to 
# count how many renters there are per day per station. I can then find the 
#median number of renters per station per customer type.
Divvy.date.split <- separate(subset(Divvy, !is.na(starttime)), 
                             starttime, c("start.date", "start.time"), 
                             sep = " ")

CusSub <- Divvy.date.split %>%
      group_by(usertype, start.date, from_station_id) %>%
      summarize(count = n()) %>%
      group_by(usertype, from_station_id) %>%
      summarize(median = median(count))

It is somewhat clear from this plot that there are stations that are used by customers more than subscribers and vice versa. This plot is not very clear on which stations are used by customers or subscribers more or by how much.

CusSubDiff <- spread(CusSub, usertype, median)
CusSubDiff$difference <- CusSubDiff$Customer - CusSubDiff$Subscriber

This plot, however, is very clear about which stations see a majority of the renters as customers versus subscribers. The stations with predominantly customers as renters show a positive difference and are coded blue. The stations with predominantly subscribers show a negative difference and are coded red. It does look like the majority of the stations see about an equal usage by both customers and subscribers. Just to get a sense of which stations show the greatest disparity, I will make a new data set to show those stations with an absolute difference of 15 or greater.

CusSubDiff.greatest <- subset(CusSubDiff, difference >= 15 
                              | difference <= -15)
print(CusSubDiff.greatest, max = 250)

## Source: local data frame [23 x 4]
## 
##    from_station_id Customer Subscriber difference
## 1               25     28.5       11.0       17.5
## 2               26     36.5       17.0       19.5
## 3               35     91.0        9.5       81.5
## 4               45     24.0        9.0       15.0
## 5               49      9.0       29.0      -20.0
## 6               66      7.0       22.0      -15.0
## 7               75     10.0       43.0      -33.0
## 8               76     62.0       11.0       51.0
## 9               77      9.0       35.5      -26.5
## 10              81     15.0       30.0      -15.0
## ..             ...      ...        ...        ...

Maybe there is also a difference in the duration of a rental, depending on the station. It could be that some stations are used by renters who will only use the Divvy bike for a short period of time, while other stations may find renters who use the bikes for a longer period of time.

DiffStation <- Divvy.date.split %>%
      group_by(start.date, from_station_id) %>%
      summarize(median.duration = median(duration.mins)) %>%
      group_by(from_station_id) %>%
      summarize(median.duration = median(median.duration))

There may be a difference based on what we see here, but it is hard to say. I did see a difference in the duration of rental by customer versus subscriber. Perhaps by combining my two previous plots, I can gain a better understanding.

CusSub.DiffStation <- inner_join(CusSubDiff.greatest, 
                                 DiffStation, by = "from_station_id")

Here we see that the stations that customers use heavily also see higher rental durations than the stations used heavily by subscribers.

Percentage of Renters by Sex and Age

Switching gears a bit, I wanted to see what the age and sex distribution was of the subscribers.

Divvy$age.bucket <- cut(Divvy$age, breaks = c(16, 24, 34, 44, 54, 64, 74))
ggplot(aes(x = age.bucket, y = ..count../sum(..count..), 
           color = gender, fill = gender), 
       data = subset(Divvy, (gender == 'Male' | gender == 'Female') 
                     & !is.na(age.bucket))) +
      geom_histogram(position = 'dodge') + 
      scale_y_continuous(labels = percent) +
      ylab("Percentage of Riders")

There is a dramatically high proportion of men using Divvy bikes over women in all age groups. One interesting way of looking at this might be inspecting the percentage of male or female riders as a function of time of day.

This plot provides a confirmation that that female renters never make up more than 30% of the renters at any given time throughout the day.

Age Buckets for Rental Time and Duration

The following two plots show that regardless of subscriber age, subscribers rent bikes for roughly the same duration of time and at the same times during the day.

Variance in daily ridership

DivvyTime.by.day <- DivvyTime %>%
      group_by(start.date, day) %>%
      summarize(count = n())

aggregate(count~day,DivvyTime.by.day,mean)

##         day    count
## 1    Sunday 4165.037
## 2    Monday 3922.519
## 3   Tuesday 3803.815
## 4 Wednesday 3732.654
## 5  Thursday 3587.815
## 6    Friday 4272.741
## 7  Saturday 4793.963

Even though Fridays and Saturdays typically have more renters, there doesn’t appear to be much statistically significant difference among the days of the week and number of renters.

Final Plots and Summary

This first pair of plots shows when Divvy renters are renting. We see that during the week, rentals peak just outside of morning rush hour and during evening rush hour, denoted by the green bars. It seems from the exploratory analysis, the subscribers are made up of commuters coming from the train station or people living in the city using the bikes to go to work or return home. The Weekday and subscriber histograms look the same. The weekend and customer histograms also look the same. I would guess that the customers are either tourists or Chicagoans that do not own bikes, based on the following two plots.

## Warning: Stacking not well defined when ymin != 0

## Warning: position_stack requires constant width: output may be incorrect

These next two plots are extensions of the previous two plots. The first showing that while the majority of stations see roughly the same number of customers and subscribers on a given day, there are stations that are patronized largely by either customers (positive difference values) or subscribers (negative difference values.) If we look closely at the stations used mostly by customers, we see that the rental durations are longer than those used by subscribers. This continues to suggest that subscribers’ short trips may be due to them commuting, while the longer trips of the customers are leisurely in nature, so they can be longer. This, in conjunction with the identification of the 10 most popular stations earlier suggests that the customer-centric stations are in more tourist-centered areas of the city, like Millennium Park. These stations see longer duration rentals. The subscriber -centric stations are by centers of commute, like Ogilvie Transportation Center. These stations see shorter duration rentals.

Something that struck me about the Divvy subscriber rental demographic is that it is overwhelmingly male. I had not come into this data set hypothesizing a 50/50 split between male and female, but I was not expecting it to be skewed so heavily to male. The percentage of subscribers to age bucket comparison showed the same information, but I find this plot more compelling. It is a relatively steady-state distribution with only some exceptions during the early morning hours.

Reflection

The most difficulty I had with this project was figuring out what I wanted to explore after some very basic histograms. After that, difficulty came from organizing the data in a format that would allow me to create the plots I wanted to create. I found the greatest success out of using dplyr package to format the data tables. I created a number of new data tables from the parent data set and it made making the plots much easier. It took some tinkering to understand what I wanted, but the visual introduction given on the tidyr and dplyr packages in Lesson 5 was extremely useful in visualizing how I needed to format the new data frame. I was very impressed with myself when I created the third and, particularly, the fourth plot that I used in my final plots section. Those two plots provide a lot of insight into the difference between the customers and subscribers that rent Divvy bikes. Further exploration would be to see if this trend continues for the upcoming years of Divvy rentals. I think it would reinforce the idea that there are two types of Divvy renters, the leisurely weekend customer and the commuting weekday subscriber. I would also be curious to look at Chicago’s bike registration data to see if the sex demographic trend holds there as well. Is there really a 3:1 male to female bike ridership in Chicago? Perhaps I could also do a comparison with analogous bike rental programs, like New York City’s Citi Bike . This web post from I Quant NY suggests the sex demographic of renters is similar between the two cities and bike rental.