Divvy bikes are rental bikes provided by the city of Chicago. Bike distribution stations are found throughout the north side of the city with a few centers on the south side. Stations have anywhere from 11 to 35 bikes available for rental. Renters can also subscribe to the Divvy service for a full year. This data set covers the first full year of Divvy bike rentals. Rentals started in June 2013. This data set comprises 1,665,487 bike rentals.

Note – For the sake of distinction, those that I refer to as customers are renters who are non-subscribers to the Divvy service. No information is provided on birth year or gender for non-subscribers.

Exploratory Analysis

suppressMessages(library("ggplot2"))
suppressMessages(library("tidyr"))
suppressMessages(library("dplyr"))
suppressMessages(library("scales"))
suppressMessages(library("gridExtra"))

library("ggplot2")
library("gridExtra")
library("dplyr")
library("tidyr")
library("scales")

Divvy_2013 <- read.csv("Divvy_Trips_2013.csv")
Divvy_2014 <- read.csv("Divvy_Trips_2014_Q1Q2.csv")

names(Divvy_2013)[names(Divvy_2013) == 'birthday'] <- 'birthyear'

Divvy_2013$age <- 2013 - Divvy_2013$birthyear
Divvy_2013$starttime <- strptime(Divvy_2013$starttime, format = "%Y-%m-%d %H:%M")
Divvy_2014$age <- 2014 - Divvy_2014$birthyear
Divvy_2014$starttime <- strptime(Divvy_2014$starttime, format = "%m/%d/%Y %H:%M")

Divvy <- rbind(Divvy_2013, Divvy_2014)

rm(Divvy_2013)
rm(Divvy_2014)

Age of Riders

Let’s see how old the Divvy subscribers are.

It looks like there are some subscribers over 75. Let’s look at that a little bit closer.

We’ve got some spry 100+ year olds in Chicago subscribing to Divvy. They could be subscribers, but I think it is more likely that these age values are from subscribers masking their age with a fake birth year. I will cut those over 75 from any age-based analysis from here on out.

I think I will take a look at bike rental duration next. Since, Divvy records rental duration in seconds, I will change the duration to minutes. Minutes are easier to visualize numerically.

Duration of Rentals

Divvy$duration.mins <- Divvy$tripduration/60
summary(Divvy$duration.mins)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.55   13.12   19.15   21.70 1440.00

Wow! Someone rented their bike for 24 hours? That couldn’t have been cheap. Using the current cost model for non-subscribers, that bike rental was 365$. Ouch! It might have been cheaper to just buy a bike.

Divvy bikes are meant to be a commuter bike that goes one way from station to station. The distance between the furthest north and furthest south Divvy station is a little over 17 miles. Google Maps estimates it will take about 90 minutes to ride that distance along the city bike trails. I will give an upper bound of 90 mins for a traditional Divvy user. I would assume anyone with a bike longer than 90 minutes is not riding continuously, which is outside the intent of Divvy bikes. Let’s see if this will affect my data.

sum(Divvy$duration.mins > 90) / nrow(Divvy)
## [1] 0.01478787

This omission affects only 1.5% of the data, so I will focus rental duration graphs up to 90 minutes.

The green line represents the 30 minute mark. Divvy heavily encourages renters to complete their journeys in 30 minutes or less. If one plans on renting for longer than 30 minutes, a renter should stop at a station after 30 minutes and rent a new bike. I wonder what portion of the rentals are 30 minutes or less.

sum(Divvy$duration.mins <= 30) / nrow(Divvy)
## [1] 0.8895158

Almost 89% of the rentals fall within the suggested 30 minute rental period. Let’s look at when rentals occur.

Rental Time of Day

Divvy$time.hour <- as.numeric(strftime(Divvy$starttime, format = "%H"))

This histogram displays rentals by time of day. It seems there are two peaks that may fall in line with the morning and evening rush hour.

Source: http://www.wbez.org/series/curious-city/when-chicago-area-traffic-worst-111374

In the above chart, sourced from data provided by the Illinios Department of Transportation, we see the density of traffic in the Chicago-aread. The AADT variable represents the annual average daily traffic, collected from 18 sites throughout the Chicagoland region between 2010 and 2013. The peak rush hour times fall between 6 and 8 AM for the morning rush and 4 and 6 PM for the evening rush.

The histogram peaks fall right in line with evening rush hour, but side-steps morning rush hour by one hour. I wonder if this is caused by the day of the week? Are Divvy rentals occurring at the same time during the weekends as they are during the weekdays?

Divvy$day <- weekdays(as.Date(Divvy$starttime))
Divvy$day <- as.factor(Divvy$day)
Divvy$day <- factor(Divvy$day, levels = c("Sunday", "Monday", "Tuesday", 
                                          "Wednesday","Thursday", "Friday", 
                                          "Saturday"))
DivvyTime <- separate(subset(Divvy, !is.na(starttime)), starttime, 
                      c("start.date", "start.time"), sep = " ")
DivvyTime <- subset(DivvyTime, select = c(start.date, time.hour, day))
DivvyTime$time.hour <- as.factor(DivvyTime$time.hour)
DivvyTime.count <- DivvyTime %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

The weekend rental peaks are definitely different than the weekday rentals. Let’s examine the variance of time of rental between the weekday and the weekend.

DivvyTime$day <- as.factor(DivvyTime$day)

DivvyTime.count.weekday <- subset(DivvyTime, day != 'Saturday' & 
                                        day != 'Sunday') %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

DivvyTime.count.weekend <- subset(DivvyTime, day == 'Saturday' 
                                  | day == 'Sunday') %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

Rental Duration by Day of the Week

Renters are renting bikes at different times between weekdays and weekends. How about the length of rental by day of the week?

It looks like there are more renters renting bikes for 15 minutes or longer on the weekends.

Rental Duration and Rental Time by Customer Type

Customers also seem to have longer rental times than subscribers.

The histogram for customers looks more like the weekend usage, while the histogram of subscribers looks like weekday usage.

Rental Duration by Season

Next, let’s look at the number or riders based on the season of the year.

Divvy$date.month <- as.numeric(strftime(Divvy$starttime, format = "%m"))
Divvy$season[Divvy$date.month == 9 | Divvy$date.month == 10 
             | Divvy$date.month == 11 ] <- "Fall"
Divvy$season[Divvy$date.month == 12 | Divvy$date.month == 1 
             | Divvy$date.month == 2 ] <- "Winter"
Divvy$season[Divvy$date.month == 3 | Divvy$date.month == 4 
             | Divvy$date.month == 5 ] <- "Spring"
Divvy$season[Divvy$date.month == 6 | Divvy$date.month == 7 
             | Divvy$date.month == 8 ] <- "Summer"
Divvy$season <- factor(Divvy$season)
table(Divvy$season)
## 
##   Fall Spring Summer Winter 
## 465735 506529 595387  97836

This table shows the total number of riders for a given season. As one would expect, the greatest number of riders is in the spring and summer. The least number of riders is in the winter.

table(Divvy$date.month)
## 
##      1      2      3      4      5      6      7      8      9     10 
##  25076  28087  69308 158476 278745 350012  74867 170508 201030 174695 
##     11     12 
##  90010  44673

Breaking from seasons to months, we see a dip in ridership for the months of July and August. At first it would seem strange as one would expect those warm months in the summer to have the highest number of rentals. Divvy started renting in June, so the first few months of renting were not at the levels they would be the following summer.

For the sake of comparison, let’s look at how season affects rental duration.

One things seems pretty obvious. Rental duration drops during the winter and numbers drop in the winter. It also shows, as confirmed previously, there are more riders in the summer. And, those summer riders are typically longer.

Rentals by Station

We can see that arrival and departure locations are varied throughout the city . But we can’t make too much sense out of these two histograms otherwise.

The first thing I’ll need to do is create a data frame of just the stations to investigate departures and arrivals.

Divvy$starttime <- as.character(Divvy$starttime)

stations_dep <- Divvy %>%
                  group_by(from_station_id) %>%
                  summarize(departure = n()) %>%
                  rename(station_id = from_station_id) %>%
                  arrange(desc(departure))

stations_arr <- Divvy %>%
                  group_by(to_station_id) %>%
                  summarize(arrival = n()) %>%
                  rename(station_id = to_station_id) %>%
                  arrange(desc(arrival))

stations <- merge(stations_dep, stations_arr, by = "station_id")
stations$total <- stations$departure + stations$arrival
stations_tidy <- gather(stations, "depart_arrive", "riders", 2:3)

stations_tidy_sum <- transform(stations_tidy, 
                               station_id = reorder(station_id, total))

Top 20 stations by total activity

Top 10 stations by departure station

Top 10 stations by arrival station

The top ten stations correspond to the following landmarks.

  • 35 - Navy Pier
  • 76 - Grant Park
  • 90 - Millennium Park
  • 91 - Next to Chicago Ogilvie Transportation Center
  • 85 - Magnificent Mile Shopping District
  • 177 - Theatre on the Lake
  • 97 - Museum Campus
  • 26 - Streeterville
  • 174 - Next to Chicago Ogilvie Transportation Center
  • 52 - Michigan Ave & Lake St (Just northwest of Millennium Park)

The majority of these stations are either on Lake Michigan or extremely close to Lake Michigan, with most of them by extremely popular tourist attractions. The only exceptions are the two stations next to Ogilvie Transportation Center , one of the busiest commuter train stations in the US. Those two stations are on opposing corners of the station.

Since I saw a difference between the time customers and subscribers rent Divvy bikes and how long they rent them, maybe there is a difference in which stations customers and subscribers use as well.

# I am spliting the data by day, station, and user type. This will allow me to 
# count how many renters there are per day per station. I can then find the 
#median number of renters per station per customer type.
Divvy.date.split <- separate(subset(Divvy, !is.na(starttime)), 
                             starttime, c("start.date", "start.time"), 
                             sep = " ")

CusSub <- Divvy.date.split %>%
      group_by(usertype, start.date, from_station_id) %>%
      summarize(count = n()) %>%
      group_by(usertype, from_station_id) %>%
      summarize(median = median(count))

It is somewhat clear from this plot that there are stations that are used by customers more than subscribers and vice versa. This plot is not very clear on which stations are used by customers or subscribers more or by how much.

CusSubDiff <- spread(CusSub, usertype, median)
CusSubDiff$difference <- CusSubDiff$Customer - CusSubDiff$Subscriber

This plot, however, is very clear about which stations see a majority of the renters as customers versus subscribers. The stations with predominantly customers as renters show a positive difference and are coded blue. The stations with predominantly subscribers show a negative difference and are coded red. It does look like the majority of the stations see about an equal usage by both customers and subscribers. Just to get a sense of which stations show the greatest disparity, I will make a new data set to show those stations with an absolute difference of 15 or greater.

CusSubDiff.greatest <- subset(CusSubDiff, difference >= 15 
                              | difference <= -15)
print(CusSubDiff.greatest, max = 250)
## Source: local data frame [21 x 4]
## 
##    from_station_id Customer Subscriber difference
##              (int)    (dbl)      (dbl)      (dbl)
## 1               35       86       12.0       74.0
## 2               48        7       24.0      -17.0
## 3               49        8       29.5      -21.5
## 4               66        6       22.0      -16.0
## 5               75        7       28.0      -21.0
## 6               76       59       11.0       48.0
## 7               77        8       30.0      -22.0
## 8               81        9       26.0      -17.0
## 9               85       40       16.0       24.0
## 10              90       59       16.0       43.0
## ..             ...      ...        ...        ...

Maybe there is also a difference in the duration of a rental, depending on the station. It could be that some stations are used by renters who will only use the Divvy bike for a short period of time, while other stations may find renters who use the bikes for a longer period of time.

DiffStation <- Divvy.date.split %>%
      group_by(start.date, from_station_id) %>%
      summarize(median.duration = median(duration.mins)) %>%
      group_by(from_station_id) %>%
      summarize(median.duration = median(median.duration))

There may be a difference based on what we see here, but it is hard to say. I did see a difference in the duration of rental by customer versus subscriber. Perhaps by combining my two previous plots, I can gain a better understanding.

CusSub.DiffStation <- inner_join(CusSubDiff.greatest, 
                                 DiffStation, by = "from_station_id")

Here we see that the stations that customers use heavily also see higher rental durations than the stations used heavily by subscribers.

Percentage of Renters by Sex and Age

Switching gears a bit, I wanted to see what the age and sex distribution was of the subscribers.

Divvy$age.bucket <- cut(Divvy$age, breaks = c(16, 24, 34, 44, 54, 64, 74))
ggplot(aes(x = age.bucket, y = ..count../sum(..count..), 
           color = gender, fill = gender), 
       data = subset(Divvy, (gender == 'Male' | gender == 'Female') 
                     & !is.na(age.bucket))) +
      geom_bar(position = 'dodge') + 
      scale_y_continuous(labels = percent) +
      ylab("Percentage of Riders")

There is a dramatically high proportion of men using Divvy bikes over women in all age groups. One interesting way of looking at this might be inspecting the percentage of male or female riders as a function of time of day.

This plot provides a confirmation that that female renters never make up more than 30% of the renters at any given time throughout the day.

Age Buckets for Rental Time and Duration

The following two plots show that regardless of subscriber age, subscribers rent bikes for roughly the same duration of time and at the same times during the day.

Variance in daily ridership

DivvyTime.by.day <- DivvyTime %>%
      group_by(start.date, day) %>%
      summarize(count = n())

aggregate(count~day,DivvyTime.by.day,mean)
##         day    count
## 1    Sunday 4726.321
## 2    Monday 4409.717
## 3   Tuesday 4240.078
## 4 Wednesday 4014.824
## 5  Thursday 4094.585
## 6    Friday 4849.264
## 7  Saturday 5401.000

Even though Fridays and Saturdays typically have more renters, there doesn’t appear to be much statistically significant difference among the days of the week and number of renters.

Final Plots and Summary

This first pair of plots shows when Divvy renters are renting. We see that during the week, rentals peak just outside of morning rush hour and during evening rush hour, denoted by the green bars. It seems from the exploratory analysis, the subscribers are made up of commuters coming from the train station or people living in the city using the bikes to go to work or return home. The Weekday and subscriber histograms look the same. The weekend and customer histograms also look the same. I would guess that the customers are either tourists or Chicagoans that do not own bikes, based on the following two plots.

These next two plots are extensions of the previous two plots. The first showing that while the majority of stations see roughly the same number of customers and subscribers on a given day, there are stations that are patronized largely by either customers (positive difference values) or subscribers (negative difference values.) If we look closely at the stations used mostly by customers, we see that the rental durations are longer than those used by subscribers. This continues to suggest that subscribers’ short trips may be due to them commuting, while the longer trips of the customers are leisurely in nature, so they can be longer. This, in conjunction with the identification of the 10 most popular stations earlier suggests that the customer-centric stations are in more tourist-centered areas of the city, like Millennium Park. These stations see longer duration rentals. The subscriber -centric stations are by centers of commute, like Ogilvie Transportation Center. These stations see shorter duration rentals.

Something that struck me about the Divvy subscriber rental demographic is that it is overwhelmingly male. I had not come into this data set hypothesizing a 50/50 split between male and female, but I was not expecting it to be skewed so heavily to male. The percentage of subscribers to age bucket comparison showed the same information, but I find this plot more compelling. It is a relatively steady-state distribution with only some exceptions during the early morning hours.

Reflection

The most difficulty I had with this project was figuring out what I wanted to explore after some very basic histograms. After that, difficulty came from organizing the data in a format that would allow me to create the plots I wanted to create. I found the greatest success while using the dplyr package to format the data tables. I created a number of new data tables from the parent data set which helped in give me new ideas for analysis. I was very impressed with myself when I created the third and, particularly, the fourth plot cited in my final plots section. Those two plots provide a lot of insight into the difference between the customers and subscribers that rent Divvy bikes. Further exploration would be to see if this trend continues for the upcoming years of Divvy rentals. I think it would reinforce the idea that there are two types of Divvy renters, a leisurely weekend customer and a commuting weekday subscriber. I would also be curious to look at Chicago’s bike registration data to see if the sex demographic trend holds there as well. Is there really a 3:1 male to female bike ridership in Chicago? Perhaps I could also do a comparison with analogous bike rental programs, like New York City’s Citi Bike . This web post from I Quant NY suggests the sex demographic of renters is similar between the two cities and bike rental.