Divvy bikes are rental bikes provided by the city of Chicago. Bike distribution stations are found throughout the north side of the city with a few centers on the south side. Stations have anywhere from 11 to 35 bikes available for rental. Renters can also subscribe to the Divvy service for a full year. This data set covers the first full year of Divvy bike rentals. Rentals started in June 2013. This data set comprises 1,665,487 bike rentals.

Note – For the sake of distinction, those that I refer to as customers are renters who are non-subscribers to the Divvy service. No information is provided on birth year or gender for non-subscribers.

Exploratory Analysis

suppressMessages(library("ggplot2"))
suppressMessages(library("tidyr"))
suppressMessages(library("dplyr"))
suppressMessages(library("scales"))
suppressMessages(library("gridExtra"))

library("ggplot2")
library("gridExtra")
library("dplyr")
library("tidyr")
library("scales")

Divvy2013 <- read.csv("Divvy_Trips_2013.csv")
Divvy2014 <- read.csv("Divvy_Trips_2014_Q1Q2.csv")

names(Divvy2013)[names(Divvy2013) == 'birthday'] <- 'birthyear'

Divvy2013$age <- 2013 - Divvy2013$birthyear
Divvy2014$age <- 2014 - Divvy2014$birthyear

Divvy <- rbind(Divvy2013, Divvy2014)

rm(Divvy2013)
rm(Divvy2014)

The first analysis I wanted to try was seasonal analysis of riders.

Divvy$starttime <- strptime(Divvy$starttime, format = "%Y-%m-%d %H:%M")
Divvy$date.month <- as.numeric(strftime(Divvy$starttime, format = "%m"))
Divvy$season[Divvy$date.month == 9 | Divvy$date.month == 10 
             | Divvy$date.month == 11 ] <- "Fall"
Divvy$season[Divvy$date.month == 12 | Divvy$date.month == 1 
             | Divvy$date.month == 2 ] <- "Winter"
Divvy$season[Divvy$date.month == 3 | Divvy$date.month == 4 
             | Divvy$date.month == 5 ] <- "Spring"
Divvy$season[Divvy$date.month == 6 | Divvy$date.month == 7 
             | Divvy$date.month == 8 ] <- "Summer"
Divvy$season <- factor(Divvy$season)
table(Divvy$season)
## 
##   Fall Summer Winter 
## 465735 249380  44673

Well, this is a strange result. It seems no one rents Divvy bikes during the spring. Those values also don’t add up to the total of the data set.

table(Divvy$date.month)
## 
##      6      7      8      9     10     11     12 
##   4005  74867 170508 201030 174695  90010  44673

No one rents Divvy bikes during the first six months of the year? Let’s go back to the starttime variable and see what is missing.

sum(is.na(Divvy$starttime))
## [1] 905699

And there is the problem. There is a little over 900,000 riders without starting time values. This number matches the number of riders found in the 2014 data set. It seems Divvy stopped recording starting time data for the riders at the beginning of 2014.

Age of Riders

Let’s see how old the Divvy subscribers are.

It looks like there are some subscribers over 75. Let’s look at that a little bit closer.

We’ve got some spry 100+ year olds in Chicago subscribing to Divvy. They could be subscribers, but I think it is more likely that these age values are from subscribers masking their age with a fake birth year. I will cut those over 75 from any age-based analysis from here on out.

I think I will take a look at bike rental duration next. Since, Divvy records rental duration in hours, I will change the duration to minutes. Minutes are easier to visualize numerically.

Duration of Rentals

Divvy$duration.mins <- Divvy$tripduration/60
summary(Divvy$duration.mins)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.55   13.12   19.15   21.70 1440.00

Wow! Someone rented their bike for two and a half days? That couldn’t have been cheap. Using the current cost model for non-subscribers, that bike rental was 365$. Ouch!

Divvy bikes are meant to be a commuter bike that goes one way from station to station. The distance between the furthest north and furthest south Divvy station is a little over 17 miles. Google Maps estimates it will take about 90 minutes to ride that distance along the city bike trails. I will give an upper bound of 90 mins for a traditional Divvy user. I would assume anyone with a bike longer than 90 minutes is not riding continuously, which is outside the intent of Divvy bikes. Let’s see if this will affect my data.

sum(Divvy$duration.mins > 90) / 1665487
## [1] 0.01478787

This omission affects only 1.5% of the data, so I will focus rental duration graphs up to 90 minutes.

The green line represents the 30 minute mark. Divvy heavily encourages rentals to occur in 30 min segments. If one plans on renting for longer than 30 minutes, a renter should stop at a station after 30 minutes and exchange bikes . I wonder what portion of the rentals are 30 minutes or less.

sum(Divvy$duration.mins <= 30) / 1665487
## [1] 0.8895158

Almost 89% of the rentals fall within the 30 minute range Divvy suggests. Let’s look at when rentals occur.

Rental Time of Day

Divvy$time.hour <- as.numeric(strftime(Divvy$starttime, format = "%H"))

This histogram displays rentals by time of day. It seems there are two peaks that may fall in line with the morning and evening rush hour.

Source: http://www.wbez.org/series/curious-city/when-chicago-area-traffic-worst-111374

AADT represents the annual average daily traffic, collected from 18 sites throughout the region between 2010 and 2013 by the Illinois Department of Transportation. The peak rush hour times fall between 6 and 8 AM for the morning rush and 4 and 6 PM for the evening rush.

The histogram peaks fall right in line with evening rush hour, but side-steps morning rush hour by an hour. I wonder if this is caused by the day of the week? Are Divvy rentals occurring at the same time during the weekends as they are during the weekdays?

Divvy$day <- weekdays(as.Date(Divvy$starttime))
Divvy$day <- as.factor(Divvy$day)
Divvy$day <- factor(Divvy$day, levels = c("Sunday", "Monday", "Tuesday", 
                                          "Wednesday","Thursday", "Friday", 
                                          "Saturday"))
DivvyTime <- separate(subset(Divvy, !is.na(starttime)), starttime, 
                      c("start.date", "start.time"), sep = " ")
DivvyTime <- subset(DivvyTime, select = c(start.date, time.hour, day))
DivvyTime$time.hour <- as.factor(DivvyTime$time.hour)
DivvyTime.count <- DivvyTime %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

The weekend rental peaks are definitely different than the weekday rentals. Let’s examine the variance of time of rental between the weekday and the weekend.

DivvyTime$day <- as.factor(DivvyTime$day)

DivvyTime.count.weekday <- subset(DivvyTime, day != 'Saturday' & 
                                        day != 'Sunday') %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

DivvyTime.count.weekend <- subset(DivvyTime, day == 'Saturday' 
                                  | day == 'Sunday') %>%
      group_by(start.date, time.hour) %>%
      summarize(count = n())

Rental Duration by Day of the Week

Renters are renting bikes at different times between weekdays and weekends. How about the length of rental by day of the week?

It looks like there are more renters renting bikes for 15 minutes or longer on the weekends.

Rental Duration and Rental Time by Customer Type

Customers also seem to have longer rental times than subscribers.

The histogram for customers looks more like the weekend usage, while the histogram of subscribers looks like weekday usage.

Rental Duration by Season

And just for the sake of comparison, how about a look at how season affects rental duration.

One things seems pretty obvious. Rental duration drops during the winter. But, it also looks like fall has a higher amount of rentals than summer. This seems counter-intuitive, but remember, this seasonal data come from Divvy’s first six months of rental data. Since the service started in the summer, I’d expect the first few months to be slower than the following three months. The summer slope between 10 and 20 minutes does seem to be more gradual than the same portion of time for the fall. This gave me the sense that summer rental durations were proportionately longer during the summer than the fall.

Rentals by Station

We can see that arrival and departure locations are varied throughout the city . But we can’t make too much sense out of these two histograms otherwise.

The first thing I’ll need to do is create a data frame of just the stations to investigate departures and arrivals.

Divvy$starttime <- as.character(Divvy$starttime)

stations_dep <- Divvy %>%
                  group_by(from_station_id) %>%
                  summarize(departure = n()) %>%
                  rename(station_id = from_station_id) %>%
                  arrange(desc(departure))

stations_arr <- Divvy %>%
                  group_by(to_station_id) %>%
                  summarize(arrival = n()) %>%
                  rename(station_id = to_station_id) %>%
                  arrange(desc(arrival))

stations <- merge(stations_dep, stations_arr, by = "station_id")
stations$total <- stations$departure + stations$arrival
stations_tidy <- gather(stations, "depart_arrive", "riders", 2:3)

stations_tidy_sum <- transform(stations_tidy, 
                               station_id = reorder(station_id, total))

Top 20 stations by total activity

Top 10 stations by departure station

Top 10 stations by arrival station

The top ten stations correspond to the following landmarks.

  • 35 - Navy Pier
  • 76 - Grant Park
  • 90 - Millennium Park
  • 91 - Next to Chicago Ogilvie Transportation Center
  • 85 - Magnificent Mile Shopping District
  • 177 - Theatre on the Lake
  • 97 - Museum Campus
  • 26 - Streeterville
  • 174 - Next to Chicago Ogilvie Transportation Center
  • 52 - Michigan Ave & Lake St (Just northwest of Millennium Park)

The majority of these stations are either on Lake Michigan or extremely close to Lake Michigan, with most of them by extremely popular tourist attractions. The only exceptions are the two stations next to Ogilvie Transportation Center , one of the busiest commuter train stations in the US. Those two stations are on opposing corners of the station.

Since I saw a difference between the time customers and subscribers rent Divvy bikes and how long they rent them, maybe there is a difference in which stations customers and subscribers use as well.

# I am spliting the data by day, station, and user type. This will allow me to 
# count how many renters there are per day per station. I can then find the 
#median number of renters per station per customer type.
Divvy.date.split <- separate(subset(Divvy, !is.na(starttime)), 
                             starttime, c("start.date", "start.time"), 
                             sep = " ")

CusSub <- Divvy.date.split %>%
      group_by(usertype, start.date, from_station_id) %>%
      summarize(count = n()) %>%
      group_by(usertype, from_station_id) %>%
      summarize(median = median(count))

It is somewhat clear from this plot that there are stations that are used by customers more than subscribers and vice versa. This plot is not very clear on which stations are used by customers or subscribers more or by how much.

CusSubDiff <- spread(CusSub, usertype, median)
CusSubDiff$difference <- CusSubDiff$Customer - CusSubDiff$Subscriber

This plot, however, is very clear about which stations see a majority of the renters as customers versus subscribers. The stations with predominantly customers as renters show a positive difference and are coded blue. The stations with predominantly subscribers show a negative difference and are coded red. It does look like the majority of the stations see about an equal usage by both customers and subscribers. Just to get a sense of which stations show the greatest disparity, I will make a new data set to show those stations with an absolute difference of 15 or greater.

CusSubDiff.greatest <- subset(CusSubDiff, difference >= 15 
                              | difference <= -15)
print(CusSubDiff.greatest, max = 250)
## Source: local data frame [23 x 4]
## 
##    from_station_id Customer Subscriber difference
## 1               25     28.5       11.0       17.5
## 2               26     36.5       17.0       19.5
## 3               35     91.0        9.5       81.5
## 4               45     24.0        9.0       15.0
## 5               49      9.0       29.0      -20.0
## 6               66      7.0       22.0      -15.0
## 7               75     10.0       43.0      -33.0
## 8               76     62.0       11.0       51.0
## 9               77      9.0       35.5      -26.5
## 10              81     15.0       30.0      -15.0
## ..             ...      ...        ...        ...

Maybe there is also a difference in the duration of a rental, depending on the station. It could be that some stations are used by renters who will only use the Divvy bike for a short period of time, while other stations may find renters who use the bikes for a longer period of time.

DiffStation <- Divvy.date.split %>%
      group_by(start.date, from_station_id) %>%
      summarize(median.duration = median(duration.mins)) %>%
      group_by(from_station_id) %>%
      summarize(median.duration = median(median.duration))

There may be a difference based on what we see here, but it is hard to say. I did see a difference in the duration of rental by customer versus subscriber. Perhaps by combining my two previous plots, I can gain a better understanding.

CusSub.DiffStation <- inner_join(CusSubDiff.greatest, 
                                 DiffStation, by = "from_station_id")

Here we see that the stations that customers use heavily also see higher rental durations than the stations used heavily by subscribers.

Percentage of Renters by Sex and Age

Switching gears a bit, I wanted to see what the age and sex distribution was of the subscribers.

Divvy$age.bucket <- cut(Divvy$age, breaks = c(16, 24, 34, 44, 54, 64, 74))
ggplot(aes(x = age.bucket, y = ..count../sum(..count..), 
           color = gender, fill = gender), 
       data = subset(Divvy, (gender == 'Male' | gender == 'Female') 
                     & !is.na(age.bucket))) +
      geom_histogram(position = 'dodge') + 
      scale_y_continuous(labels = percent) +
      ylab("Percentage of Riders")