INTRODUCTION


The Google Data Analytics Professional Certificate requires completing a Capstone Project.

I chose their Track 1 - Case Study 1 “How Does a Bike-Share Navigate Speedy Success?” to see: 1) what Google considers a legitimate scenario and 2) what other enrolees had done with this case study, and if I could add something new.

The full instructions for this Case Study are here (Coursera link)

The Google Certificate proposes to de-structure a project into the following steps: Ask, Prepare, Process, Analyse, Share, and Act.

This document presents in the 1. Executive Report section the research question (Ask), a description of the data (Analyse, Share), and recommendations (Act). The “Prepare” step was done by seeking additional information from press releases and news articles, blog posts from data analysts, and the projects of other Google Certificate enrolees.

This document then presents in the 2. Appendix section the data cleaning steps, with notes on missing data and outliers (Process).



1. EXECUTIVE REPORT


1.1 CASE STUDY


The Case Study instructions introduce the following problem:

“You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently.

The instructions propose to focus on the role of ride duration, day of the week, and variation across months, so let’s stick with that!


1.2. CONTEXT


The Case Study is based on the real-world example of the Divvy bike-share company in Chicago.


History [link]

Divvy is the European-inspired brainchild of former Chicago Mayor Richard M. Daley, who served the city from 1989 to 2011. It launched in June of 2013 with 750 bikes at 75 stations. Divvy is provided as a service of the Chicago Department of Transportation (CDOT) and is operated by private partner Motivate (formerly Alta Bicycle Share).


Early results (2014) [link]

A total of 97 percent of the responding members said they were “satisfied” or “very satisfied” with Divvy. In addition, when asked how likely on a scale of 1 to 10 they are to recommend Divvy to a friend, members responded on average with 9.1. A total of 80 percent of members are “somewhat more likely” or “much more likely” to patronize a business that is near a Divvy bike station.

Compared to what they were spending before they joined, members on average save $760 a year on travel, including auto expenses, taxis and other forms of public transit.

On average, members said they take nearly three trips each month that they would not have made if Divvy was not available. Reasons these trips wouldn’t have been taken include being too far to walk (44 percent), bicycle is faster or easier (36 percent), no bus/train or bus/train is inconvenient to that destination (32 percent), parking is limited or expensive at that destination (27 percent), or Divvy is cheaper than alternatives (23 percent).

Top things that motivated members to join Divvy: 1. Get around more easily and faster: 85%; 2. Station near home or work: 67%; 3. Like biking: 66%; 4. Save money on transportation: 51%; 5. Exercise and fitness: 51%.

Members “sometimes” or “often” use Divvy for the following purposes: 1. Go to/from work: 84%; 2. Social/entertainment: 82%; 3. Shopping/errands: 78%; 4. Go to/from transit: 76%; 5. Exercise/recreation: 57%.


Current challenges (2022) [link]

Anecdotally, bike availability has gotten worse, and an analysis of August and September shows things aren’t so great. And this is against a backdrop of price hikes and a shift away from classic bikes to more expensive e-bikes.

Still, there were 914,000 trips in August, a new record, so perhaps some of these issues are just growing pains, as the Divvy team learns to adapt to a more heavily-used system. Lyft also says it is on track to complete the deployment of 10,500 new e-bikes required by the latest contract renewal. While not everyone wants to pay extra for an e-bike, a higher number of bikes in the system overall should help increase the availability of classic bikes.

One possible way forward on the staffing front: Lyft is working with the nonprofit bike shop Working Bikes on a new mechanic training program, which saw its first graduates in April, to try to alleviate the shortage of bike mechanics.


Schematic map of the ride-share service area


Interactive map of Divvy stations

More information on the Divvy bike sharing platform

More information on the Divvy pricing model


1.3 DATASET


The Case Study instructions ask to use the last 12 months of data on this website:

Link to Divvy database

More information on the Divvy data

Importantly, the rows represent rides, not riders.

The City of Chicago shares the same data with additional information on the age and sex of riders, but this will not be explored here.

I will use the 12 files between Dec 2021 and November 2022 (since I started the project on Dec 21st 2022).


1.4 RESULTS


1.4.1. INTRODUCTION


The dataset has information on: the time and dock stations in which the ride started and ended, whether rides were classic or electric bikes, and whether these have been done by casual riders or annual members.

A total of 5,733,451 rides were available for analysis between December 2021 and November 2022.

During data cleaning, a small number of rides had: 1) started at a dock station but not ended at a dock station, 2) an invalid ride duration (e.g., negative, entire days), or 3) invalid geographical coordinates (i.e., not in Chicago). We also need to remove rides that were less than a minute to conform with Divvy’s data cleaning policy.

Analyses were therefore done in 5,592,668 valid rides (i.e., 97.5% of the sample).

Overall, 40.8% of rides were done by casual riders and 59.2% by annual members. Comparing ride type (i.e., classic bikes versus electric bike), 54.0% of rides were done on e-bikes among casual riders compared to 48.6% among annual members.


1.4.2 GEOGRAPHICAL VARIATION


Figure 1 presents the geographical distribution of 1,306 stations, highlighting the 50 most common start and end dock stations:


FIGURE 1 - INTERACTIVE MAP OF DOCKING STATIONS

Top start stations are in pale blue, top end stations are in pale red, and stations that are both appear in darker red.


Table 1 lists the most common start and end dock stations:


TABLE 1 - MOST COMMON DOCKING STATIONS

Start station Total % Casual End station Total % Casual
Streeter Dr & Grand Ave 39962 81.4 /// Streeter Dr & Grand Ave 40439 83.6
DuSable Lake Shore Dr & Monroe St 23084 83.9 /// DuSable Lake Shore Dr & Monroe St 21128 80.8
Michigan Ave & Oak St 19831 70.5 /// Michigan Ave & Oak St 19767 72.7
DuSable Lake Shore Dr & North Blvd 17810 63.7 /// DuSable Lake Shore Dr & North Blvd 19614 66.0
Millennium Park 15854 80.9 /// Millennium Park 16290 83.5
Theater on the Lake 15558 62.0 /// Wells St & Concord Ln 14955 43.7
Wells St & Concord Ln 15078 46.3 /// Theater on the Lake 14691 64.3
Clark St & Armitage Ave 12909 49.7 /// Clark St & Elm St 13039 39.2
Shedd Aquarium 12809 85.9 /// Clark St & Armitage Ave 12615 48.8
Clark St & Elm St 12619 39.8 /// Broadway & Barry Ave 12186 40.7
Broadway & Barry Ave 12182 42.4 /// Clark St & Lincoln Ave 11764 53.1
Wells St & Elm St 11889 43.3 /// Wells St & Elm St 11697 41.7
Clark St & Lincoln Ave 11874 52.2 /// Shedd Aquarium 11297 84.4
Clark St & Wrightwood Ave 10144 42.8 /// Lakeview Ave & Fullerton Pkwy 10437 46.7
Wells St & Huron St 10116 38.5 /// Wabash Ave & Grand Ave 10434 52.6


Rides are done more often in the Chicago downtown area, and from and to dock stations near major landmarks and public transit hubs.

The “% casual” support that the purpose of rides differs across stations. For instance, rides starting at the “Millenium Park” station included more casual riders leaving the park whereas rides starting at the “Clark St & Elm St” station include more annual members, e.g., moving to and from work.


1.4.3 SEASONAL VARIATION


Figure 2 and 3 presents the seasonal variation in rides among casual riders and annual members by month and day.


FIGURE 2 - Monthly rides, by membership status

FIGURE 3 - Daily rides, by membership status


Rides were more commonly done by annual members throughout the year, except in June around the start of the summer where casual riders and annual members were using the bike-share system to a similar extent.


FIGURE 4 - Rides over days of the week, by month and membership status


Figure 4 nicely shows that rides have been more common among casual riders on Saturday and Sunday, particularly during the months of July and October, whereas rides have been more common among annual members during the weekdays.


1.4.4 RIDE DURATION


Figures 5 and 6 presents the distribution of ride duration over months and days of the week. Since ride duration is not normally distributed (i.e., some rides have had a much longer duration compared to the average), Figure 5 shows both the mean (in light gray) and median (in dark gray).


FIGURE 5 - Mean and median ride duration by month, by membership

FIGURE 6 - Mean ride duration over days of the week, by month and membership


Among casual riders, the average ride time was 20.5 minutes and 50% of these rides lasted at least 13 minutes. This average duration varied over months and days of the week, with casual riders being more likely to have longer rides in the summer season, and on Saturday and Sunday.

Among annual members, the average ride time was 12.3 minutes and 50% of these rides lasted at least 9 minutes. The average duration over months and the days of the week among annual members varied in a similar way across seasons, but to a much lesser degree compared to casual riders.

The difference in duration between casual riders and annual members and between week and weekend days was relatively small in the winter months (i.e., November to January), and relatively large in the main season (i.e., March to October). This supports the idea that the purpose of rides likely varied across days of the week and seasons, especially among casual riders.


1.5 RECOMMENDATIONS


These recommendations come from the combined work of data analysts outside the Google Certificate, enrolees in the Google Certificate, and myself. The articles used to support this list are referenced in the next section.

  • Develop a relational database system allowing to link riders with rides. This is critical. Even without their demographics, just having the id of the rider would blow open the possibilities for the prediction of membership uptake among casual riders.

  • Ensure bike availability across docking stations and monitor public sentiment toward the system. Given that people often ride with others, this means ensuring at least two working bikes per station. [link]

  • Consider providing feedback to casual riders that have used the system for 1+ year (e.g., “You have started with us over one year ago now! Did you know that riders who subscribed saved up to X$ over the past year? Consider the annual membership!”).

  • Consider referral incentives for annual memberships. [link]

  • Consider optimizing the pricing model (mix of and price for subscriptions vs one-time rides) to encourage annual membership. [link]

  • Given the age distribution of the bike-share system usage, prioritise Gen X’ers, Millennials, and Gen Z’ers who are most likely to use it to go to work (and back), and therefore more likely to become annual members. [link]

  • Given that many also combine the bike-share system and the public transport system to go to work (and back), prioritise docking stations close to major public transit stops. [link]

  • Given the large fluctuation in rides across months, prioritise busier months for casual riders, i.e., May to October.

  • Alternatively, prioritise months with large month-on-month increases, i.e., May to October, which likely highlights new users.

  • Given that most rides by casual riders are done during the weekend, tailor an advertisement campaign to casual riders that use the system on Saturday and Sunday to consider the annual membership.

  • Given the pricing model, tailor an advertisement campaign to casual riders who have done rides between 30 and 45 minutes to consider the annual membership. This duration is over the 30 minutes limit of the “Single Ride” payment option but under the 45 minutes limit of the “Annual Member” option.

  • Given that riders may be tourists, prioritise casual riders who actually live in Chicago. [link]

  • Given that casual rides plummet during the off-season, consider an alternative “seasonal” membership model for casual riders who only use the bike-share system during the summer period. [link]


1.6 REFERENCES


1.6.2 OTHER SOURCES


Blog posts from other data analysts that examined the bike-share system.

“What’s going on with Divvy availability? Let’s look at the data.” - Steven Lucy, August 2022

“What does the Bike Sharing Business look like in Chicago?” - Gabriela Baker, 2020

“Oh, the Places You’ll Go! Analyzing Chicago Divvy Bike Share Data” - Samaksh (Avi) Goyal, 2019


I also found blog posts that examined the user experience with the docking stations and the app, but did not take them into account since the focus of the project was on the Divvy public data.

“Divvy Bikes Review : Chicago Bike Sharing Puts Brakes on UX” - Will Scott, 2016

“UX Design Case Study: Divvy Bikes” - AKFantham, 2016





2. APPENDIX


With a combined sample of over 5 million observations, I skipped Excel and went straight to R. I also skipped doing actual statistical tests given the enormous size of the sample.


1. VARIABLE LIST

A first look at the dataset:

## Rows: 5,733,451
## Columns: 13
## $ ride_id            <chr> "46F8167220E4431F", "73A77762838B32FD", "4CF4245205…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-12-07 15:06:07, 2021-12-11 03:43:29, 2021-12-…
## $ ended_at           <dttm> 2021-12-07 15:13:42, 2021-12-11 04:10:23, 2021-12-…
## $ start_station_name <chr> "Laflin St & Cullerton St", "LaSalle Dr & Huron St"…
## $ start_station_id   <chr> "13307", "KP1705001026", "KA1504000117", "KA1504000…
## $ end_station_name   <chr> "Morgan St & Polk St", "Clarendon Ave & Leland Ave"…
## $ end_station_id     <chr> "TA1307000130", "TA1307000119", "13137", "KP1705001…
## $ start_lat          <dbl> 41.85483, 41.89441, 41.89936, 41.89939, 41.89558, 4…
## $ start_lng          <dbl> -87.66366, -87.63233, -87.64852, -87.64854, -87.682…
## $ end_lat            <dbl> 41.87197, 41.96797, 41.93758, 41.89488, 41.93125, 4…
## $ end_lng            <dbl> -87.65097, -87.65000, -87.64410, -87.63233, -87.644…
## $ member_casual      <chr> "member", "casual", "member", "member", "member", "…

We have 13 variables:

  • ride_id, an id variable for each observation
  • rideable_type, telling us whether the ride was done with a classic, docked, or electric bike
  • started_at, the time at which the ride started
  • ended_at, the time at which the ride ended
  • start_station_name, the name of the station where the ride started
  • start_station_id, the id of the station where the ride started
  • end_station_name, the name of the station where the ride ended
  • end_station_id, the id of the station where the ride ended
  • start_lat, the latitude of the station where the ride started
  • start_lng, the longitude of the station where the ride started
  • end_lat, the latitude of the station where the ride ended
  • end_lng, the longitude of the station where the ride ended
  • member_casual, our key variable, whether someone is a member or a casual rider.


2. MISSING DATA
## [1] "ride_id"
## [1] "0 %"
## [1] "rideable_type"
## [1] "0 %"
## [1] "started_at"
## [1] "0 %"
## [1] "ended_at"
## [1] "0 %"
## [1] "start_station_name"
## [1] "14.91 %"
## [1] "start_station_id"
## [1] "14.91 %"
## [1] "end_station_name"
## [1] "15.96 %"
## [1] "end_station_id"
## [1] "15.96 %"
## [1] "start_lat"
## [1] "0 %"
## [1] "start_lng"
## [1] "0 %"
## [1] "end_lat"
## [1] "0.1 %"
## [1] "end_lng"
## [1] "0.1 %"
## [1] "member_casual"
## [1] "0 %"

There was only a serious amount of missing data on the start station and end station variables.

For start stations, we found all missing data in rides with the electric bikes.

For end stations, we found some missing data in rides with classic and docked bikes, but the large majority remained in rides with electric bikes.

This means that: 1) many rides on electric bikes were simply not started and/or ended at a normal docking station; 2) a very small number of those who rode with a classic (or docked) bike ended their ride with their bike left somewhere invalid (e.g., the side of the road or in a lake).


3. OUTLIERS


3.1 IMPOSSIBLE GEOGRAPHICAL COORDINATES


3.2 IMPOSSIBLE TIME DURATION

## [1] "Rows with a duration between 1 and 59 seconds"
## [1] 119686
## [1] "Percentage with a duration between 1 and 59 seconds"
## [1] "2.0875 %"
## [1] "Rows with a duration of zero or negative seconds, or over three hours"
## [1] 20198
## [1] "Percentage with a duration of zero or negative seconds, or over three hours"
## [1] "0.3523 %"


3.3 START STATION BUT NO END STATION AMONG CLASSIC/DOCK BIKE RIDES

## [1] "Rides with a start station but no end station"
## [1] 6685
## [1] "Percentage with a start station but no end station"
## [1] "0.2295 %"


4. EXTRA ANALYSES


4.1 Rides by casual riders between 30 and 45 minutes

## [1] "Rides by casual riders >= 30 and < 45 minutes"
## [1] 196907
## [1] "Percentage by casual riders >= 30 and < 45 minutes"
## [1] "8.6253 %"