The Google Data Analytics Professional Certificate requires completing a Capstone Project.
I chose their Track 1 - Case Study 1 “How Does a Bike-Share Navigate Speedy Success?” to see: 1) what Google considers a legitimate scenario and 2) what other enrolees had done with this case study, and if I could add something new.
The full instructions for this Case Study are here (Coursera link)
The Google Certificate proposes to de-structure a project into the following steps: Ask, Prepare, Process, Analyse, Share, and Act.
This document presents in the 1. Executive Report section the research question (Ask), a description of the data (Analyse, Share), and recommendations (Act). The “Prepare” step was done by seeking additional information from press releases and news articles, blog posts from data analysts, and the projects of other Google Certificate enrolees.
This document then presents in the 2. Appendix section the data cleaning steps, with notes on missing data and outliers (Process).
The Case Study instructions introduce the following problem:
“You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently.”
The instructions propose to focus on the role of ride duration, day of the week, and variation across months, so let’s stick with that!
The Case Study is based on the real-world example of the Divvy bike-share company in Chicago.
History [link]
“Divvy is the European-inspired brainchild of former Chicago Mayor Richard M. Daley, who served the city from 1989 to 2011. It launched in June of 2013 with 750 bikes at 75 stations. Divvy is provided as a service of the Chicago Department of Transportation (CDOT) and is operated by private partner Motivate (formerly Alta Bicycle Share).”
Early results (2014) [link]
“A total of 97 percent of the responding members said they were “satisfied” or “very satisfied” with Divvy. In addition, when asked how likely on a scale of 1 to 10 they are to recommend Divvy to a friend, members responded on average with 9.1. A total of 80 percent of members are “somewhat more likely” or “much more likely” to patronize a business that is near a Divvy bike station.
Compared to what they were spending before they joined, members on average save $760 a year on travel, including auto expenses, taxis and other forms of public transit.
On average, members said they take nearly three trips each month that they would not have made if Divvy was not available. Reasons these trips wouldn’t have been taken include being too far to walk (44 percent), bicycle is faster or easier (36 percent), no bus/train or bus/train is inconvenient to that destination (32 percent), parking is limited or expensive at that destination (27 percent), or Divvy is cheaper than alternatives (23 percent).
Top things that motivated members to join Divvy: 1. Get around more easily and faster: 85%; 2. Station near home or work: 67%; 3. Like biking: 66%; 4. Save money on transportation: 51%; 5. Exercise and fitness: 51%.
Members “sometimes” or “often” use Divvy for the following purposes: 1. Go to/from work: 84%; 2. Social/entertainment: 82%; 3. Shopping/errands: 78%; 4. Go to/from transit: 76%; 5. Exercise/recreation: 57%.”
Current challenges (2022) [link]
“Anecdotally, bike availability has gotten worse, and an analysis of August and September shows things aren’t so great. And this is against a backdrop of price hikes and a shift away from classic bikes to more expensive e-bikes.
Still, there were 914,000 trips in August, a new record, so perhaps some of these issues are just growing pains, as the Divvy team learns to adapt to a more heavily-used system. Lyft also says it is on track to complete the deployment of 10,500 new e-bikes required by the latest contract renewal. While not everyone wants to pay extra for an e-bike, a higher number of bikes in the system overall should help increase the availability of classic bikes.
One possible way forward on the staffing front: Lyft is working with the nonprofit bike shop Working Bikes on a new mechanic training program, which saw its first graduates in April, to try to alleviate the shortage of bike mechanics.”
Schematic map of the ride-share service area
Interactive map of Divvy stations
More information on the Divvy bike sharing platform
More information on the Divvy pricing model
The Case Study instructions ask to use the last 12 months of data on this website:
More information on the Divvy data
Importantly, the rows represent rides, not riders.
I will use the 12 files between Dec 2021 and November 2022 (since I started the project on Dec 21st 2022).
The dataset has information on: the time and dock stations in which the ride started and ended, whether rides were classic or electric bikes, and whether these have been done by casual riders or annual members.
A total of 5,733,451 rides were available for analysis between December 2021 and November 2022.
During data cleaning, a small number of rides had: 1) started at a dock station but not ended at a dock station, 2) an invalid ride duration (e.g., negative, entire days), or 3) invalid geographical coordinates (i.e., not in Chicago). We also need to remove rides that were less than a minute to conform with Divvy’s data cleaning policy.
Analyses were therefore done in 5,592,668 valid rides (i.e., 97.5% of the sample).
Overall, 40.8% of rides were done by casual riders and 59.2% by annual members. Comparing ride type (i.e., classic bikes versus electric bike), 54.0% of rides were done on e-bikes among casual riders compared to 48.6% among annual members.
Figure 1 presents the geographical distribution of 1,306 stations, highlighting the 50 most common start and end dock stations:
FIGURE 1 - INTERACTIVE MAP OF DOCKING STATIONS
Top start stations are in pale blue, top end stations are in pale red, and stations that are both appear in darker red.
Table 1 lists the most common start and end dock stations:
TABLE 1 - MOST COMMON DOCKING STATIONS
| Start station | Total | % Casual | End station | Total | % Casual | |
|---|---|---|---|---|---|---|
| Streeter Dr & Grand Ave | 39962 | 81.4 | /// | Streeter Dr & Grand Ave | 40439 | 83.6 |
| DuSable Lake Shore Dr & Monroe St | 23084 | 83.9 | /// | DuSable Lake Shore Dr & Monroe St | 21128 | 80.8 |
| Michigan Ave & Oak St | 19831 | 70.5 | /// | Michigan Ave & Oak St | 19767 | 72.7 |
| DuSable Lake Shore Dr & North Blvd | 17810 | 63.7 | /// | DuSable Lake Shore Dr & North Blvd | 19614 | 66.0 |
| Millennium Park | 15854 | 80.9 | /// | Millennium Park | 16290 | 83.5 |
| Theater on the Lake | 15558 | 62.0 | /// | Wells St & Concord Ln | 14955 | 43.7 |
| Wells St & Concord Ln | 15078 | 46.3 | /// | Theater on the Lake | 14691 | 64.3 |
| Clark St & Armitage Ave | 12909 | 49.7 | /// | Clark St & Elm St | 13039 | 39.2 |
| Shedd Aquarium | 12809 | 85.9 | /// | Clark St & Armitage Ave | 12615 | 48.8 |
| Clark St & Elm St | 12619 | 39.8 | /// | Broadway & Barry Ave | 12186 | 40.7 |
| Broadway & Barry Ave | 12182 | 42.4 | /// | Clark St & Lincoln Ave | 11764 | 53.1 |
| Wells St & Elm St | 11889 | 43.3 | /// | Wells St & Elm St | 11697 | 41.7 |
| Clark St & Lincoln Ave | 11874 | 52.2 | /// | Shedd Aquarium | 11297 | 84.4 |
| Clark St & Wrightwood Ave | 10144 | 42.8 | /// | Lakeview Ave & Fullerton Pkwy | 10437 | 46.7 |
| Wells St & Huron St | 10116 | 38.5 | /// | Wabash Ave & Grand Ave | 10434 | 52.6 |
Rides are done more often in the Chicago downtown area, and from and to dock stations near major landmarks and public transit hubs.
The “% casual” support that the purpose of rides differs across stations. For instance, rides starting at the “Millenium Park” station included more casual riders leaving the park whereas rides starting at the “Clark St & Elm St” station include more annual members, e.g., moving to and from work.
Figure 2 and 3 presents the seasonal variation in rides among casual riders and annual members by month and day.
FIGURE 2 - Monthly rides, by membership status
FIGURE 3 - Daily rides, by membership status
Rides were more commonly done by annual members throughout the year, except in June around the start of the summer where casual riders and annual members were using the bike-share system to a similar extent.
FIGURE 4 - Rides over days of the week, by month and membership status
Figure 4 nicely shows that rides have been more common among casual riders on Saturday and Sunday, particularly during the months of July and October, whereas rides have been more common among annual members during the weekdays.
Figures 5 and 6 presents the distribution of ride duration over months and days of the week. Since ride duration is not normally distributed (i.e., some rides have had a much longer duration compared to the average), Figure 5 shows both the mean (in light gray) and median (in dark gray).
FIGURE 5 - Mean and median ride duration by month, by membership
FIGURE 6 - Mean ride duration over days of the week, by month and membership
Among casual riders, the average ride time was 20.5 minutes and 50% of these rides lasted at least 13 minutes. This average duration varied over months and days of the week, with casual riders being more likely to have longer rides in the summer season, and on Saturday and Sunday.
Among annual members, the average ride time was 12.3 minutes and 50% of these rides lasted at least 9 minutes. The average duration over months and the days of the week among annual members varied in a similar way across seasons, but to a much lesser degree compared to casual riders.
The difference in duration between casual riders and annual members and between week and weekend days was relatively small in the winter months (i.e., November to January), and relatively large in the main season (i.e., March to October). This supports the idea that the purpose of rides likely varied across days of the week and seasons, especially among casual riders.
These recommendations come from the combined work of data analysts outside the Google Certificate, enrolees in the Google Certificate, and myself. The articles used to support this list are referenced in the next section.
Develop a relational database system allowing to link riders with rides. This is critical. Even without their demographics, just having the id of the rider would blow open the possibilities for the prediction of membership uptake among casual riders.
Ensure bike availability across docking stations and monitor public sentiment toward the system. Given that people often ride with others, this means ensuring at least two working bikes per station. [link]
Consider providing feedback to casual riders that have used the system for 1+ year (e.g., “You have started with us over one year ago now! Did you know that riders who subscribed saved up to X$ over the past year? Consider the annual membership!”).
Consider referral incentives for annual memberships. [link]
Consider optimizing the pricing model (mix of and price for subscriptions vs one-time rides) to encourage annual membership. [link]
Given the age distribution of the bike-share system usage, prioritise Gen X’ers, Millennials, and Gen Z’ers who are most likely to use it to go to work (and back), and therefore more likely to become annual members. [link]
Given that many also combine the bike-share system and the public transport system to go to work (and back), prioritise docking stations close to major public transit stops. [link]
Given the large fluctuation in rides across months, prioritise busier months for casual riders, i.e., May to October.
Alternatively, prioritise months with large month-on-month increases, i.e., May to October, which likely highlights new users.
Given that most rides by casual riders are done during the weekend, tailor an advertisement campaign to casual riders that use the system on Saturday and Sunday to consider the annual membership.
Given the pricing model, tailor an advertisement campaign to casual riders who have done rides between 30 and 45 minutes to consider the annual membership. This duration is over the 30 minutes limit of the “Single Ride” payment option but under the 45 minutes limit of the “Annual Member” option.
Given that riders may be tourists, prioritise casual riders who actually live in Chicago. [link]
Given that casual rides plummet during the off-season, consider an alternative “seasonal” membership model for casual riders who only use the bike-share system during the summer period. [link]
Blog posts from Coursera enrolees:
Blog posts from other data analysts that examined the bike-share system.
“What’s going on with Divvy availability? Let’s look at the data.” - Steven Lucy, August 2022
“What does the Bike Sharing Business look like in Chicago?” - Gabriela Baker, 2020
“Oh, the Places You’ll Go! Analyzing Chicago Divvy Bike Share Data” - Samaksh (Avi) Goyal, 2019
I also found blog posts that examined the user experience with the docking stations and the app, but did not take them into account since the focus of the project was on the Divvy public data.
“Divvy Bikes Review : Chicago Bike Sharing Puts Brakes on UX” - Will Scott, 2016
“UX Design Case Study: Divvy Bikes” - AKFantham, 2016
With a combined sample of over 5 million observations, I skipped Excel and went straight to R. I also skipped doing actual statistical tests given the enormous size of the sample.
A first look at the dataset:
## Rows: 5,733,451
## Columns: 13
## $ ride_id <chr> "46F8167220E4431F", "73A77762838B32FD", "4CF4245205…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-12-07 15:06:07, 2021-12-11 03:43:29, 2021-12-…
## $ ended_at <dttm> 2021-12-07 15:13:42, 2021-12-11 04:10:23, 2021-12-…
## $ start_station_name <chr> "Laflin St & Cullerton St", "LaSalle Dr & Huron St"…
## $ start_station_id <chr> "13307", "KP1705001026", "KA1504000117", "KA1504000…
## $ end_station_name <chr> "Morgan St & Polk St", "Clarendon Ave & Leland Ave"…
## $ end_station_id <chr> "TA1307000130", "TA1307000119", "13137", "KP1705001…
## $ start_lat <dbl> 41.85483, 41.89441, 41.89936, 41.89939, 41.89558, 4…
## $ start_lng <dbl> -87.66366, -87.63233, -87.64852, -87.64854, -87.682…
## $ end_lat <dbl> 41.87197, 41.96797, 41.93758, 41.89488, 41.93125, 4…
## $ end_lng <dbl> -87.65097, -87.65000, -87.64410, -87.63233, -87.644…
## $ member_casual <chr> "member", "casual", "member", "member", "member", "…
We have 13 variables:
## [1] "ride_id"
## [1] "0 %"
## [1] "rideable_type"
## [1] "0 %"
## [1] "started_at"
## [1] "0 %"
## [1] "ended_at"
## [1] "0 %"
## [1] "start_station_name"
## [1] "14.91 %"
## [1] "start_station_id"
## [1] "14.91 %"
## [1] "end_station_name"
## [1] "15.96 %"
## [1] "end_station_id"
## [1] "15.96 %"
## [1] "start_lat"
## [1] "0 %"
## [1] "start_lng"
## [1] "0 %"
## [1] "end_lat"
## [1] "0.1 %"
## [1] "end_lng"
## [1] "0.1 %"
## [1] "member_casual"
## [1] "0 %"
There was only a serious amount of missing data on the start station and end station variables.
For start stations, we found all missing data in rides with the electric bikes.
For end stations, we found some missing data in rides with classic and docked bikes, but the large majority remained in rides with electric bikes.
This means that: 1) many rides on electric bikes were simply not started and/or ended at a normal docking station; 2) a very small number of those who rode with a classic (or docked) bike ended their ride with their bike left somewhere invalid (e.g., the side of the road or in a lake).
3.1 IMPOSSIBLE GEOGRAPHICAL COORDINATES
3.2 IMPOSSIBLE TIME DURATION
## [1] "Rows with a duration between 1 and 59 seconds"
## [1] 119686
## [1] "Percentage with a duration between 1 and 59 seconds"
## [1] "2.0875 %"
## [1] "Rows with a duration of zero or negative seconds, or over three hours"
## [1] 20198
## [1] "Percentage with a duration of zero or negative seconds, or over three hours"
## [1] "0.3523 %"
3.3 START STATION BUT NO END STATION AMONG CLASSIC/DOCK BIKE RIDES
## [1] "Rides with a start station but no end station"
## [1] 6685
## [1] "Percentage with a start station but no end station"
## [1] "0.2295 %"
4.1 Rides by casual riders between 30 and 45 minutes
## [1] "Rides by casual riders >= 30 and < 45 minutes"
## [1] 196907
## [1] "Percentage by casual riders >= 30 and < 45 minutes"
## [1] "8.6253 %"