INTRODUCTION

The Google Data Analytics Professional Certificate requires completing a Capstone Project.

I chose their Track 1 - Case Study 1 “How Does a Bike-Share Navigate Speedy Success?” to see: 1) what Google considers a legitimate scenario and 2) what other enrolees had done with this case study, and if I could add something new.

The full instructions for this Case Study are here (Coursera link)

The Google Certificate proposes to de-structure a project into the following steps: Ask, Prepare, Process, Analyse, Share, and Act.

This document presents in the 1. Executive Report section the research question (Ask), a description of the data (Analyse, Share), and recommendations (Act). The “Prepare” step was done by seeking additional information from press releases and news articles, blog posts from data analysts, and the projects of other Google Certificate enrolees.

This document then presents in the 2. Appendix section the data cleaning steps, with notes on missing data and outliers (Process).

1. EXECUTIVE REPORT

1.1 CASE STUDY

The Case Study instructions introduce the following problem:

“You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently.”

The instructions propose to focus on the role of ride duration, day of the week, and variation across months, so let’s stick with that!

1.2. CONTEXT

The Case Study is based on the real-world example of the Divvy bike-share company in Chicago.

History [link]

“Divvy is the European-inspired brainchild of former Chicago Mayor Richard M. Daley, who served the city from 1989 to 2011. It launched in June of 2013 with 750 bikes at 75 stations. Divvy is provided as a service of the Chicago Department of Transportation (CDOT) and is operated by private partner Motivate (formerly Alta Bicycle Share).”

Early results (2014) [link]

“A total of 97 percent of the responding members said they were “satisfied” or “very satisfied” with Divvy. In addition, when asked how likely on a scale of 1 to 10 they are to recommend Divvy to a friend, members responded on average with 9.1. A total of 80 percent of members are “somewhat more likely” or “much more likely” to patronize a business that is near a Divvy bike station.

Compared to what they were spending before they joined, members on average save $760 a year on travel, including auto expenses, taxis and other forms of public transit.

On average, members said they take nearly three trips each month that they would not have made if Divvy was not available. Reasons these trips wouldn’t have been taken include being too far to walk (44 percent), bicycle is faster or easier (36 percent), no bus/train or bus/train is inconvenient to that destination (32 percent), parking is limited or expensive at that destination (27 percent), or Divvy is cheaper than alternatives (23 percent).

Top things that motivated members to join Divvy: 1. Get around more easily and faster: 85%; 2. Station near home or work: 67%; 3. Like biking: 66%; 4. Save money on transportation: 51%; 5. Exercise and fitness: 51%.

Members “sometimes” or “often” use Divvy for the following purposes: 1. Go to/from work: 84%; 2. Social/entertainment: 82%; 3. Shopping/errands: 78%; 4. Go to/from transit: 76%; 5. Exercise/recreation: 57%.”

Current challenges (2022) [link]

“Anecdotally, bike availability has gotten worse, and an analysis of August and September shows things aren’t so great. And this is against a backdrop of price hikes and a shift away from classic bikes to more expensive e-bikes.

Still, there were 914,000 trips in August, a new record, so perhaps some of these issues are just growing pains, as the Divvy team learns to adapt to a more heavily-used system. Lyft also says it is on track to complete the deployment of 10,500 new e-bikes required by the latest contract renewal. While not everyone wants to pay extra for an e-bike, a higher number of bikes in the system overall should help increase the availability of classic bikes.

One possible way forward on the staffing front: Lyft is working with the nonprofit bike shop Working Bikes on a new mechanic training program, which saw its first graduates in April, to try to alleviate the shortage of bike mechanics.”

Schematic map of the ride-share service area

Interactive map of Divvy stations

More information on the Divvy bike sharing platform

More information on the Divvy pricing model

1.3 DATASET

The Case Study instructions ask to use the last 12 months of data on this website:

Link to Divvy database

More information on the Divvy data

Importantly, the rows represent rides, not riders.

The City of Chicago shares the same data with additional information on the age and sex of riders, but this will not be explored here.

I will use the 12 files between Dec 2021 and November 2022 (since I started the project on Dec 21st 2022).

1.4 RESULTS

1.4.1. INTRODUCTION

The dataset has information on: the time and dock stations in which the ride started and ended, whether rides were classic or electric bikes, and whether these have been done by casual riders or annual members.

A total of 5,733,451 rides were available for analysis between December 2021 and November 2022.

During data cleaning, a small number of rides had: 1) started at a dock station but not ended at a dock station, 2) an invalid ride duration (e.g., negative, entire days), or 3) invalid geographical coordinates (i.e., not in Chicago). We also need to remove rides that were less than a minute to conform with Divvy’s data cleaning policy.

Analyses were therefore done in 5,592,668 valid rides (i.e., 97.5% of the sample).

Overall, 40.8% of rides were done by casual riders and 59.2% by annual members. Comparing ride type (i.e., classic bikes versus electric bike), 54.0% of rides were done on e-bikes among casual riders compared to 48.6% among annual members.

1.4.2 GEOGRAPHICAL VARIATION

Figure 1 presents the geographical distribution of 1,306 stations, highlighting the 50 most common start and end dock stations:

FIGURE 1 - INTERACTIVE MAP OF DOCKING STATIONS

Top start stations are in pale blue, top end stations are in pale red, and stations that are both appear in darker red.

Table 1 lists the most common start and end dock stations:

TABLE 1 - MOST COMMON DOCKING STATIONS

Start station	Total	% Casual		End station	Total	% Casual
Streeter Dr & Grand Ave	39962	81.4	///	Streeter Dr & Grand Ave	40439	83.6
DuSable Lake Shore Dr & Monroe St	23084	83.9	///	DuSable Lake Shore Dr & Monroe St	21128	80.8
Michigan Ave & Oak St	19831	70.5	///	Michigan Ave & Oak St	19767	72.7
DuSable Lake Shore Dr & North Blvd	17810	63.7	///	DuSable Lake Shore Dr & North Blvd	19614	66.0
Millennium Park	15854	80.9	///	Millennium Park	16290	83.5
Theater on the Lake	15558	62.0	///	Wells St & Concord Ln	14955	43.7
Wells St & Concord Ln	15078	46.3	///	Theater on the Lake	14691	64.3
Clark St & Armitage Ave	12909	49.7	///	Clark St & Elm St	13039	39.2
Shedd Aquarium	12809	85.9	///	Clark St & Armitage Ave	12615	48.8
Clark St & Elm St	12619	39.8	///	Broadway & Barry Ave	12186	40.7
Broadway & Barry Ave	12182	42.4	///	Clark St & Lincoln Ave	11764	53.1
Wells St & Elm St	11889	43.3	///	Wells St & Elm St	11697	41.7
Clark St & Lincoln Ave	11874	52.2	///	Shedd Aquarium	11297	84.4
Clark St & Wrightwood Ave	10144	42.8	///	Lakeview Ave & Fullerton Pkwy	10437	46.7
Wells St & Huron St	10116	38.5	///	Wabash Ave & Grand Ave	10434	52.6

Rides are done more often in the Chicago downtown area, and from and to dock stations near major landmarks and public transit hubs.

The “% casual” support that the purpose of rides differs across stations. For instance, rides starting at the “Millenium Park” station included more casual riders leaving the park whereas rides starting at the “Clark St & Elm St” station include more annual members, e.g., moving to and from work.

1.4.3 SEASONAL VARIATION

Figure 2 and 3 presents the seasonal variation in rides among casual riders and annual members by month and day.

FIGURE 2 - Monthly rides, by membership status

FIGURE 3 - Daily rides, by membership status

Rides were more commonly done by annual members throughout the year, except in June around the start of the summer where casual riders and annual members were using the bike-share system to a similar extent.

FIGURE 4 - Rides over days of the week, by month and membership status

Figure 4 nicely shows that rides have been more common among casual riders on Saturday and Sunday, particularly during the months of July and October, whereas rides have been more common among annual members during the weekdays.

1.4.4 RIDE DURATION

Figures 5 and 6 presents the distribution of ride duration over months and days of the week. Since ride duration is not normally distributed (i.e., some rides have had a much longer duration compared to the average), Figure 5 shows both the mean (in light gray) and median (in dark gray).

FIGURE 5 - Mean and median ride duration by month, by membership

FIGURE 6 - Mean ride duration over days of the week, by month and membership

Among casual riders, the average ride time was 20.5 minutes and 50% of these rides lasted at least 13 minutes. This average duration varied over months and days of the week, with casual riders being more likely to have longer rides in the summer season, and on Saturday and Sunday.

Among annual members, the average ride time was 12.3 minutes and 50% of these rides lasted at least 9 minutes. The average duration over months and the days of the week among annual members varied in a similar way across seasons, but to a much lesser degree compared to casual riders.

The difference in duration between casual riders and annual members and between week and weekend days was relatively small in the winter months (i.e., November to January), and relatively large in the main season (i.e., March to October). This supports the idea that the purpose of rides likely varied across days of the week and seasons, especially among casual riders.

1.5 RECOMMENDATIONS

These recommendations come from the combined work of data analysts outside the Google Certificate, enrolees in the Google Certificate, and myself. The articles used to support this list are referenced in the next section.

Develop a relational database system allowing to link riders with rides. This is critical. Even without their demographics, just having the id of the rider would blow open the possibilities for the prediction of membership uptake among casual riders.
Ensure bike availability across docking stations and monitor public sentiment toward the system. Given that people often ride with others, this means ensuring at least two working bikes per station. [link]
Consider providing feedback to casual riders that have used the system for 1+ year (e.g., “You have started with us over one year ago now! Did you know that riders who subscribed saved up to X$ over the past year? Consider the annual membership!”).
Consider referral incentives for annual memberships. [link]
Consider optimizing the pricing model (mix of and price for subscriptions vs one-time rides) to encourage annual membership. [link]
Given the age distribution of the bike-share system usage, prioritise Gen X’ers, Millennials, and Gen Z’ers who are most likely to use it to go to work (and back), and therefore more likely to become annual members. [link]
Given that many also combine the bike-share system and the public transport system to go to work (and back), prioritise docking stations close to major public transit stops. [link]
Given the large fluctuation in rides across months, prioritise busier months for casual riders, i.e., May to October.
Alternatively, prioritise months with large month-on-month increases, i.e., May to October, which likely highlights new users.
Given that most rides by casual riders are done during the weekend, tailor an advertisement campaign to casual riders that use the system on Saturday and Sunday to consider the annual membership.
Given the pricing model, tailor an advertisement campaign to casual riders who have done rides between 30 and 45 minutes to consider the annual membership. This duration is over the 30 minutes limit of the “Single Ride” payment option but under the 45 minutes limit of the “Annual Member” option.
Given that riders may be tourists, prioritise casual riders who actually live in Chicago. [link]
Given that casual rides plummet during the off-season, consider an alternative “seasonal” membership model for casual riders who only use the bike-share system during the summer period. [link]

1.6 REFERENCES

1.6.1 COURSERA ENROLEES

Blog posts from Coursera enrolees:

Analysis by Lion Shi

Analysis by Akhelaaditya

Analysis by Hock Chong

Analysis by Rodney Boyd

Analysis by Ajoke Onojeghuo

1.6.2 OTHER SOURCES

Blog posts from other data analysts that examined the bike-share system.

“What’s going on with Divvy availability? Let’s look at the data.” - Steven Lucy, August 2022

“What does the Bike Sharing Business look like in Chicago?” - Gabriela Baker, 2020

“Oh, the Places You’ll Go! Analyzing Chicago Divvy Bike Share Data” - Samaksh (Avi) Goyal, 2019

I also found blog posts that examined the user experience with the docking stations and the app, but did not take them into account since the focus of the project was on the Divvy public data.

“Divvy Bikes Review : Chicago Bike Sharing Puts Brakes on UX” - Will Scott, 2016

“UX Design Case Study: Divvy Bikes” - AKFantham, 2016

2. APPENDIX

With a combined sample of over 5 million observations, I skipped Excel and went straight to R. I also skipped doing actual statistical tests given the enormous size of the sample.

1. VARIABLE LIST

A first look at the dataset:

## Rows: 5,733,451
## Columns: 13
## $ ride_id            <chr> "46F8167220E4431F", "73A77762838B32FD", "4CF4245205…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-12-07 15:06:07, 2021-12-11 03:43:29, 2021-12-…
## $ ended_at           <dttm> 2021-12-07 15:13:42, 2021-12-11 04:10:23, 2021-12-…
## $ start_station_name <chr> "Laflin St & Cullerton St", "LaSalle Dr & Huron St"…
## $ start_station_id   <chr> "13307", "KP1705001026", "KA1504000117", "KA1504000…
## $ end_station_name   <chr> "Morgan St & Polk St", "Clarendon Ave & Leland Ave"…
## $ end_station_id     <chr> "TA1307000130", "TA1307000119", "13137", "KP1705001…
## $ start_lat          <dbl> 41.85483, 41.89441, 41.89936, 41.89939, 41.89558, 4…
## $ start_lng          <dbl> -87.66366, -87.63233, -87.64852, -87.64854, -87.682…
## $ end_lat            <dbl> 41.87197, 41.96797, 41.93758, 41.89488, 41.93125, 4…
## $ end_lng            <dbl> -87.65097, -87.65000, -87.64410, -87.63233, -87.644…
## $ member_casual      <chr> "member", "casual", "member", "member", "member", "…

We have 13 variables:

ride_id, an id variable for each observation
rideable_type, telling us whether the ride was done with a classic, docked, or electric bike
started_at, the time at which the ride started
ended_at, the time at which the ride ended
start_station_name, the name of the station where the ride started
start_station_id, the id of the station where the ride started
end_station_name, the name of the station where the ride ended
end_station_id, the id of the station where the ride ended
start_lat, the latitude of the station where the ride started
start_lng, the longitude of the station where the ride started
end_lat, the latitude of the station where the ride ended
end_lng, the longitude of the station where the ride ended
member_casual, our key variable, whether someone is a member or a casual rider.

2. MISSING DATA

## [1] "ride_id"
## [1] "0 %"
## [1] "rideable_type"
## [1] "0 %"
## [1] "started_at"
## [1] "0 %"
## [1] "ended_at"
## [1] "0 %"
## [1] "start_station_name"
## [1] "14.91 %"
## [1] "start_station_id"
## [1] "14.91 %"
## [1] "end_station_name"
## [1] "15.96 %"
## [1] "end_station_id"
## [1] "15.96 %"
## [1] "start_lat"
## [1] "0 %"
## [1] "start_lng"
## [1] "0 %"
## [1] "end_lat"
## [1] "0.1 %"
## [1] "end_lng"
## [1] "0.1 %"
## [1] "member_casual"
## [1] "0 %"

There was only a serious amount of missing data on the start station and end station variables.

For start stations, we found all missing data in rides with the electric bikes.

For end stations, we found some missing data in rides with classic and docked bikes, but the large majority remained in rides with electric bikes.

This means that: 1) many rides on electric bikes were simply not started and/or ended at a normal docking station; 2) a very small number of those who rode with a classic (or docked) bike ended their ride with their bike left somewhere invalid (e.g., the side of the road or in a lake).

3. OUTLIERS

3.1 IMPOSSIBLE GEOGRAPHICAL COORDINATES

3.2 IMPOSSIBLE TIME DURATION

## [1] "Rows with a duration between 1 and 59 seconds"

## [1] 119686

## [1] "Percentage with a duration between 1 and 59 seconds"

## [1] "2.0875 %"

## [1] "Rows with a duration of zero or negative seconds, or over three hours"

## [1] 20198

## [1] "Percentage with a duration of zero or negative seconds, or over three hours"

## [1] "0.3523 %"

3.3 START STATION BUT NO END STATION AMONG CLASSIC/DOCK BIKE RIDES

## [1] "Rides with a start station but no end station"

## [1] 6685

## [1] "Percentage with a start station but no end station"

## [1] "0.2295 %"

4. EXTRA ANALYSES

4.1 Rides by casual riders between 30 and 45 minutes

## [1] "Rides by casual riders >= 30 and < 45 minutes"

## [1] 196907

## [1] "Percentage by casual riders >= 30 and < 45 minutes"

## [1] "8.6253 %"

GOOGLE CERTIFICATE CAPSTONE PROJECT

TRACK 1 / CASE STUDY 1 - Bike-share analysis

Thierry Gagné

13 January, 2023