Notice This case study/publication is part of the capstone project under Google-Coursera’s Data Analytics course. Meanwhile, comments/feedback on this case study are more than welcome and greatly appreciated. You may get in touch with me via email at

STEP 1 - ACT

What topic are you exploring? In this case study, I will explore Cyclistic’s bike-sharing offering—comparing usage data of casual riders versus annual members with the aim of uncovering insights and providing recommendations on how to turn casual riders to paying members.

What is the problem you are trying to solve? The issue on hand is how to maximize ridership by converting casual riders to annual paying members. This includes identifying marketing strategies including marketing channels that casual riders may be most receptive to. The company’s director of marketing believes that the company’s long-term viability rests on this endeavor.

Specific task: Determine how annual members and casual riders use Cyclistic bikes differently.

What metrics will you use to measure your data to achieve your objective? Who are the stakeholders?

The metrics to be used to measure data include the ff:

Who is your audience for this analysis and how does this affect your analysis process and presentation? The audience for this presentation is Lily Moreno, Director of Marketing, and the Cyclistic Executive team. With this in mind, the presentation needs to be as detailed as possible without sacrificing clarity so that the final recommendation can come across to the decision-makers coherently.

How will this data help your stakeholders make decisions? The result of this analysis will help stakeholders decide on the recommendations in this report including marketing strategies aimed at casual riders encouraging them to convert to annual membership.

STEP 2 - PREPARE

Where is your data located? Data is stored in Amazon Web Server and is owned (first-party) by Cyclistic.

How is the data organized? Data is kept in csv files separated by month. All files contain usage of Cyclistic bikes including ride type, start and end stations (with station IDs and coordinates), as well as whether users are members or casual riders. Upon inspection of the 2021 files, however, it has been determined that some start and end points do not have names and IDs. All observations, nonetheless, have lat and long coordinates.

Are there issues with bias or credibility in this data? Does your data ROCCC? Apart from the unrecorded station names and IDs, the data seem to meet the ROCCC requirements. Let’s look at them one by one.

How are you addressing licensing, privacy, security, and accessibility? The data downloaded will be stored in my local drive and backed up in Google Drive as I go through the process. Raw data will not be shared by any means and processed data will only be displayed as tables, tibbles and visualization—keeping the raw data inaccessible to the public.

How did you verify the data’s integrity? Despite the identification of missing values (start/end station names and their corresponding IDs), the existence of lat/long coordinates ensures that each transaction is geospatially identifiable. The coordinates, at the same time, are complete (not rounded off). Although a specific station may have multiple variations in its coordinates–owing perhaps to the location data’s accuracy down to the smallest possible unit of measure (feet?).

How does it help you answer your question? It is necessary to know this because the missing values account for 30% of the transactions. If it would be determined that accounting for this 30% is necessary to come up with a valid analysis, there is secondary data in the form of the coordinates that we can use.

Are there any problems with the data? Although about 30% of transactions have missing stations names and IDs, data integrity remains intact with the presence of coordinates across all transactions. The challenge for me as the analyst is to generate the station names and IDs for the missing values if it would be required.

STEP 3 - PROCESS

Following are the steps in cleaning and manipulating the data.

  1. Added ride_length, day_of_week and number_of_days (number of days a borrow has been made if more than 24 hours) on CSV files using Excel.

  2. Load all Cyclistic 2021 data files and aggregate in one data frame entitled divvy_tripdata.

  3. Get total number of borrows per day of week (1-Sunday:7-Saturday)

Table 1

## # A tibble: 7 × 2
##   day_of_week      n
##         <dbl>  <int>
## 1           7 991047
## 2           1 857285
## 3           6 810508
## 4           4 756142
## 5           3 739905
## 6           5 737588
## 7           2 702588

STEP 4 - ANALYZE & STEP 5 - SHARE

Table 2

## # A tibble: 6 × 3
## # Groups:   rideable_type, member_casual [6]
##   rideable_type member_casual       n
##   <chr>         <chr>           <int>
## 1 classic_bike  member        1984371
## 2 classic_bike  casual        1266657
## 3 electric_bike member        1081686
## 4 electric_bike casual         950006
## 5 docked_bike   casual         312342
## 6 docked_bike   member              1

By the numbers:

In terms of user type, 54.8% of riders were members, 45.2% were casual riders. A small 0.0006% of rides was unidentifiable. Converting casual riders to members indeed has the potential to improve revenue/cash flow.

Figure 1. Weekends show the most number of usage for both casual and member riders with usage peaking on Saturday. Note: To get a better view of usage, we will need to separate the visualization between casual and member riders.

# Plot in bar chart
ggplot(data = day_count) +
  geom_col(mapping = aes(x = day_of_week, y = n)) +
  labs(title = "Total usage per day", subtitle = "Figure 1")

Figure 2 shows starkly the difference in bike-sharing usage between casual and member riders. More casual riders are using the bike-sharing service on Saturdays and Sundays than any other day of the week. The trend is reversed with member riders where more rides happen on weekdays (peaking on Wednesday).

day_of_week_per_rider <- divvy_tripdata %>%
  select(member_casual, day_of_week) %>%
  count(member_casual, day_of_week)

ggplot(data = day_of_week_per_rider) +
  geom_col(mapping = aes(x=day_of_week, y=n, fill=member_casual)) +
  facet_wrap(~member_casual) +
  theme(axis.text.x=element_text(angle = 45, hjust=1)) +
  labs(title = "Total usage per day per rider type", subtitle = "Figure 2")

Figure 3 shows that casual riders primarily choose classic bikes followed by electric then docked bikes. The chart suggests an uptick in usage of all types of bikes during the weekend with classic bikes being the most preferred ride type. Docked bikes ar almost exclusive to casual riders (with only 1 docked bike borrowed by a member in 2021). Given the nature of docked bikes, we can surmise that it is the cheapest among the three bike types.

(Casual vs Members) Number of multi-day transactions

Table 3 and 4

## # A tibble: 3 × 2
##   number_of_days       n
##            <dbl>   <int>
## 1              0 3055633
## 2              1   10396
## 3              2      29
## # A tibble: 35 × 2
##    number_of_days       n
##             <dbl>   <int>
##  1              0 2496268
##  2              1   31632
##  3              2     419
##  4              3     139
##  5              4      99
##  6              5      69
##  7              6      48
##  8              7      36
##  9              8      48
## 10              9      32
## # … with 25 more rows

While majority of transactions happen within a 24-hour period, there is a noticeable number of borrows by riders particularly in the casual category that are more than 24 hours (32,737 transactions) with 39 days as the longest length of borrow.

In contrast, there were only 10,425 borrows by members that spanned more than 24 hours.

Note: There is a need to look at the rate paid by casual and member riders to further understand their borrowing behavior.

Most used start stations

Note: 12.5% and 12.1% of starts stations by casual and member riders respectively are unrecorded.

Figures 4 and 5 visualize the count of start and end stations for both casual (blue line) and member (red line) riders.

Table 6 Most used start stations of casual riders

## # A tibble: 842 × 2
##    start_station_name             n
##    <chr>                      <int>
##  1 <NA>                      317686
##  2 Streeter Dr & Grand Ave    66360
##  3 Millennium Park            33590
##  4 Michigan Ave & Oak St      29781
##  5 Shedd Aquarium             23251
##  6 Theater on the Lake        21351
##  7 Wells St & Concord Ln      19892
##  8 Lake Shore Dr & Monroe St  19617
##  9 Clark St & Lincoln Ave     17033
## 10 Wells St & Elm St          16668
## # … with 832 more rows

Table 7 Most used start stations of member riders

## # A tibble: 837 × 2
##    start_station_name            n
##    <chr>                     <int>
##  1 <NA>                     373123
##  2 Clark St & Elm St         24740
##  3 Wells St & Concord Ln     23718
##  4 Kingsbury St & Kinzie St  23565
##  5 Wells St & Elm St         21022
##  6 Dearborn St & Erie St     19585
##  7 Wells St & Huron St       19190
##  8 St. Clair St & Erie St    18903
##  9 Broadway & Barry Ave      17803
## 10 Clinton St & Madison St   16914
## # … with 827 more rows
## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).

Table 8

## # A tibble: 622,302 × 3
##    start_lat start_lng     n
##        <dbl>     <dbl> <int>
##  1      41.9     -87.6 55620
##  2      41.9     -87.6 29964
##  3      41.9     -87.6 24250
##  4      41.9     -87.6 17863
##  5      41.9     -87.6 17498
##  6      41.9     -87.6 17222
##  7      41.9     -87.6 14174
##  8      41.9     -87.6 13123
##  9      41.9     -87.6 12432
## 10      41.9     -87.6 12415
## # … with 622,292 more rows
## # A tibble: 701,633 × 3
##    start_lat start_lng     n
##        <dbl>     <dbl> <int>
##  1      41.8     -87.6 29393
##  2      41.9     -87.6 19059
##  3      41.9     -87.6 17396
##  4      41.9     -87.6 17095
##  5      41.8     -87.6 16370
##  6      41.9     -87.6 15646
##  7      41.9     -87.6 15301
##  8      41.9     -87.6 13980
##  9      41.9     -87.6 13907
## 10      41.9     -87.6 13758
## # … with 701,623 more rows

Using coordinates, however, shows us a different picture with the top six most used stations by casual riders. These are:

  1. Streeter Dr & Grand Ave
  2. Lake Shore Dr & Monroe St
  3. Lake Shore Dr & North Blvd
  4. Theater on the Lake
  5. Millennium Park
  6. Michigan Ave & Oak St

Stations Lake Shore Dr & Monroe St, Lake Shore Dr & Monroe St and Theater on the Lake pushed back the previously 2nd (Millennium Park) and 3rd (Michigan Ave & Oak St) most used start stations. Using the actual coordinates gives us more accurate information.

Figure 6

Figure 6, meanwhile, displays the top 10 start stations for casual (standard marker) and member (circle marker) riders each.

Notice here the positions of the markers showing that more casual riders start their journey closer to or along Lakeshore Drive, an expressway designated as part of the U.S. Highway 41 interstate highway.

This analysis has allowed us to form observations in terms of the difference between casual and member rides.

  1. Casual riders tend to borrow more on weekends. (See Figure 2)
  2. Casual riders use docked bikes almost exclusively. (See Figure 3)
  3. Casual riders are borrowing longer. (See Table 4)
  4. Many casual riders start their borrow near Lakeshore Drive, an interstate highway. (See Figure 6)

From these, we can derive a number of assumptions to which we can gather insights.

  1. Casual riders may be service workers who go to the city to work on weekends.
  2. Casual riders may have tighter budgets which explains why many of them use docked bikes (which is presumably the most inexpensive ride type).
  3. Casual riders borrow longer because they want to have a more cost efficient means to go around (no need to pay for parking, gas, toll fees, etc.).

Note: Current workstation is having difficulty processing and plotting 5.5M data points on the map through R. Thus, the data was transported and visualized via Tableau.

The dashboard shows all the start points on the map including station name values that are NA, which displays these points being within proximity of Lakeshore Drive.

The dashboard, moreover, does not invalidate our observations and assumptions given the presence of more data points. It, in fact, further supports our presuppositions.

STEP 6 - ACT

Recommendations Based on the insights gathered, I am recommending the following additional action points:

  1. At the onset, we can come up with a communication campaign (email, SMS, app marketing) exclusive to casual riders offering them a limited-time promo (discounts) when they convert to an annual membership subscription.
  1. There may also be a need to look at pricing and determine which is more cost-efficient between casual and annual membership. If it turns out that annual membership is more cost-efficient in the long run, this may be communicated in the campaign as well.
  2. Conduct qualitative study (interview) with casual riders to further understand their motivations/behavior. This is to:

The End