STEP 1 - ACT
What topic are you exploring? In this case study, I will explore Cyclistic’s bike-sharing offering—comparing usage data of casual riders versus annual members with the aim of uncovering insights and providing recommendations on how to turn casual riders to paying members.
What is the problem you are trying to solve? The issue on hand is how to maximize ridership by converting casual riders to annual paying members. This includes identifying marketing strategies including marketing channels that casual riders may be most receptive to. The company’s director of marketing believes that the company’s long-term viability rests on this endeavor.
Specific task: Determine how annual members and casual riders use Cyclistic bikes differently.
What metrics will you use to measure your data to achieve your objective? Who are the stakeholders?
- Company: Cyclistic - 5,824 bicycles and 692 docking stations
- Lily Moreno - Director of Marketing, Project Manager Cyclistic Executive team
The metrics to be used to measure data include the ff:
- Most used ride type
- Aggregate 7-day usage behavior
- Most used start and end stations
Who is your audience for this analysis and how does this affect your analysis process and presentation? The audience for this presentation is Lily Moreno, Director of Marketing, and the Cyclistic Executive team. With this in mind, the presentation needs to be as detailed as possible without sacrificing clarity so that the final recommendation can come across to the decision-makers coherently.
How will this data help your stakeholders make decisions? The result of this analysis will help stakeholders decide on the recommendations in this report including marketing strategies aimed at casual riders encouraging them to convert to annual membership.
STEP 2 - PREPARE
Where is your data located? Data is stored in Amazon Web Server and is owned (first-party) by Cyclistic.
How is the data organized? Data is kept in csv files separated by month. All files contain usage of Cyclistic bikes including ride type, start and end stations (with station IDs and coordinates), as well as whether users are members or casual riders. Upon inspection of the 2021 files, however, it has been determined that some start and end points do not have names and IDs. All observations, nonetheless, have lat and long coordinates.
Are there issues with bias or credibility in this data? Does your data ROCCC? Apart from the unrecorded station names and IDs, the data seem to meet the ROCCC requirements. Let’s look at them one by one.
- Reliable - Despite the absence of station names and IDs in about 30% of the transactions, overall, data remains reliable as coordinates are present across all transactions. If there is a need to identify the station names and IDs of the missing values, we can simply match their coordinates to the coordinates of transactions with existing stations names/IDs.
- Original - Data is proprietary to Cyclistic.
- Comprehensive - We are only looking at 2021 data but record goes back to 2013. In 2021 alone, there were 5.55M transactions–giving us a wealth of data to work with and possibly provide accurate results even with a sample size.
- Current - The data remains current and valid to come up with plausible recommendations given almost similar circumstances between 2021 and 2022 in terms of mobility and economic behavior amid the pandemic.
- Cited - As mentioned, we are working with first-party data generated from Cyclistic’s bike-sharing technology.
How are you addressing licensing, privacy, security, and accessibility? The data downloaded will be stored in my local drive and backed up in Google Drive as I go through the process. Raw data will not be shared by any means and processed data will only be displayed as tables, tibbles and visualization—keeping the raw data inaccessible to the public.
How did you verify the data’s integrity? Despite the identification of missing values (start/end station names and their corresponding IDs), the existence of lat/long coordinates ensures that each transaction is geospatially identifiable. The coordinates, at the same time, are complete (not rounded off). Although a specific station may have multiple variations in its coordinates–owing perhaps to the location data’s accuracy down to the smallest possible unit of measure (feet?).
How does it help you answer your question? It is necessary to know this because the missing values account for 30% of the transactions. If it would be determined that accounting for this 30% is necessary to come up with a valid analysis, there is secondary data in the form of the coordinates that we can use.
Are there any problems with the data? Although about 30% of transactions have missing stations names and IDs, data integrity remains intact with the presence of coordinates across all transactions. The challenge for me as the analyst is to generate the station names and IDs for the missing values if it would be required.
STEP 4 - ANALYZE & STEP 5 - SHARE
Table 2
## # A tibble: 6 × 3
## # Groups: rideable_type, member_casual [6]
## rideable_type member_casual n
## <chr> <chr> <int>
## 1 classic_bike member 1984371
## 2 classic_bike casual 1266657
## 3 electric_bike member 1081686
## 4 electric_bike casual 950006
## 5 docked_bike casual 312342
## 6 docked_bike member 1
By the numbers:
- TOTAL number of borrows in 2021: 5,595,063
- Number of borrows by casual riders: 2,529,005
- Number of borrows by member riders: 3,066,058
In terms of user type, 54.8% of riders were members, 45.2% were casual riders. A small 0.0006% of rides was unidentifiable. Converting casual riders to members indeed has the potential to improve revenue/cash flow.
Figure 1. Weekends show the most number of usage for both casual and member riders with usage peaking on Saturday. Note: To get a better view of usage, we will need to separate the visualization between casual and member riders.
# Plot in bar chart
ggplot(data = day_count) +
geom_col(mapping = aes(x = day_of_week, y = n)) +
labs(title = "Total usage per day", subtitle = "Figure 1")

Figure 2 shows starkly the difference in bike-sharing usage between casual and member riders. More casual riders are using the bike-sharing service on Saturdays and Sundays than any other day of the week. The trend is reversed with member riders where more rides happen on weekdays (peaking on Wednesday).
day_of_week_per_rider <- divvy_tripdata %>%
select(member_casual, day_of_week) %>%
count(member_casual, day_of_week)
ggplot(data = day_of_week_per_rider) +
geom_col(mapping = aes(x=day_of_week, y=n, fill=member_casual)) +
facet_wrap(~member_casual) +
theme(axis.text.x=element_text(angle = 45, hjust=1)) +
labs(title = "Total usage per day per rider type", subtitle = "Figure 2")

Figure 3 shows that casual riders primarily choose classic bikes followed by electric then docked bikes. The chart suggests an uptick in usage of all types of bikes during the weekend with classic bikes being the most preferred ride type. Docked bikes ar almost exclusive to casual riders (with only 1 docked bike borrowed by a member in 2021). Given the nature of docked bikes, we can surmise that it is the cheapest among the three bike types.

(Casual vs Members) Number of multi-day transactions
Table 3 and 4
## # A tibble: 3 × 2
## number_of_days n
## <dbl> <int>
## 1 0 3055633
## 2 1 10396
## 3 2 29
## # A tibble: 35 × 2
## number_of_days n
## <dbl> <int>
## 1 0 2496268
## 2 1 31632
## 3 2 419
## 4 3 139
## 5 4 99
## 6 5 69
## 7 6 48
## 8 7 36
## 9 8 48
## 10 9 32
## # … with 25 more rows
While majority of transactions happen within a 24-hour period, there is a noticeable number of borrows by riders particularly in the casual category that are more than 24 hours (32,737 transactions) with 39 days as the longest length of borrow.
In contrast, there were only 10,425 borrows by members that spanned more than 24 hours.
Note: There is a need to look at the rate paid by casual and member riders to further understand their borrowing behavior.
- Why are members typically only borrowing less than 24 hours, with only 0.3% of borrows lasting more than a day?
- Conversely, why are there more casual riders borrowing for more than a day (1.29%)?
- Are more casual riders reliant on bike-sharing as their primary mode of transportation?
Most used start stations
Note: 12.5% and 12.1% of starts stations by casual and member riders respectively are unrecorded.
- Of those with values, Streeter Dr & Grand Ave is the most used station by casual riders (2.6%), followed by Millenium Park (1.3%).
- In contrast, Clark St & Elm St (0.8%), Wells St & Concord Ln (0.77%) and Kingsbury St & Kinzie St (0.76%) are the most used stations by member riders.
Figures 4 and 5 visualize the count of start and end stations for both casual (blue line) and member (red line) riders.
Table 6 Most used start stations of casual riders
## # A tibble: 842 × 2
## start_station_name n
## <chr> <int>
## 1 <NA> 317686
## 2 Streeter Dr & Grand Ave 66360
## 3 Millennium Park 33590
## 4 Michigan Ave & Oak St 29781
## 5 Shedd Aquarium 23251
## 6 Theater on the Lake 21351
## 7 Wells St & Concord Ln 19892
## 8 Lake Shore Dr & Monroe St 19617
## 9 Clark St & Lincoln Ave 17033
## 10 Wells St & Elm St 16668
## # … with 832 more rows
Table 7 Most used start stations of member riders
## # A tibble: 837 × 2
## start_station_name n
## <chr> <int>
## 1 <NA> 373123
## 2 Clark St & Elm St 24740
## 3 Wells St & Concord Ln 23718
## 4 Kingsbury St & Kinzie St 23565
## 5 Wells St & Elm St 21022
## 6 Dearborn St & Erie St 19585
## 7 Wells St & Huron St 19190
## 8 St. Clair St & Erie St 18903
## 9 Broadway & Barry Ave 17803
## 10 Clinton St & Madison St 16914
## # … with 827 more rows
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (position_stack).

Table 8
## # A tibble: 622,302 × 3
## start_lat start_lng n
## <dbl> <dbl> <int>
## 1 41.9 -87.6 55620
## 2 41.9 -87.6 29964
## 3 41.9 -87.6 24250
## 4 41.9 -87.6 17863
## 5 41.9 -87.6 17498
## 6 41.9 -87.6 17222
## 7 41.9 -87.6 14174
## 8 41.9 -87.6 13123
## 9 41.9 -87.6 12432
## 10 41.9 -87.6 12415
## # … with 622,292 more rows
## # A tibble: 701,633 × 3
## start_lat start_lng n
## <dbl> <dbl> <int>
## 1 41.8 -87.6 29393
## 2 41.9 -87.6 19059
## 3 41.9 -87.6 17396
## 4 41.9 -87.6 17095
## 5 41.8 -87.6 16370
## 6 41.9 -87.6 15646
## 7 41.9 -87.6 15301
## 8 41.9 -87.6 13980
## 9 41.9 -87.6 13907
## 10 41.9 -87.6 13758
## # … with 701,623 more rows
Using coordinates, however, shows us a different picture with the top six most used stations by casual riders. These are:
- Streeter Dr & Grand Ave
- Lake Shore Dr & Monroe St
- Lake Shore Dr & North Blvd
- Theater on the Lake
- Millennium Park
- Michigan Ave & Oak St
Stations Lake Shore Dr & Monroe St, Lake Shore Dr & Monroe St and Theater on the Lake pushed back the previously 2nd (Millennium Park) and 3rd (Michigan Ave & Oak St) most used start stations. Using the actual coordinates gives us more accurate information.
Figure 6
Figure 6, meanwhile, displays the top 10 start stations for casual (standard marker) and member (circle marker) riders each.
Notice here the positions of the markers showing that more casual riders start their journey closer to or along Lakeshore Drive, an expressway designated as part of the U.S. Highway 41 interstate highway.
This analysis has allowed us to form observations in terms of the difference between casual and member rides.
- Casual riders tend to borrow more on weekends. (See Figure 2)
- Casual riders use docked bikes almost exclusively. (See Figure 3)
- Casual riders are borrowing longer. (See Table 4)
- Many casual riders start their borrow near Lakeshore Drive, an interstate highway. (See Figure 6)
From these, we can derive a number of assumptions to which we can gather insights.
- Casual riders may be service workers who go to the city to work on weekends.
- Casual riders may have tighter budgets which explains why many of them use docked bikes (which is presumably the most inexpensive ride type).
- Casual riders borrow longer because they want to have a more cost efficient means to go around (no need to pay for parking, gas, toll fees, etc.).
Note: Current workstation is having difficulty processing and plotting 5.5M data points on the map through R. Thus, the data was transported and visualized via Tableau.
The dashboard shows all the start points on the map including station name values that are NA, which displays these points being within proximity of Lakeshore Drive.
The dashboard, moreover, does not invalidate our observations and assumptions given the presence of more data points. It, in fact, further supports our presuppositions.