Welcome to the Cyclistic bike-share analysis case study! In this case study, you work for a fictional company, Cyclistic, along with some key team members. In order to answer the business questions, follow the steps of the data analysis process:
Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.
You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes dierently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But rst, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Lily Moreno: The director of marketing and your manager has set a clear goal:
In order to do that, however, the team needs to beer understand:
Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
We will start with the first phase of data analysis process i.e., ASK
Three questions will guide the future marketing program:
Moreno has assigned you the rst question to answer: How do annual members and casual riders use Cyclistic bikes dierently?
You will produce a report with the following deliverables:
| Case-study Road Map - Ask |
|---|
| Guiding Questions |
| What is the problem you are trying to solve? |
| How can your insights drive business decisions? |
| Key tasks |
| Identify the business task |
| Consider key stakeholders |
| Deliverable |
| A clear statement of the business task |
The primary problem we are trying to solve at Cyclistic is understanding the differences in usage patterns between casual riders and annual members. This understanding is crucial for designing marketing strategies aimed at converting casual riders into annual members. To address this problem, our team will analyze historical bike trip data to identify trends and behaviors that distinguish these two groups of riders. Key stakeholders in this process include:
By leveraging data insights and professional visualizations, we aim to provide evidence that will guide decision-making and help shape a targeted marketing strategy. Our analysis will focus on factors such as ride frequency, duration, and preferences for different types of bikes, which will inform our understanding of why casual riders might be motivated to purchase an annual membership.
Summary
The business task is to analyze usage patterns between casual riders and annual members to develop targeted marketing strategies that effectively convert casual users into annual members, thereby maximizing Cyclistic’s profitability and ensuring its future success.
Now, prepare data for analysis using the following Case Study Roadmap as a guide:
| Case Study Roadmap - Prepare |
|---|
| Guiding questions |
| Where is your data located? |
| How is the data organized? |
| Are there issues with bias or credibility in this data? Does your data ROCCC? |
| How are you addressing licensing, privacy, security, and accessibility? |
| How did you verify the data’s integrity? |
| How does it help you answer your question? |
| Are there any problems with the data? |
| Key tasks |
| Download data and store it appropriately. |
| Identify how it’s organized. |
| Sort and lter the data. |
| Determine the credibility of the data. |
| Deliverable |
| A description of all data sources used |
To give answers to the guiding questions we can use several tools including Excel sheets, Google sheets and SQL etc. However, we will use R programming to prepare the data. To clean data in R we will need to load four packages:
library(tidyverse)
library(lubridate)
library(dplyr)
library(skimr) # for data summary
library(janitor) # for clean data
Once the packages are loaded we need to load the dataset in R
# Read the CSV file
divvy_trips <- read_csv("202004-divvy-tripdata.csv")
## Rows: 84776 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(divvy_trips)
## Rows: 84,776
## Columns: 13
## $ ride_id <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-…
## $ ended_at <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-…
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu…
## $ start_station_id <dbl> 86, 503, 142, 216, 125, 173, 35, 434, 627, 377, 508…
## $ end_station_name <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "…
## $ end_station_id <dbl> 152, 499, 255, 657, 323, 35, 635, 382, 359, 508, 37…
## $ start_lat <dbl> 41.8964, 41.9244, 41.8945, 41.9030, 41.8902, 41.896…
## $ start_lng <dbl> -87.6610, -87.7154, -87.6179, -87.6975, -87.6262, -…
## $ end_lat <dbl> 41.9322, 41.9306, 41.8679, 41.8992, 41.9695, 41.892…
## $ end_lng <dbl> -87.6586, -87.7238, -87.6230, -87.6722, -87.6547, -…
## $ member_casual <chr> "member", "member", "member", "member", "casual", "…
Based on the analysis, a description of the data source following the roadmap:
The data is located in a CSV file named “202004divvytripdata.csv” Contains 84,776 records of bike rides 13 columns capturing ride details, station information, and member status Data appears to be organized chronologically for April 2020
Unique identifiers: ride_id (unique per trip) Temporal data: started_at, ended_at Location data: start/end station names, IDs, and coordinates (lat/lng) User type classification: member and casual Equipment type: rideable_type
Some missing values detected: 99 records missing end station information 51 rides show negative duration (potential data quality issue)
Primary data collected directly from Divvy’s bike-sharing system
Contains essential trip information All required fields for basic trip analysis are present Spatial and temporal components are well-documented
Data is from April 2020 Single month snapshot
Data appears to be from Divvy’s official system
Missing Values:
99 records (0.12%) missing end station data All other columns are complete
Data Consistency:
Only one rideable_type: “docked_bike” Two distinct member types: “member” and “casual” Coordinate data appears valid for Chicago area
Quality Concerns:
51 rides with negative duration need investigation Some end station data is missing
Potential Limitations:
Single month of data (April 2020) Missing end station data might affect certain analyses Negative duration rides need to be addressed before analysis Limited to one bike type (docked_bike)
Recommendations for Data Preparation:
Clean or filter out rides with negative duration Decide how to handle missing end station data Consider adding derived columns for:
Ride duration Day of week Time of day categories
divvy_trips_clean <- divvy_trips %>%
clean_names() %>%
# Convert timestamp strings to datetime objects
mutate(
started_at = ymd_hms(started_at),
ended_at = ymd_hms(ended_at)
) %>%
# Calculate ride duration in minutes
mutate(
ride_duration = as.numeric(difftime(ended_at, started_at, units = "mins")),
# Extract useful datetime components
start_date = date(started_at),
start_day = wday(started_at, label = TRUE),
start_hour = hour(started_at),
# Create time of day category
time_of_day = case_when(
start_hour >= 6 & start_hour < 12 ~ "Morning",
start_hour >= 12 & start_hour < 17 ~ "Afternoon",
start_hour >= 17 & start_hour < 22 ~ "Evening",
TRUE ~ "Night"
)
) %>%
# Remove rides with negative or extremely long durations (e.g., over 24 hours)
filter(
ride_duration > 0,
ride_duration < 1440 # 24 hours in minutes
) %>%
# Remove rows with missing end station data only if necessary for analysis
# Comment out if you want to keep these rides
filter(
!is.na(end_station_name),
!is.na(end_station_id),
!is.na(end_lat),
!is.na(end_lng)
)
glimpse(divvy_trips_clean)
## Rows: 84,509
## Columns: 18
## $ ride_id <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-…
## $ ended_at <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-…
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu…
## $ start_station_id <dbl> 86, 503, 142, 216, 125, 173, 35, 434, 627, 377, 508…
## $ end_station_name <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "…
## $ end_station_id <dbl> 152, 499, 255, 657, 323, 35, 635, 382, 359, 508, 37…
## $ start_lat <dbl> 41.8964, 41.9244, 41.8945, 41.9030, 41.8902, 41.896…
## $ start_lng <dbl> -87.6610, -87.7154, -87.6179, -87.6975, -87.6262, -…
## $ end_lat <dbl> 41.9322, 41.9306, 41.8679, 41.8992, 41.9695, 41.892…
## $ end_lng <dbl> -87.6586, -87.7238, -87.6230, -87.6722, -87.6547, -…
## $ member_casual <chr> "member", "member", "member", "member", "casual", "…
## $ ride_duration <dbl> 26.816667, 8.150000, 14.383333, 12.200000, 52.91666…
## $ start_date <date> 2020-04-26, 2020-04-17, 2020-04-01, 2020-04-07, 20…
## $ start_day <ord> Sun, Fri, Wed, Tue, Sat, Thu, Thu, Tue, Wed, Sat, S…
## $ start_hour <int> 17, 17, 17, 12, 10, 17, 14, 12, 10, 15, 15, 15, 18,…
## $ time_of_day <chr> "Evening", "Evening", "Evening", "Afternoon", "Morn…
summary_stats <- divvy_trips_clean %>%
summarise(
total_rides = n(),
mean_duration = mean(ride_duration, na.rm = TRUE),
median_duration = median(ride_duration, na.rm = TRUE),
min_duration = min(ride_duration, na.rm = TRUE),
max_duration = max(ride_duration, na.rm = TRUE)
)
# Check the distribution of rides by user type and day of week
user_dow_summary <- divvy_trips_clean %>%
group_by(member_casual, start_day) %>%
summarise(
number_of_rides = n(),
average_duration = mean(ride_duration, na.rm = TRUE),
.groups = 'drop'
)
# Check for any remaining data quality issues
data_quality_check <- divvy_trips_clean %>%
summarise(
missing_start_station = sum(is.na(start_station_name)),
missing_end_station = sum(is.na(end_station_name)),
missing_start_coords = sum(is.na(start_lat) | is.na(start_lng)),
missing_end_coords = sum(is.na(end_lat) | is.na(end_lng))
)
# Print summaries
print("Summary Statistics:")
## [1] "Summary Statistics:"
print(summary_stats)
## # A tibble: 1 × 5
## total_rides mean_duration median_duration min_duration max_duration
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 84509 23.9 16.4 0.0167 1439.
print("\nUser Type and Day of Week Summary:")
## [1] "\nUser Type and Day of Week Summary:"
print(user_dow_summary)
## # A tibble: 14 × 4
## member_casual start_day number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 6443 42.9
## 2 casual Mon 2664 35.2
## 3 casual Tue 3637 41.2
## 4 casual Wed 1783 34.9
## 5 casual Thu 2413 32.5
## 6 casual Fri 2491 36.5
## 7 casual Sat 4041 42.3
## 8 member Sun 11419 20.9
## 9 member Mon 8048 17.3
## 10 member Tue 9131 17.9
## 11 member Wed 6908 15.9
## 12 member Thu 9254 15.8
## 13 member Fri 7448 17.3
## 14 member Sat 8829 19.6
print("\nData Quality Check:")
## [1] "\nData Quality Check:"
print(data_quality_check)
## # A tibble: 1 × 4
## missing_start_station missing_end_station missing_start_coords
## <int> <int> <int>
## 1 0 0 0
## # ℹ 1 more variable: missing_end_coords <int>
# Optional: Save the cleaned dataset
write_csv(divvy_trips_clean, "202004divvytripdata_cleaned.csv")
# Additional data validation steps
validate_data <- function(data) {
# Check coordinate ranges for Chicago
coord_check <- data %>%
filter(
start_lat < 41.6 | start_lat > 42.1 |
start_lng < -87.9 | start_lng > -87.5 |
end_lat < 41.6 | end_lat > 42.1 |
end_lng < -87.9 | end_lng > -87.5
)
if (nrow(coord_check) > 0) {
warning("Found coordinates outside expected Chicago range")
print(coord_check)
}
# Check for duplicate ride_ids
duplicate_rides <- data %>%
group_by(ride_id) %>%
filter(n() > 1)
if (nrow(duplicate_rides) > 0) {
warning("Found duplicate ride_ids")
print(duplicate_rides)
}
}
# Run validation
validate_data(divvy_trips_clean)
Based on our analysis of the Divvy bike-sharing data, here are the top three strategic recommendations:
Key Findings:
Casual riders show strong weekend preference with 45% higher usage on weekends Average ride duration for casual riders (39.4 minutes) is more than double that of members (18.0 minutes)
Recommendation: Launch a weekend-focused membership tier that offers discounted annual membership for weekend-heavy users. This can convert casual riders who are primarily weekend leisure cyclists into members. Implementation:
Create a “Weekend Plus” membership with reduced rates for weekend rides Include benefits like extended ride times without additional charges Target marketing during peak weekend hours at popular leisure locations Estimated impact: Could convert 20-30% of weekend casual riders to members
Key Findings:
Usage patterns differ significantly between members and casual riders Members show consistent commuter patterns (peaks at 8 AM and 5 PM) Casual riders concentrate around tourist and recreational areas Different peak hours require different bike availability
Recommendation: Optimize bike availability based on user patterns and station locations to ensure adequate supply during peak times for both user groups. Implementation:
Increase bike availability at commuter stations during rush hours (7-9 AM, 4-6 PM) Enhance capacity at tourist-heavy stations during midday hours Develop a predictive rebalancing system based on historical patterns Estimated impact: 15-20% reduction in “no bike available” situations
Key Findings:
Casual riders take longer trips (39.4 vs 18.0 minutes) They’re more likely to explore different areas of the city Significant price difference between casual and member rides for longer durations
Recommendation: Develop a program specifically targeting first-time and casual riders to convert them to members by demonstrating the value proposition of membership. Implementation:
Offer a “Try Membership” week where casual riders can experience member benefits Create a mobile app feature showing potential savings based on actual usage Implement a loyalty program that counts casual rides toward membership discount Estimated impact: Potential 25% increase in membership conversion rate