Introduction
1.1 About the company
1.2 The business task
1.3 The leading questions
Data background
2.1 Data source
2.2 Data credibility by “ROCCC” model
2.3 Data License
Data processing
Data analysis and visualization
Findings
5.1 Conclusion of my analysis
5.2 Recommendations based on my data analysis
Cyclistic operates a bike-share program in Chicago, USA, featuring an extensive fleet of over 5,800 bicycles and 600 docking stations. What distinguishes Cyclistic is its commitment to inclusivity, offering not only traditional bicycles but also reclining bikes, hand tricycles, and cargo bikes. This diversity caters to individuals with disabilities and those who may find it challenging to use a standard two-wheeled bike.
Up until now, Cyclistic’s marketing strategy has centered around creating general awareness and targeting a broad consumer base. The flexibility in pricing plans, offering options such as single-ride passes, full-day passes, and annual memberships, has been instrumental in achieving this. Customers opting for single-ride or full-day passes are categorized as casual riders, while those choosing annual memberships are considered Cyclistic members.
Previous data analyses have revealed that annual members contribute significantly more to profitability compared to casual riders. Consequently, the current focus of the company is to transition occasional users into long-term subscribers. This report aims to compare the activity patterns of members and casual users, providing insights to formulate an effective marketing strategy for the identified market segment.
Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes dierently?
2. Why would casual riders buy Cyclistic annual?
3. How can Cyclistic use digital media to infuence casual riders to become members?
I will utilize Cyclistic’s publicly accessible historical data stored on their cloud servers for this project. The data can be accessed Here. My focus will be on the data spanning from December 2022 to November 2023, organized into 12 separate CSV files, each representing a specific month of the year.
Reliability: The data is accurate, thorough, and impartial.
Original: The data is verified by its original source.
Comprehensiveness: It encompasses all essential information to address the queries.
Current: The data is pertinent to the previous year.
Cited: The data is trustworthy and referenced from the source in the preceding paragraph.
Effective data processing is a pivotal aspect of any comprehensive report. It involves the systematic collection, validation, and analysis of information to derive meaningful insights and support informed decision-making. This section will delve into the significance of data processing within the context of the report, highlighting its role in ensuring accuracy, reliability, and relevance.
Installs and loads essential R packages, including “readr,” “tidyverse,” “dplyr,” and “ggplot2,” to facilitate data processing and visualizatio
install.packages("readr")
install.packages("tidyverse")
install.packages("dplyr")
install.packages("ggplot2")
library(readr)
library(tidyverse)
library(dplyr)
library(ggplot2)
Reads individual CSV files corresponding to different months, collecting raw data from the specified file paths.
y2022_12 <- read.csv("Desktop/Cyclistic/202212-divvy-tripdata.csv")
y2023_01 <- read.csv("Desktop/Cyclistic/202301-divvy-tripdata.csv")
y2023_02 <- read.csv("Desktop/Cyclistic/202302-divvy-tripdata.csv")
y2023_03 <- read.csv("Desktop/Cyclistic/202303-divvy-tripdata.csv")
y2023_04 <- read.csv("Desktop/Cyclistic/202304-divvy-tripdata.csv")
y2023_05 <- read.csv("Desktop/Cyclistic/202305-divvy-tripdata.csv")
y2023_06 <- read.csv("Desktop/Cyclistic/202306-divvy-tripdata.csv")
y2023_07 <- read.csv("Desktop/Cyclistic/202307-divvy-tripdata.csv")
y2023_08 <- read.csv("Desktop/Cyclistic/202308-divvy-tripdata.csv")
y2023_09 <- read.csv("Desktop/Cyclistic/202309-divvy-tripdata.csv")
y2023_10 <- read.csv("Desktop/Cyclistic/202310-divvy-tripdata.csv")
y2023_11 <- read.csv("Desktop/Cyclistic/202311-divvy-tripdata.csv")
Addresses data type inconsistencies for the start_station_id and end_station_id columns across multiple months, ensuring numerical consistency for subsequent analyses.
y2022_12 <- mutate(y2022_12, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_01 <- mutate(y2023_01, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_02 <- mutate(y2023_02, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_03 <- mutate(y2023_03, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_04 <- mutate(y2023_04, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_05 <- mutate(y2023_05, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_06 <- mutate(y2023_06, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_07 <- mutate(y2023_07, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_08 <- mutate(y2023_08, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_09 <- mutate(y2023_09, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_10 <- mutate(y2023_10, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
y2023_11 <- mutate(y2023_11, start_station_id = as.numeric(start_station_id), end_station_id = as.numeric(end_station_id))
Combines the individual monthly data frames into a unified data frame (y2023), consolidating data for a comprehensive analysis.
y2023 <- bind_rows(
y2022_12, y2023_01, y2023_02, y2023_03, y2023_04, y2023_05,
y2023_06, y2023_07, y2023_08, y2023_09, y2023_10, y2023_11,
)
Identifies and eliminates duplicate rows within the merged data frame (y2023) to maintain data integrity.
y2023 %>% distinct()
Excludes unnecessary columns (start_lat, start_lng, end_lat, end_lng) from the data frame to streamline and focus on relevant information.
y2023 <- y2023 %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
Converts the started_at and ended_at columns from character (chr) to POSIXct format, facilitating accurate date and time analyses.
y2023$started_at <- as.POSIXct(y2023$started_at, format = "%Y-%m-%d %H:%M:%S")
y2023$ended_at <- as.POSIXct(y2023$ended_at, format = "%Y-%m-%d %H:%M:%S")
Introduces new columns (date, month, day, year, day_of_week) derived from the started_at column to enhance temporal analysis.
y2023$date <- as.Date(y2023$started_at)
y2023$month <- format(as.Date(y2023$date), "%m")
y2023$day <- format(as.Date(y2023$date), "%d")
y2023$year <- format(as.Date(y2023$date), "%Y")
y2023$day_of_week <- weekdays(y2023$started_at)
Calculates the duration of each ride and adds a new column (ride_length_s) to quantify the length of rides in seconds.
y2023$ride_length_s <- as.numeric(difftime(y2023$ended_at, y2023$started_at))
Identifies and removes rows with missing values (NA) to ensure data completeness and accuracy for subsequent analyses.
y2023 <- drop_na(y2023
Certain simple findings work well as tables as shown above. Other findings I will present using ggplot2 for data visualizations.
##Viz of the Average Ride Time for Member and Casual
y2023 %>%
group_by(member_casual) %>%
summarise(ride_length_m = mean(ride_length_s/60)) %>%
ggplot(aes(x = member_casual, y = ride_length_m, fill = member_casual)) +
geom_col(position = "dodge") +
geom_text(aes(label = sprintf("%.2f", ride_length_m)),
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3, color = "black") + # Adjust the position and appearance of the labels
labs(title = "Average Ride Time for Member and Casual",
x = " ",
y = "Average Riding Time (in minutes)") +
scale_fill_manual(values = c("casual" = "red", "member" = "turquoise")) +
theme_minimal()
The conclusion drawn from the significant difference in average ride times between casual and member riders (25.27 minutes for casual riders and 12.55 minutes for members) could be indicative of distinct usage patterns or preferences between these two user groups:
Casual Riders: The longer average ride time for casual riders (25 minutes) might suggest that they use the bike-sharing service for leisurely or recreational purposes. Casual riders may be more inclined to explore the city, take scenic routes, or engage in extended rides.
Member Riders: The shorter average ride time for members (12.5 minutes) could imply that members tend to use the bike-sharing service for more utilitarian purposes, such as commuting or short-distance travel. Members may be familiar with the system and may primarily use it for efficient point-to-point transportation.
##Viz of the Average Ride time for Casual and Member Riders by Weekday
y2023 %>%
group_by(member_casual, day_of_week, .groups = "drop_last") %>%
summarise(ride_length_m = mean(ride_length_s/60)) %>%
select(member_casual, day_of_week, ride_length_m) %>%
ggplot(aes(x = day_of_week, y = ride_length_m, fill = member_casual, label = sprintf("%.1f", ride_length_m))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(position = position_dodge(width = 0.9), vjust = -0.5, size = 3, aes(group = member_casual), color = "black") +
labs(title = "Average Ride Duration for Casual and Member Riders by Weekday",
x = " ",
y = "Average Ride Duration (minutes)") +
scale_fill_manual(values = c("casual" = "red", "member" = "turquoise")) +
theme_minimal() +
scale_x_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
The data suggests that casual riders exhibit a more pronounced change in behavior during weekends, engaging in longer and more variable rides, while member riders maintain a more consistent and utilitarian approach to bike-sharing, with minor fluctuations in ride duration.
##Viz of the Number of Total Rides for Casual and Member
y2023 %>%
group_by(member_casual) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = member_casual, y = number_of_rides, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = number_of_rides), position = position_dodge(width = 0.9), vjust = -0.5) +
labs(title = "Number of Total Rides for Casual and Member",
x = " ",
y = "Total Rides") +
scale_fill_manual(values = c("casual" = "red", "member" = "turquoise")) +
theme_minimal() +
scale_x_discrete(limits = c("casual", "member")) +
scale_y_continuous(labels = scales::comma)
The significantly higher number of total rides for members (596,389) compared to casual riders (410,549) indicates a substantial disparity in the frequency of bike-sharing usage. Members, likely utilizing the service for regular commuting or routine activities, demonstrate a higher volume of transactions.
##Viz of the Number of Rides for Casual and Member Riders by Weekday
y2023 %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Number of Rides for Casual and Member Riders by Weekday",
x = " ",
y = "Number of Rides") +
scale_fill_manual(values = c("casual" = "red", "member" = "turquoise")) +
theme_minimal() +
scale_x_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
The analysis reveals divergent riding day preferences between member and casual riders. Member riders exhibit a consistent trend, favoring weekdays—Monday through Thursday—suggesting a utilitarian approach, possibly for commuting. In contrast, casual riders show a distinct preference for Thursday through Sunday, with a notable emphasis on Saturday and Sunday. This pattern implies a recreational or leisure-oriented usage during the weekend.
##Viz of the Number of Rides for Casual and Member Riders by Month
y2023 %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = as.factor(month), y = number_of_rides, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Number of Rides for Casual and Member Riders by Month",
x = "Month",
y = "Number of Rides") +
scale_fill_manual(values = c("casual" = "red", "member" = "turquoise")) +
theme_minimal()
The observed patterns in the number of rides by month for both member and casual riders reveal interesting trends:
Member Riders- Members demonstrate a clear seasonality in bike-sharing, with peak ride counts during the warmer months, specifically in May, June, July, and August. Conversely, there is a notable decline in bike-sharing usage during the winter months (December, January, February, March), reflecting a seasonal decrease likely influenced by colder weather conditions.
Casual Riders- Casual riders exhibit a pronounced preference for summer months (June, July, August, September), emphasizing a peak in bike-sharing activity during warmer weather. Notably, the larger seasonal gap in casual riders’ usage patterns, in contrast to members, suggests heightened sensitivity to seasonal variations and potential influence from weather-dependent recreational activities.
Members and casual riders both exhibit a seasonality in their bike-sharing behavior, with peaks during the warmer months. However, casual riders, being more sensitive to seasonality, show a larger gap between the highest and lowest months, potentially influenced by weather-dependent recreational usage.
##Viz of Top 10 Start Stations for Casual Riders by Total Rides
y2023 %>%
filter(member_casual == "casual") %>%
group_by(start_station_name) %>%
summarise(total_rides = n()) %>%
arrange(desc(total_rides)) %>%
mutate(start_station_name = factor(start_station_name, levels = start_station_name)) %>%
head(10) %>%
ggplot(aes(x = start_station_name, y = total_rides, fill = start_station_name)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Start Stations for Casual Riders by Total Rides",
x = "Start Station",
y = "Total Rides") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = rainbow(10))
The identified top ten stations for casual riders, with “Streeter Dr & Grand Ave” and “DuSable Lake Shore Dr & Monroe St” leading by a significant margin, may suggest specific points of interest or popular destinations for casual riders. These stations could be strategically located near tourist attractions, recreational areas, or cultural landmarks, attracting a higher volume of casual users. The notable gap between the first and second stations compared to the rest may indicate a concentrated concentration of recreational or tourist activities in the vicinity of these two locations, reinforcing their prominence in the preferences of casual riders.
##Viz of Top 10 Start Stations for Member Riders by Total Rides
y2023 %>%
filter(member_casual == "member") %>%
group_by(start_station_name) %>%
summarise(total_rides = n()) %>%
arrange(desc(total_rides)) %>%
mutate(start_station_name = factor(start_station_name, levels = start_station_name)) %>%
head(10) %>%
ggplot(aes(x = start_station_name, y = total_rides, fill = start_station_name)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Start Stations for Casual Riders by Total Rides",
x = "Start Station",
y = "Total Rides") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = rainbow(10))
The relatively uniform distribution and small gaps among the top ten stations for member riders, suggest a more evenly spread utilization pattern. This could be indicative of members using bike-sharing services for commuting or routine activities across various neighborhoods throughout the city. The absence of pronounced gaps may signify that members are not concentrated around specific landmarks or attractions, reinforcing the notion that their bike-sharing usage is more evenly distributed across different areas of Chicago for everyday purposes. This pattern aligns with the expected behavior of members who are likely utilizing bike-sharing as a convenient and efficient means of transportation for regular activities.
Throughout the analysis of bike-sharing data in Chicago, several key conclusions emerge:
Ride Duration Differences: Casual riders, with an average ride time of 25 minutes, showcase a more leisure-oriented and variable usage pattern compared to members, who have an average ride time of 12.5 minutes, suggesting utilitarian and routine-based behavior.
Weekly Riding Patterns: Casual riders exhibit a substantial increase in average ride time during weekends, with an almost 8-minute difference from midweek, while members maintain a more consistent pattern, indicating potential recreational usage by casual riders during weekends.
Total Ride Counts: Members significantly outnumber casual riders in total ride counts (596,389 vs. 410,549), highlighting the higher frequency of bike-sharing usage among members, likely for commuting or routine activities.
Seasonal Trends: Both casual and member riders show a preference for bike-sharing during the warmer months, with a larger seasonal gap in casual riders’ usage patterns, indicating heightened sensitivity to seasonal variations and recreational activities.
Preferred Riding Days: Members favor weekdays, particularly Monday through Thursday, suggesting utilitarian usage for commuting, while casual riders prefer the latter part of the week, especially weekends, indicating a more leisure-oriented and recreational approach.
Station Preferences: Top stations for casual riders, such as “Streeter Dr & Grand Ave”, may be strategically located near tourist attractions, cultural landmarks, or recreational areas, while the more uniform distribution of top stations for members suggests even utilization across various neighborhoods for routine activities.
These conclusions provide valuable insights for targeted marketing, service enhancements, and resource allocation to better meet the distinct needs and behaviors of casual and member riders in the Chicago bike-sharing system.
How do annual members and casual riders use Cyclistic bikes dierently? Annual members exhibit more consistent and utilitarian usage, likely for commuting or routine activities, with shorter average ride times (12.5 minutes). Casual riders, with longer average ride times (25 minutes), show a more variable and leisure-oriented pattern.
Why would casual riders buy Cyclistic annual? Position Cyclistic annual memberships as a cost-effective and convenient option for casual riders who frequent the service, especially during the summer peak. Emphasize benefits like unlimited rides and discounts for longer-term commitments.
How can Cyclistic use digital media to infuence casual riders to become members? Launch targeted digital media campaigns highlighting the advantages of Cyclistic annual memberships, such as cost savings and seamless access. Leverage social media, online advertising, and influencers to reach casual riders during the summer peak, emphasizing the value and convenience of becoming annual members.