step1:Ask

1.What is the problem you are trying to solve? The difference in use between casual riders and annual (member)riders.

2.What is the stakeholder’s expectation? Design marketing strategies aimed at converting casual riders into annual members. The team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.

3.How can your insights drive business decisions? The results of this analysis will be used to design a new marketing strategy to convert casual riders to annual members.

Identify business task: Analyze the differences in bike usage between casual riders and annual members to develop insights that will inform a marketing strategy aimed at converting casual riders into annual members.

Consider key stakeholders Director of Marketing (Lily Moreno), Cyclistic marketing analytics team and Cyclistic executive team

Install important packages:

library(tidyverse)
library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(bigrquery)
install.packages("DT")
library(DT)

I did prepare and process steps in BigQuery.

Step2:Prepare(Data Exploration)

1) ride_id: the length of the ride id should be uniform

query_ride_id <- "
SELECT LENGTH(ride_id) as rider_id_length, COUNT(*) AS count
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
GROUP BY LENGTH(ride_id)
"

# Execute query
ride_id_length <- bq_table_download(bq_project_query(project_id, query_ride_id))

# View results
# Create an interactive DataTable with proper alignment
datatable(
  ride_id_length, 
  colnames = c("Rider ID Length", "Count"),          # Custom column names
  caption = htmltools::tags$caption(
    style = 'caption-side: bottom; text-align: center; font-size: 14px; color: grey;',
    "Interactive Table: Rider ID Length Summary"
  ),                                                 # Add a caption at the bottom
  options = list(
    pageLength = 5,                                  # Default number of rows per page
    autoWidth = TRUE,                                # Automatically adjust column width
    dom = 'Bfrtip',                                  # Add buttons (copy, CSV, etc.)
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'), # Export buttons
    className = 'hover',                             # Highlight rows on hover
    columnDefs = list(
      list(className = 'dt-center', targets = "_all"), # Center-align all columns
      list(width = '50%', targets = 0),              # Set specific width for the first column
      list(width = '50%', targets = 1)               # Set specific width for the second column
    ),
    initComplete = JS(                               # JavaScript for better theme
      "function(settings, json) {
         $(this.api().table().container()).css({'font-family': 'Arial', 'font-size': '14px'});
      }"
    )
  ),
  extensions = c('Buttons', 'Responsive')           # Add export buttons and responsiveness
)

–the ride_id is consistent with 16 characters.

2) rideable_type: determine the type of bikes

query_rideable_type <- "
SELECT rideable_type, COUNT(*) AS count
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
GROUP BY rideable_type
"

# Execute query
rideable_type <- bq_table_download(bq_project_query(project_id, query_rideable_type))

–there are two types of bike: electric and docked

3) started_at, ended_at: ride duration

query_ride_duration <- "
SELECT ride_id, started_at, ended_at,
       TIMESTAMP_DIFF(ended_at, started_at, MINUTE) AS duration_minutes
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
WHERE 
  TIMESTAMP_DIFF(ended_at, started_at, MINUTE) <= 1 OR
  TIMESTAMP_DIFF(ended_at, started_at, MINUTE) >= 1440
"

# Execute query
ride_duration <- bq_table_download(bq_project_query(project_id, query_ride_duration))

– check if the ride time is less than a minute or longer than a day – the end time is behind the start time – TIMESTAMP is in YYYY-MM-DD hh:mm:ss UTC format

4) name & id of start_station and end_station

query_start_station <- "
SELECT start_station_name, COUNT(*) AS start_station_count
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
GROUP BY start_station_name
"

# Execute query
start_station <- bq_table_download(bq_project_query(project_id, query_start_station))

– a total of 82958 start_station_name with null values are determined

query_end_station <- "
SELECT end_station_name, COUNT(*) AS end_station_count
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
GROUP BY end_station_name
"

# Execute query
end_station <-bq_table_download(bq_project_query(project_id, query_end_station))

– 97645 null values observed – the string lengths of station id are inconsistent. however, it will be ignored

5) lat & lng of start and end

query_lat_lng <- "
SELECT COUNT(*) AS null_count
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
WHERE start_lat IS NULL OR start_lng IS NULL OR
      end_lat IS NULL OR end_lng IS NULL
"

# Execute query
lat_lng_nulls <- bq_table_download(bq_project_query(project_id, query_lat_lng))

– 4145 null values are observed

6) member_casual: type of membership

query_membership <- "
SELECT member_casual, COUNT(*) AS membership_count
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
GROUP BY member_casual
"

# Execute query
membership_types <- bq_table_download(bq_project_query(project_id, query_membership))

– only two types: member and causal – total membership count : 2073615 – total Casual count: 1336495

7) Count Null Values

query_null_values <- "
SELECT
  COUNTIF(ride_id IS NULL) AS null_count_ride_id,
  COUNTIF(rideable_type IS NULL) AS null_count_rideable_type,
  COUNTIF(started_at IS NULL) AS null_count_started_at,
  COUNTIF(ended_at IS NULL) AS null_count_ended_at,
  COUNTIF(start_station_name IS NULL) AS null_count_start_station_name,
  COUNTIF(start_station_id IS NULL) AS null_count_start_station_id,
  COUNTIF(end_station_name IS NULL) AS null_count_end_station_name,
  COUNTIF(end_station_id IS NULL) AS null_count_end_station_id,
  COUNTIF(start_lat IS NULL) AS null_count_start_lat,
  COUNTIF(start_lng IS NULL) AS null_count_start_lng,
  COUNTIF(end_lat IS NULL) AS null_count_end_lat,
  COUNTIF(end_lng IS NULL) AS null_count_end_lng,
  COUNTIF(member_casual IS NULL) AS null_count_member_casual
FROM `astute-curve-435613-t7.Data_trip.Data_trip2020`
"

# Execute the query and download the results
null_values <-bq_table_download(bq_project_query(project_id, query_null_values))

Step3:Process(data cleaning)

# Explicitly set project ID
project_id <- "astute-curve-435613-t7"

# SQL Query for creating the cleaned dataset
query_with_cast <- "
CREATE OR REPLACE TABLE `astute-curve-435613-t7.Data_trip.Cleaned_Data` AS (
  SELECT 
    t.*,
    FORMAT_TIMESTAMP('%B', t.started_at) AS month_name,
    TIMESTAMP_DIFF(t.ended_at, t.started_at, MINUTE) AS ride_length,
    FORMAT_TIMESTAMP('%A', t.started_at) AS day_name,
    SAFE_CAST(t.ride_id AS STRING) AS fixed_ride_id,
    SAFE_CAST(t.rideable_type AS STRING) AS fixed_rideable_type,
    SAFE_CAST(t.started_at AS TIMESTAMP) AS fixed_started_at,
    SAFE_CAST(t.ended_at AS TIMESTAMP) AS fixed_ended_at,
    SAFE_CAST(t.start_station_name AS STRING) AS fixed_start_station_name,
    SAFE_CAST(t.start_station_id AS INT64) AS fixed_start_station_id,
    SAFE_CAST(t.end_station_name AS STRING) AS fixed_end_station_name,
    SAFE_CAST(t.end_station_id AS INT64) AS fixed_end_station_id,
    SAFE_CAST(t.start_lat AS FLOAT64) AS fixed_start_lat,
    SAFE_CAST(t.start_lng AS FLOAT64) AS fixed_start_lng,
    SAFE_CAST(t.end_lat AS FLOAT64) AS fixed_end_lat,
    SAFE_CAST(t.end_lng AS FLOAT64) AS fixed_end_lng,
    SAFE_CAST(t.member_casual AS STRING) AS fixed_member_casual
  FROM 
    `astute-curve-435613-t7.Data_trip.Data_trip2020` t
  WHERE 
    TIMESTAMP_DIFF(t.ended_at, t.started_at, MINUTE) BETWEEN 1 AND 1440 AND
    t.ride_id IS NOT NULL
    AND t.rideable_type IS NOT NULL
    AND t.started_at IS NOT NULL
    AND t.ended_at IS NOT NULL
    AND t.start_station_name IS NOT NULL
    AND t.start_station_id IS NOT NULL
    AND t.end_station_name IS NOT NULL
    AND t.end_station_id IS NOT NULL
    AND t.start_lat IS NOT NULL
    AND t.start_lng IS NOT NULL
    AND t.end_lat IS NOT NULL
    AND t.end_lng IS NOT NULL
    AND t.member_casual IS NOT NULL
)
 "
# Execute the query
bq_project_query(project_id, query_with_cast)
## <bq_table> astute-curve-435613-t7.Data_trip.Cleaned_Data

After These Steps(Prepare,process),I did Analyze,visualization and Act steps in Rstudio

## Rows: 500 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (7): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 4 & 5: Analyze , Visualization

Table1:Total Rides by Bike Type and Membership Type

#total number of rides by each bike type and by member type
rides_per_bike_type <- data_trip %>% 
  group_by(rideable_type,member_casual) %>% 
  summarise(total_rides= n(),.groups = "drop")

Visualize the table in bar chart:

This bar chart compares the total number of rides for different bike types (rideable_type) between members and casual riders. It highlights trends in bike usage based on membership status.

library(ggplot2)
ggplot(rides_per_bike_type, aes(x = rideable_type, y = total_rides, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +  # Bar plot with dodged bars
  labs(title = "Total Rides by Bike Type and Membership Type",
       x = "Bike Type",
       y = "Total Rides") +
  theme_minimal() +
  scale_fill_manual(values = c("member" = "blue", "casual" = "orange"))  # Customize colors
Figure 1: Total rides by bike type for members and casual riders.

Figure 1: Total rides by bike type for members and casual riders.

Summary of Total Rides by Bike Type and Membership Type

Key findings include:

  • Docked bikes are predominantly used by members, with significantly higher ride counts compared to casual riders.
  • Electric bikes show a more balanced usage between members and casual riders, with casual riders slightly leading in total rides.

These insights suggest that docked bikes are favored by regular users, while electric bikes appeal equally to both membership types.

Table 2: Total Rides per Month by Membership Type

The following table summarizes the total number of rides for each month, grouped by membership type:

library(dplyr)
rides_per_month <- data_trip %>%
  group_by(month_name, member_casual) %>%
  summarise(total_rides = n(),.groups = "drop") 

Visualize Monthly Total Rides by Membership Type (Faceted View)

The faceted bar chart below visualizes the total rides for each month, grouped by membership type (member and casual). Each panel focuses on one membership type, allowing for clearer identification of trends.

Insights

Casual Riders:

  • Seasonal Preference: Rides peak around May, likely aligning with warmer weather, outdoor activities, or leisure riding.
  • Decline in Winter: There’s a noticeable drop during colder months, suggesting casual riders are less inclined to use the service during unfavorable weather.

Members:

  • Winter Peak: A spike in rides during January-February might reflect commuting for school or work, where regular transportation needs persist despite the weather.

Comparison:

  • Ride Volume: Members significantly outnumber casual riders in total ride counts, reflecting their regular and committed usage.

  • Purpose of Usage: The difference in ride patterns suggests that casual riders tend to ride for leisure, while members primarily use the service for daily commuting, such as traveling to school or work.

Table3: Total Number of Rides by Member Type and Day of Week

This analysis shows the total number of rides for each day of the week, split by member and casual rider types

total_rides_by_day <-data_trip %>% 
  group_by(day_name,member_casual) %>% 
  summarise(total_ride=n(),.groups = "drop")

Visualize of Total Rides by Day and Membership Type

The following plot shows the total number of rides for each day of the week, split by membership type (casual and member). The bars are grouped side-by-side to compare the number of rides between casual and member riders for each day.

Insights

Casual Riders:

  • Weekend Peak: Casual riders have significantly higher ride counts on Friday and Saturday, suggesting that their usage is more focused on leisure activities and social outings.
  • Lower Weekday Usage: The low number of rides on weekdays indicates that casual riders are less likely to use the service for commuting.

Members:

  • Consistent Weekday Rides: Member rides remain high and consistent from Monday to Friday, reflecting regular use for commuting purposes such as school or work.
  • Slight Weekend Decline: While member rides drop slightly on the weekend compared to weekdays, the numbers remain substantial, indicating a mix of practical and leisure usage.

Comparison:

  • Purpose Differences: Casual riders are more active on weekends for leisure, while members maintain consistent usage throughout the week, likely for commuting.
  • Volume Gap: Members have significantly more rides on all days of the week compared to casual riders, highlighting their reliance on the service for regular transportation.

Table 4:Total Rides by Hour of Day and Membership Type

This analysis categorizes the time of day into four periods: Night, Morning, Afternoon, and Evening. We then group the data by hour of the day, membership type (casual or member), and time of day. The total number of rides for each category is calculated, and the results are summarized. A bar plot is generated to visualize how ride counts vary by time of day and membership type.

# Data preparation:Extract the hour of day from the start_time
library(dplyr)
data_trip$started_at <- as.POSIXct(data_trip$started_at, format = "%m/%d/%Y %H:%M:%S")
data_trip$hour_of_day <- format(data_trip$started_at, "%H")
data_trip$hour_of_day <- as.numeric(data_trip$hour_of_day)#Convert to numeric for easier analysis

# Categorize time of day
data_trip$time_of_day <- case_when(
  data_trip$hour_of_day >= 0 & data_trip$hour_of_day < 6 ~ "Night",
  data_trip$hour_of_day >= 6 & data_trip$hour_of_day < 12 ~"Morning",
  data_trip$hour_of_day >= 12 & data_trip$hour_of_day < 18 ~ "Afternoon",
  data_trip$hour_of_day >= 18 & data_trip$hour_of_day <= 23 ~ "Evening"
)
# Group by hour of day and member/casual, then count the number of rides
ride_counts_by_time <- data_trip %>%
  group_by(hour_of_day, member_casual,time_of_day) %>%
  summarise(total_rides = n(), .groups = "drop") %>% 
  arrange(hour_of_day, member_casual)

From this analysis, you can gain insights into which times of the day are most popular for both casual and member riders.

Insights

Casual Riders:

  • Midday Preference: Casual riders’ activity peaks between 12 PM and 6 PM , likely due to leisure trips or social activities.
  • Low Morning Rides: Casual riders rarely use the service during early hours , suggesting limited usage for commuting.

Members:

  • Morning and Evening Peaks: Member rides peak at 7-9 AM and 4-6 PM, aligning with typical commute times for school or work.
  • Sustained Usage: Members also show steady activity throughout the day, indicating mixed usage for commuting and other purposes.

Comparison:

  • Commuting Patterns: The distinct morning and evening peaks for members point to frequent use for commuting, while casual riders mostly ride for leisure.
  • Volume Difference: Members consistently outnumber casual riders across all hours, showcasing their reliance on the service.

Table5: Average Ride Length by Membership Type

This section calculates and visualizes the average ride length by membership type (casual vs member).

#average ride length by each member/casual
average_ride_length <- data_trip %>%
  group_by(member_casual) %>%
  summarise(avg_ride_length = mean(ride_length, na.rm = TRUE))

Insights

Casual Riders:

  • Longer Ride Duration: Casual riders average about 30 minutes per ride, which is significantly longer than member rides.
  • Leisure Usage: The longer ride duration likely reflects recreational or exploratory trips rather than utilitarian purposes.

Members:

  • Shorter Ride Duration: Members average about 15 minutes per ride, indicating more practical, short-distance trips.
  • Commuting Focus: This shorter average ride length aligns with the hypothesis that members often use rides for commuting to work or school.

Comparison:

  • Behavioral Differences: The stark contrast in average ride lengths highlights differing usage patterns: casual riders tend to take longer, leisure-oriented trips, while members focus on shorter, utility-driven rides.

Table6: Average Ride Length per day of week

#Average ride length per day of week
average_ride_length_per_day <- data_trip %>%
  group_by(day_name, member_casual) %>%
  summarise(avg_ride_length_per_day = mean(ride_length, na.rm = TRUE),.groups = "drop")

Insights

Casual Riders:

  • Weekend Peak: Casual riders’ average ride lengths peak on Saturday and Sunday, reaching around 40-42 minutes, reflecting their recreational and leisure-oriented usage patterns.
  • Steady Weekday Usage: On weekdays, the average ride length for casual riders hovers between 20-25 minutes, which could indicate shorter trips during workdays.

Members:

  • Consistent Across Days: Members maintain relatively stable average ride lengths across all days, ranging between 10-15 minutes.
  • Functional Usage: This consistent pattern suggests that members primarily use the service for functional purposes like commuting to work or school, regardless of the day of the week.

Comparison:

  • Weekend Differences: The difference between casual and member ride lengths is most pronounced on weekends, where casual riders take much longer trips compared to members.
  • Behavioral Patterns: Casual riders are more likely to engage in long leisure rides on weekends, while members focus on shorter, utility-driven trips consistently throughout the week.

1. start station for casual riders

# Reusable function for filtering, grouping, and sorting by station
get_station_data <- function(data, rider_type, station_type) {
  data %>%
    filter(member_casual == rider_type) %>%
    group_by(!!sym(station_type)) %>%
    summarise(total_rides = n(), .groups = "drop") %>%
    arrange(desc(total_rides))
}

# Get data for casual riders (starting station)
start_station_casual <- get_station_data(data_trip, "casual",
                                         "start_station_name")

Insights

1. Top Station
  • The most popular start station is Southport Ave & Wellington Ave with 20 total rides, indicating it is a key starting point for riders.

2. start station for member riders

Insights

1. Top Station
  • The most popular start station is Southport Ave & Wellington Ave with 106 rides, indicating it is a key starting point for member riders.

3. end station for casual riders

Insights

Top Destination
  • The most popular end station for casual riders is Warren Park West, with 8 total rides, suggesting that this station is a key leisure or recreational destination.
Observations
  • Recreational Focus:
    • The high usage of Warren Park West suggests it could be a leisure destination or located near parks or residential areas that attract casual riders.

4. end station for member riders

The chart compares Total Rides (represented by blue bars) and Average Ride Length(represented by red dots) for two membership types: casual and member.

# Create a combined data frame for visualization
combined_data <- data.frame(
  member_casual = c("member", "casual"),
  total_rides = c(sum(data_trip$member_casual == "member"), sum(data_trip$member_casual == "casual")),
  avg_ride_length = c(mean(data_trip$ride_length[data_trip$member_casual == "member"], na.rm = TRUE),
                      mean(data_trip$ride_length[data_trip$member_casual == "casual"], na.rm = TRUE))
)

# Plot the data
ggplot(combined_data, aes(x = member_casual)) +
  geom_bar(aes(y = total_rides), stat = "identity", fill = "skyblue", alpha = 0.7) +
  geom_point(aes(y = avg_ride_length * 10), color = "red", size = 4) + 
  scale_y_continuous(
    name = "Total Rides",
    sec.axis = sec_axis(~ . / 10, name = "Avg Ride Length (scaled)")
  ) +
  labs(title = "Total Rides and Average Ride Length by Membership Type") +
  theme_minimal()

Insights

Key Observations
  1. Total Rides:
    • Members account for significantly more rides (over 300) compared to casual users (just over 100), making them the primary users of the service in terms of volume.
  2. Average Ride Length:
    • Casual riders have longer average ride lengths compared to members, indicating leisure-oriented usage.
    • Members take shorter rides, suggesting their trips are practical and utilitarian, like commuting.
  3. Membership Behavior:
    • Members: High ride volume with short trips highlights frequent usage for daily transportation needs.
    • Casual Riders: Low ride volume but longer trip durations suggest occasional usage, likely for leisure or recreation.
Implications
  • Members: Their frequent usage makes them core users. Shorter ride lengths align with commuting patterns.
  • Casual Riders: Longer trips indicate leisure-focused use, presenting opportunities for recreational promotions.

Step6: ACT (Recommendations)

1. Membership Conversion:
  • Offer 20–30% discounts on annual memberships for casual riders who have taken more than 10 rides in a month.
  • Provide a one-month free trial or 50% off the first month for new members.
2. Leisure Marketing:
  • Create weekend promotions offering 10% off ride costs or free rides on Sundays during off-peak hours.
  • Promote scenic bike routes and guided tours with discounts up to 15% for casual riders.
3. Operational Optimizations:
  • Ensure 95% bike availability at key commuter stations during peak hours.
  • Introduce a pre-booking feature allowing members to reserve bikes at no extra cost.
4. Seasonal Strategies:
  • Offer seasonal membership passes (e.g., summer pass for $50) with savings of 20–25% compared to casual pricing.
  • Provide free heated handle grips or 10% off winter gear rentals for winter riders.
5. Digital Engagement:
  • Send personalized ride summaries to casual riders, highlighting their potential savings as members.
  • Use push notifications offering 10–20% discounts for casual riders to encourage repeat usage.
6. Data-Driven Promotions:
  • Identify frequent casual riders (e.g., those with 5+ rides per month) and offer them 15–25% discounts on memberships.
  • Provide 5–10% discounts for casual riders taking rides longer than 30 minutes.
7. Improved Accessibility:
  • Allocate extra bikes at top stations (e.g., increasing availability by 20% at “Southport Ave & Wellington Ave”).
  • Allow casual riders to upgrade to membership with 0% interest monthly payment plans.
8. Behavioral Nudges:
  • Use price anchoring to show annual memberships as 40% cheaper per ride compared to casual pricing.
  • Gamify experiences with challenges like “Ride 50 miles in a month” to earn 10% off the next month’s rides.
9. Expand Marketing Channels:
  • Use paid social media ads offering exclusive 10% discounts on first rides.
  • Partner with local businesses to offer 5% discounts or special deals for rides to/from partner locations.
10. Performance Monitoring:
  • Target a 10% increase in membership conversion rates over 6 months.
  • Aim for a 5–7% increase in casual rider revenue through promotions and upselling.