Case Study: How Does A Bike-Share Navigate Speedy Success?

In this Case Study, I took on the role of a Junior Data Analyst at a hypothetical company - Cyclist, I tasked with analyzing data from the first 5 months of 2023 to support answering business questions for the company’s upcoming marketing campaign

I will follow the steps of the data analysis process: Ask, Prepare, Process, Analyze and Share


I. Ask

Business Task

Provide marketing strategies to convert casual riders into member riders

Analysis questions

  1. How do members riders and causual riders differ in their use of Cyclistic bikes?

  2. Why do regular riders purchase Cyclistic’s annual membership?

  3. How can Cyclistic utilize digital platforms to influence regular riders to become members?

Moreno assigned me the first question to answer: How do members riders and causual riders differ in their use of Cyclistic bikes?


II. Prepare

Resource

I utilize Cyclistc’s historical data to conduct analysis and identify trends over the first 5 months of 2023

Data has been collected from divvy_tripdata. Provided by Motivate International under this license


III. Process

1. Combine Data

Utilize data from 01/2023 to 05/2023

Conncet to SQL Server in R language

Install Packages to connect with SQL

install.packages("odbc")

install.packages("DBI")

install.packages("tidyverse")

In there:

  • Package “odbc” allow connect with Database of SQL Server

  • Package “DBI” allow interact with Database

Utilize libary of packages

library(odbc) 

library(DBI)

library(ggplot2)

Connect to SQL Server

Combine data of 5 tables into single table

IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'divvy_tripdata_2023')
BEGIN
    SELECT * INTO divvy_tripdata_2023 
    FROM (
        SELECT * FROM divvy_tripdata_01_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_02_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_03_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_04_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_05_2023 ) AS row_data
END

SELECT COUNT(*) AS count
    FROM divvy_tripdata_2023
  
1 records
count
1670841

In this way, ‘divvy_tripdata_2023’ table has been created with 1,607,841 records

2. Data Exploration

The following table shows all field name and data type of table ‘divvy_tripdata_2023’

The query bellow shows the number of NULL vaule per field

SELECT
    COUNT(CASE WHEN ride_id IS NULL THEN 1 ELSE NULL END) AS ride_id,
    COUNT(CASE WHEN rideable_type IS NULL THEN 1 ELSE NULL END) AS rideable_type,
    COUNT(CASE WHEN started_at IS NULL THEN 1 ELSE NULL END) AS started_at,
    COUNT(CASE WHEN ended_at IS NULL THEN 1 ELSE NULL END) AS ended_at,
    COUNT(CASE WHEN start_station_name IS NULL THEN 1 ELSE NULL END) AS start_station_name,
    COUNT(CASE WHEN start_station_id IS NULL THEN 1 ELSE NULL END) AS start_station_id,
    COUNT(CASE WHEN end_station_name IS NULL THEN 1 ELSE NULL END) AS end_station_name,
    COUNT(CASE WHEN end_station_id IS NULL THEN 1 ELSE NULL END) AS end_station_id,
    COUNT(CASE WHEN start_lat IS NULL THEN 1 ELSE NULL END) AS start_lat,
    COUNT(CASE WHEN start_lng IS NULL THEN 1 ELSE NULL END) AS start_lng,
    COUNT(CASE WHEN end_lat IS NULL THEN 1 ELSE NULL END) AS end_lat,
    COUNT(CASE WHEN end_lng IS NULL THEN 1 ELSE NULL END) AS end_lng,
    COUNT(CASE WHEN member_casual IS NULL THEN 1 ELSE NULL END) AS member_casual
FROM divvy_tripdata_2023
1 records
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
0 0 0 0 241158 241290 256913 257054 0 0 1571 1571 0

Observation:

  • The start_station_name field has 241,158 NULL values and needs to be removed
  • The start_station_id field has 241,290 NULL values and needs to be removed
  • The end_station_name field has 256,913 NULL values and needs to be removed
  • The end_station_id field has 257,054 NULL values and needs to be removed
  • The end_lat and end_lng both fields have 1,571 NULL values and needs to be removed

Due to rise_id field is a Primary Key, there will be no duplicate values

3. Data Cleaning

The current format of the ‘divvy_tripdata_2023’ table is yyyy-mm-dd hh:mm:ss. To perform deeper analysis, I will create 4 new columns with respective value types: Hour, Day of Week, Month, and Year.

Create a new column named rise_length to determine the duration of each trip (in minute format).

All records containing NULL values will be removed

IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'pre_clean_combined_data')
BEGIN
    SELECT * INTO pre_clean_combined_data
    FROM (
        SELECT ride_id, rideable_type,
          CAST(DATENAME(hh, started_at) AS INT) AS hour,
            DATENAME(dw, started_at) AS day_of_week,
            DAY(started_at) AS day,
            MONTH(started_at) AS month,
            YEAR(started_at) AS year,
            -- Cast the data type of the DateDiff function result to the round result of the division to 2 decimal places
            ROUND(CAST(DATEDIFF(s, started_at, ended_at) AS float)/60, 2) AS ride_length_minute,
            start_station_name, end_station_name, start_lat, start_lng, 
            end_lat, end_lng, member_casual
        FROM divvy_tripdata_2023
        WHERE start_station_name IS NOT NULL AND
            start_station_id IS NOT NULL AND
            end_station_name IS NOT NULL AND
            end_station_id IS NOT NULL AND
            end_lat IS NOT NULL AND
            end_lng IS NOT NULL
    ) AS cleaned
END

-- Sorting ride_length_minute column in ascending
SELECT TOP(10) * FROM pre_clean_combined_data
    ORDER BY ride_length_minute ASC
          
Displaying records 1 - 10
ride_id rideable_type hour day_of_week day month year ride_length_minute start_station_name end_station_name start_lat start_lng end_lat end_lng member_casual
A769AB597DEA18C4 classic_bike 5 Saturday 27 5 2023 -0.23 Southport Ave & Belmont Ave Southport Ave & Belmont Ave 41.93948 -87.66375 41.93948 -87.66375 member
934174DB8E2AD791 classic_bike 17 Monday 29 5 2023 -0.20 Lakefront Trail & Bryn Mawr Ave Lakefront Trail & Bryn Mawr Ave 41.98404 -87.65228 41.98404 -87.65228 member
00AC4040E25E347E classic_bike 15 Sunday 7 5 2023 -0.18 Wentworth Ave & Cermak Rd* Wentworth Ave & Cermak Rd* 41.85321 -87.63187 41.85321 -87.63187 casual
97BF63D06721A3B9 classic_bike 18 Saturday 13 5 2023 -0.10 Wentworth Ave & Cermak Rd* Wentworth Ave & Cermak Rd* 41.85321 -87.63187 41.85321 -87.63187 member
0063C3704F56EC55 electric_bike 7 Thursday 27 4 2023 -0.08 Green St & Randolph St* Green St & Randolph St* 41.88346 -87.64860 41.88360 -87.64863 casual
81E1C5175FA5A23D classic_bike 14 Wednesday 19 4 2023 -0.07 Green St & Randolph St* Green St & Randolph St* 41.88360 -87.64863 41.88360 -87.64863 member
7A4D237E2C99D424 electric_bike 17 Tuesday 4 4 2023 -0.05 Green St & Randolph St* Green St & Randolph St* 41.88361 -87.64856 41.88360 -87.64863 member
579596DD4C7C7538 classic_bike 17 Tuesday 23 5 2023 -0.05 Wentworth Ave & Cermak Rd* Wentworth Ave & Cermak Rd* 41.85321 -87.63187 41.85321 -87.63187 casual
1CEB299B0AACB63F electric_bike 16 Wednesday 19 4 2023 0.00 Desplaines St & Kinzie St Desplaines St & Kinzie St 41.88850 -87.64436 41.88872 -87.64445 casual
BF7BE067A7FD077A electric_bike 7 Tuesday 18 4 2023 0.00 California Ave & Division St California Ave & Division St 41.90303 -87.69754 41.90303 -87.69747 member

Observation:

  • The result-set in ride_length_minute column have 8 values with value < 0

  • The cause may be due to user error or system error when inserting data into the database

-- Remove records that *ride_length_minute* value < 0

IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'clean_combined_data')
BEGIN
    SELECT * INTO clean_combined_data
    FROM (
        SELECT * FROM pre_clean_combined_data
        wHERE ride_length_minute >= 0 ) AS completely_clean
END

-- Count the number of records in *clean_combined_data* 
SELECT COUNT(*) AS count
    FROM clean_combined_data
1 records
count
1285707

In this way, after the data clean step, a new table named clean_combined_data will be obtained with 1,285,707 records


IV. Analyze And Share

1. Statitics Analysis

query = "SELECT
              member_casual,ROUND(AVG(ride_length_minute), 2) AS avg_ride_length
         FROM clean_combined_data
         GROUP BY member_casual"

result = dbGetQuery(con, query)

# Visualization

ggplot(data = result) +
  geom_bar(stat = "identity", mapping = aes(x = member_casual, y = avg_ride_length, fill = member_casual)) +
  geom_text(aes(x = member_casual, y = avg_ride_length, label = avg_ride_length), vjust = -0.25)

2. Analysis of a number of member riders and casual riders

query = "SELECT member_casual, count(member_casual) AS total_riders
            FROM clean_combined_data
            GROUP BY member_casual"

riders_rs = dbGetQuery(con, query)

# Visualization

ggplot(data = riders_rs) +
  geom_bar(stat = "identity", mapping = aes(x = member_casual, y = total_riders, fill = member_casual)) +
  geom_text(aes(x = member_casual, y = total_riders, label = total_riders), vjust = -0.25)

3. Analysis of a number of member riders and casual riders from vehicle type

query = "SELECT 
              member_casual, rideable_type,
              COUNT(*) AS total_trips
         FROM clean_combined_data
         GROUP BY member_casual, rideable_type"

result = dbGetQuery(con, query)

# Visualization

ggplot(result, aes(x = member_casual, y = total_trips, fill = rideable_type)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = total_trips), position = position_dodge(width = 0.9), vjust = -0.25)

4. Analysis of total travel time per ride for member riders and casual riders from month, weekday and hour

query = "SELECT 
              member_casual, month,
            ROUND(SUM(ride_length_minute), 2) AS total_trips
         FROM clean_combined_data
         GROUP BY month, member_casual
         ORDER BY member_casual, month"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = month, y = total_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In First 5 months of 2023", subtitle = "Per Month")

Observation: With the visual chart Total Trips In First 5 months of 2023, both members and casual riders exhibit similar behavior, with fewer trips in the spring and gradually increasing in the summer months

query = "SELECT member_casual, day_of_week,
            CASE day_of_week
                WHEN 'Monday' THEN 2
                WHEN 'Tuesday' THEN 3
                WHEN 'Wednesday' THEN 4
                WHEN 'Thursday' THEN 5
                WHEN 'Friday' THEN 6
                WHEN 'Saturday' THEN 7
                WHEN 'Sunday' THEN 8
            END AS day_order,
              ROUND(SUM(ride_length_minute), 2) AS total_trips
         FROM clean_combined_data
         GROUP BY day_of_week, member_casual
         ORDER BY member_casual, day_order"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = day_order, y = total_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Day of Week", subtitle = "Per Day of Week")

Observation: With the visual chart ‘Total Trips In Day of Week,’ members and casual riders exhibit contrasting travel behaviors. Member riders have higher travel time on weekdays and gradually decrease from Wednesday onwards, while casual riders have travel time on weekdays and significantly increase on the weekends

query = "SELECT 
              member_casual, hour,
            ROUND(SUM(ride_length_minute), 2) AS total_trips
         FROM clean_combined_data
         GROUP BY hour, member_casual
         ORDER BY member_casual, hour"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = hour, y = total_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Hour", subtitle = "Per Hour")

Observation: With the visual chart ‘Total Trips In Hour,’ both members and casual riders exhibit similar behavior. They have minimal travel during the period from 12:00 am to 5:00 am, increased travel after 5:00 am, and the highest travel volume between 3:00 pm and 5:00 pm

5. Analysis of average travel time per ride for member riders and casual riders from month, weekday and hour

query = "SELECT 
              member_casual, month,
            ROUND(AVG(ride_length_minute), 2) AS AVG_trips
         FROM clean_combined_data
         GROUP BY month, member_casual
         ORDER BY member_casual, month"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = month, y = AVG_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In First 5 months of 2023", subtitle = "Per Month")

query = "SELECT member_casual, day_of_week,
            CASE day_of_week
                WHEN 'Monday' THEN 2
                WHEN 'Tuesday' THEN 3
                WHEN 'Wednesday' THEN 4
                WHEN 'Thursday' THEN 5
                WHEN 'Friday' THEN 6
                WHEN 'Saturday' THEN 7
                WHEN 'Sunday' THEN 8
            END AS day_order,
              ROUND(AVG(ride_length_minute), 2) AS AVG_trips
         FROM clean_combined_data
         GROUP BY day_of_week, member_casual
         ORDER BY member_casual, day_order"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = day_order, y = AVG_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Day of Week", subtitle = "Per Day of Week")

query = "SELECT 
              member_casual, hour,
            ROUND(AVG(ride_length_minute), 2) AS AVG_trips
         FROM clean_combined_data
         GROUP BY hour, member_casual
         ORDER BY member_casual, hour"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = hour, y = AVG_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Hour", subtitle = "Per Hour")

Observation: Through the three visual charts, it is observed that the travel time of member riders tends to remain consistent across hours, days of the week, or months. On the other hand, casual riders tend to travel more than twice as much. They have a higher tendency to travel during the time frame of 8:00 am to 2:00 pm, on weekends, and during the summer months


V. Act

After identifying the differences between casual and member riders, marketing strategies to target casual riders can be developed to persuade them to become members. Recommendations:


Reference

Kaggle