Google Data Analytics Capstone: Analyze annual data of bike

I. Ask

Business Task

Provide marketing strategies to convert casual riders into member riders

Analysis questions

How do members riders and causual riders differ in their use of Cyclistic bikes?
Why do regular riders purchase Cyclistic’s annual membership?
How can Cyclistic utilize digital platforms to influence regular riders to become members?

Moreno assigned me the first question to answer: How do members riders and causual riders differ in their use of Cyclistic bikes?

II. Prepare

Resource

I utilize Cyclistc’s historical data to conduct analysis and identify trends over the first 5 months of 2023

Data has been collected from divvy_tripdata. Provided by Motivate International under this license

III. Process

1. Combine Data

Utilize data from 01/2023 to 05/2023

Conncet to SQL Server in R language

Install Packages to connect with SQL

install.packages("odbc")

install.packages("DBI")

install.packages("tidyverse")

In there:

Package “odbc” allow connect with Database of SQL Server
Package “DBI” allow interact with Database

Utilize libary of packages

library(odbc) 

library(DBI)

library(ggplot2)

Connect to SQL Server

Combine data of 5 tables into single table

IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'divvy_tripdata_2023')
BEGIN
    SELECT * INTO divvy_tripdata_2023 
    FROM (
        SELECT * FROM divvy_tripdata_01_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_02_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_03_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_04_2023
        UNION ALL
        SELECT * FROM divvy_tripdata_05_2023 ) AS row_data
END

SELECT COUNT(*) AS count
    FROM divvy_tripdata_2023

1 records
count
1670841

In this way, ‘divvy_tripdata_2023’ table has been created with 1,607,841 records

2. Data Exploration

The following table shows all field name and data type of table ‘divvy_tripdata_2023’

The query bellow shows the number of NULL vaule per field

SELECT
    COUNT(CASE WHEN ride_id IS NULL THEN 1 ELSE NULL END) AS ride_id,
    COUNT(CASE WHEN rideable_type IS NULL THEN 1 ELSE NULL END) AS rideable_type,
    COUNT(CASE WHEN started_at IS NULL THEN 1 ELSE NULL END) AS started_at,
    COUNT(CASE WHEN ended_at IS NULL THEN 1 ELSE NULL END) AS ended_at,
    COUNT(CASE WHEN start_station_name IS NULL THEN 1 ELSE NULL END) AS start_station_name,
    COUNT(CASE WHEN start_station_id IS NULL THEN 1 ELSE NULL END) AS start_station_id,
    COUNT(CASE WHEN end_station_name IS NULL THEN 1 ELSE NULL END) AS end_station_name,
    COUNT(CASE WHEN end_station_id IS NULL THEN 1 ELSE NULL END) AS end_station_id,
    COUNT(CASE WHEN start_lat IS NULL THEN 1 ELSE NULL END) AS start_lat,
    COUNT(CASE WHEN start_lng IS NULL THEN 1 ELSE NULL END) AS start_lng,
    COUNT(CASE WHEN end_lat IS NULL THEN 1 ELSE NULL END) AS end_lat,
    COUNT(CASE WHEN end_lng IS NULL THEN 1 ELSE NULL END) AS end_lng,
    COUNT(CASE WHEN member_casual IS NULL THEN 1 ELSE NULL END) AS member_casual
FROM divvy_tripdata_2023

1 records
ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
0	0	0	0	241158	241290	256913	257054	0	0	1571	1571	0

Observation:

The start_station_name field has 241,158 NULL values and needs to be removed
The start_station_id field has 241,290 NULL values and needs to be removed
The end_station_name field has 256,913 NULL values and needs to be removed
The end_station_id field has 257,054 NULL values and needs to be removed
The end_lat and end_lng both fields have 1,571 NULL values and needs to be removed

Due to rise_id field is a Primary Key, there will be no duplicate values

3. Data Cleaning

The current format of the ‘divvy_tripdata_2023’ table is yyyy-mm-dd hh:mm:ss. To perform deeper analysis, I will create 4 new columns with respective value types: Hour, Day of Week, Month, and Year.

Create a new column named rise_length to determine the duration of each trip (in minute format).

All records containing NULL values will be removed

IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'pre_clean_combined_data')
BEGIN
    SELECT * INTO pre_clean_combined_data
    FROM (
        SELECT ride_id, rideable_type,
          CAST(DATENAME(hh, started_at) AS INT) AS hour,
            DATENAME(dw, started_at) AS day_of_week,
            DAY(started_at) AS day,
            MONTH(started_at) AS month,
            YEAR(started_at) AS year,
            -- Cast the data type of the DateDiff function result to the round result of the division to 2 decimal places
            ROUND(CAST(DATEDIFF(s, started_at, ended_at) AS float)/60, 2) AS ride_length_minute,
            start_station_name, end_station_name, start_lat, start_lng, 
            end_lat, end_lng, member_casual
        FROM divvy_tripdata_2023
        WHERE start_station_name IS NOT NULL AND
            start_station_id IS NOT NULL AND
            end_station_name IS NOT NULL AND
            end_station_id IS NOT NULL AND
            end_lat IS NOT NULL AND
            end_lng IS NOT NULL
    ) AS cleaned
END

-- Sorting ride_length_minute column in ascending
SELECT TOP(10) * FROM pre_clean_combined_data
    ORDER BY ride_length_minute ASC

Displaying records 1 - 10
ride_id	rideable_type	hour	day_of_week	day	month	year	ride_length_minute	start_station_name	end_station_name	start_lat	start_lng	end_lat	end_lng	member_casual
A769AB597DEA18C4	classic_bike	5	Saturday	27	5	2023	-0.23	Southport Ave & Belmont Ave	Southport Ave & Belmont Ave	41.93948	-87.66375	41.93948	-87.66375	member
934174DB8E2AD791	classic_bike	17	Monday	29	5	2023	-0.20	Lakefront Trail & Bryn Mawr Ave	Lakefront Trail & Bryn Mawr Ave	41.98404	-87.65228	41.98404	-87.65228	member
00AC4040E25E347E	classic_bike	15	Sunday	7	5	2023	-0.18	Wentworth Ave & Cermak Rd*	Wentworth Ave & Cermak Rd*	41.85321	-87.63187	41.85321	-87.63187	casual
97BF63D06721A3B9	classic_bike	18	Saturday	13	5	2023	-0.10	Wentworth Ave & Cermak Rd*	Wentworth Ave & Cermak Rd*	41.85321	-87.63187	41.85321	-87.63187	member
0063C3704F56EC55	electric_bike	7	Thursday	27	4	2023	-0.08	Green St & Randolph St*	Green St & Randolph St*	41.88346	-87.64860	41.88360	-87.64863	casual
81E1C5175FA5A23D	classic_bike	14	Wednesday	19	4	2023	-0.07	Green St & Randolph St*	Green St & Randolph St*	41.88360	-87.64863	41.88360	-87.64863	member
7A4D237E2C99D424	electric_bike	17	Tuesday	4	4	2023	-0.05	Green St & Randolph St*	Green St & Randolph St*	41.88361	-87.64856	41.88360	-87.64863	member
579596DD4C7C7538	classic_bike	17	Tuesday	23	5	2023	-0.05	Wentworth Ave & Cermak Rd*	Wentworth Ave & Cermak Rd*	41.85321	-87.63187	41.85321	-87.63187	casual
1CEB299B0AACB63F	electric_bike	16	Wednesday	19	4	2023	0.00	Desplaines St & Kinzie St	Desplaines St & Kinzie St	41.88850	-87.64436	41.88872	-87.64445	casual
BF7BE067A7FD077A	electric_bike	7	Tuesday	18	4	2023	0.00	California Ave & Division St	California Ave & Division St	41.90303	-87.69754	41.90303	-87.69747	member

Observation:

The result-set in ride_length_minute column have 8 values with value < 0
The cause may be due to user error or system error when inserting data into the database

-- Remove records that *ride_length_minute* value < 0

IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'clean_combined_data')
BEGIN
    SELECT * INTO clean_combined_data
    FROM (
        SELECT * FROM pre_clean_combined_data
        wHERE ride_length_minute >= 0 ) AS completely_clean
END

-- Count the number of records in *clean_combined_data* 
SELECT COUNT(*) AS count
    FROM clean_combined_data

1 records
count
1285707

In this way, after the data clean step, a new table named clean_combined_data will be obtained with 1,285,707 records

IV. Analyze And Share

1. Statitics Analysis

query = "SELECT
              member_casual,ROUND(AVG(ride_length_minute), 2) AS avg_ride_length
         FROM clean_combined_data
         GROUP BY member_casual"

result = dbGetQuery(con, query)

# Visualization

ggplot(data = result) +
  geom_bar(stat = "identity", mapping = aes(x = member_casual, y = avg_ride_length, fill = member_casual)) +
  geom_text(aes(x = member_casual, y = avg_ride_length, label = avg_ride_length), vjust = -0.25)

2. Analysis of a number of member riders and casual riders

query = "SELECT member_casual, count(member_casual) AS total_riders
            FROM clean_combined_data
            GROUP BY member_casual"

riders_rs = dbGetQuery(con, query)

# Visualization

ggplot(data = riders_rs) +
  geom_bar(stat = "identity", mapping = aes(x = member_casual, y = total_riders, fill = member_casual)) +
  geom_text(aes(x = member_casual, y = total_riders, label = total_riders), vjust = -0.25)

3. Analysis of a number of member riders and casual riders from vehicle type

query = "SELECT 
              member_casual, rideable_type,
              COUNT(*) AS total_trips
         FROM clean_combined_data
         GROUP BY member_casual, rideable_type"

result = dbGetQuery(con, query)

# Visualization

ggplot(result, aes(x = member_casual, y = total_trips, fill = rideable_type)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = total_trips), position = position_dodge(width = 0.9), vjust = -0.25)

4. Analysis of total travel time per ride for member riders and casual riders from month, weekday and hour

query = "SELECT 
              member_casual, month,
            ROUND(SUM(ride_length_minute), 2) AS total_trips
         FROM clean_combined_data
         GROUP BY month, member_casual
         ORDER BY member_casual, month"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = month, y = total_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In First 5 months of 2023", subtitle = "Per Month")

Observation: With the visual chart Total Trips In First 5 months of 2023, both members and casual riders exhibit similar behavior, with fewer trips in the spring and gradually increasing in the summer months

query = "SELECT member_casual, day_of_week,
            CASE day_of_week
                WHEN 'Monday' THEN 2
                WHEN 'Tuesday' THEN 3
                WHEN 'Wednesday' THEN 4
                WHEN 'Thursday' THEN 5
                WHEN 'Friday' THEN 6
                WHEN 'Saturday' THEN 7
                WHEN 'Sunday' THEN 8
            END AS day_order,
              ROUND(SUM(ride_length_minute), 2) AS total_trips
         FROM clean_combined_data
         GROUP BY day_of_week, member_casual
         ORDER BY member_casual, day_order"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = day_order, y = total_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Day of Week", subtitle = "Per Day of Week")

Observation: With the visual chart ‘Total Trips In Day of Week,’ members and casual riders exhibit contrasting travel behaviors. Member riders have higher travel time on weekdays and gradually decrease from Wednesday onwards, while casual riders have travel time on weekdays and significantly increase on the weekends

query = "SELECT 
              member_casual, hour,
            ROUND(SUM(ride_length_minute), 2) AS total_trips
         FROM clean_combined_data
         GROUP BY hour, member_casual
         ORDER BY member_casual, hour"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = hour, y = total_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Hour", subtitle = "Per Hour")

Observation: With the visual chart ‘Total Trips In Hour,’ both members and casual riders exhibit similar behavior. They have minimal travel during the period from 12:00 am to 5:00 am, increased travel after 5:00 am, and the highest travel volume between 3:00 pm and 5:00 pm

5. Analysis of average travel time per ride for member riders and casual riders from month, weekday and hour

query = "SELECT 
              member_casual, month,
            ROUND(AVG(ride_length_minute), 2) AS AVG_trips
         FROM clean_combined_data
         GROUP BY month, member_casual
         ORDER BY member_casual, month"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = month, y = AVG_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In First 5 months of 2023", subtitle = "Per Month")

query = "SELECT member_casual, day_of_week,
            CASE day_of_week
                WHEN 'Monday' THEN 2
                WHEN 'Tuesday' THEN 3
                WHEN 'Wednesday' THEN 4
                WHEN 'Thursday' THEN 5
                WHEN 'Friday' THEN 6
                WHEN 'Saturday' THEN 7
                WHEN 'Sunday' THEN 8
            END AS day_order,
              ROUND(AVG(ride_length_minute), 2) AS AVG_trips
         FROM clean_combined_data
         GROUP BY day_of_week, member_casual
         ORDER BY member_casual, day_order"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = day_order, y = AVG_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Day of Week", subtitle = "Per Day of Week")

query = "SELECT 
              member_casual, hour,
            ROUND(AVG(ride_length_minute), 2) AS AVG_trips
         FROM clean_combined_data
         GROUP BY hour, member_casual
         ORDER BY member_casual, hour"

result = dbGetQuery(con, query)

# Visualization
ggplot(data = result, aes(x = hour, y = AVG_trips, group = as.factor(member_casual), color = member_casual)) + geom_line() + geom_point() + labs(title = "Total Trips In Hour", subtitle = "Per Hour")

Observation: Through the three visual charts, it is observed that the travel time of member riders tends to remain consistent across hours, days of the week, or months. On the other hand, casual riders tend to travel more than twice as much. They have a higher tendency to travel during the time frame of 8:00 am to 2:00 pm, on weekends, and during the summer months

V. Act

After identifying the differences between casual and member riders, marketing strategies to target casual riders can be developed to persuade them to become members. Recommendations:

Marketing campaigns might be conducted in spring and summer at tourist/recreational locations popular among casual riders.
Casual riders are most active on weekends and during the summer and spring, thus they may be offered seasonal or weekend-only memberships.
Casual riders use their bikes for longer durations than members. Offering discounts for longer rides may incentivize casual riders and entice members to ride for longer periods of time.

Reference

Kaggle

Google Data Analytics Capstone: Analyze annual data of bike - Cyclist

Author: Đỗ Việt Hưng

Creation Date: 2023-07-09

I. Ask

Business Task

Analysis questions

II. Prepare

Resource

III. Process

1. Combine Data

Conncet to SQL Server in R language

2. Data Exploration

3. Data Cleaning

V. Act

Reference

Google Data Analytics Capstone: Analyze annual data of bike - Cyclist

Author: Đỗ Việt Hưng

Creation Date: 2023-07-09

Case Study: How Does A Bike-Share Navigate Speedy Success?

I. Ask

Business Task

Analysis questions

II. Prepare

Resource

III. Process

1. Combine Data

Conncet to SQL Server in R language

2. Data Exploration

3. Data Cleaning

IV. Analyze And Share

1. Statitics Analysis

2. Analysis of a number of member riders and casual riders

3. Analysis of a number of member riders and casual riders from vehicle type

4. Analysis of total travel time per ride for member riders and casual riders from month, weekday and hour

5. Analysis of average travel time per ride for member riders and casual riders from month, weekday and hour

V. Act

Reference