GOOLGE DATA ANALYSIS CERTIFICATION- CASE STUDY-1

Ask Stage:

The primary objective is to understand the differences in how annual members and casual riders use Cyclistic bikes. By analyzing these usage patterns, we aim to uncover meaningful insights that can guide strategies to convert casual riders into annual members.

Insights and Impact:

This analysis will reveal key behaviors and trends for both user groups, enabling Cyclistic’s marketing team to create targeted, data-driven campaigns. These campaigns will address the needs and motivations of casual riders, using personalized offers, tailored messaging, and engaging content to encourage them to commit to annual memberships.

Business Objectives:

The primary task is to compare the usage patterns of annual members and casual riders by examining:

Frequency of use: How often rides occur.
Time of day: Preferred riding times.
Trip duration: Average length of rides.
Bike types: Most commonly used bikes.

This understanding will support the development of effective marketing strategies aimed at converting casual riders into annual members. Successfully achieving this goal will drive revenue growth, enhance customer retention, and strengthen Cyclistic’s long-term success.

Identifying Stakeholders

Lily Moreno (Marketing Director):
Leads the development of marketing campaigns and is focused on leveraging insights to design strategies that effectively increase the number of annual members.
Cyclistic Executives:
Committed to ensuring the marketing strategy’s success while aligning it with the company’s broader objectives of growing annual memberships and driving long-term business performance.
Cyclistic Marketing Analytics Team:
Responsible for gathering and analyzing data to uncover actionable insights that will inform and optimize the marketing efforts.
Casual Riders and Annual Members:
While not directly involved in decision-making, their behaviors, preferences, and usage patterns are central to the analysis. Understanding these groups is crucial for creating impactful marketing strategies that drive engagement and conversions.

Prepare Stage

Data Collection

Cyclistic’s historical trip data was downloaded from the official website provided by Motivate International Inc. under a public license. This dataset includes trip details for the most recent 12 months of 2024. The files were downloaded as monthly zip archives, extracted into 12 CSV files, and organized in a folder named casestudy on the desktop. The data underwent thorough checks to ensure completeness, consistency, and compliance with privacy standards.

Data File Details

The dataset consists of 13 columns, each contributing to the analysis in different ways:

ride_id: A unique identifier for each trip. Useful for tracking individual rides but not essential for comparing usage patterns.
rideable_type: Indicates the type of bike (e.g., classic or electric). Crucial for understanding bike preferences among users.
started_at: Timestamp marking the start of a ride. Critical for analyzing trip frequency, peak usage times, and seasonal trends.
ended_at: Timestamp for when a ride ends. Essential for calculating trip duration and assessing usage behavior.
start_station_name: Name of the station where the ride began. Provides insights into station-level demand and usage.
start_station_id: Unique identifier for the starting station. Useful for geographic analyses but less critical than station names.
end_station_name: Name of the station where the ride concluded. Helps identify drop-off patterns and station utilization.
end_station_id: Unique identifier for the ending station. Supports station-level analyses, similar to start_station_id.
start_lat & start_lng: Geographical coordinates for the starting station. Important for mapping and spatial analysis.
end_lat & end_lng: Geographical coordinates for the ending station. Useful for identifying drop-off trends and station demand.
member_casual: Indicates whether the rider is an annual member or a casual rider. This is the most critical column for segmenting and analyzing user behavior.

Process Stage

Why RStudio?
RStudio, paired with the R programming language, was selected for this analysis due to its robust capabilities in:

Statistical Analysis: Efficient handling of complex modeling and statistical computations.
Data Wrangling: Comprehensive tools for cleaning and transforming data into structured formats.
Visualization: High-quality static and dynamic visualizations to interpret trends effectively.
Community Support: A large, active community that provides ongoing support and resources.

By utilizing RStudio, we aim to ensure data integrity while uncovering actionable insights that drive business growth.

Load necessary libraries

library(tidyverse)
library(skimr)
library(janitor)
library(readr)
library(dplyr)
library(lubridate)
library(data.table)
library(tidyr)
library(ggplot2)

set the directory of datastes

setwd("C:/Users/Soo/Desktop/casestudy")

Step1:List all CSV files in the current directory

file_paths <- list.files(pattern = "*.csv", full.names = TRUE)

Step 2: Load and validate each file

datasets <- lapply(file_names, function(file) { data <- read_csv(file) print(glue::glue("\nFile: {file}")) print(glimpse(data)) print(summary(data)) return(data) })

The result as shown below:

Step 3: Print summary containing file name of each file and total number of rows and columns

for (i in seq_along(datasets)) { cat("File:", basename(file_names[i]), "\n") cat("Rows:", nrow(datasets[[i]]), "Columns:", ncol(datasets[[i]]), "\n\n") }

Step 4: Standardize column names to ensure consistency

we need to check the columns names for each files to avoid non consistency between files

datasets <- lapply(datasets, clean_names)

Step 5: Check for missing values in dataset

missing_summary <- lapply(datasets, function(data) { colSums(is.na(data)) })

for (i in seq_along(missing_summary)) { cat("Missing Values in", basename(file_names[i]), ":\n") print(missing_summary[[i]]) cat("\n") }

the missing values in each file with the column names

Step 6: Define possible datetime formats

Due to different DatTime formats in dataset files, we need to fix format issue by this code:

datetime_formats <- c("%Y-%m-%d %H:%M:%S", "%m/%d/%Y %H:%M", "%Y-%m-%d", "%d.%m.%Y")

Step 7: Function to parse datetime with multiple formats

parse_datetime <- function(datetime_str) { parse_date_time(datetime_str, orders = datetime_formats, tz = "UTC") }

Step 8: Ensure ‘started_time’ and ‘ended_time’ are in POSIXct datetime format

datasets <- lapply(datasets, function(df) { df %>% mutate( started_time = parse_datetime(started_time), ended_time = parse_datetime(ended_time) ) })

Step 9: Combine Datasets

all_trips <- rbindlist(datasets)

Step 10: Save combined dataset

fwrite(all_trips, "combined_all_trips.csv")

Step 11: Check for missing values in the combined dataset

missing_summary_combined <- colSums(is.na(all_trips)) print(missing_summary_combined)

To clean our data, we decided not to delete the rows with missing values, even though certain columns have up to 18% missing data. Instead, we chose to remove these columns from our analysis and focus on the critical columns listed below.

Step 12: Define important columns

 important_columns <- c("trip_num", "bike_type", "started_time", "ended_time", "membership")

Step 13: Select only the important columns

combined_data <- all_trips %>% select(all_of(important_columns))

print(summary(combined_data))

validate the columns name and total num of rows

Analyze Stage:

Step 1: Calculate trip length in minutes, day of the week, and month

combined_data <- combined_data %>% mutate( trip_length = as.numeric(difftime(ended_time, started_time, units = "mins")), week_day = lubridate::wday(started_time, label = TRUE), month = format(started_time, "%b") # Add month for later use )

Step 2: Remove negative trip durations

combined_data <- combined_data %>% filter(trip_length >= 0)

Step 3: Calculating total trips by day of the week

total_trips_day <- combined_data %>% group_by(week_day) %>% summarise(total_trips = n(), .groups = 'drop')

cat("Total Trips by Day of the Week:\n") print(total_trips_day)

Total Trips by WeekDay

Share Stage:

Data Visualizations

Total trips by Membership type vs day of the week

ggplot(combined_data, aes(x = week_day, fill = membership)) + geom_bar(position = "dodge") + labs(title = "Total Trips by Member Type vs Day of the Week", x = "Day of the Week", y = "Total Trips") + theme_minimal()

This chart shows the total number of trips taken by members and casual riders, broken down by day of the week.

Members take significantly more trips than casual riders every day of the week.
The peak day for both members and casual riders is Wednesday with a decline on the weekends.

Plot average number of trips by membership type per month

Calculate the total number of trips by membership and month

total_trips_month <- combined_data %>% group_by(membership, month) %>% summarise(total_trips = n(), .groups = 'drop')

Calculate the number of unique days in each month

unique_days_month <- combined_data %>% group_by(month) %>% summarise(unique_days = n_distinct(as.Date(started_time)), .groups = 'drop')

Merge the two data frames to get the total trips and unique days

avg_trips_month <- merge(total_trips_month, unique_days_month, by = "month")

Calculate the average trips per day for each membership type and month

avg_trips_month <- avg_trips_month %>% mutate(average_trips = total_trips / unique_days)

Convert month to a factor with levels in chronological order (without year)

 avg_trips_month$month <- factor(avg_trips_month$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

#Plotting

ggplot(avg_trips_month, aes(x = month, y = average_trips, fill = membership)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Average Number of Trips by Membership Type per Month", x = "Month", y = "Average Number of Trips") + theme_minimal()

This chart shows the average number of trips taken by members and casual riders each month.

Both member and casual riders show a strong seasonal trend.
The peak month for both is September with a significant decrease in ridership during the winter months.
The number of trips taken by members is consistently higher than that of casual riders.

Average trip duration by membership type on each day of the week

avg_duration_day <- combined_data %>% group_by(membership, week_day) %>% summarise(average_duration = mean(trip_length, na.rm = TRUE), .groups = 'drop')

ggplot(avg_duration_day, aes(x = week_day, y = average_duration, fill = membership)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Average Trip Duration by Membership Type Per Day", x = "Day of the Week", y = "Average Trip Duration (mins)") + theme_minimal()

This chart shows the average duration of trips taken by members and casual riders each day of the week.

Casual riders have a significantly longer average trip duration than members each day of the week.
The average trip duration for both members and casual riders is relatively consistent throughout the week

#Calculate the average trip duration by membership type and month

avg_duration_month <- combined_data %>% group_by(membership, month) %>% summarise(average_duration = mean(trip_length, na.rm = TRUE), .groups = 'drop')

#Convert month to a factor with levels in chronological order (without year)

avg_duration_month$month <- factor(avg_duration_month$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

ggplot(avg_duration_month, aes(x = month, y = average_duration, fill = membership)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Average Trip Duration by Membership Type per Month", x = "Month", y = "Average Trip Duration (mins)") + theme_minimal()

Calculate the average trip duration by membership type and month


avg_duration_month <- combined_data %>% group_by(membership, month) %>% summarise(average_duration = mean(trip_length, na.rm = TRUE), .groups = 'drop')

#Convert month to a factor with levels in chronological order (without year)

avg_duration_month$month <- factor(avg_duration_month$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

ggplot(avg_duration_month, aes(x = month, y = average_duration, fill = membership)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Average Trip Duration by Membership Type per Month", x = "Month", y = "Average Trip Duration (mins)") + theme_minimal()

This chart shows the average duration of trips taken by members and casual riders each month.

Casual riders have a significantly longer average trip duration than members each month.
The average trip duration for both members and casual riders follows a similar trend, decreasing in the winter months

Plot the data for bike demand

#Ensure combined_data is a data.table setDT(combined_data)

#Extract hour from started_time combined_data[, hour := as.numeric(format(started_time, "%H"))] Aggregate total trips by hour and week_day

bike_demand <- combined_data[, .(total_trips = .N), by = .(hour, week_day)]

#Convert week_day to a factor with levels ordered from Monday to Sunday

combined_data[, week_day := factor(week_day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))]

ggplot(bike_demand, aes(x = hour, y = total_trips)) + geom_line() + geom_point() + facet_wrap(~ week_day, scales = "free_y") + labs(title = "Bike Demand Over a Day by Day of the Week", x = "Hour of the Day", y = "Total Trips") + theme_minimal()

This chart shows the number of bike trips taken each hour of the day by day of the week.

There is a clear peak demand for bikes in the afternoon hours, between 4 and 6 PM.
The demand for bikes is significantly higher on weekdays than on weekends

Number of trips by membership type vs bike type

# Number of trips by membership type vs bike type
> if ("bike_type" %in% colnames(combined_data)) {
+   trips_by_bike <- combined_data[, .(total_trips = .N), by = .(membership, bike_type)]
+   ggplot(trips_by_bike, aes(x = bike_type, y = total_trips, fill = membership)) +
+     geom_bar(stat = "identity", position = "dodge") +
+     labs(title = "Number of Trips by Membership Type vs Bike Type", 
+          x = "Bike Type", 
+          y = "Total Trips") +
+     theme_minimal()
+ } else {
+   cat("Column 'bike_type' not found in the data.\n")
+ }


important_columns <- c("trip_num", "bike_type", "started_time", "ended_time", "membership") if ("bike_type" %in% colnames(combined_data)) { trips_by_bike <- combined_data[, .(total_trips = .N), by = .(membership, bike_type)] ggplot(trips_by_bike, aes(x = bike_type, y = total_trips, fill = membership)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Number of Trips by Membership Type vs Bike Type", x = "Bike Type", y = "Total Trips") + theme_minimal() } else { cat("Column 'bike_type' not found in the data.\n") }

This chart shows the number of trips taken by members and casual riders, broken down by bike type.

Members take significantly more trips than casual riders on all bike types.
The majority of trips are taken on classic bikes, followed by electric bikes.
Electric scooters have a significantly lower number of trips compared to the other two bike types.

Save the combined data to a CSV file

data.table::fwrite(combined_data, "combined_data_with_new_columns.csv") cat("Combined File Summary:\n") cat("Column Names:", colnames(combined_data), "\n") cat("Total Rows:", nrow(combined_data), "\n") cat("Total Columns:", ncol(combined_data), "\n")

Act Stage:

Analysis of Cyclistic’s Historical Bike Trip Data

The analysis of Cyclistic’s historical bike trip data reveals several key insights about the differences between casual and member riders and the potential for converting casual riders into members.

Key Findings

Member riders take more trips: The total number of trips taken by members is significantly higher than that of casual riders across all days of the week.
Member riders are more frequent: The average number of trips per month is higher for members than for casual riders throughout the year.
Member riders use bikes more frequently: Member riders take an average of 13 trips per month, while casual riders take an average of 3 trips per month. This suggests a strong level of engagement with bikes.
Member riders use bikes for shorter trips: The average trip duration is shorter for members than for casual riders both on a daily and monthly basis.
Member riders prefer classic bikes: Members favor classic bikes over electric bikes and electric scooters. Casual riders, however, tend to use electric bikes slightly more than classic bikes and are also the primary users of electric scooters.

Recommendations

Based on these findings, the following recommendations are proposed to convert casual riders into annual members:

1. Promote Membership Benefits:

Focus on convenience and value: Highlight the convenience and cost savings of an annual membership. Emphasize the fact that members can ride for shorter durations with the same affordability as casual riders.
Offer exclusive discounts: Provide members with exclusive discounts on merchandise, food, or services at partnering businesses.

2. Target Specific User Groups:

Emphasize the frequency of use: Target casual riders who take multiple trips a week, especially during peak hours or in specific locations.
Focus on the value of electric bikes: Promote the benefits of electric bikes to casual riders who prefer shorter, more frequent rides.

3. Leverage Digital Media:

Personalized email campaigns: Segment casual riders based on their trip history and send tailored emails with promotional offers, incentives, and educational content about the benefits of membership.
Targeted social media advertising: Utilize social media advertising platforms to target casual riders with relevant ads based on their interests, demographics, and location.
Mobile app promotions: Promote the benefits of membership within the Cyclistic mobile app, potentially offering in

Case Study 1_How does a bike-shared navigate speedy success

SARI HUMADI

2025-01-14