Introduction and Problem Statement

In this analysis, we delve into the Cyclistic bike-share data to gain insights into how different customer types use the bike-sharing service. We will follow the data analysis process(Ask, Prepare, Process, Analyze, Share, and Act) to answer key business questions and provide data-driven recommendations.

The primary question guiding our analysis is: How do annual members and casual riders use Cyclistic bikes differently? By addressing this question, we aim to design a new marketing strategy to convert casual riders into annual members.

Ask Phase

During the “Ask” phase, we focused on identifying the key business task and understanding the problem we need to solve. The goal was to lay the foundation for our analysis by framing the problem and outlining the scope of our investigation.

Our business task is to explore how annual members and casual riders use Cyclistic bikes differently. By understanding their usage patterns, we can inform the design of a marketing strategy aimed at converting casual riders into annual members.

Prepare Phase

During the Prepare phase, we focused on acquiring the necessary data, understanding its organization, and ensuring its credibility. Our goal was to ensure that the data is well-prepared for subsequent analysis.

The bike-sharing data has been downloaded from the company’s provided link and is stored in a dedicated ‘data’ folder within the project directory. Each quarterly dataset is saved as an individual CSV file. This organization ensures easy access and separation of data by quarters, enhancing the overall analysis process.

Data Source Used

The bike-sharing data originates from Cyclistic, the bike-sharing service provider. The data includes trip information recorded in four quarters of 2019 and the first quarter of 2020. Each dataset captures details such as trip start and end times, station information, user types, and geographic coordinates. The data sources are crucial for understanding user behavior and usage patterns, which are essential for our analysis.

(Note from the course: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.)

Data Credibility and Completeness

The bike-sharing data appears to be collected directly from user interactions with the bike-sharing service, which should reflect actual usage patterns. However, potential sources of bias could include differences in user demographics or changes in data collection methods over time. Missing values, especially in certain columns like gender and birth year, need to be addressed to ensure completeness.The consistency of column names, data types, and values across quarters should be checked and standardized.Overall, while there might be minor concerns related to bias and data completeness, the data seems credible for answering the analysis questions.

Data Acquisition and Organization

To begin, we downloaded the required Cyclistic trip data for the previous quarters. We stored the data in individual CSV files corresponding to each quarter of 2019 and 2020. We also ensured that the working directory is set to the location of the R Markdown file and the data files.

# Set the working directory to the location of your R Markdown file and CSV files
setwd("D:/Documents/rstudio/Cyclistic_Bike_Share_Analysis")

# Read CSV files
csv_data_2019q2 <- read_csv("data/Divvy_Trips_2019_Q2.csv")
csv_data_2019q3 <- read_csv("data/Divvy_Trips_2019_Q3.csv")
csv_data_2019q4 <- read_csv("data/Divvy_Trips_2019_Q4.csv")
csv_data_2020q1 <- read_csv("data/Divvy_Trips_2020_Q1.csv")

Explore the structure of each dataset

View(csv_data_2019q2)

Check for missing values in each column in each quarter

print(colSums(is.na(csv_data_2020q1)))

Data Structure and Types

To ensure the consistency of data types, we inspected the structure of each dataset. This allowed us to confirm that columns are correctly assigned to their respective data types.

# Check if data types match for each quarter
str(csv_data_2019q2)

Process Phase

In the Process phase, we focused on using the prepared data, understanding its structure, and ensuring that it’s in a suitable format for analysis. Let’s go through the key tasks performed in this phase.

I chose to use RStudio as my primary tool for this analysis due to its compatibility with handling large datasets. While other platforms like BigQuery, Google Sheets, and posit.cloud were considered, they have limitations that could hinder my analysis. For instance, BigQuery’s sandbox has size limitations for CSV files, Google Sheets struggles with the large dataset, and posit.cloud’s free tier imposes RAM constraints. RStudio provides the necessary capabilities to work with the extensive bike-sharing data effectively.

To ensure data cleanliness, we have undertaken several crucial steps. These include identifying and handling missing values, standardizing column names and data types, and verifying data consistency across different quarters. We also calculated derived columns, such as trip duration, to verify their accuracy and coherence with existing data. Furthermore, we reviewed the data for any potential outliers or anomalies that might affect the analysis. By executing these steps, we have strived to enhance the quality and reliability of the data for subsequent analysis.

Data Cleaning and Standardization

We proceeded to clean and standardize the data by addressing missing values, renaming columns, and ensuring consistent data types across different quarters. During this phase, we uncovered slight discrepancies in the “tripduration” column’s values. We calculated a new column, “calculated_duration,” based on the difference between “end_time” and “start_time.” However, due to minimal discrepancies likely caused by rounding, we decided to continue using the original “tripduration” values.

# Calculate calculated_duration by subtracting start_time from end_time a
csv_data_2019q2$calculated_duration <- as.numeric(difftime(csv_data_2019q2$end_time, csv_data_2019q2$start_time, units = "secs"))


# Check if trip duration matches calculated_duration
duration_match <- all(csv_data_2019q2$tripduration == csv_data_2019q2$calculated_duration)

# Identify rows with mismatched durations
mismatched_rows <- which(csv_data_2019q2$tripduration != csv_data_2019q2$calculated_duration)
mismatched_data <- csv_data_2019q2[mismatched_rows, ]

We also ensured that column names were consistent across quarters by renaming columns in accordance with the structure of the later quarters.

# Clean and standardize column names for q2 using the q3 names
colnames(csv_data_2019q2) <- colnames(csv_data_2019q3)

colnames(csv_data_2020q1) <- c(
  "trip_id", "rideable_type", "start_time", "end_time",
  "from_station_name", "from_station_id",
  "to_station_name", "to_station_id",
  "start_lat", "start_lng", "end_lat", "end_lng",
  "usertype"
)

Data Consistency and Usertype Standardization

We also addressed variations in the “usertype” column between 2019 and 2020. In 2020, the values were “member” and “casual,” which we mapped to “Subscriber” and “Customer,” respectively, to match the values used in 2019. We also converted the trip IDs from numeric to character format for the 2019 data to align with the format used in 2020.

# Add tripduration column to 2020q1 and calculate
csv_data_2020q1$tripduration <- as.numeric(difftime(csv_data_2020q1$end_time, csv_data_2020q1$start_time, units = "secs"))

# Rename 'Subscriber' to 'Member' and 'Customer' to 'Casual' in usertype column
csv_data_2019q2 <- csv_data_2019q2 %>%
  mutate(usertype = case_when(
    usertype == "Subscriber" ~ "Member",
    usertype == "Customer" ~ "Casual",
    TRUE ~ usertype
  )) 
csv_data_2019q3 <- csv_data_2019q3 %>%
  mutate(usertype = case_when(
    usertype == "Subscriber" ~ "Member",
    usertype == "Customer" ~ "Casual",
    TRUE ~ usertype
  )) 
csv_data_2019q4 <- csv_data_2019q4 %>%
  mutate(usertype = case_when(
    usertype == "Subscriber" ~ "Member",
    usertype == "Customer" ~ "Casual",
    TRUE ~ usertype
  )) 
csv_data_2020q1 <- csv_data_2020q1 %>%
  mutate(usertype = case_when(
    usertype == "member" ~ "Member",
    usertype == "casual" ~ "Casual",
    TRUE ~ usertype
  )) 
csv_data_2019q2$trip_id <- as.character(csv_data_2019q2$trip_id)
csv_data_2019q3$trip_id <- as.character(csv_data_2019q3$trip_id)
csv_data_2019q4$trip_id <- as.character(csv_data_2019q4$trip_id)

Adding New Columns

We added two new columns to the data frames: “ride_length” and “day_of_week.”

# Add ride_length column (trip duration in HH:MM:SS format)
csv_data_2019q2$ride_length <- format(as.POSIXct(csv_data_2019q2$end_time) - as.POSIXct(csv_data_2019q2$start_time), format = "%H:%M:%S")
csv_data_2019q3$ride_length <- format(as.POSIXct(csv_data_2019q3$end_time) - as.POSIXct(csv_data_2019q3$start_time), format = "%H:%M:%S")
csv_data_2019q4$ride_length <- format(as.POSIXct(csv_data_2019q4$end_time) - as.POSIXct(csv_data_2019q4$start_time), format = "%H:%M:%S")
csv_data_2020q1$ride_length <- format(as.POSIXct(csv_data_2020q1$end_time) - as.POSIXct(csv_data_2020q1$start_time), format = "%H:%M:%S")
# Calculate and add day_of_week column
csv_data_2019q2$day_of_week <- wday(csv_data_2019q2$start_time, label = TRUE)
csv_data_2019q3$day_of_week <- wday(csv_data_2019q3$start_time, label = TRUE)
csv_data_2019q4$day_of_week <- wday(csv_data_2019q4$start_time, label = TRUE)
csv_data_2020q1$day_of_week <- wday(csv_data_2020q1$start_time, label = TRUE)

Combine Data for Analysis

In this step, we are combining the cleaned datasets for the different quarters of 2019 to create a unified dataset for analysis. This process involves reordering the columns and standardizing their format to ensure consistency across all quarters.

This step ensures that the data from all quarters are aggregated into a single dataset, setting the stage for the subsequent analysis and exploration of insights.

# Reorder columns in each cleaned dataset
cleaned_2019q2 <- csv_data_2019q2 %>%
  select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)

cleaned_2019q3 <- csv_data_2019q3 %>%
  select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)

cleaned_2019q4 <- csv_data_2019q4 %>%
  select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)

cleaned_2020q1 <- csv_data_2020q1 %>%
  select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)

# Combine the cleaned datasets using rbind
combined_cleaned_last12months <- rbind(cleaned_2019q2, cleaned_2019q3, cleaned_2019q4, cleaned_2020q1)

# Convert day_of_week to an ordered factor
combined_cleaned_last12months$day_of_week <- factor(combined_cleaned_last12months$day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), ordered = TRUE)

The “Prepare” phase has equipped us with cleaned and standardized data that can be further analyzed to answer our business questions.

Analyze Phase

Calculate Summary Statistics

In this step, we calculated summary statistics to provide insights into ride lengths and the most common day of the week for rides. These statistics help us better understand user behavior and usage patterns within the bike-sharing service.

In some cases, we observed trip durations that exceed 24 hours. While most bike-sharing trips are typically short in duration, we encountered instances where the time between the start and end of a trip spanned across multiple days. These extended durations are not isolated incidents; there are a total of 466 instances where the trip duration exceeded 24 hours. Such instances could occur when users rent a bike for an extended period, such as a multi-day rental or a bike being returned at a later time than the rental start. It’s important to note that these extended durations could impact certain calculations and visualizations. For example, when calculating the mean ride length or analyzing patterns by day of the week, these extended durations may contribute to outliers that could skew the results. As a result, when interpreting our findings, we’ll take these instances into consideration and evaluate their potential impact on the overall analysis.

We started by calculating the mean and maximum ride lengths. The mean ride length was calculated to be approximately 00:24:37 (HH:MM:SS), while the maximum ride length was found to be 2607:30:24. The high maximum ride length can be attributed to instances where trip durations exceed 24 hours.We also determined that Tuesday is the most common day of the week for rides. This information can be valuable for operational planning and resource allocation.

# Calculate the mean ride_length in seconds
mean_ride_length_seconds <- mean(combined_cleaned_last12months$tripduration, na.rm = TRUE)

# Convert mean ride_length to HH:MM:SS format
mean_ride_length_hms <- hms::hms(seconds = mean_ride_length_seconds)


# Calculate the max ride_length in seconds
max_ride_length_seconds <- max(combined_cleaned_last12months$tripduration, na.rm = TRUE)

# Convert max ride_length to HH:MM:SS format
max_ride_length_hms <- hms::hms(seconds = max_ride_length_seconds)

# Calculate the mode of day_of_week
day_of_week_freq <- table(combined_cleaned_last12months$day_of_week)
mode_day_of_week <- names(day_of_week_freq)[which.max(day_of_week_freq)]
# Display the calculated statistics
data <- data.frame(metric = c("Mean Ride Length", "Max Ride Length", "Mode Day of Week"),
                   value = c("00:24:37.461257", "2607:30:24", "Tuesday"))

kable(data, format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
metric value
Mean Ride Length 00:24:37.461257
Max Ride Length 2607:30:24
Mode Day of Week Tuesday

Average Ride Length by User Type and Day of Week

We further analyzed the data by calculating the average ride length for different user types (members and casual riders) and across different days of the week. These calculations provide insights into the typical ride lengths for each user type and how they vary throughout the week.

average_ride_length <- combined_cleaned_last12months %>%
  group_by(usertype) %>%
  summarise(average_ride_length = mean(tripduration))

average ride_length for users by day_of_week:

average_ride_length_by_day <- combined_cleaned_last12months %>%
  group_by(usertype, day_of_week) %>%
  summarise(average_ride_length = mean(tripduration))

Number of Rides by User Type and Day of Week

Additionally, we examined the number of rides by user type and day of the week. By visualizing this information, we can identify patterns in ride frequency based on user type and day of the week.

rides_by_day <- combined_cleaned_last12months %>%
  group_by(usertype, day_of_week) %>%
  summarise(number_of_rides = n())

kable(average_ride_length, format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
usertype average_ride_length
Casual 3537.8214
Member 849.7967
kable(average_ride_length_by_day, format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
usertype day_of_week average_ride_length
Casual Sun 3581.1876
Casual Mon 3334.7510
Casual Tue 3569.2883
Casual Wed 3689.3930
Casual Thu 3658.9860
Casual Fri 3757.4195
Casual Sat 3331.8338
Member Sun 919.6931
Member Mon 842.2905
Member Tue 825.8680
Member Wed 823.7212
Member Thu 823.6426
Member Fri 824.2674
Member Sat 968.7362
kable(rides_by_day, format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
usertype day_of_week number_of_rides
Casual Sun 181299
Casual Mon 104452
Casual Tue 91191
Casual Wed 93189
Casual Thu 103347
Casual Fri 122933
Casual Sat 209543
Member Sun 267972
Member Mon 472196
Member Tue 508445
Member Wed 500330
Member Thu 484177
Member Fri 452790
Member Sat 287958

The following visualizations display our findings:

# Plot average ride length by user type and day of week
ggplot(average_ride_length_by_day, aes(x = day_of_week, y = average_ride_length, fill = usertype)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Ride Length by User Type and Day of Week",
       x = "Day of Week",
       y = "Average Ride Length") +
  theme_minimal()

# Plot number of rides by user type and day of week
ggplot(rides_by_day, aes(x = day_of_week, y = number_of_rides, fill = usertype)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Number of Rides by User Type and Day of Week",
       x = "Day of Week",
       y = "Number of Rides") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma_format())

By analyzing these summary statistics and visualizations, we gain insights into user behavior trends and patterns, which can guide decision-making and operational strategies for the bike-sharing service.

Average Ride Length

Upon analyzing the data, we observed that casual riders tend to have higher average ride lengths compared to members. This indicates that casual riders often opt for longer rides. On the other hand, members have shorter average ride lengths. This difference can be attributed to the diverse use cases and preferences of these two user types.

Number of Rides

The analysis revealed another interesting trend. While members have a higher number of rides overall, their ride durations are shorter. This suggests that members use the bike-sharing service more frequently for short trips or daily commutes. In contrast, casual riders, despite taking longer rides on average, engage in fewer rides. This implies that casual riders might be more inclined to use the service for leisure or occasional trips.

Day of the Week Patterns

When examining average ride length across days of the week, we found that the values remain relatively level. This suggests that ride lengths are consistent throughout the week and not heavily influenced by specific days.

Number of Rides by Day of the Week

An intriguing insight emerged when looking at the number of rides by user type and day of the week. Members tend to ride more frequently on weekends and less on weekdays. This aligns with the idea that members might use the service for recreational purposes during weekends. On the contrary, casual riders display the opposite pattern, with more rides occurring on weekdays and fewer on weekends. This observation hints that casual riders might rely on the service for commuting or short trips during the workweek.

In conclusion, this analysis provides valuable insights into user behavior patterns based on user type and day of the week. It underscores the distinction between casual riders’ longer, infrequent rides and members’ shorter, more frequent rides. Furthermore, the observations about day-of-week preferences shed light on the diverse ways in which different user types engage with the bike-sharing service.

Additional visualizations

# Top starting stations
top_start_stations <- combined_cleaned_last12months %>%
  group_by(from_station_name) %>%
  summarise(num_trips = n()) %>%
  arrange(desc(num_trips)) %>%
  head(10)

# Bar chart for top starting stations with a single blue color and horizontal bars
ggplot(top_start_stations, aes(x = num_trips, y = reorder(from_station_name, num_trips))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = num_trips), hjust = 1.5, color = "black", size = 3.5) +  # Add annotations
  labs(title = "Top 10 Starting Stations",
       x = "Number of Trips",
       y = "Station Name") +
 theme_minimal()

# Extract month and year from start_time
combined_cleaned_last12months$month <- lubridate::month(combined_cleaned_last12months$start_time, label = TRUE, abbr = FALSE)
combined_cleaned_last12months$year <- lubridate::year(combined_cleaned_last12months$start_time)

# Define the correct order of months
month_order <- c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "February", "March")

# Convert month to an ordered factor
combined_cleaned_last12months$month <- factor(combined_cleaned_last12months$month, levels = month_order, ordered = TRUE)

# Total number of rides by month and usertype
monthly_totals_by_usertype <- combined_cleaned_last12months %>%
  group_by(year, month, usertype) %>%
  summarise(total_rides = n())

# Line plot for monthly trends grouped by usertype
ggplot(monthly_totals_by_usertype, aes(x = month, y = total_rides, group = usertype, color = factor(usertype))) +
  geom_line() +
  labs(title = "Monthly Ride Trends by User Type",
       subtitle = "April 2019 to March 2020",
       x = "Month",
       y = "Total Number of Rides",
       color = "User Type") +
  scale_x_discrete(labels = month_order) +
  scale_y_continuous(labels = scales::comma_format())+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Extract hour from start_time
combined_cleaned_last12months$hour <- lubridate::hour(combined_cleaned_last12months$start_time)

# Create a vector of labels for the x-axis
hour_labels <- c("12am", paste(1:11, "am"), "12pm", paste(1:11, "pm"))

# Histogram for usage by hour of the day
ggplot(combined_cleaned_last12months, aes(x = hour)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(title = "Usage by Time of Day",
       x = "Hour of the Day",
       y = "Number of Rides") +
  scale_x_continuous(breaks = 0:23, labels = hour_labels) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Share Phase

In the “Share” phase, our focus is on effectively communicating the insights gained from our analysis. The goal is to create polished and sophisticated visualizations that convey the story our data tells and address the initial business question.

Key Findings

Usage Patterns: Our analysis revealed that annual members and casual riders use Cyclistic bikes differently. Casual riders tend to have longer average ride lengths compared to annual members. However, annual members have a higher frequency of rides, indicating that they use the service more frequently for shorter trips.

Day of the Week: Both casual riders and annual members show consistent ride patterns throughout the week. Tuesdays have the highest number of rides, and the usage decreases slightly towards the weekend. This trend is observed for both user types, with casual riders having a slightly more balanced distribution compared to annual members.

Visualizations

We’ve created a series of visualizations to support our key findings:

Average Ride Length by User Type and Day of Week: This stacked bar chart illustrates the average ride length for both user types across different days of the week. The chart visually highlights the differences in ride lengths between casual riders and annual members. Number of Rides by User Type and Day of Week: This stacked bar chart shows the number of rides for each user type on different days of the week. The chart highlights the usage patterns and days with the highest ride frequencies. Monthly Ride Trends by User Type: A line plot depicts the monthly ride trends for both user types. The chart helps us understand the seasonal variations in ride volumes and the differences between casual riders and annual members.

Click here for the Presentation