In this analysis, we delve into the Cyclistic bike-share data to gain insights into how different customer types use the bike-sharing service. We will follow the data analysis process(Ask, Prepare, Process, Analyze, Share, and Act) to answer key business questions and provide data-driven recommendations.
The primary question guiding our analysis is: How do annual members and casual riders use Cyclistic bikes differently? By addressing this question, we aim to design a new marketing strategy to convert casual riders into annual members.
During the “Ask” phase, we focused on identifying the key business task and understanding the problem we need to solve. The goal was to lay the foundation for our analysis by framing the problem and outlining the scope of our investigation.
Our business task is to explore how annual members and casual riders use Cyclistic bikes differently. By understanding their usage patterns, we can inform the design of a marketing strategy aimed at converting casual riders into annual members.
During the Prepare phase, we focused on acquiring the necessary data, understanding its organization, and ensuring its credibility. Our goal was to ensure that the data is well-prepared for subsequent analysis.
The bike-sharing data has been downloaded from the company’s provided link and is stored in a dedicated ‘data’ folder within the project directory. Each quarterly dataset is saved as an individual CSV file. This organization ensures easy access and separation of data by quarters, enhancing the overall analysis process.
The bike-sharing data originates from Cyclistic, the bike-sharing service provider. The data includes trip information recorded in four quarters of 2019 and the first quarter of 2020. Each dataset captures details such as trip start and end times, station information, user types, and geographic coordinates. The data sources are crucial for understanding user behavior and usage patterns, which are essential for our analysis.
(Note from the course: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.)
The bike-sharing data appears to be collected directly from user interactions with the bike-sharing service, which should reflect actual usage patterns. However, potential sources of bias could include differences in user demographics or changes in data collection methods over time. Missing values, especially in certain columns like gender and birth year, need to be addressed to ensure completeness.The consistency of column names, data types, and values across quarters should be checked and standardized.Overall, while there might be minor concerns related to bias and data completeness, the data seems credible for answering the analysis questions.
To begin, we downloaded the required Cyclistic trip data for the previous quarters. We stored the data in individual CSV files corresponding to each quarter of 2019 and 2020. We also ensured that the working directory is set to the location of the R Markdown file and the data files.
# Set the working directory to the location of your R Markdown file and CSV files
setwd("D:/Documents/rstudio/Cyclistic_Bike_Share_Analysis")
# Read CSV files
csv_data_2019q2 <- read_csv("data/Divvy_Trips_2019_Q2.csv")
csv_data_2019q3 <- read_csv("data/Divvy_Trips_2019_Q3.csv")
csv_data_2019q4 <- read_csv("data/Divvy_Trips_2019_Q4.csv")
csv_data_2020q1 <- read_csv("data/Divvy_Trips_2020_Q1.csv")
View(csv_data_2019q2)
print(colSums(is.na(csv_data_2020q1)))
To ensure the consistency of data types, we inspected the structure of each dataset. This allowed us to confirm that columns are correctly assigned to their respective data types.
# Check if data types match for each quarter
str(csv_data_2019q2)
In the Process phase, we focused on using the prepared data, understanding its structure, and ensuring that it’s in a suitable format for analysis. Let’s go through the key tasks performed in this phase.
I chose to use RStudio as my primary tool for this analysis due to its compatibility with handling large datasets. While other platforms like BigQuery, Google Sheets, and posit.cloud were considered, they have limitations that could hinder my analysis. For instance, BigQuery’s sandbox has size limitations for CSV files, Google Sheets struggles with the large dataset, and posit.cloud’s free tier imposes RAM constraints. RStudio provides the necessary capabilities to work with the extensive bike-sharing data effectively.
To ensure data cleanliness, we have undertaken several crucial steps. These include identifying and handling missing values, standardizing column names and data types, and verifying data consistency across different quarters. We also calculated derived columns, such as trip duration, to verify their accuracy and coherence with existing data. Furthermore, we reviewed the data for any potential outliers or anomalies that might affect the analysis. By executing these steps, we have strived to enhance the quality and reliability of the data for subsequent analysis.
We proceeded to clean and standardize the data by addressing missing values, renaming columns, and ensuring consistent data types across different quarters. During this phase, we uncovered slight discrepancies in the “tripduration” column’s values. We calculated a new column, “calculated_duration,” based on the difference between “end_time” and “start_time.” However, due to minimal discrepancies likely caused by rounding, we decided to continue using the original “tripduration” values.
# Calculate calculated_duration by subtracting start_time from end_time a
csv_data_2019q2$calculated_duration <- as.numeric(difftime(csv_data_2019q2$end_time, csv_data_2019q2$start_time, units = "secs"))
# Check if trip duration matches calculated_duration
duration_match <- all(csv_data_2019q2$tripduration == csv_data_2019q2$calculated_duration)
# Identify rows with mismatched durations
mismatched_rows <- which(csv_data_2019q2$tripduration != csv_data_2019q2$calculated_duration)
mismatched_data <- csv_data_2019q2[mismatched_rows, ]
We also ensured that column names were consistent across quarters by renaming columns in accordance with the structure of the later quarters.
# Clean and standardize column names for q2 using the q3 names
colnames(csv_data_2019q2) <- colnames(csv_data_2019q3)
colnames(csv_data_2020q1) <- c(
"trip_id", "rideable_type", "start_time", "end_time",
"from_station_name", "from_station_id",
"to_station_name", "to_station_id",
"start_lat", "start_lng", "end_lat", "end_lng",
"usertype"
)
We also addressed variations in the “usertype” column between 2019 and 2020. In 2020, the values were “member” and “casual,” which we mapped to “Subscriber” and “Customer,” respectively, to match the values used in 2019. We also converted the trip IDs from numeric to character format for the 2019 data to align with the format used in 2020.
# Add tripduration column to 2020q1 and calculate
csv_data_2020q1$tripduration <- as.numeric(difftime(csv_data_2020q1$end_time, csv_data_2020q1$start_time, units = "secs"))
# Rename 'Subscriber' to 'Member' and 'Customer' to 'Casual' in usertype column
csv_data_2019q2 <- csv_data_2019q2 %>%
mutate(usertype = case_when(
usertype == "Subscriber" ~ "Member",
usertype == "Customer" ~ "Casual",
TRUE ~ usertype
))
csv_data_2019q3 <- csv_data_2019q3 %>%
mutate(usertype = case_when(
usertype == "Subscriber" ~ "Member",
usertype == "Customer" ~ "Casual",
TRUE ~ usertype
))
csv_data_2019q4 <- csv_data_2019q4 %>%
mutate(usertype = case_when(
usertype == "Subscriber" ~ "Member",
usertype == "Customer" ~ "Casual",
TRUE ~ usertype
))
csv_data_2020q1 <- csv_data_2020q1 %>%
mutate(usertype = case_when(
usertype == "member" ~ "Member",
usertype == "casual" ~ "Casual",
TRUE ~ usertype
))
csv_data_2019q2$trip_id <- as.character(csv_data_2019q2$trip_id)
csv_data_2019q3$trip_id <- as.character(csv_data_2019q3$trip_id)
csv_data_2019q4$trip_id <- as.character(csv_data_2019q4$trip_id)
We added two new columns to the data frames: “ride_length” and “day_of_week.”
# Add ride_length column (trip duration in HH:MM:SS format)
csv_data_2019q2$ride_length <- format(as.POSIXct(csv_data_2019q2$end_time) - as.POSIXct(csv_data_2019q2$start_time), format = "%H:%M:%S")
csv_data_2019q3$ride_length <- format(as.POSIXct(csv_data_2019q3$end_time) - as.POSIXct(csv_data_2019q3$start_time), format = "%H:%M:%S")
csv_data_2019q4$ride_length <- format(as.POSIXct(csv_data_2019q4$end_time) - as.POSIXct(csv_data_2019q4$start_time), format = "%H:%M:%S")
csv_data_2020q1$ride_length <- format(as.POSIXct(csv_data_2020q1$end_time) - as.POSIXct(csv_data_2020q1$start_time), format = "%H:%M:%S")
# Calculate and add day_of_week column
csv_data_2019q2$day_of_week <- wday(csv_data_2019q2$start_time, label = TRUE)
csv_data_2019q3$day_of_week <- wday(csv_data_2019q3$start_time, label = TRUE)
csv_data_2019q4$day_of_week <- wday(csv_data_2019q4$start_time, label = TRUE)
csv_data_2020q1$day_of_week <- wday(csv_data_2020q1$start_time, label = TRUE)
In this step, we are combining the cleaned datasets for the different quarters of 2019 to create a unified dataset for analysis. This process involves reordering the columns and standardizing their format to ensure consistency across all quarters.
This step ensures that the data from all quarters are aggregated into a single dataset, setting the stage for the subsequent analysis and exploration of insights.
# Reorder columns in each cleaned dataset
cleaned_2019q2 <- csv_data_2019q2 %>%
select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)
cleaned_2019q3 <- csv_data_2019q3 %>%
select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)
cleaned_2019q4 <- csv_data_2019q4 %>%
select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)
cleaned_2020q1 <- csv_data_2020q1 %>%
select(trip_id, start_time, end_time, tripduration, from_station_id, from_station_name, to_station_id, to_station_name, usertype, ride_length, day_of_week)
# Combine the cleaned datasets using rbind
combined_cleaned_last12months <- rbind(cleaned_2019q2, cleaned_2019q3, cleaned_2019q4, cleaned_2020q1)
# Convert day_of_week to an ordered factor
combined_cleaned_last12months$day_of_week <- factor(combined_cleaned_last12months$day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), ordered = TRUE)
The “Prepare” phase has equipped us with cleaned and standardized data that can be further analyzed to answer our business questions.
In this step, we calculated summary statistics to provide insights into ride lengths and the most common day of the week for rides. These statistics help us better understand user behavior and usage patterns within the bike-sharing service.
In some cases, we observed trip durations that exceed 24 hours. While most bike-sharing trips are typically short in duration, we encountered instances where the time between the start and end of a trip spanned across multiple days. These extended durations are not isolated incidents; there are a total of 466 instances where the trip duration exceeded 24 hours. Such instances could occur when users rent a bike for an extended period, such as a multi-day rental or a bike being returned at a later time than the rental start. It’s important to note that these extended durations could impact certain calculations and visualizations. For example, when calculating the mean ride length or analyzing patterns by day of the week, these extended durations may contribute to outliers that could skew the results. As a result, when interpreting our findings, we’ll take these instances into consideration and evaluate their potential impact on the overall analysis.
We started by calculating the mean and maximum ride lengths. The mean ride length was calculated to be approximately 00:24:37 (HH:MM:SS), while the maximum ride length was found to be 2607:30:24. The high maximum ride length can be attributed to instances where trip durations exceed 24 hours.We also determined that Tuesday is the most common day of the week for rides. This information can be valuable for operational planning and resource allocation.
# Calculate the mean ride_length in seconds
mean_ride_length_seconds <- mean(combined_cleaned_last12months$tripduration, na.rm = TRUE)
# Convert mean ride_length to HH:MM:SS format
mean_ride_length_hms <- hms::hms(seconds = mean_ride_length_seconds)
# Calculate the max ride_length in seconds
max_ride_length_seconds <- max(combined_cleaned_last12months$tripduration, na.rm = TRUE)
# Convert max ride_length to HH:MM:SS format
max_ride_length_hms <- hms::hms(seconds = max_ride_length_seconds)
# Calculate the mode of day_of_week
day_of_week_freq <- table(combined_cleaned_last12months$day_of_week)
mode_day_of_week <- names(day_of_week_freq)[which.max(day_of_week_freq)]
# Display the calculated statistics
data <- data.frame(metric = c("Mean Ride Length", "Max Ride Length", "Mode Day of Week"),
value = c("00:24:37.461257", "2607:30:24", "Tuesday"))
kable(data, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| metric | value |
|---|---|
| Mean Ride Length | 00:24:37.461257 |
| Max Ride Length | 2607:30:24 |
| Mode Day of Week | Tuesday |
We further analyzed the data by calculating the average ride length for different user types (members and casual riders) and across different days of the week. These calculations provide insights into the typical ride lengths for each user type and how they vary throughout the week.
average_ride_length <- combined_cleaned_last12months %>%
group_by(usertype) %>%
summarise(average_ride_length = mean(tripduration))
average ride_length for users by day_of_week:
average_ride_length_by_day <- combined_cleaned_last12months %>%
group_by(usertype, day_of_week) %>%
summarise(average_ride_length = mean(tripduration))
Additionally, we examined the number of rides by user type and day of the week. By visualizing this information, we can identify patterns in ride frequency based on user type and day of the week.
rides_by_day <- combined_cleaned_last12months %>%
group_by(usertype, day_of_week) %>%
summarise(number_of_rides = n())
kable(average_ride_length, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| usertype | average_ride_length |
|---|---|
| Casual | 3537.8214 |
| Member | 849.7967 |
kable(average_ride_length_by_day, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| usertype | day_of_week | average_ride_length |
|---|---|---|
| Casual | Sun | 3581.1876 |
| Casual | Mon | 3334.7510 |
| Casual | Tue | 3569.2883 |
| Casual | Wed | 3689.3930 |
| Casual | Thu | 3658.9860 |
| Casual | Fri | 3757.4195 |
| Casual | Sat | 3331.8338 |
| Member | Sun | 919.6931 |
| Member | Mon | 842.2905 |
| Member | Tue | 825.8680 |
| Member | Wed | 823.7212 |
| Member | Thu | 823.6426 |
| Member | Fri | 824.2674 |
| Member | Sat | 968.7362 |
kable(rides_by_day, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| usertype | day_of_week | number_of_rides |
|---|---|---|
| Casual | Sun | 181299 |
| Casual | Mon | 104452 |
| Casual | Tue | 91191 |
| Casual | Wed | 93189 |
| Casual | Thu | 103347 |
| Casual | Fri | 122933 |
| Casual | Sat | 209543 |
| Member | Sun | 267972 |
| Member | Mon | 472196 |
| Member | Tue | 508445 |
| Member | Wed | 500330 |
| Member | Thu | 484177 |
| Member | Fri | 452790 |
| Member | Sat | 287958 |
The following visualizations display our findings:
# Plot average ride length by user type and day of week
ggplot(average_ride_length_by_day, aes(x = day_of_week, y = average_ride_length, fill = usertype)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Ride Length by User Type and Day of Week",
x = "Day of Week",
y = "Average Ride Length") +
theme_minimal()
# Plot number of rides by user type and day of week
ggplot(rides_by_day, aes(x = day_of_week, y = number_of_rides, fill = usertype)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Number of Rides by User Type and Day of Week",
x = "Day of Week",
y = "Number of Rides") +
theme_minimal() +
scale_y_continuous(labels = scales::comma_format())
By analyzing these summary statistics and visualizations, we gain insights into user behavior trends and patterns, which can guide decision-making and operational strategies for the bike-sharing service.
Upon analyzing the data, we observed that casual riders tend to have higher average ride lengths compared to members. This indicates that casual riders often opt for longer rides. On the other hand, members have shorter average ride lengths. This difference can be attributed to the diverse use cases and preferences of these two user types.
The analysis revealed another interesting trend. While members have a higher number of rides overall, their ride durations are shorter. This suggests that members use the bike-sharing service more frequently for short trips or daily commutes. In contrast, casual riders, despite taking longer rides on average, engage in fewer rides. This implies that casual riders might be more inclined to use the service for leisure or occasional trips.
When examining average ride length across days of the week, we found that the values remain relatively level. This suggests that ride lengths are consistent throughout the week and not heavily influenced by specific days.
An intriguing insight emerged when looking at the number of rides by user type and day of the week. Members tend to ride more frequently on weekends and less on weekdays. This aligns with the idea that members might use the service for recreational purposes during weekends. On the contrary, casual riders display the opposite pattern, with more rides occurring on weekdays and fewer on weekends. This observation hints that casual riders might rely on the service for commuting or short trips during the workweek.
In conclusion, this analysis provides valuable insights into user behavior patterns based on user type and day of the week. It underscores the distinction between casual riders’ longer, infrequent rides and members’ shorter, more frequent rides. Furthermore, the observations about day-of-week preferences shed light on the diverse ways in which different user types engage with the bike-sharing service.
# Top starting stations
top_start_stations <- combined_cleaned_last12months %>%
group_by(from_station_name) %>%
summarise(num_trips = n()) %>%
arrange(desc(num_trips)) %>%
head(10)
# Bar chart for top starting stations with a single blue color and horizontal bars
ggplot(top_start_stations, aes(x = num_trips, y = reorder(from_station_name, num_trips))) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = num_trips), hjust = 1.5, color = "black", size = 3.5) + # Add annotations
labs(title = "Top 10 Starting Stations",
x = "Number of Trips",
y = "Station Name") +
theme_minimal()
# Extract month and year from start_time
combined_cleaned_last12months$month <- lubridate::month(combined_cleaned_last12months$start_time, label = TRUE, abbr = FALSE)
combined_cleaned_last12months$year <- lubridate::year(combined_cleaned_last12months$start_time)
# Define the correct order of months
month_order <- c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "February", "March")
# Convert month to an ordered factor
combined_cleaned_last12months$month <- factor(combined_cleaned_last12months$month, levels = month_order, ordered = TRUE)
# Total number of rides by month and usertype
monthly_totals_by_usertype <- combined_cleaned_last12months %>%
group_by(year, month, usertype) %>%
summarise(total_rides = n())
# Line plot for monthly trends grouped by usertype
ggplot(monthly_totals_by_usertype, aes(x = month, y = total_rides, group = usertype, color = factor(usertype))) +
geom_line() +
labs(title = "Monthly Ride Trends by User Type",
subtitle = "April 2019 to March 2020",
x = "Month",
y = "Total Number of Rides",
color = "User Type") +
scale_x_discrete(labels = month_order) +
scale_y_continuous(labels = scales::comma_format())+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Extract hour from start_time
combined_cleaned_last12months$hour <- lubridate::hour(combined_cleaned_last12months$start_time)
# Create a vector of labels for the x-axis
hour_labels <- c("12am", paste(1:11, "am"), "12pm", paste(1:11, "pm"))
# Histogram for usage by hour of the day
ggplot(combined_cleaned_last12months, aes(x = hour)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
scale_y_continuous(labels = scales::comma_format()) +
labs(title = "Usage by Time of Day",
x = "Hour of the Day",
y = "Number of Rides") +
scale_x_continuous(breaks = 0:23, labels = hour_labels) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))