Project Title: How Does a Bike-Share Navigate Speedy Success?
Company: Cyclistic Bike
Objective: To analyze how casual riders and annual members use Cyclistic bikes differently and provide insights for converting casual riders into annual members.
Business Task: The marketing team at Cyclistic aims to increase the number of annual memberships. Understanding the differences in usage patterns between casual riders and annual members will help design a targeted marketing strategy.
Cyclistic operates a fleet of more than 5,800 bicycles which can be accessed from over 600 docking stations across the city. Bikes can be borrowed from one docking station, ridden, then returned to any docking stations. Over the years marketing campaigns have been broad and targeted a cross-section of potential users. Data analysis has shown that riders with an annual membership are more profitable than casual riders. Lily Moreno, the director of marketing, wants to implement a new marketing strategy in order to convert casual riders into annual members. She believes that with the right campaign there is a very good chance of such conversions between the user types. There are also user-friendly bike options include such as electric bikes, classic bikes and docked bikes. It makes Cyclistic services more inclusive to people. Lily has tasked the marketing analytics team to analyze past user data of one year to find trends and habits of Cyclistic’s users to help create this marketing campaign. The marketing analyst team would like to know:
How annual members and casual riders differ
Why casual riders would buy a membership
How Cyclistic can use digital media to influence casual riders to become members.
Here I have to analyze the Cyclistic historical bike trip data to identify trends in the usage of bikes by casual and member riders.
The business objective of the case study is to identify opportunities for targeted marketing campaigns to convert casual riders into annual members. This will be done through analysis of bike trip data and understanding user behavior and preferences. The ultimate goal is to increase profitability and drive future growth for the company.
As an analyst my take is to do following:-
The Stakeholders in this case study include:
Lily Moreno: Director of Marketing at Cyclistic, who is responsible for implementing the marketing campaigns at Cyclistic.
Cyclistic’s marketing team: They will be responsible for conducting the analysis and developing the marketing strategy based on the insights gained.
Cyclistic’s investors and shareholders: They have a financial interest in the company’s success and may be interested in the results of the analysis and any changes to the marketing strategy.
The data for this analysis can be accessed through the provided link. It includes 12 months of historical trip data from Cyclistic, a fictional bike share company based in Chicago. It should be noted that the data is public and can be used to explore how different customer types are using Cyclistic bikes.
For this project, the data used consists of monthly CSV files from the past 12 months (January 2021 - December 2021). The files include 13 columns of information related to ride details, such as ride id, ridership type, ride time, start and end locations, and geographic coordinates…etc. The data is organized in a way that allows for analysis of trends and patterns in the usage of Cyclistic’s bike share services.
Motivate, Inc. collected the data for this analysis directly through its management of the Cyclistic Bike Share program for the City of Chicago. The data is comprehensive and consistent, as it includes information on all rides taken by users and is not just a sample. It is also current, as it is released on a monthly basis by the City of Chicago. The data is made available to the public by the City of Chicago.
The data used for this analysis has had all identifying information removed in order to protect the privacy of users. This limitation on the data does restrict the scope of the possible analysis, as it is not possible to determine whether casual riders are repeat users or residents of the Chicago area. The data is released under a specific license and is made available for use in this analysis.
The available dataset is sufficient for the purpose of answering the business question regarding the differences in usage patterns between annual members and casual riders. Through detailed observation of the variables in the data, it has been determined that casual riders typically pay for individual or daily rides, while member riders tend to purchase annual subscriptions. This information is important in understanding the behavioral differences between the two groups and can be used to inform targeted marketing campaigns. Additional analysis of other variables in the data, such as ride duration and location, may provide further insights into the usage patterns of annual members and casual riders.
The Challenges I faced during my data analysis are:
In order to efficiently prepare, process, clean, analyze, and visualize the data for this project, I selected RStudio Desktop as the primary tool. The large size of the dataset made it impractical to use tools such as Microsoft Excel or Google Sheets, and RStudio Cloud was also unable to handle the volume of data. RStudio Desktop provided the necessary capabilities to effectively work with the data and generate meaningful insights.
In addition to RStudio Desktop, I also utilized Tableau to create visualizations for this project. The powerful data visualization capabilities of Tableau allowed me to effectively communicate the results of the analysis and highlight key trends and patterns in the data.
Overall, the combination of RStudio Desktop and Tableau proved to be a powerful toolkit for preparing, processing, cleaning, analyzing, and visualizing the data for this project
In order to gain an understanding of the data and its potential for analysis, a review was conducted to assess the content of the variables, the format of the data, and the integrity of the data. This initial review provided an overview of the data and helped to identify any potential issues or challenges that would need to be addressed in the preparation and analysis process.
Data review involved the following:
Results of the review found following things:
All 12 files were combined into one data set after initial review was completed.The final data set consisted of 5733451 rows with 13 columns of character and numeric data. This matched the number of records in all 12 monthly data files.
#load packages
library(tidyverse)
library(lubridate)
library(janitor)
library(data.table)
library(readr)
library(psych)
library(hrbrthemes)
library(ggplot2)
#Import Data
january_2021 <- read.csv("CyclisticBike/202101-divvy-tripdata.csv")
february_2021 <- read.csv("CyclisticBike/202102-divvy-tripdata.csv")
march_2021 <- read.csv("CyclisticBike/202103-divvy-tripdata.csv")
april_2021 <- read.csv("CyclisticBike/202104-divvy-tripdata.csv")
may_2021 <- read.csv("CyclisticBike/202105-divvy-tripdata.csv")
june_2021 <- read.csv("CyclisticBike/202106-divvy-tripdata.csv")
july_2021 <- read.csv("CyclisticBike/202107-divvy-tripdata.csv")
august_2021 <- read.csv("CyclisticBike/202108-divvy-tripdata.csv")
september_2021 <- read.csv("CyclisticBike/202109-divvy-tripdata.csv")
october_2021 <- read.csv("CyclisticBike/202110-divvy-tripdata.csv")
november_2021 <- read.csv("CyclisticBike/202111-divvy-tripdata.csv")
december_2021 <- read.csv("CyclisticBike/202112-divvy-tripdata.csv")
#Data Validation
colnames(january_2021)
colnames(february_2021)
colnames(march_2021)
colnames(april_2021)
colnames(may_2021)
colnames(june_2021)
colnames(july_2021)
colnames(august_2021)
colnames(september_2021)
colnames(october_2021)
colnames(november_2021)
colnames(december_2021)
#Total number of rows
sum(nrow(january_2021) + nrow(february_2021) + nrow(march_2021) + nrow(april_2021)
+ nrow(may_2021) + nrow(june_2021) + nrow(july_2021) + nrow(august_2021)
+ nrow(september_2021) + nrow(october_2021)+ nrow(november_2021) + nrow(december_2021))
#Combine Data of 12 month into one
trip_merge <- rbind(january_2021,february_2021,march_2021,april_2021,may_2021,june_2021,
july_2021,august_2021,september_2021,october_2021,november_2021, december_2021)
# Save the combined files
write.csv(trip_final,file = "data/trip_final.csv",row.names = FALSE)
#Final data validation
colnames(trip_merge)
str(trip_merge)
View(head(trip_merge))
View(tail(trip_merge))
dim(trip_merge)
summary(trip_merge)
In this stage, I performed data cleaning to identify and correct or remove errors or inconsistencies from the data. This will involve a variety of techniques, such as correcting errors in data entry, removing duplicates or incorrect records, and standardizing data formats to ensure compatibility with analysis tools. Data cleaning is an important step in the data analysis process, as it helps to ensure that the data is accurate and reliable, and that the results of the analysis are meaningful and useful.
#Count rows with "na" values
colSums(is.na(trip_merge))
#Remove missing
clean_trip <- trip_merge[complete.cases(trip_merge), ]
#Remove duplicates
clean_trip <- distinct(clean_trip)
#Remove na, empty, missing
clean_trip <- drop_na(clean_trip)
clean_trip <- remove_empty(clean_trip)
clean_trip <- remove_missing(clean_trip)
#Remove data with greater start_at than end_at
clean_trip<- clean_trip %>%
filter(started_at < ended_at)
#Renaming column for better context
clean_trip <- rename(clean_trip, costumer_type = member_casual,
bike_type = rideable_type)
#Separate date in date, day, month, year for better analysis
clean_trip$date <- as.Date(clean_trip$started_at)
clean_trip$week_day <- format(as.Date(clean_trip$date), "%A")
clean_trip$month <- format(as.Date(clean_trip$date), "%b_%y")
clean_trip$year <- format(clean_trip$date, "%Y")
#Separate column for time
clean_trip$time <- as.POSIXct(clean_trip$started_at, format = "%Y-%m-%d %H:%M:%S")
clean_trip$time <- format(clean_trip$time, format = "%H:%M")
#Add ride length column
clean_trip$ride_length <- difftime(clean_trip$ended_at,
clean_trip$started_at, units = "mins")
#Select the data we are going to use
clean_trip <- clean_trip %>%
select(bike_type, costumer_type, month, year, time, started_at, week_day, ride_length)
#Remove stolen bikes
clean_trip <- clean_trip[!clean_trip$ride_length>1440,]
clean_trip <- clean_trip[!clean_trip$ride_length<5,]
#Check Cleaned data
colSums(is.na(clean_trip))
View(filter(clean_trip, clean_trip$ride_length > 1440 | clean_trip$ride_length < 5))
#Save the cleaned data
write.csv(clean_trip,file = "CyclisticBike/clean_trip.csv",row.names = FALSE)
During the Data analysis phase, I explored the data in order to gain a better understanding of its characteristics and patterns. I created charts, graphs, and other types of visualizations to help visualize the data and identify trends. I also used statistical techniques, such as regression analysis, to identify relationships between different variables in the data. By analyzing the data in this way, I was able to extract insights and knowledge that could inform business decisions and support decision making.
#import the cleaned data
clean_trip <- read_csv("CyclisticBike/clean_trip.csv")
str(clean_trip)
names(clean_trip)
#order the data
clean_trip$month <- ordered(clean_trip$month,levels=c("Jan_21","Feb_21","Mar_21","Apr_21"
,"May_21","Jun_21","Jul_21","Aug_21"
,"Sep_21","Oct_21","Nov_21","Dec_21"))
clean_trip$week_day <- ordered(clean_trip$week_day,levels=c("Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday", "Saturday"))
#Analysis:- min, max, median, average
View(describe(clean_trip$ride_length, fast=TRUE))
#Total no. of customers
View(table(clean_trip$costumer_type))
#Total rides for each customer type in minutes
View(setNames(aggregate(ride_length ~ costumer_type, clean_trip, sum),
c("customer_type", "total_ride_len(mins)")))
#Differences between members and casual riders in terms of length of ride
View(clean_trip %>%
group_by(costumer_type) %>%
summarise(min_length_mins = min(ride_length), max_length_min = max(ride_length),
median_length_mins = median(ride_length), mean_length_min = mean(ride_length)))
#Average ride_length for users by day_of_week and Number of total rides by day_of_week
View(clean_trip %>%
group_by(week_day) %>%
summarise(Avg_length = mean(ride_length),
number_of_ride = n()))
#Average ride_length by month
View(clean_trip %>%
group_by(month) %>%
summarise(Avg_length = mean(ride_length),
number_of_ride = n()))
#Average ride length comparison by each week day according to each customer type
View(aggregate(clean_trip$ride_length ~ clean_trip$costumer_type +
clean_trip$week_day, FUN = mean))
#Average ride length comparison by each month according to each customer type
View(aggregate(clean_trip$ride_length ~ clean_trip$costumer_type +
clean_trip$month, FUN = mean))
#Analyze rider length data by customer type and weekday
View(clean_trip %>%
group_by(costumer_type, week_day) %>%
summarise(number_of_ride = n(),
avgerage_duration = mean(ride_length),
median_duration = median(ride_length),
max_duration = max(ride_length),
min_duration = min(ride_length)))
#Analyze rider length data by customer type and month
View(clean_trip %>%
group_by(costumer_type, month) %>%
summarise(number_of_ride = n(),
average_duration = mean(ride_length),
median_duration = median(ride_length),
max_duration = max(ride_length),
min_duration = min(ride_length)))
#Save the data for data visualization
write.csv(clean_trip,file = "CyclisticBike/trip_tableau.csv",row.names = FALSE)
The analysis successfully identified key behavioral differences between casual and annual riders. By leveraging these insights, Cyclistic can implement marketing strategies to boost annual memberships, optimize station locations, and enhance customer engagement.