Case Study: How Does a Bike-Share Navigate Speedy Success?
Capstone Project for the Google Data Analytics Professional Certificate
The director of marketing of company Cyclist believes that the company’s future success depends on maximizing the number of annual memberships. Therefore, we have to understand how casual riders and annual members use Cyclistic bikes differently.Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.So, we have to find insights about data that will answer to the question “How casual riders and annual members use cyclistic bikes differently?” and provide the recommendations.
1.Ask Phase:
In particular, We have to find:
- What is the total number of trips by rider types?
- What is the proportion of total trips by rider types?
- What is the average ride length for member and casual?
- Are there particular days of the week on which most rides take place by category?
- Is there a preference for the rideable type for members and casuals?
2.Prepare Phase:
The data was located on Cyclistics server, and was organised as separate files by month and year. I have used the data from January 2021 to december 2021.The data was saved as .csv files within .zip folders.During the analysis, I have stored files on my secured drive and worked with the copies of data for analysis so that if the files get corrupted during analysis we have original files stored at secure location.
Our data includes following fields:
- ride_id :— a unique ID per ride
- rideable_type:- the type of bicycle used
- started_at: the date and time at which bicycle was checked out
- ended_at: the date and time at which bicycle was checked in
- start_station_name: the name of the station at the start of the trip
- start_station_id: a unique identifier for the start station
- end_station_name: the name of the station at the end of the trip
- end_station_id : a unique identifier for the end station
- Start_lat: the latitude of the start station
- start_lng: the longitude of the start station
- end_lat: the latitude of the end station
- end_lng: the longitude of the end station
- member_casual: this field indicates whether the rider type is member or a casual
During the analysis I have added more fields to these existing fields for analysis purpose which I am going to tell when we proceed further.
After looking at the data carefully I did notice some kind of errors in collecting data as well as missing values in data.I did notice that some of the ended_at times were before the started_at times, and that some started_at and ended_at fields seemed very close.There is no duplicate data. I have also noticed that the records for which the starting station name and ending station name are missing that records contain error for having ended_at before started_at time.I have also notices the duration of trips in few seconds which I consider are wrongly recorded.Suppose if a casual rider decide to have a testing ride then it will at least take 30 seconds to a minute to complete ride,but there are many records for which the ride length is length than 30 seconds.For analysis I have used ride lengths grater than 0 seconds as I am confused that to include those records for analysis or not.
3.Process Phase:
I haven’t used many tools for the analysis:
Excel:
I have used excel for initial cleaning and some processing for individual .csv file.We can delete and create new column in excel very easily.I have calculated ride length, day of week,count of rides by rider type,average ride length by rider type for each month. But when I decided to choose the excel files for further analysis and visualization the files get corrupted due to large data files.Tableau also takes very much time to combine the data to proceed further for viz,so I decided to use R for my analysis and I have done complete project in R itself with help of some SQL functionality.
RStudio:
I used RStudio for the data manipulation, analysis and visualisation due to its ability to handle a large amount of data and its very logical approach to functions and syntax. I found the sqldf library to be a great resource too, as it allowed me to use SQL queries on the large dataframe.
Firstly, we need to install and read the packages we need for analysis: I have all packages installed except plotrix so I have read all the packages and installed plotrix package and then read it.
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2) #for data viz
library(dplyr) #for data manipulation
library(skimr) #for summarizing data
library(sqldf) #for using SQL Queries## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
We can read the data stored in our hard disk with help of command read.csv and store them in a varible of our choice.
jan_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202101-divvy-tripdata.csv")
feb_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202102-divvy-tripdata.csv")
mar_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202103-divvy-tripdata.csv")
apr_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202104-divvy-tripdata.csv")
may_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202105-divvy-tripdata.csv")
jun_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202106-divvy-tripdata.csv")
jul_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202107-divvy-tripdata.csv")
aug_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202108-divvy-tripdata.csv")
sep_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202109-divvy-tripdata.csv")
oct_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202110-divvy-tripdata.csv")
nov_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202111-divvy-tripdata.csv")
dec_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202112-divvy-tripdata.csv")We can view our data at any part of analysis with help of command View().We need to combine the data for our analysis.We can do this with help of following command.
df_1 <- do.call("rbind",list(jan_21,feb_21,mar_21,apr_21,may_21,jun_21,jul_21,aug_21,sep_21,oct_21,nov_21,dec_21))
head(df_1) #to check first 6 data units of every column ## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual
## 1 41.90034 -87.69674 41.89 -87.72 member
## 2 41.90033 -87.69671 41.90 -87.69 member
## 3 41.90031 -87.69664 41.90 -87.70 member
## 4 41.90040 -87.69666 41.92 -87.69 member
## 5 41.90033 -87.69670 41.90 -87.70 casual
## 6 41.90041 -87.69676 41.94 -87.71 casual
We can check structure of the data using:
## 'data.frame': 5595063 obs. of 13 variables:
## $ ride_id : chr "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
## $ ended_at : chr "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
## $ start_station_name: chr "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
And summary of data using:
| Name | df_1 |
| Number of rows | 5595063 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5595063 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 4677998 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 4671372 | 0 |
| start_station_name | 0 | 1 | 0 | 53 | 690809 | 848 | 0 |
| start_station_id | 0 | 1 | 0 | 36 | 690806 | 835 | 0 |
| end_station_name | 0 | 1 | 0 | 53 | 739170 | 845 | 0 |
| end_station_id | 0 | 1 | 0 | 36 | 739170 | 833 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 | ▁▁▇▇▁ |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -87.52 | ▁▁▆▇▁ |
| end_lat | 4771 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 | ▁▁▁▇▁ |
| end_lng | 4771 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 | ▁▁▁▁▇ |
As we can see there are many null values present in start_station_name,end_station_name,start_station_id,end_station_id so we have to replace them with N/A so that we do not get any error in our further analysis.
df_1$start_station_name[df_1$start_station_name==""] <- "N/A"
df_1$start_station_id[df_1$start_station_id==""] <- "N/A"
df_1$end_station_name[df_1$end_station_name==""] <- "N/A"
df_1$end_station_id[df_1$end_station_id==""] <- "N/A"As our date is in a customized format we need to convert it into date-time format so that it can help us in analysis:
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "rideDate"
To calculate ride length we have to convert started_at and ended_at also in date-time format:
Creating month and day of week column as we need that for analysis.(I have craeted day and year column also but there is no use of it in our analysis)
df_1$month <- format(df_1$rideDate,"%B")
df_1$day <- format(df_1$rideDate,"%d")
df_1$year <- format(df_1$rideDate,"%Y")
df_1$day_of_week <- format(df_1$rideDate,"%A")Now, we have to create column ride length to find duration of ride
df_1$ride_length <- difftime(df_1$ended_at,df_1$started_at)
summary(df_1$ride_length) #To chech summary## Length Class Mode
## 5595063 difftime numeric
By default the ride length will be in character format and in seconds.We have to convert it into numeric format:
We can see changes we have done in View(). Now, we need to filter the data where duration of ride is less than 0 seconds.
4.Analyze Phase:
In analysis first we have to calculate number of members, casual riders and total.
count_member <- sqldf("SELECT COUNT(member_casual)
FROM biketripdata
WHERE member_casual='member' ")
count_member## COUNT(member_casual)
## 1 3065746
count_casual <- sqldf("SELECT COUNT(member_casual)
FROM biketripdata
WHERE member_casual='casual' ")
count_casual## COUNT(member_casual)
## 1 2528664
## COUNT(member_casual)
## 1 5594410
Mean ride length for the year is:21.9383 minutes
## Average ride length over the year: 21.93831 minutes
Maximum ride length for the year is:932.4025 hours(I think someone haven’t checked in bike for the long time)
## Maximum ride length over the year is: 932.4025 hours
Ride Lengths:
The trips of casual riders were longer on average than those of members often twice.This indicates that casual riders were using the bikes for longer distance travel than those of members.And members were using the bikes for daily purpose.
May was the month with highest average ride length and December had the shortest ride length.So it concludes that longer trips during the warmer months and shorter trips during the colder months.
Busiest weekday:
Over the year Saturday was busiest day for casual riders and Tuesday had the fewest rides for casuals.Where as Wednesday was busiest day for members and Sunday had fewest rides for members.
Rideable type preference:
The analysis of rideable types indicates that classic bikes were preferd by both members and casuals followed by electric bike.Docked bikes are the least preferred bikes as 312334 casual riders and only 1 member used docked bike for their ride.
5.Share Phase:
As the primary goal of our analysis is to convert casual riders to members so it is important to show percentage contribution casuals and members in total rides. Here I have drawn 3D pie chart with the help of data we have analyzed in previous steps:
## [1] 3065746 2528664
colors = c("green","red")
piepercent <- round(100*x / sum(x), 1)
#pie3D(x,labels = piepercent,explode=0.1,col=colors,radius= 1,main = "Yearly Percentage of #Members
# vs Casual Riders")
#legend("bottomright",c("member","casual"),cex=0.5,fill = colors)As we can see the contribution of members in total rides over the year is 54.8% and casual rides contributed 45.2% in total. This indicates that there is significant room for converting casual riders into members.
Now, In order to analyze the behaviour of rider types over the year by months we can plot bar graph sorted by months:
##sorting data by months
biketripdata$month <- ordered(biketripdata$month,levels=c("January", "February", "March", "April","May", "June", "July", "August", "September", "October", "November", "December"))
ggplot(data=biketripdata)+geom_bar(mapping=aes(x=month,fill=member_casual),position = "Dodge") +
labs(title = "Number of Monthly Trips by Rider type",fill="Rider Type") +
coord_flip() There is clear relation between number of trips and seasonality, there are fewer rides during the colder months and more rides in warmer months.
Now, we can see variation of ride lengths over a period of year by rider types sorted by months to justify our analysis:
# biketripdata %>% group_by(member_casual,month) %>%
# summarise(average_ride_length=mean(ride_length)/60) %>% #converted into minutes
# ggplot(aes(x=month,y=average_ride_length,fill=member_casual)) +
# geom_bar(position="Dodge",stat = "identity") +
# labs(title="Average Ride Length of subsribers by month",subtitle="sorted by membership",
# x= "Months",y= "Average Ride Length(in Min)",fill = "Rider Type") This justifies our analysis that casual riders take longer rides than those of members nearly twice the duration of ride.
Now, we can see how different rider types use bikes on each day of week differently over the year:
#sorting data by day of week
biketripdata$day_of_week <- ordered(biketripdata$day_of_week,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
ggplot(data=biketripdata)+geom_bar(mapping=aes(x=day_of_week,fill=member_casual)) +
labs(title = "Yearly Total Rides by Day of Week",x= "Day of Week",y= "Total Rides") +
labs(title = "Yearly Total Rides by Day of Week",x= "Day of Week",y= "Total Rides") +
annotate("text",x=1,y=250000,label="416181",color="black",size=2.5) +
annotate("text",x=2,y=250000,label="465474",color="black",size=2.5) +
annotate("text",x=3,y=250000,label="477117",color="black",size=2.5) +
annotate("text",x=4,y=250000,label="451490",color="black",size=2.5) +
annotate("text",x=5,y=250000,label="446384",color="black",size=2.5) +
annotate("text",x=6,y=250000,label="433014",color="black",size=2.5) +
annotate("text",x=7,y=250000,label="376086",color="black",size=2.5) +
annotate("text",x=1,y=625000,label="286340",color="black",size=2.5) +
annotate("text",x=2,y=625000,label="274357",color="black",size=2.5) +
annotate("text",x=3,y=625000,label="278910",color="black",size=2.5) +
annotate("text",x=4,y=625000,label="286038",color="black",size=2.5) +
annotate("text",x=5,y=625000,label="364037",color="black",size=2.5) +
annotate("text",x=6,y=625000,label="557934",color="black",size=2.5) +
annotate("text",x=7,y=625000,label="481048",color="black",size=2.5) Average ride length per week by rider type:
biketripdata %>% group_by(member_casual,day_of_week) %>%
summarise(average_ride_length=mean(ride_length)/60) %>% #converted into minutes
ggplot(aes(x=member_casual,y=average_ride_length,fill=day_of_week)) +
geom_bar(position="Dodge",stat = "identity") +
labs(title="Average Ride Length over the week",subtitle="sorted by membership",
x = "Rider Type",y= "Average Ride Length(in Min)",fill = "Day of Week")## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Yearly total rides by rideable type:
ggplot(data=biketripdata) + geom_bar(mapping=aes(x=rideable_type,fill=member_casual),position = "Dodge") +
labs(title = "Yearly Total Rides by Rideable type", x="Rideable Type", y = "Total Rides",fill="Rider Type")This justifies our analysis that classic bikes are more preferred after that electric bikes are preferred by both members and casual riders.
6.Act Phase:
Based on our analysis I have following recommendations:
The goal of analysis is correct, we have to increase the number of members as there are 54.8% members in total.As the profitability from members is more than that of casual riders.So we need to advertise campaigns to focus on benefit of using cyclist bikes for working people.
The average ride length for casual riders is more than twice compared to members which is affecting comapany’s profitability therefore we need to increase number of members in program.
Most of rides takes place during saturday and sunday for members as well as casual riders so we have to run advertise campaign on those days such that more casual riders will get to know about the scheme.
As we have seen in analysis hotter months posses more number of riders than colder months so we have to make advertisements in hotter months so that more people will get to know about the scheme.
The preferred bike types by rider are classic bike and electric bike so we need to focus only those bike types in advertisements.