Case Study: How Does a Bike-Share Navigate Speedy Success?

Capstone Project for the Google Data Analytics Professional Certificate

The director of marketing of company Cyclist believes that the company’s future success depends on maximizing the number of annual memberships. Therefore, we have to understand how casual riders and annual members use Cyclistic bikes differently.Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.So, we have to find insights about data that will answer to the question “How casual riders and annual members use cyclistic bikes differently?” and provide the recommendations.

1.Ask Phase:

In particular, We have to find:

What is the total number of trips by rider types?
What is the proportion of total trips by rider types?
What is the average ride length for member and casual?
Are there particular days of the week on which most rides take place by category?
Is there a preference for the rideable type for members and casuals?

2.Prepare Phase:

The data was located on Cyclistics server, and was organised as separate files by month and year. I have used the data from January 2021 to december 2021.The data was saved as .csv files within .zip folders.During the analysis, I have stored files on my secured drive and worked with the copies of data for analysis so that if the files get corrupted during analysis we have original files stored at secure location.

Our data includes following fields:

ride_id :— a unique ID per ride
rideable_type:- the type of bicycle used
started_at: the date and time at which bicycle was checked out
ended_at: the date and time at which bicycle was checked in
start_station_name: the name of the station at the start of the trip
start_station_id: a unique identifier for the start station
end_station_name: the name of the station at the end of the trip
end_station_id : a unique identifier for the end station
Start_lat: the latitude of the start station
start_lng: the longitude of the start station
end_lat: the latitude of the end station
end_lng: the longitude of the end station
member_casual: this field indicates whether the rider type is member or a casual

During the analysis I have added more fields to these existing fields for analysis purpose which I am going to tell when we proceed further.

After looking at the data carefully I did notice some kind of errors in collecting data as well as missing values in data.I did notice that some of the ended_at times were before the started_at times, and that some started_at and ended_at fields seemed very close.There is no duplicate data. I have also noticed that the records for which the starting station name and ending station name are missing that records contain error for having ended_at before started_at time.I have also notices the duration of trips in few seconds which I consider are wrongly recorded.Suppose if a casual rider decide to have a testing ride then it will at least take 30 seconds to a minute to complete ride,but there are many records for which the ride length is length than 30 seconds.For analysis I have used ride lengths grater than 0 seconds as I am confused that to include those records for analysis or not.

3.Process Phase:

I haven’t used many tools for the analysis:

Excel:

I have used excel for initial cleaning and some processing for individual .csv file.We can delete and create new column in excel very easily.I have calculated ride length, day of week,count of rides by rider type,average ride length by rider type for each month. But when I decided to choose the excel files for further analysis and visualization the files get corrupted due to large data files.Tableau also takes very much time to combine the data to proceed further for viz,so I decided to use R for my analysis and I have done complete project in R itself with help of some SQL functionality.

RStudio:

I used RStudio for the data manipulation, analysis and visualisation due to its ability to handle a large amount of data and its very logical approach to functions and syntax. I found the sqldf library to be a great resource too, as it allowed me to use SQL queries on the large dataframe.

Firstly, we need to install and read the packages we need for analysis: I have all packages installed except plotrix so I have read all the packages and installed plotrix package and then read it.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)             #for data viz
library(dplyr)               #for data manipulation
library(skimr)               #for summarizing data
library(sqldf)               #for using SQL Queries

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

library(plotrix)            #for plotting 3d pie chart

We can read the data stored in our hard disk with help of command read.csv and store them in a varible of our choice.

jan_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202101-divvy-tripdata.csv")
feb_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202102-divvy-tripdata.csv")
mar_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202103-divvy-tripdata.csv")
apr_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202104-divvy-tripdata.csv")
may_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202105-divvy-tripdata.csv")
jun_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202106-divvy-tripdata.csv")
jul_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202107-divvy-tripdata.csv")
aug_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202108-divvy-tripdata.csv")
sep_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202109-divvy-tripdata.csv")
oct_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202110-divvy-tripdata.csv")
nov_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202111-divvy-tripdata.csv")
dec_21 <- read.csv("D:\\data Analyst\\data\\case_study_1_data\\202112-divvy-tripdata.csv")

We can view our data at any part of analysis with help of command View().We need to combine the data for our analysis.We can do this with help of following command.

df_1 <- do.call("rbind",list(jan_21,feb_21,mar_21,apr_21,may_21,jun_21,jul_21,aug_21,sep_21,oct_21,nov_21,dec_21))
head(df_1)        #to check first 6 data units of every column

##            ride_id rideable_type          started_at            ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
##           start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St            17660                                
## 2 California Ave & Cortez St            17660                                
## 3 California Ave & Cortez St            17660                                
## 4 California Ave & Cortez St            17660                                
## 5 California Ave & Cortez St            17660                                
## 6 California Ave & Cortez St            17660                                
##   start_lat start_lng end_lat end_lng member_casual
## 1  41.90034 -87.69674   41.89  -87.72        member
## 2  41.90033 -87.69671   41.90  -87.69        member
## 3  41.90031 -87.69664   41.90  -87.70        member
## 4  41.90040 -87.69666   41.92  -87.69        member
## 5  41.90033 -87.69670   41.90  -87.70        casual
## 6  41.90041 -87.69676   41.94  -87.71        casual

We can check structure of the data using:

str(df_1)

## 'data.frame':    5595063 obs. of  13 variables:
##  $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
##  $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
##  $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...

And summary of data using:

skim(df_1)

Data summary
Name	df_1
Number of rows	5595063
Number of columns	13
_______________________
Column type frequency:
character	9
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
ride_id	1	16	16	0	5595063
rideable_type	1	11	13	0	3
started_at	1	19	19	0	4677998
ended_at	1	19	19	0	4671372
start_station_name	1	0	53	690809	848
start_station_id	1	0	36	690806	835
end_station_name	1	0	53	739170	845
end_station_id	1	0	36	739170	833
member_casual	1	6	6	0	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	42.07	▁▁▇▇▁
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-87.52	▁▁▆▇▁
end_lat	4771	1	41.90	0.05	41.39	41.88	41.90	41.93	42.17	▁▁▁▇▁
end_lng	4771	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.49	▁▁▁▁▇

As we can see there are many null values present in start_station_name,end_station_name,start_station_id,end_station_id so we have to replace them with N/A so that we do not get any error in our further analysis.

df_1$start_station_name[df_1$start_station_name==""] <- "N/A"
df_1$start_station_id[df_1$start_station_id==""] <- "N/A"
df_1$end_station_name[df_1$end_station_name==""] <- "N/A"
df_1$end_station_id[df_1$end_station_id==""] <- "N/A"

As our date is in a customized format we need to convert it into date-time format so that it can help us in analysis:

df_1$rideDate <- as.Date(df_1$started_at)

colnames(df_1)    #To check the column is added or not

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"      "rideDate"

To calculate ride length we have to convert started_at and ended_at also in date-time format:

df_1$started_at <- as_datetime(df_1$started_at)
df_1$ended_at <- as_datetime(df_1$ended_at)

Creating month and day of week column as we need that for analysis.(I have craeted day and year column also but there is no use of it in our analysis)

df_1$month <- format(df_1$rideDate,"%B")
df_1$day <- format(df_1$rideDate,"%d")
df_1$year <- format(df_1$rideDate,"%Y")
df_1$day_of_week <- format(df_1$rideDate,"%A")

Now, we have to create column ride length to find duration of ride

df_1$ride_length <- difftime(df_1$ended_at,df_1$started_at)
summary(df_1$ride_length)      #To chech summary

##   Length    Class     Mode 
##  5595063 difftime  numeric

By default the ride length will be in character format and in seconds.We have to convert it into numeric format:

df_1$ride_length <- as.numeric(df_1$ride_length)

We can see changes we have done in View(). Now, we need to filter the data where duration of ride is less than 0 seconds.

biketripdata <- filter(df_1,ride_length>0)

4.Analyze Phase:

In analysis first we have to calculate number of members, casual riders and total.

count_member <- sqldf("SELECT COUNT(member_casual)
                      FROM biketripdata
                      WHERE member_casual='member' ")
count_member

##   COUNT(member_casual)
## 1              3065746

count_casual <- sqldf("SELECT COUNT(member_casual)
                      FROM biketripdata
                      WHERE member_casual='casual' ")
count_casual

##   COUNT(member_casual)
## 1              2528664

t <- count_member + count_casual
t

##   COUNT(member_casual)
## 1              5594410

Mean ride length for the year is:21.9383 minutes

cat("Average ride length over the year:",mean(biketripdata$ride_length)/60,"minutes")

## Average ride length over the year: 21.93831 minutes

Maximum ride length for the year is:932.4025 hours(I think someone haven’t checked in bike for the long time)

cat("Maximum ride length over the year is:",max(biketripdata$ride_length)/3600,"hours")

## Maximum ride length over the year is: 932.4025 hours

Ride Lengths:

The trips of casual riders were longer on average than those of members often twice.This indicates that casual riders were using the bikes for longer distance travel than those of members.And members were using the bikes for daily purpose.

May was the month with highest average ride length and December had the shortest ride length.So it concludes that longer trips during the warmer months and shorter trips during the colder months.

Busiest weekday:

Over the year Saturday was busiest day for casual riders and Tuesday had the fewest rides for casuals.Where as Wednesday was busiest day for members and Sunday had fewest rides for members.

Rideable type preference:

The analysis of rideable types indicates that classic bikes were preferd by both members and casuals followed by electric bike.Docked bikes are the least preferred bikes as 312334 casual riders and only 1 member used docked bike for their ride.

5.Share Phase:

As the primary goal of our analysis is to convert casual riders to members so it is important to show percentage contribution casuals and members in total rides. Here I have drawn 3D pie chart with the help of data we have analyzed in previous steps:

x <- c(3065746,2528664)
x          #checking

## [1] 3065746 2528664

colors = c("green","red")
piepercent <- round(100*x / sum(x), 1)

#pie3D(x,labels = piepercent,explode=0.1,col=colors,radius= 1,main = "Yearly Percentage of #Members 
#      vs Casual Riders")
#legend("bottomright",c("member","casual"),cex=0.5,fill = colors)

As we can see the contribution of members in total rides over the year is 54.8% and casual rides contributed 45.2% in total. This indicates that there is significant room for converting casual riders into members.

Now, In order to analyze the behaviour of rider types over the year by months we can plot bar graph sorted by months:

##sorting data by months
biketripdata$month <- ordered(biketripdata$month,levels=c("January", "February", "March", "April","May", "June", "July", "August", "September", "October", "November", "December"))

ggplot(data=biketripdata)+geom_bar(mapping=aes(x=month,fill=member_casual),position = "Dodge") +
  labs(title = "Number of Monthly Trips by Rider type",fill="Rider Type") +
  coord_flip()

There is clear relation between number of trips and seasonality, there are fewer rides during the colder months and more rides in warmer months.

Now, we can see variation of ride lengths over a period of year by rider types sorted by months to justify our analysis:

# biketripdata %>% group_by(member_casual,month) %>% 
#   summarise(average_ride_length=mean(ride_length)/60) %>%        #converted into minutes
#   ggplot(aes(x=month,y=average_ride_length,fill=member_casual)) +
#   geom_bar(position="Dodge",stat = "identity") +
#   labs(title="Average Ride Length of subsribers by month",subtitle="sorted by membership",
#        x= "Months",y= "Average Ride Length(in Min)",fill = "Rider Type")

This justifies our analysis that casual riders take longer rides than those of members nearly twice the duration of ride.

Now, we can see how different rider types use bikes on each day of week differently over the year:

#sorting data by day of week
biketripdata$day_of_week <- ordered(biketripdata$day_of_week,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

ggplot(data=biketripdata)+geom_bar(mapping=aes(x=day_of_week,fill=member_casual)) +
  labs(title = "Yearly Total Rides by Day of Week",x= "Day of Week",y= "Total Rides") +
  labs(title = "Yearly Total Rides by Day of Week",x= "Day of Week",y= "Total Rides") +
  annotate("text",x=1,y=250000,label="416181",color="black",size=2.5) +
  annotate("text",x=2,y=250000,label="465474",color="black",size=2.5) +
  annotate("text",x=3,y=250000,label="477117",color="black",size=2.5) +
  annotate("text",x=4,y=250000,label="451490",color="black",size=2.5) +
  annotate("text",x=5,y=250000,label="446384",color="black",size=2.5) +
  annotate("text",x=6,y=250000,label="433014",color="black",size=2.5) +
  annotate("text",x=7,y=250000,label="376086",color="black",size=2.5) +
  annotate("text",x=1,y=625000,label="286340",color="black",size=2.5) +
  annotate("text",x=2,y=625000,label="274357",color="black",size=2.5) +
  annotate("text",x=3,y=625000,label="278910",color="black",size=2.5) +
  annotate("text",x=4,y=625000,label="286038",color="black",size=2.5) +
  annotate("text",x=5,y=625000,label="364037",color="black",size=2.5) +
  annotate("text",x=6,y=625000,label="557934",color="black",size=2.5) +
  annotate("text",x=7,y=625000,label="481048",color="black",size=2.5)

Average ride length per week by rider type:

biketripdata %>% group_by(member_casual,day_of_week) %>% 
  summarise(average_ride_length=mean(ride_length)/60) %>%        #converted into minutes
  ggplot(aes(x=member_casual,y=average_ride_length,fill=day_of_week)) +
  geom_bar(position="Dodge",stat = "identity") +
  labs(title="Average Ride Length over the week",subtitle="sorted by membership",
       x = "Rider Type",y= "Average Ride Length(in Min)",fill = "Day of Week")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Yearly total rides by rideable type:

ggplot(data=biketripdata) + geom_bar(mapping=aes(x=rideable_type,fill=member_casual),position = "Dodge") +
  labs(title = "Yearly Total Rides by Rideable type", x="Rideable Type", y = "Total Rides",fill="Rider Type")

This justifies our analysis that classic bikes are more preferred after that electric bikes are preferred by both members and casual riders.

6.Act Phase:

Based on our analysis I have following recommendations:

The goal of analysis is correct, we have to increase the number of members as there are 54.8% members in total.As the profitability from members is more than that of casual riders.So we need to advertise campaigns to focus on benefit of using cyclist bikes for working people.
The average ride length for casual riders is more than twice compared to members which is affecting comapany’s profitability therefore we need to increase number of members in program.
Most of rides takes place during saturday and sunday for members as well as casual riders so we have to run advertise campaign on those days such that more casual riders will get to know about the scheme.
As we have seen in analysis hotter months posses more number of riders than colder months so we have to make advertisements in hotter months so that more people will get to know about the scheme.
The preferred bike types by rider are classic bike and electric bike so we need to focus only those bike types in advertisements.

Case Study Document : Cyclistic bike-share analysis

Avdhur Yadav

2/18/2022