About the company

In 2016, Cyclistic launched a successful bike-share offering having a fleet of 5,824 bicycles that are tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at any time.

Riders who have an annual subscription are called members while riders who are single-ride or full-day pass users are considered casual riders.

The director of marketing is looking to maximize the number of annual memberships as they are more profitable than single-ride or full-day passes. This strategy is believed to be the key to future growth.

The following data analysis steps will be followed:

Ask
Prepare
Process
Analyze
Share
Act

Ask

The questions that need to be answered are:

1)How do annual members and casual riders use Cyclistic bikes differently?
2)Why would casual riders buy Cyclistic annual memberships?
3)How can Cyclistic use digital media to influence casual riders to become members?

Prepare The dataset follows the ROCCC Analysis as described below:

Reliable - yes, not biased
Original - yes, can locate the original public data
Comprehensive - yes, not missing important information
Current - yes, updated monthly
Cited - yes

I will be using the public dataset located here. The data has been made available by Motivate International Inc. under this license.

Key Tasks Followed:

Downloaded data and copies have been stored on the computer.
I have downloaded the data for Jan 21-Feb 22 Period.
The data is in CSV (comma-separated values) format, and there are a total of 13 columns.

Installing and loading necessary packages

#install.packages("ggplot2")
#install.packages("dplyr")
#install.packages("tidyverse")

library(ggplot2)
library(dplyr)
library(tidyverse)

Setting File Directory

setwd("D:/Cyclist/data")

Importing data to Rstudio

Jan <- read.csv("202101-divvy-tripdata.csv")
Feb <- read.csv("202102-divvy-tripdata.csv")
Mar <- read.csv("202103-divvy-tripdata.csv")
Apr <- read.csv("202104-divvy-tripdata.csv")
May <- read.csv("202105-divvy-tripdata.csv")
Jun <- read.csv("202106-divvy-tripdata.csv")
Jul <- read.csv("202107-divvy-tripdata.csv")
Aug <- read.csv("202108-divvy-tripdata.csv")
Sep <- read.csv("202109-divvy-tripdata.csv")
Oct <- read.csv("202110-divvy-tripdata.csv")
Nov <- read.csv("202111-divvy-tripdata.csv")
Dec <- read.csv("202112-divvy-tripdata.csv")
Jan22 <- read.csv("202201-divvy-tripdata.csv")
Feb22 <- read.csv("202202-divvy-tripdata.csv")

Merging data into a data frame

triprawdata <- rbind(Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Jan22,Feb22)

Process

Cleaning and Preparing Data

colnames(triprawdata) #column names of data
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
nrow(triprawdata) #no. of rows
## [1] 5814442
ncol(triprawdata) #no. of columns
## [1] 13
sum(is.na(triprawdata)) # sum of NA's in data
## [1] 9868
# total NA values 9868 0.16% of total data
head(triprawdata)   #see first 6 rows of data frame
##            ride_id rideable_type      started_at        ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021/1/23 16:14 2021/1/23 16:24
## 2 DC88F20C2C55F27F electric_bike 2021/1/27 18:43 2021/1/27 18:47
## 3 EC45C94683FE3F27 electric_bike 2021/1/21 22:35 2021/1/21 22:37
## 4 4FA453A75AE377DB electric_bike  2021/1/7 13:31  2021/1/7 13:42
## 5 BE5E8EB4E7263A0B electric_bike  2021/1/23 2:24  2021/1/23 2:24
## 6 5D8969F88C773979 electric_bike  2021/1/9 14:24  2021/1/9 15:17
##           start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St            17660                                
## 2 California Ave & Cortez St            17660                                
## 3 California Ave & Cortez St            17660                                
## 4 California Ave & Cortez St            17660                                
## 5 California Ave & Cortez St            17660                                
## 6 California Ave & Cortez St            17660                                
##   start_lat start_lng end_lat end_lng member_casual
## 1  41.90034 -87.69674   41.89  -87.72        member
## 2  41.90033 -87.69671   41.90  -87.69        member
## 3  41.90031 -87.69664   41.90  -87.70        member
## 4  41.90040 -87.69666   41.92  -87.69        member
## 5  41.90033 -87.69670   41.90  -87.70        casual
## 6  41.90041 -87.69676   41.94  -87.71        casual
str(triprawdata)   #See list of columns and data type
## 'data.frame':    5814442 obs. of  13 variables:
##  $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021/1/23 16:14" "2021/1/27 18:43" "2021/1/21 22:35" "2021/1/7 13:31" ...
##  $ ended_at          : chr  "2021/1/23 16:24" "2021/1/27 18:47" "2021/1/21 22:37" "2021/1/7 13:42" ...
##  $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
summary(triprawdata)  #Statistical summary of data
##    ride_id          rideable_type       started_at          ended_at        
##  Length:5814442     Length:5814442     Length:5814442     Length:5814442    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:5814442     Length:5814442     Length:5814442     Length:5814442    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.64   Min.   :-87.84   Min.   :41.39   Min.   :-88.97  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :45.64   Max.   :-73.80   Max.   :42.17   Max.   :-87.49  
##                                   NA's   :4934    NA's   :4934    
##  member_casual     
##  Length:5814442    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
# 4934 NA's each in end_lat and end_lng

Removing Rows with NA

triprawdata<- na.omit(triprawdata)

Making consistent Date formats in started_at

# Date format is inconsistent some dates are in YYYY-MM-DD format or MM/DD/YYYY
start<-as.POSIXlt(triprawdata$started_at, tz = "","%m/%d/%Y %H:%M")
start<- data.frame(start)
start<- (start[1:nrow(na.omit(start)),])
start<- data.frame(start)



start1<-as.POSIXlt(triprawdata$started_at, tz = "","%Y-%m-%d %H:%M")
start1<- data.frame(start1)
start1<- (start1[(nrow(start)+1):nrow(start1),])
start1<-data.frame(start1)


names(start)<- "start"
names(start1)<- "start"
start_time<- (rbind(start,start1))
triprawdata$started_at <- NULL
triprawdata['started_at']<- start_time

Making consistent Date formats in ended_at

end<-(as.POSIXlt(triprawdata$ended_at, tz = "","%m/%d/%Y %H:%M"))
end<- data.frame(end)
end<- (end[1:nrow(na.omit(end)),])
end<-data.frame(end)


end1<-(as.POSIXlt(triprawdata$ended_at, tz = "","%Y-%m-%d %H:%M"))
end1<- data.frame(end1)
end1<- (end1[(nrow(end)+1):nrow(end1),])
end1<-data.frame(end1)


names(end)<- "end"
names(end1)<- "end"
end_time<- (rbind(end,end1))
triprawdata$ended_at<- NULL
triprawdata['ended_at']<- end_time

Adding a column for Ride length

ride_length<- difftime(end_time[,], start_time[,], units = "mins")
ride_length<-data.frame(ride_length)
triprawdata <- cbind(triprawdata,ride_length)

Checking consistency of Ride length & Ride ID

# Some Ride lengths are negative,start time and end time are found interchanged
clean_data<- triprawdata%>% filter(ride_length>0)

checking duplicate ride id

sum(duplicated(clean_data$ride_id))
## [1] 0

Adding columns for the day of the week into the data frame

weekday<- weekdays(clean_data$started_at)
weekday<-  factor(weekday, levels =c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday")) 
weekday<- data.frame(weekday)
clean_data<- cbind(clean_data,weekday)

#write.csv(clean_data,"D:/Cyclist/clean_data.csv", row.names = FALSE)

Rounding of coordinates to 4 digits

for (i in 7:10) {
  clean_data[,i]<- round(clean_data[,i],4)
  clean_data[,i]<- as.character(clean_data[,i])
}

Replacing missing station names in start station

start_no_data<- clean_data%>% filter (start_station_name =="")%>% group_by(start_lat,start_lng,start_station_name,member_casual)%>% summarise(n=n())

# writing file to excel for further analysis
#write.csv(start_no_data,"start_no_data.csv",row.names = FALSE)
# filtering top data cordinates and assigning start station name 
#filtering out in excel top 50 cordinates and finding their station name from our database clean_data

start_no_data<-read.csv("start_no_data.csv")

start_data<- clean_data%>% filter (start_station_name !="") %>%group_by(start_lat,start_lng,start_station_name,member_casual)%>% summarise(n=n())

Changing Station names (Some station names has been changed over the time)

# changing the names of 6 stations to their new names in excel 

#DuSable Lake Shore Dr & Monroe St
#Lake Shore Dr & Monroe St

#Lake Shore Dr & Diversey Pkwy
#DuSable Lake Shore Dr & Diversey Pkwy

#DuSable Lake Shore Dr & North Blvd
#Lake Shore Dr & North Blvd

#Dusable Lake Shore Dr & Ohio St    
#Lake Shore Dr & Ohio St

#Lake Shore Dr & Wellington Ave
#Dusable Lake Shore Dr & Wellington Ave


#Lake Shore Dr & Belmont Ave
#Dusable Lake Shore Dr & Belmont Ave


for ( i in 1:nrow(start_data)){
  if (start_data[i,3]== "Lake Shore Dr & Diversey Pkwy"){
    start_data[i,3]="DuSable Lake Shore Dr & Diversey Pkwy"
  }
}

for ( i in 1:nrow(start_data)){
  if (start_data[i,3]== "Lake Shore Dr & Monroe St"){
    start_data[i,3]="DuSable Lake Shore Dr & Monroe St"
  }
}

for ( i in 1:nrow(start_data)){
  if (start_data[i,3]== "Lake Shore Dr & Ohio St"){
    start_data[i,3]="Dusable Lake Shore Dr & Ohio St"
  }
}

for ( i in 1:nrow(start_data)){
  if (start_data[i,3]== "Lake Shore Dr & North Blvd"){
    start_data[i,3]="DuSable Lake Shore Dr & North Blvd"
  }
}

for ( i in 1:nrow(start_data)){
  if (start_data[i,3]== "Lake Shore Dr & Wellington Ave"){
    start_data[i,3]="Dusable Lake Shore Dr & Wellington Ave"
  }
}


for ( i in 1:nrow(start_data)){
  if (start_data[i,3]== "Lake Shore Dr & Belmont Ave"){
    start_data[i,3]="DuSable Lake Shore Dr & Belmont Ave"
  }
}

start_data <- data.frame(start_data)
start_data<- rbind(start_data,start_no_data)


  ###########start database created########

Replacing missing station names in end station

end_no_data <- clean_data%>% filter(end_station_name=="")%>% group_by(end_lat,end_lng,end_station_name,member_casual)%>% summarise(n=n())
end_no_data<- na.omit(end_no_data)  

#write.csv(end_no_data,"end_no_data.csv",row.names = FALSE)
# filtering top data cordinates and assigning start station name 
#filtering out in excel top 25 cordinates and finding their station name from our database
end_no_data<- read.csv("end_no_data.csv")

Changing Station names (Some station names has been changed over the time)

end_data_unclean <- clean_data%>% filter(end_station_name!="")%>% group_by(end_lat,end_lng,end_station_name,member_casual)%>% summarise(n=n())


#write.csv(end_data_unclean,"end_data_unclean.csv",row.names = FALSE)

# changing the names of 6 stations to their new names in excel 

end_data_unclean<- read.csv("end_data_unclean.csv")

end_data<- rbind(end_data_unclean,end_no_data)

###########end database created########

Analyze

Key tasks

  1. Aggregate your data so it’s useful and accessible.
  2. Organize and format your data.
  3. Perform calculations.
  4. Identify trends and relationships.

Compare members and casual users

clean_data%>% group_by(member_casual)%>% summarise(n=n())%>%
  mutate(percent = n*100/sum(n))
## # A tibble: 2 x 3
##   member_casual       n percent
##   <chr>           <int>   <dbl>
## 1 casual        1922226    44.6
## 2 member        2389007    55.4
ggplot(data = clean_data,mapping= aes(x= member_casual)) +geom_bar() + labs(title="Member Vs Casual")

Ride Analysis

cat('average duration of rides is : ', mean(clean_data$ride_length), "mins")
## average duration of rides is :  19.94426 mins
cat('maximum duration of rides is : ', round(max(clean_data$ride_length)/60/24), "days")
## maximum duration of rides is :  37 days

Ride behaviour of members and casual riders

bar<-clean_data%>% group_by(member_casual)%>% summarise(avg_ride_length=mean(ride_length))
bar
## # A tibble: 2 x 2
##   member_casual avg_ride_length
##   <chr>         <drtn>         
## 1 casual        28.49893 mins  
## 2 member        13.06106 mins
ggplot(bar, aes(x = member_casual, y = avg_ride_length, fill =member_casual))+geom_col()+ scale_fill_manual(values = c("Purple", "pink"))

Comparing bikes

clean_data%>% group_by(rideable_type) %>% summarise(n=n())%>% mutate(percent = n*100/sum(n))
## # A tibble: 3 x 3
##   rideable_type       n percent
##   <chr>           <int>   <dbl>
## 1 classic_bike  2435347   56.5 
## 2 docked_bike    216447    5.02
## 3 electric_bike 1659439   38.5
ggplot(data = clean_data,mapping= aes(x= rideable_type,fill=rideable_type)) +geom_bar() + labs(title="Bike Types used")

  1. Classic Bike is more preferred than Electric Bike.
  2. Docked Bike is the least preferred Bike.

Bike Rides percent with riders analysis

rideable_type<-clean_data%>% group_by(rideable_type,member_casual) %>% summarise(n=n())%>% 
  mutate(percent = n*100/sum(n))
rideable_type
## # A tibble: 5 x 4
## # Groups:   rideable_type [3]
##   rideable_type member_casual       n percent
##   <chr>         <chr>           <int>   <dbl>
## 1 classic_bike  casual         952630    39.1
## 2 classic_bike  member        1482717    60.9
## 3 docked_bike   casual         216447   100  
## 4 electric_bike casual         753149    45.4
## 5 electric_bike member         906290    54.6
ggplot(data = as.data.frame(rideable_type),mapping= aes(x= rideable_type, y = n, fill =member_casual)) +geom_bar(stat = 'identity') + labs(title="Bike Type with rider")

Choice of Bikes by Riders

member_type<-clean_data%>% group_by(member_casual,rideable_type) %>% summarise(n=n())%>%
  mutate(percent = n*100/sum(n))

member_type
## # A tibble: 5 x 4
## # Groups:   member_casual [2]
##   member_casual rideable_type       n percent
##   <chr>         <chr>           <int>   <dbl>
## 1 casual        classic_bike   952630    49.6
## 2 casual        docked_bike    216447    11.3
## 3 casual        electric_bike  753149    39.2
## 4 member        classic_bike  1482717    62.1
## 5 member        electric_bike  906290    37.9
ggplot(data = as.data.frame(member_type),mapping= aes(x= member_casual, y=n, fill =rideable_type)) +geom_bar(stat = 'identity') + labs(title="Choice of Bike by Riders")

  1. Classic Bike is more preferred by members.
  2. Whereas Electic bike is almost equally preferred by both riders.
  3. Members don’t use Docked Bike.

Weekday Rides (usage on different days)

clean_data$weekday<- factor(clean_data$weekday, levels= c("Monday", "Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
weektable<- clean_data%>% group_by(weekday)%>% summarise(n=n())%>% mutate(percent = n*100/sum(n))

weektable
## # A tibble: 7 x 3
##   weekday        n percent
##   <fct>      <int>   <dbl>
## 1 Monday    557725    12.9
## 2 Tuesday   553005    12.8
## 3 Wednesday 577020    13.4
## 4 Thursday  585778    13.6
## 5 Friday    621732    14.4
## 6 Saturday  758507    17.6
## 7 Sunday    657466    15.3
ggplot(data = clean_data,mapping= aes(x= weekday)) +geom_bar() + labs(title="Usage per day")

Daily rides of members and casual riders

ggplot(data = clean_data,mapping= aes(x= weekday, fill = member_casual)) +geom_bar() +facet_wrap(~member_casual)+theme(axis.text.x = element_text(angle = 60, hjust =1))

  1. Member’s usage is quite similar throughout the week except for Sunday. We can infer that members are mostly working people.
  2. Casual Riders preferred to ride on weekends, especially Saturdays.

Visualization in Tableau

The following calculated fields are created for analysis and visualization:

Month - MONTH([Started At]) , Time of day (Hour) - DATEPART(‘hour’,[Started At])

Number of Rides and Average Ride Length

  1. Clearly, the trip duration of casual riders is greater than members.
  2. The average ride duration of casual riders during weekends also increases with the number of rides.
  3. Bike usage starts increases significantly in 2nd quarter of the year and peaks in 3rd quarter.
  4. Average ride duration of casual riders starts decreasing in 3rd quarter, maybe because of higher surge pricing due to peak demand.

Riders Trend By Month

knitr::include_graphics('/Cyclist/data/Riders Trend by Month.png')

Riders Trend By Hour

knitr::include_graphics('/Cyclist/data/Riders Trend by Hour.png')

Bikes Usage Pattern By Casual Riders

casual_riders<-clean_data%>% filter(member_casual == 'casual')%>%group_by(weekday,rideable_type)%>% summarise(n=n())%>% mutate(percent = n*100/sum(n)) 


ggplot(data= casual_riders, mapping= aes(x= weekday, y=n, fill =rideable_type)) + geom_bar(stat = 'identity') 

Bikes Usage Pattern By Members

members<-clean_data%>% filter(member_casual == 'member')%>%group_by(weekday,rideable_type)%>% summarise(n=n())%>% mutate(percent = n*100/sum(n)) 

ggplot(data= members, mapping= aes(x= weekday, y=n, fill =rideable_type)) + geom_bar(stat = 'identity')

Starting stations used by casual riders and members

knitr::include_graphics('/Cyclist/data/Top 10 Starting Stations.png')

ss<- start_data %>%filter(start_station_name!= "")%>% group_by(start_station_name,member_casual) %>% summarise(n=sum(n)) %>% arrange(desc(n))
p<- ggplot(data =as.data.frame(ss), aes(x = start_station_name, y=n,fill=member_casual))+
  geom_bar(stat = 'identity')+facet_wrap(~ member_casual, ncol =1)+theme(axis.text.x = element_blank(),legend.position="none")

dat_text <- data.frame(
  label = c("Streeter Dr & Grand Ave", "DuSable Lake Shore Dr & Monroe St", "Millennium Park","Clark St & Elm St","Kingsbury St & Kinzie St","Wells St & Concord Ln" ),
  member_casual   = c('casual','casual','casual','member','member','member'),
  x     = c("Streeter Dr & Grand Ave", "DuSable Lake Shore Dr & Monroe St", "Millennium Park","Clark St & Elm St","Kingsbury St & Kinzie St","Wells St & Concord Ln"),
  y     = c(60000,37000,35000,30000,30000,30000),
  hjust =c(0,0,1,0.5,0.5,0.5),
  angle=c(0,0,0,0,0,0))

p + geom_text(data = dat_text,  mapping = aes(x = x, y = y, label = label, hjust=hjust, angle=angle), size =2.5)

Ending stations used by casual riders and members

knitr::include_graphics('/Cyclist/data/Top 10 Ending Stations.png')

Most Used Starting and Ending Stations

knitr::include_graphics('/Cyclist/data/Most Used Starting Stations.png')

knitr::include_graphics('/Cyclist/data/MOst Used Ending Stations.png')

  1. Casual users tend to start and end trips from the same station.
  2. The busiest stations used by casual riders are situated at Lakeshore drive. Lake Shore Drive is a multilevel expressway that runs alongside the shoreline of Lake Michigan.

Recommendations

To convert casual riders into annual members, the following marketing strategies can be implemented:

  1. Offer occasional membership discounts to casual riders on summer and weekends.

  2. Increase the renting price of the bikes for the weekend to target casual users into having a membership, especially for classic and electric bikes, since they are preferred more by casual users.

  3. Put banners or special discount advertisements at Lake Shore Drive that would target casual users which might influence them to become members.

Thanks for Reading, Please provide your feedback.

Check Detailed Analysis for [Code] and Visualization.