Cyclistic bike share

Introduction

This capstone project is the final project in the Google Data Analytics Professional Certificate course. I was given the opportunity to use the skills and methods I had been taught by working on a case study project that would summarize my relevant knowledge, following the steps of the data analysis process: ask, prepare, process, analyze, share and act.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

Cyclistic is a bike-share company based in Chicago that launched in 2016. They offer 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. Cyclistic offers 3 different plans, single-ride passes, full-day passes, and annual memberships. Riders who have an annual subscription are called members while riders who are single-ride or full-day pass users are considered casual riders. The main fleet consist of classic bikes, electric bikes and docked bikes.

Business Task To analyze user behaviors on how annual members and casual riders use Cyclistic bikes differently and make recommendations on how to convert casual riders into annual members.

Prepare

The data has been made available by Motivate International Inc., the company which operates the City of Chicago’s Divvy bicycle sharing service. The license to use this public dataset can be found here. Datasets are available here. For this analysis I’m going to focus on 12 months period from January 2021 to December 2021.

#Remove data from the environment

rm(list=ls())

#Read library

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(dplyr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggplot2)
library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(tidyr)

#Import data set, 12 months from January to December 2021

ds1 <- read.csv("202101-divvy-tripdata.csv")
ds2 <- read.csv("202102-divvy-tripdata.csv")
ds3 <- read.csv("202103-divvy-tripdata.csv")
ds4 <- read.csv("202104-divvy-tripdata.csv")
ds5 <- read.csv("202105-divvy-tripdata.csv")
ds6 <- read.csv("202106-divvy-tripdata.csv")
ds7 <- read.csv("202107-divvy-tripdata.csv")
ds8 <- read.csv("202108-divvy-tripdata.csv")
ds9 <- read.csv("202109-divvy-tripdata.csv")
ds10 <- read.csv("202110-divvy-tripdata.csv")
ds11 <- read.csv("202111-divvy-tripdata.csv")
ds12 <- read.csv("202112-divvy-tripdata.csv")

#Check the columns consistency so it can be combined

colnames(ds1)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds2)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds3)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds4)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds5)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds6)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds7)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds8)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds9)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds10)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds11)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(ds12)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

#Combine 12 files into one data set, get to know the data set

all_trips <- rbind(ds1,ds2,ds3,ds4,ds5,ds6,ds7,ds8,ds9,ds10,ds11,ds12)
  head(all_trips)

##            ride_id rideable_type          started_at            ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
##           start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St            17660                                
## 2 California Ave & Cortez St            17660                                
## 3 California Ave & Cortez St            17660                                
## 4 California Ave & Cortez St            17660                                
## 5 California Ave & Cortez St            17660                                
## 6 California Ave & Cortez St            17660                                
##   start_lat start_lng end_lat end_lng member_casual
## 1  41.90034 -87.69674   41.89  -87.72        member
## 2  41.90033 -87.69671   41.90  -87.69        member
## 3  41.90031 -87.69664   41.90  -87.70        member
## 4  41.90040 -87.69666   41.92  -87.69        member
## 5  41.90033 -87.69670   41.90  -87.70        casual
## 6  41.90041 -87.69676   41.94  -87.71        casual

#Remove monthly data to clear environment

remove(ds1,ds2,ds3,ds4,ds5,ds6,ds7,ds8,ds9,ds10,ds11,ds12)

#Convert started_at and ended_at char to datetime

all_trips[['started_at']] <- ymd_hms(all_trips[['started_at']])
  all_trips[['ended_at']] <- ymd_hms(all_trips[['ended_at']])
  str(all_trips)

## 'data.frame':    5595063 obs. of  13 variables:
##  $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct, format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct, format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...

#New columns were add for days of the week, month, time and duration in minutes

all_trips$week <- format(as.Date(all_trips$started_at),'%a')
  all_trips$month <- format(as.Date(all_trips$started_at),'%b_%y')
  all_trips$time <- format(all_trips$started_at, format = "%H:%M")
  all_trips$time <- as.POSIXct(all_trips$time, format = "%H:%M")
  all_trips$trip_duration <- (as.double(difftime(all_trips$ended_at, all_trips$started_at)))/60
  glimpse(all_trips)

## Rows: 5,595,063
## Columns: 17
## $ ride_id            <chr> "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-…
## $ ended_at           <dttm> 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-…
## $ start_station_name <chr> "California Ave & Cortez St", "California Ave & Cor…
## $ start_station_id   <chr> "17660", "17660", "17660", "17660", "17660", "17660…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "Wood St & Augu…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "657", "13258",…
## $ start_lat          <dbl> 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
## $ start_lng          <dbl> -87.69674, -87.69671, -87.69664, -87.69666, -87.696…
## $ end_lat            <dbl> 41.89000, 41.90000, 41.90000, 41.92000, 41.90000, 4…
## $ end_lng            <dbl> -87.72000, -87.69000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "member", "member", "member", "member", "casual", "…
## $ week               <chr> "Sat", "Wed", "Thu", "Thu", "Sat", "Sat", "Mon", "T…
## $ month              <chr> "Jan_21", "Jan_21", "Jan_21", "Jan_21", "Jan_21", "…
## $ time               <dttm> 2022-09-20 16:14:00, 2022-09-20 18:43:00, 2022-09-…
## $ trip_duration      <dbl> 10.4166667, 4.0666667, 1.3333333, 11.7000000, 0.716…

Process

R programming language will be used for this analysis. It can perform a complete cleaning, analysis, visualizations and documentation on the same platform. I found null data on start and end station name columns, trips duration less than 0, NA’s as well. which consist in 9% of the total data set. For missing data since the data set is huge, I decided ignored it. The total of 4 columns and 1.006.877 rows were removed from the new data frame.

#Remove NA

all_trips_clean <- drop_na(all_trips)
  dim(all_trips_clean)

## [1] 5590292      17

#Remove usefulness columns

all_trips_clean <- all_trips_clean %>%
  select(-c(start_lat, start_lng, end_lat, end_lng))
  dim(all_trips_clean)

## [1] 5590292      13

#Check for trips duration less then 0

nrow(subset(all_trips_clean,trip_duration < 0))

## [1] 146

#Remove trips duration less then 0

all_trips_clean <- all_trips_clean[!(all_trips_clean$trip_duration < 0),]
  dim(all_trips_clean)

## [1] 5590146      13

#Remove start_station_name and end_station_name blank results

all_trips_clean <- all_trips_clean %>%
  filter(!(is.na(start_station_name) | start_station_name == "")) %>% 
  filter(!(is.na(end_station_name) | end_station_name == ""))
  dim(all_trips_clean)

## [1] 4588186      13

Analize

In order to facilitate our analysis and arrive at more insightful conclusions, all the required information are now in a complete data frame ready to identify trends, relationships and farther explorations.

#Total rides by members type

all_trips_clean %>% 
  group_by(member_casual) %>% 
  summarise(ride_count = length(ride_id))

## # A tibble: 2 × 2
##   member_casual ride_count
##   <chr>              <int>
## 1 casual           2048335
## 2 member           2539851

There are more member riders compared to casual rides based on the ride total

#Rides duration by members type

setNames(aggregate(trip_duration ~ member_casual, all_trips_clean, sum), 
  c("customer_type", "trip_duration_mins"))

##   customer_type trip_duration_mins
## 1        casual           66590608
## 2        member           33484772

Casual members have double the average length of a ride than a members

#Total rides by bike type

table(all_trips_clean$rideable_type)

## 
##  classic_bike   docked_bike electric_bike 
##       3241904        312044       1034238

#Statistical summary of rides duration by customer type

all_trips_clean %>% 
  group_by(member_casual) %>%
  summarise(mean_trip_duration = mean(trip_duration), median_trip_duration = median(trip_duration), 
            max_trip_duration = max(trip_duration), min_trip_duration = min(trip_duration))

## # A tibble: 2 × 5
##   member_casual mean_trip_duration median_trip_duration max_trip_durat…¹ min_t…²
##   <chr>                      <dbl>                <dbl>            <dbl>   <dbl>
## 1 casual                      32.5                16.6            55944.       0
## 2 member                      13.2                 9.72            1496.       0
## # … with abbreviated variable names ¹max_trip_duration, ²min_trip_duration

#Corrected order for days of the week and months

all_trips_clean$week <- ordered(all_trips_clean$week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
  all_trips_clean$month <- ordered(all_trips_clean$month, levels=c("Jan_21", "Feb_21", "Mar_21", "Apr_21", "May_21", "Jun_21", "Jul_21", "Aug_21", "Sep_21", "Oct_21", "Nov_21", "Dec_21"))

From here we can gather a better understanding of the customer’s riding habits. The analysis shows a significant portion of the rides taken in 2021 were by members, they had higher number of rides all through the year except in June, July and August on summer season, where casual riders took more rides. Casual riders and members alike both appear to have preference towards classic bikes over electric and docked bike. Casual riders ride for a longer time during the week with the highest ride on the weekends, while members ride at a consistent pace during the week with the highest rides on the weekend. Casual riders take rides over twice the duration of those that members do monthly. These trends may be indicative of members using the bikes for work commute, and casual riders using the bikes for leisurely purposes.

all_trips_clean %>% 
  group_by(member_casual) %>% 
  summarise(ride_count = length(ride_id)) %>% 
  arrange(member_casual)  %>% 
  ggplot(aes(x = member_casual, y = ride_count, fill = member_casual)) +
  labs(title = "Total rides by Members Vs. Casual") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

A comparison of total rides between the two rider groups shows a significant portion of the rides taken in 2021 were by members

ggplot(data = all_trips_clean)+
  geom_bar(mapping=aes(x=member_casual, fill=member_casual))+
  labs(title = "Total rides by Customer type Vs. Bike type")+
  facet_wrap(~rideable_type)

Casual riders and members alike both appear to have preference towards classic bikes over electric and docked bike

#Total rides and average ride time by each day for members and casual riders

all_trips_clean %>% 
  group_by(member_casual, week) %>%  
  summarise(number_of_rides = n(),average_duration_mins = mean(trip_duration)) %>% 
  arrange(member_casual, desc(number_of_rides))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual week  number_of_rides average_duration_mins
##    <chr>         <ord>           <int>                 <dbl>
##  1 casual        Sat            468325                  34.9
##  2 casual        Sun            403762                  37.6
##  3 casual        Fri            290040                  30.9
##  4 casual        Mon            228934                  32.6
##  5 casual        Thu            224207                  28.0
##  6 casual        Wed            218132                  28.3
##  7 casual        Tue            214935                  28.8
##  8 member        Wed            397688                  12.5
##  9 member        Tue            388131                  12.4
## 10 member        Thu            373471                  12.4
## 11 member        Fri            365782                  12.8
## 12 member        Sat            357078                  14.8
## 13 member        Mon            346481                  12.7
## 14 member        Sun            311220                  15.2

all_trips_clean %>%  
  group_by(member_casual, week) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(member_casual, week)  %>% 
  ggplot(aes(x = week, y = number_of_rides, fill = member_casual)) +
  labs(title = "Total rides by Customer type Vs. Days of the week") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Casual riders have the highest number of rides on Saturday and Sunday, compared to the other days while members are quite consistent but they have the lowest number of rides on the weekend

#Total rides and average ride time by each month for members and casual riders

unique(all_trips$month)

##  [1] "Jan_21" "Feb_21" "Mar_21" "Apr_21" "May_21" "Jun_21" "Jul_21" "Aug_21"
##  [9] "Sep_21" "Oct_21" "Nov_21" "Dec_21"

all_trips_clean %>% 
  group_by(member_casual, month) %>%  
  summarise(number_of_rides = n(),`average_duration_(mins)` = mean(trip_duration)) %>% 
  arrange(member_casual,desc(number_of_rides))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 24 × 4
## # Groups:   member_casual [2]
##    member_casual month  number_of_rides `average_duration_(mins)`
##    <chr>         <ord>            <int>                     <dbl>
##  1 casual        Jul_21          369407                      33.3
##  2 casual        Aug_21          341469                      28.6
##  3 casual        Jun_21          304189                      38.5
##  4 casual        Sep_21          292926                      28.1
##  5 casual        May_21          216829                      39.6
##  6 casual        Oct_21          189117                      26.3
##  7 casual        Apr_21          120420                      38.4
##  8 casual        Mar_21           75641                      38.5
##  9 casual        Nov_21           69958                      22.5
## 10 casual        Dec_21           45076                      24.8
## # … with 14 more rows

all_trips_clean %>%  
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Customer type Vs. Month") +
  theme(axis.text.x = element_text(angle = 30)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Members had higher number of rides all through the year except in June, July and August summer season, where casual riders took more rides.

all_trips_clean %>%  
  group_by(member_casual, week) %>% 
  summarise(trip_duration = mean(trip_duration), .groups="drop") %>%
  ggplot(aes(x = week, y = trip_duration, fill = member_casual)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) + 
  labs(title ="Rides duration by Customer type Vs. Days of the week")

Casual riders ride for a longer time during the week with the highest ride on the weekends while members ride at a consistent pace during the week with the highest rides on the weekend

all_trips_clean %>%  
  group_by(member_casual, month) %>% 
  summarise(trip_duration = mean(trip_duration), .groups="drop") %>%
  ggplot(aes(x = month, y = trip_duration, fill = member_casual)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) + 
  labs(title ="Rides duration by Customer type Vs. Months")

Casual riders take rides over twice the duration of those that members do monthly. These trends may be indicative of members using the bikes for work commute, and casual riders using the bikes for leisurely purposes

Act

To convert casual riders into annual members, the following marketing recommendations for Cyclistic bike share can be implemented:

Frequent rider program that get points for each ride then redeem into discount on sign up for annual membership, get points for feedback on the purpose of their ride also would offer further insights and could encourage customers to get membership. Offer targeted memberships deal to casual riders, sending discount coupon, notification, email reminders showing the price benefits of annual memberships which might influence them to become members. Considerer introducing a monthly pass, the more costumer use the monthly pass, more they feel comfortable getting an annual subscription.

Conclusion

I was presented with the task of analyzing historical data to determine how casual riders and annual members use Cyclistic bikes service differently. Due to data-privacy some limitations were found, rides identification weren’t allowed to be connect to personal information, trips weren’t grouped by users so data couldn’t provide qualitative information. I discovered some differences between casual riders and annual members but was unable to explain why they are different, more data collection are required to conduct a profound analysis and find the root cause of this question.

Resources

RDocumentation, RStudio, Github and Kaggle community.

Cyclistic bike share

Google Data Analytics - Case Study

T.Gabriela

09.2022

Introduction

Scenario

Ask

Prepare

Process

Analize

Act

Conclusion

Resources

Cyclistic bike share

Google Data Analytics - Case Study

T.Gabriela

09.2022

Introduction

Scenario

Ask

Prepare

Process

Analize

Share

Act

Conclusion

Resources