Ask

What is the problem we are trying to solve?

How do annual members and casual members use Cyclistic bikes differently?

How can our insights drive business decisions?

Identifying the main differences between members and casual riders can help us establish who Cyclistic customers are and provide key information that will help develope the business strategy and marketing campaign.

Objective:

Design a marketing campaign to convert casual riders into annual members.

Stakeholders:

  • Cyclistic executive team: In charge of the final decision, they will decide if the recommended marketing strategy is approved.
  • Lily Moreno: Director of marketing, responsible for the development of the marketing campaign and the bike-share initiatives. She is responsible for the success or failure of the marketing campaign based on your data analysis.

Prepare

Data source:

For this analysis, we have access to a primary data source downloaded from Cyclystic trip data. It is licensed under Motivate International Inc. This dataset consists of quantitative measurements collected from bike trackers, they do not gather personal data from users.

Data credibility (ROCC):

  • Reliable: Although the dataset has many incomplete entries, this accounts for less than 0.1% of all data entries, making it reasonable to ignore for the sake of this analyisis.
  • Original: Primary data source, the company Cyclistic gathers this data first-hand.
  • Comprehensive: The dataset has more than 5 million complete data entries.
  • Current: Data is up-to-date as it includes data from the past 12 months.

Process

Tools used for processing data:

Spreadsheets where avoided because of the raw amount of data in this dataset. For this project, R Studio was the preferred tool used for this project. It has a wide range of tools available and can handle massive amounts of data with ease.

Cleaning and processing:

First we prepare the R Studio environment with the right tools by installing the required packages and libraries:

library(tidyverse) #helps wrangle data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(janitor) #helps clean data
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate) #helps wrangle date attributes
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2) #helps visualize data
library(dplyr) #Count distinct
library(scales) #For better numbering in plots
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Data importing and consistency checking

Then we have to import the data to R Studio and check for consistency, we use ‘colnames’ to verify the that the column names match in each dataframe and ‘str’ to check for consistency in data type per column.

Processing (part 2):

  1. We stack the individual data frames into one big data frame.
  2. We do a check with ‘colnames’, ‘dim’, ‘head’, ‘str’ and ‘summary’.
  3. We check for discrepancies in number of rows and distinct rider id’s, they should be the same amount.
  4. We check for distinct values in member_casual, it should to be 2.
all_rides <- bind_rows(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)

colnames(all_rides)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
n_distinct(all_rides$ride_id) #Count distinct ride id, it should be the same number as rows.
## [1] 5733451
nrow(all_rides)
## [1] 5733451
dim(all_rides)
## [1] 5733451      13
head(all_rides)
##            ride_id rideable_type          started_at            ended_at
## 1 46F8167220E4431F electric_bike 2021-12-07 15:06:07 2021-12-07 15:13:42
## 2 73A77762838B32FD electric_bike 2021-12-11 03:43:29 2021-12-11 04:10:23
## 3 4CF42452054F59C5 electric_bike 2021-12-15 23:10:28 2021-12-15 23:23:14
## 4 3278BA87BF698339  classic_bike 2021-12-26 16:16:10 2021-12-26 16:30:53
## 5 6FF54232576A3B73 electric_bike 2021-12-30 11:31:05 2021-12-30 11:51:21
## 6 93E8D79490E3AB11  classic_bike 2021-12-01 18:28:36 2021-12-01 18:38:03
##             start_station_name start_station_id           end_station_name
## 1     Laflin St & Cullerton St            13307        Morgan St & Polk St
## 2        LaSalle Dr & Huron St     KP1705001026 Clarendon Ave & Leland Ave
## 3 Halsted St & North Branch St     KA1504000117       Broadway & Barry Ave
## 4 Halsted St & North Branch St     KA1504000117      LaSalle Dr & Huron St
## 5     Leavitt St & Chicago Ave            18058     Clark St & Drummond Pl
## 6         Wabash Ave & 16th St           SL-012         Wells St & Polk St
##   end_station_id start_lat start_lng  end_lat   end_lng member_casual
## 1   TA1307000130  41.85483 -87.66366 41.87197 -87.65097        member
## 2   TA1307000119  41.89441 -87.63233 41.96797 -87.65000        casual
## 3          13137  41.89936 -87.64852 41.93758 -87.64410        member
## 4   KP1705001026  41.89939 -87.64854 41.89488 -87.63233        member
## 5   TA1307000142  41.89558 -87.68202 41.93125 -87.64434        member
## 6         SL-011  41.86038 -87.62581 41.87260 -87.63350        member
str(all_rides)
## 'data.frame':    5733451 obs. of  13 variables:
##  $ ride_id           : chr  "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : chr  "2021-12-07 15:06:07" "2021-12-11 03:43:29" "2021-12-15 23:10:28" "2021-12-26 16:16:10" ...
##  $ ended_at          : chr  "2021-12-07 15:13:42" "2021-12-11 04:10:23" "2021-12-15 23:23:14" "2021-12-26 16:30:53" ...
##  $ start_station_name: chr  "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
##  $ start_station_id  : chr  "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
##  $ end_station_name  : chr  "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
##  $ end_station_id    : chr  "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num  41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "casual" "member" "member" ...
summary(all_rides)
##    ride_id          rideable_type       started_at          ended_at        
##  Length:5733451     Length:5733451     Length:5733451     Length:5733451    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:5733451     Length:5733451     Length:5733451     Length:5733451    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.64   Min.   :-87.84   Min.   : 0.00   Min.   :-88.14  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :45.64   Max.   :-73.80   Max.   :42.37   Max.   :  0.00  
##                                   NA's   :5874    NA's   :5874    
##  member_casual     
##  Length:5733451    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
n_distinct(all_rides$member_casual) #Count distinct values to check consistency.
## [1] 2

Procccessing (part 3)

  1. We remove empty rows.
all_rides <- janitor::remove_empty(all_rides,which = c("cols")) #removes empty columns
all_rides <- janitor::remove_empty(all_rides,which = c("rows")) #removes empty rows
  1. We use ‘lubridate’ package to transform ‘started_at’ and ‘ended_at’ to ‘ymd_hms’ and to create ‘start_hour’, ‘end_hour’ and ‘day_of_week’.
all_rides$started_at <- lubridate::ymd_hms(all_rides$started_at) #turns 'started_at' to ymd_hms
all_rides$ended_at <- lubridate::ymd_hms(all_rides$ended_at) #turns 'ended_at' to ymd_hms
all_rides$start_hour <- lubridate::hour(all_rides$started_at)
all_rides$end_hour <- lubridate::hour(all_rides$ended_at)
all_rides$day_of_week <- wday(all_rides$started_at)
  1. We add a ride lenght column with ‘difftime’ using ‘ended_at’ and ‘started_at’.
all_rides$ride_length <- difftime(all_rides$ended_at,all_rides$started_at)
  1. Then we create a new dataframe excluding the rows with negative lenght time.
sum(all_rides$ride_length < 0, na.rm=TRUE) #counts number of rows < 0 in 'ride_length)
## [1] 100
all_rides_v2 <- all_rides[!(all_rides$ride_length<0),]
sum(all_rides_v2$ride_length < 0, na.rm=TRUE) #counts number of rows < 0 in 'ride_length)
## [1] 0
  1. Finally we change the ‘day_of_week’ from a number to the name of the day.
all_rides_v2 <- all_rides_v2 %>% 
  mutate(day_of_week = recode(day_of_week
                              ,"1" = "Sunday"
                              ,"2" = "Monday"
                              ,"3" = "Tuesday"
                              ,"4" = "Wednesday"
                              ,"5" = "Thursday"
                              ,"6" = "Friday"
                              ,"7" = "Saturday"))

Analyze

Descriptive analysis on ride_length

  • Average:
mean(all_rides_v2$ride_length) #straight average
## Time difference of 1165.316 secs
  • Midpoint number:
median(all_rides_v2$ride_length) #midpoint number
## Time difference of 618 secs
  • Longest and shortest ride:
max(all_rides_v2$ride_length) #longest ride
## Time difference of 2483235 secs
min(all_rides_v2$ride_length) #shortest ride
## Time difference of 0 secs
  • Descriptive analysis on ‘ride_length’ by user type:
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = mean)
##   all_rides_v2$member_casual all_rides_v2$ride_length
## 1                     casual           1746.5364 secs
## 2                     member            762.5375 secs
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = median)
##   all_rides_v2$member_casual all_rides_v2$ride_length
## 1                     casual                 783 secs
## 2                     member                 530 secs
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = max)
##   all_rides_v2$member_casual all_rides_v2$ride_length
## 1                     casual             2483235 secs
## 2                     member               93594 secs
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = min)
##   all_rides_v2$member_casual all_rides_v2$ride_length
## 1                     casual                   0 secs
## 2                     member                   0 secs
  • Descriptive analysis on ‘ride_length’ by user type and weekday:
all_rides_v2$day_of_week <- ordered(all_rides_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual + all_rides_v2$day_of_week, FUN = mean )
##    all_rides_v2$member_casual all_rides_v2$day_of_week all_rides_v2$ride_length
## 1                      casual                   Sunday           2045.0549 secs
## 2                      member                   Sunday            842.8941 secs
## 3                      casual                   Monday           1753.9351 secs
## 4                      member                   Monday            736.4146 secs
## 5                      casual                  Tuesday           1558.6146 secs
## 6                      member                  Tuesday            728.8662 secs
## 7                      casual                Wednesday           1482.0886 secs
## 8                      member                Wednesday            723.9585 secs
## 9                      casual                 Thursday           1534.3805 secs
## 10                     member                 Thursday            737.9848 secs
## 11                     casual                   Friday           1667.3108 secs
## 12                     member                   Friday            750.0025 secs
## 13                     casual                 Saturday           1951.9865 secs
## 14                     member                 Saturday            849.0422 secs

Analyze ridership data

all_rides_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  arrange(member_casual,weekday) %>% 
  ggplot(aes(x=weekday,y=number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")+
  scale_y_continuous(labels = comma)+
  ggtitle("Daily rides by user type and weekday:")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

all_rides_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  arrange(member_casual,weekday) %>% 
  ggplot(aes(x=weekday,y=average_duration, fill = member_casual)) +
  geom_col(position = "dodge")+
  scale_y_continuous(labels = comma)+
  ggtitle("Average ride duration by user type and weekday:")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

all_rides_v2 %>% 
  group_by(member_casual, start_hour) %>% 
  summarise(number_of_rides=n()) %>% 
  ggplot(aes(x=start_hour,y=number_of_rides, color = member_casual)) +
  geom_line()+
  scale_y_continuous(labels = comma)+
  ggtitle("Rides by time of day and user type:")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Further analysis on ‘rideable_type’ by user, weekday, time of day…

We first obtain the total amount of rides divided by annual member and casual rider, further subdivided by bicycle type (classic, electric and docked).

classic_member <- sum(all_rides_v2$member_casual == 'member' & all_rides_v2$rideable_type == 'classic_bike')
electric_member <- sum(all_rides_v2$member_casual == 'member' & all_rides_v2$rideable_type == 'electric_bike')
docked_member <- sum(all_rides_v2$member_casual == 'member' & all_rides_v2$rideable_type == 'docked_bike')

classic_casual <- sum(all_rides_v2$member_casual == 'casual' & all_rides_v2$rideable_type == 'classic_bike')
electric_casual <- sum(all_rides_v2$member_casual == 'casual' & all_rides_v2$rideable_type == 'electric_bike')
docked_casual <- sum(all_rides_v2$member_casual == 'casual' & all_rides_v2$rideable_type == 'docked_bike')

We search for correlations in our data

For example: does the bicycle type usage change over the week?

all_rides_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%
  group_by(rideable_type, weekday) %>% 
  summarise(number_of_rides=n()) %>% 
  ggplot(aes(x=weekday,y=number_of_rides, fill=rideable_type)) +
  geom_col(position = "dodge")+
  scale_y_continuous(labels = comma)+
  ggtitle("Rides by weekday and bicycle type:")
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

Then we analyze for differences between annual members and casual riders with the types of bicycles they use:

all_rides_v2 %>% 
  group_by(rideable_type, member_casual, start_hour) %>% 
  summarise(number_of_rides=n()) %>% 
  ggplot(aes(x=start_hour,y=number_of_rides, color=member_casual, linetype=rideable_type)) +
  geom_smooth()+
  scale_y_continuous(labels = comma)+
  ggtitle("Rides by time of day, user and bicycle type:")
## `summarise()` has grouped output by 'rideable_type', 'member_casual'. You can
## override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Data analysis results

Conclusion of results between casual riders and annual members:

  • Casual riders have a longer ride length average (1746 seconds) than annual members (762 seconds), but annual members have almost double the daily rides than casual riders with saturday and sunday as an exception.
  • Casual riders start riding by 5am and increase almost linearly until 4pm, which it decreases after. Annual members have a spike of usage at 8am and then again at 4pm, with this information we can assume there is a strong relationship with annual members using cyclist as a means to get to work and back home.

Next steps:

We interpreted the results from this analysis as an indicator that our casual riders use cyclistic for leisure, we also determined that our annual members mostly use it for commuting to work and back. We reccommend to further this investigation by surveying our annual members with a direct “yes or no” survey asking if they use cyclistic bikes for commuting to work. If that is not a possibility, then the marketing campaign should focus on inspiring casual riders to become annual members to use cyclistic for their commute to work. Reassuring their intent by offering a free trial, moneyback guarantee, or other tacticts.