The following is a case study analysis of twelve months of Divvy bike share data. Divvy launched in Chicago in 2016 and has quickly grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. Their bikes can be unlocked, ridden, and returned to and from any station within their network.
Divvy’s pricing plan identifies two types of riders: members and casual riders. Members have purchased an annual membership while casual riders purchase single rides or full-day passes. With members proving to be more profitable than casual riders, Divvy has set its sight on growth by developing strategies to convert casual riders to members.
The forthcoming analysis identifies distinguishing behaviors between the two types of riders to inform marketing strategies aimed at converting casual riders to members.
The business questions we are tasked with answering are:
Dataset: Publicly available Divvy trip data that includes: start day and time, end day and time, start station, end station, type of bike, and rider type (member or casual). Each row of data is one individual ride.
For this analysis we are using 12 months of data. January 2021 - December 2021. The compiled dataset contains 5,594,452 rows. The dataset can be downloaded here.
library(tidyverse)
library(lubridate)
library(ggplot2)
library(gridExtra)
library(waffle)
library(reshape2)
january_2021 <- read_csv("1. Cyclistic_January_2021_tripdata.csv")
february_2021 <- read_csv("2. Cyclistic_February_2021_tripdata.csv")
march_2021 <- read_csv("3. Cyclistic_March_2021_tripdata.csv")
april_2021 <- read_csv("4. Cyclistic_April_2021_tripdata.csv")
may_2021 <- read_csv("5. Cyclistic_May_2021_tripdata.csv")
june_2021 <- read_csv("6. Cyclistic_June_2021_tripdata.csv")
july_2021 <-read_csv("7. Cyclistic_July_2021_tripdata.csv")
august_2021 <- read_csv("8. Cyclistic_August_2021_tripdata.csv")
september_2021 <- read_csv("9. Cyclistic_September_2021_tripdata.csv")
october_2021 <- read_csv("10. Cyclistic_October_2021_tripdata.csv")
november_2021 <- read_csv("11. Cyclistic_November_2021_tripdata.csv")
december_2021 <- read_csv("12. Cyclistic_December_2021_tripdata.csv")
It is important we check that column names match before we combine the data
colnames(january_2021)
colnames(february_2021)
colnames(march_2021)
colnames(april_2021)
colnames(may_2021)
colnames(june_2021)
colnames(july_2021)
colnames(august_2021)
colnames(september_2021)
colnames(october_2021)
colnames(november_2021)
colnames(december_2021)
We should also check the structure of our data sets:
str(january_2021)
str(february_2021)
str(march_2021)
str(april_2021)
str(may_2021)
str(june_2021)
str(july_2021)
str(august_2021)
str(september_2021)
str(october_2021)
str(december_2021)
q1_2021 <- bind_rows(january_2021, february_2021, march_2021)
q2_2021 <- bind_rows(april_2021, may_2021, june_2021)
q3_2021 <- bind_rows(july_2021, august_2021, september_2021)
q4_2021 <- bind_rows(october_2021, november_2021, december_2021)
data_2021 <- bind_rows(q1_2021, q2_2021, q3_2021, q4_2021)
In order to clean our data effectively, we will create columns for “ride length” and “day of week”:
## Creating "ride_length" column (in seconds)
q1_2021$ride_length <- difftime(q1_2021$ended_at, q1_2021$started_at)
q2_2021$ride_length <- difftime(q2_2021$ended_at, q2_2021$started_at)
q3_2021$ride_length <- difftime(q3_2021$ended_at, q3_2021$started_at)
q4_2021$ride_length <- difftime(q4_2021$ended_at, q4_2021$started_at)
data_2021$ride_length <- difftime(data_2021$ended_at, data_2021$started_at)
## Convert "ride_length" to characters and then to numeric
q1_2021$ride_length <- as.numeric(as.character(q1_2021$ride_length))
q2_2021$ride_length <- as.numeric(as.character(q2_2021$ride_length))
q3_2021$ride_length <- as.numeric(as.character(q3_2021$ride_length))
q4_2021$ride_length <- as.numeric(as.character(q4_2021$ride_length))
data_2021$ride_length <- as.numeric(as.character(data_2021$ride_length))
## Creating day_of_week column
q1_2021$day_of_week <- weekdays(as.Date(q1_2021$started_at))
q2_2021$day_of_week <- weekdays(as.Date(q2_2021$started_at))
q3_2021$day_of_week <- weekdays(as.Date(q3_2021$started_at))
q4_2021$day_of_week <- weekdays(as.Date(q4_2021$started_at))
data_2021$day_of_week <- weekdays(as.Date(data_2021$started_at))
Now we will remove bad data. Our data frames currently have rows with a negative ride length or trips that were actually bikes removed from service for testing. We will create a new “v2” dataframe since we are removing data.
q1_2021_v2 <- q1_2021[!(q1_2021$end_station_id == "Hubbard Bike-checking (LBS-WH-TEST)" & !is.na(q1_2021$start_station_id) & !is.na(q1_2021$start_station_name) & !is.na(q1_2021$end_station_name) & !is.na(q1_2021$end_station_id) | q1_2021$ride_length < 0),]
q2_2021_v2 <- q2_2021[!(q2_2021$end_station_id == "Hubbard Bike-checking (LBS-WH-TEST)" & !is.na(q2_2021$start_station_id) & !is.na(q2_2021$start_station_name) & !is.na(q2_2021$end_station_name) & !is.na(q2_2021$end_station_id) | q2_2021$ride_length < 0),]
q3_2021_v2 <- q3_2021[!(q3_2021$end_station_id == "Hubbard Bike-checking (LBS-WH-TEST)" & !is.na(q3_2021$start_station_id) & !is.na(q3_2021$start_station_name) & !is.na(q3_2021$end_station_name) & !is.na(q3_2021$end_station_id) | q3_2021$ride_length < 0),]
q4_2021_v2 <- q4_2021[!(q4_2021$end_station_id == "Hubbard Bike-checking (LBS-WH-TEST)" & !is.na(q4_2021$start_station_id) & !is.na(q4_2021$start_station_name) & !is.na(q4_2021$end_station_name) & !is.na(q4_2021$end_station_id) | q4_2021$ride_length < 0),]
data_2021_v2 <- data_2021[!(data_2021$end_station_id == "Hubbard Bike-checking (LBS-WH-TEST)" & !is.na(data_2021$start_station_id) & !is.na(data_2021$start_station_name) & !is.na(data_2021$end_station_name) & !is.na(data_2021$end_station_id) | data_2021$ride_length < 0),]
Now that our data is cleaned, we will further manipulate it to aid in our analysis by creating a “time of day” column. This column will group our trips into four distinct time periods.
## First we need to create two new columns "start_hour" & "start_minute"
q1_2021_v2$start_hour <- hour(q1_2021_v2$started_at)
q1_2021_v2$start_minute <- minute(q1_2021_v2$started_at)
q2_2021_v2$start_hour <- hour(q2_2021_v2$started_at)
q2_2021_v2$start_minute <- minute(q2_2021_v2$started_at)
q3_2021_v2$start_hour <- hour(q3_2021_v2$started_at)
q3_2021_v2$start_minute <- minute(q3_2021_v2$started_at)
q4_2021_v2$start_hour <- hour(q4_2021_v2$started_at)
q4_2021_v2$start_minute <- minute(q4_2021_v2$started_at)
data_2021_v2$start_hour <- hour(data_2021_v2$started_at)
data_2021_v2$start_minute <- minute(data_2021_v2$started_at)
With our new columns we can now create the “time_of_day” column. Our custom time frames are as follows:
DISCLAIMER: These time periods were created to correspond to easily categorized periods of the day, strongly influenced by the M-F work schedule. Analysis will be impacted by this choice. In a professional setting, stakeholders would be involved in this decision.
## Time of day column is being created for each data frame to preserve the ability to analyze by quarter.
q1_2021_v2$time_of_day[q1_2021_v2$start_hour>=20 & q1_2021_v2$start_minute>=0]='LATE'
q1_2021_v2$time_of_day[q1_2021_v2$start_hour<=3 & q1_2021_v2$start_minute<=59] = 'LATE'
q1_2021_v2$time_of_day[q1_2021_v2$start_hour>=4 & q1_2021_v2$start_minute>=0 & q1_2021_v2$start_hour<=9 & q1_2021_v2$start_minute<=59] = 'AM'
q1_2021_v2$time_of_day[q1_2021_v2$start_hour>=10 & q1_2021_v2$start_minute>=0 & q1_2021_v2$start_hour<=14 & q1_2021_v2$start_minute<=59] = 'MID'
q1_2021_v2$time_of_day[q1_2021_v2$start_hour>=15 & q1_2021_v2$start_minute>=0 & q1_2021_v2$start_hour<=19 & q1_2021_v2$start_minute<=59] = 'PM'
q2_2021_v2$time_of_day[q2_2021_v2$start_hour>=20 & q2_2021_v2$start_minute>=0]='LATE'
q2_2021_v2$time_of_day[q2_2021_v2$start_hour<=3 & q2_2021_v2$start_minute<=59] = 'LATE'
q2_2021_v2$time_of_day[q2_2021_v2$start_hour>=4 & q2_2021_v2$start_minute>=0 & q2_2021_v2$start_hour<=9 & q2_2021_v2$start_minute<=59] = 'AM'
q2_2021_v2$time_of_day[q2_2021_v2$start_hour>=10 & q2_2021_v2$start_minute>=0 & q2_2021_v2$start_hour<=14 & q2_2021_v2$start_minute<=59] = 'MID'
q2_2021_v2$time_of_day[q2_2021_v2$start_hour>=15 & q2_2021_v2$start_minute>=0 & q2_2021_v2$start_hour<=19 & q2_2021_v2$start_minute<=59] = 'PM'
q3_2021_v2$time_of_day[q3_2021_v2$start_hour>=20 & q3_2021_v2$start_minute>=0]='LATE'
q3_2021_v2$time_of_day[q3_2021_v2$start_hour<=3 & q3_2021_v2$start_minute<=59] = 'LATE'
q3_2021_v2$time_of_day[q3_2021_v2$start_hour>=4 & q3_2021_v2$start_minute>=0 & q3_2021_v2$start_hour<=9 & q3_2021_v2$start_minute<=59] = 'AM'
q3_2021_v2$time_of_day[q3_2021_v2$start_hour>=10 & q3_2021_v2$start_minute>=0 & q3_2021_v2$start_hour<=14 & q3_2021_v2$start_minute<=59] = 'MID'
q3_2021_v2$time_of_day[q3_2021_v2$start_hour>=15 & q3_2021_v2$start_minute>=0 & q3_2021_v2$start_hour<=19 & q3_2021_v2$start_minute<=59] = 'PM'
q4_2021_v2$time_of_day[q4_2021_v2$start_hour>=20 & q4_2021_v2$start_minute>=0]='LATE'
q4_2021_v2$time_of_day[q4_2021_v2$start_hour<=3 & q4_2021_v2$start_minute<=59] = 'LATE'
q4_2021_v2$time_of_day[q4_2021_v2$start_hour>=4 & q4_2021_v2$start_minute>=0 & q4_2021_v2$start_hour<=9 & q4_2021_v2$start_minute<=59] = 'AM'
q4_2021_v2$time_of_day[q4_2021_v2$start_hour>=10 & q4_2021_v2$start_minute>=0 & q4_2021_v2$start_hour<=14 & q4_2021_v2$start_minute<=59] = 'MID'
q4_2021_v2$time_of_day[q4_2021_v2$start_hour>=15 & q4_2021_v2$start_minute>=0 & q4_2021_v2$start_hour<=19 & q4_2021_v2$start_minute<=59] = 'PM;'
data_2021_v2$time_of_day[data_2021_v2$start_hour>=20 & data_2021_v2$start_minute>=0]='LATE'
data_2021_v2$time_of_day[data_2021_v2$start_hour<=3 & data_2021_v2$start_minute<=59] = 'LATE'
data_2021_v2$time_of_day[data_2021_v2$start_hour>=4 & data_2021_v2$start_minute>=0 & data_2021_v2$start_hour<=9 & data_2021_v2$start_minute<=59] = 'AM'
data_2021_v2$time_of_day[data_2021_v2$start_hour>=10 & data_2021_v2$start_minute>=0 & data_2021_v2$start_hour<=14 & data_2021_v2$start_minute<=59] = 'MID'
data_2021_v2$time_of_day[data_2021_v2$start_hour>=15 & data_2021_v2$start_minute>=0 & data_2021_v2$start_hour<=19 & data_2021_v2$start_minute<=59] = 'PM'
## Let's rename the values within our rideable_type column
data_2021_v2 <- data_2021_v2 %>%
mutate(rideable_type = recode(rideable_type,
classic_bike = 'Classic', docked_bike='Docked', electric_bike='Electric'))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.75 12.00 21.63 21.77 55944.15
## Ride Type Day of Week Avg Length Min
## 1 member Monday 13.24712
## 2 casual Monday 31.51744
## 3 member Tuesday 12.78499
## 4 casual Tuesday 27.76331
## 5 member Wednesday 12.81457
## 6 casual Wednesday 26.89222
## 7 member Thursday 12.77541
## 8 casual Thursday 27.06034
## 9 member Friday 13.32393
## 10 casual Friday 29.33489
## 11 member Saturday 15.26466
## 12 casual Saturday 34.03548
## 13 member Sunday 15.65791
## 14 casual Sunday 36.72005
## Ride Type Day of Week Median Length Min
## 1 member Monday 9.200000
## 2 casual Monday 15.950000
## 3 member Tuesday 9.133333
## 4 casual Tuesday 14.283333
## 5 member Wednesday 9.216667
## 6 casual Wednesday 13.966667
## 7 member Thursday 9.133333
## 8 casual Thursday 13.783333
## 9 member Friday 9.433333
## 10 casual Friday 14.966667
## 11 member Saturday 10.816667
## 12 casual Saturday 17.816667
## 13 member Sunday 10.866667
## 14 casual Sunday 18.716667
## Ride Type Time of Day Avg Length Min
## 1 member AM 12.31430
## 2 casual AM 25.12575
## 3 member MID 13.70664
## 4 casual MID 33.53091
## 5 member PM 14.10659
## 6 casual PM 30.40294
## 7 member LATE 13.87497
## 8 casual LATE 32.69821
## Ride Type Time of Day Median Length Min
## 1 member AM 8.95000
## 2 casual AM 12.26667
## 3 member MID 9.35000
## 4 casual MID 18.35000
## 5 member PM 10.13333
## 6 casual PM 16.30000
## 7 member LATE 9.55000
## 8 casual LATE 14.40000
From January 2021-December 2021, there were a total of 5,594,452 rides. The distribution between member and casual is as follows:
## Member Rides
## 3065739
## Casual Rides
## 2528713
From January 2021 - December 2021, 55% of all rides were by members and 45% were casual rides.
## # A tibble: 14 x 3
## # Groups: Ride_Type [2]
## Ride_Type Day_of_Week Total_Rides
## <ord> <ord> <int>
## 1 member Monday 416159
## 2 member Tuesday 465470
## 3 member Wednesday 477122
## 4 member Thursday 451483
## 5 member Friday 446377
## 6 member Saturday 433025
## 7 member Sunday 376103
## 8 casual Monday 286347
## 9 casual Tuesday 274363
## 10 casual Wednesday 278920
## 11 casual Thursday 286045
## 12 casual Friday 364044
## 13 casual Saturday 557940
## 14 casual Sunday 481054
Member rides make up a majority of all rides Monday-Friday. Casual rides make up a majority of all rides Saturday and Sunday
Member ride distribution is relatively even across all days of the week. Casual rides, on the other hand, mostly occur on Saturday and Sunday, with a slight increase on Friday.
We’ve analyzed ride data by day of week. Now let’s shift to time of day. Again, the ranges you will see below are as follows:
DISCLAIMER: These time periods were created to correspond to easily categorized periods of the day, strongly influenced by the M-F work schedule. Analysis will be impacted by this choice. In a professional setting, stakeholders would be involved in this decision.
## # A tibble: 4 x 2
## Time_of_Day Total_Rides
## <ord> <int>
## 1 AM 808261
## 2 MID 1572295
## 3 PM 2255092
## 4 LATE 958804
From the above plot we see several unique behaviors: the PM time of day (3PM-8PM) is when most member and casual rides occur; member ride count in the AM is more than double casual ride count; the LATE time of day is majority casual rides.
Significant difference in the AM and LATE times of day. 19% of all member rides occur in the AM compared to only 9% of casual rides. While only 14% of all member rides occur in the LATE period compared to 21% of of casual rides.
At this point we’ve looked separately at behavioral data by day of week and time of day. Will we see anything different if we look at them together?
A clear example of how behavior patterns shift between weekday and weekend. Casual ride count is significantly higher in all times of day, except AM, on Saturday and Sunday.
After investigating behavioral differences by looking at when rides occur. We will analyze rides by looking at ride length. This is a different type of behavior, and it is fundamental to understanding the differences between the way members and casual riders use Divvy bikes.
Let’s begin by looking again at an aggregate summary of ride length (minutes):
## Ride Type Ride Length Min.Min. Ride Length Min.1st Qu. Ride Length Min.Median
## 1 member 0.000000 5.566667 9.600000
## 2 casual 0.000000 9.066667 15.966667
## Ride Length Min.Mean Ride Length Min.3rd Qu. Ride Length Min.Max.
## 1 13.632078 16.600000 1559.933333
## 2 31.326875 29.266667 55944.150000
With this simple function, we see that casual average ride length is more than double member average length, while the median ride lengths are much closer. The reason behind the discrepancy is apparent by looking at the MAX value, where we see a significant outlier. Let’s explore this in further detail.
From this plot we see that member ride length is largely consistent Monday - Friday, with a slight increase on Saturday and Sunday. Average member ride length is also significantly lower than casual average ride length across all days. Casual average ride length is highest on Saturday and Sunday and is not consistent throughout the week. Looking at mean by itself, however, can be deceiving.
Let’s put the median and average ride length by day of week side-by-side. We will look at the raw data and then plot it.
## # A tibble: 14 x 4
## # Groups: member_casual [2]
## member_casual day_of_week ride_length ride_min
## <ord> <ord> <dbl> <dbl>
## 1 member Monday 795. 13.2
## 2 member Tuesday 767. 12.8
## 3 member Wednesday 769. 12.8
## 4 member Thursday 767. 12.8
## 5 member Friday 799. 13.3
## 6 member Saturday 916. 15.3
## 7 member Sunday 939. 15.7
## 8 casual Monday 1891. 31.5
## 9 casual Tuesday 1666. 27.8
## 10 casual Wednesday 1614. 26.9
## 11 casual Thursday 1624. 27.1
## 12 casual Friday 1760. 29.3
## 13 casual Saturday 2042. 34.0
## 14 casual Sunday 2203. 36.7
## # A tibble: 14 x 4
## # Groups: member_casual [2]
## member_casual day_of_week ride_length ride_min
## <ord> <ord> <dbl> <dbl>
## 1 member Monday 552 9.2
## 2 member Tuesday 548 9.13
## 3 member Wednesday 553 9.22
## 4 member Thursday 548 9.13
## 5 member Friday 566 9.43
## 6 member Saturday 649 10.8
## 7 member Sunday 652 10.9
## 8 casual Monday 957 16.0
## 9 casual Tuesday 857 14.3
## 10 casual Wednesday 838 14.0
## 11 casual Thursday 827 13.8
## 12 casual Friday 898 15.0
## 13 casual Saturday 1069 17.8
## 14 casual Sunday 1123 18.7
Looking at them side-by-side, the difference in ride length between mean and median is apparent and significant. The mean ride length is being heavily skewed by outliers(lengthier trips), particularly with casual rides. Behaviorally, most casual rides are not double the length of member rides, as you would conclude from looking at only at mean. In general, most rides are shorter than the mean ride length.
Now that we have observed ride length differences by day of week. Let’s again introduce time of day to this analysis. We’ll look at the raw data first.
## # A tibble: 8 x 4
## # Groups: member_casual [2]
## member_casual time_of_day ride_length ride_min
## <ord> <ord> <dbl> <dbl>
## 1 member AM 739. 12.3
## 2 member MID 822. 13.7
## 3 member PM 846. 14.1
## 4 member LATE 832. 13.9
## 5 casual AM 1508. 25.1
## 6 casual MID 2012. 33.5
## 7 casual PM 1824. 30.4
## 8 casual LATE 1962. 32.7
## # A tibble: 8 x 4
## # Groups: member_casual [2]
## member_casual time_of_day ride_length ride_min
## <ord> <ord> <dbl> <dbl>
## 1 member AM 537 8.95
## 2 member MID 561 9.35
## 3 member PM 608 10.1
## 4 member LATE 573 9.55
## 5 casual AM 736 12.3
## 6 casual MID 1101 18.4
## 7 casual PM 978 16.3
## 8 casual LATE 864 14.4
Member ride length is consistent across all periods of the day, with slightly lengthier trips in the afternoon and evenings. Casual ride length is significantly shorter in the AM time period compared to others.
Looking at these side-by-side, they tell different stories. While member rides follow a similar pattern between mean and median, casual rides do not. When looking at mean, the longest casual rides take place in the MID time of day, followed by the LATE period. But when looking at median, the PM time of day replaces LATE as the second longest. LATE casual rides are heavily skewed by outliers.
Finally, let’s look at ride length across time of day and day of week. For the below plots we will look at both mean and median again. However, it must be stated that, ultimately, median will be a better approximation of common behavior within each ride group because it discounts outliers.
We can see by the elevated casual ride lengths on Saturday and Sunday that most ride length outliers occur on the weekend.
This shows us that ride length among members is relatively consistent Monday-Friday, with slight elevations in the MID and PM slots on Saturday and Sunday. Casual ride length is consistent Monday-Friday with an exception in the MID slot on Mondays, where length is noticeably longer. Ride length across all times of day is elevated on Saturdays and Sundays.
The below charts will examine the number of rides that fall into four duration categories:
The number of member rides under 15 minutes is far greater than the number of casual rides under 15 minutes. The number of casual rides is greater for all other cateogries.
A strong majority of member rides are under 15 minutes. Only 29% of member rides are longer than 15 minutes, while 53% of casual rides are longer than 15 minutes. Only 8% of member rides are longer than 30 minutes while 24% of casual rides are over 30 minutes.
Ride length incentives have been a major factor in sales and marketing strategies at Divvy. The current annual membership includes free rides up to 45 minutes. It will be beneficial to look more closely at casual rides over 45 minutes:
Count of casual rides that last:
## 45-50 minutes
## 49294
## 45-55 minutes
## 89338
## 45-60 minutes
## 121970
Of all casual rides over 45 minutes:
The morning belongs to members:
Late night belongs to casual riders:
The weekends belong to casual riders:
Member rides are shorter:
Casual rides are longer:
Member ride behavior appears influenced by the workday. They likely use Divvy bikes to travel to and from work, which would account for the greater ride count between 4AM-10AM and consistent ride length across the week. Casual riders use Divvy bikes more frequently after 10AM and on the weekends, likely for leisure activities. This is corroborated by higher ride count and longer ride length on Saturdays and Sundays.
In designing a campaign aimed at converting casual riders to members, success may be found in developing membership incentives surrounding weekend trips, evening/late trips, and longer trips.
Current membership plans include the following benefits as of March 2022:
The following changes to membership benefits may convert casual riders:
Another possibility is a second type of annual membership with benefits aimed at casual riders. For example, a unique “Weekend Warrior” membership - a reduced-price annual membership that receives discounts for late-night rides, rides on Saturdays and Sundays, and extra discounts on rides over “x” minutes in length.
Forthcoming analysis will focus on comparing months and quarters, as the majority of all rides occur in the warmer months. Being able to isolate behavior by season would be very beneficial to understanding Divvy customers.
This concludes our R analysis of behavioral differences between member and casual rides from January 2021 - December 2021.