Business task:
The goal of this analysis is to understand how annual members and casual riders use Cyclistic bikes differently, and to identify any insights that can inform a marketing strategy aimed at converting casual riders to annual members. By understanding the differences between these two customer segments, we can better target our marketing efforts to ultimately drive future growth for the company.
Key stakeholders:
Description of data source:
We will use Cyclistic’s historical bike trip data to analyze and identify trends. We will download the previous 12 months of Cyclistic trip data (January 2022 - December 2022) located here. (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable us to address the key business task. The data has been made available by Motivate International Inc under this license).
The 12 monthly datasets are organized as separate files in comma-delimited (.CSV) format and stored on my local drive. Each file consists of 13 columns (listed below) that contain information related to each bike trip that was recorded during that particular month.
In order to understand how casual riders and annual members use Cyclistic bikes differently, we will need to analyze and compare various usage metrics between the two groups, including:
After an initial review of the data, we will need to add some calculated fields for length and distance of rides, and also extract from start times the month of year, day of week, and hour of day to provide us with more opportunities to aggregate the data.
Before conducting any analysis, however, we first need to verify the data’s integrity! In the process phase, we will perform several pre-cleaning activities to ensure the overall accuracy, consistency, and completeness of the data.
For this particular case study, I decided to use R programming in RStudio as we are working with extremely large data sets (+5M rows of data, +1 GB in size), which spreadsheets might not be able to handle as well. The primary advantage for using R is that it provides an accessible language to organize, modify, and clean data frames, and also create insightful data visualizations. RStudio, an integrated development environment (IDE), makes it easy to reproduce your work on different data sets. For instance, when you input your code, it’s simple to just load a new data set and run your scripts again.
Documentation of data cleaning and manipulation:
Install required packages:
install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("ggmap")
install.packages("scales")
install.packages("geosphere")
Load packages:
library(tidyverse)
library(lubridate)
library(ggplot2)
library(ggmap)
library(scales)
library(geosphere)
Import the 12 monthly data sets (csv files) into Rstudio:
m1 <- read_csv("cyclistic_data/202201-divvy-tripdata.csv")
m2 <- read_csv("cyclistic_data/202202-divvy-tripdata.csv")
m3 <- read_csv("cyclistic_data/202203-divvy-tripdata.csv")
m4 <- read_csv("cyclistic_data/202204-divvy-tripdata.csv")
m5 <- read_csv("cyclistic_data/202205-divvy-tripdata.csv")
m6 <- read_csv("cyclistic_data/202206-divvy-tripdata.csv")
m7 <- read_csv("cyclistic_data/202207-divvy-tripdata.csv")
m8 <- read_csv("cyclistic_data/202208-divvy-tripdata.csv")
m9 <- read_csv("cyclistic_data/202209-divvy-tripdata.csv")
m10 <- read_csv("cyclistic_data/202210-divvy-tripdata.csv")
m11 <- read_csv("cyclistic_data/202211-divvy-tripdata.csv")
m12 <- read_csv("cyclistic_data/202212-divvy-tripdata.csv")
Organize individual monthly data sets into one big data frame. Make sure column names in each file match perfectly before we can use a command to join them into one file:
tripdata <- bind_rows(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12)
Rename columns for easier understanding and analysis:
tripdata <- rename(tripdata,
bike_type = rideable_type,
customer_type = member_casual)
Inspect the new df:
colnames(tripdata) #check column names
## [1] "ride_id" "bike_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "customer_type"
str(tripdata) #check data types for each column
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ bike_type : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
## $ ended_at : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
## $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ start_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ end_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ customer_type : chr [1:5667717] "casual" "casual" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Add columns that list the date, month, day of week, and start time of each ride that will allow us to perform our analysis:
tripdata$date <- as.Date(tripdata$started_at) #extract date (YYYY-MM-DD)
tripdata$month <- format(as.Date(tripdata$date), "%B") #extract month (full name)
tripdata$day_of_week <- format(as.Date(tripdata$date), "%A") #extract day of week (full name)
tripdata$start_time <- strftime(tripdata$started_at, "%H") #extract start time in decimal hours (24 hour)
Add a calculated column ‘ride length’ for all trips (in seconds):
tripdata$ride_length <- difftime(tripdata$ended_at, tripdata$started_at)
is.factor(tripdata$ride_length)
## [1] FALSE
tripdata$ride_length <- as.numeric(as.character(tripdata$ride_length))
is.numeric(tripdata$ride_length)
## [1] TRUE
Add a calculated column ‘ride distance’ for all trips (in km):
tripdata$ride_distance <- distGeo(matrix(c(tripdata$start_lng, tripdata$start_lat), ncol = 2), matrix(c(tripdata$end_lng, tripdata$end_lat), ncol = 2))
tripdata$ride_distance <- tripdata$ride_distance/1000 #distance in km
Check summary statistics of added columns:
summary(tripdata)
## ride_id bike_type started_at
## Length:5667717 Length:5667717 Min. :2022-01-01 00:00:05.00
## Class :character Class :character 1st Qu.:2022-05-28 19:21:05.00
## Mode :character Mode :character Median :2022-07-22 15:03:59.00
## Mean :2022-07-20 07:21:18.74
## 3rd Qu.:2022-09-16 07:21:29.00
## Max. :2022-12-31 23:59:26.00
##
## ended_at start_station_name start_station_id
## Min. :2022-01-01 00:01:48.00 Length:5667717 Length:5667717
## 1st Qu.:2022-05-28 19:43:07.00 Class :character Class :character
## Median :2022-07-22 15:24:44.00 Mode :character Mode :character
## Mean :2022-07-20 07:40:45.33
## 3rd Qu.:2022-09-16 07:39:03.00
## Max. :2023-01-02 04:56:45.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5667717 Length:5667717 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng customer_type date
## Min. : 0.00 Min. :-88.14 Length:5667717 Min. :2022-01-01
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.:2022-05-28
## Median :41.90 Median :-87.64 Mode :character Median :2022-07-22
## Mean :41.90 Mean :-87.65 Mean :2022-07-19
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:2022-09-16
## Max. :42.37 Max. : 0.00 Max. :2022-12-31
## NA's :5858 NA's :5858
## month day_of_week start_time ride_length
## Length:5667717 Length:5667717 Length:5667717 Min. :-621201
## Class :character Class :character Class :character 1st Qu.: 349
## Mode :character Mode :character Mode :character Median : 617
## Mean : 1167
## 3rd Qu.: 1108
## Max. :2483235
##
## ride_distance
## Min. : 0.000
## 1st Qu.: 0.873
## Median : 1.575
## Mean : 2.140
## 3rd Qu.: 2.781
## Max. :9817.319
## NA's :5858
There are few “bad data” problems that will need to be cleaned:
tripdata_clean <- tripdata %>%
filter(ride_length > 60 & ride_length < 32400) %>%
filter(!grepl("repair", start_station_name, ignore.case = TRUE)) %>%
filter(!grepl("repair", end_station_name, ignore.case = TRUE)) %>%
filter(!is.na(ride_distance)) %>%
filter(ride_distance > 0.1 & ride_distance < 100)
Check summary statistics of clean df:
summary(tripdata_clean)
## ride_id bike_type started_at
## Length:5232703 Length:5232703 Min. :2022-01-01 00:01:00.00
## Class :character Class :character 1st Qu.:2022-05-29 00:56:11.00
## Mode :character Mode :character Median :2022-07-22 20:38:59.00
## Mean :2022-07-20 15:59:30.03
## 3rd Qu.:2022-09-16 16:55:10.00
## Max. :2022-12-31 23:59:26.00
## ended_at start_station_name start_station_id
## Min. :2022-01-01 00:04:39.0 Length:5232703 Length:5232703
## 1st Qu.:2022-05-29 01:16:09.5 Class :character Class :character
## Median :2022-07-22 20:56:14.0 Mode :character Mode :character
## Mean :2022-07-20 16:14:58.7
## 3rd Qu.:2022-09-16 17:11:26.0
## Max. :2023-01-01 03:06:36.0
## end_station_name end_station_id start_lat start_lng
## Length:5232703 Length:5232703 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
## end_lat end_lng customer_type date
## Min. :41.55 Min. :-88.14 Length:5232703 Min. :2022-01-01
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.:2022-05-29
## Median :41.90 Median :-87.64 Mode :character Median :2022-07-22
## Mean :41.90 Mean :-87.65 Mean :2022-07-20
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:2022-09-16
## Max. :42.37 Max. :-87.30 Max. :2022-12-31
## month day_of_week start_time ride_length
## Length:5232703 Length:5232703 Length:5232703 Min. : 61.0
## Class :character Class :character Class :character 1st Qu.: 369.0
## Mode :character Mode :character Mode :character Median : 626.0
## Mean : 928.8
## 3rd Qu.: 1089.0
## Max. :32386.0
## ride_distance
## Min. : 0.100
## 1st Qu.: 1.032
## Median : 1.694
## Mean : 2.297
## 3rd Qu.: 2.926
## Max. :42.383
Now that the Cyclistic data is stored appropriately and has been prepared for analysis, let’s start putting it to work. Key tasks in the analysis phase include:
Let’s compare total number of trips and percentage distribution by customer type:
tripdata_clean %>%
group_by(customer_type) %>%
summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) / nrow(tripdata_clean)) * 100);
## # A tibble: 2 × 3
## customer_type ride_count ride_percentage
## <chr> <int> <dbl>
## 1 casual 2079861 39.7
## 2 member 3152842 60.3
Cyclistic members made up 60.3% of total trips while casual riders made up 39.7%. Cyclistic members made about ~20% more trips than casual riders in the 2022 calendar year.
Let’s analyze total number of trips via e-bike and percentage distribution of all trips:
electric_trips <- tripdata_clean %>%
filter(bike_type == "electric_bike")
(nrow(electric_trips) / nrow(tripdata_clean)) * 100
## [1] 50.71633
Check summary statistics of ride length:
summary(tripdata_clean$ride_length);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 369.0 626.0 928.8 1089.0 32386.0
Compare summary statistics of ride length by customer type:
tripdata_clean %>%
group_by(customer_type) %>%
summarise(min_ride_length = min(ride_length)/60,
median_ride_length = median(ride_length)/60,
avg_ride_length = mean(ride_length)/60,
max_ride_length = max(ride_length)/60);
## # A tibble: 2 × 5
## customer_type min_ride_length median_ride_length avg_ride_length max_ride_le…¹
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 casual 1.02 13.0 20.2 540.
## 2 member 1.02 9.05 12.3 540.
## # … with abbreviated variable name ¹max_ride_length
The average ride length of casual riders was 20.2 minutes (2022) – nearly double that of annual members at 12.3 minutes. We can hypothesize that casual riders use Cyclistic for leisure, such as exploring the city, whereas members use Cyclistic for non-recreational activities, such as commuting to work or school, running errands, etc.
Let’s analyze our usage metrics by days of the week for both customer types. First order the days of the week from Monday to Sunday:
tripdata_clean$day_of_week <- ordered(tripdata_clean$day_of_week,
levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"));
Compare total trips by customer type in each day of the week:
tripdata_clean %>%
group_by(day_of_week, customer_type) %>%
summarise(number_of_rides = n()) %>%
arrange(day_of_week, customer_type);
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups: day_of_week [7]
## day_of_week customer_type number_of_rides
## <ord> <chr> <int>
## 1 Monday casual 245924
## 2 Monday member 445756
## 3 Tuesday casual 237535
## 4 Tuesday member 490873
## 5 Wednesday casual 247997
## 6 Wednesday member 495727
## 7 Thursday casual 279847
## 8 Thursday member 503149
## 9 Friday casual 302127
## 10 Friday member 440151
## 11 Saturday casual 423211
## 12 Saturday member 415613
## 13 Sunday casual 343220
## 14 Sunday member 361573
Compare average ride length by customer type in each day of the week:
tripdata_clean %>%
group_by(day_of_week, customer_type) %>%
summarise(avg_ride_length = mean(ride_length)) %>%
arrange(day_of_week, customer_type);
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups: day_of_week [7]
## day_of_week customer_type avg_ride_length
## <ord> <chr> <dbl>
## 1 Monday casual 1225.
## 2 Monday member 712.
## 3 Tuesday casual 1078.
## 4 Tuesday member 702.
## 5 Wednesday casual 1047.
## 6 Wednesday member 708.
## 7 Thursday casual 1081.
## 8 Thursday member 718.
## 9 Friday casual 1151.
## 10 Friday member 731.
## 11 Saturday casual 1378.
## 12 Saturday member 826.
## 13 Sunday casual 1383.
## 14 Sunday member 819.
Let’s order the months of the year from January to December:
tripdata_clean$month <- ordered(tripdata_clean$month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"));
Now compare total rides by customer type in each month of the year:
tripdata_clean %>%
group_by(customer_type, month) %>%
summarise(number_of_rides = n(), .groups="drop") %>%
arrange(customer_type, month) %>%
print(n = 24);
## # A tibble: 24 × 3
## customer_type month number_of_rides
## <chr> <ord> <int>
## 1 casual January 16685
## 2 casual February 19184
## 3 casual March 79258
## 4 casual April 112284
## 5 casual May 248043
## 6 casual June 330262
## 7 casual July 361847
## 8 casual August 322034
## 9 casual September 268749
## 10 casual October 189134
## 11 casual November 91476
## 12 casual December 40905
## 13 member January 80322
## 14 member February 88026
## 15 member March 181928
## 16 member April 227024
## 17 member May 332793
## 18 member June 379145
## 19 member July 394190
## 20 member August 402996
## 21 member September 382292
## 22 member October 330212
## 23 member November 224789
## 24 member December 129125
The tibble shows us that:
Let’s compare average ride length by customer type in each month of the year:
tripdata_clean %>%
group_by(month, customer_type) %>%
summarise(avg_ride_length = mean(ride_length), .groups="drop") %>%
arrange(month, customer_type) %>%
print(n = 24);
## # A tibble: 24 × 3
## month customer_type avg_ride_length
## <ord> <chr> <dbl>
## 1 January casual 935.
## 2 January member 647.
## 3 February casual 1068.
## 4 February member 650.
## 5 March casual 1290.
## 6 March member 695.
## 7 April casual 1265.
## 8 April member 681.
## 9 May casual 1389.
## 10 May member 782.
## 11 June casual 1295.
## 12 June member 814.
## 13 July casual 1302.
## 14 July member 808.
## 15 August casual 1207.
## 16 August member 784.
## 17 September casual 1127.
## 18 September member 754.
## 19 October casual 1053.
## 20 October member 689.
## 21 November casual 871.
## 22 November member 646.
## 23 December casual 747.
## 24 December member 621.
Create summary table for top 5 start stations by customer type:
summary_stations <- tripdata_clean %>%
mutate(station = start_station_name) %>%
drop_na(start_station_name) %>%
group_by(start_station_name, customer_type) %>%
summarise(number_of_trips = n()) %>%
arrange(customer_type, desc(number_of_trips)) %>%
group_by(customer_type) %>%
filter(rank(desc(number_of_trips)) <= 5) %>%
ungroup();
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
Top 5 start stations for casual riders:
Top 5 start stations for annual members: * (1): Kingsbury St & Kinzie St (24,293 trips) * (2): Clark St & Elm St (21,083 trips) * (3): Wells St & Concord Ln (20,667 trips) * (4): Clinton St & Washington Blvd (19,268 trips) * (5): University Ave & 57th St (18,974 trips)
Now let’s create a new data frame that contains information about the most popular bike routes (> 300 trips), categorized by starting and ending coordinates, customer type, and bike type:
popular_routes <- tripdata_clean %>%
filter(start_lat != end_lat & start_lng != end_lng) %>%
group_by(start_lat, start_lng, end_lat, end_lng, customer_type, bike_type) %>%
summarise(total_rides = n(), .groups ="drop") %>%
filter(total_rides > 300);
And create two separate data frames, one for each customer type, that we will use to visualize popular routes over the Chicago map:
casual_riders <- popular_routes %>% filter(customer_type == "casual");
annual_members <- popular_routes %>% filter(customer_type == "member");
Set up ggmap and store map of Chicago (bbox, stamen map):
chicago <- c(left = -87.70, bottom = 41.77, right = -87.55, top = 41.97);
chicago_map <- get_stamenmap(bbox = chicago, zoom = 12, maptype = "terrain");
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
After analyzing Cyclistic’s historical trip data in 2022, it is clear that annual members and casual riders use Cyclistic bikes differently. Annual members made up a majority of all trips (60%) and they tend to use the bikes more during weekdays, especially during commuting hours. Casual riders, on the other hand, prefer to ride on the weekends, and they tend to use the bikes more for leisure and recreation.
Another key difference between the two customer types is that casual riders tend to ride for longer periods and distances, which suggests that they are more likely to use the bikes for sightseeing or exploring the city. This behavior is also more noticeable in late spring/early summer when casual riders tend to ride for longer periods and distances.
The data also shows that bike usage is positively correlated with temperature in Chicago, with more trips being made during the warmer summer months and fewer trips during the colder winter months. This finding suggests that Cyclistic may want to adjust its marketing strategy to promote bike usage during the late spring and early summer months.
Furthermore, the data shows that casual riders tend to make more trips in downtown Chicago (Millenium Park, Navy Pier) and in Hyde Park, whereas annual members are more scattered throughout the city. This insight may be useful in developing targeted marketing campaigns to convert casual riders into annual members, such as advertising in docking stations frequently used by casual riders.
Deliverable:
My top three recommendations based on my analysis:
(1): Offer a weekend-only annual membership at a lower rate than the annual membership: Annual members are more profitable for Cyclistic, but they are also more cost-effective for frequent riders. For example, an annual membership provides unlimited 45-minute rides for a flat fee, while casual riders pay per ride or per day. And since a majority of casual riders tend to ride the bikes mostly on the weekends, a weekend-only annual membership offered at a lower price point would encourage casual riders to switch to an annual plan while still reaping the cost-saving benefits.
(2): Offer targeted promotions or discounts to frequent riders: Since casual riders are already familiar with the Cyclistic program, offering targeted promotions and discounts to frequent riders can encourage them to upgrade to an annual membership. For example, Cyclistic could provide a discount on annual memberships to casual riders who have taken a certain number of rides in a given month, or offer a free month of membership to those who sign up during a specific promotion period. Additionally, Cyclistic should use social media, email, and other online channels to target casual riders, preferably right before spring break (when there is a spike in casual riders) and during the peak summer season.
(3): Implement a referral program: A referral program can be an effective way to attract new customers while also retaining current ones. Cyclistic could offer current members incentives for referring casual riders to join the program, such as free ride credits or even discounts for the usage of electric bikes, which made up over 50% of total trips in 2022. Based on our analysis, both casual riders and annual members are also more likely to ride bikes during the summer months. Cyclistic could leverage this insight by increasing its social media presence during the summer months to promote the program and reach potential customers.
Note: Additional data we could use to expand on our findings
All ride ids in the data set are unique so we cannot conclude if the same person made multiple trips. Pricing details for casual riders and annual members in order to optimize cost structure and provide discounts without affecting the margin. Data on bike availability and demand levels across stations. Based on the analysis, casual riders tend to make more trips in downtown Chicago and Hyde Park, while annual members are more scattered throughout the city. Cyclistic could consider redistributing bikes from less used stations to high demand areas to ensure that bikes are available when and where they are needed most. Cyclistic could also consider expanding to other areas within the state of Illinois or even expanding to other states, depending on the demand for bike-share programs in those areas.