You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Design marketing strategies targeted at converting casual riders into annual members.
Identify the difference in ways casual riders and annual riders at Cyclistic use bikes.
Lily Moreno, Marketing Director
Cyclistic Executive Team
The Cyclistic Marketing Analytics Team
Cyclistic aims to design marketing strategies targeted at converting casual riders into annual members. In order to do this, the following questions must be analyzed:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
In this analysis, we have been tasked by the Director of Marketing, Lily Moreno to answer only the first question listed above.
This case study uses Cyclistic’s historical trip data (previous 12 months) to analyze and identify trends.
The data has been made available by Motivate International Inc. under and open license. The data can be downloaded here.
This data is reliable, original, comprehensive and current as it is internally collected and stored safely by Cyclistic from August 2021 to July 2022.
Personally identifiable information such as credit card numbers has been removed because of data-privacy issues.
The data selected for use covers the last 12 months from August 2021 to July 2022.
Each month has a separate dataset. The datasets are organized in tabular format and have 13 identical columns.
Combined, the datasets have 4073561 rows.
The member_casual column will allow me to group, aggregate and compare trends between casual riders and member riders
Reliability - The data is unbiased and complete.
Original - The source of the data is first party.
Comprehensive - There is data on ‘Usertype’, and ‘Start time’ and ‘End time’ of trips as well as other fields to perform a sufficient analysis.
Current - The data is the latest 12 months available.
Cited -The data has been made available by Motivate International Inc. under this license.
install.packages(“tidyverse”)
install.packages(“modeest”)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readxl)
library(readr)
library(modeest)
## Registered S3 method overwritten by 'rmutil':
## method from
## print.response httr
library(tidyr)
aug_2021 <- read_csv("202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_2021 <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_2021 <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_2021 <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_2021 <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan_2022 <- read_csv("202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb_2022 <- read_csv("202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_2022 <- read_csv("202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_2022 <- read_csv("202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_2022 <- read_csv("202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_2022 <- read_csv("202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_2022 <- read_csv("202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There are no data type inconsistencies among each of the 12 dataframes, so it is ready to be aggregated.
The R code below bind all rows of each dataframe into 1 large data frame, ready for processing.
all_trips_raw <- bind_rows(aug_2021,sep_2021,oct_2021,nov_2021,dec_2021,jan_2022,feb_2022,mar_2022,apr_2022,may_2022,jun_2022,jul_2022)
Adding columns to separate started_at column values into date, day, month, year, and day of week.
all_trips_raw$date <- as.Date(all_trips_raw$started_at)
all_trips_raw$month <- format(as.Date(all_trips_raw$date), "%m")
all_trips_raw$day <- format(as.Date(all_trips_raw$date), "%d")
all_trips_raw$year <- format(as.Date(all_trips_raw$date), "%Y")
all_trips_raw$day_of_week <- format(as.Date(all_trips_raw$date), "%A")
Changed the column names to more fitting names, to improve ease of analysis.
all_trips_raw <- all_trips_raw %>%
rename(trip_id = ride_id, ride_type = rideable_type, start_time = started_at, end_time = ended_at, usertype = member_casual)
Removed unnecessary columns for our analysis (start_lat,end_lat,start_lng,end_lng), and also added column, ride_length, that calculates the length of a ride in seconds.
all_trips_raw$ride_length <- as.numeric(difftime(all_trips_raw$end_time,all_trips_raw$start_time))
all_trips_raw <- all_trips_raw %>%
select(-c(start_lat,start_lng,end_lat,end_lng))
Removing all NA row
all_trips_drop_na <- drop_na(all_trips_raw)
all_trips <- all_trips_drop_na
Inspecting the data frame
summary(all_trips)
## trip_id ride_type start_time
## Length:4629230 Length:4629230 Min. :2021-08-01 00:00:04.00
## Class :character Class :character 1st Qu.:2021-09-24 08:17:45.75
## Mode :character Mode :character Median :2022-02-11 17:14:55.00
## Mean :2022-01-30 14:23:23.00
## 3rd Qu.:2022-06-06 01:10:14.00
## Max. :2022-07-31 23:59:29.00
## end_time start_station_name start_station_id
## Min. :2021-08-01 00:03:30.00 Length:4629230 Length:4629230
## 1st Qu.:2021-09-24 08:30:59.25 Class :character Class :character
## Median :2022-02-11 17:25:54.50 Mode :character Mode :character
## Mean :2022-01-30 14:41:53.27
## 3rd Qu.:2022-06-06 01:48:22.50
## Max. :2022-08-01 13:26:03.00
## end_station_name end_station_id usertype date
## Length:4629230 Length:4629230 Length:4629230 Min. :2021-08-01
## Class :character Class :character Class :character 1st Qu.:2021-09-24
## Mode :character Mode :character Mode :character Median :2022-02-11
## Mean :2022-01-29
## 3rd Qu.:2022-06-06
## Max. :2022-07-31
## month day year day_of_week
## Length:4629230 Length:4629230 Length:4629230 Length:4629230
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## ride_length
## Min. : -7745
## 1st Qu.: 384
## Median : 673
## Mean : 1110
## 3rd Qu.: 1211
## Max. :2497750
Creating a new version of the data frame by removing all negative values.
all_trips_2 <- all_trips[!(all_trips$ride_length < 0),]
summary(all_trips_2)
## trip_id ride_type start_time
## Length:4629119 Length:4629119 Min. :2021-08-01 00:00:04.00
## Class :character Class :character 1st Qu.:2021-09-24 08:18:19.50
## Mode :character Mode :character Median :2022-02-11 17:19:14.00
## Mean :2022-01-30 14:26:00.58
## 3rd Qu.:2022-06-06 01:16:12.50
## Max. :2022-07-31 23:59:29.00
## end_time start_station_name start_station_id
## Min. :2021-08-01 00:03:30.00 Length:4629119 Length:4629119
## 1st Qu.:2021-09-24 08:31:09.50 Class :character Class :character
## Median :2022-02-11 17:29:58.00 Mode :character Mode :character
## Mean :2022-01-30 14:44:30.91
## 3rd Qu.:2022-06-06 01:54:02.50
## Max. :2022-08-01 13:26:03.00
## end_station_name end_station_id usertype date
## Length:4629119 Length:4629119 Length:4629119 Min. :2021-08-01
## Class :character Class :character Class :character 1st Qu.:2021-09-24
## Mode :character Mode :character Mode :character Median :2022-02-11
## Mean :2022-01-29
## 3rd Qu.:2022-06-06
## Max. :2022-07-31
## month day year day_of_week
## Length:4629119 Length:4629119 Length:4629119 Length:4629119
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## ride_length
## Min. : 0
## 1st Qu.: 384
## Median : 673
## Mean : 1110
## 3rd Qu.: 1211
## Max. :2497750
Identifying the outliers and saving them to their own data frame
outliers <- boxplot(all_trips_2$ride_length, plot = FALSE)$out
This data frame will be excluded from our dataset.
The which() function tells us the rows in which the
outliers exist, these rows are then removed from our dataset. However,
before removing them, we store all_trips_2 in a variable to
ensure we don’t lose the dataset.
x <- all_trips_2
x <- x[-which(x$ride_length %in% outliers),]
All significant outliers is now removed of and we can now proceed to our analysis
summary(x$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 365.0 622.0 766.8 1042.0 2451.0
Comparing members and casual users
aggregate(x$ride_length/60 ~ x$usertype, FUN = mean)
aggregate(x$ride_length/60 ~ x$usertype, FUN = median)
Using the statistical mode function, we see that the most common day for renting bikes is Saturday for casual riders and Wednesday for annual members.
aggregate(x$day_of_week ~ x$usertype, FUN = mfv)
Observing average ride length by each day for member vs casual user types.
#Outputs day_of_week column in proper order
x$day_of_week <- ordered(x$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#Outputs average ride_length based on usertype and day of the week
aggregate(x$ride_length/60 ~ x$usertype + x$day_of_week, FUN = mean)
Now, we analyze the number of rides by user type and weekday and visualize it
x %>%
mutate(weekday = wday(start_time, label = TRUE)) %>%
group_by(usertype, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = usertype)) + geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
`summarize()` has grouped output by ‘usertype’. You can override using the `.groups` argument.
Analyzing average duration by user type and weekday
x %>%
mutate(weekday = wday(start_time, label = TRUE)) %>%
group_by(usertype, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = usertype)) + geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
`summarize()` has grouped output by ‘usertype’. You can override using the `.groups` argument.
Casual riders average longer rides than annual members on any given day of the week. This is most likely explained by the assumption that annual members use Cyclistic rentals for shorter, direct commutes to work or school, whereas casual riders will use bikes recreational.
Casual riders use Cyclistic rentals much more on weekends (Saturday, Sunday) compared to weekdays, and vice-versa for annual members, further supporting the first observation made.
Annual members rent most often on weekdays, most likely due to work or school commutes. This is supported by the small variation in average ride duration for annual members on weekdays. (Table 2)
Regarding casual riders, as seen in Table 1, casual rides are relatively low on weekdays, with a small increase on Friday and a large peak on Saturday and Sunday. This suggests most casual rides are done for leisure and recreational reasons. Table 2 further supports our assumptions due to the significantly longer ride duration among casual riders, especially on weekends.
The following three recommendations for a marketing strategy aimed at converting casual riders to annual members are as follows:
We recommend more marketing campaigns and advertisements to be done in summer since it’s peak usage season for both members and casual riders.
Most of rides takes place during saturday and sunday for members as well as casual riders so we have to run advertise campaign on those days such that more casual riders will get to know about the program.
Offer sightseeing routes, locations of interest and much tourist appeal for annual members. Market the benefits to appeal to all riders.
Partner with local vendors for discounts on their products or services if proof is shown of annual membership. Select vendors related to activities and behaviors most relevant to leisure and recreation on weekends.