In today’s competitive urban mobility sector, long-term customer retention is a key success factor. For Cyclistic, a leading bike-share company in Chicago, the greatest opportunity for future growth lies in converting occasional users into profitable annual members.
This data analysis was initiated to answer the fundamental question: How does the use of Cyclistic bikes differ between annual members and occasional riders?
Based on the vision of Marketing Director Lily Moreno, this project examines historical ride data from the last twelve months. Our goal is to gain clear, data-driven insights into the riding habits, time patterns, and preferred routes of both user groups. Identifying these differences is the first and essential step in developing a targeted marketing strategy that can effectively convince occasional riders of the benefits of an annual membership.
The following analysis follows the structured process of data analysis – Ask, Prepare & Process, Analyze, Share, and Act.
We need to understand how the use of Cyclistic bikes differs between annual members and occasional riders. This insight will form the basis of a marketing campaign aimed at converting occasional riders into members.
Lily Moreno, Marketing Director, commissioned the analysis for the marketing strategy. The Cyclistic executive team must approve the budget and campaign based on the analysis results.
Identify differences in bike usage behavior between annual members and occasional riders by analyzing historical ride data. The results will be used to develop targeted marketing initiatives aimed at converting occasional riders into annual members.
The Cyclistic ride data for 2019, i.e., from 12 months ago, is available. This data is publicly available and is provided by Motivate International Inc.
The data is organized in monthly CSV files.
We have used Cyclistic’s historical trip data to analyze and identify trends. For this we have make use of the information contained in the following ZIP files (available here):
ZIP file names Date Modified Size Divvy_Trips_2019_Q1.zip Nov 8th 2021, 04:05:26 pm 9.57 MB Divvy_Trips_2019_Q2.zip Nov 8th 2021, 04:06:12 pm 28.72 MB Divvy_Trips_2019_Q3.zip Jan 24th 2020, 10:08:06 am 42.36 MB Divvy_Trips_2019_Q4.zip Jan 24th 2020, 10:08:07 am 18.40 MB
The raw data comprises a total of approximately 3.8 million data records.
Attribut Format Description
ride_id numeric A unique identifier for each ride. started_at POSIXct Timestamp for the start of the ride. ended_at POSIXct Timestamp for the end of the ride. rideable_type numeric The type of bike (classic, electric, etc.) tripduration numeric The trip duration of the ride. start_station_id numeric The Unique identifier for the start station. start_station_name character The name of the start station. end_station_id numeric The Unique identifier for the end station. end_station_name character The name of the end stations. member_casual character The key field that distinguishes between ‘Subscriber’ (annual member) and ‘Customer’ (casual). gender character The gender of the rider. birthyear numeric The birthyear of the rider.
Reliable The data comes from a first-party source (the company’s own system). It is original data from the bicycle rental system.
Comprehensive It contains the required fields for analysis for the user types annual member and casual. It includes data from the entire year 2019.
Current The data is no longer entirely up to date, and the coronavirus pandemic may have led to changes in driver behaviour (e.g. working from home).
Cited The data is sourced from Motivate International Inc. under licence. There are two attributes related to individuals: year of birth and gender. The combination of year of birth (and age derived from this), gender and other attributes such as exact timestamps of journeys and exact start/end stations can override the pseudonymisation of the data and lead to re-identification. To avoid a data protection issue, the gender attribute is removed and the year of birth attribute is anonymised by transforming it into an age group.
Limitations The user behaviour of individual drivers cannot be tracked due to a lack of ID. This means it is not possible to determine whether casual users have purchased tickets repeatedly.
The programming language R is used for this analysis due to its powerful data manipulation and visualisation libraries. The volume of data also speaks in favour of using R.
The following packages of the programming language R are used below:
Library Description
tidyverse Easily Install and Load the ‘Tidyverse’ dplyr A Grammar of Data Manipulation lubridate Make Dealing with Dates a Little Easier conflicted An Alternative Conflict Resolution Strategy ggplot2 Create Elegant Data Visualisations Using the Grammar of Graphics caret Classification and Regression Training rpart Recursive Partitioning and Regression Trees rpart.plot Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’ ggpubr ‘ggplot2’ Based Publication Ready Plots
library(tidyverse)
library(dplyr)
library(lubridate)
library(conflicted)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(ggpubr)
conflict_prefer("filter", "dplyr")
conflict_prefer("lag", "dplyr")
Define the work-space.
## [1] "Divvy_Trips_2019_Q1.csv" "Divvy_Trips_2019_Q2.csv"
## [3] "Divvy_Trips_2019_Q3.csv" "Divvy_Trips_2019_Q4.csv"
## [5] "Divvy_Trips_2020_Q1.csv"
The path contains five CSV files, the first four of which are imported for the year 2019.
The four CSV files are merged into a single DataFrame.
Unfortunately, different column names were used in the CSV files, so harmonisation is necessary.
colnames(df_q1)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(df_q2)
## [1] "01 - Rental Details Rental ID"
## [2] "01 - Rental Details Local Start Time"
## [3] "01 - Rental Details Local End Time"
## [4] "01 - Rental Details Bike ID"
## [5] "01 - Rental Details Duration In Seconds Uncapped"
## [6] "03 - Rental Start Station ID"
## [7] "03 - Rental Start Station Name"
## [8] "02 - Rental End Station ID"
## [9] "02 - Rental End Station Name"
## [10] "User Type"
## [11] "Member Gender"
## [12] "05 - Member Details Member Birthday Year"
colnames(df_q3)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(df_q4)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
Let us start by renaming the columns of the df_q1, df_q2, df_q3 and df_q4 data-frames.
df_q1 <- rename(df_q1,
ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
# q1_2020 does not have tripdurarion, gender and birthyear columns, so we use the names in q4_2019
df_q2 <- rename(df_q2,
ride_id = "01 - Rental Details Rental ID",
rideable_type = "01 - Rental Details Bike ID",
tripduration = "01 - Rental Details Duration In Seconds Uncapped",
started_at = "01 - Rental Details Local Start Time",
ended_at = "01 - Rental Details Local End Time",
start_station_name = "03 - Rental Start Station Name",
start_station_id = "03 - Rental Start Station ID",
end_station_name = "02 - Rental End Station Name",
end_station_id = "02 - Rental End Station ID",
member_casual = "User Type",
gender = "Member Gender",
birthyear = "05 - Member Details Member Birthday Year")
df_q3 <- rename(df_q3,
ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
df_q4 <- rename(df_q4,
ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
## [1] "The columns of the four data frames are harmonised so that a merge of raw data can be performed."
Merge into a data.frame ‘df_trips’
df_trips <- rbind(df_q1, df_q2, df_q3, df_q4)
rm(df_q1)
rm(df_q2)
rm(df_q3)
rm(df_q4)
Properties of the data frame ‘df_trips’
dim_trips_raw <- dim(df_trips)
## The data-frame “df_trips” contains 3818004 data records with 12 attributes.
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "tripduration" "start_station_id"
## [7] "start_station_name" "end_station_id" "end_station_name"
## [10] "member_casual" "gender" "birthyear"
str(df_trips)
## spc_tbl_ [3,818,004 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : num [1:3818004] 21742443 21742444 21742445 21742446 21742447 ...
## $ started_at : POSIXct[1:3818004], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
## $ ended_at : POSIXct[1:3818004], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
## $ rideable_type : num [1:3818004] 2167 4386 1524 252 1170 ...
## $ tripduration : num [1:3818004] 390 441 829 1783 364 ...
## $ start_station_id : num [1:3818004] 199 44 15 123 173 98 98 211 150 268 ...
## $ start_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
## $ end_station_id : num [1:3818004] 84 624 644 176 35 49 49 142 148 141 ...
## $ end_station_name : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
## $ member_casual : chr [1:3818004] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ gender : chr [1:3818004] "Male" "Female" "Female" "Male" ...
## $ birthyear : num [1:3818004] 1989 1990 1994 1993 1994 ...
## - attr(*, "spec")=
## .. cols(
## .. trip_id = col_double(),
## .. start_time = col_datetime(format = ""),
## .. end_time = col_datetime(format = ""),
## .. bikeid = col_double(),
## .. tripduration = col_number(),
## .. from_station_id = col_double(),
## .. from_station_name = col_character(),
## .. to_station_id = col_double(),
## .. to_station_name = col_character(),
## .. usertype = col_character(),
## .. gender = col_character(),
## .. birthyear = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
First, the attribute “gender” is removed to comply with data protection regulations.
df_trips <- df_trips %>% select(-gender)
Since the birthday date is used to form age groups, the missing values are filled in with the median birthday for the identical connection from the start station to the end station with a date, so that the data records are retained for analysis.
df_trips <- df_trips %>%
group_by(start_station_id, end_station_id) %>%
mutate(
birthyear = ifelse(
is.na(birthyear),
median(birthyear, na.rm = TRUE),
birthyear
)
) %>%
ungroup()
All other missing values result in the data record being deleted.
df_trips <- df_trips %>%
na.omit()
df_trips <- df_trips %>%
mutate(
age = year(started_at) - birthyear,
age_group = case_when(
age < 18 ~ "17_under",
age >= 18 & age < 36 ~ "18-35",
age >= 36 & age < 51 ~ "36-50",
age >= 51 & age < 66 ~ "51-65",
age >= 66 & age < 76 ~ "66-75",
age >= 76 & age < 91 ~ "76-90",
age >= 91 ~ "91_plus",
TRUE ~ "NA"
)
) %>%
select(c(-birthyear, -age))
Then the attribute “age_group” is generated based on the age from the year of birth. The columns “birthyear” and “age” are removed.
Status
## Of the raw data with 3818004 there are still 3805811 data sets available for analysis.
colnames(df_trips)
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "tripduration" "start_station_id"
## [7] "start_station_name" "end_station_id" "end_station_name"
## [10] "member_casual" "age_group"
Check of the target variable ‘member_casual’:
df_trips <- df_trips %>%
mutate(
member_casual = as.factor(member_casual)
)
if (sum(!is.na(df_trips$member_casual)) == dim(df_trips)[1]){
cat("The target variable ‘member_casual’ contains no missing values and the values are 'Subscriber' und 'Customer'.")
} else {
cat("There are missing values in the variable ‘member_casual’!")
}
## The target variable ‘member_casual’ contains no missing values and the values are 'Subscriber' und 'Customer'.
with(df_trips, table(member_casual, useNA = "always"))
## member_casual
## Customer Subscriber <NA>
## 868524 2937287 0
Check missing values
No more missing values are found.
sum(is.na(df_trips$started_at)) +
sum(is.na(df_trips$ended_at)) +
sum(is.na(df_trips$start_station_name)) +
sum(is.na(df_trips$start_station_id)) +
sum(is.na(df_trips$end_station_name)) +
sum(is.na(df_trips$end_station_id)) +
sum(is.na(df_trips$tripduration))
## [1] 0
Check invalid data
Checking whether the return date is before the start date of a trip.
df_trips_err <- df_trips %>%
filter( started_at > ended_at )
cat("There are", dim(df_trips_err)[1], "data entry errors that will be deleted!")
## There are 12 data entry errors that will be deleted!
df_trips <- df_trips %>%
filter( started_at < ended_at )
Check outliers
summary(df_trips$tripduration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61 410 707 1360 1276 9056633
## The maximum value of 9056633 for the attribute 'tripduration' implies that an outlier analysis is necessary!
detect_iqr_outliers <- function(x, multiplier = 1.5) {
qnt <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
iqr_val <- IQR(x, na.rm = TRUE)
lower_bound <- qnt[1] - multiplier * iqr_val
upper_bound <- qnt[2] + multiplier * iqr_val
list(
outliers = x < lower_bound | x > upper_bound,
bounds = c(lower = lower_bound, upper = upper_bound),
method = paste0("IQR_", multiplier)
)
}
# weak outlier
iqr_result <- detect_iqr_outliers(df_trips$tripduration, multiplier = 1.5)
df_trips_outliers <- df_trips[iqr_result$outliers,]
cat("There are", dim(df_trips_outliers)[1], "outliers detected for the attribute 'tripduration'!")
## There are 263144 outliers detected for the attribute 'tripduration'!
table(df_trips_outliers$member_casual)
##
## Customer Subscriber
## 227661 35483
summary(df_trips_outliers$tripduration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2576 3085 3952 8704 5782 9056633
If the loan lasts for more than one day, then these entries, i.e. day trips, do not fit into the population of loans by annual members.
df_trips_over_days <- df_trips_outliers %>%
filter(date(as.POSIXct(started_at)) != date(as.POSIXct(ended_at))) %>%
arrange(desc(tripduration)) %>%
select(c(ride_id, started_at, ended_at, tripduration, member_casual))
cat("There are", dim(df_trips_over_days)[1], "outliers were detected over several days for the attribute 'tripduration'!")
## There are 7977 outliers were detected over several days for the attribute 'tripduration'!
head(df_trips_over_days)
## # A tibble: 6 × 5
## ride_id started_at ended_at tripduration member_casual
## <dbl> <dttm> <dttm> <dbl> <fct>
## 1 23798837 2019-07-16 18:17:58 2019-10-29 14:01:52 9056633 Subscriber
## 2 23889502 2019-07-22 12:09:36 2019-10-25 10:56:54 8203637 Subscriber
## 3 23710210 2019-07-12 13:08:10 2019-10-12 10:32:18 7939448 Customer
## 4 24107150 2019-08-01 18:45:47 2019-10-29 12:36:57 7667469 Customer
## 5 23661214 2019-07-10 10:25:13 2019-10-06 11:26:25 7606871 Customer
## 6 23801970 2019-07-16 20:44:13 2019-10-10 18:01:25 7420632 Customer
df_trips <- df_trips %>%
filter(date(as.POSIXct(started_at)) == date(as.POSIXct(ended_at)))
dim(df_trips)
## [1] 3790489 11
summary(df_trips$tripduration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61 409 704 1091 1269 82385
## Since the maximum value for trip duration is still very high at 82385 minutes, all strong outliers for the attribute ‘tripduration’ are identified and removed for further investigation!
# strong outlier
iqr_result <- detect_iqr_outliers(df_trips$tripduration, multiplier = 3.0)
df_trips_outliers_strong <- df_trips[iqr_result$outliers,]
df_trips <- df_trips[!iqr_result$outliers,]
df_trips_outliers_strong %>%
arrange(desc(tripduration))
## # A tibble: 130,260 × 11
## ride_id started_at ended_at rideable_type tripduration
## <dbl> <dttm> <dttm> <dbl> <dbl>
## 1 24820137 2019-09-07 00:51:15 2019-09-07 23:44:20 3771 82385
## 2 24478982 2019-08-20 00:41:42 2019-08-20 23:33:39 239 82317
## 3 23011586 2019-06-06 01:24:32 2019-06-06 23:47:21 6240 80569
## 4 22224947 2019-04-06 01:21:11 2019-04-06 23:40:51 5648 80380
## 5 22224950 2019-04-06 01:21:40 2019-04-06 23:40:23 965 80323
## 6 25632227 2019-11-04 01:22:09 2019-11-04 23:28:13 608 79564
## 7 24692637 2019-08-31 01:31:37 2019-08-31 23:03:40 4831 77522
## 8 21959434 2019-02-24 00:15:13 2019-02-24 21:41:28 6141 77175
## 9 24443879 2019-08-18 01:38:35 2019-08-18 23:03:33 1449 77098
## 10 22745218 2019-05-21 02:15:29 2019-05-21 23:34:14 264 76725
## # ℹ 130,250 more rows
## # ℹ 6 more variables: start_station_id <dbl>, start_station_name <chr>,
## # end_station_id <dbl>, end_station_name <chr>, member_casual <fct>,
## # age_group <chr>
summary(df_trips$tripduration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 401.0 680.0 892.2 1184.0 3849.0
# weak outlier
iqr_result <- detect_iqr_outliers(df_trips$tripduration, multiplier = 1.5)
df_trips_outliers_weak <- df_trips[iqr_result$outliers,]
df_trips_outliers_weak %>%
arrange(desc(tripduration))
## # A tibble: 169,311 × 11
## ride_id started_at ended_at rideable_type tripduration
## <dbl> <dttm> <dttm> <dbl> <dbl>
## 1 22230184 2019-04-06 14:11:07 2019-04-06 15:15:16 5104 3849
## 2 22232802 2019-04-06 16:32:27 2019-04-06 17:36:36 3964 3849
## 3 22321327 2019-04-15 17:35:58 2019-04-15 18:40:07 4466 3849
## 4 22383452 2019-04-21 13:29:52 2019-04-21 14:34:01 6185 3849
## 5 22388265 2019-04-21 16:35:03 2019-04-21 17:39:12 6432 3849
## 6 22525053 2019-05-04 12:40:22 2019-05-04 13:44:31 4250 3849
## 7 22541373 2019-05-05 13:47:15 2019-05-05 14:51:24 4776 3849
## 8 22929432 2019-06-01 12:51:39 2019-06-01 13:55:48 935 3849
## 9 22979848 2019-06-04 14:11:59 2019-06-04 15:16:08 3745 3849
## 10 23071649 2019-06-08 19:31:52 2019-06-08 20:36:01 5830 3849
## # ℹ 169,301 more rows
## # ℹ 6 more variables: start_station_id <dbl>, start_station_name <chr>,
## # end_station_id <dbl>, end_station_name <chr>, member_casual <fct>,
## # age_group <chr>
table(df_trips_outliers_weak$member_casual)
##
## Customer Subscriber
## 125256 44055
The maximum value of approximately 40 minutes does not seem very unusual for a journey.
summary(df_trips$tripduration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 401.0 680.0 892.2 1184.0 3849.0
Add calculated fields to facilitate analysis.
# Kategorische Variablen in Faktoren umwandeln
df_trips <- df_trips %>%
mutate(
member_casual = as.factor(member_casual),
age_group = as.factor(age_group),
month = month(started_at),
month_name = case_when(
month == 01 ~ paste0("01 ", month.name[month]),
month == 02 ~ paste0("02 ", month.name[month]),
month == 03 ~ paste0("03 ", month.name[month]),
month == 04 ~ paste0("04 ", month.name[month]),
month == 05 ~ paste0("05 ", month.name[month]),
month == 06 ~ paste0("06 ", month.name[month]),
month == 07 ~ paste0("07 ", month.name[month]),
month == 08 ~ paste0("08 ", month.name[month]),
month == 09 ~ paste0("09 ", month.name[month]),
month == 10 ~ paste0("10 ", month.name[month]),
month == 11 ~ paste0("11 ", month.name[month]),
month == 12 ~ paste0("12 ", month.name[month]),
TRUE ~ "NA"
),
month = as.factor(month),
month_name = as.factor(month_name),
weekday_number = wday(as.POSIXct(ended_at)),
weekday_name = wday(as.POSIXct(ended_at), label = TRUE, abbr = FALSE),
weekday_name = as.factor(weekday_name),
started_at_hour = hour(floor_date(as.POSIXct(started_at), unit = "hours")),
started_at_hour = as.factor(started_at_hour),
ended_at_hour = hour(floor_date(as.POSIXct(ended_at), unit = "hours")),
ended_at_hour = as.factor(ended_at_hour),
tripduration = tripduration / 60,
start_station_id = as.factor(start_station_id),
end_station_id = as.factor(end_station_id),
start_station_name = as.factor(start_station_name),
end_station_name = as.factor(end_station_name)
)
colnames(df_trips)
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "tripduration" "start_station_id"
## [7] "start_station_name" "end_station_id" "end_station_name"
## [10] "member_casual" "age_group" "month"
## [13] "month_name" "weekday_number" "weekday_name"
## [16] "started_at_hour" "ended_at_hour"
The following attributes are also generated:
Attribute Format Description
age_group character An age group based on sporting activities, e.g., cycling. Special needs can usually be assigned to age groups.
ended_at_hour character Time when a trip ends (derived from the end date of a trip)
month numeric Month when the trip took place (derived from the end
date of a trip)
month_name charcter Month name for the month
weekday_number numeric Day number (01=Sunday, 02=Monday, 03=Tuesday, 04=Wednesday, 05=Thursday, 06=Friday, 07=Saturday) weekday_name character Name for the day number
To obtain an overview of the status of the attributes after the data cleansing and transformation process, statistical properties as well as the relationship to the membership type (“customer” or “subscriber”) are documented.
mtyp <- summary(df_trips$member_casual)
mtyp
## Customer Subscriber
## 738182 2922047
## The ratio of trips by ‘customers’ to ‘subscribers’ is 100 : 396
with(df_trips, table(member_casual, age_group, useNA = "always"))
## age_group
## member_casual 17_under 18-35 36-50 51-65 66-75 76-90 91_plus <NA>
## Customer 1191 612073 102672 21152 1005 48 41 0
## Subscriber 1473 1835940 723862 327758 31168 858 988 0
## <NA> 0 0 0 0 0 0 0 0
## The 18-35 age group cycles the most.
summary(df_trips$started_at)
## Min. 1st Qu. Median
## "2019-01-01 00:04:37" "2019-05-28 15:15:19" "2019-07-25 18:42:48"
## Mean 3rd Qu. Max.
## "2019-07-19 20:40:10" "2019-09-16 08:53:17" "2019-12-31 23:57:17"
with(df_trips, table(member_casual, started_at_hour, useNA = "always"))
## started_at_hour
## member_casual 0 1 2 3 4 5 6 7 8
## Customer 7011 4641 2830 1637 995 2307 5466 11624 18953
## Subscriber 15785 8963 5294 3662 6595 33073 101939 224524 283519
## <NA> 0 0 0 0 0 0 0 0 0
## started_at_hour
## member_casual 9 10 11 12 13 14 15 16 17
## Customer 22802 34605 47076 56285 61114 64342 66658 71081 74364
## Subscriber 134679 100203 119226 135947 131239 127183 162533 292038 389527
## <NA> 0 0 0 0 0 0 0 0 0
## started_at_hour
## member_casual 18 19 20 21 22 23 <NA>
## Customer 59917 44398 30412 22132 18943 8589 0
## Subscriber 246245 157709 99356 70957 48363 23488 0
## <NA> 0 0 0 0 0 0 0
summary(df_trips$ended_at)
## Min. 1st Qu. Median
## "2019-01-01 00:11:07" "2019-05-28 15:33:01" "2019-07-25 19:00:39"
## Mean 3rd Qu. Max.
## "2019-07-19 20:55:02" "2019-09-16 09:05:11" "2019-12-31 23:59:18"
with(df_trips, table(member_casual, ended_at_hour, useNA = "always"))
## ended_at_hour
## member_casual 0 1 2 3 4 5 6 7 8
## Customer 4832 5138 3694 1916 1241 1855 4255 9271 16782
## Subscriber 13381 9800 6052 3737 5695 27226 85443 194857 292986
## <NA> 0 0 0 0 0 0 0 0 0
## ended_at_hour
## member_casual 9 10 11 12 13 14 15 16 17
## Customer 19156 28495 41231 52864 58998 62783 66455 69698 76235
## Subscriber 158805 100177 113508 134604 132664 125997 151709 256421 390692
## <NA> 0 0 0 0 0 0 0 0 0
## ended_at_hour
## member_casual 18 19 20 21 22 23 <NA>
## Customer 66109 50659 36086 24757 20659 15013 0
## Subscriber 274699 172660 109979 76174 52981 31800 0
## <NA> 0 0 0 0 0 0 0
summary(df_trips$rideable_type)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1730 3454 3382 5050 6946
summary(df_trips$tripduration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.017 6.683 11.333 14.870 19.733 64.150
head(summary(df_trips$start_station_name), 15)
## Streeter Dr & Grand Ave Canal St & Adams St
## 58987 53977
## Clinton St & Madison St Clinton St & Washington Blvd
## 49446 47869
## Lake Shore Dr & Monroe St Columbus Dr & Randolph St
## 43493 37243
## Michigan Ave & Washington St Franklin St & Monroe St
## 35026 33881
## Daley Center Plaza Kingsbury St & Kinzie St
## 33298 32866
## Lake Shore Dr & North Blvd Michigan Ave & Oak St
## 32225 30424
## Theater on the Lake Canal St & Madison St
## 29966 29515
## Millennium Park
## 29178
head(summary(df_trips$end_station_name), 15)
## Streeter Dr & Grand Ave Clinton St & Washington Blvd
## 69959 50480
## Canal St & Adams St Clinton St & Madison St
## 49853 46938
## Lake Shore Dr & North Blvd Michigan Ave & Washington St
## 39408 38166
## Lake Shore Dr & Monroe St Theater on the Lake
## 34397 33078
## Daley Center Plaza Michigan Ave & Oak St
## 32740 32493
## Millennium Park Kingsbury St & Kinzie St
## 32475 32303
## Franklin St & Monroe St Canal St & Madison St
## 29271 28285
## Clark St & Elm St
## 26288
with(df_trips, table(member_casual, weekday_name, useNA = "always"))
## weekday_name
## member_casual Sonntag Montag Dienstag Mittwoch Donnerstag Freitag Samstag
## Customer 140995 84950 75597 76940 86429 102134 171137
## Subscriber 254119 457027 495098 492425 484644 454526 284208
## <NA> 0 0 0 0 0 0 0
## weekday_name
## member_casual <NA>
## Customer 0
## Subscriber 0
## <NA> 0
## Casual riders tend to ride on weekends (Saturday and Sunday), while annual members use their bikes more frequently on weekdays.
with(df_trips, table(member_casual, month_name, useNA = "always"))
## month_name
## member_casual 01 January 02 February 03 March 04 April 05 May 06 June 07 July
## Customer 4018 2384 13889 39443 66920 107815 144988
## Subscriber 98137 93068 149187 216586 284254 342809 378973
## <NA> 0 0 0 0 0 0 0
## month_name
## member_casual 08 August 09 September 10 October 11 November 12 December <NA>
## Customer 155723 109691 61926 16918 14467 0
## Subscriber 400999 362231 299682 157911 138210 0
## <NA> 0 0 0 0 0 0
## In quarters 2 and 3, both casual riders and annual members take significantly more trips than in quarters 1 and 4.
How do the other attributes relate to the target variable ‘member_casual’?
The connection can be illustrated using a decision tree.
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
set.seed(123)
# 70:30 train : test
train_index <- createDataPartition(df_trips$member_casual, p = 0.7, list = FALSE)
train_data <- df_trips[train_index, ]
test_data <- df_trips[-train_index, ]
# decision tree with all features
tree_model <- rpart(
member_casual ~ started_at + ended_at + rideable_type + tripduration + start_station_name + end_station_name + age_group + month_name + weekday_name + started_at_hour + ended_at_hour,
data = train_data,
method = "class",
control = rpart.control(
minsplit = 20,
minbucket = 7,
cp = 0.01,
maxdepth = 10
)
)
# relevant attributs
var_importance <- tree_model$variable.importance
print(round(sort(var_importance, decreasing = TRUE)))
## tripduration end_station_name start_station_name started_at_hour
## 130163 61938 44473 506
## ended_at_hour age_group
## 441 23
Interpretation
Findings from the decision tree model:
This indicates fundamental differences in usage: members are likely to use the bikes for shorter, pragmatic trips, while casual riders undertake longer leisure trips.
This suggests different usage patterns in terms of start and destination locations. Members are likely to commute between fixed stations, while casual riders use more flexible routes.
Shows different usage times – members likely to commute during rush hour, casual riders during leisure time
Age plays a minor role compared to behavior patterns.
# predictions on test data
predictions <- predict(tree_model, test_data, type = "class")
# Konfusionsmatrix und Metriken
confusion_matrix <- confusionMatrix(predictions, test_data$member_casual)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Customer Subscriber
## Customer 92964 38922
## Subscriber 128490 837692
##
## Accuracy : 0.8475
## 95% CI : (0.8469, 0.8482)
## No Information Rate : 0.7983
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4422
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.41979
## Specificity : 0.95560
## Pos Pred Value : 0.70488
## Neg Pred Value : 0.86701
## Prevalence : 0.20168
## Detection Rate : 0.08466
## Detection Prevalence : 0.12011
## Balanced Accuracy : 0.68769
##
## 'Positive' Class : Customer
##
Of interest are the journeys made by ‘Customers’ but which, when predicting the decision tree, led to a classification as ‘Subscribers’.
df_pred_subsciber <- test_data[predictions == "Subscriber",]
df_pred_subsciber <- df_pred_subsciber %>%
filter( member_casual == "Customer" )
summary(df_pred_subsciber)
## ride_id started_at ended_at
## Min. :21742456 Min. :2019-01-01 00:23:43 Min. :2019-01-01 00:33:05
## 1st Qu.:23308385 1st Qu.:2019-06-22 17:03:47 1st Qu.:2019-06-22 17:19:58
## Median :24124065 Median :2019-08-02 15:27:17 Median :2019-08-02 15:42:27
## Mean :24059772 Mean :2019-07-30 04:03:18 Mean :2019-07-30 04:21:11
## 3rd Qu.:24847000 3rd Qu.:2019-09-08 16:52:59 3rd Qu.:2019-09-08 17:09:04
## Max. :25962784 Max. :2019-12-31 21:22:16 Max. :2019-12-31 21:28:08
##
## rideable_type tripduration start_station_id
## Min. : 1 Min. : 1.017 35 : 3222
## 1st Qu.:1685 1st Qu.:10.583 268 : 2504
## Median :3386 Median :16.150 177 : 1848
## Mean :3321 Mean :17.886 85 : 1771
## 3rd Qu.:4963 3rd Qu.:23.800 43 : 1341
## Max. :6946 Max. :43.750 90 : 1237
## (Other):116567
## start_station_name end_station_id
## Streeter Dr & Grand Ave : 3222 35 : 1326
## Lake Shore Dr & North Blvd : 2504 268 : 1287
## Theater on the Lake : 1848 334 : 1210
## Michigan Ave & Oak St : 1771 177 : 1193
## Michigan Ave & Washington St: 1341 69 : 1146
## Millennium Park : 1237 313 : 1059
## (Other) :116567 (Other):121269
## end_station_name member_casual age_group
## Streeter Dr & Grand Ave : 1326 Customer :128490 17_under: 165
## Lake Shore Dr & North Blvd : 1287 Subscriber: 0 18-35 :104818
## Lake Shore Dr & Belmont Ave : 1210 36-50 : 19727
## Theater on the Lake : 1193 51-65 : 3595
## Damen Ave & Pierce Ave : 1146 66-75 : 172
## Lakeview Ave & Fullerton Pkwy: 1059 76-90 : 9
## (Other) :121269 91_plus : 4
## month month_name weekday_number weekday_name
## 8 :26980 08 August :26980 Min. :1.000 Sonntag :23556
## 7 :24544 07 July :24544 1st Qu.:2.000 Montag :14969
## 9 :20242 09 September:20242 Median :4.000 Dienstag :14003
## 6 :18024 06 June :18024 Mean :4.154 Mittwoch :14564
## 10 :11950 10 October :11950 3rd Qu.:6.000 Donnerstag:15877
## 5 :10761 05 May :10761 Max. :7.000 Freitag :18001
## (Other):15989 (Other) :15989 Samstag :27520
## started_at_hour ended_at_hour
## 17 :13949 17 :13844
## 16 :12006 18 :12159
## 18 :11367 16 :11306
## 15 : 9961 15 : 9625
## 14 : 9173 19 : 9577
## 13 : 8705 14 : 8987
## (Other):63329 (Other):62992
Interpretation
summary(df_pred_subsciber$age_group)
## 17_under 18-35 36-50 51-65 66-75 76-90 91_plus
## 165 104818 19727 3595 172 9 4
The 18-35 age group predominates.
summary(df_pred_subsciber$tripduration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.017 10.583 16.150 17.886 23.800 43.750
The median usage time is approximately 16 minutes, which is as low as for annual members, meaning that only specific trips are taken.
summary(df_pred_subsciber$ended_at_hour)
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 1201 1085 851 441 293 478 1114 2305 4077 3999 4876 6412 7897
## 13 14 15 16 17 18 19 20 21 22 23
## 8579 8987 9625 11306 13844 12159 9577 6857 4937 4462 3128
The frequency of use falls within the time range from 3:00 p.m. to 7:00 p.m., i.e., a time range of business activity.
summary(df_pred_subsciber$weekday_name)
## Sonntag Montag Dienstag Mittwoch Donnerstag Freitag Samstag
## 23556 14969 14003 14564 15877 18001 27520
It is noticeable that trips on weekends (Saturday and Sunday) predominate, but the total number across all working days (Monday to Friday) is still significant.
summary(df_pred_subsciber$month_name)
## 01 January 02 February 03 March 04 April 05 May 06 June
## 773 497 2274 6182 10761 18024
## 07 July 08 August 09 September 10 October 11 November 12 December
## 24544 26980 20242 11950 3528 2735
Trips are primarily undertaken in the second and third quarters. October is also included. This is understandable due to weather conditions, but since the trips are mostly short, there could also be demand in winter.
Most trips end in commercial, government, or mixed commercial areas.
head(summary(df_pred_subsciber$end_station_name), 15)
## Streeter Dr & Grand Ave Lake Shore Dr & North Blvd
## 1326 1287
## Lake Shore Dr & Belmont Ave Theater on the Lake
## 1210 1193
## Damen Ave & Pierce Ave Lakeview Ave & Fullerton Pkwy
## 1146 1059
## Clark St & Elm St LaSalle St & Illinois St
## 1022 1010
## Columbus Dr & Randolph St Michigan Ave & Washington St
## 989 974
## Broadway & Barry Ave Lake Shore Dr & Wellington Ave
## 970 967
## St. Clair St & Erie St Wilton Ave & Belmont Ave
## 950 932
## Millennium Park
## 922
Street Classification Analysis for Chicago Locations
Streeter Dr & Grand Av Classification: Commercial Area Reasoning: In Near North Side, near Navy Pier and tourist area
Lake Shore Dr & North Blvd Classification: Mixed Use Reasoning: In Lincoln Park, primarily residential area with park facilities
Lake Shore Dr & Belmont Ave Classification: Mixed Use Reasoning: In Lakeview, residential area with commercial influences
Theater on the Lake Classification: Service Area Reasoning: Cultural and recreational facility on the lakefront
Damen Ave & Pierce Ave Classification: Residential Area Reasoning: In Lincoln Park, primarily residential area
Lakeview Ave & Fullerton Pkwy Classification: Mixed Use Reasoning: Transition area with residential and commercial use
Clark St & Elm St Classification: Commercial Area Reasoning: In Near North Side, mixed commercial use
LaSalle St & Illinois St Classification: Commercial Area Reasoning: In the financial district, near Chicago River
Columbus Dr & Randolph St Classification: Administrative Area Reasoning: Near Millennium Park, government buildings and cultural institutions
Michigan Ave & Washington St Classification: Commercial Area Reasoning: On the Magnificent Mile, primary retail district
Kendall’s tau is particularly suitable for analyzing monotonous trends over time—i.e., whether rising values in one variable tend to be accompanied by rising or falling values in another variable.
df_trips_pro_day <- df_trips %>%
mutate(
tag = as.Date(ended_at, format = "%d.%m.%Y %H:%M:%S")) %>%
# nach dem Tag, der Startstation und der Endstation
group_by(tag, start_station_id, start_station_name, end_station_id, end_station_name) %>%
# Fahrten in jeder Gruppe
summarise(anzahl_verbindungen = n(), .groups = 'drop')
df_changes_pro_tag <- df_trips_pro_day %>%
#nach Stationen gruppiert und nach Datum sortiert
arrange(start_station_id, end_station_id, tag) %>%
group_by(start_station_id, end_station_id) %>%
# Differenz zum Vortag berechnen
mutate(
anzahl_vortag = lag(anzahl_verbindungen),
differenz = case_when(
is.na(anzahl_vortag) ~ 0, # Erster Tag bekommt 0
anzahl_verbindungen > anzahl_vortag ~ anzahl_verbindungen - anzahl_vortag, # Positive Differenz
anzahl_verbindungen < anzahl_vortag ~ anzahl_verbindungen - anzahl_vortag, # Negative Differenz
TRUE ~ 0 # Keine Veränderung = 0
),
trend = case_when(
differenz > 0 ~ 1,
differenz < 0 ~ -1,
TRUE ~ 0
),
trend_txt = case_when(
differenz > 0 ~ "erhöht",
differenz < 0 ~ "verringert",
TRUE ~ "unverändert"
)
) %>%
select(-anzahl_vortag) %>% # Hilfsspalte entfernen
ungroup()
head(df_changes_pro_tag, 15)
## # A tibble: 15 × 9
## tag start_station_id start_station_name end_station_id
## <date> <fct> <fct> <fct>
## 1 2019-01-01 2 Buckingham Fountain (Temp) 2
## 2 2019-01-04 2 Buckingham Fountain (Temp) 2
## 3 2019-01-07 2 Buckingham Fountain (Temp) 2
## 4 2019-03-03 2 Buckingham Fountain (Temp) 2
## 5 2019-03-08 2 Buckingham Fountain (Temp) 2
## 6 2019-03-11 2 Buckingham Fountain (Temp) 2
## 7 2019-03-12 2 Buckingham Fountain (Temp) 2
## 8 2019-03-13 2 Buckingham Fountain (Temp) 2
## 9 2019-03-14 2 Buckingham Fountain (Temp) 2
## 10 2019-03-16 2 Buckingham Fountain (Temp) 2
## 11 2019-03-17 2 Buckingham Fountain (Temp) 2
## 12 2019-03-19 2 Buckingham Fountain (Temp) 2
## 13 2019-03-23 2 Buckingham Fountain (Temp) 2
## 14 2019-03-27 2 Buckingham Fountain (Temp) 2
## 15 2019-03-28 2 Buckingham Fountain (Temp) 2
## # ℹ 5 more variables: end_station_name <fct>, anzahl_verbindungen <int>,
## # differenz <dbl>, trend <dbl>, trend_txt <chr>
colnames(df_changes_pro_tag)
## [1] "tag" "start_station_id" "start_station_name"
## [4] "end_station_id" "end_station_name" "anzahl_verbindungen"
## [7] "differenz" "trend" "trend_txt"
library(ggplot2)
library(ggpubr)
ggscatter(df_changes_pro_tag, x = "tag", y = "differenz",
xlab = "Day", ylab = "Increase or Decrease",
cor.method = "kendall", cor.coef = FALSE,
title = "Kendall Correlation")
A scatter plot is created in which each observation in the df_changes_pro_tag data set is represented as a point. The values of the tag variable are plotted on the x-axis and the values of the difference variable on the y-axis
Correlation calculation: With the argument cor.method = “kendall”, the strength and direction of the monotonic relationship between tag and difference is calculated using Kendall’s tau.
The horizontal trend indicates that there is no overall trend suggesting monotonous growth or decline. Locally, there may be increases, but there may also be decreases.
To strategically boost business development and sustainably increase profitability, we propose the following three data-driven measures, which are specifically aimed at converting occasional drivers into annual members.
Proposal: Develop a hybrid membership model that combines leisure use at weekends with discounted commuting options during the week. This offer should be positioned as an entry-level model tailored specifically to the needs of occasional riders.
Rationale: Since casual riders primarily use bikes on weekends and for longer leisure trips, a modular system offers the opportunity to pick up on their existing behaviour patterns and gradually communicate the advantages of cheaper membership rates during the week. The lower entry barrier increases the likelihood of conversion.
Proposal: Targeted marketing campaigns encouraging casual riders to try out cycling for commuting. This includes free trial weeks for commuting with station offers specifically tailored to business districts.
Rationale: Data shows that annual members mainly use the system for commuting. By enabling test rides during rush hour, casual riders can experience the practical benefits for their commute, which is one of the strongest motivators for switching to annual membership.
Proposal: Develop an annual membership with integrated pause options for the winter months or holiday periods, combined with discounted weekend and holiday use.
Rationale: This measure directly addresses casual riders’ concerns about year-round commitment and creates a compromise that respects their leisure usage preferences while retaining the benefits of membership for their main usage periods.
Data-based analysis provides the crucial foundation for a successful conversion strategy. The identified behaviour patterns and preferences of both user groups offer clear starting points for targeted marketing measures and operational optimisations. We are grateful for your trust and constructive cooperation and are passing on the results to marketing and management for the development of data-supported strategies, so that concrete measures to increase annual memberships can be implemented on this basis.