Cyclistic Bike-Share Analysis

Introduction

In today’s competitive urban mobility sector, long-term customer retention is a key success factor. For Cyclistic, a leading bike-share company in Chicago, the greatest opportunity for future growth lies in converting occasional users into profitable annual members.

This data analysis was initiated to answer the fundamental question: How does the use of Cyclistic bikes differ between annual members and occasional riders?

Based on the vision of Marketing Director Lily Moreno, this project examines historical ride data from the last twelve months. Our goal is to gain clear, data-driven insights into the riding habits, time patterns, and preferred routes of both user groups. Identifying these differences is the first and essential step in developing a targeted marketing strategy that can effectively convince occasional riders of the benefits of an annual membership.

The following analysis follows the structured process of data analysis – Ask, Prepare & Process, Analyze, Share, and Act.

01 The business challenge

We need to understand how the use of Cyclistic bikes differs between annual members and occasional riders. This insight will form the basis of a marketing campaign aimed at converting occasional riders into members.

Stakeholders

Lily Moreno, Marketing Director, commissioned the analysis for the marketing strategy. The Cyclistic executive team must approve the budget and campaign based on the analysis results.

Business task

Identify differences in bike usage behavior between annual members and occasional riders by analyzing historical ride data. The results will be used to develop targeted marketing initiatives aimed at converting occasional riders into annual members.

02 Data sources and description

Data source

The Cyclistic ride data for 2019, i.e., from 12 months ago, is available. This data is publicly available and is provided by Motivate International Inc.

Organization

The data is organized in monthly CSV files.

We have used Cyclistic’s historical trip data to analyze and identify trends. For this we have make use of the information contained in the following ZIP files (available here):

ZIP file names Date Modified Size Divvy_Trips_2019_Q1.zip Nov 8th 2021, 04:05:26 pm 9.57 MB Divvy_Trips_2019_Q2.zip Nov 8th 2021, 04:06:12 pm 28.72 MB Divvy_Trips_2019_Q3.zip Jan 24th 2020, 10:08:06 am 42.36 MB Divvy_Trips_2019_Q4.zip Jan 24th 2020, 10:08:07 am 18.40 MB

The raw data comprises a total of approximately 3.8 million data records.

Raw Data

Attribut Format Description

ride_id numeric A unique identifier for each ride. started_at POSIXct Timestamp for the start of the ride. ended_at POSIXct Timestamp for the end of the ride. rideable_type numeric The type of bike (classic, electric, etc.) tripduration numeric The trip duration of the ride. start_station_id numeric The Unique identifier for the start station. start_station_name character The name of the start station. end_station_id numeric The Unique identifier for the end station. end_station_name character The name of the end stations. member_casual character The key field that distinguishes between ‘Subscriber’ (annual member) and ‘Customer’ (casual). gender character The gender of the rider. birthyear numeric The birthyear of the rider.

Data integrity

Reliable The data comes from a first-party source (the company’s own system). It is original data from the bicycle rental system.

Comprehensive It contains the required fields for analysis for the user types annual member and casual. It includes data from the entire year 2019.

Current The data is no longer entirely up to date, and the coronavirus pandemic may have led to changes in driver behaviour (e.g. working from home).

Cited The data is sourced from Motivate International Inc. under licence. There are two attributes related to individuals: year of birth and gender. The combination of year of birth (and age derived from this), gender and other attributes such as exact timestamps of journeys and exact start/end stations can override the pseudonymisation of the data and lead to re-identification. To avoid a data protection issue, the gender attribute is removed and the year of birth attribute is anonymised by transforming it into an age group.

Limitations The user behaviour of individual drivers cannot be tracked due to a lack of ID. This means it is not possible to determine whether casual users have purchased tickets repeatedly.

03 Data cleansing and preparation

Tool selection

The programming language R is used for this analysis due to its powerful data manipulation and visualisation libraries. The volume of data also speaks in favour of using R.

The following packages of the programming language R are used below:

Library Description

tidyverse Easily Install and Load the ‘Tidyverse’ dplyr A Grammar of Data Manipulation lubridate Make Dealing with Dates a Little Easier conflicted An Alternative Conflict Resolution Strategy ggplot2 Create Elegant Data Visualisations Using the Grammar of Graphics caret Classification and Regression Training rpart Recursive Partitioning and Regression Trees rpart.plot Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’ ggpubr ‘ggplot2’ Based Publication Ready Plots

library(tidyverse)
library(dplyr)
library(lubridate)
library(conflicted)
library(ggplot2)

library(caret)
library(rpart)
library(rpart.plot)
library(ggpubr)

conflict_prefer("filter", "dplyr")
conflict_prefer("lag", "dplyr")

Import raw data

Define the work-space.

## [1] "Divvy_Trips_2019_Q1.csv" "Divvy_Trips_2019_Q2.csv"
## [3] "Divvy_Trips_2019_Q3.csv" "Divvy_Trips_2019_Q4.csv"
## [5] "Divvy_Trips_2020_Q1.csv"

The path contains five CSV files, the first four of which are imported for the year 2019.

Steps for data cleansing

Combine data

The four CSV files are merged into a single DataFrame.

Unfortunately, different column names were used in the CSV files, so harmonisation is necessary.

colnames(df_q1)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

colnames(df_q2)

##  [1] "01 - Rental Details Rental ID"                   
##  [2] "01 - Rental Details Local Start Time"            
##  [3] "01 - Rental Details Local End Time"              
##  [4] "01 - Rental Details Bike ID"                     
##  [5] "01 - Rental Details Duration In Seconds Uncapped"
##  [6] "03 - Rental Start Station ID"                    
##  [7] "03 - Rental Start Station Name"                  
##  [8] "02 - Rental End Station ID"                      
##  [9] "02 - Rental End Station Name"                    
## [10] "User Type"                                       
## [11] "Member Gender"                                   
## [12] "05 - Member Details Member Birthday Year"

colnames(df_q3)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

colnames(df_q4)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

Let us start by renaming the columns of the df_q1, df_q2, df_q3 and df_q4 data-frames.

df_q1 <- rename(df_q1,
                ride_id = trip_id,
                rideable_type = bikeid,
                started_at = start_time,
                ended_at = end_time,
                start_station_name = from_station_name,
                start_station_id = from_station_id,
                end_station_name = to_station_name,
                end_station_id = to_station_id,
                member_casual = usertype)

# q1_2020 does not have tripdurarion, gender and birthyear columns, so we use the names in q4_2019
df_q2 <- rename(df_q2,
                ride_id = "01 - Rental Details Rental ID",
                rideable_type = "01 - Rental Details Bike ID",
                tripduration = "01 - Rental Details Duration In Seconds Uncapped", 
                started_at = "01 - Rental Details Local Start Time",
                ended_at = "01 - Rental Details Local End Time",
                start_station_name = "03 - Rental Start Station Name",
                start_station_id = "03 - Rental Start Station ID",
                end_station_name = "02 - Rental End Station Name",
                end_station_id = "02 - Rental End Station ID",
                member_casual = "User Type",
                gender = "Member Gender",
                birthyear = "05 - Member Details Member Birthday Year")

df_q3 <- rename(df_q3,
                ride_id = trip_id,
                rideable_type = bikeid,
                started_at = start_time,
                ended_at = end_time,
                start_station_name = from_station_name,
                start_station_id = from_station_id,
                end_station_name = to_station_name,
                end_station_id = to_station_id,
                member_casual = usertype)

df_q4 <- rename(df_q4,
                ride_id = trip_id,
                rideable_type = bikeid,
                started_at = start_time,
                ended_at = end_time,
                start_station_name = from_station_name,
                start_station_id = from_station_id,
                end_station_name = to_station_name,
                end_station_id = to_station_id,
                member_casual = usertype)

## [1] "The columns of the four data frames are harmonised so that a merge of raw data can be performed."

Merge into a data.frame ‘df_trips’

df_trips <- rbind(df_q1, df_q2, df_q3, df_q4)

rm(df_q1)
rm(df_q2)
rm(df_q3)
rm(df_q4)

Properties of the data frame ‘df_trips’

dim_trips_raw <- dim(df_trips)

## The data-frame “df_trips” contains 3818004 data records with 12 attributes.

##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "gender"             "birthyear"

str(df_trips)

## spc_tbl_ [3,818,004 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : num [1:3818004] 21742443 21742444 21742445 21742446 21742447 ...
##  $ started_at        : POSIXct[1:3818004], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
##  $ ended_at          : POSIXct[1:3818004], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
##  $ rideable_type     : num [1:3818004] 2167 4386 1524 252 1170 ...
##  $ tripduration      : num [1:3818004] 390 441 829 1783 364 ...
##  $ start_station_id  : num [1:3818004] 199 44 15 123 173 98 98 211 150 268 ...
##  $ start_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
##  $ end_station_id    : num [1:3818004] 84 624 644 176 35 49 49 142 148 141 ...
##  $ end_station_name  : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
##  $ member_casual     : chr [1:3818004] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ gender            : chr [1:3818004] "Male" "Female" "Female" "Male" ...
##  $ birthyear         : num [1:3818004] 1989 1990 1994 1993 1994 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_id = col_double(),
##   ..   start_time = col_datetime(format = ""),
##   ..   end_time = col_datetime(format = ""),
##   ..   bikeid = col_double(),
##   ..   tripduration = col_number(),
##   ..   from_station_id = col_double(),
##   ..   from_station_name = col_character(),
##   ..   to_station_id = col_double(),
##   ..   to_station_name = col_character(),
##   ..   usertype = col_character(),
##   ..   gender = col_character(),
##   ..   birthyear = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

First, the attribute “gender” is removed to comply with data protection regulations.

df_trips <- df_trips %>% select(-gender)

Since the birthday date is used to form age groups, the missing values are filled in with the median birthday for the identical connection from the start station to the end station with a date, so that the data records are retained for analysis.

df_trips <- df_trips %>%
  group_by(start_station_id, end_station_id) %>%
  mutate(
    birthyear = ifelse(
      is.na(birthyear), 
      median(birthyear, na.rm = TRUE), 
      birthyear
    )
  ) %>%
  ungroup()

All other missing values result in the data record being deleted.

df_trips <- df_trips %>%
  na.omit()

df_trips <- df_trips %>%
  mutate(
    age = year(started_at) - birthyear, 
    age_group = case_when(
      age < 18 ~ "17_under",
      age >= 18 & age < 36 ~ "18-35",
      age >= 36 & age < 51 ~ "36-50", 
      age >= 51 & age < 66 ~ "51-65", 
      age >= 66 & age < 76 ~ "66-75",
      age >= 76 & age < 91 ~ "76-90",
      age >= 91 ~ "91_plus",
      TRUE ~ "NA"
    )
  ) %>% 
  select(c(-birthyear, -age))

Then the attribute “age_group” is generated based on the age from the year of birth. The columns “birthyear” and “age” are removed.

Status

## Of the raw data with 3818004 there are still 3805811 data sets available for analysis.

colnames(df_trips)

##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "age_group"

Check data

Check of the target variable ‘member_casual’:

df_trips <- df_trips %>%
  mutate(
    member_casual = as.factor(member_casual)
  )
if (sum(!is.na(df_trips$member_casual)) == dim(df_trips)[1]){
  cat("The target variable ‘member_casual’ contains no missing values and the values are 'Subscriber' und 'Customer'.")
} else {
  cat("There are missing values in the variable ‘member_casual’!")
}

## The target variable ‘member_casual’ contains no missing values and the values are 'Subscriber' und 'Customer'.

with(df_trips, table(member_casual, useNA = "always"))

## member_casual
##   Customer Subscriber       <NA> 
##     868524    2937287          0

Handle missing and invalid data

Check missing values

No more missing values are found.

sum(is.na(df_trips$started_at)) + 
sum(is.na(df_trips$ended_at)) +
sum(is.na(df_trips$start_station_name)) +
sum(is.na(df_trips$start_station_id)) +
sum(is.na(df_trips$end_station_name)) +
sum(is.na(df_trips$end_station_id)) +
sum(is.na(df_trips$tripduration))

## [1] 0

Check invalid data

Checking whether the return date is before the start date of a trip.

df_trips_err <- df_trips %>%
  filter( started_at > ended_at )
cat("There are", dim(df_trips_err)[1], "data entry errors that will be deleted!")

## There are 12 data entry errors that will be deleted!

df_trips <- df_trips %>%
  filter( started_at < ended_at )

Check outliers

summary(df_trips$tripduration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      61     410     707    1360    1276 9056633

## The maximum value of 9056633 for the attribute 'tripduration' implies that an outlier analysis is necessary!

detect_iqr_outliers <- function(x, multiplier = 1.5) {
  qnt <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
  iqr_val <- IQR(x, na.rm = TRUE)
  lower_bound <- qnt[1] - multiplier * iqr_val
  upper_bound <- qnt[2] + multiplier * iqr_val
  list(
    outliers = x < lower_bound | x > upper_bound,
    bounds = c(lower = lower_bound, upper = upper_bound),
    method = paste0("IQR_", multiplier)
  )
}

# weak outlier
iqr_result <- detect_iqr_outliers(df_trips$tripduration, multiplier = 1.5)
df_trips_outliers <- df_trips[iqr_result$outliers,]

cat("There are", dim(df_trips_outliers)[1], "outliers detected for the attribute 'tripduration'!")

## There are 263144 outliers detected for the attribute 'tripduration'!

table(df_trips_outliers$member_casual)

## 
##   Customer Subscriber 
##     227661      35483

summary(df_trips_outliers$tripduration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2576    3085    3952    8704    5782 9056633

If the loan lasts for more than one day, then these entries, i.e. day trips, do not fit into the population of loans by annual members.

df_trips_over_days <- df_trips_outliers %>% 
  filter(date(as.POSIXct(started_at)) != date(as.POSIXct(ended_at))) %>% 
  arrange(desc(tripduration)) %>% 
  select(c(ride_id, started_at, ended_at, tripduration,  member_casual))

cat("There are", dim(df_trips_over_days)[1], "outliers were detected over several days for the attribute 'tripduration'!")

## There are 7977 outliers were detected over several days for the attribute 'tripduration'!

head(df_trips_over_days)

## # A tibble: 6 × 5
##    ride_id started_at          ended_at            tripduration member_casual
##      <dbl> <dttm>              <dttm>                     <dbl> <fct>        
## 1 23798837 2019-07-16 18:17:58 2019-10-29 14:01:52      9056633 Subscriber   
## 2 23889502 2019-07-22 12:09:36 2019-10-25 10:56:54      8203637 Subscriber   
## 3 23710210 2019-07-12 13:08:10 2019-10-12 10:32:18      7939448 Customer     
## 4 24107150 2019-08-01 18:45:47 2019-10-29 12:36:57      7667469 Customer     
## 5 23661214 2019-07-10 10:25:13 2019-10-06 11:26:25      7606871 Customer     
## 6 23801970 2019-07-16 20:44:13 2019-10-10 18:01:25      7420632 Customer

df_trips <- df_trips %>% 
  filter(date(as.POSIXct(started_at)) == date(as.POSIXct(ended_at)))

dim(df_trips)

## [1] 3790489      11

summary(df_trips$tripduration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      61     409     704    1091    1269   82385

## Since the maximum value for trip duration is still very high at 82385 minutes, all strong outliers for the attribute ‘tripduration’ are identified and removed for further investigation!

# strong outlier
iqr_result <- detect_iqr_outliers(df_trips$tripduration, multiplier = 3.0)
df_trips_outliers_strong <- df_trips[iqr_result$outliers,]
df_trips <- df_trips[!iqr_result$outliers,]

df_trips_outliers_strong %>% 
  arrange(desc(tripduration))

## # A tibble: 130,260 × 11
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 24820137 2019-09-07 00:51:15 2019-09-07 23:44:20          3771        82385
##  2 24478982 2019-08-20 00:41:42 2019-08-20 23:33:39           239        82317
##  3 23011586 2019-06-06 01:24:32 2019-06-06 23:47:21          6240        80569
##  4 22224947 2019-04-06 01:21:11 2019-04-06 23:40:51          5648        80380
##  5 22224950 2019-04-06 01:21:40 2019-04-06 23:40:23           965        80323
##  6 25632227 2019-11-04 01:22:09 2019-11-04 23:28:13           608        79564
##  7 24692637 2019-08-31 01:31:37 2019-08-31 23:03:40          4831        77522
##  8 21959434 2019-02-24 00:15:13 2019-02-24 21:41:28          6141        77175
##  9 24443879 2019-08-18 01:38:35 2019-08-18 23:03:33          1449        77098
## 10 22745218 2019-05-21 02:15:29 2019-05-21 23:34:14           264        76725
## # ℹ 130,250 more rows
## # ℹ 6 more variables: start_station_id <dbl>, start_station_name <chr>,
## #   end_station_id <dbl>, end_station_name <chr>, member_casual <fct>,
## #   age_group <chr>

summary(df_trips$tripduration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0   401.0   680.0   892.2  1184.0  3849.0

# weak outlier
iqr_result <- detect_iqr_outliers(df_trips$tripduration, multiplier = 1.5)
df_trips_outliers_weak <- df_trips[iqr_result$outliers,]

df_trips_outliers_weak %>% 
  arrange(desc(tripduration))

## # A tibble: 169,311 × 11
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 22230184 2019-04-06 14:11:07 2019-04-06 15:15:16          5104         3849
##  2 22232802 2019-04-06 16:32:27 2019-04-06 17:36:36          3964         3849
##  3 22321327 2019-04-15 17:35:58 2019-04-15 18:40:07          4466         3849
##  4 22383452 2019-04-21 13:29:52 2019-04-21 14:34:01          6185         3849
##  5 22388265 2019-04-21 16:35:03 2019-04-21 17:39:12          6432         3849
##  6 22525053 2019-05-04 12:40:22 2019-05-04 13:44:31          4250         3849
##  7 22541373 2019-05-05 13:47:15 2019-05-05 14:51:24          4776         3849
##  8 22929432 2019-06-01 12:51:39 2019-06-01 13:55:48           935         3849
##  9 22979848 2019-06-04 14:11:59 2019-06-04 15:16:08          3745         3849
## 10 23071649 2019-06-08 19:31:52 2019-06-08 20:36:01          5830         3849
## # ℹ 169,301 more rows
## # ℹ 6 more variables: start_station_id <dbl>, start_station_name <chr>,
## #   end_station_id <dbl>, end_station_name <chr>, member_casual <fct>,
## #   age_group <chr>

table(df_trips_outliers_weak$member_casual)

## 
##   Customer Subscriber 
##     125256      44055

The maximum value of approximately 40 minutes does not seem very unusual for a journey.

summary(df_trips$tripduration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0   401.0   680.0   892.2  1184.0  3849.0

Create new columns

Add calculated fields to facilitate analysis.

# Kategorische Variablen in Faktoren umwandeln
df_trips <- df_trips %>%
  mutate(
    member_casual = as.factor(member_casual),
    age_group = as.factor(age_group),     
    month = month(started_at),
    month_name = case_when(
      month == 01 ~ paste0("01 ", month.name[month]), 
      month == 02 ~ paste0("02 ", month.name[month]),
      month == 03 ~ paste0("03 ", month.name[month]), 
      month == 04 ~ paste0("04 ", month.name[month]), 
      month == 05 ~ paste0("05 ", month.name[month]), 
      month == 06 ~ paste0("06 ", month.name[month]), 
      month == 07 ~ paste0("07 ", month.name[month]), 
      month == 08 ~ paste0("08 ", month.name[month]), 
      month == 09 ~ paste0("09 ", month.name[month]), 
      month == 10 ~ paste0("10 ", month.name[month]), 
      month == 11 ~ paste0("11 ", month.name[month]), 
      month == 12 ~ paste0("12 ", month.name[month]),
      TRUE ~ "NA"
    ),
    month = as.factor(month),    
    month_name = as.factor(month_name),
    weekday_number = wday(as.POSIXct(ended_at)),       
    weekday_name   = wday(as.POSIXct(ended_at), label = TRUE, abbr = FALSE),
    weekday_name = as.factor(weekday_name),
    started_at_hour = hour(floor_date(as.POSIXct(started_at), unit = "hours")), 
    started_at_hour = as.factor(started_at_hour),
    ended_at_hour = hour(floor_date(as.POSIXct(ended_at), unit = "hours")), 
    ended_at_hour = as.factor(ended_at_hour),
    tripduration = tripduration / 60, 
    start_station_id = as.factor(start_station_id),
    end_station_id  = as.factor(end_station_id), 
    start_station_name = as.factor(start_station_name),
    end_station_name  = as.factor(end_station_name)
  )

colnames(df_trips)

##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "age_group"          "month"             
## [13] "month_name"         "weekday_number"     "weekday_name"      
## [16] "started_at_hour"    "ended_at_hour"

The following attributes are also generated:

Attribute Format Description

age_group character An age group based on sporting activities, e.g., cycling. Special needs can usually be assigned to age groups.

ended_at_hour character Time when a trip ends (derived from the end date of a trip)

month numeric Month when the trip took place (derived from the end date of a trip)
month_name charcter Month name for the month

weekday_number numeric Day number (01=Sunday, 02=Monday, 03=Tuesday, 04=Wednesday, 05=Thursday, 06=Friday, 07=Saturday) weekday_name character Name for the day number

Check the attributs

To obtain an overview of the status of the attributes after the data cleansing and transformation process, statistical properties as well as the relationship to the membership type (“customer” or “subscriber”) are documented.

mtyp <- summary(df_trips$member_casual)
mtyp

##   Customer Subscriber 
##     738182    2922047

## The ratio of trips by ‘customers’ to ‘subscribers’ is 100 : 396

with(df_trips, table(member_casual, age_group, useNA = "always"))

##              age_group
## member_casual 17_under   18-35   36-50   51-65   66-75   76-90 91_plus    <NA>
##    Customer       1191  612073  102672   21152    1005      48      41       0
##    Subscriber     1473 1835940  723862  327758   31168     858     988       0
##    <NA>              0       0       0       0       0       0       0       0

## The 18-35 age group cycles the most.

summary(df_trips$started_at)

##                  Min.               1st Qu.                Median 
## "2019-01-01 00:04:37" "2019-05-28 15:15:19" "2019-07-25 18:42:48" 
##                  Mean               3rd Qu.                  Max. 
## "2019-07-19 20:40:10" "2019-09-16 08:53:17" "2019-12-31 23:57:17"

with(df_trips, table(member_casual, started_at_hour, useNA = "always"))

##              started_at_hour
## member_casual      0      1      2      3      4      5      6      7      8
##    Customer     7011   4641   2830   1637    995   2307   5466  11624  18953
##    Subscriber  15785   8963   5294   3662   6595  33073 101939 224524 283519
##    <NA>            0      0      0      0      0      0      0      0      0
##              started_at_hour
## member_casual      9     10     11     12     13     14     15     16     17
##    Customer    22802  34605  47076  56285  61114  64342  66658  71081  74364
##    Subscriber 134679 100203 119226 135947 131239 127183 162533 292038 389527
##    <NA>            0      0      0      0      0      0      0      0      0
##              started_at_hour
## member_casual     18     19     20     21     22     23   <NA>
##    Customer    59917  44398  30412  22132  18943   8589      0
##    Subscriber 246245 157709  99356  70957  48363  23488      0
##    <NA>            0      0      0      0      0      0      0

summary(df_trips$ended_at)

##                  Min.               1st Qu.                Median 
## "2019-01-01 00:11:07" "2019-05-28 15:33:01" "2019-07-25 19:00:39" 
##                  Mean               3rd Qu.                  Max. 
## "2019-07-19 20:55:02" "2019-09-16 09:05:11" "2019-12-31 23:59:18"

with(df_trips, table(member_casual, ended_at_hour, useNA = "always"))

##              ended_at_hour
## member_casual      0      1      2      3      4      5      6      7      8
##    Customer     4832   5138   3694   1916   1241   1855   4255   9271  16782
##    Subscriber  13381   9800   6052   3737   5695  27226  85443 194857 292986
##    <NA>            0      0      0      0      0      0      0      0      0
##              ended_at_hour
## member_casual      9     10     11     12     13     14     15     16     17
##    Customer    19156  28495  41231  52864  58998  62783  66455  69698  76235
##    Subscriber 158805 100177 113508 134604 132664 125997 151709 256421 390692
##    <NA>            0      0      0      0      0      0      0      0      0
##              ended_at_hour
## member_casual     18     19     20     21     22     23   <NA>
##    Customer    66109  50659  36086  24757  20659  15013      0
##    Subscriber 274699 172660 109979  76174  52981  31800      0
##    <NA>            0      0      0      0      0      0      0

summary(df_trips$rideable_type)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1    1730    3454    3382    5050    6946

summary(df_trips$tripduration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.017   6.683  11.333  14.870  19.733  64.150

head(summary(df_trips$start_station_name), 15)

##      Streeter Dr & Grand Ave          Canal St & Adams St 
##                        58987                        53977 
##      Clinton St & Madison St Clinton St & Washington Blvd 
##                        49446                        47869 
##    Lake Shore Dr & Monroe St    Columbus Dr & Randolph St 
##                        43493                        37243 
## Michigan Ave & Washington St      Franklin St & Monroe St 
##                        35026                        33881 
##           Daley Center Plaza     Kingsbury St & Kinzie St 
##                        33298                        32866 
##   Lake Shore Dr & North Blvd        Michigan Ave & Oak St 
##                        32225                        30424 
##          Theater on the Lake        Canal St & Madison St 
##                        29966                        29515 
##              Millennium Park 
##                        29178

head(summary(df_trips$end_station_name), 15)

##      Streeter Dr & Grand Ave Clinton St & Washington Blvd 
##                        69959                        50480 
##          Canal St & Adams St      Clinton St & Madison St 
##                        49853                        46938 
##   Lake Shore Dr & North Blvd Michigan Ave & Washington St 
##                        39408                        38166 
##    Lake Shore Dr & Monroe St          Theater on the Lake 
##                        34397                        33078 
##           Daley Center Plaza        Michigan Ave & Oak St 
##                        32740                        32493 
##              Millennium Park     Kingsbury St & Kinzie St 
##                        32475                        32303 
##      Franklin St & Monroe St        Canal St & Madison St 
##                        29271                        28285 
##            Clark St & Elm St 
##                        26288

with(df_trips, table(member_casual, weekday_name, useNA = "always"))

##              weekday_name
## member_casual Sonntag Montag Dienstag Mittwoch Donnerstag Freitag Samstag
##    Customer    140995  84950    75597    76940      86429  102134  171137
##    Subscriber  254119 457027   495098   492425     484644  454526  284208
##    <NA>             0      0        0        0          0       0       0
##              weekday_name
## member_casual   <NA>
##    Customer        0
##    Subscriber      0
##    <NA>            0

## Casual riders tend to ride on weekends (Saturday and Sunday), while annual members use their bikes more frequently on weekdays.

with(df_trips, table(member_casual, month_name, useNA = "always"))

##              month_name
## member_casual 01 January 02 February 03 March 04 April 05 May 06 June 07 July
##    Customer         4018        2384    13889    39443  66920  107815  144988
##    Subscriber      98137       93068   149187   216586 284254  342809  378973
##    <NA>                0           0        0        0      0       0       0
##              month_name
## member_casual 08 August 09 September 10 October 11 November 12 December   <NA>
##    Customer      155723       109691      61926       16918       14467      0
##    Subscriber    400999       362231     299682      157911      138210      0
##    <NA>               0            0          0           0           0      0

## In quarters 2 and 3, both casual riders and annual members take significantly more trips than in quarters 1 and 4.

04 Analyse

How do the other attributes relate to the target variable ‘member_casual’?

The connection can be illustrated using a decision tree.

library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)

set.seed(123)

# 70:30 train : test
train_index <- createDataPartition(df_trips$member_casual, p = 0.7, list = FALSE)
train_data <- df_trips[train_index, ]
test_data <- df_trips[-train_index, ]

# decision tree with all features
tree_model <- rpart(
  member_casual ~ started_at + ended_at + rideable_type + tripduration + start_station_name + end_station_name + age_group + month_name + weekday_name + started_at_hour + ended_at_hour,
  data = train_data,
  method = "class",   
  control = rpart.control(
    minsplit = 20,    
    minbucket = 7,    
    cp = 0.01,        
    maxdepth = 10     
  )
)

# relevant attributs
var_importance <- tree_model$variable.importance
print(round(sort(var_importance, decreasing = TRUE)))

##       tripduration   end_station_name start_station_name    started_at_hour 
##             130163              61938              44473                506 
##      ended_at_hour          age_group 
##                441                 23

Interpretation

Findings from the decision tree model:

Trip duration as the main predictor: Trip duration (130,163) is by far the most important feature for distinguishing between annual members and casual riders.

This indicates fundamental differences in usage: members are likely to use the bikes for shorter, pragmatic trips, while casual riders undertake longer leisure trips.

Station-related characteristics: end_station_name (61.938) and start_station_name (44.473) show significant importance.

This suggests different usage patterns in terms of start and destination locations. Members are likely to commute between fixed stations, while casual riders use more flexible routes.

Temporal usage patterns: started_at_hour (506) and ended_at_hour (441) are less significant but still relevant.

Shows different usage times – members likely to commute during rush hour, casual riders during leisure time

Demographic factors: age_group (23) has the least predictive power

Age plays a minor role compared to behavior patterns.

# predictions on test data
predictions <- predict(tree_model, test_data, type = "class")

# Konfusionsmatrix und Metriken
confusion_matrix <- confusionMatrix(predictions, test_data$member_casual)
print(confusion_matrix)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Customer Subscriber
##   Customer      92964      38922
##   Subscriber   128490     837692
##                                           
##                Accuracy : 0.8475          
##                  95% CI : (0.8469, 0.8482)
##     No Information Rate : 0.7983          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4422          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.41979         
##             Specificity : 0.95560         
##          Pos Pred Value : 0.70488         
##          Neg Pred Value : 0.86701         
##              Prevalence : 0.20168         
##          Detection Rate : 0.08466         
##    Detection Prevalence : 0.12011         
##       Balanced Accuracy : 0.68769         
##                                           
##        'Positive' Class : Customer        
##

Of interest are the journeys made by ‘Customers’ but which, when predicting the decision tree, led to a classification as ‘Subscribers’.

df_pred_subsciber <- test_data[predictions == "Subscriber",]

df_pred_subsciber <- df_pred_subsciber %>%
  filter( member_casual == "Customer" )

summary(df_pred_subsciber)

##     ride_id           started_at                     ended_at                  
##  Min.   :21742456   Min.   :2019-01-01 00:23:43   Min.   :2019-01-01 00:33:05  
##  1st Qu.:23308385   1st Qu.:2019-06-22 17:03:47   1st Qu.:2019-06-22 17:19:58  
##  Median :24124065   Median :2019-08-02 15:27:17   Median :2019-08-02 15:42:27  
##  Mean   :24059772   Mean   :2019-07-30 04:03:18   Mean   :2019-07-30 04:21:11  
##  3rd Qu.:24847000   3rd Qu.:2019-09-08 16:52:59   3rd Qu.:2019-09-08 17:09:04  
##  Max.   :25962784   Max.   :2019-12-31 21:22:16   Max.   :2019-12-31 21:28:08  
##                                                                                
##  rideable_type   tripduration    start_station_id
##  Min.   :   1   Min.   : 1.017   35     :  3222  
##  1st Qu.:1685   1st Qu.:10.583   268    :  2504  
##  Median :3386   Median :16.150   177    :  1848  
##  Mean   :3321   Mean   :17.886   85     :  1771  
##  3rd Qu.:4963   3rd Qu.:23.800   43     :  1341  
##  Max.   :6946   Max.   :43.750   90     :  1237  
##                                  (Other):116567  
##                     start_station_name end_station_id  
##  Streeter Dr & Grand Ave     :  3222   35     :  1326  
##  Lake Shore Dr & North Blvd  :  2504   268    :  1287  
##  Theater on the Lake         :  1848   334    :  1210  
##  Michigan Ave & Oak St       :  1771   177    :  1193  
##  Michigan Ave & Washington St:  1341   69     :  1146  
##  Millennium Park             :  1237   313    :  1059  
##  (Other)                     :116567   (Other):121269  
##                       end_station_name     member_casual       age_group     
##  Streeter Dr & Grand Ave      :  1326   Customer  :128490   17_under:   165  
##  Lake Shore Dr & North Blvd   :  1287   Subscriber:     0   18-35   :104818  
##  Lake Shore Dr & Belmont Ave  :  1210                       36-50   : 19727  
##  Theater on the Lake          :  1193                       51-65   :  3595  
##  Damen Ave & Pierce Ave       :  1146                       66-75   :   172  
##  Lakeview Ave & Fullerton Pkwy:  1059                       76-90   :     9  
##  (Other)                      :121269                       91_plus :     4  
##      month              month_name    weekday_number      weekday_name  
##  8      :26980   08 August   :26980   Min.   :1.000   Sonntag   :23556  
##  7      :24544   07 July     :24544   1st Qu.:2.000   Montag    :14969  
##  9      :20242   09 September:20242   Median :4.000   Dienstag  :14003  
##  6      :18024   06 June     :18024   Mean   :4.154   Mittwoch  :14564  
##  10     :11950   10 October  :11950   3rd Qu.:6.000   Donnerstag:15877  
##  5      :10761   05 May      :10761   Max.   :7.000   Freitag   :18001  
##  (Other):15989   (Other)     :15989                   Samstag   :27520  
##  started_at_hour ended_at_hour  
##  17     :13949   17     :13844  
##  16     :12006   18     :12159  
##  18     :11367   16     :11306  
##  15     : 9961   15     : 9625  
##  14     : 9173   19     : 9577  
##  13     : 8705   14     : 8987  
##  (Other):63329   (Other):62992

Interpretation

Age Group

summary(df_pred_subsciber$age_group)

## 17_under    18-35    36-50    51-65    66-75    76-90  91_plus 
##      165   104818    19727     3595      172        9        4

The 18-35 age group predominates.

Travel times for trips

summary(df_pred_subsciber$tripduration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.017  10.583  16.150  17.886  23.800  43.750

The median usage time is approximately 16 minutes, which is as low as for annual members, meaning that only specific trips are taken.

Trips in the daily routine

summary(df_pred_subsciber$ended_at_hour)

##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##  1201  1085   851   441   293   478  1114  2305  4077  3999  4876  6412  7897 
##    13    14    15    16    17    18    19    20    21    22    23 
##  8579  8987  9625 11306 13844 12159  9577  6857  4937  4462  3128

The frequency of use falls within the time range from 3:00 p.m. to 7:00 p.m., i.e., a time range of business activity.

Distribution across the days of the week

summary(df_pred_subsciber$weekday_name)

##    Sonntag     Montag   Dienstag   Mittwoch Donnerstag    Freitag    Samstag 
##      23556      14969      14003      14564      15877      18001      27520

It is noticeable that trips on weekends (Saturday and Sunday) predominate, but the total number across all working days (Monday to Friday) is still significant.

Trips spread throughout the year

summary(df_pred_subsciber$month_name)

##   01 January  02 February     03 March     04 April       05 May      06 June 
##          773          497         2274         6182        10761        18024 
##      07 July    08 August 09 September   10 October  11 November  12 December 
##        24544        26980        20242        11950         3528         2735

Trips are primarily undertaken in the second and third quarters. October is also included. This is understandable due to weather conditions, but since the trips are mostly short, there could also be demand in winter.

Destinations

Most trips end in commercial, government, or mixed commercial areas.

head(summary(df_pred_subsciber$end_station_name), 15)

##        Streeter Dr & Grand Ave     Lake Shore Dr & North Blvd 
##                           1326                           1287 
##    Lake Shore Dr & Belmont Ave            Theater on the Lake 
##                           1210                           1193 
##         Damen Ave & Pierce Ave  Lakeview Ave & Fullerton Pkwy 
##                           1146                           1059 
##              Clark St & Elm St       LaSalle St & Illinois St 
##                           1022                           1010 
##      Columbus Dr & Randolph St   Michigan Ave & Washington St 
##                            989                            974 
##           Broadway & Barry Ave Lake Shore Dr & Wellington Ave 
##                            970                            967 
##         St. Clair St & Erie St       Wilton Ave & Belmont Ave 
##                            950                            932 
##                Millennium Park 
##                            922

Street Classification Analysis for Chicago Locations

Streeter Dr & Grand Av Classification: Commercial Area Reasoning: In Near North Side, near Navy Pier and tourist area

Lake Shore Dr & North Blvd Classification: Mixed Use Reasoning: In Lincoln Park, primarily residential area with park facilities

Lake Shore Dr & Belmont Ave Classification: Mixed Use Reasoning: In Lakeview, residential area with commercial influences

Theater on the Lake Classification: Service Area Reasoning: Cultural and recreational facility on the lakefront

Damen Ave & Pierce Ave Classification: Residential Area Reasoning: In Lincoln Park, primarily residential area

Lakeview Ave & Fullerton Pkwy Classification: Mixed Use Reasoning: Transition area with residential and commercial use

Clark St & Elm St Classification: Commercial Area Reasoning: In Near North Side, mixed commercial use

LaSalle St & Illinois St Classification: Commercial Area Reasoning: In the financial district, near Chicago River

Columbus Dr & Randolph St Classification: Administrative Area Reasoning: Near Millennium Park, government buildings and cultural institutions

Michigan Ave & Washington St Classification: Commercial Area Reasoning: On the Magnificent Mile, primary retail district

Trend

Kendall’s tau is particularly suitable for analyzing monotonous trends over time—i.e., whether rising values in one variable tend to be accompanied by rising or falling values in another variable.

df_trips_pro_day <- df_trips %>%
  mutate(
    tag = as.Date(ended_at, format = "%d.%m.%Y %H:%M:%S")) %>%
    # nach dem Tag, der Startstation und der Endstation
    group_by(tag, start_station_id, start_station_name, end_station_id, end_station_name) %>%
    # Fahrten in jeder Gruppe
    summarise(anzahl_verbindungen = n(), .groups = 'drop')

df_changes_pro_tag <- df_trips_pro_day %>%
  #nach Stationen gruppiert und nach Datum sortiert
  arrange(start_station_id, end_station_id, tag) %>%
  group_by(start_station_id, end_station_id) %>%
  # Differenz zum Vortag berechnen
  mutate(
    anzahl_vortag = lag(anzahl_verbindungen),
    differenz = case_when(
      is.na(anzahl_vortag) ~ 0,  # Erster Tag bekommt 0
      anzahl_verbindungen > anzahl_vortag ~ anzahl_verbindungen - anzahl_vortag,  # Positive Differenz
      anzahl_verbindungen < anzahl_vortag ~ anzahl_verbindungen - anzahl_vortag,  # Negative Differenz
      TRUE ~ 0  # Keine Veränderung = 0
    ),
    trend = case_when(
      differenz > 0 ~ 1,
      differenz < 0 ~ -1, 
      TRUE ~ 0
    ),  
    trend_txt = case_when(
      differenz > 0 ~ "erhöht",
      differenz < 0 ~ "verringert", 
      TRUE ~ "unverändert"
    )
  ) %>%
  select(-anzahl_vortag) %>%  # Hilfsspalte entfernen
  ungroup()

head(df_changes_pro_tag, 15)

## # A tibble: 15 × 9
##    tag        start_station_id start_station_name         end_station_id
##    <date>     <fct>            <fct>                      <fct>         
##  1 2019-01-01 2                Buckingham Fountain (Temp) 2             
##  2 2019-01-04 2                Buckingham Fountain (Temp) 2             
##  3 2019-01-07 2                Buckingham Fountain (Temp) 2             
##  4 2019-03-03 2                Buckingham Fountain (Temp) 2             
##  5 2019-03-08 2                Buckingham Fountain (Temp) 2             
##  6 2019-03-11 2                Buckingham Fountain (Temp) 2             
##  7 2019-03-12 2                Buckingham Fountain (Temp) 2             
##  8 2019-03-13 2                Buckingham Fountain (Temp) 2             
##  9 2019-03-14 2                Buckingham Fountain (Temp) 2             
## 10 2019-03-16 2                Buckingham Fountain (Temp) 2             
## 11 2019-03-17 2                Buckingham Fountain (Temp) 2             
## 12 2019-03-19 2                Buckingham Fountain (Temp) 2             
## 13 2019-03-23 2                Buckingham Fountain (Temp) 2             
## 14 2019-03-27 2                Buckingham Fountain (Temp) 2             
## 15 2019-03-28 2                Buckingham Fountain (Temp) 2             
## # ℹ 5 more variables: end_station_name <fct>, anzahl_verbindungen <int>,
## #   differenz <dbl>, trend <dbl>, trend_txt <chr>

colnames(df_changes_pro_tag)

## [1] "tag"                 "start_station_id"    "start_station_name" 
## [4] "end_station_id"      "end_station_name"    "anzahl_verbindungen"
## [7] "differenz"           "trend"               "trend_txt"

library(ggplot2)
library(ggpubr)

ggscatter(df_changes_pro_tag, x = "tag", y = "differenz",
          xlab = "Day", ylab = "Increase or Decrease",
          cor.method = "kendall", cor.coef = FALSE,
          title = "Kendall Correlation")

A scatter plot is created in which each observation in the df_changes_pro_tag data set is represented as a point. The values of the tag variable are plotted on the x-axis and the values of the difference variable on the y-axis

Correlation calculation: With the argument cor.method = “kendall”, the strength and direction of the monotonic relationship between tag and difference is calculated using Kendall’s tau.

The horizontal trend indicates that there is no overall trend suggesting monotonous growth or decline. Locally, there may be increases, but there may also be decreases.

05 Share Findings

Annual pass holders exhibit distinct commuter behavior, with short, efficient trips mainly on weekdays during rush hour to central business districts, while occasional users tend to use the service for leisure purposes, characterized by longer weekend trips to recreational areas, especially when weather conditions are favorable.

Finding 1 - Trip durations by membership type

Annual members use the bikes for short trips. Casual riders use them for longer, leisure-oriented trips.

## [1] "Trip durations by Custmer (casual riders)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.017  13.683  22.400  24.952  32.867  64.150

## [1] "Trip durations by Subscriber (annual members)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.017   6.017   9.750  12.322  16.000  64.150

find1_plot <- 
  ggplot(df_trips, aes(x = member_casual, y = tripduration, fill = member_casual)) +
  geom_violin(trim = FALSE, draw_quantiles = c(0.25, 0.5, 0.75)) + 
  scale_fill_manual(values = c("#42E894", "#E84242"),
                    labels = c("Customer", "Subscriber"), 
                    name = "Membership Type") + 
  labs(title = "Distribution of Trip durations by membership type", 
       subtitle = "Dotted lines indicate the quartiles (25%, 50%, 75%)",
       x = "Membership Type",
       y = "Trip duration") +
  annotate(geom = "text", x = 1, y = median(df_trips_c$tripduration), label="Median") +
  annotate(geom = "text", x = 2, y = median(df_trips_s$tripduration), label="Median") +
  theme_minimal()

## Warning: The `draw_quantiles` argument of `geom_violin()` is deprecated as of ggplot2
## 4.0.0.
## ℹ Please use the `quantiles.linetype` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

find1_plot

Findings

## Casual riders have an average trip duration of 22.4 minutes (median), while annual members only drive 9.75 minutes on average (median). This strongly suggests that members use the service for short trips, while casual drivers use it for leisure activities.

Finding 2 - Trip durations fall into different times of day.

Annual members mainly use the bikes during rush hour commuting times, while occasional riders show a more consistent pattern of recreational use, concentrated in the afternoon.

## [1] "Daily usage profiles"

##              ended_at_hour
## member_casual      0      1      2      3      4      5      6      7      8
##    Customer     4832   5138   3694   1916   1241   1855   4255   9271  16782
##    Subscriber  13381   9800   6052   3737   5695  27226  85443 194857 292986
##              ended_at_hour
## member_casual      9     10     11     12     13     14     15     16     17
##    Customer    19156  28495  41231  52864  58998  62783  66455  69698  76235
##    Subscriber 158805 100177 113508 134604 132664 125997 151709 256421 390692
##              ended_at_hour
## member_casual     18     19     20     21     22     23
##    Customer    66109  50659  36086  24757  20659  15013
##    Subscriber 274699 172660 109979  76174  52981  31800

find2_plot <- ggplot(df_trips, aes(x = ended_at_hour, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#42E894", "#E84242"),
                    labels = c("Customer", "Subscriber"), 
                    name = "Membership Type") + 
  labs(
    x = "Hour",
    y = "Count",
    fill = "Membership Type",
    title = "Daily usage profiles"
  ) +
  ##scale_x_discrete(labels = c(1:10)) +
  scale_y_continuous(labels = scales::comma) + 
#  coord_flip() +
  theme_minimal()
find2_plot

Findings

Annual members show pronounced commuting peaks. Subscribers show extremely high usage figures during traditional commuting times, with a morning peak at 8 a.m. and an evening peak at 5 p.m. These massive peaks confirm that the service is used purely as a means of commuting. Casual riders follow leisure patterns. Casual riders show a completely different profile with steadily increasing usage figures from morning to afternoon and a peak around 4-5 p.m. The continuous increase throughout the day indicates spontaneous, non-time-bound use for leisure purposes.

Finding 3 - Casual riders prefer weekend use

Annual members mainly use the service during the week for commuting purposes, while occasional users mainly travel at the weekend for leisure activities.

##              weekday_name
## member_casual Sonntag Montag Dienstag Mittwoch Donnerstag Freitag Samstag
##    Customer    140995  84950    75597    76940      86429  102134  171137
##    Subscriber  254119 457027   495098   492425     484644  454526  284208

find_3_plot <- ggplot(df_trips, aes(x = weekday_name, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#42E894", "#E84242"),
                    labels = c("Customer", "Subscriber"), 
                    name = "Membership Type") + 
  labs(
    x = "Weekday",
    y = "Count",
    fill = "Membership Type",
    title = "Trips on the weekdays"
  ) +
  scale_y_continuous(labels = scales::comma) + 
  theme_minimal()

find_3_plot

Findings

Annual members show pronounced commuter behavior. Subscribers show their highest usage figures on weekdays, with peaks of over 495,000 trips on Tuesday. This consistently high level from Monday to Friday indicates primary use for commuting and business purposes.

Casual riders prefer weekend usage. Casual riders show a reverse pattern with the highest usage figures on Saturday and Sunday. The comparatively low figures during the week indicate leisure-oriented usage for weekend activities and tourism.

Finding 4: Popular destination stations show different usage patterns

Annual members predominantly use bikes for commuting to central business districts, while casual riders prefer leisure trips to tourist and recreational areas.

## [1] "Popular end stations across both groups."

##      Streeter Dr & Grand Ave Clinton St & Washington Blvd 
##                        69959                        50480 
##          Canal St & Adams St      Clinton St & Madison St 
##                        49853                        46938 
##   Lake Shore Dr & North Blvd Michigan Ave & Washington St 
##                        39408                        38166 
##    Lake Shore Dr & Monroe St          Theater on the Lake 
##                        34397                        33078 
##           Daley Center Plaza        Michigan Ave & Oak St 
##                        32740                        32493 
##              Millennium Park     Kingsbury St & Kinzie St 
##                        32475                        32303 
##      Franklin St & Monroe St        Canal St & Madison St 
##                        29271                        28285 
##            Clark St & Elm St 
##                        26288

## [1] "Popular end stations across for annual member (Subscriber)"

## Clinton St & Washington Blvd          Canal St & Adams St 
##                        48125                        47269 
##      Clinton St & Madison St           Daley Center Plaza 
##                        44203                        30545 
##     Kingsbury St & Kinzie St Michigan Ave & Washington St 
##                        30152                        27900 
##      Franklin St & Monroe St        Canal St & Madison St 
##                        26719                        26297 
##            Clark St & Elm St    LaSalle St & Jackson Blvd 
##                        22603                        21996 
##   Larrabee St & Kingsbury St         Clinton St & Lake St 
##                        21739                        21667 
##        Dearborn St & Erie St       St. Clair St & Erie St 
##                        20482                        19709 
##          Wells St & Huron St 
##                        19251

find_4_plot <- ggplot(df_vgl_s) +
  geom_bar(aes(x = end_station_name, fill = member_casual), position = "dodge") +
  ##geom_col() +
  scale_fill_manual(values = c("#42E894", "#E84242"),
                    labels = c("Customer", "Subscriber"), 
                    name = "Membership Type") + 
  labs(
    x = "Weekday",
    y = "Count",
    fill = "Membership Type",
    title = "Subscriber with Trips on Destination Stations"
  ) +
  scale_y_continuous(labels = scales::comma) + 
  coord_flip() +
  theme_minimal()
find_4_plot

The most popular end stations for annual members are centrally located and are found in neighborhoods classified as commercial, administrative, industrial, or mixed-use areas.

Street Classification Analysis for Chicago Locations (limited of the top 6 Stations)

Clinton St & Washington Blvd Central: Yes Classification: Commercial Area Reasoning: Located in the West Loop, near the Fulton Market District with numerous offices, restaurants, and commercial spaces

Canal St & Adams St Central: Yes Classification: Commercial Area Reasoning: In the financial district, surrounded by office buildings and close to the Chicago Board of Trade

Clinton St & Madison St Central: Yes Classification: Commercial Area Reasoning: In the heart of the West Loop, central business location with numerous companies

Daley Center Plaza Central: Yes Classification: Administrative Area Reasoning: Seat of Cook County Government, directly at the Richard J. Daley Center

Kingsbury St & Kinzie St Central: Yes Classification: Industrial Area Reasoning: In the area north of the Chicago River with mixed commercial-industrial use

Michigan Ave & Washington St Central: Yes Classification: Commercial Area Reasoning: On the Magnificent Mile, primary retail and tourism area

Classification Legend: Central: Within or immediately adjacent to the Chicago Loop/CBD Commercial Area: Predominantly office use, retail, services Industrial Area: Mixed commercial-industrial use, often in transition Administrative Area: Primarily government and administrative buildings Mixed Use: Transition area with both commercial and residential use

print("Popular end stations across for casual rider (Customer)")

## [1] "Popular end stations across for casual rider (Customer)"

v_nam_c <- names(summary(df_trips_c$end_station_name)[1:15])
summary(df_trips_c$end_station_name)[v_nam_c]

##      Streeter Dr & Grand Ave    Lake Shore Dr & Monroe St 
##                        56002                        24558 
##   Lake Shore Dr & North Blvd              Millennium Park 
##                        20418                        20339 
##        Michigan Ave & Oak St          Theater on the Lake 
##                        18499                        16095 
##               Shedd Aquarium Michigan Ave & Washington St 
##                        14773                        10266 
##            Adler Planetarium               Dusable Harbor 
##                         9025                         7799 
##        Michigan Ave & 8th St      Clark St & Armitage Ave 
##                         7516                         6250 
##   Indiana Ave & Roosevelt Rd              Montrose Harbor 
##                         6092                         5886 
##       Clark St & Lincoln Ave 
##                         5842

df_vgl_c <- df_trips %>% 
  filter(end_station_name %in% v_nam_c)
df_vgl_c$end_station_name <- factor(df_vgl_c$end_station_name, levels = rev(v_nam_c))

find_5_plot <- ggplot(df_vgl_c) +
  geom_bar(aes(x = end_station_name, fill = member_casual), position = "dodge") +
  ##geom_col() +
  scale_fill_manual(values = c("#42E894", "#E84242"),
                    labels = c("Customer", "Subscriber"), 
                    name = "Membership Type") + 
  labs(
    x = "Weekday",
    y = "Count",
    fill = "Membership Type",
    title = "Customers with Trips on Destination Stations"
  ) +
  scale_y_continuous(labels = scales::comma) + 
  coord_flip() +
  theme_minimal()

find_5_plot

Findings

Annual members (subscribers) show clear commuter behavior. The most popular end stations are concentrated in central business and administrative districts. Occasional users (customers) prefer leisure and tourist locations. The top stations for this group are mainly located in recreational and tourist areas.

Finding 5: Analysis of Cyclistic’s monthly usage patterns

The data underscores the economic superiority of the subscriber model with more stable, predictable usage throughout the year.

##              month_name
## member_casual 01 January 02 February 03 March 04 April 05 May 06 June 07 July
##    Customer         4018        2384    13889    39443  66920  107815  144988
##    Subscriber      98137       93068   149187   216586 284254  342809  378973
##              month_name
## member_casual 08 August 09 September 10 October 11 November 12 December
##    Customer      155723       109691      61926       16918       14467
##    Subscriber    400999       362231     299682      157911      138210

find_6_plot <- ggplot(df_trips, aes(x = month_name, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#42E894", "#E84242"),
                    labels = c("Customer", "Subscriber"), 
                    name = "Membership Type") + 
  labs(
    x = "Month",
    y = "Count",
    fill = "Membership Type",
    title = "Distribution of trips throughout the year"
  ) +
  scale_y_continuous(labels = scales::comma) + 
  theme_minimal()

find_6_plot

Chicago

Findings

Both user groups show a pronounced seasonal effect with peak values in the summer months (June-August). Subscribers show consistently higher usage figures in all months.Customers show more extreme seasonal fluctuations. Both groups show parallel growth from January to August. From Sept ember onwards, the patterns diverge slightly – customers drop off faster than subscribers. The weather could be the dominant factor behind the observed usage patterns. While subscribers use the system as a means of transportation in almost all conditions, customers treat cycling as a leisure activity that is heavily dependent on pleasant weather conditions.

06 Recommendations

To strategically boost business development and sustainably increase profitability, we propose the following three data-driven measures, which are specifically aimed at converting occasional drivers into annual members.

Introduction of a ’Weekender & Leisure Pass

Proposal: Develop a hybrid membership model that combines leisure use at weekends with discounted commuting options during the week. This offer should be positioned as an entry-level model tailored specifically to the needs of occasional riders.

Rationale: Since casual riders primarily use bikes on weekends and for longer leisure trips, a modular system offers the opportunity to pick up on their existing behaviour patterns and gradually communicate the advantages of cheaper membership rates during the week. The lower entry barrier increases the likelihood of conversion.

‘Commuter Discovery Package’

Proposal: Targeted marketing campaigns encouraging casual riders to try out cycling for commuting. This includes free trial weeks for commuting with station offers specifically tailored to business districts.

Rationale: Data shows that annual members mainly use the system for commuting. By enabling test rides during rush hour, casual riders can experience the practical benefits for their commute, which is one of the strongest motivators for switching to annual membership.

Flexible ‘Pause & Play’ membership

Proposal: Develop an annual membership with integrated pause options for the winter months or holiday periods, combined with discounted weekend and holiday use.

Rationale: This measure directly addresses casual riders’ concerns about year-round commitment and creates a compromise that respects their leisure usage preferences while retaining the benefits of membership for their main usage periods.

Conclusion

Data-based analysis provides the crucial foundation for a successful conversion strategy. The identified behaviour patterns and preferences of both user groups offer clear starting points for targeted marketing measures and operational optimisations. We are grateful for your trust and constructive cooperation and are passing on the results to marketing and management for the development of data-supported strategies, so that concrete measures to increase annual memberships can be implemented on this basis.

Cyclistic Bike-Share Analysis

Andreas Paul

2025-10-13

Introduction

01 The business challenge

Stakeholders

Business task

02 Data sources and description

Data source

Organization

Raw Data

Data integrity

03 Data cleansing and preparation

Tool selection

Import raw data

Steps for data cleansing

Combine data

Check data

Handle missing and invalid data

Create new columns

Check the attributs

04 Analyse

Trend

06 Recommendations

Introduction of a ’Weekender & Leisure Pass

‘Commuter Discovery Package’

Flexible ‘Pause & Play’ membership

Conclusion