Case Study Presentation

This is a case study focusing on the DivvyBikes rental service. I am conducting this analysis to apply the skills I acquired through the Google Data Analytics Certificate program hosted on Coursera. The main objective of this study is to identify differences between members and casual users and propose strategies to boost subscription rates for Divvy services.

Analysis Summary

Based on this analysis, here are the differences between members and casual users:

Members: Likely local residents or students, prefer efficient commuting, shorter rides, and downtown/university stations.
Casual Users: Likely tourists, prefer leisurely exploration, with more spread-out station usage and longer trip durations.

To increase conversion rate, I propose these measures:

Introducing promotional pricing around holiday seasons, or special events.
Creating short-term subscription plans, such as a weekly or weekend subscription.
Expanding station networks in tourist-heavy areas.
Promoting Divvy services in these same areas.

Introduction

Company Overview

Divvy is a bicycle rental company located in Chicago, Illinois, United States. Established in 2013, it has steadily gained popularity and currently boasts a fleet of over 6,000 bikes distributed across more than 600 stations. The service is user-friendly: users can register through the mobile app, unlock a bike using their membership card, or pay for a single trip, and conveniently return it to a nearby station upon reaching their destination.

Data Overview

The dataset used for this study is publicly available here. It includes a separate CSV file for each month of the year, with each row representing an individual trip. The dataset contains the following columns:

Trip ID
Bike type
Departure station name
Departure station ID
Arrival station name
Arrival station ID
Departure date and time
Arrival date and time
Departure latitude
Departure longitude
Arrival latitude
Arrival longitude
User type (subscriber or occasional user)

Set up

Setting Up the Workspace

To kickstart the project, I created a dedicated project directory and organized various folders for this study. After establishing my work environment in RStudio, I proceeded to update and load the necessary libraries for the analysis.

if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("skimr", quietly = TRUE)) install.packages("skimr")
if (!requireNamespace("RANN", quietly = TRUE)) install.packages("RANN")
if (!requireNamespace("patchwork", quietly = TRUE)) install.packages("patchwork")
if (!requireNamespace("RColorBrewer", quietly = TRUE)) install.packages("RColorBrewer")
if (!requireNamespace("tibble", quietly = TRUE)) install.packages("tibble")

library(tidyverse)
library(skimr)
library(RANN)
library(patchwork)
library(RColorBrewer)
library(tibble)

Loading the CSV files

After downloading the CSV files of the past 12 months, which have been stored in the Data/Study Data folder, I import and consolidate them into a unified data table within my working environment.

trip_data <- list.files("./Data/Study Data", full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows()

Now that I have a comprehensive data table in place, let’s take a closer look at the resulting columns.

colnames(trip_data)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

The resulting content is what was expected. Now it’s time to start with the cleaning process.

Data Cleaning

Duplicates and factors

To start off, let’s take a look at a summary of the data.

skim_without_charts(trip_data)

Data summary
Name	trip_data
Number of rows	5859061
Number of columns	13
_______________________
Column type frequency:
character	7
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5859061
rideable_type	0	1.00	11	13	3
start_station_name	832009	0.86	3	64	1722
start_station_id	832141	0.86	3	36	1319
end_station_name	889661	0.85	3	64	1741
end_station_id	889802	0.85	3	36	1324
member_casual	0	1.00	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	42.07
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-87.52
end_lat	5973	1	41.90	0.07	0.00	41.88	41.90	41.93	42.37
end_lng	5973	1	-87.65	0.11	-88.14	-87.66	-87.64	-87.63	0.00

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2022-05-01 00:00:06	2023-04-30 23:59:05	2022-08-28 12:44:57	4916326
ended_at	0	1	2022-05-01 00:05:17	2023-05-03 10:37:12	2022-08-28 13:07:09	4930169

There’s a lot to take in, but I will start with some preparatory steps:

I’ll check for and remove any duplicate entries in this dataset.
To facilitate further analysis, I’ll convert the member_casual and rideable_type columns into categorical factors, because these columns contain nominal qualitative values with a limited number of unique categories.

clean_data <- trip_data %>% 
  distinct(ride_id, .keep_all = TRUE)

clean_data$rideable_type <- factor(clean_data$rideable_type)
clean_data$member_casual <- factor(clean_data$member_casual)

NA values and zeros

After this, there are some data inconsistencies that need to be addressed: - I have encountered missing data for trip end coordinates. Since the number of such instances is minimal (around 0.1% of the dataset), I can safely exclude them from the study. - Furthermore, when examining the trip end coordinates, I’ve identified unusual data points. Specifically, some entries have latitudes and longitudes set to 0, which does not align with the expected Chicago coordinates. I will remove these erroneous entries from the dataset as well.

clean_data <- clean_data %>%
  drop_na(end_lat, end_lng)

clean_data <- clean_data %>% 
  filter(end_lat != 0, end_lng != 0)

Unique station names

Moving forward, I’ll be tackling more substantial issues. Firstly, I’ve observed that there are more unique station names than there are unique station IDs, indicating potential typos or discrepancies in station names.

To address this, my initial step will involve creating a table that associates each station ID with the most frequently occurring station name.

station_id <- clean_data %>%
  select(start_station_id, end_station_id) %>% 
  gather(key = "id_type", value = "station_id", start_station_id:end_station_id)

station_name <- clean_data %>% 
  select(start_station_name, end_station_name) %>% 
  gather(key = "name_type", value = "station_name", start_station_name:end_station_name)

true_stations <- cbind(station_id, station_name) %>%
  count(station_id, station_name) %>%
  na.omit() %>% 
  group_by(station_id) %>%
  slice_max(order_by = n, n = 1, with_ties = FALSE)

sum(duplicated(true_stations$station_id))

## [1] 0

sum(duplicated(true_stations$station_name))

## [1] 10

I notice the presence of multiple duplicate station names, meaning that some station names are associated with several IDs. To address this issue, I will make sure to keep only one station ID per station name.

true_stations <- true_stations %>% 
  group_by(station_name) %>% 
  slice_max(order_by = n, n = 1, with_ties = FALSE) %>% 
  select(-n)

sum(duplicated(true_stations$station_id))

## [1] 0

sum(duplicated(true_stations$station_name))

## [1] 0

Now, I can proceed with standardizing the station names. I’ll begin by merging this newly created table with the study data using a left_join. Following that, I’ll replace any station names that exhibit variations.

clean_data <- clean_data %>% 
  left_join(true_stations, by = c("start_station_id" = "station_id"))
start_name_change <- !is.na(clean_data$start_station_name) &
  !is.na(clean_data$station_name) &
  clean_data$start_station_name != clean_data$station_name
clean_data$start_station_name[start_name_change] <- clean_data$station_name[start_name_change]
clean_data <- clean_data %>% 
  select(-station_name)

clean_data <- clean_data %>% 
  left_join(true_stations, by = c("end_station_id" = "station_id"))
end_name_change <- !is.na(clean_data$end_station_name) &
  !is.na(clean_data$station_name) &
  clean_data$end_station_name != clean_data$station_name
clean_data$end_station_name[end_name_change] <- clean_data$station_name[end_name_change]
clean_data <- clean_data %>% 
  select(-station_name)

skim_without_charts(clean_data)

Data summary
Name	clean_data
Number of rows	5853080
Number of columns	13
_______________________
Column type frequency:
character	5
factor	2
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5853080
start_station_name	832009	0.86	7	64	1315
start_station_id	832141	0.86	3	36	1319
end_station_name	883688	0.85	9	64	1320
end_station_id	883829	0.85	3	36	1324

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
rideable_type	0	1	FALSE	3	ele: 3045957, cla: 2639317, doc: 167806
member_casual	0	1	FALSE	2	mem: 3500038, cas: 2353042

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	1	41.90	0.05	41.64	41.88	41.90	41.93	42.07
start_lng	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-87.52
end_lat	1	41.90	0.05	41.55	41.88	41.90	41.93	42.37
end_lng	1	-87.65	0.03	-88.14	-87.66	-87.64	-87.63	-87.30

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2022-05-01 00:00:06	2023-04-30 23:59:05	2022-08-28 13:00:09	4912172
ended_at	0	1	2022-05-01 00:05:17	2023-05-01 08:06:56	2022-08-28 13:20:10	4925682

In the character variable type section, you can see that there are more unique station IDs than unique station names this time, likely due to the multiple duplicates encountered earlier. While I could perform a similar procedure to standardize station IDs, I won’t need them for the analysis phase, so this step is unnecessary.

Missing stations names and IDs

Now, I can address another critical issue. There is a substantial number of missing station IDs and names, accounting for nearly a quarter of the data. Completely removing this data would result in significant data loss, so I’ll explore possible solutions.

First, let’s attempt to identify a common pattern among the missing data points.

na_check <- clean_data[!complete.cases(clean_data), ]
na_chart <- na_check %>% 
  count(member_casual, rideable_type) %>% 
  ggplot(mapping = aes(x = rideable_type, y = n, fill = rideable_type)) +
  geom_col() +
  facet_wrap(vars(member_casual))
na_chart

This graph shows that a majority of the missing data corresponds to trips made on electric bikes. After conducting a brief investigation on the company’s website, I discovered that in addition to parking electric bikes at rental stations, users have the option to park them elsewhere using an anti-theft device included on the bike itself. However, when users choose this option, the bike records an end-of-trip event at its current location without specifying an arrival station. Subsequently, when a user resumes their ride, it initiates a new journey without indicating a starting station. This explains the missing data in the dataset.

To address this issue, I’ll replace the names of the missing stations with the names of the nearest stations based on the recorded coordinates for these journeys. This approach allows me to retain as much usable data as possible while maintaining data consistency.

As mentioned earlier, I won’t be using station IDs for this analysis, so there’s no need to replace missing station IDs. To begin this process, I’ll collect station coordinates and round them to five decimals, ensuring accuracy to the order of 1 meter, and place them into 2 new columns for departure and arrival coordinates, facilitating future data linking.

clean_data <- clean_data %>%
  mutate(across(c('start_lat', 'start_lng', 'end_lat', 'end_lng'), \(x) round(x, digits = 5))) %>%
  mutate(start_location = paste(start_lat,start_lng,sep = ","),
         end_location = paste(end_lat,end_lng,sep = ","))

In order to replace missing station names using trip coordinates, I’ll enhance the true_stations table by incorporating the coordinates of each station. Similar to the approach with station names, I will retain only the most frequently occurring coordinates for each station ID.

station_lat <- clean_data %>% 
  select(start_lat, end_lat) %>% 
  gather(key = "lat_type", value = "station_lat", start_lat:end_lat)

station_lng <- clean_data %>% 
  select(start_lng, end_lng) %>% 
  gather(key = "lng_type", value = "station_lng", start_lng:end_lng)

station_location <- clean_data %>% 
  select(start_location, end_location) %>% 
  gather(key = "location_type", value = "station_location", start_location:end_location)

true_stations <- cbind(station_id, station_name, station_lat, station_lng, station_location) %>%
  count(station_id, station_name, station_lat, station_lng, station_location) %>%
  na.omit() %>% 
  group_by(station_id) %>%
  slice_max(order_by = n, n = 1, with_ties = FALSE) %>% 
  group_by(station_name) %>% 
  slice_max(order_by = n, n = 1, with_ties = FALSE) %>% 
  select(-n)

sum(duplicated(true_stations$station_id))

## [1] 0

sum(duplicated(true_stations$station_name))

## [1] 0

Next, I’ll create data subsets containing only the rows with missing station names and the columns essential for linking with true_stations.

start_missing <- clean_data %>% 
  select(ride_id, start_station_name, start_lng, start_lat) %>% 
  rename(start_missing_name = start_station_name) %>% 
  filter(is.na(start_missing_name))

end_missing <- clean_data %>% 
  select(ride_id, end_station_name, end_lng, end_lat) %>% 
  rename(end_missing_name = end_station_name) %>%
  filter(is.na(end_missing_name))

Afterward, I’ll proceed by creating matrices using the latitude and longitude values. I will then use these matrices to calculate the closest stations, which I can then append to the start_missing and end_missing datasets.

start_coords <- as.matrix(start_missing[, c("start_lng", "start_lat")])
end_coords <- as.matrix(end_missing[, c("end_lng", "end_lat")])
true_coords <- as.matrix(true_stations[, c("station_lng", "station_lat")])

start_neighbors <- RANN::nn2(true_coords, start_coords, k = 1)$nn.idx
start_nearest_stations <- true_stations$station_name[start_neighbors]
start_missing$start_missing_name <- start_nearest_stations

end_neighbors <- RANN::nn2(true_coords, end_coords, k = 1)$nn.idx
end_nearest_stations <- true_stations$station_name[end_neighbors]
end_missing$end_missing_name <- end_nearest_stations

Lastly, I’ll consolidate start_missing and end_missing into a single dataset. I’ll then utilize a left_join to link this combined dataset with clean_data, allowing me to replace the missing station names.

filling_stations <- full_join(start_missing, end_missing, by = "ride_id") %>% 
  select(ride_id, start_missing_name, end_missing_name)

clean_data <- clean_data %>%
  left_join(filling_stations, by = "ride_id")
  
clean_data$start_station_name <- ifelse(is.na(clean_data$start_station_name),
                                        clean_data$start_missing_name,
                                        clean_data$start_station_name)

clean_data$end_station_name <- ifelse(is.na(clean_data$end_station_name),
                                      clean_data$end_missing_name,
                                      clean_data$end_station_name)

clean_data <- clean_data %>%
  select(-start_missing_name, -end_missing_name)

skim_without_charts(clean_data)

Data summary
Name	clean_data
Number of rows	5853080
Number of columns	15
_______________________
Column type frequency:
character	7
factor	2
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5853080
start_station_name	0	1.00	7	64	1322
start_station_id	832141	0.86	3	36	1319
end_station_name	0	1.00	9	64	1324
end_station_id	883829	0.85	3	36	1324
start_location	0	1.00	8	18	523817
end_location	0	1.00	8	18	2150

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
rideable_type	0	1	FALSE	3	ele: 3045957, cla: 2639317, doc: 167806
member_casual	0	1	FALSE	2	mem: 3500038, cas: 2353042

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	1	41.90	0.05	41.64	41.88	41.90	41.93	42.07
start_lng	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-87.52
end_lat	1	41.90	0.05	41.55	41.88	41.90	41.93	42.37
end_lng	1	-87.65	0.03	-88.14	-87.66	-87.64	-87.63	-87.30

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2022-05-01 00:00:06	2023-04-30 23:59:05	2022-08-28 13:00:09	4912172
ended_at	0	1	2022-05-01 00:05:17	2023-05-01 08:06:56	2022-08-28 13:20:10	4925682

Looking at the character variable type section once more, you can see that the issue of missing station names has successfully been solved. It’s worth noting that I now have slightly more unique station names than unique IDs. This discrepancy is expected because I relied on the true_stations table, which encompassed all departure and arrival stations.

While I could address this by completing missing station IDs, it’s important to reiterate that, for the analysis phase of this study, I won’t require station IDs. Therefore, this step is unnecessary.

Unique start coordinates

However, with the introduction of the start_location and end_location columns, I identified a new anomaly in the n_unique column of the summary. There’s an unexpectedly high number of unique starting coordinates compared to arrival coordinates. Further investigation reveals that the coordinates recorded for trip departure correspond to the exact location of the rented bike, while for the arrival, they correspond to the station where the bike is parked.

To streamline data processing, I will standardize this by aligning the number of different starting coordinates with the number of starting stations. To achieve this, I’ll once again utilize the true_stations table, retaining only the essential columns, and merge it with the main dataset using a left_join. With this I’ll be able to correct coordinates as needed.

true_stations <- true_stations %>% 
  select(-station_id)

clean_data <- clean_data %>% 
  left_join(true_stations, by = c("start_station_name" = "station_name"))
start_loc_change <- !is.na(clean_data$start_location) &
  !is.na(clean_data$station_location) &
  clean_data$start_location != clean_data$station_location
clean_data$start_location[start_loc_change] <- clean_data$station_location[start_loc_change]
clean_data$start_lat[start_loc_change] <- clean_data$station_lat[start_loc_change]
clean_data$start_lng[start_loc_change] <- clean_data$station_lng[start_loc_change]
clean_data <- clean_data %>%
  select(-station_lat, -station_lng, -station_location)

skim_without_charts(clean_data)

Data summary
Name	clean_data
Number of rows	5853080
Number of columns	15
_______________________
Column type frequency:
character	7
factor	2
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5853080
start_station_name	0	1.00	7	64	1322
start_station_id	832141	0.86	3	36	1319
end_station_name	0	1.00	9	64	1324
end_station_id	883829	0.85	3	36	1324
start_location	0	1.00	8	18	2161
end_location	0	1.00	8	18	2150

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
rideable_type	0	1	FALSE	3	ele: 3045957, cla: 2639317, doc: 167806
member_casual	0	1	FALSE	2	mem: 3500038, cas: 2353042

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	1	41.90	0.05	41.65	41.88	41.90	41.93	42.06
start_lng	1	-87.65	0.03	-87.83	-87.66	-87.64	-87.63	-87.53
end_lat	1	41.90	0.05	41.55	41.88	41.90	41.93	42.37
end_lng	1	-87.65	0.03	-88.14	-87.66	-87.64	-87.63	-87.30

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2022-05-01 00:00:06	2023-04-30 23:59:05	2022-08-28 13:00:09	4912172
ended_at	0	1	2022-05-01 00:05:17	2023-05-01 08:06:56	2022-08-28 13:20:10	4925682

Looking at the n_unique column of the character type section, you can see that I effectively reduced the number of unique start_location by a significant amount. While I end up with more unique locations than unique stations, the difference is now way more manageable than before. Some refinement could be further implemented, but I’ve opted to stop at this point, as the dataset has reached a level of coherence that ensures a robust and reliable foundation for analysis.

Data processing

Distance between stations

Now that I am mostly done with the cleaning process, there is still a little bit of preparation that needs to be done. First, I’ll create a new column with the distance between the departure of a trip and its arrival. While this is not the distance traveled by a customer (which would be quite useful), this would still give me some useful data to work with. To do so, I will use the Haversine function to calculate the distance in meters between two coordinates.

start_lat <- clean_data$start_lat
start_lng <- clean_data$start_lng
end_lat <- clean_data$end_lat
end_lng <- clean_data$end_lng

start_lat_rad <- start_lat * pi / 180
start_lng_rad <- start_lng * pi / 180
end_lat_rad <- end_lat * pi / 180
end_lng_rad <- end_lng * pi / 180

delta_lat <- end_lat_rad - start_lat_rad
delta_lng <- end_lng_rad - start_lng_rad

a <- sin(delta_lat / 2)^2 + cos(start_lat_rad) * cos(end_lat_rad) * sin(delta_lng / 2)^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))

earth_radius <- 6371000
distances <- round((earth_radius * c), digits = -2)
clean_data$distance <- distances
summary(clean_data$distance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     900    1600    2114    2800   44000

Taking a look at the data, there are no negative values, and the maximum value is not dramatically out of bound. Therefore, I will keep this column as it is.

Trip duration

Now that I have the distances, I will need the duration of trips. To achieve this, I will use the ended_at and started_at columns to create a new column trip_duration, measured in minutes since this aligns more closely with the typical trip duration.

clean_data <- clean_data %>% 
  mutate("trip_duration" = round(difftime(ended_at, started_at, units = "mins")))
summary(as.numeric(clean_data$trip_duration))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -10353.00      6.00     10.00     15.82     18.00  32035.00

This time, there are obvious outliers: there are negative values, and the maximum value is way superior to the rest of the values. To fix this, I will simply get rid of negative values, and for the upper bound outliers, I’ll use the Empirical rule.

To do so, I will only keep values beneath 3 times the standard deviation, as it encapsulates most of the data. This is typically used in normally distributed data, but as most of the values seems to be gathered around the mean, this rule should work fine in this case.

duration_mean <- as.numeric(round(mean(clean_data$trip_duration)))
duration_sd <- round(sd(clean_data$trip_duration))
outliers_limit <- as.numeric(round(duration_mean + (duration_sd * 3)))

clean_data <- clean_data %>% 
  filter(trip_duration > 0 & trip_duration <= outliers_limit)
summary(as.numeric(clean_data$trip_duration))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   10.00   14.55   18.00  118.00

This gives a more coherent range for the duration data. After the duration of trips, I’ll also need additional data for time comparisons.

Time values

Given that the mean trip duration is approximately 15 minutes, there is limited value in comparing the departure and arrival times of each trip. Therefore, I’ll use the time values extracted from the started_at column as a reference for time. More precisely, the time values I’ll use for this analysis will be the month, the week, the day of the week and the hour of each trip.

clean_data <- clean_data %>% 
  mutate("month" = month(started_at, label = TRUE , abbr = TRUE),
         "week" = week(started_at),
         "week_day" = wday(started_at, label = TRUE, abbr = TRUE),
         "hour" = hour(started_at))

levels(clean_data$week_day)

## [1] "dim\\." "lun\\." "mar\\." "mer\\." "jeu\\." "ven\\." "sam\\."

levels(clean_data$month)

##  [1] "janv" "févr" "mars" "avr"  "mai"  "juin" "juil" "août" "sept" "oct" 
## [11] "nov"  "déc"

I now have the time data that I need to conduct this analysis. The levels of the month and week_day factors are not ordered properly, and in another language furthermore, as I am using a french version of R. So this will need some fixes.

clean_data$week_day <- fct_relevel(clean_data$week_day, "lun\\.", "mar\\.", "mer\\.", "jeu\\.", "ven\\.", "sam\\.", "dim\\.")
levels(clean_data$week_day) <- c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')
levels(clean_data$month) <- c("Jan","Feb","Mar","Apr","May","Jun",
                            "Jul","Aug","Sep","Oct","Nov","Dec")

Data preparation

Finally, I have all the data, cleaned and processed, that I need for the analysis. I will just get rid of the columns that I won’t use for this study.

processed_data <- clean_data %>% 
  select(-ride_id, -started_at, -ended_at, -start_station_id, -end_station_id, -start_location, -end_location)
head(processed_data, n=10)

## # A tibble: 10 × 14
##    rideable_type start_station_name end_station_name start_lat start_lng end_lat
##    <fct>         <chr>              <chr>                <dbl>     <dbl>   <dbl>
##  1 classic_bike  Wabash Ave & Gran… Halsted St & Ro…      41.9     -87.6    41.9
##  2 classic_bike  DuSable Lake Shor… Field Blvd & So…      41.9     -87.6    41.9
##  3 classic_bike  Clinton St & Madi… Wood St & Milwa…      41.9     -87.6    41.9
##  4 classic_bike  Clinton St & Madi… Clark St & Rand…      41.9     -87.6    41.9
##  5 classic_bike  Clinton St & Madi… Sangamon St & L…      41.9     -87.6    41.9
##  6 classic_bike  Carpenter St & Hu… Sangamon St & W…      41.9     -87.7    41.9
##  7 classic_bike  Noble St & Milwau… Wood St & Augus…      41.9     -87.7    41.9
##  8 docked_bike   Halsted St & Wrig… Southport Ave &…      41.9     -87.6    41.9
##  9 classic_bike  Clinton St & Madi… Clybourn Ave & …      41.9     -87.6    41.9
## 10 electric_bike Southport Ave & W… N Southport Ave…      41.9     -87.7    41.9
## # ℹ 8 more variables: end_lng <dbl>, member_casual <fct>, distance <dbl>,
## #   trip_duration <drtn>, month <ord>, week <dbl>, week_day <ord>, hour <int>

skim_without_charts(processed_data)

Data summary
Name	processed_data
Number of rows	5720421
Number of columns	14
_______________________
Column type frequency:
character	2
difftime	1
factor	4
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
start_station_name	0	1	7	64	0	1322	0
end_station_name	0	1	9	64	0	1324	0

Variable type: difftime

skim_variable	n_missing	complete_rate	min	max	median	n_unique
trip_duration	0	1	1 mins	118 mins	10 mins	118

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
rideable_type	1	FALSE	3	ele: 2971698, cla: 2594923, doc: 153800
member_casual	1	FALSE	2	mem: 3434651, cas: 2285770
month	1	TRUE	12	Jul: 803754, Aug: 767975, Jun: 751154, Sep: 686407
week_day	1	TRUE	7	Sat: 897714, Thu: 859869, Fri: 827711, Wed: 823873

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	1	41.90	0.05	41.65	41.88	41.90	41.93	42.06
start_lng	1	-87.65	0.03	-87.83	-87.66	-87.64	-87.63	-87.53
end_lat	1	41.90	0.05	41.59	41.88	41.90	41.93	42.15
end_lng	1	-87.65	0.03	-88.05	-87.66	-87.64	-87.63	-87.50
distance	1	2144.14	1920.46	0.00	900.00	1600.00	2800.00	43000.00
week	1	28.39	11.56	1.00	21.00	29.00	37.00	53.00
hour	1	14.21	5.03	0.00	11.00	15.00	18.00	23.00

Everything seems to be in order, therefore I will save my dataframe as a RDS file, clear my workspace, as I created a lot of variables and various objects, and I will just have to reload my dataframe to start the analysis.

saveRDS(processed_data, file = "Output/Data/processed_data.rds")

rm(list = ls())

Data analysis

Loading the Data

To start up the analysis, the first thing I need to do is to load the cleaned up dataset.

processed_data <- readRDS(file = "Output/Data/processed_data.rds")

Next up, I will set up a simple theme to standardize the graphs I’ll be creating along the analysis process.

work_theme <- theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Rideable types

Before making the first graph about the bikes, I think it’s important to start off with the difference between the number of casual users and members.

processed_data %>% 
  count(member_casual) %>% 
  pivot_wider(names_from = member_casual, values_from = n)

## # A tibble: 1 × 2
##    casual  member
##     <int>   <int>
## 1 2285770 3434651

So there are about one and a half times more members than casual riders. This is some valuable data for the analysis.

Now, for the first graph, I will be creating a circular diagram to illustrate the usage distribution among the various bike types. Also, from now on, I’ll hide the code chunks of this analysis and only show the graphs for more clarity.

So the first observable thing is : casual riders are the only ones to use docked bikes. This is logical as docked bikes are the ones unlocked from a station terminal, whereas the other ones are unlocked directly with the app of the company.

Furthermore, proportionally, casual riders are more likely to use electric bikes than members.

For the next graph, I’ll create a chart to showcase the difference in mean trip duration, for each type of bike, between casual riders and members.

Here, we can see that, on average, trips last longer with classic bikes than with electric bikes, and that casual riders tend to take longer trips than members. Also, casual riders take significantly longer rides with docked bikes than with other types of bikes.

Next, I’ll take a look at the average distance between the departure station and the arrival station across the different bike types, and user types.

Rides with electric bikes tend to cover larger distances than classic bikes, even though trips with electric bikes are shorter in duration as seen in the last chart. It’s interesting, as it’s clear that the choice of bike has a direct influence on the ride : electric bikes are better suited if you want to cover large distances in the least amount of time.

Stations

To take a look at bike stations, I will compare the favorite stations of members against those of casual riders.

The interesting thing to note here is that the favorite stations are completely different between members and casual riders. Also, as you can see on the map below, the most frequented station for members is located next to the University, which gives some valuable information about the population constituting members, most certainly students.

If you take a look at the mapping of these stations, some conclusions can be drawn : members tend to more actively use stations inside the city, which can correspond to citizens, students or workers of Chicago, while casual riders tend to prefer stations near the lake shore, which leads to thinking about tourists.

Trip duration

To explore the distribution of trip duration across users, I will start by comparing statistical duration data between members and casual riders.

processed_data %>% 
  group_by(member_casual) %>% 
  summarise(mean_duration = mean(trip_duration)) %>% 
  pivot_wider(names_from = member_casual, values_from = mean_duration)

## # A tibble: 1 × 2
##   casual        member       
##   <drtn>        <drtn>       
## 1 18.55165 mins 11.88196 mins

As seen earlier, with the comparison of trip duration among each type of bike, casual users have a longer average trip duration than members. Once correlated with the type of population of each group inferred from the mapping of station use, this makes senses : if members tend to be Chicago citizens, they might be using bikes as a simple and efficient mean of transportation, while if casual riders tend to be tourists, it makes sense that they would take their time to visit the city and wander around.

To further analyze trip duration, I’ll take a look at the distribution of rides across the different possible trip durations.

As seen in the beginning of the analysis, there are more members than casual riders, but the distribution here is interesting. Indeed, beside the fact that members have way more shorter trips, around the 10-minute mark, than casual riders, the rest of the distribution of rides is pretty similar between the two groups.

To more clearly represent this, I’ll group trips by blocks of ten minutes.

Here, I can confirm what I claimed before : members have way more shorter trips than casual users. But I can nuance my point a little bit more, as casuals also have a more uniformly distributed trip duration, opposed to members whose distribution is heavily weighted toward shorter rides.

Another interesting graph to look at, in conjuction with the distribution of trip duration, would be the distribution across different groups of trip distances.

The main conclusion that can be drawn from this graph is similar to the first claim of this analysis : there are more members than casual riders. Indeed, as the distribution across groups of distances is almost the same between members and casuals, the main difference between the two groups here is that there are more members than casuals.

Time distribution

Another important aspect of this analysis is to look at the distribution of rides across different groups of time, whether it is across the hours of the day or the months of the year.

So first I’ll take a look at the number of rides, and the trip duration across the hours of the day.

Right off the bat, I can extract two really interesting points here. First, at noon, in the evening, and during the night, both groups have a similar number of rides. However, around 8 AM and 5 PM, i.e. the start and end of office hours, members significantly outpace casual riders in bike usage. This observation supports the notion that members are likely local residents commuting between home and work. And second, while the average trip duration of members remains relatively consistent, hovering around the 10-minute mark throughout the day, casual users have significantly longer trips during the day, and shorter trips in the morning. Here again, this aligns with the assumption that casual riders are likely tourists on vacation, enjoying their ride during daylight hours.

To further explore the distribution of trips across the hours of day, I’ll take a look at it across the different days of the week.

As you can see, the distribution of rides over the day is quite different between the two groups during the week, but is almost the same during the weekend, on Saturdays and Sundays. This is really interesting as it further consolidates the assumption that members are mainly local residents, commuting to work or to school during the weekdays.

Now, speaking of days of the week, I’ll cover the distribution of trips across the week.

I can extract some useful data from those graphs too. First, I can reiterate that members use bikes more than casual users during the week, but they approximately have the same usage on the weekend. This reinforces once more that members are likely local residents, using bikes for their daily commute.
From the second graph, I can also infer that, similar to daytime patterns, the average trip duration of casual users varies throughout the week. Specifically, casual riders tend to have longer trip durations around the weekend, possibly pointing to a period of the week more adapted to leisure. In contrast, the trip duration of members remains consistently stable throughout the week.

And lastly, let’s create a graph to represent the distribution of data across the year.

Once more, I have valuable data on hand. Once again you can see that, overall, members tend to have more rides than casual users, but here you can see that the distribution varies across the year. Indeed, the disparity between the two groups is more negligible around the summer months, whereas the gap is more pronounced during the winter. This reinforces the hypothesis that casual riders are likely tourists, utilizing the Divvy service primarily during the summer holidays. The second graph further supports this point, as it clearly shows that casual riders also have longer trips around the summer than during the winter, whereas members keep a fairly stable average trip duration across the year.

Mapping

Finally, I will use Tableau to map the data distribution across the city of Chicago. If you want to take a closer look at the graphs that I did, you can directly check them out on my Tableau profile. To do this I first need to create data subsets as CSV files to work with on Tableau. I’ll use three different ones : one with the sheer number of rides, one with the trip duration, and one with the distance between the start and end stations.

With the necessary data subsets in hand, my next step is to derive meaningful insights from geographical data. First, I will visualize the distribution of ride counts across the various districts of Chicago.

Ride Count Heatmap

You can see that, for members, the most frequently used stations are concentrated in downtown and in the University district. This, in conjunction with the map of the favorite stations, reinforces the notion that the typical member is either a local student or resident, utilizing bikes for daily commutes to work or school. On the other hand, the data for casual users, while having a more prominent concentration in downtown, is more spread out. In conjunction with the overall analysis, this could align with the tourist model, suggesting that casual users start or end their journeys at various locations, with a preference for tourist-centric districts like downtown or the lake shores.

My next step will be mapping the distribution of the average trip duration per district.

Average Trip Duration Heatmap

While the difference between casual users and members in this graph is subtle, there is still valuable information to extract. - Firstly, both for members and casual riders, the districts situated further to the west and to the south show longer trip durations, which is to be expected as these areas are farther from downtown. - Second, the map for casual users appears slightly darker than the one for members, indicating that they tend to have longer trips, a pattern previously identified as characteristic of the tourist model. - Third, there is a distinct darker zone for members, running from northwest to southeast, encompassing the downtown area. This suggests that for members, the further they initiate their ride from downtown, the longer the duration of their trip. In contrast, casual users display a more uniform distribution of trip durations, implying that their trips tend to have a relatively consistent duration, regardless of their starting location.

And finally, let’s take a look at the distribution of the average ride distance per district.

Average Ride Distance Heatmap

Looking at this map, you can see that overall it is a bit darker on the members’ side than casual users’, meaning that members tend to ride for longer distances. Combined with the previous graph, this raises an interesting distinction between the two groups : members have shorter trip durations with longer ride distances, while casual users have longer trip durations with shorter ride distances. This coincides well with the distribution of the population in the two main categories : residents and tourists. Members, using bikes for daily commutes to work or school, have a clear destination in mind, prioritizing efficiency and speed. On the other hand, casual riders, using bikes to explore the city, will not have a need for efficiency or a defined destination.

Based on this three graphs, here are the distinctions between members and casual users : - members, likely local residents or students, have a preference for stations near downtown and the University, they have longer ride distances and shorter trip durations, pointing toward efficient commuting. - casual users, likely tourists, have a more spread station usage, and they also have a more uniform distribution of trip durations and ride distances, suggesting a leisurely exploration of the city.

Conclusion

After a comprehensive analysis of Divvy’s data over the past year, I found several key distinctions between casual users and members, based on their preferences and behaviors. Here are the key takeaways:

Rideable Types:

Members outnumber casual riders by about one and a half times.
Casual riders are the only ones to use docked bikes, meaning that they use directly the station terminals to unlock bikes, the others using the Divvy app.
Proportionally, casual riders are more inclined to use electric bikes compared to members.

Stations:

Favorite stations vary significantly between members and casual riders.
Members prefer stations in the city center, indicating likely residents or students.
Casual riders prefer stations near the lake shore, pointing toward a tourist demographic.

Trip Duration:

Casual users consistently have longer average trip durations compared to members.
Trip duration distribution shows members favor shorter rides, while casual users show a more uniform distribution, possibly reflecting leisurely exploration.

Time Distribution:

Bike usage among members peaks during typical commuting hours, suggesting a user base comprised of local residents or students.
Casual users have longer trips during the day, supporting the hypothesis that they are tourists enjoying daylight exploration.

Mapping:

Most members’ rides are in downtown and university districts, pointing toward daily commuting patterns.
Casual users have a more dispersed station usage, aligning with the tourist model, starting or ending journeys at various locations.

Overall Insights:

Members: Likely local residents or students, prefer efficient commuting, shorter rides, and downtown/university stations.
Casual Users: Likely tourists, prefer leisurely exploration, with more spread-out station usage and longer trip durations.

Actionable Steps

Based on these conclusions, there are multiple strategies I can think of that could increase the conversion rate, incentivizing casual users to subscribe to Divvy’s services:

Pricing and Loyalty Programs:

Introducing promotional pricing around holiday seasons or special events, to convert casual users to become members.
Creating short-term subscription plans, such as a weekly or weekend subscription.

Station Placement:

Considering expanding station networks in tourist-heavy areas to enhance convenience for casual riders.

Targeted Marketing:

Developing targeted marketing strategies for tourists to incentivize them into members.
Promoting Divvy services in tourist-centric areas and events.

By tailoring services and marketing efforts to the distinct preferences and behaviors of casual users and members alike, Divvy can optimize its offerings and contribute to increased user satisfaction and service adoption.

Closing note

This concludes this analysis. I hope that this study was insightful to you. This case study represented quite a lot of work for me, so I would like to thank you for accompanying me on this data-driven ride.

Have a great day.

DivvyBikes Case Study

Gabriel Wartelle

2024-01-12

Case Study Presentation

Analysis Summary

Introduction

Company Overview

Data Overview

Set up

Setting Up the Workspace

Loading the CSV files

Data Cleaning

Duplicates and factors

NA values and zeros

Unique station names

Missing stations names and IDs

Unique start coordinates

Data processing

Distance between stations

Trip duration

Time values

Data preparation

Data analysis

Loading the Data

Rideable types

Stations

Trip duration

Time distribution

Mapping

Conclusion

Actionable Steps

Closing note