This project aims to understand the way different users use the public bicyles system in Chicago, namely, Subscribers and Not subscribers.
The idea is to understand how can we convert casual users into subscribers.
##SUMMARY
To convert casual users into subscribers we have to understand how each kind of user uses the bikes and find a way to make subscriptions more appealing to casual users.
Casual user tend to use the bikes mostly on the weekends but there are some casual users that use the bikes for a really really long time (more than 10 days).
Members use the bikes more frequently and use mostly on weekdays while casuals use the bikes for longer periods of time and usually on long weekends, starting on thursdays.
One way to get more subscriptions is by targeting the casual users that stay in the city for more than 10 days. Charging extra for overnight use, making the subscription more atractive to this users.
Another way is by targeting tourist by selling a mini subscription that gives them the chance for using the bike for more than 5 days consecutive that is more attractive than paying for casual rides 2 or 3 times while they stay in the city.
#ANALYSIS
We need packages to read and clean the data, as well to process it. Therefore we are installing:
We are also using Geosphere in order to calculate distances from GPS data.
library(tidyverse)
library(readr)
library(janitor)
library(skimr)
library(geosphere)
We want to load the data and understand the differeces between the datasets.
Trips_2020_original <- read_csv("Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(Trips_2020_original)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Trips_2019_original <- read_csv("Divvy_Trips_2019_Q1.csv")
## Rows: 365069 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): start_time, end_time, from_station_name, to_station_name, usertype,...
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num (1): tripduration
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(Trips_2019_original)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
We find that there are several columns that have the same information, but different names. Also we are using janitor to clean the column names and create new datasets that we are going to process.
Trips_2019 <-Trips_2019_original %>%
rename(start_station_id = from_station_id,
start_station_name = from_station_name,
end_station_id = to_station_id,
end_station_name = to_station_name,
ride_id = trip_id,
rideable_type = bikeid)
Trips_2020<-Trips_2020_original %>%
rename(start_time=started_at, end_time=ended_at, usertype = member_casual)
Trips_2019 <-clean_names(Trips_2019)
Trips_2020 <-clean_names(Trips_2020)
We also find that the date format is in string format. We want to change it to a numerical format.
Lets change the column names and the date format
Trips_2019 <- Trips_2019 %>%
mutate(
start_time_num = ymd_hms(start_time),
end_time_num = ymd_hms(end_time),
ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
Trips_2020 <- Trips_2020 %>%
mutate(
start_time_num = ymd_hms(start_time),
end_time_num = ymd_hms(end_time))
We find there is a duration column in the 2019 dataset but it is not in the 2020 dataset. So we decided to calculate in both cases just to be sure.
Trips_2019 <- Trips_2019 %>%
mutate(ride_length = as.numeric((end_time_num - start_time_num)*60))
Trips_2020 <- Trips_2020 %>%
mutate(ride_length = as.numeric((end_time_num - start_time_num)*60))
Lets check if our calculations correspond with the 2019 dataset.
Temp <- Trips_2019 %>%
mutate(difference = tripduration - ride_length)
Now we want to differentiate subscribers forn non subscribers. for this we create a new column that is called is_member in boolean format.
Trips_2019 <- Trips_2019 %>%
mutate(is_member = (usertype == "Subscriber"))
Trips_2020 <- Trips_2020 %>%
mutate(is_member = (usertype == "member"))
We find that the stations in the 2020 datasets have the GPS adress and the 2019 don’t. Since we want to analyze the behaviour of each group we think it is important to know how long are the rides.
To insert the adresses into the 2019 dataset we create an intermediate table that has all the stations and their adresses and do a right join into the 2019 dataset.
Temp<- Trips_2020 %>%
select(start_station_name, start_lat, start_lng)
Stations <- unique(Temp)
Trips_2019_lng <- right_join(Trips_2019, Stations,by = ("start_station_name"))
Temp<- Trips_2020 %>%
select(end_station_name, end_lat, end_lng)
Stations <- unique(Temp)
Trips_2019_lng <- right_join(Trips_2019_lng, Stations,by = ("end_station_name"))
Now we want to aggregate both datasets to work with a single dataset.
all_trips <- bind_rows(Trips_2019_lng, Trips_2020)
We calculate the distance on each trip, using the geosphere package.
punto1 <- all_trips %>%
select(c(start_lng, start_lat))
punto2 <- all_trips %>%
select(c(end_lng, end_lat))
distance <- distHaversine(punto1, punto2)
distance <- as.matrix(distance)
all_trips <- all_trips %>%
mutate(distance_m = distance)
all_trips <- all_trips %>%
mutate(pace = ride_length/distance_m, velocity = distance_m/ride_length)
Now we clean all the data we don´t need. This means, removing unused columns as birthyear or the adresses and removing all the tables that we are not going to use.
all_trips <- all_trips %>%
select(-c(tripduration, start_time, end_time, usertype, birthyear, start_lat, start_lng, end_lat, end_lng, gender))
rm(distance, punto1, punto2, Stations, Temp, Trips_2019, Trips_2019_lng, Trips_2020)
We run a summary to find any more errors.
summary(all_trips)
## ride_id rideable_type start_station_id start_station_name
## Length:757027 Length:757027 Min. : 2.0 Length:757027
## Class :character Class :character 1st Qu.: 77.0 Class :character
## Mode :character Mode :character Median :173.0 Mode :character
## Mean :201.8
## 3rd Qu.:289.0
## Max. :675.0
## NA's :79
## end_station_id end_station_name start_time_num
## Min. : 2.0 Length:757027 Min. :2019-01-01 00:04:37
## 1st Qu.: 77.0 Class :character 1st Qu.:2019-03-04 15:51:59
## Median :173.0 Mode :character Median :2020-01-10 08:00:49
## Mean :201.5 Mean :2019-09-10 11:32:39
## 3rd Qu.:290.0 3rd Qu.:2020-02-21 16:28:50
## Max. :675.0 Max. :2020-03-31 23:51:34
## NA's :80 NA's :79
## end_time_num ride_length is_member
## Min. :2019-01-01 00:11:07 Min. : -33120 Mode :logical
## 1st Qu.:2019-03-04 16:09:14 1st Qu.: 583 FALSE:70297
## Median :2020-01-10 08:12:32 Median : 13440 TRUE :686651
## Mean :2019-09-10 11:52:19 Mean : 45331 NA's :79
## 3rd Qu.:2020-02-21 16:42:57 3rd Qu.: 36900
## Max. :2020-05-19 20:10:34 Max. :563221440
## NA's :79 NA's :79
## distance_m.V1 pace.V1 velocity.V1
## Min. : 0.000000000 Min. : -Inf Min. :0.0000000000000
## 1st Qu.: 850.918332521 1st Qu.: 0.369740282647 1st Qu.:0.0448952355929
## Median : 1368.496124750 Median :15.940102558200 Median :0.0627198220992
## Mean : 1834.765759490 Mean : NaN Mean :1.2389945868400
## 3rd Qu.: 2279.690639580 3rd Qu.:22.269240175500 3rd Qu.:2.7039484789400
## Max. :24453.480183000 Max. : Inf Max. :9.1424591747900
## NA's :80 NA's :173 NA's :173
Now we find that there are negative ride lengths.
We got to check it out using the filter function.
Temp <- all_trips %>%
filter(ride_length < 0)
we find that all this trips have in common they start in the HQ QR, so we remove them and run a new summary.
all_trips <- all_trips[!(all_trips$start_station_name == "HQ QR"),]
summary(all_trips) #Statistical summary of data. Mainly for numeric
## ride_id rideable_type start_station_id start_station_name
## Length:753259 Length:753259 Min. : 2.0 Length:753259
## Class :character Class :character 1st Qu.: 77.0 Class :character
## Mode :character Mode :character Median :173.0 Mode :character
## Mean :199.4
## 3rd Qu.:289.0
## Max. :673.0
## NA's :78
## end_station_id end_station_name start_time_num
## Min. : 2.0 Length:753259 Min. :2019-01-01 00:04:37
## 1st Qu.: 77.0 Class :character 1st Qu.:2019-03-04 07:45:16
## Median :172.0 Mode :character Median :2020-01-09 19:10:57
## Mean :199.1 Mean :2019-09-09 13:53:21
## 3rd Qu.:289.0 3rd Qu.:2020-02-21 07:40:23
## Max. :675.0 Max. :2020-03-31 23:51:34
## NA's :78 NA's :78
## end_time_num ride_length is_member
## Min. :2019-01-01 00:11:07 Min. : 60 Mode :logical
## 1st Qu.:2019-03-04 07:56:27 1st Qu.: 592 FALSE:66531
## Median :2020-01-09 19:24:19 Median : 13620 TRUE :686650
## Mean :2019-09-09 14:13:07 Mean : 45557 NA's :78
## 3rd Qu.:2020-02-21 07:52:05 3rd Qu.: 37080
## Max. :2020-05-19 20:10:34 Max. :563221440
## NA's :78 NA's :78
## distance_m.V1 pace.V1 velocity.V1
## Min. : 0.000000000 Min. : 0.109379761056 Min. :0.0000000000000
## 1st Qu.: 858.078446939 1st Qu.: 0.369104433652 1st Qu.:0.0451300151106
## Median : 1369.254762400 Median :15.872282096000 Median :0.0630029124956
## Mean : 1843.939773290 Mean : Inf Mean :1.2450367124000
## 3rd Qu.: 2287.234366520 3rd Qu.:22.158202197600 3rd Qu.:2.7092603307600
## Max. :24453.480183000 Max. : Inf Max. :9.1424591747900
## NA's :78 NA's :78 NA's :78
We add data for each day to analyze further in terms of days of the week, or months
all_trips$date <- as.Date(all_trips$start_time_num) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
Finally we remove al the NA to be able to do mathematical work without problems.
all_trips_v2 <- na.omit(all_trips)
We note that there are very long rides (in time), we analyze all the rides that are above the 1.440 minutes (one day), 7.200 minutes (5 days) and 144.000 minutes (100 days) to see if there is a pattern.
Temp1 <- all_trips_v2 %>%
filter(ride_length > 1440)%>%
count(is_member)
Temp5 <- all_trips_v2 %>%
filter(ride_length > 7200)%>%
count(is_member)
Temp100 <- all_trips_v2 %>%
filter(ride_length > 144000)%>%
count(is_member)
View(Temp1)
View(Temp5)
View(Temp100)
The percentage of casual users that use the same bicycle for more than a day is 12.7%, for 5 days is 10.8% and for 100 days is 69%
Now we are creating different tables for casual users and members to find more patterns.
members <- all_trips_v2 %>%
filter(is_member == "TRUE")
non_members <- all_trips_v2 %>%
filter(is_member == "FALSE")
summary(members$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 549 12300 25522 33600 337656660
summary(members$distance_m)
## V1
## Min. : 0
## 1st Qu.: 859
## Median : 1363
## Mean : 1816
## 3rd Qu.: 2223
## Max. :18588
summary(members$pace)
## V1
## Min. : 0.1094
## 1st Qu.: 0.3591
## Median :15.3093
## Mean : Inf
## 3rd Qu.:21.3042
## Max. : Inf
summary(members$velocity)
## V1
## Min. :0.00000
## 1st Qu.:0.04694
## Median :0.06532
## Mean :1.31329
## 3rd Qu.:2.78477
## Max. :9.14246
summary(non_members$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61 2376 47100 252331 105000 563221440
summary(non_members$distance_m)
## V1
## Min. : 0.0
## 1st Qu.: 794.7
## Median : 1596.4
## Mean : 2136.2
## 3rd Qu.: 2855.1
## Max. :24453.5
summary(non_members$pace)
## V1
## Min. : 0.1739
## 1st Qu.: 2.1037
## Median :25.6252
## Mean : Inf
## 3rd Qu.:82.6328
## Max. : Inf
summary(non_members$velocity)
## V1
## Min. :0.00000
## 1st Qu.:0.01210
## Median :0.03902
## Mean :0.54061
## 3rd Qu.:0.47535
## Max. :5.75140
rides_ratio = nrow(members)/nrow(non_members)
velocity_ratio = mean(members$velocity)/mean(non_members$velocity)
pace_ratio = mean(members$pace)/mean(non_members$pace)
distance_ratio = mean(members$distance_m)/mean(non_members$distance_m)
ride_length_ratio = mean(members$ride_length)/mean(non_members$ride_length)
Finding: although there are 10x more rides of members, casuals use the bike for 10x the time, members mean velocity is 2.42x the casuals mean velocity, meaning that member use the bikes for commuting and casuals take more relaxed trips, casuals ride 1.17x longer distances than members.
Lets make a plot to visualize better.
df <- data.frame(
Category = c("rides", "velocity", "distance", "time"),
Member = c(nrow(members), mean(members$velocity), mean(members$distance_m), mean(members$ride_length)),
Casual = c(nrow(non_members), mean(non_members$velocity), mean(non_members$distance_m), mean(non_members$ride_length))
)
long_data <- df %>%
pivot_longer(
cols = c(Member, Casual),
names_to = "Metric",
values_to = "Measurement"
)
key_elemnts <- ggplot(long_data, aes(x = Category, y = Measurement, fill = Metric)) +
geom_col(position = "dodge") +
labs(
title = "Comparison of key elements by Subscription",
x = "Category",
y = "Measurement",
fill = "Metric"
)+
labs(tittle = "Comparisson ")+
facet_wrap(~Category, scales = "free") # This is the key line!
ggsave ("key_elemts.png", key_elemnts)
## Saving 7 x 5 in image
There are some round trips hence the pace is 0. Lets take a look at them.
round_trips <- all_trips_v2 %>%
filter(distance_m == 0)
non_member_round_trips <- round_trips %>%
filter(is_member == "FALSE")
member_round_trips <- round_trips %>%
filter(is_member == "TRUE")
round_trip_ratio <- nrow(member_round_trips) / nrow(non_member_round_trips)
member_round_trip_ratio <- nrow(member_round_trips) / nrow(members)
non_member_round_trip_ratio <- nrow(non_member_round_trips) / nrow(non_members)
summary(member_round_trips$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 780 2847 40129 56460 18918000
summary(non_member_round_trips$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61 3306 87180 185429 175500 33890220
round_trip_ride_length_ratio = mean(member_round_trips$ride_length) / mean(non_member_round_trips$ride_length)
Findings: Members do 1.42x round trips the non members do. Although only 2% of members take round trips vs 14% of non-members. Non members round trips take 4.62x times longer
Lets make some plots to visualize better:
df <- data.frame(
Category = c("rides", "time"),
Member = c(nrow(member_round_trips), mean(member_round_trips$ride_length)),
Casual = c(nrow(non_member_round_trips),mean(non_member_round_trips$ride_length))
)
long_data <- df %>%
pivot_longer(
cols = c(Member, Casual),
names_to = "Metric",
values_to = "Measurement"
)
round_trips_comp<- ggplot(long_data, aes(x = Category, y = Measurement, fill = Metric)) +
geom_col(position = "dodge") +
labs(
title = "Comparison of round trips by Subscription",
x = "Category",
y = "Measurement",
fill = "Metric"
)+
labs(tittle = "Comparisson ")+
facet_wrap(~Category, scales = "free")
ggsave("round_trips.png", round_trips_comp)
## Saving 7 x 5 in image
Lets see how many round trips are overnight.
Temp1r <- round_trips %>%
filter(ride_length > 1440)%>%
count(is_member)
Temp100r <- round_trips %>%
filter(ride_length > 144000)%>%
count(is_member)
We find that overnight trips for casual are 49.9% of the total overnight round trips. Also round trips represent the 26% of all trips longer than 100 days.
Also round trips of casual users account for 15.5% of the total casual overnight trips.
Finally lets analyze by weekday
# Define the custom order of weekdays in Spanish
spanish_weekday_order <- c("lunes", "martes", "miércoles", "jueves", "viernes", "sábado", "domingo")
# Convert 'day_of_week' to a factor with the custom order
all_trips_v2_ordered <- all_trips_v2 %>%
mutate(day_of_week = factor(day_of_week, levels = spanish_weekday_order))
trips_by_day <- all_trips_v2_ordered %>%
group_by(day_of_week) %>%
summarise(
# 'n()' counts the number of rows in each group, giving us the frequency.
frequency = n(),
mean_distance = mean(distance_m, na.rm = TRUE),
mean_length = mean(ride_length, na.rm = TRUE)
)
Now lets plot the key elements for each weekday
long_data <- trips_by_day %>%
pivot_longer(
cols = c(frequency, mean_distance,mean_length ),
names_to = "Metric",
values_to = "Measurement")
daily_data <- ggplot(long_data, aes(x = day_of_week, y = Measurement, fill = day_of_week)) +
geom_col(position = "dodge") +
labs(
title = "Daily data",
x = "Category",
y = "Measurement",
fill = "Metric"
)+
facet_wrap(~Metric, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave("daily_data.png", daily_data)
## Saving 7 x 5 in image
Most of the trips are done during working day. But the longer trips are on Sundays. Meaning that most of people use them for commuting but also there is a tendency to take longer trips on Sunday for leisure.
Now we will apply the same process to members and casual users.
members_ordered <- members %>%
mutate(day_of_week = factor(day_of_week, levels = spanish_weekday_order))
trips_by_day_member <- members_ordered %>%
group_by(day_of_week) %>%
summarise(
frequency = n(),
mean_distance = mean(distance_m, na.rm = TRUE),
mean_length = mean(ride_length, na.rm = TRUE)
)
long_data <- trips_by_day_member %>%
pivot_longer(
cols = c(frequency, mean_distance,mean_length ),
names_to = "Metric",
values_to = "Measurement")
members_plot <- ggplot(long_data, aes(x = day_of_week, y = Measurement, fill = day_of_week)) +
geom_col(position = "dodge") +
labs(
title = "Daily data Members",
x = "Category",
y = "Measurement",
fill = "Metric"
)+
facet_wrap(~Metric, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave("memebers_plot.png", members_plot)
## Saving 7 x 5 in image
We can see that members use the bikes mostly to commute but they have longer trips on mondays, saturdays and specially sundays.
non_members_ordered <- non_members %>%
mutate(day_of_week = factor(day_of_week, levels = spanish_weekday_order))
trips_by_day_non_member <- non_members_ordered %>%
group_by(day_of_week) %>%
summarise(
frequency = n(),
mean_distance = mean(distance_m, na.rm = TRUE),
mean_length = mean(ride_length, na.rm = TRUE)
) %>%
arrange(day_of_week)
long_data <- trips_by_day_non_member %>%
pivot_longer(
cols = c(frequency, mean_distance,mean_length ),
names_to = "Metric",
values_to = "Measurement")
casuals_plot <- ggplot(long_data, aes(x = day_of_week, y = Measurement, fill = day_of_week)) +
geom_col(position = "dodge") +
labs(
title = "Daily data casuals",
x = "Category",
y = "Measurement",
fill = "Metric"
)+
facet_wrap(~Metric, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave("casuals_plot.png", casuals_plot)
## Saving 7 x 5 in image
We can see that the days that casuals use the bikes more are Saturdays and Sundays. And surprisingly the longest rides are taken on Thursdays and Fridays, this could be explained by tourist taking a long weekend and using the bikes more on the first days of their staying.
Casual user tend to use the bikes mostly on the weekends but thera are some casual users that use the bikes for a really really long time (more than 10 days).
Members use the bikes more frequently and use mostly on weekdays.
One way to get more subscriptions is by targeting the casual users that stay in the city for more than 10 days. Charging extra for overnight use, making the subscription more atractive to this users.
Another way is by targeting tourist by selling a mini subscription that gives them the chance for using the bike for more than 5 days consecutive that is more attractive than paying for casual rides 2 or 3 times while they stay in the city.