CHICAGO PUBLIC BICYCLE SYSTEM: SUBSCRIPTION ANALYSIS.

This project aims to understand the way different users use the public bicyles system in Chicago, namely, Subscribers and Not subscribers.

The idea is to understand how can we convert casual users into subscribers.

##SUMMARY

To convert casual users into subscribers we have to understand how each kind of user uses the bikes and find a way to make subscriptions more appealing to casual users.

Casual user tend to use the bikes mostly on the weekends but there are some casual users that use the bikes for a really really long time (more than 10 days).

Comparisson of key elements between users

Members use the bikes more frequently and use mostly on weekdays while casuals use the bikes for longer periods of time and usually on long weekends, starting on thursdays.

One way to get more subscriptions is by targeting the casual users that stay in the city for more than 10 days. Charging extra for overnight use, making the subscription more atractive to this users.

Another way is by targeting tourist by selling a mini subscription that gives them the chance for using the bike for more than 5 days consecutive that is more attractive than paying for casual rides 2 or 3 times while they stay in the city.

#ANALYSIS

LOADING PACKAGES

We need packages to read and clean the data, as well to process it. Therefore we are installing:

Tidyverse that allows to process and plot the data.
Skimr which allow us to read data and overview key information
anito that allow us to clean data easily.

We are also using Geosphere in order to calculate distances from GPS data.

library(tidyverse)
library(readr)
library(janitor)
library(skimr)
library(geosphere)

LOADING AND UNDESRSTANDING DATA

We want to load the data and understand the differeces between the datasets.

Trips_2020_original <- read_csv("Divvy_Trips_2020_Q1.csv")

## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(Trips_2020_original)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Trips_2019_original <- read_csv("Divvy_Trips_2019_Q1.csv")

## Rows: 365069 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): start_time, end_time, from_station_name, to_station_name, usertype,...
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num (1): tripduration
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(Trips_2019_original)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

CLEANING DATA

We find that there are several columns that have the same information, but different names. Also we are using janitor to clean the column names and create new datasets that we are going to process.

Trips_2019 <-Trips_2019_original %>%
  rename(start_station_id = from_station_id, 
         start_station_name = from_station_name, 
         end_station_id = to_station_id, 
         end_station_name = to_station_name, 
         ride_id = trip_id, 
         rideable_type = bikeid)

Trips_2020<-Trips_2020_original %>%
  rename(start_time=started_at, end_time=ended_at, usertype = member_casual)

Trips_2019 <-clean_names(Trips_2019)
Trips_2020 <-clean_names(Trips_2020)

We also find that the date format is in string format. We want to change it to a numerical format.

Lets change the column names and the date format

Trips_2019 <- Trips_2019 %>%
  mutate(
    start_time_num = ymd_hms(start_time),
    end_time_num = ymd_hms(end_time),
    ride_id = as.character(ride_id)
    ,rideable_type = as.character(rideable_type))
  
Trips_2020 <- Trips_2020 %>%
  mutate(
    start_time_num = ymd_hms(start_time),
    end_time_num = ymd_hms(end_time))

PREPARING THE DATA

We find there is a duration column in the 2019 dataset but it is not in the 2020 dataset. So we decided to calculate in both cases just to be sure.

Trips_2019 <- Trips_2019 %>%
  mutate(ride_length = as.numeric((end_time_num - start_time_num)*60))

Trips_2020 <- Trips_2020 %>%
  mutate(ride_length = as.numeric((end_time_num - start_time_num)*60))

Lets check if our calculations correspond with the 2019 dataset.

Temp <- Trips_2019 %>%
  mutate(difference = tripduration - ride_length)

Now we want to differentiate subscribers forn non subscribers. for this we create a new column that is called is_member in boolean format.

Trips_2019 <- Trips_2019 %>%
  mutate(is_member = (usertype == "Subscriber"))

Trips_2020 <- Trips_2020 %>%
  mutate(is_member = (usertype == "member"))

We find that the stations in the 2020 datasets have the GPS adress and the 2019 don’t. Since we want to analyze the behaviour of each group we think it is important to know how long are the rides.

To insert the adresses into the 2019 dataset we create an intermediate table that has all the stations and their adresses and do a right join into the 2019 dataset.

Temp<- Trips_2020 %>%
  select(start_station_name, start_lat, start_lng)

Stations <- unique(Temp)

Trips_2019_lng <- right_join(Trips_2019, Stations,by = ("start_station_name"))

Temp<- Trips_2020 %>%
  select(end_station_name, end_lat, end_lng)

Stations <- unique(Temp)

Trips_2019_lng <- right_join(Trips_2019_lng, Stations,by = ("end_station_name"))

Now we want to aggregate both datasets to work with a single dataset.

all_trips <- bind_rows(Trips_2019_lng, Trips_2020)

We calculate the distance on each trip, using the geosphere package.

punto1 <- all_trips %>% 
  select(c(start_lng, start_lat))

punto2 <- all_trips %>% 
  select(c(end_lng, end_lat))
               
distance <- distHaversine(punto1, punto2)
distance <- as.matrix(distance)

all_trips <- all_trips %>%
  mutate(distance_m = distance)

all_trips <- all_trips %>%
  mutate(pace = ride_length/distance_m, velocity = distance_m/ride_length)

Now we clean all the data we don´t need. This means, removing unused columns as birthyear or the adresses and removing all the tables that we are not going to use.

all_trips <- all_trips %>%
  select(-c(tripduration, start_time, end_time, usertype, birthyear, start_lat, start_lng, end_lat, end_lng, gender))

rm(distance, punto1, punto2, Stations, Temp, Trips_2019, Trips_2019_lng, Trips_2020)

We run a summary to find any more errors.

summary(all_trips)

##    ride_id          rideable_type      start_station_id start_station_name
##  Length:757027      Length:757027      Min.   :  2.0    Length:757027     
##  Class :character   Class :character   1st Qu.: 77.0    Class :character  
##  Mode  :character   Mode  :character   Median :173.0    Mode  :character  
##                                        Mean   :201.8                      
##                                        3rd Qu.:289.0                      
##                                        Max.   :675.0                      
##                                        NA's   :79                         
##  end_station_id  end_station_name   start_time_num               
##  Min.   :  2.0   Length:757027      Min.   :2019-01-01 00:04:37  
##  1st Qu.: 77.0   Class :character   1st Qu.:2019-03-04 15:51:59  
##  Median :173.0   Mode  :character   Median :2020-01-10 08:00:49  
##  Mean   :201.5                      Mean   :2019-09-10 11:32:39  
##  3rd Qu.:290.0                      3rd Qu.:2020-02-21 16:28:50  
##  Max.   :675.0                      Max.   :2020-03-31 23:51:34  
##  NA's   :80                         NA's   :79                   
##   end_time_num                  ride_length        is_member      
##  Min.   :2019-01-01 00:11:07   Min.   :   -33120   Mode :logical  
##  1st Qu.:2019-03-04 16:09:14   1st Qu.:      583   FALSE:70297    
##  Median :2020-01-10 08:12:32   Median :    13440   TRUE :686651   
##  Mean   :2019-09-10 11:52:19   Mean   :    45331   NA's :79       
##  3rd Qu.:2020-02-21 16:42:57   3rd Qu.:    36900                  
##  Max.   :2020-05-19 20:10:34   Max.   :563221440                  
##  NA's   :79                    NA's   :79                         
##        distance_m.V1                pace.V1                 velocity.V1       
##  Min.   :    0.000000000   Min.   :           -Inf   Min.   :0.0000000000000  
##  1st Qu.:  850.918332521   1st Qu.: 0.369740282647   1st Qu.:0.0448952355929  
##  Median : 1368.496124750   Median :15.940102558200   Median :0.0627198220992  
##  Mean   : 1834.765759490   Mean   :            NaN   Mean   :1.2389945868400  
##  3rd Qu.: 2279.690639580   3rd Qu.:22.269240175500   3rd Qu.:2.7039484789400  
##  Max.   :24453.480183000   Max.   :            Inf   Max.   :9.1424591747900  
##  NA's   :80                NA's   :173               NA's   :173

Now we find that there are negative ride lengths.

We got to check it out using the filter function.

Temp <- all_trips %>%
  filter(ride_length < 0)

we find that all this trips have in common they start in the HQ QR, so we remove them and run a new summary.

all_trips <- all_trips[!(all_trips$start_station_name == "HQ QR"),]
summary(all_trips)  #Statistical summary of data. Mainly for numeric

##    ride_id          rideable_type      start_station_id start_station_name
##  Length:753259      Length:753259      Min.   :  2.0    Length:753259     
##  Class :character   Class :character   1st Qu.: 77.0    Class :character  
##  Mode  :character   Mode  :character   Median :173.0    Mode  :character  
##                                        Mean   :199.4                      
##                                        3rd Qu.:289.0                      
##                                        Max.   :673.0                      
##                                        NA's   :78                         
##  end_station_id  end_station_name   start_time_num               
##  Min.   :  2.0   Length:753259      Min.   :2019-01-01 00:04:37  
##  1st Qu.: 77.0   Class :character   1st Qu.:2019-03-04 07:45:16  
##  Median :172.0   Mode  :character   Median :2020-01-09 19:10:57  
##  Mean   :199.1                      Mean   :2019-09-09 13:53:21  
##  3rd Qu.:289.0                      3rd Qu.:2020-02-21 07:40:23  
##  Max.   :675.0                      Max.   :2020-03-31 23:51:34  
##  NA's   :78                         NA's   :78                   
##   end_time_num                  ride_length        is_member      
##  Min.   :2019-01-01 00:11:07   Min.   :       60   Mode :logical  
##  1st Qu.:2019-03-04 07:56:27   1st Qu.:      592   FALSE:66531    
##  Median :2020-01-09 19:24:19   Median :    13620   TRUE :686650   
##  Mean   :2019-09-09 14:13:07   Mean   :    45557   NA's :78       
##  3rd Qu.:2020-02-21 07:52:05   3rd Qu.:    37080                  
##  Max.   :2020-05-19 20:10:34   Max.   :563221440                  
##  NA's   :78                    NA's   :78                         
##        distance_m.V1                pace.V1                 velocity.V1       
##  Min.   :    0.000000000   Min.   : 0.109379761056   Min.   :0.0000000000000  
##  1st Qu.:  858.078446939   1st Qu.: 0.369104433652   1st Qu.:0.0451300151106  
##  Median : 1369.254762400   Median :15.872282096000   Median :0.0630029124956  
##  Mean   : 1843.939773290   Mean   :            Inf   Mean   :1.2450367124000  
##  3rd Qu.: 2287.234366520   3rd Qu.:22.158202197600   3rd Qu.:2.7092603307600  
##  Max.   :24453.480183000   Max.   :            Inf   Max.   :9.1424591747900  
##  NA's   :78                NA's   :78                NA's   :78

We add data for each day to analyze further in terms of days of the week, or months

all_trips$date <- as.Date(all_trips$start_time_num) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Finally we remove al the NA to be able to do mathematical work without problems.

all_trips_v2 <- na.omit(all_trips)

Analysis

We note that there are very long rides (in time), we analyze all the rides that are above the 1.440 minutes (one day), 7.200 minutes (5 days) and 144.000 minutes (100 days) to see if there is a pattern.

Temp1 <- all_trips_v2 %>%
  filter(ride_length > 1440)%>%
  count(is_member)

Temp5 <- all_trips_v2 %>%
  filter(ride_length > 7200)%>%
  count(is_member)


Temp100 <- all_trips_v2 %>%
  filter(ride_length > 144000)%>%
  count(is_member)

View(Temp1)
View(Temp5)
View(Temp100)

The percentage of casual users that use the same bicycle for more than a day is 12.7%, for 5 days is 10.8% and for 100 days is 69%

Now we are creating different tables for casual users and members to find more patterns.

members <- all_trips_v2 %>%
  filter(is_member == "TRUE")


non_members <- all_trips_v2 %>%
  filter(is_member == "FALSE")

summary(members$ride_length)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##        60       549     12300     25522     33600 337656660

summary(members$distance_m)

##        V1       
##  Min.   :    0  
##  1st Qu.:  859  
##  Median : 1363  
##  Mean   : 1816  
##  3rd Qu.: 2223  
##  Max.   :18588

summary(members$pace)

##        V1         
##  Min.   : 0.1094  
##  1st Qu.: 0.3591  
##  Median :15.3093  
##  Mean   :    Inf  
##  3rd Qu.:21.3042  
##  Max.   :    Inf

summary(members$velocity)

##        V1         
##  Min.   :0.00000  
##  1st Qu.:0.04694  
##  Median :0.06532  
##  Mean   :1.31329  
##  3rd Qu.:2.78477  
##  Max.   :9.14246

summary(non_members$ride_length)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##        61      2376     47100    252331    105000 563221440

summary(non_members$distance_m)

##        V1         
##  Min.   :    0.0  
##  1st Qu.:  794.7  
##  Median : 1596.4  
##  Mean   : 2136.2  
##  3rd Qu.: 2855.1  
##  Max.   :24453.5

summary(non_members$pace)

##        V1         
##  Min.   : 0.1739  
##  1st Qu.: 2.1037  
##  Median :25.6252  
##  Mean   :    Inf  
##  3rd Qu.:82.6328  
##  Max.   :    Inf

summary(non_members$velocity)

##        V1         
##  Min.   :0.00000  
##  1st Qu.:0.01210  
##  Median :0.03902  
##  Mean   :0.54061  
##  3rd Qu.:0.47535  
##  Max.   :5.75140

rides_ratio = nrow(members)/nrow(non_members)
velocity_ratio = mean(members$velocity)/mean(non_members$velocity)
pace_ratio = mean(members$pace)/mean(non_members$pace)
distance_ratio = mean(members$distance_m)/mean(non_members$distance_m)
ride_length_ratio = mean(members$ride_length)/mean(non_members$ride_length)

Finding: although there are 10x more rides of members, casuals use the bike for 10x the time, members mean velocity is 2.42x the casuals mean velocity, meaning that member use the bikes for commuting and casuals take more relaxed trips, casuals ride 1.17x longer distances than members.

Lets make a plot to visualize better.

df <- data.frame(
  Category = c("rides", "velocity", "distance", "time"),
  Member = c(nrow(members), mean(members$velocity),  mean(members$distance_m), mean(members$ride_length)),
  Casual = c(nrow(non_members), mean(non_members$velocity), mean(non_members$distance_m), mean(non_members$ride_length))
)

long_data <- df %>%
  pivot_longer(
    cols = c(Member, Casual),
    names_to = "Metric",
    values_to = "Measurement"
  )

key_elemnts <- ggplot(long_data, aes(x = Category, y = Measurement, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(
    title = "Comparison of key elements by Subscription",
    x = "Category",
    y = "Measurement",
    fill = "Metric"
  )+
  labs(tittle = "Comparisson ")+
  facet_wrap(~Category, scales = "free") # This is the key line!

ggsave ("key_elemts.png", key_elemnts)

## Saving 7 x 5 in image

There are some round trips hence the pace is 0. Lets take a look at them.

round_trips <- all_trips_v2 %>%
  filter(distance_m == 0)

non_member_round_trips <- round_trips %>%
  filter(is_member == "FALSE")

member_round_trips <- round_trips %>%
  filter(is_member == "TRUE")

round_trip_ratio <- nrow(member_round_trips) / nrow(non_member_round_trips)
member_round_trip_ratio <- nrow(member_round_trips) / nrow(members)
non_member_round_trip_ratio <- nrow(non_member_round_trips) / nrow(non_members)

summary(member_round_trips$ride_length)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##       60      780     2847    40129    56460 18918000

summary(non_member_round_trips$ride_length)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##       61     3306    87180   185429   175500 33890220

round_trip_ride_length_ratio = mean(member_round_trips$ride_length) / mean(non_member_round_trips$ride_length)

Findings: Members do 1.42x round trips the non members do. Although only 2% of members take round trips vs 14% of non-members. Non members round trips take 4.62x times longer

Lets make some plots to visualize better:

df <- data.frame(
  Category = c("rides", "time"),
  Member = c(nrow(member_round_trips), mean(member_round_trips$ride_length)),
  Casual = c(nrow(non_member_round_trips),mean(non_member_round_trips$ride_length))
)

long_data <- df %>%
  pivot_longer(
    cols = c(Member, Casual),
    names_to = "Metric",
    values_to = "Measurement"
  )

round_trips_comp<- ggplot(long_data, aes(x = Category, y = Measurement, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(
    title = "Comparison of round trips by Subscription",
    x = "Category",
    y = "Measurement",
    fill = "Metric"
  )+
  labs(tittle = "Comparisson ")+
  facet_wrap(~Category, scales = "free")

ggsave("round_trips.png", round_trips_comp)

## Saving 7 x 5 in image

Lets see how many round trips are overnight.

Temp1r <- round_trips %>%
  filter(ride_length > 1440)%>%
  count(is_member)

Temp100r <- round_trips %>%
  filter(ride_length > 144000)%>%
  count(is_member)

We find that overnight trips for casual are 49.9% of the total overnight round trips. Also round trips represent the 26% of all trips longer than 100 days.

Also round trips of casual users account for 15.5% of the total casual overnight trips.

Finally lets analyze by weekday

# Define the custom order of weekdays in Spanish
spanish_weekday_order <- c("lunes", "martes", "miércoles", "jueves", "viernes", "sábado", "domingo")

# Convert 'day_of_week' to a factor with the custom order
all_trips_v2_ordered <- all_trips_v2 %>%
  mutate(day_of_week = factor(day_of_week, levels = spanish_weekday_order))

trips_by_day <- all_trips_v2_ordered %>%
  group_by(day_of_week) %>%
  summarise(
    # 'n()' counts the number of rows in each group, giving us the frequency.
    frequency = n(),
    mean_distance = mean(distance_m, na.rm = TRUE),
    mean_length = mean(ride_length, na.rm = TRUE)
  )

Now lets plot the key elements for each weekday

long_data <- trips_by_day %>%
  pivot_longer(
    cols = c(frequency, mean_distance,mean_length ),
    names_to = "Metric",
    values_to = "Measurement")

daily_data <- ggplot(long_data, aes(x = day_of_week, y = Measurement, fill = day_of_week)) +
  geom_col(position = "dodge") +
  labs(
    title = "Daily data",
    x = "Category",
    y = "Measurement",
    fill = "Metric"
  )+
  facet_wrap(~Metric, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("daily_data.png", daily_data)

## Saving 7 x 5 in image

Most of the trips are done during working day. But the longer trips are on Sundays. Meaning that most of people use them for commuting but also there is a tendency to take longer trips on Sunday for leisure.

Now we will apply the same process to members and casual users.

members_ordered <- members %>%
  mutate(day_of_week = factor(day_of_week, levels = spanish_weekday_order))

trips_by_day_member <- members_ordered %>%
  group_by(day_of_week) %>%
  summarise(
    frequency = n(),
    mean_distance = mean(distance_m, na.rm = TRUE),
    mean_length = mean(ride_length, na.rm = TRUE)
  )

long_data <- trips_by_day_member %>%
  pivot_longer(
    cols = c(frequency, mean_distance,mean_length ),
    names_to = "Metric",
    values_to = "Measurement")

members_plot <- ggplot(long_data, aes(x = day_of_week, y = Measurement, fill = day_of_week)) +
  geom_col(position = "dodge") +
  labs(
    title = "Daily data Members",
    x = "Category",
    y = "Measurement",
    fill = "Metric"
  )+
  facet_wrap(~Metric, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("memebers_plot.png", members_plot)

## Saving 7 x 5 in image

We can see that members use the bikes mostly to commute but they have longer trips on mondays, saturdays and specially sundays.

non_members_ordered <- non_members %>%
  mutate(day_of_week = factor(day_of_week, levels = spanish_weekday_order))

trips_by_day_non_member <- non_members_ordered %>%
  group_by(day_of_week) %>%
  summarise(
    frequency = n(),
    mean_distance = mean(distance_m, na.rm = TRUE),
    mean_length = mean(ride_length, na.rm = TRUE)
  ) %>%
  arrange(day_of_week)

long_data <- trips_by_day_non_member %>%
  pivot_longer(
    cols = c(frequency, mean_distance,mean_length ),
    names_to = "Metric",
    values_to = "Measurement")

casuals_plot <- ggplot(long_data, aes(x = day_of_week, y = Measurement, fill = day_of_week)) +
  geom_col(position = "dodge") +
  labs(
    title = "Daily data casuals",
    x = "Category",
    y = "Measurement",
    fill = "Metric"
  )+
  facet_wrap(~Metric, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("casuals_plot.png", casuals_plot)

## Saving 7 x 5 in image

We can see that the days that casuals use the bikes more are Saturdays and Sundays. And surprisingly the longest rides are taken on Thursdays and Fridays, this could be explained by tourist taking a long weekend and using the bikes more on the first days of their staying.

Conclusions:

Casual user tend to use the bikes mostly on the weekends but thera are some casual users that use the bikes for a really really long time (more than 10 days).

Members use the bikes more frequently and use mostly on weekdays.