1 - Introduction

This case study is the capstone project for the Google Data Analytics Certificate Course on Coursera. Though the scenario and company in the project are fictional, the data sets included are real and provided for public consumption courtesy of Lyft / Divvy / City of Chicago (and Evanston) and all of them can be found here.

The data used is anonymized by Divvy and does not include any personally identifiable information. The data sets are stored on Divvy servers, in archived .csv documents, with each .csv file containing a month’s worth of user trips for all trips after April 2020. Data prior to April 2020 is broken down by yearly quarters. Divvy updates the data once a month.

The Coursera project challenge is to analyze the last 12 months of data in order to identify differences in how casual users and annual members use the Cyclistic bike share service, in order to strategically deploy marketing and grow annual user subscriptions.

I analyzed data from November 2021 through October 2022. All of the data used in this study was downloaded on November 11, 2022. I performed my analysis using R in RStudio and all data cleaning and manipulation was done non-destructively in the RStudio environment.

Below, I describe how I handled importing and processing the data sets, performed analysis, presented the findings, and concluded with recommendations for next steps. I have shown all my code and I’m always open to feedback on how to improve it. Thank you!

2 - Setup

2.1 Tools

R packages I used for the study:

library(tidyverse)  # tidying + wrangling data
library(scales)     # number formatting
library(lubridate)  # date+time formatting
library(hms)        # time formatting
library(leaflet)    # maps!

2.2 Importing the .csv files into RStudio

I batch-imported all .csv files using data.table’s fread into a single data table.

cyc_trips <-
  dir(path =".",
      pattern = "*.csv") %>%
  map(data.table::fread) %>%
  reduce(rbind)

When checking the structure of the imported data, we can see that all trips are sorted by a unique ID, and each trip includes data about a type of bike used, whether the user was casual or an annual member, the date and time the trip started and ended, and the location where the trip started and ended. That’s a good start, but before analyzing further I would need to ensure there are no bad or missing entries which could affect the interpretation.

str(cyc_trips)  # check structure

## Classes 'data.table' and 'data.frame':   5755694 obs. of  13 variables:
##  $ ride_id           : chr  "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct, format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
##  $ ended_at          : POSIXct, format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 42 42 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ end_lat           : num  42 41.9 42 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
##  - attr(*, ".internal.selfref")=<externalptr>

3 - Cleanup and Manipulation

3.1 Missing Data

I started by checking for missing data. After removing all rows with NA and empty string values, I will be able to analyze the remaining 76.6% of the data.

# replace all "", (empty strings) with NA, which is easier to wrangle
cyc_trips <- cyc_trips %>% 
  mutate_if(is.character, ~na_if(.,""))

# count incomplete cases (missing/NA values)
# divide by all rows to get percentage of remaining observations
# neatly print percentage of complete data
rows_dirty <- sum(!complete.cases(cyc_trips))
rows_clean <- sum(complete.cases(cyc_trips))
rows_all <- nrow(cyc_trips)
paste0("Remaining Rows:", label_percent(accuracy=.01)(rows_clean/rows_all))

## [1] "Remaining Rows:76.63%"

cyc_trips <- na.omit(cyc_trips) # recreate the data.table with NA rows omitted

3.2 Rideable Types

The dataset included three bike types. I confirmed with Divvy’s data team that only two types of rides are available, “classic_bike” and “electric_bike”, and that “docked_bike” is effectively “classic_bike”. I renamed all “docked_bike” entries to “classic_bike” for a more consistent analysis of results.

cyc_trips  %>%
  group_by(rideable_type) %>%
  summarize(number_of_trips = n())

## # A tibble: 3 × 2
##   rideable_type number_of_trips
##   <chr>                   <int>
## 1 classic_bike          2633599
## 2 docked_bike            179635
## 3 electric_bike         1597204

# rename all docked_bike to classic_bike
cyc_trips$rideable_type[cyc_trips$rideable_type == "docked_bike"] <-
  "classic_bike"

cyc_trips  %>%
  group_by(rideable_type) %>%
  summarize(number_of_trips = n())

## # A tibble: 2 × 2
##   rideable_type number_of_trips
##   <chr>                   <int>
## 1 classic_bike          2813234
## 2 electric_bike         1597204

3.3 Times and Dates

The data set combined times and dates into “started_at” and “ended_at” columns. I extracted the values into separate columns for more descriptive analysis and grouping of data. I made extra columns here that I wanted to have available to analysis, but I will remove them as I optimize the code.

started <- cyc_trips$started_at
ended <- cyc_trips$ended_at

# split into groups for easier subsetting
cyc_trips <- cyc_trips %>% mutate(
  started_date = as.Date (started),
  started_day = wday(started, label = TRUE),
  started_month = month(started, label = TRUE),
  started_time = as_hms(started),
  started_hour = hour(started),
  started_min = minute(started),
  ended_time = as_hms(ended),
  ended_hour = hour(ended),
  ended_min = minute(ended),
  trip_dur_sec = duration(as.numeric(ended - started))
) %>%
  relocate(member_casual, .after = rideable_type)

str(cyc_trips)

## Classes 'data.table' and 'data.frame':   4410438 obs. of  23 variables:
##  $ ride_id           : chr  "4CA9676997DAFFF6" "F3E84A230AF2D676" "A1F2C92308007968" "9B871C3B14E9BEC4" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
##  $ started_at        : POSIXct, format: "2021-11-26 10:27:28" "2021-11-15 09:35:03" ...
##  $ ended_at          : POSIXct, format: "2021-11-26 11:22:13" "2021-11-15 09:42:08" ...
##  $ start_station_name: chr  "Michigan Ave & Oak St" "Clark St & Grace St" "Leamington Ave & Hirsch St" "Desplaines St & Kinzie St" ...
##  $ start_station_id  : chr  "13042" "TA1307000127" "307" "TA1306000003" ...
##  $ end_station_name  : chr  "Michigan Ave & Oak St" "Clark St & Leland Ave" "Leamington Ave & Hirsch St" "Desplaines St & Kinzie St" ...
##  $ end_station_id    : chr  "13042" "TA1309000014" "307" "TA1306000003" ...
##  $ start_lat         : num  41.9 42 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.7 -87.8 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.8 -87.6 -87.6 ...
##  $ started_date      : Date, format: "2021-11-26" "2021-11-15" ...
##  $ started_day       : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 6 2 4 3 7 5 3 6 5 1 ...
##  $ started_month     : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ started_time      : 'hms' num  10:27:28 09:35:03 16:27:02 19:51:36 ...
##   ..- attr(*, "units")= chr "secs"
##  $ started_hour      : int  10 9 16 19 19 11 22 16 8 20 ...
##  $ started_min       : int  27 35 27 51 14 58 14 48 40 29 ...
##  $ ended_time        : 'hms' num  11:22:13 09:42:08 17:04:28 20:11:17 ...
##   ..- attr(*, "units")= chr "secs"
##  $ ended_hour        : int  11 9 17 20 19 12 22 16 8 20 ...
##  $ ended_min         : int  22 42 4 11 33 8 44 53 48 33 ...
##  $ trip_dur_sec      :Formal class 'Duration' [package "lubridate"] with 1 slot
##   .. ..@ .Data: num  3285 425 2246 1181 1149 ...
##  - attr(*, ".internal.selfref")=<externalptr>

3.4 Duration Edge Cases

With calculable durations, I came across three types of edge scenarios that would impact the results:

Trips over 24 hours long
Trips with negative durations
Trips that were less than a minute long

# there are 220 trips over 24hrs
cyc_trips %>% 
  filter(trip_dur_sec > 86400) %>%
  summarize(n = n())

##     n
## 1 220

# there are 74 trips with negative durations
cyc_trips %>% 
  filter(trip_dur_sec < 0) %>% 
  summarize(n = n())

##    n
## 1 74

# there are 75206 trips less than a minute long
cyc_trips %>%
  filter(trip_dur_sec >= 0 & trip_dur_sec < 60) %>% 
  summarize(n = n())

##       n
## 1 75206

Users that do not return their bikes within 24 hours may be charged a lost or stolen bike fee of $1,200 plus tax (source). There were enough trips started between November 2021 and October 2022 to warrant potentially charging $264,000 in fines, so it may be worth looking at separately.

The amount of negative-duration trips was negligible and I didn’t want to investigate it further but I imagine something with leap years, unix time goofs, etc.

I decided to drop all the rows containing both of these.

Finally, for trips between 0-60 seconds long, Divvy support acknowledged they don’t categorize or track specific reasons for those short trips, and speculated they may be due to users re-docking bikes to make sure they’re secure. As an ex-Chicago biker, I imagine there were also a number of cases where bikes had flats or other damage that has not been serviced yet, and the user only discovered this after taking the bike out and getting on before quickly returning it.

I decided to keep those shorter trips in there since I can also imagine someone downtown Chicago taking a bike out for a quick ride down the block which may take less than a minute, and Divvy does round up each ride up to 1 minute, so there is some potential revenue there.

Dealing with the edge cases still let me proceed with 76.6% of the data, and I felt comfortable with analyzing the rest.

# filtering out all rows that are less than 0 seconds, and longer than 24h (86400 seconds)
cyc_trips <- cyc_trips %>% 
  filter(trip_dur_sec >= 0 & trip_dur_sec < 86400)

#count current clean rows vs original number of rows
rows_clean2 <- nrow(cyc_trips)
paste0("Remaining Rows:", label_percent(accuracy=.01)(rows_clean2/rows_all)) #neatly print percentage

## [1] "Remaining Rows:76.62%"

4 - Annual Members & Casual Users

With a cleaner data set, we can start to look for trends and patterns. The goal of the marketing campaign in this study is to convert casual users into annual members, so I want to examine the habits of both types of users and find when and where there may be opportunities to target advertising or create incentives.

For reference, here’s a current offering of Divvy’s plan options:

Since the publicly provided data is anonymized, it is impossible to tie the number of rides to individual users. It is also impossible to tell which rides by casual users used a day pass. While Divvy likely has more data available internally, it’s still possible to make more general observations and identify the best times and places to deploy marketing.

5 - Analysis: Number of Trips

First, let’s take a look at the number of the rides taken by users and casual members over the last year, broken down by month, day of week, and time of day.

5.1 Month

Casual users started significantly more rides in May than the months prior, and the trip count remained high until October. June through August were the peak months for casual riders and may be the best months to target advertising.

ntrips_monthly_memvcas <- cyc_trips %>%
  group_by(started_month,
           member_casual) %>%
  summarize(n = n())

ntrips_monthly_memvcas %>%
  ggplot(aes(
    x = started_month,
    y = n,
    fill = member_casual,
    group = member_casual
  )) +
  geom_col(width = .7, position = position_dodge(width = .8)) +
  #labels
  labs(title = "Number of Trips Started - Month",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  scale_y_continuous(breaks = breaks_width(50000),
                     expand = expansion(mult = c(0, .1))) +
  #months, reversed scale
  scale_x_discrete(expand = expansion(add = c(1, 1)),
                   limits = rev(
                     c(
                       "Nov",
                       "Dec",
                       "Jan",
                       "Feb",
                       "Mar",
                       "Apr",
                       "May",
                       "Jun",
                       "Jul",
                       "Aug",
                       "Sep",
                       "Oct"
                     )
                   )) +
  geom_text(
    aes(
      x = started_month,
      y = n,
      label = number(n, accuracy = 1, scale_cut = cut_short_scale()),
      hjust = -.1,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .8)
  ) +
  theme_minimal() +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  guides(fill = guide_legend(reverse = TRUE),
         # THIS REVERSES THE LEGEND ORDER!
         color = guide_legend(reverse = TRUE)) +
  coord_flip(clip = "off") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium", size = 18),
    axis.line.y = element_line(lineend = "round", color = "#5D5D5D"),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(color = "#5D5D5D"),
    panel.grid = element_blank(),
    #panel.grid.major.y = element_line(linetype = "dotted", color = "#dfe3ee"),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

5.2 Day

We can also see that while annual members ride more Monday - Friday, casual users outnumber members on the weekends. Advertising targeting Saturday and Sunday would catch the most casual users.

ntrips_daily_memvcas <- cyc_trips %>%
  group_by(started_day,
           member_casual) %>%
  summarize(n = n())

ntrips_daily_memvcas %>%
  ggplot(aes(
    x = started_day,
    y = n,
    fill = member_casual,
    group = member_casual
  )) +
  geom_col(width = .7, position = position_dodge(width = .8)) +
  #labels
  labs(title = "Number of Trips Started - Day",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  scale_y_continuous(breaks = breaks_width(50000),
                     expand = expansion(mult = c(0, .1))) +
  scale_x_discrete(expand = expansion(add = c(1, 1)),
                   limits = rev(c(
                     "Mon",
                     "Tue",
                     "Wed",
                     "Thu",
                     "Fri",
                     "Sat",
                     "Sun"
                   ))) +
  geom_text(
    aes(
      x = started_day,
      y = n,
      label = number(n, accuracy = 1, scale_cut = cut_short_scale()),
      hjust = -.1,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .8)
  ) +
  theme_minimal() +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  # THIS REVERSES THE LEGEND ORDER!
  guides(fill = guide_legend(reverse = TRUE),
         color = guide_legend(reverse = TRUE)) +
  coord_flip(clip = "off") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line.y = element_line(lineend = "round", color = "#5D5D5D"),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(color = "#5D5D5D"),
    panel.grid = element_blank(),
    #panel.grid.major.y = element_line(linetype = "dotted", color = "#dfe3ee"),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

5.3 Time of Day (Whole Week)

When we look at times of day, we can see a spike of usage for annual members followed by a dip around 8:00 and right after, however casual usage rises gradually throughout the day, peaking at 17:00 (5pm). This is particularly apparent from Monday through Friday.

On Saturday and Sunday, the usage of both groups gradually rises throughout the day and we can see the heaviest usage between 12:00 and 16:00 (4pm), with heaviest usage for casual users around 15:00 (3pm).

This suggests advertising should target mid-day and late-afternoon to reach casual users.

ntrips_hourly_memvcas <- cyc_trips %>%
  group_by(started_hour,
           member_casual) %>%
  summarize(n = n())

ntrips_hourly_memvcas %>%
  ggplot(
    aes(
      x = started_hour,
      y = n,
      color = member_casual,
      fill = member_casual,
      group = member_casual,
    )
  ) +
  geom_col(width = .4, position = position_dodge(width = .5)) +
  geom_text(
    aes(
      x = started_hour,
      y = n,
      label = number(n, accuracy = .1, scale_cut = cut_short_scale()),
      hjust = -.3,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .5),
    size = 3
  ) +
  labs(title = "Number of Trips - Time of Day",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  scale_y_continuous(expand = expansion(mult = c(0, .1))) +
  scale_x_reverse(
    expand = expansion(add = c(0, 0)),
    # reverse to make it make more sense time-wise
    breaks = breaks_width(-1),
    # USE NEGATIVE VALUE! breakdown DOWN by one
    labels = label_number(suffix = ":00")
  ) +
  theme_minimal() +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  # THIS REVERSES THE LEGEND ORDER!
  guides(fill = guide_legend(reverse = TRUE),
         color = guide_legend(reverse = TRUE)) +
  coord_flip(clip = "off") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line.y = element_line(lineend = "round", color = "#5D5D5D"),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    panel.grid = element_blank(),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

5.4 Time of Day (Mon - Fri)

This suggests advertising should target mid-day and late-afternoon to reach casual users.

ntrips_hourly_memvcas <- cyc_trips %>%
  filter(!(started_day == "Sat" | started_day == "Sun")) %>%
  group_by(started_hour,
           member_casual) %>%
  summarize(n = n())

ntrips_hourly_memvcas %>%
  ggplot(
    aes(
      x = started_hour,
      y = n,
      color = member_casual,
      fill = member_casual,
      group = member_casual,
    )
  ) +
  geom_col(width = .4, position = position_dodge(width = .5)) +
  geom_text(
    aes(
      x = started_hour,
      y = n,
      label = number(n, accuracy = .1, scale_cut = cut_short_scale()),
      hjust = -.3,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .5),
    size = 3
  ) +
  labs(title = "Number of Trips - Time of Day (Mon - Fri)",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  scale_y_continuous(expand = expansion(mult = c(0, .1))) +
  scale_x_reverse(
    expand = expansion(add = c(0, 0)),
    # reverse to make it make more sense time-wise
    breaks = breaks_width(-1),
    # USE NEGATIVE VALUE! breakdown DOWN by one
    labels = label_number(suffix = ":00")
  ) +
  theme_minimal() +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  # THIS REVERSES THE LEGEND ORDER!
  guides(fill = guide_legend(reverse = TRUE),
         color = guide_legend(reverse = TRUE)) +
  coord_flip(clip = "off") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line.y = element_line(lineend = "round", color = "#5D5D5D"),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    panel.grid = element_blank(),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

5.5 Time of Day (Sat - Sun)

This suggests advertising should target mid-day and late-afternoon to reach casual users.

ntrips_hourly_memvcas <- cyc_trips %>%
  filter(started_day == "Sat" | started_day == "Sun") %>%
  group_by(started_hour,
           member_casual) %>%
  summarize(n = n())

ntrips_hourly_memvcas %>%
  ggplot(
    aes(
      x = started_hour,
      y = n,
      color = member_casual,
      fill = member_casual,
      group = member_casual,
    )
  ) +
  geom_col(width = .4, position = position_dodge(width = .5)) +
  geom_text(
    aes(
      x = started_hour,
      y = n,
      label = number(n, accuracy = .1, scale_cut = cut_short_scale()),
      hjust = -.3,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .5),
    size = 3
  ) +
  labs(title = "Number of Trips - Time of Day (Sat - Sun)",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  scale_y_continuous(expand = expansion(mult = c(0, .1))) +
  scale_x_reverse(
    expand = expansion(add = c(0, 0)),
    # reverse to make it make more sense time-wise
    breaks = breaks_width(-1),
    # USE NEGATIVE VALUE! breakdown DOWN by one
    labels = label_number(suffix = ":00")
  ) +
  theme_minimal() +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  # THIS REVERSES THE LEGEND ORDER!
  guides(fill = guide_legend(reverse = TRUE),
         color = guide_legend(reverse = TRUE)) +
  coord_flip(clip = "off") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line.y = element_line(lineend = "round", color = "#5D5D5D"),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    panel.grid = element_blank(),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

6 - Analysis: Average Trip Length

Next, let’s see the average trip lengths to find opportunities where casual riders take the longest trip, on average.

6.1 Costs

I wrote a quick calculation for referencing the cost of trips for casual users that you can review as you check out the average trip durations. I’ll summarize the average trip duration for casual users and members here, and the data is broken down further in the following tabs.

avg_trip_duration <- cyc_trips %>% 
  group_by(member_casual) %>% 
  summarize(trip_duration = seconds_to_period(ceiling(mean(trip_dur_sec))))

knitr::kable(avg_trip_duration, col.names=c("Rider Type", "Average Trip Duration"))

Rider Type	Average Trip Duration
casual	23M 55S
member	12M 27S

As a reminder, Annual membership runs $119 for the year and members ride for free for the first 45 minutes.
Casual users pay $1 to unlock the bike and then .16¢/minute thereafter.
A day pass for casual users costs $8 and allows for unlimited trips for up to 3 hours.

We can see that casual users’ average trip length rounds up to 24 minutes, which is costing those users $4.84 per trip.

The average casual user would have to ride 25+ times in one year for the membership to beat the membership price.

\[25 * \$4.84 = \$121\]

The membership also becomes worth it when compared to the cost of 8 Day Passes.

\[8 * \$15 = \$120\]

day_in_mins <- (0:1440)
cost_casual <- (1 + day_in_mins * .16)
cost_member <- (-7.20 + day_in_mins * .16)

cyc_costs <-
  data.frame(
    cost_casual = cost_casual,
    cost_member = cost_member
  ) %>%
  mutate(trip_dur = seconds_to_period(duration(day_in_mins, unit = "min")),.before=cost_casual)
  
cyc_costs$cost_member[cyc_costs$cost_member < 0] <- 0
cyc_costs <- cyc_costs %>%
  mutate(cost_member = label_dollar()(cost_member)) %>%
  mutate(cost_casual = label_dollar()(cost_casual))

rmarkdown::paged_table(cyc_costs)

6.2 Month

By looking at the average ride length, we can identify when casual users take the longest rides and target advertising there, as those users would see the biggest benefit from the membership price incentive of unlimited rides.

It seems that every month, casual users take the longest trips in January, March, and May. By getting a membership earlier on in the year, casual users will reap a larger benefit throughout the rest of the year.

avglength_monthly_memvcas <- cyc_trips %>%
  group_by(started_month,
           member_casual) %>%
  summarize(mean = ceiling(mean(trip_dur_sec)))

avglength_monthly_memvcas %>%
  ggplot(aes(x = mean,
             y = started_month,
             fill = member_casual)) +
  geom_col(aes(fill = member_casual),
           width = .7,
           position = position_dodge(width = .8)) +
  geom_text(
    aes(
      x = mean,
      y = started_month,
      label = seconds_to_period(mean),
      hjust = -.1,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .8)
  ) +
  coord_cartesian(clip = "off") +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_x_time(breaks = breaks_width("5 min"),
               #minor_breaks = breaks_width("1 min"),
               expand = expansion(add = c(0, 1))) +
  scale_y_discrete(expand = expansion(add = c(1, 1)),
                   limits = rev(
                     c(
                       "Nov",
                       "Dec",
                       "Jan",
                       "Feb",
                       "Mar",
                       "Apr",
                       "May",
                       "Jun",
                       "Jul",
                       "Aug",
                       "Sep",
                       "Oct"
                     )
                   )) +
  theme_minimal() +
  labs(title = "Average Trip Length - Month",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  xlab(label = "Trip Length") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line = element_line(lineend = "butt", color = "#5D5D5D"),
    axis.line.x = element_blank(),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(color = "#5D5D5D"),
    panel.grid = element_blank(),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

6.3 Day

Somewhat unsurprisingly, casual users ride longer on the weekends. Saturday - Monday rides are 4-6 minutes longer than the other days of the week. Monday seems odd!

avglength_monthly_memvcas <- cyc_trips %>%
  group_by(started_day,
           member_casual) %>%
  summarize(mean = ceiling(mean(trip_dur_sec)))

avglength_monthly_memvcas %>%
  ggplot(aes(x = mean,
             y = started_day,
             fill = member_casual)) +
  geom_col(aes(fill = member_casual),
           width = .7,
           position = position_dodge(width = .8)) +
  geom_text(
    aes(
      x = mean,
      y = started_day,
      label = seconds_to_period(mean),
      hjust = -.1,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .8)
  ) +
  coord_cartesian(clip = "off") +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_x_time(breaks = breaks_width("5 min"),
               #minor_breaks = breaks_width("1 min"),
               expand = expansion(add = c(0, 1))) +
  scale_y_discrete(expand = expansion(add = c(1, 1)),
                   limits = rev(c(
                     "Mon",
                     "Tue",
                     "Wed",
                     "Thu",
                     "Fri",
                     "Sat",
                     "Sun"
                   ))) +
  theme_minimal() +
  labs(title = "Average Trip Length - Day",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  xlab(label = "Trip Length") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line = element_line(lineend = "butt", color = "#5D5D5D"),
    axis.line.x = element_blank(),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(color = "#5D5D5D"),
    panel.grid = element_blank(),
    #panel.grid.major.x = element_line(linetype = "solid", color="#e7e7e7"),
    #panel.grid.minor.x = element_line(linetype = "dotted", color="#cfcfcf"),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

6.4 Time of Day (Whole Week)

When examining casual trip lengths for the whole week, there are two times where trips are longer than most of the day, around 1:00 and 11:00.

avglength_monthly_memvcas <- cyc_trips %>%
  group_by(started_hour,
           member_casual) %>%
  summarize(mean = ceiling(mean(trip_dur_sec)))

avglength_monthly_memvcas %>%
  ggplot(aes(x = started_hour,
             y = mean,
             fill = member_casual)) +
  geom_col(width = .7,
           position = position_dodge(width = .8)) +
  geom_text(
    aes(
      x = started_hour,
      y = mean,
      label = seconds_to_period(mean),
      hjust = -.1,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .8)
  ) +
  coord_flip(clip = "off") +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_y_time(breaks = breaks_width("5 min"),
               #minor_breaks = breaks_width("1 min"),
               expand = expansion(add = c(0, 1))) +
  scale_x_reverse(
    expand = expansion(add = c(0, 0)),
    # reverse to make it make more sense time-wise
    breaks = breaks_width(-1),
    # USE NEGATIVE VALUE! breakdown DOWN by one
    labels = label_number(suffix = ":00")
  ) +
  theme_minimal() +
  labs(title = "Average Trip Length - Time of Day (Whole Week)",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  xlab(label = "Trip Length") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line = element_line(lineend = "butt", color = "#5D5D5D"),
    axis.line.x = element_blank(),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(color = "#5D5D5D"),
    panel.grid = element_blank(),
    #panel.grid.major.x = element_line(linetype = "solid", color="#e7e7e7"),
    #panel.grid.minor.x = element_line(linetype = "dotted", color="#cfcfcf"),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

6.5 Time of Day (M - F)

Casual users take longer rides on average at 1:00 and 11:00.

avglength_monthly_memvcas <- cyc_trips %>%
  filter(!(started_day == "Sat" | started_day == "Sun")) %>%
  group_by(started_hour,
           member_casual) %>%
  summarize(mean = ceiling(mean(trip_dur_sec)))

avglength_monthly_memvcas %>%
  ggplot(aes(x = started_hour,
             y = mean,
             fill = member_casual)) +
  geom_col(width = .7,
           position = position_dodge(width = .8)) +
  geom_text(
    aes(
      x = started_hour,
      y = mean,
      label = seconds_to_period(mean),
      hjust = -.1,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .8)
  ) +
  coord_flip(clip = "off") +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_y_time(breaks = breaks_width("5 min"),
               #minor_breaks = breaks_width("1 min"),
               expand = expansion(add = c(0, 1))) +
  scale_x_reverse(
    expand = expansion(add = c(0, 0)),
    # reverse to make it make more sense time-wise
    breaks = breaks_width(-1),
    # USE NEGATIVE VALUE! breakdown DOWN by one
    labels = label_number(suffix = ":00")
  ) +
  theme_minimal() +
  labs(title = "Average Trip Length - Time of Day (Mon - Fri)",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  xlab(label = "Trip Length") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line = element_line(lineend = "butt", color = "#5D5D5D"),
    axis.line.x = element_blank(),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(color = "#5D5D5D"),
    panel.grid = element_blank(),
    #panel.grid.major.x = element_line(linetype = "solid", color="#e7e7e7"),
    #panel.grid.minor.x = element_line(linetype = "dotted", color="#cfcfcf"),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

6.6 Time of Day (Sat - Sun)

The average trip gets longer progressively throughout the day, peaking at 14:00 (2pm) until it starts rising again at around midnight.

avglength_monthly_memvcas <- cyc_trips %>%
  filter(started_day == "Sat" | started_day == "Sun") %>%
  group_by(started_hour,
           member_casual) %>%
  summarize(mean = ceiling(mean(trip_dur_sec)))

avglength_monthly_memvcas %>%
  ggplot(aes(x = started_hour,
             y = mean,
             fill = member_casual)) +
  geom_col(width = .7,
           position = position_dodge(width = .8)) +
  geom_text(
    aes(
      x = started_hour,
      y = mean,
      label = seconds_to_period(mean),
      hjust = -.1,
      vjust = .5,
      color = member_casual,
    ),
    position = position_dodge(width = .8)
  ) +
  coord_flip(clip = "off") +
  scale_color_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_fill_manual(
    labels = c("Casual User", "Annual Member"),
    values = c(casual = "#3B984B", member = "#3b5998")
  ) +
  scale_y_time(breaks = breaks_width("5 min"),
               #minor_breaks = breaks_width("1 min"),
               expand = expansion(add = c(0, 1))) +
  scale_x_reverse(
    expand = expansion(add = c(0, 0)),
    # reverse to make it make more sense time-wise
    breaks = breaks_width(-1),
    # USE NEGATIVE VALUE! breakdown DOWN by one
    labels = label_number(suffix = ":00")
  ) +
  theme_minimal() +
  labs(title = "Average Trip Length - Time of Day (Sat - Sun)",
       subtitle = "November 2021 — October 2022",
       caption = "Data downloaded on November 11, 2022") +
  xlab(label = "Trip Length") +
  theme(
    text = element_text(family = "Fira Sans"),
    plot.background = element_rect(fill = "#f7f7f7", color = NA),
    plot.margin = unit(rep(2, 4), "cm"),
    plot.title = element_text(family = "Fira Sans Medium"),
    axis.line = element_line(lineend = "butt", color = "#5D5D5D"),
    axis.line.x = element_blank(),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(color = "#5D5D5D"),
    panel.grid = element_blank(),
    #panel.grid.major.x = element_line(linetype = "solid", color="#e7e7e7"),
    #panel.grid.minor.x = element_line(linetype = "dotted", color="#cfcfcf"),
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank(),
    legend.text = element_text(),
    legend.key.width = unit(1, "cm")
  )

7 - Locations

To help identify the locations to target advertising, I’ve included a list of top 10 locations for both casual users and members, as well as interactive maps that allows you to see the number of trips started and ended for each recorded station.

Zoom in and out to see the locations where the most trips started. The locations will cluster and come apart as you zoom in and out, showing you the total number of trips started in each cluster down to the individual location. This looks crummy on mobile right now and I will clean it up soon! But the data is there and works.

7.1 All Start Locations

start_locs_member <- cyc_trips %>%
  filter(member_casual == "member") %>% 
  group_by(start_station_name) %>%
  summarize(
    member_trip_count = n(),
    start_lng = round(mean(start_lng), digits = 5),
    start_lat = round(mean(start_lat), digits = 5))

start_locs_casual <- cyc_trips %>%
  filter(member_casual == "casual") %>% 
  group_by(start_station_name) %>%
  summarize(
    casual_trip_count = n(),
    start_lng = round(mean(start_lng), digits = 5),
    start_lat = round(mean(start_lat), digits = 5))

start_locs <- full_join(start_locs_casual, start_locs_member, by="start_station_name")
start_locs <- start_locs %>%
  mutate(start_lng = coalesce(start_lng.x, start_lng.y),
         start_lat = coalesce(start_lat.x, start_lat.y))
start_locs <- start_locs[c(-3,-4,-6,-7)] %>% 
  replace(is.na(.),0)


leaflet( height = "600px",
         width = "100%",
         options = leafletOptions(
           zoomControl = TRUE,
           minZoom = 10,
         )) %>% setView(lng = -87.63,
               lat = 41.8781,
               zoom = 14) %>%
  setMaxBounds(
    lng1 = -87.93,
    lat1 = 41.58,
    lng2 = -87.51,
    lat2 = 42.11
  ) %>% 
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addProviderTiles(providers$CartoDB.PositronOnlyLabels) %>%
  addMarkers(
    data = start_locs,
    lng = ~ start_lng,
    lat = ~ start_lat,
    label = ~ start_station_name,
    options = markerOptions(n = ~ (casual_trip_count + member_trip_count)),
    popupOptions = popupOptions(closePopupOnClick = F),
    popup =
      paste(
        "<b>",
        start_locs$start_station_name,
        "</b>",
        "",
        "<br/>",
        "Casual Users: ",
        prettyNum(start_locs$casual_trip_count, big.mark = ","),
        "<br/>",
        "Members: ",
        prettyNum(start_locs$member_trip_count, big.mark = ",")
      ),
    clusterOptions = markerClusterOptions(
      maxClusterRadius = 120,
      iconCreateFunction = JS(
        "function (cluster) {
    var markers = cluster.getAllChildMarkers();
    var sum = 0;
    for (i = 0; i < markers.length; i++) {
      sum += Number(markers[i].options.n);
      sum += 1;
    }
     var childCount = sum;
     var c = ' marker-cluster-';
     if (childCount < 10000) {
       c += 'small';
     } 
     else if (childCount < 100000) {
       c += 'medium';
     } 
     else {
       c += 'large';
     }
    return new L.DivIcon({ html: '<div><span>' + sum.toLocaleString(\"en-US\") + '</span></div>', 
    className: 'marker-cluster' +c, iconSize: new L.Point(40,40)});
  }"
    )
  ))

7.2 Top 10 Start Locations

Here we can see where casual users like to start, and end their trips the most. It appears that casual users start their trips near popular tourist destinations, e.g. the number one location is Streeter Dr & Grand Ave, which happens to be outside of Navy Pier.

start_locs_member <- cyc_trips %>%
  filter(member_casual == "member") %>%
  group_by(start_station_name) %>%
  summarize(
    member_trip_count = n(),
    start_lng = mean(start_lng),
    start_lat = mean(start_lat)
  ) %>% arrange(desc(member_trip_count))


start_locs_casual <- cyc_trips %>%
  filter(member_casual == "casual") %>%
  group_by(start_station_name) %>%
  summarize(
    casual_trip_count = n(),
    start_lng = mean(start_lng),
    start_lat = mean(start_lat)
  ) %>% arrange(desc(casual_trip_count))

top_start_locs_cas <-
  start_locs_casual[1:10,] %>%  mutate(rank = (1:10), .before = start_station_name)
top_start_locs_mem <-
  start_locs_member[1:10,] %>%  mutate(rank = (1:10), .before = start_station_name)

knitr::kable(
  top_start_locs_cas[1:3],
  col.names = c(" ", "Location Name", "Trips Started"),
  align = "clr",
  caption = "Trips Started by Casual Users"
)

Trips Started by Casual Users
	Location Name	Trips Started
1	Streeter Dr & Grand Ave	55295
2	DuSable Lake Shore Dr & Monroe St	30938
3	Millennium Park	24354
4	Michigan Ave & Oak St	23795
5	DuSable Lake Shore Dr & North Blvd	22275
6	Shedd Aquarium	19578
7	Theater on the Lake	17451
8	Wells St & Concord Ln	15003
9	Dusable Harbor	13438
10	Clark St & Armitage Ave	12920

leaflet(
  height = "600px",
  width = "100%",
  options = leafletOptions(
    zoomControl = FALSE,
    minZoom = 13,
    maxZoom = 13,
    dragging = FALSE
  )
) %>% setView(lng = -87.622,
              lat = 41.897,
              zoom = 13) %>%
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addProviderTiles(providers$CartoDB.PositronOnlyLabels) %>%
  #addProviderTiles(providers$Stamen.TonerLite) %>%
  addCircles(
    data = top_start_locs_cas,
    lng = ~ start_lng,
    lat = ~ start_lat,
    radius = ~ casual_trip_count / 100,
    stroke = F,
    color = "#3B984B",
    opacity = 1,
    fillOpacity = .618,
    label =  paste(
      top_start_locs_mem$rank,
      ": ",
      top_start_locs_mem$start_station_name
    ),
    labelOptions = labelOptions(direction = "top"),
    popup = paste(
      "<b>",
      top_start_locs_cas$start_station_name,
      "</b>",
      "",
      "<br/>",
      "Casual Users: ",
      prettyNum(top_start_locs_cas$casual_trip_count, big.mark = ",")
    ),
    popupOptions = popupOptions(closeOnClick = T)
  )

knitr::kable(
  top_start_locs_mem[1:3],
  col.names = c(" ", "Location Name", "Trips Started"),
  align = "clr",
  caption = "Trips Started by Members"
)

Trips Started by Members
	Location Name	Trips Started
1	Kingsbury St & Kinzie St	24408
2	Clark St & Elm St	21250
3	Wells St & Concord Ln	20310
4	Clinton St & Washington Blvd	19008
5	Clinton St & Madison St	18512
6	Loomis St & Lexington St	18477
7	Wells St & Elm St	18017
8	University Ave & 57th St	17642
9	Ellis Ave & 60th St	17609
10	Streeter Dr & Grand Ave	16312

leaflet(
  height = "600px",
  width = "100%",
  options = leafletOptions(
    zoomControl = FALSE,
    minZoom = 12,
    maxZoom = 12,
    dragging = FALSE
  )
) %>% setView(lng = -87.621,
              lat = 41.85,
              zoom = 11) %>%
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addProviderTiles(providers$CartoDB.PositronOnlyLabels) %>%
  #addProviderTiles(providers$Stamen.TonerLite) %>%
  addCircles(
    data = top_start_locs_mem,
    lng = ~ start_lng,
    lat = ~ start_lat,
    radius = ~ member_trip_count / 50,
    stroke = F,
    color = "#3b5998",
    opacity = 1,
    fillOpacity = .618,
    label =  paste(
      top_start_locs_mem$rank,
      ": ",
      top_start_locs_mem$start_station_name
    ),
    labelOptions = labelOptions(direction = "top"),
    popup = paste(
      "<b>",
      top_start_locs_mem$start_station_name,
      "</b>",
      "",
      "<br/>",
      "Members: ",
      prettyNum(top_start_locs_mem$member_trip_count, big.mark = ",")
    ),
    popupOptions = popupOptions(closeOnClick = T)
  )

7.3 All End Locations

end_locs_member <- cyc_trips %>%
  filter(member_casual == "member") %>% 
  group_by(end_station_name) %>%
  summarize(
    member_trip_count = n(),
    end_lng = round(mean(end_lng), digits = 5),
    end_lat = round(mean(end_lat), digits = 5))

end_locs_casual <- cyc_trips %>%
  filter(member_casual == "casual") %>% 
  group_by(end_station_name) %>%
  summarize(
    casual_trip_count = n(),
    end_lng = round(mean(end_lng), digits = 5),
    end_lat = round(mean(end_lat), digits = 5))

end_locs <- full_join(end_locs_casual, end_locs_member, by="end_station_name")
end_locs <- end_locs %>%
  mutate(end_lng = coalesce(end_lng.x, end_lng.y),
         end_lat = coalesce(end_lat.x, end_lat.y))
end_locs <- end_locs[c(-3,-4,-6,-7)] %>% 
  replace(is.na(.),0)


leaflet( height = "600px",
         width = "100%",
         options = leafletOptions(
           zoomControl = FALSE,
           minZoom = 10,
         )) %>% setView(lng = -87.63,
               lat = 41.8781,
               zoom = 14) %>%
  setMaxBounds(
    lng1 = -87.93,
    lat1 = 41.58,
    lng2 = -87.51,
    lat2 = 42.11
  ) %>% 
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addProviderTiles(providers$CartoDB.PositronOnlyLabels) %>%
  #addProviderTiles(providers$Stamen.TonerLite) %>%
  addMarkers(
    data = end_locs,
    lng = ~ end_lng,
    lat = ~ end_lat,
    label =  paste(top_start_locs_mem$rank,
                   ": ",
                   top_start_locs_mem$start_station_name),
    options = markerOptions(n = ~ (casual_trip_count + member_trip_count)),
    popupOptions = popupOptions(closePopupOnClick = F),
    popup =
      paste("<b>", 
        end_locs$end_station_name,
        "</b>",
        "",
        "<br/>",
        "Casual Users: ",
        prettyNum(end_locs$casual_trip_count, big.mark = ","),
        "<br/>",
        "Members: ",
        prettyNum(end_locs$member_trip_count, big.mark = ",")
      ),
    clusterOptions = markerClusterOptions(
      maxClusterRadius = 120,
      iconCreateFunction = JS(
        "function (cluster) {
    var markers = cluster.getAllChildMarkers();
    var sum = 0;
    for (i = 0; i < markers.length; i++) {
      sum += Number(markers[i].options.n);
      sum += 1;
    }
     var childCount = sum;
     var c = ' marker-cluster-';
     if (childCount < 10000) {
       c += 'small';
     } 
     else if (childCount < 100000) {
       c += 'medium';
     } 
     else {
       c += 'large';
     }
    return new L.DivIcon({ html: '<div><span>' + sum.toLocaleString(\"en-US\") + '</span></div>', 
    className: 'marker-cluster' +c, iconSize: new L.Point(40,40)});
  }"
    )
  ))

7.4 Top 10 End Locations

end_locs_member <- cyc_trips %>%
  filter(member_casual == "member") %>%
  group_by(end_station_name) %>%
  summarize(
    member_trip_count = n(),
    end_lng = mean(end_lng),
    end_lat = mean(end_lat)
  ) %>% arrange(desc(member_trip_count))


end_locs_casual <- cyc_trips %>%
  filter(member_casual == "casual") %>%
  group_by(end_station_name) %>%
  summarize(
    casual_trip_count = n(),
    end_lng = mean(end_lng),
    end_lat = mean(end_lat)
  ) %>% arrange(desc(casual_trip_count))

top_end_locs_cas <-
  end_locs_casual[1:10,] %>%  mutate(rank = (1:10), .before = end_station_name)
top_end_locs_mem <-
  end_locs_member[1:10,] %>%  mutate(rank = (1:10), .before = end_station_name)

knitr::kable(
  top_end_locs_cas[1:3],
  col.names = c(" ", "Location Name", "Trips ended"),
  align = "clr",
  caption = "Trips Ended by Casual Users"
)

Trips Ended by Casual Users
	Location Name	Trips ended
1	Streeter Dr & Grand Ave	58182
2	DuSable Lake Shore Dr & Monroe St	28929
3	Millennium Park	26171
4	Michigan Ave & Oak St	25494
5	DuSable Lake Shore Dr & North Blvd	25409
6	Theater on the Lake	18789
7	Shedd Aquarium	18244
8	Wells St & Concord Ln	14555
9	Clark St & Armitage Ave	13190
10	Clark St & Lincoln Ave	12991

leaflet(
  height = "600px",
  width = "100%",
  options = leafletOptions(
    zoomControl = FALSE,
    minZoom = 13,
    maxZoom = 13,
    dragging = FALSE
  )
) %>% setView(lng = -87.622,
              lat = 41.897,
              zoom = 13) %>%
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addProviderTiles(providers$CartoDB.PositronOnlyLabels) %>%
  #addProviderTiles(providers$Stamen.TonerLite) %>%
  addCircles(
    data = top_end_locs_cas,
    lng = ~ end_lng,
    lat = ~ end_lat,
    radius = ~ casual_trip_count / 100,
    stroke = F,
    color = "#3B984B",
    opacity = 1,
    fillOpacity = .618,
    label =  paste(
      top_end_locs_mem$rank,
      ": ",
      top_end_locs_mem$end_station_name
    ),
    labelOptions = labelOptions(direction = "top"),
    popup = paste(
      "<b>",
      top_end_locs_cas$end_station_name,
      "</b>",
      "",
      "<br/>",
      "Casual Users: ",
      prettyNum(top_end_locs_cas$casual_trip_count, big.mark = ",")
    ),
    popupOptions = popupOptions(closeOnClick = F)
  )

knitr::kable(
  top_end_locs_mem[1:3],
  col.names = c(" ", "Location Name", "Trips ended"),
  align = "clr",
  caption = "Trips Ended by Members"
)

Trips Ended by Members
	Location Name	Trips ended
1	Kingsbury St & Kinzie St	23918
2	Clark St & Elm St	21556
3	Wells St & Concord Ln	20862
4	Clinton St & Washington Blvd	19673
5	Clinton St & Madison St	18954
6	University Ave & 57th St	18748
7	Loomis St & Lexington St	18081
8	Wells St & Elm St	17854
9	Ellis Ave & 60th St	17363
10	Broadway & Barry Ave	16742

leaflet(
  height = "600px",
  width = "100%",
  options = leafletOptions(
    zoomControl = FALSE,
    minZoom = 12,
    maxZoom = 12,
    dragging = FALSE
  )
) %>% setView(lng = -87.621,
              lat = 41.85,
              zoom = 11) %>%
  addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
  addProviderTiles(providers$CartoDB.PositronOnlyLabels) %>%
  #addProviderTiles(providers$Stamen.TonerLite) %>%
  addCircles(
    data = top_end_locs_mem,
    lng = ~ end_lng,
    lat = ~ end_lat,
    radius = ~ member_trip_count / 50,
    stroke = F,
    color = "#3b5998",
    opacity = 1,
    fillOpacity = .618,
    label =  paste(
      top_end_locs_mem$rank,
      ": ",
      top_end_locs_mem$end_station_name
    ),
    labelOptions = labelOptions(direction = "top"),
    popup = paste(
      "<b>",
      top_end_locs_mem$end_station_name,
      "</b>",
      "",
      "<br/>",
      "Members: ",
      prettyNum(top_end_locs_mem$member_trip_count, big.mark = ",")
    ),
    popupOptions = popupOptions(closeOnClick = F)
  )

8 - Conclusion

Analyzing nearly 5 million individual trip entries allowed us to discover some insights into casual user and member habits that could help us target casual users and drive conversion to annual membership.

An average user would see benefits to the memberships after taking 25 average individual trips or if they plan on using more than 8 day passes in a 12 month period.

We discovered that most trips got started June through August. We also saw that the weekends were the the most popular days for casual users to ride, and we identified casual usage gradually increased throughout those days, peaking at around 15:00, or 3:00pm.

Casual users started their trips near popular tourist destinations, such as Navy Pier and Shedd Aquarium, along the lake front, and near major transportation hubs, like Merchandise Mart, and Ogilvie and Union stations.

Partnerships with those popular destinations, as well as Metra/Pace/CTA, that could produce benefits for users could drive conversion to memberships. Campaigns that begin in Spring that advertise easier access to popular events and festivals, such as Taste of Chicago or Lollapalooza could also drive increased interest.

For a digital marketing strategy, the company could use internal data to show in-app content to users that have purchased around 8 day passes in the last 12 month period, and to users whose first few rides, or several subsequent rides, end up lasting around the average length of 24 minutes.

9 - Citations

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Wickham H, Seidel D (2022). scales: Scale Functions for Visualization. R package version 1.2.1, https://CRAN.R-project.org/package=scales.

Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.

Müller K (2022). hms: Pretty Time of Day. R package version 1.1.2, https://CRAN.R-project.org/package=hms.

Cheng J, Karambelkar B, Xie Y (2022). leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. R package version 2.1.1, https://CRAN.R-project.org/package=leaflet.

Cyclistic Summary

Kamil J. Krawczyk

2022-12-07

1 - Introduction

2 - Setup

2.1 Tools

2.2 Importing the .csv files into RStudio

3 - Cleanup and Manipulation

3.1 Missing Data

3.2 Rideable Types

3.3 Times and Dates

3.4 Duration Edge Cases

4 - Annual Members & Casual Users

5 - Analysis: Number of Trips

5.1 Month

5.2 Day

5.3 Time of Day (Whole Week)

5.4 Time of Day (Mon - Fri)

5.5 Time of Day (Sat - Sun)

6 - Analysis: Average Trip Length

6.1 Costs

6.2 Month

6.3 Day

6.4 Time of Day (Whole Week)

6.5 Time of Day (M - F)

6.6 Time of Day (Sat - Sun)

7 - Locations

7.1 All Start Locations

7.2 Top 10 Start Locations

7.3 All End Locations

7.4 Top 10 End Locations

8 - Conclusion

9 - Citations