copyright: "© Wojciech Winogrodzki, 2023. All Rights Reserved." 
email: "wwinogrodzki@gmail.com"
location: "Mexico City"

Context

The present work is a realization of “Google Data Analytics Capstone: Complete a Case Study”, the final part of Google Data Analytics Professional Certificate.

I’ll be applying Google’s data analytics methodology of Ask - Prepare - Process - Analyze - Share - Act.

The tool of my choice is R programming language. In one case, I will use SQL to check integrity of bike stations.

Introduction

Since 2016, a fictional company, Cyclistic, is a bike-share company in Chicago. It features more than 5,800 bicycles and 600 docking stations.

Cyclistic’s users are more likely to ride for leisure, but about 30% of them commute to work each day.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans:

Stakeholders

Hypothesis

As concluded by Cyclistic’s finance analysts, annual members are much more profitable than casual riders. Therefore, Moreno believes that:

  1. Maximizing the number of annual members will be key to future growth.
  2. There is a very good chance to convert casual riders into members because casual riders:
    1. Are already aware of the Cyclistic program; and
    2. Have already chosen Cyclistic for their mobility needs.

Ask

Business task

The business task, as defined by Moreno, is:

Design marketing strategies aimed at converting casual riders into annual members

The following three questions will guide the future marketing program:

  1. How do annual members and casual riders use Cyclistic bikes differently?

  2. Why would casual riders buy Cyclistic annual memberships?

  3. How can Cyclistic use digital media to influence casual riders to become members?

Question

I have been assigned to work on the first question:

How do annual members and casual riders use Cyclistic bikes differently?

The question will be answered with data, trying to identify usage trends in bike rides, broken down into two types:

  • Casual (single-ride or daily pass)
  • Member (annual membership)

Prepare

Original dataset

Location

Since Cyclictics is a fictional company, the data has been made available by Motivate International Inc. under this license. I have choosen to work with just one quarter of a year, Q1 of 2020, contained in this zip archive, which is sufficient to elaborate the present case study.

Organization of original data

First, let’s setup the R environment and load the original data into a data frame identified as df:

The original data has 426,887 observations in 13 variables. It does not contain any personally identifiable information. There is 1 observation with missing ride data.

skim_without_charts(df)
Data summary
Name df
Number of rows 426887
Number of columns 13
_______________________
Column type frequency:
character 5
numeric 6
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 426887 0
rideable_type 0 1 11 11 0 1 0
start_station_name 0 1 5 43 0 607 0
end_station_name 1 1 5 43 0 602 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_station_id 0 1 209.80 163.22 2.00 77.00 176.00 298.00 675.00
end_station_id 1 1 209.34 163.20 2.00 77.00 175.00 297.00 675.00
start_lat 0 1 41.90 0.04 41.74 41.88 41.89 41.92 42.06
start_lng 0 1 -87.64 0.02 -87.77 -87.66 -87.64 -87.63 -87.55
end_lat 1 1 41.90 0.04 41.74 41.88 41.89 41.92 42.06
end_lng 1 1 -87.64 0.02 -87.77 -87.66 -87.64 -87.63 -87.55

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2020-01-01 00:04:44 2020-03-31 23:51:34 2020-02-17 05:01:27 399265
ended_at 0 1 2020-01-01 00:10:54 2020-05-19 20:10:34 2020-02-17 05:48:58 399532

Process

Findings

Start dates do fit within Q1 2020

## # A tibble: 1 × 2
##   min_started_at      max_started_at     
##   <dttm>              <dttm>             
## 1 2020-01-01 00:04:44 2020-03-31 23:51:34

Some end dates do not fit Q1 2020

There are 22 rides that end beyond Q1 of 2020 (like on May, 19th):

## # A tibble: 1 × 2
##   min_ended_at        max_ended_at       
##   <dttm>              <dttm>             
## 1 2020-01-01 00:10:54 2020-05-19 20:10:34
## # A tibble: 1 × 1
##   count
##   <int>
## 1    22

Rides of large durations

There are rides with durations longer than one day. In some cases, they exceed three months.

## # A tibble: 562 × 2
##    ride_id          duration_days
##    <chr>                    <int>
##  1 279F7DB076FD4444           109
##  2 377B59F0A27BB4CB           100
##  3 ABA59A8FED82A85C            81
##  4 F4555B8C917E55D3            65
##  5 63F1D0FB38E8B158            62
##  6 B02926396CBED16D            62
##  7 DF955AFE21AA2526            55
##  8 37E250880094B435            52
##  9 646E9350D169D138            48
## 10 2F2661DA74E96A4F            46
## # ℹ 552 more rows

Short or negative durations of rides

There are three rides with negative durations and 6,601 rides with duration of zero minutes.

## # A tibble: 6,604 × 2
##    ride_id          duration_mins
##    <chr>                    <int>
##  1 6FABADDD595AF922            -9
##  2 13BCB74BAFDADB21            -2
##  3 867C195C5650C884            -1
##  4 CEB3A25A22460197             0
##  5 89D3D51310A2D7ED             0
##  6 8681DB6781B16DCD             0
##  7 CBEFEF554E477BDC             0
##  8 898A8C2F8CD1FAA0             0
##  9 954D4BE8EB33C979             0
## 10 E84B4127E076BFE7             0
## # ℹ 6,594 more rows

Integrity check for start and end stations

Each station id corresponds to only one station name. Nevertheless, it is not enough to give it as good. Further on, I will construct the master catalog of stations with SQL.

The two queries (below) show that there are no duplicates (they return zero rows):

df %>%
  group_by(start_station_id) %>%
  summarise(ux = n_distinct(start_station_name)) %>%
  filter(ux > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: start_station_id <dbl>, ux <int>
df %>%
  group_by(end_station_id) %>%
  summarise(ux = n_distinct(end_station_name)) %>%
  filter(ux > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: end_station_id <dbl>, ux <int>

Missing observations

There is one incomplete observation in the original dataset. It is ‘ride_id’ = ‘157EAA4C4A3C8D36’ and misses ride data.

df[!complete.cases(df), ] %>% select(ride_id, end_station_id, end_station_name)
## # A tibble: 1 × 3
##   ride_id          end_station_id end_station_name
##   <chr>                     <dbl> <chr>           
## 1 157EAA4C4A3C8D36             NA <NA>

Integrity check for ‘member_casual’ column

This column must contain only two values: ‘member’ and ‘casual’, and it does.

unique(df$member_casual)
## [1] "member" "casual"

Integrity check for ‘ride_id’ column

This column must contain unique values.

df %>%
  group_by(ride_id) %>%
  summarise(qty = n()) %>%
  filter(qty > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: ride_id <chr>, qty <int>

There is no ‘ride_id’ that appears more than once.

Column ‘rideable_type’ is all the same

All observations in the variable ‘rideable_type’ contain the same value ‘docked_bike’, rendering the variable meaningless.

unique(df$rideable_type)
## [1] "docked_bike"

Cleaning

First, I will show the cleaning command and then I will explain what it did:

Cleaning command

cl <- df %>% 
  drop_na() %>% 
  select(-rideable_type) %>% 
  filter(ended_at < ymd_hms('2020-04-01 00:00:00')) %>%
  mutate(
    start_station_id = as.character(start_station_id),
    end_station_id = as.character(end_station_id),
    duration_mins = as.integer(round(difftime(ended_at, started_at, units = "mins"))),
    distance_miles = round(
      geodist_vec(
        start_lng,
        start_lat,
        end_lng,
        end_lat,
        paired = TRUE,
        sequential = TRUE, 
        measure = "cheap") * 0.000621371, 2),
    start_weekday = wday(started_at, label = TRUE),
    start_hour = hour(started_at)
    ) %>% 
  filter(duration_mins >= 5) %>% 
  filter(duration_mins < 1440) %>% 
  select (-ended_at)

drop_na()

Incomplete observations must be discarded. There is only one.

select(-rideable_type)

Discard variable ‘rideable_type’ because it contains all the same value.

filter(ended_at < ymd_hms(‘2020-04-01 00:00:00’))

Include only rides that ended within Q1 2020.

Note: ymd_hms() drops time zone information of the machine where this code is executed. If omitted, results would be skewed because original data frame contains no time zone information, therefore, comparison would be incorrect.

start_station_id = as.character(start_station_id)

The same for ‘end_station_id’. It tells R not to calculate statistics on these variables. They look like integers, but they aren’t.

start_weekday = wday(started_at, label = TRUE),

Create discrete variable containing the day of week for each ride.

start_hour = hour(started_at)

Create discrete variable containing the start hour for each ride (0 .. 23).

duration_mins = as.integer(…

Create discrete variable containing duration of rides in whole minutes.

distance_miles = round(geodist_vec(…

Calculate distance for each ride. Initially I was to include geo information, but I didn’t because I ran out of time.

filter(duration_mins >= 5)

Leave only rides that took 5 minutes or more.

filter(duration_mins < 1440)

Leave only rides that took less than one day.

select (-ended_at)

Discard ‘ended-at’ variable: it will not be necessary any more.

Organization of cleaned data

Cleaned data are stored in the data frame cl.

  1. The original data had 426,887 observations in 13 variables.
  2. The cleaned data had 352,688 observations in 15 variables.
    1. There are 5 variables of geo data which were not used in present study, but I decided to leave them for future use.
  3. Cleaned data represents 82.62 % of original data.
skim_without_charts(cl)
Data summary
Name cl
Number of rows 352688
Number of columns 15
_______________________
Column type frequency:
character 6
factor 1
numeric 7
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 352688 0
start_station_name 0 1 5 43 0 605 0
start_station_id 0 1 1 3 0 605 0
end_station_name 0 1 5 43 0 601 0
end_station_id 0 1 1 3 0 601 0
member_casual 0 1 6 6 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
start_weekday 0 1 TRUE 7 Tue: 61446, Wed: 57353, Thu: 54169, Mon: 54031

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.04 41.74 41.88 41.89 41.91 42.06
start_lng 0 1 -87.64 0.02 -87.77 -87.65 -87.64 -87.63 -87.55
end_lat 0 1 41.90 0.04 41.74 41.88 41.89 41.92 42.06
end_lng 0 1 -87.64 0.02 -87.77 -87.65 -87.64 -87.63 -87.55
duration_mins 0 1 16.76 36.96 5.00 7.00 11.00 18.00 1436.00
distance_miles 0 1 1.32 1.01 0.00 0.68 1.01 1.63 14.34
start_hour 0 1 13.28 4.58 0.00 9.00 14.00 17.00 23.00

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2020-01-01 00:04:44 2020-03-31 23:49:42 2020-02-17 12:35:06 332570

Master catalog of stations with SQL

In order to ensure data integrity of stations, I have performed, on the clean dataset, certain steps in SQL that consisted in creating a master table for all stations and counts of start and end events per ride type.

  1. Create a table for all stations (start and end), with station id and station name as primary key, and counts of start and end events per user type, station type and total.
  2. Check counts of primary keys. There must be just one ocurrence of each.
  3. Check integrity of summarized columns against clean data.

The procedure is not listed here because it turned out it was never used for analysis.

Analyze

Just to recall the question:

How do annual members and casual riders use Cyclistic bikes differently?

  1. Casual rides have only 12 % share in the analyzed dataset.

  2. Casual rides happen mostly on weekends while member rides happen mostly on weekdays.

  3. Median casual ride duration is of 41 minutes while that of a member ride is of 13 minutes.

  4. Casual rides peak on Wednesdays, Saturdays and Sundays. Member rides do not have a pronounced peak: they are of 10 minutes on weekdays and Saturday and of just 12 minutes on Sundays.

  5. Casual rides start mostly on Saturdays and Sundays between 10 am and 6 pm.

  6. Member rides start mostly on weekdays around 8 am and end around 5 pm.

  7. When compared, casual rides happen mostly on weekends and member rides happen mostly on weekdays.

  8. Start or end stations do not overlap for casual or member rides.

  9. Start or end stations for casual riders are near shore. For members, there are near business center.

Share

Visualizations

Fig. 1: Converting casual rides to member rides offers a very limited growth promise.

Fig. 1: Converting casual rides to member rides offers a very limited growth promise.


Fig. 2: Casual rides peek on weekends, member peek on weekdays.

Fig. 2: Casual rides peek on weekends, member peek on weekdays.


Fig. 3: Median duration of a casual ride is more than 3 times than that of a member ride.

Fig. 3: Median duration of a casual ride is more than 3 times than that of a member ride.


Fig. 4: Median duration for member rides is of 10 minutes from Monday to Sunday. For casual rides, it is more than 3 times longer, with peaks on Wed., Sat. and Sun.

Fig. 4: Median duration for member rides is of 10 minutes from Monday to Sunday. For casual rides, it is more than 3 times longer, with peaks on Wed., Sat. and Sun.


Fig. 5: Casual rides generally start on weekends between 10 am and 6 pm.

Fig. 5: Casual rides generally start on weekends between 10 am and 6 pm.


Fig. 6: Member rides generally start on weekdays at about 8 am and 5 pm.

Fig. 6: Member rides generally start on weekdays at about 8 am and 5 pm.


Fig. 7: Comparison of start hours per day of week per ride type.

Fig. 7: Comparison of start hours per day of week per ride type.


Fig. 8: Top 16 starts stations for casual and member rides do not overlap.

Fig. 8: Top 16 starts stations for casual and member rides do not overlap.


Fig. 9: Top 16 end stations for casual and member rides do not overlap.

Fig. 9: Top 16 end stations for casual and member rides do not overlap.

Key findings

  1. Casual rides have only 12 % share in the analyzed dataset, but it may be due to winter months, because:

  2. Casual rides ride for leisure. Members commute to work.

  3. Casual riders move near shore, members move near business centers.

Act

Top three recommendations

  1. Presential campaign shall be launched on Saturdays and Sundays; between 10 am and 6 pm.; on start and/or end stations preferred by casual riders. During the campaign, a survey can be conducted with just a few questions, like:

  2. Today, do you ride with (select all that apply):

    1. Significant Other
    2. Kids
    3. Coworkers
    4. Friends and acquaintances outside my work
  3. Do you also conmute to work (select one):

    1. No
    2. Yes
  4. The same survey can be placed on Cyclistic’s web site and emailed to casual riders (changing “Today” with “For leisure”).

  5. A campaingn directed to casual riders can be launched leveraging advantages of annual membership over casual rides. Of course, there must be some, which is beyond scope of this case study.

Other recommendations

The ride registration system must be improved, addressing the issues like:

  1. There are rides with negative or zero durations.

  2. Justify zero (or unreasonably short) rides, especially if they start and end at the same station.

  3. There are rides that span over 3 months of duration.

  4. For data registration puropses, a distinction between single-ride passes and full-day passes shall be implemented in the system.