copyright: "© Wojciech Winogrodzki, 2023. All Rights Reserved."
email: "wwinogrodzki@gmail.com"
location: "Mexico City"
The present work is a realization of “Google Data Analytics Capstone: Complete a Case Study”, the final part of Google Data Analytics Professional Certificate.
I’ll be applying Google’s data analytics methodology of Ask - Prepare - Process - Analyze - Share - Act.
The tool of my choice is R programming language. In one case, I will use SQL to check integrity of bike stations.
Since 2016, a fictional company, Cyclistic, is a bike-share company in Chicago. It features more than 5,800 bicycles and 600 docking stations.
Cyclistic’s users are more likely to ride for leisure, but about 30% of them commute to work each day.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans:
As concluded by Cyclistic’s finance analysts, annual members are much more profitable than casual riders. Therefore, Moreno believes that:
The business task, as defined by Moreno, is:
Design marketing strategies aimed at converting casual riders into annual members
The following three questions will guide the future marketing program:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
I have been assigned to work on the first question:
How do annual members and casual riders use Cyclistic bikes differently?
The question will be answered with data, trying to identify usage trends in bike rides, broken down into two types:
Since Cyclictics is a fictional company, the data has been made available by Motivate International Inc. under this license. I have choosen to work with just one quarter of a year, Q1 of 2020, contained in this zip archive, which is sufficient to elaborate the present case study.
First, let’s setup the R environment and load the original data into a data frame identified as df:
The original data has 426,887 observations in 13 variables. It does not contain any personally identifiable information. There is 1 observation with missing ride data.
skim_without_charts(df)
| Name | df |
| Number of rows | 426887 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 6 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 426887 | 0 |
| rideable_type | 0 | 1 | 11 | 11 | 0 | 1 | 0 |
| start_station_name | 0 | 1 | 5 | 43 | 0 | 607 | 0 |
| end_station_name | 1 | 1 | 5 | 43 | 0 | 602 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_station_id | 0 | 1 | 209.80 | 163.22 | 2.00 | 77.00 | 176.00 | 298.00 | 675.00 |
| end_station_id | 1 | 1 | 209.34 | 163.20 | 2.00 | 77.00 | 175.00 | 297.00 | 675.00 |
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.74 | 41.88 | 41.89 | 41.92 | 42.06 |
| start_lng | 0 | 1 | -87.64 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.55 |
| end_lat | 1 | 1 | 41.90 | 0.04 | 41.74 | 41.88 | 41.89 | 41.92 | 42.06 |
| end_lng | 1 | 1 | -87.64 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.55 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-01-01 00:04:44 | 2020-03-31 23:51:34 | 2020-02-17 05:01:27 | 399265 |
| ended_at | 0 | 1 | 2020-01-01 00:10:54 | 2020-05-19 20:10:34 | 2020-02-17 05:48:58 | 399532 |
## # A tibble: 1 × 2
## min_started_at max_started_at
## <dttm> <dttm>
## 1 2020-01-01 00:04:44 2020-03-31 23:51:34
There are 22 rides that end beyond Q1 of 2020 (like on May, 19th):
## # A tibble: 1 × 2
## min_ended_at max_ended_at
## <dttm> <dttm>
## 1 2020-01-01 00:10:54 2020-05-19 20:10:34
## # A tibble: 1 × 1
## count
## <int>
## 1 22
There are rides with durations longer than one day. In some cases, they exceed three months.
## # A tibble: 562 × 2
## ride_id duration_days
## <chr> <int>
## 1 279F7DB076FD4444 109
## 2 377B59F0A27BB4CB 100
## 3 ABA59A8FED82A85C 81
## 4 F4555B8C917E55D3 65
## 5 63F1D0FB38E8B158 62
## 6 B02926396CBED16D 62
## 7 DF955AFE21AA2526 55
## 8 37E250880094B435 52
## 9 646E9350D169D138 48
## 10 2F2661DA74E96A4F 46
## # ℹ 552 more rows
There are three rides with negative durations and 6,601 rides with duration of zero minutes.
## # A tibble: 6,604 × 2
## ride_id duration_mins
## <chr> <int>
## 1 6FABADDD595AF922 -9
## 2 13BCB74BAFDADB21 -2
## 3 867C195C5650C884 -1
## 4 CEB3A25A22460197 0
## 5 89D3D51310A2D7ED 0
## 6 8681DB6781B16DCD 0
## 7 CBEFEF554E477BDC 0
## 8 898A8C2F8CD1FAA0 0
## 9 954D4BE8EB33C979 0
## 10 E84B4127E076BFE7 0
## # ℹ 6,594 more rows
Each station id corresponds to only one station name. Nevertheless, it is not enough to give it as good. Further on, I will construct the master catalog of stations with SQL.
The two queries (below) show that there are no duplicates (they return zero rows):
df %>%
group_by(start_station_id) %>%
summarise(ux = n_distinct(start_station_name)) %>%
filter(ux > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: start_station_id <dbl>, ux <int>
df %>%
group_by(end_station_id) %>%
summarise(ux = n_distinct(end_station_name)) %>%
filter(ux > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: end_station_id <dbl>, ux <int>
There is one incomplete observation in the original dataset. It is ‘ride_id’ = ‘157EAA4C4A3C8D36’ and misses ride data.
df[!complete.cases(df), ] %>% select(ride_id, end_station_id, end_station_name)
## # A tibble: 1 × 3
## ride_id end_station_id end_station_name
## <chr> <dbl> <chr>
## 1 157EAA4C4A3C8D36 NA <NA>
This column must contain only two values: ‘member’ and ‘casual’, and it does.
unique(df$member_casual)
## [1] "member" "casual"
This column must contain unique values.
df %>%
group_by(ride_id) %>%
summarise(qty = n()) %>%
filter(qty > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: ride_id <chr>, qty <int>
There is no ‘ride_id’ that appears more than once.
All observations in the variable ‘rideable_type’ contain the same value ‘docked_bike’, rendering the variable meaningless.
unique(df$rideable_type)
## [1] "docked_bike"
First, I will show the cleaning command and then I will explain what it did:
cl <- df %>%
drop_na() %>%
select(-rideable_type) %>%
filter(ended_at < ymd_hms('2020-04-01 00:00:00')) %>%
mutate(
start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id),
duration_mins = as.integer(round(difftime(ended_at, started_at, units = "mins"))),
distance_miles = round(
geodist_vec(
start_lng,
start_lat,
end_lng,
end_lat,
paired = TRUE,
sequential = TRUE,
measure = "cheap") * 0.000621371, 2),
start_weekday = wday(started_at, label = TRUE),
start_hour = hour(started_at)
) %>%
filter(duration_mins >= 5) %>%
filter(duration_mins < 1440) %>%
select (-ended_at)
Incomplete observations must be discarded. There is only one.
Discard variable ‘rideable_type’ because it contains all the same value.
Include only rides that ended within Q1 2020.
Note: ymd_hms() drops time zone information of the machine where this code is executed. If omitted, results would be skewed because original data frame contains no time zone information, therefore, comparison would be incorrect.
The same for ‘end_station_id’. It tells R not to calculate statistics on these variables. They look like integers, but they aren’t.
Create discrete variable containing the day of week for each ride.
Create discrete variable containing the start hour for each ride (0 .. 23).
Create discrete variable containing duration of rides in whole minutes.
Calculate distance for each ride. Initially I was to include geo information, but I didn’t because I ran out of time.
Leave only rides that took 5 minutes or more.
Leave only rides that took less than one day.
Discard ‘ended-at’ variable: it will not be necessary any more.
Cleaned data are stored in the data frame cl.
skim_without_charts(cl)
| Name | cl |
| Number of rows | 352688 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| factor | 1 |
| numeric | 7 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 352688 | 0 |
| start_station_name | 0 | 1 | 5 | 43 | 0 | 605 | 0 |
| start_station_id | 0 | 1 | 1 | 3 | 0 | 605 | 0 |
| end_station_name | 0 | 1 | 5 | 43 | 0 | 601 | 0 |
| end_station_id | 0 | 1 | 1 | 3 | 0 | 601 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| start_weekday | 0 | 1 | TRUE | 7 | Tue: 61446, Wed: 57353, Thu: 54169, Mon: 54031 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.74 | 41.88 | 41.89 | 41.91 | 42.06 |
| start_lng | 0 | 1 | -87.64 | 0.02 | -87.77 | -87.65 | -87.64 | -87.63 | -87.55 |
| end_lat | 0 | 1 | 41.90 | 0.04 | 41.74 | 41.88 | 41.89 | 41.92 | 42.06 |
| end_lng | 0 | 1 | -87.64 | 0.02 | -87.77 | -87.65 | -87.64 | -87.63 | -87.55 |
| duration_mins | 0 | 1 | 16.76 | 36.96 | 5.00 | 7.00 | 11.00 | 18.00 | 1436.00 |
| distance_miles | 0 | 1 | 1.32 | 1.01 | 0.00 | 0.68 | 1.01 | 1.63 | 14.34 |
| start_hour | 0 | 1 | 13.28 | 4.58 | 0.00 | 9.00 | 14.00 | 17.00 | 23.00 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-01-01 00:04:44 | 2020-03-31 23:49:42 | 2020-02-17 12:35:06 | 332570 |
In order to ensure data integrity of stations, I have performed, on the clean dataset, certain steps in SQL that consisted in creating a master table for all stations and counts of start and end events per ride type.
The procedure is not listed here because it turned out it was never used for analysis.
Just to recall the question:
How do annual members and casual riders use Cyclistic bikes differently?
Casual rides have only 12 % share in the analyzed dataset.
Casual rides happen mostly on weekends while member rides happen mostly on weekdays.
Median casual ride duration is of 41 minutes while that of a member ride is of 13 minutes.
Casual rides peak on Wednesdays, Saturdays and Sundays. Member rides do not have a pronounced peak: they are of 10 minutes on weekdays and Saturday and of just 12 minutes on Sundays.
Casual rides start mostly on Saturdays and Sundays between 10 am and 6 pm.
Member rides start mostly on weekdays around 8 am and end around 5 pm.
When compared, casual rides happen mostly on weekends and member rides happen mostly on weekdays.
Start or end stations do not overlap for casual or member rides.
Start or end stations for casual riders are near shore. For members, there are near business center.
Presential campaign shall be launched on Saturdays and Sundays; between 10 am and 6 pm.; on start and/or end stations preferred by casual riders. During the campaign, a survey can be conducted with just a few questions, like:
Today, do you ride with (select all that apply):
Do you also conmute to work (select one):
The same survey can be placed on Cyclistic’s web site and emailed to casual riders (changing “Today” with “For leisure”).
A campaingn directed to casual riders can be launched leveraging advantages of annual membership over casual rides. Of course, there must be some, which is beyond scope of this case study.
The ride registration system must be improved, addressing the issues like:
There are rides with negative or zero durations.
Justify zero (or unreasonably short) rides, especially if they start and end at the same station.
There are rides that span over 3 months of duration.
For data registration puropses, a distinction between single-ride passes and full-day passes shall be implemented in the system.