I. Context

Cyclistic is a bike-share company that aims to conduct a marketing campaign to increase profitability, and has recently contacted me to perform a case study using their historical data. Cyclistic has discovered that annual memberships are more profitable than casual ones, which means that the campaign could rather targets these casual riders, with the aim of converting them into annual memberships.

Customers who purchase single-ride or full-day passes are referred to as casual riders, whereas those who purchase annual memberships are Cyclistic members.

There are 3 main questions that the marketing analyst team wants to answer.

  • How do annual members and casual riders use Cyclistic bikes differently?
  • Why would casual riders buy Cyclistic annual memberships?
  • How can Cyclistic use digital media to influence casual riders to become members?

As a data analyst, I was assigned to find the answer to the first question : How do annual members and casual riders use Cyclistic bikes differently?

Let me now present in this report all the steps that I took in my analysis.

II. Imports

We will mainly use the great package tidyverse, so let’s import it!

library(tidyverse)
library(lubridate)
library(hms)

III. Step 1 : Ask - What is the business task?

The problem I am trying to solve is the lack of business knowledge with regards to the differentiation between casual and annual riders in terms of behaviour.

So, the business task is to identify key differences in rider behaviour through the analysis of historical data.

The key stakeholders are the executive team (who will decide whether to approve the recommended marketing campaign or not), the director of marketing (responsible for conducting marketing campaigns) and eventually the marketing analytics team.

The insights that I will derive from this analysis will help the stakeholders to have a solid understanding of the behaviour of the casual riders, and most importantly, how they differ from the annual members. This will enable to make data-driven decisions about the best way to approach the population of interest for the campaign, because we would be able to understand how they behave, who they exactly are, and what incentives they are more likely to follow.


IV. Step 2 : Prepare - Accessing Cyclistic historical trip data

The next step in the process is to get the data from the company’s database. Access the data source via this link.

There are a total of 12 files, each containing data for each month from Nov 2020 to Nov 2021, so one year in total.

This data is protected under a license by Motivate International Inc. By accessing or using any of the Data, I agree to all of the terms and conditions of the Data License Agreement, and in particular, it is forbidden to :

  • Attempt to correlate the Data with names, addresses, or other information of customers or Members of Motivate.

Which means that data privacy concerns prevents me from using riders’ personal data to identify them via purchases or other means.


Let’s import the data and then have a glance at the structure of the files.

nov_20 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202011-divvy-tripdata.csv")
head(nov_20)
str(nov_20)
## spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
##  $ rideable_type     : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
##  $ ended_at          : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
##  $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
##  $ start_station_id  : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
##  $ end_station_name  : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
##  $ end_station_id    : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
##  $ start_lat         : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:259716] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Only focusing on one of the 12 files (November 2020), we can see that it contains 13 columns with the following information :

  • ride_id = the identification number for a certain ride.
  • rideable_type = the type of bicycle used.
  • started_at and ended_at = the time when the trip started and ended.
  • start_station_name and id and end_station_name and id = the name and identification number of the station the trip started and ended.
  • start_lat and lng and end_lat and lng = geolocation of the place where the trip started and ended.
  • member_casual = binary variable taking the value casual or member.

Now, it is time to assess the data reliability : is the data credible and reliable? Or does this data have an inherent bias or any problem that may question its credibility?

Well, it is difficult to answer to those questions, since we do not really have information about how the data was collected. Also, we don’t have much demographic data about the riders : we only know what member type each person is (casual or member), and with only that information we cannot conclude whether the sample is representative of the population, or if a certain group has been underrrepresented. Nevertheless, if the data was taken on an ongoing basis (or randomly) and for every rider, no matter its origins or its demographic information, then we can clearly state that the data accurately represents the company’s population during one full year.


Furthermore, what we can directly confirm, is that the data is :

  • Original : we are using first-party data, which is always the best thing.

  • Comprehensive : the data is enough in order to answer the business question. Indeed, we can calculate ride time by rider type and also discover which stations are more frequented by casual or annual riders. These elements can help us get insights on how different casual and annual riders are.

  • Current : the data was generated during last year, so it is up-to-date and fresh data.

  • Cited : the dataset is provided under a license by Motivate International Inc., which we can consider to be a credible organization which updates the dataset periodically.

Now, let’s group all the files into a single location.

dec_20 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202011-divvy-tripdata.csv")

jan_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202101-divvy-tripdata.csv")

feb_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202102-divvy-tripdata.csv")

mar_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202103-divvy-tripdata.csv")

apr_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202104-divvy-tripdata.csv")

may_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202105-divvy-tripdata.csv")

jun_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202106-divvy-tripdata.csv")

jul_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202107-divvy-tripdata.csv")

aug_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202108-divvy-tripdata.csv")

sep_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202109-divvy-tripdata.csv")

oct_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202110-divvy-tripdata.csv")

nov_21 <- read_csv("~/DATA SCIENCE/Online Courses/Google Certificate/CASE STUDY 1/202111-divvy-tripdata.csv")
str(nov_20)
## spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
##  $ rideable_type     : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
##  $ ended_at          : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
##  $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
##  $ start_station_id  : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
##  $ end_station_name  : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
##  $ end_station_id    : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
##  $ start_lat         : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:259716] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(jan_21)
## spec_tbl_df [96,834 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:96834] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:96834] NA NA NA NA ...
##  $ end_station_id    : chr [1:96834] NA NA NA NA ...
##  $ start_lat         : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:96834] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(feb_21)
## spec_tbl_df [49,622 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
##  $ rideable_type     : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:49622], format: "2021-02-12 16:14:56" "2021-02-14 17:52:38" ...
##  $ ended_at          : POSIXct[1:49622], format: "2021-02-12 16:21:43" "2021-02-14 18:12:09" ...
##  $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
##  $ start_station_id  : chr [1:49622] "525" "525" "KA1503000012" "637" ...
##  $ end_station_name  : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
##  $ end_station_id    : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
##  $ start_lat         : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:49622] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(mar_21)
## spec_tbl_df [228,496 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:228496] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
##  $ rideable_type     : chr [1:228496] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:228496], format: "2021-03-16 08:32:30" "2021-03-28 01:26:28" ...
##  $ ended_at          : POSIXct[1:228496], format: "2021-03-16 08:36:34" "2021-03-28 01:36:55" ...
##  $ start_station_name: chr [1:228496] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
##  $ start_station_id  : chr [1:228496] "15651" "15651" "15443" "TA1308000021" ...
##  $ end_station_name  : chr [1:228496] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:228496] "13266" "18017" "TA1308000043" "13323" ...
##  $ start_lat         : num [1:228496] 41.9 41.9 41.8 42 42 ...
##  $ start_lng         : num [1:228496] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:228496] 41.9 41.9 41.8 42 42.1 ...
##  $ end_lng           : num [1:228496] -87.7 -87.7 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:228496] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(apr_21)
## spec_tbl_df [337,230 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
##  $ rideable_type     : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
##  $ ended_at          : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
##  $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
##  $ start_station_id  : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
##  $ end_station_name  : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
##  $ end_station_id    : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
##  $ start_lat         : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ start_lng         : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ end_lng           : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:337230] "member" "casual" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(may_21)
## spec_tbl_df [531,633 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
##  $ rideable_type     : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
##  $ ended_at          : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
##  $ start_station_name: chr [1:531633] NA NA NA NA ...
##  $ start_station_id  : chr [1:531633] NA NA NA NA ...
##  $ end_station_name  : chr [1:531633] NA NA NA NA ...
##  $ end_station_id    : chr [1:531633] NA NA NA NA ...
##  $ start_lat         : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:531633] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(jun_21)
## spec_tbl_df [729,595 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
##  $ rideable_type     : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
##  $ ended_at          : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
##  $ start_station_name: chr [1:729595] NA NA NA NA ...
##  $ start_station_id  : chr [1:729595] NA NA NA NA ...
##  $ end_station_name  : chr [1:729595] NA NA NA NA ...
##  $ end_station_id    : chr [1:729595] NA NA NA NA ...
##  $ start_lat         : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ start_lng         : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ end_lng           : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:729595] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(jul_21)
## spec_tbl_df [822,410 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
##  $ rideable_type     : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
##  $ ended_at          : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
##  $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
##  $ end_station_name  : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
##  $ end_station_id    : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
##  $ start_lat         : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:822410] "casual" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(aug_21)
## spec_tbl_df [804,352 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
##  $ rideable_type     : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:804352], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
##  $ ended_at          : POSIXct[1:804352], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
##  $ start_station_name: chr [1:804352] NA NA NA NA ...
##  $ start_station_id  : chr [1:804352] NA NA NA NA ...
##  $ end_station_name  : chr [1:804352] NA NA NA NA ...
##  $ end_station_id    : chr [1:804352] NA NA NA NA ...
##  $ start_lat         : num [1:804352] 41.8 41.8 42 42 41.8 ...
##  $ start_lng         : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:804352] 41.8 41.8 42 42 41.8 ...
##  $ end_lng           : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:804352] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(sep_21)
## spec_tbl_df [756,147 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:756147], format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
##  $ ended_at          : POSIXct[1:756147], format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
##  $ start_station_name: chr [1:756147] NA NA NA NA ...
##  $ start_station_id  : chr [1:756147] NA NA NA NA ...
##  $ end_station_name  : chr [1:756147] NA NA NA NA ...
##  $ end_station_id    : chr [1:756147] NA NA NA NA ...
##  $ start_lat         : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:756147] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(oct_21)
## spec_tbl_df [631,226 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
##  $ rideable_type     : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:631226], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
##  $ ended_at          : POSIXct[1:631226], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
##  $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
##  $ start_station_id  : chr [1:631226] "KA1503000043" NA NA NA ...
##  $ end_station_name  : chr [1:631226] NA NA NA NA ...
##  $ end_station_id    : chr [1:631226] NA NA NA NA ...
##  $ start_lat         : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:631226] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(nov_21)
## spec_tbl_df [359,978 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:359978] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
##  $ rideable_type     : chr [1:359978] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:359978], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
##  $ ended_at          : POSIXct[1:359978], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
##  $ start_station_name: chr [1:359978] NA NA NA NA ...
##  $ start_station_id  : chr [1:359978] NA NA NA NA ...
##  $ end_station_name  : chr [1:359978] NA NA NA NA ...
##  $ end_station_id    : chr [1:359978] NA NA NA NA ...
##  $ start_lat         : num [1:359978] 41.9 42 42 41.9 41.9 ...
##  $ start_lng         : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ end_lat           : num [1:359978] 42 41.9 42 41.9 41.9 ...
##  $ end_lng           : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ member_casual     : chr [1:359978] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Upon inspection of the data types, we encounter an issue for the November 2020 data, because the variables start_station_id and end_station_id are in double type. They must be converted to character so that they are consistent with the rest of the variables.

nov_20 <- nov_20 %>% mutate(start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))

Now, we can safely combine all the data into a single location.

all_data <- bind_rows(nov_20, jan_21, feb_21, mar_21, apr_21, may_21, jun_21, jul_21, aug_21, sep_21, oct_21, nov_21)

The data is properly stored, checked for errors, and ready for further analysis!


V. Step 3 : Process - Cleaning the data

One big source of bad quality data are missing values. So first, let’s inspect the mising values.

NAs <- sum(is.na(all_data))

n <- dim(all_data)[1]

We have 2762988 missing values, out of a total of 5607239 observations, which represents 49.2753742 % of the data, so that A LOT! Therefore, we will not eliminate the rows with missing values, otherwise we may lose a great deal of information. We will assume that the data is “Missing at Random”, so without any systematic source of missingness.

Furthermore, I noticed that sometimes, when the ID for the stations are missing, we get at least their latitude and longitude, and so it does not really matter to have those missing values.


Finally, before moving to the analysis phase, we could create two new variables. This is called feature engineering.

  • ride-length will contain the length of each ride by subtracting the “started_at” column from the “ended_at” column. That way, we can get rid of those two columns and replace them by a single one, which is more informative.

  • day_of_week will allow us to work with weekdays instead of a too detailed time measure.

all_data <- all_data %>% mutate(ended_at = if_else(started_at > ended_at, ended_at + 86400, ended_at), ride_length = ended_at - started_at) 

all_data$ride_length <- hms::as_hms(all_data$ride_length)
all_data <- all_data %>% mutate(day_of_week = wday(started_at)) %>% 
  select(-c(started_at, ended_at))

Just to bear in mind that the number 1 means Monday, and the number 7 means Sunday.

In order to avoid this confusing notation, let us convert it into actual names.

all_data$day_of_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")[all_data$day_of_week]

Now the data shines, and is ready for analysis!


VI. Step 4 : Analyze - Drawing conclusions!

Now it is time to draw conclusions about the differences between both rider types.

Let’s start with how they differ in terms of average riding time.

max_length <- all_data %>% group_by(member_casual) %>% mutate(minutes = (day(ride_length)-1)*1440 + hour(ride_length)*60 + minute(ride_length)) %>% summarise(max(minutes))

min_length <- all_data %>% group_by(member_casual) %>% mutate(minutes = (day(ride_length)-1)*1440 + hour(ride_length)*60 + minute(ride_length)) %>% summarise(min(minutes))

all_data %>% group_by(member_casual) %>% summarise(mean_ride_time = as_hms(round(mean(ride_length), 1))) %>% ggplot() + geom_col(mapping = aes(x=member_casual, y = mean_ride_time)) + labs(title = "Difference between member type in average ride time", caption = paste0("The maximum and minimum for casual are ", max_length[1,2], " and ", min_length[1,2], " minutes respectively, and for members ", max_length[2,2], " and ", min_length[2,2], " minutes")) + xlab("Member Type") + ylab("Average Ride Time")

We can clearly see that casual riders spend, on average, around two times more time on their rides than members.

library(pander)

min_max_table <- inner_join(max_length, min_length)

colnames(min_max_table) <- c("Member Type", "Maximum Ride Length (min)", "Minimum Ride Length (min)")

pander(min_max_table)
Member Type Maximum Ride Length (min) Minimum Ride Length (min)
casual 41645 0
member 1559 0

This table summarises the maximum and minimum ride time per member type. This further confirms that casual riders spend more time using the bicicles.


Now, what about the stations? Where does each rider type go?

library(cowplot)

all_data_start_st <- all_data %>% select(member_casual, start_station_name) %>% group_by(start_station_name, member_casual) %>% mutate(count_day = n()) %>% distinct() %>% data.frame()

all_data_end_st <- all_data %>% select(member_casual, end_station_name) %>% group_by(end_station_name, member_casual) %>% mutate(count_day = n()) %>% distinct() %>% data.frame()


# Plots for start stations
plot1 <- all_data_start_st %>% filter(member_casual == "casual") %>% drop_na() %>% arrange(desc(count_day)) %>% slice(1:10) %>% ggplot(aes(reorder(start_station_name, -count_day), count_day)) + geom_bar(stat = "identity") + xlab("") + ylab("Count") + labs(title = "Top 10 start stations for casual") + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + geom_text(aes(label=count_day), vjust=0.3, hjust = 1.1, colour = "White") 

plot2 <- all_data_start_st %>% filter(member_casual == "member") %>% drop_na() %>% arrange(desc(count_day)) %>% slice(1:10) %>% ggplot(aes(reorder(start_station_name, -count_day), count_day)) + geom_bar(stat = "identity") + xlab("") + ylab("Count") + labs(title = "Top 10 start stations for members") + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + geom_text(aes(label=count_day), vjust=0.3, hjust = 1.1, colour = "White")



# Plots for end stations
plot3 <- all_data_end_st %>% filter(member_casual == "casual") %>% drop_na() %>% arrange(desc(count_day)) %>% slice(1:10) %>% ggplot(aes(reorder(end_station_name, -count_day), count_day)) + geom_bar(stat = "identity") + xlab("") + ylab("Count") + labs(title = "Top 10 end stations for casual") + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + geom_text(aes(label=count_day), vjust=0.3, hjust = 1.1, colour = "White") 

plot4 <- all_data_end_st %>% filter(member_casual == "member") %>% drop_na() %>% arrange(desc(count_day)) %>% slice(1:10) %>% ggplot(aes(reorder(end_station_name, -count_day), count_day)) + geom_bar(stat = "identity") + xlab("") + ylab("Count") + labs(title = "Top 10 end stations for members") + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + geom_text(aes(label=count_day), vjust=0.3, hjust = 1.1, colour = "White")



# Plotting
plot_grid(plot1, plot2, plot3, plot4, nrow = 2, ncol = 2)

Streeter Dr & Grand Ave is definitely the most popular station for casual riders, and Clark St & Elm St for members.


And finally, let us now learn how the day-of-week distribution looks like per membership type. Bar charts would be of great help here!

options(scipen=999)

library(cowplot)

all_data_new_2 <- all_data %>% select(member_casual, day_of_week) %>% group_by(member_casual, day_of_week) %>% mutate(count_day = n()) %>% distinct() %>% data.frame() %>% mutate(day_of_week = factor(day_of_week)) 


plot1 <- all_data_new_2 %>% filter(member_casual == "casual") %>% ggplot(aes(reorder(day_of_week, -count_day), count_day)) + geom_bar(stat = "identity") + xlab("") + ylab("Count") + labs(title = "Casual") + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + geom_text(aes(label=count_day), vjust=0.3, hjust = 1.1, colour = "White") 

plot2 <- all_data_new_2 %>% filter(member_casual == "member") %>% ggplot(aes(reorder(day_of_week, -count_day), count_day)) + geom_bar(stat = "identity") + xlab("") + ylab("Count") + labs(title = "Members") + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + geom_text(aes(label=count_day), vjust=0.3, hjust = 1.1, colour = "White")

plot_grid(plot1, plot2)

We now understand our customers much better!

In fact, the casual riders are more active during the weekends, whereas the members use our services more during the working days, which makes sense because they pay us for more convenience (e.g. commuting to the workplace).

VII. Summing up…

We can now answer to our business question (Which are the key differences in rider behaviour through the analysis of historical data?) and state some key differences between casual and member riders :

  • 1- Casual riders spend much more time using bikes, which means that we really have a marketing opportunity to attract awareness of our services through targeted campaigns.

  • 2- We have identified the actual stations where each customer type is more likely to go. This will tremendously help to know where to put ads, posters and other marketing materials.

  • 3- Casual riders are more likely to use our services during weekends, while members ride during the working days.


And of course, now we are much more informed to conduct a successful marketing campaign… Thanks to unleashing the power of data!