Reading time: 28 minutes

1 PROJECT SUMMARY

This is one of the two capstone projects of the 6-month extensive Google Data Analytics Professional Certificate program. This project was about analysing big data for a fictional bike-sharing company in Chicago on how best to convert casual riders into annual members.

1 multi-layers interactive map, 4 multi-layers ggplots (scatter, line, and bar plots), 1 interactive table, and 7 basic statistics tables were coded and achieved insightful results. 12 data sets that hosting the total of 3.7 millions of rows and 13 variables were combined, cleaned, manipulated and analysed using R programming language. Data integrity of this project was assessed. Data processing steps were documented.

Results revealed that casual riders contributed 51% of the total riding duration in the given 12 months data to analyse. Casual riders rided the most on Saturday (23%), doubling the size in week days (11%), however, 4 days in a week has 11% indicating a steady demand of rides and targeting casual riders on these days may actually achieve higher conversion rate. Casual riders love spring and summer but decreased dramatically during months in Autumn and achieve near-zero ride per day in winter. Winter has temperatures below zero degrees Celsius. Casual riders concentrate in regions near coasts and city center. They have adoped 2 newly introduced ride-able bikes and has replaced the usual docked bike in 2021 as their primary and secondary bikes.

2 R PACKAGES

R Packages loaded to perform functions required in this project: tidyverse series (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, kableExtra, lubridate, leaflet, scales, hms, ggrepel and DT.

library(tidyverse)
library(skimr)
library(kableExtra)
library(lubridate)
library(leaflet)
library(scales)
library(hms)
library(ggrepel)
library(DT)

3 SCENARIO AND BUSINESS TASKS

Cyclistic, is a fictional bike-share company designed by this Google data analytic certification course. In the scenario, the company features more than 5824 bicycles and 692 docking stations. The company also offers many types of bicycle to account for different group of riders, including standard two-wheeled bikes, reclining bikes, hand tricycles, and cargo bikes. Cyclistic also offers two types of memberships, a casual and an annual type. Casual type is designed for single-ride and full-day passes.

Cyclistic

Cyclistic recently found that converting more casual riders into annual members is more profitable. Therefore, they will need a tailored marketing strategy for casual riders, supported data analysis. My task is to analyse the available data to answer following business questions with compelling data insights and professional data visualisation.

  • How are casual riders use Cyclistic bikes differently?
  • Why would casual riders buy Cyclistic annual memberships?
  • How can Cyclistic use digital media to influence casual riders to become members?

4 DATA PREPARATION

4.1 Data sources

I have been provided with Cyclistic’s previous 12 months trip data, starting from May 2020 to April 2021. These data were stored in 12 spreadsheets in a monthly basis. Thus, they are additive to each other by date. All of them have the same number of columns and column names.

4.2 Data import

The course introduced that a data set is considered small if it has less than 100k rows and spreadsheet could be used for analysis and basic analysis. However, after inspecting the first of the twelve spreadsheets, it has already 200k rows of data. I am expecting after combining all the 12 spreadsheets, I would have a data frame exceeding 1 million rows of data and beyond the scale a spreadsheet can handle. Therefore, I am using R for this data analysis project.

# Import individual csv spreadsheet
may20 <- read_csv("202005-divvy-tripdata.csv") # 200,274 rows | 13 columns
jun20 <- read_csv("202006-divvy-tripdata.csv") # 343,005 rows | 13 columns
jul20 <- read_csv("202007-divvy-tripdata.csv") # 551,480 rows | 13 columns
aug20 <- read_csv("202008-divvy-tripdata.csv") # 622,361 rows | 13 columns
sep20 <- read_csv("202009-divvy-tripdata.csv") # 532,958 rows | 13 columns
oct20 <- read_csv("202010-divvy-tripdata.csv") # 388,653 rows | 13 columns
nov20 <- read_csv("202011-divvy-tripdata.csv") # 259,716 rows | 13 columns
dec20 <- read_csv("202012-divvy-tripdata.csv") # 131,573 rows | 13 columns
jan21 <- read_csv("202101-divvy-tripdata.csv") # 96,834 rows | 13 columns
feb21 <- read_csv("202102-divvy-tripdata.csv") # 49,622 rows | 13 columns
mar21 <- read_csv("202103-divvy-tripdata.csv") # 228,496 rows | 13 columns
apr21 <- read_csv("202104-divvy-tripdata.csv") # 337,230 rows | 13 columns

# combine all csv files into one table
mydata <- rbind(may20, jun20, jul20, aug20, sep20, oct20, nov20, dec20, jan21, feb21, mar21, apr21) 

As expected, after importing all the spreadsheets and combining them into one, the overall data set has nearly 3.7 millions of rows. A specialized data analysis software such as R is needed to process the data effectively.

4.3 Data description

This table describes the basic details of the data set.

Variables <- names(mydata)
Description <- c("ID of individual ride", 
                 "There are 3 types of rideable bike which are classic bike, docked bike, and eletric bike",
                 "Date and time that the individual riding starting at",
                 "Date and time that the individual riding ended at",
                 "Associated starting station of this riding",
                 "Station id of the starting station",
                 "Associated ending station of this riding",
                 "Station id of this ending station",
                 "Ride starting latitude",
                 "Ride starting longitude",
                 "Ride ending latitude",
                 "Ride ending latitude",
                 "Membership status of the rider, classified into annual membership or casual membership")

des <- data.frame(Variables, Description)

des %>% 
  kbl(align = "l",
      caption = "Table 4.3 Data set description") %>% 
  kable_styling(bootstrap_options = c("bordered",
                                      "striped",
                                      "hover",
                                      "condensed"))  
Table 4.3 Data set description
Variables Description
ride_id ID of individual ride
rideable_type There are 3 types of rideable bike which are classic bike, docked bike, and eletric bike
started_at Date and time that the individual riding starting at
ended_at Date and time that the individual riding ended at
start_station_name Associated starting station of this riding
start_station_id Station id of the starting station
end_station_name Associated ending station of this riding
end_station_id Station id of this ending station
start_lat Ride starting latitude
start_lng Ride starting longitude
end_lat Ride ending latitude
end_lng Ride ending latitude
member_casual Membership status of the rider, classified into annual membership or casual membership

4.4 Data exploration

Following tables show that this combined data set has 3,742,202 rows of observations and 13 variables. 7 out of the 13 variables are in character type, 4 columns are numeric, and the remaining 2 are date-time (or known as POSIXct format in R).

The complete_rate tells us the completeness of each variable, or how many they have missing values in their column, which is associated with the n_missing before it. We can clearly see that there are 6 variables have missing data:

  • start_station_name
  • start_station_id
  • end_station_name
  • end_station_id
  • end_late
  • end_lng

Data summary
Name Piped data
Number of rows 3742202
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 3741993 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 148231 0.96 10 53 0 711 0
start_station_id 148857 0.96 1 35 0 1265 0
end_station_name 171317 0.95 10 53 0 712 0
end_station_id 171778 0.95 1 35 0 1266 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.04 41.64 41.88 41.90 41.93 42.08
start_lng 0 1 -87.64 0.03 -87.87 -87.66 -87.64 -87.63 -87.52
end_lat 4906 1 41.90 0.04 41.54 41.88 41.90 41.93 42.16
end_lng 4906 1 -87.65 0.03 -88.07 -87.66 -87.64 -87.63 -87.44

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2020-05-01 00:02:07 2021-04-30 23:59:53 2020-09-09 15:51:13 3257750
ended_at 0 1 2020-05-01 00:13:03 2021-05-05 22:14:39 2020-09-09 16:08:31 3245372

I have also identified some character variables that can be used for data grouping, which is helpful in later analysis. To use them in this purpose, I will convert them into factor type in data manipulation section. These variables are:

  • rideable_type
  • start_station_name
  • end_station_name
  • member_casual

Another way to view how the type of our 13 variables are being assigned by R:

glimpse(mydata)
## Rows: 3,742,202
## Columns: 13
## $ ride_id            <chr> "02668AD35674B983", "7A50CCAF1EDDB28F", "2FFCDFDB91~
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke~
## $ started_at         <dttm> 2020-05-27 10:03:52, 2020-05-25 10:47:11, 2020-05-~
## $ ended_at           <dttm> 2020-05-27 10:16:49, 2020-05-25 11:05:40, 2020-05-~
## $ start_station_name <chr> "Franklin St & Jackson Blvd", "Clark St & Wrightwoo~
## $ start_station_id   <chr> "36", "340", "260", "251", "261", "206", "261", "18~
## $ end_station_name   <chr> "Wabash Ave & Grand Ave", "Clark St & Leland Ave", ~
## $ end_station_id     <chr> "199", "326", "260", "157", "206", "22", "261", "18~
## $ start_lat          <dbl> 41.8777, 41.9295, 41.9296, 41.9680, 41.8715, 41.847~
## $ start_lng          <dbl> -87.6353, -87.6431, -87.7079, -87.6500, -87.6699, -~
## $ end_lat            <dbl> 41.8915, 41.9671, 41.9296, 41.9367, 41.8472, 41.869~
## $ end_lng            <dbl> -87.6268, -87.6674, -87.7079, -87.6368, -87.6468, -~
## $ member_casual      <chr> "member", "casual", "casual", "casual", "member", "~

A helpful table describing some common R data type we would encounter in this project.

Type <- c("<chr>", "<dbl>", "<fctr>", "<date>", "<dttm>")
Explanation <- c("Character, for text data.", 
                 "Double, for number with decimal places.",
                 "factor, also text data but are usually used to grouping.",
                 "date, as its name would suggest.",
                 "date-time, combination of date and time.")

r_datatype <- data.frame(Type, Explanation)

r_datatype %>% 
  kbl(table.attr = "style = 'width: 60%;'",
      caption = "Table 4.4 Common R data types in this project.") %>% 
  kable_styling(bootstrap_options = c("bordered",
                                      "striped",
                                      "hover",
                                      "condensed"))
Table 4.4 Common R data types in this project.
Type Explanation
<chr> Character, for text data.
<dbl> Double, for number with decimal places.
<fctr> factor, also text data but are usually used to grouping.
<date> date, as its name would suggest.
<dttm> date-time, combination of date and time.

5 DATA INTEGRITY ASSESSMENT

Negatives

  • I was introduced that Cyclistic has standard two-wheeled bikes, reclining bikes, hand tricycles, and cargo bikes. However, only 1 column named “rideable_type” was provided in the data sets and showing 3 different type of bikes that were not mentioned in the introduction - classic bike, docked-bike and electric bike. No further details were provided to help clarifying what these new type of bikes are and the relationship between them and previously mentioned 4 type of bikes.

  • I was provided with a column classifying each ride into casual and annual membership, which is good. However, it would be better if the two types of the casual membership, single-ride and full-day passes, introduced in the introduction section are also being included in the data set, for a better complete analysis.

  • Some other useful variables were also not included in the data set such as weather of the day, seasons, public holidays, temperatures, wind speed, and if available, rider ages and gender can also be helpful in designing marketing strategy.

Positives

  • Fairness - The data was system-generated, base on its size of being millions rows with individual ride IDs. The data should be quite fair and free from collection biases and erroneous.

  • Inclusivity: Apart from lacking some variables, the data was quite inclusive. The Data was in the range of 24 hours and was provided to me as a whole. It is the entire population data of the requested 12-month period, instead of sub-set data sampled from the mother data set. This eliminate the issue of sampling bias.

  • All data was collected by the original company, Cyclistic, instead of from second- or third party sources. Cyclistic has to proper system to collect the big data. All provided data were also up-to-date and current. Comprehensiveness can be argued for lacking some useful variables, but the data set still have many other important variables to synthese desired outcome. In real world case, stakeholders will be approached about this issue and be informed.

6 DATA CLEANING AND MANIPULATION

6.1 Data type conversion

6.1.1 Character to factor

In this section, following variables are being converted from character to factor. They are string variables with distinctive levels that are useful for data grouping and analysis. Please note that numerical variables can also have distinctive levels and be converted to factor, but it is not in this case.

  • rideable_type
  • start_station_name
  • end_station_name
  • member_casual
chr_to_fct_variables <- c("rideable_type", 
                          "start_station_name", 
                          "end_station_name", 
                          "member_casual")

mydata_cleaned <- mydata

mydata_cleaned[,chr_to_fct_variables] <- lapply(mydata_cleaned[,chr_to_fct_variables], factor)

After conversion, we are able to use R to identify their distinctive levels and how many are there.

#---- to find level information of these variables----#

# rideable_type
nlevels(mydata_cleaned$rideable_type)
levels(mydata_cleaned$rideable_type) # Explore it as the number of levels is small and suitable for grouping

# start_station_name
nlevels(mydata_cleaned$start_station_name)

# end_station_name
nlevels(mydata_cleaned$end_station_name)

# member_casual
nlevels(mydata_cleaned$member_casual) # Explore it as the number of levels is small and suitable for grouping
levels(mydata_cleaned$member_casual)

head(levels(mydata_cleaned$start_station_name)) # to find the first 6 station names
head(levels(mydata_cleaned$end_station_name)) # to find the first 6 station names
#---- Creating the table ----#

Variable <- c("rideable_type", "start_station_name", "end_station_name", "member_casual")
Level_count <- c("3", "711", "712", "2")
Levels <- c("classic_bike, docked_bike, electric_bike",
            "2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd...",
            "2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd...",
            "Casual, member")

data.frame(Variable, Level_count, Levels) %>% 
  kbl(caption = "Table 6.1.1 Distinctive levels of grouping variables") %>% 
  kable_classic()
Table 6.1.1 Distinctive levels of grouping variables
Variable Level_count Levels
rideable_type 3 classic_bike, docked_bike, electric_bike
start_station_name 711 2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd…
end_station_name 712 2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd…
member_casual 2 Casual, member

6.1.2 Date

We shall only start using date or date-time for our analysis when they are in the correct format, as being recognized and labeled as “date” or “date-time”. Date or date-time are commonly labelled as a “character” variable if their arrangement of year, month, date, hour, minute, and second do not match with the date format of R in the form of year-month-date_hour-minute-second.

However, the date in our dataset is perfectly to go. They match the date format of R and have been recognised as date-time () by R.

Our date-time variables: started_at and ended_at:

glimpse(mydata_cleaned)
## Rows: 3,742,202
## Columns: 13
## $ ride_id            <chr> "02668AD35674B983", "7A50CCAF1EDDB28F", "2FFCDFDB91~
## $ rideable_type      <fct> docked_bike, docked_bike, docked_bike, docked_bike,~
## $ started_at         <dttm> 2020-05-27 10:03:52, 2020-05-25 10:47:11, 2020-05-~
## $ ended_at           <dttm> 2020-05-27 10:16:49, 2020-05-25 11:05:40, 2020-05-~
## $ start_station_name <fct> Franklin St & Jackson Blvd, Clark St & Wrightwood A~
## $ start_station_id   <chr> "36", "340", "260", "251", "261", "206", "261", "18~
## $ end_station_name   <fct> Wabash Ave & Grand Ave, Clark St & Leland Ave, Kedz~
## $ end_station_id     <chr> "199", "326", "260", "157", "206", "22", "261", "18~
## $ start_lat          <dbl> 41.8777, 41.9295, 41.9296, 41.9680, 41.8715, 41.847~
## $ start_lng          <dbl> -87.6353, -87.6431, -87.7079, -87.6500, -87.6699, -~
## $ end_lat            <dbl> 41.8915, 41.9671, 41.9296, 41.9367, 41.8472, 41.869~
## $ end_lng            <dbl> -87.6268, -87.6674, -87.7079, -87.6368, -87.6468, -~
## $ member_casual      <fct> member, casual, casual, casual, member, member, mem~

6.2 Handling the missing values

There are not much values missing in this data set but only a small proportion (compared to overall data set) in these variables:

  • Start station name (start_station_name) - 148231 missing, accounts for 4% (0.96)
  • Start station id (start_station_id) - 148857 missing, accounts for 4% (0.96)
  • End station name (end_station_name) - 171317 missing, accounts for 4.6% (0.954)
  • End station id (end_station_id) - 171778 missing, accounts for 4.6% (0.954)
skim_without_charts(mydata_cleaned)
Data summary
Name mydata_cleaned
Number of rows 3742202
Number of columns 13
_______________________
Column type frequency:
character 3
factor 4
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 3741993 0
start_station_id 148857 0.96 1 35 0 1265 0
end_station_id 171778 0.95 1 35 0 1266 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
rideable_type 0 1.00 FALSE 3 doc: 2498407, ele: 709303, cla: 534492
start_station_name 148231 0.96 FALSE 711 Str: 38524, Cla: 34227, Lak: 33560, The: 32243
end_station_name 171317 0.95 FALSE 712 Str: 40552, Cla: 34022, The: 33725, Lak: 32733
member_casual 0 1.00 FALSE 2 mem: 2198853, cas: 1543349

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.04 41.64 41.88 41.90 41.93 42.08
start_lng 0 1 -87.64 0.03 -87.87 -87.66 -87.64 -87.63 -87.52
end_lat 4906 1 41.90 0.04 41.54 41.88 41.90 41.93 42.16
end_lng 4906 1 -87.65 0.03 -88.07 -87.66 -87.64 -87.63 -87.44

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2020-05-01 00:02:07 2021-04-30 23:59:53 2020-09-09 15:51:13 3257750
ended_at 0 1 2020-05-01 00:13:03 2021-05-05 22:14:39 2020-09-09 16:08:31 3245372

Missing values can usually be filled up with valid values that is either retrieved by adjacent related variables, or looking for median or mean, or getting inputs from authorised stakeholders. However, the nature of these missing values in these locational variables could not be retried by these methods.

Since these missing values only contribute less than 5% of the data set, I will drop them in this project.

6.3 Data manipulation

This section synthesise new variables base on current data set and internet sources.

6.3.1 Create “duration”

I calculate the traveling times of each trip and stored it in a column named duration.min. The “min” stands for “minute”, and the column is relocated after the column ended_at.

mydata_cleaned <- mydata_cleaned %>% 
  mutate(duration.min = ended_at - started_at) %>% 
  relocate(duration.min, .after = ended_at) %>% 
  mutate(duration.min = minute(seconds_to_period(duration.min)))

6.3.2 Create “year, month, day, hour, minute, date, weekday, season, workingday”

This section will synthesise these new variables to help achieve better insights.

New_variables <- c("year",
                   "month",
                   "day", 
                   "hour",
                   "minute",
                   "date",
                   "weekday",
                   "season", 
                   "workingday")
Levels <- c("Based on the date-time column - started_at.",
            "Based on the date-time column - started_at.",
            "Based on the date-time column - started_at.",
            "Based on the date-time column - started_at.",
            "Based on the date-time column - started_at.",
            "A combination of the year-month-day.",
            "Monday, tuesday, wednesday, thursday, friday, saturday and sunday.",
            "Spring, summer, fall, winter. ",
            "Yes: include Monday, Tuesday, Wednesday, Thursday and Friday. \n No: include Saturday, Sunday, and public holidays.")
Reasons <- c("To aid data grouping during analysis.",
             "To aid data grouping during analysis.",
             "To aid data grouping during analysis.",
             "To aid data grouping during analysis.",
             "To aid data grouping during analysis.",
             "Prepare in advance for whenever needed.",
             "To account for differences between individual weekday.",
             "As temperature may affect ridering experience, particularly as a casual rider.",
             "Interesting to see how are riders count differs between working day and non-working day.")

df_newvar <- data.frame(New_variables, Levels, Reasons)

df_newvar %>% kbl(caption = "Table 6.3.2.1 New variables adding to the existing table.") %>% 
  kable_styling(bootstrap_options = c("border", "hover", "condensed"))
Table 6.3.2.1 New variables adding to the existing table.
New_variables Levels Reasons
year Based on the date-time column - started_at. To aid data grouping during analysis.
month Based on the date-time column - started_at. To aid data grouping during analysis.
day Based on the date-time column - started_at. To aid data grouping during analysis.
hour Based on the date-time column - started_at. To aid data grouping during analysis.
minute Based on the date-time column - started_at. To aid data grouping during analysis.
date A combination of the year-month-day. Prepare in advance for whenever needed.
weekday Monday, tuesday, wednesday, thursday, friday, saturday and sunday. To account for differences between individual weekday.
season Spring, summer, fall, winter. As temperature may affect ridering experience, particularly as a casual rider.
workingday Yes: include Monday, Tuesday, Wednesday, Thursday and Friday. No: include Saturday, Sunday, and public holidays. Interesting to see how are riders count differs between working day and non-working day.

Creating year, month, day, hour, minute and weekday

Important note: variables - “year”, “month”, and “day” are created using data from the “started_at” column. Click the right side code to view my codes.

mydata_cleaned <- mydata_cleaned %>% 
  mutate(year = year(started_at),          
         month = month(started_at),
         day = day(started_at),
         hour = hour(started_at),
         minute = minute(started_at),
         date = make_date(year, month, day),
         weekday = weekdays(started_at)) 

Creating monthly average temperature (Celsius) and seasons

Following information are extracted from Chicago climatic table on Wikipedia. Note that Chicago has shorter spring and autumn but longer winter and summer.

month <- 1:12
mean_temp_oC <- c(-3.2, -1.2, 4.4, 10.5, 16.6, 22.2, 24.8, 23.9,
                         19.9, 12.9, 5.8, -0.3)
season <- c(rep("winter", 3),
            rep("spring", 2),
            rep("summer", 4),
            rep("autumn", 2),
            "winter"
            )

data.frame(month, mean_temp_oC, season) %>% 
  kbl(alig = "l",
      table.attr = "style  ='width: 60%;'",
      caption = "Table 6.3.2.2 Monthly average temperature (Celsius) with seasons") %>% 
  kable_styling(bootstrap_options = c("bordered",
                                      "hover",
                                      "striped"))
Table 6.3.2.2 Monthly average temperature (Celsius) with seasons
month mean_temp_oC season
1 -3.2 winter
2 -1.2 winter
3 4.4 winter
4 10.5 spring
5 16.6 spring
6 22.2 summer
7 24.8 summer
8 23.9 summer
9 19.9 summer
10 12.9 autumn
11 5.8 autumn
12 -0.3 winter

Adding this table to our cleaned data set by right-hand Code .

season_table <- data.frame(month, mean_temp_oC, season)
mydata_cleaned <- merge(mydata_cleaned, season_table, by = "month") %>% 
  relocate(month, .after = year)

Creating workingday

Working day, or “workingday” in the data set has 2 values, either “yes” or “no”. The “no” meaning it is not a working day which will include Saturday, Sunday and public holidays. Following public holiday days data are extracted from Chicago Government Website.

month <- c(1, 1, 2, 2, 3, 5, 7, 9, 10, 11, 11, 12, 12)
day <- c(1, 18, 12, 15, 1, 31, 5, 6, 11, 11, 25, 24, 31)
festival <- c("New Year's Day",
              "Dr. Martin Luther King, Jr.'s Birthday",
              "Lincoln's Birthday",
              "President's Day",
              "Pulaski Day",
              "Memorial Day", 
              "Independence Day",
              " Labor Day",
              "Columbus Day",
              "Veteran's Day",
              "Thanksgiving Day",
              "Christmas Day",
              "New Year's Day"
              )

data.frame(month, day, festival) %>% 
  kbl(align = "l",
      table.attr = "style = 'width: 60%;'",
      caption = "Table 6.3.2.4 Chicago public holidays") %>% 
  kable_styling(bootstrap_options = c("bordered",
                                      "striped",
                                      "hover"))
Table 6.3.2.4 Chicago public holidays
month day festival
1 1 New Year’s Day
1 18 Dr. Martin Luther King, Jr.’s Birthday
2 12 Lincoln’s Birthday
2 15 President’s Day
3 1 Pulaski Day
5 31 Memorial Day
7 5 Independence Day
9 6 Labor Day
10 11 Columbus Day
11 11 Veteran’s Day
11 25 Thanksgiving Day
12 24 Christmas Day
12 31 New Year’s Day

Adding this table to our cleaned data set by right-hand Code .

mydata_cleaned <- mydata_cleaned %>%
  mutate(workingday = case_when(
    weekday == "Sunday" ~ "No",
    weekday == "Saturday" ~ "No",
    month == 1 & day == 1 ~ "No",
    month == 1 & day == 18 ~ "No",
    month == 2 & day == 12 ~ "No",
    month == 2 & day == 15 ~ "No",
    month == 3 & day == 1 ~ "No",
    month == 5 & day == 31 ~ "No",
    month == 7 & day == 5 ~ "No",
    month == 9 & day == 6 ~ "No",
    month == 10 & day == 11 ~ "No",
    month == 11 & day == 11 ~ "No",
    month == 11 & day == 25 ~ "No",
    month == 12 & day == 24 ~ "No",
    month == 12 & day == 31 ~ "No",
    TRUE ~ "Yes"
  )) 

6.4 Data processing documentation

The name of original data set in R is called “mydata”. The data set is now called “mydata_cleaned” and stored in a different R object.

Summary of cleaning and processing I have done to the original dataset:

  • Converted variables rideable_type, start_station_name , end_station_name, and member_casual from character to factor.
  • 11 new variables (columns) were added to the original data set. They are:
head(mydata_cleaned[,c(5, 15:24)]) %>%  # showing only top 6 observations of these new variables
  kbl(align = "c",
      caption = "The first 6 rows of the data set") %>% 
  kable_styling(bootstrap_options = c("bordered",
                                      "hover",
                                      "striped"))
The first 6 rows of the data set
duration.min year month day hour minute date weekday mean_temp_oC season workingday
10 2021 1 23 16 14 2021-01-23 Saturday -3.2 winter No
4 2021 1 27 18 43 2021-01-27 Wednesday -3.2 winter Yes
1 2021 1 21 22 35 2021-01-21 Thursday -3.2 winter Yes
11 2021 1 7 13 31 2021-01-07 Thursday -3.2 winter Yes
0 2021 1 23 2 24 2021-01-23 Saturday -3.2 winter No
53 2021 1 9 14 24 2021-01-09 Saturday -3.2 winter No

7 SUMMARY STATISTICS

Summarizing some introductory statistics of our data set to quick familiarising ourselves with some of the patterns in the data.

7.1 Casual riders ride the most!

We have annual members 15% higher than casual riders, the proportion of casual riders is larger than I thought (Table 7.1.1). More importantly, casual riders that has a 15% lower in the population even ride slightly longer than annual members at a 2% different in the total duration (minutes) in the 12 month period (Table 7.1.2).

On average riding minute per person, casual rider ride 7 minutes more than the annual members, it is a 20% higher riding duration than the annual members (Table 7.1.3).

mydata_cleaned %>% 
  group_by(member_casual) %>% 
  summarise(Riding_frequency = n()) %>% 
  mutate(Percentage = paste0(round(Riding_frequency/sum(Riding_frequency)*100), '%')) %>% 
  relocate(Percentage, .after = Riding_frequency) %>% 
  kbl(align = "l",
      table.attr = "style = 'width: 40%;'",
      caption = "Table 7.1.1 Casual vs member Frequency") %>% 
    kable_styling(bootstrap_options = c("bordered",
                                      "hover")) %>% 
  row_spec(2, color = "red")
Table 7.1.1 Casual vs member Frequency
member_casual Riding_frequency Percentage
casual 1543349 41%
member 2198853 59%
mydata_cleaned %>% 
  group_by(member_casual) %>% 
  summarise(total_riding_duration_min = sum(duration.min)) %>%  
  mutate(Percentage = total_riding_duration_min / sum(total_riding_duration_min),
         Percentage = paste0(round(Percentage*100), "%") ) %>% 
  kbl(align = "l",
      table.attr = "style = 'width: 40%;'",
      caption = "Table 7.1.2 Casual vs member total riding duration in minutes") %>% 
    kable_styling(bootstrap_options = c("bordered",
                                      "hover")) %>% 
  row_spec(1, color = "red")
Table 7.1.2 Casual vs member total riding duration in minutes
member_casual total_riding_duration_min Percentage
casual 31976253 51%
member 30323629 49%
mydata_cleaned %>% 
  group_by(member_casual) %>% 
  summarise(average_riding_duration_min = mean(duration.min)) %>%  
  mutate(Percentage = average_riding_duration_min / sum(average_riding_duration_min),
         Percentage = paste0(round(Percentage*100), "%") ) %>% 
  kbl(align = "l",
      table.attr = "style = 'width: 40%;'",
      caption = "Table 7.1.3 Casual vs member average riding duration in minutes") %>% 
    kable_styling(bootstrap_options = c("bordered",
                                      "hover")) %>% 
  row_spec(1, color = "red")
Table 7.1.3 Casual vs member average riding duration in minutes
member_casual average_riding_duration_min Percentage
casual 20.71874 60%
member 13.79066 40%

These statistics suggest that casual riders are very important group of riders and they have a great demand of our bikes.

7.2 All riders love Summer

This section incorporate years, months, seasons, and temperatures with the member types. A visualisation in EDA section will better explain this table.

However, I have highlighted the top 5 rows that has the highest numbers of riders in following table (Table 7.2). The table suggest that riders love riding in the summer, which might be the best season for our marketing activities.

mydata_cleaned %>% 
  group_by(year, month, season, mean_temp_oC, member_casual) %>% 
    summarise(riding_frequency = n(),
            total_riding_duration_min = sum(duration.min),
            average_riding_duration_min = round(mean(duration.min),2)) %>% 
  kbl(align = "c",
      table.attr = "style = 'width: 100%;'",
      caption = "Table 7.2 Year + Month + Season + Average temp (oC) + Membership type") %>% 
  kable_styling(bootstrap_options = c("bordered",
                                      "hover",
                                      "striped")) %>% 
  row_spec(5:8, color = "red", background = "yellow") %>% 
  row_spec(10, color = "red", background = "yellow")
Table 7.2 Year + Month + Season + Average temp (oC) + Membership type
year month season mean_temp_oC member_casual riding_frequency total_riding_duration_min average_riding_duration_min
2020 5 spring 16.6 casual 86909 2053309 23.63
2020 5 spring 16.6 member 113365 1940318 17.12
2020 6 summer 22.2 casual 154718 3524288 22.78
2020 6 summer 22.2 member 188287 3068354 16.30
2020 7 summer 24.8 casual 269296 6138007 22.79
2020 7 summer 24.8 member 282184 4372428 15.49
2020 8 summer 23.9 casual 289661 6190766 21.37
2020 8 summer 23.9 member 332700 4856986 14.60
2020 9 summer 19.9 casual 230692 4599016 19.94
2020 9 summer 19.9 member 302266 4095711 13.55
2020 10 autumn 12.9 casual 145012 2616028 18.04
2020 10 autumn 12.9 member 243641 2986650 12.26
2020 11 autumn 5.8 casual 88099 1614182 18.32
2020 11 autumn 5.8 member 171617 2095926 12.21
2020 12 winter -0.3 casual 30080 483810 16.08
2020 12 winter -0.3 member 101493 1144087 11.27
2021 1 winter -3.2 casual 18117 275598 15.21
2021 1 winter -3.2 member 78717 875103 11.12
2021 2 winter -1.2 casual 10131 184282 18.19
2021 2 winter -1.2 member 39491 502914 12.73
2021 3 winter 4.4 casual 84033 1654605 19.69
2021 3 winter 4.4 member 144463 1805391 12.50
2021 4 spring 10.5 casual 136601 2642362 19.34
2021 4 spring 10.5 member 200629 2579761 12.86

7.3 Statistics of the three ride-able types

I wanted to compare which of the “classic_bike” “docked_bike” and “electric_bike” are the most popular bike by simply looking at numbers of riders for each bike. However, this approach ceased because 3 of these different bikes are initiated by Cyclistic in different dates (see blue highlighted in following table 7.3).

It will be a unfair comparison if simply looking at proportion of riders in each bike category. I will visualise it in the EDA section, and it will tell a very good story!

mydata_cleaned %>% 
  group_by(year, month, rideable_type) %>% 
  summarise(riding_frequency = n(),
            total_riding_duration_min = sum(duration.min),
            average_riding_duration_min = round(mean(duration.min),2)) %>% 
  kbl(align = "l",
      table.attr = "style = 'width: 70%;'",
      caption = "Table 7.3 Blue strips are when a rideable type was first introduced") %>% 
  kable_styling(bootstrap_options = c("bordered",
                                      "hover",
                                      "striped")) %>% 
  row_spec(c(1, 4, 13), color = "white", background = "blue") 
Table 7.3 Blue strips are when a rideable type was first introduced
year month rideable_type riding_frequency total_riding_duration_min average_riding_duration_min
2020 5 docked_bike 200274 3993627 19.94
2020 6 docked_bike 343005 6592642 19.22
2020 7 docked_bike 549545 10479497 19.07
2020 7 electric_bike 1935 30938 15.99
2020 8 docked_bike 556166 9970464 17.93
2020 8 electric_bike 66195 1077288 16.27
2020 9 docked_bike 404606 6771325 16.74
2020 9 electric_bike 128352 1923402 14.99
2020 10 docked_bike 236477 3593983 15.20
2020 10 electric_bike 152176 2008695 13.20
2020 11 docked_bike 151582 2295337 15.14
2020 11 electric_bike 108134 1414771 13.08
2020 12 classic_bike 70616 886180 12.55
2020 12 docked_bike 13004 185868 14.29
2020 12 electric_bike 47953 555849 11.59
2021 1 classic_bike 61700 740644 12.00
2021 1 docked_bike 2106 45543 21.63
2021 1 electric_bike 33028 364514 11.04
2021 2 classic_bike 35012 493072 14.08
2021 2 docked_bike 1271 30458 23.96
2021 2 electric_bike 13339 163666 12.27
2021 3 classic_bike 152545 2248432 14.74
2021 3 docked_bike 15657 395910 25.29
2021 3 electric_bike 60294 815654 13.53
2021 4 classic_bike 214619 3235106 15.07
2021 4 docked_bike 24714 624303 25.26
2021 4 electric_bike 97897 1362714 13.92

7.4 Casual riders ride the most in non-working days

53% of casual riders love riding in non-working day that include public holiday, Saturday and Sunday, and 47% of casual riders love riding during working day.

Base on the data, it clearly suggest that nearly half of the casual riders ride during working day, it would imply a steady demand, and they need to write our bike constantly. Targeting these riders would have high casual-annual membership conversion.

Table 7.4 Non-working day (Sat + Sun + Pub.Holi) vs working day
member_casual workingday riding_frequency total_riding_duration_min average_riding_duration_min average_riding_duration_percentage
casual No 646691 14202547 21.96 53%
casual Yes 896658 17773706 19.82 47%
member No 643214 9665673 15.03 53%
member Yes 1555639 20657956 13.28 47%

7.5 Top 10 most frequently visited riding station

There are total of 712 riding station in the data set for 3,742,262 times of riding over the given 12-month period from May-2020 to April-2020. It would be more practical just to quick viewing the top 10 most frequently visited riding station. So, we can target marketing in these areas. Following statistics are based on the variable start_station_name, the column recording stations that riders started their rides.

mydata_cleaned %>% 
  group_by(member_casual, start_station_name) %>% 
  summarise(count = n()) %>% 
  filter(!is.na(start_station_name)) %>% 
  arrange(member_casual, desc(count)) %>% 
    top_n(10) %>% 
  kbl(align = "c",
      table.attr = "style = 'width: 40%'",
      caption = "Table 7.5.1 Top 10 most frequently visited stations among 712 stations, by casual riders and annual members.") %>% 
  kable_styling(bootstrap_options = c("hover", "striped", "bordered"))
Table 7.5.1 Top 10 most frequently visited stations among 712 stations, by casual riders and annual members.
member_casual start_station_name count
casual Streeter Dr & Grand Ave 28562
casual Lake Shore Dr & Monroe St 23593
casual Millennium Park 21554
casual Theater on the Lake 16346
casual Michigan Ave & Oak St 15326
casual Indiana Ave & Roosevelt Rd 14700
casual Lake Shore Dr & North Blvd 14518
casual Michigan Ave & Lake St 12791
casual Clark St & Elm St 12718
casual Michigan Ave & Washington St 11709
member Clark St & Elm St 21509
member Broadway & Barry Ave 16648
member Wells St & Concord Ln 16354
member Dearborn St & Erie St 16351
member St. Clair St & Erie St 16235
member Theater on the Lake 15897
member Kingsbury St & Kinzie St 15525
member Wells St & Elm St 15286
member Wells St & Huron St 15156
member Clark St & Armitage Ave 14606

8 EXPLORATORY DATA ANALYSIS (EDA)

Since casual rider is the only group of interest, my EDA will mainly focus on studying and understanding this group of riders in order to design a specific recommendation for marketing purposes. However, annual riders would occasionally appear in the graphs when found that it is useful for comparison.

8.1 Which stations have most casual riders?

Since we have Geo-tagged data (with longitude and latitude), we are able to view them on a map to explore how are popular stations distributed.

temp1 <- mydata_cleaned %>% 
  filter(member_casual == "casual") %>%  # 3.7 mil riders to 1.5 mil of casual riders
  group_by(start_station_name) %>% 
  summarise(frequency = n()) 

temp2 <- mydata_cleaned %>% 
  filter(member_casual == "casual") %>% 
  group_by(start_station_name, start_lat, start_lng) %>% 
  summarise(frequency = n()) %>% 
  group_by(start_station_name) %>% 
  filter(row_number() == 1) 

df8.1 <- merge(temp1, temp2, by = "start_station_name") %>% 
  select(-frequency.y) %>% 
  rename("frequency" = "frequency.x")    

# df for the map - create "lab", "pop-up-info" and "category"

df8.1 <- df8.1 %>% 
  mutate(lab = paste0(start_station_name, " (", frequency, ")"),
         pop_up_info = paste0(start_station_name, "<br/>",
                              "Casual rider counts: ", frequency, "<br/>",
                              "Lat: ", round(start_lat, 5), "<br/>",
                              "Lat: ", round(start_lng, 5), "<br/>"
                              )) %>% 
  filter(!is.na(start_station_name)) %>%      #remove NA
  mutate(category = cut(frequency,
                        c(0, 500, 1000, 5000, 10000, 20000, 30000),
                        labels = c("0-500", "500-1000", "1000-5000", "5000-10000", "10000-20000", ">20000")))

# define a color to each group of the "category"

palette <- colorFactor(palette = c("yellow", "green", "orange", "red", "black"), domain = df8.1$category)

# Creating 6 layers for each of the category.

Layer1 <- df8.1 %>% filter(frequency < 500)

Layer2 <- df8.1 %>% filter(frequency > 500, frequency < 1000)

Layer3 <- df8.1 %>% filter(frequency > 1000, frequency < 5000)

Layer4 <- df8.1 %>% filter(frequency > 5000, frequency < 10000)

Layer5 <- df8.1 %>% filter(frequency > 10000, frequency < 20000)

Layer6 <- df8.1 %>% filter(frequency > 20000)



leaflet() %>% 
  addTiles(group = "OpenStreetMap (Default)") %>% 
  addProviderTiles(providers$Esri.WorldImagery, group = "Esri") %>%
  addProviderTiles(providers$Stamen.TonerLite, group = "Toner Lite") %>% 
  setView(lng = -87.6298, lat = 41.8781, zoom = 12) %>%  
  addCircleMarkers(group = "Yellow: 0-500",
                   data = Layer1,
                   lat = ~ start_lat,
                   lng = ~ start_lng,
                   label = ~ lab,
                   popup = ~ pop_up_info,
                   color = ~ "Yellow",
                   radius = 5,
                   stroke = T,
                   weight = 1,
                   opacity = 1) %>% 
  addCircleMarkers(group = "Gold: 500 - 1000",
                   data = Layer2, 
                   lat = ~ start_lat,
                   lng = ~ start_lng,
                   label = ~ lab,
                   popup = ~ pop_up_info,
                   color = ~"Gold",
                   radius = 10,
                   stroke = T,
                   weight = 1,
                   opacity = 1) %>% 
  addCircleMarkers(group = "Green: 1000-5000",
                   data = Layer3,
                   lat = ~ start_lat,
                   lng = ~ start_lng,
                   label = ~ lab,
                   popup = ~ pop_up_info,
                   color = ~"Green",
                   radius = 15,
                   stroke = T,
                   weight = 1,
                   opacity = 1) %>% 
  addCircleMarkers(group = "Orange: 5000 - 10000",
                   data = Layer4, 
                   lat = ~ start_lat,
                   lng = ~ start_lng,
                   label = ~ lab,
                   popup = ~ pop_up_info,
                   color = ~"Orange",
                   radius = 20 ,
                   stroke = T,
                   weight = 1,
                   opacity = 1) %>%
  addCircleMarkers(group = "Red: 10000-20000",
                   data = Layer5,
                   lat = ~ start_lat,
                   lng = ~ start_lng,
                   label = ~ lab,
                   popup = ~ pop_up_info,
                   color = ~"Red",
                   radius = 25 ,
                   stroke = T,
                   weight = 1,
                   opacity = 1) %>% 
  addCircleMarkers(group = "Black: >20000",
                   data = Layer6,
                   lat = ~ start_lat,
                   lng = ~ start_lng,
                   label = ~ lab,
                   popup = ~ pop_up_info,
                   color = ~ "Black",
                   radius = 30,
                   stroke = T,
                   weight = 1,
                   opacity = 1) %>% 
  addLabelOnlyMarkers(group = "Casual-rider count",
                      data = df8.1,
                      lat = ~ start_lat,
                      lng = ~ start_lng,
                      label = ~ as.character(frequency),
                      labelOptions = labelOptions(noHide = T, textOnly = F, textsize = 20, alpha = 0.5)) %>% 
  addLayersControl(
    baseGroups = c("Esri", "OpenStreetMap", "Toner Lite"), 
    overlayGroups = c("Yellow: 0-500", "Gold: 500 - 1000", "Green: 1000-5000", 
                      "Orange: 5000 - 10000", "Red: 10000-20000", "Black: >20000", "Casual-rider count"),
    options = layersControlOptions(collapsed = F)
  )

Have Orange, Red and Black category ticked, we can clearly see that most casual riders are concentrated in the coaster regions, where the central city is. It is likely linked with more choices of entertainment in the area.

Use this interactive table to filter out the top three bike station that hosting most casual riders.

df8.1DT <- df8.1 %>% 
  select(category, start_station_name, frequency) %>% 
  rename("Frequency grouping" = "category",
         "Station name" = "start_station_name",
         "Casual Rider Frequency" = "frequency")

datatable(df8.1DT, filter = "top", rownames = FALSE)

The top 3 ride stations are:

    1. Streeter Dr & Grand Ave
    1. Lake Shore Dr & Monroe St
    1. Millennium Park

Please be aware that these are station named given by Cyclistic, it may be slightly different from the actual adjacent street names.

8.2 Riding frequency of casual riders

8.2.1 Chicago Temperatures

Prior to going deeper analysis, it is good to check on the temperatures of Chicago in different seasons as it would affect the experience of riding. For example, it might be too cold to be outdoor during winter in Chicago, when temperatures go below 0 Celsius.

df <- mydata_cleaned %>% 
  select(month, mean_temp_oC, season) %>% 
  group_by(month, mean_temp_oC, season) %>% 
  summarise(count = n()) %>% 
  select(-count)

ggplot(df, aes(x = month, y = mean_temp_oC, colour = season, group = "mean_temp_oC")) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey") + 
  geom_vline(xintercept = 3, linetype = "dashed", colour = "grey") +
  geom_vline(xintercept = 5, linetype = "dashed", colour = "grey") +
  geom_vline(xintercept = 9, linetype = "dashed", colour = "grey") +
  geom_vline(xintercept = 11, linetype = "dashed", colour = "grey") +
  geom_text(label = "Winter", colour = "blue", show.legend = F, x = 1.5, y = 35) +
  geom_text(label = "Spring", colour = "green4", show.legend = F, x = 4 , y = 35) +
  geom_text(label = "Summer", colour = "red", show.legend = F, x = 7, y = 35) +
  geom_text(label = "Autumn", colour = "darkgreen", show.legend = F, x = 10, y = 35) +
  geom_text(label = "Winter", colour = "blue", show.legend = F, x = 12, y = 35) +
  geom_point(shape = 21, size = 4) +
  geom_path() +
  geom_label(aes(label = mean_temp_oC)) +
  scale_y_continuous(limits = c(-5, 40)) +
  scale_x_continuous(limits = c(1, 12), breaks = seq(1, 12, 1)) +
  theme_classic() +
  scale_colour_manual(breaks = c("winter", "spring", "summer", "autumn"),
                      values = c("blue", "green4", "red", "darkgreen")) +
  theme(legend.position = "none",
        plot.title = element_text(vjust = 2),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y= element_text(margin = margin(0, 10, 0, 0)),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(title = "Average Monthly Temperatures (Celsius) of Chicago",
       caption = "Kar Ng",
       x = "Month",
       y = "Average Temperature (oC)") 

8.2.2 Casual riders per day

Following figure shows that,

  • We had most casual riders in summer, and they disappeared during winter.
  • Casual riders have taken up electric bike and classic bike when they were introduced in July 2020 and Dec 2020. Base on the figure, there is a trend anticipating that these type of bikes might would affect the popularity of docked bike in the fugure.
table6.2 <- mydata_cleaned %>% 
  group_by(date, year, rideable_type, member_casual, mean_temp_oC, season, workingday) %>% 
  summarise(daily_frequency = n())

table6.2$member_casual <- factor(table6.2$member_casual, levels = c("member", "casual"))


table6.2 %>% ggplot(aes(x = date, y = daily_frequency, 
                colour = rideable_type, 
                alpha = member_casual)) +
    annotate("rect", fill = "yellow", alpha = 0.1,
             xmin = as_date("2020-04-01"), xmax = as_date("2020-05-31"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "red", alpha = 0.1,
             xmin = as_date("2020-05-31"), xmax = as_date("2020-09-30"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "darkgreen", alpha = 0.1,
             xmin = as_date("2020-09-30"), xmax = as_date("2020-11-30"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "blue", alpha = 0.1,
             xmin = as_date("2020-11-30"), xmax = as_date("2021-03-31"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "yellow", alpha = 0.1,
             xmin = as_date("2021-03-31"), xmax = as_date("2021-05-31"),
             ymin = -Inf, ymax = Inf) + 
    geom_text(label = "Spring", colour = "green4", show.legend = F,
              x = as_date("2020-05-01"),
              y = 15000) +
    geom_text(label = "Summer", colour = "red", show.legend = F,
              x = as_date("2020-08-01"),
              y = 15000) +
    geom_text(label = "Autumn", colour = "darkgreen", show.legend = F,
              x = as_date("2020-11-01"),
              y = 15000) +
    geom_text(label = "Winter", colour = "blue", show.legend = F,
              x = as_date("2021-02-01"),
              y = 15000) +
    geom_text(label = "Spring", colour = "green4", show.legend = F,
              x = as_date("2021-05-01"),
              y = 15000) +
  geom_point() +
  geom_line(stat = "smooth", size = 1) +
  scale_x_date(date_breaks = "1 month", date_labels = "%Y %b") +
  theme_bw() + 
  theme(legend.position = "bottom", 
        axis.text.x = element_text(angle = 25, vjust = 0.7),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
        panel.grid = element_blank()) +
    labs(x = "Date",
         y = "Daily Riding Frequency",
         title = "Casual rider daily riding frequency",
         subtitle = "Transparent points are member rider's frequency and follow the similar major trends",
         caption = "Kar Ng",
         colour = "Ride-able Type",
         alpha = "Membership type")

8.3 Daily total riding hours of casual riders

Interestingly, in the last summer, casual rider has the longer riding duration than member riders. This figure can be another supporting evidence indicating casual riders are valuable riders, and it worth our focus on converting them into annual members.

table6.3 <- mydata_cleaned %>% 
  group_by(date, year, rideable_type, member_casual, mean_temp_oC, season, workingday) %>% 
  summarise(daily_duration_hr = sum(duration.min)/60)

table6.3$member_casual <- factor(table6.3$member_casual, levels = c("member", "casual"))

table6.3 %>% ggplot(aes(x = date, y = daily_duration_hr, 
                colour = rideable_type, 
                alpha = member_casual)) +
    annotate("rect", fill = "yellow", alpha = 0.1,
             xmin = as_date("2020-04-01"), xmax = as_date("2020-05-31"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "red", alpha = 0.1,
             xmin = as_date("2020-05-31"), xmax = as_date("2020-09-30"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "darkgreen", alpha = 0.1,
             xmin = as_date("2020-09-30"), xmax = as_date("2020-11-30"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "blue", alpha = 0.1,
             xmin = as_date("2020-11-30"), xmax = as_date("2021-03-31"),
             ymin = -Inf, ymax = Inf) +
    annotate("rect", fill = "yellow", alpha = 0.1,
             xmin = as_date("2021-03-31"), xmax = as_date("2021-05-31"),
             ymin = -Inf, ymax = Inf) + 
    geom_text(label = "Spring", colour = "green4", show.legend = F,
              x = as_date("2020-05-01"),
              y = 6000) +
    geom_text(label = "Summer", colour = "red", show.legend = F,
              x = as_date("2020-08-01"),
              y = 6000) +
    geom_text(label = "Autumn", colour = "darkgreen", show.legend = F,
              x = as_date("2020-11-01"),
              y = 6000) +
    geom_text(label = "Winter", colour = "blue", show.legend = F,
              x = as_date("2021-02-01"),
              y = 6000) +
    geom_text(label = "Spring", colour = "green4", show.legend = F,
              x = as_date("2021-05-01"),
              y = 6000) +
  geom_point(shape = 8) +
  geom_line(stat = "smooth", size = 1) +
  scale_x_date(date_breaks = "1 month", date_labels = "%Y %b") +
  theme_bw() + 
  theme(legend.position = "bottom", 
        axis.text.x = element_text(angle = 25, vjust = 0.7),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
        panel.grid = element_blank()) +
    labs(x = "Date",
         y = "Daily Riding Frequency",
         title = "Daily riding HOURS of Casual rider",
         subtitle = "Transparent points are member rider's frequency and follow the similar major trends",
         caption = "Kar Ng",
         colour = "Ride-able Type",
         alpha = "Membership type")

8.4 What is the best time on a day to advertise casual rider?

Across all four seasons, most casual riders love to ride during off-rush hours (see following graph), and again summer has the highest ride frequency per min during off-rush hours. Any physical marketing activities would be at the highest effectiveness during this period, as we would has the highest probability of meeting casual riders.

If physical marketing activities are to be happened after 6pm, do it in the evening of spring and summer. According to following graph, many riders love evening-ride during spring and summer but not during autumn and winter.

According to Chicago goverment webiste, rush hours of Chicago are categorised into morning at between 7:00am to 9:30am and 4:00pm to 6:00pm.

# Setting the object for this section
table6.4 <- mydata_cleaned %>% 
  mutate(time = paste0(hour, ":" ,minute),
         time2 = as.POSIXct(parse_hm(time))) %>%    #syntheses a column of time (hour:min)
  group_by(date, year, time, time2, rideable_type, member_casual, mean_temp_oC, season, workingday) %>% 
  summarise(frequency = n()) %>% 
  filter(member_casual == "casual")

table6.4$member_casual <- factor(table6.4$member_casual, levels = c("member",
                                                                    "casual"))

# Ploting the graph
ggplot(table6.4, aes(x = time2, y = frequency,
                     colour = season)) +
  annotate("rect", fill = "gold", alpha = 0.5,
           xmin = as.POSIXct(parse_hm("7:00")), xmax = as.POSIXct(parse_hm("9:30")),
           ymin = -Inf, ymax = Inf) +
  annotate("rect", fill = "gold", alpha = 0.5,
           xmin = as.POSIXct(parse_hm("16:00")), xmax = as.POSIXct(parse_hm("18:00")),
           ymin = -Inf, ymax = Inf) +
  geom_point(alpha = 0.2) +
  geom_line(stat = "smooth", colour = "black") +
  scale_x_datetime(date_labels = "%H:%M", date_breaks = "2 hour") +
  facet_wrap(~season) +
  theme_classic() +
  theme(legend.position = "none",
        plot.margin = unit(c(1,1,1,1), "cm"),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
        strip.text = element_text(size = 15)) +
  labs(title = "Total rides of Casual Riders per minute per day between may-2020 to april-2021 ",
       subtitle = "Shaded regions are rush periods between 7:00am-9:30am and 4:00pm-6:00pm", 
       caption = "Kar Ng",
       x = "Time",
       y = "Riding frequency per min")

8.5 Casual riders love Saturday

High amount of casual riders in the weekend especially on Saturday (23%) than in the weekdays (mostly 11%).

table6.5 <- mydata_cleaned %>% 
  mutate(weekday = factor(weekday),
         weekday = factor(weekday, 
                          levels = c("Monday", "Tuesday", "Wednesday",
                                     "Thursday", "Friday", "Saturday", 
                                     "Sunday"))) %>% 
  mutate(member_casual = factor(member_casual, 
                                levels = c("member", "casual")),
         member_casual = fct_recode(member_casual,
                                    "Member riders" = "member",
                                    "Casual riders" = "casual")) %>% 
  group_by(weekday, member_casual) %>% 
  summarise(wkday_frequency = n()) 

table6.5.pct <- table6.5 %>% 
  group_by(member_casual) %>% 
  mutate(proportion = wkday_frequency/sum(wkday_frequency),
         percentage = paste0(round(proportion*100), "%"))

ggplot(table6.5, aes(x = weekday, y = wkday_frequency, alpha = member_casual)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.5) +
  geom_line(data = table6.5, aes(x = weekday, y = wkday_frequency, 
                                 alpha = member_casual, group = member_casual)) +
  geom_text(data = table6.5, aes(label = wkday_frequency), vjust = -0.25) +
  geom_text(data = table6.5.pct, aes(label = paste0("(", percentage, ")")), vjust = -2, hjust = 0.3, colour = "blue") +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 25, vjust = 0.7),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  scale_y_continuous(limits = c(0, 400000) ,
                     labels = function(x) paste0({x/1000}, "k")) +
  labs(title = "Constant casual riders in the weekday and spike on Weekend",
       subtitle = "Highlighting casual riders with member riders faded at the back", 
       caption = "Kar Ng",
       x = "Days of the week",
       y = "Ride frequency")

However, the conversion rate of casual riders riding bike in the working week days into annual membership may be higher.

The number of rides by casual riders in the weekday is relatively constant, these casual riders are probably the same persons riding our bikes for their day-to-day activities in the week day such as work.

They have a constant demand compare to those casual riders in the weekend. They are likely just riding it once in a while and less in repeatedly. If resources are limited, I suggest to consider to invest advertisements on weekdays.

Related statistics tables:

table6.5.pct %>% 
  select(weekday, wkday_frequency, percentage) %>% 
  arrange(member_casual) %>% 
  rename("Membership type" = "member_casual",
         "Day" = "weekday",
         "Ride counts" = "wkday_frequency",
         "Percentage" = "percentage") %>% 
  kbl(align = "c",
      table.attr = "style = 'width: 60%;'") %>% 
  kable_styling(bootstrap_options = c("bordered", "hover", "striped")) %>% 
  row_spec(c(6,7, 13, 14), bold = T) %>% 
  pack_rows("Group 1:", 1, 7, color = "brown") %>% 
  pack_rows("Group 2:", 8, 14, color = "brown")
Membership type Day Ride counts Percentage
Group 1:
Member riders Monday 287763 13%
Member riders Tuesday 308324 14%
Member riders Wednesday 324548 15%
Member riders Thursday 320143 15%
Member riders Friday 336004 15%
Member riders Saturday 342098 16%
Member riders Sunday 279973 13%
Group 2:
Casual riders Monday 164809 11%
Casual riders Tuesday 162381 11%
Casual riders Wednesday 168883 11%
Casual riders Thursday 176583 11%
Casual riders Friday 229550 15%
Casual riders Saturday 359561 23%
Casual riders Sunday 281582 18%

9 CONCLUSION AND RECOMMENDATION

How are casual riders use Cyclistic bikes?

  • Casual riders ride the most in regions near coastal regions, especially closer to where the city central is. The top 3 stations casual riders ride the most were Streeter Dr & Grand Ave, Lake Shore Dr & Monroe St, Millennium Park. Three of these stations accounts for 5% of 710 stations.

  • Allocate more advertisement resources to electric bike and classic bike, as casual riders have taken them up and replace docked bike since January 2021 to the latest available data in April 2021.

  • Best Time for advertisement can be between 12pm - 5:30pm across 4 seasons but summer and spring can go beyond 5:30pm and up to 10pm. However, advertisement in autumn and winter is not recommended, at least to avoid large scale investment, because casual riders lost their interest in riding during cool temperatures in these seasons.

Why would casual riders buy Cyclistic annual memberships?

  • Saturday has the highest casual rides at 23%, it is twice the number of riders in the weekday that has 11% across Monday to Thursday and 15% on Friday. However, I will argue advertisement that only targeting Saturday riders. Constant riding rates on weekdays may imply a constant demand of our bikes by casual riders. These casual riders in the week days are more likely to be converted into annual membership.

How can Cyclistic use digital media to influence casual riders to become members?

  • Focus advertisement during spring and summer. Winter has minimum riders as it might be too cold to ride outside for casual purposes.

  • Digital media best performed during spring and summer, advertise on popular medias such as Facebook, Twitter, Google, or Youtube during breakfast, lunch and dinner time ONLY in the weekdays, but morning and evening during weekend as current casual riders may be riding our bikes in the day time of weekend. Cyclistic can promote stations naar coastal regions if resources is limited (or best to promote all stations near coastal regions). Advertisements should focus on electric bike and classic bike.

10 LEGALITY

This is a personal project created and designed for non-commercial use only. The data set of this personal project was prepared by Google data analytic certificate course and was made available for students as a capstone project to publish online as online data analytic project to show experiences.

However, this project is a personal project that extending the scope of the course, it is not submitted and graded by Google. Google and I, nor any stakeholders relevant to the creation of this project held any responsibility for any outcomes suffered from using the results of this project. This project is only created for educational and skill demonstration purposes.

This personal project is slightly altered and is slightly different from the original content. It is for the purpose of showing some of my specific analytuc skills outside the range of the scope of the course. The content of this project should not be perceived as the original project content of the Google data analytic course.

Thank you for reading this project.

11 REFERENCES

City Holidays n.d., viewed 24 May 2021, https://www.chicago.gov/city/en/narr/misc/city-holidays.html

Wikipedia Chicago n.d., Chicago, viewed 25 May 2021, https://en.wikipedia.org/wiki/Chicago

Claffey M, Hofer S, 2020, City Officials Will Enforce AM and PM Rush Hour Restrictions on New Chicago and Western Avenue Bus Only Lanes, viewed 28 May 2021, https://www.chicago.gov/city/en/depts/cdot/provdrs/traffic_signals_andstreetlights/news/2020/january/city-officials-will-enforce-am-and-pm-rush-hour-restrictions-on-.html