Reading time: 28 minutes
This is one of the two capstone projects of the 6-month extensive Google Data Analytics Professional Certificate program. This project was about analysing big data for a fictional bike-sharing company in Chicago on how best to convert casual riders into annual members.
1 multi-layers interactive map, 4 multi-layers ggplots (scatter, line, and bar plots), 1 interactive table, and 7 basic statistics tables were coded and achieved insightful results. 12 data sets that hosting the total of 3.7 millions of rows and 13 variables were combined, cleaned, manipulated and analysed using R programming language. Data integrity of this project was assessed. Data processing steps were documented.
Results revealed that casual riders contributed 51% of the total riding duration in the given 12 months data to analyse. Casual riders rided the most on Saturday (23%), doubling the size in week days (11%), however, 4 days in a week has 11% indicating a steady demand of rides and targeting casual riders on these days may actually achieve higher conversion rate. Casual riders love spring and summer but decreased dramatically during months in Autumn and achieve near-zero ride per day in winter. Winter has temperatures below zero degrees Celsius. Casual riders concentrate in regions near coasts and city center. They have adoped 2 newly introduced ride-able bikes and has replaced the usual docked bike in 2021 as their primary and secondary bikes.
R Packages loaded to perform functions required in this project: tidyverse series (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, kableExtra, lubridate, leaflet, scales, hms, ggrepel and DT.
library(tidyverse)
library(skimr)
library(kableExtra)
library(lubridate)
library(leaflet)
library(scales)
library(hms)
library(ggrepel)
library(DT)
Cyclistic, is a fictional bike-share company designed by this Google data analytic certification course. In the scenario, the company features more than 5824 bicycles and 692 docking stations. The company also offers many types of bicycle to account for different group of riders, including standard two-wheeled bikes, reclining bikes, hand tricycles, and cargo bikes. Cyclistic also offers two types of memberships, a casual and an annual type. Casual type is designed for single-ride and full-day passes.
Cyclistic
Cyclistic recently found that converting more casual riders into annual members is more profitable. Therefore, they will need a tailored marketing strategy for casual riders, supported data analysis. My task is to analyse the available data to answer following business questions with compelling data insights and professional data visualisation.
I have been provided with Cyclistic’s previous 12 months trip data, starting from May 2020 to April 2021. These data were stored in 12 spreadsheets in a monthly basis. Thus, they are additive to each other by date. All of them have the same number of columns and column names.
The course introduced that a data set is considered small if it has less than 100k rows and spreadsheet could be used for analysis and basic analysis. However, after inspecting the first of the twelve spreadsheets, it has already 200k rows of data. I am expecting after combining all the 12 spreadsheets, I would have a data frame exceeding 1 million rows of data and beyond the scale a spreadsheet can handle. Therefore, I am using R for this data analysis project.
# Import individual csv spreadsheet
may20 <- read_csv("202005-divvy-tripdata.csv") # 200,274 rows | 13 columns
jun20 <- read_csv("202006-divvy-tripdata.csv") # 343,005 rows | 13 columns
jul20 <- read_csv("202007-divvy-tripdata.csv") # 551,480 rows | 13 columns
aug20 <- read_csv("202008-divvy-tripdata.csv") # 622,361 rows | 13 columns
sep20 <- read_csv("202009-divvy-tripdata.csv") # 532,958 rows | 13 columns
oct20 <- read_csv("202010-divvy-tripdata.csv") # 388,653 rows | 13 columns
nov20 <- read_csv("202011-divvy-tripdata.csv") # 259,716 rows | 13 columns
dec20 <- read_csv("202012-divvy-tripdata.csv") # 131,573 rows | 13 columns
jan21 <- read_csv("202101-divvy-tripdata.csv") # 96,834 rows | 13 columns
feb21 <- read_csv("202102-divvy-tripdata.csv") # 49,622 rows | 13 columns
mar21 <- read_csv("202103-divvy-tripdata.csv") # 228,496 rows | 13 columns
apr21 <- read_csv("202104-divvy-tripdata.csv") # 337,230 rows | 13 columns
# combine all csv files into one table
mydata <- rbind(may20, jun20, jul20, aug20, sep20, oct20, nov20, dec20, jan21, feb21, mar21, apr21)
As expected, after importing all the spreadsheets and combining them into one, the overall data set has nearly 3.7 millions of rows. A specialized data analysis software such as R is needed to process the data effectively.
This table describes the basic details of the data set.
Variables <- names(mydata)
Description <- c("ID of individual ride",
"There are 3 types of rideable bike which are classic bike, docked bike, and eletric bike",
"Date and time that the individual riding starting at",
"Date and time that the individual riding ended at",
"Associated starting station of this riding",
"Station id of the starting station",
"Associated ending station of this riding",
"Station id of this ending station",
"Ride starting latitude",
"Ride starting longitude",
"Ride ending latitude",
"Ride ending latitude",
"Membership status of the rider, classified into annual membership or casual membership")
des <- data.frame(Variables, Description)
des %>%
kbl(align = "l",
caption = "Table 4.3 Data set description") %>%
kable_styling(bootstrap_options = c("bordered",
"striped",
"hover",
"condensed"))
| Variables | Description |
|---|---|
| ride_id | ID of individual ride |
| rideable_type | There are 3 types of rideable bike which are classic bike, docked bike, and eletric bike |
| started_at | Date and time that the individual riding starting at |
| ended_at | Date and time that the individual riding ended at |
| start_station_name | Associated starting station of this riding |
| start_station_id | Station id of the starting station |
| end_station_name | Associated ending station of this riding |
| end_station_id | Station id of this ending station |
| start_lat | Ride starting latitude |
| start_lng | Ride starting longitude |
| end_lat | Ride ending latitude |
| end_lng | Ride ending latitude |
| member_casual | Membership status of the rider, classified into annual membership or casual membership |
Following tables show that this combined data set has 3,742,202 rows of observations and 13 variables. 7 out of the 13 variables are in character type, 4 columns are numeric, and the remaining 2 are date-time (or known as POSIXct format in R).
The complete_rate tells us the completeness of each variable, or how many they have missing values in their column, which is associated with the n_missing before it. We can clearly see that there are 6 variables have missing data:
| Name | Piped data |
| Number of rows | 3742202 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 3741993 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 148231 | 0.96 | 10 | 53 | 0 | 711 | 0 |
| start_station_id | 148857 | 0.96 | 1 | 35 | 0 | 1265 | 0 |
| end_station_name | 171317 | 0.95 | 10 | 53 | 0 | 712 | 0 |
| end_station_id | 171778 | 0.95 | 1 | 35 | 0 | 1266 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.64 | 41.88 | 41.90 | 41.93 | 42.08 |
| start_lng | 0 | 1 | -87.64 | 0.03 | -87.87 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 4906 | 1 | 41.90 | 0.04 | 41.54 | 41.88 | 41.90 | 41.93 | 42.16 |
| end_lng | 4906 | 1 | -87.65 | 0.03 | -88.07 | -87.66 | -87.64 | -87.63 | -87.44 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-05-01 00:02:07 | 2021-04-30 23:59:53 | 2020-09-09 15:51:13 | 3257750 |
| ended_at | 0 | 1 | 2020-05-01 00:13:03 | 2021-05-05 22:14:39 | 2020-09-09 16:08:31 | 3245372 |
I have also identified some character variables that can be used for data grouping, which is helpful in later analysis. To use them in this purpose, I will convert them into factor type in data manipulation section. These variables are:
Another way to view how the type of our 13 variables are being assigned by R:
glimpse(mydata)
## Rows: 3,742,202
## Columns: 13
## $ ride_id <chr> "02668AD35674B983", "7A50CCAF1EDDB28F", "2FFCDFDB91~
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke~
## $ started_at <dttm> 2020-05-27 10:03:52, 2020-05-25 10:47:11, 2020-05-~
## $ ended_at <dttm> 2020-05-27 10:16:49, 2020-05-25 11:05:40, 2020-05-~
## $ start_station_name <chr> "Franklin St & Jackson Blvd", "Clark St & Wrightwoo~
## $ start_station_id <chr> "36", "340", "260", "251", "261", "206", "261", "18~
## $ end_station_name <chr> "Wabash Ave & Grand Ave", "Clark St & Leland Ave", ~
## $ end_station_id <chr> "199", "326", "260", "157", "206", "22", "261", "18~
## $ start_lat <dbl> 41.8777, 41.9295, 41.9296, 41.9680, 41.8715, 41.847~
## $ start_lng <dbl> -87.6353, -87.6431, -87.7079, -87.6500, -87.6699, -~
## $ end_lat <dbl> 41.8915, 41.9671, 41.9296, 41.9367, 41.8472, 41.869~
## $ end_lng <dbl> -87.6268, -87.6674, -87.7079, -87.6368, -87.6468, -~
## $ member_casual <chr> "member", "casual", "casual", "casual", "member", "~
A helpful table describing some common R data type we would encounter in this project.
Type <- c("<chr>", "<dbl>", "<fctr>", "<date>", "<dttm>")
Explanation <- c("Character, for text data.",
"Double, for number with decimal places.",
"factor, also text data but are usually used to grouping.",
"date, as its name would suggest.",
"date-time, combination of date and time.")
r_datatype <- data.frame(Type, Explanation)
r_datatype %>%
kbl(table.attr = "style = 'width: 60%;'",
caption = "Table 4.4 Common R data types in this project.") %>%
kable_styling(bootstrap_options = c("bordered",
"striped",
"hover",
"condensed"))
| Type | Explanation |
|---|---|
| <chr> | Character, for text data. |
| <dbl> | Double, for number with decimal places. |
| <fctr> | factor, also text data but are usually used to grouping. |
| <date> | date, as its name would suggest. |
| <dttm> | date-time, combination of date and time. |
Negatives
I was introduced that Cyclistic has standard two-wheeled bikes, reclining bikes, hand tricycles, and cargo bikes. However, only 1 column named “rideable_type” was provided in the data sets and showing 3 different type of bikes that were not mentioned in the introduction - classic bike, docked-bike and electric bike. No further details were provided to help clarifying what these new type of bikes are and the relationship between them and previously mentioned 4 type of bikes.
I was provided with a column classifying each ride into casual and annual membership, which is good. However, it would be better if the two types of the casual membership, single-ride and full-day passes, introduced in the introduction section are also being included in the data set, for a better complete analysis.
Some other useful variables were also not included in the data set such as weather of the day, seasons, public holidays, temperatures, wind speed, and if available, rider ages and gender can also be helpful in designing marketing strategy.
Positives
Fairness - The data was system-generated, base on its size of being millions rows with individual ride IDs. The data should be quite fair and free from collection biases and erroneous.
Inclusivity: Apart from lacking some variables, the data was quite inclusive. The Data was in the range of 24 hours and was provided to me as a whole. It is the entire population data of the requested 12-month period, instead of sub-set data sampled from the mother data set. This eliminate the issue of sampling bias.
All data was collected by the original company, Cyclistic, instead of from second- or third party sources. Cyclistic has to proper system to collect the big data. All provided data were also up-to-date and current. Comprehensiveness can be argued for lacking some useful variables, but the data set still have many other important variables to synthese desired outcome. In real world case, stakeholders will be approached about this issue and be informed.
In this section, following variables are being converted from character to factor. They are string variables with distinctive levels that are useful for data grouping and analysis. Please note that numerical variables can also have distinctive levels and be converted to factor, but it is not in this case.
chr_to_fct_variables <- c("rideable_type",
"start_station_name",
"end_station_name",
"member_casual")
mydata_cleaned <- mydata
mydata_cleaned[,chr_to_fct_variables] <- lapply(mydata_cleaned[,chr_to_fct_variables], factor)
After conversion, we are able to use R to identify their distinctive levels and how many are there.
#---- to find level information of these variables----#
# rideable_type
nlevels(mydata_cleaned$rideable_type)
levels(mydata_cleaned$rideable_type) # Explore it as the number of levels is small and suitable for grouping
# start_station_name
nlevels(mydata_cleaned$start_station_name)
# end_station_name
nlevels(mydata_cleaned$end_station_name)
# member_casual
nlevels(mydata_cleaned$member_casual) # Explore it as the number of levels is small and suitable for grouping
levels(mydata_cleaned$member_casual)
head(levels(mydata_cleaned$start_station_name)) # to find the first 6 station names
head(levels(mydata_cleaned$end_station_name)) # to find the first 6 station names
#---- Creating the table ----#
Variable <- c("rideable_type", "start_station_name", "end_station_name", "member_casual")
Level_count <- c("3", "711", "712", "2")
Levels <- c("classic_bike, docked_bike, electric_bike",
"2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd...",
"2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd...",
"Casual, member")
data.frame(Variable, Level_count, Levels) %>%
kbl(caption = "Table 6.1.1 Distinctive levels of grouping variables") %>%
kable_classic()
| Variable | Level_count | Levels |
|---|---|---|
| rideable_type | 3 | classic_bike, docked_bike, electric_bike |
| start_station_name | 711 | 2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd… |
| end_station_name | 712 | 2112 W Peterson Ave, 63rd St Beach, 900 W Harrison St, Aberdeen St & Jackson Blvd… |
| member_casual | 2 | Casual, member |
We shall only start using date or date-time for our analysis when they are in the correct format, as being recognized and labeled as “date” or “date-time”. Date or date-time are commonly labelled as a “character” variable if their arrangement of year, month, date, hour, minute, and second do not match with the date format of R in the form of year-month-date_hour-minute-second.
However, the date in our dataset is perfectly to go. They match the date format of R and have been recognised as date-time (
Our date-time variables: started_at and ended_at:
glimpse(mydata_cleaned)
## Rows: 3,742,202
## Columns: 13
## $ ride_id <chr> "02668AD35674B983", "7A50CCAF1EDDB28F", "2FFCDFDB91~
## $ rideable_type <fct> docked_bike, docked_bike, docked_bike, docked_bike,~
## $ started_at <dttm> 2020-05-27 10:03:52, 2020-05-25 10:47:11, 2020-05-~
## $ ended_at <dttm> 2020-05-27 10:16:49, 2020-05-25 11:05:40, 2020-05-~
## $ start_station_name <fct> Franklin St & Jackson Blvd, Clark St & Wrightwood A~
## $ start_station_id <chr> "36", "340", "260", "251", "261", "206", "261", "18~
## $ end_station_name <fct> Wabash Ave & Grand Ave, Clark St & Leland Ave, Kedz~
## $ end_station_id <chr> "199", "326", "260", "157", "206", "22", "261", "18~
## $ start_lat <dbl> 41.8777, 41.9295, 41.9296, 41.9680, 41.8715, 41.847~
## $ start_lng <dbl> -87.6353, -87.6431, -87.7079, -87.6500, -87.6699, -~
## $ end_lat <dbl> 41.8915, 41.9671, 41.9296, 41.9367, 41.8472, 41.869~
## $ end_lng <dbl> -87.6268, -87.6674, -87.7079, -87.6368, -87.6468, -~
## $ member_casual <fct> member, casual, casual, casual, member, member, mem~
There are not much values missing in this data set but only a small proportion (compared to overall data set) in these variables:
skim_without_charts(mydata_cleaned)
| Name | mydata_cleaned |
| Number of rows | 3742202 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 4 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 3741993 | 0 |
| start_station_id | 148857 | 0.96 | 1 | 35 | 0 | 1265 | 0 |
| end_station_id | 171778 | 0.95 | 1 | 35 | 0 | 1266 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| rideable_type | 0 | 1.00 | FALSE | 3 | doc: 2498407, ele: 709303, cla: 534492 |
| start_station_name | 148231 | 0.96 | FALSE | 711 | Str: 38524, Cla: 34227, Lak: 33560, The: 32243 |
| end_station_name | 171317 | 0.95 | FALSE | 712 | Str: 40552, Cla: 34022, The: 33725, Lak: 32733 |
| member_casual | 0 | 1.00 | FALSE | 2 | mem: 2198853, cas: 1543349 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.64 | 41.88 | 41.90 | 41.93 | 42.08 |
| start_lng | 0 | 1 | -87.64 | 0.03 | -87.87 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 4906 | 1 | 41.90 | 0.04 | 41.54 | 41.88 | 41.90 | 41.93 | 42.16 |
| end_lng | 4906 | 1 | -87.65 | 0.03 | -88.07 | -87.66 | -87.64 | -87.63 | -87.44 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-05-01 00:02:07 | 2021-04-30 23:59:53 | 2020-09-09 15:51:13 | 3257750 |
| ended_at | 0 | 1 | 2020-05-01 00:13:03 | 2021-05-05 22:14:39 | 2020-09-09 16:08:31 | 3245372 |
Missing values can usually be filled up with valid values that is either retrieved by adjacent related variables, or looking for median or mean, or getting inputs from authorised stakeholders. However, the nature of these missing values in these locational variables could not be retried by these methods.
Since these missing values only contribute less than 5% of the data set, I will drop them in this project.
This section synthesise new variables base on current data set and internet sources.
I calculate the traveling times of each trip and stored it in a column named duration.min. The “min” stands for “minute”, and the column is relocated after the column ended_at.
mydata_cleaned <- mydata_cleaned %>%
mutate(duration.min = ended_at - started_at) %>%
relocate(duration.min, .after = ended_at) %>%
mutate(duration.min = minute(seconds_to_period(duration.min)))
This section will synthesise these new variables to help achieve better insights.
New_variables <- c("year",
"month",
"day",
"hour",
"minute",
"date",
"weekday",
"season",
"workingday")
Levels <- c("Based on the date-time column - started_at.",
"Based on the date-time column - started_at.",
"Based on the date-time column - started_at.",
"Based on the date-time column - started_at.",
"Based on the date-time column - started_at.",
"A combination of the year-month-day.",
"Monday, tuesday, wednesday, thursday, friday, saturday and sunday.",
"Spring, summer, fall, winter. ",
"Yes: include Monday, Tuesday, Wednesday, Thursday and Friday. \n No: include Saturday, Sunday, and public holidays.")
Reasons <- c("To aid data grouping during analysis.",
"To aid data grouping during analysis.",
"To aid data grouping during analysis.",
"To aid data grouping during analysis.",
"To aid data grouping during analysis.",
"Prepare in advance for whenever needed.",
"To account for differences between individual weekday.",
"As temperature may affect ridering experience, particularly as a casual rider.",
"Interesting to see how are riders count differs between working day and non-working day.")
df_newvar <- data.frame(New_variables, Levels, Reasons)
df_newvar %>% kbl(caption = "Table 6.3.2.1 New variables adding to the existing table.") %>%
kable_styling(bootstrap_options = c("border", "hover", "condensed"))
| New_variables | Levels | Reasons |
|---|---|---|
| year | Based on the date-time column - started_at. | To aid data grouping during analysis. |
| month | Based on the date-time column - started_at. | To aid data grouping during analysis. |
| day | Based on the date-time column - started_at. | To aid data grouping during analysis. |
| hour | Based on the date-time column - started_at. | To aid data grouping during analysis. |
| minute | Based on the date-time column - started_at. | To aid data grouping during analysis. |
| date | A combination of the year-month-day. | Prepare in advance for whenever needed. |
| weekday | Monday, tuesday, wednesday, thursday, friday, saturday and sunday. | To account for differences between individual weekday. |
| season | Spring, summer, fall, winter. | As temperature may affect ridering experience, particularly as a casual rider. |
| workingday | Yes: include Monday, Tuesday, Wednesday, Thursday and Friday. No: include Saturday, Sunday, and public holidays. | Interesting to see how are riders count differs between working day and non-working day. |
Creating year, month, day, hour, minute and weekday
Important note: variables - “year”, “month”, and “day” are created using data from the “started_at” column. Click the right side code to view my codes.
mydata_cleaned <- mydata_cleaned %>%
mutate(year = year(started_at),
month = month(started_at),
day = day(started_at),
hour = hour(started_at),
minute = minute(started_at),
date = make_date(year, month, day),
weekday = weekdays(started_at))
Creating monthly average temperature (Celsius) and seasons
Following information are extracted from Chicago climatic table on Wikipedia. Note that Chicago has shorter spring and autumn but longer winter and summer.
month <- 1:12
mean_temp_oC <- c(-3.2, -1.2, 4.4, 10.5, 16.6, 22.2, 24.8, 23.9,
19.9, 12.9, 5.8, -0.3)
season <- c(rep("winter", 3),
rep("spring", 2),
rep("summer", 4),
rep("autumn", 2),
"winter"
)
data.frame(month, mean_temp_oC, season) %>%
kbl(alig = "l",
table.attr = "style ='width: 60%;'",
caption = "Table 6.3.2.2 Monthly average temperature (Celsius) with seasons") %>%
kable_styling(bootstrap_options = c("bordered",
"hover",
"striped"))
| month | mean_temp_oC | season |
|---|---|---|
| 1 | -3.2 | winter |
| 2 | -1.2 | winter |
| 3 | 4.4 | winter |
| 4 | 10.5 | spring |
| 5 | 16.6 | spring |
| 6 | 22.2 | summer |
| 7 | 24.8 | summer |
| 8 | 23.9 | summer |
| 9 | 19.9 | summer |
| 10 | 12.9 | autumn |
| 11 | 5.8 | autumn |
| 12 | -0.3 | winter |
Adding this table to our cleaned data set by right-hand Code .
season_table <- data.frame(month, mean_temp_oC, season)
mydata_cleaned <- merge(mydata_cleaned, season_table, by = "month") %>%
relocate(month, .after = year)
Creating workingday
Working day, or “workingday” in the data set has 2 values, either “yes” or “no”. The “no” meaning it is not a working day which will include Saturday, Sunday and public holidays. Following public holiday days data are extracted from Chicago Government Website.
month <- c(1, 1, 2, 2, 3, 5, 7, 9, 10, 11, 11, 12, 12)
day <- c(1, 18, 12, 15, 1, 31, 5, 6, 11, 11, 25, 24, 31)
festival <- c("New Year's Day",
"Dr. Martin Luther King, Jr.'s Birthday",
"Lincoln's Birthday",
"President's Day",
"Pulaski Day",
"Memorial Day",
"Independence Day",
" Labor Day",
"Columbus Day",
"Veteran's Day",
"Thanksgiving Day",
"Christmas Day",
"New Year's Day"
)
data.frame(month, day, festival) %>%
kbl(align = "l",
table.attr = "style = 'width: 60%;'",
caption = "Table 6.3.2.4 Chicago public holidays") %>%
kable_styling(bootstrap_options = c("bordered",
"striped",
"hover"))
| month | day | festival |
|---|---|---|
| 1 | 1 | New Year’s Day |
| 1 | 18 | Dr. Martin Luther King, Jr.’s Birthday |
| 2 | 12 | Lincoln’s Birthday |
| 2 | 15 | President’s Day |
| 3 | 1 | Pulaski Day |
| 5 | 31 | Memorial Day |
| 7 | 5 | Independence Day |
| 9 | 6 | Labor Day |
| 10 | 11 | Columbus Day |
| 11 | 11 | Veteran’s Day |
| 11 | 25 | Thanksgiving Day |
| 12 | 24 | Christmas Day |
| 12 | 31 | New Year’s Day |
Adding this table to our cleaned data set by right-hand Code .
mydata_cleaned <- mydata_cleaned %>%
mutate(workingday = case_when(
weekday == "Sunday" ~ "No",
weekday == "Saturday" ~ "No",
month == 1 & day == 1 ~ "No",
month == 1 & day == 18 ~ "No",
month == 2 & day == 12 ~ "No",
month == 2 & day == 15 ~ "No",
month == 3 & day == 1 ~ "No",
month == 5 & day == 31 ~ "No",
month == 7 & day == 5 ~ "No",
month == 9 & day == 6 ~ "No",
month == 10 & day == 11 ~ "No",
month == 11 & day == 11 ~ "No",
month == 11 & day == 25 ~ "No",
month == 12 & day == 24 ~ "No",
month == 12 & day == 31 ~ "No",
TRUE ~ "Yes"
))
The name of original data set in R is called “mydata”. The data set is now called “mydata_cleaned” and stored in a different R object.
Summary of cleaning and processing I have done to the original dataset:
head(mydata_cleaned[,c(5, 15:24)]) %>% # showing only top 6 observations of these new variables
kbl(align = "c",
caption = "The first 6 rows of the data set") %>%
kable_styling(bootstrap_options = c("bordered",
"hover",
"striped"))
| duration.min | year | month | day | hour | minute | date | weekday | mean_temp_oC | season | workingday |
|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 2021 | 1 | 23 | 16 | 14 | 2021-01-23 | Saturday | -3.2 | winter | No |
| 4 | 2021 | 1 | 27 | 18 | 43 | 2021-01-27 | Wednesday | -3.2 | winter | Yes |
| 1 | 2021 | 1 | 21 | 22 | 35 | 2021-01-21 | Thursday | -3.2 | winter | Yes |
| 11 | 2021 | 1 | 7 | 13 | 31 | 2021-01-07 | Thursday | -3.2 | winter | Yes |
| 0 | 2021 | 1 | 23 | 2 | 24 | 2021-01-23 | Saturday | -3.2 | winter | No |
| 53 | 2021 | 1 | 9 | 14 | 24 | 2021-01-09 | Saturday | -3.2 | winter | No |
Summarizing some introductory statistics of our data set to quick familiarising ourselves with some of the patterns in the data.
We have annual members 15% higher than casual riders, the proportion of casual riders is larger than I thought (Table 7.1.1). More importantly, casual riders that has a 15% lower in the population even ride slightly longer than annual members at a 2% different in the total duration (minutes) in the 12 month period (Table 7.1.2).
On average riding minute per person, casual rider ride 7 minutes more than the annual members, it is a 20% higher riding duration than the annual members (Table 7.1.3).
mydata_cleaned %>%
group_by(member_casual) %>%
summarise(Riding_frequency = n()) %>%
mutate(Percentage = paste0(round(Riding_frequency/sum(Riding_frequency)*100), '%')) %>%
relocate(Percentage, .after = Riding_frequency) %>%
kbl(align = "l",
table.attr = "style = 'width: 40%;'",
caption = "Table 7.1.1 Casual vs member Frequency") %>%
kable_styling(bootstrap_options = c("bordered",
"hover")) %>%
row_spec(2, color = "red")
| member_casual | Riding_frequency | Percentage |
|---|---|---|
| casual | 1543349 | 41% |
| member | 2198853 | 59% |
mydata_cleaned %>%
group_by(member_casual) %>%
summarise(total_riding_duration_min = sum(duration.min)) %>%
mutate(Percentage = total_riding_duration_min / sum(total_riding_duration_min),
Percentage = paste0(round(Percentage*100), "%") ) %>%
kbl(align = "l",
table.attr = "style = 'width: 40%;'",
caption = "Table 7.1.2 Casual vs member total riding duration in minutes") %>%
kable_styling(bootstrap_options = c("bordered",
"hover")) %>%
row_spec(1, color = "red")
| member_casual | total_riding_duration_min | Percentage |
|---|---|---|
| casual | 31976253 | 51% |
| member | 30323629 | 49% |
mydata_cleaned %>%
group_by(member_casual) %>%
summarise(average_riding_duration_min = mean(duration.min)) %>%
mutate(Percentage = average_riding_duration_min / sum(average_riding_duration_min),
Percentage = paste0(round(Percentage*100), "%") ) %>%
kbl(align = "l",
table.attr = "style = 'width: 40%;'",
caption = "Table 7.1.3 Casual vs member average riding duration in minutes") %>%
kable_styling(bootstrap_options = c("bordered",
"hover")) %>%
row_spec(1, color = "red")
| member_casual | average_riding_duration_min | Percentage |
|---|---|---|
| casual | 20.71874 | 60% |
| member | 13.79066 | 40% |
These statistics suggest that casual riders are very important group of riders and they have a great demand of our bikes.
This section incorporate years, months, seasons, and temperatures with the member types. A visualisation in EDA section will better explain this table.
However, I have highlighted the top 5 rows that has the highest numbers of riders in following table (Table 7.2). The table suggest that riders love riding in the summer, which might be the best season for our marketing activities.
mydata_cleaned %>%
group_by(year, month, season, mean_temp_oC, member_casual) %>%
summarise(riding_frequency = n(),
total_riding_duration_min = sum(duration.min),
average_riding_duration_min = round(mean(duration.min),2)) %>%
kbl(align = "c",
table.attr = "style = 'width: 100%;'",
caption = "Table 7.2 Year + Month + Season + Average temp (oC) + Membership type") %>%
kable_styling(bootstrap_options = c("bordered",
"hover",
"striped")) %>%
row_spec(5:8, color = "red", background = "yellow") %>%
row_spec(10, color = "red", background = "yellow")
| year | month | season | mean_temp_oC | member_casual | riding_frequency | total_riding_duration_min | average_riding_duration_min |
|---|---|---|---|---|---|---|---|
| 2020 | 5 | spring | 16.6 | casual | 86909 | 2053309 | 23.63 |
| 2020 | 5 | spring | 16.6 | member | 113365 | 1940318 | 17.12 |
| 2020 | 6 | summer | 22.2 | casual | 154718 | 3524288 | 22.78 |
| 2020 | 6 | summer | 22.2 | member | 188287 | 3068354 | 16.30 |
| 2020 | 7 | summer | 24.8 | casual | 269296 | 6138007 | 22.79 |
| 2020 | 7 | summer | 24.8 | member | 282184 | 4372428 | 15.49 |
| 2020 | 8 | summer | 23.9 | casual | 289661 | 6190766 | 21.37 |
| 2020 | 8 | summer | 23.9 | member | 332700 | 4856986 | 14.60 |
| 2020 | 9 | summer | 19.9 | casual | 230692 | 4599016 | 19.94 |
| 2020 | 9 | summer | 19.9 | member | 302266 | 4095711 | 13.55 |
| 2020 | 10 | autumn | 12.9 | casual | 145012 | 2616028 | 18.04 |
| 2020 | 10 | autumn | 12.9 | member | 243641 | 2986650 | 12.26 |
| 2020 | 11 | autumn | 5.8 | casual | 88099 | 1614182 | 18.32 |
| 2020 | 11 | autumn | 5.8 | member | 171617 | 2095926 | 12.21 |
| 2020 | 12 | winter | -0.3 | casual | 30080 | 483810 | 16.08 |
| 2020 | 12 | winter | -0.3 | member | 101493 | 1144087 | 11.27 |
| 2021 | 1 | winter | -3.2 | casual | 18117 | 275598 | 15.21 |
| 2021 | 1 | winter | -3.2 | member | 78717 | 875103 | 11.12 |
| 2021 | 2 | winter | -1.2 | casual | 10131 | 184282 | 18.19 |
| 2021 | 2 | winter | -1.2 | member | 39491 | 502914 | 12.73 |
| 2021 | 3 | winter | 4.4 | casual | 84033 | 1654605 | 19.69 |
| 2021 | 3 | winter | 4.4 | member | 144463 | 1805391 | 12.50 |
| 2021 | 4 | spring | 10.5 | casual | 136601 | 2642362 | 19.34 |
| 2021 | 4 | spring | 10.5 | member | 200629 | 2579761 | 12.86 |
I wanted to compare which of the “classic_bike” “docked_bike” and “electric_bike” are the most popular bike by simply looking at numbers of riders for each bike. However, this approach ceased because 3 of these different bikes are initiated by Cyclistic in different dates (see blue highlighted in following table 7.3).
It will be a unfair comparison if simply looking at proportion of riders in each bike category. I will visualise it in the EDA section, and it will tell a very good story!
mydata_cleaned %>%
group_by(year, month, rideable_type) %>%
summarise(riding_frequency = n(),
total_riding_duration_min = sum(duration.min),
average_riding_duration_min = round(mean(duration.min),2)) %>%
kbl(align = "l",
table.attr = "style = 'width: 70%;'",
caption = "Table 7.3 Blue strips are when a rideable type was first introduced") %>%
kable_styling(bootstrap_options = c("bordered",
"hover",
"striped")) %>%
row_spec(c(1, 4, 13), color = "white", background = "blue")
| year | month | rideable_type | riding_frequency | total_riding_duration_min | average_riding_duration_min |
|---|---|---|---|---|---|
| 2020 | 5 | docked_bike | 200274 | 3993627 | 19.94 |
| 2020 | 6 | docked_bike | 343005 | 6592642 | 19.22 |
| 2020 | 7 | docked_bike | 549545 | 10479497 | 19.07 |
| 2020 | 7 | electric_bike | 1935 | 30938 | 15.99 |
| 2020 | 8 | docked_bike | 556166 | 9970464 | 17.93 |
| 2020 | 8 | electric_bike | 66195 | 1077288 | 16.27 |
| 2020 | 9 | docked_bike | 404606 | 6771325 | 16.74 |
| 2020 | 9 | electric_bike | 128352 | 1923402 | 14.99 |
| 2020 | 10 | docked_bike | 236477 | 3593983 | 15.20 |
| 2020 | 10 | electric_bike | 152176 | 2008695 | 13.20 |
| 2020 | 11 | docked_bike | 151582 | 2295337 | 15.14 |
| 2020 | 11 | electric_bike | 108134 | 1414771 | 13.08 |
| 2020 | 12 | classic_bike | 70616 | 886180 | 12.55 |
| 2020 | 12 | docked_bike | 13004 | 185868 | 14.29 |
| 2020 | 12 | electric_bike | 47953 | 555849 | 11.59 |
| 2021 | 1 | classic_bike | 61700 | 740644 | 12.00 |
| 2021 | 1 | docked_bike | 2106 | 45543 | 21.63 |
| 2021 | 1 | electric_bike | 33028 | 364514 | 11.04 |
| 2021 | 2 | classic_bike | 35012 | 493072 | 14.08 |
| 2021 | 2 | docked_bike | 1271 | 30458 | 23.96 |
| 2021 | 2 | electric_bike | 13339 | 163666 | 12.27 |
| 2021 | 3 | classic_bike | 152545 | 2248432 | 14.74 |
| 2021 | 3 | docked_bike | 15657 | 395910 | 25.29 |
| 2021 | 3 | electric_bike | 60294 | 815654 | 13.53 |
| 2021 | 4 | classic_bike | 214619 | 3235106 | 15.07 |
| 2021 | 4 | docked_bike | 24714 | 624303 | 25.26 |
| 2021 | 4 | electric_bike | 97897 | 1362714 | 13.92 |
53% of casual riders love riding in non-working day that include public holiday, Saturday and Sunday, and 47% of casual riders love riding during working day.
Base on the data, it clearly suggest that nearly half of the casual riders ride during working day, it would imply a steady demand, and they need to write our bike constantly. Targeting these riders would have high casual-annual membership conversion.
| member_casual | workingday | riding_frequency | total_riding_duration_min | average_riding_duration_min | average_riding_duration_percentage |
|---|---|---|---|---|---|
| casual | No | 646691 | 14202547 | 21.96 | 53% |
| casual | Yes | 896658 | 17773706 | 19.82 | 47% |
| member | No | 643214 | 9665673 | 15.03 | 53% |
| member | Yes | 1555639 | 20657956 | 13.28 | 47% |
There are total of 712 riding station in the data set for 3,742,262 times of riding over the given 12-month period from May-2020 to April-2020. It would be more practical just to quick viewing the top 10 most frequently visited riding station. So, we can target marketing in these areas. Following statistics are based on the variable start_station_name, the column recording stations that riders started their rides.
mydata_cleaned %>%
group_by(member_casual, start_station_name) %>%
summarise(count = n()) %>%
filter(!is.na(start_station_name)) %>%
arrange(member_casual, desc(count)) %>%
top_n(10) %>%
kbl(align = "c",
table.attr = "style = 'width: 40%'",
caption = "Table 7.5.1 Top 10 most frequently visited stations among 712 stations, by casual riders and annual members.") %>%
kable_styling(bootstrap_options = c("hover", "striped", "bordered"))
| member_casual | start_station_name | count |
|---|---|---|
| casual | Streeter Dr & Grand Ave | 28562 |
| casual | Lake Shore Dr & Monroe St | 23593 |
| casual | Millennium Park | 21554 |
| casual | Theater on the Lake | 16346 |
| casual | Michigan Ave & Oak St | 15326 |
| casual | Indiana Ave & Roosevelt Rd | 14700 |
| casual | Lake Shore Dr & North Blvd | 14518 |
| casual | Michigan Ave & Lake St | 12791 |
| casual | Clark St & Elm St | 12718 |
| casual | Michigan Ave & Washington St | 11709 |
| member | Clark St & Elm St | 21509 |
| member | Broadway & Barry Ave | 16648 |
| member | Wells St & Concord Ln | 16354 |
| member | Dearborn St & Erie St | 16351 |
| member | St. Clair St & Erie St | 16235 |
| member | Theater on the Lake | 15897 |
| member | Kingsbury St & Kinzie St | 15525 |
| member | Wells St & Elm St | 15286 |
| member | Wells St & Huron St | 15156 |
| member | Clark St & Armitage Ave | 14606 |
Since casual rider is the only group of interest, my EDA will mainly focus on studying and understanding this group of riders in order to design a specific recommendation for marketing purposes. However, annual riders would occasionally appear in the graphs when found that it is useful for comparison.
Since we have Geo-tagged data (with longitude and latitude), we are able to view them on a map to explore how are popular stations distributed.
temp1 <- mydata_cleaned %>%
filter(member_casual == "casual") %>% # 3.7 mil riders to 1.5 mil of casual riders
group_by(start_station_name) %>%
summarise(frequency = n())
temp2 <- mydata_cleaned %>%
filter(member_casual == "casual") %>%
group_by(start_station_name, start_lat, start_lng) %>%
summarise(frequency = n()) %>%
group_by(start_station_name) %>%
filter(row_number() == 1)
df8.1 <- merge(temp1, temp2, by = "start_station_name") %>%
select(-frequency.y) %>%
rename("frequency" = "frequency.x")
# df for the map - create "lab", "pop-up-info" and "category"
df8.1 <- df8.1 %>%
mutate(lab = paste0(start_station_name, " (", frequency, ")"),
pop_up_info = paste0(start_station_name, "<br/>",
"Casual rider counts: ", frequency, "<br/>",
"Lat: ", round(start_lat, 5), "<br/>",
"Lat: ", round(start_lng, 5), "<br/>"
)) %>%
filter(!is.na(start_station_name)) %>% #remove NA
mutate(category = cut(frequency,
c(0, 500, 1000, 5000, 10000, 20000, 30000),
labels = c("0-500", "500-1000", "1000-5000", "5000-10000", "10000-20000", ">20000")))
# define a color to each group of the "category"
palette <- colorFactor(palette = c("yellow", "green", "orange", "red", "black"), domain = df8.1$category)
# Creating 6 layers for each of the category.
Layer1 <- df8.1 %>% filter(frequency < 500)
Layer2 <- df8.1 %>% filter(frequency > 500, frequency < 1000)
Layer3 <- df8.1 %>% filter(frequency > 1000, frequency < 5000)
Layer4 <- df8.1 %>% filter(frequency > 5000, frequency < 10000)
Layer5 <- df8.1 %>% filter(frequency > 10000, frequency < 20000)
Layer6 <- df8.1 %>% filter(frequency > 20000)
leaflet() %>%
addTiles(group = "OpenStreetMap (Default)") %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri") %>%
addProviderTiles(providers$Stamen.TonerLite, group = "Toner Lite") %>%
setView(lng = -87.6298, lat = 41.8781, zoom = 12) %>%
addCircleMarkers(group = "Yellow: 0-500",
data = Layer1,
lat = ~ start_lat,
lng = ~ start_lng,
label = ~ lab,
popup = ~ pop_up_info,
color = ~ "Yellow",
radius = 5,
stroke = T,
weight = 1,
opacity = 1) %>%
addCircleMarkers(group = "Gold: 500 - 1000",
data = Layer2,
lat = ~ start_lat,
lng = ~ start_lng,
label = ~ lab,
popup = ~ pop_up_info,
color = ~"Gold",
radius = 10,
stroke = T,
weight = 1,
opacity = 1) %>%
addCircleMarkers(group = "Green: 1000-5000",
data = Layer3,
lat = ~ start_lat,
lng = ~ start_lng,
label = ~ lab,
popup = ~ pop_up_info,
color = ~"Green",
radius = 15,
stroke = T,
weight = 1,
opacity = 1) %>%
addCircleMarkers(group = "Orange: 5000 - 10000",
data = Layer4,
lat = ~ start_lat,
lng = ~ start_lng,
label = ~ lab,
popup = ~ pop_up_info,
color = ~"Orange",
radius = 20 ,
stroke = T,
weight = 1,
opacity = 1) %>%
addCircleMarkers(group = "Red: 10000-20000",
data = Layer5,
lat = ~ start_lat,
lng = ~ start_lng,
label = ~ lab,
popup = ~ pop_up_info,
color = ~"Red",
radius = 25 ,
stroke = T,
weight = 1,
opacity = 1) %>%
addCircleMarkers(group = "Black: >20000",
data = Layer6,
lat = ~ start_lat,
lng = ~ start_lng,
label = ~ lab,
popup = ~ pop_up_info,
color = ~ "Black",
radius = 30,
stroke = T,
weight = 1,
opacity = 1) %>%
addLabelOnlyMarkers(group = "Casual-rider count",
data = df8.1,
lat = ~ start_lat,
lng = ~ start_lng,
label = ~ as.character(frequency),
labelOptions = labelOptions(noHide = T, textOnly = F, textsize = 20, alpha = 0.5)) %>%
addLayersControl(
baseGroups = c("Esri", "OpenStreetMap", "Toner Lite"),
overlayGroups = c("Yellow: 0-500", "Gold: 500 - 1000", "Green: 1000-5000",
"Orange: 5000 - 10000", "Red: 10000-20000", "Black: >20000", "Casual-rider count"),
options = layersControlOptions(collapsed = F)
)
Have Orange, Red and Black category ticked, we can clearly see that most casual riders are concentrated in the coaster regions, where the central city is. It is likely linked with more choices of entertainment in the area.
Use this interactive table to filter out the top three bike station that hosting most casual riders.
df8.1DT <- df8.1 %>%
select(category, start_station_name, frequency) %>%
rename("Frequency grouping" = "category",
"Station name" = "start_station_name",
"Casual Rider Frequency" = "frequency")
datatable(df8.1DT, filter = "top", rownames = FALSE)
The top 3 ride stations are:
Please be aware that these are station named given by Cyclistic, it may be slightly different from the actual adjacent street names.
Prior to going deeper analysis, it is good to check on the temperatures of Chicago in different seasons as it would affect the experience of riding. For example, it might be too cold to be outdoor during winter in Chicago, when temperatures go below 0 Celsius.
df <- mydata_cleaned %>%
select(month, mean_temp_oC, season) %>%
group_by(month, mean_temp_oC, season) %>%
summarise(count = n()) %>%
select(-count)
ggplot(df, aes(x = month, y = mean_temp_oC, colour = season, group = "mean_temp_oC")) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey") +
geom_vline(xintercept = 3, linetype = "dashed", colour = "grey") +
geom_vline(xintercept = 5, linetype = "dashed", colour = "grey") +
geom_vline(xintercept = 9, linetype = "dashed", colour = "grey") +
geom_vline(xintercept = 11, linetype = "dashed", colour = "grey") +
geom_text(label = "Winter", colour = "blue", show.legend = F, x = 1.5, y = 35) +
geom_text(label = "Spring", colour = "green4", show.legend = F, x = 4 , y = 35) +
geom_text(label = "Summer", colour = "red", show.legend = F, x = 7, y = 35) +
geom_text(label = "Autumn", colour = "darkgreen", show.legend = F, x = 10, y = 35) +
geom_text(label = "Winter", colour = "blue", show.legend = F, x = 12, y = 35) +
geom_point(shape = 21, size = 4) +
geom_path() +
geom_label(aes(label = mean_temp_oC)) +
scale_y_continuous(limits = c(-5, 40)) +
scale_x_continuous(limits = c(1, 12), breaks = seq(1, 12, 1)) +
theme_classic() +
scale_colour_manual(breaks = c("winter", "spring", "summer", "autumn"),
values = c("blue", "green4", "red", "darkgreen")) +
theme(legend.position = "none",
plot.title = element_text(vjust = 2),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y= element_text(margin = margin(0, 10, 0, 0)),
plot.margin = unit(c(1,1,1,1), "cm")) +
labs(title = "Average Monthly Temperatures (Celsius) of Chicago",
caption = "Kar Ng",
x = "Month",
y = "Average Temperature (oC)")
Following figure shows that,
table6.2 <- mydata_cleaned %>%
group_by(date, year, rideable_type, member_casual, mean_temp_oC, season, workingday) %>%
summarise(daily_frequency = n())
table6.2$member_casual <- factor(table6.2$member_casual, levels = c("member", "casual"))
table6.2 %>% ggplot(aes(x = date, y = daily_frequency,
colour = rideable_type,
alpha = member_casual)) +
annotate("rect", fill = "yellow", alpha = 0.1,
xmin = as_date("2020-04-01"), xmax = as_date("2020-05-31"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "red", alpha = 0.1,
xmin = as_date("2020-05-31"), xmax = as_date("2020-09-30"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "darkgreen", alpha = 0.1,
xmin = as_date("2020-09-30"), xmax = as_date("2020-11-30"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "blue", alpha = 0.1,
xmin = as_date("2020-11-30"), xmax = as_date("2021-03-31"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "yellow", alpha = 0.1,
xmin = as_date("2021-03-31"), xmax = as_date("2021-05-31"),
ymin = -Inf, ymax = Inf) +
geom_text(label = "Spring", colour = "green4", show.legend = F,
x = as_date("2020-05-01"),
y = 15000) +
geom_text(label = "Summer", colour = "red", show.legend = F,
x = as_date("2020-08-01"),
y = 15000) +
geom_text(label = "Autumn", colour = "darkgreen", show.legend = F,
x = as_date("2020-11-01"),
y = 15000) +
geom_text(label = "Winter", colour = "blue", show.legend = F,
x = as_date("2021-02-01"),
y = 15000) +
geom_text(label = "Spring", colour = "green4", show.legend = F,
x = as_date("2021-05-01"),
y = 15000) +
geom_point() +
geom_line(stat = "smooth", size = 1) +
scale_x_date(date_breaks = "1 month", date_labels = "%Y %b") +
theme_bw() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 25, vjust = 0.7),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
panel.grid = element_blank()) +
labs(x = "Date",
y = "Daily Riding Frequency",
title = "Casual rider daily riding frequency",
subtitle = "Transparent points are member rider's frequency and follow the similar major trends",
caption = "Kar Ng",
colour = "Ride-able Type",
alpha = "Membership type")
Interestingly, in the last summer, casual rider has the longer riding duration than member riders. This figure can be another supporting evidence indicating casual riders are valuable riders, and it worth our focus on converting them into annual members.
table6.3 <- mydata_cleaned %>%
group_by(date, year, rideable_type, member_casual, mean_temp_oC, season, workingday) %>%
summarise(daily_duration_hr = sum(duration.min)/60)
table6.3$member_casual <- factor(table6.3$member_casual, levels = c("member", "casual"))
table6.3 %>% ggplot(aes(x = date, y = daily_duration_hr,
colour = rideable_type,
alpha = member_casual)) +
annotate("rect", fill = "yellow", alpha = 0.1,
xmin = as_date("2020-04-01"), xmax = as_date("2020-05-31"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "red", alpha = 0.1,
xmin = as_date("2020-05-31"), xmax = as_date("2020-09-30"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "darkgreen", alpha = 0.1,
xmin = as_date("2020-09-30"), xmax = as_date("2020-11-30"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "blue", alpha = 0.1,
xmin = as_date("2020-11-30"), xmax = as_date("2021-03-31"),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "yellow", alpha = 0.1,
xmin = as_date("2021-03-31"), xmax = as_date("2021-05-31"),
ymin = -Inf, ymax = Inf) +
geom_text(label = "Spring", colour = "green4", show.legend = F,
x = as_date("2020-05-01"),
y = 6000) +
geom_text(label = "Summer", colour = "red", show.legend = F,
x = as_date("2020-08-01"),
y = 6000) +
geom_text(label = "Autumn", colour = "darkgreen", show.legend = F,
x = as_date("2020-11-01"),
y = 6000) +
geom_text(label = "Winter", colour = "blue", show.legend = F,
x = as_date("2021-02-01"),
y = 6000) +
geom_text(label = "Spring", colour = "green4", show.legend = F,
x = as_date("2021-05-01"),
y = 6000) +
geom_point(shape = 8) +
geom_line(stat = "smooth", size = 1) +
scale_x_date(date_breaks = "1 month", date_labels = "%Y %b") +
theme_bw() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 25, vjust = 0.7),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
panel.grid = element_blank()) +
labs(x = "Date",
y = "Daily Riding Frequency",
title = "Daily riding HOURS of Casual rider",
subtitle = "Transparent points are member rider's frequency and follow the similar major trends",
caption = "Kar Ng",
colour = "Ride-able Type",
alpha = "Membership type")
Across all four seasons, most casual riders love to ride during off-rush hours (see following graph), and again summer has the highest ride frequency per min during off-rush hours. Any physical marketing activities would be at the highest effectiveness during this period, as we would has the highest probability of meeting casual riders.
If physical marketing activities are to be happened after 6pm, do it in the evening of spring and summer. According to following graph, many riders love evening-ride during spring and summer but not during autumn and winter.
According to Chicago goverment webiste, rush hours of Chicago are categorised into morning at between 7:00am to 9:30am and 4:00pm to 6:00pm.
# Setting the object for this section
table6.4 <- mydata_cleaned %>%
mutate(time = paste0(hour, ":" ,minute),
time2 = as.POSIXct(parse_hm(time))) %>% #syntheses a column of time (hour:min)
group_by(date, year, time, time2, rideable_type, member_casual, mean_temp_oC, season, workingday) %>%
summarise(frequency = n()) %>%
filter(member_casual == "casual")
table6.4$member_casual <- factor(table6.4$member_casual, levels = c("member",
"casual"))
# Ploting the graph
ggplot(table6.4, aes(x = time2, y = frequency,
colour = season)) +
annotate("rect", fill = "gold", alpha = 0.5,
xmin = as.POSIXct(parse_hm("7:00")), xmax = as.POSIXct(parse_hm("9:30")),
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "gold", alpha = 0.5,
xmin = as.POSIXct(parse_hm("16:00")), xmax = as.POSIXct(parse_hm("18:00")),
ymin = -Inf, ymax = Inf) +
geom_point(alpha = 0.2) +
geom_line(stat = "smooth", colour = "black") +
scale_x_datetime(date_labels = "%H:%M", date_breaks = "2 hour") +
facet_wrap(~season) +
theme_classic() +
theme(legend.position = "none",
plot.margin = unit(c(1,1,1,1), "cm"),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
strip.text = element_text(size = 15)) +
labs(title = "Total rides of Casual Riders per minute per day between may-2020 to april-2021 ",
subtitle = "Shaded regions are rush periods between 7:00am-9:30am and 4:00pm-6:00pm",
caption = "Kar Ng",
x = "Time",
y = "Riding frequency per min")
High amount of casual riders in the weekend especially on Saturday (23%) than in the weekdays (mostly 11%).
table6.5 <- mydata_cleaned %>%
mutate(weekday = factor(weekday),
weekday = factor(weekday,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))) %>%
mutate(member_casual = factor(member_casual,
levels = c("member", "casual")),
member_casual = fct_recode(member_casual,
"Member riders" = "member",
"Casual riders" = "casual")) %>%
group_by(weekday, member_casual) %>%
summarise(wkday_frequency = n())
table6.5.pct <- table6.5 %>%
group_by(member_casual) %>%
mutate(proportion = wkday_frequency/sum(wkday_frequency),
percentage = paste0(round(proportion*100), "%"))
ggplot(table6.5, aes(x = weekday, y = wkday_frequency, alpha = member_casual)) +
geom_bar(stat = "identity", position = "dodge", width = 0.5) +
geom_line(data = table6.5, aes(x = weekday, y = wkday_frequency,
alpha = member_casual, group = member_casual)) +
geom_text(data = table6.5, aes(label = wkday_frequency), vjust = -0.25) +
geom_text(data = table6.5.pct, aes(label = paste0("(", percentage, ")")), vjust = -2, hjust = 0.3, colour = "blue") +
theme_bw() +
theme(legend.position = "top",
axis.text.x = element_text(angle = 25, vjust = 0.7),
axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
plot.margin = unit(c(1,1,1,1), "cm")) +
scale_y_continuous(limits = c(0, 400000) ,
labels = function(x) paste0({x/1000}, "k")) +
labs(title = "Constant casual riders in the weekday and spike on Weekend",
subtitle = "Highlighting casual riders with member riders faded at the back",
caption = "Kar Ng",
x = "Days of the week",
y = "Ride frequency")
However, the conversion rate of casual riders riding bike in the working week days into annual membership may be higher.
The number of rides by casual riders in the weekday is relatively constant, these casual riders are probably the same persons riding our bikes for their day-to-day activities in the week day such as work.
They have a constant demand compare to those casual riders in the weekend. They are likely just riding it once in a while and less in repeatedly. If resources are limited, I suggest to consider to invest advertisements on weekdays.
Related statistics tables:
table6.5.pct %>%
select(weekday, wkday_frequency, percentage) %>%
arrange(member_casual) %>%
rename("Membership type" = "member_casual",
"Day" = "weekday",
"Ride counts" = "wkday_frequency",
"Percentage" = "percentage") %>%
kbl(align = "c",
table.attr = "style = 'width: 60%;'") %>%
kable_styling(bootstrap_options = c("bordered", "hover", "striped")) %>%
row_spec(c(6,7, 13, 14), bold = T) %>%
pack_rows("Group 1:", 1, 7, color = "brown") %>%
pack_rows("Group 2:", 8, 14, color = "brown")
| Membership type | Day | Ride counts | Percentage |
|---|---|---|---|
| Group 1: | |||
| Member riders | Monday | 287763 | 13% |
| Member riders | Tuesday | 308324 | 14% |
| Member riders | Wednesday | 324548 | 15% |
| Member riders | Thursday | 320143 | 15% |
| Member riders | Friday | 336004 | 15% |
| Member riders | Saturday | 342098 | 16% |
| Member riders | Sunday | 279973 | 13% |
| Group 2: | |||
| Casual riders | Monday | 164809 | 11% |
| Casual riders | Tuesday | 162381 | 11% |
| Casual riders | Wednesday | 168883 | 11% |
| Casual riders | Thursday | 176583 | 11% |
| Casual riders | Friday | 229550 | 15% |
| Casual riders | Saturday | 359561 | 23% |
| Casual riders | Sunday | 281582 | 18% |
How are casual riders use Cyclistic bikes?
Casual riders ride the most in regions near coastal regions, especially closer to where the city central is. The top 3 stations casual riders ride the most were Streeter Dr & Grand Ave, Lake Shore Dr & Monroe St, Millennium Park. Three of these stations accounts for 5% of 710 stations.
Allocate more advertisement resources to electric bike and classic bike, as casual riders have taken them up and replace docked bike since January 2021 to the latest available data in April 2021.
Best Time for advertisement can be between 12pm - 5:30pm across 4 seasons but summer and spring can go beyond 5:30pm and up to 10pm. However, advertisement in autumn and winter is not recommended, at least to avoid large scale investment, because casual riders lost their interest in riding during cool temperatures in these seasons.
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Focus advertisement during spring and summer. Winter has minimum riders as it might be too cold to ride outside for casual purposes.
Digital media best performed during spring and summer, advertise on popular medias such as Facebook, Twitter, Google, or Youtube during breakfast, lunch and dinner time ONLY in the weekdays, but morning and evening during weekend as current casual riders may be riding our bikes in the day time of weekend. Cyclistic can promote stations naar coastal regions if resources is limited (or best to promote all stations near coastal regions). Advertisements should focus on electric bike and classic bike.
This is a personal project created and designed for non-commercial use only. The data set of this personal project was prepared by Google data analytic certificate course and was made available for students as a capstone project to publish online as online data analytic project to show experiences.
However, this project is a personal project that extending the scope of the course, it is not submitted and graded by Google. Google and I, nor any stakeholders relevant to the creation of this project held any responsibility for any outcomes suffered from using the results of this project. This project is only created for educational and skill demonstration purposes.
This personal project is slightly altered and is slightly different from the original content. It is for the purpose of showing some of my specific analytuc skills outside the range of the scope of the course. The content of this project should not be perceived as the original project content of the Google data analytic course.
Thank you for reading this project.
City Holidays n.d., viewed 24 May 2021, https://www.chicago.gov/city/en/narr/misc/city-holidays.html
Wikipedia Chicago n.d., Chicago, viewed 25 May 2021, https://en.wikipedia.org/wiki/Chicago
Claffey M, Hofer S, 2020, City Officials Will Enforce AM and PM Rush Hour Restrictions on New Chicago and Western Avenue Bus Only Lanes, viewed 28 May 2021, https://www.chicago.gov/city/en/depts/cdot/provdrs/traffic_signals_andstreetlights/news/2020/january/city-officials-will-enforce-am-and-pm-rush-hour-restrictions-on-.html