This is a Google Data Analytics Professional Certificate Case Study-01:
How Does a Bike-Share Navigate Speedy Success?
Cyclistic is a bike-share company in Chicago. The director of marketing believes that the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently.
The source of data can be found here. The data is provided by Motivate International Inc. under the license.
In our analysis, we looked into different aspects of bike usage. We considered demand for different days in a week, seasonal or monthly change of demand, priorities of bike choices etc. by different customer categories (member or casual).
Some analysis was done using spreadsheet and SQL initially (for details see Appendix). The number of information is so large that it is not feasible to analyze in spreadsheet. Therefore R is utilized for the analysis.
# Clear plots
if(!is.null(dev.list())) dev.off()
# Clean workspace
rm(list=ls())
# Clear console
cat("\014")
library(tidyverse)
folder_0 <- 'resources'
data_files=list.files(path=folder_0, pattern="*.csv", full.names=TRUE)
In primary analysis so many blank/null were found for start station name. But, the latitudes of those stations were always available. First unique stations name was found using pivot table and later using excel VLOOKUP function the first latitude linked to each stations was found. Then the latitudes were arranged in descending mode to get the highest latitude at the top. Then the middle distance between two consecutive station’s latitude was calculated and was assigned to the station having upper latitude as lat_range.
Suppose, the latitude of a unknown station is x. Then, if x>=43.8499443, then the station name will be Pawel Bialowas - Test- PBSC charging station; if 43.8499443>x>=43.8499442, them the station name will be Lincolnwood Dr & Central St and so on. This assumption was made to fill up the blank start stations name.
station_range=read_csv(data_files[13])
glimpse(station_range)
## Rows: 758
## Columns: 5
## $ station_ID <chr> "Pawel Bialowas - Test- PBSC charging station", "E0…
## $ station_lat <dbl> 45.63503, 42.06485, 42.06431, 42.06386, 42.05824, 4…
## $ lat_range <dbl> 43.84994, 43.84994, 42.06458, 42.06409, 42.06105, 4…
## $ start_station_name <chr> "Pawel Bialowas - Test- PBSC charging station", "Li…
## $ ...5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
station_range=station_range[c(3,4)]
glimpse(station_range)
## Rows: 758
## Columns: 2
## $ lat_range <dbl> 43.84994, 43.84994, 42.06458, 42.06409, 42.06105, 4…
## $ start_station_name <chr> "Pawel Bialowas - Test- PBSC charging station", "Li…
station_range_len <- nrow(station_range)
## Separate Data Files ----------------------------------------------
data_files=data_files[1:12]
data_files
## [1] "202201-divvy-tripdata.csv" "202202-divvy-tripdata.csv"
## [3] "202203-divvy-tripdata.csv" "202204-divvy-tripdata.csv"
## [5] "202205-divvy-tripdata.csv" "202206-divvy-tripdata.csv"
## [7] "202207-divvy-tripdata.csv" "202208-divvy-tripdata.csv"
## [9] "202209-divvy-publictripdata.csv" "202210-divvy-tripdata.csv"
## [11] "202211-divvy-tripdata.csv" "202212-divvy-tripdata.csv"
## Read 12 months data ----------------------------------------------
print('Reading data for 12 months')
## [1] "Reading data for 12 months"
data <- data_files %>%
map_df(~read_csv(.))
We have 5667717 obs (rows) with 13 columns (fields) for all the data in 2022.
## Read column names ----------------------------------------------
get_col <- function(data) {
colmn=colnames(data)
print(colmn)
return (colmn)
}
colmn <- get_col(data)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
We have the names of the columns now.
Arranged data according to start_station_name, so that the blank names will be at the tail.
There are lots of data NA at the tail for start_station_name and some other columns.
The function takes the organized data column, having blank data grouped at the bottom and then returns first blank index, total number of index and total number of blank index in the data.
get_blank_row_index <- function(data_){
rows <- nrow(data_)
rows_str <- as.character(rows)
len_str <- nchar(rows_str)
len_str <- len_str-1
#rows_str <- strtoi(lenstr, base = 0L)
initial=0
while(len_str>=0){
x <- 1
while(TRUE){
index=x*10^len_str+initial
if(is.na(data_[index,])){x <- x-1;break}
x <- x+1
}
initial <- x*10^len_str+initial
len_str <- len_str-1
}
return (c(initial+1, rows, rows-initial))
} # End get_blank_row_index # return blnk ind, tot row, num of blank
After knowing the blank rows (for start_station_name column-Column 5) indexes, make a copy of those data.
start_station_names ->column 5 & latitude -> column 9.
The function read the latitude and suggest nearby station name.
lat_to_station <- function(data_lat){
res <- station_range_len
for (i in 1:station_range_len){
if (data_lat >= station_range[i,1]){# station_range[row, col(1-Lat, 2-Name)]
res <- i
break
} # End If
} # End For
return (station_range[res,2]) } # End lat_to_station
for (i in 1:len){
station <- lat_to_station(lat_data[i,])
st_name[i,1] <- station
} # End For
write.csv(st_name,'station.csv',row.names = FALSE)
data[blank_index[1]:blank_index[2],5] <- st_name
write.csv(data,'data_no_blank.csv',row.names = FALSE)
data <- mutate(data,duration=ended_at-started_at,
month=format(started_at, format = "%m"),
day=format(started_at, format = "%A"))
data <- data[order(data$started_at),]
glimpse(data) colmn <- get_col(data)
write.csv(data,'data_no_blank_calc.csv',row.names = FALSE)
data_imp=data[,c('month','day','member_casual','rideable_type', 'start_station_name','duration')]
data_imp %>% filter(duration<0)
write.csv(data_imp,'zdata\data_imp_negDur.csv',row.names = FALSE)
data_imp <- data_imp %>% filter(duration>0)
dir.create('zdata') # create directory
write.csv(data_imp,'zdata\data_imp.csv',row.names = FALSE)
head(data_imp)
## Memory clean from unnecessary variables ----------------------------------
rm(list=ls())## Load data from files ----------------------------------------------
data <- read_csv("resources\\zdata\\data_imp.csv")
## Rows: 5667186 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): month, day, member_casual, rideable_type, start_station_name
## dbl (1): duration
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Get max min time of member and casual
mem_max_min <- data %>% filter(member_casual=='member') %>%
summarise(max_dur=max(duration), min_dur=min(duration))
casual_max_min <- data %>% filter(member_casual=='casual') %>%
summarise(max_dur=max(duration), min_dur=min(duration))
station_MemCas <- data %>%
group_by(member_casual, start_station_name) %>%
summarize(tot_sec=sum(duration),count=length(duration))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
member_used_statons_duration <- station_MemCas %>% filter(member_casual=='member') %>%
arrange(-tot_sec) %>% head()
head(member_used_statons_duration)
## # A tibble: 6 Ă— 4
## # Groups: member_casual [1]
## member_casual start_station_name tot_sec count
## <chr> <chr> <dbl> <int>
## 1 member Central Park Ave & Ohio St 34395619 41733
## 2 member Central Ave & Parker Ave 30856583 46418
## 3 member Augusta Blvd & Laramie Ave 30830046 38380
## 4 member Fullerton & Monitor 29873719 41866
## 5 member Bloomingdale Ave & Harlem Ave 28878052 38195
## 6 member Mozart St & Jackson Blvd 24797098 32881
member_used_statons_count <- station_MemCas %>% filter(member_casual=='member') %>%
arrange(-count) %>% head()
head(member_used_statons_count)
## # A tibble: 6 Ă— 4
## # Groups: member_casual [1]
## member_casual start_station_name tot_sec count
## <chr> <chr> <dbl> <int>
## 1 member Central Ave & Parker Ave 30856583 46418
## 2 member Fullerton & Monitor 29873719 41866
## 3 member Central Park Ave & Ohio St 34395619 41733
## 4 member Augusta Blvd & Laramie Ave 30830046 38380
## 5 member Bloomingdale Ave & Harlem Ave 28878052 38195
## 6 member Albany Ave & Belmont Ave 24445808 35500
casual_used_statons_duration <- station_MemCas %>% filter(member_casual=='casual') %>%
arrange(-tot_sec) %>% head()
head(casual_used_statons_duration)
## # A tibble: 6 Ă— 4
## # Groups: member_casual [1]
## member_casual start_station_name tot_sec count
## <chr> <chr> <dbl> <int>
## 1 casual Streeter Dr & Grand Ave 166414998 58083
## 2 casual DuSable Lake Shore Dr & Monroe St 90361333 31860
## 3 casual Millennium Park 88609124 25524
## 4 casual Michigan Ave & Oak St 59244349 25264
## 5 casual Shedd Aquarium 54625840 20265
## 6 casual DuSable Lake Shore Dr & North Blvd 52172829 23655
casual_used_statons_count <- station_MemCas %>% filter(member_casual=='casual') %>%
arrange(-count) %>% head()
head(casual_used_statons_count)
## # A tibble: 6 Ă— 4
## # Groups: member_casual [1]
## member_casual start_station_name tot_sec count
## <chr> <chr> <dbl> <int>
## 1 casual Streeter Dr & Grand Ave 166414998 58083
## 2 casual Fullerton & Monitor 28737694 34612
## 3 casual Central Ave & Parker Ave 26483756 33992
## 4 casual Bloomingdale Ave & Harlem Ave 29750921 32943
## 5 casual Central Park Ave & Ohio St 31360192 32472
## 6 casual DuSable Lake Shore Dr & Monroe St 90361333 31860
ls()
## [1] "casual_max_min" "casual_used_statons_count"
## [3] "casual_used_statons_duration" "data"
## [5] "day_duration" "day_duration_MemCas"
## [7] "mc_duration" "mem_max_min"
## [9] "member_used_statons_count" "member_used_statons_duration"
## [11] "month_duration" "month_duration_MemCas"
## [13] "most_used_statons_count" "most_used_statons_duration"
## [15] "rtype_duration" "rtype_duration_MemCas"
## [17] "station_MemCas" "time_MemCas"
rm(data)
save.image('zdata\\analysis_result.RData')
rm(list=ls())
load('zdata\\analysis_result.RData')
ls()
## [1] "casual_max_min" "casual_used_statons_count"
## [3] "casual_used_statons_duration" "day_duration"
## [5] "day_duration_MemCas" "mc_duration"
## [7] "mem_max_min" "member_used_statons_count"
## [9] "member_used_statons_duration" "month_duration"
## [11] "month_duration_MemCas" "most_used_statons_count"
## [13] "most_used_statons_duration" "rtype_duration"
## [15] "rtype_duration_MemCas" "station_MemCas"
## [17] "time_MemCas"
Create function to convert second to thousand hour or multiply any variable with a constant and return target decimal value.
round_fun <- function(data,constant=1/3600/1000, decimal=2){
temp <- data*constant
return (round(temp,decimal))
}
As normal rules, the rate for casual bikers is usually higher than the members. The hours usage by the casual bikers is comparatively so high. Therefore if the casual bikers become members, the revenue received from hourly rate will be decreased. But, the membership fee is unknown to me. Therefore overall financial analysis could not be done.
The demand changes a lot though out the year. Therefore, providing the resources solely could cost more. Because of low utilization factor with having large capacity of supply, outsourcing resources during high demand could be a solution. If the casual riders become members, the supply needs to be ensured always although their usage usually drops at the begin and end of the year a lot.
If casual bikers are to be inspired to be converted into members than all types of bikes needs to be provided (electric, classic, docked). The target station for advertisement could be:
Streeter Dr & Grand Ave
DuSable Lake ShoreDr & Monroe St
Millenium Park &
DuSable Lake Shore Dr & Norh Blvd
For more information on this paper, you may contact here.
The code was implemented online, in BigQuery sandbox. The project name was:
capstone-project-01-bike-share and the table names were (YearMonth): 202201, 202202, 202203 ….. 202212. Then merged all the table into one: merged_month_t
CREATE TABLE `capstone-project-01-bike-share.bike_share.merged_month_t` AS
--Select all the month files
SELECT * FROM `capstone-project-01-bike-share.bike_share.202201`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202202`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202203`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202204`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202205`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202206`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202207`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202208`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202209`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202210`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202211`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202212`
ORDER BY started_at;
Calculate duration of each ride and assign days according to the date of started_at.
CREATE TABLE `capstone-project-01-bike-share.bike_share.merged_month_2022` AS
SELECT
*,
ended_at - started_at AS duration,
EXTRACT(MONTH FROM started_at) AS month,
CASE
WHEN EXTRACT(DAYOFWEEK FROM started_at)=1 THEN 'Sunday'
WHEN EXTRACT(DAYOFWEEK FROM started_at)=2 THEN 'Monday'
WHEN EXTRACT(DAYOFWEEK FROM started_at)=3 THEN 'Tueday'
WHEN EXTRACT(DAYOFWEEK FROM started_at)=4 THEN 'Wednesday'
WHEN EXTRACT(DAYOFWEEK FROM started_at)=5 THEN 'Thursday'
WHEN EXTRACT(DAYOFWEEK FROM started_at)=6 THEN 'Friday'
ELSE 'Saturday'
END AS day --[1,7] with Sunday as the first day of the week
FROM `capstone-project-01-bike-share.bike_share.merged_month_t`
ORDER BY started_at;
Calculate average, total, minimum, maximum and number of each categorized by rider type (member or casual) as well as month.
CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_mon_det_2022` AS
SELECT
member_casual as Mem_Cas,
month as Month,
AVG(duration) as Avg_Dur,
SUM(duration) as Sum_Dur,
MIN(duration) as Min_Dur,
MAX(duration) as Max_Dur,
COUNT(duration) as Count_Dur
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, month
ORDER BY month, member_casual
Calculate average, total, minimum, maximum and number of each categorized by biker type (member or casual) , month and rideable_type..
--CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_mon_det_2022` AS
SELECT
member_casual as Mem_Cas,
rideable_type as Ride_Type,
month as Month,
AVG(duration) as Avg_Dur,
SUM(duration) as Sum_Dur,
MIN(duration) as Min_Dur,
MAX(duration) as Max_Dur,
COUNT(duration) as Count_Dur
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, month, rideable_type
ORDER BY month, member_casual, rideable_type
Find out member and casual riders usage in different days in a week.
CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_day_2022` AS
SELECT
member_casual as Mem_Cas,
day as Day,
AVG(duration) as Avg_Dur,
SUM(duration) as Sum_Dur,
MIN(duration) as Min_Dur,
MAX(duration) as Max_Dur,
COUNT(duration) as Count_Dur
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, day
ORDER BY member_casual, day
Find out preference of ride type (classic/ electric/ docked bike)
CREATE TABLE `capstone-project-01-bike-share.bike_share.ride_type_2022` AS
SELECT
member_casual as Mem_Cas,
rideable_type as Ride_Type,
AVG(duration) as Avg_Dur,
SUM(duration) as Sum_Dur,
MIN(duration) as Min_Dur,
MAX(duration) as Max_Dur,
COUNT(duration) as Count_Dur
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, rideable_type
ORDER BY member_casual, rideable_type
At first, January 2022 data was analysed for data cleaning.
It was found that in some cases the start and end station same. Needs to clarify whether it is okay.
The lattitude abd longitude datatype was found to be both text and number. Later, all of them were converted into number.
The number of blank cells for start_station_name is 16260 (5715 alone), end_station_name is 17927 (7382 alone), and for both 10545. As a result, Unique number blank spaces is 23642. Therefore deleting these lines was found to be unwise. To solve the issue, at first the lattitude and longitude of the stations were found. Later, the blank spaces were filled with the name according to the lattitude.
Two new field was introduced to show the duration of ride and denote the day of the week.
The spreadsheet was found to be too slow to deal with this large amount of data. Therefore, for all the data, to fascilitate data cleaning and analyse SQL and R are selected.
In Jan 2022, the service was mostly used by members.
The casual bikers had he maximum duration of riding. The minimum duration as zero signifies the possibility of data entry error and requirement of cleaning.
In this time period, the service demand was so high on Saturday, Thursday and Wednesday. But, there were five Saturday, Sunday and Monday. Therefore, Thursday and Wednesday can be considered as the busiest day.
Although the number was low but the average usage was higher for casual riders.
Classic bikes were found to be most popular among the riders.
Members didn’t use the docked bikes.
Although the overall contribution was low but those who used docked bike, they utilized it for longer periods.
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.2 stringr_1.5.0 dplyr_1.0.10 purrr_1.0.1
## [5] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0
## [9] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.36 bslib_0.4.2
## [4] haven_2.5.1 gargle_1.2.1 colorspace_2.0-3
## [7] vctrs_0.5.1 generics_0.1.3 htmltools_0.5.4
## [10] yaml_2.3.6 utf8_1.2.2 rlang_1.0.6
## [13] jquerylib_0.1.4 pillar_1.8.1 withr_2.5.0
## [16] glue_1.6.2 DBI_1.1.3 dbplyr_2.3.0
## [19] readxl_1.4.1 modelr_0.1.10 lifecycle_1.0.3
## [22] munsell_0.5.0 gtable_0.3.1 cellranger_1.1.0
## [25] rvest_1.0.3 evaluate_0.20 knitr_1.41
## [28] tzdb_0.3.0 fastmap_1.1.0 fansi_1.0.3
## [31] broom_1.0.2 backports_1.4.1 scales_1.2.1
## [34] googlesheets4_1.0.1 cachem_1.0.6 jsonlite_1.8.4
## [37] fs_1.5.2 hms_1.1.2 digest_0.6.31
## [40] stringi_1.7.12 grid_4.2.2 cli_3.6.0
## [43] tools_4.2.2 magrittr_2.0.3 sass_0.4.4
## [46] crayon_1.5.2 pkgconfig_2.0.3 ellipsis_0.3.2
## [49] xml2_1.3.3 reprex_2.0.2 googledrive_2.0.0
## [52] lubridate_1.9.0 timechange_0.2.0 assertthat_0.2.1
## [55] rmarkdown_2.19 httr_1.4.4 rstudioapi_0.14
## [58] R6_2.5.1 compiler_4.2.2