This is a Google Data Analytics Professional Certificate Case Study-01:

How Does a Bike-Share Navigate Speedy Success?

Details

Cyclistic is a bike-share company in Chicago. The director of marketing believes that the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently.

The source of data can be found here. The data is provided by Motivate International Inc. under the license.

In our analysis, we looked into different aspects of bike usage. We considered demand for different days in a week, seasonal or monthly change of demand, priorities of bike choices etc. by different customer categories (member or casual).

Some analysis was done using spreadsheet and SQL initially (for details see Appendix). The number of information is so large that it is not feasible to analyze in spreadsheet. Therefore R is utilized for the analysis.

Data Clean up and get important calculated summaries

Clear Work place

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clean workspace
rm(list=ls())
# Clear console
cat("\014") 

Load necessary library and read file names

library(tidyverse)
folder_0 <- 'resources'
data_files=list.files(path=folder_0, pattern="*.csv", full.names=TRUE)

Get latitudes of start station from previously saved excel file.

In primary analysis so many blank/null were found for start station name. But, the latitudes of those stations were always available. First unique stations name was found using pivot table and later using excel VLOOKUP function the first latitude linked to each stations was found. Then the latitudes were arranged in descending mode to get the highest latitude at the top. Then the middle distance between two consecutive station’s latitude was calculated and was assigned to the station having upper latitude as lat_range.

Suppose, the latitude of a unknown station is x. Then, if x>=43.8499443, then the station name will be Pawel Bialowas - Test- PBSC charging station; if 43.8499443>x>=43.8499442, them the station name will be Lincolnwood Dr & Central St and so on. This assumption was made to fill up the blank start stations name.

station_range=read_csv(data_files[13])
glimpse(station_range)
## Rows: 758
## Columns: 5
## $ station_ID         <chr> "Pawel Bialowas - Test- PBSC charging station", "E0…
## $ station_lat        <dbl> 45.63503, 42.06485, 42.06431, 42.06386, 42.05824, 4…
## $ lat_range          <dbl> 43.84994, 43.84994, 42.06458, 42.06409, 42.06105, 4…
## $ start_station_name <chr> "Pawel Bialowas - Test- PBSC charging station", "Li…
## $ ...5               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
station_range=station_range[c(3,4)]
glimpse(station_range)
## Rows: 758
## Columns: 2
## $ lat_range          <dbl> 43.84994, 43.84994, 42.06458, 42.06409, 42.06105, 4…
## $ start_station_name <chr> "Pawel Bialowas - Test- PBSC charging station", "Li…
station_range_len <- nrow(station_range)

Load data for 12 months

## Separate Data Files ----------------------------------------------
data_files=data_files[1:12]
data_files
##  [1] "202201-divvy-tripdata.csv"       "202202-divvy-tripdata.csv"      
##  [3] "202203-divvy-tripdata.csv"       "202204-divvy-tripdata.csv"      
##  [5] "202205-divvy-tripdata.csv"       "202206-divvy-tripdata.csv"      
##  [7] "202207-divvy-tripdata.csv"       "202208-divvy-tripdata.csv"      
##  [9] "202209-divvy-publictripdata.csv" "202210-divvy-tripdata.csv"      
## [11] "202211-divvy-tripdata.csv"       "202212-divvy-tripdata.csv"
## Read 12 months data ----------------------------------------------
print('Reading data for 12 months')
## [1] "Reading data for 12 months"
data <- data_files  %>% 
  map_df(~read_csv(.))

We have 5667717 obs (rows) with 13 columns (fields) for all the data in 2022.

## Read column names ---------------------------------------------- 
get_col <- function(data) {
  colmn=colnames(data)
  print(colmn)
  return (colmn)
}
colmn <- get_col(data)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

We have the names of the columns now.

Arrange data

Arranged data according to start_station_name, so that the blank names will be at the tail.

There are lots of data NA at the tail for start_station_name and some other columns.

Create function to find out blank indexes

The function takes the organized data column, having blank data grouped at the bottom and then returns first blank index, total number of index and total number of blank index in the data.

get_blank_row_index <- function(data_){
  rows <- nrow(data_)
  rows_str <- as.character(rows)
  len_str <- nchar(rows_str)
  len_str <- len_str-1
  #rows_str <- strtoi(lenstr, base = 0L)
  
  
  initial=0
  while(len_str>=0){
    x <- 1
    while(TRUE){
      index=x*10^len_str+initial
      if(is.na(data_[index,])){x <- x-1;break}
      x <- x+1
    }
    initial <- x*10^len_str+initial
    len_str <- len_str-1
  }
return (c(initial+1, rows, rows-initial))
} # End get_blank_row_index # return blnk ind, tot row, num of blank 

Copy blank data

After knowing the blank rows (for start_station_name column-Column 5) indexes, make a copy of those data.

Separate Name and latitude

start_station_names ->column 5 & latitude -> column 9.

Create Function to Fill Blank in the start station name

The function read the latitude and suggest nearby station name.

lat_to_station <- function(data_lat){
  res <- station_range_len
  for (i in 1:station_range_len){
    if (data_lat >= station_range[i,1]){# station_range[row, col(1-Lat, 2-Name)]
      res <- i
      break
    } # End If 
  } # End For
  return (station_range[res,2])  } # End lat_to_station

Fill Blank in the start station name

  • The for loop written below took more than 6 hours to execute therefore shown below and not executed in rmarkdown. The names of the stations were saved in st_name.
for (i in 1:len){

station <- lat_to_station(lat_data[i,])

st_name[i,1] <- station

} # End For
  • Later, save the data for further use later.
write.csv(st_name,'station.csv',row.names = FALSE)

data[blank_index[1]:blank_index[2],5] <- st_name

write.csv(data,'data_no_blank.csv',row.names = FALSE)
  • Calculate the duration and find out the month of the date & day of the week
data <- mutate(data,duration=ended_at-started_at,

month=format(started_at, format = "%m"),

day=format(started_at, format = "%A"))
  • Sort data according to started_at (date time)
data <- data[order(data$started_at),]

glimpse(data) colmn <- get_col(data)

write.csv(data,'data_no_blank_calc.csv',row.names = FALSE)
  • Separate important columns ———————————————-
data_imp=data[,c('month','day','member_casual','rideable_type', 'start_station_name','duration')]
  • Check negative values in duration, could be due to swap of start & end time or some other causes
data_imp %>% filter(duration<0)

write.csv(data_imp,'zdata\data_imp_negDur.csv',row.names = FALSE)
  • Found some negative vaues is duration, therefore excluding them
data_imp <- data_imp %>% filter(duration>0)

dir.create('zdata') # create directory

write.csv(data_imp,'zdata\data_imp.csv',row.names = FALSE)
  • Check dataset
head(data_imp)

After primary clean remove all variable and read the variable with important data only

## Memory clean from unnecessary variables ----------------------------------
rm(list=ls())## Load  data from files ----------------------------------------------
data <- read_csv("resources\\zdata\\data_imp.csv")
## Rows: 5667186 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): month, day, member_casual, rideable_type, start_station_name
## dbl (1): duration
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

From here separate/calculate necessary data for future analysis.

## Get max min time of member and casual
mem_max_min <- data %>% filter(member_casual=='member') %>% 
  summarise(max_dur=max(duration), min_dur=min(duration))

casual_max_min <- data %>% filter(member_casual=='casual') %>% 
  summarise(max_dur=max(duration), min_dur=min(duration))

Get station used by member_casual ------------------------------------------

station_MemCas <- data %>% 
  group_by(member_casual, start_station_name) %>%
  summarize(tot_sec=sum(duration),count=length(duration)) 
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
member_used_statons_duration <- station_MemCas %>% filter(member_casual=='member') %>%
  arrange(-tot_sec) %>% head()
head(member_used_statons_duration)
## # A tibble: 6 Ă— 4
## # Groups:   member_casual [1]
##   member_casual start_station_name             tot_sec count
##   <chr>         <chr>                            <dbl> <int>
## 1 member        Central Park Ave & Ohio St    34395619 41733
## 2 member        Central Ave & Parker Ave      30856583 46418
## 3 member        Augusta Blvd & Laramie Ave    30830046 38380
## 4 member        Fullerton & Monitor           29873719 41866
## 5 member        Bloomingdale Ave & Harlem Ave 28878052 38195
## 6 member        Mozart St & Jackson Blvd      24797098 32881
member_used_statons_count <- station_MemCas %>% filter(member_casual=='member') %>%
  arrange(-count) %>% head()
head(member_used_statons_count)
## # A tibble: 6 Ă— 4
## # Groups:   member_casual [1]
##   member_casual start_station_name             tot_sec count
##   <chr>         <chr>                            <dbl> <int>
## 1 member        Central Ave & Parker Ave      30856583 46418
## 2 member        Fullerton & Monitor           29873719 41866
## 3 member        Central Park Ave & Ohio St    34395619 41733
## 4 member        Augusta Blvd & Laramie Ave    30830046 38380
## 5 member        Bloomingdale Ave & Harlem Ave 28878052 38195
## 6 member        Albany Ave & Belmont Ave      24445808 35500
casual_used_statons_duration <- station_MemCas %>% filter(member_casual=='casual') %>%
  arrange(-tot_sec) %>% head()
head(casual_used_statons_duration)
## # A tibble: 6 Ă— 4
## # Groups:   member_casual [1]
##   member_casual start_station_name                   tot_sec count
##   <chr>         <chr>                                  <dbl> <int>
## 1 casual        Streeter Dr & Grand Ave            166414998 58083
## 2 casual        DuSable Lake Shore Dr & Monroe St   90361333 31860
## 3 casual        Millennium Park                     88609124 25524
## 4 casual        Michigan Ave & Oak St               59244349 25264
## 5 casual        Shedd Aquarium                      54625840 20265
## 6 casual        DuSable Lake Shore Dr & North Blvd  52172829 23655
casual_used_statons_count <- station_MemCas %>% filter(member_casual=='casual') %>%
  arrange(-count) %>% head()
head(casual_used_statons_count)
## # A tibble: 6 Ă— 4
## # Groups:   member_casual [1]
##   member_casual start_station_name                  tot_sec count
##   <chr>         <chr>                                 <dbl> <int>
## 1 casual        Streeter Dr & Grand Ave           166414998 58083
## 2 casual        Fullerton & Monitor                28737694 34612
## 3 casual        Central Ave & Parker Ave           26483756 33992
## 4 casual        Bloomingdale Ave & Harlem Ave      29750921 32943
## 5 casual        Central Park Ave & Ohio St         31360192 32472
## 6 casual        DuSable Lake Shore Dr & Monroe St  90361333 31860

Save all important calculated variables

ls()
##  [1] "casual_max_min"               "casual_used_statons_count"   
##  [3] "casual_used_statons_duration" "data"                        
##  [5] "day_duration"                 "day_duration_MemCas"         
##  [7] "mc_duration"                  "mem_max_min"                 
##  [9] "member_used_statons_count"    "member_used_statons_duration"
## [11] "month_duration"               "month_duration_MemCas"       
## [13] "most_used_statons_count"      "most_used_statons_duration"  
## [15] "rtype_duration"               "rtype_duration_MemCas"       
## [17] "station_MemCas"               "time_MemCas"
rm(data)
save.image('zdata\\analysis_result.RData')
rm(list=ls())

Data cleaning completed --------------------------------

Start Analysis ----------------------------------------------------

Load saved data for analysis --------------------------------

load('zdata\\analysis_result.RData')
ls()
##  [1] "casual_max_min"               "casual_used_statons_count"   
##  [3] "casual_used_statons_duration" "day_duration"                
##  [5] "day_duration_MemCas"          "mc_duration"                 
##  [7] "mem_max_min"                  "member_used_statons_count"   
##  [9] "member_used_statons_duration" "month_duration"              
## [11] "month_duration_MemCas"        "most_used_statons_count"     
## [13] "most_used_statons_duration"   "rtype_duration"              
## [15] "rtype_duration_MemCas"        "station_MemCas"              
## [17] "time_MemCas"

Create function to convert second to thousand hour or multiply any variable with a constant and return target decimal value.

round_fun <- function(data,constant=1/3600/1000, decimal=2){
  temp <- data*constant
  return (round(temp,decimal))
}

Recommendation

  1. As normal rules, the rate for casual bikers is usually higher than the members. The hours usage by the casual bikers is comparatively so high. Therefore if the casual bikers become members, the revenue received from hourly rate will be decreased. But, the membership fee is unknown to me. Therefore overall financial analysis could not be done.

  2. The demand changes a lot though out the year. Therefore, providing the resources solely could cost more. Because of low utilization factor with having large capacity of supply, outsourcing resources during high demand could be a solution. If the casual riders become members, the supply needs to be ensured always although their usage usually drops at the begin and end of the year a lot.

  3. If casual bikers are to be inspired to be converted into members than all types of bikes needs to be provided (electric, classic, docked). The target station for advertisement could be:

    1. Streeter Dr & Grand Ave

    2. DuSable Lake ShoreDr & Monroe St

    3. Millenium Park &

    4. DuSable Lake Shore Dr & Norh Blvd

For more information on this paper, you may contact here.

Appendix

SQL sample code

The code was implemented online, in BigQuery sandbox. The project name was:

capstone-project-01-bike-share and the table names were (YearMonth): 202201, 202202, 202203 ….. 202212. Then merged all the table into one: merged_month_t

CREATE TABLE `capstone-project-01-bike-share.bike_share.merged_month_t` AS
--Select all the month files
SELECT * FROM `capstone-project-01-bike-share.bike_share.202201`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202202`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202203`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202204`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202205`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202206`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202207`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202208`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202209`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202210`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202211`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202212`
ORDER BY started_at;

Calculate duration of each ride and assign days according to the date of started_at.

CREATE TABLE `capstone-project-01-bike-share.bike_share.merged_month_2022` AS

SELECT
    *,
    ended_at - started_at AS duration,
    EXTRACT(MONTH FROM started_at) AS month,
    
    CASE
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=1 THEN 'Sunday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=2 THEN 'Monday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=3 THEN 'Tueday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=4 THEN 'Wednesday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=5 THEN 'Thursday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=6 THEN 'Friday'
        ELSE 'Saturday'
    END AS day --[1,7] with Sunday as the first day of the week

FROM `capstone-project-01-bike-share.bike_share.merged_month_t`
ORDER BY started_at;

Calculate average, total, minimum, maximum and number of each categorized by rider type (member or casual) as well as month.

CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_mon_det_2022` AS
SELECT
      member_casual as Mem_Cas,
      month as Month,

      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, month
ORDER BY month, member_casual

Calculate average, total, minimum, maximum and number of each categorized by biker type (member or casual) , month and rideable_type..

--CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_mon_det_2022` AS
SELECT
      member_casual as Mem_Cas,
      rideable_type as Ride_Type,
      month as Month,

      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, month, rideable_type
ORDER BY month, member_casual, rideable_type

Find out member and casual riders usage in different days in a week.

CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_day_2022` AS
SELECT
      member_casual as Mem_Cas,
      day as Day,
      
      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, day 
ORDER BY member_casual, day

Find out preference of ride type (classic/ electric/ docked bike)

CREATE TABLE `capstone-project-01-bike-share.bike_share.ride_type_2022` AS
SELECT
      member_casual as Mem_Cas,
      rideable_type as Ride_Type,
      
      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, rideable_type 
ORDER BY member_casual, rideable_type

Excel Analysis

Datalog

  1. At first, January 2022 data was analysed for data cleaning.

  2. It was found that in some cases the start and end station same. Needs to clarify whether it is okay.

  3. The lattitude abd longitude datatype was found to be both text and number. Later, all of them were converted into number.

  4. The number of blank cells for start_station_name is 16260 (5715 alone), end_station_name is 17927 (7382 alone), and for both 10545. As a result, Unique number blank spaces is 23642. Therefore deleting these lines was found to be unwise. To solve the issue, at first the lattitude and longitude of the stations were found. Later, the blank spaces were filled with the name according to the lattitude.

  5. Two new field was introduced to show the duration of ride and denote the day of the week.

  6. The spreadsheet was found to be too slow to deal with this large amount of data. Therefore, for all the data, to fascilitate data cleaning and analyse SQL and R are selected.

Pivot charts

In Jan 2022, the service was mostly used by members.

The casual bikers had he maximum duration of riding. The minimum duration as zero signifies the possibility of data entry error and requirement of cleaning.

In this time period, the service demand was so high on Saturday, Thursday and Wednesday. But, there were five Saturday, Sunday and Monday. Therefore, Thursday and Wednesday can be considered as the busiest day.

Although the number was low but the average usage was higher for casual riders.

Classic bikes were found to be most popular among the riders.

Members didn’t use the docked bikes.

Although the overall contribution was low but those who used docked bike, they utilized it for longer periods.

Session Information

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.2   stringr_1.5.0   dplyr_1.0.10    purrr_1.0.1    
## [5] readr_2.1.3     tidyr_1.2.1     tibble_3.1.8    ggplot2_3.4.0  
## [9] tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0    xfun_0.36           bslib_0.4.2        
##  [4] haven_2.5.1         gargle_1.2.1        colorspace_2.0-3   
##  [7] vctrs_0.5.1         generics_0.1.3      htmltools_0.5.4    
## [10] yaml_2.3.6          utf8_1.2.2          rlang_1.0.6        
## [13] jquerylib_0.1.4     pillar_1.8.1        withr_2.5.0        
## [16] glue_1.6.2          DBI_1.1.3           dbplyr_2.3.0       
## [19] readxl_1.4.1        modelr_0.1.10       lifecycle_1.0.3    
## [22] munsell_0.5.0       gtable_0.3.1        cellranger_1.1.0   
## [25] rvest_1.0.3         evaluate_0.20       knitr_1.41         
## [28] tzdb_0.3.0          fastmap_1.1.0       fansi_1.0.3        
## [31] broom_1.0.2         backports_1.4.1     scales_1.2.1       
## [34] googlesheets4_1.0.1 cachem_1.0.6        jsonlite_1.8.4     
## [37] fs_1.5.2            hms_1.1.2           digest_0.6.31      
## [40] stringi_1.7.12      grid_4.2.2          cli_3.6.0          
## [43] tools_4.2.2         magrittr_2.0.3      sass_0.4.4         
## [46] crayon_1.5.2        pkgconfig_2.0.3     ellipsis_0.3.2     
## [49] xml2_1.3.3          reprex_2.0.2        googledrive_2.0.0  
## [52] lubridate_1.9.0     timechange_0.2.0    assertthat_0.2.1   
## [55] rmarkdown_2.19      httr_1.4.4          rstudioapi_0.14    
## [58] R6_2.5.1            compiler_4.2.2