Details

Cyclistic is a bike-share company in Chicago. The director of marketing believes that the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently.

The source of data can be found here. The data is provided by Motivate International Inc. under the license.

In our analysis, we looked into different aspects of bike usage. We considered demand for different days in a week, seasonal or monthly change of demand, priorities of bike choices etc. by different customer categories (member or casual).

Some analysis was done using spreadsheet and SQL initially (for details see Appendix). The number of information is so large that it is not feasible to analyze in spreadsheet. Therefore R is utilized for the analysis.

Data Clean up and get important calculated summaries

Clear Work place

# Clear plots
if(!is.null(dev.list())) dev.off()
# Clean workspace
rm(list=ls())
# Clear console
cat("\014")

Load necessary library and read file names

library(tidyverse)
folder_0 <- 'resources'
data_files=list.files(path=folder_0, pattern="*.csv", full.names=TRUE)

Get latitudes of start station from previously saved excel file.

In primary analysis so many blank/null were found for start station name. But, the latitudes of those stations were always available. First unique stations name was found using pivot table and later using excel VLOOKUP function the first latitude linked to each stations was found. Then the latitudes were arranged in descending mode to get the highest latitude at the top. Then the middle distance between two consecutive station’s latitude was calculated and was assigned to the station having upper latitude as lat_range.

Suppose, the latitude of a unknown station is x. Then, if x>=43.8499443, then the station name will be Pawel Bialowas - Test- PBSC charging station; if 43.8499443>x>=43.8499442, them the station name will be Lincolnwood Dr & Central St and so on. This assumption was made to fill up the blank start stations name.

station_range=read_csv(data_files[13])
glimpse(station_range)

## Rows: 758
## Columns: 5
## $ station_ID         <chr> "Pawel Bialowas - Test- PBSC charging station", "E0…
## $ station_lat        <dbl> 45.63503, 42.06485, 42.06431, 42.06386, 42.05824, 4…
## $ lat_range          <dbl> 43.84994, 43.84994, 42.06458, 42.06409, 42.06105, 4…
## $ start_station_name <chr> "Pawel Bialowas - Test- PBSC charging station", "Li…
## $ ...5               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

station_range=station_range[c(3,4)]
glimpse(station_range)

## Rows: 758
## Columns: 2
## $ lat_range          <dbl> 43.84994, 43.84994, 42.06458, 42.06409, 42.06105, 4…
## $ start_station_name <chr> "Pawel Bialowas - Test- PBSC charging station", "Li…

station_range_len <- nrow(station_range)

Load data for 12 months

## Separate Data Files ----------------------------------------------
data_files=data_files[1:12]
data_files

##  [1] "202201-divvy-tripdata.csv"       "202202-divvy-tripdata.csv"      
##  [3] "202203-divvy-tripdata.csv"       "202204-divvy-tripdata.csv"      
##  [5] "202205-divvy-tripdata.csv"       "202206-divvy-tripdata.csv"      
##  [7] "202207-divvy-tripdata.csv"       "202208-divvy-tripdata.csv"      
##  [9] "202209-divvy-publictripdata.csv" "202210-divvy-tripdata.csv"      
## [11] "202211-divvy-tripdata.csv"       "202212-divvy-tripdata.csv"

## Read 12 months data ----------------------------------------------
print('Reading data for 12 months')

## [1] "Reading data for 12 months"

data <- data_files  %>% 
  map_df(~read_csv(.))

We have 5667717 obs (rows) with 13 columns (fields) for all the data in 2022.

## Read column names ---------------------------------------------- 
get_col <- function(data) {
  colmn=colnames(data)
  print(colmn)
  return (colmn)
}
colmn <- get_col(data)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

We have the names of the columns now.

Arrange data

Arranged data according to start_station_name, so that the blank names will be at the tail.

There are lots of data NA at the tail for start_station_name and some other columns.

Create function to find out blank indexes

The function takes the organized data column, having blank data grouped at the bottom and then returns first blank index, total number of index and total number of blank index in the data.

get_blank_row_index <- function(data_){
  rows <- nrow(data_)
  rows_str <- as.character(rows)
  len_str <- nchar(rows_str)
  len_str <- len_str-1
  #rows_str <- strtoi(lenstr, base = 0L)
  
  
  initial=0
  while(len_str>=0){
    x <- 1
    while(TRUE){
      index=x*10^len_str+initial
      if(is.na(data_[index,])){x <- x-1;break}
      x <- x+1
    }
    initial <- x*10^len_str+initial
    len_str <- len_str-1
  }
return (c(initial+1, rows, rows-initial))
} # End get_blank_row_index # return blnk ind, tot row, num of blank

Copy blank data

After knowing the blank rows (for start_station_name column-Column 5) indexes, make a copy of those data.

Separate Name and latitude

start_station_names ->column 5 & latitude -> column 9.

Create Function to Fill Blank in the start station name

The function read the latitude and suggest nearby station name.

lat_to_station <- function(data_lat){
  res <- station_range_len
  for (i in 1:station_range_len){
    if (data_lat >= station_range[i,1]){# station_range[row, col(1-Lat, 2-Name)]
      res <- i
      break
    } # End If 
  } # End For
  return (station_range[res,2])  } # End lat_to_station

Fill Blank in the start station name

The for loop written below took more than 6 hours to execute therefore shown below and not executed in rmarkdown. The names of the stations were saved in st_name.

for (i in 1:len){

station <- lat_to_station(lat_data[i,])

st_name[i,1] <- station

} # End For

Later, save the data for further use later.

write.csv(st_name,'station.csv',row.names = FALSE)

data[blank_index[1]:blank_index[2],5] <- st_name

write.csv(data,'data_no_blank.csv',row.names = FALSE)

Calculate the duration and find out the month of the date & day of the week

data <- mutate(data,duration=ended_at-started_at,

month=format(started_at, format = "%m"),

day=format(started_at, format = "%A"))

Sort data according to started_at (date time)

data <- data[order(data$started_at),]

glimpse(data) colmn <- get_col(data)

write.csv(data,'data_no_blank_calc.csv',row.names = FALSE)

Separate important columns ———————————————-

data_imp=data[,c('month','day','member_casual','rideable_type', 'start_station_name','duration')]

Check negative values in duration, could be due to swap of start & end time or some other causes

data_imp %>% filter(duration<0)

write.csv(data_imp,'zdata\data_imp_negDur.csv',row.names = FALSE)

Found some negative vaues is duration, therefore excluding them

data_imp <- data_imp %>% filter(duration>0)

dir.create('zdata') # create directory

write.csv(data_imp,'zdata\data_imp.csv',row.names = FALSE)

Check dataset

head(data_imp)

After primary clean remove all variable and read the variable with important data only

## Memory clean from unnecessary variables ----------------------------------
rm(list=ls())## Load  data from files ----------------------------------------------
data <- read_csv("resources\\zdata\\data_imp.csv")

## Rows: 5667186 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): month, day, member_casual, rideable_type, start_station_name
## dbl (1): duration
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

From here separate/calculate necessary data for future analysis.

## Get max min time of member and casual
mem_max_min <- data %>% filter(member_casual=='member') %>% 
  summarise(max_dur=max(duration), min_dur=min(duration))

casual_max_min <- data %>% filter(member_casual=='casual') %>% 
  summarise(max_dur=max(duration), min_dur=min(duration))

Get information related to station ----------------------------------------

station_MemCas <- data %>% 
  group_by(start_station_name) %>%
  summarize(tot_sec=sum(duration),count=length(duration)) 


most_used_statons_duration <- station_MemCas %>% arrange(-tot_sec) %>% head()
head(most_used_statons_duration)

## # A tibble: 6 × 3
##   start_station_name                   tot_sec count
##   <chr>                                  <dbl> <int>
## 1 Streeter Dr & Grand Ave            188091791 75222
## 2 DuSable Lake Shore Dr & Monroe St  103162791 41276
## 3 Millennium Park                     98951894 34999
## 4 Michigan Ave & Oak St               76366857 39657
## 5 DuSable Lake Shore Dr & North Blvd  70059189 40087
## 6 Central Park Ave & Ohio St          65755811 74205

most_used_statons_count <- station_MemCas %>% arrange(-count) %>% head()
head(most_used_statons_count)

## # A tibble: 6 × 3
##   start_station_name              tot_sec count
##   <chr>                             <dbl> <int>
## 1 Central Ave & Parker Ave       57340339 80410
## 2 Fullerton & Monitor            58611413 76478
## 3 Streeter Dr & Grand Ave       188091791 75222
## 4 Central Park Ave & Ohio St     65755811 74205
## 5 Bloomingdale Ave & Harlem Ave  58628973 71138
## 6 Augusta Blvd & Laramie Ave     55414271 65760

Get station used by member_casual ------------------------------------------

station_MemCas <- data %>% 
  group_by(member_casual, start_station_name) %>%
  summarize(tot_sec=sum(duration),count=length(duration))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

member_used_statons_duration <- station_MemCas %>% filter(member_casual=='member') %>%
  arrange(-tot_sec) %>% head()
head(member_used_statons_duration)

## # A tibble: 6 × 4
## # Groups:   member_casual [1]
##   member_casual start_station_name             tot_sec count
##   <chr>         <chr>                            <dbl> <int>
## 1 member        Central Park Ave & Ohio St    34395619 41733
## 2 member        Central Ave & Parker Ave      30856583 46418
## 3 member        Augusta Blvd & Laramie Ave    30830046 38380
## 4 member        Fullerton & Monitor           29873719 41866
## 5 member        Bloomingdale Ave & Harlem Ave 28878052 38195
## 6 member        Mozart St & Jackson Blvd      24797098 32881

member_used_statons_count <- station_MemCas %>% filter(member_casual=='member') %>%
  arrange(-count) %>% head()
head(member_used_statons_count)

## # A tibble: 6 × 4
## # Groups:   member_casual [1]
##   member_casual start_station_name             tot_sec count
##   <chr>         <chr>                            <dbl> <int>
## 1 member        Central Ave & Parker Ave      30856583 46418
## 2 member        Fullerton & Monitor           29873719 41866
## 3 member        Central Park Ave & Ohio St    34395619 41733
## 4 member        Augusta Blvd & Laramie Ave    30830046 38380
## 5 member        Bloomingdale Ave & Harlem Ave 28878052 38195
## 6 member        Albany Ave & Belmont Ave      24445808 35500

casual_used_statons_duration <- station_MemCas %>% filter(member_casual=='casual') %>%
  arrange(-tot_sec) %>% head()
head(casual_used_statons_duration)

## # A tibble: 6 × 4
## # Groups:   member_casual [1]
##   member_casual start_station_name                   tot_sec count
##   <chr>         <chr>                                  <dbl> <int>
## 1 casual        Streeter Dr & Grand Ave            166414998 58083
## 2 casual        DuSable Lake Shore Dr & Monroe St   90361333 31860
## 3 casual        Millennium Park                     88609124 25524
## 4 casual        Michigan Ave & Oak St               59244349 25264
## 5 casual        Shedd Aquarium                      54625840 20265
## 6 casual        DuSable Lake Shore Dr & North Blvd  52172829 23655

casual_used_statons_count <- station_MemCas %>% filter(member_casual=='casual') %>%
  arrange(-count) %>% head()
head(casual_used_statons_count)

## # A tibble: 6 × 4
## # Groups:   member_casual [1]
##   member_casual start_station_name                  tot_sec count
##   <chr>         <chr>                                 <dbl> <int>
## 1 casual        Streeter Dr & Grand Ave           166414998 58083
## 2 casual        Fullerton & Monitor                28737694 34612
## 3 casual        Central Ave & Parker Ave           26483756 33992
## 4 casual        Bloomingdale Ave & Harlem Ave      29750921 32943
## 5 casual        Central Park Ave & Ohio St         31360192 32472
## 6 casual        DuSable Lake Shore Dr & Monroe St  90361333 31860

Get information related to time, ride type ---------------------------------

time_MemCas <- data %>% 
  group_by(month,day,member_casual, rideable_type) %>%
  summarize(tot_sec=sum(duration),count=length(duration)) %>% 
  arrange(month,day,member_casual,rideable_type)

## `summarise()` has grouped output by 'month', 'day', 'member_casual'. You can
## override using the `.groups` argument.

head(time_MemCas)

## # A tibble: 6 × 6
## # Groups:   month, day, member_casual [3]
##   month day    member_casual rideable_type tot_sec count
##   <chr> <chr>  <chr>         <chr>           <dbl> <int>
## 1 01    Friday casual        classic_bike  1725977   966
## 2 01    Friday casual        docked_bike    892955   114
## 3 01    Friday casual        electric_bike 1000486  1377
## 4 01    Friday member        classic_bike  5152800  6619
## 5 01    Friday member        electric_bike 2974482  4733
## 6 01    Monday casual        classic_bike  1856753   867

## ridable_type related

rtype_duration_MemCas <- time_MemCas %>%
  group_by(member_casual,rideable_type) %>%
  summarize(tot_sec=sum(tot_sec),count=sum(count)) %>% 
  arrange(member_casual,rideable_type)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

print(rtype_duration_MemCas)

## # A tibble: 5 × 4
## # Groups:   member_casual [2]
##   member_casual rideable_type    tot_sec   count
##   <chr>         <chr>              <dbl>   <int>
## 1 casual        classic_bike  1537839767  891406
## 2 casual        docked_bike   1306686571  177468
## 3 casual        electric_bike 1216015387 1252895
## 4 member        classic_bike  1427080400 1709682
## 5 member        electric_bike 1125109773 1635735

rtype_duration <- time_MemCas %>% group_by(rideable_type) %>%
  summarize(tot_sec=sum(tot_sec),count=sum(count)) %>% 
  arrange(-tot_sec)
print(rtype_duration)

## # A tibble: 3 × 3
##   rideable_type    tot_sec   count
##   <chr>              <dbl>   <int>
## 1 classic_bike  2964920167 2601088
## 2 electric_bike 2341125160 2888630
## 3 docked_bike   1306686571  177468

## Day related
day_duration_MemCas <- time_MemCas %>% group_by(member_casual,day) %>%
  summarize(tot_sec=sum(tot_sec),count=sum(count)) %>% 
  arrange(member_casual,day)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

print(day_duration_MemCas)

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual day         tot_sec  count
##    <chr>         <chr>         <dbl>  <int>
##  1 casual        Friday    563181312 334667
##  2 casual        Monday    486269038 277649
##  3 casual        Saturday  925949735 473130
##  4 casual        Sunday    794934926 388981
##  5 casual        Thursday  474159670 309297
##  6 casual        Tuesday   408617443 263706
##  7 casual        Wednesday 407429601 274339
##  8 member        Friday    351174606 467051
##  9 member        Monday    348472287 473305
## 10 member        Saturday  376075296 443246
## 11 member        Sunday    325980593 387180
## 12 member        Thursday  392571950 532215
## 13 member        Tuesday   377434312 518584
## 14 member        Wednesday 380481129 523836

day_duration <- time_MemCas %>% group_by(day) %>%
  summarize(tot_sec=sum(tot_sec),count=sum(count)) %>% 
  arrange(-tot_sec)
print(day_duration)

## # A tibble: 7 × 3
##   day          tot_sec  count
##   <chr>          <dbl>  <int>
## 1 Saturday  1302025031 916376
## 2 Sunday    1120915519 776161
## 3 Friday     914355918 801718
## 4 Thursday   866731620 841512
## 5 Monday     834741325 750954
## 6 Wednesday  787910730 798175
## 7 Tuesday    786051755 782290

Month related -------------------------------------------------------

month_duration_MemCas <- time_MemCas %>% group_by(member_casual,month) %>%
  summarize(tot_sec=sum(tot_sec),count=sum(count)) %>% 
  arrange(member_casual,month)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

print(month_duration_MemCas)

## # A tibble: 24 × 4
## # Groups:   member_casual [2]
##    member_casual month   tot_sec  count
##    <chr>         <chr>     <dbl>  <int>
##  1 casual        01     33755428  18517
##  2 casual        02     34319291  21414
##  3 casual        03    175930305  89874
##  4 casual        04    224004050 126398
##  5 casual        05    519378142 280387
##  6 casual        06    710770113 369022
##  7 casual        07    713313485 406013
##  8 casual        08    631205441 358886
##  9 casual        09    498187101 296664
## 10 casual        10    330880918 208961
## # … with 14 more rows

month_duration <- time_MemCas %>% group_by(month) %>%
  summarize(tot_sec=sum(tot_sec),count=sum(count)) %>% 
  arrange(-tot_sec)
print(month_duration)

## # A tibble: 12 × 3
##    month    tot_sec  count
##    <chr>      <dbl>  <int>
##  1 07    1056904598 823416
##  2 06    1046876874 769138
##  3 08     974114987 785855
##  4 09     813218753 701267
##  5 05     803641645 634810
##  6 10     581794538 558620
##  7 04     392826547 371218
##  8 03     315240053 284024
##  9 11     287018318 337677
## 10 12     147276181 181792
## 11 02      98778553 115604
## 12 01      95040851 103765

## month related
mc_duration <- time_MemCas %>% group_by(member_casual) %>%
  summarize(tot_sec=sum(tot_sec),count=sum(count)) %>% 
  arrange(-tot_sec)
print(mc_duration)

## # A tibble: 2 × 3
##   member_casual    tot_sec   count
##   <chr>              <dbl>   <int>
## 1 casual        4060541725 2321769
## 2 member        2552190173 3345417

Save all important calculated variables

ls()

##  [1] "casual_max_min"               "casual_used_statons_count"   
##  [3] "casual_used_statons_duration" "data"                        
##  [5] "day_duration"                 "day_duration_MemCas"         
##  [7] "mc_duration"                  "mem_max_min"                 
##  [9] "member_used_statons_count"    "member_used_statons_duration"
## [11] "month_duration"               "month_duration_MemCas"       
## [13] "most_used_statons_count"      "most_used_statons_duration"  
## [15] "rtype_duration"               "rtype_duration_MemCas"       
## [17] "station_MemCas"               "time_MemCas"

rm(data)
save.image('zdata\\analysis_result.RData')
rm(list=ls())

Data cleaning completed --------------------------------

Start Analysis ----------------------------------------------------

Load saved data for analysis --------------------------------

load('zdata\\analysis_result.RData')
ls()

##  [1] "casual_max_min"               "casual_used_statons_count"   
##  [3] "casual_used_statons_duration" "day_duration"                
##  [5] "day_duration_MemCas"          "mc_duration"                 
##  [7] "mem_max_min"                  "member_used_statons_count"   
##  [9] "member_used_statons_duration" "month_duration"              
## [11] "month_duration_MemCas"        "most_used_statons_count"     
## [13] "most_used_statons_duration"   "rtype_duration"              
## [15] "rtype_duration_MemCas"        "station_MemCas"              
## [17] "time_MemCas"

Create function to convert second to thousand hour or multiply any variable with a constant and return target decimal value.

round_fun <- function(data,constant=1/3600/1000, decimal=2){
  temp <- data*constant
  return (round(temp,decimal))
}

Biker type related analysis --------------------------------------

#mc_duration
temp0 <- sum(mc_duration$tot_sec)
temp1 <- sum(mc_duration$count)
temp <- mc_duration %>%
  mutate(t_hour= round_fun(tot_sec), pc_hour = round_fun(tot_sec,100/temp0),
         t_count =round_fun(count,1/1000),pc_count = round_fun(count,100/temp1))

temp %>% 
  ggplot(mapping=aes(x='',y=pc_hour, fill=member_casual)) +
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = paste0(member_casual,'\n',t_hour,' thousand hours\n',pc_hour, "% ")),
            position = position_stack(vjust = 0.5))+
  coord_polar(theta="y")+
  theme_void()+
  labs(title= "Service used by different bikers in 2022")

temp %>% 
  ggplot(mapping=aes(x='',y=pc_count, fill=member_casual)) +
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = paste0(member_casual,'\n',t_count,' thousand counts\n',pc_count, "% ")),
            position = position_stack(vjust = 0.5))+
  coord_polar(theta="y")+
  theme_void()+
  labs(title= "Number of different bikers in 2022")

It is found that 40.97% customer are casual but they receive 61.4% service hour.

## Station related analysis ------------------------------------

thm_v0 <-  theme(axis.text.x = element_blank())

# duration by stations
#head(casual_used_statons_duration)
temp <- casual_used_statons_duration %>% mutate(t_hour=round_fun(tot_sec))


temp %>% 
  ggplot() +
  geom_col(mapping=aes(x=reorder(start_station_name, -t_hour),y=t_hour,
                       fill=reorder(start_station_name, -t_hour)))+
  labs(title= "Casual bikers' top durations for start stations in 2022",
       x='Start station name', y = 'Thousand Hours') + thm_v0 +
  geom_text(aes(x =start_station_name , y = t_hour, label = t_hour,
                group =start_station_name),position = position_stack(vjust =.5))

We have the list for top start station choices for long rides of the casual bikers here.

#head(member_used_statons_duration)

temp <- member_used_statons_duration %>% mutate(t_hour=round_fun(tot_sec))

temp %>% 
  ggplot() +
  geom_col(mapping=aes(x=reorder(start_station_name, -t_hour),y=t_hour,
                       fill=reorder(start_station_name, -t_hour)))+
  labs(title= "Member bikers' top durations for start stations in 2022",
       x='Start station name', y = 'Thousand Hours') + thm_v0+
  geom_text(aes(x =start_station_name , y = t_hour, label = t_hour,
                group =start_station_name),position = position_stack(vjust =.5))

Here we see the top start station choices for long rides of the member bikers.

#head(casual_used_statons_count)
temp <- casual_used_statons_count %>% mutate(t_count=round_fun(count,1/1000))
temp %>% 
  ggplot() +
  geom_col(mapping=aes(x=reorder(start_station_name, -t_count),y=t_count,
                       fill=reorder(start_station_name, -t_count)))+
  labs(title= "Casual bikers' top counts for start stations in 2022",
       x='Start station name', y = 'Thousand Counts') + thm_v0 +
  geom_text(aes(x =start_station_name , y = t_count, label = t_count,
                group =start_station_name),position = position_stack(vjust =.5))

Most casual bikers traffic happens here.

# count by stations
#head(member_used_statons_count)
temp <- member_used_statons_count %>% mutate(t_count=round_fun(count,1/1000))
temp %>% 
  ggplot() +
  geom_col(mapping=aes(x=reorder(start_station_name, -t_count),y=t_count,
                       fill=reorder(start_station_name, -t_count)))+
  labs(title= "Member bikers' top counts for start stations in 2022",
       x='Start station name', y = 'Thousand Counts') + thm_v0 +
  geom_text(aes(x =start_station_name , y = t_count, label = t_count,
                group =start_station_name),position = position_stack(vjust =.5))

Most member bikers traffic happens here.

Ride types related analysis ------------------------------------

temp <- as.double(sum(rtype_duration$tot_sec))
rtype_duration <- rtype_duration %>% mutate(perc=round_fun(tot_sec,100/temp),
                                            hour=round_fun(tot_sec))

rtype_duration %>% 
  ggplot(mapping=aes(x='',y=perc, fill=rideable_type)) +
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = paste0(rideable_type,'\n',hour,' thousand hours\n',perc, "%")),
            position = position_stack(vjust = 0.5))+
  coord_polar(theta="y")+
  theme_void()+
  labs(title= "Usage of different types of rides in 2022")

It seems classic bike is most popular among the riders.

#rtype_duration_MemCas

temp <- as.double(sum(rtype_duration_MemCas$tot_sec))
rtype_duration_MemCas <- rtype_duration_MemCas %>% mutate(perc=round_fun(tot_sec,100/temp),
                                            t_hour=round_fun(tot_sec))


temp <- rtype_duration_MemCas[,c(1,2,6,5)]

## Calculate ride preference
temp_cas <- temp %>% filter(member_casual=='casual')
temp_mem <- temp %>% filter(member_casual=='member')


temp <- as.double(sum(temp_cas$t_hour))
temp_cas<- temp_cas %>% mutate(perc=round_fun(t_hour,100/temp,2))

temp <- as.double(sum(temp_mem$t_hour))
temp_mem<- temp_mem %>% mutate(perc=round_fun(t_hour,100/temp,2))

temp_cas %>% 
  ggplot(mapping=aes(x='',y=perc, fill=rideable_type)) +
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = paste0(rideable_type,'\n',t_hour,' thousand hours\n',perc, "%")),
            position = position_stack(vjust = 0.5))+
  coord_polar(theta="y")+ theme_void() +
  labs(title= "Ride preferance of casual bikers in 2022")

Casual bikers interest spreads almost equally to all types of bikes.

temp_mem %>% 
  ggplot(mapping=aes(x='',y=perc, fill=rideable_type)) +
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = paste0(rideable_type,'\n',t_hour,' thousand hours\n', perc, "%")),
            position = position_stack(vjust = 0.5))+
  coord_polar(theta="y")+ theme_void() +
  labs(title= "Ride preferance of member bikers in 2022")

Member bikers don’t use docker bike at all. It explains the over all low percentage interest in docker bikes among the bikers. The rent tariff for the docker bike including others needs to be known to find any relation with the price.

Weekday related analysis ---------------------------------------

#day_duration
temp <- as.double(sum(day_duration$tot_sec))
day_duration <- day_duration %>% mutate(perc=round_fun(tot_sec,100/temp),
                                                          t_hour=round_fun(tot_sec))
order <- c(1,2,7,6,3,5,4)
temp <- day_duration[,c(1,4,5)]
temp <- temp %>% mutate(order)
temp %>% 
  ggplot() + geom_col(mapping=aes(x=reorder(day,order),y=t_hour, fill=day))+
  theme(legend.position="none")+
  labs(title= "Bikes usage in different weekdays in 2022",
       x='', y = 'Thousand hours')+
  geom_text(aes(x =day , y = t_hour, label = paste0(perc, "%")), position = position_stack(vjust = .5))

It seems that on Saturday and Sunday the service requirement is quiet high.

#day_duration_MemCas
temp <- day_duration_MemCas %>% mutate(t_hour=round_fun(tot_sec))

temp$day <- ordered(temp$day, levels=c("Saturday", "Sunday","Monday",
                   "Tuesday", "Wednesday", "Thursday", "Friday"))

temp %>% 
  ggplot() + geom_col(mapping=aes(x=member_casual,y=t_hour, fill=member_casual))+
  facet_wrap(~day)+ theme(legend.position="none")+
  labs(title= "Bikes usage in different weekdays by different types of bikers in 2022",
       x='', y = 'Thousand hours')+
  geom_text(aes(x =member_casual , y = t_hour, label = t_hour,
                group = member_casual), position = position_stack(vjust = .5))

The chart shows that the usage of member in throughout the week is somewhat similar but varies a lot for casual bikers; which is so high on Saturday and Sunday.

Month related analysis ------------------------------------------------

#month_duration
temp <- month_duration
temp$month <- as.integer(temp$month)
temp <- temp %>% arrange(month)
temp1 <- as.double(sum(month_duration$tot_sec))
temp <- temp %>% mutate(perc=round_fun(tot_sec,100/temp1),
                        t_hour=round_fun(tot_sec), name=month.name)
temp <- temp[,c(1,4,5,6)]
temp %>% 
  ggplot() + geom_col(mapping=aes(x=reorder(name,month),y=t_hour, fill=name))+
  theme(legend.position="none",axis.text.x = element_text(angle = 60))+
  labs(title= "Bikes usage in different months in 2022",
       x='', y = 'Thousand hours')+
  geom_text(aes(x =reorder(name,month),
                y = t_hour, label = paste0(t_hour,'\n',perc, "%")),
            position = position_stack(vjust = .5))

The chart shows that in the middle of the year the service requirement is so high but at the begin or end of the year it is quiet low. The difference between the highest and lowest monthly demands is too high.

#month_duration_MemCas
temp <- month_duration_MemCas %>%
  mutate(t_hour=round_fun(tot_sec), name=month.name)
temp$month <- as.integer(temp$month)


temp %>% 
  ggplot()+ geom_col(mapping=aes(x=member_casual,y=t_hour, fill=member_casual))+
  facet_wrap(~reorder(name,month))+ theme(legend.position="none")+
  labs(title= "Bikes usage in different months by different types of bikers in 2022",
       x='', y = 'Thousand hours')+
  geom_text(aes(x =member_casual , y = t_hour, label = t_hour,
                group = member_casual), position = position_stack(vjust = 0.5))

cas <- temp %>% filter(member_casual=='casual')
print('Standard deviation (thousand hours) of casual riders demand along different months:')

## [1] "Standard deviation (thousand hours) of casual riders demand along different months:"

sd(cas$t_hour)

## [1] 73.46798

mem <- temp %>% filter(member_casual=='member')
print('Standard deviation (thousand hours) of member riders demand along different months:')

## [1] "Standard deviation (thousand hours) of member riders demand along different months:"

sd(mem$t_hour)

## [1] 31.03158

The casual riders’ monthly demand rose from lowest (9.38 thousand hours) to highest (198.14 thousand hours). The standard deviation seems to be too high for casual riders .

Recommendation

As normal rules, the rate for casual bikers is usually higher than the members. The hours usage by the casual bikers is comparatively so high. Therefore if the casual bikers become members, the revenue received from hourly rate will be decreased. But, the membership fee is unknown to me. Therefore overall financial analysis could not be done.
The demand changes a lot though out the year. Therefore, providing the resources solely could cost more. Because of low utilization factor with having large capacity of supply, outsourcing resources during high demand could be a solution. If the casual riders become members, the supply needs to be ensured always although their usage usually drops at the begin and end of the year a lot.
If casual bikers are to be inspired to be converted into members than all types of bikes needs to be provided (electric, classic, docked). The target station for advertisement could be:
1. Streeter Dr & Grand Ave
2. DuSable Lake ShoreDr & Monroe St
3. Millenium Park &
4. DuSable Lake Shore Dr & Norh Blvd

For more information on this paper, you may contact here.

Appendix

SQL sample code

The code was implemented online, in BigQuery sandbox. The project name was:

capstone-project-01-bike-share and the table names were (YearMonth): 202201, 202202, 202203 ….. 202212. Then merged all the table into one: merged_month_t

CREATE TABLE `capstone-project-01-bike-share.bike_share.merged_month_t` AS
--Select all the month files
SELECT * FROM `capstone-project-01-bike-share.bike_share.202201`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202202`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202203`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202204`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202205`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202206`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202207`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202208`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202209`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202210`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202211`
UNION ALL
SELECT * FROM `capstone-project-01-bike-share.bike_share.202212`
ORDER BY started_at;

Calculate duration of each ride and assign days according to the date of started_at.

CREATE TABLE `capstone-project-01-bike-share.bike_share.merged_month_2022` AS

SELECT
    *,
    ended_at - started_at AS duration,
    EXTRACT(MONTH FROM started_at) AS month,
    
    CASE
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=1 THEN 'Sunday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=2 THEN 'Monday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=3 THEN 'Tueday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=4 THEN 'Wednesday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=5 THEN 'Thursday'
        WHEN EXTRACT(DAYOFWEEK FROM started_at)=6 THEN 'Friday'
        ELSE 'Saturday'
    END AS day --[1,7] with Sunday as the first day of the week

FROM `capstone-project-01-bike-share.bike_share.merged_month_t`
ORDER BY started_at;

Calculate average, total, minimum, maximum and number of each categorized by rider type (member or casual) as well as month.

CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_mon_det_2022` AS
SELECT
      member_casual as Mem_Cas,
      month as Month,

      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, month
ORDER BY month, member_casual

Calculate average, total, minimum, maximum and number of each categorized by biker type (member or casual) , month and rideable_type..

--CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_mon_det_2022` AS
SELECT
      member_casual as Mem_Cas,
      rideable_type as Ride_Type,
      month as Month,

      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, month, rideable_type
ORDER BY month, member_casual, rideable_type

Find out member and casual riders usage in different days in a week.

CREATE TABLE `capstone-project-01-bike-share.bike_share.dur_day_2022` AS
SELECT
      member_casual as Mem_Cas,
      day as Day,
      
      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, day 
ORDER BY member_casual, day

Find out preference of ride type (classic/ electric/ docked bike)

CREATE TABLE `capstone-project-01-bike-share.bike_share.ride_type_2022` AS
SELECT
      member_casual as Mem_Cas,
      rideable_type as Ride_Type,
      
      AVG(duration) as Avg_Dur,
      SUM(duration) as Sum_Dur,
      MIN(duration) as Min_Dur,
      MAX(duration) as Max_Dur,
      COUNT(duration) as Count_Dur
      
FROM `capstone-project-01-bike-share.bike_share.merged_month_2022`
GROUP BY member_casual, rideable_type 
ORDER BY member_casual, rideable_type

Excel Analysis

Datalog

At first, January 2022 data was analysed for data cleaning.
It was found that in some cases the start and end station same. Needs to clarify whether it is okay.
The lattitude abd longitude datatype was found to be both text and number. Later, all of them were converted into number.
The number of blank cells for start_station_name is 16260 (5715 alone), end_station_name is 17927 (7382 alone), and for both 10545. As a result, Unique number blank spaces is 23642. Therefore deleting these lines was found to be unwise. To solve the issue, at first the lattitude and longitude of the stations were found. Later, the blank spaces were filled with the name according to the lattitude.
Two new field was introduced to show the duration of ride and denote the day of the week.
The spreadsheet was found to be too slow to deal with this large amount of data. Therefore, for all the data, to fascilitate data cleaning and analyse SQL and R are selected.

Pivot charts

In Jan 2022, the service was mostly used by members.

The casual bikers had he maximum duration of riding. The minimum duration as zero signifies the possibility of data entry error and requirement of cleaning.

In this time period, the service demand was so high on Saturday, Thursday and Wednesday. But, there were five Saturday, Sunday and Monday. Therefore, Thursday and Wednesday can be considered as the busiest day.

Although the number was low but the average usage was higher for casual riders.

Classic bikes were found to be most popular among the riders.

Members didn’t use the docked bikes.

Although the overall contribution was low but those who used docked bike, they utilized it for longer periods.

Session Information

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.2   stringr_1.5.0   dplyr_1.0.10    purrr_1.0.1    
## [5] readr_2.1.3     tidyr_1.2.1     tibble_3.1.8    ggplot2_3.4.0  
## [9] tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0    xfun_0.36           bslib_0.4.2        
##  [4] haven_2.5.1         gargle_1.2.1        colorspace_2.0-3   
##  [7] vctrs_0.5.1         generics_0.1.3      htmltools_0.5.4    
## [10] yaml_2.3.6          utf8_1.2.2          rlang_1.0.6        
## [13] jquerylib_0.1.4     pillar_1.8.1        withr_2.5.0        
## [16] glue_1.6.2          DBI_1.1.3           dbplyr_2.3.0       
## [19] readxl_1.4.1        modelr_0.1.10       lifecycle_1.0.3    
## [22] munsell_0.5.0       gtable_0.3.1        cellranger_1.1.0   
## [25] rvest_1.0.3         evaluate_0.20       knitr_1.41         
## [28] tzdb_0.3.0          fastmap_1.1.0       fansi_1.0.3        
## [31] broom_1.0.2         backports_1.4.1     scales_1.2.1       
## [34] googlesheets4_1.0.1 cachem_1.0.6        jsonlite_1.8.4     
## [37] fs_1.5.2            hms_1.1.2           digest_0.6.31      
## [40] stringi_1.7.12      grid_4.2.2          cli_3.6.0          
## [43] tools_4.2.2         magrittr_2.0.3      sass_0.4.4         
## [46] crayon_1.5.2        pkgconfig_2.0.3     ellipsis_0.3.2     
## [49] xml2_1.3.3          reprex_2.0.2        googledrive_2.0.0  
## [52] lubridate_1.9.0     timechange_0.2.0    assertthat_0.2.1   
## [55] rmarkdown_2.19      httr_1.4.4          rstudioapi_0.14    
## [58] R6_2.5.1            compiler_4.2.2

Cyclistic Bikers 2022

Md Shahriar Forhad

2023-01-20