1. Project Background

This project is a requirement to Google Data Analytics Certificate explicitly for novices with no prior expertise. The case study seeks to consolidate the analytic skills developed by applying the typical procedures and methods trough a systematic data analytic workflow and using the most appropriate tools. The dataset with key usage variables is collected by third party and made available publicly.

Cyclistic is a fictitious company that runs the Divvy bike sharing business. Its marketing department identifies a potential business opportunity to increase the business revenues and engages the data analytics team to uncover and reveals evidences and insights on the available bike sharing usage data to ultimately support a data driven decision making. At the heart ofthe project is to clean, analysis, summarize and visualize the results in the most effective way. Inferential statistics (hypothesis testing) is beyond the scope of this project.

A great deal of time has been spent delving and diving deep to master the grammar of data manipulation, the grammar of graphics and the art of data visualization. At the end only to realize that I just scratched the surface of R. The challenge ahead on my journey as aspiring data analyst lies on clinging to string manipulation and getting grips at functional programming before I become a R Power User.

docking station

2. The Cyclistic Business Profile at a Glance

Cyclistic is a bicycle renting business in Chicago , with an innovative concept of bike sharing. It delivers a great value to the customer basis serving.

  • Micromobility solution with healthy associated benefits from cycling in a regular basis

  • Targeting Chicagoan commuters to work, school, social encounter

  • Visitors to unlock Chicago

  • Runs a fleet of Over 5 800 bikes

  • Scattered network of 682 stations/kiosks to unlock the bikes from and return

Divvy Bikes, Chicago’s Bike Sharing System

Cyclist offer essentially 3 bikes with distinct features

  • Bikes rented, shared and charged on hourly basis

  • Bike unlocked at a station and returned at another station at the customer’s convenience

  • The are two customers segments

  • Casual riders - day pass-single one or multiple trips

  • Subscribers - annual membership

3. Business Case Identification

The company is setting to increase the overall business performance specially the revenue from non subscribed casual riders. It is believed that this customer segment has key customer offering a great deal of potential to add value to the company and there should be endeavored to bring a board as much as possible annual based membership. However, there is information gap concerning the records about the critical factors/variables on how the casual group and subscribed members uses the bike sharing service. Here comes the challenge to design a solid well founded strategy to target the casual riders to adhere to the annual based package. With that challenge comes a long the need to gather evidences, meaningful and quality information to back up the decision making for impactful results. This project report uses data collected in 2022, analysis data to uncover key insights, pattern and trends on the bike rental service usage and communicate the findings in an easy, engaging and persuasive way to the key stakeholder.

4. Assuring Data Quality

The quality of data by large determine the quality of the findings. The goal to have high quality data is to make empowered, informed and data driven decisions in order to achieve improved business performance. In order to restrain costly mistakes and false conclusions we have to ensure that data meets the data quality standards. Like it is commonly cited in data analytics and data science resources “garbage in garbage out”. We assess the quality of our dataset against the measures or dimensions of data quality. Each dimension plays a critical role in ensuring the overall quality of data.

data quality

4.1 Reliability

Our data is reliable and accurate being collected by a credible source using advanced data collection technology

4.2 Originality

Our dataset was accessed from the original source, as first recorded data points, therefore no reasonable data quality issues on this ground

4.3 Comprehensiveness

We have fair and relatively comprehensive and useful information . With available data we are able to compare the two groups as far as the usage is concerned. But not the preferences and choices they make to adhere and buy one or another product/service. This is by far determined and controlled by other variables like socio economic and demographics. So, well founded and substantiated recommendations would be hard to be drawn.

4.4 Currency

This is a time dimension (also referred to as timeliness) .. Our data is sufficiently up to date for its intended use.

4.5 Cited

This criteria is satisfied. In addition, we also keep abreast that the dataset sample should meet and clear the requirements. It seems that our records covers the entire population and after some data cleaning chores we should be left of dataset that does not compromises the representativity and assure unbiasness.

5. Sourcing and Preparing The Data

5.1 Preliminary Data Preparation

We primarily relied on the great features offered by MS Power Query to get familiarized with the dataset. For each file we easily carried out the following preliminary transformations: - column profile(count, distinct, unique, min, max, empty string) - Column quality (validity, error, percentage) - ride Date, bike_type, ride_type, station name - Splitting datetime (time stamp) into date and time columns - Computing the trip time duration based on the ended time and started time variables

5.2 Data set conformity with the tidy format

Tidy data is a concept from Hadley Wickham’s 2014 paper Tidy Data Tidy datasets are easy to manipulate, model and visualize, are rectangular data and have a specific structure: - Every row is an observation, - Every column represents variables and - Every entry into the cells of the data frame are values It is worthy quoting “Happy families are all alike, but every unhappy family is unhappy in its own way.” ~ Leo Tolstoy “Tidy datasets are all alike, but every messy dataset is messy in its own way.” ~ Hadley Wickham

Tidy Data

6. Data Analysis Workflow

Google offers a well-defined data workflow which acts as a set of guidelines for planning, organizing, and implementing data analytics projects. The Google six steps are:

  1. Ask

  2. Prepare.

  3. Process

  4. Analyze

  5. Share.

  6. Act

This workflow is self explanatory. Noticeably, the process outlined is linear, however, in many cases the process is iterative, requiring multiple stages to be repeated and revisited. Our Project is being guided by the flow bellow.

typical data analytics workflow

7. Getting into R Programming

R programming is the data analysis tool of our choice, the world’s most popular programming language for statistical analysis. R is an open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. We import and load the data into RStudio where we set a R project.

7.1 Loading The Packages

Bellow we load the package collections we will be used during the entire data analysis workflow - useful packages

knitr::opts_chunk$set( warning=FALSE, message=FALSE)
library(knitr)
library(dplyr)
library(readr) 
library(purrr)
library(magrittr)
library(knitr)
library(gt)
library(scales)
library(ggthemes)
library(ggplot2)
library(tidyr)
library(forcats)
library(lubridate)
library(pandoc)
library(skimr)
library(summarytools)
library(RColorBrewer)
library(viridis)
library(flextable)
library(janitor)
library(hablar)
library(styler)
library(patchwork)
library(cowplot)

7.2 Setting the Theme for the Plots

theme_set(theme_clean() +
  theme(
    plot.subtitle = element_text(
      hjust = 0.5,
      size = 14,
      color = "skyblue",
      face = "bold",
      family = "Tahoma"
    ),
    plot.caption = element_text(
      hjust = 1,
      size = 14, color = "grey",
      face = "italic"
    ),
    plot.title = element_text(
      hjust = 0.5,
      size = 16,
      color = "skyblue",
      face = "bold",
      family = "Tahoma"
    ),
    plot.tag = element_text(
      size = 14,
      color = "grey",
      face = "bold"
    ),
    axis.title = element_text(
      color = "steelblue",
      face = "bold",
      size = 15
    ),
    axis.line = element_line(linewidth = 1.5, color = "darkgrey"),
    axis.text = element_text(face = "bold", 
                             color = "#993333", 
                             size = 12),
    legend.title = element_blank(),
    legend.position = "top"
  ))

7.3 Loading The Data

The data set comes in 12 separated files on month basis, with individual Divvy bike sharing trips , including the dock station origin and destination, and time stamps for each trip(start and end). Source: https://divvy-tripdata.s3.amazonaws.com/index.html .

We import all file at once grammatically instead of one at time. There is claim holding that if a task is to be repeat, then it is absolutely the right task the computer is designed to perform and it is does efficiently. There are many ways to get the cat skinned, akin we find may ways to import multiple files into R. The best method so far seems to be fread() from data.tablepackage. As a matter of taste, we will go with read_csv() from Tidyverse ecosystem, as it gives an argument to specify the column types. This should free us up from the tedious work of converting the date variables from character type to POSIXct and POSIXlt date type.

We list all files with list.files() function and we get them combined into a single file with the mp_df() from purrr package.

library(tidyverse)
 my_dir <- "C:/Users/USER/Documents/DataAnalytics/Google_Capstone_Project/data/Data_RStudio"
 all_files <- list.files(path = my_dir, pattern = "*.csv", full.names = TRUE) 
bs_01 <- all_files %>% map_df(~ read_csv(.,
  col_names = TRUE,
  col_types = cols(
    started_at = col_datetime(format = "%m/%d/%Y %H:%M"),
    ended_at = col_datetime(format = "%m/%d/%Y %H:%M"),
    started_ride_date = col_date(format = "%m/%d/%Y"),
    ended_ride_date = col_date(format = "%m/%d/%Y")
  )
))

read_csv() would have guessed the variable classes and character data type to dates variables, so we have specified the date class vriables using the col_typesargument.

Moving on, our bike share dataset has been loaded as an R object bs_01 with a dimension of :

5667717 rows ,

19 columns or simply

dataset dimension (nr rows , nr columns = 5667717, 19 .

7.2.1 Checking and validating the Data Type

Data types matters a great deal. The operations we can perform on a column depend so much on its “type” (numeric, character/string). Lets revisit the common type of operators. - arithmetic operators - calculations using Arithmetic operators to be performed on values. - relational operators - for assignment and enable comparisons to be made. They are used in condition testing - logical operators - Logical operators are used to combine relational operators to give more complex decisions.

We can get a glimpse of the columns data type with the summary.default()

bs1 <- bs_01
summary.default(bs1)
##                    Length  Class   Mode     
## ride_id            5667717 -none-  character
## rideable_type      5667717 -none-  character
## started_at         5667717 POSIXct numeric  
## ended_at           5667717 POSIXct numeric  
## start_station_name 5667717 -none-  character
## start_station_id   5667717 -none-  character
## end_station_name   5667717 -none-  character
## end_station_id     5667717 -none-  character
## start_lat          5667717 -none-  numeric  
## start_lng          5667717 -none-  numeric  
## end_lat            5667717 -none-  numeric  
## end_lng            5667717 -none-  numeric  
## member_casual      5667717 -none-  character
## started_ride_date  5667717 Date    numeric  
## started_ride_time  5667717 hms     numeric  
## ended_ride_date    5667717 Date    numeric  
## ended_ride_time    5667717 hms     numeric  
## ride_length        5667717 -none-  numeric  
## ride_minutes       5667717 -none-  numeric

8. Data Exploration

Exploratory data analysis (EDA) is a technique/method/process widely used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. Originally developed by American mathematician John Tukey in the 1970s, it is quite helpfull in :

  • identify obvious errors,

  • better understand patterns within the data

  • detect outliers or anomalous events

  • find interesting relations among the variables.

Before diving deep in our journey to understanding the data, in this first stage we are exploring data mainly column wise, fixing few issues and getting data ready for cleaning in the following stage.

8.1 Quick Sense of the Dataset at a Glance

With the View(bs1) we have the display of our dataset in a tabular format. We can do an ocular scanning, filter and get familiarized with the dataset.

In addition, with the head(bs1) and the tail(bs1) we get access to the top row (head) and last row (tail), in this way making our dataset has been imported completed .

Additionally, we can get a sample size of the dataset of n dimension.

# randomm sample , n = 10
  bs1 %>% 
  sample_n(10) %>% 
  gt(caption =  "Sample of the Dataset with 10 Random Observations") 
Sample of the Dataset with 10 Random Observations
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual started_ride_date started_ride_time ended_ride_date ended_ride_time ride_length ride_minutes
18CB2761A890BCA3 classic_bike 2022-07-19 13:58:00 2022-07-19 14:20:00 Sheridan Rd & Loyola Ave RP-009 Kedzie Ave & Bryn Mawr Ave KA1504000167 42.00104 -87.66120 41.98223 -87.70889 member 2022-07-19 13:58:45 2022-07-19 14:20:36 0.015173611 21.85
A07A17579FD564B9 electric_bike 2022-09-21 22:49:00 2022-09-21 22:51:00 Broadway & Belmont Ave 13277 Wilton Ave & Belmont Ave TA1307000134 41.94010 -87.64545 41.94023 -87.65294 member 2022-09-21 22:49:50 2022-09-21 22:51:46 0.001342593 1.93
466635A9028203ED classic_bike 2022-10-22 12:11:00 2022-10-22 12:50:00 Southport Ave & Clark St TA1308000047 Clark St & Berwyn Ave KA1504000146 41.95708 -87.66420 41.97803 -87.66856 casual 2022-10-22 12:11:18 2022-10-22 12:50:09 0.026979167 38.85
639C0D36743DAE63 electric_bike 2022-06-05 09:37:00 2022-06-05 10:01:00 NA NA NA NA 41.89000 -87.72000 41.94000 -87.72000 member 2022-06-05 09:37:19 2022-06-05 10:01:09 0.016550926 23.83
293715AE4CF08CA9 electric_bike 2022-06-28 03:43:00 2022-06-28 03:52:00 Logan Blvd & Elston Ave TA1308000031 NA NA 41.92940 -87.68424 41.93000 -87.72000 casual 2022-06-28 03:43:12 2022-06-28 03:52:26 0.006412037 9.23
DD0A3551EBD04D88 classic_bike 2022-05-21 17:27:00 2022-05-21 17:33:00 Desplaines St & Kinzie St TA1306000003 Orleans St & Hubbard St 636 41.88872 -87.64445 41.89003 -87.63662 casual 2022-05-21 17:27:40 2022-05-21 17:33:12 0.003842593 5.53
D9BE3DE710D66304 classic_bike 2022-05-22 15:54:00 2022-05-22 16:09:00 Michigan Ave & Washington St 13001 Columbus Dr & Randolph St 13263 41.88398 -87.62468 41.88473 -87.61952 casual 2022-05-22 15:54:19 2022-05-22 16:09:38 0.010636574 15.32
9C4BEBB694C180AF classic_bike 2022-06-04 15:49:00 2022-06-04 16:02:00 Jefferson St & Monroe St WL-011 Clinton St & Roosevelt Rd WL-008 41.88033 -87.64275 41.86712 -87.64109 casual 2022-06-04 15:49:07 2022-06-04 16:02:04 0.008993056 12.95
D1588A71286D96F0 electric_bike 2022-06-03 11:20:00 2022-06-03 11:32:00 St. Clair St & Erie St 13016 Columbus Dr & Randolph St 13263 41.89441 -87.62270 41.88473 -87.61952 member 2022-06-03 11:20:25 2022-06-03 11:32:51 0.008634259 12.43
1937B9007C59F991 classic_bike 2022-01-14 11:30:00 2022-01-14 11:50:00 Streeter Dr & Grand Ave 13022 Rush St & Cedar St KA1504000133 41.89228 -87.61204 41.90231 -87.62769 casual 2022-01-14 11:30:00 2022-01-14 11:50:00 0.013888889 20.00

This looks quite pretty!

8. 3 Converting character type to factor data type objects

This is specially useful for both ordinal and nominal variables with limited categories since it is easy to hunt and spot entry related errors. easy data verification and validation. The variables bike type and customer type are suitable candidates.

  • Converting using R base and cross checking with levels()
library(hablar)

bs2 <- bs1 %>% 
  convert(fct(rideable_type, member_casual))

levels(bs2$rideable_type)
## [1] "classic_bike"  "docked_bike"   "electric_bike"
levels(bs2$member_casual)      
## [1] "casual" "member"

8.4 Parsing Date Time

We have touched this step of converting date columns to Date class Date type object class is tricky. Stored in the program memory (under the hoody) as numeric, often is imported as character. working with dates and times in base R, can be daunting, a bit cumbersome and clumsy. lubridate adopts a more intuitive approach to parsing dates and times. Some useful functions to coerce date character to date class include ymd(), dmy() and ydm(). We found it handy with Power Query tool before importing the dataset

8.5 Extracting Date Components From Datetime Variable

bs2 <- bs2 %>%
  mutate(
    ride_year = lubridate::year(started_ride_date),
    ride_month = lubridate::month(started_ride_date, label = TRUE, abbr = TRUE), # month in year
    ride_weekday = lubridate::wday(started_ride_date, label = TRUE, abbr = TRUE) # day of week (name)
  ) 
# cross checking
levels(bs2$ride_month)
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
levels(bs2$ride_weekday)
## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"

8.6 Renaming and Reordering the Variables

For a better syntax and good readability we change/update old column name with new column name using rename() function. We focus on those variables of interested that we will be playing around with them in our journey of revealing interesting variation pattern and trends.

  • Syntax: my_dataframe %>% rename(new_column_name = old_column_name)
bs3 <- bs2 %>%
  dplyr::rename(
    "departure_station" = "start_station_name",
    "arrival_station" = "end_station_name",
    "departure_time" = "started_ride_time",
    "arrival_time" = "ended_ride_time",
    "ride_bike" = "rideable_type",
    "ride_segment" = "member_casual",
    "ride_date" = "started_ride_date",
    "ride_time" = "ride_minutes"
  ) %>% 
  select(ride_id, ride_bike, ride_segment, ride_time, ride_date,departure_station, arrival_station, everything()) 
# cross checking
bs3 %>% head( 2) %>% 
  gt(caption = "Cross checking new variable names")
Cross checking new variable names
ride_id ride_bike ride_segment ride_time ride_date departure_station arrival_station started_at ended_at start_station_id end_station_id start_lat start_lng end_lat end_lng departure_time ended_ride_date arrival_time ride_length ride_year ride_month ride_weekday
550CF7EFEAE0C618 electric_bike casual 7.52 2022-08-07 NA NA 2022-08-07 21:34:00 2022-08-07 21:41:00 NA NA 41.93 -87.69 41.94 -87.72 21:34:15 2022-08-07 21:41:46 0.005219907 2022 Aug Sun
DAD198F405F9C5F5 electric_bike casual 14.03 2022-08-08 NA NA 2022-08-08 14:39:00 2022-08-08 14:53:00 NA NA 41.89 -87.64 41.92 -87.64 14:39:21 2022-08-08 14:53:23 0.009745370 2022 Aug Mon

Voila, This is awesome!

9. Data Cleaning

This is one of the crucial and intricate stage of data analytical workflow. Once we have aligned the data object types and the desirable variable names, we move to explore and address data entry (observations) at row level in order to get the dataset in the correct and desirable format for efficient processing. This consists essentially of identifying, correcting, or removing inaccurate, mislabeled, incorretly formatted, incomplete raw data for downstream analysis purposes. In addition, we deal with the typical data issues - Duplicated records - Recording - Missing values - Unusual values most known as outliers. What typically data cleaning means is partly illustrated bellow. The bottom line is to get data formatted in the way the computer reads, understands, computes and we get consistent outcome.

Illutrastion of data cleaning

9.1 Overview of the Dataset

It is widely claimed that plotting the data, specially the numeric continuous should be the very first step when getting a new dataset. It potentially offers a better picture of data distribution, variation and a quick catch of potential outlier. In our case does not give a pretty picture due to little variation and the excessive size of the dataset. But before visual exploration,let’s get a glance or glimpse at our data set with a summary table. There is a bunch of helpful packages with summary table functions from R base describe(), dplyr summary (), dlookr, Mice, Smir skim(), psych, Mosaico, summarytools, naniar to janitor just to mention fewer. The choice is a matter of taste, but also the detailed information we can get from.

9.2 Describing data with Summary Table

This is an analytical way to get an overview of the entire dataframe. We can use the function dfsummary() from the package summarytools . The Columns are summarized by class/type such as character, factor and numeric. We select 6 variables of interest:

  • ride_segment

  • ride_bike

  • ride_time

  • departure_station

  • arrival_station

  • ride_date.

We will include the identifier ride id just to control for dupes (duplicated entries)

using dfsummary() to summarize and describe the dataset structure with selected variables

Data Frame Summary

diagnosis

Dimensions: 10 x 7
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 ride_id [character]
1. 18CB2761A890BCA3
2. 1937B9007C59F991
3. 293715AE4CF08CA9
4. 466635A9028203ED
5. 639C0D36743DAE63
6. 9C4BEBB694C180AF
7. A07A17579FD564B9
8. D1588A71286D96F0
9. D9BE3DE710D66304
10. DD0A3551EBD04D88
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
10 (100.0%) 0 (0.0%)
2 ride_segment [factor]
1. casual
2. member
6(60.0%)
4(40.0%)
10 (100.0%) 0 (0.0%)
3 ride_bike [factor]
1. classic_bike
2. electric_bike
6(60.0%)
4(40.0%)
10 (100.0%) 0 (0.0%)
4 ride_time [numeric]
Mean (sd) : 16.2 (10.6)
min ≤ med ≤ max:
1.9 ≤ 14.1 ≤ 38.9
IQR (CV) : 11.4 (0.7)
1.93:1(10.0%)
5.53:1(10.0%)
9.23:1(10.0%)
12.43:1(10.0%)
12.95:1(10.0%)
15.32:1(10.0%)
20.00:1(10.0%)
21.85:1(10.0%)
23.83:1(10.0%)
38.85:1(10.0%)
10 (100.0%) 0 (0.0%)
5 departure_station [character]
1. Broadway & Belmont Ave
2. Desplaines St & Kinzie St
3. Jefferson St & Monroe St
4. Logan Blvd & Elston Ave
5. Michigan Ave & Washington
6. Sheridan Rd & Loyola Ave
7. Southport Ave & Clark St
8. St. Clair St & Erie St
9. Streeter Dr & Grand Ave
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
9 (90.0%) 1 (10.0%)
6 arrival_station [character]
1. Clark St & Berwyn Ave
2. Clinton St & Roosevelt Rd
3. Columbus Dr & Randolph St
4. Kedzie Ave & Bryn Mawr Av
5. Orleans St & Hubbard St
6. Rush St & Cedar St
7. Wilton Ave & Belmont Ave
1(12.5%)
1(12.5%)
2(25.0%)
1(12.5%)
1(12.5%)
1(12.5%)
1(12.5%)
8 (80.0%) 2 (20.0%)
7 ride_date [Date]
1. 2022-01-14
2. 2022-05-21
3. 2022-05-22
4. 2022-06-03
5. 2022-06-04
6. 2022-06-05
7. 2022-06-28
8. 2022-07-19
9. 2022-09-21
10. 2022-10-22
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
10 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2023-05-17

From the output, we get the account for the following data issues:

  1. duplicates

  2. distinct values for the numerical variables

  3. missing values

  4. distribution (sketch for bar charts and histogram)

  5. frequency (cat variables)

  6. summary statistics (five numbers) for the numeric continuous variables

This looks pretty, we only miss the explicit account of outliers, but the range offers some valuable hints.

In a nutshell, from the summary output we spotted issues with three variables:

  1. ride time with implausible extreme values-

  2. departure station with missing values

  3. arrival station with missing values

9.4 Dealing with the messy

9.4.1 Duplicated records

Duplicate observations should be identified and removed. Based on the table summary, there are no duplicated records. Total unique observations/cases match and corresponds to the total number of observations in the dataset .

  • Total unique records 5667717

  • Total number of observations 5667717

9.4.2 Flawed Data Entry on Categorical Variables

We use frequency tables to examine or diagnosis the possible inconsistencies that can result from incorrect entries or typo. Our explicitly categorical variables are trouble free as given in the above summary table.

9.4.2.1 Docking Stations

This is not worthy to set as a categorical variable given that as a nominal variable each value is treated as character/string and cant be meaningfully grouped. We will examine the frequency( how many times the same value occur) and seek to catch some pattern. How may unique bike departure station we have and are expected? Can we match it? All in all the network consist of 692 docking stations. Is there a meaningful way to group the stations?

  • Distinct dock station
total_dock_stations <- bs3 %>%
  group_by(departure_station) %>%
  summarize(distinct_points = n_distinct(departure_station))
dock_stations <- bs3 %>% select(departure_station) %>% 
  drop_na(departure_station)%>% 
  table()  

There are distinct 1675 departure dock stations. It seems that we have limited options to scan and detect irregularities for the character variables with no levels.

We could look at the frequencies and consider the possibility of excluding data entries of lower frequency, such as bellow 1%

Data entry of lower frequencies are suspicious and on business grounds they add no value. These observations are possibly caused by data recording errors. We might have the data with mixed cases (lower and upper), in which case R treat them differently. We could explore text data, using functions like filter(), str_detect() and count() to screen and catch data entry based on partial match or common pattern. This is beyond the scope of this project and we will not delve into.

9.4.3 Dealing with missing value

Missing values have to be identified, examined and take decision if they are kept, omitted/removed or replaced and if replaced what method to be used.

9.4.3.1 Where are missing values

From our summary table, we have missing values on two variables. Cross checking with is.na() and sum() functions we get 833064 missing values for departure_station and 892742 for arrival_station

9.4.3.2 What to do with the missing values

From the outset it important to understand how the missing values are coded and remarkably differentiated from other coded values like none, empty or NAN. In R, missing values are represented by the symbol NA (not available). On the other hand, we have NULL, which represents that the value in question simply doesn’t exist, rather than being existent but unknown. In dealing with missing values, Greg Martin suggests that they are examined, trying to understand the pattern of its occurrence (the missiness). Are missing values taking place systematically or randomly? Are they associated with other variables? In our case, we should expect that the missing values are associated with long trip duration suggesting that the bikes went missing (stolen). In that event, the ride end time should also be missing (unreported). But this de not hold at all, as we have complete records for the variable start and end time.

All what we find is that missing values at start station are associated with missing values of start station id.

library(dplyr)
 miss_values <- bs3 %>%
   select(departure_station, arrival_station, start_station_id, end_station_id, start_lat, end_lat, start_lng, end_lng) %>%
   dplyr::filter(is.na(departure_station) | is.na(arrival_station)) %>%
   count()
 miss_values
## # A tibble: 1 × 1
##         n
##     <int>
## 1 1298357

We could drop or remove the missing values, but we decide to keep them to avoid losing data points from variables of interest (customer group, date, ride time) associated with those rows where they occur.

9.4.4 Dealing with implausible out of boundaries values

observations lying far away from others are typically unusual and often considered outliers. Dealing with outliers is a highly contentions issue among statisticians , data scientists and data analyst. It is worthy to highlight the reason why outliers matter a great deal and attract hot debate. As far as the methods and techniques used in descriptive and inferential statistics hinge and grounded on the distribution assumptions, outliers may impact negatively on results and misleading or flawed findings. We don’t take outliers lightly. As a matter of fact, in business environment the considered outliers could be a niche or market segment with specific needs to be catered and deserving especial attention. It could be the of the known principle of 80:20 . We will explore the distribution and pattern of the numerical continuous variables trough the scatterplot, boxplot and histogram and match with the cyclistic core business - renting bikes on shared basis for micro mobility.

9.4.4.1 Values out of the expected range
 bs3 %>%
  select(ride_segment, ride_time) %>%
  group_by(ride_segment) %>%
  dplyr::summarize(
    min = min(ride_time),
    max = max(ride_time)
  ) %>%
  ungroup() 
## # A tibble: 2 × 3
##   ride_segment     min    max
##   <fct>          <dbl>  <dbl>
## 1 casual         -137. 41387.
## 2 member       -10353.  1560.

The maximum values are way out for a typical trip duration in a micro mobility service . This is supported by the literature reviews. Cyclistic rent bikes up to 45 mints for both casual and subscriber members. The negative values are indisputably implausible.

Ideally abike trip duration should be:

  • greater or igual ( >=) to 5 mints

  • less than 60 mints for commutting purpose

  • between 60 to 180 mints for leisure and sport

Notably, the bikes just like any commercial product are designed and manufactured with intrinsic features to suit the end use purpose. They are customized and market oriented.

bs4 <- bs3 %>% filter(ride_time > 0) # dropping negative trip duration records
summary(bs4$ride_time) # verification
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.02     5.83    10.28    19.45    18.47 41387.25
9.4.4.2 Visual Exploration - Extremely low values

Trips that last less than 60 seconds should be suspicious and worthy some attention. In fact they should occur where the bikes are checked out and returned back to the same station, being bikes with some problems and unable to start a trip.

 bs4 %>%
  dplyr::select(ride_segment, ride_time) %>%
  filter(ride_time < 5) %>%
  ggplot(aes(x = ride_time, col = ride_segment)) +
  geom_freqpoly(size = .75, binwidth = 0.02) +
  ggtitle("Visual Exploration of Data")

Note: The expected relationship is spotted, the longer the trip, there are lesser returns to the same station. We set the threshold of 1.0 minute as the minimum ride length accepted to travel from one station to another.

bs5 <- bs4 %>% 
  filter(ride_time > 1.0)
summary(bs5$ride_time)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     6.07    10.52    19.87    18.75 41387.25
9.4.4.3 Visual Exploration of Extremely high values with scatter plot

Scatter points helps to visualize and easily spot the pattern and trend of data points. The geom_jitter() offer a greater visibility (less over plotting) by adding random noise to both the width and height of each point.

expl_scatter <- bs5 %>%
  select(ride_segment, ride_date, ride_time) %>%
  ggplot(aes(x = ride_date, y = ride_time, col = ride_segment)) +
  geom_jitter(size = 1.5, shape = 21, width = 0.15, height = 0.2, alpha = .5) +
  facet_wrap(~ride_segment, nrow = 2, scales = "free") +
   labs(title = "exploratory scatter plot",
        subtitle =" scattered potential outlier data points",
        caption = "Source: divvy-tripdata")

expl_scatter + 
  scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K"))

What panes from the scatter plot?

  • clear points falling away from the most common values .

  • casual riders with rid time above 1000 mints

  • annual members with ride time above 500 mints

9.4.4.4 Visual Exploration of Extremely high values with histogram

Histogram is a popular visual tool in which a numeric variable is typically mapped to the x-axis. Then, the x-axis is divided up into equally sized sections, which we call “bins or buckets”, ranging from the lowest to the highest value for the variable, then we count the number of observations in each bin and plot a bar for each bin. The length of the bin corresponds to the count of the observations in that bin.

freqpol <- bs5 %>%
   select(ride_segment, ride_time) %>%
   ggplot(aes(x = ride_time, fill = ride_segment, color = ride_segment)) +
   geom_freqpoly(size = 0.75) +
   labs(
     title = "Exploring the Data Distribution  log transform",
     subtitle = "Decteting outboundaries data points",
     caption = "Source: divvy-tripdata"
   ) +
  theme(panel.grid.major.y =   element_blank(),
        legend.title = element_blank()) +
   scale_x_log10(
     n.breaks = 8,
     labels = number_format(scale = 1e-3, suffix = " k")
   )
freqpol <- freqpol+ scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K", big.mark = ","))
library(plotly)
ggplotly(freqpol)
9.4.4.5 Removing potential outliers

Ideally outliers are set to missing values instead of dropping them. Our datset is sufficiently larger to play other way around.

Based on our distribution, between 25% and 75% of our observations, the riding time is less than 30 mints. However, there is significant portion of riders beyond that interval and we may want to consider them in our analysis for business decision making. We take the advantage of the plotly() interactivity feature and we pick the value of the ride time as the outlier threshold. We find the value of 2.3835861 mints on the logaritm scale as the left tail limit where values beyond are typically out of the boundaries with limited frequency

density_plot <- bs5 %>%
  select(ride_segment, ride_time) %>%
  ggplot(aes(x = ride_time, color = ride_segment)) +
  geom_histogram(aes(y = ..density..), fill = "white", alpha = 0.2, position = "dodge") +
  geom_density(lwd = 1.2, linetype = 2) +
  theme(panel.grid.major.y =   element_blank(),
        legend.title = element_blank()) +
  
  scale_x_log10(limits = c(10^0, 10^2.3835861)) +
  labs(
    title = "Distribution of riders with log transform",
    subtitle = "Limiting the trip duration to 241.8723 minutes",
    x = "Trip duration (minutes)",
    caption = "Source: divvy-tripdata"
  )

library(plotly)
ggplotly(density_plot)
bs6 <- bs5 %>% 
  filter(ride_time <= 10^2.3835861)
summary(bs6$ride_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.02    6.05   10.50   15.86   18.65  241.82

9.4.5 Examining the ride date variable

summary(bs6$started_at)
summary(bs6$ended_ride_date)
bs7 <- bs6 %>% filter(ended_ride_date <= as.Date("2022-12-31"))
summary(bs7$ended_ride_date)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2022-01-01" "2022-05-28" "2022-07-22" "2022-07-19" "2022-09-16" "2022-12-31"

9.5 Final verification of the dataset

9.5.1 Keeping the variables of interest

bs8 <- bs7 %>% select(-c(ride_id, started_at, ended_at,start_station_id,
end_station_id, start_lat, start_lng,  end_lat, end_lng, ride_year, ride_length )) 
names(bs8)
##  [1] "ride_bike"         "ride_segment"      "ride_time"        
##  [4] "ride_date"         "departure_station" "arrival_station"  
##  [7] "departure_time"    "ended_ride_date"   "arrival_time"     
## [10] "ride_month"        "ride_weekday"

Deal! Pretty cool

9.5.2 Verification and validation

verify <- bs8 %>% select(ride_segment, ride_bike, ride_time, departure_station, arrival_station, 
                       departure_station, arrival_station, ride_date)
print(dfSummary(verify, graph.magnif = 0.75), 
method = 'render')

Data Frame Summary

verify

Dimensions: 10 x 6
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 ride_segment [factor]
1. casual
2. member
6(60.0%)
4(40.0%)
10 (100.0%) 0 (0.0%)
2 ride_bike [factor]
1. classic_bike
2. electric_bike
6(60.0%)
4(40.0%)
10 (100.0%) 0 (0.0%)
3 ride_time [numeric]
Mean (sd) : 16.2 (10.6)
min ≤ med ≤ max:
1.9 ≤ 14.1 ≤ 38.9
IQR (CV) : 11.4 (0.7)
1.93:1(10.0%)
5.53:1(10.0%)
9.23:1(10.0%)
12.43:1(10.0%)
12.95:1(10.0%)
15.32:1(10.0%)
20.00:1(10.0%)
21.85:1(10.0%)
23.83:1(10.0%)
38.85:1(10.0%)
10 (100.0%) 0 (0.0%)
4 departure_station [character]
1. Broadway & Belmont Ave
2. Desplaines St & Kinzie St
3. Jefferson St & Monroe St
4. Logan Blvd & Elston Ave
5. Michigan Ave & Washington
6. Sheridan Rd & Loyola Ave
7. Southport Ave & Clark St
8. St. Clair St & Erie St
9. Streeter Dr & Grand Ave
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
1(11.1%)
9 (90.0%) 1 (10.0%)
5 arrival_station [character]
1. Clark St & Berwyn Ave
2. Clinton St & Roosevelt Rd
3. Columbus Dr & Randolph St
4. Kedzie Ave & Bryn Mawr Av
5. Orleans St & Hubbard St
6. Rush St & Cedar St
7. Wilton Ave & Belmont Ave
1(12.5%)
1(12.5%)
2(25.0%)
1(12.5%)
1(12.5%)
1(12.5%)
1(12.5%)
8 (80.0%) 2 (20.0%)
6 ride_date [Date]
1. 2022-01-14
2. 2022-05-21
3. 2022-05-22
4. 2022-06-03
5. 2022-06-04
6. 2022-06-05
7. 2022-06-28
8. 2022-07-19
9. 2022-09-21
10. 2022-10-22
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
1(10.0%)
10 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2023-05-17

9.5.3 Dimensions of the final datase

Our cleaned dataset consist of 5531187, 11

10. Data Wrangling

In this stage we do some relevant data transformations ahead of the data analysis stage. Very often transformations performed include grouping continuous variable into categorical with bins or buckets.

In our case, we introduce additional variables based on existing ones. This will allows us to drill down and uncover some valuable insights in our dataset.

10.1 Grouping days into categories

weekends <- c("Sun", "Sat")
 bs9 <- bs8 %>%
   mutate(ride_days = as.factor(if_else(
     ride_weekday %in% weekends, "weekend", "businday"
   )), .after = ride_weekday) 
 
levels(bs9$ride_weekday) # verification
## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"
bs10 <-  bs9 %>%
   mutate(ride_days = as.factor(if_else(
     ride_weekday %in% weekends, "weekend", "busnday"
   )), .after = ride_weekday)
levels(bs10$ride_days)
## [1] "busnday" "weekend"

10.2 Adding Seasons Variable

  • In Chicago, the summers are warm, humid, and wet;
  • the winters are freezing, snowy, and windy; and it is partly cloudy year round.
  • January tends to be the coldest month, but also the snowiest month of the year.
Spr <- c("Mar", "Apr", "May")
  Sum <- c("Jun", "Jul", "Aug")
  Aut <- c("Sep", "Oct", "Nov")
  winter <- c("Dec", "Jan", "Feb" )
bs10 <- bs10 %>%
  mutate(us_seasons = as.factor(case_when(
    ride_month %in% Spr ~ "spring",
    ride_month %in% Sum ~ "summer",
    ride_month %in% Aut ~ "fall",
    TRUE ~ "winter"
  )), .after = ride_month)
levels(bs10$us_seasons)
## [1] "fall"   "spring" "summer" "winter"
dim(bs10)
## [1] 5531187      13

11 Data Summarizing

In this stage we summarize our dataset based on the variable type (numeric continuous or qualitative).

For the numerical continuous variables we summarize the data with the summary descriptive statistics. With the categorical variables we summarize using the frequency or cross tabulation.

11.1 Numerical variables

11.1.1 Key descriptive indicators

library(dplyr)
bs10 %>%
  # dplyr::select(ride_segment, ride_time) %>%
  dplyr::group_by(ride_segment, ride_days) %>%
  dplyr::summarise(
    Total_rides_mints = scales::comma(n()),
    rides_hours = n() / 60,
    Min_time = min(ride_time),
    Quantile_1 = quantile(ride_time, probs = 1 / 4),
    Average_time = format(mean(ride_time), digits = 2, nsmall = 2),
    Median = median(ride_time),
    Quantile_3 = quantile(ride_time, probs = 3 / 4),
    Max_time = max(ride_time)
  ) %>%
  mutate(Share = scales::percent(rides_hours / sum(rides_hours))) %>%
  gt(caption = " Tab.1 Numerical Variable Summary groupedby segments and ride days")
Tab.1 Numerical Variable Summary groupedby segments and ride days
ride_days Total_rides_mints rides_hours Min_time Quantile_1 Average_time Median Quantile_3 Max_time Share
casual
busnday 1,423,547 23725.78 1.02 7.15 19.36 12.20 22.10 241.82 63%
weekend 839,496 13991.60 1.02 8.63 23.62 15.25 27.92 241.78 37%
member
busnday 2,458,213 40970.22 1.02 5.20 11.90 8.77 14.90 241.82 75%
weekend 809,931 13498.85 1.02 5.72 13.69 9.93 17.32 241.72 25%

11.2 Summarizing Categorical Variables

  • Tabulation ride_segment and bike type
#library(flextable)
 bs10 %>%
  select(ride_segment, ride_bike) %>%
  janitor::tabyl(ride_segment, ride_bike) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_ns() %>%
  gt(caption = " Tab.2 Categorical Variable Summary")
Tab.2 Categorical Variable Summary
ride_segment classic_bike docked_bike electric_bike
casual 34.18% (873,153) 100.00% (171,129) 43.44% (1,218,761)
member 65.82% (1,681,345) 0.00% (0) 56.56% (1,586,799)

11.2.1 bike type and ride days

# library(flextable)
  bs10 %>% 
  select(ride_segment, ride_days) %>% 
  janitor::tabyl(ride_segment, ride_days) %>%
    adorn_percentages("col") %>%
    adorn_pct_formatting(digits = 2) %>%
    adorn_ns() %>% 
    gt(caption = " Tab. 3 Cross Tab ride segment and ride days")
Tab. 3 Cross Tab ride segment and ride days
ride_segment busnday weekend
casual 36.67% (1,423,547) 50.90% (839,496)
member 63.33% (2,458,213) 49.10% (809,931)

11.2.2 Seasons performance

#library(flextable)
bs10 %>% janitor::tabyl(ride_segment, us_seasons) %>%
    adorn_percentages("row") %>%
    adorn_pct_formatting(digits = 2) %>%
    adorn_ns() %>% 
   gt(caption = " Tab.4 Cross table seanos")
Tab.4 Cross table seanos
ride_segment fall spring summer winter
casual 26.10% (590,610) 21.42% (484,832) 48.84% (1,105,292) 3.64% (82,309)
member 29.60% (967,249) 23.78% (777,090) 37.21% (1,216,003) 9.42% (307,802)

11.2.3 Ranking departure stations

library(dplyr)

 bs10 %>%
   dplyr::select(ride_segment, ride_time, departure_station) %>%
   drop_na(departure_station) %>%
   group_by(ride_segment, departure_station) %>%
   dplyr::summarise(
     Total_cnt = n(),
     trip_duration = sum(ride_time), .groups = "drop"
   ) %>%
   dplyr::mutate(
     Top_rides = round((Total_cnt / sum(Total_cnt) * 100), 2),
     Top_trips = round((trip_duration / sum(trip_duration) * 100), 2)
   ) %>%
   # arrange(desc(Top_stations)) %>%
   slice_max(order_by = Top_trips, n = 10) %>%
   gt(caption = " Tab.5 Ranking Departure Stations")
Tab.5 Ranking Departure Stations
ride_segment departure_station Total_cnt trip_duration Top_rides Top_trips
casual Streeter Dr & Grand Ave 56712 1963689.6 1.20 2.55
casual DuSable Lake Shore Dr & Monroe St 31117 1061685.5 0.66 1.38
casual Millennium Park 24849 924781.1 0.53 1.20
casual Michigan Ave & Oak St 24735 854982.7 0.52 1.11
casual DuSable Lake Shore Dr & North Blvd 23051 642064.3 0.49 0.83
casual Shedd Aquarium 19769 572465.8 0.42 0.74
casual Theater on the Lake 18126 523935.2 0.38 0.68
casual Dusable Harbor 13762 454351.5 0.29 0.59
casual Montrose Harbor 12293 426148.4 0.26 0.55
casual Indiana Ave & Roosevelt Rd 13385 404501.2 0.28 0.53

11.2.3 Ranking arrival stations

Tab.6 Ranking Arrival Stations
ride_segment arrival_station Total_cnt Total_time Top_rides Top_trips
casual Streeter Dr & Grand Ave 58781 58781 1.25 1.25
casual DuSable Lake Shore Dr & Monroe St 29036 29036 0.62 0.62
casual Michigan Ave & Oak St 26024 26024 0.56 0.56
casual Millennium Park 26192 26192 0.56 0.56
casual DuSable Lake Shore Dr & North Blvd 25735 25735 0.55 0.55
member Kingsbury St & Kinzie St 24224 24224 0.52 0.52
member Clark St & Elm St 21988 21988 0.47 0.47
member Wells St & Concord Ln 21580 21580 0.46 0.46
member Clinton St & Washington Blvd 20108 20108 0.43 0.43
member University Ave & 57th St 20160 20160 0.43 0.43

11.3 Data Distribution based on BoxPlot

fig1 <- bs10 %>%
  select(ride_segment, ride_time) %>%
  ggplot(aes(x = ride_segment, y = ride_time, fill = ride_segment, col = ride_segment)) +
  geom_boxplot(alpha = 0.5) +
  geom_pointrange(
    mapping = aes(x = ride_segment, y = ride_time),
    stat = "summary",
    fun.mean = mean
  ) +
  scale_y_log10() +
  labs(
    tag = "Fig.1",
    title = "Casual riders have wider trip distance interval",
    y = "Trip duration (minutes)",
    caption = "Source: divvy-tripdata"
  ) +
  theme(
    legend.title = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "none"
  ) +
  scale_x_discrete(labels = c("Casual\n  riders", "Subscriber\n riders")) +
  scale_y_continuous(limits = c(0, 50), breaks = c(0, 10, 15, 20, 30))

fig1 <- fig1 + scale_fill_viridis_d()

fig1

11.4 Data Distribution and Shape

fig2 <- bs10 %>%
   ggplot(aes(x = ride_time, fill = ride_segment, col = ride_segment)) +
   geom_histogram(alpha = .75, binwidth = 0.05) +
   scale_x_log10() +
   theme(
     legend.position = c(.8, .90), 
     legend.title = element_blank(),
     axis.text = element_text(face = "bold"),
     panel.grid.major.y = element_blank()
   ) +
   labs(
     tag = "Fig. 2",
     title = " ",
     subtitle = "Normal Data Distribution with log10 Transformation ",
     caption = "Source: divvy-tripdata",
     y = "Frequency of riders",
     x = "Ride time duration (minutes))"
   ) +
   scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K"))

fig2

12 Communicating and sharing the findings

The following 7 questions should enable us to get some key insights out of the data we just analysed and summarized.

what is the proportion of trips each group finished

what is the proportion of trips each group cycled

What are bikes rank high

How the ride time varies daily

How the ride time varies monthly

What is the riding pattern/behaviour on business days and weekend

How the riding pattern is impacted by the season variations

12.1 Who actually is riding more frequently?

  • key insight: Out of total completed rides, 60% was done by the subscribed members

12.2 Who is riding longer (trip duration) ?

Key insights

  1. Casual riders show the highest median value of trip duration . 50% of the casual riders experience trip duration longer than 12 mints during business days and 15minutes on weekends.
  2. Subscribed members exhibit similar pattern of riding for both business days and weekends.

12.3 What are the Most Preferred Bikes

Key insight

  • classic bikes ranks first, being most preferred by the members with 30% of total bike usage. It is noticeable that the causal rides go more for the electric bikes (22%) compared to classic bikes (16%). Docked bikes only used by casual riders with marginal usage of as little as 3%.

12. 4 Can we Spot any Consistent Riding Pattern?

Key insight

  • The riding frequency shows higher variation along the months. It increases steadly from January and riches the pic during July and August, then falls until December

12.5 What is the Riding Preference Between Business and Week Days

key insights

  • The subscribed members rides more often during the business days from Monday to Friday. On weekends both casual and members have similar ride frequency

12. 6 What is the Pick Time Range

  • key insights
  1. subscribed members experience two period of high demand, around 09:00 and the highest around 18:00

  2. Casual readers show a steady demand for bikes and it reaches the pick at 18:00

  3. On weekends, both groups exhibit a similar trend.

12.7 How the Seasonal Changes Impacts on the Bike Demand

  • key insight: Rideship is Highest in summer for both groups,declining through fall,spring and ultimately the winter

12.8 Ranking Departure Stations

12.9 Ranking Arrival Stations

13. Take away insights from the customer segment comparison

  1. Subscribed member use bikes more frequently than the casual riders under the same circumstances

  2. The casual riders cycle longer on average than the annual subscribed members.

  3. Annual subscribed members experience a substantially high demand for bikes in morning around 09:00 am and late in the evening at 18:00 am. This is the pick time for both ride segments.

  4. The summer season both segments experience the highest demand for the bikes and this most likely to be associated with the favorable whether conditions.

  5. Our analysis has limited socio economic indicators to draw on evidence based recommendations for sound data driven business making

  6. Our analysis is merely descriptive and it can not be conclusive unless supported and validated with sound statistical analysis.

14. Recommendations for effective marketing strategy targeted to casual riders

Price is a decisive factor in consumer choice (what to buy ) and how much of it to buy .

Provided they get the pricing correctly some hints to be considered would include :

  1. To cater the group needs with differentiated offers aligned with trip duration (short and long riders )

  2. Design offers in line with the daily bike demand

  3. Matching bike supply with the demand based on the results of the dock station ranking