Divvy Bike Share Capstone Project

1. Project Background

This project is a requirement to Google Data Analytics Certificate explicitly for novices with no prior expertise. The case study seeks to consolidate the analytic skills developed by applying the typical procedures and methods trough a systematic data analytic workflow and using the most appropriate tools. The dataset with key usage variables is collected by third party and made available publicly.

Cyclistic is a fictitious company that runs the Divvy bike sharing business. Its marketing department identifies a potential business opportunity to increase the business revenues and engages the data analytics team to uncover and reveals evidences and insights on the available bike sharing usage data to ultimately support a data driven decision making. At the heart ofthe project is to clean, analysis, summarize and visualize the results in the most effective way. Inferential statistics (hypothesis testing) is beyond the scope of this project.

A great deal of time has been spent delving and diving deep to master the grammar of data manipulation, the grammar of graphics and the art of data visualization. At the end only to realize that I just scratched the surface of R. The challenge ahead on my journey as aspiring data analyst lies on clinging to string manipulation and getting grips at functional programming before I become a R Power User.

docking station

2. The Cyclistic Business Profile at a Glance

Cyclistic is a bicycle renting business in Chicago , with an innovative concept of bike sharing. It delivers a great value to the customer basis serving.

Micromobility solution with healthy associated benefits from cycling in a regular basis
Targeting Chicagoan commuters to work, school, social encounter
Visitors to unlock Chicago
Runs a fleet of Over 5 800 bikes
Scattered network of 682 stations/kiosks to unlock the bikes from and return

Divvy Bikes, Chicago’s Bike Sharing System

Cyclist offer essentially 3 bikes with distinct features

Bikes rented, shared and charged on hourly basis
Bike unlocked at a station and returned at another station at the customer’s convenience
The are two customers segments
Casual riders - day pass-single one or multiple trips
Subscribers - annual membership

3. Business Case Identification

The company is setting to increase the overall business performance specially the revenue from non subscribed casual riders. It is believed that this customer segment has key customer offering a great deal of potential to add value to the company and there should be endeavored to bring a board as much as possible annual based membership. However, there is information gap concerning the records about the critical factors/variables on how the casual group and subscribed members uses the bike sharing service. Here comes the challenge to design a solid well founded strategy to target the casual riders to adhere to the annual based package. With that challenge comes a long the need to gather evidences, meaningful and quality information to back up the decision making for impactful results. This project report uses data collected in 2022, analysis data to uncover key insights, pattern and trends on the bike rental service usage and communicate the findings in an easy, engaging and persuasive way to the key stakeholder.

4. Assuring Data Quality

The quality of data by large determine the quality of the findings. The goal to have high quality data is to make empowered, informed and data driven decisions in order to achieve improved business performance. In order to restrain costly mistakes and false conclusions we have to ensure that data meets the data quality standards. Like it is commonly cited in data analytics and data science resources “garbage in garbage out”. We assess the quality of our dataset against the measures or dimensions of data quality. Each dimension plays a critical role in ensuring the overall quality of data.

data quality

4.1 Reliability

Our data is reliable and accurate being collected by a credible source using advanced data collection technology

4.2 Originality

Our dataset was accessed from the original source, as first recorded data points, therefore no reasonable data quality issues on this ground

4.3 Comprehensiveness

We have fair and relatively comprehensive and useful information . With available data we are able to compare the two groups as far as the usage is concerned. But not the preferences and choices they make to adhere and buy one or another product/service. This is by far determined and controlled by other variables like socio economic and demographics. So, well founded and substantiated recommendations would be hard to be drawn.

4.4 Currency

This is a time dimension (also referred to as timeliness) .. Our data is sufficiently up to date for its intended use.

4.5 Cited

This criteria is satisfied. In addition, we also keep abreast that the dataset sample should meet and clear the requirements. It seems that our records covers the entire population and after some data cleaning chores we should be left of dataset that does not compromises the representativity and assure unbiasness.

5. Sourcing and Preparing The Data

5.1 Preliminary Data Preparation

We primarily relied on the great features offered by MS Power Query to get familiarized with the dataset. For each file we easily carried out the following preliminary transformations: - column profile(count, distinct, unique, min, max, empty string) - Column quality (validity, error, percentage) - ride Date, bike_type, ride_type, station name - Splitting datetime (time stamp) into date and time columns - Computing the trip time duration based on the ended time and started time variables

5.2 Data set conformity with the tidy format

Tidy data is a concept from Hadley Wickham’s 2014 paper Tidy Data Tidy datasets are easy to manipulate, model and visualize, are rectangular data and have a specific structure: - Every row is an observation, - Every column represents variables and - Every entry into the cells of the data frame are values It is worthy quoting “Happy families are all alike, but every unhappy family is unhappy in its own way.” ~ Leo Tolstoy “Tidy datasets are all alike, but every messy dataset is messy in its own way.” ~ Hadley Wickham

Tidy Data

6. Data Analysis Workflow

Google offers a well-defined data workflow which acts as a set of guidelines for planning, organizing, and implementing data analytics projects. The Google six steps are:

Ask
Prepare.
Process
Analyze
Share.
Act

This workflow is self explanatory. Noticeably, the process outlined is linear, however, in many cases the process is iterative, requiring multiple stages to be repeated and revisited. Our Project is being guided by the flow bellow.

typical data analytics workflow

7. Getting into R Programming

R programming is the data analysis tool of our choice, the world’s most popular programming language for statistical analysis. R is an open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. We import and load the data into RStudio where we set a R project.

7.1 Loading The Packages

Bellow we load the package collections we will be used during the entire data analysis workflow - useful packages

knitr::opts_chunk$set( warning=FALSE, message=FALSE)
library(knitr)
library(dplyr)
library(readr) 
library(purrr)
library(magrittr)
library(knitr)
library(gt)
library(scales)
library(ggthemes)
library(ggplot2)
library(tidyr)
library(forcats)
library(lubridate)
library(pandoc)
library(skimr)
library(summarytools)
library(RColorBrewer)
library(viridis)
library(flextable)
library(janitor)
library(hablar)
library(styler)
library(patchwork)
library(cowplot)

7.2 Setting the Theme for the Plots

theme_set(theme_clean() +
  theme(
    plot.subtitle = element_text(
      hjust = 0.5,
      size = 14,
      color = "skyblue",
      face = "bold",
      family = "Tahoma"
    ),
    plot.caption = element_text(
      hjust = 1,
      size = 14, color = "grey",
      face = "italic"
    ),
    plot.title = element_text(
      hjust = 0.5,
      size = 16,
      color = "skyblue",
      face = "bold",
      family = "Tahoma"
    ),
    plot.tag = element_text(
      size = 14,
      color = "grey",
      face = "bold"
    ),
    axis.title = element_text(
      color = "steelblue",
      face = "bold",
      size = 15
    ),
    axis.line = element_line(linewidth = 1.5, color = "darkgrey"),
    axis.text = element_text(face = "bold", 
                             color = "#993333", 
                             size = 12),
    legend.title = element_blank(),
    legend.position = "top"
  ))

7.3 Loading The Data

The data set comes in 12 separated files on month basis, with individual Divvy bike sharing trips , including the dock station origin and destination, and time stamps for each trip(start and end). Source: https://divvy-tripdata.s3.amazonaws.com/index.html .

We import all file at once grammatically instead of one at time. There is claim holding that if a task is to be repeat, then it is absolutely the right task the computer is designed to perform and it is does efficiently. There are many ways to get the cat skinned, akin we find may ways to import multiple files into R. The best method so far seems to be fread() from data.tablepackage. As a matter of taste, we will go with read_csv() from Tidyverse ecosystem, as it gives an argument to specify the column types. This should free us up from the tedious work of converting the date variables from character type to POSIXct and POSIXlt date type.

We list all files with list.files() function and we get them combined into a single file with the mp_df() from purrr package.

library(tidyverse)
 my_dir <- "C:/Users/USER/Documents/DataAnalytics/Google_Capstone_Project/data/Data_RStudio"
 all_files <- list.files(path = my_dir, pattern = "*.csv", full.names = TRUE) 
bs_01 <- all_files %>% map_df(~ read_csv(.,
  col_names = TRUE,
  col_types = cols(
    started_at = col_datetime(format = "%m/%d/%Y %H:%M"),
    ended_at = col_datetime(format = "%m/%d/%Y %H:%M"),
    started_ride_date = col_date(format = "%m/%d/%Y"),
    ended_ride_date = col_date(format = "%m/%d/%Y")
  )
))

read_csv() would have guessed the variable classes and character data type to dates variables, so we have specified the date class vriables using the col_typesargument.

Moving on, our bike share dataset has been loaded as an R object bs_01 with a dimension of :

5667717 rows ,

19 columns or simply

dataset dimension (nr rows , nr columns = 5667717, 19 .

7.2.1 Checking and validating the Data Type

Data types matters a great deal. The operations we can perform on a column depend so much on its “type” (numeric, character/string). Lets revisit the common type of operators. - arithmetic operators - calculations using Arithmetic operators to be performed on values. - relational operators - for assignment and enable comparisons to be made. They are used in condition testing - logical operators - Logical operators are used to combine relational operators to give more complex decisions.

We can get a glimpse of the columns data type with the summary.default()

bs1 <- bs_01
summary.default(bs1)

##                    Length  Class   Mode     
## ride_id            5667717 -none-  character
## rideable_type      5667717 -none-  character
## started_at         5667717 POSIXct numeric  
## ended_at           5667717 POSIXct numeric  
## start_station_name 5667717 -none-  character
## start_station_id   5667717 -none-  character
## end_station_name   5667717 -none-  character
## end_station_id     5667717 -none-  character
## start_lat          5667717 -none-  numeric  
## start_lng          5667717 -none-  numeric  
## end_lat            5667717 -none-  numeric  
## end_lng            5667717 -none-  numeric  
## member_casual      5667717 -none-  character
## started_ride_date  5667717 Date    numeric  
## started_ride_time  5667717 hms     numeric  
## ended_ride_date    5667717 Date    numeric  
## ended_ride_time    5667717 hms     numeric  
## ride_length        5667717 -none-  numeric  
## ride_minutes       5667717 -none-  numeric

8. Data Exploration

Exploratory data analysis (EDA) is a technique/method/process widely used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. Originally developed by American mathematician John Tukey in the 1970s, it is quite helpfull in :

identify obvious errors,
better understand patterns within the data
detect outliers or anomalous events
find interesting relations among the variables.

Before diving deep in our journey to understanding the data, in this first stage we are exploring data mainly column wise, fixing few issues and getting data ready for cleaning in the following stage.

8.1 Quick Sense of the Dataset at a Glance

With the View(bs1) we have the display of our dataset in a tabular format. We can do an ocular scanning, filter and get familiarized with the dataset.

In addition, with the head(bs1) and the tail(bs1) we get access to the top row (head) and last row (tail), in this way making our dataset has been imported completed .

Additionally, we can get a sample size of the dataset of n dimension.

# randomm sample , n = 10
  bs1 %>% 
  sample_n(10) %>% 
  gt(caption =  "Sample of the Dataset with 10 Random Observations")

Sample of the Dataset with 10 Random Observations
ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual	started_ride_date	started_ride_time	ended_ride_date	ended_ride_time	ride_length	ride_minutes
18CB2761A890BCA3	classic_bike	2022-07-19 13:58:00	2022-07-19 14:20:00	Sheridan Rd & Loyola Ave	RP-009	Kedzie Ave & Bryn Mawr Ave	KA1504000167	42.00104	-87.66120	41.98223	-87.70889	member	2022-07-19	13:58:45	2022-07-19	14:20:36	0.015173611	21.85
A07A17579FD564B9	electric_bike	2022-09-21 22:49:00	2022-09-21 22:51:00	Broadway & Belmont Ave	13277	Wilton Ave & Belmont Ave	TA1307000134	41.94010	-87.64545	41.94023	-87.65294	member	2022-09-21	22:49:50	2022-09-21	22:51:46	0.001342593	1.93
466635A9028203ED	classic_bike	2022-10-22 12:11:00	2022-10-22 12:50:00	Southport Ave & Clark St	TA1308000047	Clark St & Berwyn Ave	KA1504000146	41.95708	-87.66420	41.97803	-87.66856	casual	2022-10-22	12:11:18	2022-10-22	12:50:09	0.026979167	38.85
639C0D36743DAE63	electric_bike	2022-06-05 09:37:00	2022-06-05 10:01:00	NA	NA	NA	NA	41.89000	-87.72000	41.94000	-87.72000	member	2022-06-05	09:37:19	2022-06-05	10:01:09	0.016550926	23.83
293715AE4CF08CA9	electric_bike	2022-06-28 03:43:00	2022-06-28 03:52:00	Logan Blvd & Elston Ave	TA1308000031	NA	NA	41.92940	-87.68424	41.93000	-87.72000	casual	2022-06-28	03:43:12	2022-06-28	03:52:26	0.006412037	9.23
DD0A3551EBD04D88	classic_bike	2022-05-21 17:27:00	2022-05-21 17:33:00	Desplaines St & Kinzie St	TA1306000003	Orleans St & Hubbard St	636	41.88872	-87.64445	41.89003	-87.63662	casual	2022-05-21	17:27:40	2022-05-21	17:33:12	0.003842593	5.53
D9BE3DE710D66304	classic_bike	2022-05-22 15:54:00	2022-05-22 16:09:00	Michigan Ave & Washington St	13001	Columbus Dr & Randolph St	13263	41.88398	-87.62468	41.88473	-87.61952	casual	2022-05-22	15:54:19	2022-05-22	16:09:38	0.010636574	15.32
9C4BEBB694C180AF	classic_bike	2022-06-04 15:49:00	2022-06-04 16:02:00	Jefferson St & Monroe St	WL-011	Clinton St & Roosevelt Rd	WL-008	41.88033	-87.64275	41.86712	-87.64109	casual	2022-06-04	15:49:07	2022-06-04	16:02:04	0.008993056	12.95
D1588A71286D96F0	electric_bike	2022-06-03 11:20:00	2022-06-03 11:32:00	St. Clair St & Erie St	13016	Columbus Dr & Randolph St	13263	41.89441	-87.62270	41.88473	-87.61952	member	2022-06-03	11:20:25	2022-06-03	11:32:51	0.008634259	12.43
1937B9007C59F991	classic_bike	2022-01-14 11:30:00	2022-01-14 11:50:00	Streeter Dr & Grand Ave	13022	Rush St & Cedar St	KA1504000133	41.89228	-87.61204	41.90231	-87.62769	casual	2022-01-14	11:30:00	2022-01-14	11:50:00	0.013888889	20.00

This looks quite pretty!

8. 3 Converting character type to factor data type objects

This is specially useful for both ordinal and nominal variables with limited categories since it is easy to hunt and spot entry related errors. easy data verification and validation. The variables bike type and customer type are suitable candidates.

Converting using R base and cross checking with levels()

library(hablar)

bs2 <- bs1 %>% 
  convert(fct(rideable_type, member_casual))

levels(bs2$rideable_type)

## [1] "classic_bike"  "docked_bike"   "electric_bike"

levels(bs2$member_casual)

## [1] "casual" "member"

8.4 Parsing Date Time

We have touched this step of converting date columns to Date class Date type object class is tricky. Stored in the program memory (under the hoody) as numeric, often is imported as character. working with dates and times in base R, can be daunting, a bit cumbersome and clumsy. lubridate adopts a more intuitive approach to parsing dates and times. Some useful functions to coerce date character to date class include ymd(), dmy() and ydm(). We found it handy with Power Query tool before importing the dataset

8.5 Extracting Date Components From Datetime Variable

bs2 <- bs2 %>%
  mutate(
    ride_year = lubridate::year(started_ride_date),
    ride_month = lubridate::month(started_ride_date, label = TRUE, abbr = TRUE), # month in year
    ride_weekday = lubridate::wday(started_ride_date, label = TRUE, abbr = TRUE) # day of week (name)
  ) 
# cross checking
levels(bs2$ride_month)

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

levels(bs2$ride_weekday)

## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"

8.6 Renaming and Reordering the Variables

For a better syntax and good readability we change/update old column name with new column name using rename() function. We focus on those variables of interested that we will be playing around with them in our journey of revealing interesting variation pattern and trends.

Syntax: my_dataframe %>% rename(new_column_name = old_column_name)

bs3 <- bs2 %>%
  dplyr::rename(
    "departure_station" = "start_station_name",
    "arrival_station" = "end_station_name",
    "departure_time" = "started_ride_time",
    "arrival_time" = "ended_ride_time",
    "ride_bike" = "rideable_type",
    "ride_segment" = "member_casual",
    "ride_date" = "started_ride_date",
    "ride_time" = "ride_minutes"
  ) %>% 
  select(ride_id, ride_bike, ride_segment, ride_time, ride_date,departure_station, arrival_station, everything()) 
# cross checking
bs3 %>% head( 2) %>% 
  gt(caption = "Cross checking new variable names")

Cross checking new variable names
ride_id	ride_bike	ride_segment	ride_time	ride_date	departure_station	arrival_station	started_at	ended_at	start_station_id	end_station_id	start_lat	start_lng	end_lat	end_lng	departure_time	ended_ride_date	arrival_time	ride_length	ride_year	ride_month	ride_weekday
550CF7EFEAE0C618	electric_bike	casual	7.52	2022-08-07	NA	NA	2022-08-07 21:34:00	2022-08-07 21:41:00	NA	NA	41.93	-87.69	41.94	-87.72	21:34:15	2022-08-07	21:41:46	0.005219907	2022	Aug	Sun
DAD198F405F9C5F5	electric_bike	casual	14.03	2022-08-08	NA	NA	2022-08-08 14:39:00	2022-08-08 14:53:00	NA	NA	41.89	-87.64	41.92	-87.64	14:39:21	2022-08-08	14:53:23	0.009745370	2022	Aug	Mon

Voila, This is awesome!

9. Data Cleaning

This is one of the crucial and intricate stage of data analytical workflow. Once we have aligned the data object types and the desirable variable names, we move to explore and address data entry (observations) at row level in order to get the dataset in the correct and desirable format for efficient processing. This consists essentially of identifying, correcting, or removing inaccurate, mislabeled, incorretly formatted, incomplete raw data for downstream analysis purposes. In addition, we deal with the typical data issues - Duplicated records - Recording - Missing values - Unusual values most known as outliers. What typically data cleaning means is partly illustrated bellow. The bottom line is to get data formatted in the way the computer reads, understands, computes and we get consistent outcome.

Illutrastion of data cleaning

9.1 Overview of the Dataset

It is widely claimed that plotting the data, specially the numeric continuous should be the very first step when getting a new dataset. It potentially offers a better picture of data distribution, variation and a quick catch of potential outlier. In our case does not give a pretty picture due to little variation and the excessive size of the dataset. But before visual exploration,let’s get a glance or glimpse at our data set with a summary table. There is a bunch of helpful packages with summary table functions from R base describe(), dplyr summary (), dlookr, Mice, Smir skim(), psych, Mosaico, summarytools, naniar to janitor just to mention fewer. The choice is a matter of taste, but also the detailed information we can get from.

9.2 Describing data with Summary Table

This is an analytical way to get an overview of the entire dataframe. We can use the function dfsummary() from the package summarytools . The Columns are summarized by class/type such as character, factor and numeric. We select 6 variables of interest:

ride_segment
ride_bike
ride_time
departure_station
arrival_station
ride_date.

We will include the identifier ride id just to control for dupes (duplicated entries)

using dfsummary() to summarize and describe the dataset structure with selected variables

Data Frame Summary

diagnosis

Dimensions: 10 x 7
Duplicates: 0

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

ride_id [character]

1. 18CB2761A890BCA3

2. 1937B9007C59F991

3. 293715AE4CF08CA9

4. 466635A9028203ED

5. 639C0D36743DAE63

6. 9C4BEBB694C180AF

7. A07A17579FD564B9

8. D1588A71286D96F0

9. D9BE3DE710D66304

10. DD0A3551EBD04D88

1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)

10 (100.0%)

0 (0.0%)

2

ride_segment [factor]

1. casual

2. member

6	(	60.0%	)
4	(	40.0%	)

10 (100.0%)

0 (0.0%)

3

ride_bike [factor]

1. classic_bike

2. electric_bike

6	(	60.0%	)
4	(	40.0%	)

10 (100.0%)

0 (0.0%)

4

ride_time [numeric]

Mean (sd) : 16.2 (10.6)

min ≤ med ≤ max:

1.9 ≤ 14.1 ≤ 38.9

IQR (CV) : 11.4 (0.7)

1.93	:	1	(	10.0%	)
5.53	:	1	(	10.0%	)
9.23	:	1	(	10.0%	)
12.43	:	1	(	10.0%	)
12.95	:	1	(	10.0%	)
15.32	:	1	(	10.0%	)
20.00	:	1	(	10.0%	)
21.85	:	1	(	10.0%	)
23.83	:	1	(	10.0%	)
38.85	:	1	(	10.0%	)

10 (100.0%)

0 (0.0%)

5

departure_station [character]

1. Broadway & Belmont Ave

2. Desplaines St & Kinzie St

3. Jefferson St & Monroe St

4. Logan Blvd & Elston Ave

5. Michigan Ave & Washington

6. Sheridan Rd & Loyola Ave

7. Southport Ave & Clark St

8. St. Clair St & Erie St

9. Streeter Dr & Grand Ave

1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)

9 (90.0%)

1 (10.0%)

6

arrival_station [character]

1. Clark St & Berwyn Ave

2. Clinton St & Roosevelt Rd

3. Columbus Dr & Randolph St

4. Kedzie Ave & Bryn Mawr Av

5. Orleans St & Hubbard St

6. Rush St & Cedar St

7. Wilton Ave & Belmont Ave

1	(	12.5%	)
1	(	12.5%	)
2	(	25.0%	)
1	(	12.5%	)
1	(	12.5%	)
1	(	12.5%	)
1	(	12.5%	)

8 (80.0%)

2 (20.0%)

7

ride_date [Date]

1. 2022-01-14
2. 2022-05-21
3. 2022-05-22
4. 2022-06-03
5. 2022-06-04
6. 2022-06-05
7. 2022-06-28
8. 2022-07-19
9. 2022-09-21
10. 2022-10-22

1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)

10 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2023-05-17

From the output, we get the account for the following data issues:

duplicates
distinct values for the numerical variables
missing values
distribution (sketch for bar charts and histogram)
frequency (cat variables)
summary statistics (five numbers) for the numeric continuous variables

This looks pretty, we only miss the explicit account of outliers, but the range offers some valuable hints.

In a nutshell, from the summary output we spotted issues with three variables:

ride time with implausible extreme values-
departure station with missing values
arrival station with missing values

9.4 Dealing with the messy

9.4.1 Duplicated records

Duplicate observations should be identified and removed. Based on the table summary, there are no duplicated records. Total unique observations/cases match and corresponds to the total number of observations in the dataset .

Total unique records 5667717
Total number of observations 5667717

9.4.2 Flawed Data Entry on Categorical Variables

We use frequency tables to examine or diagnosis the possible inconsistencies that can result from incorrect entries or typo. Our explicitly categorical variables are trouble free as given in the above summary table.

9.4.2.1 Docking Stations

This is not worthy to set as a categorical variable given that as a nominal variable each value is treated as character/string and cant be meaningfully grouped. We will examine the frequency( how many times the same value occur) and seek to catch some pattern. How may unique bike departure station we have and are expected? Can we match it? All in all the network consist of 692 docking stations. Is there a meaningful way to group the stations?

Distinct dock station

total_dock_stations <- bs3 %>%
  group_by(departure_station) %>%
  summarize(distinct_points = n_distinct(departure_station))

dock_stations <- bs3 %>% select(departure_station) %>% 
  drop_na(departure_station)%>% 
  table()

There are distinct 1675 departure dock stations. It seems that we have limited options to scan and detect irregularities for the character variables with no levels.

We could look at the frequencies and consider the possibility of excluding data entries of lower frequency, such as bellow 1%

Data entry of lower frequencies are suspicious and on business grounds they add no value. These observations are possibly caused by data recording errors. We might have the data with mixed cases (lower and upper), in which case R treat them differently. We could explore text data, using functions like filter(), str_detect() and count() to screen and catch data entry based on partial match or common pattern. This is beyond the scope of this project and we will not delve into.

9.4.3 Dealing with missing value

Missing values have to be identified, examined and take decision if they are kept, omitted/removed or replaced and if replaced what method to be used.

9.4.3.1 Where are missing values

From our summary table, we have missing values on two variables. Cross checking with is.na() and sum() functions we get 833064 missing values for departure_station and 892742 for arrival_station

9.4.3.2 What to do with the missing values

From the outset it important to understand how the missing values are coded and remarkably differentiated from other coded values like none, empty or NAN. In R, missing values are represented by the symbol NA (not available). On the other hand, we have NULL, which represents that the value in question simply doesn’t exist, rather than being existent but unknown. In dealing with missing values, Greg Martin suggests that they are examined, trying to understand the pattern of its occurrence (the missiness). Are missing values taking place systematically or randomly? Are they associated with other variables? In our case, we should expect that the missing values are associated with long trip duration suggesting that the bikes went missing (stolen). In that event, the ride end time should also be missing (unreported). But this de not hold at all, as we have complete records for the variable start and end time.

All what we find is that missing values at start station are associated with missing values of start station id.

library(dplyr)
 miss_values <- bs3 %>%
   select(departure_station, arrival_station, start_station_id, end_station_id, start_lat, end_lat, start_lng, end_lng) %>%
   dplyr::filter(is.na(departure_station) | is.na(arrival_station)) %>%
   count()
 miss_values

## # A tibble: 1 × 1
##         n
##     <int>
## 1 1298357

We could drop or remove the missing values, but we decide to keep them to avoid losing data points from variables of interest (customer group, date, ride time) associated with those rows where they occur.

9.4.4 Dealing with implausible out of boundaries values

observations lying far away from others are typically unusual and often considered outliers. Dealing with outliers is a highly contentions issue among statisticians , data scientists and data analyst. It is worthy to highlight the reason why outliers matter a great deal and attract hot debate. As far as the methods and techniques used in descriptive and inferential statistics hinge and grounded on the distribution assumptions, outliers may impact negatively on results and misleading or flawed findings. We don’t take outliers lightly. As a matter of fact, in business environment the considered outliers could be a niche or market segment with specific needs to be catered and deserving especial attention. It could be the of the known principle of 80:20 . We will explore the distribution and pattern of the numerical continuous variables trough the scatterplot, boxplot and histogram and match with the cyclistic core business - renting bikes on shared basis for micro mobility.

9.4.4.1 Values out of the expected range

 bs3 %>%
  select(ride_segment, ride_time) %>%
  group_by(ride_segment) %>%
  dplyr::summarize(
    min = min(ride_time),
    max = max(ride_time)
  ) %>%
  ungroup()

## # A tibble: 2 × 3
##   ride_segment     min    max
##   <fct>          <dbl>  <dbl>
## 1 casual         -137. 41387.
## 2 member       -10353.  1560.

The maximum values are way out for a typical trip duration in a micro mobility service . This is supported by the literature reviews. Cyclistic rent bikes up to 45 mints for both casual and subscriber members. The negative values are indisputably implausible.

Ideally abike trip duration should be:

greater or igual ( >=) to 5 mints
less than 60 mints for commutting purpose
between 60 to 180 mints for leisure and sport

Notably, the bikes just like any commercial product are designed and manufactured with intrinsic features to suit the end use purpose. They are customized and market oriented.

bs4 <- bs3 %>% filter(ride_time > 0) # dropping negative trip duration records
summary(bs4$ride_time) # verification

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.02     5.83    10.28    19.45    18.47 41387.25

9.4.4.2 Visual Exploration - Extremely low values

Trips that last less than 60 seconds should be suspicious and worthy some attention. In fact they should occur where the bikes are checked out and returned back to the same station, being bikes with some problems and unable to start a trip.

 bs4 %>%
  dplyr::select(ride_segment, ride_time) %>%
  filter(ride_time < 5) %>%
  ggplot(aes(x = ride_time, col = ride_segment)) +
  geom_freqpoly(size = .75, binwidth = 0.02) +
  ggtitle("Visual Exploration of Data")

Note: The expected relationship is spotted, the longer the trip, there are lesser returns to the same station. We set the threshold of 1.0 minute as the minimum ride length accepted to travel from one station to another.

bs5 <- bs4 %>% 
  filter(ride_time > 1.0)
summary(bs5$ride_time)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     6.07    10.52    19.87    18.75 41387.25

9.4.4.3 Visual Exploration of Extremely high values with scatter plot

Scatter points helps to visualize and easily spot the pattern and trend of data points. The geom_jitter() offer a greater visibility (less over plotting) by adding random noise to both the width and height of each point.

expl_scatter <- bs5 %>%
  select(ride_segment, ride_date, ride_time) %>%
  ggplot(aes(x = ride_date, y = ride_time, col = ride_segment)) +
  geom_jitter(size = 1.5, shape = 21, width = 0.15, height = 0.2, alpha = .5) +
  facet_wrap(~ride_segment, nrow = 2, scales = "free") +
   labs(title = "exploratory scatter plot",
        subtitle =" scattered potential outlier data points",
        caption = "Source: divvy-tripdata")

expl_scatter + 
  scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K"))

What panes from the scatter plot?

clear points falling away from the most common values .
casual riders with rid time above 1000 mints
annual members with ride time above 500 mints

9.4.4.4 Visual Exploration of Extremely high values with histogram

Histogram is a popular visual tool in which a numeric variable is typically mapped to the x-axis. Then, the x-axis is divided up into equally sized sections, which we call “bins or buckets”, ranging from the lowest to the highest value for the variable, then we count the number of observations in each bin and plot a bar for each bin. The length of the bin corresponds to the count of the observations in that bin.

freqpol <- bs5 %>%
   select(ride_segment, ride_time) %>%
   ggplot(aes(x = ride_time, fill = ride_segment, color = ride_segment)) +
   geom_freqpoly(size = 0.75) +
   labs(
     title = "Exploring the Data Distribution  log transform",
     subtitle = "Decteting outboundaries data points",
     caption = "Source: divvy-tripdata"
   ) +
  theme(panel.grid.major.y =   element_blank(),
        legend.title = element_blank()) +
   scale_x_log10(
     n.breaks = 8,
     labels = number_format(scale = 1e-3, suffix = " k")
   )
freqpol <- freqpol+ scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K", big.mark = ","))
library(plotly)
ggplotly(freqpol)

9.4.4.5 Removing potential outliers

Ideally outliers are set to missing values instead of dropping them. Our datset is sufficiently larger to play other way around.

Based on our distribution, between 25% and 75% of our observations, the riding time is less than 30 mints. However, there is significant portion of riders beyond that interval and we may want to consider them in our analysis for business decision making. We take the advantage of the plotly() interactivity feature and we pick the value of the ride time as the outlier threshold. We find the value of 2.3835861 mints on the logaritm scale as the left tail limit where values beyond are typically out of the boundaries with limited frequency

density_plot <- bs5 %>%
  select(ride_segment, ride_time) %>%
  ggplot(aes(x = ride_time, color = ride_segment)) +
  geom_histogram(aes(y = ..density..), fill = "white", alpha = 0.2, position = "dodge") +
  geom_density(lwd = 1.2, linetype = 2) +
  theme(panel.grid.major.y =   element_blank(),
        legend.title = element_blank()) +
  
  scale_x_log10(limits = c(10^0, 10^2.3835861)) +
  labs(
    title = "Distribution of riders with log transform",
    subtitle = "Limiting the trip duration to 241.8723 minutes",
    x = "Trip duration (minutes)",
    caption = "Source: divvy-tripdata"
  )

library(plotly)
ggplotly(density_plot)

bs6 <- bs5 %>% 
  filter(ride_time <= 10^2.3835861)
summary(bs6$ride_time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.02    6.05   10.50   15.86   18.65  241.82

9.4.5 Examining the ride date variable

summary(bs6$started_at)
summary(bs6$ended_ride_date)

bs7 <- bs6 %>% filter(ended_ride_date <= as.Date("2022-12-31"))
summary(bs7$ended_ride_date)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2022-01-01" "2022-05-28" "2022-07-22" "2022-07-19" "2022-09-16" "2022-12-31"

9.5 Final verification of the dataset

9.5.1 Keeping the variables of interest

bs8 <- bs7 %>% select(-c(ride_id, started_at, ended_at,start_station_id,
end_station_id, start_lat, start_lng,  end_lat, end_lng, ride_year, ride_length )) 
names(bs8)

##  [1] "ride_bike"         "ride_segment"      "ride_time"        
##  [4] "ride_date"         "departure_station" "arrival_station"  
##  [7] "departure_time"    "ended_ride_date"   "arrival_time"     
## [10] "ride_month"        "ride_weekday"

Deal! Pretty cool

9.5.2 Verification and validation

verify <- bs8 %>% select(ride_segment, ride_bike, ride_time, departure_station, arrival_station, 
                       departure_station, arrival_station, ride_date)
print(dfSummary(verify, graph.magnif = 0.75), 
method = 'render')

Data Frame Summary

verify

Dimensions: 10 x 6
Duplicates: 0

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

ride_segment [factor]

1. casual

2. member

6	(	60.0%	)
4	(	40.0%	)

10 (100.0%)

0 (0.0%)

2

ride_bike [factor]

1. classic_bike

2. electric_bike

6	(	60.0%	)
4	(	40.0%	)

10 (100.0%)

0 (0.0%)

3

ride_time [numeric]

Mean (sd) : 16.2 (10.6)

min ≤ med ≤ max:

1.9 ≤ 14.1 ≤ 38.9

IQR (CV) : 11.4 (0.7)

1.93	:	1	(	10.0%	)
5.53	:	1	(	10.0%	)
9.23	:	1	(	10.0%	)
12.43	:	1	(	10.0%	)
12.95	:	1	(	10.0%	)
15.32	:	1	(	10.0%	)
20.00	:	1	(	10.0%	)
21.85	:	1	(	10.0%	)
23.83	:	1	(	10.0%	)
38.85	:	1	(	10.0%	)

10 (100.0%)

0 (0.0%)

4

departure_station [character]

1. Broadway & Belmont Ave

2. Desplaines St & Kinzie St

3. Jefferson St & Monroe St

4. Logan Blvd & Elston Ave

5. Michigan Ave & Washington

6. Sheridan Rd & Loyola Ave

7. Southport Ave & Clark St

8. St. Clair St & Erie St

9. Streeter Dr & Grand Ave

1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)
1	(	11.1%	)

9 (90.0%)

1 (10.0%)

5

arrival_station [character]

1. Clark St & Berwyn Ave

2. Clinton St & Roosevelt Rd

3. Columbus Dr & Randolph St

4. Kedzie Ave & Bryn Mawr Av

5. Orleans St & Hubbard St

6. Rush St & Cedar St

7. Wilton Ave & Belmont Ave

1	(	12.5%	)
1	(	12.5%	)
2	(	25.0%	)
1	(	12.5%	)
1	(	12.5%	)
1	(	12.5%	)
1	(	12.5%	)

8 (80.0%)

2 (20.0%)

6

ride_date [Date]

1. 2022-01-14
2. 2022-05-21
3. 2022-05-22
4. 2022-06-03
5. 2022-06-04
6. 2022-06-05
7. 2022-06-28
8. 2022-07-19
9. 2022-09-21
10. 2022-10-22

1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)
1	(	10.0%	)

10 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2023-05-17

9.5.3 Dimensions of the final datase

Our cleaned dataset consist of 5531187, 11

10. Data Wrangling

In this stage we do some relevant data transformations ahead of the data analysis stage. Very often transformations performed include grouping continuous variable into categorical with bins or buckets.

In our case, we introduce additional variables based on existing ones. This will allows us to drill down and uncover some valuable insights in our dataset.

10.1 Grouping days into categories

weekends <- c("Sun", "Sat")
 bs9 <- bs8 %>%
   mutate(ride_days = as.factor(if_else(
     ride_weekday %in% weekends, "weekend", "businday"
   )), .after = ride_weekday) 
 
levels(bs9$ride_weekday) # verification

## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"

bs10 <-  bs9 %>%
   mutate(ride_days = as.factor(if_else(
     ride_weekday %in% weekends, "weekend", "busnday"
   )), .after = ride_weekday)
levels(bs10$ride_days)

## [1] "busnday" "weekend"

10.2 Adding Seasons Variable

In Chicago, the summers are warm, humid, and wet;
the winters are freezing, snowy, and windy; and it is partly cloudy year round.
January tends to be the coldest month, but also the snowiest month of the year.

Spr <- c("Mar", "Apr", "May")
  Sum <- c("Jun", "Jul", "Aug")
  Aut <- c("Sep", "Oct", "Nov")
  winter <- c("Dec", "Jan", "Feb" )

bs10 <- bs10 %>%
  mutate(us_seasons = as.factor(case_when(
    ride_month %in% Spr ~ "spring",
    ride_month %in% Sum ~ "summer",
    ride_month %in% Aut ~ "fall",
    TRUE ~ "winter"
  )), .after = ride_month)
levels(bs10$us_seasons)

## [1] "fall"   "spring" "summer" "winter"

dim(bs10)

## [1] 5531187      13

11 Data Summarizing

In this stage we summarize our dataset based on the variable type (numeric continuous or qualitative).

For the numerical continuous variables we summarize the data with the summary descriptive statistics. With the categorical variables we summarize using the frequency or cross tabulation.

11.1 Numerical variables

11.1.1 Key descriptive indicators

library(dplyr)
bs10 %>%
  # dplyr::select(ride_segment, ride_time) %>%
  dplyr::group_by(ride_segment, ride_days) %>%
  dplyr::summarise(
    Total_rides_mints = scales::comma(n()),
    rides_hours = n() / 60,
    Min_time = min(ride_time),
    Quantile_1 = quantile(ride_time, probs = 1 / 4),
    Average_time = format(mean(ride_time), digits = 2, nsmall = 2),
    Median = median(ride_time),
    Quantile_3 = quantile(ride_time, probs = 3 / 4),
    Max_time = max(ride_time)
  ) %>%
  mutate(Share = scales::percent(rides_hours / sum(rides_hours))) %>%
  gt(caption = " Tab.1 Numerical Variable Summary groupedby segments and ride days")

Tab.1 Numerical Variable Summary groupedby segments and ride days
ride_days	Total_rides_mints	rides_hours	Min_time	Quantile_1	Average_time	Median	Quantile_3	Max_time	Share
casual
busnday	1,423,547	23725.78	1.02	7.15	19.36	12.20	22.10	241.82	63%
weekend	839,496	13991.60	1.02	8.63	23.62	15.25	27.92	241.78	37%
member
busnday	2,458,213	40970.22	1.02	5.20	11.90	8.77	14.90	241.82	75%
weekend	809,931	13498.85	1.02	5.72	13.69	9.93	17.32	241.72	25%

11.2 Summarizing Categorical Variables

Tabulation ride_segment and bike type

#library(flextable)
 bs10 %>%
  select(ride_segment, ride_bike) %>%
  janitor::tabyl(ride_segment, ride_bike) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_ns() %>%
  gt(caption = " Tab.2 Categorical Variable Summary")

Tab.2 Categorical Variable Summary
ride_segment	classic_bike	docked_bike	electric_bike
casual	34.18% (873,153)	100.00% (171,129)	43.44% (1,218,761)
member	65.82% (1,681,345)	0.00% (0)	56.56% (1,586,799)

11.2.1 bike type and ride days

# library(flextable)
  bs10 %>% 
  select(ride_segment, ride_days) %>% 
  janitor::tabyl(ride_segment, ride_days) %>%
    adorn_percentages("col") %>%
    adorn_pct_formatting(digits = 2) %>%
    adorn_ns() %>% 
    gt(caption = " Tab. 3 Cross Tab ride segment and ride days")

Tab. 3 Cross Tab ride segment and ride days
ride_segment	busnday	weekend
casual	36.67% (1,423,547)	50.90% (839,496)
member	63.33% (2,458,213)	49.10% (809,931)

11.2.2 Seasons performance

#library(flextable)
bs10 %>% janitor::tabyl(ride_segment, us_seasons) %>%
    adorn_percentages("row") %>%
    adorn_pct_formatting(digits = 2) %>%
    adorn_ns() %>% 
   gt(caption = " Tab.4 Cross table seanos")

Tab.4 Cross table seanos
ride_segment	fall	spring	summer	winter
casual	26.10% (590,610)	21.42% (484,832)	48.84% (1,105,292)	3.64% (82,309)
member	29.60% (967,249)	23.78% (777,090)	37.21% (1,216,003)	9.42% (307,802)

11.2.3 Ranking departure stations

library(dplyr)

 bs10 %>%
   dplyr::select(ride_segment, ride_time, departure_station) %>%
   drop_na(departure_station) %>%
   group_by(ride_segment, departure_station) %>%
   dplyr::summarise(
     Total_cnt = n(),
     trip_duration = sum(ride_time), .groups = "drop"
   ) %>%
   dplyr::mutate(
     Top_rides = round((Total_cnt / sum(Total_cnt) * 100), 2),
     Top_trips = round((trip_duration / sum(trip_duration) * 100), 2)
   ) %>%
   # arrange(desc(Top_stations)) %>%
   slice_max(order_by = Top_trips, n = 10) %>%
   gt(caption = " Tab.5 Ranking Departure Stations")

Tab.5 Ranking Departure Stations
ride_segment	departure_station	Total_cnt	trip_duration	Top_rides	Top_trips
casual	Streeter Dr & Grand Ave	56712	1963689.6	1.20	2.55
casual	DuSable Lake Shore Dr & Monroe St	31117	1061685.5	0.66	1.38
casual	Millennium Park	24849	924781.1	0.53	1.20
casual	Michigan Ave & Oak St	24735	854982.7	0.52	1.11
casual	DuSable Lake Shore Dr & North Blvd	23051	642064.3	0.49	0.83
casual	Shedd Aquarium	19769	572465.8	0.42	0.74
casual	Theater on the Lake	18126	523935.2	0.38	0.68
casual	Dusable Harbor	13762	454351.5	0.29	0.59
casual	Montrose Harbor	12293	426148.4	0.26	0.55
casual	Indiana Ave & Roosevelt Rd	13385	404501.2	0.28	0.53

11.2.3 Ranking arrival stations

Tab.6 Ranking Arrival Stations
ride_segment	arrival_station	Total_cnt	Total_time	Top_rides	Top_trips
casual	Streeter Dr & Grand Ave	58781	58781	1.25	1.25
casual	DuSable Lake Shore Dr & Monroe St	29036	29036	0.62	0.62
casual	Michigan Ave & Oak St	26024	26024	0.56	0.56
casual	Millennium Park	26192	26192	0.56	0.56
casual	DuSable Lake Shore Dr & North Blvd	25735	25735	0.55	0.55
member	Kingsbury St & Kinzie St	24224	24224	0.52	0.52
member	Clark St & Elm St	21988	21988	0.47	0.47
member	Wells St & Concord Ln	21580	21580	0.46	0.46
member	Clinton St & Washington Blvd	20108	20108	0.43	0.43
member	University Ave & 57th St	20160	20160	0.43	0.43

11.3 Data Distribution based on BoxPlot

fig1 <- bs10 %>%
  select(ride_segment, ride_time) %>%
  ggplot(aes(x = ride_segment, y = ride_time, fill = ride_segment, col = ride_segment)) +
  geom_boxplot(alpha = 0.5) +
  geom_pointrange(
    mapping = aes(x = ride_segment, y = ride_time),
    stat = "summary",
    fun.mean = mean
  ) +
  scale_y_log10() +
  labs(
    tag = "Fig.1",
    title = "Casual riders have wider trip distance interval",
    y = "Trip duration (minutes)",
    caption = "Source: divvy-tripdata"
  ) +
  theme(
    legend.title = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "none"
  ) +
  scale_x_discrete(labels = c("Casual\n  riders", "Subscriber\n riders")) +
  scale_y_continuous(limits = c(0, 50), breaks = c(0, 10, 15, 20, 30))

fig1 <- fig1 + scale_fill_viridis_d()

fig1

11.4 Data Distribution and Shape

fig2 <- bs10 %>%
   ggplot(aes(x = ride_time, fill = ride_segment, col = ride_segment)) +
   geom_histogram(alpha = .75, binwidth = 0.05) +
   scale_x_log10() +
   theme(
     legend.position = c(.8, .90), 
     legend.title = element_blank(),
     axis.text = element_text(face = "bold"),
     panel.grid.major.y = element_blank()
   ) +
   labs(
     tag = "Fig. 2",
     title = " ",
     subtitle = "Normal Data Distribution with log10 Transformation ",
     caption = "Source: divvy-tripdata",
     y = "Frequency of riders",
     x = "Ride time duration (minutes))"
   ) +
   scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K"))

fig2

12 Communicating and sharing the findings

The following 7 questions should enable us to get some key insights out of the data we just analysed and summarized.

what is the proportion of trips each group finished

what is the proportion of trips each group cycled

What are bikes rank high

How the ride time varies daily

How the ride time varies monthly

What is the riding pattern/behaviour on business days and weekend

How the riding pattern is impacted by the season variations

12.1 Who actually is riding more frequently?

key insight: Out of total completed rides, 60% was done by the subscribed members

12.2 Who is riding longer (trip duration) ?

Key insights

Casual riders show the highest median value of trip duration . 50% of the casual riders experience trip duration longer than 12 mints during business days and 15minutes on weekends.
Subscribed members exhibit similar pattern of riding for both business days and weekends.

12.3 What are the Most Preferred Bikes

Key insight

classic bikes ranks first, being most preferred by the members with 30% of total bike usage. It is noticeable that the causal rides go more for the electric bikes (22%) compared to classic bikes (16%). Docked bikes only used by casual riders with marginal usage of as little as 3%.

12. 4 Can we Spot any Consistent Riding Pattern?

Key insight

The riding frequency shows higher variation along the months. It increases steadly from January and riches the pic during July and August, then falls until December

12.5 What is the Riding Preference Between Business and Week Days

key insights

The subscribed members rides more often during the business days from Monday to Friday. On weekends both casual and members have similar ride frequency

12. 6 What is the Pick Time Range

key insights

subscribed members experience two period of high demand, around 09:00 and the highest around 18:00
Casual readers show a steady demand for bikes and it reaches the pick at 18:00
On weekends, both groups exhibit a similar trend.

12.7 How the Seasonal Changes Impacts on the Bike Demand

key insight: Rideship is Highest in summer for both groups,declining through fall,spring and ultimately the winter

12.8 Ranking Departure Stations

12.9 Ranking Arrival Stations

13. Take away insights from the customer segment comparison

Subscribed member use bikes more frequently than the casual riders under the same circumstances
The casual riders cycle longer on average than the annual subscribed members.
Annual subscribed members experience a substantially high demand for bikes in morning around 09:00 am and late in the evening at 18:00 am. This is the pick time for both ride segments.
The summer season both segments experience the highest demand for the bikes and this most likely to be associated with the favorable whether conditions.
Our analysis has limited socio economic indicators to draw on evidence based recommendations for sound data driven business making
Our analysis is merely descriptive and it can not be conclusive unless supported and validated with sound statistical analysis.

14. Recommendations for effective marketing strategy targeted to casual riders

Price is a decisive factor in consumer choice (what to buy ) and how much of it to buy .

Provided they get the pricing correctly some hints to be considered would include :

To cater the group needs with differentiated offers aligned with trip duration (short and long riders )
Design offers in line with the daily bike demand
Matching bike supply with the demand based on the results of the dock station ranking

Divvy Bike Share Capstone Project

Osorio O. Matucurane

2023-05-08

1. Project Background

2. The Cyclistic Business Profile at a Glance

3. Business Case Identification

4. Assuring Data Quality

4.1 Reliability

4.2 Originality

4.3 Comprehensiveness

4.4 Currency

4.5 Cited

5. Sourcing and Preparing The Data

5.1 Preliminary Data Preparation

5.2 Data set conformity with the tidy format

6. Data Analysis Workflow

7. Getting into R Programming

7.1 Loading The Packages

7.2 Setting the Theme for the Plots

7.3 Loading The Data

7.2.1 Checking and validating the Data Type

8. Data Exploration

8.1 Quick Sense of the Dataset at a Glance

8. 3 Converting character type to factor data type objects

8.4 Parsing Date Time

8.5 Extracting Date Components From Datetime Variable

8.6 Renaming and Reordering the Variables

9. Data Cleaning

9.1 Overview of the Dataset

9.2 Describing data with Summary Table

Data Frame Summary

diagnosis

9.4 Dealing with the messy

9.4.1 Duplicated records

9.4.2 Flawed Data Entry on Categorical Variables

9.4.2.1 Docking Stations

9.4.3 Dealing with missing value

9.4.3.1 Where are missing values

9.4.3.2 What to do with the missing values

9.4.4 Dealing with implausible out of boundaries values

9.4.4.1 Values out of the expected range

9.4.4.2 Visual Exploration - Extremely low values

9.4.4.3 Visual Exploration of Extremely high values with scatter plot

9.4.4.4 Visual Exploration of Extremely high values with histogram

9.4.4.5 Removing potential outliers

9.4.5 Examining the ride date variable

9.5 Final verification of the dataset

9.5.1 Keeping the variables of interest

9.5.2 Verification and validation

Data Frame Summary

verify

9.5.3 Dimensions of the final datase

10. Data Wrangling

10.1 Grouping days into categories

10.2 Adding Seasons Variable

11 Data Summarizing

11.1 Numerical variables

11.1.1 Key descriptive indicators

11.2 Summarizing Categorical Variables

11.2.1 bike type and ride days

11.2.2 Seasons performance

11.2.3 Ranking departure stations

11.2.3 Ranking arrival stations

11.3 Data Distribution based on BoxPlot

11.4 Data Distribution and Shape

12 Communicating and sharing the findings

12.1 Who actually is riding more frequently?

12.2 Who is riding longer (trip duration) ?

12.3 What are the Most Preferred Bikes

12. 4 Can we Spot any Consistent Riding Pattern?

12.5 What is the Riding Preference Between Business and Week Days

12. 6 What is the Pick Time Range

12.7 How the Seasonal Changes Impacts on the Bike Demand

12.8 Ranking Departure Stations

12.9 Ranking Arrival Stations

13. Take away insights from the customer segment comparison

14. Recommendations for effective marketing strategy targeted to casual riders

15. Relevant Resources