This project is a requirement to Google Data Analytics Certificate explicitly for novices with no prior expertise. The case study seeks to consolidate the analytic skills developed by applying the typical procedures and methods trough a systematic data analytic workflow and using the most appropriate tools. The dataset with key usage variables is collected by third party and made available publicly.
Cyclistic is a fictitious company that runs the Divvy bike sharing business. Its marketing department identifies a potential business opportunity to increase the business revenues and engages the data analytics team to uncover and reveals evidences and insights on the available bike sharing usage data to ultimately support a data driven decision making. At the heart ofthe project is to clean, analysis, summarize and visualize the results in the most effective way. Inferential statistics (hypothesis testing) is beyond the scope of this project.
A great deal of time has been spent delving and diving deep to master the grammar of data manipulation, the grammar of graphics and the art of data visualization. At the end only to realize that I just scratched the surface of R. The challenge ahead on my journey as aspiring data analyst lies on clinging to string manipulation and getting grips at functional programming before I become a R Power User.
docking station
Cyclistic is a bicycle renting business in Chicago , with an
innovative concept of bike sharing. It delivers a great value to the
customer basis serving.
Micromobility solution with healthy associated benefits from cycling in a regular basis
Targeting Chicagoan commuters to work, school, social encounter
Visitors to unlock Chicago
Runs a fleet of Over 5 800 bikes
Scattered network of 682 stations/kiosks to unlock the bikes from and return
Divvy Bikes, Chicago’s Bike Sharing System
Cyclist offer essentially 3 bikes with distinct features
Bikes rented, shared and charged on hourly basis
Bike unlocked at a station and returned at another station at the customer’s convenience
The are two customers segments
Casual riders - day pass-single one or multiple trips
Subscribers - annual membership
The company is setting to increase the overall business performance specially the revenue from non subscribed casual riders. It is believed that this customer segment has key customer offering a great deal of potential to add value to the company and there should be endeavored to bring a board as much as possible annual based membership. However, there is information gap concerning the records about the critical factors/variables on how the casual group and subscribed members uses the bike sharing service. Here comes the challenge to design a solid well founded strategy to target the casual riders to adhere to the annual based package. With that challenge comes a long the need to gather evidences, meaningful and quality information to back up the decision making for impactful results. This project report uses data collected in 2022, analysis data to uncover key insights, pattern and trends on the bike rental service usage and communicate the findings in an easy, engaging and persuasive way to the key stakeholder.
The quality of data by large determine the quality of the findings. The goal to have high quality data is to make empowered, informed and data driven decisions in order to achieve improved business performance. In order to restrain costly mistakes and false conclusions we have to ensure that data meets the data quality standards. Like it is commonly cited in data analytics and data science resources “garbage in garbage out”. We assess the quality of our dataset against the measures or dimensions of data quality. Each dimension plays a critical role in ensuring the overall quality of data.
data quality
Our data is reliable and accurate being collected by a credible source using advanced data collection technology
Our dataset was accessed from the original source, as first recorded data points, therefore no reasonable data quality issues on this ground
We have fair and relatively comprehensive and useful information . With available data we are able to compare the two groups as far as the usage is concerned. But not the preferences and choices they make to adhere and buy one or another product/service. This is by far determined and controlled by other variables like socio economic and demographics. So, well founded and substantiated recommendations would be hard to be drawn.
This is a time dimension (also referred to as timeliness) .. Our data is sufficiently up to date for its intended use.
This criteria is satisfied. In addition, we also keep abreast that the dataset sample should meet and clear the requirements. It seems that our records covers the entire population and after some data cleaning chores we should be left of dataset that does not compromises the representativity and assure unbiasness.
We primarily relied on the great features offered by MS Power Query to get familiarized with the dataset. For each file we easily carried out the following preliminary transformations: - column profile(count, distinct, unique, min, max, empty string) - Column quality (validity, error, percentage) - ride Date, bike_type, ride_type, station name - Splitting datetime (time stamp) into date and time columns - Computing the trip time duration based on the ended time and started time variables
Tidy data is a concept from Hadley Wickham’s 2014 paper Tidy Data Tidy datasets are easy to manipulate, model and visualize, are rectangular data and have a specific structure: - Every row is an observation, - Every column represents variables and - Every entry into the cells of the data frame are values It is worthy quoting “Happy families are all alike, but every unhappy family is unhappy in its own way.” ~ Leo Tolstoy “Tidy datasets are all alike, but every messy dataset is messy in its own way.” ~ Hadley Wickham
Tidy Data
Google offers a well-defined data workflow which acts as a set of guidelines for planning, organizing, and implementing data analytics projects. The Google six steps are:
Ask
Prepare.
Process
Analyze
Share.
Act
This workflow is self explanatory. Noticeably, the process outlined is linear, however, in many cases the process is iterative, requiring multiple stages to be repeated and revisited. Our Project is being guided by the flow bellow.
typical data analytics workflow
R programming is the data analysis tool of our choice,
the world’s most popular programming language for statistical analysis.
R is an open-source programming language
and free software environment for statistical computing and graphics
supported by the R Foundation for Statistical Computing. We import and
load the data into RStudio where we set a R project.
Bellow we load the package collections we will be used during the entire data analysis workflow - useful packages
knitr::opts_chunk$set( warning=FALSE, message=FALSE)
library(knitr)
library(dplyr)
library(readr)
library(purrr)
library(magrittr)
library(knitr)
library(gt)
library(scales)
library(ggthemes)
library(ggplot2)
library(tidyr)
library(forcats)
library(lubridate)
library(pandoc)
library(skimr)
library(summarytools)
library(RColorBrewer)
library(viridis)
library(flextable)
library(janitor)
library(hablar)
library(styler)
library(patchwork)
library(cowplot)
theme_set(theme_clean() +
theme(
plot.subtitle = element_text(
hjust = 0.5,
size = 14,
color = "skyblue",
face = "bold",
family = "Tahoma"
),
plot.caption = element_text(
hjust = 1,
size = 14, color = "grey",
face = "italic"
),
plot.title = element_text(
hjust = 0.5,
size = 16,
color = "skyblue",
face = "bold",
family = "Tahoma"
),
plot.tag = element_text(
size = 14,
color = "grey",
face = "bold"
),
axis.title = element_text(
color = "steelblue",
face = "bold",
size = 15
),
axis.line = element_line(linewidth = 1.5, color = "darkgrey"),
axis.text = element_text(face = "bold",
color = "#993333",
size = 12),
legend.title = element_blank(),
legend.position = "top"
))
The data set comes in 12 separated files on month basis, with individual Divvy bike sharing trips , including the dock station origin and destination, and time stamps for each trip(start and end). Source: https://divvy-tripdata.s3.amazonaws.com/index.html .
We import all file at once grammatically instead of one at time.
There is claim holding that if a task is to be repeat, then it is
absolutely the right task the computer is designed to perform and it is
does efficiently. There are many ways to get the cat skinned, akin we
find may ways to import multiple files into R. The best method so far
seems to be fread() from data.tablepackage. As
a matter of taste, we will go with read_csv() from
Tidyverse ecosystem, as it gives an argument to specify the column
types. This should free us up from the tedious work of converting the
date variables from character type to POSIXct and
POSIXlt date type.
We list all files with list.files() function and we get
them combined into a single file with the mp_df() from
purrr package.
library(tidyverse)
my_dir <- "C:/Users/USER/Documents/DataAnalytics/Google_Capstone_Project/data/Data_RStudio"
all_files <- list.files(path = my_dir, pattern = "*.csv", full.names = TRUE)
bs_01 <- all_files %>% map_df(~ read_csv(.,
col_names = TRUE,
col_types = cols(
started_at = col_datetime(format = "%m/%d/%Y %H:%M"),
ended_at = col_datetime(format = "%m/%d/%Y %H:%M"),
started_ride_date = col_date(format = "%m/%d/%Y"),
ended_ride_date = col_date(format = "%m/%d/%Y")
)
))
read_csv() would have guessed the variable classes and
character data type to dates variables, so we have specified the
date class vriables using the
col_typesargument.
Moving on, our bike share dataset has been loaded as an
R object bs_01 with a dimension of :
5667717 rows ,
19 columns or simply
dataset dimension (nr rows , nr columns = 5667717, 19
.
Data types matters a great deal. The operations we can perform on a column depend so much on its “type” (numeric, character/string). Lets revisit the common type of operators. - arithmetic operators - calculations using Arithmetic operators to be performed on values. - relational operators - for assignment and enable comparisons to be made. They are used in condition testing - logical operators - Logical operators are used to combine relational operators to give more complex decisions.
We can get a glimpse of the columns data type with the
summary.default()
bs1 <- bs_01
summary.default(bs1)
## Length Class Mode
## ride_id 5667717 -none- character
## rideable_type 5667717 -none- character
## started_at 5667717 POSIXct numeric
## ended_at 5667717 POSIXct numeric
## start_station_name 5667717 -none- character
## start_station_id 5667717 -none- character
## end_station_name 5667717 -none- character
## end_station_id 5667717 -none- character
## start_lat 5667717 -none- numeric
## start_lng 5667717 -none- numeric
## end_lat 5667717 -none- numeric
## end_lng 5667717 -none- numeric
## member_casual 5667717 -none- character
## started_ride_date 5667717 Date numeric
## started_ride_time 5667717 hms numeric
## ended_ride_date 5667717 Date numeric
## ended_ride_time 5667717 hms numeric
## ride_length 5667717 -none- numeric
## ride_minutes 5667717 -none- numeric
Exploratory data analysis (EDA) is a technique/method/process widely used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. Originally developed by American mathematician John Tukey in the 1970s, it is quite helpfull in :
identify obvious errors,
better understand patterns within the data
detect outliers or anomalous events
find interesting relations among the variables.
Before diving deep in our journey to understanding the data, in this first stage we are exploring data mainly column wise, fixing few issues and getting data ready for cleaning in the following stage.
With the View(bs1) we have the display of our dataset in
a tabular format. We can do an ocular scanning, filter and get
familiarized with the dataset.
In addition, with the head(bs1) and the
tail(bs1) we get access to the top row (head) and last row
(tail), in this way making our dataset has been imported completed .
Additionally, we can get a sample size of the dataset of
n dimension.
# randomm sample , n = 10
bs1 %>%
sample_n(10) %>%
gt(caption = "Sample of the Dataset with 10 Random Observations")
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | started_ride_date | started_ride_time | ended_ride_date | ended_ride_time | ride_length | ride_minutes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18CB2761A890BCA3 | classic_bike | 2022-07-19 13:58:00 | 2022-07-19 14:20:00 | Sheridan Rd & Loyola Ave | RP-009 | Kedzie Ave & Bryn Mawr Ave | KA1504000167 | 42.00104 | -87.66120 | 41.98223 | -87.70889 | member | 2022-07-19 | 13:58:45 | 2022-07-19 | 14:20:36 | 0.015173611 | 21.85 |
| A07A17579FD564B9 | electric_bike | 2022-09-21 22:49:00 | 2022-09-21 22:51:00 | Broadway & Belmont Ave | 13277 | Wilton Ave & Belmont Ave | TA1307000134 | 41.94010 | -87.64545 | 41.94023 | -87.65294 | member | 2022-09-21 | 22:49:50 | 2022-09-21 | 22:51:46 | 0.001342593 | 1.93 |
| 466635A9028203ED | classic_bike | 2022-10-22 12:11:00 | 2022-10-22 12:50:00 | Southport Ave & Clark St | TA1308000047 | Clark St & Berwyn Ave | KA1504000146 | 41.95708 | -87.66420 | 41.97803 | -87.66856 | casual | 2022-10-22 | 12:11:18 | 2022-10-22 | 12:50:09 | 0.026979167 | 38.85 |
| 639C0D36743DAE63 | electric_bike | 2022-06-05 09:37:00 | 2022-06-05 10:01:00 | NA | NA | NA | NA | 41.89000 | -87.72000 | 41.94000 | -87.72000 | member | 2022-06-05 | 09:37:19 | 2022-06-05 | 10:01:09 | 0.016550926 | 23.83 |
| 293715AE4CF08CA9 | electric_bike | 2022-06-28 03:43:00 | 2022-06-28 03:52:00 | Logan Blvd & Elston Ave | TA1308000031 | NA | NA | 41.92940 | -87.68424 | 41.93000 | -87.72000 | casual | 2022-06-28 | 03:43:12 | 2022-06-28 | 03:52:26 | 0.006412037 | 9.23 |
| DD0A3551EBD04D88 | classic_bike | 2022-05-21 17:27:00 | 2022-05-21 17:33:00 | Desplaines St & Kinzie St | TA1306000003 | Orleans St & Hubbard St | 636 | 41.88872 | -87.64445 | 41.89003 | -87.63662 | casual | 2022-05-21 | 17:27:40 | 2022-05-21 | 17:33:12 | 0.003842593 | 5.53 |
| D9BE3DE710D66304 | classic_bike | 2022-05-22 15:54:00 | 2022-05-22 16:09:00 | Michigan Ave & Washington St | 13001 | Columbus Dr & Randolph St | 13263 | 41.88398 | -87.62468 | 41.88473 | -87.61952 | casual | 2022-05-22 | 15:54:19 | 2022-05-22 | 16:09:38 | 0.010636574 | 15.32 |
| 9C4BEBB694C180AF | classic_bike | 2022-06-04 15:49:00 | 2022-06-04 16:02:00 | Jefferson St & Monroe St | WL-011 | Clinton St & Roosevelt Rd | WL-008 | 41.88033 | -87.64275 | 41.86712 | -87.64109 | casual | 2022-06-04 | 15:49:07 | 2022-06-04 | 16:02:04 | 0.008993056 | 12.95 |
| D1588A71286D96F0 | electric_bike | 2022-06-03 11:20:00 | 2022-06-03 11:32:00 | St. Clair St & Erie St | 13016 | Columbus Dr & Randolph St | 13263 | 41.89441 | -87.62270 | 41.88473 | -87.61952 | member | 2022-06-03 | 11:20:25 | 2022-06-03 | 11:32:51 | 0.008634259 | 12.43 |
| 1937B9007C59F991 | classic_bike | 2022-01-14 11:30:00 | 2022-01-14 11:50:00 | Streeter Dr & Grand Ave | 13022 | Rush St & Cedar St | KA1504000133 | 41.89228 | -87.61204 | 41.90231 | -87.62769 | casual | 2022-01-14 | 11:30:00 | 2022-01-14 | 11:50:00 | 0.013888889 | 20.00 |
This looks quite pretty!
This is specially useful for both ordinal and nominal variables with limited categories since it is easy to hunt and spot entry related errors. easy data verification and validation. The variables bike type and customer type are suitable candidates.
levels()library(hablar)
bs2 <- bs1 %>%
convert(fct(rideable_type, member_casual))
levels(bs2$rideable_type)
## [1] "classic_bike" "docked_bike" "electric_bike"
levels(bs2$member_casual)
## [1] "casual" "member"
We have touched this step of converting date columns to Date class
Date type object class is tricky. Stored in the program memory (under
the hoody) as numeric, often is imported as character. working with
dates and times in base R, can be daunting, a bit
cumbersome and clumsy. lubridate adopts a more intuitive
approach to parsing dates and times. Some useful functions to coerce
date character to date class include ymd(),
dmy() and ydm(). We found it handy with
Power Query tool before importing the dataset
bs2 <- bs2 %>%
mutate(
ride_year = lubridate::year(started_ride_date),
ride_month = lubridate::month(started_ride_date, label = TRUE, abbr = TRUE), # month in year
ride_weekday = lubridate::wday(started_ride_date, label = TRUE, abbr = TRUE) # day of week (name)
)
# cross checking
levels(bs2$ride_month)
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
levels(bs2$ride_weekday)
## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"
For a better syntax and good readability we change/update old column
name with new column name using rename() function. We focus
on those variables of interested that we will be playing around with
them in our journey of revealing interesting variation pattern and
trends.
bs3 <- bs2 %>%
dplyr::rename(
"departure_station" = "start_station_name",
"arrival_station" = "end_station_name",
"departure_time" = "started_ride_time",
"arrival_time" = "ended_ride_time",
"ride_bike" = "rideable_type",
"ride_segment" = "member_casual",
"ride_date" = "started_ride_date",
"ride_time" = "ride_minutes"
) %>%
select(ride_id, ride_bike, ride_segment, ride_time, ride_date,departure_station, arrival_station, everything())
# cross checking
bs3 %>% head( 2) %>%
gt(caption = "Cross checking new variable names")
| ride_id | ride_bike | ride_segment | ride_time | ride_date | departure_station | arrival_station | started_at | ended_at | start_station_id | end_station_id | start_lat | start_lng | end_lat | end_lng | departure_time | ended_ride_date | arrival_time | ride_length | ride_year | ride_month | ride_weekday |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 550CF7EFEAE0C618 | electric_bike | casual | 7.52 | 2022-08-07 | NA | NA | 2022-08-07 21:34:00 | 2022-08-07 21:41:00 | NA | NA | 41.93 | -87.69 | 41.94 | -87.72 | 21:34:15 | 2022-08-07 | 21:41:46 | 0.005219907 | 2022 | Aug | Sun |
| DAD198F405F9C5F5 | electric_bike | casual | 14.03 | 2022-08-08 | NA | NA | 2022-08-08 14:39:00 | 2022-08-08 14:53:00 | NA | NA | 41.89 | -87.64 | 41.92 | -87.64 | 14:39:21 | 2022-08-08 | 14:53:23 | 0.009745370 | 2022 | Aug | Mon |
Voila, This is awesome!
This is one of the crucial and intricate stage of data analytical workflow. Once we have aligned the data object types and the desirable variable names, we move to explore and address data entry (observations) at row level in order to get the dataset in the correct and desirable format for efficient processing. This consists essentially of identifying, correcting, or removing inaccurate, mislabeled, incorretly formatted, incomplete raw data for downstream analysis purposes. In addition, we deal with the typical data issues - Duplicated records - Recording - Missing values - Unusual values most known as outliers. What typically data cleaning means is partly illustrated bellow. The bottom line is to get data formatted in the way the computer reads, understands, computes and we get consistent outcome.
Illutrastion of data cleaning
It is widely claimed that plotting the data, specially the numeric
continuous should be the very first step when getting a new dataset. It
potentially offers a better picture of data distribution, variation and
a quick catch of potential outlier. In our case does not give a pretty
picture due to little variation and the excessive size of the dataset.
But before visual exploration,let’s get a glance or glimpse at our data
set with a summary table. There is a bunch of helpful packages with
summary table functions from R base describe(),
dplyr summary (), dlookr, Mice,
Smir skim(), psych, Mosaico,
summarytools, naniar to janitor
just to mention fewer. The choice is a matter of taste, but also the
detailed information we can get from.
This is an analytical way to get an overview of the entire dataframe.
We can use the function dfsummary() from the package
summarytools . The Columns are summarized by class/type
such as character, factor and numeric. We select 6 variables of
interest:
ride_segment
ride_bike
ride_time
departure_station
arrival_station
ride_date.
We will include the identifier ride id just to control for dupes (duplicated entries)
using dfsummary() to summarize and describe the dataset
structure with selected variables
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ride_id [character] |
|
|
10 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | ride_segment [factor] |
|
|
10 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | ride_bike [factor] |
|
|
10 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | ride_time [numeric] |
|
|
10 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | departure_station [character] |
|
|
9 (90.0%) | 1 (10.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | arrival_station [character] |
|
|
8 (80.0%) | 2 (20.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | ride_date [Date] |
|
|
10 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.0)
2023-05-17
From the output, we get the account for the following data issues:
duplicates
distinct values for the numerical variables
missing values
distribution (sketch for bar charts and histogram)
frequency (cat variables)
summary statistics (five numbers) for the numeric continuous variables
This looks pretty, we only miss the explicit account of outliers, but the range offers some valuable hints.
In a nutshell, from the summary output we spotted issues with three variables:
ride time with implausible extreme values-
departure station with missing values
arrival station with missing values
Duplicate observations should be identified and removed. Based on the table summary, there are no duplicated records. Total unique observations/cases match and corresponds to the total number of observations in the dataset .
Total unique records 5667717
Total number of observations 5667717
We use frequency tables to examine or diagnosis the possible inconsistencies that can result from incorrect entries or typo. Our explicitly categorical variables are trouble free as given in the above summary table.
This is not worthy to set as a categorical variable given that as a nominal variable each value is treated as character/string and cant be meaningfully grouped. We will examine the frequency( how many times the same value occur) and seek to catch some pattern. How may unique bike departure station we have and are expected? Can we match it? All in all the network consist of 692 docking stations. Is there a meaningful way to group the stations?
total_dock_stations <- bs3 %>%
group_by(departure_station) %>%
summarize(distinct_points = n_distinct(departure_station))
dock_stations <- bs3 %>% select(departure_station) %>%
drop_na(departure_station)%>%
table()
There are distinct 1675 departure dock stations. It seems that we have limited options to scan and detect irregularities for the character variables with no levels.
We could look at the frequencies and consider the possibility of excluding data entries of lower frequency, such as bellow 1%
Data entry of lower frequencies are suspicious and on business
grounds they add no value. These observations are possibly caused by
data recording errors. We might have the data with mixed cases (lower
and upper), in which case R treat them differently. We could explore
text data, using functions like filter(),
str_detect() and count() to screen and catch
data entry based on partial match or common pattern. This is beyond the
scope of this project and we will not delve into.
Missing values have to be identified, examined and take decision if they are kept, omitted/removed or replaced and if replaced what method to be used.
From our summary table, we have missing values on two variables.
Cross checking with is.na() and sum()
functions we get 833064 missing values for
departure_station and 892742 for
arrival_station
From the outset it important to understand how the missing values are coded and remarkably differentiated from other coded values like none, empty or NAN. In R, missing values are represented by the symbol NA (not available). On the other hand, we have NULL, which represents that the value in question simply doesn’t exist, rather than being existent but unknown. In dealing with missing values, Greg Martin suggests that they are examined, trying to understand the pattern of its occurrence (the missiness). Are missing values taking place systematically or randomly? Are they associated with other variables? In our case, we should expect that the missing values are associated with long trip duration suggesting that the bikes went missing (stolen). In that event, the ride end time should also be missing (unreported). But this de not hold at all, as we have complete records for the variable start and end time.
All what we find is that missing values at start station are associated with missing values of start station id.
library(dplyr)
miss_values <- bs3 %>%
select(departure_station, arrival_station, start_station_id, end_station_id, start_lat, end_lat, start_lng, end_lng) %>%
dplyr::filter(is.na(departure_station) | is.na(arrival_station)) %>%
count()
miss_values
## # A tibble: 1 × 1
## n
## <int>
## 1 1298357
We could drop or remove the missing values, but we decide to keep them to avoid losing data points from variables of interest (customer group, date, ride time) associated with those rows where they occur.
observations lying far away from others are typically unusual and
often considered outliers. Dealing with outliers is a highly contentions
issue among statisticians , data scientists and data analyst. It is
worthy to highlight the reason why outliers matter a great deal and
attract hot debate. As far as the methods and techniques used in
descriptive and inferential statistics hinge and grounded on the
distribution assumptions, outliers may impact negatively on results and
misleading or flawed findings. We don’t take outliers lightly. As a
matter of fact, in business environment the considered outliers could be
a niche or market segment with specific needs to be catered and
deserving especial attention. It could be the of the known principle of
80:20 . We will explore the distribution and pattern of the numerical
continuous variables trough the scatterplot,
boxplot and histogram and match with the
cyclistic core business - renting bikes on shared basis for micro
mobility.
bs3 %>%
select(ride_segment, ride_time) %>%
group_by(ride_segment) %>%
dplyr::summarize(
min = min(ride_time),
max = max(ride_time)
) %>%
ungroup()
## # A tibble: 2 × 3
## ride_segment min max
## <fct> <dbl> <dbl>
## 1 casual -137. 41387.
## 2 member -10353. 1560.
The maximum values are way out for a typical trip duration in a micro mobility service . This is supported by the literature reviews. Cyclistic rent bikes up to 45 mints for both casual and subscriber members. The negative values are indisputably implausible.
Ideally abike trip duration should be:
greater or igual ( >=) to 5 mints
less than 60 mints for commutting purpose
between 60 to 180 mints for leisure and sport
Notably, the bikes just like any commercial product are designed and manufactured with intrinsic features to suit the end use purpose. They are customized and market oriented.
bs4 <- bs3 %>% filter(ride_time > 0) # dropping negative trip duration records
summary(bs4$ride_time) # verification
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02 5.83 10.28 19.45 18.47 41387.25
Trips that last less than 60 seconds should be suspicious and worthy some attention. In fact they should occur where the bikes are checked out and returned back to the same station, being bikes with some problems and unable to start a trip.
bs4 %>%
dplyr::select(ride_segment, ride_time) %>%
filter(ride_time < 5) %>%
ggplot(aes(x = ride_time, col = ride_segment)) +
geom_freqpoly(size = .75, binwidth = 0.02) +
ggtitle("Visual Exploration of Data")
Note: The expected relationship is spotted, the longer the trip, there are lesser returns to the same station. We set the threshold of 1.0 minute as the minimum ride length accepted to travel from one station to another.
bs5 <- bs4 %>%
filter(ride_time > 1.0)
summary(bs5$ride_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 6.07 10.52 19.87 18.75 41387.25
Scatter points helps to visualize and easily spot the pattern and
trend of data points. The geom_jitter() offer a greater
visibility (less over plotting) by adding random noise to both the width
and height of each point.
expl_scatter <- bs5 %>%
select(ride_segment, ride_date, ride_time) %>%
ggplot(aes(x = ride_date, y = ride_time, col = ride_segment)) +
geom_jitter(size = 1.5, shape = 21, width = 0.15, height = 0.2, alpha = .5) +
facet_wrap(~ride_segment, nrow = 2, scales = "free") +
labs(title = "exploratory scatter plot",
subtitle =" scattered potential outlier data points",
caption = "Source: divvy-tripdata")
expl_scatter +
scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K"))
What panes from the scatter plot?
clear points falling away from the most common values .
casual riders with rid time above 1000 mints
annual members with ride time above 500 mints
Histogram is a popular visual tool in which a numeric variable is typically mapped to the x-axis. Then, the x-axis is divided up into equally sized sections, which we call “bins or buckets”, ranging from the lowest to the highest value for the variable, then we count the number of observations in each bin and plot a bar for each bin. The length of the bin corresponds to the count of the observations in that bin.
freqpol <- bs5 %>%
select(ride_segment, ride_time) %>%
ggplot(aes(x = ride_time, fill = ride_segment, color = ride_segment)) +
geom_freqpoly(size = 0.75) +
labs(
title = "Exploring the Data Distribution log transform",
subtitle = "Decteting outboundaries data points",
caption = "Source: divvy-tripdata"
) +
theme(panel.grid.major.y = element_blank(),
legend.title = element_blank()) +
scale_x_log10(
n.breaks = 8,
labels = number_format(scale = 1e-3, suffix = " k")
)
freqpol <- freqpol+ scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K", big.mark = ","))
library(plotly)
ggplotly(freqpol)
Ideally outliers are set to missing values instead of dropping them. Our datset is sufficiently larger to play other way around.
Based on our distribution, between 25% and 75% of our observations,
the riding time is less than 30 mints. However, there is significant
portion of riders beyond that interval and we may want to consider them
in our analysis for business decision making. We take the advantage of
the plotly() interactivity feature and we pick the value of
the ride time as the outlier threshold. We find the value of
2.3835861 mints on the logaritm scale as the left tail limit where
values beyond are typically out of the boundaries with limited
frequency
density_plot <- bs5 %>%
select(ride_segment, ride_time) %>%
ggplot(aes(x = ride_time, color = ride_segment)) +
geom_histogram(aes(y = ..density..), fill = "white", alpha = 0.2, position = "dodge") +
geom_density(lwd = 1.2, linetype = 2) +
theme(panel.grid.major.y = element_blank(),
legend.title = element_blank()) +
scale_x_log10(limits = c(10^0, 10^2.3835861)) +
labs(
title = "Distribution of riders with log transform",
subtitle = "Limiting the trip duration to 241.8723 minutes",
x = "Trip duration (minutes)",
caption = "Source: divvy-tripdata"
)
library(plotly)
ggplotly(density_plot)
bs6 <- bs5 %>%
filter(ride_time <= 10^2.3835861)
summary(bs6$ride_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 6.05 10.50 15.86 18.65 241.82
summary(bs6$started_at)
summary(bs6$ended_ride_date)
bs7 <- bs6 %>% filter(ended_ride_date <= as.Date("2022-12-31"))
summary(bs7$ended_ride_date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2022-01-01" "2022-05-28" "2022-07-22" "2022-07-19" "2022-09-16" "2022-12-31"
bs8 <- bs7 %>% select(-c(ride_id, started_at, ended_at,start_station_id,
end_station_id, start_lat, start_lng, end_lat, end_lng, ride_year, ride_length ))
names(bs8)
## [1] "ride_bike" "ride_segment" "ride_time"
## [4] "ride_date" "departure_station" "arrival_station"
## [7] "departure_time" "ended_ride_date" "arrival_time"
## [10] "ride_month" "ride_weekday"
Deal! Pretty cool
verify <- bs8 %>% select(ride_segment, ride_bike, ride_time, departure_station, arrival_station,
departure_station, arrival_station, ride_date)
print(dfSummary(verify, graph.magnif = 0.75),
method = 'render')
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ride_segment [factor] |
|
|
10 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | ride_bike [factor] |
|
|
10 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | ride_time [numeric] |
|
|
10 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | departure_station [character] |
|
|
9 (90.0%) | 1 (10.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | arrival_station [character] |
|
|
8 (80.0%) | 2 (20.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | ride_date [Date] |
|
|
10 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.0)
2023-05-17
Our cleaned dataset consist of 5531187, 11
In this stage we do some relevant data transformations ahead of the data analysis stage. Very often transformations performed include grouping continuous variable into categorical with bins or buckets.
In our case, we introduce additional variables based on existing ones. This will allows us to drill down and uncover some valuable insights in our dataset.
weekends <- c("Sun", "Sat")
bs9 <- bs8 %>%
mutate(ride_days = as.factor(if_else(
ride_weekday %in% weekends, "weekend", "businday"
)), .after = ride_weekday)
levels(bs9$ride_weekday) # verification
## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"
bs10 <- bs9 %>%
mutate(ride_days = as.factor(if_else(
ride_weekday %in% weekends, "weekend", "busnday"
)), .after = ride_weekday)
levels(bs10$ride_days)
## [1] "busnday" "weekend"
Spr <- c("Mar", "Apr", "May")
Sum <- c("Jun", "Jul", "Aug")
Aut <- c("Sep", "Oct", "Nov")
winter <- c("Dec", "Jan", "Feb" )
bs10 <- bs10 %>%
mutate(us_seasons = as.factor(case_when(
ride_month %in% Spr ~ "spring",
ride_month %in% Sum ~ "summer",
ride_month %in% Aut ~ "fall",
TRUE ~ "winter"
)), .after = ride_month)
levels(bs10$us_seasons)
## [1] "fall" "spring" "summer" "winter"
dim(bs10)
## [1] 5531187 13
In this stage we summarize our dataset based on the variable type (numeric continuous or qualitative).
For the numerical continuous variables we summarize the data with the summary descriptive statistics. With the categorical variables we summarize using the frequency or cross tabulation.
library(dplyr)
bs10 %>%
# dplyr::select(ride_segment, ride_time) %>%
dplyr::group_by(ride_segment, ride_days) %>%
dplyr::summarise(
Total_rides_mints = scales::comma(n()),
rides_hours = n() / 60,
Min_time = min(ride_time),
Quantile_1 = quantile(ride_time, probs = 1 / 4),
Average_time = format(mean(ride_time), digits = 2, nsmall = 2),
Median = median(ride_time),
Quantile_3 = quantile(ride_time, probs = 3 / 4),
Max_time = max(ride_time)
) %>%
mutate(Share = scales::percent(rides_hours / sum(rides_hours))) %>%
gt(caption = " Tab.1 Numerical Variable Summary groupedby segments and ride days")
| ride_days | Total_rides_mints | rides_hours | Min_time | Quantile_1 | Average_time | Median | Quantile_3 | Max_time | Share |
|---|---|---|---|---|---|---|---|---|---|
| casual | |||||||||
| busnday | 1,423,547 | 23725.78 | 1.02 | 7.15 | 19.36 | 12.20 | 22.10 | 241.82 | 63% |
| weekend | 839,496 | 13991.60 | 1.02 | 8.63 | 23.62 | 15.25 | 27.92 | 241.78 | 37% |
| member | |||||||||
| busnday | 2,458,213 | 40970.22 | 1.02 | 5.20 | 11.90 | 8.77 | 14.90 | 241.82 | 75% |
| weekend | 809,931 | 13498.85 | 1.02 | 5.72 | 13.69 | 9.93 | 17.32 | 241.72 | 25% |
#library(flextable)
bs10 %>%
select(ride_segment, ride_bike) %>%
janitor::tabyl(ride_segment, ride_bike) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
gt(caption = " Tab.2 Categorical Variable Summary")
| ride_segment | classic_bike | docked_bike | electric_bike |
|---|---|---|---|
| casual | 34.18% (873,153) | 100.00% (171,129) | 43.44% (1,218,761) |
| member | 65.82% (1,681,345) | 0.00% (0) | 56.56% (1,586,799) |
# library(flextable)
bs10 %>%
select(ride_segment, ride_days) %>%
janitor::tabyl(ride_segment, ride_days) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
gt(caption = " Tab. 3 Cross Tab ride segment and ride days")
| ride_segment | busnday | weekend |
|---|---|---|
| casual | 36.67% (1,423,547) | 50.90% (839,496) |
| member | 63.33% (2,458,213) | 49.10% (809,931) |
#library(flextable)
bs10 %>% janitor::tabyl(ride_segment, us_seasons) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
gt(caption = " Tab.4 Cross table seanos")
| ride_segment | fall | spring | summer | winter |
|---|---|---|---|---|
| casual | 26.10% (590,610) | 21.42% (484,832) | 48.84% (1,105,292) | 3.64% (82,309) |
| member | 29.60% (967,249) | 23.78% (777,090) | 37.21% (1,216,003) | 9.42% (307,802) |
library(dplyr)
bs10 %>%
dplyr::select(ride_segment, ride_time, departure_station) %>%
drop_na(departure_station) %>%
group_by(ride_segment, departure_station) %>%
dplyr::summarise(
Total_cnt = n(),
trip_duration = sum(ride_time), .groups = "drop"
) %>%
dplyr::mutate(
Top_rides = round((Total_cnt / sum(Total_cnt) * 100), 2),
Top_trips = round((trip_duration / sum(trip_duration) * 100), 2)
) %>%
# arrange(desc(Top_stations)) %>%
slice_max(order_by = Top_trips, n = 10) %>%
gt(caption = " Tab.5 Ranking Departure Stations")
| ride_segment | departure_station | Total_cnt | trip_duration | Top_rides | Top_trips |
|---|---|---|---|---|---|
| casual | Streeter Dr & Grand Ave | 56712 | 1963689.6 | 1.20 | 2.55 |
| casual | DuSable Lake Shore Dr & Monroe St | 31117 | 1061685.5 | 0.66 | 1.38 |
| casual | Millennium Park | 24849 | 924781.1 | 0.53 | 1.20 |
| casual | Michigan Ave & Oak St | 24735 | 854982.7 | 0.52 | 1.11 |
| casual | DuSable Lake Shore Dr & North Blvd | 23051 | 642064.3 | 0.49 | 0.83 |
| casual | Shedd Aquarium | 19769 | 572465.8 | 0.42 | 0.74 |
| casual | Theater on the Lake | 18126 | 523935.2 | 0.38 | 0.68 |
| casual | Dusable Harbor | 13762 | 454351.5 | 0.29 | 0.59 |
| casual | Montrose Harbor | 12293 | 426148.4 | 0.26 | 0.55 |
| casual | Indiana Ave & Roosevelt Rd | 13385 | 404501.2 | 0.28 | 0.53 |
| ride_segment | arrival_station | Total_cnt | Total_time | Top_rides | Top_trips |
|---|---|---|---|---|---|
| casual | Streeter Dr & Grand Ave | 58781 | 58781 | 1.25 | 1.25 |
| casual | DuSable Lake Shore Dr & Monroe St | 29036 | 29036 | 0.62 | 0.62 |
| casual | Michigan Ave & Oak St | 26024 | 26024 | 0.56 | 0.56 |
| casual | Millennium Park | 26192 | 26192 | 0.56 | 0.56 |
| casual | DuSable Lake Shore Dr & North Blvd | 25735 | 25735 | 0.55 | 0.55 |
| member | Kingsbury St & Kinzie St | 24224 | 24224 | 0.52 | 0.52 |
| member | Clark St & Elm St | 21988 | 21988 | 0.47 | 0.47 |
| member | Wells St & Concord Ln | 21580 | 21580 | 0.46 | 0.46 |
| member | Clinton St & Washington Blvd | 20108 | 20108 | 0.43 | 0.43 |
| member | University Ave & 57th St | 20160 | 20160 | 0.43 | 0.43 |
fig1 <- bs10 %>%
select(ride_segment, ride_time) %>%
ggplot(aes(x = ride_segment, y = ride_time, fill = ride_segment, col = ride_segment)) +
geom_boxplot(alpha = 0.5) +
geom_pointrange(
mapping = aes(x = ride_segment, y = ride_time),
stat = "summary",
fun.mean = mean
) +
scale_y_log10() +
labs(
tag = "Fig.1",
title = "Casual riders have wider trip distance interval",
y = "Trip duration (minutes)",
caption = "Source: divvy-tripdata"
) +
theme(
legend.title = element_blank(),
axis.title.x = element_blank(),
legend.position = "none"
) +
scale_x_discrete(labels = c("Casual\n riders", "Subscriber\n riders")) +
scale_y_continuous(limits = c(0, 50), breaks = c(0, 10, 15, 20, 30))
fig1 <- fig1 + scale_fill_viridis_d()
fig1
fig2 <- bs10 %>%
ggplot(aes(x = ride_time, fill = ride_segment, col = ride_segment)) +
geom_histogram(alpha = .75, binwidth = 0.05) +
scale_x_log10() +
theme(
legend.position = c(.8, .90),
legend.title = element_blank(),
axis.text = element_text(face = "bold"),
panel.grid.major.y = element_blank()
) +
labs(
tag = "Fig. 2",
title = " ",
subtitle = "Normal Data Distribution with log10 Transformation ",
caption = "Source: divvy-tripdata",
y = "Frequency of riders",
x = "Ride time duration (minutes))"
) +
scale_y_continuous(labels = number_format(scale = 1e-3, suffix = " K"))
fig2
The following 7 questions should enable us to get some key insights out of the data we just analysed and summarized.
what is the proportion of trips each group finished
what is the proportion of trips each group cycled
What are bikes rank high
How the ride time varies daily
How the ride time varies monthly
What is the riding pattern/behaviour on business days and weekend
How the riding pattern is impacted by the season variations
Key insights
Key insight
Key insight
key insights
subscribed members experience two period of high demand, around 09:00 and the highest around 18:00
Casual readers show a steady demand for bikes and it reaches the pick at 18:00
On weekends, both groups exhibit a similar trend.
Subscribed member use bikes more frequently than the casual riders under the same circumstances
The casual riders cycle longer on average than the annual subscribed members.
Annual subscribed members experience a substantially high demand for bikes in morning around 09:00 am and late in the evening at 18:00 am. This is the pick time for both ride segments.
The summer season both segments experience the highest demand for the bikes and this most likely to be associated with the favorable whether conditions.
Our analysis has limited socio economic indicators to draw on evidence based recommendations for sound data driven business making
Our analysis is merely descriptive and it can not be conclusive unless supported and validated with sound statistical analysis.
Price is a decisive factor in consumer choice (what to buy ) and how much of it to buy .
Provided they get the pricing correctly some hints to be considered would include :
To cater the group needs with differentiated offers aligned with trip duration (short and long riders )
Design offers in line with the daily bike demand
Matching bike supply with the demand based on the results of the dock station ranking
R for Data Science https://r4ds.had.co.nz
R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics http://www.cookbook-r.com/
Base R Color Palette Guide: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf