\(~\)

Table of Contents

Introduction

Welcome to my capstone project for the Google Data Analytics Certificate course! As I journey through the various steps of this project, I will tackle real-world tasks of a data analyst whereby I get to demonstrate my knowledge, skills, and thought process.

About the company

  • Cyclistic is a fictional bike-share program that features 5,824 bicycles and 692 docking stations.
  • While users are more likely to ride for leisure, about 30% use them to commute to work each day.
  • 3 flexible pricing plans are offered: single-ride passes, full-day passes, and annual memberships.
  • Single-ride or full-day pass customers are casual riders.
  • Annual membership customers are known as member riders.

Scenario

  • My title is Junior data analyst within the marketing analyst team at Cyclistic.
  • Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.
  • The director of marketing believes the company’s future success depends on maximizing the number of annual memberships
  • The marketing strategy goal is aimed at converting casual riders into annual members.

This case study will follow a six step data analysis process: Ask, Prepare, Process, Analyze, Share, Act

\(~\)

Step 1. Ask questions and define the problem

This step includes the critical steps of asking the right questions to gather enough initial information to direct the course of the project in the right direction. It’s important to make sure you understand the business task and identify all key stakeholders to keep everyone in the loop as you move along.

\(~\)

1.1 Ask the right questions

Three questions will guide the future marketing program:

  1. How do annual member riders and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

This case study primarily focuses on the first question.

\(~\)

1.2 Identify the business task

The business task is to understand how casual riders and member riders use Cyclistic differently with the goal of designing a new marketing strategy aimed at converting casual riders into annual members.

\(~\)

1.3 Identify key stakeholders

  • The director of marketing who is responsible for the development of campaigns and initiatives to promote the bike-share program.
  • The Cyclistic executive team who is notoriously detail-oriented and will decide whether to approve the recommended marketing program.
  • The Cyclistic marketing analytics team which is a team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide marketing strategies.

\(~\)

Step 2 : Prepare the data by collecting and storing information

This step includes identifying and collecting the data from its location; determining its integrity, credibility and accessibility; and storing the data in its new location.

\(~\)

2.1 Data Location

  • The data used for this project is provided by Divvy, a bike-share company in Chicago.
  • All data comes from Divvy’s public data link : Divvy Trip Data

\(~\)

2.2 Data Organization

  • Divvy’s public historical trip data is organized by month and year and saved as a zip file.
  • Each csv file is structured utilizing rows and columns.

\(~\)

2.3 Credibility of the data/ Data Bias/Does this data ROCC?

  • This data ROCCCs! Taken from Divvy’s public historical trip data, this data is Reliable, Original, Comprehensive, Current and Cited. It is credible and free of bias.

\(~\)

2.4 Licensing, Privacy, Security, and Accessibility

  • The data is maintained and made available by Motivate International Inc. under this license
  • Divvy’s trip data for public use adheres to data-privacy laws and therefore is anonymized and does not contain any personally identifiable information.
  • Divvy’s trip data for public use is released on a monthly schedule and accessible to all

\(~\)

2.5 Download the data and store it appropriately

  • For this project, I selected data files for the 12 months from November 2021 through October 2022.
  • Each file was downloaded, saved as a .csv file and named consistently to keep the files organized and easy to identify.

\(~\)

2.6 Sort and filter the data in Excel

  • Each monthly file was opened in Excel
  • Made notations on the number of records in each file
  • Checked each file for duplicate records
  • Checked each file for blank/NA records
  • Identified station names and ids with “test” - will dive deeper in future steps to catch all

\(~\)

Summary of my initial review of the data:

\(~\)

2.7 Utilize R and RStudio Cloud

The files are too large to handle in Excel so I chose to use the R programming language and RStudio Cloud to continue my work for this project.

\(~\)

2.8 Install and load R packages to use for this project

  • tidyverse for data import and wrangling
  • lubridate for date functions
  • ggplot for visualization
  • skimr provides a broad overview of a data frame

\(~\)

Install the packages

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

Load the packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(skimr)

\(~\)

2.9 Upload the data files and create data frames

Data frames are the starting point for analyzing data in R so I’ll read the 12 csv files that I’ve uploaded into RStudio and name them “data1”, “data2”, “data3”,…“data12”. I can immediately confirm the number of rows and columns for each data frame by referring to my initial summary table where I notated the row and columns of each file.

\(~\)

data1 <- read_csv("202111_divvy_tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data2 <- read_csv("202112_divvy_tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data3 <- read_csv("202201_divvy_tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data4 <- read_csv("202202_divvy_tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data5 <- read_csv("202203_divvy_tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data6 <- read_csv("202204_divvy_tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data7 <- read_csv("202205_divvy_tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data8 <- read_csv("202206_divvy_tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data9 <- read_csv("202207_divvy_tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data10 <- read_csv("202208_divvy_tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data11 <- read_csv("202209_divvy_tripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data12 <- read_csv("202210_divvy_tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

\(~\)

2.10 Check for column consistency in all 12 data frames

Use colnames() on each new data frame to make sure all have the same 13 columns.

colnames(data1)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

\(~\)

2.11 Combine 12 data frames into one data frame

Use rbind to combine the 12 dataframes into one dataframe and name it, bike_rides

bike_rides <- rbind(data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11, data12)

\(~\)

After combining the 12 dataframes into one dataframe, I used the function rm() to remove the 12 individual dataframes from the environment to free up RAM

rm(data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11, data12)

\(~\)

2.12 Inspect the new data frame

I’ll use class(), dim(), colnames(), and colSums(is.na()) to perform an initial inspection to ensure my data has remained intact up to now. At this point, I’ll use my initial observations from Excel to confirm the following:

  • My new dataset is recognized as a data frame
  • 5,755,694 rows of data
  • 13 columns of data
  • 878,177 instances where start_station_name and start_station_id is blank
  • 940,010 instances where end_sation_name and end_station_id is blank
  • 5,835 instances where end_lat and end_lng is blank

\(~\)

confirm dataset is a data frame

class(bike_rides)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

\(~\)

confirm # of rows and columns

dim(bike_rides)
## [1] 5755694      13

\(~\)

confirm column names

colnames(bike_rides)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

\(~\)

confirm blank data fields

colSums(is.na(bike_rides))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             878177             878177             940010             940010 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5835               5835 
##      member_casual 
##                  0

\(~\)

2.13 I’ll begin Exploring the data

Getting to know the data is a crucial step that should not be skipped. It’s important to always take enough time to get a good mental sense of the data. Know how big the data is, understand the structure of the data, the data types, the parameters, the dimensions, the variables, the characteristics of those variables, and so on.

Let’s continue on with these additional functions: select(), n(row), ncol(), length(), head(), tail(), glimpse(), str(), summary(), names(), rownames(), skim_without_charts(), View()

select(bike_rides)
## # A tibble: 5,755,694 × 0
nrow(bike_rides)
## [1] 5755694
ncol(bike_rides)
## [1] 13
length(bike_rides)
## [1] 13
head(bike_rides)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 7C00A93E10556… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA>    <NA>   
## 2 90854840DFD50… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA>    <NA>   
## 3 0A7D10CDD1440… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA>    <NA>   
## 4 2F3BE33085BCF… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA>    <NA>   
## 5 D67B4781A1992… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA>    <NA>   
## 6 02F85C2C3C5F7… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042  
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id
tail(bike_rides)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 DA551F0A9C0DB… classi… 2022-10-24 17:45:38 2022-10-24 17:48:02 Sedgwi… TA1307…
## 2 BC3BFA659C9AB… classi… 2022-10-30 01:41:29 2022-10-30 01:57:16 Clifto… TA1307…
## 3 ACD65450291CF… classi… 2022-10-30 01:41:54 2022-10-30 01:57:09 Clifto… TA1307…
## 4 4AAC03D1438E9… classi… 2022-10-15 09:34:11 2022-10-15 10:03:21 Sedgwi… TA1307…
## 5 8E6F3F29785E5… classi… 2022-10-09 10:21:34 2022-10-09 10:43:45 Sedgwi… TA1307…
## 6 8D14CBE672431… docked… 2022-10-22 13:17:13 2022-10-22 13:46:14 Clark … 13146  
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id
glimpse(bike_rides)
## Rows: 5,755,694
## Columns: 13
## $ ride_id            <chr> "7C00A93E10556E47", "90854840DFD508BA", "0A7D10CDD1…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-11-27 13:27:38, 2021-11-27 13:38:25, 2021-11-…
## $ ended_at           <dttm> 2021-11-27 13:46:38, 2021-11-27 13:56:10, 2021-11-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, "Michigan Ave & Oak St", NA, NA…
## $ start_station_id   <chr> NA, NA, NA, NA, NA, "13042", NA, NA, NA, NA, NA, NA…
## $ end_station_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.93000, 41.96000, 41.96000, 41.94000, 41.90000, 4…
## $ start_lng          <dbl> -87.72000, -87.70000, -87.70000, -87.79000, -87.630…
## $ end_lat            <dbl> 41.96, 41.92, 41.96, 41.93, 41.88, 41.90, 41.80, 41…
## $ end_lng            <dbl> -87.73, -87.70, -87.70, -87.79, -87.62, -87.63, -87…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
str(bike_rides)
## spc_tbl_ [5,755,694 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5755694] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
##  $ rideable_type     : chr [1:5755694] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5755694], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
##  $ ended_at          : POSIXct[1:5755694], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
##  $ start_station_name: chr [1:5755694] NA NA NA NA ...
##  $ start_station_id  : chr [1:5755694] NA NA NA NA ...
##  $ end_station_name  : chr [1:5755694] NA NA NA NA ...
##  $ end_station_id    : chr [1:5755694] NA NA NA NA ...
##  $ start_lat         : num [1:5755694] 41.9 42 42 41.9 41.9 ...
##  $ start_lng         : num [1:5755694] -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ end_lat           : num [1:5755694] 42 41.9 42 41.9 41.9 ...
##  $ end_lng           : num [1:5755694] -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ member_casual     : chr [1:5755694] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(bike_rides)
##    ride_id          rideable_type        started_at                    
##  Length:5755694     Length:5755694     Min.   :2021-11-01 00:00:14.00  
##  Class :character   Class :character   1st Qu.:2022-04-27 16:40:09.00  
##  Mode  :character   Mode  :character   Median :2022-06-30 18:31:03.00  
##                                        Mean   :2022-06-13 23:04:32.59  
##                                        3rd Qu.:2022-08-24 19:52:19.75  
##                                        Max.   :2022-10-31 23:59:33.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2021-11-01 00:04:06.00   Length:5755694     Length:5755694    
##  1st Qu.:2022-04-27 16:51:40.25   Class :character   Class :character  
##  Median :2022-06-30 18:49:28.00   Mode  :character   Mode  :character  
##  Mean   :2022-06-13 23:23:58.99                                        
##  3rd Qu.:2022-08-24 20:10:05.75                                        
##  Max.   :2022-11-07 04:53:58.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5755694     Length:5755694     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   :41.39   Min.   :-88.97   Length:5755694    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :-87.30                     
##  NA's   :5835    NA's   :5835
names(bike_rides)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
skim_without_charts(bike_rides)
Data summary
Name bike_rides
Number of rows 5755694
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5755694 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 878177 0.85 7 64 0 1639 0
start_station_id 878177 0.85 3 44 0 1306 0
end_station_name 940010 0.84 9 64 0 1663 0
end_station_id 940010 0.84 3 44 0 1314 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lat 5835 1 41.90 0.05 41.39 41.88 41.90 41.93 42.37
end_lng 5835 1 -87.65 0.03 -88.97 -87.66 -87.64 -87.63 -87.30

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2021-11-01 00:00:14 2022-10-31 23:59:33 2022-06-30 18:31:03 4824622
ended_at 0 1 2021-11-01 00:04:06 2022-11-07 04:53:58 2022-06-30 18:49:28 4836310

\(~\)

2.14 Identify missing data, limitations and other data problems

I’ll use the is.na() function on each variable/column in my data frame to get a sense of what is going on where I have missing data. Let’s get started!

I can pair is.na() with View() to create a spreadsheet like table and get a better look at all rows and see if I notice any patterns. For example: View(bike_rides[is.na(start_station_name),])

attach(bike_rides)
bike_rides[is.na(start_station_name),]
## # A tibble: 878,177 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA>    <NA>   
##  2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA>    <NA>   
##  3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA>    <NA>   
##  4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA>    <NA>   
##  5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA>    <NA>   
##  6 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA>    <NA>   
##  7 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA>    <NA>   
##  8 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA>    <NA>   
##  9 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA>    <NA>   
## 10 A2A194358CA6… electr… 2021-11-27 12:49:10 2021-11-27 12:52:47 <NA>    <NA>   
## # … with 878,167 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
bike_rides[is.na(start_station_id),]
## # A tibble: 878,177 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA>    <NA>   
##  2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA>    <NA>   
##  3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA>    <NA>   
##  4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA>    <NA>   
##  5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA>    <NA>   
##  6 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA>    <NA>   
##  7 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA>    <NA>   
##  8 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA>    <NA>   
##  9 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA>    <NA>   
## 10 A2A194358CA6… electr… 2021-11-27 12:49:10 2021-11-27 12:52:47 <NA>    <NA>   
## # … with 878,167 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
bike_rides[is.na(end_station_name),]
## # A tibble: 940,010 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA>    <NA>   
##  2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA>    <NA>   
##  3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA>    <NA>   
##  4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA>    <NA>   
##  5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA>    <NA>   
##  6 02F85C2C3C5F… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042  
##  7 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA>    <NA>   
##  8 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA>    <NA>   
##  9 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA>    <NA>   
## 10 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA>    <NA>   
## # … with 940,000 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
bike_rides[is.na(end_station_id),]
## # A tibble: 940,010 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA>    <NA>   
##  2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA>    <NA>   
##  3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA>    <NA>   
##  4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA>    <NA>   
##  5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA>    <NA>   
##  6 02F85C2C3C5F… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042  
##  7 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA>    <NA>   
##  8 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA>    <NA>   
##  9 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA>    <NA>   
## 10 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA>    <NA>   
## # … with 940,000 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
bike_rides[is.na(start_lat),]
## # A tibble: 0 × 13
## # … with 13 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## #   ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
bike_rides[is.na(start_lng),]
## # A tibble: 0 × 13
## # … with 13 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## #   ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
bike_rides[is.na(end_lat),]
## # A tibble: 5,835 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 D66FB7A5034C… classi… 2021-11-23 11:53:36 2021-11-24 12:53:30 Laflin… 13307  
##  2 214DC891AC5E… classi… 2021-11-25 19:23:35 2021-11-26 20:23:30 Rush S… 15530  
##  3 4409AA46BDD7… classi… 2021-11-06 13:13:06 2021-11-07 13:13:01 Ashlan… 16950  
##  4 C4A464C2818D… docked… 2021-11-06 16:40:58 2021-11-06 17:24:39 Millen… 13008  
##  5 E58A224FA0AC… docked… 2021-11-25 13:56:42 2021-11-26 16:50:51 Millen… 13008  
##  6 D893434A1E5B… docked… 2021-11-26 16:07:04 2021-11-27 17:07:05 Shedd … 15544  
##  7 6653EC1EF2B9… classi… 2021-11-12 10:00:16 2021-11-13 11:00:02 Univer… KA1503…
##  8 B346545D5F95… docked… 2021-11-01 18:49:55 2021-11-02 07:48:18 Shedd … 15544  
##  9 4FCD69A37C7B… classi… 2021-11-06 16:24:11 2021-11-07 16:24:06 Wester… 13241  
## 10 DCAA625495CA… classi… 2021-11-20 00:47:58 2021-11-21 01:47:53 Broadw… 13074  
## # … with 5,825 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
bike_rides[is.na(end_lng),]
## # A tibble: 5,835 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 D66FB7A5034C… classi… 2021-11-23 11:53:36 2021-11-24 12:53:30 Laflin… 13307  
##  2 214DC891AC5E… classi… 2021-11-25 19:23:35 2021-11-26 20:23:30 Rush S… 15530  
##  3 4409AA46BDD7… classi… 2021-11-06 13:13:06 2021-11-07 13:13:01 Ashlan… 16950  
##  4 C4A464C2818D… docked… 2021-11-06 16:40:58 2021-11-06 17:24:39 Millen… 13008  
##  5 E58A224FA0AC… docked… 2021-11-25 13:56:42 2021-11-26 16:50:51 Millen… 13008  
##  6 D893434A1E5B… docked… 2021-11-26 16:07:04 2021-11-27 17:07:05 Shedd … 15544  
##  7 6653EC1EF2B9… classi… 2021-11-12 10:00:16 2021-11-13 11:00:02 Univer… KA1503…
##  8 B346545D5F95… docked… 2021-11-01 18:49:55 2021-11-02 07:48:18 Shedd … 15544  
##  9 4FCD69A37C7B… classi… 2021-11-06 16:24:11 2021-11-07 16:24:06 Wester… 13241  
## 10 DCAA625495CA… classi… 2021-11-20 00:47:58 2021-11-21 01:47:53 Broadw… 13074  
## # … with 5,825 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
bike_rides[is.na(member_casual),]
## # A tibble: 0 × 13
## # … with 13 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## #   ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

\(~\)

Summary around missing data, limitations and other problems:

Upon review of the data, I got a deeper look into missing data and recognized other problems and limitations as follows:

  • Due to the protection of user privacy, this public dataset has limitations
  • We can’t determine if a bike user is local or a visitor
  • We don’t have any user demographic data
  • We don’t know how many customers there are
  • We can’t map bike trips to a customer so we have no idea how many bike trips may be associated to one customer
  • We can’t determine if a casual rider bike trip is associated to a single-ride pass or full day pass

Additionally…

  • Some bike trips appear to be associated to “testing”
  • Start and end latitude and longitude values do not go out to the same decimal point, decimal points range from 2 to 6 decimal points

These data columns have missing values which we’ll address further along in the process:

  • start_station_name
  • start_station_id
  • end_station_name
  • end_station_id
  • end_lat, end_lng

\(~\)

2.15 Ability to address the business task

Divvy’s historical trip data is appropriate data to answer the business questions. The data provided will help us to understand how the casual riders and member riders use bikes differently. While the data files provide consistent columns of data, we can also expand upon the data by using calculations and functions to draw deeper insights.

While the data has some limitations and problems, overall, there aren’t any major issues that render the data unusable. We have a great dataset from which we can gather many insights such as:

  • Bike type usage by customer type
  • Number of bike trips by customer type and time of the day, day of the week, season
  • Length of bike trips by customer type and time of the day, day of the week, season

\(~\)

Step 3: Process the data by cleaning and checking the information

This step includes selecting appropriate tools for the amount of data you will be working with. It also involves checking for data errors, cleaning the data, transforming the data by adding, renaming and removing data and, finally, verifying that the data is clean and ready for analysis. It’s important to follow a structured process and document all your steps so that teammates can follow along and perform checks on your work.

\(~\)

3.1 Select your tools for the project

Since the combined dataset is very large with 5.7 million rows, R and RStudio Cloud has been chosen as the tool for data manipulation, cleaning, aggregation, analysis and visualization.

\(~\)

3.2 Transforming the data to work with it more effectively

\(~\)

3.2.1 Renaming columns

bike_rides <- rename(bike_rides, "bike_type" = "rideable_type", "user_type" = "member_casual")

\(~\)

3.2.2 Ensure datetime format is consistent throughout the started_at and ended_at columns.

bike_rides$started_at <- ymd_hms(bike_rides$started_at)
bike_rides$ended_at <- ymd_hms(bike_rides$ended_at)

\(~\)

3.2.3 Adding columns

Adding a new column called, “ride_length_min”.

Using the difftime() function to calculate the length of each trip in minutes, rounded to two decimals.

bike_rides$ride_length_min <- round(as.numeric(difftime(bike_rides$ended_at, bike_rides$started_at, units = "mins")), 2)

\(~\)

Verify that R recognizes my new variable as numeric so that I can perform calculations.

class(bike_rides$ride_length_min)
## [1] "numeric"

\(~\)

Adding columns for: date, month, day, year, day_of_week, and hour

bike_rides$date <- as.Date(bike_rides$started_at)   
bike_rides$month <- format(as.Date(bike_rides$date), "%B")
bike_rides$day <- format(as.Date(bike_rides$date), "%d")
bike_rides$year <- format(as.Date(bike_rides$date), "%Y")
bike_rides$day_of_week <- format(as.Date(bike_rides$date), "%A")
bike_rides$hour <- lubridate::hour(bike_rides$started_at)

\(~\)

Adding a column for season

bike_rides <- bike_rides %>% mutate(season = recode(month,
                        December = "Winter",
                        January = "Winter",
                        February = "Winter",
                        March = "Spring",
                        April = "Spring",
                        May = "Spring",
                        June = "Summer",
                        July = "Summer",
                        August = "Summer",
                        September = "Fall",
                        October = "Fall",
                        November = "Fall"))

\(~\)

Adding a column for “time_of_day” by using the case_when() function

bike_rides <- bike_rides %>% mutate(time_of_day = case_when(
                        hour >= 6 & hour < 9 ~ "Early Morning",
                        hour >= 9 & hour < 12 ~ "Mid Morning",
                        hour >= 12 & hour < 18  ~ "Afternoon",
                        hour >= 18 & hour <= 23  ~ "Evening",
                        hour >= 0 & hour < 3  ~ "Early Night",
                        hour >= 3 & hour < 6  ~ "Late Night"))

\(~\)

3.2.4 Exploring our newly created column, ride_length_min

\(~\)

Let’s calculate where ride_legth_min is greater than 1,440 minutes (or 24 hours)

sum(bike_rides$ride_length_min > 1440)
## [1] 5364

\(~\)

Let’s calculate where ride_length_min is less than 1 minute (or 60 seconds)

sum(bike_rides$ride_length_min < 1)
## [1] 118453

\(~\)

Let’s calculate where ride_length_min is less than 0 or a negative number. I’ll use different code to arrive at the same solution.

sum(bike_rides$ride_length_min < 0)
## [1] 112
sum(bike_rides$started_at > bike_rides$ended_at)
## [1] 112
length(which(bike_rides$started_at > bike_rides$ended_at)) 
## [1] 112

\(~\)

Just curious to see where ride_length_min is greater than 6 hours (or 360 minutes)

sum(bike_rides$ride_length_min > 360)
## [1] 10130

\(~\)

3.3 Taking a closer look at missing values

Let’s check if the missing start station id codes pertain to a particular bike type or user type. We can see that the majority of the problem is with electric bikes and does not pertain to a particulare user type.

bike_rides %>% filter(is.na(start_station_id)) %>% 
  count(start_station_id, start_station_name, bike_type, user_type)
## # A tibble: 2 × 5
##   start_station_id start_station_name bike_type     user_type      n
##   <chr>            <chr>              <chr>         <chr>      <int>
## 1 <NA>             <NA>               electric_bike casual    363963
## 2 <NA>             <NA>               electric_bike member    514214

\(~\)

The missing end station id pertains to the three bike types and both user types, but again, the majority of the problem is with electric bikes.

bike_rides %>% filter(is.na(end_station_id)) %>% 
  count(end_station_id, end_station_name, bike_type, user_type)
## # A tibble: 5 × 5
##   end_station_id end_station_name bike_type     user_type      n
##   <chr>          <chr>            <chr>         <chr>      <int>
## 1 <NA>           <NA>             classic_bike  casual      2805
## 2 <NA>           <NA>             classic_bike  member      1533
## 3 <NA>           <NA>             docked_bike   casual      2570
## 4 <NA>           <NA>             electric_bike casual    422751
## 5 <NA>           <NA>             electric_bike member    510351

\(~\)

Let’s take a look at what else is going on where start station id is missing. In particular, I want to see what is going on with start station name, start_lat & start_lng. This shows me that start_station name is missing and both the start_lat & start_lng only go out two decimals points.

Adding the View() function will provide a table that you can then also filter on.

bike_rides %>% filter(is.na(start_station_id)) %>% 
  count(start_station_id, start_station_name, start_lat, start_lng)
## # A tibble: 621 × 5
##    start_station_id start_station_name start_lat start_lng     n
##    <chr>            <chr>                  <dbl>     <dbl> <int>
##  1 <NA>             <NA>                    41.6     -87.6     1
##  2 <NA>             <NA>                    41.6     -87.5     3
##  3 <NA>             <NA>                    41.6     -87.6    22
##  4 <NA>             <NA>                    41.6     -87.6     2
##  5 <NA>             <NA>                    41.6     -87.6     8
##  6 <NA>             <NA>                    41.6     -87.6     1
##  7 <NA>             <NA>                    41.6     -87.6     3
##  8 <NA>             <NA>                    41.6     -87.6    20
##  9 <NA>             <NA>                    41.6     -87.5    21
## 10 <NA>             <NA>                    41.6     -87.5     5
## # … with 611 more rows

\(~\)

Let’s take a look at what else is going on where end station id is missing. In particular, I want to see what is going on with end station name, end_lat & end_lng. This shows me that end_station name is missing and both the start_lat & start_lng only go out two decimals points.

Adding the View() function will provide a table that you can then also filter on.

From this view, I can also see that there are 5,835 instances where both end_lat & end_lng are missing.

bike_rides %>% filter(is.na(end_station_id)) %>% 
  count(end_station_id, end_station_name, end_lat, end_lng)
## # A tibble: 891 × 5
##    end_station_id end_station_name end_lat end_lng     n
##    <chr>          <chr>              <dbl>   <dbl> <int>
##  1 <NA>           <NA>                41.4   -89.0     1
##  2 <NA>           <NA>                41.5   -87.6     1
##  3 <NA>           <NA>                41.6   -87.3     4
##  4 <NA>           <NA>                41.6   -87.7     2
##  5 <NA>           <NA>                41.6   -87.6     1
##  6 <NA>           <NA>                41.6   -87.7     1
##  7 <NA>           <NA>                41.6   -87.6     1
##  8 <NA>           <NA>                41.6   -87.7     1
##  9 <NA>           <NA>                41.6   -87.6     1
## 10 <NA>           <NA>                41.6   -87.6     1
## # … with 881 more rows

\(~\)

3.3.1 Address missing values

To deal with the missing values for both start & end station names & ids, I’m going to create four new columns to show start_lat, start_lng, end_lat & end_lng all rounded to 2 decimal places. I’ll then use the new rounded lat & lng to impute the missing start & end station names.

My logic: This is ok to do because while not perfect, it will get me station names that are close enough distance-wise to perform the analysis for this project. I don’t need exact docking locations to analyze the rider habits for the context of this project.

\(~\)

Creating four new columns to show start_lat, start_lng, end_lat & end_lng all rounded to 2 decimal places.

bike_rides <- bike_rides %>% 
  mutate(start_lat_round = round(start_lat, digits = 2),
        start_lng_round = round(start_lng, digits = 2),
        end_lat_round = round(end_lat, digits = 2),
        end_lng_round = round(end_lng, digits = 2))

\(~\)

Impute missing start station names

bike_rides <- bike_rides %>% 
  group_by(start_lat_round, start_lng_round) %>% 
  tidyr::fill(start_station_name, .direction = "downup") %>% 
  ungroup()

\(~\)

Impute missing end station names

bike_rides <- bike_rides %>% 
  group_by(end_lat_round, end_lng_round) %>% 
  tidyr::fill(end_station_name, .direction = "downup") %>% 
  ungroup()

\(~\)

Impute missing start_station_id

bike_rides <- bike_rides %>% 
  group_by(start_station_name) %>% 
  tidyr::fill(start_station_id, .direction = "downup") %>% 
  ungroup()

\(~\)

Impute missing end_station_id

bike_rides <- bike_rides %>% 
  group_by(end_station_name) %>% 
  tidyr::fill(end_station_id, .direction = "downup") %>% 
  ungroup()

\(~\)

Now that we’ve imputed alot of the missing data, let’s check missing values by column, again, and see what’s still missing.

colSums(is.na(bike_rides))
##            ride_id          bike_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##              12097              12097              24394              24394 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5835               5835 
##          user_type    ride_length_min               date              month 
##                  0                  0                  0                  0 
##                day               year        day_of_week               hour 
##                  0                  0                  0                  0 
##             season        time_of_day    start_lat_round    start_lng_round 
##                  0                  0                  0                  0 
##      end_lat_round      end_lng_round 
##               5835               5835

\(~\)

Where end_lat & end_lng are missing, we don’t have an end_station_name or end_station_id so the missing data cannot be imputed using those fields. These are all 1 minute rides with the exception of one 2 minute ride.

Since these are 1 minute rides with missing end-lat & end_lng I’m assuming they are flukes and I am going to remove them. Percentage-wise 5,835 of missing data is considered immaterial and will not impact the integrity of my dataset.

bike_rides %>% filter(is.na(end_lat)) %>% 
  count(end_station_name, end_station_id, end_lat, end_lng, bike_type)
## # A tibble: 2 × 6
##   end_station_name end_station_id end_lat end_lng bike_type        n
##   <chr>            <chr>            <dbl>   <dbl> <chr>        <int>
## 1 <NA>             <NA>                NA      NA classic_bike  3265
## 2 <NA>             <NA>                NA      NA docked_bike   2570

\(~\)

Where start_station_name is missing, we don’t have a start_station_id to impute the missing data. I should be able to impute from data with matching end_lat & end_lng coordinates. That would be trickier because, we may have multiple station names where the lat & lng coordinates only go out 2 decimal places.

For timesake and because the amount of data is immaterial, I’m going to leave as is and move on. I could remove this data as it is immaterial. Adding the View() function will provide a table that you can then also filter on.

bike_rides %>% filter(is.na(start_station_name)) %>% 
  count(start_station_name, start_station_id, start_lat, start_lng, bike_type)
## # A tibble: 112 × 6
##    start_station_name start_station_id start_lat start_lng bike_type         n
##    <chr>              <chr>                <dbl>     <dbl> <chr>         <int>
##  1 <NA>               <NA>                  41.6     -87.6 electric_bike     1
##  2 <NA>               <NA>                  41.6     -87.5 electric_bike     3
##  3 <NA>               <NA>                  41.6     -87.6 electric_bike     2
##  4 <NA>               <NA>                  41.6     -87.6 electric_bike     1
##  5 <NA>               <NA>                  41.6     -87.6 electric_bike     3
##  6 <NA>               <NA>                  41.6     -87.5 electric_bike     5
##  7 <NA>               <NA>                  41.7     -87.6 electric_bike    13
##  8 <NA>               <NA>                  41.7     -87.6 electric_bike    24
##  9 <NA>               <NA>                  41.7     -87.6 electric_bike     6
## 10 <NA>               <NA>                  41.7     -87.6 electric_bike     3
## # … with 102 more rows

\(~\)

Where end_station_name is missing, we don’t have a end_station_id to impute the missing data, but I should be able to impute from data with matching end_lat & end_lng coordinates. That would be trickier because, we may have multiple station names where the lat & lng coordinates only go out 2 decimal places.

For timesake and because the amount of data is immaterial, I’m going to leave as is and move on. I could remove this data as it is immaterial. Adding the View() function will provide a table that you can then also filter on.

bike_rides %>% filter(is.na(end_station_name)) %>% 
  count(end_station_name, end_station_id, end_lat, end_lng, bike_type)
## # A tibble: 394 × 6
##    end_station_name end_station_id end_lat end_lng bike_type         n
##    <chr>            <chr>            <dbl>   <dbl> <chr>         <int>
##  1 <NA>             <NA>              41.4   -89.0 electric_bike     1
##  2 <NA>             <NA>              41.5   -87.6 electric_bike     1
##  3 <NA>             <NA>              41.6   -87.3 electric_bike     4
##  4 <NA>             <NA>              41.6   -87.7 electric_bike     2
##  5 <NA>             <NA>              41.6   -87.6 electric_bike     1
##  6 <NA>             <NA>              41.6   -87.7 electric_bike     1
##  7 <NA>             <NA>              41.6   -87.6 electric_bike     1
##  8 <NA>             <NA>              41.6   -87.7 electric_bike     1
##  9 <NA>             <NA>              41.6   -87.6 electric_bike     1
## 10 <NA>             <NA>              41.6   -87.6 electric_bike     1
## # … with 384 more rows

\(~\)

3.5 Removing data

After a thorough review of the data, I’ll remove the following:

  • Rides less than 60 seconds in length as they are potentially false starts or users trying to re-dock a bike to ensure it was secure per Divvy website: Divvy System Data
  • Rides with a negative ride_length are considered invalid since the trip start time cannot be greater than the trip end time
  • Rides with a ride_length greater than 24 hrs are considered invalid outliers for purposes of this project
  • Rides where end_lat & end_lng are both missing, in these cases we don’t have an end_station_name or end_station_id so the missing data cannot be imputed. Percentage-wise 5,835 of missing data is considered immaterial. These are all 1 minute rides with the exception of one 2 minute ride. Since they have no end-lat & end_lng I’m assuming they are flukes.
  • Rides where start_station_id or end_station_id are related to “testing” as identified earlier

\(~\)

I’ll use the select() function to create a new data frame with only selected columns.

bike_rides_v2 <- select(bike_rides, c(1,2,5, 6:16, 13:16, 18:22))

\(~\)

Remove rides less than 60 seconds (or 1 minute) and greater than 24 hrs (or 1440 minutes) in length. This will remove all rides with a negative ride_length

bike_rides_v2 <- bike_rides_v2 %>% 
  filter(ride_length_min >= 1 & ride_length_min <= 1440)

\(~\)

Remove rides where end_lat & end_lng are both missing

bike_rides_v2 <- bike_rides_v2 %>%
  filter(!is.na(end_lat) & !is.na(end_lng))

\(~\)

Remove rides related to test/repair stations

bike_rides_v2 <- bike_rides_v2 %>% 
  filter(!start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))
bike_rides_v2 <- bike_rides_v2 %>%
  filter(!end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))

\(~\)

3.5.1 Confirm the data removal

Confirms removal where ride_legth_min is greater than 1,440 minutes (24 hours)

sum(bike_rides_v2$ride_length_min > 1440)
## [1] 0

\(~\)

Confirms removal where ride_length_min is less than 1 minute (or 60 seconds)

sum(bike_rides_v2$ride_length_min < 1)
## [1] 0

\(~\)

Confirms removal of rides related to test/repair stations

bike_rides_v2 %>% filter(start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>%
  count(start_station_id)
## # A tibble: 0 × 2
## # … with 2 variables: start_station_id <chr>, n <int>

\(~\)

Confirms removal of rides related to test/repair stations

bike_rides_v2 %>% filter(end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>% 
  count(end_station_id)
## # A tibble: 0 × 2
## # … with 2 variables: end_station_id <chr>, n <int>

\(~\)

Confirms removal where end_lat & end_lng were both missing

colSums(is.na(bike_rides_v2))
##            ride_id          bike_type start_station_name   start_station_id 
##                  0                  0              11750              11750 
##   end_station_name     end_station_id          start_lat          start_lng 
##              17899              17899                  0                  0 
##            end_lat            end_lng          user_type    ride_length_min 
##                  0                  0                  0                  0 
##               date              month               year        day_of_week 
##                  0                  0                  0                  0 
##               hour             season        time_of_day 
##                  0                  0                  0

\(~\)

3.6 Inspect the new data frame

Let’s get a row count and take a quick glimpse of our new data frame, bike_rides_v2. After cleaning the data, we have 5,621,147 rows of data which means we removed a total of 134,547 rows.

nrow(bike_rides_v2)
## [1] 5621147
glimpse(bike_rides_v2)
## Rows: 5,621,147
## Columns: 19
## $ ride_id            <chr> "7C00A93E10556E47", "90854840DFD508BA", "0A7D10CDD1…
## $ bike_type          <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ start_station_name <chr> "Kosciuszko Park", "California Ave & Montrose Ave",…
## $ start_station_id   <chr> "15643", "15622", "15622", "397", "TA1306000011", "…
## $ end_station_name   <chr> "Keystone Ave & Montrose Ave", "Humboldt Blvd & Arm…
## $ end_station_id     <chr> "KA1504000164", "15651", "15622", "314", "13300", "…
## $ start_lat          <dbl> 41.93000, 41.96000, 41.96000, 41.94000, 41.90000, 4…
## $ start_lng          <dbl> -87.72000, -87.70000, -87.70000, -87.79000, -87.630…
## $ end_lat            <dbl> 41.96, 41.92, 41.96, 41.93, 41.88, 41.90, 41.80, 41…
## $ end_lng            <dbl> -87.73, -87.70, -87.70, -87.79, -87.62, -87.63, -87…
## $ user_type          <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ ride_length_min    <dbl> 19.00, 17.75, 2.37, 5.02, 21.22, 18.70, 6.00, 5.05,…
## $ date               <date> 2021-11-27, 2021-11-27, 2021-11-26, 2021-11-27, 20…
## $ month              <chr> "November", "November", "November", "November", "No…
## $ year               <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20…
## $ day_of_week        <chr> "Saturday", "Saturday", "Friday", "Saturday", "Frid…
## $ hour               <int> 13, 13, 22, 9, 19, 18, 13, 9, 16, 12, 13, 11, 16, 1…
## $ season             <chr> "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fa…
## $ time_of_day        <chr> "Afternoon", "Afternoon", "Evening", "Mid Morning",…

\(~\)

We still have some missing start and end station names & ids, but we have the start & end lat & lng for these so we can work with this data if needed. If we wanted to remove it, it would not be a big deal because it is a small percent of the data. For now we’ll keep it.

colSums(is.na(bike_rides_v2))
##            ride_id          bike_type start_station_name   start_station_id 
##                  0                  0              11750              11750 
##   end_station_name     end_station_id          start_lat          start_lng 
##              17899              17899                  0                  0 
##            end_lat            end_lng          user_type    ride_length_min 
##                  0                  0                  0                  0 
##               date              month               year        day_of_week 
##                  0                  0                  0                  0 
##               hour             season        time_of_day 
##                  0                  0                  0

\(~\)

3.7 Verify the data is clean and ready to analyze

Before moving on to the analysis phase I’ll run through a quick check off to verify the data is now clean, accurate, consistent and complete.

  • Checked for duplicates: the data does not have duplicate values
  • Checked for missing values: removed some data with missing values
  • Checked for outliers: removed outliers (ride length > 24 hrs)
  • Checked for data accuracy: after cleaning up data, the remaining data remained intact
  • Checked for data completeness: the data for the 12 months is complete
  • Checked for data consistency: after cleaning up the data, the data for the 12 months remained consistent
  • Checked for data relevance: dataset for the past 12 months is current and relevant
  • Checked for appropriateness: the data is appropriate to answer the business questions
  • Checked for data formats: The columns are accurate in format
  • Date-Time format consistency: the columns with date and time are consistent
  • Checked for meaningful column names: changed & added a few cols to ensure all are clear and meaningful
  • Overall sense of the data: this is a strong set of data that is consistent, formatted well and makes sense.

\(~\)

Step 4: Analyzing and Visualizing the Data

At this step, I’ll work with the data to analyze and look at it in different ways. I’ll use functions to help me explore relationships and trends while being mindful to not stray from my objective which is to analyze how annual member riders and casual riders use Cyclistic bikes differently.

\(~\)

4.1 Summary of the data at the start of analysis

summary(bike_rides_v2)
##    ride_id           bike_type         start_station_name start_station_id  
##  Length:5621147     Length:5621147     Length:5621147     Length:5621147    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5621147     Length:5621147     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##     end_lat         end_lng        user_type         ride_length_min  
##  Min.   :41.39   Min.   :-88.97   Length:5621147     Min.   :   1.00  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:   6.08  
##  Median :41.90   Median :-87.64   Mode  :character   Median :  10.55  
##  Mean   :41.90   Mean   :-87.65                      Mean   :  16.63  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:  18.82  
##  Max.   :42.37   Max.   :-87.30                      Max.   :1439.85  
##       date               month               year           day_of_week       
##  Min.   :2021-11-01   Length:5621147     Length:5621147     Length:5621147    
##  1st Qu.:2022-04-27   Class :character   Class :character   Class :character  
##  Median :2022-06-30   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2022-06-13                                                           
##  3rd Qu.:2022-08-24                                                           
##  Max.   :2022-10-31                                                           
##       hour          season          time_of_day       
##  Min.   : 0.00   Length:5621147     Length:5621147    
##  1st Qu.:11.00   Class :character   Class :character  
##  Median :15.00   Mode  :character   Mode  :character  
##  Mean   :14.22                                        
##  3rd Qu.:18.00                                        
##  Max.   :23.00

\(~\)

4.2 Casual rider vs member rider

Let’s get a count and percentage breakdown of our two user types: the casual rider vs the member rider

bike_rides_v2 %>% 
  group_by(user_type) %>% 
  summarise(count = n(), Percentage = n()/nrow(bike_rides_v2)*100)
## # A tibble: 2 × 3
##   user_type   count Percentage
##   <chr>       <int>      <dbl>
## 1 casual    2295882       40.8
## 2 member    3325265       59.2

\(~\)

Visualizing total rides by user type

ggplot(bike_rides_v2, aes(user_type, fill=user_type)) +
  geom_bar() +
  labs(x="User Type", y="Count", title = "Total Rides by User Type: Casual vs Member") +
  annotate("text",x=1,y=2000000,label="2,295,882 / (41%)",color="black",size=3.5) +
  annotate("text",x=2,y=3000000,label="3,325,265 / (59%)",color="black",size=3.5)

Key insight:

  • 3,325,265 (or 59%) of riders are member riders, while 2,295,882 (or 41%) are casual riders

\(~\)

Visualizing total rides by user type and bike type

bike_rides_v2 %>% 
  group_by(user_type, bike_type) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=user_type, y=count, fill=bike_type)) + 
  geom_bar(stat="identity", width = 0.3) +
  labs(x="Bike Type", y="Number of Rides", title = "Total Rides by user type and bike type")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insight:

  • Member and casual riders use both the classic and electric bike, while only casual users use docked bikes.

\(~\)

4.3 Analyzing Ride Length

Let’s spend some time looking at the length of each trip (in minutes)

summary(bike_rides_v2$ride_length_min)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.08   10.55   16.63   18.82 1439.85

Key insights:

  • Average of all ride lengths is 16.63 minutes
  • Minimum of all ride lengths is 1 minute
  • Maximum of all ride lengths is right under 24 hours (or 1440 minutes)
  • (this aligns with our data cleaning which removed < 1 min & > 24 hours)

\(~\)

Average ride length of each trip (in minutes) by user type

aggregate(bike_rides_v2$ride_length_min ~ bike_rides_v2$user_type, FUN = mean)
##   bike_rides_v2$user_type bike_rides_v2$ride_length_min
## 1                  casual                      22.37841
## 2                  member                      12.65979

\(~\)

Taking a deeper dive into ride lengths

Here, I’m summarizing the data by various ride lengths to see if anything stands out. The table produced shows us that 5,453,815 of the rides are 60 minutes or less. This is an important area to focus on.

bike_rides_v2 %>% 
   group_by(user_type) %>% 
  summarize("<=5min" = sum(ride_length_min <=5),
            "<=15min" = sum(ride_length_min <=15),
            "<=30min" = sum(ride_length_min <=30),
            "<=45min" = sum(ride_length_min <=45),
            "<=60min" = sum(ride_length_min <=60),
            ">2hrs" = sum(ride_length_min >120),
            ">4hrs" = sum(ride_length_min >240),
            ">6hrs" = sum(ride_length_min >360),)
## # A tibble: 2 × 9
##   user_type `<=5min` `<=15min` `<=30min` <=45m…¹ <=60m…² `>2hrs` `>4hrs` `>6hrs`
##   <chr>        <int>     <int>     <int>   <int>   <int>   <int>   <int>   <int>
## 1 casual      253461   1266259   1865460 2063967 2151575   34139    5800    3260
## 2 member      751888   2451121   3109671 3269886 3302240    5735    2058    1244
## # … with abbreviated variable names ¹​`<=45min`, ²​`<=60min`

\(~\)

Supporting analysis in Excel

I created a table in Excel to get a better breakdown. Let’s take a look:

Key insights:

  • 97% of all rides are 60 minutes or less
  • 79% of all rides are 20 minutes or less
  • 56% of all rides are less than 12 minutes
  • In my opinion, this area is crucial to gathering more data and developing a marketing strategy.

\(~\)

Let’s look strictly at ride lengths under 12 minutes

Key insights:

  • 67% of these riders are member riders
  • This is double the casual riders in this category

\(~\)

Let’s look strictly at ride lengths <= 20 minutes

Key insights:

  • 64% of these riders are member riders
  • As the ride length increased, we see a slight shift in the breakdown

\(~\)

I’ll reproduce some of this analysis in R.

I’ll create a table to show the breakdown I displayed in Excel. This tables shows us that 56% of the rides are less than 12 minutes and 79% of the rides are 20 minutes or less.

bike_rides_v2 %>% 
   group_by(user_type) %>% 
  summarize("<12 min" = sum(ride_length_min <11.99),
            "12-20 min" = sum(ride_length_min >=12 & ride_length_min <=20.99),
            "21-30 min" = sum(ride_length_min >=21 & ride_length_min <=30.99),
            "31-60 min" = sum(ride_length_min >=31 & ride_length_min <=60),
            "61-120 min" = sum(ride_length_min >=60.01 & ride_length_min <=120.99),
            "121-240 min" = sum(ride_length_min >=121 & ride_length_min <=240.99),
            "241+min" = sum(ride_length_min >=241))  
## # A tibble: 2 × 8
##   user_type `<12 min` `12-20 min` `21-30 min` 31-60 mi…¹ 61-12…² 121-2…³ 241+m…⁴
##   <chr>         <int>       <int>       <int>      <int>   <int>   <int>   <int>
## 1 casual      1025555      565694      294472     265854  110806   27750    5751
## 2 member      2125827      711239      290623     174551   17365    3608    2052
## # … with abbreviated variable names ¹​`31-60 min`, ²​`61-120 min`,
## #   ³​`121-240 min`, ⁴​`241+min`

\(~\)

Adding a column to create ride length categories to get a better visual in R

bike_rides_v2 <- bike_rides_v2 %>% mutate(ride_length_cat = case_when(
                        ride_length_min <11.99 ~ "< 12 min",
                        ride_length_min >=12 & ride_length_min <=20.99 ~ "12-20 min",
                        ride_length_min >=21 & ride_length_min <=30.99  ~ "21-30 min",
                        ride_length_min >=31 & ride_length_min <=60.99  ~ "31-60 min",
                        ride_length_min >=60 & ride_length_min <=120.99  ~ "61-120 min",
                        ride_length_min >=121 & ride_length_min <=240.99  ~ "121-240 min",
                        ride_length_min >=241  ~ "241+ min"))

\(~\)

Visualizing total rides by user type and ride length category

I had to use a factor/vector to get my x-axis labels to appear in the correct order. Without the factor/vector category labels starting with a “1” showed up first, followed by “2” and “3”.

bike_rides_v2 %>% 
  group_by(user_type, ride_length_cat) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(ride_length_cat, level = c("< 12 min", "12-20 min", "21-30 min", "31-60 min",
      "61-120 min", "121-240 min", "241+ min")), y=count, fill=user_type)) + 
  geom_col(position = "dodge") +
  labs(x="Ride Length Category", y="Number of Rides", title = "Total Rides by user type and ride length category")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insight:

  • We can clearly see how the majority of rides fall into the first two categories

\(~\)

Average Ride length in minutes of those rides in the “< 12 min” category.

bike_rides_v2 %>% filter(ride_length_cat == "< 12 min") %>% 
  group_by(user_type) %>% 
  summarize(avg_ride_length=mean(ride_length_min))
## # A tibble: 2 × 2
##   user_type avg_ride_length
##   <chr>               <dbl>
## 1 casual               7.17
## 2 member               6.44

\(~\)

Average Ride length in minutes of those rides in the “<= 20 min” category.

bike_rides_v2 %>% filter(ride_length_cat == "< 12 min" | ride_length_cat== "12-20 min") %>% 
  group_by(user_type) %>% 
  summarize(avg_ride_length=mean(ride_length_min))
## # A tibble: 2 × 2
##   user_type avg_ride_length
##   <chr>               <dbl>
## 1 casual              10.3 
## 2 member               8.77

\(~\)

Visualizing average ride length of each trip (in minutes) by user type and hour of day

bike_rides_v2 %>% 
  group_by(user_type, hour) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length_min)) %>% 
  arrange(user_type, hour) %>% 
  ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=average_ride_length, fill=user_type)) + 
  geom_col(position = "dodge") +
  labs(x="Hour of day", y="Ride Length (in minutes)", title = "Average ride length by user type and hour of day")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insight:

  • Casual riders average longer rides than member riders, peaking between 10AM - 2PM

\(~\)

Visualizing average ride length of each trip (in minutes) by user type and time of day

axis_labels <- c("Early Morning \n6am-9am", "Mid Morning \n9am-12pm", "Afternoon \n12pm-6pm", "Evening \n6pm-11pm", "Early Night \n11pm-3am", "Wee Night \n3am-6am")

bike_rides_v2 %>% 
  group_by(user_type, time_of_day) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length_min)) %>% 
  ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=average_ride_length, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.4) +
  labs(x="Time of Day", y="Ride Length (in minutes)", title = "Average ride length by user type and time of day") +
  scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insights:

  • Ride length for casual riders peaks mid morning through afternoon
  • Ride length for member riders remains more steady throughout the day

\(~\)

Average ride length of each trip (in minutes) by user type and day of the week

aggregate(bike_rides_v2$ride_length_min ~ bike_rides_v2$user_type + bike_rides_v2$day_of_week, FUN = mean)
##    bike_rides_v2$user_type bike_rides_v2$day_of_week
## 1                   casual                    Friday
## 2                   member                    Friday
## 3                   casual                    Monday
## 4                   member                    Monday
## 5                   casual                  Saturday
## 6                   member                  Saturday
## 7                   casual                    Sunday
## 8                   member                    Sunday
## 9                   casual                  Thursday
## 10                  member                  Thursday
## 11                  casual                   Tuesday
## 12                  member                   Tuesday
## 13                  casual                 Wednesday
## 14                  member                 Wednesday
##    bike_rides_v2$ride_length_min
## 1                       20.94996
## 2                       12.43993
## 3                       22.87547
## 4                       12.23843
## 5                       24.97164
## 6                       14.13372
## 7                       25.55735
## 8                       14.00323
## 9                       19.90410
## 10                      12.20314
## 11                      20.14299
## 12                      12.05269
## 13                      19.37565
## 14                      12.04317

\(~\)

Visualizing average ride length of each trip (in minutes) by user type and day of the week

note: I’m not using x=day_of_week because it results in the days of week completely out of order

bike_rides_v2 %>% 
  group_by(user_type, day_of_week) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length_min)) %>% 
  ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=average_ride_length, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Day of Week", y="Ride Length (in minutes)", title = "Average ride length by user type and day of the week")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insight:

  • Both riders take longer rides on weekends

\(~\)

4.4 Mid-point recap of my analysis on riders and ride length

At this point, I’m going to pause and summarize some key points on riders and ride length. While my analysis is not yet complete, I don’t want to lose sight of key points up to now.

\(~\)

  • My data set consists of 5,621,147 rides
  • 2,295,882 (41%) of these rides are casual riders
  • 3,325,265 (59%) of these rides are member riders
  • Both casual and member riders tend to use the classic and electric bikes
  • Member riders take more rides
  • Casual riders average longer rides

\(~\)

  • Average of all ride lengths is 16.63 minutes
  • Average ride length for all casual riders is 22.37 minutes
  • Average ride length for all member riders is 12.65 minutes

\(~\)

  • 5,453,815 (97%) of all rides are <= 60 min
  • 4,428,315 (79%) of all rides are <= 20 min
  • 3,151,382 (56%) of all rides are < 12 min

\(~\)

  • When focused on ride lengths under 12 minutes:
  • 67% of these riders are member riders
  • Average ride length for casual riders in the “<12 min” category is 7.17 minutes
  • Average ride length for member riders in the “<12 min” category is 6.44 minutes

\(~\)

  • When focused on ride lengths <= 20 minutes:
  • 64% of these riders are member riders
  • Average ride length for casual riders in the “<=20 min” category is 10.27 minutes
  • Average ride length for member riders in the “<=20 min” category is 8.76 minutes

\(~\)

Let’s continue on with additional analysis & visuals

\(~\)

4.5 Analyzing total rides by user type and hour of the day

Total rides by user type and by hour of the day

bike_rides_v2 %>% 
  group_by(user_type, hour) %>% 
  summarise(count = n()) %>%  
  arrange(user_type, hour) %>% 
  ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=count, fill=user_type)) + 
  geom_col(position = "dodge") +
  labs(x="Hour of day", y="Number of Rides", title = "Total Rides by user type and hour of day")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insights:

  • Member riders show a strong use of rides starting at 6am and peaking at 8am and then again peaking further from 4pm to 6pm.
  • Casual rides also peak from 4pm to 6 pm.

\(~\)

Total rides by user type and by time of day

bike_rides_v2 %>% 
   group_by(user_type) %>% 
  summarize("Early Morning" = sum(time_of_day == "Early Morning"),
            "Mid Morning" = sum(time_of_day == "Mid Morning"),
            "Afternoon" = sum(time_of_day == "Afternoon"),
            "Evening" = sum(time_of_day == "Evening"),
            "Early Night" = sum(time_of_day == "Early Night"),
            "Late Night" = sum(time_of_day == "Late Night"))
## # A tibble: 2 × 7
##   user_type `Early Morning` `Mid Morning` Afternoon Evening Early Nigh…¹ Late …²
##   <chr>               <int>         <int>     <int>   <int>        <int>   <int>
## 1 casual             146823        284655   1041548  698034        93924   30898
## 2 member             462623        442804   1416436  884161        69950   49291
## # … with abbreviated variable names ¹​`Early Night`, ²​`Late Night`

\(~\)

Visualizing total rides by user type and by time of day

bike_rides_v2 %>% 
  group_by(user_type, time_of_day) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Time of Day", y="Number of Rides", title = "Total Rides by user type and time of day") +
  scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insights:

  • 79% of early morning rides are taken by member riders.
  • Rides peak in the afternoon and evening for both riders.

\(~\)

Another visual for total rides by user type and by time of day

bike_rides_v2 %>% 
  group_by(user_type, time_of_day) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, color=user_type)) + 
  geom_point() + geom_line(aes(group = user_type)) +
  labs(x="Time of Day", y="Number of Rides", title = "Total Rides by user type and time of day") + ylim(0, NA) +
  scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

\(~\)

4.6 Analyzing total rides by user type and day of the week

\(~\)

Total rides by user type and day of the week

bike_rides_v2 %>% 
   group_by(user_type) %>% 
  summarize("Monday" = sum(day_of_week == "Monday"),
            "Tuesday" = sum(day_of_week == "Tuesday"),
            "Wednesday" = sum(day_of_week == "Wednesday"),
            "Thursday" = sum(day_of_week == "Thursday"),
            "Friday" = sum(day_of_week == "Friday"),
            "Saturday" = sum(day_of_week == "Saturday"),
            "Sunday" = sum(day_of_week == "Sunday"))
## # A tibble: 2 × 8
##   user_type Monday Tuesday Wednesday Thursday Friday Saturday Sunday
##   <chr>      <int>   <int>     <int>    <int>  <int>    <int>  <int>
## 1 casual    278126  257888    268822   299600 330659   473447 387340
## 2 member    478883  512242    517929   520772 465845   443287 386307

\(~\)

Visualizing total rides by user type and day of the week

bike_rides_v2 %>% 
  group_by(user_type, day_of_week) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Day of Week", y="Number of Rides", title = "Total Rides by user type and day of the week")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insight:

  • Member riders take the most rides between Monday through Thursday and decline a little over the weekend * Casual riders take most rides on the weekend.

\(~\)

Another visual of total rides by user type and day of the week

bike_rides_v2 %>% 
  group_by(user_type, day_of_week) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, color=user_type)) + 
  geom_point() + geom_line(aes(group = user_type)) +
  labs(x="Day of Week", y="Number of Rides", title = "Total Rides by user type and day of the week") +
  ylim(0, NA)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

\(~\)

4.7 Analyzing total rides by user type and season

\(~\)

Visualizing total rides by user type and season

bike_rides_v2 %>% 
  group_by(user_type, season) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(season, level= c("Spring", "Summer", "Fall", "Winter")), y=count, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Season of Year", y="Number of Rides", title = "Total Rides by user type and season of the year")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

Key insights:

  • In each season we see more member rides.
  • Both riders peak in summer and decline in winter.

\(~\)

4.8 Analyzing top five starting and ending stations by user types

\(~\)

Top five starting stations for casual riders

bike_rides_v2 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(user_type == "casual") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   start_station_name                 count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            61168
## 2 DuSable Lake Shore Dr & Monroe St  32478
## 3 Millennium Park                    26660
## 4 Michigan Ave & Oak St              26633
## 5 DuSable Lake Shore Dr & North Blvd 26398

\(~\)

Visualizing top five starting stations for casual riders. The fct_reorder() function sorts the data in the ascending order of value_variable

The coordinates are flipped so that horizontal becomes vertical, and vertical, horizontal

bike_rides_v2 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(user_type == "casual") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5) %>% 
  mutate(start_station_name= fct_reorder(start_station_name, count)) %>% 
  ggplot(aes(x=start_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="Start Station Name", title="Top 5 starting stations for casual riders")
## Selecting by count

\(~\)

Top five ending stations for casual riders

bike_rides_v2 %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(user_type == "casual") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   end_station_name                   count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            63730
## 2 DuSable Lake Shore Dr & Monroe St  30864
## 3 Millennium Park                    28075
## 4 Michigan Ave & Oak St              27578
## 5 DuSable Lake Shore Dr & North Blvd 27091

Key insight:

  • The top 5 starting and ending stations are the same for casual riders
  • The top starting and ending station for casual riders is Streeter Dr & Grand Ave - situated near a park and shoreline sightseeing - vacationers are likely visiting this area.

\(~\)

Top five starting stations for member riders

bike_rides_v2 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(user_type == "member") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   start_station_name       count
##   <chr>                    <int>
## 1 Ellis Ave & 60th St      35543
## 2 Ellis Ave & 55th St      28910
## 3 Kingsbury St & Kinzie St 26156
## 4 University Ave & 57th St 24859
## 5 Wells St & Concord Ln    24136

\(~\)

Visualizing top five starting stations for member riders

bike_rides_v2 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(user_type == "member") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5) %>% 
  mutate(start_station_name= fct_reorder(start_station_name, count)) %>% 
  ggplot(aes(x=start_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="Start Station Name", title="Top 5 starting stations for member riders")
## Selecting by count

\(~\)

Top five ending stations for member riders

bike_rides_v2 %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(user_type == "member") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   end_station_name         count
##   <chr>                    <int>
## 1 Ellis Ave & 60th St      36767
## 2 University Ave & 57th St 36523
## 3 Kingsbury St & Kinzie St 26479
## 4 Wells St & Concord Ln    25035
## 5 Clark St & Elm St        24107

Key insight:

  • The top 5 starting and ending stations are the same for member riders except for one station
  • The top starting and ending station for member riders is Ellis Ave & 60th St which is situated within the University of Chicago - this suggests that students, faculty and staff are likely using this system

\(~\)

4.9 Export summary file for further analysis

\(~\)

I’ll create a csv file that can be exported

write_csv(bike_rides_v2, file = "bike_rides_v2.csv")

\(~\)

Step 5 : Share your findings with your audience

For purposes of this project, I will knit this Rmd file to HTML and then publish to RPubs by RStudio. A link will be provided to fellow course mates as well as fellow data analysts.

\(~\)

Step 6: Act on the data and use the analysis results

This step involves summarizing conclusions from the analysis and providing recommendations to answer the business questions.

\(~\)

6.1 Conclusions from the analysis

  • 97% of all rides are 60 minutes or less
  • 79% of all rides are 20 minutes or less
  • 56% of all rides are less than 12 minutes

\(~\)

  • Looking strictly at ride lengths under 12 minutes, 67% of these riders are member riders which is double the casual riders in this category.
  • Looking strictly at ride lengths <=20 minutes, 64% of these riders are member riders.

\(~\)

  • Member riders make up the majority of riders, take more rides, take shorter rides and the number of their rides peaks Monday through Thursday, declining a little over the weekend.
    • 79% of early morning rides are member riders.
    • The top starting station for member riders is Ellis Ave & 60th St - situated within the University of Chicago - students, faculty and staff are likely using this system

\(~\)

  • Casual riders average longer rides, take more rides on the weekends and their rides peak in summer.
    • The top starting station for casual riders is Streeter Dr & Grand Ave - situated near a park and shoreline sightseeing - vacationers are likely visiting this area.

\(~\)

  • In each season we see more member rides, but both riders peak in summer and decline in winter.

\(~\)

6.2 Additional data analysis to consider

Due to the protection of user privacy, the public dataset used for this project has limitations. Ideally, the marketing team would use the full dataset to carry out the following additional analysis:

  • Determine which casual riders are locals and focus on the riding habits of your local casual riders
  • Break out your single ride and day pass riders
  • How many do you have of each category?
  • What are their riding habits? Who’s riding multiple times a day? Where are they going?
  • Analyze the number of trips and length of trips that local casual riders take

\(~\)

  • Take a closer look at the riding habits of your member riders
  • Number of trips, length of trip, who’s riding multiple times a day? and where are they going?

\(~\)

  • What does the surrounding landscape look like for member riders and local casual riders?
    • Do they live in a dense population?
    • Are there major employers around?
    • Are there educational or medical institutions nearby?
    • Get a sense of whether the surrounding infrastructure promotes riding to work, riding to school, riding to grocery stores ,eateries, shopping, coffee shops, social gatherings?

\(~\)

  • Gather socio-economic-cultural demographics to learn how these factors influence the riding habits of these two groups. Use info to tailor a marketing strategy.

\(~\)

6.3 Recommendations based on data insights

Marketing is an ongoing process. As a first phase, consider a plan focused on your top ride length categories: < 12 minutes and <= 20 min. Also, consider which areas around your bike stations best complement/support/facilitate this ride length.

  • Raise awareness of the ease of use, affordability, accessibility, green options and common uses for bikes rides <= 20 min
  • Market to people whose home area and daily travel needs likely align well with your bike system. Focusing on dense and walkable areas, partnering with schools and companies.
    • Use social media to showcase member riders who are already using bikes to go to work, school, the gym, to meet friends.
    • Partner with schools and companies who are situated within your bike system where your bike system can facilitate <12 minute to 20 minute rides to promote sponsored member subscriptions to students and employees.
  • Additionally, there’s an opportunity to market around pricing. Average out the usage and rider costs by rider to see who would benefit rolling into the annual membership plan. For example:
    • Unique Day Pass Riders who use the Day Pass more than 8 days in a year
    • Unique Single Riders who ride more than 34 Single Rides in a year or who average at least 3 Single Rides per month, each month
    • Reach out to casual riders who spend on fees & overage costs who would definitely save money by switching to an annual membership. While you may lose money on this member in the short run, you will gain loyalty and this customer is highly likely to promote your services.
  • For member riders already commuting to work or school, offer a reward for recruiting a friend to ride to work or school.

Conclusion

Thank you for your time and interest to review my capstone project! This project helped me to walk through the data analysis process from start to finish using real-world data and business questions. Accomplishing this in R and RStudio is very satisfying as I am completely new to working with big data, R and RStudio. I’m truly excited and look forward to growing in the field of data analysis.