\(~\)
Welcome to my capstone project for the Google Data Analytics Certificate course! As I journey through the various steps of this project, I will tackle real-world tasks of a data analyst whereby I get to demonstrate my knowledge, skills, and thought process.
This case study will follow a six step data analysis process: Ask, Prepare, Process, Analyze, Share, Act
\(~\)
This step includes the critical steps of asking the right questions to gather enough initial information to direct the course of the project in the right direction. It’s important to make sure you understand the business task and identify all key stakeholders to keep everyone in the loop as you move along.
\(~\)
Three questions will guide the future marketing program:
This case study primarily focuses on the first question.
\(~\)
The business task is to understand how casual riders and member riders use Cyclistic differently with the goal of designing a new marketing strategy aimed at converting casual riders into annual members.
\(~\)
\(~\)
This step includes identifying and collecting the data from its location; determining its integrity, credibility and accessibility; and storing the data in its new location.
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
The files are too large to handle in Excel so I chose to use the R programming language and RStudio Cloud to continue my work for this project.
\(~\)
\(~\)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(skimr)
\(~\)
Data frames are the starting point for analyzing data in R so I’ll read the 12 csv files that I’ve uploaded into RStudio and name them “data1”, “data2”, “data3”,…“data12”. I can immediately confirm the number of rows and columns for each data frame by referring to my initial summary table where I notated the row and columns of each file.
\(~\)
data1 <- read_csv("202111_divvy_tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data2 <- read_csv("202112_divvy_tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data3 <- read_csv("202201_divvy_tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data4 <- read_csv("202202_divvy_tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data5 <- read_csv("202203_divvy_tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data6 <- read_csv("202204_divvy_tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data7 <- read_csv("202205_divvy_tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data8 <- read_csv("202206_divvy_tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data9 <- read_csv("202207_divvy_tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data10 <- read_csv("202208_divvy_tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data11 <- read_csv("202209_divvy_tripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data12 <- read_csv("202210_divvy_tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
\(~\)
Use colnames() on each new data frame to make sure all have the same 13 columns.
colnames(data1)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
\(~\)
Use rbind to combine the 12 dataframes into one dataframe and name it, bike_rides
bike_rides <- rbind(data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11, data12)
\(~\)
After combining the 12 dataframes into one dataframe, I used the function rm() to remove the 12 individual dataframes from the environment to free up RAM
rm(data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11, data12)
\(~\)
I’ll use class(), dim(), colnames(), and colSums(is.na()) to perform an initial inspection to ensure my data has remained intact up to now. At this point, I’ll use my initial observations from Excel to confirm the following:
\(~\)
confirm dataset is a data frame
class(bike_rides)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
\(~\)
confirm # of rows and columns
dim(bike_rides)
## [1] 5755694 13
\(~\)
confirm column names
colnames(bike_rides)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
\(~\)
confirm blank data fields
colSums(is.na(bike_rides))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 878177 878177 940010 940010
## start_lat start_lng end_lat end_lng
## 0 0 5835 5835
## member_casual
## 0
\(~\)
Getting to know the data is a crucial step that should not be skipped. It’s important to always take enough time to get a good mental sense of the data. Know how big the data is, understand the structure of the data, the data types, the parameters, the dimensions, the variables, the characteristics of those variables, and so on.
Let’s continue on with these additional functions: select(), n(row), ncol(), length(), head(), tail(), glimpse(), str(), summary(), names(), rownames(), skim_without_charts(), View()
select(bike_rides)
## # A tibble: 5,755,694 × 0
nrow(bike_rides)
## [1] 5755694
ncol(bike_rides)
## [1] 13
length(bike_rides)
## [1] 13
head(bike_rides)
## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 7C00A93E10556… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA> <NA>
## 2 90854840DFD50… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA> <NA>
## 3 0A7D10CDD1440… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA> <NA>
## 4 2F3BE33085BCF… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA> <NA>
## 5 D67B4781A1992… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA> <NA>
## 6 02F85C2C3C5F7… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹rideable_type,
## # ²start_station_name, ³start_station_id
tail(bike_rides)
## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 DA551F0A9C0DB… classi… 2022-10-24 17:45:38 2022-10-24 17:48:02 Sedgwi… TA1307…
## 2 BC3BFA659C9AB… classi… 2022-10-30 01:41:29 2022-10-30 01:57:16 Clifto… TA1307…
## 3 ACD65450291CF… classi… 2022-10-30 01:41:54 2022-10-30 01:57:09 Clifto… TA1307…
## 4 4AAC03D1438E9… classi… 2022-10-15 09:34:11 2022-10-15 10:03:21 Sedgwi… TA1307…
## 5 8E6F3F29785E5… classi… 2022-10-09 10:21:34 2022-10-09 10:43:45 Sedgwi… TA1307…
## 6 8D14CBE672431… docked… 2022-10-22 13:17:13 2022-10-22 13:46:14 Clark … 13146
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹rideable_type,
## # ²start_station_name, ³start_station_id
glimpse(bike_rides)
## Rows: 5,755,694
## Columns: 13
## $ ride_id <chr> "7C00A93E10556E47", "90854840DFD508BA", "0A7D10CDD1…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-11-27 13:27:38, 2021-11-27 13:38:25, 2021-11-…
## $ ended_at <dttm> 2021-11-27 13:46:38, 2021-11-27 13:56:10, 2021-11-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, "Michigan Ave & Oak St", NA, NA…
## $ start_station_id <chr> NA, NA, NA, NA, NA, "13042", NA, NA, NA, NA, NA, NA…
## $ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.93000, 41.96000, 41.96000, 41.94000, 41.90000, 4…
## $ start_lng <dbl> -87.72000, -87.70000, -87.70000, -87.79000, -87.630…
## $ end_lat <dbl> 41.96, 41.92, 41.96, 41.93, 41.88, 41.90, 41.80, 41…
## $ end_lng <dbl> -87.73, -87.70, -87.70, -87.79, -87.62, -87.63, -87…
## $ member_casual <chr> "casual", "casual", "casual", "casual", "casual", "…
str(bike_rides)
## spc_tbl_ [5,755,694 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5755694] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
## $ rideable_type : chr [1:5755694] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5755694], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
## $ ended_at : POSIXct[1:5755694], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
## $ start_station_name: chr [1:5755694] NA NA NA NA ...
## $ start_station_id : chr [1:5755694] NA NA NA NA ...
## $ end_station_name : chr [1:5755694] NA NA NA NA ...
## $ end_station_id : chr [1:5755694] NA NA NA NA ...
## $ start_lat : num [1:5755694] 41.9 42 42 41.9 41.9 ...
## $ start_lng : num [1:5755694] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ end_lat : num [1:5755694] 42 41.9 42 41.9 41.9 ...
## $ end_lng : num [1:5755694] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ member_casual : chr [1:5755694] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(bike_rides)
## ride_id rideable_type started_at
## Length:5755694 Length:5755694 Min. :2021-11-01 00:00:14.00
## Class :character Class :character 1st Qu.:2022-04-27 16:40:09.00
## Mode :character Mode :character Median :2022-06-30 18:31:03.00
## Mean :2022-06-13 23:04:32.59
## 3rd Qu.:2022-08-24 19:52:19.75
## Max. :2022-10-31 23:59:33.00
##
## ended_at start_station_name start_station_id
## Min. :2021-11-01 00:04:06.00 Length:5755694 Length:5755694
## 1st Qu.:2022-04-27 16:51:40.25 Class :character Class :character
## Median :2022-06-30 18:49:28.00 Mode :character Mode :character
## Mean :2022-06-13 23:23:58.99
## 3rd Qu.:2022-08-24 20:10:05.75
## Max. :2022-11-07 04:53:58.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5755694 Length:5755694 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual
## Min. :41.39 Min. :-88.97 Length:5755694
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. :-87.30
## NA's :5835 NA's :5835
names(bike_rides)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
skim_without_charts(bike_rides)
| Name | bike_rides |
| Number of rows | 5755694 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5755694 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 878177 | 0.85 | 7 | 64 | 0 | 1639 | 0 |
| start_station_id | 878177 | 0.85 | 3 | 44 | 0 | 1306 | 0 |
| end_station_name | 940010 | 0.84 | 9 | 64 | 0 | 1663 | 0 |
| end_station_id | 940010 | 0.84 | 3 | 44 | 0 | 1314 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 5835 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.37 |
| end_lng | 5835 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.30 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2021-11-01 00:00:14 | 2022-10-31 23:59:33 | 2022-06-30 18:31:03 | 4824622 |
| ended_at | 0 | 1 | 2021-11-01 00:04:06 | 2022-11-07 04:53:58 | 2022-06-30 18:49:28 | 4836310 |
\(~\)
I’ll use the is.na() function on each variable/column in my data frame to get a sense of what is going on where I have missing data. Let’s get started!
I can pair is.na() with View() to create a spreadsheet like table and get a better look at all rows and see if I notice any patterns. For example: View(bike_rides[is.na(start_station_name),])
attach(bike_rides)
bike_rides[is.na(start_station_name),]
## # A tibble: 878,177 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA> <NA>
## 2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA> <NA>
## 3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA> <NA>
## 4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA> <NA>
## 5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA> <NA>
## 6 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA> <NA>
## 7 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA> <NA>
## 8 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA> <NA>
## 9 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA> <NA>
## 10 A2A194358CA6… electr… 2021-11-27 12:49:10 2021-11-27 12:52:47 <NA> <NA>
## # … with 878,167 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
bike_rides[is.na(start_station_id),]
## # A tibble: 878,177 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA> <NA>
## 2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA> <NA>
## 3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA> <NA>
## 4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA> <NA>
## 5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA> <NA>
## 6 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA> <NA>
## 7 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA> <NA>
## 8 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA> <NA>
## 9 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA> <NA>
## 10 A2A194358CA6… electr… 2021-11-27 12:49:10 2021-11-27 12:52:47 <NA> <NA>
## # … with 878,167 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
bike_rides[is.na(end_station_name),]
## # A tibble: 940,010 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA> <NA>
## 2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA> <NA>
## 3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA> <NA>
## 4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA> <NA>
## 5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA> <NA>
## 6 02F85C2C3C5F… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042
## 7 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA> <NA>
## 8 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA> <NA>
## 9 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA> <NA>
## 10 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA> <NA>
## # … with 940,000 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
bike_rides[is.na(end_station_id),]
## # A tibble: 940,010 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 7C00A93E1055… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA> <NA>
## 2 90854840DFD5… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA> <NA>
## 3 0A7D10CDD144… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA> <NA>
## 4 2F3BE33085BC… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA> <NA>
## 5 D67B4781A199… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA> <NA>
## 6 02F85C2C3C5F… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042
## 7 EF780B807EF7… electr… 2021-11-27 13:31:12 2021-11-27 13:37:12 <NA> <NA>
## 8 17069CC74912… electr… 2021-11-27 14:33:56 2021-11-27 14:34:38 <NA> <NA>
## 9 93FC4662BDC5… electr… 2021-11-27 09:14:33 2021-11-27 09:19:36 <NA> <NA>
## 10 B06B064398A3… electr… 2021-11-27 16:13:31 2021-11-27 16:22:50 <NA> <NA>
## # … with 940,000 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
bike_rides[is.na(start_lat),]
## # A tibble: 0 × 13
## # … with 13 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## # ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
bike_rides[is.na(start_lng),]
## # A tibble: 0 × 13
## # … with 13 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## # ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
bike_rides[is.na(end_lat),]
## # A tibble: 5,835 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 D66FB7A5034C… classi… 2021-11-23 11:53:36 2021-11-24 12:53:30 Laflin… 13307
## 2 214DC891AC5E… classi… 2021-11-25 19:23:35 2021-11-26 20:23:30 Rush S… 15530
## 3 4409AA46BDD7… classi… 2021-11-06 13:13:06 2021-11-07 13:13:01 Ashlan… 16950
## 4 C4A464C2818D… docked… 2021-11-06 16:40:58 2021-11-06 17:24:39 Millen… 13008
## 5 E58A224FA0AC… docked… 2021-11-25 13:56:42 2021-11-26 16:50:51 Millen… 13008
## 6 D893434A1E5B… docked… 2021-11-26 16:07:04 2021-11-27 17:07:05 Shedd … 15544
## 7 6653EC1EF2B9… classi… 2021-11-12 10:00:16 2021-11-13 11:00:02 Univer… KA1503…
## 8 B346545D5F95… docked… 2021-11-01 18:49:55 2021-11-02 07:48:18 Shedd … 15544
## 9 4FCD69A37C7B… classi… 2021-11-06 16:24:11 2021-11-07 16:24:06 Wester… 13241
## 10 DCAA625495CA… classi… 2021-11-20 00:47:58 2021-11-21 01:47:53 Broadw… 13074
## # … with 5,825 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
bike_rides[is.na(end_lng),]
## # A tibble: 5,835 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 D66FB7A5034C… classi… 2021-11-23 11:53:36 2021-11-24 12:53:30 Laflin… 13307
## 2 214DC891AC5E… classi… 2021-11-25 19:23:35 2021-11-26 20:23:30 Rush S… 15530
## 3 4409AA46BDD7… classi… 2021-11-06 13:13:06 2021-11-07 13:13:01 Ashlan… 16950
## 4 C4A464C2818D… docked… 2021-11-06 16:40:58 2021-11-06 17:24:39 Millen… 13008
## 5 E58A224FA0AC… docked… 2021-11-25 13:56:42 2021-11-26 16:50:51 Millen… 13008
## 6 D893434A1E5B… docked… 2021-11-26 16:07:04 2021-11-27 17:07:05 Shedd … 15544
## 7 6653EC1EF2B9… classi… 2021-11-12 10:00:16 2021-11-13 11:00:02 Univer… KA1503…
## 8 B346545D5F95… docked… 2021-11-01 18:49:55 2021-11-02 07:48:18 Shedd … 15544
## 9 4FCD69A37C7B… classi… 2021-11-06 16:24:11 2021-11-07 16:24:06 Wester… 13241
## 10 DCAA625495CA… classi… 2021-11-20 00:47:58 2021-11-21 01:47:53 Broadw… 13074
## # … with 5,825 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
bike_rides[is.na(member_casual),]
## # A tibble: 0 × 13
## # … with 13 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## # ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
\(~\)
Summary around missing data, limitations and other problems:
Upon review of the data, I got a deeper look into missing data and recognized other problems and limitations as follows:
Additionally…
These data columns have missing values which we’ll address further along in the process:
\(~\)
Divvy’s historical trip data is appropriate data to answer the business questions. The data provided will help us to understand how the casual riders and member riders use bikes differently. While the data files provide consistent columns of data, we can also expand upon the data by using calculations and functions to draw deeper insights.
While the data has some limitations and problems, overall, there aren’t any major issues that render the data unusable. We have a great dataset from which we can gather many insights such as:
\(~\)
This step includes selecting appropriate tools for the amount of data you will be working with. It also involves checking for data errors, cleaning the data, transforming the data by adding, renaming and removing data and, finally, verifying that the data is clean and ready for analysis. It’s important to follow a structured process and document all your steps so that teammates can follow along and perform checks on your work.
\(~\)
Since the combined dataset is very large with 5.7 million rows, R and RStudio Cloud has been chosen as the tool for data manipulation, cleaning, aggregation, analysis and visualization.
\(~\)
\(~\)
bike_rides <- rename(bike_rides, "bike_type" = "rideable_type", "user_type" = "member_casual")
\(~\)
bike_rides$started_at <- ymd_hms(bike_rides$started_at)
bike_rides$ended_at <- ymd_hms(bike_rides$ended_at)
\(~\)
Adding a new column called, “ride_length_min”.
Using the difftime() function to calculate the length of each trip in minutes, rounded to two decimals.
bike_rides$ride_length_min <- round(as.numeric(difftime(bike_rides$ended_at, bike_rides$started_at, units = "mins")), 2)
\(~\)
Verify that R recognizes my new variable as numeric so that I can perform calculations.
class(bike_rides$ride_length_min)
## [1] "numeric"
\(~\)
Adding columns for: date, month, day, year, day_of_week, and hour
bike_rides$date <- as.Date(bike_rides$started_at)
bike_rides$month <- format(as.Date(bike_rides$date), "%B")
bike_rides$day <- format(as.Date(bike_rides$date), "%d")
bike_rides$year <- format(as.Date(bike_rides$date), "%Y")
bike_rides$day_of_week <- format(as.Date(bike_rides$date), "%A")
bike_rides$hour <- lubridate::hour(bike_rides$started_at)
\(~\)
Adding a column for season
bike_rides <- bike_rides %>% mutate(season = recode(month,
December = "Winter",
January = "Winter",
February = "Winter",
March = "Spring",
April = "Spring",
May = "Spring",
June = "Summer",
July = "Summer",
August = "Summer",
September = "Fall",
October = "Fall",
November = "Fall"))
\(~\)
Adding a column for “time_of_day” by using the case_when() function
bike_rides <- bike_rides %>% mutate(time_of_day = case_when(
hour >= 6 & hour < 9 ~ "Early Morning",
hour >= 9 & hour < 12 ~ "Mid Morning",
hour >= 12 & hour < 18 ~ "Afternoon",
hour >= 18 & hour <= 23 ~ "Evening",
hour >= 0 & hour < 3 ~ "Early Night",
hour >= 3 & hour < 6 ~ "Late Night"))
\(~\)
\(~\)
Let’s calculate where ride_legth_min is greater than 1,440 minutes (or 24 hours)
sum(bike_rides$ride_length_min > 1440)
## [1] 5364
\(~\)
Let’s calculate where ride_length_min is less than 1 minute (or 60 seconds)
sum(bike_rides$ride_length_min < 1)
## [1] 118453
\(~\)
Let’s calculate where ride_length_min is less than 0 or a negative number. I’ll use different code to arrive at the same solution.
sum(bike_rides$ride_length_min < 0)
## [1] 112
sum(bike_rides$started_at > bike_rides$ended_at)
## [1] 112
length(which(bike_rides$started_at > bike_rides$ended_at))
## [1] 112
\(~\)
Just curious to see where ride_length_min is greater than 6 hours (or 360 minutes)
sum(bike_rides$ride_length_min > 360)
## [1] 10130
\(~\)
Let’s check if the missing start station id codes pertain to a particular bike type or user type. We can see that the majority of the problem is with electric bikes and does not pertain to a particulare user type.
bike_rides %>% filter(is.na(start_station_id)) %>%
count(start_station_id, start_station_name, bike_type, user_type)
## # A tibble: 2 × 5
## start_station_id start_station_name bike_type user_type n
## <chr> <chr> <chr> <chr> <int>
## 1 <NA> <NA> electric_bike casual 363963
## 2 <NA> <NA> electric_bike member 514214
\(~\)
The missing end station id pertains to the three bike types and both user types, but again, the majority of the problem is with electric bikes.
bike_rides %>% filter(is.na(end_station_id)) %>%
count(end_station_id, end_station_name, bike_type, user_type)
## # A tibble: 5 × 5
## end_station_id end_station_name bike_type user_type n
## <chr> <chr> <chr> <chr> <int>
## 1 <NA> <NA> classic_bike casual 2805
## 2 <NA> <NA> classic_bike member 1533
## 3 <NA> <NA> docked_bike casual 2570
## 4 <NA> <NA> electric_bike casual 422751
## 5 <NA> <NA> electric_bike member 510351
\(~\)
Let’s take a look at what else is going on where start station id is missing. In particular, I want to see what is going on with start station name, start_lat & start_lng. This shows me that start_station name is missing and both the start_lat & start_lng only go out two decimals points.
Adding the View() function will provide a table that you can then also filter on.
bike_rides %>% filter(is.na(start_station_id)) %>%
count(start_station_id, start_station_name, start_lat, start_lng)
## # A tibble: 621 × 5
## start_station_id start_station_name start_lat start_lng n
## <chr> <chr> <dbl> <dbl> <int>
## 1 <NA> <NA> 41.6 -87.6 1
## 2 <NA> <NA> 41.6 -87.5 3
## 3 <NA> <NA> 41.6 -87.6 22
## 4 <NA> <NA> 41.6 -87.6 2
## 5 <NA> <NA> 41.6 -87.6 8
## 6 <NA> <NA> 41.6 -87.6 1
## 7 <NA> <NA> 41.6 -87.6 3
## 8 <NA> <NA> 41.6 -87.6 20
## 9 <NA> <NA> 41.6 -87.5 21
## 10 <NA> <NA> 41.6 -87.5 5
## # … with 611 more rows
\(~\)
Let’s take a look at what else is going on where end station id is missing. In particular, I want to see what is going on with end station name, end_lat & end_lng. This shows me that end_station name is missing and both the start_lat & start_lng only go out two decimals points.
Adding the View() function will provide a table that you can then also filter on.
From this view, I can also see that there are 5,835 instances where both end_lat & end_lng are missing.
bike_rides %>% filter(is.na(end_station_id)) %>%
count(end_station_id, end_station_name, end_lat, end_lng)
## # A tibble: 891 × 5
## end_station_id end_station_name end_lat end_lng n
## <chr> <chr> <dbl> <dbl> <int>
## 1 <NA> <NA> 41.4 -89.0 1
## 2 <NA> <NA> 41.5 -87.6 1
## 3 <NA> <NA> 41.6 -87.3 4
## 4 <NA> <NA> 41.6 -87.7 2
## 5 <NA> <NA> 41.6 -87.6 1
## 6 <NA> <NA> 41.6 -87.7 1
## 7 <NA> <NA> 41.6 -87.6 1
## 8 <NA> <NA> 41.6 -87.7 1
## 9 <NA> <NA> 41.6 -87.6 1
## 10 <NA> <NA> 41.6 -87.6 1
## # … with 881 more rows
\(~\)
To deal with the missing values for both start & end station names & ids, I’m going to create four new columns to show start_lat, start_lng, end_lat & end_lng all rounded to 2 decimal places. I’ll then use the new rounded lat & lng to impute the missing start & end station names.
My logic: This is ok to do because while not perfect, it will get me station names that are close enough distance-wise to perform the analysis for this project. I don’t need exact docking locations to analyze the rider habits for the context of this project.
\(~\)
Creating four new columns to show start_lat, start_lng, end_lat & end_lng all rounded to 2 decimal places.
bike_rides <- bike_rides %>%
mutate(start_lat_round = round(start_lat, digits = 2),
start_lng_round = round(start_lng, digits = 2),
end_lat_round = round(end_lat, digits = 2),
end_lng_round = round(end_lng, digits = 2))
\(~\)
Impute missing start station names
bike_rides <- bike_rides %>%
group_by(start_lat_round, start_lng_round) %>%
tidyr::fill(start_station_name, .direction = "downup") %>%
ungroup()
\(~\)
Impute missing end station names
bike_rides <- bike_rides %>%
group_by(end_lat_round, end_lng_round) %>%
tidyr::fill(end_station_name, .direction = "downup") %>%
ungroup()
\(~\)
Impute missing start_station_id
bike_rides <- bike_rides %>%
group_by(start_station_name) %>%
tidyr::fill(start_station_id, .direction = "downup") %>%
ungroup()
\(~\)
Impute missing end_station_id
bike_rides <- bike_rides %>%
group_by(end_station_name) %>%
tidyr::fill(end_station_id, .direction = "downup") %>%
ungroup()
\(~\)
Now that we’ve imputed alot of the missing data, let’s check missing values by column, again, and see what’s still missing.
colSums(is.na(bike_rides))
## ride_id bike_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 12097 12097 24394 24394
## start_lat start_lng end_lat end_lng
## 0 0 5835 5835
## user_type ride_length_min date month
## 0 0 0 0
## day year day_of_week hour
## 0 0 0 0
## season time_of_day start_lat_round start_lng_round
## 0 0 0 0
## end_lat_round end_lng_round
## 5835 5835
\(~\)
Where end_lat & end_lng are missing, we don’t have an end_station_name or end_station_id so the missing data cannot be imputed using those fields. These are all 1 minute rides with the exception of one 2 minute ride.
Since these are 1 minute rides with missing end-lat & end_lng I’m assuming they are flukes and I am going to remove them. Percentage-wise 5,835 of missing data is considered immaterial and will not impact the integrity of my dataset.
bike_rides %>% filter(is.na(end_lat)) %>%
count(end_station_name, end_station_id, end_lat, end_lng, bike_type)
## # A tibble: 2 × 6
## end_station_name end_station_id end_lat end_lng bike_type n
## <chr> <chr> <dbl> <dbl> <chr> <int>
## 1 <NA> <NA> NA NA classic_bike 3265
## 2 <NA> <NA> NA NA docked_bike 2570
\(~\)
Where start_station_name is missing, we don’t have a start_station_id to impute the missing data. I should be able to impute from data with matching end_lat & end_lng coordinates. That would be trickier because, we may have multiple station names where the lat & lng coordinates only go out 2 decimal places.
For timesake and because the amount of data is immaterial, I’m going to leave as is and move on. I could remove this data as it is immaterial. Adding the View() function will provide a table that you can then also filter on.
bike_rides %>% filter(is.na(start_station_name)) %>%
count(start_station_name, start_station_id, start_lat, start_lng, bike_type)
## # A tibble: 112 × 6
## start_station_name start_station_id start_lat start_lng bike_type n
## <chr> <chr> <dbl> <dbl> <chr> <int>
## 1 <NA> <NA> 41.6 -87.6 electric_bike 1
## 2 <NA> <NA> 41.6 -87.5 electric_bike 3
## 3 <NA> <NA> 41.6 -87.6 electric_bike 2
## 4 <NA> <NA> 41.6 -87.6 electric_bike 1
## 5 <NA> <NA> 41.6 -87.6 electric_bike 3
## 6 <NA> <NA> 41.6 -87.5 electric_bike 5
## 7 <NA> <NA> 41.7 -87.6 electric_bike 13
## 8 <NA> <NA> 41.7 -87.6 electric_bike 24
## 9 <NA> <NA> 41.7 -87.6 electric_bike 6
## 10 <NA> <NA> 41.7 -87.6 electric_bike 3
## # … with 102 more rows
\(~\)
Where end_station_name is missing, we don’t have a end_station_id to impute the missing data, but I should be able to impute from data with matching end_lat & end_lng coordinates. That would be trickier because, we may have multiple station names where the lat & lng coordinates only go out 2 decimal places.
For timesake and because the amount of data is immaterial, I’m going to leave as is and move on. I could remove this data as it is immaterial. Adding the View() function will provide a table that you can then also filter on.
bike_rides %>% filter(is.na(end_station_name)) %>%
count(end_station_name, end_station_id, end_lat, end_lng, bike_type)
## # A tibble: 394 × 6
## end_station_name end_station_id end_lat end_lng bike_type n
## <chr> <chr> <dbl> <dbl> <chr> <int>
## 1 <NA> <NA> 41.4 -89.0 electric_bike 1
## 2 <NA> <NA> 41.5 -87.6 electric_bike 1
## 3 <NA> <NA> 41.6 -87.3 electric_bike 4
## 4 <NA> <NA> 41.6 -87.7 electric_bike 2
## 5 <NA> <NA> 41.6 -87.6 electric_bike 1
## 6 <NA> <NA> 41.6 -87.7 electric_bike 1
## 7 <NA> <NA> 41.6 -87.6 electric_bike 1
## 8 <NA> <NA> 41.6 -87.7 electric_bike 1
## 9 <NA> <NA> 41.6 -87.6 electric_bike 1
## 10 <NA> <NA> 41.6 -87.6 electric_bike 1
## # … with 384 more rows
\(~\)
After a thorough review of the data, I’ll remove the following:
\(~\)
I’ll use the select() function to create a new data frame with only selected columns.
bike_rides_v2 <- select(bike_rides, c(1,2,5, 6:16, 13:16, 18:22))
\(~\)
Remove rides less than 60 seconds (or 1 minute) and greater than 24 hrs (or 1440 minutes) in length. This will remove all rides with a negative ride_length
bike_rides_v2 <- bike_rides_v2 %>%
filter(ride_length_min >= 1 & ride_length_min <= 1440)
\(~\)
Remove rides where end_lat & end_lng are both missing
bike_rides_v2 <- bike_rides_v2 %>%
filter(!is.na(end_lat) & !is.na(end_lng))
\(~\)
Remove rides related to test/repair stations
bike_rides_v2 <- bike_rides_v2 %>%
filter(!start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))
bike_rides_v2 <- bike_rides_v2 %>%
filter(!end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))
\(~\)
Confirms removal where ride_legth_min is greater than 1,440 minutes (24 hours)
sum(bike_rides_v2$ride_length_min > 1440)
## [1] 0
\(~\)
Confirms removal where ride_length_min is less than 1 minute (or 60 seconds)
sum(bike_rides_v2$ride_length_min < 1)
## [1] 0
\(~\)
Confirms removal of rides related to test/repair stations
bike_rides_v2 %>% filter(start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>%
count(start_station_id)
## # A tibble: 0 × 2
## # … with 2 variables: start_station_id <chr>, n <int>
\(~\)
Confirms removal of rides related to test/repair stations
bike_rides_v2 %>% filter(end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>%
count(end_station_id)
## # A tibble: 0 × 2
## # … with 2 variables: end_station_id <chr>, n <int>
\(~\)
Confirms removal where end_lat & end_lng were both missing
colSums(is.na(bike_rides_v2))
## ride_id bike_type start_station_name start_station_id
## 0 0 11750 11750
## end_station_name end_station_id start_lat start_lng
## 17899 17899 0 0
## end_lat end_lng user_type ride_length_min
## 0 0 0 0
## date month year day_of_week
## 0 0 0 0
## hour season time_of_day
## 0 0 0
\(~\)
Let’s get a row count and take a quick glimpse of our new data frame, bike_rides_v2. After cleaning the data, we have 5,621,147 rows of data which means we removed a total of 134,547 rows.
nrow(bike_rides_v2)
## [1] 5621147
glimpse(bike_rides_v2)
## Rows: 5,621,147
## Columns: 19
## $ ride_id <chr> "7C00A93E10556E47", "90854840DFD508BA", "0A7D10CDD1…
## $ bike_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ start_station_name <chr> "Kosciuszko Park", "California Ave & Montrose Ave",…
## $ start_station_id <chr> "15643", "15622", "15622", "397", "TA1306000011", "…
## $ end_station_name <chr> "Keystone Ave & Montrose Ave", "Humboldt Blvd & Arm…
## $ end_station_id <chr> "KA1504000164", "15651", "15622", "314", "13300", "…
## $ start_lat <dbl> 41.93000, 41.96000, 41.96000, 41.94000, 41.90000, 4…
## $ start_lng <dbl> -87.72000, -87.70000, -87.70000, -87.79000, -87.630…
## $ end_lat <dbl> 41.96, 41.92, 41.96, 41.93, 41.88, 41.90, 41.80, 41…
## $ end_lng <dbl> -87.73, -87.70, -87.70, -87.79, -87.62, -87.63, -87…
## $ user_type <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ ride_length_min <dbl> 19.00, 17.75, 2.37, 5.02, 21.22, 18.70, 6.00, 5.05,…
## $ date <date> 2021-11-27, 2021-11-27, 2021-11-26, 2021-11-27, 20…
## $ month <chr> "November", "November", "November", "November", "No…
## $ year <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20…
## $ day_of_week <chr> "Saturday", "Saturday", "Friday", "Saturday", "Frid…
## $ hour <int> 13, 13, 22, 9, 19, 18, 13, 9, 16, 12, 13, 11, 16, 1…
## $ season <chr> "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fa…
## $ time_of_day <chr> "Afternoon", "Afternoon", "Evening", "Mid Morning",…
\(~\)
We still have some missing start and end station names & ids, but we have the start & end lat & lng for these so we can work with this data if needed. If we wanted to remove it, it would not be a big deal because it is a small percent of the data. For now we’ll keep it.
colSums(is.na(bike_rides_v2))
## ride_id bike_type start_station_name start_station_id
## 0 0 11750 11750
## end_station_name end_station_id start_lat start_lng
## 17899 17899 0 0
## end_lat end_lng user_type ride_length_min
## 0 0 0 0
## date month year day_of_week
## 0 0 0 0
## hour season time_of_day
## 0 0 0
\(~\)
Before moving on to the analysis phase I’ll run through a quick check off to verify the data is now clean, accurate, consistent and complete.
\(~\)
At this step, I’ll work with the data to analyze and look at it in different ways. I’ll use functions to help me explore relationships and trends while being mindful to not stray from my objective which is to analyze how annual member riders and casual riders use Cyclistic bikes differently.
\(~\)
summary(bike_rides_v2)
## ride_id bike_type start_station_name start_station_id
## Length:5621147 Length:5621147 Length:5621147 Length:5621147
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## end_station_name end_station_id start_lat start_lng
## Length:5621147 Length:5621147 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
## end_lat end_lng user_type ride_length_min
## Min. :41.39 Min. :-88.97 Length:5621147 Min. : 1.00
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 6.08
## Median :41.90 Median :-87.64 Mode :character Median : 10.55
## Mean :41.90 Mean :-87.65 Mean : 16.63
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 18.82
## Max. :42.37 Max. :-87.30 Max. :1439.85
## date month year day_of_week
## Min. :2021-11-01 Length:5621147 Length:5621147 Length:5621147
## 1st Qu.:2022-04-27 Class :character Class :character Class :character
## Median :2022-06-30 Mode :character Mode :character Mode :character
## Mean :2022-06-13
## 3rd Qu.:2022-08-24
## Max. :2022-10-31
## hour season time_of_day
## Min. : 0.00 Length:5621147 Length:5621147
## 1st Qu.:11.00 Class :character Class :character
## Median :15.00 Mode :character Mode :character
## Mean :14.22
## 3rd Qu.:18.00
## Max. :23.00
\(~\)
Let’s get a count and percentage breakdown of our two user types: the casual rider vs the member rider
bike_rides_v2 %>%
group_by(user_type) %>%
summarise(count = n(), Percentage = n()/nrow(bike_rides_v2)*100)
## # A tibble: 2 × 3
## user_type count Percentage
## <chr> <int> <dbl>
## 1 casual 2295882 40.8
## 2 member 3325265 59.2
\(~\)
Visualizing total rides by user type
ggplot(bike_rides_v2, aes(user_type, fill=user_type)) +
geom_bar() +
labs(x="User Type", y="Count", title = "Total Rides by User Type: Casual vs Member") +
annotate("text",x=1,y=2000000,label="2,295,882 / (41%)",color="black",size=3.5) +
annotate("text",x=2,y=3000000,label="3,325,265 / (59%)",color="black",size=3.5)
Key insight:
\(~\)
Visualizing total rides by user type and bike type
bike_rides_v2 %>%
group_by(user_type, bike_type) %>%
summarise(count = n()) %>%
ggplot(aes(x=user_type, y=count, fill=bike_type)) +
geom_bar(stat="identity", width = 0.3) +
labs(x="Bike Type", y="Number of Rides", title = "Total Rides by user type and bike type")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insight:
\(~\)
Let’s spend some time looking at the length of each trip (in minutes)
summary(bike_rides_v2$ride_length_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.08 10.55 16.63 18.82 1439.85
Key insights:
\(~\)
Average ride length of each trip (in minutes) by user type
aggregate(bike_rides_v2$ride_length_min ~ bike_rides_v2$user_type, FUN = mean)
## bike_rides_v2$user_type bike_rides_v2$ride_length_min
## 1 casual 22.37841
## 2 member 12.65979
\(~\)
Taking a deeper dive into ride lengths
Here, I’m summarizing the data by various ride lengths to see if anything stands out. The table produced shows us that 5,453,815 of the rides are 60 minutes or less. This is an important area to focus on.
bike_rides_v2 %>%
group_by(user_type) %>%
summarize("<=5min" = sum(ride_length_min <=5),
"<=15min" = sum(ride_length_min <=15),
"<=30min" = sum(ride_length_min <=30),
"<=45min" = sum(ride_length_min <=45),
"<=60min" = sum(ride_length_min <=60),
">2hrs" = sum(ride_length_min >120),
">4hrs" = sum(ride_length_min >240),
">6hrs" = sum(ride_length_min >360),)
## # A tibble: 2 × 9
## user_type `<=5min` `<=15min` `<=30min` <=45m…¹ <=60m…² `>2hrs` `>4hrs` `>6hrs`
## <chr> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 casual 253461 1266259 1865460 2063967 2151575 34139 5800 3260
## 2 member 751888 2451121 3109671 3269886 3302240 5735 2058 1244
## # … with abbreviated variable names ¹`<=45min`, ²`<=60min`
\(~\)
Supporting analysis in Excel
I created a table in Excel to get a better breakdown. Let’s take a
look:
Key insights:
\(~\)
Let’s look strictly at ride lengths under 12 minutes
Key insights:
\(~\)
Let’s look strictly at ride lengths <= 20 minutes
Key insights:
\(~\)
I’ll reproduce some of this analysis in R.
I’ll create a table to show the breakdown I displayed in Excel. This tables shows us that 56% of the rides are less than 12 minutes and 79% of the rides are 20 minutes or less.
bike_rides_v2 %>%
group_by(user_type) %>%
summarize("<12 min" = sum(ride_length_min <11.99),
"12-20 min" = sum(ride_length_min >=12 & ride_length_min <=20.99),
"21-30 min" = sum(ride_length_min >=21 & ride_length_min <=30.99),
"31-60 min" = sum(ride_length_min >=31 & ride_length_min <=60),
"61-120 min" = sum(ride_length_min >=60.01 & ride_length_min <=120.99),
"121-240 min" = sum(ride_length_min >=121 & ride_length_min <=240.99),
"241+min" = sum(ride_length_min >=241))
## # A tibble: 2 × 8
## user_type `<12 min` `12-20 min` `21-30 min` 31-60 mi…¹ 61-12…² 121-2…³ 241+m…⁴
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 casual 1025555 565694 294472 265854 110806 27750 5751
## 2 member 2125827 711239 290623 174551 17365 3608 2052
## # … with abbreviated variable names ¹`31-60 min`, ²`61-120 min`,
## # ³`121-240 min`, ⁴`241+min`
\(~\)
Adding a column to create ride length categories to get a better visual in R
bike_rides_v2 <- bike_rides_v2 %>% mutate(ride_length_cat = case_when(
ride_length_min <11.99 ~ "< 12 min",
ride_length_min >=12 & ride_length_min <=20.99 ~ "12-20 min",
ride_length_min >=21 & ride_length_min <=30.99 ~ "21-30 min",
ride_length_min >=31 & ride_length_min <=60.99 ~ "31-60 min",
ride_length_min >=60 & ride_length_min <=120.99 ~ "61-120 min",
ride_length_min >=121 & ride_length_min <=240.99 ~ "121-240 min",
ride_length_min >=241 ~ "241+ min"))
\(~\)
Visualizing total rides by user type and ride length category
I had to use a factor/vector to get my x-axis labels to appear in the correct order. Without the factor/vector category labels starting with a “1” showed up first, followed by “2” and “3”.
bike_rides_v2 %>%
group_by(user_type, ride_length_cat) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(ride_length_cat, level = c("< 12 min", "12-20 min", "21-30 min", "31-60 min",
"61-120 min", "121-240 min", "241+ min")), y=count, fill=user_type)) +
geom_col(position = "dodge") +
labs(x="Ride Length Category", y="Number of Rides", title = "Total Rides by user type and ride length category")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insight:
\(~\)
Average Ride length in minutes of those rides in the “< 12 min” category.
bike_rides_v2 %>% filter(ride_length_cat == "< 12 min") %>%
group_by(user_type) %>%
summarize(avg_ride_length=mean(ride_length_min))
## # A tibble: 2 × 2
## user_type avg_ride_length
## <chr> <dbl>
## 1 casual 7.17
## 2 member 6.44
\(~\)
Average Ride length in minutes of those rides in the “<= 20 min” category.
bike_rides_v2 %>% filter(ride_length_cat == "< 12 min" | ride_length_cat== "12-20 min") %>%
group_by(user_type) %>%
summarize(avg_ride_length=mean(ride_length_min))
## # A tibble: 2 × 2
## user_type avg_ride_length
## <chr> <dbl>
## 1 casual 10.3
## 2 member 8.77
\(~\)
Visualizing average ride length of each trip (in minutes) by user type and hour of day
bike_rides_v2 %>%
group_by(user_type, hour) %>%
summarise(count = n(), average_ride_length=mean(ride_length_min)) %>%
arrange(user_type, hour) %>%
ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=average_ride_length, fill=user_type)) +
geom_col(position = "dodge") +
labs(x="Hour of day", y="Ride Length (in minutes)", title = "Average ride length by user type and hour of day")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insight:
\(~\)
Visualizing average ride length of each trip (in minutes) by user type and time of day
axis_labels <- c("Early Morning \n6am-9am", "Mid Morning \n9am-12pm", "Afternoon \n12pm-6pm", "Evening \n6pm-11pm", "Early Night \n11pm-3am", "Wee Night \n3am-6am")
bike_rides_v2 %>%
group_by(user_type, time_of_day) %>%
summarise(count = n(), average_ride_length=mean(ride_length_min)) %>%
ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=average_ride_length, fill=user_type)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Time of Day", y="Ride Length (in minutes)", title = "Average ride length by user type and time of day") +
scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insights:
\(~\)
Average ride length of each trip (in minutes) by user type and day of the week
aggregate(bike_rides_v2$ride_length_min ~ bike_rides_v2$user_type + bike_rides_v2$day_of_week, FUN = mean)
## bike_rides_v2$user_type bike_rides_v2$day_of_week
## 1 casual Friday
## 2 member Friday
## 3 casual Monday
## 4 member Monday
## 5 casual Saturday
## 6 member Saturday
## 7 casual Sunday
## 8 member Sunday
## 9 casual Thursday
## 10 member Thursday
## 11 casual Tuesday
## 12 member Tuesday
## 13 casual Wednesday
## 14 member Wednesday
## bike_rides_v2$ride_length_min
## 1 20.94996
## 2 12.43993
## 3 22.87547
## 4 12.23843
## 5 24.97164
## 6 14.13372
## 7 25.55735
## 8 14.00323
## 9 19.90410
## 10 12.20314
## 11 20.14299
## 12 12.05269
## 13 19.37565
## 14 12.04317
\(~\)
Visualizing average ride length of each trip (in minutes) by user type and day of the week
note: I’m not using x=day_of_week because it results in the days of week completely out of order
bike_rides_v2 %>%
group_by(user_type, day_of_week) %>%
summarise(count = n(), average_ride_length=mean(ride_length_min)) %>%
ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=average_ride_length, fill=user_type)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Day of Week", y="Ride Length (in minutes)", title = "Average ride length by user type and day of the week")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insight:
\(~\)
At this point, I’m going to pause and summarize some key points on riders and ride length. While my analysis is not yet complete, I don’t want to lose sight of key points up to now.
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
Let’s continue on with additional analysis & visuals
\(~\)
Total rides by user type and by hour of the day
bike_rides_v2 %>%
group_by(user_type, hour) %>%
summarise(count = n()) %>%
arrange(user_type, hour) %>%
ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=count, fill=user_type)) +
geom_col(position = "dodge") +
labs(x="Hour of day", y="Number of Rides", title = "Total Rides by user type and hour of day")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insights:
\(~\)
Total rides by user type and by time of day
bike_rides_v2 %>%
group_by(user_type) %>%
summarize("Early Morning" = sum(time_of_day == "Early Morning"),
"Mid Morning" = sum(time_of_day == "Mid Morning"),
"Afternoon" = sum(time_of_day == "Afternoon"),
"Evening" = sum(time_of_day == "Evening"),
"Early Night" = sum(time_of_day == "Early Night"),
"Late Night" = sum(time_of_day == "Late Night"))
## # A tibble: 2 × 7
## user_type `Early Morning` `Mid Morning` Afternoon Evening Early Nigh…¹ Late …²
## <chr> <int> <int> <int> <int> <int> <int>
## 1 casual 146823 284655 1041548 698034 93924 30898
## 2 member 462623 442804 1416436 884161 69950 49291
## # … with abbreviated variable names ¹`Early Night`, ²`Late Night`
\(~\)
Visualizing total rides by user type and by time of day
bike_rides_v2 %>%
group_by(user_type, time_of_day) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, fill=user_type)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Time of Day", y="Number of Rides", title = "Total Rides by user type and time of day") +
scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insights:
\(~\)
Another visual for total rides by user type and by time of day
bike_rides_v2 %>%
group_by(user_type, time_of_day) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, color=user_type)) +
geom_point() + geom_line(aes(group = user_type)) +
labs(x="Time of Day", y="Number of Rides", title = "Total Rides by user type and time of day") + ylim(0, NA) +
scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
\(~\)
\(~\)
Total rides by user type and day of the week
bike_rides_v2 %>%
group_by(user_type) %>%
summarize("Monday" = sum(day_of_week == "Monday"),
"Tuesday" = sum(day_of_week == "Tuesday"),
"Wednesday" = sum(day_of_week == "Wednesday"),
"Thursday" = sum(day_of_week == "Thursday"),
"Friday" = sum(day_of_week == "Friday"),
"Saturday" = sum(day_of_week == "Saturday"),
"Sunday" = sum(day_of_week == "Sunday"))
## # A tibble: 2 × 8
## user_type Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 casual 278126 257888 268822 299600 330659 473447 387340
## 2 member 478883 512242 517929 520772 465845 443287 386307
\(~\)
Visualizing total rides by user type and day of the week
bike_rides_v2 %>%
group_by(user_type, day_of_week) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, fill=user_type)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Day of Week", y="Number of Rides", title = "Total Rides by user type and day of the week")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insight:
\(~\)
Another visual of total rides by user type and day of the week
bike_rides_v2 %>%
group_by(user_type, day_of_week) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, color=user_type)) +
geom_point() + geom_line(aes(group = user_type)) +
labs(x="Day of Week", y="Number of Rides", title = "Total Rides by user type and day of the week") +
ylim(0, NA)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
\(~\)
\(~\)
Visualizing total rides by user type and season
bike_rides_v2 %>%
group_by(user_type, season) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(season, level= c("Spring", "Summer", "Fall", "Winter")), y=count, fill=user_type)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Season of Year", y="Number of Rides", title = "Total Rides by user type and season of the year")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Key insights:
\(~\)
\(~\)
Top five starting stations for casual riders
bike_rides_v2 %>%
filter(!(is.na(start_station_name))) %>%
filter(user_type == "casual") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(5)
## Selecting by count
## # A tibble: 5 × 2
## start_station_name count
## <chr> <int>
## 1 Streeter Dr & Grand Ave 61168
## 2 DuSable Lake Shore Dr & Monroe St 32478
## 3 Millennium Park 26660
## 4 Michigan Ave & Oak St 26633
## 5 DuSable Lake Shore Dr & North Blvd 26398
\(~\)
Visualizing top five starting stations for casual riders. The fct_reorder() function sorts the data in the ascending order of value_variable
The coordinates are flipped so that horizontal becomes vertical, and vertical, horizontal
bike_rides_v2 %>%
filter(!(is.na(start_station_name))) %>%
filter(user_type == "casual") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(5) %>%
mutate(start_station_name= fct_reorder(start_station_name, count)) %>%
ggplot(aes(x=start_station_name, y=count, fill=count)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x="Number of Rides", y="Start Station Name", title="Top 5 starting stations for casual riders")
## Selecting by count
\(~\)
Top five ending stations for casual riders
bike_rides_v2 %>%
filter(!(is.na(end_station_name))) %>%
filter(user_type == "casual") %>%
group_by(end_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(5)
## Selecting by count
## # A tibble: 5 × 2
## end_station_name count
## <chr> <int>
## 1 Streeter Dr & Grand Ave 63730
## 2 DuSable Lake Shore Dr & Monroe St 30864
## 3 Millennium Park 28075
## 4 Michigan Ave & Oak St 27578
## 5 DuSable Lake Shore Dr & North Blvd 27091
Key insight:
\(~\)
Top five starting stations for member riders
bike_rides_v2 %>%
filter(!(is.na(start_station_name))) %>%
filter(user_type == "member") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(5)
## Selecting by count
## # A tibble: 5 × 2
## start_station_name count
## <chr> <int>
## 1 Ellis Ave & 60th St 35543
## 2 Ellis Ave & 55th St 28910
## 3 Kingsbury St & Kinzie St 26156
## 4 University Ave & 57th St 24859
## 5 Wells St & Concord Ln 24136
\(~\)
Visualizing top five starting stations for member riders
bike_rides_v2 %>%
filter(!(is.na(start_station_name))) %>%
filter(user_type == "member") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(5) %>%
mutate(start_station_name= fct_reorder(start_station_name, count)) %>%
ggplot(aes(x=start_station_name, y=count, fill=count)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x="Number of Rides", y="Start Station Name", title="Top 5 starting stations for member riders")
## Selecting by count
\(~\)
Top five ending stations for member riders
bike_rides_v2 %>%
filter(!(is.na(end_station_name))) %>%
filter(user_type == "member") %>%
group_by(end_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(5)
## Selecting by count
## # A tibble: 5 × 2
## end_station_name count
## <chr> <int>
## 1 Ellis Ave & 60th St 36767
## 2 University Ave & 57th St 36523
## 3 Kingsbury St & Kinzie St 26479
## 4 Wells St & Concord Ln 25035
## 5 Clark St & Elm St 24107
Key insight:
\(~\)
\(~\)
I’ll create a csv file that can be exported
write_csv(bike_rides_v2, file = "bike_rides_v2.csv")
\(~\)
This step involves summarizing conclusions from the analysis and providing recommendations to answer the business questions.
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
Due to the protection of user privacy, the public dataset used for this project has limitations. Ideally, the marketing team would use the full dataset to carry out the following additional analysis:
\(~\)
\(~\)
\(~\)
\(~\)
Marketing is an ongoing process. As a first phase, consider a plan focused on your top ride length categories: < 12 minutes and <= 20 min. Also, consider which areas around your bike stations best complement/support/facilitate this ride length.
Thank you for your time and interest to review my capstone project! This project helped me to walk through the data analysis process from start to finish using real-world data and business questions. Accomplishing this in R and RStudio is very satisfying as I am completely new to working with big data, R and RStudio. I’m truly excited and look forward to growing in the field of data analysis.