Title: Analysis and Visualization for Divvy Bike Share
Author: Sanji
Date: 26 April 2022
output:
pdf_document: default
word_document: default
html_document: default
In 2016, a fictional company called Cyclistic launched a bike-share program which has grown over the years across Chicago. The company has a fleet of over 5,000 bicycles that are geotracked and locked into a network of over 600 stations. Cyclistic uses a marketing strategy of flexible pricing plans to attract broad customer segments. These pricing plans include: single-ride passes, full-day passes, and annual memberships. Single-ride or full-day pass customers are considered casual riders while annual membership customers are considered Cyclistic members.
Finance analysts find that annual memberships are more profitable than casual riderships and the new business task is to convert casual riders into members. In order to do so, the marketing analysis team need a better understanding by asking the following three questions that will guide future marketing:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
The aim of this analysis is to focus on the first question: How do annual members and casual riders use Cyclistic bikes differently? Insights from this analysis will inform marketing strategies aimed at converting casual riders into annual members. This action will maximize the number of annual members and secure future growth of Cyclistic bike share program. The primary stakeholders are: Cyclistic executive team and its Director of Marketing. Secondary stakeholders are Cyclistic marketing analytics team. The key deliverable is an analysis and visualization of the differences between annual members and casual riders use of the bike share program and inform marketing strategies for bringing in more annual members to Cyclistic bike share program.
The real world data for this analysis is collected and owned by the city of Chicago, and is made available to the public by a partnership agreement between the city of Chicago and Motivate International Inc., which also operates the Cyclistic bike share program in the city. The historical data sets can be found in a list available at the following html link. The licensing, privacy, security, and accessibility of the data are also managed by Motivate #International Inc. This is a link to the license agreement: link.
The next sections of this report will describe the five other steps in the data analysis process of: prepare, process, analyze, share, and act, used in the analysis.
In preparation for this analysis, twelve data zipped CSV files were downloaded onto a PC where files could be extracted in .csv format. Individually each file was uploaded to the Rstudio global environment for data analysis after installing and loading r packages.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(rmarkdown)
library(knitr)
Set Work Directory.
knitr::opts_knit$set(root.dir = "F:/Capstone Project Case Study 1 (Jan.11)/data")
getwd()
## [1] "F:/Capstone Project Case Study 1 (Jan.11)/data"
Use of “here” package creates paths to project root directory and data sets, refer to link.
X202101_tripdata <- read_csv("202101-tripdata.csv")
## Rows: 96834 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202102_tripdata <- read_csv("202102-tripdata.csv")
## Rows: 49622 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202103_tripdata <- read_csv("202103-tripdata.csv")
## Rows: 228496 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202104_tripdata <- read_csv("202104-tripdata.csv")
## Rows: 337230 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202105_tripdata <- read_csv("202105-tripdata.csv")
## Rows: 531633 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202106_tripdata <- read_csv("202106-tripdata.csv")
## Rows: 729595 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202107_tripdata <- read_csv("202107-tripdata.csv")
## Rows: 822410 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202108_tripdata <- read_csv("202108-tripdata.csv")
## Rows: 804352 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202109_tripdata <- read_csv("202109-tripdata.csv")
## Rows: 756147 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202110_tripdata <- read_csv("202110-tripdata.csv")
## Rows: 631226 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202111_tripdata <- read_csv("202111-tripdata.csv")
## Rows: 359978 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202112_tripdata <- read_csv("202112-tripdata.csv")
## Rows: 247540 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Suggested script for using R, click here link. Data frames inspected for incongruencies
str(X202101_tripdata)
## spec_tbl_df [96,834 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:96834] "2021-01-23 16:14" "2021-01-27 18:43" "2021-01-21 22:35" "2021-01-07 13:31" ...
## $ ended_at : chr [1:96834] "2021-01-23 16:24" "2021-01-27 18:47" "2021-01-21 22:37" "2021-01-07 13:42" ...
## $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:96834] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:96834] NA NA NA NA ...
## $ end_station_id : chr [1:96834] NA NA NA NA ...
## $ start_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:96834] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202102_tripdata)
## spec_tbl_df [49,622 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr [1:49622] "2021-02-12 16:14" "2021-02-14 17:52" "2021-02-09 19:10" "2021-02-02 17:49" ...
## $ ended_at : chr [1:49622] "2021-02-12 16:21" "2021-02-14 18:12" "2021-02-09 19:19" "2021-02-02 17:54" ...
## $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr [1:49622] "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ start_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ end_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:49622] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202103_tripdata)
## spec_tbl_df [228,496 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:228496] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
## $ rideable_type : chr [1:228496] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:228496] "2021-03-16 8:32" "2021-03-28 1:26" "2021-03-11 21:17" "2021-03-11 13:26" ...
## $ ended_at : chr [1:228496] "2021-03-16 8:36" "2021-03-28 1:36" "2021-03-11 21:33" "2021-03-11 13:55" ...
## $ start_station_name: chr [1:228496] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
## $ start_station_id : chr [1:228496] "15651" "15651" "15443" "TA1308000021" ...
## $ end_station_name : chr [1:228496] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:228496] "13266" "18017" "TA1308000043" "13323" ...
## $ start_lat : num [1:228496] 41.9 41.9 41.8 42 42 ...
## $ start_lng : num [1:228496] -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:228496] 41.9 41.9 41.8 42 42.1 ...
## $ end_lng : num [1:228496] -87.7 -87.7 -87.6 -87.6 -87.7 ...
## $ member_casual : chr [1:228496] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202104_tripdata)
## spec_tbl_df [337,230 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
## $ rideable_type : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
## $ started_at : chr [1:337230] "2021-04-12 18:25" "2021-04-27 17:27" "2021-04-03 12:42" "2021-04-17 9:17" ...
## $ ended_at : chr [1:337230] "2021-04-12 18:56" "2021-04-27 18:31" "2021-04-07 11:40" "2021-04-17 9:42" ...
## $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
## $ start_station_id : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
## $ end_station_name : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
## $ end_station_id : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
## $ start_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ start_lng : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ end_lng : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:337230] "member" "casual" "casual" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202105_tripdata)
## spec_tbl_df [531,633 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
## $ rideable_type : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:531633] "2021-05-30 11:58" "2021-05-30 11:29" "2021-05-30 14:24" "2021-05-30 14:25" ...
## $ ended_at : chr [1:531633] "2021-05-30 12:10" "2021-05-30 12:14" "2021-05-30 14:25" "2021-05-30 14:41" ...
## $ start_station_name: chr [1:531633] NA NA NA NA ...
## $ start_station_id : chr [1:531633] NA NA NA NA ...
## $ end_station_name : chr [1:531633] NA NA NA NA ...
## $ end_station_id : chr [1:531633] NA NA NA NA ...
## $ start_lat : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
## $ end_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:531633] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202106_tripdata)
## spec_tbl_df [729,595 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
## $ rideable_type : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:729595] "2021-06-13 14:31" "2021-06-04 11:18" "2021-06-04 9:49" "2021-06-03 19:56" ...
## $ ended_at : chr [1:729595] "2021-06-13 14:34" "2021-06-04 11:24" "2021-06-04 9:55" "2021-06-03 20:21" ...
## $ start_station_name: chr [1:729595] NA NA NA NA ...
## $ start_station_id : chr [1:729595] NA NA NA NA ...
## $ end_station_name : chr [1:729595] NA NA NA NA ...
## $ end_station_id : chr [1:729595] NA NA NA NA ...
## $ start_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ start_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ end_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:729595] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202107_tripdata)
## spec_tbl_df [822,410 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
## $ rideable_type : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:822410] "2021-07-02 14:44" "2021-07-07 16:57" "2021-07-25 11:30" "2021-07-08 22:08" ...
## $ ended_at : chr [1:822410] "2021-07-02 15:19" "2021-07-07 17:16" "2021-07-25 11:48" "2021-07-08 22:23" ...
## $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
## $ end_station_name : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
## $ end_station_id : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
## $ start_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:822410] "casual" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202108_tripdata)
## spec_tbl_df [804,352 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
## $ rideable_type : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:804352] "2021-08-10 17:15" "2021-08-10 17:23" "2021-08-21 2:34" "2021-08-21 6:52" ...
## $ ended_at : chr [1:804352] "2021-08-10 17:22" "2021-08-10 17:39" "2021-08-21 2:50" "2021-08-21 7:08" ...
## $ start_station_name: chr [1:804352] NA NA NA NA ...
## $ start_station_id : chr [1:804352] NA NA NA NA ...
## $ end_station_name : chr [1:804352] NA NA NA NA ...
## $ end_station_id : chr [1:804352] NA NA NA NA ...
## $ start_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ start_lng : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ end_lng : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:804352] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202109_tripdata)
## spec_tbl_df [756,147 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:756147] "2021-09-28 16:07" "2021-09-28 14:24" "2021-09-28 0:20" "2021-09-28 14:51" ...
## $ ended_at : chr [1:756147] "2021-09-28 16:09" "2021-09-28 14:40" "2021-09-28 0:23" "2021-09-28 15:00" ...
## $ start_station_name: chr [1:756147] NA NA NA NA ...
## $ start_station_id : chr [1:756147] NA NA NA NA ...
## $ end_station_name : chr [1:756147] NA NA NA NA ...
## $ end_station_id : chr [1:756147] NA NA NA NA ...
## $ start_lat : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:756147] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202110_tripdata)
## spec_tbl_df [631,226 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
## $ rideable_type : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:631226] "2021-10-22 12:46" "2021-10-21 9:12" "2021-10-16 16:28" "2021-10-16 16:17" ...
## $ ended_at : chr [1:631226] "2021-10-22 12:49" "2021-10-21 9:14" "2021-10-16 16:36" "2021-10-16 16:19" ...
## $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
## $ start_station_id : chr [1:631226] "KA1503000043" NA NA NA ...
## $ end_station_name : chr [1:631226] NA NA NA NA ...
## $ end_station_id : chr [1:631226] NA NA NA NA ...
## $ start_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:631226] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202111_tripdata)
## spec_tbl_df [359,978 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:359978] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
## $ rideable_type : chr [1:359978] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:359978] "2021-11-27 13:27" "2021-11-27 13:38" "2021-11-26 22:03" "2021-11-27 9:56" ...
## $ ended_at : chr [1:359978] "2021-11-27 13:46" "2021-11-27 13:56" "2021-11-26 22:05" "2021-11-27 10:01" ...
## $ start_station_name: chr [1:359978] NA NA NA NA ...
## $ start_station_id : chr [1:359978] NA NA NA NA ...
## $ end_station_name : chr [1:359978] NA NA NA NA ...
## $ end_station_id : chr [1:359978] NA NA NA NA ...
## $ start_lat : num [1:359978] 41.9 42 42 41.9 41.9 ...
## $ start_lng : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ end_lat : num [1:359978] 42 41.9 42 41.9 41.9 ...
## $ end_lng : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ member_casual : chr [1:359978] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202112_tripdata)
## spec_tbl_df [247,540 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:247540] "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
## $ rideable_type : chr [1:247540] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr [1:247540] "2021-12-07 15:06" "2021-12-11 3:43" "2021-12-15 23:10" "2021-12-26 16:16" ...
## $ ended_at : chr [1:247540] "2021-12-07 15:13" "2021-12-11 4:10" "2021-12-15 23:23" "2021-12-26 16:30" ...
## $ start_station_name: chr [1:247540] "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
## $ start_station_id : chr [1:247540] "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
## $ end_station_name : chr [1:247540] "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
## $ end_station_id : chr [1:247540] "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
## $ start_lat : num [1:247540] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:247540] -87.7 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:247540] 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:247540] -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:247540] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Convert to Posixct date-time class, for start time. Then create new
column with calculation of difference between end time and start time.
```r
X202101_tripdata$ended_at <- as.POSIXct(X202101_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202102_tripdata$ended_at <- as.POSIXct(X202102_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202103_tripdata$ended_at <- as.POSIXct(X202103_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202104_tripdata$ended_at <- as.POSIXct(X202104_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202105_tripdata$ended_at <- as.POSIXct(X202105_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202106_tripdata$ended_at <- as.POSIXct(X202106_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202107_tripdata$ended_at <- as.POSIXct(X202107_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202108_tripdata$ended_at <- as.POSIXct(X202108_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202109_tripdata$ended_at <- as.POSIXct(X202109_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202110_tripdata$ended_at <- as.POSIXct(X202110_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202111_tripdata$ended_at <- as.POSIXct(X202111_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
X202112_tripdata$ended_at <- as.POSIXct(X202112_tripdata$ended_at, format = "%Y-%m-%d %H:%M", scientific = FALSE)
```
Stack individual month data frames into one big data frame
all_trips<-bind_rows(X202101_tripdata,X202102_tripdata,X202103_tripdata,X202104_tripdata,X202105_tripdata,X202106_tripdata,X202107_tripdata,X202108_tripdata,X202109_tripdata,X202110_tripdata,X202111_tripdata,X202112_tripdata)
all_trips<-all_trips%>%
select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id,start_lat, start_lng, end_lat, end_lng))
Suggested script, click here to link. “Cleaning Data” tutorial by Greg Martin, click here link . Reference manipulating data tutorial by Greg Martin, click here link
colnames(all_trips)
## [1] "rideable_type" "started_at" "ended_at" "member_casual"
nrow(all_trips)
## [1] 5595063
dim(all_trips)
## [1] 5595063 4
head(all_trips)
## # A tibble: 6 x 4
## rideable_type started_at ended_at member_casual
## <chr> <chr> <dttm> <chr>
## 1 electric_bike 2021-01-23 16:14 2021-01-23 16:24:00 member
## 2 electric_bike 2021-01-27 18:43 2021-01-27 18:47:00 member
## 3 electric_bike 2021-01-21 22:35 2021-01-21 22:37:00 member
## 4 electric_bike 2021-01-07 13:31 2021-01-07 13:42:00 member
## 5 electric_bike 2021-01-23 2:24 2021-01-23 02:24:00 casual
## 6 electric_bike 2021-01-09 14:24 2021-01-09 15:17:00 casual
tail(all_trips)
## # A tibble: 6 x 4
## rideable_type started_at ended_at member_casual
## <chr> <chr> <dttm> <chr>
## 1 electric_bike 2021-12-24 15:42 2021-12-24 19:29:00 casual
## 2 electric_bike 2021-12-12 13:36 2021-12-12 13:56:00 casual
## 3 electric_bike 2021-12-06 19:37 2021-12-06 19:44:00 member
## 4 electric_bike 2021-12-02 8:57 2021-12-02 09:05:00 member
## 5 electric_bike 2021-12-13 9:00 2021-12-13 09:14:00 member
## 6 classic_bike 2021-12-13 8:45 2021-12-13 08:49:00 member
str(all_trips)
## tibble [5,595,063 x 4] (S3: tbl_df/tbl/data.frame)
## $ rideable_type: chr [1:5595063] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:5595063] "2021-01-23 16:14" "2021-01-27 18:43" "2021-01-21 22:35" "2021-01-07 13:31" ...
## $ ended_at : POSIXct[1:5595063], format: "2021-01-23 16:24:00" "2021-01-27 18:47:00" ...
## $ member_casual: chr [1:5595063] "member" "member" "member" "member" ...
summary(all_trips)
## rideable_type started_at ended_at
## Length:5595063 Length:5595063 Min. :2021-01-01 00:08:00
## Class :character Class :character 1st Qu.:2021-06-07 00:44:00
## Mode :character Mode :character Median :2021-08-01 02:21:00
## Mean :2021-07-29 08:10:31
## 3rd Qu.:2021-09-24 16:54:00
## Max. :2022-01-03 17:32:00
## member_casual
## Length:5595063
## Class :character
## Mode :character
##
##
##
Remove variable “docked bike” from column “rideable_type”, as these bikes were docked and presumed not in use by definition.
table(all_trips$rideable_type)
##
## classic_bike docked_bike electric_bike
## 3251028 312343 2031692
all_trips <- subset(all_trips,rideable_type != "docked_bike")
Check data frame no longer contains docked_bike
table(all_trips$rideable_type)
##
## classic_bike electric_bike
## 3251028 2031692
Add columns that list date, month, day of week
all_trips$date<-as.Date(all_trips$started_at)
all_trips$month<-format(as.Date(all_trips$date), "%m")
all_trips$day<-format(as.Date(all_trips$date), "%d")
all_trips$day_of_week<-format(as.Date(all_trips$date), "%A")
Give order to day of week
all_trips$day_of_week <- ordered(all_trips$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
Create new column for ride_length calculation in minutes: Refer to online resource, click herelink
all_trips <-all_trips %>% mutate(all_trips, ride_length = difftime(all_trips$ended_at, all_trips$started_at, tz="America/Chicago", units = "mins"))
Ensure ride_length is numeric
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
Remove ride_length <=0, as these do not reflect bike rides in use
all_trips_v2<-all_trips[!(all_trips$ride_length<=0),]
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 127.0 131.0 138.4 140.0 1621.0
Compare member and casual riders
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 145.0265
## 2 member 133.6329
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 135
## 2 member 130
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 1621
## 2 member 1620
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 2
## 2 member 6
Create tables: Refer to Describe and Summarize data by Greg Martin, click here link .
First table = number of riders by month:
all_trips_v2 %>%
group_by(member_casual, month) %>%
summarise(number = n(), average_ride = mean(ride_length)) %>% #count of riders and average ride for each month
pivot_wider(names_from = member_casual,
values_from = number) #Create table with columns: month, average ride, member counts, casual counts
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
## # A tibble: 24 x 4
## month average_ride casual member
## <chr> <dbl> <int> <int>
## 1 01 140. 16012 NA
## 2 02 153. 8860 NA
## 3 03 148. 68376 NA
## 4 04 148. 111887 NA
## 5 05 148. 213563 NA
## 6 06 147. 318965 NA
## 7 07 146. 384358 NA
## 8 08 145. 367606 NA
## 9 09 144. 328553 NA
## 10 10 142. 234358 NA
## # ... with 14 more rows
Second table = number of riders by day of week:
all_trips_v2 %>%
group_by(member_casual, day_of_week) %>%
summarise(number = n(), average_ride = mean(ride_length)) %>% #count of riders for each day of week
pivot_wider(names_from = member_casual,
values_from = number) #Create table with columns: day of week, member counts, casual counts
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
## # A tibble: 14 x 4
## day_of_week average_ride casual member
## <ord> <dbl> <int> <int>
## 1 Sunday 149. 410707 NA
## 2 Monday 145. 249931 NA
## 3 Tuesday 143. 244764 NA
## 4 Wednesday 142. 251321 NA
## 5 Thursday 142. 258390 NA
## 6 Friday 143. 321985 NA
## 7 Saturday 147. 479565 NA
## 8 Sunday 136. NA 376142
## 9 Monday 133. NA 416212
## 10 Tuesday 133. NA 465513
## 11 Wednesday 133. NA 477191
## 12 Thursday 133. NA 451524
## 13 Friday 133. NA 446428
## 14 Saturday 135. NA 433047
Data visualization for monthly use shows casual riders used Cyclistic bike share program more than members, overall, within the 2021 year. Greater use by casual riders occurred over seven months between February to October or late Winter, through Spring/Summer and into early Fall. Data visualization for daily use shows that casual riders overall used the bike share program more often during a three day period between Saturday, Sunday and Monday compared to member riders. Data suggests to focus marketing strategies on casual riders during the months of May to October and expecially on Saturdays, Sundays and Mondays.