This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process.
Cyclistic: A Chicago bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Analyze historical trip data to identify trends in differing behaviors between cyclistic members and casual users to gain insights that will be used to create a marketing strategy aimed at converting more casual users into members.
Cyclistic’s historical trip data was used to analyze and identify trends. The previous 12 months of Cyclistic trip data were downloaded from here. (Note: The datasets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license.) The data was public data that used to explore how different customer types are using Cyclistic bikes. Data-privacy issues prohibited the use of riders’ personally identifiable information, so was not able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
Began by installing the required packages to conduct rhis analysis in R. Installed and loaded tidyverse for data import and wrangling, lubridate for date functions, ggplot for visualization, and rmarkdown for creating R notebooks. Also installed and loaded packages for cleaning including janitor, skimr and here.
install.packages("tidyverse",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/yeiro/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\yeiro\AppData\Local\Temp\RtmpEZ8a0q\downloaded_packages
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
install.packages("rmarkdown",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/yeiro/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'rmarkdown' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\yeiro\AppData\Local\Temp\RtmpEZ8a0q\downloaded_packages
library(rmarkdown)
install.packages("lubridate",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/yeiro/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'lubridate' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lubridate'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\yeiro\AppData\Local\R\win-
## library\4.2\00LOCK\lubridate\libs\x64\lubridate.dll to C:
## \Users\yeiro\AppData\Local\R\win-library\4.2\lubridate\libs\x64\lubridate.dll:
## Permission denied
## Warning: restored 'lubridate'
##
## The downloaded binary packages are in
## C:\Users\yeiro\AppData\Local\Temp\RtmpEZ8a0q\downloaded_packages
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
install.packages("janitor",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/yeiro/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'janitor' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\yeiro\AppData\Local\Temp\RtmpEZ8a0q\downloaded_packages
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
install.packages("dplyr",repos = "http://cran.us.r-project.org")
## Warning: package 'dplyr' is in use and will not be installed
library(dplyr)
install.packages("skimr",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/yeiro/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'skimr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\yeiro\AppData\Local\Temp\RtmpEZ8a0q\downloaded_packages
library(skimr)
install.packages("here",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/yeiro/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'here' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\yeiro\AppData\Local\Temp\RtmpEZ8a0q\downloaded_packages
library(here)
## here() starts at C:/Users/yeiro/OneDrive/Desktop/Case Study 1 Trip Data/Case Study 1 datasets from past year
getwd()
## [1] "C:/Users/yeiro/OneDrive/Desktop/Case Study 1 Trip Data/Case Study 1 datasets from past year"
Twelve csv files for August 2021 through July 2022 from the monthly csv files were downloaded and then combined into a larger data frame for analysis. Column names and structure checked to make sure they are consistent.
aug_2021 <- read_csv("202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_2021 <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_2021 <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(aug_2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep_2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct_2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(aug_2021)
## spec_tbl_df [804,352 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
## $ rideable_type : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:804352], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
## $ ended_at : POSIXct[1:804352], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
## $ start_station_name: chr [1:804352] NA NA NA NA ...
## $ start_station_id : chr [1:804352] NA NA NA NA ...
## $ end_station_name : chr [1:804352] NA NA NA NA ...
## $ end_station_id : chr [1:804352] NA NA NA NA ...
## $ start_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ start_lng : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ end_lng : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:804352] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(sep_2021)
## spec_tbl_df [756,147 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:756147], format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
## $ ended_at : POSIXct[1:756147], format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
## $ start_station_name: chr [1:756147] NA NA NA NA ...
## $ start_station_id : chr [1:756147] NA NA NA NA ...
## $ end_station_name : chr [1:756147] NA NA NA NA ...
## $ end_station_id : chr [1:756147] NA NA NA NA ...
## $ start_lat : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:756147] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(oct_2021)
## spec_tbl_df [631,226 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
## $ rideable_type : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:631226], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
## $ ended_at : POSIXct[1:631226], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
## $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
## $ start_station_id : chr [1:631226] "KA1503000043" NA NA NA ...
## $ end_station_name : chr [1:631226] NA NA NA NA ...
## $ end_station_id : chr [1:631226] NA NA NA NA ...
## $ start_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:631226] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
q1 <- bind_rows(aug_2021, sep_2021, oct_2021)
nov_2021 <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_2021 <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan_2022 <- read_csv("202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(nov_2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec_2021)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jan_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(nov_2021)
## spec_tbl_df [359,978 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:359978] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
## $ rideable_type : chr [1:359978] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:359978], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
## $ ended_at : POSIXct[1:359978], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
## $ start_station_name: chr [1:359978] NA NA NA NA ...
## $ start_station_id : chr [1:359978] NA NA NA NA ...
## $ end_station_name : chr [1:359978] NA NA NA NA ...
## $ end_station_id : chr [1:359978] NA NA NA NA ...
## $ start_lat : num [1:359978] 41.9 42 42 41.9 41.9 ...
## $ start_lng : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ end_lat : num [1:359978] 42 41.9 42 41.9 41.9 ...
## $ end_lng : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ member_casual : chr [1:359978] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(dec_2021)
## spec_tbl_df [247,540 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:247540] "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
## $ rideable_type : chr [1:247540] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:247540], format: "2021-12-07 15:06:07" "2021-12-11 03:43:29" ...
## $ ended_at : POSIXct[1:247540], format: "2021-12-07 15:13:42" "2021-12-11 04:10:23" ...
## $ start_station_name: chr [1:247540] "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
## $ start_station_id : chr [1:247540] "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
## $ end_station_name : chr [1:247540] "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
## $ end_station_id : chr [1:247540] "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
## $ start_lat : num [1:247540] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:247540] -87.7 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:247540] 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:247540] -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:247540] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(jan_2022)
## spec_tbl_df [103,770 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:103770] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr [1:103770] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:103770], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
## $ ended_at : POSIXct[1:103770], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
## $ start_station_name: chr [1:103770] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr [1:103770] "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr [1:103770] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr [1:103770] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num [1:103770] 42 42 41.9 42 41.9 ...
## $ start_lng : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:103770] 42 42 41.9 42 41.9 ...
## $ end_lng : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:103770] "casual" "casual" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
q2 <- bind_rows(nov_2021, dec_2021, jan_2022)
feb_2022 <- read_csv("202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_2022 <- read_csv("202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_2022 <- read_csv("202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(feb_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(mar_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(apr_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(feb_2022)
## spec_tbl_df [115,609 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:115609] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
## $ rideable_type : chr [1:115609] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:115609], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
## $ ended_at : POSIXct[1:115609], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
## $ start_station_name: chr [1:115609] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
## $ start_station_id : chr [1:115609] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
## $ end_station_name : chr [1:115609] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:115609] "13179" "TA1307000113" "13011" "13323" ...
## $ start_lat : num [1:115609] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:115609] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:115609] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng : num [1:115609] -87.6 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:115609] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(mar_2022)
## spec_tbl_df [284,042 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:284042] "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
## $ rideable_type : chr [1:284042] "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:284042], format: "2022-03-21 13:45:01" "2022-03-16 09:37:16" ...
## $ ended_at : POSIXct[1:284042], format: "2022-03-21 13:51:18" "2022-03-16 09:43:34" ...
## $ start_station_name: chr [1:284042] "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
## $ start_station_id : chr [1:284042] "TA1307000131" "13042" "13109" "TA1307000131" ...
## $ end_station_name : chr [1:284042] "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
## $ end_station_id : chr [1:284042] "KA1503000043" "620" "15578" "TA1305000025" ...
## $ start_lat : num [1:284042] 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num [1:284042] 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr [1:284042] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(apr_2022)
## spec_tbl_df [371,249 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:371249] "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
## $ rideable_type : chr [1:371249] "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:371249], format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" ...
## $ ended_at : POSIXct[1:371249], format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" ...
## $ start_station_name: chr [1:371249] "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
## $ start_station_id : chr [1:371249] "515" "13075" "TA1307000121" "13075" ...
## $ end_station_name : chr [1:371249] "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
## $ end_station_id : chr [1:371249] "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
## $ start_lat : num [1:371249] 42 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:371249] 42.1 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:371249] "member" "member" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
q3 <- bind_rows(feb_2022, mar_2022, apr_2022)
may_2022 <- read_csv("202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_2022 <- read_csv("202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_2022 <- read_csv("202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(may_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jun_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jul_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(may_2022)
## spec_tbl_df [634,858 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:634858] "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
## $ rideable_type : chr [1:634858] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:634858], format: "2022-05-23 23:06:58" "2022-05-11 08:53:28" ...
## $ ended_at : POSIXct[1:634858], format: "2022-05-23 23:40:19" "2022-05-11 09:31:22" ...
## $ start_station_name: chr [1:634858] "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
## $ start_station_id : chr [1:634858] "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
## $ end_station_name : chr [1:634858] "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
## $ end_station_id : chr [1:634858] "TA1309000025" "15534" "13221" "TA1305000030" ...
## $ start_lat : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:634858] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:634858] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr [1:634858] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(jun_2022)
## spec_tbl_df [769,204 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:769204] "600CFD130D0FD2A4" "F5E6B5C1682C6464" "B6EB6D27BAD771D2" "C9C320375DE1D5C6" ...
## $ rideable_type : chr [1:769204] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:769204], format: "2022-06-30 17:27:53" "2022-06-30 18:39:52" ...
## $ ended_at : POSIXct[1:769204], format: "2022-06-30 17:35:15" "2022-06-30 18:47:28" ...
## $ start_station_name: chr [1:769204] NA NA NA NA ...
## $ start_station_id : chr [1:769204] NA NA NA NA ...
## $ end_station_name : chr [1:769204] NA NA NA NA ...
## $ end_station_id : chr [1:769204] NA NA NA NA ...
## $ start_lat : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ...
## $ start_lng : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ...
## $ end_lng : num [1:769204] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:769204] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(jul_2022)
## spec_tbl_df [823,488 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:823488] "954144C2F67B1932" "292E027607D218B6" "57765852588AD6E0" "B5B6BE44314590E6" ...
## $ rideable_type : chr [1:823488] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:823488], format: "2022-07-05 08:12:47" "2022-07-26 12:53:38" ...
## $ ended_at : POSIXct[1:823488], format: "2022-07-05 08:24:32" "2022-07-26 12:55:31" ...
## $ start_station_name: chr [1:823488] "Ashland Ave & Blackhawk St" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" ...
## $ start_station_id : chr [1:823488] "13224" "15541" "15541" "15541" ...
## $ end_station_name : chr [1:823488] "Kingsbury St & Kinzie St" "Michigan Ave & 8th St" "Michigan Ave & 8th St" "Woodlawn Ave & 55th St" ...
## $ end_station_id : chr [1:823488] "KA1503000043" "623" "623" "TA1307000164" ...
## $ start_lat : num [1:823488] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:823488] -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:823488] 41.9 41.9 41.9 41.8 41.9 ...
## $ end_lng : num [1:823488] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ member_casual : chr [1:823488] "member" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
q4 <- bind_rows(may_2022, jun_2022, jul_2022)
all_trips <- bind_rows(q1, q2, q3, q4)
Formatting is corrected and additional columns added for date and time details. Those columns were then used to calculate the ride length column and to group by date, month, day, year and day of the week. Finally bad data was removed when entries were removed for bikes taken for quality inspections.
str(all_trips)
## spec_tbl_df [5,901,463 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5901463] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
## $ rideable_type : chr [1:5901463] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5901463], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
## $ ended_at : POSIXct[1:5901463], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
## $ start_station_name: chr [1:5901463] NA NA NA NA ...
## $ start_station_id : chr [1:5901463] NA NA NA NA ...
## $ end_station_name : chr [1:5901463] NA NA NA NA ...
## $ end_station_id : chr [1:5901463] NA NA NA NA ...
## $ start_lat : num [1:5901463] 41.8 41.8 42 42 41.8 ...
## $ start_lng : num [1:5901463] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:5901463] 41.8 41.8 42 42 41.8 ...
## $ end_lng : num [1:5901463] -87.7 -87.6 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:5901463] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(all_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
nrow(all_trips)
## [1] 5901463
dim(all_trips)
## [1] 5901463 13
head(all_trips)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at start_station_n…
## <chr> <chr> <dttm> <dttm> <chr>
## 1 99103B… electric_bike 2021-08-10 17:15:49 2021-08-10 17:22:44 <NA>
## 2 EAFCCC… electric_bike 2021-08-10 17:23:14 2021-08-10 17:39:24 <NA>
## 3 9EF4F4… electric_bike 2021-08-21 02:34:23 2021-08-21 02:50:36 <NA>
## 4 5834D3… electric_bike 2021-08-21 06:52:55 2021-08-21 07:08:13 <NA>
## 5 CD825C… electric_bike 2021-08-19 11:55:29 2021-08-19 12:04:11 <NA>
## 6 612F12… electric_bike 2021-08-19 12:41:12 2021-08-19 12:47:47 <NA>
## # … with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>
summary(all_trips)
## ride_id rideable_type started_at
## Length:5901463 Length:5901463 Min. :2021-08-01 00:00:04.00
## Class :character Class :character 1st Qu.:2021-09-27 12:35:12.50
## Mode :character Mode :character Median :2022-02-14 14:10:08.00
## Mean :2022-01-31 21:50:42.24
## 3rd Qu.:2022-06-05 15:29:40.50
## Max. :2022-07-31 23:59:58.00
##
## ended_at start_station_name start_station_id
## Min. :2021-08-01 00:03:11.00 Length:5901463 Length:5901463
## 1st Qu.:2021-09-27 12:54:02.50 Class :character Class :character
## Median :2022-02-14 14:20:23.00 Mode :character Mode :character
## Mean :2022-01-31 22:10:35.61
## 3rd Qu.:2022-06-05 15:54:48.00
## Max. :2022-08-04 13:53:01.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5901463 Length:5901463 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual
## Min. :41.39 Min. :-88.97 Length:5901463
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. :-87.50
## NA's :5590 NA's :5590
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$ride_length <- difftime(all_trips$ended_at, all_trips$started_at)
str(all_trips)
## spec_tbl_df [5,901,463 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5901463] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
## $ rideable_type : chr [1:5901463] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5901463], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
## $ ended_at : POSIXct[1:5901463], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
## $ start_station_name: chr [1:5901463] NA NA NA NA ...
## $ start_station_id : chr [1:5901463] NA NA NA NA ...
## $ end_station_name : chr [1:5901463] NA NA NA NA ...
## $ end_station_id : chr [1:5901463] NA NA NA NA ...
## $ start_lat : num [1:5901463] 41.8 41.8 42 42 41.8 ...
## $ start_lng : num [1:5901463] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:5901463] 41.8 41.8 42 42 41.8 ...
## $ end_lng : num [1:5901463] -87.7 -87.6 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:5901463] "member" "member" "member" "member" ...
## $ date : Date[1:5901463], format: "2021-08-10" "2021-08-10" ...
## $ month : chr [1:5901463] "08" "08" "08" "08" ...
## $ day : chr [1:5901463] "10" "10" "21" "21" ...
## $ year : chr [1:5901463] "2021" "2021" "2021" "2021" ...
## $ day_of_week : chr [1:5901463] "Tuesday" "Tuesday" "Saturday" "Saturday" ...
## $ ride_length : 'difftime' num [1:5901463] 415 970 973 918 ...
## ..- attr(*, "units")= chr "secs"
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
Descriptive analysis on ride_length in seconds obtained below. Aggregated summary statistics on ride length duration for casual users and members obtained.
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 381 671 1251 1210 2497750 860759
all_trips_v2$ride_minutes <- all_trips_v2$ride_length/60
summary(all_trips_v2$ride_minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 6.3 11.2 20.9 20.2 41629.2 860759
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 1878.3275
## 2 member 783.5433
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 894
## 2 member 550
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 2497750
## 2 member 93594
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 0
## 2 member 0
aggregate(all_trips_v2$ride_minutes ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_minutes
## 1 casual 31.30546
## 2 member 13.05906
aggregate(all_trips_v2$ride_minutes ~ all_trips_v2$member_casual, FUN = median)
## all_trips_v2$member_casual all_trips_v2$ride_minutes
## 1 casual 14.900000
## 2 member 9.166667
aggregate(all_trips_v2$ride_minutes ~ all_trips_v2$member_casual, FUN = max)
## all_trips_v2$member_casual all_trips_v2$ride_minutes
## 1 casual 41629.17
## 2 member 1559.90
aggregate(all_trips_v2$ride_minutes ~ all_trips_v2$member_casual, FUN = min)
## all_trips_v2$member_casual all_trips_v2$ride_minutes
## 1 casual 0
## 2 member 0
Days of the week ordered and avg ride time per day obtained for each user type. Measurements provided in seconds and minutes
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Sunday 2169.5583
## 2 member Sunday 890.1932
## 3 casual Monday 1917.1631
## 4 member Monday 761.8988
## 5 casual Tuesday 1641.1829
## 6 member Tuesday 734.6194
## 7 casual Wednesday 1609.5215
## 8 member Wednesday 735.8187
## 9 casual Thursday 1688.4972
## 10 member Thursday 751.1084
## 11 casual Friday 1767.1743
## 12 member Friday 762.3827
## 13 casual Saturday 2030.8356
## 14 member Saturday 880.5102
aggregate(all_trips_v2$ride_minutes ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week
## 1 casual Sunday
## 2 member Sunday
## 3 casual Monday
## 4 member Monday
## 5 casual Tuesday
## 6 member Tuesday
## 7 casual Wednesday
## 8 member Wednesday
## 9 casual Thursday
## 10 member Thursday
## 11 casual Friday
## 12 member Friday
## 13 casual Saturday
## 14 member Saturday
## all_trips_v2$ride_minutes
## 1 36.15930
## 2 14.83655
## 3 31.95272
## 4 12.69831
## 5 27.35305
## 6 12.24366
## 7 26.82536
## 8 12.26364
## 9 28.14162
## 10 12.51847
## 11 29.45291
## 12 12.70638
## 13 33.84726
## 14 14.67517
Average ride duration provided in minutes and seconds
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n() ,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 15 × 4
## # Groups: member_casual [3]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 414615 2170.
## 2 casual Mon 254214 1917.
## 3 casual Tue 228934 1641.
## 4 casual Wed 236061 1610.
## 5 casual Thu 265674 1688.
## 6 casual Fri 293488 1767.
## 7 casual Sat 460166 2031.
## 8 member Sun 354674 890.
## 9 member Mon 405003 762.
## 10 member Tue 450765 735.
## 11 member Wed 449966 736.
## 12 member Thu 446736 751.
## 13 member Fri 395349 762.
## 14 member Sat 384910 881.
## 15 <NA> <NA> 860759 NA
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n() ,average_duration = mean(ride_minutes)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 15 × 4
## # Groups: member_casual [3]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 414615 36.2
## 2 casual Mon 254214 32.0
## 3 casual Tue 228934 27.4
## 4 casual Wed 236061 26.8
## 5 casual Thu 265674 28.1
## 6 casual Fri 293488 29.5
## 7 casual Sat 460166 33.8
## 8 member Sun 354674 14.8
## 9 member Mon 405003 12.7
## 10 member Tue 450765 12.2
## 11 member Wed 449966 12.3
## 12 member Thu 446736 12.5
## 13 member Fri 395349 12.7
## 14 member Sat 384910 14.7
## 15 <NA> <NA> 860759 NA
all_trips_ride_totals <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Amount of Rides per Day: Casual User vs Member")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
all_trips_ride_totals
The plot above shows the average number of rides for each day of the week for members vs casual users. Casual users have a higher average amount of rides taken on weekends than weekdays. Members have a higher average number of rides on weekdays than on weekends, but day of the week has less of an impact on the totals for members than they did for casual users.
all_trips_ride_totals_bike <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday, rideable_type) %>%
drop_na() %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
facet_wrap(~rideable_type) +
labs(title = "Number of Rides per Day: Casual User vs Member", subtitle = "Amount of Rides by Rideable Type")
## `summarise()` has grouped output by 'member_casual', 'weekday'. You can
## override using the `.groups` argument.
all_trips_ride_totals_bike
Here we can see that there are more classic bike rides each day than electric bike rides. Can serve as an interesting trend to check into if more information is available including the total amount of each bike type. This could be beneficial in choosing what membership benefits to advertise to casual riders.
The first plot depicts the average duration in seconds and the next plot measures in minutes.
all_trips_avg_trip_time <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average Ride Duration: Casual User vs. Member", subtitle = "Average Bike Ride in Seconds Each Day of the Week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
all_trips_avg_trip_time
## Warning: Removed 1 rows containing missing values (geom_col).
all_trips_avg_trip_time_minutes <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_minutes)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average Ride Duration: Casual User vs. Member", subtitle = "Average Bike Ride in Minutes Each Day of the Week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
all_trips_avg_trip_time_minutes
## Warning: Removed 1 rows containing missing values (geom_col).
The visualization shows that casual riders have significantly longer average rides each day of the week compared to members. Both casual users and members have longer average ride durations on weekends than on weekdays. As with the amount of rides each day, casual users had a significant change from weekdays to weekend and day of the week does not have much effect on the ride behavior of members.
For this plot, the measurement used for time was minutes.
all_trips_avg_trip_time_minutes_bike <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday, rideable_type) %>%
drop_na() %>%
summarise(number_of_rides = n(),average_duration = mean(ride_minutes)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
facet_wrap(~rideable_type) +
labs(title = "Average Ride Duration: Casual User vs. Member", subtitle = "Average Bike Ride in Minutes Each Day of the Week by Rideable Type")
## `summarise()` has grouped output by 'member_casual', 'weekday'. You can
## override using the `.groups` argument.
all_trips_avg_trip_time_minutes_bike
all_trips_v2 %>% count(start_station_name, sort = TRUE)
## # A tibble: 1,382 × 2
## start_station_name n
## <chr> <int>
## 1 <NA> 860759
## 2 Streeter Dr & Grand Ave 80413
## 3 DuSable Lake Shore Dr & North Blvd 45413
## 4 Michigan Ave & Oak St 43199
## 5 DuSable Lake Shore Dr & Monroe St 43113
## 6 Wells St & Concord Ln 42447
## 7 Millennium Park 38696
## 8 Clark St & Elm St 38441
## 9 Wells St & Elm St 36214
## 10 Theater on the Lake 36191
## # … with 1,372 more rows
all_trips_v2 %>% count(end_station_name, sort = TRUE)
## # A tibble: 1,335 × 2
## end_station_name n
## <chr> <int>
## 1 <NA> 1272195
## 2 Streeter Dr & Grand Ave 78413
## 3 DuSable Lake Shore Dr & North Blvd 47844
## 4 Michigan Ave & Oak St 41958
## 5 DuSable Lake Shore Dr & Monroe St 40247
## 6 Wells St & Concord Ln 39992
## 7 Millennium Park 37604
## 8 Clark St & Elm St 35830
## 9 Theater on the Lake 35181
## 10 Wells St & Elm St 33418
## # … with 1,325 more rows
The code chunk above shows the start and end stations with the most rides. It looks like the top 10 most popular stations are the same for both start and end stations. The order of the top 10 for both are nearly identical. The data is limited by the amount of ride entries with NA instead of a station name.
count_station_type <- all_trips_v2 %>% count(member_casual, start_station_name, sort = TRUE) %>% drop_na()
count_station_type
## # A tibble: 2,590 × 3
## member_casual start_station_name n
## <chr> <chr> <int>
## 1 casual Streeter Dr & Grand Ave 62983
## 2 casual DuSable Lake Shore Dr & Monroe St 33183
## 3 casual Millennium Park 29219
## 4 casual Michigan Ave & Oak St 28210
## 5 casual DuSable Lake Shore Dr & North Blvd 27312
## 6 member Kingsbury St & Kinzie St 26428
## 7 member Clark St & Elm St 23548
## 8 member Wells St & Concord Ln 23498
## 9 casual Shedd Aquarium 21709
## 10 member Wells St & Elm St 20787
## # … with 2,580 more rows
The above table shows the most popular stations when grouping by membver vs casual. NA was dropped here to provide more relevant insight into actual stations that can be determined. By doing so, better insight is provided into where we could actually place marketing in order to have it in the most high traffic area of users.
Casual Users: On weekends, casual users averaged longer trips and more rides than during the week.
Members: During the week, members averaged longer trips and more rides than on the weekend. However, members seemed to have less variation in the average ride length and number of rides between weekdays and the weekend than casual users.
Members vs Casual: Casual users averaged higher
average ride durations than members every day of the week, even though
their opposing day of the week trends might have suggested otherwise.
Rideable type was not significant when comparing the number of rides for
members vs casual users.
Popular Stations: Start and end stations didn’t seem to
have a significant impact on the popularity of the top 10 stations. When
grouping by members and casual users, the most popular stations were
pretty similar though ordered differently. Casual users had a greater
number of rides starting in their most popular stations than members
did.
The analysis of historical trip data from August 2021 to July 2022 revealed that during that time period, more casual user rides were taken on weekends than weekdays. This insight is useful for designing marketing strategies aimed at converting casual riders into annual members because it shows what days more casual user rides are taken. It should be recommended that marketing be focused on the weekend because casual user rides were more frequent on those days, so those days would likely have the most casual users on the platform.
Another recommendation based on this analysis would be to provide incentives for longer rides for any user signing up for a new annual membership. Casual riders tended to take longer rides, so marketing could do a promotion to offer some 30 or 45 min free rides to users that sign up for membership since those users average around 31 minutes a ride over the year. Given that casual users tended to ride for nearly 2.5 times longer on average than members did, casual users that became members should offset the promotional costs due to the extra length of their trips.
A third recommendation based on this analysis would be to target areas with the most popular stations when considering where to advertise to prospective members. The analysis revealed the 10 most popular stations. The analysis also showed the most popular stations when grouping by user type, so we know which stations are most popular for casual users specifically. Since casual users had a greater number of rides starting in their respective most popular stations than members did, it could show that casual users have a tendency to frequent the same stations more than members. This makes location more of a factor for casual riders, an important consideration for marketing design.