For my Google Data Analytics capstone project, I completed a case study for a fictional bike-share company in which I analyzed 12 months of data to understand the difference between casual riders and annual members. From these insights, the marketing team can develop a strategy to convert casual riders into annual members. To accomplish my task, I applied the data analysis process including ask, prepare, process, analyze, share, and act.
Cyclistic is a bike-share company in Chicago with a fleet of 5,824 geotracked bicycles and a network of 692 stations. They offer 3 pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders and customers who purchase annual memberships are referred to as Cyclistic members. The company determined that annual members are more profitable than casual riders and is aiming to create a marketing campaign to maximize the number of annual members by converting casual riders.
Key stakeholder: Lily Moreno, the director of marketing responsible for developing campaigns to promote bike-share program.
Business task: How do annual members and casual riders use Cyclistic bikes differently?
Since Cyclistic is a fictional company, we want to use proxy data from a similar bike-share company which can be downloaded from : https://divvy-tripdata.s3.amazonaws.com/index.html
After downloading 12 ZIP files, each containing a CSV file with data
for each month between June 2021 and May 2022, install and load required
packages, then import and bind all 12 data sets into a single data
frame. read_csv will automatically extract ZIP files.
#Install and load required packages.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(rcartocolor)
# Suppress summarise additional informative output.
options(dplyr.summarise.inform = FALSE)
#Import and bind 12 datasets into single data frame.
bikedata1 <- list.files("data", full.names = TRUE, pattern = "*.csv") %>%
lapply(read_csv) %>%
bind_rows()
Rows: 729595 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 822410 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 804352 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 756147 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 631226 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 359978 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 247540 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 103770 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 115609 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 284042 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 371249 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 634858 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Information
There are 13 columns and 5860776 rows.
Data types include 2 datetime, 4 double, and 7 character.
There are missing values in start_station_name, start_station_id, end_station_name, end_station_id
#Preview data for immediate insights
str(bikedata1)
spec_tbl_df [5,860,776 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ride_id : chr [1:5860776] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
$ rideable_type : chr [1:5860776] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
$ started_at : POSIXct[1:5860776], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
$ ended_at : POSIXct[1:5860776], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
$ start_station_name: chr [1:5860776] NA NA NA NA ...
$ start_station_id : chr [1:5860776] NA NA NA NA ...
$ end_station_name : chr [1:5860776] NA NA NA NA ...
$ end_station_id : chr [1:5860776] NA NA NA NA ...
$ start_lat : num [1:5860776] 41.8 41.8 41.8 41.8 41.8 ...
$ start_lng : num [1:5860776] -87.6 -87.6 -87.6 -87.6 -87.6 ...
$ end_lat : num [1:5860776] 41.8 41.8 41.8 41.8 41.8 ...
$ end_lng : num [1:5860776] -87.6 -87.6 -87.6 -87.6 -87.6 ...
$ member_casual : chr [1:5860776] "member" "member" "member" "member" ...
- attr(*, "spec")=
.. cols(
.. ride_id = col_character(),
.. rideable_type = col_character(),
.. started_at = col_datetime(format = ""),
.. ended_at = col_datetime(format = ""),
.. start_station_name = col_character(),
.. start_station_id = col_character(),
.. end_station_name = col_character(),
.. end_station_id = col_character(),
.. start_lat = col_double(),
.. start_lng = col_double(),
.. end_lat = col_double(),
.. end_lng = col_double(),
.. member_casual = col_character()
.. )
- attr(*, "problems")=<externalptr>
head(bikedata1)
glimpse(bikedata1)
Rows: 5,860,776
Columns: 13
$ ride_id <chr> "99FEC93BA843FB20", "06048DCFC8520CAF", "9598066F68…
$ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
$ started_at <dttm> 2021-06-13 14:31:28, 2021-06-04 11:18:02, 2021-06-…
$ ended_at <dttm> 2021-06-13 14:34:11, 2021-06-04 11:24:19, 2021-06-…
$ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ start_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Michigan Ave &…
$ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "13042", NA, NA…
$ start_lat <dbl> 41.80, 41.79, 41.80, 41.78, 41.80, 41.78, 41.79, 41…
$ start_lng <dbl> -87.59, -87.59, -87.60, -87.58, -87.59, -87.58, -87…
$ end_lat <dbl> 41.80000, 41.80000, 41.79000, 41.80000, 41.79000, 4…
$ end_lng <dbl> -87.6000, -87.6000, -87.5900, -87.6000, -87.5900, -…
$ member_casual <chr> "member", "member", "member", "member", "member", "…
colnames(bikedata1)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
ride_id
rideable_type
started_at
ended_at
start_station_name
start_station_id
end_station_name
end_station_id
start_lat
start_lng
end_lat
end_lng
member_casual
We want to separate the date and add individual columns for the
month, day, year, and day of the week. This will allow us to aggregate
data and perform calculations easier. We also want to find the length of
each ride using the difftime() function, then convert it to
a format that is easy to read (HH:MM:SS).
bikedata1$date <- as_date(bikedata1$started_at)
#Add column for month
bikedata1$month <- format(as_date(bikedata1$date), "%m")
#Add column for day
bikedata1$day <- format(as_date(bikedata1$date), "%d")
#Add column for year
bikedata1$year <- format(as_date(bikedata1$date), "%Y")
#Add Column for day of the week
bikedata1$day_of_week <- format(as_date(bikedata1$started_at), "%A")
#Add column for ride length
bikedata1$ride_length <- difftime(bikedata1$ended_at, bikedata1$started_at) %>% as.numeric()
We want to remove columns start_lat,
start_lng, end_lat, and end_lng
as the data was dropped starting in 2020. We also need to decide how to
handle missing values which were primarily in
start_station_name, start_station_id,
end_station_name, and end_station_id. We could
try to find the missing stations using the latitude/longitude but the
numbers are not accurate to definitively pinpoint the correct station.
That leaves us with either removing the variable (column) or observation
(row). I decided to remove the rows with missing values to minimize
inaccurate data because the missing values may have influenced the
start/end times. For example, a bike could have been docked incorrectly
and the time continued to run.
#Remove lat/lng columns
bikedata1 <- bikedata1 %>% select(-start_lat, -start_lng, -end_lat, -end_lng)
#Remove rows with missing values
bikedata1 <- na.omit(bikedata1)
We want to verify the data and look for any inconsistent or inaccurate data. We confirmed that there are no duplicates, each ride is either member or casual, and the data is within the correct date range. However, there were 3 ride types (electric_bike, classic_bike, and docked_bike) which was not expected. After reviewing the types of bikes available on the Divvy Bike website, there there is no mention of a docked bike. I decided to remove rows with docked bikes to minimize the risk of inaccurate date.
We also found rows with ride_length less than or equal to 0 which is not accurate and therefore removed.
#Check for duplicates
bikedata1 %>% get_dupes(ride_id)
No duplicate combinations found of: ride_id
#Check ride type
bikedata1 %>% count(rideable_type)
#Check member type
bikedata1 %>% count(member_casual)
#Check for data outside date range
bikedata1 %>% filter(started_at <= as_date("2021-05-31") | started_at >= as_date("2022-06-01"))
#Check for ride length less than or equal to 0
bikedata1 %>% select(ride_length) %>% filter(ride_length <= 0)
#Remove rows with docked bike
bikedata1 <- bikedata1 %>% filter(!rideable_type == "docked_bike")
#Remove rows with less than or equal to 0 ride lengths
bikedata1 <- bikedata1 %>% filter(!ride_length <= 0)
summary(as.numeric(bikedata1$ride_length))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 383 665 1020 1173 89994
aggregate(list(ride_length_mean = bikedata1$ride_length), list(member_type = bikedata1$member_casual), mean)
aggregate(list(ride_length_median = bikedata1$ride_length), list(member_type = bikedata1$member_casual), median)
aggregate(list(ride_length_max = bikedata1$ride_length), list(member_type = bikedata1$member_casual), max)
aggregate(list(ride_length_min = bikedata1$ride_length), list(member_type = bikedata1$member_casual), min)
#Order week with Sunday as first day of the week
bikedata1$day_of_week <- ordered(bikedata1$day_of_week, levels = c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'))
#Aggregate data to average ride duration vs. member type/day of week
aggregate(list(ride_length_mean = bikedata1$ride_length), list(member_type = bikedata1$member_casual, day_of_week = bikedata1$day_of_week), mean)
#Plot average ride duration vs. day of week for members and casual riders.
bikedata1 %>%
group_by(day_of_week, member_casual) %>%
arrange(day_of_week) %>%
summarise(rides = n(), average_duration = mean(ride_length)) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) + geom_col(position = "dodge") + scale_fill_carto_d(palette = "Vivid") + labs(x = "Day of Week", y = "Average Ride Duration (Seconds)", title = "Average Ride Duration vs. Day of Week for Members and Casual Riders") + theme(plot.title = element_text(hjust = 0.5), legend.title = element_blank())
#Plot number of rides vs. day of week for members and casual riders.
bikedata1 %>%
group_by(day_of_week, member_casual) %>%
arrange(day_of_week) %>%
summarise(rides = n(), average_duration = mean(ride_length)) %>%
ggplot(aes(x = day_of_week, y = rides, fill = member_casual)) + geom_col(position = "dodge") + scale_fill_carto_d(palette = "Vivid") + labs(x = "Day of Week", y = "Number of Rides", title = "Number of Rides vs. Day of Week for Members and Casual Riders") + theme(plot.title = element_text(hjust = 0.5), legend.title = element_blank())
#Plot average ride duration vs. month for members and casual riders
bikedata1 %>%
mutate(month = format(as_date(started_at), "%b") %>% factor(levels = month.abb)) %>%
group_by(member_casual, month) %>%
arrange(month) %>%
summarise(rides = n(), average_duration = mean(ride_length)) %>%
ggplot(aes(x = month, y = average_duration, fill = member_casual)) + geom_col(position = "dodge") + scale_fill_carto_d(palette = "Vivid") + labs(x = "Month", y = "Average Ride Duration (Seconds)", title = "Average Ride Duration vs. Month for Members and Casual Riders") + theme(plot.title = element_text(hjust = 0.5), legend.title = element_blank())
#Plot number of rides vs. month for members and casual riders
bikedata1 %>%
mutate(month = format(as_date(started_at), "%b") %>% factor(levels = month.abb)) %>%
group_by(member_casual, month) %>%
arrange(month) %>%
summarise(rides = n(), average_duration = mean(ride_length)) %>%
ggplot(aes(x = month, y = rides, fill = member_casual)) + geom_col(position = "dodge") + scale_fill_carto_d(palette = "Vivid") + labs(x = "Month", y = "Number of Rides", title = "Number of Rides vs. Month for Members and Casual Riders") + theme(plot.title = element_text(hjust = 0.5), legend.title = element_blank())
Share Data
Act