Case Study: How Does a Bike-Share Navigate Speedy Success?

Introduction

Thank you for taking the time to read my Cyclistic bike-share analysis case study! As a junior data analyst, I will perform many real-world tasks in this case study. My job involves working for a fictional company, Cyclistic, and meeting various characters. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

Scenario

My role is junior data analyst in the marketing analyst team at Cyclistic, a Chicago-based bike-sharing company. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

1. Identify the business task.
Design marketing strategies aimed at converting casual riders into annual member. Three questions will guide the future marketing program:.
How do annual members and casual riders use Cyclistic bikes differently?.
Why would casual riders buy Cyclistic annual memberships?.
How can Cyclistic use digital media to influence casual riders to become members?.

2. Consider key stakeholders.
Lily Moreno: She is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Riders: The majority of riders opt for traditional bikes; about 8% of riders use the assistive options (reclining bikes, hand tricycles, and cargo bikes). Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Prepare

Download the previous 12 months of Cyclistic trip data here.
Identify how it’s organized Cyclistic trip data is in comma-delimited (.CSV) format
Sort and filter the data I took data for the most recent 12 months from Sep 2021 to Aug 2022.

install.packages("tidyverse", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages

install.packages("lubridate", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'lubridate' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'lubridate'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\DELL\AppData\Local\R\win-
## library\4.2\00LOCK\lubridate\libs\x64\lubridate.dll to C:
## \Users\DELL\AppData\Local\R\win-library\4.2\lubridate\libs\x64\lubridate.dll:
## Permission denied

## Warning: restored 'lubridate'

## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages

install.packages("skimr", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'skimr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages

install.packages("dplyr", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'dplyr' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'dplyr'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\DELL\AppData\Local\R\win-library\4.2\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\DELL\AppData\Local\R\win-library\4.2\dplyr\libs\x64\dplyr.dll:
## Permission denied

## Warning: restored 'dplyr'

## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages

install.packages("geosphere", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'geosphere' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'geosphere'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\DELL\AppData\Local\R\win-
## library\4.2\00LOCK\geosphere\libs\x64\geosphere.dll to C:
## \Users\DELL\AppData\Local\R\win-library\4.2\geosphere\libs\x64\geosphere.dll:
## Permission denied

## Warning: restored 'geosphere'

## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages

library(skimr)#skim data
library(dplyr) # group,  sort data

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse) # data import and tidy

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --

## v ggplot2 3.3.6     v purrr   0.3.5
## v tibble  3.1.8     v stringr 1.4.1
## v tidyr   1.2.1     v forcats 0.5.2
## v readr   2.1.3     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate) # functions set for date, time

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2) # helps visualize data
library(geosphere) # calculate distance

sep_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202109-divvy-tripdata.csv")
oct_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202110-divvy-tripdata.csv")
nov_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202111-divvy-tripdata.csv")
dec_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202112-divvy-tripdata.csv")
jan_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202201-divvy-tripdata.csv")
feb_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202202-divvy-tripdata.csv")
mar_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202203-divvy-tripdata.csv")
apr_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202204-divvy-tripdata.csv")
may_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202205-divvy-tripdata.csv")
jun_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202206-divvy-tripdata.csv")
jul_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202207-divvy-tripdata.csv")
aug_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202208-divvy-tripdata.csv")
# Import data

check_col_df <- data.frame(c1 = colnames(sep_21), 
                           c2 = colnames(oct_21),
                           c3 = colnames(nov_21),
                           c4 = colnames(dec_21),
                           c5 = colnames(jan_22),
                           c6 = colnames(feb_22),
                           c7 = colnames(mar_22),
                           c8 = colnames(apr_22),
                           c9 = colnames(may_22),
                           c10 = colnames(jan_22),
                           c11 = colnames(jul_22),
                           c12 = colnames(aug_22))
view(check_col_df)
# Compare column names each of the files
## There is no difference in column names
trip_data <- rbind(sep_21, oct_21, nov_21, dec_21, jan_22, feb_22, mar_22, apr_22, may_22, jun_22, jul_22,aug_22)
# Merge tables

str(trip_data)

## 'data.frame':    5883043 obs. of  13 variables:
##  $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
##  $ ended_at          : chr  "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...

skim_without_charts(trip_data)

Data summary
Name	trip_data
Number of rows	5883043
Number of columns	13
_______________________
Column type frequency:
character	9
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
ride_id	1	16	16	0	5883043
rideable_type	1	11	13	0	3
started_at	1	19	19	0	4912339
ended_at	1	19	19	0	4919944
start_station_name	1	0	64	884365	1439
start_station_id	1	0	44	884363	1273
end_station_name	1	0	64	946303	1453
end_station_id	1	0	44	946303	1282
member_casual	1	6	6	0	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	45.64
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-73.80
end_lat	5727	1	41.90	0.05	41.39	41.88	41.90	41.93	42.37
end_lng	5727	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.50

head(trip_data)

##            ride_id rideable_type          started_at            ended_at
## 1 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54
## 2 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05
## 3 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57
## 4 78D1DE133B3DBF55 electric_bike 2021-09-28 14:51:17 2021-09-28 15:00:06
## 5 E03D4ACDCAEF6E00 electric_bike 2021-09-28 09:53:12 2021-09-28 10:03:44
## 6 346DE323A2677DC0 electric_bike 2021-09-28 01:53:18 2021-09-28 02:00:02
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.89
## 2                                                                         41.94
## 3                                                                         41.81
## 4                                                                         41.80
## 5                                                                         41.88
## 6                                                                         41.87
##   start_lng end_lat end_lng member_casual
## 1    -87.68   41.89  -87.67        casual
## 2    -87.64   41.98  -87.67        casual
## 3    -87.72   41.80  -87.72        casual
## 4    -87.72   41.81  -87.72        casual
## 5    -87.74   41.88  -87.71        casual
## 6    -87.75   41.88  -87.74        casual

# Overview data

trip_data_drop_na <- trip_data %>% drop_na()
# Removed rows with NA values
## 1,322,897 obs have been deleted
trip_data_id <- trip_data_drop_na %>% distinct(ride_id, .keep_all = TRUE)
# Check duplicates 
## No duplicates removed
unique(trip_data_id[c("rideable_type")])

##      rideable_type
## 1    electric_bike
## 60    classic_bike
## 1218   docked_bike

unique(trip_data_id[c("member_casual")])

##    member_casual
## 1         casual
## 39        member

# Finding the unique values
str(trip_data_id)

## 'data.frame':    5877316 obs. of  13 variables:
##  $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
##  $ ended_at          : chr  "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...

trip_data_id$started_at <- ymd_hms(trip_data_id$started_at)
trip_data_id$ended_at <- ymd_hms(trip_data_id$ended_at)
# Convert started_at and ended_at column from character to timestamp
trip_data_id <- trip_data_id %>% 
  mutate(time_cycled = ended_at - started_at,  #calculate time spent in seconds on each trip
         day_of_week = wday(started_at, label = FALSE), #assign day with number, 1 = Sunday
         distance_cycled = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat))*0.001) #calculate distance cycled in km

head(trip_data_id)

##            ride_id rideable_type          started_at            ended_at
## 1 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54
## 2 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05
## 3 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57
## 4 78D1DE133B3DBF55 electric_bike 2021-09-28 14:51:17 2021-09-28 15:00:06
## 5 E03D4ACDCAEF6E00 electric_bike 2021-09-28 09:53:12 2021-09-28 10:03:44
## 6 346DE323A2677DC0 electric_bike 2021-09-28 01:53:18 2021-09-28 02:00:02
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.89
## 2                                                                         41.94
## 3                                                                         41.81
## 4                                                                         41.80
## 5                                                                         41.88
## 6                                                                         41.87
##   start_lng end_lat end_lng member_casual time_cycled day_of_week
## 1    -87.68   41.89  -87.67        casual    164 secs           3
## 2    -87.64   41.98  -87.67        casual    914 secs           3
## 3    -87.72   41.80  -87.72        casual    221 secs           3
## 4    -87.72   41.81  -87.72        casual    529 secs           3
## 5    -87.74   41.88  -87.71        casual    632 secs           3
## 6    -87.75   41.88  -87.74        casual    404 secs           3
##   distance_cycled
## 1       0.8286936
## 2       5.0984601
## 3       1.1131949
## 4       1.1131949
## 5       2.4864698
## 6       1.3878971

summary(trip_data_id$time_cycled)

##   Length    Class     Mode 
##  5877316 difftime  numeric

trip_data_id$time_cycled <- as.numeric(trip_data_id$time_cycled) 
#Convert difftime to numeric
final_data <- filter(trip_data_id, time_cycled > 9 & distance_cycled < 1000 & distance_cycled > 0.5)
summary(final_data$time_cycled)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10     404     669    1019    1147 2442301

summary(final_data$distance_cycled)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   1.126   1.822   2.436   3.082 114.512

Analyze

str(final_data)

## 'data.frame':    5155323 obs. of  16 variables:
##  $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct, format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
##  $ ended_at          : POSIXct, format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
##  $ time_cycled       : num  164 914 221 529 632 ...
##  $ day_of_week       : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ distance_cycled   : num  0.829 5.098 1.113 1.113 2.486 ...

final_data %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(average_time = mean(time_cycled)) %>% 
  ggplot(aes(x = day_of_week, y = average_time, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Average time spent of riders by weekend by member casual", x = "Days of the week", y = "Average time spent in sec")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

# Average ride_length for member and casual
final_data %>% 
  group_by(member_casual) %>% 
  summarise(totals = n()) %>% 
  ggplot(aes(x =  "", y = totals, fill = member_casual)) +
  geom_bar(stat = "identity", width = 1) +
  labs(title = "Number of riders by member casual", x = "Ride type", y = "Numbers of riders")

# Calculate the average ride_length by day_of_week
final_data %>% 
  group_by(member_casual, day_of_week) %>% 
  summarize(average_duration = mean(distance_cycled)) %>% 
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Average duration of riders by weekend by member casual", x = "Days of the week", y = "Average duration in kilometers")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

# Share

final_data %>%
  group_by(member_casual, rideable_type) %>% 
  summarise(totals = n()) %>% 
  ggplot(aes(x = member_casual, y = totals, fill = rideable_type)) +
  geom_col() +
  labs(title = "Number of riders by member casual by ride type", y = "Total")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

# Bike type usage by rider type during a week

final_data$hour_start = hour(round_date(final_data$started_at, 'hour'))
summary(final_data$hour_start)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   15.00   14.49   18.00   23.00

final_data %>% 
  ggplot(aes(hour_start, fill = member_casual)) +
  geom_bar() +
  labs(title = "Number of riders by the hours of the day by member casual", x = "The hours of the day", y = "Totals")

Act

Compare casual vs member rider.
For casual rider: Cycling peak hours is 3 p.m - 7 p.m, preferring classic bikes and docked bikes.
For member rider: Cycling peak hours are 8 a.m and 4 p.m - 7 p.m, preferring classic bikes and electric bikes.
Recommendations.
For member riders, bicycles are the main means of transportation, but not for casual riders.
Next steps.
What steps can be taken to make bicycles a main mode of transport for casual riders?.
What benefits will casual riders get from using bicycles as their primary means of transportation?.

Case Study: How Does a Bike-Share Navigate Speedy Success?

Nguyen Thi Viet Ha

2022-10-09

Introduction

Scenario

Ask

Prepare

Analyze

Act