Thank you for taking the time to read my Cyclistic bike-share analysis case study! As a junior data analyst, I will perform many real-world tasks in this case study. My job involves working for a fictional company, Cyclistic, and meeting various characters. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
My role is junior data analyst in the marketing analyst team at Cyclistic, a Chicago-based bike-sharing company. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations.
1. Identify the business task.
Design marketing strategies aimed at converting casual riders
into annual member. Three questions will guide the future
marketing program:.
How do annual members and casual riders use Cyclistic bikes
differently?.
Why would casual riders buy Cyclistic annual memberships?.
How can Cyclistic use digital media to influence casual riders to
become members?.
2. Consider key stakeholders.
Lily Moreno: She is responsible for the development of campaigns
and initiatives to promote the bike-share program. These may
include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are
responsible for collecting, analyzing, and reporting
data that helps guide Cyclistic marketing strategy. Cyclistic
executive team: The notoriously detail-oriented executive team will
decide whether to approve the recommended marketing
program.
Riders: The majority of riders opt for traditional
bikes; about 8% of riders use the assistive options (reclining
bikes, hand tricycles, and cargo bikes). Cyclistic users are more likely
to ride for leisure, but about 30% use them to commute to work each
day.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("lubridate", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'lubridate' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lubridate'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\DELL\AppData\Local\R\win-
## library\4.2\00LOCK\lubridate\libs\x64\lubridate.dll to C:
## \Users\DELL\AppData\Local\R\win-library\4.2\lubridate\libs\x64\lubridate.dll:
## Permission denied
## Warning: restored 'lubridate'
##
## The downloaded binary packages are in
## C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("skimr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'skimr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\DELL\AppData\Local\R\win-library\4.2\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\DELL\AppData\Local\R\win-library\4.2\dplyr\libs\x64\dplyr.dll:
## Permission denied
## Warning: restored 'dplyr'
##
## The downloaded binary packages are in
## C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("geosphere", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'geosphere' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'geosphere'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\DELL\AppData\Local\R\win-
## library\4.2\00LOCK\geosphere\libs\x64\geosphere.dll to C:
## \Users\DELL\AppData\Local\R\win-library\4.2\geosphere\libs\x64\geosphere.dll:
## Permission denied
## Warning: restored 'geosphere'
##
## The downloaded binary packages are in
## C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
library(skimr)#skim data
library(dplyr) # group, sort data
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse) # data import and tidy
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.5
## v tibble 3.1.8 v stringr 1.4.1
## v tidyr 1.2.1 v forcats 0.5.2
## v readr 2.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate) # functions set for date, time
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2) # helps visualize data
library(geosphere) # calculate distance
sep_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202109-divvy-tripdata.csv")
oct_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202110-divvy-tripdata.csv")
nov_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202111-divvy-tripdata.csv")
dec_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202112-divvy-tripdata.csv")
jan_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202201-divvy-tripdata.csv")
feb_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202202-divvy-tripdata.csv")
mar_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202203-divvy-tripdata.csv")
apr_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202204-divvy-tripdata.csv")
may_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202205-divvy-tripdata.csv")
jun_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202206-divvy-tripdata.csv")
jul_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202207-divvy-tripdata.csv")
aug_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202208-divvy-tripdata.csv")
# Import data
check_col_df <- data.frame(c1 = colnames(sep_21),
c2 = colnames(oct_21),
c3 = colnames(nov_21),
c4 = colnames(dec_21),
c5 = colnames(jan_22),
c6 = colnames(feb_22),
c7 = colnames(mar_22),
c8 = colnames(apr_22),
c9 = colnames(may_22),
c10 = colnames(jan_22),
c11 = colnames(jul_22),
c12 = colnames(aug_22))
view(check_col_df)
# Compare column names each of the files
## There is no difference in column names
trip_data <- rbind(sep_21, oct_21, nov_21, dec_21, jan_22, feb_22, mar_22, apr_22, may_22, jun_22, jul_22,aug_22)
# Merge tables
str(trip_data)
## 'data.frame': 5883043 obs. of 13 variables:
## $ ride_id : chr "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
## $ ended_at : chr "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
skim_without_charts(trip_data)
| Name | trip_data |
| Number of rows | 5883043 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5883043 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 4912339 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 4919944 | 0 |
| start_station_name | 0 | 1 | 0 | 64 | 884365 | 1439 | 0 |
| start_station_id | 0 | 1 | 0 | 44 | 884363 | 1273 | 0 |
| end_station_name | 0 | 1 | 0 | 64 | 946303 | 1453 | 0 |
| end_station_id | 0 | 1 | 0 | 44 | 946303 | 1282 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 5727 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.37 |
| end_lng | 5727 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.50 |
head(trip_data)
## ride_id rideable_type started_at ended_at
## 1 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54
## 2 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05
## 3 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57
## 4 78D1DE133B3DBF55 electric_bike 2021-09-28 14:51:17 2021-09-28 15:00:06
## 5 E03D4ACDCAEF6E00 electric_bike 2021-09-28 09:53:12 2021-09-28 10:03:44
## 6 346DE323A2677DC0 electric_bike 2021-09-28 01:53:18 2021-09-28 02:00:02
## start_station_name start_station_id end_station_name end_station_id start_lat
## 1 41.89
## 2 41.94
## 3 41.81
## 4 41.80
## 5 41.88
## 6 41.87
## start_lng end_lat end_lng member_casual
## 1 -87.68 41.89 -87.67 casual
## 2 -87.64 41.98 -87.67 casual
## 3 -87.72 41.80 -87.72 casual
## 4 -87.72 41.81 -87.72 casual
## 5 -87.74 41.88 -87.71 casual
## 6 -87.75 41.88 -87.74 casual
# Overview data
trip_data_drop_na <- trip_data %>% drop_na()
# Removed rows with NA values
## 1,322,897 obs have been deleted
trip_data_id <- trip_data_drop_na %>% distinct(ride_id, .keep_all = TRUE)
# Check duplicates
## No duplicates removed
unique(trip_data_id[c("rideable_type")])
## rideable_type
## 1 electric_bike
## 60 classic_bike
## 1218 docked_bike
unique(trip_data_id[c("member_casual")])
## member_casual
## 1 casual
## 39 member
# Finding the unique values
str(trip_data_id)
## 'data.frame': 5877316 obs. of 13 variables:
## $ ride_id : chr "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
## $ ended_at : chr "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
trip_data_id$started_at <- ymd_hms(trip_data_id$started_at)
trip_data_id$ended_at <- ymd_hms(trip_data_id$ended_at)
# Convert started_at and ended_at column from character to timestamp
trip_data_id <- trip_data_id %>%
mutate(time_cycled = ended_at - started_at, #calculate time spent in seconds on each trip
day_of_week = wday(started_at, label = FALSE), #assign day with number, 1 = Sunday
distance_cycled = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat))*0.001) #calculate distance cycled in km
head(trip_data_id)
## ride_id rideable_type started_at ended_at
## 1 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54
## 2 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05
## 3 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57
## 4 78D1DE133B3DBF55 electric_bike 2021-09-28 14:51:17 2021-09-28 15:00:06
## 5 E03D4ACDCAEF6E00 electric_bike 2021-09-28 09:53:12 2021-09-28 10:03:44
## 6 346DE323A2677DC0 electric_bike 2021-09-28 01:53:18 2021-09-28 02:00:02
## start_station_name start_station_id end_station_name end_station_id start_lat
## 1 41.89
## 2 41.94
## 3 41.81
## 4 41.80
## 5 41.88
## 6 41.87
## start_lng end_lat end_lng member_casual time_cycled day_of_week
## 1 -87.68 41.89 -87.67 casual 164 secs 3
## 2 -87.64 41.98 -87.67 casual 914 secs 3
## 3 -87.72 41.80 -87.72 casual 221 secs 3
## 4 -87.72 41.81 -87.72 casual 529 secs 3
## 5 -87.74 41.88 -87.71 casual 632 secs 3
## 6 -87.75 41.88 -87.74 casual 404 secs 3
## distance_cycled
## 1 0.8286936
## 2 5.0984601
## 3 1.1131949
## 4 1.1131949
## 5 2.4864698
## 6 1.3878971
summary(trip_data_id$time_cycled)
## Length Class Mode
## 5877316 difftime numeric
trip_data_id$time_cycled <- as.numeric(trip_data_id$time_cycled)
#Convert difftime to numeric
final_data <- filter(trip_data_id, time_cycled > 9 & distance_cycled < 1000 & distance_cycled > 0.5)
summary(final_data$time_cycled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 404 669 1019 1147 2442301
summary(final_data$distance_cycled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 1.126 1.822 2.436 3.082 114.512
str(final_data)
## 'data.frame': 5155323 obs. of 16 variables:
## $ ride_id : chr "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
## $ ended_at : POSIXct, format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
## $ time_cycled : num 164 914 221 529 632 ...
## $ day_of_week : num 3 3 3 3 3 3 3 3 3 3 ...
## $ distance_cycled : num 0.829 5.098 1.113 1.113 2.486 ...
final_data %>%
group_by(member_casual, day_of_week) %>%
summarise(average_time = mean(time_cycled)) %>%
ggplot(aes(x = day_of_week, y = average_time, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average time spent of riders by weekend by member casual", x = "Days of the week", y = "Average time spent in sec")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
# Average ride_length for member and casual
final_data %>%
group_by(member_casual) %>%
summarise(totals = n()) %>%
ggplot(aes(x = "", y = totals, fill = member_casual)) +
geom_bar(stat = "identity", width = 1) +
labs(title = "Number of riders by member casual", x = "Ride type", y = "Numbers of riders")
# Calculate the average ride_length by day_of_week
final_data %>%
group_by(member_casual, day_of_week) %>%
summarize(average_duration = mean(distance_cycled)) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average duration of riders by weekend by member casual", x = "Days of the week", y = "Average duration in kilometers")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
# Share
final_data %>%
group_by(member_casual, rideable_type) %>%
summarise(totals = n()) %>%
ggplot(aes(x = member_casual, y = totals, fill = rideable_type)) +
geom_col() +
labs(title = "Number of riders by member casual by ride type", y = "Total")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
# Bike type usage by rider type during a week
final_data$hour_start = hour(round_date(final_data$started_at, 'hour'))
summary(final_data$hour_start)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.00 15.00 14.49 18.00 23.00
final_data %>%
ggplot(aes(hour_start, fill = member_casual)) +
geom_bar() +
labs(title = "Number of riders by the hours of the day by member casual", x = "The hours of the day", y = "Totals")
Compare casual vs member rider.
For casual rider: Cycling peak hours is 3 p.m - 7 p.m, preferring
classic bikes and docked bikes.
For member rider: Cycling peak hours are 8 a.m and 4 p.m - 7 p.m,
preferring classic bikes and electric bikes.
Recommendations.
For member riders, bicycles are the main means of transportation, but
not for casual riders.
Next steps.
What steps can be taken to make bicycles a main mode of transport for
casual riders?.
What benefits will casual riders get from using bicycles as their
primary means of transportation?.