Introduction

Thank you for taking the time to read my Cyclistic bike-share analysis case study! As a junior data analyst, I will perform many real-world tasks in this case study. My job involves working for a fictional company, Cyclistic, and meeting various characters. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

Scenario

My role is junior data analyst in the marketing analyst team at Cyclistic, a Chicago-based bike-sharing company. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

1. Identify the business task.
Design marketing strategies aimed at converting casual riders into annual member. Three questions will guide the future marketing program:.
How do annual members and casual riders use Cyclistic bikes differently?.
Why would casual riders buy Cyclistic annual memberships?.
How can Cyclistic use digital media to influence casual riders to become members?.

2. Consider key stakeholders.
Lily Moreno: She is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.  Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Riders: The majority of riders opt for traditional bikes; about 8% of riders use the assistive options (reclining bikes, hand tricycles, and cargo bikes). Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Prepare

  1. Download the previous 12 months of Cyclistic trip data here.
  2. Identify how it’s organized Cyclistic trip data is in comma-delimited (.CSV) format
  3. Sort and filter the data I took data for the most recent 12 months from Sep 2021 to Aug 2022.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("lubridate", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'lubridate' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lubridate'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\DELL\AppData\Local\R\win-
## library\4.2\00LOCK\lubridate\libs\x64\lubridate.dll to C:
## \Users\DELL\AppData\Local\R\win-library\4.2\lubridate\libs\x64\lubridate.dll:
## Permission denied
## Warning: restored 'lubridate'
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("skimr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'skimr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\DELL\AppData\Local\R\win-library\4.2\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\DELL\AppData\Local\R\win-library\4.2\dplyr\libs\x64\dplyr.dll:
## Permission denied
## Warning: restored 'dplyr'
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
install.packages("geosphere", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'geosphere' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'geosphere'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\DELL\AppData\Local\R\win-
## library\4.2\00LOCK\geosphere\libs\x64\geosphere.dll to C:
## \Users\DELL\AppData\Local\R\win-library\4.2\geosphere\libs\x64\geosphere.dll:
## Permission denied
## Warning: restored 'geosphere'
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpE1E37J\downloaded_packages
library(skimr)#skim data
library(dplyr) # group,  sort data
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse) # data import and tidy
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.5
## v tibble  3.1.8     v stringr 1.4.1
## v tidyr   1.2.1     v forcats 0.5.2
## v readr   2.1.3     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate) # functions set for date, time
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2) # helps visualize data
library(geosphere) # calculate distance
sep_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202109-divvy-tripdata.csv")
oct_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202110-divvy-tripdata.csv")
nov_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202111-divvy-tripdata.csv")
dec_21 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202112-divvy-tripdata.csv")
jan_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202201-divvy-tripdata.csv")
feb_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202202-divvy-tripdata.csv")
mar_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202203-divvy-tripdata.csv")
apr_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202204-divvy-tripdata.csv")
may_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202205-divvy-tripdata.csv")
jun_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202206-divvy-tripdata.csv")
jul_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202207-divvy-tripdata.csv")
aug_22 <- read.csv("D:/Viet Ha/Case_study_1/Case_study_1/202208-divvy-tripdata.csv")
# Import data
check_col_df <- data.frame(c1 = colnames(sep_21), 
                           c2 = colnames(oct_21),
                           c3 = colnames(nov_21),
                           c4 = colnames(dec_21),
                           c5 = colnames(jan_22),
                           c6 = colnames(feb_22),
                           c7 = colnames(mar_22),
                           c8 = colnames(apr_22),
                           c9 = colnames(may_22),
                           c10 = colnames(jan_22),
                           c11 = colnames(jul_22),
                           c12 = colnames(aug_22))
view(check_col_df)
# Compare column names each of the files
## There is no difference in column names
trip_data <- rbind(sep_21, oct_21, nov_21, dec_21, jan_22, feb_22, mar_22, apr_22, may_22, jun_22, jul_22,aug_22)
# Merge tables
str(trip_data)
## 'data.frame':    5883043 obs. of  13 variables:
##  $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
##  $ ended_at          : chr  "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
skim_without_charts(trip_data)
Data summary
Name trip_data
Number of rows 5883043
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 5883043 0
rideable_type 0 1 11 13 0 3 0
started_at 0 1 19 19 0 4912339 0
ended_at 0 1 19 19 0 4919944 0
start_station_name 0 1 0 64 884365 1439 0
start_station_id 0 1 0 44 884363 1273 0
end_station_name 0 1 0 64 946303 1453 0
end_station_id 0 1 0 44 946303 1282 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lat 5727 1 41.90 0.05 41.39 41.88 41.90 41.93 42.37
end_lng 5727 1 -87.65 0.03 -88.97 -87.66 -87.64 -87.63 -87.50
head(trip_data)
##            ride_id rideable_type          started_at            ended_at
## 1 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54
## 2 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05
## 3 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57
## 4 78D1DE133B3DBF55 electric_bike 2021-09-28 14:51:17 2021-09-28 15:00:06
## 5 E03D4ACDCAEF6E00 electric_bike 2021-09-28 09:53:12 2021-09-28 10:03:44
## 6 346DE323A2677DC0 electric_bike 2021-09-28 01:53:18 2021-09-28 02:00:02
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.89
## 2                                                                         41.94
## 3                                                                         41.81
## 4                                                                         41.80
## 5                                                                         41.88
## 6                                                                         41.87
##   start_lng end_lat end_lng member_casual
## 1    -87.68   41.89  -87.67        casual
## 2    -87.64   41.98  -87.67        casual
## 3    -87.72   41.80  -87.72        casual
## 4    -87.72   41.81  -87.72        casual
## 5    -87.74   41.88  -87.71        casual
## 6    -87.75   41.88  -87.74        casual
# Overview data
trip_data_drop_na <- trip_data %>% drop_na()
# Removed rows with NA values
## 1,322,897 obs have been deleted
trip_data_id <- trip_data_drop_na %>% distinct(ride_id, .keep_all = TRUE)
# Check duplicates 
## No duplicates removed
unique(trip_data_id[c("rideable_type")])
##      rideable_type
## 1    electric_bike
## 60    classic_bike
## 1218   docked_bike
unique(trip_data_id[c("member_casual")])
##    member_casual
## 1         casual
## 39        member
# Finding the unique values
str(trip_data_id)
## 'data.frame':    5877316 obs. of  13 variables:
##  $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
##  $ ended_at          : chr  "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
trip_data_id$started_at <- ymd_hms(trip_data_id$started_at)
trip_data_id$ended_at <- ymd_hms(trip_data_id$ended_at)
# Convert started_at and ended_at column from character to timestamp
trip_data_id <- trip_data_id %>% 
  mutate(time_cycled = ended_at - started_at,  #calculate time spent in seconds on each trip
         day_of_week = wday(started_at, label = FALSE), #assign day with number, 1 = Sunday
         distance_cycled = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat))*0.001) #calculate distance cycled in km
head(trip_data_id)
##            ride_id rideable_type          started_at            ended_at
## 1 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54
## 2 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05
## 3 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57
## 4 78D1DE133B3DBF55 electric_bike 2021-09-28 14:51:17 2021-09-28 15:00:06
## 5 E03D4ACDCAEF6E00 electric_bike 2021-09-28 09:53:12 2021-09-28 10:03:44
## 6 346DE323A2677DC0 electric_bike 2021-09-28 01:53:18 2021-09-28 02:00:02
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.89
## 2                                                                         41.94
## 3                                                                         41.81
## 4                                                                         41.80
## 5                                                                         41.88
## 6                                                                         41.87
##   start_lng end_lat end_lng member_casual time_cycled day_of_week
## 1    -87.68   41.89  -87.67        casual    164 secs           3
## 2    -87.64   41.98  -87.67        casual    914 secs           3
## 3    -87.72   41.80  -87.72        casual    221 secs           3
## 4    -87.72   41.81  -87.72        casual    529 secs           3
## 5    -87.74   41.88  -87.71        casual    632 secs           3
## 6    -87.75   41.88  -87.74        casual    404 secs           3
##   distance_cycled
## 1       0.8286936
## 2       5.0984601
## 3       1.1131949
## 4       1.1131949
## 5       2.4864698
## 6       1.3878971
summary(trip_data_id$time_cycled)
##   Length    Class     Mode 
##  5877316 difftime  numeric
trip_data_id$time_cycled <- as.numeric(trip_data_id$time_cycled) 
#Convert difftime to numeric
final_data <- filter(trip_data_id, time_cycled > 9 & distance_cycled < 1000 & distance_cycled > 0.5)
summary(final_data$time_cycled)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10     404     669    1019    1147 2442301
summary(final_data$distance_cycled)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   1.126   1.822   2.436   3.082 114.512

Analyze

str(final_data)
## 'data.frame':    5155323 obs. of  16 variables:
##  $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct, format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
##  $ ended_at          : POSIXct, format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
##  $ time_cycled       : num  164 914 221 529 632 ...
##  $ day_of_week       : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ distance_cycled   : num  0.829 5.098 1.113 1.113 2.486 ...
final_data %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(average_time = mean(time_cycled)) %>% 
  ggplot(aes(x = day_of_week, y = average_time, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Average time spent of riders by weekend by member casual", x = "Days of the week", y = "Average time spent in sec")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

# Average ride_length for member and casual
final_data %>% 
  group_by(member_casual) %>% 
  summarise(totals = n()) %>% 
  ggplot(aes(x =  "", y = totals, fill = member_casual)) +
  geom_bar(stat = "identity", width = 1) +
  labs(title = "Number of riders by member casual", x = "Ride type", y = "Numbers of riders")

# Calculate the average ride_length by day_of_week
final_data %>% 
  group_by(member_casual, day_of_week) %>% 
  summarize(average_duration = mean(distance_cycled)) %>% 
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Average duration of riders by weekend by member casual", x = "Days of the week", y = "Average duration in kilometers")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

# Share

final_data %>%
  group_by(member_casual, rideable_type) %>% 
  summarise(totals = n()) %>% 
  ggplot(aes(x = member_casual, y = totals, fill = rideable_type)) +
  geom_col() +
  labs(title = "Number of riders by member casual by ride type", y = "Total")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

# Bike type usage by rider type during a week
final_data$hour_start = hour(round_date(final_data$started_at, 'hour'))
summary(final_data$hour_start)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   15.00   14.49   18.00   23.00
final_data %>% 
  ggplot(aes(hour_start, fill = member_casual)) +
  geom_bar() +
  labs(title = "Number of riders by the hours of the day by member casual", x = "The hours of the day", y = "Totals")

Act

Compare casual vs member rider.
For casual rider: Cycling peak hours is 3 p.m - 7 p.m, preferring classic bikes and docked bikes.
For member rider: Cycling peak hours are 8 a.m and 4 p.m - 7 p.m, preferring classic bikes and electric bikes.
Recommendations.
For member riders, bicycles are the main means of transportation, but not for casual riders.
Next steps.
What steps can be taken to make bicycles a main mode of transport for casual riders?.
What benefits will casual riders get from using bicycles as their primary means of transportation?.