This case study was completed as the final capstone project for the Google Analytics Certificate. The purpose of the project is to answer key business questions for a fictional company, Cyclistic, by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
Cyclistic is a bike-share company in Chicago, and the director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the goal of the project is to understand how casual riders and annual members use Cyclistic bikes differently.
Identify key differences in how annual members and causual riders use Cyclistic.
Previous 12 months of historical trip data will be analyzed to identify trends. The required data is made available to public by Motivate International Inc. under this license. As of the date of this case study, the most current 12 months of data (October 2021 - September 2022) can be downloaded from here.
Original data is stored as .CSV files for each month. Details of each unique bike trip is recorded in 13 columns, namely:
[1] “ride_id” (unique trip id)
[2] “rideable_type” (type of bike)
[3,4] “started_at” “ended_at” (trip start time and end time)
[5,6,7,8] “start_station_name” “start_station_id” “end_station_name”
“end_station_id” (start/end station info)
[9,10,11,12] “start_lat” “start_lng” “end_lat” “end_lng” (start/end
latitude and longitude)
[13] “member_casual” (type of membership)
We will use RStudio for this analysis, and data will be stored locally.
Set up my R environment by installing necessary packages and libraries.
#install.packages('tidyverse')
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
library(lubridate)
library(hms)
Import all data files and merge them into one data frame.
Note: Initially, I have imported each data file as an individual data
frame, and verified that they are all structured the same way and
columns are named consistently, so that they can be stacked into a
single data frame.
setwd("C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/Data")
df_all <- list.files() %>%
lapply(read_csv, show_col_types = FALSE) %>%
bind_rows
We verify that all data files imported correctly.
# Summary of the combined data frame
summary(df_all)
## ride_id rideable_type started_at
## Length:5828235 Length:5828235 Min. :2021-10-01 00:00:09.00
## Class :character Class :character 1st Qu.:2022-02-28 19:21:08.50
## Mode :character Mode :character Median :2022-06-08 06:41:28.00
## Mean :2022-05-06 21:39:18.18
## 3rd Qu.:2022-08-02 11:26:01.00
## Max. :2022-09-30 23:59:56.00
##
## ended_at start_station_name start_station_id
## Min. :2021-10-01 00:03:11.0 Length:5828235 Length:5828235
## 1st Qu.:2022-02-28 19:34:02.5 Class :character Class :character
## Median :2022-06-08 06:55:07.0 Mode :character Mode :character
## Mean :2022-05-06 21:58:54.2
## 3rd Qu.:2022-08-02 11:46:26.0
## Max. :2022-10-05 19:53:11.0
##
## end_station_name end_station_id start_lat start_lng
## Length:5828235 Length:5828235 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual
## Min. :41.39 Min. :-88.97 Length:5828235
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. :-87.30
## NA's :5844 NA's :5844
Next, we check to see if there are duplicate entries and null values.
# Remove duplicates
df_all <- df_all %>% distinct(ride_id, .keep_all = TRUE)
# Columns containing null values
colnames(df_all)[colSums(is.na(df_all))>0]
## [1] "start_station_name" "start_station_id" "end_station_name"
## [4] "end_station_id" "end_lat" "end_lng"
There are some missing values in start station and end station information. Since it should not affect our analysis of trip duration, ridership type, time of day, etc, they will be included for such calculations. However, they should be removed for further analysis concerning starting and ending stations.
# Detecting outliers in start_lat and start_lng
summary(df_all$start_lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41.64 41.88 41.90 41.90 41.93 45.64
summary(df_all$start_lng) # lat,long 45.64, -73.80 lies in Quebec, Canada
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -87.84 -87.66 -87.64 -87.65 -87.63 -73.80
View(filter(df_all,start_lat>44)) # ride_id=3327172413547F64 is an outlier and should be removed
Next, we create two new columns in our data frame, namely “ride_length” and “day_of_week”.
# "ride_length" is the duration of each trip, calculated by subtracting "started_at" from "ended_at"
# "day_of_week" gives us the day of the week when the ride took place
df_all <- df_all %>%
mutate(ride_length = as_hms(difftime(ended_at, started_at))) %>%
mutate(day_of_week = wday(started_at, label=TRUE, week_start=1))
# Inspect the structure of the modified columns
str(df_all)
## tibble [5,828,235 × 15] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5828235] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
## $ rideable_type : chr [1:5828235] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5828235], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
## $ ended_at : POSIXct[1:5828235], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
## $ start_station_name: chr [1:5828235] "Kingsbury St & Kinzie St" NA NA NA ...
## $ start_station_id : chr [1:5828235] "KA1503000043" NA NA NA ...
## $ end_station_name : chr [1:5828235] NA NA NA NA ...
## $ end_station_id : chr [1:5828235] NA NA NA NA ...
## $ start_lat : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:5828235] "member" "member" "member" "member" ...
## $ ride_length : 'hms' num [1:5828235] 00:03:08 00:01:37 00:07:47 00:01:15 ...
## ..- attr(*, "units")= chr "secs"
## $ day_of_week : Ord.factor w/ 7 levels "Mon"<"Tue"<"Wed"<..: 5 4 6 6 3 4 4 3 4 3 ...
We will remove the entries where trip duration is either negative or
zero, since there is not a way to verify if those were simply input
errors.
Multi-day trips, although realistically do not reflect actual ride
duration, could signify when a renter chose to keep the bike for
multiple days. It could also mean a difference in how the bike is used,
for example, work commute vs. leisurely ride.
# Number of trips with zero or negative duration
nrow(filter(df_all, ride_length <= 0)) # 571 "bad" entries, not a significant percent of data set
## [1] 571
# Number of trips with over 24 hours duration (86400 seconds)
nrow(filter(df_all, ride_length >= 86400)) # 5397 entries, also not a significant percent of overall data
## [1] 5397
# Create new data frame after dropping unwanted data
df_final <- df_all %>%
filter(ride_length > 0, ride_id != "3327172413547F64")
summary(df_final)
## ride_id rideable_type started_at
## Length:5827663 Length:5827663 Min. :2021-10-01 00:00:09.00
## Class :character Class :character 1st Qu.:2022-02-28 19:26:37.50
## Mode :character Mode :character Median :2022-06-08 06:43:13.00
## Mean :2022-05-06 21:41:36.58
## 3rd Qu.:2022-08-02 11:25:39.50
## Max. :2022-09-30 23:59:56.00
##
## ended_at start_station_name start_station_id
## Min. :2021-10-01 00:03:11.00 Length:5827663 Length:5827663
## 1st Qu.:2022-02-28 19:40:51.50 Class :character Class :character
## Median :2022-06-08 06:57:04.00 Mode :character Mode :character
## Mean :2022-05-06 22:01:12.96
## 3rd Qu.:2022-08-02 11:46:07.00
## Max. :2022-10-05 19:53:11.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5827663 Length:5827663 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual ride_length
## Min. :41.39 Min. :-88.97 Length:5827663 Length:5827663
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character Class1:hms
## Median :41.90 Median :-87.64 Mode :character Class2:difftime
## Mean :41.90 Mean :-87.65 Mode :numeric
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. :-87.30
## NA's :5844 NA's :5844
## day_of_week
## Mon:752789
## Tue:817229
## Wed:820099
## Thu:837172
## Fri:843901
## Sat:957928
## Sun:798545
In order to identify differences between casual riders and annual members, we will compare and find trends in a number of factors.
# Investigate bike types and membership columns
table(df_final$rideable_type, df_final$member_casual)
##
## casual member
## classic_bike 941387 1798983
## docked_bike 192464 0
## electric_bike 1267139 1627690
We find that annual members rode more trips than casual riders overall (approximately 3.4 million vs 2.4 million rides).
# Compare number of rides for each bike type
df_final %>%
group_by(rideable_type,member_casual) %>%
summarise(num_rides = n()/1000) %>%
ggplot(aes(x=member_casual, y=num_rides, fill=rideable_type)) +
geom_col(position="dodge") +
labs(title = "Ride count by bike type", x = "Membership", y = "Ride count (Thousands)")
Members prefer to use classic bikes to electric bikes, although not by a significant margin, whereas casual riders clearly prefer to use electric bikes. Docked bikes, however, appear to be exclusively used by casual riders.
# compare average ride duration on different days of the week
df_final %>%
group_by(member_casual,day_of_week) %>%
summarise(duration=mean(ride_length)/60) %>%
ggplot(aes(x=day_of_week, y=duration, fill=member_casual)) +
geom_col(position="dodge") +
labs(title = "Average ride duration by day of the week", x = "Day of the week", y = "Ride duration (Minutes)")
# Find the mode of day of week
table(df_final$day_of_week, df_final$member_casual)
##
## casual member
## Mon 279762 473027
## Tue 275745 541484
## Wed 281640 538459
## Thu 306662 530510
## Fri 352465 491436
## Sat 499739 458189
## Sun 404977 393568
# compare number of rides on different days of the week
df_final %>%
group_by(member_casual,day_of_week) %>%
summarise(num_rides = n()/1000) %>%
ggplot(aes(x=day_of_week, y=num_rides, fill=member_casual)) +
geom_col(position="dodge") +
labs(title = "Ride count by day of the week", x = "Day of the week", y = "Ride count (Thousands)")
# Compare number of rides in different months
df_final %>%
group_by(member_casual,month=month(started_at,label=TRUE)) %>%
summarise(num_rides = n()/1000) %>%
ggplot(aes(x=month, y=num_rides, fill=member_casual)) +
geom_col(position="dodge") +
labs(title = "Ride count by month", x = "Month", y = "Ride count (Thousands)")
# Compare number of rides by time of day
df_final %>%
group_by(member_casual,Hour=hour(started_at)) %>%
summarise(num_rides = n()/1000) %>%
ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
geom_col(position="dodge") +
labs(title = "Ride count by time of day", x = "Time of day (Hour)", y = "Ride count (Thousands)")
It appears that ridership for members tends to peak at two different times of day, coinciding with morning and evening work commute hours. Non-members’ ride count increases steadily throughout the day until evening hours. In order to further investigate this trend, we will split the plot into weekdays vs weekends.
# Compare number of rides by time of day on weekdays
df_final %>%
filter(day_of_week != "Sat" & day_of_week != "Sun") %>%
group_by(member_casual,Hour=hour(started_at)) %>%
summarise(num_rides = n()/1000) %>%
ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
geom_col(position="dodge") +
labs(title = "Ride count by time of day on weekdays", x = "Time of day (Hour)", y = "Ride count (Thousands)")
# Compare number of rides by time of day on weekends
df_final %>%
filter(day_of_week == "Sat" | day_of_week == "Sun") %>%
group_by(member_casual,Hour=hour(started_at)) %>%
summarise(num_rides = n()/1000) %>%
ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
geom_col(position="dodge") +
labs(title = "Ride count by time of day on weekends", x = "Time of day (Hour)", y = "Ride count (Thousands)")
Next, we export the names of the top 10 busiest start stations and end stations for both members and casuals.
# Investigate the top 10 busiest start stations
df_start <- df_final %>%
filter(start_station_name != "") %>%
group_by(member_casual,start_station_name) %>%
summarise(num_rides=n()) %>%
arrange(member_casual,desc(num_rides)) %>%
top_n(10)
write.csv(df_start, file = 'C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/top_start.csv')
# Investigate the top 10 busiest end stations
df_end <- df_final %>%
filter(end_station_name != "") %>%
group_by(member_casual,end_station_name) %>%
summarise(num_rides=n()) %>%
arrange(member_casual,desc(num_rides)) %>%
top_n(10)
write.csv(df_end, file = 'C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/top_end.csv')
=========================================================================
Thank you for reading. Feedbacks are welcome and much
appreciated
=========================================================================