Introduction

This case study was completed as the final capstone project for the Google Analytics Certificate. The purpose of the project is to answer key business questions for a fictional company, Cyclistic, by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

Scenario

Cyclistic is a bike-share company in Chicago, and the director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the goal of the project is to understand how casual riders and annual members use Cyclistic bikes differently.

Statement of business task

Identify key differences in how annual members and causual riders use Cyclistic.

Prepare

About the data

Previous 12 months of historical trip data will be analyzed to identify trends. The required data is made available to public by Motivate International Inc. under this license. As of the date of this case study, the most current 12 months of data (October 2021 - September 2022) can be downloaded from here.

Original data is stored as .CSV files for each month. Details of each unique bike trip is recorded in 13 columns, namely:

[1] “ride_id” (unique trip id)
[2] “rideable_type” (type of bike)
[3,4] “started_at” “ended_at” (trip start time and end time)
[5,6,7,8] “start_station_name” “start_station_id” “end_station_name” “end_station_id” (start/end station info)
[9,10,11,12] “start_lat” “start_lng” “end_lat” “end_lng” (start/end latitude and longitude)
[13] “member_casual” (type of membership)

We will use RStudio for this analysis, and data will be stored locally.

Setting up R environment

Set up my R environment by installing necessary packages and libraries.

#install.packages('tidyverse')
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
library(lubridate)
library(hms)

Import data

Import all data files and merge them into one data frame.
Note: Initially, I have imported each data file as an individual data frame, and verified that they are all structured the same way and columns are named consistently, so that they can be stacked into a single data frame.

setwd("C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/Data")
df_all <-  list.files() %>%
  lapply(read_csv, show_col_types = FALSE) %>% 
  bind_rows

We verify that all data files imported correctly.

# Summary of the combined data frame
summary(df_all)
##    ride_id          rideable_type        started_at                    
##  Length:5828235     Length:5828235     Min.   :2021-10-01 00:00:09.00  
##  Class :character   Class :character   1st Qu.:2022-02-28 19:21:08.50  
##  Mode  :character   Mode  :character   Median :2022-06-08 06:41:28.00  
##                                        Mean   :2022-05-06 21:39:18.18  
##                                        3rd Qu.:2022-08-02 11:26:01.00  
##                                        Max.   :2022-09-30 23:59:56.00  
##                                                                        
##     ended_at                     start_station_name start_station_id  
##  Min.   :2021-10-01 00:03:11.0   Length:5828235     Length:5828235    
##  1st Qu.:2022-02-28 19:34:02.5   Class :character   Class :character  
##  Median :2022-06-08 06:55:07.0   Mode  :character   Mode  :character  
##  Mean   :2022-05-06 21:58:54.2                                        
##  3rd Qu.:2022-08-02 11:46:26.0                                        
##  Max.   :2022-10-05 19:53:11.0                                        
##                                                                       
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5828235     Length:5828235     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   :41.39   Min.   :-88.97   Length:5828235    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :-87.30                     
##  NA's   :5844    NA's   :5844

Process

Cleaning and wrangling

Next, we check to see if there are duplicate entries and null values.

# Remove duplicates
df_all <- df_all %>% distinct(ride_id, .keep_all = TRUE)

# Columns containing null values
colnames(df_all)[colSums(is.na(df_all))>0]
## [1] "start_station_name" "start_station_id"   "end_station_name"  
## [4] "end_station_id"     "end_lat"            "end_lng"

There are some missing values in start station and end station information. Since it should not affect our analysis of trip duration, ridership type, time of day, etc, they will be included for such calculations. However, they should be removed for further analysis concerning starting and ending stations.

# Detecting outliers in start_lat and start_lng
summary(df_all$start_lat) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   41.64   41.88   41.90   41.90   41.93   45.64
summary(df_all$start_lng) # lat,long 45.64, -73.80 lies in Quebec, Canada
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -87.84  -87.66  -87.64  -87.65  -87.63  -73.80
View(filter(df_all,start_lat>44)) # ride_id=3327172413547F64 is an outlier and should be removed

Next, we create two new columns in our data frame, namely “ride_length” and “day_of_week”.

# "ride_length" is the duration of each trip, calculated by subtracting "started_at" from "ended_at"
# "day_of_week" gives us the day of the week when the ride took place
df_all <- df_all %>% 
  mutate(ride_length = as_hms(difftime(ended_at, started_at))) %>%
  mutate(day_of_week = wday(started_at, label=TRUE, week_start=1)) 
# Inspect the structure of the modified columns
str(df_all)
## tibble [5,828,235 × 15] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5828235] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
##  $ rideable_type     : chr [1:5828235] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5828235], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
##  $ ended_at          : POSIXct[1:5828235], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
##  $ start_station_name: chr [1:5828235] "Kingsbury St & Kinzie St" NA NA NA ...
##  $ start_station_id  : chr [1:5828235] "KA1503000043" NA NA NA ...
##  $ end_station_name  : chr [1:5828235] NA NA NA NA ...
##  $ end_station_id    : chr [1:5828235] NA NA NA NA ...
##  $ start_lat         : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:5828235] "member" "member" "member" "member" ...
##  $ ride_length       : 'hms' num [1:5828235] 00:03:08 00:01:37 00:07:47 00:01:15 ...
##   ..- attr(*, "units")= chr "secs"
##  $ day_of_week       : Ord.factor w/ 7 levels "Mon"<"Tue"<"Wed"<..: 5 4 6 6 3 4 4 3 4 3 ...

Removing “bad” data

We will remove the entries where trip duration is either negative or zero, since there is not a way to verify if those were simply input errors.
Multi-day trips, although realistically do not reflect actual ride duration, could signify when a renter chose to keep the bike for multiple days. It could also mean a difference in how the bike is used, for example, work commute vs. leisurely ride.

# Number of trips with zero or negative duration
nrow(filter(df_all, ride_length <= 0))  # 571 "bad" entries, not a significant percent of data set
## [1] 571
# Number of trips with over 24 hours duration (86400 seconds)
nrow(filter(df_all, ride_length >= 86400)) # 5397 entries, also not a significant percent of overall data
## [1] 5397
# Create new data frame after dropping unwanted data
df_final <- df_all %>%
  filter(ride_length > 0, ride_id != "3327172413547F64")
summary(df_final)
##    ride_id          rideable_type        started_at                    
##  Length:5827663     Length:5827663     Min.   :2021-10-01 00:00:09.00  
##  Class :character   Class :character   1st Qu.:2022-02-28 19:26:37.50  
##  Mode  :character   Mode  :character   Median :2022-06-08 06:43:13.00  
##                                        Mean   :2022-05-06 21:41:36.58  
##                                        3rd Qu.:2022-08-02 11:25:39.50  
##                                        Max.   :2022-09-30 23:59:56.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2021-10-01 00:03:11.00   Length:5827663     Length:5827663    
##  1st Qu.:2022-02-28 19:40:51.50   Class :character   Class :character  
##  Median :2022-06-08 06:57:04.00   Mode  :character   Mode  :character  
##  Mean   :2022-05-06 22:01:12.96                                        
##  3rd Qu.:2022-08-02 11:46:07.00                                        
##  Max.   :2022-10-05 19:53:11.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5827663     Length:5827663     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng       member_casual      ride_length      
##  Min.   :41.39   Min.   :-88.97   Length:5827663     Length:5827663   
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   Class1:hms       
##  Median :41.90   Median :-87.64   Mode  :character   Class2:difftime  
##  Mean   :41.90   Mean   :-87.65                      Mode  :numeric   
##  3rd Qu.:41.93   3rd Qu.:-87.63                                       
##  Max.   :42.37   Max.   :-87.30                                       
##  NA's   :5844    NA's   :5844                                         
##  day_of_week 
##  Mon:752789  
##  Tue:817229  
##  Wed:820099  
##  Thu:837172  
##  Fri:843901  
##  Sat:957928  
##  Sun:798545

Descriptive Analysis

In order to identify differences between casual riders and annual members, we will compare and find trends in a number of factors.

# Investigate bike types and membership columns
table(df_final$rideable_type, df_final$member_casual)
##                
##                  casual  member
##   classic_bike   941387 1798983
##   docked_bike    192464       0
##   electric_bike 1267139 1627690

We find that annual members rode more trips than casual riders overall (approximately 3.4 million vs 2.4 million rides).

# Compare number of rides for each bike type
df_final %>% 
    group_by(rideable_type,member_casual) %>%
    summarise(num_rides = n()/1000) %>%
    ggplot(aes(x=member_casual, y=num_rides, fill=rideable_type)) +
    geom_col(position="dodge") +
    labs(title = "Ride count by bike type", x = "Membership", y = "Ride count (Thousands)")

Members prefer to use classic bikes to electric bikes, although not by a significant margin, whereas casual riders clearly prefer to use electric bikes. Docked bikes, however, appear to be exclusively used by casual riders.

# compare average ride duration on different days of the week
df_final %>% 
  group_by(member_casual,day_of_week) %>%
  summarise(duration=mean(ride_length)/60) %>%
  ggplot(aes(x=day_of_week, y=duration, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Average ride duration by day of the week", x = "Day of the week", y = "Ride duration (Minutes)")

# Find the mode of day of week
table(df_final$day_of_week, df_final$member_casual)
##      
##       casual member
##   Mon 279762 473027
##   Tue 275745 541484
##   Wed 281640 538459
##   Thu 306662 530510
##   Fri 352465 491436
##   Sat 499739 458189
##   Sun 404977 393568
# compare number of rides on different days of the week
df_final %>% 
  group_by(member_casual,day_of_week) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=day_of_week, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by day of the week", x = "Day of the week", y = "Ride count (Thousands)")

# Compare number of rides in different months
df_final %>% 
  group_by(member_casual,month=month(started_at,label=TRUE)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=month, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +  
  labs(title = "Ride count by month", x = "Month", y = "Ride count (Thousands)")

# Compare number of rides by time of day
df_final %>% 
  group_by(member_casual,Hour=hour(started_at)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by time of day", x = "Time of day (Hour)", y = "Ride count (Thousands)")

It appears that ridership for members tends to peak at two different times of day, coinciding with morning and evening work commute hours. Non-members’ ride count increases steadily throughout the day until evening hours. In order to further investigate this trend, we will split the plot into weekdays vs weekends.

# Compare number of rides by time of day on weekdays
df_final %>% 
  filter(day_of_week != "Sat" & day_of_week != "Sun") %>%
  group_by(member_casual,Hour=hour(started_at)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by time of day on weekdays", x = "Time of day (Hour)", y = "Ride count (Thousands)")

# Compare number of rides by time of day on weekends
df_final %>% 
  filter(day_of_week == "Sat" | day_of_week == "Sun") %>%
  group_by(member_casual,Hour=hour(started_at)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by time of day on weekends", x = "Time of day (Hour)", y = "Ride count (Thousands)")

Next, we export the names of the top 10 busiest start stations and end stations for both members and casuals.

# Investigate the top 10 busiest start stations
df_start <- df_final %>% 
  filter(start_station_name != "") %>% 
  group_by(member_casual,start_station_name) %>% 
  summarise(num_rides=n()) %>% 
  arrange(member_casual,desc(num_rides)) %>%
  top_n(10)
write.csv(df_start, file = 'C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/top_start.csv')
# Investigate the top 10 busiest end stations
df_end <- df_final %>% 
  filter(end_station_name != "") %>% 
  group_by(member_casual,end_station_name) %>% 
  summarise(num_rides=n()) %>% 
  arrange(member_casual,desc(num_rides)) %>%
  top_n(10)
write.csv(df_end, file = 'C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/top_end.csv')

Share - Summary of Analysis

  1. Annual members show no preference between bike types. Casual users prefer electric to traditional bicycles.
  2. Members took approximately 40% more trips than casuals (3.4M vs 2.4M).
  3. On average, members’ ride duration is consistent around 12-14 minutes. Casuals tend to ride longer, from 25 minutes on weekdays to over 30 minutes on weekends.
  4. For members, more rides happen on weekdays than weekends. The opposite trend is observed for casuals, number of rides go up towards the end of the week, peaking on Saturdays.
  5. Generally, colder months attract less users of both types. We do see a difference in that members’ usage remain fairly high for more months, whereas for the casuals, usage fall of quickly after summer months.
  6. Tracking trip counts by time of day show an interesting contrast between members and casuals. For the members, more rides take place during morning and evening commute hours on weekdays. For non-members, most rides take place in the afternoon hours 3-5pm on both weekdays and weekends. This is also consistent with the earlier observation that usage for non-member fall sharply after summer with reduced daylight.
  7. We will output a summary file (.csv) of the top 10 busiest start and end stations for both members and casuals for further analysis.

Act - Key Findings and Recommendations

=========================================================================
Thank you for reading. Feedbacks are welcome and much appreciated
=========================================================================