Introduction
- Scenario
- Statement of business task
Prepare
Process
- Cleaning and wrangling
- Removing “bad” data
Descriptive Analysis
Share - Summary of Analysis
Act - Key Findings and Recommendations

Introduction

This case study was completed as the final capstone project for the Google Analytics Certificate. The purpose of the project is to answer key business questions for a fictional company, Cyclistic, by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

Scenario

Cyclistic is a bike-share company in Chicago, and the director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the goal of the project is to understand how casual riders and annual members use Cyclistic bikes differently.

Statement of business task

Identify key differences in how annual members and causual riders use Cyclistic.

Prepare

About the data

Previous 12 months of historical trip data will be analyzed to identify trends. The required data is made available to public by Motivate International Inc. under this license. As of the date of this case study, the most current 12 months of data (October 2021 - September 2022) can be downloaded from here.

Original data is stored as .CSV files for each month. Details of each unique bike trip is recorded in 13 columns, namely:

[1] “ride_id” (unique trip id)
[2] “rideable_type” (type of bike)
[3,4] “started_at” “ended_at” (trip start time and end time)
[5,6,7,8] “start_station_name” “start_station_id” “end_station_name” “end_station_id” (start/end station info)
[9,10,11,12] “start_lat” “start_lng” “end_lat” “end_lng” (start/end latitude and longitude)
[13] “member_casual” (type of membership)

We will use RStudio for this analysis, and data will be stored locally.

Setting up R environment

Set up my R environment by installing necessary packages and libraries.

#install.packages('tidyverse')
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

library(lubridate)
library(hms)

Import data

Import all data files and merge them into one data frame.
Note: Initially, I have imported each data file as an individual data frame, and verified that they are all structured the same way and columns are named consistently, so that they can be stacked into a single data frame.

setwd("C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/Data")
df_all <-  list.files() %>%
  lapply(read_csv, show_col_types = FALSE) %>% 
  bind_rows

We verify that all data files imported correctly.

# Summary of the combined data frame
summary(df_all)

##    ride_id          rideable_type        started_at                    
##  Length:5828235     Length:5828235     Min.   :2021-10-01 00:00:09.00  
##  Class :character   Class :character   1st Qu.:2022-02-28 19:21:08.50  
##  Mode  :character   Mode  :character   Median :2022-06-08 06:41:28.00  
##                                        Mean   :2022-05-06 21:39:18.18  
##                                        3rd Qu.:2022-08-02 11:26:01.00  
##                                        Max.   :2022-09-30 23:59:56.00  
##                                                                        
##     ended_at                     start_station_name start_station_id  
##  Min.   :2021-10-01 00:03:11.0   Length:5828235     Length:5828235    
##  1st Qu.:2022-02-28 19:34:02.5   Class :character   Class :character  
##  Median :2022-06-08 06:55:07.0   Mode  :character   Mode  :character  
##  Mean   :2022-05-06 21:58:54.2                                        
##  3rd Qu.:2022-08-02 11:46:26.0                                        
##  Max.   :2022-10-05 19:53:11.0                                        
##                                                                       
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5828235     Length:5828235     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   :41.39   Min.   :-88.97   Length:5828235    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :-87.30                     
##  NA's   :5844    NA's   :5844

Process

Cleaning and wrangling

Next, we check to see if there are duplicate entries and null values.

# Remove duplicates
df_all <- df_all %>% distinct(ride_id, .keep_all = TRUE)

# Columns containing null values
colnames(df_all)[colSums(is.na(df_all))>0]

## [1] "start_station_name" "start_station_id"   "end_station_name"  
## [4] "end_station_id"     "end_lat"            "end_lng"

There are some missing values in start station and end station information. Since it should not affect our analysis of trip duration, ridership type, time of day, etc, they will be included for such calculations. However, they should be removed for further analysis concerning starting and ending stations.

# Detecting outliers in start_lat and start_lng
summary(df_all$start_lat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   41.64   41.88   41.90   41.90   41.93   45.64

summary(df_all$start_lng) # lat,long 45.64, -73.80 lies in Quebec, Canada

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -87.84  -87.66  -87.64  -87.65  -87.63  -73.80

View(filter(df_all,start_lat>44)) # ride_id=3327172413547F64 is an outlier and should be removed

Next, we create two new columns in our data frame, namely “ride_length” and “day_of_week”.

# "ride_length" is the duration of each trip, calculated by subtracting "started_at" from "ended_at"
# "day_of_week" gives us the day of the week when the ride took place
df_all <- df_all %>% 
  mutate(ride_length = as_hms(difftime(ended_at, started_at))) %>%
  mutate(day_of_week = wday(started_at, label=TRUE, week_start=1))

# Inspect the structure of the modified columns
str(df_all)

## tibble [5,828,235 × 15] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5828235] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
##  $ rideable_type     : chr [1:5828235] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5828235], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
##  $ ended_at          : POSIXct[1:5828235], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
##  $ start_station_name: chr [1:5828235] "Kingsbury St & Kinzie St" NA NA NA ...
##  $ start_station_id  : chr [1:5828235] "KA1503000043" NA NA NA ...
##  $ end_station_name  : chr [1:5828235] NA NA NA NA ...
##  $ end_station_id    : chr [1:5828235] NA NA NA NA ...
##  $ start_lat         : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:5828235] "member" "member" "member" "member" ...
##  $ ride_length       : 'hms' num [1:5828235] 00:03:08 00:01:37 00:07:47 00:01:15 ...
##   ..- attr(*, "units")= chr "secs"
##  $ day_of_week       : Ord.factor w/ 7 levels "Mon"<"Tue"<"Wed"<..: 5 4 6 6 3 4 4 3 4 3 ...

Removing “bad” data

We will remove the entries where trip duration is either negative or zero, since there is not a way to verify if those were simply input errors.
Multi-day trips, although realistically do not reflect actual ride duration, could signify when a renter chose to keep the bike for multiple days. It could also mean a difference in how the bike is used, for example, work commute vs. leisurely ride.

# Number of trips with zero or negative duration
nrow(filter(df_all, ride_length <= 0))  # 571 "bad" entries, not a significant percent of data set

## [1] 571

# Number of trips with over 24 hours duration (86400 seconds)
nrow(filter(df_all, ride_length >= 86400)) # 5397 entries, also not a significant percent of overall data

## [1] 5397

# Create new data frame after dropping unwanted data
df_final <- df_all %>%
  filter(ride_length > 0, ride_id != "3327172413547F64")
summary(df_final)

##    ride_id          rideable_type        started_at                    
##  Length:5827663     Length:5827663     Min.   :2021-10-01 00:00:09.00  
##  Class :character   Class :character   1st Qu.:2022-02-28 19:26:37.50  
##  Mode  :character   Mode  :character   Median :2022-06-08 06:43:13.00  
##                                        Mean   :2022-05-06 21:41:36.58  
##                                        3rd Qu.:2022-08-02 11:25:39.50  
##                                        Max.   :2022-09-30 23:59:56.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2021-10-01 00:03:11.00   Length:5827663     Length:5827663    
##  1st Qu.:2022-02-28 19:40:51.50   Class :character   Class :character  
##  Median :2022-06-08 06:57:04.00   Mode  :character   Mode  :character  
##  Mean   :2022-05-06 22:01:12.96                                        
##  3rd Qu.:2022-08-02 11:46:07.00                                        
##  Max.   :2022-10-05 19:53:11.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5827663     Length:5827663     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng       member_casual      ride_length      
##  Min.   :41.39   Min.   :-88.97   Length:5827663     Length:5827663   
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   Class1:hms       
##  Median :41.90   Median :-87.64   Mode  :character   Class2:difftime  
##  Mean   :41.90   Mean   :-87.65                      Mode  :numeric   
##  3rd Qu.:41.93   3rd Qu.:-87.63                                       
##  Max.   :42.37   Max.   :-87.30                                       
##  NA's   :5844    NA's   :5844                                         
##  day_of_week 
##  Mon:752789  
##  Tue:817229  
##  Wed:820099  
##  Thu:837172  
##  Fri:843901  
##  Sat:957928  
##  Sun:798545

Descriptive Analysis

In order to identify differences between casual riders and annual members, we will compare and find trends in a number of factors.

# Investigate bike types and membership columns
table(df_final$rideable_type, df_final$member_casual)

##                
##                  casual  member
##   classic_bike   941387 1798983
##   docked_bike    192464       0
##   electric_bike 1267139 1627690

We find that annual members rode more trips than casual riders overall (approximately 3.4 million vs 2.4 million rides).

# Compare number of rides for each bike type
df_final %>% 
    group_by(rideable_type,member_casual) %>%
    summarise(num_rides = n()/1000) %>%
    ggplot(aes(x=member_casual, y=num_rides, fill=rideable_type)) +
    geom_col(position="dodge") +
    labs(title = "Ride count by bike type", x = "Membership", y = "Ride count (Thousands)")

Members prefer to use classic bikes to electric bikes, although not by a significant margin, whereas casual riders clearly prefer to use electric bikes. Docked bikes, however, appear to be exclusively used by casual riders.

# compare average ride duration on different days of the week
df_final %>% 
  group_by(member_casual,day_of_week) %>%
  summarise(duration=mean(ride_length)/60) %>%
  ggplot(aes(x=day_of_week, y=duration, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Average ride duration by day of the week", x = "Day of the week", y = "Ride duration (Minutes)")

# Find the mode of day of week
table(df_final$day_of_week, df_final$member_casual)

##      
##       casual member
##   Mon 279762 473027
##   Tue 275745 541484
##   Wed 281640 538459
##   Thu 306662 530510
##   Fri 352465 491436
##   Sat 499739 458189
##   Sun 404977 393568

# compare number of rides on different days of the week
df_final %>% 
  group_by(member_casual,day_of_week) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=day_of_week, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by day of the week", x = "Day of the week", y = "Ride count (Thousands)")

# Compare number of rides in different months
df_final %>% 
  group_by(member_casual,month=month(started_at,label=TRUE)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=month, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +  
  labs(title = "Ride count by month", x = "Month", y = "Ride count (Thousands)")

# Compare number of rides by time of day
df_final %>% 
  group_by(member_casual,Hour=hour(started_at)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by time of day", x = "Time of day (Hour)", y = "Ride count (Thousands)")

It appears that ridership for members tends to peak at two different times of day, coinciding with morning and evening work commute hours. Non-members’ ride count increases steadily throughout the day until evening hours. In order to further investigate this trend, we will split the plot into weekdays vs weekends.

# Compare number of rides by time of day on weekdays
df_final %>% 
  filter(day_of_week != "Sat" & day_of_week != "Sun") %>%
  group_by(member_casual,Hour=hour(started_at)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by time of day on weekdays", x = "Time of day (Hour)", y = "Ride count (Thousands)")

# Compare number of rides by time of day on weekends
df_final %>% 
  filter(day_of_week == "Sat" | day_of_week == "Sun") %>%
  group_by(member_casual,Hour=hour(started_at)) %>%
  summarise(num_rides = n()/1000) %>%
  ggplot(aes(x=Hour, y=num_rides, fill=member_casual)) +
  geom_col(position="dodge") +
  labs(title = "Ride count by time of day on weekends", x = "Time of day (Hour)", y = "Ride count (Thousands)")

Next, we export the names of the top 10 busiest start stations and end stations for both members and casuals.

# Investigate the top 10 busiest start stations
df_start <- df_final %>% 
  filter(start_station_name != "") %>% 
  group_by(member_casual,start_station_name) %>% 
  summarise(num_rides=n()) %>% 
  arrange(member_casual,desc(num_rides)) %>%
  top_n(10)
write.csv(df_start, file = 'C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/top_start.csv')

# Investigate the top 10 busiest end stations
df_end <- df_final %>% 
  filter(end_station_name != "") %>% 
  group_by(member_casual,end_station_name) %>% 
  summarise(num_rides=n()) %>% 
  arrange(member_casual,desc(num_rides)) %>%
  top_n(10)
write.csv(df_end, file = 'C:/Users/chitn/Data_Analytics/Coursera_Case_Study/bikeshare/top_end.csv')

Act - Key Findings and Recommendations

The key takeaway from our analysis is that annual members utilize the service mainly for commuting, meanwhile, non-members use it primarily for recreational purposes.
Categorically, casual riders can be segmented into two groups, and we recommend that the marketing campaign adopt a tailored approach to each group:
- group A: workers and students who are likely to commute
- group B: weekend-warriors and out-of-towners who only ride for fun and recreation
Specifically, those in group A should be offered special incentives for converting, e.g. discounted membership pricing after having used the service for longer than 3 months, or if riding more than 3 times per week. This approach aligns with our observation that such users are already familiar with our service, and may already be riding for commute occasionally.
Group B should be attracted by different kinds of promotions. For example, tiered loyalty programs can be introduced for weekends or summer only. We recommend creating advertisements highlighting the cost-effectiveness of membership for those who take longer rides.
Finally, further analysis is also recommended, especially focusing on the busiest station locations where casual riders are most active. Offering member exclusive perks in those areas, such as free coffee from local cafes, is one of the ways we can attempt to increase membership conversion.

=========================================================================
Thank you for reading. Feedbacks are welcome and much appreciated
=========================================================================

Case Study: Bike-share Data Analysis in R

Nick Linn

2022-12-27

Introduction

Scenario

Statement of business task

Prepare

About the data

Setting up R environment

Import data

Process

Cleaning and wrangling

Removing “bad” data

Descriptive Analysis

Act - Key Findings and Recommendations

Case Study: Bike-share Data Analysis in R

Nick Linn

2022-12-27

Introduction

Scenario

Statement of business task

Prepare

About the data

Setting up R environment

Import data

Process

Cleaning and wrangling

Removing “bad” data

Descriptive Analysis

Share - Summary of Analysis

Act - Key Findings and Recommendations