Google Data Analytics Capstone Project Cyclistic Bike-share Analysis

About the company

Cyclistic is a bike-share program that features more than 5,800 bicycles and 700 docking stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Cyclistic’s marketing strategy relied on building general awareness and appealing to broad customer segments. One approach that helped make these things possible was the flexiblity of its pricing plans: single-ride passes, full-day passes, and annual membership. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual membership are Cyclistic members.

The director of marketing believes the company’s future success depends on maximizing the number of annual membership. Therefore, as a Data Analyst, our job is to find and analyze any pattern or trend in Cyclistic historical bike trip data to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, we can create a new marketing strategy to convert casual riders to annual members.

Business Task

Identify any pattern or trend in Cyclistic historical bike trip data to understand the difference between casual riders and annual members in using Cyclistic bikes so we can create a new marketing strategy to convert casual riders to annual members.

Key Stakeholders:

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Data Preparation

The dataset I use was acquired from Divvy Tripdata. The data has been made available for public by Motivate International Inc. under this license.. For this capstone project, I use data from October 2021 to September 2022 (12 months). I use R for combining and cleaning the dataset that contains a lot of rows (more than 5 million) which Spreadsheet cannot handle.

Prepare the environment

First, we need to load the library.

#Load the library
library(tidyverse)
library(lubridate)
library(summarytools)
library(data.table)
library(hms)
#Set the working directory
setwd('D:/DatSci & Analyst/Google Data Analytics/CAPSTONE Project')

Then, we have to combine the Cyclistic Bike-Share Datatrip.

#Combine the Cyclistic Bike Dataset from October 2021 to September 2022
filenames <- list.files(path='D:/DatSci & Analyst/Google Data Analytics/CAPSTONE Project/Dataset', full.names=TRUE)

#Read all csv files from filenames
cyclist<-rbindlist(lapply(filenames,fread))

Before we clean our data, we need a summary statistics for all variables in the dataframe. We use dfSummary function from summarytools library and set ASCII to false for better printing. A summary table are useful for checking data type, validity, and missing data.

plain.ascii = FALSE
print(dfSummary(cyclist, graph.magnif = 0.75), method='render')

Data Frame Summary

cyclist

Dimensions: 5828235 x 13
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

ride_id [character]

1. 00000123F60251E6

2. 00000179CF2C4FB5

3. 0000047373295F85

4. 000004C3185FDDE9

5. 000005B1F6F86B03

6. 000008FF2B1BB8EC

7. 00000B26583EB490

8. 00000E22FBA89D81

9. 00000E408DED6BFB

10. 00000EBBC119168C

[ 5828225 others ]

1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
5828225	(	100.0%	)

5828235 (100.0%)

0 (0.0%)

rideable_type [character]

1. classic_bike

2. docked_bike

3. electric_bike

2740516	(	47.0%	)
192475	(	3.3%	)
2895244	(	49.7%	)

5828235 (100.0%)

0 (0.0%)

started_at [POSIXct, POSIXt]

min : 2021-10-01 00:00:09

med : 2022-06-08 06:41:28

max : 2022-09-30 23:59:56

range : 11m 29d 23H 59M 47S

4875181 distinct values

5828235 (100.0%)

0 (0.0%)

ended_at [POSIXct, POSIXt]

min : 2021-10-01 00:03:11

med : 2022-06-08 06:55:07

max : 2022-10-05 19:53:11

range : 1y 0m 4d 19H 50M 0S

4884765 distinct values

5828235 (100.0%)

0 (0.0%)

start_station_name [character]

1. (Empty string)

2. Streeter Dr & Grand Ave

3. DuSable Lake Shore Dr & M

4. DuSable Lake Shore Dr & N

5. Michigan Ave & Oak St

6. Wells St & Concord Ln

7. Millennium Park

8. Clark St & Elm St

9. Kingsbury St & Kinzie St

10. Theater on the Lake

[ 1582 others ]

895032	(	15.4%	)
75985	(	1.3%	)
42035	(	0.7%	)
40592	(	0.7%	)
40119	(	0.7%	)
39352	(	0.7%	)
36791	(	0.6%	)
36784	(	0.6%	)
35147	(	0.6%	)
33533	(	0.6%	)
4552865	(	78.1%	)

5828235 (100.0%)

0 (0.0%)

start_station_id [character]

1. (Empty string)

2. 13022

3. 13300

4. LF-005

5. 13042

6. TA1308000050

7. 13008

8. TA1307000039

9. KA1503000043

10. TA1308000001

[ 1293 others ]

895032	(	15.4%	)
75985	(	1.3%	)
42035	(	0.7%	)
40592	(	0.7%	)
40119	(	0.7%	)
39352	(	0.7%	)
36791	(	0.6%	)
36784	(	0.6%	)
35147	(	0.6%	)
33533	(	0.6%	)
4552865	(	78.1%	)

5828235 (100.0%)

0 (0.0%)

end_station_name [character]

1. (Empty string)

2. Streeter Dr & Grand Ave

3. DuSable Lake Shore Dr & N

4. Michigan Ave & Oak St

5. DuSable Lake Shore Dr & M

6. Wells St & Concord Ln

7. Millennium Park

8. Clark St & Elm St

9. Kingsbury St & Kinzie St

10. Theater on the Lake

[ 1600 others ]

958227	(	16.4%	)
76510	(	1.3%	)
42621	(	0.7%	)
40643	(	0.7%	)
40633	(	0.7%	)
39196	(	0.7%	)
37113	(	0.6%	)
36252	(	0.6%	)
33592	(	0.6%	)
33535	(	0.6%	)
4489913	(	77.0%	)

5828235 (100.0%)

0 (0.0%)

end_station_id [character]

1. (Empty string)

2. 13022

3. LF-005

4. 13042

5. 13300

6. TA1308000050

7. 13008

8. TA1307000039

9. KA1503000043

10. TA1308000001

[ 1300 others ]

958227	(	16.4%	)
76510	(	1.3%	)
42621	(	0.7%	)
40643	(	0.7%	)
40633	(	0.7%	)
39196	(	0.7%	)
37113	(	0.6%	)
36252	(	0.6%	)
33592	(	0.6%	)
33535	(	0.6%	)
4489913	(	77.0%	)

5828235 (100.0%)

0 (0.0%)

start_lat [numeric]

Mean (sd) : 41.9 (0)

min ≤ med ≤ max:

41.6 ≤ 41.9 ≤ 45.6

IQR (CV) : 0 (0)

601238 distinct values

5828235 (100.0%)

0 (0.0%)

start_lng [numeric]

Mean (sd) : -87.6 (0)

min ≤ med ≤ max:

-87.8 ≤ -87.6 ≤ -73.8

IQR (CV) : 0 (0)

569536 distinct values

5828235 (100.0%)

0 (0.0%)

end_lat [numeric]

Mean (sd) : 41.9 (0)

min ≤ med ≤ max:

41.4 ≤ 41.9 ≤ 42.4

IQR (CV) : 0 (0)

217893 distinct values

5822391 (99.9%)

5844 (0.1%)

end_lng [numeric]

Mean (sd) : -87.6 (0)

min ≤ med ≤ max:

-89 ≤ -87.6 ≤ -87.3

IQR (CV) : 0 (0)

207375 distinct values

5822391 (99.9%)

5844 (0.1%)

member_casual [character]

1. casual

2. member

2401286	(	41.2%	)
3426949	(	58.8%	)

5828235 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-10-19

Data Cleaning and Data Manipulation

Removing NA values

We can see from dfSummary table that there’s 5844 NA value in end_lat and end_lng. We have to validate them first.

sum(is.na(cyclist))

## [1] 11688

colSums(is.na(cyclist))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                  0                  0                  0 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5844               5844 
##      member_casual 
##                  0

It is true that there’s 5844 NA value in end_lat and end_lng. After validating the data, we have to remove the NA value and check the NA value again.

#Remove the NA Value from end_lat and end_lng column
cyclist <- na.omit(cyclist)
#Check the NA Value again
sum(is.na(cyclist))

## [1] 0

We have successfully removed the NA value.

Removing duplicate

After we remove the NA value, then we have to remove duplicate data from dataset. First, we need to see dataset’s total rows. After that, we remove the duplicate and check the total rows again.

#See total row
dim(cyclist)

## [1] 5822391      13

#Remove duplicate
cyclist %>% distinct()

##                   ride_id rideable_type          started_at            ended_at
##       1: 620BC6107255BF4C electric_bike 2021-10-22 12:46:42 2021-10-22 12:49:50
##       2: 4471C70731AB2E45 electric_bike 2021-10-21 09:12:37 2021-10-21 09:14:14
##       3: 26CA69D43D15EE14 electric_bike 2021-10-16 16:28:39 2021-10-16 16:36:26
##       4: 362947F0437E1514 electric_bike 2021-10-16 16:17:48 2021-10-16 16:19:03
##       5: BB731DE2F2EC51C5 electric_bike 2021-10-20 23:17:54 2021-10-20 23:26:10
##      ---                                                                       
## 5822387: 32ECA2B32C4B6F85  classic_bike 2022-09-05 17:59:21 2022-09-05 18:19:07
## 5822388: 14801F713026AEAE  classic_bike 2022-09-30 17:20:54 2022-09-30 17:34:40
## 5822389: 7CCAF5D6E88E45C0 electric_bike 2022-09-04 11:39:37 2022-09-04 11:50:55
## 5822390: AF9A129D9AFAA40B electric_bike 2022-09-28 13:42:45 2022-09-28 13:52:59
## 5822391: 60B56F4897429FCE electric_bike 2022-09-01 20:07:04 2022-09-01 20:18:01
##                start_station_name start_station_id
##       1: Kingsbury St & Kinzie St     KA1503000043
##       2:                                          
##       3:                                          
##       4:                                          
##       5:                                          
##      ---                                          
## 5822387:  Lincoln Ave & Winona St     KA1504000078
## 5822388:     Broadway & Ridge Ave            15578
## 5822389:     Broadway & Ridge Ave            15578
## 5822390:  Lincoln Ave & Winona St     KA1504000078
## 5822391:  Lincoln Ave & Winona St     KA1504000078
##                                             end_station_name end_station_id
##       1:                                                                   
##       2:                                                                   
##       3:                                                                   
##       4:                                                                   
##       5:                                                                   
##      ---                                                                   
## 5822387: Broadway & Wilson - Truman College Vaccination Site          13074
## 5822388: Broadway & Wilson - Truman College Vaccination Site          13074
## 5822389: Broadway & Wilson - Truman College Vaccination Site          13074
## 5822390: Broadway & Wilson - Truman College Vaccination Site          13074
## 5822391: Broadway & Wilson - Truman College Vaccination Site          13074
##          start_lat start_lng  end_lat   end_lng member_casual
##       1:  41.88919 -87.63850 41.89000 -87.63000        member
##       2:  41.93000 -87.70000 41.93000 -87.71000        member
##       3:  41.92000 -87.70000 41.94000 -87.72000        member
##       4:  41.92000 -87.69000 41.92000 -87.69000        member
##       5:  41.89000 -87.71000 41.89000 -87.69000        member
##      ---                                                     
## 5822387:  41.97491 -87.69250 41.96522 -87.65814        member
## 5822388:  41.98404 -87.66027 41.96522 -87.65814        member
## 5822389:  41.98411 -87.66027 41.96522 -87.65814        member
## 5822390:  41.97492 -87.69273 41.96522 -87.65814        member
## 5822391:  41.97490 -87.69265 41.96522 -87.65814        member

#See total row after removing duplicate
dim(cyclist)

## [1] 5822391      13

We see that there is no row removed, that means there is no duplicate data in the dataset.

Creating date variable

We create the date variable from started at and ended at column. We also create time of day column.

#Create data variable
cyclist <- cyclist %>% mutate(start_year = year(started_at),
                              start_month = month(started_at),
                              start_day = weekdays(started_at),
                              start_hour = hour(started_at),
                              start_time_of_day = case_when(start_hour>= 5 & start_hour <=12 ~ "Morning",
                                                      start_hour>=13 & start_hour <=17 ~ "Afternoon",
                                                      start_hour>=18 & start_hour <=22 ~ "Evening",
                                                      start_hour>= 0 & start_hour <=4 | start_hour ==23 ~ "Night"),
                              end_year = year(ended_at),
                              end_month = month(ended_at),
                              end_day = weekdays(ended_at),
                              end_hour = hour(ended_at),
                              end_time_of_day = case_when(end_hour>= 5 & end_hour <=12 ~ "Morning",
                                                      end_hour>=13 & end_hour <=17 ~ "Afternoon",
                                                      end_hour>=18 & end_hour <=22 ~ "Evening",
                                                      end_hour>= 0 & end_hour <=4 | end_hour ==23 ~ "Night")
                              )

Creating ride length variable

We create another variable to show how long a person use Cyclistic bike

#Create ride length
cyclist <- cyclist %>% mutate(ride_length_mins = difftime(ended_at, started_at, units="mins"))

Check if there are negative time values in the data

#Check the negative time values count
cyclist %>% filter(ride_length_mins<0) %>% count()

##      n
## 1: 108

We found 108 negative time values. So we have to subset negative time values data and check them.

negative_time <- cyclist %>% filter(ride_length_mins<0) %>% select(started_at, ended_at)
head(negative_time, 10)

##              started_at            ended_at
##  1: 2021-11-07 01:40:02 2021-11-07 01:05:46
##  2: 2021-11-07 01:52:53 2021-11-07 01:05:22
##  3: 2021-11-07 01:40:13 2021-11-07 01:00:29
##  4: 2021-11-07 01:34:03 2021-11-07 01:17:13
##  5: 2021-11-07 01:54:25 2021-11-07 01:03:44
##  6: 2021-11-07 01:54:04 2021-11-07 01:25:57
##  7: 2021-11-07 01:51:52 2021-11-07 01:22:53
##  8: 2021-11-07 01:54:12 2021-11-07 01:05:09
##  9: 2021-11-07 01:54:36 2021-11-07 01:03:11
## 10: 2021-11-07 01:51:21 2021-11-07 01:07:59

Negative time values comes when a person end time bike use is less than start time bike use (ended_at<started_at), which suggest that this negative time values data was input incorrectly (the start and end time are swapped), so we have to create a new calculation: difftime(started_at, ended_at)

cyclist <- cyclist %>% mutate(ride_length_mins = ifelse(ended_at>started_at, 
                                                        difftime(ended_at, started_at, units='mins'), 
                                                        (difftime(started_at, ended_at, units='mins'))))

Check the count of negative time values again

cyclist %>% filter(ride_length_mins<0) %>% count()

##    n
## 1: 0

All the negative time values are successfully converted to positive time values

Removing the outliers

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal (ITL NIST). We will use Interquartile Range (IQR) method to remove the outliers. First, we need to create the upper and lower bound.

#Removing outliers in ride_length column  with IQR method
Q1 <- quantile(cyclist$ride_length, .25)
Q3 <- quantile(cyclist$ride_length, .75)
IQR <- IQR(cyclist$ride_length)

upper_bound <- Q3+1.5*IQR
lower_bound <- Q1-1.5*IQR

Then, we remove the outliers. Any values that fall outside the lower and upper bound are considered outliers.

cyclist_clean <- cyclist %>% subset(ride_length_mins>lower_bound & ride_length_mins<upper_bound)

Now the dataset are clean, we are ready to visualize and analyze the data. The clean cyclistic data will be exported as csv and will get a data visualization on Tableau.

Create new dataframe to analyze top 5 start and end station

First, we need to create a data frame that consist the most used start station coordinate, then we create a data frame that consist the most used start station name for member and casual. The data frame will be exported as csv and we will manually input the coordinate for the top 5 start station names. We found that there’s a blank station name from dfSummary table, so We also need to exclude the blank station name.

#Create coordinate data start station
start_station_coord<-cyclist_clean %>% 
  filter(start_station_name != "") %>% 
  group_by(start_lat, start_lng, start_station_name, member_casual) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count))

#Create top 5 start station for member
start_station_name_member<-cyclist_clean %>% 
  filter(start_station_name != "") %>% 
  group_by(start_station_name, member_casual) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  filter(member_casual=="member")
top_5_start_station_name_member<-start_station_name_member %>% head(5)

#Create top 5 start station for casual
start_station_name_casual<-cyclist_clean %>% 
  filter(start_station_name != "") %>% 
  group_by(start_station_name, member_casual) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  filter(member_casual=="casual")
top_5_start_station_name_casual<-start_station_name_casual %>% head(5)

We will do the same method for end station.

#Create coordinate data end station
end_station_coord<-cyclist_clean %>% 
  filter(end_station_name != "") %>% 
  group_by(end_lat, end_lng, end_station_name, member_casual) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count))


#Create top 5 end station for member
end_station_name_member<-cyclist_clean %>% 
  filter(end_station_name != "") %>% 
  group_by(end_station_name, member_casual) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  filter(member_casual=="member")
top_5_end_station_name_member<-end_station_name_member %>% head(5)

#Create top 5 end station for casual
end_station_name_casual<-cyclist_clean %>% 
  filter(end_station_name != "") %>% 
  group_by(end_station_name, member_casual) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  filter(member_casual=="casual")
top_5_end_station_name_casual<-end_station_name_casual %>% head(5)

Analyze

Tableau was used for visualize the data, and now its time to analyze! We have to identify any pattern or trends from the data viz and find insights from it. The insights will help us solve the business task.

Let us take a look at dashboard that already been created on Tableau.

Cyclistic Bike-Share Dashboard

We have main dashboard, which contains the daily and monthly chart of the number of rides and average ride length between casual and annual members. We also have a pie chart for comparing annual members vs. casual riders and rideable type. The next dashboard is Maps, containing the top 5 start and end stations’ names, including Google street view. Last, we have Final Report, containing the business task, key takeaways, and some business strategy to convert casual riders to annual members.

Key takeaways:

Cyclistic has more annual members than casual riders (61.45% vs 38.55%).
Casual riders, on average, use Cyclistic bikes longer than annual member.
Casual riders are more active on weekends than annual member, which suggest casual riders are more likely ride for leisure.
Annual members tend to use Cyclistic bikes on weekdays, which suggest annual riders are more likely ride to commute to work each day.
Annual members tend to use Cyclistic bikes in morning and afternoon, this indicates that annual members are more likely to use Cyclistic bikes to commute from home to office (morning) and from office to home (afternoon).
Busiest stations (top 5 stations) are in close proximity, suggesting that casual and annual members usually drive around the same place.
Casual riders mainly use Cyclistic bikes around June-August.
Annual members mainly use Cyclistic bikes around May-October.
Both casual and annual members Cyclistic bike-share usage drops significantly around January and February.
Docked bikes are rarely used, never even used by annual members.
Casual riders prefer electric bikes, while annual members use electric bikes and classic bikes almost evenly.

Recommendations

Here are some recommendations that can be implemented to convert casual riders to annual members:

Make regular membership discount to casual riders, especially from June to August.
Run promotions during the weekends to reach out more casual riders.
Provide and promote additional perks for having a Cyclistic membership account, such as holding a membership only events and prizes.
Run promotions at the top 5 stations that casual riders often use.
Create some informative promotions and banners to inform casual riders, such as how cost-effective for them to use Cyclistic bikes as an annual member for commute to work on weekdays.
Increase the bikes’ renting price for casual riders during the weekends, especially electic bikes.

The Data Viz can be accessed from here.

-Muhammad Hafidz Roihan, 2022.

Google Data Analytics Capstone Project Cyclistic Bike-share Analysis

Muhammad Hafidz Roihan

2022-10-17

About the company

Business Task

Key Stakeholders:

Data Preparation

Prepare the environment

Data Frame Summary

cyclist

Data Cleaning and Data Manipulation

Removing NA values

Removing duplicate

Creating date variable

Creating ride length variable

Removing the outliers

Create new dataframe to analyze top 5 start and end station

Analyze

Cyclistic Bike-Share Dashboard

Key takeaways:

Recommendations

The Data Viz can be accessed from here.