This case study is a capstone project done in pursuance of the Google Data Analytics Certificate. The dataset was provided by the Cyclistic Marketing Analytics team. Cyclistic is a bike-share program that features over 5,800 bicycles and 600 docking station. Cyclistics offers various options ranging from reclining bikes to hand tricycles and cargo bikes to make bike-sharing more inclusive for people with disabilities and riders who can’t use a standard two-wheeled bike.
The business task is to create a marketing strategy aimed at getting more casual riders who pay for a full-day or single-ride pass to buy the membership subscription (a plan that gives annual access to the Cyclistics bike-share services). To do this the marketing analytics team need to provide a detailed analysis and report their findings to show how casual riders and members use cyclistic rides differently and why casual riders would buy the annual membership plan. As part of this task I will be analyzing and visualizing insights from a 12-month time period of the historical bike trip data to show how members and casual riders differ and the trends within that time.
For this project, I used the R Studio for the analysis process. I have chosen the R Studio because it affords me the chance to collect, clean, manipulate, visualize and create report in one environment. With the R Markdown document, I can also include code chunks that were used in the analysis process.
To carry out the various data analysis process in R Studio, I will need to load some packages that will enable me load, clean, manipulate and visualize the data. Some of these packages include
Jan_rides <- read.csv('Jan-2021.csv')
Feb_rides <- read.csv('Feb-2021.csv')
Mar_rides <-read.csv('Mar-2021.csv')
Apr_rides<-read.csv('Apr-2021.csv')
May_rides<-read.csv('May-2021.csv')
Jun_rides<-read.csv('Jun-2021.csv')
Jul_rides<- read.csv('Jul-2021.csv')
Aug_rides<- read.csv('Aug-2021.csv')
Sep_rides<-read.csv('Sep-2021.csv')
Oct_rides<-read.csv('Oct-2021.csv')
Nov_rides<-read.csv('Nov-2021.csv')
Dec_rides<-read.csv('Dec-2021.csv')
Bike_rides<-rbind(Jan_rides,Feb_rides,Mar_rides,Apr_rides,May_rides,Jun_rides,Jul_rides,Aug_rides,Sep_rides,Oct_rides,Nov_rides,Dec_rides)
head(Bike_rides)
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual
## 1 41.90034 -87.69674 41.89 -87.72 member
## 2 41.90033 -87.69671 41.90 -87.69 member
## 3 41.90031 -87.69664 41.90 -87.70 member
## 4 41.90040 -87.69666 41.92 -87.69 member
## 5 41.90033 -87.69670 41.90 -87.70 casual
## 6 41.90041 -87.69676 41.94 -87.71 casual
*The head() function returns a result showing the first 6 rows of the dataset and metadata such as the column names and the datatype.
Now that I have successfully loaded the dataset, I will go through it to inspect for anomalies. Some things I need to do at this stage of the analysis process inlclude
To check for duplicate columns I will use the distinct() function from the dplyr package. To this, I will run the code below after installing the package then run the distinct() function.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The syntax distinct(Bike_rides) returned the same number of rows, therefore there were no duplicate data.
I have left out the query results due to the magnitude of the data
To rename the columns, I will use the rename() function. I will also use the assignment operator (<-) to create a new data frame.
Bike_rides_2021<-rename(Bike_rides, 'Bike_type'= 'rideable_type', 'Subscription'= 'member_casual', 'Start_time'= 'started_at', 'End_time'= 'ended_at')
I will use to head() function to check the new data frame
head(Bike_rides_2021)
## ride_id Bike_type Start_time End_time
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng Subscription
## 1 41.90034 -87.69674 41.89 -87.72 member
## 2 41.90033 -87.69671 41.90 -87.69 member
## 3 41.90031 -87.69664 41.90 -87.70 member
## 4 41.90040 -87.69666 41.92 -87.69 member
## 5 41.90033 -87.69670 41.90 -87.70 casual
## 6 41.90041 -87.69676 41.94 -87.71 casual
For this analysis process, I will not need all the columns so I will delete certain columns, some of this columns include; * started_at * ended_at * start_station_id * end_station_id * start_lat * start_lng * end_lat * end_lng To delete columns, I used the syntax below
Bike_rides_2021 <-Bike_rides_2021 %>% select(-start_station_id,-end_station_id,-start_lat,-start_lng,-end_lat,-end_lng)
‘%>%’ in the syntax above is called a ‘pipe operator’.
I will run the head(Bike_rides_2021) syntax again to confirm that the columns have been deleted
head(Bike_rides_2021)
## ride_id Bike_type Start_time End_time
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name end_station_name Subscription
## 1 California Ave & Cortez St member
## 2 California Ave & Cortez St member
## 3 California Ave & Cortez St member
## 4 California Ave & Cortez St member
## 5 California Ave & Cortez St casual
## 6 California Ave & Cortez St casual
For us to easily manipulate the data and carry out calculations on the date and time column (‘Start_time’ & ‘End_time’), we will need to change the data type from ‘CHAR’ to ‘DATETIME’. To do this we will need to use the mutate() function in the lubridate package to convert both columns to POSIXct. I have added the code chunk used below.
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.2.1
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Bike_rides_2021 <- Bike_rides_2021 %>% mutate(Start_time=ymd_hms(Start_time))%>%
relocate(Start_time) %>%
mutate(End_time = ymd_hms(End_time)) %>%
relocate(End_time) %>%
relocate(ride_id)
head(Bike_rides_2021)
## ride_id End_time Start_time Bike_type
## 1 E19E6F1B8D4C42ED 2021-01-23 16:24:44 2021-01-23 16:14:19 electric_bike
## 2 DC88F20C2C55F27F 2021-01-27 18:47:12 2021-01-27 18:43:08 electric_bike
## 3 EC45C94683FE3F27 2021-01-21 22:37:14 2021-01-21 22:35:54 electric_bike
## 4 4FA453A75AE377DB 2021-01-07 13:42:55 2021-01-07 13:31:13 electric_bike
## 5 BE5E8EB4E7263A0B 2021-01-23 02:24:45 2021-01-23 02:24:02 electric_bike
## 6 5D8969F88C773979 2021-01-09 15:17:54 2021-01-09 14:24:07 electric_bike
## start_station_name end_station_name Subscription
## 1 California Ave & Cortez St member
## 2 California Ave & Cortez St member
## 3 California Ave & Cortez St member
## 4 California Ave & Cortez St member
## 5 California Ave & Cortez St casual
## 6 California Ave & Cortez St casual
We have changed the datatype for the ‘Start_time’ and ‘End_time’ columns, now we can carry out calculations the columns to analyze travel time.
We will also create a new column for Month and Day, this will help us monitor trends that took place with the days of the week and the months. To do this we will use the mutate() function again.
Bike_rides_2021 <- Bike_rides_2021 %>% mutate(Months = month(Start_time, label = TRUE))
Bike_rides_2021 <- Bike_rides_2021 %>% mutate(Days_of_the_week = weekdays(Start_time))
Now we have modified the dataframe, we will examine it with the head()function
head(Bike_rides_2021)
## ride_id End_time Start_time Bike_type
## 1 E19E6F1B8D4C42ED 2021-01-23 16:24:44 2021-01-23 16:14:19 electric_bike
## 2 DC88F20C2C55F27F 2021-01-27 18:47:12 2021-01-27 18:43:08 electric_bike
## 3 EC45C94683FE3F27 2021-01-21 22:37:14 2021-01-21 22:35:54 electric_bike
## 4 4FA453A75AE377DB 2021-01-07 13:42:55 2021-01-07 13:31:13 electric_bike
## 5 BE5E8EB4E7263A0B 2021-01-23 02:24:45 2021-01-23 02:24:02 electric_bike
## 6 5D8969F88C773979 2021-01-09 15:17:54 2021-01-09 14:24:07 electric_bike
## start_station_name end_station_name Subscription Months
## 1 California Ave & Cortez St member Jan
## 2 California Ave & Cortez St member Jan
## 3 California Ave & Cortez St member Jan
## 4 California Ave & Cortez St member Jan
## 5 California Ave & Cortez St casual Jan
## 6 California Ave & Cortez St casual Jan
## Days_of_the_week
## 1 Saturday
## 2 Wednesday
## 3 Thursday
## 4 Thursday
## 5 Saturday
## 6 Saturday
Looking closely at the query results above, we can see that there are two new columns ‘Months’ and ‘Days_of_the_week’. Now that we have these new columns and we have changed the datatype of the ‘Start_time’ and ‘End_time’ columns, we can now find trends like
Now that we have cleaned and organized our data, we will now do some data aggregation. Some data aggregation we will be carrying out in this analysis include;
Some function we will be using for the aggregation include
Bike_rides_2021 %>%
summarise(count = n())
## count
## 1 5595063
From the query results we see that Cyclistic recorded a total of 5,595,063 trip from January to December 2021.
Monthly_Trips <- Bike_rides_2021 %>%
count(Months)%>%
arrange(Months)
Bike_rides_2021 %>%
count(Months)%>%
arrange(Months)
## Months n
## 1 Jan 96834
## 2 Feb 49622
## 3 Mar 228496
## 4 Apr 337230
## 5 May 531633
## 6 Jun 729595
## 7 Jul 822410
## 8 Aug 804352
## 9 Sep 756147
## 10 Oct 631226
## 11 Nov 359978
## 12 Dec 247540
Bike_rides_2021 %>%
group_by(Bike_type) %>%
summarise(count = n())%>%
arrange(desc(count))
## # A tibble: 3 × 2
## Bike_type count
## <chr> <int>
## 1 classic_bike 3251028
## 2 electric_bike 2031692
## 3 docked_bike 312343
This result shows us that majority of Cyclistic customers used the Classic Bikes for their rides in 2021. About 3 million users used that option. One reason could be that they are more familiar with it and therefore more comfortable using that option.
Bike_rides_2021 %>%
group_by(Subscription) %>%
summarise(count = n())%>%
arrange(desc(count))
## # A tibble: 2 × 2
## Subscription count
## <chr> <int>
## 1 member 3066058
## 2 casual 2529005
From the result we can tell that majority of the customers that used the Cyclistic Bike share services in 2021 are customers who have subscribed to the membership plan. Let’s take a look at the total number of trips the different type of subscribers monthly. To get this information from the dataset we will run the syntax below
Monthly_Rides_Subscription<-Bike_rides_2021 %>%
group_by(Subscription) %>%
count(Months)%>%
arrange(Months)
This query result shows us the breakdown of rides taken by ‘Member’ and ‘Casual’ subscribers monthly in the year. A visualization will help us make more sense of this information.
Bike_rides_2021 %>%
group_by (Subscription) %>%
count (Days_of_the_week)%>%
arrange (Days_of_the_week)
## # A tibble: 14 × 3
## # Groups: Subscription [2]
## Subscription Days_of_the_week n
## <chr> <chr> <int>
## 1 casual Friday 364080
## 2 member Friday 446428
## 3 casual Monday 286376
## 4 member Monday 416212
## 5 casual Saturday 558000
## 6 member Saturday 433047
## 7 casual Sunday 481143
## 8 member Sunday 376142
## 9 casual Thursday 286064
## 10 member Thursday 451524
## 11 casual Tuesday 274392
## 12 member Tuesday 465513
## 13 casual Wednesday 278950
## 14 member Wednesday 477192
In the visualization section of this document we will see the trends in these result clearly.
Now that we have seen done some aggregation on the data to see the Total trips in within various time frames, we will go on to calculate the averages
Bike_rides_2021 %>% group_by(Subscription) %>%
summarise(Travel_time = mean(minute(seconds_to_period(End_time - Start_time))))
## # A tibble: 2 × 2
## Subscription Travel_time
## <chr> <dbl>
## 1 casual 18.0
## 2 member 12.0
We can see that casual riders had an average travel time of 18 minutes. This could mean that they rode longer or traveled farther distances
Bike_rides_2021 %>% group_by(Bike_type) %>%
summarise(Travel_time = mean(minute(seconds_to_period(End_time - Start_time))))%>%
arrange(Travel_time)
## # A tibble: 3 × 2
## Bike_type Travel_time
## <chr> <dbl>
## 1 electric_bike 13.4
## 2 classic_bike 14.6
## 3 docked_bike 24.4
The query results show that electric bikes have the shortest travel time, which means that it could be the fastest of the 3 options in the Cyclistic Bike-Sharing System. This could be as a result of the fact that electric bikes have an electric motor that aids propulsion.
We want to tell what the travel time of the different subscribers was in the various month
Bike_rides_2021 %>% group_by(Months,Subscription) %>%
summarise(Avg_travel_time=mean(minute(seconds_to_period(End_time - Start_time))))
## `summarise()` has grouped output by 'Months'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 3
## # Groups: Months [12]
## Months Subscription Avg_travel_time
## <ord> <chr> <dbl>
## 1 Jan casual 15.2
## 2 Jan member 11.1
## 3 Feb casual 18.2
## 4 Feb member 12.7
## 5 Mar casual 19.7
## 6 Mar member 12.5
## 7 Apr casual 19.3
## 8 Apr member 12.9
## 9 May casual 19.7
## 10 May member 12.9
## # … with 14 more rows
Avg_travel_time_sub <- Bike_rides_2021 %>% group_by(Months,Subscription) %>%
summarise(Avg_travel_time=mean(minute(seconds_to_period(End_time - Start_time))))
## `summarise()` has grouped output by 'Months'. You can override using the
## `.groups` argument.
We have done some cleaning and data aggregation, we now need to come up with visualization of some insights give the rest of the team a better understanding of the data and help them make decision from the findings. We will be plotting our visualizations using the ggplot() function from the ggplot2 package.
Note that packages have been pre-loaded
Some visualizations we will be creating include;
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.1
options(scipen = 999)
Monthly_Trips %>%
ggplot(aes(x=Months, y = n,fill = n))+
geom_col(position = 'dodge')+
labs(x= 'Month', y = 'Total trips', title = 'Total trips per month', fill = 'Trip Count')+
scale_y_continuous(breaks = c(100000,200000,300000,400000,500000,600000,700000,800000))+
scale_x_discrete(limits = c(Monthly_Trips$Months))
The visualization shows that the highest number of rides were recorded during the months of June to September.
ggplot(data = Bike_rides_2021)+
geom_bar(mapping = aes(x=Subscription, fill= Subscription))+
labs(x= 'Subscription Type', y = 'Total trips', title = 'Total trips by Subscription Type', fill = 'Subscription Type')+
scale_y_continuous(breaks = c(500000,1000000,1500000,2000000,2500000,3000000,3500000))
Monthly_Rides_Subscription %>%
ggplot(aes(x=Months, y = n, fill = Subscription))+
geom_col(position = 'dodge')+
labs(x= 'Months', y= 'Total trips', fill = 'Subscription Type', title = 'Total Monthly trips by Subscription Type')+
scale_y_continuous(breaks = c(50000,100000,150000,200000,250000,300000,350000,400000,450000))+
scale_x_discrete(limits =c(Monthly_Rides_Subscription$Months))
The graph above shows us the breakdown of the total trips taken each month according to the subscription type.
Bike_rides_2021 %>%
ggplot(aes(x=Bike_type,fill= Bike_type))+
geom_bar(position = 'dodge')+
labs(x= 'Bike Type', y = 'Total trips', title = 'Total trips by bike types')+
scale_y_continuous(breaks = c (500000,1000000,1500000,2000000,2500000,3000000))
Bike_rides_2021 %>%
ggplot(aes(x=Bike_type,fill= Subscription))+
geom_bar(position = 'dodge')+
labs(x= 'Bike Type', y = 'Total trips', title = 'Total trips by bike types and subscribers')+
scale_y_continuous(breaks = c (500000,1000000,1500000,2000000,2500000,3000000))
The graph shows us that majority of Cyclistic customers opted for the classic bike with about 3 million riders using the option and about 2 million of these riders subscribed were subscribed as members. All riders that opted for the docked bike options were casual subscribers.
Now that we have visualized the total trips and compared trends based on different categories. We look at the average values to examine trends.
Avg_travel_time <-Bike_rides_2021 %>% group_by(Subscription,Days_of_the_week) %>%
summarise(Avg_travel_time=mean(minute(seconds_to_period(End_time - Start_time))))
## `summarise()` has grouped output by 'Subscription'. You can override using the
## `.groups` argument.
Avg_travel_time %>%
ggplot(aes(x = Days_of_the_week, y = Avg_travel_time, color= Subscription, shape = Subscription)) + geom_point(position = "dodge") +
labs(x= 'Day of Week', y='Average Travel Time', title='Daily Average Travel time by Subcription type', fill = 'Subscription type') +
scale_y_continuous(breaks = c(5, 10, 15, 20), labels = c("5mins", "10mins", "15mins", "20mins")) +
scale_x_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
## Warning: Width not defined. Set with `position_dodge(width = ?)`
The result of the visualization above shows us that casual riders had an average travel time of about 20 mins per day and member riders had an average travel time of about 15 minutes per day.
We will also take a look at the daily demand per hour to notice trends and observe peak periods.
To do this we will need to determine the daily demand. We will need to run some syntax using the mutate() function and some other functions in the dplyr package.
First we will create an hour column from the ‘Start_time’ column.
Bike_rides_2021<-Bike_rides_2021%>%
mutate(hours = hour(Start_time))
head(Bike_rides_2021)
## ride_id End_time Start_time Bike_type
## 1 E19E6F1B8D4C42ED 2021-01-23 16:24:44 2021-01-23 16:14:19 electric_bike
## 2 DC88F20C2C55F27F 2021-01-27 18:47:12 2021-01-27 18:43:08 electric_bike
## 3 EC45C94683FE3F27 2021-01-21 22:37:14 2021-01-21 22:35:54 electric_bike
## 4 4FA453A75AE377DB 2021-01-07 13:42:55 2021-01-07 13:31:13 electric_bike
## 5 BE5E8EB4E7263A0B 2021-01-23 02:24:45 2021-01-23 02:24:02 electric_bike
## 6 5D8969F88C773979 2021-01-09 15:17:54 2021-01-09 14:24:07 electric_bike
## start_station_name end_station_name Subscription Months
## 1 California Ave & Cortez St member Jan
## 2 California Ave & Cortez St member Jan
## 3 California Ave & Cortez St member Jan
## 4 California Ave & Cortez St member Jan
## 5 California Ave & Cortez St casual Jan
## 6 California Ave & Cortez St casual Jan
## Days_of_the_week hours
## 1 Saturday 16
## 2 Wednesday 18
## 3 Thursday 22
## 4 Thursday 13
## 5 Saturday 2
## 6 Saturday 14
Now that we have created the hour column, we will find the daily demand.
Daily_demands<- Bike_rides_2021%>%
group_by(Subscription, hours)%>%
dplyr::summarise(Start_time)%>%
as.data.frame()
## `summarise()` has grouped output by 'Subscription', 'hours'. You can override
## using the `.groups` argument.
Next we will group the daily demand by the hour
Daily_demands_hour <- Daily_demands%>%
group_by(hours)%>%
count(Subscription)
To help identify the trends easily, we ill plot the results on a line graph.
options(scipen = 999)
The syntax above helps to convert the scale of axes from scientific numbers to standard notation
Daily_demands_hour%>%
ggplot(aes(x = hours, y = n , color = Subscription, group = Subscription))+
geom_line(position = 'dodge')+
labs(x = 'Hours', y = 'Number of trips', title = 'Daily demand per hour')+
scale_y_continuous(breaks = c(50000, 100000, 150000, 200000, 250000, 300000, 350000))+
scale_x_discrete(limits = c(Daily_demands_hour$hours))
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
## Warning: Width not defined. Set with `position_dodge(width = ?)`
We have identified different trends from this analysis process of the Cyclistic Bike-share Data. From these trends around the travel time and the number of trips we can come to the following conclusions
Most users prefer the classic bike type, the member subscribers (users who subscribed to the annual payment plan) never used the docked bike.
Based on the the data, July and August recorded the highest number of trips, this is the peak of summer and is a possible reason for the spike at that period.
The lowest number of trips are recorded within the months of December and March, these months are the winter months and customers won’t be able to do a lot of riding due to the temperature and weather conditions.
We also see that casual users more time on bike trips; especially on Saturdays and Sundays. This could mean that casual users ride bikes for leisure or exercise.
The data also shows a spike in demand especially by the member subscribers around the hours of 7am - 8am and 4pm - 6pm daily; these periods are the resumption and closing times respectively. This could mean that member subscribers use the Cyclistic Bike-Share as an alternative means of transportation to work to avoid traffic congestion and other delays.
From the analysis and conclusions, here are some suggestions to help the Cyclistics Marketing Analytics team design a new marketing strategy to convert casual subscribers to members.
Since casual users ride picks up on weekends, the marketing team can include a weekend only annual subscription plan that includes certain incentives not available for casual subscribers at a lower price.
Modifications to subscription plans that see users paying relatively less as they ride longer.
Rewards and bonuses for users who subscribe to the annual plan; some of these rewards could be vouchers or subscription discounts.
Marketing campaigns and promos that will give discount to first time subscribers should be launched between June to August as the number of trips and demand starts to build up at that time. This will entice casual users and make them opt for the annual plan.