Introduction

In this analysis I have gained some business insights and provided certain recommendation to the Bike Share Program Manager to increase the revenues and reduce overall costs. This Analysis is based on several attributes provided in the database. Based on each attribute analysis is conducted and appropriate recommendations are provided.

This analysis can alo be viewed using this Shiny tool. https://abhishekgaggar-dataanalysis.shinyapps.io/r_webpp/

Analysis

This analysis is based on below attributes:
1. Number of Trips
2. Duration of Trips
3. Number of Bikes
4. Traffic

Lets start by installing the required libraries and importing the data set

library(caret)
library(shiny)
library(lubridate)
library(stringr)
library(plotly)
library(forecast)
library(TTR)
library(ggplot2)
library(dplyr)
library(gridExtra)

data <- read.csv("C:/Users/abhis/Desktop/LDA Test/Dataset/GDA DATA.csv")
options(header=FALSE, stringsAsFactors = FALSE)

Once data set is added lets add additional colums for our analysis. For time series analysis additional columns for Month, Start Date, Start Hour are added.

data$Start_Date <- str_split_fixed(data$Start.Date, " ", 2)[,1]
data$Start_Time <- str_split_fixed(data$Start.Date, " ", 2)[,2]
data$Start_Date <- as.Date(data$Start_Date, "%m/%d/%Y")
data$Start_Hour <- strftime(strptime(data$Start_Time, format="%H"),"%H")
data$Start_Hour <- as.numeric(data$Start_Hour)
data <- data %>%
  mutate(Month = format(data$Start_Date, "%m"))

Number of Trips

Lets find the Overall Trend over time

## Aggregate the data wrt to no of trips per hour and no of trips daily

data_startDate <- aggregate(Trip.ID ~ Start_Date, data = data, FUN = length)

## Total Number of bike trips per day

a <- ggplot(data = data_startDate, aes(Start_Date, Trip.ID)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Trips Each Day") +
  ylab("Total Number of Bicycle Trips") +
  xlab("Date")

plot(a)

This graph represents total number of bikes booked from March to August 2014.

Observation

This graph reflects that overall trend in the bike usage have been increased.
It provides an overall KPI for the Bike Share Program Manager.
Also we can see an intresting trend that some values are much lower than other values. There has to be certain trend which we need to figure out. Lets split our analysis by considering days of the week

Number of bikes per day of week

## Total no if trips by day of the week

data_week <- as.data.frame(table(wday(data$Start_Date, label = TRUE)))

b <- ggplot(data = data_week, aes(x = Var1, Freq)) +
  geom_col() + 
  ggtitle("Total Number of Trips Per Day") +
  ylab("Total Number of Bicycle Trips") +
  xlab("Day of the Week")
plot(b)

data_startDate <- mutate(data_startDate, 
                    weekend = (wday(data_startDate$Start_Date) == 1 | wday(data_startDate$Start_Date) == 7))  

data_startDate$weekend <- factor(data_startDate$weekend, labels = c("Weekday", "Weekend"))

c <- ggplot(data = data_startDate, aes(Start_Date, Trip.ID)) +
  geom_point(aes(color = weekend), size = 3, alpha = 0.65) +
  ggtitle("Total Number of Trips Per Day") +
  ylab("Total Number of Bicycle Trips") +
  xlab("Date")

plot(c)

Observation

This graph represents the overall trips of bike over days of the week. It can be inferred that number of trips are more on weekdays and less on weekends.
Bike Share Program Manager can refer this plot to understand the distribution of bike rentals over a week. This still gives an overall picture but can be useful to provide higher level insights.

Lets expand out analysis to find out the peak times during the day.

Peak time per day

th <- mdy_hm(data$Start.Date) 
tm <- hour(th) + minute(th)/60
data$time <- tm
remove(th,tm, ths)


d <- ggplot(data, aes(time)) +
  geom_histogram(binwidth = 0.25) + #Every fifteen minutes = binwidth 
  geom_vline(xintercept = 8.75, color = 'orange')+
  geom_vline(xintercept = 17, color = 'red', alpha = 0.7) +
  annotate("text", x = 8, y = 27000, label = "8:45 AM", color = "orange",
           size = 7) +
  annotate("text", x = 17, y = 27000, label = "5:00 PM", color = "red", 
           size = 7) +
  xlab("Time of day on 24 hour clock") +
  ylab("Total number of bicycle trips")

plot(d)

This plot reflects the peak time during the day.

Observation

There are two peak time during the day. One in the morning around 8:45 AM and one in the evening around 5:00 PM. Bike renting during these times are very high as compared to other time.

Recommendation

  • Bike Share Program Manager needs to ensure that there are sufficient number of bikes present at these timings to cater the needs and to ensure increase in revenues.
  • There are offtimes during from 3:00 AM to 5:00 AM when there are no trips and could be used efficiently to transport bike to ensure proper supply of bikes and without affecting normal traffic.

Lets understand how Type of customers affect the rental of bikes.

Rentals affected by customer type

ggplot(data = data, aes(Start_Date)) +
  geom_bar(aes(color = Subscriber.Type), stat = "count") +
  ggtitle("Customer Vs. Subscriber") +
  ylab("Total Number of Bicycle Trips") +
  xlab("Trend Across Time")

Observation

Bike Share Program Manager can observe the overall trend in bike rentals by customers as well as subscribers. Total number of trips fluctuates during the week because of number of trips in weekdays as well as weekends.

One important point to observe is that number of customers are very less as compared to subscribers. Lets study the effects of weekdays vs weekends on subscriber type

Rentals affected by customer type and weekends

data <- mutate(data, weekend = (wday(data$Start_Date) == 1 | wday(data$Start_Date) == 7))  
data$weekend <- factor(data$weekend, labels = c("Weekday", "Weekend"))


e <- ggplot(data = data, aes(Start_Date)) +
  geom_bar(aes(color = Subscriber.Type), stat = "count", 
           position = "stack") +
  ggtitle("Customer Vs. Subscriber on Weekends and Weekdays") +
  ylab("Total Number of Bicycle Trips") +
  xlab("Trend Across Time") +
  facet_grid(~weekend) +
  theme(axis.text.x = element_blank())
plot(e)

Observation

With the above plot Bike Share Program Manager can observe and infer below points.

  • Observation 1: Number of Subscribers on weekdays are way higher than number of customers.
  • Observation 2: On weekends this trend changes and both customer and subscribers are in same proportion.

It could be inferred that subscribers use bicycle on daily basis to commute to their office whereas customers are mainly tourists who like to visit the city on weekends. We still need to validate this claim which we will do in our next plots.

From the aove plot Bike Share Program Manager could target these tourists by offering them discounts and group deals to increase the number of trips during weekends and weekdays.

Lets extend our analysis and consider duration of each trip.

Duration of Trip

Lets find overall duration of trip

summary(data$Duration/60)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.000     5.750     8.617    18.880    12.680 11940.000

Program Manager can observe that the median duration of trip is 8.617 minutes but maximum it more than 11940 mins more than one week, so there are certain outliers in the dataset which we have to identify.

Identifying Outliers

# Trips less than 1 hour
data <- mutate(data, hour_trip = (Duration <= 60*60))

# Trips less than one day
data <- mutate(data, day_trip = (Duration <= 24*60*60))

lengthdata <- nrow(data)
# Total no of trips less than 1 hour is:
(sum(data$hour_trip)/lengthdata)*100
## [1] 96.44396
# Total no of trips less than 1 day is:
(sum(data$day_trip)/lengthdata)*100
## [1] 99.95751
hour_trips <- subset(data, hour_trip == TRUE)

f <- ggplot(hour_trips, aes(Duration/60, group = Subscriber.Type, color = Subscriber.Type)) +
  geom_histogram(binwidth = 0.5) + 
  xlab("Time in minutes") +
  ylab("Total number of bicycle trips") +
  geom_vline(xintercept = 20,color = "Red") + 
  annotate("text", x = 20, y = 6000, label = "20 Mins", color = "Red",
           size = 7)

plot(f)

day_trips <- subset(data, day_trip == TRUE)

g <- ggplot(day_trips, aes(Duration/60/60, group = Subscriber.Type, color = Subscriber.Type)) +
  geom_histogram(binwidth = 0.25) + 
  xlab("Time in Hours") +
  ylab("Total number of bicycle trips") + 
  geom_vline(xintercept = 1,color = "Red") + 
  annotate("text", x = 3, y = 60000, label = "1 Hour", color = "Red",
           size = 7)
plot(g)

Observations

  • 96.44% of trips are less than one Hour
  • Majority of the trips are less than 20 mins
  • 99.95% of the trips are less than one day so we safely neglect other trips as outliers
  • Average turn around time of the bike is 9 mins for all customer type
  • Customers have higher average trip duration which verifies the claim that customers are tourists who uses bikes for a longer duration of time

Recommendation to the Manager

  • Program Manager needs to ensure that sufficient number of bikes need to be present considering the above turn around time to cater peak time demands as mentioned above.
  • Program manager should target these customers to increase the overall average trip duration of customers and increase revenue.
  • Very less number of trips are over one day but Program Manager needs to identify what caused this behavior.
    • Dis his bike got stolen and hence he didnt returned
    • Did customer forgot to return the bike

Duration of Trips wrt to Customer Type

data_custtype <- aggregate(Duration ~ Start_Hour + weekend + Subscriber.Type, data = hour_trips, FUN = mean)

h <- ggplot(data = data_custtype, aes(x = Start_Hour, y = Duration/60, group = weekend, color = weekend)) +
  geom_line() +
  geom_point() +
  facet_grid(~Subscriber.Type) +
  xlab("Time of day on 24 hour clock") + 
  ylab("Mean duration, mins.") + 
  ggtitle("Average Trip Duration")
  
plot(h)

Observation:

  • Average duration of trips by customers are higher and almost double than that of subscribers.
  • It can be concluded from earlier findings that subscribers uses bikes to travel to offices whereas customers are tourists who visit city during weekends and have higher average bike times.
  • Both customers as well as subscribers have average trip duration higher during weekends.
  • Average trip time during day for subscribers is 8-9 mins and is almost same throughout the day.

Recommendation to the Manager

  • Bike share program manager needs to provide more offers on weekends trips to increase duration of trip time by
    • Partering up with local tourist operator for sight seeing
    • Offer discounts on city group tours or bulk bookings
  • Program Manager can target more subscribers by advertising average wait time for bike is 9 mins if there are no bikes available.

Lets try to understand the effect of number of bikes on overall trips.

Number of Bikes

Average Number of Trips per Bike

# Average Number of bike trips during weekdays and weekends, I am only considering trips shorter than one day 

data_bike <- aggregate(Trip.ID ~ Bike.. + weekend + Month, data = day_trips, FUN = length )

data_bike_mean <- aggregate(data_bike[, 3:4], list(data_bike$Bike.., data_bike$weekend), mean)

i <- ggplot(data = data_bike_mean, aes(Group.1, Trip.ID)) +
  geom_point()+
  geom_line() +
  ggtitle("Average No of Trips per Bike") +
  ylab("Average Trips per bike per Month") +
  xlab("Bike No") +
  facet_grid(~Group.2) +
  theme(axis.text.x = element_blank())
plot(i)

Observations

  • Average number of trips per bike has varies greatly during weekday and weekends which is in sync with that we observed earlier
  • On an average, there are 60-70 trips taken per month per bike during weekdays whereas its around 10 trips per bike per month during weekends.

Recommendations

  • Bike share program manager can derive what is the average number of trips per bike and can calculate the total profitability per bike.
  • Manager can calculate number of bikes needs to be increased or decreased to cater demand.
  • One interesting thing to note is that certain bikes have very less number of trips as compared to other bikes. This trend is uniform for weekends and weekdays. It could be due to:
    • New Bikes and hence average number of trips are less
    • Bikes are damaged so no one prefers to use these bikes
    • These bikes must be at certain stations where there is very less traffic
    • Ratio of utilization of these bikes is less on weekdays whereas its high on weekends, it may be possible that these are some specialized bikes with some extra features that tourist use and not the daily office goers.
  • Manager needs to know these details to efficiently target those customers

Average ride time per bike

data_bike_duration <- aggregate(Duration ~ Bike.., data = hour_trips, FUN = mean)

j <- ggplot(data = data_bike_duration, aes(Bike.., Duration/60)) +
  geom_point()+
  geom_line() +
  ggtitle("Average Duration of Bike in Mins") +
  ylab("Average Duration per min") +
  xlab("Bike No") +
  theme(axis.text.x = element_blank())

plot(j)

Observation

  • This graph represents the mean trip duration for all bikes. We have observed earlier that mean trip duration is 9 mins which can be concluded from this figure.

Lets try to find Busiest Station.

Traffic

Top origin destinations.

# Lets try to identify busiest Origin station among different user type

length_hourtrip <- dim(hour_trips)[1]

data_user_station <- data.frame(table(hour_trips$Subscriber.Type, 
                                      hour_trips$Start.Station)/length_hourtrip)

data_user_station$Percent <- data_user_station$Freq*100

names(head(sort(table(hour_trips$Start.Station[hour_trips$Subscriber.Type == 'Customer']), decreasing = TRUE), 10))
##  [1] "Embarcadero at Sansome"                  
##  [2] "Harry Bridges Plaza (Ferry Building)"    
##  [3] "Market at 4th"                           
##  [4] "Powell Street BART"                      
##  [5] "2nd at Townsend"                         
##  [6] "Embarcadero at Vallejo"                  
##  [7] "Powell at Post (Union Square)"           
##  [8] "San Francisco Caltrain (Townsend at 4th)"
##  [9] "Steuart at Market"                       
## [10] "Embarcadero at Bryant"
names(head(sort(table(hour_trips$Start.Station[hour_trips$Subscriber.Type == 'Subscriber']), decreasing = TRUE), 10))
##  [1] "San Francisco Caltrain (Townsend at 4th)"     
##  [2] "San Francisco Caltrain 2 (330 Townsend)"      
##  [3] "Temporary Transbay Terminal (Howard at Beale)"
##  [4] "Market at Sansome"                            
##  [5] "2nd at Townsend"                              
##  [6] "Harry Bridges Plaza (Ferry Building)"         
##  [7] "Steuart at Market"                            
##  [8] "Townsend at 7th"                              
##  [9] "Market at 10th"                               
## [10] "2nd at South Park"

Observation

  • Busiest station amongst customer and subscriber are differnet

Recommendation

  • Subscribers are monthly pass holders whereas customers are daily ticket takers. Also we know that number of custoemrs are significantly less as compared to the number of subscribers. To attract more customers to become subscribers and buy monthy pass we should offer them some kind of attractive deals and discounts.
  • Manager needs to target these 10 stations and can place big banners to attraact more custoemrs as well as subscribers.
  • These are the best places to advertise average turn around time of 9 mins and attract more subscribers to buy monthly passes.

Busiest Corridors

data_tofrom_stations <- aggregate(Trip.ID ~ Start.Station + End.Station, data = hour_trips, FUN = length)


k <- ggplot(data_tofrom_stations, aes(Start.Station, End.Station)) +
  geom_tile(aes(fill = Trip.ID), color = "white") +
  scale_fill_gradient(low = "white", high = "Red") +
  ylab("Destination") +
  xlab("Origin") +
  theme(legend.title = element_text(size = 10),
        legend.text = element_text(size = 12),
        plot.title = element_text(size=16),
        axis.title=element_text(size=14,face="bold"),
        axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(fill = "No of Trips")

plot (k)

Recommmendation

  • Bike share program manager needs to know and identify the busiest corridors to place number of bikes during peak time.
  • Also he can place advertisements in these locations to attract more customers.

Exploratory Data Analysis have been done and intersting facts and findings are reported to the Bike Share Program Manager for Business Actions. Lets try to predict Number of bike trips and their duration.

Prediction

Predicting Number of trips

data_reg <- aggregate(Duration ~ Start.Station + Subscriber.Type + weekend + Month, data = hour_trips, FUN = mean)
data_reg[,1:3] <- lapply(data_reg[,1:3], FUN = as.factor)
data_reg$Month <- as.numeric(data_reg$Month)
reg <- lm(Duration/60 ~ Start.Station + Subscriber.Type + weekend + Month, data = data_reg)
summary(reg)
m1 <- ggplot(data = data_reg, aes(reg$residuals)) + 
  geom_histogram(binwidth = 1) + 
  xlab("Residuals")
plot(m1)

# Call:
#   lm(formula = Duration ~ Start.Station + Subscriber.Type + weekend + 
#        Month, data = data_reg)

# Residual standard error: 253.9 on 1514 degrees of freedom
# Multiple R-squared:  0.6186,  Adjusted R-squared:  0.6004 
# F-statistic:  34.1 on 72 and 1514 DF,  p-value: < 2.2e-16

Analysis

  • Output have been supressed as its very large, only significant elements are included in the output
  • From the aabove linear regression we got Adjusted R square value of 0.6004 is this model expalins 60% of the overall variation
  • Month is non significant as its p- value is very high so lets discard month and re-run the regression
reg_wmnth <- lm(Duration/60 ~ Start.Station + Subscriber.Type + weekend, data = data_reg)
summary(reg_wmnth)

m <- ggplot(data = reg_wmnth, aes(reg_wmnth$residuals)) +
  geom_histogram(binwidth = 1) +
  xlab("Residuals")
plot (m)

# Call:
#   lm(formula = Duration ~ Start.Station + Subscriber.Type + weekend, 
#      data = data_reg)
# Residual standard error: 254 on 1515 degrees of freedom
# Multiple R-squared:  0.618,   Adjusted R-squared:  0.6001 
# F-statistic: 34.51 on 71 and 1515 DF,  p-value: < 2.2e-16

Analysis

  • From the above analysis we can observe that histogram of residues are almost normal with few outliers, we have already understood why there are outliers in the dataset and have ignored trips which are more than 1 day.
  • Duration of trip is dependend on Starting Station, customer type (Subscriber or customer) and weekends. Some stations are non significant in the analysis but some are very significant and leads to determination of trip duration time.
  • Mean Duration time will increase by (915 - 570) secs or 3.45 mins if there is an increase in one subscriber without any customer and with addition of one customer mean trip duration will increase by 915 secs ie 15 mins
  • Mean duration time will increase by (915 + 103) seconds or around 17 mins if bike is rented on weekends.
  • On any given weekend if the trip is made from university and emerson from a customer it will lead to longest trip duration ie (568.214 + 103.9 + 915.7 = 1587.8 secs) or roughly 26 mins
  • Whereas on any weekday if trip is made from Rengstorff Avenue / California Street from a subscriber than duration would be around (144.9 - 570.46 + 915.7 = 490.14 secs) or 8.16 mins

Predicting Number of Trips

hour_trips$timeofday <- ifelse((hour_trips$time >= 0 & hour_trips$time < 6), "Night",
                               ifelse((hour_trips$time >= 6 & hour_trips$time < 12), "Morning",
                                      ifelse((hour_trips$time >= 12 & hour_trips$time < 18), "Afternoon",
                                             ifelse((hour_trips$time >= 18 & hour_trips$time < 24), "Evening", "NA"
                                                    
                                             )  
                                      )
                               )
)

data_reg2 <- aggregate(Trip.ID ~ Start.Station + Subscriber.Type + weekend + 
                         timeofday, data = hour_trips, FUN = length)

data_reg2[,1:4] <- lapply(data_reg2[,1:4], FUN = as.factor)


reg2 <- lm(log(Trip.ID) ~ ., data = data_reg2)
summary(reg2)

n <- ggplot(data = reg2, aes(reg2$residuals)) +
  geom_histogram(binwidth = 0.10) +
  xlab("Residuals")

plot(n)

# Residual standard error: 0.8555 on 937 degrees of freedom
# Multiple R-squared:  0.8032,  Adjusted R-squared:  0.7876 
# F-statistic: 51.67 on 74 and 937 DF,  p-value: < 2.2e-16

Analysis

  • I have divided time into 4 parts - Morning, Afternoon, Evening and Night
  • Adjusted R square of the model is 0.78 which explains around 78% of the overall variations in data
  • I have considered log(Number of Trips) as the data is not linear and we need to take log of trips to fit the data better
  • This analysis is based on Starting Station, Type of Customer, Weekend and time of day
  • Type of customer, weekends, time of day are very significant and leads to determination of number of trips
  • One addition in subscriber will lead to increase in exp(5.16 + 1.12) which is roughly 534 trips where as one addition in customer will lead to increase in exp(5.16 + 0) which is roughly 174 number of trips. This result is in sync with the data that there are more number of trips by subscribers than by customers.
  • If a new subscriber is added he will make exp(1.12 - 0.596 + 5.16 - 3.78) 7 trips from Redwood City Public Library on weekdays.

Recommendation to stakeholders

Detailed analysis of this data set have been conducted and below are some of the key recommendations to the stakeholders.