In this analysis I have gained some business insights and provided certain recommendation to the Bike Share Program Manager to increase the revenues and reduce overall costs. This Analysis is based on several attributes provided in the database. Based on each attribute analysis is conducted and appropriate recommendations are provided.
This analysis can alo be viewed using this Shiny tool. https://abhishekgaggar-dataanalysis.shinyapps.io/r_webpp/
This analysis is based on below attributes:
1. Number of Trips
2. Duration of Trips
3. Number of Bikes
4. Traffic
Lets start by installing the required libraries and importing the data set
library(caret)
library(shiny)
library(lubridate)
library(stringr)
library(plotly)
library(forecast)
library(TTR)
library(ggplot2)
library(dplyr)
library(gridExtra)
data <- read.csv("C:/Users/abhis/Desktop/LDA Test/Dataset/GDA DATA.csv")
options(header=FALSE, stringsAsFactors = FALSE)
Once data set is added lets add additional colums for our analysis. For time series analysis additional columns for Month, Start Date, Start Hour are added.
data$Start_Date <- str_split_fixed(data$Start.Date, " ", 2)[,1]
data$Start_Time <- str_split_fixed(data$Start.Date, " ", 2)[,2]
data$Start_Date <- as.Date(data$Start_Date, "%m/%d/%Y")
data$Start_Hour <- strftime(strptime(data$Start_Time, format="%H"),"%H")
data$Start_Hour <- as.numeric(data$Start_Hour)
data <- data %>%
mutate(Month = format(data$Start_Date, "%m"))
## Aggregate the data wrt to no of trips per hour and no of trips daily
data_startDate <- aggregate(Trip.ID ~ Start_Date, data = data, FUN = length)
## Total Number of bike trips per day
a <- ggplot(data = data_startDate, aes(Start_Date, Trip.ID)) +
geom_point() +
geom_smooth() +
ggtitle("Trips Each Day") +
ylab("Total Number of Bicycle Trips") +
xlab("Date")
plot(a)
This graph represents total number of bikes booked from March to August 2014.
This graph reflects that overall trend in the bike usage have been increased.
It provides an overall KPI for the Bike Share Program Manager.
Also we can see an intresting trend that some values are much lower than other values. There has to be certain trend which we need to figure out. Lets split our analysis by considering days of the week
## Total no if trips by day of the week
data_week <- as.data.frame(table(wday(data$Start_Date, label = TRUE)))
b <- ggplot(data = data_week, aes(x = Var1, Freq)) +
geom_col() +
ggtitle("Total Number of Trips Per Day") +
ylab("Total Number of Bicycle Trips") +
xlab("Day of the Week")
plot(b)
data_startDate <- mutate(data_startDate,
weekend = (wday(data_startDate$Start_Date) == 1 | wday(data_startDate$Start_Date) == 7))
data_startDate$weekend <- factor(data_startDate$weekend, labels = c("Weekday", "Weekend"))
c <- ggplot(data = data_startDate, aes(Start_Date, Trip.ID)) +
geom_point(aes(color = weekend), size = 3, alpha = 0.65) +
ggtitle("Total Number of Trips Per Day") +
ylab("Total Number of Bicycle Trips") +
xlab("Date")
plot(c)
This graph represents the overall trips of bike over days of the week. It can be inferred that number of trips are more on weekdays and less on weekends.
Bike Share Program Manager can refer this plot to understand the distribution of bike rentals over a week. This still gives an overall picture but can be useful to provide higher level insights.
Lets expand out analysis to find out the peak times during the day.
th <- mdy_hm(data$Start.Date)
tm <- hour(th) + minute(th)/60
data$time <- tm
remove(th,tm, ths)
d <- ggplot(data, aes(time)) +
geom_histogram(binwidth = 0.25) + #Every fifteen minutes = binwidth
geom_vline(xintercept = 8.75, color = 'orange')+
geom_vline(xintercept = 17, color = 'red', alpha = 0.7) +
annotate("text", x = 8, y = 27000, label = "8:45 AM", color = "orange",
size = 7) +
annotate("text", x = 17, y = 27000, label = "5:00 PM", color = "red",
size = 7) +
xlab("Time of day on 24 hour clock") +
ylab("Total number of bicycle trips")
plot(d)
This plot reflects the peak time during the day.
There are two peak time during the day. One in the morning around 8:45 AM and one in the evening around 5:00 PM. Bike renting during these times are very high as compared to other time.
Lets understand how Type of customers affect the rental of bikes.
ggplot(data = data, aes(Start_Date)) +
geom_bar(aes(color = Subscriber.Type), stat = "count") +
ggtitle("Customer Vs. Subscriber") +
ylab("Total Number of Bicycle Trips") +
xlab("Trend Across Time")
Bike Share Program Manager can observe the overall trend in bike rentals by customers as well as subscribers. Total number of trips fluctuates during the week because of number of trips in weekdays as well as weekends.
One important point to observe is that number of customers are very less as compared to subscribers. Lets study the effects of weekdays vs weekends on subscriber type
data <- mutate(data, weekend = (wday(data$Start_Date) == 1 | wday(data$Start_Date) == 7))
data$weekend <- factor(data$weekend, labels = c("Weekday", "Weekend"))
e <- ggplot(data = data, aes(Start_Date)) +
geom_bar(aes(color = Subscriber.Type), stat = "count",
position = "stack") +
ggtitle("Customer Vs. Subscriber on Weekends and Weekdays") +
ylab("Total Number of Bicycle Trips") +
xlab("Trend Across Time") +
facet_grid(~weekend) +
theme(axis.text.x = element_blank())
plot(e)
With the above plot Bike Share Program Manager can observe and infer below points.
It could be inferred that subscribers use bicycle on daily basis to commute to their office whereas customers are mainly tourists who like to visit the city on weekends. We still need to validate this claim which we will do in our next plots.
From the aove plot Bike Share Program Manager could target these tourists by offering them discounts and group deals to increase the number of trips during weekends and weekdays.
Lets extend our analysis and consider duration of each trip.
summary(data$Duration/60)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.750 8.617 18.880 12.680 11940.000
Program Manager can observe that the median duration of trip is 8.617 minutes but maximum it more than 11940 mins more than one week, so there are certain outliers in the dataset which we have to identify.
# Trips less than 1 hour
data <- mutate(data, hour_trip = (Duration <= 60*60))
# Trips less than one day
data <- mutate(data, day_trip = (Duration <= 24*60*60))
lengthdata <- nrow(data)
# Total no of trips less than 1 hour is:
(sum(data$hour_trip)/lengthdata)*100
## [1] 96.44396
# Total no of trips less than 1 day is:
(sum(data$day_trip)/lengthdata)*100
## [1] 99.95751
hour_trips <- subset(data, hour_trip == TRUE)
f <- ggplot(hour_trips, aes(Duration/60, group = Subscriber.Type, color = Subscriber.Type)) +
geom_histogram(binwidth = 0.5) +
xlab("Time in minutes") +
ylab("Total number of bicycle trips") +
geom_vline(xintercept = 20,color = "Red") +
annotate("text", x = 20, y = 6000, label = "20 Mins", color = "Red",
size = 7)
plot(f)
day_trips <- subset(data, day_trip == TRUE)
g <- ggplot(day_trips, aes(Duration/60/60, group = Subscriber.Type, color = Subscriber.Type)) +
geom_histogram(binwidth = 0.25) +
xlab("Time in Hours") +
ylab("Total number of bicycle trips") +
geom_vline(xintercept = 1,color = "Red") +
annotate("text", x = 3, y = 60000, label = "1 Hour", color = "Red",
size = 7)
plot(g)
data_custtype <- aggregate(Duration ~ Start_Hour + weekend + Subscriber.Type, data = hour_trips, FUN = mean)
h <- ggplot(data = data_custtype, aes(x = Start_Hour, y = Duration/60, group = weekend, color = weekend)) +
geom_line() +
geom_point() +
facet_grid(~Subscriber.Type) +
xlab("Time of day on 24 hour clock") +
ylab("Mean duration, mins.") +
ggtitle("Average Trip Duration")
plot(h)
Lets try to understand the effect of number of bikes on overall trips.
# Average Number of bike trips during weekdays and weekends, I am only considering trips shorter than one day
data_bike <- aggregate(Trip.ID ~ Bike.. + weekend + Month, data = day_trips, FUN = length )
data_bike_mean <- aggregate(data_bike[, 3:4], list(data_bike$Bike.., data_bike$weekend), mean)
i <- ggplot(data = data_bike_mean, aes(Group.1, Trip.ID)) +
geom_point()+
geom_line() +
ggtitle("Average No of Trips per Bike") +
ylab("Average Trips per bike per Month") +
xlab("Bike No") +
facet_grid(~Group.2) +
theme(axis.text.x = element_blank())
plot(i)
data_bike_duration <- aggregate(Duration ~ Bike.., data = hour_trips, FUN = mean)
j <- ggplot(data = data_bike_duration, aes(Bike.., Duration/60)) +
geom_point()+
geom_line() +
ggtitle("Average Duration of Bike in Mins") +
ylab("Average Duration per min") +
xlab("Bike No") +
theme(axis.text.x = element_blank())
plot(j)
Lets try to find Busiest Station.
# Lets try to identify busiest Origin station among different user type
length_hourtrip <- dim(hour_trips)[1]
data_user_station <- data.frame(table(hour_trips$Subscriber.Type,
hour_trips$Start.Station)/length_hourtrip)
data_user_station$Percent <- data_user_station$Freq*100
names(head(sort(table(hour_trips$Start.Station[hour_trips$Subscriber.Type == 'Customer']), decreasing = TRUE), 10))
## [1] "Embarcadero at Sansome"
## [2] "Harry Bridges Plaza (Ferry Building)"
## [3] "Market at 4th"
## [4] "Powell Street BART"
## [5] "2nd at Townsend"
## [6] "Embarcadero at Vallejo"
## [7] "Powell at Post (Union Square)"
## [8] "San Francisco Caltrain (Townsend at 4th)"
## [9] "Steuart at Market"
## [10] "Embarcadero at Bryant"
names(head(sort(table(hour_trips$Start.Station[hour_trips$Subscriber.Type == 'Subscriber']), decreasing = TRUE), 10))
## [1] "San Francisco Caltrain (Townsend at 4th)"
## [2] "San Francisco Caltrain 2 (330 Townsend)"
## [3] "Temporary Transbay Terminal (Howard at Beale)"
## [4] "Market at Sansome"
## [5] "2nd at Townsend"
## [6] "Harry Bridges Plaza (Ferry Building)"
## [7] "Steuart at Market"
## [8] "Townsend at 7th"
## [9] "Market at 10th"
## [10] "2nd at South Park"
data_tofrom_stations <- aggregate(Trip.ID ~ Start.Station + End.Station, data = hour_trips, FUN = length)
k <- ggplot(data_tofrom_stations, aes(Start.Station, End.Station)) +
geom_tile(aes(fill = Trip.ID), color = "white") +
scale_fill_gradient(low = "white", high = "Red") +
ylab("Destination") +
xlab("Origin") +
theme(legend.title = element_text(size = 10),
legend.text = element_text(size = 12),
plot.title = element_text(size=16),
axis.title=element_text(size=14,face="bold"),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(fill = "No of Trips")
plot (k)
Exploratory Data Analysis have been done and intersting facts and findings are reported to the Bike Share Program Manager for Business Actions. Lets try to predict Number of bike trips and their duration.
data_reg <- aggregate(Duration ~ Start.Station + Subscriber.Type + weekend + Month, data = hour_trips, FUN = mean)
data_reg[,1:3] <- lapply(data_reg[,1:3], FUN = as.factor)
data_reg$Month <- as.numeric(data_reg$Month)
reg <- lm(Duration/60 ~ Start.Station + Subscriber.Type + weekend + Month, data = data_reg)
summary(reg)
m1 <- ggplot(data = data_reg, aes(reg$residuals)) +
geom_histogram(binwidth = 1) +
xlab("Residuals")
plot(m1)
# Call:
# lm(formula = Duration ~ Start.Station + Subscriber.Type + weekend +
# Month, data = data_reg)
# Residual standard error: 253.9 on 1514 degrees of freedom
# Multiple R-squared: 0.6186, Adjusted R-squared: 0.6004
# F-statistic: 34.1 on 72 and 1514 DF, p-value: < 2.2e-16
reg_wmnth <- lm(Duration/60 ~ Start.Station + Subscriber.Type + weekend, data = data_reg)
summary(reg_wmnth)
m <- ggplot(data = reg_wmnth, aes(reg_wmnth$residuals)) +
geom_histogram(binwidth = 1) +
xlab("Residuals")
plot (m)
# Call:
# lm(formula = Duration ~ Start.Station + Subscriber.Type + weekend,
# data = data_reg)
# Residual standard error: 254 on 1515 degrees of freedom
# Multiple R-squared: 0.618, Adjusted R-squared: 0.6001
# F-statistic: 34.51 on 71 and 1515 DF, p-value: < 2.2e-16
hour_trips$timeofday <- ifelse((hour_trips$time >= 0 & hour_trips$time < 6), "Night",
ifelse((hour_trips$time >= 6 & hour_trips$time < 12), "Morning",
ifelse((hour_trips$time >= 12 & hour_trips$time < 18), "Afternoon",
ifelse((hour_trips$time >= 18 & hour_trips$time < 24), "Evening", "NA"
)
)
)
)
data_reg2 <- aggregate(Trip.ID ~ Start.Station + Subscriber.Type + weekend +
timeofday, data = hour_trips, FUN = length)
data_reg2[,1:4] <- lapply(data_reg2[,1:4], FUN = as.factor)
reg2 <- lm(log(Trip.ID) ~ ., data = data_reg2)
summary(reg2)
n <- ggplot(data = reg2, aes(reg2$residuals)) +
geom_histogram(binwidth = 0.10) +
xlab("Residuals")
plot(n)
# Residual standard error: 0.8555 on 937 degrees of freedom
# Multiple R-squared: 0.8032, Adjusted R-squared: 0.7876
# F-statistic: 51.67 on 74 and 937 DF, p-value: < 2.2e-16
Detailed analysis of this data set have been conducted and below are some of the key recommendations to the stakeholders.