Dataset Description: A CSV of bicycle rental transactions from the Bay Area Bike Share Data Challenge.


Problem 1: What was the average total time (in minutes) used by a bicycle in the data?

bicycles <- read.csv('bike_trip_data.csv')
names(bicycles)
##  [1] "Trip.ID"           "Duration"          "Start.Date"       
##  [4] "Start.Station"     "Start.Terminal"    "End.Date"         
##  [7] "End.Station"       "End.Terminal"      "Bike.."           
## [10] "Subscription.Type" "Zip.Code"
summary(bicycles$Duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     349     531    1231     797  722200
avg_duration <- mean(bicycles$Duration)
The average total time in minutes was 1230.9101413 minutes.

Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.

Problem 3 (easier): Assuming there are 30 bikes per station, find what date the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent. Do this ignoring “Start.Date” and “End.Date” columns.

#needed for aggregate functions later on
bicycles["start_count"] = 1; bicycles["end_count"] = 1

#Aggregate the number of bikes leaving station by day
start_columns <- c("Day","Start.Station","start_count")
dfstart <- bicycles[start_columns]
#rename to station for later merge
names(dfstart)[2] <- "Station"
aggdata <- aggregate(start_count ~.,data = dfstart, sum)

#Aggregate the number of bikes arriving back at the station by day
end_columns <- c("Day","End.Station","end_count")
dfend <- bicycles[end_columns]
#rename to station for later merge
names(dfend)[2] <- "Station"
aggdata2 <- aggregate(end_count ~.,data = dfend, sum)

#Merge the two dataframes
total <- merge(aggdata,aggdata2,by=c("Day","Station"))
total$difference <- total$end_count - total$start_count
#bikes available at start of the day - (add 30)
total$bikes_net30 <- ave(total$difference, total$Station, FUN=function(x) cumsum(c(0, head(x, -1))))
#sort to view by station and date to see what bikes_net30 shows
total <- total[order(total$Station,total$Day),]
final <- total[total$bikes_net30 == -30,] 
#sort ascending by date to find the earliest date that there was a net -30 balance of bikes at any station
final <- final[order(final$Day),]
rebalance_day <- final$Day[1]
#2013-09-05
I chose the easier option this time (it is my first R project this year :) ). The day that the bikes will first need to be rebalanced is 2013-09-05