Dataset Description: A CSV of bicycle rental transactions from the Bay Area Bike Share Data Challenge.
Problem 1: What was the average total time (in minutes) used by a bicycle in the data?
bicycles <- read.csv('bike_trip_data.csv')
names(bicycles)
## [1] "Trip.ID" "Duration" "Start.Date"
## [4] "Start.Station" "Start.Terminal" "End.Date"
## [7] "End.Station" "End.Terminal" "Bike.."
## [10] "Subscription.Type" "Zip.Code"
summary(bicycles$Duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 349 531 1231 797 722200
avg_duration <- mean(bicycles$Duration)
The average total time in minutes was 1230.9101413 minutes.
Problem 2: What was the most popular day by trip frequency in this dataset?
bicycles["DayofWeek"] <- as.Date(bicycles$Start.Date, format = "%m/%d/%y %H:%M")
bicycles["DayofWeek"] <- weekdays(bicycles$Day)
table(bicycles["DayofWeek"])
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 24283 23138 12413 10587 25265 24678 23651
#max(table(bicycles["DayofWeek"]))
popular_dayofweek <- sort((table(bicycles["DayofWeek"])),decreasing = TRUE)[1]
#Most popular day of the week
popweekday <- names(popular_dayofweek)
bicycles["Day"] <- as.Date(bicycles$Start.Date, format = "%m/%d/%y %H:%M")
popular_day <- sort((table(bicycles["Day"])),decreasing = TRUE)[1]
#Most popular date
popdate <- names(popular_day)
#2013-09-25 was most popular day to start a rental
If you look at it by day of week, then Thursday was the most popular day of the week. If you look at it by the date, then the most popular day in which bike rentals were started was 2013-09-25
Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.
Problem 3 (easier): Assuming there are 30 bikes per station, find what date the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent. Do this ignoring “Start.Date” and “End.Date” columns.
#needed for aggregate functions later on
bicycles["start_count"] = 1; bicycles["end_count"] = 1
#Aggregate the number of bikes leaving station by day
start_columns <- c("Day","Start.Station","start_count")
dfstart <- bicycles[start_columns]
#rename to station for later merge
names(dfstart)[2] <- "Station"
aggdata <- aggregate(start_count ~.,data = dfstart, sum)
#Aggregate the number of bikes arriving back at the station by day
end_columns <- c("Day","End.Station","end_count")
dfend <- bicycles[end_columns]
#rename to station for later merge
names(dfend)[2] <- "Station"
aggdata2 <- aggregate(end_count ~.,data = dfend, sum)
#Merge the two dataframes
total <- merge(aggdata,aggdata2,by=c("Day","Station"))
total$difference <- total$end_count - total$start_count
#bikes available at start of the day - (add 30)
total$bikes_net30 <- ave(total$difference, total$Station, FUN=function(x) cumsum(c(0, head(x, -1))))
#sort to view by station and date to see what bikes_net30 shows
total <- total[order(total$Station,total$Day),]
final <- total[total$bikes_net30 == -30,]
#sort ascending by date to find the earliest date that there was a net -30 balance of bikes at any station
final <- final[order(final$Day),]
rebalance_day <- final$Day[1]
#2013-09-05
I chose the easier option this time (it is my first R project this year :) ). The day that the bikes will first need to be rebalanced is 2013-09-05