Bike Trip Presentation

Dataset Description: A CSV of bicycle rental transactions from the Bay Area Bike Share Data Challenge.

Problem 1: What was the average total time (in minutes) used by a bicycle in the data?

bicycles <- read.csv('bike_trip_data.csv')
names(bicycles)

##  [1] "Trip.ID"           "Duration"          "Start.Date"       
##  [4] "Start.Station"     "Start.Terminal"    "End.Date"         
##  [7] "End.Station"       "End.Terminal"      "Bike.."           
## [10] "Subscription.Type" "Zip.Code"

summary(bicycles$Duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     349     531    1231     797  722200

avg_duration <- mean(bicycles$Duration)

The average total time in minutes was 1230.9101413 minutes.

Problem 2: What was the most popular day by trip frequency in this dataset?

bicycles["DayofWeek"] <- as.Date(bicycles$Start.Date, format = "%m/%d/%y %H:%M")
bicycles["DayofWeek"] <- weekdays(bicycles$Day)
table(bicycles["DayofWeek"])

## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##     24283     23138     12413     10587     25265     24678     23651

#max(table(bicycles["DayofWeek"]))
popular_dayofweek <- sort((table(bicycles["DayofWeek"])),decreasing = TRUE)[1]
#Most popular day of the week
popweekday <- names(popular_dayofweek)

bicycles["Day"] <- as.Date(bicycles$Start.Date, format = "%m/%d/%y %H:%M")
popular_day <- sort((table(bicycles["Day"])),decreasing = TRUE)[1]
#Most popular date
popdate <- names(popular_day) 
#2013-09-25 was most popular day to start a rental

If you look at it by day of week, then Thursday was the most popular day of the week. If you look at it by the date, then the most popular day in which bike rentals were started was 2013-09-25

Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.

Problem 3 (easier): Assuming there are 30 bikes per station, find what date the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent. Do this ignoring “Start.Date” and “End.Date” columns.

#needed for aggregate functions later on
bicycles["start_count"] = 1; bicycles["end_count"] = 1

#Aggregate the number of bikes leaving station by day
start_columns <- c("Day","Start.Station","start_count")
dfstart <- bicycles[start_columns]
#rename to station for later merge
names(dfstart)[2] <- "Station"
aggdata <- aggregate(start_count ~.,data = dfstart, sum)

#Aggregate the number of bikes arriving back at the station by day
end_columns <- c("Day","End.Station","end_count")
dfend <- bicycles[end_columns]
#rename to station for later merge
names(dfend)[2] <- "Station"
aggdata2 <- aggregate(end_count ~.,data = dfend, sum)

#Merge the two dataframes
total <- merge(aggdata,aggdata2,by=c("Day","Station"))
total$difference <- total$end_count - total$start_count
#bikes available at start of the day - (add 30)
total$bikes_net30 <- ave(total$difference, total$Station, FUN=function(x) cumsum(c(0, head(x, -1))))
#sort to view by station and date to see what bikes_net30 shows
total <- total[order(total$Station,total$Day),]
final <- total[total$bikes_net30 == -30,] 
#sort ascending by date to find the earliest date that there was a net -30 balance of bikes at any station
final <- final[order(final$Day),]
rebalance_day <- final$Day[1]
#2013-09-05

Bike Trip Presentation

Jas Sohi

Tuesday, January 13, 2015

Bike Trip Presentation

Jas Sohi

Tuesday, January 13, 2015

Dataset Description: A CSV of bicycle rental transactions from the Bay Area Bike Share Data Challenge.

Problem 1: What was the average total time (in minutes) used by a bicycle in the data?

The average total time in minutes was 1230.9101413 minutes.

Problem 2: What was the most popular day by trip frequency in this dataset?

If you look at it by day of week, then Thursday was the most popular day of the week. If you look at it by the date, then the most popular day in which bike rentals were started was 2013-09-25

Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.

Problem 3 (easier): Assuming there are 30 bikes per station, find what date the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent. Do this ignoring “Start.Date” and “End.Date” columns.

I chose the easier option this time (it is my first R project this year :) ). The day that the bikes will first need to be rebalanced is 2013-09-05