The dataset would be download from this [1]. Suppose the file was downloaded in the machine. Let’s have a brief look about the data.
#read the file
df = read.csv("bike_trip_data.csv", stringsAsFactor=FALSE)
#checking
dim(df) #dimension of the data
## [1] 144015 11
head(df)
## Trip.ID Duration Start.Date Start.Station Start.Terminal
## 1 4576 63 8/29/13 14:13 South Van Ness at Market 66
## 2 4607 70 8/29/13 14:42 San Jose City Hall 10
## 3 4130 71 8/29/13 10:16 Mountain View City Hall 27
## 4 4251 77 8/29/13 11:29 San Jose City Hall 10
## 5 4299 83 8/29/13 12:02 South Van Ness at Market 66
## 6 4927 103 8/29/13 18:54 Golden Gate at Polk 59
## End.Date End.Station End.Terminal Bike..
## 1 8/29/13 14:14 South Van Ness at Market 66 520
## 2 8/29/13 14:43 San Jose City Hall 10 661
## 3 8/29/13 10:17 Mountain View City Hall 27 48
## 4 8/29/13 11:30 San Jose City Hall 10 26
## 5 8/29/13 12:04 Market at 10th 67 319
## 6 8/29/13 18:56 Golden Gate at Polk 59 527
## Subscription.Type Zip.Code
## 1 Subscriber 94127
## 2 Subscriber 95138
## 3 Subscriber 97214
## 4 Subscriber 95060
## 5 Subscriber 94103
## 6 Subscriber 94109
summary(df)
## Trip.ID Duration Start.Date Start.Station
## Min. : 4069 Min. : 60 Length:144015 Length:144015
## 1st Qu.: 52138 1st Qu.: 349 Class :character Class :character
## Median : 99969 Median : 531 Mode :character Mode :character
## Mean :100968 Mean : 1231
## 3rd Qu.:150090 3rd Qu.: 797
## Max. :198775 Max. :722236
## Start.Terminal End.Date End.Station End.Terminal
## Min. : 2.0 Length:144015 Length:144015 Min. : 2.0
## 1st Qu.:50.0 Class :character Class :character 1st Qu.:50.0
## Median :61.0 Mode :character Mode :character Median :61.0
## Mean :57.3 Mean :57.4
## 3rd Qu.:70.0 3rd Qu.:70.0
## Max. :83.0 Max. :83.0
## Bike.. Subscription.Type Zip.Code
## Min. : 9 Length:144015 Length:144015
## 1st Qu.:349 Class :character Class :character
## Median :446 Mode :character Mode :character
## Mean :437
## 3rd Qu.:546
## Max. :717
The dataset contains 144015 rows and 11 columns. The columns are listed as follows:
Let’s check the number of unique start-station and unique end-station. Furthermore, check if they contain the same set of the stations.
length(unique(df$Start.Station))
## [1] 69
length(unique(df$End.Station))
## [1] 69
#check if they are with the same set of the station
any(sort(unique(df$Start.Station)) != sort(unique(df$End.Station)))
## [1] FALSE
There are total 69 bike stations. In addition, the start-stations and end-stations are using the same set of stations.
There are 3 questions that we need to answer. I listed the questions as follows:
To answer this question, we only need to get the mean of the column “Duration”.
avgDuration = mean(df$Duration) / 60 #unit in minute
The average total time (in minutes) used by a bicycle in the data: 20.5152 minutes
Refer to this question, I consider the most popular day by trip frequency is the most number of people who rented a bicycle in one day.
For people who rented a bicycle with a duration more than 24 hours, I only considered the frequency as 1 for the start day.
#create a new data frame with the day only
t = as.data.frame(as.Date(df$Start.Date, "%m/%d/%y"))
names(t)[1] = "day"
idx = which.max(table(t))
popularDay = table(t)[idx]
As you can see, the most popular day by trip frequency is 2013-09-25 which have 1264 number of people who rented a bicycle.
I tried to tackle the harder part of this question.
I created a function getRebalanceDate to obtain the rebalance time for a particular station (id). IDEA of the function:
getRebalanceDate = function(df, id){
#create data frame - out
outDF = as.data.frame(df[which(df$Start.Terminal == id), c("Start.Date")])
outDF$state = -1
names(outDF)[1] = "time"
#create data frame - in
inDF = as.data.frame(df[which(df$End.Terminal == id), c("End.Date")])
inDF$state = 1
names(inDF)[1] = "time"
#combine two data frame
DF = rbind(outDF, inDF)
#sort by date time
DF = DF[order(as.POSIXlt(DF$time, format="%m/%d/%y %H:%M")), ]
#at the begining with 30 bicycle
DF[1, "state"] = DF[1, "state"] + 30
#calculate the acculated sum
csum = cumsum(DF$state)
#check which index get the value 0
idxs = which(csum == 0)
#get the date time
as.character(DF[idxs[1], "time"])
}
To obtain the FIRST date and time need to rebalance the bikes, we only need to use lapply to call the getRebalanceDate function. Then, store the result in a data frame and sorted it by the time accordingly. The first element will be the FIRST date and time that needs rebalance.
#create a record data frame
record = as.data.frame(as.character(unique(df$Start.Station)))
record$id = as.character(unique(df$Start.Terminal)) #get the ids
#get the rebalance time
l = unlist(lapply(record$id, function(x){
getRebalanceDate(df, x)
}))
#update the record
record$time = l
#sort the data with respect to the time
record = record[order(as.POSIXlt(record$time, format="%m/%d/%y %H:%M")), ]
The station Embarcadero at Bryant first need to re-balanced on 9/2/13 11:33.
There are a number of station required to rebalance also. We can get this information by using the following code.
stationNeedRebalance = record[!is.na(record$time), ]
stationNeedRebalance
## as.character(unique(df$Start.Station)) id time
## 46 Embarcadero at Bryant 54 9/2/13 11:33
## 40 Market at Sansome 77 9/3/13 16:54
## 27 2nd at Folsom 62 9/4/13 18:26
## 17 San Francisco City Hall 58 9/5/13 12:27
## 20 Temporary Transbay Terminal (Howard at Beale) 55 9/5/13 17:57
## 25 Commercial at Montgomery 45 9/9/13 17:03
## 11 Clay at Battery 41 9/10/13 17:49
## 12 Post at Kearney 47 9/11/13 8:13
## 36 Mechanics Plaza (Market at Battery) 75 9/11/13 11:23
## 34 Grant Avenue at Columbus Avenue 73 9/11/13 14:18
## 9 Mountain View Caltrain Station 28 9/12/13 9:46
## 18 Beale at Market 56 9/12/13 14:12
## 50 Embarcadero at Folsom 51 9/14/13 11:13
## 1 South Van Ness at Market 66 9/15/13 13:10
## 32 Powell at Post (Union Square) 71 9/19/13 9:53
## 13 Market at 10th 67 9/20/13 20:00
## 39 Harry Bridges Plaza (Ferry Building) 50 9/23/13 11:38
## 16 Redwood City Caltrain Station 22 9/24/13 8:29
## 2 San Jose City Hall 10 9/25/13 17:10
## 53 San Jose Diridon Caltrain Station 2 10/1/13 7:46
## 4 Golden Gate at Polk 59 10/1/13 8:49
## 52 Howard at 2nd 63 10/1/13 9:09
## 5 Santa Clara at Almaden 4 10/22/13 13:40
## 54 Paseo de San Antonio 7 11/10/13 10:21
## 49 Cowper at University 37 11/15/13 12:48
## 45 Palo Alto Caltrain Station 34 12/4/13 16:29
The total number of station need rebalance: 26.
Note:
When I publish this document to Rpub, I encountered the errors about SSH and readBin. I refer to [2-3] to solve these two problems.