Background

The dataset would be download from this [1]. Suppose the file was downloaded in the machine. Let’s have a brief look about the data.

#read the file
df = read.csv("bike_trip_data.csv", stringsAsFactor=FALSE)

#checking
dim(df) #dimension of the data
## [1] 144015     11
head(df)
##   Trip.ID Duration    Start.Date            Start.Station Start.Terminal
## 1    4576       63 8/29/13 14:13 South Van Ness at Market             66
## 2    4607       70 8/29/13 14:42       San Jose City Hall             10
## 3    4130       71 8/29/13 10:16  Mountain View City Hall             27
## 4    4251       77 8/29/13 11:29       San Jose City Hall             10
## 5    4299       83 8/29/13 12:02 South Van Ness at Market             66
## 6    4927      103 8/29/13 18:54      Golden Gate at Polk             59
##        End.Date              End.Station End.Terminal Bike..
## 1 8/29/13 14:14 South Van Ness at Market           66    520
## 2 8/29/13 14:43       San Jose City Hall           10    661
## 3 8/29/13 10:17  Mountain View City Hall           27     48
## 4 8/29/13 11:30       San Jose City Hall           10     26
## 5 8/29/13 12:04           Market at 10th           67    319
## 6 8/29/13 18:56      Golden Gate at Polk           59    527
##   Subscription.Type Zip.Code
## 1        Subscriber    94127
## 2        Subscriber    95138
## 3        Subscriber    97214
## 4        Subscriber    95060
## 5        Subscriber    94103
## 6        Subscriber    94109
summary(df)
##     Trip.ID          Duration       Start.Date        Start.Station     
##  Min.   :  4069   Min.   :    60   Length:144015      Length:144015     
##  1st Qu.: 52138   1st Qu.:   349   Class :character   Class :character  
##  Median : 99969   Median :   531   Mode  :character   Mode  :character  
##  Mean   :100968   Mean   :  1231                                        
##  3rd Qu.:150090   3rd Qu.:   797                                        
##  Max.   :198775   Max.   :722236                                        
##  Start.Terminal   End.Date         End.Station         End.Terminal 
##  Min.   : 2.0   Length:144015      Length:144015      Min.   : 2.0  
##  1st Qu.:50.0   Class :character   Class :character   1st Qu.:50.0  
##  Median :61.0   Mode  :character   Mode  :character   Median :61.0  
##  Mean   :57.3                                         Mean   :57.4  
##  3rd Qu.:70.0                                         3rd Qu.:70.0  
##  Max.   :83.0                                         Max.   :83.0  
##      Bike..    Subscription.Type    Zip.Code        
##  Min.   :  9   Length:144015      Length:144015     
##  1st Qu.:349   Class :character   Class :character  
##  Median :446   Mode  :character   Mode  :character  
##  Mean   :437                                        
##  3rd Qu.:546                                        
##  Max.   :717

The dataset contains 144015 rows and 11 columns. The columns are listed as follows:

Let’s check the number of unique start-station and unique end-station. Furthermore, check if they contain the same set of the stations.

length(unique(df$Start.Station))
## [1] 69
length(unique(df$End.Station))
## [1] 69
#check if they are with the same set of the station
any(sort(unique(df$Start.Station)) != sort(unique(df$End.Station)))
## [1] FALSE

There are total 69 bike stations. In addition, the start-stations and end-stations are using the same set of stations.

Solution to 3 problems

There are 3 questions that we need to answer. I listed the questions as follows:

  1. What was the average total time (in minutes) used by a bicycle in the data?
  2. What was the most popular day by trip frequency in this dataset?
  3. (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.
  4. (easier): Assuming there are 30 bikes per station, find what date the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent. Do this ignoring “Start.Date” and “End.Date” columns.

1. What was the average total time (in minutes) used by a bicycle in the data?

To answer this question, we only need to get the mean of the column “Duration”.

avgDuration = mean(df$Duration) / 60 #unit in minute

The average total time (in minutes) used by a bicycle in the data: 20.5152 minutes

3. Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.

I tried to tackle the harder part of this question.

I created a function getRebalanceDate to obtain the rebalance time for a particular station (id). IDEA of the function:

  • get all data frame (df) and the station id (id) from the parameters
  • extract the rows (with columns: Start.Date only) which a bike was borrowed from the corrsponding id and store to a data frame (outDF)
  • add a new column named state which all the entries with value -1
  • extract the rows (with columns: End.Date only) which a bike was returned to the corrsponding id and store to a data frame (inDF)
  • add a new column named state which all the entries with value 1
  • combine two data frames together and store in a new data frame named DF
  • sort the DF with respect to the date time in ascending order
  • reset the state in first row by using the formula: 30 + the current state value
  • calcuate the acculated sum and check the first location with state value as ZERO
  • if so, return such date time; otherwise, return NA
getRebalanceDate = function(df, id){
  #create data frame - out
  outDF = as.data.frame(df[which(df$Start.Terminal == id), c("Start.Date")])
  outDF$state = -1
  names(outDF)[1] = "time"
  #create data frame - in
  inDF = as.data.frame(df[which(df$End.Terminal == id), c("End.Date")])
  inDF$state = 1
  names(inDF)[1] = "time"
  
  #combine two data frame
  DF = rbind(outDF, inDF)
  #sort by date time
  DF = DF[order(as.POSIXlt(DF$time, format="%m/%d/%y %H:%M")), ]
  #at the begining with 30 bicycle
  DF[1, "state"] = DF[1, "state"] + 30
  #calculate the acculated sum
  csum = cumsum(DF$state)
  #check which index get the value 0
  idxs = which(csum == 0)
  #get the date time
  as.character(DF[idxs[1], "time"])
}

To obtain the FIRST date and time need to rebalance the bikes, we only need to use lapply to call the getRebalanceDate function. Then, store the result in a data frame and sorted it by the time accordingly. The first element will be the FIRST date and time that needs rebalance.

#create a record data frame
record = as.data.frame(as.character(unique(df$Start.Station)))
record$id = as.character(unique(df$Start.Terminal)) #get the ids

#get the rebalance time
l = unlist(lapply(record$id, function(x){
        getRebalanceDate(df, x)
      }))
#update the record
record$time = l

#sort the data with respect to the time
record = record[order(as.POSIXlt(record$time, format="%m/%d/%y %H:%M")), ]

The station Embarcadero at Bryant first need to re-balanced on 9/2/13 11:33.

There are a number of station required to rebalance also. We can get this information by using the following code.

stationNeedRebalance = record[!is.na(record$time), ]
stationNeedRebalance
##           as.character(unique(df$Start.Station)) id           time
## 46                         Embarcadero at Bryant 54   9/2/13 11:33
## 40                             Market at Sansome 77   9/3/13 16:54
## 27                                 2nd at Folsom 62   9/4/13 18:26
## 17                       San Francisco City Hall 58   9/5/13 12:27
## 20 Temporary Transbay Terminal (Howard at Beale) 55   9/5/13 17:57
## 25                      Commercial at Montgomery 45   9/9/13 17:03
## 11                               Clay at Battery 41  9/10/13 17:49
## 12                               Post at Kearney 47   9/11/13 8:13
## 36           Mechanics Plaza (Market at Battery) 75  9/11/13 11:23
## 34               Grant Avenue at Columbus Avenue 73  9/11/13 14:18
## 9                 Mountain View Caltrain Station 28   9/12/13 9:46
## 18                               Beale at Market 56  9/12/13 14:12
## 50                         Embarcadero at Folsom 51  9/14/13 11:13
## 1                       South Van Ness at Market 66  9/15/13 13:10
## 32                 Powell at Post (Union Square) 71   9/19/13 9:53
## 13                                Market at 10th 67  9/20/13 20:00
## 39          Harry Bridges Plaza (Ferry Building) 50  9/23/13 11:38
## 16                 Redwood City Caltrain Station 22   9/24/13 8:29
## 2                             San Jose City Hall 10  9/25/13 17:10
## 53             San Jose Diridon Caltrain Station  2   10/1/13 7:46
## 4                            Golden Gate at Polk 59   10/1/13 8:49
## 52                                 Howard at 2nd 63   10/1/13 9:09
## 5                         Santa Clara at Almaden  4 10/22/13 13:40
## 54                          Paseo de San Antonio  7 11/10/13 10:21
## 49                          Cowper at University 37 11/15/13 12:48
## 45                    Palo Alto Caltrain Station 34  12/4/13 16:29

The total number of station need rebalance: 26.

Note:

When I publish this document to Rpub, I encountered the errors about SSH and readBin. I refer to [2-3] to solve these two problems.

Reference

[1] - http://mandrillapp.com/track/click/30315607/s3-us-west-1.amazonaws.com?p=eyJzIjoiRjBqdTctUHl3LTlMY1FkZU9SVVl3ZDRoV29VIiwidiI6MSwicCI6IntcInVcIjozMDMxNTYwNyxcInZcIjoxLFwidXJsXCI6XCJodHRwczpcXFwvXFxcL3MzLXVzLXdlc3QtMS5hbWF6b25hd3MuY29tXFxcL2RhdGF5ZWFyXFxcL2Jpa2VfdHJpcF9kYXRhLmNzdlwiLFwiaWRcIjpcIjFlMTRlMDVmYzNmNjRiNzE4NTEzYzQ1ZmNhNzI0OTA4XCIsXCJ1cmxfaWRzXCI6W1wiMTVlYzMzNWM1NDRlMTM1ZDI0YjAwODE4ZjI5YTdkMmFkZjU2NWQ2MVwiXX0ifQ

[2] - http://stackoverflow.com/questions/22537180/error-while-publishing-in-r-pubs

[3] - https://support.rstudio.com/hc/communities/public/questions/202448223-Publish-to-rpubs-Error-in-readBin-conn-what-raw-n-contentLength-invalid-n-argument