Introduction

This project analyses Boulder B-cycle data to understand and document any patterns from 2013 to Early 2016. The data analysis in this project is presented through summaries and visualizations. Also, part of the project was to apply Machine Learning on the numerical data to try and make predictions.

How is the analysis divided?

The analysis is divided into 4 big sections:

  1. Summary of Data

  2. Data Visualizations

  3. Machine Learning to predict the Pass Type

  4. Conclusion

1. Summary of Data

The structure of the dataset

There are 13 variables with 248544 observations. The variables include Checkout/Return Stations, Checkout/Return Time, Type of Pass, Day of the Week, Trip Duration, Bike Number and Rider/Operator number etc. Also included is a location dataset with latitude and longitude information along with other information about the Checkout/Return stations

##   Rider.Home.System Rider.or.Operator.Number Entry.Pass.Type Bike.Number
## 1   Boulder B-cycle                 R1011535         24-hour         548
## 2   Boulder B-cycle                 R1011722         24-hour         742
## 3   Boulder B-cycle                 R1008367          Annual         578
## 4   Boulder B-cycle                 R1010650         24-hour         616
## 5   Boulder B-cycle                 R1008367          Annual         578
## 6   Boulder B-cycle                 R1055681          Annual         601
##   Checkout.Date Checkout.Day.of.Week Checkout.Time  Checkout.Station
## 1     5/20/2011               Friday    9:24:00 AM      15th & Pearl
## 2     5/20/2011               Friday    9:24:00 AM      15th & Pearl
## 3     5/20/2011               Friday    9:33:00 AM Broadway & Alpine
## 4     5/20/2011               Friday    9:34:00 AM Broadway & Alpine
## 5     5/20/2011               Friday    9:36:00 AM Broadway & Alpine
## 6     5/20/2011               Friday    9:39:00 AM UCAR Center Green
##   Return.Date Return.Day.of.Week Return.Time    Return.Station
## 1   5/20/2011             Friday  9:40:00 AM      26th @ Pearl
## 2   5/20/2011             Friday  9:54:00 AM      15th & Pearl
## 3   5/20/2011             Friday  9:36:00 AM Broadway & Alpine
## 4   5/20/2011             Friday  9:37:00 AM Broadway & Alpine
## 5   5/20/2011             Friday  9:39:00 AM Broadway & Alpine
## 6   5/20/2011             Friday  9:42:00 AM UCAR Center Green
##   Trip.Duration..Minutes.
## 1                      16
## 2                      30
## 3                       3
## 4                       3
## 5                       3
## 6                       3
##                Rider.Home.System  Rider.or.Operator.Number
##  Boulder B-cycle        :243333   M9999957:  9499         
##  Denver B-cycle         :  4666   M9999950:  5684         
##  Madison B-cycle        :   201   M9999952:  5538         
##  Houston B-cycle        :   113   R1028713:  4006         
##  Indy - Pacers Bikeshare:    74   M9999943:  3077         
##  GREENbike              :    38   M9999998:  2835         
##  (Other)                :   119   (Other) :217905         
##            Entry.Pass.Type    Bike.Number       Checkout.Date   
##  24-hour           : 83642   411    :  1821   6/25/2015:   703  
##  7-day             :  5585   584    :  1755   8/2/2015 :   650  
##  Annual            :113041   666    :  1613   8/8/2015 :   639  
##  Maintenance       : 37337   744    :  1608   7/28/2015:   635  
##  Semester (150-day):  8939   665    :  1607   6/26/2015:   621  
##                              699    :  1596   8/5/2015 :   621  
##                              (Other):238544   (Other)  :244675  
##  Checkout.Day.of.Week     Checkout.Time    Checkout.Station  
##  Friday   :39020      12:16:00 PM:   467   Length:248544     
##  Monday   :35182      12:26:00 PM:   455   Class :character  
##  Saturday :36603      12:45:00 PM:   447   Mode  :character  
##  Sunday   :28767      4:12:00 PM :   434                     
##  Thursday :38079      5:05:00 PM :   433                     
##  Tuesday  :34903      12:12:00 PM:   432                     
##  Wednesday:35990      (Other)    :245876                     
##     Return.Date     Return.Day.of.Week      Return.Time    
##  6/25/2015:   706   Friday   :39026    12:04:00 AM:   495  
##  8/2/2015 :   651   Thursday :37881    1:13:00 PM :   451  
##  8/8/2015 :   637   Saturday :36322    12:12:00 PM:   441  
##  7/28/2015:   629   Wednesday:36042    1:51:00 PM :   439  
##  6/26/2015:   624   Monday   :35362    12:15:00 PM:   437  
##  7/11/2015:   624   Tuesday  :34944    12:52:00 PM:   436  
##  (Other)  :244673   (Other)  :28967    (Other)    :245845  
##  Return.Station     Trip.Duration..Minutes.
##  Length:248544      Min.   :    -2.00      
##  Class :character   1st Qu.:     5.00      
##  Mode  :character   Median :    12.00      
##                     Mean   :    63.36      
##                     3rd Qu.:    26.00      
##                     Max.   :181607.00      
## 

Corrections to the dataset

There are some errors in the “Rider.Home.System” column. This data is supposed to be for Boulder but was set to Denver and Houston in some cases, this is not correct. Not a big issue because this variable/column data is not that important in the analysis because it’s a constant and doesn’t add value to the analysis.

NOTE: Corrections were made to Checkout/Return Station “RTD”, which is really “14th & Canyon” but was entered incorrectly as “RTD”. This error was found later on in the project analysis but was corrected early on.

2. Data Visualizations

This section involves a lot of visualizations. It’s a combination of univariate and multivariate plots, with the focus on one variable at a time.

Riders or Operators

Rider are users who use the pass and Operators are B-cycle employees who do maintenance. This section has visualizations with the ‘Rider.or.Operator.Number’ in focus.

NOTE: There were a lot of riders with 1-200 rides, so to understand any patterns better, the data was subset to riders with 200 or more rides.

Fig1: Rider/Operator Count

Fig1: Rider/Operator Count

Fig2: Rider/Operator Count seperated by Pass Type

Fig2: Rider/Operator Count seperated by Pass Type

The following trends can be noted from the plots in this section from the subset data:

  1. Some riders really like to use B-cycle for their rides (Fig1). Faceting it by the pass type (Fig2), we get a better understanding of what type of passes they like to use. Annual Pass is the biggest winner among people who use the bikes often(not surprising) but there was a rider who did a little more than 200 rides using the 24 hour pass(surprised that the person didn’t think of other available pass types).

  2. The number of rides by riders using Maintenance pass is very interesting, there are a lot of rides by a few users. This indicates that these were operators who regularly used and fixed bikes.

Pass Type

There are four major pass types(Annual, 24-hour, 7-day and Semester) and a Maintenance pass type

Fig3: Pass Type Count

Fig3: Pass Type Count

Fig4: Pass Type Count separatred by Day of the Week

Fig4: Pass Type Count separatred by Day of the Week

  1. The 5 pass types can be noted from the plot above (Fig3). It is clear that the Anuual pass is definitely the most popular, followed by the 24-hour type pass. 150-day and 7-day passes pale in comparison. Maintainance is another one which has relatively high use compared to 7-day and Semester (150-day) type.

  2. From Fig4 one thing which stands out is that 24-hour pass type is used way more than Annual pass on the weekends. Whereas on the weekdays the Annual pass is still the most widely used. Semester and 7-day pass usage is comparatively very low.

NOTE: The semester pass type was introduced only in 2014, so it’s lower usage makes sense.

Bike Numbers

I was not expecting any trends when analyzing bike numbers, but surprisingly there are some trends.

The bike ID numbers in the middle range seem to be most used for Annual, 24 hour and Maintenance pass types (this is not true for the 7-day and Semester pass types). This might be related to the stations they are at, as there are stations which are more popular than others, as we will see below.

Fig5: Bike# Count seperated by Pass Type

Fig5: Bike# Count seperated by Pass Type

Fig6: Bike# Count seperated by Day of the Week

Fig6: Bike# Count seperated by Day of the Week

Fig7: Bike# Count seperated by Pass Type & Day of the Week

Fig7: Bike# Count seperated by Pass Type & Day of the Week

Day of the week

NOTE: This section only uses Checkout.Day.Of.Week for analysis as the points below are mostly the same for Return.Day.Of.Week

Fig8: Day of the Week Count

Fig8: Day of the Week Count

Fig9: Day of the Week Count seperated by Pass Type

Fig9: Day of the Week Count seperated by Pass Type

  1. Friday overall is the most popular day of the week for ridership, followed by Thursday (surprising) and then Saturday. Monday, Tuesday & Wednesday usage is very close, whereas Sunday usage is markedly lower compared to other days (Fig8)

  2. When looking at the data faceted by the pass type (Fig9), Annual pass holders like to use their passes on weekdays (the distribution is almost gaussian like). It is completely opposite for the users of 24-hour pass type, they like riding on weekends (as it was noted in the Pass Type section)

  3. Maintenance rides are common on weekdays with Thursday seeing the most maintenance instead of Friday (considering point 1).

  4. Semester pass holders like to use their pass on the weekdays with Tuesday being the most popular.

  5. In the case of the 7-day pass, there is no visible trend but Thursday is most popular.

Trip Duration

This section also uses a subset of the data. There were a lot of outliers in trip duration, so the trip duration was subset to rides within and including 60 minutes.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     -2.00      5.00     12.00     63.36     26.00 181600.00
Fig10:Box plots of Trip Duration

Fig10:Box plots of Trip Duration

Fig10:Box plots of Trip Duration

Fig10:Box plots of Trip Duration

Fig11: Trip Duration Distribution

Fig11: Trip Duration Distribution

Fig12: Trip Duration Distribution seperated by Pass Type

Fig12: Trip Duration Distribution seperated by Pass Type

Fig13: Trip Duration Distribution seperated by Day of the Week

Fig13: Trip Duration Distribution seperated by Day of the Week

Fig14: Trip Duration Distribution seperated by Pass Type and Day of the Week

Fig14: Trip Duration Distribution seperated by Pass Type and Day of the Week

  1. Overall (Fig11), the trip duration is a gaussian distribution with the peak at 4-6 minutes and falls pretty hard from the peak and kind of stabilizes from the 32 minute mark.

  2. Faceting it by pass type (Fig12), for the 24 hour pass type, the most common trip duration is 12-13 minutes, 10-13 minutes for 7-day pass, 4-6 minutes for Annual, 1 minute for Maintenance (quick maintenance rides!) and 6-8 minutes for semester type.

  3. When looking at the plots by the day of the week (Fig 13), 4-6 minute period still is the most popular on weekdays but not on weekends. With 10-12 minute trip duration seeming to be more popular, maybe this is due to the fact that people are not in a hurry on the weekends.

  4. Combining the pass type and weekday (Fig 14), annual pass holders’ trip duration pattern doesn’t change much, point 1 still holds true. For 24 hour pass type, the trip duration seems to be in the upper ranges, 10+ minutes. The 7-day pass type trip duration doesn’t show a clear pattern from the plots. 1-minute maintenance seems to be the most common turn around time. 6-8 minutes trip duration is the most common for semester type pass.

Checkout Station

Fig15: Checkout Station Count

Fig15: Checkout Station Count

Fig16: Checkout Station Count seperated by Day of the Week

Fig16: Checkout Station Count seperated by Day of the Week

Fig17: Checkout Station Count seperated by Pass Type

Fig17: Checkout Station Count seperated by Pass Type

Fig18: Trip Duration Distribution seperated by Checkout Station

Fig18: Trip Duration Distribution seperated by Checkout Station

  1. 15th & Pearl and 13th and Spruce are the 2 most popular check out stations in Boulder (Fig15). There is a close tie between 11th and Pearl and Municipal Building stations. Greenhouse and Gunbarrel North are the least used stations, 14th and Walnut office might be an error as this location doesn’t have lattitude, longitude listed.

  2. Faceting it by the day of the week (Fig16), 15th & Pearl is still the most popular checkout station. With 13th and Spruce along with Municipal building being the 2nd most popular checkout stations from Mon-Thu and 11th & Pearl from Fri-Sun.

  3. Analyzing the checkout stations by the pass type (Fig17), 15th & Pearl is still the most popular checkout station for all pass types except for the semester pass type. For the 24-hour pass type, 11th and Pearl is the 2nd most popular checkout station followed by 19th @ Broadway. The Village seems to be the 2nd most popular station for the 7-day pass type. The distribution for Annual pass type doesn’t change as similar to the overall pattern, since this is the most used pass.

  4. One thing to be noted are the spikes in maintenance (Fig16 & Fig17) in locations like The Village and 26th @ Pearl which are not in line with the overall checkout station popularity pattern. This might indicate that the bikes at those stations might have been subject to more rough use or a batch of bikes had a few defects.

  5. Faceting the trip duration (Fig18) by Checkout station there are not any major surprises, ride times were in the 6-10 minutes range, and the overall pattern across the popular stations is the same.

Return Station

No surprises from the Return Station analysis, most, if not all of the points from the pervious section apply to this section as well.

Fig19: Return Station Count seperated by Pass Type

Fig19: Return Station Count seperated by Pass Type

Fig20: Return Station Count seperated by Day of the Week

Fig20: Return Station Count seperated by Day of the Week

Fig21: Return Station Count seperated by Pass Type

Fig21: Return Station Count seperated by Pass Type

Fig22: Trip Duration Distribution seperated by Return Station

Fig22: Trip Duration Distribution seperated by Return Station

Checkout Date

As noted in the Pass Type section, the Semester Pass type was started only in Early 2014, the point becomes obvious as we look at the charts when faceting it by the pass type.

Fig23: Checkout Date Distribution

Fig23: Checkout Date Distribution

Fig24: Checkout Date Distribution seperated by Pass Type

Fig24: Checkout Date Distribution seperated by Pass Type

Fig25: Checkout Date Distribution seperated by Day of the Week

Fig25: Checkout Date Distribution seperated by Day of the Week

Fig26: Checkout Date Distribution seperated by Pass Type and Day of the Week

Fig26: Checkout Date Distribution seperated by Pass Type and Day of the Week

  1. The number of checkouts has progressively increased over the years from 2013 to 2016 (Fig23). There is a pattern in terms of usage, the summer(May-August) months seeing an increase in checkouts with a dip on on either side of the summer months. This definitely makes sense as people tend to ride less in the winter months. Among the popular summer months, July-August have the biggest checkouts across the years

  2. Viewing the plots by the type of the pass (Fig24), we can see that all pass types have seen an increase in usage since Boulder B-cycle was introduced. 7-day pass saw a big increase in the summer of 2015 and the Semester type pass also saw a big increase since it was introduced in early 2014.

  3. Among the annual pass holders, October of 2015 (Fig24) had more users than any other month in the warmer months. This is surprising, I guess October must have been warm or maybe there were a lot of events in Boulder that month.

  4. Maintenance generally follows the trend of an increase in the number of instances of maintenance in the summer months and a decrease in the colder months. One anomaly was that April of 2015 had the highest instances of maintenance for that year but it wasn’t the most popular month in terms of ridership. This might indicate that Boulder B-cycle was preparing in advance for the popular summer months. This might be a good guess because the maintenance was lower in the months following April for 2015 across all pass types.

  5. Analysing checkouts divided by the day of the week (Fig25). Only Tue-Wed deviate from the general trend that August is the most popular month followed by July. In the case of Tue-Wed the roles of July and August get reversed.

  6. Doing a multivariate analysis (Fig26) we can see finer trends in popular days across months and across pass types but there are no new points(other than the ones already documented) to be noted down.

Checkout/Return Time(Part 1)

Fig27: Checkout Time Distribution

Fig27: Checkout Time Distribution

Fig28: Return Time Distribution

Fig28: Return Time Distribution

Fig29: Checkout Time Distribution seperated by Pass Type

Fig29: Checkout Time Distribution seperated by Pass Type

Fig30: Return Time Distribution seperated by Pass Type

Fig30: Return Time Distribution seperated by Pass Type

  1. Overall (Fig27/28) 19:00 is the most popular time for riding B-cycles, this might indicate that people like to use the service to grab dinner in Boulder. Other popular times are 18:30, 19:30, 20:00 and surprisingly 23:30! There are some rides early in the morning with negligible usage till 11:00.

  2. Faceting the plots by pass type (Fig29/30) we can observe that there is no common pattern across pass types. Annual Pass type sees peaks and dips throughout the day, with highest peak at 23:30. This has to be one of the most surprising finds from the analysis.

  3. 24 hour pass type has a gaussian like distribution with the peak at 20:00 and a gradual decrease on either side.

  4. Maintenance is the highest between 14:30 and 23:00 with big dips on either side of that time range.

  5. Semester and 7-day pass types don’t have any visible patterns. Semester type pass is used the most at 15:00, whereas the 7-day pass is most used at 00:00(another surprising find).

Checkout/Return Time(Part 2)

Fig31: Checkout Time Distribution seperated by Day of the Week

Fig31: Checkout Time Distribution seperated by Day of the Week

Fig32: Return Time Distribution seperated by Day of the Week

Fig32: Return Time Distribution seperated by Day of the Week

Fig33: Checkout Time Distribution seperated by Pass Type & Day of the Week

Fig33: Checkout Time Distribution seperated by Pass Type & Day of the Week

Fig34: Return Time Distribution seperated by Pass Type & Day of the Week

Fig34: Return Time Distribution seperated by Pass Type & Day of the Week

  1. The weekends and weekday Checkout/Return times (Fig31/32) have contrasting patterns. Weekdays sees peaks and dips throughout the day with highest peak at 23:30 for Mon-Wed-Thu and 19:00 for Tue-Fri (but both peaks are close). On weekends the distribution is gaussian like with peaks at 19:00 on Sun and 19:30 on Sat.

  2. When doing a multivariate analysis (Fig33/34) point 1 in this section and points from the previous section all hold true and there are finer points which can be noted down.

Heat Map of Stations

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boulder,+Colorado&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boulder,%20Colorado&sensor=false
Fig35: Heat Map of Checkout Stations from 2013-2016

Fig35: Heat Map of Checkout Stations from 2013-2016

Fig36: Heat Map of Return Stations from 2013-2016

Fig36: Heat Map of Return Stations from 2013-2016

The size of the circle represents the overall number checkouts/returns per station since B-cycle started. From the two maps (Fig35 & Fig36) it is clear that the stations in downtown are most frequently used. The stations just outside of downtown and in/near the University are second to the downtown stations in terms of usage.

3. Machine Learning to predict the Pass Type

# Set this condition to TRUE to run the machine learning section, this has been disabled intentionally to speed up the knitHTML process
if (FALSE)
{
  # Add new variables for Checkout.Hour and Return.Hour for machine learning analysis
  dataset$Checkout.Hour <- as.factor(substr(dataset$Checkout.Time, 12, 13))
  dataset$Return.Hour <- as.factor(substr(dataset$Return.Time, 12, 13))
  
  # Remove 'Maintenance' Pass type for classification(as this is not a user pass)
  mlsubset <- droplevels(subset(dataset, dataset$Entry.Pass.Type != 'Maintenance'))
  
  # Subset the data keeping the necessary variables
  mlsubset <- mlsubset[c(3, 4, 6, 8, 10, 13, 14)]
  
  # Create partition
  trainIndex <- createDataPartition(mlsubset$Entry.Pass.Type, p = 0.8, list = FALSE, times = 1)
  trainingset <- mlsubset[trainIndex, ]
  testset <- mlsubset[-trainIndex, ]
  
  # Get the necessary variables for analysis
  # Split the data set for 10-fold cross validation, train on 9, test on 1 for all combinations
  trainControl <- trainControl(method = "cv", number = 10)
  metric <- "Accuracy"
  
  # Evaluate 4 different algorithms, make sure the same seed is used
  # Linear Discriminant Analysis
  set.seed(7)
  fit.lda <- train(Entry.Pass.Type~., data = trainingset, method = "lda", 
                   metric = metric, trControl = trainControl)
  # Classification and Regression Tree
  set.seed(7)
  fit.cart <- train(Entry.Pass.Type~., data = trainingset, method = "rpart", 
                    metric = metric, trControl = trainControl)
  # Naive Bayes
  set.seed(7)
  fit.nb <- train(Entry.Pass.Type~., data = trainingset, method = "nb", 
                  metric = metric, trControl = trainControl)
  # Random Forest(Bagged Decision tree)
  set.seed(7)
  fit.rf <- train(Entry.Pass.Type~., data = trainingset, method = "rf", 
                  ntree = 100, metric = metric, trControl = trainControl)
  
  # Summarize accuracy of models
  results <- resamples(list(lda = fit.lda, cart = fit.cart, nb = fit.nb, rf = fit.rf))
  summary(results)
  
  # Dot plot of the results
  dotplot(results)
  
  # Compare against the test set
  predictions <- predict (Fit.rf, testset)
  confusionMatrix(predictions, testset$Entry.Pass.Type)
}

This section explored whether it was possible to use Machine Learning algorithms on the numeric data (trip duration) to predict the pass type with a high accuracy. Four algorithms were tested (LDA, CART, Naive Bayes & Random Forest). Among the 4, Random Forest had the best results but the accuracy was still low <75%. Since trip duration and the derived Checkout/Return hour variables were the only numeric data the algorithms didn’t perform with a high accuracy even after removal of ‘Maintenance’ pass type data.

If there was another numeric variable which Boulder B-cycle had provided, maybe the distance covered during each trip, it would have probably helped improve the classification accuracy.

4. Conclusion

There were a lot of observations made in this document, listed here are the major finds from the dataset.