This project analyses Boulder B-cycle data to understand and document any patterns from 2013 to Early 2016. The data analysis in this project is presented through summaries and visualizations. Also, part of the project was to apply Machine Learning on the numerical data to try and make predictions.
The analysis is divided into 4 big sections:
Summary of Data
Data Visualizations
Machine Learning to predict the Pass Type
Conclusion
There are 13 variables with 248544 observations. The variables include Checkout/Return Stations, Checkout/Return Time, Type of Pass, Day of the Week, Trip Duration, Bike Number and Rider/Operator number etc. Also included is a location dataset with latitude and longitude information along with other information about the Checkout/Return stations
## Rider.Home.System Rider.or.Operator.Number Entry.Pass.Type Bike.Number
## 1 Boulder B-cycle R1011535 24-hour 548
## 2 Boulder B-cycle R1011722 24-hour 742
## 3 Boulder B-cycle R1008367 Annual 578
## 4 Boulder B-cycle R1010650 24-hour 616
## 5 Boulder B-cycle R1008367 Annual 578
## 6 Boulder B-cycle R1055681 Annual 601
## Checkout.Date Checkout.Day.of.Week Checkout.Time Checkout.Station
## 1 5/20/2011 Friday 9:24:00 AM 15th & Pearl
## 2 5/20/2011 Friday 9:24:00 AM 15th & Pearl
## 3 5/20/2011 Friday 9:33:00 AM Broadway & Alpine
## 4 5/20/2011 Friday 9:34:00 AM Broadway & Alpine
## 5 5/20/2011 Friday 9:36:00 AM Broadway & Alpine
## 6 5/20/2011 Friday 9:39:00 AM UCAR Center Green
## Return.Date Return.Day.of.Week Return.Time Return.Station
## 1 5/20/2011 Friday 9:40:00 AM 26th @ Pearl
## 2 5/20/2011 Friday 9:54:00 AM 15th & Pearl
## 3 5/20/2011 Friday 9:36:00 AM Broadway & Alpine
## 4 5/20/2011 Friday 9:37:00 AM Broadway & Alpine
## 5 5/20/2011 Friday 9:39:00 AM Broadway & Alpine
## 6 5/20/2011 Friday 9:42:00 AM UCAR Center Green
## Trip.Duration..Minutes.
## 1 16
## 2 30
## 3 3
## 4 3
## 5 3
## 6 3
## Rider.Home.System Rider.or.Operator.Number
## Boulder B-cycle :243333 M9999957: 9499
## Denver B-cycle : 4666 M9999950: 5684
## Madison B-cycle : 201 M9999952: 5538
## Houston B-cycle : 113 R1028713: 4006
## Indy - Pacers Bikeshare: 74 M9999943: 3077
## GREENbike : 38 M9999998: 2835
## (Other) : 119 (Other) :217905
## Entry.Pass.Type Bike.Number Checkout.Date
## 24-hour : 83642 411 : 1821 6/25/2015: 703
## 7-day : 5585 584 : 1755 8/2/2015 : 650
## Annual :113041 666 : 1613 8/8/2015 : 639
## Maintenance : 37337 744 : 1608 7/28/2015: 635
## Semester (150-day): 8939 665 : 1607 6/26/2015: 621
## 699 : 1596 8/5/2015 : 621
## (Other):238544 (Other) :244675
## Checkout.Day.of.Week Checkout.Time Checkout.Station
## Friday :39020 12:16:00 PM: 467 Length:248544
## Monday :35182 12:26:00 PM: 455 Class :character
## Saturday :36603 12:45:00 PM: 447 Mode :character
## Sunday :28767 4:12:00 PM : 434
## Thursday :38079 5:05:00 PM : 433
## Tuesday :34903 12:12:00 PM: 432
## Wednesday:35990 (Other) :245876
## Return.Date Return.Day.of.Week Return.Time
## 6/25/2015: 706 Friday :39026 12:04:00 AM: 495
## 8/2/2015 : 651 Thursday :37881 1:13:00 PM : 451
## 8/8/2015 : 637 Saturday :36322 12:12:00 PM: 441
## 7/28/2015: 629 Wednesday:36042 1:51:00 PM : 439
## 6/26/2015: 624 Monday :35362 12:15:00 PM: 437
## 7/11/2015: 624 Tuesday :34944 12:52:00 PM: 436
## (Other) :244673 (Other) :28967 (Other) :245845
## Return.Station Trip.Duration..Minutes.
## Length:248544 Min. : -2.00
## Class :character 1st Qu.: 5.00
## Mode :character Median : 12.00
## Mean : 63.36
## 3rd Qu.: 26.00
## Max. :181607.00
##
There are some errors in the “Rider.Home.System” column. This data is supposed to be for Boulder but was set to Denver and Houston in some cases, this is not correct. Not a big issue because this variable/column data is not that important in the analysis because it’s a constant and doesn’t add value to the analysis.
NOTE: Corrections were made to Checkout/Return Station “RTD”, which is really “14th & Canyon” but was entered incorrectly as “RTD”. This error was found later on in the project analysis but was corrected early on.
This section involves a lot of visualizations. It’s a combination of univariate and multivariate plots, with the focus on one variable at a time.
Rider are users who use the pass and Operators are B-cycle employees who do maintenance. This section has visualizations with the ‘Rider.or.Operator.Number’ in focus.
NOTE: There were a lot of riders with 1-200 rides, so to understand any patterns better, the data was subset to riders with 200 or more rides.
Fig1: Rider/Operator Count
Fig2: Rider/Operator Count seperated by Pass Type
The following trends can be noted from the plots in this section from the subset data:
Some riders really like to use B-cycle for their rides (Fig1). Faceting it by the pass type (Fig2), we get a better understanding of what type of passes they like to use. Annual Pass is the biggest winner among people who use the bikes often(not surprising) but there was a rider who did a little more than 200 rides using the 24 hour pass(surprised that the person didn’t think of other available pass types).
The number of rides by riders using Maintenance pass is very interesting, there are a lot of rides by a few users. This indicates that these were operators who regularly used and fixed bikes.
There are four major pass types(Annual, 24-hour, 7-day and Semester) and a Maintenance pass type
Fig3: Pass Type Count
Fig4: Pass Type Count separatred by Day of the Week
The 5 pass types can be noted from the plot above (Fig3). It is clear that the Anuual pass is definitely the most popular, followed by the 24-hour type pass. 150-day and 7-day passes pale in comparison. Maintainance is another one which has relatively high use compared to 7-day and Semester (150-day) type.
From Fig4 one thing which stands out is that 24-hour pass type is used way more than Annual pass on the weekends. Whereas on the weekdays the Annual pass is still the most widely used. Semester and 7-day pass usage is comparatively very low.
NOTE: The semester pass type was introduced only in 2014, so it’s lower usage makes sense.
I was not expecting any trends when analyzing bike numbers, but surprisingly there are some trends.
The bike ID numbers in the middle range seem to be most used for Annual, 24 hour and Maintenance pass types (this is not true for the 7-day and Semester pass types). This might be related to the stations they are at, as there are stations which are more popular than others, as we will see below.
Fig5: Bike# Count seperated by Pass Type
Fig6: Bike# Count seperated by Day of the Week
Fig7: Bike# Count seperated by Pass Type & Day of the Week
NOTE: This section only uses Checkout.Day.Of.Week for analysis as the points below are mostly the same for Return.Day.Of.Week
Fig8: Day of the Week Count
Fig9: Day of the Week Count seperated by Pass Type
Friday overall is the most popular day of the week for ridership, followed by Thursday (surprising) and then Saturday. Monday, Tuesday & Wednesday usage is very close, whereas Sunday usage is markedly lower compared to other days (Fig8)
When looking at the data faceted by the pass type (Fig9), Annual pass holders like to use their passes on weekdays (the distribution is almost gaussian like). It is completely opposite for the users of 24-hour pass type, they like riding on weekends (as it was noted in the Pass Type section)
Maintenance rides are common on weekdays with Thursday seeing the most maintenance instead of Friday (considering point 1).
Semester pass holders like to use their pass on the weekdays with Tuesday being the most popular.
In the case of the 7-day pass, there is no visible trend but Thursday is most popular.
This section also uses a subset of the data. There were a lot of outliers in trip duration, so the trip duration was subset to rides within and including 60 minutes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00 5.00 12.00 63.36 26.00 181600.00
Fig10:Box plots of Trip Duration
Fig10:Box plots of Trip Duration
Fig11: Trip Duration Distribution
Fig12: Trip Duration Distribution seperated by Pass Type
Fig13: Trip Duration Distribution seperated by Day of the Week
Fig14: Trip Duration Distribution seperated by Pass Type and Day of the Week
Overall (Fig11), the trip duration is a gaussian distribution with the peak at 4-6 minutes and falls pretty hard from the peak and kind of stabilizes from the 32 minute mark.
Faceting it by pass type (Fig12), for the 24 hour pass type, the most common trip duration is 12-13 minutes, 10-13 minutes for 7-day pass, 4-6 minutes for Annual, 1 minute for Maintenance (quick maintenance rides!) and 6-8 minutes for semester type.
When looking at the plots by the day of the week (Fig 13), 4-6 minute period still is the most popular on weekdays but not on weekends. With 10-12 minute trip duration seeming to be more popular, maybe this is due to the fact that people are not in a hurry on the weekends.
Combining the pass type and weekday (Fig 14), annual pass holders’ trip duration pattern doesn’t change much, point 1 still holds true. For 24 hour pass type, the trip duration seems to be in the upper ranges, 10+ minutes. The 7-day pass type trip duration doesn’t show a clear pattern from the plots. 1-minute maintenance seems to be the most common turn around time. 6-8 minutes trip duration is the most common for semester type pass.
Fig15: Checkout Station Count
Fig16: Checkout Station Count seperated by Day of the Week
Fig17: Checkout Station Count seperated by Pass Type
Fig18: Trip Duration Distribution seperated by Checkout Station
15th & Pearl and 13th and Spruce are the 2 most popular check out stations in Boulder (Fig15). There is a close tie between 11th and Pearl and Municipal Building stations. Greenhouse and Gunbarrel North are the least used stations, 14th and Walnut office might be an error as this location doesn’t have lattitude, longitude listed.
Faceting it by the day of the week (Fig16), 15th & Pearl is still the most popular checkout station. With 13th and Spruce along with Municipal building being the 2nd most popular checkout stations from Mon-Thu and 11th & Pearl from Fri-Sun.
Analyzing the checkout stations by the pass type (Fig17), 15th & Pearl is still the most popular checkout station for all pass types except for the semester pass type. For the 24-hour pass type, 11th and Pearl is the 2nd most popular checkout station followed by 19th @ Broadway. The Village seems to be the 2nd most popular station for the 7-day pass type. The distribution for Annual pass type doesn’t change as similar to the overall pattern, since this is the most used pass.
One thing to be noted are the spikes in maintenance (Fig16 & Fig17) in locations like The Village and 26th @ Pearl which are not in line with the overall checkout station popularity pattern. This might indicate that the bikes at those stations might have been subject to more rough use or a batch of bikes had a few defects.
Faceting the trip duration (Fig18) by Checkout station there are not any major surprises, ride times were in the 6-10 minutes range, and the overall pattern across the popular stations is the same.
No surprises from the Return Station analysis, most, if not all of the points from the pervious section apply to this section as well.
Fig19: Return Station Count seperated by Pass Type
Fig20: Return Station Count seperated by Day of the Week
Fig21: Return Station Count seperated by Pass Type
Fig22: Trip Duration Distribution seperated by Return Station
As noted in the Pass Type section, the Semester Pass type was started only in Early 2014, the point becomes obvious as we look at the charts when faceting it by the pass type.
Fig23: Checkout Date Distribution
Fig24: Checkout Date Distribution seperated by Pass Type
Fig25: Checkout Date Distribution seperated by Day of the Week
Fig26: Checkout Date Distribution seperated by Pass Type and Day of the Week
The number of checkouts has progressively increased over the years from 2013 to 2016 (Fig23). There is a pattern in terms of usage, the summer(May-August) months seeing an increase in checkouts with a dip on on either side of the summer months. This definitely makes sense as people tend to ride less in the winter months. Among the popular summer months, July-August have the biggest checkouts across the years
Viewing the plots by the type of the pass (Fig24), we can see that all pass types have seen an increase in usage since Boulder B-cycle was introduced. 7-day pass saw a big increase in the summer of 2015 and the Semester type pass also saw a big increase since it was introduced in early 2014.
Among the annual pass holders, October of 2015 (Fig24) had more users than any other month in the warmer months. This is surprising, I guess October must have been warm or maybe there were a lot of events in Boulder that month.
Maintenance generally follows the trend of an increase in the number of instances of maintenance in the summer months and a decrease in the colder months. One anomaly was that April of 2015 had the highest instances of maintenance for that year but it wasn’t the most popular month in terms of ridership. This might indicate that Boulder B-cycle was preparing in advance for the popular summer months. This might be a good guess because the maintenance was lower in the months following April for 2015 across all pass types.
Analysing checkouts divided by the day of the week (Fig25). Only Tue-Wed deviate from the general trend that August is the most popular month followed by July. In the case of Tue-Wed the roles of July and August get reversed.
Doing a multivariate analysis (Fig26) we can see finer trends in popular days across months and across pass types but there are no new points(other than the ones already documented) to be noted down.
Fig27: Checkout Time Distribution
Fig28: Return Time Distribution
Fig29: Checkout Time Distribution seperated by Pass Type
Fig30: Return Time Distribution seperated by Pass Type
Overall (Fig27/28) 19:00 is the most popular time for riding B-cycles, this might indicate that people like to use the service to grab dinner in Boulder. Other popular times are 18:30, 19:30, 20:00 and surprisingly 23:30! There are some rides early in the morning with negligible usage till 11:00.
Faceting the plots by pass type (Fig29/30) we can observe that there is no common pattern across pass types. Annual Pass type sees peaks and dips throughout the day, with highest peak at 23:30. This has to be one of the most surprising finds from the analysis.
24 hour pass type has a gaussian like distribution with the peak at 20:00 and a gradual decrease on either side.
Maintenance is the highest between 14:30 and 23:00 with big dips on either side of that time range.
Semester and 7-day pass types don’t have any visible patterns. Semester type pass is used the most at 15:00, whereas the 7-day pass is most used at 00:00(another surprising find).
Fig31: Checkout Time Distribution seperated by Day of the Week
Fig32: Return Time Distribution seperated by Day of the Week
Fig33: Checkout Time Distribution seperated by Pass Type & Day of the Week
Fig34: Return Time Distribution seperated by Pass Type & Day of the Week
The weekends and weekday Checkout/Return times (Fig31/32) have contrasting patterns. Weekdays sees peaks and dips throughout the day with highest peak at 23:30 for Mon-Wed-Thu and 19:00 for Tue-Fri (but both peaks are close). On weekends the distribution is gaussian like with peaks at 19:00 on Sun and 19:30 on Sat.
When doing a multivariate analysis (Fig33/34) point 1 in this section and points from the previous section all hold true and there are finer points which can be noted down.
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boulder,+Colorado&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boulder,%20Colorado&sensor=false
Fig35: Heat Map of Checkout Stations from 2013-2016
Fig36: Heat Map of Return Stations from 2013-2016
The size of the circle represents the overall number checkouts/returns per station since B-cycle started. From the two maps (Fig35 & Fig36) it is clear that the stations in downtown are most frequently used. The stations just outside of downtown and in/near the University are second to the downtown stations in terms of usage.
# Set this condition to TRUE to run the machine learning section, this has been disabled intentionally to speed up the knitHTML process
if (FALSE)
{
# Add new variables for Checkout.Hour and Return.Hour for machine learning analysis
dataset$Checkout.Hour <- as.factor(substr(dataset$Checkout.Time, 12, 13))
dataset$Return.Hour <- as.factor(substr(dataset$Return.Time, 12, 13))
# Remove 'Maintenance' Pass type for classification(as this is not a user pass)
mlsubset <- droplevels(subset(dataset, dataset$Entry.Pass.Type != 'Maintenance'))
# Subset the data keeping the necessary variables
mlsubset <- mlsubset[c(3, 4, 6, 8, 10, 13, 14)]
# Create partition
trainIndex <- createDataPartition(mlsubset$Entry.Pass.Type, p = 0.8, list = FALSE, times = 1)
trainingset <- mlsubset[trainIndex, ]
testset <- mlsubset[-trainIndex, ]
# Get the necessary variables for analysis
# Split the data set for 10-fold cross validation, train on 9, test on 1 for all combinations
trainControl <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
# Evaluate 4 different algorithms, make sure the same seed is used
# Linear Discriminant Analysis
set.seed(7)
fit.lda <- train(Entry.Pass.Type~., data = trainingset, method = "lda",
metric = metric, trControl = trainControl)
# Classification and Regression Tree
set.seed(7)
fit.cart <- train(Entry.Pass.Type~., data = trainingset, method = "rpart",
metric = metric, trControl = trainControl)
# Naive Bayes
set.seed(7)
fit.nb <- train(Entry.Pass.Type~., data = trainingset, method = "nb",
metric = metric, trControl = trainControl)
# Random Forest(Bagged Decision tree)
set.seed(7)
fit.rf <- train(Entry.Pass.Type~., data = trainingset, method = "rf",
ntree = 100, metric = metric, trControl = trainControl)
# Summarize accuracy of models
results <- resamples(list(lda = fit.lda, cart = fit.cart, nb = fit.nb, rf = fit.rf))
summary(results)
# Dot plot of the results
dotplot(results)
# Compare against the test set
predictions <- predict (Fit.rf, testset)
confusionMatrix(predictions, testset$Entry.Pass.Type)
}
This section explored whether it was possible to use Machine Learning algorithms on the numeric data (trip duration) to predict the pass type with a high accuracy. Four algorithms were tested (LDA, CART, Naive Bayes & Random Forest). Among the 4, Random Forest had the best results but the accuracy was still low <75%. Since trip duration and the derived Checkout/Return hour variables were the only numeric data the algorithms didn’t perform with a high accuracy even after removal of ‘Maintenance’ pass type data.
If there was another numeric variable which Boulder B-cycle had provided, maybe the distance covered during each trip, it would have probably helped improve the classification accuracy.
There were a lot of observations made in this document, listed here are the major finds from the dataset.
The Annual Pass is the most used pass type on weekdays and 24 hour pass type on weekends.
Overall Friday is the most popular day for ridership.
4-6 Minutes is the most common trip duration on weekdays and 10-12 minutes on weekends. Maintenance rides are usually quick (1-2 minutes)
Downtown Stations are most frequently used in Boulder, followed by the stations in/near CU Boulder and just outside downtown.
The summer months are most popular for usage (October of 2015 was an anomaly). The winter months see considerably lower usage.
Boulder B-cycle usage saw a growth in Boulder since 2013, with 2016 seeing a big increase in usage.
The Checkout/Return times threw a surprise, with 23:30 with having the highest number of Checkouts/Returns among Annual and 7-day pass type holders.
Machine Learning Algorithm (Random Forest) accuracy is < 75% with Kappa < 50%, the algorithm performance can be improved if other numerical data like distance covered during each trip is available.