Executive Summary
This project analyzes the Nice Ride MN bike share program 2016-2017 seasons alongside historical daily weather data in order to predict daily ridership. The primary solution this analysis aims to provide - a reliable estimation of daily bikeshare use based on historical behavior and correlations to weather - can apply to the frequent problem of when to redistribute bikes to various docking stations and/or determine maintenance strategies based on user behavior. Addressing these issues can lead to greater operational efficiency, lower costs, and higher revenue for Nice Ride MN.
Questions covered in the analysis:
- How does weather - temperature, precipitation, humidity, and wind speed - affect bike use?
- What are the busiest docking stations and how is this represented from a geospatial perspective?
- How does bike use vary between weekdays and weekends?
- How does use behavior differ between members and nonmembers? Behavioral use questions to include mean, median, mode, and range for ride length, ride distance, and day of the week.
Potential business applications for this research:
- Optimization model for maintaining bike stock at docking stations based on demand forecasting
- Predictive model for volume of bike use based on a combination of weather variables
- Exploration of shifting from flat rate pricing structure to dynamic pricing for revenue optimization
- Predictive model for best future locations of bike docking stations
The project analysis occurs in three stages: (1) data wrangling, (2) exploratory/statistical data analysis, and (3) predictive modeling. Findings from this analysis include:
- Riders with membership subscriptions comprise the majority of daily use, tend to utilize the bikeshare during work commute hours, and have a higher tolerance for unfavorable weather conditions.
- Bikeshare volume density is greatest in downtown Minneapolis and less utilized in St. Paul.
- Bike ride volume is comparable between the two seasons analyzed, however outside factors (road construction) might have contributed to minor difference between the seasons.
- Daily average temperature plays the largest influence on bike use, followed by humidity, precipitation, and wind speed.
Data Wrangling
Datasets and Descriptions
Nice Ride MN provides annual datasets for all bike rental activity and dock station characteristics. Nice Ride MN
Ride Activity Data Preview - Nice_Ride_trip_history_2017_season.csv
| Start date |
Date and time the rental began |
| Start station |
Descriptive name for the station where the rental began, changes if station is moved |
| Start terminal |
Logical name for the station/terminal where the rental began, does not change when station is moved |
| End date |
Date and time the rental ended |
| End station |
Descriptive name for the station where the rental ended |
| End terminal |
Logical name for the station/terminal where the rental ended |
| Total duration |
Total length of the rental, in seconds |
| Account type |
Values are Member or Casual, Members are users who have an account with Nice Ride, Casuals are walk up users who purchased a pass at the station based on half hour increments |
Station Location Characteristics - Nice_Ride_2017_station_locations.csv
| Terminal |
Logical name of station - matches Start terminal / End terminal in trip history |
| Station |
Station name used on maps, xml feed and station poster frame - matches Start station / End station in trip history |
| Latitude |
Station location decimal latitude |
| Longitude |
Station location decimal longitude |
| Nb Docks |
Total number of bike docking points at station - indicates station size |
Local climatological data are available from the National Centers for Environmental Information’s Integrated Surface Data (ISD) dataset. NOAA Weather Data Library
2017 Local Climatological Data, Daily Averages
| Station |
Station identification number |
| Station name |
Name of station |
| Elevation |
Station elevation |
| Latitude |
Latitude of station |
| Longitude |
Longitude of station |
| Date |
Date of recorded observations |
| Report type |
Reporting method characteristics |
| Average daily dry bulb temp F |
Dry bulb measured temperature in degrees Fahrenheit |
| Average daily relative humidity |
Humidity level |
| Average daily wind speed |
Wind speed in miles per hour |
| Average daily precipitation |
Precipitation in inches |
Key Identifications
Dataset Strengths By joining the three datasets described above, a rich dataset offers the opportunity to understand public bike share behavior for two differing price structures - memberships and casual rides. Exploration of the data shows differing use behavior for casual and member customers based on day of the week, time of day and responses to weather scenarios. Analyzing bike use behavior and correlations to weather scenarios allows for insights related to optimal maintenance scheduling and any potential price restructuring for strategies related to revenue and growth.
Dataset Limitations/Caveats Limitations of the dataset include the effects of historical city construction projects. The data include two bike seasons, 2016-2017, and span a duration from early April to early November. During exploratory data analysis a trend of greater biking volume was noticed for 2017 in comparison to 2016. Based on research and potentially backed by domain knowledge, there was a greater volume of city construction projects in 2016 compared to 2017, this might be an uncontrollable variable in the analysis.
There are over 800,000 bike ride observations in this dataset. Through this analysis outliers were uncovered for roughly 1500 observations for the trip duration variables. They represent less than one percent of the total observations and were removed prior conducting statistical tests.
Data Preparation Summary
The Nice Ride MN bikeshare datasets required little data wrangling other than renaming a few columns based on preference, formatting the date and time columns to match with the weather data, and a full join to match the dock station data with the trip history data. Joining bikeshare volume with daily weather averages provided to be a more challenging task. Initially the data was formatted based on hourly weather and riding observations, however, the hourly weather data provided multiple observations per hour causing excessive noise. It was decided that a more reliable and consistent dataset could be created using daily weather averages for joining with ride observations.
Predictive Modeling: Machine Learning Application
Modeling Strategies Considered:
The main question for this project is how does a given daily weather scenario affect the bike share use of casual and member riders? Framing this as a machine learning problem involves modeling the relationship between weather and bike riders via historical data in order to apply the modeling toward future weather scenarios in the next riding season.
We start the predictive modeling under the premise that this problem is supervised and in the form of a regression. Secondly, we consider hierarchial clustering as a preliminary step knowing that daily bike ride density varies throughout the system. It might be best to cluster stations based on the mean distance between all stations and run a regression for each cluster. Initial attempts of this approach are likely to yield poor results due to the previously observed heteroskedasticity of the weather variables in relation to daily counts.
A third approach is considered: random forests. Applying random forests to the total dataset while including the cluster group variable might provide the best outcome under the current scope of this project. Random forest models draw random samples from the data set with replacement and utilize a regression tree approach in place of the linear form. The model reaches an optimal regression tree based on te dominant outcome from the numerous random sampling. Utilizing the train function from the ‘caret’ package, the model chooses an optimal number of predictor variables in the dataset for each tree, resulting in the highest Rsquared predictive power while accounting for model overfit.
Main features (predictors) based on EDA results
For the first process of clustering, the important predictors are geospatial data points - latitude and longitude, clusters of bike share stations related to a centroid point, based on the distance properties gleaned from the geospatial coordinates.
For the secondary phase of random forests, in addition to the cluster variable, time variables along with the four main daily weather variables and ride account type are considered:
- Year, Month, and Day
- Day of the Week
- Account Type
- Temperature
- Wind Speed
- Precipitation
- Humidity
Properties of Model Evaluation
A random sampling of 80% of the data set is utilized for training the model and the remainder is used to test the model. Adjusted Rsquared and the Root Mean Squared Error (RMSE) will serve as measurements of the accuracy and efficiency of the random forest regression tree modeling. Maximizing Rsquared and minimizing the RMSE in proportion to the dependent variable mean will result in a reliable model for predicting daily bike volume.
Modeling Caveat
This phase of the analysis began with the intention to produce a model for each bike station cluster, however, due to time constraints and the reality that a model for the total dataset serves as a satisfactory process in itself, the clustering serves as a feature rather than modeling divergent.
Load Modeling Data
Geospatial Exploration
Observe the busiest starting stations for Minneapolis-St.Paul in order to estimate k number of clusters for additional variable.
Reviewing the density of bike station use on the city map it is clear that the majority of activity occurs in Minneapolis while most activity occurs in St. Paul along the most popular residential streets and downtown. We estimate the optimal clustering somewhere between 6-10 clusters.
Calculate distance data frame from start station coordinates for clustering and plot on same map
Utilizing the geocoordinates for each bike station we can calculate a piecewise vector for the distance between every possible combination of bike stations. We determine a possible number of clusters from this output by applying the mean distance between station combinations as our cutoff point.
Clustering the stations based on the mean distance between stations resulted in eight clusters.
We will use the cluster station ID’s for joining the clustering to our original dataset. There is a possible argument that the cluster variable should not be included in the modeling. We argue for its inclusion based on the idea that clusters with greater daily bike count density carry higher weight in the modeling process. High density stations are more closely laid out in the city and are likely to be areas of greater concern in daily operations as they harbor the most use.
Daily ride counts
Add a column for count(n) per day by account type - member or casual. This is our dependent variable for modeling. We will clean up the data set prior modeling by dropping all variables except for clustering, data/time, account type, daily weather averages, and daily counts.
Check dataset structure prior modeling
This step ensures factor variables are correct.
Random Forest Modeling
We will run a random forest model on 80% of the NiceRide.Counts data and test for prediction accuracy with the remaining 20%:
## [1] "This took 33.1 seconds"
## Random Forest
##
## 5437 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5437, 5437, 5437, 5437, 5437, 5437, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 108.5483 0.7864496 68.13897
##
## Tuning parameter 'mtry' was held constant at a value of 2
Model 1 outcome has an Rsquared of 0.79 and RMSE of 108.5. A decent first model, however, the RMSE will be more interpreable if made proportional to the mean daily ride count (n). We will make this conversion for the prediction outcome from our test dataset.
Test model 1 predictive ablity on test dataset
## [1] 100.495
The Root Mean Squared Error is ~100 bike rides per day, convert this to a percentage of the mean for (n) to better interpret model accuracy
Divide by mean of daily ride count(n) for interpretation as percentage of the mean
## [1] 0.8017891
This is a poor outcome if error rate accounts for 80% of the test set mean. Re-run model and allow ‘caret’ package free roaming for determining how many columns to include per tree (mtry):
Run model 2 removing tuneGrid component
## [1] "This took 371.6 seconds"
## Random Forest
##
## 5437 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5437, 5437, 5437, 5437, 5437, 5437, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 108.05501 0.7913703 67.79845
## 11 58.09365 0.9084784 30.24016
## 21 59.40336 0.9041902 30.52005
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 11.
Model 2 produced a much better outcome since caret was allowed to determine the mtry floor automatically. After a similar review of the model predictability, lets consider model 3 with double the ntree parameter.
Test model 2 predictive ablity on test dataset
## [1] 48.25269
## [1] -0.51985
Allowing ‘caret’ to find the optimal mtry value has reduced the RMSE by 52%, checking RMSE in proportion to mean for NR.test$n:
Divide by mean of daily ride count(n) for interpretation as percentage of the mean
## [1] 0.384979
The result for RMSE.2 is much smaller in proportion to the mean for NR.test$n, running one final model with 100 trees instead of 50:
Run model 3 doubling ntree parameter
## [1] "This took 786.7 seconds"
## Random Forest
##
## 5437 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5437, 5437, 5437, 5437, 5437, 5437, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 107.48863 0.7990741 67.49143
## 11 57.66042 0.9099156 29.97403
## 21 59.19493 0.9048651 30.42767
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 11.
Test model 3 predictive ablity on test dataset
## [1] 48.67531
## [1] 0.0087584
Model 3 provided only a slight improvement to Rsquared at mtry = 11 and the proportional difference of RMSE.3 to RMSE.2 is actually just under 1%. Not much improvement for more processing time.
Divide by mean of daily ride count(n) for interpretation as percentage of the mean
## [1] 0.3883508
A sligthly less desireable RMSE proportion to the mean for model 3 compeared to model 2.
We will consider model 2 our final model and plot the error rate in relation to the number of trees:
This plot confirms that not much improvement occurs when ntree = 100 vs. ntree = 50.
Plot variables in relation to RMSE contribution
Variable importance plotting might allude to a limitation of the random forest approach when conducted on the entire data set rather than subsets based on clustering. The train function is giving importance to the cluster groups that outweighs the input for the weather variables.
Conclusion
The machine learning application for this business problem originally began with the intention of clustering the dataset based on the mean distance between bike stations and applying a linear regression to each cluster. The approach shifted to random forests due to observing heteroskedasticity from the relationship of the weather variables to the dependent daily ride count. The heteroskedasticity caused poor regression modeling. The random forest model with 50 trees and mtry = 11 provided a model with a much higher Rsquared (.90 in place of ~0.70). However, it might be ideal to apply eight separate models to data subsets based on the clustering to give higher consideration to the weather variables and their influence on daily ride counts within each cluster rather than as a whole. Modeling on this level is beyond the scope of this current application and can be considered in future analysis.
Regardless, this analysis has provided several important insights:
Rider behavior shows statistically significant differences which ought to be considered for any planned re-evaluation of bike ride subscription types and pricing, especially in lieu of RFP’s for dockless systems. Temperature plays a large role in daily bike count volume. Nice Ride MN might consider promotional strategies around changes in temperature, humidity, precipitation, and wind speed. This model could be use to prepare a dashboard interface for predicting daily ridership based on various weather variables combinations and thresholds. From this data product Nice Ride MN promotional efforts could be planned, providing discounted riding during certain parts of the season. *More work would need to be done with this analysis to provide insight and/or support for daily operational activities such as bike station rebalancing. This is the biggest limitation of this analysis, running models on each cluster might improve the outcome.