April 01, 2018

Questions to Consider

  • How does temperature, precipitation, humidity, and wind speed affect daily bike use?
  • What are the busiest docking stations and how is this represented from a geospatial perspective?
  • How does bike use vary between weekdays and weekends?
  • How does user behavior differ between members and casual riders?

Introduction

Main Objective:

  • This project analyzes the Nice Ride MN bike share program 2016-2017 seasons alongside historical daily weather data to predict daily ridership.

  • Outcome of this analysis can apply to a redistribution strategy for bikes to various docking stations and the timing of maintenance, leading to lower costs, and higher revenue.

Project Components:

  • The project analysis occurs in three stages: (1) data wrangling, (2) exploratory/statistical data analysis, and (3) predictive modeling.

Executive Summary

General Discoveries:

  • Memberships are the majority of daily use with bikes utilized to and from work commutes
  • Members have a higher tolerance than casual riders for unfavorable weather conditions
  • Ride density is greatest in downtown Minneapolis and less in St. Paul
  • Ride volume is comparable between the two seasons, however outside factors (road construction) might contribute to seasonal variance
  • Daily average temperature is the largest weather influence on bike use, followed by humidity, precipitation, and wind speed

Background and Description of Bikeshare Data Sets

Nice Ride MN History:

  • Nice Ride MN was formed in 2008 from a city of Minneapolis initiative - Twin Cities Bike Share Project. Since 2015, the system has included over 1700 bikes and 190 stations with over 450,000 annual rides. Source: Nice Ride MN - About

Bike Share Datasets: are available online from Nice Ride MN

  • Ride history data includes: start and end date/time, start and end stations, total duration of trip, and account type
  • Bike station data includes: terminal number, station description, longitude, latitude, and number of docks

Description of Weather Data Sets

Daily Average Weather Dataset: is available online from NOAA Weather Data Library

  • Daily weather data includes averages for: temperature (F), relative humidity, wind speed (Mph), and precipitation (inches)

Data Processing

  • This step involved matching bike activity to station details and daily weather measurements to each day of bike ride observations

Monthly Trip Volume Variance

Trip distrubution shows greatest variance year-over-year during the summer months.

Trip Distribution by Day of the Week

Member riders are utilizing the bikshare for commuting purposes. Casual riders are likely to utilize the bikeshare for leisure.

Yearly Trip Variance by Account Type

Member rides dominant casual year-over-year with peak volume in the summer months. Various casual ride outliers suggest increased use on holiday and event days.

Temperature Influence On Ride Volume

An increase in average daily temperature is related to an increase in daily bike use.

Wind Speed Influence on Bike Count

As average daily wind speed increases, bike use declines.

Relative Humidity Influence On Bike Count

As average daily relative humidity increases, bike use declines.

EDA Summary

In comparison to casual riders, member riders appear more willing to handle average weather variables in the following manner:

  • Lower temperatures
  • Higher wind speed
  • Higher precipitation
  • Higher relative humidty

Geospatial Considerations Before Modeling

  • Mapping the density of bike station use makes it clear that certain areas carry greater ride density. Including this grouping variable in the modeling might improve predictive ability.

Finding the optimal station grouping for predictive modeling:

Using the average distance between all possible bike station pairings results in the following grouping:

Refined dataset for modeling:

## # A tibble: 6 x 11
## # Groups:   clust [3]
##   clust Start_DoWeek Start_Year Start_Month Start_Day Account_Type
##   <int> <chr>             <int>       <int>     <int> <chr>       
## 1     1 Mon                2016           4         4 Casual      
## 2     1 Mon                2016           4         4 Member      
## 3     2 Mon                2016           4         4 Casual      
## 4     2 Mon                2016           4         4 Member      
## 5     3 Mon                2016           4         4 Casual      
## 6     3 Mon                2016           4         4 Member      
## # ... with 5 more variables: Avg_Temp <int>, Avg_Wind <dbl>, Precip <dbl>,
## #   Rel_Humidity <dbl>, n <int>

Random Forests Modeling

  • Modeling approach creates numerous decision trees with varying outcomes based on picking from the data at random
  • Majority outcomes determine the structure of the model
  • 80% of the dataset is for training the model
  • Model performance is tested on the remaining 20%, both sets are determined at random to reduce chance for biased outcomes
  • Three RF models are ran with the goal of producing the greatest Rsquared (predictive fit) and lowest Root Mean Squared Error (error rate)

Model 1 Summary:

## Random Forest 
## 
## 5437 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5437, 5437, 5437, 5437, 5437, 5437, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   108.5483  0.7864496  68.13897
## 
## Tuning parameter 'mtry' was held constant at a value of 2
  • Model 1 has a predictive fit of 79% over the training data and tree branches contain two predictors each
## [1] 0.8017891
  • Model 1 is a poor performer with an error rate of 80% for the test dataset prediction

Model 2 Summary:

## Random Forest 
## 
## 5437 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5437, 5437, 5437, 5437, 5437, 5437, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE     
##    2    108.05501  0.7913703  67.79845
##   11     58.09365  0.9084784  30.24016
##   21     59.40336  0.9041902  30.52005
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 11.
  • Model 2 has a predictive fit of over 90% of the training data, a great improvement over Model 1
## [1] 0.384979
  • The error rate for the prediction result of Model 2 is much smaller than Model 1

Model 3 Summary:

## Random Forest 
## 
## 5437 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5437, 5437, 5437, 5437, 5437, 5437, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE     
##    2    107.48863  0.7990741  67.49143
##   11     57.66042  0.9099156  29.97403
##   21     59.19493  0.9048651  30.42767
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 11.
## [1] 0.3883508
  • Model 3 provides a predictive fit improvement of less than 0.10% and takes much longer to process

Error Rate in Relation to the Number of Trees:

This plot confirms that little improvement occurs when the number of classification trees is 100 compared to 50

Predictor Variable Ranking

  • Account type is the highest ranking variable of importance, followed by dense station groups, and average daily temperature
  • Other weather predictors play less of a role

Conclusions

  • Random forests modeling handles large weather variable variations well
  • Linear regression modeling was the originally intended approach, but would not work well for dataset

Applications of modeling outcome:

  • Account type is a significant predictor, user behavior differs between members and casual riders
  • Different days of the week can influence bike use
  • Weather variables are more influential on casual riders
  • Consider all the above and project results when re-evaluation pricing and/or promotional efforts for increasing ridership

Further research opportunities:

  • Modeling each station grouping individually is likely to yield more reliable predictions to support bike rebalancing and maintenance

Files for Capstone Project are available online: