Question and Background:

Underlying Question: Can we predict how many people will rent a bike on a given day based on weather factors?

The Data

The data we are using to create a model to predict the amount of use a bike share will get on a certain day includes factors such as day of the week, weather, temperature, humidity, wind speed, etc. Weather data about the days that people used bikes in Washington, D.C. help to draw connections between specific factors and the amount of business a bike share company is getting. From the perspective of a bike share company attempting to effectively place bikes in Washington, D.C., this model could be helpful in understanding what level of demand for bikes there is in different seasons, on different days of the week, or just on different weather days. With this information a company could optimize income while efficiently supplying communities with the most appropriate number of bikes.

Previous Analyses

Multiple analyses have been done on the effect of weather on bike share use. Bean et. al. found that bike share use is highly dependent on the day of the week and the amount of precipitation, as well as the temperature. Similarly, Quach and Malekian found that the most important variables were temperature and precipitation, using a clustering strategy to determine that there were significant differences in usage between the different weather conditions.

Sources:
Richard Bean, Dorina Pojani, Jonathan Corcoran, “How does weather affect bike share use? A comparative analysis of forty cities across climate zones”, Journal of Transport Geography, Volume 95, 2021, 103155, ISSN 0966-6923, https://doi.org/10.1016/j.jtrangeo.2021.103155. (https://www.sciencedirect.com/science/article/pii/S0966692321002088)
Quach, Jessica, and Reza Malekian. “Exploring the Weather Impact on Bike Sharing Usage through a Clustering Analysis.” ArXiv.org, 17 Aug. 2020, https://arxiv.org/abs/2008.07249.

Approach

We decided to approach this question using a decision tree model to simplify the decision slightly and attempt to determine if the bike share usage was above or below the 75th percentile of daily users on any given day based on the factors presented. Using this type of model as opposed to something that has already been used to analyze the effect of weather on bike share usage might give us even more insight into the important factors that would help a company determine how many bikes to have in certain locations on certain days.

Exploratory Data Analysis

##      Date               Season           Hour          Holiday       
##  Length:17379       Min.   :1.000   Min.   : 0.00   Min.   :0.00000  
##  Class :character   1st Qu.:1.000   1st Qu.: 6.00   1st Qu.:0.00000  
##  Mode  :character   Median :2.000   Median :12.00   Median :0.00000  
##                     Mean   :2.485   Mean   :11.55   Mean   :0.02877  
##                     3rd Qu.:3.000   3rd Qu.:18.00   3rd Qu.:0.00000  
##                     Max.   :4.000   Max.   :23.00   Max.   :1.00000  
##  Day.of.the.Week  Working.Day      Weather.Type   Temperature.F   
##  Min.   :0.000   Min.   :0.0000   Min.   :1.000   Min.   : 17.60  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.: 45.20  
##  Median :3.000   Median :1.0000   Median :1.000   Median : 59.00  
##  Mean   :3.004   Mean   :0.6827   Mean   :1.425   Mean   : 58.78  
##  3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.: 72.80  
##  Max.   :6.000   Max.   :1.0000   Max.   :4.000   Max.   :102.20  
##  Temperature.Feels.F    Humidity        Wind.Speed     Casual.Users   
##  Min.   :  3.20      Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.: 42.80      1st Qu.: 48.00   1st Qu.: 7.00   1st Qu.:  4.00  
##  Median : 60.80      Median : 63.00   Median :13.00   Median : 17.00  
##  Mean   : 59.72      Mean   : 62.72   Mean   :12.74   Mean   : 35.68  
##  3rd Qu.: 77.00      3rd Qu.: 78.00   3rd Qu.:17.00   3rd Qu.: 48.00  
##  Max.   :122.00      Max.   :100.00   Max.   :57.00   Max.   :367.00  
##  Registered.Users  Total.Users   
##  Min.   :  0.0    Min.   :  1.0  
##  1st Qu.: 34.0    1st Qu.: 40.0  
##  Median :115.0    Median :142.0  
##  Mean   :153.8    Mean   :189.5  
##  3rd Qu.:220.0    3rd Qu.:281.0  
##  Max.   :886.0    Max.   :977.0
##                     vars     n   mean     sd median trimmed    mad  min   max
## Date*                  1 17379 367.67 210.52  369.0  367.91 269.83  1.0 731.0
## Season                 2 17379   2.49   1.12    2.0    2.48   1.48  1.0   4.0
## Hour                   3 17379  11.55   6.91   12.0   11.56   8.90  0.0  23.0
## Holiday                4 17379   0.03   0.17    0.0    0.00   0.00  0.0   1.0
## Day.of.the.Week        5 17379   3.00   2.01    3.0    3.00   2.97  0.0   6.0
## Working.Day            6 17379   0.68   0.47    1.0    0.73   0.00  0.0   1.0
## Weather.Type           7 17379   1.43   0.64    1.0    1.30   0.00  1.0   4.0
## Temperature.F          8 17379  58.78  16.62   59.0   58.77  20.46 17.6 102.2
## Temperature.Feels.F    9 17379  59.72  20.42   60.8   59.93  24.02  3.2 122.0
## Humidity              10 17379  62.72  19.29   63.0   63.02  22.24  0.0 100.0
## Wind.Speed            11 17379  12.74   8.20   13.0   12.32   8.90  0.0  57.0
## Casual.Users          12 17379  35.68  49.31   17.0   25.13  23.72  0.0 367.0
## Registered.Users      13 17379 153.79 151.36  115.0  129.26 131.95  0.0 886.0
## Total.Users           14 17379 189.46 181.39  142.0  162.04 166.05  1.0 977.0
##                     range  skew kurtosis   se
## Date*               730.0 -0.01    -1.20 1.60
## Season                3.0  0.02    -1.36 0.01
## Hour                 23.0 -0.01    -1.20 0.05
## Holiday               1.0  5.64    29.78 0.00
## Day.of.the.Week       6.0  0.00    -1.26 0.02
## Working.Day           1.0 -0.79    -1.38 0.00
## Weather.Type          3.0  1.23     0.35 0.00
## Temperature.F        84.6 -0.01    -0.94 0.13
## Temperature.Feels.F 118.8 -0.09    -0.85 0.15
## Humidity            100.0 -0.11    -0.83 0.15
## Wind.Speed           57.0  0.57     0.59 0.06
## Casual.Users        367.0  2.50     7.57 0.37
## Registered.Users    886.0  1.56     2.75 1.15
## Total.Users         976.0  1.28     1.42 1.38
##                           Season         Hour      Holiday Day.of.the.Week
## Season               1.000000000  0.004931139  0.055947939    -0.003163450
## Hour                 0.004931139  1.000000000  0.000479136    -0.003497739
## Holiday              0.055947939  0.000479136  1.000000000    -0.102087791
## Day.of.the.Week     -0.003163450 -0.003497739 -0.102087791     1.000000000
## Working.Day         -0.036158734  0.002284998 -0.252471370     0.035955071
## Weather.Type         0.040452288 -0.020202528 -0.017036113     0.003310740
## Temperature.F       -0.470806327  0.137625946 -0.027356343    -0.001805613
## Temperature.Feels.F -0.469271254  0.133758276 -0.030974740    -0.008817003
## Humidity             0.014750149 -0.276497828 -0.010588465    -0.037158268
## Wind.Speed          -0.038741686  0.137253208  0.003984692     0.011504125
## Casual.Users        -0.227260165  0.301201730  0.031563628     0.032721415
## Registered.Users    -0.099585576  0.374140710 -0.047345424     0.021577888
## Total.Users         -0.144872483  0.394071498 -0.030927303     0.026899860
##                      Working.Day Weather.Type Temperature.F Temperature.Feels.F
## Season              -0.036158734   0.04045229  -0.470806327        -0.469271254
## Hour                 0.002284998  -0.02020253   0.137625946         0.133758276
## Holiday             -0.252471370  -0.01703611  -0.027356343        -0.030974740
## Day.of.the.Week      0.035955071   0.00331074  -0.001805613        -0.008817003
## Working.Day          1.000000000   0.04467222   0.055396228         0.054665178
## Weather.Type         0.044672224   1.00000000  -0.102600649        -0.105570718
## Temperature.F        0.055396228  -0.10260065   1.000000000         0.987677449
## Temperature.Feels.F  0.054665178  -0.10557072   0.987677449         1.000000000
## Humidity             0.015687512   0.41813033  -0.069889709        -0.051935510
## Wind.Speed          -0.011831470   0.02622604  -0.023115427        -0.062325722
## Casual.Users        -0.300942486  -0.15262788   0.459626269         0.454088895
## Registered.Users     0.134325791  -0.12096552   0.335373166         0.332565807
## Total.Users          0.030284368  -0.14242614   0.404785441         0.400937689
##                        Humidity   Wind.Speed Casual.Users Registered.Users
## Season               0.01475015 -0.038741686  -0.22726017      -0.09958558
## Hour                -0.27649783  0.137253208   0.30120173       0.37414071
## Holiday             -0.01058846  0.003984692   0.03156363      -0.04734542
## Day.of.the.Week     -0.03715827  0.011504125   0.03272142       0.02157789
## Working.Day          0.01568751 -0.011831470  -0.30094249       0.13432579
## Weather.Type         0.41813033  0.026226043  -0.15262788      -0.12096552
## Temperature.F       -0.06988971 -0.023115427   0.45962627       0.33537317
## Temperature.Feels.F -0.05193551 -0.062325722   0.45408890       0.33256581
## Humidity             1.00000000 -0.290108894  -0.34702809      -0.27393312
## Wind.Speed          -0.29010889  1.000000000   0.09029235       0.08232535
## Casual.Users        -0.34702809  0.090292353   1.00000000       0.50661770
## Registered.Users    -0.27393312  0.082325350   0.50661770       1.00000000
## Total.Users         -0.32291074  0.093239057   0.69456408       0.97215073
##                     Total.Users
## Season              -0.14487248
## Hour                 0.39407150
## Holiday             -0.03092730
## Day.of.the.Week      0.02689986
## Working.Day          0.03028437
## Weather.Type        -0.14242614
## Temperature.F        0.40478544
## Temperature.Feels.F  0.40093769
## Humidity            -0.32291074
## Wind.Speed           0.09323906
## Casual.Users         0.69456408
## Registered.Users     0.97215073
## Total.Users          1.00000000

KNN

## [1] "matrix" "array"

## [1] 1923  579
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 1923   94
##          1   10  579
##                                           
##                Accuracy : 0.9601          
##                  95% CI : (0.9519, 0.9673)
##     No Information Rate : 0.7417          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8914          
##                                           
##  Mcnemar's Test P-Value : 3.992e-16       
##                                           
##             Sensitivity : 0.8603          
##             Specificity : 0.9948          
##          Pos Pred Value : 0.9830          
##          Neg Pred Value : 0.9534          
##               Precision : 0.9830          
##                  Recall : 0.8603          
##                      F1 : 0.9176          
##              Prevalence : 0.2583          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2260          
##       Balanced Accuracy : 0.9276          
##                                           
##        'Positive' Class : 1               
## 
########### FINAL MODEL ON TEST ###########
df_prob_1 <- tibble(attr(df_5NN, "prob"))

final_model <- tibble(k_prob=df_prob_1$`attr(df_5NN, "prob")`,pred=df_5NN,target=test$Total.Users_f) #TARGET SHOULD BE FROM TEST DATA I ASSUME?

#Need to convert this to the likelihood to be in the poss class.
final_model$pos_prec <- ifelse(final_model$pred == 0, 1-final_model$k_prob, final_model$k_prob)

#Needs to be a factor to be correctly  
final_model$target <- as.factor(final_model$target)

# Confusion Matrix
confusionMatrix(final_model$pred, final_model$target, positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 1521  446
##          1  412  227
##                                           
##                Accuracy : 0.6708          
##                  95% CI : (0.6523, 0.6888)
##     No Information Rate : 0.7417          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.1262          
##                                           
##  Mcnemar's Test P-Value : 0.2599          
##                                           
##             Sensitivity : 0.33730         
##             Specificity : 0.78686         
##          Pos Pred Value : 0.35524         
##          Neg Pred Value : 0.77326         
##              Prevalence : 0.25825         
##          Detection Rate : 0.08711         
##    Detection Prevalence : 0.24520         
##       Balanced Accuracy : 0.56208         
##                                           
##        'Positive' Class : 1               
## 
# GOAL: would rather have less false negatives (predict slow rental day when actually busy) -- decrease FNR

#FPR:
446/(446+1521) #0.2267
## [1] 0.2267412
#FNR:
412/(412+227) #0.6447
## [1] 0.6447574
# adjust threshold function
adjust_thres <- function(x, y, z) {
  #x=pred_probablities, y=threshold, z=tune_outcome
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}

# ADJUST THRESHOLD
adjust_thres(final_model$pos_prec,.9,as.factor(final_model$target))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 1573  466
##          1  360  207
##                                           
##                Accuracy : 0.683           
##                  95% CI : (0.6648, 0.7009)
##     No Information Rate : 0.7417          
##     P-Value [Acc > NIR] : 1.0000000       
##                                           
##                   Kappa : 0.1279          
##                                           
##  Mcnemar's Test P-Value : 0.0002588       
##                                           
##             Sensitivity : 0.30758         
##             Specificity : 0.81376         
##          Pos Pred Value : 0.36508         
##          Neg Pred Value : 0.77146         
##               Precision : 0.36508         
##                  Recall : 0.30758         
##                      F1 : 0.33387         
##              Prevalence : 0.25825         
##          Detection Rate : 0.07943         
##    Detection Prevalence : 0.21757         
##       Balanced Accuracy : 0.56067         
##                                           
##        'Positive' Class : 1               
## 
# GOAL: would rather have less false negatives (predict slow rental day when actually busy) -- decrease FNR

# THRESHOLD: 0.75
# FNR: 
387 / (387+216) # 0.6417
## [1] 0.641791
# THRESHOLD: 0.9
# FNR:
360 / (360+208) # 0.6338
## [1] 0.6338028
# THRESHOLD 0.99
#FNR:
360 / (360+208) #0.6338
## [1] 0.6338028

DECISION TREE

## [1] 13904    18
## [1] 3475   18
######## BUILD DECISION TREE ############
set.seed(1984)
bike_tree <- train(Total.Users_f~., #model formula everything used to classify outcome
                   data=train_data, #use the training data
                   method='rpart',# indicates the use of tree based model
                   na.action = na.omit)#omitting the missing values
bike_tree 
## CART 
## 
## 13904 samples
##    17 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13904, 13904, 13904, 13904, 13904, 13904, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01698218  0.7814706  0.3061705
##   0.03646993  0.7714562  0.2792619
##   0.06069042  0.7547520  0.1410574
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01698218.
xx <- tibble(bike_tree$resample)
xx
## # A tibble: 25 × 3
##    Accuracy Kappa Resample  
##       <dbl> <dbl> <chr>     
##  1    0.782 0.282 Resample09
##  2    0.777 0.323 Resample05
##  3    0.778 0.317 Resample01
##  4    0.777 0.335 Resample10
##  5    0.782 0.304 Resample06
##  6    0.789 0.331 Resample02
##  7    0.782 0.303 Resample11
##  8    0.779 0.268 Resample07
##  9    0.777 0.278 Resample03
## 10    0.786 0.275 Resample12
## # … with 15 more rows
mean(xx$Accuracy)
## [1] 0.7814706
bike_tree$finalModel$variable.importance
## Temperature.Feels.F       Temperature.F            Season_2            Humidity 
##         580.9036156         498.6016755         295.6065574         147.8692095 
##           Year_2011           Year_2012      Hour_Afternoon          Hour_Night 
##         104.2242152         104.2242152          51.8270344          27.1138708 
##          Wind.Speed        Hour_Morning            Season_4 
##          21.8181929           0.8955894           0.4477947
coul <- brewer.pal(5, "Set2")
barplot(bike_tree$finalModel$variable.importance, col=coul)

bike_tree$finalModel
## n= 13904 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13904 3592 0 (0.7416571 0.2583429)  
##    2) Temperature.Feels.F< 0.5984848 9638 1585 0 (0.8355468 0.1644532) *
##    3) Temperature.Feels.F>=0.5984848 4266 2007 0 (0.5295359 0.4704641)  
##      6) Humidity>=0.555 2194  753 0 (0.6567912 0.3432088) *
##      7) Humidity< 0.555 2072  818 1 (0.3947876 0.6052124)  
##       14) Year_2011>=0.5 931  400 0 (0.5703545 0.4296455) *
##       15) Year_2011< 0.5 1141  287 1 (0.2515337 0.7484663) *
rpart.plot::rpart.plot(bike_tree$finalModel, type=4,extra=101)

####### PREDICTIONS ON TEST DATA ###########
bike_eval <-(predict(bike_tree,newdata = test))

bike_eval_prob <- predict(bike_tree, newdata = test, type = "prob")#this gives us the predicted prob, we will need these later for the fairness evaluation
table(bike_eval, test$Total.Users_f)#essential the confusion matrix, though we can make a fancy one using caret built in functions
##          
## bike_eval    0    1
##         0 2515  688
##         1   62  210
confusionMatrix(bike_eval, test$Total.Users_f, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 2515  688
##          1   62  210
##                                           
##                Accuracy : 0.7842          
##                  95% CI : (0.7701, 0.7978)
##     No Information Rate : 0.7416          
##     P-Value [Acc > NIR] : 2.868e-09       
##                                           
##                   Kappa : 0.2714          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23385         
##             Specificity : 0.97594         
##          Pos Pred Value : 0.77206         
##          Neg Pred Value : 0.78520         
##               Precision : 0.77206         
##                  Recall : 0.23385         
##                      F1 : 0.35897         
##              Prevalence : 0.25842         
##          Detection Rate : 0.06043         
##    Detection Prevalence : 0.07827         
##       Balanced Accuracy : 0.60490         
##                                           
##        'Positive' Class : 1               
## 
adjust_thres <- function(x, y, z) {
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}

#THRESHOLD: 0.7 -- less false negatives
adjust_thres(bike_eval_prob$`1`,.7, test$Total.Users_f) 
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 2515  688
##          1   62  210
##                                           
##                Accuracy : 0.7842          
##                  95% CI : (0.7701, 0.7978)
##     No Information Rate : 0.7416          
##     P-Value [Acc > NIR] : 2.868e-09       
##                                           
##                   Kappa : 0.2714          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23385         
##             Specificity : 0.97594         
##          Pos Pred Value : 0.77206         
##          Neg Pred Value : 0.78520         
##               Precision : 0.77206         
##                  Recall : 0.23385         
##                      F1 : 0.35897         
##              Prevalence : 0.25842         
##          Detection Rate : 0.06043         
##    Detection Prevalence : 0.07827         
##       Balanced Accuracy : 0.60490         
##                                           
##        'Positive' Class : 1               
## 
bike_eval_prob$test <- test$Total.Users_f
########## ROC/AUC ##############
bike_eval <- tibble(pred_class=bike_eval, pred_prob=bike_eval_prob$`1`,target=test$Total.Users_f)

index_poss <- apply(bike_eval,2, function(x) bike_eval$pred_class==bike_eval$target)

pred <- prediction(bike_eval$pred_prob,bike_eval$target)

tree_perf <- performance(pred,"tpr","fpr")


plot(tree_perf, colorize=TRUE)
abline(a=0, b= 1)

tree_perf_AUC <- performance(pred,"auc")

print(tree_perf_AUC@y.values)
## [[1]]
## [1] 0.6858666

CONCLUSION

Fairness Assessment

We could potentially use the equation for proportional parity in this instance, which is just the sum of the true positives and false positives divided by the sum of true and false positives and also true and false negatives. If we calucuated this metric for each subgroup, we could compare and find the proportional parity. The problem, however, is that in this data set we do not have any protected classes. The variables regarding bike users, like registered users and casual users, are numeric variables. In the example we worked through in class a few weeks ago, we checked the proportional parity of the bank loan data looking at gender, a binary variable. The same argument can be used for why we are not able to calculate the equity of odds, which is calculated by dividing the true positive rate by the sum of the true positive and false negatives (also known as sensitivity). When comparing the sensitivity rates for different sub groups or protected classes, we are able to obtain the equity of odds. Like before, we do not have a protected class in this dataset and are not able to run this diagnostic. The same can be said for predictive rate parity, which just compares the specificity rate across different subgroups.

Conclusion

Having said that, our model definitely isn’t perfect, so we have to decide where those problems are occurring and what we can do to fix this issue in the future. One possibility is that the variables in our dataset aren’t correlated too well with our target. While playing around with the models and doing things like selecting different numbers for K, as well as adjusting the threshold, not much changed, suggesting we might simply need more data.

One big issue from this model is that it is overfit, meaning our model is starting to memorize our training and tuning data. To account for this, we’d simply need more data. All things considered, we believe this model to be a good starting point. In practice, we would need to take these results with a grain of salt, and potentially increase the number of bikes we make available to the public incrementally, starting small in the beginning with maybe 5 bikes and increasing that number as our model improves as we continue to feed more data back into it.

Future Work

Overall, there’s a lot of work to be done with this model but we thought this was a good start for our company to get a better understanding of how to maxmize our inventory to benefit our profits, as well as potentially aid in an important social issue during this time.