Effects of Weather on Bike Share Usage

Question and Background:

Underlying Question: Can we predict how many people will rent a bike on a given day based on weather factors?

The Data

Previous Analyses

Approach

Exploratory Data Analysis

##      Date               Season           Hour          Holiday       
##  Length:17379       Min.   :1.000   Min.   : 0.00   Min.   :0.00000  
##  Class :character   1st Qu.:1.000   1st Qu.: 6.00   1st Qu.:0.00000  
##  Mode  :character   Median :2.000   Median :12.00   Median :0.00000  
##                     Mean   :2.485   Mean   :11.55   Mean   :0.02877  
##                     3rd Qu.:3.000   3rd Qu.:18.00   3rd Qu.:0.00000  
##                     Max.   :4.000   Max.   :23.00   Max.   :1.00000  
##  Day.of.the.Week  Working.Day      Weather.Type   Temperature.F   
##  Min.   :0.000   Min.   :0.0000   Min.   :1.000   Min.   : 17.60  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.: 45.20  
##  Median :3.000   Median :1.0000   Median :1.000   Median : 59.00  
##  Mean   :3.004   Mean   :0.6827   Mean   :1.425   Mean   : 58.78  
##  3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.: 72.80  
##  Max.   :6.000   Max.   :1.0000   Max.   :4.000   Max.   :102.20  
##  Temperature.Feels.F    Humidity        Wind.Speed     Casual.Users   
##  Min.   :  3.20      Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.: 42.80      1st Qu.: 48.00   1st Qu.: 7.00   1st Qu.:  4.00  
##  Median : 60.80      Median : 63.00   Median :13.00   Median : 17.00  
##  Mean   : 59.72      Mean   : 62.72   Mean   :12.74   Mean   : 35.68  
##  3rd Qu.: 77.00      3rd Qu.: 78.00   3rd Qu.:17.00   3rd Qu.: 48.00  
##  Max.   :122.00      Max.   :100.00   Max.   :57.00   Max.   :367.00  
##  Registered.Users  Total.Users   
##  Min.   :  0.0    Min.   :  1.0  
##  1st Qu.: 34.0    1st Qu.: 40.0  
##  Median :115.0    Median :142.0  
##  Mean   :153.8    Mean   :189.5  
##  3rd Qu.:220.0    3rd Qu.:281.0  
##  Max.   :886.0    Max.   :977.0

##                     vars     n   mean     sd median trimmed    mad  min   max
## Date*                  1 17379 367.67 210.52  369.0  367.91 269.83  1.0 731.0
## Season                 2 17379   2.49   1.12    2.0    2.48   1.48  1.0   4.0
## Hour                   3 17379  11.55   6.91   12.0   11.56   8.90  0.0  23.0
## Holiday                4 17379   0.03   0.17    0.0    0.00   0.00  0.0   1.0
## Day.of.the.Week        5 17379   3.00   2.01    3.0    3.00   2.97  0.0   6.0
## Working.Day            6 17379   0.68   0.47    1.0    0.73   0.00  0.0   1.0
## Weather.Type           7 17379   1.43   0.64    1.0    1.30   0.00  1.0   4.0
## Temperature.F          8 17379  58.78  16.62   59.0   58.77  20.46 17.6 102.2
## Temperature.Feels.F    9 17379  59.72  20.42   60.8   59.93  24.02  3.2 122.0
## Humidity              10 17379  62.72  19.29   63.0   63.02  22.24  0.0 100.0
## Wind.Speed            11 17379  12.74   8.20   13.0   12.32   8.90  0.0  57.0
## Casual.Users          12 17379  35.68  49.31   17.0   25.13  23.72  0.0 367.0
## Registered.Users      13 17379 153.79 151.36  115.0  129.26 131.95  0.0 886.0
## Total.Users           14 17379 189.46 181.39  142.0  162.04 166.05  1.0 977.0
##                     range  skew kurtosis   se
## Date*               730.0 -0.01    -1.20 1.60
## Season                3.0  0.02    -1.36 0.01
## Hour                 23.0 -0.01    -1.20 0.05
## Holiday               1.0  5.64    29.78 0.00
## Day.of.the.Week       6.0  0.00    -1.26 0.02
## Working.Day           1.0 -0.79    -1.38 0.00
## Weather.Type          3.0  1.23     0.35 0.00
## Temperature.F        84.6 -0.01    -0.94 0.13
## Temperature.Feels.F 118.8 -0.09    -0.85 0.15
## Humidity            100.0 -0.11    -0.83 0.15
## Wind.Speed           57.0  0.57     0.59 0.06
## Casual.Users        367.0  2.50     7.57 0.37
## Registered.Users    886.0  1.56     2.75 1.15
## Total.Users         976.0  1.28     1.42 1.38

##                           Season         Hour      Holiday Day.of.the.Week
## Season               1.000000000  0.004931139  0.055947939    -0.003163450
## Hour                 0.004931139  1.000000000  0.000479136    -0.003497739
## Holiday              0.055947939  0.000479136  1.000000000    -0.102087791
## Day.of.the.Week     -0.003163450 -0.003497739 -0.102087791     1.000000000
## Working.Day         -0.036158734  0.002284998 -0.252471370     0.035955071
## Weather.Type         0.040452288 -0.020202528 -0.017036113     0.003310740
## Temperature.F       -0.470806327  0.137625946 -0.027356343    -0.001805613
## Temperature.Feels.F -0.469271254  0.133758276 -0.030974740    -0.008817003
## Humidity             0.014750149 -0.276497828 -0.010588465    -0.037158268
## Wind.Speed          -0.038741686  0.137253208  0.003984692     0.011504125
## Casual.Users        -0.227260165  0.301201730  0.031563628     0.032721415
## Registered.Users    -0.099585576  0.374140710 -0.047345424     0.021577888
## Total.Users         -0.144872483  0.394071498 -0.030927303     0.026899860
##                      Working.Day Weather.Type Temperature.F Temperature.Feels.F
## Season              -0.036158734   0.04045229  -0.470806327        -0.469271254
## Hour                 0.002284998  -0.02020253   0.137625946         0.133758276
## Holiday             -0.252471370  -0.01703611  -0.027356343        -0.030974740
## Day.of.the.Week      0.035955071   0.00331074  -0.001805613        -0.008817003
## Working.Day          1.000000000   0.04467222   0.055396228         0.054665178
## Weather.Type         0.044672224   1.00000000  -0.102600649        -0.105570718
## Temperature.F        0.055396228  -0.10260065   1.000000000         0.987677449
## Temperature.Feels.F  0.054665178  -0.10557072   0.987677449         1.000000000
## Humidity             0.015687512   0.41813033  -0.069889709        -0.051935510
## Wind.Speed          -0.011831470   0.02622604  -0.023115427        -0.062325722
## Casual.Users        -0.300942486  -0.15262788   0.459626269         0.454088895
## Registered.Users     0.134325791  -0.12096552   0.335373166         0.332565807
## Total.Users          0.030284368  -0.14242614   0.404785441         0.400937689
##                        Humidity   Wind.Speed Casual.Users Registered.Users
## Season               0.01475015 -0.038741686  -0.22726017      -0.09958558
## Hour                -0.27649783  0.137253208   0.30120173       0.37414071
## Holiday             -0.01058846  0.003984692   0.03156363      -0.04734542
## Day.of.the.Week     -0.03715827  0.011504125   0.03272142       0.02157789
## Working.Day          0.01568751 -0.011831470  -0.30094249       0.13432579
## Weather.Type         0.41813033  0.026226043  -0.15262788      -0.12096552
## Temperature.F       -0.06988971 -0.023115427   0.45962627       0.33537317
## Temperature.Feels.F -0.05193551 -0.062325722   0.45408890       0.33256581
## Humidity             1.00000000 -0.290108894  -0.34702809      -0.27393312
## Wind.Speed          -0.29010889  1.000000000   0.09029235       0.08232535
## Casual.Users        -0.34702809  0.090292353   1.00000000       0.50661770
## Registered.Users    -0.27393312  0.082325350   0.50661770       1.00000000
## Total.Users         -0.32291074  0.093239057   0.69456408       0.97215073
##                     Total.Users
## Season              -0.14487248
## Hour                 0.39407150
## Holiday             -0.03092730
## Day.of.the.Week      0.02689986
## Working.Day          0.03028437
## Weather.Type        -0.14242614
## Temperature.F        0.40478544
## Temperature.Feels.F  0.40093769
## Humidity            -0.32291074
## Wind.Speed           0.09323906
## Casual.Users         0.69456408
## Registered.Users     0.97215073
## Total.Users          1.00000000

KNN

## [1] "matrix" "array"

## [1] 1923  579

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 1923   94
##          1   10  579
##                                           
##                Accuracy : 0.9601          
##                  95% CI : (0.9519, 0.9673)
##     No Information Rate : 0.7417          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8914          
##                                           
##  Mcnemar's Test P-Value : 3.992e-16       
##                                           
##             Sensitivity : 0.8603          
##             Specificity : 0.9948          
##          Pos Pred Value : 0.9830          
##          Neg Pred Value : 0.9534          
##               Precision : 0.9830          
##                  Recall : 0.8603          
##                      F1 : 0.9176          
##              Prevalence : 0.2583          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2260          
##       Balanced Accuracy : 0.9276          
##                                           
##        'Positive' Class : 1               
##

########### FINAL MODEL ON TEST ###########
df_prob_1 <- tibble(attr(df_5NN, "prob"))

final_model <- tibble(k_prob=df_prob_1$`attr(df_5NN, "prob")`,pred=df_5NN,target=test$Total.Users_f) #TARGET SHOULD BE FROM TEST DATA I ASSUME?

#Need to convert this to the likelihood to be in the poss class.
final_model$pos_prec <- ifelse(final_model$pred == 0, 1-final_model$k_prob, final_model$k_prob)

#Needs to be a factor to be correctly  
final_model$target <- as.factor(final_model$target)

# Confusion Matrix
confusionMatrix(final_model$pred, final_model$target, positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 1521  446
##          1  412  227
##                                           
##                Accuracy : 0.6708          
##                  95% CI : (0.6523, 0.6888)
##     No Information Rate : 0.7417          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.1262          
##                                           
##  Mcnemar's Test P-Value : 0.2599          
##                                           
##             Sensitivity : 0.33730         
##             Specificity : 0.78686         
##          Pos Pred Value : 0.35524         
##          Neg Pred Value : 0.77326         
##              Prevalence : 0.25825         
##          Detection Rate : 0.08711         
##    Detection Prevalence : 0.24520         
##       Balanced Accuracy : 0.56208         
##                                           
##        'Positive' Class : 1               
##

# GOAL: would rather have less false negatives (predict slow rental day when actually busy) -- decrease FNR

#FPR:
446/(446+1521) #0.2267

## [1] 0.2267412

#FNR:
412/(412+227) #0.6447

## [1] 0.6447574

# adjust threshold function
adjust_thres <- function(x, y, z) {
  #x=pred_probablities, y=threshold, z=tune_outcome
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}

# ADJUST THRESHOLD
adjust_thres(final_model$pos_prec,.9,as.factor(final_model$target))

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 1573  466
##          1  360  207
##                                           
##                Accuracy : 0.683           
##                  95% CI : (0.6648, 0.7009)
##     No Information Rate : 0.7417          
##     P-Value [Acc > NIR] : 1.0000000       
##                                           
##                   Kappa : 0.1279          
##                                           
##  Mcnemar's Test P-Value : 0.0002588       
##                                           
##             Sensitivity : 0.30758         
##             Specificity : 0.81376         
##          Pos Pred Value : 0.36508         
##          Neg Pred Value : 0.77146         
##               Precision : 0.36508         
##                  Recall : 0.30758         
##                      F1 : 0.33387         
##              Prevalence : 0.25825         
##          Detection Rate : 0.07943         
##    Detection Prevalence : 0.21757         
##       Balanced Accuracy : 0.56067         
##                                           
##        'Positive' Class : 1               
##

# GOAL: would rather have less false negatives (predict slow rental day when actually busy) -- decrease FNR

# THRESHOLD: 0.75
# FNR: 
387 / (387+216) # 0.6417

## [1] 0.641791

# THRESHOLD: 0.9
# FNR:
360 / (360+208) # 0.6338

## [1] 0.6338028

# THRESHOLD 0.99
#FNR:
360 / (360+208) #0.6338

## [1] 0.6338028

DECISION TREE

## [1] 13904    18

## [1] 3475   18

######## BUILD DECISION TREE ############
set.seed(1984)
bike_tree <- train(Total.Users_f~., #model formula everything used to classify outcome
                   data=train_data, #use the training data
                   method='rpart',# indicates the use of tree based model
                   na.action = na.omit)#omitting the missing values
bike_tree

## CART 
## 
## 13904 samples
##    17 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13904, 13904, 13904, 13904, 13904, 13904, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01698218  0.7814706  0.3061705
##   0.03646993  0.7714562  0.2792619
##   0.06069042  0.7547520  0.1410574
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01698218.

xx <- tibble(bike_tree$resample)
xx

## # A tibble: 25 × 3
##    Accuracy Kappa Resample  
##       <dbl> <dbl> <chr>     
##  1    0.782 0.282 Resample09
##  2    0.777 0.323 Resample05
##  3    0.778 0.317 Resample01
##  4    0.777 0.335 Resample10
##  5    0.782 0.304 Resample06
##  6    0.789 0.331 Resample02
##  7    0.782 0.303 Resample11
##  8    0.779 0.268 Resample07
##  9    0.777 0.278 Resample03
## 10    0.786 0.275 Resample12
## # … with 15 more rows

mean(xx$Accuracy)

## [1] 0.7814706

bike_tree$finalModel$variable.importance

## Temperature.Feels.F       Temperature.F            Season_2            Humidity 
##         580.9036156         498.6016755         295.6065574         147.8692095 
##           Year_2011           Year_2012      Hour_Afternoon          Hour_Night 
##         104.2242152         104.2242152          51.8270344          27.1138708 
##          Wind.Speed        Hour_Morning            Season_4 
##          21.8181929           0.8955894           0.4477947

coul <- brewer.pal(5, "Set2")
barplot(bike_tree$finalModel$variable.importance, col=coul)

bike_tree$finalModel

## n= 13904 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13904 3592 0 (0.7416571 0.2583429)  
##    2) Temperature.Feels.F< 0.5984848 9638 1585 0 (0.8355468 0.1644532) *
##    3) Temperature.Feels.F>=0.5984848 4266 2007 0 (0.5295359 0.4704641)  
##      6) Humidity>=0.555 2194  753 0 (0.6567912 0.3432088) *
##      7) Humidity< 0.555 2072  818 1 (0.3947876 0.6052124)  
##       14) Year_2011>=0.5 931  400 0 (0.5703545 0.4296455) *
##       15) Year_2011< 0.5 1141  287 1 (0.2515337 0.7484663) *

rpart.plot::rpart.plot(bike_tree$finalModel, type=4,extra=101)

####### PREDICTIONS ON TEST DATA ###########
bike_eval <-(predict(bike_tree,newdata = test))

bike_eval_prob <- predict(bike_tree, newdata = test, type = "prob")#this gives us the predicted prob, we will need these later for the fairness evaluation

table(bike_eval, test$Total.Users_f)#essential the confusion matrix, though we can make a fancy one using caret built in functions

##          
## bike_eval    0    1
##         0 2515  688
##         1   62  210

confusionMatrix(bike_eval, test$Total.Users_f, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 2515  688
##          1   62  210
##                                           
##                Accuracy : 0.7842          
##                  95% CI : (0.7701, 0.7978)
##     No Information Rate : 0.7416          
##     P-Value [Acc > NIR] : 2.868e-09       
##                                           
##                   Kappa : 0.2714          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23385         
##             Specificity : 0.97594         
##          Pos Pred Value : 0.77206         
##          Neg Pred Value : 0.78520         
##               Precision : 0.77206         
##                  Recall : 0.23385         
##                      F1 : 0.35897         
##              Prevalence : 0.25842         
##          Detection Rate : 0.06043         
##    Detection Prevalence : 0.07827         
##       Balanced Accuracy : 0.60490         
##                                           
##        'Positive' Class : 1               
##

adjust_thres <- function(x, y, z) {
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}

#THRESHOLD: 0.7 -- less false negatives
adjust_thres(bike_eval_prob$`1`,.7, test$Total.Users_f)

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 2515  688
##          1   62  210
##                                           
##                Accuracy : 0.7842          
##                  95% CI : (0.7701, 0.7978)
##     No Information Rate : 0.7416          
##     P-Value [Acc > NIR] : 2.868e-09       
##                                           
##                   Kappa : 0.2714          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23385         
##             Specificity : 0.97594         
##          Pos Pred Value : 0.77206         
##          Neg Pred Value : 0.78520         
##               Precision : 0.77206         
##                  Recall : 0.23385         
##                      F1 : 0.35897         
##              Prevalence : 0.25842         
##          Detection Rate : 0.06043         
##    Detection Prevalence : 0.07827         
##       Balanced Accuracy : 0.60490         
##                                           
##        'Positive' Class : 1               
##

bike_eval_prob$test <- test$Total.Users_f

########## ROC/AUC ##############
bike_eval <- tibble(pred_class=bike_eval, pred_prob=bike_eval_prob$`1`,target=test$Total.Users_f)

index_poss <- apply(bike_eval,2, function(x) bike_eval$pred_class==bike_eval$target)

pred <- prediction(bike_eval$pred_prob,bike_eval$target)

tree_perf <- performance(pred,"tpr","fpr")


plot(tree_perf, colorize=TRUE)
abline(a=0, b= 1)

tree_perf_AUC <- performance(pred,"auc")

print(tree_perf_AUC@y.values)

## [[1]]
## [1] 0.6858666

CONCLUSION

Fairness Assessment

We could potentially use the equation for proportional parity in this instance, which is just the sum of the true positives and false positives divided by the sum of true and false positives and also true and false negatives. If we calucuated this metric for each subgroup, we could compare and find the proportional parity. The problem, however, is that in this data set we do not have any protected classes. The variables regarding bike users, like registered users and casual users, are numeric variables. In the example we worked through in class a few weeks ago, we checked the proportional parity of the bank loan data looking at gender, a binary variable. The same argument can be used for why we are not able to calculate the equity of odds, which is calculated by dividing the true positive rate by the sum of the true positive and false negatives (also known as sensitivity). When comparing the sensitivity rates for different sub groups or protected classes, we are able to obtain the equity of odds. Like before, we do not have a protected class in this dataset and are not able to run this diagnostic. The same can be said for predictive rate parity, which just compares the specificity rate across different subgroups.

Conclusion

Based on our decision tree model, we determined that some of the most important variables the model picked were the time of afternoon and what the temperature felt like outside, followed by the actual temperature and then season (with spring being the most popular season), and finally humidity and wind speed. Based on this, we can use this information to predict the best times where people are most likely to rent our bikes, and use this information to optimize our inventory and resources and save money.

Having said that, our model definitely isn’t perfect, so we have to decide where those problems are occurring and what we can do to fix this issue in the future. One possibility is that the variables in our dataset aren’t correlated too well with our target. While playing around with the models and doing things like selecting different numbers for K, as well as adjusting the threshold, not much changed, suggesting we might simply need more data.

One big issue from this model is that it is overfit, meaning our model is starting to memorize our training and tuning data. To account for this, we’d simply need more data. All things considered, we believe this model to be a good starting point. In practice, we would need to take these results with a grain of salt, and potentially increase the number of bikes we make available to the public incrementally, starting small in the beginning with maybe 5 bikes and increasing that number as our model improves as we continue to feed more data back into it.

Effects of Weather on Bike Share Usage

Cate Irving, Katelyn Reid, Nancy Kirkman, Shannon Tran

December 10th 2022

Question and Background:

Underlying Question: Can we predict how many people will rent a bike on a given day based on weather factors?

The Data

Previous Analyses

Approach

Exploratory Data Analysis

KNN

DECISION TREE

CONCLUSION

Fairness Assessment

Conclusion

Future Work

Effects of Weather on Bike Share Usage

Cate Irving, Katelyn Reid, Nancy Kirkman, Shannon Tran

December 10th 2022

Question and Background:

Underlying Question: Can we predict how many people will rent a bike on a given day based on weather factors?

The Data

Previous Analyses

Sources:

Quach, Jessica, and Reza Malekian. “Exploring the Weather Impact on Bike Sharing Usage through a Clustering Analysis.” ArXiv.org, 17 Aug. 2020, https://arxiv.org/abs/2008.07249.

Approach

Exploratory Data Analysis

KNN

DECISION TREE

CONCLUSION

Fairness Assessment

Conclusion

Future Work

Overall, there’s a lot of work to be done with this model but we thought this was a good start for our company to get a better understanding of how to maxmize our inventory to benefit our profits, as well as potentially aid in an important social issue during this time.