Question and Background:
Underlying Question: Can we predict how many people will rent a bike
on a given day based on weather factors?
The Data
The data we are using to create a model to predict the amount of use
a bike share will get on a certain day includes factors such as day of
the week, weather, temperature, humidity, wind speed, etc. Weather data
about the days that people used bikes in Washington, D.C. help to draw
connections between specific factors and the amount of business a bike
share company is getting. From the perspective of a bike share company
attempting to effectively place bikes in Washington, D.C., this model
could be helpful in understanding what level of demand for bikes there
is in different seasons, on different days of the week, or just on
different weather days. With this information a company could optimize
income while efficiently supplying communities with the most appropriate
number of bikes.
Previous Analyses
Multiple analyses have been done on the effect of weather on bike
share use. Bean et. al. found that bike share use is highly dependent on
the day of the week and the amount of precipitation, as well as the
temperature. Similarly, Quach and Malekian found that the most important
variables were temperature and precipitation, using a clustering
strategy to determine that there were significant differences in usage
between the different weather conditions.
Sources:
Quach, Jessica, and Reza Malekian. “Exploring the Weather Impact on
Bike Sharing Usage through a Clustering Analysis.” ArXiv.org, 17
Aug. 2020, https://arxiv.org/abs/2008.07249.
Approach
We decided to approach this question using a decision tree model to
simplify the decision slightly and attempt to determine if the bike
share usage was above or below the 75th percentile of daily users on any
given day based on the factors presented. Using this type of model as
opposed to something that has already been used to analyze the effect of
weather on bike share usage might give us even more insight into the
important factors that would help a company determine how many bikes to
have in certain locations on certain days.
KNN
## [1] "matrix" "array"

## [1] 1923 579
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 1923 94
## 1 10 579
##
## Accuracy : 0.9601
## 95% CI : (0.9519, 0.9673)
## No Information Rate : 0.7417
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8914
##
## Mcnemar's Test P-Value : 3.992e-16
##
## Sensitivity : 0.8603
## Specificity : 0.9948
## Pos Pred Value : 0.9830
## Neg Pred Value : 0.9534
## Precision : 0.9830
## Recall : 0.8603
## F1 : 0.9176
## Prevalence : 0.2583
## Detection Rate : 0.2222
## Detection Prevalence : 0.2260
## Balanced Accuracy : 0.9276
##
## 'Positive' Class : 1
##
########### FINAL MODEL ON TEST ###########
df_prob_1 <- tibble(attr(df_5NN, "prob"))
final_model <- tibble(k_prob=df_prob_1$`attr(df_5NN, "prob")`,pred=df_5NN,target=test$Total.Users_f) #TARGET SHOULD BE FROM TEST DATA I ASSUME?
#Need to convert this to the likelihood to be in the poss class.
final_model$pos_prec <- ifelse(final_model$pred == 0, 1-final_model$k_prob, final_model$k_prob)
#Needs to be a factor to be correctly
final_model$target <- as.factor(final_model$target)
# Confusion Matrix
confusionMatrix(final_model$pred, final_model$target, positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 1521 446
## 1 412 227
##
## Accuracy : 0.6708
## 95% CI : (0.6523, 0.6888)
## No Information Rate : 0.7417
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0.1262
##
## Mcnemar's Test P-Value : 0.2599
##
## Sensitivity : 0.33730
## Specificity : 0.78686
## Pos Pred Value : 0.35524
## Neg Pred Value : 0.77326
## Prevalence : 0.25825
## Detection Rate : 0.08711
## Detection Prevalence : 0.24520
## Balanced Accuracy : 0.56208
##
## 'Positive' Class : 1
##
# GOAL: would rather have less false negatives (predict slow rental day when actually busy) -- decrease FNR
#FPR:
446/(446+1521) #0.2267
## [1] 0.2267412
#FNR:
412/(412+227) #0.6447
## [1] 0.6447574
# adjust threshold function
adjust_thres <- function(x, y, z) {
#x=pred_probablities, y=threshold, z=tune_outcome
thres <- as.factor(ifelse(x > y, 1,0))
confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}
# ADJUST THRESHOLD
adjust_thres(final_model$pos_prec,.9,as.factor(final_model$target))
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 1573 466
## 1 360 207
##
## Accuracy : 0.683
## 95% CI : (0.6648, 0.7009)
## No Information Rate : 0.7417
## P-Value [Acc > NIR] : 1.0000000
##
## Kappa : 0.1279
##
## Mcnemar's Test P-Value : 0.0002588
##
## Sensitivity : 0.30758
## Specificity : 0.81376
## Pos Pred Value : 0.36508
## Neg Pred Value : 0.77146
## Precision : 0.36508
## Recall : 0.30758
## F1 : 0.33387
## Prevalence : 0.25825
## Detection Rate : 0.07943
## Detection Prevalence : 0.21757
## Balanced Accuracy : 0.56067
##
## 'Positive' Class : 1
##
# GOAL: would rather have less false negatives (predict slow rental day when actually busy) -- decrease FNR
# THRESHOLD: 0.75
# FNR:
387 / (387+216) # 0.6417
## [1] 0.641791
# THRESHOLD: 0.9
# FNR:
360 / (360+208) # 0.6338
## [1] 0.6338028
# THRESHOLD 0.99
#FNR:
360 / (360+208) #0.6338
## [1] 0.6338028
DECISION TREE
## [1] 13904 18
## [1] 3475 18
######## BUILD DECISION TREE ############
set.seed(1984)
bike_tree <- train(Total.Users_f~., #model formula everything used to classify outcome
data=train_data, #use the training data
method='rpart',# indicates the use of tree based model
na.action = na.omit)#omitting the missing values
bike_tree
## CART
##
## 13904 samples
## 17 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13904, 13904, 13904, 13904, 13904, 13904, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01698218 0.7814706 0.3061705
## 0.03646993 0.7714562 0.2792619
## 0.06069042 0.7547520 0.1410574
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01698218.
xx <- tibble(bike_tree$resample)
xx
## # A tibble: 25 × 3
## Accuracy Kappa Resample
## <dbl> <dbl> <chr>
## 1 0.782 0.282 Resample09
## 2 0.777 0.323 Resample05
## 3 0.778 0.317 Resample01
## 4 0.777 0.335 Resample10
## 5 0.782 0.304 Resample06
## 6 0.789 0.331 Resample02
## 7 0.782 0.303 Resample11
## 8 0.779 0.268 Resample07
## 9 0.777 0.278 Resample03
## 10 0.786 0.275 Resample12
## # … with 15 more rows
mean(xx$Accuracy)
## [1] 0.7814706
bike_tree$finalModel$variable.importance
## Temperature.Feels.F Temperature.F Season_2 Humidity
## 580.9036156 498.6016755 295.6065574 147.8692095
## Year_2011 Year_2012 Hour_Afternoon Hour_Night
## 104.2242152 104.2242152 51.8270344 27.1138708
## Wind.Speed Hour_Morning Season_4
## 21.8181929 0.8955894 0.4477947
coul <- brewer.pal(5, "Set2")
barplot(bike_tree$finalModel$variable.importance, col=coul)

bike_tree$finalModel
## n= 13904
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13904 3592 0 (0.7416571 0.2583429)
## 2) Temperature.Feels.F< 0.5984848 9638 1585 0 (0.8355468 0.1644532) *
## 3) Temperature.Feels.F>=0.5984848 4266 2007 0 (0.5295359 0.4704641)
## 6) Humidity>=0.555 2194 753 0 (0.6567912 0.3432088) *
## 7) Humidity< 0.555 2072 818 1 (0.3947876 0.6052124)
## 14) Year_2011>=0.5 931 400 0 (0.5703545 0.4296455) *
## 15) Year_2011< 0.5 1141 287 1 (0.2515337 0.7484663) *
rpart.plot::rpart.plot(bike_tree$finalModel, type=4,extra=101)

####### PREDICTIONS ON TEST DATA ###########
bike_eval <-(predict(bike_tree,newdata = test))
bike_eval_prob <- predict(bike_tree, newdata = test, type = "prob")#this gives us the predicted prob, we will need these later for the fairness evaluation
table(bike_eval, test$Total.Users_f)#essential the confusion matrix, though we can make a fancy one using caret built in functions
##
## bike_eval 0 1
## 0 2515 688
## 1 62 210
confusionMatrix(bike_eval, test$Total.Users_f, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 2515 688
## 1 62 210
##
## Accuracy : 0.7842
## 95% CI : (0.7701, 0.7978)
## No Information Rate : 0.7416
## P-Value [Acc > NIR] : 2.868e-09
##
## Kappa : 0.2714
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.23385
## Specificity : 0.97594
## Pos Pred Value : 0.77206
## Neg Pred Value : 0.78520
## Precision : 0.77206
## Recall : 0.23385
## F1 : 0.35897
## Prevalence : 0.25842
## Detection Rate : 0.06043
## Detection Prevalence : 0.07827
## Balanced Accuracy : 0.60490
##
## 'Positive' Class : 1
##
adjust_thres <- function(x, y, z) {
thres <- as.factor(ifelse(x > y, 1,0))
confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}
#THRESHOLD: 0.7 -- less false negatives
adjust_thres(bike_eval_prob$`1`,.7, test$Total.Users_f)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 2515 688
## 1 62 210
##
## Accuracy : 0.7842
## 95% CI : (0.7701, 0.7978)
## No Information Rate : 0.7416
## P-Value [Acc > NIR] : 2.868e-09
##
## Kappa : 0.2714
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.23385
## Specificity : 0.97594
## Pos Pred Value : 0.77206
## Neg Pred Value : 0.78520
## Precision : 0.77206
## Recall : 0.23385
## F1 : 0.35897
## Prevalence : 0.25842
## Detection Rate : 0.06043
## Detection Prevalence : 0.07827
## Balanced Accuracy : 0.60490
##
## 'Positive' Class : 1
##
bike_eval_prob$test <- test$Total.Users_f
########## ROC/AUC ##############
bike_eval <- tibble(pred_class=bike_eval, pred_prob=bike_eval_prob$`1`,target=test$Total.Users_f)
index_poss <- apply(bike_eval,2, function(x) bike_eval$pred_class==bike_eval$target)
pred <- prediction(bike_eval$pred_prob,bike_eval$target)
tree_perf <- performance(pred,"tpr","fpr")
plot(tree_perf, colorize=TRUE)
abline(a=0, b= 1)

tree_perf_AUC <- performance(pred,"auc")
print(tree_perf_AUC@y.values)
## [[1]]
## [1] 0.6858666
CONCLUSION
Fairness Assessment
We could potentially use the equation for proportional
parity in this instance, which is just the sum of the true
positives and false positives divided by the sum of true and false
positives and also true and false negatives. If we calucuated this
metric for each subgroup, we could compare and find the proportional
parity. The problem, however, is that in this data set we do not have
any protected classes. The variables regarding bike users, like
registered users and casual users, are numeric variables. In the example
we worked through in class a few weeks ago, we checked the proportional
parity of the bank loan data looking at gender, a binary variable. The
same argument can be used for why we are not able to calculate the
equity of odds, which is calculated by dividing the
true positive rate by the sum of the true positive and false negatives
(also known as sensitivity). When comparing the sensitivity rates for
different sub groups or protected classes, we are able to obtain the
equity of odds. Like before, we do not have a protected class in this
dataset and are not able to run this diagnostic. The same can be said
for predictive rate parity, which just compares the
specificity rate across different subgroups.
Conclusion
Based on our decision tree model, we determined that some of the
most important variables the model picked were the time of afternoon and
what the temperature felt like outside, followed by the actual
temperature and then season (with spring being the most popular season),
and finally humidity and wind speed. Based on this, we can use this
information to predict the best times where people are most likely to
rent our bikes, and use this information to optimize our inventory and
resources and save money.
Having said that, our model definitely isn’t perfect, so we have to
decide where those problems are occurring and what we can do to fix this
issue in the future. One possibility is that the variables in our
dataset aren’t correlated too well with our target. While playing around
with the models and doing things like selecting different numbers for K,
as well as adjusting the threshold, not much changed, suggesting we
might simply need more data.
One big issue from this model is that it is overfit, meaning our
model is starting to memorize our training and tuning data. To account
for this, we’d simply need more data. All things considered, we believe
this model to be a good starting point. In practice, we would need to
take these results with a grain of salt, and potentially increase the
number of bikes we make available to the public incrementally, starting
small in the beginning with maybe 5 bikes and increasing that number as
our model improves as we continue to feed more data back into it.
Future Work
Considering that public transportation is so closely related to
social inequality and social mobility, we are dedicated to creating a
model that not only optimizes our income and wealth as a company, but
also one that helps serve the public in a meaningful way that closes the
gap in the current social inequality we see in our country today. As
part of a larger problem, the United States has the worst social
inequality score out of all the developed countries. According to the
World Economic Fourm, which ranked 82 countries based on their social
mobility scores (based on things like healthcare, education, tech
access, work opportunities and social protection) the US came in 27th
place, behind countries like Lithuania, Malta and Portugal. Obviously,
access to public transportation is not the only thing that contributes
to a country’s social mobility, but it is a factor, and we have the
opportunity to bring affordable, accessible and eco-friendly
transportation to all types of people and communities. Going forward
with this, we plan to incorporate data into our model that could give us
insight into communities that need our help the most. We currently do
not have any insights into the locations of where our bikes are being
rented, and for what purpose they are being rented (for example, riding
to work, tourism, main method of transportation etc). With insight into
this sort of data, we could help communities that we find need our help
the most. With this sort of information, although we’d have to tread
very lightly as we would be using proxy measures, we could also
potentially run a fairness evaluation on our model to get a better
understanding of the real world effects our model is imposing.
Overall, there’s a lot of work to be done with this model but we
thought this was a good start for our company to get a better
understanding of how to maxmize our inventory to benefit our profits, as
well as potentially aid in an important social issue during this
time.