Jad Shaheen - Final Exam

Honor Pledge

I pledge on my honor that I neither gave nor received help while completing this final exam.

  • Jad Shaheen

Exercise 1: Exploratory Data Analysis (EDA)

Part 1: Data Manipulation with {dplyr}

Question 1: How do bike usage patterns vary by season and student activity in terms of average usage ratio, effective capacity, and peak rental demand?

bikeshare |> 
  group_by(season, student_activity) |> 
  summarize(avg_usage_ratio = mean(usage_ratio, na.rm = TRUE), 
            avg_effective_capacity = mean(effective_capacity), 
            peak_hour_rental = max(rental)) |> 
  arrange(desc(avg_usage_ratio)) |> 
  rename("Season" = season, "Student Activity" = student_activity, "Average Usage Ratio" = avg_usage_ratio, "Average Effective Capacity" = avg_effective_capacity, "Peak Hourly Rental Demand" = peak_hour_rental) 
`summarise()` has grouped output by 'season'. You can override using the
`.groups` argument.
# A tibble: 20 × 5
# Groups:   Season [4]
   Season `Student Activity` `Average Usage Ratio` `Average Effective Capacity`
   <fct>  <fct>                              <dbl>                        <dbl>
 1 fall   heavy_class_rush                  0.242                         4118.
 2 fall   normal_class_rush                 0.232                         4089.
 3 spring heavy_class_rush                  0.229                         3938.
 4 spring normal_class_rush                 0.224                         3881.
 5 winter heavy_class_rush                  0.205                         3521.
 6 winter normal_class_rush                 0.202                         3492.
 7 fall   class_end_rush                    0.196                         4166.
 8 spring class_end_rush                    0.192                         4000.
 9 summer regular                           0.135                         4070.
10 winter class_end_rush                    0.135                         3608.
11 fall   lunch_rush                        0.119                         4158.
12 summer nightlife                         0.119                         4432.
13 spring lunch_rush                        0.116                         3982.
14 spring nightlife                         0.108                         4221.
15 spring regular                           0.102                         3675.
16 fall   regular                           0.101                         3772.
17 fall   nightlife                         0.0873                        4260.
18 winter lunch_rush                        0.0803                        3562.
19 winter regular                           0.0612                        3328.
20 winter nightlife                         0.0561                        4000 
# ℹ 1 more variable: `Peak Hourly Rental Demand` <dbl>

Question 2: How does precipitation affect bike rentals across student activity levels?

bikeshare <- bikeshare |> 
  mutate(there_is_precip = ifelse(precip > 0, 1, 0))

bikeshare |>
  filter(student_activity != "nightlife") |> 
  group_by(there_is_precip, student_activity) |>
  summarize(mean_rental = mean(rental))  |>
pivot_wider(names_from = there_is_precip, values_from = mean_rental, names_prefix = "precip_") |> 
  mutate("Difference in Average Rentals" = precip_1 - precip_0) |>
  rename("Precipitation" = precip_1, "No Precipitation" = precip_0, "Student Activity" = student_activity) |> 
  arrange("Difference in Average Rentals")
`summarise()` has grouped output by 'there_is_precip'. You can override using
the `.groups` argument.
# A tibble: 5 × 4
  `Student Activity` `No Precipitation` Precipitation Difference in Average Re…¹
  <fct>                           <dbl>         <dbl>                      <dbl>
1 class_end_rush                   639.         220.                       -419.
2 heavy_class_rush                 768.         299.                       -469.
3 lunch_rush                       418.          99.8                      -318.
4 normal_class_rush                754.         193.                       -561.
5 regular                          398.         159.                       -239.
# ℹ abbreviated name: ¹​`Difference in Average Rentals`

Part 2: Data Visualization with {ggplot2}

Plot 1: What is the effect of temperature greater than 70°F on bike rentals across seasons?

# Step 1: Ensure dataset has no missing values
bikeshare_clean <- bikeshare |> 
  filter(!is.na(temperature), !is.na(rental))

# Step 2: Manually create a piecewise temperature variable
bikeshare_clean <- bikeshare_clean |> 
  mutate(temp_above_70 = pmax(0, temperature - 70))  # 0 if temp <= 70, else (temperature - 70)

# Step 3: Fit a basic linear model with the piecewise term
lm_model <- lm(rental ~ temperature + temp_above_70, data = bikeshare_clean)

# Step 4: Predict values using the manually defined break at 70°F
bikeshare_clean <- bikeshare_clean |> 
  mutate(predicted_rentals = predict(lm_model))

# Step 5: Create the visualization
ggplot(bikeshare_clean, aes(x = temperature, y = rental, color = season)) +
  geom_point(alpha = 0.4) +  # Add transparency to reduce clutter
  geom_line(aes(y = predicted_rentals), color = "red", size = 1.2) +  # Add segmented regression line
  facet_wrap(~season) +  # Facet by season
  labs(
    title = "Seasonal Effect of Temperature on Bike Rentals (Piecewise Regression at 70°F)",
    x = "Temperature (°F)",
    y = "Number of Rentals",
    caption = "Data: Capital Bikeshare, 2016-2018"
  ) +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Plot 2: How does the peak hourly rental demand compare to the average effective capacity across different temperature ranges?

#Creating a temperature bucket variable to group temps into freezing or below, then up to the first quartile (46F), up to the median (60F), up to the third quartile (75F), and then above
bikeshare <- bikeshare |> 
  mutate(temp_bucket = case_when(
    temperature <= 32  ~ "32 or below",
    temperature <= 46  ~ "33-46",
    temperature <= 59  ~ "47-59",
    temperature <= 74  ~ "60-74",
    TRUE               ~ "75 and above"
  ))

bikeshare |> 
  group_by(hour, temp_bucket) |>
  mutate(max_rentals_per_hour_in_a_day = max(rental),
         avg_effective_capacity = mean(effective_capacity)) |> 
  ggplot(aes(x = hour)) +
  geom_point(aes(y = max_rentals_per_hour_in_a_day, color = "Max Rentals Per Hour")) +
  geom_point(aes(y= avg_effective_capacity, color = "Average Effective Capacity")) +
  geom_smooth(aes(y = max_rentals_per_hour_in_a_day, color = "Max Rentals Per Hour"), size = 0.5, se = FALSE) + 
  facet_wrap(vars(temp_bucket)) + 
  labs(
    title = "Peak Hourly Bike Rentals vs. Effective Capacity Across Temperature Ranges",
    x = "Hour of the Day",
    y = "Count",
    color = "Metric",
    caption = "Data: Capital Bikeshare, 2016-2018"
  ) + 
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Exercise 2: Modeling

Part 1: Multiple Regression

Predictor Variable Choice

Usage ratio is the ratio of rentals to effective capacity. Four variables were chosen to create a multiple regression model to predict usage ratio.

  1. Traffic Index
  2. Temperature (Fahrenheit)
  3. Precipitation ( [Yes / No] Indicator )
  4. Humidity

Traffic Index measures traffic congestion on a 0 to 100 scale with 100 being fully congested. Capital Bikeshare’s typical rental demand likely comes from two types of customers, people wanting to go on a bike for pleasure and people wanting to go on a bike as a method of transportation. The traffic index would be a good predictor variable for the ladder customer. The hypothesis is that as congestion increases, so will demand for bikes. When driving on the road becomes too congested, bikesharing becomes a potentially more practical option to get from point A to point B.

Temperature was chosen as a predictor of usage ratio because riders must ride outdoors. Those who are riding for pleasure would be more likely to choose bikesharing as an option if the weather is warm as opposed to being cold. Similarly, bikesharing as a method of transportation would be more likely to be considered if the weather was pleasant outside versus cold. How warm or cold outside may be a deciding factor for customers whether it is worth it to rent versus choosing a medium of transportation that is not outdoors.

Similar to temperature, the precipitation indicator was chosen because it is another representation of the current weather conditions. The hypothesis is that rain would cause rentals to go down stronger than any response to effective capacity. With the introduction of rain, customers are going to get wet as they ride which is not ideal, particularly for the commuting customers. The presence of rain also increases the chance that people slip while riding which is another reason to predict lower rentals and therefore usage ratio.

Humidity was chosen again as a varying measure of the outdoor conditions. The hypothesis with humidity is that as humidity rises, usage ratio falls. As humidity rises, people become more uncomfortable as the body has a more difficult time getting rid of excess heat through sweat and evaporation. More humid environments may deter both pleasure and commuting riders from renting.

Model

lm_model <- lm(usage_ratio * 100 ~ traffic_index + temperature + there_is_precip + humidity, data = bikeshare)
summary(lm_model)

Call:
lm(formula = usage_ratio * 100 ~ traffic_index + temperature + 
    there_is_precip + humidity, data = bikeshare)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.676  -4.476  -0.783   3.088  38.978 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -3.102894   0.264909  -11.71   <2e-16 ***
traffic_index    0.269005   0.002353  114.34   <2e-16 ***
temperature      0.233262   0.003025   77.12   <2e-16 ***
there_is_precip -4.930729   0.322372  -15.29   <2e-16 ***
humidity        -0.118748   0.002850  -41.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.993 on 17461 degrees of freedom
Multiple R-squared:  0.5712,    Adjusted R-squared:  0.5711 
F-statistic:  5814 on 4 and 17461 DF,  p-value: < 2.2e-16

Model Interpretation

A multiple regression model was built using the four variables listed above to predict usage ratio. The usage ratio was multiplied by 100 to get everything in percentage terms.

The intercept is not interpretable as it has a value of -3.1%. The model is saying that a traffic index of 0, temperature of 0, no precipitation, and humidity of 0 would result in a usage ratio of -3.1%. Firstly, the usage ratio percentage cannot drop below 0 since it is impossible to have negative rentals. Secondly, the minimum temperature in the data is 9 F, meaning that the extrapolation of 0 F would be unreliable.

The coefficient on the Traffic Index is 0.26, being statistically significant at the 0.05 level with a p-value <2e-16. This means that for every 1 degree increase in the traffic index, the usage ratio is predicted to increase by 0.26%, all else held equal. This supports the hypothesis from earlier that as congestion in an area rises, people are more willing to turn to bikesharing as a mode of transportation.

The coefficient on Temperature is 0.23, being statistically significant at the 0.05 level with a p-value <2e-16. This states that for every 1 degree increase in the temperature, there is a 0.23% increase in the usage ratio. The hypothesis is supported that as the temperature rises, people are more likely to rent bikes. This effect on rentals may be understated here because as the temperature increases, so does Capital Bikeshare’s effective capacity. This means that as the temperature rises, rental demand outpaces additional bikes added to the effective capacity.

The coefficient on Precipitation Index is -4.93, being statistically significant at the 0.05 level with a p-value <2e-16. This means that if there is any amount of precipitation during that hour, bikeshare usage ratio is predicted to drop by 4.93% holding all else equal. The hypothesis is supported that precipitation decreases bike rental demand pretty substantially, particularly since effective capacity can’t be changed easily on an hourly basis.

The coefficient on Humidity is -0.23, being statistically significant at the 0.05 level with a p-value <2e-16. The model predicts that as humidity increases by 1%, bikeshare usage ratio decreases by 0.12% holding all else equal. The hypothesis is supported that as humidity rises, likely all types of consumers are going to be less inclined to rent bikes which would drop the usage ratio.

Recommendations

The results of the multiple regression model can be summarized into two points:

  1. As congestion increases, the usage ratio increases.
  2. As weather conditions improve, the usage ratio improves.

To help deal with increased stress during more congested times, Capital Bikeshare could deploy or relocate bikes near major roads or public transport hubs. Relocating bikes to these high traffic areas would help encourange and capitalize on higher congestion. If the usage ratio gets too high due to the increased demand from both the congestion increase as well as the captured demand from bike relocation then Bikeshare could increase their effective capacity during these hours to soak up the strong demand. This would only be necessary if Capital Bikeshare has a target low usage ratio that they are trying to hit. In previous analysis (View Data Visualization Plot 2), the max rentals in an hour was about half the effective capacity. So even in the highest demand scenario Capital Bikeshare has had a much greater effective capacity then rental demand.

Capital Bikeshare could also implement dynamic pricing during rainy or humid days to encourage users to rent bikes during these times. When it is rainy or humid during an hour period, the usage ratio tends to be lower. Capital Bikeshare implementing dyanmic pricing to offer discounts during these times would provide a low labor intensive solution to try and boost demand when the effective capacity is already in place.

Part 2: Regression Tree

Model

# Step 1: Define the tree model specification
set.seed(5)

tree_spec <- 
  decision_tree(
    cost_complexity = 0.01,  # Pruning parameter to prevent overfitting
    min_n = 20  # Minimum observations per node
  ) |> 
  set_engine("rpart") |> 
  set_mode("regression")

# Step 2: Create the recipe (preprocessing steps)
tree_recipe <- 
  recipe(usage_ratio ~ temperature + traffic_index + effective_capacity + there_is_precip + has_event, 
         data = bikeshare) |> 
  step_dummy(all_nominal_predictors())  # Convert categorical variables into dummy variables

# Step 3: Create the workflow
tree_wflow <- workflow() |> 
  add_recipe(tree_recipe) |> 
  add_model(tree_spec)

# Step 4: Fit the model on the full dataset
tree_fit <- tree_wflow |> 
  fit(data = bikeshare)

# Step 5: Extract the fitted tree model
bikeshare_tree <- tree_fit |> 
  extract_fit_engine()

# Step 6: Visualize the decision tree
rpart.plot(
  bikeshare_tree, 
  type = 3, 
  extra = 101, 
  tweak = 1.2, 
  box.palette = "Blues", 
  main = "Regression Tree for Bikeshare Usage Ratio"
)
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
    Call rpart.plot with roundint=FALSE,
    or rebuild the rpart model with model=TRUE.

Most Important Split & Interpretation of the Tree

The most important split in the regression is traffic index < 19 vs traffic index >= 19. Traffic appears to be a major contributing factor to predicting usage ratio. The tree in particular splits on the variables traffic index, effective capacity, temperature, and if there is an event.

The left split in the tree represents all data which the traffic index is < 19. Therefore low congestion signified relatively lower usage ratio. After this the left split in the tree is if effective capacity is less than 3300. Usage ratio is the ratio of rentals to effective capacity. When effective capacity drops one may expect for usage ratio to rise. However, the effective capacity is reduced from the hours of 12pm-5am to under 3300 on average. So when effective capacity is less than 3300 it is in the early hours of the day where bike rentals is close to zero, therefore the usage ratio is predicted to be 1.5%. On the right side of the inital left split, it separates based on temperature. So if it there is low congestion, not the early morning, and the temperature is less than 78 degrees then the usage ratio is predicted to be 5%. If those same conditions are met but the temperature is greater than 78 degrees then the predicted usage ratio is 12%.

On the right side of the tree where the traffic index is greater than or equal to 19 proceeds to split based on temperature being greater than or less than 60. The left side of the split is based on traffic index being 56% which is significantly higher. The tree predicts that if the traffic index is between 19% and 56% and the temperature is less than 60F then the usage ratio will be 9.4% which is below average usage ratio of 11%. In the right side of the split, a traffic index greater than 56% and temperature above 60F indicates a usage ratio of 18% which is significantly higher than average. This is to be expected as the multiple regression highlighted that warmer temperatures along with higher traffic index scores were predicts of higher usage ratio.

If the traffic index is greater than 19% and the temperature is greater than 60F, the next split is again on traffic index. A traffic index score greater than 52% with temperature greater than 60F results in a predicted usage ratio of 25%. This, coupled with the other prediction of that a temperature greater than 60F and traffic index greater than 56% resulted in a predicted usage ratio of 18%, suggests that if the temperature is warm (above 60F) a traffic index of around 50 results in the most stress to Capital Bikeshare’s system. In this warm scenario, as traffic index grows above 56 the predicted usage ratio tapers down from 25% to 18%. One potential result is that if congestion is too high, the bike lanes could get filled up and squeezed by the addition cars on the road.

If the traffic index is less than 52% the next split is if there is an event. If there is an event, the traffic index is between 19% and 52%, and the temperature is above 60F then the usage ratio is predicted to be 24%. Having an event, especially in warm conditions with moderate levels of traffic, appear to be an ideal situation for high rental demand for bikes.

If there is no event, the final split is on if the temperature is less than or greater than 78. If the temperature is between 60 and 78F, there is no event, and the traffic index is between 19% and 52%, then the tree predicts the usage ratio to be 14%, slightly above average. This makes sense as it is moderately warm temperature with slight to moderate traffic resulting in slightly above average usage rate. If the traffic index is between 19% and 52% there is no event and the weather is warmer that 78F, the predicted usage ratio is 19%. Similar to the other predicted usage ratios above 18%, these days with higher than average usage ratios are marked by warm weather, moderate traffic, and the potential of an event.

Recommendations

These recommendations are based on the assumption that Capital Bikeshare wants to increase overall demand while keeping usage ratio around their current average of 10-12%.

Expanding on the recommendations from the multiple regression section:

  1. Capital Bikeshare could relocate bikes to high congested areas, particularly when then weather is greater than 60F and the traffic index moderate, 40-60%. When viewing the data visualization plot 2, we can see that based on temperature and hour there is relatively little variance in the effective capacity while rental demand fluctuates throughout the day. This reorganization of their fleet in what is predicted to be peak stress level time on their system would allow Capital Bikeshare to capture more rental demand while keeping the same level of investment in bikes.
  2. If Capital Bikeshare is worries about having a usage ratio too high during this time, they could use dynamic pricing to increase prices during these warm, rush hour type situations to increase the price to meet demand. This would allow them to capture similar value, keep the usage ratio within what the current norm is, and not have to drastically expand their effective capacity during these smaller peaks throughout the day.
  3. Capital Bikeshare could increase the effective capacity on days where there is an event. In particular if the weather is warm and there is moderate traffic levels. If Capital Bikeshare knows that there is an event coming up they can increase their effective capacity in those areas to match the heightened demand. This is especially needed if the weather is warm, greater than 60F, and if there is moderate traffic, between 19 and 52% congestion.
  4. Capital Bikeshare could again utilize dynamic pricing for when there is low traffic, less than 19% traffic index, and the weather is colder than 78F. In these scenarios they are only at a usage ratio of around 5% which is less than half of their average. To make better use of their effective capacity during this time, offering discounts or promotions during these times would be a great use of their fixed capacity bikes which are not getting much usage. These lowered prices could help drive more customers into their bike ecosystem during low volume portions of the day.