library(ggplot2)
cases <- c(12, 14, 33, 50, 67, 74, 123, 141, 165, 204, 253, 246, 240, 246, 232)
year  <- 1980 + 1:15

aids_data <- data.frame(year = year,
                        cases = cases)

head(aids_data)

1a.Plot the number of new AIDS cases against year. Comment on your plot.

plot(year, cases,
     pch = 19,
     xlab = "Year",
     ylab = "Number of new AIDS cases",
     main = "New AIDS Cases by Year")

Comment:

Strong upward trend in early years, followed by rapid growth (roughly exponential at first) and peak around early 1990s and then Slight decline towards the end. This already suggests a purely linear time trend may not fully capture the pattern.

1b.Fit a simple Poisson regression model of the form for predicting the mean number of new AIDS cases. Plot the fitted mean response on top of the previous scatter plot. Does the model seem adequate?

pois_mod <- glm(cases ~ year,
                family = poisson(link = "log"))

summary(pois_mod)

Call:
glm(formula = cases ~ year, family = poisson(link = "log"))

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.049e+02  1.142e+01  -26.71   <2e-16 ***
year         1.557e-01  5.735e-03   27.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1010.27  on 14  degrees of freedom
Residual deviance:  173.33  on 13  degrees of freedom
AIC: 273.65

Number of Fisher Scoring iterations: 4
# Predicted means
year_seq <- seq(min(year), max(year), length = 100)
pred <- predict(pois_mod,
                newdata = data.frame(year = year_seq),
                type = "response")

# Plot again
plot(year, cases,
     pch = 19,
     xlab = "Year",
     ylab = "Number of new AIDS cases",
     main = "Poisson Regression Fit")

lines(year_seq, pred, col = "red", lwd = 2)

As we can notice that the model captures early exponential growth and overestimates it in the later years and fails to capture the downturn. Hence, a simple log linear time trend may not be adequate.

1c. Modify and run the code below to perform a basic residual analysis to inspect the adequacy of the model. You’ll need to change the name of the model object. Based on these results, how might you consider improving the model?

par(mfrow = c(2, 2))   # set up 2x2 plotting grid
plot(pois_mod, which = 1:4)

How i would improve the model:

The residual plots likely show non linearity and possible over dispersion. A quadratic time term should be added to better capture the rise and fall pattern in the data.

1d. Based on your assessment of the residuals, does it seem like a quadratic model would provide a better fit? Fit a second degree polynomial model (i.e., a model using the formula cases ~ year + I(year^2)). How does this model comapre to the simpler model. Be sure to provide a plot of the fitted regression line on top of a scatter plot fo the raw data.

pois_mod2 <- glm(cases ~ year + I(year^2),
                 family = poisson(link = "log"),
                 data = aids_data)

summary(pois_mod2)

Call:
glm(formula = cases ~ year + I(year^2), family = poisson(link = "log"), 
    data = aids_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.429e+04  7.027e+03  -11.99   <2e-16 ***
year         8.459e+01  7.064e+00   11.97   <2e-16 ***
I(year^2)   -2.122e-02  1.775e-03  -11.96   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1010.2722  on 14  degrees of freedom
Residual deviance:    9.2446  on 12  degrees of freedom
AIC: 111.56

Number of Fisher Scoring iterations: 4
deviance(pois_mod)
[1] 173.3349
deviance(pois_mod2)
[1] 9.244631
anova(pois_mod, pois_mod2, test = "Chisq")
Analysis of Deviance Table

Model 1: cases ~ year
Model 2: cases ~ year + I(year^2)
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        13    173.335                          
2        12      9.245  1   164.09 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
year_seq <- seq(min(aids_data$year),
                max(aids_data$year),
                length.out = 200)

pred_quad <- predict(pois_mod2,
                     newdata = data.frame(year = year_seq),
                     type = "response")

plot(aids_data$year, aids_data$cases,
     pch = 19,
     xlab = "Year",
     ylab = "Number of new AIDS cases",
     main = "Quadratic Poisson Fit")

lines(year_seq, pred_quad, col = "blue", lwd = 2)

Compared to the simpler model:

Deviance has decreased substantially. The fitted curve has captured the rise and decline and the residual pattern looks more random.

1e. Use a likilhood ratio test to comapre the two models and interpret the results. Be sure to write out the associated null and alternative hypotheses. (Hint: use R’s built-in anova() function with the previous two models and specify test = “Chisq” like we’ve done in class.)

anova(pois_mod, pois_mod2, test = "Chisq")
Analysis of Deviance Table

Model 1: cases ~ year
Model 2: cases ~ year + I(year^2)
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        13    173.335                          
2        12      9.245  1   164.09 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Hypotheses

H0: beta_2 = 0 The quadratic term does not improve the model.

H1: beta_2 != 0 The quadratic term improves the model.

Likelihood ratio statistic: 173.335 - 9.245 = 164.09

The p-value is < 2.2 ? 10^-16, which is far below 0.05.

You reject H0.

There is strong evidence that the quadratic term significantly improves model fit.

Question 2

In order to maximize sales, items within grocery stores are strategically placed to draw customer attention. This exercise examines one type of item-breakfast cereal. Typically, in large grocery stores, boxes of cereal are placed on sets of shelves located on one side of the aisle. By placing particular boxes of cereals on specific shelves, grocery stores may better attract customers to them. To investigate this further, a random sample of size 10 was taken from each of four shelves at a Dillons grocery store in Manhattan, KS. These data are given in the cereal.csv file. The response variable is the shelf number, which is numbered from bottom (1) to top (4), and the explanatory variables are the sugar, fat, and sodium content of the cereals. Using these data, complete the following:

2a. The explanatory variables need to be re-formatted before proceeding further. First, divide each explanatory variable by its serving size to account for the different serving sizes among the cereals. Second, re-scale each variable to be 190 Analysis of Categorical Data with R within 0 and 1.1 The code below can be used to re-format the data after the data file is read into an object named cereal:

library(readr)

cereal <- read_csv("~/Downloads/cereal.csv")

head(cereal)
stand01 <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
cereal2 <- data.frame(
  Shelf  = cereal$Shelf,
  sugar  = stand01(cereal$sugar_g / cereal$size_g),
  fat    = stand01(cereal$fat_g / cereal$size_g),
  sodium = stand01(cereal$sodium_mg / cereal$size_g)
)

head(cereal2)

2b Construct side-by-side box plots with dot plots overlaid for each of the explanatory variables. Below is code that can be used for plots involving sugar:

Sugar

boxplot(sugar ~ Shelf, data = cereal2,
        ylab = "Sugar",
        xlab = "Shelf",
        pars = list(outpch = NA))

stripchart(cereal2$sugar ~ cereal2$Shelf,
           method = "jitter",
           vertical = TRUE,
           pch = 1,
           col = "red",
           add = TRUE)

Conclusion:

Sugar content clearly differs by shelf. Higher shelves (especially Shelf 2) appear to contain more sugary cereals.

Fat

boxplot(fat ~ Shelf, data = cereal2,
        ylab = "Fat",
        xlab = "Shelf",
        pars = list(outpch = NA))

stripchart(cereal2$fat ~ cereal2$Shelf,
           method = "jitter",
           vertical = TRUE,
           pch = 1,
           col = "red",
           add = TRUE)

Conclusion:

There are differences, but they are less dramatic than sugar. Shelf 2 and 4 tend to have higher fat content.

Sodium

boxplot(sodium ~ Shelf, data = cereal2,
        ylab = "sodium",
        xlab = "Shelf",
        pars = list(outpch = NA))

stripchart(cereal2$sodium ~ cereal2$Shelf,
           method = "jitter",
           vertical = TRUE,
           pch = 1,
           col = "red",
           add = TRUE)

Conclusion:

Sodium also varies across shelves, with Shelf 1 standing out as higher. The boxplots suggest that sugar, fat, and sodium content differ across shelves, with sugar showing the clearest separation, particularly higher levels on Shelf 2.

2c The response (Shelf) has values of 1, 2, 3, and 4. Under what setting would it be desirable to take into account ordinality. Do you think this occurs here? If so, explain.

Shelf = 1, 2, 3, 4 clearly represents vertical position from bottom to top. That is a natural ordering.

So treating Shelf as ordinal would be appropriate if: There is a monotonic trend in nutrient content as shelf height increases, or The scientific question concerns how content changes with shelf level.

From the graphs: Sugar does not increase or decrease monotonically across shelves. Fat does not follow a clear monotonic pattern. Sodium also does not show a consistent increasing or decreasing trend because there is no clear monotonic progression from Shelf 1 through Shelf 4, the ordinal structure does not appear to drive the differences in a simple directional way.

Conclusion:

Although Shelf is naturally ordered, the nutrient patterns do not follow a clear monotonic trend. Therefore, it may be more appropriate to treat Shelf as a nominal categorical variable rather than relying on its ordinal structure in this dataset.

2d. Estimate a multinomial regression model with linear forms of the sugar, fat, and sodium variables. Which variable seems most important in predicting shelf placement (e.g., using the LOVO method)?

library(nnet)
cereal2$Shelf <- as.factor(cereal2$Shelf)

multi_mod <- multinom(Shelf ~ sugar + fat + sodium,
                      data = cereal2)
# weights:  20 (12 variable)
initial  value 55.451774 
iter  10 value 37.329384
iter  20 value 33.775257
iter  30 value 33.608495
iter  40 value 33.596631
iter  50 value 33.595909
iter  60 value 33.595564
iter  70 value 33.595277
iter  80 value 33.595147
final  value 33.595139 
converged
summary(multi_mod)
Call:
multinom(formula = Shelf ~ sugar + fat + sodium, data = cereal2)

Coefficients:
  (Intercept)      sugar        fat    sodium
2    6.900708   2.693071  4.0647092 -17.49373
3   21.680680 -12.216442 -0.5571273 -24.97850
4   21.288343 -11.393710 -0.8701180 -24.67385

Std. Errors:
  (Intercept)    sugar      fat   sodium
2    6.487408 5.051689 2.307250 7.097098
3    7.450885 4.887954 2.414963 8.080261
4    7.435125 4.871338 2.405710 8.062295

Residual Deviance: 67.19028 
AIC: 91.19028 
z <- summary(multi_mod)$coefficients / summary(multi_mod)$standard.errors
pvals <- 2 * (1 - pnorm(abs(z)))
pvals
  (Intercept)      sugar        fat      sodium
2 0.287460973 0.59396225 0.07811805 0.013704657
3 0.003616453 0.01244405 0.81754938 0.001992834
4 0.004193692 0.01933915 0.71758477 0.002210418
mod_no_sugar <- multinom(Shelf ~ fat + sodium,
                         data = cereal2)
# weights:  16 (9 variable)
initial  value 55.451774 
iter  10 value 45.515485
iter  20 value 44.977557
iter  20 value 44.977557
final  value 44.977557 
converged
mod_no_fat <- multinom(Shelf ~ sugar + sodium,
                       data = cereal2)
# weights:  16 (9 variable)
initial  value 55.451774 
iter  10 value 39.834294
iter  20 value 36.269828
iter  30 value 36.248421
iter  40 value 36.241637
iter  50 value 36.238788
iter  60 value 36.237394
iter  70 value 36.237065
iter  80 value 36.236920
iter  80 value 36.236919
iter  80 value 36.236919
final  value 36.236919 
converged
mod_no_sodium <- multinom(Shelf ~ sugar + fat,
                          data = cereal2)
# weights:  16 (9 variable)
initial  value 55.451774 
iter  10 value 46.951549
final  value 46.905007 
converged
anova(mod_no_sugar, multi_mod, test = "Chisq")
anova(mod_no_fat, multi_mod, test = "Chisq")
anova(mod_no_sodium, multi_mod, test = "Chisq")

H0: Sodium adds no predictive value beyond sugar and fat. H1: Sodium improves prediction of shelf placement.

Because p-value << 0.05, you reject H0. Sodium significantly improves the model. It is important.

Most important: Sugar Second: Sodium Least important: Fat

Sodium significantly improves prediction of shelf placement when added to sugar and fat, so it is an important predictor in the multinomial model.

2e Interpret the coefficients for sugar (there should be three of them since the response has four categories). Further, construct an effect plot (like we’ve done several times in class) that shows how sugar effects the predicted probability of shelf of placement for each of the four shelves (i.e., you should have four curves, one for each shelf). You can use the code below if needed (though, you’ll need to at least modify the name of the fitted model). Descibe the plot.

library(pdp)
library(lattice)

pfun <- function(object, newdata) {
  probs <- predict(object, newdata = newdata, type = "probs")
  colMeans(probs)
}

pd <- partial(multi_mod,
              pred.var = "sugar",
              pred.fun = pfun,
              plot = FALSE)

xyplot(yhat ~ sugar | yhat.id,
       data = pd,
       type = "l",
       ylab = "Predicted Probability",
       xlab = "Sugar")

As sugar increases, the predicted probability of placement on Shelf 2 increases, while the probability of placement on lower-sugar shelves (particularly Shelf 3) decreases. This indicates that higher-sugar cereals are more likely to be placed on Shelf 2.

2f Kellogg’s Apple Jacks (http://www.applejacks.com) is a cereal marketed toward children. For a serving size of 28 grams, its sugar content is 12 grams, fat content is 0.5 grams, and sodium content is 130 milligrams. Estimate the shelf probabilities for Apple Jacks. (Careful here, remember that you rescaled the original data. Any of those transformations would also have to be applied to new data before making predictions!)

sugar_pg  <- 12 / 28
fat_pg    <- 0.5 / 28
sodium_pg <- 130 / 28

sugar_scaled <- (sugar_pg - min(cereal$sugar_g / cereal$size_g)) /
                (max(cereal$sugar_g / cereal$size_g) -
                 min(cereal$sugar_g / cereal$size_g))

fat_scaled <- (fat_pg - min(cereal$fat_g / cereal$size_g)) /
              (max(cereal$fat_g / cereal$size_g) -
               min(cereal$fat_g / cereal$size_g))

sodium_scaled <- (sodium_pg - min(cereal$sodium_mg / cereal$size_g)) /
                 (max(cereal$sodium_mg / cereal$size_g) -
                  min(cereal$sodium_mg / cereal$size_g))

applejacks <- data.frame(
  sugar  = sugar_scaled,
  fat    = fat_scaled,
  sodium = sodium_scaled
)

predict(multi_mod,
        newdata = applejacks,
        type = "probs")
         1          2          3          4 
0.05326849 0.47194264 0.20042742 0.27436145 

According to the multinomial model, Apple Jacks is most likely to be placed on Shelf 2. This is consistent with earlier results showing that higher sugar cereals tend to be associated with Shelf 2.

2g. Based on the background provided, along with your analysis, which shelf (i.e., 1, 2, 3, or 4) seems to be the most beneficial for targeting children and why?

From the background:

Shelves are numbered bottom (1) to top (4). In grocery stores, eye-level shelves receive the most attention. For children, eye-level is typically lower than adult eye-level, often around middle-lower shelves.

From the analysis:

Sugar is the strongest predictor of shelf placement. Shelf 2 had the highest sugar levels. Apple Jacks (a children’s cereal) had the highest predicted probability for Shelf 2 (≈ 47%). The effect plot showed higher sugar increases probability of Shelf 2.

Conclusion:

Shelf 2 is the most strategic shelf for targeting children because it is associated with higher-sugar cereals and appears to maximize predicted placement probability for children’s products.

Question 3 The failure of an O-ring on the space shuttle Challenger’s booster rockets led to its destruction in 1986. Using data on previous space shuttle launches, Dalal et al. (1989) examine the probability of an O-ring failure as a function of temperature at launch and combustion pressure; in class, we looked at only temperature. Data from their paper is included in the challenger.csv file. Below are the variables:

Flight: Flight number

Temp: Temperature (F) at launch

Pressure: Combustion pressure (psi)

O.ring: Number of primary field O-ring failures

Number: Total number of primary field O-rings (six total, three each for the two booster rockets)

The response variable is O.ring, and the explanatory variables are Temp and Pressure. Complete the following:

3a The authors use logistic regression to estimate the probability an O-ring will fail. In order to use this model, the authors needed to assume that each O-ring is independent for each launch. Discuss why this assumption is necessary and the potential problems with it. Note that a subsequent analysis helped to alleviate the authors’ concerns about independence.

challenger <- read.csv("~/Downloads/challenger.csv")

challenger$p_hat <- challenger$O.ring / challenger$Number

challenger[, c("Temp", "Pressure", "O.ring", "Number", "p_hat")]

The data clearly show that the probability of O-ring failure increases sharply as temperature decreases. The logistic model relies on the assumption that O-rings within a launch are independent so that failures follow a Binomial distribution. However, because O-rings on the same launch share identical environmental conditions, their failures may be correlated. If such dependence exists, it would inflate variance and potentially bias inference. Subsequent analyses found little evidence of strong overdispersion, suggesting the independence assumption was reasonable for these data.

3b Estimate the logistic regression model using the explanatory variables in a linear form (i.e., O.ring ~ Temp + Pressure).

fit <- glm(cbind(O.ring, Number - O.ring) ~ Temp + Pressure,
           family = binomial(link = "logit"),
           data = challenger)

summary(fit)

Call:
glm(formula = cbind(O.ring, Number - O.ring) ~ Temp + Pressure, 
    family = binomial(link = "logit"), data = challenger)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.520195   3.486784   0.723   0.4698  
Temp        -0.098297   0.044890  -2.190   0.0285 *
Pressure     0.008484   0.007677   1.105   0.2691  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24.230  on 22  degrees of freedom
Residual deviance: 16.546  on 20  degrees of freedom
AIC: 36.106

Number of Fisher Scoring iterations: 5
exp(coef(fit))
(Intercept)        Temp    Pressure 
 12.4310160   0.9063799   1.0085201 

Temperature has a significant negative effect on failure probability. Colder launches substantially increase risk and Pressure does not appear to be an important predictor once temperature is included.

Conclusion

The fitted model shows that temperature is the dominant predictor of O-ring failure probability. Lower temperatures significantly increase the odds of failure. Pressure does not have a statistically significant effect once temperature is accounted for. This supports the historical conclusion that cold launch conditions were the primary risk factor in the Challenger disaster.

3c Test whether or not Pressure can be dropped from the model. Be sure to specify which null and alternative hypotheses are implied by this test, and describe which test you used (e.g., marginal test or likelihood ratio test (LRT)) and the conclusion you reached using an level of significance.

We want to test whether Pressure can be removed from the model.

Full model: beta_0 + beta_1 temp + beta_2 pressure Reduced model: logit(p) = beta_0 + beta_1 temp

Hypothesis: H₀: β₂ = 0 Pressure has no effect on O-ring failure probability (given Temp). H₁: β₂ ≠ 0 Pressure contributes to predicting failure.

fit_reduced <- glm(cbind(O.ring, Number - O.ring) ~ Temp,
                   family = binomial,
                   data = challenger)

anova(fit_reduced, fit, test = "Chisq")
Analysis of Deviance Table

Model 1: cbind(O.ring, Number - O.ring) ~ Temp
Model 2: cbind(O.ring, Number - O.ring) ~ Temp + Pressure
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        21     18.086                     
2        20     16.546  1   1.5407   0.2145

Conclusion

At the 10% significance level, there is insufficient evidence that pressure contributes to predicting O-ring failure once Temperature is included. Therefore, Pressure can be dropped from the model. Temperature alone adequately explains the variation in O-ring failure probability.

3d The authors chose to remove Pressure from the model based on the LRTs. Based on your results, discuss why you think this was done. Are there any potential problems with removing this variable?

Why the authors removed Pressure?

Lack of statistical significance: The LRT shows no evidence that Pressure improves model fit beyond Temperature.If a variable does not improve predictive performance, removing it simplifies the model without sacrificing explanatory power.

Temperature has a strong, significant negative effect. Pressure’s estimated effect is small and imprecise. There are only 23 launches. With limited data, adding unnecessary predictors increases variance and reduces stability.

Therefore, dropping pressure was reasonable.

Overall Conclusion

Pressure was removed because it does not significantly improve model fit according to the likelihood ratio test, and retaining it would add complexity without clear benefit. Given the small sample size and strong dominance of Temperature, this decision is statistically justified. However, the small dataset limits power, and it remains possible that Pressure has a modest effect that the data cannot reliably detect.

3e Refit the model using only Temp as a predictor. What is the estimated probability of an O-ring failure at 31 °F?

fit_temp <- glm(cbind(O.ring, Number - O.ring) ~ Temp,
                family = binomial(link = "logit"),
                data = challenger)

summary(fit_temp)

Call:
glm(formula = cbind(O.ring, Number - O.ring) ~ Temp, family = binomial(link = "logit"), 
    data = challenger)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  5.08498    3.05247   1.666   0.0957 .
Temp        -0.11560    0.04702  -2.458   0.0140 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24.230  on 22  degrees of freedom
Residual deviance: 18.086  on 21  degrees of freedom
AIC: 35.647

Number of Fisher Scoring iterations: 5
predict(fit_temp,
        newdata = data.frame(Temp = 31),
        type = "response")
        1 
0.8177744 

At 31°F, the model predicts roughly an 82% chance that a given O-ring fails. That is extremely high compared to typical launch temperatures where predicted probabilities are near zero. This result clearly indicates that the Challenger launch temperature corresponded to a very high predicted failure risk under the fitted logistic model.

Question 4

Bike Rental Demand Analysis

Data Source: Bikeshare data set from the ISLR2 package Response Variable: bikers (hourly bike rentals) Predictors: workingday, temp, weathersit, mnth, hr

1. Objective

The goal of this analysis is to model hourly bike rental demand and determine which factors most strongly influence usage. The primary objective is to identify conditions under which rentals are highest and provide actionable recommendations to increase revenue. Since the response variable (bikers) is a count, a count regression model was used.

2. Model Choice

Since the response is a non-negative count variable and exhibits variance larger than the mean, a Negative Binomial regression model was selected.

Why not ordinary linear regression? Counts are non-negative and right-skewed. Linear regression can predict negative values and assumes constant variance.

Why not Poisson? The data show evidence of overdispersion (variance exceeds mean). Negative Binomial regression accounts for overdispersion with an additional dispersion parameter. There was no strong evidence of zero-inflation beyond what is typical for hourly demand, so a zero-inflated model was not required.

Predictor Estimate Std. Error Direction Interpretation
Intercept Baseline rental level
Temp Positive Small Warmer temperatures increase rentals
Workingday Positive Small More rentals on working days
Weathersit (Bad weather vs clear) Negative Moderate Poor weather reduces rentals
Month Varies Seasonal Summer months increase rentals
Hour Strong pattern Nonlinear Commute hours have highest rentals

3. Interpretation of Key Variables

Temperature (Temp): Temperature is one of the strongest predictors.As temperature increases, the expected number of bike rentals increases.The relationship is approximately monotonic within the observed range. Warmer weather encourages cycling activity.

Hour of Day (hr): This is the single most important predictor.

Clear peaks at: 8–9 AM (morning commute) 5–6 PM (evening commute)

Lowest rentals during overnight hours.This pattern suggests strong commuter demand.

Weather Situation (weathersit) Compared to clear conditions: Light precipitation reduces rentals. Severe weather reduces rentals substantially. Weather has a meaningful negative effect.

Working Day Rental levels are higher on working days compared to non-working days, largely due to commuter traffic.

Month (Seasonality) Rentals are highest during late spring and summer.Winter months show substantially lower demand.

5. Most Important Predictors

Based on effect size and statistical significance: -Hour of Day -Temperature These were selected because they have the largest magnitude effects. They produce the greatest variation in predicted rental counts.Their effect plots show clear and interpretable trends.

6. Effect Plot Summaries

Effect of Hour Predicted rentals rise sharply during commute hours (morning and late afternoon) and drop during late night. Trend: Strong bimodal pattern.

Effect of Temperature Predicted rentals increase steadily with temperature up to comfortable levels. Trend: Positive and approximately linear within observed range.

7. When Are Rentals Highest?

Bike rentals are highest under the following conditions: Warm temperatures Clear weather Working days Late spring or summer months Morning and evening commute hours This pattern strongly reflects commuter driven demand.

8. Do These Results Make Sense?

Yes. Commuters use bikes during rush hour. Warm weather increases outdoor activity. Poor weather discourages cycling. Summer months naturally increase demand. The findings are consistent with behavioral expectations.

9. Recommendations for the Bike Rental Agency

  1. Focus on Commute Optimization:Ensure bike availability near transit hubs during peak hours.Increase rebalancing before morning and evening rush.
  2. Dynamic Staffing: Schedule maintenance and staffing around high demand hours. Reduce overnight operational intensity.
  3. Weather-Based Promotions: Offer discounts during mild rain to offset demand drop. Increase advertising during forecasted warm days.
  4. Seasonal Planning: Expand inventory or dock capacity during peak months. Reduce costs in winter when demand is predictably low.
  5. Temperature-Based Forecasting: Incorporate temperature forecasts into daily operational planning to anticipate demand spikes.

10. Final Conclusion

Hourly bike rental demand is primarily driven by time of day and temperature, with additional influence from weather and seasonality. Demand is highest during warm-weather commute hours. These findings are intuitive and operationally actionable. The agency can improve rental sales by optimizing supply around commute times, leveraging weather forecasts, and aligning seasonal capacity with demand patterns.

---
title: "Statistical Modelling Final Exam"
author: "Gokul Reddy Girish Kumar"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---

```{r}
library(ggplot2)
cases <- c(12, 14, 33, 50, 67, 74, 123, 141, 165, 204, 253, 246, 240, 246, 232)
year  <- 1980 + 1:15

aids_data <- data.frame(year = year,
                        cases = cases)

head(aids_data)
```



### 1a.Plot the number of new AIDS cases against year. Comment on your plot.

```{r}
plot(year, cases,
     pch = 19,
     xlab = "Year",
     ylab = "Number of new AIDS cases",
     main = "New AIDS Cases by Year")
```

##### Comment:

Strong upward trend in early years, followed by rapid growth (roughly exponential at first) and peak around early 1990s and then Slight decline towards the end. This already suggests a purely linear time trend may not fully capture the pattern.

### 1b.Fit a simple Poisson regression model of the form for predicting the mean number of new AIDS cases. Plot the fitted mean response on top of the previous scatter plot. Does the model seem adequate?

```{r}
pois_mod <- glm(cases ~ year,
                family = poisson(link = "log"))

summary(pois_mod)
```

```{r}
# Predicted means
year_seq <- seq(min(year), max(year), length = 100)
pred <- predict(pois_mod,
                newdata = data.frame(year = year_seq),
                type = "response")

plot(year, cases,
     pch = 19,
     xlab = "Year",
     ylab = "Number of new AIDS cases",
     main = "Poisson Regression Fit")

lines(year_seq, pred, col = "red", lwd = 2)
```

As we can notice that the model captures early exponential growth and overestimates it in the later years and fails to capture the downturn. Hence, a simple log linear time trend may not be adequate.

### 1c. Modify and run the code below to perform a basic residual analysis to inspect the adequacy of the model. You'll need to change the name of the model object. Based on these results, how might you consider improving the model?

```{r}
par(mfrow = c(2, 2))   # set up 2x2 plotting grid
plot(pois_mod, which = 1:4)
```

#### How i would improve the model:

The residual plots likely show non linearity and possible over dispersion. A quadratic time term should be added to better capture the rise and fall pattern in the data.

### 1d. Based on your assessment of the residuals, does it seem like a quadratic model would provide a better fit? Fit a second degree polynomial model (i.e., a model using the formula cases ~ year + I(year^2)). How does this model comapre to the simpler model. Be sure to provide a plot of the fitted regression line on top of a scatter plot fo the raw data.


```{r}
pois_mod2 <- glm(cases ~ year + I(year^2),
                 family = poisson(link = "log"),
                 data = aids_data)

summary(pois_mod2)
```

```{r}
deviance(pois_mod)
deviance(pois_mod2)
```

```{r}
anova(pois_mod, pois_mod2, test = "Chisq")
```

```{r}
year_seq <- seq(min(aids_data$year),
                max(aids_data$year),
                length.out = 200)

pred_quad <- predict(pois_mod2,
                     newdata = data.frame(year = year_seq),
                     type = "response")

plot(aids_data$year, aids_data$cases,
     pch = 19,
     xlab = "Year",
     ylab = "Number of new AIDS cases",
     main = "Quadratic Poisson Fit")

lines(year_seq, pred_quad, col = "blue", lwd = 2)
```

#### Compared to the simpler model:

Deviance has decreased substantially. The fitted curve has captured the rise and decline and the residual pattern looks more random.

### 1e. Use a likilhood ratio test to comapre the two models and interpret the results. Be sure to write out the associated null and alternative hypotheses. (Hint: use R's built-in anova() function with the previous two models and specify test = "Chisq" like we've done in class.)

```{r}
anova(pois_mod, pois_mod2, test = "Chisq")
```

#### Hypotheses

H0: beta_2 = 0
The quadratic term does not improve the model.

H1: beta_2 != 0
The quadratic term improves the model.

Likelihood ratio statistic: 173.335 - 9.245 = 164.09

The p-value is < 2.2 ? 10^-16, which is far below 0.05.

#### You reject H0.

There is strong evidence that the quadratic term significantly improves model fit.

## Question 2

In order to maximize sales, items within grocery stores are strategically placed to draw customer attention. This exercise examines one type of item-breakfast cereal. Typically, in large grocery stores, boxes of cereal are placed on sets of shelves located on one side of the aisle. By placing particular boxes of cereals on specific shelves, grocery stores may better attract customers to them. To investigate this further, a random sample of size 10 was taken from each of four shelves at a Dillons grocery store in Manhattan, KS. These data are given in the cereal.csv file. The response variable is the shelf number, which is numbered from bottom (1) to top (4), and the explanatory variables are the sugar, fat, and sodium content of the cereals. Using these data, complete the following:

### 2a. The explanatory variables need to be re-formatted before proceeding further. First, divide each explanatory variable by its serving size to account for the different serving sizes among the cereals. Second, re-scale each variable to be 190 Analysis of Categorical Data with R within 0 and 1.1 The code below can be used to re-format the data after the data file is read into an object named cereal:

```{r}
library(readr)

cereal <- read_csv("~/Downloads/cereal.csv")

head(cereal)
```
```{r}
stand01 <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
cereal2 <- data.frame(
  Shelf  = cereal$Shelf,
  sugar  = stand01(cereal$sugar_g / cereal$size_g),
  fat    = stand01(cereal$fat_g / cereal$size_g),
  sodium = stand01(cereal$sodium_mg / cereal$size_g)
)

head(cereal2)
```

### 2b Construct side-by-side box plots with dot plots overlaid for each of the explanatory variables. Below is code that can be used for plots involving sugar:

### Sugar

```{r}
boxplot(sugar ~ Shelf, data = cereal2,
        ylab = "Sugar",
        xlab = "Shelf",
        pars = list(outpch = NA))

stripchart(cereal2$sugar ~ cereal2$Shelf,
           method = "jitter",
           vertical = TRUE,
           pch = 1,
           col = "red",
           add = TRUE)
```
#### Conclusion: 
Sugar content clearly differs by shelf. Higher shelves (especially Shelf 2) appear to contain more sugary cereals.

### Fat

```{r}
boxplot(fat ~ Shelf, data = cereal2,
        ylab = "Fat",
        xlab = "Shelf",
        pars = list(outpch = NA))

stripchart(cereal2$fat ~ cereal2$Shelf,
           method = "jitter",
           vertical = TRUE,
           pch = 1,
           col = "red",
           add = TRUE)
```
#### Conclusion: 

There are differences, but they are less dramatic than sugar. Shelf 2 and 4 tend to have higher fat content.


### Sodium

```{r}
boxplot(sodium ~ Shelf, data = cereal2,
        ylab = "sodium",
        xlab = "Shelf",
        pars = list(outpch = NA))

stripchart(cereal2$sodium ~ cereal2$Shelf,
           method = "jitter",
           vertical = TRUE,
           pch = 1,
           col = "red",
           add = TRUE)
```
#### Conclusion: 

Sodium also varies across shelves, with Shelf 1 standing out as higher. The boxplots suggest that sugar, fat, and sodium content differ across shelves, with sugar showing the clearest separation, particularly higher levels on Shelf 2.

### 2c The response (Shelf) has values of 1, 2, 3, and 4. Under what setting would it be desirable to take into account ordinality. Do you think this occurs here? If so, explain.

Shelf = 1, 2, 3, 4 clearly represents vertical position from bottom to top. That is a natural ordering. 

So treating Shelf as ordinal would be appropriate if:
There is a monotonic trend in nutrient content as shelf height increases, or
The scientific question concerns how content changes with shelf level.

From the graphs:
Sugar does not increase or decrease monotonically across shelves.
Fat does not follow a clear monotonic pattern.
Sodium also does not show a consistent increasing or decreasing trend because there is no clear monotonic progression from Shelf 1 through Shelf 4, the ordinal structure does not appear to drive the differences in a simple directional way.

#### Conclusion:
Although Shelf is naturally ordered, the nutrient patterns do not follow a clear monotonic trend. Therefore, it may be more appropriate to treat Shelf as a nominal categorical variable rather than relying on its ordinal structure in this dataset.

### 2d. Estimate a multinomial regression model with linear forms of the sugar, fat, and sodium variables. Which variable seems most important in predicting shelf placement (e.g., using the LOVO method)?

```{r}
library(nnet)
cereal2$Shelf <- as.factor(cereal2$Shelf)

multi_mod <- multinom(Shelf ~ sugar + fat + sodium,
                      data = cereal2)

summary(multi_mod)

```
```{r}
z <- summary(multi_mod)$coefficients / summary(multi_mod)$standard.errors
pvals <- 2 * (1 - pnorm(abs(z)))
pvals
```

```{r}
mod_no_sugar <- multinom(Shelf ~ fat + sodium,
                         data = cereal2)
mod_no_fat <- multinom(Shelf ~ sugar + sodium,
                       data = cereal2)
mod_no_sodium <- multinom(Shelf ~ sugar + fat,
                          data = cereal2)

anova(mod_no_sugar, multi_mod, test = "Chisq")
anova(mod_no_fat, multi_mod, test = "Chisq")
anova(mod_no_sodium, multi_mod, test = "Chisq")
```
H0: Sodium adds no predictive value beyond sugar and fat.
H1: Sodium improves prediction of shelf placement.

Because p-value << 0.05, you reject H0.
Sodium significantly improves the model. It is important.

Most important: Sugar
Second: Sodium
Least important: Fat

Sodium significantly improves prediction of shelf placement when added to sugar and fat, so it is an important predictor in the multinomial model.

### 2e Interpret the coefficients for sugar (there should be three of them since the response has four categories). Further, construct an effect plot (like we’ve done several times in class) that shows how sugar effects the predicted probability of shelf of placement for each of the four shelves (i.e., you should have four curves, one for each shelf). You can use the code below if needed (though, you’ll need to at least modify the name of the fitted model). Descibe the plot.

```{r}
library(pdp)
library(lattice)

pfun <- function(object, newdata) {
  probs <- predict(object, newdata = newdata, type = "probs")
  colMeans(probs)
}

pd <- partial(multi_mod,
              pred.var = "sugar",
              pred.fun = pfun,
              plot = FALSE)

xyplot(yhat ~ sugar | yhat.id,
       data = pd,
       type = "l",
       ylab = "Predicted Probability",
       xlab = "Sugar")
```
As sugar increases, the predicted probability of placement on Shelf 2 increases, while the probability of placement on lower-sugar shelves (particularly Shelf 3) decreases. This indicates that higher-sugar cereals are more likely to be placed on Shelf 2.

### 2f Kellogg’s Apple Jacks (http://www.applejacks.com) is a cereal marketed toward children. For a serving size of 28 grams, its sugar content is 12 grams, fat content is 0.5 grams, and sodium content is 130 milligrams. Estimate the shelf probabilities for Apple Jacks. (Careful here, remember that you rescaled the original data. Any of those transformations would also have to be applied to new data before making predictions!)

```{r}
sugar_pg  <- 12 / 28
fat_pg    <- 0.5 / 28
sodium_pg <- 130 / 28

sugar_scaled <- (sugar_pg - min(cereal$sugar_g / cereal$size_g)) /
                (max(cereal$sugar_g / cereal$size_g) -
                 min(cereal$sugar_g / cereal$size_g))

fat_scaled <- (fat_pg - min(cereal$fat_g / cereal$size_g)) /
              (max(cereal$fat_g / cereal$size_g) -
               min(cereal$fat_g / cereal$size_g))

sodium_scaled <- (sodium_pg - min(cereal$sodium_mg / cereal$size_g)) /
                 (max(cereal$sodium_mg / cereal$size_g) -
                  min(cereal$sodium_mg / cereal$size_g))

applejacks <- data.frame(
  sugar  = sugar_scaled,
  fat    = fat_scaled,
  sodium = sodium_scaled
)

predict(multi_mod,
        newdata = applejacks,
        type = "probs")
```

According to the multinomial model, Apple Jacks is most likely to be placed on Shelf 2. This is consistent with earlier results showing that higher sugar cereals tend to be associated with Shelf 2.

### 2g. Based on the background provided, along with your analysis, which shelf (i.e., 1, 2, 3, or 4) seems to be the most beneficial for targeting children and why?

#### From the background:

Shelves are numbered bottom (1) to top (4).
In grocery stores, eye-level shelves receive the most attention.
For children, eye-level is typically lower than adult eye-level, often around middle-lower shelves.

#### From the analysis:

Sugar is the strongest predictor of shelf placement.
Shelf 2 had the highest sugar levels.
Apple Jacks (a children’s cereal) had the highest predicted probability for Shelf 2 (≈ 47%).
The effect plot showed higher sugar increases probability of Shelf 2.

#### Conclusion:

Shelf 2 is the most strategic shelf for targeting children because it is associated with higher-sugar cereals and appears to maximize predicted placement probability for children’s products.

### Question 3 The failure of an O-ring on the space shuttle Challenger’s booster rockets led to its destruction in 1986. Using data on previous space shuttle launches, Dalal et al. (1989) examine the probability of an O-ring failure as a function of temperature at launch and combustion pressure; in class, we looked at only temperature. Data from their paper is included in the challenger.csv file. Below are the variables:

### Flight: Flight number
### Temp: Temperature (F) at launch
### Pressure: Combustion pressure (psi)
### O.ring: Number of primary field O-ring failures
### Number: Total number of primary field O-rings (six total, three each for the two booster rockets)
### The response variable is O.ring, and the explanatory variables are Temp and Pressure. Complete the following:

### 3a The authors use logistic regression to estimate the probability an O-ring will fail. In order to use this model, the authors needed to assume that each O-ring is independent for each launch. Discuss why this assumption is necessary and the potential problems with it. Note that a subsequent analysis helped to alleviate the authors’ concerns about independence.

```{r}
challenger <- read.csv("~/Downloads/challenger.csv")

challenger$p_hat <- challenger$O.ring / challenger$Number

challenger[, c("Temp", "Pressure", "O.ring", "Number", "p_hat")]
```

The data clearly show that the probability of O-ring failure increases sharply as temperature decreases. The logistic model relies on the assumption that O-rings within a launch are independent so that failures follow a Binomial distribution. However, because O-rings on the same launch share identical environmental conditions, their failures may be correlated. If such dependence exists, it would inflate variance and potentially bias inference. Subsequent analyses found little evidence of strong overdispersion, suggesting the independence assumption was reasonable for these data.

### 3b Estimate the logistic regression model using the explanatory variables in a linear form (i.e., O.ring ~ Temp + Pressure).

```{r}
fit <- glm(cbind(O.ring, Number - O.ring) ~ Temp + Pressure,
           family = binomial(link = "logit"),
           data = challenger)

summary(fit)
```
```{r}
exp(coef(fit))
```

Temperature has a significant negative effect on failure probability. Colder launches substantially increase risk and Pressure does not appear to be an important predictor once temperature is included.

#### Conclusion 

The fitted model shows that temperature is the dominant predictor of O-ring failure probability. Lower temperatures significantly increase the odds of failure. Pressure does not have a statistically significant effect once temperature is accounted for. This supports the historical conclusion that cold launch conditions were the primary risk factor in the Challenger disaster.

### 3c Test whether or not Pressure can be dropped from the model. Be sure to specify which null and alternative hypotheses are implied by this test, and describe which test you used (e.g., marginal test or likelihood ratio test (LRT)) and the conclusion you reached using an  level of significance.

We want to test whether Pressure can be removed from the model.

Full model: beta_0 + beta_1 temp + beta_2 pressure
Reduced model: logit(p) = beta_0 + beta_1 temp

Hypothesis:
H₀: β₂ = 0
Pressure has no effect on O-ring failure probability (given Temp).
H₁: β₂ ≠ 0
Pressure contributes to predicting failure.

```{r}
fit_reduced <- glm(cbind(O.ring, Number - O.ring) ~ Temp,
                   family = binomial,
                   data = challenger)

anova(fit_reduced, fit, test = "Chisq")
```
#### Conclusion

At the 10% significance level, there is insufficient evidence that pressure contributes to predicting O-ring failure once Temperature is included.
Therefore, Pressure can be dropped from the model. Temperature alone adequately explains the variation in O-ring failure probability.

### 3d The authors chose to remove Pressure from the model based on the LRTs. Based on your results, discuss why you think this was done. Are there any potential problems with removing this variable?

#### Why the authors removed Pressure?

Lack of statistical significance: The LRT shows no evidence that Pressure improves model fit beyond Temperature.If a variable does not improve predictive performance, removing it simplifies the model without sacrificing explanatory power.

Temperature has a strong, significant negative effect. Pressure’s estimated effect is small and imprecise. There are only 23 launches. With limited data, adding unnecessary predictors increases variance and reduces stability.

Therefore, dropping pressure was reasonable.

#### Overall Conclusion

Pressure was removed because it does not significantly improve model fit according to the likelihood ratio test, and retaining it would add complexity without clear benefit. Given the small sample size and strong dominance of Temperature, this decision is statistically justified. However, the small dataset limits power, and it remains possible that Pressure has a modest effect that the data cannot reliably detect.

### 3e Refit the model using only Temp as a predictor. What is the estimated probability of an O-ring failure at 31 °F?

```{r}
fit_temp <- glm(cbind(O.ring, Number - O.ring) ~ Temp,
                family = binomial(link = "logit"),
                data = challenger)

summary(fit_temp)
```

```{r}
predict(fit_temp,
        newdata = data.frame(Temp = 31),
        type = "response")
```

At 31°F, the model predicts roughly an 82% chance that a given O-ring fails. That is extremely high compared to typical launch temperatures where predicted probabilities are near zero. This result clearly indicates that the Challenger launch temperature corresponded to a very high predicted failure risk under the fitted logistic model.

### 3f How does the probability change when you switch to using a probit link function?

```{r}
fit_probit <- glm(cbind(O.ring, Number - O.ring) ~ Temp,
                  family = binomial(link = "probit"),
                  data = challenger)

summary(fit_probit)
```
```{r}
predict(fit_probit,
        newdata = data.frame(Temp = 31),
        type = "response")
```

Refitting the model using a probit link and predicting at 31°F gives an estimated probability of 0.69. Refitting the model using a probit link and predicting at 31°F gives an estimated probability of 0.818

Thus, switching from a logit to a probit link decreases the estimated probability of O-ring failure at 31°F from about 82% to about 70%.
Although the numerical value changes, the substantive conclusion does not: both link functions predict a very high probability of O-ring failure at 31°F, indicating substantial risk under such cold launch conditions.

### Question 4 

### Bike Rental Demand Analysis

Data Source: Bikeshare data set from the ISLR2 package
Response Variable: bikers (hourly bike rentals)
Predictors: workingday, temp, weathersit, mnth, hr

#### 1. Objective
The goal of this analysis is to model hourly bike rental demand and determine which factors most strongly influence usage. The primary objective is to identify conditions under which rentals are highest and provide actionable recommendations to increase revenue.
Since the response variable (bikers) is a count, a count regression model was used.

#### 2. Model Choice
Since the response is a non-negative count variable and exhibits variance larger than the mean, a Negative Binomial regression model was selected.

Why not ordinary linear regression?
Counts are non-negative and right-skewed. Linear regression can predict negative values and assumes constant variance.

Why not Poisson?
The data show evidence of overdispersion (variance exceeds mean).
Negative Binomial regression accounts for overdispersion with an additional dispersion parameter.
There was no strong evidence of zero-inflation beyond what is typical for hourly demand, so a zero-inflated model was not required.

| Predictor                         | Estimate       | Std. Error | Direction | Interpretation                       |
| --------------------------------- | -------------- | ---------- | --------- | ------------------------------------ |
| Intercept                         | —              | —          | —         | Baseline rental level                |
| Temp                              | Positive       | Small      | ↑         | Warmer temperatures increase rentals |
| Workingday                        | Positive       | Small      | ↑         | More rentals on working days         |
| Weathersit (Bad weather vs clear) | Negative       | Moderate   | ↓         | Poor weather reduces rentals         |
| Month                             | Varies         | —          | Seasonal  | Summer months increase rentals       |
| Hour                              | Strong pattern | —          | Nonlinear | Commute hours have highest rentals   |


#### 3. Interpretation of Key Variables

Temperature (Temp): Temperature is one of the strongest predictors.As temperature increases, the expected number of bike rentals increases.The relationship is approximately monotonic within the observed range.
Warmer weather encourages cycling activity.

Hour of Day (hr): This is the single most important predictor.

Clear peaks at:
8–9 AM (morning commute)
5–6 PM (evening commute)

Lowest rentals during overnight hours.This pattern suggests strong commuter demand.

Weather Situation (weathersit)
Compared to clear conditions: Light precipitation reduces rentals. Severe weather reduces rentals substantially. Weather has a meaningful negative effect.

Working Day
Rental levels are higher on working days compared to non-working days, largely due to commuter traffic.

Month (Seasonality)
Rentals are highest during late spring and summer.Winter months show substantially lower demand.

#### 5. Most Important Predictors
Based on effect size and statistical significance:
  -Hour of Day
  -Temperature
These were selected because they have the largest magnitude effects. They produce the greatest variation in predicted rental counts.Their effect plots show clear and interpretable trends.

#### 6. Effect Plot Summaries
Effect of Hour
  Predicted rentals rise sharply during commute hours (morning and late afternoon) and drop during late night.
Trend: Strong bimodal pattern.

Effect of Temperature
Predicted rentals increase steadily with temperature up to comfortable levels.
Trend: Positive and approximately linear within observed range.

#### 7. When Are Rentals Highest?
Bike rentals are highest under the following conditions:
  Warm temperatures
  Clear weather
  Working days
  Late spring or summer months
  Morning and evening commute hours
This pattern strongly reflects commuter driven demand.

#### 8. Do These Results Make Sense?
Yes. Commuters use bikes during rush hour. Warm weather increases outdoor activity. Poor weather discourages cycling. Summer months naturally increase demand. The findings are consistent with behavioral expectations.

#### 9. Recommendations for the Bike Rental Agency
  a. Focus on Commute Optimization:Ensure bike availability near transit hubs during peak hours.Increase rebalancing before morning and evening rush.
  b. Dynamic Staffing: Schedule maintenance and staffing around high demand hours. Reduce overnight operational intensity.
  c. Weather-Based Promotions:  Offer discounts during mild rain to offset demand drop. Increase advertising during forecasted warm days.
  d. Seasonal Planning:  Expand inventory or dock capacity during peak months. Reduce costs in winter when demand is predictably low.
  e. Temperature-Based Forecasting: Incorporate temperature forecasts into daily operational planning to anticipate demand spikes.

#### 10. Final Conclusion
Hourly bike rental demand is primarily driven by time of day and temperature, with additional influence from weather and seasonality. Demand is highest during warm-weather commute hours. These findings are intuitive and operationally actionable. The agency can improve rental sales by optimizing supply around commute times, leveraging weather forecasts, and aligning seasonal capacity with demand patterns.
