Introduction

This analysis explores factors influencing high-demand periods in a bike sharing system. I’ll build a logistic regression model to understand what drives peak usage periods.

Data Preparation

# Load the dataset
bike_sharing_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")

# Create our binary outcome variable: High Demand Periods
# We'll define "high demand" as periods where total rentals (cnt) exceed the 75th percentile
demand_threshold <- quantile(bike_sharing_data$cnt, 0.75)
bike_sharing_data$high_demand <- as.numeric(bike_sharing_data$cnt > demand_threshold)

# Display the distribution of our binary outcome
table(bike_sharing_data$high_demand)
## 
##     0     1 
## 13053  4326

Binary Variable Selection Rationale

I chose to create a binary variable for “high demand” periods (1 = high demand, 0 = normal/low demand) because: * It helps identify peak usage periods that require additional resource allocation * It’s directly relevant to operational decision-making * Understanding what drives high-demand periods can improve service quality

Building the Logistic Regression Model

I selected three key explanatory variables: 1. Temperature (temp) - normalized temperature values 2. Hour of day (hr) - to capture daily patterns 3. Working day indicator (workingday) - to account for commuting patterns

# Fit logistic regression model
model <- glm(high_demand ~ temp + hr + workingday, 
            data = bike_sharing_data, 
            family = binomial(link = "logit"))

# Get model summary
model_summary <- summary(model)

# Display model summary
print(model_summary)
## 
## Call:
## glm(formula = high_demand ~ temp + hr + workingday, family = binomial(link = "logit"), 
##     data = bike_sharing_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.386229   0.081101  -54.08  < 2e-16 ***
## temp         4.458206   0.111067   40.14  < 2e-16 ***
## hr           0.080942   0.002969   27.26  < 2e-16 ***
## workingday  -0.196880   0.041272   -4.77 1.84e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19504  on 17378  degrees of freedom
## Residual deviance: 16541  on 17375  degrees of freedom
## AIC: 16549
## 
## Number of Fisher Scoring iterations: 5
# Get coefficients table using broom for easier extraction
coef_table <- tidy(model)
print(coef_table)
## # A tibble: 4 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -4.39     0.0811     -54.1  0        
## 2 temp          4.46     0.111       40.1  0        
## 3 hr            0.0809   0.00297     27.3  1.17e-163
## 4 workingday   -0.197    0.0413      -4.77 1.84e-  6

Interpreting the Coefficients

Let’s interpret each coefficient:

# Extract coefficients and standard errors
temp_coef <- coef_table$estimate[coef_table$term == "temp"]
temp_se <- coef_table$std.error[coef_table$term == "temp"]
hr_coef <- coef_table$estimate[coef_table$term == "hr"]
workingday_coef <- coef_table$estimate[coef_table$term == "workingday"]

# Calculate odds ratios
temp_odds <- exp(temp_coef)
hr_odds <- exp(hr_coef)
workingday_odds <- exp(workingday_coef)
  1. Temperature (temp):
    • Coefficient: 4.458
    • Interpretation: For each unit increase in normalized temperature, the log odds of high demand increase by 4.458
    • In terms of odds: exp(4.458) = 86.33, meaning the odds of high demand increase by a factor of 86.33 for each unit increase in temperature
  2. Hour (hr):
    • Coefficient: 0.081
    • Interpretation: Each hour later in the day changes the log odds of high demand by 0.081
    • In odds terms: exp(0.081) = 1.084, indicating a 8.4% change in odds of high demand for each hour
  3. Working Day:
    • Coefficient: -0.197
    • Interpretation: Working days have log odds of high demand -0.197 different from non-working days
    • In odds terms: exp(-0.197) = 0.821, meaning the odds of high demand change by -17.9% on working days

Confidence Interval Analysis

Let’s focus on the temperature coefficient, as it shows a strong effect:

# Calculate 95% CI for temperature coefficient
temp_ci_lower <- temp_coef - 1.96 * temp_se
temp_ci_upper <- temp_coef + 1.96 * temp_se

# Convert to odds ratios for easier interpretation
temp_ci_odds_lower <- exp(temp_ci_lower)
temp_ci_odds_upper <- exp(temp_ci_upper)

Interpretation of Confidence Interval:

  • We are 95% confident that the true temperature coefficient lies between 4.241 and 4.676 (in log odds)
  • In terms of odds ratios, we’re 95% confident that a one-unit increase in temperature multiplies the odds of high demand by a factor between 69.44 and 107.33
  • The confidence interval does not include 0 (or 1 for odds ratios), indicating a statistically significant effect
  • Even at the lower bound, temperature has a substantial positive effect on high demand probability

Insights and Significance

  1. Temperature Effect:
    • Most influential predictor of high demand
    • Strong positive relationship with rental probability
    • Effect remains significant even at the lower confidence bound
    • Crucial for capacity planning during warmer periods
  2. Time of Day Impact:
    • Shows systematic variation in demand throughout the day
    • Helps identify peak usage periods
    • Useful for staff scheduling and bike redistribution
  3. Working Day Effect:
    • Demonstrates different usage patterns between work and leisure days
    • Important for resource allocation strategies
    • Suggests need for different management approaches on working vs non-working days

Further Questions to Investigate

  1. Interaction Effects:
    • How does temperature impact vary between working and non-working days?
    • Are there specific hours where temperature effects are stronger?
  2. Threshold Effects:
    • Is there an optimal temperature range for rentals?
    • Are there temperature thresholds where demand patterns change dramatically?
  3. Model Improvements:
    • Would including weather conditions improve predictions?
    • How does seasonality modify these relationships?
  4. Operational Implications:
    • How can these insights be translated into specific operational guidelines?
    • What thresholds should trigger changes in bike distribution?