Week 10 | Data Dive

Introduction

This analysis explores factors influencing high-demand periods in a bike sharing system. I’ll build a logistic regression model to understand what drives peak usage periods.

Data Preparation

# Load the dataset
bike_sharing_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")

# Create our binary outcome variable: High Demand Periods
# We'll define "high demand" as periods where total rentals (cnt) exceed the 75th percentile
demand_threshold <- quantile(bike_sharing_data$cnt, 0.75)
bike_sharing_data$high_demand <- as.numeric(bike_sharing_data$cnt > demand_threshold)

# Display the distribution of our binary outcome
table(bike_sharing_data$high_demand)

## 
##     0     1 
## 13053  4326

Binary Variable Selection Rationale

I chose to create a binary variable for “high demand” periods (1 = high demand, 0 = normal/low demand) because: * It helps identify peak usage periods that require additional resource allocation * It’s directly relevant to operational decision-making * Understanding what drives high-demand periods can improve service quality

Building the Logistic Regression Model

I selected three key explanatory variables: 1. Temperature (temp) - normalized temperature values 2. Hour of day (hr) - to capture daily patterns 3. Working day indicator (workingday) - to account for commuting patterns

# Fit logistic regression model
model <- glm(high_demand ~ temp + hr + workingday, 
            data = bike_sharing_data, 
            family = binomial(link = "logit"))

# Get model summary
model_summary <- summary(model)

# Display model summary
print(model_summary)

## 
## Call:
## glm(formula = high_demand ~ temp + hr + workingday, family = binomial(link = "logit"), 
##     data = bike_sharing_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.386229   0.081101  -54.08  < 2e-16 ***
## temp         4.458206   0.111067   40.14  < 2e-16 ***
## hr           0.080942   0.002969   27.26  < 2e-16 ***
## workingday  -0.196880   0.041272   -4.77 1.84e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19504  on 17378  degrees of freedom
## Residual deviance: 16541  on 17375  degrees of freedom
## AIC: 16549
## 
## Number of Fisher Scoring iterations: 5

# Get coefficients table using broom for easier extraction
coef_table <- tidy(model)
print(coef_table)

## # A tibble: 4 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -4.39     0.0811     -54.1  0        
## 2 temp          4.46     0.111       40.1  0        
## 3 hr            0.0809   0.00297     27.3  1.17e-163
## 4 workingday   -0.197    0.0413      -4.77 1.84e-  6

Interpreting the Coefficients

Let’s interpret each coefficient:

# Extract coefficients and standard errors
temp_coef <- coef_table$estimate[coef_table$term == "temp"]
temp_se <- coef_table$std.error[coef_table$term == "temp"]
hr_coef <- coef_table$estimate[coef_table$term == "hr"]
workingday_coef <- coef_table$estimate[coef_table$term == "workingday"]

# Calculate odds ratios
temp_odds <- exp(temp_coef)
hr_odds <- exp(hr_coef)
workingday_odds <- exp(workingday_coef)

Temperature (temp):
- Coefficient: 4.458
- Interpretation: For each unit increase in normalized temperature, the log odds of high demand increase by 4.458
- In terms of odds: exp(4.458) = 86.33, meaning the odds of high demand increase by a factor of 86.33 for each unit increase in temperature
Hour (hr):
- Coefficient: 0.081
- Interpretation: Each hour later in the day changes the log odds of high demand by 0.081
- In odds terms: exp(0.081) = 1.084, indicating a 8.4% change in odds of high demand for each hour
Working Day:
- Coefficient: -0.197
- Interpretation: Working days have log odds of high demand -0.197 different from non-working days
- In odds terms: exp(-0.197) = 0.821, meaning the odds of high demand change by -17.9% on working days

Confidence Interval Analysis

Let’s focus on the temperature coefficient, as it shows a strong effect:

# Calculate 95% CI for temperature coefficient
temp_ci_lower <- temp_coef - 1.96 * temp_se
temp_ci_upper <- temp_coef + 1.96 * temp_se

# Convert to odds ratios for easier interpretation
temp_ci_odds_lower <- exp(temp_ci_lower)
temp_ci_odds_upper <- exp(temp_ci_upper)

Interpretation of Confidence Interval:

We are 95% confident that the true temperature coefficient lies between 4.241 and 4.676 (in log odds)
In terms of odds ratios, we’re 95% confident that a one-unit increase in temperature multiplies the odds of high demand by a factor between 69.44 and 107.33
The confidence interval does not include 0 (or 1 for odds ratios), indicating a statistically significant effect
Even at the lower bound, temperature has a substantial positive effect on high demand probability

Insights and Significance

Temperature Effect:
- Most influential predictor of high demand
- Strong positive relationship with rental probability
- Effect remains significant even at the lower confidence bound
- Crucial for capacity planning during warmer periods
Time of Day Impact:
- Shows systematic variation in demand throughout the day
- Helps identify peak usage periods
- Useful for staff scheduling and bike redistribution
Working Day Effect:
- Demonstrates different usage patterns between work and leisure days
- Important for resource allocation strategies
- Suggests need for different management approaches on working vs non-working days

Further Questions to Investigate

Interaction Effects:
- How does temperature impact vary between working and non-working days?
- Are there specific hours where temperature effects are stronger?
Threshold Effects:
- Is there an optimal temperature range for rentals?
- Are there temperature thresholds where demand patterns change dramatically?
Model Improvements:
- Would including weather conditions improve predictions?
- How does seasonality modify these relationships?
Operational Implications:
- How can these insights be translated into specific operational guidelines?
- What thresholds should trigger changes in bike distribution?

Week 10 | Data Dive — GLMs

Aniket Shirsat

2024-10-31