This analysis explores factors influencing high-demand periods in a bike sharing system. I’ll build a logistic regression model to understand what drives peak usage periods.
# Load the dataset
bike_sharing_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")
# Create our binary outcome variable: High Demand Periods
# We'll define "high demand" as periods where total rentals (cnt) exceed the 75th percentile
demand_threshold <- quantile(bike_sharing_data$cnt, 0.75)
bike_sharing_data$high_demand <- as.numeric(bike_sharing_data$cnt > demand_threshold)
# Display the distribution of our binary outcome
table(bike_sharing_data$high_demand)
##
## 0 1
## 13053 4326
I chose to create a binary variable for “high demand” periods (1 = high demand, 0 = normal/low demand) because: * It helps identify peak usage periods that require additional resource allocation * It’s directly relevant to operational decision-making * Understanding what drives high-demand periods can improve service quality
I selected three key explanatory variables: 1. Temperature (temp) - normalized temperature values 2. Hour of day (hr) - to capture daily patterns 3. Working day indicator (workingday) - to account for commuting patterns
# Fit logistic regression model
model <- glm(high_demand ~ temp + hr + workingday,
data = bike_sharing_data,
family = binomial(link = "logit"))
# Get model summary
model_summary <- summary(model)
# Display model summary
print(model_summary)
##
## Call:
## glm(formula = high_demand ~ temp + hr + workingday, family = binomial(link = "logit"),
## data = bike_sharing_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.386229 0.081101 -54.08 < 2e-16 ***
## temp 4.458206 0.111067 40.14 < 2e-16 ***
## hr 0.080942 0.002969 27.26 < 2e-16 ***
## workingday -0.196880 0.041272 -4.77 1.84e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19504 on 17378 degrees of freedom
## Residual deviance: 16541 on 17375 degrees of freedom
## AIC: 16549
##
## Number of Fisher Scoring iterations: 5
# Get coefficients table using broom for easier extraction
coef_table <- tidy(model)
print(coef_table)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.39 0.0811 -54.1 0
## 2 temp 4.46 0.111 40.1 0
## 3 hr 0.0809 0.00297 27.3 1.17e-163
## 4 workingday -0.197 0.0413 -4.77 1.84e- 6
Let’s interpret each coefficient:
# Extract coefficients and standard errors
temp_coef <- coef_table$estimate[coef_table$term == "temp"]
temp_se <- coef_table$std.error[coef_table$term == "temp"]
hr_coef <- coef_table$estimate[coef_table$term == "hr"]
workingday_coef <- coef_table$estimate[coef_table$term == "workingday"]
# Calculate odds ratios
temp_odds <- exp(temp_coef)
hr_odds <- exp(hr_coef)
workingday_odds <- exp(workingday_coef)
Let’s focus on the temperature coefficient, as it shows a strong effect:
# Calculate 95% CI for temperature coefficient
temp_ci_lower <- temp_coef - 1.96 * temp_se
temp_ci_upper <- temp_coef + 1.96 * temp_se
# Convert to odds ratios for easier interpretation
temp_ci_odds_lower <- exp(temp_ci_lower)
temp_ci_odds_upper <- exp(temp_ci_upper)