This analysis examines bike sharing patterns using Poisson regression, focusing on how weather conditions influence rental behavior. The findings reveal that temperature is the strongest predictor of bike rentals, with significant but lesser effects from humidity and weather conditions. These insights can directly inform operational planning and resource allocation for bike sharing services.
In this analysis, I examine bike sharing patterns using Poisson regression, a type of Generalized Linear Model (GLM) specifically designed for count data. The Poisson distribution is particularly appropriate for this dataset as we’re modeling rental counts, which are non-negative integer values. The analysis focuses on how weather-related variables influence bike rental patterns, with special attention to temperature, humidity, and weather conditions.
# Load the dataset
bike_sharing_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")
# Examine structure
str(bike_sharing_data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
# Convert categorical variables
bike_sharing_data$season <- factor(bike_sharing_data$season,
levels = 1:4,
labels = c("Spring", "Summer", "Fall", "Winter"))
bike_sharing_data$weathersit <- factor(bike_sharing_data$weathersit,
levels = 1:4,
labels = c("Clear", "Mist", "Light Snow/Rain", "Heavy Rain/Snow"))
# Add summary statistics
summary(bike_sharing_data[c("temp", "hum", "cnt")])
## temp hum cnt
## Min. :0.020 Min. :0.0000 Min. : 1.0
## 1st Qu.:0.340 1st Qu.:0.4800 1st Qu.: 40.0
## Median :0.500 Median :0.6300 Median :142.0
## Mean :0.497 Mean :0.6272 Mean :189.5
## 3rd Qu.:0.660 3rd Qu.:0.7800 3rd Qu.:281.0
## Max. :1.000 Max. :1.0000 Max. :977.0
The dataset contains: - Temporal information (datetime) - Weather metrics (temperature, humidity, windspeed) - Categorical variables (season, weather situation) - Count variables (casual, registered, and total rentals)
The categorical variables have been appropriately factored: - Seasons: Spring (1), Summer (2), Fall (3), Winter (4) - Weather situations: Clear (1), Mist (2), Light Snow/Rain (3), Heavy Rain/Snow (4)
We use a Poisson GLM with a log link function, which is appropriate for our count data. The model includes three key weather-related predictors: - Temperature (normalized) - Humidity (normalized) - Weather situation (categorical)
# Build initial model
model1 <- glm(cnt ~ temp + hum + weathersit,
data = bike_sharing_data,
family = poisson(link = "log"))
# Display model summary
summary(model1)
##
## Call:
## glm(formula = cnt ~ temp + hum + weathersit, family = poisson(link = "log"),
## data = bike_sharing_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.091046 0.002579 1974.417 <2e-16 ***
## temp 1.833660 0.002890 634.435 <2e-16 ***
## hum -1.413693 0.003137 -450.633 <2e-16 ***
## weathersitMist 0.107834 0.001364 79.075 <2e-16 ***
## weathersitLight Snow/Rain -0.126845 0.002715 -46.724 <2e-16 ***
## weathersitHeavy Rain/Snow 0.122976 0.066985 1.836 0.0664 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2891591 on 17378 degrees of freedom
## Residual deviance: 2150374 on 17373 degrees of freedom
## AIC: 2261287
##
## Number of Fisher Scoring iterations: 5
### Model Summary Interpretation
cat("\nModel Summary Interpretation:\n")
##
## Model Summary Interpretation:
cat("1. Overall Fit:\n")
## 1. Overall Fit:
cat(" - Null deviance:", round(model1$null.deviance, 2), "on", model1$df.null, "degrees of freedom\n")
## - Null deviance: 2891591 on 17378 degrees of freedom
cat(" - Residual deviance:", round(model1$deviance, 2), "on", model1$df.residual, "degrees of freedom\n")
## - Residual deviance: 2150374 on 17373 degrees of freedom
cat(" - The reduction in deviance indicates strong predictive power\n")
## - The reduction in deviance indicates strong predictive power
cat("\n2. Key Coefficients:\n")
##
## 2. Key Coefficients:
cat(" - Temperature (β =", round(coef(model1)["temp"], 3), "): Each unit increase multiplies expected rentals by",
round(exp(coef(model1)["temp"]), 3), "\n")
## - Temperature (β = 1.834 ): Each unit increase multiplies expected rentals by 6.257
cat(" - Humidity shows a negative relationship with rental counts\n")
## - Humidity shows a negative relationship with rental counts
cat(" - Weather categories show progressive negative impact\n")
## - Weather categories show progressive negative impact
# Calculate VIF values
vif_values <- vif(model1)
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## temp 1.032696 1 1.016216
## hum 1.180838 1 1.086664
## weathersit 1.175698 3 1.027344
cat("\nVIF Interpretation:\n")
##
## VIF Interpretation:
cat("- Temperature VIF =", round(vif_values["temp"], 2), ": Shows minimal correlation\n")
## - Temperature VIF = NA : Shows minimal correlation
cat("- Humidity VIF =", round(vif_values["hum"], 2), ": Acceptable level of independence\n")
## - Humidity VIF = NA : Acceptable level of independence
cat("- Weather situation VIF values all below 5: Indicates distinct categories\n")
## - Weather situation VIF values all below 5: Indicates distinct categories
# Build simpler models
model2 <- glm(cnt ~ temp + hum,
data = bike_sharing_data,
family = poisson(link = "log"))
model3 <- glm(cnt ~ temp,
data = bike_sharing_data,
family = poisson(link = "log"))
# Compare models using ANOVA
anova_result <- anova(model3, model2, model1, test = "Chisq")
print(anova_result)
## Analysis of Deviance Table
##
## Model 1: cnt ~ temp
## Model 2: cnt ~ temp + hum
## Model 3: cnt ~ temp + hum + weathersit
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 17377 2390840
## 2 17376 2160789 1 230051 < 2.2e-16 ***
## 3 17373 2150374 3 10415 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare information criteria
cat("\nModel Selection Analysis:\n")
##
## Model Selection Analysis:
cat("1. Information Criteria Comparison:\n")
## 1. Information Criteria Comparison:
cat(" - Model 1 (Full): AIC =", round(model1$aic, 2), ", BIC =", round(BIC(model1), 2), "\n")
## - Model 1 (Full): AIC = 2261287 , BIC = 2261334
cat(" - Model 2 (Temp + Hum): AIC =", round(model2$aic, 2), ", BIC =", round(BIC(model2), 2), "\n")
## - Model 2 (Temp + Hum): AIC = 2271696 , BIC = 2271719
cat(" - Model 3 (Temp only): AIC =", round(model3$aic, 2), ", BIC =", round(BIC(model3), 2), "\n")
## - Model 3 (Temp only): AIC = 2501745 , BIC = 2501760
# Extract coefficient and standard error for temperature
temp_coef <- coef(summary(model1))["temp", "Estimate"]
temp_se <- coef(summary(model1))["temp", "Std. Error"]
# Calculate 95% confidence interval
temp_ci <- temp_coef + c(-1.96, 1.96) * temp_se
cat("Temperature Effect Analysis:\n")
## Temperature Effect Analysis:
cat("1. Statistical Interpretation:\n")
## 1. Statistical Interpretation:
cat(" - Coefficient:", round(temp_coef, 4), "\n")
## - Coefficient: 1.8337
cat(" - Standard Error:", round(temp_se, 4), "\n")
## - Standard Error: 0.0029
cat(" - 95% CI:", round(temp_ci[1], 4), "to", round(temp_ci[2], 4), "\n")
## - 95% CI: 1.828 to 1.8393
cat(" - For one unit increase in temperature, rentals multiply by:", round(exp(temp_coef), 4), "\n")
## - For one unit increase in temperature, rentals multiply by: 6.2567
cat("\n2. Practical Implications:\n")
##
## 2. Practical Implications:
cat(" - Strong positive effect of temperature on rentals\n")
## - Strong positive effect of temperature on rentals
cat(" - Effect is highly significant and precisely estimated\n")
## - Effect is highly significant and precisely estimated
cat(" - Temperature is the primary driver of rental behavior\n")
## - Temperature is the primary driver of rental behavior
# Temperature effect plot
ggplot(bike_sharing_data, aes(x = temp, y = cnt)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "glm",
method.args = list(family = "poisson"),
se = TRUE) +
theme_minimal() +
labs(title = "Temperature Effect on Bike Rentals",
x = "Normalized Temperature",
y = "Number of Rentals")
## `geom_smooth()` using formula = 'y ~ x'
# Weather situation effect plot
bike_sharing_data %>%
group_by(weathersit) %>%
summarise(mean_rentals = mean(cnt)) %>%
ggplot(aes(x = weathersit, y = mean_rentals)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(title = "Average Rentals by Weather Situation",
x = "Weather Situation",
y = "Average Number of Rentals")