Last week, we built a simple linear regression model examining the relationship between temperature and bike rentals:
# Load the dataset
bike_sharing_data <- read.csv("/Users/roshannaidu/Desktop/IU Sem 2/Stats 1/bike+sharing+dataset/hour.csv")
# Display the first few rows
head(bike_sharing_data)
# Previous simple linear regression model
base_model <- lm(cnt ~ temp, data = bike_sharing_data)
# Visualize base relationship
base_plot <- ggplot(bike_sharing_data, aes(x = temp, y = cnt)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Base Model: Temperature vs Bike Rentals",
x = "Temperature (normalized)",
y = "Number of Rentals") +
theme_minimal()
# Display base model summary and plot
print("Base Model Summary:")
## [1] "Base Model Summary:"
summary(base_model)
##
## Call:
## lm(formula = cnt ~ temp, data = bike_sharing_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291.37 -110.23 -32.86 76.77 744.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0356 3.4827 -0.01 0.992
## temp 381.2949 6.5344 58.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 165.9 on 17377 degrees of freedom
## Multiple R-squared: 0.1638, Adjusted R-squared: 0.1638
## F-statistic: 3405 on 1 and 17377 DF, p-value: < 2.2e-16
base_plot
# Create rush hour binary variable
bike_sharing_data$rush_hour <- ifelse(bike_sharing_data$hr %in% c(7,8,9,16,17,18), 1, 0)
# Analyze rush hour impact
rush_stats <- bike_sharing_data %>%
group_by(rush_hour) %>%
summarise(
mean_rentals = mean(cnt),
sd_rentals = sd(cnt),
n = n()
)
# Create rush hour model
rush_model <- lm(cnt ~ temp + rush_hour, data = bike_sharing_data)
# Calculate correlation
rush_cor <- cor(bike_sharing_data$rush_hour, bike_sharing_data$temp)
# Visualize rush hour effect
rush_plot <- ggplot(bike_sharing_data, aes(x = temp, y = cnt, color = factor(rush_hour))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
labs(title = "Temperature vs Rentals by Rush Hour",
x = "Temperature (normalized)",
y = "Number of Rentals",
color = "Rush Hour") +
theme_minimal()
# Display results
print("Rush Hour Analysis:")
## [1] "Rush Hour Analysis:"
print(rush_stats)
## # A tibble: 2 × 4
## rush_hour mean_rentals sd_rentals n
## <dbl> <dbl> <dbl> <int>
## 1 0 142. 141. 13010
## 2 1 332. 212. 4369
print("Correlation with temperature:")
## [1] "Correlation with temperature:"
print(rush_cor)
## [1] 0.0251314
rush_plot
Explanation for usage of Rush Hour: - Clear theoretical basis: Commuting patterns affect bike usage - Statistical evidence: - Mean difference between rush/non-rush: 189.94 rentals - Low correlation with temperature: r = 0.025 - Significant improvement in R²: 0.197 - Decision: Include due to strong theoretical basis and statistical support
# Analyze weather impact
weather_stats <- bike_sharing_data %>%
group_by(weathersit) %>%
summarise(
mean_rentals = mean(cnt),
sd_rentals = sd(cnt),
n = n()
)
# Create weather model
weather_model <- lm(cnt ~ temp + weathersit, data = bike_sharing_data)
# Check multicollinearity
weather_vif <- vif(weather_model)
# Visualize weather effect
weather_plot <- ggplot(bike_sharing_data, aes(x = temp, y = cnt, color = factor(weathersit))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
labs(title = "Temperature vs Rentals by Weather",
x = "Temperature (normalized)",
y = "Number of Rentals",
color = "Weather Situation") +
theme_minimal()
# Display results
print("Weather Analysis:")
## [1] "Weather Analysis:"
print(weather_stats)
## # A tibble: 4 × 4
## weathersit mean_rentals sd_rentals n
## <int> <dbl> <dbl> <int>
## 1 1 205. 189. 11413
## 2 2 175. 165. 4544
## 3 3 112. 134. 1419
## 4 4 74.3 77.9 3
print("VIF Values:")
## [1] "VIF Values:"
print(weather_vif)
## temp weathersit
## 1.010647 1.010647
weather_plot
Explanation for usage of Weather: - Strong theoretical basis: Weather directly affects biking decisions - Statistical evidence: - Significant variation across weather conditions - Acceptable VIF values (< 5) - Improves R² by 0.01 - Decision: Include based on strong predictive power and theoretical importance
# Create interaction model
interaction_model <- lm(cnt ~ temp * weathersit, data = bike_sharing_data)
# Check multicollinearity
vif_interaction <- vif(interaction_model, type = "predictor")
# Compare models
anova(weather_model, interaction_model)
print("Interaction Model VIF:")
## [1] "Interaction Model VIF:"
print(vif_interaction)
## GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
## temp 1 3 1 weathersit --
## weathersit 1 3 1 temp --
Explanation for usage of Interaction Term: - Theoretical basis: Weather effects might vary with temperature - Statistical evidence: - High VIF values indicate problematic multicollinearity - Minimal improvement in model fit - Decision: Exclude due to multicollinearity and minimal benefit
# Build final model with selected variables
final_model <- lm(cnt ~ temp + rush_hour + weathersit, data = bike_sharing_data)
summary(final_model)
##
## Call:
## lm(formula = cnt ~ temp + rush_hour + weathersit, data = bike_sharing_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -416.69 -94.20 -16.38 68.45 607.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.403 4.100 2.05 0.0404 *
## temp 360.011 5.689 63.28 <2e-16 ***
## rush_hour 186.711 2.512 74.32 <2e-16 ***
## weathersit -31.432 1.713 -18.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 143.6 on 17375 degrees of freedom
## Multiple R-squared: 0.3733, Adjusted R-squared: 0.3732
## F-statistic: 3450 on 3 and 17375 DF, p-value: < 2.2e-16
vif_final <- vif(final_model, type = "predictor")
print("Final Model VIF:")
## [1] "Final Model VIF:"
print(vif_final)
## [1] 1.011386 1.001020 1.011039
# Create all diagnostic plots
par(mfrow = c(3,2))
plot(final_model, which = 1:5)
# 1. Linearity Assessment
residual_pattern <- data.frame(
fitted = fitted(final_model),
residuals = residuals(final_model)
)
# Visualize residual pattern
ggplot(residual_pattern, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess") +
labs(title = "Residuals vs Fitted with Trend",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()
# 2. Normality Tests
# Take a random sample of 5000 residuals for Shapiro-Wilk test
set.seed(123) # for reproducibility
sampled_residuals <- sample(residuals(final_model), min(5000, length(residuals(final_model))))
normality_test <- shapiro.test(sampled_residuals)
print("Normality Test Results (on sample):")
## [1] "Normality Test Results (on sample):"
print(normality_test)
##
## Shapiro-Wilk normality test
##
## data: sampled_residuals
## W = 0.95633, p-value < 2.2e-16
# 3. Heteroscedasticity Test
bp_test <- bptest(final_model)
print("Heteroscedasticity Test Results:")
## [1] "Heteroscedasticity Test Results:"
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: final_model
## BP = 1691.1, df = 3, p-value < 2.2e-16
# 4. Influential Points
# Calculate Cook's distances
cooks_d <- cooks.distance(final_model)
influential_points <- which(cooks_d > 4/length(cooks_d))
print("Number of influential points (Cook's D > 4/n):")
## [1] "Number of influential points (Cook's D > 4/n):"
print(length(influential_points))
## [1] 1160
# Summary of influential observations
influential_summary <- data.frame(
observation = influential_points,
cooks_distance = cooks_d[influential_points]
)
print("Most influential observations:")
## [1] "Most influential observations:"
print(head(influential_summary[order(-influential_summary$cooks_distance),]))
## observation cooks_distance
## 13766 13766 0.002312720
## 15229 15229 0.002181225
## 13238 13238 0.001747619
## 14077 14077 0.001659436
## 12422 12422 0.001580814
## 14269 14269 0.001459551
# Residuals vs Temperature
ggplot(bike_sharing_data, aes(x = temp, y = residuals(final_model))) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", color = "red") +
labs(title = "Residuals vs Temperature",
x = "Temperature",
y = "Residuals") +
theme_minimal()
Residuals vs Temperature - The residuals show a
slight curved pattern rather than random scatter around zero. - This
indicates that the relationship between temperature and bike rentals may
not be fully linear. - Although the current model assumes linearity,
this pattern suggests that adding a polynomial term (e.g.,
temp^2) or using a more flexible model could improve
fit.
# Select numeric variables
numeric_data <- bike_sharing_data %>%
select(cnt, temp, hum, windspeed, rush_hour)
cor_matrix <- cor(numeric_data)
ggcorrplot(cor_matrix, lab = TRUE, title = "Correlation Heatmap")
Correlation Heatmap - Temperature has a moderate positive correlation with bike rentals, which matches the regression results. - Humidity shows a moderate negative correlation with bike rentals. - Windspeed has a weak relationship with the outcome. - The predictors do not appear highly correlated with each other, which supports the VIF results and suggests multicollinearity is not a major issue.
# Histogram of residuals
ggplot(data.frame(residuals = residuals(final_model)), aes(x = residuals)) +
geom_histogram(bins = 40, fill = "skyblue", color = "black") +
labs(title = "Histogram of Residuals",
x = "Residuals",
y = "Frequency") +
theme_minimal()
Histogram of Residuals - The residuals are roughly bell-shaped, but with noticeable skew and heavier tails. - This is consistent with the Q-Q plot, which also showed some deviation from normality. - Because the dataset is large, this is not a major problem for the overall model fit, but it does show that the residuals are not perfectly normal.
# Calculate model performance metrics
r2 <- summary(final_model)$r.squared
adj_r2 <- summary(final_model)$adj.r.squared
# Effect sizes
standardized_coef <- coef(lm(scale(cnt) ~ scale(temp) + rush_hour + weathersit,
data = bike_sharing_data))
# Create performance summary
cat("Model Performance Metrics:\n")
## Model Performance Metrics:
cat("R-squared:", round(r2, 3), "\n")
## R-squared: 0.373
cat("Adjusted R-squared:", round(adj_r2, 3), "\n")
## Adjusted R-squared: 0.373
The final model successfully incorporates key predictors while maintaining interpretability. Despite some assumption violations, it provides valuable insights for operational planning. Future work should focus on addressing non-linearity and heteroscedasticity.