In this week’s data dive, I build upon the previous data dive where I explored the relation between temperature and rented bikes per hour in the Seoul Bikeshare data set. I will increase the complexity of the model by adding the following variables:
Then, I will evaluate the model using diagnostic plots.
The original regression model showed the relationship between temperature and rented bikes at 8pm.I will use the R^2 value to compare models, so first I will attain the R^2 value of temperature and bikes rented per hour.
model <- lm(rented_bikes ~ temp_c + temp_c,data=df)
summary(model)$r.squared
## [1] 0.3166765
The R^2 value of 0.32 means that 32% of the variability in the data can be explained by temperature.
In this iteration, I will use hour of the day as an interaction term with temperature. Using hour may be problematic due to the cyclical nature and the fluctuations of demand throughout the day based on time. First, let’s examine demand throughout the day.
df |>
group_by(hour) |>
summarize(mean_bikes_rented = mean(rented_bikes))|>
ggplot(aes(x=hour,y=mean_bikes_rented))+
geom_line() +
labs(title = "Hourly Bike Rentals",
x = "Hour",
y = "Bike Rentals")
Instead of holding hour constant, I converted hour to a categorical variable and added it as a predictor variable that is an interaction term with temperature.
#Make a categorical column for time of day
df <- df |>
mutate(hour_group = case_when(
hour >= 0 & hour < 6 ~ "Night",
hour >= 6 & hour < 12 ~ "Morning",
hour >= 12 & hour < 18 ~ "Afternoon",
hour >= 18 & hour < 24 ~ "Evening"
)) |>
mutate(hour_group = factor(hour_group, levels = c("Night", "Morning", "Afternoon", "Evening")))
model <- lm(rented_bikes ~ temp_c + hour_group + temp_c*hour_group,data=df)
tidy(model,conf.int = TRUE) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 8 Ă— 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 166. 12.7 13.1 1.43e- 38 141. 191.
## 2 temp_c 13 0.823 15.9 8.73e- 56 11.4 14.7
## 3 hour_groupMorning 254. 18.1 14.0 3.82e- 44 218. 289.
## 4 hour_groupAfternoon 256 20.4 12.6 7.13e- 36 216. 296.
## 5 hour_groupEvening 275. 19.0 14.5 5.17e- 47 238. 313.
## 6 temp_c:hour_groupMo… 6.1 1.13 5.41 6.45e- 8 3.91 8.36
## 7 temp_c:hour_groupAf… 14.8 1.14 13.0 1.95e- 38 12.6 17.0
## 8 temp_c:hour_groupEv… 37 1.15 32.4 1.34e-216 34.8 39.3
summary(model)$r.squared
## [1] 0.5424425
The regression model uses “Night” as a baseline. All variables in this model have an extremely small p-value, which means we reject the null hypothesis that the coefficients are equal to 0. The R^2 is 0.54, which is a large increase from the base model.
The intercept shows a baseline of 163 rented bikes per hour, where every degree Celsius increase corresponds to 13 additional bikes rented. The time of day has strong effects on ridership, and evenings have the a larger effect on rented bikes than the morning and afternoon compared to the baseline night hours.
The model shows a clear interaction between time of day and temperature on bikes rented. In the morning, there is only an additional 6 bikes rented per hour for every increase in degrees Celsius on top of the baseline 13 bikes rented per hour. The effect of time of day and temperature is very strong in the evening.
What this means is that an equal increase in temperature has a much larger effect in after 6pm then in the morning. Practically speaking, temperature changes in the evening will have large effects on demand and could effect how much staffing is needed depending on the forecasted change.
Here is this information plotted against real data.
ggplot(data = df, aes(x = temp_c, y = rented_bikes, color = hour_group)) +
geom_point(alpha = 0.15) +
geom_smooth(method = "lm", se = FALSE, linetype = "solid") +
labs(title = "Bike Rentals vs. Temperature by Time of Day",
x = "Temperature (°C)",
y = "Bike Rentals",
color = "Time of Day") +
theme(legend.position = "top")
## `geom_smooth()` using formula = 'y ~ x'
There is a lot of error above the modeled lines, but stratifying by time of day groupings increased model fit to the data.
We can see this more clearly using predicted values to show the regression lines.
# Create a sequence of temperature values and rainfall amounts.
temp_seq <- seq(min(df$temp_c, na.rm = TRUE), max(df$temp_c, na.rm = TRUE), length.out = 100)
# Generate new data for predictions
new_data <- expand.grid(
temp_c = temp_seq,
hour_group = levels(df$hour_group))
# Predict bike rentals based on the updated model
new_data$predicted_rentals <- predict(model, newdata = new_data)
ggplot(new_data, aes(x = temp_c, y = predicted_rentals, color = hour_group)) +
geom_line(size = 1.2) +
labs(title = "Effect of Rainfall on Bike Rentals by Time of Day",
x = "Temperature (Celsius)",
y = "Predicted Bike Rentals",
color = "Time of Day",
linetype = "Time of Day") +
scale_color_manual(values = c("Night" = "blue", "Morning" = "green", "Afternoon" = "orange", "Evening" = "red")) +
theme(legend.position = "top")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The graphs above clearly demonstrate how time of day effects the coefficients of temperature and bikes rented per hour.
Now let’s see what effect rainfall has on this model. First I will assess rain as a continuous variable, then see the effects of a binary rainfall column.
model <- lm(rented_bikes ~ temp_c + hour_group + temp_c*hour_group + rainfall_mm,data=df)
tidy(model,conf.int = TRUE) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 9 Ă— 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 171. 12.3 13.9 2.18e- 43 147. 196.
## 2 temp_c 13.5 0.798 16.9 2.61e- 63 11.9 15.1
## 3 hour_groupMorning 259. 17.5 14.8 8.04e- 49 225. 294.
## 4 hour_groupAfternoon 261. 19.7 13.2 1.29e- 39 223. 300.
## 5 hour_groupEvening 278. 18.4 15.1 9.93e- 51 242. 314.
## 6 rainfall_mm -95.9 4.08 -23.5 1.34e-118 -104. -87.9
## 7 temp_c:hour_groupMo… 6.2 1.10 5.67 1.50e- 8 4.07 8.38
## 8 temp_c:hour_groupAf… 14.5 1.10 13.2 3.38e- 39 12.3 16.7
## 9 temp_c:hour_groupEv… 37.3 1.11 33.6 5.90e-233 35.1 39.5
summary(model)$r.squared
## [1] 0.5705517
Every millimeter of rainfall corresponded to about 96 less riders per hour, and this value is statistically significant. Adding rainfall increased the r-squared value from 0.54 to 0.57. We are already seeing diminishing returns in explanatory power from adding variables to the model.
Let’s compare this result to a binary version of rainfall to see if we can simplify the model.
df = df |>
mutate(rainfall = rainfall_mm > 0)
model <- lm(rented_bikes ~ temp_c + hour_group + temp_c*hour_group + rainfall,data=df)
tidy(model) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 9 Ă— 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 193. 11.8 16.3 4.01e- 59
## 2 temp_c 14.9 0.764 19.5 6.02e- 83
## 3 hour_groupMorning 259. 16.8 15.4 4.82e- 53
## 4 hour_groupAfternoon 259. 18.9 13.7 1.59e- 42
## 5 hour_groupEvening 273. 17.6 15.5 1.84e- 53
## 6 rainfallTRUE -689. 18.4 -37.5 4.29e-285
## 7 temp_c:hour_groupMorning 5.6 1.05 5.33 1.02e- 7
## 8 temp_c:hour_groupAfternoon 13.2 1.05 12.6 8.04e- 36
## 9 temp_c:hour_groupEvening 36.5 1.06 34.4 1.09e-242
summary(model)$r.squared
## [1] 0.6077389
The binary version increased the R^2 value from 0.54 to 0.61, which is 0.04 higher than if rainfall was used as a continuous variable! This version will be kept.
If it is raining, the intercept is decreased by almost 690 bikes rented per hour. The p-value is incredibly small, showing the result is statistically significant.
Let’s see the effect of this model on predicted and real data with or without rainfall.
# Create a sequence of temperature values for smooth plotting
temp_seq <- seq(min(df$temp_c, na.rm = TRUE), max(df$temp_c, na.rm = TRUE), length.out = 100)
# Generate new data for prediction, for both 'rainfall' and 'no rainfall'
# Rainfall
new_data_rain <- expand.grid(
temp_c = temp_seq,
hour_group = levels(df$hour_group),
rainfall = TRUE
)
# No Rainfall
new_data_no_rain <- expand.grid(
temp_c = temp_seq,
hour_group = levels(df$hour_group),
rainfall = FALSE
)
# Predict bike rentals based on the model for both scenarios
new_data_rain$predicted_rentals <- predict(model, newdata = new_data_rain)
new_data_no_rain$predicted_rentals <- predict(model, newdata = new_data_no_rain)
#Combine
new_data_combined <- bind_rows(new_data_rain, new_data_no_rain)
ggplot(new_data_combined, aes(x = temp_c, y = predicted_rentals, color = hour_group, linetype = rainfall)) +
geom_point(data = df, aes(x = temp_c, y = rented_bikes, color = rainfall), alpha = 0.3) +
geom_line(size = 1.2) +
theme_minimal() +
labs(title = "Effect of Temperature and Rainfall on Bike Rentals by Time of Day",
x = "Temperature (°C)",
y = "Bike Rentals",
color = "Rainfall",
linetype = "Rainfall") +
scale_color_manual(values = c("TRUE" = "black", "FALSE" = "gray",
"Night" = "blue", "Morning" = "green",
"Afternoon" = "orange", "Evening" = "red")) + # Points use black/gray, lines use hour_group colors
scale_linetype_manual(values = c("TRUE" = "dashed", "FALSE" = "solid")) +
theme(legend.position = "top")
This graph shows our model broken out by time of day and whether it was raining overlaid on the real data points colored black for rain and gray for no rain.
The model captures the drop in ridership on rainy days well, but seems to over-penalize for rain at night. The line falls well below even the lowest ridership numbers when modeling raining at night.
I used four diagnostic plots to evaluate the model: - Residuals vs Fitted - Q-Q Residuals - Scale-Location - Residuals vs Leverage
The Model: - Predictor: Time of Day (Categorical), Temperature (Continuous), Rainfall (Binary) - Response: Rented Bikes (Continuous) - Interactions Terms: Time of Day and Temperature
par(mfrow = c(2, 3))
plot(model)
The residuals vs fitted plot tests non-linearity and outliers. The model
contains many outliers. The residuals don’t bounce around the 0 line
randomly, which indicates that there is a non-linear component to our
data, which makes sense to the cyclical nature of demand over the course
of the day.
The Q-Q plot shows whether the residuals are normally distributed. In this case, the residuals are not normally distributed because the quantiles do not follow the line well, especially at the right tail.
Scale-location tests for unequal variance in the data. Below 500 on the x-axis, the plot indicates that variance is unequal due to the pattern and non-randomness.
Residuals vs Leverage looks for values that are extreme enough to influence the linear regression line. The plot shows high clustering, which indicates that there are not any values that have a drastic leverage on the fit of the model.
Based on the diagnostic plots of the model, the non-linearity of the model suggests a data transformation or different modeling approach may be appropriate.