Regression Diagnostics

In this week’s data dive, I build upon the previous data dive where I explored the relation between temperature and rented bikes per hour in the Seoul Bikeshare data set. I will increase the complexity of the model by adding the following variables:

Then, I will evaluate the model using diagnostic plots.

Recap

The original regression model showed the relationship between temperature and rented bikes at 8pm.I will use the R^2 value to compare models, so first I will attain the R^2 value of temperature and bikes rented per hour.

model <- lm(rented_bikes ~ temp_c + temp_c,data=df)
summary(model)$r.squared
## [1] 0.3166765

The R^2 value of 0.32 means that 32% of the variability in the data can be explained by temperature.

Time of Day

In this iteration, I will use hour of the day as an interaction term with temperature. Using hour may be problematic due to the cyclical nature and the fluctuations of demand throughout the day based on time. First, let’s examine demand throughout the day.

df |>
  group_by(hour) |>
  summarize(mean_bikes_rented = mean(rented_bikes))|>
  ggplot(aes(x=hour,y=mean_bikes_rented))+
  geom_line() +
  labs(title = "Hourly Bike Rentals",
       x = "Hour",
       y = "Bike Rentals")

Instead of holding hour constant, I converted hour to a categorical variable and added it as a predictor variable that is an interaction term with temperature.

#Make a categorical column for time of day
df <- df |>
  mutate(hour_group = case_when(
    hour >= 0 & hour < 6  ~ "Night",
    hour >= 6 & hour < 12 ~ "Morning",
    hour >= 12 & hour < 18 ~ "Afternoon",
    hour >= 18 & hour < 24 ~ "Evening"
  )) |>
  mutate(hour_group = factor(hour_group, levels = c("Night", "Morning", "Afternoon", "Evening")))
 

model <- lm(rented_bikes ~ temp_c + hour_group + temp_c*hour_group,data=df)

tidy(model,conf.int = TRUE) |>
  mutate(estimate = round(estimate, 1))
## # A tibble: 8 Ă— 7
##   term                 estimate std.error statistic   p.value conf.low conf.high
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)             166.     12.7       13.1  1.43e- 38   141.      191.  
## 2 temp_c                   13       0.823     15.9  8.73e- 56    11.4      14.7 
## 3 hour_groupMorning       254.     18.1       14.0  3.82e- 44   218.      289.  
## 4 hour_groupAfternoon     256      20.4       12.6  7.13e- 36   216.      296.  
## 5 hour_groupEvening       275.     19.0       14.5  5.17e- 47   238.      313.  
## 6 temp_c:hour_groupMo…      6.1     1.13       5.41 6.45e-  8     3.91      8.36
## 7 temp_c:hour_groupAf…     14.8     1.14      13.0  1.95e- 38    12.6      17.0 
## 8 temp_c:hour_groupEv…     37       1.15      32.4  1.34e-216    34.8      39.3
summary(model)$r.squared
## [1] 0.5424425

The regression model uses “Night” as a baseline. All variables in this model have an extremely small p-value, which means we reject the null hypothesis that the coefficients are equal to 0. The R^2 is 0.54, which is a large increase from the base model.

The intercept shows a baseline of 163 rented bikes per hour, where every degree Celsius increase corresponds to 13 additional bikes rented. The time of day has strong effects on ridership, and evenings have the a larger effect on rented bikes than the morning and afternoon compared to the baseline night hours.

The model shows a clear interaction between time of day and temperature on bikes rented. In the morning, there is only an additional 6 bikes rented per hour for every increase in degrees Celsius on top of the baseline 13 bikes rented per hour. The effect of time of day and temperature is very strong in the evening.

What this means is that an equal increase in temperature has a much larger effect in after 6pm then in the morning. Practically speaking, temperature changes in the evening will have large effects on demand and could effect how much staffing is needed depending on the forecasted change.

Here is this information plotted against real data.

ggplot(data = df, aes(x = temp_c, y = rented_bikes, color = hour_group)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "lm", se = FALSE, linetype = "solid") +
  labs(title = "Bike Rentals vs. Temperature by Time of Day",
       x = "Temperature (°C)",
       y = "Bike Rentals",
       color = "Time of Day") +
  theme(legend.position = "top")
## `geom_smooth()` using formula = 'y ~ x'

There is a lot of error above the modeled lines, but stratifying by time of day groupings increased model fit to the data.

We can see this more clearly using predicted values to show the regression lines.

# Create a sequence of temperature values and rainfall amounts.
temp_seq <- seq(min(df$temp_c, na.rm = TRUE), max(df$temp_c, na.rm = TRUE), length.out = 100)

# Generate new data for predictions
new_data <- expand.grid(
  temp_c = temp_seq,
  hour_group = levels(df$hour_group))

# Predict bike rentals based on the updated model
new_data$predicted_rentals <- predict(model, newdata = new_data)
ggplot(new_data, aes(x = temp_c, y = predicted_rentals, color = hour_group)) +
  geom_line(size = 1.2) +
  labs(title = "Effect of Rainfall on Bike Rentals by Time of Day",
       x = "Temperature (Celsius)",
       y = "Predicted Bike Rentals",
       color = "Time of Day",
       linetype = "Time of Day") +
  scale_color_manual(values = c("Night" = "blue", "Morning" = "green", "Afternoon" = "orange", "Evening" = "red")) +
  theme(legend.position = "top")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The graphs above clearly demonstrate how time of day effects the coefficients of temperature and bikes rented per hour.

Rainfall

Now let’s see what effect rainfall has on this model. First I will assess rain as a continuous variable, then see the effects of a binary rainfall column.

model <- lm(rented_bikes ~ temp_c + hour_group + temp_c*hour_group + rainfall_mm,data=df)

tidy(model,conf.int = TRUE) |>
  mutate(estimate = round(estimate, 1))
## # A tibble: 9 Ă— 7
##   term                 estimate std.error statistic   p.value conf.low conf.high
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)             171.     12.3       13.9  2.18e- 43   147.      196.  
## 2 temp_c                   13.5     0.798     16.9  2.61e- 63    11.9      15.1 
## 3 hour_groupMorning       259.     17.5       14.8  8.04e- 49   225.      294.  
## 4 hour_groupAfternoon     261.     19.7       13.2  1.29e- 39   223.      300.  
## 5 hour_groupEvening       278.     18.4       15.1  9.93e- 51   242.      314.  
## 6 rainfall_mm             -95.9     4.08     -23.5  1.34e-118  -104.      -87.9 
## 7 temp_c:hour_groupMo…      6.2     1.10       5.67 1.50e-  8     4.07      8.38
## 8 temp_c:hour_groupAf…     14.5     1.10      13.2  3.38e- 39    12.3      16.7 
## 9 temp_c:hour_groupEv…     37.3     1.11      33.6  5.90e-233    35.1      39.5
summary(model)$r.squared
## [1] 0.5705517

Every millimeter of rainfall corresponded to about 96 less riders per hour, and this value is statistically significant. Adding rainfall increased the r-squared value from 0.54 to 0.57. We are already seeing diminishing returns in explanatory power from adding variables to the model.

Let’s compare this result to a binary version of rainfall to see if we can simplify the model.

df = df |>
  mutate(rainfall = rainfall_mm > 0)

model <- lm(rented_bikes ~ temp_c + hour_group + temp_c*hour_group + rainfall,data=df)

tidy(model) |>
  mutate(estimate = round(estimate, 1))
## # A tibble: 9 Ă— 5
##   term                       estimate std.error statistic   p.value
##   <chr>                         <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)                   193.     11.8       16.3  4.01e- 59
## 2 temp_c                         14.9     0.764     19.5  6.02e- 83
## 3 hour_groupMorning             259.     16.8       15.4  4.82e- 53
## 4 hour_groupAfternoon           259.     18.9       13.7  1.59e- 42
## 5 hour_groupEvening             273.     17.6       15.5  1.84e- 53
## 6 rainfallTRUE                 -689.     18.4      -37.5  4.29e-285
## 7 temp_c:hour_groupMorning        5.6     1.05       5.33 1.02e-  7
## 8 temp_c:hour_groupAfternoon     13.2     1.05      12.6  8.04e- 36
## 9 temp_c:hour_groupEvening       36.5     1.06      34.4  1.09e-242
summary(model)$r.squared
## [1] 0.6077389

The binary version increased the R^2 value from 0.54 to 0.61, which is 0.04 higher than if rainfall was used as a continuous variable! This version will be kept.

If it is raining, the intercept is decreased by almost 690 bikes rented per hour. The p-value is incredibly small, showing the result is statistically significant.

Let’s see the effect of this model on predicted and real data with or without rainfall.

# Create a sequence of temperature values for smooth plotting
temp_seq <- seq(min(df$temp_c, na.rm = TRUE), max(df$temp_c, na.rm = TRUE), length.out = 100)

# Generate new data for prediction, for both 'rainfall' and 'no rainfall'
# Rainfall
new_data_rain <- expand.grid(
  temp_c = temp_seq,
  hour_group = levels(df$hour_group),
  rainfall = TRUE
)

# No Rainfall
new_data_no_rain <- expand.grid(
  temp_c = temp_seq,
  hour_group = levels(df$hour_group),
  rainfall = FALSE
)

# Predict bike rentals based on the model for both scenarios
new_data_rain$predicted_rentals <- predict(model, newdata = new_data_rain)
new_data_no_rain$predicted_rentals <- predict(model, newdata = new_data_no_rain)

#Combine
new_data_combined <- bind_rows(new_data_rain, new_data_no_rain)

ggplot(new_data_combined, aes(x = temp_c, y = predicted_rentals, color = hour_group, linetype = rainfall)) +
  geom_point(data = df, aes(x = temp_c, y = rented_bikes, color = rainfall), alpha = 0.3) +
  geom_line(size = 1.2) +
  theme_minimal() +
  labs(title = "Effect of Temperature and Rainfall on Bike Rentals by Time of Day",
       x = "Temperature (°C)",
       y = "Bike Rentals",
       color = "Rainfall",
       linetype = "Rainfall") +
  scale_color_manual(values = c("TRUE" = "black", "FALSE" = "gray", 
                                "Night" = "blue", "Morning" = "green", 
                                "Afternoon" = "orange", "Evening" = "red")) +  # Points use black/gray, lines use hour_group colors
  scale_linetype_manual(values = c("TRUE" = "dashed", "FALSE" = "solid")) +
  theme(legend.position = "top")

This graph shows our model broken out by time of day and whether it was raining overlaid on the real data points colored black for rain and gray for no rain.

The model captures the drop in ridership on rainy days well, but seems to over-penalize for rain at night. The line falls well below even the lowest ridership numbers when modeling raining at night.

Diagnostics

I used four diagnostic plots to evaluate the model: - Residuals vs Fitted - Q-Q Residuals - Scale-Location - Residuals vs Leverage

The Model: - Predictor: Time of Day (Categorical), Temperature (Continuous), Rainfall (Binary) - Response: Rented Bikes (Continuous) - Interactions Terms: Time of Day and Temperature

par(mfrow = c(2, 3))
plot(model)

The residuals vs fitted plot tests non-linearity and outliers. The model contains many outliers. The residuals don’t bounce around the 0 line randomly, which indicates that there is a non-linear component to our data, which makes sense to the cyclical nature of demand over the course of the day.

The Q-Q plot shows whether the residuals are normally distributed. In this case, the residuals are not normally distributed because the quantiles do not follow the line well, especially at the right tail.

Scale-location tests for unequal variance in the data. Below 500 on the x-axis, the plot indicates that variance is unequal due to the pattern and non-randomness.

Residuals vs Leverage looks for values that are extreme enough to influence the linear regression line. The plot shows high clustering, which indicates that there are not any values that have a drastic leverage on the fit of the model.

Based on the diagnostic plots of the model, the non-linearity of the model suggests a data transformation or different modeling approach may be appropriate.