Generalized Linear Models

Could you predict if it was raining using the hourly bikes rented, temperature, time of day, and the day of the week?

I built a logistic regression model to predict if it was precipitating (raining or snowing) based on bikes rented per hour and other variables.

Then, I interpreted the coefficients and built confidence intervals of one variable.

I created a binary column with the value “1” if it is raining or snowing and examined the sample size of each group.

df <- df |>
  mutate(precip = case_when(
    snowfall_cm > 0 ~ 1,
    rainfall_mm > 0 ~ 1,
    snowfall_cm == 0 ~ 0,
    rainfall_mm == 0 ~ 0
  ))

print(paste("It rained or snowed", round(mean(df$precip)*100,1), "percent of the time (",sum(df$precip),"hours /",count(df), "hours)."))
## [1] "It rained or snowed 11 percent of the time ( 931 hours / 8465 hours)."

There are two important conditions for logistic regression models: 1. Each outcome is independent. 2. Each predictor is linearly related to logit, holding the other predictors constant. (Diez et. al. 2019, p. 376)

The condition of independence holds because rain at any given hour is independent of rain earlier or later in the day.

The second condition will be tested later.

Feature Engineering

model <- glm(precip ~ rented_bikes+hour_group+temp_c+day_of_week,
             data = df, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = precip ~ rented_bikes + hour_group + temp_c + day_of_week, 
##     family = binomial(link = "logit"), data = df)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.7818177  0.1304967  -5.991 2.08e-09 ***
## rented_bikes         -0.0072250  0.0002786 -25.929  < 2e-16 ***
## hour_groupMorning     1.2735506  0.1087860  11.707  < 2e-16 ***
## hour_groupAfternoon   1.4593640  0.1212293  12.038  < 2e-16 ***
## hour_groupEvening     1.4861034  0.1205245  12.330  < 2e-16 ***
## temp_c                0.0390451  0.0037956  10.287  < 2e-16 ***
## day_of_weekMonday    -0.2845915  0.1522723  -1.869 0.061628 .  
## day_of_weekSaturday  -0.6024394  0.1632425  -3.690 0.000224 ***
## day_of_weekSunday    -0.7176407  0.1555054  -4.615 3.93e-06 ***
## day_of_weekThursday   0.0685840  0.1490385   0.460 0.645390    
## day_of_weekTuesday    0.0417009  0.1540893   0.271 0.786677    
## day_of_weekWednesday  0.3385007  0.1460965   2.317 0.020506 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5865.9  on 8464  degrees of freedom
## Residual deviance: 4028.6  on 8453  degrees of freedom
## AIC: 4052.6
## 
## Number of Fisher Scoring iterations: 8

The only statistically significant days of the week are Saturday and Sunday. This tells me that this data would be more predictive as a binary “weekend” feature.

I created a confusion matrix to evaluate the models accuracy of predicting of predicting rain. This was done by calculating the probability of rain for each observation than rounding to create a binary, where 1 indicates rain.

set.seed(1)

# Sample 80% of the data for the training set
train_index <- sample(1:nrow(df), size = 0.8 * nrow(df))

# Create the training and testing datasets
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

model <- glm(precip ~ rented_bikes+hour_group+temp_c+day_of_week,
             data = train_data, family = binomial(link = 'logit'))

# Get predicted probabilities on the test data
test_data$prediction_probs <- predict(model, newdata = test_data, type = "response")

# Convert probabilities to binary predictions (0 or 1)
test_data$prediction <- ifelse(test_data$prediction_probs > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(Predicted = test_data$prediction, Actual = test_data$precip)
print(conf_matrix)
##          Actual
## Predicted    0    1
##         0 1477  144
##         1   19   53
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4)))   
## [1] "Accuracy: 0.9037"

The model predicted:

No precipitation when there actually was precipitation (False Negative) 144/197 rainy hours.

Precipitation when there was none (False Positive) 19/1496 clear hours.

Precipitation correctly (True Positive) 53/197 rainy hours.

No precipitation correctly (True Negative) 1477/1496 clear hours.

Precipitation only occurs in 11% of hours within the data, which means a broken model that only outputted clear would be right 89% of the time. This model was correct for 90% of the data: 27% accuracy when predicting precipitation, and 99% accuracy when predicting clear skies. Can this be improved?

First I’ll test the effect of a binary weekend column.

weekend <- c('Saturday','Sunday')
df <- df |>
  mutate(weekend = case_when(
    day_of_week %in% weekend  ~ 1,
    !(day_of_week %in% weekend) ~ 0,
  ))


model <- glm(precip ~ rented_bikes+hour_group+temp_c+weekend,
             data = df, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = precip ~ rented_bikes + hour_group + temp_c + weekend, 
##     family = binomial(link = "logit"), data = df)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.7559794  0.0838331  -9.018  < 2e-16 ***
## rented_bikes        -0.0071312  0.0002743 -25.996  < 2e-16 ***
## hour_groupMorning    1.2634485  0.1084378  11.651  < 2e-16 ***
## hour_groupAfternoon  1.4513399  0.1208577  12.009  < 2e-16 ***
## hour_groupEvening    1.4780641  0.1201196  12.305  < 2e-16 ***
## temp_c               0.0381198  0.0037811  10.082  < 2e-16 ***
## weekend             -0.6976811  0.0944084  -7.390 1.47e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5865.9  on 8464  degrees of freedom
## Residual deviance: 4048.4  on 8458  degrees of freedom
## AIC: 4062.4
## 
## Number of Fisher Scoring iterations: 8

All features are statistically significant now with minuscule p-values.

# Sample 80% of the data for the training set
train_index <- sample(1:nrow(df), size = 0.8 * nrow(df))

# Create the training and testing datasets
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

model <- glm(precip ~ rented_bikes+hour_group+temp_c+weekend,
             data = train_data, family = binomial(link = 'logit'))

# Get predicted probabilities on the test data
test_data$prediction_probs <- predict(model, newdata = test_data, type = "response")

# Convert probabilities to binary predictions (0 or 1)
test_data$prediction <- ifelse(test_data$prediction_probs > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(Predicted = test_data$prediction, Actual = test_data$precip)
print(conf_matrix)
##          Actual
## Predicted    0    1
##         0 1486  136
##         1   19   52
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4)))   
## [1] "Accuracy: 0.9084"

The model’s total accuracy did not change, but False Positives decreased while False Negatives increased.

I hypothesized that the model would perform better if it was split in two: one model that predicts precipitation above 0 degrees Celsius and one that predicts at or below freezing. Rain and snow likely have different effects on rider behavior. Rain does occur below 0 degrees Celsius in the data, so using 0 degrees is somewhat arbitrary. However, this cutoff is helpful for the Box-Tidwell test, which will be used later to test the second linearity condition. The test requires a log transformation, so all temperature values being above 0 means that the function will not be undefined for any points, which causes missing values.

over_df <- df |>
  filter(temp_c>0)
model <- glm(precip ~ rented_bikes+hour_group+temp_c+weekend,
             data = over_df, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = precip ~ rented_bikes + hour_group + temp_c + weekend, 
##     family = binomial(link = "logit"), data = over_df)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.5079652  0.1255907 -12.007   <2e-16 ***
## rented_bikes        -0.0073849  0.0002947 -25.057   <2e-16 ***
## hour_groupMorning    1.2826033  0.1377925   9.308   <2e-16 ***
## hour_groupAfternoon  1.6286050  0.1500881  10.851   <2e-16 ***
## hour_groupEvening    2.0972576  0.1601947  13.092   <2e-16 ***
## temp_c               0.0742849  0.0058948  12.602   <2e-16 ***
## weekend             -0.2406727  0.1146528  -2.099   0.0358 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4163.1  on 7010  degrees of freedom
## Residual deviance: 2542.5  on 7004  degrees of freedom
## AIC: 2556.5
## 
## Number of Fisher Scoring iterations: 8

This model indicated that weekend is no longer statistically significant when only using non-freezing temperatures.

over_df <- df |>
  filter(temp_c>0)

set.seed(1)

# Sample 80% of the data for the training set
train_index <- sample(1:nrow(over_df), size = 0.8 * nrow(over_df))

# Create the training and testing datasets
train_data <- over_df[train_index, ]
test_data <- over_df[-train_index, ]

model <- glm(precip ~ rented_bikes+hour_group+temp_c+weekend,
             data = train_data, family = binomial(link = 'logit'))

# Get predicted probabilities on the test data
test_data$prediction_probs <- predict(model, newdata = test_data, type = "response")

# Convert probabilities to binary predictions (0 or 1)
test_data$prediction <- ifelse(test_data$prediction_probs > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(Predicted = test_data$prediction, Actual = test_data$precip)
print(conf_matrix)
##          Actual
## Predicted    0    1
##         0 1259   68
##         1   12   64
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4))) 
## [1] "Accuracy: 0.943"

A large improvement! The model’s accuracy for predicting rain increased to 94% by splitting the data on the freezing point. The model can now predict rain with 48% accuracy and clear skies with 99% accuracy.

I used the Box-Tidwell Test to test the second condition that the independent continuous variable temperature has a linear relationship with the log-odds of the dependent variable. First I log-transformed the temperature feature by taking the natural log of temperature and multiplying it by the temperature feature. Then, I added the log transformed temperature feature to the model.

# Log transformation
over_df$temp_c_log <- log(over_df$temp_c) * over_df$temp_c

#Logistic Regression model
model <- glm(precip ~ rented_bikes + hour_group + temp_c_log + day_of_week,
             data = over_df, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = precip ~ rented_bikes + hour_group + temp_c_log + 
##     day_of_week, family = binomial(link = "logit"), data = over_df)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.0382351  0.1679220  -6.183  6.3e-10 ***
## rented_bikes         -0.0074506  0.0003007 -24.778  < 2e-16 ***
## hour_groupMorning     1.2624093  0.1382820   9.129  < 2e-16 ***
## hour_groupAfternoon   1.5861607  0.1496417  10.600  < 2e-16 ***
## hour_groupEvening     2.0864029  0.1595931  13.073  < 2e-16 ***
## temp_c_log            0.0209723  0.0016448  12.751  < 2e-16 ***
## day_of_weekMonday    -0.0897094  0.1829202  -0.490  0.62383    
## day_of_weekSaturday  -0.5646977  0.2028230  -2.784  0.00537 ** 
## day_of_weekSunday    -0.4456590  0.1829425  -2.436  0.01485 *  
## day_of_weekThursday  -0.1627122  0.1873160  -0.869  0.38504    
## day_of_weekTuesday   -0.6478399  0.2189589  -2.959  0.00309 ** 
## day_of_weekWednesday -0.5407806  0.2018617  -2.679  0.00738 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4163.1  on 7010  degrees of freedom
## Residual deviance: 2532.5  on 6999  degrees of freedom
## AIC: 2556.5
## 
## Number of Fisher Scoring iterations: 8

The p-value of the interaction terms for temperature indicates the log-transformed temperature feature is statistically significant. This means that we have not met the linearity assumption of the logit and temperature. This means the model may not capture the true effect of temperature on the target variable. I omitted temperature from the following models, and instead looked towards the month feature which has an association with temperature.

over_df <- df |>
  filter(temp_c>0)
model <- glm(precip ~ rented_bikes+hour_group+month_name+weekend,
             data = over_df, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = precip ~ rented_bikes + hour_group + month_name + 
##     weekend, family = binomial(link = "logit"), data = over_df)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.341101   0.408014  -8.189 2.64e-16 ***
## rented_bikes        -0.007416   0.000300 -24.721  < 2e-16 ***
## hour_groupMorning    1.475886   0.144735  10.197  < 2e-16 ***
## hour_groupAfternoon  2.164776   0.172024  12.584  < 2e-16 ***
## hour_groupEvening    2.523894   0.178619  14.130  < 2e-16 ***
## month_nameFebruary   0.581110   0.470864   1.234 0.217152    
## month_nameMarch      0.895765   0.451462   1.984 0.047240 *  
## month_nameApril      2.501431   0.424105   5.898 3.68e-09 ***
## month_nameMay        3.340204   0.424212   7.874 3.44e-15 ***
## month_nameJune       3.922506   0.449065   8.735  < 2e-16 ***
## month_nameJuly       3.564229   0.436740   8.161 3.32e-16 ***
## month_nameAugust     3.737545   0.427428   8.744  < 2e-16 ***
## month_nameSeptember  3.072106   0.451568   6.803 1.02e-11 ***
## month_nameOctober    2.622317   0.454114   5.775 7.71e-09 ***
## month_nameNovember   3.292297   0.429380   7.668 1.75e-14 ***
## month_nameDecember   2.562835   0.422385   6.068 1.30e-09 ***
## weekend             -0.471095   0.123805  -3.805 0.000142 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4163.1  on 7010  degrees of freedom
## Residual deviance: 2318.2  on 6994  degrees of freedom
## AIC: 2352.2
## 
## Number of Fisher Scoring iterations: 8

All months except for February are statistically significant, so I will include month in the model.

set.seed(1)
# Sample 80% of the data for the training set
train_index <- sample(1:nrow(over_df), size = 0.8 * nrow(over_df))

# Create the training and testing datasets
train_data <- over_df[train_index, ]
test_data <- over_df[-train_index, ]

model <- glm(precip ~ rented_bikes+hour_group+weekend+month_name,
             data = train_data, family = binomial(link = 'logit'))

# Get predicted probabilities on the test data
test_data$prediction_probs <- predict(model, newdata = test_data, type = "response")

# Convert probabilities to binary predictions (0 or 1)
test_data$prediction <- ifelse(test_data$prediction_probs > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(Predicted = test_data$prediction, Actual = test_data$precip)
print(conf_matrix)
##          Actual
## Predicted    0    1
##         0 1253   63
##         1   18   69
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4))) 
## [1] "Accuracy: 0.9423"

The model became better at predicting rain, and can do so with greater than 50% accuracy. However, there are more false negatives.

Finally, I examined the model for at or below freezing temperatures.

under_df <- df |>
  filter(temp_c<=0)
model <- glm(precip ~ rented_bikes+hour_group+weekend,
             data = under_df, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = precip ~ rented_bikes + hour_group + weekend, family = binomial(link = "logit"), 
##     data = under_df)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.4008467  0.1436534  -2.790 0.005265 ** 
## rented_bikes        -0.0074046  0.0008584  -8.626  < 2e-16 ***
## hour_groupMorning    1.1498379  0.1908928   6.023 1.71e-09 ***
## hour_groupAfternoon  1.2297966  0.2325025   5.289 1.23e-07 ***
## hour_groupEvening    0.7727075  0.2193868   3.522 0.000428 ***
## weekend             -1.5627569  0.1959300  -7.976 1.51e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1524.9  on 1453  degrees of freedom
## Residual deviance: 1365.6  on 1448  degrees of freedom
## AIC: 1377.6
## 
## Number of Fisher Scoring iterations: 5

All features are statistically significant.

set.seed(1)
# Sample 80% of the data for the training set
train_index <- sample(1:nrow(under_df), size = 0.8 * nrow(under_df))

# Create the training and testing datasets
train_data <- under_df[train_index, ]
test_data <- under_df[-train_index, ]

model <- glm(precip ~ rented_bikes+hour_group+weekend,
             data = train_data, family = binomial(link = 'logit'))

# Get predicted probabilities on the test data
test_data$prediction_probs <- predict(model, newdata = test_data, type = "response")

# Convert probabilities to binary predictions (0 or 1)
test_data$prediction <- ifelse(test_data$prediction_probs > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(Predicted = test_data$prediction, Actual = test_data$precip)
print(conf_matrix)
##          Actual
## Predicted   0   1
##         0 235  45
##         1   4   7
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4))) 
## [1] "Accuracy: 0.8316"

The accuracy fell to 83%. The sample size decreased and the range of temperatures in the below freezing data set is smaller than the above freezing data. Snow seems to be more difficult to predict than rain.

Moving forward, the above freezing model will be used.

Coefficients

model <- glm(precip ~ rented_bikes+hour_group+month_name+weekend,
             data = over_df, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = precip ~ rented_bikes + hour_group + month_name + 
##     weekend, family = binomial(link = "logit"), data = over_df)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.341101   0.408014  -8.189 2.64e-16 ***
## rented_bikes        -0.007416   0.000300 -24.721  < 2e-16 ***
## hour_groupMorning    1.475886   0.144735  10.197  < 2e-16 ***
## hour_groupAfternoon  2.164776   0.172024  12.584  < 2e-16 ***
## hour_groupEvening    2.523894   0.178619  14.130  < 2e-16 ***
## month_nameFebruary   0.581110   0.470864   1.234 0.217152    
## month_nameMarch      0.895765   0.451462   1.984 0.047240 *  
## month_nameApril      2.501431   0.424105   5.898 3.68e-09 ***
## month_nameMay        3.340204   0.424212   7.874 3.44e-15 ***
## month_nameJune       3.922506   0.449065   8.735  < 2e-16 ***
## month_nameJuly       3.564229   0.436740   8.161 3.32e-16 ***
## month_nameAugust     3.737545   0.427428   8.744  < 2e-16 ***
## month_nameSeptember  3.072106   0.451568   6.803 1.02e-11 ***
## month_nameOctober    2.622317   0.454114   5.775 7.71e-09 ***
## month_nameNovember   3.292297   0.429380   7.668 1.75e-14 ***
## month_nameDecember   2.562835   0.422385   6.068 1.30e-09 ***
## weekend             -0.471095   0.123805  -3.805 0.000142 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4163.1  on 7010  degrees of freedom
## Residual deviance: 2318.2  on 6994  degrees of freedom
## AIC: 2352.2
## 
## Number of Fisher Scoring iterations: 8

The intercept of the model represents the log-odds of the rainfall when all features are zero. This means the base expectation of the model is that it is not raining.

The coefficients suggest that every bike rented is negatively associated with the likelihood it is raining. This is intuitive because rain deters riders. If bikes rented per hour is lower all else being equal, the likelihood of rain is higher.

The categorical feature time of day suggests that as the day progresses, the log-odds of rain increases. This result is due to the multicollinearity between time of day and bikes rented. Bikes rented per hour increases throughout the, so the effect is that time of day normalizes the daily demand cycle due to the negative coefficient on rented_bikes.

The coefficients for month show large increases in log-odds for rain starting in April. Month is associated with temperature, and because temperature effects bikes rented per month, certain months have higher coefficients than others.

Finally, the log-odd is decreased on the weekends. Again, since rain is negatively associate with bikes rented, any measure that increases bikes will decrease the probability of rain, such as if it is the weekend.

Confidence Intervals

The confidence interval for rented_bikes is

upper_ci <- -0.007416 + 1.96 * 0.0003
lower_ci <- -0.007416 - 1.96 * 0.0003

print(paste("The 95% Confidence Interval ranges from", round(lower_ci, 5), "to", round(upper_ci, 5)))
## [1] "The 95% Confidence Interval ranges from -0.008 to -0.00683"

This means that I can be 95% confident that the true value of the coefficient for rented bikes lies within this range.

Since the entire confidence interval is negative, it suggests that the effect of rented bikes on the likelihood of precipitation is consistently negative, supporting the earlier interpretation that more rented bikes are associated with a lower probability of precipitation.