Logistic Regression


Bike Sharing System


You are working as a data scientist for the city of Washington D.C. government. Currently, Washington D.C. has a bike sharing system. People can rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.

Data Dictionary

Data Source: The data is from Kaggle at https://www.kaggle.com/c/bike-sharing-demand, and contains the following columns:

Variable Data Type Description Constraints/Rules
datetime Datetime Hourly date and timestamp Must follow a standard datetime format (YYYY-MM-DD HH:MM:SS)
season Integer Season of the year Categorical: {1 = Spring, 2 = Summer, 3 = Fall, 4 = Winter}
holiday Binary (Integer) Whether the day is a holiday {0 = No, 1 = Yes}
workingday Binary (Integer) Whether the day is a working day (not a weekend/holiday) {0 = No, 1 = Yes}
weather Integer Weather condition Categorical: {1 = Clear/Few Clouds, 2 = Mist/Cloudy, 3 = Light Snow/Rain, 4 = Heavy Rain/Snow/Fog}
temp Numeric Temperature in Celsius Continuous values, typically within a range of [-10, 45]°C
atemp Numeric “Feels like” temperature in Celsius Continuous values, typically within a range of [-10, 50]°C
humidity Numeric Relative humidity (%) Ranges from 0 to 100
windspeed Numeric Wind speed (m/s) Non-negative values (windspeed ≥ 0)
casual Integer Number of rentals by non-registered users Non-negative integer (casual ≥ 0)
registered Integer Number of rentals by registered users Non-negative integer (registered ≥ 0)
count Integer Total number of rentals (casual + registered) Non-negative integer (count = casual + registered)

Question 1:

Build a linear model to forecast number of total rentals (count) using potential predictors, season, holiday, workingday, weather, atemp, and registered.

Load the dataset

bike.df <- read.csv("data/Bike.csv")

Display first few rows

head(bike.df)
##              datetime season holiday workingday weather temp  atemp humidity
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395       81
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635       80
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635       80
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395       75
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395       75
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880       75
##   windspeed casual registered count
## 1    0.0000      3         13    16
## 2    0.0000      8         32    40
## 3    0.0000      5         27    32
## 4    0.0000      3         10    13
## 5    0.0000      0          1     1
## 6    6.0032      0          1     1

Display dimension of the dataframe

dim(bike.df)
## [1] 10886    12

Display column names

colnames(bike.df)
##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"

Convert datetime to Date format if needed

bike.df$datetime <- as.POSIXct(bike.df$datetime, format = "%Y-%m-%d %H:%M:%S")

Convert season categorical variables to factors

bike.df$season = factor(bike.df$season,
                        levels = c(1, 2, 3, 4),
                        labels = c("Spring", "Summer", "Fall", "Winter")
)

Convert holiday categorical variables to factors

bike.df$holiday <- factor(bike.df$holiday, 
                          levels = c(0,1), 
                          labels = c("No", "Yes")
)

Convert workingday categorical variables to factors

bike.df$workingday <- factor(bike.df$workingday,
                             levels = c(0,1), 
                             labels = c("No", "Yes")
)

Convert weather categorical variables to factors

bike.df$weather <- factor(bike.df$weather,
                          levels = c(1, 2, 3, 4),
                          labels = c("Clear", "Misty_cloudy",
                                     "Light_snow", "Heavy_rain")
)

Build linear model for count prediction

linear_model <- lm(count ~ season + holiday + workingday +
                     weather + atemp + registered,
                   data = bike.df)

Display the statistical summary of the model

summary(linear_model)
## 
## Call:
## lm(formula = count ~ season + holiday + workingday + weather + 
##     atemp + registered, data = bike.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.215  -20.295   -3.242   14.058  265.418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.742071   1.296078  -4.430 9.50e-06 ***
## seasonSummer         -5.372699   1.192074  -4.507 6.64e-06 ***
## seasonFall          -18.014488   1.448610 -12.436  < 2e-16 ***
## seasonWinter         -8.639163   1.003388  -8.610  < 2e-16 ***
## holidayYes          -11.827273   2.086374  -5.669 1.47e-08 ***
## workingdayYes       -41.745729   0.751281 -55.566  < 2e-16 ***
## weatherMisty_cloudy  -4.962998   0.780422  -6.359 2.11e-10 ***
## weatherLight_snow    -8.562471   1.276496  -6.708 2.07e-11 ***
## weatherHeavy_rain     3.030360  35.080964   0.086    0.931    
## atemp                 2.475376   0.065339  37.885  < 2e-16 ***
## registered            1.141296   0.002414 472.758  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.07 on 10875 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9625 
## F-statistic: 2.795e+04 on 10 and 10875 DF,  p-value: < 2.2e-16

Generate predictions using the model and round the predictions

predictions <- predict(linear_model, bike.df)
rounded_predictions <- round(predictions)

Show first 15 rounded predictions

head(rounded_predictions, 15)
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
##  45  65  59  41  31  22  28  28  38  44  70  70 110  99 127

The model effectively explains bike rental patterns, showing that seasonality, holidays, and working days have substantial effects. Weather conditions, particularly misty/cloudy and snowy conditions, decrease rentals, while temperature has a positive impact. Surprisingly, heavy rain does not show a significant impact, which may warrant further investigation. Finally, the number of registered users is the strongest predictor, reinforcing the idea that bike-sharing systems heavily rely on frequent users rather than occasional riders.


Question 2:

Perform best subset selection using bestglm() function based on BIC. What’s the best model based on BIC?

Prepare data for bestglm (needs to be a dataframe with only predictors and response)

model_data <- model.matrix(~ season + holiday + workingday + weather + 
                             atemp + registered + count, data = bike.df)

Remove the first column from the model_data dataset

model_data <- model_data[,-1]

Convert model_data to dataframe

model_data.df <- data.frame(model_data)

Find the best model based on the BIC.

best_bic_model <- bestglm(model_data.df, IC = "BIC", family = gaussian)

Display the statistical summary of the best model of the best_bic_model

summary(best_bic_model$BestModel)
## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.216  -20.295   -3.242   14.057  265.418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.740447   1.295882  -4.430 9.53e-06 ***
## seasonSummer         -5.373357   1.191995  -4.508 6.62e-06 ***
## seasonFall          -18.014895   1.448537 -12.437  < 2e-16 ***
## seasonWinter         -8.640114   1.003281  -8.612  < 2e-16 ***
## holidayYes          -11.827253   2.086279  -5.669 1.47e-08 ***
## workingdayYes       -41.745324   0.751232 -55.569  < 2e-16 ***
## weatherMisty_cloudy  -4.963428   0.780370  -6.360 2.09e-10 ***
## weatherLight_snow    -8.562900   1.276428  -6.708 2.06e-11 ***
## atemp                 2.475328   0.065334  37.888  < 2e-16 ***
## registered            1.141297   0.002414 472.785  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.07 on 10876 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9625 
## F-statistic: 3.106e+04 on 9 and 10876 DF,  p-value: < 2.2e-16

This model suggests strong predictive performance, with seasonal, weather, holiday, and temperature-related factors all significantly influencing the outcome variable. The model’s high R – squared and significance levels indicate it is likely capturing the key drivers in the data effectively.


Question 3:

Compute the test error of the best model based on BIC using LOOCV.

Test error using LOOCV

loocv_control <- trainControl(method = "LOOCV")
loocv_model <- train(
  count ~ season + holiday + workingday + weather + atemp + registered,
  data = bike.df,
  method = "lm",
  trControl = loocv_control
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

Print the loocv_model results

print(loocv_model)
## Linear Regression 
## 
## 10886 samples
##     6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 10885, 10885, 10885, 10885, 10885, 10885, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   35.09127  0.9624692  23.88103
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Access the Root Mean Squared Error (RMSE) value from the results of a Leave-One-Out Cross-Validation (LOOCV) model

loocv_model$results$RMSE
## [1] 35.09127

Question 4:

Calculate the test error of the best model based on BIC using 10-fold CV.

Test error using 10-fold CV

cv_control <- trainControl(method = "cv", number = 10)

cv_model <- train(
  count ~ season + holiday + workingday +
    weather + atemp + registered,
  data = bike.df,
  method = "lm",
  trControl = cv_control
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

Print the results of the cv_model

print(cv_model)
## Linear Regression 
## 
## 10886 samples
##     6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9797, 9796, 9796, 9798, 9798, 9798, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   35.06708  0.9625144  23.88112
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Access the Root Mean Squared Error (RMSE) value from the results of a Cross-Validation model

cv_model$results$RMSE
## [1] 35.06708

Note Qn3 & Qn4:
An RMSE value of 35 indicates that, on average, the model’s predictions deviate from the actual values by 35 units. In general, lower RMSE values signify better model performance, as they indicate smaller discrepancies between predicted and actual values.


Question 5:

Perform best subset selection using bestglm() function based on CV. What’s the best model based on CV?

Best subset selection based on CV

best_cv_model <- bestglm(
  Xy = model_data.df,
  family = gaussian,
  IC = "CV",
  CVArgs = list(Method = "HTF", K = 10, REP = 1)
)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

Print the results of the best_cv_model

print(best_cv_model)
## CV(K = 10, REP = 1)
## No BICq equivalent
## Best Model:
##                 Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)    -5.334389 1.117091227  -4.775249 1.818258e-06
## workingdayYes -40.685205 0.736932772 -55.208842 0.000000e+00
## atemp           1.967543 0.042399253  46.405140 0.000000e+00
## registered      1.144745 0.002395396 477.894018 0.000000e+00

Display the statistical summary of the best model of the best_cv_model

summary(best_cv_model$BestModel)
## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.476  -20.471   -3.179   14.285  270.954 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.334389   1.117091  -4.775 1.82e-06 ***
## workingdayYes -40.685205   0.736933 -55.209  < 2e-16 ***
## atemp           1.967543   0.042399  46.405  < 2e-16 ***
## registered      1.144745   0.002395 477.894  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.58 on 10882 degrees of freedom
## Multiple R-squared:  0.9614, Adjusted R-squared:  0.9614 
## F-statistic: 9.042e+04 on 3 and 10882 DF,  p-value: < 2.2e-16

The model’s high R-squared value and significant predictors indicate a good fit, with meaningful relationships between the predictors and the outcome. However, certain predictors, such as workingdayYes and seasonFall, show large effects that warrant careful interpretation to ensure they align with domain expectations and aren’t driven by multicollinearity. Furthermore, the broad range of residuals suggests that while the model is generally accurate, there may be outliers or variations not fully captured by the current predictors.


Question 6:

Perform the backward stepwise selection using stepAIC() function. What’s the best model?

Full model for count prediction

full_model <- lm(count ~ season + holiday + workingday + 
                   weather + atemp + registered, data = bike.df)

Backward stepwise selection using stepAIC

stepwise_model <- stepAIC(full_model, direction = "backward")
## Start:  AIC=77462.43
## count ~ season + holiday + workingday + weather + atemp + registered
## 
##              Df Sum of Sq       RSS    AIC
## <none>                     13376322  77462
## - holiday     1     39527  13415849  77493
## - weather     3     89177  13465499  77529
## - season      3    263668  13639990  77669
## - atemp       1   1765415  15141737  78810
## - workingday  1   3797752  17174074  80181
## - registered  1 274906725 288283047 110885

Display the statistical summary of the best model

summary(stepwise_model)
## 
## Call:
## lm(formula = count ~ season + holiday + workingday + weather + 
##     atemp + registered, data = bike.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.215  -20.295   -3.242   14.058  265.418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.742071   1.296078  -4.430 9.50e-06 ***
## seasonSummer         -5.372699   1.192074  -4.507 6.64e-06 ***
## seasonFall          -18.014488   1.448610 -12.436  < 2e-16 ***
## seasonWinter         -8.639163   1.003388  -8.610  < 2e-16 ***
## holidayYes          -11.827273   2.086374  -5.669 1.47e-08 ***
## workingdayYes       -41.745729   0.751281 -55.566  < 2e-16 ***
## weatherMisty_cloudy  -4.962998   0.780422  -6.359 2.11e-10 ***
## weatherLight_snow    -8.562471   1.276496  -6.708 2.07e-11 ***
## weatherHeavy_rain     3.030360  35.080964   0.086    0.931    
## atemp                 2.475376   0.065339  37.885  < 2e-16 ***
## registered            1.141296   0.002414 472.758  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.07 on 10875 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9625 
## F-statistic: 2.795e+04 on 10 and 10875 DF,  p-value: < 2.2e-16

The analysis used backward stepwise selection to identify the best predictive model for bike rentals, retaining all variables as their removal increased the Akaike Information Criterion (AIC).