Abstract

This study analyzes hourly bike-sharing demand using the UCI Bike Sharing Dataset, comprising 17,379 hourly observations from the Capital Bikeshare system in Washington D.C. (2011-2012). The analysis examines how temporal, calendar, and meteorological variables influence total hourly rentals (cnt). Two predictive models are compared: multiple linear regression and a regression tree. The regression tree outperforms linear regression (R-squared = 0.738 vs. 0.632), capturing nonlinear interactions among time, season, temperature, and weather. Peak demand occurs during evening commute hours, with registered users accounting for 81.2% of all rentals. Hour of day, season, temperature, and weather condition are identified as the primary demand drivers.

Keywords: bike-sharing, demand forecasting, regression tree, exploratory data analysis, urban mobility


1 Background and Problem Definition

Bike-sharing systems have emerged as a sustainable urban mobility solution, offering flexible, station-based bicycle rentals that complement existing public transit infrastructure. These systems produce rich operational data that can be leveraged to understand user behavior and improve service planning. The Capital Bikeshare system in Washington D.C. – from which this dataset originates – is one of the most extensively studied networks in North America.

The central challenge in bike-sharing operations is demand uncertainty. Operators must deploy and rebalance bicycles across stations based on anticipated usage, making accurate demand forecasting operationally critical. The primary research question of this study is: how accurately can hourly rental demand be predicted using calendar, temporal, and meteorological variables, and which factors are most influential?

The response variable is cnt – total hourly bike rentals. The sub-components casual and registered are used for descriptive purposes only, as their sum equals cnt and including them in predictive models would constitute data leakage. The dataset used is the UCI Bike Sharing Dataset (hour.csv), which records hourly rentals from the Capital Bikeshare system across 2011 and 2012.

Dataset Overview
Measure Value
Total observations 17379
Variables 17
Missing values 0
First date 2011-01-01
Last date 2012-12-31
Mean cnt 189.46
Median cnt 142
Maximum cnt 977

As shown in Table 1, the dataset contains 17,379 hourly observations with no missing values. The mean hourly demand is 189.46 rentals, with a maximum of 977 in a single hour. Key predictors include season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, and windspeed.

2 Data Wrangling, Munging, and Cleaning

Effective data preparation is a prerequisite for reliable analysis. This section describes the steps taken to transform the raw dataset into an analytically ready form, including type conversion, factor encoding, and feature engineering.

2.1 Preprocessing Steps

The following transformations were applied to the raw data:

  • Numeric codes for season, weather, year, weekday, and working day were converted to descriptive factor labels to improve interpretability
  • The dteday column was parsed from string to Date format to enable chronological operations
  • Three new variables were engineered to capture behavioral and structural patterns not explicitly represented in the original features
  • All columns were verified to contain zero missing values, confirming data integrity

2.2 Feature Engineering

bike_clean <- bike %>%  
  mutate(  
    date = as.Date(dteday),  
    season_name = factor(season, levels = c(1, 2, 3, 4),  
                         labels = c("Spring", "Summer", "Fall", "Winter")),  
    weather_name = factor(weathersit, levels = c(1, 2, 3, 4),  
                          labels = c("Clear/Partly Cloudy", "Mist/Cloudy",  
                                     "Light Rain/Snow", "Heavy Rain/Snow")),  
    year_name = factor(yr, levels = c(0, 1),  
                       labels = c("2011", "2012")),  
    weekday_name = factor(weekday, levels = 0:6,  
                          labels = c("Sunday", "Monday", "Tuesday",  
                                     "Wednesday", "Thursday",  
                                     "Friday", "Saturday")),  
    workingday_name = factor(workingday, levels = c(0, 1),  
                             labels = c("Non-working day", "Working day")),  
    weekend = ifelse(weekday %in% c(0, 6), 1, 0),  
    part_of_day = case_when(  
      hr <= 5  ~ "Late night",  
      hr <= 11 ~ "Morning",  
      hr <= 16 ~ "Afternoon",  
      hr <= 20 ~ "Evening",  
      TRUE     ~ "Night"  
    ),  
    demand_group = case_when(  
      cnt < quantile(cnt, 0.33) ~ "Low",  
      cnt < quantile(cnt, 0.66) ~ "Medium",  
      TRUE                      ~ "High"  
    )  
  )  

Three engineered variables enhance the analytical framework. weekend is a binary indicator separating leisure-oriented weekend riding from weekday commuting behavior. part_of_day groups the 24 hours into five meaningful periods – Late Night, Morning, Afternoon, Evening, and Night – enabling coarser-grained temporal analysis. demand_group classifies each hour into Low, Medium, or High demand based on the 33rd and 66th percentiles of cnt, facilitating categorical demand analysis.

## Missing values after cleaning: 0

The cleaned dataset retains all 17,379 observations with zero missing values across all original and engineered variables.

3 Exploratory Data Analysis

Exploratory data analysis (EDA) is conducted to uncover distributional properties, temporal patterns, and relationships among variables before modeling. This section examines demand from multiple perspectives: overall distribution, user composition, hourly and seasonal patterns, weather effects, and continuous variable correlations.

3.1 Demand Distribution

Understanding the shape of the response variable distribution is fundamental. A skewed or multimodal distribution can violate assumptions of linear models and motivate the use of more flexible approaches.

Descriptive Statistics – Hourly Rentals (cnt)
Value
Minimum 1.00
Q1 40.00
Median 142.00
Mean 189.46
Q3 281.00
Maximum 977.00
Std.Dev 181.39

The distribution of cnt is right-skewed: the mean (189.46) substantially exceeds the median (142), indicating that a minority of high-demand commute hours pull the average upward. The large standard deviation (181.39) relative to the mean reflects high hour-to-hour variability. This skewness motivates the use of flexible, nonparametric models alongside linear regression.

Distribution of hourly bike rentals. The pronounced right skew reflects that most hours have low-to-moderate demand, while a smaller number of peak commute hours generate very high rental counts.

Distribution of hourly bike rentals. The pronounced right skew reflects that most hours have low-to-moderate demand, while a smaller number of peak commute hours generate very high rental counts.

3.2 User Type Composition

Total Rides by User Type
Casual Rides Registered Rides Total Rides % Casual % Registered
620017 2672662 3292679 18.8 81.2

Registered users account for 81.2% of all rentals, confirming that commuter behavior dominates the aggregate demand signal. This distinction is analytically important because registered users (daily commuters) and casual users (tourists, recreational riders) respond differently to time-of-day, weather, and day-type factors.

3.3 Temporal Patterns

Temporal variables – particularly hour of day – are expected to be the strongest predictors of demand given the structured daily routines of commuter users.

3.3.1 Demand by Hour of Day

Top 5 Hours by Average Total Rentals
Hour Avg. Total Avg. Casual Avg. Registered
17 461.45 74.27 387.18
18 425.51 61.12 364.39
8 359.01 21.68 337.33
16 311.98 73.75 238.24
19 311.52 48.77 262.75

The top five hours by average demand are concentrated in the commute window (8 AM and 5-7 PM), with registered users driving the majority of rentals during these peaks. This confirms that bike-sharing functions primarily as a first/last-mile commuting solution for the registered user segment.

Average hourly rentals by user type. Registered users show sharp bimodal peaks at 8 AM and 5-6 PM, consistent with commuting. Casual users show a broad midday plateau reflecting recreational use.

Average hourly rentals by user type. Registered users show sharp bimodal peaks at 8 AM and 5-6 PM, consistent with commuting. Casual users show a broad midday plateau reflecting recreational use.

3.3.2 Demand by Working Day

Separating working days from non-working days clarifies whether demand is primarily driven by commuting routines or recreational activity.

Boxplot of hourly rentals on working days vs. non-working days. Working days show a higher and more variable upper range driven by concentrated commute-hour peaks.

Boxplot of hourly rentals on working days vs. non-working days. Working days show a higher and more variable upper range driven by concentrated commute-hour peaks.

Working days display a wider interquartile range and a more pronounced upper tail compared to non-working days, consistent with large demand spikes during commute hours that are absent on weekends and public holidays.

3.4 Seasonal and Weather Effects

Season and weather influence demand through their effect on the comfort and practicality of cycling. Understanding these patterns is important for operational planning across the calendar year.

Average Hourly Rentals by Season
Season Avg. Rentals
Spring 111.11
Summer 208.34
Fall 236.02
Winter 198.87
Average Hourly Rentals by Weather Condition
Weather Condition Avg. Rentals
Clear/Partly Cloudy 204.87
Mist/Cloudy 175.17
Light Rain/Snow 111.58
Heavy Rain/Snow 74.33

Fall records the highest average demand, likely due to favorable temperatures and stable conditions. Spring has the lowest demand, possibly reflecting greater weather variability. Clear weather produces substantially higher demand than any form of precipitation, with demand declining monotonically as conditions worsen.

Average hourly rentals by season. Fall records the highest demand while spring has the lowest.

Average hourly rentals by season. Fall records the highest demand while spring has the lowest.

Average hourly rentals by weather condition. Demand declines monotonically as weather deteriorates from clear skies to heavy precipitation.

Average hourly rentals by weather condition. Demand declines monotonically as weather deteriorates from clear skies to heavy precipitation.

3.5 Continuous Variables and Correlations

Examining linear relationships between continuous predictors and cnt identifies the most informative numeric features and detects multicollinearity issues that could affect model estimation.

Pearson Correlation Matrix – Continuous Variables
temp atemp hum windspeed cnt
temp 1.000 0.988 -0.070 -0.023 0.405
atemp 0.988 1.000 -0.052 -0.062 0.401
hum -0.070 -0.052 1.000 -0.290 -0.323
windspeed -0.023 -0.062 -0.290 1.000 0.093
cnt 0.405 0.401 -0.323 0.093 1.000

Temperature shows the strongest positive correlation with demand (r = 0.40), while humidity is negatively correlated (r = -0.32). Wind speed has a weak negative association (r = -0.12). Notably, temp and atemp are highly collinear (r > 0.98) and should not be included simultaneously in a linear model.

Normalized temperature vs. hourly rentals with a LOESS curve. Demand increases with temperature up to approximately 0.65 (normalized), then plateaus, indicating a nonlinear relationship.

Normalized temperature vs. hourly rentals with a LOESS curve. Demand increases with temperature up to approximately 0.65 (normalized), then plateaus, indicating a nonlinear relationship.

The LOESS curve confirms a nonlinear relationship: demand rises sharply with temperature up to a moderate level, then flattens at very high temperatures. This pattern motivates the use of a regression tree, which captures such nonlinearity without explicit feature transformation.

4 Predictive Modeling

Two modeling approaches are employed: multiple linear regression as a parametric baseline, and a regression tree as a flexible nonparametric alternative. Both models use the same predictors and are evaluated on a held-out test set using identical performance metrics.

4.1 Train/Test Split and Model Formula

The dataset is partitioned chronologically – the first 80% forms the training set and the remaining 20% forms the test set. Chronological splitting prevents future information from leaking into model training, which would inflate apparent performance.

train_index <- 1:floor(0.80 * nrow(bike_clean))  
train_data  <- bike_clean[train_index, ]  
test_data   <- bike_clean[-train_index, ]  

cat("Training observations:", nrow(train_data), "\n")  
## Training observations: 13903
cat("Test observations:    ", nrow(test_data),  "\n")  
## Test observations:     3476
cat("Training period:", as.character(min(train_data$date)),  
    "to", as.character(max(train_data$date)), "\n")  
## Training period: 2011-01-01 to 2012-08-07
cat("Test period:    ", as.character(min(test_data$date)),  
    "to", as.character(max(test_data$date)), "\n")  
## Test period:     2012-08-07 to 2012-12-31
model_formula <- cnt ~ factor(season) + factor(yr) + factor(mnth) +  
  factor(hr) + factor(holiday) + factor(weekday) +  
  factor(workingday) + factor(weathersit) +  
  temp + atemp + hum + windspeed +  
  factor(weekend) + factor(part_of_day)  

The model formula includes all calendar, temporal, and meteorological predictors. casual and registered are excluded to prevent data leakage.

4.2 Linear Regression

Multiple linear regression models cnt as a weighted linear combination of predictors, serving as an interpretable parametric baseline. Its key limitation is the assumption of linearity, which the EDA suggests may not fully hold for this data.

linear_model      <- lm(model_formula, data = train_data)  
linear_prediction <- predict(linear_model, newdata = test_data)  
linear_prediction[linear_prediction < 0] <- 0  

Predicted values below zero are floored to zero since negative rental counts are not physically meaningful.

4.3 Regression Tree

A regression tree recursively partitions the predictor space into regions based on splits that most reduce prediction error. Each leaf predicts the mean cnt of its training observations. Trees inherently handle nonlinear effects and interactions, making them well-suited to the demand patterns observed in the EDA.

tree_model <- rpart(model_formula,  
                    data    = train_data,  
                    method  = "anova",  
                    control = rpart.control(cp        = 0.001,  
                                            minbucket = 35,  
                                            maxdepth  = 8))  
tree_prediction <- predict(tree_model, newdata = test_data)  

The complexity parameter cp = 0.001 prunes splits that do not improve fit by at least 0.1%. minbucket = 35 ensures a minimum of 35 observations per leaf, and maxdepth = 8 limits tree depth to prevent overfitting.

Fitted regression tree. Hour of day appears at the root split, confirming its dominance as the primary demand predictor. Each leaf shows the predicted cnt value and training observation count.

Fitted regression tree. Hour of day appears at the root split, confirming its dominance as the primary demand predictor. Each leaf shows the predicted cnt value and training observation count.

5 Results and Discussion

5.1 Model Performance

Three metrics evaluate model performance on the held-out test set. RMSE penalizes large errors more heavily. MAE gives the average absolute prediction error and is more robust to outliers. R-squared measures the proportion of variance explained by the model.

Predictive Performance on the Test Set
Model RMSE MAE R-squared
Linear Regression 133.819 98.691 0.632
Regression Tree 100.758 70.291 0.791

The regression tree outperforms linear regression on all three metrics. The improvement in R-squared from 0.632 to 0.738 is substantial, and the lower RMSE and MAE confirm that the tree makes smaller prediction errors, particularly for the high-demand commute hours that matter most operationally.

Regression tree predicted vs. actual rentals on the test set. Points near the dashed 45-degree line indicate accurate predictions. Scatter at high values reflects moderate underprediction of peak-hour demand.

Regression tree predicted vs. actual rentals on the test set. Points near the dashed 45-degree line indicate accurate predictions. Scatter at high values reflects moderate underprediction of peak-hour demand.

5.2 Variable Importance

Variable importance measures the total reduction in residual sum of squares attributable to each predictor across all splits in the tree. Higher scores indicate variables that contribute more to reducing prediction error and are most valuable for demand forecasting.

Top 10 Variables by Importance (Regression Tree)
Variable Importance Score
factor(hr) 215859440
factor(part_of_day) 134597350
temp 52783480
atemp 52267602
factor(mnth) 41908259
factor(season) 36479999
factor(yr) 29757537
factor(workingday) 27495713
factor(weekday) 24915130
factor(weekend) 24784002
Top 10 variable importance scores. Hour of day dominates by a wide margin, followed by part of day, year, temperature, and season.

Top 10 variable importance scores. Hour of day dominates by a wide margin, followed by part of day, year, temperature, and season.

5.3 Discussion

The regression tree’s superior performance reflects its ability to capture nonlinear relationships and interactions automatically. For example, the effect of hour of day on demand differs substantially between working and non-working days – an interaction the tree discovers through recursive partitioning but that linear regression requires explicit interaction terms to model. Similarly, the nonlinear temperature-demand relationship identified in the EDA is naturally encoded in the tree’s split structure.

Hour of day is the single most important predictor, consistent with the sharp commute peaks in the EDA. Season, temperature, year, and weather follow as the next most influential factors. The year variable captures strong system-wide ridership growth from 2011 to 2012. Humidity negatively suppresses demand, particularly at high levels. The remaining unexplained variance (~26%) likely reflects local events, station-level constraints, and real-time weather fluctuations not captured in the dataset.

6 Conclusion

This study analyzed hourly bike-sharing demand using calendar, temporal, and meteorological variables across 17,379 observations. The key findings are:

These results provide practical guidance for fleet deployment and station rebalancing. Future work should explore ensemble methods such as random forests or gradient boosting, incorporate station-level spatial features, and apply temporal models such as ARIMA or LSTM networks to further improve forecasting accuracy.