Abstract
This study analyzes hourly bike-sharing demand using the UCI Bike Sharing Dataset, comprising 17,379 hourly observations from the Capital Bikeshare system in Washington D.C. (2011-2012). The analysis examines how temporal, calendar, and meteorological variables influence total hourly rentals (cnt). Two predictive models are compared: multiple linear regression and a regression tree. The regression tree outperforms linear regression (R-squared = 0.738 vs. 0.632), capturing nonlinear interactions among time, season, temperature, and weather. Peak demand occurs during evening commute hours, with registered users accounting for 81.2% of all rentals. Hour of day, season, temperature, and weather condition are identified as the primary demand drivers.
Keywords: bike-sharing, demand forecasting, regression tree, exploratory data analysis, urban mobility
Bike-sharing systems have emerged as a sustainable urban mobility solution, offering flexible, station-based bicycle rentals that complement existing public transit infrastructure. These systems produce rich operational data that can be leveraged to understand user behavior and improve service planning. The Capital Bikeshare system in Washington D.C. – from which this dataset originates – is one of the most extensively studied networks in North America.
The central challenge in bike-sharing operations is demand uncertainty. Operators must deploy and rebalance bicycles across stations based on anticipated usage, making accurate demand forecasting operationally critical. The primary research question of this study is: how accurately can hourly rental demand be predicted using calendar, temporal, and meteorological variables, and which factors are most influential?
The response variable is cnt – total hourly bike
rentals. The sub-components casual and
registered are used for descriptive purposes only, as their
sum equals cnt and including them in predictive models
would constitute data leakage. The dataset used is the UCI Bike Sharing
Dataset (hour.csv), which records hourly rentals from the
Capital Bikeshare system across 2011 and 2012.
| Measure | Value |
|---|---|
| Total observations | 17379 |
| Variables | 17 |
| Missing values | 0 |
| First date | 2011-01-01 |
| Last date | 2012-12-31 |
| Mean cnt | 189.46 |
| Median cnt | 142 |
| Maximum cnt | 977 |
As shown in Table 1, the dataset contains 17,379 hourly observations
with no missing values. The mean hourly demand is 189.46 rentals, with a
maximum of 977 in a single hour. Key predictors include
season, yr, mnth,
hr, holiday, weekday,
workingday, weathersit, temp,
atemp, hum, and windspeed.
Effective data preparation is a prerequisite for reliable analysis. This section describes the steps taken to transform the raw dataset into an analytically ready form, including type conversion, factor encoding, and feature engineering.
The following transformations were applied to the raw data:
dteday column was parsed from string to
Date format to enable chronological operationsbike_clean <- bike %>%
mutate(
date = as.Date(dteday),
season_name = factor(season, levels = c(1, 2, 3, 4),
labels = c("Spring", "Summer", "Fall", "Winter")),
weather_name = factor(weathersit, levels = c(1, 2, 3, 4),
labels = c("Clear/Partly Cloudy", "Mist/Cloudy",
"Light Rain/Snow", "Heavy Rain/Snow")),
year_name = factor(yr, levels = c(0, 1),
labels = c("2011", "2012")),
weekday_name = factor(weekday, levels = 0:6,
labels = c("Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday", "Saturday")),
workingday_name = factor(workingday, levels = c(0, 1),
labels = c("Non-working day", "Working day")),
weekend = ifelse(weekday %in% c(0, 6), 1, 0),
part_of_day = case_when(
hr <= 5 ~ "Late night",
hr <= 11 ~ "Morning",
hr <= 16 ~ "Afternoon",
hr <= 20 ~ "Evening",
TRUE ~ "Night"
),
demand_group = case_when(
cnt < quantile(cnt, 0.33) ~ "Low",
cnt < quantile(cnt, 0.66) ~ "Medium",
TRUE ~ "High"
)
) Three engineered variables enhance the analytical framework.
weekend is a binary indicator separating leisure-oriented
weekend riding from weekday commuting behavior. part_of_day
groups the 24 hours into five meaningful periods – Late Night, Morning,
Afternoon, Evening, and Night – enabling coarser-grained temporal
analysis. demand_group classifies each hour into Low,
Medium, or High demand based on the 33rd and 66th percentiles of
cnt, facilitating categorical demand analysis.
## Missing values after cleaning: 0
The cleaned dataset retains all 17,379 observations with zero missing values across all original and engineered variables.
Exploratory data analysis (EDA) is conducted to uncover distributional properties, temporal patterns, and relationships among variables before modeling. This section examines demand from multiple perspectives: overall distribution, user composition, hourly and seasonal patterns, weather effects, and continuous variable correlations.
Understanding the shape of the response variable distribution is fundamental. A skewed or multimodal distribution can violate assumptions of linear models and motivate the use of more flexible approaches.
| Value | |
|---|---|
| Minimum | 1.00 |
| Q1 | 40.00 |
| Median | 142.00 |
| Mean | 189.46 |
| Q3 | 281.00 |
| Maximum | 977.00 |
| Std.Dev | 181.39 |
The distribution of cnt is right-skewed: the mean
(189.46) substantially exceeds the median (142), indicating that a
minority of high-demand commute hours pull the average upward. The large
standard deviation (181.39) relative to the mean reflects high
hour-to-hour variability. This skewness motivates the use of flexible,
nonparametric models alongside linear regression.
Distribution of hourly bike rentals. The pronounced right skew reflects that most hours have low-to-moderate demand, while a smaller number of peak commute hours generate very high rental counts.
| Casual Rides | Registered Rides | Total Rides | % Casual | % Registered |
|---|---|---|---|---|
| 620017 | 2672662 | 3292679 | 18.8 | 81.2 |
Registered users account for 81.2% of all rentals, confirming that commuter behavior dominates the aggregate demand signal. This distinction is analytically important because registered users (daily commuters) and casual users (tourists, recreational riders) respond differently to time-of-day, weather, and day-type factors.
Temporal variables – particularly hour of day – are expected to be the strongest predictors of demand given the structured daily routines of commuter users.
| Hour | Avg. Total | Avg. Casual | Avg. Registered |
|---|---|---|---|
| 17 | 461.45 | 74.27 | 387.18 |
| 18 | 425.51 | 61.12 | 364.39 |
| 8 | 359.01 | 21.68 | 337.33 |
| 16 | 311.98 | 73.75 | 238.24 |
| 19 | 311.52 | 48.77 | 262.75 |
The top five hours by average demand are concentrated in the commute window (8 AM and 5-7 PM), with registered users driving the majority of rentals during these peaks. This confirms that bike-sharing functions primarily as a first/last-mile commuting solution for the registered user segment.
Average hourly rentals by user type. Registered users show sharp bimodal peaks at 8 AM and 5-6 PM, consistent with commuting. Casual users show a broad midday plateau reflecting recreational use.
Separating working days from non-working days clarifies whether demand is primarily driven by commuting routines or recreational activity.
Boxplot of hourly rentals on working days vs. non-working days. Working days show a higher and more variable upper range driven by concentrated commute-hour peaks.
Working days display a wider interquartile range and a more pronounced upper tail compared to non-working days, consistent with large demand spikes during commute hours that are absent on weekends and public holidays.
Season and weather influence demand through their effect on the comfort and practicality of cycling. Understanding these patterns is important for operational planning across the calendar year.
| Season | Avg. Rentals |
|---|---|
| Spring | 111.11 |
| Summer | 208.34 |
| Fall | 236.02 |
| Winter | 198.87 |
| Weather Condition | Avg. Rentals |
|---|---|
| Clear/Partly Cloudy | 204.87 |
| Mist/Cloudy | 175.17 |
| Light Rain/Snow | 111.58 |
| Heavy Rain/Snow | 74.33 |
Fall records the highest average demand, likely due to favorable temperatures and stable conditions. Spring has the lowest demand, possibly reflecting greater weather variability. Clear weather produces substantially higher demand than any form of precipitation, with demand declining monotonically as conditions worsen.
Average hourly rentals by season. Fall records the highest demand while spring has the lowest.
Average hourly rentals by weather condition. Demand declines monotonically as weather deteriorates from clear skies to heavy precipitation.
Examining linear relationships between continuous predictors and
cnt identifies the most informative numeric features and
detects multicollinearity issues that could affect model estimation.
| temp | atemp | hum | windspeed | cnt | |
|---|---|---|---|---|---|
| temp | 1.000 | 0.988 | -0.070 | -0.023 | 0.405 |
| atemp | 0.988 | 1.000 | -0.052 | -0.062 | 0.401 |
| hum | -0.070 | -0.052 | 1.000 | -0.290 | -0.323 |
| windspeed | -0.023 | -0.062 | -0.290 | 1.000 | 0.093 |
| cnt | 0.405 | 0.401 | -0.323 | 0.093 | 1.000 |
Temperature shows the strongest positive correlation with demand (r =
0.40), while humidity is negatively correlated (r = -0.32). Wind speed
has a weak negative association (r = -0.12). Notably, temp
and atemp are highly collinear (r > 0.98) and should not
be included simultaneously in a linear model.
Normalized temperature vs. hourly rentals with a LOESS curve. Demand increases with temperature up to approximately 0.65 (normalized), then plateaus, indicating a nonlinear relationship.
The LOESS curve confirms a nonlinear relationship: demand rises sharply with temperature up to a moderate level, then flattens at very high temperatures. This pattern motivates the use of a regression tree, which captures such nonlinearity without explicit feature transformation.
Two modeling approaches are employed: multiple linear regression as a parametric baseline, and a regression tree as a flexible nonparametric alternative. Both models use the same predictors and are evaluated on a held-out test set using identical performance metrics.
The dataset is partitioned chronologically – the first 80% forms the training set and the remaining 20% forms the test set. Chronological splitting prevents future information from leaking into model training, which would inflate apparent performance.
train_index <- 1:floor(0.80 * nrow(bike_clean))
train_data <- bike_clean[train_index, ]
test_data <- bike_clean[-train_index, ]
cat("Training observations:", nrow(train_data), "\n") ## Training observations: 13903
## Test observations: 3476
cat("Training period:", as.character(min(train_data$date)),
"to", as.character(max(train_data$date)), "\n") ## Training period: 2011-01-01 to 2012-08-07
cat("Test period: ", as.character(min(test_data$date)),
"to", as.character(max(test_data$date)), "\n") ## Test period: 2012-08-07 to 2012-12-31
model_formula <- cnt ~ factor(season) + factor(yr) + factor(mnth) +
factor(hr) + factor(holiday) + factor(weekday) +
factor(workingday) + factor(weathersit) +
temp + atemp + hum + windspeed +
factor(weekend) + factor(part_of_day) The model formula includes all calendar, temporal, and meteorological
predictors. casual and registered are excluded
to prevent data leakage.
Multiple linear regression models cnt as a weighted
linear combination of predictors, serving as an interpretable parametric
baseline. Its key limitation is the assumption of linearity, which the
EDA suggests may not fully hold for this data.
linear_model <- lm(model_formula, data = train_data)
linear_prediction <- predict(linear_model, newdata = test_data)
linear_prediction[linear_prediction < 0] <- 0 Predicted values below zero are floored to zero since negative rental counts are not physically meaningful.
A regression tree recursively partitions the predictor space into
regions based on splits that most reduce prediction error. Each leaf
predicts the mean cnt of its training observations. Trees
inherently handle nonlinear effects and interactions, making them
well-suited to the demand patterns observed in the EDA.
tree_model <- rpart(model_formula,
data = train_data,
method = "anova",
control = rpart.control(cp = 0.001,
minbucket = 35,
maxdepth = 8))
tree_prediction <- predict(tree_model, newdata = test_data) The complexity parameter cp = 0.001 prunes splits that
do not improve fit by at least 0.1%. minbucket = 35 ensures
a minimum of 35 observations per leaf, and maxdepth = 8
limits tree depth to prevent overfitting.
Fitted regression tree. Hour of day appears at the root split, confirming its dominance as the primary demand predictor. Each leaf shows the predicted cnt value and training observation count.
Three metrics evaluate model performance on the held-out test set. RMSE penalizes large errors more heavily. MAE gives the average absolute prediction error and is more robust to outliers. R-squared measures the proportion of variance explained by the model.
| Model | RMSE | MAE | R-squared |
|---|---|---|---|
| Linear Regression | 133.819 | 98.691 | 0.632 |
| Regression Tree | 100.758 | 70.291 | 0.791 |
The regression tree outperforms linear regression on all three metrics. The improvement in R-squared from 0.632 to 0.738 is substantial, and the lower RMSE and MAE confirm that the tree makes smaller prediction errors, particularly for the high-demand commute hours that matter most operationally.
Regression tree predicted vs. actual rentals on the test set. Points near the dashed 45-degree line indicate accurate predictions. Scatter at high values reflects moderate underprediction of peak-hour demand.
Variable importance measures the total reduction in residual sum of squares attributable to each predictor across all splits in the tree. Higher scores indicate variables that contribute more to reducing prediction error and are most valuable for demand forecasting.
| Variable | Importance Score |
|---|---|
| factor(hr) | 215859440 |
| factor(part_of_day) | 134597350 |
| temp | 52783480 |
| atemp | 52267602 |
| factor(mnth) | 41908259 |
| factor(season) | 36479999 |
| factor(yr) | 29757537 |
| factor(workingday) | 27495713 |
| factor(weekday) | 24915130 |
| factor(weekend) | 24784002 |
Top 10 variable importance scores. Hour of day dominates by a wide margin, followed by part of day, year, temperature, and season.
The regression tree’s superior performance reflects its ability to capture nonlinear relationships and interactions automatically. For example, the effect of hour of day on demand differs substantially between working and non-working days – an interaction the tree discovers through recursive partitioning but that linear regression requires explicit interaction terms to model. Similarly, the nonlinear temperature-demand relationship identified in the EDA is naturally encoded in the tree’s split structure.
Hour of day is the single most important predictor, consistent with the sharp commute peaks in the EDA. Season, temperature, year, and weather follow as the next most influential factors. The year variable captures strong system-wide ridership growth from 2011 to 2012. Humidity negatively suppresses demand, particularly at high levels. The remaining unexplained variance (~26%) likely reflects local events, station-level constraints, and real-time weather fluctuations not captured in the dataset.
This study analyzed hourly bike-sharing demand using calendar, temporal, and meteorological variables across 17,379 observations. The key findings are:
These results provide practical guidance for fleet deployment and station rebalancing. Future work should explore ensemble methods such as random forests or gradient boosting, incorporate station-level spatial features, and apply temporal models such as ARIMA or LSTM networks to further improve forecasting accuracy.