In this project, I aim to explore and analyze a dataset to predict the number of bikes rented based on various environmental and temporal factors. The dataset includes features such as temperature, humidity, wind speed, visibility, solar radiation, rainfall, snowfall, time of day, season, and whether the day is a holiday.
The analysis begins with comprehensive data visualization techniques, including histograms, scatter plots, boxplots, and correlation heatmaps, to uncover underlying patterns and relationships among variables. Following this, data transformation and partitioning are conducted to prepare the dataset for predictive modeling.
In the predictive analysis phase, we apply and compare different machine learning models—Linear Regression, K-Nearest Neighbors (KNN), and Naïve Bayes—to accurately estimate the number of bikes rented, evaluating each model’s performance and suitability for the task.
| BikeRented | Hour | Temperature_C | Humidity… | Wind.speed_MS | Visibility_10m | Solar.Radiation_mj | Rainfall_MM | Snowfall_cm | Seasons | Holiday | Holiday.1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 254 | 0 | -5.2 | 37 | 2.2 | 2000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 204 | 1 | -5.5 | 38 | 0.8 | 2000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 173 | 2 | -6.0 | 39 | 1.0 | 2000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 107 | 3 | -6.2 | 40 | 0.9 | 2000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 78 | 4 | -6.0 | 36 | 2.3 | 2000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 100 | 5 | -6.4 | 37 | 1.5 | 2000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
## Training set dimensions: 6240 rows x 12 columns
## Testing set dimensions: 2520 rows x 12 columns
To evaluate the performance of my regression model, I partitioned the dataset into training and testing sets using a 70/30 split. I used 70% of the data to train the model and reserved the remaining 30% for testing. This approach ensures that the model is trained on a substantial portion of the data while still allowing for an unbiased assessment of its predictive performance on unseen data.
This takes a look at plots and graphs
The correlation heatmap reveals that temperature and hour of the day are
positively linked to bike rentals, with warmer weather and specific
hours seeing more rentals. In contrast, humidity, rainfall, and snowfall
show negative correlations, indicating that poor weather conditions
discourage bike usage. Solar radiation and visibility also have a
positive impact on rentals.
The summary statistics provide an overview of the bike rental dataset. The average number of bike rentals per day is 704.6, with a large standard deviation of 644.997, indicating significant variability. The dataset spans a range of values for several weather and time-related factors, including temperature (mean of 12.88°C, ranging from -17.8°C to 39.4°C), humidity (mean of 58.2%), and wind speed (mean of 1.7 m/s, ranging from 0 to 7.4 m/s). Visibility tends to be high on average, with a mean of 1,436.8 meters. Solar radiation and rainfall are generally low, with averages of 0.57 MJ and 0.15 mm, respectively. Snowfall is minimal, with an average of just 0.075 cm. The dataset includes a binary holiday indicator (0 for no holiday, 1 for holiday), with the majority being holidays (mean of 0.951).
| Statistic | N | Mean | St. Dev. | Min | Max |
| BikeRented | 8,760 | 704.602 | 644.997 | 0 | 3,556 |
| Hour | 8,760 | 11.500 | 6.923 | 0 | 23 |
| Temperature_C | 8,760 | 12.883 | 11.945 | -17.800 | 39.400 |
| Humidity… | 8,760 | 58.226 | 20.362 | 0 | 98 |
| Wind.speed_MS | 8,760 | 1.725 | 1.036 | 0.000 | 7.400 |
| Visibility_10m | 8,760 | 1,436.826 | 608.299 | 27 | 2,000 |
| Solar.Radiation_mj | 8,760 | 0.569 | 0.869 | 0.000 | 3.520 |
| Rainfall_MM | 8,760 | 0.149 | 1.128 | 0.000 | 35.000 |
| Snowfall_cm | 8,760 | 0.075 | 0.437 | 0.000 | 8.800 |
| Holiday.1 | 8,760 | 0.951 | 0.217 | 0 | 1 |
| BikeRented | Hour | Temperature_C | Humidity… | Wind.speed_MS | Visibility_10m | Solar.Radiation_mj | Rainfall_MM | Snowfall_cm | Seasons | Holiday | Holiday.1 | |
| 1 | 254 | 0 | -5.2 | 37 | 2.2 | 2,000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 2 | 204 | 1 | -5.5 | 38 | 0.8 | 2,000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 3 | 173 | 2 | -6 | 39 | 1 | 2,000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 4 | 107 | 3 | -6.2 | 40 | 0.9 | 2,000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| 5 | 78 | 4 | -6 | 36 | 2.3 | 2,000 | 0 | 0 | 0 | Winter | No Holiday | 1 |
| Statistic | N | Mean | St. Dev. | Min | Max |
| Temperature (°C) | 8,760 | 12.9 | 11.9 | -17.8 | 39.4 |
| Humidity (%) | 8,760 | 58.2 | 20.4 | 0 | 98 |
| Wind Speed (m/s) | 8,760 | 1.7 | 1.0 | 0.0 | 7.4 |
Columns
## [,1]
## [1,] "BikeRented"
## [2,] "Hour"
## [3,] "Temperature_C"
## [4,] "Humidity..."
## [5,] "Wind.speed_MS"
## [6,] "Visibility_10m"
## [7,] "Solar.Radiation_mj"
## [8,] "Rainfall_MM"
## [9,] "Snowfall_cm"
## [10,] "Seasons"
## [11,] "Holiday"
## [12,] "Holiday.1"
## [1] 8760 12
## BikeRented Hour Temperature_C Humidity... Wind.speed_MS Visibility_10m
## 1 254 0 -5.2 37 2.2 2000
## 2 204 1 -5.5 38 0.8 2000
## 3 173 2 -6.0 39 1.0 2000
## Solar.Radiation_mj Rainfall_MM Snowfall_cm Seasons Holiday Holiday.1
## 1 0 0 0 Winter No Holiday 1
## 2 0 0 0 Winter No Holiday 1
## 3 0 0 0 Winter No Holiday 1
## Warning: package 'kableExtra' was built under R version 4.4.3
| Statistic | Value | |
|---|---|---|
| ME | ME | -15.1 |
| RMSE | RMSE | 462.6 |
| MAE | MAE | 340.8 |
| MPE | MPE | NaN |
| MAPE | MAPE | Inf |
## Loading required package: lattice
| Metric | Value |
|---|---|
| Accuracy | -5.5 |
| Accuracy | 461.4 |
| Accuracy | 339.6 |
| Accuracy | NaN |
| Accuracy | Inf |
The root mean squared error (RMSE) is 461, and the mean absolute error (MAE) is 340, which suggests that the model’s predictions deviate from the actual values by around 340–460 bikes on average. These values indicate a moderate level of prediction error, depending on the scale of bike rentals in the dataset. The mean percentage error (MPE) and mean absolute percentage error (MAPE) could not be calculated properly, returning NaN and Inf. This likely happened because some actual bike rental values in the test set were zero, making percentage-based errors undefined or infinite.The model provides reasonably unbiased predictions, the relatively large error margins.
## k-Nearest Neighbors
##
## 8760 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 7884, 7883, 7884, 7884, 7884, 7884, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 437 0.545 299
## 7 432 0.553 297
## 9 432 0.553 298
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
## [1] 143 128 131
The model looked at the “k = 9” closest points in the training data (based on features like temp, humidity, hour, etc.), and averaged their bike rental counts to give these estimates.
## Reference
## Prediction High Low
## High 1121 191
## Low 193 1123
##
## **Accuracy of the model:** 85.4 %
Giving it an accuracy of 85.4%. This means the model correctly classifies whether bike rentals are “High” or “Low” about 85% of the time, which is quite solid for a basic classifier like k-NN.
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## High Low
## 0.5 0.5
##
## Conditional probabilities:
## Hour
## Y [,1] [,2]
## High 13.67 6.23
## Low 9.33 6.90
##
## Temperature_C
## Y [,1] [,2]
## High 19.7 8.81
## Low 6.1 10.76
##
## Humidity...
## Y [,1] [,2]
## High 54.5 17.5
## Low 61.9 22.2
##
## Wind.speed_MS
## Y [,1] [,2]
## High 1.82 0.972
## Low 1.63 1.089
##
## Visibility_10m
## Y [,1] [,2]
## High 1538 533
## Low 1335 660
##
## Solar.Radiation_mj
## Y [,1] [,2]
## High 0.901 1.013
## Low 0.237 0.514
##
## Rainfall_MM
## Y [,1] [,2]
## High 0.0149 0.362
## Low 0.2825 1.542
##
## Snowfall_cm
## Y [,1] [,2]
## High 0.00354 0.0679
## Low 0.14660 0.6056
##
## Seasons
## Y Autumn Spring Summer Winter
## High 0.3153 0.2774 0.3817 0.0256
## Low 0.1833 0.2267 0.1224 0.4676
##
## Holiday
## Y Holiday No Holiday
## High 0.0315 0.9685
## Low 0.0671 0.9329
##
## Holiday.1
## Y [,1] [,2]
## High 0.968 0.175
## Low 0.933 0.250
This output provides details about a Naive Bayes classifier model for predicting bike rentals (“High” or “Low”) based on various features. It shows the a-priori probabilities of each class (High and Low) being equally likely at 50%.
| Prediction | Reference | Freq |
|---|---|---|
| High | High | 4163 |
| Low | High | 217 |
| High | Low | 2092 |
| Low | Low | 2288 |
4163 true positives (correctly predicted “High”)
2288 true negatives (correctly predicted “Low”)
2092 false positives (predicted “High” but it was actually “Low”)
217 false negatives (predicted “Low” but it was actually “High”)
The model is very good at identifying High bike rentals (Sensitivity 95%). But it struggles more with identifying Low rentals (Specificity 52.2%).It has moderate precision: about 2/3 of predicted Highs are actually High. Overall, accuracy is solid (73.6%).
In this analysis, I explored bike rental behavior using visualizations, regression models, KNN, and Naive Bayes. Visualizations highlighted trends like higher rentals with moderate temperatures and solar radiation. Regression models helped quantify the impact of variables such as temperature, with performance evaluated using RMSE and MAE. KNN was used to predict rental volume, with k = 9 giving the best results with an accuracy of 85%. Finally, the Naive Bayes classifier predicted high or low rental days with 73.6% accuracy, revealing key factors like hour, season, and humidity. These methods provided valuable insights into factors influencing bike rentals.