Introduction

In this project, I aim to explore and analyze a dataset to predict the number of bikes rented based on various environmental and temporal factors. The dataset includes features such as temperature, humidity, wind speed, visibility, solar radiation, rainfall, snowfall, time of day, season, and whether the day is a holiday.

The analysis begins with comprehensive data visualization techniques, including histograms, scatter plots, boxplots, and correlation heatmaps, to uncover underlying patterns and relationships among variables. Following this, data transformation and partitioning are conducted to prepare the dataset for predictive modeling.

In the predictive analysis phase, we apply and compare different machine learning models—Linear Regression, K-Nearest Neighbors (KNN), and Naïve Bayes—to accurately estimate the number of bikes rented, evaluating each model’s performance and suitability for the task.


First 6 Rows of Bike Dataset
BikeRented Hour Temperature_C Humidity… Wind.speed_MS Visibility_10m Solar.Radiation_mj Rainfall_MM Snowfall_cm Seasons Holiday Holiday.1
254 0 -5.2 37 2.2 2000 0 0 0 Winter No Holiday 1
204 1 -5.5 38 0.8 2000 0 0 0 Winter No Holiday 1
173 2 -6.0 39 1.0 2000 0 0 0 Winter No Holiday 1
107 3 -6.2 40 0.9 2000 0 0 0 Winter No Holiday 1
78 4 -6.0 36 2.3 2000 0 0 0 Winter No Holiday 1
100 5 -6.4 37 1.5 2000 0 0 0 Winter No Holiday 1

Partition

## Training set dimensions: 6240 rows x 12 columns
## Testing set dimensions: 2520 rows x 12 columns

To evaluate the performance of my regression model, I partitioned the dataset into training and testing sets using a 70/30 split. I used 70% of the data to train the model and reserved the remaining 30% for testing. This approach ensures that the model is trained on a substantial portion of the data while still allowing for an unbiased assessment of its predictive performance on unseen data.

Visualization

This takes a look at plots and graphs

BOXPLOT

HISTOGRAMS

SCATTERPLOTS

The correlation heatmap reveals that temperature and hour of the day are positively linked to bike rentals, with warmer weather and specific hours seeing more rentals. In contrast, humidity, rainfall, and snowfall show negative correlations, indicating that poor weather conditions discourage bike usage. Solar radiation and visibility also have a positive impact on rentals.

StarGazer


The summary statistics provide an overview of the bike rental dataset. The average number of bike rentals per day is 704.6, with a large standard deviation of 644.997, indicating significant variability. The dataset spans a range of values for several weather and time-related factors, including temperature (mean of 12.88°C, ranging from -17.8°C to 39.4°C), humidity (mean of 58.2%), and wind speed (mean of 1.7 m/s, ranging from 0 to 7.4 m/s). Visibility tends to be high on average, with a mean of 1,436.8 meters. Solar radiation and rainfall are generally low, with averages of 0.57 MJ and 0.15 mm, respectively. Snowfall is minimal, with an average of just 0.075 cm. The dataset includes a binary holiday indicator (0 for no holiday, 1 for holiday), with the majority being holidays (mean of 0.951).

Summary Statistics
Statistic N Mean St. Dev. Min Max
BikeRented 8,760 704.602 644.997 0 3,556
Hour 8,760 11.500 6.923 0 23
Temperature_C 8,760 12.883 11.945 -17.800 39.400
Humidity… 8,760 58.226 20.362 0 98
Wind.speed_MS 8,760 1.725 1.036 0.000 7.400
Visibility_10m 8,760 1,436.826 608.299 27 2,000
Solar.Radiation_mj 8,760 0.569 0.869 0.000 3.520
Rainfall_MM 8,760 0.149 1.128 0.000 35.000
Snowfall_cm 8,760 0.075 0.437 0.000 8.800
Holiday.1 8,760 0.951 0.217 0 1
Bike Data (First 5 Rows)
BikeRented Hour Temperature_C Humidity… Wind.speed_MS Visibility_10m Solar.Radiation_mj Rainfall_MM Snowfall_cm Seasons Holiday Holiday.1
1 254 0 -5.2 37 2.2 2,000 0 0 0 Winter No Holiday 1
2 204 1 -5.5 38 0.8 2,000 0 0 0 Winter No Holiday 1
3 173 2 -6 39 1 2,000 0 0 0 Winter No Holiday 1
4 107 3 -6.2 40 0.9 2,000 0 0 0 Winter No Holiday 1
5 78 4 -6 36 2.3 2,000 0 0 0 Winter No Holiday 1
Descriptive Statistics
Statistic N Mean St. Dev. Min Max
Temperature (°C) 8,760 12.9 11.9 -17.8 39.4
Humidity (%) 8,760 58.2 20.4 0 98
Wind Speed (m/s) 8,760 1.7 1.0 0.0 7.4

Linear

Columns

##       [,1]                
##  [1,] "BikeRented"        
##  [2,] "Hour"              
##  [3,] "Temperature_C"     
##  [4,] "Humidity..."       
##  [5,] "Wind.speed_MS"     
##  [6,] "Visibility_10m"    
##  [7,] "Solar.Radiation_mj"
##  [8,] "Rainfall_MM"       
##  [9,] "Snowfall_cm"       
## [10,] "Seasons"           
## [11,] "Holiday"           
## [12,] "Holiday.1"
## [1] 8760   12
##   BikeRented Hour Temperature_C Humidity... Wind.speed_MS Visibility_10m
## 1        254    0          -5.2          37           2.2           2000
## 2        204    1          -5.5          38           0.8           2000
## 3        173    2          -6.0          39           1.0           2000
##   Solar.Radiation_mj Rainfall_MM Snowfall_cm Seasons    Holiday Holiday.1
## 1                  0           0           0  Winter No Holiday         1
## 2                  0           0           0  Winter No Holiday         1
## 3                  0           0           0  Winter No Holiday         1
## Warning: package 'kableExtra' was built under R version 4.4.3
Model Accuracy Metrics
Statistic Value
ME ME -15.1
RMSE RMSE 462.6
MAE MAE 340.8
MPE MPE NaN
MAPE MAPE Inf

## Loading required package: lattice
Model Accuracy
Metric Value
Accuracy -5.5
Accuracy 461.4
Accuracy 339.6
Accuracy NaN
Accuracy Inf

The root mean squared error (RMSE) is 461, and the mean absolute error (MAE) is 340, which suggests that the model’s predictions deviate from the actual values by around 340–460 bikes on average. These values indicate a moderate level of prediction error, depending on the scale of bike rentals in the dataset. The mean percentage error (MPE) and mean absolute percentage error (MAPE) could not be calculated properly, returning NaN and Inf. This likely happened because some actual bike rental values in the test set were zero, making percentage-based errors undefined or infinite.The model provides reasonably unbiased predictions, the relatively large error margins.

KNN

## k-Nearest Neighbors 
## 
## 8760 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 7884, 7883, 7884, 7884, 7884, 7884, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE  Rsquared  MAE
##   5  437   0.545     299
##   7  432   0.553     297
##   9  432   0.553     298
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

## [1] 143 128 131

The model looked at the “k = 9” closest points in the training data (based on features like temp, humidity, hour, etc.), and averaged their bike rental counts to give these estimates.

##           Reference
## Prediction High  Low
##       High 1121  191
##       Low   193 1123
## 
## **Accuracy of the model:** 85.4 %

Giving it an accuracy of 85.4%. This means the model correctly classifies whether bike rentals are “High” or “Low” about 85% of the time, which is quite solid for a basic classifier like k-NN.

Naive Bayes

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## High  Low 
##  0.5  0.5 
## 
## Conditional probabilities:
##       Hour
## Y       [,1] [,2]
##   High 13.67 6.23
##   Low   9.33 6.90
## 
##       Temperature_C
## Y      [,1]  [,2]
##   High 19.7  8.81
##   Low   6.1 10.76
## 
##       Humidity...
## Y      [,1] [,2]
##   High 54.5 17.5
##   Low  61.9 22.2
## 
##       Wind.speed_MS
## Y      [,1]  [,2]
##   High 1.82 0.972
##   Low  1.63 1.089
## 
##       Visibility_10m
## Y      [,1] [,2]
##   High 1538  533
##   Low  1335  660
## 
##       Solar.Radiation_mj
## Y       [,1]  [,2]
##   High 0.901 1.013
##   Low  0.237 0.514
## 
##       Rainfall_MM
## Y        [,1]  [,2]
##   High 0.0149 0.362
##   Low  0.2825 1.542
## 
##       Snowfall_cm
## Y         [,1]   [,2]
##   High 0.00354 0.0679
##   Low  0.14660 0.6056
## 
##       Seasons
## Y      Autumn Spring Summer Winter
##   High 0.3153 0.2774 0.3817 0.0256
##   Low  0.1833 0.2267 0.1224 0.4676
## 
##       Holiday
## Y      Holiday No Holiday
##   High  0.0315     0.9685
##   Low   0.0671     0.9329
## 
##       Holiday.1
## Y       [,1]  [,2]
##   High 0.968 0.175
##   Low  0.933 0.250

This output provides details about a Naive Bayes classifier model for predicting bike rentals (“High” or “Low”) based on various features. It shows the a-priori probabilities of each class (High and Low) being equally likely at 50%.


Confusion Matrix
Confusion Matrix
Prediction Reference Freq
High High 4163
Low High 217
High Low 2092
Low Low 2288

4163 true positives (correctly predicted “High”)

2288 true negatives (correctly predicted “Low”)

2092 false positives (predicted “High” but it was actually “Low”)

217 false negatives (predicted “Low” but it was actually “High”)

The model is very good at identifying High bike rentals (Sensitivity 95%). But it struggles more with identifying Low rentals (Specificity 52.2%).It has moderate precision: about 2/3 of predicted Highs are actually High. Overall, accuracy is solid (73.6%).

Conclusion

In this analysis, I explored bike rental behavior using visualizations, regression models, KNN, and Naive Bayes. Visualizations highlighted trends like higher rentals with moderate temperatures and solar radiation. Regression models helped quantify the impact of variables such as temperature, with performance evaluated using RMSE and MAE. KNN was used to predict rental volume, with k = 9 giving the best results with an accuracy of 85%. Finally, the Naive Bayes classifier predicted high or low rental days with 73.6% accuracy, revealing key factors like hour, season, and humidity. These methods provided valuable insights into factors influencing bike rentals.