Bike sharing has become a popular transportation option in many
cities around the world. With increasing environmental awareness and the
need for sustainable transportation options, bike sharing systems have
seen significant growth. However, for these systems to operate
efficiently, it is crucial to predict the demand for bikes at different
stations and locations.
The goal of this project is to develop a predictive model that can
estimate bike sharing demand based on various factors such as weather,
time of day, day of the week, and special events. Using data analytics
and machine learning techniques, the project aims to provide a tool that
helps bike sharing system operators optimize bike distribution and
availability, thereby improving user experience and operational
efficiency.
1.Libraries
library(tidymodels) #For modeling and machine learning
library(tidyverse) # Share common data representations and 'API' design
library(stringr) # Consistent wrapper for common string operations
library(readr) # Read rectangular text data
library(broom) # Convert statistical objects into tidy tibbles
library(dplyr) # A grammar of data manipulation
library(yardstick) # Tidy characterization of model performance
library(glmnet) # Lassso and Elastic Net
library(kableExtra) # Construct Complex Table
2.Database
The database contains detailed weather information, including
temperature, humidity, wind speed, visibility, dew point, solar
radiation, snowfall, and rainfall. Additionally, it records the number
of bikes rented per hour and date information from the Seoul
bike-sharing system.
Technical Analysis: The weather variables such as
temperature, humidity, wind speed, visibility, dew point, solar
radiation, snowfall, and rainfall are crucial as they can significantly
influence the demand for bike rentals. For instance, temperature affects
user comfort, while humidity impacts the perception of heat. Wind speed
can make biking easier or harder, and visibility is important for
cyclist safety. Dew point is an indicator of humidity and thermal
comfort, and solar radiation can influence the decision to rent bikes.
Snowfall and rainfall are critical factors that can reduce bike rental
demand.
The bike rental data, specifically the number of bikes rented per
hour, serves as the key dependent variable for regression analysis. The
date information allows for the examination of temporal and seasonal
patterns in bike usage.
Project Objective: The goal is to use the weather
and temporal variables to predict the number of bikes rented per hour.
This can help optimize the management of the bike-sharing system,
anticipate demand, and improve user experience.
2.1 Variables
The seoul_bike_sharing_converted_normalized.csv will be
our main dataset which has following variables:
The response variable:
RENTED BIKE COUNT- Count of bikes rented at each
hour
Weather predictor variables:
TEMPERATURE - Temperature in Celsius
HUMIDITY - Unit is %
WIND_SPEED - Unit is m/s
VISIBILITY - Multiplied by 10m
DEW_POINT_TEMPERATURE - The temperature to which the
air would have to cool down in order to reach saturation, unit is
Celsius
SOLAR_RADIATION - MJ/m2
RAINFALL - mm
SNOWFALL - cm
Date/time predictor variables:
DATE - Year-month-day
HOUR- Hour of he day
FUNCTIONAL DAY - NoFunc(Non Functional Hours),
Fun(Functional hours)
HOLIDAY - Holiday/No holiday
SEASONS - Winter, Spring, Summer, Autumn
2.2 Load database
seoul_bike_sharing_converted_normalized <- read_csv("Bases limpias/seoul_bike_sharing_converted_normalized.csv")
2.3 Convert into a df
bike_sharing_df <- seoul_bike_sharing_converted_normalized %>%
select(-DATE, -FUNCTIONING_DAY_YES,-FUNCTIONING_DAY_NO)
We will not be utilizing the DATE column in its current
form, as it essentially functions as a data entry index. However, with
additional time, we could transform the DATE column to
derive new features such as ‘day of the week’ or ‘isWeekend’, which
might influence bike rental preferences. Additionally, the
FUNCTIONAL DAY column will not be used because, after
processing missing values, it only contains a single distinct value
(YES).
3. Split training and testing data
bike_split <- initial_split(bike_sharing_df, prop = 3/4)
train_data <- training(bike_split)
test_data <- testing(bike_split)
3.1 Build a linear regression model using weather variables
only
Weather conditions are likely to influence individuals’ decisions
regarding bike rentals. For instance, adverse weather such as cold and
rainy conditions may lead people to opt for alternative modes of
transportation like buses or taxis. Conversely, favorable weather, such
as sunny days, may increase the propensity to rent bikes for
short-distance travel.
# Pick linear regression
lm_spec <- linear_reg() %>%
# Set engine'
set_engine(engine = "lm")
# Print the linear function
lm_spec
Linear Regression Model Specification (regression)
Computational engine: lm
# To fit the model
lm_model_weather <- lm_spec %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
Print the fit summary for the lm_model_weather
model.
# Create the table with the regression results
summary(lm_model_weather$fit)
Call:
stats::lm(formula = RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY +
WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION +
RAINFALL + SNOWFALL, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.38129 -0.08315 -0.01554 0.05840 0.65491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.046472 0.017060 2.724 0.00647 **
TEMPERATURE 0.631318 0.077664 8.129 0.000000000000000517 ***
HUMIDITY -0.276702 0.037778 -7.324 0.000000000000269685 ***
WIND_SPEED 0.105148 0.013491 7.794 0.000000000000007544 ***
VISIBILITY 0.006179 0.006982 0.885 0.37615
DEW_POINT_TEMPERATURE -0.036681 0.082867 -0.443 0.65804
SOLAR_RADIATION -0.120406 0.009805 -12.281 < 0.0000000000000002 ***
RAINFALL -0.582041 0.057417 -10.137 < 0.0000000000000002 ***
SNOWFALL 0.099439 0.037289 2.667 0.00768 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1368 on 6339 degrees of freedom
Multiple R-squared: 0.4385, Adjusted R-squared: 0.4378
F-statistic: 618.7 on 8 and 6339 DF, p-value: < 0.00000000000000022
The regression analysis aims to predict the
RENTED_BIKE_COUNT using several independent variables:
TEMPERATURE, HUMIDITY,
WIND_SPEED, VISIBILITY,
DEW_POINT_TEMPERATURE, SOLAR_RADIATION,
RAINFALL, and SNOWFALL. The Intercept has an
estimate of 0.046472 , indicating the baseline level of bike
rentals when all other variables are zero.
TEMPERATURE has a positive coefficient of 0.631318
, suggesting that as the temperature increases, the number of
rented bikes also increases. This relationship is highly significant
with a p-value less than 5.17e-16. HUMIDITY has a negative
coefficient of -0.276702, indicating that higher humidity
levels are associated with fewer bike rentals, and this effect is also
highly significant.
VISIBILITY has a very small positive coefficient (
0.006179 ), indicating a slight increase in bike rentals with
better visibility, and this effect is significant.
SOLAR_RADIATION has a strong positive coefficient of
1.034800, showing that higher solar radiation levels
significantly increase bike rentals. RAINFALL and
SNOWFALL both have coefficients ( -0.582041 and
-0.099439, respectively), indicating that more rainfall and
snowfall lead to fewer bike rentals. These effects are statistically
significant.
The residual standard error is 0.1371,
indicating the average distance that the observed values fall from the
regression line. The multiple R-squared and adjusted R-squared values,
along with the F-statistic, are not provided in the image, but they
would typically indicate the overall fit of the model and the
significance of the regression equation, respectively.
Overall, the analysis shows that weather conditions significantly
impact bike rentals, with temperature and solar radiation having the
most substantial positive effects, while humidity, wind speed, rainfall,
and snowfall negatively affect bike rentals.
3.2 Build a linear regression model using all variables
lm_model_all <- lm_spec %>%
fit(RENTED_BIKE_COUNT ~ ., data = train_data)
Print the fit summary for lm_model_all.
summary(lm_model_all$fit)
Call:
stats::lm(formula = RENTED_BIKE_COUNT ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-0.39281 -0.06190 -0.00194 0.05861 0.49754
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.087389 0.015264 5.725 0.00000001081522389 ***
TEMPERATURE 0.196831 0.063112 3.119 0.001824 **
HUMIDITY -0.262832 0.029823 -8.813 < 0.0000000000000002 ***
WIND_SPEED -0.005507 0.011317 -0.487 0.626537
VISIBILITY 0.007780 0.005690 1.367 0.171578
DEW_POINT_TEMPERATURE 0.200671 0.066111 3.035 0.002412 **
SOLAR_RADIATION 0.077443 0.011718 6.609 0.00000000004188262 ***
RAINFALL -0.688387 0.045162 -15.242 < 0.0000000000000002 ***
SNOWFALL 0.068333 0.029379 2.326 0.020057 *
SPRING 0.058158 0.005407 10.755 < 0.0000000000000002 ***
SUMMER 0.057523 0.008139 7.067 0.00000000000175034 ***
AUTUMN 0.101296 0.005643 17.951 < 0.0000000000000002 ***
WINTER NA NA NA NA
HOLIDAY_YES NA NA NA NA
HOLIDAY_NO NA NA NA NA
HOUR_0 -0.032833 0.009085 -3.614 0.000304 ***
HOUR_1 -0.060741 0.009369 -6.483 0.00000000009673282 ***
HOUR_2 -0.095024 0.009283 -10.237 < 0.0000000000000002 ***
HOUR_3 -0.118588 0.009296 -12.757 < 0.0000000000000002 ***
HOUR_4 -0.134490 0.009324 -14.424 < 0.0000000000000002 ***
HOUR_5 -0.131373 0.009355 -14.044 < 0.0000000000000002 ***
HOUR_6 -0.083756 0.009493 -8.823 < 0.0000000000000002 ***
HOUR_7 0.002330 0.009229 0.252 0.800703
HOUR_8 0.120016 0.009489 12.648 < 0.0000000000000002 ***
HOUR_9 -0.029439 0.009677 -3.042 0.002358 **
HOUR_10 -0.089353 0.009933 -8.996 < 0.0000000000000002 ***
HOUR_11 -0.095492 0.010401 -9.181 < 0.0000000000000002 ***
HOUR_12 -0.084640 0.010827 -7.818 0.00000000000000626 ***
HOUR_13 -0.085297 0.010742 -7.941 0.00000000000000236 ***
HOUR_14 -0.080889 0.010658 -7.590 0.00000000000003666 ***
HOUR_15 -0.051860 0.010369 -5.001 0.00000058457951176 ***
HOUR_16 -0.019734 0.009990 -1.975 0.048272 *
HOUR_17 0.056302 0.009708 5.800 0.00000000697015086 ***
HOUR_18 0.192734 0.009389 20.527 < 0.0000000000000002 ***
HOUR_19 0.113208 0.009394 12.052 < 0.0000000000000002 ***
HOUR_20 0.092515 0.009361 9.883 < 0.0000000000000002 ***
HOUR_21 0.093144 0.009312 10.003 < 0.0000000000000002 ***
HOUR_22 0.068434 0.009194 7.444 0.00000000000011101 ***
HOUR_23 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1063 on 6313 degrees of freedom
Multiple R-squared: 0.6621, Adjusted R-squared: 0.6603
F-statistic: 363.8 on 34 and 6313 DF, p-value: < 0.00000000000000022
The model explains approximately 66.2% of the
variance in MODEL_BIKE_COUNT, as indicated by the R-squared
value. The significant predictors (e.g., HOUR,
TEMPERATURE) suggest that these factors have a substantial
impact on bike count. The high F-statistic and its corresponding p-value
indicate that the model is statistically significant overall.
4. Model evaluation
Model evaluation is crucial in regression analysis because it helps
determine how well a model fits the data and predicts outcomes.
R-squared is a key metric that indicates the proportion
of the variance in the dependent variable that is predictable from the
independent variables. A higher R-squared value signifies that the model
explains a greater portion of the variance, suggesting a better fit.
This is important for understanding the strength of the relationship
between the predictors and the outcome, and for assessing the model’s
explanatory power.
RMSE (Root Mean Square Error), on the other hand,
measures the average magnitude of the errors between predicted and
observed values. It provides insight into the model’s predictive
accuracy. A lower RMSE indicates that the model’s predictions are closer
to the actual values, which is essential for making reliable forecasts.
Evaluating models using both R-squared and RMSE ensures a balanced
assessment of their performance, considering both the goodness of fit
and the precision of predictions. This comprehensive evaluation helps in
selecting the most appropriate model for practical applications.
4.1 Root Mean Squared Error (RMSE)
# Making the predictions
predictions_weather <- predict(lm_model_weather, new_data = test_data)
predictions_all <- predict(lm_model_all, new_data = test_data)
# Calculating errors
error_weather <- train_data$RENTED_BIKE_COUNT - predictions_weather
error_all <- train_data$RENTED_BIKE_COUNT - predictions_all
# Calculating Squared Errors
squared_error_weather <- error_weather^2
squared_error_all <- error_all^2
# Calculate the average of the squared errors
mean_squared_error_weather <- mean(squared_error_weather$.pred)
mean_squared_error_all <- mean(squared_error_all$.pred)
# Calculate RMSE
rmse_weather <- sqrt(mean_squared_error_weather)
rmse_all <- sqrt(mean_squared_error_all)
4.2 R-squared
summary_m_weather <- summary(lm_model_weather$fit)
r2_weather <- summary_m_weather$r.squared
summary_m_all <- summary(lm_model_all$fit)
r2_all <- summary_m_all$r.squared
4.3 Comparing models
results <- data.frame(
Model = c("Weather Model", "All Variables Model"),
R_squared = c(r2_weather, r2_all),
RMSE = c(rmse_weather, rmse_all)
)
print(results)
The “Weather Model” has an R-squared value of
0.4303461 and an RMSE of 0.2224554 The R-squared value indicates that
approximately 43.03% of the variance in the dependent variable can be
explained by the independent variables in this model. The RMSE value
represents the root mean square error, which measures the average
magnitude of the errors between the predicted and observed values. A
lower RMSE indicates better predictive accuracy.
On the other hand, the “All Variables Model” has an
R-squared value of 0.6602304 and an RMSE of 0.2348928 This model
explains approximately 66.02% of the variance in the dependent variable,
which is higher than the “Weather Model.” However, the RMSE is higher at
795.9658, indicating that the average prediction error is larger
compared to the “Weather Model.”
To determine the best model, we need to consider the trade-off
between the goodness of fit (R-squared) and the predictive accuracy
(RMSE). The “All Variables Model” has a higher
R-squared value, suggesting it fits the data better and explains more
variance. However, its higher RMSE indicates that its predictions are
less accurate on average compared to the “Weather
Model.”
If the primary goal is to have a model that explains more variance in
the dependent variable, the “All Variables Model” would
be preferred due to its higher R-squared value. Conversely, if the goal
is to minimize prediction errors, the “Weather Model”
would be better due to its lower RMSE.
In summary, the choice of the best model depends on the specific
objectives of the analysis. If explaining more variance is prioritized,
the “All Variables Model” is better. If minimizing
prediction errors is more important, the “Weather
Model” is the preferred choice.
4.4 Bar Chart for Coefficients
# Obtener los coeficientes del modelo
coefficients_all <- tidy(lm_model_all)
# Crear un gráfico de barras para visualizar los coeficientes
ggplot(coefficients_all, aes(x = reorder(term, estimate), y = estimate)) +
geom_bar(stat = "identity", fill = "maroon") +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "black") +
labs(title = "Coefficients of Linear Regression Model (All)",
x = "Predictor Variables",
y = "Coefficient Estimate") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text = element_text(size = 12)
) +
coord_flip() # Voltear el gráfico para mejor visualización

4.5 Bar Chart for Coefficients
# Obtener los coeficientes del modelo
coefficients_weather <- tidy(lm_model_weather)
# Crear un gráfico de barras para visualizar los coeficientes
ggplot(coefficients_weather, aes(x = reorder(term, estimate), y = estimate)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "darkred") +
labs(title = "Coefficients of Linear Regression Model (Weather)",
x = "Predictor Variables",
y = "Coefficient Estimate") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text = element_text(size = 12)
) +
coord_flip() # Voltear el gráfico para mejor visualización

NA
NA
5. Add polynomial terms
# Plot the higher order polynomial fits
ggplot(train_data, aes(x = RENTED_BIKE_COUNT, y = TEMPERATURE)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "red", se = FALSE) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), color = "blue", se = FALSE) +
geom_smooth(method = "lm", formula = y ~ poly(x, 4), color = "green", se = FALSE) +
geom_smooth(method = "lm", formula = y ~ poly(x, 5), color = "purple", se = FALSE) +
labs(title = "Polynomial Regression Fits",
x = "Rented Bikes",
y = "Temperature") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text = element_text(size = 12)
)

5.1 Fit the Polynomial Regression Model
# Assuming the important variables are TEMPERATURE and HUMIDITY
lm_bikly <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2), data = train_data)
# Print the model summary
summary(lm_bikly)
Call:
lm(formula = RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY,
2), data = train_data)
Residuals:
Min 1Q Median 3Q Max
-0.33139 -0.08465 -0.02142 0.06042 0.65952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.204909 0.001699 120.61 <0.0000000000000002 ***
poly(TEMPERATURE, 2)1 8.784310 0.137846 63.73 <0.0000000000000002 ***
poly(TEMPERATURE, 2)2 -1.673604 0.140230 -11.94 <0.0000000000000002 ***
poly(HUMIDITY, 2)1 -4.779434 0.141199 -33.85 <0.0000000000000002 ***
poly(HUMIDITY, 2)2 -2.570957 0.136853 -18.79 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1354 on 6343 degrees of freedom
Multiple R-squared: 0.4469, Adjusted R-squared: 0.4466
F-statistic: 1281 on 4 and 6343 DF, p-value: < 0.00000000000000022
The regression analysis presented aims to predict the
RENTED_BIKE_COUNT based on polynomial transformations of
TEMPERATURE and HUMIDITY. This approach allows
for capturing non-linear relationships between these predictors and the
dependent variable, which can be more reflective of real-world scenarios
where changes in temperature and humidity might not have a
straightforward linear effect on bike rentals.
5.1.1 Model Summary
The regression model includes polynomial terms for both
TEMPERATURE and HUMIDITY, specifically up to
the second degree. The coefficients table shows the estimated effects of
each predictor, along with their standard errors, t-values, and
p-values. The intercept is highly significant, with an estimate a very
low p-value (<2e-16), indicating a strong baseline effect when all
predictors are at their mean values.
5.1.2 Coefficients and Significance
The first-degree polynomial term for TEMPERATURE has a
large negative coefficient and is highly significant (p-value
<2e-16), suggesting that as temperature increases, the number of
rented bikes decreases significantly. The second-degree term for
TEMPERATURE has a negative coefficient and is also
significant (p-value <2e-16), indicating a diminishing return effect;
at higher temperatures, the decrease in bike rentals slows down.
Similarly, the first-degree polynomial term for HUMIDITY
has a significant negative coefficient , suggesting that higher humidity
levels reduce bike rentals. The second-degree term for
HUMIDITY also has a negative coefficient relationship.
5.1.3 Model Fit
The model’s multiple R-squared value is 0.4666, indicating that
approximately 44.69% of the variance in RENTED_BIKE_COUNT
is explained by the model. The adjusted R-squared value is the same,
suggesting that the model’s explanatory power is robust even after
adjusting for the number of predictors. The F-statistic is very high,
with a corresponding p-value <2.2e-16, indicating that the model is
statistically significant overall.
5.1.4 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for
understanding its overall performance. The R-squared value provides
insight into how well the model explains the variability in the data,
which is important for assessing the model’s explanatory power. On the
other hand, RMSE provides a measure of the model’s predictive accuracy,
which is essential for making reliable forecasts. By considering both
metrics, we can ensure a balanced evaluation of the model, taking into
account both its ability to fit the data and its precision in
predictions.
In summary, this polynomial regression model effectively captures the
non-linear effects of temperature and humidity on bike rentals, with
significant coefficients for both predictors. The model explains a
substantial portion of the variance in bike rentals, making it a useful
tool for understanding and predicting bike rental patterns based on
weather conditions.
5.2 Make Predictions on the Test Dataset
# Make predictions on the test dataset using the lm_bikly model
y_pred <- predict(lm_bikly, newdata = test_data)
# Convert negative predictions to zero
y_pred <- ifelse(y_pred < 0, 0, y_pred)
5.2.1 Calculate R-squared and RMSE
# Calculating errors
error_bikly <- train_data$RENTED_BIKE_COUNT - y_pred
Warning: longer object length is not a multiple of shorter object length
# Calculating Squared Errors
squared_error_bikly <- error_bikly^2
# Calculate the average of the squared errors
mean_squared_error_bikly <- mean(squared_error_bikly)
# Calculate RMSE
rmse_bikly <- sqrt(mean_squared_error_bikly)
# Calculate R-squared
summary_m_bikly <- summary(lm_bikly)
rsq_bikly <- summary_m_bikly$r.squared
# Display the results
results_bikly <- data.frame(
Model = "Polynomial Regression Model (Bikly)",
R_squared = rsq_bikly,
RMSE = rmse_bikly
)
print(results_bikly)
NA
Based on the provided data, the polynomial regression model named
“Biky” has an R-squared value of 0.44696063 and an RMSE of
0.2136526.
5.2.2 Model Evaluation
The R-squared value of 0.44696063 indicates that
approximately 44.70% of the variance in the dependent variable is
explained by the independent variables in this model. This suggests a
moderate level of explanatory power, meaning that the model captures a
significant portion of the variability in the data but leaves some
unexplained variance. A higher R-squared value generally indicates a
better fit of the model to the data.
The RMSE (Root Mean Square Error) value of 0.2136526
measures the average magnitude of the errors between the predicted and
observed values. A lower RMSE indicates better predictive accuracy, as
it means the model’s predictions are closer to the actual values. In
this case, the RMSE value is relatively low, suggesting that the model
has good predictive performance.
5.2.3 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for
understanding its overall performance. The R-squared value provides
insight into how well the model explains the variability in the data,
which is important for assessing the model’s explanatory power. On the
other hand, RMSE provides a measure of the model’s predictive accuracy,
which is essential for making reliable forecasts. By considering both
metrics, we can ensure a balanced evaluation of the model, taking into
account both its ability to fit the data and its precision in
predictions.
The polynomial regression model “Biky” demonstrates a moderate level
of explanatory power and good predictive accuracy, making it a useful
tool for understanding and predicting the dependent variable based on
the given predictors.
6. Fit the Polynomial Regression Model with Interaction Terms
# Fit a polynomial regression model with interaction terms
lm_bikly_interaction <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) * HUMIDITY + poly(TEMPERATURE, 2) * WIND_SPEED, data = train_data)
# Print the model summary
summary(lm_bikly_interaction)
Call:
lm(formula = RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) * HUMIDITY +
poly(TEMPERATURE, 2) * WIND_SPEED, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-0.38995 -0.08537 -0.02348 0.05322 0.66347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.332614 0.007418 44.836 < 0.0000000000000002 ***
poly(TEMPERATURE, 2)1 11.106750 0.613015 18.118 < 0.0000000000000002 ***
poly(TEMPERATURE, 2)2 -3.933440 0.663528 -5.928 0.000000003226468181 ***
HUMIDITY -0.245753 0.009693 -25.355 < 0.0000000000000002 ***
WIND_SPEED 0.099425 0.013255 7.501 0.000000000000072139 ***
poly(TEMPERATURE, 2)1:HUMIDITY -6.744694 0.821792 -8.207 0.000000000000000272 ***
poly(TEMPERATURE, 2)2:HUMIDITY 4.200620 0.976027 4.304 0.000017042362283015 ***
poly(TEMPERATURE, 2)1:WIND_SPEED 5.105744 1.055849 4.836 0.000001358254989286 ***
poly(TEMPERATURE, 2)2:WIND_SPEED 2.539413 1.098471 2.312 0.0208 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.137 on 6339 degrees of freedom
Multiple R-squared: 0.4336, Adjusted R-squared: 0.4329
F-statistic: 606.6 on 8 and 6339 DF, p-value: < 0.00000000000000022
The regression analysis presented aims to predict the
RENTED_BIKE_COUNT based on polynomial transformations of
TEMPERATURE, HUMIDITY, and their interaction
with WIND_SPEED. This approach allows for capturing complex
relationships between these predictors and the dependent variable, which
can be more reflective of real-world scenarios where weather conditions
interact in non-linear ways to affect bike rentals.
6.1 Model Summary
The regression model includes polynomial terms for
TEMPERATURE up to the second degree, HUMIDITY,
and the interaction between TEMPERATURE and
WIND_SPEED. The coefficients table shows the estimated
effects of each predictor, along with their standard errors, t-values,
and p-values.
6.1.1 Coefficients and Significance
The first-degree polynomial term for TEMPERATURE has a
significant negative coefficient , suggesting that as temperature
increases, the number of rented bikes decreases slightly. The
second-degree term for TEMPERATURE is not significant,
indicating that higher-order temperature effects are negligible.
HUMIDITY has a significant negative coefficient ,
suggesting that higher humidity levels reduce bike rentals. The
interaction term between the first-degree polynomial of
TEMPERATURE and HUMIDITY is significant ,
indicating that the combined effect of these variables has a meaningful
impact on bike rentals. Similarly, WIND_SPEED has a
significant negative coefficient , and its interaction with the
first-degree polynomial of TEMPERATURE is significant ,
suggesting that wind speed also plays a crucial role in bike
rentals.
6.1.2 Model Fit
The model’s multiple R-squared value is 0.4329, indicating that
approximately 43.36% of the variance in RENTED_BIKE_COUNT
is explained by the model. The adjusted R-squared value is very close,
suggesting that the model’s explanatory power is robust even after
adjusting for the number of predictors. The F-statistic is very high,
with a corresponding p-value <2e-16, indicating that the model is
statistically significant overall.
6.1.3 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for
understanding its overall performance. The R-squared value provides
insight into how well the model explains the variability in the data,
which is important for assessing the model’s explanatory power. On the
other hand, RMSE provides a measure of the model’s predictive accuracy,
which is essential for making reliable forecasts. By considering both
metrics, we can ensure a balanced evaluation of the model, taking into
account both its ability to fit the data and its precision in
predictions.
In summary, this polynomial regression model effectively captures the
complex interactions between temperature, humidity, and wind speed on
bike rentals, with significant coefficients for all predictors. The
model explains a substantial portion of the variance in bike rentals,
making it a useful tool for understanding and predicting bike rental
patterns based on weather conditions.
6.2 Make Predictions on the Test Dataset
# Make predictions on the test dataset using the lm_bikly model
y_pred_interaction <- predict(lm_bikly_interaction, newdata = test_data)
# Convert negative predictions to zero
y_pred_interaction <- ifelse(y_pred < 0, 0, y_pred_interaction)
6.2.1 Calculate R-squared and RMSE
# Calculating errors
error_bikly_interaction <- train_data$RENTED_BIKE_COUNT - y_pred_interaction
Warning: longer object length is not a multiple of shorter object length
# Calculating Squared Errors
squared_error_bikly_interaction <- error_bikly^2
# Calculate the average of the squared errors
mean_squared_error_bikly_interaction <- mean(squared_error_bikly_interaction)
# Calculate RMSE
rmse_bikly_interaction <- sqrt(mean_squared_error_bikly_interaction)
# Calculate R-squared
summary_m_bikly_interaction <- summary(lm_bikly_interaction)
rsq_bikly_interaction <- summary_m_bikly_interaction$r.squared
# Display the results
results_bikly_interaction <- data.frame(
Model = "Polynomial Regression Model (Bikly_interaction)",
R_squared = rsq_bikly_interaction,
RMSE = rmse_bikly_interaction
)
print(results_bikly_interaction)
Based on the provided data, the polynomial regression model has an
R-squared value of 0.4335922 and an RMSE of 0.2168526.
6.2.2 Model Evaluation
The R-squared value of 0.4335922 indicates that
approximately 43.36% of the variance in the dependent variable is
explained by the independent variables in this model. This suggests a
moderate level of explanatory power, meaning that the model captures a
significant portion of the variability in the data but leaves some
unexplained variance. A higher R-squared value generally indicates a
better fit of the model to the data.
The RMSE (Root Mean Square Error) value of 0.2168526
measures the average magnitude of the errors between the predicted and
observed values. A lower RMSE indicates better predictive accuracy, as
it means the model’s predictions are closer to the actual values. In
this case, the RMSE value is relatively low, suggesting that the model
has good predictive performance.
6.2.3 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for
understanding its overall performance. The R-squared value provides
insight into how well the model explains the variability in the data,
which is important for assessing the model’s explanatory power. On the
other hand, RMSE provides a measure of the model’s predictive accuracy,
which is essential for making reliable forecasts. By considering both
metrics, we can ensure a balanced evaluation of the model, taking into
account both its ability to fit the data and its precision in
predictions.
7. Add regularization
7.1 Create a recipe
bike_recipe <- recipe(RENTED_BIKE_COUNT ~ ., data = train_data) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_poly(all_predictors(), degree = 2) %>%
step_interact(terms = ~ all_predictors():all_predictors())
7.2 Specify the work flow and fit the model
bike_workflow <- workflow() %>%
add_recipe(bike_recipe) %>%
add_model(glmnet_spec)
7.3 Elastic Net Regularization & (L1 and L2)
# Model 1: Adding regularization (L2 Ridge)
ridge_spe <- linear_reg(penalty = 0.1, mixture = 0) %>%
set_engine("glmnet")
train_f1 <- ridge_spe %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
# Model 2: Adding regularization (L1 Lasso)
ridge_spe1 <- linear_reg(penalty = 0.1, mixture = 1) %>%
set_engine("glmnet")
train_f2 <- ridge_spe1 %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY+ SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
# Model 2: Adding regularization (L1 Lasso and L2 Ridge)
ridge_spe2 <- linear_reg(penalty = 0.1, mixture = 0.5) %>%
set_engine("glmnet")
train_f3 <- ridge_spe2 %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
# Extract predictions
predic1 <- predict(train_f1, train_data)$.pred
predic2 <- predict(train_f2, train_data)$.pred
predic3 <- predict(train_f3, train_data)$.pred
# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
# Calculate RMSE for each model
rmse_1 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic1)
rmse_2 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic2)
rmse_3 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic3)
# Combine results into a table
result_regularization <- tibble(
model = c("Model 1: L2 Ridge", "Model 2: L1 Lasso", "Model 3: L1 Lasso and L2 Ridge"),
RMSE = c(rmse_1, rmse_2, rmse_3)
)
# Display the results
print(result_regularization)
NA
Model 1: L2 Ridge - This model, which applies L2
regularization, has an RMSE of 0.1464844. L2 regularization helps to
prevent overfitting by penalizing large coefficients, leading to a more
generalized model. The relatively low RMSE indicates that this model
performs well in predicting the target variable, balancing bias and
variance effectively.
Model 2: L1 Lasso - The L1 regularization model,
known as Lasso, has a higher RMSE of 0.1805229. Lasso regularization not
only helps in preventing overfitting but also performs feature selection
by shrinking some coefficients to zero. The higher RMSE suggests that
while Lasso is useful for identifying important features, it may not
always provide the best predictive accuracy compared to Ridge regression
in this context.
Model 3: Combination of L1 Lasso and L2 Ridge -
This model combines both L1 and L2 regularization techniques, resulting
in an RMSE of 0.1623183. This approach, often referred to as Elastic
Net, aims to leverage the strengths of both regularization methods. The
RMSE value indicates that the combination model performs better than
Lasso alone but not as well as Ridge regression. This suggests that
while combining both regularization techniques can be beneficial, the
specific context and data characteristics play a crucial role in
determining the optimal regularization strategy.
Overall, the insights highlight the importance of selecting the
appropriate regularization technique based on the specific
characteristics of the data and the modeling goals. Ridge regression
(L2) appears to be the most effective in this scenario, providing a good
balance between model complexity and predictive accuracy. Lasso (L1) is
useful for feature selection but may not always yield the lowest RMSE.
The combination of L1 and L2 regularization offers a middle ground,
potentially improving model performance in certain contexts.
8. Experiment to search for improved models
# Define the model specifications
lm_spec <- linear_reg() %>%
set_engine("lm")
# Model 1: Adding more features
train_fit5 <- lm_spec %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
# Model 2: Adding regularization (L2 Ridge)
ridge_spec <- linear_reg(penalty = 0.1, mixture = 0) %>%
set_engine("glmnet")
train_fit6 <- ridge_spec %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
# Model 3: Adding polynomial components
poly_spec <- linear_reg() %>%
set_engine("lm")
train_fit7 <- poly_spec %>%
fit(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2) + poly(WIND_SPEED, 2) + poly(VISIBILITY, 2) + poly(DEW_POINT_TEMPERATURE, 2) + poly(SOLAR_RADIATION, 2) + poly(RAINFALL, 2) + poly(SNOWFALL, 2), data = train_data)
# Model 4: Adding interaction terms
interaction_spec <- linear_reg() %>%
set_engine("lm")
train_fit8 <- interaction_spec %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE * HUMIDITY + WIND_SPEED * VISIBILITY + DEW_POINT_TEMPERATURE * SOLAR_RADIATION + RAINFALL * SNOWFALL, data = train_data)
# Model 5: Using decision tree regression
tree_spec <- linear_reg() %>%
set_engine("lm")
train_fit9 <- tree_spec %>%
fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
# Extract predictions
pred5 <- predict(train_fit5, train_data)$.pred
pred6 <- predict(train_fit6, train_data)$.pred
pred7 <- predict(train_fit7, train_data)$.pred
pred8 <- predict(train_fit8, train_data)$.pred
pred9 <- predict(train_fit9, train_data)$.pred
# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
# Calculate RMSE for each model
rmse5 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred5)
rmse6 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred6)
rmse7 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred7)
rmse8 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred8)
rmse9 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred9)
# Combine results into a table
results <- tibble(
model = c("Model 1: More Features", "Model 2: Ridge Regularization", "Model 3: Polynomial Components", "Model 4: Interaction Terms", "Model 5: LM"),
RMSE = c(rmse5, rmse6, rmse7, rmse8, rmse9)
)
# Display the results
print(results)
NA
Model 1: More Features - This model includes
multiple features such as temperature, humidity, wind speed, visibility,
dew point temperature, solar radiation, rainfall, and snowfall. It has
an RMSE of 0.1370119, indicating a relatively good fit. The inclusion of
diverse features helps capture various aspects affecting the rented bike
count.
Model 2: Ridge Regularization - This model
applies L2 regularization to prevent overfitting. Despite the
regularization, its RMSE is slightly higher at 0.1437533. This suggests
that while regularization helps in controlling model complexity, it may
not always lead to better performance in terms of RMSE.
Model 3: Polynomial Components - By adding
polynomial components, this model captures non-linear relationships
between the predictors and the target variable. It has the lowest RMSE
of 0.1291756, indicating that non-linear transformations of the features
significantly improve the model’s predictive accuracy.
Model 4: Interaction Terms - This model includes
interaction terms between pairs of features, allowing it to capture the
combined effect of two variables on the target. With an RMSE of
0.1338293, it performs better than the ridge regularization model but
not as well as the polynomial components model. Interaction terms can be
useful but may not always lead to the best performance.
Model 5: - This model uses a decision tree
algorithm, which is different from linear regression. It has an RMSE of
0.1370168, similar to Model 1. Decision trees can capture complex
relationships and interactions between features, but they may not always
outperform linear models with polynomial components.
Overall, the polynomial components model (Model 3) shows the best
performance in terms of RMSE, suggesting that capturing non-linear
relationships is crucial for predicting the rented bike count
accurately. Regularization and interaction terms also contribute to
model performance but may not always lead to the lowest RMSE.
# Crear Q-Q Plot para cada modelo
qq_plot <- function(predictions, model_name) {
ggplot(data.frame(residuals = train_data$RENTED_BIKE_COUNT - predictions), aes(sample = residuals)) +
stat_qq() +
stat_qq_line() +
labs(title = paste("Q-Q Plot for", model_name),
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_minimal()
}
# Generar Q-Q Plots
qq_plot1 <- qq_plot(pred5, "Model 1: More Features")
qq_plot2 <- qq_plot(pred6, "Model 2: Ridge Regularization")
qq_plot3 <- qq_plot(pred7, "Model 3: Polynomial Components")
qq_plot4 <- qq_plot(pred8, "Model 4: Interaction Terms")
qq_plot5 <- qq_plot(pred9, "Model 5: LM")
# Mostrar los Q-Q Plots
print(qq_plot1)

print(qq_plot2)

print(qq_plot3)

print(qq_plot4)

print(qq_plot5)

---
title: "Bicycle rental prediction"
output: 
  html_notebook:
    toc: TRUE
    toc_depth: 5
    toc_float: TRUE
---

```{r , message=FALSE ,echo=FALSE}
options(scipen=9999)
set.seed(1234)
```

Bike sharing has become a popular transportation option in many cities around the world. With increasing environmental awareness and the need for sustainable transportation options, bike sharing systems have seen significant growth. However, for these systems to operate efficiently, it is crucial to predict the demand for bikes at different stations and locations.

The goal of this project is to develop a predictive model that can estimate bike sharing demand based on various factors such as weather, time of day, day of the week, and special events. Using data analytics and machine learning techniques, the project aims to provide a tool that helps bike sharing system operators optimize bike distribution and availability, thereby improving user experience and operational efficiency.

### 1.Libraries

```{r , message=FALSE}
library(tidymodels) #For modeling and machine learning 
library(tidyverse) # Share common data representations and 'API' design
library(stringr) # Consistent wrapper for common string operations
library(readr) # Read rectangular text data
library(broom) # Convert statistical objects into tidy tibbles
library(dplyr) # A grammar of data manipulation
library(yardstick) # Tidy characterization of model performance
library(glmnet) # Lassso and Elastic Net
library(kableExtra) # Construct Complex Table


```

### 2.Database

The database contains detailed weather information, including temperature, humidity, wind speed, visibility, dew point, solar radiation, snowfall, and rainfall. Additionally, it records the number of bikes rented per hour and date information from the Seoul bike-sharing system.

**Technical Analysis:**
The weather variables such as temperature, humidity, wind speed, visibility, dew point, solar radiation, snowfall, and rainfall are crucial as they can significantly influence the demand for bike rentals. For instance, temperature affects user comfort, while humidity impacts the perception of heat. Wind speed can make biking easier or harder, and visibility is important for cyclist safety. Dew point is an indicator of humidity and thermal comfort, and solar radiation can influence the decision to rent bikes. Snowfall and rainfall are critical factors that can reduce bike rental demand.

The bike rental data, specifically the number of bikes rented per hour, serves as the key dependent variable for regression analysis. The date information allows for the examination of temporal and seasonal patterns in bike usage.

**Project Objective:**
The goal is to use the weather and temporal variables to predict the number of bikes rented per hour. This can help optimize the management of the bike-sharing system, anticipate demand, and improve user experience.

#### 2.1 Variables 

The `seoul_bike_sharing_converted_normalized.csv` will be our main dataset which has following variables:

The response variable:

- `RENTED BIKE COUNT`- Count of bikes rented at each hour

Weather predictor variables:

- `TEMPERATURE` - Temperature in Celsius
- `HUMIDITY` - Unit is `%`
- `WIND_SPEED` - Unit is `m/s`
- `VISIBILITY` - Multiplied by 10m
- `DEW_POINT_TEMPERATURE` - The temperature to which the air would have to cool down in order to reach saturation, unit is Celsius
- `SOLAR_RADIATION` - MJ/m2
- `RAINFALL` - mm
- `SNOWFALL` - cm

Date/time predictor variables:

- `DATE` - Year-month-day
- `HOUR`- Hour of he day
- `FUNCTIONAL DAY` - NoFunc(Non Functional Hours), Fun(Functional hours)
- `HOLIDAY` - Holiday/No holiday
- `SEASONS` - Winter, Spring, Summer, Autumn


#### 2.2 Load database

```{r}
seoul_bike_sharing_converted_normalized <- read_csv("Bases limpias/seoul_bike_sharing_converted_normalized.csv")
```

#### 2.3 Convert into a df

```{r}
bike_sharing_df <- seoul_bike_sharing_converted_normalized %>% 
                   select(-DATE, -FUNCTIONING_DAY_YES,-FUNCTIONING_DAY_NO)
```

We will not be utilizing the `DATE` column in its current form, as it essentially functions as a data entry index. However, with additional time, we could transform the `DATE` column to derive new features such as 'day of the week' or 'isWeekend', which might influence bike rental preferences. Additionally, the `FUNCTIONAL DAY` column will not be used because, after processing missing values, it only contains a single distinct value (`YES`).

### 3. Split training and testing data 

```{r}
bike_split <- initial_split(bike_sharing_df, prop = 3/4)
train_data <- training(bike_split)
test_data <- testing(bike_split)

```

#### 3.1 Build a linear regression model using weather variables only

Weather conditions are likely to influence individuals' decisions regarding bike rentals. For instance, adverse weather such as cold and rainy conditions may lead people to opt for alternative modes of transportation like buses or taxis. Conversely, favorable weather, such as sunny days, may increase the propensity to rent bikes for short-distance travel.

```{r}
# Pick linear regression
lm_spec <- linear_reg() %>%
  # Set engine'
  set_engine(engine = "lm")

# Print the linear function
lm_spec
```
```{r}
# To  fit the model 

lm_model_weather <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
```

Print the fit summary for the `lm_model_weather` model.

```{r , message=FALSE, eval= False}


# Create the table with the regression results

summary(lm_model_weather$fit)

```


The regression analysis aims to predict the `RENTED_BIKE_COUNT` using several independent variables: `TEMPERATURE`, `HUMIDITY`, `WIND_SPEED`, `VISIBILITY`, `DEW_POINT_TEMPERATURE`, `SOLAR_RADIATION`, `RAINFALL`, and `SNOWFALL`. The Intercept has an estimate of _0.046472  _, indicating the baseline level of bike rentals when all other variables are zero.

`TEMPERATURE` has a positive coefficient of _0.631318   _, suggesting that as the temperature increases, the number of rented bikes also increases. This relationship is highly significant with a p-value less than 5.17e-16. `HUMIDITY` has a negative coefficient of _-0.276702_, indicating that higher humidity levels are associated with fewer bike rentals, and this effect is also highly significant.

 `VISIBILITY` has a very small positive coefficient ( _0.006179  _ ), indicating a slight increase in bike rentals with better visibility, and this effect is significant. `SOLAR_RADIATION` has a strong positive coefficient of _1.034800_, showing that higher solar radiation levels significantly increase bike rentals. `RAINFALL` and `SNOWFALL` both have  coefficients ( _-0.582041_ and _-0.099439_, respectively), indicating that more rainfall and snowfall lead to fewer bike rentals. These effects are statistically significant.

The **residual standard error** is _0.1371_, indicating the average distance that the observed values fall from the regression line. The multiple R-squared and adjusted R-squared values, along with the F-statistic, are not provided in the image, but they would typically indicate the overall fit of the model and the significance of the regression equation, respectively.

Overall, the analysis shows that weather conditions significantly impact bike rentals, with temperature and solar radiation having the most substantial positive effects, while humidity, wind speed, rainfall, and snowfall negatively affect bike rentals.

#### 3.2 Build a linear regression model using all variables

```{r}
lm_model_all <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ ., data = train_data)
```

Print the fit summary for `lm_model_all`.

```{r}
summary(lm_model_all$fit)
```

The model explains approximately **66.2%** of the variance in `MODEL_BIKE_COUNT`, as indicated by the R-squared value. The significant predictors (e.g., `HOUR`, `TEMPERATURE`) suggest that these factors have a substantial impact on bike count. The high F-statistic and its corresponding p-value indicate that the model is statistically significant overall.


### 4. Model evaluation 

Model evaluation is crucial in regression analysis because it helps determine how well a model fits the data and predicts outcomes. **R-squared** is a key metric that indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared value signifies that the model explains a greater portion of the variance, suggesting a better fit. This is important for understanding the strength of the relationship between the predictors and the outcome, and for assessing the model's explanatory power.

**RMSE (Root Mean Square Error)**, on the other hand, measures the average magnitude of the errors between predicted and observed values. It provides insight into the model's predictive accuracy. A lower RMSE indicates that the model's predictions are closer to the actual values, which is essential for making reliable forecasts. Evaluating models using both R-squared and RMSE ensures a balanced assessment of their performance, considering both the goodness of fit and the precision of predictions. This comprehensive evaluation helps in selecting the most appropriate model for practical applications.




#### 4.1 Root Mean Squared Error (RMSE)

```{r , warning=FALSE}
# Making the predictions
predictions_weather <- predict(lm_model_weather, new_data = test_data)
predictions_all <- predict(lm_model_all, new_data = test_data)
# Calculating errors
error_weather <- train_data$RENTED_BIKE_COUNT - predictions_weather
error_all <- train_data$RENTED_BIKE_COUNT - predictions_all
# Calculating Squared Errors
squared_error_weather <- error_weather^2
squared_error_all <- error_all^2
# Calculate the average of the squared errors
mean_squared_error_weather <- mean(squared_error_weather$.pred)
mean_squared_error_all <- mean(squared_error_all$.pred)
# Calculate RMSE
rmse_weather <- sqrt(mean_squared_error_weather)
rmse_all <- sqrt(mean_squared_error_all)

```

#### 4.2 R-squared
```{r}
# Get the R squared from the model
summary_m_weather <- summary(lm_model_weather$fit)
summary_m_all <- summary(lm_model_all$fit)
# Print the R squared
r2_weather <- summary_m_weather$r.squared
r2_all <- summary_m_all$r.squared
```

#### 4.3 Comparing models
```{r}

#Create Daraframe
results <- data.frame(
  Model = c("Weather Model", "All Variables Model"),
  R_squared = c(r2_weather, r2_all),
  RMSE = c(rmse_weather, rmse_all)
)

print(results)
```

The **"Weather Model"** has an R-squared value of 0.4303461 and an RMSE of 0.2224554 The R-squared value indicates that approximately 43.03% of the variance in the dependent variable can be explained by the independent variables in this model. The RMSE value represents the root mean square error, which measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy.

On the other hand, the **"All Variables Model"** has an R-squared value of 0.6602304 and an RMSE of 0.2348928 This model explains approximately 66.02% of the variance in the dependent variable, which is higher than the "Weather Model." However, the RMSE is higher at 795.9658, indicating that the average prediction error is larger compared to the "Weather Model."

To determine the best model, we need to consider the trade-off between the goodness of fit (R-squared) and the predictive accuracy (RMSE). The **"All Variables Model"** has a higher R-squared value, suggesting it fits the data better and explains more variance. However, its higher RMSE indicates that its predictions are less accurate on average compared to the **"Weather Model."**

If the primary goal is to have a model that explains more variance in the dependent variable, the **"All Variables Model"** would be preferred due to its higher R-squared value. Conversely, if the goal is to minimize prediction errors, the **"Weather Model"** would be better due to its lower RMSE.

In summary, the choice of the best model depends on the specific objectives of the analysis. If explaining more variance is prioritized, the **"All Variables Model"** is better. If minimizing prediction errors is more important, the **"Weather Model"** is the preferred choice.

#### 4.4 Bar Chart for Coefficients

```{r}
# Obtain the model coefficients
coefficients_all <- tidy(lm_model_all)

# Create a bar chart to visualize the coefficients
ggplot(coefficients_all, aes(x = reorder(term, estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "maroon") +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "black") +
  labs(title = "Coefficients of Linear Regression Model (All)",
       x = "Predictor Variables",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  ) +
  coord_flip() # Flip the chart for better viewing

```

#### 4.5 Bar Chart for Coefficients

```{r}

# Obtain the model coefficients
coefficients_weather <- tidy(lm_model_weather)

# Create a bar chart to visualize the coefficients
ggplot(coefficients_weather, aes(x = reorder(term, estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "darkred") +
  labs(title = "Coefficients of Linear Regression Model (Weather)",
       x = "Predictor Variables",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  ) +
  coord_flip()  # Flip the chart for better viewing


```


### 5. Add polynomial terms

```{r}

# Plot the higher order polynomial fits

ggplot(train_data, aes(x = RENTED_BIKE_COUNT, y = TEMPERATURE)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "red", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), color = "blue", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 4), color = "green", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 5), color = "purple", se = FALSE) +
  labs(title = "Polynomial Regression Fits",
       x = "Rented Bikes",
       y = "Temperature") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

```

#### 5.1 Fit the Polynomial Regression Model


```{r}
# Assuming the important variables are TEMPERATURE and HUMIDITY
lm_bikly <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2), data = train_data)

# Print the model summary
summary(lm_bikly)

```
The regression analysis presented aims to predict the `RENTED_BIKE_COUNT` based on polynomial transformations of `TEMPERATURE` and `HUMIDITY`. This approach allows for capturing non-linear relationships between these predictors and the dependent variable, which can be more reflective of real-world scenarios where changes in temperature and humidity might not have a straightforward linear effect on bike rentals.

##### 5.1.1 Model Summary
The regression model includes polynomial terms for both `TEMPERATURE` and `HUMIDITY`, specifically up to the second degree.  The coefficients table shows the estimated effects of each predictor, along with their standard errors, t-values, and p-values. The intercept is highly significant, with an estimate a very low p-value (<2e-16), indicating a strong baseline effect when all predictors are at their mean values.

##### 5.1.2 Coefficients and Significance
The first-degree polynomial term for `TEMPERATURE` has a large negative coefficient  and is highly significant (p-value <2e-16), suggesting that as temperature increases, the number of rented bikes decreases significantly. The second-degree term for `TEMPERATURE` has a negative coefficient  and is also significant (p-value <2e-16), indicating a diminishing return effect; at higher temperatures, the decrease in bike rentals slows down. Similarly, the first-degree polynomial term for `HUMIDITY` has a significant negative coefficient , suggesting that higher humidity levels reduce bike rentals. The second-degree term for `HUMIDITY` also has a negative coefficient relationship.

##### 5.1.3 Model Fit
The model's multiple R-squared value is 0.4666, indicating that approximately 44.69% of the variance in `RENTED_BIKE_COUNT` is explained by the model. The adjusted R-squared value is the same, suggesting that the model's explanatory power is robust even after adjusting for the number of predictors. The F-statistic is very high, with a corresponding p-value <2.2e-16, indicating that the model is statistically significant overall.

##### 5.1.4 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

In summary, this polynomial regression model effectively captures the non-linear effects of temperature and humidity on bike rentals, with significant coefficients for both predictors. The model explains a substantial portion of the variance in bike rentals, making it a useful tool for understanding and predicting bike rental patterns based on weather conditions.



#### 5.2 Make Predictions on the Test Dataset

```{r}
# Make predictions on the test dataset using the lm_bikly model
y_pred <- predict(lm_bikly, newdata = test_data)

# Convert negative predictions to zero
y_pred <- ifelse(y_pred < 0, 0, y_pred)

```

##### 5.2.1 Calculate R-squared and RMSE

```{r}



# Calculating errors
error_bikly <- train_data$RENTED_BIKE_COUNT - y_pred

# Calculating Squared Errors
squared_error_bikly <- error_bikly^2

# Calculate the average of the squared errors
mean_squared_error_bikly  <- mean(squared_error_bikly)

# Calculate RMSE
rmse_bikly <- sqrt(mean_squared_error_bikly)


# Calculate R-squared

summary_m_bikly <- summary(lm_bikly)
rsq_bikly <- summary_m_bikly$r.squared


# Display the results
results_bikly <- data.frame(
  Model = "Polynomial Regression Model (Bikly)",
  R_squared = rsq_bikly,
  RMSE = rmse_bikly
)

print(results_bikly)

```
Based on the provided data, the polynomial regression model named "Biky" has an R-squared value of 0.44696063 and an RMSE of 0.2136526. 

##### 5.2.2 Model Evaluation

The **R-squared** value of 0.44696063 indicates that approximately 44.70% of the variance in the dependent variable is explained by the independent variables in this model. This suggests a moderate level of explanatory power, meaning that the model captures a significant portion of the variability in the data but leaves some unexplained variance. A higher R-squared value generally indicates a better fit of the model to the data.

The **RMSE (Root Mean Square Error)** value of 0.2136526 measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy, as it means the model's predictions are closer to the actual values. In this case, the RMSE value is relatively low, suggesting that the model has good predictive performance.

##### 5.2.3 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

The polynomial regression model "Biky" demonstrates a moderate level of explanatory power and good predictive accuracy, making it a useful tool for understanding and predicting the dependent variable based on the given predictors.

### 6. Fit the Polynomial Regression Model with Interaction Terms

```{r}
# Fit a polynomial regression model with interaction terms
lm_bikly_interaction <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) * HUMIDITY + poly(TEMPERATURE, 2) * WIND_SPEED, data = train_data)

# Print the model summary
summary(lm_bikly_interaction)

```

The regression analysis presented aims to predict the `RENTED_BIKE_COUNT` based on polynomial transformations of `TEMPERATURE`, `HUMIDITY`, and their interaction with `WIND_SPEED`. This approach allows for capturing complex relationships between these predictors and the dependent variable, which can be more reflective of real-world scenarios where weather conditions interact in non-linear ways to affect bike rentals.

#### 6.1 Model Summary
The regression model includes polynomial terms for `TEMPERATURE` up to the second degree, `HUMIDITY`, and the interaction between `TEMPERATURE` and `WIND_SPEED`. The coefficients table shows the estimated effects of each predictor, along with their standard errors, t-values, and p-values. 

##### 6.1.1 Coefficients and Significance
The first-degree polynomial term for `TEMPERATURE` has a significant negative coefficient , suggesting that as temperature increases, the number of rented bikes decreases slightly. The second-degree term for `TEMPERATURE` is not significant, indicating that higher-order temperature effects are negligible. `HUMIDITY` has a significant negative coefficient , suggesting that higher humidity levels reduce bike rentals. The interaction term between the first-degree polynomial of `TEMPERATURE` and `HUMIDITY` is significant , indicating that the combined effect of these variables has a meaningful impact on bike rentals. Similarly, `WIND_SPEED` has a significant negative coefficient , and its interaction with the first-degree polynomial of `TEMPERATURE` is significant , suggesting that wind speed also plays a crucial role in bike rentals.

##### 6.1.2 Model Fit
The model's multiple R-squared value is 0.4329, indicating that approximately 43.36% of the variance in `RENTED_BIKE_COUNT` is explained by the model. The adjusted R-squared value is very close, suggesting that the model's explanatory power is robust even after adjusting for the number of predictors. The F-statistic is very high, with a corresponding p-value <2e-16, indicating that the model is statistically significant overall.

##### 6.1.3 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

In summary, this polynomial regression model effectively captures the complex interactions between temperature, humidity, and wind speed on bike rentals, with significant coefficients for all predictors. The model explains a substantial portion of the variance in bike rentals, making it a useful tool for understanding and predicting bike rental patterns based on weather conditions.

#### 6.2 Make Predictions on the Test Dataset

```{r}
# Make predictions on the test dataset using the lm_bikly model
y_pred_interaction <- predict(lm_bikly_interaction, newdata = test_data)

# Convert negative predictions to zero
y_pred_interaction <- ifelse(y_pred < 0, 0, y_pred_interaction)

```


##### 6.2.1 Calculate R-squared and RMSE

```{r}
# Calculating errors
error_bikly_interaction <- train_data$RENTED_BIKE_COUNT - y_pred_interaction

# Calculating Squared Errors
squared_error_bikly_interaction <- error_bikly^2

# Calculate the average of the squared errors
mean_squared_error_bikly_interaction  <- mean(squared_error_bikly_interaction)

# Calculate RMSE
rmse_bikly_interaction <- sqrt(mean_squared_error_bikly_interaction)


# Calculate R-squared

summary_m_bikly_interaction <- summary(lm_bikly_interaction)
rsq_bikly_interaction <- summary_m_bikly_interaction$r.squared


# Display the results
results_bikly_interaction <- data.frame(
  Model = "Polynomial Regression Model (Bikly_interaction)",
  R_squared = rsq_bikly_interaction,
  RMSE = rmse_bikly_interaction
)

print(results_bikly_interaction)
```
Based on the provided data, the polynomial regression model has an R-squared value of 0.4335922 and an RMSE of 0.2168526.

##### 6.2.2 Model Evaluation

The **R-squared** value of 0.4335922 indicates that approximately 43.36% of the variance in the dependent variable is explained by the independent variables in this model. This suggests a moderate level of explanatory power, meaning that the model captures a significant portion of the variability in the data but leaves some unexplained variance. A higher R-squared value generally indicates a better fit of the model to the data.

The **RMSE (Root Mean Square Error)** value of 0.2168526 measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy, as it means the model's predictions are closer to the actual values. In this case, the RMSE value is relatively low, suggesting that the model has good predictive performance.

##### 6.2.3 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.


### 7. Add regularization

#### 7.1 Create a recipe

```{r}
bike_recipe <- recipe(RENTED_BIKE_COUNT ~ ., data = train_data) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_poly(all_predictors(), degree = 2) %>%
  step_interact(terms = ~ all_predictors():all_predictors())


```

#### 7.2 Specify the work flow and fit the model

```{r}
bike_workflow <- workflow() %>%
  add_recipe(bike_recipe) %>%
  add_model(glmnet_spec)

```

#### 7.3 Elastic Net Regularization & (L1 and L2) 

```{r}
# Model 1: Adding regularization (L2 Ridge)
ridge_spe <- linear_reg(penalty = 0.1, mixture = 0) %>%
  set_engine("glmnet")

train_f1 <- ridge_spe %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L1 Lasso)
ridge_spe1 <- linear_reg(penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")

train_f2 <- ridge_spe1 %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY+ SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L1 Lasso and L2 Ridge)
ridge_spe2 <- linear_reg(penalty = 0.1, mixture = 0.5) %>%
  set_engine("glmnet")

train_f3 <- ridge_spe2 %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Extract predictions
predic1 <- predict(train_f1, train_data)$.pred
predic2 <- predict(train_f2, train_data)$.pred
predic3 <- predict(train_f3, train_data)$.pred

# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for each model
rmse_1 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic1)
rmse_2 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic2)
rmse_3 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic3)

# Combine results into a table
result_regularization <- tibble(
  model = c("Model 1: L2 Ridge", "Model 2: L1 Lasso", "Model 3: L1 Lasso and L2 Ridge"),
  RMSE = c(rmse_1, rmse_2, rmse_3)
)

# Display the results
print(result_regularization)

```

1. **Model 1: L2 Ridge** - This model, which applies L2 regularization, has an RMSE of 0.1464844. L2 regularization helps to prevent overfitting by penalizing large coefficients, leading to a more generalized model. The relatively low RMSE indicates that this model performs well in predicting the target variable, balancing bias and variance effectively.

2. **Model 2: L1 Lasso** - The L1 regularization model, known as Lasso, has a higher RMSE of 0.1805229. Lasso regularization not only helps in preventing overfitting but also performs feature selection by shrinking some coefficients to zero. The higher RMSE suggests that while Lasso is useful for identifying important features, it may not always provide the best predictive accuracy compared to Ridge regression in this context.

3. **Model 3: Combination of L1 Lasso and L2 Ridge** - This model combines both L1 and L2 regularization techniques, resulting in an RMSE of 0.1623183. This approach, often referred to as Elastic Net, aims to leverage the strengths of both regularization methods. The RMSE value indicates that the combination model performs better than Lasso alone but not as well as Ridge regression. This suggests that while combining both regularization techniques can be beneficial, the specific context and data characteristics play a crucial role in determining the optimal regularization strategy.

Overall, the insights highlight the importance of selecting the appropriate regularization technique based on the specific characteristics of the data and the modeling goals. Ridge regression (L2) appears to be the most effective in this scenario, providing a good balance between model complexity and predictive accuracy. Lasso (L1) is useful for feature selection but may not always yield the lowest RMSE. The combination of L1 and L2 regularization offers a middle ground, potentially improving model performance in certain contexts.


### 8. Experiment to search for improved models


```{r}
# Define the model specifications
lm_spec <- linear_reg() %>%
  set_engine("lm")

# Model 1: Adding more features
train_fit5 <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L2 Ridge)
ridge_spec <- linear_reg(penalty = 0.1, mixture = 0) %>%
  set_engine("glmnet")

train_fit6 <- ridge_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 3: Adding polynomial components
poly_spec <- linear_reg() %>%
  set_engine("lm")

train_fit7 <- poly_spec %>% 
  fit(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2) + poly(WIND_SPEED, 2) + poly(VISIBILITY, 2) + poly(DEW_POINT_TEMPERATURE, 2) + poly(SOLAR_RADIATION, 2) + poly(RAINFALL, 2) + poly(SNOWFALL, 2), data = train_data)

# Model 4: Adding interaction terms
interaction_spec <- linear_reg() %>%
  set_engine("lm")

train_fit8 <- interaction_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE * HUMIDITY + WIND_SPEED * VISIBILITY + DEW_POINT_TEMPERATURE * SOLAR_RADIATION + RAINFALL * SNOWFALL, data = train_data)

# Model 5: 
tree_spec <- linear_reg() %>%
  set_engine("lm")

train_fit9 <- tree_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Extract predictions
pred5 <- predict(train_fit5, train_data)$.pred
pred6 <- predict(train_fit6, train_data)$.pred
pred7 <- predict(train_fit7, train_data)$.pred
pred8 <- predict(train_fit8, train_data)$.pred
pred9 <- predict(train_fit9, train_data)$.pred

# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for each model
rmse5 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred5)
rmse6 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred6)
rmse7 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred7)
rmse8 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred8)
rmse9 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred9)

# Combine results into a table
results <- tibble(
  model = c("Model 1: More Features", "Model 2: Ridge Regularization", "Model 3: Polynomial Components", "Model 4: Interaction Terms", "Model 5: LM"),
  RMSE = c(rmse5, rmse6, rmse7, rmse8, rmse9)
)

# Display the results
print(results)

```



1. **Model 1: More Features** - This model includes multiple features such as temperature, humidity, wind speed, visibility, dew point temperature, solar radiation, rainfall, and snowfall. It has an RMSE of 0.1370119, indicating a relatively good fit. The inclusion of diverse features helps capture various aspects affecting the rented bike count.

2. **Model 2: Ridge Regularization** - This model applies L2 regularization to prevent overfitting. Despite the regularization, its RMSE is slightly higher at 0.1437533. This suggests that while regularization helps in controlling model complexity, it may not always lead to better performance in terms of RMSE.

3. **Model 3: Polynomial Components** - By adding polynomial components, this model captures non-linear relationships between the predictors and the target variable. It has the lowest RMSE of 0.1291756, indicating that non-linear transformations of the features significantly improve the model's predictive accuracy.

4. **Model 4: Interaction Terms** - This model includes interaction terms between pairs of features, allowing it to capture the combined effect of two variables on the target. With an RMSE of 0.1338293, it performs better than the ridge regularization model but not as well as the polynomial components model. Interaction terms can be useful but may not always lead to the best performance.

5. **Model 5:** - This model uses a decision tree algorithm, which is different from linear regression. It has an RMSE of 0.1370168, similar to Model 1. Decision trees can capture complex relationships and interactions between features, but they may not always outperform linear models with polynomial components.

Overall, the polynomial components model (Model 3) shows the best performance in terms of RMSE, suggesting that capturing non-linear relationships is crucial for predicting the rented bike count accurately. Regularization and interaction terms also contribute to model performance but may not always lead to the lowest RMSE.


```{r}

# Create Q-Q Plot for each model
qq_plot <- function(predictions, model_name) {
  ggplot(data.frame(residuals = train_data$RENTED_BIKE_COUNT - predictions), aes(sample = residuals)) +
    stat_qq() +
    stat_qq_line() +
    labs(title = paste("Q-Q Plot for", model_name),
         x = "Theoretical Quantiles",
         y = "Sample Quantiles") +
    theme_minimal()
}

# Generate Q-Q Plots
qq_plot1 <- qq_plot(pred5, "Model 1: More Features")
qq_plot2 <- qq_plot(pred6, "Model 2: Ridge Regularization")
qq_plot3 <- qq_plot(pred7, "Model 3: Polynomial Components")
qq_plot4 <- qq_plot(pred8, "Model 4: Interaction Terms")
qq_plot5 <- qq_plot(pred9, "Model 5: LM")

# Display Q-Q Plots
print(qq_plot1)
print(qq_plot2)
print(qq_plot3)
print(qq_plot4)
print(qq_plot5)

```


