Bike sharing has become a popular transportation option in many cities around the world. With increasing environmental awareness and the need for sustainable transportation options, bike sharing systems have seen significant growth. However, for these systems to operate efficiently, it is crucial to predict the demand for bikes at different stations and locations.

The goal of this project is to develop a predictive model that can estimate bike sharing demand based on various factors such as weather, time of day, day of the week, and special events. Using data analytics and machine learning techniques, the project aims to provide a tool that helps bike sharing system operators optimize bike distribution and availability, thereby improving user experience and operational efficiency.

1.Libraries

library(tidymodels) #For modeling and machine learning 
library(tidyverse) # Share common data representations and 'API' design
library(stringr) # Consistent wrapper for common string operations
library(readr) # Read rectangular text data
library(broom) # Convert statistical objects into tidy tibbles
library(dplyr) # A grammar of data manipulation
library(yardstick) # Tidy characterization of model performance
library(glmnet) # Lassso and Elastic Net
library(kableExtra) # Construct Complex Table

2.Database

The database contains detailed weather information, including temperature, humidity, wind speed, visibility, dew point, solar radiation, snowfall, and rainfall. Additionally, it records the number of bikes rented per hour and date information from the Seoul bike-sharing system.

Technical Analysis: The weather variables such as temperature, humidity, wind speed, visibility, dew point, solar radiation, snowfall, and rainfall are crucial as they can significantly influence the demand for bike rentals. For instance, temperature affects user comfort, while humidity impacts the perception of heat. Wind speed can make biking easier or harder, and visibility is important for cyclist safety. Dew point is an indicator of humidity and thermal comfort, and solar radiation can influence the decision to rent bikes. Snowfall and rainfall are critical factors that can reduce bike rental demand.

The bike rental data, specifically the number of bikes rented per hour, serves as the key dependent variable for regression analysis. The date information allows for the examination of temporal and seasonal patterns in bike usage.

Project Objective: The goal is to use the weather and temporal variables to predict the number of bikes rented per hour. This can help optimize the management of the bike-sharing system, anticipate demand, and improve user experience.

2.1 Variables

The seoul_bike_sharing_converted_normalized.csv will be our main dataset which has following variables:

The response variable:

  • RENTED BIKE COUNT- Count of bikes rented at each hour

Weather predictor variables:

  • TEMPERATURE - Temperature in Celsius
  • HUMIDITY - Unit is %
  • WIND_SPEED - Unit is m/s
  • VISIBILITY - Multiplied by 10m
  • DEW_POINT_TEMPERATURE - The temperature to which the air would have to cool down in order to reach saturation, unit is Celsius
  • SOLAR_RADIATION - MJ/m2
  • RAINFALL - mm
  • SNOWFALL - cm

Date/time predictor variables:

  • DATE - Year-month-day
  • HOUR- Hour of he day
  • FUNCTIONAL DAY - NoFunc(Non Functional Hours), Fun(Functional hours)
  • HOLIDAY - Holiday/No holiday
  • SEASONS - Winter, Spring, Summer, Autumn

2.2 Load database

seoul_bike_sharing_converted_normalized <- read_csv("Bases limpias/seoul_bike_sharing_converted_normalized.csv")

2.3 Convert into a df

bike_sharing_df <- seoul_bike_sharing_converted_normalized %>% 
                   select(-DATE, -FUNCTIONING_DAY_YES,-FUNCTIONING_DAY_NO)

We will not be utilizing the DATE column in its current form, as it essentially functions as a data entry index. However, with additional time, we could transform the DATE column to derive new features such as ‘day of the week’ or ‘isWeekend’, which might influence bike rental preferences. Additionally, the FUNCTIONAL DAY column will not be used because, after processing missing values, it only contains a single distinct value (YES).

3. Split training and testing data

bike_split <- initial_split(bike_sharing_df, prop = 3/4)
train_data <- training(bike_split)
test_data <- testing(bike_split)

3.1 Build a linear regression model using weather variables only

Weather conditions are likely to influence individuals’ decisions regarding bike rentals. For instance, adverse weather such as cold and rainy conditions may lead people to opt for alternative modes of transportation like buses or taxis. Conversely, favorable weather, such as sunny days, may increase the propensity to rent bikes for short-distance travel.

# Pick linear regression
lm_spec <- linear_reg() %>%
  # Set engine'
  set_engine(engine = "lm")

# Print the linear function
lm_spec
Linear Regression Model Specification (regression)

Computational engine: lm 
# To  fit the model 

lm_model_weather <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

Print the fit summary for the lm_model_weather model.



# Create the table with the regression results

summary(lm_model_weather$fit)

Call:
stats::lm(formula = RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + 
    WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + 
    RAINFALL + SNOWFALL, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38129 -0.08315 -0.01554  0.05840  0.65491 

Coefficients:
                       Estimate Std. Error t value             Pr(>|t|)    
(Intercept)            0.046472   0.017060   2.724              0.00647 ** 
TEMPERATURE            0.631318   0.077664   8.129 0.000000000000000517 ***
HUMIDITY              -0.276702   0.037778  -7.324 0.000000000000269685 ***
WIND_SPEED             0.105148   0.013491   7.794 0.000000000000007544 ***
VISIBILITY             0.006179   0.006982   0.885              0.37615    
DEW_POINT_TEMPERATURE -0.036681   0.082867  -0.443              0.65804    
SOLAR_RADIATION       -0.120406   0.009805 -12.281 < 0.0000000000000002 ***
RAINFALL              -0.582041   0.057417 -10.137 < 0.0000000000000002 ***
SNOWFALL               0.099439   0.037289   2.667              0.00768 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1368 on 6339 degrees of freedom
Multiple R-squared:  0.4385,    Adjusted R-squared:  0.4378 
F-statistic: 618.7 on 8 and 6339 DF,  p-value: < 0.00000000000000022

The regression analysis aims to predict the RENTED_BIKE_COUNT using several independent variables: TEMPERATURE, HUMIDITY, WIND_SPEED, VISIBILITY, DEW_POINT_TEMPERATURE, SOLAR_RADIATION, RAINFALL, and SNOWFALL. The Intercept has an estimate of 0.046472 , indicating the baseline level of bike rentals when all other variables are zero.

TEMPERATURE has a positive coefficient of 0.631318 , suggesting that as the temperature increases, the number of rented bikes also increases. This relationship is highly significant with a p-value less than 5.17e-16. HUMIDITY has a negative coefficient of -0.276702, indicating that higher humidity levels are associated with fewer bike rentals, and this effect is also highly significant.

VISIBILITY has a very small positive coefficient ( 0.006179 ), indicating a slight increase in bike rentals with better visibility, and this effect is significant. SOLAR_RADIATION has a strong positive coefficient of 1.034800, showing that higher solar radiation levels significantly increase bike rentals. RAINFALL and SNOWFALL both have coefficients ( -0.582041 and -0.099439, respectively), indicating that more rainfall and snowfall lead to fewer bike rentals. These effects are statistically significant.

The residual standard error is 0.1371, indicating the average distance that the observed values fall from the regression line. The multiple R-squared and adjusted R-squared values, along with the F-statistic, are not provided in the image, but they would typically indicate the overall fit of the model and the significance of the regression equation, respectively.

Overall, the analysis shows that weather conditions significantly impact bike rentals, with temperature and solar radiation having the most substantial positive effects, while humidity, wind speed, rainfall, and snowfall negatively affect bike rentals.

3.2 Build a linear regression model using all variables

lm_model_all <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ ., data = train_data)

Print the fit summary for lm_model_all.

summary(lm_model_all$fit)

Call:
stats::lm(formula = RENTED_BIKE_COUNT ~ ., data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.39281 -0.06190 -0.00194  0.05861  0.49754 

Coefficients: (4 not defined because of singularities)
                       Estimate Std. Error t value             Pr(>|t|)    
(Intercept)            0.087389   0.015264   5.725  0.00000001081522389 ***
TEMPERATURE            0.196831   0.063112   3.119             0.001824 ** 
HUMIDITY              -0.262832   0.029823  -8.813 < 0.0000000000000002 ***
WIND_SPEED            -0.005507   0.011317  -0.487             0.626537    
VISIBILITY             0.007780   0.005690   1.367             0.171578    
DEW_POINT_TEMPERATURE  0.200671   0.066111   3.035             0.002412 ** 
SOLAR_RADIATION        0.077443   0.011718   6.609  0.00000000004188262 ***
RAINFALL              -0.688387   0.045162 -15.242 < 0.0000000000000002 ***
SNOWFALL               0.068333   0.029379   2.326             0.020057 *  
SPRING                 0.058158   0.005407  10.755 < 0.0000000000000002 ***
SUMMER                 0.057523   0.008139   7.067  0.00000000000175034 ***
AUTUMN                 0.101296   0.005643  17.951 < 0.0000000000000002 ***
WINTER                       NA         NA      NA                   NA    
HOLIDAY_YES                  NA         NA      NA                   NA    
HOLIDAY_NO                   NA         NA      NA                   NA    
HOUR_0                -0.032833   0.009085  -3.614             0.000304 ***
HOUR_1                -0.060741   0.009369  -6.483  0.00000000009673282 ***
HOUR_2                -0.095024   0.009283 -10.237 < 0.0000000000000002 ***
HOUR_3                -0.118588   0.009296 -12.757 < 0.0000000000000002 ***
HOUR_4                -0.134490   0.009324 -14.424 < 0.0000000000000002 ***
HOUR_5                -0.131373   0.009355 -14.044 < 0.0000000000000002 ***
HOUR_6                -0.083756   0.009493  -8.823 < 0.0000000000000002 ***
HOUR_7                 0.002330   0.009229   0.252             0.800703    
HOUR_8                 0.120016   0.009489  12.648 < 0.0000000000000002 ***
HOUR_9                -0.029439   0.009677  -3.042             0.002358 ** 
HOUR_10               -0.089353   0.009933  -8.996 < 0.0000000000000002 ***
HOUR_11               -0.095492   0.010401  -9.181 < 0.0000000000000002 ***
HOUR_12               -0.084640   0.010827  -7.818  0.00000000000000626 ***
HOUR_13               -0.085297   0.010742  -7.941  0.00000000000000236 ***
HOUR_14               -0.080889   0.010658  -7.590  0.00000000000003666 ***
HOUR_15               -0.051860   0.010369  -5.001  0.00000058457951176 ***
HOUR_16               -0.019734   0.009990  -1.975             0.048272 *  
HOUR_17                0.056302   0.009708   5.800  0.00000000697015086 ***
HOUR_18                0.192734   0.009389  20.527 < 0.0000000000000002 ***
HOUR_19                0.113208   0.009394  12.052 < 0.0000000000000002 ***
HOUR_20                0.092515   0.009361   9.883 < 0.0000000000000002 ***
HOUR_21                0.093144   0.009312  10.003 < 0.0000000000000002 ***
HOUR_22                0.068434   0.009194   7.444  0.00000000000011101 ***
HOUR_23                      NA         NA      NA                   NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1063 on 6313 degrees of freedom
Multiple R-squared:  0.6621,    Adjusted R-squared:  0.6603 
F-statistic: 363.8 on 34 and 6313 DF,  p-value: < 0.00000000000000022

The model explains approximately 66.2% of the variance in MODEL_BIKE_COUNT, as indicated by the R-squared value. The significant predictors (e.g., HOUR, TEMPERATURE) suggest that these factors have a substantial impact on bike count. The high F-statistic and its corresponding p-value indicate that the model is statistically significant overall.

4. Model evaluation

Model evaluation is crucial in regression analysis because it helps determine how well a model fits the data and predicts outcomes. R-squared is a key metric that indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared value signifies that the model explains a greater portion of the variance, suggesting a better fit. This is important for understanding the strength of the relationship between the predictors and the outcome, and for assessing the model’s explanatory power.

RMSE (Root Mean Square Error), on the other hand, measures the average magnitude of the errors between predicted and observed values. It provides insight into the model’s predictive accuracy. A lower RMSE indicates that the model’s predictions are closer to the actual values, which is essential for making reliable forecasts. Evaluating models using both R-squared and RMSE ensures a balanced assessment of their performance, considering both the goodness of fit and the precision of predictions. This comprehensive evaluation helps in selecting the most appropriate model for practical applications.

4.1 Root Mean Squared Error (RMSE)

# Making the predictions
predictions_weather <- predict(lm_model_weather, new_data = test_data)
predictions_all <- predict(lm_model_all, new_data = test_data)
# Calculating errors
error_weather <- train_data$RENTED_BIKE_COUNT - predictions_weather
error_all <- train_data$RENTED_BIKE_COUNT - predictions_all
# Calculating Squared Errors
squared_error_weather <- error_weather^2
squared_error_all <- error_all^2
# Calculate the average of the squared errors
mean_squared_error_weather <- mean(squared_error_weather$.pred)
mean_squared_error_all <- mean(squared_error_all$.pred)
# Calculate RMSE
rmse_weather <- sqrt(mean_squared_error_weather)
rmse_all <- sqrt(mean_squared_error_all)

4.2 R-squared


summary_m_weather <- summary(lm_model_weather$fit)
r2_weather <- summary_m_weather$r.squared


summary_m_all <- summary(lm_model_all$fit)
r2_all <- summary_m_all$r.squared

4.3 Comparing models


results <- data.frame(
  Model = c("Weather Model", "All Variables Model"),
  R_squared = c(r2_weather, r2_all),
  RMSE = c(rmse_weather, rmse_all)
)

print(results)

The “Weather Model” has an R-squared value of 0.4303461 and an RMSE of 0.2224554 The R-squared value indicates that approximately 43.03% of the variance in the dependent variable can be explained by the independent variables in this model. The RMSE value represents the root mean square error, which measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy.

On the other hand, the “All Variables Model” has an R-squared value of 0.6602304 and an RMSE of 0.2348928 This model explains approximately 66.02% of the variance in the dependent variable, which is higher than the “Weather Model.” However, the RMSE is higher at 795.9658, indicating that the average prediction error is larger compared to the “Weather Model.”

To determine the best model, we need to consider the trade-off between the goodness of fit (R-squared) and the predictive accuracy (RMSE). The “All Variables Model” has a higher R-squared value, suggesting it fits the data better and explains more variance. However, its higher RMSE indicates that its predictions are less accurate on average compared to the “Weather Model.”

If the primary goal is to have a model that explains more variance in the dependent variable, the “All Variables Model” would be preferred due to its higher R-squared value. Conversely, if the goal is to minimize prediction errors, the “Weather Model” would be better due to its lower RMSE.

In summary, the choice of the best model depends on the specific objectives of the analysis. If explaining more variance is prioritized, the “All Variables Model” is better. If minimizing prediction errors is more important, the “Weather Model” is the preferred choice.

4.4 Bar Chart for Coefficients


# Obtener los coeficientes del modelo
coefficients_all <- tidy(lm_model_all)

# Crear un gráfico de barras para visualizar los coeficientes
ggplot(coefficients_all, aes(x = reorder(term, estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "maroon") +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "black") +
  labs(title = "Coefficients of Linear Regression Model (All)",
       x = "Predictor Variables",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  ) +
  coord_flip()  # Voltear el gráfico para mejor visualización

4.5 Bar Chart for Coefficients


# Obtener los coeficientes del modelo
coefficients_weather <- tidy(lm_model_weather)

# Crear un gráfico de barras para visualizar los coeficientes
ggplot(coefficients_weather, aes(x = reorder(term, estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "darkred") +
  labs(title = "Coefficients of Linear Regression Model (Weather)",
       x = "Predictor Variables",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  ) +
  coord_flip()  # Voltear el gráfico para mejor visualización

NA
NA

5. Add polynomial terms


# Plot the higher order polynomial fits

ggplot(train_data, aes(x = RENTED_BIKE_COUNT, y = TEMPERATURE)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "red", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), color = "blue", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 4), color = "green", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 5), color = "purple", se = FALSE) +
  labs(title = "Polynomial Regression Fits",
       x = "Rented Bikes",
       y = "Temperature") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

5.1 Fit the Polynomial Regression Model

# Assuming the important variables are TEMPERATURE and HUMIDITY
lm_bikly <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2), data = train_data)

# Print the model summary
summary(lm_bikly)

Call:
lm(formula = RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 
    2), data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33139 -0.08465 -0.02142  0.06042  0.65952 

Coefficients:
                       Estimate Std. Error t value            Pr(>|t|)    
(Intercept)            0.204909   0.001699  120.61 <0.0000000000000002 ***
poly(TEMPERATURE, 2)1  8.784310   0.137846   63.73 <0.0000000000000002 ***
poly(TEMPERATURE, 2)2 -1.673604   0.140230  -11.94 <0.0000000000000002 ***
poly(HUMIDITY, 2)1    -4.779434   0.141199  -33.85 <0.0000000000000002 ***
poly(HUMIDITY, 2)2    -2.570957   0.136853  -18.79 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1354 on 6343 degrees of freedom
Multiple R-squared:  0.4469,    Adjusted R-squared:  0.4466 
F-statistic:  1281 on 4 and 6343 DF,  p-value: < 0.00000000000000022

The regression analysis presented aims to predict the RENTED_BIKE_COUNT based on polynomial transformations of TEMPERATURE and HUMIDITY. This approach allows for capturing non-linear relationships between these predictors and the dependent variable, which can be more reflective of real-world scenarios where changes in temperature and humidity might not have a straightforward linear effect on bike rentals.

5.1.1 Model Summary

The regression model includes polynomial terms for both TEMPERATURE and HUMIDITY, specifically up to the second degree. The coefficients table shows the estimated effects of each predictor, along with their standard errors, t-values, and p-values. The intercept is highly significant, with an estimate a very low p-value (<2e-16), indicating a strong baseline effect when all predictors are at their mean values.

5.1.2 Coefficients and Significance

The first-degree polynomial term for TEMPERATURE has a large negative coefficient and is highly significant (p-value <2e-16), suggesting that as temperature increases, the number of rented bikes decreases significantly. The second-degree term for TEMPERATURE has a negative coefficient and is also significant (p-value <2e-16), indicating a diminishing return effect; at higher temperatures, the decrease in bike rentals slows down. Similarly, the first-degree polynomial term for HUMIDITY has a significant negative coefficient , suggesting that higher humidity levels reduce bike rentals. The second-degree term for HUMIDITY also has a negative coefficient relationship.

5.1.3 Model Fit

The model’s multiple R-squared value is 0.4666, indicating that approximately 44.69% of the variance in RENTED_BIKE_COUNT is explained by the model. The adjusted R-squared value is the same, suggesting that the model’s explanatory power is robust even after adjusting for the number of predictors. The F-statistic is very high, with a corresponding p-value <2.2e-16, indicating that the model is statistically significant overall.

5.1.4 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model’s explanatory power. On the other hand, RMSE provides a measure of the model’s predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

In summary, this polynomial regression model effectively captures the non-linear effects of temperature and humidity on bike rentals, with significant coefficients for both predictors. The model explains a substantial portion of the variance in bike rentals, making it a useful tool for understanding and predicting bike rental patterns based on weather conditions.

5.2 Make Predictions on the Test Dataset

# Make predictions on the test dataset using the lm_bikly model
y_pred <- predict(lm_bikly, newdata = test_data)

# Convert negative predictions to zero
y_pred <- ifelse(y_pred < 0, 0, y_pred)
5.2.1 Calculate R-squared and RMSE



# Calculating errors
error_bikly <- train_data$RENTED_BIKE_COUNT - y_pred
Warning: longer object length is not a multiple of shorter object length
# Calculating Squared Errors
squared_error_bikly <- error_bikly^2

# Calculate the average of the squared errors
mean_squared_error_bikly  <- mean(squared_error_bikly)

# Calculate RMSE
rmse_bikly <- sqrt(mean_squared_error_bikly)


# Calculate R-squared

summary_m_bikly <- summary(lm_bikly)
rsq_bikly <- summary_m_bikly$r.squared


# Display the results
results_bikly <- data.frame(
  Model = "Polynomial Regression Model (Bikly)",
  R_squared = rsq_bikly,
  RMSE = rmse_bikly
)

print(results_bikly)
NA

Based on the provided data, the polynomial regression model named “Biky” has an R-squared value of 0.44696063 and an RMSE of 0.2136526.

5.2.2 Model Evaluation

The R-squared value of 0.44696063 indicates that approximately 44.70% of the variance in the dependent variable is explained by the independent variables in this model. This suggests a moderate level of explanatory power, meaning that the model captures a significant portion of the variability in the data but leaves some unexplained variance. A higher R-squared value generally indicates a better fit of the model to the data.

The RMSE (Root Mean Square Error) value of 0.2136526 measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy, as it means the model’s predictions are closer to the actual values. In this case, the RMSE value is relatively low, suggesting that the model has good predictive performance.

5.2.3 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model’s explanatory power. On the other hand, RMSE provides a measure of the model’s predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

The polynomial regression model “Biky” demonstrates a moderate level of explanatory power and good predictive accuracy, making it a useful tool for understanding and predicting the dependent variable based on the given predictors.

6. Fit the Polynomial Regression Model with Interaction Terms

# Fit a polynomial regression model with interaction terms
lm_bikly_interaction <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) * HUMIDITY + poly(TEMPERATURE, 2) * WIND_SPEED, data = train_data)

# Print the model summary
summary(lm_bikly_interaction)

Call:
lm(formula = RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) * HUMIDITY + 
    poly(TEMPERATURE, 2) * WIND_SPEED, data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38995 -0.08537 -0.02348  0.05322  0.66347 

Coefficients:
                                  Estimate Std. Error t value             Pr(>|t|)    
(Intercept)                       0.332614   0.007418  44.836 < 0.0000000000000002 ***
poly(TEMPERATURE, 2)1            11.106750   0.613015  18.118 < 0.0000000000000002 ***
poly(TEMPERATURE, 2)2            -3.933440   0.663528  -5.928 0.000000003226468181 ***
HUMIDITY                         -0.245753   0.009693 -25.355 < 0.0000000000000002 ***
WIND_SPEED                        0.099425   0.013255   7.501 0.000000000000072139 ***
poly(TEMPERATURE, 2)1:HUMIDITY   -6.744694   0.821792  -8.207 0.000000000000000272 ***
poly(TEMPERATURE, 2)2:HUMIDITY    4.200620   0.976027   4.304 0.000017042362283015 ***
poly(TEMPERATURE, 2)1:WIND_SPEED  5.105744   1.055849   4.836 0.000001358254989286 ***
poly(TEMPERATURE, 2)2:WIND_SPEED  2.539413   1.098471   2.312               0.0208 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.137 on 6339 degrees of freedom
Multiple R-squared:  0.4336,    Adjusted R-squared:  0.4329 
F-statistic: 606.6 on 8 and 6339 DF,  p-value: < 0.00000000000000022

The regression analysis presented aims to predict the RENTED_BIKE_COUNT based on polynomial transformations of TEMPERATURE, HUMIDITY, and their interaction with WIND_SPEED. This approach allows for capturing complex relationships between these predictors and the dependent variable, which can be more reflective of real-world scenarios where weather conditions interact in non-linear ways to affect bike rentals.

6.1 Model Summary

The regression model includes polynomial terms for TEMPERATURE up to the second degree, HUMIDITY, and the interaction between TEMPERATURE and WIND_SPEED. The coefficients table shows the estimated effects of each predictor, along with their standard errors, t-values, and p-values.

6.1.1 Coefficients and Significance

The first-degree polynomial term for TEMPERATURE has a significant negative coefficient , suggesting that as temperature increases, the number of rented bikes decreases slightly. The second-degree term for TEMPERATURE is not significant, indicating that higher-order temperature effects are negligible. HUMIDITY has a significant negative coefficient , suggesting that higher humidity levels reduce bike rentals. The interaction term between the first-degree polynomial of TEMPERATURE and HUMIDITY is significant , indicating that the combined effect of these variables has a meaningful impact on bike rentals. Similarly, WIND_SPEED has a significant negative coefficient , and its interaction with the first-degree polynomial of TEMPERATURE is significant , suggesting that wind speed also plays a crucial role in bike rentals.

6.1.2 Model Fit

The model’s multiple R-squared value is 0.4329, indicating that approximately 43.36% of the variance in RENTED_BIKE_COUNT is explained by the model. The adjusted R-squared value is very close, suggesting that the model’s explanatory power is robust even after adjusting for the number of predictors. The F-statistic is very high, with a corresponding p-value <2e-16, indicating that the model is statistically significant overall.

6.1.3 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model’s explanatory power. On the other hand, RMSE provides a measure of the model’s predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

In summary, this polynomial regression model effectively captures the complex interactions between temperature, humidity, and wind speed on bike rentals, with significant coefficients for all predictors. The model explains a substantial portion of the variance in bike rentals, making it a useful tool for understanding and predicting bike rental patterns based on weather conditions.

6.2 Make Predictions on the Test Dataset

# Make predictions on the test dataset using the lm_bikly model
y_pred_interaction <- predict(lm_bikly_interaction, newdata = test_data)

# Convert negative predictions to zero
y_pred_interaction <- ifelse(y_pred < 0, 0, y_pred_interaction)
6.2.1 Calculate R-squared and RMSE
# Calculating errors
error_bikly_interaction <- train_data$RENTED_BIKE_COUNT - y_pred_interaction
Warning: longer object length is not a multiple of shorter object length
# Calculating Squared Errors
squared_error_bikly_interaction <- error_bikly^2

# Calculate the average of the squared errors
mean_squared_error_bikly_interaction  <- mean(squared_error_bikly_interaction)

# Calculate RMSE
rmse_bikly_interaction <- sqrt(mean_squared_error_bikly_interaction)


# Calculate R-squared

summary_m_bikly_interaction <- summary(lm_bikly_interaction)
rsq_bikly_interaction <- summary_m_bikly_interaction$r.squared


# Display the results
results_bikly_interaction <- data.frame(
  Model = "Polynomial Regression Model (Bikly_interaction)",
  R_squared = rsq_bikly_interaction,
  RMSE = rmse_bikly_interaction
)

print(results_bikly_interaction)

Based on the provided data, the polynomial regression model has an R-squared value of 0.4335922 and an RMSE of 0.2168526.

6.2.2 Model Evaluation

The R-squared value of 0.4335922 indicates that approximately 43.36% of the variance in the dependent variable is explained by the independent variables in this model. This suggests a moderate level of explanatory power, meaning that the model captures a significant portion of the variability in the data but leaves some unexplained variance. A higher R-squared value generally indicates a better fit of the model to the data.

The RMSE (Root Mean Square Error) value of 0.2168526 measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy, as it means the model’s predictions are closer to the actual values. In this case, the RMSE value is relatively low, suggesting that the model has good predictive performance.

6.2.3 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model’s explanatory power. On the other hand, RMSE provides a measure of the model’s predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

7. Add regularization

7.1 Create a recipe

bike_recipe <- recipe(RENTED_BIKE_COUNT ~ ., data = train_data) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_poly(all_predictors(), degree = 2) %>%
  step_interact(terms = ~ all_predictors():all_predictors())

7.2 Specify the work flow and fit the model

bike_workflow <- workflow() %>%
  add_recipe(bike_recipe) %>%
  add_model(glmnet_spec)

7.3 Elastic Net Regularization & (L1 and L2)

# Model 1: Adding regularization (L2 Ridge)
ridge_spe <- linear_reg(penalty = 0.1, mixture = 0) %>%
  set_engine("glmnet")

train_f1 <- ridge_spe %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L1 Lasso)
ridge_spe1 <- linear_reg(penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")

train_f2 <- ridge_spe1 %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY+ SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L1 Lasso and L2 Ridge)
ridge_spe2 <- linear_reg(penalty = 0.1, mixture = 0.5) %>%
  set_engine("glmnet")

train_f3 <- ridge_spe2 %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Extract predictions
predic1 <- predict(train_f1, train_data)$.pred
predic2 <- predict(train_f2, train_data)$.pred
predic3 <- predict(train_f3, train_data)$.pred

# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for each model
rmse_1 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic1)
rmse_2 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic2)
rmse_3 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic3)

# Combine results into a table
result_regularization <- tibble(
  model = c("Model 1: L2 Ridge", "Model 2: L1 Lasso", "Model 3: L1 Lasso and L2 Ridge"),
  RMSE = c(rmse_1, rmse_2, rmse_3)
)

# Display the results
print(result_regularization)
NA
  1. Model 1: L2 Ridge - This model, which applies L2 regularization, has an RMSE of 0.1464844. L2 regularization helps to prevent overfitting by penalizing large coefficients, leading to a more generalized model. The relatively low RMSE indicates that this model performs well in predicting the target variable, balancing bias and variance effectively.

  2. Model 2: L1 Lasso - The L1 regularization model, known as Lasso, has a higher RMSE of 0.1805229. Lasso regularization not only helps in preventing overfitting but also performs feature selection by shrinking some coefficients to zero. The higher RMSE suggests that while Lasso is useful for identifying important features, it may not always provide the best predictive accuracy compared to Ridge regression in this context.

  3. Model 3: Combination of L1 Lasso and L2 Ridge - This model combines both L1 and L2 regularization techniques, resulting in an RMSE of 0.1623183. This approach, often referred to as Elastic Net, aims to leverage the strengths of both regularization methods. The RMSE value indicates that the combination model performs better than Lasso alone but not as well as Ridge regression. This suggests that while combining both regularization techniques can be beneficial, the specific context and data characteristics play a crucial role in determining the optimal regularization strategy.

Overall, the insights highlight the importance of selecting the appropriate regularization technique based on the specific characteristics of the data and the modeling goals. Ridge regression (L2) appears to be the most effective in this scenario, providing a good balance between model complexity and predictive accuracy. Lasso (L1) is useful for feature selection but may not always yield the lowest RMSE. The combination of L1 and L2 regularization offers a middle ground, potentially improving model performance in certain contexts.

8. Experiment to search for improved models

# Define the model specifications
lm_spec <- linear_reg() %>%
  set_engine("lm")

# Model 1: Adding more features
train_fit5 <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L2 Ridge)
ridge_spec <- linear_reg(penalty = 0.1, mixture = 0) %>%
  set_engine("glmnet")

train_fit6 <- ridge_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 3: Adding polynomial components
poly_spec <- linear_reg() %>%
  set_engine("lm")

train_fit7 <- poly_spec %>% 
  fit(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2) + poly(WIND_SPEED, 2) + poly(VISIBILITY, 2) + poly(DEW_POINT_TEMPERATURE, 2) + poly(SOLAR_RADIATION, 2) + poly(RAINFALL, 2) + poly(SNOWFALL, 2), data = train_data)

# Model 4: Adding interaction terms
interaction_spec <- linear_reg() %>%
  set_engine("lm")

train_fit8 <- interaction_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE * HUMIDITY + WIND_SPEED * VISIBILITY + DEW_POINT_TEMPERATURE * SOLAR_RADIATION + RAINFALL * SNOWFALL, data = train_data)

# Model 5: Using decision tree regression
tree_spec <- linear_reg() %>%
  set_engine("lm")

train_fit9 <- tree_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Extract predictions
pred5 <- predict(train_fit5, train_data)$.pred
pred6 <- predict(train_fit6, train_data)$.pred
pred7 <- predict(train_fit7, train_data)$.pred
pred8 <- predict(train_fit8, train_data)$.pred
pred9 <- predict(train_fit9, train_data)$.pred

# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for each model
rmse5 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred5)
rmse6 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred6)
rmse7 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred7)
rmse8 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred8)
rmse9 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred9)

# Combine results into a table
results <- tibble(
  model = c("Model 1: More Features", "Model 2: Ridge Regularization", "Model 3: Polynomial Components", "Model 4: Interaction Terms", "Model 5: LM"),
  RMSE = c(rmse5, rmse6, rmse7, rmse8, rmse9)
)

# Display the results
print(results)
NA
  1. Model 1: More Features - This model includes multiple features such as temperature, humidity, wind speed, visibility, dew point temperature, solar radiation, rainfall, and snowfall. It has an RMSE of 0.1370119, indicating a relatively good fit. The inclusion of diverse features helps capture various aspects affecting the rented bike count.

  2. Model 2: Ridge Regularization - This model applies L2 regularization to prevent overfitting. Despite the regularization, its RMSE is slightly higher at 0.1437533. This suggests that while regularization helps in controlling model complexity, it may not always lead to better performance in terms of RMSE.

  3. Model 3: Polynomial Components - By adding polynomial components, this model captures non-linear relationships between the predictors and the target variable. It has the lowest RMSE of 0.1291756, indicating that non-linear transformations of the features significantly improve the model’s predictive accuracy.

  4. Model 4: Interaction Terms - This model includes interaction terms between pairs of features, allowing it to capture the combined effect of two variables on the target. With an RMSE of 0.1338293, it performs better than the ridge regularization model but not as well as the polynomial components model. Interaction terms can be useful but may not always lead to the best performance.

  5. Model 5: - This model uses a decision tree algorithm, which is different from linear regression. It has an RMSE of 0.1370168, similar to Model 1. Decision trees can capture complex relationships and interactions between features, but they may not always outperform linear models with polynomial components.

Overall, the polynomial components model (Model 3) shows the best performance in terms of RMSE, suggesting that capturing non-linear relationships is crucial for predicting the rented bike count accurately. Regularization and interaction terms also contribute to model performance but may not always lead to the lowest RMSE.


# Crear Q-Q Plot para cada modelo
qq_plot <- function(predictions, model_name) {
  ggplot(data.frame(residuals = train_data$RENTED_BIKE_COUNT - predictions), aes(sample = residuals)) +
    stat_qq() +
    stat_qq_line() +
    labs(title = paste("Q-Q Plot for", model_name),
         x = "Theoretical Quantiles",
         y = "Sample Quantiles") +
    theme_minimal()
}

# Generar Q-Q Plots
qq_plot1 <- qq_plot(pred5, "Model 1: More Features")
qq_plot2 <- qq_plot(pred6, "Model 2: Ridge Regularization")
qq_plot3 <- qq_plot(pred7, "Model 3: Polynomial Components")
qq_plot4 <- qq_plot(pred8, "Model 4: Interaction Terms")
qq_plot5 <- qq_plot(pred9, "Model 5: LM")

# Mostrar los Q-Q Plots
print(qq_plot1)

print(qq_plot2)

print(qq_plot3)

print(qq_plot4)

print(qq_plot5)

---
title: "Bicycle rental prediction"
output: 
  html_notebook:
    toc: TRUE
    toc_depth: 5
    toc_float: TRUE
---

```{r , message=FALSE ,echo=FALSE}
options(scipen=9999)
set.seed(1234)
```

Bike sharing has become a popular transportation option in many cities around the world. With increasing environmental awareness and the need for sustainable transportation options, bike sharing systems have seen significant growth. However, for these systems to operate efficiently, it is crucial to predict the demand for bikes at different stations and locations.

The goal of this project is to develop a predictive model that can estimate bike sharing demand based on various factors such as weather, time of day, day of the week, and special events. Using data analytics and machine learning techniques, the project aims to provide a tool that helps bike sharing system operators optimize bike distribution and availability, thereby improving user experience and operational efficiency.

### 1.Libraries

```{r , message=FALSE}
library(tidymodels) #For modeling and machine learning 
library(tidyverse) # Share common data representations and 'API' design
library(stringr) # Consistent wrapper for common string operations
library(readr) # Read rectangular text data
library(broom) # Convert statistical objects into tidy tibbles
library(dplyr) # A grammar of data manipulation
library(yardstick) # Tidy characterization of model performance
library(glmnet) # Lassso and Elastic Net
library(kableExtra) # Construct Complex Table


```

### 2.Database

The database contains detailed weather information, including temperature, humidity, wind speed, visibility, dew point, solar radiation, snowfall, and rainfall. Additionally, it records the number of bikes rented per hour and date information from the Seoul bike-sharing system.

**Technical Analysis:**
The weather variables such as temperature, humidity, wind speed, visibility, dew point, solar radiation, snowfall, and rainfall are crucial as they can significantly influence the demand for bike rentals. For instance, temperature affects user comfort, while humidity impacts the perception of heat. Wind speed can make biking easier or harder, and visibility is important for cyclist safety. Dew point is an indicator of humidity and thermal comfort, and solar radiation can influence the decision to rent bikes. Snowfall and rainfall are critical factors that can reduce bike rental demand.

The bike rental data, specifically the number of bikes rented per hour, serves as the key dependent variable for regression analysis. The date information allows for the examination of temporal and seasonal patterns in bike usage.

**Project Objective:**
The goal is to use the weather and temporal variables to predict the number of bikes rented per hour. This can help optimize the management of the bike-sharing system, anticipate demand, and improve user experience.

#### 2.1 Variables 

The `seoul_bike_sharing_converted_normalized.csv` will be our main dataset which has following variables:

The response variable:

- `RENTED BIKE COUNT`- Count of bikes rented at each hour

Weather predictor variables:

- `TEMPERATURE` - Temperature in Celsius
- `HUMIDITY` - Unit is `%`
- `WIND_SPEED` - Unit is `m/s`
- `VISIBILITY` - Multiplied by 10m
- `DEW_POINT_TEMPERATURE` - The temperature to which the air would have to cool down in order to reach saturation, unit is Celsius
- `SOLAR_RADIATION` - MJ/m2
- `RAINFALL` - mm
- `SNOWFALL` - cm

Date/time predictor variables:

- `DATE` - Year-month-day
- `HOUR`- Hour of he day
- `FUNCTIONAL DAY` - NoFunc(Non Functional Hours), Fun(Functional hours)
- `HOLIDAY` - Holiday/No holiday
- `SEASONS` - Winter, Spring, Summer, Autumn


#### 2.2 Load database

```{r}
seoul_bike_sharing_converted_normalized <- read_csv("Bases limpias/seoul_bike_sharing_converted_normalized.csv")
```

#### 2.3 Convert into a df

```{r}
bike_sharing_df <- seoul_bike_sharing_converted_normalized %>% 
                   select(-DATE, -FUNCTIONING_DAY_YES,-FUNCTIONING_DAY_NO)
```

We will not be utilizing the `DATE` column in its current form, as it essentially functions as a data entry index. However, with additional time, we could transform the `DATE` column to derive new features such as 'day of the week' or 'isWeekend', which might influence bike rental preferences. Additionally, the `FUNCTIONAL DAY` column will not be used because, after processing missing values, it only contains a single distinct value (`YES`).

### 3. Split training and testing data 

```{r}
bike_split <- initial_split(bike_sharing_df, prop = 3/4)
train_data <- training(bike_split)
test_data <- testing(bike_split)

```

#### 3.1 Build a linear regression model using weather variables only

Weather conditions are likely to influence individuals' decisions regarding bike rentals. For instance, adverse weather such as cold and rainy conditions may lead people to opt for alternative modes of transportation like buses or taxis. Conversely, favorable weather, such as sunny days, may increase the propensity to rent bikes for short-distance travel.

```{r}
# Pick linear regression
lm_spec <- linear_reg() %>%
  # Set engine'
  set_engine(engine = "lm")

# Print the linear function
lm_spec
```
```{r}
# To  fit the model 

lm_model_weather <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)
```

Print the fit summary for the `lm_model_weather` model.

```{r , message=FALSE, eval= False}


# Create the table with the regression results

summary(lm_model_weather$fit)

```


The regression analysis aims to predict the `RENTED_BIKE_COUNT` using several independent variables: `TEMPERATURE`, `HUMIDITY`, `WIND_SPEED`, `VISIBILITY`, `DEW_POINT_TEMPERATURE`, `SOLAR_RADIATION`, `RAINFALL`, and `SNOWFALL`. The Intercept has an estimate of _0.046472  _, indicating the baseline level of bike rentals when all other variables are zero.

`TEMPERATURE` has a positive coefficient of _0.631318   _, suggesting that as the temperature increases, the number of rented bikes also increases. This relationship is highly significant with a p-value less than 5.17e-16. `HUMIDITY` has a negative coefficient of _-0.276702_, indicating that higher humidity levels are associated with fewer bike rentals, and this effect is also highly significant.

 `VISIBILITY` has a very small positive coefficient ( _0.006179  _ ), indicating a slight increase in bike rentals with better visibility, and this effect is significant. `SOLAR_RADIATION` has a strong positive coefficient of _1.034800_, showing that higher solar radiation levels significantly increase bike rentals. `RAINFALL` and `SNOWFALL` both have  coefficients ( _-0.582041_ and _-0.099439_, respectively), indicating that more rainfall and snowfall lead to fewer bike rentals. These effects are statistically significant.

The **residual standard error** is _0.1371_, indicating the average distance that the observed values fall from the regression line. The multiple R-squared and adjusted R-squared values, along with the F-statistic, are not provided in the image, but they would typically indicate the overall fit of the model and the significance of the regression equation, respectively.

Overall, the analysis shows that weather conditions significantly impact bike rentals, with temperature and solar radiation having the most substantial positive effects, while humidity, wind speed, rainfall, and snowfall negatively affect bike rentals.

#### 3.2 Build a linear regression model using all variables

```{r}
lm_model_all <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ ., data = train_data)
```

Print the fit summary for `lm_model_all`.

```{r}
summary(lm_model_all$fit)
```

The model explains approximately **66.2%** of the variance in `MODEL_BIKE_COUNT`, as indicated by the R-squared value. The significant predictors (e.g., `HOUR`, `TEMPERATURE`) suggest that these factors have a substantial impact on bike count. The high F-statistic and its corresponding p-value indicate that the model is statistically significant overall.


### 4. Model evaluation 

Model evaluation is crucial in regression analysis because it helps determine how well a model fits the data and predicts outcomes. **R-squared** is a key metric that indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared value signifies that the model explains a greater portion of the variance, suggesting a better fit. This is important for understanding the strength of the relationship between the predictors and the outcome, and for assessing the model's explanatory power.

**RMSE (Root Mean Square Error)**, on the other hand, measures the average magnitude of the errors between predicted and observed values. It provides insight into the model's predictive accuracy. A lower RMSE indicates that the model's predictions are closer to the actual values, which is essential for making reliable forecasts. Evaluating models using both R-squared and RMSE ensures a balanced assessment of their performance, considering both the goodness of fit and the precision of predictions. This comprehensive evaluation helps in selecting the most appropriate model for practical applications.




#### 4.1 Root Mean Squared Error (RMSE)

```{r , warning=FALSE}
# Making the predictions
predictions_weather <- predict(lm_model_weather, new_data = test_data)
predictions_all <- predict(lm_model_all, new_data = test_data)
# Calculating errors
error_weather <- train_data$RENTED_BIKE_COUNT - predictions_weather
error_all <- train_data$RENTED_BIKE_COUNT - predictions_all
# Calculating Squared Errors
squared_error_weather <- error_weather^2
squared_error_all <- error_all^2
# Calculate the average of the squared errors
mean_squared_error_weather <- mean(squared_error_weather$.pred)
mean_squared_error_all <- mean(squared_error_all$.pred)
# Calculate RMSE
rmse_weather <- sqrt(mean_squared_error_weather)
rmse_all <- sqrt(mean_squared_error_all)

```

#### 4.2 R-squared
```{r}
# Get the R squared from the model
summary_m_weather <- summary(lm_model_weather$fit)
summary_m_all <- summary(lm_model_all$fit)
# Print the R squared
r2_weather <- summary_m_weather$r.squared
r2_all <- summary_m_all$r.squared
```

#### 4.3 Comparing models
```{r}

#Create Daraframe
results <- data.frame(
  Model = c("Weather Model", "All Variables Model"),
  R_squared = c(r2_weather, r2_all),
  RMSE = c(rmse_weather, rmse_all)
)

print(results)
```

The **"Weather Model"** has an R-squared value of 0.4303461 and an RMSE of 0.2224554 The R-squared value indicates that approximately 43.03% of the variance in the dependent variable can be explained by the independent variables in this model. The RMSE value represents the root mean square error, which measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy.

On the other hand, the **"All Variables Model"** has an R-squared value of 0.6602304 and an RMSE of 0.2348928 This model explains approximately 66.02% of the variance in the dependent variable, which is higher than the "Weather Model." However, the RMSE is higher at 795.9658, indicating that the average prediction error is larger compared to the "Weather Model."

To determine the best model, we need to consider the trade-off between the goodness of fit (R-squared) and the predictive accuracy (RMSE). The **"All Variables Model"** has a higher R-squared value, suggesting it fits the data better and explains more variance. However, its higher RMSE indicates that its predictions are less accurate on average compared to the **"Weather Model."**

If the primary goal is to have a model that explains more variance in the dependent variable, the **"All Variables Model"** would be preferred due to its higher R-squared value. Conversely, if the goal is to minimize prediction errors, the **"Weather Model"** would be better due to its lower RMSE.

In summary, the choice of the best model depends on the specific objectives of the analysis. If explaining more variance is prioritized, the **"All Variables Model"** is better. If minimizing prediction errors is more important, the **"Weather Model"** is the preferred choice.

#### 4.4 Bar Chart for Coefficients

```{r}
# Obtain the model coefficients
coefficients_all <- tidy(lm_model_all)

# Create a bar chart to visualize the coefficients
ggplot(coefficients_all, aes(x = reorder(term, estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "maroon") +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "black") +
  labs(title = "Coefficients of Linear Regression Model (All)",
       x = "Predictor Variables",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  ) +
  coord_flip() # Flip the chart for better viewing

```

#### 4.5 Bar Chart for Coefficients

```{r}

# Obtain the model coefficients
coefficients_weather <- tidy(lm_model_weather)

# Create a bar chart to visualize the coefficients
ggplot(coefficients_weather, aes(x = reorder(term, estimate), y = estimate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2, color = "darkred") +
  labs(title = "Coefficients of Linear Regression Model (Weather)",
       x = "Predictor Variables",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  ) +
  coord_flip()  # Flip the chart for better viewing


```


### 5. Add polynomial terms

```{r}

# Plot the higher order polynomial fits

ggplot(train_data, aes(x = RENTED_BIKE_COUNT, y = TEMPERATURE)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "red", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), color = "blue", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 4), color = "green", se = FALSE) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 5), color = "purple", se = FALSE) +
  labs(title = "Polynomial Regression Fits",
       x = "Rented Bikes",
       y = "Temperature") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

```

#### 5.1 Fit the Polynomial Regression Model


```{r}
# Assuming the important variables are TEMPERATURE and HUMIDITY
lm_bikly <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2), data = train_data)

# Print the model summary
summary(lm_bikly)

```
The regression analysis presented aims to predict the `RENTED_BIKE_COUNT` based on polynomial transformations of `TEMPERATURE` and `HUMIDITY`. This approach allows for capturing non-linear relationships between these predictors and the dependent variable, which can be more reflective of real-world scenarios where changes in temperature and humidity might not have a straightforward linear effect on bike rentals.

##### 5.1.1 Model Summary
The regression model includes polynomial terms for both `TEMPERATURE` and `HUMIDITY`, specifically up to the second degree.  The coefficients table shows the estimated effects of each predictor, along with their standard errors, t-values, and p-values. The intercept is highly significant, with an estimate a very low p-value (<2e-16), indicating a strong baseline effect when all predictors are at their mean values.

##### 5.1.2 Coefficients and Significance
The first-degree polynomial term for `TEMPERATURE` has a large negative coefficient  and is highly significant (p-value <2e-16), suggesting that as temperature increases, the number of rented bikes decreases significantly. The second-degree term for `TEMPERATURE` has a negative coefficient  and is also significant (p-value <2e-16), indicating a diminishing return effect; at higher temperatures, the decrease in bike rentals slows down. Similarly, the first-degree polynomial term for `HUMIDITY` has a significant negative coefficient , suggesting that higher humidity levels reduce bike rentals. The second-degree term for `HUMIDITY` also has a negative coefficient relationship.

##### 5.1.3 Model Fit
The model's multiple R-squared value is 0.4666, indicating that approximately 44.69% of the variance in `RENTED_BIKE_COUNT` is explained by the model. The adjusted R-squared value is the same, suggesting that the model's explanatory power is robust even after adjusting for the number of predictors. The F-statistic is very high, with a corresponding p-value <2.2e-16, indicating that the model is statistically significant overall.

##### 5.1.4 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

In summary, this polynomial regression model effectively captures the non-linear effects of temperature and humidity on bike rentals, with significant coefficients for both predictors. The model explains a substantial portion of the variance in bike rentals, making it a useful tool for understanding and predicting bike rental patterns based on weather conditions.



#### 5.2 Make Predictions on the Test Dataset

```{r}
# Make predictions on the test dataset using the lm_bikly model
y_pred <- predict(lm_bikly, newdata = test_data)

# Convert negative predictions to zero
y_pred <- ifelse(y_pred < 0, 0, y_pred)

```

##### 5.2.1 Calculate R-squared and RMSE

```{r}



# Calculating errors
error_bikly <- train_data$RENTED_BIKE_COUNT - y_pred

# Calculating Squared Errors
squared_error_bikly <- error_bikly^2

# Calculate the average of the squared errors
mean_squared_error_bikly  <- mean(squared_error_bikly)

# Calculate RMSE
rmse_bikly <- sqrt(mean_squared_error_bikly)


# Calculate R-squared

summary_m_bikly <- summary(lm_bikly)
rsq_bikly <- summary_m_bikly$r.squared


# Display the results
results_bikly <- data.frame(
  Model = "Polynomial Regression Model (Bikly)",
  R_squared = rsq_bikly,
  RMSE = rmse_bikly
)

print(results_bikly)

```
Based on the provided data, the polynomial regression model named "Biky" has an R-squared value of 0.44696063 and an RMSE of 0.2136526. 

##### 5.2.2 Model Evaluation

The **R-squared** value of 0.44696063 indicates that approximately 44.70% of the variance in the dependent variable is explained by the independent variables in this model. This suggests a moderate level of explanatory power, meaning that the model captures a significant portion of the variability in the data but leaves some unexplained variance. A higher R-squared value generally indicates a better fit of the model to the data.

The **RMSE (Root Mean Square Error)** value of 0.2136526 measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy, as it means the model's predictions are closer to the actual values. In this case, the RMSE value is relatively low, suggesting that the model has good predictive performance.

##### 5.2.3 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

The polynomial regression model "Biky" demonstrates a moderate level of explanatory power and good predictive accuracy, making it a useful tool for understanding and predicting the dependent variable based on the given predictors.

### 6. Fit the Polynomial Regression Model with Interaction Terms

```{r}
# Fit a polynomial regression model with interaction terms
lm_bikly_interaction <- lm(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) * HUMIDITY + poly(TEMPERATURE, 2) * WIND_SPEED, data = train_data)

# Print the model summary
summary(lm_bikly_interaction)

```

The regression analysis presented aims to predict the `RENTED_BIKE_COUNT` based on polynomial transformations of `TEMPERATURE`, `HUMIDITY`, and their interaction with `WIND_SPEED`. This approach allows for capturing complex relationships between these predictors and the dependent variable, which can be more reflective of real-world scenarios where weather conditions interact in non-linear ways to affect bike rentals.

#### 6.1 Model Summary
The regression model includes polynomial terms for `TEMPERATURE` up to the second degree, `HUMIDITY`, and the interaction between `TEMPERATURE` and `WIND_SPEED`. The coefficients table shows the estimated effects of each predictor, along with their standard errors, t-values, and p-values. 

##### 6.1.1 Coefficients and Significance
The first-degree polynomial term for `TEMPERATURE` has a significant negative coefficient , suggesting that as temperature increases, the number of rented bikes decreases slightly. The second-degree term for `TEMPERATURE` is not significant, indicating that higher-order temperature effects are negligible. `HUMIDITY` has a significant negative coefficient , suggesting that higher humidity levels reduce bike rentals. The interaction term between the first-degree polynomial of `TEMPERATURE` and `HUMIDITY` is significant , indicating that the combined effect of these variables has a meaningful impact on bike rentals. Similarly, `WIND_SPEED` has a significant negative coefficient , and its interaction with the first-degree polynomial of `TEMPERATURE` is significant , suggesting that wind speed also plays a crucial role in bike rentals.

##### 6.1.2 Model Fit
The model's multiple R-squared value is 0.4329, indicating that approximately 43.36% of the variance in `RENTED_BIKE_COUNT` is explained by the model. The adjusted R-squared value is very close, suggesting that the model's explanatory power is robust even after adjusting for the number of predictors. The F-statistic is very high, with a corresponding p-value <2e-16, indicating that the model is statistically significant overall.

##### 6.1.3 Theoretical Insights
Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.

In summary, this polynomial regression model effectively captures the complex interactions between temperature, humidity, and wind speed on bike rentals, with significant coefficients for all predictors. The model explains a substantial portion of the variance in bike rentals, making it a useful tool for understanding and predicting bike rental patterns based on weather conditions.

#### 6.2 Make Predictions on the Test Dataset

```{r}
# Make predictions on the test dataset using the lm_bikly model
y_pred_interaction <- predict(lm_bikly_interaction, newdata = test_data)

# Convert negative predictions to zero
y_pred_interaction <- ifelse(y_pred < 0, 0, y_pred_interaction)

```


##### 6.2.1 Calculate R-squared and RMSE

```{r}
# Calculating errors
error_bikly_interaction <- train_data$RENTED_BIKE_COUNT - y_pred_interaction

# Calculating Squared Errors
squared_error_bikly_interaction <- error_bikly^2

# Calculate the average of the squared errors
mean_squared_error_bikly_interaction  <- mean(squared_error_bikly_interaction)

# Calculate RMSE
rmse_bikly_interaction <- sqrt(mean_squared_error_bikly_interaction)


# Calculate R-squared

summary_m_bikly_interaction <- summary(lm_bikly_interaction)
rsq_bikly_interaction <- summary_m_bikly_interaction$r.squared


# Display the results
results_bikly_interaction <- data.frame(
  Model = "Polynomial Regression Model (Bikly_interaction)",
  R_squared = rsq_bikly_interaction,
  RMSE = rmse_bikly_interaction
)

print(results_bikly_interaction)
```
Based on the provided data, the polynomial regression model has an R-squared value of 0.4335922 and an RMSE of 0.2168526.

##### 6.2.2 Model Evaluation

The **R-squared** value of 0.4335922 indicates that approximately 43.36% of the variance in the dependent variable is explained by the independent variables in this model. This suggests a moderate level of explanatory power, meaning that the model captures a significant portion of the variability in the data but leaves some unexplained variance. A higher R-squared value generally indicates a better fit of the model to the data.

The **RMSE (Root Mean Square Error)** value of 0.2168526 measures the average magnitude of the errors between the predicted and observed values. A lower RMSE indicates better predictive accuracy, as it means the model's predictions are closer to the actual values. In this case, the RMSE value is relatively low, suggesting that the model has good predictive performance.

##### 6.2.3 Theoretical Insights

Evaluating a model using both R-squared and RMSE is crucial for understanding its overall performance. The R-squared value provides insight into how well the model explains the variability in the data, which is important for assessing the model's explanatory power. On the other hand, RMSE provides a measure of the model's predictive accuracy, which is essential for making reliable forecasts. By considering both metrics, we can ensure a balanced evaluation of the model, taking into account both its ability to fit the data and its precision in predictions.


### 7. Add regularization

#### 7.1 Create a recipe

```{r}
bike_recipe <- recipe(RENTED_BIKE_COUNT ~ ., data = train_data) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_poly(all_predictors(), degree = 2) %>%
  step_interact(terms = ~ all_predictors():all_predictors())


```

#### 7.2 Specify the work flow and fit the model

```{r}
bike_workflow <- workflow() %>%
  add_recipe(bike_recipe) %>%
  add_model(glmnet_spec)

```

#### 7.3 Elastic Net Regularization & (L1 and L2) 

```{r}
# Model 1: Adding regularization (L2 Ridge)
ridge_spe <- linear_reg(penalty = 0.1, mixture = 0) %>%
  set_engine("glmnet")

train_f1 <- ridge_spe %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L1 Lasso)
ridge_spe1 <- linear_reg(penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")

train_f2 <- ridge_spe1 %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY+ SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L1 Lasso and L2 Ridge)
ridge_spe2 <- linear_reg(penalty = 0.1, mixture = 0.5) %>%
  set_engine("glmnet")

train_f3 <- ridge_spe2 %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Extract predictions
predic1 <- predict(train_f1, train_data)$.pred
predic2 <- predict(train_f2, train_data)$.pred
predic3 <- predict(train_f3, train_data)$.pred

# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for each model
rmse_1 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic1)
rmse_2 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic2)
rmse_3 <- rmse_manual(train_data$RENTED_BIKE_COUNT, predic3)

# Combine results into a table
result_regularization <- tibble(
  model = c("Model 1: L2 Ridge", "Model 2: L1 Lasso", "Model 3: L1 Lasso and L2 Ridge"),
  RMSE = c(rmse_1, rmse_2, rmse_3)
)

# Display the results
print(result_regularization)

```

1. **Model 1: L2 Ridge** - This model, which applies L2 regularization, has an RMSE of 0.1464844. L2 regularization helps to prevent overfitting by penalizing large coefficients, leading to a more generalized model. The relatively low RMSE indicates that this model performs well in predicting the target variable, balancing bias and variance effectively.

2. **Model 2: L1 Lasso** - The L1 regularization model, known as Lasso, has a higher RMSE of 0.1805229. Lasso regularization not only helps in preventing overfitting but also performs feature selection by shrinking some coefficients to zero. The higher RMSE suggests that while Lasso is useful for identifying important features, it may not always provide the best predictive accuracy compared to Ridge regression in this context.

3. **Model 3: Combination of L1 Lasso and L2 Ridge** - This model combines both L1 and L2 regularization techniques, resulting in an RMSE of 0.1623183. This approach, often referred to as Elastic Net, aims to leverage the strengths of both regularization methods. The RMSE value indicates that the combination model performs better than Lasso alone but not as well as Ridge regression. This suggests that while combining both regularization techniques can be beneficial, the specific context and data characteristics play a crucial role in determining the optimal regularization strategy.

Overall, the insights highlight the importance of selecting the appropriate regularization technique based on the specific characteristics of the data and the modeling goals. Ridge regression (L2) appears to be the most effective in this scenario, providing a good balance between model complexity and predictive accuracy. Lasso (L1) is useful for feature selection but may not always yield the lowest RMSE. The combination of L1 and L2 regularization offers a middle ground, potentially improving model performance in certain contexts.


### 8. Experiment to search for improved models


```{r}
# Define the model specifications
lm_spec <- linear_reg() %>%
  set_engine("lm")

# Model 1: Adding more features
train_fit5 <- lm_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 2: Adding regularization (L2 Ridge)
ridge_spec <- linear_reg(penalty = 0.1, mixture = 0) %>%
  set_engine("glmnet")

train_fit6 <- ridge_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Model 3: Adding polynomial components
poly_spec <- linear_reg() %>%
  set_engine("lm")

train_fit7 <- poly_spec %>% 
  fit(RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 2) + poly(HUMIDITY, 2) + poly(WIND_SPEED, 2) + poly(VISIBILITY, 2) + poly(DEW_POINT_TEMPERATURE, 2) + poly(SOLAR_RADIATION, 2) + poly(RAINFALL, 2) + poly(SNOWFALL, 2), data = train_data)

# Model 4: Adding interaction terms
interaction_spec <- linear_reg() %>%
  set_engine("lm")

train_fit8 <- interaction_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE * HUMIDITY + WIND_SPEED * VISIBILITY + DEW_POINT_TEMPERATURE * SOLAR_RADIATION + RAINFALL * SNOWFALL, data = train_data)

# Model 5: 
tree_spec <- linear_reg() %>%
  set_engine("lm")

train_fit9 <- tree_spec %>% 
  fit(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + SOLAR_RADIATION + RAINFALL + SNOWFALL, data = train_data)

# Extract predictions
pred5 <- predict(train_fit5, train_data)$.pred
pred6 <- predict(train_fit6, train_data)$.pred
pred7 <- predict(train_fit7, train_data)$.pred
pred8 <- predict(train_fit8, train_data)$.pred
pred9 <- predict(train_fit9, train_data)$.pred

# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for each model
rmse5 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred5)
rmse6 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred6)
rmse7 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred7)
rmse8 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred8)
rmse9 <- rmse_manual(train_data$RENTED_BIKE_COUNT, pred9)

# Combine results into a table
results <- tibble(
  model = c("Model 1: More Features", "Model 2: Ridge Regularization", "Model 3: Polynomial Components", "Model 4: Interaction Terms", "Model 5: LM"),
  RMSE = c(rmse5, rmse6, rmse7, rmse8, rmse9)
)

# Display the results
print(results)

```



1. **Model 1: More Features** - This model includes multiple features such as temperature, humidity, wind speed, visibility, dew point temperature, solar radiation, rainfall, and snowfall. It has an RMSE of 0.1370119, indicating a relatively good fit. The inclusion of diverse features helps capture various aspects affecting the rented bike count.

2. **Model 2: Ridge Regularization** - This model applies L2 regularization to prevent overfitting. Despite the regularization, its RMSE is slightly higher at 0.1437533. This suggests that while regularization helps in controlling model complexity, it may not always lead to better performance in terms of RMSE.

3. **Model 3: Polynomial Components** - By adding polynomial components, this model captures non-linear relationships between the predictors and the target variable. It has the lowest RMSE of 0.1291756, indicating that non-linear transformations of the features significantly improve the model's predictive accuracy.

4. **Model 4: Interaction Terms** - This model includes interaction terms between pairs of features, allowing it to capture the combined effect of two variables on the target. With an RMSE of 0.1338293, it performs better than the ridge regularization model but not as well as the polynomial components model. Interaction terms can be useful but may not always lead to the best performance.

5. **Model 5:** - This model uses a decision tree algorithm, which is different from linear regression. It has an RMSE of 0.1370168, similar to Model 1. Decision trees can capture complex relationships and interactions between features, but they may not always outperform linear models with polynomial components.

Overall, the polynomial components model (Model 3) shows the best performance in terms of RMSE, suggesting that capturing non-linear relationships is crucial for predicting the rented bike count accurately. Regularization and interaction terms also contribute to model performance but may not always lead to the lowest RMSE.


```{r}

# Create Q-Q Plot for each model
qq_plot <- function(predictions, model_name) {
  ggplot(data.frame(residuals = train_data$RENTED_BIKE_COUNT - predictions), aes(sample = residuals)) +
    stat_qq() +
    stat_qq_line() +
    labs(title = paste("Q-Q Plot for", model_name),
         x = "Theoretical Quantiles",
         y = "Sample Quantiles") +
    theme_minimal()
}

# Generate Q-Q Plots
qq_plot1 <- qq_plot(pred5, "Model 1: More Features")
qq_plot2 <- qq_plot(pred6, "Model 2: Ridge Regularization")
qq_plot3 <- qq_plot(pred7, "Model 3: Polynomial Components")
qq_plot4 <- qq_plot(pred8, "Model 4: Interaction Terms")
qq_plot5 <- qq_plot(pred9, "Model 5: LM")

# Display Q-Q Plots
print(qq_plot1)
print(qq_plot2)
print(qq_plot3)
print(qq_plot4)
print(qq_plot5)

```


