1 INTRODUCTION

1.1 Problem Statement

A bike-sharing provider in the Korea, has been facing a significant decline in its revenues due to the COVID-19 pandemic. To ensure their business’s sustainability and future growth, the company is proactively devising a strategic plan. This plan aims to boost revenue once the current lockdowns and restrictions are lifted, and the economy returns to normal.

As part of their strategy, the company is keen on understanding the post-pandemic demand for shared bikes among the korean population. This understanding will enable them to meet the needs of customers effectively and distinguish themselves from competitors, ultimately leading to substantial profits.

To gain insights into the factors influencing the demand for shared bikes, the company has engaged a consulting firm. They seek answers to two crucial questions:

Which variables play a significant role in predicting the demand for shared bikes? How effectively do these variables explain the fluctuations in bike demand? To answer these questions comprehensively, the company has collected a vast dataset that includes daily bike demand information from across the korean market. This data is based on various factors, including meteorological surveys and people’s lifestyle choices.

In summary, the company is taking proactive steps to prepare for a post-pandemic resurgence in demand for shared bikes in the korean market. They are leveraging data analysis and consulting expertise to identify the key variables driving this demand and to ensure their business is well-positioned to thrive in the future economic landscape.

1.2 Bussines Goal

Our task is to create a demand model for shared bikes using the available independent variables. This model will serve as a valuable tool for the management team to gain insights into how demand for shared bikes fluctuates in response to various features or factors. By analyzing this model, the management can make informed decisions and adjust their business strategy to align with changing demand levels, ultimately meeting customer expectations effectively. Additionally, this model will be instrumental in providing an in-depth understanding of demand dynamics in new markets, helping the management expand their operations with confidence.

2 DATA PREPARATION

library(dplyr)
library(GGally)
library(ggplot2)
library(gridExtra)
library(caret)
library(lmtest)
library(scales)
library(car)
library(psych)
library(magrittr)
library(tidyr)
library(infotheo)
library(TSA)
library(forecast)
library(chron)
library(caTools)

2.1 Load dataset

df <- read.csv("data_input/bike.csv", check.names = F, na.strings = c("", "NA", "N/A", "na"))

df <- setNames(df, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))

rmarkdown::paged_table(df)

The dataset contains 8760 rows and 14 columns. Almost all columns are either of double or integer data types.

Upon initial examination, it appears that some of the fields in the dataset have categorical characteristics, despite being represented as integers or double, also Date should be in date format. We will conduct a thorough analysis to determine whether it is appropriate to convert these columns into categorical variables or if they should continue to be treated as numerical values.

2.2 Data Quality Check

Check for NULL/MISSING values

colSums(is.na(df))

#>            date      bike_count            hour            temp        humidity 
#>               0               0               0               0               0 
#>      wind_speed      visibility  dew_point_temp solar_radiation        rainfall 
#>               0               0               0               0               0 
#>        snowfall         seasons         holiday functioning_day 
#>               0               0               0               0

There are no missing / Null values either in columns or rows.

Check Duplicated data

df_distic <- distinct(df)
dim(df_distic)

#> [1] 8760   14

There is no data reduction after using the distinct function, this means there is no duplication of data.

2.3 Data Cleaning

Based on a high-level examination of the data and the data dictionary, several variables can be safely removed from further analysis:

Date: Since the dataset already contains separate columns for ‘holiday’ and ‘Functioning Day,’ the ‘Date’ column, which represents the date, may not be needed for the analysis. It can be excluded.
season, Functioning day and Holiday: These columns are of ‘integer’ type, so we have to change them to the ‘category’ data type

The resulting dataset, after removing and change these columns, can be saved as ‘df_clean.’ This ensures that the original dataset is preserved for potential future analysis or validation.

df_clean <- df %>% 
  select(-date) %>% 
  mutate_at(vars(seasons, holiday, functioning_day, hour), as.factor)

glimpse(df_clean)

#> Rows: 8,760
#> Columns: 13
#> $ bike_count      <int> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490, 339, …
#> $ hour            <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
#> $ temp            <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, …
#> $ humidity        <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, 23, 25…
#> $ wind_speed      <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.2,…
#> $ visibility      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
#> $ dew_point_temp  <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3…
#> $ solar_radiation <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, …
#> $ rainfall        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ snowfall        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ seasons         <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winter…
#> $ holiday         <fct> No Holiday, No Holiday, No Holiday, No Holiday, No Hol…
#> $ functioning_day <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…

3 Exploratory Data Analysis

Exploratory data analysis (EDA) is a stage in data analysis during which we thoroughly examine the dataset’s variables, aiming to uncover any potential patterns or relationships that may indicate correlations between these variables.

3.1 correlation analysis

ggcorr(df_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

The correlaton matrix offers a clear depiction of multicollinearity among variables, as well as indicating which variables exhibit strong collinearity with the target variable. We will use this heatmap iteratively during the linear model-building process to validate correlations, alongside metrics like Variance Inflation Factor (VIF) and p-values. This approach will help us make informed decisions about selecting or eliminating variables in the model. temp and atemp variables have a large positive correlation, for that we will only use the temp variable.

3.2 Numeric Variables Analysis

Let’s make a plot of all the numeric variables.

ggpairs(df_clean, 
        columns = c("temp", "humidity", "wind_speed","visibility", "dew_point_temp", "solar_radiation", "rainfall", "snowfall"),
        title = "Pairplot of Data")

The above Pair-Plot tells us that there is a LINEAR RELATION between ‘Temperature’ and ‘Dew.point.temperature’.

3.3 Catagorical Variables analysis

plot1 <- ggplot(df_clean, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)

plot2 <- ggplot(df_clean, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)

plot3 <- ggplot(df_clean, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)

plot4 <- ggplot(df_clean, aes(x = as.factor(hour), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)



grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)

An overwhelming 95% of bike bookings occurred when it was not a holiday, indicating a significant bias in the data. This implies that ’holiday’may not be a strong predictor for the dependent variable. The bikes will not be available to rent during non functioning days and therefore, bike count will be zero on non functioning day as no one will be able to rent the bicycles as seen in the boxplot. Therefore, non-functioning day data points are deleted from the dataset as this is not relevent to our purpose. The dataset is now left with 8465 data points.

Bar plots showed a few outliers in the dataset for each category. However, these will not be removed as they might not be actual outliers but instead, can be explained by other factors such as weather conditions and whether it is holiday or not, etc. These will be removed in later modelling stage if they seem to negatively affect the model assumptions. These observations provide valuable insights into the relationships between categorical variables and the dependent variable, which can guide the feature selection process for building predictive models.

4 Modeling

4.1 Train-Test Split

Prior to constructing the model, it’s essential to divide the dataset into two separate subsets: the training dataset and the test dataset. The training dataset will serve as the basis for training the linear regression model, while the test dataset will be employed for comparison purposes. This helps assess whether the model becomes overly specialized (overfit) to the training data and is unable to make accurate predictions on new, unseen data.

The plan is to allocate 70% of the data to the training dataset, while the remaining 30% will be designated as the testing dataset. This division ensures that a significant portion of the data is used for model training, while a distinct portion remains reserved for evaluating the model’s performance on unseen data.

# Deleting rows when it is non-functioning day
df_clean<-df_clean[!(df_clean$functioning_day=="No"),]
# removing unused columns
df_clean <- subset(df_clean, select = -c(functioning_day))
summary(df_clean)

#>    bike_count          hour           temp           humidity    
#>  Min.   :   2.0   7      : 353   Min.   :-17.80   Min.   : 0.00  
#>  1st Qu.: 214.0   8      : 353   1st Qu.:  3.00   1st Qu.:42.00  
#>  Median : 542.0   9      : 353   Median : 13.50   Median :57.00  
#>  Mean   : 729.2   10     : 353   Mean   : 12.77   Mean   :58.15  
#>  3rd Qu.:1084.0   11     : 353   3rd Qu.: 22.70   3rd Qu.:74.00  
#>  Max.   :3556.0   12     : 353   Max.   : 39.40   Max.   :98.00  
#>                   (Other):6347                                   
#>    wind_speed      visibility   dew_point_temp    solar_radiation 
#>  Min.   :0.000   Min.   :  27   Min.   :-30.600   Min.   :0.0000  
#>  1st Qu.:0.900   1st Qu.: 935   1st Qu.: -5.100   1st Qu.:0.0000  
#>  Median :1.500   Median :1690   Median :  4.700   Median :0.0100  
#>  Mean   :1.726   Mean   :1434   Mean   :  3.945   Mean   :0.5679  
#>  3rd Qu.:2.300   3rd Qu.:2000   3rd Qu.: 15.200   3rd Qu.:0.9300  
#>  Max.   :7.400   Max.   :2000   Max.   : 27.200   Max.   :3.5200  
#>                                                                   
#>     rainfall          snowfall         seasons           holiday    
#>  Min.   : 0.0000   Min.   :0.00000   Autumn:1937   Holiday   : 408  
#>  1st Qu.: 0.0000   1st Qu.:0.00000   Spring:2160   No Holiday:8057  
#>  Median : 0.0000   Median :0.00000   Summer:2208                    
#>  Mean   : 0.1491   Mean   :0.07769   Winter:2160                    
#>  3rd Qu.: 0.0000   3rd Qu.:0.00000                                  
#>  Max.   :35.0000   Max.   :8.80000                                  
#>

set.seed(123)
samplesize <- round(0.7 * nrow(df_clean), 0)
index <- sample(seq_len(nrow(df_clean)), size = samplesize)

data_train <- df_clean[index, ]
data_test <- df_clean[-index, ]

4.2 Linear Regression

Now we will try to model the linear regression using ‘cnt’ as the target variable and all other cariable as the predictors.

4.2.1 Model with all Predictors

set.seed(123)
lm_all <- lm(bike_count ~ ., data = data_train)

summary(lm_all)

#> 
#> Call:
#> lm(formula = bike_count ~ ., data = data_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1394.01  -219.07    -9.01   201.74  1769.15 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)        1.126e+03  1.034e+02  10.890  < 2e-16 ***
#> hour1             -1.155e+02  3.436e+01  -3.363 0.000777 ***
#> hour2             -2.234e+02  3.437e+01  -6.501 8.62e-11 ***
#> hour3             -3.114e+02  3.442e+01  -9.046  < 2e-16 ***
#> hour4             -3.694e+02  3.441e+01 -10.736  < 2e-16 ***
#> hour5             -3.720e+02  3.374e+01 -11.028  < 2e-16 ***
#> hour6             -2.026e+02  3.435e+01  -5.898 3.89e-09 ***
#> hour7              9.524e+01  3.441e+01   2.768 0.005658 ** 
#> hour8              4.872e+02  3.399e+01  14.336  < 2e-16 ***
#> hour9              7.601e+00  3.519e+01   0.216 0.829000    
#> hour10            -2.339e+02  3.604e+01  -6.491 9.24e-11 ***
#> hour11            -2.351e+02  3.726e+01  -6.309 3.01e-10 ***
#> hour12            -2.032e+02  3.879e+01  -5.238 1.68e-07 ***
#> hour13            -2.014e+02  3.919e+01  -5.139 2.85e-07 ***
#> hour14            -1.914e+02  3.855e+01  -4.965 7.05e-07 ***
#> hour15            -1.175e+02  3.767e+01  -3.118 0.001829 ** 
#> hour16             5.112e+01  3.637e+01   1.406 0.159903    
#> hour17             3.237e+02  3.498e+01   9.255  < 2e-16 ***
#> hour18             7.875e+02  3.528e+01  22.323  < 2e-16 ***
#> hour19             5.360e+02  3.528e+01  15.194  < 2e-16 ***
#> hour20             4.774e+02  3.479e+01  13.725  < 2e-16 ***
#> hour21             4.522e+02  3.479e+01  12.998  < 2e-16 ***
#> hour22             3.277e+02  3.389e+01   9.669  < 2e-16 ***
#> hour23             9.670e+01  3.419e+01   2.828 0.004694 ** 
#> temp               9.687e+00  3.864e+00   2.507 0.012208 *  
#> humidity          -1.040e+01  1.072e+00  -9.698  < 2e-16 ***
#> wind_speed        -2.104e+00  5.594e+00  -0.376 0.706808    
#> visibility         8.033e-03  1.050e-02   0.765 0.444165    
#> dew_point_temp     1.559e+01  4.010e+00   3.887 0.000102 ***
#> solar_radiation    8.305e+01  1.199e+01   6.929 4.70e-12 ***
#> rainfall          -7.106e+01  4.817e+00 -14.751  < 2e-16 ***
#> snowfall           4.059e+01  1.115e+01   3.641 0.000274 ***
#> seasonsSpring     -1.737e+02  1.486e+01 -11.691  < 2e-16 ***
#> seasonsSummer     -1.860e+02  1.829e+01 -10.166  < 2e-16 ***
#> seasonsWinter     -3.556e+02  2.086e+01 -17.042  < 2e-16 ***
#> holidayNo Holiday  1.341e+02  2.321e+01   5.776 8.04e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 375 on 5890 degrees of freedom
#> Multiple R-squared:  0.6666, Adjusted R-squared:  0.6647 
#> F-statistic: 336.5 on 35 and 5890 DF,  p-value: < 2.2e-16

We get a lot more coefficient than variables that we have. This is because the categorical variable are transformed into dummy variables.

The summary of the ‘lm_all’ model provides a wealth of information, but at this stage, our primary focus should be on the ‘Pr(>|t|)’ (p-value) column. This column indicates the significance level of each variable with respect to the model. When the ‘Pr(>|t|)’ value is below 0.05, we can confidently conclude that the variable has a significant impact on the model. In other words, the estimated coefficients for these variables are statistically different from zero, and they play a crucial role in the model’s predictive power. Conversely, variables with p-values greater than 0.05 are considered not significant and may be candidates for removal from the model. The coefficient value shows the estimated contribution of the predictor to the target or in our case it can be interpreted, for example, every 1 point increase in the rainfall unit value will contribute to a reduction in the number of bicycle rentals by 54.88149.

4.2.2 Model with only Significant predictors

df_sig <- df_clean %>% select(-c(visibility, wind_speed))

data_train_sig <- df_sig[index, ]
data_test_sig <- df_sig[-index, ]

set.seed(123)
lm_sig <- lm(bike_count ~ ., data = data_train_sig)

summary(lm_sig)

#> 
#> Call:
#> lm(formula = bike_count ~ ., data = data_train_sig)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1390.20  -218.24    -9.49   202.30  1764.78 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       1149.728     96.813  11.876  < 2e-16 ***
#> hour1             -115.612     34.349  -3.366 0.000768 ***
#> hour2             -223.341     34.351  -6.502 8.60e-11 ***
#> hour3             -311.316     34.396  -9.051  < 2e-16 ***
#> hour4             -369.524     34.384 -10.747  < 2e-16 ***
#> hour5             -371.601     33.702 -11.026  < 2e-16 ***
#> hour6             -202.084     34.314  -5.889 4.09e-09 ***
#> hour7               95.153     34.373   2.768 0.005654 ** 
#> hour8              486.926     33.937  14.348  < 2e-16 ***
#> hour9                6.754     35.148   0.192 0.847635    
#> hour10            -235.152     35.976  -6.536 6.84e-11 ***
#> hour11            -236.758     37.192  -6.366 2.09e-10 ***
#> hour12            -205.138     38.717  -5.298 1.21e-07 ***
#> hour13            -203.823     39.069  -5.217 1.88e-07 ***
#> hour14            -194.215     38.367  -5.062 4.27e-07 ***
#> hour15            -120.634     37.380  -3.227 0.001257 ** 
#> hour16              47.918     36.027   1.330 0.183551    
#> hour17             320.631     34.584   9.271  < 2e-16 ***
#> hour18             784.792     34.964  22.446  < 2e-16 ***
#> hour19             533.908     35.013  15.249  < 2e-16 ***
#> hour20             475.591     34.602  13.745  < 2e-16 ***
#> hour21             451.237     34.725  12.994  < 2e-16 ***
#> hour22             327.064     33.863   9.658  < 2e-16 ***
#> hour23              96.144     34.173   2.813 0.004918 ** 
#> temp                 9.559      3.857   2.479 0.013219 *  
#> humidity           -10.598      1.037 -10.218  < 2e-16 ***
#> dew_point_temp      15.749      4.004   3.934 8.46e-05 ***
#> solar_radiation     81.934     11.835   6.923 4.89e-12 ***
#> rainfall           -71.208      4.814 -14.793  < 2e-16 ***
#> snowfall            40.583     11.146   3.641 0.000274 ***
#> seasonsSpring     -176.898     14.332 -12.343  < 2e-16 ***
#> seasonsSummer     -185.812     18.173 -10.225  < 2e-16 ***
#> seasonsWinter     -358.826     20.462 -17.536  < 2e-16 ***
#> holidayNo Holiday  134.107     23.205   5.779 7.89e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 374.9 on 5892 degrees of freedom
#> Multiple R-squared:  0.6666, Adjusted R-squared:  0.6647 
#> F-statistic:   357 on 33 and 5892 DF,  p-value: < 2.2e-16

We try to create a new model using only significant predictors, adjusted R-squared can be a reference to see the differences between the two models. The initial model that uses all predictors has an adjusted R-squared of 0.6647, this means the model can explain 66.47% of the target variables. while the second model has the same adjusted R-squared of 0.6647 this means we can use a model that only uses significant predictors.

5 Evaluation

5.1 Model Preformance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error: RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the ‘RMSE ()’ functions from ‘caret’ package. Below is the first model (with complete variables) performance. \[ RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n}{\sum_{i=1}^{n} (y_i-\hat{y}_i)^2}} \]

pred_sig <- predict(lm_sig, newdata = data_test_sig %>% select(-bike_count))

# RMSE of train dataset
RMSE(pred = lm_sig$fitted.values, obs = data_train_sig$bike_count)

#> [1] 373.8531

#RMSE of test dataset
RMSE(pred = pred_sig, obs = data_test_sig$bike_count)

#> [1] 374.1233

the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.

5.2 Asumptions

Linear regression is considered a parametric model because it relies on certain classical assumptions to establish its model equation. If linear regression doesn’t adhere to these assumptions, the model can yield misleading results or have biased estimators. In this section, we will specifically focus on the second model, which is the one with some variables removed.

5.2.1 Linearity

The linear regression model is built on the assumption that there exists a linear relationship between the predictor variables and the response variable. When the true relationship deviates significantly from linearity, it raises doubts about the validity of the conclusions drawn from the model. Moreover, the predictive accuracy of the model can be notably compromised.

To detect non-linearity in the model, residual plots serve as a valuable graphical tool. If these plots exhibit a discernible pattern, it suggests that there is room for improvement in the model or that it does not conform to the linearity assumption. These plots illustrate the connection between the residuals (errors) and the predicted (fitted) values.

resact <- data.frame(residual = lm_sig$residuals, fitted = lm_sig$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

There is a pattern in the data, with the residuals becoming more negative as the posted value increases but then falls again. The pattern suggests that our model may not be linear enough.

5.2.2 Normality Test

The second assumption in linear regression pertains to the normality of the residuals. This assumption suggests that the residuals should follow a normal distribution. We can assess this by employing the Shapiro-Wilk normality test, which helps determine whether the residuals conform to a normal distribution.

shapiro.test(lm_sig$residuals[0:5000])

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  lm_sig$residuals[0:5000]
#> W = 0.98536, p-value < 2.2e-16

P-value < 0.05 so H0 is rejected or the residual does not follow a normal distribution.

5.2.3 Autocorrelation

The standard errors computed for estimated regression coefficients or fitted values rely on the assumption of uncorrelated error terms, meaning there is no autocorrelation. However, if there is indeed correlation among the error terms, the estimated standard errors are likely to underestimate the true standard errors. This leads to confidence and prediction intervals appearing narrower than they should be. For instance, a 95% confidence interval may have a lower probability than 0.95 of containing the true parameter value. Additionally, p-values associated with the model may be lower than they should be, potentially leading to incorrect conclusions about the statistical significance of parameters. In essence, if error terms exhibit correlation, it can result in an unjustified sense of confidence in the model.

Autocorrelation can be identified through the Durbin-Watson test, which tests the null hypothesis that there is no autocorrelation.

durbinWatsonTest(lm_sig)

#>  lag Autocorrelation D-W Statistic p-value
#>    1     -0.00168817      2.003258     0.9
#>  Alternative hypothesis: rho != 0

The result shows that the null hypothesis is rejected, meaning that our residual has autocorrelation in it.

5.2.4 Heterocedasticity

Heteroscedasticity refers to a situation where the variances of the error terms are not consistent or constant across different levels of the predictor variables. It can be detected by observing a funnel-shaped pattern in the residual plot, similar to how linearity issues can be identified.

bptest(lm_sig)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  lm_sig
#> BP = 1721.7, df = 33, p-value < 2.2e-16

resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

We can observe that the residuals are concentrated around the value 0 approaching larger values. The second way to detect heteroscedasticity is to use the Breusch-Pagan test, with the null hypothesis that heteroscedasticity does not occur. With a p-value <0.05, we can conclude that there is heteroscedasticity in our model.

5.2.5 Multicollinearity

Multicollinearity arises when there is a correlation or high degree of association between the independent variables or predictors in a regression model. To assess multicollinearity, one commonly used metric is the Variance Inflation Factor (VIF). As a general guideline, a VIF value exceeding 5 or 10 is indicative of a significant level of collinearity, which can be problematic for the regression analysis.

vif(lm_sig)

#>                       GVIF Df GVIF^(1/(2*Df))
#> hour              4.109813 23        1.031203
#> temp             93.077456  1        9.647666
#> humidity         19.165637  1        4.377858
#> dew_point_temp  120.175736  1       10.962469
#> solar_radiation   4.522325  1        2.126576
#> rainfall          1.120921  1        1.058735
#> snowfall          1.118603  1        1.057640
#> seasons           5.111938  3        1.312495
#> holiday           1.025383  1        1.012612

Dew.point.temp’s GVIF value is greater than 10 and showing presense of multicollinearity. This will be removed from the model.

6 Model Improvement

In the evaluation, our model shows that it cannot meet several assumptions, including linearity, heteroscedasticity and autocorrelation. To improve the model so that the model is more linear, we can transform several variables, both target and predictor.

6.1 Model Tuning

One approach that can be taken is to automatically select predictor variables using step-wise regression with the backward elimination method.

6.1.1 Stepwise Regression

lm_back <- step(lm_all, direction = "backward")

#> Start:  AIC=70280.95
#> bike_count ~ hour + temp + humidity + wind_speed + visibility + 
#>     dew_point_temp + solar_radiation + rainfall + snowfall + 
#>     seasons + holiday
#> 
#>                   Df Sum of Sq        RSS   AIC
#> - wind_speed       1     19896  828180164 70279
#> - visibility       1     82335  828242603 70280
#> <none>                          828160268 70281
#> - temp             1    883601  829043870 70285
#> - snowfall         1   1864134  830024402 70292
#> - dew_point_temp   1   2124695  830284963 70294
#> - holiday          1   4690744  832851012 70312
#> - solar_radiation  1   6749735  834910003 70327
#> - humidity         1  13225132  841385401 70373
#> - rainfall         1  30596042  858756310 70494
#> - seasons          3  66191249  894351517 70731
#> - hour            23 494589413 1322749681 73010
#> 
#> Step:  AIC=70279.09
#> bike_count ~ hour + temp + humidity + visibility + dew_point_temp + 
#>     solar_radiation + rainfall + snowfall + seasons + holiday
#> 
#>                   Df Sum of Sq        RSS   AIC
#> - visibility       1     74134  828254298 70278
#> <none>                          828180164 70279
#> - temp             1    889903  829070067 70283
#> - snowfall         1   1868554  830048718 70290
#> - dew_point_temp   1   2127534  830307698 70292
#> - holiday          1   4704268  832884431 70311
#> - solar_radiation  1   6792715  834972879 70326
#> - humidity         1  13237899  841418063 70371
#> - rainfall         1  30627799  858807963 70492
#> - seasons          3  67210918  895391082 70735
#> - hour            23 516392361 1344572525 73105
#> 
#> Step:  AIC=70277.62
#> bike_count ~ hour + temp + humidity + dew_point_temp + solar_radiation + 
#>     rainfall + snowfall + seasons + holiday
#> 
#>                   Df Sum of Sq        RSS   AIC
#> <none>                          828254298 70278
#> - temp             1    863575  829117872 70282
#> - snowfall         1   1863586  830117884 70289
#> - dew_point_temp   1   2175236  830429533 70291
#> - holiday          1   4694872  832949170 70309
#> - solar_radiation  1   6737663  834991961 70324
#> - humidity         1  14676337  842930635 70380
#> - rainfall         1  30762003  859016301 70492
#> - seasons          3  68694443  896948741 70744
#> - hour            23 516318243 1344572541 73103

The step-wise regression method will produce an optimal formula based on the lowest AIC (Akaike Information Criterion) value, where a lower AIC value indicates a better fit, with fewer unexplained observations.

summary(lm_back)

#> 
#> Call:
#> lm(formula = bike_count ~ hour + temp + humidity + dew_point_temp + 
#>     solar_radiation + rainfall + snowfall + seasons + holiday, 
#>     data = data_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1390.20  -218.24    -9.49   202.30  1764.78 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       1149.728     96.813  11.876  < 2e-16 ***
#> hour1             -115.612     34.349  -3.366 0.000768 ***
#> hour2             -223.341     34.351  -6.502 8.60e-11 ***
#> hour3             -311.316     34.396  -9.051  < 2e-16 ***
#> hour4             -369.524     34.384 -10.747  < 2e-16 ***
#> hour5             -371.601     33.702 -11.026  < 2e-16 ***
#> hour6             -202.084     34.314  -5.889 4.09e-09 ***
#> hour7               95.153     34.373   2.768 0.005654 ** 
#> hour8              486.926     33.937  14.348  < 2e-16 ***
#> hour9                6.754     35.148   0.192 0.847635    
#> hour10            -235.152     35.976  -6.536 6.84e-11 ***
#> hour11            -236.758     37.192  -6.366 2.09e-10 ***
#> hour12            -205.138     38.717  -5.298 1.21e-07 ***
#> hour13            -203.823     39.069  -5.217 1.88e-07 ***
#> hour14            -194.215     38.367  -5.062 4.27e-07 ***
#> hour15            -120.634     37.380  -3.227 0.001257 ** 
#> hour16              47.918     36.027   1.330 0.183551    
#> hour17             320.631     34.584   9.271  < 2e-16 ***
#> hour18             784.792     34.964  22.446  < 2e-16 ***
#> hour19             533.908     35.013  15.249  < 2e-16 ***
#> hour20             475.591     34.602  13.745  < 2e-16 ***
#> hour21             451.237     34.725  12.994  < 2e-16 ***
#> hour22             327.064     33.863   9.658  < 2e-16 ***
#> hour23              96.144     34.173   2.813 0.004918 ** 
#> temp                 9.559      3.857   2.479 0.013219 *  
#> humidity           -10.598      1.037 -10.218  < 2e-16 ***
#> dew_point_temp      15.749      4.004   3.934 8.46e-05 ***
#> solar_radiation     81.934     11.835   6.923 4.89e-12 ***
#> rainfall           -71.208      4.814 -14.793  < 2e-16 ***
#> snowfall            40.583     11.146   3.641 0.000274 ***
#> seasonsSpring     -176.898     14.332 -12.343  < 2e-16 ***
#> seasonsSummer     -185.812     18.173 -10.225  < 2e-16 ***
#> seasonsWinter     -358.826     20.462 -17.536  < 2e-16 ***
#> holidayNo Holiday  134.107     23.205   5.779 7.89e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 374.9 on 5892 degrees of freedom
#> Multiple R-squared:  0.6666, Adjusted R-squared:  0.6647 
#> F-statistic:   357 on 33 and 5892 DF,  p-value: < 2.2e-16

The results of the experiment in making a model using the step-wise regression method failed to increase the adjusted R-Squared. For this reason, we will try to create a model using average daily weather conditions. For this model, bike demand for each day and average daily weather conditions are calculated.

6.1.2 Average Daily Model

# creating dataframe for bike count by date
df_by_day <- df[!(df$functioning_day=="No"),]

df_by_day$seasons <- factor(df_by_day$seasons, levels = c("Winter", "Spring", "Summer", "Autumn"))
df_by_day$seasons <- as.numeric(df_by_day$seasons)

df_by_day$holiday <- factor(df_by_day$holiday, levels = c("No Holiday", "Holiday"))
df_by_day$holiday <- as.numeric(df_by_day$holiday)

df_by_day <- df_by_day %>% 
  group_by(date) %>% 
  summarise(sum_bike_count = sum(bike_count), 
            mean_temp = mean(temp), 
            mean_humidity = mean(humidity), 
            mean_wind_speed = mean(wind_speed), 
            mean_visibility = mean(visibility), 
            mean_dew_point_temp = mean(dew_point_temp), 
            mean_solar_radiation = mean(solar_radiation), 
            mean_rainfall = mean(rainfall), 
            mean_snowfall = mean(snowfall), 
            seasons = mean(as.numeric(seasons)), 
            holiday = mean(as.numeric(holiday)))

#Changing the factors back to original values 
df_by_day$holiday <- as.factor(df_by_day$holiday)
df_by_day$seasons <- as.factor(df_by_day$seasons)
levels(df_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(df_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(df_by_day)

#> # A tibble: 6 × 12
#>   date    sum_bike_count mean_temp mean_humidity mean_wind_speed mean_visibility
#>   <chr>            <int>     <dbl>         <dbl>           <dbl>           <dbl>
#> 1 01/01/…           4290     -1.28          39.3            1.45           1895.
#> 2 01/02/…           5377     -3.87          44              1.61           1924.
#> 3 01/03/…           5132      0.45          64.2            3.55           1084 
#> 4 01/04/…          17388     15.2           68.9            1.57            832.
#> 5 01/05/…          26820     20.3           72.8            1.44            456.
#> 6 01/06/…          31928     23.7           50.1            1.95           1598.
#> # ℹ 6 more variables: mean_dew_point_temp <dbl>, mean_solar_radiation <dbl>,
#> #   mean_rainfall <dbl>, mean_snowfall <dbl>, seasons <fct>, holiday <fct>

Below is the histograms of how the continuous variables look like after transforming the data. The sum_bike_count shows signs of having a bimodal distribution and temperature shows more of flatter distribution compared to the original data. The rest of columns seem to be not too different from the original data distribution.

#Creating histograms 
multi.hist(df_by_day[,sapply(df_by_day, is.numeric)], global = F)

smp_size <- floor(0.7 * nrow(df_by_day))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(df_by_day)), size = smp_size)

df_test_day <- df_by_day[ -trainIndex,]
df_train_day <- df_by_day[ trainIndex,]

lm_day = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=df_train_day)

summary(lm_day)

#> 
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
#>     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
#>     mean_rainfall + mean_snowfall + seasons + holiday, data = df_train_day)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -10664  -2811    326   2415  11413 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          29503.9833 10654.9667   2.769  0.00607 ** 
#> mean_temp             -609.0105   407.7464  -1.494  0.13663    
#> mean_humidity         -271.0193   121.8328  -2.225  0.02707 *  
#> mean_wind_speed      -1246.6589   476.9793  -2.614  0.00954 ** 
#> mean_visibility         -0.3715     0.7231  -0.514  0.60785    
#> mean_dew_point_temp   1000.6835   433.3629   2.309  0.02181 *  
#> mean_solar_radiation 12678.6386  1431.3165   8.858  < 2e-16 ***
#> mean_rainfall        -3586.3001   684.3712  -5.240 3.58e-07 ***
#> mean_snowfall         -774.9305   752.6909  -1.030  0.30429    
#> seasonsSpring         3186.1014  1110.6161   2.869  0.00450 ** 
#> seasonsSummer         5384.1030  1621.8989   3.320  0.00105 ** 
#> seasonsAutumn         8085.2979  1105.2357   7.315 4.07e-12 ***
#> holidayHoliday       -2815.4830  1159.9954  -2.427  0.01597 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3975 on 234 degrees of freedom
#> Multiple R-squared:  0.8457, Adjusted R-squared:  0.8378 
#> F-statistic: 106.9 on 12 and 234 DF,  p-value: < 2.2e-16

This model can explain 0.8378 and is statistically significant as p-value is < 0.05. However, it shows that there are some factors that are not statistically significant. Stepwise AIC is used again to calculate an optimum model for prediction.

lm_day_back <- step(lm_day, direction = "back")

#> Start:  AIC=4106.77
#> sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
#>     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
#>     mean_rainfall + mean_snowfall + seasons + holiday
#> 
#>                        Df  Sum of Sq        RSS    AIC
#> - mean_visibility       1    4171109 3700908872 4105.0
#> - mean_snowfall         1   16745380 3713483143 4105.9
#> <none>                               3696737763 4106.8
#> - mean_temp             1   35242929 3731980692 4107.1
#> - mean_humidity         1   78176227 3774913990 4109.9
#> - mean_dew_point_temp   1   84235051 3780972814 4110.3
#> - holiday               1   93067076 3789804839 4110.9
#> - mean_wind_speed       1  107919286 3804657049 4111.9
#> - mean_rainfall         1  433823069 4130560832 4132.2
#> - seasons               3 1210781976 4907519740 4170.7
#> - mean_solar_radiation  1 1239586080 4936323843 4176.2
#> 
#> Step:  AIC=4105.05
#> sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
#>     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
#>     mean_snowfall + seasons + holiday
#> 
#>                        Df  Sum of Sq        RSS    AIC
#> - mean_snowfall         1   17130713 3718039585 4104.2
#> <none>                               3700908872 4105.0
#> - mean_temp             1   32044146 3732953018 4105.2
#> - mean_humidity         1   74992318 3775901190 4108.0
#> - mean_dew_point_temp   1   80125920 3781034792 4108.3
#> - holiday               1   93152837 3794061709 4109.2
#> - mean_wind_speed       1  120863500 3821772372 4111.0
#> - mean_rainfall         1  461304613 4162213485 4132.1
#> - seasons               3 1293347178 4994256050 4173.1
#> - mean_solar_radiation  1 1244522333 4945431205 4174.6
#> 
#> Step:  AIC=4104.19
#> sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
#>     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
#>     seasons + holiday
#> 
#>                        Df  Sum of Sq        RSS    AIC
#> <none>                               3718039585 4104.2
#> - mean_temp             1   37696739 3755736323 4104.7
#> - mean_humidity         1   91373371 3809412956 4108.2
#> - holiday               1   91558270 3809597855 4108.2
#> - mean_dew_point_temp   1   91918258 3809957843 4108.2
#> - mean_wind_speed       1  120160943 3838200528 4110.0
#> - mean_rainfall         1  447306983 4165346567 4130.2
#> - seasons               3 1293215616 5011255201 4171.9
#> - mean_solar_radiation  1 1233207265 4951246850 4172.9

summary(lm_day_back)

#> 
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
#>     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
#>     seasons + holiday, data = df_train_day)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -10754.0  -2829.3    315.3   2470.8  11443.2 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           29086.4     9847.3   2.954  0.00346 ** 
#> mean_temp              -615.9      398.2  -1.547  0.12324    
#> mean_humidity          -271.4      112.7  -2.408  0.01680 *  
#> mean_wind_speed       -1290.4      467.3  -2.762  0.00620 ** 
#> mean_dew_point_temp    1015.9      420.6   2.415  0.01648 *  
#> mean_solar_radiation  12625.9     1427.1   8.847  < 2e-16 ***
#> mean_rainfall         -3567.6      669.5  -5.328 2.31e-07 ***
#> seasonsSpring          3324.3     1102.7   3.015  0.00285 ** 
#> seasonsSummer          5275.5     1607.0   3.283  0.00118 ** 
#> seasonsAutumn          7980.7     1074.2   7.430 1.99e-12 ***
#> holidayHoliday        -2792.0     1158.1  -2.411  0.01669 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3969 on 236 degrees of freedom
#> Multiple R-squared:  0.8448, Adjusted R-squared:  0.8382 
#> F-statistic: 128.5 on 10 and 236 DF,  p-value: < 2.2e-16

The stepwise regression results show a slight increase in adjusted R-Squared to 0.8382 by reducing predictors mean_visibility and mean_snowfall.

# multicollinearity
car::vif(lm_day_back)

#>                            GVIF Df GVIF^(1/(2*Df))
#> mean_temp            310.007401  1       17.607027
#> mean_humidity         41.456335  1        6.438659
#> mean_wind_speed        1.172024  1        1.082601
#> mean_dew_point_temp  428.283969  1       20.695023
#> mean_solar_radiation   3.139801  1        1.771948
#> mean_rainfall          1.645654  1        1.282831
#> seasons                5.974708  3        1.347057
#> holiday                1.048544  1        1.023985

multicollinearity check shows that the predictors mean_dew_point temp, mean_temp and mean humidity have a VIF above 10.

# Removing columns with multicollinearity and stepwise

lm_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=df_train_day)

#summary
summary(lm_day_model_final)

#> 
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
#>     mean_solar_radiation + mean_rainfall + seasons + holiday, 
#>     data = df_train_day)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -10405.5  -2878.9    162.2   2675.8  11690.2 
#> 
#> Coefficients:
#>                       Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           5695.242   1804.410   3.156 0.001805 ** 
#> mean_temp              336.384     56.312   5.974 8.44e-09 ***
#> mean_humidity           -6.515     26.227  -0.248 0.804040    
#> mean_wind_speed      -1249.830    471.686  -2.650 0.008598 ** 
#> mean_solar_radiation 12368.306   1437.522   8.604 1.07e-15 ***
#> mean_rainfall        -4020.502    649.273  -6.192 2.59e-09 ***
#> seasonsSpring         3064.422   1108.547   2.764 0.006152 ** 
#> seasonsSummer         5492.566   1620.812   3.389 0.000822 ***
#> seasonsAutumn         7929.562   1084.881   7.309 4.10e-12 ***
#> holidayHoliday       -2462.738   1161.768  -2.120 0.035061 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4009 on 237 degrees of freedom
#> Multiple R-squared:  0.841,  Adjusted R-squared:  0.8349 
#> F-statistic: 139.2 on 9 and 237 DF,  p-value: < 2.2e-16

# multicollinearity
car::vif(lm_day_model_final)

#>                          GVIF Df GVIF^(1/(2*Df))
#> mean_temp            6.076400  1        2.465035
#> mean_humidity        2.200467  1        1.483397
#> mean_wind_speed      1.170508  1        1.081900
#> mean_solar_radiation 3.122264  1        1.766993
#> mean_rainfall        1.516602  1        1.231504
#> seasons              5.613827  3        1.333142
#> holiday              1.034021  1        1.016868

The final model can explain 83.49% of the data although there is a slight decrease from the previous results but there is no multicollinearity in the final model. besides that, the model is also statistically significant at the 95% confidence interval because the p value is <0.05. The p-value of mean_humidity is greater than 0.5 and therefore, not statistically significant.

par(mfrow=c(2,2))
plot(lm_day_model_final , which=1:4)

There are no outliers in the dataset, but the resituals shows a bit of hetroscedasticity.

# Evaluate homoscedasticity
bptest(lm_day_model_final)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  lm_day_model_final
#> BP = 50.356, df = 9, p-value = 9.234e-08

p-value = 9.234e-08 < 0.05 so we have to reject H0. Residuals are not distributed evenly (heteroscedasticity)

# Test for Autocorrelated Errors
durbinWatsonTest(lm_day_model_final)

#>  lag Autocorrelation D-W Statistic p-value
#>    1      0.02524088      1.942352   0.628
#>  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

6.1.3 box-cox transformation

We will try to transform the data using box-cox transformation as it does not pass some tests.

# Making all the values positive
df_by_day_BC <- df_by_day %>% 
  select(-mean_humidity)

for (i in colnames(df_by_day_BC)){
  if(class(df_by_day_BC[[i]]) == "integer" | class(df_by_day_BC[[i]]) == "numeric"){
  df_by_day_BC[i] <- round(df_by_day_BC[i] + (abs(min(df_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(sum_bike_count ~ mean_temp +
    mean_solar_radiation 
  ,data = df_by_day_BC)

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = (data^lambda_val)-1/lambda_val
  return(data)
}

for (i in colnames(df_by_day_BC)){
  if(class(df_by_day_BC[[i]]) == "integer" | class(df_by_day_BC[[i]]) == "numeric"){
    df_by_day_BC[i] <- transform_boxcox(df_by_day_BC[i])
  }
}

6.1.4 Final Model

smp_size_bc <- floor(0.7 * nrow(df_by_day_BC))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex_bc <- sample(seq_len(nrow(df_by_day_BC)), size = smp_size)

df_test_day_bc <- df_by_day_BC[ -trainIndex,]
df_train_day_bc <- df_by_day_BC[ trainIndex,]

lm_day_final_BC = lm(sum_bike_count ~ mean_temp +  mean_wind_speed + 
    mean_solar_radiation + mean_rainfall + seasons + holiday, 
    data = df_train_day_bc)
  
summary(lm_day_final_BC)

#> 
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_wind_speed + mean_solar_radiation + 
#>     mean_rainfall + seasons + holiday, data = df_train_day_bc)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -31.462  -9.079   1.488   8.033  30.139 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            85.507     14.450   5.917 1.13e-08 ***
#> mean_temp              12.848      1.684   7.631 5.58e-13 ***
#> mean_wind_speed       -16.903      5.912  -2.859 0.004628 ** 
#> mean_solar_radiation   97.230     10.224   9.510  < 2e-16 ***
#> mean_rainfall         -49.827      6.002  -8.302 7.71e-15 ***
#> seasonsSpring          13.927      3.588   3.881 0.000135 ***
#> seasonsSummer          21.705      4.922   4.410 1.57e-05 ***
#> seasonsAutumn          29.584      3.497   8.460 2.73e-15 ***
#> holidayHoliday         -9.779      3.640  -2.686 0.007737 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12.55 on 238 degrees of freedom
#> Multiple R-squared:  0.871,  Adjusted R-squared:  0.8666 
#> F-statistic: 200.8 on 8 and 238 DF,  p-value: < 2.2e-16

The final model can explain 87% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant.

par(mfrow=c(2,2))
plot(lm_day_final_BC, which=1:4)

There are not outliers detected.

6.2 Performance

RMSE

pred_day_final_bc <- predict(lm_day_final_BC, newdata = df_test_day_bc %>% select(-sum_bike_count))

# RMSE of train dataset
RMSE(pred = lm_day_final_BC$fitted.values, obs = df_test_day_bc$sum_bike_count)

#> [1] 48.83893

#RMSE of test dataset
RMSE(pred = pred_day_final_bc, obs = df_test_day_bc$sum_bike_count)

#> [1] 14.87956

Our model experiences an increase in R-squared to 87 %. The model also no longer seems to overfit, this is shown by the RMSE of the training data set being 48.83893 while the testing data set has an RMSE of 14.87956

6.3 Asumptions

6.3.1 Linearity

resact_final <- data.frame(residual = lm_day_final_BC$residuals, fitted = lm_day_final_BC$fitted.values)

resact_final %>% ggplot(aes(residual, fitted)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())

6.3.2 Normality Test

# Test for Normally Distributed Errors
shapiro.test(lm_day_final_BC$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  lm_day_final_BC$residuals
#> W = 0.99242, p-value = 0.2368

P-value of the Shapiro-Wilk is greater than 0.05 therefore, we can conclude that our residuals are normally distributed.

6.3.3 Autocorrelation

set.seed(1)
durbinWatsonTest(lm_day_final_BC)

#>  lag Autocorrelation D-W Statistic p-value
#>    1     0.003288265      1.987583   0.896
#>  Alternative hypothesis: rho != 0

With p-value > 0.05, we can conclude that autocorrelation is not present.

6.3.4 Heterocedasticity

bptest(lm_day_final_BC)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  lm_day_final_BC
#> BP = 32.245, df = 8, p-value = 8.419e-05

p-value < 0.05, then we will reject H0 (heteroscedacity).

6.3.5 Multicollinearity

vif(lm_day_final_BC)

#>                          GVIF Df GVIF^(1/(2*Df))
#> mean_temp            5.133564  1        2.265737
#> mean_wind_speed      1.200817  1        1.095818
#> mean_solar_radiation 2.410500  1        1.552579
#> mean_rainfall        1.530961  1        1.237320
#> seasons              5.187576  3        1.315712
#> holiday              1.036265  1        1.017971

there are no multicollinearity variables

7 Conclusion

Variables that are useful for describing bicycle rentals are temperature, wind speed, solar radiation, mean rainfall, seasons, holidays. Our final model satisfies a number of assumptions including Linearity, normality test, autocorrelation and multicollinearity vif test, but fails the Homoscedasticity of Residuals assumption. The R-squared of the model is high, where 87% of the variables can explain bicycle rental. The accuracy of the model in predicting is measured by RMSE, with the training data having an RMSE of 48.83893 while the testing data set has an RMSE of 14.87956, this shows that our model does not overfit the training dataset.

Bike-sharing Multiple Linear Regression

Jayadi Butar Butar

2023-11-03