1 INTRODUCTION
1.1 Problem Statement
A bike-sharing provider in the Korea, has been facing a significant decline in its revenues due to the COVID-19 pandemic. To ensure their business’s sustainability and future growth, the company is proactively devising a strategic plan. This plan aims to boost revenue once the current lockdowns and restrictions are lifted, and the economy returns to normal.
As part of their strategy, the company is keen on understanding the post-pandemic demand for shared bikes among the korean population. This understanding will enable them to meet the needs of customers effectively and distinguish themselves from competitors, ultimately leading to substantial profits.
To gain insights into the factors influencing the demand for shared bikes, the company has engaged a consulting firm. They seek answers to two crucial questions:
Which variables play a significant role in predicting the demand for shared bikes? How effectively do these variables explain the fluctuations in bike demand? To answer these questions comprehensively, the company has collected a vast dataset that includes daily bike demand information from across the korean market. This data is based on various factors, including meteorological surveys and people’s lifestyle choices.
In summary, the company is taking proactive steps to prepare for a post-pandemic resurgence in demand for shared bikes in the korean market. They are leveraging data analysis and consulting expertise to identify the key variables driving this demand and to ensure their business is well-positioned to thrive in the future economic landscape.
1.2 Bussines Goal
Our task is to create a demand model for shared bikes using the available independent variables. This model will serve as a valuable tool for the management team to gain insights into how demand for shared bikes fluctuates in response to various features or factors. By analyzing this model, the management can make informed decisions and adjust their business strategy to align with changing demand levels, ultimately meeting customer expectations effectively. Additionally, this model will be instrumental in providing an in-depth understanding of demand dynamics in new markets, helping the management expand their operations with confidence.
2 DATA PREPARATION
library(dplyr)
library(GGally)
library(ggplot2)
library(gridExtra)
library(caret)
library(lmtest)
library(scales)
library(car)
library(psych)
library(magrittr)
library(tidyr)
library(infotheo)
library(TSA)
library(forecast)
library(chron)
library(caTools)2.1 Load dataset
df <- read.csv("data_input/bike.csv", check.names = F, na.strings = c("", "NA", "N/A", "na"))
df <- setNames(df, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))
rmarkdown::paged_table(df)The dataset contains 8760 rows and 14 columns. Almost all columns are either of double or integer data types.
Upon initial examination, it appears that some of the fields in the dataset have categorical characteristics, despite being represented as integers or double, also Date should be in date format. We will conduct a thorough analysis to determine whether it is appropriate to convert these columns into categorical variables or if they should continue to be treated as numerical values.
2.2 Data Quality Check
Check for NULL/MISSING values
#> date bike_count hour temp humidity
#> 0 0 0 0 0
#> wind_speed visibility dew_point_temp solar_radiation rainfall
#> 0 0 0 0 0
#> snowfall seasons holiday functioning_day
#> 0 0 0 0
There are no missing / Null values either in columns or rows.
Check Duplicated data
#> [1] 8760 14
There is no data reduction after using the distinct function, this means there is no duplication of data.
2.3 Data Cleaning
Based on a high-level examination of the data and the data dictionary, several variables can be safely removed from further analysis:
Date: Since the dataset already contains separate columns for ‘holiday’ and ‘Functioning Day,’ the ‘Date’ column, which represents the date, may not be needed for the analysis. It can be excluded.season,Functioning dayandHoliday: These columns are of ‘integer’ type, so we have to change them to the ‘category’ data type
The resulting dataset, after removing and change these columns, can be saved as ‘df_clean.’ This ensures that the original dataset is preserved for potential future analysis or validation.
df_clean <- df %>%
select(-date) %>%
mutate_at(vars(seasons, holiday, functioning_day, hour), as.factor)
glimpse(df_clean)#> Rows: 8,760
#> Columns: 13
#> $ bike_count <int> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490, 339, …
#> $ hour <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
#> $ temp <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, …
#> $ humidity <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, 23, 25…
#> $ wind_speed <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.2,…
#> $ visibility <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
#> $ dew_point_temp <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3…
#> $ solar_radiation <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, …
#> $ rainfall <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ snowfall <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ seasons <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winter…
#> $ holiday <fct> No Holiday, No Holiday, No Holiday, No Holiday, No Hol…
#> $ functioning_day <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
3 Exploratory Data Analysis
Exploratory data analysis (EDA) is a stage in data analysis during which we thoroughly examine the dataset’s variables, aiming to uncover any potential patterns or relationships that may indicate correlations between these variables.
3.1 correlation analysis
The correlaton matrix offers a clear depiction of multicollinearity among variables, as well as indicating which variables exhibit strong collinearity with the target variable. We will use this heatmap iteratively during the linear model-building process to validate correlations, alongside metrics like Variance Inflation Factor (VIF) and p-values. This approach will help us make informed decisions about selecting or eliminating variables in the model. temp and atemp variables have a large positive correlation, for that we will only use the temp variable.
3.2 Numeric Variables Analysis
Let’s make a plot of all the numeric variables.
ggpairs(df_clean,
columns = c("temp", "humidity", "wind_speed","visibility", "dew_point_temp", "solar_radiation", "rainfall", "snowfall"),
title = "Pairplot of Data")The above Pair-Plot tells us that there is a LINEAR RELATION between ‘Temperature’ and ‘Dew.point.temperature’.
3.3 Catagorical Variables analysis
plot1 <- ggplot(df_clean, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)
plot2 <- ggplot(df_clean, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)
plot3 <- ggplot(df_clean, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)
plot4 <- ggplot(df_clean, aes(x = as.factor(hour), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2)
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)
An overwhelming 95% of bike bookings occurred when it was not a holiday,
indicating a significant bias in the data. This implies that
’holiday’may not be a strong predictor for the dependent variable. The
bikes will not be available to rent during non functioning days and
therefore, bike count will be zero on non functioning day as no one will
be able to rent the bicycles as seen in the boxplot. Therefore,
non-functioning day data points are deleted from the dataset as this is
not relevent to our purpose. The dataset is now left with 8465 data
points.
Bar plots showed a few outliers in the dataset for each category. However, these will not be removed as they might not be actual outliers but instead, can be explained by other factors such as weather conditions and whether it is holiday or not, etc. These will be removed in later modelling stage if they seem to negatively affect the model assumptions. These observations provide valuable insights into the relationships between categorical variables and the dependent variable, which can guide the feature selection process for building predictive models.
4 Modeling
4.1 Train-Test Split
Prior to constructing the model, it’s essential to divide the dataset into two separate subsets: the training dataset and the test dataset. The training dataset will serve as the basis for training the linear regression model, while the test dataset will be employed for comparison purposes. This helps assess whether the model becomes overly specialized (overfit) to the training data and is unable to make accurate predictions on new, unseen data.
The plan is to allocate 70% of the data to the training dataset, while the remaining 30% will be designated as the testing dataset. This division ensures that a significant portion of the data is used for model training, while a distinct portion remains reserved for evaluating the model’s performance on unseen data.
# Deleting rows when it is non-functioning day
df_clean<-df_clean[!(df_clean$functioning_day=="No"),]
# removing unused columns
df_clean <- subset(df_clean, select = -c(functioning_day))
summary(df_clean)#> bike_count hour temp humidity
#> Min. : 2.0 7 : 353 Min. :-17.80 Min. : 0.00
#> 1st Qu.: 214.0 8 : 353 1st Qu.: 3.00 1st Qu.:42.00
#> Median : 542.0 9 : 353 Median : 13.50 Median :57.00
#> Mean : 729.2 10 : 353 Mean : 12.77 Mean :58.15
#> 3rd Qu.:1084.0 11 : 353 3rd Qu.: 22.70 3rd Qu.:74.00
#> Max. :3556.0 12 : 353 Max. : 39.40 Max. :98.00
#> (Other):6347
#> wind_speed visibility dew_point_temp solar_radiation
#> Min. :0.000 Min. : 27 Min. :-30.600 Min. :0.0000
#> 1st Qu.:0.900 1st Qu.: 935 1st Qu.: -5.100 1st Qu.:0.0000
#> Median :1.500 Median :1690 Median : 4.700 Median :0.0100
#> Mean :1.726 Mean :1434 Mean : 3.945 Mean :0.5679
#> 3rd Qu.:2.300 3rd Qu.:2000 3rd Qu.: 15.200 3rd Qu.:0.9300
#> Max. :7.400 Max. :2000 Max. : 27.200 Max. :3.5200
#>
#> rainfall snowfall seasons holiday
#> Min. : 0.0000 Min. :0.00000 Autumn:1937 Holiday : 408
#> 1st Qu.: 0.0000 1st Qu.:0.00000 Spring:2160 No Holiday:8057
#> Median : 0.0000 Median :0.00000 Summer:2208
#> Mean : 0.1491 Mean :0.07769 Winter:2160
#> 3rd Qu.: 0.0000 3rd Qu.:0.00000
#> Max. :35.0000 Max. :8.80000
#>
4.2 Linear Regression
Now we will try to model the linear regression using ‘cnt’ as the target variable and all other cariable as the predictors.
4.2.1 Model with all Predictors
#>
#> Call:
#> lm(formula = bike_count ~ ., data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1394.01 -219.07 -9.01 201.74 1769.15
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.126e+03 1.034e+02 10.890 < 2e-16 ***
#> hour1 -1.155e+02 3.436e+01 -3.363 0.000777 ***
#> hour2 -2.234e+02 3.437e+01 -6.501 8.62e-11 ***
#> hour3 -3.114e+02 3.442e+01 -9.046 < 2e-16 ***
#> hour4 -3.694e+02 3.441e+01 -10.736 < 2e-16 ***
#> hour5 -3.720e+02 3.374e+01 -11.028 < 2e-16 ***
#> hour6 -2.026e+02 3.435e+01 -5.898 3.89e-09 ***
#> hour7 9.524e+01 3.441e+01 2.768 0.005658 **
#> hour8 4.872e+02 3.399e+01 14.336 < 2e-16 ***
#> hour9 7.601e+00 3.519e+01 0.216 0.829000
#> hour10 -2.339e+02 3.604e+01 -6.491 9.24e-11 ***
#> hour11 -2.351e+02 3.726e+01 -6.309 3.01e-10 ***
#> hour12 -2.032e+02 3.879e+01 -5.238 1.68e-07 ***
#> hour13 -2.014e+02 3.919e+01 -5.139 2.85e-07 ***
#> hour14 -1.914e+02 3.855e+01 -4.965 7.05e-07 ***
#> hour15 -1.175e+02 3.767e+01 -3.118 0.001829 **
#> hour16 5.112e+01 3.637e+01 1.406 0.159903
#> hour17 3.237e+02 3.498e+01 9.255 < 2e-16 ***
#> hour18 7.875e+02 3.528e+01 22.323 < 2e-16 ***
#> hour19 5.360e+02 3.528e+01 15.194 < 2e-16 ***
#> hour20 4.774e+02 3.479e+01 13.725 < 2e-16 ***
#> hour21 4.522e+02 3.479e+01 12.998 < 2e-16 ***
#> hour22 3.277e+02 3.389e+01 9.669 < 2e-16 ***
#> hour23 9.670e+01 3.419e+01 2.828 0.004694 **
#> temp 9.687e+00 3.864e+00 2.507 0.012208 *
#> humidity -1.040e+01 1.072e+00 -9.698 < 2e-16 ***
#> wind_speed -2.104e+00 5.594e+00 -0.376 0.706808
#> visibility 8.033e-03 1.050e-02 0.765 0.444165
#> dew_point_temp 1.559e+01 4.010e+00 3.887 0.000102 ***
#> solar_radiation 8.305e+01 1.199e+01 6.929 4.70e-12 ***
#> rainfall -7.106e+01 4.817e+00 -14.751 < 2e-16 ***
#> snowfall 4.059e+01 1.115e+01 3.641 0.000274 ***
#> seasonsSpring -1.737e+02 1.486e+01 -11.691 < 2e-16 ***
#> seasonsSummer -1.860e+02 1.829e+01 -10.166 < 2e-16 ***
#> seasonsWinter -3.556e+02 2.086e+01 -17.042 < 2e-16 ***
#> holidayNo Holiday 1.341e+02 2.321e+01 5.776 8.04e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 375 on 5890 degrees of freedom
#> Multiple R-squared: 0.6666, Adjusted R-squared: 0.6647
#> F-statistic: 336.5 on 35 and 5890 DF, p-value: < 2.2e-16
We get a lot more coefficient than variables that we have. This is because the categorical variable are transformed into dummy variables.
The summary of the ‘lm_all’ model provides a wealth of information, but at this stage, our primary focus should be on the ‘Pr(>|t|)’ (p-value) column. This column indicates the significance level of each variable with respect to the model. When the ‘Pr(>|t|)’ value is below 0.05, we can confidently conclude that the variable has a significant impact on the model. In other words, the estimated coefficients for these variables are statistically different from zero, and they play a crucial role in the model’s predictive power. Conversely, variables with p-values greater than 0.05 are considered not significant and may be candidates for removal from the model. The coefficient value shows the estimated contribution of the predictor to the target or in our case it can be interpreted, for example, every 1 point increase in the rainfall unit value will contribute to a reduction in the number of bicycle rentals by 54.88149.
4.2.2 Model with only Significant predictors
df_sig <- df_clean %>% select(-c(visibility, wind_speed))
data_train_sig <- df_sig[index, ]
data_test_sig <- df_sig[-index, ]
set.seed(123)
lm_sig <- lm(bike_count ~ ., data = data_train_sig)
summary(lm_sig)#>
#> Call:
#> lm(formula = bike_count ~ ., data = data_train_sig)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1390.20 -218.24 -9.49 202.30 1764.78
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1149.728 96.813 11.876 < 2e-16 ***
#> hour1 -115.612 34.349 -3.366 0.000768 ***
#> hour2 -223.341 34.351 -6.502 8.60e-11 ***
#> hour3 -311.316 34.396 -9.051 < 2e-16 ***
#> hour4 -369.524 34.384 -10.747 < 2e-16 ***
#> hour5 -371.601 33.702 -11.026 < 2e-16 ***
#> hour6 -202.084 34.314 -5.889 4.09e-09 ***
#> hour7 95.153 34.373 2.768 0.005654 **
#> hour8 486.926 33.937 14.348 < 2e-16 ***
#> hour9 6.754 35.148 0.192 0.847635
#> hour10 -235.152 35.976 -6.536 6.84e-11 ***
#> hour11 -236.758 37.192 -6.366 2.09e-10 ***
#> hour12 -205.138 38.717 -5.298 1.21e-07 ***
#> hour13 -203.823 39.069 -5.217 1.88e-07 ***
#> hour14 -194.215 38.367 -5.062 4.27e-07 ***
#> hour15 -120.634 37.380 -3.227 0.001257 **
#> hour16 47.918 36.027 1.330 0.183551
#> hour17 320.631 34.584 9.271 < 2e-16 ***
#> hour18 784.792 34.964 22.446 < 2e-16 ***
#> hour19 533.908 35.013 15.249 < 2e-16 ***
#> hour20 475.591 34.602 13.745 < 2e-16 ***
#> hour21 451.237 34.725 12.994 < 2e-16 ***
#> hour22 327.064 33.863 9.658 < 2e-16 ***
#> hour23 96.144 34.173 2.813 0.004918 **
#> temp 9.559 3.857 2.479 0.013219 *
#> humidity -10.598 1.037 -10.218 < 2e-16 ***
#> dew_point_temp 15.749 4.004 3.934 8.46e-05 ***
#> solar_radiation 81.934 11.835 6.923 4.89e-12 ***
#> rainfall -71.208 4.814 -14.793 < 2e-16 ***
#> snowfall 40.583 11.146 3.641 0.000274 ***
#> seasonsSpring -176.898 14.332 -12.343 < 2e-16 ***
#> seasonsSummer -185.812 18.173 -10.225 < 2e-16 ***
#> seasonsWinter -358.826 20.462 -17.536 < 2e-16 ***
#> holidayNo Holiday 134.107 23.205 5.779 7.89e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 374.9 on 5892 degrees of freedom
#> Multiple R-squared: 0.6666, Adjusted R-squared: 0.6647
#> F-statistic: 357 on 33 and 5892 DF, p-value: < 2.2e-16
We try to create a new model using only significant predictors, adjusted R-squared can be a reference to see the differences between the two models. The initial model that uses all predictors has an adjusted R-squared of 0.6647, this means the model can explain 66.47% of the target variables. while the second model has the same adjusted R-squared of 0.6647 this means we can use a model that only uses significant predictors.
5 Evaluation
5.1 Model Preformance
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error: RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the ‘RMSE ()’ functions from ‘caret’ package. Below is the first model (with complete variables) performance. \[ RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n}{\sum_{i=1}^{n} (y_i-\hat{y}_i)^2}} \]
pred_sig <- predict(lm_sig, newdata = data_test_sig %>% select(-bike_count))
# RMSE of train dataset
RMSE(pred = lm_sig$fitted.values, obs = data_train_sig$bike_count)#> [1] 373.8531
#> [1] 374.1233
the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.
5.2 Asumptions
Linear regression is considered a parametric model because it relies on certain classical assumptions to establish its model equation. If linear regression doesn’t adhere to these assumptions, the model can yield misleading results or have biased estimators. In this section, we will specifically focus on the second model, which is the one with some variables removed.
5.2.1 Linearity
The linear regression model is built on the assumption that there exists a linear relationship between the predictor variables and the response variable. When the true relationship deviates significantly from linearity, it raises doubts about the validity of the conclusions drawn from the model. Moreover, the predictive accuracy of the model can be notably compromised.
To detect non-linearity in the model, residual plots serve as a valuable graphical tool. If these plots exhibit a discernible pattern, it suggests that there is room for improvement in the model or that it does not conform to the linearity assumption. These plots illustrate the connection between the residuals (errors) and the predicted (fitted) values.
resact <- data.frame(residual = lm_sig$residuals, fitted = lm_sig$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())There is a pattern in the data, with the residuals becoming more negative as the posted value increases but then falls again. The pattern suggests that our model may not be linear enough.
5.2.2 Normality Test
The second assumption in linear regression pertains to the normality of the residuals. This assumption suggests that the residuals should follow a normal distribution. We can assess this by employing the Shapiro-Wilk normality test, which helps determine whether the residuals conform to a normal distribution.
#>
#> Shapiro-Wilk normality test
#>
#> data: lm_sig$residuals[0:5000]
#> W = 0.98536, p-value < 2.2e-16
P-value < 0.05 so H0 is rejected or the residual does not follow a normal distribution.
5.2.3 Autocorrelation
The standard errors computed for estimated regression coefficients or fitted values rely on the assumption of uncorrelated error terms, meaning there is no autocorrelation. However, if there is indeed correlation among the error terms, the estimated standard errors are likely to underestimate the true standard errors. This leads to confidence and prediction intervals appearing narrower than they should be. For instance, a 95% confidence interval may have a lower probability than 0.95 of containing the true parameter value. Additionally, p-values associated with the model may be lower than they should be, potentially leading to incorrect conclusions about the statistical significance of parameters. In essence, if error terms exhibit correlation, it can result in an unjustified sense of confidence in the model.
Autocorrelation can be identified through the Durbin-Watson test, which tests the null hypothesis that there is no autocorrelation.
#> lag Autocorrelation D-W Statistic p-value
#> 1 -0.00168817 2.003258 0.9
#> Alternative hypothesis: rho != 0
The result shows that the null hypothesis is rejected, meaning that our residual has autocorrelation in it.
5.2.4 Heterocedasticity
Heteroscedasticity refers to a situation where the variances of the error terms are not consistent or constant across different levels of the predictor variables. It can be detected by observing a funnel-shaped pattern in the residual plot, similar to how linearity issues can be identified.
#>
#> studentized Breusch-Pagan test
#>
#> data: lm_sig
#> BP = 1721.7, df = 33, p-value < 2.2e-16
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))
We can observe that the residuals are concentrated around the value 0
approaching larger values. The second way to detect heteroscedasticity
is to use the Breusch-Pagan test, with the null hypothesis that
heteroscedasticity does not occur. With a p-value <0.05, we can
conclude that there is heteroscedasticity in our model.
5.2.5 Multicollinearity
Multicollinearity arises when there is a correlation or high degree of association between the independent variables or predictors in a regression model. To assess multicollinearity, one commonly used metric is the Variance Inflation Factor (VIF). As a general guideline, a VIF value exceeding 5 or 10 is indicative of a significant level of collinearity, which can be problematic for the regression analysis.
#> GVIF Df GVIF^(1/(2*Df))
#> hour 4.109813 23 1.031203
#> temp 93.077456 1 9.647666
#> humidity 19.165637 1 4.377858
#> dew_point_temp 120.175736 1 10.962469
#> solar_radiation 4.522325 1 2.126576
#> rainfall 1.120921 1 1.058735
#> snowfall 1.118603 1 1.057640
#> seasons 5.111938 3 1.312495
#> holiday 1.025383 1 1.012612
Dew.point.temp’s GVIF value is greater than 10 and showing presense of multicollinearity. This will be removed from the model.
6 Model Improvement
In the evaluation, our model shows that it cannot meet several assumptions, including linearity, heteroscedasticity and autocorrelation. To improve the model so that the model is more linear, we can transform several variables, both target and predictor.
6.1 Model Tuning
One approach that can be taken is to automatically select predictor variables using step-wise regression with the backward elimination method.
6.1.1 Stepwise Regression
#> Start: AIC=70280.95
#> bike_count ~ hour + temp + humidity + wind_speed + visibility +
#> dew_point_temp + solar_radiation + rainfall + snowfall +
#> seasons + holiday
#>
#> Df Sum of Sq RSS AIC
#> - wind_speed 1 19896 828180164 70279
#> - visibility 1 82335 828242603 70280
#> <none> 828160268 70281
#> - temp 1 883601 829043870 70285
#> - snowfall 1 1864134 830024402 70292
#> - dew_point_temp 1 2124695 830284963 70294
#> - holiday 1 4690744 832851012 70312
#> - solar_radiation 1 6749735 834910003 70327
#> - humidity 1 13225132 841385401 70373
#> - rainfall 1 30596042 858756310 70494
#> - seasons 3 66191249 894351517 70731
#> - hour 23 494589413 1322749681 73010
#>
#> Step: AIC=70279.09
#> bike_count ~ hour + temp + humidity + visibility + dew_point_temp +
#> solar_radiation + rainfall + snowfall + seasons + holiday
#>
#> Df Sum of Sq RSS AIC
#> - visibility 1 74134 828254298 70278
#> <none> 828180164 70279
#> - temp 1 889903 829070067 70283
#> - snowfall 1 1868554 830048718 70290
#> - dew_point_temp 1 2127534 830307698 70292
#> - holiday 1 4704268 832884431 70311
#> - solar_radiation 1 6792715 834972879 70326
#> - humidity 1 13237899 841418063 70371
#> - rainfall 1 30627799 858807963 70492
#> - seasons 3 67210918 895391082 70735
#> - hour 23 516392361 1344572525 73105
#>
#> Step: AIC=70277.62
#> bike_count ~ hour + temp + humidity + dew_point_temp + solar_radiation +
#> rainfall + snowfall + seasons + holiday
#>
#> Df Sum of Sq RSS AIC
#> <none> 828254298 70278
#> - temp 1 863575 829117872 70282
#> - snowfall 1 1863586 830117884 70289
#> - dew_point_temp 1 2175236 830429533 70291
#> - holiday 1 4694872 832949170 70309
#> - solar_radiation 1 6737663 834991961 70324
#> - humidity 1 14676337 842930635 70380
#> - rainfall 1 30762003 859016301 70492
#> - seasons 3 68694443 896948741 70744
#> - hour 23 516318243 1344572541 73103
The step-wise regression method will produce an optimal formula based on the lowest AIC (Akaike Information Criterion) value, where a lower AIC value indicates a better fit, with fewer unexplained observations.
#>
#> Call:
#> lm(formula = bike_count ~ hour + temp + humidity + dew_point_temp +
#> solar_radiation + rainfall + snowfall + seasons + holiday,
#> data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1390.20 -218.24 -9.49 202.30 1764.78
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1149.728 96.813 11.876 < 2e-16 ***
#> hour1 -115.612 34.349 -3.366 0.000768 ***
#> hour2 -223.341 34.351 -6.502 8.60e-11 ***
#> hour3 -311.316 34.396 -9.051 < 2e-16 ***
#> hour4 -369.524 34.384 -10.747 < 2e-16 ***
#> hour5 -371.601 33.702 -11.026 < 2e-16 ***
#> hour6 -202.084 34.314 -5.889 4.09e-09 ***
#> hour7 95.153 34.373 2.768 0.005654 **
#> hour8 486.926 33.937 14.348 < 2e-16 ***
#> hour9 6.754 35.148 0.192 0.847635
#> hour10 -235.152 35.976 -6.536 6.84e-11 ***
#> hour11 -236.758 37.192 -6.366 2.09e-10 ***
#> hour12 -205.138 38.717 -5.298 1.21e-07 ***
#> hour13 -203.823 39.069 -5.217 1.88e-07 ***
#> hour14 -194.215 38.367 -5.062 4.27e-07 ***
#> hour15 -120.634 37.380 -3.227 0.001257 **
#> hour16 47.918 36.027 1.330 0.183551
#> hour17 320.631 34.584 9.271 < 2e-16 ***
#> hour18 784.792 34.964 22.446 < 2e-16 ***
#> hour19 533.908 35.013 15.249 < 2e-16 ***
#> hour20 475.591 34.602 13.745 < 2e-16 ***
#> hour21 451.237 34.725 12.994 < 2e-16 ***
#> hour22 327.064 33.863 9.658 < 2e-16 ***
#> hour23 96.144 34.173 2.813 0.004918 **
#> temp 9.559 3.857 2.479 0.013219 *
#> humidity -10.598 1.037 -10.218 < 2e-16 ***
#> dew_point_temp 15.749 4.004 3.934 8.46e-05 ***
#> solar_radiation 81.934 11.835 6.923 4.89e-12 ***
#> rainfall -71.208 4.814 -14.793 < 2e-16 ***
#> snowfall 40.583 11.146 3.641 0.000274 ***
#> seasonsSpring -176.898 14.332 -12.343 < 2e-16 ***
#> seasonsSummer -185.812 18.173 -10.225 < 2e-16 ***
#> seasonsWinter -358.826 20.462 -17.536 < 2e-16 ***
#> holidayNo Holiday 134.107 23.205 5.779 7.89e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 374.9 on 5892 degrees of freedom
#> Multiple R-squared: 0.6666, Adjusted R-squared: 0.6647
#> F-statistic: 357 on 33 and 5892 DF, p-value: < 2.2e-16
The results of the experiment in making a model using the step-wise regression method failed to increase the adjusted R-Squared. For this reason, we will try to create a model using average daily weather conditions. For this model, bike demand for each day and average daily weather conditions are calculated.
6.1.2 Average Daily Model
# creating dataframe for bike count by date
df_by_day <- df[!(df$functioning_day=="No"),]
df_by_day$seasons <- factor(df_by_day$seasons, levels = c("Winter", "Spring", "Summer", "Autumn"))
df_by_day$seasons <- as.numeric(df_by_day$seasons)
df_by_day$holiday <- factor(df_by_day$holiday, levels = c("No Holiday", "Holiday"))
df_by_day$holiday <- as.numeric(df_by_day$holiday)
df_by_day <- df_by_day %>%
group_by(date) %>%
summarise(sum_bike_count = sum(bike_count),
mean_temp = mean(temp),
mean_humidity = mean(humidity),
mean_wind_speed = mean(wind_speed),
mean_visibility = mean(visibility),
mean_dew_point_temp = mean(dew_point_temp),
mean_solar_radiation = mean(solar_radiation),
mean_rainfall = mean(rainfall),
mean_snowfall = mean(snowfall),
seasons = mean(as.numeric(seasons)),
holiday = mean(as.numeric(holiday)))#Changing the factors back to original values
df_by_day$holiday <- as.factor(df_by_day$holiday)
df_by_day$seasons <- as.factor(df_by_day$seasons)
levels(df_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(df_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(df_by_day)#> # A tibble: 6 × 12
#> date sum_bike_count mean_temp mean_humidity mean_wind_speed mean_visibility
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 01/01/… 4290 -1.28 39.3 1.45 1895.
#> 2 01/02/… 5377 -3.87 44 1.61 1924.
#> 3 01/03/… 5132 0.45 64.2 3.55 1084
#> 4 01/04/… 17388 15.2 68.9 1.57 832.
#> 5 01/05/… 26820 20.3 72.8 1.44 456.
#> 6 01/06/… 31928 23.7 50.1 1.95 1598.
#> # ℹ 6 more variables: mean_dew_point_temp <dbl>, mean_solar_radiation <dbl>,
#> # mean_rainfall <dbl>, mean_snowfall <dbl>, seasons <fct>, holiday <fct>
Below is the histograms of how the continuous variables look like after transforming the data. The sum_bike_count shows signs of having a bimodal distribution and temperature shows more of flatter distribution compared to the original data. The rest of columns seem to be not too different from the original data distribution.
smp_size <- floor(0.7 * nrow(df_by_day))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(df_by_day)), size = smp_size)
df_test_day <- df_by_day[ -trainIndex,]
df_train_day <- df_by_day[ trainIndex,]lm_day = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=df_train_day)
summary(lm_day)#>
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
#> mean_visibility + mean_dew_point_temp + mean_solar_radiation +
#> mean_rainfall + mean_snowfall + seasons + holiday, data = df_train_day)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -10664 -2811 326 2415 11413
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 29503.9833 10654.9667 2.769 0.00607 **
#> mean_temp -609.0105 407.7464 -1.494 0.13663
#> mean_humidity -271.0193 121.8328 -2.225 0.02707 *
#> mean_wind_speed -1246.6589 476.9793 -2.614 0.00954 **
#> mean_visibility -0.3715 0.7231 -0.514 0.60785
#> mean_dew_point_temp 1000.6835 433.3629 2.309 0.02181 *
#> mean_solar_radiation 12678.6386 1431.3165 8.858 < 2e-16 ***
#> mean_rainfall -3586.3001 684.3712 -5.240 3.58e-07 ***
#> mean_snowfall -774.9305 752.6909 -1.030 0.30429
#> seasonsSpring 3186.1014 1110.6161 2.869 0.00450 **
#> seasonsSummer 5384.1030 1621.8989 3.320 0.00105 **
#> seasonsAutumn 8085.2979 1105.2357 7.315 4.07e-12 ***
#> holidayHoliday -2815.4830 1159.9954 -2.427 0.01597 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3975 on 234 degrees of freedom
#> Multiple R-squared: 0.8457, Adjusted R-squared: 0.8378
#> F-statistic: 106.9 on 12 and 234 DF, p-value: < 2.2e-16
This model can explain 0.8378 and is statistically significant as p-value is < 0.05. However, it shows that there are some factors that are not statistically significant. Stepwise AIC is used again to calculate an optimum model for prediction.
#> Start: AIC=4106.77
#> sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
#> mean_visibility + mean_dew_point_temp + mean_solar_radiation +
#> mean_rainfall + mean_snowfall + seasons + holiday
#>
#> Df Sum of Sq RSS AIC
#> - mean_visibility 1 4171109 3700908872 4105.0
#> - mean_snowfall 1 16745380 3713483143 4105.9
#> <none> 3696737763 4106.8
#> - mean_temp 1 35242929 3731980692 4107.1
#> - mean_humidity 1 78176227 3774913990 4109.9
#> - mean_dew_point_temp 1 84235051 3780972814 4110.3
#> - holiday 1 93067076 3789804839 4110.9
#> - mean_wind_speed 1 107919286 3804657049 4111.9
#> - mean_rainfall 1 433823069 4130560832 4132.2
#> - seasons 3 1210781976 4907519740 4170.7
#> - mean_solar_radiation 1 1239586080 4936323843 4176.2
#>
#> Step: AIC=4105.05
#> sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
#> mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
#> mean_snowfall + seasons + holiday
#>
#> Df Sum of Sq RSS AIC
#> - mean_snowfall 1 17130713 3718039585 4104.2
#> <none> 3700908872 4105.0
#> - mean_temp 1 32044146 3732953018 4105.2
#> - mean_humidity 1 74992318 3775901190 4108.0
#> - mean_dew_point_temp 1 80125920 3781034792 4108.3
#> - holiday 1 93152837 3794061709 4109.2
#> - mean_wind_speed 1 120863500 3821772372 4111.0
#> - mean_rainfall 1 461304613 4162213485 4132.1
#> - seasons 3 1293347178 4994256050 4173.1
#> - mean_solar_radiation 1 1244522333 4945431205 4174.6
#>
#> Step: AIC=4104.19
#> sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
#> mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
#> seasons + holiday
#>
#> Df Sum of Sq RSS AIC
#> <none> 3718039585 4104.2
#> - mean_temp 1 37696739 3755736323 4104.7
#> - mean_humidity 1 91373371 3809412956 4108.2
#> - holiday 1 91558270 3809597855 4108.2
#> - mean_dew_point_temp 1 91918258 3809957843 4108.2
#> - mean_wind_speed 1 120160943 3838200528 4110.0
#> - mean_rainfall 1 447306983 4165346567 4130.2
#> - seasons 3 1293215616 5011255201 4171.9
#> - mean_solar_radiation 1 1233207265 4951246850 4172.9
#>
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
#> mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
#> seasons + holiday, data = df_train_day)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -10754.0 -2829.3 315.3 2470.8 11443.2
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 29086.4 9847.3 2.954 0.00346 **
#> mean_temp -615.9 398.2 -1.547 0.12324
#> mean_humidity -271.4 112.7 -2.408 0.01680 *
#> mean_wind_speed -1290.4 467.3 -2.762 0.00620 **
#> mean_dew_point_temp 1015.9 420.6 2.415 0.01648 *
#> mean_solar_radiation 12625.9 1427.1 8.847 < 2e-16 ***
#> mean_rainfall -3567.6 669.5 -5.328 2.31e-07 ***
#> seasonsSpring 3324.3 1102.7 3.015 0.00285 **
#> seasonsSummer 5275.5 1607.0 3.283 0.00118 **
#> seasonsAutumn 7980.7 1074.2 7.430 1.99e-12 ***
#> holidayHoliday -2792.0 1158.1 -2.411 0.01669 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3969 on 236 degrees of freedom
#> Multiple R-squared: 0.8448, Adjusted R-squared: 0.8382
#> F-statistic: 128.5 on 10 and 236 DF, p-value: < 2.2e-16
The stepwise regression results show a slight increase in adjusted R-Squared to 0.8382 by reducing predictors mean_visibility and mean_snowfall.
#> GVIF Df GVIF^(1/(2*Df))
#> mean_temp 310.007401 1 17.607027
#> mean_humidity 41.456335 1 6.438659
#> mean_wind_speed 1.172024 1 1.082601
#> mean_dew_point_temp 428.283969 1 20.695023
#> mean_solar_radiation 3.139801 1 1.771948
#> mean_rainfall 1.645654 1 1.282831
#> seasons 5.974708 3 1.347057
#> holiday 1.048544 1 1.023985
multicollinearity check shows that the predictors mean_dew_point temp, mean_temp and mean humidity have a VIF above 10.
# Removing columns with multicollinearity and stepwise
lm_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=df_train_day)
#summary
summary(lm_day_model_final)#>
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
#> mean_solar_radiation + mean_rainfall + seasons + holiday,
#> data = df_train_day)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -10405.5 -2878.9 162.2 2675.8 11690.2
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5695.242 1804.410 3.156 0.001805 **
#> mean_temp 336.384 56.312 5.974 8.44e-09 ***
#> mean_humidity -6.515 26.227 -0.248 0.804040
#> mean_wind_speed -1249.830 471.686 -2.650 0.008598 **
#> mean_solar_radiation 12368.306 1437.522 8.604 1.07e-15 ***
#> mean_rainfall -4020.502 649.273 -6.192 2.59e-09 ***
#> seasonsSpring 3064.422 1108.547 2.764 0.006152 **
#> seasonsSummer 5492.566 1620.812 3.389 0.000822 ***
#> seasonsAutumn 7929.562 1084.881 7.309 4.10e-12 ***
#> holidayHoliday -2462.738 1161.768 -2.120 0.035061 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4009 on 237 degrees of freedom
#> Multiple R-squared: 0.841, Adjusted R-squared: 0.8349
#> F-statistic: 139.2 on 9 and 237 DF, p-value: < 2.2e-16
#> GVIF Df GVIF^(1/(2*Df))
#> mean_temp 6.076400 1 2.465035
#> mean_humidity 2.200467 1 1.483397
#> mean_wind_speed 1.170508 1 1.081900
#> mean_solar_radiation 3.122264 1 1.766993
#> mean_rainfall 1.516602 1 1.231504
#> seasons 5.613827 3 1.333142
#> holiday 1.034021 1 1.016868
The final model can explain 83.49% of the data although there is a slight decrease from the previous results but there is no multicollinearity in the final model. besides that, the model is also statistically significant at the 95% confidence interval because the p value is <0.05. The p-value of mean_humidity is greater than 0.5 and therefore, not statistically significant.
There are no outliers in the dataset, but the resituals shows a bit of
hetroscedasticity.
#>
#> studentized Breusch-Pagan test
#>
#> data: lm_day_model_final
#> BP = 50.356, df = 9, p-value = 9.234e-08
p-value = 9.234e-08 < 0.05 so we have to reject H0. Residuals are not distributed evenly (heteroscedasticity)
#> lag Autocorrelation D-W Statistic p-value
#> 1 0.02524088 1.942352 0.628
#> Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
6.1.3 box-cox transformation
We will try to transform the data using box-cox transformation as it does not pass some tests.
# Making all the values positive
df_by_day_BC <- df_by_day %>%
select(-mean_humidity)
for (i in colnames(df_by_day_BC)){
if(class(df_by_day_BC[[i]]) == "integer" | class(df_by_day_BC[[i]]) == "numeric"){
df_by_day_BC[i] <- round(df_by_day_BC[i] + (abs(min(df_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
}
}6.1.4 Final Model
smp_size_bc <- floor(0.7 * nrow(df_by_day_BC))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex_bc <- sample(seq_len(nrow(df_by_day_BC)), size = smp_size)
df_test_day_bc <- df_by_day_BC[ -trainIndex,]
df_train_day_bc <- df_by_day_BC[ trainIndex,]lm_day_final_BC = lm(sum_bike_count ~ mean_temp + mean_wind_speed +
mean_solar_radiation + mean_rainfall + seasons + holiday,
data = df_train_day_bc)
summary(lm_day_final_BC)#>
#> Call:
#> lm(formula = sum_bike_count ~ mean_temp + mean_wind_speed + mean_solar_radiation +
#> mean_rainfall + seasons + holiday, data = df_train_day_bc)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -31.462 -9.079 1.488 8.033 30.139
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 85.507 14.450 5.917 1.13e-08 ***
#> mean_temp 12.848 1.684 7.631 5.58e-13 ***
#> mean_wind_speed -16.903 5.912 -2.859 0.004628 **
#> mean_solar_radiation 97.230 10.224 9.510 < 2e-16 ***
#> mean_rainfall -49.827 6.002 -8.302 7.71e-15 ***
#> seasonsSpring 13.927 3.588 3.881 0.000135 ***
#> seasonsSummer 21.705 4.922 4.410 1.57e-05 ***
#> seasonsAutumn 29.584 3.497 8.460 2.73e-15 ***
#> holidayHoliday -9.779 3.640 -2.686 0.007737 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12.55 on 238 degrees of freedom
#> Multiple R-squared: 0.871, Adjusted R-squared: 0.8666
#> F-statistic: 200.8 on 8 and 238 DF, p-value: < 2.2e-16
The final model can explain 87% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant.
There are not outliers detected.
6.2 Performance
RMSE
pred_day_final_bc <- predict(lm_day_final_BC, newdata = df_test_day_bc %>% select(-sum_bike_count))
# RMSE of train dataset
RMSE(pred = lm_day_final_BC$fitted.values, obs = df_test_day_bc$sum_bike_count)#> [1] 48.83893
#> [1] 14.87956
Our model experiences an increase in R-squared to 87 %. The model also no longer seems to overfit, this is shown by the RMSE of the training data set being 48.83893 while the testing data set has an RMSE of 14.87956
6.3 Asumptions
6.3.1 Linearity
resact_final <- data.frame(residual = lm_day_final_BC$residuals, fitted = lm_day_final_BC$fitted.values)
resact_final %>% ggplot(aes(residual, fitted)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())6.3.2 Normality Test
#>
#> Shapiro-Wilk normality test
#>
#> data: lm_day_final_BC$residuals
#> W = 0.99242, p-value = 0.2368
P-value of the Shapiro-Wilk is greater than 0.05 therefore, we can conclude that our residuals are normally distributed.
6.3.3 Autocorrelation
#> lag Autocorrelation D-W Statistic p-value
#> 1 0.003288265 1.987583 0.896
#> Alternative hypothesis: rho != 0
With p-value > 0.05, we can conclude that autocorrelation is not present.
6.3.4 Heterocedasticity
#>
#> studentized Breusch-Pagan test
#>
#> data: lm_day_final_BC
#> BP = 32.245, df = 8, p-value = 8.419e-05
p-value < 0.05, then we will reject H0 (heteroscedacity).
6.3.5 Multicollinearity
#> GVIF Df GVIF^(1/(2*Df))
#> mean_temp 5.133564 1 2.265737
#> mean_wind_speed 1.200817 1 1.095818
#> mean_solar_radiation 2.410500 1 1.552579
#> mean_rainfall 1.530961 1 1.237320
#> seasons 5.187576 3 1.315712
#> holiday 1.036265 1 1.017971
there are no multicollinearity variables
7 Conclusion
Variables that are useful for describing bicycle rentals are temperature, wind speed, solar radiation, mean rainfall, seasons, holidays. Our final model satisfies a number of assumptions including Linearity, normality test, autocorrelation and multicollinearity vif test, but fails the Homoscedasticity of Residuals assumption. The R-squared of the model is high, where 87% of the variables can explain bicycle rental. The accuracy of the model in predicting is measured by RMSE, with the training data having an RMSE of 48.83893 while the testing data set has an RMSE of 14.87956, this shows that our model does not overfit the training dataset.