The purpose of this project is to accurately predict rental bike demand on each day on factors such as weather and holiday season. This will help the public lessen the waiting time and enhance the comfort in urban cities. Several other factors such as workday and holiday also affect bike demand other than the weather conditions (Kim, 2018) and therefore, we’re interested in how integrating different variables will change the prediction accuracy. Accurately predicting the bike counts required for stable supply is a major concern as convenience plays a crucial role in people continuing using the services (Sathishkumar, Jangwoo, and Yongyun, 2020). Therefore, the project aims to predict demand for bike rental demands during am hours and pm hours as well as hourly and daily demands. This will help rental companies to have an idea of how many bikes should be available to meet the demand of the public on any given day. It will also help the company schedule bike maintenance times just in time for the demand.
Chen (2003) stated that regression based algorithms cannot predict the bike demand accurately. However, regression algorithms are comparatively faster at predicting large datasets than most other machine learning methods. Linear regression also does not assume normality of the dataset. As our data is very skewed in some factors, it is still valuable to see how the model performs. Therefore, we aim to predict bike demand using different regression algorithms to test the prediction accuracy and limitations.
This Dataset is taken from kaggle open datasets: Seoul Bike Sharing Demand Dataset at https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand. The dataset has 8760 data points and each of them are representing weather and holiday information which are possible major factors in hiring a bike. The time span of the dataset is 365 days and weather information and bike demand data is given for each hour.
Features of the dataset:
The dataset contains 14 attributes. 10 numerical and 3 categorical and 1 time series values.
Target values: Rented Bike count - Count of bikes rented at each hour
Feature attributes: Date : year-month-day Hour - Hour of he day Temperature-Temperature in Celsius Humidity - % Wind Speed - m/s Visibility - 10m Dew point temperature - Celsius Solar radiation - MJ/m2 Rainfall - mm Snowfall - cm Seasons - Winter, Spring, Summer, Autumn Holiday - Holiday/No holiday Functional Day - No(Non Functional Hours), Yes(Functional hours)
Some assumptions are made during handling of the Bike Dataset. It is assumed that there will be more demand on holidays despite bad weather conditions. There will be less demand on very cold and very hot days. Wind speed will be inversely proportional to the demand. There will be more demand during hotter seasons such as summer and spring. There will be considerably less demand on rainy days. There will be more demand during week-end than during weekdays. These will be tested during the data exploration stage. The data is splitted to 80-20 to minimise overfitting and to test the prediction accuracy of the model.
There are 4 regression models in this report predicting bike rental demands.
Model 1 - Multiple linear regression model predicting bike rental demand for every hour using hourly weather conditions. Model 2 - Multiple linear regression model predicting bike rental demand using average daily weather conditions. Model 3 - Multiple linear regression model predicting bike rental demand during am hours and pm hours. Model 4 - Multiple linear regression model predicting bike rental demand for every day with weekend column and using daily average weather conditions.
The dataset is already clean and therefore, data preprocessing steps such as removal of outliers and handling of missing values are not performed before fitting into the models. As regression models do not assume normality of the data, the dataset is used as it is without any transformation. Categorical columns, hours, seasons, holiday and functional day are set as factors before fitting into the models. All the feature attributes of the dataset, except functioning day, is used to predict bike demand in all the models.
For each of the models, the dataset is splitted and then fit into the regression model. Then the optimum model is calculated using the stepwise regression method and VIF regression to test multicollinearity. Then statistical tests such as the non-constant error variance test to check homoscedasticity of the model, Cook’s distance to check outliers, Durbin Watson test for autocorrelation and Shapiro-Wilk tests for normality, are performed to check if the model is adequate. If the model fails the tests, especially for homoscedasticity, box-cox transformation is performed on the model. Then outliers are removed and the data is then fitted to model again and tested for adequacy. Shapiro-Wilk test is performed only on datasets less than 5000 points.
# Importing libraries
library(ggplot2)
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(infotheo)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
library(TSA)
## Registered S3 methods overwritten by 'TSA':
## method from
## fitted.Arima forecast
## plot.Arima forecast
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library(forecast)
library(chron)
library(caTools)
library(caret)
## Loading required package: lattice
# import dataset
setwd("/Users/April/Desktop/DS/Regression/Regression_Project")
bike <- read.csv("SeoulBikeData.csv",check.names = F)
#set new column names
bike <- setNames(bike, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))
#Setting categorical values
bike$hour <- factor(bike$hour)
bike$seasons <- factor(bike$seasons)
bike$holiday <- factor(bike$holiday)
bike$functioning_day <- factor(bike$functioning_day)
head(bike)
## date bike_count hour temp humidity wind_speed visibility dew_point_temp
## 1 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6
## 2 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6
## 3 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7
## 4 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6
## 5 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6
## 6 01/12/2017 100 5 -6.4 37 1.5 2000 -18.7
## solar_radiation rainfall snowfall seasons holiday functioning_day
## 1 0 0 0 Winter No Holiday Yes
## 2 0 0 0 Winter No Holiday Yes
## 3 0 0 0 Winter No Holiday Yes
## 4 0 0 0 Winter No Holiday Yes
## 5 0 0 0 Winter No Holiday Yes
## 6 0 0 0 Winter No Holiday Yes
summary(bike$functioning_day)
## No Yes
## 295 8465
According to the summary statistics of functioning_day, there are 295 non-functioning days in the dataset. These day will have zero bike counts.
fig1 <- ggplot(bike, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 1. Bike Demand by functining day") + xlab("Functioning Day") + ylab("Bike Count")
fig1 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
The bikes will not be available to rent during non functioning days and therefore, bike count will be zero on non functioning day as no one will be able to rent the bicycles as seen in Fig 1. Therefore, non-functioning day data points are deleted from the dataset as this is not relevent to our purpose. The dataset is now left with 8465 data points.
# Deleting rows when it is non-functioning day
bike<-bike[!(bike$functioning_day=="No"),]
# removing unused columns
bike <- subset(bike, select = - c(functioning_day))
summary(bike)
## date bike_count hour temp
## 01/01/2018: 24 Min. : 2.0 7 : 353 Min. :-17.80
## 01/02/2018: 24 1st Qu.: 214.0 8 : 353 1st Qu.: 3.00
## 01/03/2018: 24 Median : 542.0 9 : 353 Median : 13.50
## 01/04/2018: 24 Mean : 729.2 10 : 353 Mean : 12.77
## 01/05/2018: 24 3rd Qu.:1084.0 11 : 353 3rd Qu.: 22.70
## 01/06/2018: 24 Max. :3556.0 12 : 353 Max. : 39.40
## (Other) :8321 (Other):6347
## humidity wind_speed visibility dew_point_temp
## Min. : 0.00 Min. :0.000 Min. : 27 Min. :-30.600
## 1st Qu.:42.00 1st Qu.:0.900 1st Qu.: 935 1st Qu.: -5.100
## Median :57.00 Median :1.500 Median :1690 Median : 4.700
## Mean :58.15 Mean :1.726 Mean :1434 Mean : 3.945
## 3rd Qu.:74.00 3rd Qu.:2.300 3rd Qu.:2000 3rd Qu.: 15.200
## Max. :98.00 Max. :7.400 Max. :2000 Max. : 27.200
##
## solar_radiation rainfall snowfall seasons
## Min. :0.0000 Min. : 0.0000 Min. :0.00000 Autumn:1937
## 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.:0.00000 Spring:2160
## Median :0.0100 Median : 0.0000 Median :0.00000 Summer:2208
## Mean :0.5679 Mean : 0.1491 Mean :0.07769 Winter:2160
## 3rd Qu.:0.9300 3rd Qu.: 0.0000 3rd Qu.:0.00000
## Max. :3.5200 Max. :35.0000 Max. :8.80000
##
## holiday
## Holiday : 408
## No Holiday:8057
##
##
##
##
##
Summary statistics of the dataset shows general idea of how the data without non-functioning days look like. This will be further analysed with bar plots and histograms.
Fig 2. Hourly Bike Count Bar Plot indicates the numbers of bikes rented by hour. The mean demand is lower than 500 bike counts for hours between 1 am and 6 am. We can see huge spike in boxplot at 8 which is 8 am in the morning, and we can see slowly increasing parretn in boxplot throughout daytime reaching huge peak at 18 which is 6pm. we can see bike renting starts to decrease after 18 and reached lowest at 4. All the am hours except 8 am and 7 am has deman of lower than 1000 bike counts. Therefore, we take am hours as low deman hours and pm hours as high demand hours.
fig2<- ggplot(bike, aes(x = as.factor(hour) , y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 2. Hourly Bike Count Bar Plot") + xlab("Hour") + ylab("Bike Count")
fig2 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
From Fig 3. Seasonal Bike Count Bar Plot, we can see the comparisons of bike renting by seasons. we can easily infer that during better weather times such as summer, people usually prefer to cycle more and in winters bikes are rented lowest with well lower than 500 bike counts in demand. This may be due to cold weather as well as snow during winter. There are a few outliers with higher density than the rest of the seasons where bike demand in Winter is over 500. This may be some underlying reasons such as the day having better weather than other days in winter. In Autumn, bike demand is higher than Spring though we were expecting the demand to be higher in Spring than in Autumn. This may be due to having more rainy days in Spring than in Autumn.
fig3 <- ggplot(bike, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 3. Seasonal Bike Count Bar Plot") + xlab("Seasons") + ylab("Bike Count")
fig3 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
According to figure 4, we can determine that the demand for bike is higher on non holiday days, which means mostly our users might be renting bikes for other reasons than for recreational purposes.
fig4 <- ggplot(bike, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 4. Holiday Bike Count Bar Plot") + xlab("Holiday") + ylab("Bike Count")
fig4 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
Bar plots showed a few outliers in the dataset for each category. However, these will not be removed as they might not be actual outliers but instead, can be explained by other factors such as weather conditions and whether it is holiday or not, etc. These will be removed in later modelling stage if they seem to negatively affect the model assumptions.
We are performing correlation matrix to understand relationship between our variables which is explained in greater detail below.
# correlation matrix of dataset
cor_df <- data.frame(bike$bike_count, as.integer(bike$hour), bike$temp, bike$humidity, bike$wind_speed, bike$visibility, bike$dew_point, bike$solar_radiation, bike$rainfall, bike$snowfall)
cor_df <- data.frame(cor(cor_df))
cor_df[order(-cor_df[,1]), ]
## bike.bike_count as.integer.bike.hour. bike.temp
## bike.bike_count 1.0000000 0.425255882 0.56274017
## bike.temp 0.5627402 0.122741816 1.00000000
## as.integer.bike.hour. 0.4252559 1.000000000 0.12274182
## bike.dew_point 0.4002628 0.004691479 0.91446695
## bike.solar_radiation 0.2738616 0.144658482 0.35484355
## bike.visibility 0.2123228 0.103869318 0.02826206
## bike.wind_speed 0.1250219 0.287779819 -0.03848085
## bike.rainfall -0.1286261 0.014344645 0.05214889
## bike.snowfall -0.1516108 -0.022082499 -0.21774581
## bike.humidity -0.2019727 -0.235937180 0.16642522
## bike.humidity bike.wind_speed bike.visibility
## bike.bike_count -0.2019727 0.125021946 0.21232278
## bike.temp 0.1664252 -0.038480848 0.02826206
## as.integer.bike.hour. -0.2359372 0.287779819 0.10386932
## bike.dew_point 0.5394024 -0.177170126 -0.18258645
## bike.solar_radiation -0.4572727 0.326221868 0.15304614
## bike.visibility -0.5485418 0.180427649 1.00000000
## bike.wind_speed -0.3373524 1.000000000 0.18042765
## bike.rainfall 0.2369169 -0.024931327 -0.17035180
## bike.snowfall 0.1101265 -0.003789344 -0.12285973
## bike.humidity 1.0000000 -0.337352380 -0.54854183
## bike.dew_point bike.solar_radiation bike.rainfall
## bike.bike_count 0.400262829 0.27386155 -0.128626093
## bike.temp 0.914466952 0.35484355 0.052148891
## as.integer.bike.hour. 0.004691479 0.14465848 0.014344645
## bike.dew_point 1.000000000 0.09852498 0.126812453
## bike.solar_radiation 0.098524979 1.00000000 -0.074157271
## bike.visibility -0.182586445 0.15304614 -0.170351798
## bike.wind_speed -0.177170126 0.32622187 -0.024931327
## bike.rainfall 0.126812453 -0.07415727 1.000000000
## bike.snowfall -0.149759793 -0.07337987 0.008604092
## bike.humidity 0.539402446 -0.45727269 0.236916912
## bike.snowfall
## bike.bike_count -0.151610753
## bike.temp -0.217745811
## as.integer.bike.hour. -0.022082499
## bike.dew_point -0.149759793
## bike.solar_radiation -0.073379874
## bike.visibility -0.122859731
## bike.wind_speed -0.003789344
## bike.rainfall 0.008604092
## bike.snowfall 1.000000000
## bike.humidity 0.110126502
From the correlation matrix we can determine all the below points.
1.) the target value bike_count’s correlation can be seen in order of temp, hour, dew_point, solar_radiation and so on. 2.) dew_point has high correlation with temp which may be problematic in constructing good models. 3.) bike_count has low correlation with each individual factors showing predicting bike demand is affected by more than one factor.
We will be doing further analysis of this in each model’s multicollinearity section.
Below is the histograms of continuous data to determine the shape of the variables. It can be seen that bike_count values are right-skewed. There are a lot of hours with temperature below zero, which we expect these hours to have lower demand for bike rentals. The humidity histogram shows that the data is more of normally distributed compared to the rest of the factors. Windspeed is low with very few extreme speeds. Visibility good on most of the days. Dew point temperature shows similar distribution to temperature. Solar radiation is 0 on most hours with very few hours with extrememe values. Rainfall and snowfall data is extremely skewed to the right with very few hours with high levels of rain and snow. The temperature shows signs of having a bimodal distribution.
#Creating histograms
multi.hist(bike[,sapply(bike, is.numeric)])
Model 1 is a multiple linear regression model predicting bike rental demand for every hour using hourly weather conditions. For this model, the dataset is used as it is without changing anything. Data is splitted t 80-20 to build the model.
Data is splitted to 80% train set and 20% test set. This is to minimise overfitting. Data is splitted by random selection using sample() method.
#Randomise our dataset
## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike)), size = smp_size)
model1_prediction <- bike[ -trainIndex,]
model1 <- bike[ trainIndex,]
After splitting the data, train dataset is fitted to the model.
# regression model with all columns original data
row.names(model1) <- NULL
original_full = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model1)
#summary
summary(original_full)
##
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + wind_speed +
## visibility + dew_point_temp + solar_radiation + rainfall +
## snowfall + seasons + holiday, data = model1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1396.44 -219.81 -9.44 201.57 1945.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.161e+03 9.868e+01 11.763 < 2e-16 ***
## hour1 -1.006e+02 3.207e+01 -3.138 0.001711 **
## hour2 -2.244e+02 3.172e+01 -7.076 1.64e-12 ***
## hour3 -3.032e+02 3.210e+01 -9.444 < 2e-16 ***
## hour4 -3.751e+02 3.215e+01 -11.669 < 2e-16 ***
## hour5 -3.616e+02 3.183e+01 -11.361 < 2e-16 ***
## hour6 -1.864e+02 3.176e+01 -5.868 4.61e-09 ***
## hour7 1.243e+02 3.182e+01 3.908 9.40e-05 ***
## hour8 4.876e+02 3.207e+01 15.205 < 2e-16 ***
## hour9 1.802e+01 3.289e+01 0.548 0.583730
## hour10 -2.215e+02 3.393e+01 -6.529 7.11e-11 ***
## hour11 -2.263e+02 3.528e+01 -6.414 1.52e-10 ***
## hour12 -2.048e+02 3.666e+01 -5.587 2.40e-08 ***
## hour13 -1.815e+02 3.699e+01 -4.908 9.42e-07 ***
## hour14 -1.757e+02 3.606e+01 -4.873 1.12e-06 ***
## hour15 -9.281e+01 3.536e+01 -2.625 0.008691 **
## hour16 5.468e+01 3.405e+01 1.606 0.108374
## hour17 3.379e+02 3.297e+01 10.248 < 2e-16 ***
## hour18 7.940e+02 3.273e+01 24.256 < 2e-16 ***
## hour19 5.406e+02 3.277e+01 16.499 < 2e-16 ***
## hour20 4.682e+02 3.244e+01 14.433 < 2e-16 ***
## hour21 4.597e+02 3.253e+01 14.134 < 2e-16 ***
## hour22 3.393e+02 3.178e+01 10.676 < 2e-16 ***
## hour23 1.117e+02 3.201e+01 3.491 0.000484 ***
## temp 7.741e+00 3.704e+00 2.090 0.036674 *
## humidity -1.099e+01 1.029e+00 -10.676 < 2e-16 ***
## wind_speed -3.156e+00 5.266e+00 -0.599 0.549061
## visibility 8.024e-03 9.851e-03 0.815 0.415382
## dew_point_temp 1.750e+01 3.850e+00 4.546 5.57e-06 ***
## solar_radiation 8.194e+01 1.133e+01 7.231 5.32e-13 ***
## rainfall -6.346e+01 4.269e+00 -14.864 < 2e-16 ***
## snowfall 3.497e+01 1.075e+01 3.255 0.001140 **
## seasonsSpring -1.675e+02 1.388e+01 -12.067 < 2e-16 ***
## seasonsSummer -1.766e+02 1.713e+01 -10.312 < 2e-16 ***
## seasonsWinter -3.545e+02 1.951e+01 -18.176 < 2e-16 ***
## holidayNo Holiday 1.383e+02 2.182e+01 6.338 2.47e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 376.5 on 6736 degrees of freedom
## Multiple R-squared: 0.6614, Adjusted R-squared: 0.6597
## F-statistic: 376 on 35 and 6736 DF, p-value: < 2.2e-16
Then stepwise AIC is used to calculate an optimum model for prediction.
# possible subsets regression original data
library(MASS)
stepReg=MASS::stepAIC(original_full, direction="both")
## Start: AIC=80364.21
## bike_count ~ hour + temp + humidity + wind_speed + visibility +
## dew_point_temp + solar_radiation + rainfall + snowfall +
## seasons + holiday
##
## Df Sum of Sq RSS AIC
## - wind_speed 1 50894 954887654 80363
## - visibility 1 94041 954930801 80363
## <none> 954836760 80364
## - temp 1 619060 955455820 80367
## - snowfall 1 1501693 956338453 80373
## - dew_point_temp 1 2928905 957765666 80383
## - holiday 1 5694656 960531416 80402
## - solar_radiation 1 7412253 962249013 80415
## - humidity 1 16155129 970991890 80476
## - rainfall 1 31319043 986155804 80581
## - seasons 3 73932170 1028768930 80863
## - hour 23 564054506 1518891266 83462
##
## Step: AIC=80362.57
## bike_count ~ hour + temp + humidity + visibility + dew_point_temp +
## solar_radiation + rainfall + snowfall + seasons + holiday
##
## Df Sum of Sq RSS AIC
## - visibility 1 79525 954967180 80361
## <none> 954887654 80363
## + wind_speed 1 50894 954836760 80364
## - temp 1 627843 955515498 80365
## - snowfall 1 1504623 956392277 80371
## - dew_point_temp 1 2932423 957820077 80381
## - holiday 1 5718410 960606064 80401
## - solar_radiation 1 7400209 962287863 80413
## - humidity 1 16171684 971059339 80474
## - rainfall 1 31387164 986274819 80580
## - seasons 3 75107146 1029994800 80869
## - hour 23 588532596 1543420251 83568
##
## Step: AIC=80361.14
## bike_count ~ hour + temp + humidity + dew_point_temp + solar_radiation +
## rainfall + snowfall + seasons + holiday
##
## Df Sum of Sq RSS AIC
## <none> 954967180 80361
## + visibility 1 79525 954887654 80363
## + wind_speed 1 36378 954930801 80363
## - temp 1 603611 955570790 80363
## - snowfall 1 1497042 956464221 80370
## - dew_point_temp 1 2996946 957964125 80380
## - holiday 1 5705782 960672961 80399
## - solar_radiation 1 7343094 962310274 80411
## - humidity 1 17870682 972837862 80485
## - rainfall 1 31492353 986459532 80579
## - seasons 3 76754753 1031721932 80879
## - hour 23 588488290 1543455469 83566
stepReg$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## bike_count ~ hour + temp + humidity + wind_speed + visibility +
## dew_point_temp + solar_radiation + rainfall + snowfall +
## seasons + holiday
##
## Final Model:
## bike_count ~ hour + temp + humidity + dew_point_temp + solar_radiation +
## rainfall + snowfall + seasons + holiday
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 6736 954836760 80364.21
## 2 - wind_speed 1 50894.07 6737 954887654 80362.57
## 3 - visibility 1 79525.24 6738 954967180 80361.14
# multicollinearity test
car::vif(original_full)
## GVIF Df GVIF^(1/(2*Df))
## hour 4.806960 23 1.034721
## temp 96.163368 1 9.806292
## humidity 21.287807 1 4.613871
## wind_speed 1.415588 1 1.189785
## visibility 1.722859 1 1.312577
## dew_point_temp 124.716403 1 11.167650
## solar_radiation 4.666642 1 2.160241
## rainfall 1.114160 1 1.055538
## snowfall 1.116854 1 1.056813
## seasons 5.553691 3 1.330751
## holiday 1.023949 1 1.011903
Stepwise method produced a final regression model without visibility, and windspeed. These will be removed from the final model. Dew_point_temp’s GVIF value is greater than 10 and showing presense of multicollinearity. This will be removed from the model.
# Removing columns with multicollinearity and stepwise
original_final = lm(bike_count ~ hour + temp + humidity + solar_radiation +
rainfall + snowfall + seasons + holiday, data=model1)
#summary
summary(original_final)
##
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + solar_radiation +
## rainfall + snowfall + seasons + holiday, data = model1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1378.6 -219.7 -10.4 203.1 2018.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 795.2381 38.1578 20.841 < 2e-16 ***
## hour1 -101.0909 32.1155 -3.148 0.001653 **
## hour2 -224.8470 31.7478 -7.082 1.56e-12 ***
## hour3 -302.2156 32.1248 -9.408 < 2e-16 ***
## hour4 -373.5116 32.1679 -11.611 < 2e-16 ***
## hour5 -360.1377 31.8366 -11.312 < 2e-16 ***
## hour6 -187.5019 31.7687 -5.902 3.76e-09 ***
## hour7 126.4071 31.8235 3.972 7.20e-05 ***
## hour8 489.3452 32.0661 15.261 < 2e-16 ***
## hour9 23.4321 32.8732 0.713 0.475993
## hour10 -214.8377 33.8889 -6.339 2.45e-10 ***
## hour11 -221.7090 35.2506 -6.290 3.38e-10 ***
## hour12 -203.2819 36.6409 -5.548 3.00e-08 ***
## hour13 -184.6526 36.9468 -4.998 5.95e-07 ***
## hour14 -184.3223 35.9467 -5.128 3.02e-07 ***
## hour15 -103.6727 35.1285 -2.951 0.003176 **
## hour16 42.1817 33.7517 1.250 0.211429
## hour17 326.7397 32.6444 10.009 < 2e-16 ***
## hour18 784.9803 32.4882 24.162 < 2e-16 ***
## hour19 533.7577 32.5840 16.381 < 2e-16 ***
## hour20 465.1283 32.3425 14.381 < 2e-16 ***
## hour21 458.2095 32.5283 14.086 < 2e-16 ***
## hour22 339.0684 31.8091 10.659 < 2e-16 ***
## hour23 111.4768 32.0440 3.479 0.000507 ***
## temp 24.1372 0.8797 27.439 < 2e-16 ***
## humidity -6.8340 0.3117 -21.927 < 2e-16 ***
## solar_radiation 71.5660 11.0287 6.489 9.25e-11 ***
## rainfall -65.9748 4.2413 -15.555 < 2e-16 ***
## snowfall 31.1165 10.7279 2.901 0.003737 **
## seasonsSpring -174.1587 13.3956 -13.001 < 2e-16 ***
## seasonsSummer -170.6949 17.0033 -10.039 < 2e-16 ***
## seasonsWinter -359.3512 19.1826 -18.733 < 2e-16 ***
## holidayNo Holiday 137.6383 21.8440 6.301 3.14e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 377 on 6739 degrees of freedom
## Multiple R-squared: 0.6603, Adjusted R-squared: 0.6587
## F-statistic: 409.4 on 32 and 6739 DF, p-value: < 2.2e-16
T-value for snowfall, hour 9 and hour 16 show that they are greater than 0.05 therefore, the predictions for these variables are not statistically significant. However, the final model shows that it can explain 65.87% of the dataset and it is statistically significant as p-value is less than 0.05.
par(mfrow=c(2,2))
plot(original_final , which=1:4)
Cook’s distance shows that there is an outlier. This will be removed from the model.
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(original_final)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1508.533, Df = 1, p = < 2.22e-16
P-value is less than 0.05, implying that the constant error variance assumption is violated. Therefore, box-cox transformation will be performed on the model.
# Test for Autocorrelated Errors
durbinWatsonTest(original_final)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.004870399 2.009702 0.646
## Alternative hypothesis: rho != 0
P-value is more than 0.05 and therefore, uncorrelated error assumption is not violated.
# removing One outlier
model1_BC <- model1[-c(6040), ] # removing Outlier
for (i in colnames(model1_BC)){
if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
model1_BC[i] <- model1_BC[i] + (abs(min(bike[i])))+1 #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(bike_count ~ ( temp + humidity), data = model1_BC)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = (data^lambda_val)-1/lambda_val
return(data)
}
for (i in colnames(model1_BC)){
if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
model1_BC[i] <- transform_boxcox(model1_BC[i])
}
}
# fitting new model
model1_BC_fit = lm(bike_count ~ hour + temp + humidity + solar_radiation +
rainfall + snowfall + seasons + holiday, data=model1_BC
)
summary(model1_BC_fit)
##
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + solar_radiation +
## rainfall + snowfall + seasons + holiday, data = model1_BC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.882 -15.014 0.253 15.629 120.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.7483 3.5814 12.215 < 2e-16 ***
## hour1 -8.6268 2.2049 -3.912 9.23e-05 ***
## hour2 -19.8489 2.1796 -9.107 < 2e-16 ***
## hour3 -28.6157 2.2053 -12.976 < 2e-16 ***
## hour4 -37.0104 2.2081 -16.761 < 2e-16 ***
## hour5 -36.2146 2.1851 -16.573 < 2e-16 ***
## hour6 -18.4712 2.1806 -8.471 < 2e-16 ***
## hour7 7.6510 2.1856 3.501 0.000467 ***
## hour8 32.2375 2.2088 14.595 < 2e-16 ***
## hour9 1.5362 2.2798 0.674 0.500430
## hour10 -18.1290 2.3629 -7.673 1.92e-14 ***
## hour11 -18.2114 2.4605 -7.402 1.51e-13 ***
## hour12 -16.8484 2.5564 -6.591 4.71e-11 ***
## hour13 -16.2825 2.5760 -6.321 2.77e-10 ***
## hour14 -16.0612 2.5079 -6.404 1.61e-10 ***
## hour15 -9.7617 2.4511 -3.983 6.89e-05 ***
## hour16 0.2894 2.3489 0.123 0.901961
## hour17 19.9810 2.2594 8.844 < 2e-16 ***
## hour18 48.9441 2.2358 21.891 < 2e-16 ***
## hour19 35.0344 2.2370 15.661 < 2e-16 ***
## hour20 31.1849 2.2202 14.046 < 2e-16 ***
## hour21 31.2509 2.2332 13.994 < 2e-16 ***
## hour22 23.5139 2.1838 10.767 < 2e-16 ***
## hour23 7.8049 2.2000 3.548 0.000391 ***
## temp 7.6162 0.2622 29.047 < 2e-16 ***
## humidity -2.5621 0.1207 -21.235 < 2e-16 ***
## solar_radiation 14.0901 1.4850 9.488 < 2e-16 ***
## rainfall -21.2329 0.7989 -26.579 < 2e-16 ***
## snowfall 2.1753 1.4817 1.468 0.142131
## seasonsSpring -14.9415 0.9187 -16.264 < 2e-16 ***
## seasonsSummer -10.9605 1.1220 -9.769 < 2e-16 ***
## seasonsWinter -30.2731 1.3436 -22.531 < 2e-16 ***
## holidayNo Holiday 12.3153 1.5010 8.205 2.75e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.89 on 6738 degrees of freedom
## Multiple R-squared: 0.7155, Adjusted R-squared: 0.7141
## F-statistic: 529.5 on 32 and 6738 DF, p-value: < 2.2e-16
The final model can explain 71.55% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant. However, after transformation, factors such as snowfall, hour 9 and hour 16 are showing that they’re not statistically significant.
# Checking multicollinearity in the final Model
car::vif(model1_BC_fit)
## GVIF Df GVIF^(1/(2*Df))
## hour 4.440614 23 1.032939
## temp 5.233746 1 2.287738
## humidity 2.011847 1 1.418396
## solar_radiation 4.804818 1 2.191989
## rainfall 1.123093 1 1.059761
## snowfall 1.123577 1 1.059989
## seasons 4.780295 3 1.297904
## holiday 1.025185 1 1.012514
Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the final Model.
par(mfrow=c(2,2))
plot(model1_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(model1_BC_fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 610.9344, Df = 1, p = < 2.22e-16
P-value is less than 0.05, implying that the constant error variance assumption is still violated after box-cox transformation.
# Test for Autocorrelated Errors
durbinWatsonTest(model1_BC_fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.005014278 2.010001 0.718
## Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
#shapiro.test(model1_BC_fit$residuals)
# Error in shapiro.test(model1_BC_fit$residuals) : sample size must be between 3 and 5000
P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model.
Model 1 did not turn out to pass homoscedasticity test and therefore, deemed to not be able to predict the bike demand accurately. We will now test if Model 2 can predict it better.
Model 2 is a multiple linear regression model predicting bike rental demand using average daily weather conditions. For this model, bike demand for each day and average daily weather conditions are calculated. Data will be splitted to 80-20 train and test dataset to build the model.
# creating dataframe for bike count by date
bike_by_day <- bike
levels(bike_by_day$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_by_day$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)
bike_by_day <- bike_by_day %>% group_by(date) %>% summarise(sum_bike_count = sum(bike_count), mean_temp = mean(temp), mean_humidity = mean(humidity), mean_wind_speed = mean(wind_speed), mean_visibility = mean(visibility), mean_dew_point_temp = mean(dew_point_temp), mean_solar_radiation = mean(solar_radiation), mean_rainfall = mean(rainfall), mean_snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
#Changing the factors back to original values
bike_by_day$holiday <- as.factor(bike_by_day$holiday)
bike_by_day$seasons <- as.factor(bike_by_day$seasons)
levels(bike_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(bike_by_day)
## # A tibble: 6 x 12
## date sum_bike_count mean_temp mean_humidity mean_wind_speed mean_visibility
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 01/01/… 4290 -1.28 39.3 1.45 1895.
## 2 01/02/… 5377 -3.87 44 1.61 1924.
## 3 01/03/… 5132 0.45 64.2 3.55 1084
## 4 01/04/… 17388 15.2 68.9 1.57 832.
## 5 01/05/… 26820 20.3 72.8 1.44 456.
## 6 01/06/… 31928 23.7 50.1 1.95 1598.
## # … with 6 more variables: mean_dew_point_temp <dbl>,
## # mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## # seasons <fct>, holiday <fct>
Below is the histograms of how the continuous variables look like after transforming the data. The sum_bike_count shows signs of having a bimodal distribution and temperature shows more of flatter distribution compared to the original data. The rest of columns seem to be not too different from the original data distribution.
#Creating histograms
multi.hist(bike_by_day[,sapply(bike_by_day, is.numeric)])
Data is splitted to 80% train set and 20% test set. This is to minimise overfitting. Data is splitted by random selection using sample() method.
#Randomise our dataset
## 75% of the sample size
smp_size <- floor(0.80 * nrow(bike_by_day))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)
model2_prediction <- bike_by_day[ -trainIndex,]
model2 <- bike_by_day[ trainIndex,]
After splitting the data, train dataset is fitted to the model.
# regression model by day
bike_by_day_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=model2)
#summary
summary(bike_by_day_model)
##
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_visibility + mean_dew_point_temp + mean_solar_radiation +
## mean_rainfall + mean_snowfall + seasons + holiday, data = model2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10732.9 -2611.3 380.5 2530.7 11448.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.297e+04 9.992e+03 3.299 0.00110 **
## mean_temp -5.287e+02 3.790e+02 -1.395 0.16417
## mean_humidity -2.535e+02 1.130e+02 -2.244 0.02568 *
## mean_wind_speed -1.273e+03 4.415e+02 -2.884 0.00424 **
## mean_visibility -6.274e-02 6.641e-01 -0.094 0.92480
## mean_dew_point_temp 9.295e+02 4.042e+02 2.300 0.02222 *
## mean_solar_radiation 1.220e+04 1.364e+03 8.944 < 2e-16 ***
## mean_rainfall -3.570e+03 6.402e+02 -5.577 5.96e-08 ***
## mean_snowfall -5.857e+02 6.705e+02 -0.874 0.38313
## seasonsSpring -4.810e+03 7.525e+02 -6.392 7.19e-10 ***
## seasonsSummer -2.820e+03 9.234e+02 -3.054 0.00249 **
## seasonsAutumn -8.009e+03 1.027e+03 -7.795 1.42e-13 ***
## holidayHoliday 2.761e+03 1.103e+03 2.503 0.01291 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3940 on 269 degrees of freedom
## Multiple R-squared: 0.8501, Adjusted R-squared: 0.8434
## F-statistic: 127.1 on 12 and 269 DF, p-value: < 2.2e-16
This model can explain 0.8434 and is statistically significant as p-value is < 0.05. However, it shows that there are some factors that are not statistically significant. Stepwise AIC is used to calculate an optimum model for prediction.
# possible subsets regression
stepReg=MASS::stepAIC(bike_by_day_model, direction="both")
## Start: AIC=4682.07
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_visibility + mean_dew_point_temp + mean_solar_radiation +
## mean_rainfall + mean_snowfall + seasons + holiday
##
## Df Sum of Sq RSS AIC
## - mean_visibility 1 138612 4176899119 4680.1
## - mean_snowfall 1 11848940 4188609447 4680.9
## <none> 4176760507 4682.1
## - mean_temp 1 30215864 4206976371 4682.1
## - mean_humidity 1 78153669 4254914175 4685.3
## - mean_dew_point_temp 1 82133290 4258893797 4685.6
## - holiday 1 97276886 4274037393 4686.6
## - mean_wind_speed 1 129189506 4305950013 4688.7
## - mean_rainfall 1 482894786 4659655293 4710.9
## - mean_solar_radiation 1 1242133949 5418894456 4753.5
## - seasons 3 1372262290 5549022797 4756.2
##
## Step: AIC=4680.08
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
## mean_snowfall + seasons + holiday
##
## Df Sum of Sq RSS AIC
## - mean_snowfall 1 12005670 4188904789 4678.9
## <none> 4176899119 4680.1
## - mean_temp 1 30530456 4207429575 4680.1
## + mean_visibility 1 138612 4176760507 4682.1
## - mean_dew_point_temp 1 84204705 4261103824 4683.7
## - mean_humidity 1 85890314 4262789433 4683.8
## - holiday 1 97275442 4274174561 4684.6
## - mean_wind_speed 1 137322948 4314222067 4687.2
## - mean_rainfall 1 504105669 4681004788 4710.2
## - mean_solar_radiation 1 1242797146 5419696265 4751.5
## - seasons 3 1473953173 5650852292 4759.3
##
## Step: AIC=4678.89
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
## seasons + holiday
##
## Df Sum of Sq RSS AIC
## <none> 4188904789 4678.9
## - mean_temp 1 35239864 4224144653 4679.3
## + mean_snowfall 1 12005670 4176899119 4680.1
## + mean_visibility 1 295342 4188609447 4680.9
## - mean_dew_point_temp 1 94497682 4283402471 4683.2
## - holiday 1 96775043 4285679832 4683.3
## - mean_humidity 1 100475636 4289380425 4683.6
## - mean_wind_speed 1 138369580 4327274369 4686.1
## - mean_rainfall 1 493107696 4682012484 4708.3
## - mean_solar_radiation 1 1235664818 5424569607 4749.8
## - seasons 3 1476936963 5665841752 4758.1
stepReg$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_visibility + mean_dew_point_temp + mean_solar_radiation +
## mean_rainfall + mean_snowfall + seasons + holiday
##
## Final Model:
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
## seasons + holiday
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 269 4176760507 4682.072
## 2 - mean_visibility 1 138612 270 4176899119 4680.082
## 3 - mean_snowfall 1 12005670 271 4188904789 4678.891
# multicollinearity
car::vif(bike_by_day_model)
## GVIF Df GVIF^(1/(2*Df))
## mean_temp 332.934319 1 18.246488
## mean_humidity 49.016289 1 7.001163
## mean_wind_speed 1.263369 1 1.123997
## mean_visibility 2.050572 1 1.431982
## mean_dew_point_temp 463.892705 1 21.538169
## mean_solar_radiation 3.264403 1 1.806766
## mean_rainfall 1.704366 1 1.305514
## mean_snowfall 1.170149 1 1.081734
## seasons 7.575015 3 1.401406
## holiday 1.042355 1 1.020958
mean_temperature and mean_dew_point_temp GVIF values are greater than 10 and therefore, there is multicollinearity. The stepwise AIC removed visibility and snowfall in the final model. These variables will be removed from the final model.
# Removing columns with multicollinearity and stepwise
bike_by_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=model2)
#summary
summary(bike_by_day_model_final)
##
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_solar_radiation + mean_rainfall + seasons + holiday,
## data = model2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10443.0 -2558.4 196.8 2565.9 11704.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11836.98 2198.00 5.385 1.57e-07 ***
## mean_temp 346.09 53.14 6.512 3.56e-10 ***
## mean_humidity -14.96 24.24 -0.617 0.53759
## mean_wind_speed -1241.65 433.81 -2.862 0.00453 **
## mean_solar_radiation 11933.36 1369.45 8.714 2.93e-16 ***
## mean_rainfall -3944.21 605.06 -6.519 3.43e-10 ***
## seasonsSpring -4977.03 705.83 -7.051 1.46e-11 ***
## seasonsSummer -2571.73 921.82 -2.790 0.00565 **
## seasonsAutumn -8036.42 1011.77 -7.943 5.24e-14 ***
## holidayHoliday 2506.71 1106.13 2.266 0.02422 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3968 on 272 degrees of freedom
## Multiple R-squared: 0.8463, Adjusted R-squared: 0.8412
## F-statistic: 166.4 on 9 and 272 DF, p-value: < 2.2e-16
# multicollinearity
car::vif(bike_by_day_model_final)
## GVIF Df GVIF^(1/(2*Df))
## mean_temp 6.454181 1 2.540508
## mean_humidity 2.223462 1 1.491128
## mean_wind_speed 1.202904 1 1.096770
## mean_solar_radiation 3.244988 1 1.801385
## mean_rainfall 1.500984 1 1.225147
## seasons 6.040279 3 1.349510
## holiday 1.033729 1 1.016725
The final model can explain 84.18% of the data and is statistically significant at 95% confidence interval as p-value < 0.05. There is no multicollinearity in the final model. The p-value of mean_humidity is greater than 0.5 and therefore, is not statistically significant.
par(mfrow=c(2,2))
plot(bike_by_day_model_final , which=1:4)
There are no outliers in the dataset but the resituals shows a bit of hetroscedasticity.
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_model_final)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 27.39761, Df = 1, p = 1.6564e-07
P-value is less than 0.05, implying that the constant error variance assumption is violated.
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_model_final)
## lag Autocorrelation D-W Statistic p-value
## 1 0.01148254 1.970562 0.798
## Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
# Test for Normally Distributed Errors
shapiro.test(bike_by_day_model_final$residuals)
##
## Shapiro-Wilk normality test
##
## data: bike_by_day_model_final$residuals
## W = 0.99635, p-value = 0.7624
P-value of the Shapiro-Wilk is greater than 0.05 therefore, it can be concluded that normality error assumption is not violated. Data will be transformed using box-cox transformation as it did not pass all the tests.
# Making all the values positive
bike_by_day_BC <- bike_by_day
for (i in colnames(bike_by_day_BC)){
if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
bike_by_day_BC[i] <- round(bike_by_day_BC[i] + (abs(min(bike_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(sum_bike_count ~ mean_temp +
mean_humidity +
mean_solar_radiation
,data = bike_by_day_BC)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = (data^lambda_val)-1/lambda_val
return(data)
}
for (i in colnames(bike_by_day_BC)){
if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
bike_by_day_BC[i] <- transform_boxcox(bike_by_day_BC[i])
}
}
# fitting new model
bike_by_day_BC_fit = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
mean_solar_radiation + mean_rainfall + seasons + holiday,
data = bike_by_day_BC)
summary(bike_by_day_BC_fit)
##
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_solar_radiation + mean_rainfall + seasons + holiday,
## data = bike_by_day_BC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.329 -14.109 1.105 14.262 54.485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.8573 18.6492 7.875 4.54e-14 ***
## mean_temp 16.3622 1.9991 8.185 5.41e-15 ***
## mean_humidity -0.7034 1.7476 -0.403 0.687541
## mean_wind_speed -19.0108 7.2987 -2.605 0.009597 **
## mean_solar_radiation 138.3179 15.0689 9.179 < 2e-16 ***
## mean_rainfall -67.4140 7.5264 -8.957 < 2e-16 ***
## seasonsSpring -26.8193 3.3871 -7.918 3.38e-14 ***
## seasonsSummer -18.7019 4.0442 -4.624 5.33e-06 ***
## seasonsAutumn -48.7661 4.8991 -9.954 < 2e-16 ***
## holidayHoliday 20.1854 5.2618 3.836 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.81 on 343 degrees of freedom
## Multiple R-squared: 0.86, Adjusted R-squared: 0.8563
## F-statistic: 234.1 on 9 and 343 DF, p-value: < 2.2e-16
The final model can explain 0.86% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant.
# Checking multicollinearity in the final Model
car::vif(bike_by_day_BC_fit)
## GVIF Df GVIF^(1/(2*Df))
## mean_temp 6.314855 1 2.512937
## mean_humidity 2.395446 1 1.547723
## mean_wind_speed 1.274952 1 1.129138
## mean_solar_radiation 3.309453 1 1.819190
## mean_rainfall 1.609907 1 1.268821
## seasons 5.734895 3 1.337892
## holiday 1.034576 1 1.017141
Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the final Model.
par(mfrow=c(2,2))
plot(bike_by_day_BC_fit , which=1:4)
There are not outliers detected.
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_BC_fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5.579538, Df = 1, p = 0.018172
P-value is still less than 0.05, implying that the constant error variance assumption is violated even after transformation.
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_BC_fit)
## lag Autocorrelation D-W Statistic p-value
## 1 0.0320337 1.935827 0.5
## Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
shapiro.test(bike_by_day_BC_fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: bike_by_day_BC_fit$residuals
## W = 0.99623, p-value = 0.5709
P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model.
Model 3 is a multiple linear regression model predicting bike rental demand during am hours and pm hours. For this, sum of bike counts for am hours and pm hours are calculated and average values of the am and pm weather data.
# Dividing to am and pm
bike_AMPM <- bike
bike_AMPM$hour <- as.numeric(bike_AMPM$hour)
bike_AMPM['hour'] [bike_AMPM['hour'] <=12] <- -1
bike_AMPM['hour'] [bike_AMPM['hour'] >12] <- 1
bike_AMPM$hour <- factor(bike_AMPM$hour)
head(bike_AMPM)
## date bike_count hour temp humidity wind_speed visibility dew_point_temp
## 1 01/12/2017 254 -1 -5.2 37 2.2 2000 -17.6
## 2 01/12/2017 204 -1 -5.5 38 0.8 2000 -17.6
## 3 01/12/2017 173 -1 -6.0 39 1.0 2000 -17.7
## 4 01/12/2017 107 -1 -6.2 40 0.9 2000 -17.6
## 5 01/12/2017 78 -1 -6.0 36 2.3 2000 -18.6
## 6 01/12/2017 100 -1 -6.4 37 1.5 2000 -18.7
## solar_radiation rainfall snowfall seasons holiday
## 1 0 0 0 Winter No Holiday
## 2 0 0 0 Winter No Holiday
## 3 0 0 0 Winter No Holiday
## 4 0 0 0 Winter No Holiday
## 5 0 0 0 Winter No Holiday
## 6 0 0 0 Winter No Holiday
# creating dataframe for bike count by am pm
levels(bike_AMPM$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_AMPM$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)
bike_AMPM <- bike_AMPM %>% group_by(date,hour) %>% summarise(bike_count = sum(bike_count), temp = mean(temp), humidity = mean(humidity), wind_speed = mean(wind_speed), visibility = mean(visibility), dew_point_temp = mean(dew_point_temp), solar_radiation = mean(solar_radiation), rainfall = mean(rainfall), snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
bike_AMPM
## # A tibble: 706 x 13
## # Groups: date [353]
## date hour bike_count temp humidity wind_speed visibility dew_point_temp
## <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01/01/… -1 1520 -3.57 44.5 0.992 1896. -14.2
## 2 01/01/… 1 2770 1 34.2 1.92 1894. -13.6
## 3 01/02/… -1 1962 -6.51 53.9 1.31 1891. -14.5
## 4 01/02/… 1 3415 -1.23 34.1 1.91 1956. -15.6
## 5 01/03/… -1 1361 1.63 85.8 2.82 778. -0.767
## 6 01/03/… 1 3771 -0.733 42.7 4.29 1390. -12.0
## 7 01/04/… -1 4227 12.9 65.8 0.967 974. 6.52
## 8 01/04/… 1 13161 17.5 72 2.17 690. 12.2
## 9 01/05/… -1 7847 19.1 77.9 0.842 374. 15.0
## 10 01/05/… 1 18973 21.6 67.7 2.03 539. 15.3
## # … with 696 more rows, and 5 more variables: solar_radiation <dbl>,
## # rainfall <dbl>, snowfall <dbl>, seasons <dbl>, holiday <dbl>
#Changing the factors back to original values
bike_AMPM$holiday <- as.factor(bike_AMPM$holiday)
bike_AMPM$seasons <- as.factor(bike_AMPM$seasons)
levels(bike_AMPM$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_AMPM$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
#Drop date column
bike_AMPM <- subset(bike_AMPM, select = -c(date))
head(bike_AMPM)
## # A tibble: 6 x 12
## hour bike_count temp humidity wind_speed visibility dew_point_temp
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1 1520 -3.57 44.5 0.992 1896. -14.2
## 2 1 2770 1 34.2 1.92 1894. -13.6
## 3 -1 1962 -6.51 53.9 1.31 1891. -14.5
## 4 1 3415 -1.23 34.1 1.91 1956. -15.6
## 5 -1 1361 1.63 85.8 2.82 778. -0.767
## 6 1 3771 -0.733 42.7 4.29 1390. -12.0
## # … with 5 more variables: solar_radiation <dbl>, rainfall <dbl>,
## # snowfall <dbl>, seasons <fct>, holiday <fct>
Below is the histograms of how the continuous variables look like after transforming the data. The sum_bike_count shows signs of having a trimodal distribution and temperature shows bimodal distribution compared to the original data. The rest of columns seem to be not too different from the original data distribution.
#Creating histograms
multi.hist(bike_AMPM[,sapply(bike_AMPM, is.numeric)])
Data is splitted to 80% train set and 20% test set. This is to minimise overfitting. Data is splitted by random selection using sample() method.
#Randomise our dataset
## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike_AMPM))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_AMPM)), size = smp_size)
model3_prediction <- bike_AMPM[ -trainIndex,]
model3 <- bike_AMPM[ trainIndex,]
# regression model with all columns except functioning day
am_pm_model = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model3)
#summary
summary(am_pm_model)
##
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + wind_speed +
## visibility + dew_point_temp + solar_radiation + rainfall +
## snowfall + seasons + holiday, data = model3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7580.5 -1774.8 -110.7 1658.1 9173.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10803.0578 3438.4154 3.142 0.00177 **
## hour1 2603.6307 334.3744 7.787 3.44e-14 ***
## temp -70.8103 132.5580 -0.534 0.59343
## humidity -88.9478 38.1577 -2.331 0.02011 *
## wind_speed -366.3750 183.0305 -2.002 0.04581 *
## visibility -0.1240 0.2969 -0.418 0.67626
## dew_point_temp 281.2093 139.9159 2.010 0.04494 *
## solar_radiation 6282.3114 566.6650 11.086 < 2e-16 ***
## rainfall -1575.4707 233.7714 -6.739 4.03e-11 ***
## snowfall -3.7379 319.7514 -0.012 0.99068
## seasonsSpring -2792.0417 375.2958 -7.440 3.91e-13 ***
## seasonsSummer -2463.7911 459.3894 -5.363 1.21e-07 ***
## seasonsAutumn -3864.3881 512.7259 -7.537 1.99e-13 ***
## holidayHoliday 1423.5079 561.4535 2.535 0.01151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2798 on 550 degrees of freedom
## Multiple R-squared: 0.8051, Adjusted R-squared: 0.8005
## F-statistic: 174.8 on 13 and 550 DF, p-value: < 2.2e-16
#Building ANOVA table with full dataset (all columns) against intercept only model
anova(am_pm_model)
## Analysis of Variance Table
##
## Response: bike_count
## Df Sum Sq Mean Sq F value Pr(>F)
## hour 1 5152074809 5152074809 658.1438 < 2.2e-16 ***
## temp 1 8164062729 8164062729 1042.9056 < 2.2e-16 ***
## humidity 1 1620661660 1620661660 207.0289 < 2.2e-16 ***
## wind_speed 1 2706774 2706774 0.3458 0.55676
## visibility 1 4452068 4452068 0.5687 0.45109
## dew_point_temp 1 25404155 25404155 3.2452 0.07218 .
## solar_radiation 1 1445028910 1445028910 184.5930 < 2.2e-16 ***
## rainfall 1 474552468 474552468 60.6210 3.458e-14 ***
## snowfall 1 20816 20816 0.0027 0.95889
## seasons 3 845211047 281737016 35.9901 < 2.2e-16 ***
## holiday 1 50321496 50321496 6.4282 0.01151 *
## Residuals 550 4305504463 7828190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# possible subsets regression
library(MASS)
stepReg=MASS::stepAIC(am_pm_model, direction="both")
## Start: AIC=8966.33
## bike_count ~ hour + temp + humidity + wind_speed + visibility +
## dew_point_temp + solar_radiation + rainfall + snowfall +
## seasons + holiday
##
## Df Sum of Sq RSS AIC
## - snowfall 1 1070 4305505533 8964.3
## - visibility 1 1366394 4306870856 8964.5
## - temp 1 2233786 4307738249 8964.6
## <none> 4305504463 8966.3
## - wind_speed 1 31366505 4336870968 8968.4
## - dew_point_temp 1 31621783 4337126246 8968.5
## - humidity 1 42537159 4348041622 8969.9
## - holiday 1 50321496 4355825959 8970.9
## - rainfall 1 355548738 4661053200 9009.1
## - hour 1 474628752 4780133215 9023.3
## - seasons 3 845815569 5151320032 9061.5
## - solar_radiation 1 962160506 5267664969 9078.1
##
## Step: AIC=8964.33
## bike_count ~ hour + temp + humidity + wind_speed + visibility +
## dew_point_temp + solar_radiation + rainfall + seasons + holiday
##
## Df Sum of Sq RSS AIC
## - visibility 1 1365404 4306870937 8962.5
## - temp 1 2260638 4307766171 8962.6
## <none> 4305505533 8964.3
## + snowfall 1 1070 4305504463 8966.3
## - wind_speed 1 31365939 4336871471 8966.4
## - dew_point_temp 1 32132277 4337637810 8966.5
## - humidity 1 43793132 4349298665 8968.0
## - holiday 1 50335767 4355841299 8968.9
## - rainfall 1 357746412 4663251945 9007.3
## - hour 1 475162772 4780668305 9021.4
## - seasons 3 845887370 5151392902 9059.5
## - solar_radiation 1 964779517 5270285050 9076.4
##
## Step: AIC=8962.51
## bike_count ~ hour + temp + humidity + wind_speed + dew_point_temp +
## solar_radiation + rainfall + seasons + holiday
##
## Df Sum of Sq RSS AIC
## - temp 1 1951238 4308822174 8960.8
## <none> 4306870937 8962.5
## + visibility 1 1365404 4305505533 8964.3
## + snowfall 1 80 4306870856 8964.5
## - dew_point_temp 1 31127896 4337998833 8964.6
## - wind_speed 1 34342550 4341213487 8965.0
## - humidity 1 42728660 4349599596 8966.1
## - holiday 1 50334035 4357204972 8967.1
## - rainfall 1 363503334 4670374271 9006.2
## - hour 1 494415635 4801286572 9021.8
## - seasons 3 873130692 5180001629 9060.6
## - solar_radiation 1 965276021 5272146957 9074.6
##
## Step: AIC=8960.77
## bike_count ~ hour + humidity + wind_speed + dew_point_temp +
## solar_radiation + rainfall + seasons + holiday
##
## Df Sum of Sq RSS AIC
## <none> 4308822174 8960.8
## + temp 1 1951238 4306870937 8962.5
## + visibility 1 1056004 4307766171 8962.6
## + snowfall 1 18823 4308803352 8962.8
## - wind_speed 1 33797487 4342619662 8963.2
## - holiday 1 50067049 4358889224 8965.3
## - humidity 1 207643653 4516465827 8985.3
## - rainfall 1 397393139 4706215314 9008.5
## - hour 1 493236863 4802059037 9019.9
## - dew_point_temp 1 522431104 4831253278 9023.3
## - seasons 3 871530804 5180352978 9058.7
## - solar_radiation 1 987955361 5296777535 9075.2
stepReg$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## bike_count ~ hour + temp + humidity + wind_speed + visibility +
## dew_point_temp + solar_radiation + rainfall + snowfall +
## seasons + holiday
##
## Final Model:
## bike_count ~ hour + humidity + wind_speed + dew_point_temp +
## solar_radiation + rainfall + seasons + holiday
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 550 4305504463 8966.332
## 2 - snowfall 1 1069.786 551 4305505533 8964.332
## 3 - visibility 1 1365404.194 552 4306870937 8962.511
## 4 - temp 1 1951237.637 553 4308822174 8960.766
# multicollinearity
car::vif(am_pm_model)
## GVIF Df GVIF^(1/(2*Df))
## hour 2.013430 1 1.418954
## temp 183.172662 1 13.534130
## humidity 34.606618 1 5.882739
## wind_speed 1.581910 1 1.257740
## visibility 1.849322 1 1.359898
## dew_point_temp 247.930596 1 15.745812
## solar_radiation 4.019861 1 2.004959
## rainfall 1.430928 1 1.196214
## snowfall 1.148384 1 1.071627
## seasons 6.907353 3 1.380020
## holiday 1.035202 1 1.017449
Stepwise method produced a final regression model without temperature, snowfall and visibility. dew_point_temp and temp GVIF values are greater than 10 and therefore, there is multicollinearity. Therefore, temperature visibility and snowfall will be removed from the final model.
# Removing columns with multicollinearity and stepwise
am_pm_model_final = lm(bike_count ~ hour + humidity + wind_speed + dew_point_temp +
solar_radiation + rainfall + seasons + holiday, data=model3)
#summary
summary(am_pm_model_final)
##
## Call:
## lm(formula = bike_count ~ hour + humidity + wind_speed + dew_point_temp +
## solar_radiation + rainfall + seasons + holiday, data = model3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7597.7 -1775.7 -83.3 1634.7 9298.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8820.03 1054.90 8.361 5.07e-16 ***
## hour1 2607.90 327.78 7.956 1.01e-14 ***
## humidity -68.07 13.19 -5.162 3.41e-07 ***
## wind_speed -375.22 180.16 -2.083 0.0377 *
## dew_point_temp 207.75 25.37 8.188 1.84e-15 ***
## solar_radiation 6226.55 552.96 11.260 < 2e-16 ***
## rainfall -1609.85 225.42 -7.142 2.92e-12 ***
## seasonsSpring -2752.29 354.07 -7.773 3.76e-14 ***
## seasonsSummer -2463.34 458.11 -5.377 1.12e-07 ***
## seasonsAutumn -3794.97 497.44 -7.629 1.04e-13 ***
## holidayHoliday 1419.42 559.95 2.535 0.0115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2791 on 553 degrees of freedom
## Multiple R-squared: 0.8049, Adjusted R-squared: 0.8014
## F-statistic: 228.2 on 10 and 553 DF, p-value: < 2.2e-16
The final model can explain 80.49% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant.
# Checking multicollinearity in the final Model
car::vif(am_pm_model_final)
## GVIF Df GVIF^(1/(2*Df))
## hour 1.943833 1 1.394214
## humidity 4.151456 1 2.037512
## wind_speed 1.539901 1 1.240928
## dew_point_temp 8.190588 1 2.861920
## solar_radiation 3.845713 1 1.961049
## rainfall 1.336744 1 1.156176
## seasons 5.829589 3 1.341548
## holiday 1.034496 1 1.017102
Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the final Model.
par(mfrow=c(2,2))
plot(am_pm_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
library(car)
ncvTest(am_pm_model_final)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 72.38561, Df = 1, p = < 2.22e-16
P-value is less than 0.05, implying that the constant error variance assumption is violated.
# Test for Autocorrelated Errors
durbinWatsonTest(am_pm_model_final)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04232994 2.082806 0.336
## Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
shapiro.test(am_pm_model_final$residuals)
##
## Shapiro-Wilk normality test
##
## data: am_pm_model_final$residuals
## W = 0.99099, p-value = 0.001627
P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model. Therefore, box-cox transformation is performed again on the model.
bike_AMPM_BC <- model3[-c(185), ]
for (i in colnames(bike_AMPM_BC)){
if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
bike_AMPM_BC[i] <- round(bike_AMPM_BC[i] + (abs(min(bike_AMPM[i])))+1 , digits=3) # adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(
bike_AMPM_BC$bike_count ~
bike_AMPM_BC$humidity +
bike_AMPM_BC$dew_point_temp +
bike_AMPM_BC$solar_radiation +
bike_AMPM_BC$rainfall
)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = data^lambda_val
return(data)
}
for (i in colnames(bike_AMPM_BC)){
if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
bike_AMPM_BC[i] <- transform_boxcox(bike_AMPM_BC[i])
}
}
# fitting new model
bike_AMPM_BC_fit = lm(bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour +
bike_AMPM_BC$humidity +
bike_AMPM_BC$wind_speed +
bike_AMPM_BC$dew_point_temp +
bike_AMPM_BC$solar_radiation +
bike_AMPM_BC$rainfall +
bike_AMPM_BC$seasons +
bike_AMPM_BC$holiday)
summary(bike_AMPM_BC_fit)
##
## Call:
## lm(formula = bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour + bike_AMPM_BC$humidity +
## bike_AMPM_BC$wind_speed + bike_AMPM_BC$dew_point_temp + bike_AMPM_BC$solar_radiation +
## bike_AMPM_BC$rainfall + bike_AMPM_BC$seasons + bike_AMPM_BC$holiday)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9794834 -2027571 -338710 1946797 13767376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1745897.9 1163242.2 -1.501 0.133956
## bike_AMPM_BC$hour1 2762832.5 395822.2 6.980 8.50e-12 ***
## bike_AMPM_BC$humidity -1043.3 454.6 -2.295 0.022101 *
## bike_AMPM_BC$wind_speed -157128.9 58269.7 -2.697 0.007219 **
## bike_AMPM_BC$dew_point_temp 7682.6 1615.6 4.755 2.53e-06 ***
## bike_AMPM_BC$solar_radiation 3305837.4 259130.7 12.757 < 2e-16 ***
## bike_AMPM_BC$rainfall -368076.0 105967.4 -3.473 0.000554 ***
## bike_AMPM_BC$seasonsSpring -2828765.0 442537.9 -6.392 3.48e-10 ***
## bike_AMPM_BC$seasonsSummer -2370556.0 611605.0 -3.876 0.000119 ***
## bike_AMPM_BC$seasonsAutumn -3317203.4 579843.7 -5.721 1.74e-08 ***
## bike_AMPM_BC$holidayHoliday 839983.5 693100.6 1.212 0.226061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3464000 on 552 degrees of freedom
## Multiple R-squared: 0.7179, Adjusted R-squared: 0.7128
## F-statistic: 140.5 on 10 and 552 DF, p-value: < 2.2e-16
After transforming the data, the model can explain 71.79% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, it is statistically significant.
# Checking multicollinearity in the final Model
car::vif(bike_AMPM_BC_fit)
## GVIF Df GVIF^(1/(2*Df))
## bike_AMPM_BC$hour 1.837746 1 1.355635
## bike_AMPM_BC$humidity 3.497117 1 1.870058
## bike_AMPM_BC$wind_speed 1.432754 1 1.196977
## bike_AMPM_BC$dew_point_temp 7.285334 1 2.699136
## bike_AMPM_BC$solar_radiation 3.337718 1 1.826942
## bike_AMPM_BC$rainfall 1.299225 1 1.139835
## bike_AMPM_BC$seasons 6.190769 3 1.355057
## bike_AMPM_BC$holiday 1.029240 1 1.014515
Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the transformed model.
par(mfrow=c(2,2))
plot(bike_AMPM_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_AMPM_BC_fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 161.5261, Df = 1, p = < 2.22e-16
P-value is still less than 0.05, implying that the constant error variance assumption is violated even after transformation.
# Test for Autocorrelated Errors
durbinWatsonTest(bike_AMPM_BC_fit)
## lag Autocorrelation D-W Statistic p-value
## 1 6.005578e-05 1.998875 0.926
## Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
shapiro.test(bike_AMPM_BC_fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: bike_AMPM_BC_fit$residuals
## W = 0.96879, p-value = 1.42e-09
P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model. Therefore, this model cannot adequately predict the bike demand.
Model 4 is a multiple linear regression model predicting bike rental demand for every day with weekend column and using daily average weather conditions. We expect bike demand will be higher during weekends as more people will be hiring them for recreational purposes and this will be one of the factors.
#Creating weekend colum
bike_by_day$date <- as.Date(bike_by_day$date, "%d/%m/%y")
bike_by_day$weekend = chron::is.weekend(bike_by_day$date)
bike_by_day$weekend <- as.factor(bike_by_day$weekend)
bike_by_day
## # A tibble: 353 x 13
## date sum_bike_count mean_temp mean_humidity mean_wind_speed
## <date> <int> <dbl> <dbl> <dbl>
## 1 2020-01-01 4290 -1.28 39.3 1.45
## 2 2020-02-01 5377 -3.87 44 1.61
## 3 2020-03-01 5132 0.45 64.2 3.55
## 4 2020-04-01 17388 15.2 68.9 1.57
## 5 2020-05-01 26820 20.3 72.8 1.44
## 6 2020-06-01 31928 23.7 50.1 1.95
## 7 2020-07-01 3231 22.0 93.5 0.783
## 8 2020-08-01 20712 33.4 55.8 1.68
## 9 2020-09-01 26010 25.5 61.2 1.21
## 10 2020-10-01 27909 15.4 54.2 2.82
## # … with 343 more rows, and 8 more variables: mean_visibility <dbl>,
## # mean_dew_point_temp <dbl>, mean_solar_radiation <dbl>, mean_rainfall <dbl>,
## # mean_snowfall <dbl>, seasons <fct>, holiday <fct>, weekend <fct>
As expected, Fig 5. shows that bike demand is slightly higher during the weekends.
# Bar plot for weekend and weekday bike demand
fig5 <- ggplot(bike_by_day, aes(x = as.factor(weekend), y = sum_bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 5. Weekend and Weekday Bike Count Bar Plot") + xlab("Weekend") + ylab("Bike Count")
fig5 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
Data is splitted to 90% train set and 10% test set instead of usual 8-20 due to being smaller dataset. Data is splitted by random selection using sample() method.
#Randomise our dataset
## 75% of the sample size
smp_size <- floor(0.90 * nrow(bike_by_day))
## set the seed to make your partition reproducible
set.seed(500)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)
model4_prediction <- bike_by_day[ -trainIndex,]
model4 <- bike_by_day[ trainIndex,]
# regression model by day with Weekend
bike_weekend_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday+ weekend, data=model4)
#summary
summary(bike_weekend_model)
##
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_visibility + mean_dew_point_temp + mean_solar_radiation +
## mean_rainfall + mean_snowfall + seasons + holiday + weekend,
## data = model4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12217.1 -2425.6 119.3 2364.1 11546.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29313.3932 9717.1960 3.017 0.002772 **
## mean_temp -441.9317 368.9779 -1.198 0.231963
## mean_humidity -202.6930 108.6237 -1.866 0.063006 .
## mean_wind_speed -1414.7777 466.2044 -3.035 0.002617 **
## mean_visibility -0.2196 0.6901 -0.318 0.750541
## mean_dew_point_temp 795.2753 390.2534 2.038 0.042434 *
## mean_solar_radiation 12763.3986 1369.6733 9.319 < 2e-16 ***
## mean_rainfall -3548.4095 632.1151 -5.614 4.48e-08 ***
## mean_snowfall -690.1610 666.3953 -1.036 0.301185
## seasonsSpring -5204.0355 768.5960 -6.771 6.64e-11 ***
## seasonsSummer -3615.2394 926.7999 -3.901 0.000118 ***
## seasonsAutumn -8336.2703 1054.0449 -7.909 4.89e-14 ***
## holidayHoliday 2690.5144 1091.4530 2.465 0.014253 *
## weekendTRUE 1292.1261 531.6491 2.430 0.015662 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4164 on 303 degrees of freedom
## Multiple R-squared: 0.8316, Adjusted R-squared: 0.8244
## F-statistic: 115.1 on 13 and 303 DF, p-value: < 2.2e-16
# possible subsets regression (by day and weekend)
library(MASS)
stepReg=MASS::stepAIC(bike_weekend_model, direction="both")
## Start: AIC=5297.59
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_visibility + mean_dew_point_temp + mean_solar_radiation +
## mean_rainfall + mean_snowfall + seasons + holiday + weekend
##
## Df Sum of Sq RSS AIC
## - mean_visibility 1 1755800 5255588899 5295.7
## - mean_snowfall 1 18598184 5272431282 5296.7
## - mean_temp 1 24873862 5278706961 5297.1
## <none> 5253833098 5297.6
## - mean_humidity 1 60375619 5314208717 5299.2
## - mean_dew_point_temp 1 72007074 5325840172 5299.9
## - weekend 1 102421987 5356255086 5301.7
## - holiday 1 105364470 5359197569 5301.9
## - mean_wind_speed 1 159682489 5413515587 5305.1
## - mean_rainfall 1 546397538 5800230637 5327.0
## - mean_solar_radiation 1 1505678886 6759511984 5375.5
## - seasons 3 1651376972 6905210070 5378.2
##
## Step: AIC=5295.7
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
## mean_snowfall + seasons + holiday + weekend
##
## Df Sum of Sq RSS AIC
## - mean_snowfall 1 19052170 5274641068 5294.8
## - mean_temp 1 23519634 5279108533 5295.1
## <none> 5255588899 5295.7
## - mean_humidity 1 60164675 5315753574 5297.3
## + mean_visibility 1 1755800 5253833098 5297.6
## - mean_dew_point_temp 1 70253609 5325842508 5297.9
## - weekend 1 100675609 5356264508 5299.7
## - holiday 1 104629406 5360218305 5299.9
## - mean_wind_speed 1 178897927 5434486826 5304.3
## - mean_rainfall 1 578004202 5833593101 5326.8
## - mean_solar_radiation 1 1524264427 6779853325 5374.4
## - seasons 3 1752780932 7008369831 5380.9
##
## Step: AIC=5294.85
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_dew_point_temp + mean_solar_radiation + mean_rainfall +
## seasons + holiday + weekend
##
## Df Sum of Sq RSS AIC
## - mean_temp 1 29087942 5303729010 5294.6
## <none> 5274641068 5294.8
## + mean_snowfall 1 19052170 5255588899 5295.7
## + mean_visibility 1 2209786 5272431282 5296.7
## - mean_humidity 1 76552577 5351193646 5297.4
## - mean_dew_point_temp 1 82548216 5357189284 5297.8
## - holiday 1 100763194 5375404263 5298.8
## - weekend 1 100875940 5375517008 5298.9
## - mean_wind_speed 1 177510423 5452151492 5303.3
## - mean_rainfall 1 561169565 5835810633 5324.9
## - mean_solar_radiation 1 1512359155 6787000224 5372.8
## - seasons 3 1771964675 7046605743 5380.7
##
## Step: AIC=5294.59
## sum_bike_count ~ mean_humidity + mean_wind_speed + mean_dew_point_temp +
## mean_solar_radiation + mean_rainfall + seasons + holiday +
## weekend
##
## Df Sum of Sq RSS AIC
## <none> 5303729010 5294.6
## + mean_temp 1 29087942 5274641068 5294.8
## + mean_snowfall 1 24620477 5279108533 5295.1
## + mean_visibility 1 542453 5303186557 5296.6
## - holiday 1 101874317 5405603327 5298.6
## - weekend 1 102027671 5405756681 5298.6
## - mean_humidity 1 128646104 5432375115 5300.2
## - mean_wind_speed 1 165044922 5468773932 5302.3
## - mean_rainfall 1 687009406 5990738416 5331.2
## - mean_dew_point_temp 1 694143519 5997872529 5331.6
## - mean_solar_radiation 1 1485298995 6789028005 5370.9
## - seasons 3 1750784334 7054513345 5379.0
stepReg$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
## mean_visibility + mean_dew_point_temp + mean_solar_radiation +
## mean_rainfall + mean_snowfall + seasons + holiday + weekend
##
## Final Model:
## sum_bike_count ~ mean_humidity + mean_wind_speed + mean_dew_point_temp +
## mean_solar_radiation + mean_rainfall + seasons + holiday +
## weekend
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 303 5253833098 5297.593
## 2 - mean_visibility 1 1755800 304 5255588899 5295.699
## 3 - mean_snowfall 1 19052170 305 5274641068 5294.846
## 4 - mean_temp 1 29087942 306 5303729010 5294.589
# multicollinearity
car::vif(bike_weekend_model)
## GVIF Df GVIF^(1/(2*Df))
## mean_temp 343.949542 1 18.545877
## mean_humidity 47.599843 1 6.899264
## mean_wind_speed 1.337044 1 1.156306
## mean_visibility 2.145153 1 1.464634
## mean_dew_point_temp 470.144400 1 21.682813
## mean_solar_radiation 3.486644 1 1.867256
## mean_rainfall 1.746116 1 1.321407
## mean_snowfall 1.207067 1 1.098666
## seasons 7.971927 3 1.413385
## holiday 1.043767 1 1.021649
## weekend 1.036276 1 1.017976
mean_temp and mean_dew_point_temp GVIF values are greater than 10 and therefore, there is multicollinearity. mean_temp will be taken out from the final model. The final model from stepwise AIC did not include mean_visibility and mean_snowfall. These will be taken out from our final Weekend Model.
# Final regression model by day with Weekend
bike_weekend_model_final = lm(sum_bike_count ~ mean_dew_point_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=model4)
#summary
summary(bike_weekend_model_final)
##
## Call:
## lm(formula = sum_bike_count ~ mean_dew_point_temp + mean_humidity +
## mean_wind_speed + mean_solar_radiation + mean_rainfall +
## seasons + holiday + weekend, data = model4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12506.6 -2441.1 113.6 2504.3 11762.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18603.28 2514.06 7.400 1.33e-12 ***
## mean_dew_point_temp 343.11 54.22 6.328 8.79e-10 ***
## mean_humidity -88.29 32.41 -2.724 0.006813 **
## mean_wind_speed -1386.87 449.43 -3.086 0.002215 **
## mean_solar_radiation 12428.26 1342.56 9.257 < 2e-16 ***
## mean_rainfall -3729.03 592.30 -6.296 1.06e-09 ***
## seasonsSpring -5131.80 718.68 -7.141 6.80e-12 ***
## seasonsSummer -3630.47 924.20 -3.928 0.000106 ***
## seasonsAutumn -8200.36 1022.85 -8.017 2.31e-14 ***
## holidayHoliday 2641.38 1089.50 2.424 0.015913 *
## weekendTRUE 1276.78 526.25 2.426 0.015835 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4163 on 306 degrees of freedom
## Multiple R-squared: 0.83, Adjusted R-squared: 0.8244
## F-statistic: 149.4 on 10 and 306 DF, p-value: < 2.2e-16
# multicollinearity
car::vif(bike_weekend_model_final)
## GVIF Df GVIF^(1/(2*Df))
## mean_dew_point_temp 9.077950 1 3.012964
## mean_humidity 4.238293 1 2.058711
## mean_wind_speed 1.243064 1 1.114928
## mean_solar_radiation 3.351304 1 1.830657
## mean_rainfall 1.533711 1 1.238431
## seasons 6.509081 3 1.366428
## holiday 1.040455 1 1.020027
## weekend 1.015724 1 1.007831
The final model with weekend can explain 82.36% of the data and is statistically accurate at 95% with p-value < 0.05. There is no mulicollinearity in the final model.
par(mfrow=c(2,2))
plot(bike_weekend_model_final , which=1:4)
There is no outliers in the model’s residuals according to the Cook’s distance plot.
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_weekend_model_final)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 35.62784, Df = 1, p = 2.3885e-09
P-value is less than 0.05, implying that the constant error variance assumption is violated.
# Test for Autocorrelated Errors
durbinWatsonTest(bike_weekend_model_final)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.1132151 2.224318 0.056
## Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
# Test for Normally Distributed Errors
shapiro.test(bike_weekend_model_final$residuals)
##
## Shapiro-Wilk normality test
##
## data: bike_weekend_model_final$residuals
## W = 0.99468, p-value = 0.3409
P-value of the normality test is greater than 0.05 and therefore, the residuals of the data is normally distributed. However, as the model failed the homoscedascity test, the model will be transformed using box-cox transformation method.
bike_by_dayBC <- model4
for (i in colnames(bike_by_dayBC)){
if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
bike_by_dayBC[i] <- bike_by_dayBC[i] + (abs(min(bike_by_day[i])))+1 #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(
bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall
)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = ((data^lambda_val)-1)/lambda_val
return(data)
}
for (i in colnames(bike_by_dayBC)){
if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
bike_by_dayBC[i] <- transform_boxcox(bike_by_dayBC[i])
}
}
# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall +
bike_by_dayBC$seasons +
bike_by_dayBC$holiday +
bike_by_dayBC$weekend)
summary(bike_by_dayBC_fit)
##
## Call:
## lm(formula = bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
## bike_by_dayBC$mean_wind_speed + bike_by_dayBC$mean_dew_point_temp +
## bike_by_dayBC$mean_solar_radiation + bike_by_dayBC$mean_rainfall +
## bike_by_dayBC$seasons + bike_by_dayBC$holiday + bike_by_dayBC$weekend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1709 -0.5515 0.0273 0.5290 2.3787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.4886 1.4987 13.004 < 2e-16 ***
## bike_by_dayBC$mean_humidity -0.8255 0.2858 -2.889 0.004146 **
## bike_by_dayBC$mean_wind_speed -1.0416 0.2930 -3.555 0.000438 ***
## bike_by_dayBC$mean_dew_point_temp 1.0738 0.1442 7.449 9.67e-13 ***
## bike_by_dayBC$mean_solar_radiation 3.4083 0.4194 8.128 1.10e-14 ***
## bike_by_dayBC$mean_rainfall -1.9489 0.2287 -8.520 7.35e-16 ***
## bike_by_dayBC$seasonsSpring -1.0351 0.1472 -7.030 1.35e-11 ***
## bike_by_dayBC$seasonsSummer -0.4055 0.1672 -2.426 0.015849 *
## bike_by_dayBC$seasonsAutumn -2.4737 0.2051 -12.061 < 2e-16 ***
## bike_by_dayBC$holidayHoliday 0.8703 0.2269 3.835 0.000152 ***
## bike_by_dayBC$weekendTRUE 0.2655 0.1098 2.419 0.016153 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8644 on 306 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.863
## F-statistic: 200 on 10 and 306 DF, p-value: < 2.2e-16
# multicollinearity
car::vif(bike_by_dayBC_fit)
## GVIF Df GVIF^(1/(2*Df))
## bike_by_dayBC$mean_humidity 3.607174 1 1.899256
## bike_by_dayBC$mean_wind_speed 1.333563 1 1.154800
## bike_by_dayBC$mean_dew_point_temp 5.928370 1 2.434824
## bike_by_dayBC$mean_solar_radiation 3.391377 1 1.841569
## bike_by_dayBC$mean_rainfall 1.693317 1 1.301275
## bike_by_dayBC$seasons 4.499259 3 1.284863
## bike_by_dayBC$holiday 1.046936 1 1.023199
## bike_by_dayBC$weekend 1.025189 1 1.012516
The final transformed model with weekend can explain 86.3% of the data and is statistically accurate at 95% with p-value < 0.05. There is no mulicollinearity in the final model.
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
Cook’s distance shows that there are some outliers (152, 99 and 39) and therfore, they will be removed from the dataset before fitting it into the model again.
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 6.343005, Df = 1, p = 0.011785
P-value is less than 0.05, implying that the constant error variance assumption is violated if we do not remove the outliers. Therefore, these will be removed and then the model will be tested again.
bike_by_dayBC <- bike_by_dayBC[-c(152,99,39),]
# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall +
bike_by_dayBC$seasons +
bike_by_dayBC$holiday +
bike_by_dayBC$weekend)
summary(bike_by_dayBC_fit)
##
## Call:
## lm(formula = bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
## bike_by_dayBC$mean_wind_speed + bike_by_dayBC$mean_dew_point_temp +
## bike_by_dayBC$mean_solar_radiation + bike_by_dayBC$mean_rainfall +
## bike_by_dayBC$seasons + bike_by_dayBC$holiday + bike_by_dayBC$weekend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.80853 -0.54398 0.02451 0.50972 1.71834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.6022 1.4305 13.703 < 2e-16 ***
## bike_by_dayBC$mean_humidity -1.0221 0.2780 -3.676 0.000280 ***
## bike_by_dayBC$mean_wind_speed -0.9514 0.2799 -3.399 0.000766 ***
## bike_by_dayBC$mean_dew_point_temp 1.2492 0.1493 8.366 2.21e-15 ***
## bike_by_dayBC$mean_solar_radiation 3.3637 0.4012 8.383 1.96e-15 ***
## bike_by_dayBC$mean_rainfall -1.7471 0.2222 -7.863 6.63e-14 ***
## bike_by_dayBC$seasonsSpring -1.1297 0.1421 -7.949 3.76e-14 ***
## bike_by_dayBC$seasonsSummer -0.5501 0.1618 -3.400 0.000765 ***
## bike_by_dayBC$seasonsAutumn -2.3815 0.2006 -11.871 < 2e-16 ***
## bike_by_dayBC$holidayHoliday 1.0753 0.2242 4.797 2.54e-06 ***
## bike_by_dayBC$weekendTRUE 0.3408 0.1055 3.229 0.001377 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8242 on 303 degrees of freedom
## Multiple R-squared: 0.8769, Adjusted R-squared: 0.8728
## F-statistic: 215.8 on 10 and 303 DF, p-value: < 2.2e-16
The final model with weekend can explain 87.69% of the data and is statistically accurate at 95% with p-value < 0.05. There is no mulicollinearity in the final model. This is by far a model with the most prediction accuracy.
# multicollinearity
car::vif(bike_by_dayBC_fit)
## GVIF Df GVIF^(1/(2*Df))
## bike_by_dayBC$mean_humidity 3.644442 1 1.909042
## bike_by_dayBC$mean_wind_speed 1.308730 1 1.143998
## bike_by_dayBC$mean_dew_point_temp 6.386343 1 2.527121
## bike_by_dayBC$mean_solar_radiation 3.373157 1 1.836616
## bike_by_dayBC$mean_rainfall 1.676241 1 1.294697
## bike_by_dayBC$seasons 4.820642 3 1.299723
## bike_by_dayBC$holiday 1.056871 1 1.028042
## bike_by_dayBC$weekend 1.031407 1 1.015582
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.58038, Df = 1, p = 0.1082
After removing outliers, the P-value of non-constant error variance test shows that it is greater than 0.05, implying that the constant error variance assumption is not violated.
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_dayBC_fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04184489 2.07716 0.522
## Alternative hypothesis: rho != 0
P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.
# Test for Normally Distributed Errors
shapiro.test(bike_by_dayBC_fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: bike_by_dayBC_fit$residuals
## W = 0.9897, p-value = 0.02599
P-value of the normality test is less than 0.05 and therefore residuals are not normally distributed.
To select final model we will use criteria like final models accuracy with r2, model’s residual analysis, adaquecy test’s results of each model.
Due to non constant variance we will reject this model
Due to non constant variance we will reject this model because it fails the models confidence in prediction.
Due to non constant variance and not normally distribution we will reject this model because it fails the models confidence in prediction.
Model 4’s P-value of the normality test is less than 0.05 and therefore, the residuals of the data is not normally distributed. However, as the dataset is not too small, the p-value is not too much smaller than 0.05, and the model pass the rest of the tests, we conclude that Model 4 can adequately predict the bike demand and is by far, the best model.
Due to this reasons we will select model 4 as out best model and will predicte the data using model 4.
We have selected model 4 as our best model and to predict values using this model we will be using the model 4’s prediction dataset which is model4_prediction.
# we will transform this series to positive by adding absulute smallest value to whole dataset by their columns to pe
for (i in colnames(model4_prediction)){
if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
model4_prediction[i] <- model4_prediction[i] + (abs(min(bike_by_day[i])))+1
}
}
# Box-cox transformatino using lambda
transform_boxcox <- function(data)
{
# box-cox transformation
data = ((data^lambda_val)-1)/lambda_val
return(data)
}
for (i in colnames(model4_prediction)){
if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
model4_prediction[i] <- transform_boxcox(model4_prediction[i])
}
}
head(model4_prediction)
## # A tibble: 6 x 13
## date sum_bike_count mean_temp mean_humidity mean_wind_speed
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-07-01 15.0 4.63 6.56 0.948
## 2 2020-12-01 17.8 3.07 5.63 1.25
## 3 2020-04-02 20.4 4.46 6.07 1.41
## 4 2020-08-02 20.3 5.07 5.81 1.25
## 5 2020-09-04 21.5 4.68 6.19 1.47
## 6 2020-12-04 17.5 3.28 5.79 1.82
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## # mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## # seasons <fct>, holiday <fct>, weekend <fct>
lm(sum_bike_count ~ mean_humidity + mean_wind_speed + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=bike_by_dayBC)
From the boxcox transformation we get lambda value of 0.128451. We will use this to transforme the data from the model4_prediction
# creating function to predict the bike counts
pred_bike_count <- function(mean_humidity, mean_wind_speed, mean_dew_point_temp, mean_solar_radiation, mean_rainfall, seasons, holiday, weekend){
seasonsSummer = 0
seasonsAutumn = 0
seasonsSpring = 0
holidayHoliday = 0
weekendTRUE = 0
# Assinging value as 1 if they exist in data
if(seasons == "Summer"){
seasonsSummer = 1
} else if(seasons == "Autumn"){
seasonsAutumn = 1
} else{
seasonsSpring = 1
}
if(holiday == "Holiday"){
holidayHoliday = 1
}
if(weekendTRUE == "TRUE"){
weekendTRUE = 1
}
# fitted eqN of model 4
y = 19.4886 - 0.8255 * mean_humidity - 1.0416 * mean_wind_speed + 1.0738 * mean_dew_point_temp + 3.4083 * mean_solar_radiation - 1.9489 * mean_rainfall - 1.0351 * seasonsSpring - 0.4055 * seasonsSummer -2.4737 * seasonsAutumn + 0.8703 * holidayHoliday + 0.2655 * weekendTRUE
return(as.numeric(y))
}
model4_prediction[1,] # first row of model 4 prediction
## # A tibble: 1 x 13
## date sum_bike_count mean_temp mean_humidity mean_wind_speed
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-07-01 15.0 4.63 6.56 0.948
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## # mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## # seasons <fct>, holiday <fct>, weekend <fct>
prediction1 <- pred_bike_count(
model4_prediction[1,"mean_humidity"],
model4_prediction[1,"mean_wind_speed"],
model4_prediction[1,"mean_dew_point_temp"],
model4_prediction[1,"mean_solar_radiation"],
model4_prediction[1,"mean_rainfall"],
model4_prediction[1,"seasons"],
model4_prediction[1,"holiday"],
model4_prediction[1,"weekend"]
)
prediction1
## [1] 16.20181
model4_prediction[2,] # second row of model 4 prediction
## # A tibble: 1 x 13
## date sum_bike_count mean_temp mean_humidity mean_wind_speed
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-12-01 17.8 3.07 5.63 1.25
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## # mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## # seasons <fct>, holiday <fct>, weekend <fct>
prediction2 <- pred_bike_count(
model4_prediction[2,"mean_humidity"],
model4_prediction[2,"mean_wind_speed"],
model4_prediction[2,"mean_dew_point_temp"],
model4_prediction[2,"mean_solar_radiation"],
model4_prediction[2,"mean_rainfall"],
model4_prediction[2,"seasons"],
model4_prediction[2,"holiday"],
model4_prediction[2,"weekend"]
)
prediction2
## [1] 16.27909
model4_prediction[3,] # third row of model 4 prediction
## # A tibble: 1 x 13
## date sum_bike_count mean_temp mean_humidity mean_wind_speed
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-04-02 20.4 4.46 6.07 1.41
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## # mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## # seasons <fct>, holiday <fct>, weekend <fct>
prediction3 <- pred_bike_count(
model4_prediction[3,"mean_humidity"],
model4_prediction[3,"mean_wind_speed"],
model4_prediction[3,"mean_dew_point_temp"],
model4_prediction[3,"mean_solar_radiation"],
model4_prediction[3,"mean_rainfall"],
model4_prediction[3,"seasons"],
model4_prediction[3,"holiday"],
model4_prediction[3,"weekend"]
)
prediction3
## [1] 19.60029
model4_prediction[4,] # four row of model 4 prediction
## # A tibble: 1 x 13
## date sum_bike_count mean_temp mean_humidity mean_wind_speed
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-08-02 20.3 5.07 5.81 1.25
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## # mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## # seasons <fct>, holiday <fct>, weekend <fct>
prediction4 <- pred_bike_count(
model4_prediction[4,"mean_humidity"],
model4_prediction[4,"mean_wind_speed"],
model4_prediction[4,"mean_dew_point_temp"],
model4_prediction[4,"mean_solar_radiation"],
model4_prediction[4,"mean_rainfall"],
model4_prediction[4,"seasons"],
model4_prediction[4,"holiday"],
model4_prediction[4,"weekend"]
)
prediction4
## [1] 21.44509
model4_prediction[5,] # five row of model 4 prediction
## # A tibble: 1 x 13
## date sum_bike_count mean_temp mean_humidity mean_wind_speed
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-09-04 21.5 4.68 6.19 1.47
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## # mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## # seasons <fct>, holiday <fct>, weekend <fct>
prediction5 <- pred_bike_count(
model4_prediction[5,"mean_humidity"],
model4_prediction[5,"mean_wind_speed"],
model4_prediction[5,"mean_dew_point_temp"],
model4_prediction[5,"mean_solar_radiation"],
model4_prediction[5,"mean_rainfall"],
model4_prediction[5,"seasons"],
model4_prediction[5,"holiday"],
model4_prediction[5,"weekend"]
)
prediction5
## [1] 20.33717
actual_bikecount <- c(model4_prediction[1,]$sum_bike_count, model4_prediction[2,]$sum_bike_count, model4_prediction[3,]$sum_bike_count, model4_prediction[4,]$sum_bike_count, model4_prediction[5,]$sum_bike_count)
predicted_bikecount <- c(prediction1, prediction2, prediction3, prediction4, prediction5)
df_prediction <- data.frame(actual_bikecount, predicted_bikecount)
df_prediction # final prediction table
## actual_bikecount predicted_bikecount
## 1 14.95501 16.20181
## 2 17.79364 16.27909
## 3 20.42867 19.60029
## 4 20.28287 21.44509
## 5 21.54331 20.33717
Out of all the models, Model 4 is shown to be the most reliable and accurate at predicting bike demand by passing all the requirements except normality of residuals. As expected, ‘weekend’ plays a factor in the prediction of bike demand and Model 4: predicting bike rental demand for everyday with added ‘weekend’ feature gives the best prediction of bike demand, giving the accuracy of 87.69%. For normality assumption of regression we will consider that since the dataset is not too small, the p-value is not too much smaller than 0.05 and from QQ-plot we can see that most of the residuals are on the dotted lines. the model 4 passes the rest of the tests, we conclude that Model 4 can adequately predict the bike demand. It may also be valuable to add on other transformation methods such as min-max transformation after box-cox transformation on the model to see whether models that failed adequacy tests will give adequate models.
Even though regression models could not predict hourly bike demand as accurately as other machine learning models and deep learning models, the accuracy for predicting daily bike demand is considerably strong. However, this model can only predict daily bike demand instead of hourly, which may be more valuable for the bike hire companies. This means other machine learning techniques may be more suitable for the purposes of the bike hire companies. However, predicting daily bike demand is still valuable as this can help in activities such as scheduling bike servicing scheduling, seasonal bike stockings and profit optimisation. We did not test for adding the weekend factor for the hourly bike demand prediction model. This may increase the accuracy and adequacy of the model therefore, should be further tested. Recent studies have also found that other factors such as infrastructure of the bike paths and accessibility of the stations (El-Assi, Mahmoud, & Habib, 2017). If these factors are also accounted for in the model, we may be able to improve the prediction accuracy.
Chen, T. (2003). A fuzzy back propagation network for output time prediction in a wafer fab. Applied Soft Computing, 2(3), 211–222. doi:10.1016/S1568-4946(02)00066-2agation network for output time prediction in a wafer fab. Applied Soft Computing, 2(3), 211–222. doi:10.1016/S1568-4946(02)00066-2
El-Assi, W., Mahmoud, M.S., & Habib, K.N. (2017). Effects of built environment and weather on bike sharing demand: A station level analysis of commercial bike sharing in Toronto. Transportation, 44(3), 589–613. doi:10.1007/s11116-015-9669-z
Kim, K. (2018). Investigation on the effects of weather and calendar events on bike-sharing according to the trip patterns of bike rentals of stations. Journal of Transport Geography, 66, 309–320. doi:10.1016/j.jtrangeo.2018.01.001
Sathishkumar, V. E., Jangwoo, P., and Yongyun, C., 2020. ‘Using data mining techniques for bike sharing demand prediction in metropolitan city.’ Computer Communications, Vol.153, pp.353-366.
# Importing libraries
library(ggplot2)
library(magrittr)
library(dplyr)
library(tidyr)
library(infotheo)
library(forecast)
library(MASS)
library(psych)
library(car)
library(TSA)
library(forecast)
library(chron)
library(caTools)
library(caret)
# import dataset
setwd("/Users/April/Desktop/DS/Regression/Regression_Project")
bike <- read.csv("SeoulBikeData.csv",check.names = F)
#set new column names
bike <- setNames(bike, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))
#Setting categorical values
bike$hour <- factor(bike$hour)
bike$seasons <- factor(bike$seasons)
bike$holiday <- factor(bike$holiday)
bike$functioning_day <- factor(bike$functioning_day)
head(bike)
summary(bike$functioning_day)
fig1 <- ggplot(bike, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 1. Bike Demand by functining day") + xlab("Functioning Day") + ylab("Bike Count")
fig1 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
# Deleting rows when it is non-functioning day
bike<-bike[!(bike$functioning_day=="No"),]
# removing unused columns
bike <- subset(bike, select = - c(functioning_day))
summary(bike)
fig2<- ggplot(bike, aes(x = as.factor(hour) , y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 2. Hourly Bike Count Bar Plot") + xlab("Hour") + ylab("Bike Count")
fig2 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
fig3 <- ggplot(bike, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 3. Seasonal Bike Count Bar Plot") + xlab("Seasons") + ylab("Bike Count")
fig3 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
fig4 <- ggplot(bike, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 4. Holiday Bike Count Bar Plot") + xlab("Holiday") + ylab("Bike Count")
fig4 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
# correlation matrix of dataset
cor_df <- data.frame(bike$bike_count, as.integer(bike$hour), bike$temp, bike$humidity, bike$wind_speed, bike$visibility, bike$dew_point, bike$solar_radiation, bike$rainfall, bike$snowfall)
cor_df <- data.frame(cor(cor_df))
cor_df[order(-cor_df[,1]), ]
#Creating histograms
multi.hist(bike[,sapply(bike, is.numeric)])
#Randomise our dataset
## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike)), size = smp_size)
model1_prediction <- bike[ -trainIndex,]
model1 <- bike[ trainIndex,]
# regression model with all columns original data
row.names(model1) <- NULL
original_full = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model1)
#summary
summary(original_full)
# possible subsets regression original data
library(MASS)
stepReg=MASS::stepAIC(original_full, direction="both")
stepReg$anova
# multicollinearity test
car::vif(original_full)
# Removing columns with multicollinearity and stepwise
original_final = lm(bike_count ~ hour + temp + humidity + solar_radiation +
rainfall + snowfall + seasons + holiday, data=model1)
#summary
summary(original_final)
par(mfrow=c(2,2))
plot(original_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(original_final)
# Test for Autocorrelated Errors
durbinWatsonTest(original_final)
# removing One outlier
model1_BC <- model1[-c(6040), ] # removing Outlier
for (i in colnames(model1_BC)){
if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
model1_BC[i] <- model1_BC[i] + (abs(min(bike[i])))+1 #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(bike_count ~ ( temp + humidity), data = model1_BC)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = (data^lambda_val)-1/lambda_val
return(data)
}
for (i in colnames(model1_BC)){
if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
model1_BC[i] <- transform_boxcox(model1_BC[i])
}
}
# fitting new model
model1_BC_fit = lm(bike_count ~ hour + temp + humidity + solar_radiation +
rainfall + snowfall + seasons + holiday, data=model1_BC
)
summary(model1_BC_fit)
# Checking multicollinearity in the final Model
car::vif(model1_BC_fit)
par(mfrow=c(2,2))
plot(model1_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(model1_BC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(model1_BC_fit)
#shapiro.test(model1_BC_fit$residuals)
# Error in shapiro.test(model1_BC_fit$residuals) : sample size must be between 3 and 5000
# creating dataframe for bike count by date
bike_by_day <- bike
levels(bike_by_day$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_by_day$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)
bike_by_day <- bike_by_day %>% group_by(date) %>% summarise(sum_bike_count = sum(bike_count), mean_temp = mean(temp), mean_humidity = mean(humidity), mean_wind_speed = mean(wind_speed), mean_visibility = mean(visibility), mean_dew_point_temp = mean(dew_point_temp), mean_solar_radiation = mean(solar_radiation), mean_rainfall = mean(rainfall), mean_snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
#Changing the factors back to original values
bike_by_day$holiday <- as.factor(bike_by_day$holiday)
bike_by_day$seasons <- as.factor(bike_by_day$seasons)
levels(bike_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(bike_by_day)
#Creating histograms
multi.hist(bike_by_day[,sapply(bike_by_day, is.numeric)])
#Randomise our dataset
## 75% of the sample size
smp_size <- floor(0.80 * nrow(bike_by_day))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)
model2_prediction <- bike_by_day[ -trainIndex,]
model2 <- bike_by_day[ trainIndex,]
# regression model by day
bike_by_day_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=model2)
#summary
summary(bike_by_day_model)
# possible subsets regression
stepReg=MASS::stepAIC(bike_by_day_model, direction="both")
stepReg$anova
# multicollinearity
car::vif(bike_by_day_model)
# Removing columns with multicollinearity and stepwise
bike_by_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=model2)
#summary
summary(bike_by_day_model_final)
# multicollinearity
car::vif(bike_by_day_model_final)
par(mfrow=c(2,2))
plot(bike_by_day_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_model_final)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_model_final)
# Test for Normally Distributed Errors
shapiro.test(bike_by_day_model_final$residuals)
# Making all the values positive
bike_by_day_BC <- bike_by_day
for (i in colnames(bike_by_day_BC)){
if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
bike_by_day_BC[i] <- round(bike_by_day_BC[i] + (abs(min(bike_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(sum_bike_count ~ mean_temp +
mean_humidity +
mean_solar_radiation
,data = bike_by_day_BC)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = (data^lambda_val)-1/lambda_val
return(data)
}
for (i in colnames(bike_by_day_BC)){
if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
bike_by_day_BC[i] <- transform_boxcox(bike_by_day_BC[i])
}
}
# fitting new model
bike_by_day_BC_fit = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
mean_solar_radiation + mean_rainfall + seasons + holiday,
data = bike_by_day_BC)
summary(bike_by_day_BC_fit)
# Checking multicollinearity in the final Model
car::vif(bike_by_day_BC_fit)
par(mfrow=c(2,2))
plot(bike_by_day_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_BC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_BC_fit)
shapiro.test(bike_by_day_BC_fit$residuals)
# Dividing to am and pm
bike_AMPM <- bike
bike_AMPM$hour <- as.numeric(bike_AMPM$hour)
bike_AMPM['hour'] [bike_AMPM['hour'] <=12] <- -1
bike_AMPM['hour'] [bike_AMPM['hour'] >12] <- 1
bike_AMPM$hour <- factor(bike_AMPM$hour)
head(bike_AMPM)
# creating dataframe for bike count by am pm
levels(bike_AMPM$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_AMPM$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)
bike_AMPM <- bike_AMPM %>% group_by(date,hour) %>% summarise(bike_count = sum(bike_count), temp = mean(temp), humidity = mean(humidity), wind_speed = mean(wind_speed), visibility = mean(visibility), dew_point_temp = mean(dew_point_temp), solar_radiation = mean(solar_radiation), rainfall = mean(rainfall), snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
bike_AMPM
#Changing the factors back to original values
bike_AMPM$holiday <- as.factor(bike_AMPM$holiday)
bike_AMPM$seasons <- as.factor(bike_AMPM$seasons)
levels(bike_AMPM$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_AMPM$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
#Drop date column
bike_AMPM <- subset(bike_AMPM, select = -c(date))
head(bike_AMPM)
#Creating histograms
multi.hist(bike_AMPM[,sapply(bike_AMPM, is.numeric)])
#Randomise our dataset
## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike_AMPM))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_AMPM)), size = smp_size)
model3_prediction <- bike_AMPM[ -trainIndex,]
model3 <- bike_AMPM[ trainIndex,]
# regression model with all columns except functioning day
am_pm_model = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model3)
#summary
summary(am_pm_model)
#Building ANOVA table with full dataset (all columns) against intercept only model
anova(am_pm_model)
# possible subsets regression
library(MASS)
stepReg=MASS::stepAIC(am_pm_model, direction="both")
stepReg$anova
# multicollinearity
car::vif(am_pm_model)
# Removing columns with multicollinearity and stepwise
am_pm_model_final = lm(bike_count ~ hour + humidity + wind_speed + dew_point_temp +
solar_radiation + rainfall + seasons + holiday, data=model3)
#summary
summary(am_pm_model_final)
# Checking multicollinearity in the final Model
car::vif(am_pm_model_final)
par(mfrow=c(2,2))
plot(am_pm_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
library(car)
ncvTest(am_pm_model_final)
# Test for Autocorrelated Errors
durbinWatsonTest(am_pm_model_final)
shapiro.test(am_pm_model_final$residuals)
bike_AMPM_BC <- model3[-c(185), ]
for (i in colnames(bike_AMPM_BC)){
if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
bike_AMPM_BC[i] <- round(bike_AMPM_BC[i] + (abs(min(bike_AMPM[i])))+1 , digits=3) # adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(
bike_AMPM_BC$bike_count ~
bike_AMPM_BC$humidity +
bike_AMPM_BC$dew_point_temp +
bike_AMPM_BC$solar_radiation +
bike_AMPM_BC$rainfall
)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = data^lambda_val
return(data)
}
for (i in colnames(bike_AMPM_BC)){
if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
bike_AMPM_BC[i] <- transform_boxcox(bike_AMPM_BC[i])
}
}
# fitting new model
bike_AMPM_BC_fit = lm(bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour +
bike_AMPM_BC$humidity +
bike_AMPM_BC$wind_speed +
bike_AMPM_BC$dew_point_temp +
bike_AMPM_BC$solar_radiation +
bike_AMPM_BC$rainfall +
bike_AMPM_BC$seasons +
bike_AMPM_BC$holiday)
summary(bike_AMPM_BC_fit)
# Checking multicollinearity in the final Model
car::vif(bike_AMPM_BC_fit)
par(mfrow=c(2,2))
plot(bike_AMPM_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_AMPM_BC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_AMPM_BC_fit)
shapiro.test(bike_AMPM_BC_fit$residuals)
#Creating weekend colum
bike_by_day$date <- as.Date(bike_by_day$date, "%d/%m/%y")
bike_by_day$weekend = chron::is.weekend(bike_by_day$date)
bike_by_day$weekend <- as.factor(bike_by_day$weekend)
bike_by_day
# Bar plot for weekend and weekday bike demand
fig5 <- ggplot(bike_by_day, aes(x = as.factor(weekend), y = sum_bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 5. Weekend and Weekday Bike Count Bar Plot") + xlab("Weekend") + ylab("Bike Count")
fig5 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
#Randomise our dataset
## 75% of the sample size
smp_size <- floor(0.90 * nrow(bike_by_day))
## set the seed to make your partition reproducible
set.seed(500)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)
model4_prediction <- bike_by_day[ -trainIndex,]
model4 <- bike_by_day[ trainIndex,]
# regression model by day with Weekend
bike_weekend_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday+ weekend, data=model4)
#summary
summary(bike_weekend_model)
# possible subsets regression (by day and weekend)
library(MASS)
stepReg=MASS::stepAIC(bike_weekend_model, direction="both")
stepReg$anova
# multicollinearity
car::vif(bike_weekend_model)
# Final regression model by day with Weekend
bike_weekend_model_final = lm(sum_bike_count ~ mean_dew_point_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=model4)
#summary
summary(bike_weekend_model_final)
# multicollinearity
car::vif(bike_weekend_model_final)
par(mfrow=c(2,2))
plot(bike_weekend_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_weekend_model_final)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_weekend_model_final)
# Test for Normally Distributed Errors
shapiro.test(bike_weekend_model_final$residuals)
bike_by_dayBC <- model4
for (i in colnames(bike_by_dayBC)){
if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
bike_by_dayBC[i] <- bike_by_dayBC[i] + (abs(min(bike_by_day[i])))+1 #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(
bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall
)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = ((data^lambda_val)-1)/lambda_val
return(data)
}
for (i in colnames(bike_by_dayBC)){
if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
bike_by_dayBC[i] <- transform_boxcox(bike_by_dayBC[i])
}
}
# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall +
bike_by_dayBC$seasons +
bike_by_dayBC$holiday +
bike_by_dayBC$weekend)
summary(bike_by_dayBC_fit)
# multicollinearity
car::vif(bike_by_dayBC_fit)
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
bike_by_dayBC <- bike_by_dayBC[-c(152,99,39),]
# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall +
bike_by_dayBC$seasons +
bike_by_dayBC$holiday +
bike_by_dayBC$weekend)
summary(bike_by_dayBC_fit)
# multicollinearity
car::vif(bike_by_dayBC_fit)
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_dayBC_fit)
# Test for Normally Distributed Errors
shapiro.test(bike_by_dayBC_fit$residuals)
# we will transform this series to positive by adding absulute smallest value to whole dataset by their columns to pe
for (i in colnames(model4_prediction)){
if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
model4_prediction[i] <- model4_prediction[i] + (abs(min(bike_by_day[i])))+1
}
}
# Box-cox transformatino using lambda
transform_boxcox <- function(data)
{
# box-cox transformation
data = ((data^lambda_val)-1)/lambda_val
return(data)
}
for (i in colnames(model4_prediction)){
if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
model4_prediction[i] <- transform_boxcox(model4_prediction[i])
}
}
head(model4_prediction)
# creating function to predict the bike counts
pred_bike_count <- function(mean_humidity, mean_wind_speed, mean_dew_point_temp, mean_solar_radiation, mean_rainfall, seasons, holiday, weekend){
seasonsSummer = 0
seasonsAutumn = 0
seasonsSpring = 0
holidayHoliday = 0
weekendTRUE = 0
# Assinging value as 1 if they exist in data
if(seasons == "Summer"){
seasonsSummer = 1
} else if(seasons == "Autumn"){
seasonsAutumn = 1
} else{
seasonsSpring = 1
}
if(holiday == "Holiday"){
holidayHoliday = 1
}
if(weekendTRUE == "TRUE"){
weekendTRUE = 1
}
# fitted eqN of model 4
y = 19.4886 - 0.8255 * mean_humidity - 1.0416 * mean_wind_speed + 1.0738 * mean_dew_point_temp + 3.4083 * mean_solar_radiation - 1.9489 * mean_rainfall - 1.0351 * seasonsSpring - 0.4055 * seasonsSummer -2.4737 * seasonsAutumn + 0.8703 * holidayHoliday + 0.2655 * weekendTRUE
return(as.numeric(y))
}
model4_prediction[1,] # first row of model 4 prediction
prediction1 <- pred_bike_count(
model4_prediction[1,"mean_humidity"],
model4_prediction[1,"mean_wind_speed"],
model4_prediction[1,"mean_dew_point_temp"],
model4_prediction[1,"mean_solar_radiation"],
model4_prediction[1,"mean_rainfall"],
model4_prediction[1,"seasons"],
model4_prediction[1,"holiday"],
model4_prediction[1,"weekend"]
)
prediction1
model4_prediction[2,] # second row of model 4 prediction
prediction2 <- pred_bike_count(
model4_prediction[2,"mean_humidity"],
model4_prediction[2,"mean_wind_speed"],
model4_prediction[2,"mean_dew_point_temp"],
model4_prediction[2,"mean_solar_radiation"],
model4_prediction[2,"mean_rainfall"],
model4_prediction[2,"seasons"],
model4_prediction[2,"holiday"],
model4_prediction[2,"weekend"]
)
prediction2
model4_prediction[3,] # third row of model 4 prediction
prediction3 <- pred_bike_count(
model4_prediction[3,"mean_humidity"],
model4_prediction[3,"mean_wind_speed"],
model4_prediction[3,"mean_dew_point_temp"],
model4_prediction[3,"mean_solar_radiation"],
model4_prediction[3,"mean_rainfall"],
model4_prediction[3,"seasons"],
model4_prediction[3,"holiday"],
model4_prediction[3,"weekend"]
)
prediction3
model4_prediction[4,] # four row of model 4 prediction
prediction4 <- pred_bike_count(
model4_prediction[4,"mean_humidity"],
model4_prediction[4,"mean_wind_speed"],
model4_prediction[4,"mean_dew_point_temp"],
model4_prediction[4,"mean_solar_radiation"],
model4_prediction[4,"mean_rainfall"],
model4_prediction[4,"seasons"],
model4_prediction[4,"holiday"],
model4_prediction[4,"weekend"]
)
prediction4
model4_prediction[5,] # five row of model 4 prediction
prediction5 <- pred_bike_count(
model4_prediction[5,"mean_humidity"],
model4_prediction[5,"mean_wind_speed"],
model4_prediction[5,"mean_dew_point_temp"],
model4_prediction[5,"mean_solar_radiation"],
model4_prediction[5,"mean_rainfall"],
model4_prediction[5,"seasons"],
model4_prediction[5,"holiday"],
model4_prediction[5,"weekend"]
)
prediction5
actual_bikecount <- c(model4_prediction[1,]$sum_bike_count, model4_prediction[2,]$sum_bike_count, model4_prediction[3,]$sum_bike_count, model4_prediction[4,]$sum_bike_count, model4_prediction[5,]$sum_bike_count)
predicted_bikecount <- c(prediction1, prediction2, prediction3, prediction4, prediction5)
df_prediction <- data.frame(actual_bikecount, predicted_bikecount)
df_prediction # final prediction table
# Importing libraries
library(ggplot2)
library(magrittr)
library(dplyr)
library(tidyr)
library(infotheo)
library(forecast)
library(MASS)
library(psych)
library(car)
library(TSA)
library(forecast)
library(chron)
library(caTools)
library(caret)
# import dataset
bike <- read.csv("~/OneDrive - RMIT University/sem4/regression/project/SeoulBikeData.csv",check.names = F)
#set new column names
bike <- setNames(bike, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))
#Setting categorical values
bike$hour <- factor(bike$hour)
bike$seasons <- factor(bike$seasons)
bike$holiday <- factor(bike$holiday)
bike$functioning_day <- factor(bike$functioning_day)
head(bike)
# Data Analysis
summary(bike$functioning_day)
fig1 <- ggplot(bike, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 1. Bike Demand by functining day") + xlab("Functioning Day") + ylab("Bike Count")
fig1 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
# Deleting rows when it is non-functioning day
bike<-bike[!(bike$functioning_day=="No"),]
# removing unused columns
bike <- subset(bike, select = - c(functioning_day))
summary(bike)
fig2<- ggplot(bike, aes(x = as.factor(hour) , y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 2. Hourly Bike Count Bar Plot") + xlab("Hour") + ylab("Bike Count")
fig2 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
fig3 <- ggplot(bike, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 3. Seasonal Bike Count Bar Plot") + xlab("Seasons") + ylab("Bike Count")
fig3 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
fig4 <- ggplot(bike, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 4. Holiday Bike Count Bar Plot") + xlab("Holiday") + ylab("Bike Count")
fig4 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
# correlation matrix of dataset
cor_df <- data.frame(bike$bike_count, as.integer(bike$hour), bike$temp, bike$humidity, bike$wind_speed, bike$visibility, bike$dew_point, bike$solar_radiation, bike$rainfall, bike$snowfall)
cor_df <- data.frame(cor(cor_df))
cor_df[order(-cor_df[,1]), ]
#Creating histograms
multi.hist(bike[,sapply(bike, is.numeric)])
# Modelling
## 1) Model 1
#Randomise our dataset
## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike)), size = smp_size)
model1_prediction <- bike[ -trainIndex,]
model1 <- bike[ trainIndex,]
# regression model with all columns original data
row.names(model1) <- NULL
original_full = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model1)
#summary
summary(original_full)
# possible subsets regression original data
library(MASS)
stepReg=MASS::stepAIC(original_full, direction="both")
stepReg$anova
# multicollinearity test
car::vif(original_full)
# Removing columns with multicollinearity and stepwise
original_final = lm(bike_count ~ hour + temp + humidity + solar_radiation +
rainfall + snowfall + seasons + holiday, data=model1)
#summary
summary(original_final)
par(mfrow=c(2,2))
plot(original_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(original_final)
# Test for Autocorrelated Errors
durbinWatsonTest(original_final)
# removing One outlier
model1_BC <- model1[-c(6040), ]
for (i in colnames(model1_BC)){
if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
model1_BC[i] <- model1_BC[i] + (abs(min(bike[i])))+1 #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(bike_count ~ ( temp + humidity), data = model1_BC)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = (data^lambda_val)-1/lambda_val
return(data)
}
for (i in colnames(model1_BC)){
if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
model1_BC[i] <- transform_boxcox(model1_BC[i])
}
}
# fitting new model
model1_BC_fit = lm(bike_count ~ hour + temp + humidity + solar_radiation +
rainfall + snowfall + seasons + holiday, data=model1_BC
)
summary(model1_BC_fit)
# Checking multicollinearity in the final Model
car::vif(model1_BC_fit)
par(mfrow=c(2,2))
plot(model1_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(model1_BC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(model1_BC_fit)
#shapiro.test(model1_BC_fit$residuals)
# Error in shapiro.test(model1_BC_fit$residuals) : sample size must be between 3 and 5000
## 2) Model 2
# creating dataframe for bike count by date
bike_by_day <- bike
levels(bike_by_day$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_by_day$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)
bike_by_day <- bike_by_day %>% group_by(date) %>% summarise(sum_bike_count = sum(bike_count), mean_temp = mean(temp), mean_humidity = mean(humidity), mean_wind_speed = mean(wind_speed), mean_visibility = mean(visibility), mean_dew_point_temp = mean(dew_point_temp), mean_solar_radiation = mean(solar_radiation), mean_rainfall = mean(rainfall), mean_snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
#Changing the factors back to original values
bike_by_day$holiday <- as.factor(bike_by_day$holiday)
bike_by_day$seasons <- as.factor(bike_by_day$seasons)
levels(bike_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(bike_by_day)
#Creating histograms
multi.hist(bike_by_day[,sapply(bike_by_day, is.numeric)])
#Randomise our dataset
## 75% of the sample size
smp_size <- floor(0.80 * nrow(bike_by_day))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)
model2_prediction <- bike_by_day[ -trainIndex,]
model2 <- bike_by_day[ trainIndex,]
# regression model by day
bike_by_day_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=model2)
#summary
summary(bike_by_day_model)
# possible subsets regression
stepReg=MASS::stepAIC(bike_by_day_model, direction="both")
stepReg$anova
# multicollinearity
car::vif(bike_by_day_model)
# Removing columns with multicollinearity and stepwise
bike_by_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=model2)
#summary
summary(bike_by_day_model_final)
# multicollinearity
car::vif(bike_by_day_model_final)
par(mfrow=c(2,2))
plot(bike_by_day_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_model_final)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_model_final)
# Test for Normally Distributed Errors
shapiro.test(bike_by_day_model_final$residuals)
# Making all the values positive
bike_by_day_BC <- bike_by_day
for (i in colnames(bike_by_day_BC)){
if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
bike_by_day_BC[i] <- round(bike_by_day_BC[i] + (abs(min(bike_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(sum_bike_count ~ mean_temp +
mean_humidity +
mean_solar_radiation
,data = bike_by_day_BC)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = (data^lambda_val)-1/lambda_val
return(data)
}
for (i in colnames(bike_by_day_BC)){
if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
bike_by_day_BC[i] <- transform_boxcox(bike_by_day_BC[i])
}
}
# fitting new model
bike_by_day_BC_fit = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed +
mean_solar_radiation + mean_rainfall + seasons + holiday,
data = bike_by_day_BC)
summary(bike_by_day_BC_fit)
# Checking multicollinearity in the final Model
car::vif(bike_by_day_BC_fit)
par(mfrow=c(2,2))
plot(bike_by_day_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_BC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_BC_fit)
shapiro.test(bike_by_day_BC_fit$residuals)
## 3) Model 3
# Dividing to am and pm
bike_AMPM <- bike
bike_AMPM$hour <- as.numeric(bike_AMPM$hour)
bike_AMPM['hour'] [bike_AMPM['hour'] <=12] <- -1
bike_AMPM['hour'] [bike_AMPM['hour'] >12] <- 1
bike_AMPM$hour <- factor(bike_AMPM$hour)
head(bike_AMPM)
# creating dataframe for bike count by am pm
levels(bike_AMPM$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_AMPM$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)
bike_AMPM <- bike_AMPM %>% group_by(date,hour) %>% summarise(bike_count = sum(bike_count), temp = mean(temp), humidity = mean(humidity), wind_speed = mean(wind_speed), visibility = mean(visibility), dew_point_temp = mean(dew_point_temp), solar_radiation = mean(solar_radiation), rainfall = mean(rainfall), snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
bike_AMPM
#Changing the factors back to original values
bike_AMPM$holiday <- as.factor(bike_AMPM$holiday)
bike_AMPM$seasons <- as.factor(bike_AMPM$seasons)
levels(bike_AMPM$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_AMPM$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
#Drop date column
bike_AMPM <- subset(bike_AMPM, select = -c(date))
head(bike_AMPM)
#Randomise our dataset
## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike_AMPM))
## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_AMPM)), size = smp_size)
model3_prediction <- bike_AMPM[ -trainIndex,]
model3 <- bike_AMPM[ trainIndex,]
# regression model with all columns except functioning day
am_pm_model = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model3)
#summary
summary(am_pm_model)
#Building ANOVA table with full dataset (all columns) against intercept only model
anova(am_pm_model)
# possible subsets regression
library(MASS)
stepReg=MASS::stepAIC(am_pm_model, direction="both")
stepReg$anova
# multicollinearity
car::vif(am_pm_model)
# Removing columns with multicollinearity and stepwise
am_pm_model_final = lm(bike_count ~ hour + humidity + wind_speed + dew_point_temp +
solar_radiation + rainfall + seasons + holiday, data=model3)
#summary
summary(am_pm_model_final)
# Checking multicollinearity in the final Model
car::vif(am_pm_model_final)
par(mfrow=c(2,2))
plot(am_pm_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
library(car)
ncvTest(am_pm_model_final)
# Test for Autocorrelated Errors
durbinWatsonTest(am_pm_model_final)
shapiro.test(am_pm_model_final$residuals)
bike_AMPM_BC <- model3[-c(185), ]
for (i in colnames(bike_AMPM_BC)){
if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
bike_AMPM_BC[i] <- round(bike_AMPM_BC[i] + (abs(min(bike_AMPM[i])))+1 , digits=3) # adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(
bike_AMPM_BC$bike_count ~
bike_AMPM_BC$humidity +
bike_AMPM_BC$dew_point_temp +
bike_AMPM_BC$solar_radiation +
bike_AMPM_BC$rainfall
)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = data^lambda_val
return(data)
}
for (i in colnames(bike_AMPM_BC)){
if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
bike_AMPM_BC[i] <- transform_boxcox(bike_AMPM_BC[i])
}
}
# fitting new model
bike_AMPM_BC_fit = lm(bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour +
bike_AMPM_BC$humidity +
bike_AMPM_BC$wind_speed +
bike_AMPM_BC$dew_point_temp +
bike_AMPM_BC$solar_radiation +
bike_AMPM_BC$rainfall +
bike_AMPM_BC$seasons +
bike_AMPM_BC$holiday)
summary(bike_AMPM_BC_fit)
# Checking multicollinearity in the final Model
car::vif(bike_AMPM_BC_fit)
par(mfrow=c(2,2))
plot(bike_AMPM_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_AMPM_BC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_AMPM_BC_fit)
shapiro.test(bike_AMPM_BC_fit$residuals)
## 4) Model 4
#Creating weekend colum
bike_by_day$date <- as.Date(bike_by_day$date, "%d/%m/%y")
bike_by_day$weekend = chron::is.weekend(bike_by_day$date)
bike_by_day$weekend <- as.factor(bike_by_day$weekend)
bike_by_day
# Bar plot for weekend and weekday bike demand
fig5 <- ggplot(bike_by_day, aes(x = as.factor(weekend), y = sum_bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 5. Weekend and Weekday Bike Count Bar Plot") + xlab("Weekend") + ylab("Bike Count")
fig5 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
#Randomise our dataset
## 75% of the sample size
smp_size <- floor(0.90 * nrow(bike_by_day))
## set the seed to make your partition reproducible
set.seed(500)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)
model4_prediction <- bike_by_day[ -trainIndex,]
model4 <- bike_by_day[ trainIndex,]
# regression model by day with Weekend
bike_weekend_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday+ weekend, data=model4)
#summary
summary(bike_weekend_model)
# possible subsets regression (by day and weekend)
library(MASS)
stepReg=MASS::stepAIC(bike_weekend_model, direction="both")
stepReg$anova
# multicollinearity
car::vif(bike_weekend_model)
# Final regression model by day with Weekend
bike_weekend_model_final = lm(sum_bike_count ~ mean_dew_point_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=model4)
#summary
summary(bike_weekend_model_final)
# multicollinearity
car::vif(bike_weekend_model_final)
par(mfrow=c(2,2))
plot(bike_weekend_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_weekend_model_final)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_weekend_model_final)
# Test for Normally Distributed Errors
shapiro.test(bike_weekend_model_final$residuals)
bike_by_dayBC <- model4
for (i in colnames(bike_by_dayBC)){
if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
bike_by_dayBC[i] <- bike_by_dayBC[i] + (abs(min(bike_by_day[i])))+1 #adding minimum value to make dataset positive
}
}
bc1 <-boxTidwell(
bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall
)
lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
# box-cox transformation
data = ((data^lambda_val)-1)/lambda_val
return(data)
}
for (i in colnames(bike_by_dayBC)){
if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
bike_by_dayBC[i] <- transform_boxcox(bike_by_dayBC[i])
}
}
# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall +
bike_by_dayBC$seasons +
bike_by_dayBC$holiday +
bike_by_dayBC$weekend)
summary(bike_by_dayBC_fit)
# multicollinearity
car::vif(bike_by_dayBC_fit)
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
bike_by_dayBC <- bike_by_dayBC[-c(152,99,39),]
# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity +
bike_by_dayBC$mean_wind_speed +
bike_by_dayBC$mean_dew_point_temp +
bike_by_dayBC$mean_solar_radiation +
bike_by_dayBC$mean_rainfall +
bike_by_dayBC$seasons +
bike_by_dayBC$holiday +
bike_by_dayBC$weekend)
summary(bike_by_dayBC_fit)
# multicollinearity
car::vif(bike_by_dayBC_fit)
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_dayBC_fit)
# Test for Normally Distributed Errors
shapiro.test(bike_by_dayBC_fit$residuals)
# Prediction
for (i in colnames(model4_prediction)){
if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
model4_prediction[i] <- model4_prediction[i] + (abs(min(bike_by_day[i])))+1
}
}
transform_boxcox <- function(data)
{
# box-cox transformation
data = ((data^lambda_val)-1)/lambda_val
return(data)
}
for (i in colnames(model4_prediction)){
if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
model4_prediction[i] <- transform_boxcox(model4_prediction[i])
}
}
head(model4_prediction)
#### Model 4's equation
# creating function to predict the bike counts
pred_bike_count <- function(mean_humidity, mean_wind_speed, mean_dew_point_temp, mean_solar_radiation, mean_rainfall, seasons, holiday, weekend){
seasonsSummer = 0
seasonsAutumn = 0
seasonsSpring = 0
holidayHoliday = 0
weekendTRUE = 0
if(seasons == "Summer"){
seasonsSummer = 1
} else if(seasons == "Autumn"){
seasonsAutumn = 1
} else{
seasonsSpring = 1
}
if(holiday == "Holiday"){
holidayHoliday = 1
}
if(weekendTRUE == "TRUE"){
weekendTRUE = 1
}
y = 19.4886 - 0.8255 * mean_humidity - 1.0416 * mean_wind_speed + 1.0738 * mean_dew_point_temp + 3.4083 * mean_solar_radiation - 1.9489 * mean_rainfall - 1.0351 * seasonsSpring - 0.4055 * seasonsSummer -2.4737 * seasonsAutumn + 0.8703 * holidayHoliday + 0.2655 * weekendTRUE
return(as.numeric(y))
}
# Predicting values
model4_prediction[1,]
prediction1 <- pred_bike_count(
model4_prediction[1,"mean_humidity"],
model4_prediction[1,"mean_wind_speed"],
model4_prediction[1,"mean_dew_point_temp"],
model4_prediction[1,"mean_solar_radiation"],
model4_prediction[1,"mean_rainfall"],
model4_prediction[1,"seasons"],
model4_prediction[1,"holiday"],
model4_prediction[1,"weekend"]
)
prediction1
model4_prediction[2,]
prediction2 <- pred_bike_count(
model4_prediction[2,"mean_humidity"],
model4_prediction[2,"mean_wind_speed"],
model4_prediction[2,"mean_dew_point_temp"],
model4_prediction[2,"mean_solar_radiation"],
model4_prediction[2,"mean_rainfall"],
model4_prediction[2,"seasons"],
model4_prediction[2,"holiday"],
model4_prediction[2,"weekend"]
)
prediction2
model4_prediction[3,]
prediction3 <- pred_bike_count(
model4_prediction[3,"mean_humidity"],
model4_prediction[3,"mean_wind_speed"],
model4_prediction[3,"mean_dew_point_temp"],
model4_prediction[3,"mean_solar_radiation"],
model4_prediction[3,"mean_rainfall"],
model4_prediction[3,"seasons"],
model4_prediction[3,"holiday"],
model4_prediction[3,"weekend"]
)
prediction3
model4_prediction[4,]
prediction4 <- pred_bike_count(
model4_prediction[4,"mean_humidity"],
model4_prediction[4,"mean_wind_speed"],
model4_prediction[4,"mean_dew_point_temp"],
model4_prediction[4,"mean_solar_radiation"],
model4_prediction[4,"mean_rainfall"],
model4_prediction[4,"seasons"],
model4_prediction[4,"holiday"],
model4_prediction[4,"weekend"]
)
prediction4
model4_prediction[5,]
prediction5 <- pred_bike_count(
model4_prediction[5,"mean_humidity"],
model4_prediction[5,"mean_wind_speed"],
model4_prediction[5,"mean_dew_point_temp"],
model4_prediction[5,"mean_solar_radiation"],
model4_prediction[5,"mean_rainfall"],
model4_prediction[5,"seasons"],
model4_prediction[5,"holiday"],
model4_prediction[5,"weekend"]
)
prediction5
actual_bikecount <- c(model4_prediction[1,]$sum_bike_count, model4_prediction[2,]$sum_bike_count, model4_prediction[3,]$sum_bike_count, model4_prediction[4,]$sum_bike_count, model4_prediction[5,]$sum_bike_count)
predicted_bikecount <- c(prediction1, prediction2, prediction3, prediction4, prediction5)
df_prediction <- data.frame(actual_bikecount, predicted_bikecount)
df_prediction