student 1: Yonn April s3727210

student 2: Milind Shaileshkumar Parvatia s3806853

Introduction & Objective:

The purpose of this project is to accurately predict rental bike demand on each day on factors such as weather and holiday season. This will help the public lessen the waiting time and enhance the comfort in urban cities. Several other factors such as workday and holiday also affect bike demand other than the weather conditions (Kim, 2018) and therefore, we’re interested in how integrating different variables will change the prediction accuracy. Accurately predicting the bike counts required for stable supply is a major concern as convenience plays a crucial role in people continuing using the services (Sathishkumar, Jangwoo, and Yongyun, 2020). Therefore, the project aims to predict demand for bike rental demands during am hours and pm hours as well as hourly and daily demands. This will help rental companies to have an idea of how many bikes should be available to meet the demand of the public on any given day. It will also help the company schedule bike maintenance times just in time for the demand.

Chen (2003) stated that regression based algorithms cannot predict the bike demand accurately. However, regression algorithms are comparatively faster at predicting large datasets than most other machine learning methods. Linear regression also does not assume normality of the dataset. As our data is very skewed in some factors, it is still valuable to see how the model performs. Therefore, we aim to predict bike demand using different regression algorithms to test the prediction accuracy and limitations.

Dataset & Source:

This Dataset is taken from kaggle open datasets: Seoul Bike Sharing Demand Dataset at https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand. The dataset has 8760 data points and each of them are representing weather and holiday information which are possible major factors in hiring a bike. The time span of the dataset is 365 days and weather information and bike demand data is given for each hour.

Features of the dataset:

The dataset contains 14 attributes. 10 numerical and 3 categorical and 1 time series values.

Target values: Rented Bike count - Count of bikes rented at each hour

Feature attributes: Date : year-month-day Hour - Hour of he day Temperature-Temperature in Celsius Humidity - % Wind Speed - m/s Visibility - 10m Dew point temperature - Celsius Solar radiation - MJ/m2 Rainfall - mm Snowfall - cm Seasons - Winter, Spring, Summer, Autumn Holiday - Holiday/No holiday Functional Day - No(Non Functional Hours), Yes(Functional hours)

Method

Some assumptions are made during handling of the Bike Dataset. It is assumed that there will be more demand on holidays despite bad weather conditions. There will be less demand on very cold and very hot days. Wind speed will be inversely proportional to the demand. There will be more demand during hotter seasons such as summer and spring. There will be considerably less demand on rainy days. There will be more demand during week-end than during weekdays. These will be tested during the data exploration stage. The data is splitted to 80-20 to minimise overfitting and to test the prediction accuracy of the model.

There are 4 regression models in this report predicting bike rental demands.

Model 1 - Multiple linear regression model predicting bike rental demand for every hour using hourly weather conditions. Model 2 - Multiple linear regression model predicting bike rental demand using average daily weather conditions. Model 3 - Multiple linear regression model predicting bike rental demand during am hours and pm hours. Model 4 - Multiple linear regression model predicting bike rental demand for every day with weekend column and using daily average weather conditions.

The dataset is already clean and therefore, data preprocessing steps such as removal of outliers and handling of missing values are not performed before fitting into the models. As regression models do not assume normality of the data, the dataset is used as it is without any transformation. Categorical columns, hours, seasons, holiday and functional day are set as factors before fitting into the models. All the feature attributes of the dataset, except functioning day, is used to predict bike demand in all the models.

For each of the models, the dataset is splitted and then fit into the regression model. Then the optimum model is calculated using the stepwise regression method and VIF regression to test multicollinearity. Then statistical tests such as the non-constant error variance test to check homoscedasticity of the model, Cook’s distance to check outliers, Durbin Watson test for autocorrelation and Shapiro-Wilk tests for normality, are performed to check if the model is adequate. If the model fails the tests, especially for homoscedasticity, box-cox transformation is performed on the model. Then outliers are removed and the data is then fitted to model again and tested for adequacy. Shapiro-Wilk test is performed only on datasets less than 5000 points.

Importing Libraries

# Importing libraries
library(ggplot2)
library(magrittr) 
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(infotheo)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
library(TSA)
## Registered S3 methods overwritten by 'TSA':
##   method       from    
##   fitted.Arima forecast
##   plot.Arima   forecast
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
library(forecast)
library(chron)
library(caTools)
library(caret)
## Loading required package: lattice

Importing Data

# import dataset
setwd("/Users/April/Desktop/DS/Regression/Regression_Project")
bike <- read.csv("SeoulBikeData.csv",check.names = F)

#set new column names
bike <- setNames(bike, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))

#Setting categorical values

bike$hour <- factor(bike$hour)
bike$seasons <- factor(bike$seasons)
bike$holiday <- factor(bike$holiday)
bike$functioning_day <- factor(bike$functioning_day)

head(bike)
##         date bike_count hour temp humidity wind_speed visibility dew_point_temp
## 1 01/12/2017        254    0 -5.2       37        2.2       2000          -17.6
## 2 01/12/2017        204    1 -5.5       38        0.8       2000          -17.6
## 3 01/12/2017        173    2 -6.0       39        1.0       2000          -17.7
## 4 01/12/2017        107    3 -6.2       40        0.9       2000          -17.6
## 5 01/12/2017         78    4 -6.0       36        2.3       2000          -18.6
## 6 01/12/2017        100    5 -6.4       37        1.5       2000          -18.7
##   solar_radiation rainfall snowfall seasons    holiday functioning_day
## 1               0        0        0  Winter No Holiday             Yes
## 2               0        0        0  Winter No Holiday             Yes
## 3               0        0        0  Winter No Holiday             Yes
## 4               0        0        0  Winter No Holiday             Yes
## 5               0        0        0  Winter No Holiday             Yes
## 6               0        0        0  Winter No Holiday             Yes

Data Analysis

summary(bike$functioning_day)
##   No  Yes 
##  295 8465

According to the summary statistics of functioning_day, there are 295 non-functioning days in the dataset. These day will have zero bike counts.

fig1 <- ggplot(bike, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 1. Bike Demand by functining day") +  xlab("Functioning Day") + ylab("Bike Count")
fig1 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

The bikes will not be available to rent during non functioning days and therefore, bike count will be zero on non functioning day as no one will be able to rent the bicycles as seen in Fig 1. Therefore, non-functioning day data points are deleted from the dataset as this is not relevent to our purpose. The dataset is now left with 8465 data points.

# Deleting rows when it is non-functioning day
bike<-bike[!(bike$functioning_day=="No"),]

# removing unused columns
bike <- subset(bike, select = - c(functioning_day))

summary(bike)
##          date        bike_count          hour           temp       
##  01/01/2018:  24   Min.   :   2.0   7      : 353   Min.   :-17.80  
##  01/02/2018:  24   1st Qu.: 214.0   8      : 353   1st Qu.:  3.00  
##  01/03/2018:  24   Median : 542.0   9      : 353   Median : 13.50  
##  01/04/2018:  24   Mean   : 729.2   10     : 353   Mean   : 12.77  
##  01/05/2018:  24   3rd Qu.:1084.0   11     : 353   3rd Qu.: 22.70  
##  01/06/2018:  24   Max.   :3556.0   12     : 353   Max.   : 39.40  
##  (Other)   :8321                    (Other):6347                   
##     humidity       wind_speed      visibility   dew_point_temp   
##  Min.   : 0.00   Min.   :0.000   Min.   :  27   Min.   :-30.600  
##  1st Qu.:42.00   1st Qu.:0.900   1st Qu.: 935   1st Qu.: -5.100  
##  Median :57.00   Median :1.500   Median :1690   Median :  4.700  
##  Mean   :58.15   Mean   :1.726   Mean   :1434   Mean   :  3.945  
##  3rd Qu.:74.00   3rd Qu.:2.300   3rd Qu.:2000   3rd Qu.: 15.200  
##  Max.   :98.00   Max.   :7.400   Max.   :2000   Max.   : 27.200  
##                                                                  
##  solar_radiation     rainfall          snowfall         seasons    
##  Min.   :0.0000   Min.   : 0.0000   Min.   :0.00000   Autumn:1937  
##  1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.:0.00000   Spring:2160  
##  Median :0.0100   Median : 0.0000   Median :0.00000   Summer:2208  
##  Mean   :0.5679   Mean   : 0.1491   Mean   :0.07769   Winter:2160  
##  3rd Qu.:0.9300   3rd Qu.: 0.0000   3rd Qu.:0.00000                
##  Max.   :3.5200   Max.   :35.0000   Max.   :8.80000                
##                                                                    
##        holiday    
##  Holiday   : 408  
##  No Holiday:8057  
##                   
##                   
##                   
##                   
## 

Summary statistics of the dataset shows general idea of how the data without non-functioning days look like. This will be further analysed with bar plots and histograms.

Fig 2. Hourly Bike Count Bar Plot indicates the numbers of bikes rented by hour. The mean demand is lower than 500 bike counts for hours between 1 am and 6 am. We can see huge spike in boxplot at 8 which is 8 am in the morning, and we can see slowly increasing parretn in boxplot throughout daytime reaching huge peak at 18 which is 6pm. we can see bike renting starts to decrease after 18 and reached lowest at 4. All the am hours except 8 am and 7 am has deman of lower than 1000 bike counts. Therefore, we take am hours as low deman hours and pm hours as high demand hours.

fig2<- ggplot(bike, aes(x = as.factor(hour) , y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 2. Hourly Bike Count Bar Plot") +  xlab("Hour") + ylab("Bike Count")
fig2 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

From Fig 3. Seasonal Bike Count Bar Plot, we can see the comparisons of bike renting by seasons. we can easily infer that during better weather times such as summer, people usually prefer to cycle more and in winters bikes are rented lowest with well lower than 500 bike counts in demand. This may be due to cold weather as well as snow during winter. There are a few outliers with higher density than the rest of the seasons where bike demand in Winter is over 500. This may be some underlying reasons such as the day having better weather than other days in winter. In Autumn, bike demand is higher than Spring though we were expecting the demand to be higher in Spring than in Autumn. This may be due to having more rainy days in Spring than in Autumn.

fig3 <- ggplot(bike, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 3. Seasonal Bike Count Bar Plot") +  xlab("Seasons") + ylab("Bike Count")
fig3 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

According to figure 4, we can determine that the demand for bike is higher on non holiday days, which means mostly our users might be renting bikes for other reasons than for recreational purposes.

fig4 <- ggplot(bike, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 4. Holiday Bike Count Bar Plot") +  xlab("Holiday") + ylab("Bike Count")
fig4 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

Bar plots showed a few outliers in the dataset for each category. However, these will not be removed as they might not be actual outliers but instead, can be explained by other factors such as weather conditions and whether it is holiday or not, etc. These will be removed in later modelling stage if they seem to negatively affect the model assumptions.

Correlation Matrix

We are performing correlation matrix to understand relationship between our variables which is explained in greater detail below.

# correlation matrix of dataset
cor_df <- data.frame(bike$bike_count, as.integer(bike$hour), bike$temp, bike$humidity, bike$wind_speed, bike$visibility, bike$dew_point, bike$solar_radiation, bike$rainfall, bike$snowfall)
cor_df <- data.frame(cor(cor_df))

cor_df[order(-cor_df[,1]), ]
##                       bike.bike_count as.integer.bike.hour.   bike.temp
## bike.bike_count             1.0000000           0.425255882  0.56274017
## bike.temp                   0.5627402           0.122741816  1.00000000
## as.integer.bike.hour.       0.4252559           1.000000000  0.12274182
## bike.dew_point              0.4002628           0.004691479  0.91446695
## bike.solar_radiation        0.2738616           0.144658482  0.35484355
## bike.visibility             0.2123228           0.103869318  0.02826206
## bike.wind_speed             0.1250219           0.287779819 -0.03848085
## bike.rainfall              -0.1286261           0.014344645  0.05214889
## bike.snowfall              -0.1516108          -0.022082499 -0.21774581
## bike.humidity              -0.2019727          -0.235937180  0.16642522
##                       bike.humidity bike.wind_speed bike.visibility
## bike.bike_count          -0.2019727     0.125021946      0.21232278
## bike.temp                 0.1664252    -0.038480848      0.02826206
## as.integer.bike.hour.    -0.2359372     0.287779819      0.10386932
## bike.dew_point            0.5394024    -0.177170126     -0.18258645
## bike.solar_radiation     -0.4572727     0.326221868      0.15304614
## bike.visibility          -0.5485418     0.180427649      1.00000000
## bike.wind_speed          -0.3373524     1.000000000      0.18042765
## bike.rainfall             0.2369169    -0.024931327     -0.17035180
## bike.snowfall             0.1101265    -0.003789344     -0.12285973
## bike.humidity             1.0000000    -0.337352380     -0.54854183
##                       bike.dew_point bike.solar_radiation bike.rainfall
## bike.bike_count          0.400262829           0.27386155  -0.128626093
## bike.temp                0.914466952           0.35484355   0.052148891
## as.integer.bike.hour.    0.004691479           0.14465848   0.014344645
## bike.dew_point           1.000000000           0.09852498   0.126812453
## bike.solar_radiation     0.098524979           1.00000000  -0.074157271
## bike.visibility         -0.182586445           0.15304614  -0.170351798
## bike.wind_speed         -0.177170126           0.32622187  -0.024931327
## bike.rainfall            0.126812453          -0.07415727   1.000000000
## bike.snowfall           -0.149759793          -0.07337987   0.008604092
## bike.humidity            0.539402446          -0.45727269   0.236916912
##                       bike.snowfall
## bike.bike_count        -0.151610753
## bike.temp              -0.217745811
## as.integer.bike.hour.  -0.022082499
## bike.dew_point         -0.149759793
## bike.solar_radiation   -0.073379874
## bike.visibility        -0.122859731
## bike.wind_speed        -0.003789344
## bike.rainfall           0.008604092
## bike.snowfall           1.000000000
## bike.humidity           0.110126502

From the correlation matrix we can determine all the below points.

1.) the target value bike_count’s correlation can be seen in order of temp, hour, dew_point, solar_radiation and so on. 2.) dew_point has high correlation with temp which may be problematic in constructing good models. 3.) bike_count has low correlation with each individual factors showing predicting bike demand is affected by more than one factor.

We will be doing further analysis of this in each model’s multicollinearity section.

Below is the histograms of continuous data to determine the shape of the variables. It can be seen that bike_count values are right-skewed. There are a lot of hours with temperature below zero, which we expect these hours to have lower demand for bike rentals. The humidity histogram shows that the data is more of normally distributed compared to the rest of the factors. Windspeed is low with very few extreme speeds. Visibility good on most of the days. Dew point temperature shows similar distribution to temperature. Solar radiation is 0 on most hours with very few hours with extrememe values. Rainfall and snowfall data is extremely skewed to the right with very few hours with high levels of rain and snow. The temperature shows signs of having a bimodal distribution.

#Creating histograms 
multi.hist(bike[,sapply(bike, is.numeric)])

Modelling

1) Model 1

Model 1 is a multiple linear regression model predicting bike rental demand for every hour using hourly weather conditions. For this model, the dataset is used as it is without changing anything. Data is splitted t 80-20 to build the model.

Data spliting

Data is splitted to 80% train set and 20% test set. This is to minimise overfitting. Data is splitted by random selection using sample() method.

#Randomise our dataset

## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike)), size = smp_size)

model1_prediction <- bike[ -trainIndex,]
model1 <- bike[ trainIndex,]

After splitting the data, train dataset is fitted to the model.

# regression model with all columns original data
row.names(model1) <- NULL

original_full = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model1)

#summary
summary(original_full)
## 
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + wind_speed + 
##     visibility + dew_point_temp + solar_radiation + rainfall + 
##     snowfall + seasons + holiday, data = model1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1396.44  -219.81    -9.44   201.57  1945.23 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.161e+03  9.868e+01  11.763  < 2e-16 ***
## hour1             -1.006e+02  3.207e+01  -3.138 0.001711 ** 
## hour2             -2.244e+02  3.172e+01  -7.076 1.64e-12 ***
## hour3             -3.032e+02  3.210e+01  -9.444  < 2e-16 ***
## hour4             -3.751e+02  3.215e+01 -11.669  < 2e-16 ***
## hour5             -3.616e+02  3.183e+01 -11.361  < 2e-16 ***
## hour6             -1.864e+02  3.176e+01  -5.868 4.61e-09 ***
## hour7              1.243e+02  3.182e+01   3.908 9.40e-05 ***
## hour8              4.876e+02  3.207e+01  15.205  < 2e-16 ***
## hour9              1.802e+01  3.289e+01   0.548 0.583730    
## hour10            -2.215e+02  3.393e+01  -6.529 7.11e-11 ***
## hour11            -2.263e+02  3.528e+01  -6.414 1.52e-10 ***
## hour12            -2.048e+02  3.666e+01  -5.587 2.40e-08 ***
## hour13            -1.815e+02  3.699e+01  -4.908 9.42e-07 ***
## hour14            -1.757e+02  3.606e+01  -4.873 1.12e-06 ***
## hour15            -9.281e+01  3.536e+01  -2.625 0.008691 ** 
## hour16             5.468e+01  3.405e+01   1.606 0.108374    
## hour17             3.379e+02  3.297e+01  10.248  < 2e-16 ***
## hour18             7.940e+02  3.273e+01  24.256  < 2e-16 ***
## hour19             5.406e+02  3.277e+01  16.499  < 2e-16 ***
## hour20             4.682e+02  3.244e+01  14.433  < 2e-16 ***
## hour21             4.597e+02  3.253e+01  14.134  < 2e-16 ***
## hour22             3.393e+02  3.178e+01  10.676  < 2e-16 ***
## hour23             1.117e+02  3.201e+01   3.491 0.000484 ***
## temp               7.741e+00  3.704e+00   2.090 0.036674 *  
## humidity          -1.099e+01  1.029e+00 -10.676  < 2e-16 ***
## wind_speed        -3.156e+00  5.266e+00  -0.599 0.549061    
## visibility         8.024e-03  9.851e-03   0.815 0.415382    
## dew_point_temp     1.750e+01  3.850e+00   4.546 5.57e-06 ***
## solar_radiation    8.194e+01  1.133e+01   7.231 5.32e-13 ***
## rainfall          -6.346e+01  4.269e+00 -14.864  < 2e-16 ***
## snowfall           3.497e+01  1.075e+01   3.255 0.001140 ** 
## seasonsSpring     -1.675e+02  1.388e+01 -12.067  < 2e-16 ***
## seasonsSummer     -1.766e+02  1.713e+01 -10.312  < 2e-16 ***
## seasonsWinter     -3.545e+02  1.951e+01 -18.176  < 2e-16 ***
## holidayNo Holiday  1.383e+02  2.182e+01   6.338 2.47e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 376.5 on 6736 degrees of freedom
## Multiple R-squared:  0.6614, Adjusted R-squared:  0.6597 
## F-statistic:   376 on 35 and 6736 DF,  p-value: < 2.2e-16

Then stepwise AIC is used to calculate an optimum model for prediction.

# possible subsets regression original data
library(MASS)
stepReg=MASS::stepAIC(original_full, direction="both")
## Start:  AIC=80364.21
## bike_count ~ hour + temp + humidity + wind_speed + visibility + 
##     dew_point_temp + solar_radiation + rainfall + snowfall + 
##     seasons + holiday
## 
##                   Df Sum of Sq        RSS   AIC
## - wind_speed       1     50894  954887654 80363
## - visibility       1     94041  954930801 80363
## <none>                          954836760 80364
## - temp             1    619060  955455820 80367
## - snowfall         1   1501693  956338453 80373
## - dew_point_temp   1   2928905  957765666 80383
## - holiday          1   5694656  960531416 80402
## - solar_radiation  1   7412253  962249013 80415
## - humidity         1  16155129  970991890 80476
## - rainfall         1  31319043  986155804 80581
## - seasons          3  73932170 1028768930 80863
## - hour            23 564054506 1518891266 83462
## 
## Step:  AIC=80362.57
## bike_count ~ hour + temp + humidity + visibility + dew_point_temp + 
##     solar_radiation + rainfall + snowfall + seasons + holiday
## 
##                   Df Sum of Sq        RSS   AIC
## - visibility       1     79525  954967180 80361
## <none>                          954887654 80363
## + wind_speed       1     50894  954836760 80364
## - temp             1    627843  955515498 80365
## - snowfall         1   1504623  956392277 80371
## - dew_point_temp   1   2932423  957820077 80381
## - holiday          1   5718410  960606064 80401
## - solar_radiation  1   7400209  962287863 80413
## - humidity         1  16171684  971059339 80474
## - rainfall         1  31387164  986274819 80580
## - seasons          3  75107146 1029994800 80869
## - hour            23 588532596 1543420251 83568
## 
## Step:  AIC=80361.14
## bike_count ~ hour + temp + humidity + dew_point_temp + solar_radiation + 
##     rainfall + snowfall + seasons + holiday
## 
##                   Df Sum of Sq        RSS   AIC
## <none>                          954967180 80361
## + visibility       1     79525  954887654 80363
## + wind_speed       1     36378  954930801 80363
## - temp             1    603611  955570790 80363
## - snowfall         1   1497042  956464221 80370
## - dew_point_temp   1   2996946  957964125 80380
## - holiday          1   5705782  960672961 80399
## - solar_radiation  1   7343094  962310274 80411
## - humidity         1  17870682  972837862 80485
## - rainfall         1  31492353  986459532 80579
## - seasons          3  76754753 1031721932 80879
## - hour            23 588488290 1543455469 83566
stepReg$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## bike_count ~ hour + temp + humidity + wind_speed + visibility + 
##     dew_point_temp + solar_radiation + rainfall + snowfall + 
##     seasons + holiday
## 
## Final Model:
## bike_count ~ hour + temp + humidity + dew_point_temp + solar_radiation + 
##     rainfall + snowfall + seasons + holiday
## 
## 
##           Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                               6736  954836760 80364.21
## 2 - wind_speed  1 50894.07      6737  954887654 80362.57
## 3 - visibility  1 79525.24      6738  954967180 80361.14
# multicollinearity test
car::vif(original_full)
##                       GVIF Df GVIF^(1/(2*Df))
## hour              4.806960 23        1.034721
## temp             96.163368  1        9.806292
## humidity         21.287807  1        4.613871
## wind_speed        1.415588  1        1.189785
## visibility        1.722859  1        1.312577
## dew_point_temp  124.716403  1       11.167650
## solar_radiation   4.666642  1        2.160241
## rainfall          1.114160  1        1.055538
## snowfall          1.116854  1        1.056813
## seasons           5.553691  3        1.330751
## holiday           1.023949  1        1.011903

Stepwise method produced a final regression model without visibility, and windspeed. These will be removed from the final model. Dew_point_temp’s GVIF value is greater than 10 and showing presense of multicollinearity. This will be removed from the model.

# Removing columns with multicollinearity and stepwise
original_final = lm(bike_count ~ hour + temp + humidity + solar_radiation + 
    rainfall + snowfall + seasons + holiday, data=model1)

#summary
summary(original_final)
## 
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + solar_radiation + 
##     rainfall + snowfall + seasons + holiday, data = model1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1378.6  -219.7   -10.4   203.1  2018.5 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        795.2381    38.1578  20.841  < 2e-16 ***
## hour1             -101.0909    32.1155  -3.148 0.001653 ** 
## hour2             -224.8470    31.7478  -7.082 1.56e-12 ***
## hour3             -302.2156    32.1248  -9.408  < 2e-16 ***
## hour4             -373.5116    32.1679 -11.611  < 2e-16 ***
## hour5             -360.1377    31.8366 -11.312  < 2e-16 ***
## hour6             -187.5019    31.7687  -5.902 3.76e-09 ***
## hour7              126.4071    31.8235   3.972 7.20e-05 ***
## hour8              489.3452    32.0661  15.261  < 2e-16 ***
## hour9               23.4321    32.8732   0.713 0.475993    
## hour10            -214.8377    33.8889  -6.339 2.45e-10 ***
## hour11            -221.7090    35.2506  -6.290 3.38e-10 ***
## hour12            -203.2819    36.6409  -5.548 3.00e-08 ***
## hour13            -184.6526    36.9468  -4.998 5.95e-07 ***
## hour14            -184.3223    35.9467  -5.128 3.02e-07 ***
## hour15            -103.6727    35.1285  -2.951 0.003176 ** 
## hour16              42.1817    33.7517   1.250 0.211429    
## hour17             326.7397    32.6444  10.009  < 2e-16 ***
## hour18             784.9803    32.4882  24.162  < 2e-16 ***
## hour19             533.7577    32.5840  16.381  < 2e-16 ***
## hour20             465.1283    32.3425  14.381  < 2e-16 ***
## hour21             458.2095    32.5283  14.086  < 2e-16 ***
## hour22             339.0684    31.8091  10.659  < 2e-16 ***
## hour23             111.4768    32.0440   3.479 0.000507 ***
## temp                24.1372     0.8797  27.439  < 2e-16 ***
## humidity            -6.8340     0.3117 -21.927  < 2e-16 ***
## solar_radiation     71.5660    11.0287   6.489 9.25e-11 ***
## rainfall           -65.9748     4.2413 -15.555  < 2e-16 ***
## snowfall            31.1165    10.7279   2.901 0.003737 ** 
## seasonsSpring     -174.1587    13.3956 -13.001  < 2e-16 ***
## seasonsSummer     -170.6949    17.0033 -10.039  < 2e-16 ***
## seasonsWinter     -359.3512    19.1826 -18.733  < 2e-16 ***
## holidayNo Holiday  137.6383    21.8440   6.301 3.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 377 on 6739 degrees of freedom
## Multiple R-squared:  0.6603, Adjusted R-squared:  0.6587 
## F-statistic: 409.4 on 32 and 6739 DF,  p-value: < 2.2e-16

T-value for snowfall, hour 9 and hour 16 show that they are greater than 0.05 therefore, the predictions for these variables are not statistically significant. However, the final model shows that it can explain 65.87% of the dataset and it is statistically significant as p-value is less than 0.05.

par(mfrow=c(2,2))
plot(original_final , which=1:4)

Cook’s distance shows that there is an outlier. This will be removed from the model.

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(original_final)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1508.533, Df = 1, p = < 2.22e-16

P-value is less than 0.05, implying that the constant error variance assumption is violated. Therefore, box-cox transformation will be performed on the model.

# Test for Autocorrelated Errors
durbinWatsonTest(original_final)
##  lag Autocorrelation D-W Statistic p-value
##    1    -0.004870399      2.009702   0.646
##  Alternative hypothesis: rho != 0

P-value is more than 0.05 and therefore, uncorrelated error assumption is not violated.

BoxCox transformation

# removing One outlier
model1_BC <- model1[-c(6040), ] # removing Outlier


for (i in colnames(model1_BC)){
  if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
    model1_BC[i] <- model1_BC[i] + (abs(min(bike[i])))+1 #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(bike_count ~ ( temp + humidity), data = model1_BC)

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = (data^lambda_val)-1/lambda_val
  return(data)
}

for (i in colnames(model1_BC)){
  if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
    model1_BC[i] <- transform_boxcox(model1_BC[i])
  }
}

Final Model 1

# fitting new model
model1_BC_fit = lm(bike_count ~ hour + temp + humidity + solar_radiation + 
    rainfall + snowfall + seasons + holiday, data=model1_BC
    )
  
summary(model1_BC_fit)
## 
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + solar_radiation + 
##     rainfall + snowfall + seasons + holiday, data = model1_BC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.882  -15.014    0.253   15.629  120.232 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        43.7483     3.5814  12.215  < 2e-16 ***
## hour1              -8.6268     2.2049  -3.912 9.23e-05 ***
## hour2             -19.8489     2.1796  -9.107  < 2e-16 ***
## hour3             -28.6157     2.2053 -12.976  < 2e-16 ***
## hour4             -37.0104     2.2081 -16.761  < 2e-16 ***
## hour5             -36.2146     2.1851 -16.573  < 2e-16 ***
## hour6             -18.4712     2.1806  -8.471  < 2e-16 ***
## hour7               7.6510     2.1856   3.501 0.000467 ***
## hour8              32.2375     2.2088  14.595  < 2e-16 ***
## hour9               1.5362     2.2798   0.674 0.500430    
## hour10            -18.1290     2.3629  -7.673 1.92e-14 ***
## hour11            -18.2114     2.4605  -7.402 1.51e-13 ***
## hour12            -16.8484     2.5564  -6.591 4.71e-11 ***
## hour13            -16.2825     2.5760  -6.321 2.77e-10 ***
## hour14            -16.0612     2.5079  -6.404 1.61e-10 ***
## hour15             -9.7617     2.4511  -3.983 6.89e-05 ***
## hour16              0.2894     2.3489   0.123 0.901961    
## hour17             19.9810     2.2594   8.844  < 2e-16 ***
## hour18             48.9441     2.2358  21.891  < 2e-16 ***
## hour19             35.0344     2.2370  15.661  < 2e-16 ***
## hour20             31.1849     2.2202  14.046  < 2e-16 ***
## hour21             31.2509     2.2332  13.994  < 2e-16 ***
## hour22             23.5139     2.1838  10.767  < 2e-16 ***
## hour23              7.8049     2.2000   3.548 0.000391 ***
## temp                7.6162     0.2622  29.047  < 2e-16 ***
## humidity           -2.5621     0.1207 -21.235  < 2e-16 ***
## solar_radiation    14.0901     1.4850   9.488  < 2e-16 ***
## rainfall          -21.2329     0.7989 -26.579  < 2e-16 ***
## snowfall            2.1753     1.4817   1.468 0.142131    
## seasonsSpring     -14.9415     0.9187 -16.264  < 2e-16 ***
## seasonsSummer     -10.9605     1.1220  -9.769  < 2e-16 ***
## seasonsWinter     -30.2731     1.3436 -22.531  < 2e-16 ***
## holidayNo Holiday  12.3153     1.5010   8.205 2.75e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.89 on 6738 degrees of freedom
## Multiple R-squared:  0.7155, Adjusted R-squared:  0.7141 
## F-statistic: 529.5 on 32 and 6738 DF,  p-value: < 2.2e-16

The final model can explain 71.55% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant. However, after transformation, factors such as snowfall, hour 9 and hour 16 are showing that they’re not statistically significant.

# Checking multicollinearity in the final Model
car::vif(model1_BC_fit)
##                     GVIF Df GVIF^(1/(2*Df))
## hour            4.440614 23        1.032939
## temp            5.233746  1        2.287738
## humidity        2.011847  1        1.418396
## solar_radiation 4.804818  1        2.191989
## rainfall        1.123093  1        1.059761
## snowfall        1.123577  1        1.059989
## seasons         4.780295  3        1.297904
## holiday         1.025185  1        1.012514

Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the final Model.

par(mfrow=c(2,2))
plot(model1_BC_fit , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(model1_BC_fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 610.9344, Df = 1, p = < 2.22e-16

P-value is less than 0.05, implying that the constant error variance assumption is still violated after box-cox transformation.

# Test for Autocorrelated Errors
durbinWatsonTest(model1_BC_fit)
##  lag Autocorrelation D-W Statistic p-value
##    1    -0.005014278      2.010001   0.718
##  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

#shapiro.test(model1_BC_fit$residuals)
# Error in shapiro.test(model1_BC_fit$residuals) : sample size must be between 3 and 5000

P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model.

Model 1 did not turn out to pass homoscedasticity test and therefore, deemed to not be able to predict the bike demand accurately. We will now test if Model 2 can predict it better.

2) Model 2

Model 2 is a multiple linear regression model predicting bike rental demand using average daily weather conditions. For this model, bike demand for each day and average daily weather conditions are calculated. Data will be splitted to 80-20 train and test dataset to build the model.

# creating dataframe for bike count by date
bike_by_day <- bike
levels(bike_by_day$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_by_day$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)


bike_by_day <- bike_by_day %>% group_by(date) %>% summarise(sum_bike_count = sum(bike_count), mean_temp = mean(temp), mean_humidity = mean(humidity), mean_wind_speed = mean(wind_speed), mean_visibility = mean(visibility), mean_dew_point_temp = mean(dew_point_temp), mean_solar_radiation = mean(solar_radiation), mean_rainfall = mean(rainfall), mean_snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
#Changing the factors back to original values 
bike_by_day$holiday <- as.factor(bike_by_day$holiday)
bike_by_day$seasons <- as.factor(bike_by_day$seasons)
levels(bike_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(bike_by_day)
## # A tibble: 6 x 12
##   date    sum_bike_count mean_temp mean_humidity mean_wind_speed mean_visibility
##   <fct>            <int>     <dbl>         <dbl>           <dbl>           <dbl>
## 1 01/01/…           4290     -1.28          39.3            1.45           1895.
## 2 01/02/…           5377     -3.87          44              1.61           1924.
## 3 01/03/…           5132      0.45          64.2            3.55           1084 
## 4 01/04/…          17388     15.2           68.9            1.57            832.
## 5 01/05/…          26820     20.3           72.8            1.44            456.
## 6 01/06/…          31928     23.7           50.1            1.95           1598.
## # … with 6 more variables: mean_dew_point_temp <dbl>,
## #   mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## #   seasons <fct>, holiday <fct>

Below is the histograms of how the continuous variables look like after transforming the data. The sum_bike_count shows signs of having a bimodal distribution and temperature shows more of flatter distribution compared to the original data. The rest of columns seem to be not too different from the original data distribution.

#Creating histograms 
multi.hist(bike_by_day[,sapply(bike_by_day, is.numeric)])

Data spliting

Data is splitted to 80% train set and 20% test set. This is to minimise overfitting. Data is splitted by random selection using sample() method.

#Randomise our dataset

## 75% of the sample size
smp_size <- floor(0.80 * nrow(bike_by_day))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)

model2_prediction <- bike_by_day[ -trainIndex,]
model2 <- bike_by_day[ trainIndex,]

After splitting the data, train dataset is fitted to the model.

# regression model by day 
bike_by_day_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=model2)

#summary
summary(bike_by_day_model)
## 
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
##     mean_rainfall + mean_snowfall + seasons + holiday, data = model2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10732.9  -2611.3    380.5   2530.7  11448.5 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.297e+04  9.992e+03   3.299  0.00110 ** 
## mean_temp            -5.287e+02  3.790e+02  -1.395  0.16417    
## mean_humidity        -2.535e+02  1.130e+02  -2.244  0.02568 *  
## mean_wind_speed      -1.273e+03  4.415e+02  -2.884  0.00424 ** 
## mean_visibility      -6.274e-02  6.641e-01  -0.094  0.92480    
## mean_dew_point_temp   9.295e+02  4.042e+02   2.300  0.02222 *  
## mean_solar_radiation  1.220e+04  1.364e+03   8.944  < 2e-16 ***
## mean_rainfall        -3.570e+03  6.402e+02  -5.577 5.96e-08 ***
## mean_snowfall        -5.857e+02  6.705e+02  -0.874  0.38313    
## seasonsSpring        -4.810e+03  7.525e+02  -6.392 7.19e-10 ***
## seasonsSummer        -2.820e+03  9.234e+02  -3.054  0.00249 ** 
## seasonsAutumn        -8.009e+03  1.027e+03  -7.795 1.42e-13 ***
## holidayHoliday        2.761e+03  1.103e+03   2.503  0.01291 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3940 on 269 degrees of freedom
## Multiple R-squared:  0.8501, Adjusted R-squared:  0.8434 
## F-statistic: 127.1 on 12 and 269 DF,  p-value: < 2.2e-16

This model can explain 0.8434 and is statistically significant as p-value is < 0.05. However, it shows that there are some factors that are not statistically significant. Stepwise AIC is used to calculate an optimum model for prediction.

# possible subsets regression 
stepReg=MASS::stepAIC(bike_by_day_model, direction="both")
## Start:  AIC=4682.07
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
##     mean_rainfall + mean_snowfall + seasons + holiday
## 
##                        Df  Sum of Sq        RSS    AIC
## - mean_visibility       1     138612 4176899119 4680.1
## - mean_snowfall         1   11848940 4188609447 4680.9
## <none>                               4176760507 4682.1
## - mean_temp             1   30215864 4206976371 4682.1
## - mean_humidity         1   78153669 4254914175 4685.3
## - mean_dew_point_temp   1   82133290 4258893797 4685.6
## - holiday               1   97276886 4274037393 4686.6
## - mean_wind_speed       1  129189506 4305950013 4688.7
## - mean_rainfall         1  482894786 4659655293 4710.9
## - mean_solar_radiation  1 1242133949 5418894456 4753.5
## - seasons               3 1372262290 5549022797 4756.2
## 
## Step:  AIC=4680.08
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
##     mean_snowfall + seasons + holiday
## 
##                        Df  Sum of Sq        RSS    AIC
## - mean_snowfall         1   12005670 4188904789 4678.9
## <none>                               4176899119 4680.1
## - mean_temp             1   30530456 4207429575 4680.1
## + mean_visibility       1     138612 4176760507 4682.1
## - mean_dew_point_temp   1   84204705 4261103824 4683.7
## - mean_humidity         1   85890314 4262789433 4683.8
## - holiday               1   97275442 4274174561 4684.6
## - mean_wind_speed       1  137322948 4314222067 4687.2
## - mean_rainfall         1  504105669 4681004788 4710.2
## - mean_solar_radiation  1 1242797146 5419696265 4751.5
## - seasons               3 1473953173 5650852292 4759.3
## 
## Step:  AIC=4678.89
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
##     seasons + holiday
## 
##                        Df  Sum of Sq        RSS    AIC
## <none>                               4188904789 4678.9
## - mean_temp             1   35239864 4224144653 4679.3
## + mean_snowfall         1   12005670 4176899119 4680.1
## + mean_visibility       1     295342 4188609447 4680.9
## - mean_dew_point_temp   1   94497682 4283402471 4683.2
## - holiday               1   96775043 4285679832 4683.3
## - mean_humidity         1  100475636 4289380425 4683.6
## - mean_wind_speed       1  138369580 4327274369 4686.1
## - mean_rainfall         1  493107696 4682012484 4708.3
## - mean_solar_radiation  1 1235664818 5424569607 4749.8
## - seasons               3 1476936963 5665841752 4758.1
stepReg$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
##     mean_rainfall + mean_snowfall + seasons + holiday
## 
## Final Model:
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
##     seasons + holiday
## 
## 
##                Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                                     269 4176760507 4682.072
## 2 - mean_visibility  1   138612       270 4176899119 4680.082
## 3   - mean_snowfall  1 12005670       271 4188904789 4678.891
# multicollinearity
car::vif(bike_by_day_model)
##                            GVIF Df GVIF^(1/(2*Df))
## mean_temp            332.934319  1       18.246488
## mean_humidity         49.016289  1        7.001163
## mean_wind_speed        1.263369  1        1.123997
## mean_visibility        2.050572  1        1.431982
## mean_dew_point_temp  463.892705  1       21.538169
## mean_solar_radiation   3.264403  1        1.806766
## mean_rainfall          1.704366  1        1.305514
## mean_snowfall          1.170149  1        1.081734
## seasons                7.575015  3        1.401406
## holiday                1.042355  1        1.020958

mean_temperature and mean_dew_point_temp GVIF values are greater than 10 and therefore, there is multicollinearity. The stepwise AIC removed visibility and snowfall in the final model. These variables will be removed from the final model.

# Removing columns with multicollinearity and stepwise
bike_by_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=model2)

#summary
summary(bike_by_day_model_final)
## 
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_solar_radiation + mean_rainfall + seasons + holiday, 
##     data = model2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10443.0  -2558.4    196.8   2565.9  11704.5 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          11836.98    2198.00   5.385 1.57e-07 ***
## mean_temp              346.09      53.14   6.512 3.56e-10 ***
## mean_humidity          -14.96      24.24  -0.617  0.53759    
## mean_wind_speed      -1241.65     433.81  -2.862  0.00453 ** 
## mean_solar_radiation 11933.36    1369.45   8.714 2.93e-16 ***
## mean_rainfall        -3944.21     605.06  -6.519 3.43e-10 ***
## seasonsSpring        -4977.03     705.83  -7.051 1.46e-11 ***
## seasonsSummer        -2571.73     921.82  -2.790  0.00565 ** 
## seasonsAutumn        -8036.42    1011.77  -7.943 5.24e-14 ***
## holidayHoliday        2506.71    1106.13   2.266  0.02422 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3968 on 272 degrees of freedom
## Multiple R-squared:  0.8463, Adjusted R-squared:  0.8412 
## F-statistic: 166.4 on 9 and 272 DF,  p-value: < 2.2e-16
# multicollinearity
car::vif(bike_by_day_model_final)
##                          GVIF Df GVIF^(1/(2*Df))
## mean_temp            6.454181  1        2.540508
## mean_humidity        2.223462  1        1.491128
## mean_wind_speed      1.202904  1        1.096770
## mean_solar_radiation 3.244988  1        1.801385
## mean_rainfall        1.500984  1        1.225147
## seasons              6.040279  3        1.349510
## holiday              1.033729  1        1.016725

The final model can explain 84.18% of the data and is statistically significant at 95% confidence interval as p-value < 0.05. There is no multicollinearity in the final model. The p-value of mean_humidity is greater than 0.5 and therefore, is not statistically significant.

par(mfrow=c(2,2))
plot(bike_by_day_model_final , which=1:4)

There are no outliers in the dataset but the resituals shows a bit of hetroscedasticity.

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_model_final)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 27.39761, Df = 1, p = 1.6564e-07

P-value is less than 0.05, implying that the constant error variance assumption is violated.

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_model_final)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.01148254      1.970562   0.798
##  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

# Test for Normally Distributed Errors
shapiro.test(bike_by_day_model_final$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  bike_by_day_model_final$residuals
## W = 0.99635, p-value = 0.7624

P-value of the Shapiro-Wilk is greater than 0.05 therefore, it can be concluded that normality error assumption is not violated. Data will be transformed using box-cox transformation as it did not pass all the tests.

BoxCox transformation

# Making all the values positive
bike_by_day_BC <- bike_by_day

for (i in colnames(bike_by_day_BC)){
  if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
    bike_by_day_BC[i] <- round(bike_by_day_BC[i] + (abs(min(bike_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
  }
}
bc1 <-boxTidwell(sum_bike_count ~ mean_temp +
    mean_humidity + 
    mean_solar_radiation 
  ,data = bike_by_day_BC)

lambda_val=bc1$result[1]
transform_boxcox <- function(data)
{
  # box-cox transformation
  data = (data^lambda_val)-1/lambda_val
  return(data)
}

for (i in colnames(bike_by_day_BC)){
  if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
    bike_by_day_BC[i] <- transform_boxcox(bike_by_day_BC[i])
  }
}

Final Model 2

# fitting new model
bike_by_day_BC_fit = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
    mean_solar_radiation + mean_rainfall + seasons + holiday, 
    data = bike_by_day_BC)
  
summary(bike_by_day_BC_fit)
## 
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_solar_radiation + mean_rainfall + seasons + holiday, 
##     data = bike_by_day_BC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.329 -14.109   1.105  14.262  54.485 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          146.8573    18.6492   7.875 4.54e-14 ***
## mean_temp             16.3622     1.9991   8.185 5.41e-15 ***
## mean_humidity         -0.7034     1.7476  -0.403 0.687541    
## mean_wind_speed      -19.0108     7.2987  -2.605 0.009597 ** 
## mean_solar_radiation 138.3179    15.0689   9.179  < 2e-16 ***
## mean_rainfall        -67.4140     7.5264  -8.957  < 2e-16 ***
## seasonsSpring        -26.8193     3.3871  -7.918 3.38e-14 ***
## seasonsSummer        -18.7019     4.0442  -4.624 5.33e-06 ***
## seasonsAutumn        -48.7661     4.8991  -9.954  < 2e-16 ***
## holidayHoliday        20.1854     5.2618   3.836 0.000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.81 on 343 degrees of freedom
## Multiple R-squared:   0.86,  Adjusted R-squared:  0.8563 
## F-statistic: 234.1 on 9 and 343 DF,  p-value: < 2.2e-16

The final model can explain 0.86% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant.

# Checking multicollinearity in the final Model
car::vif(bike_by_day_BC_fit)
##                          GVIF Df GVIF^(1/(2*Df))
## mean_temp            6.314855  1        2.512937
## mean_humidity        2.395446  1        1.547723
## mean_wind_speed      1.274952  1        1.129138
## mean_solar_radiation 3.309453  1        1.819190
## mean_rainfall        1.609907  1        1.268821
## seasons              5.734895  3        1.337892
## holiday              1.034576  1        1.017141

Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the final Model.

par(mfrow=c(2,2))
plot(bike_by_day_BC_fit , which=1:4)

There are not outliers detected.

# Evaluate homoscedasticity
# non-constant error variance test

ncvTest(bike_by_day_BC_fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5.579538, Df = 1, p = 0.018172

P-value is still less than 0.05, implying that the constant error variance assumption is violated even after transformation.

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_BC_fit)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.0320337      1.935827     0.5
##  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

shapiro.test(bike_by_day_BC_fit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  bike_by_day_BC_fit$residuals
## W = 0.99623, p-value = 0.5709

P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model.

3) Model 3

Model 3 is a multiple linear regression model predicting bike rental demand during am hours and pm hours. For this, sum of bike counts for am hours and pm hours are calculated and average values of the am and pm weather data.

# Dividing to am and pm
bike_AMPM <- bike
bike_AMPM$hour <- as.numeric(bike_AMPM$hour)

bike_AMPM['hour'] [bike_AMPM['hour'] <=12] <- -1
bike_AMPM['hour'] [bike_AMPM['hour'] >12] <- 1
bike_AMPM$hour <- factor(bike_AMPM$hour)


head(bike_AMPM)
##         date bike_count hour temp humidity wind_speed visibility dew_point_temp
## 1 01/12/2017        254   -1 -5.2       37        2.2       2000          -17.6
## 2 01/12/2017        204   -1 -5.5       38        0.8       2000          -17.6
## 3 01/12/2017        173   -1 -6.0       39        1.0       2000          -17.7
## 4 01/12/2017        107   -1 -6.2       40        0.9       2000          -17.6
## 5 01/12/2017         78   -1 -6.0       36        2.3       2000          -18.6
## 6 01/12/2017        100   -1 -6.4       37        1.5       2000          -18.7
##   solar_radiation rainfall snowfall seasons    holiday
## 1               0        0        0  Winter No Holiday
## 2               0        0        0  Winter No Holiday
## 3               0        0        0  Winter No Holiday
## 4               0        0        0  Winter No Holiday
## 5               0        0        0  Winter No Holiday
## 6               0        0        0  Winter No Holiday
# creating dataframe for bike count by am pm
levels(bike_AMPM$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_AMPM$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)

bike_AMPM <- bike_AMPM %>% group_by(date,hour) %>% summarise(bike_count = sum(bike_count), temp = mean(temp), humidity = mean(humidity), wind_speed = mean(wind_speed), visibility = mean(visibility), dew_point_temp = mean(dew_point_temp), solar_radiation = mean(solar_radiation), rainfall = mean(rainfall), snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
bike_AMPM
## # A tibble: 706 x 13
## # Groups:   date [353]
##    date    hour  bike_count   temp humidity wind_speed visibility dew_point_temp
##    <fct>   <fct>      <int>  <dbl>    <dbl>      <dbl>      <dbl>          <dbl>
##  1 01/01/… -1          1520 -3.57      44.5      0.992      1896.        -14.2  
##  2 01/01/… 1           2770  1         34.2      1.92       1894.        -13.6  
##  3 01/02/… -1          1962 -6.51      53.9      1.31       1891.        -14.5  
##  4 01/02/… 1           3415 -1.23      34.1      1.91       1956.        -15.6  
##  5 01/03/… -1          1361  1.63      85.8      2.82        778.         -0.767
##  6 01/03/… 1           3771 -0.733     42.7      4.29       1390.        -12.0  
##  7 01/04/… -1          4227 12.9       65.8      0.967       974.          6.52 
##  8 01/04/… 1          13161 17.5       72        2.17        690.         12.2  
##  9 01/05/… -1          7847 19.1       77.9      0.842       374.         15.0  
## 10 01/05/… 1          18973 21.6       67.7      2.03        539.         15.3  
## # … with 696 more rows, and 5 more variables: solar_radiation <dbl>,
## #   rainfall <dbl>, snowfall <dbl>, seasons <dbl>, holiday <dbl>
#Changing the factors back to original values 
bike_AMPM$holiday <- as.factor(bike_AMPM$holiday)
bike_AMPM$seasons <- as.factor(bike_AMPM$seasons)
levels(bike_AMPM$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_AMPM$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')

#Drop date column
bike_AMPM <- subset(bike_AMPM, select = -c(date))

head(bike_AMPM)
## # A tibble: 6 x 12
##   hour  bike_count   temp humidity wind_speed visibility dew_point_temp
##   <fct>      <int>  <dbl>    <dbl>      <dbl>      <dbl>          <dbl>
## 1 -1          1520 -3.57      44.5      0.992      1896.        -14.2  
## 2 1           2770  1         34.2      1.92       1894.        -13.6  
## 3 -1          1962 -6.51      53.9      1.31       1891.        -14.5  
## 4 1           3415 -1.23      34.1      1.91       1956.        -15.6  
## 5 -1          1361  1.63      85.8      2.82        778.         -0.767
## 6 1           3771 -0.733     42.7      4.29       1390.        -12.0  
## # … with 5 more variables: solar_radiation <dbl>, rainfall <dbl>,
## #   snowfall <dbl>, seasons <fct>, holiday <fct>

Below is the histograms of how the continuous variables look like after transforming the data. The sum_bike_count shows signs of having a trimodal distribution and temperature shows bimodal distribution compared to the original data. The rest of columns seem to be not too different from the original data distribution.

#Creating histograms 
multi.hist(bike_AMPM[,sapply(bike_AMPM, is.numeric)])

Data spliting

Data is splitted to 80% train set and 20% test set. This is to minimise overfitting. Data is splitted by random selection using sample() method.

#Randomise our dataset

## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike_AMPM))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_AMPM)), size = smp_size)

model3_prediction <- bike_AMPM[ -trainIndex,]
model3 <- bike_AMPM[ trainIndex,]
# regression model with all columns except functioning day
am_pm_model = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model3)

#summary
summary(am_pm_model)
## 
## Call:
## lm(formula = bike_count ~ hour + temp + humidity + wind_speed + 
##     visibility + dew_point_temp + solar_radiation + rainfall + 
##     snowfall + seasons + holiday, data = model3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7580.5 -1774.8  -110.7  1658.1  9173.9 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10803.0578  3438.4154   3.142  0.00177 ** 
## hour1            2603.6307   334.3744   7.787 3.44e-14 ***
## temp              -70.8103   132.5580  -0.534  0.59343    
## humidity          -88.9478    38.1577  -2.331  0.02011 *  
## wind_speed       -366.3750   183.0305  -2.002  0.04581 *  
## visibility         -0.1240     0.2969  -0.418  0.67626    
## dew_point_temp    281.2093   139.9159   2.010  0.04494 *  
## solar_radiation  6282.3114   566.6650  11.086  < 2e-16 ***
## rainfall        -1575.4707   233.7714  -6.739 4.03e-11 ***
## snowfall           -3.7379   319.7514  -0.012  0.99068    
## seasonsSpring   -2792.0417   375.2958  -7.440 3.91e-13 ***
## seasonsSummer   -2463.7911   459.3894  -5.363 1.21e-07 ***
## seasonsAutumn   -3864.3881   512.7259  -7.537 1.99e-13 ***
## holidayHoliday   1423.5079   561.4535   2.535  0.01151 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2798 on 550 degrees of freedom
## Multiple R-squared:  0.8051, Adjusted R-squared:  0.8005 
## F-statistic: 174.8 on 13 and 550 DF,  p-value: < 2.2e-16
#Building ANOVA table with full dataset (all columns) against intercept only model
anova(am_pm_model) 
## Analysis of Variance Table
## 
## Response: bike_count
##                  Df     Sum Sq    Mean Sq   F value    Pr(>F)    
## hour              1 5152074809 5152074809  658.1438 < 2.2e-16 ***
## temp              1 8164062729 8164062729 1042.9056 < 2.2e-16 ***
## humidity          1 1620661660 1620661660  207.0289 < 2.2e-16 ***
## wind_speed        1    2706774    2706774    0.3458   0.55676    
## visibility        1    4452068    4452068    0.5687   0.45109    
## dew_point_temp    1   25404155   25404155    3.2452   0.07218 .  
## solar_radiation   1 1445028910 1445028910  184.5930 < 2.2e-16 ***
## rainfall          1  474552468  474552468   60.6210 3.458e-14 ***
## snowfall          1      20816      20816    0.0027   0.95889    
## seasons           3  845211047  281737016   35.9901 < 2.2e-16 ***
## holiday           1   50321496   50321496    6.4282   0.01151 *  
## Residuals       550 4305504463    7828190                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# possible subsets regression
library(MASS)
stepReg=MASS::stepAIC(am_pm_model, direction="both")
## Start:  AIC=8966.33
## bike_count ~ hour + temp + humidity + wind_speed + visibility + 
##     dew_point_temp + solar_radiation + rainfall + snowfall + 
##     seasons + holiday
## 
##                   Df Sum of Sq        RSS    AIC
## - snowfall         1      1070 4305505533 8964.3
## - visibility       1   1366394 4306870856 8964.5
## - temp             1   2233786 4307738249 8964.6
## <none>                         4305504463 8966.3
## - wind_speed       1  31366505 4336870968 8968.4
## - dew_point_temp   1  31621783 4337126246 8968.5
## - humidity         1  42537159 4348041622 8969.9
## - holiday          1  50321496 4355825959 8970.9
## - rainfall         1 355548738 4661053200 9009.1
## - hour             1 474628752 4780133215 9023.3
## - seasons          3 845815569 5151320032 9061.5
## - solar_radiation  1 962160506 5267664969 9078.1
## 
## Step:  AIC=8964.33
## bike_count ~ hour + temp + humidity + wind_speed + visibility + 
##     dew_point_temp + solar_radiation + rainfall + seasons + holiday
## 
##                   Df Sum of Sq        RSS    AIC
## - visibility       1   1365404 4306870937 8962.5
## - temp             1   2260638 4307766171 8962.6
## <none>                         4305505533 8964.3
## + snowfall         1      1070 4305504463 8966.3
## - wind_speed       1  31365939 4336871471 8966.4
## - dew_point_temp   1  32132277 4337637810 8966.5
## - humidity         1  43793132 4349298665 8968.0
## - holiday          1  50335767 4355841299 8968.9
## - rainfall         1 357746412 4663251945 9007.3
## - hour             1 475162772 4780668305 9021.4
## - seasons          3 845887370 5151392902 9059.5
## - solar_radiation  1 964779517 5270285050 9076.4
## 
## Step:  AIC=8962.51
## bike_count ~ hour + temp + humidity + wind_speed + dew_point_temp + 
##     solar_radiation + rainfall + seasons + holiday
## 
##                   Df Sum of Sq        RSS    AIC
## - temp             1   1951238 4308822174 8960.8
## <none>                         4306870937 8962.5
## + visibility       1   1365404 4305505533 8964.3
## + snowfall         1        80 4306870856 8964.5
## - dew_point_temp   1  31127896 4337998833 8964.6
## - wind_speed       1  34342550 4341213487 8965.0
## - humidity         1  42728660 4349599596 8966.1
## - holiday          1  50334035 4357204972 8967.1
## - rainfall         1 363503334 4670374271 9006.2
## - hour             1 494415635 4801286572 9021.8
## - seasons          3 873130692 5180001629 9060.6
## - solar_radiation  1 965276021 5272146957 9074.6
## 
## Step:  AIC=8960.77
## bike_count ~ hour + humidity + wind_speed + dew_point_temp + 
##     solar_radiation + rainfall + seasons + holiday
## 
##                   Df Sum of Sq        RSS    AIC
## <none>                         4308822174 8960.8
## + temp             1   1951238 4306870937 8962.5
## + visibility       1   1056004 4307766171 8962.6
## + snowfall         1     18823 4308803352 8962.8
## - wind_speed       1  33797487 4342619662 8963.2
## - holiday          1  50067049 4358889224 8965.3
## - humidity         1 207643653 4516465827 8985.3
## - rainfall         1 397393139 4706215314 9008.5
## - hour             1 493236863 4802059037 9019.9
## - dew_point_temp   1 522431104 4831253278 9023.3
## - seasons          3 871530804 5180352978 9058.7
## - solar_radiation  1 987955361 5296777535 9075.2
stepReg$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## bike_count ~ hour + temp + humidity + wind_speed + visibility + 
##     dew_point_temp + solar_radiation + rainfall + snowfall + 
##     seasons + holiday
## 
## Final Model:
## bike_count ~ hour + humidity + wind_speed + dew_point_temp + 
##     solar_radiation + rainfall + seasons + holiday
## 
## 
##           Step Df    Deviance Resid. Df Resid. Dev      AIC
## 1                                   550 4305504463 8966.332
## 2   - snowfall  1    1069.786       551 4305505533 8964.332
## 3 - visibility  1 1365404.194       552 4306870937 8962.511
## 4       - temp  1 1951237.637       553 4308822174 8960.766
# multicollinearity
car::vif(am_pm_model)
##                       GVIF Df GVIF^(1/(2*Df))
## hour              2.013430  1        1.418954
## temp            183.172662  1       13.534130
## humidity         34.606618  1        5.882739
## wind_speed        1.581910  1        1.257740
## visibility        1.849322  1        1.359898
## dew_point_temp  247.930596  1       15.745812
## solar_radiation   4.019861  1        2.004959
## rainfall          1.430928  1        1.196214
## snowfall          1.148384  1        1.071627
## seasons           6.907353  3        1.380020
## holiday           1.035202  1        1.017449

Stepwise method produced a final regression model without temperature, snowfall and visibility. dew_point_temp and temp GVIF values are greater than 10 and therefore, there is multicollinearity. Therefore, temperature visibility and snowfall will be removed from the final model.

# Removing columns with multicollinearity and stepwise
am_pm_model_final = lm(bike_count ~ hour + humidity + wind_speed + dew_point_temp + 
    solar_radiation + rainfall + seasons + holiday, data=model3)

#summary
summary(am_pm_model_final)
## 
## Call:
## lm(formula = bike_count ~ hour + humidity + wind_speed + dew_point_temp + 
##     solar_radiation + rainfall + seasons + holiday, data = model3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7597.7 -1775.7   -83.3  1634.7  9298.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8820.03    1054.90   8.361 5.07e-16 ***
## hour1            2607.90     327.78   7.956 1.01e-14 ***
## humidity          -68.07      13.19  -5.162 3.41e-07 ***
## wind_speed       -375.22     180.16  -2.083   0.0377 *  
## dew_point_temp    207.75      25.37   8.188 1.84e-15 ***
## solar_radiation  6226.55     552.96  11.260  < 2e-16 ***
## rainfall        -1609.85     225.42  -7.142 2.92e-12 ***
## seasonsSpring   -2752.29     354.07  -7.773 3.76e-14 ***
## seasonsSummer   -2463.34     458.11  -5.377 1.12e-07 ***
## seasonsAutumn   -3794.97     497.44  -7.629 1.04e-13 ***
## holidayHoliday   1419.42     559.95   2.535   0.0115 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2791 on 553 degrees of freedom
## Multiple R-squared:  0.8049, Adjusted R-squared:  0.8014 
## F-statistic: 228.2 on 10 and 553 DF,  p-value: < 2.2e-16

The final model can explain 80.49% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, statistically significant.

# Checking multicollinearity in the final Model
car::vif(am_pm_model_final)
##                     GVIF Df GVIF^(1/(2*Df))
## hour            1.943833  1        1.394214
## humidity        4.151456  1        2.037512
## wind_speed      1.539901  1        1.240928
## dew_point_temp  8.190588  1        2.861920
## solar_radiation 3.845713  1        1.961049
## rainfall        1.336744  1        1.156176
## seasons         5.829589  3        1.341548
## holiday         1.034496  1        1.017102

Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the final Model.

par(mfrow=c(2,2))
plot(am_pm_model_final , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
library(car)

ncvTest(am_pm_model_final)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 72.38561, Df = 1, p = < 2.22e-16

P-value is less than 0.05, implying that the constant error variance assumption is violated.

# Test for Autocorrelated Errors
durbinWatsonTest(am_pm_model_final)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.04232994      2.082806   0.336
##  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

shapiro.test(am_pm_model_final$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  am_pm_model_final$residuals
## W = 0.99099, p-value = 0.001627

P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model. Therefore, box-cox transformation is performed again on the model.

BoxCox transformation

bike_AMPM_BC <- model3[-c(185), ]

for (i in colnames(bike_AMPM_BC)){
  if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
    bike_AMPM_BC[i] <- round(bike_AMPM_BC[i] + (abs(min(bike_AMPM[i])))+1 , digits=3) # adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(
  bike_AMPM_BC$bike_count ~ 
    bike_AMPM_BC$humidity + 
    bike_AMPM_BC$dew_point_temp + 
    bike_AMPM_BC$solar_radiation + 
    bike_AMPM_BC$rainfall
  )

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = data^lambda_val
  return(data)
}

for (i in colnames(bike_AMPM_BC)){
  if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
    bike_AMPM_BC[i] <- transform_boxcox(bike_AMPM_BC[i])
  }
}

Final Model 3

# fitting new model
bike_AMPM_BC_fit = lm(bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour + 
    bike_AMPM_BC$humidity + 
    bike_AMPM_BC$wind_speed + 
    bike_AMPM_BC$dew_point_temp + 
    bike_AMPM_BC$solar_radiation + 
    bike_AMPM_BC$rainfall + 
    bike_AMPM_BC$seasons +
    bike_AMPM_BC$holiday)
  
summary(bike_AMPM_BC_fit)
## 
## Call:
## lm(formula = bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour + bike_AMPM_BC$humidity + 
##     bike_AMPM_BC$wind_speed + bike_AMPM_BC$dew_point_temp + bike_AMPM_BC$solar_radiation + 
##     bike_AMPM_BC$rainfall + bike_AMPM_BC$seasons + bike_AMPM_BC$holiday)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9794834 -2027571  -338710  1946797 13767376 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -1745897.9  1163242.2  -1.501 0.133956    
## bike_AMPM_BC$hour1            2762832.5   395822.2   6.980 8.50e-12 ***
## bike_AMPM_BC$humidity           -1043.3      454.6  -2.295 0.022101 *  
## bike_AMPM_BC$wind_speed       -157128.9    58269.7  -2.697 0.007219 ** 
## bike_AMPM_BC$dew_point_temp      7682.6     1615.6   4.755 2.53e-06 ***
## bike_AMPM_BC$solar_radiation  3305837.4   259130.7  12.757  < 2e-16 ***
## bike_AMPM_BC$rainfall         -368076.0   105967.4  -3.473 0.000554 ***
## bike_AMPM_BC$seasonsSpring   -2828765.0   442537.9  -6.392 3.48e-10 ***
## bike_AMPM_BC$seasonsSummer   -2370556.0   611605.0  -3.876 0.000119 ***
## bike_AMPM_BC$seasonsAutumn   -3317203.4   579843.7  -5.721 1.74e-08 ***
## bike_AMPM_BC$holidayHoliday    839983.5   693100.6   1.212 0.226061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3464000 on 552 degrees of freedom
## Multiple R-squared:  0.7179, Adjusted R-squared:  0.7128 
## F-statistic: 140.5 on 10 and 552 DF,  p-value: < 2.2e-16

After transforming the data, the model can explain 71.79% of the data and it is statistically significant as the p-value of the model is less than 0.05. Non of the variables’s p-values are greater than 0.05 and therefore, it is statistically significant.

# Checking multicollinearity in the final Model
car::vif(bike_AMPM_BC_fit)
##                                  GVIF Df GVIF^(1/(2*Df))
## bike_AMPM_BC$hour            1.837746  1        1.355635
## bike_AMPM_BC$humidity        3.497117  1        1.870058
## bike_AMPM_BC$wind_speed      1.432754  1        1.196977
## bike_AMPM_BC$dew_point_temp  7.285334  1        2.699136
## bike_AMPM_BC$solar_radiation 3.337718  1        1.826942
## bike_AMPM_BC$rainfall        1.299225  1        1.139835
## bike_AMPM_BC$seasons         6.190769  3        1.355057
## bike_AMPM_BC$holiday         1.029240  1        1.014515

Since all the GVIF values are less than 10, it can be assumed that there is no multicollinearity for the transformed model.

par(mfrow=c(2,2))
plot(bike_AMPM_BC_fit , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test

ncvTest(bike_AMPM_BC_fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 161.5261, Df = 1, p = < 2.22e-16

P-value is still less than 0.05, implying that the constant error variance assumption is violated even after transformation.

# Test for Autocorrelated Errors
durbinWatsonTest(bike_AMPM_BC_fit)
##  lag Autocorrelation D-W Statistic p-value
##    1    6.005578e-05      1.998875   0.926
##  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

shapiro.test(bike_AMPM_BC_fit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  bike_AMPM_BC_fit$residuals
## W = 0.96879, p-value = 1.42e-09

P-value is less than 0.05 therefore, it can be assumed that the residuals are not normally distributed for this model. Therefore, this model cannot adequately predict the bike demand.

4) Model 4

Model 4 is a multiple linear regression model predicting bike rental demand for every day with weekend column and using daily average weather conditions. We expect bike demand will be higher during weekends as more people will be hiring them for recreational purposes and this will be one of the factors.

#Creating weekend colum
bike_by_day$date <- as.Date(bike_by_day$date, "%d/%m/%y")
bike_by_day$weekend = chron::is.weekend(bike_by_day$date)
bike_by_day$weekend <- as.factor(bike_by_day$weekend)
bike_by_day
## # A tibble: 353 x 13
##    date       sum_bike_count mean_temp mean_humidity mean_wind_speed
##    <date>              <int>     <dbl>         <dbl>           <dbl>
##  1 2020-01-01           4290     -1.28          39.3           1.45 
##  2 2020-02-01           5377     -3.87          44             1.61 
##  3 2020-03-01           5132      0.45          64.2           3.55 
##  4 2020-04-01          17388     15.2           68.9           1.57 
##  5 2020-05-01          26820     20.3           72.8           1.44 
##  6 2020-06-01          31928     23.7           50.1           1.95 
##  7 2020-07-01           3231     22.0           93.5           0.783
##  8 2020-08-01          20712     33.4           55.8           1.68 
##  9 2020-09-01          26010     25.5           61.2           1.21 
## 10 2020-10-01          27909     15.4           54.2           2.82 
## # … with 343 more rows, and 8 more variables: mean_visibility <dbl>,
## #   mean_dew_point_temp <dbl>, mean_solar_radiation <dbl>, mean_rainfall <dbl>,
## #   mean_snowfall <dbl>, seasons <fct>, holiday <fct>, weekend <fct>

As expected, Fig 5. shows that bike demand is slightly higher during the weekends.

# Bar plot for weekend and weekday bike demand
fig5 <- ggplot(bike_by_day, aes(x = as.factor(weekend), y = sum_bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 5. Weekend and Weekday Bike Count Bar Plot") +  xlab("Weekend") + ylab("Bike Count")
fig5 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

Data spliting

Data is splitted to 90% train set and 10% test set instead of usual 8-20 due to being smaller dataset. Data is splitted by random selection using sample() method.

#Randomise our dataset

## 75% of the sample size
smp_size <- floor(0.90 * nrow(bike_by_day))

## set the seed to make your partition reproducible
set.seed(500)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)

model4_prediction <- bike_by_day[ -trainIndex,]
model4 <- bike_by_day[ trainIndex,]
# regression model by day with Weekend
bike_weekend_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday+ weekend, data=model4)

#summary
summary(bike_weekend_model)
## 
## Call:
## lm(formula = sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
##     mean_rainfall + mean_snowfall + seasons + holiday + weekend, 
##     data = model4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12217.1  -2425.6    119.3   2364.1  11546.1 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          29313.3932  9717.1960   3.017 0.002772 ** 
## mean_temp             -441.9317   368.9779  -1.198 0.231963    
## mean_humidity         -202.6930   108.6237  -1.866 0.063006 .  
## mean_wind_speed      -1414.7777   466.2044  -3.035 0.002617 ** 
## mean_visibility         -0.2196     0.6901  -0.318 0.750541    
## mean_dew_point_temp    795.2753   390.2534   2.038 0.042434 *  
## mean_solar_radiation 12763.3986  1369.6733   9.319  < 2e-16 ***
## mean_rainfall        -3548.4095   632.1151  -5.614 4.48e-08 ***
## mean_snowfall         -690.1610   666.3953  -1.036 0.301185    
## seasonsSpring        -5204.0355   768.5960  -6.771 6.64e-11 ***
## seasonsSummer        -3615.2394   926.7999  -3.901 0.000118 ***
## seasonsAutumn        -8336.2703  1054.0449  -7.909 4.89e-14 ***
## holidayHoliday        2690.5144  1091.4530   2.465 0.014253 *  
## weekendTRUE           1292.1261   531.6491   2.430 0.015662 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4164 on 303 degrees of freedom
## Multiple R-squared:  0.8316, Adjusted R-squared:  0.8244 
## F-statistic: 115.1 on 13 and 303 DF,  p-value: < 2.2e-16
# possible subsets regression (by day and weekend)
library(MASS)
stepReg=MASS::stepAIC(bike_weekend_model, direction="both")
## Start:  AIC=5297.59
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
##     mean_rainfall + mean_snowfall + seasons + holiday + weekend
## 
##                        Df  Sum of Sq        RSS    AIC
## - mean_visibility       1    1755800 5255588899 5295.7
## - mean_snowfall         1   18598184 5272431282 5296.7
## - mean_temp             1   24873862 5278706961 5297.1
## <none>                               5253833098 5297.6
## - mean_humidity         1   60375619 5314208717 5299.2
## - mean_dew_point_temp   1   72007074 5325840172 5299.9
## - weekend               1  102421987 5356255086 5301.7
## - holiday               1  105364470 5359197569 5301.9
## - mean_wind_speed       1  159682489 5413515587 5305.1
## - mean_rainfall         1  546397538 5800230637 5327.0
## - mean_solar_radiation  1 1505678886 6759511984 5375.5
## - seasons               3 1651376972 6905210070 5378.2
## 
## Step:  AIC=5295.7
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
##     mean_snowfall + seasons + holiday + weekend
## 
##                        Df  Sum of Sq        RSS    AIC
## - mean_snowfall         1   19052170 5274641068 5294.8
## - mean_temp             1   23519634 5279108533 5295.1
## <none>                               5255588899 5295.7
## - mean_humidity         1   60164675 5315753574 5297.3
## + mean_visibility       1    1755800 5253833098 5297.6
## - mean_dew_point_temp   1   70253609 5325842508 5297.9
## - weekend               1  100675609 5356264508 5299.7
## - holiday               1  104629406 5360218305 5299.9
## - mean_wind_speed       1  178897927 5434486826 5304.3
## - mean_rainfall         1  578004202 5833593101 5326.8
## - mean_solar_radiation  1 1524264427 6779853325 5374.4
## - seasons               3 1752780932 7008369831 5380.9
## 
## Step:  AIC=5294.85
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_dew_point_temp + mean_solar_radiation + mean_rainfall + 
##     seasons + holiday + weekend
## 
##                        Df  Sum of Sq        RSS    AIC
## - mean_temp             1   29087942 5303729010 5294.6
## <none>                               5274641068 5294.8
## + mean_snowfall         1   19052170 5255588899 5295.7
## + mean_visibility       1    2209786 5272431282 5296.7
## - mean_humidity         1   76552577 5351193646 5297.4
## - mean_dew_point_temp   1   82548216 5357189284 5297.8
## - holiday               1  100763194 5375404263 5298.8
## - weekend               1  100875940 5375517008 5298.9
## - mean_wind_speed       1  177510423 5452151492 5303.3
## - mean_rainfall         1  561169565 5835810633 5324.9
## - mean_solar_radiation  1 1512359155 6787000224 5372.8
## - seasons               3 1771964675 7046605743 5380.7
## 
## Step:  AIC=5294.59
## sum_bike_count ~ mean_humidity + mean_wind_speed + mean_dew_point_temp + 
##     mean_solar_radiation + mean_rainfall + seasons + holiday + 
##     weekend
## 
##                        Df  Sum of Sq        RSS    AIC
## <none>                               5303729010 5294.6
## + mean_temp             1   29087942 5274641068 5294.8
## + mean_snowfall         1   24620477 5279108533 5295.1
## + mean_visibility       1     542453 5303186557 5296.6
## - holiday               1  101874317 5405603327 5298.6
## - weekend               1  102027671 5405756681 5298.6
## - mean_humidity         1  128646104 5432375115 5300.2
## - mean_wind_speed       1  165044922 5468773932 5302.3
## - mean_rainfall         1  687009406 5990738416 5331.2
## - mean_dew_point_temp   1  694143519 5997872529 5331.6
## - mean_solar_radiation  1 1485298995 6789028005 5370.9
## - seasons               3 1750784334 7054513345 5379.0
stepReg$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
##     mean_visibility + mean_dew_point_temp + mean_solar_radiation + 
##     mean_rainfall + mean_snowfall + seasons + holiday + weekend
## 
## Final Model:
## sum_bike_count ~ mean_humidity + mean_wind_speed + mean_dew_point_temp + 
##     mean_solar_radiation + mean_rainfall + seasons + holiday + 
##     weekend
## 
## 
##                Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                                     303 5253833098 5297.593
## 2 - mean_visibility  1  1755800       304 5255588899 5295.699
## 3   - mean_snowfall  1 19052170       305 5274641068 5294.846
## 4       - mean_temp  1 29087942       306 5303729010 5294.589
# multicollinearity
car::vif(bike_weekend_model)
##                            GVIF Df GVIF^(1/(2*Df))
## mean_temp            343.949542  1       18.545877
## mean_humidity         47.599843  1        6.899264
## mean_wind_speed        1.337044  1        1.156306
## mean_visibility        2.145153  1        1.464634
## mean_dew_point_temp  470.144400  1       21.682813
## mean_solar_radiation   3.486644  1        1.867256
## mean_rainfall          1.746116  1        1.321407
## mean_snowfall          1.207067  1        1.098666
## seasons                7.971927  3        1.413385
## holiday                1.043767  1        1.021649
## weekend                1.036276  1        1.017976

mean_temp and mean_dew_point_temp GVIF values are greater than 10 and therefore, there is multicollinearity. mean_temp will be taken out from the final model. The final model from stepwise AIC did not include mean_visibility and mean_snowfall. These will be taken out from our final Weekend Model.

# Final regression model by day with Weekend
bike_weekend_model_final = lm(sum_bike_count ~ mean_dew_point_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=model4)

#summary
summary(bike_weekend_model_final)
## 
## Call:
## lm(formula = sum_bike_count ~ mean_dew_point_temp + mean_humidity + 
##     mean_wind_speed + mean_solar_radiation + mean_rainfall + 
##     seasons + holiday + weekend, data = model4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12506.6  -2441.1    113.6   2504.3  11762.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          18603.28    2514.06   7.400 1.33e-12 ***
## mean_dew_point_temp    343.11      54.22   6.328 8.79e-10 ***
## mean_humidity          -88.29      32.41  -2.724 0.006813 ** 
## mean_wind_speed      -1386.87     449.43  -3.086 0.002215 ** 
## mean_solar_radiation 12428.26    1342.56   9.257  < 2e-16 ***
## mean_rainfall        -3729.03     592.30  -6.296 1.06e-09 ***
## seasonsSpring        -5131.80     718.68  -7.141 6.80e-12 ***
## seasonsSummer        -3630.47     924.20  -3.928 0.000106 ***
## seasonsAutumn        -8200.36    1022.85  -8.017 2.31e-14 ***
## holidayHoliday        2641.38    1089.50   2.424 0.015913 *  
## weekendTRUE           1276.78     526.25   2.426 0.015835 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4163 on 306 degrees of freedom
## Multiple R-squared:   0.83,  Adjusted R-squared:  0.8244 
## F-statistic: 149.4 on 10 and 306 DF,  p-value: < 2.2e-16
# multicollinearity
car::vif(bike_weekend_model_final)
##                          GVIF Df GVIF^(1/(2*Df))
## mean_dew_point_temp  9.077950  1        3.012964
## mean_humidity        4.238293  1        2.058711
## mean_wind_speed      1.243064  1        1.114928
## mean_solar_radiation 3.351304  1        1.830657
## mean_rainfall        1.533711  1        1.238431
## seasons              6.509081  3        1.366428
## holiday              1.040455  1        1.020027
## weekend              1.015724  1        1.007831

The final model with weekend can explain 82.36% of the data and is statistically accurate at 95% with p-value < 0.05. There is no mulicollinearity in the final model.

par(mfrow=c(2,2))
plot(bike_weekend_model_final , which=1:4)

There is no outliers in the model’s residuals according to the Cook’s distance plot.

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_weekend_model_final)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 35.62784, Df = 1, p = 2.3885e-09

P-value is less than 0.05, implying that the constant error variance assumption is violated.

# Test for Autocorrelated Errors
durbinWatsonTest(bike_weekend_model_final)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.1132151      2.224318   0.056
##  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

# Test for Normally Distributed Errors
shapiro.test(bike_weekend_model_final$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  bike_weekend_model_final$residuals
## W = 0.99468, p-value = 0.3409

P-value of the normality test is greater than 0.05 and therefore, the residuals of the data is normally distributed. However, as the model failed the homoscedascity test, the model will be transformed using box-cox transformation method.

BoxCox transformation

bike_by_dayBC <- model4

for (i in colnames(bike_by_dayBC)){
  if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
    bike_by_dayBC[i] <- bike_by_dayBC[i] + (abs(min(bike_by_day[i])))+1 #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(
  bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall
  )

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = ((data^lambda_val)-1)/lambda_val
  return(data)
}

for (i in colnames(bike_by_dayBC)){
  if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
    bike_by_dayBC[i] <- transform_boxcox(bike_by_dayBC[i])
  }
}

# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall +
    bike_by_dayBC$seasons +
    bike_by_dayBC$holiday +
    bike_by_dayBC$weekend)
  
summary(bike_by_dayBC_fit)
## 
## Call:
## lm(formula = bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
##     bike_by_dayBC$mean_wind_speed + bike_by_dayBC$mean_dew_point_temp + 
##     bike_by_dayBC$mean_solar_radiation + bike_by_dayBC$mean_rainfall + 
##     bike_by_dayBC$seasons + bike_by_dayBC$holiday + bike_by_dayBC$weekend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1709 -0.5515  0.0273  0.5290  2.3787 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         19.4886     1.4987  13.004  < 2e-16 ***
## bike_by_dayBC$mean_humidity         -0.8255     0.2858  -2.889 0.004146 ** 
## bike_by_dayBC$mean_wind_speed       -1.0416     0.2930  -3.555 0.000438 ***
## bike_by_dayBC$mean_dew_point_temp    1.0738     0.1442   7.449 9.67e-13 ***
## bike_by_dayBC$mean_solar_radiation   3.4083     0.4194   8.128 1.10e-14 ***
## bike_by_dayBC$mean_rainfall         -1.9489     0.2287  -8.520 7.35e-16 ***
## bike_by_dayBC$seasonsSpring         -1.0351     0.1472  -7.030 1.35e-11 ***
## bike_by_dayBC$seasonsSummer         -0.4055     0.1672  -2.426 0.015849 *  
## bike_by_dayBC$seasonsAutumn         -2.4737     0.2051 -12.061  < 2e-16 ***
## bike_by_dayBC$holidayHoliday         0.8703     0.2269   3.835 0.000152 ***
## bike_by_dayBC$weekendTRUE            0.2655     0.1098   2.419 0.016153 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8644 on 306 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.863 
## F-statistic:   200 on 10 and 306 DF,  p-value: < 2.2e-16
# multicollinearity
car::vif(bike_by_dayBC_fit)
##                                        GVIF Df GVIF^(1/(2*Df))
## bike_by_dayBC$mean_humidity        3.607174  1        1.899256
## bike_by_dayBC$mean_wind_speed      1.333563  1        1.154800
## bike_by_dayBC$mean_dew_point_temp  5.928370  1        2.434824
## bike_by_dayBC$mean_solar_radiation 3.391377  1        1.841569
## bike_by_dayBC$mean_rainfall        1.693317  1        1.301275
## bike_by_dayBC$seasons              4.499259  3        1.284863
## bike_by_dayBC$holiday              1.046936  1        1.023199
## bike_by_dayBC$weekend              1.025189  1        1.012516

The final transformed model with weekend can explain 86.3% of the data and is statistically accurate at 95% with p-value < 0.05. There is no mulicollinearity in the final model.

par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)

Cook’s distance shows that there are some outliers (152, 99 and 39) and therfore, they will be removed from the dataset before fitting it into the model again.

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 6.343005, Df = 1, p = 0.011785

P-value is less than 0.05, implying that the constant error variance assumption is violated if we do not remove the outliers. Therefore, these will be removed and then the model will be tested again.

Final Model 4

bike_by_dayBC <- bike_by_dayBC[-c(152,99,39),]

# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall +
    bike_by_dayBC$seasons +
    bike_by_dayBC$holiday +
    bike_by_dayBC$weekend)
  
summary(bike_by_dayBC_fit)
## 
## Call:
## lm(formula = bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
##     bike_by_dayBC$mean_wind_speed + bike_by_dayBC$mean_dew_point_temp + 
##     bike_by_dayBC$mean_solar_radiation + bike_by_dayBC$mean_rainfall + 
##     bike_by_dayBC$seasons + bike_by_dayBC$holiday + bike_by_dayBC$weekend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.80853 -0.54398  0.02451  0.50972  1.71834 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         19.6022     1.4305  13.703  < 2e-16 ***
## bike_by_dayBC$mean_humidity         -1.0221     0.2780  -3.676 0.000280 ***
## bike_by_dayBC$mean_wind_speed       -0.9514     0.2799  -3.399 0.000766 ***
## bike_by_dayBC$mean_dew_point_temp    1.2492     0.1493   8.366 2.21e-15 ***
## bike_by_dayBC$mean_solar_radiation   3.3637     0.4012   8.383 1.96e-15 ***
## bike_by_dayBC$mean_rainfall         -1.7471     0.2222  -7.863 6.63e-14 ***
## bike_by_dayBC$seasonsSpring         -1.1297     0.1421  -7.949 3.76e-14 ***
## bike_by_dayBC$seasonsSummer         -0.5501     0.1618  -3.400 0.000765 ***
## bike_by_dayBC$seasonsAutumn         -2.3815     0.2006 -11.871  < 2e-16 ***
## bike_by_dayBC$holidayHoliday         1.0753     0.2242   4.797 2.54e-06 ***
## bike_by_dayBC$weekendTRUE            0.3408     0.1055   3.229 0.001377 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8242 on 303 degrees of freedom
## Multiple R-squared:  0.8769, Adjusted R-squared:  0.8728 
## F-statistic: 215.8 on 10 and 303 DF,  p-value: < 2.2e-16

The final model with weekend can explain 87.69% of the data and is statistically accurate at 95% with p-value < 0.05. There is no mulicollinearity in the final model. This is by far a model with the most prediction accuracy.

# multicollinearity
car::vif(bike_by_dayBC_fit)
##                                        GVIF Df GVIF^(1/(2*Df))
## bike_by_dayBC$mean_humidity        3.644442  1        1.909042
## bike_by_dayBC$mean_wind_speed      1.308730  1        1.143998
## bike_by_dayBC$mean_dew_point_temp  6.386343  1        2.527121
## bike_by_dayBC$mean_solar_radiation 3.373157  1        1.836616
## bike_by_dayBC$mean_rainfall        1.676241  1        1.294697
## bike_by_dayBC$seasons              4.820642  3        1.299723
## bike_by_dayBC$holiday              1.056871  1        1.028042
## bike_by_dayBC$weekend              1.031407  1        1.015582
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.58038, Df = 1, p = 0.1082

After removing outliers, the P-value of non-constant error variance test shows that it is greater than 0.05, implying that the constant error variance assumption is not violated.

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_dayBC_fit)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.04184489       2.07716   0.522
##  Alternative hypothesis: rho != 0

P-value is greater than 0.05 and therefore, we cannot reject H0. The uncorrelated error assumption is not violated.

# Test for Normally Distributed Errors
shapiro.test(bike_by_dayBC_fit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  bike_by_dayBC_fit$residuals
## W = 0.9897, p-value = 0.02599

P-value of the normality test is less than 0.05 and therefore residuals are not normally distributed.

Final model selection

To select final model we will use criteria like final models accuracy with r2, model’s residual analysis, adaquecy test’s results of each model.

Model 1 (model1_BC_fit)

  1. it has Multiple R-squared of 71.55%, which is can be consider as good fit.
  2. It passes the multicollinearity vif test.
  3. In the residual plots, residuals vs. fitted plot shows slight curve which is indication of possible non-linear relationships.
  4. QQ plot looks good.
  5. In Scale-Location plot we can see residuals are not spreaded equally along the ranges of predictors. which means it might not have constant variance and the nvcTest proves that point.
  6. Plot of Cook’s ditstance looks good.
  7. durbin watson test proves no autocorrelation in model
  8. shapiro wilk test failed to perfomes due to extremly large dataset (shown error with code)

Due to non constant variance we will reject this model

Model 2 (bike_by_day_BC_fit)

  1. It has Multiple R-squared of 86%, which is can be consider as good fit.
  2. It passes the multicollinearity vif test.
  3. In the residual plots, residuals vs. fitted plot shows staight line which is indication of no non-linear relationships.
  4. QQ plot looks good.
  5. In Scale-Location plot we can’t see the spreaded residuals well but from nvcTest we can determine that model fails constant variance test.
  6. Plot of Cook’s ditstance looks good.
  7. durbin watson test proves no autocorrelation in model
  8. shapiro wilk test proves tha residuals are normally distributed.

Due to non constant variance we will reject this model because it fails the models confidence in prediction.

Model 3 (bike_AMPM_BC_fit)

  1. It has Multiple R-squared of 71.79%, which is can be consider as good fit.
  2. It passes the multicollinearity vif test.
  3. In the residual plots, residuals vs. fitted plot shows slight curve which is indication of possible non-linear relationships.
  4. From QQ plot we can see from quantile 2 residuals are not on dotted line which it may fails normalitiy test.
  5. In Scale-Location plot we can see residuals are not spreaded equally along the ranges of predictors. which means it might not have constant variance and the nvcTest proves that point.
  6. Plot of Cook’s ditstance looks good.
  7. durbin watson test proves no autocorrelation in model
  8. shapiro wilk test proves tha residuals are not normally distributed as we mentioned from QQ-plot.

Due to non constant variance and not normally distribution we will reject this model because it fails the models confidence in prediction.

Model 4 (bike_AMPM_BC_fit)

  1. It has Multiple R-squared of 87.69%, which is can be consider as good fit.
  2. It passes the multicollinearity vif test.
  3. In the residual plots, residuals vs. fitted plot shows staight line which is indication of no non-linear relationships.
  4. QQ plot looks good.
  5. Scale-Location plot we can see residuals are spreaded somewhat equally along the ranges of predictors. which means it may have constant variance and from the nvcTest we can prove that model has constant vaiance.
  6. Plot of Cook’s ditstance looks good.
  7. durbin watson test proves no autocorrelation in model
  8. shapiro wilk test fails to proves that residuals are normally distributed.

Model 4’s P-value of the normality test is less than 0.05 and therefore, the residuals of the data is not normally distributed. However, as the dataset is not too small, the p-value is not too much smaller than 0.05, and the model pass the rest of the tests, we conclude that Model 4 can adequately predict the bike demand and is by far, the best model.

Due to this reasons we will select model 4 as out best model and will predicte the data using model 4.

Prediction

We have selected model 4 as our best model and to predict values using this model we will be using the model 4’s prediction dataset which is model4_prediction.

# we will transform this series to positive by adding absulute smallest value to whole dataset by their columns to pe

for (i in colnames(model4_prediction)){
  if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
    model4_prediction[i] <- model4_prediction[i] + (abs(min(bike_by_day[i])))+1
  }
}

# Box-cox transformatino using lambda
transform_boxcox <- function(data)
{
  # box-cox transformation
  data = ((data^lambda_val)-1)/lambda_val
  return(data)
}

for (i in colnames(model4_prediction)){
  if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
    model4_prediction[i] <- transform_boxcox(model4_prediction[i])
  }
}


head(model4_prediction)
## # A tibble: 6 x 13
##   date       sum_bike_count mean_temp mean_humidity mean_wind_speed
##   <date>              <dbl>     <dbl>         <dbl>           <dbl>
## 1 2020-07-01           15.0      4.63          6.56           0.948
## 2 2020-12-01           17.8      3.07          5.63           1.25 
## 3 2020-04-02           20.4      4.46          6.07           1.41 
## 4 2020-08-02           20.3      5.07          5.81           1.25 
## 5 2020-09-04           21.5      4.68          6.19           1.47 
## 6 2020-12-04           17.5      3.28          5.79           1.82 
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## #   mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## #   seasons <fct>, holiday <fct>, weekend <fct>

Model 4’s equation

lm(sum_bike_count ~ mean_humidity + mean_wind_speed + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=bike_by_dayBC)

y = 19.4886 - 0.8255 X mean_humidity - 1.0416 X mean_wind_speed + 1.0738 X mean_dew_point_temp + 3.4083 X mean_solar_radiation - 1.9489 X mean_rainfall - 1.0351 X seasonsSpring - 0.4055 X seasonsSummer -2.4737 X seasonsAutumn + 0.8703 X holidayHoliday + 0.2655 X weekendTRUE + ε

From the boxcox transformation we get lambda value of 0.128451. We will use this to transforme the data from the model4_prediction

# creating function to predict the bike counts
pred_bike_count <- function(mean_humidity, mean_wind_speed, mean_dew_point_temp,  mean_solar_radiation, mean_rainfall, seasons, holiday, weekend){
  seasonsSummer = 0
  seasonsAutumn = 0
  seasonsSpring = 0
  holidayHoliday = 0
  weekendTRUE = 0
  
  # Assinging value as 1 if they exist in data
  if(seasons == "Summer"){
    seasonsSummer = 1
  } else if(seasons == "Autumn"){
    seasonsAutumn = 1
  } else{
    seasonsSpring = 1
  }
  
  if(holiday == "Holiday"){
    holidayHoliday = 1
  }
  
  if(weekendTRUE == "TRUE"){
    weekendTRUE = 1
  }
  
  # fitted eqN of model 4
  y =  19.4886 - 0.8255 * mean_humidity - 1.0416 * mean_wind_speed + 1.0738 * mean_dew_point_temp + 3.4083 * mean_solar_radiation - 1.9489 * mean_rainfall - 1.0351 * seasonsSpring - 0.4055 * seasonsSummer -2.4737 * seasonsAutumn + 0.8703 * holidayHoliday + 0.2655 * weekendTRUE
  
  return(as.numeric(y))
}

Predicting values

model4_prediction[1,] # first row of model 4 prediction
## # A tibble: 1 x 13
##   date       sum_bike_count mean_temp mean_humidity mean_wind_speed
##   <date>              <dbl>     <dbl>         <dbl>           <dbl>
## 1 2020-07-01           15.0      4.63          6.56           0.948
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## #   mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## #   seasons <fct>, holiday <fct>, weekend <fct>
prediction1 <- pred_bike_count(
  model4_prediction[1,"mean_humidity"],
  model4_prediction[1,"mean_wind_speed"],
  model4_prediction[1,"mean_dew_point_temp"],
  model4_prediction[1,"mean_solar_radiation"],
  model4_prediction[1,"mean_rainfall"],
  model4_prediction[1,"seasons"],
  model4_prediction[1,"holiday"],
  model4_prediction[1,"weekend"]
  )

prediction1
## [1] 16.20181
model4_prediction[2,] # second row of model 4 prediction
## # A tibble: 1 x 13
##   date       sum_bike_count mean_temp mean_humidity mean_wind_speed
##   <date>              <dbl>     <dbl>         <dbl>           <dbl>
## 1 2020-12-01           17.8      3.07          5.63            1.25
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## #   mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## #   seasons <fct>, holiday <fct>, weekend <fct>
prediction2 <- pred_bike_count(
  model4_prediction[2,"mean_humidity"],
  model4_prediction[2,"mean_wind_speed"],
  model4_prediction[2,"mean_dew_point_temp"],
  model4_prediction[2,"mean_solar_radiation"],
  model4_prediction[2,"mean_rainfall"],
  model4_prediction[2,"seasons"],
  model4_prediction[2,"holiday"],
  model4_prediction[2,"weekend"]
  )

prediction2
## [1] 16.27909
model4_prediction[3,] # third row of model 4 prediction
## # A tibble: 1 x 13
##   date       sum_bike_count mean_temp mean_humidity mean_wind_speed
##   <date>              <dbl>     <dbl>         <dbl>           <dbl>
## 1 2020-04-02           20.4      4.46          6.07            1.41
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## #   mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## #   seasons <fct>, holiday <fct>, weekend <fct>
prediction3 <- pred_bike_count(
  model4_prediction[3,"mean_humidity"],
  model4_prediction[3,"mean_wind_speed"],
  model4_prediction[3,"mean_dew_point_temp"],
  model4_prediction[3,"mean_solar_radiation"],
  model4_prediction[3,"mean_rainfall"],
  model4_prediction[3,"seasons"],
  model4_prediction[3,"holiday"],
  model4_prediction[3,"weekend"]
  )

prediction3
## [1] 19.60029
model4_prediction[4,] # four row of model 4 prediction
## # A tibble: 1 x 13
##   date       sum_bike_count mean_temp mean_humidity mean_wind_speed
##   <date>              <dbl>     <dbl>         <dbl>           <dbl>
## 1 2020-08-02           20.3      5.07          5.81            1.25
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## #   mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## #   seasons <fct>, holiday <fct>, weekend <fct>
prediction4 <- pred_bike_count(
  model4_prediction[4,"mean_humidity"],
  model4_prediction[4,"mean_wind_speed"],
  model4_prediction[4,"mean_dew_point_temp"],
  model4_prediction[4,"mean_solar_radiation"],
  model4_prediction[4,"mean_rainfall"],
  model4_prediction[4,"seasons"],
  model4_prediction[4,"holiday"],
  model4_prediction[4,"weekend"]
  )

prediction4
## [1] 21.44509
model4_prediction[5,] # five row of model 4 prediction
## # A tibble: 1 x 13
##   date       sum_bike_count mean_temp mean_humidity mean_wind_speed
##   <date>              <dbl>     <dbl>         <dbl>           <dbl>
## 1 2020-09-04           21.5      4.68          6.19            1.47
## # … with 8 more variables: mean_visibility <dbl>, mean_dew_point_temp <dbl>,
## #   mean_solar_radiation <dbl>, mean_rainfall <dbl>, mean_snowfall <dbl>,
## #   seasons <fct>, holiday <fct>, weekend <fct>
prediction5 <- pred_bike_count(
  model4_prediction[5,"mean_humidity"],
  model4_prediction[5,"mean_wind_speed"],
  model4_prediction[5,"mean_dew_point_temp"],
  model4_prediction[5,"mean_solar_radiation"],
  model4_prediction[5,"mean_rainfall"],
  model4_prediction[5,"seasons"],
  model4_prediction[5,"holiday"],
  model4_prediction[5,"weekend"]
  )

prediction5
## [1] 20.33717

Prediction table of daily bike demand

actual_bikecount <- c(model4_prediction[1,]$sum_bike_count, model4_prediction[2,]$sum_bike_count, model4_prediction[3,]$sum_bike_count, model4_prediction[4,]$sum_bike_count, model4_prediction[5,]$sum_bike_count)
predicted_bikecount <- c(prediction1, prediction2, prediction3, prediction4, prediction5)

df_prediction <- data.frame(actual_bikecount, predicted_bikecount)

df_prediction # final prediction table
##   actual_bikecount predicted_bikecount
## 1         14.95501            16.20181
## 2         17.79364            16.27909
## 3         20.42867            19.60029
## 4         20.28287            21.44509
## 5         21.54331            20.33717

Discussion

Out of all the models, Model 4 is shown to be the most reliable and accurate at predicting bike demand by passing all the requirements except normality of residuals. As expected, ‘weekend’ plays a factor in the prediction of bike demand and Model 4: predicting bike rental demand for everyday with added ‘weekend’ feature gives the best prediction of bike demand, giving the accuracy of 87.69%. For normality assumption of regression we will consider that since the dataset is not too small, the p-value is not too much smaller than 0.05 and from QQ-plot we can see that most of the residuals are on the dotted lines. the model 4 passes the rest of the tests, we conclude that Model 4 can adequately predict the bike demand. It may also be valuable to add on other transformation methods such as min-max transformation after box-cox transformation on the model to see whether models that failed adequacy tests will give adequate models.

Conclusion

Even though regression models could not predict hourly bike demand as accurately as other machine learning models and deep learning models, the accuracy for predicting daily bike demand is considerably strong. However, this model can only predict daily bike demand instead of hourly, which may be more valuable for the bike hire companies. This means other machine learning techniques may be more suitable for the purposes of the bike hire companies. However, predicting daily bike demand is still valuable as this can help in activities such as scheduling bike servicing scheduling, seasonal bike stockings and profit optimisation. We did not test for adding the weekend factor for the hourly bike demand prediction model. This may increase the accuracy and adequacy of the model therefore, should be further tested. Recent studies have also found that other factors such as infrastructure of the bike paths and accessibility of the stations (El-Assi, Mahmoud, & Habib, 2017). If these factors are also accounted for in the model, we may be able to improve the prediction accuracy.

References:

Chen, T. (2003). A fuzzy back propagation network for output time prediction in a wafer fab. Applied Soft Computing, 2(3), 211–222. doi:10.1016/S1568-4946(02)00066-2agation network for output time prediction in a wafer fab. Applied Soft Computing, 2(3), 211–222. doi:10.1016/S1568-4946(02)00066-2

El-Assi, W., Mahmoud, M.S., & Habib, K.N. (2017). Effects of built environment and weather on bike sharing demand: A station level analysis of commercial bike sharing in Toronto. Transportation, 44(3), 589–613. doi:10.1007/s11116-015-9669-z

Kim, K. (2018). Investigation on the effects of weather and calendar events on bike-sharing according to the trip patterns of bike rentals of stations. Journal of Transport Geography, 66, 309–320. doi:10.1016/j.jtrangeo.2018.01.001

Sathishkumar, V. E., Jangwoo, P., and Yongyun, C., 2020. ‘Using data mining techniques for bike sharing demand prediction in metropolitan city.’ Computer Communications, Vol.153, pp.353-366.

Code Appendix

# Importing libraries
library(ggplot2)
library(magrittr) 
library(dplyr) 
library(tidyr)
library(infotheo)
library(forecast)
library(MASS)
library(psych)
library(car)
library(TSA)
library(forecast)
library(chron)
library(caTools)
library(caret)
# import dataset
setwd("/Users/April/Desktop/DS/Regression/Regression_Project")
bike <- read.csv("SeoulBikeData.csv",check.names = F)

#set new column names
bike <- setNames(bike, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))

#Setting categorical values

bike$hour <- factor(bike$hour)
bike$seasons <- factor(bike$seasons)
bike$holiday <- factor(bike$holiday)
bike$functioning_day <- factor(bike$functioning_day)

head(bike)
summary(bike$functioning_day)
fig1 <- ggplot(bike, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 1. Bike Demand by functining day") +  xlab("Functioning Day") + ylab("Bike Count")
fig1 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

# Deleting rows when it is non-functioning day
bike<-bike[!(bike$functioning_day=="No"),]

# removing unused columns
bike <- subset(bike, select = - c(functioning_day))

summary(bike)
fig2<- ggplot(bike, aes(x = as.factor(hour) , y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 2. Hourly Bike Count Bar Plot") +  xlab("Hour") + ylab("Bike Count")
fig2 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
fig3 <- ggplot(bike, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 3. Seasonal Bike Count Bar Plot") +  xlab("Seasons") + ylab("Bike Count")
fig3 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
fig4 <- ggplot(bike, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 4. Holiday Bike Count Bar Plot") +  xlab("Holiday") + ylab("Bike Count")
fig4 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
# correlation matrix of dataset
cor_df <- data.frame(bike$bike_count, as.integer(bike$hour), bike$temp, bike$humidity, bike$wind_speed, bike$visibility, bike$dew_point, bike$solar_radiation, bike$rainfall, bike$snowfall)
cor_df <- data.frame(cor(cor_df))

cor_df[order(-cor_df[,1]), ]
#Creating histograms 
multi.hist(bike[,sapply(bike, is.numeric)])

#Randomise our dataset

## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike)), size = smp_size)

model1_prediction <- bike[ -trainIndex,]
model1 <- bike[ trainIndex,]
# regression model with all columns original data
row.names(model1) <- NULL

original_full = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model1)

#summary
summary(original_full)
# possible subsets regression original data
library(MASS)
stepReg=MASS::stepAIC(original_full, direction="both")
stepReg$anova

# multicollinearity test
car::vif(original_full)

# Removing columns with multicollinearity and stepwise
original_final = lm(bike_count ~ hour + temp + humidity + solar_radiation + 
    rainfall + snowfall + seasons + holiday, data=model1)

#summary
summary(original_final)
par(mfrow=c(2,2))
plot(original_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(original_final)

# Test for Autocorrelated Errors
durbinWatsonTest(original_final)

# removing One outlier
model1_BC <- model1[-c(6040), ] # removing Outlier


for (i in colnames(model1_BC)){
  if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
    model1_BC[i] <- model1_BC[i] + (abs(min(bike[i])))+1 #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(bike_count ~ ( temp + humidity), data = model1_BC)

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = (data^lambda_val)-1/lambda_val
  return(data)
}

for (i in colnames(model1_BC)){
  if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
    model1_BC[i] <- transform_boxcox(model1_BC[i])
  }
}
# fitting new model
model1_BC_fit = lm(bike_count ~ hour + temp + humidity + solar_radiation + 
    rainfall + snowfall + seasons + holiday, data=model1_BC
    )
  
summary(model1_BC_fit)

# Checking multicollinearity in the final Model
car::vif(model1_BC_fit)
par(mfrow=c(2,2))
plot(model1_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(model1_BC_fit)

# Test for Autocorrelated Errors
durbinWatsonTest(model1_BC_fit)

#shapiro.test(model1_BC_fit$residuals)
# Error in shapiro.test(model1_BC_fit$residuals) : sample size must be between 3 and 5000
# creating dataframe for bike count by date
bike_by_day <- bike
levels(bike_by_day$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_by_day$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)


bike_by_day <- bike_by_day %>% group_by(date) %>% summarise(sum_bike_count = sum(bike_count), mean_temp = mean(temp), mean_humidity = mean(humidity), mean_wind_speed = mean(wind_speed), mean_visibility = mean(visibility), mean_dew_point_temp = mean(dew_point_temp), mean_solar_radiation = mean(solar_radiation), mean_rainfall = mean(rainfall), mean_snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))
#Changing the factors back to original values 
bike_by_day$holiday <- as.factor(bike_by_day$holiday)
bike_by_day$seasons <- as.factor(bike_by_day$seasons)
levels(bike_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(bike_by_day)
#Creating histograms 
multi.hist(bike_by_day[,sapply(bike_by_day, is.numeric)])
#Randomise our dataset

## 75% of the sample size
smp_size <- floor(0.80 * nrow(bike_by_day))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)

model2_prediction <- bike_by_day[ -trainIndex,]
model2 <- bike_by_day[ trainIndex,]
# regression model by day 
bike_by_day_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=model2)

#summary
summary(bike_by_day_model)
# possible subsets regression 
stepReg=MASS::stepAIC(bike_by_day_model, direction="both")
stepReg$anova

# multicollinearity
car::vif(bike_by_day_model)

# Removing columns with multicollinearity and stepwise
bike_by_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=model2)

#summary
summary(bike_by_day_model_final)
# multicollinearity
car::vif(bike_by_day_model_final)

par(mfrow=c(2,2))
plot(bike_by_day_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_model_final)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_model_final)


# Test for Normally Distributed Errors
shapiro.test(bike_by_day_model_final$residuals)
# Making all the values positive
bike_by_day_BC <- bike_by_day

for (i in colnames(bike_by_day_BC)){
  if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
    bike_by_day_BC[i] <- round(bike_by_day_BC[i] + (abs(min(bike_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
  }
}
bc1 <-boxTidwell(sum_bike_count ~ mean_temp +
    mean_humidity + 
    mean_solar_radiation 
  ,data = bike_by_day_BC)

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = (data^lambda_val)-1/lambda_val
  return(data)
}

for (i in colnames(bike_by_day_BC)){
  if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
    bike_by_day_BC[i] <- transform_boxcox(bike_by_day_BC[i])
  }
}
# fitting new model
bike_by_day_BC_fit = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
    mean_solar_radiation + mean_rainfall + seasons + holiday, 
    data = bike_by_day_BC)
  
summary(bike_by_day_BC_fit)

# Checking multicollinearity in the final Model
car::vif(bike_by_day_BC_fit)

par(mfrow=c(2,2))
plot(bike_by_day_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test

ncvTest(bike_by_day_BC_fit)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_BC_fit)

shapiro.test(bike_by_day_BC_fit$residuals)
# Dividing to am and pm
bike_AMPM <- bike
bike_AMPM$hour <- as.numeric(bike_AMPM$hour)

bike_AMPM['hour'] [bike_AMPM['hour'] <=12] <- -1
bike_AMPM['hour'] [bike_AMPM['hour'] >12] <- 1
bike_AMPM$hour <- factor(bike_AMPM$hour)


head(bike_AMPM)
# creating dataframe for bike count by am pm
levels(bike_AMPM$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_AMPM$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)

bike_AMPM <- bike_AMPM %>% group_by(date,hour) %>% summarise(bike_count = sum(bike_count), temp = mean(temp), humidity = mean(humidity), wind_speed = mean(wind_speed), visibility = mean(visibility), dew_point_temp = mean(dew_point_temp), solar_radiation = mean(solar_radiation), rainfall = mean(rainfall), snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))

bike_AMPM
#Changing the factors back to original values 
bike_AMPM$holiday <- as.factor(bike_AMPM$holiday)
bike_AMPM$seasons <- as.factor(bike_AMPM$seasons)
levels(bike_AMPM$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_AMPM$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')

#Drop date column
bike_AMPM <- subset(bike_AMPM, select = -c(date))

head(bike_AMPM)
#Creating histograms 
multi.hist(bike_AMPM[,sapply(bike_AMPM, is.numeric)])
#Randomise our dataset

## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike_AMPM))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_AMPM)), size = smp_size)

model3_prediction <- bike_AMPM[ -trainIndex,]
model3 <- bike_AMPM[ trainIndex,]

# regression model with all columns except functioning day
am_pm_model = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model3)

#summary
summary(am_pm_model)

#Building ANOVA table with full dataset (all columns) against intercept only model
anova(am_pm_model) 
# possible subsets regression
library(MASS)
stepReg=MASS::stepAIC(am_pm_model, direction="both")
stepReg$anova
# multicollinearity
car::vif(am_pm_model)
# Removing columns with multicollinearity and stepwise
am_pm_model_final = lm(bike_count ~ hour + humidity + wind_speed + dew_point_temp + 
    solar_radiation + rainfall + seasons + holiday, data=model3)

#summary
summary(am_pm_model_final)
# Checking multicollinearity in the final Model
car::vif(am_pm_model_final)

par(mfrow=c(2,2))
plot(am_pm_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
library(car)

ncvTest(am_pm_model_final)

# Test for Autocorrelated Errors
durbinWatsonTest(am_pm_model_final)

shapiro.test(am_pm_model_final$residuals)
bike_AMPM_BC <- model3[-c(185), ]

for (i in colnames(bike_AMPM_BC)){
  if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
    bike_AMPM_BC[i] <- round(bike_AMPM_BC[i] + (abs(min(bike_AMPM[i])))+1 , digits=3) # adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(
  bike_AMPM_BC$bike_count ~ 
    bike_AMPM_BC$humidity + 
    bike_AMPM_BC$dew_point_temp + 
    bike_AMPM_BC$solar_radiation + 
    bike_AMPM_BC$rainfall
  )

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = data^lambda_val
  return(data)
}

for (i in colnames(bike_AMPM_BC)){
  if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
    bike_AMPM_BC[i] <- transform_boxcox(bike_AMPM_BC[i])
  }
}

# fitting new model
bike_AMPM_BC_fit = lm(bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour + 
    bike_AMPM_BC$humidity + 
    bike_AMPM_BC$wind_speed + 
    bike_AMPM_BC$dew_point_temp + 
    bike_AMPM_BC$solar_radiation + 
    bike_AMPM_BC$rainfall + 
    bike_AMPM_BC$seasons +
    bike_AMPM_BC$holiday)
  
summary(bike_AMPM_BC_fit)

# Checking multicollinearity in the final Model
car::vif(bike_AMPM_BC_fit)

par(mfrow=c(2,2))
plot(bike_AMPM_BC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test

ncvTest(bike_AMPM_BC_fit)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_AMPM_BC_fit)

shapiro.test(bike_AMPM_BC_fit$residuals)
#Creating weekend colum
bike_by_day$date <- as.Date(bike_by_day$date, "%d/%m/%y")
bike_by_day$weekend = chron::is.weekend(bike_by_day$date)
bike_by_day$weekend <- as.factor(bike_by_day$weekend)
bike_by_day

# Bar plot for weekend and weekday bike demand
fig5 <- ggplot(bike_by_day, aes(x = as.factor(weekend), y = sum_bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 5. Weekend and Weekday Bike Count Bar Plot") +  xlab("Weekend") + ylab("Bike Count")
fig5 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))


#Randomise our dataset

## 75% of the sample size
smp_size <- floor(0.90 * nrow(bike_by_day))

## set the seed to make your partition reproducible
set.seed(500)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)

model4_prediction <- bike_by_day[ -trainIndex,]
model4 <- bike_by_day[ trainIndex,]
# regression model by day with Weekend
bike_weekend_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday+ weekend, data=model4)

#summary
summary(bike_weekend_model)
# possible subsets regression (by day and weekend)
library(MASS)
stepReg=MASS::stepAIC(bike_weekend_model, direction="both")
stepReg$anova

# multicollinearity
car::vif(bike_weekend_model)

# Final regression model by day with Weekend
bike_weekend_model_final = lm(sum_bike_count ~ mean_dew_point_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=model4)

#summary
summary(bike_weekend_model_final)
# multicollinearity
car::vif(bike_weekend_model_final)

par(mfrow=c(2,2))
plot(bike_weekend_model_final , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_weekend_model_final)
# Test for Autocorrelated Errors
durbinWatsonTest(bike_weekend_model_final)
# Test for Normally Distributed Errors
shapiro.test(bike_weekend_model_final$residuals)

bike_by_dayBC <- model4

for (i in colnames(bike_by_dayBC)){
  if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
    bike_by_dayBC[i] <- bike_by_dayBC[i] + (abs(min(bike_by_day[i])))+1 #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(
  bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall
  )

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = ((data^lambda_val)-1)/lambda_val
  return(data)
}

for (i in colnames(bike_by_dayBC)){
  if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
    bike_by_dayBC[i] <- transform_boxcox(bike_by_dayBC[i])
  }
}

# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall +
    bike_by_dayBC$seasons +
    bike_by_dayBC$holiday +
    bike_by_dayBC$weekend)
  
summary(bike_by_dayBC_fit)

# multicollinearity
car::vif(bike_by_dayBC_fit)
par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)
bike_by_dayBC <- bike_by_dayBC[-c(152,99,39),]

# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall +
    bike_by_dayBC$seasons +
    bike_by_dayBC$holiday +
    bike_by_dayBC$weekend)
  
summary(bike_by_dayBC_fit)

# multicollinearity
car::vif(bike_by_dayBC_fit)

par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_dayBC_fit)

# Test for Normally Distributed Errors
shapiro.test(bike_by_dayBC_fit$residuals)
# we will transform this series to positive by adding absulute smallest value to whole dataset by their columns to pe

for (i in colnames(model4_prediction)){
  if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
    model4_prediction[i] <- model4_prediction[i] + (abs(min(bike_by_day[i])))+1
  }
}

# Box-cox transformatino using lambda
transform_boxcox <- function(data)
{
  # box-cox transformation
  data = ((data^lambda_val)-1)/lambda_val
  return(data)
}

for (i in colnames(model4_prediction)){
  if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
    model4_prediction[i] <- transform_boxcox(model4_prediction[i])
  }
}


head(model4_prediction)
# creating function to predict the bike counts
pred_bike_count <- function(mean_humidity, mean_wind_speed, mean_dew_point_temp,  mean_solar_radiation, mean_rainfall, seasons, holiday, weekend){
  seasonsSummer = 0
  seasonsAutumn = 0
  seasonsSpring = 0
  holidayHoliday = 0
  weekendTRUE = 0
  
  # Assinging value as 1 if they exist in data
  if(seasons == "Summer"){
    seasonsSummer = 1
  } else if(seasons == "Autumn"){
    seasonsAutumn = 1
  } else{
    seasonsSpring = 1
  }
  
  if(holiday == "Holiday"){
    holidayHoliday = 1
  }
  
  if(weekendTRUE == "TRUE"){
    weekendTRUE = 1
  }
  
  # fitted eqN of model 4
  y =  19.4886 - 0.8255 * mean_humidity - 1.0416 * mean_wind_speed + 1.0738 * mean_dew_point_temp + 3.4083 * mean_solar_radiation - 1.9489 * mean_rainfall - 1.0351 * seasonsSpring - 0.4055 * seasonsSummer -2.4737 * seasonsAutumn + 0.8703 * holidayHoliday + 0.2655 * weekendTRUE
  
  return(as.numeric(y))
}

model4_prediction[1,] # first row of model 4 prediction

prediction1 <- pred_bike_count(
  model4_prediction[1,"mean_humidity"],
  model4_prediction[1,"mean_wind_speed"],
  model4_prediction[1,"mean_dew_point_temp"],
  model4_prediction[1,"mean_solar_radiation"],
  model4_prediction[1,"mean_rainfall"],
  model4_prediction[1,"seasons"],
  model4_prediction[1,"holiday"],
  model4_prediction[1,"weekend"]
  )

prediction1
model4_prediction[2,] # second row of model 4 prediction

prediction2 <- pred_bike_count(
  model4_prediction[2,"mean_humidity"],
  model4_prediction[2,"mean_wind_speed"],
  model4_prediction[2,"mean_dew_point_temp"],
  model4_prediction[2,"mean_solar_radiation"],
  model4_prediction[2,"mean_rainfall"],
  model4_prediction[2,"seasons"],
  model4_prediction[2,"holiday"],
  model4_prediction[2,"weekend"]
  )

prediction2
model4_prediction[3,] # third row of model 4 prediction

prediction3 <- pred_bike_count(
  model4_prediction[3,"mean_humidity"],
  model4_prediction[3,"mean_wind_speed"],
  model4_prediction[3,"mean_dew_point_temp"],
  model4_prediction[3,"mean_solar_radiation"],
  model4_prediction[3,"mean_rainfall"],
  model4_prediction[3,"seasons"],
  model4_prediction[3,"holiday"],
  model4_prediction[3,"weekend"]
  )

prediction3
model4_prediction[4,] # four row of model 4 prediction

prediction4 <- pred_bike_count(
  model4_prediction[4,"mean_humidity"],
  model4_prediction[4,"mean_wind_speed"],
  model4_prediction[4,"mean_dew_point_temp"],
  model4_prediction[4,"mean_solar_radiation"],
  model4_prediction[4,"mean_rainfall"],
  model4_prediction[4,"seasons"],
  model4_prediction[4,"holiday"],
  model4_prediction[4,"weekend"]
  )

prediction4
model4_prediction[5,] # five row of model 4 prediction

prediction5 <- pred_bike_count(
  model4_prediction[5,"mean_humidity"],
  model4_prediction[5,"mean_wind_speed"],
  model4_prediction[5,"mean_dew_point_temp"],
  model4_prediction[5,"mean_solar_radiation"],
  model4_prediction[5,"mean_rainfall"],
  model4_prediction[5,"seasons"],
  model4_prediction[5,"holiday"],
  model4_prediction[5,"weekend"]
  )

prediction5

actual_bikecount <- c(model4_prediction[1,]$sum_bike_count, model4_prediction[2,]$sum_bike_count, model4_prediction[3,]$sum_bike_count, model4_prediction[4,]$sum_bike_count, model4_prediction[5,]$sum_bike_count)
predicted_bikecount <- c(prediction1, prediction2, prediction3, prediction4, prediction5)

df_prediction <- data.frame(actual_bikecount, predicted_bikecount)

df_prediction # final prediction table

# Importing libraries
library(ggplot2)
library(magrittr) 
library(dplyr) 
library(tidyr)
library(infotheo)
library(forecast)
library(MASS)
library(psych)
library(car)
library(TSA)
library(forecast)
library(chron)
library(caTools)
library(caret)

# import dataset
bike <- read.csv("~/OneDrive - RMIT University/sem4/regression/project/SeoulBikeData.csv",check.names = F)

#set new column names
bike <- setNames(bike, c("date","bike_count","hour","temp","humidity","wind_speed","visibility","dew_point_temp","solar_radiation","rainfall","snowfall","seasons","holiday","functioning_day"))

#Setting categorical values

bike$hour <- factor(bike$hour)
bike$seasons <- factor(bike$seasons)
bike$holiday <- factor(bike$holiday)
bike$functioning_day <- factor(bike$functioning_day)

head(bike)

# Data Analysis

summary(bike$functioning_day)

fig1 <- ggplot(bike, aes(x = as.factor(functioning_day), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 1. Bike Demand by functining day") +  xlab("Functioning Day") + ylab("Bike Count")
fig1 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

# Deleting rows when it is non-functioning day
bike<-bike[!(bike$functioning_day=="No"),]

# removing unused columns
bike <- subset(bike, select = - c(functioning_day))

summary(bike)

fig2<- ggplot(bike, aes(x = as.factor(hour) , y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 2. Hourly Bike Count Bar Plot") +  xlab("Hour") + ylab("Bike Count")
fig2 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

fig3 <- ggplot(bike, aes(x = as.factor(seasons), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 3. Seasonal Bike Count Bar Plot") +  xlab("Seasons") + ylab("Bike Count")
fig3 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

fig4 <- ggplot(bike, aes(x = as.factor(holiday), y = bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 4. Holiday Bike Count Bar Plot") +  xlab("Holiday") + ylab("Bike Count")
fig4 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))

# correlation matrix of dataset
cor_df <- data.frame(bike$bike_count, as.integer(bike$hour), bike$temp, bike$humidity, bike$wind_speed, bike$visibility, bike$dew_point, bike$solar_radiation, bike$rainfall, bike$snowfall)
cor_df <- data.frame(cor(cor_df))

cor_df[order(-cor_df[,1]), ]

#Creating histograms 
multi.hist(bike[,sapply(bike, is.numeric)])


# Modelling

## 1) Model 1

#Randomise our dataset

## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike)), size = smp_size)

model1_prediction <- bike[ -trainIndex,]
model1 <- bike[ trainIndex,]

# regression model with all columns original data
row.names(model1) <- NULL

original_full = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model1)

#summary
summary(original_full)

# possible subsets regression original data
library(MASS)
stepReg=MASS::stepAIC(original_full, direction="both")
stepReg$anova

# multicollinearity test
car::vif(original_full)

# Removing columns with multicollinearity and stepwise
original_final = lm(bike_count ~ hour + temp + humidity + solar_radiation + 
    rainfall + snowfall + seasons + holiday, data=model1)

#summary
summary(original_final)

par(mfrow=c(2,2))
plot(original_final , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(original_final)

# Test for Autocorrelated Errors
durbinWatsonTest(original_final)

# removing One outlier
model1_BC <- model1[-c(6040), ]


for (i in colnames(model1_BC)){
  if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
    model1_BC[i] <- model1_BC[i] + (abs(min(bike[i])))+1 #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(bike_count ~ ( temp + humidity), data = model1_BC)

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = (data^lambda_val)-1/lambda_val
  return(data)
}

for (i in colnames(model1_BC)){
  if(class(model1_BC[[i]]) == "integer" | class(model1_BC[[i]]) == "numeric"){
    model1_BC[i] <- transform_boxcox(model1_BC[i])
  }
}

# fitting new model
model1_BC_fit = lm(bike_count ~ hour + temp + humidity + solar_radiation + 
    rainfall + snowfall + seasons + holiday, data=model1_BC
    )
  
summary(model1_BC_fit)

# Checking multicollinearity in the final Model
car::vif(model1_BC_fit)

par(mfrow=c(2,2))
plot(model1_BC_fit , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(model1_BC_fit)

# Test for Autocorrelated Errors
durbinWatsonTest(model1_BC_fit)

#shapiro.test(model1_BC_fit$residuals)
# Error in shapiro.test(model1_BC_fit$residuals) : sample size must be between 3 and 5000

## 2) Model 2

# creating dataframe for bike count by date
bike_by_day <- bike
levels(bike_by_day$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_by_day$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)


bike_by_day <- bike_by_day %>% group_by(date) %>% summarise(sum_bike_count = sum(bike_count), mean_temp = mean(temp), mean_humidity = mean(humidity), mean_wind_speed = mean(wind_speed), mean_visibility = mean(visibility), mean_dew_point_temp = mean(dew_point_temp), mean_solar_radiation = mean(solar_radiation), mean_rainfall = mean(rainfall), mean_snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))

#Changing the factors back to original values 
bike_by_day$holiday <- as.factor(bike_by_day$holiday)
bike_by_day$seasons <- as.factor(bike_by_day$seasons)
levels(bike_by_day$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_by_day$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')
head(bike_by_day)

#Creating histograms 
multi.hist(bike_by_day[,sapply(bike_by_day, is.numeric)])

#Randomise our dataset

## 75% of the sample size
smp_size <- floor(0.80 * nrow(bike_by_day))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)

model2_prediction <- bike_by_day[ -trainIndex,]
model2 <- bike_by_day[ trainIndex,]

# regression model by day 
bike_by_day_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday, data=model2)

#summary
summary(bike_by_day_model)

# possible subsets regression 
stepReg=MASS::stepAIC(bike_by_day_model, direction="both")
stepReg$anova

# multicollinearity
car::vif(bike_by_day_model)

# Removing columns with multicollinearity and stepwise
bike_by_day_model_final = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday, data=model2)

#summary
summary(bike_by_day_model_final)

# multicollinearity
car::vif(bike_by_day_model_final)

par(mfrow=c(2,2))
plot(bike_by_day_model_final , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_day_model_final)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_model_final)


# Test for Normally Distributed Errors
shapiro.test(bike_by_day_model_final$residuals)

# Making all the values positive
bike_by_day_BC <- bike_by_day

for (i in colnames(bike_by_day_BC)){
  if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
    bike_by_day_BC[i] <- round(bike_by_day_BC[i] + (abs(min(bike_by_day[i])))+1 , digits=3) #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(sum_bike_count ~ mean_temp +
    mean_humidity + 
    mean_solar_radiation 
  ,data = bike_by_day_BC)

lambda_val=bc1$result[1]


transform_boxcox <- function(data)
{
  # box-cox transformation
  data = (data^lambda_val)-1/lambda_val
  return(data)
}

for (i in colnames(bike_by_day_BC)){
  if(class(bike_by_day_BC[[i]]) == "integer" | class(bike_by_day_BC[[i]]) == "numeric"){
    bike_by_day_BC[i] <- transform_boxcox(bike_by_day_BC[i])
  }
}

# fitting new model
bike_by_day_BC_fit = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + 
    mean_solar_radiation + mean_rainfall + seasons + holiday, 
    data = bike_by_day_BC)
  
summary(bike_by_day_BC_fit)

# Checking multicollinearity in the final Model
car::vif(bike_by_day_BC_fit)


par(mfrow=c(2,2))
plot(bike_by_day_BC_fit , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test

ncvTest(bike_by_day_BC_fit)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_day_BC_fit)

shapiro.test(bike_by_day_BC_fit$residuals)

## 3) Model 3

# Dividing to am and pm
bike_AMPM <- bike
bike_AMPM$hour <- as.numeric(bike_AMPM$hour)

bike_AMPM['hour'] [bike_AMPM['hour'] <=12] <- -1
bike_AMPM['hour'] [bike_AMPM['hour'] >12] <- 1
bike_AMPM$hour <- factor(bike_AMPM$hour)


head(bike_AMPM)

# creating dataframe for bike count by am pm
levels(bike_AMPM$holiday) <- c("No Holiday" = 1, "Holiday"= 2)
levels(bike_AMPM$seasons) <- c("Winter" = 1, "Spring"= 2, "Summer" = 3, "Autumn" = 4)

bike_AMPM <- bike_AMPM %>% group_by(date,hour) %>% summarise(bike_count = sum(bike_count), temp = mean(temp), humidity = mean(humidity), wind_speed = mean(wind_speed), visibility = mean(visibility), dew_point_temp = mean(dew_point_temp), solar_radiation = mean(solar_radiation), rainfall = mean(rainfall), snowfall = mean(snowfall), seasons = mean(as.numeric(seasons)), holiday = mean(as.numeric(holiday)))

bike_AMPM

#Changing the factors back to original values 
bike_AMPM$holiday <- as.factor(bike_AMPM$holiday)
bike_AMPM$seasons <- as.factor(bike_AMPM$seasons)
levels(bike_AMPM$holiday) <- c('1'= 'No Holiday', '2'='Holiday')
levels(bike_AMPM$seasons) <- c('1'= 'Winter' , '2'='Spring', '3'= 'Summer', '4'='Autumn')

#Drop date column
bike_AMPM <- subset(bike_AMPM, select = -c(date))

head(bike_AMPM)

#Randomise our dataset

## 80% of the sample size
smp_size <- floor(0.80 * nrow(bike_AMPM))

## set the seed to make your partition reproducible
set.seed(123)
trainIndex <- sample(seq_len(nrow(bike_AMPM)), size = smp_size)

model3_prediction <- bike_AMPM[ -trainIndex,]
model3 <- bike_AMPM[ trainIndex,]

# regression model with all columns except functioning day
am_pm_model = lm(bike_count ~ hour+temp+humidity+wind_speed+visibility+dew_point_temp+solar_radiation+rainfall+snowfall+seasons+holiday, data=model3)

#summary
summary(am_pm_model)

#Building ANOVA table with full dataset (all columns) against intercept only model
anova(am_pm_model) 

# possible subsets regression
library(MASS)
stepReg=MASS::stepAIC(am_pm_model, direction="both")
stepReg$anova

# multicollinearity
car::vif(am_pm_model)

# Removing columns with multicollinearity and stepwise
am_pm_model_final = lm(bike_count ~ hour + humidity + wind_speed + dew_point_temp + 
    solar_radiation + rainfall + seasons + holiday, data=model3)

#summary
summary(am_pm_model_final)

# Checking multicollinearity in the final Model
car::vif(am_pm_model_final)

par(mfrow=c(2,2))
plot(am_pm_model_final , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
library(car)

ncvTest(am_pm_model_final)

# Test for Autocorrelated Errors
durbinWatsonTest(am_pm_model_final)


shapiro.test(am_pm_model_final$residuals)


bike_AMPM_BC <- model3[-c(185), ]

for (i in colnames(bike_AMPM_BC)){
  if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
    bike_AMPM_BC[i] <- round(bike_AMPM_BC[i] + (abs(min(bike_AMPM[i])))+1 , digits=3) # adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(
  bike_AMPM_BC$bike_count ~ 
    bike_AMPM_BC$humidity + 
    bike_AMPM_BC$dew_point_temp + 
    bike_AMPM_BC$solar_radiation + 
    bike_AMPM_BC$rainfall
  )

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = data^lambda_val
  return(data)
}

for (i in colnames(bike_AMPM_BC)){
  if(class(bike_AMPM_BC[[i]]) == "integer" | class(bike_AMPM_BC[[i]]) == "numeric"){
    bike_AMPM_BC[i] <- transform_boxcox(bike_AMPM_BC[i])
  }
}

# fitting new model
bike_AMPM_BC_fit = lm(bike_AMPM_BC$bike_count ~ bike_AMPM_BC$hour + 
    bike_AMPM_BC$humidity + 
    bike_AMPM_BC$wind_speed + 
    bike_AMPM_BC$dew_point_temp + 
    bike_AMPM_BC$solar_radiation + 
    bike_AMPM_BC$rainfall + 
    bike_AMPM_BC$seasons +
    bike_AMPM_BC$holiday)
  
summary(bike_AMPM_BC_fit)

# Checking multicollinearity in the final Model
car::vif(bike_AMPM_BC_fit)


par(mfrow=c(2,2))
plot(bike_AMPM_BC_fit , which=1:4)


# Evaluate homoscedasticity
# non-constant error variance test

ncvTest(bike_AMPM_BC_fit)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_AMPM_BC_fit)

shapiro.test(bike_AMPM_BC_fit$residuals)

## 4) Model 4

#Creating weekend colum
bike_by_day$date <- as.Date(bike_by_day$date, "%d/%m/%y")
bike_by_day$weekend = chron::is.weekend(bike_by_day$date)
bike_by_day$weekend <- as.factor(bike_by_day$weekend)
bike_by_day

# Bar plot for weekend and weekday bike demand
fig5 <- ggplot(bike_by_day, aes(x = as.factor(weekend), y = sum_bike_count)) + geom_boxplot(fill="slateblue", alpha=0.2) + ggtitle("Fig 5. Weekend and Weekday Bike Count Bar Plot") +  xlab("Weekend") + ylab("Bike Count")
fig5 + theme(plot.title = element_text(size=14, face="bold", hjust=0.5))


#Randomise our dataset

## 75% of the sample size
smp_size <- floor(0.90 * nrow(bike_by_day))

## set the seed to make your partition reproducible
set.seed(500)
trainIndex <- sample(seq_len(nrow(bike_by_day)), size = smp_size)

model4_prediction <- bike_by_day[ -trainIndex,]
model4 <- bike_by_day[ trainIndex,]

# regression model by day with Weekend
bike_weekend_model = lm(sum_bike_count ~ mean_temp + mean_humidity + mean_wind_speed + mean_visibility + mean_dew_point_temp + mean_solar_radiation + mean_rainfall + mean_snowfall + seasons + holiday+ weekend, data=model4)

#summary
summary(bike_weekend_model)

# possible subsets regression (by day and weekend)
library(MASS)
stepReg=MASS::stepAIC(bike_weekend_model, direction="both")
stepReg$anova

# multicollinearity
car::vif(bike_weekend_model)


# Final regression model by day with Weekend
bike_weekend_model_final = lm(sum_bike_count ~ mean_dew_point_temp + mean_humidity + mean_wind_speed + mean_solar_radiation + mean_rainfall + seasons + holiday + weekend, data=model4)

#summary
summary(bike_weekend_model_final)

# multicollinearity
car::vif(bike_weekend_model_final)


par(mfrow=c(2,2))
plot(bike_weekend_model_final , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_weekend_model_final)

# Test for Autocorrelated Errors
durbinWatsonTest(bike_weekend_model_final)

# Test for Normally Distributed Errors
shapiro.test(bike_weekend_model_final$residuals)

bike_by_dayBC <- model4

for (i in colnames(bike_by_dayBC)){
  if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
    bike_by_dayBC[i] <- bike_by_dayBC[i] + (abs(min(bike_by_day[i])))+1 #adding minimum value to make dataset positive
  }
}

bc1 <-boxTidwell(
  bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall
  )

lambda_val=bc1$result[1]

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = ((data^lambda_val)-1)/lambda_val
  return(data)
}

for (i in colnames(bike_by_dayBC)){
  if(class(bike_by_dayBC[[i]]) == "integer" | class(bike_by_dayBC[[i]]) == "numeric"){
    bike_by_dayBC[i] <- transform_boxcox(bike_by_dayBC[i])
  }
}

# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall +
    bike_by_dayBC$seasons +
    bike_by_dayBC$holiday +
    bike_by_dayBC$weekend)
  
summary(bike_by_dayBC_fit)


# multicollinearity
car::vif(bike_by_dayBC_fit)

par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)

bike_by_dayBC <- bike_by_dayBC[-c(152,99,39),]

# fitting new model
bike_by_dayBC_fit = lm(bike_by_dayBC$sum_bike_count ~ bike_by_dayBC$mean_humidity + 
    bike_by_dayBC$mean_wind_speed + 
    bike_by_dayBC$mean_dew_point_temp + 
    bike_by_dayBC$mean_solar_radiation + 
    bike_by_dayBC$mean_rainfall +
    bike_by_dayBC$seasons +
    bike_by_dayBC$holiday +
    bike_by_dayBC$weekend)
  
summary(bike_by_dayBC_fit)


# multicollinearity
car::vif(bike_by_dayBC_fit)

par(mfrow=c(2,2))
plot(bike_by_dayBC_fit , which=1:4)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(bike_by_dayBC_fit)


# Test for Autocorrelated Errors
durbinWatsonTest(bike_by_dayBC_fit)


# Test for Normally Distributed Errors
shapiro.test(bike_by_dayBC_fit$residuals)

# Prediction

for (i in colnames(model4_prediction)){
  if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
    model4_prediction[i] <- model4_prediction[i] + (abs(min(bike_by_day[i])))+1
  }
}

transform_boxcox <- function(data)
{
  # box-cox transformation
  data = ((data^lambda_val)-1)/lambda_val
  return(data)
}

for (i in colnames(model4_prediction)){
  if(class(model4_prediction[[i]]) == "integer" | class(model4_prediction[[i]]) == "numeric"){
    model4_prediction[i] <- transform_boxcox(model4_prediction[i])
  }
}

head(model4_prediction)

#### Model 4's equation

# creating function to predict the bike counts
pred_bike_count <- function(mean_humidity, mean_wind_speed, mean_dew_point_temp,  mean_solar_radiation, mean_rainfall, seasons, holiday, weekend){
  seasonsSummer = 0
  seasonsAutumn = 0
  seasonsSpring = 0
  holidayHoliday = 0
  weekendTRUE = 0
  
  if(seasons == "Summer"){
    seasonsSummer = 1
  } else if(seasons == "Autumn"){
    seasonsAutumn = 1
  } else{
    seasonsSpring = 1
  }
  
  if(holiday == "Holiday"){
    holidayHoliday = 1
  }
  
  if(weekendTRUE == "TRUE"){
    weekendTRUE = 1
  }
  
  y =  19.4886 - 0.8255 * mean_humidity - 1.0416 * mean_wind_speed + 1.0738 * mean_dew_point_temp + 3.4083 * mean_solar_radiation - 1.9489 * mean_rainfall - 1.0351 * seasonsSpring - 0.4055 * seasonsSummer -2.4737 * seasonsAutumn + 0.8703 * holidayHoliday + 0.2655 * weekendTRUE
  
  return(as.numeric(y))
}

# Predicting values

model4_prediction[1,]

prediction1 <- pred_bike_count(
  model4_prediction[1,"mean_humidity"],
  model4_prediction[1,"mean_wind_speed"],
  model4_prediction[1,"mean_dew_point_temp"],
  model4_prediction[1,"mean_solar_radiation"],
  model4_prediction[1,"mean_rainfall"],
  model4_prediction[1,"seasons"],
  model4_prediction[1,"holiday"],
  model4_prediction[1,"weekend"]
  )

prediction1

model4_prediction[2,]

prediction2 <- pred_bike_count(
  model4_prediction[2,"mean_humidity"],
  model4_prediction[2,"mean_wind_speed"],
  model4_prediction[2,"mean_dew_point_temp"],
  model4_prediction[2,"mean_solar_radiation"],
  model4_prediction[2,"mean_rainfall"],
  model4_prediction[2,"seasons"],
  model4_prediction[2,"holiday"],
  model4_prediction[2,"weekend"]
  )

prediction2

model4_prediction[3,]

prediction3 <- pred_bike_count(
  model4_prediction[3,"mean_humidity"],
  model4_prediction[3,"mean_wind_speed"],
  model4_prediction[3,"mean_dew_point_temp"],
  model4_prediction[3,"mean_solar_radiation"],
  model4_prediction[3,"mean_rainfall"],
  model4_prediction[3,"seasons"],
  model4_prediction[3,"holiday"],
  model4_prediction[3,"weekend"]
  )

prediction3

model4_prediction[4,]

prediction4 <- pred_bike_count(
  model4_prediction[4,"mean_humidity"],
  model4_prediction[4,"mean_wind_speed"],
  model4_prediction[4,"mean_dew_point_temp"],
  model4_prediction[4,"mean_solar_radiation"],
  model4_prediction[4,"mean_rainfall"],
  model4_prediction[4,"seasons"],
  model4_prediction[4,"holiday"],
  model4_prediction[4,"weekend"]
  )

prediction4

model4_prediction[5,]

prediction5 <- pred_bike_count(
  model4_prediction[5,"mean_humidity"],
  model4_prediction[5,"mean_wind_speed"],
  model4_prediction[5,"mean_dew_point_temp"],
  model4_prediction[5,"mean_solar_radiation"],
  model4_prediction[5,"mean_rainfall"],
  model4_prediction[5,"seasons"],
  model4_prediction[5,"holiday"],
  model4_prediction[5,"weekend"]
  )

prediction5


actual_bikecount <- c(model4_prediction[1,]$sum_bike_count, model4_prediction[2,]$sum_bike_count, model4_prediction[3,]$sum_bike_count, model4_prediction[4,]$sum_bike_count, model4_prediction[5,]$sum_bike_count)
predicted_bikecount <- c(prediction1, prediction2, prediction3, prediction4, prediction5)

df_prediction <- data.frame(actual_bikecount, predicted_bikecount)

df_prediction