Please make an explorative data analysis and build a prediction model for the hourly utilization “cnt” of this data set: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. Your solution will consist of a short analysis (text ~ 1 DIN A4 with additional pages for plots) and the relevant code your report is about (source files).
In your report, present only one model, that you think is most suitable for a business-case. Summarize your reasons for choosing this model. Report the mean absolute deviations.
Assume that the code you are writing is used in production in a daily prediction service and maintained by your colleagues.
As per the instructions the data sets has been taken from the above mentioned link. There were 2 data sets I found on the location.
There was also a Readme.txt file found on the same location, which can help us to understand the background of this problem, data set, data set characteristics etc.
Since the problem required to forecast number of users per hour, we will be using the below Machine learning algorithms to estimates the counts.
Let’s start solving this problem first with Multiple Linear Regression.
The first task we will be doing here, to install all the required libraries. We have created a function in R, which can check, if any of these below package is available in the system (Already installed), in case if the package is not installed, this function, first installed the package and then load it.
# gc()
# rm(list = ls(all = TRUE))
# Installed the required packages using the below function.
packages<-function(x){
x<-as.character(match.call()[[2]])
if (!require(x,character.only=TRUE)){
install.packages(pkgs=x,repos="http://cran.r-project.org")
require(x,character.only=TRUE)
}
}
packages(caret)
packages(car)
packages(caTools)
packages(tree)
packages(ISLR)
packages(rpart)
packages(rpart.plot)
packages(randomForest)
packages(e1071)
packages(tidyverse)
packages(mlbench)
We will be loading the data input files from our local machine. The files is in ‘.csv’ format. First, Let’s set the working directory. We are setting the working directory path where our input files are stored on the local machine.
# provide the path of the input files
path <- "C:/Users/Abdul_Yunus/Desktop/Yunus_Personal/Learning/Case Study/JDA"
# Set working directory
setwd(path)
Now import the day.csv and hour.csv file from the local machine.
# Import day.csv file.
day_df <- read.csv('day.csv', header = TRUE)
# Import hour.csv file.
hour_df <- read.csv('hour.csv', header = TRUE)
I would like to look at my data to understand it. A summary of data can help me here. Will have an overview of the individual data sets using the summary and str function.
Let’s check the summary of the data set
summary(day_df)
## instant dteday season yr
## Min. : 1.0 2011-01-01: 1 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 2011-01-02: 1 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 2011-01-03: 1 Median :3.000 Median :1.0000
## Mean :366.0 2011-01-04: 1 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 2011-01-05: 1 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 2011-01-06: 1 Max. :4.000 Max. :1.0000
## (Other) :725
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
##
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
##
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
##
summary(hour_df)
## instant dteday season yr
## Min. : 1 2011-01-01: 24 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 2011-01-08: 24 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 2011-01-09: 24 Median :3.000 Median :1.0000
## Mean : 8690 2011-01-10: 24 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 2011-01-13: 24 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 2011-01-15: 24 Max. :4.000 Max. :1.0000
## (Other) :17235
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
##
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
##
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
##
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
##
We can also check the variable type by using function str
str(day_df)
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
str(hour_df)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
Data Analysis
Let’s check if we do have any missing value in the data used.
# Checking missing value in day_df data
sapply(day_df, function(x) sum(is.na(x)))
## instant dteday season yr mnth holiday
## 0 0 0 0 0 0
## weekday workingday weathersit temp atemp hum
## 0 0 0 0 0 0
## windspeed casual registered cnt
## 0 0 0 0
# Checking missing value in hour_df data
sapply(hour_df, function(x) sum(is.na(x)))
## instant dteday season yr mnth hr
## 0 0 0 0 0 0
## holiday weekday workingday weathersit temp atemp
## 0 0 0 0 0 0
## hum windspeed casual registered cnt
## 0 0 0 0 0
There are some variables in the data that needs to be as factor. Let’s convert some of numerical variable into factors from both the data.
# Convert the numeric variable into factor for day_df data
day_df$season <- as.factor(day_df$season)
day_df$yr <- as.factor(day_df$yr)
day_df$mnth <- as.factor(day_df$mnth)
day_df$holiday <- as.factor(day_df$holiday)
day_df$weekday <- as.factor(day_df$weekday)
day_df$workingday <- as.factor(day_df$workingday)
day_df$weathersit <- as.factor(day_df$weathersit)
# Convert the numeric variable into factor for hour_df data
hour_df$season <- as.factor(hour_df$season)
hour_df$yr <- as.factor(hour_df$yr)
hour_df$mnth <- as.factor(hour_df$mnth)
hour_df$holiday <- as.factor(hour_df$holiday)
hour_df$weekday <- as.factor(hour_df$weekday)
hour_df$workingday <- as.factor(hour_df$workingday)
hour_df$weathersit <- as.factor(hour_df$weathersit)
hour_df$hr <- as.factor(hour_df$hr)
Column ‘instant’ is simply the serial numbers, we can remove the column from both the data.
day_df <- day_df[,-1]
hour_df <- hour_df[,-1]
Let’s do the Exploratory Data Analysis on the hour_df data.
Let’s try to understand how season and weather has impact on bike renting? We can plot Number of bikes rent by Seasons for every hour.
# Get the average count of bikes rent by season and hr.
season_summary_by_hour <- hour_df %>%
select(season, hr, cnt) %>%
group_by(season, hr) %>%
summarise(Count = mean(cnt))
# Let's plot this
ggplot(hour_df, aes(x= hr, y= Count, color=as.factor(season)))+
geom_point(data = season_summary_by_hour, aes(group = as.factor(season)))+
geom_line(data = season_summary_by_hour, aes(group = as.factor(season)))+
ggtitle("Bikes Rent By Season")+
theme(plot.title = element_text(hjust = 0.5))+
scale_colour_hue('Season',breaks = levels(as.factor(season_summary_by_hour$season)), labels=c('spring', 'summer', 'fall', 'winter'))
From the above plot we can say that
We can also check how season and weather has impact on bike renting?
This can be done by plotting Number of bikes rent by Seasons for every hour.
# Get the average count of bikes rent by weather and hr.
weather_summary_by_hour <- hour_df %>%
select(weathersit, hr, cnt) %>%
group_by(weathersit, hr) %>%
summarise(Count = mean(cnt))
ggplot(hour_df, aes(x= hr, y= Count, color= as.factor(weathersit)))+
geom_point(data = weather_summary_by_hour, aes(group = weathersit))+
geom_line(data = weather_summary_by_hour, aes(group = weathersit))+
ggtitle("Bikes Rent By Weather")+
theme(plot.title = element_text(hjust = 0.5))+
scale_colour_hue('Weathersit',breaks = levels(as.factor(hour_df$weathersit)), labels=c('Good', 'Normal', 'Bad', 'Very Bad'))
From the above plot we can say that
Does Working day and Non-Working day has any impact on bike renting ?
Now we can plot bike rental activity depending on the type of day (working days or non-working days)
# Taking subset of dat hour_df data for working day.
hour_df.wd <- subset(hour_df, workingday == 1)
# Plotting for the subset data
ggplot(hour_df.wd, aes( x = hr, y = cnt))+
geom_point(position = position_jitter(w = 1, h = 0), aes(color = temp))+
scale_color_gradient(low = "#88d8b0", high = "#ff6f69") +
ggtitle("Bike Count on Working Days") +
theme(plot.title = element_text(hjust = 0.5))
From the above plot we can see
# Taking subset of dat hour_df data for Non-working day.
hour_df.nwd <- subset(hour_df, workingday == 0)
# Plotting for the subset data
ggplot(hour_df.nwd, aes( x = hr, y = cnt))+
geom_point(position = position_jitter(w = 1, h = 0), aes(color = temp))+
scale_color_gradient(low = "#88d8b0", high = "#ff6f69") +
ggtitle("Bike Count on Working Days")+
theme(plot.title = element_text(hjust = 0.5))
During non-working days, a gradual increase in bike rental activity is observed that peaks between 1PM and 3PM.
The plots illustrate the working days have peak bike activity during the morning (around 8am) and right after work (around 6pm), with some lunchtime activity. Whereas the non-working days have a steady rise and fall for the afternoon.
First Model - Multiple Linear Regression
Let’s start with Multiple linear regression, one of my favorite algorithm. We will be using hour_df
data for model building as the day_df is actually aggregated data of hour_df.
I have observe some categorical variable we have in the data which is labled. We can do the one-hot encoding on those variables.
Let’s do all this on the hour_df
data as we are using this dataset for model building
# One Hot Encoding for season
hour_df$season_2 <- ifelse(hour_df$season ==2,1,0)
hour_df$season_3 <- ifelse(hour_df$season ==3,1,0)
hour_df$season_4 <- ifelse(hour_df$season ==4,1,0)
# One Hot Encoding for weathersit
hour_df$weathersit_2 <- ifelse(hour_df$weathersit ==2,1,0)
hour_df$weathersit_3 <- ifelse(hour_df$weathersit ==3,1,0)
hour_df$weathersit_4 <- ifelse(hour_df$weathersit ==4,1,0)
Since we have created dummy variables for season, weathersit. Let’s drop these variables from data and keep only respective dummy variables.
# Dropping variables 'season' and 'weather'.
final_df <- subset(hour_df, select = - c(season, weathersit))
Since we have all information regarding date, month and time, we can remove the dteday column from the data Column cnt is the sum of casual and registered, we can also remove these variables from our data.
# dropping variables 'dteday', 'casual', 'registered'
final_df <- subset(final_df, select = -c(dteday, casual, registered))
Data partitioning
Before building our model, we can partition our data into train and test data. We will be using the train data for building our and we will validate our model performance on the test data.
Let’s split data into 75% for training and 25% for testing. We can change the proportion if required.
set.seed(123)
id <- sample.split(Y = final_df$cnt, SplitRatio = 0.75)
train_df <- subset(final_df, id == "TRUE")
test_df <- subset(final_df, id == "FALSE")
We will put all the feature variables once in our model against the target variable and check how our model performs. Let’s first fit a regression model on this data using all the variables
# Fitting the multiple linear regression model
m1 <- lm(formula = cnt ~., data = train_df)
# Checking summary to interpretate of the model.
summary(m1)
##
## Call:
## lm(formula = cnt ~ ., data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.08 -61.36 -7.24 51.98 429.24
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -84.264 7.770 -10.845 < 2e-16 ***
## yr1 86.667 1.830 47.366 < 2e-16 ***
## mnth2 4.198 4.612 0.910 0.362715
## mnth3 17.874 5.175 3.454 0.000554 ***
## mnth4 10.236 7.656 1.337 0.181208
## mnth5 21.831 8.200 2.662 0.007770 **
## mnth6 6.905 8.421 0.820 0.412256
## mnth7 -12.179 9.421 -1.293 0.196116
## mnth8 11.351 9.189 1.235 0.216763
## mnth9 35.494 8.156 4.352 1.36e-05 ***
## mnth10 23.695 7.566 3.132 0.001742 **
## mnth11 -3.793 7.300 -0.520 0.603391
## mnth12 -1.540 5.807 -0.265 0.790800
## hr1 -17.867 6.267 -2.851 0.004366 **
## hr2 -27.671 6.304 -4.390 1.14e-05 ***
## hr3 -39.201 6.384 -6.141 8.44e-10 ***
## hr4 -42.755 6.298 -6.789 1.18e-11 ***
## hr5 -24.659 6.313 -3.906 9.44e-05 ***
## hr6 34.790 6.296 5.526 3.34e-08 ***
## hr7 173.756 6.278 27.677 < 2e-16 ***
## hr8 315.912 6.246 50.578 < 2e-16 ***
## hr9 160.249 6.327 25.329 < 2e-16 ***
## hr10 105.700 6.287 16.813 < 2e-16 ***
## hr11 135.831 6.368 21.330 < 2e-16 ***
## hr12 171.835 6.454 26.625 < 2e-16 ***
## hr13 167.973 6.494 25.865 < 2e-16 ***
## hr14 148.624 6.504 22.850 < 2e-16 ***
## hr15 156.601 6.492 24.121 < 2e-16 ***
## hr16 223.356 6.472 34.509 < 2e-16 ***
## hr17 386.151 6.420 60.153 < 2e-16 ***
## hr18 351.793 6.344 55.449 < 2e-16 ***
## hr19 237.894 6.327 37.599 < 2e-16 ***
## hr20 157.780 6.314 24.987 < 2e-16 ***
## hr21 108.213 6.292 17.198 < 2e-16 ***
## hr22 71.484 6.273 11.395 < 2e-16 ***
## hr23 31.172 6.270 4.972 6.72e-07 ***
## holiday1 -32.396 5.780 -5.604 2.13e-08 ***
## weekday1 11.510 3.482 3.305 0.000952 ***
## weekday2 13.766 3.427 4.017 5.93e-05 ***
## weekday3 15.752 3.400 4.633 3.65e-06 ***
## weekday4 15.363 3.400 4.518 6.29e-06 ***
## weekday5 19.452 3.412 5.702 1.21e-08 ***
## weekday6 17.814 3.395 5.247 1.57e-07 ***
## workingday1 NA NA NA NA
## temp 116.792 35.055 3.332 0.000866 ***
## atemp 124.540 36.457 3.416 0.000637 ***
## hum -85.279 6.494 -13.132 < 2e-16 ***
## windspeed -34.078 8.240 -4.136 3.56e-05 ***
## season_2 39.267 5.677 6.917 4.84e-12 ***
## season_3 34.038 6.709 5.073 3.96e-07 ***
## season_4 65.715 5.709 11.511 < 2e-16 ***
## weathersit_2 -10.933 2.248 -4.863 1.17e-06 ***
## weathersit_3 -65.392 3.792 -17.247 < 2e-16 ***
## weathersit_4 -62.763 59.788 -1.050 0.293851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.2 on 13023 degrees of freedom
## Multiple R-squared: 0.6896, Adjusted R-squared: 0.6884
## F-statistic: 556.5 on 52 and 13023 DF, p-value: < 2.2e-16
The model summary can tell us that the Adjusted R-square which is around 0.69. Impressive!! Since we used all the variables in our model, many variables are not significant, we can put only selected variables in our model and check the model performance.
We can use StepAIC
function from MASS
package to select the variables for our model.
# Let's load the library MASS
packages(MASS)
var_select <- stepAIC(m1, direction = 'both')
## Start: AIC=121309.5
## cnt ~ yr + mnth + hr + holiday + weekday + workingday + temp +
## atemp + hum + windspeed + season_2 + season_3 + season_4 +
## weathersit_2 + weathersit_3 + weathersit_4
##
##
## Step: AIC=121309.5
## cnt ~ yr + mnth + hr + holiday + weekday + temp + atemp + hum +
## windspeed + season_2 + season_3 + season_4 + weathersit_2 +
## weathersit_3 + weathersit_4
##
## Df Sum of Sq RSS AIC
## - weathersit_4 1 11735 138692843 121309
## <none> 138681108 121309
## - temp 1 118206 138799314 121319
## - atemp 1 124270 138805378 121319
## - windspeed 1 182148 138863256 121325
## - weathersit_2 1 251877 138932985 121331
## - season_3 1 274090 138955198 121333
## - holiday 1 334480 139015588 121339
## - weekday 6 454279 139135387 121340
## - season_2 1 509467 139190575 121355
## - season_4 1 1410924 140092032 121440
## - mnth 11 1971654 140652762 121472
## - hum 1 1836367 140517475 121479
## - weathersit_3 1 3167461 141848569 121603
## - yr 1 23891542 162572650 123386
## - hr 23 155305841 293986949 131088
##
## Step: AIC=121308.6
## cnt ~ yr + mnth + hr + holiday + weekday + temp + atemp + hum +
## windspeed + season_2 + season_3 + season_4 + weathersit_2 +
## weathersit_3
##
## Df Sum of Sq RSS AIC
## <none> 138692843 121309
## + weathersit_4 1 11735 138681108 121309
## - temp 1 118054 138810897 121318
## - atemp 1 124825 138817668 121318
## - windspeed 1 182711 138875554 121324
## - weathersit_2 1 249431 138942274 121330
## - season_3 1 274007 138966850 121332
## - holiday 1 333719 139026562 121338
## - weekday 6 453148 139145991 121339
## - season_2 1 509725 139202568 121355
## - season_4 1 1410902 140103744 121439
## - mnth 11 1974778 140667621 121471
## - hum 1 1852688 140545530 121480
## - weathersit_3 1 3159475 141852318 121601
## - yr 1 23882921 162575764 123384
## - hr 23 155311734 294004577 131087
# Since some of the dplyr functions doesnt work when we are using MASS library, need to remove it and later we will install it.
detach("package:MASS", unload=TRUE)
After running StepAIc for both forward and backward, We got the selected variables. These variables data can directly be taken from the StepAIC output. We will save this data separate.
data.selected.variables <- var_select$model
Let’s use this data.selected.variable to fit out model.
# Fitting the multiple linear regression model using selected variables after using StepAIC
m2 <- lm(formula = cnt ~., data = data.selected.variables)
summary(m2)
##
## Call:
## lm(formula = cnt ~ ., data = data.selected.variables)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.04 -61.36 -7.24 51.99 429.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -84.259 7.770 -10.844 < 2e-16 ***
## yr1 86.647 1.830 47.358 < 2e-16 ***
## mnth2 4.359 4.610 0.946 0.344393
## mnth3 18.022 5.173 3.484 0.000496 ***
## mnth4 10.363 7.655 1.354 0.175814
## mnth5 21.962 8.199 2.679 0.007402 **
## mnth6 7.007 8.420 0.832 0.405344
## mnth7 -12.074 9.421 -1.282 0.199992
## mnth8 11.476 9.188 1.249 0.211709
## mnth9 35.642 8.155 4.371 1.25e-05 ***
## mnth10 23.852 7.565 3.153 0.001619 **
## mnth11 -3.628 7.298 -0.497 0.619101
## mnth12 -1.363 5.804 -0.235 0.814360
## hr1 -17.975 6.266 -2.869 0.004131 **
## hr2 -27.662 6.304 -4.388 1.15e-05 ***
## hr3 -39.189 6.384 -6.139 8.54e-10 ***
## hr4 -42.737 6.298 -6.786 1.20e-11 ***
## hr5 -24.639 6.313 -3.903 9.56e-05 ***
## hr6 34.809 6.296 5.529 3.29e-08 ***
## hr7 173.772 6.278 27.680 < 2e-16 ***
## hr8 315.916 6.246 50.579 < 2e-16 ***
## hr9 160.238 6.327 25.327 < 2e-16 ***
## hr10 105.671 6.287 16.809 < 2e-16 ***
## hr11 135.789 6.368 21.324 < 2e-16 ***
## hr12 171.778 6.454 26.617 < 2e-16 ***
## hr13 167.909 6.494 25.856 < 2e-16 ***
## hr14 148.555 6.504 22.840 < 2e-16 ***
## hr15 156.530 6.492 24.112 < 2e-16 ***
## hr16 223.175 6.470 34.493 < 2e-16 ***
## hr17 386.089 6.419 60.145 < 2e-16 ***
## hr18 351.629 6.343 55.440 < 2e-16 ***
## hr19 237.856 6.327 37.594 < 2e-16 ***
## hr20 157.752 6.314 24.983 < 2e-16 ***
## hr21 108.195 6.292 17.195 < 2e-16 ***
## hr22 71.476 6.273 11.394 < 2e-16 ***
## hr23 31.164 6.270 4.970 6.77e-07 ***
## holiday1 -32.359 5.780 -5.598 2.21e-08 ***
## weekday1 11.467 3.482 3.293 0.000993 ***
## weekday2 13.756 3.427 4.014 6.00e-05 ***
## weekday3 15.709 3.400 4.620 3.87e-06 ***
## weekday4 15.348 3.400 4.514 6.43e-06 ***
## weekday5 19.440 3.412 5.698 1.24e-08 ***
## weekday6 17.772 3.395 5.235 1.68e-07 ***
## temp 116.716 35.055 3.330 0.000872 ***
## atemp 124.815 36.456 3.424 0.000620 ***
## hum -85.576 6.488 -13.190 < 2e-16 ***
## windspeed -34.130 8.240 -4.142 3.46e-05 ***
## season_2 39.277 5.677 6.919 4.78e-12 ***
## season_3 34.033 6.709 5.073 3.98e-07 ***
## season_4 65.715 5.709 11.510 < 2e-16 ***
## weathersit_2 -10.877 2.247 -4.840 1.32e-06 ***
## weathersit_3 -65.286 3.790 -17.225 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.2 on 13024 degrees of freedom
## Multiple R-squared: 0.6896, Adjusted R-squared: 0.6884
## F-statistic: 567.3 on 51 and 13024 DF, p-value: < 2.2e-16
Almost all the variables used in this model are significant and we got similar Adjusted R-square here (0.68).
Let’s do prediction using the predict
function on test data.
pred <- predict(object = m2, newdata = test_df)
Since we are predicting counts of user renting bike. These numbers should be positive. Let’s cross check it, if we can get all the positive values in the prediction on test data.
test_df_check <- cbind(test_df,pred)
test_df_check$Validation <- ifelse(test_df_check$pred <0,1,0)
sum(test_df_check$Validation)
## [1] 453
The Model has predicted 453 negative values, which is around 10% of our test data.
Since the model is predicting negative values of the outcome, we can try and transform the target variable, we can take natural log of the target variable and then build the model.
Let’s try this.
m3 <- lm(formula = log(cnt) ~., data = data.selected.variables)
summary(m3)
##
## Call:
## lm(formula = log(cnt) ~ ., data = data.selected.variables)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3404 -0.2965 0.0324 0.3749 2.5388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.605497 0.046783 55.693 < 2e-16 ***
## yr1 0.473625 0.011016 42.993 < 2e-16 ***
## mnth2 0.123829 0.027756 4.461 8.21e-06 ***
## mnth3 0.154067 0.031149 4.946 7.66e-07 ***
## mnth4 0.132929 0.046090 2.884 0.00393 **
## mnth5 0.256752 0.049367 5.201 2.01e-07 ***
## mnth6 0.142619 0.050700 2.813 0.00492 **
## mnth7 -0.020115 0.056722 -0.355 0.72288
## mnth8 0.057162 0.055324 1.033 0.30152
## mnth9 0.143738 0.049101 2.927 0.00342 **
## mnth10 0.070128 0.045547 1.540 0.12366
## mnth11 -0.006192 0.043943 -0.141 0.88794
## mnth12 -0.008070 0.034947 -0.231 0.81737
## hr1 -0.629722 0.037730 -16.690 < 2e-16 ***
## hr2 -1.203189 0.037955 -31.701 < 2e-16 ***
## hr3 -1.765851 0.038436 -45.943 < 2e-16 ***
## hr4 -2.052123 0.037919 -54.119 < 2e-16 ***
## hr5 -0.932062 0.038012 -24.520 < 2e-16 ***
## hr6 0.270819 0.037909 7.144 9.55e-13 ***
## hr7 1.263119 0.037800 33.416 < 2e-16 ***
## hr8 1.882329 0.037608 50.051 < 2e-16 ***
## hr9 1.547435 0.038094 40.621 < 2e-16 ***
## hr10 1.211942 0.037852 32.018 < 2e-16 ***
## hr11 1.352827 0.038342 35.283 < 2e-16 ***
## hr12 1.527933 0.038858 39.321 < 2e-16 ***
## hr13 1.505136 0.039101 38.493 < 2e-16 ***
## hr14 1.413821 0.039162 36.102 < 2e-16 ***
## hr15 1.448736 0.039088 37.063 < 2e-16 ***
## hr16 1.723401 0.038957 44.238 < 2e-16 ***
## hr17 2.132386 0.038651 55.170 < 2e-16 ***
## hr18 2.063892 0.038189 54.044 < 2e-16 ***
## hr19 1.763117 0.038096 46.281 < 2e-16 ***
## hr20 1.474724 0.038020 38.789 < 2e-16 ***
## hr21 1.228019 0.037886 32.413 < 2e-16 ***
## hr22 0.983553 0.037772 26.039 < 2e-16 ***
## hr23 0.568499 0.037752 15.059 < 2e-16 ***
## holiday1 -0.161585 0.034804 -4.643 3.47e-06 ***
## weekday1 -0.045083 0.020967 -2.150 0.03155 *
## weekday2 -0.054735 0.020634 -2.653 0.00799 **
## weekday3 -0.037803 0.020471 -1.847 0.06483 .
## weekday4 0.007848 0.020473 0.383 0.70147
## weekday5 0.119542 0.020542 5.819 6.05e-09 ***
## weekday6 0.101080 0.020441 4.945 7.71e-07 ***
## temp 0.367901 0.211067 1.743 0.08135 .
## atemp 1.197141 0.219505 5.454 5.02e-08 ***
## hum -0.278811 0.039064 -7.137 1.00e-12 ***
## windspeed -0.206134 0.049611 -4.155 3.27e-05 ***
## season_2 0.307935 0.034182 9.009 < 2e-16 ***
## season_3 0.383962 0.040397 9.505 < 2e-16 ***
## season_4 0.602964 0.034375 17.541 < 2e-16 ***
## weathersit_2 -0.037556 0.013532 -2.775 0.00552 **
## weathersit_3 -0.581544 0.022821 -25.482 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6213 on 13024 degrees of freedom
## Multiple R-squared: 0.8265, Adjusted R-squared: 0.8258
## F-statistic: 1217 on 51 and 13024 DF, p-value: < 2.2e-16
Let’s do the prediction using the predict function and do the prediction on test data. Since we have applied log onto the Target variable, we need to use exponential function to get the correct prediction.
pred <- exp(predict(object = m3, newdata = test_df))
We can again do the validation for negative values.
test_df_check <- cbind(test_df,pred)
test_df_check$Validation <- ifelse(test_df_check$pred <0,1,0)
sum(test_df_check$Validation)
## [1] 0
Great, after applying natural log on the target variable, we didn’t get any negative value. Let’s check the accuracy of this model.
packages(DescTools)
RMSE_MLR_m3 <- RMSE(x = pred, ref = test_df$cnt, na.rm = T, )
MAD_MLR_m3 <- sum(abs(test_df_check$pred - mean(test_df_check$pred))/length(test_df_check$pred))
MAPE_MLR_m3 <- mean(abs((test_df_check$cnt - test_df_check$pred)/test_df_check$cnt)*100)
MLR_Accuracy <- data.frame(RMSE_MLR_m3, MAD_MLR_m3,MAPE_MLR_m3)
MLR_Accuracy
## RMSE_MLR_m3 MAD_MLR_m3 MAPE_MLR_m3
## 1 94.12152 121.3354 62.22146
We have tried this model and we check the accuracy of our model as well. Let’s try some other algorithm on the data. Next we can try
Let’s start with Decision Tree
We will be building a decision tree model by calling the rpart
function. Let’s first create a base model with default parameters and value on the training data.
tree1 <- rpart(cnt ~., data = train_df)
Let’s plot the tree
rpart.plot(tree1)
The model is ready, we can check the output by using the prediction function
pred1 = predict(object = tree1, newdata = test_df)
Let’s check if there is any negative value predicted.
test_df_val = cbind(test_df, pred1)
test_df_val$NegativeVal <- ifelse(test_df_val$pred1 < 0,1,0)
# If we take the sum of all the values in the column 'NegativeVal' we can find the number of negative values.
sum(test_df_val$NegativeVal)
## [1] 0
Since the sum of the ‘NegativeVal’ column is zero, it means the algorithm is not predicting any negative value.
Let’s calculate RMSE (Root Mean Square Error) and MAD (Mean Absolute Deviation) and MAPE (Mean Absolute Percentage Error).
test_df_val$Error <- test_df_val$cnt - test_df_val$pred1
RMSE_Dectree1 <- sum(sqrt(mean((test_df_val$Error)^2)))
MAD_Dectree1 <- sum(abs(test_df_val$pred1 - mean(test_df_val$pred1))/length(test_df_val$pred1))
MAPE_Dectree1 <- mean(abs((test_df_val$cnt - test_df_val$pred1)/test_df_val$cnt)*100)
Dec_tree_Accuracy <- data.frame(RMSE_Dectree1, MAD_Dectree1, MAPE_Dectree1)
# Model Accuracy for the decision tree
Dec_tree_Accuracy
## RMSE_Dectree1 MAD_Dectree1 MAPE_Dectree1
## 1 89.39187 114.1421 188.2216
Let check the comparison with the output of our regression model
# Model accuracy for the Multiple Linear regression for comparision
MLR_Accuracy
## RMSE_MLR_m3 MAD_MLR_m3 MAPE_MLR_m3
## 1 94.12152 121.3354 62.22146
After using Decision Tree Model we got the RMSE = 89, and MAD ~ 114 which comparatively better than the previous regression model.
Decision Tree has problem of overfitting and in some books it is mentioned that it is sometimes not fit for continuous variables. We can overcome this issue by setting constraints on the model parameter, however, first Let’s try Random forest and see what result we get from it?
let build the random forest model on training data
set.seed(123)
rf1 = randomForest(cnt ~., data = train_df, mtry = ncol(train_df)-1)
If we check the Out of Bag Error (OOB estimate of error) - this represents the accuracy. Let’s print the model
When we do classification - Accuracy is OOB When we do Regression - Accuracy is RMSE and R-sq
Let’s print the model
print(rf1)
##
## Call:
## randomForest(formula = cnt ~ ., data = train_df, mtry = ncol(train_df) - 1)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 16
##
## Mean of squared residuals: 1971.444
## % Var explained: 94.23
We can check the variable importance using the below code.
# Priority of Variables
rf1$importance
## IncNodePurity
## yr 41989315.08
## mnth 16359226.85
## hr 258901477.39
## holiday 681684.81
## weekday 12203623.53
## workingday 28118041.89
## temp 47726926.51
## atemp 9645114.73
## hum 10461059.80
## windspeed 3706120.04
## season_2 556466.59
## season_3 138943.24
## season_4 5246309.07
## weathersit_2 683956.15
## weathersit_3 7040222.23
## weathersit_4 10030.21
Prediction on the training data set can be done using the below codes
pred_rf1 = predict(object = rf1, newdata = train_df)
We can also check if the model is producing any negative output. Let’s check it.
train_df_val = cbind(train_df, pred_rf1)
train_df_val$NegativeVal <- ifelse(train_df_val$pred_rf1 < 0,1,0)
# If we take the sum of all the values in the column 'NegativeVal' we can find the number of negative values.
sum(train_df_val$NegativeVal)
## [1] 0
Since the sum of the ‘NegativeVal’ column is zero, it means the algorithm is not predicting any negative value.
Let’s calculate RMSE (Root Mean Square Error) , MAD (Mean Absolute Deviation) and MAPE (Mean Absolute Percentage Error).
train_df_val$Error <- train_df_val$cnt - train_df_val$pred_rf1
# Calculate RMSE
RMSE_train_rf1 <- sum(sqrt(mean((train_df_val$Error)^2)))
# Calculate MAD
MAD_train_rf1 <- sum(abs(train_df_val$pred_rf1 - mean(train_df_val$pred_rf1))/length(train_df_val$pred_rf1))
# Calculate MAPE
MAPE_train_rf1 <- mean(abs((train_df_val$cnt - train_df_val$pred_rf1)/train_df_val$cnt)*100)
Accuracy_table <- data.frame(RMSE_Dectree1, MAD_Dectree1, MAPE_Dectree1, RMSE_train_rf1,MAD_train_rf1, MAPE_train_rf1)
Accuracy_table
## RMSE_Dectree1 MAD_Dectree1 MAPE_Dectree1 RMSE_train_rf1 MAD_train_rf1
## 1 89.39187 114.1421 188.2216 19.74873 140.3387
## MAPE_train_rf1
## 1 15.77177
Prediction on the test data set
pred_rf_test = predict(object = rf1, newdata = test_df)
Check if the model is producing any negative values
test_df_val = cbind(test_df, pred_rf_test)
test_df_val$NegativeVal <- ifelse(test_df_val$pred_rf_test < 0,1,0)
# If we take the sum of all the values in the column 'NegativeVal' we can find the number of negative values.
sum(test_df_val$NegativeVal)
## [1] 0
Since the sum of the ‘NegativeVal’ column is zero, it means the algorithm is not predicting any negative value.
Let’s calculate RMSE, MAD and MAPE.
test_df_val$Error <- test_df_val$cnt - test_df_val$pred_rf_test
RMSE_test_rf1 <- sum(sqrt(mean((test_df_val$Error)^2)))
MAD_test_rf1 <- sum(abs(test_df_val$pred_rf_test- mean(test_df_val$pred_rf_test))/length(test_df_val$pred_rf_test))
MAPE_test_rf1 <- mean(abs((test_df_val$cnt - test_df_val$pred_rf_test)/test_df_val$cnt)*100)
Accuracy_table <- data.frame(RMSE_Dectree1, RMSE_train_rf1,RMSE_test_rf1 ,MAD_Dectree1,MAD_train_rf1,MAD_test_rf1, MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1)
Accuracy_table
## RMSE_Dectree1 RMSE_train_rf1 RMSE_test_rf1 MAD_Dectree1 MAD_train_rf1
## 1 89.39187 19.74873 41.63328 114.1421 140.3387
## MAD_test_rf1 MAPE_Dectree1 MAPE_train_rf1 MAPE_test_rf1
## 1 129.1864 188.2216 15.77177 36.93351
** Error rate of Random Forest ** Let’s plot the RF model to check the OBB - it actually tell us where OBB is steady or constant. This plot is also use to know how many tree to used for random forest?
plot(rf1)
From the plot, we can see that after 100 trees the Error is steady and no more drop in the error value. Let’s tune the model to find out mtry values. Put ntreeTry = 100 in the tuneRF function.
t = tuneRF(train_df, train_df$cnt,
stepFactor = 1.2,
plot = T,
ntreeTry = 100,
improve = 0.05)
## mtry = 5 OOB error = 91.40781
## Searching left ...
## Searching right ...
## mtry = 6 OOB error = 44.33138
## 0.5150154 0.05
## mtry = 7 OOB error = 23.44444
## 0.4711548 0.05
## mtry = 8 OOB error = 13.76817
## 0.4127317 0.05
## mtry = 9 OOB error = 9.118621
## 0.337703 0.05
## mtry = 10 OOB error = 7.247002
## 0.2052524 0.05
## mtry = 12 OOB error = 5.819201
## 0.1970195 0.05
## mtry = 14 OOB error = 7.646352
## -0.3139864 0.05
After tuning these parameter, the suggested mtry
value is 12. Let’s put these parameter (mtry = 12 and ntreeTry = 100) in our model and run it again.
set.seed(1234)
rf2 = randomForest(cnt ~., data = train_df,
mtry = 12,
ntree = 100,
importance = T,
proximity = T)
print(rf2)
##
## Call:
## randomForest(formula = cnt ~ ., data = train_df, mtry = 12, ntree = 100, importance = T, proximity = T)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 12
##
## Mean of squared residuals: 1989.138
## % Var explained: 94.18
We can check the important variables.
# Gives GINI Index (Priority of Variables)
rf2$importance
## %IncMSE IncNodePurity
## yr 6668.66614 4.066315e+07
## mnth 2430.61801 1.601167e+07
## hr 45062.90318 2.506876e+08
## holiday 70.63313 8.456058e+05
## weekday 3774.70323 1.460731e+07
## workingday 6300.85024 2.503173e+07
## temp 5163.68085 3.562984e+07
## atemp 3414.24245 3.090593e+07
## hum 1880.73628 1.280895e+07
## windspeed 142.01234 3.971474e+06
## season_2 87.05899 5.902376e+05
## season_3 40.72671 2.576685e+05
## season_4 945.32282 4.541537e+06
## weathersit_2 42.97021 7.848554e+05
## weathersit_3 800.50463 6.901134e+06
## weathersit_4 0.00000 8.688723e+03
Prediction on the training data set
pred_rf2 = predict(object = rf2, newdata = train_df)
Let’s calculate RMSE, MAD and MAPE.
train_df <- cbind(train_df,pred_rf2)
train_df$Error <- train_df$cnt - train_df$pred_rf2
RMSE_train_rf2 <- sum(sqrt(mean((train_df$Error)^2)))
MAD_train_rf2 <- sum(abs(train_df$pred_rf2 - mean(train_df$pred_rf2, na.rm = T))/length(train_df$pred_rf2))
MAPE_train_rf2 <- mean(abs((train_df$cnt - train_df$pred_rf2)/train_df$cnt)*100)
Accuracy_table_RMSE <- data.frame(RMSE_Dectree1, RMSE_train_rf1,RMSE_test_rf1 ,RMSE_train_rf2)
Accuracy_table_MAD <- data.frame(MAD_Dectree1, MAD_train_rf1,MAD_test_rf1, MAD_train_rf2)
Accuracy_table_MAPE <- data.frame(MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1,MAPE_train_rf2)
# Let's see the tables
Accuracy_table_RMSE
## RMSE_Dectree1 RMSE_train_rf1 RMSE_test_rf1 RMSE_train_rf2
## 1 89.39187 19.74873 41.63328 20.20037
Accuracy_table_MAD
## MAD_Dectree1 MAD_train_rf1 MAD_test_rf1 MAD_train_rf2
## 1 114.1421 140.3387 129.1864 139.923
Accuracy_table_MAPE
## MAPE_Dectree1 MAPE_train_rf1 MAPE_test_rf1 MAPE_train_rf2
## 1 188.2216 15.77177 36.93351 16.62618
** Prediction on the test data set **
pred_rf_test = predict(object = rf2, newdata = test_df)
Let’s calculate RMSE, MAD and MAPE.
test_df_val$Error <- test_df_val$cnt - test_df_val$pred_rf_test
RMSE_test_rf2 <- sum(sqrt(mean((test_df_val$Error)^2)))
MAD_test_rf2 <- sum(abs(test_df_val$pred_rf_test- mean(test_df_val$pred_rf_test))/length(test_df_val$pred_rf_test))
MAPE_test_rf2 <- mean(abs((test_df_val$cnt - test_df_val$pred_rf_test)/test_df_val$cnt)*100)
Accuracy_table_MAPE <- data.frame(MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1,MAPE_train_rf2)
Accuracy_table_RMSE <- data.frame(RMSE_Dectree1, RMSE_train_rf1,RMSE_test_rf1 ,RMSE_train_rf2,RMSE_test_rf2)
Accuracy_table_MAD <- data.frame(MAD_Dectree1, MAD_train_rf1,MAD_test_rf1, MAD_train_rf2,MAD_test_rf2)
Accuracy_table_MAPE <- data.frame(MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1,MAPE_train_rf2,MAPE_test_rf2 )
We run the decision tree and also run random forest on our data, we have calculated the RMSE, MAD and MAPE and we can compare the results now.
# Let's see the tables
Accuracy_table_RMSE
## RMSE_Dectree1 RMSE_train_rf1 RMSE_test_rf1 RMSE_train_rf2 RMSE_test_rf2
## 1 89.39187 19.74873 41.63328 20.20037 41.63328
Accuracy_table_MAD
## MAD_Dectree1 MAD_train_rf1 MAD_test_rf1 MAD_train_rf2 MAD_test_rf2
## 1 114.1421 140.3387 129.1864 139.923 129.1864
Accuracy_table_MAPE
## MAPE_Dectree1 MAPE_train_rf1 MAPE_test_rf1 MAPE_train_rf2 MAPE_test_rf2
## 1 188.2216 15.77177 36.93351 16.62618 36.93351
We can see the final accuracy table in the below on the test data
ML_Algorithms <- c('Multiple Linear Regression','Decision Tree','Random Forest')
RMSE <- c(round(RMSE_MLR_m3,2), round(RMSE_Dectree1,2), round(RMSE_test_rf2,2))
MAD <- c(round(MAD_MLR_m3,2), round(MAD_Dectree1,2), round(MAD_test_rf2,2))
MAPE <- c(round(MAPE_MLR_m3,2),round(MAPE_Dectree1,2),round(MAPE_test_rf2))
Acu_tbl_combine <- data.frame(ML_Algorithms,RMSE, MAD, MAPE)
Lets see the below table to compare RMSE, MAD and MAPE.
# Comparison table for RMSE, MAPE and MAD
Acu_tbl_combine
## ML_Algorithms RMSE MAD MAPE
## 1 Multiple Linear Regression 94.12 121.34 62.22
## 2 Decision Tree 89.39 114.14 188.22
## 3 Random Forest 41.63 129.19 37.00
Conclusion:
From the above table, we can conclude that the best model if we select based on RMSE, it should be using Random forest after tuning the parameters.
We have also tried KNN algorithm separatly on this data and calculate the RMSE which is ~ 86 on the test data. You can find these codes on the GitHub on this link https://github.com/abdulyunus/KNN-on-Washingtons-Bike-Sharing-Data.