Bike Sharing in Washington D.C. Demand Forecast

Assignment Part 1

Please make an explorative data analysis and build a prediction model for the hourly utilization “cnt” of this data set: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. Your solution will consist of a short analysis (text ~ 1 DIN A4 with additional pages for plots) and the relevant code your report is about (source files).

In your report, present only one model, that you think is most suitable for a business-case. Summarize your reasons for choosing this model. Report the mean absolute deviations.
Assume that the code you are writing is used in production in a daily prediction service and maintained by your colleagues.

Solution - Part 1

As per the instructions the data sets has been taken from the above mentioned link. There were 2 data sets I found on the location.

day.csv
hour.csv

There was also a Readme.txt file found on the same location, which can help us to understand the background of this problem, data set, data set characteristics etc.

Approach

Since the problem required to forecast number of users per hour, we will be using the below Machine learning algorithms to estimates the counts.

Multiple Linear Regression
Decision Tree
Random Forest
K Nearest Neighbor

Let’s start solving this problem first with Multiple Linear Regression.

The first task we will be doing here, to install all the required libraries. We have created a function in R, which can check, if any of these below package is available in the system (Already installed), in case if the package is not installed, this function, first installed the package and then load it.

# gc()
# rm(list = ls(all = TRUE))
# Installed the required packages using the below function.

packages<-function(x){
  x<-as.character(match.call()[[2]])
  if (!require(x,character.only=TRUE)){
    install.packages(pkgs=x,repos="http://cran.r-project.org")
    require(x,character.only=TRUE)
  }
}
 
packages(caret)
packages(car)
packages(caTools)
packages(tree)
packages(ISLR)
packages(rpart)
packages(rpart.plot)
packages(randomForest)
packages(e1071)
packages(tidyverse)
packages(mlbench)

Load the Data ( Input files)

We will be loading the data input files from our local machine. The files is in ‘.csv’ format. First, Let’s set the working directory. We are setting the working directory path where our input files are stored on the local machine.

# provide the path of the input files
path <- "C:/Users/Abdul_Yunus/Desktop/Yunus_Personal/Learning/Case Study/JDA"

# Set working directory
setwd(path)

Now import the day.csv and hour.csv file from the local machine.

# Import day.csv file.
day_df <- read.csv('day.csv', header = TRUE)

# Import hour.csv file.
hour_df <- read.csv('hour.csv', header = TRUE)

I would like to look at my data to understand it. A summary of data can help me here. Will have an overview of the individual data sets using the summary and str function.

Let’s check the summary of the data set

summary(day_df)

##     instant             dteday        season            yr        
##  Min.   :  1.0   2011-01-01:  1   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   2011-01-02:  1   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   2011-01-03:  1   Median :3.000   Median :1.0000  
##  Mean   :366.0   2011-01-04:  1   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   2011-01-05:  1   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   2011-01-06:  1   Max.   :4.000   Max.   :1.0000  
##                  (Other)   :725                                   
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##                                                                   
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##                                                                      
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714  
##

summary(hour_df)

##     instant             dteday          season            yr        
##  Min.   :    1   2011-01-01:   24   Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   2011-01-08:   24   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   2011-01-09:   24   Median :3.000   Median :1.0000  
##  Mean   : 8690   2011-01-10:   24   Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034   2011-01-13:   24   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379   2011-01-15:   24   Max.   :4.000   Max.   :1.0000  
##                  (Other)   :17235                                   
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##                                                                    
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##                                                                   
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##                                                                    
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0  
##

We can also check the variable type by using function str

str(day_df)

## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

str(hour_df)

## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...

Data Analysis

Let’s check if we do have any missing value in the data used.

# Checking missing value in day_df data

sapply(day_df, function(x) sum(is.na(x)))

##    instant     dteday     season         yr       mnth    holiday 
##          0          0          0          0          0          0 
##    weekday workingday weathersit       temp      atemp        hum 
##          0          0          0          0          0          0 
##  windspeed     casual registered        cnt 
##          0          0          0          0

# Checking missing value in hour_df data

sapply(hour_df, function(x) sum(is.na(x)))

##    instant     dteday     season         yr       mnth         hr 
##          0          0          0          0          0          0 
##    holiday    weekday workingday weathersit       temp      atemp 
##          0          0          0          0          0          0 
##        hum  windspeed     casual registered        cnt 
##          0          0          0          0          0

There are some variables in the data that needs to be as factor. Let’s convert some of numerical variable into factors from both the data.

# Convert the numeric variable into factor for day_df data
day_df$season <- as.factor(day_df$season)
day_df$yr <- as.factor(day_df$yr)
day_df$mnth <- as.factor(day_df$mnth)
day_df$holiday <- as.factor(day_df$holiday)
day_df$weekday <- as.factor(day_df$weekday)
day_df$workingday <- as.factor(day_df$workingday)
day_df$weathersit <- as.factor(day_df$weathersit)


# Convert the numeric variable into factor for hour_df data
hour_df$season <- as.factor(hour_df$season)
hour_df$yr <- as.factor(hour_df$yr)
hour_df$mnth <- as.factor(hour_df$mnth)
hour_df$holiday <- as.factor(hour_df$holiday)
hour_df$weekday <- as.factor(hour_df$weekday)
hour_df$workingday <- as.factor(hour_df$workingday)
hour_df$weathersit <- as.factor(hour_df$weathersit)
hour_df$hr <- as.factor(hour_df$hr)

Column ‘instant’ is simply the serial numbers, we can remove the column from both the data.

day_df <- day_df[,-1]
hour_df <- hour_df[,-1]

Let’s do the Exploratory Data Analysis on the hour_df data.

Let’s try to understand how season and weather has impact on bike renting? We can plot Number of bikes rent by Seasons for every hour.

# Get the average count of bikes rent by season and hr.
season_summary_by_hour <- hour_df %>%
  select(season, hr, cnt) %>%
  group_by(season, hr) %>%
  summarise(Count = mean(cnt)) 

# Let's plot this
ggplot(hour_df, aes(x= hr, y= Count, color=as.factor(season)))+
  geom_point(data = season_summary_by_hour, aes(group = as.factor(season)))+
  geom_line(data = season_summary_by_hour, aes(group = as.factor(season)))+
  ggtitle("Bikes Rent By Season")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_colour_hue('Season',breaks = levels(as.factor(season_summary_by_hour$season)), labels=c('spring', 'summer', 'fall', 'winter'))

From the above plot we can say that

There are more rental in morning (from 7-9th hour) and evening(17-19th hour)
People rent bikes more in Fall and summer, and much less in Spring

We can also check how season and weather has impact on bike renting?

This can be done by plotting Number of bikes rent by Seasons for every hour.

# Get the average count of bikes rent by weather and hr.

weather_summary_by_hour <- hour_df %>%
  select(weathersit, hr, cnt) %>%
  group_by(weathersit, hr) %>%
  summarise(Count = mean(cnt)) 

ggplot(hour_df, aes(x= hr, y= Count, color= as.factor(weathersit)))+
  geom_point(data = weather_summary_by_hour, aes(group = weathersit))+
  geom_line(data = weather_summary_by_hour, aes(group = weathersit))+
  ggtitle("Bikes Rent By Weather")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_colour_hue('Weathersit',breaks = levels(as.factor(hour_df$weathersit)), labels=c('Good', 'Normal', 'Bad', 'Very Bad'))

From the above plot we can say that

People rent bikes more when weather is good. Even during a bad weather, we can observe that people are renting the bikes.
We see bike rent only at 18th hour when weather is very bad

Does Working day and Non-Working day has any impact on bike renting ?

Now we can plot bike rental activity depending on the type of day (working days or non-working days)

# Taking subset of dat hour_df data for working day.
hour_df.wd <- subset(hour_df, workingday == 1)

# Plotting for the subset data
ggplot(hour_df.wd, aes( x = hr, y = cnt))+
  geom_point(position = position_jitter(w = 1, h = 0), aes(color = temp))+
  scale_color_gradient(low = "#88d8b0", high = "#ff6f69") +
  ggtitle("Bike Count on Working Days") +
  theme(plot.title = element_text(hjust = 0.5))

From the above plot we can see

On working days, bike rental activity peaks in the morning (~8PM) and in the afternoon (~5PM to 6PM).

# Taking subset of dat hour_df data for Non-working day.
hour_df.nwd <- subset(hour_df, workingday == 0)

# Plotting for the subset data
ggplot(hour_df.nwd, aes( x = hr, y = cnt))+
  geom_point(position = position_jitter(w = 1, h = 0), aes(color = temp))+
  scale_color_gradient(low = "#88d8b0", high = "#ff6f69") +
  ggtitle("Bike Count on Working Days")+
  theme(plot.title = element_text(hjust = 0.5))

During non-working days, a gradual increase in bike rental activity is observed that peaks between 1PM and 3PM.

The plots illustrate the working days have peak bike activity during the morning (around 8am) and right after work (around 6pm), with some lunchtime activity. Whereas the non-working days have a steady rise and fall for the afternoon.

First Model - Multiple Linear Regression

Let’s start with Multiple linear regression, one of my favorite algorithm. We will be using hour_df data for model building as the day_df is actually aggregated data of hour_df.

I have observe some categorical variable we have in the data which is labled. We can do the one-hot encoding on those variables.

Let’s do all this on the hour_df data as we are using this dataset for model building

# One Hot Encoding for season
hour_df$season_2 <- ifelse(hour_df$season ==2,1,0)
hour_df$season_3 <- ifelse(hour_df$season ==3,1,0)
hour_df$season_4 <- ifelse(hour_df$season ==4,1,0)

# One Hot Encoding for weathersit
hour_df$weathersit_2 <- ifelse(hour_df$weathersit ==2,1,0)
hour_df$weathersit_3 <- ifelse(hour_df$weathersit ==3,1,0)
hour_df$weathersit_4 <- ifelse(hour_df$weathersit ==4,1,0)

Since we have created dummy variables for season, weathersit. Let’s drop these variables from data and keep only respective dummy variables.

# Dropping variables 'season' and 'weather'.
final_df <- subset(hour_df, select = - c(season, weathersit))

Since we have all information regarding date, month and time, we can remove the dteday column from the data Column cnt is the sum of casual and registered, we can also remove these variables from our data.

# dropping variables 'dteday', 'casual', 'registered'
final_df <- subset(final_df, select = -c(dteday, casual, registered))

Data partitioning

Before building our model, we can partition our data into train and test data. We will be using the train data for building our and we will validate our model performance on the test data.

Let’s split data into 75% for training and 25% for testing. We can change the proportion if required.

set.seed(123)
id <- sample.split(Y = final_df$cnt, SplitRatio = 0.75)
train_df <- subset(final_df, id == "TRUE")
test_df <- subset(final_df, id == "FALSE")

We will put all the feature variables once in our model against the target variable and check how our model performs. Let’s first fit a regression model on this data using all the variables

# Fitting the multiple linear regression model
m1 <- lm(formula = cnt ~., data = train_df)

# Checking summary to interpretate of the model.
summary(m1)

## 
## Call:
## lm(formula = cnt ~ ., data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.08  -61.36   -7.24   51.98  429.24 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -84.264      7.770 -10.845  < 2e-16 ***
## yr1            86.667      1.830  47.366  < 2e-16 ***
## mnth2           4.198      4.612   0.910 0.362715    
## mnth3          17.874      5.175   3.454 0.000554 ***
## mnth4          10.236      7.656   1.337 0.181208    
## mnth5          21.831      8.200   2.662 0.007770 ** 
## mnth6           6.905      8.421   0.820 0.412256    
## mnth7         -12.179      9.421  -1.293 0.196116    
## mnth8          11.351      9.189   1.235 0.216763    
## mnth9          35.494      8.156   4.352 1.36e-05 ***
## mnth10         23.695      7.566   3.132 0.001742 ** 
## mnth11         -3.793      7.300  -0.520 0.603391    
## mnth12         -1.540      5.807  -0.265 0.790800    
## hr1           -17.867      6.267  -2.851 0.004366 ** 
## hr2           -27.671      6.304  -4.390 1.14e-05 ***
## hr3           -39.201      6.384  -6.141 8.44e-10 ***
## hr4           -42.755      6.298  -6.789 1.18e-11 ***
## hr5           -24.659      6.313  -3.906 9.44e-05 ***
## hr6            34.790      6.296   5.526 3.34e-08 ***
## hr7           173.756      6.278  27.677  < 2e-16 ***
## hr8           315.912      6.246  50.578  < 2e-16 ***
## hr9           160.249      6.327  25.329  < 2e-16 ***
## hr10          105.700      6.287  16.813  < 2e-16 ***
## hr11          135.831      6.368  21.330  < 2e-16 ***
## hr12          171.835      6.454  26.625  < 2e-16 ***
## hr13          167.973      6.494  25.865  < 2e-16 ***
## hr14          148.624      6.504  22.850  < 2e-16 ***
## hr15          156.601      6.492  24.121  < 2e-16 ***
## hr16          223.356      6.472  34.509  < 2e-16 ***
## hr17          386.151      6.420  60.153  < 2e-16 ***
## hr18          351.793      6.344  55.449  < 2e-16 ***
## hr19          237.894      6.327  37.599  < 2e-16 ***
## hr20          157.780      6.314  24.987  < 2e-16 ***
## hr21          108.213      6.292  17.198  < 2e-16 ***
## hr22           71.484      6.273  11.395  < 2e-16 ***
## hr23           31.172      6.270   4.972 6.72e-07 ***
## holiday1      -32.396      5.780  -5.604 2.13e-08 ***
## weekday1       11.510      3.482   3.305 0.000952 ***
## weekday2       13.766      3.427   4.017 5.93e-05 ***
## weekday3       15.752      3.400   4.633 3.65e-06 ***
## weekday4       15.363      3.400   4.518 6.29e-06 ***
## weekday5       19.452      3.412   5.702 1.21e-08 ***
## weekday6       17.814      3.395   5.247 1.57e-07 ***
## workingday1        NA         NA      NA       NA    
## temp          116.792     35.055   3.332 0.000866 ***
## atemp         124.540     36.457   3.416 0.000637 ***
## hum           -85.279      6.494 -13.132  < 2e-16 ***
## windspeed     -34.078      8.240  -4.136 3.56e-05 ***
## season_2       39.267      5.677   6.917 4.84e-12 ***
## season_3       34.038      6.709   5.073 3.96e-07 ***
## season_4       65.715      5.709  11.511  < 2e-16 ***
## weathersit_2  -10.933      2.248  -4.863 1.17e-06 ***
## weathersit_3  -65.392      3.792 -17.247  < 2e-16 ***
## weathersit_4  -62.763     59.788  -1.050 0.293851    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.2 on 13023 degrees of freedom
## Multiple R-squared:  0.6896, Adjusted R-squared:  0.6884 
## F-statistic: 556.5 on 52 and 13023 DF,  p-value: < 2.2e-16

The model summary can tell us that the Adjusted R-square which is around 0.69. Impressive!! Since we used all the variables in our model, many variables are not significant, we can put only selected variables in our model and check the model performance.

We can use StepAIC function from MASS package to select the variables for our model.

# Let's load the library MASS
packages(MASS)
var_select <- stepAIC(m1, direction = 'both')

## Start:  AIC=121309.5
## cnt ~ yr + mnth + hr + holiday + weekday + workingday + temp + 
##     atemp + hum + windspeed + season_2 + season_3 + season_4 + 
##     weathersit_2 + weathersit_3 + weathersit_4
## 
## 
## Step:  AIC=121309.5
## cnt ~ yr + mnth + hr + holiday + weekday + temp + atemp + hum + 
##     windspeed + season_2 + season_3 + season_4 + weathersit_2 + 
##     weathersit_3 + weathersit_4
## 
##                Df Sum of Sq       RSS    AIC
## - weathersit_4  1     11735 138692843 121309
## <none>                      138681108 121309
## - temp          1    118206 138799314 121319
## - atemp         1    124270 138805378 121319
## - windspeed     1    182148 138863256 121325
## - weathersit_2  1    251877 138932985 121331
## - season_3      1    274090 138955198 121333
## - holiday       1    334480 139015588 121339
## - weekday       6    454279 139135387 121340
## - season_2      1    509467 139190575 121355
## - season_4      1   1410924 140092032 121440
## - mnth         11   1971654 140652762 121472
## - hum           1   1836367 140517475 121479
## - weathersit_3  1   3167461 141848569 121603
## - yr            1  23891542 162572650 123386
## - hr           23 155305841 293986949 131088
## 
## Step:  AIC=121308.6
## cnt ~ yr + mnth + hr + holiday + weekday + temp + atemp + hum + 
##     windspeed + season_2 + season_3 + season_4 + weathersit_2 + 
##     weathersit_3
## 
##                Df Sum of Sq       RSS    AIC
## <none>                      138692843 121309
## + weathersit_4  1     11735 138681108 121309
## - temp          1    118054 138810897 121318
## - atemp         1    124825 138817668 121318
## - windspeed     1    182711 138875554 121324
## - weathersit_2  1    249431 138942274 121330
## - season_3      1    274007 138966850 121332
## - holiday       1    333719 139026562 121338
## - weekday       6    453148 139145991 121339
## - season_2      1    509725 139202568 121355
## - season_4      1   1410902 140103744 121439
## - mnth         11   1974778 140667621 121471
## - hum           1   1852688 140545530 121480
## - weathersit_3  1   3159475 141852318 121601
## - yr            1  23882921 162575764 123384
## - hr           23 155311734 294004577 131087

# Since some of the dplyr functions doesnt work when we are using MASS library, need to remove it and later we will install it.
detach("package:MASS", unload=TRUE)

After running StepAIc for both forward and backward, We got the selected variables. These variables data can directly be taken from the StepAIC output. We will save this data separate.

data.selected.variables <- var_select$model

Let’s use this data.selected.variable to fit out model.

# Fitting the multiple linear regression model using selected variables after using StepAIC
m2 <- lm(formula = cnt ~., data = data.selected.variables)
summary(m2)

## 
## Call:
## lm(formula = cnt ~ ., data = data.selected.variables)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.04  -61.36   -7.24   51.99  429.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -84.259      7.770 -10.844  < 2e-16 ***
## yr1            86.647      1.830  47.358  < 2e-16 ***
## mnth2           4.359      4.610   0.946 0.344393    
## mnth3          18.022      5.173   3.484 0.000496 ***
## mnth4          10.363      7.655   1.354 0.175814    
## mnth5          21.962      8.199   2.679 0.007402 ** 
## mnth6           7.007      8.420   0.832 0.405344    
## mnth7         -12.074      9.421  -1.282 0.199992    
## mnth8          11.476      9.188   1.249 0.211709    
## mnth9          35.642      8.155   4.371 1.25e-05 ***
## mnth10         23.852      7.565   3.153 0.001619 ** 
## mnth11         -3.628      7.298  -0.497 0.619101    
## mnth12         -1.363      5.804  -0.235 0.814360    
## hr1           -17.975      6.266  -2.869 0.004131 ** 
## hr2           -27.662      6.304  -4.388 1.15e-05 ***
## hr3           -39.189      6.384  -6.139 8.54e-10 ***
## hr4           -42.737      6.298  -6.786 1.20e-11 ***
## hr5           -24.639      6.313  -3.903 9.56e-05 ***
## hr6            34.809      6.296   5.529 3.29e-08 ***
## hr7           173.772      6.278  27.680  < 2e-16 ***
## hr8           315.916      6.246  50.579  < 2e-16 ***
## hr9           160.238      6.327  25.327  < 2e-16 ***
## hr10          105.671      6.287  16.809  < 2e-16 ***
## hr11          135.789      6.368  21.324  < 2e-16 ***
## hr12          171.778      6.454  26.617  < 2e-16 ***
## hr13          167.909      6.494  25.856  < 2e-16 ***
## hr14          148.555      6.504  22.840  < 2e-16 ***
## hr15          156.530      6.492  24.112  < 2e-16 ***
## hr16          223.175      6.470  34.493  < 2e-16 ***
## hr17          386.089      6.419  60.145  < 2e-16 ***
## hr18          351.629      6.343  55.440  < 2e-16 ***
## hr19          237.856      6.327  37.594  < 2e-16 ***
## hr20          157.752      6.314  24.983  < 2e-16 ***
## hr21          108.195      6.292  17.195  < 2e-16 ***
## hr22           71.476      6.273  11.394  < 2e-16 ***
## hr23           31.164      6.270   4.970 6.77e-07 ***
## holiday1      -32.359      5.780  -5.598 2.21e-08 ***
## weekday1       11.467      3.482   3.293 0.000993 ***
## weekday2       13.756      3.427   4.014 6.00e-05 ***
## weekday3       15.709      3.400   4.620 3.87e-06 ***
## weekday4       15.348      3.400   4.514 6.43e-06 ***
## weekday5       19.440      3.412   5.698 1.24e-08 ***
## weekday6       17.772      3.395   5.235 1.68e-07 ***
## temp          116.716     35.055   3.330 0.000872 ***
## atemp         124.815     36.456   3.424 0.000620 ***
## hum           -85.576      6.488 -13.190  < 2e-16 ***
## windspeed     -34.130      8.240  -4.142 3.46e-05 ***
## season_2       39.277      5.677   6.919 4.78e-12 ***
## season_3       34.033      6.709   5.073 3.98e-07 ***
## season_4       65.715      5.709  11.510  < 2e-16 ***
## weathersit_2  -10.877      2.247  -4.840 1.32e-06 ***
## weathersit_3  -65.286      3.790 -17.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.2 on 13024 degrees of freedom
## Multiple R-squared:  0.6896, Adjusted R-squared:  0.6884 
## F-statistic: 567.3 on 51 and 13024 DF,  p-value: < 2.2e-16

Almost all the variables used in this model are significant and we got similar Adjusted R-square here (0.68).

Let’s do prediction using the predict function on test data.

pred <- predict(object = m2, newdata = test_df)

Since we are predicting counts of user renting bike. These numbers should be positive. Let’s cross check it, if we can get all the positive values in the prediction on test data.

test_df_check <- cbind(test_df,pred)
test_df_check$Validation <- ifelse(test_df_check$pred <0,1,0)
sum(test_df_check$Validation)

## [1] 453

The Model has predicted 453 negative values, which is around 10% of our test data.

Since the model is predicting negative values of the outcome, we can try and transform the target variable, we can take natural log of the target variable and then build the model.

Let’s try this.

m3 <- lm(formula = log(cnt) ~., data = data.selected.variables)
summary(m3)

## 
## Call:
## lm(formula = log(cnt) ~ ., data = data.selected.variables)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3404 -0.2965  0.0324  0.3749  2.5388 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.605497   0.046783  55.693  < 2e-16 ***
## yr1           0.473625   0.011016  42.993  < 2e-16 ***
## mnth2         0.123829   0.027756   4.461 8.21e-06 ***
## mnth3         0.154067   0.031149   4.946 7.66e-07 ***
## mnth4         0.132929   0.046090   2.884  0.00393 ** 
## mnth5         0.256752   0.049367   5.201 2.01e-07 ***
## mnth6         0.142619   0.050700   2.813  0.00492 ** 
## mnth7        -0.020115   0.056722  -0.355  0.72288    
## mnth8         0.057162   0.055324   1.033  0.30152    
## mnth9         0.143738   0.049101   2.927  0.00342 ** 
## mnth10        0.070128   0.045547   1.540  0.12366    
## mnth11       -0.006192   0.043943  -0.141  0.88794    
## mnth12       -0.008070   0.034947  -0.231  0.81737    
## hr1          -0.629722   0.037730 -16.690  < 2e-16 ***
## hr2          -1.203189   0.037955 -31.701  < 2e-16 ***
## hr3          -1.765851   0.038436 -45.943  < 2e-16 ***
## hr4          -2.052123   0.037919 -54.119  < 2e-16 ***
## hr5          -0.932062   0.038012 -24.520  < 2e-16 ***
## hr6           0.270819   0.037909   7.144 9.55e-13 ***
## hr7           1.263119   0.037800  33.416  < 2e-16 ***
## hr8           1.882329   0.037608  50.051  < 2e-16 ***
## hr9           1.547435   0.038094  40.621  < 2e-16 ***
## hr10          1.211942   0.037852  32.018  < 2e-16 ***
## hr11          1.352827   0.038342  35.283  < 2e-16 ***
## hr12          1.527933   0.038858  39.321  < 2e-16 ***
## hr13          1.505136   0.039101  38.493  < 2e-16 ***
## hr14          1.413821   0.039162  36.102  < 2e-16 ***
## hr15          1.448736   0.039088  37.063  < 2e-16 ***
## hr16          1.723401   0.038957  44.238  < 2e-16 ***
## hr17          2.132386   0.038651  55.170  < 2e-16 ***
## hr18          2.063892   0.038189  54.044  < 2e-16 ***
## hr19          1.763117   0.038096  46.281  < 2e-16 ***
## hr20          1.474724   0.038020  38.789  < 2e-16 ***
## hr21          1.228019   0.037886  32.413  < 2e-16 ***
## hr22          0.983553   0.037772  26.039  < 2e-16 ***
## hr23          0.568499   0.037752  15.059  < 2e-16 ***
## holiday1     -0.161585   0.034804  -4.643 3.47e-06 ***
## weekday1     -0.045083   0.020967  -2.150  0.03155 *  
## weekday2     -0.054735   0.020634  -2.653  0.00799 ** 
## weekday3     -0.037803   0.020471  -1.847  0.06483 .  
## weekday4      0.007848   0.020473   0.383  0.70147    
## weekday5      0.119542   0.020542   5.819 6.05e-09 ***
## weekday6      0.101080   0.020441   4.945 7.71e-07 ***
## temp          0.367901   0.211067   1.743  0.08135 .  
## atemp         1.197141   0.219505   5.454 5.02e-08 ***
## hum          -0.278811   0.039064  -7.137 1.00e-12 ***
## windspeed    -0.206134   0.049611  -4.155 3.27e-05 ***
## season_2      0.307935   0.034182   9.009  < 2e-16 ***
## season_3      0.383962   0.040397   9.505  < 2e-16 ***
## season_4      0.602964   0.034375  17.541  < 2e-16 ***
## weathersit_2 -0.037556   0.013532  -2.775  0.00552 ** 
## weathersit_3 -0.581544   0.022821 -25.482  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6213 on 13024 degrees of freedom
## Multiple R-squared:  0.8265, Adjusted R-squared:  0.8258 
## F-statistic:  1217 on 51 and 13024 DF,  p-value: < 2.2e-16

Let’s do the prediction using the predict function and do the prediction on test data. Since we have applied log onto the Target variable, we need to use exponential function to get the correct prediction.

pred <- exp(predict(object = m3, newdata = test_df))

We can again do the validation for negative values.

test_df_check <- cbind(test_df,pred)
test_df_check$Validation <- ifelse(test_df_check$pred <0,1,0)
sum(test_df_check$Validation)

## [1] 0

Great, after applying natural log on the target variable, we didn’t get any negative value. Let’s check the accuracy of this model.

packages(DescTools)
RMSE_MLR_m3 <- RMSE(x =  pred, ref = test_df$cnt, na.rm = T, )
MAD_MLR_m3  <- sum(abs(test_df_check$pred - mean(test_df_check$pred))/length(test_df_check$pred))
MAPE_MLR_m3 <- mean(abs((test_df_check$cnt - test_df_check$pred)/test_df_check$cnt)*100)

MLR_Accuracy <- data.frame(RMSE_MLR_m3, MAD_MLR_m3,MAPE_MLR_m3)
MLR_Accuracy

##   RMSE_MLR_m3 MAD_MLR_m3 MAPE_MLR_m3
## 1    94.12152   121.3354    62.22146

We have tried this model and we check the accuracy of our model as well. Let’s try some other algorithm on the data. Next we can try

Decision Tree
Random Forest
K Nearest Neighbor

Let’s start with Decision Tree

Decision Tree

We will be building a decision tree model by calling the rpart function. Let’s first create a base model with default parameters and value on the training data.

tree1 <- rpart(cnt ~., data = train_df)

Let’s plot the tree

rpart.plot(tree1)

The model is ready, we can check the output by using the prediction function

pred1 = predict(object = tree1, newdata = test_df)

Let’s check if there is any negative value predicted.

test_df_val = cbind(test_df, pred1)
test_df_val$NegativeVal <- ifelse(test_df_val$pred1 < 0,1,0)
# If we take the sum of all the values in the column 'NegativeVal' we can find the number of negative values.

sum(test_df_val$NegativeVal)

## [1] 0

Since the sum of the ‘NegativeVal’ column is zero, it means the algorithm is not predicting any negative value.

Let’s calculate RMSE (Root Mean Square Error) and MAD (Mean Absolute Deviation) and MAPE (Mean Absolute Percentage Error).

test_df_val$Error <- test_df_val$cnt - test_df_val$pred1
RMSE_Dectree1 <- sum(sqrt(mean((test_df_val$Error)^2)))
MAD_Dectree1  <- sum(abs(test_df_val$pred1 - mean(test_df_val$pred1))/length(test_df_val$pred1))
MAPE_Dectree1 <- mean(abs((test_df_val$cnt - test_df_val$pred1)/test_df_val$cnt)*100)
Dec_tree_Accuracy <- data.frame(RMSE_Dectree1, MAD_Dectree1, MAPE_Dectree1)
# Model Accuracy for the decision tree
Dec_tree_Accuracy

##   RMSE_Dectree1 MAD_Dectree1 MAPE_Dectree1
## 1      89.39187     114.1421      188.2216

Let check the comparison with the output of our regression model

# Model accuracy for the Multiple Linear regression for comparision
MLR_Accuracy

##   RMSE_MLR_m3 MAD_MLR_m3 MAPE_MLR_m3
## 1    94.12152   121.3354    62.22146

After using Decision Tree Model we got the RMSE = 89, and MAD ~ 114 which comparatively better than the previous regression model.

Decision Tree has problem of overfitting and in some books it is mentioned that it is sometimes not fit for continuous variables. We can overcome this issue by setting constraints on the model parameter, however, first Let’s try Random forest and see what result we get from it?

Random Forest

let build the random forest model on training data

set.seed(123)
rf1 = randomForest(cnt ~., data = train_df, mtry = ncol(train_df)-1)

If we check the Out of Bag Error (OOB estimate of error) - this represents the accuracy. Let’s print the model

When we do classification - Accuracy is OOB When we do Regression - Accuracy is RMSE and R-sq

Let’s print the model

print(rf1)

## 
## Call:
##  randomForest(formula = cnt ~ ., data = train_df, mtry = ncol(train_df) -      1) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 16
## 
##           Mean of squared residuals: 1971.444
##                     % Var explained: 94.23

We can check the variable importance using the below code.

# Priority of Variables
rf1$importance

##              IncNodePurity
## yr             41989315.08
## mnth           16359226.85
## hr            258901477.39
## holiday          681684.81
## weekday        12203623.53
## workingday     28118041.89
## temp           47726926.51
## atemp           9645114.73
## hum            10461059.80
## windspeed       3706120.04
## season_2         556466.59
## season_3         138943.24
## season_4        5246309.07
## weathersit_2     683956.15
## weathersit_3    7040222.23
## weathersit_4      10030.21

Prediction on the training data set can be done using the below codes

pred_rf1 = predict(object = rf1, newdata = train_df)

We can also check if the model is producing any negative output. Let’s check it.

train_df_val = cbind(train_df, pred_rf1)
train_df_val$NegativeVal <- ifelse(train_df_val$pred_rf1 < 0,1,0)

# If we take the sum of all the values in the column 'NegativeVal' we can find the number of negative values.
sum(train_df_val$NegativeVal)

## [1] 0

Since the sum of the ‘NegativeVal’ column is zero, it means the algorithm is not predicting any negative value.

Let’s calculate RMSE (Root Mean Square Error) , MAD (Mean Absolute Deviation) and MAPE (Mean Absolute Percentage Error).

train_df_val$Error <- train_df_val$cnt - train_df_val$pred_rf1
# Calculate RMSE
RMSE_train_rf1 <- sum(sqrt(mean((train_df_val$Error)^2)))
# Calculate MAD
MAD_train_rf1  <- sum(abs(train_df_val$pred_rf1 - mean(train_df_val$pred_rf1))/length(train_df_val$pred_rf1))
# Calculate MAPE
MAPE_train_rf1 <- mean(abs((train_df_val$cnt - train_df_val$pred_rf1)/train_df_val$cnt)*100)

Accuracy_table <- data.frame(RMSE_Dectree1, MAD_Dectree1, MAPE_Dectree1, RMSE_train_rf1,MAD_train_rf1, MAPE_train_rf1)

Accuracy_table

##   RMSE_Dectree1 MAD_Dectree1 MAPE_Dectree1 RMSE_train_rf1 MAD_train_rf1
## 1      89.39187     114.1421      188.2216       19.74873      140.3387
##   MAPE_train_rf1
## 1       15.77177

Prediction on the test data set

pred_rf_test = predict(object = rf1, newdata = test_df)

Check if the model is producing any negative values

test_df_val = cbind(test_df, pred_rf_test)
test_df_val$NegativeVal <- ifelse(test_df_val$pred_rf_test < 0,1,0)
# If we take the sum of all the values in the column 'NegativeVal' we can find the number of negative values.

sum(test_df_val$NegativeVal)

## [1] 0

Since the sum of the ‘NegativeVal’ column is zero, it means the algorithm is not predicting any negative value.

Let’s calculate RMSE, MAD and MAPE.

test_df_val$Error <- test_df_val$cnt - test_df_val$pred_rf_test

RMSE_test_rf1 <- sum(sqrt(mean((test_df_val$Error)^2)))

MAD_test_rf1  <- sum(abs(test_df_val$pred_rf_test- mean(test_df_val$pred_rf_test))/length(test_df_val$pred_rf_test))

MAPE_test_rf1 <- mean(abs((test_df_val$cnt - test_df_val$pred_rf_test)/test_df_val$cnt)*100)

Accuracy_table <- data.frame(RMSE_Dectree1,  RMSE_train_rf1,RMSE_test_rf1 ,MAD_Dectree1,MAD_train_rf1,MAD_test_rf1,  MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1)

Accuracy_table

##   RMSE_Dectree1 RMSE_train_rf1 RMSE_test_rf1 MAD_Dectree1 MAD_train_rf1
## 1      89.39187       19.74873      41.63328     114.1421      140.3387
##   MAD_test_rf1 MAPE_Dectree1 MAPE_train_rf1 MAPE_test_rf1
## 1     129.1864      188.2216       15.77177      36.93351

** Error rate of Random Forest ** Let’s plot the RF model to check the OBB - it actually tell us where OBB is steady or constant. This plot is also use to know how many tree to used for random forest?

plot(rf1)

From the plot, we can see that after 100 trees the Error is steady and no more drop in the error value. Let’s tune the model to find out mtry values. Put ntreeTry = 100 in the tuneRF function.

t = tuneRF(train_df, train_df$cnt,
           stepFactor = 1.2,
           plot = T,
           ntreeTry = 100,
           improve = 0.05)

## mtry = 5  OOB error = 91.40781 
## Searching left ...
## Searching right ...
## mtry = 6     OOB error = 44.33138 
## 0.5150154 0.05 
## mtry = 7     OOB error = 23.44444 
## 0.4711548 0.05 
## mtry = 8     OOB error = 13.76817 
## 0.4127317 0.05 
## mtry = 9     OOB error = 9.118621 
## 0.337703 0.05 
## mtry = 10    OOB error = 7.247002 
## 0.2052524 0.05 
## mtry = 12    OOB error = 5.819201 
## 0.1970195 0.05 
## mtry = 14    OOB error = 7.646352 
## -0.3139864 0.05

After tuning these parameter, the suggested mtry value is 12. Let’s put these parameter (mtry = 12 and ntreeTry = 100) in our model and run it again.

set.seed(1234)
rf2 = randomForest(cnt ~., data = train_df,
                   mtry = 12,
                   ntree = 100,
                   importance = T,
                   proximity = T)
print(rf2)

## 
## Call:
##  randomForest(formula = cnt ~ ., data = train_df, mtry = 12, ntree = 100,      importance = T, proximity = T) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 12
## 
##           Mean of squared residuals: 1989.138
##                     % Var explained: 94.18

We can check the important variables.

# Gives GINI Index (Priority of Variables)
rf2$importance

##                  %IncMSE IncNodePurity
## yr            6668.66614  4.066315e+07
## mnth          2430.61801  1.601167e+07
## hr           45062.90318  2.506876e+08
## holiday         70.63313  8.456058e+05
## weekday       3774.70323  1.460731e+07
## workingday    6300.85024  2.503173e+07
## temp          5163.68085  3.562984e+07
## atemp         3414.24245  3.090593e+07
## hum           1880.73628  1.280895e+07
## windspeed      142.01234  3.971474e+06
## season_2        87.05899  5.902376e+05
## season_3        40.72671  2.576685e+05
## season_4       945.32282  4.541537e+06
## weathersit_2    42.97021  7.848554e+05
## weathersit_3   800.50463  6.901134e+06
## weathersit_4     0.00000  8.688723e+03

Prediction on the training data set

pred_rf2 = predict(object = rf2, newdata = train_df)

Let’s calculate RMSE, MAD and MAPE.

train_df <- cbind(train_df,pred_rf2)
train_df$Error <- train_df$cnt - train_df$pred_rf2

RMSE_train_rf2 <- sum(sqrt(mean((train_df$Error)^2)))

MAD_train_rf2  <- sum(abs(train_df$pred_rf2 - mean(train_df$pred_rf2, na.rm = T))/length(train_df$pred_rf2))
MAPE_train_rf2 <- mean(abs((train_df$cnt - train_df$pred_rf2)/train_df$cnt)*100)

Accuracy_table_RMSE <- data.frame(RMSE_Dectree1, RMSE_train_rf1,RMSE_test_rf1 ,RMSE_train_rf2)
Accuracy_table_MAD <- data.frame(MAD_Dectree1, MAD_train_rf1,MAD_test_rf1, MAD_train_rf2)
Accuracy_table_MAPE <- data.frame(MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1,MAPE_train_rf2)

# Let's see the tables
Accuracy_table_RMSE

##   RMSE_Dectree1 RMSE_train_rf1 RMSE_test_rf1 RMSE_train_rf2
## 1      89.39187       19.74873      41.63328       20.20037

Accuracy_table_MAD

##   MAD_Dectree1 MAD_train_rf1 MAD_test_rf1 MAD_train_rf2
## 1     114.1421      140.3387     129.1864       139.923

Accuracy_table_MAPE

##   MAPE_Dectree1 MAPE_train_rf1 MAPE_test_rf1 MAPE_train_rf2
## 1      188.2216       15.77177      36.93351       16.62618

** Prediction on the test data set **

pred_rf_test = predict(object = rf2, newdata = test_df)

Let’s calculate RMSE, MAD and MAPE.

test_df_val$Error <- test_df_val$cnt - test_df_val$pred_rf_test

RMSE_test_rf2 <- sum(sqrt(mean((test_df_val$Error)^2)))

MAD_test_rf2  <- sum(abs(test_df_val$pred_rf_test- mean(test_df_val$pred_rf_test))/length(test_df_val$pred_rf_test))

MAPE_test_rf2 <- mean(abs((test_df_val$cnt - test_df_val$pred_rf_test)/test_df_val$cnt)*100)

Accuracy_table_MAPE <- data.frame(MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1,MAPE_train_rf2)
Accuracy_table_RMSE <- data.frame(RMSE_Dectree1, RMSE_train_rf1,RMSE_test_rf1 ,RMSE_train_rf2,RMSE_test_rf2)
Accuracy_table_MAD <- data.frame(MAD_Dectree1, MAD_train_rf1,MAD_test_rf1, MAD_train_rf2,MAD_test_rf2)
Accuracy_table_MAPE <- data.frame(MAPE_Dectree1, MAPE_train_rf1,MAPE_test_rf1,MAPE_train_rf2,MAPE_test_rf2 )

We run the decision tree and also run random forest on our data, we have calculated the RMSE, MAD and MAPE and we can compare the results now.

# Let's see the tables
Accuracy_table_RMSE

##   RMSE_Dectree1 RMSE_train_rf1 RMSE_test_rf1 RMSE_train_rf2 RMSE_test_rf2
## 1      89.39187       19.74873      41.63328       20.20037      41.63328

Accuracy_table_MAD

##   MAD_Dectree1 MAD_train_rf1 MAD_test_rf1 MAD_train_rf2 MAD_test_rf2
## 1     114.1421      140.3387     129.1864       139.923     129.1864

Accuracy_table_MAPE

##   MAPE_Dectree1 MAPE_train_rf1 MAPE_test_rf1 MAPE_train_rf2 MAPE_test_rf2
## 1      188.2216       15.77177      36.93351       16.62618      36.93351

We can see the final accuracy table in the below on the test data

ML_Algorithms <- c('Multiple Linear Regression','Decision Tree','Random Forest')
RMSE <- c(round(RMSE_MLR_m3,2), round(RMSE_Dectree1,2), round(RMSE_test_rf2,2))
MAD <- c(round(MAD_MLR_m3,2), round(MAD_Dectree1,2), round(MAD_test_rf2,2))
MAPE <- c(round(MAPE_MLR_m3,2),round(MAPE_Dectree1,2),round(MAPE_test_rf2))
Acu_tbl_combine <- data.frame(ML_Algorithms,RMSE, MAD, MAPE)

Lets see the below table to compare RMSE, MAD and MAPE.

# Comparison table for RMSE, MAPE and MAD
Acu_tbl_combine

##                ML_Algorithms  RMSE    MAD   MAPE
## 1 Multiple Linear Regression 94.12 121.34  62.22
## 2              Decision Tree 89.39 114.14 188.22
## 3              Random Forest 41.63 129.19  37.00

Conclusion:

From the above table, we can conclude that the best model if we select based on RMSE, it should be using Random forest after tuning the parameters.

We have also tried KNN algorithm separatly on this data and calculate the RMSE which is ~ 86 on the test data. You can find these codes on the GitHub on this link https://github.com/abdulyunus/KNN-on-Washingtons-Bike-Sharing-Data.

Bike Sharing in Washington D.C. Demand Forecast

Abdul Yunus Abdul Ghaffar

April 22, 2019

Assignment Part 1

Solution - Part 1

Approach

Load the Data ( Input files)

Decision Tree

Random Forest