Data Set Description and Preparation

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.Therefore this Project aims to predict the Number of Bikes Rented and to find the best model for regression analysis.The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour as the outcome variable.

Lets have a look at the data set

my_data<-read.csv("SeoulBikeData.csv")
glimpse(my_data)
## Registered S3 method overwritten by 'cli':
##   method     from
##   print.tree tree
## Rows: 8,760
## Columns: 14
## $ Date            <chr> "1/12/2017", "1/12/2017", "1/12/2017", "1/12/2017", "1~
## $ rntd_bike_count <dbl> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490, 339, ~
## $ hour            <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ~
## $ temperature     <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, ~
## $ humidity        <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, 23, 25~
## $ windspeed       <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.2,~
## $ visibility      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dewpointemp     <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3~
## $ solar_rad       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, ~
## $ rainfall        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons         <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Win~
## $ holiday         <chr> "No Holiday", "No Holiday", "No Holiday", "No Holiday"~
## $ functioningday  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"~
dim(my_data)
## [1] 8760   14

In total we have 8760 observations and 14 features including our outcome variable. Above output shows that we have both integers and character type variables. We need to convert character to factors. But first lets get rid of date feature as it is not meaningful one.

my_data <- my_data[, c(-1)] #removing date feature
names(my_data)
##  [1] "rntd_bike_count" "hour"            "temperature"     "humidity"       
##  [5] "windspeed"       "visibility"      "dewpointemp"     "solar_rad"      
##  [9] "rainfall"        "snowfall"        "seasons"         "holiday"        
## [13] "functioningday"

Lets take a look at the hour variable whether we find any variation in number of rented bikes as hours vary. We can do this by plotting a boxplot.So lets plot number of rented bikes against hours.

boxplot(my_data$rntd_bike_count ~ my_data$hour, xlab = "Hours", ylab="Bike Count")

In the above boxplot we can see that number of rented bikes do vary in response to hours. One can easily conclude from the above boxplot that more bikes are rented from 7 a.m. till 9 a.m. and till 15:00 afternoon it stays somewhat flat and after 15:00 p.m. they again start to increase. Based on above we can categorized number of hours into three categories as “Low” (bikes < 7) “Average (between 10 till 15 hours) and High” (7-9 and from 16-22). Numbers don’t really provide any useful information its more intuitive to have three categories instead.

my_data$hour <- ifelse(my_data$hour < 7, "low",
                           ifelse(my_data$hour >= 9 & my_data$hour <=15, "Average", "High"))

Before applying various feature selection methods its better to quickly go through entire columns and see if any column has any missing values.

colSums(is.na(my_data)) %>% sort() 
## rntd_bike_count            hour     temperature        humidity       windspeed 
##               0               0               0               0               0 
##      visibility     dewpointemp       solar_rad        rainfall        snowfall 
##               0               0               0               0               0 
##         seasons         holiday  functioningday 
##               0               0               0

We can see we don’t have any missing values in any of the columns so we can proceed further with our analysis.

We know we have both categorical and numerical features in our data set. So lets first save both of them in each different object. Lets make a list of categorical variables and then convert those character variables to factor variables and store them as factors. Also make a list of numeric features and store them in an vector named numerical.

bikes_categorical <- sapply(my_data, is.character) %>% which() %>% names()

#converting all categorical variables to factors by applying a loop

for (variable in bikes_categorical) {
  my_data[[variable]] <- as.factor(my_data[[variable]])
}

#list of numeric features in a separate vector 

bikes_numerical <- sapply(my_data, is.numeric) %>% which () %>% names()

glimpse(my_data)
## Rows: 8,760
## Columns: 13
## $ rntd_bike_count <dbl> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490, 339, ~
## $ hour            <fct> low, low, low, low, low, low, low, High, High, Average~
## $ temperature     <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, ~
## $ humidity        <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, 23, 25~
## $ windspeed       <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.2,~
## $ visibility      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dewpointemp     <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3~
## $ solar_rad       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, ~
## $ rainfall        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons         <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winter~
## $ holiday         <fct> No Holiday, No Holiday, No Holiday, No Holiday, No Hol~
## $ functioningday  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,~

Data Partitioning

Lets divide the data into training and test sample. There are few functions that would help our cause but we will use createDataPartition() function from the caret package.

set.seed(987654321)

data_which_train <- createDataPartition(my_data$rntd_bike_count, # target variable
                                          # share of the training sample
                                          p = 0.7,
                                          # should result be a list?
                                          list = FALSE)

It is a vector of numbers - indexes of 70% of observations are selected,CreateDataPartition function does not create training or test sample just select number of observations for training sample. We need to apply this index for data division. lets create the division. Furthermore lets check the frequency distribution of the target variable i.e number of rented bikes (bikes).

train_data <- my_data[data_which_train, ]
test_data <- my_data[-data_which_train, ]

summary(train_data$rntd_bike_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.1   191.0   504.0   704.5  1064.0  3556.0
summary(test_data$rntd_bike_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.1   191.0   505.0   704.9  1065.5  3404.0

Apart from maximum values other statistics are almost similar. Its important to keep in mind that all transformations taking into account of variable distributions must be done on training data. Lets check the distribution of our outcome variable i.e number of bikes rented (bikes).

hist(my_data$rntd_bike_count)

We can see in the above histogram that our outcome variable is not normally distributed and linear regression assumes normality so its better to have more normally distributed data at least for linear regression.

hist(log(my_data$rntd_bike_count))

Although log transformation does help a little bit but now we can say we have left skewed data.As for linear regression we would take logarithmic values and then will inverse the log to compare RMSE for all models.

Initial Feature Selection

Feature selection is the most vital part for machine learning algorithms. Because inclusion of redundant and weakly correlated variables with the outcome variable do not add anything to our model results. It could reduce the model accuracy like in classification task or could increase errors in regression task. Therefore, it is highly recommended to apply various feature selection techniques to identify redundant features and remove them. First, we will start with examining relationship of predictors with the outcome variable. And we will observe if we have strong relationship among predictors as well i.e collinearity . We must make sure that they are not strongly correlated with each other. We have both categorical and numeric features so we will do it separately.

Correlation Among Numeric Features

bikes_correlation <-
  cor(train_data[,bikes_numerical],
      use = "pairwise.complete.obs")

bikes_correlation %>% round(digits = 2)
##                 rntd_bike_count temperature humidity windspeed visibility
## rntd_bike_count            1.00        0.54    -0.20      0.12       0.20
## temperature                0.54        1.00     0.15     -0.05       0.04
## humidity                  -0.20        0.15     1.00     -0.33      -0.54
## windspeed                  0.12       -0.05    -0.33      1.00       0.16
## visibility                 0.20        0.04    -0.54      0.16       1.00
## dewpointemp                0.38        0.91     0.53     -0.18      -0.17
## solar_rad                  0.26        0.35    -0.46      0.32       0.15
## rainfall                  -0.12        0.05     0.23     -0.01      -0.16
## snowfall                  -0.14       -0.22     0.11      0.00      -0.13
##                 dewpointemp solar_rad rainfall snowfall
## rntd_bike_count        0.38      0.26    -0.12    -0.14
## temperature            0.91      0.35     0.05    -0.22
## humidity               0.53     -0.46     0.23     0.11
## windspeed             -0.18      0.32    -0.01     0.00
## visibility            -0.17      0.15    -0.16    -0.13
## dewpointemp            1.00      0.09     0.13    -0.15
## solar_rad              0.09      1.00    -0.07    -0.08
## rainfall               0.13     -0.07     1.00     0.01
## snowfall              -0.15     -0.08     0.01     1.00

Its a long matrix and not easy to read. We can even visualize it by using corrplot function. But we will use the alternative function from the caret package: findCorrelation(x = correlation_matrix, cutoff = 0.9) which identifies correlations above the accepted threshold and indicates variables to be deleted. By default it returns a list of column numbers to be deleted. lets decrease the cutoff to 0.75 above 0.75 there could be some who are equally correlated but function returns only one because it compares both with other predictors and the one that’s average correlation is higher is suggested to be removed.

findCorrelation(bikes_correlation,
                cutoff = 0.75,
                names = TRUE) -> bikes_crr 
bikes_crr 
## [1] "dewpointemp"

Only dewpointemp is suggested to be removed. We will keep this in mind and we will remove all redundant features together.

Correlation with the Outcome Variable (Bikes rented)

Lets sort the predictors on basis of their correlation with the outcome variable in a decreasing order. And then use it to order rows and columns of the correlation matrix on the plot.

correlation_order <-
  # we take correlations with number of bikes rented 
  bikes_correlation[,"rntd_bike_count"] %>%
  # sort them in the decreasing order
  sort(decreasing = TRUE) %>%
  # end extract just variables' names
  names()

correlation_order
## [1] "rntd_bike_count" "temperature"     "dewpointemp"     "solar_rad"      
## [5] "visibility"      "windspeed"       "rainfall"        "snowfall"       
## [9] "humidity"

Lets visualize correlation among numeric features by using a corr plot function.

corrplot.mixed(bikes_correlation[correlation_order, correlation_order],
               upper = "square", lower = "number",
               tl.col="black", tl.pos = "lt") 

We see that dew point and temperature are quite strongly correlated with each other. We will take care of that in final selection of variables.

Lets now see the relationship of first 3 most important predictors with the outcome variable i.e bikes rented. The following scatter plots clearly demonstrate a positive relationship with the outcome variable.

ggplot(train_data,
       aes(x = temperature,
           y = log(rntd_bike_count))) +
  geom_point(col = "blue") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw() +
  labs(title = "Scatter plot of Temperature vs. log(Bikes_Rented)")+
  theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'

ggplot(train_data,
       aes(x = solar_rad,
           y = log(rntd_bike_count))) +
  geom_point(col = "blue") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw() +
  labs(title = "Scatter plot of Solar radiation vs. log(Bikes_Rented)")+
  theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'

ggplot(train_data,
       aes(x = visibility,
           y = log(rntd_bike_count))) +
  geom_point(col = "blue") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw() +
  labs(title = "Scatter plot of Visibility vs. log(Bikes_Rented)")+
  theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'

Correlation between Categorical vs Outcome Variable

As the target variable is quantitative and explanatory qualitative, one can use analysis of variance (ANOVA). We want to check whether e.g our outcome variable bikes rented differ for our predictor i.e hour which have three levels (low, average and high) whether the bikes rented differ for different levels. The F statistic or the p-value can be used to verify this hypothesis.The null hypothesis is that:

H0: Hours do not impact the bikes rented i.e. average numbers of bike rented do not differ for different levels of hours.

Ha: Hours do impact bikes rent count.

The higher the F-statistic (or the lower its p-value) the stronger we reject H0. We will use a function and extract only the p- value for all of our categorical variables. After that we will sort them in the descending order based on p-value.

bikes_F_anova <- function(bikes_categorical){
  anova_ <- aov(train_data$rntd_bike_count ~
                  train_data[[bikes_categorical]])

  return(summary(anova_)[[1]][1, 5])
}

sapply(bikes_categorical,
       bikes_F_anova) %>%
  # in addition lets sort them
  # in the decreasing order of F
  #  and store as an object
  sort(decreasing = TRUE) -> bikes_anova_all_categorical

bikes_anova_all_categorical
##        holiday functioningday        seasons           hour 
##   1.396411e-08   7.096976e-57  2.472680e-317  1.740989e-318

All p- values are less then 0.05 significance level. We fail to reject alternative hypothesis. It means that all our categorical predictors do impact our outcome variable which bikes rent count. Lets just plot one of them i.e seasons to see the relationship with the outcome variable.

ggplot(train_data,
       aes(x = seasons ,
           y = rntd_bike_count)) +
  geom_boxplot(fill = "blue") +
  theme_bw()+
  labs(title = "Boxplot of Bike Count vs. Seasons")+
  theme(plot.title = element_text(hjust = 0.5))

We can see in the above box plot that number of bikes rented do vary with respect to change in seasons. Summer season tends to increase the number of bikes rented.

Identifying Linear combinations

Identification of linear relationships in the data findLinearCombos(data) is a very useful function, which decomposes the data matrix to identify groups of variables that are linear combinations of other variables we will treat such variables as redundant because since they can be determined from the others, they bring nothing to the analysis lets check it for numeric variables.

( findLinearCombos(train_data[, bikes_numerical] ) ->
    houses_linearCombos )
## $linearCombos
## list()
## 
## $remove
## NULL

It suggests we don’t have any linear combinations in our training data that’s why it returns null.

Identifying Features with Zero or Near Zero Variance

We obviously don’t want high concentration of features in a single value issue in data zero variance.to To identify such variables we can use the function nearZeroVar(data, saveMetrics = FALSE) from the caret package.

nearZeroVar(train_data,
            saveMetrics = TRUE) -> bikes_nzv_stats

bikes_nzv_stats %>%
  # we add rownames of the frame
  # (with names of variables)
  # as a new column in the data
  rownames_to_column("variable") %>%
  # and sort it in the descreasing order
  arrange(-zeroVar, -nzv, -freqRatio)
##           variable  freqRatio percentUnique zeroVar   nzv
## 1         snowfall 187.806452    0.78265123   FALSE  TRUE
## 2         rainfall  69.397590    0.89678787   FALSE  TRUE
## 3        solar_rad  34.919540    5.59269526   FALSE  TRUE
## 4   functioningday  29.665000    0.03261047   FALSE  TRUE
## 5          holiday  19.443333    0.03261047   FALSE  TRUE
## 6       visibility  69.782609   26.91994130   FALSE FALSE
## 7  rntd_bike_count  13.333333   31.77890103   FALSE FALSE
## 8             hour   1.420613    0.04891570   FALSE FALSE
## 9      dewpointemp   1.264706    8.95157346   FALSE FALSE
## 10        humidity   1.106195    1.46747106   FALSE FALSE
## 11     temperature   1.071429    8.72330018   FALSE FALSE
## 12         seasons   1.018088    0.06522094   FALSE FALSE
## 13       windspeed   1.006780    1.04353497   FALSE FALSE

The above output suggests that we have 5 variables that have near zero variance.

Final Selection of Features

# start with all variables 

selected_variables <- names(my_data)


#Lets take all numeric features correlated with the outcome variable and the last 4 categorical variables based on the p-values

selected_variables <- c(correlation_order,
                               names(bikes_anova_all_categorical)
                                )


#Now lets delete mutullay correlated variables as identified by findcorrelation function

selected_variables  <-
  selected_variables [!selected_variables %in%
                         bikes_crr]


#lets now delet the near zero variance variables 
(bikes_variables_nzv <- nearZeroVar(train_data,
                                      names = TRUE))
## [1] "solar_rad"      "rainfall"       "snowfall"       "holiday"       
## [5] "functioningday"
selected_variables  <-
  selected_variables [!selected_variables  %in%
                         bikes_variables_nzv]

selected_variables
## [1] "rntd_bike_count" "temperature"     "visibility"      "windspeed"      
## [5] "humidity"        "seasons"         "hour"
bikes_lm1 <- lm(rntd_bike_count ~ ., 
                 data = train_data %>% 
                   dplyr::select(all_of(selected_variables))) 
vifs <- ols_vif_tol(bikes_lm1)
vifs[which(vifs$Variables %in% bikes_numerical),]
##     Variables Tolerance      VIF
## 1 temperature 0.2315079 4.319507
## 2  visibility 0.6054896 1.651556
## 3   windspeed 0.8215594 1.217198
## 4    humidity 0.5030769 1.987768

In the above output we can see variance inflation factor (VIF) value.The VIF for all the predictors is small its not that high. Now lets apply the backward elimination method and find out the redundant features and remove them.

bikes_lm1_backward_p <- ols_step_backward_p(bikes_lm1,
                    prem = 0.05,
                    progress = F)

summary(bikes_lm1_backward_p$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1617.18  -253.14    -7.38   235.58  2224.53 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    852.4737    26.8797  31.714  < 2e-16 ***
## temperature     23.8661     0.9731  24.525  < 2e-16 ***
## humidity        -8.6638     0.3060 -28.317  < 2e-16 ***
## seasonsSpring  -66.9339    15.8872  -4.213 2.56e-05 ***
## seasonsSummer  -31.6544    20.0958  -1.575    0.115    
## seasonsWinter -275.4599    23.1408 -11.904  < 2e-16 ***
## hourHigh       429.3132    13.8749  30.942  < 2e-16 ***
## hourlow       -128.1614    16.2537  -7.885 3.70e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 439.1 on 6125 degrees of freedom
## Multiple R-squared:  0.5353, Adjusted R-squared:  0.5347 
## F-statistic:  1008 on 7 and 6125 DF,  p-value: < 2.2e-16
bikes_lm1_backward_p$removed
## [1] "visibility" "windspeed"

At the end we have a list of variables that are suggested to be removed. based on the backward elimination method. Now lets remove the identified features and make a list of selected_final_variables to be used for regression purpose.

selected_final_variables <- selected_variables[-which(selected_variables %in% 
                              c("visibility","windspeed"))]

bikes_lm_benchmark <- lm(log(rntd_bike_count)  ~ ., 
                data = train_data%>% 
                  dplyr::select(all_of(selected_final_variables))) 

summary(bikes_lm_benchmark)
## 
## Call:
## lm(formula = log(rntd_bike_count) ~ ., data = train_data %>% 
##     dplyr::select(all_of(selected_final_variables)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4184 -0.2629  0.3067  0.8156  2.4547 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.418267   0.102944  62.347  < 2e-16 ***
## temperature    0.020241   0.003727   5.431 5.82e-08 ***
## humidity      -0.020416   0.001172 -17.423  < 2e-16 ***
## seasonsSpring  0.413061   0.060845   6.789 1.24e-11 ***
## seasonsSummer  0.968415   0.076963  12.583  < 2e-16 ***
## seasonsWinter -0.230709   0.088624  -2.603  0.00926 ** 
## hourHigh       0.405548   0.053138   7.632 2.66e-14 ***
## hourlow       -0.491526   0.062248  -7.896 3.38e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.682 on 6125 degrees of freedom
## Multiple R-squared:  0.2006, Adjusted R-squared:  0.1997 
## F-statistic: 219.5 on 7 and 6125 DF,  p-value: < 2.2e-16

The above output also shows the regression estimates. We can see all out selected_final_variables bear a significant relationship with the outcome variable. Furthermore our purpose is to generate predictions both on training and testing data. So lets generate predictions and estimates RMSE on both training and testing data.

bikes_lm_benchmark$fitted.values <- exp(bikes_lm_benchmark$fitted.values)-1
regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bikes_lm_benchmark, 
                                      train_data[, selected_final_variables]))
##        MSE     RMSE      MAE    MedAE     MSLE        R2
## 1 901754.8 949.6077 699.0244 498.9379 18.10552 -1.176323
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bikes_lm_benchmark, 
                                      test_data[, selected_final_variables]))
##        MSE     RMSE      MAE    MedAE     MSLE        R2
## 1 907814.4 952.7929 699.5142 499.9987 18.12239 -1.162723

For training data RMSE is 949 and for testing data it is 952. We can conclude that linear regression performs approximately well on both training and testing data.So lets try out different machine learning algorithms and evaluate results if they help us in controlling or minimizing over fitting.

model.formula <- rntd_bike_count ~.

Decision Trees (Regression)

A simple regression tree is built thesame way as a simple classificatioon tree. Like the simple classification tree, it is rarely invoked on its own; the bagged, random forest, and gradient boosting methods build on this logic.The first step is to build a full tree, then perform k-fold cross-validation to help select the optimal cost complexity (cp). The only difference here is the rpart() parameter method = “anova” to produce a regression tree.

set.seed(1234)
bikes.tree <- rpart(model.formula,
                   data = train_data,
                   method = "anova")
summary(bikes.tree)
## Call:
## rpart(formula = model.formula, data = train_data, method = "anova")
##   n= 6133 
## 
##            CP nsplit rel error    xerror        xstd
## 1  0.24989438      0 1.0000000 1.0002476 0.021429957
## 2  0.16345940      1 0.7501056 0.7547652 0.017132405
## 3  0.06097695      2 0.5866462 0.5900575 0.014459967
## 4  0.05072879      4 0.4646923 0.4689065 0.011753958
## 5  0.02727066      5 0.4139635 0.4172414 0.011559632
## 6  0.01766738      7 0.3594222 0.3838004 0.010632823
## 7  0.01531200      8 0.3417548 0.3527000 0.010112491
## 8  0.01228374      9 0.3264428 0.3432102 0.009807599
## 9  0.01184149     10 0.3141591 0.3312553 0.009435020
## 10 0.01000000     11 0.3023176 0.3267140 0.009335105
## 
## Variable importance
##    temperature           hour    dewpointemp        seasons      solar_rad 
##             21             17             17             11              9 
##       humidity functioningday       rainfall      windspeed       snowfall 
##              9              6              5              3              2 
##     visibility 
##              1 
## 
## Node number 1: 6133 observations,    complexity param=0.2498944
##   mean=704.4715, MSE=414347.9 
##   left son=2 (2927 obs) right son=3 (3206 obs)
##   Primary splits:
##       temperature < 12.65  to the left,  improve=0.2498944, (0 missing)
##       seasons     splits as  RRRL,       improve=0.1825931, (0 missing)
##       hour        splits as  RRL,        improve=0.1703735, (0 missing)
##       dewpointemp < -2.45  to the left,  improve=0.1318742, (0 missing)
##       solar_rad   < 0.005  to the left,  improve=0.1084771, (0 missing)
##   Surrogate splits:
##       dewpointemp < 5.25   to the left,  agree=0.880, adj=0.748, (0 split)
##       seasons     splits as  RRRL,       agree=0.768, adj=0.515, (0 split)
##       solar_rad   < 0.015  to the left,  agree=0.609, adj=0.181, (0 split)
##       snowfall    < 0.05   to the right, agree=0.573, adj=0.106, (0 split)
##       humidity    < 43.5   to the left,  agree=0.562, adj=0.082, (0 split)
## 
## Node number 2: 2927 observations,    complexity param=0.02727066
##   mean=367.7032, MSE=127769.8 
##   left son=4 (1506 obs) right son=5 (1421 obs)
##   Primary splits:
##       seasons     splits as  RR-L,       improve=0.17526230, (0 missing)
##       temperature < 3.65   to the left,  improve=0.17226590, (0 missing)
##       hour        splits as  RRL,        improve=0.14406380, (0 missing)
##       solar_rad   < 0.005  to the left,  improve=0.07712105, (0 missing)
##       dewpointemp < -10.45 to the left,  improve=0.06816643, (0 missing)
##   Surrogate splits:
##       temperature < 2.05   to the left,  agree=0.837, adj=0.664, (0 split)
##       dewpointemp < -3.75  to the left,  agree=0.799, adj=0.586, (0 split)
##       humidity    < 57.5   to the left,  agree=0.644, adj=0.267, (0 split)
##       windspeed   < 1.55   to the right, agree=0.578, adj=0.131, (0 split)
##       snowfall    < 0.05   to the right, agree=0.567, adj=0.108, (0 split)
## 
## Node number 3: 3206 observations,    complexity param=0.1634594
##   mean=1011.933, MSE=477911.1 
##   left son=6 (1833 obs) right son=7 (1373 obs)
##   Primary splits:
##       hour           splits as  LRL,        improve=0.27110490, (0 missing)
##       humidity       < 80.5   to the right, improve=0.14887800, (0 missing)
##       rainfall       < 0.05   to the right, improve=0.11995590, (0 missing)
##       functioningday splits as  LR,         improve=0.10221200, (0 missing)
##       solar_rad      < 0.045  to the left,  improve=0.07750853, (0 missing)
##   Surrogate splits:
##       solar_rad   < 1.275  to the right, agree=0.621, adj=0.114, (0 split)
##       windspeed   < 2.35   to the left,  agree=0.590, adj=0.042, (0 split)
##       visibility  < 1986.5 to the left,  agree=0.579, adj=0.017, (0 split)
##       dewpointemp < 22.75  to the left,  agree=0.577, adj=0.012, (0 split)
##       temperature < 13.75  to the right, agree=0.575, adj=0.007, (0 split)
## 
## Node number 4: 1506 observations
##   mean=222.344, MSE=22339.8 
## 
## Node number 5: 1421 observations,    complexity param=0.02727066
##   mean=521.7575, MSE=193380.3 
##   left son=10 (568 obs) right son=11 (853 obs)
##   Primary splits:
##       hour      splits as  RRL,        improve=0.26585510, (0 missing)
##       humidity  < 61.5   to the right, improve=0.12874380, (0 missing)
##       solar_rad < 0.005  to the left,  improve=0.12643280, (0 missing)
##       rainfall  < 0.05   to the right, improve=0.06214300, (0 missing)
##       seasons   splits as  RL--,       improve=0.06061931, (0 missing)
##   Surrogate splits:
##       solar_rad   < 0.005  to the left,  agree=0.764, adj=0.408, (0 split)
##       windspeed   < 0.85   to the left,  agree=0.659, adj=0.146, (0 split)
##       humidity    < 62.5   to the right, agree=0.625, adj=0.062, (0 split)
##       dewpointemp < 2.25   to the right, agree=0.616, adj=0.040, (0 split)
##       temperature < 2.55   to the left,  agree=0.614, adj=0.035, (0 split)
## 
## Node number 6: 1833 observations,    complexity param=0.05072879
##   mean=700.4054, MSE=194532.7 
##   left son=12 (908 obs) right son=13 (925 obs)
##   Primary splits:
##       solar_rad      < 0.695  to the left,  improve=0.3615243, (0 missing)
##       hour           splits as  R-L,        improve=0.2728347, (0 missing)
##       humidity       < 56.5   to the right, improve=0.2220229, (0 missing)
##       rainfall       < 0.05   to the right, improve=0.1368647, (0 missing)
##       functioningday splits as  LR,         improve=0.1165556, (0 missing)
##   Surrogate splits:
##       hour        splits as  R-L,        agree=0.932, adj=0.863, (0 split)
##       humidity    < 59.5   to the right, agree=0.830, adj=0.656, (0 split)
##       windspeed   < 1.25   to the left,  agree=0.708, adj=0.411, (0 split)
##       dewpointemp < 9.35   to the right, agree=0.639, adj=0.272, (0 split)
##       temperature < 24.95  to the left,  agree=0.616, adj=0.226, (0 split)
## 
## Node number 7: 1373 observations,    complexity param=0.06097695
##   mean=1427.832, MSE=553694.2 
##   left son=14 (99 obs) right son=15 (1274 obs)
##   Primary splits:
##       rainfall       < 0.15   to the right, improve=0.19939500, (0 missing)
##       humidity       < 86.5   to the right, improve=0.18747820, (0 missing)
##       functioningday splits as  LR,         improve=0.18294860, (0 missing)
##       visibility     < 757.5  to the left,  improve=0.08676154, (0 missing)
##       temperature    < 22.05  to the left,  improve=0.07971517, (0 missing)
##   Surrogate splits:
##       humidity    < 91     to the right, agree=0.966, adj=0.535, (0 split)
##       visibility  < 220.5  to the left,  agree=0.932, adj=0.051, (0 split)
##       dewpointemp < 26.35  to the right, agree=0.930, adj=0.030, (0 split)
## 
## Node number 10: 568 observations
##   mean=243.8956, MSE=34236.05 
## 
## Node number 11: 853 observations,    complexity param=0.01228374
##   mean=706.7816, MSE=213707 
##   left son=22 (189 obs) right son=23 (664 obs)
##   Primary splits:
##       humidity    < 77.5   to the right, improve=0.17123830, (0 missing)
##       rainfall    < 0.1    to the right, improve=0.10051090, (0 missing)
##       visibility  < 1028   to the left,  improve=0.09575222, (0 missing)
##       seasons     splits as  RL--,       improve=0.09573489, (0 missing)
##       dewpointemp < 5.05   to the right, improve=0.08804882, (0 missing)
##   Surrogate splits:
##       dewpointemp < 6.45   to the right, agree=0.871, adj=0.418, (0 split)
##       rainfall    < 0.1    to the right, agree=0.832, adj=0.243, (0 split)
##       visibility  < 352    to the left,  agree=0.822, adj=0.196, (0 split)
##       snowfall    < 2.45   to the right, agree=0.796, adj=0.079, (0 split)
## 
## Node number 12: 908 observations
##   mean=432.7395, MSE=99375.04 
## 
## Node number 13: 925 observations,    complexity param=0.01766738
##   mean=963.152, MSE=148577.5 
##   left son=26 (46 obs) right son=27 (879 obs)
##   Primary splits:
##       functioningday splits as  LR,         improve=0.32667470, (0 missing)
##       dewpointemp    < 19.15  to the right, improve=0.09446312, (0 missing)
##       temperature    < 30.05  to the right, improve=0.08616519, (0 missing)
##       humidity       < 36.5   to the right, improve=0.06290279, (0 missing)
##       solar_rad      < 1.765  to the left,  improve=0.04175090, (0 missing)
## 
## Node number 14: 99 observations
##   mean=235.8788, MSE=127073.7 
## 
## Node number 15: 1274 observations,    complexity param=0.06097695
##   mean=1520.456, MSE=467862.9 
##   left son=30 (65 obs) right son=31 (1209 obs)
##   Primary splits:
##       functioningday splits as  LR,         improve=0.26561910, (0 missing)
##       temperature    < 19.95  to the left,  improve=0.09107253, (0 missing)
##       humidity       < 68.5   to the right, improve=0.04688175, (0 missing)
##       windspeed      < 1.15   to the left,  improve=0.03554759, (0 missing)
##       solar_rad      < 0.045  to the left,  improve=0.03484219, (0 missing)
## 
## Node number 22: 189 observations
##   mean=348.2206, MSE=138102 
## 
## Node number 23: 664 observations
##   mean=808.8419, MSE=188216 
## 
## Node number 26: 46 observations
##   mean=0.1, MSE=1.92593e-34 
## 
## Node number 27: 879 observations
##   mean=1013.551, MSE=105276.3 
## 
## Node number 30: 65 observations
##   mean=0.1, MSE=1.232595e-32 
## 
## Node number 31: 1209 observations,    complexity param=0.015312
##   mean=1602.196, MSE=362062.2 
##   left son=62 (270 obs) right son=63 (939 obs)
##   Primary splits:
##       humidity    < 72.5   to the right, improve=0.08889154, (0 missing)
##       temperature < 16.85  to the left,  improve=0.07816428, (0 missing)
##       visibility  < 752    to the left,  improve=0.05034945, (0 missing)
##       windspeed   < 1.25   to the left,  improve=0.04615068, (0 missing)
##       solar_rad   < 0.045  to the left,  improve=0.03783253, (0 missing)
##   Surrogate splits:
##       visibility  < 550.5  to the left,  agree=0.813, adj=0.163, (0 split)
##       windspeed   < 0.45   to the left,  agree=0.787, adj=0.044, (0 split)
##       dewpointemp < 24.45  to the right, agree=0.786, adj=0.041, (0 split)
## 
## Node number 62: 270 observations
##   mean=1267.637, MSE=266321.2 
## 
## Node number 63: 939 observations,    complexity param=0.01184149
##   mean=1698.395, MSE=348153 
##   left son=126 (312 obs) right son=127 (627 obs)
##   Primary splits:
##       temperature < 19.95  to the left,  improve=0.09204681, (0 missing)
##       windspeed   < 1.25   to the left,  improve=0.04064881, (0 missing)
##       solar_rad   < 0.025  to the left,  improve=0.03132255, (0 missing)
##       dewpointemp < 11.25  to the left,  improve=0.02413678, (0 missing)
##       seasons     splits as  RLR-,       improve=0.01279832, (0 missing)
##   Surrogate splits:
##       dewpointemp < 9.35   to the left,  agree=0.838, adj=0.513, (0 split)
##       seasons     splits as  LLR-,       agree=0.764, adj=0.288, (0 split)
##       humidity    < 26.5   to the left,  agree=0.684, adj=0.048, (0 split)
##       visibility  < 758    to the left,  agree=0.682, adj=0.042, (0 split)
##       windspeed   < 4.15   to the right, agree=0.670, adj=0.006, (0 split)
## 
## Node number 126: 312 observations
##   mean=1444.622, MSE=265882.9 
## 
## Node number 127: 627 observations
##   mean=1824.675, MSE=341098.4

Lets generate the predictions and RMSE on both training and test data.

regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bikes.tree, 
                                      train_data))
##        MSE     RMSE      MAE    MedAE     MSLE        R2
## 1 125264.7 353.9275 249.1645 172.8956 1.114458 0.6976824
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bikes.tree, 
                                      test_data))
##        MSE     RMSE      MAE    MedAE     MSLE        R2
## 1 134647.7 366.9438 257.7226 177.8788 1.215412 0.6792232

In reference to our linear regression benchmark model RMSE has improved. Means we have now got lower value of RMSE but one can see that there is slightly an over fitting as decision tree model not equally perform well on the testing data. Lets try to apply cross validation and tune grid.

tc      <- trainControl(method = "cv", number = 10)
cp.grid <- expand.grid(cp = seq(0, 0.03, 0.001))
set.seed(123456789)
bikes.tree_tuned <- train(model.formula,
                    data = train_data, 
                    method = "rpart", 
                    trControl = tc,
                    tuneGrid = cp.grid)

Lets plot and see how RMSE value responds to change in complexity parameter.

plot(bikes.tree_tuned)

The above graph shows that we get lowest rmse for cp=0.001. Apart from that for each other value our RMSE increases. Now lets create predcitions and calculate RMSE values on for training and testing data by using tuned model.

regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bikes.tree_tuned, 
                                      train_data))
##       MSE     RMSE      MAE    MedAE     MSLE        R2
## 1 83506.5 288.9749 197.9861 131.2474 0.591729 0.7984628
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bikes.tree_tuned, 
                                      test_data))
##        MSE     RMSE      MAE    MedAE     MSLE        R2
## 1 106042.5 325.6416 219.3235 145.0231 0.655586 0.7473706

The above results show that we manage too drastically reduce RMSE value on training data as value is 288 in comparison to linear regression model and decision tree full model. However this bike.tunned model truns out to be the most overfitted one as RMSE value on testing data is 325 which is highest in relation to linear regression and decision tree model.

Bagging

This time we will use the bagging method by specifying method = “treebag”. we ll use tuneLength = 5 and not worry about tuneGrid anymore. Caret has no hyperparameters to tune with this model.

bikes.tree_bagged <- train(model.formula,
                    data = train_data, 
                    method = "treebag", 
                    trControl = tc)
bikes.tree_bagged
## Bagged CART 
## 
## 6133 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 5518, 5520, 5520, 5519, 5518, 5521, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   342.8511  0.7179281  241.4324
plot(varImp(bikes.tree_bagged), main="Variable Importance with Regression Bagging")

The above graph is self explanatory as it suggest variable importance. The most important ones are on the top. Lets now generate RMSE value for bagged model.

regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bikes.tree_bagged, 
                                      train_data))
##        MSE     RMSE      MAE    MedAE    MSLE        R2
## 1 114038.4 337.6957 237.1042 165.2226 1.17922 0.7247763
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bikes.tree_bagged, 
                                      test_data))
##        MSE     RMSE      MAE    MedAE    MSLE        R2
## 1 123474.4 351.3893 247.2098 170.9239 1.31071 0.7058418

From the above output we still have a slight difference and we can conclude that there exist a slight overfittig problem.

Random Forest

lets now apply random forest with cross validation.

set.seed(71)
bikes.tree_rf <- train(model.formula,
                    data = train_data, 
                    method = "ranger", 
                    trControl = tc)
bikes.tree_rf
## Random Forest 
## 
## 6133 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 5519, 5520, 5521, 5519, 5519, 5519, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared   MAE     
##    2    variance    315.7303  0.7828877  220.5034
##    2    extratrees  348.1651  0.7393025  244.5772
##    8    variance    274.7520  0.8182531  179.3584
##    8    extratrees  272.6129  0.8215457  178.7928
##   15    variance    278.0289  0.8141102  180.3669
##   15    extratrees  271.6199  0.8226332  177.4247
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 15, splitrule = extratrees
##  and min.node.size = 5.
plot(bikes.tree_rf)

Lets generate RMSE value on both training and testing data and see if we find any difference in comparison to above mentioned models.

regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bikes.tree_rf, 
                                      train_data))
##        MSE     RMSE      MAE    MedAE      MSLE        R2
## 1 18902.43 137.4861 89.31816 53.05743 0.1382353 0.9543803
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bikes.tree_rf, 
                                      test_data))
##        MSE     RMSE      MAE    MedAE      MSLE        R2
## 1 79872.57 282.6174 182.6458 111.0392 0.3403784 0.8097163

The results of random forest are quite interesting as it generate lowest RMSE value on training data among all the models. However it performs worst among all models on testing data. It has the highest over fitting.

Lets tuned this model and try out different combination of parameters and see if it helps.

bkes_rf_tuned_1 <- randomForest(model.formula,
               data = train_data,
               ntree = 100,
               mtry = 8,
               # minimum number of obs in the terminal nodes
               nodesize = 100,
               # we also generate predictors importance measures,
               importance = TRUE)
print(bkes_rf_tuned_1)
## 
## Call:
##  randomForest(formula = model.formula, data = train_data, ntree = 100,      mtry = 8, nodesize = 100, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 8
## 
##           Mean of squared residuals: 82604.63
##                     % Var explained: 80.06
regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bkes_rf_tuned_1, 
                                      train_data))
##        MSE     RMSE     MAE    MedAE     MSLE        R2
## 1 66457.59 257.7937 171.521 105.4981 0.587629 0.8396092
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bkes_rf_tuned_1, 
                                      test_data))
##        MSE     RMSE     MAE    MedAE      MSLE        R2
## 1 87908.72 296.4941 197.826 125.1216 0.6979514 0.7905714

now trying with mtry=6

bkes_rf_tuned_2<- randomForest(model.formula,
               data = train_data,
               ntree = 100,
               mtry = 6,
               # minimum number of obs in the terminal nodes
               nodesize = 100,
               # we also generate predictors importance measures,
               importance = TRUE)
print(bkes_rf_tuned_2)
## 
## Call:
##  randomForest(formula = model.formula, data = train_data, ntree = 100,      mtry = 6, nodesize = 100, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 82875.83
##                     % Var explained: 80
regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bkes_rf_tuned_2, 
                                      train_data))
##        MSE     RMSE      MAE    MedAE      MSLE       R2
## 1 67466.62 259.7434 173.1614 106.3531 0.7774465 0.837174
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bkes_rf_tuned_2, 
                                      test_data))
##        MSE     RMSE      MAE    MedAE      MSLE        R2
## 1 88255.55 297.0784 198.6865 127.5221 0.8980173 0.7897452
parameters_rf <- expand.grid(mtry = 2:12)
bkes_rf_tuned_3 <-
    train(model.formula,
          data = train_data,
          method = "rf",
          ntree = 100,
          nodesize = 100,
          tuneGrid = parameters_rf,
          trControl = tc)
regressionMetrics(real = train_data$rntd_bike_count,
                  predicted = predict(bkes_rf_tuned_3, 
                                      train_data))
##        MSE     RMSE      MAE    MedAE      MSLE        R2
## 1 66577.27 258.0257 171.7525 103.4091 0.5889308 0.8393204
regressionMetrics(real = test_data$rntd_bike_count,
                  predicted = predict(bkes_rf_tuned_3, 
                                      test_data))
##        MSE     RMSE      MAE    MedAE      MSLE        R2
## 1 88339.78 297.2201 197.8014 126.8731 0.7054815 0.7895445
plot(bkes_rf_tuned_3)

Summary and Conclusions

lets compare results of different estimated models, linear regression benchmark (bikes_lm_benchmark), Decision tree(bikes.tree), Decision tree tunned(bikes.tree_tunned), bagged model (bikes.tree_bagged) and Random forests. lets compute the predicted values for each model and summarize the error measures.

bikes_model_list <- list(bikes_lm_benchmark =bikes_lm_benchmark ,
                          bikes.tree=bikes.tree,
                          bikes.tree_tuned= bikes.tree_tuned,
                          bikes.tree_bagged=bikes.tree_bagged,
                          bikes.tree_rf=bikes.tree_rf,
                         bkes_rf_tuned_1=bkes_rf_tuned_1,
                         bkes_rf_tuned_2=bkes_rf_tuned_2,
                         bkes_rf_tuned_3=bkes_rf_tuned_3)

bikes_fitted <- bikes_model_list %>%
  sapply(function(x) predict(x, newdata=train_data)) %>%
  data.frame()

sapply(bikes_fitted,
       function(x) regressionMetrics(real = train_data$rntd_bike_count, predicted = x
       )) %>%  t()
##                    MSE      RMSE     MAE      MedAE    MSLE      R2       
## bikes_lm_benchmark 901754.8 949.6077 699.0244 498.9379 18.10552  -1.176323
## bikes.tree         125264.7 353.9275 249.1645 172.8956 1.114458  0.6976824
## bikes.tree_tuned   83506.5  288.9749 197.9861 131.2474 0.591729  0.7984628
## bikes.tree_bagged  114038.4 337.6957 237.1042 165.2226 1.17922   0.7247763
## bikes.tree_rf      18902.43 137.4861 89.31816 53.05743 0.1382353 0.9543803
## bkes_rf_tuned_1    66457.59 257.7937 171.521  105.4981 0.587629  0.8396092
## bkes_rf_tuned_2    67466.62 259.7434 173.1614 106.3531 0.7774465 0.837174 
## bkes_rf_tuned_3    66577.27 258.0257 171.7525 103.4091 0.5889308 0.8393204
bikes_fitted <- bikes_model_list %>%
  sapply(function(x) predict(x, newdata=test_data)) %>%
  data.frame()

sapply(bikes_fitted,
       function(x) regressionMetrics(real = test_data$rntd_bike_count, predicted = x
       )) %>%  t()
##                    MSE      RMSE     MAE      MedAE    MSLE      R2       
## bikes_lm_benchmark 907814.4 952.7929 699.5142 499.9987 18.12239  -1.162723
## bikes.tree         134647.7 366.9438 257.7226 177.8788 1.215412  0.6792232
## bikes.tree_tuned   106042.5 325.6416 219.3235 145.0231 0.655586  0.7473706
## bikes.tree_bagged  123474.4 351.3893 247.2098 170.9239 1.31071   0.7058418
## bikes.tree_rf      79872.57 282.6174 182.6458 111.0392 0.3403784 0.8097163
## bkes_rf_tuned_1    87908.72 296.4941 197.826  125.1216 0.6979514 0.7905714
## bkes_rf_tuned_2    88255.55 297.0784 198.6865 127.5221 0.8980173 0.7897452
## bkes_rf_tuned_3    88339.78 297.2201 197.8014 126.8731 0.7054815 0.7895445

We applied different models.The above output shows that random forest generates(bikes.tree_rf) the lowest RMSE on training data but it turns out to be the most over fitted one as it performs worst on testing data in comparison to other models results. All the models have slight over fitting but to consider and select model among all the above mentioned one based on RMSE, we will go with random forest (bkes_rf_tuned_1) as it has relatively lesser over fitting and also lower RMSE for both training(257) and testing(296) data. We could conclude that in relative terms it could perform equally well if its applied on unseen data.