Data Set Description and Preparation
Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.Therefore this Project aims to predict the Number of Bikes Rented and to find the best model for regression analysis.The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour as the outcome variable.
Lets have a look at the data set
my_data<-read.csv("SeoulBikeData.csv")
glimpse(my_data)
## Registered S3 method overwritten by 'cli':
## method from
## print.tree tree
## Rows: 8,760
## Columns: 14
## $ Date <chr> "1/12/2017", "1/12/2017", "1/12/2017", "1/12/2017", "1~
## $ rntd_bike_count <dbl> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490, 339, ~
## $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ~
## $ temperature <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, ~
## $ humidity <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, 23, 25~
## $ windspeed <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.2,~
## $ visibility <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dewpointemp <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3~
## $ solar_rad <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, ~
## $ rainfall <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Win~
## $ holiday <chr> "No Holiday", "No Holiday", "No Holiday", "No Holiday"~
## $ functioningday <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"~
dim(my_data)
## [1] 8760 14
In total we have 8760 observations and 14 features including our outcome variable. Above output shows that we have both integers and character type variables. We need to convert character to factors. But first lets get rid of date feature as it is not meaningful one.
my_data <- my_data[, c(-1)] #removing date feature
names(my_data)
## [1] "rntd_bike_count" "hour" "temperature" "humidity"
## [5] "windspeed" "visibility" "dewpointemp" "solar_rad"
## [9] "rainfall" "snowfall" "seasons" "holiday"
## [13] "functioningday"
Lets take a look at the hour variable whether we find any variation in number of rented bikes as hours vary. We can do this by plotting a boxplot.So lets plot number of rented bikes against hours.
boxplot(my_data$rntd_bike_count ~ my_data$hour, xlab = "Hours", ylab="Bike Count")
In the above boxplot we can see that number of rented bikes do vary in response to hours. One can easily conclude from the above boxplot that more bikes are rented from 7 a.m. till 9 a.m. and till 15:00 afternoon it stays somewhat flat and after 15:00 p.m. they again start to increase. Based on above we can categorized number of hours into three categories as “Low” (bikes < 7) “Average (between 10 till 15 hours) and High” (7-9 and from 16-22). Numbers don’t really provide any useful information its more intuitive to have three categories instead.
my_data$hour <- ifelse(my_data$hour < 7, "low",
ifelse(my_data$hour >= 9 & my_data$hour <=15, "Average", "High"))
Before applying various feature selection methods its better to quickly go through entire columns and see if any column has any missing values.
colSums(is.na(my_data)) %>% sort()
## rntd_bike_count hour temperature humidity windspeed
## 0 0 0 0 0
## visibility dewpointemp solar_rad rainfall snowfall
## 0 0 0 0 0
## seasons holiday functioningday
## 0 0 0
We can see we don’t have any missing values in any of the columns so we can proceed further with our analysis.
We know we have both categorical and numerical features in our data set. So lets first save both of them in each different object. Lets make a list of categorical variables and then convert those character variables to factor variables and store them as factors. Also make a list of numeric features and store them in an vector named numerical.
bikes_categorical <- sapply(my_data, is.character) %>% which() %>% names()
#converting all categorical variables to factors by applying a loop
for (variable in bikes_categorical) {
my_data[[variable]] <- as.factor(my_data[[variable]])
}
#list of numeric features in a separate vector
bikes_numerical <- sapply(my_data, is.numeric) %>% which () %>% names()
glimpse(my_data)
## Rows: 8,760
## Columns: 13
## $ rntd_bike_count <dbl> 254, 204, 173, 107, 78, 100, 181, 460, 930, 490, 339, ~
## $ hour <fct> low, low, low, low, low, low, low, High, High, Average~
## $ temperature <dbl> -5.2, -5.5, -6.0, -6.2, -6.0, -6.4, -6.6, -7.4, -7.6, ~
## $ humidity <int> 37, 38, 39, 40, 36, 37, 35, 38, 37, 27, 24, 21, 23, 25~
## $ windspeed <dbl> 2.2, 0.8, 1.0, 0.9, 2.3, 1.5, 1.3, 0.9, 1.1, 0.5, 1.2,~
## $ visibility <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ~
## $ dewpointemp <dbl> -17.6, -17.6, -17.7, -17.6, -18.6, -18.7, -19.5, -19.3~
## $ solar_rad <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, ~
## $ rainfall <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ snowfall <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ seasons <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winter~
## $ holiday <fct> No Holiday, No Holiday, No Holiday, No Holiday, No Hol~
## $ functioningday <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,~
Data Partitioning
Lets divide the data into training and test sample. There are few functions that would help our cause but we will use createDataPartition() function from the caret package.
set.seed(987654321)
data_which_train <- createDataPartition(my_data$rntd_bike_count, # target variable
# share of the training sample
p = 0.7,
# should result be a list?
list = FALSE)
It is a vector of numbers - indexes of 70% of observations are selected,CreateDataPartition function does not create training or test sample just select number of observations for training sample. We need to apply this index for data division. lets create the division. Furthermore lets check the frequency distribution of the target variable i.e number of rented bikes (bikes).
train_data <- my_data[data_which_train, ]
test_data <- my_data[-data_which_train, ]
summary(train_data$rntd_bike_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1 191.0 504.0 704.5 1064.0 3556.0
summary(test_data$rntd_bike_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1 191.0 505.0 704.9 1065.5 3404.0
Apart from maximum values other statistics are almost similar. Its important to keep in mind that all transformations taking into account of variable distributions must be done on training data. Lets check the distribution of our outcome variable i.e number of bikes rented (bikes).
hist(my_data$rntd_bike_count)
We can see in the above histogram that our outcome variable is not normally distributed and linear regression assumes normality so its better to have more normally distributed data at least for linear regression.
hist(log(my_data$rntd_bike_count))
Although log transformation does help a little bit but now we can say we have left skewed data.As for linear regression we would take logarithmic values and then will inverse the log to compare RMSE for all models.
Initial Feature Selection
Feature selection is the most vital part for machine learning algorithms. Because inclusion of redundant and weakly correlated variables with the outcome variable do not add anything to our model results. It could reduce the model accuracy like in classification task or could increase errors in regression task. Therefore, it is highly recommended to apply various feature selection techniques to identify redundant features and remove them. First, we will start with examining relationship of predictors with the outcome variable. And we will observe if we have strong relationship among predictors as well i.e collinearity . We must make sure that they are not strongly correlated with each other. We have both categorical and numeric features so we will do it separately.
Correlation Among Numeric Features
bikes_correlation <-
cor(train_data[,bikes_numerical],
use = "pairwise.complete.obs")
bikes_correlation %>% round(digits = 2)
## rntd_bike_count temperature humidity windspeed visibility
## rntd_bike_count 1.00 0.54 -0.20 0.12 0.20
## temperature 0.54 1.00 0.15 -0.05 0.04
## humidity -0.20 0.15 1.00 -0.33 -0.54
## windspeed 0.12 -0.05 -0.33 1.00 0.16
## visibility 0.20 0.04 -0.54 0.16 1.00
## dewpointemp 0.38 0.91 0.53 -0.18 -0.17
## solar_rad 0.26 0.35 -0.46 0.32 0.15
## rainfall -0.12 0.05 0.23 -0.01 -0.16
## snowfall -0.14 -0.22 0.11 0.00 -0.13
## dewpointemp solar_rad rainfall snowfall
## rntd_bike_count 0.38 0.26 -0.12 -0.14
## temperature 0.91 0.35 0.05 -0.22
## humidity 0.53 -0.46 0.23 0.11
## windspeed -0.18 0.32 -0.01 0.00
## visibility -0.17 0.15 -0.16 -0.13
## dewpointemp 1.00 0.09 0.13 -0.15
## solar_rad 0.09 1.00 -0.07 -0.08
## rainfall 0.13 -0.07 1.00 0.01
## snowfall -0.15 -0.08 0.01 1.00
Its a long matrix and not easy to read. We can even visualize it by using corrplot function. But we will use the alternative function from the caret package: findCorrelation(x = correlation_matrix, cutoff = 0.9) which identifies correlations above the accepted threshold and indicates variables to be deleted. By default it returns a list of column numbers to be deleted. lets decrease the cutoff to 0.75 above 0.75 there could be some who are equally correlated but function returns only one because it compares both with other predictors and the one that’s average correlation is higher is suggested to be removed.
findCorrelation(bikes_correlation,
cutoff = 0.75,
names = TRUE) -> bikes_crr
bikes_crr
## [1] "dewpointemp"
Only dewpointemp is suggested to be removed. We will keep this in mind and we will remove all redundant features together.
Correlation with the Outcome Variable (Bikes rented)
Lets sort the predictors on basis of their correlation with the outcome variable in a decreasing order. And then use it to order rows and columns of the correlation matrix on the plot.
correlation_order <-
# we take correlations with number of bikes rented
bikes_correlation[,"rntd_bike_count"] %>%
# sort them in the decreasing order
sort(decreasing = TRUE) %>%
# end extract just variables' names
names()
correlation_order
## [1] "rntd_bike_count" "temperature" "dewpointemp" "solar_rad"
## [5] "visibility" "windspeed" "rainfall" "snowfall"
## [9] "humidity"
Lets visualize correlation among numeric features by using a corr plot function.
corrplot.mixed(bikes_correlation[correlation_order, correlation_order],
upper = "square", lower = "number",
tl.col="black", tl.pos = "lt")
We see that dew point and temperature are quite strongly correlated with each other. We will take care of that in final selection of variables.
Lets now see the relationship of first 3 most important predictors with the outcome variable i.e bikes rented. The following scatter plots clearly demonstrate a positive relationship with the outcome variable.
ggplot(train_data,
aes(x = temperature,
y = log(rntd_bike_count))) +
geom_point(col = "blue") +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
labs(title = "Scatter plot of Temperature vs. log(Bikes_Rented)")+
theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'
ggplot(train_data,
aes(x = solar_rad,
y = log(rntd_bike_count))) +
geom_point(col = "blue") +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
labs(title = "Scatter plot of Solar radiation vs. log(Bikes_Rented)")+
theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'
ggplot(train_data,
aes(x = visibility,
y = log(rntd_bike_count))) +
geom_point(col = "blue") +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
labs(title = "Scatter plot of Visibility vs. log(Bikes_Rented)")+
theme(plot.title = element_text(hjust = 0.4))
## `geom_smooth()` using formula 'y ~ x'
Correlation between Categorical vs Outcome Variable
As the target variable is quantitative and explanatory qualitative, one can use analysis of variance (ANOVA). We want to check whether e.g our outcome variable bikes rented differ for our predictor i.e hour which have three levels (low, average and high) whether the bikes rented differ for different levels. The F statistic or the p-value can be used to verify this hypothesis.The null hypothesis is that:
H0: Hours do not impact the bikes rented i.e. average numbers of bike rented do not differ for different levels of hours.
Ha: Hours do impact bikes rent count.
The higher the F-statistic (or the lower its p-value) the stronger we reject H0. We will use a function and extract only the p- value for all of our categorical variables. After that we will sort them in the descending order based on p-value.
bikes_F_anova <- function(bikes_categorical){
anova_ <- aov(train_data$rntd_bike_count ~
train_data[[bikes_categorical]])
return(summary(anova_)[[1]][1, 5])
}
sapply(bikes_categorical,
bikes_F_anova) %>%
# in addition lets sort them
# in the decreasing order of F
# and store as an object
sort(decreasing = TRUE) -> bikes_anova_all_categorical
bikes_anova_all_categorical
## holiday functioningday seasons hour
## 1.396411e-08 7.096976e-57 2.472680e-317 1.740989e-318
All p- values are less then 0.05 significance level. We fail to reject alternative hypothesis. It means that all our categorical predictors do impact our outcome variable which bikes rent count. Lets just plot one of them i.e seasons to see the relationship with the outcome variable.
ggplot(train_data,
aes(x = seasons ,
y = rntd_bike_count)) +
geom_boxplot(fill = "blue") +
theme_bw()+
labs(title = "Boxplot of Bike Count vs. Seasons")+
theme(plot.title = element_text(hjust = 0.5))
We can see in the above box plot that number of bikes rented do vary with respect to change in seasons. Summer season tends to increase the number of bikes rented.
Identifying Linear combinations
Identification of linear relationships in the data findLinearCombos(data) is a very useful function, which decomposes the data matrix to identify groups of variables that are linear combinations of other variables we will treat such variables as redundant because since they can be determined from the others, they bring nothing to the analysis lets check it for numeric variables.
( findLinearCombos(train_data[, bikes_numerical] ) ->
houses_linearCombos )
## $linearCombos
## list()
##
## $remove
## NULL
It suggests we don’t have any linear combinations in our training data that’s why it returns null.
Identifying Features with Zero or Near Zero Variance
We obviously don’t want high concentration of features in a single value issue in data zero variance.to To identify such variables we can use the function nearZeroVar(data, saveMetrics = FALSE) from the caret package.
nearZeroVar(train_data,
saveMetrics = TRUE) -> bikes_nzv_stats
bikes_nzv_stats %>%
# we add rownames of the frame
# (with names of variables)
# as a new column in the data
rownames_to_column("variable") %>%
# and sort it in the descreasing order
arrange(-zeroVar, -nzv, -freqRatio)
## variable freqRatio percentUnique zeroVar nzv
## 1 snowfall 187.806452 0.78265123 FALSE TRUE
## 2 rainfall 69.397590 0.89678787 FALSE TRUE
## 3 solar_rad 34.919540 5.59269526 FALSE TRUE
## 4 functioningday 29.665000 0.03261047 FALSE TRUE
## 5 holiday 19.443333 0.03261047 FALSE TRUE
## 6 visibility 69.782609 26.91994130 FALSE FALSE
## 7 rntd_bike_count 13.333333 31.77890103 FALSE FALSE
## 8 hour 1.420613 0.04891570 FALSE FALSE
## 9 dewpointemp 1.264706 8.95157346 FALSE FALSE
## 10 humidity 1.106195 1.46747106 FALSE FALSE
## 11 temperature 1.071429 8.72330018 FALSE FALSE
## 12 seasons 1.018088 0.06522094 FALSE FALSE
## 13 windspeed 1.006780 1.04353497 FALSE FALSE
The above output suggests that we have 5 variables that have near zero variance.
Final Selection of Features
# start with all variables
selected_variables <- names(my_data)
#Lets take all numeric features correlated with the outcome variable and the last 4 categorical variables based on the p-values
selected_variables <- c(correlation_order,
names(bikes_anova_all_categorical)
)
#Now lets delete mutullay correlated variables as identified by findcorrelation function
selected_variables <-
selected_variables [!selected_variables %in%
bikes_crr]
#lets now delet the near zero variance variables
(bikes_variables_nzv <- nearZeroVar(train_data,
names = TRUE))
## [1] "solar_rad" "rainfall" "snowfall" "holiday"
## [5] "functioningday"
selected_variables <-
selected_variables [!selected_variables %in%
bikes_variables_nzv]
selected_variables
## [1] "rntd_bike_count" "temperature" "visibility" "windspeed"
## [5] "humidity" "seasons" "hour"
bikes_lm1 <- lm(rntd_bike_count ~ .,
data = train_data %>%
dplyr::select(all_of(selected_variables)))
vifs <- ols_vif_tol(bikes_lm1)
vifs[which(vifs$Variables %in% bikes_numerical),]
## Variables Tolerance VIF
## 1 temperature 0.2315079 4.319507
## 2 visibility 0.6054896 1.651556
## 3 windspeed 0.8215594 1.217198
## 4 humidity 0.5030769 1.987768
In the above output we can see variance inflation factor (VIF) value.The VIF for all the predictors is small its not that high. Now lets apply the backward elimination method and find out the redundant features and remove them.
bikes_lm1_backward_p <- ols_step_backward_p(bikes_lm1,
prem = 0.05,
progress = F)
summary(bikes_lm1_backward_p$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1617.18 -253.14 -7.38 235.58 2224.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 852.4737 26.8797 31.714 < 2e-16 ***
## temperature 23.8661 0.9731 24.525 < 2e-16 ***
## humidity -8.6638 0.3060 -28.317 < 2e-16 ***
## seasonsSpring -66.9339 15.8872 -4.213 2.56e-05 ***
## seasonsSummer -31.6544 20.0958 -1.575 0.115
## seasonsWinter -275.4599 23.1408 -11.904 < 2e-16 ***
## hourHigh 429.3132 13.8749 30.942 < 2e-16 ***
## hourlow -128.1614 16.2537 -7.885 3.70e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 439.1 on 6125 degrees of freedom
## Multiple R-squared: 0.5353, Adjusted R-squared: 0.5347
## F-statistic: 1008 on 7 and 6125 DF, p-value: < 2.2e-16
bikes_lm1_backward_p$removed
## [1] "visibility" "windspeed"
At the end we have a list of variables that are suggested to be removed. based on the backward elimination method. Now lets remove the identified features and make a list of selected_final_variables to be used for regression purpose.
selected_final_variables <- selected_variables[-which(selected_variables %in%
c("visibility","windspeed"))]
bikes_lm_benchmark <- lm(log(rntd_bike_count) ~ .,
data = train_data%>%
dplyr::select(all_of(selected_final_variables)))
summary(bikes_lm_benchmark)
##
## Call:
## lm(formula = log(rntd_bike_count) ~ ., data = train_data %>%
## dplyr::select(all_of(selected_final_variables)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4184 -0.2629 0.3067 0.8156 2.4547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.418267 0.102944 62.347 < 2e-16 ***
## temperature 0.020241 0.003727 5.431 5.82e-08 ***
## humidity -0.020416 0.001172 -17.423 < 2e-16 ***
## seasonsSpring 0.413061 0.060845 6.789 1.24e-11 ***
## seasonsSummer 0.968415 0.076963 12.583 < 2e-16 ***
## seasonsWinter -0.230709 0.088624 -2.603 0.00926 **
## hourHigh 0.405548 0.053138 7.632 2.66e-14 ***
## hourlow -0.491526 0.062248 -7.896 3.38e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.682 on 6125 degrees of freedom
## Multiple R-squared: 0.2006, Adjusted R-squared: 0.1997
## F-statistic: 219.5 on 7 and 6125 DF, p-value: < 2.2e-16
The above output also shows the regression estimates. We can see all out selected_final_variables bear a significant relationship with the outcome variable. Furthermore our purpose is to generate predictions both on training and testing data. So lets generate predictions and estimates RMSE on both training and testing data.
bikes_lm_benchmark$fitted.values <- exp(bikes_lm_benchmark$fitted.values)-1
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bikes_lm_benchmark,
train_data[, selected_final_variables]))
## MSE RMSE MAE MedAE MSLE R2
## 1 901754.8 949.6077 699.0244 498.9379 18.10552 -1.176323
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bikes_lm_benchmark,
test_data[, selected_final_variables]))
## MSE RMSE MAE MedAE MSLE R2
## 1 907814.4 952.7929 699.5142 499.9987 18.12239 -1.162723
For training data RMSE is 949 and for testing data it is 952. We can conclude that linear regression performs approximately well on both training and testing data.So lets try out different machine learning algorithms and evaluate results if they help us in controlling or minimizing over fitting.
model.formula <- rntd_bike_count ~.
Decision Trees (Regression)
A simple regression tree is built thesame way as a simple classificatioon tree. Like the simple classification tree, it is rarely invoked on its own; the bagged, random forest, and gradient boosting methods build on this logic.The first step is to build a full tree, then perform k-fold cross-validation to help select the optimal cost complexity (cp). The only difference here is the rpart() parameter method = “anova” to produce a regression tree.
set.seed(1234)
bikes.tree <- rpart(model.formula,
data = train_data,
method = "anova")
summary(bikes.tree)
## Call:
## rpart(formula = model.formula, data = train_data, method = "anova")
## n= 6133
##
## CP nsplit rel error xerror xstd
## 1 0.24989438 0 1.0000000 1.0002476 0.021429957
## 2 0.16345940 1 0.7501056 0.7547652 0.017132405
## 3 0.06097695 2 0.5866462 0.5900575 0.014459967
## 4 0.05072879 4 0.4646923 0.4689065 0.011753958
## 5 0.02727066 5 0.4139635 0.4172414 0.011559632
## 6 0.01766738 7 0.3594222 0.3838004 0.010632823
## 7 0.01531200 8 0.3417548 0.3527000 0.010112491
## 8 0.01228374 9 0.3264428 0.3432102 0.009807599
## 9 0.01184149 10 0.3141591 0.3312553 0.009435020
## 10 0.01000000 11 0.3023176 0.3267140 0.009335105
##
## Variable importance
## temperature hour dewpointemp seasons solar_rad
## 21 17 17 11 9
## humidity functioningday rainfall windspeed snowfall
## 9 6 5 3 2
## visibility
## 1
##
## Node number 1: 6133 observations, complexity param=0.2498944
## mean=704.4715, MSE=414347.9
## left son=2 (2927 obs) right son=3 (3206 obs)
## Primary splits:
## temperature < 12.65 to the left, improve=0.2498944, (0 missing)
## seasons splits as RRRL, improve=0.1825931, (0 missing)
## hour splits as RRL, improve=0.1703735, (0 missing)
## dewpointemp < -2.45 to the left, improve=0.1318742, (0 missing)
## solar_rad < 0.005 to the left, improve=0.1084771, (0 missing)
## Surrogate splits:
## dewpointemp < 5.25 to the left, agree=0.880, adj=0.748, (0 split)
## seasons splits as RRRL, agree=0.768, adj=0.515, (0 split)
## solar_rad < 0.015 to the left, agree=0.609, adj=0.181, (0 split)
## snowfall < 0.05 to the right, agree=0.573, adj=0.106, (0 split)
## humidity < 43.5 to the left, agree=0.562, adj=0.082, (0 split)
##
## Node number 2: 2927 observations, complexity param=0.02727066
## mean=367.7032, MSE=127769.8
## left son=4 (1506 obs) right son=5 (1421 obs)
## Primary splits:
## seasons splits as RR-L, improve=0.17526230, (0 missing)
## temperature < 3.65 to the left, improve=0.17226590, (0 missing)
## hour splits as RRL, improve=0.14406380, (0 missing)
## solar_rad < 0.005 to the left, improve=0.07712105, (0 missing)
## dewpointemp < -10.45 to the left, improve=0.06816643, (0 missing)
## Surrogate splits:
## temperature < 2.05 to the left, agree=0.837, adj=0.664, (0 split)
## dewpointemp < -3.75 to the left, agree=0.799, adj=0.586, (0 split)
## humidity < 57.5 to the left, agree=0.644, adj=0.267, (0 split)
## windspeed < 1.55 to the right, agree=0.578, adj=0.131, (0 split)
## snowfall < 0.05 to the right, agree=0.567, adj=0.108, (0 split)
##
## Node number 3: 3206 observations, complexity param=0.1634594
## mean=1011.933, MSE=477911.1
## left son=6 (1833 obs) right son=7 (1373 obs)
## Primary splits:
## hour splits as LRL, improve=0.27110490, (0 missing)
## humidity < 80.5 to the right, improve=0.14887800, (0 missing)
## rainfall < 0.05 to the right, improve=0.11995590, (0 missing)
## functioningday splits as LR, improve=0.10221200, (0 missing)
## solar_rad < 0.045 to the left, improve=0.07750853, (0 missing)
## Surrogate splits:
## solar_rad < 1.275 to the right, agree=0.621, adj=0.114, (0 split)
## windspeed < 2.35 to the left, agree=0.590, adj=0.042, (0 split)
## visibility < 1986.5 to the left, agree=0.579, adj=0.017, (0 split)
## dewpointemp < 22.75 to the left, agree=0.577, adj=0.012, (0 split)
## temperature < 13.75 to the right, agree=0.575, adj=0.007, (0 split)
##
## Node number 4: 1506 observations
## mean=222.344, MSE=22339.8
##
## Node number 5: 1421 observations, complexity param=0.02727066
## mean=521.7575, MSE=193380.3
## left son=10 (568 obs) right son=11 (853 obs)
## Primary splits:
## hour splits as RRL, improve=0.26585510, (0 missing)
## humidity < 61.5 to the right, improve=0.12874380, (0 missing)
## solar_rad < 0.005 to the left, improve=0.12643280, (0 missing)
## rainfall < 0.05 to the right, improve=0.06214300, (0 missing)
## seasons splits as RL--, improve=0.06061931, (0 missing)
## Surrogate splits:
## solar_rad < 0.005 to the left, agree=0.764, adj=0.408, (0 split)
## windspeed < 0.85 to the left, agree=0.659, adj=0.146, (0 split)
## humidity < 62.5 to the right, agree=0.625, adj=0.062, (0 split)
## dewpointemp < 2.25 to the right, agree=0.616, adj=0.040, (0 split)
## temperature < 2.55 to the left, agree=0.614, adj=0.035, (0 split)
##
## Node number 6: 1833 observations, complexity param=0.05072879
## mean=700.4054, MSE=194532.7
## left son=12 (908 obs) right son=13 (925 obs)
## Primary splits:
## solar_rad < 0.695 to the left, improve=0.3615243, (0 missing)
## hour splits as R-L, improve=0.2728347, (0 missing)
## humidity < 56.5 to the right, improve=0.2220229, (0 missing)
## rainfall < 0.05 to the right, improve=0.1368647, (0 missing)
## functioningday splits as LR, improve=0.1165556, (0 missing)
## Surrogate splits:
## hour splits as R-L, agree=0.932, adj=0.863, (0 split)
## humidity < 59.5 to the right, agree=0.830, adj=0.656, (0 split)
## windspeed < 1.25 to the left, agree=0.708, adj=0.411, (0 split)
## dewpointemp < 9.35 to the right, agree=0.639, adj=0.272, (0 split)
## temperature < 24.95 to the left, agree=0.616, adj=0.226, (0 split)
##
## Node number 7: 1373 observations, complexity param=0.06097695
## mean=1427.832, MSE=553694.2
## left son=14 (99 obs) right son=15 (1274 obs)
## Primary splits:
## rainfall < 0.15 to the right, improve=0.19939500, (0 missing)
## humidity < 86.5 to the right, improve=0.18747820, (0 missing)
## functioningday splits as LR, improve=0.18294860, (0 missing)
## visibility < 757.5 to the left, improve=0.08676154, (0 missing)
## temperature < 22.05 to the left, improve=0.07971517, (0 missing)
## Surrogate splits:
## humidity < 91 to the right, agree=0.966, adj=0.535, (0 split)
## visibility < 220.5 to the left, agree=0.932, adj=0.051, (0 split)
## dewpointemp < 26.35 to the right, agree=0.930, adj=0.030, (0 split)
##
## Node number 10: 568 observations
## mean=243.8956, MSE=34236.05
##
## Node number 11: 853 observations, complexity param=0.01228374
## mean=706.7816, MSE=213707
## left son=22 (189 obs) right son=23 (664 obs)
## Primary splits:
## humidity < 77.5 to the right, improve=0.17123830, (0 missing)
## rainfall < 0.1 to the right, improve=0.10051090, (0 missing)
## visibility < 1028 to the left, improve=0.09575222, (0 missing)
## seasons splits as RL--, improve=0.09573489, (0 missing)
## dewpointemp < 5.05 to the right, improve=0.08804882, (0 missing)
## Surrogate splits:
## dewpointemp < 6.45 to the right, agree=0.871, adj=0.418, (0 split)
## rainfall < 0.1 to the right, agree=0.832, adj=0.243, (0 split)
## visibility < 352 to the left, agree=0.822, adj=0.196, (0 split)
## snowfall < 2.45 to the right, agree=0.796, adj=0.079, (0 split)
##
## Node number 12: 908 observations
## mean=432.7395, MSE=99375.04
##
## Node number 13: 925 observations, complexity param=0.01766738
## mean=963.152, MSE=148577.5
## left son=26 (46 obs) right son=27 (879 obs)
## Primary splits:
## functioningday splits as LR, improve=0.32667470, (0 missing)
## dewpointemp < 19.15 to the right, improve=0.09446312, (0 missing)
## temperature < 30.05 to the right, improve=0.08616519, (0 missing)
## humidity < 36.5 to the right, improve=0.06290279, (0 missing)
## solar_rad < 1.765 to the left, improve=0.04175090, (0 missing)
##
## Node number 14: 99 observations
## mean=235.8788, MSE=127073.7
##
## Node number 15: 1274 observations, complexity param=0.06097695
## mean=1520.456, MSE=467862.9
## left son=30 (65 obs) right son=31 (1209 obs)
## Primary splits:
## functioningday splits as LR, improve=0.26561910, (0 missing)
## temperature < 19.95 to the left, improve=0.09107253, (0 missing)
## humidity < 68.5 to the right, improve=0.04688175, (0 missing)
## windspeed < 1.15 to the left, improve=0.03554759, (0 missing)
## solar_rad < 0.045 to the left, improve=0.03484219, (0 missing)
##
## Node number 22: 189 observations
## mean=348.2206, MSE=138102
##
## Node number 23: 664 observations
## mean=808.8419, MSE=188216
##
## Node number 26: 46 observations
## mean=0.1, MSE=1.92593e-34
##
## Node number 27: 879 observations
## mean=1013.551, MSE=105276.3
##
## Node number 30: 65 observations
## mean=0.1, MSE=1.232595e-32
##
## Node number 31: 1209 observations, complexity param=0.015312
## mean=1602.196, MSE=362062.2
## left son=62 (270 obs) right son=63 (939 obs)
## Primary splits:
## humidity < 72.5 to the right, improve=0.08889154, (0 missing)
## temperature < 16.85 to the left, improve=0.07816428, (0 missing)
## visibility < 752 to the left, improve=0.05034945, (0 missing)
## windspeed < 1.25 to the left, improve=0.04615068, (0 missing)
## solar_rad < 0.045 to the left, improve=0.03783253, (0 missing)
## Surrogate splits:
## visibility < 550.5 to the left, agree=0.813, adj=0.163, (0 split)
## windspeed < 0.45 to the left, agree=0.787, adj=0.044, (0 split)
## dewpointemp < 24.45 to the right, agree=0.786, adj=0.041, (0 split)
##
## Node number 62: 270 observations
## mean=1267.637, MSE=266321.2
##
## Node number 63: 939 observations, complexity param=0.01184149
## mean=1698.395, MSE=348153
## left son=126 (312 obs) right son=127 (627 obs)
## Primary splits:
## temperature < 19.95 to the left, improve=0.09204681, (0 missing)
## windspeed < 1.25 to the left, improve=0.04064881, (0 missing)
## solar_rad < 0.025 to the left, improve=0.03132255, (0 missing)
## dewpointemp < 11.25 to the left, improve=0.02413678, (0 missing)
## seasons splits as RLR-, improve=0.01279832, (0 missing)
## Surrogate splits:
## dewpointemp < 9.35 to the left, agree=0.838, adj=0.513, (0 split)
## seasons splits as LLR-, agree=0.764, adj=0.288, (0 split)
## humidity < 26.5 to the left, agree=0.684, adj=0.048, (0 split)
## visibility < 758 to the left, agree=0.682, adj=0.042, (0 split)
## windspeed < 4.15 to the right, agree=0.670, adj=0.006, (0 split)
##
## Node number 126: 312 observations
## mean=1444.622, MSE=265882.9
##
## Node number 127: 627 observations
## mean=1824.675, MSE=341098.4
Lets generate the predictions and RMSE on both training and test data.
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bikes.tree,
train_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 125264.7 353.9275 249.1645 172.8956 1.114458 0.6976824
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bikes.tree,
test_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 134647.7 366.9438 257.7226 177.8788 1.215412 0.6792232
In reference to our linear regression benchmark model RMSE has improved. Means we have now got lower value of RMSE but one can see that there is slightly an over fitting as decision tree model not equally perform well on the testing data. Lets try to apply cross validation and tune grid.
tc <- trainControl(method = "cv", number = 10)
cp.grid <- expand.grid(cp = seq(0, 0.03, 0.001))
set.seed(123456789)
bikes.tree_tuned <- train(model.formula,
data = train_data,
method = "rpart",
trControl = tc,
tuneGrid = cp.grid)
Lets plot and see how RMSE value responds to change in complexity parameter.
plot(bikes.tree_tuned)
The above graph shows that we get lowest rmse for cp=0.001. Apart from that for each other value our RMSE increases. Now lets create predcitions and calculate RMSE values on for training and testing data by using tuned model.
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bikes.tree_tuned,
train_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 83506.5 288.9749 197.9861 131.2474 0.591729 0.7984628
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bikes.tree_tuned,
test_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 106042.5 325.6416 219.3235 145.0231 0.655586 0.7473706
The above results show that we manage too drastically reduce RMSE value on training data as value is 288 in comparison to linear regression model and decision tree full model. However this bike.tunned model truns out to be the most overfitted one as RMSE value on testing data is 325 which is highest in relation to linear regression and decision tree model.
Bagging
This time we will use the bagging method by specifying method = “treebag”. we ll use tuneLength = 5 and not worry about tuneGrid anymore. Caret has no hyperparameters to tune with this model.
bikes.tree_bagged <- train(model.formula,
data = train_data,
method = "treebag",
trControl = tc)
bikes.tree_bagged
## Bagged CART
##
## 6133 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 5518, 5520, 5520, 5519, 5518, 5521, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 342.8511 0.7179281 241.4324
plot(varImp(bikes.tree_bagged), main="Variable Importance with Regression Bagging")
The above graph is self explanatory as it suggest variable importance. The most important ones are on the top. Lets now generate RMSE value for bagged model.
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bikes.tree_bagged,
train_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 114038.4 337.6957 237.1042 165.2226 1.17922 0.7247763
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bikes.tree_bagged,
test_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 123474.4 351.3893 247.2098 170.9239 1.31071 0.7058418
From the above output we still have a slight difference and we can conclude that there exist a slight overfittig problem.
Random Forest
lets now apply random forest with cross validation.
set.seed(71)
bikes.tree_rf <- train(model.formula,
data = train_data,
method = "ranger",
trControl = tc)
bikes.tree_rf
## Random Forest
##
## 6133 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 5519, 5520, 5521, 5519, 5519, 5519, ...
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 315.7303 0.7828877 220.5034
## 2 extratrees 348.1651 0.7393025 244.5772
## 8 variance 274.7520 0.8182531 179.3584
## 8 extratrees 272.6129 0.8215457 178.7928
## 15 variance 278.0289 0.8141102 180.3669
## 15 extratrees 271.6199 0.8226332 177.4247
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 15, splitrule = extratrees
## and min.node.size = 5.
plot(bikes.tree_rf)
Lets generate RMSE value on both training and testing data and see if we find any difference in comparison to above mentioned models.
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bikes.tree_rf,
train_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 18902.43 137.4861 89.31816 53.05743 0.1382353 0.9543803
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bikes.tree_rf,
test_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 79872.57 282.6174 182.6458 111.0392 0.3403784 0.8097163
The results of random forest are quite interesting as it generate lowest RMSE value on training data among all the models. However it performs worst among all models on testing data. It has the highest over fitting.
Lets tuned this model and try out different combination of parameters and see if it helps.
bkes_rf_tuned_1 <- randomForest(model.formula,
data = train_data,
ntree = 100,
mtry = 8,
# minimum number of obs in the terminal nodes
nodesize = 100,
# we also generate predictors importance measures,
importance = TRUE)
print(bkes_rf_tuned_1)
##
## Call:
## randomForest(formula = model.formula, data = train_data, ntree = 100, mtry = 8, nodesize = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 8
##
## Mean of squared residuals: 82604.63
## % Var explained: 80.06
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bkes_rf_tuned_1,
train_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 66457.59 257.7937 171.521 105.4981 0.587629 0.8396092
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bkes_rf_tuned_1,
test_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 87908.72 296.4941 197.826 125.1216 0.6979514 0.7905714
now trying with mtry=6
bkes_rf_tuned_2<- randomForest(model.formula,
data = train_data,
ntree = 100,
mtry = 6,
# minimum number of obs in the terminal nodes
nodesize = 100,
# we also generate predictors importance measures,
importance = TRUE)
print(bkes_rf_tuned_2)
##
## Call:
## randomForest(formula = model.formula, data = train_data, ntree = 100, mtry = 6, nodesize = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 82875.83
## % Var explained: 80
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bkes_rf_tuned_2,
train_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 67466.62 259.7434 173.1614 106.3531 0.7774465 0.837174
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bkes_rf_tuned_2,
test_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 88255.55 297.0784 198.6865 127.5221 0.8980173 0.7897452
parameters_rf <- expand.grid(mtry = 2:12)
bkes_rf_tuned_3 <-
train(model.formula,
data = train_data,
method = "rf",
ntree = 100,
nodesize = 100,
tuneGrid = parameters_rf,
trControl = tc)
regressionMetrics(real = train_data$rntd_bike_count,
predicted = predict(bkes_rf_tuned_3,
train_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 66577.27 258.0257 171.7525 103.4091 0.5889308 0.8393204
regressionMetrics(real = test_data$rntd_bike_count,
predicted = predict(bkes_rf_tuned_3,
test_data))
## MSE RMSE MAE MedAE MSLE R2
## 1 88339.78 297.2201 197.8014 126.8731 0.7054815 0.7895445
plot(bkes_rf_tuned_3)
Summary and Conclusions
lets compare results of different estimated models, linear regression benchmark (bikes_lm_benchmark), Decision tree(bikes.tree), Decision tree tunned(bikes.tree_tunned), bagged model (bikes.tree_bagged) and Random forests. lets compute the predicted values for each model and summarize the error measures.
bikes_model_list <- list(bikes_lm_benchmark =bikes_lm_benchmark ,
bikes.tree=bikes.tree,
bikes.tree_tuned= bikes.tree_tuned,
bikes.tree_bagged=bikes.tree_bagged,
bikes.tree_rf=bikes.tree_rf,
bkes_rf_tuned_1=bkes_rf_tuned_1,
bkes_rf_tuned_2=bkes_rf_tuned_2,
bkes_rf_tuned_3=bkes_rf_tuned_3)
bikes_fitted <- bikes_model_list %>%
sapply(function(x) predict(x, newdata=train_data)) %>%
data.frame()
sapply(bikes_fitted,
function(x) regressionMetrics(real = train_data$rntd_bike_count, predicted = x
)) %>% t()
## MSE RMSE MAE MedAE MSLE R2
## bikes_lm_benchmark 901754.8 949.6077 699.0244 498.9379 18.10552 -1.176323
## bikes.tree 125264.7 353.9275 249.1645 172.8956 1.114458 0.6976824
## bikes.tree_tuned 83506.5 288.9749 197.9861 131.2474 0.591729 0.7984628
## bikes.tree_bagged 114038.4 337.6957 237.1042 165.2226 1.17922 0.7247763
## bikes.tree_rf 18902.43 137.4861 89.31816 53.05743 0.1382353 0.9543803
## bkes_rf_tuned_1 66457.59 257.7937 171.521 105.4981 0.587629 0.8396092
## bkes_rf_tuned_2 67466.62 259.7434 173.1614 106.3531 0.7774465 0.837174
## bkes_rf_tuned_3 66577.27 258.0257 171.7525 103.4091 0.5889308 0.8393204
bikes_fitted <- bikes_model_list %>%
sapply(function(x) predict(x, newdata=test_data)) %>%
data.frame()
sapply(bikes_fitted,
function(x) regressionMetrics(real = test_data$rntd_bike_count, predicted = x
)) %>% t()
## MSE RMSE MAE MedAE MSLE R2
## bikes_lm_benchmark 907814.4 952.7929 699.5142 499.9987 18.12239 -1.162723
## bikes.tree 134647.7 366.9438 257.7226 177.8788 1.215412 0.6792232
## bikes.tree_tuned 106042.5 325.6416 219.3235 145.0231 0.655586 0.7473706
## bikes.tree_bagged 123474.4 351.3893 247.2098 170.9239 1.31071 0.7058418
## bikes.tree_rf 79872.57 282.6174 182.6458 111.0392 0.3403784 0.8097163
## bkes_rf_tuned_1 87908.72 296.4941 197.826 125.1216 0.6979514 0.7905714
## bkes_rf_tuned_2 88255.55 297.0784 198.6865 127.5221 0.8980173 0.7897452
## bkes_rf_tuned_3 88339.78 297.2201 197.8014 126.8731 0.7054815 0.7895445
We applied different models.The above output shows that random forest generates(bikes.tree_rf) the lowest RMSE on training data but it turns out to be the most over fitted one as it performs worst on testing data in comparison to other models results. All the models have slight over fitting but to consider and select model among all the above mentioned one based on RMSE, we will go with random forest (bkes_rf_tuned_1) as it has relatively lesser over fitting and also lower RMSE for both training(257) and testing(296) data. We could conclude that in relative terms it could perform equally well if its applied on unseen data.