| Dataset Characteristics | No of Observations | Domain | Number of Attributes | Missing Values |
|---|---|---|---|---|
| Multivariate | 17,379 | Business | 12 | - |
Using these Bike Sharing systems, people rent bikes from one location and return it to a different or same place on need basis. People can rent a bike through membership (mostly regular users) or on demand basis (mostly casual users). This process is controlled by a network of automated kiosk across the city.Currently, there are over 500 bike-sharing programs around the world. Inorder to forecast bike share demand in the in Washington, D.C.we need to combine historical usage patterns with weather data.
In this project, we have performed some exploratory analysis on the data set from the kaggle competition, which contains historical data of bike sharing system in Washington, D.C. from the beginning of 2011 to the end of 2012.
The data is saved in a comma separated file (CSV) with 12 attributes. All the data has already been converted to numeric values.
Bike sharing programs are popular around the world. In May 2014, the machine learning competition website kaggle.com opened the competition "Forecast use of a city bikeshare system". In this competition, participants were+ asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
This analysis attempts to generate a machine learning model based on linear regression to predict rental demand by the hour. The training set (http://www.kaggle.com/c/bike-sharing-demand/data) includes hourly rental data for the first 19 days of the month. The goal is to predict the rest of the days for that month.
In the training data set, they have separately given bike demand by registered, casual users and sum of both is given as count.
Training data set has 12 variables (see below) and Test has 9 (excluding registered, casual and count).
The following is a list of the attributes from the dataset, and a brief description of their purpose.
Independent Variables:
datetime : date and hour in "mm/dd/yyyy hh:mm" format
season : Four categories-> 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday : whether the day is a holiday or not (1/0)
workingday : whether the day is neither a weekend nor holiday (1/0)
weather : Four Categories of weather
1-> Clear, Few clouds, Partly cloudy, Partly cloudy
2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : hourly temperature in Celsius
atemp : "feels like" temperature in Celsius
humidity : relative humidity
windspeed : wind speed
Dependent Variables:
registered : number of registered user
casual : number of non-registered user
count : number of total rentals (registered + casual)
setwd("E:/LIBA_EDBA/Regression and Classification/Bike Sharing Dataset")
train <- read.csv("train.csv")
test <- read.csv("test.csv")
str(test)
## 'data.frame': 6493 obs. of 9 variables:
## $ datetime : Factor w/ 6493 levels "20-01-2011 00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 1 1 1 1 1 1 1 1 1 1 ...
## $ weather : int 1 1 1 1 1 1 1 1 1 2 ...
## $ temp : num 10.7 10.7 10.7 10.7 10.7 ...
## $ atemp : num 11.4 13.6 13.6 12.9 12.9 ...
## $ humidity : int 56 56 56 56 56 60 60 55 55 52 ...
## $ windspeed : num 26 0 0 11 11 ...
str(train)
## 'data.frame': 10886 obs. of 12 variables:
## $ datetime : Factor w/ 10886 levels "01-01-2011 00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num 0 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ count : int 16 40 32 13 1 1 2 3 8 14 ...
test$registered=0
test$casual=0
test$count=0
combineddata <- rbind(train, test)
str(combineddata)
## 'data.frame': 17379 obs. of 12 variables:
## $ datetime : Factor w/ 17379 levels "01-01-2011 00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num 0 0 0 0 0 ...
## $ casual : num 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: num 13 32 27 10 1 1 0 2 7 6 ...
## $ count : num 16 40 32 13 1 1 2 3 8 14 ...
table(is.na(combineddata))
##
## FALSE
## 208548
combineddata$season=as.factor(combineddata$season)
combineddata$weather=as.factor(combineddata$weather)
combineddata$holiday=as.factor(combineddata$holiday)
combineddata$workingday=as.factor(combineddata$workingday)
Prediction is that there will be High Demand : Peak Hours (school & Office timings) Low Demand : 11pm-5.00 am Not sure about the trend during early mornings and evenings
There is no variable ‘hour’, we only have ‘datetime’. So we can extract it using the datetime column.
combineddata$hour=substr(combineddata$datetime,12,13)
combineddata$hour=as.factor(combineddata$hour)
train=combineddata[as.integer(substr(combineddata$datetime,9,10))<20,]
test=combineddata[as.integer(substr(combineddata$datetime,9,10))>19,]
layout(matrix(c(1,2,3),1,3,byrow=FALSE))
boxplot(train$count~train$hour,xlab="hour",ylab="OVERALL COUNT(Users)",main="OVERALL COUNT(Users)",col =c("red","sienna","royalblue2", "palevioletred1"))
boxplot(train$casual~train$hour, xlab="hour",ylab="CASUAL(Users)",main="CASUAL(Users)",col=c("palegreen2","plum3", "sienna2","snow3"))
boxplot(train$registered~train$hour,xlab="hour", ylab="REGISTERED(Users)",main="REGISTERED(Users)",col=c("red","sienna","royalblue2", "palevioletred1"))
From plot the trend of bike demand over hours we can segregate it into three categories:
High Demand : 7-9 and 17-19 hours Average Demand : 10-16 hours Low : 0-6 and 20-24 hours
Also registered users have similar trend as count Whereas, casual users have different trend. So hour is a significant variable.
Prediction is that the Registered users might demand more bike on weekdays as compared to weekend or holiday
date=substr(combineddata$datetime,1,10)
days<-weekdays(as.Date(date))
combineddata$day=days
train=combineddata[as.integer(substr(combineddata$day,1,10))<20,]
test=combineddata[as.integer(substr(combineddata$day,1,10))>19,]
layout(matrix(c(1,2,3),1,3,byrow=FALSE))
boxplot(combineddata$count~combineddata$day,xlab="day",ylab="OVERALL COUNT(Users)",main="OVERALL COUNT(Users)",col =c("red","sienna","royalblue2", "palevioletred1"))
boxplot(combineddata$casual~combineddata$day, xlab="day",ylab="CASUAL(Users)",main="CASUAL(Users)",col=c("palegreen2","plum3", "sienna2","snow3"))
boxplot(combineddata$registered~combineddata$day,xlab="day", ylab="REGISTERED(Users)",main="REGISTERED(Users)",col=c("red","sienna","royalblue2", "palevioletred1"))
Can predict more but can roughly say that the demand for casual users increases during weekends
From the datetime column and see the trend of bike demand over year.
layout(matrix(c(1,2,3),1,3,byrow=FALSE))
boxplot(combineddata$count~combineddata$year,xlab="year",ylab="OVERALL COUNT(Users)",main="OVERALL COUNT(Users)",col =c("red","royalblue2", "palevioletred1"))
boxplot(combineddata$casual~combineddata$year, xlab="year",ylab="CASUAL(Users)",main="CASUAL(Users)",col=c("palegreen2","plum3", "sienna2","snow3"))
boxplot(combineddata$registered~combineddata$year,xlab="year", ylab="REGISTERED(Users)",main="REGISTERED(Users)",col=c("red","sienna","royalblue2", "palevioletred1"))
Temperature, Windspeed and Humidity: These are continuous variables so we can look at the correlation factor. From the plot we can infer the following:
1. Variable temp - positiv correlation with dependent variables (casual is more compare to registered)
2. Variable atemp - high correlation with temp.
3. Windspeed - low correlation when compared to temp and humidity
Now we are splitting the train and test data from the Combineddata which we used for exploration
train_modified <- combineddata[1:10886,]
test_modified <- combineddata[10887:17379,]
For modelling purpose, lets do a partition of the "train_modified"
set.seed(1143)
Train_Partition <- createDataPartition(1:nrow(train_modified),p=0.5,list=F,times = 1)
Train_Partition_Training <- train_modified[Train_Partition, ]
Train_Partition_Validation <- train_modified[-Train_Partition, ]
Since variable temp has high positive correlation with dependent variables (casual is more compare to registered). We are building a linear model using temp & count variables
mod1<- lm(Train_Partition_Training$count ~ hour+temp, data=Train_Partition_Training)
summary(mod1)
##
## Call:
## lm(formula = Train_Partition_Training$count ~ hour + temp, data = Train_Partition_Training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -309.15 -101.21 -34.98 59.60 694.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -83.3166 6.5211 -12.78 <2e-16 ***
## hour 9.0497 0.3046 29.71 <2e-16 ***
## temp 7.9823 0.2698 29.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 154 on 5441 degrees of freedom
## Multiple R-squared: 0.2737, Adjusted R-squared: 0.2735
## F-statistic: 1025 on 2 and 5441 DF, p-value: < 2.2e-16
predict_mod1_TrainValidation<-round(predict(mod1,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod1_TrainValidation)
## [1] 1.577749
Now lets run model using the test
predict_test_1 <- round(predict(mod1, newdata = test_modified))
head(predict_test_1)
## 10887 10888 10889 10890 10891 10892
## 11 20 29 38 47 50
bikeevaluate(Train_Partition_Validation, predict_test_1)
## [1] 1.78042
Lets do some transformation based on the distribution of the data
descdist(Train_Partition_Training$count, discrete = FALSE, obs.col = "darkblue", boot.col = "yellow",boot=500)
## summary statistics
## ------
## min: 2 max: 949
## median: 141
## mean: 190.9791
## estimated sd: 180.7069
## estimated skewness: 1.231578
## estimated kurtosis: 4.239825
Based on the Cullen and Frey Graph, Model 2, Model 3, Model 4 are built based on the distribution of the data
mod2 <- fitdist(Train_Partition_Training$count, "gamma", method = "mge")
plot(mod2)
predict_mod2_TrainValidation<-round(predict(mod2,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod2_TrainValidation)
Now lets run model using the test
predict_test_2 <- round(predict(mod2, newdata = test_modified))
bikeevaluate(Train_Partition_Validation, predict_test_2)
2.112379
mod3<-fitdist(Train_Partition_Training$count,"norm")
plot(mod3)
form <- formula(Train_Partition_Training$logcount ~ season + holiday + workingday + weather + temp + atemp +
humidity + windspeed + hour + day +year)
mod5 <- rpart(form, data = Train_Partition_Training)
print(mod5)
## n= 5444
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 5444 11014.92000 4.576449
## 2) hour< 7.5 1581 2042.19700 2.877024
## 4) hour< 6.5 1355 1499.77600 2.703309
## 8) hour>=2.5 923 783.27010 2.354324 *
## 9) hour< 2.5 432 363.91620 3.448940
## 18) workingday=1 287 166.78810 3.034163 *
## 19) workingday=0 145 50.02320 4.269913 *
## 5) hour>=6.5 226 256.37390 3.918546
## 10) workingday=0 73 60.45990 2.738760 *
## 11) workingday=1 153 45.82587 4.481450 *
## 3) hour>=7.5 3863 2538.01100 5.271968
## 6) atemp< 16.2875 814 579.73750 4.625301 *
## 7) atemp>=16.2875 3049 1526.99900 5.444611
## 14) year=2011 1463 745.72850 5.183103 *
## 15) year=2012 1586 588.93080 5.685838 *
predictValidation5 <- round(predict(mod5, newdata = Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predictValidation5)
## [1] 3.236562
predict_test_5 <- round(predict(mod5, newdata = test_modified))
head(predict_test_5)
## 10887 10888 10889 10890 10891 10892
## 3 3 2 2 2 2
Evaluate model
bikeevaluate(Train_Partition_Validation, predict_test_5)
## [1] 3.470934
mod6_lm.full <- lm(Train_Partition_Training$count ~season + holiday + workingday + weather + temp + atemp +humidity + windspeed + hour + day +year, data = Train_Partition_Training)
drop1() shows the partial F-test p-value for each variable. The same p-value can be obtained by using anova(model.full, model.one.fewer.var) to compare two models. Remove the least significant variable using update() and do drop1() again until all remaining variables is significant at a pre-specified cutoff (0.1 here).
drop1(mod6_lm.full, test = "F")
## Single term deletions
##
## Model:
## Train_Partition_Training$count ~ season + holiday + workingday +
## weather + temp + atemp + humidity + windspeed + hour + day +
## year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 108173091 53921
## season 3 3974259 112147350 54112 66.4133 < 2.2e-16 ***
## holiday 1 47933 108221024 53922 2.4030 0.1211607
## workingday 1 3591 108176682 53919 0.1800 0.6713612
## weather 3 330016 108503107 53932 5.5149 0.0008866 ***
## temp 1 63209 108236300 53922 3.1688 0.0751117 .
## atemp 1 189974 108363065 53929 9.5239 0.0020384 **
## humidity 1 4858977 113032068 54158 243.5932 < 2.2e-16 ***
## windspeed 1 117 108173208 53919 0.0059 0.9390119
## hour 1 12920212 121093304 54533 647.7240 < 2.2e-16 ***
## day 6 145089 108318180 53916 1.2123 0.2964689
## year 1 8333481 116506572 54323 417.7792 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Holiday is the least significant based on “F” Statistics
drop1(update(mod6_lm.full, ~ . -holiday), test = "F")
## Single term deletions
##
## Model:
## Train_Partition_Training$count ~ season + workingday + weather +
## temp + atemp + humidity + windspeed + hour + day + year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 108221024 53922
## season 3 3948456 112169480 54111 65.9651 < 2.2e-16 ***
## workingday 1 36 108221060 53920 0.0018 0.9661072
## weather 3 326729 108547754 53932 5.4585 0.0009601 ***
## temp 1 59876 108280900 53923 3.0010 0.0832725 .
## atemp 1 196780 108417804 53929 9.8625 0.0016959 **
## humidity 1 4854840 113075864 54158 243.3229 < 2.2e-16 ***
## windspeed 1 118 108221142 53920 0.0059 0.9387904
## hour 1 12927738 121148763 54534 647.9337 < 2.2e-16 ***
## day 6 147711 108368735 53917 1.2339 0.2853578
## year 1 8321062 116542086 54323 417.0487 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now Working day is the least significant based on “F” Statistics
drop1(update(mod6_lm.full, ~ . -holiday -workingday), test = "F")
## Single term deletions
##
## Model:
## Train_Partition_Training$count ~ season + weather + temp + atemp +
## humidity + windspeed + hour + day + year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 108221060 53920
## season 3 3948429 112169489 54109 65.9768 < 2.2e-16 ***
## weather 3 327259 108548320 53930 5.4684 0.0009468 ***
## temp 1 59885 108280946 53921 3.0020 0.0832192 .
## atemp 1 197311 108418371 53927 9.8910 0.0016699 **
## humidity 1 4856883 113077944 54157 243.4701 < 2.2e-16 ***
## windspeed 1 118 108221178 53918 0.0059 0.9387837
## hour 1 12929914 121150974 54532 648.1620 < 2.2e-16 ***
## day 6 149252 108370312 53915 1.2470 0.2787738
## year 1 8321382 116542442 54321 417.1415 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now temp & day are least significant
drop1(update(mod6_lm.full, ~ . -holiday -workingday -temp), test = "F")
## Single term deletions
##
## Model:
## Train_Partition_Training$count ~ season + weather + atemp + humidity +
## windspeed + hour + day + year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 108280946 53921
## season 3 3888560 112169506 54107 64.9524 < 2.2e-16 ***
## weather 3 319000 108599945 53931 5.3284 0.001154 **
## atemp 1 7932297 116213243 54303 397.4905 < 2.2e-16 ***
## humidity 1 5031880 113312826 54166 252.1494 < 2.2e-16 ***
## windspeed 1 4697 108285643 53919 0.2354 0.627573
## hour 1 12980638 121261584 54535 650.4648 < 2.2e-16 ***
## day 6 152316 108433261 53916 1.2721 0.266479
## year 1 8394922 116675868 54325 420.6728 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(update(mod6_lm.full, ~ . -holiday -workingday -temp -day), test = "F")
## Single term deletions
##
## Model:
## Train_Partition_Training$count ~ season + weather + atemp + humidity +
## windspeed + hour + year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 108433261 53916
## season 3 3876825 112310086 54101 64.7369 < 2.2e-16 ***
## weather 3 327935 108761196 53927 5.4760 0.0009367 ***
## atemp 1 7986334 116419595 54301 400.0780 < 2.2e-16 ***
## humidity 1 5019449 113452710 54161 251.4509 < 2.2e-16 ***
## windspeed 1 4108 108437370 53914 0.2058 0.6500953
## hour 1 12993830 121427091 54530 650.9302 < 2.2e-16 ***
## year 1 8388578 116821840 54320 420.2286 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(update(mod6_lm.full, ~ . -holiday -workingday -temp -day -windspeed), test = "F")
## Single term deletions
##
## Model:
## Train_Partition_Training$count ~ season + weather + atemp + humidity +
## hour + year
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 108437370 53914
## season 3 3875545 112312915 54100 64.7250 < 2.2e-16 ***
## weather 3 323890 108761260 53925 5.4093 0.001029 **
## atemp 1 7983483 116420853 54299 399.9937 < 2.2e-16 ***
## humidity 1 5639583 114076952 54188 282.5581 < 2.2e-16 ***
## hour 1 13067031 121504401 54532 654.6929 < 2.2e-16 ***
## year 1 8388500 116825870 54318 420.2861 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(update(mod6_lm.full, ~ . -holiday -workingday -temp -day -windspeed))
##
## Call:
## lm(formula = Train_Partition_Training$count ~ season + weather +
## atemp + humidity + hour + year, data = Train_Partition_Training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -315.13 -94.34 -29.36 58.26 616.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.7140 10.8413 -1.726 0.08437 .
## season2 20.6311 6.8747 3.001 0.00270 **
## season3 6.3794 8.4991 0.751 0.45293
## season4 69.4004 5.8196 11.925 < 2e-16 ***
## weather2 7.7399 4.6952 1.648 0.09931 .
## weather3 -23.7046 7.6702 -3.090 0.00201 **
## weather4 47.4441 141.4452 0.335 0.73732
## atemp 7.3753 0.3688 20.000 < 2e-16 ***
## humidity -1.9917 0.1185 -16.809 < 2e-16 ***
## hour 7.6002 0.2970 25.587 < 2e-16 ***
## year2012 79.3302 3.8696 20.501 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.3 on 5433 degrees of freedom
## Multiple R-squared: 0.3899, Adjusted R-squared: 0.3888
## F-statistic: 347.2 on 10 and 5433 DF, p-value: < 2.2e-16
predictValidation6 <- round(predict(mod6_lm.full, newdata = Train_Partition_Validation))
predictValidation6<- round(predictValidation6)
predictValidation6[predictValidation6<0] <- 0
bikeevaluate(Train_Partition_Training, predictValidation6)
## [1] 1.690499
predict_test_6 <- round(predict(mod6_lm.full, newdata = test_modified))
predict_test_6<- round(predict_test_6)
predict_test_6[predict_test_6<0] <- 0
Evaluate model
bikeevaluate(Train_Partition_Validation, predict_test_6)
## [1] 1.985764
mod7<- lm(Train_Partition_Training$count~season + weather + atemp + humidity + windspeed + hour +year,data=Train_Partition_Training,method = 'gbm' )
print(mod7)
##
## Call:
## lm(formula = Train_Partition_Training$count ~ season + weather +
## atemp + humidity + windspeed + hour + year, data = Train_Partition_Training,
## method = "gbm")
##
## Coefficients:
## (Intercept) season2 season3 season4 weather2
## -21.3108 20.6312 6.5023 69.5699 7.5925
## weather3 weather4 atemp humidity windspeed
## -24.2335 47.8730 7.3812 -1.9743 0.1139
## hour year2012
## 7.5922 79.4075
predict_mod7_TrainValidation<-round(predict(mod7,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod7_TrainValidation)
## [1] 1.689909
## [1] 1.989862
using rms::fastbw()
library(rms)
mod8 <- ols(Train_Partition_Training$logcount~season + holiday + workingday + weather + temp + atemp +
humidity + windspeed + hour + day +year, data = Train_Partition_Training)
fastbw(mod8, rule = "p", sls = 0.1)
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC R2
## day 7.66 6 0.2641 7.66 6 0.2641 -4.34 0.499
## windspeed 0.99 1 0.3205 8.65 7 0.2790 -5.35 0.499
## temp 2.95 1 0.0856 11.60 8 0.1699 -4.40 0.499
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept 2.67818 0.0801580 33.4112 0.000e+00
## season=2 0.17785 0.0490785 3.6238 2.903e-04
## season=3 0.10710 0.0606979 1.7645 7.765e-02
## season=4 0.59016 0.0415566 14.2012 0.000e+00
## holiday=1 -0.15059 0.0838447 -1.7960 7.249e-02
## workingday=1 -0.09642 0.0304265 -3.1691 1.529e-03
## weather=2 0.14009 0.0335065 4.1809 2.904e-05
## weather=3 -0.17790 0.0547663 -3.2484 1.161e-03
## weather=4 0.93046 1.0093751 0.9218 3.566e-01
## atemp 0.05077 0.0026339 19.2745 0.000e+00
## humidity -0.01462 0.0008455 -17.2900 0.000e+00
## hour 0.09809 0.0021199 46.2722 0.000e+00
## year=2012 0.41028 0.0276147 14.8574 0.000e+00
##
## Factors in Final Model
##
## [1] season holiday workingday weather atemp humidity
## [7] hour year
predict_mod8_TrainValidation<-round(predict(mod8,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod8_TrainValidation)
## [1] 3.204593
## [1] 3.448292
mod9.aic.backward <- step(mod6_lm.full, direction = "backward", trace = 1)
## Start: AIC=53921.13
## Train_Partition_Training$count ~ season + holiday + workingday +
## weather + temp + atemp + humidity + windspeed + hour + day +
## year
##
## Df Sum of Sq RSS AIC
## - day 6 145089 108318180 53916
## - windspeed 1 117 108173208 53919
## - workingday 1 3591 108176682 53919
## <none> 108173091 53921
## - holiday 1 47933 108221024 53922
## - temp 1 63209 108236300 53922
## - atemp 1 189974 108363065 53929
## - weather 3 330016 108503107 53932
## - season 3 3974259 112147350 54112
## - humidity 1 4858977 113032068 54158
## - year 1 8333481 116506572 54323
## - hour 1 12920212 121093304 54533
##
## Step: AIC=53916.42
## Train_Partition_Training$count ~ season + holiday + workingday +
## weather + temp + atemp + humidity + windspeed + hour + year
##
## Df Sum of Sq RSS AIC
## - windspeed 1 27 108318207 53914
## - workingday 1 9194 108327374 53915
## <none> 108318180 53916
## - holiday 1 50555 108368735 53917
## - temp 1 67106 108385286 53918
## - atemp 1 185941 108504121 53924
## - weather 3 340170 108658350 53927
## - season 3 3963398 112281578 54106
## - humidity 1 4852417 113170597 54153
## - year 1 8323367 116641547 54317
## - hour 1 12921011 121239191 54528
##
## Step: AIC=53914.42
## Train_Partition_Training$count ~ season + holiday + workingday +
## weather + temp + atemp + humidity + hour + year
##
## Df Sum of Sq RSS AIC
## - workingday 1 9188 108327395 53913
## <none> 108318207 53914
## - holiday 1 50551 108368758 53915
## - temp 1 71506 108389712 53916
## - atemp 1 195390 108513597 53922
## - weather 3 342395 108660601 53926
## - season 3 3966312 112284518 54104
## - humidity 1 5251410 113569616 54170
## - year 1 8344859 116663065 54316
## - hour 1 12957438 121275645 54528
##
## Step: AIC=53912.89
## Train_Partition_Training$count ~ season + holiday + weather +
## temp + atemp + humidity + hour + year
##
## Df Sum of Sq RSS AIC
## <none> 108327395 53913
## - holiday 1 42938 108370333 53913
## - temp 1 69419 108396813 53914
## - atemp 1 198426 108525820 53921
## - weather 3 345469 108672863 53924
## - season 3 3963203 112290597 54102
## - humidity 1 5251343 113578738 54169
## - year 1 8346225 116673619 54315
## - hour 1 12974176 121301571 54527
summary(mod9.aic.backward)
##
## Call:
## lm(formula = Train_Partition_Training$count ~ season + holiday +
## weather + temp + atemp + humidity + hour + year, data = Train_Partition_Training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -317.27 -94.29 -29.26 58.53 614.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.9739 10.8465 -1.657 0.09755 .
## season2 18.8169 6.9372 2.712 0.00670 **
## season3 2.1127 8.8543 0.239 0.81142
## season4 69.2794 5.8256 11.892 < 2e-16 ***
## holiday1 -16.6700 11.3617 -1.467 0.14238
## weather2 7.3267 4.7015 1.558 0.11920
## weather3 -25.2859 7.7059 -3.281 0.00104 **
## weather4 48.8783 141.4036 0.346 0.72961
## temp 3.1289 1.6772 1.866 0.06216 .
## atemp 4.6812 1.4842 3.154 0.00162 **
## humidity -1.9528 0.1204 -16.226 < 2e-16 ***
## hour 7.5790 0.2972 25.504 < 2e-16 ***
## year2012 79.1736 3.8705 20.456 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.2 on 5431 degrees of freedom
## Multiple R-squared: 0.3905, Adjusted R-squared: 0.3892
## F-statistic: 290 on 12 and 5431 DF, p-value: < 2.2e-16
predict_mod9_TrainValidation<-round(predict(mod9.aic.backward,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod9_TrainValidation)
## [1] 1.689367
## [1] 1.989146
mod10.aic.forward <- step(mod6_lm.full, direction = "forward", trace = 1,
scope = ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + day +year, data = Train_Partition_Training)
## Start: AIC=53921.13
## Train_Partition_Training$count ~ season + holiday + workingday +
## weather + temp + atemp + humidity + windspeed + hour + day +
## year
summary(mod10.aic.forward)
##
## Call:
## lm(formula = Train_Partition_Training$count ~ season + holiday +
## workingday + weather + temp + atemp + humidity + windspeed +
## hour + day + year, data = Train_Partition_Training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -327.75 -93.20 -29.45 59.55 617.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.01312 13.44739 -0.968 0.33323
## season2 18.51625 6.97563 2.654 0.00797 **
## season3 1.86676 8.94392 0.209 0.83468
## season4 69.54437 5.87388 11.840 < 2e-16 ***
## holiday1 -18.29361 11.80109 -1.550 0.12116
## workingday1 -1.86474 4.39481 -0.424 0.67136
## weather2 7.49136 4.71242 1.590 0.11196
## weather3 -24.56502 7.81357 -3.144 0.00168 **
## weather4 53.12738 141.48416 0.376 0.70730
## temp 3.08366 1.73228 1.780 0.07511 .
## atemp 4.73044 1.53283 3.086 0.00204 **
## humidity -1.96436 0.12586 -15.607 < 2e-16 ***
## windspeed 0.01983 0.25913 0.077 0.93901
## hour 7.57864 0.29778 25.450 < 2e-16 ***
## dayMonday 7.54843 7.34758 1.027 0.30431
## daySaturday -2.33223 7.25949 -0.321 0.74802
## daySunday -5.89870 7.09447 -0.831 0.40576
## dayThursday -9.89744 7.06004 -1.402 0.16100
## dayTuesday -6.70294 6.99810 -0.958 0.33820
## dayWednesday -6.01473 7.12451 -0.844 0.39858
## year2012 79.22746 3.87617 20.440 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.2 on 5423 degrees of freedom
## Multiple R-squared: 0.3914, Adjusted R-squared: 0.3892
## F-statistic: 174.4 on 20 and 5423 DF, p-value: < 2.2e-16
predict_mod10_TrainValidation<-round(predict(mod10.aic.forward,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod10_TrainValidation)
## [1] 1.690499
## [1] 1.989862
form11 <- formula(Train_Partition_Training$logcount ~ season + holiday + workingday + weather + temp + atemp +
humidity + windspeed)
mod11 <- rpart(form11, data = Train_Partition_Training)
print(mod11)
## n= 5444
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 5444 11014.9200 4.576449
## 2) atemp< 29.925 3851 7828.4480 4.277787
## 4) temp< 11.89 897 1835.5100 3.613460 *
## 5) temp>=11.89 2954 5476.8550 4.479514
## 10) humidity>=64.5 1662 3324.4740 4.125477 *
## 11) humidity< 64.5 1292 1676.0860 4.934940 *
## 3) atemp>=29.925 1593 2012.5640 5.298449
## 6) humidity>=67.5 403 760.9172 4.580619 *
## 7) humidity< 67.5 1190 973.6652 5.541545 *
predictValidation11 <- round(predict(mod11, newdata = Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predictValidation11)
## [1] 3.170218
predict_test_11 <- round(predict(mod11, newdata = test_modified))
head(predict_test_11)
## 10887 10888 10889 10890 10891 10892
## 4 4 4 4 4 4
Evaluate model
bikeevaluate(Train_Partition_Validation, predict_test_11)
## [1] 3.405507
| Model | Metric(RMSE) |
|---|---|
| RPART | 3.4709 |
| RPART | 3.4055 |
| Stepwise Regression | 3.4482 |
| Transformation | 2.1124 |
| GBM | 1.9898 |
| AIC-based backward | 1.9891 |
| Full to Null lm | 1.6904 |
| LM | 1.7804 |
Using different combinations of models and feature engineering, we found that hour has the most predictive feature for our problem, whereas weather variables play a much smaller, but still noticeable effect in contributing to accurate predictions.
Variable reduction models, Lm Model and GBM models best fits the data
Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg doi:10.1007/s13748-013-0040-3.
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
Other resources from internet: http://ww.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture07b_PoissReg.pdf https://www.youtube.com/watch?v=mgIyPRhrVTY https://onlinecourses.science.psu.edu/stat504/node/169 https://rpubs.com/tybyers/25177 http://www.ats.ucla.edu/stat/r/dae/nbreg.htm