EXPLORATORY ANALYSIS ON BIKE SHARE FORECASTING

Dataset Overview:

Dataset Characteristics	No of Observations	Domain	Number of Attributes	Missing Values
Multivariate	17,379	Business	12	-

Background of the Dataset:

Using these Bike Sharing systems, people rent bikes from one location and return it to a different or same place on need basis. People can rent a bike through membership (mostly regular users) or on demand basis (mostly casual users). This process is controlled by a network of automated kiosk across the city.Currently, there are over 500 bike-sharing programs around the world. Inorder to forecast bike share demand in the in Washington, D.C.we need to combine historical usage patterns with weather data.

Synopsis:

In this project, we have performed some exploratory analysis on the data set from the kaggle competition, which contains historical data of bike sharing system in Washington, D.C. from the beginning of 2011 to the end of 2012. 
The data is saved in a comma separated file (CSV) with 12 attributes. All the data has already been converted to numeric values.

Problem Description:

Bike sharing programs are popular around the world. In May 2014, the machine learning competition website kaggle.com opened the competition "Forecast use of a city bikeshare system". In this competition, participants were+ asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
This analysis attempts to generate a machine learning model based on linear     regression to predict rental demand by the hour. The training set (http://www.kaggle.com/c/bike-sharing-demand/data) includes hourly rental data for the first 19 days of the month. The goal is to predict the rest of the days for that month.

Understanding the Data Set:

In the training data set, they have separately given bike demand by registered, casual users and sum of both is given as count.
Training data set has 12 variables (see below) and Test has 9 (excluding registered, casual and count).

The following is a list of the attributes from the dataset, and a brief description of their purpose.

Independent Variables:

datetime    : date and hour in "mm/dd/yyyy hh:mm" format
season      : Four categories-> 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday     : whether the day is a holiday or not (1/0)
workingday  : whether the day is neither a weekend nor holiday (1/0)
weather     : Four Categories of weather
        1-> Clear, Few clouds, Partly cloudy, Partly cloudy
        2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
        3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
        4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp        : hourly temperature in Celsius
atemp       : "feels like" temperature in Celsius
humidity    : relative humidity
windspeed   : wind speed

Dependent Variables:

registered  : number of registered user
casual      : number of non-registered user
count       : number of total rentals (registered + casual)

Importing Data set & Data Exploration:

setwd("E:/LIBA_EDBA/Regression and Classification/Bike Sharing Dataset")

1. Import Train and Test Data Set:

train <- read.csv("train.csv")
test <- read.csv("test.csv")
str(test)

## 'data.frame':    6493 obs. of  9 variables:
##  $ datetime  : Factor w/ 6493 levels "20-01-2011 00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ weather   : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ temp      : num  10.7 10.7 10.7 10.7 10.7 ...
##  $ atemp     : num  11.4 13.6 13.6 12.9 12.9 ...
##  $ humidity  : int  56 56 56 56 56 60 60 55 55 52 ...
##  $ windspeed : num  26 0 0 11 11 ...

str(train)

## 'data.frame':    10886 obs. of  12 variables:
##  $ datetime  : Factor w/ 10886 levels "01-01-2011 00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num  0 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : int  16 40 32 13 1 1 2 3 8 14 ...

2. Combine test and train data set

test$registered=0
test$casual=0
test$count=0
combineddata <- rbind(train, test)

3. Structure of the dataset

str(combineddata)

## 'data.frame':    17379 obs. of  12 variables:
##  $ datetime  : Factor w/ 17379 levels "01-01-2011 00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num  0 0 0 0 0 ...
##  $ casual    : num  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: num  13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : num  16 40 32 13 1 1 2 3 8 14 ...

4. Missing Values in the Dataset

table(is.na(combineddata))

## 
##  FALSE 
## 208548

5. Convert discrete variables into factor (season, weather, holiday, workingday)

combineddata$season=as.factor(combineddata$season)
combineddata$weather=as.factor(combineddata$weather)
combineddata$holiday=as.factor(combineddata$holiday)
combineddata$workingday=as.factor(combineddata$workingday)

Hourly Trend

Prediction is that there will be High Demand : Peak Hours (school & Office timings) Low Demand : 11pm-5.00 am Not sure about the trend during early mornings and evenings

There is no variable ‘hour’, we only have ‘datetime’. So we can extract it using the datetime column.

combineddata$hour=substr(combineddata$datetime,12,13)
combineddata$hour=as.factor(combineddata$hour)
train=combineddata[as.integer(substr(combineddata$datetime,9,10))<20,]
test=combineddata[as.integer(substr(combineddata$datetime,9,10))>19,]

layout(matrix(c(1,2,3),1,3,byrow=FALSE))
boxplot(train$count~train$hour,xlab="hour",ylab="OVERALL COUNT(Users)",main="OVERALL COUNT(Users)",col =c("red","sienna","royalblue2", "palevioletred1"))
boxplot(train$casual~train$hour, xlab="hour",ylab="CASUAL(Users)",main="CASUAL(Users)",col=c("palegreen2","plum3", "sienna2","snow3"))
boxplot(train$registered~train$hour,xlab="hour", ylab="REGISTERED(Users)",main="REGISTERED(Users)",col=c("red","sienna","royalblue2", "palevioletred1"))

From plot the trend of bike demand over hours we can segregate it into three categories:

High Demand : 7-9 and 17-19 hours Average Demand : 10-16 hours Low : 0-6 and 20-24 hours

Also registered users have similar trend as count Whereas, casual users have different trend. So hour is a significant variable.

Daywise Trend:

Prediction is that the Registered users might demand more bike on weekdays as compared to weekend or holiday

date=substr(combineddata$datetime,1,10)
days<-weekdays(as.Date(date))
combineddata$day=days
train=combineddata[as.integer(substr(combineddata$day,1,10))<20,]
test=combineddata[as.integer(substr(combineddata$day,1,10))>19,]

layout(matrix(c(1,2,3),1,3,byrow=FALSE))
boxplot(combineddata$count~combineddata$day,xlab="day",ylab="OVERALL COUNT(Users)",main="OVERALL COUNT(Users)",col =c("red","sienna","royalblue2", "palevioletred1"))
boxplot(combineddata$casual~combineddata$day, xlab="day",ylab="CASUAL(Users)",main="CASUAL(Users)",col=c("palegreen2","plum3", "sienna2","snow3"))
boxplot(combineddata$registered~combineddata$day,xlab="day", ylab="REGISTERED(Users)",main="REGISTERED(Users)",col=c("red","sienna","royalblue2", "palevioletred1"))

Can predict more but can roughly say that the demand for casual users increases during weekends

Time Trend:

From the datetime column and see the trend of bike demand over year.

layout(matrix(c(1,2,3),1,3,byrow=FALSE))
boxplot(combineddata$count~combineddata$year,xlab="year",ylab="OVERALL COUNT(Users)",main="OVERALL COUNT(Users)",col =c("red","royalblue2", "palevioletred1"))
boxplot(combineddata$casual~combineddata$year, xlab="year",ylab="CASUAL(Users)",main="CASUAL(Users)",col=c("palegreen2","plum3", "sienna2","snow3"))
boxplot(combineddata$registered~combineddata$year,xlab="year", ylab="REGISTERED(Users)",main="REGISTERED(Users)",col=c("red","sienna","royalblue2", "palevioletred1"))

Temperature, Windspeed and Humidity: These are continuous variables so we can look at the correlation factor. From the plot we can infer the following:
1. Variable temp      - positiv correlation with dependent variables (casual is more compare to registered)
2. Variable atemp     - high correlation with temp.
3. Windspeed          - low correlation when compared to temp and humidity

Model Building:

Now we are splitting the train and test data from the Combineddata which we used for exploration

train_modified <- combineddata[1:10886,]
test_modified <- combineddata[10887:17379,]

For modelling purpose, lets do a partition of the "train_modified"

set.seed(1143)
Train_Partition <- createDataPartition(1:nrow(train_modified),p=0.5,list=F,times = 1)
Train_Partition_Training <- train_modified[Train_Partition, ]
Train_Partition_Validation <- train_modified[-Train_Partition, ]

Simple Linear Regression

Model 1:

Since variable temp has high positive correlation with dependent variables (casual is more compare to registered). We are building a linear model using temp & count variables

mod1<- lm(Train_Partition_Training$count ~ hour+temp, data=Train_Partition_Training)
summary(mod1)

## 
## Call:
## lm(formula = Train_Partition_Training$count ~ hour + temp, data = Train_Partition_Training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -309.15 -101.21  -34.98   59.60  694.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -83.3166     6.5211  -12.78   <2e-16 ***
## hour          9.0497     0.3046   29.71   <2e-16 ***
## temp          7.9823     0.2698   29.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154 on 5441 degrees of freedom
## Multiple R-squared:  0.2737, Adjusted R-squared:  0.2735 
## F-statistic:  1025 on 2 and 5441 DF,  p-value: < 2.2e-16

predict_mod1_TrainValidation<-round(predict(mod1,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod1_TrainValidation)

## [1] 1.577749

Now lets run model using the test

predict_test_1 <- round(predict(mod1, newdata = test_modified))
head(predict_test_1)

## 10887 10888 10889 10890 10891 10892 
##    11    20    29    38    47    50

bikeevaluate(Train_Partition_Validation, predict_test_1)

## [1] 1.78042

Transformation :

Lets do some transformation based on the distribution of the data

descdist(Train_Partition_Training$count, discrete = FALSE, obs.col = "darkblue", boot.col = "yellow",boot=500)

## summary statistics
## ------
## min:  2   max:  949 
## median:  141 
## mean:  190.9791 
## estimated sd:  180.7069 
## estimated skewness:  1.231578 
## estimated kurtosis:  4.239825

Based on the Cullen and Frey Graph, Model 2, Model 3, Model 4 are built based on the distribution of the data

Model 2 : Gamma Distribution

mod2 <- fitdist(Train_Partition_Training$count, "gamma", method = "mge")
plot(mod2)

predict_mod2_TrainValidation<-round(predict(mod2,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod2_TrainValidation)

Now lets run model using the test

predict_test_2 <- round(predict(mod2, newdata = test_modified))

bikeevaluate(Train_Partition_Validation, predict_test_2)

2.112379

Model 3: Normal Distribution

mod3<-fitdist(Train_Partition_Training$count,"norm")
plot(mod3)

MODEL 4: Exponential Distribution

Model 5: RPART

form <- formula(Train_Partition_Training$logcount ~ season + holiday + workingday + weather + temp + atemp +
                  humidity + windspeed + hour + day +year)
mod5 <- rpart(form, data = Train_Partition_Training)
print(mod5)

## n= 5444 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 5444 11014.92000 4.576449  
##    2) hour< 7.5 1581  2042.19700 2.877024  
##      4) hour< 6.5 1355  1499.77600 2.703309  
##        8) hour>=2.5 923   783.27010 2.354324 *
##        9) hour< 2.5 432   363.91620 3.448940  
##         18) workingday=1 287   166.78810 3.034163 *
##         19) workingday=0 145    50.02320 4.269913 *
##      5) hour>=6.5 226   256.37390 3.918546  
##       10) workingday=0 73    60.45990 2.738760 *
##       11) workingday=1 153    45.82587 4.481450 *
##    3) hour>=7.5 3863  2538.01100 5.271968  
##      6) atemp< 16.2875 814   579.73750 4.625301 *
##      7) atemp>=16.2875 3049  1526.99900 5.444611  
##       14) year=2011 1463   745.72850 5.183103 *
##       15) year=2012 1586   588.93080 5.685838 *

predictValidation5 <- round(predict(mod5, newdata = Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predictValidation5)

## [1] 3.236562

predict_test_5 <- round(predict(mod5, newdata = test_modified))
head(predict_test_5)

## 10887 10888 10889 10890 10891 10892 
##     3     3     2     2     2     2

Evaluate model

bikeevaluate(Train_Partition_Validation, predict_test_5)

## [1] 3.470934

Full and Null lm Models

Model 6: Full to Null

mod6_lm.full <- lm(Train_Partition_Training$count ~season + holiday + workingday + weather + temp + atemp +humidity + windspeed + hour + day +year, data = Train_Partition_Training)

drop1() shows the partial F-test p-value for each variable. The same p-value can be obtained by using anova(model.full, model.one.fewer.var) to compare two models. Remove the least significant variable using update() and do drop1() again until all remaining variables is significant at a pre-specified cutoff (0.1 here).

drop1(mod6_lm.full, test = "F")

## Single term deletions
## 
## Model:
## Train_Partition_Training$count ~ season + holiday + workingday + 
##     weather + temp + atemp + humidity + windspeed + hour + day + 
##     year
##            Df Sum of Sq       RSS   AIC  F value    Pr(>F)    
## <none>                  108173091 53921                       
## season      3   3974259 112147350 54112  66.4133 < 2.2e-16 ***
## holiday     1     47933 108221024 53922   2.4030 0.1211607    
## workingday  1      3591 108176682 53919   0.1800 0.6713612    
## weather     3    330016 108503107 53932   5.5149 0.0008866 ***
## temp        1     63209 108236300 53922   3.1688 0.0751117 .  
## atemp       1    189974 108363065 53929   9.5239 0.0020384 ** 
## humidity    1   4858977 113032068 54158 243.5932 < 2.2e-16 ***
## windspeed   1       117 108173208 53919   0.0059 0.9390119    
## hour        1  12920212 121093304 54533 647.7240 < 2.2e-16 ***
## day         6    145089 108318180 53916   1.2123 0.2964689    
## year        1   8333481 116506572 54323 417.7792 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Holiday is the least significant based on “F” Statistics

drop1(update(mod6_lm.full, ~ . -holiday), test = "F")

## Single term deletions
## 
## Model:
## Train_Partition_Training$count ~ season + workingday + weather + 
##     temp + atemp + humidity + windspeed + hour + day + year
##            Df Sum of Sq       RSS   AIC  F value    Pr(>F)    
## <none>                  108221024 53922                       
## season      3   3948456 112169480 54111  65.9651 < 2.2e-16 ***
## workingday  1        36 108221060 53920   0.0018 0.9661072    
## weather     3    326729 108547754 53932   5.4585 0.0009601 ***
## temp        1     59876 108280900 53923   3.0010 0.0832725 .  
## atemp       1    196780 108417804 53929   9.8625 0.0016959 ** 
## humidity    1   4854840 113075864 54158 243.3229 < 2.2e-16 ***
## windspeed   1       118 108221142 53920   0.0059 0.9387904    
## hour        1  12927738 121148763 54534 647.9337 < 2.2e-16 ***
## day         6    147711 108368735 53917   1.2339 0.2853578    
## year        1   8321062 116542086 54323 417.0487 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now Working day is the least significant based on “F” Statistics

drop1(update(mod6_lm.full, ~ . -holiday -workingday), test = "F")

## Single term deletions
## 
## Model:
## Train_Partition_Training$count ~ season + weather + temp + atemp + 
##     humidity + windspeed + hour + day + year
##           Df Sum of Sq       RSS   AIC  F value    Pr(>F)    
## <none>                 108221060 53920                       
## season     3   3948429 112169489 54109  65.9768 < 2.2e-16 ***
## weather    3    327259 108548320 53930   5.4684 0.0009468 ***
## temp       1     59885 108280946 53921   3.0020 0.0832192 .  
## atemp      1    197311 108418371 53927   9.8910 0.0016699 ** 
## humidity   1   4856883 113077944 54157 243.4701 < 2.2e-16 ***
## windspeed  1       118 108221178 53918   0.0059 0.9387837    
## hour       1  12929914 121150974 54532 648.1620 < 2.2e-16 ***
## day        6    149252 108370312 53915   1.2470 0.2787738    
## year       1   8321382 116542442 54321 417.1415 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now temp & day are least significant

drop1(update(mod6_lm.full, ~ . -holiday -workingday -temp), test = "F")

## Single term deletions
## 
## Model:
## Train_Partition_Training$count ~ season + weather + atemp + humidity + 
##     windspeed + hour + day + year
##           Df Sum of Sq       RSS   AIC  F value    Pr(>F)    
## <none>                 108280946 53921                       
## season     3   3888560 112169506 54107  64.9524 < 2.2e-16 ***
## weather    3    319000 108599945 53931   5.3284  0.001154 ** 
## atemp      1   7932297 116213243 54303 397.4905 < 2.2e-16 ***
## humidity   1   5031880 113312826 54166 252.1494 < 2.2e-16 ***
## windspeed  1      4697 108285643 53919   0.2354  0.627573    
## hour       1  12980638 121261584 54535 650.4648 < 2.2e-16 ***
## day        6    152316 108433261 53916   1.2721  0.266479    
## year       1   8394922 116675868 54325 420.6728 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(update(mod6_lm.full, ~ . -holiday -workingday -temp -day), test = "F")

## Single term deletions
## 
## Model:
## Train_Partition_Training$count ~ season + weather + atemp + humidity + 
##     windspeed + hour + year
##           Df Sum of Sq       RSS   AIC  F value    Pr(>F)    
## <none>                 108433261 53916                       
## season     3   3876825 112310086 54101  64.7369 < 2.2e-16 ***
## weather    3    327935 108761196 53927   5.4760 0.0009367 ***
## atemp      1   7986334 116419595 54301 400.0780 < 2.2e-16 ***
## humidity   1   5019449 113452710 54161 251.4509 < 2.2e-16 ***
## windspeed  1      4108 108437370 53914   0.2058 0.6500953    
## hour       1  12993830 121427091 54530 650.9302 < 2.2e-16 ***
## year       1   8388578 116821840 54320 420.2286 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(update(mod6_lm.full, ~ . -holiday -workingday -temp -day -windspeed), test = "F")

## Single term deletions
## 
## Model:
## Train_Partition_Training$count ~ season + weather + atemp + humidity + 
##     hour + year
##          Df Sum of Sq       RSS   AIC  F value    Pr(>F)    
## <none>                108437370 53914                       
## season    3   3875545 112312915 54100  64.7250 < 2.2e-16 ***
## weather   3    323890 108761260 53925   5.4093  0.001029 ** 
## atemp     1   7983483 116420853 54299 399.9937 < 2.2e-16 ***
## humidity  1   5639583 114076952 54188 282.5581 < 2.2e-16 ***
## hour      1  13067031 121504401 54532 654.6929 < 2.2e-16 ***
## year      1   8388500 116825870 54318 420.2861 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(update(mod6_lm.full, ~ .  -holiday -workingday -temp -day -windspeed))

## 
## Call:
## lm(formula = Train_Partition_Training$count ~ season + weather + 
##     atemp + humidity + hour + year, data = Train_Partition_Training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315.13  -94.34  -29.36   58.26  616.16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.7140    10.8413  -1.726  0.08437 .  
## season2      20.6311     6.8747   3.001  0.00270 ** 
## season3       6.3794     8.4991   0.751  0.45293    
## season4      69.4004     5.8196  11.925  < 2e-16 ***
## weather2      7.7399     4.6952   1.648  0.09931 .  
## weather3    -23.7046     7.6702  -3.090  0.00201 ** 
## weather4     47.4441   141.4452   0.335  0.73732    
## atemp         7.3753     0.3688  20.000  < 2e-16 ***
## humidity     -1.9917     0.1185 -16.809  < 2e-16 ***
## hour          7.6002     0.2970  25.587  < 2e-16 ***
## year2012     79.3302     3.8696  20.501  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.3 on 5433 degrees of freedom
## Multiple R-squared:  0.3899, Adjusted R-squared:  0.3888 
## F-statistic: 347.2 on 10 and 5433 DF,  p-value: < 2.2e-16

predictValidation6 <- round(predict(mod6_lm.full, newdata = Train_Partition_Validation))
predictValidation6<- round(predictValidation6)
predictValidation6[predictValidation6<0] <- 0
bikeevaluate(Train_Partition_Training, predictValidation6)

## [1] 1.690499

predict_test_6 <- round(predict(mod6_lm.full, newdata = test_modified))
predict_test_6<- round(predict_test_6)
predict_test_6[predict_test_6<0] <- 0

Evaluate model

bikeevaluate(Train_Partition_Validation, predict_test_6)

## [1] 1.985764

Model 7 : GBM (Gradient Boosting Algorithm)

mod7<- lm(Train_Partition_Training$count~season + weather + atemp + humidity + windspeed + hour +year,data=Train_Partition_Training,method = 'gbm' )
print(mod7)

## 
## Call:
## lm(formula = Train_Partition_Training$count ~ season + weather + 
##     atemp + humidity + windspeed + hour + year, data = Train_Partition_Training, 
##     method = "gbm")
## 
## Coefficients:
## (Intercept)      season2      season3      season4     weather2  
##    -21.3108      20.6312       6.5023      69.5699       7.5925  
##    weather3     weather4        atemp     humidity    windspeed  
##    -24.2335      47.8730       7.3812      -1.9743       0.1139  
##        hour     year2012  
##      7.5922      79.4075

predict_mod7_TrainValidation<-round(predict(mod7,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod7_TrainValidation)

## [1] 1.689909

## [1] 1.989862

Stepwise Regression:

Model 8: Automated F-test-based backward selection

using rms::fastbw()

library(rms)
mod8 <- ols(Train_Partition_Training$logcount~season + holiday + workingday + weather + temp + atemp +
                  humidity + windspeed + hour + day +year, data = Train_Partition_Training)
fastbw(mod8, rule = "p", sls = 0.1)

## 
##  Deleted   Chi-Sq d.f. P      Residual d.f. P      AIC   R2   
##  day       7.66   6    0.2641  7.66    6    0.2641 -4.34 0.499
##  windspeed 0.99   1    0.3205  8.65    7    0.2790 -5.35 0.499
##  temp      2.95   1    0.0856 11.60    8    0.1699 -4.40 0.499
## 
## Approximate Estimates after Deleting Factors
## 
##                  Coef      S.E.   Wald Z         P
## Intercept     2.67818 0.0801580  33.4112 0.000e+00
## season=2      0.17785 0.0490785   3.6238 2.903e-04
## season=3      0.10710 0.0606979   1.7645 7.765e-02
## season=4      0.59016 0.0415566  14.2012 0.000e+00
## holiday=1    -0.15059 0.0838447  -1.7960 7.249e-02
## workingday=1 -0.09642 0.0304265  -3.1691 1.529e-03
## weather=2     0.14009 0.0335065   4.1809 2.904e-05
## weather=3    -0.17790 0.0547663  -3.2484 1.161e-03
## weather=4     0.93046 1.0093751   0.9218 3.566e-01
## atemp         0.05077 0.0026339  19.2745 0.000e+00
## humidity     -0.01462 0.0008455 -17.2900 0.000e+00
## hour          0.09809 0.0021199  46.2722 0.000e+00
## year=2012     0.41028 0.0276147  14.8574 0.000e+00
## 
## Factors in Final Model
## 
## [1] season     holiday    workingday weather    atemp      humidity  
## [7] hour       year

predict_mod8_TrainValidation<-round(predict(mod8,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod8_TrainValidation)

## [1] 3.204593

## [1] 3.448292

Model 9: AIC-based backward selection

mod9.aic.backward <- step(mod6_lm.full, direction = "backward", trace = 1)

## Start:  AIC=53921.13
## Train_Partition_Training$count ~ season + holiday + workingday + 
##     weather + temp + atemp + humidity + windspeed + hour + day + 
##     year
## 
##              Df Sum of Sq       RSS   AIC
## - day         6    145089 108318180 53916
## - windspeed   1       117 108173208 53919
## - workingday  1      3591 108176682 53919
## <none>                    108173091 53921
## - holiday     1     47933 108221024 53922
## - temp        1     63209 108236300 53922
## - atemp       1    189974 108363065 53929
## - weather     3    330016 108503107 53932
## - season      3   3974259 112147350 54112
## - humidity    1   4858977 113032068 54158
## - year        1   8333481 116506572 54323
## - hour        1  12920212 121093304 54533
## 
## Step:  AIC=53916.42
## Train_Partition_Training$count ~ season + holiday + workingday + 
##     weather + temp + atemp + humidity + windspeed + hour + year
## 
##              Df Sum of Sq       RSS   AIC
## - windspeed   1        27 108318207 53914
## - workingday  1      9194 108327374 53915
## <none>                    108318180 53916
## - holiday     1     50555 108368735 53917
## - temp        1     67106 108385286 53918
## - atemp       1    185941 108504121 53924
## - weather     3    340170 108658350 53927
## - season      3   3963398 112281578 54106
## - humidity    1   4852417 113170597 54153
## - year        1   8323367 116641547 54317
## - hour        1  12921011 121239191 54528
## 
## Step:  AIC=53914.42
## Train_Partition_Training$count ~ season + holiday + workingday + 
##     weather + temp + atemp + humidity + hour + year
## 
##              Df Sum of Sq       RSS   AIC
## - workingday  1      9188 108327395 53913
## <none>                    108318207 53914
## - holiday     1     50551 108368758 53915
## - temp        1     71506 108389712 53916
## - atemp       1    195390 108513597 53922
## - weather     3    342395 108660601 53926
## - season      3   3966312 112284518 54104
## - humidity    1   5251410 113569616 54170
## - year        1   8344859 116663065 54316
## - hour        1  12957438 121275645 54528
## 
## Step:  AIC=53912.89
## Train_Partition_Training$count ~ season + holiday + weather + 
##     temp + atemp + humidity + hour + year
## 
##            Df Sum of Sq       RSS   AIC
## <none>                  108327395 53913
## - holiday   1     42938 108370333 53913
## - temp      1     69419 108396813 53914
## - atemp     1    198426 108525820 53921
## - weather   3    345469 108672863 53924
## - season    3   3963203 112290597 54102
## - humidity  1   5251343 113578738 54169
## - year      1   8346225 116673619 54315
## - hour      1  12974176 121301571 54527

summary(mod9.aic.backward)

## 
## Call:
## lm(formula = Train_Partition_Training$count ~ season + holiday + 
##     weather + temp + atemp + humidity + hour + year, data = Train_Partition_Training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.27  -94.29  -29.26   58.53  614.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.9739    10.8465  -1.657  0.09755 .  
## season2      18.8169     6.9372   2.712  0.00670 ** 
## season3       2.1127     8.8543   0.239  0.81142    
## season4      69.2794     5.8256  11.892  < 2e-16 ***
## holiday1    -16.6700    11.3617  -1.467  0.14238    
## weather2      7.3267     4.7015   1.558  0.11920    
## weather3    -25.2859     7.7059  -3.281  0.00104 ** 
## weather4     48.8783   141.4036   0.346  0.72961    
## temp          3.1289     1.6772   1.866  0.06216 .  
## atemp         4.6812     1.4842   3.154  0.00162 ** 
## humidity     -1.9528     0.1204 -16.226  < 2e-16 ***
## hour          7.5790     0.2972  25.504  < 2e-16 ***
## year2012     79.1736     3.8705  20.456  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.2 on 5431 degrees of freedom
## Multiple R-squared:  0.3905, Adjusted R-squared:  0.3892 
## F-statistic:   290 on 12 and 5431 DF,  p-value: < 2.2e-16

predict_mod9_TrainValidation<-round(predict(mod9.aic.backward,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod9_TrainValidation)

## [1] 1.689367

## [1] 1.989146

Model 10: AIC-based forward selection

mod10.aic.forward <- step(mod6_lm.full, direction = "forward", trace = 1, 
                          scope = ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + day +year, data = Train_Partition_Training)

## Start:  AIC=53921.13
## Train_Partition_Training$count ~ season + holiday + workingday + 
##     weather + temp + atemp + humidity + windspeed + hour + day + 
##     year

summary(mod10.aic.forward)

## 
## Call:
## lm(formula = Train_Partition_Training$count ~ season + holiday + 
##     workingday + weather + temp + atemp + humidity + windspeed + 
##     hour + day + year, data = Train_Partition_Training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -327.75  -93.20  -29.45   59.55  617.07 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.01312   13.44739  -0.968  0.33323    
## season2       18.51625    6.97563   2.654  0.00797 ** 
## season3        1.86676    8.94392   0.209  0.83468    
## season4       69.54437    5.87388  11.840  < 2e-16 ***
## holiday1     -18.29361   11.80109  -1.550  0.12116    
## workingday1   -1.86474    4.39481  -0.424  0.67136    
## weather2       7.49136    4.71242   1.590  0.11196    
## weather3     -24.56502    7.81357  -3.144  0.00168 ** 
## weather4      53.12738  141.48416   0.376  0.70730    
## temp           3.08366    1.73228   1.780  0.07511 .  
## atemp          4.73044    1.53283   3.086  0.00204 ** 
## humidity      -1.96436    0.12586 -15.607  < 2e-16 ***
## windspeed      0.01983    0.25913   0.077  0.93901    
## hour           7.57864    0.29778  25.450  < 2e-16 ***
## dayMonday      7.54843    7.34758   1.027  0.30431    
## daySaturday   -2.33223    7.25949  -0.321  0.74802    
## daySunday     -5.89870    7.09447  -0.831  0.40576    
## dayThursday   -9.89744    7.06004  -1.402  0.16100    
## dayTuesday    -6.70294    6.99810  -0.958  0.33820    
## dayWednesday  -6.01473    7.12451  -0.844  0.39858    
## year2012      79.22746    3.87617  20.440  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.2 on 5423 degrees of freedom
## Multiple R-squared:  0.3914, Adjusted R-squared:  0.3892 
## F-statistic: 174.4 on 20 and 5423 DF,  p-value: < 2.2e-16

predict_mod10_TrainValidation<-round(predict(mod10.aic.forward,newdata=Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predict_mod10_TrainValidation)

## [1] 1.690499

## [1] 1.989862

Model 11: ANOTHER RPART

form11 <- formula(Train_Partition_Training$logcount ~ season + holiday + workingday + weather + temp + atemp +
                  humidity + windspeed)
mod11 <- rpart(form11, data = Train_Partition_Training)
print(mod11)

## n= 5444 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 5444 11014.9200 4.576449  
##    2) atemp< 29.925 3851  7828.4480 4.277787  
##      4) temp< 11.89 897  1835.5100 3.613460 *
##      5) temp>=11.89 2954  5476.8550 4.479514  
##       10) humidity>=64.5 1662  3324.4740 4.125477 *
##       11) humidity< 64.5 1292  1676.0860 4.934940 *
##    3) atemp>=29.925 1593  2012.5640 5.298449  
##      6) humidity>=67.5 403   760.9172 4.580619 *
##      7) humidity< 67.5 1190   973.6652 5.541545 *

predictValidation11 <- round(predict(mod11, newdata = Train_Partition_Validation))
bikeevaluate(Train_Partition_Training, predictValidation11)

## [1] 3.170218

predict_test_11 <- round(predict(mod11, newdata = test_modified))
head(predict_test_11)

## 10887 10888 10889 10890 10891 10892 
##     4     4     4     4     4     4

Evaluate model

bikeevaluate(Train_Partition_Validation, predict_test_11)

## [1] 3.405507

Summary:

Model	Metric(RMSE)
RPART	3.4709
RPART	3.4055
Stepwise Regression	3.4482
Transformation	2.1124
GBM	1.9898
AIC-based backward	1.9891
Full to Null lm	1.6904
LM	1.7804

Conclusion:

Using different combinations of models and feature engineering, we found that hour has the most predictive feature for our problem, whereas weather variables play a much smaller, but still noticeable effect in contributing to accurate predictions.
Variable reduction models, Lm Model and GBM models best fits the data

Reference Papers:

Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg doi:10.1007/s13748-013-0040-3.
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
Other resources from internet: http://ww.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture07b_PoissReg.pdf https://www.youtube.com/watch?v=mgIyPRhrVTY https://onlinecourses.science.psu.edu/stat504/node/169 https://rpubs.com/tybyers/25177 http://www.ats.ucla.edu/stat/r/dae/nbreg.htm

EXPLORATORY ANALYSIS ON BIKE SHARE FORECASTING

Jenifer PK & Srinath KS

15 October 2015

Dataset Overview:

Background of the Dataset:

Synopsis:

Problem Description:

Understanding the Data Set:

Importing Data set & Data Exploration:

1. Import Train and Test Data Set:

2. Combine test and train data set

3. Structure of the dataset

4. Missing Values in the Dataset

5. Convert discrete variables into factor (season, weather, holiday, workingday)

Hourly Trend

Daywise Trend:

Time Trend:

Model Building:

Simple Linear Regression

Model 1:

Transformation :

Model 2 : Gamma Distribution

Model 3: Normal Distribution

MODEL 4: Exponential Distribution

Model 5: RPART

Full and Null lm Models

Model 6: Full to Null

Model 7 : GBM (Gradient Boosting Algorithm)

Stepwise Regression:

Model 8: Automated F-test-based backward selection

Model 9: AIC-based backward selection

Model 10: AIC-based forward selection

Model 11: ANOTHER RPART

Summary:

Conclusion:

Reference Papers: