DATASET : BIKE SHARE FORECASTING
Jenifer PK & Srinath KS
4th Oct’ 2015
About Dataset:
The data set we are using for our analysis is from a two - year usage log of bikes being rented in a bike sharing system in Washington, D.C., USA, known as Capital Bike Sharing (CBS). The reason why this data set is of relevance for our analysis is the data has been compiled over a two year period giving us ample data in which to build a model. The data is available in two formats daily and hourly, during this analysis we will only be looking at the hourly data set.
The data is saved in a comma separated file (CSV) with 17 attributes. All the data has already been converted to numericv values.
Synopsis and Problem Description:
Bike sharing programs are popular around the world. In May 2014, the machine learning competition website kaggle.com opened the competition “Forecast use of a city bikeshare system”. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C. This analysis attempts to generate a machine learning model based on linear regression to predict rental demand by the hour. The training set (http://www.kaggle.com/c/bike-sharing-demand/data) includes hourly rental data for the first 19 days of the month. The goal is to predict the rest of the days for that month. ********** ## Files
The following is a list of the attributes from the dataset, and a brief description of their purpose.
- sample.csv
- train.csv : Records: 10886
- test.csv : Records: 6493
PACKAGES TO BE LOADED
library(ggplot2) library(caret) library(MASS) library(plyr) library(splines) library(fitdistrplus) library(logspline) library(MASS) library(rms) library(rpart) library(faraway) library(lattice) library(randomForest) library(lubridate) library(Cubist) library(mlbench) library(gbm)```
--------------------- | ---------------------- | --------------------- | ----------------------
Library(lattice) | Library(Hmisc) | Library(faraway) | Library(randomForest)
Library(Formula) | Library(caret) | Library(ggplot2) | Library(rms)
Library(grid) | Library(plyr) | Library(caTools) | Library(logspline)
Library(survival) | Library(splines) | Library(MASS) | Library(fitdistrplus)
# Back up of Train and Test Data
FALSE
FALSE
‘data.frame’: 10886 obs. of 12 variables:
$ hour : int 0 1 2 3 4 5 6 7 8 9 …
$ season : int 1 1 1 1 1 1 1 1 1 1 …
$ holiday : int 0 0 0 0 0 0 0 0 0 0 …
$ workingday: int 0 0 0 0 0 0 0 0 0 0 …
$ weather : int 1 1 1 1 1 2 1 1 1 1 …
$ temp : num 9.84 9.02 9.02 9.84 9.84 …
$ atemp : num 14.4 13.6 13.6 14.4 14.4 …
$ humidity : int 81 80 80 75 75 75 80 86 75 76 …
$ windspeed : num 0 0 0 0 0 …
$ casual : int 3 8 5 3 0 0 2 1 1 8 …
$ registered: int 13 32 27 10 1 1 0 2 7 6 …
$ count : int 16 40 32 13 1 1 2 3 8 14 …
# PLOTS
##BASED ON WEATHER
<img src="Figs/Plots-1.png" title="" alt="" width="1152" />
##BASED ON SEASON
<img src="Figs/unnamed-chunk-1-1.png" title="" alt="" width="1152" />
##BASED ON HOLIDAY
<img src="Figs/unnamed-chunk-2-1.png" title="" alt="" width="1152" />
##BASED ON WORKING DAY
<img src="Figs/unnamed-chunk-3-1.png" title="" alt="" width="1152" />
##QPLOTS based on Temperature & FeelsLike
<img src="Figs/gplot1-1.png" title="" alt="" width="1152" /><img src="Figs/gplot1-2.png" title="" alt="" width="1152" />
##QPLOTS based on Windspeed & Humidity
<img src="Figs/gplot2-1.png" title="" alt="" width="1152" /><img src="Figs/gplot2-2.png" title="" alt="" width="1152" />
##Pairs
<img src="Figs/pairs-1.png" title="" alt="" width="1152" />
#Modelling
##Model 1: Simple Linear Regression
Call:
lm(formula = count ~ ., data = Training)
Residuals:
Min 1Q Median 3Q Max
-3.060e-11 -1.950e-14 9.100e-15 3.760e-14 1.795e-11
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.471e-12 4.095e-14 -3.591e+01 <2e-16 ## hour 1.005e-14 1.065e-15 9.428e+00 <2e-16
season 1.822e-13 6.367e-15 2.861e+01 <2e-16 ## holiday -9.711e-14 4.005e-14 -2.425e+00 0.0153
## workingday -2.095e-13 1.687e-14 -1.242e+01 <2e-16 ## weather -7.742e-14 1.153e-14 -6.716e+00 2e-11 ## temp 4.586e-14 4.892e-15 9.374e+00 <2e-16 ## atemp -4.935e-15 4.501e-15 -1.096e+00 0.2730
## humidity -2.447e-16 4.414e-16 -5.540e-01 0.5793
## windspeed 5.405e-16 8.833e-16 6.120e-01 0.5406
## casual 1.000e+00 1.946e-16 5.137e+15 <2e-16 ## registered 1.000e+00 5.587e-17 1.790e+16 <2e-16 * ## — ## Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1 ## ## Residual standard error: 5.745e-13 on 7610 degrees of freedom ## Multiple R-squared: 1, Adjusted R-squared: 1 ## F-statistic: 6.971e+31 on 11 and 7610 DF, p-value: < 2.2e-16
* AIC -4.0801288\times 10^{5}
* RMSE 5.949524\times 10^{-13}
* BIC -4.0792267\times 10^{5}
* RD 1.7053026\times 10^{-13}, 3.623768\times 10^{-13}, 7.1054274\times 10^{-13}, 1.0231815\times 10^{-12}
##Model 2: Linear Regression
Single term deletions
Model:
count ~ hour + season + holiday + workingday + weather + temp +
atemp + humidity + windspeed + casual + registered
0 -429645
hour 1 0 0 -427023 3.1272e+03 <2e-16 ## season 1 0 0 -432662 -2.4860e+03 1
## holiday 1 0 0 -428780 9.1654e+02 <2e-16
workingday 1 0 0 -429499 1.4951e+02 <2e-16 ## weather 1 0 0 -430147 -4.8257e+02 1
## temp 1 0 0 -427389 2.6237e+03 <2e-16
atemp 1 0 0 -429771 -1.2285e+02 1
humidity 1 0 0 -423895 8.5758e+03 <2e-16 ## windspeed 1 0 0 -430504 -8.0908e+02 1
## casual 1 8709955 8709955 53690 2.6393e+31 <2e-16
registered 1 105712469 105712469 72716 3.2033e+32 <2e-16 *** ## — ## Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Single term deletions
Model:
count ~ hour + holiday + workingday + weather + temp + atemp +
humidity + windspeed + casual + registered
0 -432662
hour 1 0 0 -430597 2.3700e+03 <2e-16 ## holiday 1 0 0 -431144 1.6787e+03 <2e-16
workingday 1 0 0 -433745 -1.0065e+03 1
weather 1 0 0 -432305 3.6652e+02 <2e-16 ## temp 1 0 0 -424412 1.4858e+04 <2e-16
atemp 1 0 0 -430887 1.9976e+03 <2e-16 ## humidity 1 0 0 -428905 4.8510e+03 <2e-16
windspeed 1 0 0 -431058 1.7843e+03 <2e-16 ## casual 1 8713096 8713096 53691 3.9215e+31 <2e-16
registered 1 108155057 108155057 72888 4.8677e+32 <2e-16 *** ## — ## Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Single term deletions
Model:
count ~ hour + holiday + weather + temp + atemp + humidity +
windspeed + casual + registered
0 -433745
hour 1 0 0 -429414 5.8277e+03 <2e-16 ## holiday 1 0 0 -432327 1.5584e+03 <2e-16
weather 1 0 0 -435982 -1.9351e+03 1
temp 1 0 0 -435640 -1.6744e+03 1
atemp 1 0 0 -432079 1.8614e+03 <2e-16 ## humidity 1 0 0 -431704 2.3393e+03 <2e-16
windspeed 1 0 0 -423590 2.1244e+04 <2e-16 ## casual 1 11336662 11336662 55695 5.8807e+31 <2e-16
registered 1 118669596 118669596 73594 6.1558e+32 <2e-16 *** ## — ## Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Single term deletions
Model:
count ~ hour + holiday + weather + atemp + humidity + windspeed +
casual + registered
0 -435640
hour 1 0 0 -419751 5.3622e+04 < 2.2e-16 ## holiday 1 0 0 -429362 9.7409e+03 < 2.2e-16
weather 1 0 0 -432249 4.2694e+03 < 2.2e-16 ## atemp 1 0 0 -426806 1.6656e+04 < 2.2e-16
humidity 1 0 0 -428820 1.1020e+04 < 2.2e-16 ## windspeed 1 0 0 -424323 2.5999e+04 < 2.2e-16
casual 1 11348042 11348042 55701 7.5476e+31 < 2.2e-16 ## registered 1 118669602 118669602 73592 7.8927e+32 < 2.2e-16
—
Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Call:
lm(formula = count ~ hour + holiday + weather + atemp + humidity +
windspeed + casual + registered, data = Training)
Residuals:
Min 1Q Median 3Q Max
-3.091e-11 -1.210e-14 8.200e-15 2.780e-14 3.894e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.495e-13 2.564e-14 -1.753e+01 < 2e-16 ## hour 8.269e-15 7.189e-16 1.150e+01 < 2e-16
holiday 1.338e-13 2.609e-14 5.127e+00 3.02e-07 ## weather 9.059e-14 7.761e-15 1.167e+01 < 2e-16
atemp 1.046e-14 6.036e-16 1.733e+01 < 2e-16 ## humidity -6.531e-15 2.893e-16 -2.258e+01 < 2e-16
windspeed 5.895e-16 5.850e-16 1.008e+00 0.314
casual 1.000e+00 1.151e-16 8.688e+15 < 2e-16 ## registered 1.000e+00 3.559e-17 2.809e+16 < 2e-16
—
Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 3.878e-13 on 7613 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.104e+32 on 8 and 7613 DF, p-value: < 2.2e-16
Call:
lm(formula = count ~ hour + holiday + weather + atemp + humidity +
windspeed + casual + registered, data = Training)
Coefficients:
(Intercept) hour holiday weather atemp
-4.495e-13 8.269e-15 1.338e-13 9.059e-14 1.046e-14
humidity windspeed casual registered
-6.531e-15 5.895e-16 1.000e+00 1.000e+00
df AIC
lm.full 13 -408012.9
df BIC
lm.full 13 -407922.7