Exploratory Data Analysis of Bike Rental System in Washington, D.C.

1. Synopsis

In this project, we performed some exploratory analysis on the data set from the kaggle competition, which contains historical data of bike sharing system in Washington, D.C. from the beginning of 2011 to the end of 2012. In this competition, the goal is to combine past rental patterns with historical weather data in order to forecast bike rental demand. To do so, we built several models including Poisson regression, Quasi-Poisson regression, Linear regression (with log transformation of the response variables) and Negative Binomial regression, however none of the models fit the data very well. Therefore we could only explore the relationship between the bike rental and all the explanatory variables.

2. Background

Following is a short background introduction from the kaggle competition website:

"Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able to rent a bike from one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city."

3. Understanding the Data Set

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of the week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in the Capital Bikeshare website. The data set used in this project was aggregated on daily basis and then added the corresponding weather and seasonal information. Weather information are extracted from the i-weather website.

Count of rented bikes is also correlated to some events in the town, which are easily traceable via search engines. For instance, a query like “2012-10-30 Washington D.C.” on Google returns results related to Hurricane Sandy. Therefore the data can be used for identification and validation of anomaly or event detection algorithms as well.

The original data set contains 16 variables in total:

instant: record index
dteday: date
season: seasons (1: spring, 2: summer, 3: fall, 4: winter)
yr: year (0: 2011, 1: 2012)
mnth: month (1 to 12)
holiday: whether day is holiday or not (extracted from the goverment website)
weekday: day of the week
workingday: if the day is neither weekend nor holiday is 1, otherwise is 0.
weathersit:
- 1: Clear, Few clouds, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
temp: Normalized temperature in Celsius. The values are divided by 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided by 50 (max)
hum: Normalized humidity. The values are divided by 100 (max)
windspeed: Normalized wind speed. The values are divided by 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

There are some variables in the data set that are not in our scope of work. For this reason, we removed the irrelevant variables and made season, holiday, workingday and weather as factor variables. The cleared data set looks like as it follows:

      dteday season holiday workingday weathersit      temp    atemp
1 2011-01-01      1       0          0          2 14.110847 18.18125
2 2011-01-02      1       0          0          2 14.902598 17.68695
3 2011-01-03      1       0          1          1  8.050924  9.47025
4 2011-01-04      1       0          1          1  8.200000 10.60610
5 2011-01-05      1       0          1          1  9.305237 11.46350
6 2011-01-06      1       0          1          1  8.378268 11.66045
      hum windspeed  cnt
1 80.5833 10.749882  985
2 69.6087 16.652113  801
3 43.7273 16.636703 1349
4 59.0435 10.739832 1562
5 43.6957 12.522300 1600
6 51.8261  6.000868 1606

4. Exploratory Analysis with Plots

First of all, we want to observe how the response variable cnt is distributed. From the histogram above, it seems that the number of total rented bikes follow a nearly normal distribution. Many studies have revealed that Poisson distribution can be used to analyze count data. The mean and variance of Poisson distribution are the same, and when the mean is getting larger, Poisson distribution approximates a normal distribution.

        mean          var        ratio 
   4504.3488 3752788.2083     833.1478

We can see there is a huge difference between mean and variance of the cnt variable. Clearly, it doesn’t follow a standard Poisson distribution. For count data, if variance is larger than mean, it means the data has over dispersion. There are several ways to deal with it. We will discuss it later in the regression analysis part.

Next, we looked at the relationship between the response variable and each explanatory variable. We selected some plots with obvious patterns as shown below. The first plot shows the relationship between cnt variable and date. We can see that the overall trend increased during the two-year time span. And within each year, there are huge amount of bike rentals during summer and fall seasons.

The second plot shows the relationship between cnt variable and season, which confirms the conclusion we got above. The average numbers of bike rentals are the highest during summer and fall.

The third plot shows the relationship between cnt variable and holiday. We can see that the average number of bike rentals on non-holiday is higher than holiday, but has more variability as well.

The forth plot shows the relationship between cnt variable and weather. There is a clearly decreasing trend of bike rentals when weather conditions grow worse. These two plots show the relationship between cnt variable and temperature. There seems to be a linear relationship between them, which means more people will rent bikes when it gets warmer. However, the data seems to be scattered with a lot of variability, so the linear relationship might be weak if there is any.

5. Regression Analysis

5.1 Poisson Regression

As we mentioned earlier in this report, Poisson regression is good for modeling count data. It assumes the response variable follows Poisson distribution and can be explained by linear combination of the explanatory variables.

So first of all, we used Poisson regression as our default method to fit a model with all the explanatory variables but no interactions, although we have already known there is over dispersion in the data set.

poi.mod <- glm(cnt ~ dteday + season + holiday + workingday + weathersit + temp + atemp + hum
               + windspeed, family = poisson, data = bike)
exp(poi.mod$coef)

 (Intercept)       dteday      season2      season3      season4 
4.618677e-05 1.000000e+00 1.321074e+00 1.087635e+00 1.221007e+00 
    holiday1  workingday1  weathersit2  weathersit3         temp 
8.430506e-01 1.028818e+00 9.137808e-01 4.878163e-01 1.017446e+00 
       atemp          hum    windspeed 
1.012933e+00 9.965098e-01 9.918340e-01

The table above shows the exponential value of the coefficients.

The model tells us, for instance, when all the other variables are held constant, the average number of rented bikes during summer time (season2) is about 1.321 times or increased by 32.1% compared with that during spring.

Another example of interpretation would be, when all the other variables are held constant, the average number of rented bikes under weather condition 3 (weathersit3) is about 0.488 times or decreased by 1 - 48.8% = 51.2% compared with that under weather condition 1, note that weather condition 3 is much worse than weather condition 1.

deviance      d.f 
143921.6    718.0

Despite all the explanatory variables are significant in the model, the deviance 143921.6 and degrees of freedom 718 suggest that the model doesn’t fit the data well. The reason for that is probably because of the over dispersion in the data set.

5.2 Quasipoisson Regression

Quasi-Poisson model is a remedy for over dispersion in Poisson model. It estimates a scale parameter as well, and also fixes the estimated standard error. It gives us more conservative results.

Estimated coefficients table and dispersion parameter of Poisson regression model:

                 Estimate   Std. Error    z value      Pr(>|z|)
(Intercept) -9.982817e+00 4.665580e-02 -213.96733  0.000000e+00
dteday       1.348374e-08 3.491469e-11  386.19105  0.000000e+00
season2      2.784453e-01 2.274414e-03  122.42508  0.000000e+00
season3      8.400579e-02 2.846314e-03   29.51389 1.910120e-191
season4      1.996763e-01 2.066588e-03   96.62123  0.000000e+00
holiday1    -1.707283e-01 3.728532e-03  -45.78969  0.000000e+00
workingday1  2.841018e-02 1.237314e-03   22.96119 1.139257e-116
weathersit2 -9.016454e-02 1.498123e-03  -60.18499  0.000000e+00
weathersit3 -7.178164e-01 5.502657e-03 -130.44906  0.000000e+00
temp         1.729515e-02 5.748669e-04   30.08549 7.502723e-199
atemp        1.284963e-02 5.193170e-04   24.74334 3.657157e-135
hum         -3.496353e-03 5.554426e-05  -62.94715  0.000000e+00
windspeed   -8.199551e-03 1.215504e-04  -67.45806  0.000000e+00

dispersion.parameter 
                   1

Estimated coefficients table and dispersion parameter of Quasipoisson regression model:

                 Estimate   Std. Error    t value      Pr(>|t|)
(Intercept) -9.982817e+00 6.348993e-01 -15.723466  4.309183e-48
dteday       1.348374e-08 4.751244e-10  28.379388 2.143476e-119
season2      2.784453e-01 3.095057e-02   8.996451  2.053811e-18
season3      8.400579e-02 3.873308e-02   2.168839  3.042250e-02
season4      1.996763e-01 2.812244e-02   7.100246  3.000020e-12
holiday1    -1.707283e-01 5.073844e-02  -3.364872  8.065030e-04
workingday1  2.841018e-02 1.683755e-02   1.687311  9.197791e-02
weathersit2 -9.016454e-02 2.038669e-02  -4.422716  1.125689e-05
weathersit3 -7.178164e-01 7.488100e-02  -9.586096  1.461548e-20
temp         1.729515e-02 7.822877e-03   2.210843  2.736118e-02
atemp        1.284963e-02 7.066944e-03   1.818273  6.943897e-02
hum         -3.496353e-03 7.558548e-04  -4.625694  4.428587e-06
windspeed   -8.199551e-03 1.654076e-03  -4.957180  8.931570e-07

dispersion.parameter 
            185.1818

Comparing the two tables above, we can see that the estimated standard errors in Quasi-Poisson model are larger than that in Poisson model. And the dispersion parameter given by Quasi-Poisson model is 185.1818, which means there is indeed over dispersion in the data set and Poisson model underestimated the standard errors.

5.3 Linear Regression with Transformation

We also fitted linear models for the data set with log transformation of the response variable. We found that atemp variable is insignificant in the first linear model, so we removed it and fitted again. All the variables in the second linear model are significant.

y <- log(bike$cnt) # log transformation of the response variable
lin.mod <- lm(y ~ dteday + season + holiday + workingday + weathersit + temp + atemp + hum 
              + windspeed, data = bike)
adv.lin.mod <- lm(y ~ dteday + season + holiday + workingday + weathersit + temp + hum + windspeed,
                  data = bike)
c(adjusted.R.squared = summary(adv.lin.mod)$adj.r.squared)

adjusted.R.squared 
         0.7201124

The adjusted R-square values in both models are identical (0.72). It suggests about 72% of the variability in the response variable can be explained by the linear model. However, checking the residual plot and QQ plot, we can see that the residuals have a pattern, and are not normally distributed, which means the linear model doesn’t fit the data well. We should perform transformation on the explanatory variables as well. We can use Box-Tidwell method.

There are three numerical explanatory variables in the second linear model: temp, hum and windspeed.

 Score Statistic p-value MLE of lambda
       -9.223382       0    -0.2142003

iterations =  8

P-value is 0 < 0.05, so we should perform transformation on temp variable. Suggested transformation is log transformation.

 Score Statistic p-value MLE of lambda
        -6.77468       0      10.55881

iterations =  13

P-value is 0 < 0.05, so we should perform transformation on hum variable. Suggested transformation is power transformation.

 Score Statistic   p-value MLE of lambda
       -2.253712 0.0242143      2.033202

iterations =  5

P-value is 0.02 < 0.05, so we should perform transformation on windspeed variable. Suggested transformation is squared transformation.

After transformation, we fitted another three models including each transformed explanatory variable respectively, and found that adjusted R squared increased for both models including windspeed and temp transformed, but decreased for the model with hum transformed. At last we fitted our final linear model containing windspeed and temp transformed together.

tm <- log(bike$temp)
hm <- bike$hum^10
ws <- bike$windspeed^2
lin.mod.trans <- lm(y ~ dteday + season + holiday + workingday + weathersit + temp + hm + windspeed,
                    data = bike)
lin.mod.trans2 <- lm(y ~ dteday + season + holiday + workingday + weathersit + temp + hum + ws,
                     data = bike)
lin.mod.trans3 <- lm(y ~ dteday + season + holiday + workingday + weathersit + tm + hum + windspeed,
                     data = bike)
lin.mod.trans4 <- lm(y ~ dteday + season + holiday + workingday + weathersit + tm + hum + ws,
                     data = bike)
c(adjusted.R.squared = summary(lin.mod.trans4)$adj.r.squared)

adjusted.R.squared 
         0.7423005

The adjusted R squared value increased to 0.74. It suggests about 74% of the variability in the response variable can be explained by the final linear model. However the residual plot still has a pattern, and is still not normally distributed. So linear models are not good.

5.4 Negative Binomial Regression

Negative Binomial model is also very useful for modeling count variable, especially for the data set with over dispersion. Compared to Poisson model, it contains an extra parameter theta, which is the parameter of multiplicative random effect.

Our first negative binomial model showed that temp variable is insignificant. Removing it and fitting the model again, we got the second negative binomial model.

The Estimated coefficients table is shown below. All the explanatory variables left are significant. And the theta value given in the model is about 15.23.

library(MASS)
nb.mod <- glm.nb(cnt ~ dteday + season + holiday + workingday + weathersit + temp + atemp + hum
                 + windspeed, data = bike, link = log)
nb.mod2 <- glm.nb(cnt ~ dteday + season + holiday + workingday + weathersit + atemp + hum + windspeed,
                  data = bike, link = log)
summary(nb.mod2)$coef

                 Estimate   Std. Error    z value      Pr(>|z|)
(Intercept) -1.082840e+01 7.696921e-01 -14.068486  5.932605e-45
dteday       1.408594e-08 5.794981e-10  24.307141 1.647505e-130
season2      2.635771e-01 3.521487e-02   7.484824  7.164321e-14
season3      6.484597e-02 4.544132e-02   1.427026  1.535723e-01
season4      1.901084e-01 3.211021e-02   5.920497  3.209710e-09
holiday1    -1.762376e-01 5.887881e-02  -2.993227  2.760445e-03
workingday1  4.120836e-02 2.120445e-02   1.943383  5.196989e-02
weathersit2 -7.585443e-02 2.581426e-02  -2.938470  3.298369e-03
weathersit3 -6.610451e-01 6.716998e-02  -9.841377  7.468175e-23
atemp        3.356771e-02 2.104071e-03  15.953694  2.685344e-57
hum         -4.849562e-03 9.641796e-04  -5.029729  4.911744e-07
windspeed   -9.600759e-03 2.005338e-03  -4.787602  1.687860e-06

c(theta = summary(nb.mod2)$theta, deviance = nb.mod2$deviance, d.f = nb.mod2$df.residual)

    theta  deviance       d.f 
 15.22945 746.48931 719.00000

Using ANOVA function in R to compare the two negative binomial models, we found the p-value of Chi-squared test is 0.047 < 0.05. This means we should include temp variable in the model. According to the exploratory analysis earlier, we also believe temp variable should have some significance. Perhaps it has an interaction with other explanatory variables. So we decide to keep it for now.

Likelihood ratio tests of Negative Binomial Models

Response: cnt
                                                                                 Model
1        dteday + season + holiday + workingday + weathersit + atemp + hum + windspeed
2 dteday + season + holiday + workingday + weathersit + temp + atemp + hum + windspeed
     theta Resid. df    2 x log-lik.   Test    df LR stat.   Pr(Chi)
1 15.22945       719       -12221.43                                
2 15.31226       718       -12217.47 1 vs 2     1 3.958592 0.0466327

The estimated coefficients table of our first negative binomial model is also shown below.

                 Estimate   Std. Error     z value      Pr(>|z|)
(Intercept) -1.073855e+01 7.687204e-01 -13.9693839  2.396912e-44
dteday       1.403270e-08 5.783503e-10  24.2633219 4.783718e-130
season2      2.542613e-01 3.535240e-02   7.1921923  6.375913e-13
season3      3.825985e-02 4.693087e-02   0.8152384  4.149359e-01
season4      1.887947e-01 3.202397e-02   5.8954176  3.737349e-09
holiday1    -1.813487e-01 5.877447e-02  -3.0855011  2.032095e-03
workingday1  3.965653e-02 2.115562e-02   1.8745153  6.085943e-02
weathersit2 -7.498544e-02 2.574537e-02  -2.9125797  3.584567e-03
weathersit3 -6.632867e-01 6.702765e-02  -9.8957175  4.344835e-23
temp         1.850296e-02 1.060117e-02   1.7453685  8.092074e-02
atemp        1.767658e-02 9.497677e-03   1.8611478  6.272331e-02
hum         -4.853112e-03 9.618614e-04  -5.0455413  4.522391e-07
windspeed   -1.020153e-02 2.029716e-03  -5.0260903  5.005802e-07

    theta  deviance       d.f 
 15.31226 746.51785 718.00000

Last but not least, we compared our negative binomial model with standard Poisson model using Chi-squared test. The result is shown below:

                 p.value
Chi-squared test       1

Unfortunately, the p-value is 1, which means our negative binomial model doesn’t improve Poisson model at all. But from the deviance value 746.49 and degrees of freedom 719, we can see that the negative binomial model deals with over dispersion quite well.

The regression equation given by our best negative binomial model so far is:

cnt = 1.957e-05 * 1^dteday * 1.292^seasnon2 * 1.043^season3 * 1.207^season4 * 0.834^holiday1 * 1.041^workdingday1
* 0.919^weathersit2 * 0.501^weathersit3 * 1.019^temp * 1.017^atemp * 0.996^hum * 0.990^windspeed

From the equation, we can see that season2 has the largest positive effect on numbers of bikes rented. There is on average 29.2% increase in the effect of summer compared to the effect of spring, when all the other variables are held constant.

Also we can see that weathersit3 has the largest negative effect on numbers of bikes rented. There is on average 49.9% decrease in the worst weather condition compared with the best weather condition, when all the other variables are held constant.

Further study should consider possible interactions among all the explanatory variables that are not include in our project due to limited time. From our perspective, variables like season, weather, temp, hum and windspeed should be highly correlated. We need to put more effort on analyzing the relationship between those explanatory variables before we can really discover the pattern of the response variable.

Citation and Reference

Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.

UCI Machine Learning Repository:
http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

Other resources from internet:
http://www.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture07b_PoissReg.pdf
https://www.youtube.com/watch?v=mgIyPRhrVTY
http://www.instantr.com/2012/12/12/building-a-poisson-regression-model-in-r/
https://onlinecourses.science.psu.edu/stat504/node/169
http://data.princeton.edu/wws509/r/overdispersion.html http://www.ats.ucla.edu/stat/r/dae/nbreg.htm