Background

Introduction

Bay Area Bike Share offers cheap, fast and on-demand service for the users in San Francisco who need commute to work or school, run errands and explore the city.

How to utilize data to optimize the operation which in turn to streamline the process and supply chain in order to satisfy customers’ demands by analyzing the data that the company has would be the top priority of the company.

We have tracking the usage of the rented bike in the past years which makes it available to analyze by using the historical data.

Objects and Approach

  1. Exploratory analysis:
  • When do people use the bike?

  • How long do people use the bike?

  • Which stations has more usage of bike?

  1. Linear Regression: How does the weather affect the bike usage?

Data

Data comes from Ford GoBike open data source.

Contains 4 tables:

  1. station.csv

  2. trip.csv

  3. weather.csv

  4. status.csv

We use the data from 08/2015 - 08/2016. And in this analysis we did not include the status data, which is the status of each bike station in every minute.

Take look at the data tables:

station.csv

67 records – station ID, name, latitude, longitude, dockcount, city, installation date

trip.csv

Approx. 314,000 records of individual trips

weather.csv

1,830 records of daily weather by city

Exploratory Analysis

Time and bike usage

First of all, with all trips data, we can calculate usage by day and get:

In this plot we can see how does the bike usage affected the dates. With 1 year data:

  • The usage in weekday and weekend significantly different.
  • Seems like that season will affect the usage as well. (which actually is holiday, and we will mention this in regression analysis)

Bike usage among a week

More intuitively, we can calculate total number of trips of each day in a week:

Clearly, weekends has a low usage.

Bike usage in a day

Give the bike usage of each day, one may also curious about how does the usage change in different times in a day:

We can see that peak hours has the most usage. But this might be too general. Will weekday/weekends also affect the usage in different times in a day?

The answer is yes:

So we could conclude that people are likely use bike as a transportation to their office in weekdays, and just riding for fun during the weekend.

Trip duration

Then immediately, one may want to find out if the duration of usage in weekdays and weekends will differ:

The result also confirmed out guess: people are busy in weekdays and mainly use bike to commute, but they have more time enjoy riding in weekends.

Locations

Bike stations are available in 4 cities: San Francisco, San Jose, Mountain View, and Palo Alto. Here are the locations:

Bike usage in a city

Given the bike station locations, and in trip data, we know where each trip started and ended, we can explore which station or what routes are more popular.

Take San Jose as an example:

Regression

In this part we want to know how does the weather affect the bike usage? Give the weather data:

Observations: 1,830
Variables: 24
$ PDT                         <chr> "9/1/2015", "9/2/2015", "9/3/2015"...
$ `Max TemperatureF`          <int> 75, 73, 70, 72, 79, 84, 91, 95, 93...
$ `Mean TemperatureF`         <int> 67, 68, 65, 64, 65, 68, 73, 76, 77...
$ `Min TemperatureF`          <int> 58, 62, 60, 55, 51, 52, 54, 57, 60...
$ `Max Dew PointF`            <int> 58, 59, 57, 52, 53, 53, 56, 54, 58...
$ `MeanDew PointF`            <int> 56, 56, 54, 50, 48, 47, 46, 46, 50...
$ `Min DewpointF`             <int> 54, 54, 50, 48, 44, 31, 37, 38, 44...
$ `Max Humidity`              <int> 84, 78, 84, 77, 89, 72, 67, 67, 73...
$ `Mean Humidity`             <int> 67, 68, 69, 61, 60, 44, 42, 43, 47...
$ `Min Humidity`              <int> 49, 57, 53, 44, 30, 16, 16, 19, 21...
$ `Max Sea Level PressureIn`  <dbl> 29.93, 29.97, 29.94, 29.90, 30.00,...
$ `Mean Sea Level PressureIn` <dbl> 29.89, 29.93, 29.89, 29.86, 29.95,...
$ `Min Sea Level PressureIn`  <dbl> 29.86, 29.90, 29.83, 29.83, 29.90,...
$ `Max VisibilityMiles`       <int> 10, 10, 10, 10, 10, 10, 10, 10, 10...
$ `Mean VisibilityMiles`      <int> 10, 10, 10, 10, 10, 10, 10, 10, 10...
$ `Min VisibilityMiles`       <int> 6, 10, 10, 7, 10, 10, 9, 10, 10, 1...
$ `Max Wind SpeedMPH`         <int> 22, 23, 18, 20, 21, 14, 13, 14, 16...
$ `Mean Wind SpeedMPH`        <int> 9, 12, 12, 10, 7, 5, 5, 6, 5, 6, 8...
$ `Max Gust SpeedMPH`         <int> 26, 26, 25, 24, 24, 17, 15, 17, 18...
$ PrecipitationIn             <chr> "0", "0", "0", "0", "0", "0", "0",...
$ CloudCover                  <int> 2, 5, 2, 2, 2, 3, 1, 0, 0, 2, 4, 7...
$ Events                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ WindDirDegrees              <int> 245, 249, 250, 233, 289, 280, 307,...
$ ZIP                         <int> 94107, 94107, 94107, 94107, 94107,...

More specifically, we want to conduct a multiple linear regression between the number of trips in a day and weather-related variables. In this model we only use San Francisco data, which has the most usage among 4 cities.

1. Variable Selection

The weather data has 24 variables, and many of them are obviously correlated, e.g. Max TemperatureF, Mean TemperatureF and Min TemperatureF.

We select 8 initial predictor variables:

  • weekday
  • mean_temperature_f
  • precipitation_in
  • mean_visibility_miles
  • mean_humidity
  • mean_wind_speed_mph
  • cloud_cover
  • events

and the response variable is num of trips.

Use regsubsets function in leaps package for variable selection:

Initial formula:

num_trips ~ 1 + weekday + mean_temperature_f + precipitation_in + 
    mean_visibility_miles + mean_humidity + mean_wind_speed_mph + 
    cloud_cover + events

Choose the model with the min Mallows’s \(C_{p}\):

            (Intercept)                weekday1      mean_temperature_f 
                   TRUE                    TRUE                    TRUE 
       precipitation_in   mean_visibility_miles           mean_humidity 
                  FALSE                    TRUE                   FALSE 
    mean_wind_speed_mph             cloud_cover          eventsFog-Rain 
                  FALSE                    TRUE                    TRUE 
             eventsNone              eventsRain eventsRain-Thunderstorm 
                   TRUE                    TRUE                    TRUE 

(We can see that here dummy variables are created for categorical variables.)

And we can get the formula whose corresponded model has min \(C_{p}\):

num_trips ~ 1 + weekday1 + mean_temperature_f + mean_visibility_miles + 
    cloud_cover + `eventsFog-Rain` + eventsNone + eventsRain + 
    `eventsRain-Thunderstorm`

2. Residual Diagnostic

Use selected formula fit the model, we can get


Call:
lm(formula = formula2, data = weather_frame %>% mutate(num_trips = weather$num_trips))

Residuals:
    Min      1Q  Median      3Q     Max 
-3032.9  -252.8    68.3   381.9  1653.2 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -727.850    529.758  -1.374  0.17033    
weekday1                  -2534.999     74.576 -33.992  < 2e-16 ***
mean_temperature_f           43.228      5.319   8.127 7.27e-15 ***
mean_visibility_miles       132.872     47.496   2.798  0.00543 ** 
cloud_cover                  31.189     16.970   1.838  0.06691 .  
`eventsFog-Rain`           -706.404    295.505  -2.390  0.01734 *  
eventsNone                 -289.830    156.990  -1.846  0.06569 .  
eventsRain                 -746.711    161.956  -4.611 5.60e-06 ***
`eventsRain-Thunderstorm`  -857.602    395.629  -2.168  0.03084 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 638.5 on 357 degrees of freedom
Multiple R-squared:  0.7904,    Adjusted R-squared:  0.7857 
F-statistic: 168.3 on 8 and 357 DF,  p-value: < 2.2e-16

For residual Diagnostic, we can conclude:

  1. Residual vs fitted has a pattern - need transform on Y

  2. qq plot shows residuals not normal - indicates that the model was affected by extreme values

Same as the bike usage plot we have in the beginning, if we add weather events and precipitation (in inches):

Furthermore, if we check weekdays with low usage (extreme values):

# A tibble: 18 x 2
          pdt num_trips
       <date>     <int>
 1 2015-09-07       465
 2 2015-11-26       121
 3 2015-11-27       181
 4 2015-12-21       545
 5 2015-12-22       943
 6 2015-12-23       956
 7 2015-12-24       155
 8 2015-12-25       167
 9 2015-12-28       751
10 2015-12-29       794
11 2015-12-30       421
12 2015-12-31       323
13 2016-01-01       123
14 2016-01-18       598
15 2016-01-22       966
16 2016-02-15       727
17 2016-05-30       277
18 2016-07-04       429

We can find that many of these days are national holidays or days around national holidays. So separating days by weekday/weekend may not accurate.

Treat national holidays also as weekend, and do \(Y = \sqrt{Y}\) transformation on response variable, and using the same variable selection, then we can get the final model:

\[ \sqrt{num\ of\ trips} = 18.685 - 35.562 weekend + 0.366 mean\ temperature +\\ 1.534mean\ visibility\ miles - 5.91 events\ Rain - 5.767 events\ Rain\ Thunderstorm \] Run diagnostic:


Call:
lm(formula = formula3, data = weather_holiday_adjust_frame %>% 
    mutate(num_trips = weather$num_trips))

Residuals:
     Min       1Q   Median       3Q      Max 
-22.1701  -2.4834   0.3127   3.0150  17.4536 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                18.68452    3.89332   4.799 2.34e-06 ***
weekday1                  -35.56177    0.56699 -62.720  < 2e-16 ***
mean_temperature_f          0.36601    0.04113   8.899  < 2e-16 ***
mean_visibility_miles       1.53434    0.33088   4.637 4.95e-06 ***
eventsRain                 -5.90986    0.78455  -7.533 4.05e-13 ***
`eventsRain-Thunderstorm`  -5.76702    2.95714  -1.950   0.0519 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.008 on 360 degrees of freedom
Multiple R-squared:  0.9241,    Adjusted R-squared:  0.923 
F-statistic: 876.6 on 5 and 360 DF,  p-value: < 2.2e-16

We can see that the pattern of residual ~ fitted is not significant as before. Residual QQ-plot is still shows somewhat long-tail, but also better than before.