Bay Area Bike Share offers cheap, fast and on-demand service for the users in San Francisco who need commute to work or school, run errands and explore the city.
How to utilize data to optimize the operation which in turn to streamline the process and supply chain in order to satisfy customers’ demands by analyzing the data that the company has would be the top priority of the company.
We have tracking the usage of the rented bike in the past years which makes it available to analyze by using the historical data.
When do people use the bike?
How long do people use the bike?
Which stations has more usage of bike?
Data comes from Ford GoBike open data source.
Contains 4 tables:
station.csv
trip.csv
weather.csv
status.csv
We use the data from 08/2015 - 08/2016. And in this analysis we did not include the status data, which is the status of each bike station in every minute.
Take look at the data tables:
67 records – station ID, name, latitude, longitude, dockcount, city, installation date
Approx. 314,000 records of individual trips
1,830 records of daily weather by city
First of all, with all trips data, we can calculate usage by day and get:
In this plot we can see how does the bike usage affected the dates. With 1 year data:
More intuitively, we can calculate total number of trips of each day in a week:
Clearly, weekends has a low usage.
Give the bike usage of each day, one may also curious about how does the usage change in different times in a day:
We can see that peak hours has the most usage. But this might be too general. Will weekday/weekends also affect the usage in different times in a day?
The answer is yes:
So we could conclude that people are likely use bike as a transportation to their office in weekdays, and just riding for fun during the weekend.
Then immediately, one may want to find out if the duration of usage in weekdays and weekends will differ:
The result also confirmed out guess: people are busy in weekdays and mainly use bike to commute, but they have more time enjoy riding in weekends.
Bike stations are available in 4 cities: San Francisco, San Jose, Mountain View, and Palo Alto. Here are the locations:
Given the bike station locations, and in trip data, we know where each trip started and ended, we can explore which station or what routes are more popular.
Take San Jose as an example:
In this part we want to know how does the weather affect the bike usage? Give the weather data:
Observations: 1,830
Variables: 24
$ PDT <chr> "9/1/2015", "9/2/2015", "9/3/2015"...
$ `Max TemperatureF` <int> 75, 73, 70, 72, 79, 84, 91, 95, 93...
$ `Mean TemperatureF` <int> 67, 68, 65, 64, 65, 68, 73, 76, 77...
$ `Min TemperatureF` <int> 58, 62, 60, 55, 51, 52, 54, 57, 60...
$ `Max Dew PointF` <int> 58, 59, 57, 52, 53, 53, 56, 54, 58...
$ `MeanDew PointF` <int> 56, 56, 54, 50, 48, 47, 46, 46, 50...
$ `Min DewpointF` <int> 54, 54, 50, 48, 44, 31, 37, 38, 44...
$ `Max Humidity` <int> 84, 78, 84, 77, 89, 72, 67, 67, 73...
$ `Mean Humidity` <int> 67, 68, 69, 61, 60, 44, 42, 43, 47...
$ `Min Humidity` <int> 49, 57, 53, 44, 30, 16, 16, 19, 21...
$ `Max Sea Level PressureIn` <dbl> 29.93, 29.97, 29.94, 29.90, 30.00,...
$ `Mean Sea Level PressureIn` <dbl> 29.89, 29.93, 29.89, 29.86, 29.95,...
$ `Min Sea Level PressureIn` <dbl> 29.86, 29.90, 29.83, 29.83, 29.90,...
$ `Max VisibilityMiles` <int> 10, 10, 10, 10, 10, 10, 10, 10, 10...
$ `Mean VisibilityMiles` <int> 10, 10, 10, 10, 10, 10, 10, 10, 10...
$ `Min VisibilityMiles` <int> 6, 10, 10, 7, 10, 10, 9, 10, 10, 1...
$ `Max Wind SpeedMPH` <int> 22, 23, 18, 20, 21, 14, 13, 14, 16...
$ `Mean Wind SpeedMPH` <int> 9, 12, 12, 10, 7, 5, 5, 6, 5, 6, 8...
$ `Max Gust SpeedMPH` <int> 26, 26, 25, 24, 24, 17, 15, 17, 18...
$ PrecipitationIn <chr> "0", "0", "0", "0", "0", "0", "0",...
$ CloudCover <int> 2, 5, 2, 2, 2, 3, 1, 0, 0, 2, 4, 7...
$ Events <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ WindDirDegrees <int> 245, 249, 250, 233, 289, 280, 307,...
$ ZIP <int> 94107, 94107, 94107, 94107, 94107,...
More specifically, we want to conduct a multiple linear regression between the number of trips in a day and weather-related variables. In this model we only use San Francisco data, which has the most usage among 4 cities.
The weather
data has 24 variables, and many of them are obviously correlated, e.g. Max TemperatureF
, Mean TemperatureF
and Min TemperatureF
.
We select 8 initial predictor variables:
weekday
mean_temperature_f
precipitation_in
mean_visibility_miles
mean_humidity
mean_wind_speed_mph
cloud_cover
events
and the response variable is num of trips
.
Use regsubsets
function in leaps
package for variable selection:
Initial formula:
num_trips ~ 1 + weekday + mean_temperature_f + precipitation_in +
mean_visibility_miles + mean_humidity + mean_wind_speed_mph +
cloud_cover + events
Choose the model with the min Mallows’s \(C_{p}\):
(Intercept) weekday1 mean_temperature_f
TRUE TRUE TRUE
precipitation_in mean_visibility_miles mean_humidity
FALSE TRUE FALSE
mean_wind_speed_mph cloud_cover eventsFog-Rain
FALSE TRUE TRUE
eventsNone eventsRain eventsRain-Thunderstorm
TRUE TRUE TRUE
(We can see that here dummy variables are created for categorical variables.)
And we can get the formula whose corresponded model has min \(C_{p}\):
num_trips ~ 1 + weekday1 + mean_temperature_f + mean_visibility_miles +
cloud_cover + `eventsFog-Rain` + eventsNone + eventsRain +
`eventsRain-Thunderstorm`
Use selected formula fit the model, we can get
Call:
lm(formula = formula2, data = weather_frame %>% mutate(num_trips = weather$num_trips))
Residuals:
Min 1Q Median 3Q Max
-3032.9 -252.8 68.3 381.9 1653.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -727.850 529.758 -1.374 0.17033
weekday1 -2534.999 74.576 -33.992 < 2e-16 ***
mean_temperature_f 43.228 5.319 8.127 7.27e-15 ***
mean_visibility_miles 132.872 47.496 2.798 0.00543 **
cloud_cover 31.189 16.970 1.838 0.06691 .
`eventsFog-Rain` -706.404 295.505 -2.390 0.01734 *
eventsNone -289.830 156.990 -1.846 0.06569 .
eventsRain -746.711 161.956 -4.611 5.60e-06 ***
`eventsRain-Thunderstorm` -857.602 395.629 -2.168 0.03084 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 638.5 on 357 degrees of freedom
Multiple R-squared: 0.7904, Adjusted R-squared: 0.7857
F-statistic: 168.3 on 8 and 357 DF, p-value: < 2.2e-16
For residual Diagnostic, we can conclude:
Residual vs fitted has a pattern - need transform on Y
qq plot shows residuals not normal - indicates that the model was affected by extreme values
Same as the bike usage plot we have in the beginning, if we add weather events and precipitation (in inches):
Furthermore, if we check weekdays with low usage (extreme values):
# A tibble: 18 x 2
pdt num_trips
<date> <int>
1 2015-09-07 465
2 2015-11-26 121
3 2015-11-27 181
4 2015-12-21 545
5 2015-12-22 943
6 2015-12-23 956
7 2015-12-24 155
8 2015-12-25 167
9 2015-12-28 751
10 2015-12-29 794
11 2015-12-30 421
12 2015-12-31 323
13 2016-01-01 123
14 2016-01-18 598
15 2016-01-22 966
16 2016-02-15 727
17 2016-05-30 277
18 2016-07-04 429
We can find that many of these days are national holidays or days around national holidays. So separating days by weekday/weekend may not accurate.
Treat national holidays also as weekend, and do \(Y = \sqrt{Y}\) transformation on response variable, and using the same variable selection, then we can get the final model:
\[ \sqrt{num\ of\ trips} = 18.685 - 35.562 weekend + 0.366 mean\ temperature +\\ 1.534mean\ visibility\ miles - 5.91 events\ Rain - 5.767 events\ Rain\ Thunderstorm \] Run diagnostic:
Call:
lm(formula = formula3, data = weather_holiday_adjust_frame %>%
mutate(num_trips = weather$num_trips))
Residuals:
Min 1Q Median 3Q Max
-22.1701 -2.4834 0.3127 3.0150 17.4536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.68452 3.89332 4.799 2.34e-06 ***
weekday1 -35.56177 0.56699 -62.720 < 2e-16 ***
mean_temperature_f 0.36601 0.04113 8.899 < 2e-16 ***
mean_visibility_miles 1.53434 0.33088 4.637 4.95e-06 ***
eventsRain -5.90986 0.78455 -7.533 4.05e-13 ***
`eventsRain-Thunderstorm` -5.76702 2.95714 -1.950 0.0519 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.008 on 360 degrees of freedom
Multiple R-squared: 0.9241, Adjusted R-squared: 0.923
F-statistic: 876.6 on 5 and 360 DF, p-value: < 2.2e-16
We can see that the pattern of residual ~ fitted
is not significant as before. Residual QQ-plot is still shows somewhat long-tail, but also better than before.