Since 2013 a shared bicycle system known as Citibike has been available in New York City. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of Citibike’s pricing constraints.
There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. (Bikes are also available in Jersey City, New Jersey, but users are not allowed to move bikes across the Hudson River.)
Users can either sign up for an annual membership, currently priced at $169 (with some discounts available), or a short-term pass which costs $12 dollars for one day or $24 for three days.
In order to ensure that bikes are available when and where users need them, the Citibike system is intended for relatively short-duration commuting trips rather than lengthy usage.
For annual members, the first 45 minutes of each trip bears no additional cost, while for short-term users, only the first 30 minutes of each trip is included.
Users incur an additional fee if they keep a bike for longer than the initial time allotment, with annual members incurring an extra charge of $2.50 per 15 minutes after the initial 45 minutes, and short-term users paying $4 for each additional 15 minutes.
Citibike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.
For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)
For the current year, the number of lines in each monthly datafile is as follows:
| lines | monthly file |
|---|---|
| 967,288 | 201901-citibike-tripdata.csv |
| 943,745 | 201902-citibike-tripdata.csv |
| 1,327,961 | 201903-citibike-tripdata.csv |
| 1,766,095 | 201904-citibike-tripdata.csv |
| 1,924,564 | 201905-citibike-tripdata.csv |
| 2,125,371 | 201906-citibike-tripdata.csv |
| 2,181,065 | 201907-citibike-tripdata.csv |
| 2,344,225 | 201908-citibike-tripdata.csv |
| 2,444,901 | 201909-citibike-tripdata.csv |
| 2,092,574 | 201910-citibike-tripdata.csv |
| 18,117,789 | total |
An example record from these datafiles includes the following features:
| feature name | value |
|---|---|
| tripduration (seconds) | 527 |
| starttime | 10/1/2019 00:00:05.6 |
| stoptime | 10/1/2019 00:08:52.9 |
| start station id | 3746 |
| start station name | 6 Ave & Broome St |
| start station latitude | 40.72430832 |
| start station longitude | -74.00473036 |
| end station id | 223 |
| end station name | W 13 St & 7 Ave |
| end station latitude | 40.73781509 |
| end station longitude | -73.99994661 |
| bikeid | 41750 |
| usertype | Subscriber |
| birth year | 1993 |
| gender | 1 |
We conjuecture that one key determinant as to how many people will use Citibike on a given day is the weather.
So, we can obtain data from the NCDC (National Climatic Data Center) of NOAA (National Oceanic and Atmospheric Agency).
Here is an sample of daily weather data which can be requested from their website, https://www.ncdc.noaa.gov/cdo-web/:
| Feature | description | |Random date 1 | |Random date 2 |
|---|---|---|---|
| STATION | Station ID number | USW00094728 | USW00094728 |
| NAME | Name of station | NY CITY CENTRAL PARK, NY US | NY CITY CENTRAL PARK, NY US |
| LATITUDE | 40.77898 | 40.77898 | |
| LONGITUDE | -73.96925 | -73.96925 | |
| ELEVATION | 42.7 | 42.7 | |
| DATE | 1/30/2019 | 7/22/2019 | |
| AWND | Average Wind Speed | 2.68 | |
| PRCP | Amount of precipitation | 0.01 | 1.66 |
| SNOW | Amount of snowfall | 0.4 | 0 |
| SNWD | Snow Depth | 0 | 0 |
| TAVG | Average temperature | ||
| TMAX | Maximum temperature | 35 | 90 |
| TMIN | Minimum temperature | 6 | 72 |
| WDF2 | Direction of fastest 2-minute wind | 10 | |
| WDF5 | Direction of fastest 5-second wind | 340 | |
| WSF2 | Fastest 2-minute Wind Speed | |14.1 | |
| WSF5 | Fastest 5-second Wind Speed | |25.1 | |
| WT01 | Fog, ice fog, or freezing fog? | 1 | 1 |
| WT02 | Heavy fog or heavy freezing fog? | 1 | |
| WT03 | Thunder? | |1 | |
| WT06 | Glaze or rime? | ||
| WT08 | Smoke or haze? |
When it first started, Citibike proved to be popular (from the usage perspective) but a financial challenge. In 2013 the price of an annual membership was only $95 and a daily pass was less than $10. The vendor which provided the initial bikes and software went bankrupt, and the system operator was teetering on the brink before a new investor rescued it.
The annual membership fee was increased, and changes were made to the fees for daily and overtime usage. Presently, Citibike is now owned by an subsidiary of Lyft (the taxi app company.)
The data provides a wealth of information which can be mined to seek trends in usage. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.
Because of weather, ridership is expected to be lower during the winter months, and on foul-weather days during the rest of the year, than on a warm and sunny summer day. Using the weather data we can seek to model the relationship between bicycle ridership and fair/foul or hot/cold weather.
What are the differences in patterns on weekdays (when, presumably, many people are using the bicycles for commuting) vs. on a weekend (when most users are presumed to be pursuing leisure?)
What are the differences in rental patterns between annual members (presumably, local residents) vs. casual users (presumably, tourists?)
Is there any significant relationship between the age and/or gender of the bicycle renter vs. the rental patterns?
What are the characteristics of trips which incur extra usage charges (i.e., longer than 45 minutes for annual subscribers or 30 minutes for everyone else?) How can such additional usage be encouraged? What changes to the pricing structure might encourage people to keep the bikes longer, thus incurring extra fees?
Obviously, more ridership and longer rides would translate to increased revenue for the company. (However, increased usage would likely cause an increase in maintenance costs to repair the bicycles. We don’t have such data available, so we will concern ourselves with just the revenues side of the equation.)
We propose to utilize various regression techniques covered in this course to seek answers to the above.
At present we have not yet determined which approach(es) would work best, so we will have to explore various modeling techniques to determine which (if any) yield a meaningful result.
We do not yet know whether we will be able to answer all of the above questions, or only some of them, but will see where the data leads us.