Abstract

  • Short summary of the research problem and its importance, what you do and what you find

  • A person reading your abstract should get a good sense of what problem you addressed and how you addressed it without having to look at the rest of the paper

Key words

Bikeshare, Weather, Cycling

Introduction

  • What is the general area? What is the exact problem you are addressing?

  • Why is it important? (why should I be interested as a reader?)

  • What are the objectives of the research? What are your hypotheses?

  • How is the paper structured?

Literature review

Looking at these papers/articles:

  • SCHMIDT, Charles Active Travel for All? The Surge in Public Bike-Sharing Programs
  • ZHOU, Xiaolu Understanding Spatiotemporal Patterns of Biking Behavior by Analyzing Massive Bike Sharing Data in Chicago.
  • JIA, Yingnan, … Effects of new dock-less bicycle-sharing programs on cycling: a retrospective study in Shanghai
  • JIA, Yingnan, … Association between innovative dockless bicycle sharing programs and adopting cycling in commuting and non-commuting trips
  • HOSFORD,Kate … Who is in the near market for bicycle sharing? Identifying current, potential, and unlikely users of a public bicycle share program in Vancouver, Canada
  • HOSFORD,Kate … Evaluation of the impact of a public bicycle share program on population bicycling in Vancouver, BC
  • WESTLAND, James … Demand cycles and market segmentation in bicycle sharing
  • DELL’AMICO,… The bike sharing rebalancing problem: Mathematical formulations and benchmark instances
  • DELL’AMICO,… The Bike sharing Rebalancing Problem with Stochastic Demands
  • WANG, Shuai BRAVO: Improving the Rebalancing Operation in Bike Sharing with Rebalancing Range Prediction
  • VOGEL, Patrick, … “Strategic and Operational Planning of Bike-Sharing Systems by Data Mining – A Case Study”
  • FULLER, Daniel,… Impact of a public transit strike on public bicycle share use: An interrupted time series natural experiment study
  • FULLER, Daniel,… Impact evaluation of a public bicycle share program on cycling: a caseexample of BIXI in Montreal, Quebec
  • FAGHIH-IMANI, Ahmadreza A finite mixture modeling approach to examine New York City bicycle sharing system (CitiBike) users’ destination preferences
  • AN, Ran, … Weather and cycling in New York: The case of Citibike
  • HEANEY, Alexandra, … Climate Change and Physical Activity: Estimated Impacts of Ambient Temperatures on Bikeshare Usage in New York City

(I think the ones near the bottom of the list may be most promising…)

Methodology

  • Define data collection method

  • Accurate representation of the sample population and coverage issue from the target population

  • Description of Data

  • A complete description of the desired output

  • Data Analysis

  • Describe the instrumentation

  • Describe the analysis plan

  • Describe the scope and limitations of the methodology


  • The data-set is currently composed of XXXXX records and VVV variables.

  • Comment on missing values

  • Comment on cleanup

in order to obtain the maximum information possible, we had to discard the use of many variables and put our focus into the following variables:

  • var1

  • var2

  • var3

  • var4

Experimentation and Results

The Results Section

  • Needs to systematically and clearly articulate the study findings. If the results are unclear, the reviewer must decide whether the analysis of the data was poorly executed or whether the Results section is poorly organized.

From the above, we decided to …

The Discussion Section

– Should state whether their hypotheses were verified or proven untrue or, if no hypotheses were given, whether their research questions were answered. The authors should also comment on their results in light of previous studies and explain what differences (if any) exist between their findings and those reported by others and attempt to provide an explanation for the discrepancies.

The Figures and Graphs

  • Should illustrate the important features of the methods and results.

  • Should allow the reader to understand the figure or graph without having to refer back to the text of the manuscript.

  • Common mistakes made by inexperienced authors are failing to include figures that best depict their findings, writing unclear figure legends, and making poor use of arrows.

Tables

  • Should summarize the data, make the data more easily understandable, and point out important comparisons.

  • Description of the data in the text, if possible, is preferable to the use of a space-consuming table.

Conclusions and Summary

  • Recap briefly what you do in the paper

  • Evaluate the effectiveness of your research and provide recommendations (if applicable)

  • Make sure that all of the questions raised in the introduction and the literature review have been addressed

  • Compare the final results against the original aims and objectives

  • Identify any shortcomings and future research

References

Appendix

Load libraries

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units

1. Data Exploration

Load data

Weather data

## [1] ".//NYC_Weather_Data_2013-2019.csv"
weatherfile <- "NYC_Weather_Data_2013-2019.csv"

## Perhaps we should rename the columns to more clearly reflect their meaning?
weatherspec <- cols(
  STATION = col_character(),
  NAME = col_character(),
  LATITUDE = col_double(),
  LONGITUDE = col_double(),
  ELEVATION = col_double(),
  DATE = col_date(format = "%m/%d/%Y"), #col_date(format = "%F"),          #  readr::parse_datetime() :   "%F" = "%Y-%m-%d"
  AWND = col_double(),                     # Average Daily Wind Speed
  AWND_ATTRIBUTES = col_character(),
  PGTM = col_double(),                    # Peak Wind-Gust Time
  PGTM_ATTRIBUTES = col_character(),
  PRCP = col_double(),                    # Amount of Precipitation
  PRCP_ATTRIBUTES = col_character(),
  SNOW = col_double(),                    # Amount of Snowfall
  SNOW_ATTRIBUTES = col_character(),
  SNWD = col_double(),                    # Depth of snow on the ground
  SNWD_ATTRIBUTES = col_character(),
  TAVG = col_double(),                    # Average Temperature (not populated)
  TAVG_ATTRIBUTES = col_character(),
  TMAX = col_double(),                    # Maximum temperature for the day
  TMAX_ATTRIBUTES = col_character(),
  TMIN = col_double(),                    # Minimum temperature for the day
  TMIN_ATTRIBUTES = col_character(),
  TSUN = col_double(),                    # Daily Total Sunshine (not populated)
  TSUN_ATTRIBUTES = col_character(),
  WDF2 = col_double(),                    # Direction of fastest 2-minute wind
  WDF2_ATTRIBUTES = col_character(),
  WDF5 = col_double(),                    # Direction of fastest 5-second wind
  WDF5_ATTRIBUTES = col_character(),
  WSF2 = col_double(),                    # Fastest 2-minute wind speed
  WSF2_ATTRIBUTES = col_character(),
  WSF5 = col_double(),                    # fastest 5-second wind speed
  WSF5_ATTRIBUTES = col_character(),
  WT01 = col_double(),                    # Fog
  WT01_ATTRIBUTES = col_character(),
  WT02 = col_double(),                    # Heavy Fog
  WT02_ATTRIBUTES = col_character(),
  WT03 = col_double(),                    # Thunder
  WT03_ATTRIBUTES = col_character(),
  WT04 = col_double(),                    # Sleet
  WT04_ATTRIBUTES = col_character(),
  WT06 = col_double(),                    # Glaze
  WT06_ATTRIBUTES = col_character(),
  WT08 = col_double(),                    # Smoke or haze
  WT08_ATTRIBUTES = col_character(),
  WT13 = col_double(),                    # Mist
  WT13_ATTRIBUTES = col_character(),
  WT14 = col_double(),                    # Drizzle
  WT14_ATTRIBUTES = col_character(),
  WT16 = col_double(),                    # Rain
  WT16_ATTRIBUTES = col_character(),
  WT18 = col_double(),                    # Snow      
  WT18_ATTRIBUTES = col_character(),
  WT19 = col_double(),                    # Unknown source of precipitation
  WT19_ATTRIBUTES = col_character(),
  WT22 = col_double(),                    # Ice fog
  WT22_ATTRIBUTES = col_character()
)



# load all the daily weather data
weather <- read_csv(weatherfile,col_types = weatherspec)

# extract just 2019
weather2019 <- weather[(weather$DATE>="2019-01-01" & weather$DATE<="2019-12-31"),]


# extract just one month
weather201906 <- weather[(weather$DATE>="2019-06-01" & weather$DATE<="2019-06-30"),]

List the names of available citibike data files

## [1] 75
##       [,1]                               
##  [1,] "2013-07 - Citi Bike trip data.csv"
##  [2,] "2013-08 - Citi Bike trip data.csv"
##  [3,] "2013-09 - Citi Bike trip data.csv"
##  [4,] "2013-10 - Citi Bike trip data.csv"
##  [5,] "2013-11 - Citi Bike trip data.csv"
##  [6,] "2013-12 - Citi Bike trip data.csv"
##  [7,] "2014-01 - Citi Bike trip data.csv"
##  [8,] "2014-02 - Citi Bike trip data.csv"
##  [9,] "2014-03 - Citi Bike trip data.csv"
## [10,] "2014-04 - Citi Bike trip data.csv"
## [11,] "2014-05 - Citi Bike trip data.csv"
## [12,] "2014-06 - Citi Bike trip data.csv"
## [13,] "2014-07 - Citi Bike trip data.csv"
## [14,] "2014-08 - Citi Bike trip data.csv"
## [15,] "201409-citibike-tripdata.csv"     
## [16,] "201410-citibike-tripdata.csv"     
## [17,] "201411-citibike-tripdata.csv"     
## [18,] "201412-citibike-tripdata.csv"     
## [19,] "201501-citibike-tripdata.csv"     
## [20,] "201502-citibike-tripdata.csv"     
## [21,] "201503-citibike-tripdata.csv"     
## [22,] "201504-citibike-tripdata.csv"     
## [23,] "201505-citibike-tripdata.csv"     
## [24,] "201506-citibike-tripdata.csv"     
## [25,] "201507-citibike-tripdata.csv"     
## [26,] "201508-citibike-tripdata.csv"     
## [27,] "201509-citibike-tripdata.csv"     
## [28,] "201510-citibike-tripdata.csv"     
## [29,] "201511-citibike-tripdata.csv"     
## [30,] "201512-citibike-tripdata.csv"     
## [31,] "201601-citibike-tripdata.csv"     
## [32,] "201602-citibike-tripdata.csv"     
## [33,] "201603-citibike-tripdata.csv"     
## [34,] "201604-citibike-tripdata.csv"     
## [35,] "201605-citibike-tripdata.csv"     
## [36,] "201606-citibike-tripdata.csv"     
## [37,] "201607-citibike-tripdata.csv"     
## [38,] "201608-citibike-tripdata.csv"     
## [39,] "201609-citibike-tripdata.csv"     
## [40,] "201610-citibike-tripdata.csv"     
## [41,] "201611-citibike-tripdata.csv"     
## [42,] "201612-citibike-tripdata.csv"     
## [43,] "201701-citibike-tripdata.csv"     
## [44,] "201702-citibike-tripdata.csv"     
## [45,] "201703-citibike-tripdata.csv"     
## [46,] "201704-citibike-tripdata.csv"     
## [47,] "201705-citibike-tripdata.csv"     
## [48,] "201706-citibike-tripdata.csv"     
## [49,] "201707-citibike-tripdata.csv"     
## [50,] "201708-citibike-tripdata.csv"     
## [51,] "201709-citibike-tripdata.csv"     
## [52,] "201710-citibike-tripdata.csv"     
## [53,] "201711-citibike-tripdata.csv"     
## [54,] "201712-citibike-tripdata.csv"     
## [55,] "201801-citibike-tripdata.csv"     
## [56,] "201802-citibike-tripdata.csv"     
## [57,] "201803-citibike-tripdata.csv"     
## [58,] "201804-citibike-tripdata.csv"     
## [59,] "201805-citibike-tripdata.csv"     
## [60,] "201806-citibike-tripdata.csv"     
## [61,] "201807-citibike-tripdata.csv"     
## [62,] "201808-citibike-tripdata.csv"     
## [63,] "201809-citibike-tripdata.csv"     
## [64,] "201810-citibike-tripdata.csv"     
## [65,] "201811-citibike-tripdata.csv"     
## [66,] "201812-citibike-tripdata.csv"     
## [67,] "201901-citibike-tripdata.csv"     
## [68,] "201902-citibike-tripdata.csv"     
## [69,] "201903-citibike-tripdata.csv"     
## [70,] "201904-citibike-tripdata.csv"     
## [71,] "201905-citibike-tripdata.csv"     
## [72,] "201906-citibike-tripdata.csv"     
## [73,] "201907-citibike-tripdata.csv"     
## [74,] "201908-citibike-tripdata.csv"     
## [75,] "201909-citibike-tripdata.csv"

glimpse the dataset

## Observations: 2,125,370
## Variables: 15
## $ trip_duration  <dbl> 330, 830, 380, 1155, 1055, 128, 315, 471, 1554, 392, 588, 1906, 2754, 2037, 454, 642, 1371, 13…
## $ s_time         <dttm> 2019-06-01 00:00:01, 2019-06-01 00:00:04, 2019-06-01 00:00:06, 2019-06-01 00:00:06, 2019-06-0…
## $ e_time         <dttm> 2019-06-01 00:05:31, 2019-06-01 00:13:55, 2019-06-01 00:06:26, 2019-06-01 00:19:22, 2019-06-0…
## $ s_station_id   <fct> 3602, 3054, 229, 3771, 441, 3236, 3129, 3467, 379, 467, 2000, 532, 3016, 3285, 3125, 3429, 223…
## $ s_station_name <fct> 31 Ave & 34 St, Greene Ave & Throop Ave, Great Jones St, McKibbin St & Bogart St, E 52 St & 2 …
## $ s_lat          <dbl> 40.76315, 40.68949, 40.72743, 40.70624, 40.75601, 40.75898, 40.75110, 40.72495, 40.74916, 40.6…
## $ s_long         <dbl> -73.92083, -73.94206, -73.99379, -73.93387, -73.96742, -73.99380, -73.94074, -74.00166, -73.99…
## $ e_station_id   <fct> 3570, 3781, 326, 3016, 3159, 495, 3560, 401, 2006, 3368, 366, 3712, 3764, 540, 3221, 157, 450,…
## $ e_station_name <fct> 35 Ave & 37 St, Greene Av & Myrtle Av, E 11 St & 1 Ave, Kent Ave & N 7 St, W 67 St & Broadway,…
## $ e_lat          <dbl> 40.75573, 40.69857, 40.72954, 40.72037, 40.77493, 40.76270, 40.75488, 40.72020, 40.76591, 40.6…
## $ e_long         <dbl> -73.92366, -73.91888, -73.98427, -73.96165, -73.98267, -73.99301, -73.93433, -73.98998, -73.97…
## $ bike_id        <int> 20348, 34007, 20587, 33762, 31290, 25137, 25648, 26972, 32969, 33539, 19770, 16708, 25711, 270…
## $ user_type      <fct> Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber…
## $ birth_year     <int> 1992, 1987, 1990, 1987, 1973, 1989, 1995, 1990, 1970, 1980, 1974, 1993, 1993, 1965, 1989, 1969…
## $ gender         <fct> 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0…

str - structure of the dataset

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2125370 obs. of  15 variables:
##  $ trip_duration : num  330 830 380 1155 1055 ...
##  $ s_time        : POSIXct, format: "2019-06-01 00:00:01" "2019-06-01 00:00:04" "2019-06-01 00:00:06" "2019-06-01 00:00:06" ...
##  $ e_time        : POSIXct, format: "2019-06-01 00:05:31" "2019-06-01 00:13:55" "2019-06-01 00:06:26" "2019-06-01 00:19:22" ...
##  $ s_station_id  : Factor w/ 793 levels "3602","3054",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ s_station_name: Factor w/ 793 levels "31 Ave & 34 St",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ s_lat         : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ s_long        : num  -73.9 -73.9 -74 -73.9 -74 ...
##  $ e_station_id  : Factor w/ 806 levels "3570","3781",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ e_station_name: Factor w/ 806 levels "35 Ave & 37 St",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ e_lat         : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ e_long        : num  -73.9 -73.9 -74 -74 -74 ...
##  $ bike_id       : int  20348 34007 20587 33762 31290 25137 25648 26972 32969 33539 ...
##  $ user_type     : Factor w/ 2 levels "Subscriber","Customer": 1 1 1 1 1 1 1 1 2 1 ...
##  $ birth_year    : int  1992 1987 1990 1987 1973 1989 1995 1990 1970 1980 ...
##  $ gender        : Factor w/ 3 levels "1","2","0": 1 2 2 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_duration = col_double(),
##   ..   s_time = col_datetime(format = ""),
##   ..   e_time = col_datetime(format = ""),
##   ..   s_station_id = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   s_station_name = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   s_lat = col_double(),
##   ..   s_long = col_double(),
##   ..   e_station_id = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   e_station_name = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   e_lat = col_double(),
##   ..   e_long = col_double(),
##   ..   bike_id = col_integer(),
##   ..   user_type = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   birth_year = col_integer(),
##   ..   gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
##   .. )

Summary of the dataset

##  trip_duration         s_time                        e_time                     s_station_id    
##  Min.   :     61   Min.   :2019-06-01 00:00:01   Min.   :2019-06-01 00:02:16   519    :  14778  
##  1st Qu.:    395   1st Qu.:2019-06-07 18:46:00   1st Qu.:2019-06-07 19:02:42   514    :  13512  
##  Median :    683   Median :2019-06-15 14:17:56   Median :2019-06-15 14:40:21   426    :  13410  
##  Mean   :   1109   Mean   :2019-06-15 21:45:53   Mean   :2019-06-15 22:04:23   402    :  12191  
##  3rd Qu.:   1199   3rd Qu.:2019-06-23 18:48:22   3rd Qu.:2019-06-23 19:09:09   3255   :  11973  
##  Max.   :3379585   Max.   :2019-06-30 23:59:54   Max.   :2019-07-15 03:53:26   499    :  11680  
##                                                                                (Other):2047826  
##                s_station_name        s_lat           s_long        e_station_id                   e_station_name   
##  Pershing Square North:  14778   Min.   :40.66   Min.   :-74.03   519    :  14696   Pershing Square North:  14696  
##  12 Ave & W 40 St     :  13512   1st Qu.:40.72   1st Qu.:-74.00   426    :  14163   West St & Chambers St:  14163  
##  West St & Chambers St:  13410   Median :40.74   Median :-73.99   514    :  13614   12 Ave & W 40 St     :  13614  
##  Broadway & E 22 St   :  12191   Mean   :40.74   Mean   :-73.98   402    :  12639   Broadway & E 22 St   :  12639  
##  8 Ave & W 31 St      :  11973   3rd Qu.:40.76   3rd Qu.:-73.97   3255   :  12055   8 Ave & W 31 St      :  12055  
##  Broadway & W 60 St   :  11680   Max.   :40.81   Max.   :-73.91   499    :  11310   Broadway & W 60 St   :  11310  
##  (Other)              :2047826                                    (Other):2046893   (Other)              :2046893  
##      e_lat           e_long          bike_id           user_type         birth_year   gender     
##  Min.   :40.66   Min.   :-74.07   Min.   :14529   Subscriber:1752526   Min.   :1885   1:1399108  
##  1st Qu.:40.72   1st Qu.:-74.00   1st Qu.:20961   Customer  : 372844   1st Qu.:1969   2: 527577  
##  Median :40.74   Median :-73.99   Median :28996                        Median :1983   0: 198685  
##  Mean   :40.74   Mean   :-73.98   Mean   :27614                        Mean   :1980              
##  3rd Qu.:40.76   3rd Qu.:-73.97   3rd Qu.:33016                        3rd Qu.:1990              
##  Max.   :40.81   Max.   :-73.91   Max.   :39934                        Max.   :2003              
## 

2. Data Preparation

Check for missing values

Let’s check whether any variables have missing values, i.e., values which are NULL or NA.

## [1] "Number of columns with missing values =  0"
## [1] "Names of columns with missing values =  "

In this data (just a single month), there are no missing items.

Look for unusual values / outliers

Examine trip_duration

The trip_duration is specified in seconds, but there seem to be some outliers which may be incorrect, as the value for Max is quite high: 3379585 seconds, or 39.1155671 days. We can assume that this data is bad, as nobody would willingly rent a bicycle for this period of time, given the fees that would be charged.

Histogram of log(trip_duration)

It may be easier to think of trip duration in other units (i.e., minutes, hours, or days) rather than in seconds, so lets create such variables. Also, let’s confirm that the value shown (in seconds) is consistent with the difference between the start time and the end time:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0169   0.1098   0.1900   0.3083   0.3333 938.7739
## [1] 2024

Let’s assume that nobody would rent a bicycle for more than a specified timelimit, and drop the records which exceed this:

## [1] "Initial number of trips:  2125370"
## [1] "Removed 4003 trips (0.188%) of longer than 3 hours."
## [1] "Remaining number of trips: 2121369"

Examine birth year

The birth year for some users is as old as 1885, which is not possible.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1885    1969    1983    1980    1990    2003

3. Build Models

4. Select Models