Abstract

Short summary of the research problem and its importance, what you do and what you find
A person reading your abstract should get a good sense of what problem you addressed and how you addressed it without having to look at the rest of the paper

Key words

Bikeshare, Weather, Cycling

Introduction

What is the general area? What is the exact problem you are addressing?
Why is it important? (why should I be interested as a reader?)
What are the objectives of the research? What are your hypotheses?
How is the paper structured?

Literature review

Looking at these papers/articles:

SCHMIDT, Charles Active Travel for All? The Surge in Public Bike-Sharing Programs
ZHOU, Xiaolu Understanding Spatiotemporal Patterns of Biking Behavior by Analyzing Massive Bike Sharing Data in Chicago.
JIA, Yingnan, … Effects of new dock-less bicycle-sharing programs on cycling: a retrospective study in Shanghai
JIA, Yingnan, … Association between innovative dockless bicycle sharing programs and adopting cycling in commuting and non-commuting trips
HOSFORD,Kate … Who is in the near market for bicycle sharing? Identifying current, potential, and unlikely users of a public bicycle share program in Vancouver, Canada
HOSFORD,Kate … Evaluation of the impact of a public bicycle share program on population bicycling in Vancouver, BC
WESTLAND, James … Demand cycles and market segmentation in bicycle sharing
DELL’AMICO,… The bike sharing rebalancing problem: Mathematical formulations and benchmark instances
DELL’AMICO,… The Bike sharing Rebalancing Problem with Stochastic Demands
WANG, Shuai BRAVO: Improving the Rebalancing Operation in Bike Sharing with Rebalancing Range Prediction
VOGEL, Patrick, … “Strategic and Operational Planning of Bike-Sharing Systems by Data Mining – A Case Study”
FULLER, Daniel,… Impact of a public transit strike on public bicycle share use: An interrupted time series natural experiment study
FULLER, Daniel,… Impact evaluation of a public bicycle share program on cycling: a caseexample of BIXI in Montreal, Quebec
FAGHIH-IMANI, Ahmadreza A finite mixture modeling approach to examine New York City bicycle sharing system (CitiBike) users’ destination preferences
AN, Ran, … Weather and cycling in New York: The case of Citibike
HEANEY, Alexandra, … Climate Change and Physical Activity: Estimated Impacts of Ambient Temperatures on Bikeshare Usage in New York City

(I think the ones near the bottom of the list may be most promising…)

Methodology

Define data collection method
Accurate representation of the sample population and coverage issue from the target population
Description of Data
A complete description of the desired output
Data Analysis
Describe the instrumentation
Describe the analysis plan
Describe the scope and limitations of the methodology

The data-set is currently composed of XXXXX records and VVV variables.
Comment on missing values
Comment on cleanup

in order to obtain the maximum information possible, we had to discard the use of many variables and put our focus into the following variables:

var1
var2
var3
var4

Experimentation and Results

The Results Section

Needs to systematically and clearly articulate the study findings. If the results are unclear, the reviewer must decide whether the analysis of the data was poorly executed or whether the Results section is poorly organized.

From the above, we decided to …

The Discussion Section

– Should state whether their hypotheses were verified or proven untrue or, if no hypotheses were given, whether their research questions were answered. The authors should also comment on their results in light of previous studies and explain what differences (if any) exist between their findings and those reported by others and attempt to provide an explanation for the discrepancies.

The Figures and Graphs

Should illustrate the important features of the methods and results.
Should allow the reader to understand the figure or graph without having to refer back to the text of the manuscript.
Common mistakes made by inexperienced authors are failing to include figures that best depict their findings, writing unclear figure legends, and making poor use of arrows.

Tables

Should summarize the data, make the data more easily understandable, and point out important comparisons.
Description of the data in the text, if possible, is preferable to the use of a space-consuming table.

Conclusions and Summary

Recap briefly what you do in the paper
Evaluate the effectiveness of your research and provide recommendations (if applicable)
Make sure that all of the questions raised in the introduction and the literature review have been addressed
Compare the final results against the original aims and objectives
Identify any shortcomings and future research

References

Appendix

knitr::opts_chunk$set(echo = TRUE, fig.pos = 'h')
#mydir = "C:/Users/Michael/Dropbox/priv/CUNY/MSDS/201909-Fall/DATA621_Nasrin/20201214_FinalProject/"
mydir = "./"
setwd(mydir)
knitr::opts_knit$set(root.dir = mydir)
options(digits=7,scipen=999,width=120)
datadir = paste0(mydir,"data/")

Load libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(sp)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(readr)

1. Data Exploration

Load data

Weather data

# Weather data is obtained from the  NCDC (National Climatic Data Center) via https://www.ncdc.noaa.gov/cdo-web/
# click on search tool  https://www.ncdc.noaa.gov/cdo-web/search
# select "daily summaries"
# select Search for Stations
# Enter Search Term "USW00094728" for Central Park Station: 
# https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail
# "add to cart"


weatherfilenames=list.files(path="./",pattern = '.csv$', full.names = T)    # ending with .csv ; not .zip
weatherfilenames

## [1] ".//NYC_Weather_Data_2013-2019.csv"

weatherfile <- "NYC_Weather_Data_2013-2019.csv"

## Perhaps we should rename the columns to more clearly reflect their meaning?
weatherspec <- cols(
  STATION = col_character(),
  NAME = col_character(),
  LATITUDE = col_double(),
  LONGITUDE = col_double(),
  ELEVATION = col_double(),
  DATE = col_date(format = "%m/%d/%Y"), #col_date(format = "%F"),          #  readr::parse_datetime() :   "%F" = "%Y-%m-%d"
  AWND = col_double(),                     # Average Daily Wind Speed
  AWND_ATTRIBUTES = col_character(),
  PGTM = col_double(),                    # Peak Wind-Gust Time
  PGTM_ATTRIBUTES = col_character(),
  PRCP = col_double(),                    # Amount of Precipitation
  PRCP_ATTRIBUTES = col_character(),
  SNOW = col_double(),                    # Amount of Snowfall
  SNOW_ATTRIBUTES = col_character(),
  SNWD = col_double(),                    # Depth of snow on the ground
  SNWD_ATTRIBUTES = col_character(),
  TAVG = col_double(),                    # Average Temperature (not populated)
  TAVG_ATTRIBUTES = col_character(),
  TMAX = col_double(),                    # Maximum temperature for the day
  TMAX_ATTRIBUTES = col_character(),
  TMIN = col_double(),                    # Minimum temperature for the day
  TMIN_ATTRIBUTES = col_character(),
  TSUN = col_double(),                    # Daily Total Sunshine (not populated)
  TSUN_ATTRIBUTES = col_character(),
  WDF2 = col_double(),                    # Direction of fastest 2-minute wind
  WDF2_ATTRIBUTES = col_character(),
  WDF5 = col_double(),                    # Direction of fastest 5-second wind
  WDF5_ATTRIBUTES = col_character(),
  WSF2 = col_double(),                    # Fastest 2-minute wind speed
  WSF2_ATTRIBUTES = col_character(),
  WSF5 = col_double(),                    # fastest 5-second wind speed
  WSF5_ATTRIBUTES = col_character(),
  WT01 = col_double(),                    # Fog
  WT01_ATTRIBUTES = col_character(),
  WT02 = col_double(),                    # Heavy Fog
  WT02_ATTRIBUTES = col_character(),
  WT03 = col_double(),                    # Thunder
  WT03_ATTRIBUTES = col_character(),
  WT04 = col_double(),                    # Sleet
  WT04_ATTRIBUTES = col_character(),
  WT06 = col_double(),                    # Glaze
  WT06_ATTRIBUTES = col_character(),
  WT08 = col_double(),                    # Smoke or haze
  WT08_ATTRIBUTES = col_character(),
  WT13 = col_double(),                    # Mist
  WT13_ATTRIBUTES = col_character(),
  WT14 = col_double(),                    # Drizzle
  WT14_ATTRIBUTES = col_character(),
  WT16 = col_double(),                    # Rain
  WT16_ATTRIBUTES = col_character(),
  WT18 = col_double(),                    # Snow      
  WT18_ATTRIBUTES = col_character(),
  WT19 = col_double(),                    # Unknown source of precipitation
  WT19_ATTRIBUTES = col_character(),
  WT22 = col_double(),                    # Ice fog
  WT22_ATTRIBUTES = col_character()
)



# load all the daily weather data
weather <- read_csv(weatherfile,col_types = weatherspec)

# extract just 2019
weather2019 <- weather[(weather$DATE>="2019-01-01" & weather$DATE<="2019-12-31"),]


# extract just one month
weather201906 <- weather[(weather$DATE>="2019-06-01" & weather$DATE<="2019-06-30"),]

Function to load up a citibike datafile

library(readr)
read_CB_data_file = function(f){
  Startloadtime = Sys.time()
  print(paste("reading data file   ", f, " at ", Startloadtime))
  datafile = read_csv(f,skip = 1,
                  col_names=c("trip_duration",            # in seconds
                              "s_time",                   # start date/time
                              "e_time",                   # end date/time
                              "s_station_id",             # station ID for beginning of trip 
                              "s_station_name", 
                              "s_lat",                    # start station latitude
                              "s_long",                   # start station longitude
                              "e_station_id",             # station ID for end of trip
                              "e_station_name",
                              "e_lat",                    # latitude
                              "e_long",                   # longitude
                              "bike_id",                  # every bike has a 5-digit ID number
                              "user_type",                # Annual Subscriber or Daily Customer
                              "birth_year",               # Can infer age from this
                              "gender")                   # 1=Male,2=Female,0=unknown
                  ,col_types = "dTTffddffddifif"    # d=decimal; T=datetime; f=factor; i=integer
                  )
  Endloadtime = Sys.time()
  print(paste("done with data file ",  f, " at ", Endloadtime))
  Totalloadtime = round(Endloadtime - Startloadtime, 2)
  print(paste("Totaltime = ", Totalloadtime))
  print("------------------------------------------------------------")
  return(datafile)
}

List the names of available citibike data files

filenames=list.files("/home/ann/projects/nyc-citibike-data/data", pattern = "data.csv$")   # ending with .csv ; not .zip
length(filenames)

## [1] 75

t(t(filenames))

##       [,1]                               
##  [1,] "2013-07 - Citi Bike trip data.csv"
##  [2,] "2013-08 - Citi Bike trip data.csv"
##  [3,] "2013-09 - Citi Bike trip data.csv"
##  [4,] "2013-10 - Citi Bike trip data.csv"
##  [5,] "2013-11 - Citi Bike trip data.csv"
##  [6,] "2013-12 - Citi Bike trip data.csv"
##  [7,] "2014-01 - Citi Bike trip data.csv"
##  [8,] "2014-02 - Citi Bike trip data.csv"
##  [9,] "2014-03 - Citi Bike trip data.csv"
## [10,] "2014-04 - Citi Bike trip data.csv"
## [11,] "2014-05 - Citi Bike trip data.csv"
## [12,] "2014-06 - Citi Bike trip data.csv"
## [13,] "2014-07 - Citi Bike trip data.csv"
## [14,] "2014-08 - Citi Bike trip data.csv"
## [15,] "201409-citibike-tripdata.csv"     
## [16,] "201410-citibike-tripdata.csv"     
## [17,] "201411-citibike-tripdata.csv"     
## [18,] "201412-citibike-tripdata.csv"     
## [19,] "201501-citibike-tripdata.csv"     
## [20,] "201502-citibike-tripdata.csv"     
## [21,] "201503-citibike-tripdata.csv"     
## [22,] "201504-citibike-tripdata.csv"     
## [23,] "201505-citibike-tripdata.csv"     
## [24,] "201506-citibike-tripdata.csv"     
## [25,] "201507-citibike-tripdata.csv"     
## [26,] "201508-citibike-tripdata.csv"     
## [27,] "201509-citibike-tripdata.csv"     
## [28,] "201510-citibike-tripdata.csv"     
## [29,] "201511-citibike-tripdata.csv"     
## [30,] "201512-citibike-tripdata.csv"     
## [31,] "201601-citibike-tripdata.csv"     
## [32,] "201602-citibike-tripdata.csv"     
## [33,] "201603-citibike-tripdata.csv"     
## [34,] "201604-citibike-tripdata.csv"     
## [35,] "201605-citibike-tripdata.csv"     
## [36,] "201606-citibike-tripdata.csv"     
## [37,] "201607-citibike-tripdata.csv"     
## [38,] "201608-citibike-tripdata.csv"     
## [39,] "201609-citibike-tripdata.csv"     
## [40,] "201610-citibike-tripdata.csv"     
## [41,] "201611-citibike-tripdata.csv"     
## [42,] "201612-citibike-tripdata.csv"     
## [43,] "201701-citibike-tripdata.csv"     
## [44,] "201702-citibike-tripdata.csv"     
## [45,] "201703-citibike-tripdata.csv"     
## [46,] "201704-citibike-tripdata.csv"     
## [47,] "201705-citibike-tripdata.csv"     
## [48,] "201706-citibike-tripdata.csv"     
## [49,] "201707-citibike-tripdata.csv"     
## [50,] "201708-citibike-tripdata.csv"     
## [51,] "201709-citibike-tripdata.csv"     
## [52,] "201710-citibike-tripdata.csv"     
## [53,] "201711-citibike-tripdata.csv"     
## [54,] "201712-citibike-tripdata.csv"     
## [55,] "201801-citibike-tripdata.csv"     
## [56,] "201802-citibike-tripdata.csv"     
## [57,] "201803-citibike-tripdata.csv"     
## [58,] "201804-citibike-tripdata.csv"     
## [59,] "201805-citibike-tripdata.csv"     
## [60,] "201806-citibike-tripdata.csv"     
## [61,] "201807-citibike-tripdata.csv"     
## [62,] "201808-citibike-tripdata.csv"     
## [63,] "201809-citibike-tripdata.csv"     
## [64,] "201810-citibike-tripdata.csv"     
## [65,] "201811-citibike-tripdata.csv"     
## [66,] "201812-citibike-tripdata.csv"     
## [67,] "201901-citibike-tripdata.csv"     
## [68,] "201902-citibike-tripdata.csv"     
## [69,] "201903-citibike-tripdata.csv"     
## [70,] "201904-citibike-tripdata.csv"     
## [71,] "201905-citibike-tripdata.csv"     
## [72,] "201906-citibike-tripdata.csv"     
## [73,] "201907-citibike-tripdata.csv"     
## [74,] "201908-citibike-tripdata.csv"     
## [75,] "201909-citibike-tripdata.csv"

Load up data file or files

#### Load all the data files, notting how much time it takes to load them

Starttime = Sys.time()
print(paste("Start time: ", Starttime))

## [1] "Start time:  2019-12-22 20:27:49"

### loads up the data for all files -- problem is too much data for my computer to handle if all are loaded
#print("About to load multiple datafiles:")
#print(filenames)
#CB <- do.call(rbind,lapply(filenames,read_CB_data_file))



### Problem:  loading up multiple files is too much for my computer to handle !!!!



### load up just a single month of data ("filename")
#### Look at just June 2019
filename = filenames[72]
print(paste("Loading data for ", filename))

## [1] "Loading data for  201906-citibike-tripdata.csv"

CB <- read_CB_data_file(paste0("data/", filename))

## [1] "reading data file    data/201906-citibike-tripdata.csv  at  2019-12-22 20:27:49"
## [1] "done with data file  data/201906-citibike-tripdata.csv  at  2019-12-22 20:28:03"
## [1] "Totaltime =  14.58"
## [1] "------------------------------------------------------------"

Endtime = Sys.time()
print(paste("End time: ", Endtime))

## [1] "End time:  2019-12-22 20:28:03"

Totaltime = round(Endtime - Starttime,2)
print(paste("Totaltime for loading above file(s) = ", Totaltime))

## [1] "Totaltime for loading above file(s) =  14.6"

## Save a copy of the loaded data, in case we need it during manipulations below
save_CB <- CB

`glimpse` the dataset

glimpse(CB)

## Observations: 2,125,370
## Variables: 15
## $ trip_duration  <dbl> 330, 830, 380, 1155, 1055, 128, 315, 471, 1554, 392, 588, 1906, 2754, 2037, 454, 642, 1371, 13…
## $ s_time         <dttm> 2019-06-01 00:00:01, 2019-06-01 00:00:04, 2019-06-01 00:00:06, 2019-06-01 00:00:06, 2019-06-0…
## $ e_time         <dttm> 2019-06-01 00:05:31, 2019-06-01 00:13:55, 2019-06-01 00:06:26, 2019-06-01 00:19:22, 2019-06-0…
## $ s_station_id   <fct> 3602, 3054, 229, 3771, 441, 3236, 3129, 3467, 379, 467, 2000, 532, 3016, 3285, 3125, 3429, 223…
## $ s_station_name <fct> 31 Ave & 34 St, Greene Ave & Throop Ave, Great Jones St, McKibbin St & Bogart St, E 52 St & 2 …
## $ s_lat          <dbl> 40.76315, 40.68949, 40.72743, 40.70624, 40.75601, 40.75898, 40.75110, 40.72495, 40.74916, 40.6…
## $ s_long         <dbl> -73.92083, -73.94206, -73.99379, -73.93387, -73.96742, -73.99380, -73.94074, -74.00166, -73.99…
## $ e_station_id   <fct> 3570, 3781, 326, 3016, 3159, 495, 3560, 401, 2006, 3368, 366, 3712, 3764, 540, 3221, 157, 450,…
## $ e_station_name <fct> 35 Ave & 37 St, Greene Av & Myrtle Av, E 11 St & 1 Ave, Kent Ave & N 7 St, W 67 St & Broadway,…
## $ e_lat          <dbl> 40.75573, 40.69857, 40.72954, 40.72037, 40.77493, 40.76270, 40.75488, 40.72020, 40.76591, 40.6…
## $ e_long         <dbl> -73.92366, -73.91888, -73.98427, -73.96165, -73.98267, -73.99301, -73.93433, -73.98998, -73.97…
## $ bike_id        <int> 20348, 34007, 20587, 33762, 31290, 25137, 25648, 26972, 32969, 33539, 19770, 16708, 25711, 270…
## $ user_type      <fct> Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber…
## $ birth_year     <int> 1992, 1987, 1990, 1987, 1973, 1989, 1995, 1990, 1970, 1980, 1974, 1993, 1993, 1965, 1989, 1969…
## $ gender         <fct> 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0…

`str` - structure of the dataset

str(CB)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2125370 obs. of  15 variables:
##  $ trip_duration : num  330 830 380 1155 1055 ...
##  $ s_time        : POSIXct, format: "2019-06-01 00:00:01" "2019-06-01 00:00:04" "2019-06-01 00:00:06" "2019-06-01 00:00:06" ...
##  $ e_time        : POSIXct, format: "2019-06-01 00:05:31" "2019-06-01 00:13:55" "2019-06-01 00:06:26" "2019-06-01 00:19:22" ...
##  $ s_station_id  : Factor w/ 793 levels "3602","3054",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ s_station_name: Factor w/ 793 levels "31 Ave & 34 St",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ s_lat         : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ s_long        : num  -73.9 -73.9 -74 -73.9 -74 ...
##  $ e_station_id  : Factor w/ 806 levels "3570","3781",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ e_station_name: Factor w/ 806 levels "35 Ave & 37 St",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ e_lat         : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ e_long        : num  -73.9 -73.9 -74 -74 -74 ...
##  $ bike_id       : int  20348 34007 20587 33762 31290 25137 25648 26972 32969 33539 ...
##  $ user_type     : Factor w/ 2 levels "Subscriber","Customer": 1 1 1 1 1 1 1 1 2 1 ...
##  $ birth_year    : int  1992 1987 1990 1987 1973 1989 1995 1990 1970 1980 ...
##  $ gender        : Factor w/ 3 levels "1","2","0": 1 2 2 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_duration = col_double(),
##   ..   s_time = col_datetime(format = ""),
##   ..   e_time = col_datetime(format = ""),
##   ..   s_station_id = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   s_station_name = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   s_lat = col_double(),
##   ..   s_long = col_double(),
##   ..   e_station_id = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   e_station_name = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   e_lat = col_double(),
##   ..   e_long = col_double(),
##   ..   bike_id = col_integer(),
##   ..   user_type = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   birth_year = col_integer(),
##   ..   gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
##   .. )

Summary of the dataset

summary(CB)

##  trip_duration         s_time                        e_time                     s_station_id    
##  Min.   :     61   Min.   :2019-06-01 00:00:01   Min.   :2019-06-01 00:02:16   519    :  14778  
##  1st Qu.:    395   1st Qu.:2019-06-07 18:46:00   1st Qu.:2019-06-07 19:02:42   514    :  13512  
##  Median :    683   Median :2019-06-15 14:17:56   Median :2019-06-15 14:40:21   426    :  13410  
##  Mean   :   1109   Mean   :2019-06-15 21:45:53   Mean   :2019-06-15 22:04:23   402    :  12191  
##  3rd Qu.:   1199   3rd Qu.:2019-06-23 18:48:22   3rd Qu.:2019-06-23 19:09:09   3255   :  11973  
##  Max.   :3379585   Max.   :2019-06-30 23:59:54   Max.   :2019-07-15 03:53:26   499    :  11680  
##                                                                                (Other):2047826  
##                s_station_name        s_lat           s_long        e_station_id                   e_station_name   
##  Pershing Square North:  14778   Min.   :40.66   Min.   :-74.03   519    :  14696   Pershing Square North:  14696  
##  12 Ave & W 40 St     :  13512   1st Qu.:40.72   1st Qu.:-74.00   426    :  14163   West St & Chambers St:  14163  
##  West St & Chambers St:  13410   Median :40.74   Median :-73.99   514    :  13614   12 Ave & W 40 St     :  13614  
##  Broadway & E 22 St   :  12191   Mean   :40.74   Mean   :-73.98   402    :  12639   Broadway & E 22 St   :  12639  
##  8 Ave & W 31 St      :  11973   3rd Qu.:40.76   3rd Qu.:-73.97   3255   :  12055   8 Ave & W 31 St      :  12055  
##  Broadway & W 60 St   :  11680   Max.   :40.81   Max.   :-73.91   499    :  11310   Broadway & W 60 St   :  11310  
##  (Other)              :2047826                                    (Other):2046893   (Other)              :2046893  
##      e_lat           e_long          bike_id           user_type         birth_year   gender     
##  Min.   :40.66   Min.   :-74.07   Min.   :14529   Subscriber:1752526   Min.   :1885   1:1399108  
##  1st Qu.:40.72   1st Qu.:-74.00   1st Qu.:20961   Customer  : 372844   1st Qu.:1969   2: 527577  
##  Median :40.74   Median :-73.99   Median :28996                        Median :1983   0: 198685  
##  Mean   :40.74   Mean   :-73.98   Mean   :27614                        Mean   :1980              
##  3rd Qu.:40.76   3rd Qu.:-73.97   3rd Qu.:33016                        3rd Qu.:1990              
##  Max.   :40.81   Max.   :-73.91   Max.   :39934                        Max.   :2003              
##

2. Data Preparation

Check for missing values

Let’s check whether any variables have missing values, i.e., values which are NULL or NA.

## [1] "Number of columns with missing values =  0"

## [1] "Names of columns with missing values =  "

In this data (just a single month), there are no missing items.

Look for unusual values / outliers

Examine trip_duration

The trip_duration is specified in seconds, but there seem to be some outliers which may be incorrect, as the value for Max is quite high: 3379585 seconds, or 39.1155671 days. We can assume that this data is bad, as nobody would willingly rent a bicycle for this period of time, given the fees that would be charged.

Histogram of log(trip_duration)

hist(log(CB$trip_duration),col='lightblue', breaks=100)

It may be easier to think of trip duration in other units (i.e., minutes, hours, or days) rather than in seconds, so lets create such variables. Also, let’s confirm that the value shown (in seconds) is consistent with the difference between the start time and the end time:

#express trip duration in seconds, minutes, hours, days

CB$trip_duration_s = as.numeric(CB$e_time - CB$s_time,"secs")
CB$trip_duration_m = as.numeric(CB$e_time - CB$s_time,"mins")
CB$trip_duration_h = as.numeric(CB$e_time - CB$s_time,"hours")
CB$trip_duration_d = as.numeric(CB$e_time - CB$s_time,"days")
summary(CB$trip_duration_h)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0169   0.1098   0.1900   0.3083   0.3333 938.7739

sum(CB$trip_duration_h>6)

## [1] 2024

Let’s assume that nobody would rent a bicycle for more than a specified timelimit, and drop the records which exceed this:

total_rows=dim(CB)[1]
print(paste("Initial number of trips: ", total_rows))

## [1] "Initial number of trips:  2125370"

# choose only trips that were at most 3 hrs, as longer trips may reflect an error
# remove long trips from the data set -- something may be wrong (e.g., the system failed to properly record the return of a bike)
longtripthreshold_s = 60 * 60 *3  # 10800 seconds = 180 minutes = 2 hours
longtripthreshold_m = longtripthreshold_s / 60
longtripthreshold_h = longtripthreshold_m / 60

long_trips <- CB %>% filter(trip_duration_s > longtripthreshold_s)
num_long_trips_removed = dim(long_trips)[1]
pct_long_trips_removed = round(100*num_long_trips_removed / total_rows, 3)

CB <- CB %>% filter(trip_duration <= longtripthreshold_s)
reduced_rows = dim(CB)[1]

print(paste0("Removed ", num_long_trips_removed, " trips (", pct_long_trips_removed, "%) of longer than ", longtripthreshold_h, " hours."))

## [1] "Removed 4003 trips (0.188%) of longer than 3 hours."

print(paste0("Remaining number of trips: ", reduced_rows))

## [1] "Remaining number of trips: 2121369"

par(mfrow=c(1,2))
hist(CB$trip_duration_m, col="lightgreen",  xlab="Trip duration, in minutes")
hist(log(CB$trip_duration_m), col="lightgreen",  xlab="log(Trip duration, in minutes)")

Examine birth year

The birth year for some users is as old as 1885, which is not possible.

summary(CB$birth_year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1885    1969    1983    1980    1990    2003

hist(CB$birth_year, col="lightgreen")

Remove trips associated with very old users

# Deduce age from trip date and birth year
#library(lubridate) #loaded above
CB$age <- year(CB$s_time) - CB$birth_year

par(mfrow=c(1,2))
hist(CB$age, col="lightblue",  xlab="User Age, inferred from birth year")
hist(log(CB$age), col="lightblue",  xlab="log(User Age, inferred from birth year)")

# choose only trips where the user was born after a certain year,  as older users may reflect an error
age_threshhold = 90


aged_trips <- CB %>% filter(age > age_threshhold)
num_aged_trips_removed = dim(aged_trips)[1]
pct_aged_trips_removed = round(100*num_aged_trips_removed / total_rows, 3)

CB <- CB %>% filter(age <= age_threshhold)
reduced_rows = dim(CB)[1]

print(paste0("Removed ", num_aged_trips_removed, " trips (", pct_aged_trips_removed, "%) of users older than ", age_threshhold, " years."))

## [1] "Removed 1052 trips (0.049%) of users older than 90 years."

print(paste0("Remaining number of trips: ", reduced_rows))

## [1] "Remaining number of trips: 2120317"

par(mfrow=c(1,2))
hist(CB$age, col="lightgreen",  xlab="User Age, inferred from birth year")
hist(log(CB$age), col="lightgreen",  xlab="log(User Age, inferred from birth year)")

Compute distance between start and end stations

This is straight-line distance – it doesn’t incorporate an actual route.
There are services (e.g., from Google) which can compute and measure a recommended bicycle route between points, but use of such services requires a subscription and incurs a cost.

# Compute the distance between start and end stations
s_lat_long <- CB %>% select(c(s_lat,s_long)) %>%  as.matrix
e_lat_long <- CB %>% select(c(e_lat,e_long)) %>%  as.matrix
#library(sp) # loaded above
CB$distance_km <- spDists(s_lat_long, e_lat_long, longlat=T, diagonal = TRUE)

# There is a time-based usage fee for rides longer than an initial period.
# For user_type=Subscriber, the fee is $2.50 per 15 minutes following an initial free 45 minutes.
# For user_type=Customer, the fee is $4.00 per 15 minutes following an initial free 30 minutes.


CB$trip_fee[CB$user_type=="Subscriber"] <- 2.50 * (ceiling(CB$trip_duration_m[CB$user_type=="Subscriber"]  / 15)-3)  # first 45 minutes are free

## Warning: Unknown or uninitialised column: 'trip_fee'.

CB$trip_fee[CB$user_type=="Customer"]   <- 4.00 * (ceiling(CB$trip_duration_m[CB$user_type=="Customer"]  / 15)-2)  # first 30 minutes are free
CB$trip_fee[CB$trip_fee<0] <- 0   # fee is non-negative

Make a smaller dataset, numeric, without multicollinearities, for correlation calculations

# extract selected fields
CBlite  <- select(CB, c(trip_duration, trip_fee, distance_km, 
                        s_station_id, s_lat, s_long,
                        e_station_id, e_lat, e_long,
                        user_type, gender, age))

#make numeric variables
CBlite$user_type <- as.integer(CBlite$user_type)
CBlite$gender <- as.integer(CBlite$gender)

# function to revert factor back to its numeric levels
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}

CBlite$s_station_id <- as.numeric.factor(CBlite$s_station_id)
CBlite$e_station_id <- as.numeric.factor(CBlite$e_station_id)

compute correlations

#library(Hmisc) #loaded above
res2<-rcorr(as.matrix(CBlite))
respearson=rcorr(as.matrix(CBlite),type = "pearson")
resspearman=rcorr(as.matrix(CBlite),type = "spearman")
res3 <- cor(as.matrix(CBlite))

Citibike Data Analysis

Data 621: Final Project - Group 4

Sachid Deshmukh

Michael Yampol

Vishal Arora

Ann Liu-Ferrara

12/13/2019