1. Abstract

Bicycling is an activity which yields many benefits: Riders improve their health through exercise, while traffic congestion is reduced if riders move out of cars, with a corresponding reduction in pollution from carbon emissions [5]. In recent years, Bike Sharing has become popular in a growing list of cities around the world. The NYC “Citibike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” [9-10] as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served. One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service to accommodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.) We seek to examine Citibike ridership data [2-3], joined with daily NYC weather data [4], to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day [6-8]. The goal is to estimate future demand which would enable the system operator to make expansion plans. Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation [6-8]. Additionally, ridership is segmented by user type (annual subscribers use the system much more heavily than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s). [11-12]

Keywords

Bikeshare, Weather, Cycling, CitiBike, New York City

2. Introduction

Since 2013 a shared bicycle system known as Citibike has been available in New York City [1]. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling [5]. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of Citibike’ s pricing constraints. There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership to make good investments. The available usage data provides a wealth of information which can be mined to seek trends in usage 2-3]. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.

Because of weather, ridership is expected to be lower during the winter months, and on foul-weather days during the rest of the year, compared to a warm and sunny summer day. Using the weather data, we can seek to model the relationship between bicycle ridership and fair/foul or hot/cold weather [6-8].
What are the differences in rental patterns between annual members (presumably, residents) vs. casual users (presumably, tourists?) [12]
Is there any significant relationship between the age and/or gender of the bicycle renter vs. the rental patterns? [12]
What is the seasonal component of bikeshare pattern exhibited by consumers (weekday vs weekend) [11]

4. Data Collection

We obtained data from two sources:

CitiBike trip dataset

CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.

For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)

There are currently 77 monthly data files for the New York City bikeshare system, spanning July 2013 through November 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. The total number of entries was more than 90 million, resulting in 17GB of data. Because of the computational limitations which this presented, we created samples of 1/1000 and 1/100 of the data. The samples were created deterministically, by subsetting the files on each 1000th (or, 100th) row.

Central Park daily weather data

Also we obtained historical weather information for 2013-2019 from the NCDC (National Climatic Data Center) by submitting an online request to https://www.ncdc.noaa.gov/cdo-web/search . Although the weather may vary slightly within New York City, we opted to use just the data associated with the Central Park observations as proxy for the entire city’s weather.

We believe that the above data provides a reasonable representation of the target population (all CitiBike rides) and the citywide weather.

5. Load Data

Load citibike data

load(file='DATA/CB.RData')
city_bike_df = as.data.frame(CB)
head(city_bike_df)

##   trip_duration              s_time              e_time s_station_id          s_station_name    s_lat    s_long
## 1           634 2013-06-30 21:00:00 2013-06-30 21:10:34          164         E 47 St & 2 Ave 40.75323 -73.97033
## 2           437 2013-07-01 03:54:02 2013-07-01 04:01:19          479         9 Ave & W 45 St 40.76019 -73.99126
## 3          1398 2013-07-01 05:03:38 2013-07-01 05:26:56          157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4          1124 2013-07-01 05:37:40 2013-07-01 05:56:24          496         E 16 St & 5 Ave 40.73726 -73.99239
## 5          1199 2013-07-01 06:16:59 2013-07-01 06:36:58          432       E 7 St & Avenue A 40.72622 -73.98380
## 6           221 2013-07-01 08:50:21 2013-07-01 08:54:02          475     E 16 St & Irving Pl 40.73524 -73.98759
##   e_station_id          e_station_name    e_lat    e_long bike_id  user_type birth_year gender
## 1          504         1 Ave & E 15 St 40.73222 -73.98166   16950   Customer         NA      0
## 2          243 Fulton St & Rockwell Pl 40.68798 -73.97847   16151 Subscriber       1987      1
## 3          375 Mercer St & Bleecker St 40.72679 -73.99695   15997 Subscriber       1987      1
## 4          500      Broadway & W 51 St 40.76229 -73.98336   17750 Subscriber       1959      2
## 5          466         W 25 St & 6 Ave 40.74395 -73.99145   17671 Subscriber       1983      2
## 6          537 Lexington Ave & E 24 St 40.74026 -73.98409   16490 Subscriber       1956      1

nrow(city_bike_df)

## [1] 92565

ncol(city_bike_df)

## [1] 15

Load Weather data

# Weather data is obtained from the  NCDC (National Climatic Data Center) via https://www.ncdc.noaa.gov/cdo-web/
# click on search tool  https://www.ncdc.noaa.gov/cdo-web/search
# select "daily summaries"
# select Search for Stations
# Enter Search Term "USW00094728" for Central Park Station: 
# https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail
# "add to cart"


weatherfilenames=list.files(path="./",pattern = '.csv$', full.names = T)    # ending with .csv ; not .zip
#weatherfilenames
weatherfile <- "DATA/NYC_Weather_Data_2013-2019.csv"

## Perhaps we should rename the columns to more clearly reflect their meaning?
weatherspec <- cols(
  STATION = col_character(),
  NAME = col_character(),
  LATITUDE = col_double(),
  LONGITUDE = col_double(),
  ELEVATION = col_double(),
  DATE = col_date(format = "%F"),          #  readr::parse_datetime() :   "%F" = "%Y-%m-%d"
  #DATE = col_date(format = "%m/%d/%Y"), #col_date(format = "%F")
  AWND = col_double(),                     # Average Daily Wind Speed
  AWND_ATTRIBUTES = col_character(),
  PGTM = col_double(),                    # Peak Wind-Gust Time
  PGTM_ATTRIBUTES = col_character(),
  PRCP = col_double(),                    # Amount of Precipitation
  PRCP_ATTRIBUTES = col_character(),
  SNOW = col_double(),                    # Amount of Snowfall
  SNOW_ATTRIBUTES = col_character(),
  SNWD = col_double(),                    # Depth of snow on the ground
  SNWD_ATTRIBUTES = col_character(),
  TAVG = col_double(),                    # Average Temperature (not populated)
  TAVG_ATTRIBUTES = col_character(),
  TMAX = col_double(),                    # Maximum temperature for the day
  TMAX_ATTRIBUTES = col_character(),
  TMIN = col_double(),                    # Minimum temperature for the day
  TMIN_ATTRIBUTES = col_character(),
  TSUN = col_double(),                    # Daily Total Sunshine (not populated)
  TSUN_ATTRIBUTES = col_character(),
  WDF2 = col_double(),                    # Direction of fastest 2-minute wind
  WDF2_ATTRIBUTES = col_character(),
  WDF5 = col_double(),                    # Direction of fastest 5-second wind
  WDF5_ATTRIBUTES = col_character(),
  WSF2 = col_double(),                    # Fastest 2-minute wind speed
  WSF2_ATTRIBUTES = col_character(),
  WSF5 = col_double(),                    # fastest 5-second wind speed
  WSF5_ATTRIBUTES = col_character(),
  WT01 = col_double(),                    # Fog
  WT01_ATTRIBUTES = col_character(),
  WT02 = col_double(),                    # Heavy Fog
  WT02_ATTRIBUTES = col_character(),
  WT03 = col_double(),                    # Thunder
  WT03_ATTRIBUTES = col_character(),
  WT04 = col_double(),                    # Sleet
  WT04_ATTRIBUTES = col_character(),
  WT06 = col_double(),                    # Glaze
  WT06_ATTRIBUTES = col_character(),
  WT08 = col_double(),                    # Smoke or haze
  WT08_ATTRIBUTES = col_character(),
  WT13 = col_double(),                    # Mist
  WT13_ATTRIBUTES = col_character(),
  WT14 = col_double(),                    # Drizzle
  WT14_ATTRIBUTES = col_character(),
  WT16 = col_double(),                    # Rain
  WT16_ATTRIBUTES = col_character(),
  WT18 = col_double(),                    # Snow      
  WT18_ATTRIBUTES = col_character(),
  WT19 = col_double(),                    # Unknown source of precipitation
  WT19_ATTRIBUTES = col_character(),
  WT22 = col_double(),                    # Ice fog
  WT22_ATTRIBUTES = col_character()
)

# load all the daily weather data
weather <- read_csv(weatherfile, col_types = weatherspec)
weather_df1 = as.data.frame(weather)

# Check the number of rows and columns in weather data frame
nrow(weather_df1)

## [1] 2556

ncol(weather_df1)

## [1] 28

# Select only those columns that are useful for our analysis
weather_df = select(weather_df1, STATION, NAME, DATE, AWND, PRCP, SNOW, SNWD, TMAX, TMIN, WDF2, WDF5, WSF2, WSF5, WT01)

# Check how many columns have empty values
sapply(weather_df, function(x) sum(is.na(x)))

## STATION    NAME    DATE    AWND    PRCP    SNOW    SNWD    TMAX    TMIN    WDF2    WDF5    WSF2    WSF5    WT01 
##       0       0       0     166       0       1       0       0       0     163     179     163     179    1703

# Perform Data Impuation on weather_df, replace empty/blank values with mean values
weather_df$AWND[is.na(weather_df$AWND)] = mean(weather_df1$AWND, na.rm=TRUE)
weather_df$SNOW[is.na(weather_df$SNOW)] = mean(weather_df1$SNOW, na.rm=TRUE)
weather_df$WDF2[is.na(weather_df$WDF2)] = mean(weather_df1$WDF2, na.rm=TRUE)
weather_df$WDF5[is.na(weather_df$WDF5)] = mean(weather_df1$WDF5, na.rm=TRUE)
weather_df$WSF2[is.na(weather_df$WSF2)] = mean(weather_df1$WSF2, na.rm=TRUE)
weather_df$WSF5[is.na(weather_df$WSF5)] = mean(weather_df1$WSF5, na.rm=TRUE)

# Again, check if the imputation removed all empty/blank values with mean values
sapply(weather_df, function(x) sum(is.na(x)))

## STATION    NAME    DATE    AWND    PRCP    SNOW    SNWD    TMAX    TMIN    WDF2    WDF5    WSF2    WSF5    WT01 
##       0       0       0       0       0       0       0       0       0       0       0       0       0    1703

# City bike data number of rows and columns
c(nrow(city_bike_df), ncol(city_bike_df))

## [1] 92565    15

# Weather data number of rows and columns
c(nrow(weather_df), ncol(weather_df))

## [1] 2556   14

# Check the column names of city_bike_df and weather_df
colnames(city_bike_df)

##  [1] "trip_duration"  "s_time"         "e_time"         "s_station_id"   "s_station_name" "s_lat"         
##  [7] "s_long"         "e_station_id"   "e_station_name" "e_lat"          "e_long"         "bike_id"       
## [13] "user_type"      "birth_year"     "gender"

colnames(weather_df)

##  [1] "STATION" "NAME"    "DATE"    "AWND"    "PRCP"    "SNOW"    "SNWD"    "TMAX"    "TMIN"    "WDF2"    "WDF5"   
## [12] "WSF2"    "WSF5"    "WT01"

# Display head of city_bike_df and weather_df
head(city_bike_df)

##   trip_duration              s_time              e_time s_station_id          s_station_name    s_lat    s_long
## 1           634 2013-06-30 21:00:00 2013-06-30 21:10:34          164         E 47 St & 2 Ave 40.75323 -73.97033
## 2           437 2013-07-01 03:54:02 2013-07-01 04:01:19          479         9 Ave & W 45 St 40.76019 -73.99126
## 3          1398 2013-07-01 05:03:38 2013-07-01 05:26:56          157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4          1124 2013-07-01 05:37:40 2013-07-01 05:56:24          496         E 16 St & 5 Ave 40.73726 -73.99239
## 5          1199 2013-07-01 06:16:59 2013-07-01 06:36:58          432       E 7 St & Avenue A 40.72622 -73.98380
## 6           221 2013-07-01 08:50:21 2013-07-01 08:54:02          475     E 16 St & Irving Pl 40.73524 -73.98759
##   e_station_id          e_station_name    e_lat    e_long bike_id  user_type birth_year gender
## 1          504         1 Ave & E 15 St 40.73222 -73.98166   16950   Customer         NA      0
## 2          243 Fulton St & Rockwell Pl 40.68798 -73.97847   16151 Subscriber       1987      1
## 3          375 Mercer St & Bleecker St 40.72679 -73.99695   15997 Subscriber       1987      1
## 4          500      Broadway & W 51 St 40.76229 -73.98336   17750 Subscriber       1959      2
## 5          466         W 25 St & 6 Ave 40.74395 -73.99145   17671 Subscriber       1983      2
## 6          537 Lexington Ave & E 24 St 40.74026 -73.98409   16490 Subscriber       1956      1

head(weather_df)

##       STATION                        NAME       DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 1 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-01 6.93    0    0    0   40   26  310  300 15.0 25.9   NA
## 2 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-02 5.82    0    0    0   33   22  310  340 15.0 21.9   NA
## 3 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-03 4.47    0    0    0   32   24  260  260 13.0 19.9   NA
## 4 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-04 8.05    0    0    0   37   30  290  250 17.9 28.0   NA
## 5 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-05 6.71    0    0    0   42   32  310  310 17.0 25.9   NA
## 6 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-06 6.71    0    0    0   46   34  290  270 13.0 19.9   NA

6. Exploratory Data Analysis

Examine Trip Duration variable

Summary of Trip durations before truncation
	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
supplied_secs	61.0000000	373.0000000	618.0000000	906.8727165	1061.0000000	1688083.0000
calc_secs	60.0000000	373.7270000	618.6000001	907.3520470	1061.9250000	1688083.0000
calc_mins	1.0000000	6.2287833	10.3100000	15.1225341	17.6987500	28134.7167
calc_hours	0.0166667	0.1038131	0.1718333	0.2520422	0.2949792	468.9119
calc_days	0.0006944	0.0043255	0.0071597	0.0105018	0.0122908	19.5380

The above indicates that the duration of the trips (in seconds) includes values in the millions – which likely reflects a trip which failed to be properly closed out.

Delete cases with unreasonable trip_duration values Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:

## [1] "Removed 158 trips (0.171%) of longer than 3 hours."

## [1] "Remaining number of trips: 92407"

Examine birth_year variable

Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.

The birth year for some users is as old as

summary(CB$birth_year)["Min."]

## Min. 
## 1885

which is not possible:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1885    1969    1981    1978    1988    2003    5828

Remove trips associated with very old users (age>90) (Also: remove trips associated with missing birth_year)

## [1] "Removed 41 trips (0.044%) of users older than 90 years."

## [1] "Removed 5828 trips (6.296%) of users where age is unknown (birth_year unspecified)."

## [1] "Remaining number of trips: 86538"

Compute distance between start and end stations

This is straight-line distance between (longitude,latitude) points – it doesn’t incorporate an actual bicycle route.

There are services (e.g., from Google) which can compute and measure a recommended bicycle route between points, but use of such services requires a subscription and incurs a cost.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5885  1.0477  1.3090  1.7280 10.6603

In this subset of the data, the maximum distance between stations is

maxdistance

##     Max. 
## 10.66028

km. In the data there are some stations for which the latitude and longitude are zero, which suggests that the distance between such a station and an actual station is many thousands of miles. If such items exist, we will delete them:

Delete unusually long distances

## [1] "No unusually long distances were found in this subset of the data."

Compute usage fee

There is a time-based usage fee for rides longer than an initial period:

For user_type=Subscriber, the fee is $2.50 per 15 minutes following an initial free 45 minutes per ride.
For user_type=Customer, the fee is $4.00 per 15 minutes following an initial free 30 minutes per ride.
There are some cases where the user type is not specified (we have relabeled as “UNKNOWN”, and we do note estimate usage fees for such trips.)

#### Summary of trip durations AFTER censoring/truncation:

#express trip duration in seconds, minutes, hours, days
# note: we needed to fix the November daylight savings problem to eliminate negative trip times

#### Supplied seconds
#print("Supplied Seconds:")
supplied_secs<-summary(CB$trip_duration)

#### Seconds
CB$trip_duration_s = as.numeric(CB$e_time - CB$s_time,"secs")
calc_secs<-summary(CB$trip_duration_s)

#### Minutes
CB$trip_duration_m = as.numeric(CB$e_time - CB$s_time,"mins")
calc_mins<-summary(CB$trip_duration_m)

#### Hours
CB$trip_duration_h = as.numeric(CB$e_time - CB$s_time,"hours")
calc_hours<-summary(CB$trip_duration_h)

#### Days
CB$trip_duration_d = as.numeric(CB$e_time - CB$s_time,"days")
calc_days <-summary(CB$trip_duration_d)

# library(kableExtra) # loaded above
rbind(supplied_secs, calc_secs, calc_mins, calc_hours, calc_days) %>% 
  kable(caption = "Summary of trip durations - AFTER truncations:") %>%
  kable_styling(c("bordered","striped"),latex_options =  "hold_position")

Summary of trip durations - AFTER truncations:
	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
supplied_secs	61.0000000	363.0000000	592.0000000	775.7573898	994.0000000	10617.0000000
calc_secs	60.0000000	363.0000000	593.0000000	776.2454997	994.7922499	10617.5160000
calc_mins	1.0000000	6.0500000	9.8833333	12.9374250	16.5798708	176.9586000
calc_hours	0.0166667	0.1008333	0.1647222	0.2156237	0.2763312	2.9493100
calc_days	0.0006944	0.0042014	0.0068634	0.0089843	0.0115138	0.1228879

We could have chosen to censor the data, in which case we would not drop observations, but would instead move them to a limiting value, such as three hours (for trip time) or an age of 90 years (for adjusting birth_year).

As there were few such cases, we instead decided to truncate the data by dropping such observations from the dataset.

Limitations and Challenges in uploading and analyzing this data

Data Size

Because there is so much data, it is difficult to analyze the entire universe of trip-by-trip data unless one has high-performance computational resources.

Data formatting inconsistencies from month to month:

Data column names change slightly from month to month.
In some months, CitiBike specifies dates as YYYY-MM-DD, while in other months, dates are MM/DD/YYYY .
In certain months, the timestamps include HH:MM:SS (as well as fractional seconds) while in other months, timestamps only include HH:MM , as seconds are omitted entirely.
We encountered an unusual quirk which manifests itself just once a year, on the first Sunday of November, when clocks are rolled back an hour as Daylight Savings time changes to Standard time:
- The files do not specify whether a timestamp is EDT or EST. On any other date, this is not a problem, but the hour of 1am-2am EDT on that November Sunday is followed by an hour 1am-2am EST.
- If someone rents a bike at, say, 1:55am EDT (before the time change) and then returns it 15 minutes later, the time is now 1:10am (EST).
- The difference in time timestamps suggests that the rental was negative 45 minutes, which is of course impossible!
Sometimes there is an unusually long interval between the start time of a bicycle rental and the time at which the system registers such rental as having concluded.

Correlations of individual trip data features

We can examine the correlations between variables to understand the relationship between variables, and also to help be alert to potential problems of multicollinearity. Here we compute rank correlations (Pearson and Spearman) as well as actual correlations between key variables. Here we compute the correlations between key variables on the individual CitiBike Trip data. (Later we will compute correlations on daily aggregated data which has been joined with the daily weather observations.)

Aggregate and join

Aggregate individual CitiBike trip data by day, and join to daily weather data

We will perform our calculations on an aggregated basis. We will group each day’s rides together, but we will segment by user_type (“Subscriber” or “Customer”) and by gender (“Male or”Female“). For each of these segments, there are some cases where the user_type is not specified, so we have designated that as”Unknown." For gender, there are cases where the CitiBike data set contains a zero, which indicates that the gender of the user was not recorded.

For each day, we will aggregate the following items across each of the above groupings:

mean trip_duration
median trip_duration
sum of distance_km
sum of trip_fee
mean of age
count of number of trips on that day

We will split the aggregated data into a training dataset, consisting of all (grouped, daily) aggregations from 2013-2018, and a test dataset, consisting of (grouped, daily) aggregations from 2019.

We will then join each aggregated CitiBike data element with the corresponding weather obserservation for that date.

#summary(CB)

##CB$user_type[is.na(CB$user_type)] <- "UNKNOWN"    ## should  not be necessary to do this
CB$gender <- recode_factor(CB$gender, '1' = "Male", '2' = "Female", '0' = "UNKNOWN")

# make the training data set
train <- CB %>% 
              mutate(start_date = as.Date(s_time, format="%Y-%m-%d"),
#                     user_type = as.character(user_type),
                     train = 1) %>%
              filter(start_date < '2019-01-01') %>%
              group_by(start_date, user_type, train, gender) %>%
              summarise(
                mean_duration = mean(trip_duration), 
                median_duration = median(trip_duration),
                sum_distance_km = sum(distance_km),
                sum_trip_fee = sum(trip_fee),
                avg_age = mean(age),
                trips = n()
              ) %>%
              ungroup()

train_rows = dim(train)[1]
#summary(train)

# make the test data set
test <- CB %>% 
              mutate(start_date = as.Date(s_time, format="%Y-%m-%d"),
#                     user_type = as.character(user_type),
                     train = 0) %>%
              filter(start_date >= '2019-01-01') %>%
              group_by(start_date, user_type, train, gender) %>%
              summarise(
                mean_duration = mean(trip_duration), 
                median_duration = median(trip_duration),
                sum_distance_km = sum(distance_km),
                sum_trip_fee = sum(trip_fee),
                avg_age = mean(age),
                trips = n()
              ) %>%
              ungroup()
test_rows = dim(test)[1]


# Join train with weather data (there should be no rows with missing values)
train_weather <- weather %>% inner_join(train, by = c("DATE" = "start_date" ))
#dim(train_weather)

# Join test with weather data (there should be no rows with missing values)
test_weather <- weather %>% inner_join(test, by = c("DATE" = "start_date" )) 

df.join = rbind(train_weather, test_weather)
tmp.df <- mice(
  df.join,
  m = 2,
  maxit = 10,
  meth = 'pmm',
  seed = 500,
  print = FALSE
)
df.join <- complete(tmp.df)
naCols <- names(which(sapply(df.join, anyNA)))
df.join = df.join[, (!names(df.join) %in% naCols)]
df.join.agg = df.join %>% dplyr::group_by(DATE, user_type, train, gender) %>% summarise(AWND = max(AWND), PRCP = max(PRCP), SNOW = max(SNOW), SNWD = max(SNWD), TMAX = max(TMAX), TMIN = max(TMIN), WDF2 = max(WDF2), WDF5 = max(WDF5), WSF2 = max(WSF2), WSF5 = max(WSF5), mean_duration = max(mean_duration), median_duration = max(median_duration), sum_distance_km = sum(sum_distance_km), sum_trip_fee = sum(sum_trip_fee), avg_age = max(avg_age), trips = sum(trips))
df.join.agg$user_type = as.character(df.join.agg$user_type)
df.join.agg$user_type[df.join.agg$user_type == "Subscriber"] = 2
df.join.agg$user_type[df.join.agg$user_type == "Customer"] = 1
df.join.agg$user_type[df.join.agg$user_type == "UNKNOWN"] = 0
df.join.agg$user_type = as.integer(df.join.agg$user_type)

df.join.agg$gender = as.character(df.join.agg$gender)
df.join.agg$gender[df.join.agg$gender == "UNKNOWN"] = 0
df.join.agg$gender[df.join.agg$gender == "Female"] = 1
df.join.agg$gender[df.join.agg$gender == "Male"] = 2
df.join.agg$gender = as.integer(df.join.agg$gender)

train = df.join.agg[df.join.agg$train ==1,]
test = df.join.agg[df.join.agg$train !=1,]

Number of records in train dataset

nrow(train)

## [1] 5202

Number ot rows in test dataset

nrow(test)

## [1] 1742

7. Model Building

TS Models

Convert training dataset into timeseries object

train.agg = train %>% dplyr::group_by(DATE) %>% summarise(trips = sum(trips))
train.ts = ts(train.agg$trips, start = c(2013,181), end = c(2018,365), frequency = 365)
test.agg = test %>% group_by(DATE) %>% summarise(trips = sum(trips))
test.ts = ts(test.agg$trips, start = c(2019,1), end = c(2019,365), frequency = 365)
train.ts = tsclean(train.ts)
test.ts = tsclean(test.ts)

Let’s plot the time series

autoplot(train.ts) +
  xlab('Day') +
  ylab('Trips') +
  ggtitle('CitiBike Daily trips')

From the timeseries plot we can clearly see annual seasonality. Bikeshare activity peaks in Summer and slows down in winter. Time seried also reveals increasing trend, where ridership count it increasing over the period of time

Let’s use STL decomposition to confirm our findings

res = stl(train.ts, s.window = 30)
plot(res)

We can see that STL plot above shows strong seasonal pattern. There is positive trend which indicates growth in the citibike bikeshare ridership over the period if time

Above is also a non-stationary time series. Let’s see what order differencing is required to make it stationary

print(paste('Order of seasonal difference required = ', nsdiffs(train.ts)))

## [1] "Order of seasonal difference required =  0"

print(paste('Order of difference required = ', ndiffs(train.ts)))

## [1] "Order of difference required =  1"

We can see that we need 0 order seasonal difference followed by first order regular difference to make this time series stationary

Let’s try following forecating models on this time series and assess which one is better in terms of RMSE. We will use time series cross validation function to calculate RMSE for eash model

Seasonal Naive Model
Random walk model
ETS model
ARIMA model

df.perf <<- NULL
calcRMSE = function(modelName, fcst, actuals, modelType)
{
  RMSE =  sqrt(sum((fcst - actuals) ^2)/length(fcst))
  print(paste('RMSE for model', modelName, '=', RMSE))
  df.perf <<- rbind(df.perf, data.frame(ModelName = modelName, RMSE = RMSE, ModelType = modelType))
  
}

Fit seasonal naive model and calc RMSE

snaive.model = snaive(train.ts, h=365)
fcst = forecast(snaive.model, h = 365)
df.fcst.snaive = data.frame(fcst)
df.fcst.snaive = df.fcst.snaive %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('snaive', df.fcst.snaive$fcst, test.ts, 'Time Series')

## [1] "RMSE for model snaive = 19.6318165022517"

Fit stl forecast method

stl.model = stlf(train.ts, h=365)
fcst = forecast(stl.model, h = 365)
df.fcst.stl = data.frame(fcst)
df.fcst.stl = df.fcst.stl %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('stl', df.fcst.stl$fcst, test.ts, 'Time Series')

## [1] "RMSE for model stl = 20.8565111276799"

Fit Random Walk model and calc RMSE

rwf.model =rwf(train.ts , h=365, drift = TRUE, lambda = BoxCox.lambda(train.ts))
fcst = forecast(rwf.model, h = 365, lambda =BoxCox.lambda(train.ts) )
df.fcst.rwf = data.frame(fcst)
df.fcst.rwf = df.fcst.rwf %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('rwf', df.fcst.rwf$fcst, test.ts, 'Time Series')

## [1] "RMSE for model rwf = 51.8626824385318"

Fit Seasonal ARIMA model and calc RMSE

func <- auto.arima(train.ts, seasonal = TRUE)
fcst = forecast(func, h = 365)
df.fcst.arima = data.frame(fcst)
df.fcst.arima = df.fcst.arima %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('auto arima', df.fcst.arima$fcst, test.ts, 'Time Series')

## [1] "RMSE for model auto arima = 45.8177058461619"

Fit ETS model and calc RMSE

func =  ets(test.ts, model="ZZZ")
fcst = forecast(func, h = 365)
df.fcst.ets = data.frame(fcst)
df.fcst.ets = df.fcst.ets %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('ETS', df.fcst.ets$fcst, test.ts, 'Time Series')

## [1] "RMSE for model ETS = 37.7541272011348"

Fit theta model and calc RMSE

func =  thetaf(train.ts, h= 365)
fcst = forecast(func, h = 365)
df.fcst.theta = data.frame(fcst)
df.fcst.theta = df.fcst.theta %>% dplyr::rename('fcst' = 'Point.Forecast')

calcRMSE('Theta', df.fcst.theta$fcst, test.ts, 'Time Series')

## [1] "RMSE for model Theta = 17.2872284591559"

We can see that theta model gives us best RMSE. Let’s analyze model details

fit = thetaf(train.ts, h=365)

Let’s analyze residuals

checkresiduals(fit)

## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.

Residual plots are not satisfactory. Small value of Ljung-Box test shows that residuals are correlated. ACF plot also confirms the same. Histogram suggests that residuals are not normally distributed. Timeseries models alone are not very relevent for Citibike ridership count prediction

Let’s try model ensemble

df.cons = data.frame( cbind(df.fcst.theta$Hi.80, df.fcst.snaive$Hi.80, df.fcst.stl$Hi.80))
names(df.cons) = c("fcst1", "fcst2", "fcst3")
df.cons$fcst = (df.cons$fcst1 + df.cons$fcst2 + df.cons$fcst3) /3

calcRMSE('TS Ensemble', df.cons$fcst, test.ts, 'Time Series')

## [1] "RMSE for model TS Ensemble = 16.5124625909039"

Time Series Model Performance

knitr::kable(df.perf %>% filter(ModelType == 'Time Series') %>% arrange (RMSE))

ModelName	RMSE	ModelType
TS Ensemble	16.51246	Time Series
Theta	17.28723	Time Series
snaive	19.63182	Time Series
stl	20.85651	Time Series
ETS	37.75413	Time Series
auto arima	45.81771	Time Series
rwf	51.86268	Time Series

Even with timeseries ensemble model RMSE is not vastly improved. This indicates that timeseries models alone are not a good choice for forecasting Citibike ridership

Let’s plot the forecast

autoplot(train.ts) +
  autolayer(test.ts, series = 'Test Data') +
  autolayer(ts(df.cons$fcst, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
  ggtitle("Forecast - Citibike ridership") +
  ylab("Ride Count")

Linear Regression Models

Create training and test split on the data

colY = which(colnames(train) == "trips")

trainX <- train[,-colY]
trainY <- train[, colY]
testX <- test[,-colY]
testY <- test[, colY]

ctrl <- trainControl(method = "cv")

Fit Linear Model

lm.model <- train(trips ~ ., data = train,
                  method = "lm",
                 preProcess = c( "center", "scale"),
                 trControl = ctrl)
lm.pred <- predict(lm.model, newdata = testX)
df.pred = data.frame(pred = lm.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('lm', df.pred$pred, test.ts, 'Linear Model')

## [1] "RMSE for model lm = 5.86958506650074"

autoplot(train.ts) +
  autolayer(test.ts, series = 'Test Data') +
  autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
  ggtitle("Forecast - Citibike ridership") +
  ylab("Ride Count")

Fit Partial Least Square model (PLS)

plsTune <- train(trips ~ .,
                 data = train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:19),
                 trControl = ctrl)
pls.pred <- predict(plsTune, newdata = testX)
df.pred = data.frame(pred = pls.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('pls', df.pred$pred, test.ts, 'Linear Model')

## [1] "RMSE for model pls = 5.87525402411147"

Fit PCR (Principal Components Regression) model

pcrTune <- train(trips ~ .,
                 data = train,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:19),
                 trControl = ctrl)
pcr.pred <- predict(pcrTune, newdata = testX)
df.pred = data.frame(pred = pcr.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('pcr', df.pred$pred, test.ts, 'Linear Model')

## [1] "RMSE for model pcr = 5.86958506650087"

Importance of variables in PLS

plsImp <- varImp(plsTune, scale = FALSE)
plot(plsImp, top = 35, scales = list(y = list(cex = .95)))

Fit Penalized Models

Ridge Regression

ridgeGrid <- expand.grid(lambda = seq(0, .025, length = 11))
train.m = train[, -nearZeroVar(train)]
ridgeTune <- train(trips ~ .,
                   data = train.m,
                   method = "ridge",
                   #tuneGrid = ridgeGrid,
                   trControl = ctrl,
                   preProc = c("BoxCox", "center", "scale")
                   )
ridge.pred <- predict(ridgeTune, newdata = testX)
df.pred = data.frame(pred = ridge.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('ridge', df.pred$pred, test.ts, 'Linear Model')

## [1] "RMSE for model ridge = 5.78166711055302"

update(plot(ridgeTune), xlab = "Penalty",main="Ridge results vs. lambda penalty")

Elasticnet

enetGrid <- expand.grid(lambda = c(seq(0,1,length=6)),
                        fraction = seq(.05, 1, length = 20))

enetTune <- train(trips ~ .,
                  data = train.m,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("BoxCox", "center", "scale")
                  )
enet.pred <- predict(enetTune, newdata = testX)
df.pred = data.frame(pred = enet.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('elastic net', df.pred$pred, test.ts, 'Linear Model')

## [1] "RMSE for model elastic net = 5.78143488861333"

Elasticnet Plot

plot(enetTune, main="ElasticNet", sub="Minimum RMSE occurs at lambda=0 and fraction=0.85")

Linear Model Performance

knitr::kable(df.perf %>% filter(ModelType == 'Linear Model') %>% arrange (RMSE))

ModelName	RMSE	ModelType
elastic net	5.781435	Linear Model
ridge	5.781667	Linear Model
lm	5.869585	Linear Model
pcr	5.869585	Linear Model
pls	5.875254	Linear Model

We can see that elastic net provides the best performance amongst linear models

Analyze residuals

checkresiduals(enetTune )

## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.

Residuals from Elastic Net linear model looks much better than timeseries models. Residuals appeared to be normally distributed and ACF plot shows little correlation amongst residuals which is not very significant.

Plot Forecast with Elastic Net model

df.pred = data.frame(pred = enet.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
  autolayer(test.ts, series = 'Test Data') +
  autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
  ggtitle("Forecast - Citibike ridership") +
  ylab("Ride Count")

#### Forecast with elastic net model is much more improved as compared to time series models

Non-Linear Models

Fit Neural Network

nnetGrid <- expand.grid(size = c(1:10),
                        decay = c(0, 0.01, 0.1))
nnetModel <- train(
  trips~ .,
  data = train,
  method = "nnet",
  tuneGrid = nnetGrid,
  trControl = ctrl,
  #preProcess = c("BoxCox", "center", "scale", "pca"),
  maxit = 500
)

#plot(nnetModel)
#varImp(nnetModel)

nnetPred <- predict(nnetModel, newdata = testX)
df.pred = data.frame(pred = nnetPred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('Neural Net', df.pred$pred, test.ts, 'Nonlinear Model')

Fit MARS: Multivariate Adaptive Regression Splines

library(earth)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:30)
marsModel <- train(
  trips ~ .,
  data = train,
  method = "earth",
  tuneGrid = marsGrid,
  trControl = ctrl
)

marsPred <- predict(marsModel, newdata = testX)
df.pred = data.frame(pred = marsPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('MARS', df.pred$pred, test.ts, 'Nonlinear Model')

Fit SVM: Support Vector Machines

svmRadial

svmModel <- train(
  trips ~ .,
  data = train,
  method = "svmRadial",
  preProc = c("center", "scale"),
  #tuneLength = 14,
  trControl = ctrl
)
svmPred <- predict(svmModel, newdata = testX)
df.pred = data.frame(pred = svmPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('SVM_Radial', df.pred$pred, test.ts, 'Nonlinear Model')

## [1] "RMSE for model SVM_Radial = 8.93912270538394"

svmPoly

svmModel2 <- train(
  trips ~ .,
  data = train,
  method = "svmPoly",
  preProc = c("center", "scale"),
  #tuneLength = 14,
  trControl = ctrl
)
svmPred2 <- predict(svmModel2, newdata = testX)
df.pred = data.frame(pred = svmPred2, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('SVM_Poly', df.pred$pred, test.ts, 'Nonlinear Model')

## [1] "RMSE for model SVM_Poly = 4.46486569718514"

svmLinear

svmModel3 <- train(
  trips~ .,
  data =train,
  method = "svmLinear",
  preProc = c("center", "scale"),
  tuneGrid = data.frame(.C = c(.25, .5, 1)),
  trControl = ctrl
)

svmPred3 <- predict(svmModel3, newdata = testX)
df.pred = data.frame(pred = svmPred3, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('SVM_Linear', df.pred$pred, test.ts, 'Nonlinear Model')

## [1] "RMSE for model SVM_Linear = 6.00371767358531"

Fit KNN: K-Nearest Neighbors

knnModel <- train(
  trips ~ .,
  data = train,
  method = "knn",
  preProcess = c("center", "scale"),
  tuneGrid = data.frame(.k = 1:10),
  trControl = ctrl
)

knnPred <- predict(knnModel, newdata = testX)
df.pred = data.frame(pred = knnPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('KNN', df.pred$pred, test.ts, 'Nonlinear Model')

## [1] "RMSE for model KNN = 7.90295893180088"

Nonlinear Model Performance

knitr::kable(df.perf %>% filter(ModelType == 'Nonlinear Model') %>% arrange (RMSE))

ModelName	RMSE	ModelType
MARS	4.434103	Nonlinear Model
SVM_Poly	4.464866	Nonlinear Model
SVM_Linear	6.003718	Nonlinear Model
KNN	7.902959	Nonlinear Model
SVM_Radial	8.939123	Nonlinear Model
Neural Net	55.240198	Nonlinear Model

We can see that MARS model provides the best performance amongst non linear models

Analyze residuals

checkresiduals(marsModel)

## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.

Residuals from MARS model looks much better than timeseries/linear models. Residuals appeared to be normally distributed and ACF plot shows little correlation amongst residuals which is not very significant.

Plot Forecast with MARS model

df.pred = data.frame(pred = marsPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
  autolayer(test.ts, series = 'Test Data') +
  autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
  ggtitle("Forecast - Citibike ridership") +
  ylab("Ride Count")

#### Forecast with MARS model is much more improved as compared to time series models/ linear models

Tree based ensemble models

Fit Random Forest

library(randomForest)
rf_model <- randomForest(trips ~ ., data = train, ntree = 300)
rf_predicted <- predict(rf_model, testX)
df.pred = data.frame(pred = rf_predicted, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('Random Forest', df.pred$pred, test.ts, 'Treebased models')

## [1] "RMSE for model Random Forest = 5.39192947837228"

Variable Importance - Random Forest

library(vip)
rfImp1 <- rf_model$importance 
vip(rf_model, color = 'red', fill='green') + 
  ggtitle('Random Forest Var Imp')

Boosted Trees

gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2), n.trees = seq(100, 1000, by = 100), shrinkage = 0.1, n.minobsinnode = 5)
gbm_model <- train(trips ~ ., data = train, tuneGrid = gbmGrid, verbose = FALSE, method = 'gbm' )
gbm_predicted <- predict(gbm_model, testX)
df.pred = data.frame(pred = gbm_predicted, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('Boosted Trees', df.pred$pred, test.ts, "Treebased models")

## [1] "RMSE for model Boosted Trees = 4.43923091284317"

Treebased Model Performance

knitr::kable(df.perf %>% filter(ModelType == 'Treebased models') %>% arrange (RMSE))

ModelName	RMSE	ModelType
Boosted Trees	4.439231	Treebased models
Random Forest	5.391929	Treebased models

We can see that Boosted Trees model provides the best performance amongst Tree based models

Analyze residuals

checkresiduals(gbm_model)

## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.

Residuals from Boosted trees models looks most appropriate. Residuals are normally distributed. ACF plots doesn’t show any correlation

Plot Forecast with Boosted trees model

df.pred = data.frame(pred = gbm_predicted, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
  autolayer(test.ts, series = 'Test Data') +
  autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
  ggtitle("Forecast - Citibike ridership") +
  ylab("Ride Count")

Forecast with boosted trees model is much more improved as compared to time series models/ linear models/ non linear models

Model Performance

Let’s analyze overall model performance

knitr::kable(df.perf %>% arrange (RMSE))

ModelName	RMSE	ModelType
MARS	4.434103	Nonlinear Model
Boosted Trees	4.439231	Treebased models
SVM_Poly	4.464866	Nonlinear Model
Random Forest	5.391929	Treebased models
elastic net	5.781435	Linear Model
ridge	5.781667	Linear Model
lm	5.869585	Linear Model
pcr	5.869585	Linear Model
pls	5.875254	Linear Model
SVM_Linear	6.003718	Nonlinear Model
KNN	7.902959	Nonlinear Model
SVM_Radial	8.939123	Nonlinear Model
TS Ensemble	16.512463	Time Series
Theta	17.287229	Time Series
snaive	19.631816	Time Series
stl	20.856511	Time Series
ETS	37.754127	Time Series
auto arima	45.817706	Time Series
rwf	51.862682	Time Series
Neural Net	55.240198	Nonlinear Model

We can see the MARS model gives best performance amongst all the models

Analyze MARS model feature importance

Variable Importance - MARS

library(vip)
rfImp1 <- marsModel$importance 
vip(marsModel, color = 'red', fill='green') + 
  ggtitle('MARS Variable Importance')

8. Try Model Ensembles

Let’s combine top three models and evaluate performace

final.pred = (gbm_predicted + marsPred + svmPred2)/3

df.pred = data.frame(pred = final.pred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))

calcRMSE('Ensemble (gbm, mars, svm_poly)', df.pred$pred, test.ts, "Top Ensemble Models")

## [1] "RMSE for model Ensemble (gbm, mars, svm_poly) = 4.16579042936282"

Overall model perfomance with ensemble

knitr::kable(df.perf %>% arrange (RMSE))

ModelName	RMSE	ModelType
Ensemble (gbm, mars, svm_poly)	4.165790	Top Ensemble Models
MARS	4.434103	Nonlinear Model
Boosted Trees	4.439231	Treebased models
SVM_Poly	4.464866	Nonlinear Model
Random Forest	5.391929	Treebased models
elastic net	5.781435	Linear Model
ridge	5.781667	Linear Model
lm	5.869585	Linear Model
pcr	5.869585	Linear Model
pls	5.875254	Linear Model
SVM_Linear	6.003718	Nonlinear Model
KNN	7.902959	Nonlinear Model
SVM_Radial	8.939123	Nonlinear Model
TS Ensemble	16.512463	Time Series
Theta	17.287229	Time Series
snaive	19.631816	Time Series
stl	20.856511	Time Series
ETS	37.754127	Time Series
auto arima	45.817706	Time Series
rwf	51.862682	Time Series
Neural Net	55.240198	Nonlinear Model

The best performance can be obtained by making an ensemble of top three models

Boosted Trees
MARS
SVM Poly

Plot forecast with Ensemble models

df.pred = data.frame(pred = final.pred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
  autolayer(test.ts, series = 'Test Data') +
  autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
  ggtitle("Ensemble model Forecast - Citibike ridership") +
  ylab("Ride Count")

Forecast with Ensemble models looks much better than any other individual models

4. Conclusion

The Citybike ridership analysis determined in this paper have many implications. Analysis reveals interesting scientific facts which are statestically significant. Citibike ridership data shows a strong annual sesonality and increasing trend. Ridership count increases in summer and decreases in winter. Increasing trend in the ridership count from 2013 to 2019 shows increasing popularity of Citybike ridership service over the period of time. It is established that most the Citibike riders are young in their 30s and 40s. Citibike customers prefers to use ride share service for making shot distance trips < 2 KM. It is established with statestical significance that weather has a significant impact on Citibike ridfership pattern. Good weather has positive corrleation with Citibike ridership count.

For predicting future Citibike ridership count, statestical timeseries models alone are not sufficient. Based on RMSE and residual analysis, statestical timeseries models alone doesn’t do a better job of forecasting Citibike rideshare count. This is due to the fact that statestical timeseries models doesn’t leverage the rich feature set available for model optimization e.g. weather data. Based on RMSE Linear models do better job of forecasting Citibike ridership count than timeseries models. This is due to the fact that Linear models are multivariate and leverage rich input features for model optimization.

Nonlinear models further improves the RMSE and outerforms Linear models. The fact that Nonlinear models are better suited for CitiBike ridership dataset suggests that there exists a non linear relatioship between input features and predictor variables

The best model performance is derived from tree based models. Boosted tree model outperforms all other models and provides best performance in terms of RMSE. Residual analysis of boosted tree models looks more appropriate where residuals are perfectly normally distributed and ACF plot shows no correlation between residuals. The variable importance of Boosted trees models revals important infomration about what factors significantly imact the Citibike ridership pattern. From the Boosted Trees model variable importance plot we can conclude that following factors are statestically significant in infuencing Citibike ridership pattern

Distance between station
Gender
User Type (customers/ Subscribers)
Trip Duration
Day of week
Avg age of users
Temperature
Wind Speed

Above factors are key input for Citibike system operator for planning increase in the number of bicycles availability and expansion of the geographic footprint of docking stations.

It is imperative that for a complex business operation like Citibike which involves many factors (Internal and External) one model alone won’t be sufficient for ridership forecasting. When we tried Ensemble of top three models (Boosted Trees, MARS and SVM Poly) we got the best performace in terms of RMSE (7 % RMSE improvement). We can conclude that the best modelling strategy for predicting future ridership pattern for Citibike ridershare service is to use model ensembles. Winning model enselbles established in this paper will enable Citibike system operator to make informed business expansion plans resulting into increased ROI and long-term business profitability and viability of Citibike operations

DATA698-Capstone Project

Sachid Vijay Deshmukh, Ann Liu-Ferrara, Ahmed Sajjad

2/21/2020

1. Abstract

Keywords

2. Introduction

4. Data Collection

5. Load Data

Load citibike data

Load Weather data

6. Exploratory Data Analysis

Examine Trip Duration variable

Examine birth_year variable

Compute distance between start and end stations

Delete unusually long distances

Compute usage fee

Limitations and Challenges in uploading and analyzing this data

Data Size

Data formatting inconsistencies from month to month:

Correlations of individual trip data features

Aggregate and join

Number of records in train dataset

Number ot rows in test dataset

7. Model Building

TS Models

Convert training dataset into timeseries object

Let’s plot the time series

From the timeseries plot we can clearly see annual seasonality. Bikeshare activity peaks in Summer and slows down in winter. Time seried also reveals increasing trend, where ridership count it increasing over the period of time

Let’s use STL decomposition to confirm our findings

We can see that STL plot above shows strong seasonal pattern. There is positive trend which indicates growth in the citibike bikeshare ridership over the period if time

Above is also a non-stationary time series. Let’s see what order differencing is required to make it stationary

We can see that we need 0 order seasonal difference followed by first order regular difference to make this time series stationary

Let’s try following forecating models on this time series and assess which one is better in terms of RMSE. We will use time series cross validation function to calculate RMSE for eash model

Fit seasonal naive model and calc RMSE

Fit stl forecast method

Fit Random Walk model and calc RMSE

Fit Seasonal ARIMA model and calc RMSE

Fit ETS model and calc RMSE

Fit theta model and calc RMSE

We can see that theta model gives us best RMSE. Let’s analyze model details

Let’s analyze residuals

Residual plots are not satisfactory. Small value of Ljung-Box test shows that residuals are correlated. ACF plot also confirms the same. Histogram suggests that residuals are not normally distributed. Timeseries models alone are not very relevent for Citibike ridership count prediction

Let’s try model ensemble

Time Series Model Performance

Even with timeseries ensemble model RMSE is not vastly improved. This indicates that timeseries models alone are not a good choice for forecasting Citibike ridership

Let’s plot the forecast

Linear Regression Models

Create training and test split on the data

Fit Linear Model

Fit Partial Least Square model (PLS)

Fit PCR (Principal Components Regression) model

Importance of variables in PLS

Fit Penalized Models

Ridge Regression

Elasticnet

Elasticnet Plot

Linear Model Performance

We can see that elastic net provides the best performance amongst linear models

Analyze residuals

Residuals from Elastic Net linear model looks much better than timeseries models. Residuals appeared to be normally distributed and ACF plot shows little correlation amongst residuals which is not very significant.

Plot Forecast with Elastic Net model

Non-Linear Models

Fit Neural Network

Fit MARS: Multivariate Adaptive Regression Splines

Fit SVM: Support Vector Machines

svmRadial

svmPoly

svmLinear

Fit KNN: K-Nearest Neighbors

Nonlinear Model Performance

We can see that MARS model provides the best performance amongst non linear models

Analyze residuals

Residuals from MARS model looks much better than timeseries/linear models. Residuals appeared to be normally distributed and ACF plot shows little correlation amongst residuals which is not very significant.

Plot Forecast with MARS model

Tree based ensemble models

Fit Random Forest

Variable Importance - Random Forest

Boosted Trees

Treebased Model Performance

We can see that Boosted Trees model provides the best performance amongst Tree based models