CUNY606 - Final Project

Part 1 - Introduction:

New City is the most populous city in the us. In fact, it has more people than 40 states. In addition to the NY City residents, a considerable number of people are commuting into the city from neighboring counties and states. New York has one of the largest Commuter-Adjusted Daytime Population according and although it has also one of the most developed public transportation system in world (it has the largest subway system). New York City subway delivered and estimated 5.5 million rides on weekdays in 2013. New York City commute is also considered the worst in the U.S. The average commuting time for New York is 40 minutes higher than the national average. As anyone having ridden the 4/5/6 or the 2/3 subway lines (downtown direction) in the morning can attest, public transportations system is congested at peak time and using public transportation can be horrendous experience. Since 2013, an alternate commuting option is being offered in New York city; a paying bike sharing system “Citibike”.

Riders can rent bike at various docking stations throughout the city and returned them to another docking station. There are 2 main forms of payment; “pay as you go” meaning per ride or “Annual Subscription” meaning pay a flat fee for the year with unlimited rides and higher cap on the ride. There is a time limit on how long the bike can be in use per ride; 30 minutes for non-subscribers and 45 minutes for subscribers. Financial penalties are applied in the cases the ride exceed these limits.

As one of the number commuters coming into the city from the neighboring state of New Jersey, I have experienced first-hand the difficulties and frustration of riding the NY city public subway system. Also, as a parent working full time and part-time student in a Master program, I am finding that juggling the various demands of family, work, class work, leaves little time for anything else; like exercise, hobbies, and sleep. The possibility of mutating one segment of my commute from the jarring stressful experience into a healthier, pleasant alternative has high appeal. From the number of missing bikes from the docking stations around Penn Station Train station between 8:00 and 9:30 AM, I would surmise that I not the only commuter taking advantage of this opportunity. Since the entire data set for every ride logged to this system is available on-line and as a student of Data Science, I felt intrigue and compelled to investigate further.

For this study, we are interested in exploring whether there is any relationship between the age of rider and ride duration. Furthermore, whether the gender of rider, or whether the ride is on weekday vs weekend day impact the ride duration. Since the data only contains additional information such as birth year and gender for rider that are annual subscribers, we will limit our analysis to this subset. In addition, we will only limit the data set to the month of October 2014.

Part 2 - Data:

In this study, we consider New York City bike sharing data. The details of every rides by a bike in the bike sharing system is recorded by the docking stations, cleansed, centralized, and made available to the public. Our particular focus for this study is all rides for the month of October 2014 taken by “subscribers” (riders that pay on a yearly basis and for whom additional demographic information is tagged with ride information captured by the docking stations. This additional information, mainly gender and date of birth is captured at the time of registration by the subscribers and is provided by the subscriber. At this point we do not have a reliable mechanism to validate this data or account for missing entry. This may cause outliers and missing data.

This collection represents our population of interest. In this study we would like to explore the possible relationship between the age of the rider and the ride duration in minutes, both are numerical. In addition, we would like to explore whether other variables, such as gender or day of week, or day category (whether day of ride is a week day or a week-end day/holiday). These additional variables are categorical in nature. For the purpose of our study, the ride duration is the response variable and the other sited are considered explanatory variables. The study is observational, we are basing our analysis on actual observations. There was no interference when collecting the data on how the data came to be. The data collection is done automatically. We have the entire set of data available that represents every rides logged in the system, however for practicality (the entire data set is quite large > 4GB), we are limiting our study to on month worth of data; October 2014.

Why October 2014, since the system was launched in May 2013, we considered the first 6 months are settling months and we though that the data collected in these months may not be very representative. Winter months may not be representative as well since ridership will go done in these months. In the summer months, June, July, and August, there might be an influx of riders due to enjoyable weather and influx tourists and visitors to New York City.

Since our analysis is only based on one month of data and the ridership will be impacted by various factors not included in the data (weather and outside temperatures, influx of visitors, lack of personal information for non-subscriber riders, …). Even, generalization from a given month to same month in another year should be done cautiously since the weather may be different. In this case, since the entire population data is known, it would be easier to run any analysis on the entire population (if possible), however, on the entire population, the modeling will have to account for non-independence of observations.

Actually, since the observations are not independent of each other (the same person is most probably riding bikes and making use of system multiple times), we will pursue the analysis on a random sample population.

Since this study is observational and the design does not allow for random assignment of the observation to one group or another therefore no casual inference can be made.

variables list:

Name	Description
tripduration	length of ride in second
startdate	date of start of ride
startime	start time of ride
stoptime	date and time when ride ends
start.station.id	unique identifier for ride starting station
start.station.name	name of ride starting station
start.station.latitude	latitude of ride starting station
start.station.longitude	longitude of ride starting station
end.station.id	unique identifier for ride ending station
end.station.name	name of ride ending station
end.station.latitude	latitude of ride ending station
end.station.longitude	longitude of ride ending station
bikeid	unique identifier for every bike in system
usertype	type of user; “Customer”, “Subscriber”
birth.year	date of birth of “Subscriber” user
gender	gender of “Subscriber” user; 1=Male, 2=Female
dayofweek	day of week ride took place
ride.year	year ride started
rideday	type of day of ride, weekday = 1, weekend and Holidays = 0

Note: dayofweek is as follows: 1 - Sunday, 2 - Monday, 3 - Tuesday, 4 - Wednesday, 5 - Thursday, 6 - Friday, 7 - Saturday

Part 3 - Exploratory data analysis:

When running summary on the full data set (citibike) and looking at the variables of interest, we observed that we have some data that is missing in the case of birth.year (and therefore age) or highly unlikely in case of tripduration (and therefore durationminute) and again birth.year (age)

data issues Summary

We will address these and determine the best approach to address them prior starting our analysis.

Missing values

We have 285 observations were the birth.year is missing or the gender has not specified. Eventhough though these observations belong to “Subscribers”, we will treat them as “Customer” and remove them from the data set.

Improbable values

Trip duration have values that are highly improbably for a single ride. If a bike is not properly engage upon its return to a docking station (as indicated by a green light), the system will not record the end of ride and the fact that the bike has been returned.

For example, in the case where the trip duration was over 90277 minutes, it is highly probable that the bike was not engaged properly at the returning station. The 2 stations both are in Brooklyn and although there is no information on the path travelled and it possible that the trip was simply from one station to the other. However, any interpretation of such data is highly suspect and speculative.

code extract

map

There is a monetary penalty for rides longer than 45 minutes for Subscriber and 30 minutes for customers, it is improbable than any rides will much longer. The System host site indicates that for distance calculation, trips are capped at a 2 hour limit and we will use a similar approach.
We will filter any trips greater than 2 hours.

We have on in the data year of birth that are highly suspect, the maximum value is 115. This data is from subscriber entering the information when registering into the system. These suspicious data entries may due to mistake or deliberate entry of erroneous information. In fact, we have no guarantees that the year of birth or gender have accurate value. There are 338 obersvations with age >= 88.
We will consider these highly suspect and remove them from the data set.

Exploratory Statistics

We will now further our analysis by taking a closer look at the variable age by running summary and plotting histogram

summary(citibike$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   29.00   35.00   37.56   45.00   84.00

ggplot(citibike, aes(x=age)) + geom_histogram (binwidth = 3, fill = "lightblue")

From the histogram, we can observe that the distribution for the age variable is right skweed. There is the prescence of outliers (age >= 75). The average age of rider = 37.56.

summary(citibike$durationminute)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   10.00   11.95   15.00  120.00

ggplot(citibike, aes(x=durationminute)) + geom_histogram (binwidth = 1, fill = "lightblue")

From the histobram, we can observe that the distribution for the ride duration is right skweed. Eventhough the majority of ride are within 1 to 45 minutes, the prescense of ride up to 120 minutes are causing a skew in the distribution. the average ride = 11.95 minutes

Sample Selection Since the observations are not independent of each other, we will select a random sample from the data set. We have 757505 observations in the data set. We will select with size 1000, which is well below 10% of total data set to insure independence.

We will re-run the previous summary function and graph on sample:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   29.00   36.00   37.87   46.00   80.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   10.00   12.31   15.00  112.00

The average rider age for the sample is `r mean(samp$age)’

we will now explore the potential relationship between these 2 variables by examine their scatter plot.

From the scatter plot, it is not apparent that there is a linear relationship between these 2 variables. We will now explore whether the introduction of gender introduce some differentiation. Again we will use a scatter plot and differentiate the point by color.

Again, it is not clear that there is a linear relationship between the 2 variables and gender does not seem to introduce any differentiation.
We will now look at ride duration and type of day of week (weekday/weekend day or Holiday), for this graph we will use a jitter graph.

From these graph, it is apparent that there is more rider on a week day than on week-end/Holiday. We will test whether there is statisical significant difference between the average rider duration based on on the type of day.

Part 4 - Inference:

We will now consider the first two variable of interest; ride duration in minute and age of rider. From the scatter plot, it is difficult to determine linearity. Let us consider the corelation between these 2 variables. We have selected a random sample with a size well within 10% of original data set to ensure independence.

## 
##  Pearson's product-moment correlation
## 
## data:  samp$durationminute and samp$age
## t = 0.21503, df = 998, p-value = 0.8298
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05520983  0.06877061
## sample estimates:
##        cor 
## 0.00680655

## 
## Call:
## lm(formula = durationminute ~ age, data = samp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.470  -6.280  -2.291   2.747  99.725 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.107148   0.997250  12.141   <2e-16 ***
## age          0.005409   0.025155   0.215     0.83    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.325 on 998 degrees of freedom
## Multiple R-squared:  4.633e-05,  Adjusted R-squared:  -0.0009556 
## F-statistic: 0.04624 on 1 and 998 DF,  p-value: 0.8298

The residual qqplot clearly show that the residuals distribution is not normal. We do not have the condistion for linear regression. It does not appear that there is a linear relation between ride duration and age of rider.

Let us now look at average ride and whether the ride took place on a weekday or weekend day. We have selected a random sample of size 1000. For this sample there is some skeedness to the ride duration variable, however the sample is quite large. We will proceed with the analysis. The 2 variables are not paired.

H0: There is no difference in the average ride based on type of day of week the ride took place Ha: There is a difference in the average ride based on type of day of week the ride took place

## samp$rideday: 0
## [1] 12.90794
## -------------------------------------------------------- 
## samp$rideday: 1
## [1] 12.03796

## samp$rideday: 0
## [1] 8.945043
## -------------------------------------------------------- 
## samp$rideday: 1
## [1] 9.481516

## 
##  Welch Two Sample t-test
## 
## data:  weekend$durationminute and weekday$durationminute
## t = 1.4016, df = 643.44, p-value = 0.1615
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3488354  2.0887960
## sample estimates:
## mean of x mean of y 
##  12.90794  12.03796

From the result of the test, we can conclude that the difference in average ride duration between week day or weekend days are statistically significant and not due to random variance in the data.

Part 5 - Conclusion:

There does not seem to be a linear relationship between the average ride and the age of the rider, however there seems to be a better avenue of exploration along the line of the days for the ride and specifically between whether the ride occurred on a weekday or a weekend. The results of the anlysis seem to indicate that on average ride on a weekend (or Holidays) are longer than the ride on weekday for the sample of the population. The observations in our population of interest (month of October 2014) not being independent, we may not be able to infere similar conclusion. However, additional analysis based on day and time of rides may proved worth while to investigate. Also, as future analysis, it would be interesting to explore any geographical relationships among the various bike docking stations.

References:

http://gothamist.com/2015/03/18/ny_commute.php - NYC Worst commute in US

https://www.citibikenyc.com/ - Main site for NYC bike sharing system “citibike”

http://www.newgeography.com/content/004967-commuting-new-york - Commuting in NYC

https://en.wikipedia.org/wiki/Transportation_in_New_York_City - NYC Transportating page

http://www.citylab.com/commute/2013/05/most-important-population-statistic-hardly-ever-gets-talked-about/5747/

http://www.census.gov/hhes/commuting/data/daytimepop.html

Appendix (optional):

Appendix A

In this appendix, we will outlined the necessary steps to obtain the data from the hosting site and outline the steps involved in deriving the data set on which the analysis is done.

The data is hosted and can be found at the following site:
http://www.citibikenyc.com/system-data

The downloadable data files are in .zip format and can be found at:
https://s3.amazonaws.com/tripdata/index.html

In the eventuality that anyone would like to use any aspect of this analysis or the data contained wherein, please confirm the license agreement found below:
http://www.citibikenyc.com/data-sharing-policy

For the purpose of this analysis, we will consider the data for the month of October 2014 (201410-citibike-tripdata.zip). The data was downloaded to a local machine and the following transformation were performed.

The original data only has observations with the following characteristics:

with a duration of < 1 minute
that begin at publicly available stations (thereby excluding trips that originate at citibike depots for rebalancing or maintenance purposes)

The data original data included the following fields:
* Trip Duration (seconds) * Start Time and Date * Stop Time and Date * Start Station Name * End Station Name * Station ID * Station Lat/Long * Bike ID * User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member) * Gender (Zero=unknown; 1=male; 2=female) * Year of Birth

After the data was obtain from the provider, it has been further transformed for the purpose of this study. All the steps taken for this transformation can be found in the r-chunk code in Appendix A are summarized below:

subset the data to User Type = “Subscriber”
split date and time from startime
derive the age of rider, Age = Year of ride - Year of Birth
weekday/weekend day indicator, rideday 1 = Weekday, 0 = Weekend or Holiday
converted tripduration to minutes, durationminute = tripduration/60, rounded to nearest minute

For the purpose of this analysis, the data was then saved as an RData object so that it could be loaded at the beginning of the analysis.

Appendix B

The following are a list of ressources and articles that may be of interest:

New York City is committed to developing its biking infrastructure. To find out more follow the link from DOT (Department of Transportation):
http://www.nyc.gov/html/dot/html/bicyclists/bicyclists.shtml http://www.nyc.gov/html/dot/downloads/pdf/nyc-protected-bike-lanes.pdf

There is a google group of “private citizen” interested in exploring, developping, and working with NY City bike related Data feeds.
https://groups.google.com/forum/#!aboutgroup/citibike-hackers

Analysis of “citibike system” by Todd Schneider http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

http://www.r-bloggers.com/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

CUNY606 - Final Project - Citibike