# Data manipulation from raw data require following packages
library(dplyr)
library(tidyr)
library(lubridate)

# load data
# downloaded data from site and stored in my local PC in directory .\Data
# The data was downloaded month by month, 1 csv file per month and rename as 2014mm_citibike_tripdata.csv

tb_01 <- read.csv(file="./Data/201401_citibike_tripdata.csv",head=TRUE,sep=",")
tb_02 <- read.csv(file="./Data/201402_citibike_tripdata.csv",head=TRUE,sep=",")
tb_03 <- read.csv(file="./Data/201403_citibike_tripdata.csv",head=TRUE,sep=",")
tb_04 <- read.csv(file="./Data/201404_citibike_tripdata.csv",head=TRUE,sep=",")
tb_05 <- read.csv(file="./Data/201405_citibike_tripdata.csv",head=TRUE,sep=",")
tb_06 <- read.csv(file="./Data/201406_citibike_tripdata.csv",head=TRUE,sep=",")
tb_07 <- read.csv(file="./Data/201407_citibike_tripdata.csv",head=TRUE,sep=",")
tb_08 <- read.csv(file="./Data/201408_citibike_tripdata.csv",head=TRUE,sep=",")
tb_09 <- read.csv(file="./Data/201409_citibike_tripdata.csv",head=TRUE,sep=",")
tb_10 <- read.csv(file="./Data/201410_citibike_tripdata.csv",head=TRUE,sep=",")
tb_11 <- read.csv(file="./Data/201411_citibike_tripdata.csv",head=TRUE,sep=",")
tb_12 <- read.csv(file="./Data/201412_citibike_tripdata.csv",head=TRUE,sep=",")

# append tables 01-12

my_list <- list(tb_01, tb_02, tb_03, tb_04, tb_05, tb_05, tb_06, tb_07, tb_08, tb_09, tb_10, tb_11, tb_12)

tb_2014 <- rbind.fill(my_list)

# Write .csv file for reproducible research

write.csv(tb_2014, file = "./Data/2014_citibike_tripdata.csv", sep = ",")


# Please note that the file cannot be loaded to my github directory due to problem with memory, even with zipping the file.

Research question

The paying bike sharing system in NYC as mean of transportation for commuters.
What percentage of ridership are during the week day initiated at a station near a communiting transporation hub between the hours of 7:00 AM and 10:00 AM?

In New York City (NYC), there is since 2013 a paying bike sharing system “Citibike”. Riders can rent bike at various docking stations throughout the city and returned them to another docking station. There are 2 main forms of payment; “pay as you go” meaning per ride or “Annual Subscription” meaning pay a flat fee for year with unlimited rides There is a time limit on how long the bike can be in use per ride; 30 minutes for non-subscribers and 45 minutes for subscribers. Financial penalties are applied in the cases the ride exceed these limits.

The analysis will be done on 2014 data.

Cases

The raw data is a record of every ride in the system (for the year 2014) with the following characteristics;

with a duration of > 1 minute
that begin at publicly available stations (thereby excluding trips that originate at our depots for rebalancing or maintenance purposes).

The data includes the following fields:
* Trip Duration (seconds) * Start Time and Date * Stop Time and Date * Start Station Name * End Station Name * Station ID * Station Lat/Long * Bike ID * User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member) * Gender (Zero=unknown; 1=male; 2=female) * Year of Birth

For the consolidated data for 2014, there are 8046287 rows.

Data collection

The data was collected by the operator of the system. The data is captured by the docking stations and centralized.

Type of study

The data is collected by “observations”, it represents the actual experience of the riders that use bike sharing system.

Data Source

The raw data can be found at:
[http://www.citibikenyc.com/system-data]

Response

Number of rides (numerical discreet)

Explanatory

day of the week (categorical), start time (numerical)

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you are comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

# Separate Date & time from startime
tb_2014_2 <- separate(tb_2014, starttime, c("startdate", "starttime2"), sep = " ")

# Extract the day of week
tb_2014_2$dayofweek <- wday(as.Date(tb_2014_2$startdate))

# Select only a subset of the columns
tb_2014_2 <- tb_2014_2 %>% select(tripduration, startdate, starttime2, start.station.id, start.station.name, usertype, birth.year, gender, dayofweek) %>% 
                             filter(usertype == "Subscriber")

# Group the raw data by day of week, start station
tb_2014_3 <- tb_2014_2 %>% group_by(dayofweek, start.station.id) %>% 
              summarize(mean(tripduration), sum(as.numeric(tripduration)), rides=n())

head(tb_2014_3)

summary(tb_2014_3)

For entries on “09/01/2014”, date does not follows the same format and the wday function populated dayofweek with NA. This will need to be address prior to analysis

Also, dayofweek seems to be as follows: 1 - Sunday, 2 - Monday, 3 - Tuesday, 4 - Wednesday, 5 - Thursday, 6 - Friday, 7 - Saturday