DATA 606 Data Project Proposal Focusing on Capital Bike Share Data set

Data Preparation

 ### For the purpose of manipulating the raw data I have included the following Packages .
library(dplyr)
library(tidyr)
library(lubridate)
library(knitr)

# Load data
# I have downloaded the  data from the website  and stored in my local PC in directory C:\Dataproject
# To explore the research question, we will select only the month of October 2018. The month of October  2018 was selected with the aim of providing the most uptodated and averged data . 
# I have downloaded data month by month, 1 csv file per month and renamed as 2018mm_capitalbikeshare_tripdata.csv

tb_01 <- read.csv(file=“C:/201801_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_02 <- read.csv(file=“C:/201802_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_03 <- read.csv(file=“C:/201803_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_04 <- read.csv(file=“C:/201804_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_05 <- read.csv(file=“C:/201805_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_06 <- read.csv(file=“C:/201806_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_07 <- read.csv(file=“C:/201807_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_08 <- read.csv(file=“C:/201808_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_09 <- read.csv(file=“C:/201809_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_10 <- read.csv(file=“C:/201810_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_11 <- read.csv(file=“C:/201811_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”) tb_12 <- read.csv(file=“C:/201812_capitalbikeshare_tripdata.csv”,head=TRUE,sep=“,”)

Research question

Capital Bikeshare (also abbreviated CaBi) is a bicycle sharing system that serves Washington, D.C.; Arlington County, Virginia; the city of Alexandria, Virginia; Montgomery County, Maryland and Fairfax County, Virginia. It has more than 500 stations and 4,300 bicycles, all owned by these local governments and operated in a public-private partnership with Motivate International. Opened in September 2010, the system was the largest bike sharing service in the United States until New York City’s Citi Bike began operations in May 2013. There are two category of riders . The first category of reiders are member riders wheo have subscrption option and the other category of riders are Casual riders with out subscrption.

In this research I am interested to check if there is any relationship between the type of riders and the duration of rides . In addition to this I will also be investigating on the the ipmact of weekend days and weekdays on the duration of rides. As we can see from the data set except the discrption of the type of ridesrs there is no addition al information as to the age and gender of ridesrs . To this end my anaysis will only focus on the type of riders and its ipcat on the duartion of the rides. Due to the balnkess of the data set I will only be using the dats sets for the months of October, November and December of 2018.

Cases

The raw data is a record of every ride in the system (for the months of October, November and December 2018) with the following characteristics;

with a duration of > 1 minute
This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of our “test” stations at our warehouses and any trips lasting less than 60 seconds (potentially false starts or users trying to re-dock a bike to ensure it’s secure).

The data includes the following fields:
* Duration - Duration of trip Start Date - Includes start date and time End Date - Includes end date and time Start Station - Includes starting station name and number End Station - Includes ending station name and number *Bike Number - Includes ID number of bike used for the trip

For the purpose of our analysis, we will subset the data to User Type = “Member”.

We will derive whether the ride occurred on a weekday or a weekend/holiday as well the new field: rideday will have values 1 = Weekday, 0 = Weekend or Holiday. And dayofweek is as follows: 1 - Sunday, 2 - Monday, 3 - Tuesday, 4 - Wednesday, 5 - Thursday, 6 - Friday, 7 - Saturday

I have also decided to convert the ride duration from seconds to minutes and round to nearest whole minute number (up or down) and update accordingly on the data set.

Data collection

The data was collected by Capital Bikeshare and published real-time system data in General Bikeshare Feed Specification format.

Type of study

The data is collected by “observations”

Data Source

The raw data can be found at:
[https://www.capitalbikeshare.com/system-data]

Response

It represents the ride duration in minutes meaning (Ride duration in minutes, numerical)

Variables

weekday category (categorical) weekday category (rideday in data set) is a derive variable see R section above that indicate whether the ride took place on a weekday or a weekend day.

Relevant summary statistics

```{r}Separate Date & time from startime tb_2018_2 <- separate(tb_2018, starttime, c(“startdate”, “starttime2”), sep = " “)

Extract the day of week tb_2018_2dayofweek<???wday(as.Date(tb20182dayofweek<???wday(as.Date(tb20182startdate))

Select only a subset of the columns tb_2018_2 <- tb_2018_2 %>% select(tripduration, startdate, starttime2, start.station.id, start.station.name, usertype, day the week) %>% filter(usertype == “member”)

Group the raw data by day of week, start station tb_2018_3 <- tb_2018_2 %>% group_by(dayofweek, start.station.id) %>% summarize(mean(tripduration), sum(as.numeric(tripduration)), rides=n())

head(tb_2018_3)

summary(tb_2018_3)