DATA 606 Data Project Proposal

Data Preparation

# load data for August 2017
library('DATA606') 

cb_master <- read.csv(file="~/documents/201708-citibike-tripdata.csv", header=TRUE, sep = ",", as.is="birth.year")

head(cb_master)

# Subset the columns of interest

citibike <- cb_master[,c("tripduration", "starttime", "bikeid", "usertype","birth.year", "gender")]

# Remove non-subscribers, null gender records and erroneous data where duration is greater than 1000 mins 

citibike <- citibike[(citibike$usertype %in% c("Subscriber") & citibike$gender %in% c("1", "2") & citibike$tripduration < 43000) ,]

# Rename Gender labels
citibike$gender <- as.factor(citibike$gender)
levels(citibike$gender) <- c("Male", "Female")

# Convert birth year to integer and tripduration to minutes instead of seconds

citibike$birth.year <- as.integer(citibike$birth.year)

## Warning: NAs introduced by coercion

citibike["tripduration.min"] <- citibike$tripduration/60

# Add a new variable with calculated age
citibike["age"] <- (2018 - citibike$birth.year)

# Subset selected variables for analysis

cbdata <- citibike[which(citibike$age < 110),c("tripduration", "starttime", "bikeid","birth.year", "gender", "tripduration.min", "age")]

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for

Are age and gender predictive of speed and distance of a rider?

Cases

Each case represents one ride of a Citibike member. There are approx 1.5 million obervations in the given data set.

Data collection

Describe the method of data collection.

The data is collected by Citbike NYC for each member ride using the logs of pick and drop off. Riders age and gender is submitted by Citibike members at the time of account opening.

Type of study

What type of study is this (observational/experiment)? This is an observational study

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Data is collected by Citibike NYC and shared in .csv files on their website https://www.citibikenyc.com/system-data

Considering the large amount of data per month, I will perform analysis on one month per season for two years.

Dependent Variable

What is the response variable? Is it quantitative or qualitative? The response variable is speed and distance. Both are quantitative.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

Independent variables are age and gender, that are categorical.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(cbdata$gender)

##    Male  Female 
## 1129259  403179

summary(cbdata$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   29.00   35.00   38.38   46.00  108.00

# Summary statistics for Female riders
summary(cbdata$tripduration.min[cbdata$gender %in% c("Female")])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.017   6.967  11.400  14.640  19.020 715.900

# Summary statistics for Male riders
summary(cbdata$tripduration.min[cbdata$gender %in% c("Male")])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.017   5.867   9.600  12.660  16.200 715.100

library(ggplot2)
#ggplot(citi_trim) + geom_histogram(aes(x=age), stat="count",binwidth =2 )

ggplot(cbdata) + geom_bar(aes(x=age, y=(..count..)/sum(..count..), fill=gender), stat="count",binwidth =2) + xlim(c(15,75))

## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.

## Warning: Removed 2947 rows containing non-finite values (stat_bin).