# load data for August 2017
library('DATA606')
cb_master <- read.csv(file="~/documents/201708-citibike-tripdata.csv", header=TRUE, sep = ",", as.is="birth.year")
head(cb_master)
# Subset the columns of interest
citibike <- cb_master[,c("tripduration", "starttime", "bikeid", "usertype","birth.year", "gender")]
# Remove non-subscribers, null gender records and erroneous data where duration is greater than 1000 mins
citibike <- citibike[(citibike$usertype %in% c("Subscriber") & citibike$gender %in% c("1", "2") & citibike$tripduration < 43000) ,]
# Rename Gender labels
citibike$gender <- as.factor(citibike$gender)
levels(citibike$gender) <- c("Male", "Female")
# Convert birth year to integer and tripduration to minutes instead of seconds
citibike$birth.year <- as.integer(citibike$birth.year)
## Warning: NAs introduced by coercion
citibike["tripduration.min"] <- citibike$tripduration/60
# Add a new variable with calculated age
citibike["age"] <- (2018 - citibike$birth.year)
# Subset selected variables for analysis
cbdata <- citibike[which(citibike$age < 110),c("tripduration", "starttime", "bikeid","birth.year", "gender", "tripduration.min", "age")]
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for
Are age and gender predictive of speed and distance of a rider?
Each case represents one ride of a Citibike member. There are approx 1.5 million obervations in the given data set.
Describe the method of data collection.
The data is collected by Citbike NYC for each member ride using the logs of pick and drop off. Riders age and gender is submitted by Citibike members at the time of account opening.
What type of study is this (observational/experiment)? This is an observational study
If you collected the data, state self-collected. If not, provide a citation/link.
Data is collected by Citibike NYC and shared in .csv files on their website https://www.citibikenyc.com/system-data
Considering the large amount of data per month, I will perform analysis on one month per season for two years.
What is the response variable? Is it quantitative or qualitative? The response variable is speed and distance. Both are quantitative.
You should have two independent variables, one quantitative and one qualitative.
Independent variables are age and gender, that are categorical.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(cbdata$gender)
## Male Female
## 1129259 403179
summary(cbdata$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 29.00 35.00 38.38 46.00 108.00
# Summary statistics for Female riders
summary(cbdata$tripduration.min[cbdata$gender %in% c("Female")])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.017 6.967 11.400 14.640 19.020 715.900
# Summary statistics for Male riders
summary(cbdata$tripduration.min[cbdata$gender %in% c("Male")])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.017 5.867 9.600 12.660 16.200 715.100
library(ggplot2)
#ggplot(citi_trim) + geom_histogram(aes(x=age), stat="count",binwidth =2 )
ggplot(cbdata) + geom_bar(aes(x=age, y=(..count..)/sum(..count..), fill=gender), stat="count",binwidth =2) + xlim(c(15,75))
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
## Warning: Removed 2947 rows containing non-finite values (stat_bin).