DATA 606 Data Project Proposal

Data Preparation

library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)


weatherRaw <- read.csv2("data/weather_data_nyc_centralpark_2016(1).csv", sep = ",")
head(weatherRaw)

weather <- weatherRaw %>% 
  select(date, maximum.temperature, precipitation) %>% 
  separate(date, c("day", "month", "year"), sep = "-") %>% 
  mutate(DATE = str_c(year, month, day, sep='-'), 
         date = ymd(DATE),
         precipitation = if_else(precipitation == 0.00 , FALSE, 
                                 if_else(precipitation == "T", FALSE, 
                                         if_else(precipitation == 0, FALSE, TRUE, missing = NULL)))) %>% 
  select(date, maximum.temperature, precipitation)

head(weather)
nrow(weather)

springRaw <- read.csv2("data/data-8eZnB.csv", sep = ",")
spring <- springRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)

summerRaw <- read.csv2("data/data-xXpEp.csv", sep = ",")
summer <- summerRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)

fallRaw <- read.csv2("data/data-gRmSF.csv", sep = ",")
fall <- fallRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)

winterRaw <- read.csv2("data/data-rreHM.csv", sep = ",")
winter <- winterRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)

tripsRaw <- bind_rows(summer, spring, fall, winter)  

trips <- tripsRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.) %>% 
  separate(Date, c("month", "day", "year"), sep = "/") %>% 
  mutate(DATE = str_c(year, month, day, sep='-'), 
         date = ymd(DATE),
         n_trips = Trips.over.the.past.24.hours..midnight.to.11.59pm.) %>% 
  select(date, n_trips)

head(trips)
tail(trips)

cases <- inner_join(weather, trips) %>% arrange(date)
head(cases)
tail(cases)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

What is the impact of weather on Manhattan cycling?

Cases

What are the cases, and how many are there?

Each case is one date in 2016. There are 366 (leap year).

Data collection

Describe the method of data collection.

CitiBike trips are recorded whenever a bike is locked and then docked. Each member has a key which creates a unique identifier for an individual trip. To create the dataset, I have aggregated the number of trips by date (which is how the weather data is collected).

Weather data was collected at the Central Park Weather Station by NOAA. High temperature is taken as given, which precipitation has been categorized as TRUE or FALSE based on the recording of precipitation as > 0 or T.

Type of study

What type of study is this (observational/experiment)?

Observational

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Weather data: Kaggle
Ride Data Data from Citi Bike’s Website (2016 chosen to match weather data)

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is number of trips, which is quantitative.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

Qualitative variable: Precipitation Quantitative variable: High Temperature

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

summary(cases)

##       date            maximum.temperature precipitation      n_trips     
##  Min.   :2016-01-01   Min.   :15.00       Mode :logical   Min.   :    0  
##  1st Qu.:2016-04-01   1st Qu.:50.00       FALSE:140       1st Qu.:24677  
##  Median :2016-07-01   Median :64.50       TRUE :226       Median :38332  
##  Mean   :2016-07-01   Mean   :64.63                       Mean   :37819  
##  3rd Qu.:2016-09-30   3rd Qu.:81.00                       3rd Qu.:51742  
##  Max.   :2016-12-31   Max.   :96.00                       Max.   :69758

boxplot(cases$maximum.temperature ~ cases$precipitation, main = "Max Temp by Precip, 2016")

boxplot(cases$n_trips, main = "Number of Citi Bikes Trips per Day, 2016")

qplot(y=n_trips, x=maximum.temperature, data=cases, colour=precipitation)