library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
weatherRaw <- read.csv2("data/weather_data_nyc_centralpark_2016(1).csv", sep = ",")
head(weatherRaw)
weather <- weatherRaw %>%
select(date, maximum.temperature, precipitation) %>%
separate(date, c("day", "month", "year"), sep = "-") %>%
mutate(DATE = str_c(year, month, day, sep='-'),
date = ymd(DATE),
precipitation = if_else(precipitation == 0.00 , FALSE,
if_else(precipitation == "T", FALSE,
if_else(precipitation == 0, FALSE, TRUE, missing = NULL)))) %>%
select(date, maximum.temperature, precipitation)
head(weather)
nrow(weather)
springRaw <- read.csv2("data/data-8eZnB.csv", sep = ",")
spring <- springRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)
summerRaw <- read.csv2("data/data-xXpEp.csv", sep = ",")
summer <- summerRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)
fallRaw <- read.csv2("data/data-gRmSF.csv", sep = ",")
fall <- fallRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)
winterRaw <- read.csv2("data/data-rreHM.csv", sep = ",")
winter <- winterRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.)
tripsRaw <- bind_rows(summer, spring, fall, winter)
trips <- tripsRaw %>% select(Date, Trips.over.the.past.24.hours..midnight.to.11.59pm.) %>%
separate(Date, c("month", "day", "year"), sep = "/") %>%
mutate(DATE = str_c(year, month, day, sep='-'),
date = ymd(DATE),
n_trips = Trips.over.the.past.24.hours..midnight.to.11.59pm.) %>%
select(date, n_trips)
head(trips)
tail(trips)
cases <- inner_join(weather, trips) %>% arrange(date)
head(cases)
tail(cases)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
What is the impact of weather on Manhattan cycling?
What are the cases, and how many are there?
Each case is one date in 2016. There are 366 (leap year).
Describe the method of data collection.
CitiBike trips are recorded whenever a bike is locked and then docked. Each member has a key which creates a unique identifier for an individual trip. To create the dataset, I have aggregated the number of trips by date (which is how the weather data is collected).
Weather data was collected at the Central Park Weather Station by NOAA. High temperature is taken as given, which precipitation has been categorized as TRUE or FALSE based on the recording of precipitation as > 0 or T.
What type of study is this (observational/experiment)?
Observational
If you collected the data, state self-collected. If not, provide a citation/link.
Weather data: Kaggle
Ride Data Data from Citi Bike’s Website (2016 chosen to match weather data)
What is the response variable? Is it quantitative or qualitative?
The response variable is number of trips, which is quantitative.
You should have two independent variables, one quantitative and one qualitative.
Qualitative variable: Precipitation Quantitative variable: High Temperature
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
summary(cases)
## date maximum.temperature precipitation n_trips
## Min. :2016-01-01 Min. :15.00 Mode :logical Min. : 0
## 1st Qu.:2016-04-01 1st Qu.:50.00 FALSE:140 1st Qu.:24677
## Median :2016-07-01 Median :64.50 TRUE :226 Median :38332
## Mean :2016-07-01 Mean :64.63 Mean :37819
## 3rd Qu.:2016-09-30 3rd Qu.:81.00 3rd Qu.:51742
## Max. :2016-12-31 Max. :96.00 Max. :69758
boxplot(cases$maximum.temperature ~ cases$precipitation, main = "Max Temp by Precip, 2016")
boxplot(cases$n_trips, main = "Number of Citi Bikes Trips per Day, 2016")
qplot(y=n_trips, x=maximum.temperature, data=cases, colour=precipitation)