library(tidyverse)
library(feather)
data = read_feather('nyc_taxi_subset.fth')
data$tpep_pickup_datetime = as.Date(data$tpep_dropoff_datetime, format='%Y-%m-%d')
data$tpep_dropoff_datetime = as.Date(data$tpep_dropoff_datetime, format='%Y-%m-%d')
# subset to be manageable sized
# write_feather(data, 'nyc_taxi_subset.fth')
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
# Do trip distance and payment type affect mean tip amount:
# The payment types are:
# 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
# Trip distance is in miles as reported by taximeter
# tip amount is in dollars
# H0: There is no difference between mean tip amount for the various payment type groups
# HA: There is a difference between mean tip amount for the various payment type groups
# Is there any correlation between trip distance and tip amount? Is it a positive
# or negative correlation?s
What are the cases, and how many are there?
# Originally there are 10679307 rows/cases. Each case is a taxi trip
# in New York captured by the Taxi and Limousine Comission (TLC)
# and their technology partners who record the data
# about taxi trips electronically. Trips are from
# 2015-01-01 to 2015-03-04.
# We subset the first 100000 cases from 2015-01-01 to 2015-02-01
# because processing scripts time and size limitations to github
Describe the method of data collection.
# Data collected by NYC Taxi and Limousine Commission by technology providers
# When a Taxi is dispatched, the technology can capture various data about
# the trip such as pick-up date, time, taxi zone location id, fare, etc.
# Additional data is collected at the end of the Taxi trip, like payment method
# and dropoff location.
What type of study is this (observational/experiment)?
# This is an observational study. No specific experiment is being conducted
# and data is being observed only.
If you collected the data, state self-collected. If not, provide a citation/link.
# Original data from
# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
# And then processed from
# taxi_preprocessing_example.py script at
# https://github.com/holoviz/datashader/tree/master/examples
What is the response variable? Is it quantitative or qualitative?
# The response variable is tip amount and it is quantitative
You should have two independent variables, one quantitative and one qualitative.
# The independent variables are trip distance, which is quantitative,
# and payment_type, which is qualitative/categorical.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(data$trip_distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.920 1.500 1.887 2.400 99.900
summary(data$payment_type)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.388 2.000 4.000
summary(data$tip_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -58.090 0.000 1.000 1.216 2.000 100.000
IQR(data$trip_distance)
## [1] 1.48
IQR(data$payment_type)
## [1] 1
IQR(data$tip_amount)
## [1] 2
# 1= Credit card 2= Cash 3= No charge 4= Dispute
boxplot(data$trip_distance)
boxplot(data$payment_type)
boxplot(data$tip_amount ~ data$payment_type)
boxplot(data$trip_distance ~ data$payment_type)
# 1= Credit card 2= Cash 3= No charge 4= Dispute
ggplot(data, aes(x=tip_amount)) + geom_density() + facet_wrap(~payment_type)