Data Preparation

library(tidyverse)
library(feather)
data = read_feather('nyc_taxi_subset.fth')
data$tpep_pickup_datetime = as.Date(data$tpep_dropoff_datetime, format='%Y-%m-%d')
data$tpep_dropoff_datetime = as.Date(data$tpep_dropoff_datetime, format='%Y-%m-%d')
# subset to be manageable sized
# write_feather(data, 'nyc_taxi_subset.fth')

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

# Do trip distance and payment type affect mean tip amount:
# The payment types are:
# 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
# Trip distance is in miles as reported by taximeter
# tip amount is in dollars

# H0: There is no difference between mean tip amount for the various payment type groups
# HA: There is a difference between mean tip amount for the various payment type groups

# Is there any correlation between trip distance and tip amount? Is it a positive
# or negative correlation?s

Cases

What are the cases, and how many are there?

# Originally there are 10679307 rows/cases. Each case is a taxi trip
# in New York captured by the Taxi and Limousine Comission (TLC)
# and their technology partners who record the data
# about taxi trips electronically. Trips are from
# 2015-01-01 to 2015-03-04.
# We subset the first 100000 cases from 2015-01-01 to 2015-02-01
# because processing scripts time and size limitations to github

Data collection

Describe the method of data collection.

# Data collected by NYC Taxi and Limousine Commission by technology providers
# When a Taxi is dispatched, the technology can capture various data about
# the trip such as pick-up date, time, taxi zone location id, fare, etc.
# Additional data is collected at the end of the Taxi trip, like payment method
# and dropoff location.

Type of study

What type of study is this (observational/experiment)?

# This is an observational study. No specific experiment is being conducted
# and data is being observed only.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

# Original data from 
# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
# And then processed from 
# taxi_preprocessing_example.py script at
# https://github.com/holoviz/datashader/tree/master/examples

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

# The response variable is tip amount and it is quantitative

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

# The independent variables are trip distance, which is quantitative,
# and payment_type, which is qualitative/categorical.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(data$trip_distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.920   1.500   1.887   2.400  99.900
summary(data$payment_type)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.388   2.000   4.000
summary(data$tip_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -58.090   0.000   1.000   1.216   2.000 100.000
IQR(data$trip_distance)
## [1] 1.48
IQR(data$payment_type)
## [1] 1
IQR(data$tip_amount)
## [1] 2
# 1= Credit card 2= Cash 3= No charge 4= Dispute
boxplot(data$trip_distance)

boxplot(data$payment_type)

boxplot(data$tip_amount ~ data$payment_type)

boxplot(data$trip_distance ~ data$payment_type)

# 1= Credit card 2= Cash 3= No charge 4= Dispute
ggplot(data, aes(x=tip_amount)) + geom_density() + facet_wrap(~payment_type)