Project 1: Taxi Trips in NYC

Author

Isaac Cuellar

Taxi Trips in NYC 2023

The data-set I will be analyzing on are Yellow Taxi trips in New York City for the year 2023. The data was collected using Technology Service Providers (TSPs). The main variables I will be using to analyze are passenger counts, tip amounts, payment type, and the total amount for the taxi trip. Passengers tend to show their drivers gratitude through tips. My focus will be on whether the type of payment affects the amount of tips the drivers receive.

library(readr)
Warning: package 'readr' was built under R version 4.5.3
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.5.3
Taxi <- readr::read_csv("2023_Yellow_Taxi_Trip_Data.csv",n_max=100000)
Rows: 100000 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): tpep_pickup_datetime, tpep_dropoff_datetime, store_and_fwd_flag
dbl (16): VendorID, passenger_count, trip_distance, RatecodeID, PULocationID...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Taxi_clean <- Taxi %>% filter(payment_type %in% c(1,2) )
Taxi$payment_type <- as.factor(Taxi$payment_type)
model <- lm(tip_amount ~ payment_type, data = Taxi)
summary(model)

Call:
lm(formula = tip_amount ~ payment_type, data = Taxi)

Residuals:
    Min      1Q  Median      3Q     Max 
 -4.757  -2.237  -0.137   0.000 206.743 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.75689    0.01487  319.93   <2e-16 ***
payment_type2 -4.75643    0.03015 -157.75   <2e-16 ***
payment_type3 -4.75689    0.14697  -32.37   <2e-16 ***
payment_type4 -4.67826    0.11086  -42.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.047 on 99996 degrees of freedom
Multiple R-squared:  0.2094,    Adjusted R-squared:  0.2094 
F-statistic:  8828 on 3 and 99996 DF,  p-value: < 2.2e-16
Taxi_clean$payment_type<-factor(Taxi_clean$payment_type,
  levels = c(1,2),
  labels = c("Credit Card", "Cash"))

The process I took to clean the data was reducing the number of observations to 100,000 because the original data-set had more than 1 million rows. I used the “n_max=100000” command alongside the read_csv command to not cluster the r software and make working on the data less straining on my device. Another form of cleaning I did was turning the “payment_types” to a categorical value. Originally, the data used the numbers 1 through 4 to communicate the payment type (i.e: 1 = Credit Card, 2 = Cash). I wanted to only focus on these two forms of payment as they are the most popular and then turned them into factors.

ggplot(Taxi_clean,aes(x=payment_type,y=tip_amount,fill=payment_type))+
  geom_boxplot()+
  coord_cartesian(ylim=c(0,20))+
  labs(title = "Tip Amount by Payment Type",
       x = "Payment Method",
       y = "Tip Amount ($)",
       fill = "Payment Type",
       caption = "NYC Yellow Taxi Trip Data 2023")+
  theme_classic()

The box-plot above shows that credit card payments have a positive distribution in tip amounts with the median being $3.50, while cash payments have lower tip rates. The likelihood for these results are that cash tips are not recorded in the data-set as drivers prefer to keep their earnings for themselves. This concludes that payment types appear to influence recorded tip amounts, but can be because of the way the data was collected.

model <- lm(tip_amount ~ total_amount + trip_distance, data = Taxi_clean)
summary(model)

Call:
lm(formula = tip_amount ~ total_amount + trip_distance, data = Taxi_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.863  -1.446   0.741   1.403 184.328 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.853e-01  1.635e-02 -11.337   <2e-16 ***
total_amount   1.227e-01  4.061e-04 302.112   <2e-16 ***
trip_distance -4.296e-05  5.269e-05  -0.815    0.415    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.286 on 97874 degrees of freedom
Multiple R-squared:  0.4826,    Adjusted R-squared:  0.4826 
F-statistic: 4.565e+04 on 2 and 97874 DF,  p-value: < 2.2e-16
Taxi_plot<-Taxi_clean %>% 
  sample_n(5000)

ggplot(Taxi_plot, aes(x = total_amount, y = tip_amount)) +
  geom_point(alpha = 0.03) +
  geom_smooth(method = "lm", color = "blue",size=1.5) +
  coord_cartesian(xlim = c(0, 100), ylim = c(0, 20)) +
  
  labs(
    title = "Relationship Between Total Fare and Tip Amount",
    x = "Total Fare ($)",
    y = "Tip Amount ($)",
    caption = "NYC Yellow Taxi Trip Data 2023"
  ) +
  
  theme_classic()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'

The scatter-plot above shows a positive relationship between total fare and tip amount. When the cost of the trip goes up so does the amount of tips as well. The blue line does a great job at illustrating the relationship. To conclude, though there is some variability in the data, the upward trend tells us that total fare amounts is a strong predictor to tipping behaviors by customers.