Final Project on Predicting Taxi Ride Duration

Qufei Wang Xuejia Xu

Agenda

NYC Taxi Trip Duration Prediction

resource: https://www.kaggle.com/c/nyc-taxi-trip-duration/data

  • Introduction to dataset
  • Exploratory data analysis
  • Cleaning dataset
  • Primary model
  • Improved model

Our Problem

Have you ever worried about being late for a meeting and are unsure of whether you should take a taxi or not? Knowing the accurate ETA can help developing a software where people can call a taxi and know right away exactly when they will arrive at their destination!

In this project, we will build a model to predict taxi trip duration in New York City!

alt text

Introduction to the Dataset

The training data is consisted of 1.5 million observations and 10 features.

nrow(train_df)
[1] 1458644

The test dataset includes about 600000 trips for us to predict

nrow(test_df)
[1] 459855

Luckily the data is clean!

[1] 0
[1] 0

Features

alt text

EDA--Distribution of Target Variable

It is very important to look at how trip duration is distributed in our dataset.

plot of chunk unnamed-chunk-8

The graph looks wired. Therefore I took another look at the summary statistics of trip duration

summary(train_df$trip_duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1     397     662     959    1075 3526282 

It turns out that trip duration has some extreme outliers such as trips that lasted for almost a day.

Distribution after transformation

After log transformation on trip duration,the distribution looks much more normal plot of chunk unnamed-chunk-10

EDA--Pickup Datetime

The pickup datetime is of data type datetime and is not convenient for modeling. Therefore we reformatted the feature into pickup_month, pickup_day, pickup_hour and pickup_weekday

train_df$pickup_month=month(train_df$pickup_datetime)
train_df$pickup_day=as.Date(train_df$pickup_datetime)
train_df$pickup_hour=hour(train_df$pickup_datetime)
train_df$pickup_weekdays=wday(train_df$pickup_datetime)
  pickup_month pickup_day pickup_hour pickup_weekdays
1            3 2016-03-14          17               2

EDA--Median Trip Duration vs Weekdays and Hour

alt text alt text

It is pretty obvious from the graph below that weekdays, especially Tuesday to Friday have longer trips than weekends and Monday do.Thus we created a dummy variable with 0 being weekends and Monday and 1 being weekdays and included it in our model to test its predicitve power.

train_df$weekday_dv<-ifelse((train_df$pickup_weekdays ==1 |train_df$pickup_weekdays ==2|train_df$pickup_weekdays ==7),0,1)

EDA--Pickup/Dropoff Location

As we can tell from the map below, the three major pickup/dropoff locations are Manhattan, JFK airport and LaGuardia Airport.

alt text

Take away here: -Create two dummy variables indicating whether a trip is from/to the airports. -Use external data to group pickup/dropoff locations into different neighborhoods in NYC

Feature engineering---Distance and bearing

*Calculating distance with the Vincenty (ellipsoid) method and geosphere package

train_df$distance_km<-NA
train_df[,'distance_km']<-distVincentyEllipsoid(matrix(c(train_df$pickup_longitude,train_df$pickup_latitude),ncol=2),
                              matrix(c(train_df$dropoff_longitude,train_df$dropoff_latitude), ncol=2), a=6378137, b=6356752.3142, f=1/298.257223563)/1000

*Calculating bearing which is the direction of the trip using function inside the same package

train_df$bearing<-bearing(matrix(c(train_df$pickup_longitude,train_df$pickup_latitude),
                                 ncol=2),matrix(c(train_df$dropoff_longitude,train_df$dropoff_latitude), ncol=2))

EDA--Vendor_id:

plot of chunk unnamed-chunk-17

We can see that vendor 1 and vendor 2 both have similar pattern for the business of the day od the week. Their trip durations increase from Monday and then reach the highest point on Friday. During the weekend, their trip durations decrease.

Data Cleaning

  • Trips that lasted for over a day
  • Trips that lasted for less than haf a minute but with an unreasonably high speed
  • Trips that are out of boudaries (go into ocean!)
  • Trips that had an unreasonably high speed alt text

Primary Model--Regression

alt text

Evaluation

AIC Value:

[1] 1414115

Validation under full regression model:

*Calculating the Root Mean Squared Logarithmic Error

[1] 0.6255181

*Calculating the Mean Squared Error

[1] 13964780

Primary Model--Reduced Regression


Call:
lm(formula = trip_duration ~ vendor_id + passenger_count + pickup_longitude + 
    weekday_dv + hour_dv1 + hour_dv2 + pickup_latitude + distance_km + 
    pickup_day + bearing, data = taxi1)

Residuals:
   Min     1Q Median     3Q    Max 
 -6835   -341   -157     68  85709 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.055e+05  2.758e+04  -3.825 0.000131 ***
vendor_id         1.891e+02  2.293e+01   8.244  < 2e-16 ***
passenger_count   2.002e+01  8.765e+00   2.284 0.022385 *  
pickup_longitude -1.098e+03  2.899e+02  -3.789 0.000151 ***
weekday_dv        8.455e+01  2.231e+01   3.789 0.000151 ***
hour_dv1         -2.583e+02  5.509e+01  -4.689 2.75e-06 ***
hour_dv2          1.752e+02  2.262e+01   7.744 9.75e-15 ***
pickup_latitude   3.972e+02  4.132e+02   0.961 0.336388    
distance_km       1.407e+02  3.300e+00  42.654  < 2e-16 ***
pickup_day        4.765e-01  2.129e-01   2.238 0.025216 *  
bearing           7.449e-02  1.081e-01   0.689 0.490605    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3006 on 74989 degrees of freedom
Multiple R-squared:  0.03186,   Adjusted R-squared:  0.03173 
F-statistic: 246.7 on 10 and 74989 DF,  p-value: < 2.2e-16

Evaluation

AIC Value:

[1] 1414127

Validation on reduced model *Calculating the Root Mean Squared Logarithmic Error

[1] 0.6282666

*Calculating the Mean Squared Error

[1] 13964201

Primary Model--Decision Tree/Random Forest

plot of chunk unnamed-chunk-26

Validation under the tree model: Calculating the Root Mean Squared Logarithmic Error

[1] 0.7174902

Validation under the random forest model:

[1] 0.5417291

Improved Model---Regression


Call:
lm(formula = trip_duration_log ~ vendor_dv + passenger_count + 
    pickup_longitude + weekday_dv + hour_dv1 + hour_dv2 + pickup_day + 
    pickup_latitude + distance_km + bearing, data = taxi_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.0414  -0.3056   0.0574   0.3656   5.6944 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.067e+02  5.407e+00 -38.224  < 2e-16 ***
vendor_dv         2.174e-02  4.497e-03   4.835 1.34e-06 ***
passenger_count   7.523e-03  1.719e-03   4.378 1.20e-05 ***
pickup_longitude -2.476e+00  5.684e-02 -43.564  < 2e-16 ***
weekday_dv        1.243e-01  4.375e-03  28.419  < 2e-16 ***
hour_dv1         -3.432e-01  1.080e-02 -31.770  < 2e-16 ***
hour_dv2          1.527e-01  4.435e-03  34.439  < 2e-16 ***
pickup_day        5.314e-04  4.175e-05  12.728  < 2e-16 ***
pickup_latitude   4.995e-01  8.102e-02   6.165 7.10e-10 ***
distance_km       1.418e-01  6.470e-04 219.164  < 2e-16 ***
bearing          -8.021e-06  2.119e-05  -0.379    0.705    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5895 on 74989 degrees of freedom
Multiple R-squared:  0.4369,    Adjusted R-squared:  0.4368 
F-statistic:  5819 on 10 and 74989 DF,  p-value: < 2.2e-16

Evaluation

AIC Value

[1] 133575.8

The Mean Squared Error

[1] 32544614

The Root Mean Squared Logarithmic Error

[1] 0.6236186

Improved Model---Decision Tree

plot of chunk unnamed-chunk-33

The mean squared error

[1] 10840382

The Root Mean Squared Logarithmic Error

[1] 0.5115337

Improved Model---Random Forest

set.seed(123)
r_forest<-randomForest::randomForest(trip_duration_log~vendor_dv+ passenger_count+pickup_longitude+weekday_dv+hour_dv1+hour_dv2+pickup_latitude+dropoff_longitude+ dropoff_latitude+ pickup_month+ pickup_hour+ pickup_weekdays+ distance_km+bearing,data=taxi_train,ntree=100,mtry=4,importance=TRUE,na.action=randomForest::na.roughfix,replace=FALSE)

alt text

Evaluation

Validation under the random forest model

The mean squared error

y_forest<- predict(r_forest, newdata = taxi_test)
test.MSE_r_forest<- mean((expm1(y_forest) - taxiData$trip_duration)^2)
test.MSE_r_forest
[1] 10908202

The Root Mean Squared Logarithmic Error

prediction_3=predict(r_forest,val)
error_3=sqrt(sum((prediction_3-val$trip_duration_log)^2)/nrow(val))
error_3
[1] 0.4016112

Conclusion

  • Because the distribution of trip duration is too skewed, the model on trip_duration_log works better to predict trip durtaion
  • Among all models, the random forest has the lowest mean squared error and root mean squared logarithmic error
  • We will be using the random forest model to predict trip duration of the test data
  • Future study: use external data to engineer more variables such as weather condition, traffic condition, and neighborhood.