Qufei Wang Xuejia Xu
NYC Taxi Trip Duration Prediction
resource: https://www.kaggle.com/c/nyc-taxi-trip-duration/data
Have you ever worried about being late for a meeting and are unsure of whether you should take a taxi or not? Knowing the accurate ETA can help developing a software where people can call a taxi and know right away exactly when they will arrive at their destination!
In this project, we will build a model to predict taxi trip duration in New York City!
The training data is consisted of 1.5 million observations and 10 features.
nrow(train_df)
[1] 1458644
The test dataset includes about 600000 trips for us to predict
nrow(test_df)
[1] 459855
Luckily the data is clean!
[1] 0
[1] 0
It is very important to look at how trip duration is distributed in our dataset.
The graph looks wired. Therefore I took another look at the summary statistics of trip duration
summary(train_df$trip_duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 397 662 959 1075 3526282
It turns out that trip duration has some extreme outliers such as trips that lasted for almost a day.
After log transformation on trip duration,the distribution looks much more normal
The pickup datetime is of data type datetime and is not convenient for modeling. Therefore we reformatted the feature into pickup_month, pickup_day, pickup_hour and pickup_weekday
train_df$pickup_month=month(train_df$pickup_datetime)
train_df$pickup_day=as.Date(train_df$pickup_datetime)
train_df$pickup_hour=hour(train_df$pickup_datetime)
train_df$pickup_weekdays=wday(train_df$pickup_datetime)
pickup_month pickup_day pickup_hour pickup_weekdays
1 3 2016-03-14 17 2
It is pretty obvious from the graph below that weekdays, especially Tuesday to Friday have longer trips than weekends and Monday do.Thus we created a dummy variable with 0 being weekends and Monday and 1 being weekdays and included it in our model to test its predicitve power.
train_df$weekday_dv<-ifelse((train_df$pickup_weekdays ==1 |train_df$pickup_weekdays ==2|train_df$pickup_weekdays ==7),0,1)
As we can tell from the map below, the three major pickup/dropoff locations are Manhattan, JFK airport and LaGuardia Airport.
Take away here: -Create two dummy variables indicating whether a trip is from/to the airports. -Use external data to group pickup/dropoff locations into different neighborhoods in NYC
*Calculating distance with the Vincenty (ellipsoid) method and geosphere package
train_df$distance_km<-NA
train_df[,'distance_km']<-distVincentyEllipsoid(matrix(c(train_df$pickup_longitude,train_df$pickup_latitude),ncol=2),
matrix(c(train_df$dropoff_longitude,train_df$dropoff_latitude), ncol=2), a=6378137, b=6356752.3142, f=1/298.257223563)/1000
*Calculating bearing which is the direction of the trip using function inside the same package
train_df$bearing<-bearing(matrix(c(train_df$pickup_longitude,train_df$pickup_latitude),
ncol=2),matrix(c(train_df$dropoff_longitude,train_df$dropoff_latitude), ncol=2))
We can see that vendor 1 and vendor 2 both have similar pattern for the business of the day od the week. Their trip durations increase from Monday and then reach the highest point on Friday. During the weekend, their trip durations decrease.
AIC Value:
[1] 1414115
Validation under full regression model:
*Calculating the Root Mean Squared Logarithmic Error
[1] 0.6255181
*Calculating the Mean Squared Error
[1] 13964780
Call:
lm(formula = trip_duration ~ vendor_id + passenger_count + pickup_longitude +
weekday_dv + hour_dv1 + hour_dv2 + pickup_latitude + distance_km +
pickup_day + bearing, data = taxi1)
Residuals:
Min 1Q Median 3Q Max
-6835 -341 -157 68 85709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.055e+05 2.758e+04 -3.825 0.000131 ***
vendor_id 1.891e+02 2.293e+01 8.244 < 2e-16 ***
passenger_count 2.002e+01 8.765e+00 2.284 0.022385 *
pickup_longitude -1.098e+03 2.899e+02 -3.789 0.000151 ***
weekday_dv 8.455e+01 2.231e+01 3.789 0.000151 ***
hour_dv1 -2.583e+02 5.509e+01 -4.689 2.75e-06 ***
hour_dv2 1.752e+02 2.262e+01 7.744 9.75e-15 ***
pickup_latitude 3.972e+02 4.132e+02 0.961 0.336388
distance_km 1.407e+02 3.300e+00 42.654 < 2e-16 ***
pickup_day 4.765e-01 2.129e-01 2.238 0.025216 *
bearing 7.449e-02 1.081e-01 0.689 0.490605
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3006 on 74989 degrees of freedom
Multiple R-squared: 0.03186, Adjusted R-squared: 0.03173
F-statistic: 246.7 on 10 and 74989 DF, p-value: < 2.2e-16
AIC Value:
[1] 1414127
Validation on reduced model *Calculating the Root Mean Squared Logarithmic Error
[1] 0.6282666
*Calculating the Mean Squared Error
[1] 13964201
Validation under the tree model: Calculating the Root Mean Squared Logarithmic Error
[1] 0.7174902
Validation under the random forest model:
[1] 0.5417291
Call:
lm(formula = trip_duration_log ~ vendor_dv + passenger_count +
pickup_longitude + weekday_dv + hour_dv1 + hour_dv2 + pickup_day +
pickup_latitude + distance_km + bearing, data = taxi_train)
Residuals:
Min 1Q Median 3Q Max
-13.0414 -0.3056 0.0574 0.3656 5.6944
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.067e+02 5.407e+00 -38.224 < 2e-16 ***
vendor_dv 2.174e-02 4.497e-03 4.835 1.34e-06 ***
passenger_count 7.523e-03 1.719e-03 4.378 1.20e-05 ***
pickup_longitude -2.476e+00 5.684e-02 -43.564 < 2e-16 ***
weekday_dv 1.243e-01 4.375e-03 28.419 < 2e-16 ***
hour_dv1 -3.432e-01 1.080e-02 -31.770 < 2e-16 ***
hour_dv2 1.527e-01 4.435e-03 34.439 < 2e-16 ***
pickup_day 5.314e-04 4.175e-05 12.728 < 2e-16 ***
pickup_latitude 4.995e-01 8.102e-02 6.165 7.10e-10 ***
distance_km 1.418e-01 6.470e-04 219.164 < 2e-16 ***
bearing -8.021e-06 2.119e-05 -0.379 0.705
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5895 on 74989 degrees of freedom
Multiple R-squared: 0.4369, Adjusted R-squared: 0.4368
F-statistic: 5819 on 10 and 74989 DF, p-value: < 2.2e-16
AIC Value
[1] 133575.8
The Mean Squared Error
[1] 32544614
The Root Mean Squared Logarithmic Error
[1] 0.6236186
The mean squared error
[1] 10840382
The Root Mean Squared Logarithmic Error
[1] 0.5115337
set.seed(123)
r_forest<-randomForest::randomForest(trip_duration_log~vendor_dv+ passenger_count+pickup_longitude+weekday_dv+hour_dv1+hour_dv2+pickup_latitude+dropoff_longitude+ dropoff_latitude+ pickup_month+ pickup_hour+ pickup_weekdays+ distance_km+bearing,data=taxi_train,ntree=100,mtry=4,importance=TRUE,na.action=randomForest::na.roughfix,replace=FALSE)
Validation under the random forest model
The mean squared error
y_forest<- predict(r_forest, newdata = taxi_test)
test.MSE_r_forest<- mean((expm1(y_forest) - taxiData$trip_duration)^2)
test.MSE_r_forest
[1] 10908202
The Root Mean Squared Logarithmic Error
prediction_3=predict(r_forest,val)
error_3=sqrt(sum((prediction_3-val$trip_duration_log)^2)/nrow(val))
error_3
[1] 0.4016112