Introduction

This R markdown provides a detailed description of steps taken to develop a model/s that predicts fare amount using taxi trip data.

Reading and Merging The Data

We have been provided with the data in three formats: csv, xls and json. The script “ReadandCombine” data reads each data set individually using the r functions fread(), loadWorkbook() and fromJSON. The scripts also checks for duplicate observations and validates if the data can adequately be merged. This is done by initially running a key match using “isTRUE(all(dat1[,key] %in% dat2[,key]))”. 50000 matches where identified. Below is a summary for each data set.

##      key             fare_amount     pickup_datetime    passenger_count
##  Length:50028       Min.   : -5.00   Length:50028       Min.   :0.000  
##  Class :character   1st Qu.:  6.00   Class :character   1st Qu.:1.000  
##  Mode  :character   Median :  8.50   Mode  :character   Median :1.000  
##                     Mean   : 11.36                      Mean   :1.668  
##                     3rd Qu.: 12.50                      3rd Qu.:2.000  
##                     Max.   :200.00                      Max.   :6.000
##      key            pickup_longitude pickup_latitude 
##  Length:50000       Min.   :-75.42   Min.   :-74.01  
##  Class :character   1st Qu.:-73.99   1st Qu.: 40.73  
##  Mode  :character   Median :-73.98   Median : 40.75  
##                     Mean   :-72.51   Mean   : 39.93  
##                     3rd Qu.:-73.97   3rd Qu.: 40.77  
##                     Max.   : 40.78   Max.   :401.08
##      key            dropoff_longitude dropoff_latitude
##  Length:50000       Min.   :-84.65    Min.   :-74.01  
##  Class :character   1st Qu.:-73.99    1st Qu.: 40.73  
##  Mode  :character   Median :-73.98    Median : 40.75  
##                     Mean   :-72.50    Mean   : 39.93  
##                     3rd Qu.:-73.96    3rd Qu.: 40.77  
##                     Max.   : 40.85    Max.   : 43.42

Here we have 7 variables and one key variable. We further noted that the extra 28 observations in the first data set were duplicate points and we removed these from our merged data. To Merge the data we used data.table left join by key.

Initial Exploratory Analysis

We run an initially exploratory analysis to understand the scope of our data and characteristics of the dependent variable “fare_amount”. The script that generates the results for this analysis is titled “InitialAnalysis” and is located in the R folder. The figures below show the initial distribution of the trip fare, along with the Box Cox transformation. Statistical models that make normality assumption of the data are known to produce more accurate results when data is less skewed. Box Cox transformation is used in this initial analysis instead of log transform since we have zero and negative trip fare data values. Later we will remove these values. Skewness values between 0 and 1 are according to the literature acceptable.

## [1] "Initial skewness Value 3.562"
## [1] "skewness after BoxCox Transformation 0.124"

Below are a geographical plots of pickup (top) and drop off (bottom) locations. It is clear that the scope of our data is within the New York region. We will use this information to enrich our data with geo tag information from Google maps API.

The passenger_count distribution is displayed below. Here we can see that the most frequent trips are that of single passengers. We do have 165 empty trips.

All of the no passenger trips do have a trip fare associated with them. This is an interesting finding, however we do not have enough information to justify removing these data point from our data set.

Derived Data

The script derived data contains the base code for additional variables that were derived form the data. These variables include:

  1. Pickup Hour
  2. Month
  3. Weekday
  4. Year
  5. Shift
  6. Distance

Pickup Hour, Month, Weekday and Year were all derived from the pickup_date variable. We used the lubridate R package to parse pickup_date to date and time and the functions hour(), year(), wday() and month() to get the variables. Further the Pickup Hour variable was used to derive shifts. Here, a pickup that occurred between 18h00 and 23h00 was categorized as shift zero, a shift that occurred between 00h00 and 05h00 was categorized as shift one, a shift that occurred between 06h00 and 11h00 was shift 2. All other dat shifts were allocated shift 3. The Pickup Hour,shift, Month, Year and Month distributions are shown below.

The shift and hour variable will allow us to understand if the time of day impacts fare prices. In some instances due to high demand of trips and low supply of taxi drivers fare prices can go up. This variable maybe used to explain this variation.

Similarly they may be certain times of the week where demand for taxi cabs are high and this might influence pricing.

From the figure above, most rides occur between 18h00 and 23h00, there is significant decline in rides between 00h00 and 05h00, there is low variation in weekday data and we have the least data for 2015. We will explore the relationships between variables in the later sections.

Below is a plot of the “Distance” variable. The distribution of this variable is very skewed towards the left. This variable was derived using the geosphere library. The function distm() calculates the geospatial distance between two points. We used this to obtain an estimate of the distance between pick location and drop off location. Distance is an important feature in determining the price of a trip. The larger the distance one travels, the more petrol and frequent car maintenance is required. Theoretical fare price and trip distance have a strong relationship.

## [1] "The skewness below removing outliers 18.3731592526613"
## [1] "The skewness after removing outliers Distance < 100km 2.92246957621679"

Data From Google Maps API

The R package ggmap allows one to connect to the Google maps API to obtain location data. We used this to tag longitude and latitude to states. For neighborhood data, a rest API call was made. Before we could achieve this we had to first create a project on Google and apply the given API key. The new variables added included:

  1. Pickup neighborhood
  2. Drop off neighborhood
  3. Pickup State
  4. Drop off State
  5. Pickup borough
  6. Drop off borough

This data was collect to explore if there is any relationship between an area within a city and fare prices. In some instance it is possible for taxi cabs that operate within a specific are may be more prestige and offer higher rates.

From the above we can confirm at majority of drop offs and pickups are in New York. Below is the location neighborhood plot. Here majority of pickups(top) and dropoff(bottom) can visibly be seen occurring in the center i.e midtown. Manhattan is the distinct with the most pickups and drop offs. We have a few other states that feature in our data such as Connecticut, Pennsylvania, new jersey and Massachusetts.

Weather Data

It is possible that bad weather conditions could discourage people from taking public transport. Example there is a lower chance for someone to want to wait in the rain or extreme cold conditions whilst having the option to be in the comfort of their own vehicle. Thereby reducing demand of taxi cabs. This may intern encourage taxi companies to apply discounts. By getting weather information we can test this hypothesis.

To obtain weather data we used the National Center of Environment Information (NOAA) web service. We access their data using the rest API in rnoaa package. The script that produces this data is “Weather” found in the R folder. We specifically used the NYC station location number “GHCND:USW00094728” and the data set id “GHCND”. The data set id “GHCND” is a summary of daily weather conditions. The summaries we obtained using individual API calls included:

  1. PRCP
  2. Minimum Temperature
  3. Maximum Temperature
  4. Snow
  5. Strong winds

PRCP is a variable that gives an indication of percentage rain or wetness that is in the area.

The summaries for each data are described in the plots below.

Our final data set therefore had 23 possible predictor variables. In the next section we take a look at the relation between fare amount and these variables.

##  [1] "PRCP"              "TMIN"              "TMAX"             
##  [4] "SNOW"              "SNWD"              "fare_amount"      
##  [7] "passenger_count"   "Hour"              "Month"            
## [10] "Weekday"           "neighborhood"      "drop.neighborhood"
## [13] "borough"           "drop.borough"      "State"            
## [16] "Distance"          "Shift"             "pickup_longitude" 
## [19] "pickup_latitude"   "dropoff_longitude" "dropoff_latitude" 
## [22] "Drop_State"        "year"

Data Relationships Numerical

Here we explore linear relationships between the data using correlations. We then plot the highly correlated variables. Before we do this we first split our data to 80% train and 20% test. We convert all turn all nominal categorical variables to factors before applying the correlation. In this instance Month and Weekday are transformed to factor variables.

The only attribute with significant correlation with fare amount is distance. Min and Max temperatures were highly correlated. Similarly lat and log variables are highly correlated to each other. We therefore combined Min and Max to an average temperature variable.

## Compare row 9  and column  11 with corr  1 
##   Means:  0.239 vs 0.089 so flagging column 9 
## Compare row 11  and column  12 with corr  0.984 
##   Means:  0.176 vs 0.068 so flagging column 11 
## Compare row 12  and column  10 with corr  1 
##   Means:  0.103 vs 0.055 so flagging column 12 
## All correlations <= 0.8

A plot of fare vs Distance. Here we can see evidence of the linear relationship.

Visualising Categorical Variables Relationship with trip data

We assess the relationship between fare amount and the categorical variables through visual inspection.

NAs and Missing Data

Many algorithms such as XGBoost and Random forest do not handle missing data well. They omit any missing data points. This becomes a problem when our data is small relative to the number of missing data points. Here we look at the number of missing data points.

## 
##  Variables sorted by number of missings: 
##           Variable      Count
##           Distance 0.14924057
##               year 0.14924057
##       neighborhood 0.02454615
##            borough 0.02454615
##              State 0.02454615
##            Weekday 0.02286298
##  drop.neighborhood 0.02286298
##       drop.borough 0.02286298
##               PRCP 0.00000000
##               SNOW 0.00000000
##               SNWD 0.00000000
##                key 0.00000000
##        fare_amount 0.00000000
##    passenger_count 0.00000000
##               Hour 0.00000000
##              Month 0.00000000
##              Shift 0.00000000
##   pickup_longitude 0.00000000
##    pickup_latitude 0.00000000
##  dropoff_longitude 0.00000000
##   dropoff_latitude 0.00000000
##         Drop_State 0.00000000
##              label 0.00000000
##               TAVG 0.00000000

Majority of our missing data is related to Drop_State i.e. approx 15%. This is due to missing latitude and longitude data that have been marked as zeros. In this analysis missing data is ignored.

Transformations and Preprocessing summary

One hot encoding categorical variables : We apply one hot encoding to nominal categorical variables. This was applied to the weekday variable.

Outlier Removal: Outlier identified during the exploratory analysis where removed from the data. This included, distance greater than 100km, latitude variables greater the 90 (boundaries of lat are -90 and 90), trip fares less than or equal to zero. This produced a data set of 48616 observations.

Log transform: This applied to the fare amount

Statistical methods such as Logistic regression make assumptions about the shape of the data. To compare these methods effectively we apply the normality transformations.

Variable Selection

Based on the outcome from the exploratory analysis we select the following features for our analysis:

  1. Distance
  2. Hour
  3. Shift
  4. Passenger count
  5. Weekday
  6. Year
  7. TAVG
  8. PRCP

These featured show some variation when plotted against fare_amount. Though the variation in mean for weekday was very week we include for exploratory reasons.

Models

The models trained included, random forest, xgboost, glm elastic net and neural network. The models built in R with the exception of the neural net which was done in python. The R model scripts are in the R folder and titled Models. The python model can be found in the Tensorflow folder.

Random Forest

The Random Forest model was ran using the Random Forest package. The best tune model after apply 10 fold cross validation with 3 repeats had mtry of 4 and 500 trees. Cross validation is a method that assist with preventing the model from over fitting.

Below is the variable importance plot. Looking at the variable importance plot, distance then followed by average temperature make the most significant splits. Passenger count and shift are least most important features according to this method.

## 
## Call:
##  randomForest(formula = fare_amount ~ ., data = train, importance = TRUE,      ntree = 500, trControl = control) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.07021078
##                     % Var explained: 79.47

The figure below is a plot of test error against out of bag error (OOB) at different mtry values at 200 trees. The mtry tuning parameters determines the depth of the tree. Hence, it is therefore an important parameter to improve over fitting. The figure below shows a large incline in test and OOB error after 2 mtry. This indicates that the model starts over fitting after 2 mtry.

We evaluate the model on the test data using RMSE and RMSLE.

## [1] "The RMSE is 5.476"
## [1] "The RMSLE is 0.25"

XGBOOST

The xgboost model was ran using the caret package. To tune this model we created a tuning grid “xgbTreeGrid”. Tune grids assist the model in finding the best possible combination of parameters that have the lowest RMSE score on the cross validation samples. We implemented a 10 fold cross validation. This improves the ability for our model to generalize better to unseen data.

RMSE was used to select the optimal model using the smallest value.The final values used for the model were nrounds = 1000, max_depth = 2, eta = 0.01, gamma = 0, colsample_bytree = 0.4, min_child_weight = 1 and subsample = 0.4. The xgboost model has distance as the only important feature. Fridays and Sundays appear to the most important dates when it comes to explaining fare amount.

## [1] "The test RMSE is 5.323"
## [1] "The test RMSLE is 0.251"

Neural Network Tensor Flow

The python jupyter notebook is located in the Tensorflow model folder. The neural network model selected for this analysis was a simple MLP (Multilayer perceptron network). The model contained one hidden layer of 5 neurons, activation function tanh and dropout of 10%. The model was very quick to overfit using this method. Including the dropout rate regularization method improved training. As well as the learning rate was reduced to 1e-4 using Adam optimizer. Tanh function is known in the literature to better handle large gradients and was therefore use in neural network. Below is the figure of the training process over 10 epochs and a batch size of 10. Approaching epoch 10 the data starts to overfit. The final model was ran at 10 epochs.

Figure below is the training process without applying regularization dropout method. Here we can see the model start to over fit earlier at epoch 4. Over fitting is identified as the test error starts to become larger than the train error.

## [1] "The test RMSE is 9.875"
## [1] "The test RMSLE is 0.482"

Logistic regression

Logistic regression is a generalized linear model when one is to predict a binary outcome. Given our data we therefore train a glm where the family is Gaussian. We apply elastic net for regularization and feature selection. The best model selected was alpha = 0.5 and lambda = 0.001.

## [1] "The test RMSE is 26.91"
## [1] "The test RMSLE is 0.321"

Final Model Ensemble

The final model is an ensemble of xgboot, light gradient boost, lasso, ridge and elastic net regression, random forest and ridge regression. Combining the tree based method with linear regression produced a model with the following results:

## The following models were ensembled: xgbTree, gbm, glmnet, glmnet.1, glmnet.2 
## They were weighted: 
## -0.2651 0.4832 0.5615 -1.0759 NA 1.1531
## The resulting RMSE is: 0.2639
## The fit for each individual model on the RMSE is: 
##    method      RMSE      RMSESD
##   xgbTree 0.2666084 0.003825753
##       gbm 0.2645727 0.004082647
##    glmnet 0.3286478 0.003332821
##  glmnet.1 0.3286478 0.003332821
##  glmnet.2 0.3305106 0.002846398
## [1] "The test RMSE is 4.932"
## [1] "The test RMSLE is 0.243"

Residuals plot is depicted below.

The residuals plot show the presence of an outlier. As part of model improvement it is recommended to remove the outlier.

Conclusion

In this assessment I demonstrated how to combine data sets of different formats, enrich data through derivation and from API sources such as Google maps and climate data. An exploratory analysis was then performed on the data in which it was discovered that distance had the strongest linear relationship with fare amount. This is an expected outcome. Other variables did not reveal strong linear dependencies with the fare amount. With this information, it is therefore necessary to explore non-parametric methods. Such methods include xgboost, random forest and neural networks.

RMSE measure was used compare model performance as this is the most popular measure for evaluating regression models. However RMSE penalized large difference to a great degree. An alternative measure RMSLE (Root Mean Squared Logarithmic Error) is used to provide a different evaluation. RMSLE reduces the impact of extreme large values.

Our results show that the non-parametric methods perform better than the generalized linear models. The ensemble model achieved the lowest error rates. Neural Networks have greater capability to detecting and modelling non-linearity. In this analysis the power of neural network approach was not demonstrated. This is mainly due to the fact that we had very few input variables and neural networks have many configuration parameters that we did not have sufficient time to explore.

An attribute that was missing that would have a strong relationship with fare amount would be trip duration. I was not able to estimate this as drop off time was not provided for in the data. The residuals plot showed the presence of an outlier. As part of model improvement it is recommended to remove the outlier. This is point 8601 in the training data. One can investigate this further.