DS3001 Final Project: Model to Predict Rideshare Costs

Author

Peneeta Wojcik

Introduction

Ride share services have become increasingly popular in the past decade. Uber and Lyft are leading companies in this field. To keep up with demand and increase profitability, these companies use dynamic pricing to encourage more drivers to pick up requests and keep up with demand (more about this on the Uber blog).

In the following analysis, we will examine trends in ride share data and develop a machine-learning model to predict the price of a ride depending on top hours, distance, and other factors. We will use the Uber and Lyft Dataset Boston, MA available on Kaggle to conduct these analyses.

# Setup and Libraries Used
library(tidyr)
library(dplyr)
library(readr)
library(tidyverse)
library(ggplot2)
library(mice)

# Import the data
rides <- read_csv("~/R/DS3001/Final/rideshare_kaggle.csv")

Exploratory Data Analysis

Before developing the model, we need to understand the data set. This involves posing a few questions such as the ones below:

1. Which day has the most rides?

2. What time of day has the most rides?

To answer these questions, we’ll need to first clean the data of extra columns that do not provide important information to the questions we are trying to answer.

Clean the Data

# Get rid of NULL values
rides <- rides[complete.cases(rides), ]
sum(is.na(rides))
[1] 0
# Drop unnecessary columns
drop <- c("id", "icon", "timestamp", "datetime", "long_summary", "latitude", "longitude", "windGustTime", "sunriseTime", "sunsetTime", "apparentTemperatureHighTime", "apparentTemperatureLowTime", "temperatureLowTime", "uvIndexTime", "temperatureMinTime", "apparentTemperatureMinTime","temperatureMaxTime", "apparentTemperatureMaxTime", "ozone", "apparentTemperatureMin", "apparentTemperatureMax", "apparentTemperatureHigh", "apparentTemperatureLow", "temperatureHighTime", "timezone", "product_id")
ridesClean <- rides[, !(names(rides) %in% drop)]

# View structure of data (omitted to save space)
#str(ridesClean)

Cab Type: Uber vs Lyft

Here, we will examine how many data points are Uber and how many are Lyft. This is to check whether one ride-share company outweighs the other in the data set.

# Table results
cabType <- table(ridesClean$cab_type)
cabPercent <- cbind(cabType, round(prop.table(cabType)*100, 2))
colnames(cabPercent) <- c("Count", "Percent")
cabPercent <- as.data.frame(cabPercent)

# Create a pie chart to visualize results
pieLabels <- paste0(c("Lyft", "Uber"), " = ", cabPercent$Percent, "%")
pie(cabPercent$Percent, labels = pieLabels, main = "Percentage of Cabs in Data Set", col=topo.colors(2))

From the results, the percentage of Uber and Lyft data points are approximately 52% and 48% respectively, indicating there is a near even split.

Day with the Most Rides

First, we will determine the day with most rides. This data set only contains two months of data so days were chosen instead of examining peak months.

# Create a Data Set with Day and Frequency of Rides
ridesPerDay <- ridesClean 
ridesPerDay <- ridesPerDay %>% group_by(day) %>% summarise(Freq=n())

# Create a Bar Graph
p<-ggplot(data=ridesPerDay, aes(x=day, y=Freq)) +
  geom_bar(stat="identity", fill="dodgerblue1")+
  theme_minimal()
p + labs(x = "Day of the Month", y = "Number of Rides", title = "Frequency of Rides  on a Day of the Month")

From this, it appears there are three peaks where rides occur most often. Most rides occurred at the end of the month, and ride frequencies at the beginning and the middle of the month were similar.

Time with the Most Rides

Next, we’ll look at the time with the most rides.

# Create dataset to group frequency of data
ridesTime <- ridesClean 
ridesTime <- ridesTime %>% group_by(hour) %>% summarise(Freq1=n())

# Create a Bar Graph
p2 <- ggplot(data=ridesTime, aes(x=hour, y=Freq1)) +
      geom_freqpoly(stat='identity', color = 'dodgerblue1',
    lwd = 1,) +
      theme_minimal()

p2 + labs(x = "Hour", y = "Number of Rides", title = "Frequency of Rides at a Particular Hour")

The general pattern seen throughout the day is that most rides occur between the times 21:00 (9:00 pm) to 1:00 (1:00 am), and 10:00 (10:00 am) to 17:00 (5:00 pm).

How is Ride Frequency Affected by Weather?

p3 <- ggplot(data=ridesClean, aes(short_summary)) +
    labs(x="Weather Description", title="Frequency of Rides Per Weather Description") +
    geom_bar(fill = 'dodgerblue1') + coord_flip() + theme_minimal()

p3


Price Prediction Model

Here, we will develop a price prediction model using various types of linear regression. Linear regression was chosen because typically, the price of a ride mainly depends on the distance traveled, and rates increase at a steady rate. The different types of regression we’ll be looking at are below:

  • Classic Linear Regression

  • Ridge Linear Regression

  • Elastic Net Linear Regression

  • Lasso Linear Regression

These different models will be compared with one another to determine which one is the most accurate in predicting ride share costs.

First, we’ll use some additional libraries.

library(mltools)
library(data.table)

Next, organize the categorical data into factors and One-Hot encode.

# Create a new df from original
dfModel <- ridesClean

# Turn only character columns to factor columns
dfModel[sapply(dfModel, is.character)] <- lapply(dfModel[sapply(dfModel, is.character)], as.factor)

# One-Hot Encode
dfModel <- one_hot(as.data.table(dfModel), cols = "auto",sparsifyNAs = TRUE,naCols = FALSE,dropCols = TRUE,dropUnusedLevels = TRUE)

# Looks good!  (Omitted to save space)
#str(dfModel)

Let’s split the data into train/tune/test sets. Here we’ll be using 70% of the data in the train set, and 15% of the data in each of the tune and test sets.

# Partition 70% of the data 
set.seed(1)
part_index_1 <- caret::createDataPartition(dfModel$`price`,
                                           times=1,
                                           p = 0.80,
                                           groups=1,
                                           list=FALSE)

train <- dfModel[part_index_1,]
test <- dfModel[-part_index_1]


# Check dimensions
dims <- data.frame("Train Size" = nrow(train), "Test Size" = nrow(test))
dims
  Train.Size Test.Size
1     510381    127595

Basic Linear Regression Model

This type of linear regression model takes into account multiple input variables.

# Creating a simple Linear Regression Model
lm_model <- lm(price ~. , data=train)
summary(lm_model)

Call:
lm(formula = price ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.113  -1.424  -0.153   1.256  53.711 

Coefficients: (8 not defined because of singularities)
                                        Estimate Std. Error  t value Pr(>|t|)
(Intercept)                           -1.454e+01  8.708e-01  -16.697  < 2e-16
hour                                   9.424e-05  6.382e-04    0.148  0.88262
day                                   -3.005e-03  2.183e-03   -1.377  0.16863
month                                 -5.533e-02  4.987e-02   -1.110  0.26719
`source_Back Bay`                     -5.052e-02  1.710e-02   -2.953  0.00314
`source_Beacon Hill`                  -3.737e-01  1.711e-02  -21.834  < 2e-16
`source_Boston University`            -3.193e-01  2.382e-02  -13.402  < 2e-16
source_Fenway                         -1.305e-01  2.357e-02   -5.539 3.04e-08
`source_Financial District`            3.003e-01  1.716e-02   17.502  < 2e-16
`source_Haymarket Square`              3.506e-01  2.327e-02   15.063  < 2e-16
`source_North End`                     5.123e-01  2.319e-02   22.095  < 2e-16
`source_North Station`                 1.829e-02  1.715e-02    1.067  0.28600
`source_Northeastern University`      -3.362e-01  2.358e-02  -14.259  < 2e-16
`source_South Station`                 1.493e-01  2.317e-02    6.443 1.17e-10
`source_Theatre District`              4.687e-01  1.712e-02   27.377  < 2e-16
`source_West End`                             NA         NA       NA       NA
`destination_Back Bay`                 4.510e-02  1.710e-02    2.638  0.00834
`destination_Beacon Hill`             -2.740e-01  1.714e-02  -15.986  < 2e-16
`destination_Boston University`        2.219e-02  1.799e-02    1.233  0.21748
destination_Fenway                    -2.992e-01  1.779e-02  -16.815  < 2e-16
`destination_Financial District`       4.408e-01  1.715e-02   25.708  < 2e-16
`destination_Haymarket Square`         2.374e-01  1.711e-02   13.878  < 2e-16
`destination_North End`                9.632e-02  1.706e-02    5.646 1.64e-08
`destination_North Station`            2.347e-01  1.716e-02   13.682  < 2e-16
`destination_Northeastern University`  3.437e-02  1.759e-02    1.954  0.05067
`destination_South Station`                   NA         NA       NA       NA
`destination_Theatre District`         2.864e-01  1.711e-02   16.743  < 2e-16
`destination_West End`                        NA         NA       NA       NA
cab_type_Lyft                         -3.713e+00  1.709e-02 -217.305  < 2e-16
cab_type_Uber                                 NA         NA       NA       NA
name_Black                             1.076e+01  1.678e-02  641.065  < 2e-16
`name_Black SUV`                       2.052e+01  1.678e-02 1222.577  < 2e-16
name_Lux                               1.104e+01  1.746e-02  632.478  < 2e-16
`name_Lux Black`                       1.635e+01  1.745e-02  936.745  < 2e-16
`name_Lux Black XL`                    2.559e+01  1.745e-02 1466.609  < 2e-16
name_Lyft                              2.874e+00  1.744e-02  164.766  < 2e-16
`name_Lyft XL`                         8.570e+00  1.745e-02  490.964  < 2e-16
name_Shared                                   NA         NA       NA       NA
name_UberPool                         -1.009e+00  1.677e-02  -60.163  < 2e-16
name_UberX                            -1.992e-03  1.676e-02   -0.119  0.90538
name_UberXL                            5.918e+00  1.677e-02  352.862  < 2e-16
name_WAV                                      NA         NA       NA       NA
distance                               2.887e+00  4.005e-03  720.906  < 2e-16
surge_multiplier                       1.843e+01  3.737e-02  493.270  < 2e-16
temperature                            5.812e-03  1.294e-02    0.449  0.65322
apparentTemperature                   -4.598e-03  4.467e-03   -1.029  0.30335
short_summary_Clear                   -1.031e-01  1.095e-01   -0.942  0.34623
short_summary_Drizzle                  7.845e-02  7.239e-02    1.084  0.27846
short_summary_Foggy                   -1.888e-02  1.088e-01   -0.174  0.86218
`short_summary_Light Rain`            -1.564e-02  4.323e-02   -0.362  0.71755
`short_summary_Mostly Cloudy`          3.766e-03  1.007e-01    0.037  0.97016
short_summary_Overcast                 2.356e-02  9.911e-02    0.238  0.81207
`short_summary_Partly Cloudy`         -5.561e-02  1.042e-01   -0.534  0.59354
`short_summary_Possible Drizzle`       9.663e-03  7.401e-02    0.131  0.89612
short_summary_Rain                            NA         NA       NA       NA
precipIntensity                       -5.399e-02  5.274e-01   -0.102  0.91845
precipProbability                      5.687e-02  8.558e-02    0.665  0.50636
humidity                              -3.631e-02  4.028e-01   -0.090  0.92818
windSpeed                             -1.214e-02  5.283e-03   -2.299  0.02152
windGust                               4.905e-03  2.358e-03    2.080  0.03753
visibility                            -2.177e-03  3.440e-03   -0.633  0.52679
temperatureHigh                       -7.538e-03  6.124e-03   -1.231  0.21836
temperatureLow                         6.300e-04  9.535e-04    0.661  0.50878
dewPoint                               1.106e-03  1.124e-02    0.098  0.92162
pressure                               2.735e-04  5.749e-04    0.476  0.63421
windBearing                            1.025e-04  6.527e-05    1.570  0.11633
cloudCover                            -1.507e-01  5.207e-02   -2.894  0.00381
uvIndex                               -1.354e-02  8.600e-03   -1.575  0.11527
visibility.1                                  NA         NA       NA       NA
moonPhase                             -4.642e-02  5.217e-02   -0.890  0.37355
precipIntensityMax                    -5.439e-02  1.157e-01   -0.470  0.63830
temperatureMin                         3.521e-03  1.724e-03    2.043  0.04107
temperatureMax                         3.133e-03  5.837e-03    0.537  0.59149
                                         
(Intercept)                           ***
hour                                     
day                                      
month                                    
`source_Back Bay`                     ** 
`source_Beacon Hill`                  ***
`source_Boston University`            ***
source_Fenway                         ***
`source_Financial District`           ***
`source_Haymarket Square`             ***
`source_North End`                    ***
`source_North Station`                   
`source_Northeastern University`      ***
`source_South Station`                ***
`source_Theatre District`             ***
`source_West End`                        
`destination_Back Bay`                ** 
`destination_Beacon Hill`             ***
`destination_Boston University`          
destination_Fenway                    ***
`destination_Financial District`      ***
`destination_Haymarket Square`        ***
`destination_North End`               ***
`destination_North Station`           ***
`destination_Northeastern University` .  
`destination_South Station`              
`destination_Theatre District`        ***
`destination_West End`                   
cab_type_Lyft                         ***
cab_type_Uber                            
name_Black                            ***
`name_Black SUV`                      ***
name_Lux                              ***
`name_Lux Black`                      ***
`name_Lux Black XL`                   ***
name_Lyft                             ***
`name_Lyft XL`                        ***
name_Shared                              
name_UberPool                         ***
name_UberX                               
name_UberXL                           ***
name_WAV                                 
distance                              ***
surge_multiplier                      ***
temperature                              
apparentTemperature                      
short_summary_Clear                      
short_summary_Drizzle                    
short_summary_Foggy                      
`short_summary_Light Rain`               
`short_summary_Mostly Cloudy`            
short_summary_Overcast                   
`short_summary_Partly Cloudy`            
`short_summary_Possible Drizzle`         
short_summary_Rain                       
precipIntensity                          
precipProbability                        
humidity                                 
windSpeed                             *  
windGust                              *  
visibility                               
temperatureHigh                          
temperatureLow                           
dewPoint                                 
pressure                                 
windBearing                              
cloudCover                            ** 
uvIndex                                  
visibility.1                             
moonPhase                                
precipIntensityMax                       
temperatureMin                        *  
temperatureMax                           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.49 on 510316 degrees of freedom
Multiple R-squared:  0.9287,    Adjusted R-squared:  0.9287 
F-statistic: 1.039e+05 on 64 and 510316 DF,  p-value: < 2.2e-16

From the summary, we can conclude multiple variables have a statistically significant relationship with the price. Some of the most important are:

  • Hour

  • Distance

  • Surge Multiplier

Now we’ll create predictions of price. The quality of the prediction will be evaluated using Root Mean Square Error (RMSE) and R-squared value for goodness of fit.

Let’s make a function to make it easier to display results.

eval_metrics <- function(true, predicted){
  
  # Calcaulte Mean-Squared Error and Root-Mean-Squared Error
  mse <- mean((true-predicted)**2)
  rmse <- sqrt(mse) 
  rmse <- signif(rmse, digits = 5)

  # Calculate R-Squared Value
  SSE <- sum((predicted - true)^2)
  SST <- sum((true - mean(true))^2)
  R_square <- 1 - SSE / SST
  R_square <- signif(R_square, digits = 5)
  
  cat(paste0("RMSE: ", rmse, "\nR-Squared: ", R_square, "\n" ))
}
# Create predictions from model
preds <- predict(lm_model, test)

# Create df with actual and predicted values
modelEval <- cbind(test$price, preds)
colnames(modelEval) <- c('Actual', 'Predicted')
modelEval <- as.data.frame(modelEval)

# Display Results
eval_metrics(modelEval$Actual, modelEval$Predicted)
RMSE: 2.4974
R-Squared: 0.92827

The RMSE value for the predictions is ~2.5 and the goodness of fit is 0.928. The optimal R-squared value is 1, so anything above 0.9 indicates good fit.

Let’s visualize this with a plot.

# Create Plot
g <- ggplot(modelEval, aes(x=Actual, y=Predicted)) + geom_point(color = 'dodgerblue1') + labs(x = "Actual Price", y="Predicted Price", title="Predicted vs Actual Values - Linear Regression") + theme_minimal() + geom_abline(intercept = 0, slope = 1)
g

Ideally, we want the distribution of points to be linear (y=x). Here, there is clearly variance between the y=x guide line and the data points. It appears that the model underestimates the higher prices.

Ridge Linear Regression Model

Next, we’ll build a ridge linear regression model.

The goal of ridge regression is to introduce a small amount of bias so that the model does not fit the training set as accurately. This lowers variance in predicting new values and mitigates the risk of overfitting the training set, resulting in better model predictions over time.

library(glmnet)

Ridge regression depends on a parameter, lambda, that adds a penalty to the least squares method. The size of lambda depends how much bias is added to the model to decrease the goodness of fit.

# Specify Training Inputs
x_train <- as.matrix(train[,-'price'])
y_train <- as.matrix(train$price)

# Specify Testing Inputs
x_test <- as.matrix(test[,-'price'])
y_test <- as.matrix(test$price)


# Set of possible lambdas by 10^exp
lambdas <- 10^seq(2, -3, by = -.1)

# Use Cross-Validation to find the Optimal Lambda
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min

# Display output
print(paste0("Optimal Lambda: ", optimal_lambda))
[1] "Optimal Lambda: 0.00125892541179417"

The optimal lambda here is 0.001. We’ll use this value in our models.

An aside: How does changing lambda affect the coefficient values? Here is a plot to visualize how the coefficients are altered. The choice of lambda is nontrivial as seen here.

# Create a ridge regression model with the training set
rrModel <- glmnet(x_train, y_train, alpha=0)

# Plot of how log(lambda) changes coefficients of the model
plot(rrModel, xvar='lambda')

As lambda increases, the coefficients converge to zero, making the model less sensitive to certain variables.

Now we’ll use the model to predict values.

# Make predictions
set.seed(1)
predsRidge <- predict(rrModel, x_test, s=optimal_lambda)
  
# Create df with actual and predicted values
modelEval2 <- cbind(test$price, predsRidge)
colnames(modelEval2) <- c('Actual', 'Predicted')
modelEval2 <- as.data.frame(modelEval2)

# Display Results
eval_metrics(modelEval2$Actual, modelEval2$Predicted)
RMSE: 2.5323
R-Squared: 0.92626

The RMSE value is around 2.53 and the R^2 value is 0.926, meaning this performed worse than classical linear regression.

g <- ggplot(modelEval2, aes(x=Actual, y=Predicted)) + geom_point(color = 'dodgerblue1') + labs(x = "Actual Price", y="Predicted Price", title="Predicted vs Actual Values - Ridge Regression") + theme_minimal() + geom_abline(intercept = 0, slope = 1)
  
g

Lasso Linear Regression Model

Next, let’s try a lasso regressio model.

# Train the model
lasso_reg = glmnet(x_train, y_train, alpha = 1, standardize = TRUE, nfolds=5)

# Predict from the test set
lassoPredTest <- predict(lasso_reg, s=optimal_lambda, newx = x_test)

# Display results
eval_metrics(y_test, lassoPredTest)
RMSE: 2.4981
R-Squared: 0.92824

The RMSE and R-squared is slightly better than linear regression, but only marginally.

Plot the Data

# Create data set with Predicted vs Actual
modelEval3 <- cbind(test$price, lassoPredTest)
colnames(modelEval3) <- c('Actual', 'Predicted')
modelEval3 <- as.data.frame(modelEval3)

# Plot Data
g <- ggplot(modelEval3, aes(x=Actual, y=Predicted)) + geom_point(color = 'dodgerblue1') + labs(x = "Actual Price", y="Predicted Price", title="Predicted vs Actual Values  - Lasso Regression") + theme_minimal() + geom_abline(intercept = 0, slope = 1)
  
g

Conclusions

The purpose of this study was to determine if there are any significant differences in these types of linear regression when predicting ride share prices. A problem with all the models is that they underestimate higher prices.

Despite differences in these models, all three seemed to perform similarly to one another. Lasso linear regression and classical linear regression were the most similar and accurate, both with an RMSE of 2.499 and an R-Squared value of 0.928. This supported that ride prices increased at a consistent rate according to distance traveled.

The next step would be to create a tree-based model to compare the performance of linear regression models to. I hypothesize that linear regression will be the best model due to the R-squared values being closest to 1.