# Setup and Libraries Used
library(tidyr)
library(dplyr)
library(readr)
library(tidyverse)
library(ggplot2)
library(mice)
# Import the data
<- read_csv("~/R/DS3001/Final/rideshare_kaggle.csv") rides
DS3001 Final Project: Model to Predict Rideshare Costs
Introduction
Ride share services have become increasingly popular in the past decade. Uber and Lyft are leading companies in this field. To keep up with demand and increase profitability, these companies use dynamic pricing to encourage more drivers to pick up requests and keep up with demand (more about this on the Uber blog).
In the following analysis, we will examine trends in ride share data and develop a machine-learning model to predict the price of a ride depending on top hours, distance, and other factors. We will use the Uber and Lyft Dataset Boston, MA available on Kaggle to conduct these analyses.
Exploratory Data Analysis
Before developing the model, we need to understand the data set. This involves posing a few questions such as the ones below:
1. Which day has the most rides?
2. What time of day has the most rides?
To answer these questions, we’ll need to first clean the data of extra columns that do not provide important information to the questions we are trying to answer.
Clean the Data
# Get rid of NULL values
<- rides[complete.cases(rides), ]
rides sum(is.na(rides))
[1] 0
# Drop unnecessary columns
<- c("id", "icon", "timestamp", "datetime", "long_summary", "latitude", "longitude", "windGustTime", "sunriseTime", "sunsetTime", "apparentTemperatureHighTime", "apparentTemperatureLowTime", "temperatureLowTime", "uvIndexTime", "temperatureMinTime", "apparentTemperatureMinTime","temperatureMaxTime", "apparentTemperatureMaxTime", "ozone", "apparentTemperatureMin", "apparentTemperatureMax", "apparentTemperatureHigh", "apparentTemperatureLow", "temperatureHighTime", "timezone", "product_id")
drop <- rides[, !(names(rides) %in% drop)]
ridesClean
# View structure of data (omitted to save space)
#str(ridesClean)
Cab Type: Uber vs Lyft
Here, we will examine how many data points are Uber and how many are Lyft. This is to check whether one ride-share company outweighs the other in the data set.
# Table results
<- table(ridesClean$cab_type)
cabType <- cbind(cabType, round(prop.table(cabType)*100, 2))
cabPercent colnames(cabPercent) <- c("Count", "Percent")
<- as.data.frame(cabPercent)
cabPercent
# Create a pie chart to visualize results
<- paste0(c("Lyft", "Uber"), " = ", cabPercent$Percent, "%")
pieLabels pie(cabPercent$Percent, labels = pieLabels, main = "Percentage of Cabs in Data Set", col=topo.colors(2))
From the results, the percentage of Uber and Lyft data points are approximately 52% and 48% respectively, indicating there is a near even split.
Day with the Most Rides
First, we will determine the day with most rides. This data set only contains two months of data so days were chosen instead of examining peak months.
# Create a Data Set with Day and Frequency of Rides
<- ridesClean
ridesPerDay <- ridesPerDay %>% group_by(day) %>% summarise(Freq=n())
ridesPerDay
# Create a Bar Graph
<-ggplot(data=ridesPerDay, aes(x=day, y=Freq)) +
pgeom_bar(stat="identity", fill="dodgerblue1")+
theme_minimal()
+ labs(x = "Day of the Month", y = "Number of Rides", title = "Frequency of Rides on a Day of the Month") p
From this, it appears there are three peaks where rides occur most often. Most rides occurred at the end of the month, and ride frequencies at the beginning and the middle of the month were similar.
Time with the Most Rides
Next, we’ll look at the time with the most rides.
# Create dataset to group frequency of data
<- ridesClean
ridesTime <- ridesTime %>% group_by(hour) %>% summarise(Freq1=n())
ridesTime
# Create a Bar Graph
<- ggplot(data=ridesTime, aes(x=hour, y=Freq1)) +
p2 geom_freqpoly(stat='identity', color = 'dodgerblue1',
lwd = 1,) +
theme_minimal()
+ labs(x = "Hour", y = "Number of Rides", title = "Frequency of Rides at a Particular Hour") p2
The general pattern seen throughout the day is that most rides occur between the times 21:00 (9:00 pm) to 1:00 (1:00 am), and 10:00 (10:00 am) to 17:00 (5:00 pm).
How is Ride Frequency Affected by Weather?
<- ggplot(data=ridesClean, aes(short_summary)) +
p3 labs(x="Weather Description", title="Frequency of Rides Per Weather Description") +
geom_bar(fill = 'dodgerblue1') + coord_flip() + theme_minimal()
p3
Price Prediction Model
Here, we will develop a price prediction model using various types of linear regression. Linear regression was chosen because typically, the price of a ride mainly depends on the distance traveled, and rates increase at a steady rate. The different types of regression we’ll be looking at are below:
Classic Linear Regression
Ridge Linear Regression
Elastic Net Linear Regression
Lasso Linear Regression
These different models will be compared with one another to determine which one is the most accurate in predicting ride share costs.
First, we’ll use some additional libraries.
library(mltools)
library(data.table)
Next, organize the categorical data into factors and One-Hot encode.
# Create a new df from original
<- ridesClean
dfModel
# Turn only character columns to factor columns
sapply(dfModel, is.character)] <- lapply(dfModel[sapply(dfModel, is.character)], as.factor)
dfModel[
# One-Hot Encode
<- one_hot(as.data.table(dfModel), cols = "auto",sparsifyNAs = TRUE,naCols = FALSE,dropCols = TRUE,dropUnusedLevels = TRUE)
dfModel
# Looks good! (Omitted to save space)
#str(dfModel)
Let’s split the data into train/tune/test sets. Here we’ll be using 70% of the data in the train set, and 15% of the data in each of the tune and test sets.
# Partition 70% of the data
set.seed(1)
<- caret::createDataPartition(dfModel$`price`,
part_index_1 times=1,
p = 0.80,
groups=1,
list=FALSE)
<- dfModel[part_index_1,]
train <- dfModel[-part_index_1]
test
# Check dimensions
<- data.frame("Train Size" = nrow(train), "Test Size" = nrow(test))
dims dims
Train.Size Test.Size
1 510381 127595
Basic Linear Regression Model
This type of linear regression model takes into account multiple input variables.
# Creating a simple Linear Regression Model
<- lm(price ~. , data=train)
lm_model summary(lm_model)
Call:
lm(formula = price ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-22.113 -1.424 -0.153 1.256 53.711
Coefficients: (8 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.454e+01 8.708e-01 -16.697 < 2e-16
hour 9.424e-05 6.382e-04 0.148 0.88262
day -3.005e-03 2.183e-03 -1.377 0.16863
month -5.533e-02 4.987e-02 -1.110 0.26719
`source_Back Bay` -5.052e-02 1.710e-02 -2.953 0.00314
`source_Beacon Hill` -3.737e-01 1.711e-02 -21.834 < 2e-16
`source_Boston University` -3.193e-01 2.382e-02 -13.402 < 2e-16
source_Fenway -1.305e-01 2.357e-02 -5.539 3.04e-08
`source_Financial District` 3.003e-01 1.716e-02 17.502 < 2e-16
`source_Haymarket Square` 3.506e-01 2.327e-02 15.063 < 2e-16
`source_North End` 5.123e-01 2.319e-02 22.095 < 2e-16
`source_North Station` 1.829e-02 1.715e-02 1.067 0.28600
`source_Northeastern University` -3.362e-01 2.358e-02 -14.259 < 2e-16
`source_South Station` 1.493e-01 2.317e-02 6.443 1.17e-10
`source_Theatre District` 4.687e-01 1.712e-02 27.377 < 2e-16
`source_West End` NA NA NA NA
`destination_Back Bay` 4.510e-02 1.710e-02 2.638 0.00834
`destination_Beacon Hill` -2.740e-01 1.714e-02 -15.986 < 2e-16
`destination_Boston University` 2.219e-02 1.799e-02 1.233 0.21748
destination_Fenway -2.992e-01 1.779e-02 -16.815 < 2e-16
`destination_Financial District` 4.408e-01 1.715e-02 25.708 < 2e-16
`destination_Haymarket Square` 2.374e-01 1.711e-02 13.878 < 2e-16
`destination_North End` 9.632e-02 1.706e-02 5.646 1.64e-08
`destination_North Station` 2.347e-01 1.716e-02 13.682 < 2e-16
`destination_Northeastern University` 3.437e-02 1.759e-02 1.954 0.05067
`destination_South Station` NA NA NA NA
`destination_Theatre District` 2.864e-01 1.711e-02 16.743 < 2e-16
`destination_West End` NA NA NA NA
cab_type_Lyft -3.713e+00 1.709e-02 -217.305 < 2e-16
cab_type_Uber NA NA NA NA
name_Black 1.076e+01 1.678e-02 641.065 < 2e-16
`name_Black SUV` 2.052e+01 1.678e-02 1222.577 < 2e-16
name_Lux 1.104e+01 1.746e-02 632.478 < 2e-16
`name_Lux Black` 1.635e+01 1.745e-02 936.745 < 2e-16
`name_Lux Black XL` 2.559e+01 1.745e-02 1466.609 < 2e-16
name_Lyft 2.874e+00 1.744e-02 164.766 < 2e-16
`name_Lyft XL` 8.570e+00 1.745e-02 490.964 < 2e-16
name_Shared NA NA NA NA
name_UberPool -1.009e+00 1.677e-02 -60.163 < 2e-16
name_UberX -1.992e-03 1.676e-02 -0.119 0.90538
name_UberXL 5.918e+00 1.677e-02 352.862 < 2e-16
name_WAV NA NA NA NA
distance 2.887e+00 4.005e-03 720.906 < 2e-16
surge_multiplier 1.843e+01 3.737e-02 493.270 < 2e-16
temperature 5.812e-03 1.294e-02 0.449 0.65322
apparentTemperature -4.598e-03 4.467e-03 -1.029 0.30335
short_summary_Clear -1.031e-01 1.095e-01 -0.942 0.34623
short_summary_Drizzle 7.845e-02 7.239e-02 1.084 0.27846
short_summary_Foggy -1.888e-02 1.088e-01 -0.174 0.86218
`short_summary_Light Rain` -1.564e-02 4.323e-02 -0.362 0.71755
`short_summary_Mostly Cloudy` 3.766e-03 1.007e-01 0.037 0.97016
short_summary_Overcast 2.356e-02 9.911e-02 0.238 0.81207
`short_summary_Partly Cloudy` -5.561e-02 1.042e-01 -0.534 0.59354
`short_summary_Possible Drizzle` 9.663e-03 7.401e-02 0.131 0.89612
short_summary_Rain NA NA NA NA
precipIntensity -5.399e-02 5.274e-01 -0.102 0.91845
precipProbability 5.687e-02 8.558e-02 0.665 0.50636
humidity -3.631e-02 4.028e-01 -0.090 0.92818
windSpeed -1.214e-02 5.283e-03 -2.299 0.02152
windGust 4.905e-03 2.358e-03 2.080 0.03753
visibility -2.177e-03 3.440e-03 -0.633 0.52679
temperatureHigh -7.538e-03 6.124e-03 -1.231 0.21836
temperatureLow 6.300e-04 9.535e-04 0.661 0.50878
dewPoint 1.106e-03 1.124e-02 0.098 0.92162
pressure 2.735e-04 5.749e-04 0.476 0.63421
windBearing 1.025e-04 6.527e-05 1.570 0.11633
cloudCover -1.507e-01 5.207e-02 -2.894 0.00381
uvIndex -1.354e-02 8.600e-03 -1.575 0.11527
visibility.1 NA NA NA NA
moonPhase -4.642e-02 5.217e-02 -0.890 0.37355
precipIntensityMax -5.439e-02 1.157e-01 -0.470 0.63830
temperatureMin 3.521e-03 1.724e-03 2.043 0.04107
temperatureMax 3.133e-03 5.837e-03 0.537 0.59149
(Intercept) ***
hour
day
month
`source_Back Bay` **
`source_Beacon Hill` ***
`source_Boston University` ***
source_Fenway ***
`source_Financial District` ***
`source_Haymarket Square` ***
`source_North End` ***
`source_North Station`
`source_Northeastern University` ***
`source_South Station` ***
`source_Theatre District` ***
`source_West End`
`destination_Back Bay` **
`destination_Beacon Hill` ***
`destination_Boston University`
destination_Fenway ***
`destination_Financial District` ***
`destination_Haymarket Square` ***
`destination_North End` ***
`destination_North Station` ***
`destination_Northeastern University` .
`destination_South Station`
`destination_Theatre District` ***
`destination_West End`
cab_type_Lyft ***
cab_type_Uber
name_Black ***
`name_Black SUV` ***
name_Lux ***
`name_Lux Black` ***
`name_Lux Black XL` ***
name_Lyft ***
`name_Lyft XL` ***
name_Shared
name_UberPool ***
name_UberX
name_UberXL ***
name_WAV
distance ***
surge_multiplier ***
temperature
apparentTemperature
short_summary_Clear
short_summary_Drizzle
short_summary_Foggy
`short_summary_Light Rain`
`short_summary_Mostly Cloudy`
short_summary_Overcast
`short_summary_Partly Cloudy`
`short_summary_Possible Drizzle`
short_summary_Rain
precipIntensity
precipProbability
humidity
windSpeed *
windGust *
visibility
temperatureHigh
temperatureLow
dewPoint
pressure
windBearing
cloudCover **
uvIndex
visibility.1
moonPhase
precipIntensityMax
temperatureMin *
temperatureMax
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.49 on 510316 degrees of freedom
Multiple R-squared: 0.9287, Adjusted R-squared: 0.9287
F-statistic: 1.039e+05 on 64 and 510316 DF, p-value: < 2.2e-16
From the summary, we can conclude multiple variables have a statistically significant relationship with the price. Some of the most important are:
Hour
Distance
Surge Multiplier
Now we’ll create predictions of price. The quality of the prediction will be evaluated using Root Mean Square Error (RMSE) and R-squared value for goodness of fit.
Let’s make a function to make it easier to display results.
<- function(true, predicted){
eval_metrics
# Calcaulte Mean-Squared Error and Root-Mean-Squared Error
<- mean((true-predicted)**2)
mse <- sqrt(mse)
rmse <- signif(rmse, digits = 5)
rmse
# Calculate R-Squared Value
<- sum((predicted - true)^2)
SSE <- sum((true - mean(true))^2)
SST <- 1 - SSE / SST
R_square <- signif(R_square, digits = 5)
R_square
cat(paste0("RMSE: ", rmse, "\nR-Squared: ", R_square, "\n" ))
}
# Create predictions from model
<- predict(lm_model, test)
preds
# Create df with actual and predicted values
<- cbind(test$price, preds)
modelEval colnames(modelEval) <- c('Actual', 'Predicted')
<- as.data.frame(modelEval)
modelEval
# Display Results
eval_metrics(modelEval$Actual, modelEval$Predicted)
RMSE: 2.4974
R-Squared: 0.92827
The RMSE value for the predictions is ~2.5 and the goodness of fit is 0.928. The optimal R-squared value is 1, so anything above 0.9 indicates good fit.
Let’s visualize this with a plot.
# Create Plot
<- ggplot(modelEval, aes(x=Actual, y=Predicted)) + geom_point(color = 'dodgerblue1') + labs(x = "Actual Price", y="Predicted Price", title="Predicted vs Actual Values - Linear Regression") + theme_minimal() + geom_abline(intercept = 0, slope = 1)
g g
Ideally, we want the distribution of points to be linear (y=x). Here, there is clearly variance between the y=x guide line and the data points. It appears that the model underestimates the higher prices.
Ridge Linear Regression Model
Next, we’ll build a ridge linear regression model.
The goal of ridge regression is to introduce a small amount of bias so that the model does not fit the training set as accurately. This lowers variance in predicting new values and mitigates the risk of overfitting the training set, resulting in better model predictions over time.
library(glmnet)
Ridge regression depends on a parameter, lambda, that adds a penalty to the least squares method. The size of lambda depends how much bias is added to the model to decrease the goodness of fit.
# Specify Training Inputs
<- as.matrix(train[,-'price'])
x_train <- as.matrix(train$price)
y_train
# Specify Testing Inputs
<- as.matrix(test[,-'price'])
x_test <- as.matrix(test$price)
y_test
# Set of possible lambdas by 10^exp
<- 10^seq(2, -3, by = -.1)
lambdas
# Use Cross-Validation to find the Optimal Lambda
<- cv.glmnet(x_train, y_train, alpha = 0, lambda = lambdas)
cv_ridge <- cv_ridge$lambda.min
optimal_lambda
# Display output
print(paste0("Optimal Lambda: ", optimal_lambda))
[1] "Optimal Lambda: 0.00125892541179417"
The optimal lambda here is 0.001. We’ll use this value in our models.
An aside: How does changing lambda affect the coefficient values? Here is a plot to visualize how the coefficients are altered. The choice of lambda is nontrivial as seen here.
# Create a ridge regression model with the training set
<- glmnet(x_train, y_train, alpha=0)
rrModel
# Plot of how log(lambda) changes coefficients of the model
plot(rrModel, xvar='lambda')
As lambda increases, the coefficients converge to zero, making the model less sensitive to certain variables.
Now we’ll use the model to predict values.
# Make predictions
set.seed(1)
<- predict(rrModel, x_test, s=optimal_lambda)
predsRidge
# Create df with actual and predicted values
<- cbind(test$price, predsRidge)
modelEval2 colnames(modelEval2) <- c('Actual', 'Predicted')
<- as.data.frame(modelEval2)
modelEval2
# Display Results
eval_metrics(modelEval2$Actual, modelEval2$Predicted)
RMSE: 2.5323
R-Squared: 0.92626
The RMSE value is around 2.53 and the R^2 value is 0.926, meaning this performed worse than classical linear regression.
<- ggplot(modelEval2, aes(x=Actual, y=Predicted)) + geom_point(color = 'dodgerblue1') + labs(x = "Actual Price", y="Predicted Price", title="Predicted vs Actual Values - Ridge Regression") + theme_minimal() + geom_abline(intercept = 0, slope = 1)
g
g
Lasso Linear Regression Model
Next, let’s try a lasso regressio model.
# Train the model
= glmnet(x_train, y_train, alpha = 1, standardize = TRUE, nfolds=5)
lasso_reg
# Predict from the test set
<- predict(lasso_reg, s=optimal_lambda, newx = x_test)
lassoPredTest
# Display results
eval_metrics(y_test, lassoPredTest)
RMSE: 2.4981
R-Squared: 0.92824
The RMSE and R-squared is slightly better than linear regression, but only marginally.
Plot the Data
# Create data set with Predicted vs Actual
<- cbind(test$price, lassoPredTest)
modelEval3 colnames(modelEval3) <- c('Actual', 'Predicted')
<- as.data.frame(modelEval3)
modelEval3
# Plot Data
<- ggplot(modelEval3, aes(x=Actual, y=Predicted)) + geom_point(color = 'dodgerblue1') + labs(x = "Actual Price", y="Predicted Price", title="Predicted vs Actual Values - Lasso Regression") + theme_minimal() + geom_abline(intercept = 0, slope = 1)
g
g
Conclusions
The purpose of this study was to determine if there are any significant differences in these types of linear regression when predicting ride share prices. A problem with all the models is that they underestimate higher prices.
Despite differences in these models, all three seemed to perform similarly to one another. Lasso linear regression and classical linear regression were the most similar and accurate, both with an RMSE of 2.499 and an R-Squared value of 0.928. This supported that ride prices increased at a consistent rate according to distance traveled.
The next step would be to create a tree-based model to compare the performance of linear regression models to. I hypothesize that linear regression will be the best model due to the R-squared values being closest to 1.