A ride-sharing company wants to implement a dynamic pricing strategy to optimize fares based on real-time market conditions. The company only uses ride duration to decide ride fares currently. The company aims to leverage data-driven techniques to analyze historical data and develop a predictive model that can dynamically adjust prices in response to changing factors.
A dataset containing historical ride data has been provided. The dataset includes features such as the number of riders, number of drivers, location category, customer loyalty status, number of past rides, average ratings, time of booking, vehicle type, expected ride duration, and historical cost of the rides.
My goal is to build a dynamic pricing model that incorporates the provided features to predict optimal fares for rides in real-time. The model must consider factors such as demand patterns and supply availability.
This data can be found and downloaded from Statso
In this dynamic pricing strategy, I will develope a model that will help maximize revenue and profitability by pricing items at the right level that balances supply and demand changes. This will allow company to adjust prices dynamically based on the following factors like time of day, location of booking, customer segments, and market conditions.
to achieve this i will be using R for my coding.
#installing important packages
#install.packages("DBI")
#install.packages("RSQLite")
library(DBI)
library(ggplot2)
library(visdat)
library(dplyr)
library(plotly)
I have created a database for the data into DBMS. So I will connect my R studio to my DBMS to query the data into R. I will do this by connecting my R studio to the DBMS
#connecting R to SQlite DMBS
cab_fare <- dbConnect(RSQLite::SQLite(), "ride price prediction")
#listing the various tables from the database
dbListTables(cab_fare)
## [1] "customers" "drivers" "historic_data"
Querying the data into my R studio from my BDMS.
#querying the various tables from the DBMS
customers <- DBI::dbGetQuery(cab_fare,
"SELECT * FROM customers;")
customers
#summary statistics
head(customers)
tail(customers)
str(customers)
## 'data.frame': 1000 obs. of 5 variables:
## $ customer_loyalty_status: chr "Silver" "Silver" "Silver" "Regular" ...
## $ time_of_booking : chr "Night" "Evening" "Afternoon" "Afternoon" ...
## $ expected_ride_duration : int 90 43 76 134 149 128 16 47 128 128 ...
## $ average_ratings : num 4.47 4.06 3.99 4.31 3.77 3.51 4.41 3.59 3.74 3.85 ...
## $ location_category : chr "Urban" "Suburban" "Rural" "Rural" ...
vis_dat(customers)
vis_miss(customers)
summary(customers)
## customer_loyalty_status time_of_booking expected_ride_duration
## Length:1000 Length:1000 Min. : 10.00
## Class :character Class :character 1st Qu.: 59.75
## Mode :character Mode :character Median :102.00
## Mean : 99.59
## 3rd Qu.:143.00
## Max. :180.00
## average_ratings location_category
## Min. :3.500 Length:1000
## 1st Qu.:3.870 Class :character
## Median :4.270 Mode :character
## Mean :4.257
## 3rd Qu.:4.633
## Max. :5.000
#finding the location with higher demand
demand_zone <- customers %>%
group_by(location_category) %>%
summarise(total_count= n()) %>%
mutate(Rank = case_when(
total_count >= 340 ~ "1",
total_count == 332 ~ "2",
total_count < 332 ~ "3"
)) %>%
arrange(desc(Rank))
demand_zone
#finding the time with higher demand
time_of_higher_demand = customers %>%
group_by(time_of_booking) %>%
summarise(total_count = n()) %>%
mutate(Rank= case_when(
total_count >= 270 ~ "1",
total_count == 247 ~ "2",
total_count == 246 ~ "3",
total_count < 246 ~ "4"
)) %>%
arrange(desc(Rank))
time_of_higher_demand
#plotting a bar plot to show the location with the highest demand of rides
DZ <- ggplot(demand_zone, aes(x=location_category, y=total_count, fill =factor(Rank) )) +
geom_bar(stat = "identity")+
labs(title = "Location with the highest demand",
x = "Location",
y = "Total count",
fill = "Rank")
ggplotly(DZ)
#plotting a bar plot to show the time with the highest demand of rides
TD <- ggplot(time_of_higher_demand, aes(x=time_of_booking, y=total_count, fill =factor(Rank) )) +
geom_bar(stat = "identity")+
labs(title = "Time with the highest demand",
x = "Time",
y = "Total count",
fill = "Rank")
ggplotly(TD)
This first diagram plotted shows the location where demand for drivers is higher. Urban region has the higher demand for drivers while Suburban has the lowest demand for driver compare to Rural from the consumers dataset.
With the second diagram plotted above show the time period when demand for drivers is higher in the four different time period from the consumers dataset is the Night period, having a higher demand than the other three time period. Afternoon follows as the second highest time period when drivers are demanded, and Morning follows after and lastly Evening.
#querying drivers table
drivers <- DBI::dbGetQuery(cab_fare,
"SELECT * FROM drivers;")
drivers
## 'data.frame': 1000 obs. of 6 variables:
## $ number_of_drivers : int 45 39 31 28 22 35 43 39 14 6 ...
## $ average_ratings : num 4.47 4.06 3.99 4.31 3.77 3.51 4.41 3.59 3.74 3.85 ...
## $ vehicle_type : chr "Premium" "Economy" "Premium" "Premium" ...
## $ number_of_past_rides: int 13 72 0 67 74 83 44 83 71 21 ...
## $ number_of_riders : int 90 58 42 89 78 59 93 62 79 42 ...
## $ location_category : chr "Urban" "Suburban" "Rural" "Rural" ...
## number_of_drivers average_ratings vehicle_type number_of_past_rides
## Min. : 5.00 Min. :3.500 Length:1000 Min. : 0.00
## 1st Qu.:11.00 1st Qu.:3.870 Class :character 1st Qu.: 25.00
## Median :22.00 Median :4.270 Mode :character Median : 51.00
## Mean :27.08 Mean :4.257 Mean : 50.03
## 3rd Qu.:38.00 3rd Qu.:4.633 3rd Qu.: 75.00
## Max. :89.00 Max. :5.000 Max. :100.00
## number_of_riders location_category
## Min. : 20.00 Length:1000
## 1st Qu.: 40.00 Class :character
## Median : 60.00 Mode :character
## Mean : 60.37
## 3rd Qu.: 81.00
## Max. :100.00
#finding the vehicle type and the total number of drivers for it
total_no_premium_eco_drivers <- drivers %>%
group_by(vehicle_type) %>%
summarise(total_drivers = sum(number_of_drivers),
total_riders = sum(number_of_riders))
total_no_premium_eco_drivers
dv <- ggplot(total_no_premium_eco_drivers, aes(x=total_drivers, y=total_riders, fill = vehicle_type)) +
geom_bar(stat = "identity")+
labs(title = "vehicle type and the total number for it drivers",
x = "Total drivers",
y = "Total riders",
fill = "Vehicle type")+
theme_minimal()
ggplotly(dv)
dv2 <- ggplot(total_no_premium_eco_drivers, aes(x=vehicle_type, y=factor(total_riders))) +
geom_bar(stat = "identity", fill = "blue")+
labs(title = "vehicle type and the total number riders",
x = "Vehicle type",
y = "Total riders",
fill = "Vehicle type")+
theme_minimal()
ggplotly(dv2)
In this two different diagram plotted shows the different choice of preferred vehicle type by riders and the vehicle type drivers drive across the various locations. The fist diagram plotted show that, most drivers prefer to drive across the various locations is the Premium vehicle. whiles most riders also prefer to ride in Premium vehicles to Economy.
total_no_premium_eco_location <- drivers %>%
group_by(location_category, vehicle_type) %>%
summarise(total_drivers = sum(number_of_drivers),
total_riders = sum(number_of_riders), .groups = 'drop') %>%
mutate(Percentage =round(total_riders /sum(total_riders) * 100 ))
total_no_premium_eco_location
drivers_supply <- drivers %>%
group_by(location_category) %>%
summarise(total_drivers = sum(number_of_drivers))
drivers_supply
total_p <- ggplot(total_no_premium_eco_location, aes(x=vehicle_type, y= total_riders)) +
geom_bar(stat = "identity", fill ="gold")+
facet_wrap(~location_category, scales = "free") +
labs(title = "vehicle type riders request most at certain location",
x = "Location",
y = "total number of Riders",
fill = "vehicle type")+
theme_minimal()
ggplotly(total_p)
d_s <- ggplot(drivers_supply, aes(x=location_category, y= factor(total_drivers))) +geom_bar(stat = "identity", fill = "blue")+labs(title = "supply of drivers to various location",
x = "Location",
y = "total number of Driver")+
theme_minimal()
ggplotly(d_s)
The first diagram plotted show the distribution of preference from riders in various location on vehicle type they book. from the diagram show that the total number of riders prefer or demand is the Premium vehicle compare to Economy vehicle type. But is does not mean it is so in every location. The table shows that most riders from Rural Location prefer Premium vehicle to Economy, which the margin difference is 4%, where as Urban and Suburban, riders prefer Economy vehicle type over Premium but the margin differences of 1% at Urban location and 0.5% difference at Suburban location.
historic_data <- DBI::dbGetQuery(cab_fare,
"SELECT * FROM historic_data;")
historic_data
#selecting columns from all the tables
customers1 <- customers %>%
select(time_of_booking, location_category, expected_ride_duration)
drivers1 <- drivers %>%
select(number_of_drivers, number_of_riders, number_of_past_rides, vehicle_type)
historic_data1 <- historic_data %>%
select(historical_cost_of_the_rides)
#joining all the selected tables
cab_rides <- bind_cols(drivers1, customers1, historic_data1)
View(cab_rides)
th <- cab_rides %>%
group_by(time_of_booking) %>%
summarise(total_his_ride_cost = round(sum(historical_cost_of_the_rides))) %>%
arrange(desc(total_his_ride_cost)) %>%
mutate(Rank = case_when(
total_his_ride_cost >= 101294 ~ "1",
total_his_ride_cost < 101294 & total_his_ride_cost == 94062 ~ "2",
total_his_ride_cost == 93003 ~ "3",
total_his_ride_cost < 93003 ~ "4"
))
th
#plotting a bar plot to show the time with the highest demand of rides and total cost
th1<- ggplot(th, aes(x=time_of_booking, y=factor(total_his_ride_cost), fill =Rank)) +
geom_bar(stat = "identity")+
labs(title = "Time with the highest demand for ride and total cost for all rides",
x = "Time",
y = "Total ride cost",
fill = "Rank")+
theme_classic()
ggplotly(th1)
The diagram plotted above shows the time period when demand for drivers is higher in the four different time period from the consumers dataset is the Night period, having a higher demand than the other three time period. Afternoon follows as the second highest time period when drivers are demanded, and Morning follows after and lastly Evening. The total cost of rides at Night is 101294, which is the highest cost of rides compare to Afternoon with 94062, Morning with 93003 and Evening with 90000.
tl <- cab_rides %>%
group_by(location_category, vehicle_type,time_of_booking) %>%
summarise(total_his_ride_cost = sum(historical_cost_of_the_rides),
total_drivers = sum(number_of_drivers),
total_of_riders= sum(number_of_riders),
total_expected_duration = sum(expected_ride_duration), .groups = 'drop') %>%
mutate(Rank = rank(total_his_ride_cost))
tl
#plotting a bar plot to show the location with the highest demand of rides and total cost
t1 <- ggplot(tl, aes(x=location_category, y=total_his_ride_cost, fill =time_of_booking )) +geom_boxplot()+ facet_wrap(~time_of_booking)+scale_y_continuous(label = scales::comma) +
labs(title = "Location with the highest total cost for all rides",
x = "Location",
y = "Total ride cost",
fill = "Time")
ggplotly(t1)
t2<- ggplot(tl, aes(x=total_his_ride_cost, y=total_drivers, colour = time_of_booking)) +geom_point(size = 3, alpha = 0.6) + facet_grid(~location_category)+
labs(title = "Total number of drivers with the highest total ride cost at certain time of booking",
x = "Total ride cost",
y = "Total number of drivers",
colour = "Time")
ggplotly(t2)
t3<- ggplot(tl, aes(x=total_his_ride_cost, y=total_of_riders, colour = time_of_booking)) +geom_point(size = 3, alpha = 0.6) + facet_grid(~location_category)+
labs(title = "Total cost for riders booking time",
x = "Total ride cost",
y = "Total number of riders",
colour = "Time")
ggplotly(t3)
t4<- ggplot(tl, aes(x=time_of_booking, y=total_expected_duration, colour = location_category)) +geom_point(size = 3, alpha = 0.6) +
labs(title = "booking time against ride duration",
x = "Time of booking",
y = "Total ride duration",
colour = "Location")
ggplotly(t4)
The first diagram plotted above shows the location with the highest total cost of rides. The Urban location has the highest total cost of rides compare to Suburban and Rural locations. The second diagram plotted shows the total number of drivers at each location with the highest total cost of rides at certain time of booking. The third diagram plotted shows the total number of riders at each location with the highest total cost of rides at certain time of booking. The last diagram plotted shows the relationship between time of booking and expected ride duration across the various locations.
# Calculate the threshold demand for each time_of_booking and location_category
threshold_demand <- cab_rides %>%
group_by(time_of_booking, location_category) %>%
summarise(count = n(), .groups = 'drop') %>%
summarise(mean_count = mean(count))
threshold_demand
group_counts <- cab_rides %>%
group_by(time_of_booking, location_category) %>%
summarise(count = n(), .groups = 'drop')
group_counts
# Calculate the mean count across all groups
mean_count <- mean(group_counts$count)
# Filter groups with count >= mean
filtered_groups <- group_counts %>%
filter(count >= mean_count)
filtered_groups
# Step 1: Calculate counts per time_of_booking and location_category
group_counts <- cab_rides %>%
group_by(time_of_booking, location_category) %>%
summarise(count = n(), .groups = 'drop')
# Step 2: Calculate overall mean
mean_count <- mean(group_counts$count)
# Step 3: Identify high-demand times (above the mean)
high_demand_times <- group_counts %>%
filter(count >= mean_count)
# Step 4: Identify high-demand times (above the mean)
lower_demand_times <- group_counts %>%
filter(count < mean_count)
# Step 5: Create a flag in your main data to mark high-demand times
cab_rides <- cab_rides %>%
left_join(high_demand_times %>%
select(time_of_booking, location_category),
by = c("time_of_booking", "location_category")) %>%
mutate(price_multiplier = ifelse(!is.na(count), 1.2, 1)) # 20% increase during high demand
# Step 6: Create a flag in your main data to mark lower-demand times
cab_rides <- cab_rides %>%
left_join(lower_demand_times %>%
select(time_of_booking, location_category),
by = c("time_of_booking", "location_category")) %>%
mutate(price_multiplier2 = ifelse(!is.na(count), 1.1, 1)) # 10% increase during lower demand
# Step 7: Adjust fares
cab_rides <- cab_rides %>%
mutate(adjusted_fare = historical_cost_of_the_rides * price_multiplier) %>%
mutate(adjusted_fare2 = historical_cost_of_the_rides * price_multiplier2)
View(cab_rides)
# Now, 'adjusted_fare' reflects higher prices during high-demand times as while as lower_demand
The code above calculates the demand for rides at each time of booking and location category. It then creates a flag to indicate high-demand times (above the mean) and lower-demand times (below the mean). The fare is adjusted based on these flags, applying a 20% increase during high demand and a 10% increase during lower demand.
#library(plotly)
#finding profit for all rides.
profit_margin <- cab_rides %>%
reframe(Profit = adjusted_fare - historical_cost_of_the_rides) %>%
mutate(Percentage = round(Profit / 1000 * 100))
profit_margin
profitable_ride <- profit_margin %>%
mutate(
profitability = case_when(
Percentage <= 5~ "low profit",
Percentage > 5~ "high profit",
TRUE ~ "0"
)
)
profitable_ride
profit_percentage <- profitable_ride %>%
group_by(profitability) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
profit_percentage
plot_ly(
data = profit_percentage,
labels = ~profitability,
values = ~Count,
type = 'pie',
hole = 0.5,
textinfo = 'label+percent',
insidetextorientation = 'radial'
) %>%
layout(title = "Total Percentage of profit margin",
showlegend = TRUE,
legend = list(orientation = 'h', xanchor = 'center', x = 0.5, y = -0.2))
The code above calculates the profit margin for all rides by subtracting the historical cost of rides from the adjusted fare. It then categorizes the profitability into “low profit” and “high profit” based on a percentage threshold. Finally, it visualizes the total percentage of profit margin using a pie chart.
library(rpart)
library(rpart.plot)
library(caret)
# Ensure 'ride duration' is numeric
cab_rides <- cab_rides %>%
mutate(
time_of_booking = as.factor(time_of_booking),
location_category = as.factor(location_category),
expected_ride_duration = as.numeric(expected_ride_duration)
)
# Split into training and testing sets
set.seed(123)
train_index <- createDataPartition(cab_rides$adjusted_fare, p = 0.8, list = FALSE)
train_data <- cab_rides[train_index, ]
test_data <- cab_rides[-train_index, ]
# Train the model including 'ride duration'
model <- train(
adjusted_fare ~ time_of_booking + location_category + expected_ride_duration,
data = train_data,
method = "lm"
)
# Predictions
predictions <- predict(model, test_data)
# Evaluate with RMSE
rmse <- sqrt(mean((predictions - test_data$adjusted_fare)^2))
print(paste("RMSE:", rmse))
## [1] "RMSE: 84.5571639058569"
The code above uses the caret package to train a linear
regression model that predicts the adjusted fare based on time of
booking, location category, and expected ride duration. It splits the
data into training and testing sets, trains the model, makes
predictions, and evaluates the model’s performance using RMSE (Root Mean
Square Error).
# Prepare data for plotting
plot_data <- test_data %>%
mutate(predicted_fare = predictions)
# Plot actual vs. predicted fares
prediction <- ggplot(plot_data, aes(x = adjusted_fare, y = predicted_fare)) +
geom_point(alpha = 0.5, colour = "red", size = 2) +
geom_abline(slope = 1, intercept = 0, color = "blue", linetype = "dashed") +
labs(
title = "Actual vs. Predicted Fare",
x = "Actual Fare",
y = "Predicted Fare"
) +
theme_minimal()
ggplotly(prediction)
The code above visualizes the actual fare against the predicted fare from the model. It uses a scatter plot to show how well the model’s predictions align with the actual fares, with a dashed line indicating where the predicted fare would equal the actual fare. The diagram indicates that the model’s predictions are generally close to the actual fares, with some deviations. This diagram is show that the model’s prediction is 85% accurate, which is a good prediction accuracy. The company can use this model to dynamically adjust fares based on real-time market conditions, such as time of booking, location category, and expected ride duration. This will help the company optimize its pricing strategy and maximize revenue.
In the Analysis, I have explored the dataset to understand the demand
patterns and supply availability for the ride-sharing company. I have
identified the locations and times with higher demand, analyzed the
vehicle types preferred by riders, and calculated the profit margins for
rides. I have also developed a predictive model using machine learning
techniques to dynamically adjust fares based on real-time market
conditions. The model considers factors such as time of booking,
location category, and expected ride duration to predict optimal fares
for rides.
This dynamic pricing strategy will help the company maximize revenue and
profitability by pricing rides at the right level that balances supply
and demand changes. The model’s predictions can be used to adjust fares
in real-time, allowing the company to respond to changing market
conditions effectively.
In the future, I plan to enhance the model by incorporating additional features such as customer loyalty status, average ratings, and historical ride data. This will allow for a more comprehensive understanding of customer behavior and preferences, leading to more accurate fare predictions. I will also explore advanced machine learning techniques, such as gradient boosting and neural networks, to improve the model’s performance further. Additionally, I will consider implementing real-time data processing capabilities to enable dynamic fare adjustments based on live market conditions. This will ensure that the company can respond quickly to changes in demand and supply, optimizing its pricing strategy in real-time.
This project demonstrates the potential of data-driven techniques to optimize pricing strategies for ride-sharing companies. By leveraging historical data and predictive modeling, the company can implement a dynamic pricing strategy that adjusts fares based on real-time market conditions, ultimately maximizing revenue and profitability.