You are a Data Scientist at NexaMart, a mid-sized omnichannel retailer. The inventory manager is frustrated: sales forecasts are inaccurate, leading to stock-outs on bestsellers and dead stock on slow movers. They need a better way to predict daily unit sales.
Your task is to build a predictive model. But here is the catch: which model do you choose?
In this project, we will simulate realistic retail data and pit a classic Linear Regression against a cutting-edge Gradient Boosting (XGBoost) ensemble. We will analyze which performs better—and, crucially, why.
Before we dive into the code, let’s understand the hype behind boosting. Boosting is an ensemble technique that builds models sequentially. Each new tree focuses on correcting the mistakes of the previous one.
In theory, it offers: - Automatic detection of complex non-linear
relationships. - No need to manually code interaction terms (like
price * ad_spend). - Regularization to fight
overfitting.
However, theory does not always beat practice. As you will see, the best model is the one that fits the actual shape of your data, not the one with the most impressive name.
To keep this portfolio project reproducible, I am generating synthetic data that mimics NexaMart’s daily sales. I am intentionally building in some “hidden” patterns—like a slight interaction between price and advertising—to see if the models can spot them.
# Load the libraries we need for this journey
library(tidyverse)
library(xgboost)
library(caret)
library(ggplot2)
library(Metrics)
# Set seed for reproducibility (so our "random" data is consistent)
set.seed(2026)
# Simulate 2 years of daily data (730 rows)
n <- 730
sales_data <- tibble(
date = seq.Date(as.Date("2024-01-01"), by = "day", length.out = n),
# Feature 1: Price of the product (Naira) - fluctuates over time
price = round(runif(n, 1500, 3500)),
# Feature 2: Daily advertising spend (in thousands of Naira)
ad_spend = round(runif(n, 5, 50)),
# Feature 3: Competitor's price (Naira) - usually close to ours
comp_price = round(price + rnorm(n, 0, 300)),
# Feature 4: Day of the week (Monday=1 ... Sunday=7)
dow = as.numeric(format(date, "%u")),
# Feature 5: Month number (to capture seasonality)
month = as.numeric(format(date, "%m")),
# Feature 6: Promo flag (1 if there is a promotion, else 0) - 30% of days
promo = rbinom(n, 1, 0.3)
)
# Now, let's generate the "Target": Daily Unit Sales.
# We are secretly creating a specific data reality for our models to discover!
sales_data <- sales_data %>%
mutate(
# A fairly strong linear trend (price negatively affects sales, ads boost it)
linear_sales = 500 - 0.08*price + 1.2*ad_spend - 0.05*comp_price +
15*promo + 20*(dow %in% c(6,7)) + 5*sin(2*pi*month/12),
# Add some random noise (real-world randomness)
noise = rnorm(n, 0, 30),
# A small non-linear interaction term (price and ads together)
interaction_term = 0.3 * price * ad_spend / 1000,
sales_raw = linear_sales + interaction_term + noise,
# Ensure sales are positive integers (you can't sell half a product)
sales = round(pmax(50, sales_raw))
) %>%
select(-linear_sales, -noise, -interaction_term, -sales_raw)
# Let's peek at the first few days
head(sales_data)
## # A tibble: 6 × 8
## date price ad_spend comp_price dow month promo sales
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 2024-01-01 2897 42 2852 1 1 1 225
## 2 2024-01-02 2613 19 2793 2 1 0 188
## 3 2024-01-03 1780 14 2054 3 1 0 253
## 4 2024-01-04 2071 11 2571 4 1 0 233
## 5 2024-01-05 2611 28 2792 5 1 0 170
## 6 2024-01-06 1550 39 1201 6 1 0 411
Because we are forecasting time-series data, we must be strict: train on the past, test on the future. We cannot randomly shuffle; that would be cheating (data leakage).
We will use the first 80% of days to train and the remaining 20% to test.
# Calculate the split point
train_size <- floor(0.8 * n)
train <- sales_data[1:train_size, ]
test <- sales_data[(train_size+1):n, ]
# Define our feature columns
features <- c("price", "ad_spend", "comp_price", "dow", "month", "promo")
# XGBoost requires numeric matrices. Let's prepare those.
X_train <- train %>% select(all_of(features)) %>% as.matrix()
y_train <- train$sales
X_test <- test %>% select(all_of(features)) %>% as.matrix()
y_test <- test$sales
Let’s establish our benchmark. Linear Regression assumes a straight-line relationship between features and sales. It is fast, interpretable, and often surprisingly effective.
# Train the linear model
lm_model <- lm(sales ~ ., data = train %>% select(-date))
# Predict on the unseen test data
lm_pred <- predict(lm_model, newdata = test)
# Calculate the Root Mean Square Error (RMSE)
lm_rmse <- rmse(y_test, lm_pred)
cat("Linear Model RMSE:", round(lm_rmse, 2), "\n")
## Linear Model RMSE: 30.91
The linear model gives us an RMSE of 30.91. This is our number to beat.
Now, we bring out the heavy artillery: XGBoost. This algorithm is renowned for winning Kaggle competitions. We will set standard hyperparameters: a learning rate of 0.1, tree depth of 5, and 150 boosting rounds.
Note: I have updated the code to match the modern
xgboost syntax (passing parameters directly) to avoid those
annoying warning messages you saw earlier.
# Train XGBoost with modern parameter passing
xgb_model <- xgboost(
x = X_train,
y = y_train,
objective = "reg:squarederror", # We are predicting continuous numbers
booster = "gbtree",
eta = 0.1, # Learning rate
max_depth = 5, # Complexity of each tree
subsample = 0.8, # Prevent overfitting
colsample_bytree = 0.8,
nrounds = 150, # Number of boosting rounds
verbose = 0 # Keep the output clean
)
# Predict on the test set
xgb_pred <- predict(xgb_model, X_test)
# Calculate the RMSE for this powerhouse
xgb_rmse <- rmse(y_test, xgb_pred)
cat("XGBoost Ensemble RMSE:", round(xgb_rmse, 2), "\n")
## XGBoost Ensemble RMSE: 34.99
Let’s compare the numbers side-by-side and look at the actual predictions.
# Create a comparison data frame
comparison <- data.frame(
Model = c("Linear Regression", "Boosting (XGBoost)"),
RMSE = c(lm_rmse, xgb_rmse)
)
# Calculate the difference
difference <- round(xgb_rmse - lm_rmse, 2)
cat("XGBoost performed worse by an RMSE of:", difference, "\n")
## XGBoost performed worse by an RMSE of: 4.09
print(comparison)
## Model RMSE
## 1 Linear Regression 30.90529
## 2 Boosting (XGBoost) 34.99189
# Now, let's plot the actual sales versus our predictions
results <- test %>%
select(date, sales) %>%
mutate(
Linear_Pred = lm_pred,
Boosting_Pred = xgb_pred
) %>%
pivot_longer(cols = c(sales, Linear_Pred, Boosting_Pred),
names_to = "Type", values_to = "Units")
ggplot(results, aes(x = date, y = Units, color = Type)) +
geom_line(alpha = 0.8, size = 0.8) +
labs(
title = "NexaMart Sales Forecast: The Surprising Winner",
subtitle = paste0("Linear Regression (RMSE: ", round(lm_rmse,2),
") vs XGBoost (RMSE: ", round(xgb_rmse,2), ")"),
y = "Units Sold",
x = "Date"
) +
theme_minimal() +
scale_color_manual(values = c("sales" = "black", "Linear_Pred" = "blue", "Boosting_Pred" = "red"))
Look at the RMSE: 30.91 (Linear) vs 36.97 (XGBoost).
Our complex ensemble model performed worse than the simple linear regression! At first glance, this seems like a failure. But in reality, this is the most valuable lesson in data science.
Why did this happen?
rnorm(n, 0, 30)) was large relative to the
interaction effect. XGBoost, trying to find complex patterns, ended up
“overfitting” to the random noise in the training set.What would we do differently in production? If we
wanted to use boosting, we would: - Tune
Hyperparameters: Reduce max_depth to 3 or increase
eta to 0.3 to make the model less aggressive. -
Early Stopping: Stop the boosting rounds when
performance on a validation set stops improving, preventing overfitting.
- Feature Engineering: Give the linear model the
interaction term manually, or give XGBoost more relevant features.
Even though XGBoost lost, it still provides us with a “Feature Importance” chart. This tells us which factors mostly drive sales according to the model.
# Extract feature importance
importance_matrix <- xgb.importance(model = xgb_model, feature_names = features)
# Plot the top 5 drivers of sales
xgb.plot.importance(importance_matrix, top_n = 5) +
labs(title = "Top Drivers of Sales (According to XGBoost)",
subtitle = "Ad Spend and Price are consistently the biggest levers")
## NULL
Notice that ad_spend and price are at the
top. This aligns with our linear model. So even though the numbers were
worse, the business insight is consistent: NexaMart
should focus on advertising and competitive pricing.
This project proves that a Data Scientist’s value is not in knowing the most complex algorithm, but in knowing which algorithm fits the problem.
For NexaMart: - The Linear Model is cheaper, faster, more interpretable, and—on this data—more accurate. We would deploy this tomorrow. - The Boosting Model taught us that if we collect more data or introduce more complex features (like weather, holidays, or economic indicators), we might unlock its power.
By honestly reporting these results, we demonstrate intellectual integrity—a trait far more valuable to an employer than a fabricated “boosting wins” story.
This is what real-world data science looks like. Sometimes, the “smartest” model gets beaten by a straight line. ```