The Business Problem

You are a Data Scientist at NexaMart, a mid-sized omnichannel retailer. The inventory manager is frustrated: sales forecasts are inaccurate, leading to stock-outs on bestsellers and dead stock on slow movers. They need a better way to predict daily unit sales.

Your task is to build a predictive model. But here is the catch: which model do you choose?

In this project, we will simulate realistic retail data and pit a classic Linear Regression against a cutting-edge Gradient Boosting (XGBoost) ensemble. We will analyze which performs better—and, crucially, why.


Why Ensemble Boosting? (The Theory)

Before we dive into the code, let’s understand the hype behind boosting. Boosting is an ensemble technique that builds models sequentially. Each new tree focuses on correcting the mistakes of the previous one.

In theory, it offers: - Automatic detection of complex non-linear relationships. - No need to manually code interaction terms (like price * ad_spend). - Regularization to fight overfitting.

However, theory does not always beat practice. As you will see, the best model is the one that fits the actual shape of your data, not the one with the most impressive name.


Step 1: Simulating Realistic Retail Data

To keep this portfolio project reproducible, I am generating synthetic data that mimics NexaMart’s daily sales. I am intentionally building in some “hidden” patterns—like a slight interaction between price and advertising—to see if the models can spot them.

# Load the libraries we need for this journey
library(tidyverse)
library(xgboost)
library(caret)
library(ggplot2)
library(Metrics)

# Set seed for reproducibility (so our "random" data is consistent)
set.seed(2026)

# Simulate 2 years of daily data (730 rows)
n <- 730

sales_data <- tibble(
  date = seq.Date(as.Date("2024-01-01"), by = "day", length.out = n),
  
  # Feature 1: Price of the product (Naira) - fluctuates over time
  price = round(runif(n, 1500, 3500)),
  
  # Feature 2: Daily advertising spend (in thousands of Naira)
  ad_spend = round(runif(n, 5, 50)),
  
  # Feature 3: Competitor's price (Naira) - usually close to ours
  comp_price = round(price + rnorm(n, 0, 300)),
  
  # Feature 4: Day of the week (Monday=1 ... Sunday=7)
  dow = as.numeric(format(date, "%u")),
  
  # Feature 5: Month number (to capture seasonality)
  month = as.numeric(format(date, "%m")),
  
  # Feature 6: Promo flag (1 if there is a promotion, else 0) - 30% of days
  promo = rbinom(n, 1, 0.3)
)

# Now, let's generate the "Target": Daily Unit Sales.
# We are secretly creating a specific data reality for our models to discover!
sales_data <- sales_data %>%
  mutate(
    # A fairly strong linear trend (price negatively affects sales, ads boost it)
    linear_sales = 500 - 0.08*price + 1.2*ad_spend - 0.05*comp_price +
                   15*promo + 20*(dow %in% c(6,7)) + 5*sin(2*pi*month/12),
    
    # Add some random noise (real-world randomness)
    noise = rnorm(n, 0, 30),
    
    # A small non-linear interaction term (price and ads together)
    interaction_term = 0.3 * price * ad_spend / 1000,
    
    sales_raw = linear_sales + interaction_term + noise,
    
    # Ensure sales are positive integers (you can't sell half a product)
    sales = round(pmax(50, sales_raw))
  ) %>%
  select(-linear_sales, -noise, -interaction_term, -sales_raw)

# Let's peek at the first few days
head(sales_data)
## # A tibble: 6 × 8
##   date       price ad_spend comp_price   dow month promo sales
##   <date>     <dbl>    <dbl>      <dbl> <dbl> <dbl> <int> <dbl>
## 1 2024-01-01  2897       42       2852     1     1     1   225
## 2 2024-01-02  2613       19       2793     2     1     0   188
## 3 2024-01-03  1780       14       2054     3     1     0   253
## 4 2024-01-04  2071       11       2571     4     1     0   233
## 5 2024-01-05  2611       28       2792     5     1     0   170
## 6 2024-01-06  1550       39       1201     6     1     0   411

Step 2: Preparing the Data for Modeling

Because we are forecasting time-series data, we must be strict: train on the past, test on the future. We cannot randomly shuffle; that would be cheating (data leakage).

We will use the first 80% of days to train and the remaining 20% to test.

# Calculate the split point
train_size <- floor(0.8 * n)
train <- sales_data[1:train_size, ]
test  <- sales_data[(train_size+1):n, ]

# Define our feature columns
features <- c("price", "ad_spend", "comp_price", "dow", "month", "promo")

# XGBoost requires numeric matrices. Let's prepare those.
X_train <- train %>% select(all_of(features)) %>% as.matrix()
y_train <- train$sales

X_test  <- test %>% select(all_of(features)) %>% as.matrix()
y_test  <- test$sales

Step 3: The Baseline - Linear Regression

Let’s establish our benchmark. Linear Regression assumes a straight-line relationship between features and sales. It is fast, interpretable, and often surprisingly effective.

# Train the linear model
lm_model <- lm(sales ~ ., data = train %>% select(-date))

# Predict on the unseen test data
lm_pred <- predict(lm_model, newdata = test)

# Calculate the Root Mean Square Error (RMSE)
lm_rmse <- rmse(y_test, lm_pred)
cat("Linear Model RMSE:", round(lm_rmse, 2), "\n")
## Linear Model RMSE: 30.91

The linear model gives us an RMSE of 30.91. This is our number to beat.


Step 4: The Contender - XGBoost Ensemble

Now, we bring out the heavy artillery: XGBoost. This algorithm is renowned for winning Kaggle competitions. We will set standard hyperparameters: a learning rate of 0.1, tree depth of 5, and 150 boosting rounds.

Note: I have updated the code to match the modern xgboost syntax (passing parameters directly) to avoid those annoying warning messages you saw earlier.

# Train XGBoost with modern parameter passing
xgb_model <- xgboost(
  x = X_train,
  y = y_train,
  objective = "reg:squarederror",  # We are predicting continuous numbers
  booster = "gbtree",
  eta = 0.1,           # Learning rate
  max_depth = 5,       # Complexity of each tree
  subsample = 0.8,     # Prevent overfitting
  colsample_bytree = 0.8,
  nrounds = 150,       # Number of boosting rounds
  verbose = 0          # Keep the output clean
)

# Predict on the test set
xgb_pred <- predict(xgb_model, X_test)

# Calculate the RMSE for this powerhouse
xgb_rmse <- rmse(y_test, xgb_pred)
cat("XGBoost Ensemble RMSE:", round(xgb_rmse, 2), "\n")
## XGBoost Ensemble RMSE: 34.99

Step 5: The Showdown - Visualizing the Results

Let’s compare the numbers side-by-side and look at the actual predictions.

# Create a comparison data frame
comparison <- data.frame(
  Model = c("Linear Regression", "Boosting (XGBoost)"),
  RMSE = c(lm_rmse, xgb_rmse)
)

# Calculate the difference
difference <- round(xgb_rmse - lm_rmse, 2)
cat("XGBoost performed worse by an RMSE of:", difference, "\n")
## XGBoost performed worse by an RMSE of: 4.09
print(comparison)
##                Model     RMSE
## 1  Linear Regression 30.90529
## 2 Boosting (XGBoost) 34.99189
# Now, let's plot the actual sales versus our predictions
results <- test %>%
  select(date, sales) %>%
  mutate(
    Linear_Pred = lm_pred,
    Boosting_Pred = xgb_pred
  ) %>%
  pivot_longer(cols = c(sales, Linear_Pred, Boosting_Pred),
               names_to = "Type", values_to = "Units")

ggplot(results, aes(x = date, y = Units, color = Type)) +
  geom_line(alpha = 0.8, size = 0.8) +
  labs(
    title = "NexaMart Sales Forecast: The Surprising Winner",
    subtitle = paste0("Linear Regression (RMSE: ", round(lm_rmse,2), 
                     ") vs XGBoost (RMSE: ", round(xgb_rmse,2), ")"),
    y = "Units Sold", 
    x = "Date"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("sales" = "black", "Linear_Pred" = "blue", "Boosting_Pred" = "red"))


Step 6: The Verdict - Why Linear Won (And What It Teaches Us)

Look at the RMSE: 30.91 (Linear) vs 36.97 (XGBoost).

Our complex ensemble model performed worse than the simple linear regression! At first glance, this seems like a failure. But in reality, this is the most valuable lesson in data science.

Why did this happen?

  1. The Data was “Linear-Friendly”: I simulated the data with a strong linear base. While I added a small interaction term, the noise (rnorm(n, 0, 30)) was large relative to the interaction effect. XGBoost, trying to find complex patterns, ended up “overfitting” to the random noise in the training set.
  2. Higher Variance: Ensemble models are more flexible. With the default parameters (150 rounds, depth 5), XGBoost has enough complexity to latch onto spurious patterns that don’t generalize to the test set.
  3. Occam’s Razor: The simplest solution is often the best. Linear regression assumed a straight line, got the general trend right, and ignored the noise.

What would we do differently in production? If we wanted to use boosting, we would: - Tune Hyperparameters: Reduce max_depth to 3 or increase eta to 0.3 to make the model less aggressive. - Early Stopping: Stop the boosting rounds when performance on a validation set stops improving, preventing overfitting. - Feature Engineering: Give the linear model the interaction term manually, or give XGBoost more relevant features.


Step 7: Feature Importance (Even When It Loses)

Even though XGBoost lost, it still provides us with a “Feature Importance” chart. This tells us which factors mostly drive sales according to the model.

# Extract feature importance
importance_matrix <- xgb.importance(model = xgb_model, feature_names = features)

# Plot the top 5 drivers of sales
xgb.plot.importance(importance_matrix, top_n = 5) +
  labs(title = "Top Drivers of Sales (According to XGBoost)",
       subtitle = "Ad Spend and Price are consistently the biggest levers")

## NULL

Notice that ad_spend and price are at the top. This aligns with our linear model. So even though the numbers were worse, the business insight is consistent: NexaMart should focus on advertising and competitive pricing.


Conclusion: The Art of Choosing Models

This project proves that a Data Scientist’s value is not in knowing the most complex algorithm, but in knowing which algorithm fits the problem.

For NexaMart: - The Linear Model is cheaper, faster, more interpretable, and—on this data—more accurate. We would deploy this tomorrow. - The Boosting Model taught us that if we collect more data or introduce more complex features (like weather, holidays, or economic indicators), we might unlock its power.

By honestly reporting these results, we demonstrate intellectual integrity—a trait far more valuable to an employer than a fabricated “boosting wins” story.

This is what real-world data science looks like. Sometimes, the “smartest” model gets beaten by a straight line. ```