Practical Guide to Understanding XGBoost

Fisher Price Stats Series

Author

Nick Lepore (he/him)

What is XGBoost?

XGBoost (Extreme Gradient Boosting) is one of the most widely used machine learning algorithms for tabular data. It’s known for its high predictive power and flexibility. The method is an implementation of gradient boosting decision trees, a method that builds an ensemble (or forest) of trees, each one trained to fix the mistakes of the previous trees. It does this by minimizing a loss function (such as a squared error or log loss) using a technique similar to gradient descent.

Why Use XGBoost?

XGBoost has emerged as one of the most powerful and popular algorithms in applied machine learning, particularly for structured data problems. One of its primary advantages is accuracy. XGBoost consistently performs well in machine learning competitions and real-world applications due to its ability to model complex, non-linear relationships and interactions between features. It does this by iteratively correcting the mistakes of previous models, leading to more refined and accurate predictions with each boosting round.

In addition to its accuracy, XGBoost is known for its speed. The library is engineered for efficiency and optimized to support parallel processing, cache-aware computations, and even distributed training. This means models can be trained on large datasets with less computational overhead than many other ensemble methods.

XGBoost is also highly flexible. It supports a variety of supervised learning tasks, including regression, binary classification, multi-class classification, and ranking. Users can specify different objective functions and evaluation metrics, making it adaptable across a wide range of domains and business use cases.

Another key strength of XGBoost is its built-in regularization, including both L1 (Lasso) and L2 (Ridge) penalties. These help the model generalize better to unseen data by discouraging overly complex trees that may overfit the training set. This makes XGBoost more robust than traditional decision tree algorithms, which can easily overfit without pruning or regularization.

Together, these features make XGBoost a preferred choice when accuracy, speed, and model flexibility are important—and when explainability can be enhanced through tools like feature importance and SHAP values.

When Should You Use XGBoost?

You should use or consider this methodology when:

You’re working with structured/ tabular data (rows and columns)
You care about high predictive accuracy
You need a model that supports missing values natively
You want built in tools for model interpretation (e.g. feature importance, SHAP)

It’s often utilized in marketing response prediction, sales forecasting, fraud detection, credit scoring, and many many Kaggle competitions.

What are the limitations of XGBoost?

Despite its widespread popularity and strong predictive performance, XGBoost is not without limitations. One of the primary challenges lies in its interpretability. While XGBoost includes tools like feature importance plots and SHAP values to help explain model behavior, it remains considerably less transparent than simpler models such as linear regression. In a linear model, the relationship between predictors and the outcome is explicitly defined by coefficients, making it easier for non-technical stakeholders to understand how input variables influence predictions. In contrast, XGBoost builds a complex ensemble of decision trees where interactions and non-linearities are learned automatically, resulting in a model often described as a “black box.”

Another limitation concerns computational efficiency. Although XGBoost is highly optimized and faster than many traditional boosting methods, it can become computationally expensive when applied to very large datasets, particularly those with a high number of observations (rows) or features (columns). In such cases, alternative gradient boosting libraries like LightGBM or CatBoost may offer better performance due to their more efficient handling of sparse data, categorical variables, and memory usage.

Lastly, XGBoost models require careful tuning of hyperparameters to achieve optimal results. While the algorithm performs reasonably well with default settings, its true power is realized only through systematic optimization of parameters such as learning rate, tree depth, number of rounds, and regularization terms. This tuning process can be time-consuming and often requires cross-validation, domain knowledge, and computational resources to avoid overfitting or underfitting.

Classroom Metaphor: Helping to Understand XGBoost

To better understand how XGBoost works, imagine a classroom full of students tasked with learning how to predict home prices. Each student represents a simple decision tree — limited in ability and only capable of making basic rules-based predictions. At the beginning of the lesson, the first student takes a shot at the homework (the training data), using whatever strategy they think works best. Naturally, they make some mistakes — maybe they underestimate how much the number of rooms matters or overemphasize the impact of crime rate.

Now, a teacher steps in. The teacher doesn’t just grade the work — they give feedback by pointing out exactly where and how the student was wrong. The next student in the classroom sees both the original homework and the previous student’s errors. Their job isn’t to start from scratch but to focus specifically on correcting the mistakes of the student before them. Student 2 now makes their own mistakes — often by overcompensating for the last student’s errors. This process repeats again and again: each new student learns from all the errors that came before and tries to patch up the remaining gaps in understanding.

Individually, none of these students are perfect — each is only a shallow learner, capable of seeing part of the picture. But together, through this iterative, error-correcting process, the class becomes collectively smart. Their combined knowledge — the ensemble of decision trees — gets very good at predicting home prices, often better than any one student (or tree) could do alone.

This is how XGBoost operates under the hood. It builds a sequence of decision trees where each one is trained to correct the residuals (errors) made by the ensemble of all previous trees. Over time, the model homes in on the most accurate predictions it can achieve, much like a classroom getting closer to the right answers with each new round of feedback and revision.

Running an Example Model

Data Splitting and Preparation

This dataset (MASS:: Boston) contains 13 predictor variables that we’ll use to predict one response variable called mdev, which represents the median value of homes in different census tracts around Boston. We can see that the dataset contains 506 observations and 14 total variables.

Before building a predictive model, we split our dataset into two parts: a training set and a test set. The training set is used to fit the model, while the test set helps evaluate the how well the model generalizes to the new, unseen data. This split is crucial for preventing overfitting and simulating how the model would perform in production.

suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(xgboost))

set.seed(666)
data <- MASS::Boston

#split into training (80%) and testing set (20%)
parts <- createDataPartition(data$medv, p = .8, list = F)
train <- data[parts,]
test <- data[-parts,]

After the split, we isolate the predictor variables and the response variable (mdev, or median home value) into separate matrices. XGBoost requires numerical matrix input for its internal optimization.

#define predictor and response variables in training set
train_x <-  data.matrix(select(train, -medv))
train_y <- train$medv

#define predictor and response variable in testing set
test_x <- data.matrix(select(test, -medv))
test_y <- test$medv

Converting Data into XGBoost Format

XGBoost uses a specific data structure called ‘DMatrix’ which is optimized for speed and memory efficiency. We convert our training and test matrices into this format to prepare for modeling.

xgb_train <- xgb.DMatrix(data = train_x, label = train_y)
xgb_test <- xgb.DMatrix(data = test_x, label = test_y)

Model Training with Monitoring

To begin training our model, we use the xgb.train() function, which allows for more granular control than the high-level xgboost() wrapper. Here, we specify two key hyperparameters: the maximum tree depth and the number of boosting rounds. Additionally, we define a watchlist—a set of datasets (in this case, training and test sets) for which the model will log performance metrics after each boosting round.

In this example, we set the number of boosting rounds (nrounds) to 100. This value controls how many trees the model will build sequentially. While 100 is sufficient for this relatively small dataset, it is common to use hundreds—or even thousands—of rounds for larger or more complex datasets. However, it’s important to remember that more rounds increase both training time and the risk of overfitting, especially if early stopping or regularization is not used.

We also set max.depth = 3, which constrains how deep each individual tree can grow. A shallow depth (commonly between 2 and 4) is preferred in gradient boosting, as it encourages the model to build a large ensemble of simple learners. These small trees capture general patterns rather than fitting idiosyncrasies in the training data, ultimately improving generalization performance.

#define the watchlist
watchlist <- list(train = xgb_train, test = xgb_test)

model <- xgb.train(data = xgb_train, max.depth = 3, 
                   watchlist = watchlist, 
                   nrounds = 100)

[1] train-rmse:17.074084    test-rmse:17.461590 
[2] train-rmse:12.373532    test-rmse:12.866384 
[3] train-rmse:9.097789 test-rmse:9.658035 
[4] train-rmse:6.804359 test-rmse:7.467660 
[5] train-rmse:5.238225 test-rmse:5.906971 
[6] train-rmse:4.197283 test-rmse:5.010275 
[7] train-rmse:3.490850 test-rmse:4.351104 
[8] train-rmse:3.056421 test-rmse:3.995134 
[9] train-rmse:2.766979 test-rmse:3.847864 
[10]    train-rmse:2.568076 test-rmse:3.729875 
[11]    train-rmse:2.438584 test-rmse:3.661224 
[12]    train-rmse:2.317215 test-rmse:3.622712 
[13]    train-rmse:2.251488 test-rmse:3.543120 
[14]    train-rmse:2.153198 test-rmse:3.488909 
[15]    train-rmse:2.100967 test-rmse:3.440475 
[16]    train-rmse:2.029767 test-rmse:3.429986 
[17]    train-rmse:1.977669 test-rmse:3.380167 
[18]    train-rmse:1.944740 test-rmse:3.374696 
[19]    train-rmse:1.858340 test-rmse:3.307926 
[20]    train-rmse:1.801895 test-rmse:3.265149 
[21]    train-rmse:1.770711 test-rmse:3.253217 
[22]    train-rmse:1.728918 test-rmse:3.244515 
[23]    train-rmse:1.706003 test-rmse:3.217940 
[24]    train-rmse:1.681012 test-rmse:3.206996 
[25]    train-rmse:1.638893 test-rmse:3.196978 
[26]    train-rmse:1.627310 test-rmse:3.187975 
[27]    train-rmse:1.600885 test-rmse:3.175280 
[28]    train-rmse:1.573738 test-rmse:3.147306 
[29]    train-rmse:1.542200 test-rmse:3.117934 
[30]    train-rmse:1.521615 test-rmse:3.106745 
[31]    train-rmse:1.496952 test-rmse:3.092283 
[32]    train-rmse:1.475341 test-rmse:3.083602 
[33]    train-rmse:1.458105 test-rmse:3.081130 
[34]    train-rmse:1.434331 test-rmse:3.067813 
[35]    train-rmse:1.425292 test-rmse:3.063306 
[36]    train-rmse:1.406196 test-rmse:3.061016 
[37]    train-rmse:1.364669 test-rmse:3.062102 
[38]    train-rmse:1.353563 test-rmse:3.063070 
[39]    train-rmse:1.318068 test-rmse:3.046975 
[40]    train-rmse:1.301413 test-rmse:3.057864 
[41]    train-rmse:1.279898 test-rmse:3.065304 
[42]    train-rmse:1.264926 test-rmse:3.068018 
[43]    train-rmse:1.244449 test-rmse:3.053596 
[44]    train-rmse:1.236074 test-rmse:3.061812 
[45]    train-rmse:1.227887 test-rmse:3.060089 
[46]    train-rmse:1.217920 test-rmse:3.059514 
[47]    train-rmse:1.206329 test-rmse:3.055247 
[48]    train-rmse:1.189609 test-rmse:3.054242 
[49]    train-rmse:1.162534 test-rmse:3.041356 
[50]    train-rmse:1.144909 test-rmse:3.036817 
[51]    train-rmse:1.132334 test-rmse:3.029646 
[52]    train-rmse:1.116703 test-rmse:3.027764 
[53]    train-rmse:1.102554 test-rmse:3.028239 
[54]    train-rmse:1.091736 test-rmse:3.037952 
[55]    train-rmse:1.075572 test-rmse:3.042324 
[56]    train-rmse:1.063501 test-rmse:3.043412 
[57]    train-rmse:1.054008 test-rmse:3.040448 
[58]    train-rmse:1.047326 test-rmse:3.031618 
[59]    train-rmse:1.038730 test-rmse:3.026163 
[60]    train-rmse:1.024430 test-rmse:3.013002 
[61]    train-rmse:1.010530 test-rmse:3.016801 
[62]    train-rmse:1.002087 test-rmse:3.011642 
[63]    train-rmse:0.993566 test-rmse:3.010908 
[64]    train-rmse:0.980303 test-rmse:3.011502 
[65]    train-rmse:0.966931 test-rmse:3.005774 
[66]    train-rmse:0.953414 test-rmse:3.018112 
[67]    train-rmse:0.946001 test-rmse:3.007038 
[68]    train-rmse:0.930490 test-rmse:3.002940 
[69]    train-rmse:0.916495 test-rmse:2.995773 
[70]    train-rmse:0.901322 test-rmse:2.986512 
[71]    train-rmse:0.899469 test-rmse:2.985599 
[72]    train-rmse:0.895218 test-rmse:2.991023 
[73]    train-rmse:0.879811 test-rmse:2.996202 
[74]    train-rmse:0.870915 test-rmse:2.999885 
[75]    train-rmse:0.855303 test-rmse:3.002461 
[76]    train-rmse:0.850526 test-rmse:3.006822 
[77]    train-rmse:0.839257 test-rmse:3.009883 
[78]    train-rmse:0.830284 test-rmse:3.006069 
[79]    train-rmse:0.818284 test-rmse:3.010761 
[80]    train-rmse:0.815235 test-rmse:3.014790 
[81]    train-rmse:0.809399 test-rmse:3.015254 
[82]    train-rmse:0.803663 test-rmse:3.011946 
[83]    train-rmse:0.799276 test-rmse:3.012252 
[84]    train-rmse:0.790383 test-rmse:3.010558 
[85]    train-rmse:0.781849 test-rmse:3.014770 
[86]    train-rmse:0.771216 test-rmse:3.012953 
[87]    train-rmse:0.764252 test-rmse:3.019482 
[88]    train-rmse:0.759158 test-rmse:3.018897 
[89]    train-rmse:0.751736 test-rmse:3.019813 
[90]    train-rmse:0.746386 test-rmse:3.019666 
[91]    train-rmse:0.735551 test-rmse:3.020498 
[92]    train-rmse:0.729518 test-rmse:3.021683 
[93]    train-rmse:0.715510 test-rmse:3.013982 
[94]    train-rmse:0.705789 test-rmse:3.013081 
[95]    train-rmse:0.695209 test-rmse:3.009481 
[96]    train-rmse:0.685817 test-rmse:3.011045 
[97]    train-rmse:0.677410 test-rmse:3.012857 
[98]    train-rmse:0.671372 test-rmse:3.011165 
[99]    train-rmse:0.662652 test-rmse:3.011561 
[100]   train-rmse:0.652543 test-rmse:3.009788

As the model trains, it prints the Root Mean Square Error (RMSE) at each iteration for both the training and test sets. RMSE represents the average magnitude of prediction error in the same units as the outcome (here, thousands of dollars). By monitoring these metrics, we can detect overfitting. In our case, the test RMSE reaches its minimum at iteration 71. After that, the test RMSE begins to increase slightly, while the training RMSE continues to decline. This divergence is a classic sign of overfitting—where the model begins to memorize the training data at the expense of generalizing well to unseen cases.

Based on this observation, we will define our final model using only 71 boosting rounds, which balances predictive accuracy and model robustness.

Tracking RMSE Across Boosting Rounds

To assess how model performance evolves during training, we extract and visualize the RMSE) recorded at each boosting round. RMSE is a standard evaluation metric in regression that quantifies the average magnitude of prediction error, expressed in the same units as the outcome variable (in this case, thousands of dollars). The plot below displays RMSE across 100 boosting rounds for both the training and test datasets. Lower RMSE values indicate better predictive accuracy, making this an essential tool for diagnosing model fit and detecting signs of overfitting.

# Extract evaluation log
eval_log <- model$evaluation_log

# Plot RMSE over boosting rounds
eval_log %>%
  pivot_longer(cols = c(train_rmse, test_rmse), names_to = "Dataset", values_to = "RMSE") %>%
  ggplot(aes(x = iter, y = RMSE, color = Dataset)) +
  geom_line(linewidth = 1.2) +
  labs(title = "XGBoost RMSE Over Boosting Rounds",
       x = "Boosting Round", y = "RMSE") +
  theme_minimal()

In the early stages (rounds 1–10), both training and test RMSE decline rapidly. This is typical, as the model quickly learns broad patterns in the data. As training continues, however, the RMSE on the test set begins to flatten out, while the training RMSE continues its downward trend.

The lowest test RMSE is observed at round 71, where it reaches approximately 2.98. After this point, the test RMSE begins to gradually increase, even as the training RMSE continues to decline. This divergence between training and test error indicates the onset of overfitting—the model is beginning to learn noise or idiosyncrasies specific to the training data rather than generalizable patterns. This pattern demonstrates an important lesson in boosting: more trees do not always lead to better performance. In fact, after a certain point, additional complexity can reduce generalization. Thus, in this case, round 71 offers the best tradeoff between model complexity and performance, and is a strong candidate for the optimal stopping point when tuning the model.

Fitting Our Final Model

After analyzing the RMSE values across boosting rounds, we observed that the lowest test RMSE was achieved at round 71. Beyond this point, test RMSE began to increase gradually while training RMSE continued to decline—a clear sign that the model was beginning to overfit the training data. Overfitting occurs when a model becomes too closely tailored to the training set, capturing noise and idiosyncrasies rather than general patterns. This can lead to poor performance when applied to new, unseen data.

To mitigate this, we define our final model using only the first 71 rounds of boosting. This decision ensures that the model is trained to a point of optimal generalization, where it performs well not just on the data it has seen but also on the held-out test set.

final <- xgboost(data = xgb_train, max.depth = 3, nrounds = 71, verbose = 0)

Let’s break down this function call:

data = xgb_train specifies the training data in DMatrix format, which is optimized for speed and memory efficiency.
max.depth = 3 controls the maximum depth of each individual tree in the ensemble. Shallow trees (e.g., depth 2–4) are typically preferred in gradient boosting because they reduce the risk of overfitting and encourage the model to build many small, complementary learners.
nrounds = 71 sets the number of boosting iterations to the point where validation performance was best—round 71.
verbose = 0 suppresses the console output, making the function call cleaner.

By finalizing our model at 71 rounds, we strike a deliberate balance between underfitting and overfitting. The model is complex enough to capture meaningful structure in the data but restrained enough to remain robust on new inputs. This approach reflects a common best practice in boosting workflows: using evaluation metrics from a validation set to determine an optimal stopping point, rather than training indefinitely.

Model Evaluation on Test Set

After training the final model, the next step is to evaluate its predictive performance on the test set—data that was intentionally held out during model training. This evaluation provides an unbiased estimate of how well the model generalizes to new, unseen data.

We use two common metrics to assess model performance in regression tasks: Root Mean Square Error (RMSE) and R-squared (R²).

# Predict on test set using the final model
preds <- predict(final, test_x)

# Calculate RMSE
rmse <- sqrt(mean((preds - test_y)^2))
cat("Test RMSE:", round(rmse, 3), "\n")

Test RMSE: 2.986

# Optionally, calculate R-squared
sst <- sum((test_y - mean(test_y))^2)
sse <- sum((preds - test_y)^2)
rsq <- 1 - sse/sst
cat("R-squared:", round(rsq, 3), "\n")

R-squared: 0.901

Let’s break these metrics down:

RMSE measures the average magnitude of prediction errors, penalizing larger errors more heavily due to the squaring. Because it is expressed in the same units as the outcome (in this case, thousands of dollars), RMSE is easy to interpret in practical terms. In our model, we achieve a Test RMSE of 2.986, meaning that on average, the predicted home values are within $2,986 of the actual values.
R² measures the proportion of variance in the outcome variable that is explained by the model. An R² value of 0.901 indicates that approximately 90.1% of the variation in housing prices can be accounted for by the features used in the model. The remaining 9.9% is unexplained, which may be due to random noise or factors not included in the dataset.

Together, these metrics offer a strong assessment of model quality. A relatively low RMSE and a high R² suggest that the XGBoost model has learned a strong, generalizable pattern in the data—one that could be confidently deployed or further explored in production settings.

Feature Importance: Which Variables Did the Model Use?

Before diving into SHAP values, we can start by examining XGBoost’s built-in feature importance scores, which reflect how frequently each feature was used to split nodes across all trees in the model. While this method doesn’t capture the strength or direction of a feature’s influence, it provides a useful high-level overview of which variables the model relied on most during training.

importance <- xgb.importance(model = final)
xgb.plot.importance(importance_matrix = importance, top_n = 10)

In the plot, we see that lstat (the percentage of lower-income residents in an area) was the most frequently used feature, followed closely by rm (average number of rooms per dwelling) and dis (distance to employment centers). These variables are well-known to be strong predictors of housing prices and likely formed the backbone of the model’s decision trees.

Features such as nox (air pollution levels), crim (crime rate), and ptratio (student-teacher ratio) also appeared regularly in tree splits, suggesting that the model considered them meaningful, though somewhat less central. Features like rad (access to highways) and age (proportion of older housing) appeared infrequently and were likely less informative in this particular model.

It’s important to note, however, that this type of feature importance measures frequency, not impact. A variable can appear in many trees but have only a modest effect on the final predictions, or it may have strong effects in a few specific cases but not be used often. To truly understand how much a feature contributed—and in which direction—we turn next to SHAP values, which offer a more granular and interpretable view of model behavior.

SHAP Values: Explaining Model Predictions

Once an XGBoost model is trained and evaluated, a natural next step is to understand why the model is making the predictions it does. Traditional feature importance metrics—such as gain, coverage, or frequency—offer limited insight, as they often reflect global averages or structural usage rather than localized influence on predictions.

To address this, we use SHAP values (SHapley Additive exPlanations), a unified approach to feature attribution based on cooperative game theory. SHAP decomposes a prediction into the sum of the contributions from each feature, plus a baseline value. These contributions can be positive or negative, depending on whether a given feature drives the prediction higher or lower.

In the context of regression, each SHAP value answers the question:

“For this individual prediction, how much did each feature contribute to shifting the predicted value away from the baseline (i.e., the average prediction)?”

We compute SHAP values using XGBoost’s built-in functionality via the argument predcontrib = TRUE in the predict() function. This returns, for each observation in the test set, a matrix of SHAP values corresponding to each feature plus a bias term (the average model prediction).

# Compute SHAP values and summarize
shap_values <- predict(final, test_x, predcontrib = TRUE)
shap_df <- as_tibble(shap_values) %>% select(-BIAS)

We then summarize these values by computing the mean absolute SHAP value for each feature. This gives us a measure of global feature importance—that is, how much each feature contributed to the model’s predictions on average, regardless of direction. While SHAP values are measured in the same units as the dependent variable—in this case, thousands of dollars—they may be difficult for non-technical stakeholders to interpret. To increase clarity, we rescale the mean absolute SHAP values to a 0–100 scale, such that:

The most important feature is set to 100.
All other feature importances are expressed as a percentage of this maximum.

This allows stakeholders to easily see, for example, that a given feature is “65% as important as the top feature.”

shap_summary <- shap_df %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "shap") %>%
  group_by(feature) %>%
  summarise(mean_abs_shap = mean(abs(shap))) %>%
  arrange(desc(mean_abs_shap)) %>%
  # Normalize to 0-100 scale
  mutate(scaled_importance = 100 * (mean_abs_shap / max(mean_abs_shap)))

shap_summary

# A tibble: 13 × 3
   feature mean_abs_shap scaled_importance
   <chr>           <dbl>             <dbl>
 1 lstat          3.87              100   
 2 rm             2.51               64.9 
 3 dis            0.838              21.7 
 4 ptratio        0.675              17.5 
 5 nox            0.622              16.1 
 6 age            0.607              15.7 
 7 crim           0.561              14.5 
 8 tax            0.550              14.2 
 9 black          0.334               8.65
10 indus          0.217               5.62
11 rad            0.176               4.55
12 chas           0.0909              2.35
13 zn             0.0527              1.36

The results confirm earlier feature importance plots derived from the XGBoost model. However, SHAP adds interpretability by grounding importance in actual predicted values. Here are several key insights:

lstat, the percentage of lower-status individuals in a neighborhood, is the most influential variable, contributing the largest absolute effect on the prediction. It likely captures socioeconomic disadvantage in the housing market.
rm, the average number of rooms per dwelling, is also highly predictive, suggesting that larger homes command significantly higher prices.
dis, ptratio, and nox round out the next tier of importance, reflecting how location (distance to employment), educational quality, and environmental conditions influence housing values.

To make these contributions visually intuitive, we present the rescaled SHAP values in the bar chart below:

ggplot(shap_summary, aes(x = reorder(feature, scaled_importance), y = scaled_importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "SHAP Summary (Rescaled): Relative Feature Impact",
       x = "Feature", y = "Scaled SHAP Value (0–100)") +
  theme_minimal()

These rankings are intuitive and consistent with domain knowledge of real estate economics, suggesting that the model has successfully learned meaningful patterns.

This horizontal bar chart provides a clear comparison of feature importances, showing not just which features matter most, but also how much more they matter relative to others.

SHAP values offer a principled, model-agnostic way to explain individual and global feature importance. When applied to the Boston housing dataset, they reveal a strong influence of socioeconomic status (lstat) and housing size (rm) on median home values. By rescaling the results to a 0–100 scale, we translate complex model internals into a format more accessible to stakeholders, enabling transparent and interpretable decision-making.

Summary

In this write-up, we took a deep dive into implementing and interpreting an XGBoost regression model using the Boston Housing dataset. Let’s recap what we did and what we learned:

What we did:

Prepared the data: We split the data into training and testing sets and converted the features into a matrix format suitable for the xgboost() function.
Trained an XGBoost model: Using the xgboost() function, we trained a model to predict housing prices (medv), iterating through boosting rounds to monitor performance.
Selected the optimal number of boosting rounds: By comparing RMSE on the training and test sets across 100 rounds, we identified the point at which the model achieved the lowest test error, striking a balance between underfitting and overfitting.
Examined feature importance: Using both XGBoost’s built-in importance metrics and SHAP values, we evaluated which predictors most strongly influenced the model’s predictions.
Visualized insights: We generated plots to interpret variable importance in a stakeholder-friendly format, including a rescaled 0–100 SHAP summary plot.

What we learned:

XGBoost is powerful but needs careful tuning. Selecting the number of rounds (nrounds) is crucial. Too few rounds underfit; too many can overfit, even with regularization.
Interpretability is key. XGBoost models can be opaque, but SHAP values help break down and quantify the individual contribution of each feature to model predictions.
Not all “important” variables are equal. SHAP offered a more nuanced picture of feature importance than traditional gain- or frequency-based metrics, allowing us to clearly see that variables like lstat and rm consistently shaped predictions.
Rescaling aids communication. Stakeholders may better understand model explanations when impact scores are presented on a common scale (like 0–100), even if the underlying math is more complex.

XGBoost is one of the most powerful tools in a data scientist’s toolkit—offering speed, flexibility, and predictive power. As with any machine learning method, its effectiveness depends on thoughtful modeling decisions, careful validation, and a clear-eyed interpretation of results! Go Forth and Boost!