A Beginner’s Guide to Sports Analytics: Predicting Expected Threat (xT) with Random Forest Regression in R

Introduction

When it comes to understanding football tactics, predicting how dangerous a particular moment might become can be a huge advantage. In my previous tutorial, we explored how multiple linear regression can help us make sense of fan attendance patterns. This time, let’s turn our attention to a machine learning approach you’ll probably encounter often: Random Forest regression.

Random Forests are a favorite among data scientists because they make very few assumptions about your data—they just let the data speak! Our focus here will be on building an “Expected Threat” (xT) model, predicting how likely a team is to score based on the ball’s position and defensive context.

Setting the Scene: What is xT?

Imagine you’re watching a game. The ball’s in play, and you want to know: “What’s the real danger here?” The xT value helps answer that by estimating the chance that this possession will lead to a goal within moments. It’s expressed as a probability—usually a number between 0 and 1.

For our model, we’ll look at:

x_coord: Where is the ball horizontally (from own goal = 0 to opponent’s = 100)?
y_coord: Where is it vertically (sideline = 0/100, centre = 50)?
pressure: Number of defenders nearby.
density: How many defenders are between the ball and the goal.

Similar to the last tutorial, we will create mock data to simulate game states and their corresponding xT values. This tutorial will follow a 9-step process for predictive analytics: 1. Data Preparation 2. Load required packages on R 3. Data Splitting 4. Model Building (Baseline) 5. Hyperparameter Tuning 6. Model Inspection & Selection 7. Prediction (on Final Model) 8. Evaluation (on Final Model) 9. Interpretation and Results

Step 1: Data Preparation

This is the most complex part. We need to simulate realistic “game states.” We will invent a “ground truth” formula for xT_value that our model will have to learn.

Our formula will say that xT_value is: - Highest when x_coord is high (near opponent’s goal). - Highest when y_coord is central (near 50). - Lower when pressure is high. - Lower when density is high.

Our Random Forest model will not know this formula; its job is to discover these relationships from the data.

# 1. CREATE HYPOTHETICAL DATA
set.seed(123)
num_states <- 2000 # 2000 "snapshots" of the ball

# Create predictors
x_coord <- runif(num_states, 0, 100) # 0=own goal, 100=opponent goal
y_coord <- runif(num_states, 0, 100) # 0/100=sideline, 50=center
pressure <- sample(0:3, num_states, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1))
density <- sample(0:5, num_states, replace = TRUE, prob = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))

# Our Formula for the xT Value
# 1. Base threat from zone (highest at x=100, y=50)
base_threat <- plogis(-5 + (x_coord * 0.1) - (abs(y_coord - 50) * 0.05))

# 2. Penalty from pressure/density
penalty <- 1 / (1 + (pressure * 0.5) + (density * 0.2))

# 3. Final xT value
xT_value <- base_threat * penalty
# End of Formula

# Create the final dataframe
xt_data <- data.frame(
  x_coord,
  y_coord,
  pressure = as.factor(pressure), # Treat these as categories
  density = as.factor(density),
  xT_value = xT_value # Our numeric target variable
)

summary(xt_data)

##     x_coord            y_coord          pressure density    xT_value        
##  Min.   : 0.04653   Min.   : 0.006534   0:787    0:413   Min.   :0.0002735  
##  1st Qu.:25.36180   1st Qu.:23.871404   1:629    1:602   1st Qu.:0.0119670  
##  Median :49.49429   Median :50.007570   2:404    2:401   Median :0.1191179  
##  Mean   :49.77672   Mean   :50.129441   3:180    3:182   Mean   :0.2190456  
##  3rd Qu.:74.62122   3rd Qu.:75.694901            4:201   3rd Qu.:0.3815966  
##  Max.   :99.95240   Max.   :99.965101            5:201   Max.   :0.9852367

Step 2: Load required packages on R

Next, we load our toolkit. In a typical R script, you’d do this at the very top, but we’ll do it here as part of our workflow.

randomForest is our main model. caret will help with data splitting and assessment. dplyr and ggplot2 for tidy data wrangling and nice plots.

# 2. LOAD REQUIRED PACKAGES
library(randomForest) # Our regression model
library(caret)        # For splitting and evaluation
library(dplyr)
library(ggplot2)

Step 3: Data Splitting

We use createDataPartition from caret to split our data. From this point forward, we do not touch test_data until it’s time for the final evaluation in Step 7.

# 3. SPLIT DATA (TRAIN & TEST)
set.seed(123)
train_index <- createDataPartition(
  y = xt_data$xT_value, # The numeric outcome variable
  p = 0.7,
  list = FALSE
)

# Create the training and testing sets
train_data <- xt_data[train_index, ]
test_data <- xt_data[-train_index, ]

Step 4: Model Building (Baseline)

We use the randomForest() function to build our first baseline model using the training data. This gives us a starting point to improve upon.

# 4. BUILD THE BASELINE MODEL
# We only use the training data!
# The formula is `xT_value ~ .` (predict xT from all other columns)
rf_model <- randomForest(
  xT_value ~ .,
  data = train_data,
  ntree = 100,       # 100 trees is fine for this example
  importance = TRUE  # Save importance data
)

Step 5: Hyperparameter Tuning

Now we’ll try to optimise our model. We will tune mtry (number of variables randomly sampled at each split) to see if we can find a better configuration.

We use tuneRF which uses “Out-of-Bag” (OOB) error, an internal cross-validation method, to find the best mtry without peeking at the test set.

# 5. HYPERPARAMETER TUNING FOR mtry
set.seed(123)

tune_result <- tuneRF(
  x = train_data[, -which(names(train_data) == "xT_value")], # All predictors
  y = train_data$xT_value, # Outcome variable
  stepFactor = 1.5, # Increase/decrease mtry by this factor
  improve = 0.01, # Minimum improvement to continue
  ntreeTry = 100, # Number of trees to try
  plot = TRUE # Plot the results
)

## mtry = 1  OOB error = 0.007018921 
## Searching left ...
## Searching right ...

best_mtry <- tune_result[which.min(tune_result[, 2]), 1]
print(paste("Best mtry:", best_mtry))

## [1] "Best mtry: 1"

# Rebuild the model with the best mtry
rf_model_tuned <- randomForest(
  xT_value ~ .,
  data = train_data,
  ntree = 100,
  mtry = best_mtry,
  importance = TRUE
)

In this case, tuneRF suggests that our original mtry=1 was already optimal.

Step 6: Model Inspection & Selection

Now we compare our two models using their OOB error scores (the % Var explained they printed). This helps us decide which model to use for our final test.

# 6. INSPECT AND SELECT THE BEST MODEL
print("Baseline Model (Training Performance)")

## [1] "Baseline Model (Training Performance)"

print(rf_model)

## 
## Call:
##  randomForest(formula = xT_value ~ ., data = train_data, ntree = 100,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.006430948
##                     % Var explained: 88.69

print("Tuned Model (Training Performance)")

## [1] "Tuned Model (Training Performance)"

print(rf_model_tuned)

## 
## Call:
##  randomForest(formula = xT_value ~ ., data = train_data, ntree = 100,      mtry = best_mtry, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.007212656
##                     % Var explained: 87.31

In this specific case, tuneRF confirmed that our original mtry=1 was already the best. The % Var explained is nearly identical (88.69% vs 87.31%). If the tuned model had been significantly better, we would choose that one.

For this tutorial, we’ll select the rf_model_tuned as our final model (even though it’s very similar to the baseline).

final_model <- rf_model_tuned

Step 7: Prediction (on Final Model)

This is the first time we use our test_data. We take our final_model and use it to make predictions on the unseen data.

# 7. MAKE PREDICTIONS ON THE TEST SET (ONE TIME ONLY)
predictions <- predict(
  final_model, 
  newdata = test_data
)

# We'll also store the "ground truth" for comparison
actual_outcomes <- test_data$xT_value

Step 8: Evaluation (on Final Model)

How good were the predictions from Step 7? We must compare them to the actual_outcomes. For a regression model, we use metrics like RMSE and R-squared. This is our final, honest score.

Key Metrics: - RMSE (Root Mean Squared Error): The “average error” of our model. A lower RMSE is better. - Rsquared (R²): How much of the variance in xT_value did our model successfully explain on the new data? A higher R-squared (closer to 1.0) is better.

# 8. EVALUATE THE FINAL MODEL'S PREDICTIONS
eval_metrics <- caret::postResample(pred = predictions, obs = actual_outcomes)

print("Test Set Performance Metrics (Final Model)")

## [1] "Test Set Performance Metrics (Final Model)"

print(eval_metrics)

##       RMSE   Rsquared        MAE 
## 0.07963433 0.95034555 0.06316457

# We can also plot predictions vs actual
# A good model will have points close to the red line
plot_data <- data.frame(predictions, actual_outcomes)
ggplot(plot_data, aes(x = actual_outcomes, y = predictions)) +
  geom_point(alpha = 0.4) +
  geom_abline(color = "red", linetype = "dashed") +
  labs(
    title = "Actual vs. Predicted xT Values (Test Set)",
    x = "Actual (True) xT Value",
    y = "Predicted xT Value"
  ) +
  theme_minimal()

RMSE (Root Mean Squared Error): 0.0796
- This tells you the average size of the errors your model makes when predicting xT. Since your xT values range from 0 to 1 (it’s a probability), an RMSE of about 0.08 means your typical prediction is about 0.08 units away from the true xT value.
- In practical terms: If the actual threat of a situation is 0.20, the model might predict 0.12 or 0.28, so it’s usually within 8 percentage points of the correct answer.
Rsquared (R²): 0.95
- R², or the “coefficient of determination,” tells us how much of the variation in xT the model explains. With an R² of 0.95, the model does an exceptional job: it explains 95% of the variance in xT values on unseen test data.
- For sports analytics, this is a very strong R² and suggests your predictor variables (like ball position, defensive pressure, etc.) are capturing most factors that determine Expected Threat.
MAE (Mean Absolute Error): 0.0632
- MAE is the average of the absolute errors. In other words, it’s the average distance between the predictions and the actual xT values, not squaring the differences like RMSE.
- In simple terms: On average, the predictions are about 0.063 off from the true xT, or roughly a 6.3% difference, which is very good for a probability model.

From the plot, we can see how closely our predictions align with the actual xT values. The closer the points are to the red dashed line, the better our model’s performance. The vast majority of the data points are crowded in the bottom-left corner (from 0.0 to 0.25). The model has thousands of examples of low-threat situations to learn from, so it becomes an expert at predicting them. As you move to the right (Actual xT > 0.5), the dots get much sparser. These are rare, high-threat situations. Because the model has fewer examples of these high-threat states, it’s less confident in its predictions. However, overall, the model does a good job of capturing the general trend.

Step 9: Interpretation and Results

Finally, the final_model can be interpreted to understand what it learned. We can’t plot one simple tree (we built 100!), so we ask the forest which features were most useful for making its predictions.

# 9. VISUALISE VARIABLE IMPORTANCE
varImpPlot(
  final_model,
  main = "Variable Importance for Predicting Expected Threat (xT)"
)

The plot shows the importance of each predictor variable in estimating xT_value, with x_coord being the most important, followed by pressure, y_coord and density.

Conclusion

In this tutorial, we built a Random Forest regressor to predict the Expected Threat (xT) of football game states using continuous target values. As you might have noticed, the process is similar to the linear regression tutorial. However, between a linear regression and random forest regressor, among the key differences lies in the the latter’s ability to capture complex, non-linear relationships.

In terms of interpretation, while linear regression provides coefficients that indicate the direction and magnitude of predictor effects, random forests offer variable importance measures that highlight which features are most influential in making predictions. Simply said, if you’re wondering whether Random Forests beat out linear regression, consider that their flexibility comes at the cost of interpretability. But for spotting tricky, nonlinear patterns, they’re a powerful tool.