Introduction
When it comes to understanding football tactics, predicting how dangerous a particular moment might become can be a huge advantage. In my previous tutorial, we explored how multiple linear regression can help us make sense of fan attendance patterns. This time, let’s turn our attention to a machine learning approach you’ll probably encounter often: Random Forest regression.
Random Forests are a favorite among data scientists because they make very few assumptions about your data—they just let the data speak! Our focus here will be on building an “Expected Threat” (xT) model, predicting how likely a team is to score based on the ball’s position and defensive context.
Setting the Scene: What is xT?
Imagine you’re watching a game. The ball’s in play, and you want to know: “What’s the real danger here?” The xT value helps answer that by estimating the chance that this possession will lead to a goal within moments. It’s expressed as a probability—usually a number between 0 and 1.
For our model, we’ll look at:
x_coord: Where is the ball horizontally (from own goal = 0 to opponent’s = 100)?
y_coord: Where is it vertically (sideline = 0/100, centre = 50)?
pressure: Number of defenders nearby.
density: How many defenders are between the ball and the goal.
Similar to the last tutorial, we will create mock data to simulate game states and their corresponding xT values. This tutorial will follow a 9-step process for predictive analytics: 1. Data Preparation 2. Load required packages on R 3. Data Splitting 4. Model Building (Baseline) 5. Hyperparameter Tuning 6. Model Inspection & Selection 7. Prediction (on Final Model) 8. Evaluation (on Final Model) 9. Interpretation and Results
Step 1: Data Preparation
This is the most complex part. We need to simulate realistic “game
states.” We will invent a “ground truth” formula for
xT_value that our model will have to learn.
Our formula will say that xT_value is: - Highest when
x_coord is high (near opponent’s goal). - Highest when
y_coord is central (near 50). - Lower when
pressure is high. - Lower when
density is high.
Our Random Forest model will not know this formula; its job is to discover these relationships from the data.
# 1. CREATE HYPOTHETICAL DATA
set.seed(123)
num_states <- 2000 # 2000 "snapshots" of the ball
# Create predictors
x_coord <- runif(num_states, 0, 100) # 0=own goal, 100=opponent goal
y_coord <- runif(num_states, 0, 100) # 0/100=sideline, 50=center
pressure <- sample(0:3, num_states, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1))
density <- sample(0:5, num_states, replace = TRUE, prob = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))
# Our Formula for the xT Value
# 1. Base threat from zone (highest at x=100, y=50)
base_threat <- plogis(-5 + (x_coord * 0.1) - (abs(y_coord - 50) * 0.05))
# 2. Penalty from pressure/density
penalty <- 1 / (1 + (pressure * 0.5) + (density * 0.2))
# 3. Final xT value
xT_value <- base_threat * penalty
# End of Formula
# Create the final dataframe
xt_data <- data.frame(
x_coord,
y_coord,
pressure = as.factor(pressure), # Treat these as categories
density = as.factor(density),
xT_value = xT_value # Our numeric target variable
)
summary(xt_data)## x_coord y_coord pressure density xT_value
## Min. : 0.04653 Min. : 0.006534 0:787 0:413 Min. :0.0002735
## 1st Qu.:25.36180 1st Qu.:23.871404 1:629 1:602 1st Qu.:0.0119670
## Median :49.49429 Median :50.007570 2:404 2:401 Median :0.1191179
## Mean :49.77672 Mean :50.129441 3:180 3:182 Mean :0.2190456
## 3rd Qu.:74.62122 3rd Qu.:75.694901 4:201 3rd Qu.:0.3815966
## Max. :99.95240 Max. :99.965101 5:201 Max. :0.9852367
Step 2: Load required packages on R
Next, we load our toolkit. In a typical R script, you’d do this at the very top, but we’ll do it here as part of our workflow.
randomForest is our main model. caret will
help with data splitting and assessment. dplyr and
ggplot2 for tidy data wrangling and nice plots.
Step 3: Data Splitting
We use createDataPartition from caret to
split our data. From this point forward, we do not touch
test_data until it’s time for the final evaluation
in Step 7.
Step 4: Model Building (Baseline)
We use the randomForest() function to build our first
baseline model using the training data. This gives us a
starting point to improve upon.
Step 5: Hyperparameter Tuning
Now we’ll try to optimise our model. We will tune mtry
(number of variables randomly sampled at each split) to see if we can
find a better configuration.
We use tuneRF which uses “Out-of-Bag” (OOB) error, an
internal cross-validation method, to find the best mtry
without peeking at the test set.
# 5. HYPERPARAMETER TUNING FOR mtry
set.seed(123)
tune_result <- tuneRF(
x = train_data[, -which(names(train_data) == "xT_value")], # All predictors
y = train_data$xT_value, # Outcome variable
stepFactor = 1.5, # Increase/decrease mtry by this factor
improve = 0.01, # Minimum improvement to continue
ntreeTry = 100, # Number of trees to try
plot = TRUE # Plot the results
)## mtry = 1 OOB error = 0.007018921
## Searching left ...
## Searching right ...
## [1] "Best mtry: 1"
# Rebuild the model with the best mtry
rf_model_tuned <- randomForest(
xT_value ~ .,
data = train_data,
ntree = 100,
mtry = best_mtry,
importance = TRUE
)In this case, tuneRF suggests that our original
mtry=1 was already optimal.
Step 6: Model Inspection & Selection
Now we compare our two models using their OOB error scores (the
% Var explained they printed). This helps us decide which
model to use for our final test.
## [1] "Baseline Model (Training Performance)"
##
## Call:
## randomForest(formula = xT_value ~ ., data = train_data, ntree = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.006430948
## % Var explained: 88.69
## [1] "Tuned Model (Training Performance)"
##
## Call:
## randomForest(formula = xT_value ~ ., data = train_data, ntree = 100, mtry = best_mtry, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.007212656
## % Var explained: 87.31
In this specific case, tuneRF confirmed that
our original mtry=1 was already the best. The
% Var explained is nearly identical (88.69% vs 87.31%). If
the tuned model had been significantly better, we would choose that
one.
For this tutorial, we’ll select the
rf_model_tuned as our final model (even
though it’s very similar to the baseline).
Step 7: Prediction (on Final Model)
This is the first time we use our
test_data. We take our final_model and use it
to make predictions on the unseen data.
Step 8: Evaluation (on Final Model)
How good were the predictions from Step 7? We must compare them to
the actual_outcomes. For a regression model, we use metrics
like RMSE and R-squared. This is our final, honest score.
Key Metrics: - RMSE (Root Mean Squared
Error): The “average error” of our model. A lower RMSE is
better. - Rsquared (R²): How much of the variance in
xT_value did our model successfully explain on the new
data? A higher R-squared (closer to 1.0) is better.
# 8. EVALUATE THE FINAL MODEL'S PREDICTIONS
eval_metrics <- caret::postResample(pred = predictions, obs = actual_outcomes)
print("Test Set Performance Metrics (Final Model)")## [1] "Test Set Performance Metrics (Final Model)"
## RMSE Rsquared MAE
## 0.07963433 0.95034555 0.06316457
# We can also plot predictions vs actual
# A good model will have points close to the red line
plot_data <- data.frame(predictions, actual_outcomes)
ggplot(plot_data, aes(x = actual_outcomes, y = predictions)) +
geom_point(alpha = 0.4) +
geom_abline(color = "red", linetype = "dashed") +
labs(
title = "Actual vs. Predicted xT Values (Test Set)",
x = "Actual (True) xT Value",
y = "Predicted xT Value"
) +
theme_minimal()RMSE (Root Mean Squared Error): 0.0796
This tells you the average size of the errors your model makes when predicting xT. Since your xT values range from 0 to 1 (it’s a probability), an RMSE of about 0.08 means your typical prediction is about 0.08 units away from the true xT value.
In practical terms: If the actual threat of a situation is 0.20, the model might predict 0.12 or 0.28, so it’s usually within 8 percentage points of the correct answer.
Rsquared (R²): 0.95
R², or the “coefficient of determination,” tells us how much of the variation in xT the model explains. With an R² of 0.95, the model does an exceptional job: it explains 95% of the variance in xT values on unseen test data.
For sports analytics, this is a very strong R² and suggests your predictor variables (like ball position, defensive pressure, etc.) are capturing most factors that determine Expected Threat.
MAE (Mean Absolute Error): 0.0632
MAE is the average of the absolute errors. In other words, it’s the average distance between the predictions and the actual xT values, not squaring the differences like RMSE.
In simple terms: On average, the predictions are about 0.063 off from the true xT, or roughly a 6.3% difference, which is very good for a probability model.
From the plot, we can see how closely our predictions align with the actual xT values. The closer the points are to the red dashed line, the better our model’s performance. The vast majority of the data points are crowded in the bottom-left corner (from 0.0 to 0.25). The model has thousands of examples of low-threat situations to learn from, so it becomes an expert at predicting them. As you move to the right (Actual xT > 0.5), the dots get much sparser. These are rare, high-threat situations. Because the model has fewer examples of these high-threat states, it’s less confident in its predictions. However, overall, the model does a good job of capturing the general trend.
Step 9: Interpretation and Results
Finally, the final_model can be interpreted to
understand what it learned. We can’t plot one simple tree (we
built 100!), so we ask the forest which features were most
useful for making its predictions.
# 9. VISUALISE VARIABLE IMPORTANCE
varImpPlot(
final_model,
main = "Variable Importance for Predicting Expected Threat (xT)"
)The plot shows the importance of each predictor variable in
estimating xT_value, with x_coord being the
most important, followed by pressure, y_coord
and density.
Conclusion
In this tutorial, we built a Random Forest regressor to predict the Expected Threat (xT) of football game states using continuous target values. As you might have noticed, the process is similar to the linear regression tutorial. However, between a linear regression and random forest regressor, among the key differences lies in the the latter’s ability to capture complex, non-linear relationships.
In terms of interpretation, while linear regression provides coefficients that indicate the direction and magnitude of predictor effects, random forests offer variable importance measures that highlight which features are most influential in making predictions. Simply said, if you’re wondering whether Random Forests beat out linear regression, consider that their flexibility comes at the cost of interpretability. But for spotting tricky, nonlinear patterns, they’re a powerful tool.