In this project, I analyze a dataset of Real Madrid C.F. player statistics from the 2015-2016 season until the 2018-2019 season. I limited the players analyzed to those who were active on the team for each of those seasons. These are: Gareth Bale, Karim Benzema, Dani Carvajal, Casemiro, Isco Alarcon, Toni Kroos, Marcelo, Luka Modric, Nacho, Sergio Ramos, and Raphael Varane. My data source is FBREF.
Real Madrid is a dominant force in European football and it is no surprise that their total wins far outweigh their losses and draws.
These are the 7 teams that Real Madrid score the most against on average. These teams generally tended toward the bottom of the table in this 4-season stretch. The best of these teams was Celta Vigo in 2015-2016, a year when they finished 6th in La Liga. That year, Real Madrid’s greatest victory was a 7-1 goal fest at home led by a poker by Ronaldo. The worst of these teams would be Granada which spend 2 of the analyzed years in La Liga and 2 in the Second Division. In 2016 they finished 16th in Liga, and the following year they finished last.
Madrid’s total record against these teams over the four-year period was 26-3-1.
These are the 7 teams Real Madrid scored the least against on average. Notably, their two biggest rivals, Atlético Madrid and FC Barcelona are the bottom two teams on this list with 1.13 and 1.00 goals scored against, respectively. In the 4 analyzed seasons, Real Madrid, Atleti, and Barcelona were the top 3 finishers in La Liga and Barcelona finished 1st in three of those years. Atleti and Barcelona averaged 0.632 and 0.862 goals conceded per season, respectively. Although these two teams provided the hardest goal scoring tests for Real Madrid, they still managed to score more than Atleti and Barcelona conceded on average.
Madrid’s total record against these teams over the four-year period was 24-20-11.
These are the 7 teams that Real Madrid conceded the most goals to, on average. FC Barcelona leads this group with 2.50. They were also the group Madrid scored the least against. Needless to say, Barcelona dominated the league rivalry in this time period. The biggest surprises are the appearances of Girona and Osasuna as these are teams that Madrid also scored many goals against. Real Madrid played their best players against Girona in these games but conceded goals because of sub-par man-marking on balls into the box and late-game heroics from Cristian Portu. There were only two games played against Osasuna, which resulted in victories, but Madrid should do a better job at holding a team like this down.
Madrid’s total record against these teams over the four-year period was 20-9-17.
These are the 7 teams that Real Madrid conceded the least goals to on average. Cross-town rivals Atlético Madrid appear in this list, but we have to take into account that Atleti games are generally very low-scoring. Atleti averaged 1.62 goals scored over the 4-year period, so Real held them to under half of their average scoring output; but Real’s drew or lost in 6 out of their 8 analyzed matches.
Madrid’s total record against these teams over the four-year period was 35-8-3.
| Match Result | Mean Goals Scored | Mean Goals Conceded |
|---|---|---|
| W | 3.22 | 0.77 |
| D | 1.29 | 1.29 |
| L | 0.63 | 2.32 |
We see here that, as expected, the Real Madrid’s goal scoring is strongly correlated with whether or not they win the game. The deeper insight here is that when they win, they win in convincing fashion. Similarly, when they lose, the opponent scores at least two goals on average. As an opponent, you have to score. It is not likely to get a 0-0 draw, and even if you score 1 goal, the best you can expect get out the match is a draw.
| Player | Team Goals When Active | Team Goals When Inactive | Games Played When Active | Games Played When Inactive |
|---|---|---|---|---|
| Bale | 2.28 | 2.34 | 134 | 83 |
| Benzema | 2.31 | 2.24 | 178 | 39 |
| Casemiro | 2.29 | 2.42 | 159 | 58 |
| Isco | 2.30 | 2.28 | 160 | 57 |
| Kroos | 2.20 | 2.68 | 172 | 45 |
| Marcelo | 2.30 | 2.33 | 159 | 58 |
| Modric | 2.11 | 2.95 | 167 | 50 |
There are countless factors that contribute to how many goals a team scores in a given game, one of the biggest being which Opponent is being faced, as we saw above. So, looking at a team’s performance with and without a single player must be taken with a few grains of salt. Madrid performed slightly different depending on the activity of each player, but the differential for Luka Modric certainly jumps out. Over the 4 year period, the team scored 0.84 more goals per game, on average, in the 50 games that Modric was inactive.
For these measures I took a 10% trimmed mean to factor out some of the landslide victories that ocurred when the team could afford to sit Modric while they competed against a weaker opponent. I wanted to take a look at a few of these games to discover a bit more about why the team was scoring more without him. I pulled 3 games where the goals scored are near 2.95, and there was a strong opponent, in a game that Real Madrid was motivated to win.
The most common theme for these games is Real Madrid plays with either 4 in the midfield or 4 forward with two players attacking wide with pace. Whether they cut in to take a shot or pass the ball into the box, they create goal scoring opportunities from the wide position.
The team averaging less goals scored with Modric on the field is not a terrible thing, because the team also concedes few goals when he plays.
| Player | Team Conceded When Active | Team Conceded When Inactive | Games Played When Active | Games Played When Inactive |
|---|---|---|---|---|
| Carvajal | 1.01 | 1.06 | 139 | 78 |
| Casemiro | 1.01 | 1.04 | 159 | 58 |
| Kroos | 1.06 | 0.89 | 172 | 45 |
| Marcelo | 1.06 | 0.90 | 159 | 58 |
| Modric | 1.13 | 1.07 | 167 | 50 |
| Nacho | 1.04 | 1.00 | 120 | 97 |
| Ramos | 1.17 | 0.92 | 158 | 59 |
| Varane | 0.98 | 1.09 | 150 | 67 |
There were not any great enough differences here to analyze whether or not a player’s absence contributed to a shift in playing style that contributed to any more or less goals.
Two attacking players of those analyzed recently struggled for consistent, heavy minutes in the 2019-2020 season under Zidane. Bale and Isco were players that started when they reached the club but have fallen on the team rotation in recent years. Real Madrid has used them as substitutes to bolster the attack in the second half. Here’s a quick look at their effectiveness at creating goals while coming off the bench between 2015 and 2019:
| Player | Games | Minutes | Shots on Target | Goals | Crosses | Assists |
|---|---|---|---|---|---|---|
| Bale | 28 | 25.68 | 0.64 | 0.29 | 1.18 | 0.04 |
| Isco | 54 | 21.30 | 0.19 | 0.06 | 0.98 | 0.04 |
Albeit in fewer games and in more minutes, Gareth Bale was more effective at creating offense off the bench on average. His advantage with pace on the wing allows him to put pressure on fatigued full-backs and put dangerous balls in the box for the other forwards, or have a shot himself.
Here’s an overview of the most accurate players looking for goal. Most data points in the scatter plot were concentrated around 0-2 shots, so I applied some jitter. This is why some points appear to be below 0.
## `geom_smooth()` using formula 'y ~ x'
Benzema is definitely a high volume shooter amongst the players analyzed, and he is the deadly in front of goal. He has an expected 1 shot on target for every 2.1 shots taken.
## `geom_smooth()` using formula 'y ~ x'
Bale has the lowest accuracy with an expected 1 shot on target for every 2.66 shots taken. He is not necessarily a bad finisher for not having a higher accuracy, especially when he takes 5 or more shots a game. Bale sometimes assumes free kick duty as well as takes shots from outside the box. Although the accuracy on these shots is lower, the degree of difficulty is high, and the team benefits from long shots being taken. Shots from distance makes the defense have to pressure the ball, which opens the box for players like Benzema and Ronaldo to roam more freely.
## `geom_smooth()` using formula 'y ~ x'
Isco has a fairly good accuracy for the volume of shots he takes. He has an expected 1 shot on target for every 2.23 shots taken. Most of the points on the plot above are gathered around 0-2 total shots per game. He usually takes very high percentage shots in the box.
I ran a Gradient Boosting Machine model to predict the goal scoring output of Karim Benzema in a game. The predictors for this model are the Opponent, Venue (Home vs. Away), Minutes played, Match Result (Win/Draw/Loss), Shots on Target, Goals Scored by Team, and Goals Conceded by Team. The response is a categorical variable to represent the number of goals scored in the game. I went with a categorical response to represent this instead of numerical. There are 4 levels to this variable: (0,1,2,3). In a physical sense, there is a vast difference in the impact on the game between those goal levels. Scoring 1 versus 0 could often be the difference between a win and a loss, so I want the model to see these distinctly. The data to be trained on is all La Liga games played by Karim between 2015 and 2019, totalling 124 games.
gbm1 = gbm(Gls~ Shots.on.Target + Opponent + Min + Venue + ts + tc + wl,
data=benz_liga, var.monotone=rep(0,7), distribution = "multinomial",
n.trees=5000, shrinkage=0.001, interaction.depth=3, bag.fraction = .5,
train.fraction = 1, n.minobsinnode = 10, cv.folds = 2, keep.data=TRUE, verbose=FALSE)
## Gls = # of goals in the game, factor with 4 levels: (0,1,2,3)
## Min is minutes played in the game
## Venue is a factor with 2 levels: (Home,Away)
## ts is how many goals the team scored in the game
## tc is how many goals the team conceded in the game
## wl is the match result, factor with 3 levels: (D,L,W)
best.iter = gbm.perf(gbm1,method = "cv")
summary(gbm1,n.trees = best.iter)
## var rel.inf
## Opponent Opponent 59.5831126
## Shots.on.Target Shots.on.Target 30.9780278
## ts ts 3.3923876
## Min Min 2.6708452
## Venue Venue 1.9906826
## wl wl 1.1694840
## tc tc 0.2154601
##classification results
cv_dv = gbm1$cv.error[best.iter] ##CV deviance
predict_prob = exp(-gbm1$cv.error[best.iter]/2) ##predictive probabiltiy for correct class
Cross-Validation Deviance: 1.0943519
Predictive Probabiltiy for Correct Class: 0.5785815
Training Confusion Matrix for each Predictive Class:
| 0 | 1 | 2 | 3 | class.error | |
|---|---|---|---|---|---|
| 0 | 72 | 5 | 0 | 0 | 0.0649351 |
| 1 | 8 | 27 | 0 | 0 | 0.2285714 |
| 2 | 3 | 4 | 3 | 0 | 0.7000000 |
| 3 | 0 | 2 | 0 | 0 | 1.0000000 |
Training Confusion Matrix for 0 Goals vs. 1 or More:
| 0 | 1+ | class.error | |
|---|---|---|---|
| 0 | 72 | 5 | 0.0649351 |
| 1+ | 11 | 36 | 0.2340426 |
gbm() has a built-in importance measure that sums up the squared error decreased as a result of each split on that variable. Leading the way, by far, is the Opponent. As we saw in the goals scored against each opponent analysis above, there is great discrepancy in goals scored against each team. Given this, as I looked at effects of predictors, I definitely considered the interactions. The most prevalent would be the effect of the interaction between the Opponent and the amount of goals scored by the team on the response. To look at this effect, I ran the exact model but for Goals as a continuous variable instead of a factor. This way I could look at the fluid effects of these variables on an ‘Expected Goals per Game’ response:
gbm2 = gbm(Gls~ Shots.on.Target + Opponent + Min + Venue + ts + tc + wl,
data=benz_liga, var.monotone=rep(0,7), distribution = "gaussian",
n.trees=5000, shrinkage=0.001, interaction.depth=3, bag.fraction = .5,
train.fraction = 1, n.minobsinnode = 10, cv.folds = 2, keep.data=TRUE, verbose=FALSE)
best.iter2 = gbm.perf(gbm2,method = "cv")
plot(gbm2,i.var = c(5,2),n.trees = best.iter2)
The interaction here is evident in the way for each team, the blue line, which represents the expected goals scored per game is higher or lower on the y-axis depending on the team. Furthermore, the effect of goals scored by the team is shown by the left-to-right increase in the same blue line.
The unexpected insight I found was looking through to the interaction between the Opponent and the Match Result:
The y-axis here again represents the expected goals per game. What I expected to see here was the blue dots differing greatly in height depending on both the opponent and the match result. Instead, they differ for each opponent, but within each opponent they actually seem to lay in an almost horizontal line! The main takeaways are when predicting whether Benzema will score in a game, the opponent matters…a lot. Second, Benzema scores according to the team performance. When the team loses, Benzema’s scoring output is not completely to blame. His scoring output is routinely similar in wins. The other side of that argument is also when the team wins, Benzema does not always deserve all of the credit, because he is expected to do the same in a losing effort. Other factors contribute more heavily to the match results.
The training data was Benzema La Liga data from the 2015-2016 until the 2018-2019 season. To test the model, I used Benzema’s 2019-2020 La Liga season.
Confusion Matrix:
| 0 | 1 | 2 | 3 | class.error | |
|---|---|---|---|---|---|
| 0 | 17 | 3 | 0 | 0 | 0.1500000 |
| 1 | 7 | 6 | 0 | 0 | 0.5384615 |
| 2 | 2 | 1 | 1 | 0 | 0.7500000 |
| 3 | 0 | 0 | 0 | 0 | 0.0000000 |
The model did an excellent job of predicting when Benz would not score, and whenn it misclassified, it did not predict he would score 2 or 3. Predicting when he would score exactly 1 was more of a coin flip. Predicting when he would score exactly 2 was inaccurate. Below is a confusion matrix for Benzema scoring 0 vs. scoring 1 or more:
| 0 | 1+ | class.error | |
|---|---|---|---|
| 0 | 17 | 3 | 0.1500000 |
| 1+ | 9 | 8 | 0.4705882 |
The model could wants to predict Benzema would score 0 more often than not. I don’t think this is a “broken clock is right twice a day” situation because in the occasions he scored 2 goals, the model model predicted 2 goals 25% of the time, and 1 goal 25% of the time. What draws the model to misclassify 1-goal games as 0-goal games is the 2016-2017 season when Benzema scored 11 goals and the 2017-2018 season when Benzema scored 5. These were two powerhouse seasons for Cristiano Ronaldo who won the Ballon D’Or in 2017. Benzema’s scoring was not very needed by the squad when an all-time great was in the same attacking side.
Finally, I conducted some lineup analysis by modeling Match Results using whether or not players were active as variables. For this model, I used Random Forest Regression to train the data. Given not every single player that was on the roster during the sample time is included in the analysis, there are far too many outside variables that influence the response. For this analysis I am focusing on the individual effects of the variables, and interactions.
Model:
rf1 = randomForest(`W/L`~ Bale + Benzema + Carvajal + Casemiro + Isco + Kroos + Marcelo + Modric + Ramos + Varane,
data = x1,mtry = 6, ntree = 500, nodesize = 5, importance = TRUE)
## each row in x1 is a different game, and the predictors are 1/0 representing if they played in the game or not
Of the analyzed players, the model deemed Ramos the most important to the prediction of the Match Result. MeanDecreaseAccuracy shows us the magnitude of which the accuracy of the model decreases as a result of that variable being excluded. So, excluding Ramos from the model would decrease the accuracy of the forest the most. MeanDecreaseGini shows us to what degree the model is making the classifying correctly at the nodes split on that variable. So having Ramos at the modeled decisionn nodes decreases the misclassification rate the most. A few variables are reordered drastically between these two plots. For example in the left plot, Kroos is the most important variable, but on the right, he is third least important. To look at a bit of a combination of these plots, I looked at a multi-way importance plot:
This plot shows a bit more clearly the importance of each individual variable. The most important variables will be at the top right, performance the highest for both variables. Ramos appears to be the most important, but we see the p-value for these measures is not very significant. The measures for Bale and Marcelo, however are both significant and high-ranking.
This plot shows the levels of interaction between the most frequent interactions in the dataset. This gives insight into the impact each pairing of players has on the match result. The strongest among these interactions are between (Ramos,Benzema), (Kroos,Benzema), (Ramos,Isco), (Marcelo,Isco).
All Football Stats: https://fbref.com/en/
Brandon Greenwell, Bradley Boehmke, Jay Cunningham and GBM Developers (2020). gbm: Generalized Boosted Regression Models. R package version 2.1.8. https://CRAN.R-project.org/package=gbm
A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2020). rmarkdown: Dynamic Documents for R. R package version 2.1. URL https://rmarkdown.rstudio.com.
Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.28.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2019). dplyr: A Grammar of Data Manipulation. R package version 0.8.3. https://CRAN.R-project.org/package=dplyr
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Matt Dowle and Arun Srinivasan (2019). data.table: Extension of
data.frame. R package version 1.12.6. https://CRAN.R-project.org/package=data.table
A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl