#1A
devtools::install_github("JaseZiv/worldfootballR")
## Skipping install of 'worldfootballR' from a github remote, the SHA1 (dc489beb) has not changed since last install.
## Use `force = TRUE` to force installation
#1B
library(worldfootballR)
#1C
#I chose to use the 'Fbref' dataset which is football/soccer data collected by a company called Sports Reference. I'm an avid football fan and it's often difficult to make sense of the game. Obviously teams win and lose matches, the team that scores more goals in a match win the game, the team that does that the most times in a season wins. However, there is nuance to it, 11 players for each team on the pitch adds nuance. The world of football has exploded and begun to record all sorts of metrics to try to understand and better predict each match. The more we measure and the more effective the measure used, the better we can understand what makes the best players and teams the best.
#2A
#The source of this information is derived from a github I downloaded online --Thank you JaseViz! This user also included databses aside from Fbref, Understat and Transfermark. These compile into the 'worldfootballR' package.
la_liga_passing_stats <- fb_season_team_stats(country = "ESP", gender = "M", season_end_year = 2024, tier = "1st", stat_type = "passing")
#2B
#There are many different functions in worldfootballR, however, for the sake of my own sanity, I chose to focus on 'fb_season_team_stats' function. Still this is too broad, and as such, filters must be placed on the function in order to produce a reasonable sample. In line 22, this filtering is visible, filtering for LaLiga, the Spanish first division. Gender, season year and the type of statistic measure, in this case passing, are necessary too.
#Finally, we set aside a function which recognizes 31 variables.
la_liga_passing_stats_clean <- subset(la_liga_passing_stats, Team_or_Opponent == "team")
#2C
#I wanted to clean up the function slightly. Despite 18 teams playing in LaLiga, the function read 36 observations. This meant that each team was counted twice, either as a team or opponent. While this is an interesting perspective, framing stats in terms of opposition, I chose to omit the opponent observations as observed in line 27.
final_standings <- c("Real Madrid" = 1, "Barcelona" = 2, "Girona" = 3, "Atlético Madrid" = 4, "Athletic Club" = 5, "Real Sociedad" = 6, "Betis" = 7, "Villarreal" = 8, "Valencia" = 9, "Alavés" = 10, "Osasuna" = 11, "Getafe" = 12, "Celta Vigo" = 13, "Sevilla" = 14, "Mallorca" = 15, "Las Palmas" = 16, "Rayo Vallecano" = 17, "Cádiz" = 18, "AlmerÃa" = 19, "Granada" = 20)
la_liga_passing_stats_clean$final_standing <- final_standings[la_liga_passing_stats_clean$Squad]
#3A
#The variables observed are pretty extensive. I assume these data points are collected from each game then averaged out. But the variables in this case are passing-specific, yet incredibly detailed. All possible samples are collected according to these variables so that would make it simple sampling. Short, medium and long passes are broken down into attempted and completed. Total distance of passes, whether the passes are progressive, how many passes were made in the final third of the pitch. It's is pretty impressive how many variables are recorded. Statisticians have become prolific in the footballing world and it doesn't surprise me that they have employees that work full time meticulously collected these details. It paints a really complete picture for these specific general observations like passing, shooting, defending etc. Sample sizes are small enough to be able to collect necessary data points for each variable. In the code chunk above, I correlated the squads to their final standings in the 2023/24 season, I think this could be useful in assessing whether certain passing statistics correlate to success within LaLiga.
#4A
#Does xA (expected assists) correlate to the final league position?, here outcome variable would be league standing, explanatory would be expected assists
#Does the number of long passes attempted correlate with increased percentage of long passes completed? Long passes would be explanatory while league standing would be the outcome variable.
#Does key passes (KP) and progressive passing (PrgP) correlate to final league standing, the more key passes and progressive passes, the higher the league standing? Where explanatory variables are KP and PrgP and outcome variable is league standing.
#5A
install.packages("stargazer")
##
## The downloaded binary packages are in
## /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpEPk5xt/downloaded_packages
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
LLPS <- la_liga_passing_stats_clean[, !names(la_liga_passing_stats_clean) %in% c("Team_or_Opponent", "Mins_Per_90", "Season_End_Year", "Gender", "Competition_Name", "Country")]
stargazer(LLPS, type = "text", summary = TRUE, digits = 1)
##
## ==========================================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------------------
## Num_Players 20 30.4 4.0 25 40
## Cmp_Total 20 14,677.2 3,441.1 10,091 21,794
## Att_Total 20 18,521.5 3,103.2 14,116 24,761
## Cmp_percent_Total 20 78.5 5.2 70.5 88.3
## TotDist_Total 20 257,823.5 46,818.0 191,648 359,810
## PrgDist_Total 20 95,604.0 10,118.9 80,185 114,447
## Cmp_Short 20 6,854.6 2,098.3 4,443 11,641
## Att_Short 20 7,724.4 2,090.4 5,283 12,391
## Cmp_percent_Short 20 88.0 2.9 84.0 93.9
## Cmp_Medium 20 5,983.4 1,316.5 4,011 8,724
## Att_Medium 20 6,985.3 1,247.9 5,052 9,617
## Cmp_percent_Medium 20 85.1 3.8 77.5 91.0
## Cmp_Long 20 1,462.2 157.9 1,145 1,717
## Att_Long 20 2,809.0 313.6 2,312 3,436
## Cmp_percent_Long 20 52.4 6.2 45.2 65.9
## Ast 20 34.8 14.6 12 66
## xAG 20 35.4 10.1 23.3 57.4
## xA_Expected 20 35.1 9.8 24.2 57.9
## A_minus_xAG_Expected 20 -0.6 6.4 -12.6 13.5
## KP 20 344.0 53.0 267 476
## Final_Third 20 1,143.8 244.6 834 1,702
## PPA 20 290.0 61.9 212 405
## CrsPA 20 86.4 21.7 53 134
## PrgP 20 1,403.7 268.0 1,017 1,935
## final_standing 20 10.5 5.9 1 20
## ----------------------------------------------------------
#5B
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpEPk5xt/downloaded_packages
library(ggplot2)
ggplot(LLPS, aes(x = Ast)) + geom_density(fill = "lightblue2", alpha = 0.5) + labs(title = "Density Distribution of Assists", x = "Assists", y = "Density") +
theme_minimal()

#There's a high density of assists around 30, then gradually declines until a little past 60. It makes sense having a higher density of a lower amount of assists while having more assists likely means more goals, a difficult task.
install.packages("plotly")
##
## The downloaded binary packages are in
## /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpEPk5xt/downloaded_packages
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#5C
p <- ggplot(LLPS, aes(y = final_standing, x = PrgP)) + geom_point(color = "darkblue") + labs(title = "League Standing Progressive Passes", x = "Progressive Passes", y = "Final League Standing") + theme_minimal() + geom_smooth(method = "lm", se = FALSE, color = "lightskyblue3") + scale_y_reverse()
ggplotly(p, tooltip = c("x", "y"))
## `geom_smooth()` using formula = 'y ~ x'
#6A
#We know that the trend line depicts a trend in which the more progressive passes a team plays, the higher that team placed in the final league standing.
#6B
#As far as I'm aware, far more is factored into the passes. As is seen in this data set, models are set up to include multiple indicators, highlighting effectiveness of not only progressive passes, but key passes, pass in final third etc. that lead to a higher final league standing/success/winning. 'https://statsbomb.com/articles/soccer/the-art-of-progression-an-analysis-of-passing-vs-ball-carrying/' This article is super detailed and does an extensive job of tracking many many different variables in an attempt to compare several teams and the ways in which they play. It highlights how some teams more than others carry the ball or dribble it more often than they pass the ball.
#6C
#I'm not sure previous work is incomplete, rather it attempts to explain or create a predictive model. You can list tons and tons of different variables incorporate them into a model which correlates these variables to success, however, a football match is wild and unpredictable at times. The studies and analyses that have been conducted are likely strong and capture all facets of the game in relation to winning capability. My interests are similar but I like certain facets of the game. Maybe I value a high volume of long passes being played, or high volume of attempted dribbles. My goal throughout this project is to see whether teams and players I like to watch reflect these attributes in this data and whether traits I admire in-game translate to success.