#1A
devtools::install_github("JaseZiv/worldfootballR")
## Skipping install of 'worldfootballR' from a github remote, the SHA1 (dc489beb) has not changed since last install.
##   Use `force = TRUE` to force installation
#1B
library(worldfootballR)

#1C
#I chose to use the 'Fbref' dataset which is football/soccer data collected by a company called Sports Reference. I'm an avid football fan and it's often difficult to make sense of the game. Obviously teams win and lose matches, the team that scores more goals in a match win the game, the team that does that the most times in a season wins. However, there is nuance to it, 11 players for each team on the pitch adds nuance. The world of football has exploded and begun to record all sorts of metrics to try to understand and better predict each match. The more we measure and the more effective the measure used, the better we can understand what makes the best players and teams the best. 
#2A
#The source of this information is derived from a github I downloaded online --Thank you JaseViz! This user also included databses aside from Fbref, Understat and Transfermark. These compile into the 'worldfootballR' package.  

la_liga_passing_stats <- fb_season_team_stats(country = "ESP", gender = "M", season_end_year = 2024, tier = "1st", stat_type = "passing")

#2B
#There are many different functions in worldfootballR, however, for the sake of my own sanity, I chose to focus on 'fb_season_team_stats' function. Still this is too broad, and as such, filters must be placed on the function in order to produce a reasonable sample. In line 22, this filtering is visible, filtering for LaLiga, the Spanish first division. Gender, season year and the type of statistic measure, in this case passing, are necessary too. 
#Finally, we set aside a function which recognizes 31 variables. 

la_liga_passing_stats_clean <- subset(la_liga_passing_stats, Team_or_Opponent == "team")

#2C
#I wanted to clean up the function slightly. Despite 18 teams playing in LaLiga, the function read 36 observations. This meant that each team was counted twice, either as a team or opponent. While this is an interesting perspective, framing stats in terms of opposition, I chose to omit the opponent observations as observed in line 27.  
final_standings <- c("Real Madrid" = 1, "Barcelona" = 2, "Girona" = 3, "Atlético Madrid" = 4, "Athletic Club" = 5, "Real Sociedad" = 6, "Betis" = 7, "Villarreal" = 8, "Valencia" = 9, "Alavés" = 10, "Osasuna" = 11, "Getafe" = 12, "Celta Vigo" = 13, "Sevilla" = 14, "Mallorca" = 15, "Las Palmas" = 16, "Rayo Vallecano" = 17, "Cádiz" = 18, "Almería" = 19, "Granada" = 20)

la_liga_passing_stats_clean$final_standing <- final_standings[la_liga_passing_stats_clean$Squad]
#3A
#The variables observed are pretty extensive. I assume these data points are collected from each game then averaged out. But the variables in this case are passing-specific, yet incredibly detailed. All possible samples are collected according to these variables so that would make it simple sampling. Short, medium and long passes are broken down into attempted and completed. Total distance of passes, whether the passes are progressive, how many passes were made in the final third of the pitch. It's is pretty impressive how many variables are recorded. Statisticians have become prolific in the footballing world and it doesn't surprise me that they have employees that work full time meticulously collected these details. It paints a really complete picture for these specific general observations like passing, shooting, defending etc. Sample sizes are small enough to be able to collect necessary data points for each variable. In the code chunk above, I correlated the squads to their final standings in the 2023/24 season, I think this could be useful in assessing whether certain passing statistics correlate to success within LaLiga. 
#4A 
#Does xA (expected assists) correlate to the final league position?, here outcome variable would be league standing, explanatory would be expected assists
#Does the number of long passes attempted correlate with increased percentage of long passes completed? Long passes would be explanatory while league standing would be the outcome variable. 
#Does key passes (KP) and progressive passing (PrgP) correlate to final league standing, the more key passes and progressive passes, the higher the league standing? Where explanatory variables are KP and PrgP and outcome variable is league standing. 
#5A
install.packages("stargazer")
## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpEPk5xt/downloaded_packages
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
LLPS <- la_liga_passing_stats_clean[, !names(la_liga_passing_stats_clean) %in% c("Team_or_Opponent", "Mins_Per_90", "Season_End_Year", "Gender", "Competition_Name", "Country")]

stargazer(LLPS, type = "text", summary = TRUE, digits = 1)
## 
## ==========================================================
## Statistic            N    Mean    St. Dev.   Min     Max  
## ----------------------------------------------------------
## Num_Players          20   30.4      4.0      25      40   
## Cmp_Total            20 14,677.2  3,441.1  10,091  21,794 
## Att_Total            20 18,521.5  3,103.2  14,116  24,761 
## Cmp_percent_Total    20   78.5      5.2     70.5    88.3  
## TotDist_Total        20 257,823.5 46,818.0 191,648 359,810
## PrgDist_Total        20 95,604.0  10,118.9 80,185  114,447
## Cmp_Short            20  6,854.6  2,098.3   4,443  11,641 
## Att_Short            20  7,724.4  2,090.4   5,283  12,391 
## Cmp_percent_Short    20   88.0      2.9     84.0    93.9  
## Cmp_Medium           20  5,983.4  1,316.5   4,011   8,724 
## Att_Medium           20  6,985.3  1,247.9   5,052   9,617 
## Cmp_percent_Medium   20   85.1      3.8     77.5    91.0  
## Cmp_Long             20  1,462.2   157.9    1,145   1,717 
## Att_Long             20  2,809.0   313.6    2,312   3,436 
## Cmp_percent_Long     20   52.4      6.2     45.2    65.9  
## Ast                  20   34.8      14.6     12      66   
## xAG                  20   35.4      10.1    23.3    57.4  
## xA_Expected          20   35.1      9.8     24.2    57.9  
## A_minus_xAG_Expected 20   -0.6      6.4     -12.6   13.5  
## KP                   20   344.0     53.0     267     476  
## Final_Third          20  1,143.8   244.6     834    1,702 
## PPA                  20   290.0     61.9     212     405  
## CrsPA                20   86.4      21.7     53      134  
## PrgP                 20  1,403.7   268.0    1,017   1,935 
## final_standing       20   10.5      5.9       1      20   
## ----------------------------------------------------------
#5B
install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpEPk5xt/downloaded_packages
library(ggplot2)

ggplot(LLPS, aes(x = Ast)) + geom_density(fill = "lightblue2", alpha = 0.5) + labs(title = "Density Distribution of Assists", x = "Assists", y = "Density") +
theme_minimal()

#There's a high density of assists around 30, then gradually declines until a little past 60. It makes sense having a higher density of a lower amount of assists while having more assists likely means more goals, a difficult task. 
install.packages("plotly")
## 
## The downloaded binary packages are in
##  /var/folders/1w/4dk6y8ys0038w511yrg6_zfh0000gn/T//RtmpEPk5xt/downloaded_packages
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#5C
p <- ggplot(LLPS, aes(y = final_standing, x = PrgP)) + geom_point(color = "darkblue") + labs(title = "League Standing  Progressive Passes", x = "Progressive Passes", y = "Final League Standing") + theme_minimal() + geom_smooth(method = "lm", se = FALSE, color = "lightskyblue3") +   scale_y_reverse()

ggplotly(p, tooltip = c("x", "y"))
## `geom_smooth()` using formula = 'y ~ x'
#6A
#We know that the trend line depicts a trend in which the more progressive passes a team plays, the higher that team placed in the final league standing.

#6B
#As far as I'm aware, far more is factored into the passes. As is seen in this data set, models are set up to include multiple indicators, highlighting effectiveness of not only progressive passes, but key passes, pass in final third etc. that lead to a higher final league standing/success/winning. 'https://statsbomb.com/articles/soccer/the-art-of-progression-an-analysis-of-passing-vs-ball-carrying/' This article is super detailed and does an extensive job of tracking many many different variables in an attempt to compare several teams and the ways in which they play. It highlights how some teams more than others carry the ball or dribble it more often than they pass the ball. 

#6C
#I'm not sure previous work is incomplete, rather it attempts to explain or create a predictive model. You can list tons and tons of different variables incorporate them into a model which correlates these variables to success, however, a football match is wild and unpredictable at times. The studies and analyses that have been conducted are likely  strong and capture all facets of the game in relation to winning capability. My interests are similar but I like certain facets of the game. Maybe I value a high volume of long passes being played, or high volume of attempted dribbles. My goal throughout this project is to see whether teams and players I like to watch reflect these attributes in this data and whether traits I admire in-game translate to success.