FIFA is a soccer video game produced by EA which has gained notoriety among individuals from all-around the world with over 31 million players on FIFA 21. A new FIFA game comes out every year in which EA rates all the players in the game and give them certain stats which are supposed to be a depiction of real life. Each player is rated on various attributes which are then used to give them an overall score. Some of the attributes players are rated on include stats such as: Pace, Dribbling, Shooting. Our group was curious, if we take these values as accurate, what can we learn about the real world?
We firmly believe that answering this question could lead to some significant insights about the real world because we truly believe these attributes are as close of a representation we can get to measuring these attributes in the real world. FIFA, as mentioned prior, is run by EA, a multi-billion dollar corporation who has not only the resources but also the technical known-how/experience to create accurate value for the measurements.
We found a database online which contains all these attributes for the most recent FIFA game, FIFA 22, along with real-world values such as player wage, value, club, and jersey number. We hope to address the proposed problem statement by understanding the relationship between various in-game attributes and real-world values. For example, how does a player’s overall relate to their national jersey number? We have always been allured by the infamous number 10, but is that actually the jersey the best players wear? Which attribute has lead people to the highest pay day?
Our current proposed analytic technique is to run a linear regression between these different in-game and real-world values, using a p-test to understand their statistical significance. If significant we can compare the coefficients and correlation coefficients to understand the degree of a relationship and how we could predict/describe real-world values using this game.
The main target for our analysis are real soccer players as well as soccer fanatics. Real soccer players will benefit the most from our analysis. If I was soccer player, I would be looking to learn which attribute of my self I can boost the most to not only improve my self as a player but to also improve my salary. We hope that our analysis could help players direct their workout regiment by helping understand which attributes are truly considered valuable in soccer and for a club. Soccer fanatics will also benefit from our analysis as it will help them better understand which characteristics they may want to look for when looking at potential trade prospects for their respective team.
library(tidyverse)
library(dplyr)
At this point in the project, the only packages we utilized were the tidyverse package and dplyr package. Both packages were used together specifically for one function, to help trim a string in our data. The process has been displayed later on, however, both packages together helped in the process. The dplyr package’s pipe operator proved to be helpful in executing certain functions and the tidyverse package’s str_sub function proved to be helpful in actually trimming the string. In addition to this instance, there were many other uses of both packages such as when renaming variable names to something more user-friendly.
We gathered our data from kaggle, specifically from this specific link:https://www.kaggle.com/stefanoleone992/fifa-22-complete-player-dataset
The data contains two different tables with players and info on their teams. We will join both of these together in R to cumulatively attain amore robust data set. The original players table had 90 variables including all the attributes for all the players and general real-world information about them such as wage and value. The teams table originally had 14 columns. There isn’t explicit information on when this data was collected, however, we do know it was fairly recently since FIFA 22 came out on October 1st, 2021. It is also worth noting that all those statistics are suggested mostly by a 6,000 group of volunteers led by Head of Data Collection & Licensing.
All real-world professional soccer players that are included in the game, as discussed above, are described by a set of statistics and characteristics, which determine how good they are in-game. Main indicator of the quality of a player is their overall score, which is a net of all their statistics. There are 34 of them with values in the 0-99 interval, in which 5 are used to describe goalkeeping abilities and 29 are used to describe abilities of an outfield player. All the players are described by all 34 stats, but goalkeeping abilities have no impact on an outfield players overall and vice-versa.
Since the players are presented as cards, it would be difficult to show all the statistics on the face of the card. Thus, there are so called “Face stats”. The card shows only six statistics on it to make it more readable. The statistics are linear combinations of the ones already mentioned. For example, instead of having both Acceleration and Sprint Speed displayed on the card, the game utilizes a pace statistic which is a weighted sum of those two. These variables can be found in our data with aTotal pre-fix.
There is also a set of four categorical variables in our data describing players:
Skills - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pull off skill moves, like the famous roulette
Weak foot - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pass, shoot, etc. with his not-preferred foot correctly
Attacking work rate - Can be either High, Medium or Low; determines how much does the player run in attacking zones of the pitch
Defensive work rate - Can be either High, Medium or Low; determines how much does the player run in defensive zones of the pitch
We believe, after careful evaluation, most variable names are self-explanatory except for FKAccuracy, which stands for Free Kick Accuracy - how accurately one can shoot from a set piece. Also, position refers to players’ preferred positions on the pitch and the game recognizes 28 different ones.
We can get a better understanding of our data by taking a look at it in R, so let’s first load in the data:
players <- read.csv("players_fifa22.csv")
teams <- read.csv("teams_fifa22.csv")
Now let’s take a look at the first few rows of the players table:
head(players, 3)
## ID Name FullName Age Height Weight
## 1 158023 L. Messi Lionel Messi 34 170 72
## 2 188545 R. Lewandowski Robert Lewandowski 32 185 81
## 3 231747 K. Mbappé Kylian Mbappé 22 182 73
## PhotoUrl Nationality Overall
## 1 https://cdn.sofifa.com/players/158/023/22_60.png Argentina 93
## 2 https://cdn.sofifa.com/players/188/545/22_60.png Poland 92
## 3 https://cdn.sofifa.com/players/231/747/22_60.png France 91
## Potential Growth TotalStats BaseStats Positions BestPosition
## 1 93 0 2219 462 RW,ST,CF RW
## 2 92 0 2212 460 ST ST
## 3 95 4 2175 470 ST,LW ST
## Club ValueEUR WageEUR ReleaseClause ClubPosition
## 1 Paris Saint-Germain 78000000 320000 144300000 RW
## 2 FC Bayern München 119500000 270000 197200000 ST
## 3 Paris Saint-Germain 194000000 230000 373500000 ST
## ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 1 2023 30 2021 False Argentina RW
## 2 2023 9 2014 False Poland ST
## 3 2022 7 2018 False France LW
## NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 1 10 Left 5 4 4
## 2 9 Right 5 4 4
## 3 10 Right 4 4 5
## AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 1 Medium Low 85 92 91
## 2 High Medium 78 92 79
## 3 High Low 97 88 80
## DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 1 95 34 65 85 95
## 2 85 44 82 71 95
## 3 92 36 77 78 93
## HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 1 70 91 88 96 93 94 91
## 2 90 85 89 85 79 85 70
## 3 72 85 83 93 80 69 71
## BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 1 96 91 80 91 94 95 86
## 2 88 77 79 77 93 82 90
## 3 91 97 97 92 93 83 86
## Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 1 68 72 69 94 44 40 93
## 2 85 76 86 87 81 49 95
## 3 78 88 77 82 62 38 92
## Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 1 95 75 96 20 35 24 6
## 2 81 90 88 35 42 19 15
## 3 82 79 88 26 34 32 13
## GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 1 11 15 14 8 92 92 93
## 2 6 12 8 10 92 85 88
## 3 5 7 11 6 91 90 90
## CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 1 93 93 92 93 93 90 93 69
## 2 88 88 85 89 87 83 87 67
## 3 90 90 90 92 92 84 92 70
## CDMRating RWBRating LBRating CBRating RBRating GKRating
## 1 67 69 64 53 64 22
## 2 69 67 64 63 64 22
## 3 66 70 66 57 66 21
As we can see from the above table, some of the names in our data have loaded weirdly. We can fix this be reloading the data and making a few specifications. We can check that this worked by looking at the third row of the players table again:
players <- read.csv("players_fifa22.csv", encoding="UTF-8", stringsAsFactor=FALSE)
teams <- read.csv("teams_fifa22.csv", encoding="UTF-8", stringsAsFactor=FALSE)
players[3,]
## ID Name FullName Age Height Weight
## 3 231747 K. Mbappé Kylian Mbappé 22 182 73
## PhotoUrl Nationality Overall
## 3 https://cdn.sofifa.com/players/231/747/22_60.png France 91
## Potential Growth TotalStats BaseStats Positions BestPosition
## 3 95 4 2175 470 ST,LW ST
## Club ValueEUR WageEUR ReleaseClause ClubPosition
## 3 Paris Saint-Germain 194000000 230000 373500000 ST
## ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 3 2022 7 2018 False France LW
## NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 3 10 Right 4 4 5
## AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 3 High Low 97 88 80
## DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 3 92 36 77 78 93
## HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 3 72 85 83 93 80 69 71
## BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 3 91 97 97 92 93 83 86
## Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 3 78 88 77 82 62 38 92
## Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 3 82 79 88 26 34 32 13
## GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 3 5 7 11 6 91 90 90
## CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 3 90 90 90 92 92 84 92 70
## CDMRating RWBRating LBRating CBRating RBRating GKRating
## 3 66 70 66 57 66 21
As we can see from the code above, this was able to help out in the data processing. Now that we have our data looking good, we can go ahead and join our tables to have one robust data set. We will do the join based on the Club Name as we are trying to retrieve club related information. We will also use a left join so that each player’s club is matched up to the team based information.
fifa <- players %>%
left_join(teams, by=c("Club" = "Name"))
Let’s explore this data structure a little further:
dim(fifa)
## [1] 19276 103
We now know we have 19276 rows in our data with 103 columns. This means we have 19276 players in our data base described in 103 different ways. As previously discussed, the players table had 90 columns and teams table had 14, since the column we joined on wouldn’t show up twice, it makes sense why we have 103 different columns.
Now that our data looks good and we have explored the general structure, we can look at the missing values information:
sapply(fifa, function(x) sum(is.na(x)))
## ID.x Name FullName Age
## 0 0 0 0
## Height Weight PhotoUrl Nationality
## 0 0 0 0
## Overall.x Potential Growth TotalStats
## 0 0 0 0
## BaseStats Positions BestPosition Club
## 0 0 0 0
## ValueEUR WageEUR ReleaseClause ClubPosition
## 0 0 0 0
## ContractUntil ClubNumber ClubJoined OnLoad
## 70 70 0 0
## NationalTeam NationalPosition NationalNumber PreferredFoot
## 0 0 18519 0
## IntReputation WeakFoot SkillMoves AttackingWorkRate
## 0 0 0 0
## DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 0 0 0 0
## DribblingTotal DefendingTotal PhysicalityTotal Crossing
## 0 0 0 0
## Finishing HeadingAccuracy ShortPassing Volleys
## 0 0 0 0
## Dribbling Curve FKAccuracy LongPassing
## 0 0 0 0
## BallControl Acceleration SprintSpeed Agility
## 0 0 0 0
## Reactions Balance ShotPower Jumping
## 0 0 0 0
## Stamina Strength LongShots Aggression
## 0 0 0 0
## Interceptions Positioning Vision Penalties
## 0 0 0 0
## Composure Marking StandingTackle SlidingTackle
## 0 0 0 0
## GKDiving GKHandling GKKicking GKPositioning
## 0 0 0 0
## GKReflexes STRating LWRating LFRating
## 0 0 0 0
## CFRating RFRating RWRating CAMRating
## 0 0 0 0
## LMRating CMRating RMRating LWBRating
## 0 0 0 0
## CDMRating RWBRating LBRating CBRating
## 0 0 0 0
## RBRating GKRating ID.y League
## 0 0 124 124
## LeagueId Overall.y Attack Midfield
## 124 124 124 124
## Defence TransferBudget DomesticPrestige IntPrestige
## 124 124 124 124
## Players StartingAverageAge AllTeamAverageAge
## 124 124 124
We can see from the data above that our data looks pretty good already and avoids many missing values. We do have some missing values, but we believe there is no action that can be taken right now for them. The columns where we see a few missing values are ContractUntil, ClubNumber and NationalNumber. We understand, however, that there are valid reasons why all these data points could be blank or non-existent as some players contract info or club number may not be released yet. Additionally, we know most players that play professional soccer don’t have the opportunity to play for their national country and hence we can understand why we have so many N/A values in the NationalNumber column.
We can also see that some of the data we had joined in our table came out missing. Let’s try to evaluate this:
NoTeamInfo <- fifa[is.na(fifa$Overall.y),]
unique(NoTeamInfo$Club)
## [1] "Free agent" "Burgos CF" "SC East Bengal FC"
We can note an interesting trend here, all of the players fall in one of two categories. The players play either for a rather small market team such as “Burgos CF” of “SC East Bengal FC” or they are a “Free Agent”. Although, we believe the two actual teams should have some values for their team information we do not see it as helpful to impute values as these teams, based off our soccer knowledge, are rather below average and will hence see faulty values. For the most part, our analysis is player-centric rather than team-centric and since we still have important information on the players from these teams we believe it is important to keep this data. Free Agents are also an interesting data section we can explore further and they should rightfully not have any team information, as they don’t belong to a team, hence we have taken no action on these entries either.
In terms of formatting, we were able to get a decent idea of what the data looks like from the head() function we ran above. We can notice that one of the things that looks weird is the league column. Every entry in the league column has a (1) after it. We can work to remove this and clean the data by utilizing the str_sub function from the tidyverse package. We will create a column called TeamLeague with the trimmed league names.
fifa <- fifa %>%
mutate(TeamLeague= str_sub(League, 1, -5))
Now that our league name has been corrected, by creating a new column in R with the correct values, we can remove the previous League column. We have done this in the code below:
fifa <- select(fifa, -(League))
Lastly, our team preferred the name League instead of TeamLague so we have gone ahead and change the column name of the created column to that of the previous column.
fifa <- rename(fifa, League = TeamLeague)
Another issue in our data can be seen in the naming convention of our team overall column. As we can see, it is currently labeled as Overall.1. This naming convention isn’t friendly to a new user of our data and hence we can change this title to TeamOverall- a variable which indicates the overall rating a whole team is given.
fifa <- rename(fifa, TeamOverall = Overall.y)
fifa <- rename(fifa, Overall = Overall.x)
As we are renaming columns, we can also rename a few columns to help our readers get a better understanding of units of measure for those metrics.
fifa <- rename(fifa, 'Weight (KG)' = Weight)
fifa <- rename(fifa, 'Height (CM)' = Height)
To finish off the renaming, we can go ahead and rename some of the ID columns to be more explanatory on what exactly they are identifying.
fifa <- rename(fifa, 'TeamID' = ID.y)
fifa <- rename(fifa, 'PlayerID' = ID.x)
Lastly, there are some unnecessary columns in our data which won’t further our analysis, hence we can remove them to reduce any model’s computational expense. In specific, we will remove the PhotoURL and OnLoad variable.
fifa <- fifa[,-c(7, 24)]
Lastly, we noticed some of the positional titles were a little too specific for our analysis. We hope to conduct analysis on general positional groups and excessive detail regarding positions is unnecessary for us. Especially since some of the titles discussed are not commonly used in typical soccer jargon.
#All the unique positions
unique(fifa$BestPosition)
## [1] "RW" "ST" "GK" "CM" "LW" "CDM" "CF" "LM" "CB" "CAM" "RB" "LB"
## [13] "RM" "LWB" "RWB"
#Edit into more appropriate positional titles
fifa$BestPosition_new <- ifelse(fifa$BestPosition == "GK" , "GK",
ifelse(fifa$BestPosition %in% c("LCB", "CB", "RCB"), "CB",
ifelse(fifa$BestPosition %in% c("LB", "LWB", "RWB", "RB"), "LB/RB",
ifelse(fifa$BestPosition %in% c("LDM", "CDM", "RDM"), "DM",
ifelse(fifa$BestPosition %in% c("LCM", "CM", "RCM"), "CM",
ifelse(fifa$BestPosition %in% c("LM", "RM"), "LM/RM",
ifelse(fifa$BestPosition %in% c("LAM", "CAM", "RAM"), "AM",
ifelse(fifa$BestPosition %in% c("LW", "RW"), "LW/RW",
ifelse(fifa$BestPosition %in% c("LF", "CF", "RF"),"CF",
"ST")))))))))
fifa$BestPosition_new <- factor(fifa$BestPosition_new, levels = c("GK", "CB", "LB/RB", "DM", "CM",
"LM/RM", "AM", "LW/RW", "CF", "ST"))
#check if our change worked
levels(fifa$BestPosition_new)
## [1] "GK" "CB" "LB/RB" "DM" "CM" "LM/RM" "AM" "LW/RW" "CF"
## [10] "ST"
At this point, we should have addressed all issues with our data and should have put ourselves in a good position for further analysis. We can check if all our changes have gone through and if our data looks good now by running the head function again. We will only show the first row this time as it will be representative of our data set while ensuring too much space isn’t taken up by our data.
head(fifa, 1)
## PlayerID Name FullName Age Height (CM) Weight (KG) Nationality
## 1 158023 L. Messi Lionel Messi 34 170 72 Argentina
## Overall Potential Growth TotalStats BaseStats Positions BestPosition
## 1 93 93 0 2219 462 RW,ST,CF RW
## Club ValueEUR WageEUR ReleaseClause ClubPosition ContractUntil
## 1 Paris Saint-Germain 78000000 320000 144300000 RW 2023
## ClubNumber ClubJoined NationalTeam NationalPosition NationalNumber
## 1 30 2021 Argentina RW 10
## PreferredFoot IntReputation WeakFoot SkillMoves AttackingWorkRate
## 1 Left 5 4 4 Medium
## DefensiveWorkRate PaceTotal ShootingTotal PassingTotal DribblingTotal
## 1 Low 85 92 91 95
## DefendingTotal PhysicalityTotal Crossing Finishing HeadingAccuracy
## 1 34 65 85 95 70
## ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing BallControl
## 1 91 88 96 93 94 91 96
## Acceleration SprintSpeed Agility Reactions Balance ShotPower Jumping Stamina
## 1 91 80 91 94 95 86 68 72
## Strength LongShots Aggression Interceptions Positioning Vision Penalties
## 1 69 94 44 40 93 95 75
## Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking
## 1 96 20 35 24 6 11 15
## GKPositioning GKReflexes STRating LWRating LFRating CFRating RFRating
## 1 14 8 92 92 93 93 93
## RWRating CAMRating LMRating CMRating RMRating LWBRating CDMRating RWBRating
## 1 92 93 93 90 93 69 67 69
## LBRating CBRating RBRating GKRating TeamID LeagueId TeamOverall Attack
## 1 64 53 64 22 73 16 86 89
## Midfield Defence TransferBudget DomesticPrestige IntPrestige Players
## 1 83 85 160000000 10 9 33
## StartingAverageAge AllTeamAverageAge League BestPosition_new
## 1 28 25.9 French Ligue 1 LW/RW
As we can from the result above, our data looks good.
Lastly, we can take a look at the summary for a few of the important variables in our data.
fifa.summary <- fifa %>%
select(Overall, WageEUR, ValueEUR, PaceTotal, DribblingTotal, ShootingTotal, WeakFoot, SkillMoves) %>%
summarise_each(funs(min = min,
q25 = quantile(., 0.25),
median = median,
q75 = quantile(., 0.75),
max = max,
mean = mean))
#We can reshape this data to make it look better
fifa.summary_reformat <- fifa.summary %>% gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean)
print(fifa.summary_reformat)
## var min q25 median q75 max mean
## 1 DribblingTotal 26 58 64 69 9.50e+01 6.296675e+01
## 2 Overall 47 61 66 70 9.30e+01 6.576873e+01
## 3 PaceTotal 28 62 68 75 9.70e+01 6.787534e+01
## 4 ShootingTotal 18 44 56 64 9.40e+01 5.346088e+01
## 5 SkillMoves 1 2 2 3 5.00e+00 2.352459e+00
## 6 ValueEUR 0 475000 975000 2000000 1.94e+08 2.851651e+06
## 7 WageEUR 0 1000 3000 8000 3.50e+05 8.849611e+03
## 8 WeakFoot 1 3 3 3 5.00e+00 2.945580e+00
From the above result we can see a consolidated table with the summary statistics of the some of the variables we consider valuable. We can see that ratings on players week foot and skill moves range between 1 through 5, with most players skill moves being rated below the middle rating of 3, at 2. We can also see that wages and player values range widely with some values being as low as 0. We believe that although these 0’s may not be completely accurate, it is possible that the information isn’t public or is just incredibly low. It is important for us to remember that there are some incredibly small teams included in this game and hence it is possible players have a wage and value that are relatively low in comparison to that of the best players in the world. Hence, a 0 is utilized as a placeholder. We have not removed these rows or values because we view doing so would remove a lot of important data, and imputing the median would give us an incorrect idea of our data. This will just be an important fact we will have to keep in mind as we approach our analysis. We can also note that the players in the game range widely in terms of overall ability going from 47 to 93.
We hope to uncover new information in the data that is not self-evident by slicing and dicing the data in various ways and by creating various variables to help better understand the data. A great example of this is that we plan to create a continent column based off the player’s respective countries to better understand what players from different continents are like in different categories. We could also create groupings of positions such as defenders, midfielders, and attackers. We hope that these different variables and views will better help us understand some of the hidden trends in our data. We hope to summarize these data slices and trends at the end with various visualizations that compare the different categories/variables we created.
We hope to create many visualizations to help us illustrate the findings of our questions. Here are a few of our ideas:
Histogram to show distribution of players’ statistics, both face stats and “raw” stats. It may give some valuable insight into how certain stats behave generally, maybe they come from a known distribution (like normal distribution).
Correlation plot between various statistics to showcase relationship between significant variables.
Bar plot to show various stats across various continents and different positional groups
Our group feel’s comfortable moving forward with all of these ideas.
In terms of machine learning techniques, our group plans on using a linear regression to better understand the relationship between independent and dependent variables.
Of course, prior to utilizing a linear regression, we will need to understand if a linear regression is even the correct possible test. We would first need to check for outliers prior to running a linear regression since those could heavily impact our analysis. We would also need to understand if the data is linear, an assumption we could check via a scatterplot. Additionally, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram, a Q-Q Plot, or shapiro test. Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. Multicollinearity may be tested with three central criteria:
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check if the data is homoscedastic (meaning the residuals are equal across the regression line). If homoscedasticity is present, a non-linear correction might fix the problem. We want to get a model that produces a constant residual variance, does not form a pattern (trumpet form). If homoscedasticity is present, a non-linear correction might fix the problem.
We hope to build a model for each position or position group to better understand the trends in our data.
The other machine learning technique we may use is a Prime Component Analysis. The main goal of a PCA is to reduce the number of attributes, so that they could be interpreted more easily as groups. We would also like to see whether reducing number of dimensions to 6 (so the number of Face stats) would result in a similar grouping to that proposed by EA. First, we would take all the attributes and scale them, as PCA is easily influenced by magnitude.
Apart from the reduction to 6 dimensions, it would be good to know the optimal number of dimensions according to the eigenvalues. One can use a method which rules out all the components with eigenvalues below 1 (Kaiser criterion), one which takes number of components to describe a certain amount of variance, use MAP test or use parallel analysis.
When it comes to Face stats, we only have 6 variables to begin with but nonetheless we want to reduce their number even more.
Overall, the only machine learning techniques we hope to use are a Linear Regression and/or a PCA to better understand trends in our data dependent on their significance.
To answer this question, we must first subset our data to only contain the players attributes, skill moves, and weak foot rating in addition to their wage. Since player’s face-values are cumulative sums of other data attributes, we will not include those in our analysis to avoid multicollinearity. Since we also are looking for an understanding of general attributes that make a player earn a higher wage, we will also avoid position based stats. Let’s subset our data first and save this new table as playerwage a table we will utilize for our analysis on wage. We can get an analysis of the columns in our data by running the colnames function and ensuring that all our data is in our new data frame.
playerwage <- fifa[,c(17,38:66)]
colnames(playerwage)
## [1] "WageEUR" "Crossing" "Finishing" "HeadingAccuracy"
## [5] "ShortPassing" "Volleys" "Dribbling" "Curve"
## [9] "FKAccuracy" "LongPassing" "BallControl" "Acceleration"
## [13] "SprintSpeed" "Agility" "Reactions" "Balance"
## [17] "ShotPower" "Jumping" "Stamina" "Strength"
## [21] "LongShots" "Aggression" "Interceptions" "Positioning"
## [25] "Vision" "Penalties" "Composure" "Marking"
## [29] "StandingTackle" "SlidingTackle"
Let’s take a look at the distribution of the various attributes in our data. We have not included player wage in this visualization as that operates on a different scale which would make the attributes harder to view.
boxplot(playerwage[,2:30], las=2, main= "Attributes Box Plot", ylab= "Attribute Value")
Let’s further understand the distribution of the wage variable by looking at it’s distribution:
hist(playerwage$WageEUR, main= "Player Wage Histogram", xlab = "Player Wage (Euros)")
As we can see here, many players are on the lower end. Our next step is a linear regression, so we will have to make adjustments to these values so that our linear regression gives us an adequate understanding of our data without being skewed by the large amount of 0 and low-end data points.
We can set up the linear regression, where our Wage value is the dependent value and the player attributes are the independent values.
wage_lr <- lm(WageEUR ~ ., data = playerwage)
wage_lr
##
## Call:
## lm(formula = WageEUR ~ ., data = playerwage)
##
## Coefficients:
## (Intercept) Crossing Finishing HeadingAccuracy
## -65584.100 54.320 16.580 17.616
## ShortPassing Volleys Dribbling Curve
## 79.454 59.731 19.533 67.091
## FKAccuracy LongPassing BallControl Acceleration
## -39.411 -5.233 54.196 4.717
## SprintSpeed Agility Reactions Balance
## 106.186 -94.964 919.638 -14.486
## ShotPower Jumping Stamina Strength
## 164.160 34.077 -115.639 -34.650
## LongShots Aggression Interceptions Positioning
## -191.241 -21.741 -74.522 -104.325
## Vision Penalties Composure Marking
## 127.779 -59.983 182.741 -17.945
## StandingTackle SlidingTackle
## 121.586 -49.441
summary(wage_lr)
##
## Call:
## lm(formula = WageEUR ~ ., data = playerwage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38542 -7478 -2329 3858 305270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65584.100 1376.538 -47.644 < 2e-16 ***
## Crossing 54.320 15.927 3.411 0.000649 ***
## Finishing 16.580 21.266 0.780 0.435626
## HeadingAccuracy 17.616 15.912 1.107 0.268268
## ShortPassing 79.454 30.736 2.585 0.009744 **
## Volleys 59.731 18.223 3.278 0.001048 **
## Dribbling 19.533 26.055 0.750 0.453435
## Curve 67.091 17.682 3.794 0.000149 ***
## FKAccuracy -39.411 15.316 -2.573 0.010085 *
## LongPassing -5.233 21.594 -0.242 0.808511
## BallControl 54.196 31.105 1.742 0.081465 .
## Acceleration 4.717 23.984 0.197 0.844086
## SprintSpeed 106.186 21.342 4.975 6.56e-07 ***
## Agility -94.964 18.121 -5.241 1.62e-07 ***
## Reactions 919.638 20.117 45.714 < 2e-16 ***
## Balance -14.486 15.489 -0.935 0.349687
## ShotPower 164.160 18.152 9.044 < 2e-16 ***
## Jumping 34.077 12.373 2.754 0.005891 **
## Stamina -115.639 14.206 -8.140 4.19e-16 ***
## Strength -34.650 14.829 -2.337 0.019469 *
## LongShots -191.241 19.855 -9.632 < 2e-16 ***
## Aggression -21.741 13.690 -1.588 0.112279
## Interceptions -74.522 21.416 -3.480 0.000503 ***
## Positioning -104.325 20.226 -5.158 2.52e-07 ***
## Vision 127.779 18.185 7.027 2.19e-12 ***
## Penalties -59.983 16.848 -3.560 0.000372 ***
## Composure 182.741 19.257 9.490 < 2e-16 ***
## Marking -17.945 19.568 -0.917 0.359123
## StandingTackle 121.586 30.182 4.028 5.64e-05 ***
## SlidingTackle -49.441 28.659 -1.725 0.084520 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16120 on 19246 degrees of freedom
## Multiple R-squared: 0.3108, Adjusted R-squared: 0.3097
## F-statistic: 299.2 on 29 and 19246 DF, p-value: < 2.2e-16
From the above coefficients we can understand that if all else held constant, reactions is one of the most important variables to an increase in wage. With a coefficient of 919.638 we can understand that an increase by 1 in the reactions section can increase a player’s wage by nearly 919 euros. The other attributes which lead an increase of more than 100 euros in wage with a 1 score increase in attribute rating are: Sprint Speed, Vision, Composure, Shot Power, and Standing Tackle. If I am player looking to increase my wage, these are the attributes that I would be focused on improving.
From the summary of ‘wage_lr’, we can see that p-value is less than the significance level (0.05) hence there is a statistical relevance between player’s wage and dependent variables. Also, As the overall F-stat is significant and R-squared is not equal to zero, and the correlation between the model and dependent variable is statistically significant.
We will use a similar set of attributes as to the ones we used above for this analysis, however, instead of including wage we will include player value. We will again create a new table, but will call this one playervalue
playervalue <- fifa[,c(16,38:66)]
colnames(playervalue)
## [1] "ValueEUR" "Crossing" "Finishing" "HeadingAccuracy"
## [5] "ShortPassing" "Volleys" "Dribbling" "Curve"
## [9] "FKAccuracy" "LongPassing" "BallControl" "Acceleration"
## [13] "SprintSpeed" "Agility" "Reactions" "Balance"
## [17] "ShotPower" "Jumping" "Stamina" "Strength"
## [21] "LongShots" "Aggression" "Interceptions" "Positioning"
## [25] "Vision" "Penalties" "Composure" "Marking"
## [29] "StandingTackle" "SlidingTackle"
We can visualize the attributes here, as done above, using boxplots for the various attributes.
boxplot(playervalue[,2:30], las=2, main= "Attributes Box Plot", ylab= "Attribute Value")
More importantly, however, we can also visualize the distribution of the player values in our dataset by creating a histogram
options(scipen = 999)
hist(playervalue$ValueEUR, main= "Player Value Histogram", xlab = "Player Value (Euros)")
Again, we can notice that a large number of points here are 0. Although this could be realistic, we will need to make adjustments when we enter the next stage of the linear regression to ensure our analysis isn’t skewed.
value_lr <- lm(ValueEUR ~ ., data = playervalue)
value_lr
##
## Call:
## lm(formula = ValueEUR ~ ., data = playervalue)
##
## Coefficients:
## (Intercept) Crossing Finishing HeadingAccuracy
## -25662833.1 -2900.7 30247.4 -10310.2
## ShortPassing Volleys Dribbling Curve
## 54827.8 21324.2 14650.1 20285.5
## FKAccuracy LongPassing BallControl Acceleration
## -12623.9 -7276.1 14405.9 22539.5
## SprintSpeed Agility Reactions Balance
## 53180.8 -47815.2 360453.3 335.9
## ShotPower Jumping Stamina Strength
## 44622.4 -1202.3 -15616.6 -6189.9
## LongShots Aggression Interceptions Positioning
## -76002.6 -21180.4 -37517.5 -47741.2
## Vision Penalties Composure Marking
## 57514.9 -41219.4 53892.6 -10731.5
## StandingTackle SlidingTackle
## 48102.8 -7940.4
summary(value_lr)
##
## Call:
## lm(formula = ValueEUR ~ ., data = playervalue)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12511031 -2772965 -984953 1195349 176149216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25662833.1 554109.4 -46.314 < 0.0000000000000002 ***
## Crossing -2900.7 6411.1 -0.452 0.650949
## Finishing 30247.4 8560.5 3.533 0.000411 ***
## HeadingAccuracy -10310.2 6405.1 -1.610 0.107481
## ShortPassing 54827.8 12372.4 4.431 0.00000941153763968 ***
## Volleys 21324.2 7335.5 2.907 0.003654 **
## Dribbling 14650.1 10487.9 1.397 0.162473
## Curve 20285.5 7117.8 2.850 0.004377 **
## FKAccuracy -12623.9 6165.3 -2.048 0.040617 *
## LongPassing -7276.1 8692.4 -0.837 0.402568
## BallControl 14405.9 12521.1 1.151 0.249938
## Acceleration 22539.5 9654.3 2.335 0.019572 *
## SprintSpeed 53180.8 8590.9 6.190 0.00000000061226851 ***
## Agility -47815.2 7294.2 -6.555 0.00000000005697122 ***
## Reactions 360453.3 8097.9 44.512 < 0.0000000000000002 ***
## Balance 335.9 6235.1 0.054 0.957042
## ShotPower 44622.4 7306.8 6.107 0.00000000103502772 ***
## Jumping -1202.3 4980.7 -0.241 0.809254
## Stamina -15616.6 5718.5 -2.731 0.006322 **
## Strength -6189.9 5969.3 -1.037 0.299771
## LongShots -76002.6 7992.4 -9.509 < 0.0000000000000002 ***
## Aggression -21180.4 5510.7 -3.844 0.000122 ***
## Interceptions -37517.5 8620.8 -4.352 0.00001356097008179 ***
## Positioning -47741.2 8141.7 -5.864 0.00000000459743317 ***
## Vision 57514.9 7320.2 7.857 0.00000000000000414 ***
## Penalties -41219.4 6782.1 -6.078 0.00000000124206784 ***
## Composure 53892.6 7751.7 6.952 0.00000000000370753 ***
## Marking -10731.5 7876.8 -1.362 0.173081
## StandingTackle 48102.8 12149.4 3.959 0.00007545100626097 ***
## SlidingTackle -7940.4 11536.5 -0.688 0.491283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6489000 on 19246 degrees of freedom
## Multiple R-squared: 0.2728, Adjusted R-squared: 0.2717
## F-statistic: 249 on 29 and 19246 DF, p-value: < 0.00000000000000022
From the above linear regression results we can understand that if all else is held constant, reactions is again one of the most valuable attributes. Our research suggests that an increase in 1 in reactions score could increase player value by over $36 thousand. Other important attributes include sprint speed, finishing, short passing, vollets, and acceleration to name a few. Largly, the attributes that are required to increase player value and wage are the same for a pretty obvious reason - if a player is valuable their wage is probaly representative of that.
From the summary of ‘player_lr’, we can see that p-value is less than the significance level (0.05) hence there is a statistical relevance between player’s value and dependent variables. Also, As the overall F-stat is significant and with the support of Null hypothesis, we can say that the correlation between the model and dependent variable is statistically significant.
To understand this we will create a column which will serve as a binary flag, 1 indicating player is faster than they are strong, and 0 indicating otherwise. By then summing this column we can get a better understanding of how many players fit that mold. We can also divide this number by the total number of players to understand what percent of players value one attribute over the other.
fifa <- fifa %>%
mutate(SpeedVsStrength = if_else(PaceTotal> PhysicalityTotal, 1, 0))
sum(fifa$SpeedVsStrength)/nrow(fifa)
## [1] 0.5974787
We can see from the result above the roughly 60% of soccer players value their paciness over their physicality. This is an interesting trend for any upcoming soccer player as this may indicate a need to focus on their speed training a little more than strength training to better fit the mold of an average soccer player.
To understand this, we will average the player overall ratings for all players the belong to a certain country. We can then look at the top 10 countries with the highest avg player overall ratings.
fifa %>%
group_by(Nationality) %>%
summarize(AvgPlayerOverall = mean(Overall)) %>%
top_n(10, wt= AvgPlayerOverall) %>%
arrange(desc(AvgPlayerOverall))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 10 x 2
## Nationality AvgPlayerOverall
## <chr> <dbl>
## 1 Tanzania 74
## 2 Libya 73.3
## 3 Mozambique 73
## 4 Central African Republic 72.5
## 5 Egypt 72
## 6 Fiji 72
## 7 Syria 72
## 8 Gabon 71.2
## 9 Brazil 70.8
## 10 Czech Republic 70.7
Our analysis came out with a result completely different than what we expected with the Tanzania producing the highest rated players on Average. For any soccer fan or scouts, this may be useful information to look out for up and coming players coming out of Tanzania.
The above analysis, however, may be incomplete as the values could be skewed by a small sample size. To truly understand what country produces the best players, we will need to look at countries which can consistently put out good players. For this reason, we decided to conduct this analysis again, but with an added constraint. We will only include countries in our analysis whom have more than 50 players in the game (more than .2% of the total players).
fifa %>%
group_by(Nationality) %>%
mutate(NumberOfPlayers = n()) %>%
filter(NumberOfPlayers > 50) %>%
summarize(AvgPlayerOverall = mean(Overall)) %>%
top_n(10, wt= AvgPlayerOverall) %>%
arrange(desc(AvgPlayerOverall))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 10 x 2
## Nationality AvgPlayerOverall
## <chr> <dbl>
## 1 Brazil 70.8
## 2 Czech Republic 70.7
## 3 Algeria 70.6
## 4 Ukraine 70.5
## 5 Italy 70.0
## 6 Portugal 69.8
## 7 Spain 69.5
## 8 Morocco 69.4
## 9 Serbia 69.2
## 10 Croatia 69.1
From the results above we can see a country more in line with our predictions to come in 1st with Brazil. This analysis indicates to us more holistically that Brazil on average has the best players and would be a great country to look at for scouts. This list, however, isn’t completely predictable and we see a few unlikely countries this list such as Czech Republic, Algeria, Ukraine, and Serbia.
Let’s first look at the distribution of the skill move ratings among the players.
barplot(table(fifa$SkillMoves), main= "Number of Players by Skill Moves Rating", xlab = "Skill Moves Rating")
As we can see from the above visualization, 2 is the most common rating with 3 being the second most common.
We can approach understanding which nation produces the most skillful players by taking the average of the skill move rating for all the players, and then grouping this by nationality to get this information at the country level.
fifa %>%
group_by(Nationality) %>%
summarize(AvgSkillMove = mean(SkillMoves)) %>%
top_n(10, wt= AvgSkillMove) %>%
arrange(desc(AvgSkillMove))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 19 x 2
## Nationality AvgSkillMove
## <chr> <dbl>
## 1 Fiji 4
## 2 Puerto Rico 3.5
## 3 Liberia 3.33
## 4 Bermuda 3
## 5 Central African Republic 3
## 6 Chad 3
## 7 Chinese Taipei 3
## 8 Eritrea 3
## 9 Estonia 3
## 10 Ethiopia 3
## 11 Iraq 3
## 12 Korea DPR 3
## 13 Kyrgyzstan 3
## 14 Libya 3
## 15 Malaysia 3
## 16 Namibia 3
## 17 Papua New Guinea 3
## 18 Syria 3
## 19 Tanzania 3
Wow, again our results are completely different from what we expected. The country which produces the most skillful players on average is Fiji. I would be curious to learn more about the soccer culture in Fiji and how much they value creativity in their play. If I am a fan or a scout, I am looking out for prospects from Fiji to bring exciting new plays.
This analysis again may be biased towards countries with a low sample size. To get a more holistic view we can run this analysis again, however, only with countries that have more than 50 players in the game (more than .2% of the total players).
fifa %>%
group_by(Nationality) %>%
mutate(NumberOfPlayers = n()) %>%
filter(NumberOfPlayers > 50) %>%
summarize(AvgSkillMove = mean(SkillMoves)) %>%
top_n(10, wt= AvgSkillMove) %>%
arrange(desc(AvgSkillMove))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 10 x 2
## Nationality AvgSkillMove
## <chr> <dbl>
## 1 Congo DR 2.86
## 2 Morocco 2.85
## 3 Algeria 2.75
## 4 Nigeria 2.71
## 5 Ghana 2.66
## 6 Portugal 2.66
## 7 Brazil 2.56
## 8 Côte d'Ivoire 2.55
## 9 Mali 2.55
## 10 Spain 2.54
Here we can see a more holistic view on which countries on average have the most skillful players. From this list we can see that players from Congo DR are typically the most skillful and a scout may look for players in Congo if they are looking for an exciting and fun-to-watch player.
We can find the answer to this question through one key function. We can utilize the n() function to get a count of the rows that meet the previously specific group by criteria.
fifa %>%
group_by(Nationality) %>%
summarise(NumberOfPlayers = n()) %>%
top_n(10, wt= NumberOfPlayers) %>%
arrange(desc(NumberOfPlayers))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 10 x 2
## Nationality NumberOfPlayers
## <chr> <int>
## 1 England 1717
## 2 Germany 1214
## 3 Spain 1111
## 4 France 984
## 5 Argentina 963
## 6 Brazil 890
## 7 Japan 547
## 8 Netherlands 439
## 9 United States 412
## 10 Poland 403
We can see that England has the most players in this game. The most interesting country on this list, however, is Japan at the 7th position. Our group never really thought of them as a soccer-talent producing nation but we were glad to be proven incorrect.
We can understand the solution to this answer by averaging players overall at the league level.
fifa %>%
group_by(League) %>%
summarize(AvgPlayerOverall = mean(Overall)) %>%
top_n(10, wt = AvgPlayerOverall) %>%
arrange(desc(AvgPlayerOverall))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 10 x 2
## League AvgPlayerOverall
## <chr> <dbl>
## 1 Spain Primera Division 73.5
## 2 English Premier League 72.5
## 3 Italian Serie A 72.1
## 4 Czech Republic Gambrinus Liga 72.0
## 5 Ukrainian Premier League 71.9
## 6 German 1. Bundesliga 71.3
## 7 Campeonato Brasileiro Série A 71.0
## 8 French Ligue 1 70.9
## 9 Greek Super League 70.8
## 10 Russian Premier League 70.0
We can see that the 1st division in Spain actually has the best avg player rating. The most interesting entry on this list, however, is the Czech league. The Czech league is not typically popular, however, it is interesting to note the amount og high quality of players there. As a fan, I may need to start watching more Czech league soccer to learn more about great players I wasn’t aware about.
Walk-Out and Boards are both slang terms the FIFA community has given for different subsets of players in the gold rating. We were curious, however, how these subsets composed of great overall players compare in terms of player value. Walk-Outs are the handful of best players in the world, so does their value represent this distinction?
To help us get to the bottom of this question we will first create a column that will do this characterization for us. We can then go ahead and average the player values based off of these characterizations.
fifa %>%
mutate(GoldLevel = if_else(Overall >= 83 & Overall <=85, 'Board', if_else(Overall >= 86, 'Walk-Out', 'Other'))) %>%
group_by(GoldLevel) %>%
summarize(AvgPlayerValue = mean(ValueEUR)) %>%
arrange(desc(AvgPlayerValue))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 2
## GoldLevel AvgPlayerValue
## <chr> <dbl>
## 1 Walk-Out 78057971.
## 2 Board 40907692.
## 3 Other 2320304.
We can see from the above values that the designation between a Walk-Out and a Board is heavily valued in the real world. Players who in-game are characterized as Walk-Outs have an average value 78 million Euros where as Boards have an average value of nearly 41 million euros. Hence, on average, Walk-Outs are valued about 37 million euros higher than a Board.
We can understand the answer to this question by grouping our data by the positional group column we had created and then averaging the pace for each player in those respective groups.
fifa %>%
group_by(BestPosition_new) %>%
summarize(AvgPace = mean(PaceTotal)) %>%
arrange(desc(AvgPace))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 10 x 2
## BestPosition_new AvgPace
## <fct> <dbl>
## 1 LW/RW 79.5
## 2 LM/RM 76.9
## 3 CF 73.2
## 4 LB/RB 73.2
## 5 AM 70.0
## 6 ST 69.5
## 7 GK 65.0
## 8 CM 63.7
## 9 DM 61.3
## 10 CB 59.4
We can note from above that wingers are on average the fastest players which makes sense with the responsibilities of their position. What is interesting, however, is how low Strikers pace came. In fact, even lower than the average pace of Left and Right Backs. This may show how pace is important for a striker but is not everything for a striker. A striker who is slower could still be successful if they can compensate for that with other strong qualities such as finishing. In contrast, however, a winger would need to be incredibly pacey and therefore a slower winger may not be able to be successful despite being good in other areas. It is also interesting to note how RB/LB are typically the players gaurding LW/RW and hence we can understand why their average pace is so high. To be able to guard an extremely quick winger, teams need to utilize fast defenders on the wing.
We can understand this by looking at the free agent with the highest overall.
fifa %>%
filter(Club == 'Free agent') %>%
top_n(1, wt = Overall)
## PlayerID Name FullName Age Height (CM) Weight (KG)
## 1 184087 T. Alderweireld Toby Alderweireld 32 186 81
## Nationality Overall Potential Growth TotalStats BaseStats Positions
## 1 Belgium 83 83 0 1991 412 CB
## BestPosition Club ValueEUR WageEUR ReleaseClause ClubPosition
## 1 CB Free agent 0 0 0
## ContractUntil ClubNumber ClubJoined NationalTeam NationalPosition
## 1 NA NA 2021 Belgium CB
## NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 1 2 Right 3 3 2
## AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 1 Medium Medium 58 55 70
## DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 1 66 86 77 64 45
## HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 1 81 77 38 62 63 59 81
## BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 1 75 55 60 54 85 62 78
## Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 1 81 76 77 58 79 85 52
## Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 1 62 58 86 87 87 84 16
## GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 1 6 14 16 14 68 63 65
## CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 1 65 65 63 69 69 76 69 80
## CDMRating RWBRating LBRating CBRating RBRating GKRating TeamID LeagueId
## 1 83 80 81 83 81 24 NA NA
## TeamOverall Attack Midfield Defence TransferBudget DomesticPrestige
## 1 NA NA NA NA NA NA
## IntPrestige Players StartingAverageAge AllTeamAverageAge League
## 1 NA NA NA NA <NA>
## BestPosition_new SpeedVsStrength
## 1 CB 0
We can note that Toby Alderweireld is the best Free Agent available, as of the time of this data frame. He is an 83 rated Defender from Belgium. This is interesting to note for any teams who may be looking for a new, highly rated, defender.
We can try to understand this by creating a new variable which divides, putting the player overall rating over the players age. This will punish older players while benefiting younger players. This new metric will help us better understand who some up and coming players that we may want to look our for in the future.
fifa %>%
mutate(PlayerAgeOverallMetric = Overall/Age) %>%
top_n(10, wt = PlayerAgeOverallMetric) %>%
arrange(desc(PlayerAgeOverallMetric)) %>%
select(Name, Age, Overall, PlayerAgeOverallMetric)
## Name Age Overall PlayerAgeOverallMetric
## 1 Pedri 18 81 4.500000
## 2 E. Haaland 20 88 4.400000
## 3 J. Bellingham 18 79 4.388889
## 4 F. Wirtz 18 78 4.333333
## 5 E. Camavinga 18 78 4.333333
## 6 R. Cherki 17 73 4.294118
## 7 G. Reyna 18 77 4.277778
## 8 Ansu Fati 18 76 4.222222
## 9 A. Hložek 18 76 4.222222
## 10 B. Saka 19 80 4.210526
Above we can see a list of 10 of the most talented and so-far proven young talents in soccer. We can see some pretty famous names on this list such as: Haaland, Camavinga, Fati, and Saka. We can also see less well known players on this list such as: Wirtz, and Hlozek. As an american, however, I am really excited to see Reyna on this list, an exciting young talent from the USA. As a fan of the sport, I am definitely excited to see these young players grow. As a soccer manager, however, I would definitely be looking to get some of these young players to join my club as their future looks incredibly promising right now.
We will characterize the best players in the world as players with an overall rating of 86 or higher as FIFA distinguishes those players as “Walk-Outs”. This is the highest distinction a player can get in FIFA and hence we believe it is a good way to indicate who are truly the best players. We will do this by first creating a dataframe called Best_Players which is comprised solely of players with the walk-out designation. We will then create a barplot from this dataframe of the count of players club jersey numbers to see if there is any pattern.
Best_Players <- fifa %>%
mutate(GoldLevel = if_else(Overall >= 83 & Overall <=85, 'Board', if_else(Overall >= 86, 'Walk-Out', 'Other'))) %>%
filter(GoldLevel == 'Walk-Out')
barplot(table(Best_Players$ClubNumber), las=2, main= "How many of the world's best players wear each Jersey Number?", xlab = "Jersey Number for Club", ylab = "Number of Soccer Players")
This is extremely unexpected, usually the jersey number 10 is regarded as the number reserved for the ‘best’ players, however, we can see that this really isn’t the case. In club soccer, most of the best players are actually wearing jersey number 1. 10 and 7 are both in the running, however, number 1 is by far the most popular jersey number among the best players. Number 1 is most commonly worn amongst goalkeepers which indicates to us that there are many goalkeepers who have an overall rating of 86+.
Maybe we have the hype all wrong though, maybe the allure to the coveted number 10 jersey is mainly for player’s national team appearances and not necessarily for clubs. Let’s create a similar plot to understand the most common jersey number among the world’s best players for their national teams.
barplot(table(Best_Players$NationalNumber), las=2, main= "How many of the world's best players wear each Jersey Number?", xlab = "Jersey Number for National Team", ylab = "Number of Soccer Players")
In the above visualization we can more clearly understand why so many kids hope to wear the jersey number 10 someday, it’s because in national team appearances, the jersey number 10 is most commonly worn by the worlds best players.
In FIFA, each player is rated on various attributes which include stats such as: Pace, Dribbling, Shooting. These attributes combined give each player an overall score. We wanted to figure out, utilizing the current data set, what we could understand about soccer and soccer players in the real world using attributes given to players in the game.
We have addressed the proposed problem statement by understanding the relationship between various in-game attributes and real-world values. Our current proposed analytic technique runs a linear regression between these different in-game and real-world values. We have summarized the regression analysis to understand their statistical significance - to find the correlation between predicted values (ex. player’s wage) and dependent variables. (using null hypothesis, F-stat and R-squared values). Additionally, we used data manipulation techniques to understand basic geographical or statistical trends in our data for specific demographics and players.
Our analysis was really interesting overall as it allowed us to understand the game of soccer in ways we had never understood it before. One of the questions we wanted to answer was, what positional group is the fastest. You, as I did, may imagine that all of the fastest positional groups are strikers looking to run past defenders and score goals. That is, however, only partially correct. Through our analysis we realized that ¾ of our top 4 paciest positional groups are players that play on the wings (or far edges) of the field. This is interesting as it indicates to us that the fastest players will typically be on the sides and not the middle. The 3 groups referenced are: RW/LW, RM/LM, and RB/LB. The RB/LB (defenders on the wing) group was incredibly surprising to us, but makes a lot of sense upon further review. If teams keep their fastest players as attackers on the wings of a field, you must also place fast players to defend them on the wings.
It was also interesting to understand what attributes truly have the strongest positive effect on a player’s value and wage. Our group would have never expected ‘Reactions’ to be the top answer, and this is somewhere where our analysis proved to be incredibly insightful. If I was a professional soccer player right now, this is the attribute that I would be focused on improving the most. It is easy to reason this attribute as something incredibly valuable as soccer is a fast-paced sport based on acting and reacting. A slow reaction by any given player could easily cost a team the game, and a fast reaction could help a team win a game.
Another interesting insight our group gained in this analysis was when looking at the top young players in the world of professional soccer. Some of the most prominent players in the game today started at an incredibly young age. Messi is a great example of this as he moved with his family to Barcelona at the age of 13 to fulfill his dream and potential. Our group, through a custom metric created, found that Pedri is the best young player in the world. Interestingly enough, on 11/29 (a few days after we finished our analysis) Pedri was awarded the Kopa Trophy, “an award presented to the best young player in men’s soccer.” This was a great confirmation of the metric we created and of the game’s accuracy in measuring the real world. Another interesting player on our list of the top 10 most promising young players was Gio Reyna. He specifically came in at number 7 on our rankings but this was exciting as he is from the United States of America, a country currently on the rise in the soccer world. Our group is now excited to see how Reyna will lead the American soccer team in the years moving forward.
We approached this project and analysis as fans of soccer and FIFA. We hoped our analysis would hence be valuable for soccer fans like ourselves, and also to any players. We believe our analysis fits that criteria well by providing facts that we, as soccer fans, thought were incredibly interesting. Additionally, we believe our analysis can be valuable to current players as they can understand that improving their reaction time can be important in helping increase their wage and value. Upon further review of our analysis, however, our group realized that the person that may benefit the most from our analysis and answered questions may actually be a soccer scout or manager. A soccer scout could use the questions we answered and our analysis to help point them towards some promising young players or countries where they could target scouting efforts.
Despite our analysis indicating interesting trends and analysis, we believe there are a few limitations which future analysts could explore further. One of the largest limitations of our analysis was the data itself. Most of the findings do not tell us much about the players in real life but rather are reflective of the methodology employed by EA Sports to assign attributes to players in this game.
Most defenders, especially center-backs have been given very low pace in comparison to the rest of the outfield positions. Real-life statistics, however, indicate that some center-backs do register very quick sprint speeds, and often outpace strikers and wingers. For example, Leicester City’s Caglar Soyuncu registered a top speed of 37.55 km/h in a game against Crystal Palace in November, but his pace in game is just 76, which is quite low considering how quick he can run. From our understanding, the developers at EA tend to give out low pace to defenders because in real life, defenders are not expected to run with the ball, and do not usually get a lot of opportunities to showcase their pace while running with the ball.
Some of the analysis done, however, will hold concrete in the world when the information used is not subjective. These variables include information like wage, club, position, jersey number, nationality, age, etc. Some of the analysis did give us insight into how the subjective attributes are correlated to the objective attributes about the players.
Another limitation to our analysis was our specific approach to answer a broad goal. The question we started with was incredibly broad and we answered smaller, more specific questions throughout the process. We believe, however, that a different set of these subset questions could have navigated our analysis to a different direction. For example, we could have looked more into the positional groupings. Soccer players have various skills. For each position, different sets of skills are needed. For example, the most important skills required for a defender are completely different from the ones required by attacker/striker or other positions. A detailed analysis using Linear regression modeling could have been applied to identify or highlight the most important set of skills or key attributes for a particular player’s position. This gap was a limitation to our analysis. We encourage other groups to try such an analysis in the future as it could possibly indicate the similarity between Main stats and Face stats. This analysis might help us further understand how EA calculates face stats and what weight is provided to the individual attributes in each cumulative face stat metric. Either way, it feels intuitive enough that a Principal component analysis (PCA) could show that players can be categorized into groups, such as ‘attacking, technical’, ‘fast, dribbling’, ‘attacking, physical’.
Overall, our group learned a lot more about soccer and soccer players through our analysis. Historically, video games have been based on real life, however, we thought it was really interesting trying to approach the relationship in the opposite manner. We hope our analysis can be continued and built upon in the future to reveal more hidden secrets about the soccer world.