FIFA is a soccer video game produced by EA which has gained notoriety among individuals from all-around the world with over 31 million players on FIFA 21. A new FIFA game comes out every year in which EA rates all the players in the game and give them certain stats which are supposed to be a depiction of real life. Each player is rated on various attributes which are then used to give them an overall score. Some of the attributes players are rated on include stats such as: Pace, Dribbling, Shooting. Our group was curious, if we take these values as accurate, what can we learn about the real world?
We firmly believe that answering this question could lead to some significant insights about the real world because we truly believe these attributes are as close of a representation we can get to measuring these attributes in the real world. FIFA, as mentioned prior, is run by EA, a multi-billion dollar corporation who has not only the resources but also the technical known-how/experience to create accurate value for the measurements.
We found a database online which contains all these attributes for the most recent FIFA game, FIFA 22, along with real-world values such as player wage, value, club, and jersey number. We hope to address the proposed problem staement by understanding the relationship between various in-game attributes and real-world values. For example, how does a player’s overall relate to their national jersey number? We have always been allured by the infamous number 10, but is that actually the jersey the best players wear? Which attribute has lead people to the highest pay day?
Our current proposed analytic technique is to run a linear regression between these different in-game and real-world values, using a p-test to understand their statistical significance. If significant we can compare the coefficients and correlation coefficients to understand the degree of a relationship and how we could predict/describe real-world values using this game.
The main target for our analysis are real soccer players as well as soccer fanatics. Real soccer players will benefit the most from our analysis. If I was soccer player, I would be looking to learn which attribute of my self I can boost the most to not only improve my self as a player but to also improve my salary. We hope that our analysis could help players direct their workout regiment by helping understand which attributes are truly considered valuable in soccer and for a club. Soccer fanatics will also benefit from our analysis as it will help them better understand which characteristics they may want to look for when looking at potential trade prospects for their respective team.
library(tidyverse)
library(dplyr)
At this point in the project, the only packages we utilized were the tidyverse package and dplyr package. Both packages were used together specifically for one function, to help trim a string in our data. The process has been displayed later on, however, both packages together helped in the process. The dplyr package’s pipe operator proved to be helpful in executing certain functions and the tidyverse package’s str_sub function proved to be helpful in actually trimming the string. In addition to this instance, there were many other uses of both packages such as when renaming variable names to something more user-friendly.
We gathered our data from kaggle, specifically from this specific link:https://www.kaggle.com/stefanoleone992/fifa-22-complete-player-dataset
The data contained two different tables with players and info on their teams. We joined both of these together prior in Excel, and will import it as one Excel file into R. The original players table had 90 variables including all the attributes for all the players and general real-world information about them such as wage and value. The teams table originally had 14 columns out of which 12 (all variables except club name, as would be redundant, and TeamID) were included. There isn’t explicit information on when this data was collected, however, we do know it was fairly recently since FIFA 22 came out on October 1st, 2021. It is also worth noting that all those statistics are suggested mostly by a 6,000 group of volunteers led by Head of Data Collection & Licensing.
All real-world professional soccer players that are included in the game, as discussed above, are described by a set of statistics and characteristics, which determine how good they are in-game. Main indicator of the quality of a player is their overall score, which is a net of all their statistics. There are 34 of them with values in the 0-99 interval, in which 5 are used to describe goalkeeping abilities and 29 are used to describe abilities of an outfield player. All the players are described by all 34 stats, but goalkeeping abilities have no impact on an outfield players overall and vice-versa.
Since the players are presented as cards, it would be difficult to show all the statistics on the face of the card. Thus, there are so called “Face stats”. The card shows only six statistics on it to make it more readable. The statistics are linear combinations of the ones already mentioned. For example, instead of having both Acceleration and Sprint Speed displayed on the card, the game utilizes a pace statistic which is a weighted sum of those two. These variables can be found in our data with aTotal pre-fix.
There is also a set of four categorical variables in our data describing players:
Skills - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pull off skill moves, like the famous roulette
Weak foot - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pass, shoot, etc. with his not-preferred foot correctly
Attacking work rate - Can be either High, Medium or Low; determines how much does the player run in attacking zones of the pitch
Defensive work rate - Can be either High, Medium or Low; determines how much does the player run in defensive zones of the pitch
We believe, after careful evaluation, most variable names are self-explanatory except for FKAccuracy, which stands for Free Kick Accuracy - how accurately one can shoot from a set piece. Also, position refers to players’ preferred positions on the pitch and the game recognizes 28 different ones.
We can get a better understanding of our data by taking a look at it in R, so let’s first load in the data:
fifa <- read.csv("players_fifa22.csv")
Now let’s take a look at the first few rows:
head(fifa, 3)
## ID Name FullName Age Height Weight
## 1 158023 L. Messi Lionel Messi 34 170 72
## 2 188545 R. Lewandowski Robert Lewandowski 32 185 81
## 3 231747 K. Mbappé Kylian Mbappé 22 182 73
## PhotoUrl Nationality Overall
## 1 https://cdn.sofifa.com/players/158/023/22_60.png Argentina 93
## 2 https://cdn.sofifa.com/players/188/545/22_60.png Poland 92
## 3 https://cdn.sofifa.com/players/231/747/22_60.png France 91
## Potential Growth TotalStats BaseStats Positions BestPosition
## 1 93 0 2219 462 RW,ST,CF RW
## 2 92 0 2212 460 ST ST
## 3 95 4 2175 470 ST,LW ST
## Club ValueEUR WageEUR ReleaseClause ClubPosition
## 1 Paris Saint-Germain 78000000 320000 144300000 RW
## 2 FC Bayern München 119500000 270000 197200000 ST
## 3 Paris Saint-Germain 194000000 230000 373500000 ST
## ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 1 2023 30 2021 FALSE Argentina RW
## 2 2023 9 2014 FALSE Poland ST
## 3 2022 7 2018 FALSE France LW
## NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 1 10 Left 5 4 4
## 2 9 Right 5 4 4
## 3 10 Right 4 4 5
## AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 1 Medium Low 85 92 91
## 2 High Medium 78 92 79
## 3 High Low 97 88 80
## DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 1 95 34 65 85 95
## 2 85 44 82 71 95
## 3 92 36 77 78 93
## HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 1 70 91 88 96 93 94 91
## 2 90 85 89 85 79 85 70
## 3 72 85 83 93 80 69 71
## BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 1 96 91 80 91 94 95 86
## 2 88 77 79 77 93 82 90
## 3 91 97 97 92 93 83 86
## Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 1 68 72 69 94 44 40 93
## 2 85 76 86 87 81 49 95
## 3 78 88 77 82 62 38 92
## Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 1 95 75 96 20 35 24 6
## 2 81 90 88 35 42 19 15
## 3 82 79 88 26 34 32 13
## GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 1 11 15 14 8 92 92 93
## 2 6 12 8 10 92 85 88
## 3 5 7 11 6 91 90 90
## CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 1 93 93 92 93 93 90 93 69
## 2 88 88 85 89 87 83 87 67
## 3 90 90 90 92 92 84 92 70
## CDMRating RWBRating LBRating CBRating RBRating GKRating
## 1 67 69 64 53 64 22
## 2 69 67 64 63 64 22
## 3 66 70 66 57 66 21
## League LeagueId Overall.1 Attack Midfield Defence
## 1 French Ligue 1 (1) 16 86 89 83 85
## 2 German 1. Bundesliga (1) 19 84 92 85 81
## 3 French Ligue 1 (1) 16 86 89 83 85
## TransferBudget DomesticPrestige IntPrestige Players StartingAverageAge
## 1 160000000 10 9 33 28
## 2 100000000 10 10 28 26.6
## 3 160000000 10 9 33 28
## AllTeamAverageAge
## 1 25.9
## 2 24.8
## 3 25.9
As we can see from the above table, some of the names in our data have loaded weirdly. We can fix this be reloading the data and making a few specifications. We can check that this worked by looking at the third row:
fifa <- read.csv("players_fifa22.csv", encoding="UTF-8", stringsAsFactor=FALSE)
fifa[3,]
## ID Name FullName Age Height Weight
## 3 231747 K. Mbappé Kylian Mbappé 22 182 73
## PhotoUrl Nationality Overall
## 3 https://cdn.sofifa.com/players/231/747/22_60.png France 91
## Potential Growth TotalStats BaseStats Positions BestPosition
## 3 95 4 2175 470 ST,LW ST
## Club ValueEUR WageEUR ReleaseClause ClubPosition
## 3 Paris Saint-Germain 194000000 230000 373500000 ST
## ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 3 2022 7 2018 FALSE France LW
## NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 3 10 Right 4 4 5
## AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 3 High Low 97 88 80
## DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 3 92 36 77 78 93
## HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 3 72 85 83 93 80 69 71
## BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 3 91 97 97 92 93 83 86
## Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 3 78 88 77 82 62 38 92
## Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 3 82 79 88 26 34 32 13
## GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 3 5 7 11 6 91 90 90
## CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 3 90 90 90 92 92 84 92 70
## CDMRating RWBRating LBRating CBRating RBRating GKRating League
## 3 66 70 66 57 66 21 French Ligue 1 (1)
## LeagueId Overall.1 Attack Midfield Defence TransferBudget DomesticPrestige
## 3 16 86 89 83 85 160000000 10
## IntPrestige Players StartingAverageAge AllTeamAverageAge
## 3 9 33 28 25.9
As we can see from the code above, this was able to help out in the data processing. We can look at the missing values information now:
sapply(fifa, function(x) sum(is.na(x)))
## ID Name FullName Age
## 0 0 0 0
## Height Weight PhotoUrl Nationality
## 0 0 0 0
## Overall Potential Growth TotalStats
## 0 0 0 0
## BaseStats Positions BestPosition Club
## 0 0 0 0
## ValueEUR WageEUR ReleaseClause ClubPosition
## 0 0 0 0
## ContractUntil ClubNumber ClubJoined OnLoad
## 70 70 0 0
## NationalTeam NationalPosition NationalNumber PreferredFoot
## 0 0 18491 0
## IntReputation WeakFoot SkillMoves AttackingWorkRate
## 0 0 0 0
## DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 0 0 0 0
## DribblingTotal DefendingTotal PhysicalityTotal Crossing
## 0 0 0 0
## Finishing HeadingAccuracy ShortPassing Volleys
## 0 0 0 0
## Dribbling Curve FKAccuracy LongPassing
## 0 0 0 0
## BallControl Acceleration SprintSpeed Agility
## 0 0 0 0
## Reactions Balance ShotPower Jumping
## 0 0 0 0
## Stamina Strength LongShots Aggression
## 0 0 0 0
## Interceptions Positioning Vision Penalties
## 0 0 0 0
## Composure Marking StandingTackle SlidingTackle
## 0 0 0 0
## GKDiving GKHandling GKKicking GKPositioning
## 0 0 0 0
## GKReflexes STRating LWRating LFRating
## 0 0 0 0
## CFRating RFRating RWRating CAMRating
## 0 0 0 0
## LMRating CMRating RMRating LWBRating
## 0 0 0 0
## CDMRating RWBRating LBRating CBRating
## 0 0 0 0
## RBRating GKRating League LeagueId
## 0 0 0 0
## Overall.1 Attack Midfield Defence
## 0 0 0 0
## TransferBudget DomesticPrestige IntPrestige Players
## 0 0 0 0
## StartingAverageAge AllTeamAverageAge
## 0 0
We can see from the data above that our data looks pretty good already and avoids many missing values. We do have some missing values, but we believe there is no action that can be taken right now for them. The columns where we see a few missing values are ContractUntil, ClubNumber and NationalNumber. We understand, however, that there are valid reasons why all these data points could be blank or non-existent as some players contract info or club number may not be released yet. Additionally, we know most players that play professional soccer don’t have the opportunity to play for their national country and hence we can understand why we have so many N/A values in the NationalNumber column.
In terms of formatting, we were able to get a decent idea of what the data looks like from the head() function we ran above. We can notice that one of the things that looks weird is the league column. Every entry in the league column has a (1) after it. We can work to remove this and clean the data by utilizing the str_sub function from the tidyverse package. We will create a column called TeamLeague with the trimmed league names.
fifa <- fifa %>%
mutate(TeamLeague= str_sub(League, 1, -5))
Now that our league name has been corrected, by creating a new column in R with the correct values, we can remove the previous League column. We have done this in the code below:
fifa <- select(fifa, -(League))
Lastly, our team preferred the name League instead of TeamLague so we have gone ahead and change the column name of the created column to that of the previous column.
fifa <- rename(fifa, League = TeamLeague)
The only other issue in our data can be seen in the naming convention of our team overall column. As we can see, it is currently labeled as Overall.1. This naming convention isn’t friendly to a new user of our data and hence we can change this title to TeamOverall- a variable which indicates the overall rating a whole team is given.
fifa <- rename(fifa, TeamOverall = Overall.1)
As we are renaming columns, we can also rename a few columns to help our readers get a better understanding of units of measure for those metrics.
fifa <- rename(fifa, 'Weight (KG)' = Weight)
fifa <- rename(fifa, 'Height (CM)' = Height)
Lastly, there are some unnecessary columns in our data which won’t further our analysis, hence we can remove them to reduce any model’s computational expense. In specific, we will remove the PhotoURL and OnLoad variable.
fifa <- fifa[,-c(7, 24)]
Lastly, we noticed some of the positional titles were a little too specific for our analysis. We hope to conduct analysis on general positional groups and excecessive detail regarding positions is unuecessary for us. Especially since some of the titles discussed are not commonly used in typical soccer jargon.
#All the unique positions
unique(fifa$BestPosition)
## [1] "RW" "ST" "GK" "CM" "LW" "CDM" "CF" "LM" "CB" "CAM" "RB" "LB"
## [13] "RM" "LWB" "RWB"
#Edit into more appropriate positional titles
fifa$BestPosition_new <- ifelse(fifa$BestPosition == "GK" , "GK",
ifelse(fifa$BestPosition %in% c("LCB", "CB", "RCB"), "CB",
ifelse(fifa$BestPosition %in% c("LB", "LWB", "RWB", "RB"), "LB/RB",
ifelse(fifa$BestPosition %in% c("LDM", "CDM", "RDM"), "DM",
ifelse(fifa$BestPosition %in% c("LCM", "CM", "RCM"), "CM",
ifelse(fifa$BestPosition %in% c("LM", "RM"), "LM/RM",
ifelse(fifa$BestPosition %in% c("LAM", "CAM", "RAM"), "AM",
ifelse(fifa$BestPosition %in% c("LW", "RW"), "LW/RW",
ifelse(fifa$BestPosition %in% c("LF", "CF", "RF"),"CF",
"ST")))))))))
fifa$BestPosition_new <- factor(fifa$BestPosition_new, levels = c("GK", "CB", "LB/RB", "DM", "CM",
"LM/RM", "AM", "LW/RW", "CF", "ST"))
#check if our change worked
levels(fifa$BestPosition_new)
## [1] "GK" "CB" "LB/RB" "DM" "CM" "LM/RM" "AM" "LW/RW" "CF"
## [10] "ST"
At this point, we should have addressed all issues with our data and should have put ourselves in a good position for further analysis. We can check if all our changes have gone through and if our data looks good now by running the head function again. We will only show the first row this time as it will be representative of our data set while ensuring too much space isn’t taken up by our data.
head(fifa, 1)
## ID Name FullName Age Height (CM) Weight (KG) Nationality Overall
## 1 158023 L. Messi Lionel Messi 34 170 72 Argentina 93
## Potential Growth TotalStats BaseStats Positions BestPosition
## 1 93 0 2219 462 RW,ST,CF RW
## Club ValueEUR WageEUR ReleaseClause ClubPosition ContractUntil
## 1 Paris Saint-Germain 78000000 320000 144300000 RW 2023
## ClubNumber ClubJoined NationalTeam NationalPosition NationalNumber
## 1 30 2021 Argentina RW 10
## PreferredFoot IntReputation WeakFoot SkillMoves AttackingWorkRate
## 1 Left 5 4 4 Medium
## DefensiveWorkRate PaceTotal ShootingTotal PassingTotal DribblingTotal
## 1 Low 85 92 91 95
## DefendingTotal PhysicalityTotal Crossing Finishing HeadingAccuracy
## 1 34 65 85 95 70
## ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing BallControl
## 1 91 88 96 93 94 91 96
## Acceleration SprintSpeed Agility Reactions Balance ShotPower Jumping Stamina
## 1 91 80 91 94 95 86 68 72
## Strength LongShots Aggression Interceptions Positioning Vision Penalties
## 1 69 94 44 40 93 95 75
## Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking
## 1 96 20 35 24 6 11 15
## GKPositioning GKReflexes STRating LWRating LFRating CFRating RFRating
## 1 14 8 92 92 93 93 93
## RWRating CAMRating LMRating CMRating RMRating LWBRating CDMRating RWBRating
## 1 92 93 93 90 93 69 67 69
## LBRating CBRating RBRating GKRating LeagueId TeamOverall Attack Midfield
## 1 64 53 64 22 16 86 89 83
## Defence TransferBudget DomesticPrestige IntPrestige Players
## 1 85 160000000 10 9 33
## StartingAverageAge AllTeamAverageAge League BestPosition_new
## 1 28 25.9 French Ligue 1 LW/RW
As we can from the result above, our data looks good.
Lastly, we can take a look at the summary for a few of the important variables in our data.
fifa.summary <- fifa %>%
select(Overall, WageEUR, ValueEUR, PaceTotal, DribblingTotal, ShootingTotal, WeakFoot, SkillMoves) %>%
summarise_each(funs(min = min,
q25 = quantile(., 0.25),
median = median,
q75 = quantile(., 0.75),
max = max,
mean = mean))
#We can reshape this data to make it look better
fifa.summary_reformat <- fifa.summary %>% gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean)
print(fifa.summary_reformat)
## var min q25 median q75 max mean
## 1 DribblingTotal 26 58 64 69 9.50e+01 6.296784e+01
## 2 Overall 47 61 66 70 9.30e+01 6.576922e+01
## 3 PaceTotal 28 62 68 75 9.70e+01 6.787994e+01
## 4 ShootingTotal 18 44 56 64 9.40e+01 5.346088e+01
## 5 SkillMoves 1 2 2 3 5.00e+00 2.352816e+00
## 6 ValueEUR 0 475000 975000 2000000 1.94e+08 2.854488e+06
## 7 WageEUR 0 1000 3000 8000 3.50e+05 8.858094e+03
## 8 WeakFoot 1 3 3 3 5.00e+00 2.945501e+00
From the above result we can see a consolidated table with the summary statistics of the some of the variables we consider valuable. We can see that ratings on players week foot and skill moves range between 1 through 5, with most players skill moves being rated below the middle rating of 3, at 2. We can also see that wages and player values range widely with some values being as low as 0. We believe that although these 0’s may not be completely accurate, it is possible that the information isn’t public or is just incredibly low. It is important for us to remember that there are some incredibly small teams included in this game and hence it is possible players have a wage and value that are relatively low in comparison to that of the best players in the world. Hence, a 0 is utilized as a placeholder. We have not removed these rows or values because we view doing so would remove a lot of important data, and imputing the median would give us an incorrect idea of our data. This will just be an important fact we will have to keep in mind as we approach our analysis. We can also note that the players in the game range widely in terms of overall ability going from 47 to 93.
We hope to uncover new information in the data that is not self-evident by slicing and dicing the data in various ways and by creating various variables to help better understand the data. A great example of this is that we plan to create a continent column based off the player’s respective countries to better understand what players from different continents are like in different categories. We could also create groupings of positions such as defenders, midfielders, and attackers. We hope that these different variables and views will better help us understand some of the hidden trends in our data. We hope to summarize these data slices and trends at the end with various visualizations that compare the different categories/variables we created.
We hope to create many visualizations to help us illustrate the findings of our questions. Here are a few of our ideas:
Histogram to show distribution of players’ statistics, both face stats and “raw” stats. It may give some valuable insight into how certain stats behave generally, maybe they come from a known distribution (like normal distribution).
Correlation plot between various statistics to showcase relationship between significant variables.
Bar plot to show various stats across various continents and different positional groups
Our group feel’s comfortable moving forward with all of these ideas.
In terms of machine learning techniques, our group plans on using a linear regression to better understand the relationship between independent and dependent variables.
Of course, prior to utilizing a linear regression, we will need to understand if a linear regression is even the correct possible test. We would first need to check for outliers prior to running a linear regression since those could heavily impact our analysis. We would also need to understand if the data is linear, an assumption we could check via a scatterplot. Additionally, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram, a Q-Q Plot, or shapiro test. Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. Multicollinearity may be tested with three central criteria:
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check if the data is homoscedastic (meaning the residuals are equal across the regression line). If homoscedasticity is present, a non-linear correction might fix the problem. We want to get a model that produces a constant residual variance, does not form a pattern (trumpet form). If homoscedasticity is present, a non-linear correction might fix the problem.
We hope to build a model for each position or position group to better understand the trends in our data.
The other machine learning technique we may use is a Prime Component Analysis. The main goal of a PCA is to reduce the number of attributes, so that they could be interpreted more easily as groups. We would also like to see whether reducing number of dimensions to 6 (so the number of Face stats) would result in a similar grouping to that proposed by EA. First, we would take all the attributes and scale them, as PCA is easily influenced by magnitude.
Apart from the reduction to 6 dimensions, it would be good to know the optimal number of dimensions according to the eigenvalues. One can use a method which rules out all the components with eigenvalues below 1 (Kaiser criterion), one which takes number of components to describe a certain amount of variance, use MAP test or use parallel analysis.
When it comes to Face stats, we only have 6 variables to begin with but nonetheless we want to reduce their number even more.
Overall, the only machine learning techniques we hope to use are a Linear Regression and/or a PCA to better understand trends in our data dependent on their significance.