Introduction

FIFA is a soccer video game produced by EA which has gained notoriety among individuals from all-around the world with over 31 million players on FIFA 21. A new FIFA game comes out every year in which EA rates all the players in the game and give them certain stats which are supposed to be a depiction of real life. Each player is rated on various attributes which are then used to give them an overall score. Some of the attributes players are rated on include stats such as: Pace, Dribbling, Shooting. Our group was curious, if we take these values as accurate, what can we learn about the real world?

We firmly believe that answering this question could lead to some significant insights about the real world because we truly believe these attributes are as close of a representation we can get to measuring these attributes in the real world. FIFA, as mentioned prior, is run by EA, a multi-billion dollar corporation who has not only the resources but also the technical known-how/experience to create accurate value for the measurements.

We found a database online which contains all these attributes for the most recent FIFA game, FIFA 22, along with real-world values such as player wage, value, club, and jersey number. We hope to address the proposed problem statement by understanding the relationship between various in-game attributes and real-world values. For example, how does a player’s overall relate to their national jersey number? We have always been allured by the infamous number 10, but is that actually the jersey the best players wear? Which attribute has lead people to the highest pay day?

Our current proposed analytic technique is to run a linear regression between these different in-game and real-world values, using a p-test to understand their statistical significance. If significant we can compare the coefficients and correlation coefficients to understand the degree of a relationship and how we could predict/describe real-world values using this game.

The main target for our analysis are real soccer players as well as soccer fanatics. Real soccer players will benefit the most from our analysis. If I was soccer player, I would be looking to learn which attribute of my self I can boost the most to not only improve my self as a player but to also improve my salary. We hope that our analysis could help players direct their workout regiment by helping understand which attributes are truly considered valuable in soccer and for a club. Soccer fanatics will also benefit from our analysis as it will help them better understand which characteristics they may want to look for when looking at potential trade prospects for their respective team.

Packages

library(tidyverse)
library(dplyr)

At this point in the project, the only packages we utilized were the tidyverse package and dplyr package. Both packages were used together specifically for one function, to help trim a string in our data. The process has been displayed later on, however, both packages together helped in the process. The dplyr package’s pipe operator proved to be helpful in executing certain functions and the tidyverse package’s str_sub function proved to be helpful in actually trimming the string. In addition to this instance, there were many other uses of both packages such as when renaming variable names to something more user-friendly.

Data Preparation

We gathered our data from kaggle, specifically from this specific link:https://www.kaggle.com/stefanoleone992/fifa-22-complete-player-dataset

The data contains two different tables with players and info on their teams. We will join both of these together in R to cumulatively attain amore robust data set. The original players table had 90 variables including all the attributes for all the players and general real-world information about them such as wage and value. The teams table originally had 14 columns. There isn’t explicit information on when this data was collected, however, we do know it was fairly recently since FIFA 22 came out on October 1st, 2021. It is also worth noting that all those statistics are suggested mostly by a 6,000 group of volunteers led by Head of Data Collection & Licensing.

All real-world professional soccer players that are included in the game, as discussed above, are described by a set of statistics and characteristics, which determine how good they are in-game. Main indicator of the quality of a player is their overall score, which is a net of all their statistics. There are 34 of them with values in the 0-99 interval, in which 5 are used to describe goalkeeping abilities and 29 are used to describe abilities of an outfield player. All the players are described by all 34 stats, but goalkeeping abilities have no impact on an outfield players overall and vice-versa.

Since the players are presented as cards, it would be difficult to show all the statistics on the face of the card. Thus, there are so called “Face stats”. The card shows only six statistics on it to make it more readable. The statistics are linear combinations of the ones already mentioned. For example, instead of having both Acceleration and Sprint Speed displayed on the card, the game utilizes a pace statistic which is a weighted sum of those two. These variables can be found in our data with aTotal pre-fix.

There is also a set of four categorical variables in our data describing players:

Skills - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pull off skill moves, like the famous roulette

Weak foot - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pass, shoot, etc. with his not-preferred foot correctly

Attacking work rate - Can be either High, Medium or Low; determines how much does the player run in attacking zones of the pitch

Defensive work rate - Can be either High, Medium or Low; determines how much does the player run in defensive zones of the pitch

We believe, after careful evaluation, most variable names are self-explanatory except for FKAccuracy, which stands for Free Kick Accuracy - how accurately one can shoot from a set piece. Also, position refers to players’ preferred positions on the pitch and the game recognizes 28 different ones.

We can get a better understanding of our data by taking a look at it in R, so let’s first load in the data:

players <- read.csv("players_fifa22.csv")
teams <- read.csv("teams_fifa22.csv")

Now let’s take a look at the first few rows of the players table:

head(players, 3)

##       ID           Name           FullName Age Height Weight
## 1 158023       L. Messi       Lionel Messi  34    170     72
## 2 188545 R. Lewandowski Robert Lewandowski  32    185     81
## 3 231747     K. MbappÃ©     Kylian MbappÃ©  22    182     73
##                                           PhotoUrl Nationality Overall
## 1 https://cdn.sofifa.com/players/158/023/22_60.png   Argentina      93
## 2 https://cdn.sofifa.com/players/188/545/22_60.png      Poland      92
## 3 https://cdn.sofifa.com/players/231/747/22_60.png      France      91
##   Potential Growth TotalStats BaseStats Positions BestPosition
## 1        93      0       2219       462  RW,ST,CF           RW
## 2        92      0       2212       460        ST           ST
## 3        95      4       2175       470     ST,LW           ST
##                  Club  ValueEUR WageEUR ReleaseClause ClubPosition
## 1 Paris Saint-Germain  78000000  320000     144300000           RW
## 2  FC Bayern MÃ¼nchen 119500000  270000     197200000           ST
## 3 Paris Saint-Germain 194000000  230000     373500000           ST
##   ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 1          2023         30       2021  False    Argentina               RW
## 2          2023          9       2014  False       Poland               ST
## 3          2022          7       2018  False       France               LW
##   NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 1             10          Left             5        4          4
## 2              9         Right             5        4          4
## 3             10         Right             4        4          5
##   AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 1            Medium               Low        85            92           91
## 2              High            Medium        78            92           79
## 3              High               Low        97            88           80
##   DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 1             95             34               65       85        95
## 2             85             44               82       71        95
## 3             92             36               77       78        93
##   HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 1              70           91      88        96    93         94          91
## 2              90           85      89        85    79         85          70
## 3              72           85      83        93    80         69          71
##   BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 1          96           91          80      91        94      95        86
## 2          88           77          79      77        93      82        90
## 3          91           97          97      92        93      83        86
##   Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 1      68      72       69        94         44            40          93
## 2      85      76       86        87         81            49          95
## 3      78      88       77        82         62            38          92
##   Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 1     95        75        96      20             35            24        6
## 2     81        90        88      35             42            19       15
## 3     82        79        88      26             34            32       13
##   GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 1         11        15            14          8       92       92       93
## 2          6        12             8         10       92       85       88
## 3          5         7            11          6       91       90       90
##   CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 1       93       93       92        93       93       90       93        69
## 2       88       88       85        89       87       83       87        67
## 3       90       90       90        92       92       84       92        70
##   CDMRating RWBRating LBRating CBRating RBRating GKRating
## 1        67        69       64       53       64       22
## 2        69        67       64       63       64       22
## 3        66        70       66       57       66       21

As we can see from the above table, some of the names in our data have loaded weirdly. We can fix this be reloading the data and making a few specifications. We can check that this worked by looking at the third row of the players table again:

players <- read.csv("players_fifa22.csv", encoding="UTF-8", stringsAsFactor=FALSE)

teams <- read.csv("teams_fifa22.csv", encoding="UTF-8", stringsAsFactor=FALSE)

players[3,]

##       ID      Name      FullName Age Height Weight
## 3 231747 K. Mbappé Kylian Mbappé  22    182     73
##                                           PhotoUrl Nationality Overall
## 3 https://cdn.sofifa.com/players/231/747/22_60.png      France      91
##   Potential Growth TotalStats BaseStats Positions BestPosition
## 3        95      4       2175       470     ST,LW           ST
##                  Club  ValueEUR WageEUR ReleaseClause ClubPosition
## 3 Paris Saint-Germain 194000000  230000     373500000           ST
##   ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 3          2022          7       2018  False       France               LW
##   NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 3             10         Right             4        4          5
##   AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 3              High               Low        97            88           80
##   DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 3             92             36               77       78        93
##   HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 3              72           85      83        93    80         69          71
##   BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 3          91           97          97      92        93      83        86
##   Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 3      78      88       77        82         62            38          92
##   Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 3     82        79        88      26             34            32       13
##   GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 3          5         7            11          6       91       90       90
##   CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 3       90       90       90        92       92       84       92        70
##   CDMRating RWBRating LBRating CBRating RBRating GKRating
## 3        66        70       66       57       66       21

As we can see from the code above, this was able to help out in the data processing. Now that we have our data looking good, we can go ahead and join our tables to have one robust data set. We will do the join based on the Club Name as we are trying to retrieve club related information. We will also use a left join so that each player’s club is matched up to the team based information.

fifa <- players %>%
  left_join(teams, by=c("Club" = "Name"))

Let’s explore this data structure a little further:

dim(fifa)

## [1] 19276   103

We now know we have 19276 rows in our data with 103 columns. This means we have 19276 players in our data base described in 103 different ways. As previously discussed, the players table had 90 columns and teams table had 14, since the column we joined on wouldn’t show up twice, it makes sense why we have 103 different columns.

Now that our data looks good and we have explored the general structure, we can look at the missing values information:

sapply(fifa, function(x) sum(is.na(x)))

##               ID.x               Name           FullName                Age 
##                  0                  0                  0                  0 
##             Height             Weight           PhotoUrl        Nationality 
##                  0                  0                  0                  0 
##          Overall.x          Potential             Growth         TotalStats 
##                  0                  0                  0                  0 
##          BaseStats          Positions       BestPosition               Club 
##                  0                  0                  0                  0 
##           ValueEUR            WageEUR      ReleaseClause       ClubPosition 
##                  0                  0                  0                  0 
##      ContractUntil         ClubNumber         ClubJoined             OnLoad 
##                 70                 70                  0                  0 
##       NationalTeam   NationalPosition     NationalNumber      PreferredFoot 
##                  0                  0              18519                  0 
##      IntReputation           WeakFoot         SkillMoves  AttackingWorkRate 
##                  0                  0                  0                  0 
##  DefensiveWorkRate          PaceTotal      ShootingTotal       PassingTotal 
##                  0                  0                  0                  0 
##     DribblingTotal     DefendingTotal   PhysicalityTotal           Crossing 
##                  0                  0                  0                  0 
##          Finishing    HeadingAccuracy       ShortPassing            Volleys 
##                  0                  0                  0                  0 
##          Dribbling              Curve         FKAccuracy        LongPassing 
##                  0                  0                  0                  0 
##        BallControl       Acceleration        SprintSpeed            Agility 
##                  0                  0                  0                  0 
##          Reactions            Balance          ShotPower            Jumping 
##                  0                  0                  0                  0 
##            Stamina           Strength          LongShots         Aggression 
##                  0                  0                  0                  0 
##      Interceptions        Positioning             Vision          Penalties 
##                  0                  0                  0                  0 
##          Composure            Marking     StandingTackle      SlidingTackle 
##                  0                  0                  0                  0 
##           GKDiving         GKHandling          GKKicking      GKPositioning 
##                  0                  0                  0                  0 
##         GKReflexes           STRating           LWRating           LFRating 
##                  0                  0                  0                  0 
##           CFRating           RFRating           RWRating          CAMRating 
##                  0                  0                  0                  0 
##           LMRating           CMRating           RMRating          LWBRating 
##                  0                  0                  0                  0 
##          CDMRating          RWBRating           LBRating           CBRating 
##                  0                  0                  0                  0 
##           RBRating           GKRating               ID.y             League 
##                  0                  0                124                124 
##           LeagueId          Overall.y             Attack           Midfield 
##                124                124                124                124 
##            Defence     TransferBudget   DomesticPrestige        IntPrestige 
##                124                124                124                124 
##            Players StartingAverageAge  AllTeamAverageAge 
##                124                124                124

We can see from the data above that our data looks pretty good already and avoids many missing values. We do have some missing values, but we believe there is no action that can be taken right now for them. The columns where we see a few missing values are ContractUntil, ClubNumber and NationalNumber. We understand, however, that there are valid reasons why all these data points could be blank or non-existent as some players contract info or club number may not be released yet. Additionally, we know most players that play professional soccer don’t have the opportunity to play for their national country and hence we can understand why we have so many N/A values in the NationalNumber column.

We can also see that some of the data we had joined in our table came out missing. Let’s try to evaluate this:

NoTeamInfo <- fifa[is.na(fifa$Overall.y),]

unique(NoTeamInfo$Club)

## [1] "Free agent"        "Burgos CF"         "SC East Bengal FC"

We can note an interesting trend here, all of the players fall in one of two categories. The players play either for a rather small market team such as “Burgos CF” of “SC East Bengal FC” or they are a “Free Agent”. Although, we believe the two actual teams should have some values for their team information we do not see it as helpful to impute values as these teams, based off our soccer knowledge, are rather below average and will hence see faulty values. For the most part, our analysis is player-centric rather than team-centric and since we still have important information on the players from these teams we believe it is important to keep this data. Free Agents are also an interesting data section we can explore further and they should rightfully not have any team information, as they don’t belong to a team, hence we have taken no action on these entries either.

In terms of formatting, we were able to get a decent idea of what the data looks like from the head() function we ran above. We can notice that one of the things that looks weird is the league column. Every entry in the league column has a (1) after it. We can work to remove this and clean the data by utilizing the str_sub function from the tidyverse package. We will create a column called TeamLeague with the trimmed league names.

fifa <- fifa %>%
  mutate(TeamLeague= str_sub(League, 1, -5))

Now that our league name has been corrected, by creating a new column in R with the correct values, we can remove the previous League column. We have done this in the code below:

fifa <- select(fifa, -(League))

Lastly, our team preferred the name League instead of TeamLague so we have gone ahead and change the column name of the created column to that of the previous column.

fifa <- rename(fifa, League = TeamLeague)

Another issue in our data can be seen in the naming convention of our team overall column. As we can see, it is currently labeled as Overall.1. This naming convention isn’t friendly to a new user of our data and hence we can change this title to TeamOverall- a variable which indicates the overall rating a whole team is given.

fifa <- rename(fifa, TeamOverall = Overall.y)
fifa <- rename(fifa, Overall = Overall.x)

As we are renaming columns, we can also rename a few columns to help our readers get a better understanding of units of measure for those metrics.

fifa <- rename(fifa, 'Weight (KG)' = Weight)
fifa <- rename(fifa, 'Height (CM)' = Height)

To finish off the renaming, we can go ahead and rename some of the ID columns to be more explanatory on what exactly they are identifying.

fifa <- rename(fifa, 'TeamID' = ID.y)
fifa <- rename(fifa, 'PlayerID' = ID.x)

Lastly, there are some unnecessary columns in our data which won’t further our analysis, hence we can remove them to reduce any model’s computational expense. In specific, we will remove the PhotoURL and OnLoad variable.

fifa <- fifa[,-c(7, 24)]

Lastly, we noticed some of the positional titles were a little too specific for our analysis. We hope to conduct analysis on general positional groups and excessive detail regarding positions is unnecessary for us. Especially since some of the titles discussed are not commonly used in typical soccer jargon.

#All the unique positions
unique(fifa$BestPosition)

##  [1] "RW"  "ST"  "GK"  "CM"  "LW"  "CDM" "CF"  "LM"  "CB"  "CAM" "RB"  "LB" 
## [13] "RM"  "LWB" "RWB"

#Edit into more appropriate positional titles 
fifa$BestPosition_new <- ifelse(fifa$BestPosition == "GK" , "GK",
                                ifelse(fifa$BestPosition %in% c("LCB", "CB", "RCB"), "CB",
                                       ifelse(fifa$BestPosition %in% c("LB", "LWB", "RWB", "RB"), "LB/RB",
                                              ifelse(fifa$BestPosition %in% c("LDM", "CDM", "RDM"), "DM",
                                                     ifelse(fifa$BestPosition %in% c("LCM", "CM", "RCM"), "CM",
                                                            ifelse(fifa$BestPosition %in% c("LM", "RM"), "LM/RM",
                                                                   ifelse(fifa$BestPosition %in% c("LAM", "CAM", "RAM"), "AM",
                                                                          ifelse(fifa$BestPosition %in% c("LW", "RW"), "LW/RW",
                                                                                 ifelse(fifa$BestPosition %in% c("LF", "CF", "RF"),"CF",
                                                                                        "ST")))))))))
fifa$BestPosition_new <- factor(fifa$BestPosition_new, levels = c("GK", "CB", "LB/RB", "DM", "CM",
                                                  "LM/RM", "AM", "LW/RW", "CF", "ST"))

#check if our change worked
levels(fifa$BestPosition_new)

##  [1] "GK"    "CB"    "LB/RB" "DM"    "CM"    "LM/RM" "AM"    "LW/RW" "CF"   
## [10] "ST"

At this point, we should have addressed all issues with our data and should have put ourselves in a good position for further analysis. We can check if all our changes have gone through and if our data looks good now by running the head function again. We will only show the first row this time as it will be representative of our data set while ensuring too much space isn’t taken up by our data.

head(fifa, 1)

##   PlayerID     Name     FullName Age Height (CM) Weight (KG) Nationality
## 1   158023 L. Messi Lionel Messi  34         170          72   Argentina
##   Overall Potential Growth TotalStats BaseStats Positions BestPosition
## 1      93        93      0       2219       462  RW,ST,CF           RW
##                  Club ValueEUR WageEUR ReleaseClause ClubPosition ContractUntil
## 1 Paris Saint-Germain 78000000  320000     144300000           RW          2023
##   ClubNumber ClubJoined NationalTeam NationalPosition NationalNumber
## 1         30       2021    Argentina               RW             10
##   PreferredFoot IntReputation WeakFoot SkillMoves AttackingWorkRate
## 1          Left             5        4          4            Medium
##   DefensiveWorkRate PaceTotal ShootingTotal PassingTotal DribblingTotal
## 1               Low        85            92           91             95
##   DefendingTotal PhysicalityTotal Crossing Finishing HeadingAccuracy
## 1             34               65       85        95              70
##   ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing BallControl
## 1           91      88        96    93         94          91          96
##   Acceleration SprintSpeed Agility Reactions Balance ShotPower Jumping Stamina
## 1           91          80      91        94      95        86      68      72
##   Strength LongShots Aggression Interceptions Positioning Vision Penalties
## 1       69        94         44            40          93     95        75
##   Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking
## 1        96      20             35            24        6         11        15
##   GKPositioning GKReflexes STRating LWRating LFRating CFRating RFRating
## 1            14          8       92       92       93       93       93
##   RWRating CAMRating LMRating CMRating RMRating LWBRating CDMRating RWBRating
## 1       92        93       93       90       93        69        67        69
##   LBRating CBRating RBRating GKRating TeamID LeagueId TeamOverall Attack
## 1       64       53       64       22     73       16          86     89
##   Midfield Defence TransferBudget DomesticPrestige IntPrestige Players
## 1       83      85      160000000               10           9      33
##   StartingAverageAge AllTeamAverageAge         League BestPosition_new
## 1                 28              25.9 French Ligue 1            LW/RW

As we can from the result above, our data looks good.

Lastly, we can take a look at the summary for a few of the important variables in our data.

fifa.summary <- fifa %>%
  select(Overall, WageEUR, ValueEUR, PaceTotal, DribblingTotal, ShootingTotal, WeakFoot, SkillMoves) %>%
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean))

#We can reshape this data to make it look better

fifa.summary_reformat <- fifa.summary %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean)

print(fifa.summary_reformat)

##              var min    q25 median     q75      max         mean
## 1 DribblingTotal  26     58     64      69 9.50e+01 6.296675e+01
## 2        Overall  47     61     66      70 9.30e+01 6.576873e+01
## 3      PaceTotal  28     62     68      75 9.70e+01 6.787534e+01
## 4  ShootingTotal  18     44     56      64 9.40e+01 5.346088e+01
## 5     SkillMoves   1      2      2       3 5.00e+00 2.352459e+00
## 6       ValueEUR   0 475000 975000 2000000 1.94e+08 2.851651e+06
## 7        WageEUR   0   1000   3000    8000 3.50e+05 8.849611e+03
## 8       WeakFoot   1      3      3       3 5.00e+00 2.945580e+00

From the above result we can see a consolidated table with the summary statistics of the some of the variables we consider valuable. We can see that ratings on players week foot and skill moves range between 1 through 5, with most players skill moves being rated below the middle rating of 3, at 2. We can also see that wages and player values range widely with some values being as low as 0. We believe that although these 0’s may not be completely accurate, it is possible that the information isn’t public or is just incredibly low. It is important for us to remember that there are some incredibly small teams included in this game and hence it is possible players have a wage and value that are relatively low in comparison to that of the best players in the world. Hence, a 0 is utilized as a placeholder. We have not removed these rows or values because we view doing so would remove a lot of important data, and imputing the median would give us an incorrect idea of our data. This will just be an important fact we will have to keep in mind as we approach our analysis. We can also note that the players in the game range widely in terms of overall ability going from 47 to 93.

Proposed Exploratory Data Analysis

We hope to uncover new information in the data that is not self-evident by slicing and dicing the data in various ways and by creating various variables to help better understand the data. A great example of this is that we plan to create a continent column based off the player’s respective countries to better understand what players from different continents are like in different categories. We could also create groupings of positions such as defenders, midfielders, and attackers. We hope that these different variables and views will better help us understand some of the hidden trends in our data. We hope to summarize these data slices and trends at the end with various visualizations that compare the different categories/variables we created.

We hope to create many visualizations to help us illustrate the findings of our questions. Here are a few of our ideas:

Histogram to show distribution of players’ statistics, both face stats and “raw” stats. It may give some valuable insight into how certain stats behave generally, maybe they come from a known distribution (like normal distribution).
Correlation plot between various statistics to showcase relationship between significant variables.
Bar plot to show various stats across various continents and different positional groups

Our group feel’s comfortable moving forward with all of these ideas.

In terms of machine learning techniques, our group plans on using a linear regression to better understand the relationship between independent and dependent variables.

Of course, prior to utilizing a linear regression, we will need to understand if a linear regression is even the correct possible test. We would first need to check for outliers prior to running a linear regression since those could heavily impact our analysis. We would also need to understand if the data is linear, an assumption we could check via a scatterplot. Additionally, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram, a Q-Q Plot, or shapiro test. Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. Multicollinearity may be tested with three central criteria:

Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.
Tolerance – the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.
Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 5 there is an indication that multicollinearity may be present; with VIF > 10 there is certainly multicollinearity among the variables.

The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check if the data is homoscedastic (meaning the residuals are equal across the regression line). If homoscedasticity is present, a non-linear correction might fix the problem. We want to get a model that produces a constant residual variance, does not form a pattern (trumpet form). If homoscedasticity is present, a non-linear correction might fix the problem.

We hope to build a model for each position or position group to better understand the trends in our data.

The other machine learning technique we may use is a Prime Component Analysis. The main goal of a PCA is to reduce the number of attributes, so that they could be interpreted more easily as groups. We would also like to see whether reducing number of dimensions to 6 (so the number of Face stats) would result in a similar grouping to that proposed by EA. First, we would take all the attributes and scale them, as PCA is easily influenced by magnitude.

Apart from the reduction to 6 dimensions, it would be good to know the optimal number of dimensions according to the eigenvalues. One can use a method which rules out all the components with eigenvalues below 1 (Kaiser criterion), one which takes number of components to describe a certain amount of variance, use MAP test or use parallel analysis.

When it comes to Face stats, we only have 6 variables to begin with but nonetheless we want to reduce their number even more.

Overall, the only machine learning techniques we hope to use are a Linear Regression and/or a PCA to better understand trends in our data dependent on their significance.

Our 10 Questions:

What attributes are most important to higher wage?

To answer this question, we must first subset our data to only contain the players attributes, skill moves, and weak foot rating in addition to their wage. Since player’s face-values are cumulative sums of other data attributes, we will not include those in our analysis to avoid multicollinearity. Since we also are looking for an understanding of general attributes that make a player earn a higher wage, we will also avoid position based stats. Let’s subset our data first and save this new table as playerwage a table we will utilize for our analysis on wage. We can get an analysis of the columns in our data by running the colnames function and ensuring that all our data is in our new data frame.

playerwage <- fifa[,c(17,38:66)]
colnames(playerwage)

##  [1] "WageEUR"         "Crossing"        "Finishing"       "HeadingAccuracy"
##  [5] "ShortPassing"    "Volleys"         "Dribbling"       "Curve"          
##  [9] "FKAccuracy"      "LongPassing"     "BallControl"     "Acceleration"   
## [13] "SprintSpeed"     "Agility"         "Reactions"       "Balance"        
## [17] "ShotPower"       "Jumping"         "Stamina"         "Strength"       
## [21] "LongShots"       "Aggression"      "Interceptions"   "Positioning"    
## [25] "Vision"          "Penalties"       "Composure"       "Marking"        
## [29] "StandingTackle"  "SlidingTackle"

Let’s take a look at the distribution of the various attributes in our data. We have not included player wage in this visualization as that operates on a different scale which would make the attributes harder to view.

boxplot(playerwage[,2:30], las=2, main= "Attributes Box Plot", ylab= "Attribute Value")

Let’s further understand the distribution of the wage variable by looking at it’s distribution:

hist(playerwage$WageEUR, main= "Player Wage Histogram", xlab = "Player Wage (Euros)")

As we can see here, many players are on the lower end. Our next step is a linear regression, so we will have to make adjustments to these values so that our linear regression gives us an adequate understanding of our data without being skewed by the large amount of 0 and low-end data points.

We can set up the linear regression, where our Wage value is the dependent value and the player attributes are the independent values.

wage_lr <- lm(WageEUR ~ ., data = playerwage)
wage_lr

## 
## Call:
## lm(formula = WageEUR ~ ., data = playerwage)
## 
## Coefficients:
##     (Intercept)         Crossing        Finishing  HeadingAccuracy  
##      -65584.100           54.320           16.580           17.616  
##    ShortPassing          Volleys        Dribbling            Curve  
##          79.454           59.731           19.533           67.091  
##      FKAccuracy      LongPassing      BallControl     Acceleration  
##         -39.411           -5.233           54.196            4.717  
##     SprintSpeed          Agility        Reactions          Balance  
##         106.186          -94.964          919.638          -14.486  
##       ShotPower          Jumping          Stamina         Strength  
##         164.160           34.077         -115.639          -34.650  
##       LongShots       Aggression    Interceptions      Positioning  
##        -191.241          -21.741          -74.522         -104.325  
##          Vision        Penalties        Composure          Marking  
##         127.779          -59.983          182.741          -17.945  
##  StandingTackle    SlidingTackle  
##         121.586          -49.441

summary(wage_lr)

## 
## Call:
## lm(formula = WageEUR ~ ., data = playerwage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38542  -7478  -2329   3858 305270 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -65584.100   1376.538 -47.644  < 2e-16 ***
## Crossing            54.320     15.927   3.411 0.000649 ***
## Finishing           16.580     21.266   0.780 0.435626    
## HeadingAccuracy     17.616     15.912   1.107 0.268268    
## ShortPassing        79.454     30.736   2.585 0.009744 ** 
## Volleys             59.731     18.223   3.278 0.001048 ** 
## Dribbling           19.533     26.055   0.750 0.453435    
## Curve               67.091     17.682   3.794 0.000149 ***
## FKAccuracy         -39.411     15.316  -2.573 0.010085 *  
## LongPassing         -5.233     21.594  -0.242 0.808511    
## BallControl         54.196     31.105   1.742 0.081465 .  
## Acceleration         4.717     23.984   0.197 0.844086    
## SprintSpeed        106.186     21.342   4.975 6.56e-07 ***
## Agility            -94.964     18.121  -5.241 1.62e-07 ***
## Reactions          919.638     20.117  45.714  < 2e-16 ***
## Balance            -14.486     15.489  -0.935 0.349687    
## ShotPower          164.160     18.152   9.044  < 2e-16 ***
## Jumping             34.077     12.373   2.754 0.005891 ** 
## Stamina           -115.639     14.206  -8.140 4.19e-16 ***
## Strength           -34.650     14.829  -2.337 0.019469 *  
## LongShots         -191.241     19.855  -9.632  < 2e-16 ***
## Aggression         -21.741     13.690  -1.588 0.112279    
## Interceptions      -74.522     21.416  -3.480 0.000503 ***
## Positioning       -104.325     20.226  -5.158 2.52e-07 ***
## Vision             127.779     18.185   7.027 2.19e-12 ***
## Penalties          -59.983     16.848  -3.560 0.000372 ***
## Composure          182.741     19.257   9.490  < 2e-16 ***
## Marking            -17.945     19.568  -0.917 0.359123    
## StandingTackle     121.586     30.182   4.028 5.64e-05 ***
## SlidingTackle      -49.441     28.659  -1.725 0.084520 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16120 on 19246 degrees of freedom
## Multiple R-squared:  0.3108, Adjusted R-squared:  0.3097 
## F-statistic: 299.2 on 29 and 19246 DF,  p-value: < 2.2e-16

From the above coefficients we can understand that if all else held constant, reactions is one of the most important variables to an increase in wage. With a coefficient of 919.638 we can understand that an increase by 1 in the reactions section can increase a player’s wage by nearly 919 euros. The other attributes which lead an increase of more than 100 euros in wage with a 1 score increase in attribute rating are: Sprint Speed, Vision, Composure, Shot Power, and Standing Tackle. If I am player looking to increase my wage, these are the attributes that I would be focused on improving.

From the summary of ‘wage_lr’, we can see that p-value is less than the significance level (0.05) hence there is a statistical relevance between player’s wage and dependent variables. Also, As the overall F-stat is significant and R-squared is not equal to zero, and the correlation between the model and dependent variable is statistically significant.

What attributes are most important to an increase in player value?

We will use a similar set of attributes as to the ones we used above for this analysis, however, instead of including wage we will include player value. We will again create a new table, but will call this one playervalue

playervalue <- fifa[,c(16,38:66)]
colnames(playervalue)

##  [1] "ValueEUR"        "Crossing"        "Finishing"       "HeadingAccuracy"
##  [5] "ShortPassing"    "Volleys"         "Dribbling"       "Curve"          
##  [9] "FKAccuracy"      "LongPassing"     "BallControl"     "Acceleration"   
## [13] "SprintSpeed"     "Agility"         "Reactions"       "Balance"        
## [17] "ShotPower"       "Jumping"         "Stamina"         "Strength"       
## [21] "LongShots"       "Aggression"      "Interceptions"   "Positioning"    
## [25] "Vision"          "Penalties"       "Composure"       "Marking"        
## [29] "StandingTackle"  "SlidingTackle"

We can visualize the attributes here, as done above, using boxplots for the various attributes.

boxplot(playervalue[,2:30], las=2, main= "Attributes Box Plot", ylab= "Attribute Value")

More importantly, however, we can also visualize the distribution of the player values in our dataset by creating a histogram

options(scipen = 999)
hist(playervalue$ValueEUR, main= "Player Value Histogram", xlab = "Player Value (Euros)")

Again, we can notice that a large number of points here are 0. Although this could be realistic, we will need to make adjustments when we enter the next stage of the linear regression to ensure our analysis isn’t skewed.

value_lr <- lm(ValueEUR ~ ., data = playervalue)
value_lr

## 
## Call:
## lm(formula = ValueEUR ~ ., data = playervalue)
## 
## Coefficients:
##     (Intercept)         Crossing        Finishing  HeadingAccuracy  
##     -25662833.1          -2900.7          30247.4         -10310.2  
##    ShortPassing          Volleys        Dribbling            Curve  
##         54827.8          21324.2          14650.1          20285.5  
##      FKAccuracy      LongPassing      BallControl     Acceleration  
##        -12623.9          -7276.1          14405.9          22539.5  
##     SprintSpeed          Agility        Reactions          Balance  
##         53180.8         -47815.2         360453.3            335.9  
##       ShotPower          Jumping          Stamina         Strength  
##         44622.4          -1202.3         -15616.6          -6189.9  
##       LongShots       Aggression    Interceptions      Positioning  
##        -76002.6         -21180.4         -37517.5         -47741.2  
##          Vision        Penalties        Composure          Marking  
##         57514.9         -41219.4          53892.6         -10731.5  
##  StandingTackle    SlidingTackle  
##         48102.8          -7940.4

summary(value_lr)

## 
## Call:
## lm(formula = ValueEUR ~ ., data = playervalue)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -12511031  -2772965   -984953   1195349 176149216 
## 
## Coefficients:
##                    Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     -25662833.1    554109.4 -46.314 < 0.0000000000000002 ***
## Crossing            -2900.7      6411.1  -0.452             0.650949    
## Finishing           30247.4      8560.5   3.533             0.000411 ***
## HeadingAccuracy    -10310.2      6405.1  -1.610             0.107481    
## ShortPassing        54827.8     12372.4   4.431  0.00000941153763968 ***
## Volleys             21324.2      7335.5   2.907             0.003654 ** 
## Dribbling           14650.1     10487.9   1.397             0.162473    
## Curve               20285.5      7117.8   2.850             0.004377 ** 
## FKAccuracy         -12623.9      6165.3  -2.048             0.040617 *  
## LongPassing         -7276.1      8692.4  -0.837             0.402568    
## BallControl         14405.9     12521.1   1.151             0.249938    
## Acceleration        22539.5      9654.3   2.335             0.019572 *  
## SprintSpeed         53180.8      8590.9   6.190  0.00000000061226851 ***
## Agility            -47815.2      7294.2  -6.555  0.00000000005697122 ***
## Reactions          360453.3      8097.9  44.512 < 0.0000000000000002 ***
## Balance               335.9      6235.1   0.054             0.957042    
## ShotPower           44622.4      7306.8   6.107  0.00000000103502772 ***
## Jumping             -1202.3      4980.7  -0.241             0.809254    
## Stamina            -15616.6      5718.5  -2.731             0.006322 ** 
## Strength            -6189.9      5969.3  -1.037             0.299771    
## LongShots          -76002.6      7992.4  -9.509 < 0.0000000000000002 ***
## Aggression         -21180.4      5510.7  -3.844             0.000122 ***
## Interceptions      -37517.5      8620.8  -4.352  0.00001356097008179 ***
## Positioning        -47741.2      8141.7  -5.864  0.00000000459743317 ***
## Vision              57514.9      7320.2   7.857  0.00000000000000414 ***
## Penalties          -41219.4      6782.1  -6.078  0.00000000124206784 ***
## Composure           53892.6      7751.7   6.952  0.00000000000370753 ***
## Marking            -10731.5      7876.8  -1.362             0.173081    
## StandingTackle      48102.8     12149.4   3.959  0.00007545100626097 ***
## SlidingTackle       -7940.4     11536.5  -0.688             0.491283    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6489000 on 19246 degrees of freedom
## Multiple R-squared:  0.2728, Adjusted R-squared:  0.2717 
## F-statistic:   249 on 29 and 19246 DF,  p-value: < 0.00000000000000022

From the above linear regression results we can understand that if all else is held constant, reactions is again one of the most valuable attributes. Our research suggests that an increase in 1 in reactions score could increase player value by over $36 thousand. Other important attributes include sprint speed, finishing, short passing, vollets, and acceleration to name a few. Largly, the attributes that are required to increase player value and wage are the same for a pretty obvious reason - if a player is valuable their wage is probaly representative of that.

From the summary of ‘player_lr’, we can see that p-value is less than the significance level (0.05) hence there is a statistical relevance between player’s value and dependent variables. Also, As the overall F-stat is significant and with the support of Null hypothesis, we can say that the correlation between the model and dependent variable is statistically significant.

Are more players pacier in comparison to their peers than they are physical? (Is their pace rating higher than their physicality rating)?

To understand this we will create a column which will serve as a binary flag, 1 indicating player is faster than they are strong, and 0 indicating otherwise. By then summing this column we can get a better understanding of how many players fit that mold. We can also divide this number by the total number of players to understand what percent of players value one attribute over the other.

fifa <- fifa %>%
  mutate(SpeedVsStrength = if_else(PaceTotal> PhysicalityTotal, 1, 0))

sum(fifa$SpeedVsStrength)/nrow(fifa)

## [1] 0.5974787

We can see from the result above the roughly 60% of soccer players value their paciness over their physicality. This is an interesting trend for any upcoming soccer player as this may indicate a need to focus on their speed training a little more than strength training to better fit the mold of an average soccer player.

What country produces the best players?

To understand this, we will average the player overall ratings for all players the belong to a certain country. We can then look at the top 10 countries with the highest avg player overall ratings.

fifa %>%
  group_by(Nationality) %>%
  summarize(AvgPlayerOverall = mean(Overall)) %>%
  top_n(10, wt= AvgPlayerOverall) %>%
  arrange(desc(AvgPlayerOverall))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 10 x 2
##    Nationality              AvgPlayerOverall
##    <chr>                               <dbl>
##  1 Tanzania                             74  
##  2 Libya                                73.3
##  3 Mozambique                           73  
##  4 Central African Republic             72.5
##  5 Egypt                                72  
##  6 Fiji                                 72  
##  7 Syria                                72  
##  8 Gabon                                71.2
##  9 Brazil                               70.8
## 10 Czech Republic                       70.7

Our analysis came out with a result completely different than what we expected with the Tanzania producing the highest rated players on Average. For any soccer fan or scouts, this may be useful information to look out for up and coming players coming out of Tanzania.

The above analysis, however, may be incomplete as the values could be skewed by a small sample size. To truly understand what country produces the best players, we will need to look at countries which can consistently put out good players. For this reason, we decided to conduct this analysis again, but with an added constraint. We will only include countries in our analysis whom have more than 50 players in the game (more than .2% of the total players).

fifa %>%
  group_by(Nationality) %>%
  mutate(NumberOfPlayers = n()) %>%
  filter(NumberOfPlayers > 50) %>%
  summarize(AvgPlayerOverall = mean(Overall)) %>%
  top_n(10, wt= AvgPlayerOverall) %>%
  arrange(desc(AvgPlayerOverall))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 10 x 2
##    Nationality    AvgPlayerOverall
##    <chr>                     <dbl>
##  1 Brazil                     70.8
##  2 Czech Republic             70.7
##  3 Algeria                    70.6
##  4 Ukraine                    70.5
##  5 Italy                      70.0
##  6 Portugal                   69.8
##  7 Spain                      69.5
##  8 Morocco                    69.4
##  9 Serbia                     69.2
## 10 Croatia                    69.1

From the results above we can see a country more in line with our predictions to come in 1st with Brazil. This analysis indicates to us more holistically that Brazil on average has the best players and would be a great country to look at for scouts. This list, however, isn’t completely predictable and we see a few unlikely countries this list such as Czech Republic, Algeria, Ukraine, and Serbia.

What country produces the most skillfull/artsy (skill moves rating) players?

Let’s first look at the distribution of the skill move ratings among the players.

barplot(table(fifa$SkillMoves), main= "Number of Players by Skill Moves Rating", xlab = "Skill Moves Rating")

As we can see from the above visualization, 2 is the most common rating with 3 being the second most common.

We can approach understanding which nation produces the most skillful players by taking the average of the skill move rating for all the players, and then grouping this by nationality to get this information at the country level.

fifa %>%
  group_by(Nationality) %>%
  summarize(AvgSkillMove = mean(SkillMoves)) %>%
  top_n(10, wt= AvgSkillMove) %>%
  arrange(desc(AvgSkillMove))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 19 x 2
##    Nationality              AvgSkillMove
##    <chr>                           <dbl>
##  1 Fiji                             4   
##  2 Puerto Rico                      3.5 
##  3 Liberia                          3.33
##  4 Bermuda                          3   
##  5 Central African Republic         3   
##  6 Chad                             3   
##  7 Chinese Taipei                   3   
##  8 Eritrea                          3   
##  9 Estonia                          3   
## 10 Ethiopia                         3   
## 11 Iraq                             3   
## 12 Korea DPR                        3   
## 13 Kyrgyzstan                       3   
## 14 Libya                            3   
## 15 Malaysia                         3   
## 16 Namibia                          3   
## 17 Papua New Guinea                 3   
## 18 Syria                            3   
## 19 Tanzania                         3

Wow, again our results are completely different from what we expected. The country which produces the most skillful players on average is Fiji. I would be curious to learn more about the soccer culture in Fiji and how much they value creativity in their play. If I am a fan or a scout, I am looking out for prospects from Fiji to bring exciting new plays.

This analysis again may be biased towards countries with a low sample size. To get a more holistic view we can run this analysis again, however, only with countries that have more than 50 players in the game (more than .2% of the total players).

fifa %>%
  group_by(Nationality) %>% 
  mutate(NumberOfPlayers = n()) %>%
  filter(NumberOfPlayers > 50) %>%
  summarize(AvgSkillMove = mean(SkillMoves)) %>%
  top_n(10, wt= AvgSkillMove) %>%
  arrange(desc(AvgSkillMove))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 10 x 2
##    Nationality   AvgSkillMove
##    <chr>                <dbl>
##  1 Congo DR              2.86
##  2 Morocco               2.85
##  3 Algeria               2.75
##  4 Nigeria               2.71
##  5 Ghana                 2.66
##  6 Portugal              2.66
##  7 Brazil                2.56
##  8 Côte d'Ivoire         2.55
##  9 Mali                  2.55
## 10 Spain                 2.54

Here we can see a more holistic view on which countries on average have the most skillful players. From this list we can see that players from Congo DR are typically the most skillful and a scout may look for players in Congo if they are looking for an exciting and fun-to-watch player.

What country has the most players in FIFA?

We can find the answer to this question through one key function. We can utilize the n() function to get a count of the rows that meet the previously specific group by criteria.

fifa %>%
  group_by(Nationality) %>%
  summarise(NumberOfPlayers = n()) %>%
  top_n(10, wt= NumberOfPlayers) %>%
  arrange(desc(NumberOfPlayers))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 10 x 2
##    Nationality   NumberOfPlayers
##    <chr>                   <int>
##  1 England                  1717
##  2 Germany                  1214
##  3 Spain                    1111
##  4 France                    984
##  5 Argentina                 963
##  6 Brazil                    890
##  7 Japan                     547
##  8 Netherlands               439
##  9 United States             412
## 10 Poland                    403

We can see that England has the most players in this game. The most interesting country on this list, however, is Japan at the 7th position. Our group never really thought of them as a soccer-talent producing nation but we were glad to be proven incorrect.

What league has the best players?

We can understand the solution to this answer by averaging players overall at the league level.

fifa %>%
  group_by(League) %>%
  summarize(AvgPlayerOverall = mean(Overall)) %>%
  top_n(10, wt = AvgPlayerOverall) %>%
  arrange(desc(AvgPlayerOverall))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 10 x 2
##    League                        AvgPlayerOverall
##    <chr>                                    <dbl>
##  1 Spain Primera Division                    73.5
##  2 English Premier League                    72.5
##  3 Italian Serie A                           72.1
##  4 Czech Republic Gambrinus Liga             72.0
##  5 Ukrainian Premier League                  71.9
##  6 German 1. Bundesliga                      71.3
##  7 Campeonato Brasileiro Série A             71.0
##  8 French Ligue 1                            70.9
##  9 Greek Super League                        70.8
## 10 Russian Premier League                    70.0

We can see that the 1st division in Spain actually has the best avg player rating. The most interesting entry on this list, however, is the Czech league. The Czech league is not typically popular, however, it is interesting to note the amount og high quality of players there. As a fan, I may need to start watching more Czech league soccer to learn more about great players I wasn’t aware about.

What is the difference in Average Player Value between a Board (Greater than 83 player overall) and a Walk-Out (Greater than 87 player overall)?

Walk-Out and Boards are both slang terms the FIFA community has given for different subsets of players in the gold rating. We were curious, however, how these subsets composed of great overall players compare in terms of player value. Walk-Outs are the handful of best players in the world, so does their value represent this distinction?

To help us get to the bottom of this question we will first create a column that will do this characterization for us. We can then go ahead and average the player values based off of these characterizations.

fifa %>%
  mutate(GoldLevel = if_else(Overall >= 83 & Overall <=85, 'Board', if_else(Overall >= 86, 'Walk-Out', 'Other'))) %>%
  group_by(GoldLevel) %>%
  summarize(AvgPlayerValue = mean(ValueEUR)) %>%
  arrange(desc(AvgPlayerValue))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 3 x 2
##   GoldLevel AvgPlayerValue
##   <chr>              <dbl>
## 1 Walk-Out       78057971.
## 2 Board          40907692.
## 3 Other           2320304.

We can see from the above values that the designation between a Walk-Out and a Board is heavily valued in the real world. Players who in-game are characterized as Walk-Outs have an average value 78 million Euros where as Boards have an average value of nearly 41 million euros. Hence, on average, Walk-Outs are valued about 37 million euros higher than a Board.

What positional group is typically the fastest?

We can understand the answer to this question by grouping our data by the positional group column we had created and then averaging the pace for each player in those respective groups.

fifa %>%
  group_by(BestPosition_new) %>%
  summarize(AvgPace = mean(PaceTotal)) %>%
  arrange(desc(AvgPace))

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 10 x 2
##    BestPosition_new AvgPace
##    <fct>              <dbl>
##  1 LW/RW               79.5
##  2 LM/RM               76.9
##  3 CF                  73.2
##  4 LB/RB               73.2
##  5 AM                  70.0
##  6 ST                  69.5
##  7 GK                  65.0
##  8 CM                  63.7
##  9 DM                  61.3
## 10 CB                  59.4

We can note from above that wingers are on average the fastest players which makes sense with the responsibilities of their position. What is interesting, however, is how low Strikers pace came. In fact, even lower than the average pace of Left and Right Backs. This may show how pace is important for a striker but is not everything for a striker. A striker who is slower could still be successful if they can compensate for that with other strong qualities such as finishing. In contrast, however, a winger would need to be incredibly pacey and therefore a slower winger may not be able to be successful despite being good in other areas. It is also interesting to note how RB/LB are typically the players gaurding LW/RW and hence we can understand why their average pace is so high. To be able to guard an extremely quick winger, teams need to utilize fast defenders on the wing.

Who is the best free agent avaialable?

We can understand this by looking at the free agent with the highest overall.

fifa %>%
  filter(Club == 'Free agent') %>%
  top_n(1, wt = Overall)

##   PlayerID            Name          FullName Age Height (CM) Weight (KG)
## 1   184087 T. Alderweireld Toby Alderweireld  32         186          81
##   Nationality Overall Potential Growth TotalStats BaseStats Positions
## 1     Belgium      83        83      0       1991       412        CB
##   BestPosition       Club ValueEUR WageEUR ReleaseClause ClubPosition
## 1           CB Free agent        0       0             0             
##   ContractUntil ClubNumber ClubJoined NationalTeam NationalPosition
## 1            NA         NA       2021      Belgium               CB
##   NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 1              2         Right             3        3          2
##   AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 1            Medium            Medium        58            55           70
##   DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 1             66             86               77       64        45
##   HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 1              81           77      38        62    63         59          81
##   BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 1          75           55          60      54        85      62        78
##   Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 1      81      76       77        58         79            85          52
##   Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 1     62        58        86      87             87            84       16
##   GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 1          6        14            16         14       68       63       65
##   CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 1       65       65       63        69       69       76       69        80
##   CDMRating RWBRating LBRating CBRating RBRating GKRating TeamID LeagueId
## 1        83        80       81       83       81       24     NA       NA
##   TeamOverall Attack Midfield Defence TransferBudget DomesticPrestige
## 1          NA     NA       NA      NA             NA               NA
##   IntPrestige Players StartingAverageAge AllTeamAverageAge League
## 1          NA      NA                 NA                NA   <NA>
##   BestPosition_new SpeedVsStrength
## 1               CB               0

We can note that Toby Alderweireld is the best Free Agent available, as of the time of this data frame. He is an 83 rated Defender from Belgium. This is interesting to note for any teams who may be looking for a new, highly rated, defender.

Who is the best player for their age in FIFA currently?

We can try to understand this by creating a new variable which divides, putting the player overall rating over the players age. This will punish older players while benefiting younger players. This new metric will help us better understand who some up and coming players that we may want to look our for in the future.

fifa %>%
  mutate(PlayerAgeOverallMetric = Overall/Age) %>%
  top_n(10, wt = PlayerAgeOverallMetric) %>%
  arrange(desc(PlayerAgeOverallMetric)) %>%
  select(Name, Age, Overall, PlayerAgeOverallMetric)

##             Name Age Overall PlayerAgeOverallMetric
## 1          Pedri  18      81               4.500000
## 2     E. Haaland  20      88               4.400000
## 3  J. Bellingham  18      79               4.388889
## 4       F. Wirtz  18      78               4.333333
## 5   E. Camavinga  18      78               4.333333
## 6      R. Cherki  17      73               4.294118
## 7       G. Reyna  18      77               4.277778
## 8      Ansu Fati  18      76               4.222222
## 9      A. Hložek  18      76               4.222222
## 10       B. Saka  19      80               4.210526

Above we can see a list of 10 of the most talented and so-far proven young talents in soccer. We can see some pretty famous names on this list such as: Haaland, Camavinga, Fati, and Saka. We can also see less well known players on this list such as: Wirtz, and Hlozek. As an american, however, I am really excited to see Reyna on this list, an exciting young talent from the USA. As a fan of the sport, I am definitely excited to see these young players grow. As a soccer manager, however, I would definitely be looking to get some of these young players to join my club as their future looks incredibly promising right now.

What Jersey number is most common amongst the best players in the world?

We will characterize the best players in the world as players with an overall rating of 86 or higher as FIFA distinguishes those players as “Walk-Outs”. This is the highest distinction a player can get in FIFA and hence we believe it is a good way to indicate who are truly the best players. We will do this by first creating a dataframe called Best_Players which is comprised solely of players with the walk-out designation. We will then create a barplot from this dataframe of the count of players club jersey numbers to see if there is any pattern.

Best_Players <- fifa %>%
  mutate(GoldLevel = if_else(Overall >= 83 & Overall <=85, 'Board', if_else(Overall >= 86, 'Walk-Out', 'Other'))) %>%
  filter(GoldLevel == 'Walk-Out')

barplot(table(Best_Players$ClubNumber), las=2, main= "How many of the world's best players wear each Jersey Number?", xlab = "Jersey Number for Club", ylab = "Number of Soccer Players")

This is extremely unexpected, usually the jersey number 10 is regarded as the number reserved for the ‘best’ players, however, we can see that this really isn’t the case. In club soccer, most of the best players are actually wearing jersey number 1. 10 and 7 are both in the running, however, number 1 is by far the most popular jersey number among the best players. Number 1 is most commonly worn amongst goalkeepers which indicates to us that there are many goalkeepers who have an overall rating of 86+.

Maybe we have the hype all wrong though, maybe the allure to the coveted number 10 jersey is mainly for player’s national team appearances and not necessarily for clubs. Let’s create a similar plot to understand the most common jersey number among the world’s best players for their national teams.

barplot(table(Best_Players$NationalNumber), las=2, main= "How many of the world's best players wear each Jersey Number?", xlab = "Jersey Number for National Team", ylab = "Number of Soccer Players")

In the above visualization we can more clearly understand why so many kids hope to wear the jersey number 10 someday, it’s because in national team appearances, the jersey number 10 is most commonly worn by the worlds best players.

Summary

In FIFA, each player is rated on various attributes which include stats such as: Pace, Dribbling, Shooting. These attributes combined give each player an overall score. We wanted to figure out, utilizing the current data set, what we could understand about soccer and soccer players in the real world using attributes given to players in the game.

We have addressed the proposed problem statement by understanding the relationship between various in-game attributes and real-world values. Our current proposed analytic technique runs a linear regression between these different in-game and real-world values. We have summarized the regression analysis to understand their statistical significance - to find the correlation between predicted values (ex. player’s wage) and dependent variables. (using null hypothesis, F-stat and R-squared values). Additionally, we used data manipulation techniques to understand basic geographical or statistical trends in our data for specific demographics and players.

Our analysis was really interesting overall as it allowed us to understand the game of soccer in ways we had never understood it before. One of the questions we wanted to answer was, what positional group is the fastest. You, as I did, may imagine that all of the fastest positional groups are strikers looking to run past defenders and score goals. That is, however, only partially correct. Through our analysis we realized that ¾ of our top 4 paciest positional groups are players that play on the wings (or far edges) of the field. This is interesting as it indicates to us that the fastest players will typically be on the sides and not the middle. The 3 groups referenced are: RW/LW, RM/LM, and RB/LB. The RB/LB (defenders on the wing) group was incredibly surprising to us, but makes a lot of sense upon further review. If teams keep their fastest players as attackers on the wings of a field, you must also place fast players to defend them on the wings.

It was also interesting to understand what attributes truly have the strongest positive effect on a player’s value and wage. Our group would have never expected ‘Reactions’ to be the top answer, and this is somewhere where our analysis proved to be incredibly insightful. If I was a professional soccer player right now, this is the attribute that I would be focused on improving the most. It is easy to reason this attribute as something incredibly valuable as soccer is a fast-paced sport based on acting and reacting. A slow reaction by any given player could easily cost a team the game, and a fast reaction could help a team win a game.

Another interesting insight our group gained in this analysis was when looking at the top young players in the world of professional soccer. Some of the most prominent players in the game today started at an incredibly young age. Messi is a great example of this as he moved with his family to Barcelona at the age of 13 to fulfill his dream and potential. Our group, through a custom metric created, found that Pedri is the best young player in the world. Interestingly enough, on 11/29 (a few days after we finished our analysis) Pedri was awarded the Kopa Trophy, “an award presented to the best young player in men’s soccer.” This was a great confirmation of the metric we created and of the game’s accuracy in measuring the real world. Another interesting player on our list of the top 10 most promising young players was Gio Reyna. He specifically came in at number 7 on our rankings but this was exciting as he is from the United States of America, a country currently on the rise in the soccer world. Our group is now excited to see how Reyna will lead the American soccer team in the years moving forward.

We approached this project and analysis as fans of soccer and FIFA. We hoped our analysis would hence be valuable for soccer fans like ourselves, and also to any players. We believe our analysis fits that criteria well by providing facts that we, as soccer fans, thought were incredibly interesting. Additionally, we believe our analysis can be valuable to current players as they can understand that improving their reaction time can be important in helping increase their wage and value. Upon further review of our analysis, however, our group realized that the person that may benefit the most from our analysis and answered questions may actually be a soccer scout or manager. A soccer scout could use the questions we answered and our analysis to help point them towards some promising young players or countries where they could target scouting efforts.

Despite our analysis indicating interesting trends and analysis, we believe there are a few limitations which future analysts could explore further. One of the largest limitations of our analysis was the data itself. Most of the findings do not tell us much about the players in real life but rather are reflective of the methodology employed by EA Sports to assign attributes to players in this game.

Most defenders, especially center-backs have been given very low pace in comparison to the rest of the outfield positions. Real-life statistics, however, indicate that some center-backs do register very quick sprint speeds, and often outpace strikers and wingers. For example, Leicester City’s Caglar Soyuncu registered a top speed of 37.55 km/h in a game against Crystal Palace in November, but his pace in game is just 76, which is quite low considering how quick he can run. From our understanding, the developers at EA tend to give out low pace to defenders because in real life, defenders are not expected to run with the ball, and do not usually get a lot of opportunities to showcase their pace while running with the ball.

Some of the analysis done, however, will hold concrete in the world when the information used is not subjective. These variables include information like wage, club, position, jersey number, nationality, age, etc. Some of the analysis did give us insight into how the subjective attributes are correlated to the objective attributes about the players.

Another limitation to our analysis was our specific approach to answer a broad goal. The question we started with was incredibly broad and we answered smaller, more specific questions throughout the process. We believe, however, that a different set of these subset questions could have navigated our analysis to a different direction. For example, we could have looked more into the positional groupings. Soccer players have various skills. For each position, different sets of skills are needed. For example, the most important skills required for a defender are completely different from the ones required by attacker/striker or other positions. A detailed analysis using Linear regression modeling could have been applied to identify or highlight the most important set of skills or key attributes for a particular player’s position. This gap was a limitation to our analysis. We encourage other groups to try such an analysis in the future as it could possibly indicate the similarity between Main stats and Face stats. This analysis might help us further understand how EA calculates face stats and what weight is provided to the individual attributes in each cumulative face stat metric. Either way, it feels intuitive enough that a Principal component analysis (PCA) could show that players can be categorized into groups, such as ‘attacking, technical’, ‘fast, dribbling’, ‘attacking, physical’.

Overall, our group learned a lot more about soccer and soccer players through our analysis. Historically, video games have been based on real life, however, we thought it was really interesting trying to approach the relationship in the opposite manner. We hope our analysis can be continued and built upon in the future to reveal more hidden secrets about the soccer world.

FIFA 22 Data: Final Project

Group 1: Mrudul Nekkanti, Hetal Sawant, Rishi Ambani

11/30/2021