Introduction

FIFA is a soccer video game produced by EA which has gained notoriety among individuals from all-around the world with over 31 million players on FIFA 21. A new FIFA game comes out every year in which EA rates all the players in the game and give them certain stats which are supposed to be a depiction of real life. Each player is rated on various attributes which are then used to give them an overall score. Some of the attributes players are rated on include stats such as: Pace, Dribbling, Shooting. Our group was curious, if we take these values as accurate, what can we learn about the real world?

We firmly believe that answering this question could lead to some significant insights about the real world because we truly believe these attributes are as close of a representation we can get to measuring these attributes in the real world. FIFA, as mentioned prior, is run by EA, a multi-billion dollar corporation who has not only the resources but also the technical known-how/experience to create accurate value for the measurements.

We found a database online which contains all these attributes for the most recent FIFA game, FIFA 22, along with real-world values such as player wage, value, club, and jersey number. We hope to address the proposed problem staement by understanding the relationship between various in-game attributes and real-world values. For example, how does a player’s overall relate to their national jersey number? We have always been allured by the infamous number 10, but is that actually the jersey the best players wear? Which attribute has lead people to the highest pay day?

Our current proposed analytic technique is to run a linear regression between these different in-game and real-world values, using a p-test to understand their statistical significance. If significant we can compare the coefficients and correlation coefficients to understand the degree of a relationship and how we could predict/describe real-world values using this game.

The main target for our analysis are real soccer players as well as soccer fanatics. Real soccer players will benefit the most from our analysis. If I was soccer player, I would be looking to learn which attribute of my self I can boost the most to not only improve my self as a player but to also improve my salary. We hope that our analysis could help players direct their workout regiment by helping understand which attributes are truly considered valuable in soccer and for a club. Soccer fanatics will also benefit from our analysis as it will help them better understand which characteristics they may want to look for when looking at potential trade prospects for their respective team.

Packages

library(tidyverse)
library(dplyr)

At this point in the project, the only packages we utilized were the tidyverse package and dplyr package. Both packages were used together specifically for one function, to help trim a string in our data. The process has been displayed later on, however, both packages together helped in the process. The dplyr package’s pipe operator proved to be helpful in executing certain functions and the tidyverse package’s str_sub function proved to be helpful in actually trimming the string. In addition to this instance, there were many other uses of both packages such as when renaming variable names to something more user-friendly.

Data Preparation

We gathered our data from kaggle, specifically from this specific link:https://www.kaggle.com/stefanoleone992/fifa-22-complete-player-dataset

The data contained two different tables with players and info on their teams. We joined both of these together prior in Excel, and will import it as one Excel file into R. The original players table had 90 variables including all the attributes for all the players and general real-world information about them such as wage and value. The teams table originally had 14 columns out of which 12 (all variables except club name, as would be redundant, and TeamID) were included. There isn’t explicit information on when this data was collected, however, we do know it was fairly recently since FIFA 22 came out on October 1st, 2021. It is also worth noting that all those statistics are suggested mostly by a 6,000 group of volunteers led by Head of Data Collection & Licensing.

All real-world professional soccer players that are included in the game, as discussed above, are described by a set of statistics and characteristics, which determine how good they are in-game. Main indicator of the quality of a player is their overall score, which is a net of all their statistics. There are 34 of them with values in the 0-99 interval, in which 5 are used to describe goalkeeping abilities and 29 are used to describe abilities of an outfield player. All the players are described by all 34 stats, but goalkeeping abilities have no impact on an outfield players overall and vice-versa.

Since the players are presented as cards, it would be difficult to show all the statistics on the face of the card. Thus, there are so called “Face stats”. The card shows only six statistics on it to make it more readable. The statistics are linear combinations of the ones already mentioned. For example, instead of having both Acceleration and Sprint Speed displayed on the card, the game utilizes a pace statistic which is a weighted sum of those two. These variables can be found in our data with aTotal pre-fix.

There is also a set of four categorical variables in our data describing players:

Skills - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pull off skill moves, like the famous roulette

Weak foot - Rated in stars, min 1 star, max 5 stars; determines player’s ability to pass, shoot, etc. with his not-preferred foot correctly

Attacking work rate - Can be either High, Medium or Low; determines how much does the player run in attacking zones of the pitch

Defensive work rate - Can be either High, Medium or Low; determines how much does the player run in defensive zones of the pitch

We believe, after careful evaluation, most variable names are self-explanatory except for FKAccuracy, which stands for Free Kick Accuracy - how accurately one can shoot from a set piece. Also, position refers to players’ preferred positions on the pitch and the game recognizes 28 different ones.

We can get a better understanding of our data by taking a look at it in R, so let’s first load in the data:

fifa <- read.csv("players_fifa22.csv")

Now let’s take a look at the first few rows:

head(fifa, 3)

##       ID           Name           FullName Age Height Weight
## 1 158023       L. Messi       Lionel Messi  34    170     72
## 2 188545 R. Lewandowski Robert Lewandowski  32    185     81
## 3 231747     K. MbappÃ©     Kylian MbappÃ©  22    182     73
##                                           PhotoUrl Nationality Overall
## 1 https://cdn.sofifa.com/players/158/023/22_60.png   Argentina      93
## 2 https://cdn.sofifa.com/players/188/545/22_60.png      Poland      92
## 3 https://cdn.sofifa.com/players/231/747/22_60.png      France      91
##   Potential Growth TotalStats BaseStats Positions BestPosition
## 1        93      0       2219       462  RW,ST,CF           RW
## 2        92      0       2212       460        ST           ST
## 3        95      4       2175       470     ST,LW           ST
##                  Club  ValueEUR WageEUR ReleaseClause ClubPosition
## 1 Paris Saint-Germain  78000000  320000     144300000           RW
## 2  FC Bayern MÃ¼nchen 119500000  270000     197200000           ST
## 3 Paris Saint-Germain 194000000  230000     373500000           ST
##   ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 1          2023         30       2021  FALSE    Argentina               RW
## 2          2023          9       2014  FALSE       Poland               ST
## 3          2022          7       2018  FALSE       France               LW
##   NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 1             10          Left             5        4          4
## 2              9         Right             5        4          4
## 3             10         Right             4        4          5
##   AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 1            Medium               Low        85            92           91
## 2              High            Medium        78            92           79
## 3              High               Low        97            88           80
##   DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 1             95             34               65       85        95
## 2             85             44               82       71        95
## 3             92             36               77       78        93
##   HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 1              70           91      88        96    93         94          91
## 2              90           85      89        85    79         85          70
## 3              72           85      83        93    80         69          71
##   BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 1          96           91          80      91        94      95        86
## 2          88           77          79      77        93      82        90
## 3          91           97          97      92        93      83        86
##   Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 1      68      72       69        94         44            40          93
## 2      85      76       86        87         81            49          95
## 3      78      88       77        82         62            38          92
##   Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 1     95        75        96      20             35            24        6
## 2     81        90        88      35             42            19       15
## 3     82        79        88      26             34            32       13
##   GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 1         11        15            14          8       92       92       93
## 2          6        12             8         10       92       85       88
## 3          5         7            11          6       91       90       90
##   CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 1       93       93       92        93       93       90       93        69
## 2       88       88       85        89       87       83       87        67
## 3       90       90       90        92       92       84       92        70
##   CDMRating RWBRating LBRating CBRating RBRating GKRating
## 1        67        69       64       53       64       22
## 2        69        67       64       63       64       22
## 3        66        70       66       57       66       21
##                     League LeagueId Overall.1 Attack Midfield Defence
## 1       French Ligue 1 (1)       16        86     89       83      85
## 2 German 1. Bundesliga (1)       19        84     92       85      81
## 3       French Ligue 1 (1)       16        86     89       83      85
##   TransferBudget DomesticPrestige IntPrestige Players StartingAverageAge
## 1      160000000               10           9      33                 28
## 2      100000000               10          10      28               26.6
## 3      160000000               10           9      33                 28
##   AllTeamAverageAge
## 1              25.9
## 2              24.8
## 3              25.9

As we can see from the above table, some of the names in our data have loaded weirdly. We can fix this be reloading the data and making a few specifications. We can check that this worked by looking at the third row:

fifa <- read.csv("players_fifa22.csv", encoding="UTF-8", stringsAsFactor=FALSE)

fifa[3,]

##       ID      Name      FullName Age Height Weight
## 3 231747 K. Mbappé Kylian Mbappé  22    182     73
##                                           PhotoUrl Nationality Overall
## 3 https://cdn.sofifa.com/players/231/747/22_60.png      France      91
##   Potential Growth TotalStats BaseStats Positions BestPosition
## 3        95      4       2175       470     ST,LW           ST
##                  Club  ValueEUR WageEUR ReleaseClause ClubPosition
## 3 Paris Saint-Germain 194000000  230000     373500000           ST
##   ContractUntil ClubNumber ClubJoined OnLoad NationalTeam NationalPosition
## 3          2022          7       2018  FALSE       France               LW
##   NationalNumber PreferredFoot IntReputation WeakFoot SkillMoves
## 3             10         Right             4        4          5
##   AttackingWorkRate DefensiveWorkRate PaceTotal ShootingTotal PassingTotal
## 3              High               Low        97            88           80
##   DribblingTotal DefendingTotal PhysicalityTotal Crossing Finishing
## 3             92             36               77       78        93
##   HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing
## 3              72           85      83        93    80         69          71
##   BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower
## 3          91           97          97      92        93      83        86
##   Jumping Stamina Strength LongShots Aggression Interceptions Positioning
## 3      78      88       77        82         62            38          92
##   Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving
## 3     82        79        88      26             34            32       13
##   GKHandling GKKicking GKPositioning GKReflexes STRating LWRating LFRating
## 3          5         7            11          6       91       90       90
##   CFRating RFRating RWRating CAMRating LMRating CMRating RMRating LWBRating
## 3       90       90       90        92       92       84       92        70
##   CDMRating RWBRating LBRating CBRating RBRating GKRating             League
## 3        66        70       66       57       66       21 French Ligue 1 (1)
##   LeagueId Overall.1 Attack Midfield Defence TransferBudget DomesticPrestige
## 3       16        86     89       83      85      160000000               10
##   IntPrestige Players StartingAverageAge AllTeamAverageAge
## 3           9      33                 28              25.9

As we can see from the code above, this was able to help out in the data processing. We can look at the missing values information now:

sapply(fifa, function(x) sum(is.na(x)))

##                 ID               Name           FullName                Age 
##                  0                  0                  0                  0 
##             Height             Weight           PhotoUrl        Nationality 
##                  0                  0                  0                  0 
##            Overall          Potential             Growth         TotalStats 
##                  0                  0                  0                  0 
##          BaseStats          Positions       BestPosition               Club 
##                  0                  0                  0                  0 
##           ValueEUR            WageEUR      ReleaseClause       ClubPosition 
##                  0                  0                  0                  0 
##      ContractUntil         ClubNumber         ClubJoined             OnLoad 
##                 70                 70                  0                  0 
##       NationalTeam   NationalPosition     NationalNumber      PreferredFoot 
##                  0                  0              18491                  0 
##      IntReputation           WeakFoot         SkillMoves  AttackingWorkRate 
##                  0                  0                  0                  0 
##  DefensiveWorkRate          PaceTotal      ShootingTotal       PassingTotal 
##                  0                  0                  0                  0 
##     DribblingTotal     DefendingTotal   PhysicalityTotal           Crossing 
##                  0                  0                  0                  0 
##          Finishing    HeadingAccuracy       ShortPassing            Volleys 
##                  0                  0                  0                  0 
##          Dribbling              Curve         FKAccuracy        LongPassing 
##                  0                  0                  0                  0 
##        BallControl       Acceleration        SprintSpeed            Agility 
##                  0                  0                  0                  0 
##          Reactions            Balance          ShotPower            Jumping 
##                  0                  0                  0                  0 
##            Stamina           Strength          LongShots         Aggression 
##                  0                  0                  0                  0 
##      Interceptions        Positioning             Vision          Penalties 
##                  0                  0                  0                  0 
##          Composure            Marking     StandingTackle      SlidingTackle 
##                  0                  0                  0                  0 
##           GKDiving         GKHandling          GKKicking      GKPositioning 
##                  0                  0                  0                  0 
##         GKReflexes           STRating           LWRating           LFRating 
##                  0                  0                  0                  0 
##           CFRating           RFRating           RWRating          CAMRating 
##                  0                  0                  0                  0 
##           LMRating           CMRating           RMRating          LWBRating 
##                  0                  0                  0                  0 
##          CDMRating          RWBRating           LBRating           CBRating 
##                  0                  0                  0                  0 
##           RBRating           GKRating             League           LeagueId 
##                  0                  0                  0                  0 
##          Overall.1             Attack           Midfield            Defence 
##                  0                  0                  0                  0 
##     TransferBudget   DomesticPrestige        IntPrestige            Players 
##                  0                  0                  0                  0 
## StartingAverageAge  AllTeamAverageAge 
##                  0                  0

We can see from the data above that our data looks pretty good already and avoids many missing values. We do have some missing values, but we believe there is no action that can be taken right now for them. The columns where we see a few missing values are ContractUntil, ClubNumber and NationalNumber. We understand, however, that there are valid reasons why all these data points could be blank or non-existent as some players contract info or club number may not be released yet. Additionally, we know most players that play professional soccer don’t have the opportunity to play for their national country and hence we can understand why we have so many N/A values in the NationalNumber column.

In terms of formatting, we were able to get a decent idea of what the data looks like from the head() function we ran above. We can notice that one of the things that looks weird is the league column. Every entry in the league column has a (1) after it. We can work to remove this and clean the data by utilizing the str_sub function from the tidyverse package. We will create a column called TeamLeague with the trimmed league names.

fifa <- fifa %>%
  mutate(TeamLeague= str_sub(League, 1, -5))

Now that our league name has been corrected, by creating a new column in R with the correct values, we can remove the previous League column. We have done this in the code below:

fifa <- select(fifa, -(League))

Lastly, our team preferred the name League instead of TeamLague so we have gone ahead and change the column name of the created column to that of the previous column.

fifa <- rename(fifa, League = TeamLeague)

The only other issue in our data can be seen in the naming convention of our team overall column. As we can see, it is currently labeled as Overall.1. This naming convention isn’t friendly to a new user of our data and hence we can change this title to TeamOverall- a variable which indicates the overall rating a whole team is given.

fifa <- rename(fifa, TeamOverall = Overall.1)

As we are renaming columns, we can also rename a few columns to help our readers get a better understanding of units of measure for those metrics.

fifa <- rename(fifa, 'Weight (KG)' = Weight)
fifa <- rename(fifa, 'Height (CM)' = Height)

Lastly, there are some unnecessary columns in our data which won’t further our analysis, hence we can remove them to reduce any model’s computational expense. In specific, we will remove the PhotoURL and OnLoad variable.

fifa <- fifa[,-c(7, 24)]

Lastly, we noticed some of the positional titles were a little too specific for our analysis. We hope to conduct analysis on general positional groups and excecessive detail regarding positions is unuecessary for us. Especially since some of the titles discussed are not commonly used in typical soccer jargon.

#All the unique positions
unique(fifa$BestPosition)

##  [1] "RW"  "ST"  "GK"  "CM"  "LW"  "CDM" "CF"  "LM"  "CB"  "CAM" "RB"  "LB" 
## [13] "RM"  "LWB" "RWB"

#Edit into more appropriate positional titles 
fifa$BestPosition_new <- ifelse(fifa$BestPosition == "GK" , "GK",
                                ifelse(fifa$BestPosition %in% c("LCB", "CB", "RCB"), "CB",
                                       ifelse(fifa$BestPosition %in% c("LB", "LWB", "RWB", "RB"), "LB/RB",
                                              ifelse(fifa$BestPosition %in% c("LDM", "CDM", "RDM"), "DM",
                                                     ifelse(fifa$BestPosition %in% c("LCM", "CM", "RCM"), "CM",
                                                            ifelse(fifa$BestPosition %in% c("LM", "RM"), "LM/RM",
                                                                   ifelse(fifa$BestPosition %in% c("LAM", "CAM", "RAM"), "AM",
                                                                          ifelse(fifa$BestPosition %in% c("LW", "RW"), "LW/RW",
                                                                                 ifelse(fifa$BestPosition %in% c("LF", "CF", "RF"),"CF",
                                                                                        "ST")))))))))
fifa$BestPosition_new <- factor(fifa$BestPosition_new, levels = c("GK", "CB", "LB/RB", "DM", "CM",
                                                  "LM/RM", "AM", "LW/RW", "CF", "ST"))

#check if our change worked
levels(fifa$BestPosition_new)

##  [1] "GK"    "CB"    "LB/RB" "DM"    "CM"    "LM/RM" "AM"    "LW/RW" "CF"   
## [10] "ST"

At this point, we should have addressed all issues with our data and should have put ourselves in a good position for further analysis. We can check if all our changes have gone through and if our data looks good now by running the head function again. We will only show the first row this time as it will be representative of our data set while ensuring too much space isn’t taken up by our data.

head(fifa, 1)

##       ID     Name     FullName Age Height (CM) Weight (KG) Nationality Overall
## 1 158023 L. Messi Lionel Messi  34         170          72   Argentina      93
##   Potential Growth TotalStats BaseStats Positions BestPosition
## 1        93      0       2219       462  RW,ST,CF           RW
##                  Club ValueEUR WageEUR ReleaseClause ClubPosition ContractUntil
## 1 Paris Saint-Germain 78000000  320000     144300000           RW          2023
##   ClubNumber ClubJoined NationalTeam NationalPosition NationalNumber
## 1         30       2021    Argentina               RW             10
##   PreferredFoot IntReputation WeakFoot SkillMoves AttackingWorkRate
## 1          Left             5        4          4            Medium
##   DefensiveWorkRate PaceTotal ShootingTotal PassingTotal DribblingTotal
## 1               Low        85            92           91             95
##   DefendingTotal PhysicalityTotal Crossing Finishing HeadingAccuracy
## 1             34               65       85        95              70
##   ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing BallControl
## 1           91      88        96    93         94          91          96
##   Acceleration SprintSpeed Agility Reactions Balance ShotPower Jumping Stamina
## 1           91          80      91        94      95        86      68      72
##   Strength LongShots Aggression Interceptions Positioning Vision Penalties
## 1       69        94         44            40          93     95        75
##   Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking
## 1        96      20             35            24        6         11        15
##   GKPositioning GKReflexes STRating LWRating LFRating CFRating RFRating
## 1            14          8       92       92       93       93       93
##   RWRating CAMRating LMRating CMRating RMRating LWBRating CDMRating RWBRating
## 1       92        93       93       90       93        69        67        69
##   LBRating CBRating RBRating GKRating LeagueId TeamOverall Attack Midfield
## 1       64       53       64       22       16          86     89       83
##   Defence TransferBudget DomesticPrestige IntPrestige Players
## 1      85      160000000               10           9      33
##   StartingAverageAge AllTeamAverageAge         League BestPosition_new
## 1                 28              25.9 French Ligue 1            LW/RW

As we can from the result above, our data looks good.

Lastly, we can take a look at the summary for a few of the important variables in our data.

fifa.summary <- fifa %>%
  select(Overall, WageEUR, ValueEUR, PaceTotal, DribblingTotal, ShootingTotal, WeakFoot, SkillMoves) %>%
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean))

#We can reshape this data to make it look better

fifa.summary_reformat <- fifa.summary %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean)

print(fifa.summary_reformat)

##              var min    q25 median     q75      max         mean
## 1 DribblingTotal  26     58     64      69 9.50e+01 6.296784e+01
## 2        Overall  47     61     66      70 9.30e+01 6.576922e+01
## 3      PaceTotal  28     62     68      75 9.70e+01 6.787994e+01
## 4  ShootingTotal  18     44     56      64 9.40e+01 5.346088e+01
## 5     SkillMoves   1      2      2       3 5.00e+00 2.352816e+00
## 6       ValueEUR   0 475000 975000 2000000 1.94e+08 2.854488e+06
## 7        WageEUR   0   1000   3000    8000 3.50e+05 8.858094e+03
## 8       WeakFoot   1      3      3       3 5.00e+00 2.945501e+00

From the above result we can see a consolidated table with the summary statistics of the some of the variables we consider valuable. We can see that ratings on players week foot and skill moves range between 1 through 5, with most players skill moves being rated below the middle rating of 3, at 2. We can also see that wages and player values range widely with some values being as low as 0. We believe that although these 0’s may not be completely accurate, it is possible that the information isn’t public or is just incredibly low. It is important for us to remember that there are some incredibly small teams included in this game and hence it is possible players have a wage and value that are relatively low in comparison to that of the best players in the world. Hence, a 0 is utilized as a placeholder. We have not removed these rows or values because we view doing so would remove a lot of important data, and imputing the median would give us an incorrect idea of our data. This will just be an important fact we will have to keep in mind as we approach our analysis. We can also note that the players in the game range widely in terms of overall ability going from 47 to 93.

Proposed Exploratory Data Analysis

We hope to uncover new information in the data that is not self-evident by slicing and dicing the data in various ways and by creating various variables to help better understand the data. A great example of this is that we plan to create a continent column based off the player’s respective countries to better understand what players from different continents are like in different categories. We could also create groupings of positions such as defenders, midfielders, and attackers. We hope that these different variables and views will better help us understand some of the hidden trends in our data. We hope to summarize these data slices and trends at the end with various visualizations that compare the different categories/variables we created.

We hope to create many visualizations to help us illustrate the findings of our questions. Here are a few of our ideas:

Histogram to show distribution of players’ statistics, both face stats and “raw” stats. It may give some valuable insight into how certain stats behave generally, maybe they come from a known distribution (like normal distribution).
Correlation plot between various statistics to showcase relationship between significant variables.
Bar plot to show various stats across various continents and different positional groups

Our group feel’s comfortable moving forward with all of these ideas.

In terms of machine learning techniques, our group plans on using a linear regression to better understand the relationship between independent and dependent variables.

Of course, prior to utilizing a linear regression, we will need to understand if a linear regression is even the correct possible test. We would first need to check for outliers prior to running a linear regression since those could heavily impact our analysis. We would also need to understand if the data is linear, an assumption we could check via a scatterplot. Additionally, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram, a Q-Q Plot, or shapiro test. Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. Multicollinearity may be tested with three central criteria:

Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.
Tolerance – the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.
Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 5 there is an indication that multicollinearity may be present; with VIF > 10 there is certainly multicollinearity among the variables.

The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check if the data is homoscedastic (meaning the residuals are equal across the regression line). If homoscedasticity is present, a non-linear correction might fix the problem. We want to get a model that produces a constant residual variance, does not form a pattern (trumpet form). If homoscedasticity is present, a non-linear correction might fix the problem.

We hope to build a model for each position or position group to better understand the trends in our data.

The other machine learning technique we may use is a Prime Component Analysis. The main goal of a PCA is to reduce the number of attributes, so that they could be interpreted more easily as groups. We would also like to see whether reducing number of dimensions to 6 (so the number of Face stats) would result in a similar grouping to that proposed by EA. First, we would take all the attributes and scale them, as PCA is easily influenced by magnitude.

Apart from the reduction to 6 dimensions, it would be good to know the optimal number of dimensions according to the eigenvalues. One can use a method which rules out all the components with eigenvalues below 1 (Kaiser criterion), one which takes number of components to describe a certain amount of variance, use MAP test or use parallel analysis.

When it comes to Face stats, we only have 6 variables to begin with but nonetheless we want to reduce their number even more.

Overall, the only machine learning techniques we hope to use are a Linear Regression and/or a PCA to better understand trends in our data dependent on their significance.

FIFA 22 Data: Mid-Term Project

Group 1: Mrudul Nekkanti, Hetal Sawant, Rishi Ambani

11/7/2021

Introduction

Packages

Data Preparation

Proposed Exploratory Data Analysis