In this document, I will try to group different players from Fifa 20 into clusters based on their attributes. I will take advantage of the k-means algorithm in R to cluster the players into different groups.

A k-means is an unsupervised machine learning algorithm that partitions the observation into different clusters based on their attribute. The user has to provide information on how many clusters to divide the observation into. Once the user inputs the number of clusters, the algorithm randomly picks ‘centroids’ for each cluster. The points closest to such centroids are allocated into that particular cluster. The mean value between the initial centroid and the new point will now be the new centroid. This repeats until all the values are allocated into at least one cluster and do not change in the next iteration.

The program runs the algorithm multiple times and the clusters that have the least variation will be selected. Details about the k-means functionality are beyond the scope of this document.

One important thing about k-means clustering is how many clusters should the observation divide the observation into. A variance and cluster number trade-off help determine the optimal number of a cluster for the observation. However, in this document, we will pick cluster size to be three. This is because there are three major position in football, i.e Goalkeeping, Defending and Attacking, which is the most important classification of any individual players. (Note midfield players can be either defending or attacking. There is usually overlap between thier role hence I have not made a sperate category for them).

The Fifa 20 data set is available in Kaggle.

library(readr)
Fifa20 <- read_csv("fifa-20-complete-player-dataset/players_20.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   player_url = col_character(),
##   short_name = col_character(),
##   long_name = col_character(),
##   dob = col_date(format = ""),
##   nationality = col_character(),
##   club = col_character(),
##   player_positions = col_character(),
##   preferred_foot = col_character(),
##   work_rate = col_character(),
##   body_type = col_character(),
##   real_face = col_character(),
##   player_tags = col_character(),
##   team_position = col_character(),
##   loaned_from = col_character(),
##   joined = col_date(format = ""),
##   nation_position = col_character(),
##   player_traits = col_character(),
##   ls = col_character(),
##   st = col_character(),
##   rs = col_character()
##   # ... with 23 more columns
## )
## See spec(...) for full column specifications.
head(Fifa20,15)
## # A tibble: 15 x 104
##    sofifa_id player_url short_name long_name   age dob        height_cm
##        <dbl> <chr>      <chr>      <chr>     <dbl> <date>         <dbl>
##  1    158023 https://s~ L. Messi   Lionel A~    32 1987-06-24       170
##  2     20801 https://s~ Cristiano~ Cristian~    34 1985-02-05       187
##  3    190871 https://s~ Neymar Jr  Neymar d~    27 1992-02-05       175
##  4    200389 https://s~ J. Oblak   Jan Oblak    26 1993-01-07       188
##  5    183277 https://s~ E. Hazard  Eden Haz~    28 1991-01-07       175
##  6    192985 https://s~ K. De Bru~ Kevin De~    28 1991-06-28       181
##  7    192448 https://s~ M. ter St~ Marc-And~    27 1992-04-30       187
##  8    203376 https://s~ V. van Di~ Virgil v~    27 1991-07-08       193
##  9    177003 https://s~ L. Modric  Luka Mod~    33 1985-09-09       172
## 10    209331 https://s~ M. Salah   Mohamed ~    27 1992-06-15       175
## 11    231747 https://s~ K. Mbappé  Kylian M~    20 1998-12-20       178
## 12    201024 https://s~ K. Koulib~ Kalidou ~    28 1991-06-20       187
## 13    202126 https://s~ H. Kane    Harry Ka~    25 1993-07-28       188
## 14    212831 https://s~ Alisson    Alisson ~    26 1992-10-02       191
## 15    193080 https://s~ De Gea     David De~    28 1990-11-07       192
## # ... with 97 more variables: weight_kg <dbl>, nationality <chr>, club <chr>,
## #   overall <dbl>, potential <dbl>, value_eur <dbl>, wage_eur <dbl>,
## #   player_positions <chr>, preferred_foot <chr>,
## #   international_reputation <dbl>, weak_foot <dbl>, skill_moves <dbl>,
## #   work_rate <chr>, body_type <chr>, real_face <chr>,
## #   release_clause_eur <dbl>, player_tags <chr>, team_position <chr>,
## #   team_jersey_number <dbl>, loaned_from <chr>, joined <date>,
## #   contract_valid_until <dbl>, nation_position <chr>,
## #   nation_jersey_number <dbl>, pace <dbl>, shooting <dbl>, passing <dbl>,
## #   dribbling <dbl>, defending <dbl>, physic <dbl>, gk_diving <dbl>,
## #   gk_handling <dbl>, gk_kicking <dbl>, gk_reflexes <dbl>, gk_speed <dbl>,
## #   gk_positioning <dbl>, player_traits <chr>, attacking_crossing <dbl>,
## #   attacking_finishing <dbl>, attacking_heading_accuracy <dbl>,
## #   attacking_short_passing <dbl>, attacking_volleys <dbl>,
## #   skill_dribbling <dbl>, skill_curve <dbl>, skill_fk_accuracy <dbl>,
## #   skill_long_passing <dbl>, skill_ball_control <dbl>,
## #   movement_acceleration <dbl>, movement_sprint_speed <dbl>,
## #   movement_agility <dbl>, movement_reactions <dbl>, movement_balance <dbl>,
## #   power_shot_power <dbl>, power_jumping <dbl>, power_stamina <dbl>,
## #   power_strength <dbl>, power_long_shots <dbl>, mentality_aggression <dbl>,
## #   mentality_interceptions <dbl>, mentality_positioning <dbl>,
## #   mentality_vision <dbl>, mentality_penalties <dbl>,
## #   mentality_composure <dbl>, defending_marking <dbl>,
## #   defending_standing_tackle <dbl>, defending_sliding_tackle <dbl>,
## #   goalkeeping_diving <dbl>, goalkeeping_handling <dbl>,
## #   goalkeeping_kicking <dbl>, goalkeeping_positioning <dbl>,
## #   goalkeeping_reflexes <dbl>, ls <chr>, st <chr>, rs <chr>, lw <chr>,
## #   lf <chr>, cf <chr>, rf <chr>, rw <chr>, lam <chr>, cam <chr>, ram <chr>,
## #   lm <chr>, lcm <chr>, cm <chr>, rcm <chr>, rm <chr>, lwb <chr>, ldm <chr>,
## #   cdm <chr>, rdm <chr>, rwb <chr>, lb <chr>, lcb <chr>, cb <chr>, rcb <chr>,
## #   rb <chr>
str(Fifa20)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18278 obs. of  104 variables:
##  $ sofifa_id                 : num  158023 20801 190871 200389 183277 ...
##  $ player_url                : chr  "https://sofifa.com/player/158023/lionel-messi/20/159586" "https://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/20/159586" "https://sofifa.com/player/190871/neymar-da-silva-santos-jr/20/159586" "https://sofifa.com/player/200389/jan-oblak/20/159586" ...
##  $ short_name                : chr  "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "J. Oblak" ...
##  $ long_name                 : chr  "Lionel Andrés Messi Cuccittini" "Cristiano Ronaldo dos Santos Aveiro" "Neymar da Silva Santos Junior" "Jan Oblak" ...
##  $ age                       : num  32 34 27 26 28 28 27 27 33 27 ...
##  $ dob                       : Date, format: "1987-06-24" "1985-02-05" ...
##  $ height_cm                 : num  170 187 175 188 175 181 187 193 172 175 ...
##  $ weight_kg                 : num  72 83 68 87 74 70 85 92 66 71 ...
##  $ nationality               : chr  "Argentina" "Portugal" "Brazil" "Slovenia" ...
##  $ club                      : chr  "FC Barcelona" "Juventus" "Paris Saint-Germain" "Atlético Madrid" ...
##  $ overall                   : num  94 93 92 91 91 91 90 90 90 90 ...
##  $ potential                 : num  94 93 92 93 91 91 93 91 90 90 ...
##  $ value_eur                 : num  9.55e+07 5.85e+07 1.06e+08 7.75e+07 9.00e+07 ...
##  $ wage_eur                  : num  565000 405000 290000 125000 470000 370000 250000 200000 340000 240000 ...
##  $ player_positions          : chr  "RW, CF, ST" "ST, LW" "LW, CAM" "GK" ...
##  $ preferred_foot            : chr  "Left" "Right" "Right" "Right" ...
##  $ international_reputation  : num  5 5 5 3 4 4 3 3 4 3 ...
##  $ weak_foot                 : num  4 4 5 3 4 5 4 3 4 3 ...
##  $ skill_moves               : num  4 5 5 1 4 4 1 2 4 4 ...
##  $ work_rate                 : chr  "Medium/Low" "High/Low" "High/Medium" "Medium/Medium" ...
##  $ body_type                 : chr  "Messi" "C. Ronaldo" "Neymar" "Normal" ...
##  $ real_face                 : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ release_clause_eur        : num  1.96e+08 9.65e+07 1.95e+08 1.65e+08 1.84e+08 ...
##  $ player_tags               : chr  "#Dribbler, #Distance Shooter, #Crosser, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Forward" "#Speedster, #Dribbler, #Distance Shooter, #Acrobat, #Clinical Finisher, #Complete Forward" "#Speedster, #Dribbler, #Playmaker  , #Crosser, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Midfield"| __truncated__ NA ...
##  $ team_position             : chr  "RW" "LW" "CAM" "GK" ...
##  $ team_jersey_number        : num  10 7 10 13 7 17 1 4 10 11 ...
##  $ loaned_from               : chr  NA NA NA NA ...
##  $ joined                    : Date, format: "2004-07-01" "2018-07-10" ...
##  $ contract_valid_until      : num  2021 2022 2022 2023 2024 ...
##  $ nation_position           : chr  NA "LS" "LW" "GK" ...
##  $ nation_jersey_number      : num  NA 7 10 1 10 7 22 4 NA 10 ...
##  $ pace                      : num  87 90 91 NA 91 76 NA 77 74 93 ...
##  $ shooting                  : num  92 93 85 NA 83 86 NA 60 76 86 ...
##  $ passing                   : num  92 82 87 NA 86 92 NA 70 89 81 ...
##  $ dribbling                 : num  96 89 95 NA 94 86 NA 71 89 89 ...
##  $ defending                 : num  39 35 32 NA 35 61 NA 90 72 45 ...
##  $ physic                    : num  66 78 58 NA 66 78 NA 86 66 74 ...
##  $ gk_diving                 : num  NA NA NA 87 NA NA 88 NA NA NA ...
##  $ gk_handling               : num  NA NA NA 92 NA NA 85 NA NA NA ...
##  $ gk_kicking                : num  NA NA NA 78 NA NA 88 NA NA NA ...
##  $ gk_reflexes               : num  NA NA NA 89 NA NA 90 NA NA NA ...
##  $ gk_speed                  : num  NA NA NA 52 NA NA 45 NA NA NA ...
##  $ gk_positioning            : num  NA NA NA 90 NA NA 88 NA NA NA ...
##  $ player_traits             : chr  "Beat Offside Trap, Argues with Officials, Early Crosser, Finesse Shot, Speed Dribbler (CPU AI Only), 1-on-1 Rus"| __truncated__ "Long Throw-in, Selfish, Argues with Officials, Early Crosser, Speed Dribbler (CPU AI Only), Skilled Dribbling" "Power Free-Kick, Injury Free, Selfish, Early Crosser, Speed Dribbler (CPU AI Only), Crowd Favourite" "Flair, Acrobatic Clearance" ...
##  $ attacking_crossing        : num  88 84 87 13 81 93 18 53 86 79 ...
##  $ attacking_finishing       : num  95 94 87 11 84 82 14 52 72 90 ...
##  $ attacking_heading_accuracy: num  70 89 62 15 61 55 11 86 55 59 ...
##  $ attacking_short_passing   : num  92 83 87 43 89 92 61 78 92 84 ...
##  $ attacking_volleys         : num  88 87 87 13 83 82 14 45 76 79 ...
##  $ skill_dribbling           : num  97 89 96 12 95 86 21 70 87 89 ...
##  $ skill_curve               : num  93 81 88 13 83 85 18 60 85 83 ...
##  $ skill_fk_accuracy         : num  94 76 87 14 79 83 12 70 78 69 ...
##  $ skill_long_passing        : num  92 77 81 40 83 91 63 81 88 75 ...
##  $ skill_ball_control        : num  96 92 95 30 94 91 30 76 92 89 ...
##  $ movement_acceleration     : num  91 89 94 43 94 77 38 74 77 94 ...
##  $ movement_sprint_speed     : num  84 91 89 60 88 76 50 79 71 92 ...
##  $ movement_agility          : num  93 87 96 67 95 78 37 61 92 91 ...
##  $ movement_reactions        : num  95 96 92 88 90 91 86 88 89 92 ...
##  $ movement_balance          : num  95 71 84 49 94 76 43 53 93 88 ...
##  $ power_shot_power          : num  86 95 80 59 82 91 66 81 79 80 ...
##  $ power_jumping             : num  68 95 61 78 56 63 79 90 68 69 ...
##  $ power_stamina             : num  75 85 81 41 84 89 35 75 85 85 ...
##  $ power_strength            : num  68 78 49 78 63 74 78 92 58 73 ...
##  $ power_long_shots          : num  94 93 84 12 80 90 10 64 82 84 ...
##  $ mentality_aggression      : num  48 63 51 34 54 76 43 82 62 63 ...
##  $ mentality_interceptions   : num  40 29 36 19 41 61 22 89 82 55 ...
##  $ mentality_positioning     : num  94 95 87 11 87 88 11 47 79 92 ...
##  $ mentality_vision          : num  94 82 90 65 89 94 70 65 91 84 ...
##  $ mentality_penalties       : num  75 85 90 11 88 79 25 62 82 77 ...
##  $ mentality_composure       : num  96 95 94 68 91 91 70 89 92 91 ...
##  $ defending_marking         : num  33 28 27 27 34 68 25 91 68 38 ...
##  $ defending_standing_tackle : num  37 32 26 12 27 58 13 92 76 43 ...
##  $ defending_sliding_tackle  : num  26 24 29 18 22 51 10 85 71 41 ...
##  $ goalkeeping_diving        : num  6 7 9 87 11 15 88 13 13 14 ...
##  $ goalkeeping_handling      : num  11 11 9 92 12 13 85 10 9 14 ...
##  $ goalkeeping_kicking       : num  15 15 15 78 6 5 88 13 7 9 ...
##  $ goalkeeping_positioning   : num  14 14 15 90 8 10 88 11 14 11 ...
##  $ goalkeeping_reflexes      : num  8 11 11 89 8 13 90 11 9 14 ...
##  $ ls                        : chr  "89+2" "91+3" "84+3" NA ...
##  $ st                        : chr  "89+2" "91+3" "84+3" NA ...
##  $ rs                        : chr  "89+2" "91+3" "84+3" NA ...
##  $ lw                        : chr  "93+2" "89+3" "90+3" NA ...
##  $ lf                        : chr  "93+2" "90+3" "89+3" NA ...
##  $ cf                        : chr  "93+2" "90+3" "89+3" NA ...
##  $ rf                        : chr  "93+2" "90+3" "89+3" NA ...
##  $ rw                        : chr  "93+2" "89+3" "90+3" NA ...
##  $ lam                       : chr  "93+2" "88+3" "90+3" NA ...
##  $ cam                       : chr  "93+2" "88+3" "90+3" NA ...
##  $ ram                       : chr  "93+2" "88+3" "90+3" NA ...
##  $ lm                        : chr  "92+2" "88+3" "89+3" NA ...
##  $ lcm                       : chr  "87+2" "81+3" "82+3" NA ...
##  $ cm                        : chr  "87+2" "81+3" "82+3" NA ...
##  $ rcm                       : chr  "87+2" "81+3" "82+3" NA ...
##  $ rm                        : chr  "92+2" "88+3" "89+3" NA ...
##  $ lwb                       : chr  "68+2" "65+3" "66+3" NA ...
##  $ ldm                       : chr  "66+2" "61+3" "61+3" NA ...
##  $ cdm                       : chr  "66+2" "61+3" "61+3" NA ...
##  $ rdm                       : chr  "66+2" "61+3" "61+3" NA ...
##  $ rwb                       : chr  "68+2" "65+3" "66+3" NA ...
##   [list output truncated]
##  - attr(*, "spec")=
##   .. cols(
##   ..   sofifa_id = col_double(),
##   ..   player_url = col_character(),
##   ..   short_name = col_character(),
##   ..   long_name = col_character(),
##   ..   age = col_double(),
##   ..   dob = col_date(format = ""),
##   ..   height_cm = col_double(),
##   ..   weight_kg = col_double(),
##   ..   nationality = col_character(),
##   ..   club = col_character(),
##   ..   overall = col_double(),
##   ..   potential = col_double(),
##   ..   value_eur = col_double(),
##   ..   wage_eur = col_double(),
##   ..   player_positions = col_character(),
##   ..   preferred_foot = col_character(),
##   ..   international_reputation = col_double(),
##   ..   weak_foot = col_double(),
##   ..   skill_moves = col_double(),
##   ..   work_rate = col_character(),
##   ..   body_type = col_character(),
##   ..   real_face = col_character(),
##   ..   release_clause_eur = col_double(),
##   ..   player_tags = col_character(),
##   ..   team_position = col_character(),
##   ..   team_jersey_number = col_double(),
##   ..   loaned_from = col_character(),
##   ..   joined = col_date(format = ""),
##   ..   contract_valid_until = col_double(),
##   ..   nation_position = col_character(),
##   ..   nation_jersey_number = col_double(),
##   ..   pace = col_double(),
##   ..   shooting = col_double(),
##   ..   passing = col_double(),
##   ..   dribbling = col_double(),
##   ..   defending = col_double(),
##   ..   physic = col_double(),
##   ..   gk_diving = col_double(),
##   ..   gk_handling = col_double(),
##   ..   gk_kicking = col_double(),
##   ..   gk_reflexes = col_double(),
##   ..   gk_speed = col_double(),
##   ..   gk_positioning = col_double(),
##   ..   player_traits = col_character(),
##   ..   attacking_crossing = col_double(),
##   ..   attacking_finishing = col_double(),
##   ..   attacking_heading_accuracy = col_double(),
##   ..   attacking_short_passing = col_double(),
##   ..   attacking_volleys = col_double(),
##   ..   skill_dribbling = col_double(),
##   ..   skill_curve = col_double(),
##   ..   skill_fk_accuracy = col_double(),
##   ..   skill_long_passing = col_double(),
##   ..   skill_ball_control = col_double(),
##   ..   movement_acceleration = col_double(),
##   ..   movement_sprint_speed = col_double(),
##   ..   movement_agility = col_double(),
##   ..   movement_reactions = col_double(),
##   ..   movement_balance = col_double(),
##   ..   power_shot_power = col_double(),
##   ..   power_jumping = col_double(),
##   ..   power_stamina = col_double(),
##   ..   power_strength = col_double(),
##   ..   power_long_shots = col_double(),
##   ..   mentality_aggression = col_double(),
##   ..   mentality_interceptions = col_double(),
##   ..   mentality_positioning = col_double(),
##   ..   mentality_vision = col_double(),
##   ..   mentality_penalties = col_double(),
##   ..   mentality_composure = col_double(),
##   ..   defending_marking = col_double(),
##   ..   defending_standing_tackle = col_double(),
##   ..   defending_sliding_tackle = col_double(),
##   ..   goalkeeping_diving = col_double(),
##   ..   goalkeeping_handling = col_double(),
##   ..   goalkeeping_kicking = col_double(),
##   ..   goalkeeping_positioning = col_double(),
##   ..   goalkeeping_reflexes = col_double(),
##   ..   ls = col_character(),
##   ..   st = col_character(),
##   ..   rs = col_character(),
##   ..   lw = col_character(),
##   ..   lf = col_character(),
##   ..   cf = col_character(),
##   ..   rf = col_character(),
##   ..   rw = col_character(),
##   ..   lam = col_character(),
##   ..   cam = col_character(),
##   ..   ram = col_character(),
##   ..   lm = col_character(),
##   ..   lcm = col_character(),
##   ..   cm = col_character(),
##   ..   rcm = col_character(),
##   ..   rm = col_character(),
##   ..   lwb = col_character(),
##   ..   ldm = col_character(),
##   ..   cdm = col_character(),
##   ..   rdm = col_character(),
##   ..   rwb = col_character(),
##   ..   lb = col_character(),
##   ..   lcb = col_character(),
##   ..   cb = col_character(),
##   ..   rcb = col_character(),
##   ..   rb = col_character()
##   .. )

The file has lots of information which we do not need for clustering. Only player’s attributes such as speed, strength, passing, finish, heading etc will be taken.

k.Fifa <- Fifa20[,c(32:43,45:78)]
head(k.Fifa,25)
## # A tibble: 25 x 46
##     pace shooting passing dribbling defending physic gk_diving gk_handling
##    <dbl>    <dbl>   <dbl>     <dbl>     <dbl>  <dbl>     <dbl>       <dbl>
##  1    87       92      92        96        39     66        NA          NA
##  2    90       93      82        89        35     78        NA          NA
##  3    91       85      87        95        32     58        NA          NA
##  4    NA       NA      NA        NA        NA     NA        87          92
##  5    91       83      86        94        35     66        NA          NA
##  6    76       86      92        86        61     78        NA          NA
##  7    NA       NA      NA        NA        NA     NA        88          85
##  8    77       60      70        71        90     86        NA          NA
##  9    74       76      89        89        72     66        NA          NA
## 10    93       86      81        89        45     74        NA          NA
## # ... with 15 more rows, and 38 more variables: gk_kicking <dbl>,
## #   gk_reflexes <dbl>, gk_speed <dbl>, gk_positioning <dbl>,
## #   attacking_crossing <dbl>, attacking_finishing <dbl>,
## #   attacking_heading_accuracy <dbl>, attacking_short_passing <dbl>,
## #   attacking_volleys <dbl>, skill_dribbling <dbl>, skill_curve <dbl>,
## #   skill_fk_accuracy <dbl>, skill_long_passing <dbl>,
## #   skill_ball_control <dbl>, movement_acceleration <dbl>,
## #   movement_sprint_speed <dbl>, movement_agility <dbl>,
## #   movement_reactions <dbl>, movement_balance <dbl>, power_shot_power <dbl>,
## #   power_jumping <dbl>, power_stamina <dbl>, power_strength <dbl>,
## #   power_long_shots <dbl>, mentality_aggression <dbl>,
## #   mentality_interceptions <dbl>, mentality_positioning <dbl>,
## #   mentality_vision <dbl>, mentality_penalties <dbl>,
## #   mentality_composure <dbl>, defending_marking <dbl>,
## #   defending_standing_tackle <dbl>, defending_sliding_tackle <dbl>,
## #   goalkeeping_diving <dbl>, goalkeeping_handling <dbl>,
## #   goalkeeping_kicking <dbl>, goalkeeping_positioning <dbl>,
## #   goalkeeping_reflexes <dbl>
str(k.Fifa)
## Classes 'tbl_df', 'tbl' and 'data.frame':    18278 obs. of  46 variables:
##  $ pace                      : num  87 90 91 NA 91 76 NA 77 74 93 ...
##  $ shooting                  : num  92 93 85 NA 83 86 NA 60 76 86 ...
##  $ passing                   : num  92 82 87 NA 86 92 NA 70 89 81 ...
##  $ dribbling                 : num  96 89 95 NA 94 86 NA 71 89 89 ...
##  $ defending                 : num  39 35 32 NA 35 61 NA 90 72 45 ...
##  $ physic                    : num  66 78 58 NA 66 78 NA 86 66 74 ...
##  $ gk_diving                 : num  NA NA NA 87 NA NA 88 NA NA NA ...
##  $ gk_handling               : num  NA NA NA 92 NA NA 85 NA NA NA ...
##  $ gk_kicking                : num  NA NA NA 78 NA NA 88 NA NA NA ...
##  $ gk_reflexes               : num  NA NA NA 89 NA NA 90 NA NA NA ...
##  $ gk_speed                  : num  NA NA NA 52 NA NA 45 NA NA NA ...
##  $ gk_positioning            : num  NA NA NA 90 NA NA 88 NA NA NA ...
##  $ attacking_crossing        : num  88 84 87 13 81 93 18 53 86 79 ...
##  $ attacking_finishing       : num  95 94 87 11 84 82 14 52 72 90 ...
##  $ attacking_heading_accuracy: num  70 89 62 15 61 55 11 86 55 59 ...
##  $ attacking_short_passing   : num  92 83 87 43 89 92 61 78 92 84 ...
##  $ attacking_volleys         : num  88 87 87 13 83 82 14 45 76 79 ...
##  $ skill_dribbling           : num  97 89 96 12 95 86 21 70 87 89 ...
##  $ skill_curve               : num  93 81 88 13 83 85 18 60 85 83 ...
##  $ skill_fk_accuracy         : num  94 76 87 14 79 83 12 70 78 69 ...
##  $ skill_long_passing        : num  92 77 81 40 83 91 63 81 88 75 ...
##  $ skill_ball_control        : num  96 92 95 30 94 91 30 76 92 89 ...
##  $ movement_acceleration     : num  91 89 94 43 94 77 38 74 77 94 ...
##  $ movement_sprint_speed     : num  84 91 89 60 88 76 50 79 71 92 ...
##  $ movement_agility          : num  93 87 96 67 95 78 37 61 92 91 ...
##  $ movement_reactions        : num  95 96 92 88 90 91 86 88 89 92 ...
##  $ movement_balance          : num  95 71 84 49 94 76 43 53 93 88 ...
##  $ power_shot_power          : num  86 95 80 59 82 91 66 81 79 80 ...
##  $ power_jumping             : num  68 95 61 78 56 63 79 90 68 69 ...
##  $ power_stamina             : num  75 85 81 41 84 89 35 75 85 85 ...
##  $ power_strength            : num  68 78 49 78 63 74 78 92 58 73 ...
##  $ power_long_shots          : num  94 93 84 12 80 90 10 64 82 84 ...
##  $ mentality_aggression      : num  48 63 51 34 54 76 43 82 62 63 ...
##  $ mentality_interceptions   : num  40 29 36 19 41 61 22 89 82 55 ...
##  $ mentality_positioning     : num  94 95 87 11 87 88 11 47 79 92 ...
##  $ mentality_vision          : num  94 82 90 65 89 94 70 65 91 84 ...
##  $ mentality_penalties       : num  75 85 90 11 88 79 25 62 82 77 ...
##  $ mentality_composure       : num  96 95 94 68 91 91 70 89 92 91 ...
##  $ defending_marking         : num  33 28 27 27 34 68 25 91 68 38 ...
##  $ defending_standing_tackle : num  37 32 26 12 27 58 13 92 76 43 ...
##  $ defending_sliding_tackle  : num  26 24 29 18 22 51 10 85 71 41 ...
##  $ goalkeeping_diving        : num  6 7 9 87 11 15 88 13 13 14 ...
##  $ goalkeeping_handling      : num  11 11 9 92 12 13 85 10 9 14 ...
##  $ goalkeeping_kicking       : num  15 15 15 78 6 5 88 13 7 9 ...
##  $ goalkeeping_positioning   : num  14 14 15 90 8 10 88 11 14 11 ...
##  $ goalkeeping_reflexes      : num  8 11 11 89 8 13 90 11 9 14 ...

We can see that there are lots of ‘NA’ values in the dataset. k-means clustering cannot take ‘NA’ values. While there are multiple statistical methods to find out the missing value, we will not be using any such technique. Rather, I will be replacing all the ‘NA’ values with 1. In the next function, I will highlight why I sought to replace ‘NA’ with 1 without using any statistical methods.

Fifa20[1:10,c(3,32:43)]
## # A tibble: 10 x 13
##    short_name  pace shooting passing dribbling defending physic gk_diving
##    <chr>      <dbl>    <dbl>   <dbl>     <dbl>     <dbl>  <dbl>     <dbl>
##  1 L. Messi      87       92      92        96        39     66        NA
##  2 Cristiano~    90       93      82        89        35     78        NA
##  3 Neymar Jr     91       85      87        95        32     58        NA
##  4 J. Oblak      NA       NA      NA        NA        NA     NA        87
##  5 E. Hazard     91       83      86        94        35     66        NA
##  6 K. De Bru~    76       86      92        86        61     78        NA
##  7 M. ter St~    NA       NA      NA        NA        NA     NA        88
##  8 V. van Di~    77       60      70        71        90     86        NA
##  9 L. Modric     74       76      89        89        72     66        NA
## 10 M. Salah      93       86      81        89        45     74        NA
## # ... with 5 more variables: gk_handling <dbl>, gk_kicking <dbl>,
## #   gk_reflexes <dbl>, gk_speed <dbl>, gk_positioning <dbl>

From the table, we can see that J.Oblak and M. ter Stegen have a missing value for the first six attributes whereas the remaining players have a missing value for the last six attributes. J. Oblak and M.ter Stegen are goalkeepers, meaning that there is no information about their pace, dribbling or shooting since goalkeepers do not perform such skills. Likewise, other players such as L.Messi do not have gk_diving or gk_positioning attributes as such skills are not required for him. Since all the ratings are out of 100, I am replacing such ‘NA’ with 1 to indicate that the particular players have a very low level of skills for such attributes.

After replacing the missing value, I will be running the k-means clustering. As I mentioned above, I will assign three cluster for three most basic position of football.

k.Fifa[is.na(k.Fifa)] <-1
summary(k.Fifa)
##       pace          shooting        passing        dribbling    
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:57.00   1st Qu.:35.00   1st Qu.:46.00   1st Qu.:53.00  
##  Median :67.00   Median :52.00   Median :56.00   Median :62.00  
##  Mean   :60.27   Mean   :46.58   Mean   :50.97   Mean   :55.68  
##  3rd Qu.:74.00   3rd Qu.:62.00   3rd Qu.:63.00   3rd Qu.:69.00  
##  Max.   :96.00   Max.   :93.00   Max.   :92.00   Max.   :96.00  
##    defending         physic        gk_diving       gk_handling    
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:31.00   1st Qu.:55.00   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :52.00   Median :64.00   Median : 1.000   Median : 1.000  
##  Mean   :45.92   Mean   :57.76   Mean   : 8.176   Mean   : 7.923  
##  3rd Qu.:64.00   3rd Qu.:71.00   3rd Qu.: 1.000   3rd Qu.: 1.000  
##  Max.   :90.00   Max.   :90.00   Max.   :90.000   Max.   :92.000  
##    gk_kicking      gk_reflexes        gk_speed      gk_positioning  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 1.000   Median : 1.000   Median : 1.000   Median : 1.000  
##  Mean   : 7.776   Mean   : 8.284   Mean   : 5.099   Mean   : 7.948  
##  3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.000  
##  Max.   :93.000   Max.   :92.000   Max.   :65.000   Max.   :91.000  
##  attacking_crossing attacking_finishing attacking_heading_accuracy
##  Min.   : 5.00      Min.   : 2.00       Min.   : 5.00             
##  1st Qu.:38.00      1st Qu.:30.00       1st Qu.:44.00             
##  Median :54.00      Median :49.00       Median :56.00             
##  Mean   :49.72      Mean   :45.59       Mean   :52.22             
##  3rd Qu.:64.00      3rd Qu.:62.00       3rd Qu.:64.00             
##  Max.   :93.00      Max.   :95.00       Max.   :93.00             
##  attacking_short_passing attacking_volleys skill_dribbling  skill_curve   
##  Min.   : 7.00           Min.   : 3.00     Min.   : 4.0    Min.   : 6.00  
##  1st Qu.:54.00           1st Qu.:30.00     1st Qu.:50.0    1st Qu.:34.00  
##  Median :62.00           Median :44.00     Median :61.0    Median :49.00  
##  Mean   :58.75           Mean   :42.81     Mean   :55.6    Mean   :47.33  
##  3rd Qu.:68.00           3rd Qu.:56.00     3rd Qu.:68.0    3rd Qu.:62.00  
##  Max.   :92.00           Max.   :90.00     Max.   :97.0    Max.   :94.00  
##  skill_fk_accuracy skill_long_passing skill_ball_control movement_acceleration
##  Min.   : 4.00     Min.   : 8.00      Min.   : 5.00      Min.   :12.0         
##  1st Qu.:31.00     1st Qu.:43.00      1st Qu.:54.00      1st Qu.:56.0         
##  Median :41.00     Median :56.00      Median :63.00      Median :67.0         
##  Mean   :42.71     Mean   :52.77      Mean   :58.46      Mean   :64.3         
##  3rd Qu.:56.00     3rd Qu.:64.00      3rd Qu.:69.00      3rd Qu.:75.0         
##  Max.   :94.00     Max.   :92.00      Max.   :96.00      Max.   :97.0         
##  movement_sprint_speed movement_agility movement_reactions movement_balance
##  Min.   :11.00         Min.   :11.0     Min.   :21.00      Min.   :12.00   
##  1st Qu.:57.00         1st Qu.:55.0     1st Qu.:56.00      1st Qu.:56.00   
##  Median :67.00         Median :66.0     Median :62.00      Median :66.00   
##  Mean   :64.42         Mean   :63.5     Mean   :61.75      Mean   :63.86   
##  3rd Qu.:75.00         3rd Qu.:74.0     3rd Qu.:68.00      3rd Qu.:74.00   
##  Max.   :96.00         Max.   :96.0     Max.   :96.00      Max.   :97.00   
##  power_shot_power power_jumping   power_stamina   power_strength 
##  Min.   :14.00    Min.   :19.00   Min.   :12.00   Min.   :20.00  
##  1st Qu.:48.00    1st Qu.:58.00   1st Qu.:56.00   1st Qu.:58.00  
##  Median :59.00    Median :66.00   Median :66.00   Median :66.00  
##  Mean   :58.18    Mean   :64.93   Mean   :62.89   Mean   :65.23  
##  3rd Qu.:68.00    3rd Qu.:73.00   3rd Qu.:74.00   3rd Qu.:74.00  
##  Max.   :95.00    Max.   :95.00   Max.   :97.00   Max.   :97.00  
##  power_long_shots mentality_aggression mentality_interceptions
##  Min.   : 4.00    Min.   : 9.00        Min.   : 3.00          
##  1st Qu.:32.00    1st Qu.:44.00        1st Qu.:25.00          
##  Median :51.00    Median :58.00        Median :52.00          
##  Mean   :46.81    Mean   :55.74        Mean   :46.38          
##  3rd Qu.:62.00    3rd Qu.:69.00        3rd Qu.:64.00          
##  Max.   :94.00    Max.   :95.00        Max.   :92.00          
##  mentality_positioning mentality_vision mentality_penalties mentality_composure
##  Min.   : 2.00         Min.   : 9.00    Min.   : 7.00       Min.   :12.00      
##  1st Qu.:39.00         1st Qu.:44.00    1st Qu.:39.00       1st Qu.:51.00      
##  Median :55.00         Median :55.00    Median :49.00       Median :60.00      
##  Mean   :50.07         Mean   :53.61    Mean   :48.38       Mean   :58.53      
##  3rd Qu.:64.00         3rd Qu.:64.00    3rd Qu.:60.00       3rd Qu.:67.00      
##  Max.   :95.00         Max.   :94.00    Max.   :92.00       Max.   :96.00      
##  defending_marking defending_standing_tackle defending_sliding_tackle
##  Min.   : 1.00     Min.   : 5.00             Min.   : 3.00           
##  1st Qu.:29.00     1st Qu.:27.00             1st Qu.:24.00           
##  Median :52.00     Median :55.00             Median :52.00           
##  Mean   :46.85     Mean   :47.64             Mean   :45.61           
##  3rd Qu.:64.00     3rd Qu.:66.00             3rd Qu.:64.00           
##  Max.   :94.00     Max.   :92.00             Max.   :90.00           
##  goalkeeping_diving goalkeeping_handling goalkeeping_kicking
##  Min.   : 1.00      Min.   : 1.00        Min.   : 1.00      
##  1st Qu.: 8.00      1st Qu.: 8.00        1st Qu.: 8.00      
##  Median :11.00      Median :11.00        Median :11.00      
##  Mean   :16.57      Mean   :16.35        Mean   :16.21      
##  3rd Qu.:14.00      3rd Qu.:14.00        3rd Qu.:14.00      
##  Max.   :90.00      Max.   :92.00        Max.   :93.00      
##  goalkeeping_positioning goalkeeping_reflexes
##  Min.   : 1.00           Min.   : 1.00       
##  1st Qu.: 8.00           1st Qu.: 8.00       
##  Median :11.00           Median :11.00       
##  Mean   :16.37           Mean   :16.71       
##  3rd Qu.:14.00           3rd Qu.:14.00       
##  Max.   :91.00           Max.   :92.00

Now there are no missing values (‘NA’) in any column.

set.seed(100)
cluster <- kmeans(k.Fifa,3)
table(cluster$cluster)
## 
##    1    2    3 
## 2036 7032 9210

Now we shall we how our cluster was done. Before verifying the result of the clustering with our real database, we shall first see which players are allocated in which cluster.

results <- as.data.frame(cbind(Fifa20[3],cluster=cluster$cluster))
head(results[results$cluster ==1,],10)
##       short_name cluster
## 4       J. Oblak       1
## 7  M. ter Stegen       1
## 14       Alisson       1
## 15        De Gea       1
## 26       Ederson       1
## 29   T. Courtois       1
## 31 S. Handanovic       1
## 32      M. Neuer       1
## 33     H. Lloris       1
## 54      K. Navas       1
head(results[results$cluster ==2,],10)
##           short_name cluster
## 1           L. Messi       2
## 2  Cristiano Ronaldo       2
## 3          Neymar Jr       2
## 5          E. Hazard       2
## 6       K. De Bruyne       2
## 9          L. Modric       2
## 10          M. Salah       2
## 11         K. Mbappé       2
## 13           H. Kane       2
## 18         S. Agüero       2
head(results[results$cluster ==3,],10)
##         short_name cluster
## 8      V. van Dijk       3
## 12    K. Koulibaly       3
## 16        N. Kanté       3
## 17    G. Chiellini       3
## 19    Sergio Ramos       3
## 22 Sergio Busquets       3
## 30           Piqué       3
## 36        D. Godín       3
## 41      A. Laporte       3
## 43        Casemiro       3

(Please note that the order of the observation is by the top overall rated in Fifa 20. This will help us in further interpretation)

For a regular football fan, one can see that cluster 1 is distinctly allocated for the goalkeeper. It should be no surprise as differentiating a goalkeeper with other position (outfield player) should be the most obvious. A goalkeeper clearly needs a different set of attributes, which we also discussed above.

Attacking players such as L.Messi and Cristiano Ronaldo is in cluster 2. Some midfield players, who play high above the pitch, like K. De Bruye and M. Salah are also grouped in this cluster.

Finally, top defenders like V. van Dijk and Sergio Ramos are alloctaed in cluster 3. Some midfield players who supports in defensive duties such as N. Kante and Sergio Busquest are also in cluster 3.

With the help of plotting, we will make our assumption more robust.

library(ggplot2)
ggplot(k.Fifa, aes(x=defending, y= shooting, col= as.factor(cluster$cluster))) + geom_point()

ggplot(k.Fifa, aes(x=defending_standing_tackle, y= attacking_finishing, col= as.factor(cluster$cluster))) + geom_point()

From the graphs, it becomes clear that cluster 2 includes a player with greater attacking skills like shooting and attacking_finish. The players with higher defending skills such as defending itself and tackles are included on cluster 3. Cluster 1, which is for the goalkeepers, is on the bottom of the graph as goalkeeping does not require both the skills taken for plotting the graph.

We can notice that there are some overlaps between cluster 2 and cluster 3, around the centre of both the axis.

Now, we shall compare the cluster with the player’s position given in Fifa20. Some players have multiple positions. We shall only take the first listed position of the particular player.

#checking the team-position details 
unique(Fifa20$player_positions)
##   [1] "RW, CF, ST"   "ST, LW"       "LW, CAM"      "GK"           "LW, CF"      
##   [6] "CAM, CM"      "CB"           "CM"           "RW, ST"       "ST, RW"      
##  [11] "ST"           "CDM, CM"      "CF, ST, LW"   "CAM, RW"      "CM, CDM"     
##  [16] "RW, LW"       "CAM, LM, ST"  "ST, LM"       "LW, LM"       "CB, LB"      
##  [21] "RW, CAM, CM"  "CDM"          "CF, LM"       "CF, ST"       "LB"          
##  [26] "CM, CAM, CDM" "CF, LW, ST"   "LW"           "CB, CDM"      "RB, CM, CDM" 
##  [31] "CAM, CM, LW"  "CF, ST, CAM"  "LW, CM"       "CAM, RM, RW"  "CM, CAM"     
##  [36] "CM, LM, RM"   "LB, CB"       "RB"           "CAM, CF, ST"  "RW, LW, ST"  
##  [41] "LB, LM"       "RM, LM, CM"   "CAM, CM, RM"  "RM, LM"       "CAM, RM"     
##  [46] "CF, LW, CAM"  "CAM, LM, RM"  "LM, RM, LW"   "RM, LM, LW"   "CAM"         
##  [51] "CAM, CM, CF"  "LM"           "CDM, CB"      "RB, CB"       "RM, RW"      
##  [56] "LM, RW, LW"   "RM, CM"       "CAM, LW, ST"  "RW, RM"       "CM, CDM, CAM"
##  [61] "CM, CAM, CF"  "LW, ST, LM"   "LM, ST"       "RM, RW, ST"   "LM, CAM, RM" 
##  [66] "LW, RW"       "CF, LM, LW"   "RM, CAM"      "CF, RM, LM"   "RW, LW, CAM" 
##  [71] "CDM, CM, CAM" "CDM, CB, LB"  "ST, CAM, LW"  "ST, CF"       "RW, CAM"     
##  [76] "LW, LM, RW"   "RW, CAM, LW"  "RM, ST"       "CM, CDM, RM"  "RW, CF"      
##  [81] "RB, RM"       "CAM, LW"      "CF, CAM, CM"  "RB, RM, CM"   "LWB, LM, LB" 
##  [86] "ST, RW, LW"   "CB, LB, RB"   "RM, LM, CF"   "CAM, LM"      "LM, LWB"     
##  [91] "LM, RM"       "RM, RB"       "CM, CDM, LM"  "CM, LW"       "RWB, RM"     
##  [96] "RW"           "CB, RB"       "CM, LM, CDM"  "CAM, CM, LM"  "LW, RW, CAM" 
## [101] "CM, LM"       "CAM, CM, RW"  "LM, LB, CM"   "CM, LB"       "CF, ST, RM"  
## [106] "LB, LWB"      "RM, CAM, RW"  "RB, RW, LW"   "LW, LB"       "CAM, LM, LW" 
## [111] "CF, LW"       "RM, LM, CAM"  "RB, RWB"      "LM, LW"       "RM, ST, LM"  
## [116] "CM, RM"       "CF, RW, LW"   "CAM, CF"      "ST, LW, CAM"  "RM"          
## [121] "RWB, RB, RM"  "LW, CAM, CM"  "LM, ST, RM"   "CM, RB, RM"   "LW, CF, ST"  
## [126] "LM, CAM"      "RB, CDM, CM"  "RM, RW, CAM"  "CF"           "LM, RW"      
## [131] "RM, RW, LM"   "ST, RM"       "CAM, LM, CM"  "CDM, CM, LM"  "RW, ST, LW"  
## [136] "LB, RB"       "RB, LB"       "LW, CF, RW"   "LB, LM, LWB"  "RM, CAM, CM" 
## [141] "LM, LW, RM"   "CDM, CM, CB"  "ST, RW, LM"   "RM, LM, ST"   "LM, CF"      
## [146] "CDM, CB, CM"  "LWB, LB"      "RWB, RM, RB"  "ST, LM, LW"   "LM, LB"      
## [151] "RWB"          "ST, RW, RM"   "ST, CAM"      "CAM, CDM"     "CAM, ST"     
## [156] "CF, CM"       "CF, RW"       "CM, LW, RW"   "RM, CAM, LM"  "RB, RM, CB"  
## [161] "LB, CDM"      "CAM, RM, CM"  "CM, CB"       "CB, RB, LB"   "LM, RM, CAM" 
## [166] "CM, LM, CAM"  "CF, CM, LW"   "RW, ST, RM"   "CF, RW, RM"   "CAM, RM, LM" 
## [171] "LM, LW, CM"   "ST, CF, CAM"  "LM, LW, ST"   "RM, RWB"      "CF, LW, RW"  
## [176] "CDM, CM, RB"  "RB, RW"       "CM, CAM, RW"  "RM, ST, RW"   "CAM, RM, CF" 
## [181] "CM, RM, CAM"  "LW, CAM, LM"  "RB, RWB, RM"  "RB, CM"       "CM, RB"      
## [186] "CM, CAM, RM"  "LM, LWB, LW"  "LM, LW, CF"   "LB, LWB, LM"  "LM, RM, CM"  
## [191] "RM, RWB, LWB" "LM, LW, CAM"  "LW, CAM, RM"  "CAM, CM, CDM" "CM, CAM, LM" 
## [196] "RB, LB, RM"   "LB, CM"       "CAM, RW, LW"  "RM, LM, RW"   "CM, CDM, CF" 
## [201] "CM, CDM, RB"  "LM, CAM, CM"  "ST, CAM, CF"  "RM, CM, RB"   "CM, RW, CAM" 
## [206] "LW, RW, ST"   "CDM, RWB"     "CAM, RW, RM"  "CM, CB, CAM"  "LM, RM, RB"  
## [211] "RM, CF, RW"   "LW, RM, LM"   "CM, RM, CDM"  "ST, CF, RW"   "LM, LB, LWB" 
## [216] "LM, RM, ST"   "CAM, RW, CM"  "LW, LM, CF"   "RWB, RB"      "CDM, CAM, LM"
## [221] "RW, LW, RM"   "CAM, ST, CDM" "LM, CM"       "CM, ST"       "LM, RB"      
## [226] "LB, RM"       "RM, CM, LM"   "LW, RM"       "LW, ST, RW"   "CM, LWB, LM" 
## [231] "CF, CAM, ST"  "RM, LW"       "LW, RW, LM"   "CF, CAM"      "RW, LM, RM"  
## [236] "RW, RM, CAM"  "LB, LM, CAM"  "ST, RM, CAM"  "CM, CDM, RWB" "LM, LW, LB"  
## [241] "ST, CAM, LM"  "RW, CM, RM"   "LM, RW, CF"   "CF, RW, ST"   "CDM, CAM, CM"
## [246] "RM, CM, RW"   "LM, CF, RM"   "CAM, LM, RW"  "CAM, ST, LM"  "LB, LWB, RB" 
## [251] "LWB, LB, LM"  "RB, CB, RM"   "RW, CF, LW"   "LWB, LB, RB"  "LM, CM, CAM" 
## [256] "CAM, CDM, CM" "LW, ST"       "RB, CDM"      "CAM, ST, RW"  "CM, CDM, CB" 
## [261] "CB, CDM, RB"  "RM, CF, LM"   "LWB, RM"      "ST, RM, LM"   "CAM, LW, CM" 
## [266] "LM, CF, CM"   "RW, RM, RB"   "RB, RM, RWB"  "ST, LM, CAM"  "LM, ST, CAM" 
## [271] "ST, LW, RW"   "RM, LM, LB"   "CM, LWB"      "CB, CDM, LB"  "CAM, LW, RW" 
## [276] "LM, CAM, ST"  "RW, RWB"      "RM, RWB, RB"  "LM, CAM, LWB" "LW, RW, RM"  
## [281] "RWB, RB, LWB" "CAM, ST, RM"  "RW, RM, CF"   "RW, RM, LW"   "RW, CAM, ST" 
## [286] "RB, RM, LM"   "CM, CDM, LB"  "CDM, LB, CM"  "LM, CM, LB"   "LB, RB, RM"  
## [291] "LW, CM, CAM"  "LB, RB, CB"   "CAM, ST, LW"  "LWB, LB, CB"  "LWB, LM"     
## [296] "LWB"          "CDM, CB, RB"  "CM, LM, LB"   "RW, LB"       "RB, RM, LB"  
## [301] "RW, RM, ST"   "CM, CAM, ST"  "CAM, CF, RW"  "CAM, RM, RB"  "ST, LM, RM"  
## [306] "ST, RM, LW"   "CDM, CM, RM"  "RM, RW, CM"   "ST, CM, RB"   "RM, RW, LW"  
## [311] "CB, RB, RM"   "CAM, RM, ST"  "RB, CM, CB"   "RW, LW, CM"   "RM, RWB, CAM"
## [316] "RW, RM, CM"   "RM, RB, RWB"  "RB, LW, LB"   "LB, CB, LWB"  "ST, CAM, CM" 
## [321] "LM, RWB"      "RB, LB, CDM"  "CB, LWB"      "CM, RWB, RM"  "RM, CF"      
## [326] "LB, CB, LM"   "LWB, RWB"     "RB, LB, RWB"  "RW, LW, LM"   "LM, CM, LWB" 
## [331] "LM, ST, LW"   "RM, CAM, ST"  "RW, CAM, RM"  "LW, CAM, RW"  "LW, RW, CM"  
## [336] "CAM, ST, CF"  "LB, CAM, LM"  "LB, CB, CDM"  "LM, RM, LB"   "LM, RM, CF"  
## [341] "LB, LW"       "LM, LB, LW"   "ST, CAM, RM"  "LW, ST, CAM"  "ST, CAM, RW" 
## [346] "ST, LW, LM"   "CAM, LM, CF"  "CAM, CF, CM"  "LM, RW, CAM"  "LB, RM, LM"  
## [351] "CF, CAM, LM"  "CAM, RW, CF"  "CB, LB, CDM"  "CB, CDM, CM"  "CB, CM, CDM" 
## [356] "RM, LM, RWB"  "RW, RM, LM"   "ST, LM, RB"   "RB, RM, RW"   "CB, RWB, RB" 
## [361] "RB, CB, ST"   "CDM, CM, LB"  "RW, ST, CF"   "ST, RWB"      "LB, LWB, CDM"
## [366] "RB, RWB, LB"  "RM, CF, CAM"  "RB, LB, CB"   "RM, LW, RW"   "CDM, LB"     
## [371] "CDM, RB, RM"  "CDM, RB"      "LB, CM, LM"   "RM, ST, CAM"  "CM, CDM, LW" 
## [376] "CDM, RM"      "RM, RB, LB"   "LM, LWB, LB"  "CM, RW"       "CM, RM, RB"  
## [381] "CM, LM, RB"   "RM, CAM, CF"  "CAM, CM, ST"  "CM, CF, CAM"  "RM, RB, CM"  
## [386] "LW, LM, LB"   "RWB, RB, LB"  "RM, LM, CDM"  "LB, LM, RM"   "CM, CF"      
## [391] "RB, CB, RWB"  "LW, LM, ST"   "CAM, RB"      "ST, CF, LW"   "LM, LB, CAM" 
## [396] "CF, CM, CAM"  "CB, CDM, CAM" "LM, CAM, LW"  "RM, LM, RB"   "RWB, LWB"    
## [401] "RW, CM"       "CB, CM, RB"   "LB, RB, LW"   "RB, CDM, LB"  "CM, RW, LW"  
## [406] "RWB, RB, CB"  "CF, RW, CAM"  "RB, CB, LB"   "CAM, LW, CF"  "CB, RB, CAM" 
## [411] "LM, LW, RW"   "LB, CB, RB"   "ST, RW, CF"   "CDM, LM, CM"  "ST, LM, CF"  
## [416] "LW, LM, CAM"  "LW, RM, CM"   "RB, CM, RWB"  "ST, CF, LM"   "CDM, CAM"    
## [421] "LM, CDM"      "LB, LM, CB"   "CAM, CF, RM"  "RM, CM, CAM"  "CAM, CF, LW" 
## [426] "LM, CM, ST"   "LM, RM, RW"   "CB, CM"       "LW, LM, RM"   "CB, LB, LWB" 
## [431] "RM, RB, LW"   "LB, LM, CM"   "LM, RM, LWB"  "RB, LM, RM"   "CF, LM, CAM" 
## [436] "LB, LWB, CB"  "RB, CM, RM"   "CM, LB, CDM"  "CAM, CF, LM"  "LW, CAM, CF" 
## [441] "CDM, RB, CB"  "LW, LB, RW"   "LM, CM, RW"   "LM, RB, RWB"  "ST, CM"      
## [446] "RM, CM, RWB"  "LM, ST, LB"   "RB, CB, CM"   "RW, LW, CF"   "CDM, RB, CM" 
## [451] "RM, LM, LWB"  "LM, RW, RM"   "RB, LM"       "CB, LB, LM"   "RB, LB, LWB" 
## [456] "RB, RWB, CB"  "LB, CM, LW"   "CM, RB, CDM"  "RM, LW, LM"   "CAM, ST, CM" 
## [461] "LB, CDM, CM"  "LWB, LM, RWB" "CDM, LWB, CM" "LM, LWB, CM"  "CAM, CDM, CB"
## [466] "RM, LB"       "CDM, RB, LB"  "ST, RM, RW"   "RB, LB, LM"   "LB, CDM, LWB"
## [471] "LB, LM, RB"   "CDM, RM, CM"  "RWB, CB"      "CF, CM, LB"   "CM, CAM, LB" 
## [476] "CDM, LWB"     "RW, CAM, CF"  "RM, RWB, LM"  "ST, LW, RM"   "RB, RM, CDM" 
## [481] "RB, CB, CDM"  "CF, RM, ST"   "CM, ST, CF"   "LM, CAM, CDM" "LW, CM, LB"  
## [486] "LB, LM, LW"   "CM, LM, ST"   "CF, RM"       "RB, LB, RW"   "LM, CM, LW"  
## [491] "LW, RW, CF"   "CDM, LM"      "CM, CB, CDM"  "LB, RB, LM"   "RM, LW, CAM" 
## [496] "CB, LB, CM"   "CM, LW, LWB"  "RM, RWB, ST"  "ST, RW, CAM"  "ST, RB, RM"  
## [501] "LB, RW, CM"   "CF, CAM, RW"  "RM, RW, RB"   "RB, CDM, CB"  "RB, RWB, CDM"
## [506] "LM, LB, RM"   "LM, CF, ST"   "CF, ST, LM"   "LB, RB, LWB"  "RM, ST, RB"  
## [511] "CDM, CAM, RM" "RB, CDM, RM"  "ST, LW, CM"   "CB, RB, CDM"  "LWB, CM"     
## [516] "ST, RWB, LM"  "RM, CM, ST"   "RB, CDM, RWB" "CM, RM, LM"   "LM, CM, RM"  
## [521] "LB, RB, CDM"  "RB, RM, ST"   "CF, LM, RM"   "CM, ST, LM"   "CM, RM, CF"  
## [526] "CB, CM, LB"   "RB, LB, CM"   "LWB, CB, LM"  "CB, LM, LB"   "RM, RWB, CM" 
## [531] "LM, RB, CB"   "RM, ST, RWB"  "CDM, RM, LM"  "RW, CM, CAM"  "CF, CAM, LW" 
## [536] "RM, RB, LM"   "CF, ST, CM"   "LB, LWB, CM"  "CM, RW, LM"   "RB, RWB, RW" 
## [541] "ST, CB, CAM"  "LM, CF, CAM"  "LM, LB, ST"   "RB, RW, RWB"  "RM, RB, RW"  
## [546] "RWB, CB, CM"  "RWB, RM, LM"  "ST, CB"       "CM, LB, LM"   "LW, RB, LB"  
## [551] "LB, RW, LW"   "RW, RB"       "LWB, ST, CF"  "RW, RM, RWB"  "CB, ST"      
## [556] "RWB, LM"      "CM, LM, RW"   "RM, CF, CM"   "LM, LB, CDM"  "CB, LWB, LB" 
## [561] "RM, RW, CF"   "RB, CDM, CAM" "LW, RM, RW"   "CM, RWB"      "RW, RM, LB"  
## [566] "CB, CF"       "RB, ST, RM"   "LM, LW, CDM"  "CB, CAM"      "RM, RB, CDM" 
## [571] "LM, LW, LWB"  "CM, RWB, LWB" "LWB, CB"      "RB, LW"       "CM, CDM, RW" 
## [576] "RB, RWB, CM"  "CB, RWB"      "LB, CM, CDM"  "RM, RB, ST"   "LW, CM, RW"  
## [581] "CB, RM"       "CDM, LB, LM"  "CM, CAM, RB"  "CAM, RM, CDM" "RM, CAM, RWB"
## [586] "RW, LM"       "CB, LM, ST"   "CM, ST, RM"   "CM, CAM, LW"  "RB, ST, RW"  
## [591] "LB, CM, RB"   "CAM, LB"      "RM, RB, CAM"  "RM, LWB, ST"  "ST, CB, RB"  
## [596] "RB, CAM"      "CM, CB, RB"   "CM, CDM, ST"  "RM, RW, RWB"  "RM, CB, RB"  
## [601] "RWB, RB, CDM" "RW, CM, LB"   "RM, CF, RB"   "RM, CM, CF"   "LB, LM, CDM" 
## [606] "CDM, CAM, RB" "CM, LW, ST"   "RM, CB"       "CM, LB, RM"   "LB, CDM, LM" 
## [611] "CDM, LB, RM"  "LM, CM, RB"   "LW, CM, RB"   "RM, LWB, LM"  "LWB, CAM, LM"
## [616] "CM, RB, LM"   "LWB, CB, LB"  "ST, CM, CAM"  "LWB, LW"      "RM, RWB, RW" 
## [621] "RW, CM, ST"   "CAM, ST, RB"  "CDM, LB, RB"  "RWB, CM"      "LB, CB, RM"  
## [626] "CF, RM, CM"   "RWB, LWB, CB" "ST, RM, RWB"  "LM, ST, CM"   "CM, LM, CB"  
## [631] "LWB, LW, ST"  "CM, CF, RB"   "ST, RW, RB"   "RW, LM, CAM"  "RW, RB, LB"  
## [636] "RWB, CDM"     "LW, LWB, LB"  "RB, ST"       "ST, LW, CDM"  "LB, CDM, RB" 
## [641] "CM, RWB, CDM" "LM, CDM, LWB" "RM, ST, CM"
#taking the primary position only 
position <- gsub(",.*$", "", Fifa20$player_positions)
length(unique(position))
## [1] 15

There are 15 distint position. By using human intiation, I will try to group them under three position. The grouping will be done as follows

GK - “GK”

Def - “CB”, “LB”, “RB”, “LWB”, “RWB” , “CDM”

Att - “RM”, “LM”, “CAM”, “RW”, “LW”, “CF”, “ST”

Adding the three position in the dataset

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v purrr   0.3.3     v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
results$position <- position
GK  <- ("GK")
Def <- c("CB", "LB", "RB", "LWB", "RWB" , "CDM", "CM")
Att <- c("RM", "LM",  "CAM", "RW", "LW", "CF", "ST")             #The last four step will allocate the position into four categories.

for (i in 1:nrow(results)) {                #The loop checks every observation and changes the given position to the four categories
  if (position[i] %in% GK )
  {
    results[i,3] = "GK"
  }
  else if (position[i] %in% Def)
  {
    results[i,3] = "DEF"
  }
  
  else
    results[i,3] = "ATT"
  
}

Now, finally we shall tabulate our cluster with the player’s real position.

Note that we have decided the following position for each clusters based on the players included in each cluster Cluster 1 = Goalkeeper Cluster 2 = Attacker Cluster 3 = Defender

table(results$cluster, results$position)
##    
##      ATT  DEF   GK
##   1    0    0 2036
##   2 6333  699    0
##   3  354 8856    0
accuracy = (2036+6333+8856)/18278
accuracy *100
## [1] 94.23898

The table shows how the cluster performed. It should be no surprise that all the goalkeeper are clustered into a single cluster,cluster 1, and non other since goalkeeper attributes are distinct to that of other outfield player. The graph above had also verified this result.

Only about 10 percent of the attackers are wrongly classified as defence whereas only approximately 4 percent of the defenders are wrongly classified as attackers. It is not uncommon for players to have skills that are not normally a strong attribute for their positions. This may cause some players to show up into different clusters.

However, the total accuracy is still 94 %. This means that even if EA Sports would not provide the player’s position in the game; with the help of attributes data, we can cluster which players are more similar to each other and categorize their game position.

One thing to understand is that this document does not create new knowledge but rather demonstrate how k-means is used in R. Other models such as classification trees or logistic regression may provide a better relationship between variables in the dataset.