In this document, I will try to group different players from Fifa 20 into clusters based on their attributes. I will take advantage of the k-means algorithm in R to cluster the players into different groups.
A k-means is an unsupervised machine learning algorithm that partitions the observation into different clusters based on their attribute. The user has to provide information on how many clusters to divide the observation into. Once the user inputs the number of clusters, the algorithm randomly picks ‘centroids’ for each cluster. The points closest to such centroids are allocated into that particular cluster. The mean value between the initial centroid and the new point will now be the new centroid. This repeats until all the values are allocated into at least one cluster and do not change in the next iteration.
The program runs the algorithm multiple times and the clusters that have the least variation will be selected. Details about the k-means functionality are beyond the scope of this document.
One important thing about k-means clustering is how many clusters should the observation divide the observation into. A variance and cluster number trade-off help determine the optimal number of a cluster for the observation. However, in this document, we will pick cluster size to be three. This is because there are three major position in football, i.e Goalkeeping, Defending and Attacking, which is the most important classification of any individual players. (Note midfield players can be either defending or attacking. There is usually overlap between thier role hence I have not made a sperate category for them).
The Fifa 20 data set is available in Kaggle.
library(readr)
Fifa20 <- read_csv("fifa-20-complete-player-dataset/players_20.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## player_url = col_character(),
## short_name = col_character(),
## long_name = col_character(),
## dob = col_date(format = ""),
## nationality = col_character(),
## club = col_character(),
## player_positions = col_character(),
## preferred_foot = col_character(),
## work_rate = col_character(),
## body_type = col_character(),
## real_face = col_character(),
## player_tags = col_character(),
## team_position = col_character(),
## loaned_from = col_character(),
## joined = col_date(format = ""),
## nation_position = col_character(),
## player_traits = col_character(),
## ls = col_character(),
## st = col_character(),
## rs = col_character()
## # ... with 23 more columns
## )
## See spec(...) for full column specifications.
head(Fifa20,15)
## # A tibble: 15 x 104
## sofifa_id player_url short_name long_name age dob height_cm
## <dbl> <chr> <chr> <chr> <dbl> <date> <dbl>
## 1 158023 https://s~ L. Messi Lionel A~ 32 1987-06-24 170
## 2 20801 https://s~ Cristiano~ Cristian~ 34 1985-02-05 187
## 3 190871 https://s~ Neymar Jr Neymar d~ 27 1992-02-05 175
## 4 200389 https://s~ J. Oblak Jan Oblak 26 1993-01-07 188
## 5 183277 https://s~ E. Hazard Eden Haz~ 28 1991-01-07 175
## 6 192985 https://s~ K. De Bru~ Kevin De~ 28 1991-06-28 181
## 7 192448 https://s~ M. ter St~ Marc-And~ 27 1992-04-30 187
## 8 203376 https://s~ V. van Di~ Virgil v~ 27 1991-07-08 193
## 9 177003 https://s~ L. Modric Luka Mod~ 33 1985-09-09 172
## 10 209331 https://s~ M. Salah Mohamed ~ 27 1992-06-15 175
## 11 231747 https://s~ K. Mbappé Kylian M~ 20 1998-12-20 178
## 12 201024 https://s~ K. Koulib~ Kalidou ~ 28 1991-06-20 187
## 13 202126 https://s~ H. Kane Harry Ka~ 25 1993-07-28 188
## 14 212831 https://s~ Alisson Alisson ~ 26 1992-10-02 191
## 15 193080 https://s~ De Gea David De~ 28 1990-11-07 192
## # ... with 97 more variables: weight_kg <dbl>, nationality <chr>, club <chr>,
## # overall <dbl>, potential <dbl>, value_eur <dbl>, wage_eur <dbl>,
## # player_positions <chr>, preferred_foot <chr>,
## # international_reputation <dbl>, weak_foot <dbl>, skill_moves <dbl>,
## # work_rate <chr>, body_type <chr>, real_face <chr>,
## # release_clause_eur <dbl>, player_tags <chr>, team_position <chr>,
## # team_jersey_number <dbl>, loaned_from <chr>, joined <date>,
## # contract_valid_until <dbl>, nation_position <chr>,
## # nation_jersey_number <dbl>, pace <dbl>, shooting <dbl>, passing <dbl>,
## # dribbling <dbl>, defending <dbl>, physic <dbl>, gk_diving <dbl>,
## # gk_handling <dbl>, gk_kicking <dbl>, gk_reflexes <dbl>, gk_speed <dbl>,
## # gk_positioning <dbl>, player_traits <chr>, attacking_crossing <dbl>,
## # attacking_finishing <dbl>, attacking_heading_accuracy <dbl>,
## # attacking_short_passing <dbl>, attacking_volleys <dbl>,
## # skill_dribbling <dbl>, skill_curve <dbl>, skill_fk_accuracy <dbl>,
## # skill_long_passing <dbl>, skill_ball_control <dbl>,
## # movement_acceleration <dbl>, movement_sprint_speed <dbl>,
## # movement_agility <dbl>, movement_reactions <dbl>, movement_balance <dbl>,
## # power_shot_power <dbl>, power_jumping <dbl>, power_stamina <dbl>,
## # power_strength <dbl>, power_long_shots <dbl>, mentality_aggression <dbl>,
## # mentality_interceptions <dbl>, mentality_positioning <dbl>,
## # mentality_vision <dbl>, mentality_penalties <dbl>,
## # mentality_composure <dbl>, defending_marking <dbl>,
## # defending_standing_tackle <dbl>, defending_sliding_tackle <dbl>,
## # goalkeeping_diving <dbl>, goalkeeping_handling <dbl>,
## # goalkeeping_kicking <dbl>, goalkeeping_positioning <dbl>,
## # goalkeeping_reflexes <dbl>, ls <chr>, st <chr>, rs <chr>, lw <chr>,
## # lf <chr>, cf <chr>, rf <chr>, rw <chr>, lam <chr>, cam <chr>, ram <chr>,
## # lm <chr>, lcm <chr>, cm <chr>, rcm <chr>, rm <chr>, lwb <chr>, ldm <chr>,
## # cdm <chr>, rdm <chr>, rwb <chr>, lb <chr>, lcb <chr>, cb <chr>, rcb <chr>,
## # rb <chr>
str(Fifa20)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18278 obs. of 104 variables:
## $ sofifa_id : num 158023 20801 190871 200389 183277 ...
## $ player_url : chr "https://sofifa.com/player/158023/lionel-messi/20/159586" "https://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/20/159586" "https://sofifa.com/player/190871/neymar-da-silva-santos-jr/20/159586" "https://sofifa.com/player/200389/jan-oblak/20/159586" ...
## $ short_name : chr "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "J. Oblak" ...
## $ long_name : chr "Lionel Andrés Messi Cuccittini" "Cristiano Ronaldo dos Santos Aveiro" "Neymar da Silva Santos Junior" "Jan Oblak" ...
## $ age : num 32 34 27 26 28 28 27 27 33 27 ...
## $ dob : Date, format: "1987-06-24" "1985-02-05" ...
## $ height_cm : num 170 187 175 188 175 181 187 193 172 175 ...
## $ weight_kg : num 72 83 68 87 74 70 85 92 66 71 ...
## $ nationality : chr "Argentina" "Portugal" "Brazil" "Slovenia" ...
## $ club : chr "FC Barcelona" "Juventus" "Paris Saint-Germain" "Atlético Madrid" ...
## $ overall : num 94 93 92 91 91 91 90 90 90 90 ...
## $ potential : num 94 93 92 93 91 91 93 91 90 90 ...
## $ value_eur : num 9.55e+07 5.85e+07 1.06e+08 7.75e+07 9.00e+07 ...
## $ wage_eur : num 565000 405000 290000 125000 470000 370000 250000 200000 340000 240000 ...
## $ player_positions : chr "RW, CF, ST" "ST, LW" "LW, CAM" "GK" ...
## $ preferred_foot : chr "Left" "Right" "Right" "Right" ...
## $ international_reputation : num 5 5 5 3 4 4 3 3 4 3 ...
## $ weak_foot : num 4 4 5 3 4 5 4 3 4 3 ...
## $ skill_moves : num 4 5 5 1 4 4 1 2 4 4 ...
## $ work_rate : chr "Medium/Low" "High/Low" "High/Medium" "Medium/Medium" ...
## $ body_type : chr "Messi" "C. Ronaldo" "Neymar" "Normal" ...
## $ real_face : chr "Yes" "Yes" "Yes" "Yes" ...
## $ release_clause_eur : num 1.96e+08 9.65e+07 1.95e+08 1.65e+08 1.84e+08 ...
## $ player_tags : chr "#Dribbler, #Distance Shooter, #Crosser, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Forward" "#Speedster, #Dribbler, #Distance Shooter, #Acrobat, #Clinical Finisher, #Complete Forward" "#Speedster, #Dribbler, #Playmaker , #Crosser, #FK Specialist, #Acrobat, #Clinical Finisher, #Complete Midfield"| __truncated__ NA ...
## $ team_position : chr "RW" "LW" "CAM" "GK" ...
## $ team_jersey_number : num 10 7 10 13 7 17 1 4 10 11 ...
## $ loaned_from : chr NA NA NA NA ...
## $ joined : Date, format: "2004-07-01" "2018-07-10" ...
## $ contract_valid_until : num 2021 2022 2022 2023 2024 ...
## $ nation_position : chr NA "LS" "LW" "GK" ...
## $ nation_jersey_number : num NA 7 10 1 10 7 22 4 NA 10 ...
## $ pace : num 87 90 91 NA 91 76 NA 77 74 93 ...
## $ shooting : num 92 93 85 NA 83 86 NA 60 76 86 ...
## $ passing : num 92 82 87 NA 86 92 NA 70 89 81 ...
## $ dribbling : num 96 89 95 NA 94 86 NA 71 89 89 ...
## $ defending : num 39 35 32 NA 35 61 NA 90 72 45 ...
## $ physic : num 66 78 58 NA 66 78 NA 86 66 74 ...
## $ gk_diving : num NA NA NA 87 NA NA 88 NA NA NA ...
## $ gk_handling : num NA NA NA 92 NA NA 85 NA NA NA ...
## $ gk_kicking : num NA NA NA 78 NA NA 88 NA NA NA ...
## $ gk_reflexes : num NA NA NA 89 NA NA 90 NA NA NA ...
## $ gk_speed : num NA NA NA 52 NA NA 45 NA NA NA ...
## $ gk_positioning : num NA NA NA 90 NA NA 88 NA NA NA ...
## $ player_traits : chr "Beat Offside Trap, Argues with Officials, Early Crosser, Finesse Shot, Speed Dribbler (CPU AI Only), 1-on-1 Rus"| __truncated__ "Long Throw-in, Selfish, Argues with Officials, Early Crosser, Speed Dribbler (CPU AI Only), Skilled Dribbling" "Power Free-Kick, Injury Free, Selfish, Early Crosser, Speed Dribbler (CPU AI Only), Crowd Favourite" "Flair, Acrobatic Clearance" ...
## $ attacking_crossing : num 88 84 87 13 81 93 18 53 86 79 ...
## $ attacking_finishing : num 95 94 87 11 84 82 14 52 72 90 ...
## $ attacking_heading_accuracy: num 70 89 62 15 61 55 11 86 55 59 ...
## $ attacking_short_passing : num 92 83 87 43 89 92 61 78 92 84 ...
## $ attacking_volleys : num 88 87 87 13 83 82 14 45 76 79 ...
## $ skill_dribbling : num 97 89 96 12 95 86 21 70 87 89 ...
## $ skill_curve : num 93 81 88 13 83 85 18 60 85 83 ...
## $ skill_fk_accuracy : num 94 76 87 14 79 83 12 70 78 69 ...
## $ skill_long_passing : num 92 77 81 40 83 91 63 81 88 75 ...
## $ skill_ball_control : num 96 92 95 30 94 91 30 76 92 89 ...
## $ movement_acceleration : num 91 89 94 43 94 77 38 74 77 94 ...
## $ movement_sprint_speed : num 84 91 89 60 88 76 50 79 71 92 ...
## $ movement_agility : num 93 87 96 67 95 78 37 61 92 91 ...
## $ movement_reactions : num 95 96 92 88 90 91 86 88 89 92 ...
## $ movement_balance : num 95 71 84 49 94 76 43 53 93 88 ...
## $ power_shot_power : num 86 95 80 59 82 91 66 81 79 80 ...
## $ power_jumping : num 68 95 61 78 56 63 79 90 68 69 ...
## $ power_stamina : num 75 85 81 41 84 89 35 75 85 85 ...
## $ power_strength : num 68 78 49 78 63 74 78 92 58 73 ...
## $ power_long_shots : num 94 93 84 12 80 90 10 64 82 84 ...
## $ mentality_aggression : num 48 63 51 34 54 76 43 82 62 63 ...
## $ mentality_interceptions : num 40 29 36 19 41 61 22 89 82 55 ...
## $ mentality_positioning : num 94 95 87 11 87 88 11 47 79 92 ...
## $ mentality_vision : num 94 82 90 65 89 94 70 65 91 84 ...
## $ mentality_penalties : num 75 85 90 11 88 79 25 62 82 77 ...
## $ mentality_composure : num 96 95 94 68 91 91 70 89 92 91 ...
## $ defending_marking : num 33 28 27 27 34 68 25 91 68 38 ...
## $ defending_standing_tackle : num 37 32 26 12 27 58 13 92 76 43 ...
## $ defending_sliding_tackle : num 26 24 29 18 22 51 10 85 71 41 ...
## $ goalkeeping_diving : num 6 7 9 87 11 15 88 13 13 14 ...
## $ goalkeeping_handling : num 11 11 9 92 12 13 85 10 9 14 ...
## $ goalkeeping_kicking : num 15 15 15 78 6 5 88 13 7 9 ...
## $ goalkeeping_positioning : num 14 14 15 90 8 10 88 11 14 11 ...
## $ goalkeeping_reflexes : num 8 11 11 89 8 13 90 11 9 14 ...
## $ ls : chr "89+2" "91+3" "84+3" NA ...
## $ st : chr "89+2" "91+3" "84+3" NA ...
## $ rs : chr "89+2" "91+3" "84+3" NA ...
## $ lw : chr "93+2" "89+3" "90+3" NA ...
## $ lf : chr "93+2" "90+3" "89+3" NA ...
## $ cf : chr "93+2" "90+3" "89+3" NA ...
## $ rf : chr "93+2" "90+3" "89+3" NA ...
## $ rw : chr "93+2" "89+3" "90+3" NA ...
## $ lam : chr "93+2" "88+3" "90+3" NA ...
## $ cam : chr "93+2" "88+3" "90+3" NA ...
## $ ram : chr "93+2" "88+3" "90+3" NA ...
## $ lm : chr "92+2" "88+3" "89+3" NA ...
## $ lcm : chr "87+2" "81+3" "82+3" NA ...
## $ cm : chr "87+2" "81+3" "82+3" NA ...
## $ rcm : chr "87+2" "81+3" "82+3" NA ...
## $ rm : chr "92+2" "88+3" "89+3" NA ...
## $ lwb : chr "68+2" "65+3" "66+3" NA ...
## $ ldm : chr "66+2" "61+3" "61+3" NA ...
## $ cdm : chr "66+2" "61+3" "61+3" NA ...
## $ rdm : chr "66+2" "61+3" "61+3" NA ...
## $ rwb : chr "68+2" "65+3" "66+3" NA ...
## [list output truncated]
## - attr(*, "spec")=
## .. cols(
## .. sofifa_id = col_double(),
## .. player_url = col_character(),
## .. short_name = col_character(),
## .. long_name = col_character(),
## .. age = col_double(),
## .. dob = col_date(format = ""),
## .. height_cm = col_double(),
## .. weight_kg = col_double(),
## .. nationality = col_character(),
## .. club = col_character(),
## .. overall = col_double(),
## .. potential = col_double(),
## .. value_eur = col_double(),
## .. wage_eur = col_double(),
## .. player_positions = col_character(),
## .. preferred_foot = col_character(),
## .. international_reputation = col_double(),
## .. weak_foot = col_double(),
## .. skill_moves = col_double(),
## .. work_rate = col_character(),
## .. body_type = col_character(),
## .. real_face = col_character(),
## .. release_clause_eur = col_double(),
## .. player_tags = col_character(),
## .. team_position = col_character(),
## .. team_jersey_number = col_double(),
## .. loaned_from = col_character(),
## .. joined = col_date(format = ""),
## .. contract_valid_until = col_double(),
## .. nation_position = col_character(),
## .. nation_jersey_number = col_double(),
## .. pace = col_double(),
## .. shooting = col_double(),
## .. passing = col_double(),
## .. dribbling = col_double(),
## .. defending = col_double(),
## .. physic = col_double(),
## .. gk_diving = col_double(),
## .. gk_handling = col_double(),
## .. gk_kicking = col_double(),
## .. gk_reflexes = col_double(),
## .. gk_speed = col_double(),
## .. gk_positioning = col_double(),
## .. player_traits = col_character(),
## .. attacking_crossing = col_double(),
## .. attacking_finishing = col_double(),
## .. attacking_heading_accuracy = col_double(),
## .. attacking_short_passing = col_double(),
## .. attacking_volleys = col_double(),
## .. skill_dribbling = col_double(),
## .. skill_curve = col_double(),
## .. skill_fk_accuracy = col_double(),
## .. skill_long_passing = col_double(),
## .. skill_ball_control = col_double(),
## .. movement_acceleration = col_double(),
## .. movement_sprint_speed = col_double(),
## .. movement_agility = col_double(),
## .. movement_reactions = col_double(),
## .. movement_balance = col_double(),
## .. power_shot_power = col_double(),
## .. power_jumping = col_double(),
## .. power_stamina = col_double(),
## .. power_strength = col_double(),
## .. power_long_shots = col_double(),
## .. mentality_aggression = col_double(),
## .. mentality_interceptions = col_double(),
## .. mentality_positioning = col_double(),
## .. mentality_vision = col_double(),
## .. mentality_penalties = col_double(),
## .. mentality_composure = col_double(),
## .. defending_marking = col_double(),
## .. defending_standing_tackle = col_double(),
## .. defending_sliding_tackle = col_double(),
## .. goalkeeping_diving = col_double(),
## .. goalkeeping_handling = col_double(),
## .. goalkeeping_kicking = col_double(),
## .. goalkeeping_positioning = col_double(),
## .. goalkeeping_reflexes = col_double(),
## .. ls = col_character(),
## .. st = col_character(),
## .. rs = col_character(),
## .. lw = col_character(),
## .. lf = col_character(),
## .. cf = col_character(),
## .. rf = col_character(),
## .. rw = col_character(),
## .. lam = col_character(),
## .. cam = col_character(),
## .. ram = col_character(),
## .. lm = col_character(),
## .. lcm = col_character(),
## .. cm = col_character(),
## .. rcm = col_character(),
## .. rm = col_character(),
## .. lwb = col_character(),
## .. ldm = col_character(),
## .. cdm = col_character(),
## .. rdm = col_character(),
## .. rwb = col_character(),
## .. lb = col_character(),
## .. lcb = col_character(),
## .. cb = col_character(),
## .. rcb = col_character(),
## .. rb = col_character()
## .. )
The file has lots of information which we do not need for clustering. Only player’s attributes such as speed, strength, passing, finish, heading etc will be taken.
k.Fifa <- Fifa20[,c(32:43,45:78)]
head(k.Fifa,25)
## # A tibble: 25 x 46
## pace shooting passing dribbling defending physic gk_diving gk_handling
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 87 92 92 96 39 66 NA NA
## 2 90 93 82 89 35 78 NA NA
## 3 91 85 87 95 32 58 NA NA
## 4 NA NA NA NA NA NA 87 92
## 5 91 83 86 94 35 66 NA NA
## 6 76 86 92 86 61 78 NA NA
## 7 NA NA NA NA NA NA 88 85
## 8 77 60 70 71 90 86 NA NA
## 9 74 76 89 89 72 66 NA NA
## 10 93 86 81 89 45 74 NA NA
## # ... with 15 more rows, and 38 more variables: gk_kicking <dbl>,
## # gk_reflexes <dbl>, gk_speed <dbl>, gk_positioning <dbl>,
## # attacking_crossing <dbl>, attacking_finishing <dbl>,
## # attacking_heading_accuracy <dbl>, attacking_short_passing <dbl>,
## # attacking_volleys <dbl>, skill_dribbling <dbl>, skill_curve <dbl>,
## # skill_fk_accuracy <dbl>, skill_long_passing <dbl>,
## # skill_ball_control <dbl>, movement_acceleration <dbl>,
## # movement_sprint_speed <dbl>, movement_agility <dbl>,
## # movement_reactions <dbl>, movement_balance <dbl>, power_shot_power <dbl>,
## # power_jumping <dbl>, power_stamina <dbl>, power_strength <dbl>,
## # power_long_shots <dbl>, mentality_aggression <dbl>,
## # mentality_interceptions <dbl>, mentality_positioning <dbl>,
## # mentality_vision <dbl>, mentality_penalties <dbl>,
## # mentality_composure <dbl>, defending_marking <dbl>,
## # defending_standing_tackle <dbl>, defending_sliding_tackle <dbl>,
## # goalkeeping_diving <dbl>, goalkeeping_handling <dbl>,
## # goalkeeping_kicking <dbl>, goalkeeping_positioning <dbl>,
## # goalkeeping_reflexes <dbl>
str(k.Fifa)
## Classes 'tbl_df', 'tbl' and 'data.frame': 18278 obs. of 46 variables:
## $ pace : num 87 90 91 NA 91 76 NA 77 74 93 ...
## $ shooting : num 92 93 85 NA 83 86 NA 60 76 86 ...
## $ passing : num 92 82 87 NA 86 92 NA 70 89 81 ...
## $ dribbling : num 96 89 95 NA 94 86 NA 71 89 89 ...
## $ defending : num 39 35 32 NA 35 61 NA 90 72 45 ...
## $ physic : num 66 78 58 NA 66 78 NA 86 66 74 ...
## $ gk_diving : num NA NA NA 87 NA NA 88 NA NA NA ...
## $ gk_handling : num NA NA NA 92 NA NA 85 NA NA NA ...
## $ gk_kicking : num NA NA NA 78 NA NA 88 NA NA NA ...
## $ gk_reflexes : num NA NA NA 89 NA NA 90 NA NA NA ...
## $ gk_speed : num NA NA NA 52 NA NA 45 NA NA NA ...
## $ gk_positioning : num NA NA NA 90 NA NA 88 NA NA NA ...
## $ attacking_crossing : num 88 84 87 13 81 93 18 53 86 79 ...
## $ attacking_finishing : num 95 94 87 11 84 82 14 52 72 90 ...
## $ attacking_heading_accuracy: num 70 89 62 15 61 55 11 86 55 59 ...
## $ attacking_short_passing : num 92 83 87 43 89 92 61 78 92 84 ...
## $ attacking_volleys : num 88 87 87 13 83 82 14 45 76 79 ...
## $ skill_dribbling : num 97 89 96 12 95 86 21 70 87 89 ...
## $ skill_curve : num 93 81 88 13 83 85 18 60 85 83 ...
## $ skill_fk_accuracy : num 94 76 87 14 79 83 12 70 78 69 ...
## $ skill_long_passing : num 92 77 81 40 83 91 63 81 88 75 ...
## $ skill_ball_control : num 96 92 95 30 94 91 30 76 92 89 ...
## $ movement_acceleration : num 91 89 94 43 94 77 38 74 77 94 ...
## $ movement_sprint_speed : num 84 91 89 60 88 76 50 79 71 92 ...
## $ movement_agility : num 93 87 96 67 95 78 37 61 92 91 ...
## $ movement_reactions : num 95 96 92 88 90 91 86 88 89 92 ...
## $ movement_balance : num 95 71 84 49 94 76 43 53 93 88 ...
## $ power_shot_power : num 86 95 80 59 82 91 66 81 79 80 ...
## $ power_jumping : num 68 95 61 78 56 63 79 90 68 69 ...
## $ power_stamina : num 75 85 81 41 84 89 35 75 85 85 ...
## $ power_strength : num 68 78 49 78 63 74 78 92 58 73 ...
## $ power_long_shots : num 94 93 84 12 80 90 10 64 82 84 ...
## $ mentality_aggression : num 48 63 51 34 54 76 43 82 62 63 ...
## $ mentality_interceptions : num 40 29 36 19 41 61 22 89 82 55 ...
## $ mentality_positioning : num 94 95 87 11 87 88 11 47 79 92 ...
## $ mentality_vision : num 94 82 90 65 89 94 70 65 91 84 ...
## $ mentality_penalties : num 75 85 90 11 88 79 25 62 82 77 ...
## $ mentality_composure : num 96 95 94 68 91 91 70 89 92 91 ...
## $ defending_marking : num 33 28 27 27 34 68 25 91 68 38 ...
## $ defending_standing_tackle : num 37 32 26 12 27 58 13 92 76 43 ...
## $ defending_sliding_tackle : num 26 24 29 18 22 51 10 85 71 41 ...
## $ goalkeeping_diving : num 6 7 9 87 11 15 88 13 13 14 ...
## $ goalkeeping_handling : num 11 11 9 92 12 13 85 10 9 14 ...
## $ goalkeeping_kicking : num 15 15 15 78 6 5 88 13 7 9 ...
## $ goalkeeping_positioning : num 14 14 15 90 8 10 88 11 14 11 ...
## $ goalkeeping_reflexes : num 8 11 11 89 8 13 90 11 9 14 ...
We can see that there are lots of ‘NA’ values in the dataset. k-means clustering cannot take ‘NA’ values. While there are multiple statistical methods to find out the missing value, we will not be using any such technique. Rather, I will be replacing all the ‘NA’ values with 1. In the next function, I will highlight why I sought to replace ‘NA’ with 1 without using any statistical methods.
Fifa20[1:10,c(3,32:43)]
## # A tibble: 10 x 13
## short_name pace shooting passing dribbling defending physic gk_diving
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 L. Messi 87 92 92 96 39 66 NA
## 2 Cristiano~ 90 93 82 89 35 78 NA
## 3 Neymar Jr 91 85 87 95 32 58 NA
## 4 J. Oblak NA NA NA NA NA NA 87
## 5 E. Hazard 91 83 86 94 35 66 NA
## 6 K. De Bru~ 76 86 92 86 61 78 NA
## 7 M. ter St~ NA NA NA NA NA NA 88
## 8 V. van Di~ 77 60 70 71 90 86 NA
## 9 L. Modric 74 76 89 89 72 66 NA
## 10 M. Salah 93 86 81 89 45 74 NA
## # ... with 5 more variables: gk_handling <dbl>, gk_kicking <dbl>,
## # gk_reflexes <dbl>, gk_speed <dbl>, gk_positioning <dbl>
From the table, we can see that J.Oblak and M. ter Stegen have a missing value for the first six attributes whereas the remaining players have a missing value for the last six attributes. J. Oblak and M.ter Stegen are goalkeepers, meaning that there is no information about their pace, dribbling or shooting since goalkeepers do not perform such skills. Likewise, other players such as L.Messi do not have gk_diving or gk_positioning attributes as such skills are not required for him. Since all the ratings are out of 100, I am replacing such ‘NA’ with 1 to indicate that the particular players have a very low level of skills for such attributes.
After replacing the missing value, I will be running the k-means clustering. As I mentioned above, I will assign three cluster for three most basic position of football.
k.Fifa[is.na(k.Fifa)] <-1
summary(k.Fifa)
## pace shooting passing dribbling
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.:57.00 1st Qu.:35.00 1st Qu.:46.00 1st Qu.:53.00
## Median :67.00 Median :52.00 Median :56.00 Median :62.00
## Mean :60.27 Mean :46.58 Mean :50.97 Mean :55.68
## 3rd Qu.:74.00 3rd Qu.:62.00 3rd Qu.:63.00 3rd Qu.:69.00
## Max. :96.00 Max. :93.00 Max. :92.00 Max. :96.00
## defending physic gk_diving gk_handling
## Min. : 1.00 Min. : 1.00 Min. : 1.000 Min. : 1.000
## 1st Qu.:31.00 1st Qu.:55.00 1st Qu.: 1.000 1st Qu.: 1.000
## Median :52.00 Median :64.00 Median : 1.000 Median : 1.000
## Mean :45.92 Mean :57.76 Mean : 8.176 Mean : 7.923
## 3rd Qu.:64.00 3rd Qu.:71.00 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :90.00 Max. :90.00 Max. :90.000 Max. :92.000
## gk_kicking gk_reflexes gk_speed gk_positioning
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 1.000 Median : 1.000 Median : 1.000 Median : 1.000
## Mean : 7.776 Mean : 8.284 Mean : 5.099 Mean : 7.948
## 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :93.000 Max. :92.000 Max. :65.000 Max. :91.000
## attacking_crossing attacking_finishing attacking_heading_accuracy
## Min. : 5.00 Min. : 2.00 Min. : 5.00
## 1st Qu.:38.00 1st Qu.:30.00 1st Qu.:44.00
## Median :54.00 Median :49.00 Median :56.00
## Mean :49.72 Mean :45.59 Mean :52.22
## 3rd Qu.:64.00 3rd Qu.:62.00 3rd Qu.:64.00
## Max. :93.00 Max. :95.00 Max. :93.00
## attacking_short_passing attacking_volleys skill_dribbling skill_curve
## Min. : 7.00 Min. : 3.00 Min. : 4.0 Min. : 6.00
## 1st Qu.:54.00 1st Qu.:30.00 1st Qu.:50.0 1st Qu.:34.00
## Median :62.00 Median :44.00 Median :61.0 Median :49.00
## Mean :58.75 Mean :42.81 Mean :55.6 Mean :47.33
## 3rd Qu.:68.00 3rd Qu.:56.00 3rd Qu.:68.0 3rd Qu.:62.00
## Max. :92.00 Max. :90.00 Max. :97.0 Max. :94.00
## skill_fk_accuracy skill_long_passing skill_ball_control movement_acceleration
## Min. : 4.00 Min. : 8.00 Min. : 5.00 Min. :12.0
## 1st Qu.:31.00 1st Qu.:43.00 1st Qu.:54.00 1st Qu.:56.0
## Median :41.00 Median :56.00 Median :63.00 Median :67.0
## Mean :42.71 Mean :52.77 Mean :58.46 Mean :64.3
## 3rd Qu.:56.00 3rd Qu.:64.00 3rd Qu.:69.00 3rd Qu.:75.0
## Max. :94.00 Max. :92.00 Max. :96.00 Max. :97.0
## movement_sprint_speed movement_agility movement_reactions movement_balance
## Min. :11.00 Min. :11.0 Min. :21.00 Min. :12.00
## 1st Qu.:57.00 1st Qu.:55.0 1st Qu.:56.00 1st Qu.:56.00
## Median :67.00 Median :66.0 Median :62.00 Median :66.00
## Mean :64.42 Mean :63.5 Mean :61.75 Mean :63.86
## 3rd Qu.:75.00 3rd Qu.:74.0 3rd Qu.:68.00 3rd Qu.:74.00
## Max. :96.00 Max. :96.0 Max. :96.00 Max. :97.00
## power_shot_power power_jumping power_stamina power_strength
## Min. :14.00 Min. :19.00 Min. :12.00 Min. :20.00
## 1st Qu.:48.00 1st Qu.:58.00 1st Qu.:56.00 1st Qu.:58.00
## Median :59.00 Median :66.00 Median :66.00 Median :66.00
## Mean :58.18 Mean :64.93 Mean :62.89 Mean :65.23
## 3rd Qu.:68.00 3rd Qu.:73.00 3rd Qu.:74.00 3rd Qu.:74.00
## Max. :95.00 Max. :95.00 Max. :97.00 Max. :97.00
## power_long_shots mentality_aggression mentality_interceptions
## Min. : 4.00 Min. : 9.00 Min. : 3.00
## 1st Qu.:32.00 1st Qu.:44.00 1st Qu.:25.00
## Median :51.00 Median :58.00 Median :52.00
## Mean :46.81 Mean :55.74 Mean :46.38
## 3rd Qu.:62.00 3rd Qu.:69.00 3rd Qu.:64.00
## Max. :94.00 Max. :95.00 Max. :92.00
## mentality_positioning mentality_vision mentality_penalties mentality_composure
## Min. : 2.00 Min. : 9.00 Min. : 7.00 Min. :12.00
## 1st Qu.:39.00 1st Qu.:44.00 1st Qu.:39.00 1st Qu.:51.00
## Median :55.00 Median :55.00 Median :49.00 Median :60.00
## Mean :50.07 Mean :53.61 Mean :48.38 Mean :58.53
## 3rd Qu.:64.00 3rd Qu.:64.00 3rd Qu.:60.00 3rd Qu.:67.00
## Max. :95.00 Max. :94.00 Max. :92.00 Max. :96.00
## defending_marking defending_standing_tackle defending_sliding_tackle
## Min. : 1.00 Min. : 5.00 Min. : 3.00
## 1st Qu.:29.00 1st Qu.:27.00 1st Qu.:24.00
## Median :52.00 Median :55.00 Median :52.00
## Mean :46.85 Mean :47.64 Mean :45.61
## 3rd Qu.:64.00 3rd Qu.:66.00 3rd Qu.:64.00
## Max. :94.00 Max. :92.00 Max. :90.00
## goalkeeping_diving goalkeeping_handling goalkeeping_kicking
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :11.00 Median :11.00
## Mean :16.57 Mean :16.35 Mean :16.21
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :90.00 Max. :92.00 Max. :93.00
## goalkeeping_positioning goalkeeping_reflexes
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :16.37 Mean :16.71
## 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :91.00 Max. :92.00
Now there are no missing values (‘NA’) in any column.
set.seed(100)
cluster <- kmeans(k.Fifa,3)
table(cluster$cluster)
##
## 1 2 3
## 2036 7032 9210
Now we shall we how our cluster was done. Before verifying the result of the clustering with our real database, we shall first see which players are allocated in which cluster.
results <- as.data.frame(cbind(Fifa20[3],cluster=cluster$cluster))
head(results[results$cluster ==1,],10)
## short_name cluster
## 4 J. Oblak 1
## 7 M. ter Stegen 1
## 14 Alisson 1
## 15 De Gea 1
## 26 Ederson 1
## 29 T. Courtois 1
## 31 S. Handanovic 1
## 32 M. Neuer 1
## 33 H. Lloris 1
## 54 K. Navas 1
head(results[results$cluster ==2,],10)
## short_name cluster
## 1 L. Messi 2
## 2 Cristiano Ronaldo 2
## 3 Neymar Jr 2
## 5 E. Hazard 2
## 6 K. De Bruyne 2
## 9 L. Modric 2
## 10 M. Salah 2
## 11 K. Mbappé 2
## 13 H. Kane 2
## 18 S. Agüero 2
head(results[results$cluster ==3,],10)
## short_name cluster
## 8 V. van Dijk 3
## 12 K. Koulibaly 3
## 16 N. Kanté 3
## 17 G. Chiellini 3
## 19 Sergio Ramos 3
## 22 Sergio Busquets 3
## 30 Piqué 3
## 36 D. Godín 3
## 41 A. Laporte 3
## 43 Casemiro 3
(Please note that the order of the observation is by the top overall rated in Fifa 20. This will help us in further interpretation)
For a regular football fan, one can see that cluster 1 is distinctly allocated for the goalkeeper. It should be no surprise as differentiating a goalkeeper with other position (outfield player) should be the most obvious. A goalkeeper clearly needs a different set of attributes, which we also discussed above.
Attacking players such as L.Messi and Cristiano Ronaldo is in cluster 2. Some midfield players, who play high above the pitch, like K. De Bruye and M. Salah are also grouped in this cluster.
Finally, top defenders like V. van Dijk and Sergio Ramos are alloctaed in cluster 3. Some midfield players who supports in defensive duties such as N. Kante and Sergio Busquest are also in cluster 3.
With the help of plotting, we will make our assumption more robust.
library(ggplot2)
ggplot(k.Fifa, aes(x=defending, y= shooting, col= as.factor(cluster$cluster))) + geom_point()
ggplot(k.Fifa, aes(x=defending_standing_tackle, y= attacking_finishing, col= as.factor(cluster$cluster))) + geom_point()
From the graphs, it becomes clear that cluster 2 includes a player with greater attacking skills like shooting and attacking_finish. The players with higher defending skills such as defending itself and tackles are included on cluster 3. Cluster 1, which is for the goalkeepers, is on the bottom of the graph as goalkeeping does not require both the skills taken for plotting the graph.
We can notice that there are some overlaps between cluster 2 and cluster 3, around the centre of both the axis.
Now, we shall compare the cluster with the player’s position given in Fifa20. Some players have multiple positions. We shall only take the first listed position of the particular player.
#checking the team-position details
unique(Fifa20$player_positions)
## [1] "RW, CF, ST" "ST, LW" "LW, CAM" "GK" "LW, CF"
## [6] "CAM, CM" "CB" "CM" "RW, ST" "ST, RW"
## [11] "ST" "CDM, CM" "CF, ST, LW" "CAM, RW" "CM, CDM"
## [16] "RW, LW" "CAM, LM, ST" "ST, LM" "LW, LM" "CB, LB"
## [21] "RW, CAM, CM" "CDM" "CF, LM" "CF, ST" "LB"
## [26] "CM, CAM, CDM" "CF, LW, ST" "LW" "CB, CDM" "RB, CM, CDM"
## [31] "CAM, CM, LW" "CF, ST, CAM" "LW, CM" "CAM, RM, RW" "CM, CAM"
## [36] "CM, LM, RM" "LB, CB" "RB" "CAM, CF, ST" "RW, LW, ST"
## [41] "LB, LM" "RM, LM, CM" "CAM, CM, RM" "RM, LM" "CAM, RM"
## [46] "CF, LW, CAM" "CAM, LM, RM" "LM, RM, LW" "RM, LM, LW" "CAM"
## [51] "CAM, CM, CF" "LM" "CDM, CB" "RB, CB" "RM, RW"
## [56] "LM, RW, LW" "RM, CM" "CAM, LW, ST" "RW, RM" "CM, CDM, CAM"
## [61] "CM, CAM, CF" "LW, ST, LM" "LM, ST" "RM, RW, ST" "LM, CAM, RM"
## [66] "LW, RW" "CF, LM, LW" "RM, CAM" "CF, RM, LM" "RW, LW, CAM"
## [71] "CDM, CM, CAM" "CDM, CB, LB" "ST, CAM, LW" "ST, CF" "RW, CAM"
## [76] "LW, LM, RW" "RW, CAM, LW" "RM, ST" "CM, CDM, RM" "RW, CF"
## [81] "RB, RM" "CAM, LW" "CF, CAM, CM" "RB, RM, CM" "LWB, LM, LB"
## [86] "ST, RW, LW" "CB, LB, RB" "RM, LM, CF" "CAM, LM" "LM, LWB"
## [91] "LM, RM" "RM, RB" "CM, CDM, LM" "CM, LW" "RWB, RM"
## [96] "RW" "CB, RB" "CM, LM, CDM" "CAM, CM, LM" "LW, RW, CAM"
## [101] "CM, LM" "CAM, CM, RW" "LM, LB, CM" "CM, LB" "CF, ST, RM"
## [106] "LB, LWB" "RM, CAM, RW" "RB, RW, LW" "LW, LB" "CAM, LM, LW"
## [111] "CF, LW" "RM, LM, CAM" "RB, RWB" "LM, LW" "RM, ST, LM"
## [116] "CM, RM" "CF, RW, LW" "CAM, CF" "ST, LW, CAM" "RM"
## [121] "RWB, RB, RM" "LW, CAM, CM" "LM, ST, RM" "CM, RB, RM" "LW, CF, ST"
## [126] "LM, CAM" "RB, CDM, CM" "RM, RW, CAM" "CF" "LM, RW"
## [131] "RM, RW, LM" "ST, RM" "CAM, LM, CM" "CDM, CM, LM" "RW, ST, LW"
## [136] "LB, RB" "RB, LB" "LW, CF, RW" "LB, LM, LWB" "RM, CAM, CM"
## [141] "LM, LW, RM" "CDM, CM, CB" "ST, RW, LM" "RM, LM, ST" "LM, CF"
## [146] "CDM, CB, CM" "LWB, LB" "RWB, RM, RB" "ST, LM, LW" "LM, LB"
## [151] "RWB" "ST, RW, RM" "ST, CAM" "CAM, CDM" "CAM, ST"
## [156] "CF, CM" "CF, RW" "CM, LW, RW" "RM, CAM, LM" "RB, RM, CB"
## [161] "LB, CDM" "CAM, RM, CM" "CM, CB" "CB, RB, LB" "LM, RM, CAM"
## [166] "CM, LM, CAM" "CF, CM, LW" "RW, ST, RM" "CF, RW, RM" "CAM, RM, LM"
## [171] "LM, LW, CM" "ST, CF, CAM" "LM, LW, ST" "RM, RWB" "CF, LW, RW"
## [176] "CDM, CM, RB" "RB, RW" "CM, CAM, RW" "RM, ST, RW" "CAM, RM, CF"
## [181] "CM, RM, CAM" "LW, CAM, LM" "RB, RWB, RM" "RB, CM" "CM, RB"
## [186] "CM, CAM, RM" "LM, LWB, LW" "LM, LW, CF" "LB, LWB, LM" "LM, RM, CM"
## [191] "RM, RWB, LWB" "LM, LW, CAM" "LW, CAM, RM" "CAM, CM, CDM" "CM, CAM, LM"
## [196] "RB, LB, RM" "LB, CM" "CAM, RW, LW" "RM, LM, RW" "CM, CDM, CF"
## [201] "CM, CDM, RB" "LM, CAM, CM" "ST, CAM, CF" "RM, CM, RB" "CM, RW, CAM"
## [206] "LW, RW, ST" "CDM, RWB" "CAM, RW, RM" "CM, CB, CAM" "LM, RM, RB"
## [211] "RM, CF, RW" "LW, RM, LM" "CM, RM, CDM" "ST, CF, RW" "LM, LB, LWB"
## [216] "LM, RM, ST" "CAM, RW, CM" "LW, LM, CF" "RWB, RB" "CDM, CAM, LM"
## [221] "RW, LW, RM" "CAM, ST, CDM" "LM, CM" "CM, ST" "LM, RB"
## [226] "LB, RM" "RM, CM, LM" "LW, RM" "LW, ST, RW" "CM, LWB, LM"
## [231] "CF, CAM, ST" "RM, LW" "LW, RW, LM" "CF, CAM" "RW, LM, RM"
## [236] "RW, RM, CAM" "LB, LM, CAM" "ST, RM, CAM" "CM, CDM, RWB" "LM, LW, LB"
## [241] "ST, CAM, LM" "RW, CM, RM" "LM, RW, CF" "CF, RW, ST" "CDM, CAM, CM"
## [246] "RM, CM, RW" "LM, CF, RM" "CAM, LM, RW" "CAM, ST, LM" "LB, LWB, RB"
## [251] "LWB, LB, LM" "RB, CB, RM" "RW, CF, LW" "LWB, LB, RB" "LM, CM, CAM"
## [256] "CAM, CDM, CM" "LW, ST" "RB, CDM" "CAM, ST, RW" "CM, CDM, CB"
## [261] "CB, CDM, RB" "RM, CF, LM" "LWB, RM" "ST, RM, LM" "CAM, LW, CM"
## [266] "LM, CF, CM" "RW, RM, RB" "RB, RM, RWB" "ST, LM, CAM" "LM, ST, CAM"
## [271] "ST, LW, RW" "RM, LM, LB" "CM, LWB" "CB, CDM, LB" "CAM, LW, RW"
## [276] "LM, CAM, ST" "RW, RWB" "RM, RWB, RB" "LM, CAM, LWB" "LW, RW, RM"
## [281] "RWB, RB, LWB" "CAM, ST, RM" "RW, RM, CF" "RW, RM, LW" "RW, CAM, ST"
## [286] "RB, RM, LM" "CM, CDM, LB" "CDM, LB, CM" "LM, CM, LB" "LB, RB, RM"
## [291] "LW, CM, CAM" "LB, RB, CB" "CAM, ST, LW" "LWB, LB, CB" "LWB, LM"
## [296] "LWB" "CDM, CB, RB" "CM, LM, LB" "RW, LB" "RB, RM, LB"
## [301] "RW, RM, ST" "CM, CAM, ST" "CAM, CF, RW" "CAM, RM, RB" "ST, LM, RM"
## [306] "ST, RM, LW" "CDM, CM, RM" "RM, RW, CM" "ST, CM, RB" "RM, RW, LW"
## [311] "CB, RB, RM" "CAM, RM, ST" "RB, CM, CB" "RW, LW, CM" "RM, RWB, CAM"
## [316] "RW, RM, CM" "RM, RB, RWB" "RB, LW, LB" "LB, CB, LWB" "ST, CAM, CM"
## [321] "LM, RWB" "RB, LB, CDM" "CB, LWB" "CM, RWB, RM" "RM, CF"
## [326] "LB, CB, LM" "LWB, RWB" "RB, LB, RWB" "RW, LW, LM" "LM, CM, LWB"
## [331] "LM, ST, LW" "RM, CAM, ST" "RW, CAM, RM" "LW, CAM, RW" "LW, RW, CM"
## [336] "CAM, ST, CF" "LB, CAM, LM" "LB, CB, CDM" "LM, RM, LB" "LM, RM, CF"
## [341] "LB, LW" "LM, LB, LW" "ST, CAM, RM" "LW, ST, CAM" "ST, CAM, RW"
## [346] "ST, LW, LM" "CAM, LM, CF" "CAM, CF, CM" "LM, RW, CAM" "LB, RM, LM"
## [351] "CF, CAM, LM" "CAM, RW, CF" "CB, LB, CDM" "CB, CDM, CM" "CB, CM, CDM"
## [356] "RM, LM, RWB" "RW, RM, LM" "ST, LM, RB" "RB, RM, RW" "CB, RWB, RB"
## [361] "RB, CB, ST" "CDM, CM, LB" "RW, ST, CF" "ST, RWB" "LB, LWB, CDM"
## [366] "RB, RWB, LB" "RM, CF, CAM" "RB, LB, CB" "RM, LW, RW" "CDM, LB"
## [371] "CDM, RB, RM" "CDM, RB" "LB, CM, LM" "RM, ST, CAM" "CM, CDM, LW"
## [376] "CDM, RM" "RM, RB, LB" "LM, LWB, LB" "CM, RW" "CM, RM, RB"
## [381] "CM, LM, RB" "RM, CAM, CF" "CAM, CM, ST" "CM, CF, CAM" "RM, RB, CM"
## [386] "LW, LM, LB" "RWB, RB, LB" "RM, LM, CDM" "LB, LM, RM" "CM, CF"
## [391] "RB, CB, RWB" "LW, LM, ST" "CAM, RB" "ST, CF, LW" "LM, LB, CAM"
## [396] "CF, CM, CAM" "CB, CDM, CAM" "LM, CAM, LW" "RM, LM, RB" "RWB, LWB"
## [401] "RW, CM" "CB, CM, RB" "LB, RB, LW" "RB, CDM, LB" "CM, RW, LW"
## [406] "RWB, RB, CB" "CF, RW, CAM" "RB, CB, LB" "CAM, LW, CF" "CB, RB, CAM"
## [411] "LM, LW, RW" "LB, CB, RB" "ST, RW, CF" "CDM, LM, CM" "ST, LM, CF"
## [416] "LW, LM, CAM" "LW, RM, CM" "RB, CM, RWB" "ST, CF, LM" "CDM, CAM"
## [421] "LM, CDM" "LB, LM, CB" "CAM, CF, RM" "RM, CM, CAM" "CAM, CF, LW"
## [426] "LM, CM, ST" "LM, RM, RW" "CB, CM" "LW, LM, RM" "CB, LB, LWB"
## [431] "RM, RB, LW" "LB, LM, CM" "LM, RM, LWB" "RB, LM, RM" "CF, LM, CAM"
## [436] "LB, LWB, CB" "RB, CM, RM" "CM, LB, CDM" "CAM, CF, LM" "LW, CAM, CF"
## [441] "CDM, RB, CB" "LW, LB, RW" "LM, CM, RW" "LM, RB, RWB" "ST, CM"
## [446] "RM, CM, RWB" "LM, ST, LB" "RB, CB, CM" "RW, LW, CF" "CDM, RB, CM"
## [451] "RM, LM, LWB" "LM, RW, RM" "RB, LM" "CB, LB, LM" "RB, LB, LWB"
## [456] "RB, RWB, CB" "LB, CM, LW" "CM, RB, CDM" "RM, LW, LM" "CAM, ST, CM"
## [461] "LB, CDM, CM" "LWB, LM, RWB" "CDM, LWB, CM" "LM, LWB, CM" "CAM, CDM, CB"
## [466] "RM, LB" "CDM, RB, LB" "ST, RM, RW" "RB, LB, LM" "LB, CDM, LWB"
## [471] "LB, LM, RB" "CDM, RM, CM" "RWB, CB" "CF, CM, LB" "CM, CAM, LB"
## [476] "CDM, LWB" "RW, CAM, CF" "RM, RWB, LM" "ST, LW, RM" "RB, RM, CDM"
## [481] "RB, CB, CDM" "CF, RM, ST" "CM, ST, CF" "LM, CAM, CDM" "LW, CM, LB"
## [486] "LB, LM, LW" "CM, LM, ST" "CF, RM" "RB, LB, RW" "LM, CM, LW"
## [491] "LW, RW, CF" "CDM, LM" "CM, CB, CDM" "LB, RB, LM" "RM, LW, CAM"
## [496] "CB, LB, CM" "CM, LW, LWB" "RM, RWB, ST" "ST, RW, CAM" "ST, RB, RM"
## [501] "LB, RW, CM" "CF, CAM, RW" "RM, RW, RB" "RB, CDM, CB" "RB, RWB, CDM"
## [506] "LM, LB, RM" "LM, CF, ST" "CF, ST, LM" "LB, RB, LWB" "RM, ST, RB"
## [511] "CDM, CAM, RM" "RB, CDM, RM" "ST, LW, CM" "CB, RB, CDM" "LWB, CM"
## [516] "ST, RWB, LM" "RM, CM, ST" "RB, CDM, RWB" "CM, RM, LM" "LM, CM, RM"
## [521] "LB, RB, CDM" "RB, RM, ST" "CF, LM, RM" "CM, ST, LM" "CM, RM, CF"
## [526] "CB, CM, LB" "RB, LB, CM" "LWB, CB, LM" "CB, LM, LB" "RM, RWB, CM"
## [531] "LM, RB, CB" "RM, ST, RWB" "CDM, RM, LM" "RW, CM, CAM" "CF, CAM, LW"
## [536] "RM, RB, LM" "CF, ST, CM" "LB, LWB, CM" "CM, RW, LM" "RB, RWB, RW"
## [541] "ST, CB, CAM" "LM, CF, CAM" "LM, LB, ST" "RB, RW, RWB" "RM, RB, RW"
## [546] "RWB, CB, CM" "RWB, RM, LM" "ST, CB" "CM, LB, LM" "LW, RB, LB"
## [551] "LB, RW, LW" "RW, RB" "LWB, ST, CF" "RW, RM, RWB" "CB, ST"
## [556] "RWB, LM" "CM, LM, RW" "RM, CF, CM" "LM, LB, CDM" "CB, LWB, LB"
## [561] "RM, RW, CF" "RB, CDM, CAM" "LW, RM, RW" "CM, RWB" "RW, RM, LB"
## [566] "CB, CF" "RB, ST, RM" "LM, LW, CDM" "CB, CAM" "RM, RB, CDM"
## [571] "LM, LW, LWB" "CM, RWB, LWB" "LWB, CB" "RB, LW" "CM, CDM, RW"
## [576] "RB, RWB, CM" "CB, RWB" "LB, CM, CDM" "RM, RB, ST" "LW, CM, RW"
## [581] "CB, RM" "CDM, LB, LM" "CM, CAM, RB" "CAM, RM, CDM" "RM, CAM, RWB"
## [586] "RW, LM" "CB, LM, ST" "CM, ST, RM" "CM, CAM, LW" "RB, ST, RW"
## [591] "LB, CM, RB" "CAM, LB" "RM, RB, CAM" "RM, LWB, ST" "ST, CB, RB"
## [596] "RB, CAM" "CM, CB, RB" "CM, CDM, ST" "RM, RW, RWB" "RM, CB, RB"
## [601] "RWB, RB, CDM" "RW, CM, LB" "RM, CF, RB" "RM, CM, CF" "LB, LM, CDM"
## [606] "CDM, CAM, RB" "CM, LW, ST" "RM, CB" "CM, LB, RM" "LB, CDM, LM"
## [611] "CDM, LB, RM" "LM, CM, RB" "LW, CM, RB" "RM, LWB, LM" "LWB, CAM, LM"
## [616] "CM, RB, LM" "LWB, CB, LB" "ST, CM, CAM" "LWB, LW" "RM, RWB, RW"
## [621] "RW, CM, ST" "CAM, ST, RB" "CDM, LB, RB" "RWB, CM" "LB, CB, RM"
## [626] "CF, RM, CM" "RWB, LWB, CB" "ST, RM, RWB" "LM, ST, CM" "CM, LM, CB"
## [631] "LWB, LW, ST" "CM, CF, RB" "ST, RW, RB" "RW, LM, CAM" "RW, RB, LB"
## [636] "RWB, CDM" "LW, LWB, LB" "RB, ST" "ST, LW, CDM" "LB, CDM, RB"
## [641] "CM, RWB, CDM" "LM, CDM, LWB" "RM, ST, CM"
#taking the primary position only
position <- gsub(",.*$", "", Fifa20$player_positions)
length(unique(position))
## [1] 15
There are 15 distint position. By using human intiation, I will try to group them under three position. The grouping will be done as follows
GK - “GK”
Def - “CB”, “LB”, “RB”, “LWB”, “RWB” , “CDM”
Att - “RM”, “LM”, “CAM”, “RW”, “LW”, “CF”, “ST”
Adding the three position in the dataset
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v purrr 0.3.3 v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
results$position <- position
GK <- ("GK")
Def <- c("CB", "LB", "RB", "LWB", "RWB" , "CDM", "CM")
Att <- c("RM", "LM", "CAM", "RW", "LW", "CF", "ST") #The last four step will allocate the position into four categories.
for (i in 1:nrow(results)) { #The loop checks every observation and changes the given position to the four categories
if (position[i] %in% GK )
{
results[i,3] = "GK"
}
else if (position[i] %in% Def)
{
results[i,3] = "DEF"
}
else
results[i,3] = "ATT"
}
Now, finally we shall tabulate our cluster with the player’s real position.
Note that we have decided the following position for each clusters based on the players included in each cluster Cluster 1 = Goalkeeper Cluster 2 = Attacker Cluster 3 = Defender
table(results$cluster, results$position)
##
## ATT DEF GK
## 1 0 0 2036
## 2 6333 699 0
## 3 354 8856 0
accuracy = (2036+6333+8856)/18278
accuracy *100
## [1] 94.23898
The table shows how the cluster performed. It should be no surprise that all the goalkeeper are clustered into a single cluster,cluster 1, and non other since goalkeeper attributes are distinct to that of other outfield player. The graph above had also verified this result.
Only about 10 percent of the attackers are wrongly classified as defence whereas only approximately 4 percent of the defenders are wrongly classified as attackers. It is not uncommon for players to have skills that are not normally a strong attribute for their positions. This may cause some players to show up into different clusters.
However, the total accuracy is still 94 %. This means that even if EA Sports would not provide the player’s position in the game; with the help of attributes data, we can cluster which players are more similar to each other and categorize their game position.
One thing to understand is that this document does not create new knowledge but rather demonstrate how k-means is used in R. Other models such as classification trees or logistic regression may provide a better relationship between variables in the dataset.