I decided to analyse the football players statistics scraped from https://sofifa.com/ and published on https://www.kaggle.com/ platform.
Sofifa publish data from FIFA game and updating it several times a year. I chose the most fresh statistics from FIFA 20, so the whole data was as clean as earlier updates.
Let’s look on the data:
glimpse(players_full)
## Rows: 18,278
## Columns: 104
## $ sofifa_id <int> 158023, 20801, 190871, 200389, 183277, 1...
## $ player_url <chr> "https://sofifa.com/player/158023/lionel...
## $ short_name <chr> "L. Messi", "Cristiano Ronaldo", "Neymar...
## $ long_name <chr> "Lionel Andrés Messi Cuccittini", "Cris...
## $ age <int> 32, 34, 27, 26, 28, 28, 27, 27, 33, 27, ...
## $ dob <chr> "1987-06-24", "1985-02-05", "1992-02-05"...
## $ height_cm <int> 170, 187, 175, 188, 175, 181, 187, 193, ...
## $ weight_kg <int> 72, 83, 68, 87, 74, 70, 85, 92, 66, 71, ...
## $ nationality <chr> "Argentina", "Portugal", "Brazil", "Slov...
## $ club <chr> "FC Barcelona", "Juventus", "Paris Saint...
## $ overall <int> 94, 93, 92, 91, 91, 91, 90, 90, 90, 90, ...
## $ potential <int> 94, 93, 92, 93, 91, 91, 93, 91, 90, 90, ...
## $ value_eur <int> 95500000, 58500000, 105500000, 77500000,...
## $ wage_eur <int> 565000, 405000, 290000, 125000, 470000, ...
## $ player_positions <chr> "RW, CF, ST", "ST, LW", "LW, CAM", "GK",...
## $ preferred_foot <chr> "Left", "Right", "Right", "Right", "Righ...
## $ international_reputation <int> 5, 5, 5, 3, 4, 4, 3, 3, 4, 3, 3, 3, 3, 3...
## $ weak_foot <int> 4, 4, 5, 3, 4, 5, 4, 3, 4, 3, 4, 3, 4, 3...
## $ skill_moves <int> 4, 5, 5, 1, 4, 4, 1, 2, 4, 4, 5, 2, 3, 1...
## $ work_rate <chr> "Medium/Low", "High/Low", "High/Medium",...
## $ body_type <chr> "Messi", "C. Ronaldo", "Neymar", "Normal...
## $ real_face <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
## $ release_clause_eur <int> 195800000, 96500000, 195200000, 16470000...
## $ player_tags <chr> "#Dribbler, #Distance Shooter, #Crosser,...
## $ team_position <chr> "RW", "LW", "CAM", "GK", "LW", "RCM", "G...
## $ team_jersey_number <int> 10, 7, 10, 13, 7, 17, 1, 4, 10, 11, 7, 2...
## $ loaned_from <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ joined <chr> "2004-07-01", "2018-07-10", "2017-08-03"...
## $ contract_valid_until <int> 2021, 2022, 2022, 2023, 2024, 2023, 2022...
## $ nation_position <chr> NA, "LS", "LW", "GK", "LF", "RCM", "SUB"...
## $ nation_jersey_number <int> NA, 7, 10, 1, 10, 7, 22, 4, NA, 10, 10, ...
## $ pace <int> 87, 90, 91, NA, 91, 76, NA, 77, 74, 93, ...
## $ shooting <int> 92, 93, 85, NA, 83, 86, NA, 60, 76, 86, ...
## $ passing <int> 92, 82, 87, NA, 86, 92, NA, 70, 89, 81, ...
## $ dribbling <int> 96, 89, 95, NA, 94, 86, NA, 71, 89, 89, ...
## $ defending <int> 39, 35, 32, NA, 35, 61, NA, 90, 72, 45, ...
## $ physic <int> 66, 78, 58, NA, 66, 78, NA, 86, 66, 74, ...
## $ gk_diving <int> NA, NA, NA, 87, NA, NA, 88, NA, NA, NA, ...
## $ gk_handling <int> NA, NA, NA, 92, NA, NA, 85, NA, NA, NA, ...
## $ gk_kicking <int> NA, NA, NA, 78, NA, NA, 88, NA, NA, NA, ...
## $ gk_reflexes <int> NA, NA, NA, 89, NA, NA, 90, NA, NA, NA, ...
## $ gk_speed <int> NA, NA, NA, 52, NA, NA, 45, NA, NA, NA, ...
## $ gk_positioning <int> NA, NA, NA, 90, NA, NA, 88, NA, NA, NA, ...
## $ player_traits <chr> "Beat Offside Trap, Argues with Official...
## $ attacking_crossing <int> 88, 84, 87, 13, 81, 93, 18, 53, 86, 79, ...
## $ attacking_finishing <int> 95, 94, 87, 11, 84, 82, 14, 52, 72, 90, ...
## $ attacking_heading_accuracy <int> 70, 89, 62, 15, 61, 55, 11, 86, 55, 59, ...
## $ attacking_short_passing <int> 92, 83, 87, 43, 89, 92, 61, 78, 92, 84, ...
## $ attacking_volleys <int> 88, 87, 87, 13, 83, 82, 14, 45, 76, 79, ...
## $ skill_dribbling <int> 97, 89, 96, 12, 95, 86, 21, 70, 87, 89, ...
## $ skill_curve <int> 93, 81, 88, 13, 83, 85, 18, 60, 85, 83, ...
## $ skill_fk_accuracy <int> 94, 76, 87, 14, 79, 83, 12, 70, 78, 69, ...
## $ skill_long_passing <int> 92, 77, 81, 40, 83, 91, 63, 81, 88, 75, ...
## $ skill_ball_control <int> 96, 92, 95, 30, 94, 91, 30, 76, 92, 89, ...
## $ movement_acceleration <int> 91, 89, 94, 43, 94, 77, 38, 74, 77, 94, ...
## $ movement_sprint_speed <int> 84, 91, 89, 60, 88, 76, 50, 79, 71, 92, ...
## $ movement_agility <int> 93, 87, 96, 67, 95, 78, 37, 61, 92, 91, ...
## $ movement_reactions <int> 95, 96, 92, 88, 90, 91, 86, 88, 89, 92, ...
## $ movement_balance <int> 95, 71, 84, 49, 94, 76, 43, 53, 93, 88, ...
## $ power_shot_power <int> 86, 95, 80, 59, 82, 91, 66, 81, 79, 80, ...
## $ power_jumping <int> 68, 95, 61, 78, 56, 63, 79, 90, 68, 69, ...
## $ power_stamina <int> 75, 85, 81, 41, 84, 89, 35, 75, 85, 85, ...
## $ power_strength <int> 68, 78, 49, 78, 63, 74, 78, 92, 58, 73, ...
## $ power_long_shots <int> 94, 93, 84, 12, 80, 90, 10, 64, 82, 84, ...
## $ mentality_aggression <int> 48, 63, 51, 34, 54, 76, 43, 82, 62, 63, ...
## $ mentality_interceptions <int> 40, 29, 36, 19, 41, 61, 22, 89, 82, 55, ...
## $ mentality_positioning <int> 94, 95, 87, 11, 87, 88, 11, 47, 79, 92, ...
## $ mentality_vision <int> 94, 82, 90, 65, 89, 94, 70, 65, 91, 84, ...
## $ mentality_penalties <int> 75, 85, 90, 11, 88, 79, 25, 62, 82, 77, ...
## $ mentality_composure <int> 96, 95, 94, 68, 91, 91, 70, 89, 92, 91, ...
## $ defending_marking <int> 33, 28, 27, 27, 34, 68, 25, 91, 68, 38, ...
## $ defending_standing_tackle <int> 37, 32, 26, 12, 27, 58, 13, 92, 76, 43, ...
## $ defending_sliding_tackle <int> 26, 24, 29, 18, 22, 51, 10, 85, 71, 41, ...
## $ goalkeeping_diving <int> 6, 7, 9, 87, 11, 15, 88, 13, 13, 14, 13,...
## $ goalkeeping_handling <int> 11, 11, 9, 92, 12, 13, 85, 10, 9, 14, 5,...
## $ goalkeeping_kicking <int> 15, 15, 15, 78, 6, 5, 88, 13, 7, 9, 7, 7...
## $ goalkeeping_positioning <int> 14, 14, 15, 90, 8, 10, 88, 11, 14, 11, 1...
## $ goalkeeping_reflexes <int> 8, 11, 11, 89, 8, 13, 90, 11, 9, 14, 6, ...
## $ ls <chr> "89+2", "91+3", "84+3", NA, "83+3", "82+...
## $ st <chr> "89+2", "91+3", "84+3", NA, "83+3", "82+...
## $ rs <chr> "89+2", "91+3", "84+3", NA, "83+3", "82+...
## $ lw <chr> "93+2", "89+3", "90+3", NA, "89+3", "87+...
## $ lf <chr> "93+2", "90+3", "89+3", NA, "88+3", "87+...
## $ cf <chr> "93+2", "90+3", "89+3", NA, "88+3", "87+...
## $ rf <chr> "93+2", "90+3", "89+3", NA, "88+3", "87+...
## $ rw <chr> "93+2", "89+3", "90+3", NA, "89+3", "87+...
## $ lam <chr> "93+2", "88+3", "90+3", NA, "89+3", "88+...
## $ cam <chr> "93+2", "88+3", "90+3", NA, "89+3", "88+...
## $ ram <chr> "93+2", "88+3", "90+3", NA, "89+3", "88+...
## $ lm <chr> "92+2", "88+3", "89+3", NA, "89+3", "88+...
## $ lcm <chr> "87+2", "81+3", "82+3", NA, "83+3", "87+...
## $ cm <chr> "87+2", "81+3", "82+3", NA, "83+3", "87+...
## $ rcm <chr> "87+2", "81+3", "82+3", NA, "83+3", "87+...
## $ rm <chr> "92+2", "88+3", "89+3", NA, "89+3", "88+...
## $ lwb <chr> "68+2", "65+3", "66+3", NA, "66+3", "77+...
## $ ldm <chr> "66+2", "61+3", "61+3", NA, "63+3", "77+...
## $ cdm <chr> "66+2", "61+3", "61+3", NA, "63+3", "77+...
## $ rdm <chr> "66+2", "61+3", "61+3", NA, "63+3", "77+...
## $ rwb <chr> "68+2", "65+3", "66+3", NA, "66+3", "77+...
## $ lb <chr> "63+2", "61+3", "61+3", NA, "61+3", "73+...
## $ lcb <chr> "52+2", "53+3", "46+3", NA, "49+3", "66+...
## $ cb <chr> "52+2", "53+3", "46+3", NA, "49+3", "66+...
## $ rcb <chr> "52+2", "53+3", "46+3", NA, "49+3", "66+...
## $ rb <chr> "63+2", "61+3", "61+3", NA, "61+3", "73+...
Data contain 18278 players with 104 attributes. Some of them are IDs, urls and names, but most are numeric values (0-100 or 1-5) describing specific skills.
For classification, some of attributes will not be useful in the analysis. I create players dataset with only useful variables. There is also a challenge - I have to choose player position dependent variable from 3 possible columns: team_position, player_positions, nation_position. Not every player plays in nation cups, so I will choose between team_position and player_positions.
players <-
players_full %>%
dplyr::select(c(7:8,15,16,18:20,24:25,32:78))
glimpse(players)
## Rows: 18,278
## Columns: 56
## $ height_cm <int> 170, 187, 175, 188, 175, 181, 187, 193, ...
## $ weight_kg <int> 72, 83, 68, 87, 74, 70, 85, 92, 66, 71, ...
## $ player_positions <chr> "RW, CF, ST", "ST, LW", "LW, CAM", "GK",...
## $ preferred_foot <chr> "Left", "Right", "Right", "Right", "Righ...
## $ weak_foot <int> 4, 4, 5, 3, 4, 5, 4, 3, 4, 3, 4, 3, 4, 3...
## $ skill_moves <int> 4, 5, 5, 1, 4, 4, 1, 2, 4, 4, 5, 2, 3, 1...
## $ work_rate <chr> "Medium/Low", "High/Low", "High/Medium",...
## $ player_tags <chr> "#Dribbler, #Distance Shooter, #Crosser,...
## $ team_position <chr> "RW", "LW", "CAM", "GK", "LW", "RCM", "G...
## $ pace <int> 87, 90, 91, NA, 91, 76, NA, 77, 74, 93, ...
## $ shooting <int> 92, 93, 85, NA, 83, 86, NA, 60, 76, 86, ...
## $ passing <int> 92, 82, 87, NA, 86, 92, NA, 70, 89, 81, ...
## $ dribbling <int> 96, 89, 95, NA, 94, 86, NA, 71, 89, 89, ...
## $ defending <int> 39, 35, 32, NA, 35, 61, NA, 90, 72, 45, ...
## $ physic <int> 66, 78, 58, NA, 66, 78, NA, 86, 66, 74, ...
## $ gk_diving <int> NA, NA, NA, 87, NA, NA, 88, NA, NA, NA, ...
## $ gk_handling <int> NA, NA, NA, 92, NA, NA, 85, NA, NA, NA, ...
## $ gk_kicking <int> NA, NA, NA, 78, NA, NA, 88, NA, NA, NA, ...
## $ gk_reflexes <int> NA, NA, NA, 89, NA, NA, 90, NA, NA, NA, ...
## $ gk_speed <int> NA, NA, NA, 52, NA, NA, 45, NA, NA, NA, ...
## $ gk_positioning <int> NA, NA, NA, 90, NA, NA, 88, NA, NA, NA, ...
## $ player_traits <chr> "Beat Offside Trap, Argues with Official...
## $ attacking_crossing <int> 88, 84, 87, 13, 81, 93, 18, 53, 86, 79, ...
## $ attacking_finishing <int> 95, 94, 87, 11, 84, 82, 14, 52, 72, 90, ...
## $ attacking_heading_accuracy <int> 70, 89, 62, 15, 61, 55, 11, 86, 55, 59, ...
## $ attacking_short_passing <int> 92, 83, 87, 43, 89, 92, 61, 78, 92, 84, ...
## $ attacking_volleys <int> 88, 87, 87, 13, 83, 82, 14, 45, 76, 79, ...
## $ skill_dribbling <int> 97, 89, 96, 12, 95, 86, 21, 70, 87, 89, ...
## $ skill_curve <int> 93, 81, 88, 13, 83, 85, 18, 60, 85, 83, ...
## $ skill_fk_accuracy <int> 94, 76, 87, 14, 79, 83, 12, 70, 78, 69, ...
## $ skill_long_passing <int> 92, 77, 81, 40, 83, 91, 63, 81, 88, 75, ...
## $ skill_ball_control <int> 96, 92, 95, 30, 94, 91, 30, 76, 92, 89, ...
## $ movement_acceleration <int> 91, 89, 94, 43, 94, 77, 38, 74, 77, 94, ...
## $ movement_sprint_speed <int> 84, 91, 89, 60, 88, 76, 50, 79, 71, 92, ...
## $ movement_agility <int> 93, 87, 96, 67, 95, 78, 37, 61, 92, 91, ...
## $ movement_reactions <int> 95, 96, 92, 88, 90, 91, 86, 88, 89, 92, ...
## $ movement_balance <int> 95, 71, 84, 49, 94, 76, 43, 53, 93, 88, ...
## $ power_shot_power <int> 86, 95, 80, 59, 82, 91, 66, 81, 79, 80, ...
## $ power_jumping <int> 68, 95, 61, 78, 56, 63, 79, 90, 68, 69, ...
## $ power_stamina <int> 75, 85, 81, 41, 84, 89, 35, 75, 85, 85, ...
## $ power_strength <int> 68, 78, 49, 78, 63, 74, 78, 92, 58, 73, ...
## $ power_long_shots <int> 94, 93, 84, 12, 80, 90, 10, 64, 82, 84, ...
## $ mentality_aggression <int> 48, 63, 51, 34, 54, 76, 43, 82, 62, 63, ...
## $ mentality_interceptions <int> 40, 29, 36, 19, 41, 61, 22, 89, 82, 55, ...
## $ mentality_positioning <int> 94, 95, 87, 11, 87, 88, 11, 47, 79, 92, ...
## $ mentality_vision <int> 94, 82, 90, 65, 89, 94, 70, 65, 91, 84, ...
## $ mentality_penalties <int> 75, 85, 90, 11, 88, 79, 25, 62, 82, 77, ...
## $ mentality_composure <int> 96, 95, 94, 68, 91, 91, 70, 89, 92, 91, ...
## $ defending_marking <int> 33, 28, 27, 27, 34, 68, 25, 91, 68, 38, ...
## $ defending_standing_tackle <int> 37, 32, 26, 12, 27, 58, 13, 92, 76, 43, ...
## $ defending_sliding_tackle <int> 26, 24, 29, 18, 22, 51, 10, 85, 71, 41, ...
## $ goalkeeping_diving <int> 6, 7, 9, 87, 11, 15, 88, 13, 13, 14, 13,...
## $ goalkeeping_handling <int> 11, 11, 9, 92, 12, 13, 85, 10, 9, 14, 5,...
## $ goalkeeping_kicking <int> 15, 15, 15, 78, 6, 5, 88, 13, 7, 9, 7, 7...
## $ goalkeeping_positioning <int> 14, 14, 15, 90, 8, 10, 88, 11, 14, 11, 1...
## $ goalkeeping_reflexes <int> 8, 11, 11, 89, 8, 13, 90, 11, 9, 14, 6, ...
56 columns left after deleting columns not good for being predictors.
Now let’s compare player_positions and team_position
ggplot(data = players) + geom_bar(mapping = aes(x = team_position))
ggplot(data = players) + geom_bar(mapping = aes(x = player_positions))
Team_position looks much cleaner, because there are much less levels. Unofortunately the 2 largest levels are SUB and RES, which are not positions on the field. They are refer to player status in the team (SUB is substitute and RES is injured or something similar). Deleting such huge part of players doesn’t make sense.
Player_positions has many levels, because it stores every player’s position instead of only the current. Some of levels may be contain same positions, but in different order. I will sort them in asceding order.
z = players$player_positions
players$player_positions <-
unname(sapply(z, function(z) {
paste(sort(trimws(strsplit(z[1], ',')[[1]])), collapse=',')} ))
sort(table(players$player_positions))
##
## CAM,CB CAM,CB,CM CAM,CB,RB CAM,CB,ST CAM,CDM,ST CAM,CM,LB CAM,CM,RB
## 1 1 1 1 1 1 1
## CAM,LB CAM,RB,ST CB,CF CB,CM,LM CB,CM,RWB CB,LB,RM CB,LM,RB
## 1 1 1 1 1 1 1
## CB,LM,ST CB,LWB,RWB CDM,CF,CM CDM,CM,LWB CDM,CM,RW CDM,CM,ST CDM,LB,RM
## 1 1 1 1 1 1 1
## CDM,LM,LW CDM,LM,LWB CDM,LW,ST CDM,LWB CF,CM,LB CF,CM,LM CF,CM,RB
## 1 1 1 1 1 1 1
## CF,LM,RW CF,LWB,ST CF,RB,RM CM,LB,LWB CM,LB,RM CM,LW,LWB CM,LW,RB
## 1 1 1 1 1 1 1
## CM,LW,RM CM,LWB,RWB CM,RB,ST CM,RW,ST LB,LW,LWB LB,RM,RW LB,RW
## 1 1 1 1 1 1 1
## LM,RB,RWB LM,RB,ST LM,RWB,ST LW,LWB,ST LW,RB LWB,RB,RWB LWB,RM
## 1 1 1 1 1 1 1
## LWB,RM,RWB LWB,RM,ST RB,ST CAM,CB,CDM CAM,CDM,LM CAM,CDM,RM CAM,LM,LWB
## 1 1 1 2 2 2 2
## CAM,LW,RM CAM,RB,RM CB,LM,LWB CB,RB,ST CDM,CM,LW CDM,CM,RWB CDM,LB,LWB
## 2 2 2 2 2 2 2
## CDM,LM,RM CF,CM,LW CM,LB,LW CM,LB,RW CM,LW,ST CM,RB,RWB LB,RB,RW
## 2 2 2 2 2 2 2
## LM,LWB,RWB LM,RB LW,RB,RM RB,RW,RWB RB,RW,ST RW,RWB RWB,ST
## 2 2 2 2 2 2 2
## CAM,CDM,RB CAM,RM,RWB CB,CM,LB CB,RM CDM,RM CDM,RWB CF,CM,ST
## 3 3 3 3 3 3 3
## CM,LM,LWB CM,LM,RB CM,LM,RW CM,RM,RWB CM,RM,ST LB,LW,RB LB,LW,RW
## 3 3 3 3 3 3 3
## LM,LW,LWB LM,RWB LW,LWB LW,RB,RW LW,RM,ST CAM,LM,RW CDM,LM
## 3 3 3 3 3 4 4
## CDM,RB,RWB CF,CM CF,CM,RM CF,RM,ST CM,LWB LB,LM,ST LM,RW,ST
## 4 4 4 4 4 4 4
## RM,RW,RWB RM,RWB,ST CAM,LB,LM CB,CM,RB CF,LM,LW CF,RM,RW CM,LB,RB
## 4 4 5 5 5 5 5
## CM,LM,ST CM,RM,RW LWB,RWB CAM,RB CB,ST CF,RW CM,LM,LW
## 5 5 5 6 6 6 6
## CM,RWB LB,RM LM,LWB,RM LM,RW LW,RM RB,RM,ST CAM,CF,RW
## 6 6 6 6 6 6 7
## CDM,LB,LM CF,LM CF,LM,ST LB,LWB,RB RB,RM,RW CAM,CF,LW CB,RWB
## 7 7 7 7 7 8 8
## LM,RM,RWB CDM,RB,RM CF,RM CAM,CF,LM CAM,RW,ST CF,LW CF,LW,RW
## 8 9 9 11 11 11 11
## CF,LW,ST RB,RW CAM,CF,RM CAM,LM,LW CDM,CM,LB CM,LW CM,LW,RW
## 11 11 12 12 12 12 12
## CM,ST CB,LWB LB,RB,RWB LM,LWB RM,RWB CB,CDM,LB CB,RB,RWB
## 12 13 13 13 13 14 14
## CF LB,LM,RB CF,RW,ST CM,RW CDM,LB,RB LB,LM,LW CAM,RM,RW
## 14 14 15 15 16 16 17
## CB,LB,LM LB,LW RWB CAM,CM,LW CAM,CM,ST CF,LM,RM LW,RM,RW
## 17 17 17 18 18 18 18
## CAM,CDM CAM,LW,ST CB,LB,LWB LB,LM,RM CDM,LB CDM,CM,LM CM,LB,LM
## 20 20 20 20 21 22 23
## LM,RB,RM LWB CB,RB,RM LM,LW,RW CM,LB CAM,CF,CM CDM,CM,RM
## 23 23 24 24 25 27 27
## CM,RB,RM LM,LW,RM CAM,CM,RW CAM,RM,ST CM,RB CB,CM LB,RB,RM
## 28 30 31 32 32 35 36
## CAM,LW,RW CAM,CF,ST LM,RM,RW RM,RW,ST CAM,RW CAM,LW CDM,CM,RB
## 39 40 40 40 41 43 44
## CAM,LM,ST CDM,RB CB,CDM,RB CAM,CF LM,LW,ST CM,LM,RM RB,RM,RWB
## 46 47 48 52 53 54 56
## LB,LM,LWB CAM,CM,RM CF,ST CB,LB,RB CM,LM LM,LW LW
## 59 71 79 81 86 88 88
## RW CM,RM LW,RW,ST CAM,CM,LM RM,RW CAM,RM RB,RWB
## 91 95 99 102 107 114 115
## CB,CDM,CM RW,ST LB,LWB LW,ST CAM,LM LM,RM,ST CAM,ST
## 122 123 134 135 138 151 153
## LM,ST LW,RW RB,RM CAM,LM,RM RM,ST LB,RB CAM,CDM,CM
## 153 161 162 166 184 190 208
## RM LB,LM LM CAM CB,CDM CB,LB CDM
## 227 238 247 291 294 316 363
## CB,RB CAM,CM LM,RM RB LB CM CDM,CM
## 374 400 430 587 669 786 1413
## ST GK CB
## 1809 2036 2322
There are many levels, but there are only field positions, so I choose it as my dependent variable. To make it more simple and easier to look at results, I will group players into 4 main positions: Attacker, Midfielder, Defender and Goalkeeper.
players[which(
players$player_positions=="ST"|
players$player_positions=="LW"|
players$player_positions=="RW"|
players$player_positions=="CF"|
players$player_positions=="RM,ST"|
players$player_positions=="LW,RW"|
players$player_positions=="LM,ST"|
players$player_positions=="CAM,ST"|
players$player_positions=="CF,RW"|
players$player_positions=="LW,ST"|
players$player_positions=="RW,ST"|
players$player_positions=="CF,ST"|
players$player_positions=="CF,LW"|
players$player_positions=="CM,ST"|
players$player_positions=="CAM,CF"|
players$player_positions=="LW,RW,ST"|
players$player_positions=="LM,RM,ST"|
players$player_positions=="LM,LW,ST"|
players$player_positions=="CAM,LM,ST"|
players$player_positions=="RM,RW,ST"|
players$player_positions=="CAM,CF,ST"|
players$player_positions=="CAM,RM,ST"|
players$player_positions=="CAM,LW,ST"|
players$player_positions=="CAM,CM,ST"|
players$player_positions=="CF,RW,ST"|
players$player_positions=="CF,LW,ST"|
players$player_positions=="CF,LW,RW"|
players$player_positions=="CAM,RW,ST"|
players$player_positions=="CAM,CF,CM"|
players$player_positions=="CAM,CF,LW"|
players$player_positions=="CF,LM,ST"|
players$player_positions=="CAM,CF,RW"|
players$player_positions=="CF,LM,LW"|
players$player_positions=="CF,RM,RW"|
players$player_positions=="LM,RW,ST"|
players$player_positions=="CF,RM,ST"|
players$player_positions=="LW,RM,ST"|
players$player_positions=="CF,CM,ST"|
players$player_positions=="CM,LW,ST"|
players$player_positions=="CF,CM,LW"|
players$player_positions=="LW,LWB,ST"|
players$player_positions=="CM,RW,ST"|
players$player_positions=="CF,LWB,ST"|
players$player_positions=="CF,LM,RW"|
players$player_positions=="CDM,LW,ST"),
"player_positions"] <- "Attacker"
players[which(
players$player_positions=="CB"|
players$player_positions=="LB"|
players$player_positions=="RB"|
players$player_positions=="RWB"|
players$player_positions=="LWB"|
players$player_positions=="CB,RB"|
players$player_positions=="CB,LB"|
players$player_positions=="CB,CDM"|
players$player_positions=="LB,RB"|
players$player_positions=="LB,LM"|
players$player_positions=="RB,RM"|
players$player_positions=="LB,LWB"|
players$player_positions=="RB,RWB"|
players$player_positions=="CDM,RB"|
players$player_positions=="CB,CM"|
players$player_positions=="CM,RB"|
players$player_positions=="CM,LB"|
players$player_positions=="LB,LW"|
players$player_positions=="CB,RM"|
players$player_positions=="CB,LWB"|
players$player_positions=="CDM,LB"|
players$player_positions=="RB,RW"|
players$player_positions=="LB,RM"|
players$player_positions=="CAM,RB"|
players$player_positions=="CB,RWB"|
players$player_positions=="CB,CDM,CM"|
players$player_positions=="CB,LB,RB"|
players$player_positions=="LB,LM,LWB"|
players$player_positions=="RB,RM,RWB"|
players$player_positions=="CB,CDM,RB"|
players$player_positions=="CDM,CM,RB"|
players$player_positions=="LB,RB,RM"|
players$player_positions=="LM,LW,RW"|
players$player_positions=="CB,RB,RM"|
players$player_positions=="LM,RB,RM"|
players$player_positions=="CM,LB,LM"|
players$player_positions=="LB,LM,RM"|
players$player_positions=="CB,LB,LWB"|
players$player_positions=="CB,LB,LM"|
players$player_positions=="LB,LM,LW"|
players$player_positions=="CDM,LB,RB"|
players$player_positions=="LB,LM,RB"|
players$player_positions=="CB,RB,RWB"|
players$player_positions=="CB,CDM,LB"|
players$player_positions=="LB,RB,RWB"|
players$player_positions=="CDM,CM,LB"|
players$player_positions=="CDM,RB,RM"|
players$player_positions=="LM,RM,RWB"|
players$player_positions=="RB,RM,RW"|
players$player_positions=="LB,LWB,RB"|
players$player_positions=="CDM,LB,LM"|
players$player_positions=="RB,RM,ST"|
players$player_positions=="CM,LB,RB"|
players$player_positions=="CB,CM,RB"|
players$player_positions=="CAM,LB,LM"|
players$player_positions=="CDM,RB,RWB"|
players$player_positions=="LB,LW,RB"|
players$player_positions=="CB,CM,LB"|
players$player_positions=="LB,RB,RW"|
players$player_positions=="CM,RB,RWB"|
players$player_positions=="CDM,LB,LWB"|
players$player_positions=="CB,LM,LWB"|
players$player_positions=="LWB,RB,RWB"|
players$player_positions=="LM,RB,RWB"|
players$player_positions=="LB,LW,LWB"|
players$player_positions=="CM,LB,LWB"|
players$player_positions=="CB,LWB,RWB"|
players$player_positions=="CB,LM,RB"|
players$player_positions=="CB,LB,RM"|
players$player_positions=="CB,CM,RWB"|
players$player_positions=="CAM,CB,RB"),
"player_positions"] <- "Defender"
players[which(
players$player_positions=="CM"|
players$player_positions=="RM"|
players$player_positions=="LM"|
players$player_positions=="CAM"|
players$player_positions=="CDM"|
players$player_positions=="CDM,CM"|
players$player_positions=="LM,RM"|
players$player_positions=="CAM,CM"|
players$player_positions=="LWB,RM"|
players$player_positions=="CAM,LM"|
players$player_positions=="CAM,RM"|
players$player_positions=="RM,RW"|
players$player_positions=="CM,RM"|
players$player_positions=="LM,LW"|
players$player_positions=="CM,LM"|
players$player_positions=="CDM,RWB"|
players$player_positions=="CDM,RM"|
players$player_positions=="CAM,LW"|
players$player_positions=="CAM,RW"|
players$player_positions=="CF,RM"|
players$player_positions=="CF,LM"|
players$player_positions=="LW,RM"|
players$player_positions=="LM,RW"|
players$player_positions=="CM,LWB"|
players$player_positions=="CF,CM"|
players$player_positions=="CDM,LM"|
players$player_positions=="LW,LWB"|
players$player_positions=="LM,RWB"|
players$player_positions=="CAM,CDM"|
players$player_positions=="LW,RM,RW"|
players$player_positions=="CF,LM,RM"|
players$player_positions=="CAM,CM,LW"|
players$player_positions=="CAM,RM,RW"|
players$player_positions=="CM,RW"|
players$player_positions=="RM,RWB"|
players$player_positions=="LM,LWB"|
players$player_positions=="CM,RWB"|
players$player_positions=="CM,LW"|
players$player_positions=="CM,LW,RW"|
players$player_positions=="CAM,LM,LW"|
players$player_positions=="CAM,CF,RM"|
players$player_positions=="CAM,CF,LM"|
players$player_positions=="LM,LWB,RM"|
players$player_positions=="LM,RM,RW"|
players$player_positions=="CAM,LW,RW"|
players$player_positions=="CAM,CM,RW"|
players$player_positions=="LM,LW,RM"|
players$player_positions=="CM,RB,RM"|
players$player_positions=="CDM,CM,RM"|
players$player_positions=="CAM,CM,RM"|
players$player_positions=="CDM,CM,LM"|
players$player_positions=="CM,LM,LW"|
players$player_positions=="LWB,RWB"|
players$player_positions=="CM,RM,RW"|
players$player_positions=="CM,LM,ST"|
players$player_positions=="RM,RWB,ST"|
players$player_positions=="RM,RW,RWB"|
players$player_positions=="CF,CM,RM"|
players$player_positions=="CAM,LM,RW"|
players$player_positions=="LM,LW,LWB"|
players$player_positions=="CM,RM,ST"|
players$player_positions=="CM,RM,RWB"|
players$player_positions=="CM,LM,RW"|
players$player_positions=="CM,LM,LWB"|
players$player_positions=="CAM,RM,RWB"|
players$player_positions=="LM,LWB,RWB"|
players$player_positions=="CAM,CM,RM"|
players$player_positions=="CM,LM,RM"|
players$player_positions=="CDM,LM,RM"|
players$player_positions=="CDM,CM,RWB"|
players$player_positions=="CDM,CM,LW"|
players$player_positions=="CAM,LW,RM"|
players$player_positions=="CAM,LM,LWB"|
players$player_positions=="CAM,CDM,RM"|
players$player_positions=="CAM,CDM,LM"|
players$player_positions=="LWB,RM,RWB"|
players$player_positions=="CAM,CDM,CM"|
players$player_positions=="CAM,LM,RM"|
players$player_positions=="LM,RWB,ST"|
players$player_positions=="CM,LWB,RWB"|
players$player_positions=="CM,LW,RM"|
players$player_positions=="CM,LW,LWB"|
players$player_positions=="CF,CM,LM"|
players$player_positions=="CAM,CM,LM"|
players$player_positions=="CDM,LM,LWB"|
players$player_positions=="CDM,LM,LW"|
players$player_positions=="CDM,CM,ST"|
players$player_positions=="CDM,CM,RW"|
players$player_positions=="CDM,CM,LWB"|
players$player_positions=="CDM,CF,CM"),
"player_positions"] <- "Midfielder"
players[which(
players$player_positions=="GK"),
"player_positions"] <- "Goalkeeper"
players[which(
players$player_positions!="Goalkeeper" &
players$player_positions!="Midfielder" &
players$player_positions!="Defender" &
players$player_positions!="Attacker"),
"player_positions"] <- "Others"
sort(table(players$player_positions))
##
## Others Goalkeeper Attacker Midfielder Defender
## 69 2036 3700 6026 6447
Now we have only 69 player in Others category, which was too many different positions to group it using such division.
I delete Others, because there is no such position on the field and it will not be helpful in the analysis.
players <- players[-which(players$player_positions=="Others"),]
I delete gk_kicking, gk_positioning, gk_diving, gk_handling, gk_reflexes and gk_speed, because they are already represented by other columns: goalkeeping_kicking, goalkeeping_positioning, goalkeeping_reflexes, goalkeeping_diving, goalkeeping_handling and movement_acceleration / movement_sprint_speed. Player_tags and player_traits are also unique values for some players - other players do not have tags and traits - that’s why I will also delete them. I delete team_position also, because it does not provide any additional information about position.
players <-
players %>%
dplyr::select(-c("team_position","gk_kicking","gk_positioning","gk_diving",
"gk_handling","gk_reflexes","gk_speed","player_tags",
"player_traits"))
Now let’s look on the missing values:
players %>%
md.pattern(rotate.names = TRUE)
## height_cm weight_kg player_positions preferred_foot weak_foot skill_moves
## 16173 1 1 1 1 1 1
## 2036 1 1 1 1 1 1
## 0 0 0 0 0 0
## work_rate attacking_crossing attacking_finishing
## 16173 1 1 1
## 2036 1 1 1
## 0 0 0
## attacking_heading_accuracy attacking_short_passing attacking_volleys
## 16173 1 1 1
## 2036 1 1 1
## 0 0 0
## skill_dribbling skill_curve skill_fk_accuracy skill_long_passing
## 16173 1 1 1 1
## 2036 1 1 1 1
## 0 0 0 0
## skill_ball_control movement_acceleration movement_sprint_speed
## 16173 1 1 1
## 2036 1 1 1
## 0 0 0
## movement_agility movement_reactions movement_balance power_shot_power
## 16173 1 1 1 1
## 2036 1 1 1 1
## 0 0 0 0
## power_jumping power_stamina power_strength power_long_shots
## 16173 1 1 1 1
## 2036 1 1 1 1
## 0 0 0 0
## mentality_aggression mentality_interceptions mentality_positioning
## 16173 1 1 1
## 2036 1 1 1
## 0 0 0
## mentality_vision mentality_penalties mentality_composure
## 16173 1 1 1
## 2036 1 1 1
## 0 0 0
## defending_marking defending_standing_tackle defending_sliding_tackle
## 16173 1 1 1
## 2036 1 1 1
## 0 0 0
## goalkeeping_diving goalkeeping_handling goalkeeping_kicking
## 16173 1 1 1
## 2036 1 1 1
## 0 0 0
## goalkeeping_positioning goalkeeping_reflexes pace shooting passing
## 16173 1 1 1 1 1
## 2036 1 1 0 0 0
## 0 0 2036 2036 2036
## dribbling defending physic
## 16173 1 1 1 0
## 2036 0 0 0 6
## 2036 2036 2036 12216
There are 2036 rows with missing values in 6 columns: pace, shooting, passing, dribbling, defending and physic. It’s more than 10% of observations so i will omit them in the future analysis. Additionally, this values are already represented by other variables.
players <-
players %>%
dplyr::select(-c("pace","shooting","passing","dribbling","defending",
"physic"))
summary(players)
## height_cm weight_kg player_positions preferred_foot
## Min. :156.0 Min. : 50.00 Length:18209 Length:18209
## 1st Qu.:177.0 1st Qu.: 70.00 Class :character Class :character
## Median :181.0 Median : 75.00 Mode :character Mode :character
## Mean :181.4 Mean : 75.28
## 3rd Qu.:186.0 3rd Qu.: 80.00
## Max. :205.0 Max. :110.00
## weak_foot skill_moves work_rate attacking_crossing
## Min. :1.000 Min. :1.000 Length:18209 Min. : 5.00
## 1st Qu.:3.000 1st Qu.:2.000 Class :character 1st Qu.:38.00
## Median :3.000 Median :2.000 Mode :character Median :54.00
## Mean :2.944 Mean :2.367 Mean :49.68
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:64.00
## Max. :5.000 Max. :5.000 Max. :93.00
## attacking_finishing attacking_heading_accuracy attacking_short_passing
## Min. : 2.00 Min. : 5.0 Min. : 7.00
## 1st Qu.:30.00 1st Qu.:44.0 1st Qu.:54.00
## Median :49.00 Median :56.0 Median :62.00
## Mean :45.55 Mean :52.2 Mean :58.73
## 3rd Qu.:62.00 3rd Qu.:64.0 3rd Qu.:68.00
## Max. :95.00 Max. :93.0 Max. :92.00
## attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## Min. : 3.00 Min. : 4.00 Min. : 6.0 Min. : 4.0
## 1st Qu.:30.00 1st Qu.:50.00 1st Qu.:34.0 1st Qu.:31.0
## Median :44.00 Median :61.00 Median :49.0 Median :41.0
## Mean :42.78 Mean :55.57 Mean :47.3 Mean :42.7
## 3rd Qu.:56.00 3rd Qu.:68.00 3rd Qu.:62.0 3rd Qu.:56.0
## Max. :90.00 Max. :97.00 Max. :94.0 Max. :94.0
## skill_long_passing skill_ball_control movement_acceleration
## Min. : 8.00 Min. : 5.00 Min. :12.00
## 1st Qu.:43.00 1st Qu.:54.00 1st Qu.:56.00
## Median :56.00 Median :63.00 Median :67.00
## Mean :52.76 Mean :58.44 Mean :64.28
## 3rd Qu.:64.00 3rd Qu.:69.00 3rd Qu.:75.00
## Max. :92.00 Max. :96.00 Max. :96.00
## movement_sprint_speed movement_agility movement_reactions movement_balance
## Min. :11.00 Min. :11.00 Min. :21.00 Min. :12.00
## 1st Qu.:57.00 1st Qu.:55.00 1st Qu.:56.00 1st Qu.:56.00
## Median :67.00 Median :66.00 Median :62.00 Median :66.00
## Mean :64.39 Mean :63.49 Mean :61.75 Mean :63.84
## 3rd Qu.:75.00 3rd Qu.:74.00 3rd Qu.:68.00 3rd Qu.:74.00
## Max. :96.00 Max. :96.00 Max. :96.00 Max. :97.00
## power_shot_power power_jumping power_stamina power_strength
## Min. :14.00 Min. :19.00 Min. :12.00 Min. :20.00
## 1st Qu.:48.00 1st Qu.:58.00 1st Qu.:56.00 1st Qu.:58.00
## Median :59.00 Median :66.00 Median :66.00 Median :66.00
## Mean :58.16 Mean :64.92 Mean :62.87 Mean :65.23
## 3rd Qu.:68.00 3rd Qu.:73.00 3rd Qu.:74.00 3rd Qu.:74.00
## Max. :95.00 Max. :95.00 Max. :97.00 Max. :97.00
## power_long_shots mentality_aggression mentality_interceptions
## Min. : 4.00 Min. : 9.00 Min. : 3.00
## 1st Qu.:32.00 1st Qu.:44.00 1st Qu.:25.00
## Median :51.00 Median :58.00 Median :52.00
## Mean :46.78 Mean :55.72 Mean :46.35
## 3rd Qu.:62.00 3rd Qu.:69.00 3rd Qu.:64.00
## Max. :94.00 Max. :95.00 Max. :92.00
## mentality_positioning mentality_vision mentality_penalties mentality_composure
## Min. : 2.00 Min. : 9.00 Min. : 7.00 Min. :12.00
## 1st Qu.:39.00 1st Qu.:44.00 1st Qu.:39.00 1st Qu.:51.00
## Median :55.00 Median :55.00 Median :49.00 Median :60.00
## Mean :50.03 Mean :53.59 Mean :48.36 Mean :58.52
## 3rd Qu.:64.00 3rd Qu.:64.00 3rd Qu.:60.00 3rd Qu.:67.00
## Max. :95.00 Max. :94.00 Max. :92.00 Max. :96.00
## defending_marking defending_standing_tackle defending_sliding_tackle
## Min. : 1.00 Min. : 5.0 Min. : 3.00
## 1st Qu.:29.00 1st Qu.:27.0 1st Qu.:24.00
## Median :52.00 Median :55.0 Median :52.00
## Mean :46.82 Mean :47.6 Mean :45.57
## 3rd Qu.:64.00 3rd Qu.:66.0 3rd Qu.:64.00
## Max. :94.00 Max. :92.0 Max. :90.00
## goalkeeping_diving goalkeeping_handling goalkeeping_kicking
## Min. : 1.0 Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.0 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.0 Median :11.00 Median :11.00
## Mean :16.6 Mean :16.38 Mean :16.23
## 3rd Qu.:14.0 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :90.0 Max. :92.00 Max. :93.00
## goalkeeping_positioning goalkeeping_reflexes
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :16.39 Mean :16.73
## 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :91.00 Max. :92.00
Most of the numerical variables are integers in 0-100 range. I will convert skill_moves and weak_foot from 1-5 to 0-100 range and also height_cm and weight_kg to 0-100 range to normalize them. It can be helpful in the future analysis.
players$weak_foot <- players$weak_foot*100/5
players$skill_moves <- players$skill_moves*100/5
players$height_cm <- players$height_cm/(max(players$height_cm))*100
players$weight_kg <- players$weight_kg/(max(players$weight_kg))*100
glimpse(players)
## Rows: 18,209
## Columns: 41
## $ height_cm <dbl> 82.92683, 91.21951, 85.36585, 91.70732, ...
## $ weight_kg <dbl> 65.45455, 75.45455, 61.81818, 79.09091, ...
## $ player_positions <chr> "Attacker", "Attacker", "Midfielder", "G...
## $ preferred_foot <chr> "Left", "Right", "Right", "Right", "Righ...
## $ weak_foot <dbl> 80, 80, 100, 60, 80, 100, 80, 60, 80, 60...
## $ skill_moves <dbl> 80, 100, 100, 20, 80, 80, 20, 40, 80, 80...
## $ work_rate <chr> "Medium/Low", "High/Low", "High/Medium",...
## $ attacking_crossing <int> 88, 84, 87, 13, 81, 93, 18, 53, 86, 79, ...
## $ attacking_finishing <int> 95, 94, 87, 11, 84, 82, 14, 52, 72, 90, ...
## $ attacking_heading_accuracy <int> 70, 89, 62, 15, 61, 55, 11, 86, 55, 59, ...
## $ attacking_short_passing <int> 92, 83, 87, 43, 89, 92, 61, 78, 92, 84, ...
## $ attacking_volleys <int> 88, 87, 87, 13, 83, 82, 14, 45, 76, 79, ...
## $ skill_dribbling <int> 97, 89, 96, 12, 95, 86, 21, 70, 87, 89, ...
## $ skill_curve <int> 93, 81, 88, 13, 83, 85, 18, 60, 85, 83, ...
## $ skill_fk_accuracy <int> 94, 76, 87, 14, 79, 83, 12, 70, 78, 69, ...
## $ skill_long_passing <int> 92, 77, 81, 40, 83, 91, 63, 81, 88, 75, ...
## $ skill_ball_control <int> 96, 92, 95, 30, 94, 91, 30, 76, 92, 89, ...
## $ movement_acceleration <int> 91, 89, 94, 43, 94, 77, 38, 74, 77, 94, ...
## $ movement_sprint_speed <int> 84, 91, 89, 60, 88, 76, 50, 79, 71, 92, ...
## $ movement_agility <int> 93, 87, 96, 67, 95, 78, 37, 61, 92, 91, ...
## $ movement_reactions <int> 95, 96, 92, 88, 90, 91, 86, 88, 89, 92, ...
## $ movement_balance <int> 95, 71, 84, 49, 94, 76, 43, 53, 93, 88, ...
## $ power_shot_power <int> 86, 95, 80, 59, 82, 91, 66, 81, 79, 80, ...
## $ power_jumping <int> 68, 95, 61, 78, 56, 63, 79, 90, 68, 69, ...
## $ power_stamina <int> 75, 85, 81, 41, 84, 89, 35, 75, 85, 85, ...
## $ power_strength <int> 68, 78, 49, 78, 63, 74, 78, 92, 58, 73, ...
## $ power_long_shots <int> 94, 93, 84, 12, 80, 90, 10, 64, 82, 84, ...
## $ mentality_aggression <int> 48, 63, 51, 34, 54, 76, 43, 82, 62, 63, ...
## $ mentality_interceptions <int> 40, 29, 36, 19, 41, 61, 22, 89, 82, 55, ...
## $ mentality_positioning <int> 94, 95, 87, 11, 87, 88, 11, 47, 79, 92, ...
## $ mentality_vision <int> 94, 82, 90, 65, 89, 94, 70, 65, 91, 84, ...
## $ mentality_penalties <int> 75, 85, 90, 11, 88, 79, 25, 62, 82, 77, ...
## $ mentality_composure <int> 96, 95, 94, 68, 91, 91, 70, 89, 92, 91, ...
## $ defending_marking <int> 33, 28, 27, 27, 34, 68, 25, 91, 68, 38, ...
## $ defending_standing_tackle <int> 37, 32, 26, 12, 27, 58, 13, 92, 76, 43, ...
## $ defending_sliding_tackle <int> 26, 24, 29, 18, 22, 51, 10, 85, 71, 41, ...
## $ goalkeeping_diving <int> 6, 7, 9, 87, 11, 15, 88, 13, 13, 14, 13,...
## $ goalkeeping_handling <int> 11, 11, 9, 92, 12, 13, 85, 10, 9, 14, 5,...
## $ goalkeeping_kicking <int> 15, 15, 15, 78, 6, 5, 88, 13, 7, 9, 7, 7...
## $ goalkeeping_positioning <int> 14, 14, 15, 90, 8, 10, 88, 11, 14, 11, 1...
## $ goalkeeping_reflexes <int> 8, 11, 11, 89, 8, 13, 90, 11, 9, 14, 6, ...
There are 3 character variables in the dataset: dependent variable player_positions, preferred_foot and work_rate. I convert them into factors and create list of factor and numerical predictors.
players$player_positions <- as.factor(players$player_positions)
players$preferred_foot <- as.factor(players$preferred_foot)
players$work_rate <- as.factor(players$work_rate)
players_numeric_vars <-
sapply(players, is.numeric) %>%
which() %>%
names()
players_factor_vars <-
sapply(players, is.factor) %>%
which() %>%
names()
Now it’s time to divide data into training and test set
set.seed(987654321)
players_which_train <- createDataPartition(players$player_positions,
p = 0.7,
list = FALSE)
players_train <- players[players_which_train,]
players_test <- players[-players_which_train,]
The distribution of the target variable in both samples are very similar:
## Train dataset distribution:
## .
## Attacker Defender Goalkeeper Midfielder
## 0.2031691 0.3540163 0.1118607 0.3309539
## Test dataset distribution:
## .
## Attacker Defender Goalkeeper Midfielder
## 0.2032595 0.3541476 0.1117012 0.3308918
Now let’s look on the correlations between variables:
players_correlations <-
cor(players_train[,players_numeric_vars],
use = "pairwise.complete.obs")
corrplot(players_correlations,
method = "color",tl.cex = 0.5)
Goalkeeping and defending variables are very highly correlated to each other. In general, goalkeeping skills seems to be negatively correlated with most of other attributes, so we can expect that predicting goalkeepers will be very accurate in every model.
I save the most highly correlated variables as candidates to be excluded from the analysis. They can give very little information about position and increase time consumption of computing models.
correlated_variables_90 <- findCorrelation(players_correlations,
cutoff = 0.90,
names = TRUE)
correlated_variables_80 <- findCorrelation(players_correlations,
cutoff = 0.80,
names = TRUE)
Before we start modelling, let’s look on the factor variables:
We can see that there are much more players with the right foot preferred. Only 30.92% of players prefer their left foot.
Work rate is rate of working in attack and defense. For example, High/Low means that player works hard in attack and does not work hard in the defense, but it is more mental that the real position on the field, so there are defenders with Low/Low etc.
We can see, that most of the players have Medium/Medium work rate. Other groups are smaller, but only Low/Low seems to be really small and may not provide efficient value to model. However, we can not add this group to another, so I will keep it.
Now we are ready to try to run some models and predict players position on the field.
I run multinomial logit model without variables with correlation higher than 0.8 and without preferred_foot variable.
players_mlogit1a <- multinom(player_positions ~ .,
data = players_train %>%
dplyr::select(-c(all_of(correlated_variables_80),"preferred_foot")))
players_mlogit1a_fitted <- predict(players_mlogit1a)
table(players_mlogit1a_fitted,
players_train$player_positions)
##
## players_mlogit1a_fitted Attacker Defender Goalkeeper Midfielder
## Attacker 2152 10 0 325
## Defender 14 4036 0 436
## Goalkeeper 0 0 1426 0
## Midfielder 424 467 0 3458
Now I run multinomial logit model without variables with correlation higher than 0.9 and without preferred_foot variable.
players_mlogit1b <- multinom(player_positions ~ .,
data = players_train %>%
dplyr::select(-c(all_of(correlated_variables_90),"preferred_foot")))
players_mlogit1b_fitted <- predict(players_mlogit1b)
table(players_mlogit1b_fitted,
players_train$player_positions)
##
## players_mlogit1b_fitted Attacker Defender Goalkeeper Midfielder
## Attacker 2178 15 0 366
## Defender 12 4065 0 322
## Goalkeeper 0 0 1426 1
## Midfielder 400 433 0 3530
And now I run multinomial logit model with every variable.
players_mlogit2 <- multinom(player_positions ~ .,
data = players_train)
players_mlogit2_fitted <- predict(players_mlogit2)
table(players_mlogit2_fitted,
players_train$player_positions)
##
## players_mlogit2_fitted Attacker Defender Goalkeeper Midfielder
## Attacker 2191 19 0 390
## Defender 15 4116 0 315
## Goalkeeper 2 8 1426 3
## Midfielder 382 370 0 3511
Likelihood ratio test:
lrtest(players_mlogit1a)[5]
## # weights: 8 (3 variable)
## initial value 17672.480516
## final value 16603.004977
## converged
## Pr(>Chisq)
## 1
## 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(players_mlogit1b)[5]
## # weights: 8 (3 variable)
## initial value 17672.480516
## final value 16603.004977
## converged
## Pr(>Chisq)
## 1
## 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(players_mlogit2)[5]
## # weights: 8 (3 variable)
## initial value 17672.480516
## final value 16603.004977
## converged
## Pr(>Chisq)
## 1
## 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis can be rejected on the 0.001 level in all models.
Now I am comparing real and predicted values:
accuracy_multinom(predicted = players_mlogit1a_fitted,
real = players_train$player_positions)
## accuracy balanced_accuracy
## 86.85284 88.62047
## balanced_correctly_predicted
## 89.00282
accuracy_multinom(predicted = players_mlogit1b_fitted,
real = players_train$player_positions)
## accuracy balanced_accuracy
## 87.84907 89.45873
## balanced_correctly_predicted
## 89.58907
accuracy_multinom(predicted = players_mlogit2_fitted,
real = players_train$player_positions)
## accuracy balanced_accuracy
## 88.20207 89.75414
## balanced_correctly_predicted
## 89.57582
players_test$multinom1a <- predict(players_mlogit1a,
newdata = players_test)
conf_matrix_multinom1a <-
confusionMatrix(players_test$multinom1a,
players_test$player_positions)
players_test$multinom1b <- predict(players_mlogit1b,
newdata = players_test)
conf_matrix_multinom1b <-
confusionMatrix(players_test$multinom1b,
players_test$player_positions)
players_test$multinom2 <- predict(players_mlogit2,
newdata = players_test)
conf_matrix_multinom2 <-
confusionMatrix(players_test$multinom2,
players_test$player_positions)
And now I check accuracy on the test dataset:
## Accuracy of multinomial model 1a: 0.8731002
##
## Accuracy of multinomial model 1b: 0.8741989
##
## Accuracy of multinomial model 2: 0.8798755
##
## Accuracy of multinomial model 1a by position:
## Precision Balanced Accuracy
## Class: Attacker 0.8732394 0.9034052
## Class: Defender 0.9086564 0.9179641
## Class: Goalkeeper 1.0000000 1.0000000
## Class: Midfielder 0.7971624 0.8669377
##
## Accuracy of multinomial model 1b by position:
## Precision Balanced Accuracy
## Class: Attacker 0.8490393 0.8990569
## Class: Defender 0.9257453 0.9221503
## Class: Goalkeeper 0.9983633 0.9998969
## Class: Midfielder 0.7991632 0.8702551
##
## Accuracy of multinomial model 2 by position:
## Precision Balanced Accuracy
## Class: Attacker 0.8479053 0.9001784
## Class: Defender 0.9318670 0.9310653
## Class: Goalkeeper 0.9854604 0.9990724
## Class: Midfielder 0.8122340 0.8742203
Multinomial logit model with all variables has the highest average accuracy, but looking on the accuracy by position, it is really hard to choose the best model. What is also important, model 1a has around 50 predictors less than model 2 and gives very similar results. Overall, accuracy is pretty high, but can it be higher? Let’s find out.
I will start from the defining the training controls - it will be 2-fold cross validation and 10-fold cross validation control. I will compare the models with both controls.
control_cv2 <- trainControl(method = "cv",
number = 2,
classProbs = TRUE)
control_cv10 <- trainControl(method = "cv",
number = 10,
classProbs = TRUE)
Now I compute 4 models - two without cross validation and two with cross validation (full data and data with highly correlated variables and preferred_foot variable excluded).
I try many k values to obtain possibly highest accuracy and scale all variables to range [0, 1].
set.seed(987654321)
test_k <- data.frame(k = seq(1, 99, 4))
players_train_knn1a <-
train(player_positions ~ .,
data = players_train %>%
dplyr::select(-c(all_of(correlated_variables_90),"preferred_foot")),
method = "knn",
trControl = control_cv2,
tuneGrid = test_k,
preProcess = c("range"))
players_train_knn1b <-
train(player_positions ~ .,
data = players_train,
method = "knn",
trControl = control_cv2,
tuneGrid = test_k,
preProcess = c("range"))
players_train_knn2a <-
train(player_positions ~ .,
data = players_train %>%
dplyr::select(-c(all_of(correlated_variables_90),"preferred_foot")),
method = "knn",
trControl = control_cv10,
tuneGrid = test_k,
preProcess = c("range"))
players_train_knn2b <-
train(player_positions ~ .,
data = players_train,
method = "knn",
trControl = control_cv10,
tuneGrid = test_k,
preProcess = c("range"))
par(mfrow=c(2,2))
plot(players_train_knn1a)
plot(players_train_knn1b)
plot(players_train_knn2a)
plot(players_train_knn2b)
Let’s look on k values selected in modelling:
## players_train_knn1a k value selected: 13
##
## players_train_knn1b k value selected: 9
##
## players_train_knn2a k value selected: 21
##
## players_train_knn2b k value selected: 17
Models selected k values: 13, 9, 21 and 17, but all of them are quite similar accuracy.
Let’s look on the accuracy of each model:
players_test_forecasts <-
data.frame(players_train_knn1a = predict(players_train_knn1a,
players_test),
players_train_knn1b = predict(players_train_knn1b,
players_test),
players_train_knn2a = predict(players_train_knn2a,
players_test),
players_train_knn2b = predict(players_train_knn2b,
players_test))
sapply(players_test_forecasts,
function(x) accuracy_multinom(predicted = x,
real = players_test$player_positions))
## players_train_knn1a players_train_knn1b
## accuracy 85.91833 86.85222
## balanced_accuracy 87.39991 88.24109
## balanced_correctly_predicted 88.94343 89.35451
## players_train_knn2a players_train_knn2b
## accuracy 85.69859 87.05365
## balanced_accuracy 87.02629 88.39872
## balanced_correctly_predicted 88.97030 89.78581
It does not seem to give better result than multinomial logistic regression, but we can see, that again, model with all of variables gives better prediction. Additionally, we can see that 10-fold cross validation give us slightly better results.
Let’s now try with Discriminant Analysis methods and LogitBoost method. I run 4 methods:
set.seed(12345)
m_sda <- train(player_positions~.,
data=players_train,
method="sda",
trControl=control_cv10,
preProcess = c("center","scale"))
set.seed(12345)
m_hdda <- train(player_positions~.,
data=players_train,
method="hdda",
trControl=control_cv10,
preProcess = c("center","scale"))
set.seed(12345)
m_pda <- train(player_positions~.,
data=players_train,
method="pda",
trControl=control_cv10,
preProcess = c("center", "scale"))
set.seed(12345)
m_LogitBoost <- train(player_positions~.,
data=players_train,
method="LogitBoost",
trControl=control_cv10,
preProcess = c("center", "scale"))
Now I use computed models to predict positions:
players_test$predicted_sda <- predict(m_sda,
newdata = players_test)
## Prediction uses 47 features.
players_test$predicted_hdda <- predict(m_hdda,
newdata = players_test)
players_test$predicted_pda <- predict(m_pda,
newdata = players_test)
players_test$predicted_LogitBoost <- predict(m_LogitBoost,
newdata = players_test)
conf_matrix_sda <-
confusionMatrix(players_test$predicted_sda,
players_test$player_positions)
conf_matrix_hdda <-
confusionMatrix(players_test$predicted_hdda,
players_test$player_positions)
conf_matrix_pda <-
confusionMatrix(players_test$predicted_pda,
players_test$player_positions)
conf_matrix_LogitBoost <-
confusionMatrix(players_test$predicted_LogitBoost,
players_test$player_positions)
Accurracies of each model:
## Accuracy of Shrinkage Discriminant Analysis model: 0.8835378
##
## Accuracy of High Dimensional Discriminant Analysis model: 0.8359275
##
## Accuracy of Penalized Discriminant Analysis model: 0.8833547
##
## Accuracy of LogitBoost model: 0.8834586
##
##
## Accuracy of Shrinkage Discriminant Analysis model by position:
## Precision Balanced Accuracy
## Class: Attacker 0.8818444 0.8993788
## Class: Defender 0.9369565 0.9292638
## Class: Goalkeeper 0.9983633 0.9998969
## Class: Midfielder 0.7988827 0.8810646
##
## Accuracy of High Dimensional Discriminant Analysis model by position:
## Precision Balanced Accuracy
## Class: Attacker 0.7491961 0.8839660
## Class: Defender 0.9102285 0.9091323
## Class: Goalkeeper 1.0000000 1.0000000
## Class: Midfielder 0.7631430 0.8162129
##
## Accuracy of Penalized Discriminant Analysis model by position:
## Precision Balanced Accuracy
## Class: Attacker 0.8809981 0.8992639
## Class: Defender 0.9369565 0.9292638
## Class: Goalkeeper 0.9983633 0.9998969
## Class: Midfielder 0.7987805 0.8807879
##
## Accuracy of LogitBoost model by position:
## Precision Balanced Accuracy
## Class: Attacker 0.8280802 0.9157331
## Class: Defender 0.9151547 0.9358756
## Class: Goalkeeper 1.0000000 1.0000000
## Class: Midfielder 0.8358503 0.8634438
Let’s look on the accuracy boxplots, based on resamples accuracy.
resample_results <- resamples(list(PDA=m_pda, SDA=m_sda, HDDA=m_hdda,
KNN=players_train_knn2b,
LogitBoost = m_LogitBoost))
bwplot(resample_results , metric = "Accuracy")
And density plot of accuracies:
densityplot(resample_results , metric = "Accuracy" ,auto.key = list(columns = 3))
The accuracies of the Shrinkage Discriminant Analysis model, Penalized Discriminant Analysis model and LogitBoost model are higher than best in best case of multinomial logistic regression. LogitBoost looks the best, but Shrinkage and Penalized Discriminant Analysis look also very good comparing to KNN and High Dimensional Discriminant Analysis.
All of the models gave quite good results, so it was more difficult to see which performs better. For sure, 10-fold cross validation made modelling more precise, so it is often worth to use some additional computing power to perform cross validation.
In this case, best model was Logit boost, but from the players positions grouping perspective it should be considered to group positions in other way (for example defensive midfield, midfield and offensive midfield instead of only midfield).
To sum up, 3 best computed models in the analysis were:
LogitBoost
Shrinkage Discriminant Analysis
Penalized Discriminant Analysis