About dataset

I decided to analyse the football players statistics scraped from https://sofifa.com/ and published on https://www.kaggle.com/ platform.

Sofifa publish data from FIFA game and updating it several times a year. I chose the most fresh statistics from FIFA 20, so the whole data was as clean as earlier updates.

Let’s look on the data:

glimpse(players_full)
## Rows: 18,278
## Columns: 104
## $ sofifa_id                  <int> 158023, 20801, 190871, 200389, 183277, 1...
## $ player_url                 <chr> "https://sofifa.com/player/158023/lionel...
## $ short_name                 <chr> "L. Messi", "Cristiano Ronaldo", "Neymar...
## $ long_name                  <chr> "Lionel Andrés Messi Cuccittini", "Cris...
## $ age                        <int> 32, 34, 27, 26, 28, 28, 27, 27, 33, 27, ...
## $ dob                        <chr> "1987-06-24", "1985-02-05", "1992-02-05"...
## $ height_cm                  <int> 170, 187, 175, 188, 175, 181, 187, 193, ...
## $ weight_kg                  <int> 72, 83, 68, 87, 74, 70, 85, 92, 66, 71, ...
## $ nationality                <chr> "Argentina", "Portugal", "Brazil", "Slov...
## $ club                       <chr> "FC Barcelona", "Juventus", "Paris Saint...
## $ overall                    <int> 94, 93, 92, 91, 91, 91, 90, 90, 90, 90, ...
## $ potential                  <int> 94, 93, 92, 93, 91, 91, 93, 91, 90, 90, ...
## $ value_eur                  <int> 95500000, 58500000, 105500000, 77500000,...
## $ wage_eur                   <int> 565000, 405000, 290000, 125000, 470000, ...
## $ player_positions           <chr> "RW, CF, ST", "ST, LW", "LW, CAM", "GK",...
## $ preferred_foot             <chr> "Left", "Right", "Right", "Right", "Righ...
## $ international_reputation   <int> 5, 5, 5, 3, 4, 4, 3, 3, 4, 3, 3, 3, 3, 3...
## $ weak_foot                  <int> 4, 4, 5, 3, 4, 5, 4, 3, 4, 3, 4, 3, 4, 3...
## $ skill_moves                <int> 4, 5, 5, 1, 4, 4, 1, 2, 4, 4, 5, 2, 3, 1...
## $ work_rate                  <chr> "Medium/Low", "High/Low", "High/Medium",...
## $ body_type                  <chr> "Messi", "C. Ronaldo", "Neymar", "Normal...
## $ real_face                  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
## $ release_clause_eur         <int> 195800000, 96500000, 195200000, 16470000...
## $ player_tags                <chr> "#Dribbler, #Distance Shooter, #Crosser,...
## $ team_position              <chr> "RW", "LW", "CAM", "GK", "LW", "RCM", "G...
## $ team_jersey_number         <int> 10, 7, 10, 13, 7, 17, 1, 4, 10, 11, 7, 2...
## $ loaned_from                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ joined                     <chr> "2004-07-01", "2018-07-10", "2017-08-03"...
## $ contract_valid_until       <int> 2021, 2022, 2022, 2023, 2024, 2023, 2022...
## $ nation_position            <chr> NA, "LS", "LW", "GK", "LF", "RCM", "SUB"...
## $ nation_jersey_number       <int> NA, 7, 10, 1, 10, 7, 22, 4, NA, 10, 10, ...
## $ pace                       <int> 87, 90, 91, NA, 91, 76, NA, 77, 74, 93, ...
## $ shooting                   <int> 92, 93, 85, NA, 83, 86, NA, 60, 76, 86, ...
## $ passing                    <int> 92, 82, 87, NA, 86, 92, NA, 70, 89, 81, ...
## $ dribbling                  <int> 96, 89, 95, NA, 94, 86, NA, 71, 89, 89, ...
## $ defending                  <int> 39, 35, 32, NA, 35, 61, NA, 90, 72, 45, ...
## $ physic                     <int> 66, 78, 58, NA, 66, 78, NA, 86, 66, 74, ...
## $ gk_diving                  <int> NA, NA, NA, 87, NA, NA, 88, NA, NA, NA, ...
## $ gk_handling                <int> NA, NA, NA, 92, NA, NA, 85, NA, NA, NA, ...
## $ gk_kicking                 <int> NA, NA, NA, 78, NA, NA, 88, NA, NA, NA, ...
## $ gk_reflexes                <int> NA, NA, NA, 89, NA, NA, 90, NA, NA, NA, ...
## $ gk_speed                   <int> NA, NA, NA, 52, NA, NA, 45, NA, NA, NA, ...
## $ gk_positioning             <int> NA, NA, NA, 90, NA, NA, 88, NA, NA, NA, ...
## $ player_traits              <chr> "Beat Offside Trap, Argues with Official...
## $ attacking_crossing         <int> 88, 84, 87, 13, 81, 93, 18, 53, 86, 79, ...
## $ attacking_finishing        <int> 95, 94, 87, 11, 84, 82, 14, 52, 72, 90, ...
## $ attacking_heading_accuracy <int> 70, 89, 62, 15, 61, 55, 11, 86, 55, 59, ...
## $ attacking_short_passing    <int> 92, 83, 87, 43, 89, 92, 61, 78, 92, 84, ...
## $ attacking_volleys          <int> 88, 87, 87, 13, 83, 82, 14, 45, 76, 79, ...
## $ skill_dribbling            <int> 97, 89, 96, 12, 95, 86, 21, 70, 87, 89, ...
## $ skill_curve                <int> 93, 81, 88, 13, 83, 85, 18, 60, 85, 83, ...
## $ skill_fk_accuracy          <int> 94, 76, 87, 14, 79, 83, 12, 70, 78, 69, ...
## $ skill_long_passing         <int> 92, 77, 81, 40, 83, 91, 63, 81, 88, 75, ...
## $ skill_ball_control         <int> 96, 92, 95, 30, 94, 91, 30, 76, 92, 89, ...
## $ movement_acceleration      <int> 91, 89, 94, 43, 94, 77, 38, 74, 77, 94, ...
## $ movement_sprint_speed      <int> 84, 91, 89, 60, 88, 76, 50, 79, 71, 92, ...
## $ movement_agility           <int> 93, 87, 96, 67, 95, 78, 37, 61, 92, 91, ...
## $ movement_reactions         <int> 95, 96, 92, 88, 90, 91, 86, 88, 89, 92, ...
## $ movement_balance           <int> 95, 71, 84, 49, 94, 76, 43, 53, 93, 88, ...
## $ power_shot_power           <int> 86, 95, 80, 59, 82, 91, 66, 81, 79, 80, ...
## $ power_jumping              <int> 68, 95, 61, 78, 56, 63, 79, 90, 68, 69, ...
## $ power_stamina              <int> 75, 85, 81, 41, 84, 89, 35, 75, 85, 85, ...
## $ power_strength             <int> 68, 78, 49, 78, 63, 74, 78, 92, 58, 73, ...
## $ power_long_shots           <int> 94, 93, 84, 12, 80, 90, 10, 64, 82, 84, ...
## $ mentality_aggression       <int> 48, 63, 51, 34, 54, 76, 43, 82, 62, 63, ...
## $ mentality_interceptions    <int> 40, 29, 36, 19, 41, 61, 22, 89, 82, 55, ...
## $ mentality_positioning      <int> 94, 95, 87, 11, 87, 88, 11, 47, 79, 92, ...
## $ mentality_vision           <int> 94, 82, 90, 65, 89, 94, 70, 65, 91, 84, ...
## $ mentality_penalties        <int> 75, 85, 90, 11, 88, 79, 25, 62, 82, 77, ...
## $ mentality_composure        <int> 96, 95, 94, 68, 91, 91, 70, 89, 92, 91, ...
## $ defending_marking          <int> 33, 28, 27, 27, 34, 68, 25, 91, 68, 38, ...
## $ defending_standing_tackle  <int> 37, 32, 26, 12, 27, 58, 13, 92, 76, 43, ...
## $ defending_sliding_tackle   <int> 26, 24, 29, 18, 22, 51, 10, 85, 71, 41, ...
## $ goalkeeping_diving         <int> 6, 7, 9, 87, 11, 15, 88, 13, 13, 14, 13,...
## $ goalkeeping_handling       <int> 11, 11, 9, 92, 12, 13, 85, 10, 9, 14, 5,...
## $ goalkeeping_kicking        <int> 15, 15, 15, 78, 6, 5, 88, 13, 7, 9, 7, 7...
## $ goalkeeping_positioning    <int> 14, 14, 15, 90, 8, 10, 88, 11, 14, 11, 1...
## $ goalkeeping_reflexes       <int> 8, 11, 11, 89, 8, 13, 90, 11, 9, 14, 6, ...
## $ ls                         <chr> "89+2", "91+3", "84+3", NA, "83+3", "82+...
## $ st                         <chr> "89+2", "91+3", "84+3", NA, "83+3", "82+...
## $ rs                         <chr> "89+2", "91+3", "84+3", NA, "83+3", "82+...
## $ lw                         <chr> "93+2", "89+3", "90+3", NA, "89+3", "87+...
## $ lf                         <chr> "93+2", "90+3", "89+3", NA, "88+3", "87+...
## $ cf                         <chr> "93+2", "90+3", "89+3", NA, "88+3", "87+...
## $ rf                         <chr> "93+2", "90+3", "89+3", NA, "88+3", "87+...
## $ rw                         <chr> "93+2", "89+3", "90+3", NA, "89+3", "87+...
## $ lam                        <chr> "93+2", "88+3", "90+3", NA, "89+3", "88+...
## $ cam                        <chr> "93+2", "88+3", "90+3", NA, "89+3", "88+...
## $ ram                        <chr> "93+2", "88+3", "90+3", NA, "89+3", "88+...
## $ lm                         <chr> "92+2", "88+3", "89+3", NA, "89+3", "88+...
## $ lcm                        <chr> "87+2", "81+3", "82+3", NA, "83+3", "87+...
## $ cm                         <chr> "87+2", "81+3", "82+3", NA, "83+3", "87+...
## $ rcm                        <chr> "87+2", "81+3", "82+3", NA, "83+3", "87+...
## $ rm                         <chr> "92+2", "88+3", "89+3", NA, "89+3", "88+...
## $ lwb                        <chr> "68+2", "65+3", "66+3", NA, "66+3", "77+...
## $ ldm                        <chr> "66+2", "61+3", "61+3", NA, "63+3", "77+...
## $ cdm                        <chr> "66+2", "61+3", "61+3", NA, "63+3", "77+...
## $ rdm                        <chr> "66+2", "61+3", "61+3", NA, "63+3", "77+...
## $ rwb                        <chr> "68+2", "65+3", "66+3", NA, "66+3", "77+...
## $ lb                         <chr> "63+2", "61+3", "61+3", NA, "61+3", "73+...
## $ lcb                        <chr> "52+2", "53+3", "46+3", NA, "49+3", "66+...
## $ cb                         <chr> "52+2", "53+3", "46+3", NA, "49+3", "66+...
## $ rcb                        <chr> "52+2", "53+3", "46+3", NA, "49+3", "66+...
## $ rb                         <chr> "63+2", "61+3", "61+3", NA, "61+3", "73+...

Data contain 18278 players with 104 attributes. Some of them are IDs, urls and names, but most are numeric values (0-100 or 1-5) describing specific skills.

EDA and data manipulation

For classification, some of attributes will not be useful in the analysis. I create players dataset with only useful variables. There is also a challenge - I have to choose player position dependent variable from 3 possible columns: team_position, player_positions, nation_position. Not every player plays in nation cups, so I will choose between team_position and player_positions.

players <- 
  players_full %>% 
  dplyr::select(c(7:8,15,16,18:20,24:25,32:78))

glimpse(players)
## Rows: 18,278
## Columns: 56
## $ height_cm                  <int> 170, 187, 175, 188, 175, 181, 187, 193, ...
## $ weight_kg                  <int> 72, 83, 68, 87, 74, 70, 85, 92, 66, 71, ...
## $ player_positions           <chr> "RW, CF, ST", "ST, LW", "LW, CAM", "GK",...
## $ preferred_foot             <chr> "Left", "Right", "Right", "Right", "Righ...
## $ weak_foot                  <int> 4, 4, 5, 3, 4, 5, 4, 3, 4, 3, 4, 3, 4, 3...
## $ skill_moves                <int> 4, 5, 5, 1, 4, 4, 1, 2, 4, 4, 5, 2, 3, 1...
## $ work_rate                  <chr> "Medium/Low", "High/Low", "High/Medium",...
## $ player_tags                <chr> "#Dribbler, #Distance Shooter, #Crosser,...
## $ team_position              <chr> "RW", "LW", "CAM", "GK", "LW", "RCM", "G...
## $ pace                       <int> 87, 90, 91, NA, 91, 76, NA, 77, 74, 93, ...
## $ shooting                   <int> 92, 93, 85, NA, 83, 86, NA, 60, 76, 86, ...
## $ passing                    <int> 92, 82, 87, NA, 86, 92, NA, 70, 89, 81, ...
## $ dribbling                  <int> 96, 89, 95, NA, 94, 86, NA, 71, 89, 89, ...
## $ defending                  <int> 39, 35, 32, NA, 35, 61, NA, 90, 72, 45, ...
## $ physic                     <int> 66, 78, 58, NA, 66, 78, NA, 86, 66, 74, ...
## $ gk_diving                  <int> NA, NA, NA, 87, NA, NA, 88, NA, NA, NA, ...
## $ gk_handling                <int> NA, NA, NA, 92, NA, NA, 85, NA, NA, NA, ...
## $ gk_kicking                 <int> NA, NA, NA, 78, NA, NA, 88, NA, NA, NA, ...
## $ gk_reflexes                <int> NA, NA, NA, 89, NA, NA, 90, NA, NA, NA, ...
## $ gk_speed                   <int> NA, NA, NA, 52, NA, NA, 45, NA, NA, NA, ...
## $ gk_positioning             <int> NA, NA, NA, 90, NA, NA, 88, NA, NA, NA, ...
## $ player_traits              <chr> "Beat Offside Trap, Argues with Official...
## $ attacking_crossing         <int> 88, 84, 87, 13, 81, 93, 18, 53, 86, 79, ...
## $ attacking_finishing        <int> 95, 94, 87, 11, 84, 82, 14, 52, 72, 90, ...
## $ attacking_heading_accuracy <int> 70, 89, 62, 15, 61, 55, 11, 86, 55, 59, ...
## $ attacking_short_passing    <int> 92, 83, 87, 43, 89, 92, 61, 78, 92, 84, ...
## $ attacking_volleys          <int> 88, 87, 87, 13, 83, 82, 14, 45, 76, 79, ...
## $ skill_dribbling            <int> 97, 89, 96, 12, 95, 86, 21, 70, 87, 89, ...
## $ skill_curve                <int> 93, 81, 88, 13, 83, 85, 18, 60, 85, 83, ...
## $ skill_fk_accuracy          <int> 94, 76, 87, 14, 79, 83, 12, 70, 78, 69, ...
## $ skill_long_passing         <int> 92, 77, 81, 40, 83, 91, 63, 81, 88, 75, ...
## $ skill_ball_control         <int> 96, 92, 95, 30, 94, 91, 30, 76, 92, 89, ...
## $ movement_acceleration      <int> 91, 89, 94, 43, 94, 77, 38, 74, 77, 94, ...
## $ movement_sprint_speed      <int> 84, 91, 89, 60, 88, 76, 50, 79, 71, 92, ...
## $ movement_agility           <int> 93, 87, 96, 67, 95, 78, 37, 61, 92, 91, ...
## $ movement_reactions         <int> 95, 96, 92, 88, 90, 91, 86, 88, 89, 92, ...
## $ movement_balance           <int> 95, 71, 84, 49, 94, 76, 43, 53, 93, 88, ...
## $ power_shot_power           <int> 86, 95, 80, 59, 82, 91, 66, 81, 79, 80, ...
## $ power_jumping              <int> 68, 95, 61, 78, 56, 63, 79, 90, 68, 69, ...
## $ power_stamina              <int> 75, 85, 81, 41, 84, 89, 35, 75, 85, 85, ...
## $ power_strength             <int> 68, 78, 49, 78, 63, 74, 78, 92, 58, 73, ...
## $ power_long_shots           <int> 94, 93, 84, 12, 80, 90, 10, 64, 82, 84, ...
## $ mentality_aggression       <int> 48, 63, 51, 34, 54, 76, 43, 82, 62, 63, ...
## $ mentality_interceptions    <int> 40, 29, 36, 19, 41, 61, 22, 89, 82, 55, ...
## $ mentality_positioning      <int> 94, 95, 87, 11, 87, 88, 11, 47, 79, 92, ...
## $ mentality_vision           <int> 94, 82, 90, 65, 89, 94, 70, 65, 91, 84, ...
## $ mentality_penalties        <int> 75, 85, 90, 11, 88, 79, 25, 62, 82, 77, ...
## $ mentality_composure        <int> 96, 95, 94, 68, 91, 91, 70, 89, 92, 91, ...
## $ defending_marking          <int> 33, 28, 27, 27, 34, 68, 25, 91, 68, 38, ...
## $ defending_standing_tackle  <int> 37, 32, 26, 12, 27, 58, 13, 92, 76, 43, ...
## $ defending_sliding_tackle   <int> 26, 24, 29, 18, 22, 51, 10, 85, 71, 41, ...
## $ goalkeeping_diving         <int> 6, 7, 9, 87, 11, 15, 88, 13, 13, 14, 13,...
## $ goalkeeping_handling       <int> 11, 11, 9, 92, 12, 13, 85, 10, 9, 14, 5,...
## $ goalkeeping_kicking        <int> 15, 15, 15, 78, 6, 5, 88, 13, 7, 9, 7, 7...
## $ goalkeeping_positioning    <int> 14, 14, 15, 90, 8, 10, 88, 11, 14, 11, 1...
## $ goalkeeping_reflexes       <int> 8, 11, 11, 89, 8, 13, 90, 11, 9, 14, 6, ...

56 columns left after deleting columns not good for being predictors.

Now let’s compare player_positions and team_position

ggplot(data = players) + geom_bar(mapping = aes(x = team_position))

ggplot(data = players) + geom_bar(mapping = aes(x = player_positions))

Team_position looks much cleaner, because there are much less levels. Unofortunately the 2 largest levels are SUB and RES, which are not positions on the field. They are refer to player status in the team (SUB is substitute and RES is injured or something similar). Deleting such huge part of players doesn’t make sense.

Player_positions has many levels, because it stores every player’s position instead of only the current. Some of levels may be contain same positions, but in different order. I will sort them in asceding order.

z = players$player_positions
players$player_positions <- 
  unname(sapply(z, function(z) {
    paste(sort(trimws(strsplit(z[1], ',')[[1]])), collapse=',')} ))

sort(table(players$player_positions))
## 
##     CAM,CB  CAM,CB,CM  CAM,CB,RB  CAM,CB,ST CAM,CDM,ST  CAM,CM,LB  CAM,CM,RB 
##          1          1          1          1          1          1          1 
##     CAM,LB  CAM,RB,ST      CB,CF   CB,CM,LM  CB,CM,RWB   CB,LB,RM   CB,LM,RB 
##          1          1          1          1          1          1          1 
##   CB,LM,ST CB,LWB,RWB  CDM,CF,CM CDM,CM,LWB  CDM,CM,RW  CDM,CM,ST  CDM,LB,RM 
##          1          1          1          1          1          1          1 
##  CDM,LM,LW CDM,LM,LWB  CDM,LW,ST    CDM,LWB   CF,CM,LB   CF,CM,LM   CF,CM,RB 
##          1          1          1          1          1          1          1 
##   CF,LM,RW  CF,LWB,ST   CF,RB,RM  CM,LB,LWB   CM,LB,RM  CM,LW,LWB   CM,LW,RB 
##          1          1          1          1          1          1          1 
##   CM,LW,RM CM,LWB,RWB   CM,RB,ST   CM,RW,ST  LB,LW,LWB   LB,RM,RW      LB,RW 
##          1          1          1          1          1          1          1 
##  LM,RB,RWB   LM,RB,ST  LM,RWB,ST  LW,LWB,ST      LW,RB LWB,RB,RWB     LWB,RM 
##          1          1          1          1          1          1          1 
## LWB,RM,RWB  LWB,RM,ST      RB,ST CAM,CB,CDM CAM,CDM,LM CAM,CDM,RM CAM,LM,LWB 
##          1          1          1          2          2          2          2 
##  CAM,LW,RM  CAM,RB,RM  CB,LM,LWB   CB,RB,ST  CDM,CM,LW CDM,CM,RWB CDM,LB,LWB 
##          2          2          2          2          2          2          2 
##  CDM,LM,RM   CF,CM,LW   CM,LB,LW   CM,LB,RW   CM,LW,ST  CM,RB,RWB   LB,RB,RW 
##          2          2          2          2          2          2          2 
## LM,LWB,RWB      LM,RB   LW,RB,RM  RB,RW,RWB   RB,RW,ST     RW,RWB     RWB,ST 
##          2          2          2          2          2          2          2 
## CAM,CDM,RB CAM,RM,RWB   CB,CM,LB      CB,RM     CDM,RM    CDM,RWB   CF,CM,ST 
##          3          3          3          3          3          3          3 
##  CM,LM,LWB   CM,LM,RB   CM,LM,RW  CM,RM,RWB   CM,RM,ST   LB,LW,RB   LB,LW,RW 
##          3          3          3          3          3          3          3 
##  LM,LW,LWB     LM,RWB     LW,LWB   LW,RB,RW   LW,RM,ST  CAM,LM,RW     CDM,LM 
##          3          3          3          3          3          4          4 
## CDM,RB,RWB      CF,CM   CF,CM,RM   CF,RM,ST     CM,LWB   LB,LM,ST   LM,RW,ST 
##          4          4          4          4          4          4          4 
##  RM,RW,RWB  RM,RWB,ST  CAM,LB,LM   CB,CM,RB   CF,LM,LW   CF,RM,RW   CM,LB,RB 
##          4          4          5          5          5          5          5 
##   CM,LM,ST   CM,RM,RW    LWB,RWB     CAM,RB      CB,ST      CF,RW   CM,LM,LW 
##          5          5          5          6          6          6          6 
##     CM,RWB      LB,RM  LM,LWB,RM      LM,RW      LW,RM   RB,RM,ST  CAM,CF,RW 
##          6          6          6          6          6          6          7 
##  CDM,LB,LM      CF,LM   CF,LM,ST  LB,LWB,RB   RB,RM,RW  CAM,CF,LW     CB,RWB 
##          7          7          7          7          7          8          8 
##  LM,RM,RWB  CDM,RB,RM      CF,RM  CAM,CF,LM  CAM,RW,ST      CF,LW   CF,LW,RW 
##          8          9          9         11         11         11         11 
##   CF,LW,ST      RB,RW  CAM,CF,RM  CAM,LM,LW  CDM,CM,LB      CM,LW   CM,LW,RW 
##         11         11         12         12         12         12         12 
##      CM,ST     CB,LWB  LB,RB,RWB     LM,LWB     RM,RWB  CB,CDM,LB  CB,RB,RWB 
##         12         13         13         13         13         14         14 
##         CF   LB,LM,RB   CF,RW,ST      CM,RW  CDM,LB,RB   LB,LM,LW  CAM,RM,RW 
##         14         14         15         15         16         16         17 
##   CB,LB,LM      LB,LW        RWB  CAM,CM,LW  CAM,CM,ST   CF,LM,RM   LW,RM,RW 
##         17         17         17         18         18         18         18 
##    CAM,CDM  CAM,LW,ST  CB,LB,LWB   LB,LM,RM     CDM,LB  CDM,CM,LM   CM,LB,LM 
##         20         20         20         20         21         22         23 
##   LM,RB,RM        LWB   CB,RB,RM   LM,LW,RW      CM,LB  CAM,CF,CM  CDM,CM,RM 
##         23         23         24         24         25         27         27 
##   CM,RB,RM   LM,LW,RM  CAM,CM,RW  CAM,RM,ST      CM,RB      CB,CM   LB,RB,RM 
##         28         30         31         32         32         35         36 
##  CAM,LW,RW  CAM,CF,ST   LM,RM,RW   RM,RW,ST     CAM,RW     CAM,LW  CDM,CM,RB 
##         39         40         40         40         41         43         44 
##  CAM,LM,ST     CDM,RB  CB,CDM,RB     CAM,CF   LM,LW,ST   CM,LM,RM  RB,RM,RWB 
##         46         47         48         52         53         54         56 
##  LB,LM,LWB  CAM,CM,RM      CF,ST   CB,LB,RB      CM,LM      LM,LW         LW 
##         59         71         79         81         86         88         88 
##         RW      CM,RM   LW,RW,ST  CAM,CM,LM      RM,RW     CAM,RM     RB,RWB 
##         91         95         99        102        107        114        115 
##  CB,CDM,CM      RW,ST     LB,LWB      LW,ST     CAM,LM   LM,RM,ST     CAM,ST 
##        122        123        134        135        138        151        153 
##      LM,ST      LW,RW      RB,RM  CAM,LM,RM      RM,ST      LB,RB CAM,CDM,CM 
##        153        161        162        166        184        190        208 
##         RM      LB,LM         LM        CAM     CB,CDM      CB,LB        CDM 
##        227        238        247        291        294        316        363 
##      CB,RB     CAM,CM      LM,RM         RB         LB         CM     CDM,CM 
##        374        400        430        587        669        786       1413 
##         ST         GK         CB 
##       1809       2036       2322

There are many levels, but there are only field positions, so I choose it as my dependent variable. To make it more simple and easier to look at results, I will group players into 4 main positions: Attacker, Midfielder, Defender and Goalkeeper.

players[which(
  players$player_positions=="ST"|
    players$player_positions=="LW"|
    players$player_positions=="RW"|
    players$player_positions=="CF"|
    players$player_positions=="RM,ST"|
    players$player_positions=="LW,RW"|
    players$player_positions=="LM,ST"|
    players$player_positions=="CAM,ST"|
    players$player_positions=="CF,RW"|
    players$player_positions=="LW,ST"|
    players$player_positions=="RW,ST"|
    players$player_positions=="CF,ST"|
    players$player_positions=="CF,LW"|
    players$player_positions=="CM,ST"|
    players$player_positions=="CAM,CF"|
    players$player_positions=="LW,RW,ST"|
    players$player_positions=="LM,RM,ST"|
    players$player_positions=="LM,LW,ST"|
    players$player_positions=="CAM,LM,ST"|
    players$player_positions=="RM,RW,ST"|
    players$player_positions=="CAM,CF,ST"|
    players$player_positions=="CAM,RM,ST"|
    players$player_positions=="CAM,LW,ST"|
    players$player_positions=="CAM,CM,ST"|
    players$player_positions=="CF,RW,ST"|
    players$player_positions=="CF,LW,ST"|
    players$player_positions=="CF,LW,RW"|
    players$player_positions=="CAM,RW,ST"|
    players$player_positions=="CAM,CF,CM"|
    players$player_positions=="CAM,CF,LW"|
    players$player_positions=="CF,LM,ST"|
    players$player_positions=="CAM,CF,RW"|
    players$player_positions=="CF,LM,LW"|
    players$player_positions=="CF,RM,RW"|
    players$player_positions=="LM,RW,ST"|
    players$player_positions=="CF,RM,ST"|
    players$player_positions=="LW,RM,ST"|
    players$player_positions=="CF,CM,ST"|
    players$player_positions=="CM,LW,ST"|
    players$player_positions=="CF,CM,LW"|
    players$player_positions=="LW,LWB,ST"|
    players$player_positions=="CM,RW,ST"|
    players$player_positions=="CF,LWB,ST"|
    players$player_positions=="CF,LM,RW"|
    players$player_positions=="CDM,LW,ST"),
  "player_positions"]  <- "Attacker"

players[which(
  players$player_positions=="CB"|
    players$player_positions=="LB"|
    players$player_positions=="RB"|
    players$player_positions=="RWB"|
    players$player_positions=="LWB"|
    players$player_positions=="CB,RB"|
    players$player_positions=="CB,LB"|
    players$player_positions=="CB,CDM"|
    players$player_positions=="LB,RB"|
    players$player_positions=="LB,LM"|
    players$player_positions=="RB,RM"|
    players$player_positions=="LB,LWB"|
    players$player_positions=="RB,RWB"|
    players$player_positions=="CDM,RB"|
    players$player_positions=="CB,CM"|
    players$player_positions=="CM,RB"|
    players$player_positions=="CM,LB"|
    players$player_positions=="LB,LW"|
    players$player_positions=="CB,RM"|
    players$player_positions=="CB,LWB"|
    players$player_positions=="CDM,LB"|
    players$player_positions=="RB,RW"|
    players$player_positions=="LB,RM"|
    players$player_positions=="CAM,RB"|
    players$player_positions=="CB,RWB"|
    players$player_positions=="CB,CDM,CM"|
    players$player_positions=="CB,LB,RB"|
    players$player_positions=="LB,LM,LWB"|
    players$player_positions=="RB,RM,RWB"|
    players$player_positions=="CB,CDM,RB"|
    players$player_positions=="CDM,CM,RB"|
    players$player_positions=="LB,RB,RM"|
    players$player_positions=="LM,LW,RW"|
    players$player_positions=="CB,RB,RM"|
    players$player_positions=="LM,RB,RM"|
    players$player_positions=="CM,LB,LM"|
    players$player_positions=="LB,LM,RM"|
    players$player_positions=="CB,LB,LWB"|
    players$player_positions=="CB,LB,LM"|
    players$player_positions=="LB,LM,LW"|
    players$player_positions=="CDM,LB,RB"|
    players$player_positions=="LB,LM,RB"|
    players$player_positions=="CB,RB,RWB"|
    players$player_positions=="CB,CDM,LB"|
    players$player_positions=="LB,RB,RWB"|
    players$player_positions=="CDM,CM,LB"|
    players$player_positions=="CDM,RB,RM"|
    players$player_positions=="LM,RM,RWB"|
    players$player_positions=="RB,RM,RW"|
    players$player_positions=="LB,LWB,RB"|
    players$player_positions=="CDM,LB,LM"|
    players$player_positions=="RB,RM,ST"|
    players$player_positions=="CM,LB,RB"|
    players$player_positions=="CB,CM,RB"|
    players$player_positions=="CAM,LB,LM"|
    players$player_positions=="CDM,RB,RWB"|
    players$player_positions=="LB,LW,RB"|
    players$player_positions=="CB,CM,LB"|
    players$player_positions=="LB,RB,RW"|
    players$player_positions=="CM,RB,RWB"|
    players$player_positions=="CDM,LB,LWB"|
    players$player_positions=="CB,LM,LWB"|
    players$player_positions=="LWB,RB,RWB"|
    players$player_positions=="LM,RB,RWB"|
    players$player_positions=="LB,LW,LWB"|
    players$player_positions=="CM,LB,LWB"|
    players$player_positions=="CB,LWB,RWB"|
    players$player_positions=="CB,LM,RB"|
    players$player_positions=="CB,LB,RM"|
    players$player_positions=="CB,CM,RWB"|
    players$player_positions=="CAM,CB,RB"),
  "player_positions"]  <-  "Defender"

players[which(
  players$player_positions=="CM"|
    players$player_positions=="RM"|
    players$player_positions=="LM"|
    players$player_positions=="CAM"|
    players$player_positions=="CDM"|
    players$player_positions=="CDM,CM"|
    players$player_positions=="LM,RM"|
    players$player_positions=="CAM,CM"|
    players$player_positions=="LWB,RM"|
    players$player_positions=="CAM,LM"|
    players$player_positions=="CAM,RM"|
    players$player_positions=="RM,RW"|
    players$player_positions=="CM,RM"|
    players$player_positions=="LM,LW"|
    players$player_positions=="CM,LM"|
    players$player_positions=="CDM,RWB"|
    players$player_positions=="CDM,RM"|
    players$player_positions=="CAM,LW"|
    players$player_positions=="CAM,RW"|
    players$player_positions=="CF,RM"|
    players$player_positions=="CF,LM"|
    players$player_positions=="LW,RM"|
    players$player_positions=="LM,RW"|
    players$player_positions=="CM,LWB"|
    players$player_positions=="CF,CM"|
    players$player_positions=="CDM,LM"|
    players$player_positions=="LW,LWB"|
    players$player_positions=="LM,RWB"|
    players$player_positions=="CAM,CDM"|
    players$player_positions=="LW,RM,RW"|
    players$player_positions=="CF,LM,RM"|
    players$player_positions=="CAM,CM,LW"|
    players$player_positions=="CAM,RM,RW"|
    players$player_positions=="CM,RW"|
    players$player_positions=="RM,RWB"|
    players$player_positions=="LM,LWB"|
    players$player_positions=="CM,RWB"|
    players$player_positions=="CM,LW"|
    players$player_positions=="CM,LW,RW"|
    players$player_positions=="CAM,LM,LW"|
    players$player_positions=="CAM,CF,RM"|
    players$player_positions=="CAM,CF,LM"|
    players$player_positions=="LM,LWB,RM"|
    players$player_positions=="LM,RM,RW"|
    players$player_positions=="CAM,LW,RW"|
    players$player_positions=="CAM,CM,RW"|
    players$player_positions=="LM,LW,RM"|
    players$player_positions=="CM,RB,RM"|
    players$player_positions=="CDM,CM,RM"|
    players$player_positions=="CAM,CM,RM"|
    players$player_positions=="CDM,CM,LM"|
    players$player_positions=="CM,LM,LW"|
    players$player_positions=="LWB,RWB"|
    players$player_positions=="CM,RM,RW"|
    players$player_positions=="CM,LM,ST"|
    players$player_positions=="RM,RWB,ST"|
    players$player_positions=="RM,RW,RWB"|
    players$player_positions=="CF,CM,RM"|
    players$player_positions=="CAM,LM,RW"|
    players$player_positions=="LM,LW,LWB"|
    players$player_positions=="CM,RM,ST"|
    players$player_positions=="CM,RM,RWB"|
    players$player_positions=="CM,LM,RW"|
    players$player_positions=="CM,LM,LWB"|
    players$player_positions=="CAM,RM,RWB"|
    players$player_positions=="LM,LWB,RWB"|
    players$player_positions=="CAM,CM,RM"|
    players$player_positions=="CM,LM,RM"|
    players$player_positions=="CDM,LM,RM"|
    players$player_positions=="CDM,CM,RWB"|
    players$player_positions=="CDM,CM,LW"|
    players$player_positions=="CAM,LW,RM"|
    players$player_positions=="CAM,LM,LWB"|
    players$player_positions=="CAM,CDM,RM"|
    players$player_positions=="CAM,CDM,LM"|
    players$player_positions=="LWB,RM,RWB"|
    players$player_positions=="CAM,CDM,CM"|
    players$player_positions=="CAM,LM,RM"|
    players$player_positions=="LM,RWB,ST"|
    players$player_positions=="CM,LWB,RWB"|
    players$player_positions=="CM,LW,RM"|
    players$player_positions=="CM,LW,LWB"|
    players$player_positions=="CF,CM,LM"|
    players$player_positions=="CAM,CM,LM"|
    players$player_positions=="CDM,LM,LWB"|
    players$player_positions=="CDM,LM,LW"|
    players$player_positions=="CDM,CM,ST"|
    players$player_positions=="CDM,CM,RW"|
    players$player_positions=="CDM,CM,LWB"|
    players$player_positions=="CDM,CF,CM"),
  "player_positions"]  <- "Midfielder"

players[which(
  players$player_positions=="GK"),
  "player_positions"]  <- "Goalkeeper"

players[which(
  players$player_positions!="Goalkeeper" &
    players$player_positions!="Midfielder" &
    players$player_positions!="Defender" &
    players$player_positions!="Attacker"),
  "player_positions"]  <- "Others"
sort(table(players$player_positions))
## 
##     Others Goalkeeper   Attacker Midfielder   Defender 
##         69       2036       3700       6026       6447

Now we have only 69 player in Others category, which was too many different positions to group it using such division.

I delete Others, because there is no such position on the field and it will not be helpful in the analysis.

players <- players[-which(players$player_positions=="Others"),]

I delete gk_kicking, gk_positioning, gk_diving, gk_handling, gk_reflexes and gk_speed, because they are already represented by other columns: goalkeeping_kicking, goalkeeping_positioning, goalkeeping_reflexes, goalkeeping_diving, goalkeeping_handling and movement_acceleration / movement_sprint_speed. Player_tags and player_traits are also unique values for some players - other players do not have tags and traits - that’s why I will also delete them. I delete team_position also, because it does not provide any additional information about position.

players <- 
  players %>% 
  dplyr::select(-c("team_position","gk_kicking","gk_positioning","gk_diving",
                   "gk_handling","gk_reflexes","gk_speed","player_tags",
                   "player_traits"))

Now let’s look on the missing values:

players %>% 
  md.pattern(rotate.names = TRUE)

##       height_cm weight_kg player_positions preferred_foot weak_foot skill_moves
## 16173         1         1                1              1         1           1
## 2036          1         1                1              1         1           1
##               0         0                0              0         0           0
##       work_rate attacking_crossing attacking_finishing
## 16173         1                  1                   1
## 2036          1                  1                   1
##               0                  0                   0
##       attacking_heading_accuracy attacking_short_passing attacking_volleys
## 16173                          1                       1                 1
## 2036                           1                       1                 1
##                                0                       0                 0
##       skill_dribbling skill_curve skill_fk_accuracy skill_long_passing
## 16173               1           1                 1                  1
## 2036                1           1                 1                  1
##                     0           0                 0                  0
##       skill_ball_control movement_acceleration movement_sprint_speed
## 16173                  1                     1                     1
## 2036                   1                     1                     1
##                        0                     0                     0
##       movement_agility movement_reactions movement_balance power_shot_power
## 16173                1                  1                1                1
## 2036                 1                  1                1                1
##                      0                  0                0                0
##       power_jumping power_stamina power_strength power_long_shots
## 16173             1             1              1                1
## 2036              1             1              1                1
##                   0             0              0                0
##       mentality_aggression mentality_interceptions mentality_positioning
## 16173                    1                       1                     1
## 2036                     1                       1                     1
##                          0                       0                     0
##       mentality_vision mentality_penalties mentality_composure
## 16173                1                   1                   1
## 2036                 1                   1                   1
##                      0                   0                   0
##       defending_marking defending_standing_tackle defending_sliding_tackle
## 16173                 1                         1                        1
## 2036                  1                         1                        1
##                       0                         0                        0
##       goalkeeping_diving goalkeeping_handling goalkeeping_kicking
## 16173                  1                    1                   1
## 2036                   1                    1                   1
##                        0                    0                   0
##       goalkeeping_positioning goalkeeping_reflexes pace shooting passing
## 16173                       1                    1    1        1       1
## 2036                        1                    1    0        0       0
##                             0                    0 2036     2036    2036
##       dribbling defending physic      
## 16173         1         1      1     0
## 2036          0         0      0     6
##            2036      2036   2036 12216

There are 2036 rows with missing values in 6 columns: pace, shooting, passing, dribbling, defending and physic. It’s more than 10% of observations so i will omit them in the future analysis. Additionally, this values are already represented by other variables.

players <- 
  players %>% 
  dplyr::select(-c("pace","shooting","passing","dribbling","defending",
                   "physic"))

summary(players)
##    height_cm       weight_kg      player_positions   preferred_foot    
##  Min.   :156.0   Min.   : 50.00   Length:18209       Length:18209      
##  1st Qu.:177.0   1st Qu.: 70.00   Class :character   Class :character  
##  Median :181.0   Median : 75.00   Mode  :character   Mode  :character  
##  Mean   :181.4   Mean   : 75.28                                        
##  3rd Qu.:186.0   3rd Qu.: 80.00                                        
##  Max.   :205.0   Max.   :110.00                                        
##    weak_foot      skill_moves     work_rate         attacking_crossing
##  Min.   :1.000   Min.   :1.000   Length:18209       Min.   : 5.00     
##  1st Qu.:3.000   1st Qu.:2.000   Class :character   1st Qu.:38.00     
##  Median :3.000   Median :2.000   Mode  :character   Median :54.00     
##  Mean   :2.944   Mean   :2.367                      Mean   :49.68     
##  3rd Qu.:3.000   3rd Qu.:3.000                      3rd Qu.:64.00     
##  Max.   :5.000   Max.   :5.000                      Max.   :93.00     
##  attacking_finishing attacking_heading_accuracy attacking_short_passing
##  Min.   : 2.00       Min.   : 5.0               Min.   : 7.00          
##  1st Qu.:30.00       1st Qu.:44.0               1st Qu.:54.00          
##  Median :49.00       Median :56.0               Median :62.00          
##  Mean   :45.55       Mean   :52.2               Mean   :58.73          
##  3rd Qu.:62.00       3rd Qu.:64.0               3rd Qu.:68.00          
##  Max.   :95.00       Max.   :93.0               Max.   :92.00          
##  attacking_volleys skill_dribbling  skill_curve   skill_fk_accuracy
##  Min.   : 3.00     Min.   : 4.00   Min.   : 6.0   Min.   : 4.0     
##  1st Qu.:30.00     1st Qu.:50.00   1st Qu.:34.0   1st Qu.:31.0     
##  Median :44.00     Median :61.00   Median :49.0   Median :41.0     
##  Mean   :42.78     Mean   :55.57   Mean   :47.3   Mean   :42.7     
##  3rd Qu.:56.00     3rd Qu.:68.00   3rd Qu.:62.0   3rd Qu.:56.0     
##  Max.   :90.00     Max.   :97.00   Max.   :94.0   Max.   :94.0     
##  skill_long_passing skill_ball_control movement_acceleration
##  Min.   : 8.00      Min.   : 5.00      Min.   :12.00        
##  1st Qu.:43.00      1st Qu.:54.00      1st Qu.:56.00        
##  Median :56.00      Median :63.00      Median :67.00        
##  Mean   :52.76      Mean   :58.44      Mean   :64.28        
##  3rd Qu.:64.00      3rd Qu.:69.00      3rd Qu.:75.00        
##  Max.   :92.00      Max.   :96.00      Max.   :96.00        
##  movement_sprint_speed movement_agility movement_reactions movement_balance
##  Min.   :11.00         Min.   :11.00    Min.   :21.00      Min.   :12.00   
##  1st Qu.:57.00         1st Qu.:55.00    1st Qu.:56.00      1st Qu.:56.00   
##  Median :67.00         Median :66.00    Median :62.00      Median :66.00   
##  Mean   :64.39         Mean   :63.49    Mean   :61.75      Mean   :63.84   
##  3rd Qu.:75.00         3rd Qu.:74.00    3rd Qu.:68.00      3rd Qu.:74.00   
##  Max.   :96.00         Max.   :96.00    Max.   :96.00      Max.   :97.00   
##  power_shot_power power_jumping   power_stamina   power_strength 
##  Min.   :14.00    Min.   :19.00   Min.   :12.00   Min.   :20.00  
##  1st Qu.:48.00    1st Qu.:58.00   1st Qu.:56.00   1st Qu.:58.00  
##  Median :59.00    Median :66.00   Median :66.00   Median :66.00  
##  Mean   :58.16    Mean   :64.92   Mean   :62.87   Mean   :65.23  
##  3rd Qu.:68.00    3rd Qu.:73.00   3rd Qu.:74.00   3rd Qu.:74.00  
##  Max.   :95.00    Max.   :95.00   Max.   :97.00   Max.   :97.00  
##  power_long_shots mentality_aggression mentality_interceptions
##  Min.   : 4.00    Min.   : 9.00        Min.   : 3.00          
##  1st Qu.:32.00    1st Qu.:44.00        1st Qu.:25.00          
##  Median :51.00    Median :58.00        Median :52.00          
##  Mean   :46.78    Mean   :55.72        Mean   :46.35          
##  3rd Qu.:62.00    3rd Qu.:69.00        3rd Qu.:64.00          
##  Max.   :94.00    Max.   :95.00        Max.   :92.00          
##  mentality_positioning mentality_vision mentality_penalties mentality_composure
##  Min.   : 2.00         Min.   : 9.00    Min.   : 7.00       Min.   :12.00      
##  1st Qu.:39.00         1st Qu.:44.00    1st Qu.:39.00       1st Qu.:51.00      
##  Median :55.00         Median :55.00    Median :49.00       Median :60.00      
##  Mean   :50.03         Mean   :53.59    Mean   :48.36       Mean   :58.52      
##  3rd Qu.:64.00         3rd Qu.:64.00    3rd Qu.:60.00       3rd Qu.:67.00      
##  Max.   :95.00         Max.   :94.00    Max.   :92.00       Max.   :96.00      
##  defending_marking defending_standing_tackle defending_sliding_tackle
##  Min.   : 1.00     Min.   : 5.0              Min.   : 3.00           
##  1st Qu.:29.00     1st Qu.:27.0              1st Qu.:24.00           
##  Median :52.00     Median :55.0              Median :52.00           
##  Mean   :46.82     Mean   :47.6              Mean   :45.57           
##  3rd Qu.:64.00     3rd Qu.:66.0              3rd Qu.:64.00           
##  Max.   :94.00     Max.   :92.0              Max.   :90.00           
##  goalkeeping_diving goalkeeping_handling goalkeeping_kicking
##  Min.   : 1.0       Min.   : 1.00        Min.   : 1.00      
##  1st Qu.: 8.0       1st Qu.: 8.00        1st Qu.: 8.00      
##  Median :11.0       Median :11.00        Median :11.00      
##  Mean   :16.6       Mean   :16.38        Mean   :16.23      
##  3rd Qu.:14.0       3rd Qu.:14.00        3rd Qu.:14.00      
##  Max.   :90.0       Max.   :92.00        Max.   :93.00      
##  goalkeeping_positioning goalkeeping_reflexes
##  Min.   : 1.00           Min.   : 1.00       
##  1st Qu.: 8.00           1st Qu.: 8.00       
##  Median :11.00           Median :11.00       
##  Mean   :16.39           Mean   :16.73       
##  3rd Qu.:14.00           3rd Qu.:14.00       
##  Max.   :91.00           Max.   :92.00

Most of the numerical variables are integers in 0-100 range. I will convert skill_moves and weak_foot from 1-5 to 0-100 range and also height_cm and weight_kg to 0-100 range to normalize them. It can be helpful in the future analysis.

players$weak_foot   <- players$weak_foot*100/5
players$skill_moves <- players$skill_moves*100/5
players$height_cm   <- players$height_cm/(max(players$height_cm))*100
players$weight_kg   <- players$weight_kg/(max(players$weight_kg))*100

glimpse(players)
## Rows: 18,209
## Columns: 41
## $ height_cm                  <dbl> 82.92683, 91.21951, 85.36585, 91.70732, ...
## $ weight_kg                  <dbl> 65.45455, 75.45455, 61.81818, 79.09091, ...
## $ player_positions           <chr> "Attacker", "Attacker", "Midfielder", "G...
## $ preferred_foot             <chr> "Left", "Right", "Right", "Right", "Righ...
## $ weak_foot                  <dbl> 80, 80, 100, 60, 80, 100, 80, 60, 80, 60...
## $ skill_moves                <dbl> 80, 100, 100, 20, 80, 80, 20, 40, 80, 80...
## $ work_rate                  <chr> "Medium/Low", "High/Low", "High/Medium",...
## $ attacking_crossing         <int> 88, 84, 87, 13, 81, 93, 18, 53, 86, 79, ...
## $ attacking_finishing        <int> 95, 94, 87, 11, 84, 82, 14, 52, 72, 90, ...
## $ attacking_heading_accuracy <int> 70, 89, 62, 15, 61, 55, 11, 86, 55, 59, ...
## $ attacking_short_passing    <int> 92, 83, 87, 43, 89, 92, 61, 78, 92, 84, ...
## $ attacking_volleys          <int> 88, 87, 87, 13, 83, 82, 14, 45, 76, 79, ...
## $ skill_dribbling            <int> 97, 89, 96, 12, 95, 86, 21, 70, 87, 89, ...
## $ skill_curve                <int> 93, 81, 88, 13, 83, 85, 18, 60, 85, 83, ...
## $ skill_fk_accuracy          <int> 94, 76, 87, 14, 79, 83, 12, 70, 78, 69, ...
## $ skill_long_passing         <int> 92, 77, 81, 40, 83, 91, 63, 81, 88, 75, ...
## $ skill_ball_control         <int> 96, 92, 95, 30, 94, 91, 30, 76, 92, 89, ...
## $ movement_acceleration      <int> 91, 89, 94, 43, 94, 77, 38, 74, 77, 94, ...
## $ movement_sprint_speed      <int> 84, 91, 89, 60, 88, 76, 50, 79, 71, 92, ...
## $ movement_agility           <int> 93, 87, 96, 67, 95, 78, 37, 61, 92, 91, ...
## $ movement_reactions         <int> 95, 96, 92, 88, 90, 91, 86, 88, 89, 92, ...
## $ movement_balance           <int> 95, 71, 84, 49, 94, 76, 43, 53, 93, 88, ...
## $ power_shot_power           <int> 86, 95, 80, 59, 82, 91, 66, 81, 79, 80, ...
## $ power_jumping              <int> 68, 95, 61, 78, 56, 63, 79, 90, 68, 69, ...
## $ power_stamina              <int> 75, 85, 81, 41, 84, 89, 35, 75, 85, 85, ...
## $ power_strength             <int> 68, 78, 49, 78, 63, 74, 78, 92, 58, 73, ...
## $ power_long_shots           <int> 94, 93, 84, 12, 80, 90, 10, 64, 82, 84, ...
## $ mentality_aggression       <int> 48, 63, 51, 34, 54, 76, 43, 82, 62, 63, ...
## $ mentality_interceptions    <int> 40, 29, 36, 19, 41, 61, 22, 89, 82, 55, ...
## $ mentality_positioning      <int> 94, 95, 87, 11, 87, 88, 11, 47, 79, 92, ...
## $ mentality_vision           <int> 94, 82, 90, 65, 89, 94, 70, 65, 91, 84, ...
## $ mentality_penalties        <int> 75, 85, 90, 11, 88, 79, 25, 62, 82, 77, ...
## $ mentality_composure        <int> 96, 95, 94, 68, 91, 91, 70, 89, 92, 91, ...
## $ defending_marking          <int> 33, 28, 27, 27, 34, 68, 25, 91, 68, 38, ...
## $ defending_standing_tackle  <int> 37, 32, 26, 12, 27, 58, 13, 92, 76, 43, ...
## $ defending_sliding_tackle   <int> 26, 24, 29, 18, 22, 51, 10, 85, 71, 41, ...
## $ goalkeeping_diving         <int> 6, 7, 9, 87, 11, 15, 88, 13, 13, 14, 13,...
## $ goalkeeping_handling       <int> 11, 11, 9, 92, 12, 13, 85, 10, 9, 14, 5,...
## $ goalkeeping_kicking        <int> 15, 15, 15, 78, 6, 5, 88, 13, 7, 9, 7, 7...
## $ goalkeeping_positioning    <int> 14, 14, 15, 90, 8, 10, 88, 11, 14, 11, 1...
## $ goalkeeping_reflexes       <int> 8, 11, 11, 89, 8, 13, 90, 11, 9, 14, 6, ...

There are 3 character variables in the dataset: dependent variable player_positions, preferred_foot and work_rate. I convert them into factors and create list of factor and numerical predictors.

players$player_positions <- as.factor(players$player_positions)
players$preferred_foot   <- as.factor(players$preferred_foot)
players$work_rate        <- as.factor(players$work_rate)

players_numeric_vars <- 
  sapply(players, is.numeric) %>% 
  which() %>% 
  names()

players_factor_vars <- 
  sapply(players, is.factor) %>% 
  which() %>% 
  names()

Data division

Now it’s time to divide data into training and test set

set.seed(987654321)

players_which_train <- createDataPartition(players$player_positions,
                                           p = 0.7, 
                                           list = FALSE) 

players_train <- players[players_which_train,]
players_test <- players[-players_which_train,]

The distribution of the target variable in both samples are very similar:

## Train dataset distribution:
## .
##   Attacker   Defender Goalkeeper Midfielder 
##  0.2031691  0.3540163  0.1118607  0.3309539
## Test dataset distribution:
## .
##   Attacker   Defender Goalkeeper Midfielder 
##  0.2032595  0.3541476  0.1117012  0.3308918

Now let’s look on the correlations between variables:

players_correlations <- 
  cor(players_train[,players_numeric_vars],
      use = "pairwise.complete.obs")

corrplot(players_correlations, 
         method = "color",tl.cex = 0.5)

Goalkeeping and defending variables are very highly correlated to each other. In general, goalkeeping skills seems to be negatively correlated with most of other attributes, so we can expect that predicting goalkeepers will be very accurate in every model.

I save the most highly correlated variables as candidates to be excluded from the analysis. They can give very little information about position and increase time consumption of computing models.

correlated_variables_90 <- findCorrelation(players_correlations,
                cutoff = 0.90,
                names = TRUE)

correlated_variables_80 <- findCorrelation(players_correlations,
                cutoff = 0.80,
                names = TRUE)

Before we start modelling, let’s look on the factor variables:

We can see that there are much more players with the right foot preferred. Only 30.92% of players prefer their left foot.

Work rate is rate of working in attack and defense. For example, High/Low means that player works hard in attack and does not work hard in the defense, but it is more mental that the real position on the field, so there are defenders with Low/Low etc.

We can see, that most of the players have Medium/Medium work rate. Other groups are smaller, but only Low/Low seems to be really small and may not provide efficient value to model. However, we can not add this group to another, so I will keep it.

Data modelling

Now we are ready to try to run some models and predict players position on the field.

Logit model

I run multinomial logit model without variables with correlation higher than 0.8 and without preferred_foot variable.

players_mlogit1a <- multinom(player_positions ~ .,
                    data = players_train %>% 
                      dplyr::select(-c(all_of(correlated_variables_80),"preferred_foot")))

players_mlogit1a_fitted <- predict(players_mlogit1a) 
table(players_mlogit1a_fitted,
      players_train$player_positions)
##                        
## players_mlogit1a_fitted Attacker Defender Goalkeeper Midfielder
##              Attacker       2152       10          0        325
##              Defender         14     4036          0        436
##              Goalkeeper        0        0       1426          0
##              Midfielder      424      467          0       3458

Now I run multinomial logit model without variables with correlation higher than 0.9 and without preferred_foot variable.

players_mlogit1b <- multinom(player_positions ~ .,
                    data = players_train %>% 
                      dplyr::select(-c(all_of(correlated_variables_90),"preferred_foot")))

players_mlogit1b_fitted <- predict(players_mlogit1b) 
table(players_mlogit1b_fitted,
      players_train$player_positions)
##                        
## players_mlogit1b_fitted Attacker Defender Goalkeeper Midfielder
##              Attacker       2178       15          0        366
##              Defender         12     4065          0        322
##              Goalkeeper        0        0       1426          1
##              Midfielder      400      433          0       3530

And now I run multinomial logit model with every variable.

players_mlogit2 <- multinom(player_positions ~ .,
                    data = players_train)

players_mlogit2_fitted <- predict(players_mlogit2) 
table(players_mlogit2_fitted,
      players_train$player_positions)
##                       
## players_mlogit2_fitted Attacker Defender Goalkeeper Midfielder
##             Attacker       2191       19          0        390
##             Defender         15     4116          0        315
##             Goalkeeper        2        8       1426          3
##             Midfielder      382      370          0       3511

Likelihood ratio test:

lrtest(players_mlogit1a)[5]
## # weights:  8 (3 variable)
## initial  value 17672.480516 
## final  value 16603.004977 
## converged
##   Pr(>Chisq)    
## 1               
## 2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(players_mlogit1b)[5]
## # weights:  8 (3 variable)
## initial  value 17672.480516 
## final  value 16603.004977 
## converged
##   Pr(>Chisq)    
## 1               
## 2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(players_mlogit2)[5]
## # weights:  8 (3 variable)
## initial  value 17672.480516 
## final  value 16603.004977 
## converged
##   Pr(>Chisq)    
## 1               
## 2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis can be rejected on the 0.001 level in all models.

Now I am comparing real and predicted values:

accuracy_multinom(predicted = players_mlogit1a_fitted, 
                  real = players_train$player_positions)
##                     accuracy            balanced_accuracy 
##                     86.85284                     88.62047 
## balanced_correctly_predicted 
##                     89.00282
accuracy_multinom(predicted = players_mlogit1b_fitted, 
                  real = players_train$player_positions)
##                     accuracy            balanced_accuracy 
##                     87.84907                     89.45873 
## balanced_correctly_predicted 
##                     89.58907
accuracy_multinom(predicted = players_mlogit2_fitted, 
                  real = players_train$player_positions)
##                     accuracy            balanced_accuracy 
##                     88.20207                     89.75414 
## balanced_correctly_predicted 
##                     89.57582
players_test$multinom1a <- predict(players_mlogit1a, 
                                    newdata = players_test)

conf_matrix_multinom1a  <- 
  confusionMatrix(players_test$multinom1a,
                players_test$player_positions)

players_test$multinom1b <- predict(players_mlogit1b, 
                                    newdata = players_test)

conf_matrix_multinom1b  <-
  confusionMatrix(players_test$multinom1b,
                players_test$player_positions)

players_test$multinom2  <- predict(players_mlogit2, 
                                    newdata = players_test)

conf_matrix_multinom2  <- 
  confusionMatrix(players_test$multinom2,
                players_test$player_positions)

And now I check accuracy on the test dataset:

## Accuracy of multinomial model 1a: 0.8731002
## 
## Accuracy of multinomial model 1b: 0.8741989
## 
## Accuracy of multinomial model 2: 0.8798755
## 
## Accuracy of multinomial model 1a by position:
##                   Precision Balanced Accuracy
## Class: Attacker   0.8732394         0.9034052
## Class: Defender   0.9086564         0.9179641
## Class: Goalkeeper 1.0000000         1.0000000
## Class: Midfielder 0.7971624         0.8669377
## 
## Accuracy of multinomial model 1b by position:
##                   Precision Balanced Accuracy
## Class: Attacker   0.8490393         0.8990569
## Class: Defender   0.9257453         0.9221503
## Class: Goalkeeper 0.9983633         0.9998969
## Class: Midfielder 0.7991632         0.8702551
## 
## Accuracy of multinomial model 2 by position:
##                   Precision Balanced Accuracy
## Class: Attacker   0.8479053         0.9001784
## Class: Defender   0.9318670         0.9310653
## Class: Goalkeeper 0.9854604         0.9990724
## Class: Midfielder 0.8122340         0.8742203

Multinomial logit model with all variables has the highest average accuracy, but looking on the accuracy by position, it is really hard to choose the best model. What is also important, model 1a has around 50 predictors less than model 2 and gives very similar results. Overall, accuracy is pretty high, but can it be higher? Let’s find out.

KNN model

I will start from the defining the training controls - it will be 2-fold cross validation and 10-fold cross validation control. I will compare the models with both controls.

control_cv2 <- trainControl(method = "cv",
                          number = 2,
                          classProbs = TRUE)

control_cv10 <- trainControl(method = "cv",
                          number = 10,
                          classProbs = TRUE)

Now I compute 4 models - two without cross validation and two with cross validation (full data and data with highly correlated variables and preferred_foot variable excluded).

I try many k values to obtain possibly highest accuracy and scale all variables to range [0, 1].

set.seed(987654321)

test_k <- data.frame(k = seq(1, 99, 4))

players_train_knn1a <- 
  train(player_positions ~ .,
        data = players_train %>% 
          dplyr::select(-c(all_of(correlated_variables_90),"preferred_foot")),
        method = "knn",
        trControl = control_cv2,
        tuneGrid = test_k,
        preProcess = c("range"))

players_train_knn1b <- 
  train(player_positions ~ .,
        data = players_train,
        method = "knn",
        trControl = control_cv2,
        tuneGrid = test_k,
        preProcess = c("range"))

players_train_knn2a <- 
  train(player_positions ~ .,
        data = players_train %>% 
          dplyr::select(-c(all_of(correlated_variables_90),"preferred_foot")),
        method = "knn",
        trControl = control_cv10,
        tuneGrid = test_k,
        preProcess = c("range"))

players_train_knn2b <- 
  train(player_positions ~ .,
        data = players_train,
        method = "knn",
        trControl = control_cv10,
        tuneGrid = test_k,
        preProcess = c("range"))

par(mfrow=c(2,2))
plot(players_train_knn1a)

plot(players_train_knn1b)

plot(players_train_knn2a)

plot(players_train_knn2b)

Let’s look on k values selected in modelling:

## players_train_knn1a k value selected: 13
## 
## players_train_knn1b k value selected: 9
## 
## players_train_knn2a k value selected: 21
## 
## players_train_knn2b k value selected: 17

Models selected k values: 13, 9, 21 and 17, but all of them are quite similar accuracy.

Let’s look on the accuracy of each model:

players_test_forecasts <- 
  data.frame(players_train_knn1a = predict(players_train_knn1a,
                                         players_test),
             players_train_knn1b = predict(players_train_knn1b,
                                        players_test),
             players_train_knn2a = predict(players_train_knn2a,
                                          players_test),
             players_train_knn2b = predict(players_train_knn2b,
                                           players_test))

sapply(players_test_forecasts,
       function(x) accuracy_multinom(predicted = x,
                                        real = players_test$player_positions))
##                              players_train_knn1a players_train_knn1b
## accuracy                                85.91833            86.85222
## balanced_accuracy                       87.39991            88.24109
## balanced_correctly_predicted            88.94343            89.35451
##                              players_train_knn2a players_train_knn2b
## accuracy                                85.69859            87.05365
## balanced_accuracy                       87.02629            88.39872
## balanced_correctly_predicted            88.97030            89.78581

It does not seem to give better result than multinomial logistic regression, but we can see, that again, model with all of variables gives better prediction. Additionally, we can see that 10-fold cross validation give us slightly better results.

Let’s now try with Discriminant Analysis methods and LogitBoost method. I run 4 methods:

- Shrinkage Discriminant Analysis

set.seed(12345)

m_sda <- train(player_positions~.,
               data=players_train,
               method="sda", 
               trControl=control_cv10,
               preProcess = c("center","scale")) 

- High Dimensional Discriminant Analysis

set.seed(12345)

m_hdda <- train(player_positions~.,
               data=players_train,
               method="hdda", 
               trControl=control_cv10,
               preProcess = c("center","scale")) 

- Penalized Discriminant Analysis

set.seed(12345)
m_pda <- train(player_positions~.,
               data=players_train, 
               method="pda",
               trControl=control_cv10,
               preProcess = c("center", "scale"))

- LogitBoost Model

set.seed(12345)
m_LogitBoost <- train(player_positions~.,
               data=players_train, 
               method="LogitBoost",
               trControl=control_cv10,
               preProcess = c("center", "scale"))

Now I use computed models to predict positions:

players_test$predicted_sda <- predict(m_sda, 
                                    newdata = players_test)
## Prediction uses 47 features.
players_test$predicted_hdda <- predict(m_hdda, 
                                    newdata = players_test)

players_test$predicted_pda <- predict(m_pda, 
                                    newdata = players_test)

players_test$predicted_LogitBoost <- predict(m_LogitBoost, 
                                    newdata = players_test)

conf_matrix_sda <- 
  confusionMatrix(players_test$predicted_sda,
                players_test$player_positions)

conf_matrix_hdda <- 
  confusionMatrix(players_test$predicted_hdda,
                players_test$player_positions)

conf_matrix_pda <- 
  confusionMatrix(players_test$predicted_pda,
                players_test$player_positions)

conf_matrix_LogitBoost <- 
  confusionMatrix(players_test$predicted_LogitBoost,
                players_test$player_positions)

Accurracies of each model:

## Accuracy of Shrinkage Discriminant Analysis model:        0.8835378
## 
## Accuracy of High Dimensional Discriminant Analysis model: 0.8359275
## 
## Accuracy of Penalized Discriminant Analysis model:        0.8833547
## 
## Accuracy of LogitBoost model:                             0.8834586
## 
## 
## Accuracy of Shrinkage Discriminant Analysis model by position:
##                   Precision Balanced Accuracy
## Class: Attacker   0.8818444         0.8993788
## Class: Defender   0.9369565         0.9292638
## Class: Goalkeeper 0.9983633         0.9998969
## Class: Midfielder 0.7988827         0.8810646
## 
## Accuracy of High Dimensional Discriminant Analysis model by position:
##                   Precision Balanced Accuracy
## Class: Attacker   0.7491961         0.8839660
## Class: Defender   0.9102285         0.9091323
## Class: Goalkeeper 1.0000000         1.0000000
## Class: Midfielder 0.7631430         0.8162129
## 
## Accuracy of Penalized Discriminant Analysis model by position:
##                   Precision Balanced Accuracy
## Class: Attacker   0.8809981         0.8992639
## Class: Defender   0.9369565         0.9292638
## Class: Goalkeeper 0.9983633         0.9998969
## Class: Midfielder 0.7987805         0.8807879
## 
## Accuracy of LogitBoost model by position:
##                   Precision Balanced Accuracy
## Class: Attacker   0.8280802         0.9157331
## Class: Defender   0.9151547         0.9358756
## Class: Goalkeeper 1.0000000         1.0000000
## Class: Midfielder 0.8358503         0.8634438

Let’s look on the accuracy boxplots, based on resamples accuracy.

resample_results <- resamples(list(PDA=m_pda, SDA=m_sda, HDDA=m_hdda, 
                                   KNN=players_train_knn2b, 
                                   LogitBoost = m_LogitBoost))
bwplot(resample_results , metric = "Accuracy")

And density plot of accuracies:

densityplot(resample_results , metric = "Accuracy" ,auto.key = list(columns = 3))

The accuracies of the Shrinkage Discriminant Analysis model, Penalized Discriminant Analysis model and LogitBoost model are higher than best in best case of multinomial logistic regression. LogitBoost looks the best, but Shrinkage and Penalized Discriminant Analysis look also very good comparing to KNN and High Dimensional Discriminant Analysis.

Summary

All of the models gave quite good results, so it was more difficult to see which performs better. For sure, 10-fold cross validation made modelling more precise, so it is often worth to use some additional computing power to perform cross validation.

In this case, best model was Logit boost, but from the players positions grouping perspective it should be considered to group positions in other way (for example defensive midfield, midfield and offensive midfield instead of only midfield).

To sum up, 3 best computed models in the analysis were: