European_Soccer_Predictions

Exploring the data

My dataset is called: European Soccer Database 25k+ matches, players & teams attributes for European Professional Football. The dataset is from Kaggle.com: https://www.kaggle.com/datasets/hugomathien/soccer

This is the EA sport FIFA soccer database for data analysis for machine learning. The creator’s name is HUGO MATHIEN. We don’t know much about the guy’s history, but we do know that he is a dedicated soccer fan and a hardworking database engineer. There is no motivation listed in the description, but I believe this data was created as historical record of European soccer team/player/matches, either out of curiosity or true passion. And because of it, I believe this record project is no funded. The original data source was collected from multiple trusted soccer website, including FIFA’s contribution themselves. To list a few:

http://football-data.mx-api.enetscores.com/ : scores, lineup, team formation and events
http://sofifa.com/ : players and teams attributes from EA Sports FIFA games. FIFA series and all FIFA assets property of EA Sports.

They offer 25,000 matches, over 10,000 real players and top 11 European Countries with their lead championship league. Within the league, they offer starting line-up, record of matches events. There are also players, players attributes, team and team attributes for each individual player and team. The two player tables include stats such as rating, strong_foot, crossing, defensive_work_rate, … The two team tables feature team name/id, buildUp, chanceCreation, defenseTeamWidth, … The data will be useful for evaluating player’s rating, team’s strength, team’s winning chance and more.

From the evidence gathered above, from my opinion, the table is used to keep a true historical record of European soccer matches, players and teams in the past. All the data is useful for different purposes. Soccer enthusiasts can look up the database and check how much their favorite team won in the past, by how many goals, starting line-up. Teams and coaches can look up player’s rating and decide who to pursue in the future. There is also an interesting description of betting odds on the website, however this table is taken down due to some rule violations. Overall, this data is very useful for keeping track of European soccer in the past.

The data is distributed publicly on Kaggle website (linked on top) with Open Data Commons Open Database License (ODbL) v1.0. The author started collecting data from 2008 until 2016 (most recent update is 16th Oct 2016). There is no restriction on usage, and it is solely dependent on the users. Reading the history of data, it was well-maintained and active from 2008 until 2016. After 2016, the author stopped working on the project, but the data is still opened to public until today.

The data has a historical impact on European soccer from the past till 2016. It keeps true record of matches, players and team for users who wanted to investigate in soccer. I believe this is the best soccer data set systematically collected on Kaggle (The data were given golden with 3000+ upvotes). Unfortunately, they don’t have recent 2022 update, but 2016 is a close timeline where the peak of current football generation was.

TLDR:

Data from Kaggle: https://www.kaggle.com/datasets/hugomathien/soccer
Data offers 25,000 matches, over 10,000 real players and top 11 European Countries with their lead championship league
Data collected by a single man: HUGO MATHIEN, came from multiple trusted European sites, including FIFA organization.
Data is made public, free for all, unlimited usages

Summarize the data

Loading library

library(DBI)
library(RSQLite)
library(MASS)
library(class)
library(devtools)

## Loading required package: usethis

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(naivebayes)

## naivebayes 0.9.7 loaded

require(moonBook)

## Loading required package: moonBook

require(ggiraph)

## Loading required package: ggiraph

require(ggiraphExtra)

## Loading required package: ggiraphExtra

## 
## Attaching package: 'ggiraphExtra'

## The following objects are masked from 'package:moonBook':
## 
##     addLabelDf, getMapping

library(rpart)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:moonBook':
## 
##     densityplot

library(rpart.plot)
library(data.tree)
library(caTools)
library(leaps)

Reading data

set.seed(0)

# Read data (this is local)
con <- dbConnect(SQLite(), "C:/Users/dvtie/OneDrive/Desktop/UO_STUDY/Spring_2022/Data_Science/Project/database.sqlite")

# Fetch to R object
Country <- dbFetch(dbSendQuery(con, "SELECT * FROM Country"))
League <- dbFetch(dbSendQuery(con, "SELECT * FROM League"))
Match <- dbFetch(dbSendQuery(con, "SELECT * FROM Match"))
Player <- dbFetch(dbSendQuery(con, "SELECT * FROM Player"))
Player_Attributes <- dbFetch(dbSendQuery(con, "SELECT * FROM Player_Attributes"))
Team <- dbFetch(dbSendQuery(con, "SELECT * FROM Team"))
Team_Attributes <- dbFetch(dbSendQuery(con, "SELECT * FROM Team_Attributes"))
# sqlite_sequence is not really data here, just SQLite procedure.

Table list

dbListTables(con)

## [1] "Country"           "League"            "Match"            
## [4] "Player"            "Player_Attributes" "Team"             
## [7] "Team_Attributes"   "sqlite_sequence"

Investigating players

Preparing data

# Remove id first, this is just SQLite indexing
Player = subset(Player, select = -1)
Player_Attributes = subset(Player_Attributes, select = -1)

PlayerTable = merge(Player, Player_Attributes, by=c("player_api_id","player_fifa_api_id"))
PlayerTable = na.omit(PlayerTable)  # Removing N/A rows

Let’s start with a question: “How do you evaluate a soccer player?”

—————–## UPDATE ## ——————-

Player table

summary(PlayerTable)

##  player_api_id    player_fifa_api_id player_name          birthday        
##  Min.   :  2625   Min.   :     2     Length:180228      Length:180228     
##  1st Qu.: 35451   1st Qu.:156593     Class :character   Class :character  
##  Median : 80293   Median :183781     Mode  :character   Mode  :character  
##  Mean   :137701   Mean   :166807                                          
##  3rd Qu.:192842   3rd Qu.:200138                                          
##  Max.   :750584   Max.   :234141                                          
##      height          weight          date           overall_rating 
##  Min.   :157.5   Min.   :117.0   Length:180228      Min.   :33.00  
##  1st Qu.:177.8   1st Qu.:159.0   Class :character   1st Qu.:64.00  
##  Median :182.9   Median :168.0   Mode  :character   Median :69.00  
##  Mean   :181.9   Mean   :168.8                      Mean   :68.63  
##  3rd Qu.:185.4   3rd Qu.:179.0                      3rd Qu.:73.00  
##  Max.   :208.3   Max.   :243.0                      Max.   :94.00  
##    potential     preferred_foot     attacking_work_rate defensive_work_rate
##  Min.   :39.00   Length:180228      Length:180228       Length:180228      
##  1st Qu.:69.00   Class :character   Class :character    Class :character   
##  Median :74.00   Mode  :character   Mode  :character    Mode  :character   
##  Mean   :73.48                                                             
##  3rd Qu.:78.00                                                             
##  Max.   :97.00                                                             
##     crossing       finishing     heading_accuracy short_passing  
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00    Min.   : 3.00  
##  1st Qu.:45.00   1st Qu.:34.00   1st Qu.:49.00    1st Qu.:57.00  
##  Median :59.00   Median :53.00   Median :60.00    Median :65.00  
##  Mean   :55.14   Mean   :49.95   Mean   :57.27    Mean   :62.48  
##  3rd Qu.:68.00   3rd Qu.:65.00   3rd Qu.:68.00    3rd Qu.:72.00  
##  Max.   :95.00   Max.   :97.00   Max.   :98.00    Max.   :97.00  
##     volleys        dribbling         curve       free_kick_accuracy
##  Min.   : 1.00   Min.   : 1.00   Min.   : 2.00   Min.   : 1.00     
##  1st Qu.:35.00   1st Qu.:52.00   1st Qu.:41.00   1st Qu.:36.00     
##  Median :52.00   Median :64.00   Median :56.00   Median :50.00     
##  Mean   :49.48   Mean   :59.26   Mean   :52.99   Mean   :49.38     
##  3rd Qu.:64.00   3rd Qu.:72.00   3rd Qu.:67.00   3rd Qu.:63.00     
##  Max.   :93.00   Max.   :97.00   Max.   :94.00   Max.   :97.00     
##   long_passing    ball_control    acceleration   sprint_speed     agility     
##  Min.   : 3.00   Min.   : 5.00   Min.   :10.0   Min.   :12.0   Min.   :11.00  
##  1st Qu.:49.00   1st Qu.:59.00   1st Qu.:61.0   1st Qu.:62.0   1st Qu.:58.00  
##  Median :59.00   Median :67.00   Median :69.0   Median :69.0   Median :68.00  
##  Mean   :57.09   Mean   :63.45   Mean   :67.7   Mean   :68.1   Mean   :65.99  
##  3rd Qu.:67.00   3rd Qu.:73.00   3rd Qu.:77.0   3rd Qu.:77.0   3rd Qu.:75.00  
##  Max.   :97.00   Max.   :97.00   Max.   :97.0   Max.   :97.0   Max.   :96.00  
##    reactions        balance        shot_power       jumping     
##  Min.   :17.00   Min.   :12.00   Min.   : 2.00   Min.   :14.00  
##  1st Qu.:61.00   1st Qu.:58.00   1st Qu.:54.00   1st Qu.:60.00  
##  Median :67.00   Median :67.00   Median :66.00   Median :68.00  
##  Mean   :66.14   Mean   :65.18   Mean   :61.87   Mean   :66.99  
##  3rd Qu.:72.00   3rd Qu.:74.00   3rd Qu.:73.00   3rd Qu.:74.00  
##  Max.   :96.00   Max.   :96.00   Max.   :97.00   Max.   :96.00  
##     stamina         strength       long_shots      aggression   
##  Min.   :10.00   Min.   :10.00   Min.   : 1.00   Min.   : 6.00  
##  1st Qu.:61.00   1st Qu.:60.00   1st Qu.:41.00   1st Qu.:51.00  
##  Median :69.00   Median :69.00   Median :58.00   Median :64.00  
##  Mean   :67.05   Mean   :67.44   Mean   :53.38   Mean   :60.95  
##  3rd Qu.:76.00   3rd Qu.:76.00   3rd Qu.:67.00   3rd Qu.:73.00  
##  Max.   :96.00   Max.   :96.00   Max.   :96.00   Max.   :97.00  
##  interceptions    positioning        vision        penalties    
##  Min.   : 1.00   Min.   : 2.00   Min.   : 1.00   Min.   : 2.00  
##  1st Qu.:34.00   1st Qu.:45.00   1st Qu.:49.00   1st Qu.:45.00  
##  Median :56.00   Median :60.00   Median :60.00   Median :57.00  
##  Mean   :51.91   Mean   :55.73   Mean   :57.86   Mean   :54.93  
##  3rd Qu.:68.00   3rd Qu.:69.00   3rd Qu.:69.00   3rd Qu.:67.00  
##  Max.   :96.00   Max.   :95.00   Max.   :97.00   Max.   :96.00  
##     marking      standing_tackle sliding_tackle    gk_diving    
##  Min.   : 1.00   Min.   : 1.00   Min.   : 2.00   Min.   : 1.00  
##  1st Qu.:25.00   1st Qu.:29.00   1st Qu.:25.00   1st Qu.: 7.00  
##  Median :50.00   Median :56.00   Median :53.00   Median :10.00  
##  Mean   :46.77   Mean   :50.37   Mean   :48.04   Mean   :14.69  
##  3rd Qu.:66.00   3rd Qu.:69.00   3rd Qu.:67.00   3rd Qu.:13.00  
##  Max.   :94.00   Max.   :95.00   Max.   :95.00   Max.   :94.00  
##   gk_handling      gk_kicking    gk_positioning   gk_reflexes   
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 8.00  
##  Median :11.00   Median :12.00   Median :11.00   Median :11.00  
##  Mean   :15.95   Mean   :20.53   Mean   :16.01   Mean   :16.32  
##  3rd Qu.:15.00   3rd Qu.:15.00   3rd Qu.:15.00   3rd Qu.:15.00  
##  Max.   :93.00   Max.   :97.00   Max.   :96.00   Max.   :96.00

Analyzing the Player Table:

There are identifications, categorical and numeric columns
player_api_id, player_fifa_api_id player_name, birthday, date: are parts of player identity. Identity columns have no effect as a predictor, just a player identification
preferred_foot, attacking_work_rate, defensive_work_rate are categorial predictors
the rest are numeric predictors. Height is a positive real value (meters), weight is a positive real value (kg). Other numeric value range from 0 to 100. (0 is bad, 100 is perfect)
Among numeric columns, there are overall_rating and potential given by FIFA rating:
- potential defines a potential overall_rating of a player in the future. If a player is young then the potential might be higer than current overall_rating. If a player is old then the potential will be lower. This is a purely guess by FIFA. We mostly ignore this value and focus on overall_rating.
- overall_rating is an actual value FIFA gives a player (on official sites, in games, player cards, ..). This value is based on a player’s position and their main attributes. We don’t know the method FIFA used to derive this number. It could just be the average of all main attributes. They could also use a model themselves . Therefore, our job is to fit a model to predict this value, then compare with what FIFA gives.

———————-###————————

After looking at the PlayerTable, some are just for identification, some are player’s attribute. There are some assumptions/observations:

player_api_id, player_fifa_api_id, player_name, birthday, date are part of identity factor, we can exclude this in choosing predictors
height, weight, preferred_fot, crossing, finishing, shoot_power, stamina, … they all can be good predictors for finding a player’s overall_rating
Out of predictors, preferred_fot, attacking_work_rate, and defensive_work_rate are categorial or factor predictor.

And some hypothesises:

height and weight were part of the identity table Player, but we still include them in the predictors to see which one is statistical significant
Most of the attributes by the look of it will affect the overall_rating of a player, however:
Attacking player will rely more on attacking stats: finishing, volleys, dribbling, shot_power, ..
Defending player will rely more on defensive stats: standing_tackle, sliding_tackle, marking, ..
A goalkeeper mainly rely on goalkeeping stats: gk_diving, gk_handling, gk_positioning, ..

—————–## UPDATE ## ——————-

After asking the question, we will divide some strategy for analyzing the data:

First, we take a quick look by a linear regression model to identify which predictors is significant
NOTE: The table doesn’t provide us with a player’s position. So in the project, I provide some 30 data of player’s role based on the official website & common knowledge myself.
Then, we try to use different non-linear model (logistic, naive bayes, tree/forest) to classify a position of a player (attacker/defender/goalkeeper)

———————-###————————

Multiple linear regression
The PlayerTable has overall_rating following by multiple player attributes. Let’s try plot a model to predict overall_rating value. Given the goal to predict a numeric value, we should first try multiple linear regression:

Player table

summary(PlayerTable)

##  player_api_id    player_fifa_api_id player_name          birthday        
##  Min.   :  2625   Min.   :     2     Length:180228      Length:180228     
##  1st Qu.: 35451   1st Qu.:156593     Class :character   Class :character  
##  Median : 80293   Median :183781     Mode  :character   Mode  :character  
##  Mean   :137701   Mean   :166807                                          
##  3rd Qu.:192842   3rd Qu.:200138                                          
##  Max.   :750584   Max.   :234141                                          
##      height          weight          date           overall_rating 
##  Min.   :157.5   Min.   :117.0   Length:180228      Min.   :33.00  
##  1st Qu.:177.8   1st Qu.:159.0   Class :character   1st Qu.:64.00  
##  Median :182.9   Median :168.0   Mode  :character   Median :69.00  
##  Mean   :181.9   Mean   :168.8                      Mean   :68.63  
##  3rd Qu.:185.4   3rd Qu.:179.0                      3rd Qu.:73.00  
##  Max.   :208.3   Max.   :243.0                      Max.   :94.00  
##    potential     preferred_foot     attacking_work_rate defensive_work_rate
##  Min.   :39.00   Length:180228      Length:180228       Length:180228      
##  1st Qu.:69.00   Class :character   Class :character    Class :character   
##  Median :74.00   Mode  :character   Mode  :character    Mode  :character   
##  Mean   :73.48                                                             
##  3rd Qu.:78.00                                                             
##  Max.   :97.00                                                             
##     crossing       finishing     heading_accuracy short_passing  
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00    Min.   : 3.00  
##  1st Qu.:45.00   1st Qu.:34.00   1st Qu.:49.00    1st Qu.:57.00  
##  Median :59.00   Median :53.00   Median :60.00    Median :65.00  
##  Mean   :55.14   Mean   :49.95   Mean   :57.27    Mean   :62.48  
##  3rd Qu.:68.00   3rd Qu.:65.00   3rd Qu.:68.00    3rd Qu.:72.00  
##  Max.   :95.00   Max.   :97.00   Max.   :98.00    Max.   :97.00  
##     volleys        dribbling         curve       free_kick_accuracy
##  Min.   : 1.00   Min.   : 1.00   Min.   : 2.00   Min.   : 1.00     
##  1st Qu.:35.00   1st Qu.:52.00   1st Qu.:41.00   1st Qu.:36.00     
##  Median :52.00   Median :64.00   Median :56.00   Median :50.00     
##  Mean   :49.48   Mean   :59.26   Mean   :52.99   Mean   :49.38     
##  3rd Qu.:64.00   3rd Qu.:72.00   3rd Qu.:67.00   3rd Qu.:63.00     
##  Max.   :93.00   Max.   :97.00   Max.   :94.00   Max.   :97.00     
##   long_passing    ball_control    acceleration   sprint_speed     agility     
##  Min.   : 3.00   Min.   : 5.00   Min.   :10.0   Min.   :12.0   Min.   :11.00  
##  1st Qu.:49.00   1st Qu.:59.00   1st Qu.:61.0   1st Qu.:62.0   1st Qu.:58.00  
##  Median :59.00   Median :67.00   Median :69.0   Median :69.0   Median :68.00  
##  Mean   :57.09   Mean   :63.45   Mean   :67.7   Mean   :68.1   Mean   :65.99  
##  3rd Qu.:67.00   3rd Qu.:73.00   3rd Qu.:77.0   3rd Qu.:77.0   3rd Qu.:75.00  
##  Max.   :97.00   Max.   :97.00   Max.   :97.0   Max.   :97.0   Max.   :96.00  
##    reactions        balance        shot_power       jumping     
##  Min.   :17.00   Min.   :12.00   Min.   : 2.00   Min.   :14.00  
##  1st Qu.:61.00   1st Qu.:58.00   1st Qu.:54.00   1st Qu.:60.00  
##  Median :67.00   Median :67.00   Median :66.00   Median :68.00  
##  Mean   :66.14   Mean   :65.18   Mean   :61.87   Mean   :66.99  
##  3rd Qu.:72.00   3rd Qu.:74.00   3rd Qu.:73.00   3rd Qu.:74.00  
##  Max.   :96.00   Max.   :96.00   Max.   :97.00   Max.   :96.00  
##     stamina         strength       long_shots      aggression   
##  Min.   :10.00   Min.   :10.00   Min.   : 1.00   Min.   : 6.00  
##  1st Qu.:61.00   1st Qu.:60.00   1st Qu.:41.00   1st Qu.:51.00  
##  Median :69.00   Median :69.00   Median :58.00   Median :64.00  
##  Mean   :67.05   Mean   :67.44   Mean   :53.38   Mean   :60.95  
##  3rd Qu.:76.00   3rd Qu.:76.00   3rd Qu.:67.00   3rd Qu.:73.00  
##  Max.   :96.00   Max.   :96.00   Max.   :96.00   Max.   :97.00  
##  interceptions    positioning        vision        penalties    
##  Min.   : 1.00   Min.   : 2.00   Min.   : 1.00   Min.   : 2.00  
##  1st Qu.:34.00   1st Qu.:45.00   1st Qu.:49.00   1st Qu.:45.00  
##  Median :56.00   Median :60.00   Median :60.00   Median :57.00  
##  Mean   :51.91   Mean   :55.73   Mean   :57.86   Mean   :54.93  
##  3rd Qu.:68.00   3rd Qu.:69.00   3rd Qu.:69.00   3rd Qu.:67.00  
##  Max.   :96.00   Max.   :95.00   Max.   :97.00   Max.   :96.00  
##     marking      standing_tackle sliding_tackle    gk_diving    
##  Min.   : 1.00   Min.   : 1.00   Min.   : 2.00   Min.   : 1.00  
##  1st Qu.:25.00   1st Qu.:29.00   1st Qu.:25.00   1st Qu.: 7.00  
##  Median :50.00   Median :56.00   Median :53.00   Median :10.00  
##  Mean   :46.77   Mean   :50.37   Mean   :48.04   Mean   :14.69  
##  3rd Qu.:66.00   3rd Qu.:69.00   3rd Qu.:67.00   3rd Qu.:13.00  
##  Max.   :94.00   Max.   :95.00   Max.   :95.00   Max.   :94.00  
##   gk_handling      gk_kicking    gk_positioning   gk_reflexes   
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 8.00  
##  Median :11.00   Median :12.00   Median :11.00   Median :11.00  
##  Mean   :15.95   Mean   :20.53   Mean   :16.01   Mean   :16.32  
##  3rd Qu.:15.00   3rd Qu.:15.00   3rd Qu.:15.00   3rd Qu.:15.00  
##  Max.   :93.00   Max.   :97.00   Max.   :96.00   Max.   :96.00

Preparing the data

# Processing data
test2016 = PlayerTable
test2016$date<- substr(test2016$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2016, because the data is very big
test2016 = test2016[test2016$date == "2016", ]

# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2016$defensive_work_rate[test2016$defensive_work_rate == "1" | test2016$defensive_work_rate == "2" | test2016$defensive_work_rate == "3"] <- "low"
test2016$defensive_work_rate[test2016$defensive_work_rate == "4" | test2016$defensive_work_rate == "5" | test2016$defensive_work_rate == "6"] <- "medium"
test2016$defensive_work_rate[test2016$defensive_work_rate == "7" | test2016$defensive_work_rate == "8" | test2016$defensive_work_rate == "9"] <- "high"

# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2016$attacking_work_rate[test2016$attacking_work_rate == "1" | test2016$attacking_work_rate == "2" | test2016$attacking_work_rate == "3"] <- "low"
test2016$attacking_work_rate[test2016$attacking_work_rate == "4" | test2016$attacking_work_rate == "5" | test2016$attacking_work_rate == "6"] <- "medium"
test2016$attacking_work_rate[test2016$attacking_work_rate == "7" | test2016$attacking_work_rate == "8" | test2016$attacking_work_rate == "9"] <- "high"

# Change from character to factor
test2016$preferred_foot <- as.factor(test2016$preferred_foot)
test2016$attacking_work_rate <- as.factor(test2016$attacking_work_rate)
test2016$defensive_work_rate <- as.factor(test2016$defensive_work_rate)

Fit multiple linear regression model:

# Fit a linear regression model using some Remove some identity value
fit1 = lm(test2016$overall_rating~., data=test2016[ , -which(names(test2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "potential"))] )
# Note that we eliminate potential because they are essentially the same value (potential and overall_rating). We just guess one
summary(fit1)

## 
## Call:
## lm(formula = test2016$overall_rating ~ ., data = test2016[, -which(names(test2016) %in% 
##     c("player_api_id", "player_fifa_api_id", "player_name", "birthday", 
##         "date", "potential"))])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9327  -1.7845  -0.0733   1.7244  11.2269 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.7991842  1.6522048   2.299 0.021493 *  
## height                     0.0084735  0.0077156   1.098 0.272126    
## weight                     0.0074687  0.0028024   2.665 0.007704 ** 
## preferred_footright       -0.2163025  0.0555863  -3.891 0.000100 ***
## attacking_work_ratelow     1.6528801  0.1320685  12.515  < 2e-16 ***
## attacking_work_ratemedium -0.2213471  0.0566527  -3.907 9.39e-05 ***
## attacking_work_rateNone   -0.3418119  0.2547715  -1.342 0.179734    
## defensive_work_ratehigh    3.4131868  0.7132937   4.785 1.73e-06 ***
## defensive_work_ratelow     3.5370080  0.7130870   4.960 7.13e-07 ***
## defensive_work_ratemedium  3.2706436  0.7122524   4.592 4.43e-06 ***
## crossing                  -0.0088195  0.0031010  -2.844 0.004460 ** 
## finishing                  0.0154217  0.0037002   4.168 3.09e-05 ***
## heading_accuracy           0.0791795  0.0031897  24.824  < 2e-16 ***
## short_passing              0.0749497  0.0060995  12.288  < 2e-16 ***
## volleys                    0.0058049  0.0031380   1.850 0.064358 .  
## dribbling                  0.0020077  0.0051245   0.392 0.695221    
## curve                      0.0224619  0.0031416   7.150 9.11e-13 ***
## free_kick_accuracy         0.0004618  0.0027768   0.166 0.867924    
## long_passing               0.0113399  0.0042340   2.678 0.007409 ** 
## ball_control               0.2214545  0.0066256  33.424  < 2e-16 ***
## acceleration               0.0245989  0.0052352   4.699 2.64e-06 ***
## sprint_speed               0.0552671  0.0047871  11.545  < 2e-16 ***
## agility                   -0.0138669  0.0038769  -3.577 0.000349 ***
## reactions                  0.3663411  0.0043485  84.246  < 2e-16 ***
## balance                    0.0020724  0.0039969   0.518 0.604128    
## shot_power                 0.0210394  0.0036195   5.813 6.28e-09 ***
## jumping                    0.0053470  0.0025274   2.116 0.034393 *  
## stamina                   -0.0111346  0.0029966  -3.716 0.000203 ***
## strength                   0.0360852  0.0036396   9.915  < 2e-16 ***
## long_shots                -0.0211872  0.0036058  -5.876 4.30e-09 ***
## aggression                -0.0011006  0.0025462  -0.432 0.665560    
## interceptions              0.0054994  0.0034546   1.592 0.111433    
## positioning               -0.0646825  0.0035893 -18.021  < 2e-16 ***
## vision                    -0.0054751  0.0035768  -1.531 0.125866    
## penalties                  0.0130128  0.0030049   4.331 1.50e-05 ***
## marking                    0.0178739  0.0044763   3.993 6.56e-05 ***
## standing_tackle            0.0135714  0.0053135   2.554 0.010655 *  
## sliding_tackle            -0.0139389  0.0049896  -2.794 0.005220 ** 
## gk_diving                  0.0739050  0.0069207  10.679  < 2e-16 ***
## gk_handling                0.0720897  0.0068272  10.559  < 2e-16 ***
## gk_kicking                 0.0286646  0.0064623   4.436 9.25e-06 ***
## gk_positioning             0.0634670  0.0068165   9.311  < 2e-16 ***
## gk_reflexes                0.0723199  0.0067605  10.697  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.734 on 14041 degrees of freedom
## Multiple R-squared:  0.7986, Adjusted R-squared:  0.798 
## F-statistic:  1326 on 42 and 14041 DF,  p-value: < 2.2e-16

From this summary, most of the data have significant impact on overall_rating, excluding: height, dribbling, free_kick_accuracy, balance, aggression, interceptions, vision (p-value > 0.01). Notice that height is not significant, but weight is significant. In real life, free_kick_accuracy being non important is understandable, but balance, aggression, interceptions, vision still contribute somewhat to overall_rating of a player. Nonetheless, let’s verify the rest to see if they are statistically significant:

Removing insignificant predictors from lm:

# Fit a linear regression model using some Remove some identity value
fit1 = lm(test2016$overall_rating~.-height- dribbling- free_kick_accuracy- balance- aggression- interceptions- vision, data=test2016[ , -which(names(test2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "potential"))] )
summary(fit1)

## 
## Call:
## lm(formula = test2016$overall_rating ~ . - height - dribbling - 
##     free_kick_accuracy - balance - aggression - interceptions - 
##     vision, data = test2016[, -which(names(test2016) %in% c("player_api_id", 
##     "player_fifa_api_id", "player_name", "birthday", "date", 
##     "potential"))])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1224  -1.7938  -0.0767   1.7195  11.2877 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.268892   0.872198   6.041 1.57e-09 ***
## weight                     0.008249   0.002594   3.180 0.001474 ** 
## preferred_footright       -0.216632   0.055354  -3.914 9.14e-05 ***
## attacking_work_ratelow     1.650829   0.131329  12.570  < 2e-16 ***
## attacking_work_ratemedium -0.217579   0.056464  -3.853 0.000117 ***
## attacking_work_rateNone   -0.341619   0.254428  -1.343 0.179394    
## defensive_work_ratehigh    3.416816   0.712393   4.796 1.63e-06 ***
## defensive_work_ratelow     3.544216   0.712021   4.978 6.51e-07 ***
## defensive_work_ratemedium  3.277539   0.711359   4.607 4.11e-06 ***
## crossing                  -0.008903   0.003037  -2.931 0.003382 ** 
## finishing                  0.014720   0.003655   4.027 5.67e-05 ***
## heading_accuracy           0.079989   0.003032  26.384  < 2e-16 ***
## short_passing              0.073620   0.005993  12.284  < 2e-16 ***
## volleys                    0.005455   0.003121   1.748 0.080460 .  
## curve                      0.022568   0.002923   7.720 1.24e-14 ***
## long_passing               0.011097   0.004086   2.716 0.006621 ** 
## ball_control               0.221598   0.005791  38.266  < 2e-16 ***
## acceleration               0.025092   0.005116   4.904 9.48e-07 ***
## sprint_speed               0.055789   0.004725  11.807  < 2e-16 ***
## agility                   -0.014300   0.003700  -3.865 0.000112 ***
## reactions                  0.366805   0.004266  85.974  < 2e-16 ***
## shot_power                 0.020845   0.003555   5.864 4.62e-09 ***
## jumping                    0.005253   0.002462   2.134 0.032879 *  
## stamina                   -0.011027   0.002970  -3.712 0.000206 ***
## strength                   0.036588   0.003499  10.457  < 2e-16 ***
## long_shots                -0.020960   0.003538  -5.924 3.21e-09 ***
## positioning               -0.065522   0.003482 -18.819  < 2e-16 ***
## penalties                  0.012626   0.002938   4.297 1.74e-05 ***
## marking                    0.019355   0.004371   4.428 9.58e-06 ***
## standing_tackle            0.015184   0.005127   2.961 0.003068 ** 
## sliding_tackle            -0.013275   0.004972  -2.670 0.007598 ** 
## gk_diving                  0.073473   0.006907  10.637  < 2e-16 ***
## gk_handling                0.071357   0.006812  10.475  < 2e-16 ***
## gk_kicking                 0.028233   0.006455   4.374 1.23e-05 ***
## gk_positioning             0.063407   0.006814   9.306  < 2e-16 ***
## gk_reflexes                0.072573   0.006743  10.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.733 on 14048 degrees of freedom
## Multiple R-squared:  0.7985, Adjusted R-squared:  0.798 
## F-statistic:  1591 on 35 and 14048 DF,  p-value: < 2.2e-16

The result is hidden, but all of them are statistically significant now (some value is close to 0.01, but we keep using them)

Comment’s on this fit1 model

The pairs (1Q,3Q)=(-1.7845, 1.7244) and (Min,Max)=(-11.9327,11.2269) value are very close. The Median=-0.0733 is close to 0. This indicates good residual results.
Almost all of the predictors are statistically significant (p-value < 0.01)
The Std. Error is very close to 0, meaning very few errors.
RSS value is 2.733 (on a 100 scale) which is a good low error
Both multiple-R-squared and Adjusted R-squared are also good, close to 1 (0.7985 and 0.798), and close to each other (fit, no over-fitting)
-> Drawing conclusion from the model fit1: Our multiple linear regression model fit the data very well.

Confirming the predictors with graph:

overall_rating vs short_passing (p-value < 2e-16) boxplot:

#boxplot(test2016$overall_rating~test2016$short_passing, data=test2016, main="overall_rating vs short_passing", xlab="short_passing", ylab="overall_rating")

ggplot(test2016,aes(y=overall_rating,x=short_passing))+geom_point()+stat_smooth(method="lm",se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

This looks like a curve at first with the few noise when short_passing value is low. But when short_passing > 41, the value steadily increase in a linear fashion with very low error. This is understandable. Some player who has low short_passing but high overall_rating because their role does not emphasize short_passing. There are also only a few in between low pass high rating value. Nonetheless, short_passing proves to be a statistically significant predictor for overall_rating. We can improve this more in the later section with non-linear, classification strategy.

Let’s try some other significant predictor:

overall_rating vs jumping (0.032879) boxplot:

#boxplot(test2016$overall_rating~test2016$jumping, data=test2016, main="overall_rating vs jumping", xlab="jumping", ylab="overall_rating")

ggplot(test2016,aes(y=overall_rating,x=jumping))+geom_point()+stat_smooth(method="lm",se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

This is surprisingly linear. The noises are low and the points can be draw in a linear line. Jumping factor has slope of 0.005253 +- 0.002462 indicating that the relationship between jumping and overall_rating is close, somewhat significant.

overall_rating vs attacking_work_rate boxplot:

boxplot(test2016$overall_rating~test2016$attacking_work_rate, data=test2016, main="overall_rating vs attacking_work_rate", xlab="attacking_work_rate", ylab="overall_rating")

This overall_rating vs attacking_work_rate shows that attacking_work_rate noises are low (within + or - 5). High,low, and medium are good predictors because it covers a wide range. None are insignificant factor as seen in the p-value of summary(fit1).

Now let’s use this fit1 model to predict against training dataset test2016:

Predictions:

# testing against the training table test2016:
lm.pred = predict(fit1, data.frame(test2016), interval = "confidence")

# Creating result table
lm.table1 = data.frame(lm.pred)
lm.table1 = cbind(lm.table1, true=test2016[, 8:8])
lm.table1 = cbind(lm.table1, relativeError=(lm.table1$fit-lm.table1$true)/lm.table1$true)
lm.table1 = cbind(lm.table1, percentageError=abs(lm.table1$fit-lm.table1$true)*100/lm.table1$true)

Listed random 20 predictions

set.seed(3123)
lm.table1[sample(1:nrow(lm.table1), 20), ]

##             fit      lwr      upr true relativeError percentageError
## 65367  73.98439 73.77482 74.19396   73  0.0134848319      1.34848319
## 118630 65.48025 65.20204 65.75847   69 -0.0510108450      5.10108450
## 165843 62.14930 61.90195 62.39665   61  0.0188409668      1.88409668
## 31841  65.66133 65.33096 65.99170   62  0.0590537467      5.90537467
## 86757  74.83475 74.63195 75.03756   73  0.0251335969      2.51335969
## 49697  66.50379 66.20242 66.80516   62  0.0726418012      7.26418012
## 95196  76.96988 76.68440 77.25536   77 -0.0003912117      0.03912117
## 9712   63.95795 63.73327 64.18263   66 -0.0309400968      3.09400968
## 22288  76.37617 76.12393 76.62842   79 -0.0332130098      3.32130098
## 101393 68.77557 68.52817 69.02296   67  0.0265010211      2.65010211
## 88752  74.36687 74.02453 74.70921   79 -0.0586471826      5.86471826
## 158630 63.61043 63.02179 64.19908   68 -0.0645524947      6.45524947
## 106477 63.76991 63.48581 64.05402   64 -0.0035950782      0.35950782
## 183645 76.28578 76.05460 76.51695   80 -0.0464277773      4.64277773
## 32615  75.73059 75.38636 76.07483   77 -0.0164858238      1.64858238
## 140327 68.17510 67.94091 68.40928   68  0.0025749485      0.25749485
## 92552  70.29337 70.07911 70.50762   70  0.0041909366      0.41909366
## 36811  74.22156 74.01087 74.43225   74  0.0029939911      0.29939911
## 180909 70.87995 70.63180 71.12811   71 -0.0016908083      0.16908083
## 66336  65.49948 65.21654 65.78242   68 -0.0367723428      3.67723428

Average percentage Error

mean(lm.table1$percentageError)

## [1] 3.088751

Above is the prediction table for players in 2016. We can draw some conclusion from this prediction:

The average percentage error of all prediction is 3.089505% percent, which is pretty good
fit1 prediction is very close to true value in lm.table1
fit1 fit value has very close upper and lower bound, so very low error

The prediction is good to predict the training dataset itself. However, we must test with a test set:

Let’s make another prediction with player stats from 2015 (test dataset)

Preparing 2015 data:

# Processing data
test2015 = PlayerTable
test2015$date<- substr(test2015$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2015, because the data is very big
test2015 = test2015[test2015$date == "2015", ]

# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2015$defensive_work_rate[test2015$defensive_work_rate == "1" | test2015$defensive_work_rate == "2" | test2015$defensive_work_rate == "3"] <- "low"
test2015$defensive_work_rate[test2015$defensive_work_rate == "4" | test2015$defensive_work_rate == "5" | test2015$defensive_work_rate == "6"] <- "medium"
test2015$defensive_work_rate[test2015$defensive_work_rate == "7" | test2015$defensive_work_rate == "8" | test2015$defensive_work_rate == "9"] <- "high"

# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2015$attacking_work_rate[test2015$attacking_work_rate == "1" | test2015$attacking_work_rate == "2" | test2015$attacking_work_rate == "3"] <- "low"
test2015$attacking_work_rate[test2015$attacking_work_rate == "4" | test2015$attacking_work_rate == "5" | test2015$attacking_work_rate == "6"] <- "medium"
test2015$attacking_work_rate[test2015$attacking_work_rate == "7" | test2015$attacking_work_rate == "8" | test2015$attacking_work_rate == "9"] <- "high"

# Change from character to factor
test2015$preferred_foot <- as.factor(test2015$preferred_foot)
test2015$attacking_work_rate <- as.factor(test2015$attacking_work_rate)
test2015$defensive_work_rate <- as.factor(test2015$defensive_work_rate)

Predictions:

# using fit1 model to predict on test2015 dataset
lm.pred = predict(fit1, data.frame(test2015), interval = "confidence")

# Creating result table
lm.table2 = data.frame(lm.pred)
lm.table2 = cbind(lm.table2, true=test2015[, 8:8])
lm.table2 = cbind(lm.table2, relativeError=(lm.table2$fit-lm.table2$true)/lm.table2$true)
lm.table2 = cbind(lm.table2, percentageError=abs(lm.table2$fit-lm.table2$true)*100/lm.table2$true)

Listed random 20 predictions

set.seed(234)
lm.table2[sample(1:nrow(lm.table2), 20), ]

##             fit      lwr      upr true relativeError percentageError
## 92980  68.41145 68.12930 68.69359   70 -0.0226936337      2.26936337
## 111414 66.09703 65.72940 66.46466   73 -0.0945611812      9.45611812
## 7803   73.06123 72.88779 73.23466   69  0.0588583731      5.88583731
## 108270 65.23618 64.95819 65.51416   64  0.0193152552      1.93152552
## 25328  64.06763 63.79879 64.33646   64  0.0010566458      0.10566458
## 51978  74.97734 74.70426 75.25042   78 -0.0387520845      3.87520845
## 165728 67.53733 67.28017 67.79449   71 -0.0487699640      4.87699640
## 84139  71.55618 71.16924 71.94313   70  0.0222311822      2.22311822
## 165359 74.25879 74.05585 74.46173   74  0.0034971561      0.34971561
## 112626 80.06108 79.73491 80.38725   80  0.0007635116      0.07635116
## 21729  67.67575 67.43077 67.92073   65  0.0411653685      4.11653685
## 18604  68.08643 67.55853 68.61434   66  0.0316126146      3.16126146
## 31153  69.37086 69.10348 69.63825   56  0.2387654247     23.87654247
## 31178  68.39452 68.12863 68.66042   69 -0.0087750038      0.87750038
## 492    66.15661 65.91401 66.39920   66  0.0023728566      0.23728566
## 167474 62.89946 62.56740 63.23151   60  0.0483243033      4.83243033
## 123860 70.22731 70.01322 70.44140   66  0.0640501277      6.40501277
## 91007  69.34016 69.00135 69.67898   68  0.0197082726      1.97082726
## 49873  75.20560 74.83030 75.58090   75  0.0027413123      0.27413123
## 151249 70.14120 69.87887 70.40354   71 -0.0120957363      1.20957363

Average percentage Error

mean(lm.table2$percentageError)

## [1] 3.274805

Above is the prediction table for players in 2015 using fit1 model training with players in 2016. We can draw some conclusion from this prediction:

The average percentage error of all prediction is 3.27604% percent, which good
fit1 prediction is very close to true value in lm.table1
The average error is around 3.2%, but we see sometimes there’s a 23% percentage error pop up.
fit1 fit-value has very close upper and lower bound, so very low error
-> The result of predicting player’s overall_rating between 2015 and 2016 dataset are close to each other, no significant different. This proves that the model fit the data well, no overfitting or underfitting, can be used to make good prediction.

-> Conclusion: Multiple linear regression model fits really well in predicting a numeric overall_rating value of a player based on different numeric and categorial attributes.

Non-linear regression

One thing missing from the PlayerTable is the role of the player. A player’s role is important to evaluate a player. An attacking player would have better attacking stats: shooting, dribbling, speed, … A defensive player emphasizes on defensive stats: sliding_tackles, standing_tackles, intercept, aggresion,… A goalkeeper is different than others with their own stats: gk_diving, gk_reflex, … If we can categorize a player into different roles, we can evaluate them better base on the important stats.

In this section, we try to answer this following question: “How do categorize a player into different roles?” Since we don’t have roles in any table or any relation table, we need to make one our own by looking up on some official websites (FIFA.com, or Wiki.com site will usually tell you if the player is famous enough). The strategy are as followed:

Creating new training dataset by adding new column name: role
Try multiple non-linear regression method
Predict on a test set. Now since we don’t have an actual test set to compare either, we can again rely on official sites to determine if our prediction is correct or not.
Or we can use unsupervised prediction, by grouping up close related player, classify base on relationship.
Figure out a best fitted model. Comments on discovery

Preparing test dataset:

# Processing data
role2016 = PlayerTable
role2016$date<- substr(role2016$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2016, because the data is very big
role2016 = role2016[role2016$date == "2016", ]

# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
role2016$defensive_work_rate[role2016$defensive_work_rate == "1" | role2016$defensive_work_rate == "2" | role2016$defensive_work_rate == "3"] <- "low"
role2016$defensive_work_rate[role2016$defensive_work_rate == "4" | role2016$defensive_work_rate == "5" | role2016$defensive_work_rate == "6"] <- "medium"
role2016$defensive_work_rate[role2016$defensive_work_rate == "7" | role2016$defensive_work_rate == "8" | role2016$defensive_work_rate == "9"] <- "high"

# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
role2016$attacking_work_rate[role2016$attacking_work_rate == "1" | role2016$attacking_work_rate == "2" | role2016$attacking_work_rate == "3"] <- "low"
role2016$attacking_work_rate[role2016$attacking_work_rate == "4" | role2016$attacking_work_rate == "5" | role2016$attacking_work_rate == "6"] <- "medium"
role2016$attacking_work_rate[role2016$attacking_work_rate == "7" | role2016$attacking_work_rate == "8" | role2016$attacking_work_rate == "9"] <- "high"

# Change from character to factor
role2016$preferred_foot <- as.factor(role2016$preferred_foot)
role2016$attacking_work_rate <- as.factor(role2016$attacking_work_rate)
role2016$defensive_work_rate <- as.factor(role2016$defensive_work_rate)

# Adding role column, give random value
set.seed(2313)
roles <- c("Attacker", "Defender", "Goalkeeper")
role2016$role <- sample(roles, size = nrow(role2016), replace = TRUE)

# Manually selecting some players and give them the correct role:
  # Goalkeepers
p1 = role2016[role2016$player_api_id == 182917, ]
p1["role"] <- "Goalkeeper"
p2 = role2016[role2016$player_api_id == 30717, ]
p2["role"] <- "Goalkeeper"
p3 = role2016[role2016$player_api_id == 27299, ]
p3["role"] <- "Goalkeeper"
p4 = role2016[role2016$player_api_id == 30859, ]
p4["role"] <- "Goalkeeper"
p5 = role2016[role2016$player_api_id == 51949, ]
p5["role"] <- "Goalkeeper"

 # Attackers
p6 = role2016[role2016$player_api_id == 169200, ]
p6["role"] <- "Attacker"
p7 = role2016[role2016$player_api_id == 23354, ]
p7["role"] <- "Attacker"
p8 = role2016[role2016$player_api_id == 286119, ]
p8["role"] <- "Attacker"
p9 = role2016[role2016$player_api_id == 30822, ]
p9["role"] <- "Attacker"
p10 = role2016[role2016$player_api_id == 30853, ]
p10["role"] <- "Attacker"


  # Defenders
p11 = role2016[role2016$player_api_id == 30865, ]
p11["role"] <- "Defender"
p12 = role2016[role2016$player_api_id == 150739, ]
p12["role"] <- "Defender"
p13 = role2016[role2016$player_api_id == 30962, ]
p13["role"] <- "Defender"
p14 = role2016[role2016$player_api_id == 186137, ]
p14["role"] <- "Defender"
p15 = role2016[role2016$player_api_id == 56678, ]
p15["role"] <- "Defender"

  # Add some more because of singularity
p16 = role2016[role2016$player_api_id == 184554, ]
p16["role"] <- "Goalkeeper"
p17 = role2016[role2016$player_api_id == 42422, ]
p17["role"] <- "Goalkeeper"
p18 = role2016[role2016$player_api_id == 26295, ]
p18["role"] <- "Goalkeeper"
p19 = role2016[role2016$player_api_id == 177126, ]
p19["role"] <- "Goalkeeper"
p20 = role2016[role2016$player_api_id == 40604, ]
p20["role"] <- "Goalkeeper"

p21 = role2016[role2016$player_api_id == 363333, ]
p21["role"] <- "Attacker"
p22 = role2016[role2016$player_api_id == 150565, ]
p22["role"] <- "Attacker"
p23 = role2016[role2016$player_api_id == 194165, ]
p23["role"] <- "Attacker"
p24 = role2016[role2016$player_api_id == 184536, ]
p24["role"] <- "Attacker"
p25 = role2016[role2016$player_api_id == 107417, ]
p25["role"] <- "Attacker"

p26 = role2016[role2016$player_api_id == 474589, ]
p26["role"] <- "Defender"
p27 = role2016[role2016$player_api_id == 282674, ]
p27["role"] <- "Defender"
p28 = role2016[role2016$player_api_id == 37762, ]
p28["role"] <- "Defender"
p29 = role2016[role2016$player_api_id == 56678, ]
p29["role"] <- "Defender"
p30 = role2016[role2016$player_api_id == 574200, ]
p30["role"] <- "Defender"

# Creating test set:
newrole2016 = rbind(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15,
                    p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30)
# Change from character to factor
newrole2016$role <- as.factor(newrole2016$role)

Let’s plot some graph:

ggplot(newrole2016,aes(y=role,x=dribbling))+geom_point()+geom_smooth(method="glm")

## `geom_smooth()` using formula 'y ~ x'

ggplot(newrole2016,aes(y=role,x=gk_diving))+geom_point()+geom_smooth(method="glm")

## `geom_smooth()` using formula 'y ~ x'

ggplot(newrole2016,aes(y=role,x=interceptions))+geom_point()+geom_smooth(method="glm")

## `geom_smooth()` using formula 'y ~ x'

We can draw some conclusion from the given graph on limited train set:

Dribbling has strong relation ship with attackers, medium with defender and little effect on goalkeeper
Interception has strong relation ship with defenders, low with attacker and little effect on goalkeeper
gk_diving has strong relation ship with goalkeeper, little effect on attacker and defender
-> This is in fact true due to common sense.

Let’s train on this new train set using logistic regression:

# glm
fit2 = glm(newrole2016$role~., data=newrole2016[ , -which(names(newrole2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "overall_rating", "potential"))], family="binomial")
summary(fit2)

## 
## Call:
## glm(formula = newrole2016$role ~ ., family = "binomial", data = newrole2016[, 
##     -which(names(newrole2016) %in% c("player_api_id", "player_fifa_api_id", 
##         "player_name", "birthday", "date", "overall_rating", 
##         "potential"))])
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.592e-06  -2.381e-06   2.042e-06   2.409e-06   3.344e-06  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)
## (Intercept)               -5.291e+02  3.552e+07       0        1
## height                     1.515e+00  1.285e+05       0        1
## weight                     5.371e-01  5.886e+04       0        1
## preferred_footright       -1.526e+01  1.201e+06       0        1
## attacking_work_ratelow    -1.664e+01  2.470e+06       0        1
## attacking_work_ratemedium -1.727e+00  1.753e+06       0        1
## defensive_work_ratelow    -1.986e+01  1.977e+06       0        1
## defensive_work_ratemedium -1.926e+01  1.924e+06       0        1
## crossing                   9.267e-01  7.467e+04       0        1
## finishing                 -8.183e-01  7.152e+04       0        1
## heading_accuracy           8.505e-01  8.479e+04       0        1
## short_passing             -1.209e+00  1.261e+05       0        1
## volleys                   -9.926e-02  2.838e+04       0        1
## dribbling                 -6.849e-01  1.079e+05       0        1
## curve                      5.234e-01  8.391e+04       0        1
## free_kick_accuracy        -8.180e-01  1.127e+05       0        1
## long_passing               1.321e+00  1.122e+05       0        1
## ball_control               4.976e-01  1.008e+05       0        1
## acceleration              -5.071e-01  7.395e+04       0        1
## sprint_speed               8.993e-01  8.126e+04       0        1
## agility                    5.360e-01  5.743e+04       0        1
## reactions                  8.214e-02  4.050e+04       0        1
## balance                    8.314e-01  7.142e+04       0        1
## shot_power                -7.594e-01  5.547e+04       0        1
## jumping                    1.053e+00  5.277e+04       0        1
## stamina                   -2.700e-01  5.265e+04       0        1
## strength                   1.381e-01  6.040e+04       0        1
## long_shots                 9.955e-01  1.378e+05       0        1
## aggression                -6.167e-02  4.377e+04       0        1
## interceptions             -4.286e-01  5.125e+04       0        1
## positioning                2.381e-02  5.971e+04       0        1
## vision                    -1.554e-01  4.146e+04       0        1
## penalties                 -5.940e-01  5.818e+04       0        1
## marking                    9.262e-02  8.587e+04       0        1
## standing_tackle            1.752e+00  1.181e+05       0        1
## sliding_tackle            -1.527e+00  6.217e+04       0        1
## gk_diving                  1.287e+00  1.438e+05       0        1
## gk_handling               -1.860e-01  1.583e+05       0        1
## gk_kicking                -1.470e-01  7.080e+04       0        1
## gk_positioning             9.417e-03  1.195e+05       0        1
## gk_reflexes               -4.414e-01  1.809e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.0364e+02  on 74  degrees of freedom
## Residual deviance: 4.3892e-10  on 34  degrees of freedom
## AIC: 82
## 
## Number of Fisher Scoring iterations: 25

—————–## UPDATE ## ——————-

Comments on the general multiple logistic model:

The deviance residuals value is good. Similar (1Q, 3Q), (min, max). Median is close to 0
However, the z value and Pr(<|z|) (or p-value) are 0 and 1 respectively, showing that none of our data is significant
The data is not big enough with variety. When testing with lda model, we get a similar result with a warning: Error in lda.default(x, grouping, …) : variable 6 appears to be constant within groups. So we can see that a model that try to draw a line/non-linear line to fit a low amount of data will not work well
-> This is a big limited of logistic model if we are operating on limited and not enough variety known training set. Despite the graph showing good relationship on some factors, we will not be using this model further on.
In the next section, we switch to Naive Bayes. The reason is Naive Bayes work with probability, it is not trying to fit lines. So let’s hope this works out.

———————###————————-

Let’s train on this new train set using Naive Bayes probability method:

Let’s first plot some graph. We will not be using pairs(data[,]) because there are so many predictors. We will focus on some relationship that makes sense, then later test with model:

Checking plot: role-goalkeeper vs gk_diving+gk_handling+gk_kicking+gk_positioning+gk_reflexes:

pairs(~role+gk_diving+gk_handling+gk_kicking+gk_positioning+gk_reflexes, data=newrole2016)

Checking plot: role-attacker vs attacking_work_rate +sprint_speed+volleys+dribbling+shot_power:

pairs(~role+attacking_work_rate +sprint_speed+volleys+dribbling+shot_power, data=newrole2016)

Checking plot: role-defender vs defensive_work_rate+ aggression+ standing_tackle+ sliding_tackle+ interceptions:

pairs(~role+defensive_work_rate +aggression+standing_tackle+sliding_tackle+interceptions, data=newrole2016)

The main idea of these plots are to check the relationship with three roles attacker, defender and goalkeeper. If we just look at the first row of the plots, they are relationship between role and other factors. Drawing conclusion from these rows, as expected:

The more gk_diving+gk_handling+gk_kicking+gk_positioning+gk_reflexes, the better chance this is a goalkeeper, very low chance of being a defender or attacker
The more attacking_work_rate +sprint_speed+volleys+dribbling+shot_power, the better chance this player is an attacker, low chance of being a goalkeeper. However, shot_power and sprint_speed still somewhat affect the probability of being a defender.
The more defensive_work_rate+ aggression+ standing_tackle+ sliding_tackle+ interceptions, the better chance this is a defender, low chance of being a goalkeeper. However, aggression still somewhat contributes to a defender’s probability

Let’s figure them out by using Naive Bayes probability model:

Fit Naive Bayes model:

# 
fit3 <- naive_bayes(newrole2016$role~., data=newrole2016[ , -which(names(newrole2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "overall_rating", "potential"))])
print(fit3[8])

## $call
## naive_bayes.formula(formula = newrole2016$role ~ ., data = newrole2016[, 
##     -which(names(newrole2016) %in% c("player_api_id", "player_fifa_api_id", 
##         "player_name", "birthday", "date", "overall_rating", 
##         "potential"))])

print(fit3[5])

## $prior
## 
##   Attacker   Defender Goalkeeper 
##  0.4666667  0.2533333  0.2800000

print(fit3[4])

## $tables
## 
## --------------------------------------------------------------------------------- 
##  ::: height (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## height   Attacker   Defender Goalkeeper
##   mean 182.952571 184.216842 190.862857
##   sd     5.560963   5.765323   3.140339
## 
## --------------------------------------------------------------------------------- 
##  ::: weight (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## weight   Attacker   Defender Goalkeeper
##   mean 164.857143 168.000000 190.190476
##   sd    10.541634  11.215069   9.516402
## 
## --------------------------------------------------------------------------------- 
##  ::: preferred_foot (Bernoulli) 
## --------------------------------------------------------------------------------- 
##               
## preferred_foot  Attacker  Defender Goalkeeper
##          left  0.0000000 0.1052632  0.1904762
##          right 1.0000000 0.8947368  0.8095238
## 
## --------------------------------------------------------------------------------- 
##  ::: attacking_work_rate (Categorical) 
## --------------------------------------------------------------------------------- 
##                    
## attacking_work_rate   Attacker   Defender Goalkeeper
##              high   0.74285714 0.52631579 0.00000000
##              low    0.00000000 0.05263158 0.00000000
##              medium 0.25714286 0.42105263 1.00000000
##              None   0.00000000 0.00000000 0.00000000
## 
## --------------------------------------------------------------------------------- 
##  ::: defensive_work_rate (Categorical) 
## --------------------------------------------------------------------------------- 
##                    
## defensive_work_rate  Attacker  Defender Goalkeeper
##              0      0.0000000 0.0000000  0.0000000
##              high   0.3142857 0.4210526  0.0000000
##              low    0.2285714 0.0000000  0.0000000
##              medium 0.4571429 0.5789474  1.0000000
## 
## --------------------------------------------------------------------------------- 
##  ::: crossing (Gaussian) 
## --------------------------------------------------------------------------------- 
##         
## crossing  Attacker  Defender Goalkeeper
##     mean 68.200000 56.105263  14.047619
##     sd    8.376438 15.530954   2.178903
## 
## --------------------------------------------------------------------------------- 
##  ::: finishing (Gaussian) 
## --------------------------------------------------------------------------------- 
##          
## finishing  Attacker  Defender Goalkeeper
##      mean 78.228571 42.684211  12.285714
##      sd    6.235976 11.855859   1.677583
## 
## --------------------------------------------------------------------------------- 
##  ::: heading_accuracy (Gaussian) 
## --------------------------------------------------------------------------------- 
##                 
## heading_accuracy  Attacker  Defender Goalkeeper
##             mean 68.171429 75.684211  14.904762
##             sd   13.443751  9.080214   5.812958
## 
## --------------------------------------------------------------------------------- 
##  ::: short_passing (Gaussian) 
## --------------------------------------------------------------------------------- 
##              
## short_passing  Attacker  Defender Goalkeeper
##          mean 73.400000 72.578947  33.714286
##          sd    6.804843  5.294817   6.357223
## 
## --------------------------------------------------------------------------------- 
##  ::: volleys (Gaussian) 
## --------------------------------------------------------------------------------- 
##        
## volleys  Attacker  Defender Goalkeeper
##    mean 72.257143 43.421053  13.238095
##    sd    9.450624 12.411087   2.119074
## 
## --------------------------------------------------------------------------------- 
##  ::: dribbling (Gaussian) 
## --------------------------------------------------------------------------------- 
##          
## dribbling  Attacker  Defender Goalkeeper
##      mean 78.571429 57.263158  16.047619
##      sd    4.641790 10.434143   3.485343
## 
## --------------------------------------------------------------------------------- 
##  ::: curve (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## curve   Attacker  Defender Goalkeeper
##   mean 68.942857 48.631579  14.666667
##   sd   10.778363 14.260320   3.812261
## 
## --------------------------------------------------------------------------------- 
##  ::: free_kick_accuracy (Gaussian) 
## --------------------------------------------------------------------------------- 
##                   
## free_kick_accuracy  Attacker  Defender Goalkeeper
##               mean 62.742857 46.578947  14.238095
##               sd   15.293872 11.051977   2.861901
## 
## --------------------------------------------------------------------------------- 
##  ::: long_passing (Gaussian) 
## --------------------------------------------------------------------------------- 
##             
## long_passing  Attacker  Defender Goalkeeper
##         mean 63.000000 63.473684  35.000000
##         sd   10.406446  6.744285   6.024948
## 
## --------------------------------------------------------------------------------- 
##  ::: ball_control (Gaussian) 
## --------------------------------------------------------------------------------- 
##             
## ball_control  Attacker  Defender Goalkeeper
##         mean 78.314286 66.631579  24.095238
##         sd    5.290232 10.683375   5.821553
## 
## --------------------------------------------------------------------------------- 
##  ::: acceleration (Gaussian) 
## --------------------------------------------------------------------------------- 
##             
## acceleration  Attacker  Defender Goalkeeper
##         mean 78.285714 72.157895  50.619048
##         sd   10.714398  9.239187   7.592603
## 
## --------------------------------------------------------------------------------- 
##  ::: sprint_speed (Gaussian) 
## --------------------------------------------------------------------------------- 
##             
## sprint_speed Attacker Defender Goalkeeper
##         mean 77.80000 75.57895   55.33333
##         sd   11.09266  8.82116    4.90238
## 
## --------------------------------------------------------------------------------- 
##  ::: agility (Gaussian) 
## --------------------------------------------------------------------------------- 
##        
## agility  Attacker  Defender Goalkeeper
##    mean 74.371429 63.894737  50.142857
##    sd    7.635389 13.755382  10.292161
## 
## --------------------------------------------------------------------------------- 
##  ::: reactions (Gaussian) 
## --------------------------------------------------------------------------------- 
##          
## reactions  Attacker  Defender Goalkeeper
##      mean 78.485714 75.947368  82.333333
##      sd    4.648665  5.522416   3.381321
## 
## --------------------------------------------------------------------------------- 
##  ::: balance (Gaussian) 
## --------------------------------------------------------------------------------- 
##        
## balance Attacker Defender Goalkeeper
##    mean 67.28571 62.31579   43.33333
##    sd   11.40839 10.88349    7.83156
## 
## --------------------------------------------------------------------------------- 
##  ::: shot_power (Gaussian) 
## --------------------------------------------------------------------------------- 
##           
## shot_power  Attacker  Defender Goalkeeper
##       mean 79.371429 65.052632  25.095238
##       sd    7.079738  8.714443   5.421298
## 
## --------------------------------------------------------------------------------- 
##  ::: jumping (Gaussian) 
## --------------------------------------------------------------------------------- 
##        
## jumping  Attacker  Defender Goalkeeper
##    mean 67.857143 83.578947  75.523810
##    sd    7.643309  4.845568   7.318600
## 
## --------------------------------------------------------------------------------- 
##  ::: stamina (Gaussian) 
## --------------------------------------------------------------------------------- 
##        
## stamina  Attacker  Defender Goalkeeper
##    mean 78.285714 77.473684  38.142857
##    sd   11.200240  8.362405   6.513722
## 
## --------------------------------------------------------------------------------- 
##  ::: strength (Gaussian) 
## --------------------------------------------------------------------------------- 
##         
## strength  Attacker  Defender Goalkeeper
##     mean 72.285714 79.789474  71.142857
##     sd    8.570028  4.577468   9.773872
## 
## --------------------------------------------------------------------------------- 
##  ::: long_shots (Gaussian) 
## --------------------------------------------------------------------------------- 
##           
## long_shots  Attacker  Defender Goalkeeper
##       mean 74.857143 49.263158  13.666667
##       sd    5.531423 10.120618   3.214550
## 
## --------------------------------------------------------------------------------- 
##  ::: aggression (Gaussian) 
## --------------------------------------------------------------------------------- 
##           
## aggression  Attacker  Defender Goalkeeper
##       mean 70.742857 84.263158  34.000000
##       sd   15.483089  3.229071   7.930952
## 
## --------------------------------------------------------------------------------- 
##  ::: interceptions (Gaussian) 
## --------------------------------------------------------------------------------- 
##              
## interceptions  Attacker  Defender Goalkeeper
##          mean 47.914286 82.789474  23.714286
##          sd   12.804805  2.070398   4.417498
## 
## --------------------------------------------------------------------------------- 
##  ::: positioning (Gaussian) 
## --------------------------------------------------------------------------------- 
##            
## positioning  Attacker  Defender Goalkeeper
##        mean 80.657143 42.631579  12.857143
##        sd    3.709629 15.413596   2.007130
## 
## --------------------------------------------------------------------------------- 
##  ::: vision (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## vision  Attacker  Defender Goalkeeper
##   mean 75.828571 46.736842  49.809524
##   sd    6.128457 14.007308  17.670934
## 
## --------------------------------------------------------------------------------- 
##  ::: penalties (Gaussian) 
## --------------------------------------------------------------------------------- 
##          
## penalties  Attacker  Defender Goalkeeper
##      mean 71.400000 49.315789  28.285714
##      sd   12.934678  8.838380   9.011897
## 
## --------------------------------------------------------------------------------- 
##  ::: marking (Gaussian) 
## --------------------------------------------------------------------------------- 
##        
## marking  Attacker  Defender Goalkeeper
##    mean 37.971429 82.368421  12.666667
##    sd   14.878895  2.586515   2.575526
## 
## --------------------------------------------------------------------------------- 
##  ::: standing_tackle (Gaussian) 
## --------------------------------------------------------------------------------- 
##                
## standing_tackle  Attacker  Defender Goalkeeper
##            mean 43.114286 84.000000  12.857143
##            sd   13.908887  1.855921   3.678121
## 
## --------------------------------------------------------------------------------- 
##  ::: sliding_tackle (Gaussian) 
## --------------------------------------------------------------------------------- 
##               
## sliding_tackle  Attacker  Defender Goalkeeper
##           mean 41.742857 83.368421  12.714286
##           sd   13.414967  3.515213   2.512824
## 
## --------------------------------------------------------------------------------- 
##  ::: gk_diving (Gaussian) 
## --------------------------------------------------------------------------------- 
##          
## gk_diving  Attacker  Defender Goalkeeper
##      mean 10.742857 10.842105  84.904762
##      sd    3.424332  2.853130   2.567192
## 
## --------------------------------------------------------------------------------- 
##  ::: gk_handling (Gaussian) 
## --------------------------------------------------------------------------------- 
##            
## gk_handling  Attacker  Defender Goalkeeper
##        mean 10.257143 10.842105  81.714286
##        sd    2.993550  3.095932   2.722919
## 
## --------------------------------------------------------------------------------- 
##  ::: gk_kicking (Gaussian) 
## --------------------------------------------------------------------------------- 
##           
## gk_kicking  Attacker  Defender Goalkeeper
##       mean  8.714286 10.157895  78.095238
##       sd    2.395724  3.287403   9.032745
## 
## --------------------------------------------------------------------------------- 
##  ::: gk_positioning (Gaussian) 
## --------------------------------------------------------------------------------- 
##               
## gk_positioning  Attacker  Defender Goalkeeper
##           mean 11.000000  9.631579  83.904762
##           sd    3.360672  3.336840   4.241518
## 
## --------------------------------------------------------------------------------- 
##  ::: gk_reflexes (Gaussian) 
## --------------------------------------------------------------------------------- 
##            
## gk_reflexes  Attacker  Defender Goalkeeper
##        mean 10.600000 11.263158  85.523810
##        sd    2.557572  3.429320   2.600366
## 
## ---------------------------------------------------------------------------------

Interpreting the Naive Bayes model result:

Looking at priority table, there are 46.66% attacker, 25.33% defender and 28% goalkeeper. This information is not major, but true to the common knowledge. Now let’s take a look at the table:
If we look at categorical predictor:
- preferred_foot: some of the left-foot player are goalkeeper or defender, most of them are right footed and evenly distributed. We have strange probability here because of limited train data
- attacking/defending_work_rate: We have the same strange probability. However, notice that most attacking_work_rate high has good chance of being an attacker. Same applies to defender
If we look at numeric predictor, besides height and weight, the value are in range 0-100:
- Goalkeeper usually has better height and weight than regular player. Height is important to a goalkeeper, that’s a fact
- There is a good chance a player is a goalkeeper if they have these high stats(>= 75): reactions, strength, jumping, gk_diving, gk_handling, gk_kicking, gk_positionning, gk_reflexes.
- There is a good chance a player is an attacker if they have these high stats(>= 75): finishing, dribbling, ball_control, acceleration, sprint_speed, agility, reactions, shot_power, stamina, positioning, vision.
- There is a good chance a player is a defender if they have these high stats(>= 75): heading_accuracy, sprint_speed, reactions, jumping, stamina, strength, aggression, interceptions, marking, standing_tackle, sliding_tackle.
- There is one predictor: reactions that show up in all three roles, we can conclude that either everyone has high reactions skill or the data does not have enough variety.
- -> The predictors give good chance of predicting a correct role. We can use this model for prediction some test data.

Predictions:

# newrole2016 is new training set, we can use test2015 as test set
naives_bayes.pred = predict(fit3, test2015, type = 'prob')

Since we don’t have a test result to compare with, we will attach the prediction table to the original data set and make observations:

Making table result

# Adding Naive Bayes prediction to test2015 test set
resultNB = test2015
resultNB = cbind(resultNB, data.frame(naives_bayes.pred))
# -> resultNB will hold all the role predictions for test2015 test-set

Select 5 known player as predictors:

nb1 = resultNB[resultNB$player_api_id == 30893, ]
nb2 = resultNB[resultNB$player_api_id == 30981, ]
nb3 = resultNB[resultNB$player_api_id == 30717, ]
nb4 = resultNB[resultNB$player_api_id == 248453, ]
nb5 = resultNB[resultNB$player_api_id == 39027, ]

resultNB5 = rbind(nb1, nb2, nb3, nb4, nb5)
print(resultNB5)

##        player_api_id player_fifa_api_id       player_name            birthday
## 98150          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98157          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98158          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98160          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98162          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 99708          30981             158023      Lionel Messi 1987-06-24 00:00:00
## 99711          30981             158023      Lionel Messi 1987-06-24 00:00:00
## 99713          30981             158023      Lionel Messi 1987-06-24 00:00:00
## 99717          30981             158023      Lionel Messi 1987-06-24 00:00:00
## 99721          30981             158023      Lionel Messi 1987-06-24 00:00:00
## 99722          30981             158023      Lionel Messi 1987-06-24 00:00:00
## 99725          30981             158023      Lionel Messi 1987-06-24 00:00:00
## 96315          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96322          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96325          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 67943         248453             195864        Paul Pogba 1993-03-15 00:00:00
## 67944         248453             195864        Paul Pogba 1993-03-15 00:00:00
## 67945         248453             195864        Paul Pogba 1993-03-15 00:00:00
## 67946         248453             195864        Paul Pogba 1993-03-15 00:00:00
## 67947         248453             195864        Paul Pogba 1993-03-15 00:00:00
## 67948         248453             195864        Paul Pogba 1993-03-15 00:00:00
## 130640         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130641         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130642         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130643         39027             139720   Vincent Kompany 1986-04-10 00:00:00
##        height weight date overall_rating potential preferred_foot
## 98150  185.42    176 2015             92        92          right
## 98157  185.42    176 2015             93        93          right
## 98158  185.42    176 2015             92        92          right
## 98160  185.42    176 2015             93        93          right
## 98162  185.42    176 2015             93        93          right
## 99708  170.18    159 2015             93        93           left
## 99711  170.18    159 2015             94        95           left
## 99713  170.18    159 2015             93        95           left
## 99717  170.18    159 2015             93        95           left
## 99721  170.18    159 2015             93        95           left
## 99722  170.18    159 2015             94        94           left
## 99725  170.18    159 2015             94        94           left
## 96315  193.04    201 2015             83        83          right
## 96322  193.04    201 2015             84        84          right
## 96325  193.04    201 2015             84        84          right
## 67943  190.50    185 2015             86        92          right
## 67944  190.50    185 2015             86        92          right
## 67945  190.50    185 2015             86        92          right
## 67946  190.50    185 2015             84        91          right
## 67947  190.50    185 2015             84        91          right
## 67948  190.50    185 2015             83        90          right
## 130640 193.04    187 2015             85        85          right
## 130641 193.04    187 2015             86        86          right
## 130642 193.04    187 2015             86        86          right
## 130643 193.04    187 2015             86        86          right
##        attacking_work_rate defensive_work_rate crossing finishing
## 98150                 high                 low       83        95
## 98157                 high                 low       82        95
## 98158                 high                 low       83        95
## 98160                 high                 low       82        95
## 98162                 high                 low       82        95
## 99708               medium                 low       84        94
## 99711               medium                 low       80        93
## 99713               medium                 low       84        94
## 99717               medium                 low       84        94
## 99721               medium                 low       84        94
## 99722               medium                 low       80        93
## 99725               medium                 low       80        93
## 96315               medium              medium       25        25
## 96322               medium              medium       13        15
## 96325               medium              medium       13        15
## 67943                 high              medium       76        70
## 67944                 high              medium       76        70
## 67945                 high              medium       76        70
## 67946                 high              medium       73        70
## 67947                 high              medium       73        70
## 67948                 high              medium       73        70
## 130640              medium              medium       61        45
## 130641              medium              medium       61        45
## 130642              medium              medium       61        45
## 130643              medium              medium       61        45
##        heading_accuracy short_passing volleys dribbling curve
## 98150                86            82      87        93    88
## 98157                86            81      87        93    88
## 98158                86            82      87        93    88
## 98160                86            81      87        93    88
## 98162                86            81      87        93    88
## 99708                71            89      85        96    89
## 99711                71            88      85        96    89
## 99713                71            89      85        96    89
## 99717                71            89      85        96    89
## 99721                71            89      85        96    89
## 99722                71            88      85        96    89
## 99725                71            88      85        96    89
## 96315                32            37      25        20    25
## 96322                13            37      17        25    20
## 96325                13            37      17        25    20
## 67943                72            85      84        88    82
## 67944                72            85      84        88    82
## 67945                72            85      84        88    82
## 67946                84            85      84        88    83
## 67947                84            85      78        88    78
## 67948                84            85      78        88    78
## 130640               84            65      46        64    61
## 130641               84            80      46        64    61
## 130642               84            80      46        64    61
## 130643               84            80      46        64    61
##        free_kick_accuracy long_passing ball_control acceleration sprint_speed
## 98150                  79           72           92           91           94
## 98157                  77           72           91           91           93
## 98158                  79           72           92           91           94
## 98160                  77           72           91           91           93
## 98162                  77           72           91           91           93
## 99708                  90           76           96           96           90
## 99711                  90           79           96           95           90
## 99713                  90           76           96           96           90
## 99717                  90           76           96           96           90
## 99721                  90           76           96           96           90
## 99722                  90           79           96           95           90
## 99725                  90           79           96           95           90
## 96315                  25           31           28           45           33
## 96322                  13           35           23           49           43
## 96325                  13           35           23           49           43
## 67943                  80           81           89           75           79
## 67944                  80           81           89           75           79
## 67945                  80           81           89           75           79
## 67946                  65           80           90           75           77
## 67947                  65           80           90           75           77
## 67948                  65           80           88           75           77
## 130640                 52           66           68           68           70
## 130641                 52           75           74           68           73
## 130642                 52           75           74           68           73
## 130643                 52           75           74           68           77
##        agility reactions balance shot_power jumping stamina strength long_shots
## 98150       93        90      63         94      94      89       79         93
## 98157       90        92      62         94      94      87       79         93
## 98158       93        90      63         94      94      89       79         93
## 98160       90        92      62         94      94      90       79         93
## 98162       90        92      62         94      94      87       79         93
## 99708       94        94      95         80      73      77       60         88
## 99711       92        92      95         80      68      76       59         88
## 99713       94        94      95         80      73      77       60         88
## 99717       94        94      95         80      73      77       60         88
## 99721       94        94      95         80      73      77       60         88
## 99722       92        92      95         80      68      75       59         88
## 99725       92        92      95         80      68      75       59         88
## 96315       51        74      49         47      75      39       60         25
## 96322       55        76      49         20      75      39       63         13
## 96325       55        76      49         20      75      39       63         13
## 67943       75        86      61         91      85      89       91         91
## 67944       75        86      61         91      85      87       91         91
## 67945       75        86      61         91      85      87       91         91
## 67946       79        79      60         89      87      87       90         91
## 67947       79        78      60         89      87      87       90         88
## 67948       79        78      60         89      87      87       90         88
## 130640      63        84      42         76      73      70       88         55
## 130641      63        84      42         76      73      70       88         55
## 130642      63        84      42         76      73      70       88         55
## 130643      63        84      42         76      73      70       88         67
##        aggression interceptions positioning vision penalties marking
## 98150          63            24          91     81        85      22
## 98157          62            29          93     81        85      22
## 98158          63            24          91     81        85      22
## 98160          62            29          93     81        85      22
## 98162          62            29          93     81        85      22
## 99708          48            22          92     90        74      25
## 99711          48            22          90     90        74      13
## 99713          48            22          92     90        76      25
## 99717          48            22          92     90        76      25
## 99721          48            22          92     90        74      25
## 99722          48            22          90     90        74      13
## 99725          48            22          90     90        74      13
## 96315          34            29          25     25        35      25
## 96322          38            20          14     50        22      10
## 96325          38            20          14     50        22      10
## 67943          80            71          83     86        76      71
## 67944          80            71          83     86        76      71
## 67945          80            71          83     86        76      71
## 67946          82            69          79     85        67      62
## 67947          81            69          79     85        67      62
## 67948          81            69          79     85        67      62
## 130640         81            78          41     59        63      86
## 130641         81            86          41     59        63      84
## 130642         81            87          41     59        63      85
## 130643         78            87          41     59        63      85
##        standing_tackle sliding_tackle gk_diving gk_handling gk_kicking
## 98150               31             23         7          11         15
## 98157               31             23         7          11         15
## 98158               31             23         7          11         15
## 98160               31             23         7          11         15
## 98162               31             23         7          11         15
## 99708               21             20         6          11         15
## 99711               23             21         6          11         15
## 99713               21             20         6          11         15
## 99717               21             20         6          11         15
## 99721               21             20         6          11         15
## 99722               23             21         6          11         15
## 99725               23             21         6          11         15
## 96315               25             25        84          77         62
## 96322               11             11        85          79         71
## 96325               11             11        85          79         71
## 67943               77             83         5           6          2
## 67944               77             83         5           6          2
## 67945               77             83         5           6          2
## 67946               78             81         5           6          2
## 67947               78             81         5           6          2
## 67948               78             81         5           6          2
## 130640              90             87        10           9          5
## 130641              88             85        10           9          5
## 130642              90             85        10           9          5
## 130643              90             85        10           9          5
##        gk_positioning gk_reflexes     Attacker     Defender Goalkeeper
## 98150              14          11 1.000000e+00 0.000000e+00          0
## 98157              14          11 1.000000e+00 0.000000e+00          0
## 98158              14          11 1.000000e+00 0.000000e+00          0
## 98160              14          11 1.000000e+00 0.000000e+00          0
## 98162              14          11 1.000000e+00 0.000000e+00          0
## 99708              14           8 1.000000e+00 0.000000e+00          0
## 99711              14           8 1.000000e+00 0.000000e+00          0
## 99713              14           8 1.000000e+00 0.000000e+00          0
## 99717              14           8 1.000000e+00 0.000000e+00          0
## 99721              14           8 1.000000e+00 0.000000e+00          0
## 99722              14           8 1.000000e+00 0.000000e+00          0
## 99725              14           8 1.000000e+00 0.000000e+00          0
## 96315              90          80 0.000000e+00 0.000000e+00          1
## 96322              89          83 0.000000e+00 0.000000e+00          1
## 96325              89          83 0.000000e+00 0.000000e+00          1
## 67943               4           3 1.000000e+00 1.406998e-22          0
## 67944               4           3 1.000000e+00 1.628461e-22          0
## 67945               4           3 1.000000e+00 1.628461e-22          0
## 67946               4           3 1.000000e+00 6.818945e-31          0
## 67947               4           3 1.000000e+00 1.344006e-30          0
## 67948               4           3 1.000000e+00 9.268056e-31          0
## 130640              8           6 1.390696e-46 1.000000e+00          0
## 130641              8           6 6.580123e-48 1.000000e+00          0
## 130642              8           6 1.562781e-46 1.000000e+00          0
## 130643              8           6 6.618217e-43 1.000000e+00          0

Above is the prediction table for players in 2015 using fit3 Naive_Bayes model training with some players in 2016. Because of the limited test set, we can draw some conclusion based on observations: (Note*: There are some multiple data per person, this came from different period of the year):

Cristiano Ronaldo, Messi and Paul Pogba is classified as attacking role, which is true. They have good attacking stats (shooting ,passing, speed, …)
Vincent Kompany is classified as a defender, which is true. He has good defensive stats (tackling, defende_working_rate, strength, jump, ..)
Gianluigi Buffon is a goalkeeper, which is true. He has all high goalkeeping stats (gl_xxx, …)

-> Conclusion: Multiple logistic regression doesn’t seem to perform well given an limited dataset on classification task. A simple Naive Bayes yet effective prediction models based on probability seems to work really well in the case of limited testset, multi-classification (>2). Naive Bayes that we used only train on 30-40 samples, but manage to predict really well on all 31779 rows of test2015 dataset.

Exploring Decision tree

Question Another technique in classification using in R is tree. We’d like to answer a type of question to utilize tree. Given the data above, and some analysis in the non-linear section. Let’s answer this question: How can you tell if a player is an attacking player or not?. The strategy is to build a decision tree

Preparing testset and trainset

## TRAIN
dt2015 = test2015

dt2015$attacking_work_rate[dt2015$attacking_work_rate == "low" | dt2015$attacking_work_rate == "medium" | dt2015$attacking_work_rate == "None"] <- "low"
dt2015$defensive_work_rate[dt2015$defensive_work_rate == "low" | dt2015$defensive_work_rate == "medium" | dt2015$defensive_work_rate == "0"] <- "low"

#choosing meaningful training data: dt2015
dt2015 <- select(dt2015, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)


## TEST
dt2016 = test2016

dt2016$attacking_work_rate[dt2016$attacking_work_rate == "low" | dt2016$attacking_work_rate == "medium" | dt2016$attacking_work_rate == "None"] <- "low"
dt2016$defensive_work_rate[dt2016$defensive_work_rate == "low" | dt2016$defensive_work_rate == "medium" | dt2016$defensive_work_rate == "0"] <- "low"

#choosing meaningful test data: dt2016
dt2016 <- select(dt2016, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)

Build tree

tree <- rpart(attacking_work_rate~., data = dt2015)

Predictions

# predict against test set dt2016
tree.attacking_work_rate.pred <- predict(tree, dt2016, type='class')

Evaluating the model

confusionMatrix(tree.attacking_work_rate.pred, dt2016$attacking_work_rate)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high  low medium None
##     high   1590  861      0    0
##     low    2866 8767      0    0
##     medium    0    0      0    0
##     None      0    0      0    0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7354         
##                  95% CI : (0.728, 0.7426)
##     No Information Rate : 0.6836         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3042         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: medium Class: None
## Sensitivity               0.3568     0.9106            NA          NA
## Specificity               0.9106     0.3568             1           1
## Pos Pred Value            0.6487     0.7536            NA          NA
## Neg Pred Value            0.7536     0.6487            NA          NA
## Prevalence                0.3164     0.6836             0           0
## Detection Rate            0.1129     0.6225             0           0
## Detection Prevalence      0.1740     0.8260             0           0
## Balanced Accuracy         0.6337     0.6337            NA          NA

Some comments on the confusion matrix:

The accuracy is 0.7354 or 73.54%, this is only a decent model.
Looking at precition and reference table:
- There are 1590 correct prediction on attacking player and 8767 prediction on this is not an attacking player. There were 861 prediction on attacking player (false-negative), but in fact this is not an attacking player, so low false-negative. There were 2866 prediction on not an attacking player (false-positive), but in fact this is an attacking player, so low false-positive. The false-negative value is low so this prediction is not the worst case scenario
P-Value [Acc > NIR] : < 2.2e-16 so we can say the predictors are statistically significant

Visualization example:

prp(tree)

Quick comment on visualization:

The three most important factor contributes to an attacking player is: sprint_speed, stamina and position.
If the player has sprint_speed < 71, then he’s not likely to be an attacking player.
If the player has sprint_speed >= 71 but stamina < 75, then he’s not likely to be an attacking player
If the player has sprint_speed >= 71 and stamina <= 75 but position < 60, then he’s not likely to be an attacking player
The least requirements for an attacking player is that sprint_speed >= 71 and stamina <= 75 and position >= 60. This case, the player is more likely to be an attacking player -> Conclusion: Decision Tree can be used to classify either-or/1-out-of-2 problem. In the example, our tree guess a player is an attacking player based on all the important attributes, but only make 73.54% accuracy. But it has good visualize of how the progress is made (by looking at the tree)

Multiple logistic model

Let’s give logistic model a redemption arc on predicting binary value (only 2-classification) and see if it performs better than decision tree.

Quick test on statistically significant:

# glm
fit4 = glm(dt2015$attacking_work_rate~., data=dt2015, family="binomial")
summary(fit4)

## 
## Call:
## glm(formula = dt2015$attacking_work_rate ~ ., family = "binomial", 
##     data = dt2015)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1529  -0.8396   0.3686   0.7936   1.9685  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          4.539e+00  9.318e-01   4.871 1.11e-06 ***
## height               2.312e-02  4.821e-03   4.796 1.62e-06 ***
## weight               1.643e-03  1.743e-03   0.943 0.345926    
## overall_rating       1.939e-02  5.993e-03   3.235 0.001215 ** 
## preferred_footright  5.594e-03  3.404e-02   0.164 0.869441    
## crossing            -1.939e-02  2.007e-03  -9.659  < 2e-16 ***
## finishing           -1.517e-02  2.330e-03  -6.512 7.43e-11 ***
## heading_accuracy    -5.334e-03  1.957e-03  -2.726 0.006409 ** 
## short_passing        2.327e-02  4.139e-03   5.623 1.88e-08 ***
## volleys              6.747e-03  1.961e-03   3.440 0.000582 ***
## dribbling           -1.712e-02  3.674e-03  -4.660 3.17e-06 ***
## curve               -8.907e-03  1.918e-03  -4.644 3.42e-06 ***
## free_kick_accuracy  -3.458e-05  1.653e-03  -0.021 0.983316    
## long_passing         1.608e-02  2.701e-03   5.953 2.63e-09 ***
## ball_control         8.624e-03  4.903e-03   1.759 0.078615 .  
## acceleration        -2.079e-02  3.766e-03  -5.521 3.36e-08 ***
## sprint_speed        -3.187e-02  3.417e-03  -9.326  < 2e-16 ***
## agility             -4.318e-03  2.730e-03  -1.582 0.113673    
## reactions            3.555e-03  3.268e-03   1.088 0.276581    
## balance              7.585e-03  2.608e-03   2.908 0.003633 ** 
## shot_power          -1.579e-03  2.491e-03  -0.634 0.526210    
## jumping              6.052e-03  1.587e-03   3.813 0.000137 ***
## stamina             -4.396e-02  1.992e-03 -22.061  < 2e-16 ***
## strength            -1.825e-03  2.213e-03  -0.825 0.409651    
## long_shots           2.175e-03  2.398e-03   0.907 0.364343    
## aggression          -8.797e-03  1.508e-03  -5.834 5.40e-09 ***
## interceptions        1.174e-03  2.065e-03   0.568 0.569705    
## positioning         -4.153e-02  2.443e-03 -17.005  < 2e-16 ***
## vision               3.876e-03  2.469e-03   1.570 0.116422    
## penalties            2.689e-03  1.891e-03   1.422 0.154945    
## marking             -6.899e-03  2.638e-03  -2.615 0.008914 ** 
## standing_tackle      6.279e-03  3.102e-03   2.024 0.042938 *  
## sliding_tackle      -7.817e-03  2.942e-03  -2.657 0.007884 ** 
## gk_diving            1.144e-02  4.462e-03   2.565 0.010324 *  
## gk_handling         -2.703e-03  4.481e-03  -0.603 0.546333    
## gk_kicking           1.937e-02  4.465e-03   4.338 1.44e-05 ***
## gk_positioning      -7.210e-03  4.459e-03  -1.617 0.105891    
## gk_reflexes          4.808e-03  4.481e-03   1.073 0.283314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38151  on 31778  degrees of freedom
## Residual deviance: 29417  on 31741  degrees of freedom
## AIC: 29493
## 
## Number of Fisher Scoring iterations: 8

Comments on finding:

P-value and z value are distinct now given a good amount of data
Residual value are good (3Q, 1Q close, max, min close, mean low, close to 0)
All other factor excepts are statistically important: -weight- preferred_foot- free_kick_accuracy- agility-reactions- strength-long_shots- interceptions- vision- penalties-shot_power- ball_control- gk_handling- gk_positioning- gk_reflexes. We remove the non significant from the equation
Surprisingly, gk_kicking has p-value = 1.44e-05 is considered significant for an attacking player Now that’s everything is statistically significant, let’s make prediction:

Model:

# glm
fit4 = glm(dt2015$attacking_work_rate~.-weight-preferred_foot-free_kick_accuracy-agility-reactions-strength-long_shots-interceptions-vision-penalties-shot_power-ball_control-gk_handling-gk_positioning-gk_reflexes, data=dt2015, family="binomial")
summary(fit4)

## 
## Call:
## glm(formula = dt2015$attacking_work_rate ~ . - weight - preferred_foot - 
##     free_kick_accuracy - agility - reactions - strength - long_shots - 
##     interceptions - vision - penalties - shot_power - ball_control - 
##     gk_handling - gk_positioning - gk_reflexes, family = "binomial", 
##     data = dt2015)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1939  -0.8414   0.3693   0.7954   2.0060  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       4.508544   0.910824   4.950 7.42e-07 ***
## height            0.024244   0.004452   5.446 5.15e-08 ***
## overall_rating    0.026603   0.004649   5.722 1.05e-08 ***
## crossing         -0.019614   0.001971  -9.950  < 2e-16 ***
## finishing        -0.013916   0.002152  -6.466 1.01e-10 ***
## heading_accuracy -0.005528   0.001905  -2.902 0.003704 ** 
## short_passing     0.026899   0.003870   6.950 3.66e-12 ***
## volleys           0.007729   0.001863   4.147 3.36e-05 ***
## dribbling        -0.014520   0.003127  -4.644 3.42e-06 ***
## curve            -0.007929   0.001710  -4.637 3.54e-06 ***
## long_passing      0.017454   0.002541   6.868 6.49e-12 ***
## acceleration     -0.022649   0.003629  -6.241 4.34e-10 ***
## sprint_speed     -0.033627   0.003347 -10.047  < 2e-16 ***
## balance           0.007051   0.002486   2.837 0.004561 ** 
## jumping           0.005435   0.001550   3.507 0.000454 ***
## stamina          -0.043885   0.001939 -22.637  < 2e-16 ***
## aggression       -0.008781   0.001422  -6.174 6.65e-10 ***
## positioning      -0.039854   0.002352 -16.947  < 2e-16 ***
## marking          -0.007162   0.002571  -2.786 0.005339 ** 
## standing_tackle   0.007126   0.002983   2.389 0.016891 *  
## sliding_tackle   -0.007871   0.002921  -2.695 0.007041 ** 
## gk_diving         0.010076   0.004334   2.325 0.020075 *  
## gk_kicking        0.018515   0.004341   4.265 2.00e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38151  on 31778  degrees of freedom
## Residual deviance: 29434  on 31756  degrees of freedom
## AIC: 29480
## 
## Number of Fisher Scoring iterations: 7

Predictions:

glm.prob = predict(fit4, dt2016, type = "response")
glm.pred = rep(0, length(glm.prob))
glm.pred[glm.prob < 0.5] <- 1
glm.pred[glm.prob >= 0.5] <- 0
glm.true = rep(0, length(glm.prob))
glm.true[dt2016$attacking_work_rate == "high"] <- 1
glm.true[dt2016$attacking_work_rate == "low"] <- 0

# Confusion matrix 
table(glm.pred, glm.true)

##         glm.true
## glm.pred    0    1
##        0 8693 2449
##        1  935 2007

# Accuracy
mean(glm.pred == glm.true)

## [1] 0.7597274

Some comments on the result of multiple logistic regression:

The accuracy is round up to 76%, which is decent. This model has better accuracy than decision tree model
Looking at prediction and reference table:
- There are 8693 of non attacking player and 2449 attacking players which are correct predictions. There are 2449 predictions of non-attacking player but in-fact attacking player (false-positive). There are 935 predictions of attacking player but in-fact non-attacking player (false-negative). The number of false-negative is lowest, but up to 9% of the prediction, which tells you there are some mistakes.

-> conclusion: What can we tell from choosing a binary classification methods:

Decision tree gave decent visualization, data formation (you need to make a tree, so you’ll be able to tell the factors contributing), but give less accuracy
Multiple logistic regression performs well under a full (good variety) dataset, identifies correct statistical significant predictors, gives better results than decision tree
-> If we’d choose a model for a binary classification, a logistic regression method is simple yet effective, will be a better choice.

—————–## UPDATE ## ——————-
## Generalization ##

We continue to apply our model on earlier years. We choose year 2012 and 2013 because they are the last entry that gives consistency in the table. On earlier year, there are some inconsistency in the data. There is this error: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor attacking_work_rate has new levels le, norm, stoc, y The attacking_work_rate has alwasy been high, medium, low. However, we are unable to explain these value: le, norm, stoc, y because there are no description in the offical site Kaggle. Since it is hard to inpterpret those values, let’s analyze year 2013, 2012 and stop there.

Preparing 2013 test dataset for fit1 Naives Bayes:

# Processing data
test2013 = PlayerTable
test2013$date<- substr(test2013$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2015, because the data is very big
test2013 = test2013[test2013$date == "2013", ]

# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2013$defensive_work_rate[test2013$defensive_work_rate == "1" | test2013$defensive_work_rate == "2" | test2013$defensive_work_rate == "3"] <- "low"
test2013$defensive_work_rate[test2013$defensive_work_rate == "4" | test2013$defensive_work_rate == "5" | test2013$defensive_work_rate == "6"] <- "medium"
test2013$defensive_work_rate[test2013$defensive_work_rate == "7" | test2013$defensive_work_rate == "8" | test2013$defensive_work_rate == "9"] <- "high"

# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2013$attacking_work_rate[test2013$attacking_work_rate == "1" | test2013$attacking_work_rate == "2" | test2013$attacking_work_rate == "3"] <- "low"
test2013$attacking_work_rate[test2013$attacking_work_rate == "4" | test2013$attacking_work_rate == "5" | test2013$attacking_work_rate == "6"] <- "medium"
test2013$attacking_work_rate[test2013$attacking_work_rate == "7" | test2013$attacking_work_rate == "8" | test2013$attacking_work_rate == "9"] <- "high"

# Change from character to factor
test2013$preferred_foot <- as.factor(test2013$preferred_foot)
test2013$attacking_work_rate <- as.factor(test2013$attacking_work_rate)
test2013$defensive_work_rate <- as.factor(test2013$defensive_work_rate)

Preparing 2012 test dataset for fit1 and Naives Bayes:

# Processing data
test2012 = PlayerTable
test2012$date<- substr(test2012$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2015, because the data is very big
test2012 = test2012[test2012$date == "2012", ]

# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2012$defensive_work_rate[test2012$defensive_work_rate == "1" | test2012$defensive_work_rate == "2" | test2012$defensive_work_rate == "3"] <- "low"
test2012$defensive_work_rate[test2012$defensive_work_rate == "4" | test2012$defensive_work_rate == "5" | test2012$defensive_work_rate == "6"] <- "medium"
test2012$defensive_work_rate[test2012$defensive_work_rate == "7" | test2012$defensive_work_rate == "8" | test2012$defensive_work_rate == "9"] <- "high"

# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2012$attacking_work_rate[test2012$attacking_work_rate == "1" | test2012$attacking_work_rate == "2" | test2012$attacking_work_rate == "3"] <- "low"
test2012$attacking_work_rate[test2012$attacking_work_rate == "4" | test2012$attacking_work_rate == "5" | test2012$attacking_work_rate == "6"] <- "medium"
test2012$attacking_work_rate[test2012$attacking_work_rate == "7" | test2012$attacking_work_rate == "8" | test2012$attacking_work_rate == "9"] <- "high"

# Change from character to factor
test2012$preferred_foot <- as.factor(test2012$preferred_foot)
test2012$attacking_work_rate <- as.factor(test2012$attacking_work_rate)
test2012$defensive_work_rate <- as.factor(test2012$defensive_work_rate)

Preparing 2013 test for fit4

dt2013 = test2013

dt2013$attacking_work_rate[dt2013$attacking_work_rate == "low" | dt2013$attacking_work_rate == "medium" | dt2013$attacking_work_rate == "None"] <- "low"
dt2013$defensive_work_rate[dt2013$defensive_work_rate == "low" | dt2013$defensive_work_rate == "medium" | dt2013$defensive_work_rate == "0"] <- "low"

#choosing meaningful training data: dt2013
dt2013 <- select(dt2013, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)

Preparing 2012 test for fit4

dt2012 = test2012

dt2012$attacking_work_rate[dt2012$attacking_work_rate == "low" | dt2012$attacking_work_rate == "medium" | dt2012$attacking_work_rate == "None"] <- "low"
dt2012$defensive_work_rate[dt2012$defensive_work_rate == "low" | dt2012$defensive_work_rate == "medium" | dt2012$defensive_work_rate == "0"] <- "low"

#choosing meaningful test data: dt2012
dt2012 <- select(dt2012, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)

a) Multiple linear-regression on predicting earlier year

Let’s make another prediction with player stats from 2013 (test dataset)

Predictions:

# using fit1 model (trained on 2016 dataset) to predict on test2013 dataset
lm.pred3 = predict(fit1, data.frame(test2013), interval = "confidence")

# Creating result table
lm.table3 = data.frame(lm.pred3)
lm.table3 = cbind(lm.table3, true=test2013[, 8:8])
lm.table3 = cbind(lm.table3, relativeError=(lm.table3$fit-lm.table3$true)/lm.table3$true)
lm.table3 = cbind(lm.table3, percentageError=abs(lm.table3$fit-lm.table3$true)*100/lm.table3$true)

Listed random 10 predictions

set.seed(433)
lm.table3[sample(1:nrow(lm.table3), 10), ]

##             fit      lwr      upr true relativeError percentageError
## 45171  70.47573 70.25672 70.69474   68   0.036407806       3.6407806
## 117302 78.45536 78.19492 78.71580   77   0.018900779       1.8900779
## 23280  68.43523 68.20121 68.66924   66   0.036897383       3.6897383
## 13381  63.39589 63.09530 63.69649   62   0.022514408       2.2514408
## 160584 74.40326 74.15603 74.65049   74   0.005449466       0.5449466
## 16828  69.42513 69.13688 69.71338   67   0.036195975       3.6195975
## 102774 65.66362 65.47743 65.84981   67  -0.019945976       1.9945976
## 89873  62.70174 62.39370 63.00978   66  -0.049973686       4.9973686
## 17440  70.09611 69.69008 70.50214   67   0.046210597       4.6210597
## 560    68.65416 68.34410 68.96423   67   0.024689008       2.4689008

Average percentage Error

mean(lm.table3$percentageError)

## [1] 3.601826

Another prediction with player stats from 2012 (test dataset)

Predictions:

# using fit1 model (trained on 2016 dataset) to predict on test2010 dataset
lm.pred2 = predict(fit1, data.frame(test2012), interval = "confidence")

# Creating result table
lm.table2 = data.frame(lm.pred2)
lm.table2 = cbind(lm.table2, true=test2012[, 8:8])
lm.table2 = cbind(lm.table2, relativeError=(lm.table2$fit-lm.table2$true)/lm.table2$true)
lm.table2 = cbind(lm.table2, percentageError=abs(lm.table2$fit-lm.table2$true)*100/lm.table2$true)

Listed random 10 predictions

set.seed(422)
lm.table2[sample(1:nrow(lm.table2), 10), ]

##             fit      lwr      upr true relativeError percentageError
## 153846 76.02586 75.79852 76.25320   76  0.0003402748      0.03402748
## 133959 81.05344 80.81986 81.28701   81  0.0006597032      0.06597032
## 83072  72.44339 72.18256 72.70422   72  0.0061581802      0.61581802
## 182405 67.57592 67.33612 67.81573   67  0.0085958664      0.85958664
## 133608 77.81929 77.59218 78.04641   80 -0.0272588234      2.72588234
## 25597  64.80223 64.50674 65.09772   61  0.0623316337      6.23316337
## 134977 71.93279 71.67677 72.18882   66  0.0898908220      8.98908220
## 86658  68.23215 68.05101 68.41330   68  0.0034140096      0.34140096
## 111432 69.59006 69.29324 69.88688   72 -0.0334713519      3.34713519
## 57260  66.12803 65.81323 66.44283   61  0.0840659851      8.40659851

Average percentage Error

mean(lm.table2$percentageError)

## [1] 3.82785

Conclusion:
The percentage error at year 2013 and 2014 are: at 3.601826% and 3.82785%, comparable with the error in year 2016 and 2015 respectively: 3.089505% and 3.27604%.
We can realize that the further the year behind, the more the percentage error increase, but within the +- 0.1% range. We can assume as the earlier the year, the relationship between predictors and overall_rating loosens up. On a completely new data, the accuracy slightly decrease, but within 0.1% error which is acceptable

b) Naive Bayes on predicting earlier year:

Let’s make a prediction with player stats from 2013 (test dataset)

Predictions:

# newrole2016 is new training set, we can use test2013 as test set
naives_bayes.pred = predict(fit3, test2013, type = 'prob')

Since we don’t have a test result to compare with, we will attach the prediction table to the original data set and make observations:

Making table result

# Adding Naive Bayes prediction to test2013 test set
resultNB = test2013
resultNB = cbind(resultNB, data.frame(naives_bayes.pred))
# -> resultNB will hold all the role predictions for test2013 test-set

Select 3 known player as predictors:

nb1 = resultNB[resultNB$player_api_id == 30893, ]
nb2 = resultNB[resultNB$player_api_id == 30717, ]
nb3 = resultNB[resultNB$player_api_id == 39027, ]

resultNB3 = rbind(nb1, nb2, nb3)
print(resultNB3)

##        player_api_id player_fifa_api_id       player_name            birthday
## 98153          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98154          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98159          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98165          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98167          30893              20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 96304          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96314          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96326          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96328          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96331          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96332          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96333          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 96334          30717               1179  Gianluigi Buffon 1978-01-28 00:00:00
## 130649         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130650         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130651         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130652         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130653         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130654         39027             139720   Vincent Kompany 1986-04-10 00:00:00
## 130655         39027             139720   Vincent Kompany 1986-04-10 00:00:00
##        height weight date overall_rating potential preferred_foot
## 98153  185.42    176 2013             92        95          right
## 98154  185.42    176 2013             92        94          right
## 98159  185.42    176 2013             92        94          right
## 98165  185.42    176 2013             92        95          right
## 98167  185.42    176 2013             92        95          right
## 96304  193.04    201 2013             87        87          right
## 96314  193.04    201 2013             87        87          right
## 96326  193.04    201 2013             87        87          right
## 96328  193.04    201 2013             85        85          right
## 96331  193.04    201 2013             87        87          right
## 96332  193.04    201 2013             85        85          right
## 96333  193.04    201 2013             87        87          right
## 96334  193.04    201 2013             87        87          right
## 130649 193.04    187 2013             86        88          right
## 130650 193.04    187 2013             86        88          right
## 130651 193.04    187 2013             86        88          right
## 130652 193.04    187 2013             86        88          right
## 130653 193.04    187 2013             86        88          right
## 130654 193.04    187 2013             86        88          right
## 130655 193.04    187 2013             85        88          right
##        attacking_work_rate defensive_work_rate crossing finishing
## 98153                 high                 low       83        92
## 98154                 high                 low       84        92
## 98159                 high                 low       84        92
## 98165                 high                 low       83        92
## 98167                 high                 low       83        92
## 96304               medium              medium       19        16
## 96314               medium              medium       19        16
## 96326               medium              medium       19        16
## 96328               medium              medium       25        25
## 96331               medium              medium       19        16
## 96332               medium              medium       25        25
## 96333               medium              medium       19        16
## 96334               medium              medium       19        16
## 130649              medium              medium       61        45
## 130650              medium              medium       61        45
## 130651              medium              medium       61        45
## 130652              medium              medium       61        45
## 130653              medium              medium       61        45
## 130654              medium              medium       61        45
## 130655              medium              medium       61        45
##        heading_accuracy short_passing volleys dribbling curve
## 98153                86            82      85        90    88
## 98154                87            83      85        90    88
## 98159                87            83      85        90    88
## 98165                86            82      85        90    88
## 98167                86            82      85        90    88
## 96304                28            45      28        27    36
## 96314                28            45      28        27    36
## 96326                28            42      18        21    26
## 96328                25            36      25        21    25
## 96331                28            45      28        27    36
## 96332                25            38      25        21    25
## 96333                28            45      28        27    36
## 96334                28            45      28        27    36
## 130649               84            80      46        70    61
## 130650               81            80      46        70    61
## 130651               81            80      46        70    61
## 130652               81            82      46        70    61
## 130653               81            82      46        70    61
## 130654               81            82      46        70    61
## 130655               80            82      46        70    61
##        free_kick_accuracy long_passing ball_control acceleration sprint_speed
## 98153                  79           72           95           91           94
## 98154                  77           72           95           91           94
## 98159                  79           72           95           91           94
## 98165                  79           72           95           91           94
## 98167                  79           72           95           91           94
## 96304                  14           30           37           59           49
## 96314                  14           30           37           50           40
## 96326                  14           30           37           49           39
## 96328                  25           30           33           47           37
## 96331                  14           30           37           49           39
## 96332                  25           30           32           47           37
## 96333                  14           30           37           49           39
## 96334                  14           30           37           49           39
## 130649                 52           75           79           68           76
## 130650                 52           75           79           68           76
## 130651                 52           75           79           68           76
## 130652                 52           74           79           69           76
## 130653                 52           74           79           69           76
## 130654                 52           74           79           69           77
## 130655                 52           74           79           69           77
##        agility reactions balance shot_power jumping stamina strength long_shots
## 98153       93        90      75         94      94      89       79         93
## 98154       93        88      75         95      94      89       79         93
## 98159       93        88      75         95      94      89       79         93
## 98165       93        90      75         94      94      89       79         93
## 98167       93        90      75         94      94      89       79         93
## 96304       59        74      55         34      75      55       63         20
## 96314       51        74      55         34      75      55       59         20
## 96326       51        74      45         34      75      35       55         20
## 96328       49        80      45         30      75      35       55         25
## 96331       51        74      45         34      75      35       55         20
## 96332       49        80      45         30      75      35       55         25
## 96333       51        74      45         34      75      35       55         20
## 96334       51        74      45         34      75      35       55         20
## 130649      63        84      42         76      73      70       88         67
## 130650      63        84      42         76      69      70       88         67
## 130651      63        84      42         76      69      70       88         67
## 130652      63        84      42         76      69      70       88         67
## 130653      63        84      42         76      69      70       88         67
## 130654      63        84      42         76      69      70       88         67
## 130655      63        84      42         76      69      70       88         67
##        aggression interceptions positioning vision penalties marking
## 98153          63            24          89     81        85      22
## 98154          63            24          89     81        85      22
## 98159          63            24          89     81        85      22
## 98165          63            24          89     81        85      22
## 98167          63            24          89     81        85      22
## 96304          64            44          11     24        36      18
## 96314          40            15           5     15        36      18
## 96326          35            15           5     15        36      18
## 96328          35            25          25     25        36      25
## 96331          35            15           5     15        36      18
## 96332          35            25          25     25        36      25
## 96333          35            15           5     15        36      18
## 96334          35            15           5     15        36      18
## 130649         75            87          41     63        63      84
## 130650         75            87          41     63        63      84
## 130651         75            87          41     63        63      84
## 130652         75            87          41     63        63      84
## 130653         75            87          41     63        63      84
## 130654         75            87          41     63        63      84
## 130655         75            87          41     63        63      84
##        standing_tackle sliding_tackle gk_diving gk_handling gk_kicking
## 98153               31             23         7          11         15
## 98154               31             23         7          11         15
## 98159               31             23         7          11         15
## 98165               31             23         7          11         15
## 98167               31             23         7          11         15
## 96304               30             26        90          82         66
## 96314               30             26        90          82         66
## 96326               30             26        90          80         66
## 96328               25             25        88          78         66
## 96331               30             26        90          82         66
## 96332               25             25        88          78         66
## 96333               30             26        90          80         66
## 96334               30             26        90          82         66
## 130649              90             85        10           9          5
## 130650              90             85        10           9          5
## 130651              90             85        10           9          5
## 130652              90             85        10           9          5
## 130653              90             85        10           9          5
## 130654              90             85        10           9          5
## 130655              90             85        10           9          5
##        gk_positioning gk_reflexes     Attacker Defender Goalkeeper
## 98153              14          11 1.000000e+00        0          0
## 98154              14          11 1.000000e+00        0          0
## 98159              14          11 1.000000e+00        0          0
## 98165              14          11 1.000000e+00        0          0
## 98167              14          11 1.000000e+00        0          0
## 96304              90          86 0.000000e+00        0          1
## 96314              90          86 0.000000e+00        0          1
## 96326              90          84 0.000000e+00        0          1
## 96328              90          80 0.000000e+00        0          1
## 96331              90          86 0.000000e+00        0          1
## 96332              90          80 0.000000e+00        0          1
## 96333              90          84 0.000000e+00        0          1
## 96334              90          84 0.000000e+00        0          1
## 130649              8           6 3.454576e-39        1          0
## 130650              8           6 3.615070e-38        1          0
## 130651              8           6 3.615070e-38        1          0
## 130652              8           6 4.322264e-38        1          0
## 130653              8           6 4.322264e-38        1          0
## 130654              8           6 4.420183e-38        1          0
## 130655              8           6 4.463737e-38        1          0

Conclusion:
Naive Bayes fit really well when predicting roles. Cristiano Ronaldo is still an attacker, Buffon is a goalkeeper and Vincent Kompany is a defender. Nothing changes here. The results is good as expected.
(NOTES*: We only test on 2013 this time because there is no actual table to compare with. This is to minimize the reading)

c) Logistic regression on predicting attacking/non-attacking players:

fit4 is the model that we trained on dt2015 training dataset. Let’s apply it to different years:

Predictions for training dataset 2013:

glm.prob = predict(fit4, dt2013, type = "response")
glm.pred = rep(0, length(glm.prob))
glm.pred[glm.prob < 0.5] <- 1
glm.pred[glm.prob >= 0.5] <- 0
glm.true = rep(0, length(glm.prob))
glm.true[dt2013$attacking_work_rate == "high"] <- 1
glm.true[dt2013$attacking_work_rate == "low"] <- 0

# Confusion matrix 
table(glm.pred, glm.true)

##         glm.true
## glm.pred     0     1
##        0 26563  5220
##        1  3074  3942

# Accuracy
mean(glm.pred == glm.true)

## [1] 0.7862316

Predictions for training dataset 2012:

glm.prob = predict(fit4, dt2012, type = "response")
glm.pred = rep(0, length(glm.prob))
glm.pred[glm.prob < 0.5] <- 1
glm.pred[glm.prob >= 0.5] <- 0
glm.true = rep(0, length(glm.prob))
glm.true[dt2012$attacking_work_rate == "high"] <- 1
glm.true[dt2012$attacking_work_rate == "low"] <- 0

# Confusion matrix 
table(glm.pred, glm.true)

##         glm.true
## glm.pred    0    1
##        0 9181 1496
##        1  945  997

# Accuracy
mean(glm.pred == glm.true)

## [1] 0.8065615

Conclusion:
Correct predicting results on year 2013 and 2012 respectively is 78.62316% and 80.65615%, which is surprisingly better than result in year 2016: 75.97274%
-> Despite the change in years (which means completely new dataset, new size, new attributes, ..), the logistic model on binary classification works really well.
-> CONCLUSION ON GENERALIZATION: The interaction terms or non-linear terms help. Our model continues to generalize on earlier year and work well as long as the data is consistent.

———————###————————-

Impact

From the work presented above, we can see some uses in predicting a historical European soccer dataset. If the goal is to evaluate the **numeric** overall_rating of a player, we would use multiple linear regression model. If the goal is to **classify** a player into different roles **(>=3)**, we'd want to use **Naives Bayes**, a simple but effective method. If the goal is to **binary classify** (either-or attributes), we'd want to use **logistic regression method**. The analysis suggests those method fit the best given the good result prediction on a ten of thousands scale of entries. **The analysis also generalizes really well given data from different years**. However, the data is bias to the author who provided the data, attributes and rating. Since this project was made out of passion, the result even if accurate would given only an entertainment value. But no one would stop you to use data science to make good scientific **betting** guess. The results if weren't accurate would not significantly impact anybody but the bettor. I hope this contribution is helpful to those who'd like to study data science, and provide an entertainment value to European soccer fans.

European_Soccer_Predictions

Tien_Dinh

2022-06-06

Exploring the data

Summarize the data

Investigating players

Exploring Decision tree

Impact

END