My dataset is called: European Soccer Database 25k+ matches, players & teams attributes for European Professional Football. The dataset is from Kaggle.com: https://www.kaggle.com/datasets/hugomathien/soccer
This is the EA sport FIFA soccer database for data analysis for machine learning. The creator’s name is HUGO MATHIEN. We don’t know much about the guy’s history, but we do know that he is a dedicated soccer fan and a hardworking database engineer. There is no motivation listed in the description, but I believe this data was created as historical record of European soccer team/player/matches, either out of curiosity or true passion. And because of it, I believe this record project is no funded. The original data source was collected from multiple trusted soccer website, including FIFA’s contribution themselves. To list a few:
They offer 25,000 matches, over 10,000 real players and top 11 European Countries with their lead championship league. Within the league, they offer starting line-up, record of matches events. There are also players, players attributes, team and team attributes for each individual player and team. The two player tables include stats such as rating, strong_foot, crossing, defensive_work_rate, … The two team tables feature team name/id, buildUp, chanceCreation, defenseTeamWidth, … The data will be useful for evaluating player’s rating, team’s strength, team’s winning chance and more.
From the evidence gathered above, from my opinion, the table is used to keep a true historical record of European soccer matches, players and teams in the past. All the data is useful for different purposes. Soccer enthusiasts can look up the database and check how much their favorite team won in the past, by how many goals, starting line-up. Teams and coaches can look up player’s rating and decide who to pursue in the future. There is also an interesting description of betting odds on the website, however this table is taken down due to some rule violations. Overall, this data is very useful for keeping track of European soccer in the past.
The data is distributed publicly on Kaggle website (linked on top) with Open Data Commons Open Database License (ODbL) v1.0. The author started collecting data from 2008 until 2016 (most recent update is 16th Oct 2016). There is no restriction on usage, and it is solely dependent on the users. Reading the history of data, it was well-maintained and active from 2008 until 2016. After 2016, the author stopped working on the project, but the data is still opened to public until today.
The data has a historical impact on European soccer from the past till 2016. It keeps true record of matches, players and team for users who wanted to investigate in soccer. I believe this is the best soccer data set systematically collected on Kaggle (The data were given golden with 3000+ upvotes). Unfortunately, they don’t have recent 2022 update, but 2016 is a close timeline where the peak of current football generation was.
TLDR:
Loading library
library(DBI)
library(RSQLite)
library(MASS)
library(class)
library(devtools)
## Loading required package: usethis
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(naivebayes)
## naivebayes 0.9.7 loaded
require(moonBook)
## Loading required package: moonBook
require(ggiraph)
## Loading required package: ggiraph
require(ggiraphExtra)
## Loading required package: ggiraphExtra
##
## Attaching package: 'ggiraphExtra'
## The following objects are masked from 'package:moonBook':
##
## addLabelDf, getMapping
library(rpart)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:moonBook':
##
## densityplot
library(rpart.plot)
library(data.tree)
library(caTools)
library(leaps)
Reading data
set.seed(0)
# Read data (this is local)
con <- dbConnect(SQLite(), "C:/Users/dvtie/OneDrive/Desktop/UO_STUDY/Spring_2022/Data_Science/Project/database.sqlite")
# Fetch to R object
Country <- dbFetch(dbSendQuery(con, "SELECT * FROM Country"))
League <- dbFetch(dbSendQuery(con, "SELECT * FROM League"))
Match <- dbFetch(dbSendQuery(con, "SELECT * FROM Match"))
Player <- dbFetch(dbSendQuery(con, "SELECT * FROM Player"))
Player_Attributes <- dbFetch(dbSendQuery(con, "SELECT * FROM Player_Attributes"))
Team <- dbFetch(dbSendQuery(con, "SELECT * FROM Team"))
Team_Attributes <- dbFetch(dbSendQuery(con, "SELECT * FROM Team_Attributes"))
# sqlite_sequence is not really data here, just SQLite procedure.
Table list
dbListTables(con)
## [1] "Country" "League" "Match"
## [4] "Player" "Player_Attributes" "Team"
## [7] "Team_Attributes" "sqlite_sequence"
Preparing data
# Remove id first, this is just SQLite indexing
Player = subset(Player, select = -1)
Player_Attributes = subset(Player_Attributes, select = -1)
PlayerTable = merge(Player, Player_Attributes, by=c("player_api_id","player_fifa_api_id"))
PlayerTable = na.omit(PlayerTable) # Removing N/A rows
Let’s start with a question: “How do you evaluate a soccer player?”
—————–## UPDATE ## ——————-
Player table
summary(PlayerTable)
## player_api_id player_fifa_api_id player_name birthday
## Min. : 2625 Min. : 2 Length:180228 Length:180228
## 1st Qu.: 35451 1st Qu.:156593 Class :character Class :character
## Median : 80293 Median :183781 Mode :character Mode :character
## Mean :137701 Mean :166807
## 3rd Qu.:192842 3rd Qu.:200138
## Max. :750584 Max. :234141
## height weight date overall_rating
## Min. :157.5 Min. :117.0 Length:180228 Min. :33.00
## 1st Qu.:177.8 1st Qu.:159.0 Class :character 1st Qu.:64.00
## Median :182.9 Median :168.0 Mode :character Median :69.00
## Mean :181.9 Mean :168.8 Mean :68.63
## 3rd Qu.:185.4 3rd Qu.:179.0 3rd Qu.:73.00
## Max. :208.3 Max. :243.0 Max. :94.00
## potential preferred_foot attacking_work_rate defensive_work_rate
## Min. :39.00 Length:180228 Length:180228 Length:180228
## 1st Qu.:69.00 Class :character Class :character Class :character
## Median :74.00 Mode :character Mode :character Mode :character
## Mean :73.48
## 3rd Qu.:78.00
## Max. :97.00
## crossing finishing heading_accuracy short_passing
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 3.00
## 1st Qu.:45.00 1st Qu.:34.00 1st Qu.:49.00 1st Qu.:57.00
## Median :59.00 Median :53.00 Median :60.00 Median :65.00
## Mean :55.14 Mean :49.95 Mean :57.27 Mean :62.48
## 3rd Qu.:68.00 3rd Qu.:65.00 3rd Qu.:68.00 3rd Qu.:72.00
## Max. :95.00 Max. :97.00 Max. :98.00 Max. :97.00
## volleys dribbling curve free_kick_accuracy
## Min. : 1.00 Min. : 1.00 Min. : 2.00 Min. : 1.00
## 1st Qu.:35.00 1st Qu.:52.00 1st Qu.:41.00 1st Qu.:36.00
## Median :52.00 Median :64.00 Median :56.00 Median :50.00
## Mean :49.48 Mean :59.26 Mean :52.99 Mean :49.38
## 3rd Qu.:64.00 3rd Qu.:72.00 3rd Qu.:67.00 3rd Qu.:63.00
## Max. :93.00 Max. :97.00 Max. :94.00 Max. :97.00
## long_passing ball_control acceleration sprint_speed agility
## Min. : 3.00 Min. : 5.00 Min. :10.0 Min. :12.0 Min. :11.00
## 1st Qu.:49.00 1st Qu.:59.00 1st Qu.:61.0 1st Qu.:62.0 1st Qu.:58.00
## Median :59.00 Median :67.00 Median :69.0 Median :69.0 Median :68.00
## Mean :57.09 Mean :63.45 Mean :67.7 Mean :68.1 Mean :65.99
## 3rd Qu.:67.00 3rd Qu.:73.00 3rd Qu.:77.0 3rd Qu.:77.0 3rd Qu.:75.00
## Max. :97.00 Max. :97.00 Max. :97.0 Max. :97.0 Max. :96.00
## reactions balance shot_power jumping
## Min. :17.00 Min. :12.00 Min. : 2.00 Min. :14.00
## 1st Qu.:61.00 1st Qu.:58.00 1st Qu.:54.00 1st Qu.:60.00
## Median :67.00 Median :67.00 Median :66.00 Median :68.00
## Mean :66.14 Mean :65.18 Mean :61.87 Mean :66.99
## 3rd Qu.:72.00 3rd Qu.:74.00 3rd Qu.:73.00 3rd Qu.:74.00
## Max. :96.00 Max. :96.00 Max. :97.00 Max. :96.00
## stamina strength long_shots aggression
## Min. :10.00 Min. :10.00 Min. : 1.00 Min. : 6.00
## 1st Qu.:61.00 1st Qu.:60.00 1st Qu.:41.00 1st Qu.:51.00
## Median :69.00 Median :69.00 Median :58.00 Median :64.00
## Mean :67.05 Mean :67.44 Mean :53.38 Mean :60.95
## 3rd Qu.:76.00 3rd Qu.:76.00 3rd Qu.:67.00 3rd Qu.:73.00
## Max. :96.00 Max. :96.00 Max. :96.00 Max. :97.00
## interceptions positioning vision penalties
## Min. : 1.00 Min. : 2.00 Min. : 1.00 Min. : 2.00
## 1st Qu.:34.00 1st Qu.:45.00 1st Qu.:49.00 1st Qu.:45.00
## Median :56.00 Median :60.00 Median :60.00 Median :57.00
## Mean :51.91 Mean :55.73 Mean :57.86 Mean :54.93
## 3rd Qu.:68.00 3rd Qu.:69.00 3rd Qu.:69.00 3rd Qu.:67.00
## Max. :96.00 Max. :95.00 Max. :97.00 Max. :96.00
## marking standing_tackle sliding_tackle gk_diving
## Min. : 1.00 Min. : 1.00 Min. : 2.00 Min. : 1.00
## 1st Qu.:25.00 1st Qu.:29.00 1st Qu.:25.00 1st Qu.: 7.00
## Median :50.00 Median :56.00 Median :53.00 Median :10.00
## Mean :46.77 Mean :50.37 Mean :48.04 Mean :14.69
## 3rd Qu.:66.00 3rd Qu.:69.00 3rd Qu.:67.00 3rd Qu.:13.00
## Max. :94.00 Max. :95.00 Max. :95.00 Max. :94.00
## gk_handling gk_kicking gk_positioning gk_reflexes
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :12.00 Median :11.00 Median :11.00
## Mean :15.95 Mean :20.53 Mean :16.01 Mean :16.32
## 3rd Qu.:15.00 3rd Qu.:15.00 3rd Qu.:15.00 3rd Qu.:15.00
## Max. :93.00 Max. :97.00 Max. :96.00 Max. :96.00
Analyzing the Player Table:
There are identifications, categorical and numeric columns
player_api_id, player_fifa_api_id player_name, birthday, date: are parts of player identity. Identity columns have no effect as a predictor, just a player identification
preferred_foot, attacking_work_rate, defensive_work_rate are categorial predictors
the rest are numeric predictors. Height is a positive real value (meters), weight is a positive real value (kg). Other numeric value range from 0 to 100. (0 is bad, 100 is perfect)
Among numeric columns, there are overall_rating and potential given by FIFA rating:
———————-###————————
After looking at the PlayerTable, some are just for identification, some are player’s attribute. There are some assumptions/observations:
And some hypothesises:
—————–## UPDATE ## ——————-
After asking the question, we will divide some strategy for analyzing the data:
———————-###————————
Multiple linear regression
The PlayerTable has overall_rating following by multiple player
attributes. Let’s try plot a model to predict overall_rating value.
Given the goal to predict a numeric value, we should first try multiple
linear regression:
Player table
summary(PlayerTable)
## player_api_id player_fifa_api_id player_name birthday
## Min. : 2625 Min. : 2 Length:180228 Length:180228
## 1st Qu.: 35451 1st Qu.:156593 Class :character Class :character
## Median : 80293 Median :183781 Mode :character Mode :character
## Mean :137701 Mean :166807
## 3rd Qu.:192842 3rd Qu.:200138
## Max. :750584 Max. :234141
## height weight date overall_rating
## Min. :157.5 Min. :117.0 Length:180228 Min. :33.00
## 1st Qu.:177.8 1st Qu.:159.0 Class :character 1st Qu.:64.00
## Median :182.9 Median :168.0 Mode :character Median :69.00
## Mean :181.9 Mean :168.8 Mean :68.63
## 3rd Qu.:185.4 3rd Qu.:179.0 3rd Qu.:73.00
## Max. :208.3 Max. :243.0 Max. :94.00
## potential preferred_foot attacking_work_rate defensive_work_rate
## Min. :39.00 Length:180228 Length:180228 Length:180228
## 1st Qu.:69.00 Class :character Class :character Class :character
## Median :74.00 Mode :character Mode :character Mode :character
## Mean :73.48
## 3rd Qu.:78.00
## Max. :97.00
## crossing finishing heading_accuracy short_passing
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 3.00
## 1st Qu.:45.00 1st Qu.:34.00 1st Qu.:49.00 1st Qu.:57.00
## Median :59.00 Median :53.00 Median :60.00 Median :65.00
## Mean :55.14 Mean :49.95 Mean :57.27 Mean :62.48
## 3rd Qu.:68.00 3rd Qu.:65.00 3rd Qu.:68.00 3rd Qu.:72.00
## Max. :95.00 Max. :97.00 Max. :98.00 Max. :97.00
## volleys dribbling curve free_kick_accuracy
## Min. : 1.00 Min. : 1.00 Min. : 2.00 Min. : 1.00
## 1st Qu.:35.00 1st Qu.:52.00 1st Qu.:41.00 1st Qu.:36.00
## Median :52.00 Median :64.00 Median :56.00 Median :50.00
## Mean :49.48 Mean :59.26 Mean :52.99 Mean :49.38
## 3rd Qu.:64.00 3rd Qu.:72.00 3rd Qu.:67.00 3rd Qu.:63.00
## Max. :93.00 Max. :97.00 Max. :94.00 Max. :97.00
## long_passing ball_control acceleration sprint_speed agility
## Min. : 3.00 Min. : 5.00 Min. :10.0 Min. :12.0 Min. :11.00
## 1st Qu.:49.00 1st Qu.:59.00 1st Qu.:61.0 1st Qu.:62.0 1st Qu.:58.00
## Median :59.00 Median :67.00 Median :69.0 Median :69.0 Median :68.00
## Mean :57.09 Mean :63.45 Mean :67.7 Mean :68.1 Mean :65.99
## 3rd Qu.:67.00 3rd Qu.:73.00 3rd Qu.:77.0 3rd Qu.:77.0 3rd Qu.:75.00
## Max. :97.00 Max. :97.00 Max. :97.0 Max. :97.0 Max. :96.00
## reactions balance shot_power jumping
## Min. :17.00 Min. :12.00 Min. : 2.00 Min. :14.00
## 1st Qu.:61.00 1st Qu.:58.00 1st Qu.:54.00 1st Qu.:60.00
## Median :67.00 Median :67.00 Median :66.00 Median :68.00
## Mean :66.14 Mean :65.18 Mean :61.87 Mean :66.99
## 3rd Qu.:72.00 3rd Qu.:74.00 3rd Qu.:73.00 3rd Qu.:74.00
## Max. :96.00 Max. :96.00 Max. :97.00 Max. :96.00
## stamina strength long_shots aggression
## Min. :10.00 Min. :10.00 Min. : 1.00 Min. : 6.00
## 1st Qu.:61.00 1st Qu.:60.00 1st Qu.:41.00 1st Qu.:51.00
## Median :69.00 Median :69.00 Median :58.00 Median :64.00
## Mean :67.05 Mean :67.44 Mean :53.38 Mean :60.95
## 3rd Qu.:76.00 3rd Qu.:76.00 3rd Qu.:67.00 3rd Qu.:73.00
## Max. :96.00 Max. :96.00 Max. :96.00 Max. :97.00
## interceptions positioning vision penalties
## Min. : 1.00 Min. : 2.00 Min. : 1.00 Min. : 2.00
## 1st Qu.:34.00 1st Qu.:45.00 1st Qu.:49.00 1st Qu.:45.00
## Median :56.00 Median :60.00 Median :60.00 Median :57.00
## Mean :51.91 Mean :55.73 Mean :57.86 Mean :54.93
## 3rd Qu.:68.00 3rd Qu.:69.00 3rd Qu.:69.00 3rd Qu.:67.00
## Max. :96.00 Max. :95.00 Max. :97.00 Max. :96.00
## marking standing_tackle sliding_tackle gk_diving
## Min. : 1.00 Min. : 1.00 Min. : 2.00 Min. : 1.00
## 1st Qu.:25.00 1st Qu.:29.00 1st Qu.:25.00 1st Qu.: 7.00
## Median :50.00 Median :56.00 Median :53.00 Median :10.00
## Mean :46.77 Mean :50.37 Mean :48.04 Mean :14.69
## 3rd Qu.:66.00 3rd Qu.:69.00 3rd Qu.:67.00 3rd Qu.:13.00
## Max. :94.00 Max. :95.00 Max. :95.00 Max. :94.00
## gk_handling gk_kicking gk_positioning gk_reflexes
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :12.00 Median :11.00 Median :11.00
## Mean :15.95 Mean :20.53 Mean :16.01 Mean :16.32
## 3rd Qu.:15.00 3rd Qu.:15.00 3rd Qu.:15.00 3rd Qu.:15.00
## Max. :93.00 Max. :97.00 Max. :96.00 Max. :96.00
Preparing the data
# Processing data
test2016 = PlayerTable
test2016$date<- substr(test2016$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2016, because the data is very big
test2016 = test2016[test2016$date == "2016", ]
# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2016$defensive_work_rate[test2016$defensive_work_rate == "1" | test2016$defensive_work_rate == "2" | test2016$defensive_work_rate == "3"] <- "low"
test2016$defensive_work_rate[test2016$defensive_work_rate == "4" | test2016$defensive_work_rate == "5" | test2016$defensive_work_rate == "6"] <- "medium"
test2016$defensive_work_rate[test2016$defensive_work_rate == "7" | test2016$defensive_work_rate == "8" | test2016$defensive_work_rate == "9"] <- "high"
# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2016$attacking_work_rate[test2016$attacking_work_rate == "1" | test2016$attacking_work_rate == "2" | test2016$attacking_work_rate == "3"] <- "low"
test2016$attacking_work_rate[test2016$attacking_work_rate == "4" | test2016$attacking_work_rate == "5" | test2016$attacking_work_rate == "6"] <- "medium"
test2016$attacking_work_rate[test2016$attacking_work_rate == "7" | test2016$attacking_work_rate == "8" | test2016$attacking_work_rate == "9"] <- "high"
# Change from character to factor
test2016$preferred_foot <- as.factor(test2016$preferred_foot)
test2016$attacking_work_rate <- as.factor(test2016$attacking_work_rate)
test2016$defensive_work_rate <- as.factor(test2016$defensive_work_rate)
Fit multiple linear regression model:
# Fit a linear regression model using some Remove some identity value
fit1 = lm(test2016$overall_rating~., data=test2016[ , -which(names(test2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "potential"))] )
# Note that we eliminate potential because they are essentially the same value (potential and overall_rating). We just guess one
summary(fit1)
##
## Call:
## lm(formula = test2016$overall_rating ~ ., data = test2016[, -which(names(test2016) %in%
## c("player_api_id", "player_fifa_api_id", "player_name", "birthday",
## "date", "potential"))])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9327 -1.7845 -0.0733 1.7244 11.2269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7991842 1.6522048 2.299 0.021493 *
## height 0.0084735 0.0077156 1.098 0.272126
## weight 0.0074687 0.0028024 2.665 0.007704 **
## preferred_footright -0.2163025 0.0555863 -3.891 0.000100 ***
## attacking_work_ratelow 1.6528801 0.1320685 12.515 < 2e-16 ***
## attacking_work_ratemedium -0.2213471 0.0566527 -3.907 9.39e-05 ***
## attacking_work_rateNone -0.3418119 0.2547715 -1.342 0.179734
## defensive_work_ratehigh 3.4131868 0.7132937 4.785 1.73e-06 ***
## defensive_work_ratelow 3.5370080 0.7130870 4.960 7.13e-07 ***
## defensive_work_ratemedium 3.2706436 0.7122524 4.592 4.43e-06 ***
## crossing -0.0088195 0.0031010 -2.844 0.004460 **
## finishing 0.0154217 0.0037002 4.168 3.09e-05 ***
## heading_accuracy 0.0791795 0.0031897 24.824 < 2e-16 ***
## short_passing 0.0749497 0.0060995 12.288 < 2e-16 ***
## volleys 0.0058049 0.0031380 1.850 0.064358 .
## dribbling 0.0020077 0.0051245 0.392 0.695221
## curve 0.0224619 0.0031416 7.150 9.11e-13 ***
## free_kick_accuracy 0.0004618 0.0027768 0.166 0.867924
## long_passing 0.0113399 0.0042340 2.678 0.007409 **
## ball_control 0.2214545 0.0066256 33.424 < 2e-16 ***
## acceleration 0.0245989 0.0052352 4.699 2.64e-06 ***
## sprint_speed 0.0552671 0.0047871 11.545 < 2e-16 ***
## agility -0.0138669 0.0038769 -3.577 0.000349 ***
## reactions 0.3663411 0.0043485 84.246 < 2e-16 ***
## balance 0.0020724 0.0039969 0.518 0.604128
## shot_power 0.0210394 0.0036195 5.813 6.28e-09 ***
## jumping 0.0053470 0.0025274 2.116 0.034393 *
## stamina -0.0111346 0.0029966 -3.716 0.000203 ***
## strength 0.0360852 0.0036396 9.915 < 2e-16 ***
## long_shots -0.0211872 0.0036058 -5.876 4.30e-09 ***
## aggression -0.0011006 0.0025462 -0.432 0.665560
## interceptions 0.0054994 0.0034546 1.592 0.111433
## positioning -0.0646825 0.0035893 -18.021 < 2e-16 ***
## vision -0.0054751 0.0035768 -1.531 0.125866
## penalties 0.0130128 0.0030049 4.331 1.50e-05 ***
## marking 0.0178739 0.0044763 3.993 6.56e-05 ***
## standing_tackle 0.0135714 0.0053135 2.554 0.010655 *
## sliding_tackle -0.0139389 0.0049896 -2.794 0.005220 **
## gk_diving 0.0739050 0.0069207 10.679 < 2e-16 ***
## gk_handling 0.0720897 0.0068272 10.559 < 2e-16 ***
## gk_kicking 0.0286646 0.0064623 4.436 9.25e-06 ***
## gk_positioning 0.0634670 0.0068165 9.311 < 2e-16 ***
## gk_reflexes 0.0723199 0.0067605 10.697 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.734 on 14041 degrees of freedom
## Multiple R-squared: 0.7986, Adjusted R-squared: 0.798
## F-statistic: 1326 on 42 and 14041 DF, p-value: < 2.2e-16
From this summary, most of the data have significant impact on overall_rating, excluding: height, dribbling, free_kick_accuracy, balance, aggression, interceptions, vision (p-value > 0.01). Notice that height is not significant, but weight is significant. In real life, free_kick_accuracy being non important is understandable, but balance, aggression, interceptions, vision still contribute somewhat to overall_rating of a player. Nonetheless, let’s verify the rest to see if they are statistically significant:
Removing insignificant predictors from lm:
# Fit a linear regression model using some Remove some identity value
fit1 = lm(test2016$overall_rating~.-height- dribbling- free_kick_accuracy- balance- aggression- interceptions- vision, data=test2016[ , -which(names(test2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "potential"))] )
summary(fit1)
##
## Call:
## lm(formula = test2016$overall_rating ~ . - height - dribbling -
## free_kick_accuracy - balance - aggression - interceptions -
## vision, data = test2016[, -which(names(test2016) %in% c("player_api_id",
## "player_fifa_api_id", "player_name", "birthday", "date",
## "potential"))])
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1224 -1.7938 -0.0767 1.7195 11.2877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.268892 0.872198 6.041 1.57e-09 ***
## weight 0.008249 0.002594 3.180 0.001474 **
## preferred_footright -0.216632 0.055354 -3.914 9.14e-05 ***
## attacking_work_ratelow 1.650829 0.131329 12.570 < 2e-16 ***
## attacking_work_ratemedium -0.217579 0.056464 -3.853 0.000117 ***
## attacking_work_rateNone -0.341619 0.254428 -1.343 0.179394
## defensive_work_ratehigh 3.416816 0.712393 4.796 1.63e-06 ***
## defensive_work_ratelow 3.544216 0.712021 4.978 6.51e-07 ***
## defensive_work_ratemedium 3.277539 0.711359 4.607 4.11e-06 ***
## crossing -0.008903 0.003037 -2.931 0.003382 **
## finishing 0.014720 0.003655 4.027 5.67e-05 ***
## heading_accuracy 0.079989 0.003032 26.384 < 2e-16 ***
## short_passing 0.073620 0.005993 12.284 < 2e-16 ***
## volleys 0.005455 0.003121 1.748 0.080460 .
## curve 0.022568 0.002923 7.720 1.24e-14 ***
## long_passing 0.011097 0.004086 2.716 0.006621 **
## ball_control 0.221598 0.005791 38.266 < 2e-16 ***
## acceleration 0.025092 0.005116 4.904 9.48e-07 ***
## sprint_speed 0.055789 0.004725 11.807 < 2e-16 ***
## agility -0.014300 0.003700 -3.865 0.000112 ***
## reactions 0.366805 0.004266 85.974 < 2e-16 ***
## shot_power 0.020845 0.003555 5.864 4.62e-09 ***
## jumping 0.005253 0.002462 2.134 0.032879 *
## stamina -0.011027 0.002970 -3.712 0.000206 ***
## strength 0.036588 0.003499 10.457 < 2e-16 ***
## long_shots -0.020960 0.003538 -5.924 3.21e-09 ***
## positioning -0.065522 0.003482 -18.819 < 2e-16 ***
## penalties 0.012626 0.002938 4.297 1.74e-05 ***
## marking 0.019355 0.004371 4.428 9.58e-06 ***
## standing_tackle 0.015184 0.005127 2.961 0.003068 **
## sliding_tackle -0.013275 0.004972 -2.670 0.007598 **
## gk_diving 0.073473 0.006907 10.637 < 2e-16 ***
## gk_handling 0.071357 0.006812 10.475 < 2e-16 ***
## gk_kicking 0.028233 0.006455 4.374 1.23e-05 ***
## gk_positioning 0.063407 0.006814 9.306 < 2e-16 ***
## gk_reflexes 0.072573 0.006743 10.763 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.733 on 14048 degrees of freedom
## Multiple R-squared: 0.7985, Adjusted R-squared: 0.798
## F-statistic: 1591 on 35 and 14048 DF, p-value: < 2.2e-16
The result is hidden, but all of them are statistically significant now (some value is close to 0.01, but we keep using them)
Comment’s on this fit1 model
Confirming the predictors with graph:
overall_rating vs short_passing (p-value < 2e-16) boxplot:
#boxplot(test2016$overall_rating~test2016$short_passing, data=test2016, main="overall_rating vs short_passing", xlab="short_passing", ylab="overall_rating")
ggplot(test2016,aes(y=overall_rating,x=short_passing))+geom_point()+stat_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
This looks like a curve at first with the few noise when short_passing
value is low. But when short_passing > 41, the value steadily
increase in a linear fashion with very low error. This is
understandable. Some player who has low short_passing but high
overall_rating because their role does not emphasize short_passing.
There are also only a few in between low pass high rating value.
Nonetheless, short_passing proves to be a statistically significant
predictor for overall_rating. We can improve this more in the later
section with non-linear, classification strategy.
Let’s try some other significant predictor:
overall_rating vs jumping (0.032879) boxplot:
#boxplot(test2016$overall_rating~test2016$jumping, data=test2016, main="overall_rating vs jumping", xlab="jumping", ylab="overall_rating")
ggplot(test2016,aes(y=overall_rating,x=jumping))+geom_point()+stat_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
This is surprisingly linear. The noises are low and the points can be draw in a linear line. Jumping factor has slope of 0.005253 +- 0.002462 indicating that the relationship between jumping and overall_rating is close, somewhat significant.
overall_rating vs attacking_work_rate boxplot:
boxplot(test2016$overall_rating~test2016$attacking_work_rate, data=test2016, main="overall_rating vs attacking_work_rate", xlab="attacking_work_rate", ylab="overall_rating")
This overall_rating vs attacking_work_rate shows that attacking_work_rate noises are low (within + or - 5). High,low, and medium are good predictors because it covers a wide range. None are insignificant factor as seen in the p-value of summary(fit1).
Now let’s use this fit1 model to predict against training dataset test2016:
Predictions:
# testing against the training table test2016:
lm.pred = predict(fit1, data.frame(test2016), interval = "confidence")
# Creating result table
lm.table1 = data.frame(lm.pred)
lm.table1 = cbind(lm.table1, true=test2016[, 8:8])
lm.table1 = cbind(lm.table1, relativeError=(lm.table1$fit-lm.table1$true)/lm.table1$true)
lm.table1 = cbind(lm.table1, percentageError=abs(lm.table1$fit-lm.table1$true)*100/lm.table1$true)
Listed random 20 predictions
set.seed(3123)
lm.table1[sample(1:nrow(lm.table1), 20), ]
## fit lwr upr true relativeError percentageError
## 65367 73.98439 73.77482 74.19396 73 0.0134848319 1.34848319
## 118630 65.48025 65.20204 65.75847 69 -0.0510108450 5.10108450
## 165843 62.14930 61.90195 62.39665 61 0.0188409668 1.88409668
## 31841 65.66133 65.33096 65.99170 62 0.0590537467 5.90537467
## 86757 74.83475 74.63195 75.03756 73 0.0251335969 2.51335969
## 49697 66.50379 66.20242 66.80516 62 0.0726418012 7.26418012
## 95196 76.96988 76.68440 77.25536 77 -0.0003912117 0.03912117
## 9712 63.95795 63.73327 64.18263 66 -0.0309400968 3.09400968
## 22288 76.37617 76.12393 76.62842 79 -0.0332130098 3.32130098
## 101393 68.77557 68.52817 69.02296 67 0.0265010211 2.65010211
## 88752 74.36687 74.02453 74.70921 79 -0.0586471826 5.86471826
## 158630 63.61043 63.02179 64.19908 68 -0.0645524947 6.45524947
## 106477 63.76991 63.48581 64.05402 64 -0.0035950782 0.35950782
## 183645 76.28578 76.05460 76.51695 80 -0.0464277773 4.64277773
## 32615 75.73059 75.38636 76.07483 77 -0.0164858238 1.64858238
## 140327 68.17510 67.94091 68.40928 68 0.0025749485 0.25749485
## 92552 70.29337 70.07911 70.50762 70 0.0041909366 0.41909366
## 36811 74.22156 74.01087 74.43225 74 0.0029939911 0.29939911
## 180909 70.87995 70.63180 71.12811 71 -0.0016908083 0.16908083
## 66336 65.49948 65.21654 65.78242 68 -0.0367723428 3.67723428
Average percentage Error
mean(lm.table1$percentageError)
## [1] 3.088751
Above is the prediction table for players in 2016. We can draw some conclusion from this prediction:
The prediction is good to predict the training dataset itself. However, we must test with a test set:
Let’s make another prediction with player stats from 2015 (test dataset)
Preparing 2015 data:
# Processing data
test2015 = PlayerTable
test2015$date<- substr(test2015$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2015, because the data is very big
test2015 = test2015[test2015$date == "2015", ]
# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2015$defensive_work_rate[test2015$defensive_work_rate == "1" | test2015$defensive_work_rate == "2" | test2015$defensive_work_rate == "3"] <- "low"
test2015$defensive_work_rate[test2015$defensive_work_rate == "4" | test2015$defensive_work_rate == "5" | test2015$defensive_work_rate == "6"] <- "medium"
test2015$defensive_work_rate[test2015$defensive_work_rate == "7" | test2015$defensive_work_rate == "8" | test2015$defensive_work_rate == "9"] <- "high"
# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2015$attacking_work_rate[test2015$attacking_work_rate == "1" | test2015$attacking_work_rate == "2" | test2015$attacking_work_rate == "3"] <- "low"
test2015$attacking_work_rate[test2015$attacking_work_rate == "4" | test2015$attacking_work_rate == "5" | test2015$attacking_work_rate == "6"] <- "medium"
test2015$attacking_work_rate[test2015$attacking_work_rate == "7" | test2015$attacking_work_rate == "8" | test2015$attacking_work_rate == "9"] <- "high"
# Change from character to factor
test2015$preferred_foot <- as.factor(test2015$preferred_foot)
test2015$attacking_work_rate <- as.factor(test2015$attacking_work_rate)
test2015$defensive_work_rate <- as.factor(test2015$defensive_work_rate)
Predictions:
# using fit1 model to predict on test2015 dataset
lm.pred = predict(fit1, data.frame(test2015), interval = "confidence")
# Creating result table
lm.table2 = data.frame(lm.pred)
lm.table2 = cbind(lm.table2, true=test2015[, 8:8])
lm.table2 = cbind(lm.table2, relativeError=(lm.table2$fit-lm.table2$true)/lm.table2$true)
lm.table2 = cbind(lm.table2, percentageError=abs(lm.table2$fit-lm.table2$true)*100/lm.table2$true)
Listed random 20 predictions
set.seed(234)
lm.table2[sample(1:nrow(lm.table2), 20), ]
## fit lwr upr true relativeError percentageError
## 92980 68.41145 68.12930 68.69359 70 -0.0226936337 2.26936337
## 111414 66.09703 65.72940 66.46466 73 -0.0945611812 9.45611812
## 7803 73.06123 72.88779 73.23466 69 0.0588583731 5.88583731
## 108270 65.23618 64.95819 65.51416 64 0.0193152552 1.93152552
## 25328 64.06763 63.79879 64.33646 64 0.0010566458 0.10566458
## 51978 74.97734 74.70426 75.25042 78 -0.0387520845 3.87520845
## 165728 67.53733 67.28017 67.79449 71 -0.0487699640 4.87699640
## 84139 71.55618 71.16924 71.94313 70 0.0222311822 2.22311822
## 165359 74.25879 74.05585 74.46173 74 0.0034971561 0.34971561
## 112626 80.06108 79.73491 80.38725 80 0.0007635116 0.07635116
## 21729 67.67575 67.43077 67.92073 65 0.0411653685 4.11653685
## 18604 68.08643 67.55853 68.61434 66 0.0316126146 3.16126146
## 31153 69.37086 69.10348 69.63825 56 0.2387654247 23.87654247
## 31178 68.39452 68.12863 68.66042 69 -0.0087750038 0.87750038
## 492 66.15661 65.91401 66.39920 66 0.0023728566 0.23728566
## 167474 62.89946 62.56740 63.23151 60 0.0483243033 4.83243033
## 123860 70.22731 70.01322 70.44140 66 0.0640501277 6.40501277
## 91007 69.34016 69.00135 69.67898 68 0.0197082726 1.97082726
## 49873 75.20560 74.83030 75.58090 75 0.0027413123 0.27413123
## 151249 70.14120 69.87887 70.40354 71 -0.0120957363 1.20957363
Average percentage Error
mean(lm.table2$percentageError)
## [1] 3.274805
Above is the prediction table for players in 2015 using fit1 model training with players in 2016. We can draw some conclusion from this prediction:
-> Conclusion: Multiple linear regression model fits really well in predicting a numeric overall_rating value of a player based on different numeric and categorial attributes.
Non-linear regression
One thing missing from the PlayerTable is the role of the player. A player’s role is important to evaluate a player. An attacking player would have better attacking stats: shooting, dribbling, speed, … A defensive player emphasizes on defensive stats: sliding_tackles, standing_tackles, intercept, aggresion,… A goalkeeper is different than others with their own stats: gk_diving, gk_reflex, … If we can categorize a player into different roles, we can evaluate them better base on the important stats.
In this section, we try to answer this following question: “How do categorize a player into different roles?” Since we don’t have roles in any table or any relation table, we need to make one our own by looking up on some official websites (FIFA.com, or Wiki.com site will usually tell you if the player is famous enough). The strategy are as followed:
Preparing test dataset:
# Processing data
role2016 = PlayerTable
role2016$date<- substr(role2016$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2016, because the data is very big
role2016 = role2016[role2016$date == "2016", ]
# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
role2016$defensive_work_rate[role2016$defensive_work_rate == "1" | role2016$defensive_work_rate == "2" | role2016$defensive_work_rate == "3"] <- "low"
role2016$defensive_work_rate[role2016$defensive_work_rate == "4" | role2016$defensive_work_rate == "5" | role2016$defensive_work_rate == "6"] <- "medium"
role2016$defensive_work_rate[role2016$defensive_work_rate == "7" | role2016$defensive_work_rate == "8" | role2016$defensive_work_rate == "9"] <- "high"
# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
role2016$attacking_work_rate[role2016$attacking_work_rate == "1" | role2016$attacking_work_rate == "2" | role2016$attacking_work_rate == "3"] <- "low"
role2016$attacking_work_rate[role2016$attacking_work_rate == "4" | role2016$attacking_work_rate == "5" | role2016$attacking_work_rate == "6"] <- "medium"
role2016$attacking_work_rate[role2016$attacking_work_rate == "7" | role2016$attacking_work_rate == "8" | role2016$attacking_work_rate == "9"] <- "high"
# Change from character to factor
role2016$preferred_foot <- as.factor(role2016$preferred_foot)
role2016$attacking_work_rate <- as.factor(role2016$attacking_work_rate)
role2016$defensive_work_rate <- as.factor(role2016$defensive_work_rate)
# Adding role column, give random value
set.seed(2313)
roles <- c("Attacker", "Defender", "Goalkeeper")
role2016$role <- sample(roles, size = nrow(role2016), replace = TRUE)
# Manually selecting some players and give them the correct role:
# Goalkeepers
p1 = role2016[role2016$player_api_id == 182917, ]
p1["role"] <- "Goalkeeper"
p2 = role2016[role2016$player_api_id == 30717, ]
p2["role"] <- "Goalkeeper"
p3 = role2016[role2016$player_api_id == 27299, ]
p3["role"] <- "Goalkeeper"
p4 = role2016[role2016$player_api_id == 30859, ]
p4["role"] <- "Goalkeeper"
p5 = role2016[role2016$player_api_id == 51949, ]
p5["role"] <- "Goalkeeper"
# Attackers
p6 = role2016[role2016$player_api_id == 169200, ]
p6["role"] <- "Attacker"
p7 = role2016[role2016$player_api_id == 23354, ]
p7["role"] <- "Attacker"
p8 = role2016[role2016$player_api_id == 286119, ]
p8["role"] <- "Attacker"
p9 = role2016[role2016$player_api_id == 30822, ]
p9["role"] <- "Attacker"
p10 = role2016[role2016$player_api_id == 30853, ]
p10["role"] <- "Attacker"
# Defenders
p11 = role2016[role2016$player_api_id == 30865, ]
p11["role"] <- "Defender"
p12 = role2016[role2016$player_api_id == 150739, ]
p12["role"] <- "Defender"
p13 = role2016[role2016$player_api_id == 30962, ]
p13["role"] <- "Defender"
p14 = role2016[role2016$player_api_id == 186137, ]
p14["role"] <- "Defender"
p15 = role2016[role2016$player_api_id == 56678, ]
p15["role"] <- "Defender"
# Add some more because of singularity
p16 = role2016[role2016$player_api_id == 184554, ]
p16["role"] <- "Goalkeeper"
p17 = role2016[role2016$player_api_id == 42422, ]
p17["role"] <- "Goalkeeper"
p18 = role2016[role2016$player_api_id == 26295, ]
p18["role"] <- "Goalkeeper"
p19 = role2016[role2016$player_api_id == 177126, ]
p19["role"] <- "Goalkeeper"
p20 = role2016[role2016$player_api_id == 40604, ]
p20["role"] <- "Goalkeeper"
p21 = role2016[role2016$player_api_id == 363333, ]
p21["role"] <- "Attacker"
p22 = role2016[role2016$player_api_id == 150565, ]
p22["role"] <- "Attacker"
p23 = role2016[role2016$player_api_id == 194165, ]
p23["role"] <- "Attacker"
p24 = role2016[role2016$player_api_id == 184536, ]
p24["role"] <- "Attacker"
p25 = role2016[role2016$player_api_id == 107417, ]
p25["role"] <- "Attacker"
p26 = role2016[role2016$player_api_id == 474589, ]
p26["role"] <- "Defender"
p27 = role2016[role2016$player_api_id == 282674, ]
p27["role"] <- "Defender"
p28 = role2016[role2016$player_api_id == 37762, ]
p28["role"] <- "Defender"
p29 = role2016[role2016$player_api_id == 56678, ]
p29["role"] <- "Defender"
p30 = role2016[role2016$player_api_id == 574200, ]
p30["role"] <- "Defender"
# Creating test set:
newrole2016 = rbind(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15,
p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30)
# Change from character to factor
newrole2016$role <- as.factor(newrole2016$role)
Let’s plot some graph:
ggplot(newrole2016,aes(y=role,x=dribbling))+geom_point()+geom_smooth(method="glm")
## `geom_smooth()` using formula 'y ~ x'
ggplot(newrole2016,aes(y=role,x=gk_diving))+geom_point()+geom_smooth(method="glm")
## `geom_smooth()` using formula 'y ~ x'
ggplot(newrole2016,aes(y=role,x=interceptions))+geom_point()+geom_smooth(method="glm")
## `geom_smooth()` using formula 'y ~ x'
We can draw some conclusion from the given graph on limited train set:
Let’s train on this new train set using logistic regression:
# glm
fit2 = glm(newrole2016$role~., data=newrole2016[ , -which(names(newrole2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "overall_rating", "potential"))], family="binomial")
summary(fit2)
##
## Call:
## glm(formula = newrole2016$role ~ ., family = "binomial", data = newrole2016[,
## -which(names(newrole2016) %in% c("player_api_id", "player_fifa_api_id",
## "player_name", "birthday", "date", "overall_rating",
## "potential"))])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.592e-06 -2.381e-06 2.042e-06 2.409e-06 3.344e-06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.291e+02 3.552e+07 0 1
## height 1.515e+00 1.285e+05 0 1
## weight 5.371e-01 5.886e+04 0 1
## preferred_footright -1.526e+01 1.201e+06 0 1
## attacking_work_ratelow -1.664e+01 2.470e+06 0 1
## attacking_work_ratemedium -1.727e+00 1.753e+06 0 1
## defensive_work_ratelow -1.986e+01 1.977e+06 0 1
## defensive_work_ratemedium -1.926e+01 1.924e+06 0 1
## crossing 9.267e-01 7.467e+04 0 1
## finishing -8.183e-01 7.152e+04 0 1
## heading_accuracy 8.505e-01 8.479e+04 0 1
## short_passing -1.209e+00 1.261e+05 0 1
## volleys -9.926e-02 2.838e+04 0 1
## dribbling -6.849e-01 1.079e+05 0 1
## curve 5.234e-01 8.391e+04 0 1
## free_kick_accuracy -8.180e-01 1.127e+05 0 1
## long_passing 1.321e+00 1.122e+05 0 1
## ball_control 4.976e-01 1.008e+05 0 1
## acceleration -5.071e-01 7.395e+04 0 1
## sprint_speed 8.993e-01 8.126e+04 0 1
## agility 5.360e-01 5.743e+04 0 1
## reactions 8.214e-02 4.050e+04 0 1
## balance 8.314e-01 7.142e+04 0 1
## shot_power -7.594e-01 5.547e+04 0 1
## jumping 1.053e+00 5.277e+04 0 1
## stamina -2.700e-01 5.265e+04 0 1
## strength 1.381e-01 6.040e+04 0 1
## long_shots 9.955e-01 1.378e+05 0 1
## aggression -6.167e-02 4.377e+04 0 1
## interceptions -4.286e-01 5.125e+04 0 1
## positioning 2.381e-02 5.971e+04 0 1
## vision -1.554e-01 4.146e+04 0 1
## penalties -5.940e-01 5.818e+04 0 1
## marking 9.262e-02 8.587e+04 0 1
## standing_tackle 1.752e+00 1.181e+05 0 1
## sliding_tackle -1.527e+00 6.217e+04 0 1
## gk_diving 1.287e+00 1.438e+05 0 1
## gk_handling -1.860e-01 1.583e+05 0 1
## gk_kicking -1.470e-01 7.080e+04 0 1
## gk_positioning 9.417e-03 1.195e+05 0 1
## gk_reflexes -4.414e-01 1.809e+05 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.0364e+02 on 74 degrees of freedom
## Residual deviance: 4.3892e-10 on 34 degrees of freedom
## AIC: 82
##
## Number of Fisher Scoring iterations: 25
—————–## UPDATE ## ——————-
Comments on the general multiple logistic model:
———————###————————-
Let’s train on this new train set using Naive Bayes probability method:
Let’s first plot some graph. We will not be using pairs(data[,]) because there are so many predictors. We will focus on some relationship that makes sense, then later test with model:
Checking plot: role-goalkeeper vs gk_diving+gk_handling+gk_kicking+gk_positioning+gk_reflexes:
pairs(~role+gk_diving+gk_handling+gk_kicking+gk_positioning+gk_reflexes, data=newrole2016)
Checking plot: role-attacker vs attacking_work_rate +sprint_speed+volleys+dribbling+shot_power:
pairs(~role+attacking_work_rate +sprint_speed+volleys+dribbling+shot_power, data=newrole2016)
Checking plot: role-defender vs defensive_work_rate+ aggression+ standing_tackle+ sliding_tackle+ interceptions:
pairs(~role+defensive_work_rate +aggression+standing_tackle+sliding_tackle+interceptions, data=newrole2016)
The main idea of these plots are to check the relationship with three roles attacker, defender and goalkeeper. If we just look at the first row of the plots, they are relationship between role and other factors. Drawing conclusion from these rows, as expected:
Let’s figure them out by using Naive Bayes probability model:
Fit Naive Bayes model:
#
fit3 <- naive_bayes(newrole2016$role~., data=newrole2016[ , -which(names(newrole2016) %in% c("player_api_id","player_fifa_api_id", "player_name", "birthday", "date", "overall_rating", "potential"))])
print(fit3[8])
## $call
## naive_bayes.formula(formula = newrole2016$role ~ ., data = newrole2016[,
## -which(names(newrole2016) %in% c("player_api_id", "player_fifa_api_id",
## "player_name", "birthday", "date", "overall_rating",
## "potential"))])
print(fit3[5])
## $prior
##
## Attacker Defender Goalkeeper
## 0.4666667 0.2533333 0.2800000
print(fit3[4])
## $tables
##
## ---------------------------------------------------------------------------------
## ::: height (Gaussian)
## ---------------------------------------------------------------------------------
##
## height Attacker Defender Goalkeeper
## mean 182.952571 184.216842 190.862857
## sd 5.560963 5.765323 3.140339
##
## ---------------------------------------------------------------------------------
## ::: weight (Gaussian)
## ---------------------------------------------------------------------------------
##
## weight Attacker Defender Goalkeeper
## mean 164.857143 168.000000 190.190476
## sd 10.541634 11.215069 9.516402
##
## ---------------------------------------------------------------------------------
## ::: preferred_foot (Bernoulli)
## ---------------------------------------------------------------------------------
##
## preferred_foot Attacker Defender Goalkeeper
## left 0.0000000 0.1052632 0.1904762
## right 1.0000000 0.8947368 0.8095238
##
## ---------------------------------------------------------------------------------
## ::: attacking_work_rate (Categorical)
## ---------------------------------------------------------------------------------
##
## attacking_work_rate Attacker Defender Goalkeeper
## high 0.74285714 0.52631579 0.00000000
## low 0.00000000 0.05263158 0.00000000
## medium 0.25714286 0.42105263 1.00000000
## None 0.00000000 0.00000000 0.00000000
##
## ---------------------------------------------------------------------------------
## ::: defensive_work_rate (Categorical)
## ---------------------------------------------------------------------------------
##
## defensive_work_rate Attacker Defender Goalkeeper
## 0 0.0000000 0.0000000 0.0000000
## high 0.3142857 0.4210526 0.0000000
## low 0.2285714 0.0000000 0.0000000
## medium 0.4571429 0.5789474 1.0000000
##
## ---------------------------------------------------------------------------------
## ::: crossing (Gaussian)
## ---------------------------------------------------------------------------------
##
## crossing Attacker Defender Goalkeeper
## mean 68.200000 56.105263 14.047619
## sd 8.376438 15.530954 2.178903
##
## ---------------------------------------------------------------------------------
## ::: finishing (Gaussian)
## ---------------------------------------------------------------------------------
##
## finishing Attacker Defender Goalkeeper
## mean 78.228571 42.684211 12.285714
## sd 6.235976 11.855859 1.677583
##
## ---------------------------------------------------------------------------------
## ::: heading_accuracy (Gaussian)
## ---------------------------------------------------------------------------------
##
## heading_accuracy Attacker Defender Goalkeeper
## mean 68.171429 75.684211 14.904762
## sd 13.443751 9.080214 5.812958
##
## ---------------------------------------------------------------------------------
## ::: short_passing (Gaussian)
## ---------------------------------------------------------------------------------
##
## short_passing Attacker Defender Goalkeeper
## mean 73.400000 72.578947 33.714286
## sd 6.804843 5.294817 6.357223
##
## ---------------------------------------------------------------------------------
## ::: volleys (Gaussian)
## ---------------------------------------------------------------------------------
##
## volleys Attacker Defender Goalkeeper
## mean 72.257143 43.421053 13.238095
## sd 9.450624 12.411087 2.119074
##
## ---------------------------------------------------------------------------------
## ::: dribbling (Gaussian)
## ---------------------------------------------------------------------------------
##
## dribbling Attacker Defender Goalkeeper
## mean 78.571429 57.263158 16.047619
## sd 4.641790 10.434143 3.485343
##
## ---------------------------------------------------------------------------------
## ::: curve (Gaussian)
## ---------------------------------------------------------------------------------
##
## curve Attacker Defender Goalkeeper
## mean 68.942857 48.631579 14.666667
## sd 10.778363 14.260320 3.812261
##
## ---------------------------------------------------------------------------------
## ::: free_kick_accuracy (Gaussian)
## ---------------------------------------------------------------------------------
##
## free_kick_accuracy Attacker Defender Goalkeeper
## mean 62.742857 46.578947 14.238095
## sd 15.293872 11.051977 2.861901
##
## ---------------------------------------------------------------------------------
## ::: long_passing (Gaussian)
## ---------------------------------------------------------------------------------
##
## long_passing Attacker Defender Goalkeeper
## mean 63.000000 63.473684 35.000000
## sd 10.406446 6.744285 6.024948
##
## ---------------------------------------------------------------------------------
## ::: ball_control (Gaussian)
## ---------------------------------------------------------------------------------
##
## ball_control Attacker Defender Goalkeeper
## mean 78.314286 66.631579 24.095238
## sd 5.290232 10.683375 5.821553
##
## ---------------------------------------------------------------------------------
## ::: acceleration (Gaussian)
## ---------------------------------------------------------------------------------
##
## acceleration Attacker Defender Goalkeeper
## mean 78.285714 72.157895 50.619048
## sd 10.714398 9.239187 7.592603
##
## ---------------------------------------------------------------------------------
## ::: sprint_speed (Gaussian)
## ---------------------------------------------------------------------------------
##
## sprint_speed Attacker Defender Goalkeeper
## mean 77.80000 75.57895 55.33333
## sd 11.09266 8.82116 4.90238
##
## ---------------------------------------------------------------------------------
## ::: agility (Gaussian)
## ---------------------------------------------------------------------------------
##
## agility Attacker Defender Goalkeeper
## mean 74.371429 63.894737 50.142857
## sd 7.635389 13.755382 10.292161
##
## ---------------------------------------------------------------------------------
## ::: reactions (Gaussian)
## ---------------------------------------------------------------------------------
##
## reactions Attacker Defender Goalkeeper
## mean 78.485714 75.947368 82.333333
## sd 4.648665 5.522416 3.381321
##
## ---------------------------------------------------------------------------------
## ::: balance (Gaussian)
## ---------------------------------------------------------------------------------
##
## balance Attacker Defender Goalkeeper
## mean 67.28571 62.31579 43.33333
## sd 11.40839 10.88349 7.83156
##
## ---------------------------------------------------------------------------------
## ::: shot_power (Gaussian)
## ---------------------------------------------------------------------------------
##
## shot_power Attacker Defender Goalkeeper
## mean 79.371429 65.052632 25.095238
## sd 7.079738 8.714443 5.421298
##
## ---------------------------------------------------------------------------------
## ::: jumping (Gaussian)
## ---------------------------------------------------------------------------------
##
## jumping Attacker Defender Goalkeeper
## mean 67.857143 83.578947 75.523810
## sd 7.643309 4.845568 7.318600
##
## ---------------------------------------------------------------------------------
## ::: stamina (Gaussian)
## ---------------------------------------------------------------------------------
##
## stamina Attacker Defender Goalkeeper
## mean 78.285714 77.473684 38.142857
## sd 11.200240 8.362405 6.513722
##
## ---------------------------------------------------------------------------------
## ::: strength (Gaussian)
## ---------------------------------------------------------------------------------
##
## strength Attacker Defender Goalkeeper
## mean 72.285714 79.789474 71.142857
## sd 8.570028 4.577468 9.773872
##
## ---------------------------------------------------------------------------------
## ::: long_shots (Gaussian)
## ---------------------------------------------------------------------------------
##
## long_shots Attacker Defender Goalkeeper
## mean 74.857143 49.263158 13.666667
## sd 5.531423 10.120618 3.214550
##
## ---------------------------------------------------------------------------------
## ::: aggression (Gaussian)
## ---------------------------------------------------------------------------------
##
## aggression Attacker Defender Goalkeeper
## mean 70.742857 84.263158 34.000000
## sd 15.483089 3.229071 7.930952
##
## ---------------------------------------------------------------------------------
## ::: interceptions (Gaussian)
## ---------------------------------------------------------------------------------
##
## interceptions Attacker Defender Goalkeeper
## mean 47.914286 82.789474 23.714286
## sd 12.804805 2.070398 4.417498
##
## ---------------------------------------------------------------------------------
## ::: positioning (Gaussian)
## ---------------------------------------------------------------------------------
##
## positioning Attacker Defender Goalkeeper
## mean 80.657143 42.631579 12.857143
## sd 3.709629 15.413596 2.007130
##
## ---------------------------------------------------------------------------------
## ::: vision (Gaussian)
## ---------------------------------------------------------------------------------
##
## vision Attacker Defender Goalkeeper
## mean 75.828571 46.736842 49.809524
## sd 6.128457 14.007308 17.670934
##
## ---------------------------------------------------------------------------------
## ::: penalties (Gaussian)
## ---------------------------------------------------------------------------------
##
## penalties Attacker Defender Goalkeeper
## mean 71.400000 49.315789 28.285714
## sd 12.934678 8.838380 9.011897
##
## ---------------------------------------------------------------------------------
## ::: marking (Gaussian)
## ---------------------------------------------------------------------------------
##
## marking Attacker Defender Goalkeeper
## mean 37.971429 82.368421 12.666667
## sd 14.878895 2.586515 2.575526
##
## ---------------------------------------------------------------------------------
## ::: standing_tackle (Gaussian)
## ---------------------------------------------------------------------------------
##
## standing_tackle Attacker Defender Goalkeeper
## mean 43.114286 84.000000 12.857143
## sd 13.908887 1.855921 3.678121
##
## ---------------------------------------------------------------------------------
## ::: sliding_tackle (Gaussian)
## ---------------------------------------------------------------------------------
##
## sliding_tackle Attacker Defender Goalkeeper
## mean 41.742857 83.368421 12.714286
## sd 13.414967 3.515213 2.512824
##
## ---------------------------------------------------------------------------------
## ::: gk_diving (Gaussian)
## ---------------------------------------------------------------------------------
##
## gk_diving Attacker Defender Goalkeeper
## mean 10.742857 10.842105 84.904762
## sd 3.424332 2.853130 2.567192
##
## ---------------------------------------------------------------------------------
## ::: gk_handling (Gaussian)
## ---------------------------------------------------------------------------------
##
## gk_handling Attacker Defender Goalkeeper
## mean 10.257143 10.842105 81.714286
## sd 2.993550 3.095932 2.722919
##
## ---------------------------------------------------------------------------------
## ::: gk_kicking (Gaussian)
## ---------------------------------------------------------------------------------
##
## gk_kicking Attacker Defender Goalkeeper
## mean 8.714286 10.157895 78.095238
## sd 2.395724 3.287403 9.032745
##
## ---------------------------------------------------------------------------------
## ::: gk_positioning (Gaussian)
## ---------------------------------------------------------------------------------
##
## gk_positioning Attacker Defender Goalkeeper
## mean 11.000000 9.631579 83.904762
## sd 3.360672 3.336840 4.241518
##
## ---------------------------------------------------------------------------------
## ::: gk_reflexes (Gaussian)
## ---------------------------------------------------------------------------------
##
## gk_reflexes Attacker Defender Goalkeeper
## mean 10.600000 11.263158 85.523810
## sd 2.557572 3.429320 2.600366
##
## ---------------------------------------------------------------------------------
Interpreting the Naive Bayes model result:
Predictions:
# newrole2016 is new training set, we can use test2015 as test set
naives_bayes.pred = predict(fit3, test2015, type = 'prob')
Since we don’t have a test result to compare with, we will attach the prediction table to the original data set and make observations:
Making table result
# Adding Naive Bayes prediction to test2015 test set
resultNB = test2015
resultNB = cbind(resultNB, data.frame(naives_bayes.pred))
# -> resultNB will hold all the role predictions for test2015 test-set
Select 5 known player as predictors:
nb1 = resultNB[resultNB$player_api_id == 30893, ]
nb2 = resultNB[resultNB$player_api_id == 30981, ]
nb3 = resultNB[resultNB$player_api_id == 30717, ]
nb4 = resultNB[resultNB$player_api_id == 248453, ]
nb5 = resultNB[resultNB$player_api_id == 39027, ]
resultNB5 = rbind(nb1, nb2, nb3, nb4, nb5)
print(resultNB5)
## player_api_id player_fifa_api_id player_name birthday
## 98150 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98157 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98158 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98160 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98162 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 99708 30981 158023 Lionel Messi 1987-06-24 00:00:00
## 99711 30981 158023 Lionel Messi 1987-06-24 00:00:00
## 99713 30981 158023 Lionel Messi 1987-06-24 00:00:00
## 99717 30981 158023 Lionel Messi 1987-06-24 00:00:00
## 99721 30981 158023 Lionel Messi 1987-06-24 00:00:00
## 99722 30981 158023 Lionel Messi 1987-06-24 00:00:00
## 99725 30981 158023 Lionel Messi 1987-06-24 00:00:00
## 96315 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96322 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96325 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 67943 248453 195864 Paul Pogba 1993-03-15 00:00:00
## 67944 248453 195864 Paul Pogba 1993-03-15 00:00:00
## 67945 248453 195864 Paul Pogba 1993-03-15 00:00:00
## 67946 248453 195864 Paul Pogba 1993-03-15 00:00:00
## 67947 248453 195864 Paul Pogba 1993-03-15 00:00:00
## 67948 248453 195864 Paul Pogba 1993-03-15 00:00:00
## 130640 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130641 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130642 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130643 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## height weight date overall_rating potential preferred_foot
## 98150 185.42 176 2015 92 92 right
## 98157 185.42 176 2015 93 93 right
## 98158 185.42 176 2015 92 92 right
## 98160 185.42 176 2015 93 93 right
## 98162 185.42 176 2015 93 93 right
## 99708 170.18 159 2015 93 93 left
## 99711 170.18 159 2015 94 95 left
## 99713 170.18 159 2015 93 95 left
## 99717 170.18 159 2015 93 95 left
## 99721 170.18 159 2015 93 95 left
## 99722 170.18 159 2015 94 94 left
## 99725 170.18 159 2015 94 94 left
## 96315 193.04 201 2015 83 83 right
## 96322 193.04 201 2015 84 84 right
## 96325 193.04 201 2015 84 84 right
## 67943 190.50 185 2015 86 92 right
## 67944 190.50 185 2015 86 92 right
## 67945 190.50 185 2015 86 92 right
## 67946 190.50 185 2015 84 91 right
## 67947 190.50 185 2015 84 91 right
## 67948 190.50 185 2015 83 90 right
## 130640 193.04 187 2015 85 85 right
## 130641 193.04 187 2015 86 86 right
## 130642 193.04 187 2015 86 86 right
## 130643 193.04 187 2015 86 86 right
## attacking_work_rate defensive_work_rate crossing finishing
## 98150 high low 83 95
## 98157 high low 82 95
## 98158 high low 83 95
## 98160 high low 82 95
## 98162 high low 82 95
## 99708 medium low 84 94
## 99711 medium low 80 93
## 99713 medium low 84 94
## 99717 medium low 84 94
## 99721 medium low 84 94
## 99722 medium low 80 93
## 99725 medium low 80 93
## 96315 medium medium 25 25
## 96322 medium medium 13 15
## 96325 medium medium 13 15
## 67943 high medium 76 70
## 67944 high medium 76 70
## 67945 high medium 76 70
## 67946 high medium 73 70
## 67947 high medium 73 70
## 67948 high medium 73 70
## 130640 medium medium 61 45
## 130641 medium medium 61 45
## 130642 medium medium 61 45
## 130643 medium medium 61 45
## heading_accuracy short_passing volleys dribbling curve
## 98150 86 82 87 93 88
## 98157 86 81 87 93 88
## 98158 86 82 87 93 88
## 98160 86 81 87 93 88
## 98162 86 81 87 93 88
## 99708 71 89 85 96 89
## 99711 71 88 85 96 89
## 99713 71 89 85 96 89
## 99717 71 89 85 96 89
## 99721 71 89 85 96 89
## 99722 71 88 85 96 89
## 99725 71 88 85 96 89
## 96315 32 37 25 20 25
## 96322 13 37 17 25 20
## 96325 13 37 17 25 20
## 67943 72 85 84 88 82
## 67944 72 85 84 88 82
## 67945 72 85 84 88 82
## 67946 84 85 84 88 83
## 67947 84 85 78 88 78
## 67948 84 85 78 88 78
## 130640 84 65 46 64 61
## 130641 84 80 46 64 61
## 130642 84 80 46 64 61
## 130643 84 80 46 64 61
## free_kick_accuracy long_passing ball_control acceleration sprint_speed
## 98150 79 72 92 91 94
## 98157 77 72 91 91 93
## 98158 79 72 92 91 94
## 98160 77 72 91 91 93
## 98162 77 72 91 91 93
## 99708 90 76 96 96 90
## 99711 90 79 96 95 90
## 99713 90 76 96 96 90
## 99717 90 76 96 96 90
## 99721 90 76 96 96 90
## 99722 90 79 96 95 90
## 99725 90 79 96 95 90
## 96315 25 31 28 45 33
## 96322 13 35 23 49 43
## 96325 13 35 23 49 43
## 67943 80 81 89 75 79
## 67944 80 81 89 75 79
## 67945 80 81 89 75 79
## 67946 65 80 90 75 77
## 67947 65 80 90 75 77
## 67948 65 80 88 75 77
## 130640 52 66 68 68 70
## 130641 52 75 74 68 73
## 130642 52 75 74 68 73
## 130643 52 75 74 68 77
## agility reactions balance shot_power jumping stamina strength long_shots
## 98150 93 90 63 94 94 89 79 93
## 98157 90 92 62 94 94 87 79 93
## 98158 93 90 63 94 94 89 79 93
## 98160 90 92 62 94 94 90 79 93
## 98162 90 92 62 94 94 87 79 93
## 99708 94 94 95 80 73 77 60 88
## 99711 92 92 95 80 68 76 59 88
## 99713 94 94 95 80 73 77 60 88
## 99717 94 94 95 80 73 77 60 88
## 99721 94 94 95 80 73 77 60 88
## 99722 92 92 95 80 68 75 59 88
## 99725 92 92 95 80 68 75 59 88
## 96315 51 74 49 47 75 39 60 25
## 96322 55 76 49 20 75 39 63 13
## 96325 55 76 49 20 75 39 63 13
## 67943 75 86 61 91 85 89 91 91
## 67944 75 86 61 91 85 87 91 91
## 67945 75 86 61 91 85 87 91 91
## 67946 79 79 60 89 87 87 90 91
## 67947 79 78 60 89 87 87 90 88
## 67948 79 78 60 89 87 87 90 88
## 130640 63 84 42 76 73 70 88 55
## 130641 63 84 42 76 73 70 88 55
## 130642 63 84 42 76 73 70 88 55
## 130643 63 84 42 76 73 70 88 67
## aggression interceptions positioning vision penalties marking
## 98150 63 24 91 81 85 22
## 98157 62 29 93 81 85 22
## 98158 63 24 91 81 85 22
## 98160 62 29 93 81 85 22
## 98162 62 29 93 81 85 22
## 99708 48 22 92 90 74 25
## 99711 48 22 90 90 74 13
## 99713 48 22 92 90 76 25
## 99717 48 22 92 90 76 25
## 99721 48 22 92 90 74 25
## 99722 48 22 90 90 74 13
## 99725 48 22 90 90 74 13
## 96315 34 29 25 25 35 25
## 96322 38 20 14 50 22 10
## 96325 38 20 14 50 22 10
## 67943 80 71 83 86 76 71
## 67944 80 71 83 86 76 71
## 67945 80 71 83 86 76 71
## 67946 82 69 79 85 67 62
## 67947 81 69 79 85 67 62
## 67948 81 69 79 85 67 62
## 130640 81 78 41 59 63 86
## 130641 81 86 41 59 63 84
## 130642 81 87 41 59 63 85
## 130643 78 87 41 59 63 85
## standing_tackle sliding_tackle gk_diving gk_handling gk_kicking
## 98150 31 23 7 11 15
## 98157 31 23 7 11 15
## 98158 31 23 7 11 15
## 98160 31 23 7 11 15
## 98162 31 23 7 11 15
## 99708 21 20 6 11 15
## 99711 23 21 6 11 15
## 99713 21 20 6 11 15
## 99717 21 20 6 11 15
## 99721 21 20 6 11 15
## 99722 23 21 6 11 15
## 99725 23 21 6 11 15
## 96315 25 25 84 77 62
## 96322 11 11 85 79 71
## 96325 11 11 85 79 71
## 67943 77 83 5 6 2
## 67944 77 83 5 6 2
## 67945 77 83 5 6 2
## 67946 78 81 5 6 2
## 67947 78 81 5 6 2
## 67948 78 81 5 6 2
## 130640 90 87 10 9 5
## 130641 88 85 10 9 5
## 130642 90 85 10 9 5
## 130643 90 85 10 9 5
## gk_positioning gk_reflexes Attacker Defender Goalkeeper
## 98150 14 11 1.000000e+00 0.000000e+00 0
## 98157 14 11 1.000000e+00 0.000000e+00 0
## 98158 14 11 1.000000e+00 0.000000e+00 0
## 98160 14 11 1.000000e+00 0.000000e+00 0
## 98162 14 11 1.000000e+00 0.000000e+00 0
## 99708 14 8 1.000000e+00 0.000000e+00 0
## 99711 14 8 1.000000e+00 0.000000e+00 0
## 99713 14 8 1.000000e+00 0.000000e+00 0
## 99717 14 8 1.000000e+00 0.000000e+00 0
## 99721 14 8 1.000000e+00 0.000000e+00 0
## 99722 14 8 1.000000e+00 0.000000e+00 0
## 99725 14 8 1.000000e+00 0.000000e+00 0
## 96315 90 80 0.000000e+00 0.000000e+00 1
## 96322 89 83 0.000000e+00 0.000000e+00 1
## 96325 89 83 0.000000e+00 0.000000e+00 1
## 67943 4 3 1.000000e+00 1.406998e-22 0
## 67944 4 3 1.000000e+00 1.628461e-22 0
## 67945 4 3 1.000000e+00 1.628461e-22 0
## 67946 4 3 1.000000e+00 6.818945e-31 0
## 67947 4 3 1.000000e+00 1.344006e-30 0
## 67948 4 3 1.000000e+00 9.268056e-31 0
## 130640 8 6 1.390696e-46 1.000000e+00 0
## 130641 8 6 6.580123e-48 1.000000e+00 0
## 130642 8 6 1.562781e-46 1.000000e+00 0
## 130643 8 6 6.618217e-43 1.000000e+00 0
Above is the prediction table for players in 2015 using fit3 Naive_Bayes model training with some players in 2016. Because of the limited test set, we can draw some conclusion based on observations: (Note*: There are some multiple data per person, this came from different period of the year):
-> Conclusion: Multiple logistic regression doesn’t seem to perform well given an limited dataset on classification task. A simple Naive Bayes yet effective prediction models based on probability seems to work really well in the case of limited testset, multi-classification (>2). Naive Bayes that we used only train on 30-40 samples, but manage to predict really well on all 31779 rows of test2015 dataset.
Question Another technique in classification using in R is tree. We’d like to answer a type of question to utilize tree. Given the data above, and some analysis in the non-linear section. Let’s answer this question: How can you tell if a player is an attacking player or not?. The strategy is to build a decision tree
Preparing testset and trainset
## TRAIN
dt2015 = test2015
dt2015$attacking_work_rate[dt2015$attacking_work_rate == "low" | dt2015$attacking_work_rate == "medium" | dt2015$attacking_work_rate == "None"] <- "low"
dt2015$defensive_work_rate[dt2015$defensive_work_rate == "low" | dt2015$defensive_work_rate == "medium" | dt2015$defensive_work_rate == "0"] <- "low"
#choosing meaningful training data: dt2015
dt2015 <- select(dt2015, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)
## TEST
dt2016 = test2016
dt2016$attacking_work_rate[dt2016$attacking_work_rate == "low" | dt2016$attacking_work_rate == "medium" | dt2016$attacking_work_rate == "None"] <- "low"
dt2016$defensive_work_rate[dt2016$defensive_work_rate == "low" | dt2016$defensive_work_rate == "medium" | dt2016$defensive_work_rate == "0"] <- "low"
#choosing meaningful test data: dt2016
dt2016 <- select(dt2016, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)
Build tree
tree <- rpart(attacking_work_rate~., data = dt2015)
Predictions
# predict against test set dt2016
tree.attacking_work_rate.pred <- predict(tree, dt2016, type='class')
Evaluating the model
confusionMatrix(tree.attacking_work_rate.pred, dt2016$attacking_work_rate)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low medium None
## high 1590 861 0 0
## low 2866 8767 0 0
## medium 0 0 0 0
## None 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.7354
## 95% CI : (0.728, 0.7426)
## No Information Rate : 0.6836
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3042
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: high Class: low Class: medium Class: None
## Sensitivity 0.3568 0.9106 NA NA
## Specificity 0.9106 0.3568 1 1
## Pos Pred Value 0.6487 0.7536 NA NA
## Neg Pred Value 0.7536 0.6487 NA NA
## Prevalence 0.3164 0.6836 0 0
## Detection Rate 0.1129 0.6225 0 0
## Detection Prevalence 0.1740 0.8260 0 0
## Balanced Accuracy 0.6337 0.6337 NA NA
Some comments on the confusion matrix:
The accuracy is 0.7354 or 73.54%, this is only a decent model.
Looking at precition and reference table:
P-Value [Acc > NIR] : < 2.2e-16 so we can say the predictors are statistically significant
Visualization example:
prp(tree)
Quick comment on visualization:
Multiple logistic model
Let’s give logistic model a redemption arc on predicting binary value (only 2-classification) and see if it performs better than decision tree.
Quick test on statistically significant:
# glm
fit4 = glm(dt2015$attacking_work_rate~., data=dt2015, family="binomial")
summary(fit4)
##
## Call:
## glm(formula = dt2015$attacking_work_rate ~ ., family = "binomial",
## data = dt2015)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1529 -0.8396 0.3686 0.7936 1.9685
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.539e+00 9.318e-01 4.871 1.11e-06 ***
## height 2.312e-02 4.821e-03 4.796 1.62e-06 ***
## weight 1.643e-03 1.743e-03 0.943 0.345926
## overall_rating 1.939e-02 5.993e-03 3.235 0.001215 **
## preferred_footright 5.594e-03 3.404e-02 0.164 0.869441
## crossing -1.939e-02 2.007e-03 -9.659 < 2e-16 ***
## finishing -1.517e-02 2.330e-03 -6.512 7.43e-11 ***
## heading_accuracy -5.334e-03 1.957e-03 -2.726 0.006409 **
## short_passing 2.327e-02 4.139e-03 5.623 1.88e-08 ***
## volleys 6.747e-03 1.961e-03 3.440 0.000582 ***
## dribbling -1.712e-02 3.674e-03 -4.660 3.17e-06 ***
## curve -8.907e-03 1.918e-03 -4.644 3.42e-06 ***
## free_kick_accuracy -3.458e-05 1.653e-03 -0.021 0.983316
## long_passing 1.608e-02 2.701e-03 5.953 2.63e-09 ***
## ball_control 8.624e-03 4.903e-03 1.759 0.078615 .
## acceleration -2.079e-02 3.766e-03 -5.521 3.36e-08 ***
## sprint_speed -3.187e-02 3.417e-03 -9.326 < 2e-16 ***
## agility -4.318e-03 2.730e-03 -1.582 0.113673
## reactions 3.555e-03 3.268e-03 1.088 0.276581
## balance 7.585e-03 2.608e-03 2.908 0.003633 **
## shot_power -1.579e-03 2.491e-03 -0.634 0.526210
## jumping 6.052e-03 1.587e-03 3.813 0.000137 ***
## stamina -4.396e-02 1.992e-03 -22.061 < 2e-16 ***
## strength -1.825e-03 2.213e-03 -0.825 0.409651
## long_shots 2.175e-03 2.398e-03 0.907 0.364343
## aggression -8.797e-03 1.508e-03 -5.834 5.40e-09 ***
## interceptions 1.174e-03 2.065e-03 0.568 0.569705
## positioning -4.153e-02 2.443e-03 -17.005 < 2e-16 ***
## vision 3.876e-03 2.469e-03 1.570 0.116422
## penalties 2.689e-03 1.891e-03 1.422 0.154945
## marking -6.899e-03 2.638e-03 -2.615 0.008914 **
## standing_tackle 6.279e-03 3.102e-03 2.024 0.042938 *
## sliding_tackle -7.817e-03 2.942e-03 -2.657 0.007884 **
## gk_diving 1.144e-02 4.462e-03 2.565 0.010324 *
## gk_handling -2.703e-03 4.481e-03 -0.603 0.546333
## gk_kicking 1.937e-02 4.465e-03 4.338 1.44e-05 ***
## gk_positioning -7.210e-03 4.459e-03 -1.617 0.105891
## gk_reflexes 4.808e-03 4.481e-03 1.073 0.283314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38151 on 31778 degrees of freedom
## Residual deviance: 29417 on 31741 degrees of freedom
## AIC: 29493
##
## Number of Fisher Scoring iterations: 8
Comments on finding:
Model:
# glm
fit4 = glm(dt2015$attacking_work_rate~.-weight-preferred_foot-free_kick_accuracy-agility-reactions-strength-long_shots-interceptions-vision-penalties-shot_power-ball_control-gk_handling-gk_positioning-gk_reflexes, data=dt2015, family="binomial")
summary(fit4)
##
## Call:
## glm(formula = dt2015$attacking_work_rate ~ . - weight - preferred_foot -
## free_kick_accuracy - agility - reactions - strength - long_shots -
## interceptions - vision - penalties - shot_power - ball_control -
## gk_handling - gk_positioning - gk_reflexes, family = "binomial",
## data = dt2015)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1939 -0.8414 0.3693 0.7954 2.0060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.508544 0.910824 4.950 7.42e-07 ***
## height 0.024244 0.004452 5.446 5.15e-08 ***
## overall_rating 0.026603 0.004649 5.722 1.05e-08 ***
## crossing -0.019614 0.001971 -9.950 < 2e-16 ***
## finishing -0.013916 0.002152 -6.466 1.01e-10 ***
## heading_accuracy -0.005528 0.001905 -2.902 0.003704 **
## short_passing 0.026899 0.003870 6.950 3.66e-12 ***
## volleys 0.007729 0.001863 4.147 3.36e-05 ***
## dribbling -0.014520 0.003127 -4.644 3.42e-06 ***
## curve -0.007929 0.001710 -4.637 3.54e-06 ***
## long_passing 0.017454 0.002541 6.868 6.49e-12 ***
## acceleration -0.022649 0.003629 -6.241 4.34e-10 ***
## sprint_speed -0.033627 0.003347 -10.047 < 2e-16 ***
## balance 0.007051 0.002486 2.837 0.004561 **
## jumping 0.005435 0.001550 3.507 0.000454 ***
## stamina -0.043885 0.001939 -22.637 < 2e-16 ***
## aggression -0.008781 0.001422 -6.174 6.65e-10 ***
## positioning -0.039854 0.002352 -16.947 < 2e-16 ***
## marking -0.007162 0.002571 -2.786 0.005339 **
## standing_tackle 0.007126 0.002983 2.389 0.016891 *
## sliding_tackle -0.007871 0.002921 -2.695 0.007041 **
## gk_diving 0.010076 0.004334 2.325 0.020075 *
## gk_kicking 0.018515 0.004341 4.265 2.00e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38151 on 31778 degrees of freedom
## Residual deviance: 29434 on 31756 degrees of freedom
## AIC: 29480
##
## Number of Fisher Scoring iterations: 7
Predictions:
glm.prob = predict(fit4, dt2016, type = "response")
glm.pred = rep(0, length(glm.prob))
glm.pred[glm.prob < 0.5] <- 1
glm.pred[glm.prob >= 0.5] <- 0
glm.true = rep(0, length(glm.prob))
glm.true[dt2016$attacking_work_rate == "high"] <- 1
glm.true[dt2016$attacking_work_rate == "low"] <- 0
# Confusion matrix
table(glm.pred, glm.true)
## glm.true
## glm.pred 0 1
## 0 8693 2449
## 1 935 2007
# Accuracy
mean(glm.pred == glm.true)
## [1] 0.7597274
Some comments on the result of multiple logistic regression:
-> conclusion: What can we tell from choosing a binary classification methods:
—————–## UPDATE ## ——————-
## Generalization ##
We continue to apply our model on earlier years. We choose year 2012 and 2013 because they are the last entry that gives consistency in the table. On earlier year, there are some inconsistency in the data. There is this error: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor attacking_work_rate has new levels le, norm, stoc, y The attacking_work_rate has alwasy been high, medium, low. However, we are unable to explain these value: le, norm, stoc, y because there are no description in the offical site Kaggle. Since it is hard to inpterpret those values, let’s analyze year 2013, 2012 and stop there.
Preparing 2013 test dataset for fit1 Naives Bayes:
# Processing data
test2013 = PlayerTable
test2013$date<- substr(test2013$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2015, because the data is very big
test2013 = test2013[test2013$date == "2013", ]
# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2013$defensive_work_rate[test2013$defensive_work_rate == "1" | test2013$defensive_work_rate == "2" | test2013$defensive_work_rate == "3"] <- "low"
test2013$defensive_work_rate[test2013$defensive_work_rate == "4" | test2013$defensive_work_rate == "5" | test2013$defensive_work_rate == "6"] <- "medium"
test2013$defensive_work_rate[test2013$defensive_work_rate == "7" | test2013$defensive_work_rate == "8" | test2013$defensive_work_rate == "9"] <- "high"
# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2013$attacking_work_rate[test2013$attacking_work_rate == "1" | test2013$attacking_work_rate == "2" | test2013$attacking_work_rate == "3"] <- "low"
test2013$attacking_work_rate[test2013$attacking_work_rate == "4" | test2013$attacking_work_rate == "5" | test2013$attacking_work_rate == "6"] <- "medium"
test2013$attacking_work_rate[test2013$attacking_work_rate == "7" | test2013$attacking_work_rate == "8" | test2013$attacking_work_rate == "9"] <- "high"
# Change from character to factor
test2013$preferred_foot <- as.factor(test2013$preferred_foot)
test2013$attacking_work_rate <- as.factor(test2013$attacking_work_rate)
test2013$defensive_work_rate <- as.factor(test2013$defensive_work_rate)
Preparing 2012 test dataset for fit1 and Naives Bayes:
# Processing data
test2012 = PlayerTable
test2012$date<- substr(test2012$date, 1, 4) # change year to only "20XX" form
# Selecting only year 2015, because the data is very big
test2012 = test2012[test2012$date == "2012", ]
# Original data was inconsistent. Changing some value in defensive_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2012$defensive_work_rate[test2012$defensive_work_rate == "1" | test2012$defensive_work_rate == "2" | test2012$defensive_work_rate == "3"] <- "low"
test2012$defensive_work_rate[test2012$defensive_work_rate == "4" | test2012$defensive_work_rate == "5" | test2012$defensive_work_rate == "6"] <- "medium"
test2012$defensive_work_rate[test2012$defensive_work_rate == "7" | test2012$defensive_work_rate == "8" | test2012$defensive_work_rate == "9"] <- "high"
# Original data was inconsistent. Changing some value in attacking_work_rate for consistency (low, medium, high) not (1,2,3...9)
test2012$attacking_work_rate[test2012$attacking_work_rate == "1" | test2012$attacking_work_rate == "2" | test2012$attacking_work_rate == "3"] <- "low"
test2012$attacking_work_rate[test2012$attacking_work_rate == "4" | test2012$attacking_work_rate == "5" | test2012$attacking_work_rate == "6"] <- "medium"
test2012$attacking_work_rate[test2012$attacking_work_rate == "7" | test2012$attacking_work_rate == "8" | test2012$attacking_work_rate == "9"] <- "high"
# Change from character to factor
test2012$preferred_foot <- as.factor(test2012$preferred_foot)
test2012$attacking_work_rate <- as.factor(test2012$attacking_work_rate)
test2012$defensive_work_rate <- as.factor(test2012$defensive_work_rate)
Preparing 2013 test for fit4
dt2013 = test2013
dt2013$attacking_work_rate[dt2013$attacking_work_rate == "low" | dt2013$attacking_work_rate == "medium" | dt2013$attacking_work_rate == "None"] <- "low"
dt2013$defensive_work_rate[dt2013$defensive_work_rate == "low" | dt2013$defensive_work_rate == "medium" | dt2013$defensive_work_rate == "0"] <- "low"
#choosing meaningful training data: dt2013
dt2013 <- select(dt2013, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)
Preparing 2012 test for fit4
dt2012 = test2012
dt2012$attacking_work_rate[dt2012$attacking_work_rate == "low" | dt2012$attacking_work_rate == "medium" | dt2012$attacking_work_rate == "None"] <- "low"
dt2012$defensive_work_rate[dt2012$defensive_work_rate == "low" | dt2012$defensive_work_rate == "medium" | dt2012$defensive_work_rate == "0"] <- "low"
#choosing meaningful test data: dt2012
dt2012 <- select(dt2012, height, weight, overall_rating, preferred_foot, attacking_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes)
a) Multiple linear-regression on predicting earlier year
Let’s make another prediction with player stats from 2013 (test dataset)
Predictions:
# using fit1 model (trained on 2016 dataset) to predict on test2013 dataset
lm.pred3 = predict(fit1, data.frame(test2013), interval = "confidence")
# Creating result table
lm.table3 = data.frame(lm.pred3)
lm.table3 = cbind(lm.table3, true=test2013[, 8:8])
lm.table3 = cbind(lm.table3, relativeError=(lm.table3$fit-lm.table3$true)/lm.table3$true)
lm.table3 = cbind(lm.table3, percentageError=abs(lm.table3$fit-lm.table3$true)*100/lm.table3$true)
Listed random 10 predictions
set.seed(433)
lm.table3[sample(1:nrow(lm.table3), 10), ]
## fit lwr upr true relativeError percentageError
## 45171 70.47573 70.25672 70.69474 68 0.036407806 3.6407806
## 117302 78.45536 78.19492 78.71580 77 0.018900779 1.8900779
## 23280 68.43523 68.20121 68.66924 66 0.036897383 3.6897383
## 13381 63.39589 63.09530 63.69649 62 0.022514408 2.2514408
## 160584 74.40326 74.15603 74.65049 74 0.005449466 0.5449466
## 16828 69.42513 69.13688 69.71338 67 0.036195975 3.6195975
## 102774 65.66362 65.47743 65.84981 67 -0.019945976 1.9945976
## 89873 62.70174 62.39370 63.00978 66 -0.049973686 4.9973686
## 17440 70.09611 69.69008 70.50214 67 0.046210597 4.6210597
## 560 68.65416 68.34410 68.96423 67 0.024689008 2.4689008
Average percentage Error
mean(lm.table3$percentageError)
## [1] 3.601826
Another prediction with player stats from 2012 (test dataset)
Predictions:
# using fit1 model (trained on 2016 dataset) to predict on test2010 dataset
lm.pred2 = predict(fit1, data.frame(test2012), interval = "confidence")
# Creating result table
lm.table2 = data.frame(lm.pred2)
lm.table2 = cbind(lm.table2, true=test2012[, 8:8])
lm.table2 = cbind(lm.table2, relativeError=(lm.table2$fit-lm.table2$true)/lm.table2$true)
lm.table2 = cbind(lm.table2, percentageError=abs(lm.table2$fit-lm.table2$true)*100/lm.table2$true)
Listed random 10 predictions
set.seed(422)
lm.table2[sample(1:nrow(lm.table2), 10), ]
## fit lwr upr true relativeError percentageError
## 153846 76.02586 75.79852 76.25320 76 0.0003402748 0.03402748
## 133959 81.05344 80.81986 81.28701 81 0.0006597032 0.06597032
## 83072 72.44339 72.18256 72.70422 72 0.0061581802 0.61581802
## 182405 67.57592 67.33612 67.81573 67 0.0085958664 0.85958664
## 133608 77.81929 77.59218 78.04641 80 -0.0272588234 2.72588234
## 25597 64.80223 64.50674 65.09772 61 0.0623316337 6.23316337
## 134977 71.93279 71.67677 72.18882 66 0.0898908220 8.98908220
## 86658 68.23215 68.05101 68.41330 68 0.0034140096 0.34140096
## 111432 69.59006 69.29324 69.88688 72 -0.0334713519 3.34713519
## 57260 66.12803 65.81323 66.44283 61 0.0840659851 8.40659851
Average percentage Error
mean(lm.table2$percentageError)
## [1] 3.82785
b) Naive Bayes on predicting earlier year:
Let’s make a prediction with player stats from 2013 (test dataset)
Predictions:
# newrole2016 is new training set, we can use test2013 as test set
naives_bayes.pred = predict(fit3, test2013, type = 'prob')
Since we don’t have a test result to compare with, we will attach the prediction table to the original data set and make observations:
Making table result
# Adding Naive Bayes prediction to test2013 test set
resultNB = test2013
resultNB = cbind(resultNB, data.frame(naives_bayes.pred))
# -> resultNB will hold all the role predictions for test2013 test-set
Select 3 known player as predictors:
nb1 = resultNB[resultNB$player_api_id == 30893, ]
nb2 = resultNB[resultNB$player_api_id == 30717, ]
nb3 = resultNB[resultNB$player_api_id == 39027, ]
resultNB3 = rbind(nb1, nb2, nb3)
print(resultNB3)
## player_api_id player_fifa_api_id player_name birthday
## 98153 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98154 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98159 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98165 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 98167 30893 20801 Cristiano Ronaldo 1985-02-05 00:00:00
## 96304 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96314 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96326 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96328 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96331 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96332 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96333 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 96334 30717 1179 Gianluigi Buffon 1978-01-28 00:00:00
## 130649 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130650 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130651 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130652 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130653 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130654 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## 130655 39027 139720 Vincent Kompany 1986-04-10 00:00:00
## height weight date overall_rating potential preferred_foot
## 98153 185.42 176 2013 92 95 right
## 98154 185.42 176 2013 92 94 right
## 98159 185.42 176 2013 92 94 right
## 98165 185.42 176 2013 92 95 right
## 98167 185.42 176 2013 92 95 right
## 96304 193.04 201 2013 87 87 right
## 96314 193.04 201 2013 87 87 right
## 96326 193.04 201 2013 87 87 right
## 96328 193.04 201 2013 85 85 right
## 96331 193.04 201 2013 87 87 right
## 96332 193.04 201 2013 85 85 right
## 96333 193.04 201 2013 87 87 right
## 96334 193.04 201 2013 87 87 right
## 130649 193.04 187 2013 86 88 right
## 130650 193.04 187 2013 86 88 right
## 130651 193.04 187 2013 86 88 right
## 130652 193.04 187 2013 86 88 right
## 130653 193.04 187 2013 86 88 right
## 130654 193.04 187 2013 86 88 right
## 130655 193.04 187 2013 85 88 right
## attacking_work_rate defensive_work_rate crossing finishing
## 98153 high low 83 92
## 98154 high low 84 92
## 98159 high low 84 92
## 98165 high low 83 92
## 98167 high low 83 92
## 96304 medium medium 19 16
## 96314 medium medium 19 16
## 96326 medium medium 19 16
## 96328 medium medium 25 25
## 96331 medium medium 19 16
## 96332 medium medium 25 25
## 96333 medium medium 19 16
## 96334 medium medium 19 16
## 130649 medium medium 61 45
## 130650 medium medium 61 45
## 130651 medium medium 61 45
## 130652 medium medium 61 45
## 130653 medium medium 61 45
## 130654 medium medium 61 45
## 130655 medium medium 61 45
## heading_accuracy short_passing volleys dribbling curve
## 98153 86 82 85 90 88
## 98154 87 83 85 90 88
## 98159 87 83 85 90 88
## 98165 86 82 85 90 88
## 98167 86 82 85 90 88
## 96304 28 45 28 27 36
## 96314 28 45 28 27 36
## 96326 28 42 18 21 26
## 96328 25 36 25 21 25
## 96331 28 45 28 27 36
## 96332 25 38 25 21 25
## 96333 28 45 28 27 36
## 96334 28 45 28 27 36
## 130649 84 80 46 70 61
## 130650 81 80 46 70 61
## 130651 81 80 46 70 61
## 130652 81 82 46 70 61
## 130653 81 82 46 70 61
## 130654 81 82 46 70 61
## 130655 80 82 46 70 61
## free_kick_accuracy long_passing ball_control acceleration sprint_speed
## 98153 79 72 95 91 94
## 98154 77 72 95 91 94
## 98159 79 72 95 91 94
## 98165 79 72 95 91 94
## 98167 79 72 95 91 94
## 96304 14 30 37 59 49
## 96314 14 30 37 50 40
## 96326 14 30 37 49 39
## 96328 25 30 33 47 37
## 96331 14 30 37 49 39
## 96332 25 30 32 47 37
## 96333 14 30 37 49 39
## 96334 14 30 37 49 39
## 130649 52 75 79 68 76
## 130650 52 75 79 68 76
## 130651 52 75 79 68 76
## 130652 52 74 79 69 76
## 130653 52 74 79 69 76
## 130654 52 74 79 69 77
## 130655 52 74 79 69 77
## agility reactions balance shot_power jumping stamina strength long_shots
## 98153 93 90 75 94 94 89 79 93
## 98154 93 88 75 95 94 89 79 93
## 98159 93 88 75 95 94 89 79 93
## 98165 93 90 75 94 94 89 79 93
## 98167 93 90 75 94 94 89 79 93
## 96304 59 74 55 34 75 55 63 20
## 96314 51 74 55 34 75 55 59 20
## 96326 51 74 45 34 75 35 55 20
## 96328 49 80 45 30 75 35 55 25
## 96331 51 74 45 34 75 35 55 20
## 96332 49 80 45 30 75 35 55 25
## 96333 51 74 45 34 75 35 55 20
## 96334 51 74 45 34 75 35 55 20
## 130649 63 84 42 76 73 70 88 67
## 130650 63 84 42 76 69 70 88 67
## 130651 63 84 42 76 69 70 88 67
## 130652 63 84 42 76 69 70 88 67
## 130653 63 84 42 76 69 70 88 67
## 130654 63 84 42 76 69 70 88 67
## 130655 63 84 42 76 69 70 88 67
## aggression interceptions positioning vision penalties marking
## 98153 63 24 89 81 85 22
## 98154 63 24 89 81 85 22
## 98159 63 24 89 81 85 22
## 98165 63 24 89 81 85 22
## 98167 63 24 89 81 85 22
## 96304 64 44 11 24 36 18
## 96314 40 15 5 15 36 18
## 96326 35 15 5 15 36 18
## 96328 35 25 25 25 36 25
## 96331 35 15 5 15 36 18
## 96332 35 25 25 25 36 25
## 96333 35 15 5 15 36 18
## 96334 35 15 5 15 36 18
## 130649 75 87 41 63 63 84
## 130650 75 87 41 63 63 84
## 130651 75 87 41 63 63 84
## 130652 75 87 41 63 63 84
## 130653 75 87 41 63 63 84
## 130654 75 87 41 63 63 84
## 130655 75 87 41 63 63 84
## standing_tackle sliding_tackle gk_diving gk_handling gk_kicking
## 98153 31 23 7 11 15
## 98154 31 23 7 11 15
## 98159 31 23 7 11 15
## 98165 31 23 7 11 15
## 98167 31 23 7 11 15
## 96304 30 26 90 82 66
## 96314 30 26 90 82 66
## 96326 30 26 90 80 66
## 96328 25 25 88 78 66
## 96331 30 26 90 82 66
## 96332 25 25 88 78 66
## 96333 30 26 90 80 66
## 96334 30 26 90 82 66
## 130649 90 85 10 9 5
## 130650 90 85 10 9 5
## 130651 90 85 10 9 5
## 130652 90 85 10 9 5
## 130653 90 85 10 9 5
## 130654 90 85 10 9 5
## 130655 90 85 10 9 5
## gk_positioning gk_reflexes Attacker Defender Goalkeeper
## 98153 14 11 1.000000e+00 0 0
## 98154 14 11 1.000000e+00 0 0
## 98159 14 11 1.000000e+00 0 0
## 98165 14 11 1.000000e+00 0 0
## 98167 14 11 1.000000e+00 0 0
## 96304 90 86 0.000000e+00 0 1
## 96314 90 86 0.000000e+00 0 1
## 96326 90 84 0.000000e+00 0 1
## 96328 90 80 0.000000e+00 0 1
## 96331 90 86 0.000000e+00 0 1
## 96332 90 80 0.000000e+00 0 1
## 96333 90 84 0.000000e+00 0 1
## 96334 90 84 0.000000e+00 0 1
## 130649 8 6 3.454576e-39 1 0
## 130650 8 6 3.615070e-38 1 0
## 130651 8 6 3.615070e-38 1 0
## 130652 8 6 4.322264e-38 1 0
## 130653 8 6 4.322264e-38 1 0
## 130654 8 6 4.420183e-38 1 0
## 130655 8 6 4.463737e-38 1 0
c) Logistic regression on predicting attacking/non-attacking players:
fit4 is the model that we trained on dt2015 training dataset. Let’s apply it to different years:
Predictions for training dataset 2013:
glm.prob = predict(fit4, dt2013, type = "response")
glm.pred = rep(0, length(glm.prob))
glm.pred[glm.prob < 0.5] <- 1
glm.pred[glm.prob >= 0.5] <- 0
glm.true = rep(0, length(glm.prob))
glm.true[dt2013$attacking_work_rate == "high"] <- 1
glm.true[dt2013$attacking_work_rate == "low"] <- 0
# Confusion matrix
table(glm.pred, glm.true)
## glm.true
## glm.pred 0 1
## 0 26563 5220
## 1 3074 3942
# Accuracy
mean(glm.pred == glm.true)
## [1] 0.7862316
Predictions for training dataset 2012:
glm.prob = predict(fit4, dt2012, type = "response")
glm.pred = rep(0, length(glm.prob))
glm.pred[glm.prob < 0.5] <- 1
glm.pred[glm.prob >= 0.5] <- 0
glm.true = rep(0, length(glm.prob))
glm.true[dt2012$attacking_work_rate == "high"] <- 1
glm.true[dt2012$attacking_work_rate == "low"] <- 0
# Confusion matrix
table(glm.pred, glm.true)
## glm.true
## glm.pred 0 1
## 0 9181 1496
## 1 945 997
# Accuracy
mean(glm.pred == glm.true)
## [1] 0.8065615
Conclusion:
Correct predicting results on year 2013 and 2012 respectively is 78.62316% and 80.65615%, which is surprisingly better than result in year 2016: 75.97274%
-> Despite the change in years (which means completely new dataset, new size, new attributes, ..), the logistic model on binary classification works really well.
-> CONCLUSION ON GENERALIZATION: The interaction terms or non-linear terms help. Our model continues to generalize on earlier year and work well as long as the data is consistent.
———————###————————-
From the work presented above, we can see some uses in predicting a historical European soccer dataset. If the goal is to evaluate the **numeric** overall_rating of a player, we would use multiple linear regression model. If the goal is to **classify** a player into different roles **(>=3)**, we'd want to use **Naives Bayes**, a simple but effective method. If the goal is to **binary classify** (either-or attributes), we'd want to use **logistic regression method**. The analysis suggests those method fit the best given the good result prediction on a ten of thousands scale of entries. **The analysis also generalizes really well given data from different years**. However, the data is bias to the author who provided the data, attributes and rating. Since this project was made out of passion, the result even if accurate would given only an entertainment value. But no one would stop you to use data science to make good scientific **betting** guess. The results if weren't accurate would not significantly impact anybody but the bettor. I hope this contribution is helpful to those who'd like to study data science, and provide an entertainment value to European soccer fans.