data21 <- read.table("./players_21.csv", fill=TRUE, header=TRUE, sep=",")
data <- data21[,c(3,5,9,13,14,15,18,34,35,36,37,38,39)]
data[data == ''] <- NA
#Dropping rows containing missing values
data <- data %>% drop_na()
#Convert character objects to integer objects / factors
data$short_name <- factor(data$short_name)
data$age <- as.integer(data$age)
data$overall <- as.integer(data$overall)
data$nationality <- factor(data$nationality)
data$potential <- as.integer(data$potential)
data$value_eur <- as.integer(data$value_eur)
data$preferred_foot <- factor(data$preferred_foot)
data$pace <- as.integer(data$pace)
data$shooting <- as.integer(data$shooting)
data$passing <- as.integer(data$passing)
data$dribbling <- as.integer(data$dribbling)
data$defending <- as.integer(data$defending)
data$physic <- as.integer(data$physic)
colnames(data) <- c("Name", "Age", "Nationality", "Overall", "Potential", "Value", "PreferredFoot", "Pace", "Shooting", "Passing", "Dribbling", "Defending", "Physic")
data$Value <- data$Value/1000000
# Selecting only player from Spain (smaller sample size)
mydata <- subset(data, Nationality == "Spain")
mydata <- mydata[,c(1,2,5,7,8,10,13)]
head(mydata)
## Name Age Potential PreferredFoot Pace Passing Physic
## 11 Sergio Ramos 34 89 Right 71 76 85
## 23 Pedro 32 80 Right 73 76 50
## 26 Illarramendi 30 80 Right 52 78 69
## 29 Escudero 30 80 Left 72 75 71
## 30 Mario Gaspar 29 80 Right 75 71 77
## 37 Bartra 29 80 Right 72 72 74
Name: Name of a player
Age: Age of a player in years
Potential: Potential rating (1-100)
PreferredFoot: The natural preference of player’s left or right foot
Pace: Pace rating (1-100)
Passing: Passing rating (1-100)
Physic: Physic rating (1-100)
round(stat.desc(mydata[,c(-1,-4)]),2)
## Age Potential Pace Passing Physic
## nbr.val 493.00 493.00 493.00 493.00 493.00
## nbr.null 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 17.00 59.00 30.00 31.00 33.00
## max 40.00 90.00 96.00 82.00 85.00
## range 23.00 31.00 66.00 51.00 52.00
## sum 12770.00 36550.00 32882.00 30759.00 31350.00
## median 25.00 74.00 68.00 63.00 65.00
## mean 25.90 74.14 66.70 62.39 63.59
## SE.mean 0.20 0.21 0.51 0.41 0.44
## CI.mean.0.95 0.39 0.42 1.00 0.81 0.86
## var 19.13 22.20 127.93 83.58 94.01
## std.dev 4.37 4.71 11.31 9.14 9.70
## coef.var 0.17 0.06 0.17 0.15 0.15
The term player potential in FIFA refers to the upward ceiling of the card of a particular footballer and indicates how high it can reach.
Based on our data we will show the relationship between potential and listed independent variables and determine if it is statistically significant.
scatterplotMatrix(mydata[,c(3,2,5,6,7)], smooth = FALSE)
According to the scaterplot matrix, almost all independent variables (pace, passing and physic rating) are positively correlated, age being the only independent variable negatively correlated.
reg_model <- lm(Potential ~ Age + Pace + Passing + Physic + PreferredFoot,
data = mydata)
summary(reg_model)
##
## Call:
## lm(formula = Potential ~ Age + Pace + Passing + Physic + PreferredFoot,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1011 -2.3675 -0.0701 2.0496 12.9847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.48176 1.81700 34.387 <2e-16 ***
## Age -0.73540 0.04555 -16.145 <2e-16 ***
## Pace 0.02528 0.01475 1.713 0.0873 .
## Passing 0.28381 0.01906 14.891 <2e-16 ***
## Physic 0.17027 0.01865 9.131 <2e-16 ***
## PreferredFootRight 0.69953 0.34524 2.026 0.0433 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.508 on 487 degrees of freedom
## Multiple R-squared: 0.4513, Adjusted R-squared: 0.4457
## F-statistic: 80.13 on 5 and 487 DF, p-value: < 2.2e-16
vif(reg_model)
## Age Pace Passing Physic PreferredFoot
## 1.586951 1.113011 1.213755 1.306848 1.018179
mean(vif(reg_model))
## [1] 1.247749
There are no strong relationships between explanatory variables - Multicolinearity assumption is met.
VIF < 5
mean(VIF) > 1
# Standardized residuals (finding outliers)
mydata$StdResid <- round(rstandard(reg_model), 3)
hist(mydata$StdResid, main = "Distribution of Standardized residuals")
# SHAPIRO-WILK NORMALITY TEST
# H0: Variable is normally distributed
# H1: Variable is NOT normally distributed
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.9953, p-value = 0.1421
There is normal distribution (We cannot reject H0, p > 0.05). Regression model does not violate the normality assumption.
mydata$StdFittedValue <- scale(fitted(reg_model))
scatterplot(mydata$StdResid, mydata$StdFittedValue,
regLine = TRUE,
smooth = FALSE,
xlab = "Standardized residuals",
ylab = "Standardized fitted values")
Based on the scatterplot, we can assume that the linearity assumption is met.
# Test for homoskedasticity
ols_test_breusch_pagan(reg_model)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------------
## Response : Potential
## Variables: fitted values of Potential
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 4.609884
## Prob > Chi2 = 0.03178819
Assumption of homoscedasticity is not met (We can reject H0, p = 0.032).
#Calculating cooks distances (units with high impact)
mydata$CooksD <- round(cooks.distance(reg_model), 3)
hist(mydata$CooksD, main = "Distribution of Cooks distances")
head(mydata[order(-mydata$CooksD),c(1,10)],10)
## Name CooksD
## 169 Jorge Molina 0.043
## 11 Sergio Ramos 0.041
## 2222 Nino 0.036
## 124 Juan Mata 0.033
## 1920 Rubén Castro 0.027
## 859 Llorente 0.025
## 23 Pedro 0.021
## 461 Ansu Fati 0.021
## 585 Diego Castro 0.021
## 920 Brahim 0.021
We can detect some outliers.
# Removing outliers and units with high impact
mydata <- mydata[!(mydata$StdResid < -3),]
mydata <- mydata[!(mydata$StdResid > 3),]
mydata <- mydata[!(mydata$CooksD > 0.021),]
head(mydata[order(-mydata$CooksD),c(1,10)])
## Name CooksD
## 23 Pedro 0.021
## 461 Ansu Fati 0.021
## 585 Diego Castro 0.021
## 920 Brahim 0.021
## 7846 Tamayo 0.019
## 2910 Hugo Duro 0.018
reg_model <- lm(Potential ~ Age + Pace + Passing + Physic + PreferredFoot,
data = mydata)
vif(reg_model)
## Age Pace Passing Physic PreferredFoot
## 1.620006 1.102971 1.217990 1.349680 1.016350
mean(vif(reg_model))
## [1] 1.2614
There are no strong relationships between explanatory variables - Multicolinearity assumption is met.
VIF < 5
mean(VIF) > 1
# Standardized residuals (finding outliers)
mydata$StdResid <- round(rstandard(reg_model), 3)
hist(mydata$StdResid, main = "Distribution of Standardized residuals")
# SHAPIRO-WILK NORMALITY TEST
# H0: Variable is normally distributed
# H1: Variable is NOT normally distributed
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.9964, p-value = 0.3443
There is normal distribution (We cannot reject H0: Normal distribution, p > 0.05). Regression model does not violate the normality assumption.
mydata$StdFittedValue <- scale(fitted(reg_model))
scatterplot(mydata$StdResid, mydata$StdFittedValue,
regLine = TRUE,
smooth = FALSE,
xlab = "Standardized residuals",
ylab = "Standardized fitted values")
Based on the scatterplot, we can assume that the linearity assumption is met.
# Test for homoskedasticity
ols_test_breusch_pagan(reg_model)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------------
## Response : Potential
## Variables: fitted values of Potential
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 2.466219
## Prob > Chi2 = 0.1163174
Assumption of homoscedasticity is now met (We cannot reject H0, p > 0.05).
summary(reg_model)
##
## Call:
## lm(formula = Potential ~ Age + Pace + Passing + Physic + PreferredFoot,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0378 -2.2619 0.0081 1.9161 9.9033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.06174 1.73839 36.276 <2e-16 ***
## Age -0.81977 0.04540 -18.056 <2e-16 ***
## Pace 0.03053 0.01420 2.149 0.0321 *
## Passing 0.28696 0.01824 15.737 <2e-16 ***
## Physic 0.18530 0.01818 10.195 <2e-16 ***
## PreferredFootRight 0.70533 0.32869 2.146 0.0324 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.326 on 480 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4916
## F-statistic: 94.78 on 5 and 480 DF, p-value: < 2.2e-16
\[ Potential = 63.1 - 0.82 Age + 0.03 Pace + 0.29 Passing + 0.19 Physic + 0.71 PreferredFootRight \]
Age - If age increases by 1 year, potential rating decrease on average by 0.82, assuming everything else remains unchanged.
Pace - If pace rating increases by 1, potential rating increase on average by 0.03, assuming everything else remains unchanged.
Passing - If passing rating increases by 1, potential rating increase on average by 0.29, assuming everything else remains unchanged.
Physic - If physic rating increases by 1, potential rating increase on average by 0.19, assuming everything else remains unchanged.
PrefferedFoot - Given the values of all other variables, players who prefer right foot have on average 0.71 more potential then those, who prefer left.
The adjusted R-squared indicates that 49.2% of the variability of dependent variable, is explained by linear effect of all explanatory variables.
sqrt(summary(reg_model)$r.squared)
## [1] 0.7048447
\[ sqrt(R^{2}) = 0.70 \]
The maximum degree of linear relationship between the variable Potential, and independent variables is 0.70 - fairly strong relationship.
H0: \(ρ^{2} = 0\)
H1: \(ρ^{2} > 0\)
We can reject H0 at p < 0.001.
We can conclude that our model is valid because we proved a relationship between the dependent variable and at least one explanatory variable.