FIFA series is one of the football videogames series, which has been officially renamed to EA Sports FC 24 from the season 2023/24 it due to the termination of collaboration with FIFA.
This project aims to leverage database tool PostgreSQL and statistical analysis language R to analyze the FIFA players dataset available at this. This dataset includes full information regarding players such as age, overall and potential attributes, club and league name, positions, and many others from the latest 10 editions of FIFA (from the FIFA 15 to the EAFC 24). Before discussing specific question, the dataset above would be cleaned and manipulated by PostgreSQL to make sure it’s appropriate to used for data analysis.
The first part is to investigate the relationship between football players’ potential for each country with more than 1000 players and various factors such as their overall rating and count of players from the country.
The next step is the analysis of the players dataset, looking for insights about clubs, leagues and players.
library(tidyverse)
library(htmltools)
library(vtable)
library(dplyr)
country.and.pot=read_csv("/home/jovyan/EAFC24/country and pot.csv")
We choose the linear regression model for the data set and fit the relevant EAFC data with a linear regression model using R:
In multiple regression, t-test is applied to assess the significance of each predictor variable. The t-test measures whether the coefficient for a particular predictor is significantly different from zero. It can be achieved by comparing p-value and significance level α, .05. If the t-value associated with a predictor is sufficiently large, it suggests that the predictor is contributing significantly to the model.
attach(country.and.pot)
summary(country.and.pot)
## nationality_name cnt avgrating avgpot
## Length:41 Min. : 1004 Min. :58.97 Min. :62.48
## Class :character 1st Qu.: 1524 1st Qu.:63.47 1st Qu.:69.48
## Mode :character Median : 3040 Median :65.58 Median :70.70
## Mean : 3841 Mean :65.47 Mean :70.59
## 3rd Qu.: 3829 3rd Qu.:67.46 3rd Qu.:72.60
## Max. :16415 Max. :70.68 Max. :74.62
fit = lm(avgpot ~ avgrating + cnt + cnt:avgrating ,
data = country.and.pot)
summary(fit)
##
## Call:
## lm(formula = avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.90000 -0.34841 0.02377 0.68255 1.37060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.491e+00 5.319e+00 0.656 0.5157
## avgrating 1.023e+00 8.122e-02 12.600 5.93e-15 ***
## cnt 1.986e-03 1.051e-03 1.890 0.0667 .
## avgrating:cnt -2.988e-05 1.599e-05 -1.869 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8599 on 37 degrees of freedom
## Multiple R-squared: 0.9023, Adjusted R-squared: 0.8944
## F-statistic: 113.9 on 3 and 37 DF, p-value: < 2.2e-16
plot(country.and.pot$avgrating, avgpot,
main = "Overall versus Potential",
xlab = "overall", ylab = "Potential")
plot(country.and.pot$avgrating, cnt,
main = "Overall versus number of players from the country or region",
xlab = "count", ylab = "Potential")
t_value = qt(.975 , df = 37)
se = 8.122e-02
beta_hat = 1.023e+00 #avgrating
c( beta_hat - t_value * se ,
beta_hat + t_value * se )
## [1] 0.8584326 1.1875674
se2 = 1.599e-05
beta_hat2 = -2.988e-05 #avgrating:cnt
c( beta_hat2 - t_value * se2,
beta_hat2 + t_value * se2 )
## [1] -6.227882e-05 2.518817e-06
The F-test assesses the overall significance of a group of predictor variables in a multiple regression model. The full model contains all predictors. It tests whether at least one of the predictor variables in the model has a non-zero coefficient by reducing these predictors. Such model is reduced model.
fit_full = lm ( avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot )
fit_reduced = lm (avgpot ~ avgrating + cnt:avgrating, data = country.and.pot )
anova ( fit_reduced , fit_full)
fit_reduced1 = lm (avgpot ~ cnt , data = country.and.pot )
anova ( fit_reduced1 , fit_full)
We now plot the residuals versus the fitted values, and a QQ plot and other for this regression model fit.
fitted_values = fitted(fit)
residual_values = resid(fit)
sresidual_values = rstandard(fit)
plot(fitted_values, sresidual_values,
main = "EAFC 24: fitted versus residual values",
xlab = "Fitted",
ylab = "Standardized residuals")
hist(rstandard(fit), xlab = "Standardized residuals", main = "Standardized residuals histogram")
plot(fit, which=1)
plot(fit, which = 2)
TBD