Introduction

FIFA series is one of the football videogames series, which has been officially renamed to EA Sports FC 24 from the season 2023/24 it due to the termination of collaboration with FIFA.

This project aims to leverage database tool PostgreSQL and statistical analysis language R to analyze the FIFA players dataset available at this. This dataset includes full information regarding players such as age, overall and potential attributes, club and league name, positions, and many others from the latest 10 editions of FIFA (from the FIFA 15 to the EAFC 24). Before discussing specific question, the dataset above would be cleaned and manipulated by PostgreSQL to make sure it’s appropriate to used for data analysis.

The first part is to investigate the relationship between football players’ potential for each country with more than 1000 players and various factors such as their overall rating and count of players from the country.

The next step is the analysis of the players dataset, looking for insights about clubs, leagues and players.

Load data

library(tidyverse)
library(htmltools)
library(vtable)
library(dplyr)
country.and.pot=read_csv("/home/jovyan/EAFC24/country and pot.csv")

Regression model choice

We choose the linear regression model for the data set and fit the relevant EAFC data with a linear regression model using R:

t-Test

In multiple regression, t-test is applied to assess the significance of each predictor variable. The t-test measures whether the coefficient for a particular predictor is significantly different from zero. It can be achieved by comparing p-value and significance level α, .05. If the t-value associated with a predictor is sufficiently large, it suggests that the predictor is contributing significantly to the model.

attach(country.and.pot)
summary(country.and.pot)

##  nationality_name        cnt          avgrating         avgpot     
##  Length:41          Min.   : 1004   Min.   :58.97   Min.   :62.48  
##  Class :character   1st Qu.: 1524   1st Qu.:63.47   1st Qu.:69.48  
##  Mode  :character   Median : 3040   Median :65.58   Median :70.70  
##                     Mean   : 3841   Mean   :65.47   Mean   :70.59  
##                     3rd Qu.: 3829   3rd Qu.:67.46   3rd Qu.:72.60  
##                     Max.   :16415   Max.   :70.68   Max.   :74.62

fit = lm(avgpot ~ avgrating + cnt + cnt:avgrating ,
        data = country.and.pot)
summary(fit)

## 
## Call:
## lm(formula = avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.90000 -0.34841  0.02377  0.68255  1.37060 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.491e+00  5.319e+00   0.656   0.5157    
## avgrating      1.023e+00  8.122e-02  12.600 5.93e-15 ***
## cnt            1.986e-03  1.051e-03   1.890   0.0667 .  
## avgrating:cnt -2.988e-05  1.599e-05  -1.869   0.0695 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8599 on 37 degrees of freedom
## Multiple R-squared:  0.9023, Adjusted R-squared:  0.8944 
## F-statistic: 113.9 on 3 and 37 DF,  p-value: < 2.2e-16

plot(country.and.pot$avgrating, avgpot,
     main = "Overall versus Potential",
     xlab = "overall", ylab = "Potential")

plot(country.and.pot$avgrating, cnt,
     main = "Overall versus number of players from the country or region",
     xlab = "count", ylab = "Potential")

There is a roughly a linear relationship with slope 1.023 between average rating and average potential of players on the planet.
It is approxiately a straight line y = x where y is potential and x is overall rating.
It is difficult to find a clear pattern between potential and count.

Confidence Interval of ‘avgrating’

t_value = qt(.975 , df = 37)
se = 8.122e-02
beta_hat = 1.023e+00 #avgrating
c( beta_hat - t_value * se ,
beta_hat + t_value * se )

## [1] 0.8584326 1.1875674

se2 = 1.599e-05
beta_hat2 = -2.988e-05 #avgrating:cnt
c( beta_hat2 - t_value * se2,
beta_hat2 + t_value * se2 )

## [1] -6.227882e-05  2.518817e-06

The reduced model is sufficient when keeps ‘avgrating’ (strong evidence to reject the null hypothesis that reduce ‘avgrating’ is sufficient) due to its small p-value compare to the level (.05).
The code draws no conclusion on the “avgrating:cnt” as its p-value is higher than .05 and 0 and its coefficient both falls in the C.I. (-6.227882e-05 , 2.518817e-06).

F Test on multiple linear regression

The F-test assesses the overall significance of a group of predictor variables in a multiple regression model. The full model contains all predictors. It tests whether at least one of the predictor variables in the model has a non-zero coefficient by reducing these predictors. Such model is reduced model.

fit_full = lm ( avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot )
fit_reduced = lm (avgpot ~ avgrating + cnt:avgrating, data = country.and.pot )
anova ( fit_reduced , fit_full)

fit_reduced1 = lm (avgpot ~  cnt , data = country.and.pot )
anova ( fit_reduced1 , fit_full)

The F test demonstrates same result from the t-test: there is strong evidence to reject the null hypothesis that coefficient of ‘avgrating’ can be 0.

Regression diagnostics

We now plot the residuals versus the fitted values, and a QQ plot and other for this regression model fit.

fitted_values = fitted(fit)
residual_values = resid(fit)

sresidual_values = rstandard(fit)
     
plot(fitted_values, sresidual_values, 
     main = "EAFC 24: fitted versus residual values", 
     xlab = "Fitted", 
     ylab = "Standardized residuals")

hist(rstandard(fit), xlab = "Standardized residuals", main = "Standardized residuals histogram")

plot(fit, which=1)

plot(fit, which = 2)

Conclusions

TBD

Analyzing the EAFC 24 players

Yue Yu