Dataset overview

FIFA series is one of the football videogames series, which has been officially renamed to EA Sports FC 24 from the season 2023/24 it due to the termination of collaboration with FIFA.

This project aims to leverage database tool PostgreSQL and statistical analysis language R to analyze the FIFA players dataset available at this. This dataset includes sufficient information regarding players such as age, overall and potential attributes, club and league name, positions, and many others from the latest 10 editions of FIFA (from the FIFA 15 to the EAFC 24).

Introduction of the project

The first part is to investigate the relationship between football players’ potential for each country with more than 1000 players and various factors such as their overall rating and count of players from the country. The dataset used is queried from the this.

The next step is the analysis of the players dataset, looking for insights about clubs, leagues and players.

Load data

library(tidyverse)
library(htmltools)
library(vtable)
library(dplyr)
country.and.pot=read_csv("/home/jovyan/EAFC24/country and pot.csv")

Regression model choice

We choose the linear regression model for the data set and fit the relevant EAFC data with a linear regression model using R:

Data cleaning

The dataset is cleaned and manipulated by PostgreSQL to make sure it’s appropriate to used for data analysis, ensuring no mssing values.

t-Test

In multiple regression, t-test is applied to assess the significance of each predictor variable. The t-test measures whether the coefficient for a particular predictor is significantly different from zero. It can be achieved by comparing p-value and significance level α, .05. If the t-value associated with a predictor is sufficiently large, it suggests that the predictor is contributing significantly to the model.

Descriptive statistics of the fitted model

summary(country.and.pot)

##    country              avgpot        avgrating          cnt         
##  Length:188         Min.   :58.00   Min.   :54.00   Min.   :    1.0  
##  Class :character   1st Qu.:67.88   1st Qu.:63.27   1st Qu.:   20.5  
##  Mode  :character   Median :69.90   Median :65.60   Median :   91.0  
##                     Mean   :69.55   Mean   :65.10   Mean   :  957.6  
##                     3rd Qu.:71.72   3rd Qu.:67.50   3rd Qu.:  795.8  
##                     Max.   :74.60   Max.   :70.70   Max.   :16415.0

fit = lm(avgpot ~ avgrating + cnt + cnt:avgrating ,
        data = country.and.pot)
summary(fit)

## 
## Call:
## lm(formula = avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9651 -1.0966  0.1905  1.0176  8.9024 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.886e+01  2.985e+00   9.668   <2e-16 ***
## avgrating      6.225e-01  4.588e-02  13.568   <2e-16 ***
## cnt           -2.263e-03  1.487e-03  -1.522    0.130    
## avgrating:cnt  3.704e-05  2.262e-05   1.638    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.891 on 184 degrees of freedom
## Multiple R-squared:  0.5769, Adjusted R-squared:   0.57 
## F-statistic: 83.64 on 3 and 184 DF,  p-value: < 2.2e-16

‘avgpot’ is larger than than ‘avgrating’ in all aspects of the summary of these two predictors
The outlier exists on the left side of the distributions of ‘avgpot’ and ‘avgrating’ while the outlier of ‘cnt’ is near the maximum value.

Data visualization (selected)

# Create a scatterplot with the automatic reference line
country.and.pot %>% ggplot( aes(x = avgrating, y = avgpot)) +
  geom_point() +  # Add points for the data
  geom_smooth(method='lm', formula= avgpot ~ avgrating) +
  #geom_abline(intercept = intercept, slope = slope, color = "red") +  #  Add the automatic reference line
  labs(title = "Scatterplot with Automatic Reference Line",x = 'overall', y = 'pot')

## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'avgpot' not found

country.and.pot %>% ggplot(aes( x = avgrating)) + geom_boxplot(fill = "dark gray" )+ labs(title= "Boxplot of avgrating" , x = 'overall')

country.and.pot %>% ggplot(aes( x = avgrating)) + geom_histogram(fill = "dark gray" )+ labs(title= "Distribution of avgrating" , x = 'overall')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is a roughly a linear relationship with slope 0.62 between average rating and average potential of players on the planet.
It is approxiately a straight line y = x where y is potential and x is overall rating.
It is difficult to find a clear pattern between potential and count.

Confidence Interval of ‘avgrating’

t_value = qt(.975 , df = 184)
se = 4.588e-02
beta_hat = 6.225e-01 #avgrating
c( beta_hat - t_value * se ,
beta_hat + t_value * se )

## [1] 0.5319815 0.7130185

se2 =   2.262e-05
beta_hat2 = 3.704e-05 #avgrating:cnt
c( beta_hat2 - t_value * se2,
beta_hat2 + t_value * se2 )

## [1] -7.587915e-06  8.166792e-05

The reduced model is sufficient when keeps ‘avgrating’ (strong evidence to reject the null hypothesis that reduce ‘avgrating’ is sufficient) due to its small p-value compare to the level (.05).
The code draws no conclusion on the “avgrating:cnt” as its p-value is higher than .05 and, 0 and its coefficient both falls in the C.I. (-7.587915e-06, 8.166792e-05).

F Test on multiple linear regression

The F-test assesses the overall significance of a group of predictor variables in a multiple regression model. The full model contains all predictors. It tests whether at least one of the predictor variables in the model has a non-zero coefficient by reducing these predictors. Such model is reduced model.

fit_full = lm ( avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot )
fit_reduced = lm (avgpot ~ avgrating + cnt:avgrating, data = country.and.pot )
anova ( fit_reduced , fit_full)

fit_reduced1 = lm (avgpot ~  cnt + cnt:avgrating , data = country.and.pot )
anova ( fit_reduced1 , fit_full)

The F test demonstrates same result from the t-test: there is strong evidence to reject the null hypothesis that coefficient of ‘avgrating’ can be 0.

Regression diagnostics

We now plot the residuals versus the fitted values, and a QQ plot and other for this regression model fit.

It is significant to check the regression modeling assumptions.If the regression modeling assumptions fail to hold, then predictor may be biased with unpredictable variance and non normally distributed data points.

fitted_values = fitted(fit)
residual_values = resid(fit)

sresidual_values = rstandard(fit)
     
plot(fitted_values, sresidual_values, 
     main = "EAFC 24: fitted versus residual values", 
     xlab = "Fitted", 
     ylab = "Standardized residuals")

hist(rstandard(fit), xlab = "Standardized residuals", main = "Standardized residuals histogram")

plot(fit, which=1)

plot(fit, which = 2)

The residual plot looks like a null plot (no systematic pattern) and the residuals have constant mean 0, unit variance, and roughly a normal distribution N (0, 1) shown by the approximately horizontal reference line passing through 0.
QQ plot also resembles approximately a line except two ends.
No more action required as no strong evidence against the assumptions

Analyzing the EAFC 24 players

Yue Yu