Dataset overview

FIFA series is one of the football videogames series, which has been officially renamed to EA Sports FC 24 from the season 2023/24 it due to the termination of collaboration with FIFA.

This project aims to leverage database tool PostgreSQL and statistical analysis language R to analyze the FIFA players dataset available at this. This dataset includes sufficient information regarding players such as age, overall and potential attributes, club and league name, positions, and many others from the latest 10 editions of FIFA (from the FIFA 15 to the EAFC 24).

Introduction of the project

The first part is to investigate the relationship between football players’ potential for each country with more than 1000 players and various factors such as their overall rating and count of players from the country. The dataset used is queried from the this.

The next step is the analysis of the players dataset, looking for insights about clubs, leagues and players.

Load data

library(tidyverse)
library(htmltools)
library(vtable)
library(dplyr)
country.and.pot=read_csv("/home/jovyan/EAFC24/country and pot.csv")

Regression model choice

We choose the linear regression model for the data set and fit the relevant EAFC data with a linear regression model using R:

Data cleaning

The dataset is cleaned and manipulated by PostgreSQL to make sure it’s appropriate to used for data analysis, ensuring no mssing values.

t-Test

In multiple regression, t-test is applied to assess the significance of each predictor variable. The t-test measures whether the coefficient for a particular predictor is significantly different from zero. It can be achieved by comparing p-value and significance level α, .05. If the t-value associated with a predictor is sufficiently large, it suggests that the predictor is contributing significantly to the model.

Descriptive statistics of the fitted model

summary(country.and.pot)
##    country              avgpot        avgrating          cnt         
##  Length:188         Min.   :58.00   Min.   :54.00   Min.   :    1.0  
##  Class :character   1st Qu.:67.88   1st Qu.:63.27   1st Qu.:   20.5  
##  Mode  :character   Median :69.90   Median :65.60   Median :   91.0  
##                     Mean   :69.55   Mean   :65.10   Mean   :  957.6  
##                     3rd Qu.:71.72   3rd Qu.:67.50   3rd Qu.:  795.8  
##                     Max.   :74.60   Max.   :70.70   Max.   :16415.0
fit = lm(avgpot ~ avgrating + cnt + cnt:avgrating ,
        data = country.and.pot)
summary(fit)
## 
## Call:
## lm(formula = avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9651 -1.0966  0.1905  1.0176  8.9024 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.886e+01  2.985e+00   9.668   <2e-16 ***
## avgrating      6.225e-01  4.588e-02  13.568   <2e-16 ***
## cnt           -2.263e-03  1.487e-03  -1.522    0.130    
## avgrating:cnt  3.704e-05  2.262e-05   1.638    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.891 on 184 degrees of freedom
## Multiple R-squared:  0.5769, Adjusted R-squared:   0.57 
## F-statistic: 83.64 on 3 and 184 DF,  p-value: < 2.2e-16

Data visualization (selected)

# Create a scatterplot with the automatic reference line
country.and.pot %>% ggplot( aes(x = avgrating, y = avgpot)) +
  geom_point() +  # Add points for the data
  geom_smooth(method='lm', formula= avgpot ~ avgrating) +
  #geom_abline(intercept = intercept, slope = slope, color = "red") +  #  Add the automatic reference line
  labs(title = "Scatterplot with Automatic Reference Line",x = 'overall', y = 'pot')
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'avgpot' not found

country.and.pot %>% ggplot(aes( x = avgrating)) + geom_boxplot(fill = "dark gray" )+ labs(title= "Boxplot of avgrating" , x = 'overall')

country.and.pot %>% ggplot(aes( x = avgrating)) + geom_histogram(fill = "dark gray" )+ labs(title= "Distribution of avgrating" , x = 'overall')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Confidence Interval of ‘avgrating’

t_value = qt(.975 , df = 184)
se = 4.588e-02
beta_hat = 6.225e-01 #avgrating
c( beta_hat - t_value * se ,
beta_hat + t_value * se )
## [1] 0.5319815 0.7130185
se2 =   2.262e-05
beta_hat2 = 3.704e-05 #avgrating:cnt
c( beta_hat2 - t_value * se2,
beta_hat2 + t_value * se2 ) 
## [1] -7.587915e-06  8.166792e-05

F Test on multiple linear regression

The F-test assesses the overall significance of a group of predictor variables in a multiple regression model. The full model contains all predictors. It tests whether at least one of the predictor variables in the model has a non-zero coefficient by reducing these predictors. Such model is reduced model.

fit_full = lm ( avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot )
fit_reduced = lm (avgpot ~ avgrating + cnt:avgrating, data = country.and.pot )
anova ( fit_reduced , fit_full) 
fit_reduced1 = lm (avgpot ~  cnt + cnt:avgrating , data = country.and.pot )
anova ( fit_reduced1 , fit_full) 

Regression diagnostics

We now plot the residuals versus the fitted values, and a QQ plot and other for this regression model fit.

It is significant to check the regression modeling assumptions.If the regression modeling assumptions fail to hold, then predictor may be biased with unpredictable variance and non normally distributed data points.

fitted_values = fitted(fit)
residual_values = resid(fit)

sresidual_values = rstandard(fit)
     
plot(fitted_values, sresidual_values, 
     main = "EAFC 24: fitted versus residual values", 
     xlab = "Fitted", 
     ylab = "Standardized residuals")

hist(rstandard(fit), xlab = "Standardized residuals", main = "Standardized residuals histogram")

plot(fit, which=1)

plot(fit, which = 2)