Dataset overview

FIFA series is one of the football videogames series, which has been officially renamed to EA Sports FC 24 from the season 2023/24 it due to the termination of collaboration with FIFA.

This project aims to leverage database tool PostgreSQL and statistical analysis language R to analyze the FIFA players dataset available at this. This dataset includes sufficient information regarding players such as age, overall and potential attributes, club and league name, positions, and many others from the latest 10 editions of FIFA (from the FIFA 15 to the EAFC 24).

Introduction of the project

The first part is to investigate the relationship between football players’ potential for each country with more than 1000 players and various factors such as their overall rating and count of players from the country. The dataset used is queried from the this.

The next step is the analysis of the players dataset, looking for insights about clubs, leagues and players.

Load data

library(tidyverse)
library(htmltools)
install.packages('car')
## Warning: dependency 'pbkrtest' is not available
## Warning in install.packages("car"): installation of package 'car' had non-zero
## exit status
library(dplyr)
country.and.pot=read_csv("/home/jovyan/EAFC24/country and pot.csv")

Regression model choice

We choose the linear regression model for the data set and fit the relevant EAFC data with a linear regression model using R:

Data cleaning

The dataset is cleaned and manipulated by PostgreSQL to make sure it’s appropriate to used for data analysis, ensuring no missing values.

Data summary

Descriptive statistics of the fitted model

summary(country.and.pot)
##    country              avgpot        avgrating          cnt         
##  Length:188         Min.   :58.00   Min.   :54.00   Min.   :    1.0  
##  Class :character   1st Qu.:67.88   1st Qu.:63.27   1st Qu.:   20.5  
##  Mode  :character   Median :69.90   Median :65.60   Median :   91.0  
##                     Mean   :69.55   Mean   :65.10   Mean   :  957.6  
##                     3rd Qu.:71.72   3rd Qu.:67.50   3rd Qu.:  795.8  
##                     Max.   :74.60   Max.   :70.70   Max.   :16415.0
fit = lm(avgpot ~ avgrating + cnt + cnt:avgrating ,
        data = country.and.pot)
summary(fit)
## 
## Call:
## lm(formula = avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9651 -1.0966  0.1905  1.0176  8.9024 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.886e+01  2.985e+00   9.668   <2e-16 ***
## avgrating      6.225e-01  4.588e-02  13.568   <2e-16 ***
## cnt           -2.263e-03  1.487e-03  -1.522    0.130    
## avgrating:cnt  3.704e-05  2.262e-05   1.638    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.891 on 184 degrees of freedom
## Multiple R-squared:  0.5769, Adjusted R-squared:   0.57 
## F-statistic: 83.64 on 3 and 184 DF,  p-value: < 2.2e-16

  • avgpot is larger than than avgrating in all aspects of the summary of these two predictors
  • The outlier exists on the left side of the distributions of avgpot and avgrating while the outlier of cnt is near the maximum value.

Data visualization (selected)

# Create a scatterplot with the automatic reference line
country.and.pot %>% ggplot( aes(x = avgrating, y = avgpot)) +
  geom_point() +  # Add points for the data
  geom_smooth(method='lm', formula= avgpot ~ avgrating) +
  #geom_abline(intercept = intercept, slope = slope, color = "red") +  #  Add the automatic reference line
  labs(title = "Scatterplot with Automatic Reference Line",x = 'overall', y = 'pot')
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'avgpot' not found

country.and.pot %>% ggplot(aes( x = avgrating)) + geom_boxplot(fill = "dark gray" )+ labs(title= "Boxplot of avgrating" , x = 'overall')

country.and.pot %>% ggplot(aes( x = avgrating)) + geom_histogram(fill = "dark gray" )+ labs(title= "Distribution of avgrating" , x = 'overall')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • There is a roughly a linear relationship with slope 0.622 between average rating and average potential of players on the planet.
  • It is approxiately a straight line y = x where y is potential and x is overall rating.
  • It is difficult to find a clear pattern between potential and count.

Regression diagnostics

We now plot the residuals versus the fitted values, and a QQ plot and other for this regression model fit.

It is significant to check the regression modeling assumptions.If the regression modeling assumptions fail to hold, then predictor may be biased with unpredictable variance and non normally distributed data points.

fitted_values = fitted(fit)
residual_values = resid(fit)

sresidual_values = rstandard(fit)
     
plot(fitted_values, sresidual_values, 
     main = "EAFC 24: fitted versus residual values", 
     xlab = "Fitted", 
     ylab = "Standardized residuals")

hist(rstandard(fit), xlab = "Standardized residuals", main = "Standardized residuals histogram")

plot(fit, which=1)

plot(fit, which = 2)

  • The residual plot looks like a null plot (no systematic pattern) and the residuals have constant mean 0, unit variance, and roughly a normal distribution N (0, 1) shown by the approximately horizontal reference line passing through 0.
  • QQ plot also resembles approximately a line except little tail at two ends.
  • No more action required as no strong evidence against the assumptions

t-Test

In multiple regression, t-test is applied to assess the significance of each predictor variable. The t-test measures whether the coefficient for a particular predictor is significantly different from zero. It can be achieved by comparing p-value and significance level α, .05. If the t-value associated with a predictor is sufficiently large, it suggests that the predictor is contributing significantly to the model.

95% Confidence Interval of all parameters

confint(fit, level = 0.95)
##                       2.5 %       97.5 %
## (Intercept)    2.297235e+01 3.475207e+01
## avgrating      5.319492e-01 7.129809e-01
## cnt           -5.196464e-03 6.700380e-04
## avgrating:cnt -7.579950e-06 8.166582e-05

  • Based on the The reduced model is sufficient when keeps ‘avgrating’ (strong evidence to reject the null hypothesis that reduce avgrating is sufficient) due to its small p-value compare to the level (.05).
  • The code draws no conclusion (failed to reject the null hypothesis) on the avgrating:cnt and cnt as their p-value are higher than 0.05. Also, this sample of data yielded a 955 confidence interval that contained 0. We are 95% confident that this sample and thus this interval is one of the 95% of all intervals that captured the true slope. The true slope may be 0, but it is not possible to verify for certain.
In conclusion, t-test is not helpful to drop some predictors. Therefore, we may need other tests.

F Test on multiple linear regression

The F-test assesses the overall significance of a group of predictor variables in a multiple regression model. The full model contains all predictors. It tests whether at least one of the predictor variables in the model has a non-zero coefficient by reducing these predictors. Such model is reduced model.

fit_full = lm ( avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot )
fit_reduced = lm (avgpot ~ avgrating + cnt:avgrating, data = country.and.pot )
anova ( fit_reduced , fit_full) 
fit_reduced1 = lm (avgpot ~  avgrating + cnt, data = country.and.pot )
anova ( fit_reduced1 , fit_full) 
fit_reduced2 = lm (avgpot ~  avgrating , data = country.and.pot )
anova ( fit_reduced2 , fit_full) 

  • The F test demonstrates same result from the t-test: there is strong evidence to reject the null hypothesis that coefficient of ‘avgrating’ can be 0.
  • In terms of the goodness of the fit (RSS), the reduced models cannot be the selection of model due to their higher RSS.

AIC/BIC

The RSS can only reflect the “goodness” of the fit. That implies larger models have smaller RSS but increased model complexity. To judge the model in a more considerate way, we may penalize its complexity by criterion in the form of {Goodness of fit to the data} + {model complexity penalty}

AIC(fit, k=2)
## [1] 779.1045
n = 188
extractAIC(fit, k = log(n))
## [1]   4.0000 256.5294
par ( mfrow = c (2 , 2) )
plot ( fit , which = c(1 , 2 , 3 , 4) )

stepfit = step ( fit , direction = "backward" , k = log (n) )
## Start:  AIC=256.53
## avgpot ~ avgrating + cnt + cnt:avgrating
## 
##                 Df Sum of Sq    RSS    AIC
## - avgrating:cnt  1    9.5958 667.82 254.01
## <none>                       658.22 256.53
## 
## Step:  AIC=254.01
## avgpot ~ avgrating + cnt
## 
##             Df Sum of Sq     RSS    AIC
## <none>                    667.82 254.01
## - cnt        1     25.05  692.87 255.70
## - avgrating  1    832.39 1500.21 400.93
summary ( stepfit )
## 
## Call:
## lm(formula = avgpot ~ avgrating + cnt, data = country.and.pot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7697 -1.1776  0.1661  0.9451  9.1811 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.706e+01  2.787e+00   9.709  < 2e-16 ***
## avgrating   6.502e-01  4.282e-02  15.185  < 2e-16 ***
## cnt         1.695e-04  6.436e-05   2.634  0.00915 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.9 on 185 degrees of freedom
## Multiple R-squared:  0.5708, Adjusted R-squared:  0.5661 
## F-statistic:   123 on 2 and 185 DF,  p-value: < 2.2e-16
# Check the final model assumptions
par ( mfrow = c (2 , 2) )
plot ( stepfit , which = c (1 , 2 , 3 , 4) )

AIC(lm(formula = avgpot ~ avgrating + cnt, data = country.and.pot), k=2)
## [1] 779.8254
extractAIC(lm(formula = avgpot ~ avgrating + cnt, data = country.and.pot), k = log(n))
## [1]   3.0000 254.0139

The summary illustrates the best model according to BIC using backward step selection contains only main predictors cnt and avgrating. The coefficient of both independent variables are statistically significant since its p-value by t-test is small, rejecting the hypothesis that the coefficient for each predictor is zero.

From summary table, we can spot multiple and adjusted R-squared are above 0.5, which is a rough cutoff to suggest a relative strong relationship between the predictors and response. The field of study, football, have an inherently greater amount of unexplained variations behind these predictors, including difference of socioeconomic status among countries/regions, the difference of football culture, and the lack of rea-world players’ data due to limited leagues included in this video game.

The backward step selection process do not check the assumptions of the model.By checking the assumptions before and after the model selection by backward step selection, there is no clear violations of the assumptions of the model except some problematic observations. The multicollinarity may be another issue to deal with.

Multicollinary

fit_avgrating = lm( avgrating ~ cnt + avgrating:cnt, 
         data = country.and.pot )

summary(fit_avgrating)$r.squared
## [1] 0.1428252
1/(1 - summary(fit_avgrating)$r.squared)
## [1] 1.166623

The variance inflation factor indicates the existence of the multcollinarity between the predictor, the average rating and count of players of the countries in the context of this project. R3, the coefficient of determination between predictors, is 0.1428252 and VIF is 1.167. Then there is no strong evidence of multicollinarity as the threshold of VIF is 5.

Problematic observations

Problematic observations are the observations that may not align with the general patterns. There are different reasons to explain this kind of data points including the wrong entries of data or some rare cases among all observations.

leverage_values <-hatvalues(stepfit)
threshold1 <- 2*(2+1)/188 # Set the threshold
high_leverage_indices <- which(leverage_values > threshold1)
print(leverage_values[high_leverage_indices])
##          7         23         24         53         61         65         80 
## 0.09301918 0.07153576 0.05693224 0.28627873 0.09198274 0.10719047 0.05907345 
##         81         85        147        153        158        159 
## 0.03629447 0.04187019 0.06757683 0.05693224 0.06535729 0.11274654
sresid <-rstandard(stepfit)
cutoff <- 2 # Set the threshold
sresid_indices <- which(sresid > cutoff)
print(sresid[sresid_indices]) 
##       24       81      106      153      163 
## 3.350029 2.457688 2.261802 4.975974 2.103109
cooks_distance <-cooks.distance(stepfit)
threshold2 <- qf(0.5, 2, 185) # Set the threshold
influencial_indices <- which(cooks_distance > threshold2)
print(cooks_distance[influencial_indices]) 
## named numeric(0)

Besides the coefficient of determination, the problematic observations should be addressed for regression analysis. From the residual plot above, we can spot outliers, including observation number 24, 153 and others.Besides, we can compute hidden ones and the results are observation 81, 106, and 163. They have higher residuals compared to other observations.By similar approach, we figured out observation 7, 23, 53, 61, 65, 80, 81, 85, 147, 153, 158, 159 are the high leverage points with a cutoff 2(p+1)/n.

Also, by the fourth plot in the plot matrix, observation 24, 118, and 153 stand out from others. However, by calculation and comparing with median of F statistics, there is no influencial point. Thus, it allows us to conclude observation 81 and 153 are both outliers and influencial points, which is not necessary in general case.

However, after identifying and classifying the problematic observations, it is still cautious to remove the observations. The observations may just be rare and unique among others. The observation 153 is a good example as it is both high leverage and influencial point with relative high cook’s distance, which is not necessary in the general case. This data point represents Singapore. It is the country that only have one player included in EAFC 24. The data point constructed by extremely small data sample means high chance of representing biased situation. In this case, the two features represented by the predictor cannot be a general reflection of the country’s football level but only a showcase of an individual’s situation. That can explain its high residuals value, high leverage value, and its relative high cook’s distance among all data points since this sample is so random. However, it is not ethical to remove Singapore from our global map as it goes against the goal of this project to explore the players’ ability and possible future ability in the global range.