FIFA series is one of the football videogames series, which has been officially renamed to EA Sports FC 24 from the season 2023/24 it due to the termination of collaboration with FIFA.
This project aims to leverage database tool PostgreSQL and statistical analysis language R to analyze the FIFA players dataset available at this. This dataset includes sufficient information regarding players such as age, overall and potential attributes, club and league name, positions, and many others from the latest 10 editions of FIFA (from the FIFA 15 to the EAFC 24).
The first part is to investigate the relationship between football players’ potential for each country with more than 1000 players and various factors such as their overall rating and count of players from the country. The dataset used is queried from the this.
The next step is the analysis of the players dataset, looking for insights about clubs, leagues and players.
library(tidyverse)
library(htmltools)
library(vtable)
library(dplyr)
country.and.pot=read_csv("/home/jovyan/EAFC24/country and pot.csv")
We choose the linear regression model for the data set and fit the relevant EAFC data with a linear regression model using R:
The dataset is cleaned and manipulated by PostgreSQL to make sure it’s appropriate to used for data analysis, ensuring no mssing values.
In multiple regression, t-test is applied to assess the significance of each predictor variable. The t-test measures whether the coefficient for a particular predictor is significantly different from zero. It can be achieved by comparing p-value and significance level α, .05. If the t-value associated with a predictor is sufficiently large, it suggests that the predictor is contributing significantly to the model.
summary(country.and.pot)
## country avgpot avgrating cnt
## Length:188 Min. :58.00 Min. :54.00 Min. : 1.0
## Class :character 1st Qu.:67.88 1st Qu.:63.27 1st Qu.: 20.5
## Mode :character Median :69.90 Median :65.60 Median : 91.0
## Mean :69.55 Mean :65.10 Mean : 957.6
## 3rd Qu.:71.72 3rd Qu.:67.50 3rd Qu.: 795.8
## Max. :74.60 Max. :70.70 Max. :16415.0
fit = lm(avgpot ~ avgrating + cnt + cnt:avgrating ,
data = country.and.pot)
summary(fit)
##
## Call:
## lm(formula = avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9651 -1.0966 0.1905 1.0176 8.9024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.886e+01 2.985e+00 9.668 <2e-16 ***
## avgrating 6.225e-01 4.588e-02 13.568 <2e-16 ***
## cnt -2.263e-03 1.487e-03 -1.522 0.130
## avgrating:cnt 3.704e-05 2.262e-05 1.638 0.103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.891 on 184 degrees of freedom
## Multiple R-squared: 0.5769, Adjusted R-squared: 0.57
## F-statistic: 83.64 on 3 and 184 DF, p-value: < 2.2e-16
# Create a scatterplot with the automatic reference line
country.and.pot %>% ggplot( aes(x = avgrating, y = avgpot)) +
geom_point() + # Add points for the data
geom_smooth(method='lm', formula= avgpot ~ avgrating) +
#geom_abline(intercept = intercept, slope = slope, color = "red") + # Add the automatic reference line
labs(title = "Scatterplot with Automatic Reference Line",x = 'overall', y = 'pot')
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'avgpot' not found
country.and.pot %>% ggplot(aes( x = avgrating)) + geom_boxplot(fill = "dark gray" )+ labs(title= "Boxplot of avgrating" , x = 'overall')
country.and.pot %>% ggplot(aes( x = avgrating)) + geom_histogram(fill = "dark gray" )+ labs(title= "Distribution of avgrating" , x = 'overall')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
t_value = qt(.975 , df = 184)
se = 4.588e-02
beta_hat = 6.225e-01 #avgrating
c( beta_hat - t_value * se ,
beta_hat + t_value * se )
## [1] 0.5319815 0.7130185
se2 = 2.262e-05
beta_hat2 = 3.704e-05 #avgrating:cnt
c( beta_hat2 - t_value * se2,
beta_hat2 + t_value * se2 )
## [1] -7.587915e-06 8.166792e-05
The F-test assesses the overall significance of a group of predictor variables in a multiple regression model. The full model contains all predictors. It tests whether at least one of the predictor variables in the model has a non-zero coefficient by reducing these predictors. Such model is reduced model.
fit_full = lm ( avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot )
fit_reduced = lm (avgpot ~ avgrating + cnt:avgrating, data = country.and.pot )
anova ( fit_reduced , fit_full)
fit_reduced1 = lm (avgpot ~ cnt + cnt:avgrating , data = country.and.pot )
anova ( fit_reduced1 , fit_full)
We now plot the residuals versus the fitted values, and a QQ plot and other for this regression model fit.
It is significant to check the regression modeling assumptions.If the regression modeling assumptions fail to hold, then predictor may be biased with unpredictable variance and non normally distributed data points.
fitted_values = fitted(fit)
residual_values = resid(fit)
sresidual_values = rstandard(fit)
plot(fitted_values, sresidual_values,
main = "EAFC 24: fitted versus residual values",
xlab = "Fitted",
ylab = "Standardized residuals")
hist(rstandard(fit), xlab = "Standardized residuals", main = "Standardized residuals histogram")
plot(fit, which=1)
plot(fit, which = 2)