FIFA series is one of the football videogames series, which has been officially renamed to EA Sports FC 24 from the season 2023/24 it due to the termination of collaboration with FIFA.
This project aims to leverage database tool PostgreSQL and statistical analysis language R to analyze the FIFA players dataset available at this. This dataset includes sufficient information regarding players such as age, overall and potential attributes, club and league name, positions, and many others from the latest 10 editions of FIFA (from the FIFA 15 to the EAFC 24).
The first part is to investigate the relationship between football players’ potential for each country with more than 1000 players and various factors such as their overall rating and count of players from the country. The dataset used is queried from the this.
The next step is the analysis of the players dataset, looking for insights about clubs, leagues and players.
library(tidyverse)
library(htmltools)
install.packages('car')
## Warning: dependency 'pbkrtest' is not available
## Warning in install.packages("car"): installation of package 'car' had non-zero
## exit status
library(dplyr)
country.and.pot=read_csv("/home/jovyan/EAFC24/country and pot.csv")
We choose the linear regression model for the data set and fit the relevant EAFC data with a linear regression model using R:
The dataset is cleaned and manipulated by PostgreSQL to make sure it’s appropriate to used for data analysis, ensuring no missing values.
summary(country.and.pot)
## country avgpot avgrating cnt
## Length:188 Min. :58.00 Min. :54.00 Min. : 1.0
## Class :character 1st Qu.:67.88 1st Qu.:63.27 1st Qu.: 20.5
## Mode :character Median :69.90 Median :65.60 Median : 91.0
## Mean :69.55 Mean :65.10 Mean : 957.6
## 3rd Qu.:71.72 3rd Qu.:67.50 3rd Qu.: 795.8
## Max. :74.60 Max. :70.70 Max. :16415.0
fit = lm(avgpot ~ avgrating + cnt + cnt:avgrating ,
data = country.and.pot)
summary(fit)
##
## Call:
## lm(formula = avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9651 -1.0966 0.1905 1.0176 8.9024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.886e+01 2.985e+00 9.668 <2e-16 ***
## avgrating 6.225e-01 4.588e-02 13.568 <2e-16 ***
## cnt -2.263e-03 1.487e-03 -1.522 0.130
## avgrating:cnt 3.704e-05 2.262e-05 1.638 0.103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.891 on 184 degrees of freedom
## Multiple R-squared: 0.5769, Adjusted R-squared: 0.57
## F-statistic: 83.64 on 3 and 184 DF, p-value: < 2.2e-16
# Create a scatterplot with the automatic reference line
country.and.pot %>% ggplot( aes(x = avgrating, y = avgpot)) +
geom_point() + # Add points for the data
geom_smooth(method='lm', formula= avgpot ~ avgrating) +
#geom_abline(intercept = intercept, slope = slope, color = "red") + # Add the automatic reference line
labs(title = "Scatterplot with Automatic Reference Line",x = 'overall', y = 'pot')
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'avgpot' not found
country.and.pot %>% ggplot(aes( x = avgrating)) + geom_boxplot(fill = "dark gray" )+ labs(title= "Boxplot of avgrating" , x = 'overall')
country.and.pot %>% ggplot(aes( x = avgrating)) + geom_histogram(fill = "dark gray" )+ labs(title= "Distribution of avgrating" , x = 'overall')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We now plot the residuals versus the fitted values, and a QQ plot and other for this regression model fit.
It is significant to check the regression modeling assumptions.If the regression modeling assumptions fail to hold, then predictor may be biased with unpredictable variance and non normally distributed data points.
fitted_values = fitted(fit)
residual_values = resid(fit)
sresidual_values = rstandard(fit)
plot(fitted_values, sresidual_values,
main = "EAFC 24: fitted versus residual values",
xlab = "Fitted",
ylab = "Standardized residuals")
hist(rstandard(fit), xlab = "Standardized residuals", main = "Standardized residuals histogram")
plot(fit, which=1)
plot(fit, which = 2)
In multiple regression, t-test is applied to assess the significance of each predictor variable. The t-test measures whether the coefficient for a particular predictor is significantly different from zero. It can be achieved by comparing p-value and significance level α, .05. If the t-value associated with a predictor is sufficiently large, it suggests that the predictor is contributing significantly to the model.
confint(fit, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 2.297235e+01 3.475207e+01
## avgrating 5.319492e-01 7.129809e-01
## cnt -5.196464e-03 6.700380e-04
## avgrating:cnt -7.579950e-06 8.166582e-05
The F-test assesses the overall significance of a group of predictor variables in a multiple regression model. The full model contains all predictors. It tests whether at least one of the predictor variables in the model has a non-zero coefficient by reducing these predictors. Such model is reduced model.
fit_full = lm ( avgpot ~ avgrating + cnt + cnt:avgrating, data = country.and.pot )
fit_reduced = lm (avgpot ~ avgrating + cnt:avgrating, data = country.and.pot )
anova ( fit_reduced , fit_full)
fit_reduced1 = lm (avgpot ~ avgrating + cnt, data = country.and.pot )
anova ( fit_reduced1 , fit_full)
fit_reduced2 = lm (avgpot ~ avgrating , data = country.and.pot )
anova ( fit_reduced2 , fit_full)
The RSS can only reflect the “goodness” of the fit. That implies larger models have smaller RSS but increased model complexity. To judge the model in a more considerate way, we may penalize its complexity by criterion in the form of {Goodness of fit to the data} + {model complexity penalty}
AIC(fit, k=2)
## [1] 779.1045
n = 188
extractAIC(fit, k = log(n))
## [1] 4.0000 256.5294
par ( mfrow = c (2 , 2) )
plot ( fit , which = c(1 , 2 , 3 , 4) )
stepfit = step ( fit , direction = "backward" , k = log (n) )
## Start: AIC=256.53
## avgpot ~ avgrating + cnt + cnt:avgrating
##
## Df Sum of Sq RSS AIC
## - avgrating:cnt 1 9.5958 667.82 254.01
## <none> 658.22 256.53
##
## Step: AIC=254.01
## avgpot ~ avgrating + cnt
##
## Df Sum of Sq RSS AIC
## <none> 667.82 254.01
## - cnt 1 25.05 692.87 255.70
## - avgrating 1 832.39 1500.21 400.93
summary ( stepfit )
##
## Call:
## lm(formula = avgpot ~ avgrating + cnt, data = country.and.pot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7697 -1.1776 0.1661 0.9451 9.1811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.706e+01 2.787e+00 9.709 < 2e-16 ***
## avgrating 6.502e-01 4.282e-02 15.185 < 2e-16 ***
## cnt 1.695e-04 6.436e-05 2.634 0.00915 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.9 on 185 degrees of freedom
## Multiple R-squared: 0.5708, Adjusted R-squared: 0.5661
## F-statistic: 123 on 2 and 185 DF, p-value: < 2.2e-16
# Check the final model assumptions
par ( mfrow = c (2 , 2) )
plot ( stepfit , which = c (1 , 2 , 3 , 4) )
AIC(lm(formula = avgpot ~ avgrating + cnt, data = country.and.pot), k=2)
## [1] 779.8254
extractAIC(lm(formula = avgpot ~ avgrating + cnt, data = country.and.pot), k = log(n))
## [1] 3.0000 254.0139
The summary illustrates the best model according to BIC using backward step selection contains only main predictors cnt and avgrating. The coefficient of both independent variables are statistically significant since its p-value by t-test is small, rejecting the hypothesis that the coefficient for each predictor is zero.
From summary table, we can spot multiple and adjusted R-squared are above 0.5, which is a rough cutoff to suggest a relative strong relationship between the predictors and response. The field of study, football, have an inherently greater amount of unexplained variations behind these predictors, including difference of socioeconomic status among countries/regions, the difference of football culture, and the lack of rea-world players’ data due to limited leagues included in this video game.
The backward step selection process do not check the assumptions of the model.By checking the assumptions before and after the model selection by backward step selection, there is no clear violations of the assumptions of the model except some problematic observations. The multicollinarity may be another issue to deal with.
fit_avgrating = lm( avgrating ~ cnt + avgrating:cnt,
data = country.and.pot )
summary(fit_avgrating)$r.squared
## [1] 0.1428252
1/(1 - summary(fit_avgrating)$r.squared)
## [1] 1.166623
The variance inflation factor indicates the existence of the multcollinarity between the predictor, the average rating and count of players of the countries in the context of this project. R3, the coefficient of determination between predictors, is 0.1428252 and VIF is 1.167. Then there is no strong evidence of multicollinarity as the threshold of VIF is 5.
Problematic observations are the observations that may not align with the general patterns. There are different reasons to explain this kind of data points including the wrong entries of data or some rare cases among all observations.
leverage_values <-hatvalues(stepfit)
threshold1 <- 2*(2+1)/188 # Set the threshold
high_leverage_indices <- which(leverage_values > threshold1)
print(leverage_values[high_leverage_indices])
## 7 23 24 53 61 65 80
## 0.09301918 0.07153576 0.05693224 0.28627873 0.09198274 0.10719047 0.05907345
## 81 85 147 153 158 159
## 0.03629447 0.04187019 0.06757683 0.05693224 0.06535729 0.11274654
sresid <-rstandard(stepfit)
cutoff <- 2 # Set the threshold
sresid_indices <- which(sresid > cutoff)
print(sresid[sresid_indices])
## 24 81 106 153 163
## 3.350029 2.457688 2.261802 4.975974 2.103109
cooks_distance <-cooks.distance(stepfit)
threshold2 <- qf(0.5, 2, 185) # Set the threshold
influencial_indices <- which(cooks_distance > threshold2)
print(cooks_distance[influencial_indices])
## named numeric(0)
Besides the coefficient of determination, the problematic observations should be addressed for regression analysis. From the residual plot above, we can spot outliers, including observation number 24, 153 and others.Besides, we can compute hidden ones and the results are observation 81, 106, and 163. They have higher residuals compared to other observations.By similar approach, we figured out observation 7, 23, 53, 61, 65, 80, 81, 85, 147, 153, 158, 159 are the high leverage points with a cutoff 2(p+1)/n.
Also, by the fourth plot in the plot matrix, observation 24, 118, and 153 stand out from others. However, by calculation and comparing with median of F statistics, there is no influencial point. Thus, it allows us to conclude observation 81 and 153 are both outliers and influencial points, which is not necessary in general case.However, after identifying and classifying the problematic observations, it is still cautious to remove the observations. The observations may just be rare and unique among others. The observation 153 is a good example as it is both high leverage and influencial point with relative high cook’s distance, which is not necessary in the general case. This data point represents Singapore. It is the country that only have one player included in EAFC 24. The data point constructed by extremely small data sample means high chance of representing biased situation. In this case, the two features represented by the predictor cannot be a general reflection of the country’s football level but only a showcase of an individual’s situation. That can explain its high residuals value, high leverage value, and its relative high cook’s distance among all data points since this sample is so random. However, it is not ethical to remove Singapore from our global map as it goes against the goal of this project to explore the players’ ability and possible future ability in the global range.