In my point estimation and confidence intervals report I will focus on data about players (and information about them) from widely known game called FIFA 22. Data is provided by https://www.kaggle.com/, specificaly by Stefano Leone (https://www.kaggle.com/stefanoleone992/fut-22-fifa-ultimate-team-players-and-prices). I will estimate and compare population mean for two leagues, namely LaLiga Santander and Premier league. Firstly lets examine histograms of populations from both leagues.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both histograms seems to be moderately left skewed. Their means will be probably pretty close to each other. To get better insight lets plot on boxplot.
After getting some insights from boxplot we can clearly see that means are very close to each other. Also, their IQR seems to be pretty similiar.
Firstly we will prepare our vectors for both leagues.
samp_laliga_mean <- rep(NA, 100)
samp_premier_mean <- rep(NA, 100)
samp_laliga_sd <- rep(NA, 100)
samp_premier_sd <- rep(NA, 100)
Now, I will calculate population parameters for both leagues. We will use it to calculate how big sample size needs to be to achieve level of delta equal to 2.1. Thanks to that, our estimation of population mean error will be at max 5.7% of mean (Given that is ~74)
n_laliga = length(laliga_players$overall)
sd_laliga_pop = sd(laliga_players$overall) * ((n_laliga - 1) / n_laliga)
n_premier = length(premier_players$overall)
sd_premier_pop = sd(premier_players$overall) * ((n_premier - 1) / n_premier)
avg_laliga_pop = mean(laliga_players$overall)
avg_premier_pop = mean(premier_players$overall)
sample.size.mean(2.1, sd_laliga_pop, n_laliga, 0.95)
##
## sample.size.mean object: Sample size for mean estimate
## With finite population correction: N=613, precision e=2.1 and standard deviation S=6.7628
##
## Sample size needed: 38
sample.size.mean(2.1, sd_premier_pop, n_premier, 0.95)
##
## sample.size.mean object: Sample size for mean estimate
## With finite population correction: N=633, precision e=2.1 and standard deviation S=8.5761
##
## Sample size needed: 59
samp_size = 59
t_value = 2.0017
We need sample size of 59 to achieve this level of accuracy. In fact, in LaLiga we need only n=38, but I will stick to 59 for both leagues for the ease of calculations. It’s not a problem, since our sample size won’t exceed 10% of population size. I assume in our calculations that we don’t know population’s SD, so for CI calculation I’m using t-distribution. At 58 degrees of freedom and alpha level equal to 95% it yelds 2.0017.
Now, I will use the sample size of 59 to calculate 100 samples for both LaLiga and Premier league.
for(i in 1:100){
samp_size = 59
samp_laliga <- laliga_players[sample(1:n_laliga, samp_size),]
samp_premier <- premier_players[sample(1:n_premier, samp_size),]
sd_laliga_samp <- sd(samp_laliga$overall)
sd_premier_samp <- sd(samp_premier$overall)
avg_laliga_samp <- mean(samp_laliga$overall)
avg_premier_samp <- mean(samp_premier$overall)
samp_laliga_mean[i] <- avg_laliga_samp
samp_premier_mean[i] <- avg_premier_samp
samp_laliga_sd[i] <- sd_laliga_samp
samp_premier_sd[i] <- sd_premier_samp
}
Now upper, lower bounds of CI and estimation of population’s mean based on sample size of 59 and 50 samples.
lower_laliga <- samp_laliga_mean - t_value*samp_laliga_sd / sqrt(samp_size)
upper_laliga <- samp_laliga_mean + t_value*samp_laliga_sd / sqrt(samp_size)
lower_premier <- samp_premier_mean - t_value*samp_premier_sd / sqrt(samp_size)
upper_premier <- samp_premier_mean + t_value*samp_premier_sd / sqrt(samp_size)
laliga_estimated_mean = mean(samp_laliga_mean)
premier_estimated_mean = mean(samp_premier_mean)
laliga_estimated_mean
## [1] 73.85305
avg_laliga_pop
## [1] 73.94127
premier_estimated_mean
## [1] 73.57339
avg_premier_pop
## [1] 73.59084
And finally our CI plots.
plotCI(1:100,samp_laliga_mean,ui = upper_laliga, li=lower_laliga)
abline(h=laliga_estimated_mean,lwd=2)
plotCI(1:100,samp_premier_mean,ui = upper_premier, li=lower_premier)
abline(h=premier_estimated_mean,lwd=2)