library(tidyr)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
volley_data <- read.csv("C:\\Users\\brian\\Downloads\\bvb_matches_2022.csv")
n_samples <- 5
samp_size <- round(nrow(volley_data) * 0.5)
for (i in 1:n_samples) {
samp_data <- volley_data[sample(nrow(volley_data), size = samp_size, replace = TRUE), ]
assign(paste0("df_", i), samp_data, envir = .GlobalEnv)
}
In this code we established the samples and stored them in separate data frames so that we can easily evaluate them. The samples were chosen with replacement and stored in data frames.
df_summary1 <- df_1 |>
group_by(circuit) |>
count(country) |>
arrange(n)
df_summary1
## # A tibble: 24 Ă— 3
## # Groups: circuit [2]
## circuit country n
## <chr> <chr> <int>
## 1 FIVB Korea 5
## 2 FIVB France 19
## 3 FIVB Australia 29
## 4 FIVB Austria 35
## 5 FIVB Lithuania 35
## 6 FIVB Spain 35
## 7 FIVB Slovenia 37
## 8 FIVB Hungary 38
## 9 FIVB Thailand 38
## 10 FIVB Belgium 39
## # ℹ 14 more rows
df_summary2 <- df_2 |>
group_by(circuit) |>
count(country) |>
arrange(n)
df_summary2
## # A tibble: 24 Ă— 3
## # Groups: circuit [2]
## circuit country n
## <chr> <chr> <int>
## 1 FIVB Korea 16
## 2 FIVB France 26
## 3 FIVB Lithuania 29
## 4 FIVB Australia 40
## 5 FIVB Spain 40
## 6 FIVB Latvia 42
## 7 FIVB Slovenia 42
## 8 FIVB Thailand 42
## 9 FIVB Czech Republic 43
## 10 FIVB Germany 43
## # ℹ 14 more rows
df_summary3 <- df_3 |>
group_by(circuit) |>
count(country) |>
arrange(n)
df_summary3
## # A tibble: 24 Ă— 3
## # Groups: circuit [2]
## circuit country n
## <chr> <chr> <int>
## 1 FIVB Korea 8
## 2 FIVB France 12
## 3 FIVB Thailand 31
## 4 FIVB Czech Republic 32
## 5 FIVB Lithuania 32
## 6 FIVB Australia 36
## 7 FIVB Belgium 38
## 8 FIVB Hungary 39
## 9 FIVB Slovenia 40
## 10 FIVB Spain 40
## # ℹ 14 more rows
df_summary4 <- df_4 |>
group_by(circuit) |>
count(country) |>
arrange(n)
df_summary4
## # A tibble: 24 Ă— 3
## # Groups: circuit [2]
## circuit country n
## <chr> <chr> <int>
## 1 FIVB Korea 11
## 2 FIVB France 15
## 3 FIVB Thailand 29
## 4 FIVB Lithuania 30
## 5 FIVB Spain 31
## 6 FIVB Latvia 35
## 7 FIVB Slovenia 35
## 8 FIVB Czech Republic 36
## 9 FIVB Hungary 37
## 10 FIVB Belgium 41
## # ℹ 14 more rows
df_summary5 <- df_5 |>
group_by(circuit) |>
count(country) |>
arrange(n)
df_summary5
## # A tibble: 24 Ă— 3
## # Groups: circuit [2]
## circuit country n
## <chr> <chr> <int>
## 1 FIVB Korea 6
## 2 FIVB France 17
## 3 FIVB Thailand 33
## 4 FIVB Slovenia 34
## 5 FIVB Australia 36
## 6 FIVB Spain 37
## 7 FIVB Lithuania 39
## 8 FIVB Austria 41
## 9 FIVB Hungary 42
## 10 FIVB Belgium 45
## # ℹ 14 more rows
We could look at many different variables to assess the differences in this sample but the one I chose was to count the matches from each country. It is interesting to look at the differences in country count because in each sample we have a very different order of least to most matches played. I did group by circuit first and the US is the only country to play AVP circuit matches and they also have the most matches in all samples which is not surprising given it had the highest overall matches by far in the total data set. Korea also has the least number of matches in all samples. Aside from these two, the order changes a lot in each sample. This is interesting to see because each sample by itself is not always a great representation of the total data set, but as we take more and more, the average of the samples usually gives a good summary.
df_sum1 <- df_1 |>
group_by(country) |>
summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
arrange(desc(meanage))
df_sum1
## # A tibble: 24 Ă— 2
## country meanage
## <chr> <dbl>
## 1 Czech Republic 30.6
## 2 United States 30.5
## 3 Latvia 29.1
## 4 Switzerland 29.0
## 5 Mexico 29.0
## 6 Germany 28.8
## 7 Qatar 28.6
## 8 Brazil 27.4
## 9 Thailand 27.4
## 10 T<fc>rkiye 27.4
## # ℹ 14 more rows
df_sum2 <- df_2 |>
group_by(country) |>
summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
arrange(desc(meanage))
df_sum2
## # A tibble: 24 Ă— 2
## country meanage
## <chr> <dbl>
## 1 United States 30.7
## 2 Mexico 29.6
## 3 Qatar 28.8
## 4 Switzerland 28.7
## 5 Lithuania 28.6
## 6 Latvia 28.6
## 7 Czech Republic 28.6
## 8 Germany 28.5
## 9 T<fc>rkiye 28.3
## 10 Spain 27.5
## # ℹ 14 more rows
df_sum3 <- df_3 |>
group_by(country) |>
summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
arrange(desc(meanage))
df_sum3
## # A tibble: 24 Ă— 2
## country meanage
## <chr> <dbl>
## 1 United States 30.0
## 2 Mexico 29.5
## 3 Germany 29.2
## 4 Czech Republic 28.5
## 5 T<fc>rkiye 28.3
## 6 Latvia 28.2
## 7 Qatar 28.1
## 8 Switzerland 27.8
## 9 Lithuania 27.7
## 10 Brazil 27.7
## # ℹ 14 more rows
df_sum4 <- df_4 |>
group_by(country) |>
summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
arrange(desc(meanage))
df_sum4
## # A tibble: 24 Ă— 2
## country meanage
## <chr> <dbl>
## 1 United States 30.4
## 2 Switzerland 29.7
## 3 Mexico 29.5
## 4 Germany 28.9
## 5 Qatar 28.8
## 6 Brazil 28.7
## 7 Czech Republic 28.7
## 8 Latvia 28.0
## 9 T<fc>rkiye 27.8
## 10 Thailand 27.5
## # ℹ 14 more rows
df_sum5 <- df_5 |>
group_by(country) |>
summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
arrange(desc(meanage))
df_sum5
## # A tibble: 24 Ă— 2
## country meanage
## <chr> <dbl>
## 1 United States 30.4
## 2 Czech Republic 29.8
## 3 Latvia 29.3
## 4 Switzerland 29.2
## 5 Germany 29.1
## 6 Mexico 29.1
## 7 Brazil 28.6
## 8 Qatar 28.5
## 9 T<fc>rkiye 27.9
## 10 Lithuania 27.5
## # ℹ 14 more rows
I also decided to look at age since that is something that I have been looking into in the past data dives. It is interesting how different the mean ages for one player are within the samples. I was able to compare these to the whole sample and some of the ages are much different than the mean of the entire data set.
ggplot(df_1, aes(x = country, y = w_p1_age)) +
geom_boxplot() +
labs(title="Mean Age by Country") +
theme_minimal()
## Warning: Removed 49 rows containing non-finite values (`stat_boxplot()`).
ggplot(df_2, aes(x = country, y = w_p1_age)) +
geom_boxplot() +
labs(title="Mean Age by Country") +
theme_minimal()
## Warning: Removed 54 rows containing non-finite values (`stat_boxplot()`).
ggplot(df_3, aes(x = country, y = w_p1_age)) +
geom_boxplot() +
labs(title="Mean Age by Country") +
theme_minimal()
## Warning: Removed 63 rows containing non-finite values (`stat_boxplot()`).
ggplot(df_4, aes(x = country, y = w_p1_age)) +
geom_boxplot() +
labs(title="Mean Age by Country") +
theme_minimal()
## Warning: Removed 58 rows containing non-finite values (`stat_boxplot()`).
ggplot(df_5, aes(x = country, y = w_p1_age)) +
geom_boxplot() +
labs(title="Mean Age by Country") +
theme_minimal()
## Warning: Removed 44 rows containing non-finite values (`stat_boxplot()`).
In sample one, we see many anomalies. These can be seen on the graph by the points not connected to the box plots. In this specific sample, the most obvious anomoly is the player that is almost 50 in the United States.
In sample 2, we have a player that is about 55 from Belgium that is definitely an outlier to the rest of the data, especially since the mean age in Belgium is about 26.
There is a high value from Australia in the box plot for sample 3, and we also still see the higher values in Belgium and the United States. In comparison to the mean values, these are all anomolies and values that we would doubt. France also has a very low mean which is off from it’s mean in the total dataset.
Samples 4 and 5 also contain many anomolies like the other samples. We can see all of these on the box plot and these are points that we would want to look into to make sure that they were in our original data set and that they make sense. In the data set we do have players birthdays, so we may want to verify that these points are correct using this.
This data dive was informative to the fact that samples of a population are not always great representations of the entire population. The samples are not too far off, but we should still not form conclusions based off of a single sample. As we collect more samples, we can find averages of the samples that represent our entire data set better, but single samples could be misleading if we try to conclude things about all of the data.