Data Dive 4

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

volley_data <- read.csv("C:\\Users\\brian\\Downloads\\bvb_matches_2022.csv")

n_samples <- 5
samp_size <- round(nrow(volley_data) * 0.5)

for (i in 1:n_samples) {
  samp_data <- volley_data[sample(nrow(volley_data), size = samp_size, replace = TRUE), ]
  assign(paste0("df_", i), samp_data, envir = .GlobalEnv)
}

In this code we established the samples and stored them in separate data frames so that we can easily evaluate them. The samples were chosen with replacement and stored in data frames.

df_summary1 <- df_1 |>
  group_by(circuit) |>
  count(country) |>
  arrange(n)
df_summary1

## # A tibble: 24 × 3
## # Groups:   circuit [2]
##    circuit country       n
##    <chr>   <chr>     <int>
##  1 FIVB    Korea         5
##  2 FIVB    France       19
##  3 FIVB    Australia    29
##  4 FIVB    Austria      35
##  5 FIVB    Lithuania    35
##  6 FIVB    Spain        35
##  7 FIVB    Slovenia     37
##  8 FIVB    Hungary      38
##  9 FIVB    Thailand     38
## 10 FIVB    Belgium      39
## # ℹ 14 more rows

df_summary2 <- df_2 |>
  group_by(circuit) |>
  count(country) |>
  arrange(n)
df_summary2

## # A tibble: 24 × 3
## # Groups:   circuit [2]
##    circuit country            n
##    <chr>   <chr>          <int>
##  1 FIVB    Korea             16
##  2 FIVB    France            26
##  3 FIVB    Lithuania         29
##  4 FIVB    Australia         40
##  5 FIVB    Spain             40
##  6 FIVB    Latvia            42
##  7 FIVB    Slovenia          42
##  8 FIVB    Thailand          42
##  9 FIVB    Czech Republic    43
## 10 FIVB    Germany           43
## # ℹ 14 more rows

df_summary3 <- df_3 |>
  group_by(circuit) |>
  count(country) |>
  arrange(n)
df_summary3

## # A tibble: 24 × 3
## # Groups:   circuit [2]
##    circuit country            n
##    <chr>   <chr>          <int>
##  1 FIVB    Korea              8
##  2 FIVB    France            12
##  3 FIVB    Thailand          31
##  4 FIVB    Czech Republic    32
##  5 FIVB    Lithuania         32
##  6 FIVB    Australia         36
##  7 FIVB    Belgium           38
##  8 FIVB    Hungary           39
##  9 FIVB    Slovenia          40
## 10 FIVB    Spain             40
## # ℹ 14 more rows

df_summary4 <- df_4 |>
  group_by(circuit) |>
  count(country) |>
  arrange(n)
df_summary4

## # A tibble: 24 × 3
## # Groups:   circuit [2]
##    circuit country            n
##    <chr>   <chr>          <int>
##  1 FIVB    Korea             11
##  2 FIVB    France            15
##  3 FIVB    Thailand          29
##  4 FIVB    Lithuania         30
##  5 FIVB    Spain             31
##  6 FIVB    Latvia            35
##  7 FIVB    Slovenia          35
##  8 FIVB    Czech Republic    36
##  9 FIVB    Hungary           37
## 10 FIVB    Belgium           41
## # ℹ 14 more rows

df_summary5 <- df_5 |>
  group_by(circuit) |>
  count(country) |>
  arrange(n)
df_summary5

## # A tibble: 24 × 3
## # Groups:   circuit [2]
##    circuit country       n
##    <chr>   <chr>     <int>
##  1 FIVB    Korea         6
##  2 FIVB    France       17
##  3 FIVB    Thailand     33
##  4 FIVB    Slovenia     34
##  5 FIVB    Australia    36
##  6 FIVB    Spain        37
##  7 FIVB    Lithuania    39
##  8 FIVB    Austria      41
##  9 FIVB    Hungary      42
## 10 FIVB    Belgium      45
## # ℹ 14 more rows

We could look at many different variables to assess the differences in this sample but the one I chose was to count the matches from each country. It is interesting to look at the differences in country count because in each sample we have a very different order of least to most matches played. I did group by circuit first and the US is the only country to play AVP circuit matches and they also have the most matches in all samples which is not surprising given it had the highest overall matches by far in the total data set. Korea also has the least number of matches in all samples. Aside from these two, the order changes a lot in each sample. This is interesting to see because each sample by itself is not always a great representation of the total data set, but as we take more and more, the average of the samples usually gives a good summary.

df_sum1 <- df_1 |>
  group_by(country) |>
  summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
  arrange(desc(meanage))
df_sum1

## # A tibble: 24 × 2
##    country        meanage
##    <chr>            <dbl>
##  1 Czech Republic    30.6
##  2 United States     30.5
##  3 Latvia            29.1
##  4 Switzerland       29.0
##  5 Mexico            29.0
##  6 Germany           28.8
##  7 Qatar             28.6
##  8 Brazil            27.4
##  9 Thailand          27.4
## 10 T<fc>rkiye        27.4
## # ℹ 14 more rows

df_sum2 <- df_2 |>
  group_by(country) |>
  summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
  arrange(desc(meanage))
df_sum2

## # A tibble: 24 × 2
##    country        meanage
##    <chr>            <dbl>
##  1 United States     30.7
##  2 Mexico            29.6
##  3 Qatar             28.8
##  4 Switzerland       28.7
##  5 Lithuania         28.6
##  6 Latvia            28.6
##  7 Czech Republic    28.6
##  8 Germany           28.5
##  9 T<fc>rkiye        28.3
## 10 Spain             27.5
## # ℹ 14 more rows

df_sum3 <- df_3 |>
  group_by(country) |>
  summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
  arrange(desc(meanage))
df_sum3

## # A tibble: 24 × 2
##    country        meanage
##    <chr>            <dbl>
##  1 United States     30.0
##  2 Mexico            29.5
##  3 Germany           29.2
##  4 Czech Republic    28.5
##  5 T<fc>rkiye        28.3
##  6 Latvia            28.2
##  7 Qatar             28.1
##  8 Switzerland       27.8
##  9 Lithuania         27.7
## 10 Brazil            27.7
## # ℹ 14 more rows

df_sum4 <- df_4 |>
  group_by(country) |>
  summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
  arrange(desc(meanage))
df_sum4

## # A tibble: 24 × 2
##    country        meanage
##    <chr>            <dbl>
##  1 United States     30.4
##  2 Switzerland       29.7
##  3 Mexico            29.5
##  4 Germany           28.9
##  5 Qatar             28.8
##  6 Brazil            28.7
##  7 Czech Republic    28.7
##  8 Latvia            28.0
##  9 T<fc>rkiye        27.8
## 10 Thailand          27.5
## # ℹ 14 more rows

df_sum5 <- df_5 |>
  group_by(country) |>
  summarise(meanage = mean(w_p1_age, na.rm = TRUE)) |>
  arrange(desc(meanage))
df_sum5

## # A tibble: 24 × 2
##    country        meanage
##    <chr>            <dbl>
##  1 United States     30.4
##  2 Czech Republic    29.8
##  3 Latvia            29.3
##  4 Switzerland       29.2
##  5 Germany           29.1
##  6 Mexico            29.1
##  7 Brazil            28.6
##  8 Qatar             28.5
##  9 T<fc>rkiye        27.9
## 10 Lithuania         27.5
## # ℹ 14 more rows

I also decided to look at age since that is something that I have been looking into in the past data dives. It is interesting how different the mean ages for one player are within the samples. I was able to compare these to the whole sample and some of the ages are much different than the mean of the entire data set.

ggplot(df_1, aes(x = country, y = w_p1_age)) +
  geom_boxplot() +
  labs(title="Mean Age by Country") +
  theme_minimal()

## Warning: Removed 49 rows containing non-finite values (`stat_boxplot()`).

ggplot(df_2, aes(x = country, y = w_p1_age)) +
  geom_boxplot() +
  labs(title="Mean Age by Country") +
  theme_minimal()

## Warning: Removed 54 rows containing non-finite values (`stat_boxplot()`).

ggplot(df_3, aes(x = country, y = w_p1_age)) +
  geom_boxplot() +
  labs(title="Mean Age by Country") +
  theme_minimal()

## Warning: Removed 63 rows containing non-finite values (`stat_boxplot()`).

ggplot(df_4, aes(x = country, y = w_p1_age)) +
  geom_boxplot() +
  labs(title="Mean Age by Country") +
  theme_minimal()

## Warning: Removed 58 rows containing non-finite values (`stat_boxplot()`).

ggplot(df_5, aes(x = country, y = w_p1_age)) +
  geom_boxplot() +
  labs(title="Mean Age by Country") +
  theme_minimal()

## Warning: Removed 44 rows containing non-finite values (`stat_boxplot()`).

In sample one, we see many anomalies. These can be seen on the graph by the points not connected to the box plots. In this specific sample, the most obvious anomoly is the player that is almost 50 in the United States.

In sample 2, we have a player that is about 55 from Belgium that is definitely an outlier to the rest of the data, especially since the mean age in Belgium is about 26.

There is a high value from Australia in the box plot for sample 3, and we also still see the higher values in Belgium and the United States. In comparison to the mean values, these are all anomolies and values that we would doubt. France also has a very low mean which is off from it’s mean in the total dataset.

Samples 4 and 5 also contain many anomolies like the other samples. We can see all of these on the box plot and these are points that we would want to look into to make sure that they were in our original data set and that they make sense. In the data set we do have players birthdays, so we may want to verify that these points are correct using this.

This data dive was informative to the fact that samples of a population are not always great representations of the entire population. The samples are not too far off, but we should still not form conclusions based off of a single sample. As we collect more samples, we can find averages of the samples that represent our entire data set better, but single samples could be misleading if we try to conclude things about all of the data.

Data Dive 4

2024-09-19

R Markdown