Before going to the analysis let’s udnerstand our data, by checking the structer.
## see the structure of the data
str(country_results_df)
## spc_tbl_ [3,780 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year                     : num [1:3780] 2024 2024 2024 2024 2024 ...
##  $ country                  : chr [1:3780] "United States of America" "People's Republic of China" "Republic of Korea" "India" ...
##  $ team_size_all            : num [1:3780] 6 6 6 6 6 6 6 6 6 6 ...
##  $ team_size_male           : num [1:3780] 5 6 6 6 6 6 6 6 6 5 ...
##  $ team_size_female         : num [1:3780] 1 0 0 0 0 0 0 0 0 1 ...
##  $ p1                       : num [1:3780] 42 42 42 42 42 42 42 42 42 38 ...
##  $ p2                       : num [1:3780] 41 42 37 34 30 37 33 37 25 37 ...
##  $ p3                       : num [1:3780] 19 31 18 11 10 7 8 16 5 5 ...
##  $ p4                       : num [1:3780] 40 40 42 42 42 42 42 36 42 42 ...
##  $ p5                       : num [1:3780] 35 22 7 28 36 29 31 23 35 12 ...
##  $ p6                       : num [1:3780] 15 13 22 10 5 5 6 1 2 17 ...
##  $ p7                       : logi [1:3780] NA NA NA NA NA NA ...
##  $ awards_gold              : num [1:3780] 5 5 2 4 4 1 2 2 1 2 ...
##  $ awards_silver            : num [1:3780] 1 1 4 1 0 5 3 3 4 2 ...
##  $ awards_bronze            : num [1:3780] 0 0 0 0 2 0 1 1 1 2 ...
##  $ awards_honorable_mentions: num [1:3780] 0 0 0 1 0 0 0 0 0 0 ...
##  $ leader                   : chr [1:3780] "John Berman" "Liang Xiao" "Suyoung Choi" "Krishnan Sivasubramanian" ...
##  $ deputy_leader            : chr [1:3780] "Carl Schildkraut" "Yijun Yao" "Hwajong Yoo" "Rijul Saini" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   country = col_character(),
##   ..   team_size_all = col_double(),
##   ..   team_size_male = col_double(),
##   ..   team_size_female = col_double(),
##   ..   p1 = col_double(),
##   ..   p2 = col_double(),
##   ..   p3 = col_double(),
##   ..   p4 = col_double(),
##   ..   p5 = col_double(),
##   ..   p6 = col_double(),
##   ..   p7 = col_logical(),
##   ..   awards_gold = col_double(),
##   ..   awards_silver = col_double(),
##   ..   awards_bronze = col_double(),
##   ..   awards_honorable_mentions = col_double(),
##   ..   leader = col_character(),
##   ..   deputy_leader = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Comment: The country_results_df dataset provides a detailed view of the International Mathematical Olympiad (IMO), highlighting team compositions, individual scores, and awards across countries and years.

Now we understand the data, lets convert to data table and do some descriptive analysis
##  Convert to data.table
dt <- as.data.table(country_results_df)

 

After converting let’s do varius analysis, one by one. Above each chunk code, you will find two hashtags (##) and then a description what will be runing. After the output, when needed i will provide a short comment/interpretation
#  Lets see Top 3 countries with golden awards
top_gold <- dt[, .(total_gold = sum(awards_gold)), by = country][order(-total_gold)][1:3]
top_gold
##                       country total_gold
##                        <char>      <num>
## 1: People's Republic of China        185
## 2:   United States of America        151
## 3:         Russian Federation        106

As we can see, the table shows the top three countries with golden awards, led by People’s Republic of China with 185 golden awards.

## Now let's see Top 3 countries with silver awards
top_silver <- dt[, .(total_silver = sum(awards_silver)), by = country][order(-total_silver)][1:3]
top_silver
##     country total_silver
##      <char>        <num>
## 1:  Hungary          174
## 2:  Romania          158
## 3: Bulgaria          130

Now we see the top three countries with silver awards, led by Hungary with 174 silver awards.

 

## Now lets aggregating data and see total awards by year
awards_by_year <- dt[, .(
  total_gold = sum(awards_gold),
  total_silver = sum(awards_silver),
  total_bronze = sum(awards_bronze),
  total_honorable_mentions = sum(awards_honorable_mentions)
), by = year]
awards_by_year
##      year total_gold total_silver total_bronze total_honorable_mentions
##     <num>      <num>        <num>        <num>                    <num>
##  1:  2024         54          121          145                      170
##  2:  2023         49           89          170                      192
##  3:  2022         41           98          140                      210
##  4:  2021         52          103          148                       98
##  5:  2020         49          112          155                      173
##  6:  2019         52           94          156                      144
##  7:  2018         48           98          143                      138
##  8:  2017         48           90          153                      222
##  9:  2016         44          101          135                      162
## 10:  2015         39          100          143                      126
## 11:  2014         49          113          133                      151
## 12:  2013         45           92          141                      141
## 13:  2012         51           88          137                      148
## 14:  2011         54           90          137                      120
## 15:  2010         NA           NA           NA                       NA
## 16:  2009         49           98          135                       96
## 17:  2008         47          100          120                      103
## 18:  2007         39           83          131                      149
## 19:  2006         42           89          122                      139
## 20:  2005         42           79          128                       68
## 21:  2004         45           78          120                       78
## 22:  2003         37           69          104                      116
## 23:  2002         39           73          120                       66
## 24:  2001         39           81          122                       37
## 25:  2000         39           71          119                       51
## 26:  1999         38           70          118                       12
## 27:  1998         37           66          102                       56
## 28:  1997         39           70          122                       78
## 29:  1996         35           66           99                       22
## 30:  1995         30           71          100                       99
## 31:  1994         30           64           98                       92
## 32:  1993         35           66           97                       12
## 33:  1992         26           57           87                       34
## 34:  1991         NA           NA           NA                       NA
## 35:  1990         23           56           76                       32
## 36:  1989         20           55           72                       64
## 37:  1988         17           48           65                       33
## 38:  1987         22           42           56                       NA
## 39:  1986         18           41           48                       NA
## 40:  1985         14           35           52                       NA
## 41:  1984         14           35           49                       NA
## 42:  1983          9           27           57                       NA
## 43:  1982         10           20           31                       NA
## 44:  1981         36           37           30                       NA
## 45:  1979          8           32           42                       NA
## 46:  1978          5           20           38                       NA
## 47:  1977         13           29           35                       NA
## 48:  1976          9           28           45                       NA
## 49:  1975          8           25           36                       NA
## 50:  1974         10           24           37                       NA
## 51:  1973          5           15           48                       NA
## 52:  1972          8           16           30                       NA
## 53:  1971          7           12           29                       NA
## 54:  1970          7           11           40                       NA
## 55:  1969          3           20           21                       NA
## 56:  1968         22           22           20                       NA
## 57:  1967         11           14           26                       NA
## 58:  1966         13           15           11                       NA
## 59:  1965          8           12           17                       NA
## 60:  1964          7            9           19                       NA
## 61:  1963          7           11           17                       NA
## 62:  1962          4           12           15                       NA
## 63:  1961          3            4            4                       NA
## 64:  1960          4            4            4                       NA
## 65:  1959          3            3            5                       NA
##      year total_gold total_silver total_bronze total_honorable_mentions

The above table captures the evolving landscape of mathematical excellence and competition at the IMO, reflecting both historical consistency and growth in recognition over the years. But, lets create the same one in a beautiful and nice format

# Same table but in the beutiful and proper format
kable(awards_by_year, 
     caption = "Awards by Year", 
     align = "c", 
     digits = 0) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), 
               full_width = F) 
Awards by Year
year total_gold total_silver total_bronze total_honorable_mentions
2024 54 121 145 170
2023 49 89 170 192
2022 41 98 140 210
2021 52 103 148 98
2020 49 112 155 173
2019 52 94 156 144
2018 48 98 143 138
2017 48 90 153 222
2016 44 101 135 162
2015 39 100 143 126
2014 49 113 133 151
2013 45 92 141 141
2012 51 88 137 148
2011 54 90 137 120
2010 NA NA NA NA
2009 49 98 135 96
2008 47 100 120 103
2007 39 83 131 149
2006 42 89 122 139
2005 42 79 128 68
2004 45 78 120 78
2003 37 69 104 116
2002 39 73 120 66
2001 39 81 122 37
2000 39 71 119 51
1999 38 70 118 12
1998 37 66 102 56
1997 39 70 122 78
1996 35 66 99 22
1995 30 71 100 99
1994 30 64 98 92
1993 35 66 97 12
1992 26 57 87 34
1991 NA NA NA NA
1990 23 56 76 32
1989 20 55 72 64
1988 17 48 65 33
1987 22 42 56 NA
1986 18 41 48 NA
1985 14 35 52 NA
1984 14 35 49 NA
1983 9 27 57 NA
1982 10 20 31 NA
1981 36 37 30 NA
1979 8 32 42 NA
1978 5 20 38 NA
1977 13 29 35 NA
1976 9 28 45 NA
1975 8 25 36 NA
1974 10 24 37 NA
1973 5 15 48 NA
1972 8 16 30 NA
1971 7 12 29 NA
1970 7 11 40 NA
1969 3 20 21 NA
1968 22 22 20 NA
1967 11 14 26 NA
1966 13 15 11 NA
1965 8 12 17 NA
1964 7 9 19 NA
1963 7 11 17 NA
1962 4 12 15 NA
1961 3 4 4 NA
1960 4 4 4 NA
1959 3 3 5 NA

 

Now let’s start creating some plots and visualise the analyse.
## Total gold awards by year
plot1 <- ggplot(awards_by_year, aes(x = year, y = total_gold)) +
  geom_line(color = "#2c7fb8", linewidth = 1) +
  labs(title = "Total Gold Awards by Year", x = "Year", y = "Gold Awards") +
  theme_minimal()
plot1

The plot of total gold awards by year shows the trend of gold medals awarded at the International Mathematical Olympiad over time

 

## Now, let's see Top 5 countries in gold award section
## Summarizing total gold awards by country
top_countries_gold <- dt[, .(total_gold = sum(awards_gold, na.rm = TRUE)), by = country]

## Selecting the top 5 countries with the most gold awards
top_5_countries_gold <- top_countries_gold[order(-total_gold)][1:5]

## Filtering the dataset for these top 5 countries
dt_top_5_countries <- dt[country %in% top_5_countries_gold$country]
## Plot: Gold Awards by Top 5 Countries
plot_gold <- ggplot(top_5_countries_gold, aes(x = reorder(country, -total_gold), y = total_gold, fill = country)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Top 5 Countries by Total Gold Awards",
    x = "Country",
    y = "Total Gold Awards"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot_gold

The plot of the Top 5 Countries by Total Gold Awards visualizes the distribution of gold medals among the top-performing countries. And we can see that People’s Republic of China is the first one.

 

## Now, lets see top silver awards by country, but in horizontal bar chart.
## Summarizing total silver awards by country
top_countries_silver <- dt[, .(total_silver = sum(awards_silver, na.rm = TRUE)), by = country]
## Selecting the top 5 countries with the most silver awards
top_5_countries_silver <- top_countries_silver[order(-total_silver)][1:5]
## Silver Awards by Top 5 Countries
plot_silver <- ggplot(top_5_countries_silver, aes(x = reorder(country, total_silver), y = total_silver, fill = country)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "Top 5 Countries by Total Silver Awards",
    x = "Country",
    y = "Total Silver Awards"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.y = element_text(size = 12),
        axis.text.x = element_text(size = 12, angle = 45, hjust = 1)) +
  coord_flip()  
plot_silver

Now we can see that Hungary leads with over 150 silver awards, while the USA is in 5th place.

 

## Now let's see the proportion of Total Bronze Awards by Top 5 Countries
# Summarize total bronze awards by country
top_countries_bronze <- dt[, .(total_bronze = sum(awards_bronze, na.rm = TRUE)), by = country]

## Select the top 5 countries with the most bronze awards
top_5_countries_bronze <- top_countries_bronze[order(-total_bronze)][1:5]

## Create a pie chart
plot_bronze_pie <- ggplot(top_5_countries_bronze, aes(x = "", y = total_bronze, fill = country)) +
  geom_bar(stat = "identity", width = 1, color = "#636363") + 
  coord_polar(theta = "y") + 
  labs(
    title = "Proportion of Total Bronze Awards by Top 5 Countries",
    x = NULL,
    y = NULL
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Pastel1") +
  theme(axis.text.x = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = scales::percent(total_bronze / sum(total_bronze))), 
            position = position_stack(vjust = 0.5), color = "#636363")

plot_bronze_pie

As we can see above, the pie chart for the Proportion of Total Bronze Awards by Top 5 Countries displays the relative share of bronze awards won by each of the top 5 countries. Each slice of the pie represents a country’s contribution to the total number of bronze awards, offering a clear view of how the bronze awards are distributed among the leading countries.

 

## Now let's create a group bar chart and see who are top five countries in 2024.
## Filter data for the year 2024
data_2024 <- dt[year == 2024]

## Summarize gold, silver, and bronze awards by country
awards_2024 <- data_2024[, .(
  total_gold = sum(awards_gold, na.rm = TRUE),
  total_silver = sum(awards_silver, na.rm = TRUE),
  total_bronze = sum(awards_bronze, na.rm = TRUE)
), by = country]

## Add a column for the total awards (gold + silver + bronze)
awards_2024[, total_awards := total_gold + total_silver + total_bronze]

## Select the top 5 countries based on total awards
top_5_awards_2024 <- awards_2024[order(-total_awards)][1:5]

## Melt the data to long format for plotting
awards_2024_long <- melt(top_5_awards_2024[, .(country, total_gold, total_silver, total_bronze)], 
                         id.vars = "country", 
                         variable.name = "award_type", 
                         value.name = "total_awards")

## Create a grouped bar chart for top 5 countries
plot_2024_top5_awards <- ggplot(awards_2024_long, aes(x = country, y = total_awards, fill = award_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Gold, Silver, and Bronze Awards by Top 5 Countries (2024)",
    x = "Country",
    y = "Number of Awards",
    fill = "Award Type"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot_2024_top5_awards

The grouped bar chart for Gold, Silver, and Bronze Awards by Top 5 Countries (2024) shows the number of each type of award won by the top 5 countries in the year 2024.

 

## Now let's focus on distriboution by gender. Let's see Distribution of Participation by Gender Over the Years.
## Aggregate the team size by gender (male and female) for each country and year
gender_distribution_by_year <- dt[, .(
  total_male = sum(team_size_male, na.rm = TRUE),
  total_female = sum(team_size_female, na.rm = TRUE)
), by = year]

## Reshape the data to have gender types in one column
gender_distribution_long <- melt(gender_distribution_by_year, id.vars = "year", 
                                 variable.name = "gender", value.name = "total_participants")

# Create a stacked bar chart to show the distribution of male and female participants by year
plot_gender_distribution <- ggplot(gender_distribution_long, aes(x = factor(year), y = total_participants, fill = gender)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("total_male" = "#1f77b4", "total_female" = "#ff7f0e")) +
  labs(
    title = "Distribution of Participation by Gender Over the Years",
    x = "Year",
    y = "Total Participants",
    fill = "Gender"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title.x = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16)
  )
# Print the plot
print(plot_gender_distribution)

Now, the plot of the Distribution of Participation by Gender Over the Years visualizes the number of male and female participants at the International Mathematical Olympiad each year. The stacked bar chart shows the total number of participants each year, with the height of each bar segment representing the number of male and female participants respectively.

 

## Now let's focus in individual country. Let's for example Kosovo.
# Filter data for Kosovo
kosovo_data <- dt[country == "Kosovo"]

# Summarize total awards (gold, silver, bronze, honorable mentions) by year
kosovo_awards_by_year <- kosovo_data[, .(
  total_gold = sum(awards_gold, na.rm = TRUE),
  total_silver = sum(awards_silver, na.rm = TRUE),
  total_bronze = sum(awards_bronze, na.rm = TRUE),
  total_honorable_mentions = sum(awards_honorable_mentions, na.rm = TRUE)
), by = year]

# Calculate total awards for each year by summing all award types
kosovo_awards_by_year[, total_awards := total_gold + total_silver + total_bronze + total_honorable_mentions]

# Create a plot to show the total awards by year for Kosovo
plot_kosovo_awards_by_year <- ggplot(kosovo_awards_by_year, aes(x = year, y = total_awards)) +
  geom_line(color = "#0073e6", linewidth = 1.2) +  # Blue line for total awards
  geom_point(color = "#ff5733", linewidth = 3) +  # Red points to highlight the years
  labs(
    title = "Total Awards Won by Kosovo by Year",
    x = "Year",
    y = "Total Awards"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title.x = element_text(face = "bold", size = 14),
    axis.title.y = element_text(face = "bold", size = 14)
  )

# Display the plot
plot_kosovo_awards_by_year

As we can see above, the plot of Total Awards Won by Kosovo by Year shows the total number of awards won by Kosovo each year at the International Mathematical Olympiad. The plot uses a line graph with points to display the trend in awards over time.

 

Lets do more comprehensive analysis and load the next data set (individual results) and merge with the first data set which was country_results
individual_results_df <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-09-24/individual_results_df.csv', show_col_types = FALSE)

 

# Next step is to merge, so let's do by the following code. I will merge bon 'year and country'
merged_df <- merge(country_results_df, individual_results_df, by = c("year", "country"))

 

## Let's see Individual vs Country Performance
# Scatter plot for individual total scores vs country awards
individual_vs_country <- merged_df %>% 
  group_by(country) %>% 
  summarize(
    avg_individual_score = mean(total, na.rm = TRUE),
    total_country_awards = sum(awards_gold + awards_silver + awards_bronze, na.rm = TRUE)
  )

# Plot
ggplot(individual_vs_country, aes(x = avg_individual_score, y = total_country_awards, label = country)) +
  geom_point(size = 3, color = "#7fcdbb") +
  geom_text(size = 3, hjust = -0.1) +
  labs(
    title = "Individual vs Country Performance",
    x = "Average Individual Score",
    y = "Total Country Awards"
  ) +
  theme_minimal()

As we can see the scatter plot for Individual vs Country Performance allows for observation of a correlation between a country’s average individual scores and its total country awards. The plot shows each country represented by a point, with its horizontal position representing the average individual score of all participants from that country, and its vertical position representing the total number of awards (gold, silver, and bronze) won by that country.

 

## Now, let's see the Distribution of Total Score by Country, focusing in Albania and Kosovo
# Set up a dynamic color palette using colorRampPalette
color_palette <- colorRampPalette(brewer.pal(9, "Set3"))(2)  # Only two countries: Albania and Kosovo

# Filter data for Albania and Kosovo
merged_df_filtered <- merged_df %>%
  filter(country %in% c("Albania", "Kosovo"))

# Create the box plot
ggplot(merged_df_filtered, aes(x = country, y = total, fill = country)) +
  geom_boxplot(outlier.size = 2, outlier.colour = "#de2d26", alpha = 0.7) +   # Box plot with outliers highlighted
  scale_fill_manual(values = color_palette) +  # Dynamically apply the color palette
  labs(
    title = "Distribution of Total Scores by Country",
    subtitle = "Comparing the spread of scores between Albania and Kosovo",
    x = "Country",
    y = "Total Score",
    caption = "Data from International Mathematical Olympiad"
  ) +
  theme_minimal(base_size = 14) +  # Apply a minimal theme with adjusted base size for readability
  theme(
    legend.position = "none",  # No legend needed for this plot
    axis.text.x = element_text(angle = 90, hjust = 1)  # Rotate x-axis labels for better readability
  )

Now, we see the box plot of the Distribution of Total Scores by Country compares the spread of total scores between Albania and Kosovo. Based on this, Kosovo’s scores appear to have a wider range, as indicated by the larger box and more spread-out outliers. This suggests there is more variability in individual performance of Kosovo participants. Albania’s scores are more tightly clustered, with less variation around the median. The outliers are also less dispersed compared to Kosovo. This suggests more consistency in the scores achieved by Albanian participants.