STAT 360, Project 5

Published

December 12, 2025

Introduction

In this report, we analyze different aspects of an Olympic athlete dataset. The dataset is comprised of roughly 271,000 unique athlete-event-year tuples, as well as whether or not this athlete (in a given year and event) had received a medal. Olympic games from 1896 to 2016 are considered in this analysis.

Here is a sample of the first few columns of the first few entries of the data:

Code
library(ggplot2)
library(dplyr)
library(readxl)

oly.host <- read.csv("/home/user/School/STAT360/Project 5 (Olympics)/olym.csv")
oly <- read.csv("/home/user/School/STAT360/Project 5 (Olympics)/athlete_events.csv")

oly$Host <- ifelse(paste0(oly$NOC, oly$Year) %in% paste0(oly.host$NOC, oly.host$Year), TRUE, FALSE)

oly[1:10, 2:8]
                       Name Sex Age Height Weight           Team NOC
1                 A Dijiang   M  24    180     80          China CHN
2                  A Lamusi   M  23    170     60          China CHN
3       Gunnar Nielsen Aaby   M  24     NA     NA        Denmark DEN
4      Edgar Lindenau Aabye   M  34     NA     NA Denmark/Sweden DEN
5  Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
6  Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
7  Christine Jacoba Aaftink   F  25    185     82    Netherlands NED
8  Christine Jacoba Aaftink   F  25    185     82    Netherlands NED
9  Christine Jacoba Aaftink   F  27    185     82    Netherlands NED
10 Christine Jacoba Aaftink   F  27    185     82    Netherlands NED
Code
colnames(oly)
 [1] "ID"     "Name"   "Sex"    "Age"    "Height" "Weight" "Team"   "NOC"   
 [9] "Games"  "Year"   "Season" "City"   "Sport"  "Event"  "Medal"  "Host"  

1. Medal Distribution

Which countries have historically dominated the Olympics in terms of medals won?

Some key countries consistently win medals in specific events. This may be because of geography, national/social interest, lots of things:

Distribution of Olympic medal awards by country

Code
# Calculate summary statistics
count.countries <- oly |>
  group_by(NOC) |>
  summarize(
    gold = sum(Medal == "Gold", na.rm = TRUE),
    silver = sum(Medal == "Silver", na.rm = TRUE),
    bronze = sum(Medal == "Bronze", na.rm = TRUE)
  ) |>
  mutate(total = gold + silver + bronze) |>
  arrange(desc(total))

# Create stacked barplot
medals.bp <- barplot(
  height = t(as.matrix(count.countries[1:25, -(-4:-2)])),
  col = c("darkorange4", "grey", "gold"),
  main = "Olympic medals by country",
  xlab = "Country",
  ylab = "Medals"
)
legend(
  x = "topright",
  fill = c("gold", "grey", "darkorange4"),
  legend = c("Gold", "Silver", "Bronze"),
  bty = "n"
)
text(
  x = medals.bp,
  y = -300,
  labels = count.countries$NOC[1:25],
  adj = 1,
  srt = 60,
  xpd = TRUE
)
text(
  x = median(medals.bp),
  y = 6000,
  labels = "Top 25 countries",
  xpd = TRUE
)

It is well-known that the USA typically dominates the Olympics in terms of raw counts of medals won. The USA has accumulated 5637 medals: 2638 gold, 1641 silver, and 1358 bronze. Second place is, in fact, the USSR (wow!), with a total of 2503 medals. Then, Germany with 2165, Great Britain with 2068, and so on:

Code
as.data.frame(count.countries)
    NOC gold silver bronze total
1   USA 2638   1641   1358  5637
2   URS 1082    732    689  2503
3   GER  745    674    746  2165
4   GBR  678    739    651  2068
5   FRA  501    610    666  1777
6   ITA  575    531    531  1637
7   SWE  479    522    535  1536
8   CAN  463    438    451  1352
9   AUS  348    455    517  1320
10  RUS  390    367    408  1165
11  HUN  432    332    371  1135
12  NED  287    340    413  1040
13  NOR  378    361    294  1033
14  GDR  397    327    281  1005
15  CHN  350    347    292   989
16  JPN  247    309    357   913
17  FIN  198    270    432   900
18  SUI  175    248    268   691
19  ROU  161    200    292   653
20  KOR  221    232    185   638
21  DEN  179    241    177   597
22  FRG  159    194    233   586
23  POL  117    195    253   565
24  ESP  110    243    136   489
25  TCH   81    225    182   488
26  BRA  109    175    191   475
27  BEL   98    197    173   468
28  AUT  108    186    156   450
29  CUB  164    129    116   409
30  YUG  130    167     93   390
31  BUL   54    144    144   342
32  EUN  127     71     81   279
33  ARG   91     92     91   274
34  GRE   62    109     84   255
35  NZL   90     56     82   228
36  UKR   47     52    100   199
37  IND  138     19     40   197
38  JAM   38     75     44   157
39  CRO   58     54     37   149
40  CZE   42     36     66   144
41  BLR   24     44     71   139
42  RSA   32     47     52   131
43  PAK   42     45     34   121
44  MEX   30     26     54   110
45  KEN   34     41     31   106
46  NGR   23     30     46    99
47  TUR   40     27     28    95
48  SRB   15     29     41    85
49  KAZ   20     25     32    77
50  IRI   18     21     29    68
51  PRK   16     16     35    67
52  SCG   12     26     26    64
53  URU   31      2     30    63
54  LTU    6      7     48    61
55  ETH   22      9     22    53
56  EST   13     12     25    50
57  TPE    3     28     18    49
58  SLO    8     13     27    48
59  SVK   15     19     13    47
60  AZE    7     12     25    44
61  INA   11     17     13    41
62  POR    4     11     26    41
63  BAH   14     11     15    40
64  IRL    9     13     13    35
65  LAT    3     19     13    35
66  UZB   10      7     17    34
67  CHI    3      9     20    32
68  GEO    8      6     18    32
69  TTO    7      8     17    32
70  THA    9      8     13    30
71  ANZ   20      4      5    29
72  COL    5      9     14    28
73  EGY    7      8     12    27
74  MGL    2     10     14    26
75  GHA    0      1     22    23
76  MAR    6      5     12    23
77  CMR   20      1      1    22
78  ZIM   17      4      1    22
79  ALG    5      4      8    17
80  ISL    0     15      2    17
81  PAR    0     17      0    17
82  ARM    2      5      9    16
83  MAS    0     11      5    16
84  PER    1     14      0    15
85  VEN    2      3     10    15
86  MNE    0     14      0    14
87  FIJ   13      0      0    13
88  TUN    3      3      7    13
89  BOH    0      1     11    12
90  PHI    0      3      7    10
91  ISR    1      1      7     9
92  LIE    2      2      5     9
93  PUR    1      2      6     9
94  SGP    1      4      4     9
95  LUX    4      4      0     8
96  MDA    0      3      5     8
97  DOM    3      2      2     7
98  HAI    1      1      5     7
99  UGA    2      3      2     7
100 KSA    0      1      5     6
101 IOA    1      1      3     5
102 QAT    0      1      4     5
103 WIF    0      0      5     5
104 CRC    1      1      2     4
105 HKG    1      2      1     4
106 LIB    0      2      2     4
107 NAM    0      4      0     4
108 TJK    1      1      2     4
109 VIE    1      3      0     4
110 BRN    1      1      1     3
111 CIV    1      1      1     3
112 KGZ    0      1      2     3
113 PAN    1      0      2     3
114 SYR    1      1      1     3
115 AFG    0      0      2     2
116 BDI    1      1      0     2
117 ECU    1      1      0     2
118 GRN    1      1      0     2
119 KUW    0      0      2     2
120 MOZ    1      0      1     2
121 NIG    0      1      1     2
122 SRI    0      2      0     2
123 SUR    1      0      1     2
124 TAN    0      2      0     2
125 UAE    1      0      1     2
126 UAR    0      1      1     2
127 ZAM    0      1      1     2
128 AHO    0      1      0     1
129 BAR    0      0      1     1
130 BER    0      0      1     1
131 BOT    0      1      0     1
132 CYP    0      1      0     1
133 DJI    0      0      1     1
134 ERI    0      0      1     1
135 GAB    0      1      0     1
136 GUA    0      1      0     1
137 GUY    0      0      1     1
138 IRQ    0      0      1     1
139 ISV    0      1      0     1
140 JOR    1      0      0     1
141 KOS    1      0      0     1
142 MKD    0      0      1     1
143 MON    0      0      1     1
144 MRI    0      0      1     1
145 NEP    1      0      0     1
146 SEN    0      1      0     1
147 SUD    0      1      0     1
148 TGA    0      1      0     1
149 TOG    0      0      1     1
150 ALB    0      0      0     0
151 AND    0      0      0     0
152 ANG    0      0      0     0
153 ANT    0      0      0     0
154 ARU    0      0      0     0
155 ASA    0      0      0     0
156 BAN    0      0      0     0
157 BEN    0      0      0     0
158 BHU    0      0      0     0
159 BIH    0      0      0     0
160 BIZ    0      0      0     0
161 BOL    0      0      0     0
162 BRU    0      0      0     0
163 BUR    0      0      0     0
164 CAF    0      0      0     0
165 CAM    0      0      0     0
166 CAY    0      0      0     0
167 CGO    0      0      0     0
168 CHA    0      0      0     0
169 COD    0      0      0     0
170 COK    0      0      0     0
171 COM    0      0      0     0
172 CPV    0      0      0     0
173 CRT    0      0      0     0
174 DMA    0      0      0     0
175 ESA    0      0      0     0
176 FSM    0      0      0     0
177 GAM    0      0      0     0
178 GBS    0      0      0     0
179 GEQ    0      0      0     0
180 GUI    0      0      0     0
181 GUM    0      0      0     0
182 HON    0      0      0     0
183 IVB    0      0      0     0
184 KIR    0      0      0     0
185 LAO    0      0      0     0
186 LBA    0      0      0     0
187 LBR    0      0      0     0
188 LCA    0      0      0     0
189 LES    0      0      0     0
190 MAD    0      0      0     0
191 MAL    0      0      0     0
192 MAW    0      0      0     0
193 MDV    0      0      0     0
194 MHL    0      0      0     0
195 MLI    0      0      0     0
196 MLT    0      0      0     0
197 MTN    0      0      0     0
198 MYA    0      0      0     0
199 NBO    0      0      0     0
200 NCA    0      0      0     0
201 NFL    0      0      0     0
202 NRU    0      0      0     0
203 OMA    0      0      0     0
204 PLE    0      0      0     0
205 PLW    0      0      0     0
206 PNG    0      0      0     0
207 RHO    0      0      0     0
208 ROT    0      0      0     0
209 RWA    0      0      0     0
210 SAA    0      0      0     0
211 SAM    0      0      0     0
212 SEY    0      0      0     0
213 SKN    0      0      0     0
214 SLE    0      0      0     0
215 SMR    0      0      0     0
216 SOL    0      0      0     0
217 SOM    0      0      0     0
218 SSD    0      0      0     0
219 STP    0      0      0     0
220 SWZ    0      0      0     0
221 TKM    0      0      0     0
222 TLS    0      0      0     0
223 TUV    0      0      0     0
224 UNK    0      0      0     0
225 VAN    0      0      0     0
226 VIN    0      0      0     0
227 VNM    0      0      0     0
228 YAR    0      0      0     0
229 YEM    0      0      0     0
230 YMD    0      0      0     0

The average number of medals/country is about 173, the median is 2 (heavily unbalanced), and there are 81 countries that have earned zero medals (as of the 2016 Olympics).

2. Medals by sport

Which sports tend to have the most medals awarded?

Depending on the type of the sport (individual or team), the age of the sport, and the sport’s popularity, not all Olympic sports have awarded the same counts of medals over the years. So, we ask, what does the distribution look like?

Medals awarded by sport

Code
# Calculate summary statistics
count.games <- oly |>
  group_by(Sport, Season) |>
  summarize(
    total = sum(!is.na(Medal))
  ) |>
  arrange(desc(total)) |>
  mutate(
    color = case_when(
      Season == "Summer" ~ "khaki",
      Season == "Winter" ~ "skyblue"
    )
  )

par(oma = c(2, 0, 0, 0))

# Create barplot
games.bp <- barplot(
  height = count.games$total[1:25],
  col = count.games$color[1:25],
  #  col = colorRampPalette(c("grey", "skyblue3"))(25),
  main = "Olympic medals by sport",
  xlab = "",
  ylab = "Medals (absolute count)"
)
text(
  x = games.bp,
  y = -250,
  labels = c(count.games$Sport[1:19], "CC Skiing", count.games$Sport[21:25]),
  cex = 0.8,
  adj = 1,
  srt = 60,
  xpd = TRUE
)
text(
  x = median(medals.bp),
  y = 4225,
  labels = "Top 25 medal-giving games",
  xpd = TRUE
)
legend(
  x = "topright",
  fill = c("khaki", "skyblue"),
  legend = c("Summer", "Winter"),
  title = "Event season",
  cex = 0.8,
  bty = "n"
)
mtext(
  "Games",
  side = 1,
  outer = TRUE
)

Code
par(oma = c(0, 0, 0, 0))

Athletics includes the track-and-field events (things like pole-vaulting, xx-meter sprints, hurdles, shotput, and more). As such, given the diversity of events present within “Athletics”, it makes sense to see so many awarded medals! Generally, however, much of the above graphic can be easily explained by the total number of Olympic games that the sport was present in.

Here is an adjustment for medals/game, so that we can see more clearly which games give many medals and which give few:

Code
# Calculate summary statistics
count.gamesadj <- oly |>
  group_by(Sport, Season) |>
  summarize(
    total = sum(!is.na(Medal)),
    years = length(unique(Year))
    ) |>
  mutate(peryear = total/years) |>
  arrange(desc(peryear)) |>
  mutate(
    color = case_when(
      Season == "Summer" ~ "khaki",
      Season == "Winter" ~ "skyblue"
    )
  )

par(oma = c(2, 0, 0, 0))

# Create barplot
gamesadj.bp <- barplot(
  height = count.gamesadj$peryear[1:25],
  col = count.gamesadj$color[1:25],
  main = "Olympic medals by sport",
  xlab = "",
  ylab = "Medals (per occurance)",
  cex.axis = 0.86
)
text(
  x = gamesadj.bp,
  y = -7,
  labels = c(count.gamesadj$Sport[1:20], "ST Speed Skating", count.gamesadj$Sport[22:25]),
  cex = 0.8,
  adj = 1,
  srt = 60,
  xpd = TRUE
)
text(
  x = median(gamesadj.bp),
  y = 145,
  labels = "Top 25 medal-giving games, (adjusted for sport presence)",
  cex = 0.8,
  xpd = TRUE
)
legend(
  x = "topright",
  fill = c("khaki", "skyblue"),
  legend = c("Summer", "Winter"),
  title = "Event season",
  cex = 0.8,
  bty = "n"
)
mtext(
  "Games",
  side = 1,
  outer = TRUE
)

Code
par(oma = c(0, 0, 0, 0))

Now we see what we expected. Athletics still gives many of medals each time it is present in the Olympics (again, lots of events in “Athletics”), but the team sports now show many awarded medals each year. Notably, there are few winter games among the top medal-giving games! This is likely because winter games tend to be more solitary than -team-based, with exceptions such as hockey and curling.

3. Athlete ages

4. Athlete sexes

How has the sex/gender distribution of athletes changed over the years?

Sexes of Olympic athletes

Code
# Calculate summary statistics
count.sex <- oly |>
  group_by(Year, ID) |>
  slice(1) |>
  ungroup() |>
  group_by(Year, Sex, Season) |>
  summarize(total = n())

count.sex.winter <- count.sex |>
  filter(Season == "Winter")

count.sex.summer <- count.sex |>
  filter(Season == "Summer")

# Plot over time
ggplot(data = count.sex.summer, aes(x = Year, y = total, fill = Sex)) +
  geom_area(position = "stack") +
  scale_fill_manual(values = c(M = "skyblue", F = alpha("hotpink2", alpha = 0.75))) +
  labs(
    title = "Male vs. Female athletes",
    subtitle = "Summer Olympics",
    y = "Athletes (count)"
  ) +
  theme_classic()

Code
ggplot(data = count.sex.winter, aes(x = Year, y = total, fill = Sex)) +
  geom_area(position = "stack") +
  scale_fill_manual(values = c(M = "skyblue", F = alpha("hotpink2", alpha = 0.75))) +
  labs(
    title = "Male vs. Female athletes",
    subtitle = "Winter Olympics",
    y = "Athletes (count)"
  ) +
  theme_classic()

We can see that the relative proportions of male to female athletes has always been large on the male side, holding a constant majority of athlete pools year-by-year (for both summer and winter Olympic games). The number of female athletes is increasing relative to men, and nearing a 50-50 split, especially in the summer games!

Additionally, it’s interesting that the summer games see far more athletes in general than the winter games do. This aligns with earlier conclusions about medal distributions, as the winter games typically has a greater number of solitary events than the summer games.

5. Host country performance

Does the performance of a country change when they are hosting the Olympics?

Code
# Find Canada host years
can.hostyears <- oly.host |>
  filter(NOC == "CAN") |>
  pull(Year)

can.presyears <- oly |>
  filter(NOC == "CAN") |>
  group_by(Year) |>
  slice(1) |>
  pull(Year)

# Find awarded medals by year
count.years <- oly |>
  filter(!is.na(Medal)) |>
  group_by(Year) |>
  summarize(total = n()) |>
  filter(Year %in% can.presyears)

# Find awarded medals (to Canada) by year
can.perf <- oly |>
  filter(NOC == "CAN", !is.na(Medal)) |>
  group_by(Year) |>
  summarize(total = n())

# Find proportions of all medals won by USA, mark host year
can.perf <- can.perf |>
  mutate(
    prop = total/count.years$total,
    host = ifelse(Year %in% can.hostyears, TRUE, FALSE)
  )

# Share results
results <- data.frame(
  c(
    paste0("Mean proportion of medals won (hosting): ", trunc(mean(can.perf |> filter(host == TRUE) |> pull(prop)*10000))/100, "%"),
    paste0("Mean proportion of medals won (not hosting): ", trunc(mean(can.perf |> filter(host == FALSE) |> pull(prop)*10000))/100, "%")
  )
)

colnames(results) <- "Results (Canada)"
results
                                    Results (Canada)
1     Mean proportion of medals won (hosting): 6.85%
2 Mean proportion of medals won (not hosting): 4.57%
Code
# Make plot
ggplot(data = can.perf, aes(x = Year, y = prop, fill = host)) +
  geom_col() +
  scale_fill_manual(
    name = "Host status",
    values = c("TRUE" = "royalblue", "FALSE" = "grey50")
  ) +
  labs(
    title = "Proportion of total medals won",
    subtitle = "Performance of Canada by year",
    y = "Proportion"
  ) +
  theme_classic()

Clearly, for Canada at least, they win a higher proportion of the medals (on average) while hosting than while not hosting. Note that there may be some explanation in that Canada tends to perform well in the winter games, and 2/3 of their hosting years were winter games years.

We can perform the same analysis with France as well:

Code
# Find France host years
fra.hostyears <- oly.host |>
  filter(NOC == "FRA") |>
  pull(Year)

fra.presyears <- oly |>
  filter(NOC == "FRA") |>
  group_by(Year) |>
  slice(1) |>
  pull(Year)

# Find awarded medals by year
count.years <- oly |>
  filter(!is.na(Medal)) |>
  group_by(Year) |>
  summarize(total = n()) |>
  filter(Year %in% fra.presyears)

# Find awarded medals (to France) by year
fra.perf <- oly |>
  filter(NOC == "FRA", !is.na(Medal)) |>
  group_by(Year) |>
  summarize(total = n())

# Find proportions of all medals won by USA, mark host year
fra.perf <- fra.perf |>
  mutate(
    prop = total/count.years$total,
    host = ifelse(Year %in% fra.hostyears, TRUE, FALSE)
  )

# Share results
results <- data.frame(
  c(
    paste0("Mean proportion of medals won (hosting): ", trunc(mean(fra.perf |> filter(host == TRUE) |> pull(prop)*10000))/100, "%"),
    paste0("Mean proportion of medals won (not hosting): ", trunc(mean(fra.perf |> filter(host == FALSE) |> pull(prop)*10000))/100, "%")
  )
)

colnames(results) <- "Results (France)"
results
                                    Results (France)
1    Mean proportion of medals won (hosting): 14.41%
2 Mean proportion of medals won (not hosting): 4.09%
Code
# Make plot
ggplot(data = fra.perf, aes(x = Year, y = prop, fill = host)) +
  geom_col() +
  scale_fill_manual(
    name = "Host status",
    values = c("TRUE" = "red4", "FALSE" = "grey50")
  ) +
  labs(
    title = "Proportion of total medals won",
    subtitle = "Performance of France by year",
    y = "Proportion"
  ) +
  theme_classic()

This one may be more thoroughly explained by the fact that, in the early modern Olympics (~1900s), there were far fewer countries present in the games, and France was one of them. By year, after that, the proportion of medals won by France seems consistent, regardless of whether or not they are hosting the games.

We can perform the same analysis with France as well:

Code
# Find France host years
ita.hostyears <- oly.host |>
  filter(NOC == "ITA") |>
  pull(Year)

ita.presyears <- oly |>
  filter(NOC == "ITA") |>
  group_by(Year) |>
  slice(1) |>
  pull(Year)

# Find awarded medals by year
count.years <- oly |>
  filter(!is.na(Medal)) |>
  group_by(Year) |>
  summarize(total = n()) |>
  filter(Year %in% ita.presyears)

# Find awarded medals (to France) by year
ita.perf <- oly |>
  filter(NOC == "ITA") |>
  group_by(Year) |>
  summarize(total = sum(!is.na(Medal)))

# Find proportions of all medals won by USA, mark host year
ita.perf <- ita.perf |>
  mutate(
    prop = total/count.years$total,
    host = ifelse(Year %in% ita.hostyears, TRUE, FALSE)
  )

# Share results
results <- data.frame(
  c(
    paste0("Mean proportion of medals won (hosting): ", trunc(mean(ita.perf |> filter(host == TRUE) |> pull(prop)*10000))/100, "%"),
    paste0("Mean proportion of medals won (not hosting): ", trunc(mean(ita.perf |> filter(host == FALSE) |> pull(prop)*10000))/100, "%")
  )
)

colnames(results) <- "Results (Italy)"
results
                                    Results (Italy)
1    Mean proportion of medals won (hosting): 6.14%
2 Mean proportion of medals won (not hosting): 4.2%
Code
# Make plot
ggplot(data = ita.perf, aes(x = Year, y = prop, fill = host)) +
  geom_col() +
  scale_fill_manual(
    name = "Host status",
    values = c("TRUE" = "forestgreen", "FALSE" = "grey50")
  ) +
  labs(
    title = "Proportion of total medals won",
    subtitle = "Performance of Italy by year",
    y = "Proportion"
  ) +
  theme_classic()

For Italy, there is somewhat of a better case for hosting being an indicator of athlete success.

In total, we can can conclude (informally) that whether or not a country is hosting the games may impact performance of the country, in terms of proportions of total medals earned; but not greatly. There are typically too few instances of each country hosting to draw real generalizations from, with the vast majority of countries (that have hosted) hosting less than 5 times total.