Using the Olympic data I will apply the methods of the \(\chi^2\) distribution. Let’s find a nice contingency table to look at.
table(data$Medal)
##
## Bronze Gold Silver
## 13295 13372 13116
Okay so let’s ask if there is any difference in the number of medals of each type being awarded. Of course we assume that the number of medals should be equal. (I actually know this may not be the case, for example in boxing two bronze medals are always awarded) Here is my null and alternative hypothesis: \[ H_0: \text{The number of medals awarded are equal}\\ H_A: \text{The number of medals awarded are not all equal}. \] I could have also expressed this in terms of observed and expected. Something along the lines that the number of expected is equal to the number of observed.
test = chisq.test(table(data$Medal), p = c(1,1,1)/3)
test
##
## Chi-squared test for given probabilities
##
## data: table(data$Medal)
## X-squared = 2.6018, df = 2, p-value = 0.2723
While I had evidence to suggest these numbers might be different, the statistics did not bear that out. I will fail to reject my null hypothesis here.
The expected values for the number of medals follows
test$expected
## Bronze Gold Silver
## 13261 13261 13261
Here is my bar chart
barplot(table(data$Medal))
Not a terribly interesting plot. Maybe now I see why I failed to reject that these had any differences!
Let’s expand on this and ask a little different question, are the number of medals awarded dependent on the sport you play?
table(data$Sport,data$Medal)
##
## Bronze Gold Silver
## Aeronautics 0 1 0
## Alpine Skiing 141 143 144
## Alpinism 0 25 0
## Archery 98 135 120
## Art Competitions 51 49 56
## Athletics 1296 1339 1334
## Badminton 60 54 54
## Baseball 112 112 112
## Basketball 356 365 359
## Basque Pelota 0 2 0
## Beach Volleyball 24 24 24
## Biathlon 135 136 137
## Bobsleigh 136 133 129
## Boxing 441 252 251
## Canoeing 389 388 388
## Cricket 0 12 12
## Croquet 2 4 2
## Cross Country Skiing 263 256 257
## Curling 49 50 53
## Cycling 414 424 425
## Diving 143 142 142
## Equestrianism 319 324 322
## Fencing 566 594 583
## Figure Skating 127 131 128
## Football 543 515 513
## Freestyle Skiing 34 34 34
## Golf 16 15 15
## Gymnastics 719 791 746
## Handball 353 349 358
## Hockey 512 518 498
## Ice Hockey 507 508 515
## Jeu De Paume 1 1 1
## Judo 274 137 136
## Lacrosse 12 24 24
## Luge 60 62 58
## Military Ski Patrol 4 4 4
## Modern Pentathlon 62 62 62
## Motorboating 0 7 0
## Nordic Combined 55 55 55
## Polo 20 22 25
## Racquets 4 3 3
## Rhythmic Gymnastics 44 44 44
## Roque 1 1 1
## Rowing 990 978 977
## Rugby 16 67 79
## Rugby Sevens 25 25 24
## Sailing 367 450 415
## Shooting 407 410 411
## Short Track Speed Skating 94 95 95
## Skeleton 10 10 10
## Ski Jumping 68 69 70
## Snowboarding 30 30 30
## Softball 59 60 60
## Speed Skating 190 193 197
## Swimming 956 1099 993
## Synchronized Swimming 70 74 71
## Table Tennis 60 54 54
## Taekwondo 64 40 40
## Tennis 128 106 106
## Trampolining 10 10 10
## Triathlon 10 10 10
## Tug-Of-War 29 43 43
## Volleyball 325 322 322
## Water Polo 360 350 347
## Weightlifting 216 217 213
## Wrestling 468 413 415
Here we see my note about boxing come to light, there are twice as many bronze as other medals in this sport. Most of the rest do not seem as pronounced so perhaps we won’t see an effect. Let’s set up the hypothesis test. \[ H_O: \text{The number of medals is independent of sport}\\ H_A: \text{The number of medals is dependent on the sport} \]
Now we test!
test = chisq.test(table(data$Sport,data$Medal))
## Warning in chisq.test(table(data$Sport, data$Medal)): Chi-squared approximation
## may be incorrect
test
##
## Pearson's Chi-squared test
##
## data: table(data$Sport, data$Medal)
## X-squared = 325.88, df = 130, p-value < 2.2e-16
Here we are able to reject our null hypothesis. We have evidence to suggest that the number of medals is dependent on the sport. We should note that it is not that there are more medals in a particular sport but that they are not distributed in a uniform way, eg. boxing.
Esitmates to follow
test$expected
##
## Bronze Gold Silver
## Aeronautics 0.3341880 0.3361235 0.3296886
## Alpine Skiing 143.0324510 143.8608451 141.1067039
## Alpinism 8.3546992 8.4030867 8.2422140
## Archery 117.9683533 118.6515848 116.3800618
## Art Competitions 52.1333233 52.4352613 51.4314154
## Athletics 1326.3920519 1334.0740517 1308.5338964
## Badminton 56.1435789 56.4687429 55.3876782
## Baseball 112.2871578 112.9374859 110.7753563
## Basketball 360.9230073 363.0133474 356.0636453
## Basque Pelota 0.6683759 0.6722469 0.6593771
## Beach Volleyball 24.0615338 24.2008898 23.7375764
## Biathlon 136.3486917 137.1383757 134.5129327
## Bobsleigh 133.0068120 133.7771410 131.2160471
## Boxing 315.4734434 317.3005555 311.2260011
## Canoeing 389.3289847 391.5838423 384.0871729
## Cricket 8.0205113 8.0669633 7.9125255
## Croquet 2.6735038 2.6889878 2.6375085
## Cross Country Skiing 259.3298645 260.8318126 255.8383229
## Curling 50.7965714 51.0907674 50.1126612
## Cycling 422.0794058 424.5239424 416.3966518
## Diving 142.6982631 143.5247216 140.7770153
## Equestrianism 322.4913908 324.3591484 318.1494608
## Fencing 582.4896312 585.8632079 574.6471608
## Figure Skating 128.9965563 129.7436594 127.2597843
## Football 525.0093005 528.0499711 517.9407285
## Freestyle Skiing 34.0871729 34.2845939 33.6282332
## Golf 15.3726466 15.4616796 15.1656738
## Gymnastics 753.9280597 758.2945479 743.7773924
## Handball 354.2392479 356.2908780 349.4698741
## Hockey 510.6392178 513.5966619 503.7641204
## Ice Hockey 511.3075937 514.2689088 504.4234975
## Jeu De Paume 1.0025639 1.0083704 0.9890657
## Judo 182.8008194 183.8595380 180.3396426
## Lacrosse 20.0512782 20.1674082 19.7813136
## Luge 60.1538346 60.5022246 59.3439409
## Military Ski Patrol 4.0102556 4.0334816 3.9562627
## Modern Pentathlon 62.1589624 62.5189654 61.3220722
## Motorboating 2.3393158 2.3528643 2.3078199
## Nordic Combined 55.1410150 55.4603725 54.3986125
## Polo 22.3905940 22.5202725 22.0891335
## Racquets 3.3418797 3.3612347 3.2968856
## Rhythmic Gymnastics 44.1128120 44.3682980 43.5188900
## Roque 1.0025639 1.0083704 0.9890657
## Rowing 984.1835709 989.8836186 970.9328105
## Rugby 54.1384511 54.4520021 53.4095468
## Rugby Sevens 24.7299098 24.8731368 24.3969535
## Sailing 411.7195787 414.1041148 406.1763065
## Shooting 410.3828268 412.7596209 404.8575522
## Short Track Speed Skating 94.9093834 95.4590654 93.6315512
## Skeleton 10.0256391 10.0837041 9.8906568
## Ski Jumping 69.1769097 69.5775583 68.2455320
## Snowboarding 30.0769173 30.2511123 29.6719704
## Softball 59.8196466 60.1661011 59.0142523
## Speed Skating 193.8290224 194.9516125 191.2193651
## Swimming 1018.6049318 1024.5043360 1004.8907322
## Synchronized Swimming 71.8504135 72.2665460 70.8830405
## Table Tennis 56.1435789 56.4687429 55.3876782
## Taekwondo 48.1230676 48.4017797 47.4751527
## Tennis 113.6239097 114.2819797 112.0941105
## Trampolining 10.0256391 10.0837041 9.8906568
## Triathlon 10.0256391 10.0837041 9.8906568
## Tug-Of-War 38.4316165 38.6541990 37.9141845
## Volleyball 323.8281427 325.7036423 319.4682151
## Water Polo 353.2366840 355.2825076 348.4808084
## Weightlifting 215.8854284 217.1357615 212.9788100
## Wrestling 433.1076088 435.6160169 427.2763743
Lastly let’s try to visulaize the contingency table.
mosaicplot(table(data$Sport,data$Medal))
I did it but damn that is ugly! Let me try it with ggplot.
library(ggmosaic)
ggplot(data = data)+
geom_mosaic( aes(x = product( Sport, Sex),fill = Medal),na.rm = TRUE) +
labs(x = "Sex",y = "Sport", title = "Medals Awarded by Sport")
Not convinced that is all that much better but maybe you can recreate it and get something you like. I added Sex in there too so you could see how many fewer women have been awarded medals.