Olympic Medals Data Exploration

I discovered this data set at https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results. In order to create my data file, I clicked the icon to upload to Google sheets, published it to the web as a csv and copied a link. Some data will be possible to upload to github but if the file is large github will not do it. You need the data to be in a raw form, with csv.

summary(data)
##        ID             Name               Sex                 Age       
##  Min.   :     1   Length:271116      Length:271116      Min.   :10.00  
##  1st Qu.: 34643   Class :character   Class :character   1st Qu.:21.00  
##  Median : 68205   Mode  :character   Mode  :character   Median :24.00  
##  Mean   : 68249                                         Mean   :25.56  
##  3rd Qu.:102097                                         3rd Qu.:28.00  
##  Max.   :135571                                         Max.   :97.00  
##                                                         NA's   :9474   
##      Height          Weight          Team               NOC           
##  Min.   :127.0   Min.   : 25.0   Length:271116      Length:271116     
##  1st Qu.:168.0   1st Qu.: 60.0   Class :character   Class :character  
##  Median :175.0   Median : 70.0   Mode  :character   Mode  :character  
##  Mean   :175.3   Mean   : 70.7                                        
##  3rd Qu.:183.0   3rd Qu.: 79.0                                        
##  Max.   :226.0   Max.   :214.0                                        
##  NA's   :60171   NA's   :62875                                        
##     Games                Year         Season              City          
##  Length:271116      Min.   :1896   Length:271116      Length:271116     
##  Class :character   1st Qu.:1960   Class :character   Class :character  
##  Mode  :character   Median :1988   Mode  :character   Mode  :character  
##                     Mean   :1978                                        
##                     3rd Qu.:2002                                        
##                     Max.   :2016                                        
##                                                                         
##     Sport              Event              Medal          
##  Length:271116      Length:271116      Length:271116     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

I wonder what sports have had the most athletes of all time.

ggplot(data, aes(Sport)) +
  geom_bar()

table(data$Sport)
## 
##               Aeronautics             Alpine Skiing                  Alpinism 
##                         1                      8829                        25 
##                   Archery          Art Competitions                 Athletics 
##                      2334                      3578                     38624 
##                 Badminton                  Baseball                Basketball 
##                      1457                       894                      4536 
##             Basque Pelota          Beach Volleyball                  Biathlon 
##                         2                       564                      4893 
##                 Bobsleigh                    Boxing                  Canoeing 
##                      3058                      6047                      6171 
##                   Cricket                   Croquet      Cross Country Skiing 
##                        24                        19                      9133 
##                   Curling                   Cycling                    Diving 
##                       463                     10859                      2842 
##             Equestrianism                   Fencing            Figure Skating 
##                      6344                     10735                      2298 
##                  Football          Freestyle Skiing                      Golf 
##                      6745                       937                       247 
##                Gymnastics                  Handball                    Hockey 
##                     26707                      3665                      5417 
##                Ice Hockey              Jeu De Paume                      Judo 
##                      5516                        11                      3801 
##                  Lacrosse                      Luge       Military Ski Patrol 
##                        60                      1479                        24 
##         Modern Pentathlon              Motorboating           Nordic Combined 
##                      1677                        17                      1344 
##                      Polo                  Racquets       Rhythmic Gymnastics 
##                        95                        12                       658 
##                     Roque                    Rowing                     Rugby 
##                         4                     10595                       162 
##              Rugby Sevens                   Sailing                  Shooting 
##                       299                      6586                     11448 
## Short Track Speed Skating                  Skeleton               Ski Jumping 
##                      1534                       199                      2401 
##              Snowboarding                  Softball             Speed Skating 
##                       936                       478                      5613 
##                  Swimming     Synchronized Swimming              Table Tennis 
##                     23195                       909                      1955 
##                 Taekwondo                    Tennis              Trampolining 
##                       606                      2862                       152 
##                 Triathlon                Tug-Of-War                Volleyball 
##                       529                       170                      3404 
##                Water Polo             Weightlifting                 Wrestling 
##                      3846                      3937                      7154

The table gave me a better representation. Clearly I could not tell in the bar chart what the games were. Let me try one more way and include relative frequency.

#options(digits=2)
freq <- rbind(table(data$Sport),table(data$Sport)/length(data$Sport))
row.names(freq)<-c('Frequency','Relative Frequency')
freq
##                     Aeronautics Alpine Skiing     Alpinism      Archery
## Frequency          1.000000e+00   8.82900e+03 2.500000e+01 2.334000e+03
## Relative Frequency 3.688458e-06   3.25654e-02 9.221145e-05 8.608861e-03
##                    Art Competitions   Athletics    Badminton     Baseball
## Frequency               3.57800e+03 3.86240e+04 1.457000e+03 8.940000e+02
## Relative Frequency      1.31973e-02 1.42463e-01 5.374083e-03 3.297482e-03
##                      Basketball Basque Pelota Beach Volleyball     Biathlon
## Frequency          4.536000e+03  2.000000e+00      5.64000e+02 4.893000e+03
## Relative Frequency 1.673085e-02  7.376916e-06      2.08029e-03 1.804763e-02
##                      Bobsleigh       Boxing     Canoeing      Cricket
## Frequency          3.05800e+03 6.047000e+03 6.171000e+03 2.400000e+01
## Relative Frequency 1.12793e-02 2.230411e-02 2.276147e-02 8.852299e-05
##                        Croquet Cross Country Skiing      Curling      Cycling
## Frequency          1.90000e+01         9.133000e+03 4.630000e+02 1.085900e+04
## Relative Frequency 7.00807e-05         3.368669e-02 1.707756e-03 4.005297e-02
##                         Diving Equestrianism     Fencing Figure Skating
## Frequency          2.84200e+03  6.344000e+03 1.07350e+04   2.298000e+03
## Relative Frequency 1.04826e-02  2.339958e-02 3.95956e-02   8.476077e-03
##                        Football Freestyle Skiing         Golf   Gymnastics
## Frequency          6.745000e+03     9.370000e+02 2.470000e+02 2.670700e+04
## Relative Frequency 2.487865e-02     3.456085e-03 9.110491e-04 9.850765e-02
##                       Handball       Hockey   Ice Hockey Jeu De Paume
## Frequency          3.66500e+03 5.417000e+03 5.516000e+03 1.100000e+01
## Relative Frequency 1.35182e-02 1.998038e-02 2.034553e-02 4.057304e-05
##                            Judo     Lacrosse         Luge Military Ski Patrol
## Frequency          3.801000e+03 6.000000e+01 1.479000e+03        2.400000e+01
## Relative Frequency 1.401983e-02 2.213075e-04 5.455229e-03        8.852299e-05
##                    Modern Pentathlon Motorboating Nordic Combined         Polo
## Frequency               1.677000e+03 1.700000e+01    1.344000e+03 9.500000e+01
## Relative Frequency      6.185544e-03 6.270379e-05    4.957288e-03 3.504035e-04
##                       Racquets Rhythmic Gymnastics        Roque       Rowing
## Frequency          1.20000e+01        6.580000e+02 4.000000e+00 1.059500e+04
## Relative Frequency 4.42615e-05        2.427005e-03 1.475383e-05 3.907921e-02
##                           Rugby Rugby Sevens      Sailing     Shooting
## Frequency          1.620000e+02 2.990000e+02 6.586000e+03 1.144800e+04
## Relative Frequency 5.975302e-04 1.102849e-03 2.429218e-02 4.222547e-02
##                    Short Track Speed Skating     Skeleton  Ski Jumping
## Frequency                       1.534000e+03 1.990000e+02 2.401000e+03
## Relative Frequency              5.658095e-03 7.340032e-04 8.855988e-03
##                    Snowboarding     Softball Speed Skating     Swimming
## Frequency          9.360000e+02 4.780000e+02  5.613000e+03 2.319500e+04
## Relative Frequency 3.452397e-03 1.763083e-03  2.070332e-02 8.555379e-02
##                    Synchronized Swimming Table Tennis    Taekwondo       Tennis
## Frequency                   9.090000e+02 1.955000e+03 6.060000e+02 2.862000e+03
## Relative Frequency          3.352808e-03 7.210936e-03 2.235206e-03 1.055637e-02
##                    Trampolining    Triathlon   Tug-Of-War   Volleyball
## Frequency          1.520000e+02 5.290000e+02 1.700000e+02 3.404000e+03
## Relative Frequency 5.606456e-04 1.951194e-03 6.270379e-04 1.255551e-02
##                      Water Polo Weightlifting    Wrestling
## Frequency          3.846000e+03  3.937000e+03 7.154000e+03
## Relative Frequency 1.418581e-02  1.452146e-02 2.638723e-02

meh, that is still not great. One more time?

percent <- function(x, digits = 2, format = "f", ...) {      # Create user-defined function
  paste0(formatC(x * 100, format = format, digits = digits, ...), "%")
} 
data %>%
  group_by(Sport) %>%
  summarize(Frequency = length(Sport),RelativeFrequency = percent(length(Sport)/271116))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 66 x 3
##    Sport            Frequency RelativeFrequency
##    <chr>                <int> <chr>            
##  1 Aeronautics              1 0.00%            
##  2 Alpine Skiing         8829 3.26%            
##  3 Alpinism                25 0.01%            
##  4 Archery               2334 0.86%            
##  5 Art Competitions      3578 1.32%            
##  6 Athletics            38624 14.25%           
##  7 Badminton             1457 0.54%            
##  8 Baseball               894 0.33%            
##  9 Basketball            4536 1.67%            
## 10 Basque Pelota            2 0.00%            
## # … with 56 more rows

Such a nicer, usable table! I needed the tidyverse library to do this but I think it was worth it. You should also notice that I created a function for doing the percent too! Take a minute and try to understand what that function is doing because somethings will need to be modified to make it work for your data!

Let’s look at a couple of the sports. Athletics, cycling and rowing are of interest to me.

sports <- c('Athletics','Cycling', 'Rowing')

ggplot(subset(data, Sport %in% sports), aes(Sport,fill = Sex)) +
  geom_bar()

Wow so many more men than women! Let’s see if the number of medals is as far apart.

table(data$Medal,data$Sex)
##         
##             F    M
##   Bronze 3771 9524
##   Gold   3747 9625
##   Silver 3735 9381

Clearly there have been way more men’s medals awarded then women’s medals. Let’s see if we can figure out what year women start competing at the same level as the men.

table(data$Sex,data$Year)
##    
##      1896  1900  1904  1906  1908  1912  1920  1924  1928  1932  1936  1948
##   F     0    33    16    11    47    87   134   261   437   369   549   761
##   M   380  1903  1285  1722  3054  3953  4158  5432  5137  2952  6852  6719
##    
##      1952  1956  1960  1964  1968  1972  1976  1980  1984  1988  1992  1994
##   F  1682  1139  1730  1752  2193  2608  2606  2186  2983  4223  5178  1105
##   M  7676  5295  7505  7728  8286  9351  7896  6751  8605 10453 11235  2055
##    
##      1996  1998  2000  2002  2004  2006  2008  2010  2012  2014  2016
##   F  5008  1384  5431  1582  5546  1757  5816  1847  5815  2023  6223
##   M  8772  2221  8390  2527  7897  2625  7786  2555  7105  2868  7465

Interestingly the female contingent is still smaller than the male. I wonder if there are less medals awarded too.

year16 <- data[which(data$Year == 2016),]
table(year16$Sex,year16$Medal)
##    
##     Bronze Gold Silver
##   F    331  318    320
##   M    372  347    335

Yes, less medals all the way around in the most recent year.

I guess I should do nationality counts and look at the number of medals won too!

data %>%
  group_by(Medal,Team,Sex) %>%
  summarize(Medals = length(Medal))
## `summarise()` regrouping output by 'Medal', 'Team' (override with `.groups` argument)
## # A tibble: 2,353 x 4
## # Groups:   Medal, Team [1,715]
##    Medal  Team                  Sex   Medals
##    <chr>  <chr>                 <chr>  <int>
##  1 Bronze A North American Team M          4
##  2 Bronze Afghanistan           M          2
##  3 Bronze Algeria               F          1
##  4 Bronze Algeria               M          7
##  5 Bronze Ali-Baba II           M          5
##  6 Bronze Amstel Amsterdam      M          4
##  7 Bronze Antwerpia V           M          5
##  8 Bronze Aphrodite             M          3
##  9 Bronze Argentina             F         37
## 10 Bronze Argentina             M         54
## # … with 2,343 more rows

Way too much data there to synthesize!

Let’s look at age and Team.

data %>%
  group_by(Team) %>%
  summarize(AveargeAge = mean(Age, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1,184 x 2
##    Team                  AveargeAge
##    <chr>                      <dbl>
##  1 30. Februar                 33.5
##  2 A North American Team       41.3
##  3 Acipactli                   47.3
##  4 Acturus                     27  
##  5 Afghanistan                 23.5
##  6 Akatonbo                    40.3
##  7 Alain IV                    48  
##  8 Albania                     25.3
##  9 Alcaid                      30  
## 10 Alcyon-6                   NaN  
## # … with 1,174 more rows

Okay well again there is just too many teams! I wonder if there is a gender difference in age?

data %>%
  group_by(Sex) %>%
  summarize(AverageAge = mean(Age, na.rm = TRUE), StandardDeviation = sd(Age, na.rm = TRUE), SampleSize = length(Age))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   Sex   AverageAge StandardDeviation SampleSize
##   <chr>      <dbl>             <dbl>      <int>
## 1 F           23.7              5.80      74522
## 2 M           26.3              6.47     196594

I bet those are statistically significant differences! I’ll state my hypotheses as \[ H_0:\ \mu_M = \mu_W\\ H_A:\ \mu_M\neq \mu_W. \] I almost always just examine not equal. If I go by formula

serror <- sqrt(5.795252^2/74522+6.474972^2/196594)
degrees <- 74521
xbarMen <- 26.27756
xbarFe <- 23.73288

t <- (xbarMen - xbarFe)/serror
alpha = .05 
thalfalpha = qt(1-alpha/2, df=degrees) 
c(-thalfalpha, thalfalpha) 
## [1] -1.959996  1.959996
t
## [1] 98.75797

I am way outside the confidence interval! Let’s get a \(p\) value too!

2*(1-pt(t,degrees))
## [1] 0

Well that was expected! The \(p\) value is zero because there is no way those means could be equal!

I’ll repeat this test using the built in function. If your data is loaded into the program you should just use the built-ins. They will give you a more accurate result!

t.test(data$Age~data$Sex)
## 
##  Welch Two Sample t-test
## 
## data:  data$Age by data$Sex
## t = -97.815, df = 150727, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.595670 -2.493691
## sample estimates:
## mean in group F mean in group M 
##        23.73288        26.27756

We see that yes indeed the differences in ages are significant! Lets visualize this too!

ggplot(data = data,aes(x= Age, y =Sex))+
  geom_boxplot()
## Warning: Removed 9474 rows containing non-finite values (stat_boxplot).

Seriously who are all these ancient olympians?

outliers <- boxplot.stats(data$Age)$out #finds the outliers
oldies <- data[which(data$Age %in% c(outliers)),] #grabs all the data from the outliers and renames it

oldies %>%
  group_by(Sport) %>%
  summarize(AverageAge = mean(Age), MaxAge = max(Age), MinAge = min(Age))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 58 x 4
##    Sport            AverageAge MaxAge MinAge
##    <chr>                 <dbl>  <int>  <int>
##  1 Alpine Skiing          42.3     55     39
##  2 Alpinism               46.8     57     39
##  3 Archery                46.3     71     39
##  4 Art Competitions       52.1     97     39
##  5 Athletics              41.0     52     39
##  6 Badminton              42       44     40
##  7 Baseball               40.4     44     39
##  8 Basketball             39.2     40     39
##  9 Beach Volleyball       39.5     41     39
## 10 Biathlon               39.9     45     39
## # … with 48 more rows

Well the 97 year old participated in ‘Art Competitions’… This table is misleading in that I am only looking at the outliers! There is also a 10 year old gymnast!

I should have a histogram over the age too.

ggplot(data = data, aes(x= Age)) +
  geom_histogram(binwidth = 1) #I set the binwidth because of an error message
## Warning: Removed 9474 rows containing non-finite values (stat_bin).

Since I am interested in how Sex plays, I will include that in my histogram too.

ggplot(data = data, aes(x= Age, color = Sex)) +
  geom_histogram(binwidth = 1) #I set the binwidth because of an error message
## Warning: Removed 9474 rows containing non-finite values (stat_bin).

ggplot(data,aes(sample = Age)) +
  stat_qq() +
  stat_qq_line()
## Warning: Removed 9474 rows containing non-finite values (stat_qq).
## Warning: Removed 9474 rows containing non-finite values (stat_qq_line).

The QQ Plot gives us an indication of whether the data is normal. Normal data will fall on the line, this is clearly not normally distributed!

Okay last exploration I promise!

ggplot(data, aes(x = Year, color = Sex)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Anybody know why there wasn’t olympics held in to two dips? The one really tall bar was when they held the summer and winter olympics in the same year.

Lastly I thought maybe I should show the normal distribution and how to get R to do it.

pnorm(2)#probability of z=2 everything to the left!
## [1] 0.9772499
pnorm(2,lower.tail = FALSE) #everything to the right
## [1] 0.02275013
qnorm(.975) #z value for when p =.975 this is the z critical for a two tailed 95% CI
## [1] 1.959964

I think you can do a test too. Proportions are normal. I’ll check out gender!

usathletes = data[which(data$Team=='United States'),]
frathletes = data[which(data$Team=='France'),]

frx = sum(frathletes$Sex=="M",na.rm = TRUE)
usx = sum(usathletes$Sex=="M",na.rm = TRUE)

frn = length(frathletes$Sex)
usn = length(usathletes$Sex)

prop.test(c(frx,usx),c(frn,usn))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(frx, usx) out of c(frn, usn)
## X-squared = 187.26, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.06180676 0.08212664
## sample estimates:
##    prop 1    prop 2 
## 0.7711879 0.6992212

This was harder than I thought. I know we haven’t covered this but I am comparing two proportions, french and us proportion of males. The french have a statistically higher proportion of males.