I discovered this data set at https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results. In order to create my data file, I clicked the icon to upload to Google sheets, published it to the web as a csv and copied a link. Some data will be possible to upload to github but if the file is large github will not do it. You need the data to be in a raw form, with csv.
summary(data)
## ID Name Sex Age
## Min. : 1 Length:271116 Length:271116 Min. :10.00
## 1st Qu.: 34643 Class :character Class :character 1st Qu.:21.00
## Median : 68205 Mode :character Mode :character Median :24.00
## Mean : 68249 Mean :25.56
## 3rd Qu.:102097 3rd Qu.:28.00
## Max. :135571 Max. :97.00
## NA's :9474
## Height Weight Team NOC
## Min. :127.0 Min. : 25.0 Length:271116 Length:271116
## 1st Qu.:168.0 1st Qu.: 60.0 Class :character Class :character
## Median :175.0 Median : 70.0 Mode :character Mode :character
## Mean :175.3 Mean : 70.7
## 3rd Qu.:183.0 3rd Qu.: 79.0
## Max. :226.0 Max. :214.0
## NA's :60171 NA's :62875
## Games Year Season City
## Length:271116 Min. :1896 Length:271116 Length:271116
## Class :character 1st Qu.:1960 Class :character Class :character
## Mode :character Median :1988 Mode :character Mode :character
## Mean :1978
## 3rd Qu.:2002
## Max. :2016
##
## Sport Event Medal
## Length:271116 Length:271116 Length:271116
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
I wonder what sports have had the most athletes of all time.
ggplot(data, aes(Sport)) +
geom_bar()
table(data$Sport)
##
## Aeronautics Alpine Skiing Alpinism
## 1 8829 25
## Archery Art Competitions Athletics
## 2334 3578 38624
## Badminton Baseball Basketball
## 1457 894 4536
## Basque Pelota Beach Volleyball Biathlon
## 2 564 4893
## Bobsleigh Boxing Canoeing
## 3058 6047 6171
## Cricket Croquet Cross Country Skiing
## 24 19 9133
## Curling Cycling Diving
## 463 10859 2842
## Equestrianism Fencing Figure Skating
## 6344 10735 2298
## Football Freestyle Skiing Golf
## 6745 937 247
## Gymnastics Handball Hockey
## 26707 3665 5417
## Ice Hockey Jeu De Paume Judo
## 5516 11 3801
## Lacrosse Luge Military Ski Patrol
## 60 1479 24
## Modern Pentathlon Motorboating Nordic Combined
## 1677 17 1344
## Polo Racquets Rhythmic Gymnastics
## 95 12 658
## Roque Rowing Rugby
## 4 10595 162
## Rugby Sevens Sailing Shooting
## 299 6586 11448
## Short Track Speed Skating Skeleton Ski Jumping
## 1534 199 2401
## Snowboarding Softball Speed Skating
## 936 478 5613
## Swimming Synchronized Swimming Table Tennis
## 23195 909 1955
## Taekwondo Tennis Trampolining
## 606 2862 152
## Triathlon Tug-Of-War Volleyball
## 529 170 3404
## Water Polo Weightlifting Wrestling
## 3846 3937 7154
The table gave me a better representation. Clearly I could not tell in the bar chart what the games were. Let me try one more way and include relative frequency.
#options(digits=2)
freq <- rbind(table(data$Sport),table(data$Sport)/length(data$Sport))
row.names(freq)<-c('Frequency','Relative Frequency')
freq
## Aeronautics Alpine Skiing Alpinism Archery
## Frequency 1.000000e+00 8.82900e+03 2.500000e+01 2.334000e+03
## Relative Frequency 3.688458e-06 3.25654e-02 9.221145e-05 8.608861e-03
## Art Competitions Athletics Badminton Baseball
## Frequency 3.57800e+03 3.86240e+04 1.457000e+03 8.940000e+02
## Relative Frequency 1.31973e-02 1.42463e-01 5.374083e-03 3.297482e-03
## Basketball Basque Pelota Beach Volleyball Biathlon
## Frequency 4.536000e+03 2.000000e+00 5.64000e+02 4.893000e+03
## Relative Frequency 1.673085e-02 7.376916e-06 2.08029e-03 1.804763e-02
## Bobsleigh Boxing Canoeing Cricket
## Frequency 3.05800e+03 6.047000e+03 6.171000e+03 2.400000e+01
## Relative Frequency 1.12793e-02 2.230411e-02 2.276147e-02 8.852299e-05
## Croquet Cross Country Skiing Curling Cycling
## Frequency 1.90000e+01 9.133000e+03 4.630000e+02 1.085900e+04
## Relative Frequency 7.00807e-05 3.368669e-02 1.707756e-03 4.005297e-02
## Diving Equestrianism Fencing Figure Skating
## Frequency 2.84200e+03 6.344000e+03 1.07350e+04 2.298000e+03
## Relative Frequency 1.04826e-02 2.339958e-02 3.95956e-02 8.476077e-03
## Football Freestyle Skiing Golf Gymnastics
## Frequency 6.745000e+03 9.370000e+02 2.470000e+02 2.670700e+04
## Relative Frequency 2.487865e-02 3.456085e-03 9.110491e-04 9.850765e-02
## Handball Hockey Ice Hockey Jeu De Paume
## Frequency 3.66500e+03 5.417000e+03 5.516000e+03 1.100000e+01
## Relative Frequency 1.35182e-02 1.998038e-02 2.034553e-02 4.057304e-05
## Judo Lacrosse Luge Military Ski Patrol
## Frequency 3.801000e+03 6.000000e+01 1.479000e+03 2.400000e+01
## Relative Frequency 1.401983e-02 2.213075e-04 5.455229e-03 8.852299e-05
## Modern Pentathlon Motorboating Nordic Combined Polo
## Frequency 1.677000e+03 1.700000e+01 1.344000e+03 9.500000e+01
## Relative Frequency 6.185544e-03 6.270379e-05 4.957288e-03 3.504035e-04
## Racquets Rhythmic Gymnastics Roque Rowing
## Frequency 1.20000e+01 6.580000e+02 4.000000e+00 1.059500e+04
## Relative Frequency 4.42615e-05 2.427005e-03 1.475383e-05 3.907921e-02
## Rugby Rugby Sevens Sailing Shooting
## Frequency 1.620000e+02 2.990000e+02 6.586000e+03 1.144800e+04
## Relative Frequency 5.975302e-04 1.102849e-03 2.429218e-02 4.222547e-02
## Short Track Speed Skating Skeleton Ski Jumping
## Frequency 1.534000e+03 1.990000e+02 2.401000e+03
## Relative Frequency 5.658095e-03 7.340032e-04 8.855988e-03
## Snowboarding Softball Speed Skating Swimming
## Frequency 9.360000e+02 4.780000e+02 5.613000e+03 2.319500e+04
## Relative Frequency 3.452397e-03 1.763083e-03 2.070332e-02 8.555379e-02
## Synchronized Swimming Table Tennis Taekwondo Tennis
## Frequency 9.090000e+02 1.955000e+03 6.060000e+02 2.862000e+03
## Relative Frequency 3.352808e-03 7.210936e-03 2.235206e-03 1.055637e-02
## Trampolining Triathlon Tug-Of-War Volleyball
## Frequency 1.520000e+02 5.290000e+02 1.700000e+02 3.404000e+03
## Relative Frequency 5.606456e-04 1.951194e-03 6.270379e-04 1.255551e-02
## Water Polo Weightlifting Wrestling
## Frequency 3.846000e+03 3.937000e+03 7.154000e+03
## Relative Frequency 1.418581e-02 1.452146e-02 2.638723e-02
meh, that is still not great. One more time?
percent <- function(x, digits = 2, format = "f", ...) { # Create user-defined function
paste0(formatC(x * 100, format = format, digits = digits, ...), "%")
}
data %>%
group_by(Sport) %>%
summarize(Frequency = length(Sport),RelativeFrequency = percent(length(Sport)/271116))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 66 x 3
## Sport Frequency RelativeFrequency
## <chr> <int> <chr>
## 1 Aeronautics 1 0.00%
## 2 Alpine Skiing 8829 3.26%
## 3 Alpinism 25 0.01%
## 4 Archery 2334 0.86%
## 5 Art Competitions 3578 1.32%
## 6 Athletics 38624 14.25%
## 7 Badminton 1457 0.54%
## 8 Baseball 894 0.33%
## 9 Basketball 4536 1.67%
## 10 Basque Pelota 2 0.00%
## # … with 56 more rows
Such a nicer, usable table! I needed the tidyverse library to do this but I think it was worth it. You should also notice that I created a function for doing the percent too! Take a minute and try to understand what that function is doing because somethings will need to be modified to make it work for your data!
Let’s look at a couple of the sports. Athletics, cycling and rowing are of interest to me.
sports <- c('Athletics','Cycling', 'Rowing')
ggplot(subset(data, Sport %in% sports), aes(Sport,fill = Sex)) +
geom_bar()
Wow so many more men than women! Let’s see if the number of medals is as far apart.
table(data$Medal,data$Sex)
##
## F M
## Bronze 3771 9524
## Gold 3747 9625
## Silver 3735 9381
Clearly there have been way more men’s medals awarded then women’s medals. Let’s see if we can figure out what year women start competing at the same level as the men.
table(data$Sex,data$Year)
##
## 1896 1900 1904 1906 1908 1912 1920 1924 1928 1932 1936 1948
## F 0 33 16 11 47 87 134 261 437 369 549 761
## M 380 1903 1285 1722 3054 3953 4158 5432 5137 2952 6852 6719
##
## 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1994
## F 1682 1139 1730 1752 2193 2608 2606 2186 2983 4223 5178 1105
## M 7676 5295 7505 7728 8286 9351 7896 6751 8605 10453 11235 2055
##
## 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
## F 5008 1384 5431 1582 5546 1757 5816 1847 5815 2023 6223
## M 8772 2221 8390 2527 7897 2625 7786 2555 7105 2868 7465
Interestingly the female contingent is still smaller than the male. I wonder if there are less medals awarded too.
year16 <- data[which(data$Year == 2016),]
table(year16$Sex,year16$Medal)
##
## Bronze Gold Silver
## F 331 318 320
## M 372 347 335
Yes, less medals all the way around in the most recent year.
I guess I should do nationality counts and look at the number of medals won too!
data %>%
group_by(Medal,Team,Sex) %>%
summarize(Medals = length(Medal))
## `summarise()` regrouping output by 'Medal', 'Team' (override with `.groups` argument)
## # A tibble: 2,353 x 4
## # Groups: Medal, Team [1,715]
## Medal Team Sex Medals
## <chr> <chr> <chr> <int>
## 1 Bronze A North American Team M 4
## 2 Bronze Afghanistan M 2
## 3 Bronze Algeria F 1
## 4 Bronze Algeria M 7
## 5 Bronze Ali-Baba II M 5
## 6 Bronze Amstel Amsterdam M 4
## 7 Bronze Antwerpia V M 5
## 8 Bronze Aphrodite M 3
## 9 Bronze Argentina F 37
## 10 Bronze Argentina M 54
## # … with 2,343 more rows
Way too much data there to synthesize!
Let’s look at age and Team.
data %>%
group_by(Team) %>%
summarize(AveargeAge = mean(Age, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1,184 x 2
## Team AveargeAge
## <chr> <dbl>
## 1 30. Februar 33.5
## 2 A North American Team 41.3
## 3 Acipactli 47.3
## 4 Acturus 27
## 5 Afghanistan 23.5
## 6 Akatonbo 40.3
## 7 Alain IV 48
## 8 Albania 25.3
## 9 Alcaid 30
## 10 Alcyon-6 NaN
## # … with 1,174 more rows
Okay well again there is just too many teams! I wonder if there is a gender difference in age?
data %>%
group_by(Sex) %>%
summarize(AverageAge = mean(Age, na.rm = TRUE), StandardDeviation = sd(Age, na.rm = TRUE), SampleSize = length(Age))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
## Sex AverageAge StandardDeviation SampleSize
## <chr> <dbl> <dbl> <int>
## 1 F 23.7 5.80 74522
## 2 M 26.3 6.47 196594
I bet those are statistically significant differences! I’ll state my hypotheses as \[ H_0:\ \mu_M = \mu_W\\ H_A:\ \mu_M\neq \mu_W. \] I almost always just examine not equal. If I go by formula
serror <- sqrt(5.795252^2/74522+6.474972^2/196594)
degrees <- 74521
xbarMen <- 26.27756
xbarFe <- 23.73288
t <- (xbarMen - xbarFe)/serror
alpha = .05
thalfalpha = qt(1-alpha/2, df=degrees)
c(-thalfalpha, thalfalpha)
## [1] -1.959996 1.959996
t
## [1] 98.75797
I am way outside the confidence interval! Let’s get a \(p\) value too!
2*(1-pt(t,degrees))
## [1] 0
Well that was expected! The \(p\) value is zero because there is no way those means could be equal!
I’ll repeat this test using the built in function. If your data is loaded into the program you should just use the built-ins. They will give you a more accurate result!
t.test(data$Age~data$Sex)
##
## Welch Two Sample t-test
##
## data: data$Age by data$Sex
## t = -97.815, df = 150727, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.595670 -2.493691
## sample estimates:
## mean in group F mean in group M
## 23.73288 26.27756
We see that yes indeed the differences in ages are significant! Lets visualize this too!
ggplot(data = data,aes(x= Age, y =Sex))+
geom_boxplot()
## Warning: Removed 9474 rows containing non-finite values (stat_boxplot).
Seriously who are all these ancient olympians?
outliers <- boxplot.stats(data$Age)$out #finds the outliers
oldies <- data[which(data$Age %in% c(outliers)),] #grabs all the data from the outliers and renames it
oldies %>%
group_by(Sport) %>%
summarize(AverageAge = mean(Age), MaxAge = max(Age), MinAge = min(Age))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 58 x 4
## Sport AverageAge MaxAge MinAge
## <chr> <dbl> <int> <int>
## 1 Alpine Skiing 42.3 55 39
## 2 Alpinism 46.8 57 39
## 3 Archery 46.3 71 39
## 4 Art Competitions 52.1 97 39
## 5 Athletics 41.0 52 39
## 6 Badminton 42 44 40
## 7 Baseball 40.4 44 39
## 8 Basketball 39.2 40 39
## 9 Beach Volleyball 39.5 41 39
## 10 Biathlon 39.9 45 39
## # … with 48 more rows
Well the 97 year old participated in ‘Art Competitions’… This table is misleading in that I am only looking at the outliers! There is also a 10 year old gymnast!
I should have a histogram over the age too.
ggplot(data = data, aes(x= Age)) +
geom_histogram(binwidth = 1) #I set the binwidth because of an error message
## Warning: Removed 9474 rows containing non-finite values (stat_bin).
Since I am interested in how Sex plays, I will include that in my histogram too.
ggplot(data = data, aes(x= Age, color = Sex)) +
geom_histogram(binwidth = 1) #I set the binwidth because of an error message
## Warning: Removed 9474 rows containing non-finite values (stat_bin).
ggplot(data,aes(sample = Age)) +
stat_qq() +
stat_qq_line()
## Warning: Removed 9474 rows containing non-finite values (stat_qq).
## Warning: Removed 9474 rows containing non-finite values (stat_qq_line).
The QQ Plot gives us an indication of whether the data is normal. Normal data will fall on the line, this is clearly not normally distributed!
Okay last exploration I promise!
ggplot(data, aes(x = Year, color = Sex)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Anybody know why there wasn’t olympics held in to two dips? The one really tall bar was when they held the summer and winter olympics in the same year.
Lastly I thought maybe I should show the normal distribution and how to get R to do it.
pnorm(2)#probability of z=2 everything to the left!
## [1] 0.9772499
pnorm(2,lower.tail = FALSE) #everything to the right
## [1] 0.02275013
qnorm(.975) #z value for when p =.975 this is the z critical for a two tailed 95% CI
## [1] 1.959964
I think you can do a test too. Proportions are normal. I’ll check out gender!
usathletes = data[which(data$Team=='United States'),]
frathletes = data[which(data$Team=='France'),]
frx = sum(frathletes$Sex=="M",na.rm = TRUE)
usx = sum(usathletes$Sex=="M",na.rm = TRUE)
frn = length(frathletes$Sex)
usn = length(usathletes$Sex)
prop.test(c(frx,usx),c(frn,usn))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(frx, usx) out of c(frn, usn)
## X-squared = 187.26, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.06180676 0.08212664
## sample estimates:
## prop 1 prop 2
## 0.7711879 0.6992212
This was harder than I thought. I know we haven’t covered this but I am comparing two proportions, french and us proportion of males. The french have a statistically higher proportion of males.