I want to finish the olympic project with some analysis using the non-parametric methods. I am going to try to repeat many of the analysis I have already made (except for Matched Pairs).
I will compare the median (NOT mean!) ages between the genders. My hypotheses are \[ H_0: m_M = m_W\\ H_a: m_M \neq m_W \]
Visualizing the data first:
ggplot(data = data,aes(x= Age, y =Sex))+
geom_boxplot()
## Warning: Removed 9474 rows containing non-finite values (stat_boxplot).
wilcox.test(Age~Sex, data = data)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Age by Sex
## W = 5295841686, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
So the median ages are also different! I reject the null hypothesis, there is evidence to suggest that the median age of male and female olympians is different.
I doubt many of you will be able to do this test with the data you have but figured I should give it a go! I am going to examine the number of medals awarded each year and ask are men being awarded more medals than women. First I’ll need to create this data.
data %>%
group_by(Year,Sex)%>%
filter(!is.na(Medal))%>%
count(Medal %in% c("Bronze","Silver","Gold"), name = "Medals")%>%
spread(Sex,Medals)%>%
mutate(diff = M-F)
## # A tibble: 35 x 5
## # Groups: Year [35]
## Year `Medal %in% c("Bronze", "Silver", "Gold")` F M diff
## <int> <lgl> <int> <int> <int>
## 1 1896 TRUE NA 143 NA
## 2 1900 TRUE 13 591 578
## 3 1904 TRUE 10 476 466
## 4 1906 TRUE 6 452 446
## 5 1908 TRUE 16 815 799
## 6 1912 TRUE 30 911 881
## 7 1920 TRUE 44 1264 1220
## 8 1924 TRUE 52 910 858
## 9 1928 TRUE 103 720 617
## 10 1932 TRUE 68 671 603
## # … with 25 more rows
This took me an hour to create, so a little tricky! If you look at my diff column you’ll notice there is NEVER a year that women earn more medals than the men. We will clearly see that the number of medals earned in a year is different by Sex but let’s do the test anyway. I should state my hypothesis! The median difference in the number of medals earned each year between the sexes is 0. So let \(m\) be the median difference, then \[ H_0: m=0\\ H_a: m \neq 0 \]
data %>%
group_by(Year,Sex)%>%
filter(!is.na(Medal))%>%
count(Medal %in% c("Bronze","Silver","Gold"), name = "Medals")%>%
spread(Sex,Medals)%>%
mutate(diff = M-F)%>%
{wilcox.test(.$diff)}
## Warning in wilcox.test.default(.$diff): cannot compute exact p-value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: .$diff
## V = 595, p-value = 3.814e-07
## alternative hypothesis: true location is not equal to 0
So the new line there passes the diff column into the wilcox test. This was a pain but I wanted to practice with the wrangling! We reject the null hypothesis, we have evidence to suggest that the median difference in number of medals awarded is not zero.
To do the Kruskal-Wallis Test, we’ll want to do something more like ANOVA where we have multiple levels. We’ll be asking are the medians over these multiple treatments the same. Let’s ask the same question as before, does the sport you play effect your height?
a <- kruskal.test(Height ~ Sport , data = data)
a
##
## Kruskal-Wallis rank sum test
##
## data: Height by Sport
## Kruskal-Wallis chi-squared = 56101, df = 58, p-value < 2.2e-16
Our results are very similar as when we ran the ANOVA, Sport does have effect on the median of the height. We reject the null hypothesis that all medians are the same and have evidence to suggest that at least two are different. Let’s visualize!
ggplot(data = data, aes(Height,Sport)) +
geom_boxplot()
## Warning: Removed 60171 rows containing non-finite values (stat_boxplot).
Not exactly the easiest visualization to read but you can see that basketball players are tall and gymnasts are short.
For the last of the non-parametric, we’ll look at correlation! Let’s do height and weight again!
ggplot(data = data, aes(Weight, Height, color = Sex))+
geom_point()
## Warning: Removed 64263 rows containing missing values (geom_point).
I through Sex in because I could in the visualization…
The hypothesis will be about \(\rho_S\) the Spearman Rank correlation. \[ H_0: \rho_S = 0\\ H_a: \rho_S \neq 0 \]
cor.test(data$Height, data$Weight, method = "spearman")
## Warning in cor.test.default(data$Height, data$Weight, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$Height and data$Weight
## S = 2.5501e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.827131
We can reject our null hypothesis that the correlation is zero, thus we have evidence to suggest that the correlation is not zero. Remember here to that we will pick up non-linear correlation so it may be that the data fits a nice curve.
Okay that is it! My report is done! Nice work this semester!