2016 Rio Olympics

I will be using the 2016 Olympic Games in Rio de Janeiro dataset I found on kaggle https://www.kaggle.com/rio2016/olympic-games#athletes.csv. I will be looking at the number of gold medals won (dependent continuous variable), gender, and the 2 levels I will be looking at are the individual Olympic athletes and the nationalities the athletes are from, since different nations may have better training and preperation for their athletes.

Loading in packages

library(nlme)
library(sjmisc)
library(dplyr)
library(magrittr)
library(tidyr)
library(lmerTest)
library(ggplot2)

Loading in data

There are 11,538 athletes in this dataset.

library(readr)
athletes1 <- read_csv("C:/Users/abbys/Downloads/athletes.csv")
head(athletes1)
dim(athletes1)
[1] 11538    11

Recoding gender

Male=0, Female=1.

  athletes2<-athletes1%>%
  mutate(NewSex=ifelse(sex=="male",0,
                       ifelse( sex=="female",1,NA)))

Number of Nationalities and Grouping by Nationality

I have 207 different nationalities in my dataset.

length(unique(athletes2$nationality))
[1] 207
athletes2%>% 
  group_by(nationality) %>% 
  summarise(n_sch = n())

mean_g=number of gold medals won. mean_s=gender where male=0 and female=1.

nationality_d <- athletes2 %>% 
  group_by(nationality) %>% 
  summarise(mean_g = mean(gold, na.rm = TRUE), mean_s = mean(NewSex, na.rm = TRUE))
head(nationality_d)

Ecological Regression

National level analysis. Although not statistically significant, the number of gold medals won is lower for females than for males. Being female (an increase from 0=male to 1=female) decreases the number of gold medals won by by .0067 on average. Males won an average of .016 medals. An ecological fallacy would be to say that this relationship is also true at the individual level.

ecoreg <- lm(mean_g ~ mean_s, data = nationality_d)
summary(ecoreg)

Call:
lm(formula = mean_g ~ mean_s, data = nationality_d)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.022384 -0.019023 -0.017903 -0.005132  0.225895 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.01566    0.00818   1.915   0.0569 .
mean_s       0.00672    0.01825   0.368   0.7131  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04213 on 205 degrees of freedom
Multiple R-squared:  0.0006608, Adjusted R-squared:  -0.004214 
F-statistic: 0.1356 on 1 and 205 DF,  p-value: 0.7131

Complete Pooling

In the complete pooling model there’s no nationality difference, the olympic athletes are all the same regardless of what their nationality is. For every 1 unit increase in sex (from male to female), where NewSex is the effect of being female on estimate (the estimated number of gold medals won), and statistical significance aside, there’s a .006 increase in the number of gold medals females won than males. Males won an average of .055 gold medals.

cpooling <- lm(gold ~ NewSex, data = athletes2)
summary(cpooling)

Call:
lm(formula = gold ~ NewSex, data = athletes2)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.0611 -0.0611 -0.0550 -0.0550  4.9450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.054950   0.003216  17.088   <2e-16 ***
NewSex      0.006145   0.004788   1.283    0.199    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2559 on 11536 degrees of freedom
Multiple R-squared:  0.0001428, Adjusted R-squared:  5.61e-05 
F-statistic: 1.647 on 1 and 11536 DF,  p-value: 0.1994

No-pooling model - The Intercept:

The ggplot shows average number of gold medals won by males (the intercept where NewSex=0) for each nationality. On average, there is some nationality where a male won more than .3 gold medals. I see that the mode is 0, which means that most males did not win any gold medals in the 2016 Rio Olympics.

dcoef <- athletes2 %>%
  group_by(nationality)%>%
  do(mod = lm(gold ~ NewSex, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()

No-pooling model - The slope:

On average, this is the number of gold medals females won over males across different nationalities. The mode of the difference is 0, but the variation shows that there are many nationalities where being female reduces the number of gold medals they won (as low as -.35). Other times, being female increases the number of golds won (.2).

dcoef <- athletes2%>% 
    group_by(nationality)%>% 
    do(mod = lm(gold ~ NewSex, data = .))
coef <- dcoef %>% do(data.frame(NSc = coef(.$mod)[2]))
ggplot(coef, aes(x = NSc)) + geom_histogram()

Random intercept:

This model does not allow gender difference in the number of gold medals won to differ between nationalities. Females won .00086 more gold medalss than males, on average. The intercept shows that males won .032 gold medals on average. Looking at the standard deviation between nationalities for males, it is about .044.

m1_lme <- lme(gold ~ NewSex, data = athletes2, random = ~1|nationality, method = "ML")
summary(m1_lme)
Linear mixed-effects model fit by maximum likelihood
 Data: athletes2 

Random effects:
 Formula: ~1 | nationality
        (Intercept)  Residual
StdDev:  0.04414681 0.2484611

Fixed effects: gold ~ NewSex 
 Correlation: 
       (Intr)
NewSex -0.379

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3 
-0.94293909 -0.26869324 -0.11883108 -0.05871213 
        Max 
19.18438999 

Number of Observations: 11538
Number of Groups: 207 

Random Slope:

This model does allow for gender differences in the number of gold medals won to differ between nationalities. For the random slope model, Females won .005 fewer gold medals than males across different nationalities, on average, with a standard deviation of .053. Males won .034 gold medals over different nationalitites on average with a standard deviation of .050.

m2_lme <- lme(gold ~ NewSex, data = athletes2, random = ~ NewSex|nationality, method = "ML")
summary(m2_lme)
Linear mixed-effects model fit by maximum likelihood
 Data: athletes2 

Random effects:
 Formula: ~NewSex | nationality
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev     Corr  
(Intercept) 0.04977801 (Intr)
NewSex      0.05275568 -0.496
Residual    0.24679620       

Fixed effects: gold ~ NewSex 
 Correlation: 
       (Intr)
NewSex -0.56 

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3 
-1.07495066 -0.26008904 -0.10369982 -0.05785941 
        Max 
19.45892497 

Number of Observations: 11538
Number of Groups: 207 

Model selection

Based on the lowest AIC, the best model is the random slope model (m2_lme).

AIC(cpooling, m1_lme, m2_lme)

Intra-class Correlation

Are the number of golds won by an athlete an individual level or national level achievement?

0.04415819 /(0.04415819 + 0.248461) =0.1509067, means that ~15.09% total variation in an athlete’s number of gold medals won can be atrributed to national level influences, and the remaining 84.91% can be attributed to individual athlete level.

intraclass<-0.04415819 /(0.04415819  + 0.248461)
intraclass
[1] 0.1509067
Micc_lme <- lme(gold ~ 1, random = ~1|nationality, data = athletes2, method = "ML")
summary(Micc_lme)
Linear mixed-effects model fit by maximum likelihood
 Data: athletes2 

Random effects:
 Formula: ~1 | nationality
        (Intercept) Residual
StdDev:  0.04415819 0.248461

Fixed effects: gold ~ 1 

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3 
-0.94133080 -0.26964621 -0.11783737 -0.05907515 
        Max 
19.18254840 

Number of Observations: 11538
Number of Groups: 207 
intervals(Micc_lme)
Approximate 95% confidence intervals

 Fixed effects:
                 lower       est.      upper
(Intercept) 0.02252695 0.03211522 0.04170349
attr(,"label")
[1] "Fixed effects:"

 Random Effects:
  Level: nationality 
                     lower       est.      upper
sd((Intercept)) 0.03783781 0.04415819 0.05153431

 Within-group standard error:
    lower      est.     upper 
0.2452645 0.2484610 0.2516992 

Conclusion

In summary, the ecological regression (national level) and random slope (best fit, national level) models show that females won less gold medals than males did in the 2016 Rio Olympics. The random intercept (national level) and complete pooling (individual level) models show that females won more gold medals than males did in the 2016 Rio Olympics. The intra class correlation model shows that ~15.09% total variation in an athlete’s number of gold medals won can be attributed to national level influences (and the rest is attributed to the indivdual level).

