1 Introduction to the Olympics:

The modern Olympic Games or Olympics are leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world’s foremost sports competition with more than 200 nations participating. The Olympic Games are held every four years, with the Summer and Winter Games alternating by occurring every four years but two years apart.

The evolution of the Olympic Movement during the 20th and 21st centuries has resulted in several changes to the Olympic Games. Some of these adjustments include the creation of the Winter Olympic Games for snow and ice sports, the Paralympic Games for athletes with a disability, the Youth Olympic Games for athletes aged 14 to 18, the five Continental games (Pan American, African, Asian, European, and Pacific), and the World Games for sports that are not contested in the Olympic Games. The Deaflympics and Special Olympics are also endorsed by the IOC. The IOC has had to adapt to a variety of economic, political, and technological advancements. As a result, the Olympics has shifted away from pure amateurism, as envisioned by Coubertin, to allowing participation of professional athletes. The growing importance of mass media created the issue of corporate sponsorship and commercialisation of the Games. World wars led to the cancellation of the 1916, 1940, and 1944 Games. Large boycotts during the Cold War limited participation in the 1980 and 1984 Games. The latter, however, attracted 140 National Olympic Committees, which was a record at the time.

For more information refer Olympics’ official page

2 Introduction about the Data:

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.

3 Content:

The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

  1. ID - Unique number for each athlete
  2. Name - Athlete’s name
  3. Sex - M or F
  4. Age - Integer
  5. Height - In centimeters
  6. Weight - In kilograms
  7. Team - Team name
  8. NOC - National Olympic Committee 3-letter code
  9. Games - Year and season
  10. Year - Integer
  11. Season - Summer or Winter
  12. City - Host city
  13. Sport - Sport
  14. Event - Event
  15. Medal - Gold, Silver, Bronze, or NA

4 Loading the dataset:

atheletes <- read.csv("athlete_events.csv", stringsAsFactors = F)
regions <- read.csv("noc_regions.csv", stringsAsFactors = F)

5 Loading the required libraries:

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.1 --
## v tibble  1.4.2     v purrr   0.2.4
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rvest)
## Warning: package 'rvest' was built under R version 3.5.1
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.5.1
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.5.1
## 
## Attaching package: 'ggmap'
## The following object is masked from 'package:magrittr':
## 
##     inset
library(stringr)

6 Initial Exploration of the dataset:

glimpse(atheletes)
## Observations: 271,116
## Variables: 15
## $ ID     <int> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7...
## $ Name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar ...
## $ Sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", ...
## $ Age    <int> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31,...
## $ Height <int> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 18...
## $ Weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75,...
## $ Team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherl...
## $ NOC    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED",...
## $ Games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summ...
## $ Year   <int> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1...
## $ Season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Wint...
## $ City   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary",...
## $ Sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed ...
## $ Event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightw...
## $ Medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA,...
glimpse(regions)
## Observations: 230
## Variables: 3
## $ NOC    <chr> "AFG", "AHO", "ALB", "ALG", "AND", "ANG", "ANT", "ANZ",...
## $ region <chr> "Afghanistan", "Curacao", "Albania", "Algeria", "Andorr...
## $ notes  <chr> "", "Netherlands Antilles", "", "", "", "", "Antigua an...

6.1 Analysis by sex:

6.1.1 Rough Numbers:

The first part of our analysis will be helpful in understanding how the Olympics has evolved over the years. How have men and women been allowed to participate in the competitions overtime.

df <- atheletes %>%
  group_by(Season, Sex) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = round(Count*100 / sum(Count))) 
  
df %>%   
ggplot(aes(x=Season, y=Percentage, fill = Sex)) + geom_bar(stat='identity',position=position_dodge()) +
        ggtitle("Total number of transfers in a window(2007-2017)") +
        geom_label(label=df$Percentage, position = position_dodge(0.9)) +
        theme_minimal() 

During the winter and the summer the percentage and women have remained the same.

6.1.2 By trend:

6.1.2.1 Overall trend:

As the olymics progressed through the ages the number of people participating in it would have increased, which meant that the number of men and women participating in the olympics grew.

atheletes %>%
  group_by(Year, Season) %>%
  summarise(NumberOfParticipants = n()) %>%
  ggplot(aes(x = Year, y = NumberOfParticipants, group = Season)) +
  geom_line(aes(color = Season)) +
  geom_point(aes(color = Season)) +
  labs(x = "Year", y = "Number of Participants", title = "Number of participants in the olympics overtime")+
  theme_minimal()

The number of participants in the olympics have grown overtime. It is also obvious that the number of participants in the summer olympics are more than that of the winter olympics.

6.1.2.2 Trend of sex ratio:

groupMale <- atheletes %>%
              filter(Sex == "M") %>%
              group_by(Year, Season) %>%
              summarise(Number_Of_Men = n())

groupFemale <- atheletes %>%
              filter(Sex == "F") %>%
              group_by(Year, Season) %>%
              summarise(Number_Of_Women = n())

group <- groupMale %>%
          left_join(groupFemale) %>%
          mutate(Sex_Ratio = Number_Of_Men/Number_Of_Women)
## Joining, by = c("Year", "Season")
group$Sex_Ratio[is.na(group$Sex_Ratio)] <- 175

p1 <- group %>%
    ggplot(aes(x = Year, y= Sex_Ratio, group = Season)) +
    geom_line(aes(color = Season)) +
    geom_point(aes(color = Season)) +
    labs(x = "Year", y = "Sex Ratio", title = "Sex Ratio in Olympics through the years") +
    theme_minimal()

p2 <- group %>%
    filter(Year>1927) %>%
    ggplot(aes(x = Year, y= Sex_Ratio, group = Season)) +
    geom_line(aes(color = Season)) +
    geom_point(aes(color = Season)) +
    labs(x = "Year", y = "Sex Ratio", title = "Sex Ratio in Olympics through the years after 1927") +
    theme_minimal()

cowplot::plot_grid(p1,p2, ncol = 1, 
          align = 'h', axis = 'l')

When the olympics started no women participated in the Olympics. In 1900 women started participating in the olympics. As years passed the sex ratio i.e. the ratio of men to women became smaller. After 2000 the ratio started to move towards 1, which means that the olympics now are more diverse than they used to be, which great.

6.2 Analysis by Age

Age is something might have changed from the olymipics started. Hypothesis: The participants during the 1900 of the olympics had a median age greater than the median age of the participants during the 2000’s.

6.2.1 Age Density plots

atheletes$Age[is.na(atheletes$Age)] <- median(atheletes$Age, na.rm = T)
cat("The median age of the athletes in the modern olympics is", median(atheletes$Age))
## The median age of the athletes in the modern olympics is 24
cat("The median age of the male athletes in the modern olympics is", median(atheletes$Age[atheletes$Sex == "M"]))
## The median age of the male athletes in the modern olympics is 25
cat("The median age of the female athletes in the modern olympics is", median(atheletes$Age[atheletes$Sex == "F"]))
## The median age of the female athletes in the modern olympics is 23

The median age for female athletes was lesser than that of median age of male athletes.

# Filling the missing ages with median values.

p1 <- atheletes %>%
      ggplot(aes(x = Age)) +
      geom_density(color = "black", fill = "tomato") +
      labs(x = "Age", title = "Distribution of Age") +
      theme_minimal()

p2 <- atheletes %>%
      ggplot(aes(x=Age, fill=Sex)) +
      geom_density(alpha=0.4) +
      labs(x = "Age", title = "Distribution of Age by Sex") +
      theme_minimal()

cowplot::plot_grid(p1,p2, ncol = 1, 
          align = 'h', axis = 'l')

6.2.2 Age of athletes over the years.

atheletes %>%
  group_by(Year, Sex) %>%
  summarise(Median_Age = median(Age)) %>%
  ggplot(aes(x = Year, y = Median_Age, Group = Sex)) +
  geom_line(aes(color = Sex)) +
  geom_point(aes(color = Sex)) + 
  labs( x = "Year", y = "Medain age of Athletes", title = "Median age of Male and Female athletes over the years")+
  theme_minimal()

The median age of men and women participating in the olymipics has increased a bit after the 1980’s.

6.3 Analysis by Team:

Teams here refer to the countries and the different atletic clubs that have participated in the olympics over the years.

cat("The total number of teams that have paricipated in the olympics are", length(unique(atheletes$Team)))
## The total number of teams that have paricipated in the olympics are 1184
atheletes <- atheletes %>%
              left_join(regions, by = "NOC")

We have now joined athletes dataset with the regions dataset. I would like to analyze the dataset based on the National Olympics Committee rather than the teams.

  cat("The total number of National Olympics Committees that have paricipated in the olympics are", length(unique(atheletes$region)))
## The total number of National Olympics Committees that have paricipated in the olympics are 206

6.3.1 The 1976 and 1980 Olympic Boycott.

atheletes %>%
  group_by(Year, Season) %>%
  summarise(NoOfCountries = length(unique(region))) %>%
  ggplot(aes(x = Year, y = NoOfCountries, group = Season)) +
  geom_line(aes(color = Season)) +
  geom_point(aes(color = Season)) +
  labs(x = "Year", y = "Number of countries participated", title = "Number of countries that participated in the Olympics") +
  theme_minimal()

The number of countries that participated in the olymics have seen a steday increase over time. But, in 1976 and 1980 the number has seen a sharp decrease. Why is that?

6.3.1.1 The 1976 Boycott

The Montreal 1976 Olympics was boycotted by 29 countries due to the refusal of the IOC to ban New Zealand, after the New Zealand national rugby union team had toured South Africa earlier in 1976. The boycott was led by Congolese official Jean Claude Ganga. Some of the boycotting nations (including Morocco, Cameroon and Egypt) had already participated, however, and withdrew after the first few days. Senegal and Ivory Coast were the only African countries that competed throughout the duration of the Games. Elsewhere, both Iraq and Guyana also opted to join the Congolese-led boycott. South Africa had been banned from the Olympics since 1964 due to its apartheid policies. Other countries, such as El Salvador and Zaire, did not participate in Montreal because of economic reasons. The republic of China also boycotted the olympics.

Image Source: Wikipedia

6.3.1.2 The 1980 Boycott

The 1980 Summer Olympics boycott was one part of a number of actions initiated by the United States to protest the Soviet invasion of Afghanistan. Only 80 nations participated in the event. For more information on the 1980 Olympics Boycott

Image Source: Wikipedia

6.3.2 Medal Winners

6.3.2.1 By Numbers

atheletes %>%
  filter(Medal != "<NA>") %>%
  group_by(region) %>%
  summarise(Medal_Tally = length(Medal))%>%
  arrange(desc(Medal_Tally)) %>%
  ungroup() %>%
  mutate(region = reorder(region,Medal_Tally)) %>%
  top_n(10) %>%
  ggplot(aes(x = region,y = Medal_Tally)) +
    geom_bar(stat='identity',colour="white", fill = "tomato") +
    geom_text(aes(x = region, y = .1, label = paste0("(",round(Medal_Tally,2),")",sep="")),
              hjust=0, vjust=.5, size = 4, colour = 'black',
              fontface = 'bold') +
    theme(plot.title = element_text(size=10),
          axis.title = element_text(size=10),
          axis.text = element_text(size=10)) +
    labs(x = 'Country', 
         y = 'Number of Medals'
         ) +
    coord_flip() + 
    theme_bw()
## Selecting by Medal_Tally

USA has the most amount of medal winners in the history of the modern olympics followed by Russia, Germany, UK etc.

6.3.2.2 USA, Russia, Germany, UK and France through the years.

Gold <-  atheletes %>%
                filter(Medal == "Gold")%>%
                group_by(Year, Season, region) %>%
                summarise(Gold = n())

Silver <-  atheletes %>%
                filter(Medal == "Silver")%>%
                group_by(Year, Season, region) %>%
                summarise(Silver = n())

Bronze <-  atheletes %>%
                filter(Medal == "Bronze")%>%
                group_by(Year, Season, region) %>%
                summarise(Bronze = n())

Total <-  atheletes %>%
                filter(Medal != "<NA>")%>%
                group_by(Year, Season, region) %>%
                summarise(Total = n())

Total <- Total %>%
          left_join(Gold) 
## Joining, by = c("Year", "Season", "region")
Total <- Total %>%
          left_join(Silver) 
## Joining, by = c("Year", "Season", "region")
Total <- Total %>%
          left_join(Bronze)
## Joining, by = c("Year", "Season", "region")
Total$Gold[is.na(Total$Gold)] <- 0
Total$Silver[is.na(Total$Silver)] <- 0
Total$Bronze[is.na(Total$Bronze)] <- 0
Total$Total[is.na(Total$Total)] <- 0

Total <- Total %>%
          filter(region %in% c("USA","Russia","Germany","France")) %>%
          filter(Season == "Summer")

p1 <- Total %>%
  ggplot(aes(x = Year, y = Total, group = region)) +
  geom_line(aes(color = region)) +
  geom_point(aes(color = region)) +
  theme_minimal()

p2 <- Total %>%
  ggplot(aes(x = Year, y = Gold, group = region)) +
  geom_line(aes(color = region)) +
  geom_point(aes(color = region)) +
  theme_minimal()

cowplot::plot_grid(p1,p2, ncol = 1, 
          align = 'h', axis = 'l')

6.3.3 Gold, Silver and Bronze Medals:

Gold_Winners <- atheletes %>%
                     filter(Medal == "Gold") %>%
                     group_by(region) %>%
                     summarise(Medal_Tally = length(Medal)) %>%
                     arrange(desc(Medal_Tally)) %>% 
                     mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))

Silver_Winners <- atheletes %>%
                     filter(Medal == "Silver") %>%
                     group_by(region) %>%
                     summarise(Medal_Tally = length(Medal)) %>%
                     arrange(desc(Medal_Tally)) %>% 
                     mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))

Bronze_Winners <- atheletes %>%
                     filter(Medal == "Bronze") %>%
                     group_by(region) %>%
                     summarise(Medal_Tally = length(Medal)) %>%
                     arrange(desc(Medal_Tally)) %>% 
                     mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))

AllMedals <- atheletes %>%
                     filter(Medal != "<NA>") %>% 
                     group_by(region) %>%
                     summarise(Medal_Tally = length(Medal)) %>%
                     arrange(desc(Medal_Tally)) %>% 
                     mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))

All <- atheletes %>%
                      group_by(region) %>%
                     summarise(Medal_Tally = length(Medal)) %>%
                     arrange(desc(Medal_Tally)) %>% 
                     mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally)) %>%
                     filter(!region %in% AllMedals$region) %>%
                     mutate(Medal_Tally = "No Medal")

AllMedals$Medal_Tally <- "Medal Winners"

Medal_Tally <- rbind(AllMedals, All)

map.world <- map_data("world")
## Warning: package 'maps' was built under R version 3.5.1
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
#as.factor(medal_winners$region) %>% levels()

map.world_joined <- left_join(map.world, Medal_Tally, by ='region')

map.world_joined$Medal_Tally[is.na(map.world_joined$Medal_Tally)] <- "No Participation/No Data"

ggplot() +
  geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill = Medal_Tally)) +
  labs(x = " ", y = " ", title = 'Medal winners in the world') +
  theme_minimal() +
  theme(legend.position="bottom")

The map shows the countries that have won atleast one medal(in red) and have not won medals(in green).

AllSilver <- atheletes %>%
                      group_by(region) %>%
                      filter(Medal == "Silver") %>%
                     summarise(Medal_Tally = length(Medal)) %>%
                     arrange(desc(Medal_Tally)) %>% 
                     mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally)) %>%
                     filter(!region %in% Gold_Winners$region) %>%
                     mutate(Medal_Tally = "Silver")

AllBronze <- atheletes %>%
                      group_by(region) %>%
                      filter(Medal == "Bronze") %>%
                     summarise(Medal_Tally = length(Medal)) %>%
                     arrange(desc(Medal_Tally)) %>% 
                     mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally)) %>%
                     filter(!(region %in% Gold_Winners$region & region %in% Silver_Winners$region) )%>%
                     mutate(Medal_Tally = "Bronze")

Gold_Winners$Medal_Tally <- "Gold"
Gold_Winners <- Gold_Winners[1:99,]

Medal_Tally <- rbind(Gold_Winners, AllSilver)
Medal_Tally <- rbind(Medal_Tally, AllBronze)

map.world <- map_data("world")

#as.factor(medal_winners$region) %>% levels()

map.world_joined <- left_join(map.world, Medal_Tally, by ='region')

map.world_joined$Medal_Tally[is.na(map.world_joined$Medal_Tally)] <- "No Medal"

ggplot() +
  geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill = Medal_Tally)) +
  scale_fill_manual(values = c("#cd7f32", "#FFDF00",  "#ff6347", "#808080")) +
  labs(x = " ", y = " ", title = 'Medal winners in the world') +
  theme_minimal() +
  theme(legend.position="bottom")

The countries in yellow are countries that have won atleast one gold medal, the countries in grey are countries that have atleast one gold but have not won a gold and the countries in bronze are countries which have won atleast one bronze medal, but no gold or silver medals.

7 Deeper Analysis

7.1 Aggregation

Count <- atheletes %>%
  group_by(Year, Season, region) %>%
  summarise(NumberOfAthltes = n())

Gold_Winners <- atheletes %>%
                filter(Medal != "<NA>")%>%
                group_by(Year, Season, region) %>%
                summarise(NumberOfMedals = n())

Aggregated <- Count %>% left_join(Gold_Winners, by = c("Year", "Season", "region"))

groupMale <- atheletes %>%
              filter(Sex == "M") %>%
              group_by(Year, Season, region) %>%
              summarise(Number_Of_Men = n())

groupFemale <- atheletes %>%
              filter(Sex == "F") %>%
              group_by(Year, Season, region) %>%
              summarise(Number_Of_Women = n())

group <- groupMale %>%
          left_join(groupFemale) %>%
          mutate(Sex_Ratio = Number_Of_Men/Number_Of_Women)
## Joining, by = c("Year", "Season", "region")
group$Sex_Ratio[is.na(group$Sex_Ratio)] <- 236

Aggregated <- Aggregated %>%
              left_join(group, by = c("Year", "Season", "region"))

AgeAgg <- atheletes %>%
                group_by(Year, Season, region) %>%
                summarise(MedianAge = median(Age, na.rm = T))

HeightAgg <- atheletes %>%
                group_by(Year, Season, region) %>%
                summarise(MedianHeight = median(Height, na.rm = T))

WeightAgg <- atheletes %>%
                group_by(Year, Season, region) %>%
                summarise(MedianWeight = median(Weight, na.rm = T))

Aggregated <- Aggregated %>%
              left_join(AgeAgg, by = c("Year", "Season", "region"))
Aggregated <- Aggregated %>%
              left_join(HeightAgg, by = c("Year", "Season", "region"))
Aggregated <- Aggregated %>%
              left_join(WeightAgg, by = c("Year", "Season", "region"))

Aggregated$NumberOfMedals[is.na(Aggregated$NumberOfMedals)] <- 0
Aggregated$Sex_Ratio[is.na(Aggregated$Sex_Ratio)] <- 0

7.2 Influences on Medal won

7.2.1 Total Number of Athletes

Aggregated %>% 
        ggplot(aes(x=NumberOfAthltes, y=NumberOfMedals)) +
        geom_point(col="blue")  + geom_smooth(method = "lm", se=TRUE, color="black", aes(group=1)) +
        theme_minimal() +
        labs(x = "Number of Athletes", y = "Number of Medals")

It is obvious that as the number of athletes who participate increases so would the number of Medals. But what about the Sex, does that play a part in how nations win medals?

7.2.2 Sex and its Impacts

p1 <- Aggregated %>% 
        filter(!is.na(Number_Of_Women)) %>%
        ggplot(aes(x=Number_Of_Women, y=NumberOfMedals)) +
        geom_point(col="darkblue")  + geom_smooth(method = "lm", se=TRUE, color="black", aes(group=1)) +
        theme_minimal() +
        labs(x = "Number of Female Athletes", y = "Number of Medals")
        
p2 <- Aggregated %>% 
        filter(!is.na(Number_Of_Men)) %>%
        ggplot(aes(x=Number_Of_Men, y=NumberOfMedals)) +
        geom_point(col="red")  + geom_smooth(method = "lm", se=TRUE, color="black", aes(group=1)) +
        theme_minimal() +
        labs(x = "Number of Male Athletes", y = "Number of Medals")

cowplot::plot_grid(p1,p2, ncol = 1, 
          align = 'h', axis = 'l')

Number of men and number of women athletes are correlated with the number of medals won by a nation. Number of women athletes has a smaller correlation with the number of medals won than the number of male athletes. By what about the ratio of men and women, does that have an impact on the the number of medals won?

7.2.3 Sex Ratio

      Aggregated %>% 
        filter(!is.na(Sex_Ratio)) %>%
        ggplot(aes(x=Sex_Ratio, y=NumberOfMedals)) +
        geom_point(col="darkblue")  + geom_smooth(method = "lm", se=TRUE, color="black", aes(group=1)) +
        theme_minimal() +
        labs(x = "Sex Ratio", y = "Number of Medals")

So the equality in sex does not influence the amount of medals that a country wins. A nation’s contingent can be made completely of men and it will not matter. But, the catch here is if the contingent is made only of men then it is quite obvious the nation would lose out on opportunities to win medals in female athletics. It does not hurt to have a mix of both men and women in the contingent.

7.2.4 Median Age

Aggregated %>% 
        filter(!is.na(MedianAge)) %>%
        ggplot(aes(x=MedianAge, y=NumberOfMedals)) +
        geom_point(col="darkblue")  + geom_smooth(method = "lm", se=TRUE, color="black", aes(group=1)) +
        theme_minimal() +
        labs(x = "Median Age of the Contingent", y = "Number of Medals")

Age also does not affect the number of medals a contingent wins.

7.2.5 Median Height and Weight

p1 <- Aggregated %>% 
        filter(!is.na(MedianHeight)) %>%
        ggplot(aes(x=MedianHeight, y=NumberOfMedals)) +
        geom_point(col="steelblue")  + geom_smooth(method = "lm", se=TRUE, color="black", aes(group=1)) +
        theme_minimal() +
        labs(x = "Median Height of the Contingent", y = "Number of Medals")
        
p2 <- Aggregated %>% 
        filter(!is.na(MedianWeight)) %>%
        ggplot(aes(x=MedianWeight, y=NumberOfMedals)) +
        geom_point(col="tomato")  + geom_smooth(method = "lm", se=TRUE, color="black", aes(group=1)) +
        theme_minimal() +
        labs(x = "Median Weight of the Contingent", y = "Number of Medals")

cowplot::plot_grid(p1,p2, ncol = 1, 
          align = 'h', axis = 'l')

Again, height and weights do not play a part in the number of medals the Contingent wins.

7.2.6 Caution: To self and the Contingents

We have come to the conclusion that the more number of athletes in your contingent, the higher the number of medals your nation wins. This is mere correlation. For example a nation cannot send 100 athletes and expex to win 70-80 medal. That is not we should conclude from here. The larger the contingent more the odds are in your favor. But, winning it depends on the quality of the athletes and their performance. As we do not have the data on the quality of the athletes we will continue working with what we have as a proxy.

** Caution: Number of medals and number of athletes in a contingent are merely correlated.**

More Analysis and a Power BI Dashboard Coming Soon. Meanwhile check the correctly rendered version on rpubs.