Introduction to the Olympics:
The modern Olympic Games or Olympics are leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world’s foremost sports competition with more than 200 nations participating. The Olympic Games are held every four years, with the Summer and Winter Games alternating by occurring every four years but two years apart.
The evolution of the Olympic Movement during the 20th and 21st centuries has resulted in several changes to the Olympic Games. Some of these adjustments include the creation of the Winter Olympic Games for snow and ice sports, the Paralympic Games for athletes with a disability, the Youth Olympic Games for athletes aged 14 to 18, the five Continental games (Pan American, African, Asian, European, and Pacific), and the World Games for sports that are not contested in the Olympic Games. The Deaflympics and Special Olympics are also endorsed by the IOC. The IOC has had to adapt to a variety of economic, political, and technological advancements. As a result, the Olympics has shifted away from pure amateurism, as envisioned by Coubertin, to allowing participation of professional athletes. The growing importance of mass media created the issue of corporate sponsorship and commercialisation of the Games. World wars led to the cancellation of the 1916, 1940, and 1944 Games. Large boycotts during the Cold War limited participation in the 1980 and 1984 Games. The latter, however, attracted 140 National Olympic Committees, which was a record at the time.
For more information refer Olympics’ official page
Content:
The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:
- ID - Unique number for each athlete
- Name - Athlete’s name
- Sex - M or F
- Age - Integer
- Height - In centimeters
- Weight - In kilograms
- Team - Team name
- NOC - National Olympic Committee 3-letter code
- Games - Year and season
- Year - Integer
- Season - Summer or Winter
- City - Host city
- Sport - Sport
- Event - Event
- Medal - Gold, Silver, Bronze, or NA
Loading the required libraries:
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.4
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'rvest' was built under R version 3.5.1
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.5.1
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
## Warning: package 'ggmap' was built under R version 3.5.1
##
## Attaching package: 'ggmap'
## The following object is masked from 'package:magrittr':
##
## inset
Initial Exploration of the dataset:
## Observations: 271,116
## Variables: 15
## $ ID <int> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7...
## $ Name <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar ...
## $ Sex <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", ...
## $ Age <int> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31,...
## $ Height <int> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 18...
## $ Weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75,...
## $ Team <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherl...
## $ NOC <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED",...
## $ Games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summ...
## $ Year <int> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1...
## $ Season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Wint...
## $ City <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary",...
## $ Sport <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed ...
## $ Event <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightw...
## $ Medal <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## Observations: 230
## Variables: 3
## $ NOC <chr> "AFG", "AHO", "ALB", "ALG", "AND", "ANG", "ANT", "ANZ",...
## $ region <chr> "Afghanistan", "Curacao", "Albania", "Algeria", "Andorr...
## $ notes <chr> "", "Netherlands Antilles", "", "", "", "", "Antigua an...
Analysis by sex:
Rough Numbers:
The first part of our analysis will be helpful in understanding how the Olympics has evolved over the years. How have men and women been allowed to participate in the competitions overtime.
df <- atheletes %>%
group_by(Season, Sex) %>%
summarise(Count = n()) %>%
mutate(Percentage = round(Count*100 / sum(Count)))
df %>%
ggplot(aes(x=Season, y=Percentage, fill = Sex)) + geom_bar(stat='identity',position=position_dodge()) +
ggtitle("Total number of transfers in a window(2007-2017)") +
geom_label(label=df$Percentage, position = position_dodge(0.9)) +
theme_minimal()

During the winter and the summer the percentage and women have remained the same.
By trend:
Overall trend:
As the olymics progressed through the ages the number of people participating in it would have increased, which meant that the number of men and women participating in the olympics grew.
atheletes %>%
group_by(Year, Season) %>%
summarise(NumberOfParticipants = n()) %>%
ggplot(aes(x = Year, y = NumberOfParticipants, group = Season)) +
geom_line(aes(color = Season)) +
geom_point(aes(color = Season)) +
labs(x = "Year", y = "Number of Participants", title = "Number of participants in the olympics overtime")+
theme_minimal()

The number of participants in the olympics have grown overtime. It is also obvious that the number of participants in the summer olympics are more than that of the winter olympics.
Trend of sex ratio:
groupMale <- atheletes %>%
filter(Sex == "M") %>%
group_by(Year, Season) %>%
summarise(Number_Of_Men = n())
groupFemale <- atheletes %>%
filter(Sex == "F") %>%
group_by(Year, Season) %>%
summarise(Number_Of_Women = n())
group <- groupMale %>%
left_join(groupFemale) %>%
mutate(Sex_Ratio = Number_Of_Men/Number_Of_Women)
## Joining, by = c("Year", "Season")
group$Sex_Ratio[is.na(group$Sex_Ratio)] <- 175
p1 <- group %>%
ggplot(aes(x = Year, y= Sex_Ratio, group = Season)) +
geom_line(aes(color = Season)) +
geom_point(aes(color = Season)) +
labs(x = "Year", y = "Sex Ratio", title = "Sex Ratio in Olympics through the years") +
theme_minimal()
p2 <- group %>%
filter(Year>1927) %>%
ggplot(aes(x = Year, y= Sex_Ratio, group = Season)) +
geom_line(aes(color = Season)) +
geom_point(aes(color = Season)) +
labs(x = "Year", y = "Sex Ratio", title = "Sex Ratio in Olympics through the years after 1927") +
theme_minimal()
cowplot::plot_grid(p1,p2, ncol = 1,
align = 'h', axis = 'l')

When the olympics started no women participated in the Olympics. In 1900 women started participating in the olympics. As years passed the sex ratio i.e. the ratio of men to women became smaller. After 2000 the ratio started to move towards 1, which means that the olympics now are more diverse than they used to be, which great.
Analysis by Age
Age is something might have changed from the olymipics started. Hypothesis: The participants during the 1900 of the olympics had a median age greater than the median age of the participants during the 2000’s.
Age Density plots
atheletes$Age[is.na(atheletes$Age)] <- median(atheletes$Age, na.rm = T)
cat("The median age of the athletes in the modern olympics is", median(atheletes$Age))
## The median age of the athletes in the modern olympics is 24
cat("The median age of the male athletes in the modern olympics is", median(atheletes$Age[atheletes$Sex == "M"]))
## The median age of the male athletes in the modern olympics is 25
cat("The median age of the female athletes in the modern olympics is", median(atheletes$Age[atheletes$Sex == "F"]))
## The median age of the female athletes in the modern olympics is 23
The median age for female athletes was lesser than that of median age of male athletes.
# Filling the missing ages with median values.
p1 <- atheletes %>%
ggplot(aes(x = Age)) +
geom_density(color = "black", fill = "tomato") +
labs(x = "Age", title = "Distribution of Age") +
theme_minimal()
p2 <- atheletes %>%
ggplot(aes(x=Age, fill=Sex)) +
geom_density(alpha=0.4) +
labs(x = "Age", title = "Distribution of Age by Sex") +
theme_minimal()
cowplot::plot_grid(p1,p2, ncol = 1,
align = 'h', axis = 'l')

Age of athletes over the years.
atheletes %>%
group_by(Year, Sex) %>%
summarise(Median_Age = median(Age)) %>%
ggplot(aes(x = Year, y = Median_Age, Group = Sex)) +
geom_line(aes(color = Sex)) +
geom_point(aes(color = Sex)) +
labs( x = "Year", y = "Medain age of Athletes", title = "Median age of Male and Female athletes over the years")+
theme_minimal()

The median age of men and women participating in the olymipics has increased a bit after the 1980’s.
Analysis by Team:
Teams here refer to the countries and the different atletic clubs that have participated in the olympics over the years.
cat("The total number of teams that have paricipated in the olympics are", length(unique(atheletes$Team)))
## The total number of teams that have paricipated in the olympics are 1184
atheletes <- atheletes %>%
left_join(regions, by = "NOC")
We have now joined athletes dataset with the regions dataset. I would like to analyze the dataset based on the National Olympics Committee rather than the teams.
cat("The total number of National Olympics Committees that have paricipated in the olympics are", length(unique(atheletes$region)))
## The total number of National Olympics Committees that have paricipated in the olympics are 206
The 1976 and 1980 Olympic Boycott.
atheletes %>%
group_by(Year, Season) %>%
summarise(NoOfCountries = length(unique(region))) %>%
ggplot(aes(x = Year, y = NoOfCountries, group = Season)) +
geom_line(aes(color = Season)) +
geom_point(aes(color = Season)) +
labs(x = "Year", y = "Number of countries participated", title = "Number of countries that participated in the Olympics") +
theme_minimal()

The number of countries that participated in the olymics have seen a steday increase over time. But, in 1976 and 1980 the number has seen a sharp decrease. Why is that?
The 1976 Boycott
The Montreal 1976 Olympics was boycotted by 29 countries due to the refusal of the IOC to ban New Zealand, after the New Zealand national rugby union team had toured South Africa earlier in 1976. The boycott was led by Congolese official Jean Claude Ganga. Some of the boycotting nations (including Morocco, Cameroon and Egypt) had already participated, however, and withdrew after the first few days. Senegal and Ivory Coast were the only African countries that competed throughout the duration of the Games. Elsewhere, both Iraq and Guyana also opted to join the Congolese-led boycott. South Africa had been banned from the Olympics since 1964 due to its apartheid policies. Other countries, such as El Salvador and Zaire, did not participate in Montreal because of economic reasons. The republic of China also boycotted the olympics.
Image Source: Wikipedia
The 1980 Boycott
The 1980 Summer Olympics boycott was one part of a number of actions initiated by the United States to protest the Soviet invasion of Afghanistan. Only 80 nations participated in the event. For more information on the 1980 Olympics Boycott
Image Source: Wikipedia
Medal Winners
By Numbers
atheletes %>%
filter(Medal != "<NA>") %>%
group_by(region) %>%
summarise(Medal_Tally = length(Medal))%>%
arrange(desc(Medal_Tally)) %>%
ungroup() %>%
mutate(region = reorder(region,Medal_Tally)) %>%
top_n(10) %>%
ggplot(aes(x = region,y = Medal_Tally)) +
geom_bar(stat='identity',colour="white", fill = "tomato") +
geom_text(aes(x = region, y = .1, label = paste0("(",round(Medal_Tally,2),")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
theme(plot.title = element_text(size=10),
axis.title = element_text(size=10),
axis.text = element_text(size=10)) +
labs(x = 'Country',
y = 'Number of Medals'
) +
coord_flip() +
theme_bw()
## Selecting by Medal_Tally

USA has the most amount of medal winners in the history of the modern olympics followed by Russia, Germany, UK etc.
USA, Russia, Germany, UK and France through the years.
Gold <- atheletes %>%
filter(Medal == "Gold")%>%
group_by(Year, Season, region) %>%
summarise(Gold = n())
Silver <- atheletes %>%
filter(Medal == "Silver")%>%
group_by(Year, Season, region) %>%
summarise(Silver = n())
Bronze <- atheletes %>%
filter(Medal == "Bronze")%>%
group_by(Year, Season, region) %>%
summarise(Bronze = n())
Total <- atheletes %>%
filter(Medal != "<NA>")%>%
group_by(Year, Season, region) %>%
summarise(Total = n())
Total <- Total %>%
left_join(Gold)
## Joining, by = c("Year", "Season", "region")
Total <- Total %>%
left_join(Silver)
## Joining, by = c("Year", "Season", "region")
Total <- Total %>%
left_join(Bronze)
## Joining, by = c("Year", "Season", "region")
Total$Gold[is.na(Total$Gold)] <- 0
Total$Silver[is.na(Total$Silver)] <- 0
Total$Bronze[is.na(Total$Bronze)] <- 0
Total$Total[is.na(Total$Total)] <- 0
Total <- Total %>%
filter(region %in% c("USA","Russia","Germany","France")) %>%
filter(Season == "Summer")
p1 <- Total %>%
ggplot(aes(x = Year, y = Total, group = region)) +
geom_line(aes(color = region)) +
geom_point(aes(color = region)) +
theme_minimal()
p2 <- Total %>%
ggplot(aes(x = Year, y = Gold, group = region)) +
geom_line(aes(color = region)) +
geom_point(aes(color = region)) +
theme_minimal()
cowplot::plot_grid(p1,p2, ncol = 1,
align = 'h', axis = 'l')

Gold, Silver and Bronze Medals:
Gold_Winners <- atheletes %>%
filter(Medal == "Gold") %>%
group_by(region) %>%
summarise(Medal_Tally = length(Medal)) %>%
arrange(desc(Medal_Tally)) %>%
mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))
Silver_Winners <- atheletes %>%
filter(Medal == "Silver") %>%
group_by(region) %>%
summarise(Medal_Tally = length(Medal)) %>%
arrange(desc(Medal_Tally)) %>%
mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))
Bronze_Winners <- atheletes %>%
filter(Medal == "Bronze") %>%
group_by(region) %>%
summarise(Medal_Tally = length(Medal)) %>%
arrange(desc(Medal_Tally)) %>%
mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))
AllMedals <- atheletes %>%
filter(Medal != "<NA>") %>%
group_by(region) %>%
summarise(Medal_Tally = length(Medal)) %>%
arrange(desc(Medal_Tally)) %>%
mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally))
All <- atheletes %>%
group_by(region) %>%
summarise(Medal_Tally = length(Medal)) %>%
arrange(desc(Medal_Tally)) %>%
mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally)) %>%
filter(!region %in% AllMedals$region) %>%
mutate(Medal_Tally = "No Medal")
AllMedals$Medal_Tally <- "Medal Winners"
Medal_Tally <- rbind(AllMedals, All)
map.world <- map_data("world")
## Warning: package 'maps' was built under R version 3.5.1
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
#as.factor(medal_winners$region) %>% levels()
map.world_joined <- left_join(map.world, Medal_Tally, by ='region')
map.world_joined$Medal_Tally[is.na(map.world_joined$Medal_Tally)] <- "No Participation/No Data"
ggplot() +
geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill = Medal_Tally)) +
labs(x = " ", y = " ", title = 'Medal winners in the world') +
theme_minimal() +
theme(legend.position="bottom")

The map shows the countries that have won atleast one medal(in red) and have not won medals(in green).
AllSilver <- atheletes %>%
group_by(region) %>%
filter(Medal == "Silver") %>%
summarise(Medal_Tally = length(Medal)) %>%
arrange(desc(Medal_Tally)) %>%
mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally)) %>%
filter(!region %in% Gold_Winners$region) %>%
mutate(Medal_Tally = "Silver")
AllBronze <- atheletes %>%
group_by(region) %>%
filter(Medal == "Bronze") %>%
summarise(Medal_Tally = length(Medal)) %>%
arrange(desc(Medal_Tally)) %>%
mutate(region = str_trim(region),Medal_Tally = str_trim(Medal_Tally)) %>%
filter(!(region %in% Gold_Winners$region & region %in% Silver_Winners$region) )%>%
mutate(Medal_Tally = "Bronze")
Gold_Winners$Medal_Tally <- "Gold"
Gold_Winners <- Gold_Winners[1:99,]
Medal_Tally <- rbind(Gold_Winners, AllSilver)
Medal_Tally <- rbind(Medal_Tally, AllBronze)
map.world <- map_data("world")
#as.factor(medal_winners$region) %>% levels()
map.world_joined <- left_join(map.world, Medal_Tally, by ='region')
map.world_joined$Medal_Tally[is.na(map.world_joined$Medal_Tally)] <- "No Medal"
ggplot() +
geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill = Medal_Tally)) +
scale_fill_manual(values = c("#cd7f32", "#FFDF00", "#ff6347", "#808080")) +
labs(x = " ", y = " ", title = 'Medal winners in the world') +
theme_minimal() +
theme(legend.position="bottom")

The countries in yellow are countries that have won atleast one gold medal, the countries in grey are countries that have atleast one gold but have not won a gold and the countries in bronze are countries which have won atleast one bronze medal, but no gold or silver medals.