Olympic Data Analysis - Final Project Report

0.1 Introduction
0.2 Domain Problem Charactarization
0.3 Data/operation abstraction design
0.4 Encoding/Interaction design
0.5 Algorithmic design
0.6 User evaluation
0.7 Future work
0.8 Appendix
0.9 References

0.1 Introduction

The ancient olympic game was held at Olympia, Greece, from 776 BC through 393 AD. It returned after 1503 years. The first modern olympic was held in Athenes, Greece in 1896. The ‘modern Olympics’ comprises all the Games from Athens 1896 to Rio 2016. Baron Pierre de Coubertin presented the idea in 1894.

There are two long periods without any Games between 1912-1920 and 1936-1948, corresponding to WWI and WWII.

Perhaps, the most significant benefit of visual analytics is to ease the understanding of complex data, while representing it in correct, concise, and appropriate way. This manuscript proposes a handy visualization analysis of Olympic games between 1896 to 2016, which comprises in four levels of design system. These four levels are;

Characterization of the tasks and data in the vocabulary of the problem domain,
Abstract into operations and data types,
Design visual encoding and interaction techniques,
Create algorithms to execute these techniques, efficiently.

This application utilizes concrete analysis examples and claim to provide efficient, effective, functional, and convenient model for users. Discussion, conclusion, future work are summarized along with requesting recommendation for improvement of the limitations, where it was surveyed in evaluation as another section.

The primary goal of the dashboard is to explain how the user may benefit from the developed system experiencing visual representations. User interfaces are pretty significant, simple and clear to use, while informing the individuals about Olympics. The evaluation should at least consider whether the product meets the specific requirements, efficient and effectiveness.

Goal

Primary Question: What are the dominant countries in each sport type?
Sub Question 1: How participation changed over time?
Sub Question 2: What nations own most medels in various disciplines?
Sub Question 3: What are the number female vs male attendent to Olympics?

Data Characteristics

events (summer and winter)
sports level data
atheletes level data
excluded art competition (focused on atheletics)

0.2 Domain Problem Charactarization

Choosing an empiric dataset has a great impact on the quality of the project. After researching myriad types of dataset, 120 Years of Olympic game is considered for representation. Advantages of its dataset is not limited of being a large scale that spans between 1896 to 2016. Over the years of course to be able to show the medal counts in popular sports (basketball) and orphan sports (lacrosse) is some of the useful feature of the application. Last but not least, considering visual representation perspective and providing the best user experience, Olympic game dataset is found worth to study.
The application aims to attract users who have interest on Olympics and address their curiosity in a clear way. Although, detailed information is provided in the app, primary research question is dealt with the dominant countries in each sport type. Dataset is picked from Kaggle, which consists of two distinct tables, athlete_events and event_regions, to address the primary question.
App is consisted of five tabs, which are information about the Olympics, world and host city, top countries and athletes. One way to evaluate the shiny dashboard would be to set up an experiment where users asked which country got the most medals in different sports and examine if the answers are correct with the ground truth data.

0.3 Data/operation abstraction design

Two different data frames were used for creating dashboard. After exploring the dataset, it is decided not to remove any data, and instead data manipulation/transformation with the dplyr package was conducted to make it applicable and plausible.
Some variables, athlete’s age, weight, and height were not executed in shiny dashboard but will incorporated in the future work.
Geospatial visual representation is reflected on world map.
Detailed observations such as name, sex, age, weight, team’s region, game type, location, season, medals and the year of the Olympics were also examined.
Both categorical (name, sex, team, region, year, city, and medal type), and numerical (athlete’s age, weight, and height) variables were utilized.

1. Athelete Events data

library("plotly")
library("tidyverse")
library("data.table")
library("gridExtra")
library("knitr")
library("gganimate")


# Load athletes_events data 
data <- read_csv("Data/athlete_events.csv",
                 col_types = cols(
                   ID = col_character(),
                   Name = col_character(),
                   Sex = col_factor(levels = c("M","F")),
                   Age =  col_integer(),
                   Height = col_double(),
                   Weight = col_double(),
                   Team = col_character(),
                   NOC = col_character(),
                   Games = col_character(),
                   Year = col_integer(),
                   Season = col_factor(levels = c("Summer","Winter")),
                   City = col_character(),
                   Sport = col_character(),
                   Event = col_character(),
                   Medal = col_factor(levels = c("Gold","Silver","Bronze"))
                 )
)

glimpse(data)

## Observations: 271,116
## Variables: 15
## $ ID     <chr> "1", "2", "3", "4", "5", "5", "5", "5", "5", "5", "6", "6…
## $ Name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Li…
## $ Sex    <fct> M, M, M, M, F, F, F, F, F, F, M, M, M, M, M, M, M, M, M, …
## $ Age    <int> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 3…
## $ Height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188,…
## $ Weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 7…
## $ Team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlan…
## $ NOC    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "…
## $ Games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer…
## $ Year   <int> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 199…
## $ Season <fct> Summer, Summer, Summer, Summer, Winter, Winter, Winter, W…
## $ City   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "…
## $ Sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Sk…
## $ Event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightwei…
## $ Medal  <fct> NA, NA, NA, Gold, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

head(data)

## # A tibble: 6 x 15
##   ID    Name  Sex     Age Height Weight Team  NOC   Games  Year Season
##   <chr> <chr> <fct> <int>  <dbl>  <dbl> <chr> <chr> <chr> <int> <fct> 
## 1 1     A Di… M        24    180     80 China CHN   1992…  1992 Summer
## 2 2     A La… M        23    170     60 China CHN   2012…  2012 Summer
## 3 3     Gunn… M        24     NA     NA Denm… DEN   1920…  1920 Summer
## 4 4     Edga… M        34     NA     NA Denm… DEN   1900…  1900 Summer
## 5 5     Chri… F        21    185     82 Neth… NED   1988…  1988 Winter
## 6 5     Chri… F        21    185     82 Neth… NED   1988…  1988 Winter
## # … with 4 more variables: City <chr>, Sport <chr>, Event <chr>,
## #   Medal <fct>

2. NOC Regions data

# Load data file matching NOCs with mao regions (countries)
noc <- read_csv("Data/noc_regions.csv",
                col_types = cols(
                  NOC = col_character(),
                  region = col_character()
                ))
glimpse(noc)

## Observations: 230
## Variables: 3
## $ NOC    <chr> "AFG", "AHO", "ALB", "ALG", "AND", "ANG", "ANT", "ANZ", "…
## $ region <chr> "Afghanistan", "Curacao", "Albania", "Algeria", "Andorra"…
## $ notes  <chr> NA, "Netherlands Antilles", NA, NA, NA, NA, "Antigua and …

head(noc)

## # A tibble: 6 x 3
##   NOC   region      notes               
##   <chr> <chr>       <chr>               
## 1 AFG   Afghanistan <NA>                
## 2 AHO   Curacao     Netherlands Antilles
## 3 ALB   Albania     <NA>                
## 4 ALG   Algeria     <NA>                
## 5 AND   Andorra     <NA>                
## 6 ANG   Angola      <NA>

0.4 Encoding/Interaction design

Olympic games are globally played and thus the date revealed from it is interest of world map and the medal counts. Another interesting analysis is ranking of athletes and countries, which are represented with bar charts.
User friendly Shiny dashboard is very convenient to show visual representations interactively, and more specific questions make it reliable tool. Redundant visual representations may negatively affect its usefulness, where it is aimed to provide concise visualization with deep understanding.

1. Has the number of athletes, nations, and events changed over time?

# Load athletes_events data 
data <- read_csv("Data/athlete_events.csv",
                 col_types = cols(
                   ID = col_character(),
                   Name = col_character(),
                   Sex = col_factor(levels = c("M","F")),
                   Age =  col_integer(),
                   Height = col_double(),
                   Weight = col_double(),
                   Team = col_character(),
                   NOC = col_character(),
                   Games = col_character(),
                   Year = col_integer(),
                   Season = col_factor(levels = c("Summer","Winter")),
                   City = col_character(),
                   Sport = col_character(),
                   Event = col_character(),
                   Medal = col_factor(levels = c("Gold","Silver","Bronze"))
                 )
)

# count number of athletes, nations, & events, excluding the Art Competitions
counts <- data %>%
  group_by(Year, Season) %>%
  summarize(
    Athletes = length(unique(ID)),
    Nations = length(unique(NOC)),
    Events = length(unique(Event))
  )
counts <- counts %>%
  mutate(gap= if(Year<1920) 1 else if(Year>=1920 & Year<=1936) 2 else 3)
p1 <- ggplot(counts, aes(x=Year, y=Athletes, group=interaction(Season,gap), color=Season)) +
  geom_point(size=2) +
  geom_line() +
  scale_color_manual(values=c("darkorange","darkblue")) +
  xlab("") +  
  
  annotate("text",x=c(1916,1942),y=c(10000,10000),
           label=c("WWI","WWII"), size=4, color="red") +
  geom_segment(mapping=aes(x=1914,y=8000,xend=1918,yend=8000),color="red", size=2) +
  geom_segment(mapping=aes(x=1939,y=8000,xend=1945,yend=8000),color="red", size=2)
p2 <- ggplot(counts, aes(x=Year, y=Nations, group=interaction(Season,gap), color=Season)) +
  geom_point(size=2) +
  geom_line() +
  scale_color_manual(values=c("darkorange","darkblue")) +
  xlab("") 
p3 <- ggplot(counts, aes(x=Year, y=Events, group=interaction(Season,gap), color=Season)) +
  geom_point(size=2) +
  geom_line() +
  scale_color_manual(values=c("darkorange","darkblue"))
grid.arrange(p1, p2, p3, ncol=1)

2. Which countries won the most medals (TOP 10)?

# count number of medals awarded to each Team
medal_counts <- data %>% filter(!is.na(Medal))%>% 
  group_by(NOC, Medal, Event, Games) %>%
  summarize(isMedal=1)

medal_counts <-  medal_counts %>% 
  group_by(NOC, Medal) %>%
  summarize(Count= sum(isMedal))

medal_counts <- left_join(medal_counts, noc, by= "NOC" )
medal_counts <- medal_counts %>% 
  mutate (Team = region)
  medal_counts <- medal_counts %>% select( Medal, Team, Count)

# order Team by total medal count
levs <- medal_counts %>%
  group_by(Team) %>%
  summarize(Total=sum(Count)) %>%
  arrange(desc(Total)) %>%
  select(Team) %>%
  slice(10:1)

medal_counts$Team <- factor(medal_counts$Team, levels=levs$Team)
medal_counts <- medal_counts %>% filter(Team != "NA")
# plot
ggplot(medal_counts, aes(x=Team, y=Count, fill=Medal)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values=c("gold1","gray70","gold4")) +
  ggtitle("Historical medal counts from Competitions") +
  theme(plot.title = element_text(hjust = 0.5))

# count number of medals awarded to each Team
medal_counts <- data %>% filter(!is.na(Medal))%>% 
  group_by(NOC, Medal, Event, Games, Year) %>%
  summarize(isMedal=1)

medal_counts <-  medal_counts %>% 
  group_by(NOC, Medal, Year) %>%
  summarize(Count= sum(isMedal))

medal_counts <- left_join(medal_counts, noc, by= "NOC" )
medal_counts <- medal_counts %>% 
  mutate (Team = region)
  medal_counts <- medal_counts %>% select( Medal, Team, Count, Year)

# order Team by total medal count
levs <- medal_counts %>%
  group_by(Team) %>%
  summarize(Total=sum(Count)) %>%
  arrange(desc(Total)) %>%
  select(Team) %>%
  slice(15:1)

medal_counts$Team <- factor(medal_counts$Team, levels=levs$Team)
medal_counts <- medal_counts %>% filter(Team != "NA")
# plot
p<- ggplot(medal_counts, aes(x=Team, y=Count, fill=Medal)) +
  labs(title = 'Year: {frame_time}')+
  transition_time(Year)+
  geom_col() +
  coord_flip() +
  scale_fill_manual(values=c("gold1","gray70","gold4")) +
  #ggtitle("Historical medal counts from Competitions") +
  theme(plot.title = element_text(hjust = 0.5))
animate(p,fps=3)

3.Which countries won the most medals- Map view

# Load data file matching NOCs with mao regions (countries)
noc <- read_csv("Data/noc_regions.csv",
                col_types = cols(
                  NOC = col_character(),
                  region = col_character()
                ))

medal_counts <- data %>% filter(!is.na(Medal))%>% 
  group_by(NOC, Medal, Event, Games) %>%
  summarize(isMedal=1)

medal_counts <-  medal_counts %>% 
  group_by(NOC, Medal) %>%
  summarize(Count= sum(isMedal))

medal_counts <- left_join(medal_counts, noc, by= "NOC" ) %>% select(region, NOC, Medal, Count)

medal_counts <- medal_counts %>%
  group_by(region) %>%
  summarize(Total=sum(Count))

data_regions <- medal_counts %>% 
  left_join(noc,by="region") %>%
  filter(!is.na(region))

world <- map_data("world")
world <- left_join(world, data_regions, by="region")
# Plot: 
p <- ggplot(world, aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = Total, label= region)) +
  labs(title = "regions",
       x = NULL, y=NULL) +
  theme(axis.ticks = element_blank(),
        axis.text = element_blank(),
        panel.background = element_rect(fill = "navy"),
        plot.title = element_text(hjust = 0.5)) +
  guides(fill=guide_colourbar(title="medals")) +
  scale_fill_gradient(low="lightgreen",high="darkgreen")
ggplotly(p)

4.Number of female and male over time

# Exclude art competitions from data (I won't use them again in the kernel)
data <- data %>% filter(Sport != "Art Competitions")

# Recode year of Winter Games after 1992 to match the next Summer Games
# Thus, "Year" now applies to the Olympiad in which each Olympics occurred 
original <- c(1994,1998,2002,2006,2010,2014)
new <- c(1996,2000,2004,2008,2012,2016)
for (i in 1:length(original)) {
  data$Year <- gsub(original[i], new[i], data$Year)
}
data$Year <- as.integer(data$Year)

# Table counting number of athletes by Year and Sex
counts_sex <- data %>% group_by(Year, Sex) %>%
  summarize(Athletes = length(unique(ID)))
counts_sex$Year <- as.integer(counts_sex$Year)

# Plot number of male/female athletes vs time
ggplot(counts_sex, aes(x=Year, y=Athletes, group=Sex, color=Sex)) +
  geom_point(size=2) +
  geom_line()  +
  transition_reveal(Year)+
  scale_color_manual(values=c("darkblue","red")) +
  labs(title = "Number of male and female Olympians over time") +
  theme(plot.title = element_text(hjust = 0.5))

Primary and Secondry Question Findings

Top countries in Olympic games were found to be USA, Soviet Union, Germany, Great Britain, France, Italy, Sweden, Australia, Canada, Hungary. Another helpful information expresses that USA is dominant country in terms of gold, silver, bronze medals.
Despite the fact that male athletes were tremendously higher than the females, the latter one showed steep slope after 1980, where ultimately the difference was only around 2000 between them.
USA athletes, Matt Biondi and Natalie Anne Coughlin win the most gold and bronze medals, respectively.
Only few countries (Canada, France, Norway, USA, Austria, Australia) showed success in Olympics snowboarding games. Although Australia has the tropical influenced climate surprisingly they showed success for winning gold and silver medals.
Russia has dominated the rhythmic gymnastics category, where their medals were found almost equal to the total numbers of the rest of the countries.
Only Canadian, British and American Olympians win medals in Lacrosse.
USA, Serbia Russia, Brazil, Australia are the most successful countries in basketball.
Badminton medals went to only two regions that are Asia and Europe, where China is the most successful and followed by Indonesia, Malaysia, Denmark, UK, and South Korea.
Athletics is found very popular sport, where almost all countries win at least a medal over the year.

0.5 Algorithmic design

Validation is about whether one has built the right product, and verification is about whether one has built the product right. Application algorithm should carry out the visual encoding and interaction design. The performance of the system is significant component of the accessibility and the usability. Performance of the application was considered while creating the coding and system design. Tidiness and neatness of data coding effects the system performance and reproducibility. The variables which may slow down the application were created at the top of the application as a pre-processing portion of the system. Additionally, reproducibility (please see the Github URL in Appendix) and readiness for the production were designed considering the user.
ShinyLoadTest can only be run if we have individual servers. Plese see the limitations of shiny load test:

https://rstudio.github.io/shinyloadtest/articles/limitations-of-shinyloadtest.html

0.6 User evaluation

The evaluation of the system by human direct interaction is extremely complex task. Users may be biased and influenced by the experience, prior knowledge, and perspective. Also, cognitive ability may differ from person to person, which can bring about discord in judgment. Individuals may see different than one another, while one may see the cosmetics, others technical details.
Analytical and empirical techniques utilized by Human Computer Interaction (HCI) interacts with users via computers, which should; assess the functionality of the system that fulfills all of the functions requested by the user that defined in the phase of user requirements specification, analyze the system’s effect on the final users. A methodology evaluates the aspects linked to the human factors, such as usability of the graphical interface, simplicity, and level of acceptance by the users, identify every possible problem that could arise with the final users of the system, such as preventing an unpredicted result or anything that could be misleading to the users.
Visual representation provides functional, rapid, simple and detailed information in several ways such as popularity index of sports in different countries, sports became trendy overtime, effects of historical events such as world wars on the Olympics.
The system will be evaluated as completing survey by the users. This evaluation will provide understanding of data essentials, user needs and expectations to system developer. It is the most crucial part of the system giving insights with visual representations. Also, essential for future work to improve the quality of the application.
The survey was created and send it to individuals to having a feedback about the application how well performing and how it is useful in some way. It may bee seen the survey questions below.
1 How easy was the navigation through tabs? (5- most easy, 1- very difficult)
2 Did the app take lot more time to load?
3 Was the instructions clear to navigate through the app? (5 - very clear)
4 Did the app provide information which was helpful to increase your knowledge? (5- very helpful)
5 Any suggestions?
Feedback about survey As far as 9 individual completed the survey. They are a hundred percent aggree navigation bars easy to understand, app instructions are clear to navigate, app provides to increase their knowledge. While %22.2 of the individual thought the app is taking time while loading, %77.8 had no performance issue Also, there were some suggestions like font size of the navigation bar and using heat map instead of regular one. Also, one suggested to add user understandable message if there is no data while selecting countries from the navigation bar.

0.7 Future work

Find corelation of winning medel with nations’ GDP and other factor.
Come up with predictive model to predict win probability.
Survey feedback will be considered for future work to improve the dashboard.
For some users, face recognition is easier than remembering their names, thus photos of the athletes along with their names would bring asset to the app.

0.8 Appendix

Athelete_event dataset contains host cities but it does not include information about associated country. Host_city_xls data set got from wikipedia and solved the problem of few cities which is pretent in multiple countries (such as London).

Limitations

Athlete_event and noc_regions were combined in order to merge two dataset and ordered by the NOC(region). Although %90 of the regions were ordered in the drop down menu under select country in " Top 10 Athletes", a few of are not shown.
Rayshader is an R package for producing 2D and 3D hillshaded maps of elevation matrices using a combination of raytracing, spherical texture mapping, overlays, and ambient occlusion. It also includes the ability to export 3D models to a 3D print-ready format, and includes a post-processing depth of field effect for 3D visualizations) was tried to implement to shiny dashboard. It was required more time to apply into our system which may be proposed for future work.
Scrollytell is an R Package for producing and visualizing hillshaded maps from elevation matrices in 2D and 3D. Scrollytell is used to generate scrolly tell presentations in R shiny. It may proposed for future work for having a limited time.

Links