Milestone 3: Beyond Descriptive Stats - 120 years of Olympics data

Load Packages
Overview
- Definition of terms
- Load files
Data Visualization
Questions
Relationships / Correlation
Summary
References

Load Packages

The following R packages are used for this analysis:

readr
RSQLite
dplyr
wordcloud
RColorBrewer
ggplot2

Overview

Milestone 3 follows Milestone 2. In this project, text mining is used to analyse the Olympics data. Wordcloud and histogram of some of the columns will be observe for any correlation.

Definition of terms

Text mining: is the process of extracting useful information from textual data.
Text analytics: is also part of text mining and is the use of statistical and machine learning techiques to extract on textual document that has undergone some levels of text mining.

Load files

# A tibble: 6 x 15
     ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Di~ M        24    180     80 China CHN   1992~  1992 Summer Barc~
2     2 A La~ M        23    170     60 China CHN   2012~  2012 Summer Lond~
3     3 Gunn~ M        24     NA     NA Denm~ DEN   1920~  1920 Summer Antw~
4     4 Edga~ M        34     NA     NA Denm~ DEN   1900~  1900 Summer Paris
5     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
6     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
# ... with 3 more variables: Sport <chr>, Event <chr>, Medal <chr>

Data Visualization

Sort using R

# Order column team

team_ordering <- olympics %>% select(Team) %>% count(Team, sort = TRUE) %>% group_by(Team)
team_ordering

# A tibble: 1,184 x 2
# Groups:   Team [1,184]
   Team              n
   <chr>         <int>
 1 United States 17847
 2 France        11988
 3 Great Britain 11404
 4 Italy         10260
 5 Germany        9326
 6 Canada         9279
 7 Japan          8289
 8 Sweden         8052
 9 Australia      7513
10 Hungary        6547
# ... with 1,174 more rows

Sort using SQL


SELECT Team
      ,COUNT(Team) AS Country_appearance
FROM olymp
GROUP BY Team
ORDER BY Country_appearance DESC

Displaying records 1 - 10
Team	Country_appearance
United States	17847
France	11988
Great Britain	11404
Italy	10260
Germany	9326
Canada	9279
Japan	8289
Sweden	8052
Australia	7513
Hungary	6547

Discussion

The same output is seen with R and SQL on sorting the column Team. The column of interest represents the participation of each country in a given sporting event.

Graphics

team_ordering <- team_ordering[1:30,]

tplot <- ggplot(team_ordering, aes(Team, n, fill = Team)) + geom_col()
tplot

team_ordering2 <- team_ordering[1:10,]

tplot2 <- ggplot(team_ordering2, aes(Team, n, fill = Team)) + geom_col() +
          geom_text(aes(label = n, vjust = -0.5)) +
          theme_minimal() +
          theme(legend.position = "none", axis.title.y = element_text(angle = 0)) +
          labs(title = "Top 10 countries")
tplot2

Wordcloud

team_ordering <- olympics %>% 
                 select(Team) %>% 
                 count(Team, sort = TRUE) %>% 
                 group_by(Team) %>%
                 with(wordcloud(Team, n, max.words = 100, colors = brewer.pal(20, "Set1")))

Questions

How do Team participation in the different sporting events contribute to the number of Medal. Does participation in more sporting events enhance the amount of medals won by a country? Is there any collinearity or correlation between participation of team n and Medal?

# A tibble: 4 x 2
# Groups:   Medal [4]
  Medal       n
  <chr>   <int>
1 <NA>   231333
2 Gold    13372
3 Bronze  13295
4 Silver  13116

Relationships / Correlation

Looking at the two variables of interest viz. Team and Medal and from the Questions sub-section, more work needs to be done to be able to ascertain if there is any advantage when a country has more participation in the sporting events.

Which country won the most medals in the Olympics? This is another question I will be looking at in Milestone 4

Summary

The graphics for the Team and Medal variables are clear illustrations and depict some kind of metrics showing the different distribution of these variable. The wordcloud also showed that the United States of America has more sporting partticipation. According to All-time medal count, USA is the all time champion of Olympics medals. Milestone 4 will try to validate this claims.