Load Packages

The following R packages are used for this analysis:

  1. readr
  2. RSQLite
  3. dplyr
  4. wordcloud
  5. RColorBrewer
  6. ggplot2

Overview

Milestone 3 follows Milestone 2. In this project, text mining is used to analyse the Olympics data. Wordcloud and histogram of some of the columns will be observe for any correlation.

Definition of terms

  • Text mining: is the process of extracting useful information from textual data.
  • Text analytics: is also part of text mining and is the use of statistical and machine learning techiques to extract on textual document that has undergone some levels of text mining.


Load files


# A tibble: 6 x 15
     ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Di~ M        24    180     80 China CHN   1992~  1992 Summer Barc~
2     2 A La~ M        23    170     60 China CHN   2012~  2012 Summer Lond~
3     3 Gunn~ M        24     NA     NA Denm~ DEN   1920~  1920 Summer Antw~
4     4 Edga~ M        34     NA     NA Denm~ DEN   1900~  1900 Summer Paris
5     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
6     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
# ... with 3 more variables: Sport <chr>, Event <chr>, Medal <chr>


Data Visualization

Sort using R


# Order column team

team_ordering <- olympics %>% select(Team) %>% count(Team, sort = TRUE) %>% group_by(Team)
team_ordering
# A tibble: 1,184 x 2
# Groups:   Team [1,184]
   Team              n
   <chr>         <int>
 1 United States 17847
 2 France        11988
 3 Great Britain 11404
 4 Italy         10260
 5 Germany        9326
 6 Canada         9279
 7 Japan          8289
 8 Sweden         8052
 9 Australia      7513
10 Hungary        6547
# ... with 1,174 more rows


Sort using SQL



SELECT Team
      ,COUNT(Team) AS Country_appearance
FROM olymp
GROUP BY Team
ORDER BY Country_appearance DESC
Displaying records 1 - 10
Team Country_appearance
United States 17847
France 11988
Great Britain 11404
Italy 10260
Germany 9326
Canada 9279
Japan 8289
Sweden 8052
Australia 7513
Hungary 6547


Discussion

The same output is seen with R and SQL on sorting the column Team. The column of interest represents the participation of each country in a given sporting event.


Graphics


team_ordering <- team_ordering[1:30,]

tplot <- ggplot(team_ordering, aes(Team, n, fill = Team)) + geom_col()
tplot

team_ordering2 <- team_ordering[1:10,]

tplot2 <- ggplot(team_ordering2, aes(Team, n, fill = Team)) + geom_col() +
          geom_text(aes(label = n, vjust = -0.5)) +
          theme_minimal() +
          theme(legend.position = "none", axis.title.y = element_text(angle = 0)) +
          labs(title = "Top 10 countries")
tplot2


Wordcloud


team_ordering <- olympics %>% 
                 select(Team) %>% 
                 count(Team, sort = TRUE) %>% 
                 group_by(Team) %>%
                 with(wordcloud(Team, n, max.words = 100, colors = brewer.pal(20, "Set1")))


Questions

How do Team participation in the different sporting events contribute to the number of Medal. Does participation in more sporting events enhance the amount of medals won by a country? Is there any collinearity or correlation between participation of team n and Medal?


# A tibble: 4 x 2
# Groups:   Medal [4]
  Medal       n
  <chr>   <int>
1 <NA>   231333
2 Gold    13372
3 Bronze  13295
4 Silver  13116



Relationships / Correlation

Looking at the two variables of interest viz. Team and Medal and from the Questions sub-section, more work needs to be done to be able to ascertain if there is any advantage when a country has more participation in the sporting events.


Summary

The graphics for the Team and Medal variables are clear illustrations and depict some kind of metrics showing the different distribution of these variable. The wordcloud also showed that the United States of America has more sporting partticipation. According to All-time medal count, USA is the all time champion of Olympics medals. Milestone 4 will try to validate this claims.

References

  1. Text Mining with R
  2. A Short Guide to Historical Newspaper Data, Using R
  3. Text Mining vs Text Analytics
  4. Basic Text Analysis in R
  5. R Core Team (2019)
  6. ggplot2: Elegant Graphics for Data Analysis
  7. RSQLite: 'SQLite' Interface for R
  8. readr: Read Rectangular Text Data
  9. dplyr: A Grammar of Data Manipulation 10.wordcloud: Word Clouds 11.RColorBrewer: ColorBrewer Palettes