The following R packages are used for this analysis:
readr
RSQLite
dplyr
wordcloud
RColorBrewer
ggplot2
Milestone 3 follows Milestone 2. In this project, text mining is used to analyse the Olympics data. Wordcloud and histogram of some of the columns will be observe for any correlation.
# A tibble: 6 x 15
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 A Di~ M 24 180 80 China CHN 1992~ 1992 Summer Barc~
2 2 A La~ M 23 170 60 China CHN 2012~ 2012 Summer Lond~
3 3 Gunn~ M 24 NA NA Denm~ DEN 1920~ 1920 Summer Antw~
4 4 Edga~ M 34 NA NA Denm~ DEN 1900~ 1900 Summer Paris
5 5 Chri~ F 21 185 82 Neth~ NED 1988~ 1988 Winter Calg~
6 5 Chri~ F 21 185 82 Neth~ NED 1988~ 1988 Winter Calg~
# ... with 3 more variables: Sport <chr>, Event <chr>, Medal <chr>
# Order column team
team_ordering <- olympics %>% select(Team) %>% count(Team, sort = TRUE) %>% group_by(Team)
team_ordering# A tibble: 1,184 x 2
# Groups: Team [1,184]
Team n
<chr> <int>
1 United States 17847
2 France 11988
3 Great Britain 11404
4 Italy 10260
5 Germany 9326
6 Canada 9279
7 Japan 8289
8 Sweden 8052
9 Australia 7513
10 Hungary 6547
# ... with 1,174 more rows
SELECT Team
,COUNT(Team) AS Country_appearance
FROM olymp
GROUP BY Team
ORDER BY Country_appearance DESC| Team | Country_appearance |
|---|---|
| United States | 17847 |
| France | 11988 |
| Great Britain | 11404 |
| Italy | 10260 |
| Germany | 9326 |
| Canada | 9279 |
| Japan | 8289 |
| Sweden | 8052 |
| Australia | 7513 |
| Hungary | 6547 |
The same output is seen with R and SQL on sorting the column Team. The column of interest represents the participation of each country in a given sporting event.
team_ordering <- team_ordering[1:30,]
tplot <- ggplot(team_ordering, aes(Team, n, fill = Team)) + geom_col()
tplotteam_ordering2 <- team_ordering[1:10,]
tplot2 <- ggplot(team_ordering2, aes(Team, n, fill = Team)) + geom_col() +
geom_text(aes(label = n, vjust = -0.5)) +
theme_minimal() +
theme(legend.position = "none", axis.title.y = element_text(angle = 0)) +
labs(title = "Top 10 countries")
tplot2team_ordering <- olympics %>%
select(Team) %>%
count(Team, sort = TRUE) %>%
group_by(Team) %>%
with(wordcloud(Team, n, max.words = 100, colors = brewer.pal(20, "Set1")))How do Team participation in the different sporting events contribute to the number of Medal. Does participation in more sporting events enhance the amount of medals won by a country? Is there any collinearity or correlation between participation of team n and Medal?
# A tibble: 4 x 2
# Groups: Medal [4]
Medal n
<chr> <int>
1 <NA> 231333
2 Gold 13372
3 Bronze 13295
4 Silver 13116
Looking at the two variables of interest viz. Team and Medal and from the Questions sub-section, more work needs to be done to be able to ascertain if there is any advantage when a country has more participation in the sporting events.
The graphics for the Team and Medal variables are clear illustrations and depict some kind of metrics showing the different distribution of these variable. The wordcloud also showed that the United States of America has more sporting partticipation. According to All-time medal count, USA is the all time champion of Olympics medals. Milestone 4 will try to validate this claims.