Description
- Anthems projects uses a national anthems of different countries around the world and groups them together based on theme
- The goal of this project is to use text mining and text analytics concepts to extract theme based clusters
- The final result displays the clusters on an interactive map that groups countries with similar national anthem themes
Data source
- The data source has 190 entries
- Data source consists of country names, 2 letter and 3 letter country codes, continent and the country’s national anthem
anthems <- read.csv('C:\\Users\\17327\\anthems.csv')
knitr::kable(head(anthems,3), "simple")
| Albania |
AL |
ALB |
Europe |
Around our flag we stand united, With one wish and one goal, A sacred oath we bestow upon it Proclaiming loyalty for our salvation. From war abstains only he, Who a traitor is born, He who is a true man is not frightened, But dies a warrior to the cause. With weapons in our hands a-brandished, We will defend our fatherland, Our sacred rights we’ll not relinquish, The foe has no place in our land. For God himself proclaimed The nations of the earth shall wane, And yet will live, will thrive Albania. For you, for you we fight. O Flag, flag, you sacred symbol Upon you we now swear For Albania, our dear fatherland For honour and your glory. Brave man is named and honoured The one who sacrificed himself for the fatherland Forever he will be remembered On earth and under as a saint! |
| Armenia |
AM |
ARM |
Europe |
Our Fatherland, free, independent, That has for centuries lived, Is now summoning its sons To the free, independent Armenia. Here is a flag for you, my brother, That I have sewn by hand Over the sleepless nights, And bathed in my tears. Look at it, tricolored, A valuable symbol for us. Let it shine against the enemy. Let you, Armenia, be glorious forever. Death is the same everywhere, A man dies but once, Blessed is the one that dies For the freedom of his nation. |
| Austria |
AT |
AUT |
Europe |
Land of mountains, land by the river, Land of fields, land of cathedrals, Land of hammers, with a promising future! Home to great daughters and sons, People highly gifted for beautiful arts, Much-praised Austria! Strongly feuded for, fiercely hard-fought for, Thou liest in the middle of the continent Like a strong heart, Since the early days of the ancestors thou hast Borne the burden of a high mission, Much-tried Austria. Bravely towards the new ages See us striding, free, and faithful, Assiduous and full of hope, Unified, let us in jolly choirs Pledge allegiance to thee, Fatherland Much-beloved Austria. |
Top 3 words in each Anthem
- The anthem is cleaned and words that do not contribute to the semantic structure of the anthem are removed
- At this stage the top three commonly occurring words in the anthem are returned
library(qdapRegex)
library(textclean)
library(tidyverse)
library(factoextra)
library(xfun)
library(cluster)
library(stringr)
library(tm)
library(dplyr)
library(tidytext)
stopwords <- read.csv("C:\\Users\\17327\\stopwords.csv")
clean_data <- function(data) {
data <- tolower(trimws(data))
data <- removeWords(data, c(stopwords("english")))
stopwords <- head(sort(unlist(stopwords), decreasing=TRUE), 986)
stopwords <- removePunctuation(stopwords)
data <- removeWords(data, stopwords)
data <- removePunctuation(data)
data <- stemDocument(data, language = "english")
corpus <- Corpus(VectorSource(data))
dtm <- as.matrix(DocumentTermMatrix(corpus))
freq <- colSums(dtm)
freq <- as.data.frame(sort(freq, decreasing = TRUE))
data2 <- freq
data2 <- tibble::rownames_to_column(data2,"word")
colnames(data2) <- c("word","count")
head(data2$word, 3)
}
clean_data(anthems[1,5])
## [1] "fatherland" "flag" "sacr"
Clean, process and cluster the data
- The anthems are cleaned, filtered and subject to the Td-Idf algorithm. This statistical algorithm assigns scores to words based on its frequency both within each document and through the entire set of documents
Tf(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
Idf = (Total number of documents) / (Number of documents with word t in it)
Tf-Idf = Tf x Idf
- The result from te Td-Idf algorithm is then converted to a matrix before applying the kmeans algorithm
- In order to find an appropriate k value, we use the elbow method by plotting a graph
- Using the k value got from the elbow method, the Td-Idf matrix is clustered into k unique clusters containing words with similar theme
stopwords <- read.csv("C:\\Users\\17327\\stopwords.csv")
cluster_data <- function() {
anthem <- my_func()
corpus <- VectorSource(anthem)
corpus <- Corpus(corpus)
anthem <- tm::tm_map(corpus, tolower)
anthem <- tm::tm_map(anthem, removeWords, stopwords("english"))
stopwords <- head(sort(unlist(stopwords), decreasing=TRUE), 986)
stopwords <- removePunctuation(stopwords)
anthem <- tm::tm_map(anthem, removeWords, stopwords)
anthem <- tm::tm_map(anthem, removePunctuation)
anthem <- tm::tm_map(anthem, stemDocument)
anthem <- tm::tm_map(anthem, removeWords, c("listanthem"))
anthem <- tm::tm_map(anthem, removeWords, "http[A-Za-z]+")
anthem <- tm::tm_map(anthem, removeWords, "www.[A-Za-z]+")
anthem <- tm::tm_map(anthem, removeWords, c("afa201rpafad", "aa200o", "a200oeend", "200end",
"200", "thi", "arab","oer", "wait"))
anthem <- tm::tm_map(anthem, rm_non_ascii)
anthem <- tm::tm_map(anthem, rm_nchar_words, 1)
anthem <- tm::tm_map(anthem, rm_nchar_words, 2)
anthem <- tm::tm_map(anthem, removeWords, stopwords)
dtm <- DocumentTermMatrix(anthem)
dtm <- removeSparseTerms(dtm, sparse = 0.98)
dtm.tfidf <- weightTfIdf(dtm)
dtm.tfidf
tfidf.matrix <- t(as.matrix(dtm.tfidf))
print(fviz_nbclust(tfidf.matrix, kmeans, method = "wss"))
clustering.kmeans <- kmeans(tfidf.matrix, centers = 5)
}
cluster <- cluster_data()

knitr::kable(head(cluster$cluster, 20), "simple")
| bestow |
1 |
| born |
1 |
| brave |
2 |
| dear |
5 |
| defend |
1 |
| die |
5 |
| earth |
1 |
| fatherland |
5 |
| fight |
1 |
| flag |
1 |
| foe |
1 |
| forev |
1 |
| glori |
5 |
| goal |
1 |
| god |
5 |
| hand |
5 |
| honour |
1 |
| land |
5 |
| live |
5 |
| loyalti |
1 |
Label the data based on the clustered info
- Each cluster generated in the previous step are renamed
- The top 3 words in each anthem are cross-referenced with the clusters to find the cluster with the highest count of matching words
- The cluster that is generated is arranged according to their index numbers
final_label <- function(doc, cluster_kmeans) {
result <- as.data.frame(cluster_kmeans$cluster)
result <- tibble::rownames_to_column(result, "word")
colnames(result) <- c("word","cluster")
I <- result %>%
filter(cluster == 1)
I <- I$word
II <- result %>%
filter(cluster == 2)
II <- II$word
III <- result %>%
filter(cluster == 3)
III <- III$word
IV <- result %>%
filter(cluster == 4)
IV <- IV$word
V <- result %>%
filter(cluster == 5)
V <- V$word
count1 <- 0
count2 <- 0
count3 <- 0
count4 <- 0
count5 <- 0
clean_doc <- clean_data(doc)
for (word in clean_doc) {
ifelse((is.element(word, I)),count1 <- count1+1,
ifelse((is.element(word, II)), count2 <- count2+1,
ifelse((is.element(word, III)), count3 <- count3+1,
ifelse((is.element(word, IV)), count4 <- count4+1,
ifelse((is.element(word, V)), count5 <- count5+1,FALSE)
)
)
)
)
count <- max(count1, count2, count3, count4, count5)
}
labels = c()
ifelse((count1 == count), labels <- return("1"),
ifelse((count2 == count), labels <- return("2"),
ifelse((count3 == count), labels <- return("3"),
ifelse((count4 == count), labels <- return("4"),
ifelse((count5 == count), labels <- return("5"),)
)
)
)
)
}
Assign the labels to the different countries
- The main function consists of function calls for all the above functions
- The final result syncs the labels generated in the previous step with the respectively countries using index numbers
- The final output consists of a table with the country’s name, continent and the label assigned
main <- function(corpora) {
step3 <- cluster_data()
labels <- c()
for (k in 1:190) {
country <- anthems[k,5]
labels <- append(labels, final_label(country, step3))
}
corpus <- as.data.frame(corpora)
corpus["labels"] <- labels
corpus[c(1,4,6)]
}
knitr::kable(head(main(anthems),20), "simple")

| Albania |
Europe |
1 |
| Armenia |
Europe |
5 |
| Austria |
Europe |
5 |
| Azerbaijan |
Europe |
5 |
| Belarus |
Europe |
5 |
| Belgium |
Europe |
5 |
| Bosnia and Herzegovina |
Europe |
5 |
| Bulgaria |
Europe |
5 |
| Croatia |
Europe |
5 |
| Cyprus |
Europe |
4 |
| Czechia |
Europe |
5 |
| Denmark |
Europe |
1 |
| Estonia |
Europe |
5 |
| Finland |
Europe |
5 |
| France |
Europe |
1 |
| Georgia |
Europe |
1 |
| Germany |
Europe |
5 |
| Greece |
Europe |
4 |
| Hungary |
Europe |
5 |
| Iceland |
Europe |
1 |
Interactive map on Tableau
- The data got from the analysis is exported to Tableau to create an interactive map
- This map is color-coded according to their label
- Countries with the same color have a similar themed anthem
- Below is a snapshot of the map. Follow the link for the interactive Tableau representation
Tableau map
Source
- The source project by Lucas de Sá uses the same algorithm in Python to cluster the national anthems of 190 countries based on their theme.
Source Project