Clustering documents, films, newspapers or any text data is becoming more and more popular. The main idea of doing that is to organize those document in groups, in which observations are similiar. There is a paper called “Clustering Search Keywords Using K-Means Clustering” by Randy Zwitch https://randyzwitch.com/rsitecatalyst-k-means-clustering/, where k-means algorithm are used to obtain groups of topics based on keywords that users type into search. Another example is an article written by Mr. Hardeep Singh “Clustering of text documents by implementation of K-means algorithms”, where K-means and spherical k-means were compared on “The 20 Newsgroups dataset”. It comes out that spherical k-means work better for high-dimensional text data. In the area of machine learning and natural language processing (models based on text data) there is also a topic model which is a statistical model used for discovering topics that occur in documents. In this approach each document concerns multiple topics in different proportions.
In working with text data there are also few steps of data preprocessing that should be done. In the article “Efficient clustering of very large document collections” following operations have been done:
There are several algorithms for clustering data. For document clustering the most popular is probably k-means algortihm, which is often compared with spherical k-means. The difference in those approaches is that in kmeans usually the euclidean distance is minimized
\[d(p,q)= \sqrt{\sum_{i=1}^n (q_{i}-p_{i})^2}\]
when in spherical k-means, the cosine dissimilarity is minimized.
\[ J = \sum_{i}(1-cos(x_{i},p_{i}))\]
Due to the fact that spherical k-means is suitable for large collections of text documents, k-means will be used. Other algorithms which will be used are PAM (Partioning around medoids), hierarchial clustering and LDA (Latetnt Dirichlet Allocation)
Latent Dirichlet Allocation (LDA) is generative probabilistic model used for text data. The model for each observation calculates probability of being in each of the topics. The main idea is to define finite number of topics and our observations will be assigned to those topics.
In this case, different types of films will be clustered based on their short description.Dataset consists of 48 observations, in which each observation is a short description of a film. There are 4 different types of films (12 observations in each category) in the collection:
The data were collected from IMDB site, which contains information about films and TV series.
What will be done?
What steps will be taken to get the best possible clustering:
data <- read.xlsx("filmy_48.xlsx")
head(data,2)
## Category
## 1 [western]
## 2 [western]
## Description
## 1 A bounty hunting scam joins two men in an uneasy alliance against a third in a race to find a fortune in gold buried in a remote cemetery.
## 2 With the help of a German bounty hunter, a freed slave sets out to rescue his wife from a brutal Mississippi plantation owner.
Dataset consists of two columns:
data$Description <- tolower(data$Description)
data$Description <- removePunctuation(data$Description)
data$Description <- removeWords(data$Description, words=stopwords())
data$Description <- stemDocument(data$Description)
corpus = as_corpus_frame(data$Description)
# Tokenization
dtm <- term_matrix(corpus)
dtm = as.data.frame(as.matrix(dtm))
head(corpus)
## text
## 1 bounti hunt scam join two men uneasi allianc third race find fortun gold…
## 2 help german bounti hunter freed slave set rescu wife brutal mississippi …
## 3 two bounti hunter intent team track western outlaw
## 4 retir old west gunsling william munni reluct take one last job help old …
## 5 wyom earli 1900s butch cassidi sundanc kid leader band outlaw train robb…
## 6 senat becam famous kill notori outlaw return funer old friend tell truth…
After preprocessing descriptions, it looks like they are normalized and ready for tokenization.
After tokenization dataset consists of 48rows (films) and 624 variables. The next step is to eliminate low-frequency words. To choose the treshold, get a look into data statistics.
dtm1=t(dtm)
dtm1=as.data.frame(dtm1)
dtm1$counter = rowSums(dtm1)
dtm1 = rownames_to_column(dtm1,var='rowname')
word_counts =as.data.frame(dtm1$counter)
colnames(word_counts) <- c("count")
word_counts %>% count(count)
## # A tibble: 10 x 2
## count n
## <dbl> <int>
## 1 1 420
## 2 2 58
## 3 3 20
## 4 4 13
## 5 5 5
## 6 6 2
## 7 7 2
## 8 8 1
## 9 9 1
## 10 14 1
It comes out that there are 420 words that occur once in all the short descriptions, 58 words that occur 2 times and so on. Words that occur once will be deleted from dataset. In the further analysis I would like to compare results of dataset with words that occur 2 times and without them.
If it comes to words with high-frequency, most of them were stopwords, which were deleted in previous step. The words with the highest frequency that are not stopwords or punctuation:
dtm1_hf = select(dtm1,rowname,counter)
dtm1_hf %>% arrange(desc(counter)) %>% head(10)
## rowname counter
## 1 war 14
## 2 help 9
## 3 stori 8
## 4 find 7
## 5 world 7
## 6 love 6
## 7 outlaw 6
## 8 boxer 5
## 9 ii 5
## 10 indian 5
The word with most occurences is “war” and in this case I think that we should not eliminate this word because it’s the indicator of the war movies, but look into most popular words in each of four groups.
## word counter
## 1 outlaw 6
## 2 bounti 4
## 3 find 3
## 4 help 3
## 5 hunter 3
## word counter
## 1 stori 7
## 2 boxer 5
## 3 champion 4
## 4 indian 4
## 5 team 4
## word counter
## 1 love 6
## 2 help 4
## 3 find 3
## 4 will 3
## 5 woman 3
## word counter
## 1 war 12
## 2 world 6
## 3 ii 5
## 4 plan 3
## 5 soldier 3
Lets take a look at how dataset looks like and also at processing data using TF-IDF. In this point we make two datasets for data with word count and two datasets after TF-IDF process:
Scheme of dataframe 48 rows and 103 columns
dtm_1words = dplyr::filter(dtm1, dtm1$counter > 1) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_1words= dtm_1words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_1words) <- col_names
head(dtm_1words, 10)[1:10]
## american around assassin becam becom behind bloodi bounti boxer boy
## V1 0 0 0 0 0 0 0 1 0 0
## V2 0 0 0 0 0 0 0 1 0 0
## V3 0 0 0 0 0 0 0 1 0 0
## V4 0 0 0 0 0 0 0 0 0 0
## V5 0 0 0 0 0 0 0 0 0 0
## V6 0 0 0 1 0 0 0 0 0 0
## V7 0 0 0 0 0 0 0 0 0 0
## V8 1 1 0 0 0 0 0 0 0 0
## V9 1 0 0 0 0 0 0 0 0 0
## V10 0 1 0 0 0 0 0 0 0 0
Scheme of dataframe 48 rows and 45 columns
dtm_2words = dplyr::filter(dtm1, dtm1$counter > 2) %>% select(-counter)
col_names2=dtm_2words$rowname
dtm_2words= dtm_2words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_2words) <- col_names2
head(dtm_2words, 10)[1:10]
## american around becom bounti boxer boy brother camp champion escap
## V1 0 0 0 1 0 0 0 0 0 0
## V2 0 0 0 1 0 0 0 0 0 0
## V3 0 0 0 1 0 0 0 0 0 0
## V4 0 0 0 0 0 0 0 0 0 0
## V5 0 0 0 0 0 0 0 0 0 1
## V6 0 0 0 0 0 0 0 0 0 0
## V7 0 0 0 0 0 0 0 0 0 0
## V8 1 1 0 0 0 0 0 0 0 0
## V9 1 0 0 0 0 0 1 0 0 0
## V10 0 1 0 0 0 0 0 0 0 0
TF-IDF (Term frequency - inverse document frequency) is a method to convert document term matrix, where the values are the number of occurences of a given word to values that represents the importance of a word in the document in comparison to all the documents. TF-IDF consists of two things:
Example:
# Calculatinf TF
number = rowSums(dtm_1words)
freq=((dtm_1words) / (number))
# Calculating denominator of IDF
df=dtm_1words
df[df=="2"]<-1
df[df=="3"]<-1
df[df=="4"]<-1
dt= rowSums(t(df))
# Counter - number of documents
counter = 48
# IDF
idf = log(counter/dt)
# TF - IDF
tf_idf = t(freq) * idf
tf_idf = t(tf_idf)
tf_idf_1 = as.data.frame((tf_idf))
tf_idf_1[is.na(tf_idf_1)] <- 0
# Calculatinf TF
number = rowSums(dtm_2words)
freq=((dtm_2words) / (number))
# Calculating denominator of IDF
df=dtm_2words
df[df=="2"]<-1
df[df=="3"]<-1
df[df=="4"]<-1
dt= rowSums(t(df))
# Counter - number of documents
licznik = 48
# IDF
idf = log(licznik/dt)
# TF - IDF
tf_idf = t(freq) * idf
tf_idf = t(tf_idf)
tf_idf_2 = as.data.frame((tf_idf))
tf_idf_2[is.na(tf_idf_2)] <- 0
To choose optimal number of clusters silhouette statistic will be used
a <- fviz_nbclust(dtm_1words,kmeans,method = "s") +ggtitle("kmeans")
b <- fviz_nbclust(dtm_1words,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(dtm_1words,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
On this dataset PAM and CLARA should be done on 4 clusters. From the silhouette analysis k-means should be done for 3 clusters but as we know that we have 4 groups of films we will also process K-means with 4 clusters.
a <- fviz_nbclust(dtm_2words,kmeans,method = "s")+ggtitle("kmeans")
b <- fviz_nbclust(dtm_2words,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(dtm_2words,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
On this dataset, analysis show that we should calculate all the algorithms on 2 clusters but as we know it makes no sense as we have 4 pre-defined clusters. In this case we will do same clusterization as for “Dataset for 1+ words”
a <- fviz_nbclust(tf_idf_1,kmeans,method = "s") +ggtitle("kmeans")
b <- fviz_nbclust(tf_idf_1,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(tf_idf_1,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
In this case, 2 clusters are optimal for each of the method. For PAM and CLARA choosing 3 clusters is also quiet good because silhouette is almost as big as for 2 clusters. For those two methods 3 clusters will be chosen because we got 4 pre-defined groups. The K-means will be used with 2 clusters and 4 clusters will also be checked.
For this dataset we will cluster identically as for dataset “TF-IDF for 1+ words”
a <- fviz_nbclust(tf_idf_2,kmeans,method = "s") +ggtitle("kmeans")
b <- fviz_nbclust(tf_idf_2,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(tf_idf_2,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
The results for this dataset are really close to results for dataset “TF-IDF for 1+ words”, so the same number of clusters will be checked.
After analyzing silhouette statistic to choose optimal number of clusters, WSS statistic will be used to check whether it gives the same results or maybe slightly different.
a <- fviz_nbclust(dtm_1words,kmeans,method = "wss") + ggtitle("k-means")
b <- fviz_nbclust(dtm_1words,pam,method = "wss") + ggtitle("pam")
c <- fviz_nbclust(dtm_1words,clara,method = "wss") +ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
a <- fviz_nbclust(dtm_2words,kmeans,method = "wss")+ggtitle("k-means")
b <- fviz_nbclust(dtm_2words,pam,method = "wss")+ggtitle("pam")
c <- fviz_nbclust(dtm_2words,clara,method = "wss")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
a <- fviz_nbclust(tf_idf_1,kmeans,method = "wss") + ggtitle("k-means")
b <- fviz_nbclust(tf_idf_1,pam,method = "wss") + ggtitle("pam")
c <- fviz_nbclust(tf_idf_1,clara,method = "wss") +ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
a <- fviz_nbclust(tf_idf_2,kmeans,method = "wss")+ggtitle("k-means")
b <- fviz_nbclust(tf_idf_2,pam,method = "wss")+ggtitle("pam")
c <- fviz_nbclust(tf_idf_2,clara,method = "wss")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")
The results of silhouette and WSS statistics are similiar, the clusterization will be produced as it was said in previous step. Algorithms like K-means, PAM and CLARA will be used to cluster data.
km3 <- eclust(dtm_1words,k=3,hc_metric = 'euclidean', graph = FALSE)
km4 <- eclust(dtm_1words,k=4,hc_metric = 'euclidean', graph = FALSE)
pam <- eclust(dtm_1words,'pam',k=4,hc_metric = 'euclidean', graph = FALSE)
clara <- eclust(dtm_1words,'clara',k=4,hc_metric = 'euclidean', graph = FALSE)
c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 4 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 4 clusters')
grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))
km3 <- eclust(dtm_2words,k=3,hc_metric = 'euclidean')
km4 <- eclust(dtm_2words,k=4,hc_metric = 'euclidean')
pam <- eclust(dtm_2words,'pam',k=4,hc_metric = 'euclidean')
clara <- eclust(dtm_2words,'clara',k=4,hc_metric = 'euclidean')
c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 4 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 4 clusters')
grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))
km3 <- eclust(tf_idf_1,k=3,hc_metric = 'euclidean')
km4 <- eclust(tf_idf_1,k=4,hc_metric = 'euclidean')
pam <- eclust(tf_idf_1,'pam',k=4,hc_metric = 'euclidean')
clara <- eclust(tf_idf_1,'clara',k=4,hc_metric = 'euclidean')
c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 2 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 3 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 2 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 2 clusters')
grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))
km3 <- eclust(tf_idf_2,k=3,hc_metric = 'euclidean')
km4 <- eclust(tf_idf_2,k=4,hc_metric = 'euclidean')
pam <- eclust(tf_idf_2,'pam',k=4,hc_metric = 'euclidean')
clara <- eclust(tf_idf_2,'clara',k=4,hc_metric = 'euclidean')
c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 4 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 4 clusters')
grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))
After clusterization on 4 different datasets, the only reasonable results were on dataset with 1+ words with K-means algorithm.
km3 <- eclust(dtm_1words,k=3,hc_metric = 'euclidean')
km4 <- eclust(dtm_1words,k=4,hc_metric = 'euclidean')
km3_s <- fviz_silhouette(km3) + ggtitle('K-means with 3 clusters')
km4_s <- fviz_silhouette(km4) + ggtitle('K-means with 4 clusters')
km3$silinfo$clus.avg.widths
## [1] 0.00000000 0.01832987 0.15315103
km4$silinfo$clus.avg.widths
## [1] 0.00000000 0.11626586 0.13811792 -0.07792676
grid.arrange(arrangeGrob(km3_s,km4_s, ncol=1 , top = "Clustering"))
Silhouette shows that K-means with k =4 has a negative average silhouette value, which means that this clusterization is not so good. In the next step, we will take a look into those clusters for both clusterizations.
kmeans_c1 = as.data.frame(km3_s$data$cluster)
colnames(kmeans_c1) <- c('cluster')
kmeans_c1$type = data$Category
kmeans_c1$counter = 1
kmeans_c2 = as.data.frame(km4_s$data$cluster)
colnames(kmeans_c2) <- c('cluster')
kmeans_c2$type = data$Category
kmeans_c2$counter = 1
kmeans_c1 %>% group_by(cluster, type) %>% summarise(group=sum(counter))
## # A tibble: 6 x 3
## # Groups: cluster [?]
## cluster type group
## <fct> <chr> <dbl>
## 1 1 [western] 1
## 2 2 [western] 9
## 3 3 [comedy] 12
## 4 3 [sport] 12
## 5 3 [war] 12
## 6 3 [western] 2
kmeans_c2 %>% group_by(cluster, type) %>% summarise(group=sum(counter))
## # A tibble: 7 x 3
## # Groups: cluster [?]
## cluster type group
## <fct> <chr> <dbl>
## 1 1 [western] 1
## 2 2 [western] 6
## 3 3 [comedy] 12
## 4 3 [sport] 12
## 5 3 [war] 10
## 6 3 [western] 5
## 7 4 [war] 2
Results for both attempts show that we have one big cluster with majority of the observations. Those results are not satisfactory so the other techniques will be used.
Another method for clusterization is hierarchical clustering. This method is clustering observations that are most similiar to each other and merge them as one. The algorithm stops when there is no more observations to pair. In this project the euclidean metric will be performed.
The hierarchical clustering was perfomed on TF-IDF data but the results were rather poor. So the results on Document term matrix dataset will be presented with different threshold of occurances of words. There will be datasets with all the words that occur more than 1, 2, 3, 4 and 5 times.
m <- as.matrix(dtm_1words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=3)
m <- as.matrix(dtm_2words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)
dtm_3words = dplyr::filter(dtm1, dtm1$counter > 3) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_3words= dtm_3words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_3words) <- col_names
m <- as.matrix(dtm_3words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)
dtm_4words = dplyr::filter(dtm1, dtm1$counter > 4) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_4words= dtm_4words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_4words) <- col_names
m <- as.matrix(dtm_4words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)
dtm_5words = dplyr::filter(dtm1, dtm1$counter > 5) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_5words= dtm_5words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_5words) <- col_names
m <- as.matrix(dtm_5words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)
As the analysis show, changing threshold have big influence of the results. The best results are for datasets with threshold 3+ and 4+. The clusters have similiar number of observations and in comparison to K-means quality of clusterization is really good.
As it was said at he beggining, LDA is a generative probabilistic model. We have to define number of topics, in our case 4. The result will be the probability of observation assigment to each category.
#Number of topics
k <- 4
ldaOut <-LDA(dtm_1words,k)
ldatopics=as.data.frame(topics(ldaOut))
colnames(ldatopics) <- c('cluster')
ldatopics$type = data$Category
ldatopics %>% group_by(cluster,type) %>% summarise(group=n())
## # A tibble: 15 x 3
## # Groups: cluster [?]
## cluster type group
## <int> <chr> <int>
## 1 1 [comedy] 1
## 2 1 [sport] 5
## 3 1 [war] 2
## 4 1 [western] 2
## 5 2 [comedy] 4
## 6 2 [sport] 2
## 7 2 [war] 1
## 8 2 [western] 1
## 9 3 [comedy] 3
## 10 3 [war] 8
## 11 3 [western] 5
## 12 4 [comedy] 4
## 13 4 [sport] 5
## 14 4 [war] 1
## 15 4 [western] 4
Tha dataset was divided into 4 clusters, but the quality of assignments are rather low. In the first cluster we can films of every type. The others clusters are simingly poor.
ldaOut.terms <- as.matrix(terms(ldaOut,6))
ldaOut.terms
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "team" "war" "war" "boxer"
## [2,] "ii" "love" "find" "stori"
## [3,] "outlaw" "one" "world" "outlaw"
## [4,] "world" "help" "stori" "love"
## [5,] "american" "retir" "help" "help"
## [6,] "son" "indian" "young" "old"
topicProbabilities <- as.data.frame(ldaOut@gamma)
head(topicProbabilities, 3)
## V1 V2 V3 V4
## 1 0.2485314 0.2485343 0.2565321 0.2464022
## 2 0.2503228 0.2503840 0.2505972 0.2486960
## 3 0.2539081 0.2517308 0.2462514 0.2481096
What is more we can analyze which words are the most important in every topic. For example for the first topic the most vital are the words from different film types. This shows that combination of words assigned to a given topic is mixed up and that the words are not really connected with each other.
To sum up, the text data were clusterd with usage of using different methods. In the text data the preprocessing step is very important, different datasets were created in order to get better clustering results. The algorithms like K-means, PAM, CLARA show poor results in every dataset. Latent Dirichlet Allocation also did not bring much value to the overall analysis. The only algorithm with satisfactory results was Hierarchical Clustering in which the observations were divided into 4 fairly equal groups. What is more, in each cluster the observations were mostly from one film genre. This analysis show that to obtain possibly the best results different methods have to be used.