Clustering text data

Clustering documents, films, newspapers or any text data is becoming more and more popular. The main idea of doing that is to organize those document in groups, in which observations are similiar. There is a paper called “Clustering Search Keywords Using K-Means Clustering” by Randy Zwitch https://randyzwitch.com/rsitecatalyst-k-means-clustering/, where k-means algorithm are used to obtain groups of topics based on keywords that users type into search. Another example is an article written by Mr. Hardeep Singh “Clustering of text documents by implementation of K-means algorithms”, where K-means and spherical k-means were compared on “The 20 Newsgroups dataset”. It comes out that spherical k-means work better for high-dimensional text data. In the area of machine learning and natural language processing (models based on text data) there is also a topic model which is a statistical model used for discovering topics that occur in documents. In this approach each document concerns multiple topics in different proportions.

In working with text data there are also few steps of data preprocessing that should be done. In the article “Efficient clustering of very large document collections” following operations have been done:

Ignoring case, extract all unique words from the entire set of documents;
Eliminate stopwords (‘a’, ‘and’, ‘the’, etc.) and punctuation;
For each document, count the number of occurences of each word;
Elimiate high-frequency and low-frequency words;
Stem words (reduce each word to its “root” by eliminating plurals, tenses, prefixes and sufixes);
Calculate TF-IDF (Term frequency - inverse document frequency).

Clustering

There are several algorithms for clustering data. For document clustering the most popular is probably k-means algortihm, which is often compared with spherical k-means. The difference in those approaches is that in kmeans usually the euclidean distance is minimized

\[d(p,q)= \sqrt{\sum_{i=1}^n (q_{i}-p_{i})^2}\]

when in spherical k-means, the cosine dissimilarity is minimized.

\[ J = \sum_{i}(1-cos(x_{i},p_{i}))\]

Due to the fact that spherical k-means is suitable for large collections of text documents, k-means will be used. Other algorithms which will be used are PAM (Partioning around medoids), hierarchial clustering and LDA (Latetnt Dirichlet Allocation)

LDA

Latent Dirichlet Allocation (LDA) is generative probabilistic model used for text data. The model for each observation calculates probability of being in each of the topics. The main idea is to define finite number of topics and our observations will be assigned to those topics.

Project overview

In this case, different types of films will be clustered based on their short description.Dataset consists of 48 observations, in which each observation is a short description of a film. There are 4 different types of films (12 observations in each category) in the collection:

western,
sport,
comedy,
war.

The data were collected from IMDB site, which contains information about films and TV series.

What will be done?

What steps will be taken to get the best possible clustering:

Preprocessing (remove stopwords, remove puncatuation etc.),
See how K-means, PAM, hierarchial clustering works,
Try to use LDA (Latent dirichlet allocation).

Dataset and preprocessing

Preprocessing

data <- read.xlsx("filmy_48.xlsx")
head(data,2)

##    Category
## 1 [western]
## 2 [western]
##                                                                                                                                  Description
## 1 A bounty hunting scam joins two men in an uneasy alliance against a third in a race to find a fortune in gold buried in a remote cemetery.
## 2             With the help of a German bounty hunter, a freed slave sets out to rescue his wife from a brutal Mississippi plantation owner.

Dataset consists of two columns:

Category - Genre of film,
Short description.

data$Description <- tolower(data$Description)
data$Description <- removePunctuation(data$Description)
data$Description <- removeWords(data$Description, words=stopwords())
data$Description <- stemDocument(data$Description)

corpus = as_corpus_frame(data$Description)
# Tokenization
dtm <- term_matrix(corpus)
dtm = as.data.frame(as.matrix(dtm))
head(corpus)

##   text                                                                     
## 1 bounti hunt scam join two men uneasi allianc third race find fortun gold…
## 2 help german bounti hunter freed slave set rescu wife brutal mississippi …
## 3 two bounti hunter intent team track western outlaw                       
## 4 retir old west gunsling william munni reluct take one last job help old …
## 5 wyom earli 1900s butch cassidi sundanc kid leader band outlaw train robb…
## 6 senat becam famous kill notori outlaw return funer old friend tell truth…

After preprocessing descriptions, it looks like they are normalized and ready for tokenization.

After tokenization dataset consists of 48rows (films) and 624 variables. The next step is to eliminate low-frequency words. To choose the treshold, get a look into data statistics.

dtm1=t(dtm)
dtm1=as.data.frame(dtm1)
dtm1$counter = rowSums(dtm1)
dtm1 = rownames_to_column(dtm1,var='rowname')
word_counts =as.data.frame(dtm1$counter)
colnames(word_counts) <- c("count")
word_counts %>% count(count)

## # A tibble: 10 x 2
##    count     n
##    <dbl> <int>
##  1     1   420
##  2     2    58
##  3     3    20
##  4     4    13
##  5     5     5
##  6     6     2
##  7     7     2
##  8     8     1
##  9     9     1
## 10    14     1

It comes out that there are 420 words that occur once in all the short descriptions, 58 words that occur 2 times and so on. Words that occur once will be deleted from dataset. In the further analysis I would like to compare results of dataset with words that occur 2 times and without them.

If it comes to words with high-frequency, most of them were stopwords, which were deleted in previous step. The words with the highest frequency that are not stopwords or punctuation:

dtm1_hf =  select(dtm1,rowname,counter)
dtm1_hf %>% arrange(desc(counter)) %>% head(10)

##    rowname counter
## 1      war      14
## 2     help       9
## 3    stori       8
## 4     find       7
## 5    world       7
## 6     love       6
## 7   outlaw       6
## 8    boxer       5
## 9       ii       5
## 10  indian       5

The word with most occurences is “war” and in this case I think that we should not eliminate this word because it’s the indicator of the war movies, but look into most popular words in each of four groups.

Most frequent words in groups

Western

##     word counter
## 1 outlaw       6
## 2 bounti       4
## 3   find       3
## 4   help       3
## 5 hunter       3

Sport

##       word counter
## 1    stori       7
## 2    boxer       5
## 3 champion       4
## 4   indian       4
## 5     team       4

Comedy

##    word counter
## 1  love       6
## 2  help       4
## 3  find       3
## 4  will       3
## 5 woman       3

War

##      word counter
## 1     war      12
## 2   world       6
## 3      ii       5
## 4    plan       3
## 5 soldier       3

Datasets

Lets take a look at how dataset looks like and also at processing data using TF-IDF. In this point we make two datasets for data with word count and two datasets after TF-IDF process:

Dataset with words that occur at least once (Dataset for 1+ words)
Dataset with words that occur at least twice (Dataset for 2+ words)
Dataset with words that occur at least once and TF-IDF process (TF-IDF for 1+ words)
Dataset with words that occur at least twice and TF-IDF process (TF-IDF for 2+ words)

Document term matrix

Dataset for 1+ words

Scheme of dataframe 48 rows and 103 columns

dtm_1words  = dplyr::filter(dtm1, dtm1$counter > 1) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_1words= dtm_1words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_1words) <- col_names
head(dtm_1words, 10)[1:10]

##     american around assassin becam becom behind bloodi bounti boxer boy
## V1         0      0        0     0     0      0      0      1     0   0
## V2         0      0        0     0     0      0      0      1     0   0
## V3         0      0        0     0     0      0      0      1     0   0
## V4         0      0        0     0     0      0      0      0     0   0
## V5         0      0        0     0     0      0      0      0     0   0
## V6         0      0        0     1     0      0      0      0     0   0
## V7         0      0        0     0     0      0      0      0     0   0
## V8         1      1        0     0     0      0      0      0     0   0
## V9         1      0        0     0     0      0      0      0     0   0
## V10        0      1        0     0     0      0      0      0     0   0

Dataset for 2+ words

Scheme of dataframe 48 rows and 45 columns

dtm_2words  = dplyr::filter(dtm1, dtm1$counter > 2) %>% select(-counter)
col_names2=dtm_2words$rowname
dtm_2words= dtm_2words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_2words) <- col_names2
head(dtm_2words, 10)[1:10]

##     american around becom bounti boxer boy brother camp champion escap
## V1         0      0     0      1     0   0       0    0        0     0
## V2         0      0     0      1     0   0       0    0        0     0
## V3         0      0     0      1     0   0       0    0        0     0
## V4         0      0     0      0     0   0       0    0        0     0
## V5         0      0     0      0     0   0       0    0        0     1
## V6         0      0     0      0     0   0       0    0        0     0
## V7         0      0     0      0     0   0       0    0        0     0
## V8         1      1     0      0     0   0       0    0        0     0
## V9         1      0     0      0     0   0       1    0        0     0
## V10        0      1     0      0     0   0       0    0        0     0

Dataset with TF-IDF

TF-IDF (Term frequency - inverse document frequency) is a method to convert document term matrix, where the values are the number of occurences of a given word to values that represents the importance of a word in the document in comparison to all the documents. TF-IDF consists of two things:

TF - is the frequency of a word in a document.
IDF - is the measure of whether the word is common or rare in all of the documents.

Example:

If word “war” appears twice in film description with 10 words, TF = 2/10.
We have 48 documents and “war” appears in 24 descritpion, DF= 48/24 = 2. The Inverse DF (IDF) is the log(DF), so log(2)=0.69
So for our example TF-IDF = 0.2*0.69 = 0.138

TF-IDF for 1+

# Calculatinf TF
number  = rowSums(dtm_1words)
freq=((dtm_1words) / (number))

# Calculating denominator of IDF 
df=dtm_1words
df[df=="2"]<-1
df[df=="3"]<-1
df[df=="4"]<-1

dt= rowSums(t(df))

# Counter - number of documents
counter = 48

# IDF
idf = log(counter/dt)

# TF - IDF
tf_idf = t(freq) * idf
tf_idf = t(tf_idf)
tf_idf_1 = as.data.frame((tf_idf))
tf_idf_1[is.na(tf_idf_1)] <- 0

TF-IDF for 2+

# Calculatinf TF
number  = rowSums(dtm_2words)
freq=((dtm_2words) / (number))

# Calculating denominator of IDF
df=dtm_2words
df[df=="2"]<-1
df[df=="3"]<-1
df[df=="4"]<-1

dt= rowSums(t(df))

# Counter - number of documents
licznik = 48

# IDF
idf = log(licznik/dt)

# TF - IDF
tf_idf = t(freq) * idf
tf_idf = t(tf_idf)
tf_idf_2 = as.data.frame((tf_idf))
tf_idf_2[is.na(tf_idf_2)] <- 0

Clustering (K-means, PAM, CLARA, Hierarchical)

Silhouette

To choose optimal number of clusters silhouette statistic will be used

Dataset for 1+ words

a <- fviz_nbclust(dtm_1words,kmeans,method = "s") +ggtitle("kmeans")
b <- fviz_nbclust(dtm_1words,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(dtm_1words,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

On this dataset PAM and CLARA should be done on 4 clusters. From the silhouette analysis k-means should be done for 3 clusters but as we know that we have 4 groups of films we will also process K-means with 4 clusters.

Dataset for 2+ words

a <- fviz_nbclust(dtm_2words,kmeans,method = "s")+ggtitle("kmeans")
b <- fviz_nbclust(dtm_2words,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(dtm_2words,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

On this dataset, analysis show that we should calculate all the algorithms on 2 clusters but as we know it makes no sense as we have 4 pre-defined clusters. In this case we will do same clusterization as for “Dataset for 1+ words”

TF-IDF for 1+ words

a <- fviz_nbclust(tf_idf_1,kmeans,method = "s") +ggtitle("kmeans")
b <- fviz_nbclust(tf_idf_1,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(tf_idf_1,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

In this case, 2 clusters are optimal for each of the method. For PAM and CLARA choosing 3 clusters is also quiet good because silhouette is almost as big as for 2 clusters. For those two methods 3 clusters will be chosen because we got 4 pre-defined groups. The K-means will be used with 2 clusters and 4 clusters will also be checked.

TF-IDF for 2+ words

For this dataset we will cluster identically as for dataset “TF-IDF for 1+ words”

a <- fviz_nbclust(tf_idf_2,kmeans,method = "s") +ggtitle("kmeans")
b <- fviz_nbclust(tf_idf_2,pam,method = "s")+ggtitle("pam")
c <- fviz_nbclust(tf_idf_2,clara,method = "s")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

The results for this dataset are really close to results for dataset “TF-IDF for 1+ words”, so the same number of clusters will be checked.

Total within Sum of Square

After analyzing silhouette statistic to choose optimal number of clusters, WSS statistic will be used to check whether it gives the same results or maybe slightly different.

Dataset for 1+ words

a <- fviz_nbclust(dtm_1words,kmeans,method = "wss") + ggtitle("k-means")
b <- fviz_nbclust(dtm_1words,pam,method = "wss") + ggtitle("pam")
c <- fviz_nbclust(dtm_1words,clara,method = "wss") +ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

Dataset for 2+ words

a <- fviz_nbclust(dtm_2words,kmeans,method = "wss")+ggtitle("k-means")
b <- fviz_nbclust(dtm_2words,pam,method = "wss")+ggtitle("pam")
c <- fviz_nbclust(dtm_2words,clara,method = "wss")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

TF-IDF for 1+ words

a <- fviz_nbclust(tf_idf_1,kmeans,method = "wss") + ggtitle("k-means")
b <- fviz_nbclust(tf_idf_1,pam,method = "wss") + ggtitle("pam")
c <- fviz_nbclust(tf_idf_1,clara,method = "wss") +ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

TF-IDF for 2+ words

a <- fviz_nbclust(tf_idf_2,kmeans,method = "wss")+ggtitle("k-means")
b <- fviz_nbclust(tf_idf_2,pam,method = "wss")+ggtitle("pam")
c <- fviz_nbclust(tf_idf_2,clara,method = "wss")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

Clustering

The results of silhouette and WSS statistics are similiar, the clusterization will be produced as it was said in previous step. Algorithms like K-means, PAM and CLARA will be used to cluster data.

Dataset for 1+ words

km3 <- eclust(dtm_1words,k=3,hc_metric = 'euclidean', graph = FALSE)
km4 <- eclust(dtm_1words,k=4,hc_metric = 'euclidean', graph = FALSE)
pam <- eclust(dtm_1words,'pam',k=4,hc_metric = 'euclidean', graph = FALSE)
clara <- eclust(dtm_1words,'clara',k=4,hc_metric = 'euclidean', graph = FALSE)

c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 4 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 4 clusters')

grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))

Dataset for 2+ words

km3 <- eclust(dtm_2words,k=3,hc_metric = 'euclidean')
km4 <- eclust(dtm_2words,k=4,hc_metric = 'euclidean')
pam <- eclust(dtm_2words,'pam',k=4,hc_metric = 'euclidean')
clara <- eclust(dtm_2words,'clara',k=4,hc_metric = 'euclidean')

c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 4 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 4 clusters')

grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))

TF-IDF for 1+ words

km3 <- eclust(tf_idf_1,k=3,hc_metric = 'euclidean')
km4 <- eclust(tf_idf_1,k=4,hc_metric = 'euclidean')
pam <- eclust(tf_idf_1,'pam',k=4,hc_metric = 'euclidean')
clara <- eclust(tf_idf_1,'clara',k=4,hc_metric = 'euclidean')

c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 2 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 3 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 2 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 2 clusters')

grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))

TF-IDF for 2+ words

km3 <- eclust(tf_idf_2,k=3,hc_metric = 'euclidean')
km4 <- eclust(tf_idf_2,k=4,hc_metric = 'euclidean')
pam <- eclust(tf_idf_2,'pam',k=4,hc_metric = 'euclidean')
clara <- eclust(tf_idf_2,'clara',k=4,hc_metric = 'euclidean')

c1 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
c2 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam, geom = c("point")) + ggtitle('PAM with 4 clusters')
c4 <- fviz_cluster(clara, geom = c("point")) + ggtitle('CLARA with 4 clusters')

grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))

Post diagnostics

After clusterization on 4 different datasets, the only reasonable results were on dataset with 1+ words with K-means algorithm.

km3 <- eclust(dtm_1words,k=3,hc_metric = 'euclidean')
km4 <- eclust(dtm_1words,k=4,hc_metric = 'euclidean')
km3_s <- fviz_silhouette(km3)  + ggtitle('K-means with 3 clusters')
km4_s <- fviz_silhouette(km4)  + ggtitle('K-means with 4 clusters')

km3$silinfo$clus.avg.widths

## [1] 0.00000000 0.01832987 0.15315103

km4$silinfo$clus.avg.widths

## [1]  0.00000000  0.11626586  0.13811792 -0.07792676

grid.arrange(arrangeGrob(km3_s,km4_s, ncol=1 , top = "Clustering"))

Silhouette shows that K-means with k =4 has a negative average silhouette value, which means that this clusterization is not so good. In the next step, we will take a look into those clusters for both clusterizations.

kmeans_c1 = as.data.frame(km3_s$data$cluster)
colnames(kmeans_c1) <- c('cluster')
kmeans_c1$type = data$Category
kmeans_c1$counter = 1

kmeans_c2 = as.data.frame(km4_s$data$cluster)
colnames(kmeans_c2) <- c('cluster')
kmeans_c2$type = data$Category
kmeans_c2$counter = 1

kmeans_c1 %>% group_by(cluster, type) %>% summarise(group=sum(counter))

## # A tibble: 6 x 3
## # Groups:   cluster [?]
##   cluster type      group
##   <fct>   <chr>     <dbl>
## 1 1       [western]     1
## 2 2       [western]     9
## 3 3       [comedy]     12
## 4 3       [sport]      12
## 5 3       [war]        12
## 6 3       [western]     2

kmeans_c2 %>% group_by(cluster, type) %>% summarise(group=sum(counter))

## # A tibble: 7 x 3
## # Groups:   cluster [?]
##   cluster type      group
##   <fct>   <chr>     <dbl>
## 1 1       [western]     1
## 2 2       [western]     6
## 3 3       [comedy]     12
## 4 3       [sport]      12
## 5 3       [war]        10
## 6 3       [western]     5
## 7 4       [war]         2

Results for both attempts show that we have one big cluster with majority of the observations. Those results are not satisfactory so the other techniques will be used.

Hierarchical clustering

Another method for clusterization is hierarchical clustering. This method is clustering observations that are most similiar to each other and merge them as one. The algorithm stops when there is no more observations to pair. In this project the euclidean metric will be performed.

The hierarchical clustering was perfomed on TF-IDF data but the results were rather poor. So the results on Document term matrix dataset will be presented with different threshold of occurances of words. There will be datasets with all the words that occur more than 1, 2, 3, 4 and 5 times.

Threshold set on 1+ words

m  <- as.matrix(dtm_1words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=3)

Threshold set on 2+ words

m  <- as.matrix(dtm_2words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)

Threshold set on 3+ words

dtm_3words  = dplyr::filter(dtm1, dtm1$counter > 3) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_3words= dtm_3words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_3words) <- col_names

m  <- as.matrix(dtm_3words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)

Threshold set on 4+ words

dtm_4words  = dplyr::filter(dtm1, dtm1$counter > 4) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_4words= dtm_4words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_4words) <- col_names

m  <- as.matrix(dtm_4words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)

Threshold set on 5+ words

dtm_5words  = dplyr::filter(dtm1, dtm1$counter > 5) %>% select(-counter)
col_names=dtm_1words$rowname
dtm_5words= dtm_5words %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_5words) <- col_names

m  <- as.matrix(dtm_5words)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
groups$labels= data$Category
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=4)

As the analysis show, changing threshold have big influence of the results. The best results are for datasets with threshold 3+ and 4+. The clusters have similiar number of observations and in comparison to K-means quality of clusterization is really good.

LDA (Latent Dirichlet Allocation)

As it was said at he beggining, LDA is a generative probabilistic model. We have to define number of topics, in our case 4. The result will be the probability of observation assigment to each category.

#Number of topics
k <- 4

ldaOut <-LDA(dtm_1words,k)

ldatopics=as.data.frame(topics(ldaOut))
colnames(ldatopics) <- c('cluster')
ldatopics$type = data$Category
ldatopics %>% group_by(cluster,type) %>% summarise(group=n())

## # A tibble: 15 x 3
## # Groups:   cluster [?]
##    cluster type      group
##      <int> <chr>     <int>
##  1       1 [comedy]      1
##  2       1 [sport]       5
##  3       1 [war]         2
##  4       1 [western]     2
##  5       2 [comedy]      4
##  6       2 [sport]       2
##  7       2 [war]         1
##  8       2 [western]     1
##  9       3 [comedy]      3
## 10       3 [war]         8
## 11       3 [western]     5
## 12       4 [comedy]      4
## 13       4 [sport]       5
## 14       4 [war]         1
## 15       4 [western]     4

Tha dataset was divided into 4 clusters, but the quality of assignments are rather low. In the first cluster we can films of every type. The others clusters are simingly poor.

ldaOut.terms <- as.matrix(terms(ldaOut,6))
ldaOut.terms

##      Topic 1    Topic 2  Topic 3 Topic 4 
## [1,] "team"     "war"    "war"   "boxer" 
## [2,] "ii"       "love"   "find"  "stori" 
## [3,] "outlaw"   "one"    "world" "outlaw"
## [4,] "world"    "help"   "stori" "love"  
## [5,] "american" "retir"  "help"  "help"  
## [6,] "son"      "indian" "young" "old"

topicProbabilities <- as.data.frame(ldaOut@gamma)
head(topicProbabilities, 3)

##          V1        V2        V3        V4
## 1 0.2485314 0.2485343 0.2565321 0.2464022
## 2 0.2503228 0.2503840 0.2505972 0.2486960
## 3 0.2539081 0.2517308 0.2462514 0.2481096

What is more we can analyze which words are the most important in every topic. For example for the first topic the most vital are the words from different film types. This shows that combination of words assigned to a given topic is mixed up and that the words are not really connected with each other.

Conclusions

To sum up, the text data were clusterd with usage of using different methods. In the text data the preprocessing step is very important, different datasets were created in order to get better clustering results. The algorithms like K-means, PAM, CLARA show poor results in every dataset. Latent Dirichlet Allocation also did not bring much value to the overall analysis. The only algorithm with satisfactory results was Hierarchical Clustering in which the observations were divided into 4 fairly equal groups. What is more, in each cluster the observations were mostly from one film genre. This analysis show that to obtain possibly the best results different methods have to be used.

Film clustering based on the description

Krystian Andruszek

Clustering text data

Clustering

LDA

Project overview

Dataset and preprocessing

Preprocessing

Most frequent words in groups

Western

Sport

Comedy

War

Datasets

Document term matrix

Dataset for 1+ words

Dataset for 2+ words

Dataset with TF-IDF

TF-IDF for 1+

TF-IDF for 2+

Clustering (K-means, PAM, CLARA, Hierarchical)

Silhouette

Dataset for 1+ words

Dataset for 2+ words

TF-IDF for 1+ words

TF-IDF for 2+ words

Total within Sum of Square

Dataset for 1+ words

Dataset for 2+ words

TF-IDF for 1+ words

TF-IDF for 2+ words

Clustering

Dataset for 1+ words

Dataset for 2+ words

TF-IDF for 1+ words

TF-IDF for 2+ words

Post diagnostics

Hierarchical clustering

Threshold set on 1+ words

Threshold set on 2+ words

Threshold set on 3+ words

Threshold set on 4+ words

Threshold set on 5+ words

LDA (Latent Dirichlet Allocation)

Conclusions