Bonus_text_files - Khon Nguyen

First I load the file.

# Load library tm
library(tm)

## Warning: package 'tm' was built under R version 4.2.2

## Loading required package: NLP

#Load the text files
docs <- DirSource("H:/My Drive/Text Mining/5/Homework2/1.Bonus_text_files/text_files/")
docs <- VCorpus(docs)

Then I clean it.

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
#Clean the characters that are not words, for example " ' ", ' " ', " - ", etc.
docs <- tm_map(docs, toSpace, "\\W")

#Remove one alphabet letter
docs <- tm_map(docs, toSpace, "\\b[A-z]\\b{1}")

# Preliminary cleaning, Cleaning text and Stopword removal ----
## Remove punctuation
docs <- tm_map(docs, removePunctuation)
## Remove numbers
docs <- tm_map(docs, removeNumbers)
## Lower all words
docs <- tm_map(docs, content_transformer(tolower))
## Remove all stop words
docs <- tm_map(docs, removeWords, stopwords("english"))
## Strip white space
docs <- tm_map(docs, stripWhitespace)

After that, I create document-term matrix.

# Create document-term matrix ----
dtm <- DocumentTermMatrix(docs)
dtm

## <<DocumentTermMatrix (documents: 33, terms: 6747)>>
## Non-/sparse entries: 19923/202728
## Sparsity           : 91%
## Maximal term length: 19
## Weighting          : term frequency (tf)

m <- as.matrix(dtm)

Then I start to perform clustering, I use Hierarchical Clustering. Because in the description, it is said that there are 2 authors, so I choose the number of clusters as 2.

# compute distance between document vectors
d <- dist(m)

# run hierarchical clustering using Ward's method
hc <- hclust(d, "ward.D")

# plot dendrogram
plot(hc, main = "Hierarchical clustering of docs",
     ylab = "", xlab = "", yaxt = "n")
rect.hclust(hc,2)

After that, I am going to use word cloud to check the content of each cluster. To do that, I need to split the dtm into 2 based on the cluster. And for each cluster, I am going to create a separate word cloud.

#cut the dendrogram into 2 clusters
groups <- cutree(hc, k=2)

#append cluster labels to original data
final_data <- cbind(m, cluster = groups)
final_data <- as.data.frame(final_data)

c1 <- final_data[final_data$cluster == 1,]
c1$cluster <- NULL
c2 <- final_data[final_data$cluster == 2,]
c2$cluster <- NULL

# Create word cloud
library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.2.2

## Loading required package: RColorBrewer

Cloud cluster 1

#Cluster 1
# for tdm:
freq <- colSums(c1)
# Limit words in word cloud by specifying maximum number of words
wordcloud(names(freq), freq, max.words=50, rot.per=0.2, colors = brewer.pal(6, "Dark2"))

As the result, for cluster 1, from the word cloud, I can say that this author is writing about something relating to politics. There are many terms about American politics like: American, people, country, state, immigration, jobs, united, etc.

#Cluster 2
# for tdm:
freq <- colSums(c2)
# Limit words in word cloud by specifying maximum number of words
wordcloud(names(freq), freq, max.words=50, rot.per=0.2, colors = brewer.pal(6, "Dark2"))

For the second author, he/she should be a businessman, the terms mentioned in his/her work is more about business term, such as: business, management, project, problem, desiciosn, data, etc.

Bonus_text_files - Khon Nguyen - 444135

Khon Hoang Nguyen

11/12/2022