Text Mining Clustering Assignment

Si Tang Lin

476912

Introduction

By using Text Mining technique and unsupervised learning method to analysis text reviews and name main industry branches.

Prerequisite

## Loading required package: Matrix
## 
## Attaching package: 'textmineR'
## The following object is masked from 'package:Matrix':
## 
##     update
## The following object is masked from 'package:stats':
## 
##     update
## Package version: 4.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Load Dataset and Key Insight

# Load text reviews data
text <- read.csv("/Users/ninalin/Desktop/dsba/winter semester 25/text mining/hw/textreviews.csv")

View(text)
str(text)
## 'data.frame':    2000 obs. of  2 variables:
##  $ id  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ text: chr  "Warn and super soft. love it !" "Love the high waist, prewashed softness, and relaxed fit. i normally wear a 27 in pilcro but sized down to a 26"| __truncated__ "Nice jeans, but had to return. too tight in hips/thighs and big in waist." "Im 5'1\" and about 110lbs. ordered the small because i do have some curves- it was huge- more like a large and "| __truncated__ ...
summary(text)
##        id             text          
##  Min.   :   1.0   Length:2000       
##  1st Qu.: 500.8   Class :character  
##  Median :1000.5   Mode  :character  
##  Mean   :1000.5                     
##  3rd Qu.:1500.2                     
##  Max.   :2000.0

Text Data Preprocessing

# Corpus
corp <- corpus(text$text)
# Consist of 2000 documents
# Separate into tokens
tokens <- tokens(corp,
               remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        tokens_tolower() %>%
        tokens_select(pattern = stopwords("en"), selection = "remove")
# Each contain different length of tokens
# Remove Stopword
stopword <- c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart"))
tokens <- tokens_select(tokens, pattern = stopword, selection = "remove")

# Save the above procedure to Clean_text
clean_text <- sapply(tokens, function(x) paste(x, collapse = " "))

Document Term Matrix

# Document Term Matrix- Create (Document* Term) matrix
text_DTM <- CreateDtm(doc_vec = clean_text,
                      doc_names = text$id,
                      ngram_window = c(1, 2), 
                      verbose = TRUE,
                      cpus = 2)

TF-IDF

# Transform into TF-IDF
tfidf_transformer <- TfIdf$new()
text_tfidf <- tfidf_transformer$fit_transform(text_DTM)

dim(text_tfidf)
## [1]  2000 28720
# 2000 documents * 28720 terms(and weight of each term)

Number set for optimal K

set.seed(42)
svd_k <- 100
svd_res <- irlba(text_tfidf, nv = svd_k)
doc_emb <- svd_res$u %*% diag(svd_res$d)

# Elbow Method
fviz_nbclust(doc_emb, kmeans, method = "wss") + ggtitle("Elbow Method")
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Used elbow method to find the optimal number of cluster. The graph indicates that the optimal number of cluster is 4.

K-means method

set.seed(42)
k_chosen <- 4
km <- kmeans(doc_emb, centers = k_chosen, nstart = 50)
table(km$cluster)
## 
##    1    2    3    4 
##    1    1  262 1736
text$cluster <- km$cluster
for(i in 1:k_chosen){
  idx <- which(text$cluster == i)
  top_terms <- sort(colSums(as.matrix(text_tfidf[idx, ])), decreasing = TRUE)
  cat("\n===== Cluster", i, "(n =", length(idx), ") =====\n")
  print(head(names(top_terms), 15))
}
## 
## ===== Cluster 1 (n = 1 ) =====
## NULL
## 
## ===== Cluster 2 (n = 1 ) =====
## NULL
## 
## ===== Cluster 3 (n = 262 ) =====
##  [1] "game"         "good"         "fun"          "awesome"      "addictive"   
##  [6] "great_game"   "great"        "good_game"    "good_good"    "cool"        
## [11] "awesome_game" "fun_fun"      "fun_game"     "great_great"  "love"        
## 
## ===== Cluster 4 (n = 1736 ) =====
##  [1] "game"      "love"      "great"     "top"       "dress"     "fun"      
##  [7] "size"      "good"      "play"      "app"       "nice"      "fit"      
## [13] "time"      "love_game" "color"

Clusters collapse- because the dataset is dominated by certain topic, making it difficult for k-means to divide the clusters. Thus, cluster 1 and 2 remain empty while cluster 4 absorb most samples(n=1736). So we apply Umap to reduce the data in a non-linear way.

Uniform Manifold Approximation Projection

um <- umap(doc_emb)
km_umap <- kmeans(um$layout, centers = 3, nstart = 50)
table(km_umap$cluster)
## 
##    1    2    3 
##  389 1034  577

The cluster distribution is much balanced after UMAP algorithm. We then want to check the related features in each cluster.

#Top Term of each cluster
text$cluster_umap <- km_umap$cluster

for (i in 1:3) {
  idx <- which(text$cluster_umap == i)  
  top_terms <- sort(colSums(as.matrix(text_tfidf[idx, , drop = FALSE])), decreasing = TRUE)
  
  cat("\n===== Cluster", i, "(n =", length(idx), ") =====\n")
  print(head(names(top_terms), 18)) 
}
## 
## ===== Cluster 1 (n = 389 ) =====
##  [1] "top"         "size"        "dress"       "obsessed"    "beautiful"  
##  [6] "fit"         "small"       "perfect"     "wear"        "cute"       
## [11] "big"         "color"       "pretty"      "flattering"  "fabric"     
## [16] "keeper"      "comfortable" "sweater"    
## 
## ===== Cluster 2 (n = 1034 ) =====
##  [1] "game"         "good"         "fun"          "addictive"    "awesome"     
##  [6] "great"        "great_game"   "love"         "good_game"    "cool"        
## [11] "love_game"    "amazing"      "awesome_game" "play"         "good_good"   
## [16] "app"          "time"         "fun_game"    
## 
## ===== Cluster 3 (n = 577 ) =====
##  [1] "love"        "great"       "nice"        "dress"       "love_love"  
##  [6] "great_great" "top"         "size"        "fit"         "color"      
## [11] "wear"        "soft"        "fabric"      "perfect"     "shirt"      
## [16] "lots"        "comfortable" "flattering"

Cluster 1 indicates apparel and fashion-related comments, possibly in game contexts.

Cluster 2 focuses on detailed descriptions of clothing and fashion attributes-such as color and fabric.

Cluster 3 is clearly related to games and entertainment.

Plot Visaulization

# Cluster visualization
plot_df <- data.frame(x = um$layout[,1], y = um$layout[,2], cluster = factor(km_umap$cluster))

ggplot(plot_df, aes(x = x, y = y, color = cluster)) +
  geom_point(alpha = 0.6) + theme_minimal() +
  ggtitle("K-Means Clusters with UMAP projection")

Conclusion

Upon investigation, we noticed an unevenly distributed across topic by using K-Means method. Therefore, we applied UMAP method to improve the performance. The result shows three well-distributed clusters: One related to fashion and appearance, another presents similar result but much toward clothing and fabric, while the last cluster is mainly about entertainment and games.