Text Mining Homework

Text Mining Clustering Assignment

Si Tang Lin

476912

Introduction

By using Text Mining technique and unsupervised learning method to analysis text reviews and name main industry branches.

Prerequisite

## Loading required package: Matrix

## 
## Attaching package: 'textmineR'

## The following object is masked from 'package:Matrix':
## 
##     update

## The following object is masked from 'package:stats':
## 
##     update

## Package version: 4.3.1
## Unicode version: 14.0
## ICU version: 71.1

## Parallel computing: disabled

## See https://quanteda.io for tutorials and examples.

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Load Dataset and Key Insight

# Load text reviews data
text <- read.csv("/Users/ninalin/Desktop/dsba/winter semester 25/text mining/hw/textreviews.csv")

View(text)
str(text)

## 'data.frame':    2000 obs. of  2 variables:
##  $ id  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ text: chr  "Warn and super soft. love it !" "Love the high waist, prewashed softness, and relaxed fit. i normally wear a 27 in pilcro but sized down to a 26"| __truncated__ "Nice jeans, but had to return. too tight in hips/thighs and big in waist." "Im 5'1\" and about 110lbs. ordered the small because i do have some curves- it was huge- more like a large and "| __truncated__ ...

summary(text)

##        id             text          
##  Min.   :   1.0   Length:2000       
##  1st Qu.: 500.8   Class :character  
##  Median :1000.5   Mode  :character  
##  Mean   :1000.5                     
##  3rd Qu.:1500.2                     
##  Max.   :2000.0

Text Data Preprocessing

# Corpus
corp <- corpus(text$text)
# Consist of 2000 documents

# Separate into tokens
tokens <- tokens(corp,
               remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        tokens_tolower() %>%
        tokens_select(pattern = stopwords("en"), selection = "remove")
# Each contain different length of tokens
# Remove Stopword
stopword <- c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart"))
tokens <- tokens_select(tokens, pattern = stopword, selection = "remove")

# Save the above procedure to Clean_text
clean_text <- sapply(tokens, function(x) paste(x, collapse = " "))

Document Term Matrix

# Document Term Matrix- Create (Document* Term) matrix
text_DTM <- CreateDtm(doc_vec = clean_text,
                      doc_names = text$id,
                      ngram_window = c(1, 2), 
                      verbose = TRUE,
                      cpus = 2)

TF-IDF

# Transform into TF-IDF
tfidf_transformer <- TfIdf$new()
text_tfidf <- tfidf_transformer$fit_transform(text_DTM)

dim(text_tfidf)

## [1]  2000 28720

# 2000 documents * 28720 terms(and weight of each term)

Number set for optimal K

set.seed(42)
svd_k <- 100
svd_res <- irlba(text_tfidf, nv = svd_k)
doc_emb <- svd_res$u %*% diag(svd_res$d)

# Elbow Method
fviz_nbclust(doc_emb, kmeans, method = "wss") + ggtitle("Elbow Method")

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Used elbow method to find the optimal number of cluster. The graph indicates that the optimal number of cluster is 4.

K-means method

set.seed(42)
k_chosen <- 4
km <- kmeans(doc_emb, centers = k_chosen, nstart = 50)
table(km$cluster)

## 
##    1    2    3    4 
##    1    1  262 1736

text$cluster <- km$cluster
for(i in 1:k_chosen){
  idx <- which(text$cluster == i)
  top_terms <- sort(colSums(as.matrix(text_tfidf[idx, ])), decreasing = TRUE)
  cat("\n===== Cluster", i, "(n =", length(idx), ") =====\n")
  print(head(names(top_terms), 15))
}

## 
## ===== Cluster 1 (n = 1 ) =====
## NULL
## 
## ===== Cluster 2 (n = 1 ) =====
## NULL
## 
## ===== Cluster 3 (n = 262 ) =====
##  [1] "game"         "good"         "fun"          "awesome"      "addictive"   
##  [6] "great_game"   "great"        "good_game"    "good_good"    "cool"        
## [11] "awesome_game" "fun_fun"      "fun_game"     "great_great"  "love"        
## 
## ===== Cluster 4 (n = 1736 ) =====
##  [1] "game"      "love"      "great"     "top"       "dress"     "fun"      
##  [7] "size"      "good"      "play"      "app"       "nice"      "fit"      
## [13] "time"      "love_game" "color"

Clusters collapse- because the dataset is dominated by certain topic, making it difficult for k-means to divide the clusters. Thus, cluster 1 and 2 remain empty while cluster 4 absorb most samples(n=1736). So we apply Umap to reduce the data in a non-linear way.

Uniform Manifold Approximation Projection

um <- umap(doc_emb)
km_umap <- kmeans(um$layout, centers = 3, nstart = 50)
table(km_umap$cluster)

## 
##    1    2    3 
##  389 1034  577

The cluster distribution is much balanced after UMAP algorithm. We then want to check the related features in each cluster.

#Top Term of each cluster
text$cluster_umap <- km_umap$cluster

for (i in 1:3) {
  idx <- which(text$cluster_umap == i)  
  top_terms <- sort(colSums(as.matrix(text_tfidf[idx, , drop = FALSE])), decreasing = TRUE)
  
  cat("\n===== Cluster", i, "(n =", length(idx), ") =====\n")
  print(head(names(top_terms), 18)) 
}

## 
## ===== Cluster 1 (n = 389 ) =====
##  [1] "top"         "size"        "dress"       "obsessed"    "beautiful"  
##  [6] "fit"         "small"       "perfect"     "wear"        "cute"       
## [11] "big"         "color"       "pretty"      "flattering"  "fabric"     
## [16] "keeper"      "comfortable" "sweater"    
## 
## ===== Cluster 2 (n = 1034 ) =====
##  [1] "game"         "good"         "fun"          "addictive"    "awesome"     
##  [6] "great"        "great_game"   "love"         "good_game"    "cool"        
## [11] "love_game"    "amazing"      "awesome_game" "play"         "good_good"   
## [16] "app"          "time"         "fun_game"    
## 
## ===== Cluster 3 (n = 577 ) =====
##  [1] "love"        "great"       "nice"        "dress"       "love_love"  
##  [6] "great_great" "top"         "size"        "fit"         "color"      
## [11] "wear"        "soft"        "fabric"      "perfect"     "shirt"      
## [16] "lots"        "comfortable" "flattering"

Cluster 1 indicates apparel and fashion-related comments, possibly in game contexts.

Cluster 2 focuses on detailed descriptions of clothing and fashion attributes-such as color and fabric.

Cluster 3 is clearly related to games and entertainment.

Plot Visaulization

# Cluster visualization
plot_df <- data.frame(x = um$layout[,1], y = um$layout[,2], cluster = factor(km_umap$cluster))

ggplot(plot_df, aes(x = x, y = y, color = cluster)) +
  geom_point(alpha = 0.6) + theme_minimal() +
  ggtitle("K-Means Clusters with UMAP projection")

Conclusion

Upon investigation, we noticed an unevenly distributed across topic by using K-Means method. Therefore, we applied UMAP method to improve the performance. The result shows three well-distributed clusters: One related to fashion and appearance, another presents similar result but much toward clothing and fabric, while the last cluster is mainly about entertainment and games.

Text Mining Homework_Text Reviews

Si Tang Lin

2025-11-11

Text Mining Clustering Assignment

Si Tang Lin

476912

Introduction

Prerequisite

Load Dataset and Key Insight

Text Data Preprocessing

Document Term Matrix

TF-IDF

Number set for optimal K

K-means method

Uniform Manifold Approximation Projection

Plot Visaulization

Conclusion