By using Text Mining technique and unsupervised learning method to analysis text reviews and name main industry branches.
## Loading required package: Matrix
##
## Attaching package: 'textmineR'
## The following object is masked from 'package:Matrix':
##
## update
## The following object is masked from 'package:stats':
##
## update
## Package version: 4.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Load text reviews data
text <- read.csv("/Users/ninalin/Desktop/dsba/winter semester 25/text mining/hw/textreviews.csv")
View(text)
str(text)
## 'data.frame': 2000 obs. of 2 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ text: chr "Warn and super soft. love it !" "Love the high waist, prewashed softness, and relaxed fit. i normally wear a 27 in pilcro but sized down to a 26"| __truncated__ "Nice jeans, but had to return. too tight in hips/thighs and big in waist." "Im 5'1\" and about 110lbs. ordered the small because i do have some curves- it was huge- more like a large and "| __truncated__ ...
summary(text)
## id text
## Min. : 1.0 Length:2000
## 1st Qu.: 500.8 Class :character
## Median :1000.5 Mode :character
## Mean :1000.5
## 3rd Qu.:1500.2
## Max. :2000.0
# Corpus
corp <- corpus(text$text)
# Consist of 2000 documents
# Separate into tokens
tokens <- tokens(corp,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_select(pattern = stopwords("en"), selection = "remove")
# Each contain different length of tokens
# Remove Stopword
stopword <- c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart"))
tokens <- tokens_select(tokens, pattern = stopword, selection = "remove")
# Save the above procedure to Clean_text
clean_text <- sapply(tokens, function(x) paste(x, collapse = " "))
# Document Term Matrix- Create (Document* Term) matrix
text_DTM <- CreateDtm(doc_vec = clean_text,
doc_names = text$id,
ngram_window = c(1, 2),
verbose = TRUE,
cpus = 2)
# Transform into TF-IDF
tfidf_transformer <- TfIdf$new()
text_tfidf <- tfidf_transformer$fit_transform(text_DTM)
dim(text_tfidf)
## [1] 2000 28720
# 2000 documents * 28720 terms(and weight of each term)
set.seed(42)
svd_k <- 100
svd_res <- irlba(text_tfidf, nv = svd_k)
doc_emb <- svd_res$u %*% diag(svd_res$d)
# Elbow Method
fviz_nbclust(doc_emb, kmeans, method = "wss") + ggtitle("Elbow Method")
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Used elbow method to find the optimal number of cluster. The graph indicates that the optimal number of cluster is 4.
set.seed(42)
k_chosen <- 4
km <- kmeans(doc_emb, centers = k_chosen, nstart = 50)
table(km$cluster)
##
## 1 2 3 4
## 1 1 262 1736
text$cluster <- km$cluster
for(i in 1:k_chosen){
idx <- which(text$cluster == i)
top_terms <- sort(colSums(as.matrix(text_tfidf[idx, ])), decreasing = TRUE)
cat("\n===== Cluster", i, "(n =", length(idx), ") =====\n")
print(head(names(top_terms), 15))
}
##
## ===== Cluster 1 (n = 1 ) =====
## NULL
##
## ===== Cluster 2 (n = 1 ) =====
## NULL
##
## ===== Cluster 3 (n = 262 ) =====
## [1] "game" "good" "fun" "awesome" "addictive"
## [6] "great_game" "great" "good_game" "good_good" "cool"
## [11] "awesome_game" "fun_fun" "fun_game" "great_great" "love"
##
## ===== Cluster 4 (n = 1736 ) =====
## [1] "game" "love" "great" "top" "dress" "fun"
## [7] "size" "good" "play" "app" "nice" "fit"
## [13] "time" "love_game" "color"
Clusters collapse- because the dataset is dominated by certain topic, making it difficult for k-means to divide the clusters. Thus, cluster 1 and 2 remain empty while cluster 4 absorb most samples(n=1736). So we apply Umap to reduce the data in a non-linear way.
um <- umap(doc_emb)
km_umap <- kmeans(um$layout, centers = 3, nstart = 50)
table(km_umap$cluster)
##
## 1 2 3
## 389 1034 577
The cluster distribution is much balanced after UMAP algorithm. We then want to check the related features in each cluster.
#Top Term of each cluster
text$cluster_umap <- km_umap$cluster
for (i in 1:3) {
idx <- which(text$cluster_umap == i)
top_terms <- sort(colSums(as.matrix(text_tfidf[idx, , drop = FALSE])), decreasing = TRUE)
cat("\n===== Cluster", i, "(n =", length(idx), ") =====\n")
print(head(names(top_terms), 18))
}
##
## ===== Cluster 1 (n = 389 ) =====
## [1] "top" "size" "dress" "obsessed" "beautiful"
## [6] "fit" "small" "perfect" "wear" "cute"
## [11] "big" "color" "pretty" "flattering" "fabric"
## [16] "keeper" "comfortable" "sweater"
##
## ===== Cluster 2 (n = 1034 ) =====
## [1] "game" "good" "fun" "addictive" "awesome"
## [6] "great" "great_game" "love" "good_game" "cool"
## [11] "love_game" "amazing" "awesome_game" "play" "good_good"
## [16] "app" "time" "fun_game"
##
## ===== Cluster 3 (n = 577 ) =====
## [1] "love" "great" "nice" "dress" "love_love"
## [6] "great_great" "top" "size" "fit" "color"
## [11] "wear" "soft" "fabric" "perfect" "shirt"
## [16] "lots" "comfortable" "flattering"
Cluster 1 indicates apparel and fashion-related comments, possibly in game contexts.
Cluster 2 focuses on detailed descriptions of clothing and fashion attributes-such as color and fabric.
Cluster 3 is clearly related to games and entertainment.
# Cluster visualization
plot_df <- data.frame(x = um$layout[,1], y = um$layout[,2], cluster = factor(km_umap$cluster))
ggplot(plot_df, aes(x = x, y = y, color = cluster)) +
geom_point(alpha = 0.6) + theme_minimal() +
ggtitle("K-Means Clusters with UMAP projection")
Upon investigation, we noticed an unevenly distributed across topic by using K-Means method. Therefore, we applied UMAP method to improve the performance. The result shows three well-distributed clusters: One related to fashion and appearance, another presents similar result but much toward clothing and fabric, while the last cluster is mainly about entertainment and games.