Task A – Text Mining
options(repos = c(CRAN = "https://cloud.r-project.org"))
libraries <- c("tm", "tidytext", "ggplot2", "wordcloud", "syuzhet", "dplyr", "tibble", "textstem", "textdata", "tidyr", "Matrix", "topicmodels", "stringr", "reshape2", "LDAvis", "jsonlite", "spacyr", "stm")
install.packages(libraries)
## package 'tm' successfully unpacked and MD5 sums checked
## package 'tidytext' successfully unpacked and MD5 sums checked
## package 'ggplot2' successfully unpacked and MD5 sums checked
## package 'wordcloud' successfully unpacked and MD5 sums checked
## package 'syuzhet' successfully unpacked and MD5 sums checked
## package 'dplyr' successfully unpacked and MD5 sums checked
## package 'tibble' successfully unpacked and MD5 sums checked
## package 'textstem' successfully unpacked and MD5 sums checked
## package 'textdata' successfully unpacked and MD5 sums checked
## package 'tidyr' successfully unpacked and MD5 sums checked
## package 'Matrix' successfully unpacked and MD5 sums checked
## package 'topicmodels' successfully unpacked and MD5 sums checked
## package 'stringr' successfully unpacked and MD5 sums checked
## package 'reshape2' successfully unpacked and MD5 sums checked
## package 'LDAvis' successfully unpacked and MD5 sums checked
## package 'jsonlite' successfully unpacked and MD5 sums checked
## package 'spacyr' successfully unpacked and MD5 sums checked
## package 'stm' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\akilo\AppData\Local\Temp\RtmpQNlfb7\downloaded_packages
for (lib in libraries) {
library(lib, character.only = TRUE)
}
Utilize techniques associated with text mining/cleaning It loads the CSV as a tibble, shows the first rows, summaries each column, and counts missing values.
df <- tibble(read.csv("MS4S09 Assessment 1 Dataset.csv"))
head(df)
## # A tibble: 6 × 11
## X Clothing.ID Age Title Review.Text Rating Recommended.IND
## <int> <int> <int> <chr> <chr> <int> <int>
## 1 0 767 33 "" "Absolutel… 4 1
## 2 1 1080 34 "" "Love this… 5 1
## 3 2 1077 60 "Some major design… "I had suc… 3 0
## 4 3 1049 50 "My favorite buy!" "I love, l… 5 1
## 5 4 847 47 "Flattering shirt" "This shir… 5 1
## 6 5 1080 49 "Not for the very … "I love tr… 2 0
## # ℹ 4 more variables: Positive.Feedback.Count <int>, Division.Name <chr>,
## # Department.Name <chr>, Class.Name <chr>
summary(df)
## X Clothing.ID Age Title
## Min. : 0 Min. : 0.0 Min. :18.0 Length:23486
## 1st Qu.: 5871 1st Qu.: 861.0 1st Qu.:34.0 Class :character
## Median :11742 Median : 936.0 Median :41.0 Mode :character
## Mean :11742 Mean : 918.1 Mean :43.2
## 3rd Qu.:17614 3rd Qu.:1078.0 3rd Qu.:52.0
## Max. :23485 Max. :1205.0 Max. :99.0
## Review.Text Rating Recommended.IND Positive.Feedback.Count
## Length:23486 Min. :1.000 Min. :0.0000 Min. : 0.000
## Class :character 1st Qu.:4.000 1st Qu.:1.0000 1st Qu.: 0.000
## Mode :character Median :5.000 Median :1.0000 Median : 1.000
## Mean :4.196 Mean :0.8224 Mean : 2.536
## 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.: 3.000
## Max. :5.000 Max. :1.0000 Max. :122.000
## Division.Name Department.Name Class.Name
## Length:23486 Length:23486 Length:23486
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
colSums(is.na(df))
## X Clothing.ID Age
## 0 0 0
## Title Review.Text Rating
## 0 0 0
## Recommended.IND Positive.Feedback.Count Division.Name
## 0 0 0
## Department.Name Class.Name
## 0 0
It keeps only the columns needed for your analysis and then removes any rows with missing values so the data set is clean, consistent, and ready for the text‑mining task without errors or gaps.
df <- df[,c(1,2,3,5,6,9)] #selecting the data for task A
textMining <- na.omit(df) # Removes all rows containing null values
It sets a fixed random seed so the results stay the same each time, randomly selects 500 rows from the cleaned dataset, and creates a new dataset containing only that random sample.
set.seed(0) # Set random seed for repeatability
# Take sample of 500 review.Text
sample_index <- sample(nrow(textMining), 500) #Sample (Size of population, Size of sample), returns index for sample
sample_textMining <- textMining[sample_index,]
print(sample_textMining)
## # A tibble: 500 × 6
## X Clothing.ID Age Review.Text Rating Division.Name
## <int> <int> <int> <chr> <int> <chr>
## 1 17400 912 47 "I believe that this sweater is… 3 General
## 2 4774 1059 52 "The \"movement\" in these culo… 5 General Peti…
## 3 13217 1078 36 "I love this dress and will be … 5 General
## 4 10538 1080 44 "This dress is wonderful and th… 5 General Peti…
## 5 8461 863 57 "This is a nice light top for t… 4 General
## 6 4049 257 36 "Very soft, cozy, and exactly w… 5 Initmates
## 7 13498 1094 34 "I ordered the red/orange but i… 4 General
## 8 11570 862 41 "Not flattering and not the col… 2 General
## 9 12256 868 35 "The fit on this shirt is bizar… 1 General
## 10 17684 1095 46 "I ordered this dress in the bl… 4 General
## # ℹ 490 more rows
head(sample_textMining)
## # A tibble: 6 × 6
## X Clothing.ID Age Review.Text Rating Division.Name
## <int> <int> <int> <chr> <int> <chr>
## 1 17400 912 47 "I believe that this sweater is … 3 General
## 2 4774 1059 52 "The \"movement\" in these culot… 5 General Peti…
## 3 13217 1078 36 "I love this dress and will be p… 5 General
## 4 10538 1080 44 "This dress is wonderful and the… 5 General Peti…
## 5 8461 863 57 "This is a nice light top for th… 4 General
## 6 4049 257 36 "Very soft, cozy, and exactly wh… 5 Initmates
It converts the review text into tokens so the text can be analysed: the first line breaks each review into individual words, and the second line breaks each review into two‑word combinations (bigrams).
word_tokenized_data <- textMining %>%
unnest_tokens(output = word, input = "Review.Text", to_lower = TRUE)
bigram_tokenized_data <- textMining %>%
unnest_tokens(output = bigram, input = "Review.Text",token = "ngrams", n = 2, to_lower = TRUE)
It counts how often each word appears, then plots the ten most frequent words as a horizontal bar chart to show which terms occur most in the reviews.
word_counts <-word_tokenized_data %>%
count(word, sort = TRUE)
ggplot(word_counts[1:10, ], aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
labs(x = "Words" , y = "Frequency") +
coord_flip() +
theme_minimal()
It counts how often each bigram appears in the dataset, then plots the ten most frequent two‑word combinations as a horizontal bar chart to show which paired terms occur most in the reviews.
word_counts <-bigram_tokenized_data %>%
count(bigram, sort = TRUE)
ggplot(word_counts[1:10, ], aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "steelblue") +
labs(x = "Words" , y = "Frequency") +
coord_flip() +
theme_minimal()
It cleans the tokenised words by removing common stop‑words,
stripping out numbers and special characters, converting empty results
to missing values, lemmatising each word to its base form, and finally
dropping any rows that still contain NAs so the text is clean and ready
for analysis
clean_tokens <- word_tokenized_data %>%
anti_join(stop_words, by = "word") #Remove stop words
clean_tokens$word <- gsub("[^a-zA-z ]", "", clean_tokens$word) %>% # Remove special characters and numbers
na_if("") %>% # Replaces empty strings with NA
lemmatize_words() # Lemmatizes text
clean_tokens <- na.omit(clean_tokens) # Removes null values
It finds the ten most frequent cleaned words, filters the dataset to keep only those words, orders them for plotting, and then creates a horizontal bar chart showing how often each of the top words appears.
word_counts <- clean_tokens %>%
count(word, sort = TRUE)
top_words <- top_n(word_counts,10,n)$word
filtered_word_counts <- filter(word_counts, word %in% top_words)
filtered_word_counts$word <- factor(filtered_word_counts$word, levels = top_words[length(top_words):1])
ggplot(filtered_word_counts[1:10, ], aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
labs(x = "Words" , y = "Frequency") +
coord_flip() +
theme_minimal()
It creates a word cloud of the most frequent cleaned words,
using their frequencies to determine word size and applying a fixed seed
so the layout stays consistent each time
set.seed(1)
wordcloud(word = filtered_word_counts$word, freq = filtered_word_counts$n, min.freq = 10, random.order = FALSE, random.color = FALSE, colors = sample(colors(), size = 10))
It attaches sentiment labels to each cleaned word, calculates a sentiment score for every review by subtracting negative words from positive ones, and then merges those scores back into your sampled dataset so each review includes its overall Bing sentiment value.
# Create dataset containing only words with associated sentiment & adds sentiment column.
sentiment_data <- clean_tokens %>%
inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") # Joins lexicon to dataset using only words that are in both.
# Calculate Sentiment scores for each review
sentiment_score <- sentiment_data %>%
group_by(X) %>%
summarize(bing_sentiment = sum(sentiment == "positive") - sum(sentiment == "negative")) # Calculates sentiment score as sum of number of positive and negative sentiments
# Merge with df
textMining_with_sentiment = sample_textMining %>%
inner_join(sentiment_score, by = "X")
It keeps only the words that have an AFINN sentiment score, adds those numeric sentiment values to the tokens, sums the scores for each review to get an overall AFINN sentiment score, and then merges that score back into your dataset so each review includes its AFINN sentiment value.
# Create dataset containing only words with associated sentiment & adds sentiment column.
sentiment_data <- clean_tokens %>%
inner_join(get_sentiments("afinn"), by = "word", relationship = "many-to-many") # Joins lexicon to dataset using only words that are in both.
# Calculate Sentiment scores for each review
sentiment_score <- sentiment_data %>%
group_by(X) %>%
summarize(afinn_sentiment = sum(value))
# Merge with df
textMining_with_sentiment = textMining_with_sentiment %>%
inner_join(sentiment_score, by = "X")
It finds the review with the lowest sentiment score and prints it, then finds the review with the highest sentiment score and prints that one as well.
worst_review = textMining_with_sentiment[order(textMining_with_sentiment$bing_sentiment)[1], "Review.Text"]
for (review in worst_review){
print(review)
}
## [1] "I probably have 10 pairs of pilcro jeans and pants and most of the jeans are the skinny fit. i wanted something that wasn't a skinny fit so i thought i would try these straight fit jeans thinking they would be skinny but they also wouldn't be loose. just got them today and when i tried them on they were tighter then the skinny jeans i just took off. i guess i agree with the other reviewers that they fit like a glove. so if that's what you're looking for these will be great."
best_review = textMining_with_sentiment[order(textMining_with_sentiment$bing_sentiment, decreasing = TRUE)[1], "Review.Text"]
for (review in best_review){
print(review)
}
## [1] "Pretty and comfortable fun cardi for those of us who love a nice little cropped number. glad to report that this cobalt beauty fits true to size in an xs all over -- shoulders hit at a nice place, arms, waist, length is great above your waist or just at it. the cotton is ultra comfortable. the zipper has that luxurious feature of a double zip where you can zip from the bottom or the top once closed, creating a nice opening. i love this for a casual but chic holiday look thanks to the velvet ador"
It creates a histogram of the Bing sentiment scores, showing how the reviews are distributed from negative to positive sentiment. The histogram shows that most customer reviews have slightly positive sentiment, with scores clustering around 2–3, indicating generally favorable feedback but with a mix of both positive and negative experiences.
ggplot(textMining_with_sentiment, aes(x = bing_sentiment)) +
geom_histogram(binwidth = 1, fill = "steelblue")
It calculates the average Bing sentiment score for each clothing division and then plots those averages as a horizontal bar chart to show which divisions receive more positive or negative reviews. Intimates scored well, suggesting strong customer satisfaction. General had the lowest sentiment, indicating potential issues with product quality, fit, or expectations
clothing_sentiment <- textMining_with_sentiment %>%
group_by(Division.Name) %>%
summarise(Average_Bing_Sentiment = mean(bing_sentiment))
ggplot(clothing_sentiment, aes(x = reorder(Division.Name, Average_Bing_Sentiment), y = Average_Bing_Sentiment, fill = Division.Name)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Average Sentiment Score for clothing", x = "Clothing", y = "Average Sentiment Score")
The boxplot shows a clear link between customer ratings and
Bing sentiment scores. Higher ratings always match up with more positive
review language, while lower ratings show more negative sentiment. This
shows that textual sentiment is very similar to star-based
feedback.
ggplot(textMining_with_sentiment, aes(x = factor(Rating), y = bing_sentiment)) +
geom_boxplot(fill = "steelblue") +
labs(title = "Box plot of Bing sentiment score vs. Rating",
x = "Rating",
y = "Sentiment score")
The scatter plot shows a strong positive relationship
between the two sentiment methods. This means that as Bing sentiment
scores increase, AFINN sentiment scores also increase. This means that
both lexicons generally agree on how positive or negative each review
is.
ggplot(textMining_with_sentiment, aes(x = bing_sentiment, y = afinn_sentiment)) +
geom_point() +
labs(title = "Scatter plot of Bing vs. AFINN Sentiment score",
x = "Bing Sentiment score",
y = "AFINN Sentiment Score")
The scatter plot shows that higher product ratings are
usually linked to higher Bing sentiment scores. This means that
customers who give better ratings also use more positive language in
their written reviews. On the other hand, lower ratings are linked to
more negative sentiment.
ggplot(textMining_with_sentiment, aes(x =Rating , y = bing_sentiment)) +
geom_point() +
labs(title = "Scatter plot of Rating vs. Bing Sentiment score",
x = "Rating",
y = "Bing Sentiment Score")
##Topic Modelling This code gets your dataset ready for
topic modeling by filtering, cleaning, and, if you want, sampling it. It
picks reviews that are between 150 and 500 characters long, gets rid of
any missing values, gives each row a new ID, and if the dataset has more
than 1000 reviews, it randomly picks 1000 of them to use for efficient
topic modeling.
TopicModelling <- textMining %>%
filter(str_count(Review.Text) >= 150 & str_count(Review.Text) <= 500)
TopicModelling <- na.omit(TopicModelling) # Removes all rows containing null values
TopicModelling$X <- 1:nrow(TopicModelling)
if(nrow(TopicModelling) > 1000) {
set.seed(1) # for reproducibility
df <- sample_n(df, 1000)
}
This code cleans up your text and turns it into a structured matrix so that you can use it for topic modeling.It cleans up the text corpus by getting rid of noise (like punctuation, stopwords, and case differences), stemming the words, and then turning the processed reviews into a term-document matrix that shows how many times each word appears in each review.
# Convert text column to corpus
corpus <- VCorpus(VectorSource(TopicModelling$Review.Text))
# Apply cleaning
corpus <- tm_map(corpus, content_transformer(tolower)) %>%
tm_map(content_transformer(function(x) gsub("[^a-zA-Z ]", "", x))) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument)
# Convert to a term document matrix
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 15)))
tdm_matrix <- as.matrix(tdm)
This code figures out how many times each word shows up in all the reviews, finds the ten most common words, and shows how the word frequencies are spread out overall. It shows what customers talk about the most by turning the term-document matrix into a frequency table. This makes it easier to see which vocabulary patterns are most common. The histogram shows you if most words are rare or often repeated, which is how natural language works. In general, it gives a clear picture of the dataset’s vocabulary structure and gets the text ready for more in-depth analysis, like topic modeling.
term_frequencies <- rowSums(tdm_matrix)
# Create a data frame for plotting
term_frequencies_df <- data.frame(term = names(term_frequencies), frequency = term_frequencies)
# Sort the data frame by frequency in descending order and select the top 10
top_terms <- term_frequencies_df %>%
arrange(desc(frequency)) %>%
head(10)
# Display the top 10 terms
print(top_terms)
## term frequency
## dress dress 10665
## fit fit 9901
## love love 9551
## size size 9515
## look look 8348
## wear wear 7249
## top top 7192
## like like 7041
## color color 6288
## just just 5081
# Create the histogram
ggplot(term_frequencies_df, aes(x = frequency)) + geom_histogram(binwidth = 10, fill = "steelblue") +
labs(title = "Histogram of Term Frequencies",
x = "Term Frequency",
y = "Number of Terms") +
theme_minimal()
This code filters the term–document matrix by removing words
that are either too common or too rare, ensuring only meaningful terms
remain for topic modelling. It identifies words appearing in more than
15% of documents and removes them unless they are useful keywords like
“fit,” “love,” “size,” “look,” or “wear.” It also removes extremely rare
terms that appear in less than 0.5% of documents, which typically add
noise. After filtering, it removes any document columns that contain no
remaining terms. In general, this step cleans up and improves the matrix
so that topic modeling is more accurate.
# Find terms that appear in more than 15% of documents
frequent_terms <- findFreqTerms(tdm, lowfreq = 0.15 * ncol(tdm_matrix))
# Find terms that appear in less than 0.5% of documents
rare_terms <- findFreqTerms(tdm, highfreq = 0.005 * ncol(tdm_matrix))
print("Frequent Terms")
## [1] "Frequent Terms"
print(frequent_terms)
## [1] "back" "beauti" "can" "color" "dress" "fabric" "fit"
## [8] "flatter" "great" "just" "like" "littl" "look" "love"
## [15] "nice" "one" "order" "perfect" "realli" "size" "small"
## [22] "soft" "top" "tri" "wear" "well" "will"
print("First 20 Infrequent Terms")
## [1] "First 20 Infrequent Terms"
print(rare_terms[1:20])
## [1] "aaaaaaamaz" "aaaaannnnnnd" "aaaah" "aaaahmaz" "aaah"
## [6] "aam" "abbey" "abbi" "abck" "abdomen"
## [11] "abdomin" "abercrombi" "abhor" "abil" "abject"
## [16] "abnorm" "abo" "abolut" "abou" "aboutthi"
# Edit list of frequent words to keep useful ones
to_keep <- c("fit", "love", "size", "look", "wear")
to_remove <- frequent_terms[!frequent_terms %in% to_keep]
filtered_tdm_matrix <- tdm_matrix[!rownames(tdm_matrix) %in% to_remove, ]
filtered_tdm_matrix <- filtered_tdm_matrix[!rownames(filtered_tdm_matrix) %in% rare_terms, ]
# Remove 0 sum columns from tdm.
# Calculate column sums
column_sums <- colSums(filtered_tdm_matrix)
# Identify columns that are all zeros
zero_columns <- which(column_sums == 0)
# Remove these columns
if (length(zero_columns) > 0) {
# Remove these columns
filtered_tdm_matrix <- filtered_tdm_matrix[, -zero_columns]
} else {
# If no columns are all zeros, just use the original matrix
print("No zero columns in TDM matrix")
}
## [1] "No zero columns in TDM matrix"
This code runs the first stage of your topic modelling by converting the filtered term–document matrix into the correct format and fitting an LDA model with three topics.It transposes the filtered matrix so documents become rows, then applies LDA with three topics to uncover the main themes present in the cleaned review text.
dtm <- t(filtered_tdm_matrix)
lda_model <- LDA(dtm, k = 3)
This code takes the LDA model that was fitted and gets the beta matrix, which shows the chance that each word belongs to each topic. Then it finds the 10 most important words for each topic by picking the ones with the highest beta values. This means that these words are the best clues to what each topic is about. After selecting these top terms, the code arranges them neatly and creates a bar chart for each topic, showing which words contribute most to that topic. The plot clearly shows each topic separately by flipping the axes and using facet panels. This makes it easy to understand the themes that the LDA model shows.
Topic 1: Words like look, love, size, fit, and wear are the most common on this topic. These terms suggest that Topic 1 is mainly about appearance and overall satisfaction. People who are interested in this topic seem to care about how the product looks, if they like it, and if the size is right. The main idea is probably how things look and how people feel about them.
Topic 2: The strongest words here are fit, love, size, wear, length, comfort, shirt, and quality. This topic is more about how well something fits, how comfortable it is, and how good the quality is, especially when it comes to clothes. Words like length, comfort, and shirt show that the focus is on how the item fits and feels on the body. Likely themes are Fit, comfort, and clothing quality.
Topic 3: This topic again highlights look, fit, love, size, and wear, but also includes words like waist, think, got, and much.These additional terms suggest customers are discussing specific fit issues, such as waist sizing, and giving more reflective or comparative comments. Likely themes are detailed fit issues and personal evaluation.
topics <- tidy(lda_model, matrix = "beta")
topics
## # A tibble: 2,385 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 abl 0.000127
## 2 2 abl 0.00148
## 3 3 abl 0.00120
## 4 1 absolut 0.00350
## 5 2 absolut 0.00164
## 6 3 absolut 0.000225
## 7 1 accentu 0.000232
## 8 2 accentu 0.000217
## 9 3 accentu 0.000304
## 10 1 across 0.000955
## # ℹ 2,375 more rows
top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
ggplot(aes(x = reorder(term, beta), y = beta, fill(topic))) +
geom_col(show.legend = FALSE, fill = "steelblue") +
facet_wrap(~ topic, scales = "free") +
coord_flip()
This code fits a three‑topic LDA model to your document–term
matrix and then prepares all the information required for an interactive
LDAvis display by extracting the probability of each word within each
topic (phi), the probability of each topic within each document (theta),
the length of every document, the full vocabulary list, and the overall
frequency of each term, combining these components into a structured
JSON object that LDAvis can interpret, and finally launching an
interactive visualisation that allows you to explore how distinct the
topics are, how strongly each term contributes to each topic, and how
the topics are distributed across your documents, giving you a deeper
and more intuitive understanding of your model’s
structure.
Three different topics emerge from your text data in this visualization, which displays the findings of a topic-modeling analysis using LDAvis. The intertopic distance map on the left shows each topic as a circle, with the distance between the circles representing the vocabulary differences between the topics. The circles’ separation indicates that the model was successful in identifying significant, non-overlapping themes. The size of the circles shows how prevalent a topic is throughout the dataset. The top thirty salient terms for the chosen topic are listed in the bar chart on the right, emphasizing terms that are both common and instructive. The terms “fit,” “wear,” “waist,” “comfort,” and “shirt” imply that the focus of this discussion is on the fit, sizing, and comfort of clothing. You can highlight words that are either common or distinctively characteristic by adjusting the relevance calculation using the λ slider. In general, the visualization aids in your interpretation of the significance, uniqueness, and meaning of each topic in your corpus.
set.seed(1)
lda_model <- LDA(dtm, k = 3)
lda_vis_data <- createJSON(phi = posterior(lda_model)$terms,
theta = posterior(lda_model)$topics,
doc.length = rowSums(as.matrix(dtm)),
vocab = colnames(as.matrix(dtm)),
term.frequency = colSums(as.matrix(dtm)))
serVis(lda_vis_data)
This code creates a TF‑IDF representation of your text data by first building a corpus from the review text and then cleaning it through several preprocessing steps. Converting all text to lowercase, removing punctuation, numbers, stopwords, and extra whitespace ensures that only meaningful words remain. After cleaning, a Term‑Document Matrix is generated using TF‑IDF weighting, which highlights words that are important within individual documents but not overly common across the entire dataset. This weighting method is valuable because it reduces the influence of generic, repetitive terms and emphasizes distinctive vocabulary that better represents the underlying themes in your text. The final matrix provides a structured numerical format suitable for clustering, topic modeling, or machine‑learning tasks.
# Create a corpus
corpus <- VCorpus(VectorSource(textMining_with_sentiment$Review.Text))
# clean the corpus
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
# create a term document matrix with IF-TDF weighting
tdm_tfidf <- TermDocumentMatrix(corpus, control = list(weighting <- weightTfIdf))
# Convert to matrix for analysis
tfidf_matrix <- as.matrix(tdm_tfidf)
This code identifies the ten most influential terms in your corpus by summing TF‑IDF scores across all documents and visualising them in a bar chart. The summary highlights which words are most distinctive and meaningful in the dataset, helping you understand key themes that ordinary frequency counts might miss. TF‑IDF is important because it emphasises unique, information‑rich terms rather than common, repetitive words, making your insights more accurate and targeted.
# Sum TF-IDF scores across all documents
term_scores <- rowSums(tfidf_matrix)
# create a data frame for plotting
tfidf_df <- data.frame(term = names(term_scores), score = term_scores)
# Sort the data frame by frequency in descending order and select the top 10
top_tfidf_terms <- tfidf_df %>%
arrange(desc(score)) %>%
head(10)
# Display the top 10 terms
print(top_tfidf_terms)
## term score
## love love 204
## dress dress 193
## size size 181
## top top 153
## fit fit 145
## like like 142
## just just 122
## wear wear 119
## great great 105
## small small 104
# Plot the terms against tf-idf
ggplot(top_tfidf_terms, aes(x = reorder(term, score), y = score)) +
geom_col(binwidth = 10, fill = "steelblue") +
coord_flip() +
labs(title = "Top TF‑IDF Terms",
x = "Term",
y = "TF‑IDF Score") +
theme_minimal()
The most notable and significant words in your dataset are
visually highlighted by this code, which creates a word cloud using the
top TF-IDF terms. The cloud highlights terms that provide the most
unique information across all reviews by mapping TF-IDF scores to word
size.
set.seed(2)
wordcloud(word = top_tfidf_terms$term, freq = top_tfidf_terms$score, min.freq = 10, random.order = FALSE, random.color = FALSE, colors = sample(colors(), size = 10))
Each of the five topics identified by the STM model is labeled with the most representative words in the chart. The expected topic proportions, or the frequency with which each topic occurs throughout the corpus, are displayed by the horizontal bars. This aids in determining the topics that consumers talk about most frequently. Fit, fabric, and appearance, for instance, are frequently discussed, suggesting that these are major issues in the reviews. This plot is useful because it shows which topics are most important in customer feedback and gives a high-level overview of the dataset’s dominant themes. STM, as opposed to LDA, associates these topics with metadata, enabling more thorough examination of the ways in which topic prevalence differs among ratings, departments, or sentiment. Because of this, the summary plot is an essential tool for comprehending the text’s general customer priorities and patterns.
processed <- textProcessor(documents = textMining_with_sentiment$Review.Text, metadata = textMining_with_sentiment)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
prep <- prepDocuments(processed$documents, processed$vocab, processed$meta)
## Removing 874 of 1851 terms (874 of 11869 tokens) due to frequency
## Your corpus now has 459 documents, 977 terms and 10995 tokens.
stm_model <- stm(documents = prep$documents, vocab = prep$vocab, data = prep$meta, K = 5, max.em.its = 75, init.type = "Spectral")
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Finding anchor words...
## .....
## Recovering initialization...
## .........
## Initialization complete.
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -6.216)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -6.101, relative change = 1.854e-02)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -6.072, relative change = 4.617e-03)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -6.060, relative change = 2.079e-03)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -6.052, relative change = 1.234e-03)
## Topic 1: fit, size, top, great, perfect
## Topic 2: comfort, love, dress, work, cute
## Topic 3: short, just, like, bought, fit
## Topic 4: love, top, size, color, fit
## Topic 5: dress, look, wear, order, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -6.047, relative change = 8.774e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -6.043, relative change = 7.148e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -6.039, relative change = 6.333e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -6.035, relative change = 5.811e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -6.032, relative change = 5.430e-04)
## Topic 1: fit, size, great, top, perfect
## Topic 2: comfort, dress, cute, work, length
## Topic 3: just, like, short, bought, pant
## Topic 4: love, top, color, fit, size
## Topic 5: dress, look, wear, order, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -6.029, relative change = 5.044e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -6.026, relative change = 4.588e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -6.024, relative change = 4.159e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -6.021, relative change = 3.867e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 15 (approx. per word bound = -6.019, relative change = 3.719e-04)
## Topic 1: fit, size, great, top, perfect
## Topic 2: dress, comfort, cute, length, work
## Topic 3: just, like, bought, store, perfect
## Topic 4: love, top, fit, color, size
## Topic 5: dress, look, wear, order, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 16 (approx. per word bound = -6.017, relative change = 3.680e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 17 (approx. per word bound = -6.015, relative change = 3.692e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 18 (approx. per word bound = -6.013, relative change = 3.714e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 19 (approx. per word bound = -6.010, relative change = 3.728e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 20 (approx. per word bound = -6.008, relative change = 3.696e-04)
## Topic 1: fit, size, great, top, perfect
## Topic 2: dress, comfort, cute, length, work
## Topic 3: just, like, bought, fabric, store
## Topic 4: love, fit, top, color, size
## Topic 5: look, dress, wear, order, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 21 (approx. per word bound = -6.006, relative change = 3.594e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 22 (approx. per word bound = -6.004, relative change = 3.405e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 23 (approx. per word bound = -6.002, relative change = 3.148e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 24 (approx. per word bound = -6.000, relative change = 2.864e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 25 (approx. per word bound = -5.999, relative change = 2.608e-04)
## Topic 1: fit, size, great, top, perfect
## Topic 2: dress, cute, length, realli, work
## Topic 3: like, just, fabric, bought, store
## Topic 4: love, fit, top, size, color
## Topic 5: look, wear, dress, order, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 26 (approx. per word bound = -5.997, relative change = 2.403e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 27 (approx. per word bound = -5.996, relative change = 2.241e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 28 (approx. per word bound = -5.995, relative change = 2.104e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 29 (approx. per word bound = -5.993, relative change = 1.986e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 30 (approx. per word bound = -5.992, relative change = 1.905e-04)
## Topic 1: fit, size, great, top, perfect
## Topic 2: dress, realli, cute, length, petit
## Topic 3: fabric, like, just, bought, store
## Topic 4: love, fit, top, size, color
## Topic 5: look, wear, order, can, nice
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 31 (approx. per word bound = -5.991, relative change = 1.886e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 32 (approx. per word bound = -5.990, relative change = 1.910e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 33 (approx. per word bound = -5.989, relative change = 1.863e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 34 (approx. per word bound = -5.988, relative change = 1.715e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 35 (approx. per word bound = -5.987, relative change = 1.609e-04)
## Topic 1: fit, size, great, top, perfect
## Topic 2: dress, realli, petit, cute, length
## Topic 3: fabric, like, just, bought, store
## Topic 4: love, fit, top, size, color
## Topic 5: look, wear, order, can, like
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 36 (approx. per word bound = -5.986, relative change = 1.565e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 37 (approx. per word bound = -5.985, relative change = 1.557e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 38 (approx. per word bound = -5.984, relative change = 1.563e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 39 (approx. per word bound = -5.983, relative change = 1.564e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 40 (approx. per word bound = -5.982, relative change = 1.550e-04)
## Topic 1: fit, size, great, top, comfort
## Topic 2: dress, realli, petit, cute, length
## Topic 3: fabric, like, just, bought, store
## Topic 4: love, fit, top, size, color
## Topic 5: look, wear, order, can, like
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 41 (approx. per word bound = -5.981, relative change = 1.525e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 42 (approx. per word bound = -5.980, relative change = 1.498e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 43 (approx. per word bound = -5.980, relative change = 1.490e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 44 (approx. per word bound = -5.979, relative change = 1.495e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 45 (approx. per word bound = -5.978, relative change = 1.515e-04)
## Topic 1: fit, size, great, comfort, top
## Topic 2: dress, realli, petit, cute, length
## Topic 3: fabric, like, bought, just, color
## Topic 4: love, fit, top, size, small
## Topic 5: look, wear, order, can, like
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 46 (approx. per word bound = -5.977, relative change = 1.538e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 47 (approx. per word bound = -5.976, relative change = 1.561e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 48 (approx. per word bound = -5.975, relative change = 1.570e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 49 (approx. per word bound = -5.974, relative change = 1.552e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 50 (approx. per word bound = -5.973, relative change = 1.505e-04)
## Topic 1: fit, size, great, comfort, top
## Topic 2: dress, petit, realli, cute, length
## Topic 3: fabric, like, bought, just, color
## Topic 4: love, fit, top, size, small
## Topic 5: look, wear, order, like, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 51 (approx. per word bound = -5.972, relative change = 1.443e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 52 (approx. per word bound = -5.971, relative change = 1.383e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 53 (approx. per word bound = -5.971, relative change = 1.332e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 54 (approx. per word bound = -5.970, relative change = 1.296e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 55 (approx. per word bound = -5.969, relative change = 1.271e-04)
## Topic 1: fit, size, great, comfort, top
## Topic 2: dress, petit, realli, cute, get
## Topic 3: fabric, like, bought, color, just
## Topic 4: love, fit, top, size, small
## Topic 5: look, wear, order, like, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 56 (approx. per word bound = -5.968, relative change = 1.257e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 57 (approx. per word bound = -5.968, relative change = 1.248e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 58 (approx. per word bound = -5.967, relative change = 1.245e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 59 (approx. per word bound = -5.966, relative change = 1.241e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 60 (approx. per word bound = -5.965, relative change = 1.222e-04)
## Topic 1: fit, size, great, comfort, perfect
## Topic 2: dress, petit, realli, cute, get
## Topic 3: fabric, bought, like, color, just
## Topic 4: love, fit, top, size, small
## Topic 5: look, wear, order, like, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 61 (approx. per word bound = -5.965, relative change = 1.189e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 62 (approx. per word bound = -5.964, relative change = 1.148e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 63 (approx. per word bound = -5.963, relative change = 1.095e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 64 (approx. per word bound = -5.963, relative change = 1.038e-04)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 65 (approx. per word bound = -5.962, relative change = 9.922e-05)
## Topic 1: fit, size, great, comfort, perfect
## Topic 2: dress, petit, realli, cute, get
## Topic 3: fabric, bought, color, like, just
## Topic 4: love, fit, top, size, small
## Topic 5: look, wear, like, order, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 66 (approx. per word bound = -5.962, relative change = 9.613e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 67 (approx. per word bound = -5.961, relative change = 9.432e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 68 (approx. per word bound = -5.960, relative change = 9.393e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 69 (approx. per word bound = -5.960, relative change = 9.342e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 70 (approx. per word bound = -5.959, relative change = 9.258e-05)
## Topic 1: fit, size, great, comfort, perfect
## Topic 2: dress, petit, realli, cute, get
## Topic 3: fabric, bought, color, like, just
## Topic 4: love, top, fit, size, small
## Topic 5: look, wear, like, order, can
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 71 (approx. per word bound = -5.959, relative change = 9.022e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 72 (approx. per word bound = -5.958, relative change = 8.643e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 73 (approx. per word bound = -5.958, relative change = 8.270e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 74 (approx. per word bound = -5.957, relative change = 7.967e-05)
## ..................................................................................................................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
labelTopics(stm_model)
## Topic 1 Top Words:
## Highest Prob: fit, size, great, comfort, perfect, top, love
## FREX: around, pant, buy, romper, pilcro, often, armhol
## Lift: appropri, coh, fleetwood, floreat, gain, includ, ride
## Score: travel, pant, fit, around, size, pilcro, blous
## Topic 2 Top Words:
## Highest Prob: dress, petit, realli, cute, get, good, length
## FREX: regular, knee, dress, petit, good, realli, cute
## Lift: necessari, unusu, chose, closur, compel, given, mislabel
## Score: dress, unusu, petit, regular, realli, length, cute
## Topic 3 Top Words:
## Highest Prob: fabric, bought, color, like, just, store, materi
## FREX: compliment, bought, almost, swing, light, anoth, weather
## Lift: air, ask, combin, compar, custom, cuter, forward
## Score: parti, bought, compliment, almost, arriv, onlin, anoth
## Topic 4 Top Words:
## Highest Prob: love, top, fit, size, small, littl, color
## FREX: bottom, differ, cut, xxs, total, kept, extrem
## Lift: breast, contrast, due, extrem, hour, neither, ran
## Score: extrem, top, fit, differ, idea, size, cut
## Topic 5 Top Words:
## Highest Prob: wear, look, like, can, order, nice, sweater
## FREX: make, skirt, sweater, front, wear, loos, can
## Lift: silk, chic, everyday, justic, laid, middl, obvious
## Score: chic, can, sweater, wear, soft, skirt, make
plot(stm_model, type = "summary")
he most significant terms related to Topic 1 from your STM
model are shown in this word cloud. Each word’s weight within the topic
is reflected in its size, so larger words have a greater impact on
defining the theme. Terms like size, fit, perfect, comfort, and jean
predominate in your case, indicating that the focus of this discussion
is on the comfort, fit, and accuracy of clothing
sizing.
cloud(stm_model, topic = 1, random.color = FALSE, colors = sample(colors(), size = 10))
In order to capture deeper contextual meaning beyond bag-of-words approaches, future work could expand this analysis by utilizing more sophisticated NLP techniques like transformer-based models (e.g., BERT). Sentiment could be directly linked to particular product attributes, such as fit, quality, or delivery, using aspect-based sentiment analysis. Topic modeling that is dynamic or time-aware may show how customer concerns change over time. Predictive models of customer satisfaction could be constructed by integrating additional metadata, such as past purchases or return patterns. Lastly, implementing these models in an interactive dashboard would facilitate real-time decision-making and allow ongoing monitoring of emerging themes.
```