Task A – Text Mining

options(repos = c(CRAN = "https://cloud.r-project.org"))

libraries <- c("tm", "tidytext", "ggplot2", "wordcloud", "syuzhet", "dplyr", "tibble", "textstem", "textdata", "tidyr", "Matrix", "topicmodels", "stringr", "reshape2", "LDAvis", "jsonlite", "spacyr", "stm")

install.packages(libraries)
## package 'tm' successfully unpacked and MD5 sums checked
## package 'tidytext' successfully unpacked and MD5 sums checked
## package 'ggplot2' successfully unpacked and MD5 sums checked
## package 'wordcloud' successfully unpacked and MD5 sums checked
## package 'syuzhet' successfully unpacked and MD5 sums checked
## package 'dplyr' successfully unpacked and MD5 sums checked
## package 'tibble' successfully unpacked and MD5 sums checked
## package 'textstem' successfully unpacked and MD5 sums checked
## package 'textdata' successfully unpacked and MD5 sums checked
## package 'tidyr' successfully unpacked and MD5 sums checked
## package 'Matrix' successfully unpacked and MD5 sums checked
## package 'topicmodels' successfully unpacked and MD5 sums checked
## package 'stringr' successfully unpacked and MD5 sums checked
## package 'reshape2' successfully unpacked and MD5 sums checked
## package 'LDAvis' successfully unpacked and MD5 sums checked
## package 'jsonlite' successfully unpacked and MD5 sums checked
## package 'spacyr' successfully unpacked and MD5 sums checked
## package 'stm' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\akilo\AppData\Local\Temp\RtmpQNlfb7\downloaded_packages
for (lib in libraries) {
  library(lib, character.only = TRUE)
}

Utilize techniques associated with text mining/cleaning It loads the CSV as a tibble, shows the first rows, summaries each column, and counts missing values.

df <- tibble(read.csv("MS4S09 Assessment 1 Dataset.csv"))

head(df)
## # A tibble: 6 × 11
##       X Clothing.ID   Age Title               Review.Text Rating Recommended.IND
##   <int>       <int> <int> <chr>               <chr>        <int>           <int>
## 1     0         767    33 ""                  "Absolutel…      4               1
## 2     1        1080    34 ""                  "Love this…      5               1
## 3     2        1077    60 "Some major design… "I had suc…      3               0
## 4     3        1049    50 "My favorite buy!"  "I love, l…      5               1
## 5     4         847    47 "Flattering shirt"  "This shir…      5               1
## 6     5        1080    49 "Not for the very … "I love tr…      2               0
## # ℹ 4 more variables: Positive.Feedback.Count <int>, Division.Name <chr>,
## #   Department.Name <chr>, Class.Name <chr>
summary(df)
##        X          Clothing.ID          Age          Title          
##  Min.   :    0   Min.   :   0.0   Min.   :18.0   Length:23486      
##  1st Qu.: 5871   1st Qu.: 861.0   1st Qu.:34.0   Class :character  
##  Median :11742   Median : 936.0   Median :41.0   Mode  :character  
##  Mean   :11742   Mean   : 918.1   Mean   :43.2                     
##  3rd Qu.:17614   3rd Qu.:1078.0   3rd Qu.:52.0                     
##  Max.   :23485   Max.   :1205.0   Max.   :99.0                     
##  Review.Text            Rating      Recommended.IND  Positive.Feedback.Count
##  Length:23486       Min.   :1.000   Min.   :0.0000   Min.   :  0.000        
##  Class :character   1st Qu.:4.000   1st Qu.:1.0000   1st Qu.:  0.000        
##  Mode  :character   Median :5.000   Median :1.0000   Median :  1.000        
##                     Mean   :4.196   Mean   :0.8224   Mean   :  2.536        
##                     3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:  3.000        
##                     Max.   :5.000   Max.   :1.0000   Max.   :122.000        
##  Division.Name      Department.Name     Class.Name       
##  Length:23486       Length:23486       Length:23486      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
colSums(is.na(df))
##                       X             Clothing.ID                     Age 
##                       0                       0                       0 
##                   Title             Review.Text                  Rating 
##                       0                       0                       0 
##         Recommended.IND Positive.Feedback.Count           Division.Name 
##                       0                       0                       0 
##         Department.Name              Class.Name 
##                       0                       0

It keeps only the columns needed for your analysis and then removes any rows with missing values so the data set is clean, consistent, and ready for the text‑mining task without errors or gaps.

df <- df[,c(1,2,3,5,6,9)] #selecting the data for task A

textMining <- na.omit(df) # Removes all rows containing null values

It sets a fixed random seed so the results stay the same each time, randomly selects 500 rows from the cleaned dataset, and creates a new dataset containing only that random sample.

set.seed(0) # Set random seed for repeatability

# Take sample of 500 review.Text
sample_index <- sample(nrow(textMining), 500) #Sample (Size of population, Size of sample), returns index for sample
sample_textMining <- textMining[sample_index,]

print(sample_textMining)
## # A tibble: 500 × 6
##        X Clothing.ID   Age Review.Text                      Rating Division.Name
##    <int>       <int> <int> <chr>                             <int> <chr>        
##  1 17400         912    47 "I believe that this sweater is…      3 General      
##  2  4774        1059    52 "The \"movement\" in these culo…      5 General Peti…
##  3 13217        1078    36 "I love this dress and will be …      5 General      
##  4 10538        1080    44 "This dress is wonderful and th…      5 General Peti…
##  5  8461         863    57 "This is a nice light top for t…      4 General      
##  6  4049         257    36 "Very soft, cozy, and exactly w…      5 Initmates    
##  7 13498        1094    34 "I ordered the red/orange but i…      4 General      
##  8 11570         862    41 "Not flattering and not the col…      2 General      
##  9 12256         868    35 "The fit on this shirt is bizar…      1 General      
## 10 17684        1095    46 "I ordered this dress in the bl…      4 General      
## # ℹ 490 more rows
head(sample_textMining)
## # A tibble: 6 × 6
##       X Clothing.ID   Age Review.Text                       Rating Division.Name
##   <int>       <int> <int> <chr>                              <int> <chr>        
## 1 17400         912    47 "I believe that this sweater is …      3 General      
## 2  4774        1059    52 "The \"movement\" in these culot…      5 General Peti…
## 3 13217        1078    36 "I love this dress and will be p…      5 General      
## 4 10538        1080    44 "This dress is wonderful and the…      5 General Peti…
## 5  8461         863    57 "This is a nice light top for th…      4 General      
## 6  4049         257    36 "Very soft, cozy, and exactly wh…      5 Initmates

It converts the review text into tokens so the text can be analysed: the first line breaks each review into individual words, and the second line breaks each review into two‑word combinations (bigrams).

word_tokenized_data <- textMining %>%
  unnest_tokens(output = word, input = "Review.Text", to_lower = TRUE)

bigram_tokenized_data <- textMining %>%
  unnest_tokens(output = bigram, input = "Review.Text",token = "ngrams", n = 2, to_lower = TRUE)

It counts how often each word appears, then plots the ten most frequent words as a horizontal bar chart to show which terms occur most in the reviews.

word_counts <-word_tokenized_data %>%
  count(word, sort = TRUE)

ggplot(word_counts[1:10, ], aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  labs(x = "Words" , y = "Frequency") +
  coord_flip() +
  theme_minimal()

It counts how often each bigram appears in the dataset, then plots the ten most frequent two‑word combinations as a horizontal bar chart to show which paired terms occur most in the reviews.

word_counts <-bigram_tokenized_data %>%
  count(bigram, sort = TRUE)

ggplot(word_counts[1:10, ], aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "steelblue") +
  labs(x = "Words" , y = "Frequency") +
  coord_flip() +
  theme_minimal()

It cleans the tokenised words by removing common stop‑words, stripping out numbers and special characters, converting empty results to missing values, lemmatising each word to its base form, and finally dropping any rows that still contain NAs so the text is clean and ready for analysis

clean_tokens <- word_tokenized_data %>%
  anti_join(stop_words, by = "word") #Remove stop words

clean_tokens$word <- gsub("[^a-zA-z ]", "", clean_tokens$word) %>% # Remove special characters and numbers
  na_if("") %>% # Replaces empty strings with NA
  lemmatize_words() # Lemmatizes text

clean_tokens <- na.omit(clean_tokens) # Removes null values

It finds the ten most frequent cleaned words, filters the dataset to keep only those words, orders them for plotting, and then creates a horizontal bar chart showing how often each of the top words appears.

word_counts <- clean_tokens %>%
  count(word, sort = TRUE)

top_words <- top_n(word_counts,10,n)$word

filtered_word_counts <- filter(word_counts, word %in% top_words)
filtered_word_counts$word <- factor(filtered_word_counts$word, levels = top_words[length(top_words):1])

ggplot(filtered_word_counts[1:10, ], aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  labs(x = "Words" , y = "Frequency") +
  coord_flip() +
  theme_minimal()

It creates a word cloud of the most frequent cleaned words, using their frequencies to determine word size and applying a fixed seed so the layout stays consistent each time

set.seed(1)
wordcloud(word = filtered_word_counts$word, freq = filtered_word_counts$n, min.freq = 10, random.order = FALSE, random.color = FALSE, colors = sample(colors(), size = 10))

Sentient

It attaches sentiment labels to each cleaned word, calculates a sentiment score for every review by subtracting negative words from positive ones, and then merges those scores back into your sampled dataset so each review includes its overall Bing sentiment value.

# Create dataset containing only words with associated sentiment & adds sentiment column.
sentiment_data <- clean_tokens %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") # Joins lexicon to dataset using only words that are in both.

# Calculate Sentiment scores for each review
sentiment_score <- sentiment_data %>%
  group_by(X) %>%
  summarize(bing_sentiment = sum(sentiment == "positive") - sum(sentiment == "negative")) # Calculates sentiment score as sum of number of positive and negative sentiments

# Merge with df
textMining_with_sentiment = sample_textMining %>%
  inner_join(sentiment_score, by = "X")

It keeps only the words that have an AFINN sentiment score, adds those numeric sentiment values to the tokens, sums the scores for each review to get an overall AFINN sentiment score, and then merges that score back into your dataset so each review includes its AFINN sentiment value.

# Create dataset containing only words with associated sentiment & adds sentiment column.
sentiment_data <- clean_tokens %>%
  inner_join(get_sentiments("afinn"), by = "word", relationship = "many-to-many") # Joins lexicon to dataset using only words that are in both.

# Calculate Sentiment scores for each review
sentiment_score <- sentiment_data %>%
  group_by(X) %>%
  summarize(afinn_sentiment = sum(value)) 

# Merge with df
textMining_with_sentiment = textMining_with_sentiment %>%
  inner_join(sentiment_score, by = "X")

It finds the review with the lowest sentiment score and prints it, then finds the review with the highest sentiment score and prints that one as well.

worst_review = textMining_with_sentiment[order(textMining_with_sentiment$bing_sentiment)[1], "Review.Text"]

for (review in worst_review){
  print(review)
}
## [1] "I probably have 10 pairs of pilcro jeans and pants and most of the jeans are the skinny fit. i wanted something that wasn't a skinny fit so i thought i would try these straight fit jeans thinking they would be skinny but they also wouldn't be loose. just got them today and when i tried them on they were tighter then the skinny jeans i just took off. i guess i agree with the other reviewers that they fit like a glove. so if that's what you're looking for these will be great."
best_review = textMining_with_sentiment[order(textMining_with_sentiment$bing_sentiment, decreasing = TRUE)[1], "Review.Text"]

for (review in best_review){
  print(review)
}
## [1] "Pretty and comfortable fun cardi for those of us who love a nice little cropped number. glad to report that this cobalt beauty fits true to size in an xs all over -- shoulders hit at a nice place, arms, waist, length is great above your waist or just at it. the cotton is ultra comfortable. the zipper has that luxurious feature of a double zip where you can zip from the bottom or the top once closed, creating a nice opening. i love this for a casual but chic holiday look thanks to the velvet ador"

It creates a histogram of the Bing sentiment scores, showing how the reviews are distributed from negative to positive sentiment. The histogram shows that most customer reviews have slightly positive sentiment, with scores clustering around 2–3, indicating generally favorable feedback but with a mix of both positive and negative experiences.

ggplot(textMining_with_sentiment, aes(x = bing_sentiment)) +
  geom_histogram(binwidth = 1, fill = "steelblue")

It calculates the average Bing sentiment score for each clothing division and then plots those averages as a horizontal bar chart to show which divisions receive more positive or negative reviews. Intimates scored well, suggesting strong customer satisfaction. General had the lowest sentiment, indicating potential issues with product quality, fit, or expectations

clothing_sentiment <- textMining_with_sentiment %>%
  group_by(Division.Name) %>%
  summarise(Average_Bing_Sentiment = mean(bing_sentiment))

ggplot(clothing_sentiment, aes(x = reorder(Division.Name, Average_Bing_Sentiment), y = Average_Bing_Sentiment, fill = Division.Name)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Average Sentiment Score for clothing", x = "Clothing", y = "Average Sentiment Score")

The boxplot shows a clear link between customer ratings and Bing sentiment scores. Higher ratings always match up with more positive review language, while lower ratings show more negative sentiment. This shows that textual sentiment is very similar to star-based feedback.

ggplot(textMining_with_sentiment, aes(x = factor(Rating), y = bing_sentiment)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "Box plot of Bing sentiment score vs. Rating",
       x = "Rating",
       y = "Sentiment score")

The scatter plot shows a strong positive relationship between the two sentiment methods. This means that as Bing sentiment scores increase, AFINN sentiment scores also increase. This means that both lexicons generally agree on how positive or negative each review is.

ggplot(textMining_with_sentiment, aes(x = bing_sentiment, y = afinn_sentiment)) +
  geom_point() +
  labs(title = "Scatter plot of Bing vs. AFINN Sentiment score",
       x = "Bing Sentiment score",
       y = "AFINN Sentiment Score")

The scatter plot shows that higher product ratings are usually linked to higher Bing sentiment scores. This means that customers who give better ratings also use more positive language in their written reviews. On the other hand, lower ratings are linked to more negative sentiment.

ggplot(textMining_with_sentiment, aes(x =Rating , y = bing_sentiment)) +
  geom_point() +
  labs(title = "Scatter plot of Rating vs. Bing Sentiment score",
       x = "Rating",
       y = "Bing Sentiment Score")

##Topic Modelling This code gets your dataset ready for topic modeling by filtering, cleaning, and, if you want, sampling it. It picks reviews that are between 150 and 500 characters long, gets rid of any missing values, gives each row a new ID, and if the dataset has more than 1000 reviews, it randomly picks 1000 of them to use for efficient topic modeling.

TopicModelling <- textMining %>%
  filter(str_count(Review.Text) >= 150 & str_count(Review.Text) <= 500)

TopicModelling <- na.omit(TopicModelling) # Removes all rows containing null values

TopicModelling$X <- 1:nrow(TopicModelling)

if(nrow(TopicModelling) > 1000) {
  set.seed(1) # for reproducibility
  df <- sample_n(df, 1000)
}

This code cleans up your text and turns it into a structured matrix so that you can use it for topic modeling.It cleans up the text corpus by getting rid of noise (like punctuation, stopwords, and case differences), stemming the words, and then turning the processed reviews into a term-document matrix that shows how many times each word appears in each review.

# Convert text column to corpus
corpus <- VCorpus(VectorSource(TopicModelling$Review.Text))

# Apply cleaning
corpus <- tm_map(corpus, content_transformer(tolower)) %>%
  tm_map(content_transformer(function(x) gsub("[^a-zA-Z ]", "", x))) %>%
  tm_map(removeWords, stopwords("en")) %>%
  tm_map(stemDocument)

# Convert to a term document matrix
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, 15)))

tdm_matrix <- as.matrix(tdm)

This code figures out how many times each word shows up in all the reviews, finds the ten most common words, and shows how the word frequencies are spread out overall. It shows what customers talk about the most by turning the term-document matrix into a frequency table. This makes it easier to see which vocabulary patterns are most common. The histogram shows you if most words are rare or often repeated, which is how natural language works. In general, it gives a clear picture of the dataset’s vocabulary structure and gets the text ready for more in-depth analysis, like topic modeling.

term_frequencies <- rowSums(tdm_matrix)

# Create a data frame for plotting
term_frequencies_df <- data.frame(term = names(term_frequencies), frequency = term_frequencies)

# Sort the data frame by frequency in descending order and select the top 10
top_terms <- term_frequencies_df %>%
  arrange(desc(frequency)) %>%
  head(10)

# Display the top 10 terms
print(top_terms)
##        term frequency
## dress dress     10665
## fit     fit      9901
## love   love      9551
## size   size      9515
## look   look      8348
## wear   wear      7249
## top     top      7192
## like   like      7041
## color color      6288
## just   just      5081
# Create the histogram
ggplot(term_frequencies_df, aes(x = frequency)) + geom_histogram(binwidth = 10, fill = "steelblue") + 
labs(title = "Histogram of Term Frequencies", 
     x = "Term Frequency", 
     y = "Number of Terms") +
  theme_minimal()

This code filters the term–document matrix by removing words that are either too common or too rare, ensuring only meaningful terms remain for topic modelling. It identifies words appearing in more than 15% of documents and removes them unless they are useful keywords like “fit,” “love,” “size,” “look,” or “wear.” It also removes extremely rare terms that appear in less than 0.5% of documents, which typically add noise. After filtering, it removes any document columns that contain no remaining terms. In general, this step cleans up and improves the matrix so that topic modeling is more accurate.

# Find terms that appear in more than 15% of documents
frequent_terms <- findFreqTerms(tdm, lowfreq = 0.15 * ncol(tdm_matrix))

# Find terms that appear in less than 0.5% of documents
rare_terms <- findFreqTerms(tdm, highfreq = 0.005 * ncol(tdm_matrix))

print("Frequent Terms")
## [1] "Frequent Terms"
print(frequent_terms)
##  [1] "back"    "beauti"  "can"     "color"   "dress"   "fabric"  "fit"    
##  [8] "flatter" "great"   "just"    "like"    "littl"   "look"    "love"   
## [15] "nice"    "one"     "order"   "perfect" "realli"  "size"    "small"  
## [22] "soft"    "top"     "tri"     "wear"    "well"    "will"
print("First 20 Infrequent Terms")
## [1] "First 20 Infrequent Terms"
print(rare_terms[1:20])
##  [1] "aaaaaaamaz"   "aaaaannnnnnd" "aaaah"        "aaaahmaz"     "aaah"        
##  [6] "aam"          "abbey"        "abbi"         "abck"         "abdomen"     
## [11] "abdomin"      "abercrombi"   "abhor"        "abil"         "abject"      
## [16] "abnorm"       "abo"          "abolut"       "abou"         "aboutthi"
# Edit list of frequent words to keep useful ones
to_keep <- c("fit", "love", "size", "look", "wear")

to_remove <- frequent_terms[!frequent_terms %in% to_keep]

filtered_tdm_matrix <- tdm_matrix[!rownames(tdm_matrix) %in% to_remove, ]
filtered_tdm_matrix <- filtered_tdm_matrix[!rownames(filtered_tdm_matrix) %in% rare_terms, ]

# Remove 0 sum columns from tdm.

# Calculate column sums
column_sums <- colSums(filtered_tdm_matrix)

# Identify columns that are all zeros
zero_columns <- which(column_sums == 0)

# Remove these columns
if (length(zero_columns) > 0) {
  # Remove these columns
  filtered_tdm_matrix <- filtered_tdm_matrix[, -zero_columns]
} else {
  # If no columns are all zeros, just use the original matrix
  print("No zero columns in TDM matrix")
}
## [1] "No zero columns in TDM matrix"

This code runs the first stage of your topic modelling by converting the filtered term–document matrix into the correct format and fitting an LDA model with three topics.It transposes the filtered matrix so documents become rows, then applies LDA with three topics to uncover the main themes present in the cleaned review text.

dtm <- t(filtered_tdm_matrix)
lda_model <- LDA(dtm, k = 3)

This code takes the LDA model that was fitted and gets the beta matrix, which shows the chance that each word belongs to each topic. Then it finds the 10 most important words for each topic by picking the ones with the highest beta values. This means that these words are the best clues to what each topic is about. After selecting these top terms, the code arranges them neatly and creates a bar chart for each topic, showing which words contribute most to that topic. The plot clearly shows each topic separately by flipping the axes and using facet panels. This makes it easy to understand the themes that the LDA model shows.

Topic 1: Words like look, love, size, fit, and wear are the most common on this topic. These terms suggest that Topic 1 is mainly about appearance and overall satisfaction. People who are interested in this topic seem to care about how the product looks, if they like it, and if the size is right. The main idea is probably how things look and how people feel about them.

Topic 2: The strongest words here are fit, love, size, wear, length, comfort, shirt, and quality. This topic is more about how well something fits, how comfortable it is, and how good the quality is, especially when it comes to clothes. Words like length, comfort, and shirt show that the focus is on how the item fits and feels on the body. Likely themes are Fit, comfort, and clothing quality.

Topic 3: This topic again highlights look, fit, love, size, and wear, but also includes words like waist, think, got, and much.These additional terms suggest customers are discussing specific fit issues, such as waist sizing, and giving more reflective or comparative comments. Likely themes are detailed fit issues and personal evaluation.

  topics <- tidy(lda_model, matrix = "beta")
  topics
## # A tibble: 2,385 × 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 abl     0.000127
##  2     2 abl     0.00148 
##  3     3 abl     0.00120 
##  4     1 absolut 0.00350 
##  5     2 absolut 0.00164 
##  6     3 absolut 0.000225
##  7     1 accentu 0.000232
##  8     2 accentu 0.000217
##  9     3 accentu 0.000304
## 10     1 across  0.000955
## # ℹ 2,375 more rows
  top_terms <- topics %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)
  
  top_terms %>%
    ggplot(aes(x = reorder(term, beta), y = beta, fill(topic))) + 
    geom_col(show.legend = FALSE, fill = "steelblue") +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()

This code fits a three‑topic LDA model to your document–term matrix and then prepares all the information required for an interactive LDAvis display by extracting the probability of each word within each topic (phi), the probability of each topic within each document (theta), the length of every document, the full vocabulary list, and the overall frequency of each term, combining these components into a structured JSON object that LDAvis can interpret, and finally launching an interactive visualisation that allows you to explore how distinct the topics are, how strongly each term contributes to each topic, and how the topics are distributed across your documents, giving you a deeper and more intuitive understanding of your model’s structure.

Three different topics emerge from your text data in this visualization, which displays the findings of a topic-modeling analysis using LDAvis. The intertopic distance map on the left shows each topic as a circle, with the distance between the circles representing the vocabulary differences between the topics. The circles’ separation indicates that the model was successful in identifying significant, non-overlapping themes. The size of the circles shows how prevalent a topic is throughout the dataset. The top thirty salient terms for the chosen topic are listed in the bar chart on the right, emphasizing terms that are both common and instructive. The terms “fit,” “wear,” “waist,” “comfort,” and “shirt” imply that the focus of this discussion is on the fit, sizing, and comfort of clothing. You can highlight words that are either common or distinctively characteristic by adjusting the relevance calculation using the λ slider. In general, the visualization aids in your interpretation of the significance, uniqueness, and meaning of each topic in your corpus.

set.seed(1)
lda_model <- LDA(dtm, k = 3)

lda_vis_data <- createJSON(phi = posterior(lda_model)$terms,
                           theta = posterior(lda_model)$topics,
                           doc.length = rowSums(as.matrix(dtm)),
                           vocab = colnames(as.matrix(dtm)),
                           term.frequency = colSums(as.matrix(dtm)))

serVis(lda_vis_data)

This code creates a TF‑IDF representation of your text data by first building a corpus from the review text and then cleaning it through several preprocessing steps. Converting all text to lowercase, removing punctuation, numbers, stopwords, and extra whitespace ensures that only meaningful words remain. After cleaning, a Term‑Document Matrix is generated using TF‑IDF weighting, which highlights words that are important within individual documents but not overly common across the entire dataset. This weighting method is valuable because it reduces the influence of generic, repetitive terms and emphasizes distinctive vocabulary that better represents the underlying themes in your text. The final matrix provides a structured numerical format suitable for clustering, topic modeling, or machine‑learning tasks.

# Create a corpus 
corpus <- VCorpus(VectorSource(textMining_with_sentiment$Review.Text))

# clean the corpus
corpus <- tm_map(corpus, content_transformer(tolower)) 
corpus <- tm_map(corpus, removePunctuation) 
corpus <- tm_map(corpus, removeNumbers) 
corpus <- tm_map(corpus, removeWords, stopwords("english")) 
corpus <- tm_map(corpus, stripWhitespace)

# create a term document matrix with IF-TDF weighting
tdm_tfidf <- TermDocumentMatrix(corpus, control = list(weighting <- weightTfIdf))

# Convert to matrix for analysis 
tfidf_matrix <- as.matrix(tdm_tfidf)

This code identifies the ten most influential terms in your corpus by summing TF‑IDF scores across all documents and visualising them in a bar chart. The summary highlights which words are most distinctive and meaningful in the dataset, helping you understand key themes that ordinary frequency counts might miss. TF‑IDF is important because it emphasises unique, information‑rich terms rather than common, repetitive words, making your insights more accurate and targeted.

# Sum TF-IDF scores across all documents
term_scores <- rowSums(tfidf_matrix)

# create a data frame for plotting
tfidf_df <- data.frame(term = names(term_scores), score = term_scores)

# Sort the data frame by frequency in descending order and select the top 10
top_tfidf_terms <- tfidf_df %>%
  arrange(desc(score)) %>%
  head(10)

# Display the top 10 terms
print(top_tfidf_terms)
##        term score
## love   love   204
## dress dress   193
## size   size   181
## top     top   153
## fit     fit   145
## like   like   142
## just   just   122
## wear   wear   119
## great great   105
## small small   104
# Plot the terms against tf-idf
ggplot(top_tfidf_terms, aes(x = reorder(term, score), y = score)) + 
  geom_col(binwidth = 10, fill = "steelblue") +
  coord_flip() +
  labs(title = "Top TF‑IDF Terms", 
     x = "Term", 
     y = "TF‑IDF Score") +
  theme_minimal()

The most notable and significant words in your dataset are visually highlighted by this code, which creates a word cloud using the top TF-IDF terms. The cloud highlights terms that provide the most unique information across all reviews by mapping TF-IDF scores to word size.

set.seed(2)
wordcloud(word = top_tfidf_terms$term, freq = top_tfidf_terms$score, min.freq = 10, random.order = FALSE, random.color = FALSE, colors = sample(colors(), size = 10))

Each of the five topics identified by the STM model is labeled with the most representative words in the chart. The expected topic proportions, or the frequency with which each topic occurs throughout the corpus, are displayed by the horizontal bars. This aids in determining the topics that consumers talk about most frequently. Fit, fabric, and appearance, for instance, are frequently discussed, suggesting that these are major issues in the reviews. This plot is useful because it shows which topics are most important in customer feedback and gives a high-level overview of the dataset’s dominant themes. STM, as opposed to LDA, associates these topics with metadata, enabling more thorough examination of the ways in which topic prevalence differs among ratings, departments, or sentiment. Because of this, the summary plot is an essential tool for comprehending the text’s general customer priorities and patterns.

processed <- textProcessor(documents = textMining_with_sentiment$Review.Text, metadata   = textMining_with_sentiment)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...
prep <- prepDocuments(processed$documents, processed$vocab, processed$meta)
## Removing 874 of 1851 terms (874 of 11869 tokens) due to frequency 
## Your corpus now has 459 documents, 977 terms and 10995 tokens.
stm_model <- stm(documents = prep$documents, vocab = prep$vocab, data = prep$meta, K = 5, max.em.its = 75, init.type = "Spectral")
## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      .....
##   Recovering initialization...
##      .........
## Initialization complete.
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -6.216) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -6.101, relative change = 1.854e-02) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -6.072, relative change = 4.617e-03) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -6.060, relative change = 2.079e-03) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -6.052, relative change = 1.234e-03) 
## Topic 1: fit, size, top, great, perfect 
##  Topic 2: comfort, love, dress, work, cute 
##  Topic 3: short, just, like, bought, fit 
##  Topic 4: love, top, size, color, fit 
##  Topic 5: dress, look, wear, order, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -6.047, relative change = 8.774e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -6.043, relative change = 7.148e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 8 (approx. per word bound = -6.039, relative change = 6.333e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 9 (approx. per word bound = -6.035, relative change = 5.811e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 10 (approx. per word bound = -6.032, relative change = 5.430e-04) 
## Topic 1: fit, size, great, top, perfect 
##  Topic 2: comfort, dress, cute, work, length 
##  Topic 3: just, like, short, bought, pant 
##  Topic 4: love, top, color, fit, size 
##  Topic 5: dress, look, wear, order, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 11 (approx. per word bound = -6.029, relative change = 5.044e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 12 (approx. per word bound = -6.026, relative change = 4.588e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 13 (approx. per word bound = -6.024, relative change = 4.159e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 14 (approx. per word bound = -6.021, relative change = 3.867e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 15 (approx. per word bound = -6.019, relative change = 3.719e-04) 
## Topic 1: fit, size, great, top, perfect 
##  Topic 2: dress, comfort, cute, length, work 
##  Topic 3: just, like, bought, store, perfect 
##  Topic 4: love, top, fit, color, size 
##  Topic 5: dress, look, wear, order, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 16 (approx. per word bound = -6.017, relative change = 3.680e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 17 (approx. per word bound = -6.015, relative change = 3.692e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 18 (approx. per word bound = -6.013, relative change = 3.714e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 19 (approx. per word bound = -6.010, relative change = 3.728e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 20 (approx. per word bound = -6.008, relative change = 3.696e-04) 
## Topic 1: fit, size, great, top, perfect 
##  Topic 2: dress, comfort, cute, length, work 
##  Topic 3: just, like, bought, fabric, store 
##  Topic 4: love, fit, top, color, size 
##  Topic 5: look, dress, wear, order, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 21 (approx. per word bound = -6.006, relative change = 3.594e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 22 (approx. per word bound = -6.004, relative change = 3.405e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 23 (approx. per word bound = -6.002, relative change = 3.148e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 24 (approx. per word bound = -6.000, relative change = 2.864e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 25 (approx. per word bound = -5.999, relative change = 2.608e-04) 
## Topic 1: fit, size, great, top, perfect 
##  Topic 2: dress, cute, length, realli, work 
##  Topic 3: like, just, fabric, bought, store 
##  Topic 4: love, fit, top, size, color 
##  Topic 5: look, wear, dress, order, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 26 (approx. per word bound = -5.997, relative change = 2.403e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 27 (approx. per word bound = -5.996, relative change = 2.241e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 28 (approx. per word bound = -5.995, relative change = 2.104e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 29 (approx. per word bound = -5.993, relative change = 1.986e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 30 (approx. per word bound = -5.992, relative change = 1.905e-04) 
## Topic 1: fit, size, great, top, perfect 
##  Topic 2: dress, realli, cute, length, petit 
##  Topic 3: fabric, like, just, bought, store 
##  Topic 4: love, fit, top, size, color 
##  Topic 5: look, wear, order, can, nice 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 31 (approx. per word bound = -5.991, relative change = 1.886e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 32 (approx. per word bound = -5.990, relative change = 1.910e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 33 (approx. per word bound = -5.989, relative change = 1.863e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 34 (approx. per word bound = -5.988, relative change = 1.715e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 35 (approx. per word bound = -5.987, relative change = 1.609e-04) 
## Topic 1: fit, size, great, top, perfect 
##  Topic 2: dress, realli, petit, cute, length 
##  Topic 3: fabric, like, just, bought, store 
##  Topic 4: love, fit, top, size, color 
##  Topic 5: look, wear, order, can, like 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 36 (approx. per word bound = -5.986, relative change = 1.565e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 37 (approx. per word bound = -5.985, relative change = 1.557e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 38 (approx. per word bound = -5.984, relative change = 1.563e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 39 (approx. per word bound = -5.983, relative change = 1.564e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 40 (approx. per word bound = -5.982, relative change = 1.550e-04) 
## Topic 1: fit, size, great, top, comfort 
##  Topic 2: dress, realli, petit, cute, length 
##  Topic 3: fabric, like, just, bought, store 
##  Topic 4: love, fit, top, size, color 
##  Topic 5: look, wear, order, can, like 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 41 (approx. per word bound = -5.981, relative change = 1.525e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 42 (approx. per word bound = -5.980, relative change = 1.498e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 43 (approx. per word bound = -5.980, relative change = 1.490e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 44 (approx. per word bound = -5.979, relative change = 1.495e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 45 (approx. per word bound = -5.978, relative change = 1.515e-04) 
## Topic 1: fit, size, great, comfort, top 
##  Topic 2: dress, realli, petit, cute, length 
##  Topic 3: fabric, like, bought, just, color 
##  Topic 4: love, fit, top, size, small 
##  Topic 5: look, wear, order, can, like 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 46 (approx. per word bound = -5.977, relative change = 1.538e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 47 (approx. per word bound = -5.976, relative change = 1.561e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 48 (approx. per word bound = -5.975, relative change = 1.570e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 49 (approx. per word bound = -5.974, relative change = 1.552e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 50 (approx. per word bound = -5.973, relative change = 1.505e-04) 
## Topic 1: fit, size, great, comfort, top 
##  Topic 2: dress, petit, realli, cute, length 
##  Topic 3: fabric, like, bought, just, color 
##  Topic 4: love, fit, top, size, small 
##  Topic 5: look, wear, order, like, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 51 (approx. per word bound = -5.972, relative change = 1.443e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 52 (approx. per word bound = -5.971, relative change = 1.383e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 53 (approx. per word bound = -5.971, relative change = 1.332e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 54 (approx. per word bound = -5.970, relative change = 1.296e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 55 (approx. per word bound = -5.969, relative change = 1.271e-04) 
## Topic 1: fit, size, great, comfort, top 
##  Topic 2: dress, petit, realli, cute, get 
##  Topic 3: fabric, like, bought, color, just 
##  Topic 4: love, fit, top, size, small 
##  Topic 5: look, wear, order, like, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 56 (approx. per word bound = -5.968, relative change = 1.257e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 57 (approx. per word bound = -5.968, relative change = 1.248e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 58 (approx. per word bound = -5.967, relative change = 1.245e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 59 (approx. per word bound = -5.966, relative change = 1.241e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 60 (approx. per word bound = -5.965, relative change = 1.222e-04) 
## Topic 1: fit, size, great, comfort, perfect 
##  Topic 2: dress, petit, realli, cute, get 
##  Topic 3: fabric, bought, like, color, just 
##  Topic 4: love, fit, top, size, small 
##  Topic 5: look, wear, order, like, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 61 (approx. per word bound = -5.965, relative change = 1.189e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 62 (approx. per word bound = -5.964, relative change = 1.148e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 63 (approx. per word bound = -5.963, relative change = 1.095e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 64 (approx. per word bound = -5.963, relative change = 1.038e-04) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 65 (approx. per word bound = -5.962, relative change = 9.922e-05) 
## Topic 1: fit, size, great, comfort, perfect 
##  Topic 2: dress, petit, realli, cute, get 
##  Topic 3: fabric, bought, color, like, just 
##  Topic 4: love, fit, top, size, small 
##  Topic 5: look, wear, like, order, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 66 (approx. per word bound = -5.962, relative change = 9.613e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 67 (approx. per word bound = -5.961, relative change = 9.432e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 68 (approx. per word bound = -5.960, relative change = 9.393e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 69 (approx. per word bound = -5.960, relative change = 9.342e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 70 (approx. per word bound = -5.959, relative change = 9.258e-05) 
## Topic 1: fit, size, great, comfort, perfect 
##  Topic 2: dress, petit, realli, cute, get 
##  Topic 3: fabric, bought, color, like, just 
##  Topic 4: love, top, fit, size, small 
##  Topic 5: look, wear, like, order, can 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 71 (approx. per word bound = -5.959, relative change = 9.022e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 72 (approx. per word bound = -5.958, relative change = 8.643e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 73 (approx. per word bound = -5.958, relative change = 8.270e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 74 (approx. per word bound = -5.957, relative change = 7.967e-05) 
## ..................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Terminated Before Convergence Reached
labelTopics(stm_model)
## Topic 1 Top Words:
##       Highest Prob: fit, size, great, comfort, perfect, top, love 
##       FREX: around, pant, buy, romper, pilcro, often, armhol 
##       Lift: appropri, coh, fleetwood, floreat, gain, includ, ride 
##       Score: travel, pant, fit, around, size, pilcro, blous 
## Topic 2 Top Words:
##       Highest Prob: dress, petit, realli, cute, get, good, length 
##       FREX: regular, knee, dress, petit, good, realli, cute 
##       Lift: necessari, unusu, chose, closur, compel, given, mislabel 
##       Score: dress, unusu, petit, regular, realli, length, cute 
## Topic 3 Top Words:
##       Highest Prob: fabric, bought, color, like, just, store, materi 
##       FREX: compliment, bought, almost, swing, light, anoth, weather 
##       Lift: air, ask, combin, compar, custom, cuter, forward 
##       Score: parti, bought, compliment, almost, arriv, onlin, anoth 
## Topic 4 Top Words:
##       Highest Prob: love, top, fit, size, small, littl, color 
##       FREX: bottom, differ, cut, xxs, total, kept, extrem 
##       Lift: breast, contrast, due, extrem, hour, neither, ran 
##       Score: extrem, top, fit, differ, idea, size, cut 
## Topic 5 Top Words:
##       Highest Prob: wear, look, like, can, order, nice, sweater 
##       FREX: make, skirt, sweater, front, wear, loos, can 
##       Lift: silk, chic, everyday, justic, laid, middl, obvious 
##       Score: chic, can, sweater, wear, soft, skirt, make
plot(stm_model, type = "summary")

he most significant terms related to Topic 1 from your STM model are shown in this word cloud. Each word’s weight within the topic is reflected in its size, so larger words have a greater impact on defining the theme. Terms like size, fit, perfect, comfort, and jean predominate in your case, indicating that the focus of this discussion is on the comfort, fit, and accuracy of clothing sizing.

cloud(stm_model, topic = 1, random.color = FALSE, colors = sample(colors(), size = 10))

In order to capture deeper contextual meaning beyond bag-of-words approaches, future work could expand this analysis by utilizing more sophisticated NLP techniques like transformer-based models (e.g., BERT). Sentiment could be directly linked to particular product attributes, such as fit, quality, or delivery, using aspect-based sentiment analysis. Topic modeling that is dynamic or time-aware may show how customer concerns change over time. Predictive models of customer satisfaction could be constructed by integrating additional metadata, such as past purchases or return patterns. Lastly, implementing these models in an interactive dashboard would facilitate real-time decision-making and allow ongoing monitoring of emerging themes.

```