1. Libraries & Setup

All packages and lexicons are loaded once here and reused throughout the document.

library(tidyverse)
library(tidytext)
library(stringr)
library(dplyr)
library(ggplot2)
library(scales)
library(SnowballC)
library(topicmodels)
library(tm)
library(readtext)
library(tidyr)
library(textdata)
library(widyr)
library(igraph)
library(ggraph)
library(readxl)
library(officer)
library(forcats)
# Load all sentiment lexicons once
bing_lexicon  <- get_sentiments("bing")
afinn_lexicon <- get_sentiments("afinn")
nrc_lexicon   <- get_sentiments("nrc")

# Pre-filter NRC emotion subsets
nrc_joy          <- nrc_lexicon %>% filter(sentiment == "joy")
nrc_anger        <- nrc_lexicon %>% filter(sentiment == "anger")
nrc_trust        <- nrc_lexicon %>% filter(sentiment == "trust")
nrc_anticipation <- nrc_lexicon %>% filter(sentiment == "anticipation")

2. Data Collection & Preprocessing

2.1 Data Sources

Text data was collected from six channels for Nike and two competitor brands (Adidas, Under Armour):

Brand Sources
Nike App Store Reviews, News Headlines, Reddit Comments, Twitter Mentions, YouTube Comments, Product Reviews
Adidas Twitter, News, YouTube, App Store, Reddit, Product Reviews
Under Armour Twitter, News, YouTube, App Store, Reddit, Product Reviews

2.2 Load Documents

# !! CHANGE THIS to the folder where your .docx files are saved !!
folder_path <- "C:/Users/patri/OneDrive/Documents/Hult/MBAN/Business Analysis with Unstructured Data - Kurnicki/Nike/New folder"

raw_docs <- readtext(paste0(folder_path, "/*.docx"))

colnames(raw_docs)[1] <- "source"
colnames(raw_docs)[2] <- "text"
raw_docs$source <- str_remove(raw_docs$source, "\\.docx$")

nike_df <- raw_docs %>% select(source, text)

# Confirm source names before recoding
unique(nike_df$source)
##  [1] "Adidas Reviews Website"          "Adidas Twitter Mentions"        
##  [3] "adidas_headlines"                "Adidas_Youtube_Comments"        
##  [5] "appstore_adidas"                 "appstore_underarmour"           
##  [7] "Nike - App Store Reviews"        "Nike - News_Headlines_NoYears"  
##  [9] "Nike - Reddit_Comments_NoLinks"  "nike - Twitter mentions"        
## [11] "Nike - Youtube_Comments_NoLinks" "Nike Product Reviews"           
## [13] "reddit_adidas"                   "reddit_underarmour"             
## [15] "Under Armour Reviews Website"    "Under Armour Twitter Mentions"  
## [17] "Under Armour Youtube Comments"   "under_armour_headlines"

2.3 Source Labelling

nike_df$source <- recode(nike_df$source,
  "Nike - App Store Reviews"         = "Nike - App Store",
  "Nike - News_Headlines_NoYears"    = "Nike - News",
  "Nike - Reddit_Comments_NoLinks"   = "Nike - Reddit",
  "nike - Twitter mentions"          = "Nike - Twitter",
  "Nike - Youtube_Comments_NoLinks"  = "Nike - YouTube",
  "Nike Product Reviews"             = "Nike - Product Reviews",
  "Adidas Twitter Mentions"          = "Adidas - Twitter",
  "adidas_headlines"                 = "Adidas - News",
  "Adidas_Youtube_Comments"          = "Adidas - YouTube",
  "appstore_adidas"                  = "Adidas - App Store",
  "reddit_adidas"                    = "Adidas - Reddit",
  "Adidas Reviews Website"           = "Adidas - Product Reviews",
  "Under Armour Twitter Mentions"    = "Under Armour - Twitter",
  "Under Armour Youtube Comments"    = "Under Armour - YouTube",
  "under_armour_headlines"           = "Under Armour - News",
  "appstore_underarmour"             = "Under Armour - App Store",
  "reddit_underarmour"               = "Under Armour - Reddit",
  "Under Armour Reviews Website"     = "Under Armour - Product Reviews"
)

# Brand groupings used throughout the script
nike_sources <- c("Nike - App Store", "Nike - News", "Nike - Reddit",
                  "Nike - Twitter", "Nike - YouTube", "Nike - Product Reviews")

adidas_sources <- c("Adidas - Twitter", "Adidas - News", "Adidas - YouTube",
                    "Adidas - App Store", "Adidas - Reddit", "Adidas - Product Reviews")

ua_sources <- c("Under Armour - Twitter", "Under Armour - YouTube",
                "Under Armour - News", "Under Armour - App Store",
                "Under Armour - Reddit", "Under Armour - Product Reviews")

print(unique(nike_df$source))
##  [1] "Adidas - Product Reviews"       "Adidas - Twitter"              
##  [3] "Adidas - News"                  "Adidas - YouTube"              
##  [5] "Adidas - App Store"             "Under Armour - App Store"      
##  [7] "Nike - App Store"               "Nike - News"                   
##  [9] "Nike - Reddit"                  "Nike - Twitter"                
## [11] "Nike - YouTube"                 "Nike - Product Reviews"        
## [13] "Adidas - Reddit"                "Under Armour - Reddit"         
## [15] "Under Armour - Product Reviews" "Under Armour - Twitter"        
## [17] "Under Armour - YouTube"         "Under Armour - News"

2.4 Tokenization

nike_tokens <- nike_df %>%
  unnest_tokens(word, text)

nike_tokens %>% count(word, sort = TRUE) %>% head(10)
## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
##   word      n text     
## * <chr> <int> <chr>    
## 1 the    4759 "\"\"..."
## 2 i      3883 "\"\"..."
## 3 and    2960 "\"\"..."
## 4 to     2960 "\"\"..."
## 5 a      2307 "\"\"..."
## 6 it     1946 "\"\"..."
## # ℹ 4 more rows

2.5 Stopword Removal

tidy_nike <- nike_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  filter(!str_detect(word, "^[0-9]+$"))

tidy_nike %>% count(word, sort = TRUE) %>% head(10)
## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
##   word         n text     
## * <chr>    <int> <chr>    
## 1 app       1690 "\"\"..."
## 2 love       601 "\"\"..."
## 3 nike       554 "\"\"..."
## 4 adidas     418 "\"\"..."
## 5 running    345 "\"\"..."
## 6 workouts   343 "\"\"..."
## # ℹ 4 more rows

2.6 Stemming

Words are reduced to their root form using the Porter stemmer.

tidy_nike_stem <- tidy_nike %>%
  mutate(word = wordStem(word))

tidy_nike_stem %>% count(word, sort = TRUE) %>% head(10)
## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
##   word        n text     
## * <chr>   <int> <chr>    
## 1 app      1771 "\"\"..."
## 2 run       725 "\"\"..."
## 3 love      664 "\"\"..."
## 4 workout   627 "\"\"..."
## 5 shoe      614 "\"\"..."
## 6 nike      572 "\"\"..."
## # ℹ 4 more rows

2.7 Top Words per Brand

tidy_nike %>%
  mutate(brand = case_when(
    source %in% adidas_sources ~ "Adidas",
    source %in% ua_sources     ~ "Under Armour",
    TRUE                       ~ "Nike"
  )) %>%
  count(brand, word, sort = TRUE) %>%
  group_by(brand) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, n, brand)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_segment(aes(xend = word, y = 0, yend = n), color = "gray70") +
  geom_point(aes(color = brand), size = 3, show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~brand, scales = "free") +
  coord_flip() +
  labs(x = NULL, y = "Word Count",
       title = "Top 10 Words per Brand (Stopwords Removed)")

2.8 N-grams

Bigrams

bigrams_separated <- nike_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word)

bigrams_filtered %>%
  count(word1, word2, sort = TRUE) %>%
  head(15)
##          word1    word2   n
## 1            5 positive 267
## 2            1 negative  97
## 3        apple    watch  70
## 4      running      app  55
## 5            4 positive  52
## 6            3  neutral  51
## 7         2024       11  47
## 8          air      max  47
## 9         2025       01  41
## 10      adidas    score  40
## 11 underarmour    score  37
## 12        2024       12  32
## 13     workout      app  32
## 14         gif    giphy  30
## 15    positive     love  29

Quadrograms

nike_df %>%
  unnest_tokens(quadrogram, text, token = "ngrams", n = 4) %>%
  separate(quadrogram, c("word1", "word2", "word3", "word4"), sep = " ") %>%
  filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word,
         !word3 %in% stop_words$word, !word4 %in% stop_words$word) %>%
  count(word1, word2, word3, word4, sort = TRUE) %>%
  head(15)
##              word1 word2 word3    word4  n
## 1      underarmour score     7      low 12
## 2      underarmour score     8      low  9
## 3     sneakerheads score     5      low  8
## 4               11    12     5 positive  6
## 5             2024    11    12        5  6
## 6               02    10     5 positive  5
## 7               11    13     5 positive  5
## 8               11    15     5 positive  5
## 9               11    23     5 positive  5
## 10 adidasoriginals score     5      low  5
## 11              01    06     5 positive  4
## 12              11    16     5 positive  4
## 13              12    04     5 positive  4
## 14              12    27     5 positive  4
## 15            2024    11    13        5  4

2.9 Document-Term Matrix (DTM)

tidy_dtm <- nike_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(source, word, sort = TRUE) %>%
  cast_dtm(source, word, n)

inspect(tidy_dtm[1:6, 1:10])
## <<DocumentTermMatrix (documents: 6, terms: 10)>>
## Non-/sparse entries: 45/15
## Sparsity           : 25%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##                           Terms
## Docs                       2025   5 app love nike pair positive shoes workout
##   Adidas - App Store          2  20 567   90   13    0        0    14      29
##   Adidas - Reddit             0  14   0    2    1    7        1     6       0
##   Adidas - YouTube            1   6   0   25    4   28        0    34       0
##   Nike - App Store          239 294 639  206  128    3      324     0     206
##   Nike - Reddit               2  22  15   89  217  170        0   178       4
##   Under Armour - App Store    3  28 468  114    5    0        0     3      39
##                           Terms
## Docs                       workouts
##   Adidas - App Store             20
##   Adidas - Reddit                 0
##   Adidas - YouTube                0
##   Nike - App Store              286
##   Nike - Reddit                   0
##   Under Armour - App Store       34

3. Sentiment Analysis

3.1 Lexicon Overview

Three sentiment lexicons are used in this analysis:

Lexicon Type Scale
AFINN Numeric score per word -5 (very negative) to +5 (very positive)
Bing Binary classification Positive / Negative
NRC Emotion categories Joy, Anger, Trust, Anticipation, Fear, Sadness, Surprise, Disgust
sentiments <- bind_rows(mutate(afinn_lexicon, lexicon = "afinn"),
                        mutate(nrc_lexicon,   lexicon = "nrc"),
                        mutate(bing_lexicon,  lexicon = "bing"))

# NRC emotion categories
unique(subset(sentiments, lexicon == "nrc")$sentiment)
##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"
# AFINN score distribution
summary(subset(sentiments, lexicon == "afinn")$value)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.0000 -2.0000 -2.0000 -0.5894  2.0000  5.0000

3.2 NRC Emotion Words in Product Reviews

tidy_nike %>%
  filter(source == "Nike - Product Reviews") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>%
  head(10)
## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
##   word         n text     
## * <chr>    <int> <chr>    
## 1 love         9 "\"\"..."
## 2 daughter     3 "\"\"..."
## 3 happy        3 "\"\"..."
## 4 favorite     2 "\"\"..."
## 5 true         2 "\"\"..."
## 6 baby         1 "\"\"..."
## # ℹ 4 more rows
tidy_nike %>%
  filter(source == "Nike - Product Reviews") %>%
  inner_join(nrc_anger) %>%
  count(word, sort = TRUE) %>%
  head(10)
## readtext object consisting of 4 documents and 0 docvars.
## # A data frame: 4 × 3
##   word       n text     
## * <chr>  <int> <chr>    
## 1 fleece     7 "\"\"..."
## 2 fits       5 "\"\"..."
## 3 force      1 "\"\"..."
## 4 storm      1 "\"\"..."
tidy_nike %>%
  filter(source == "Nike - Product Reviews") %>%
  inner_join(nrc_trust) %>%
  count(word, sort = TRUE) %>%
  head(10)
## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
##   word          n text     
## * <chr>     <int> <chr>    
## 1 cap           3 "\"\"..."
## 2 happy         3 "\"\"..."
## 3 wear          3 "\"\"..."
## 4 favorite      2 "\"\"..."
## 5 recommend     2 "\"\"..."
## 6 true          2 "\"\"..."
## # ℹ 4 more rows

3.3 Sentiment Comparison per Source

app_store       <- tidy_nike %>% filter(source == "Nike - App Store")
product_reviews <- tidy_nike %>% filter(source == "Nike - Product Reviews")
reddit          <- tidy_nike %>% filter(source == "Nike - Reddit")
twitter         <- tidy_nike %>% filter(source == "Nike - Twitter")
youtube         <- tidy_nike %>% filter(source == "Nike - YouTube")
news            <- tidy_nike %>% filter(source == "Nike - News")

sources_list <- list(
  "App Store Reviews"    = app_store,
  "Nike Product Reviews" = product_reviews,
  "Reddit Comments"      = reddit,
  "Twitter Mentions"     = twitter,
  "YouTube Comments"     = youtube,
  "News Headlines"       = news
)

for (src_name in names(sources_list)) {
  src_data <- sources_list[[src_name]]

  afinn <- src_data %>%
    inner_join(afinn_lexicon) %>%
    summarise(sentiment = sum(value)) %>%
    mutate(method = "AFINN")

  bing_and_nrc <- bind_rows(
    src_data %>% inner_join(bing_lexicon) %>% mutate(method = "Bing et al."),
    src_data %>%
      inner_join(nrc_lexicon %>% filter(sentiment %in% c("positive", "negative"))) %>%
      mutate(method = "NRC")
  ) %>%
    count(method, sentiment) %>%
    spread(sentiment, n, fill = 0) %>%
    mutate(sentiment = positive - negative)

  print(
    bind_rows(afinn, bing_and_nrc) %>%
      ggplot(aes(method, sentiment, fill = sentiment > 0)) +
      geom_col(show.legend = FALSE) +
      geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
      scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
      facet_wrap(~method, ncol = 1, scales = "free_y") +
      labs(title = paste0(src_name, ": Sentiment Across Lexicons"),
           x = NULL, y = "Sentiment Score")
  )
}

3.4 Top 10 Positive & Negative Words per Source

for (src_name in names(sources_list)) {
  print(
    sources_list[[src_name]] %>%
      inner_join(bing_lexicon) %>%
      count(word, sentiment, sort = TRUE) %>%
      group_by(sentiment) %>%
      top_n(10) %>%
      ungroup() %>%
      mutate(word = reorder(word, n)) %>%
      ggplot(aes(word, n, fill = sentiment)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~sentiment, scales = "free_y") +
      labs(y = "Contribution to sentiment", x = NULL,
           title = paste0(src_name, ": Top 10 Positive & Negative Words")) +
      coord_flip()
  )
}

3.5 Overall Sentiment (Stemmed)

tidy_nike_stem %>%
  inner_join(bing_lexicon, by = "word") %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_max(order_by = n, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = tidytext::reorder_within(word, n, sentiment)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  tidytext::scale_x_reordered() +
  labs(y = "Contribution to sentiment", x = NULL,
       title = "Overall: Top 10 Positive & Negative Words (Stemmed)") +
  coord_flip()

3.6 Sentiment per Source — AFINN, Bing, NRC

# AFINN
tidy_nike %>%
  inner_join(afinn_lexicon) %>%
  group_by(source) %>%
  summarise(sentiment = sum(value)) %>%
  mutate(source = reorder(source, sentiment)) %>%
  ggplot(aes(source, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
  coord_flip() +
  labs(x = NULL, y = "Sentiment Score", title = "AFINN: Sentiment Intensity per Source")

# Bing
tidy_nike %>%
  inner_join(bing_lexicon) %>%
  count(source, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative,
         source = reorder(source, sentiment)) %>%
  ggplot(aes(source, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
  coord_flip() +
  labs(x = NULL, y = "Net Sentiment (Positive - Negative)",
       title = "Bing: Net Sentiment Score per Source")

# NRC
tidy_nike %>%
  inner_join(nrc_lexicon) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  count(source, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative,
         source = reorder(source, sentiment)) %>%
  ggplot(aes(source, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
  coord_flip() +
  labs(x = NULL, y = "Net Sentiment (Positive - Negative)",
       title = "NRC: Net Sentiment Score per Source")

3.7 Emotion Heatmap per Source

tidy_nike %>%
  inner_join(nrc_lexicon) %>%
  filter(sentiment %in% c("joy", "anger", "trust", "anticipation",
                          "fear", "sadness", "surprise", "disgust")) %>%
  count(source, sentiment) %>%
  group_by(source) %>%
  mutate(proportion = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x = source, y = sentiment, fill = proportion)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = percent(proportion, accuracy = 1)), size = 3) +
  scale_fill_gradient(low = "#f7f7f7", high = "#1A5276",
                      labels = percent_format()) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
  labs(x = NULL, y = NULL, fill = "% Words",
       title = "Emotion Heatmap per Source")


4. Topic Modeling (LDA)

4.1 Model Details

An LDA model with k = 4 topics is fitted on the Nike-only data. The seed is set to 123 for reproducibility.

nike_lda_dtm <- tidy_nike %>%
  count(source, word, sort = TRUE) %>%
  cast_dtm(source, word, n)

nike_lda <- LDA(nike_lda_dtm, k = 4, control = list(seed = 123))

4.2 Top Terms per Topic (Beta)

nike_topics <- tidy(nike_lda, matrix = "beta")

top_terms <- nike_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# !! READ THIS OUTPUT and update topic_labels below !!
top_terms %>% print(n = 40)
## # A tibble: 40 × 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 app         0.0741 
##  2     1 running     0.0177 
##  3     1 run         0.0169 
##  4     1 love        0.0147 
##  5     1 it’s        0.0119 
##  6     1 time        0.0119 
##  7     1 track       0.00884
##  8     1 i’ve        0.00835
##  9     1 watch       0.00759
## 10     1 adidas      0.00688
## 11     2 adidas      0.0728 
## 12     2 nike        0.0183 
## 13     2 score       0.0176 
## 14     2 shoes       0.0116 
## 15     2 pair        0.00916
## 16     2 pro         0.00718
## 17     2 love        0.00702
## 18     2 video       0.00668
## 19     2 adios       0.00643
## 20     2 shoe        0.00595
## 21     3 app         0.0704 
## 22     3 positive    0.0357 
## 23     3 workouts    0.0319 
## 24     3 workout     0.0231 
## 25     3 love        0.0230 
## 26     3 nike        0.0142 
## 27     3 negative    0.0133 
## 28     3 it’s        0.0123 
## 29     3 free        0.0118 
## 30     3 i’ve        0.00897
## 31     4 nike        0.0166 
## 32     4 shoes       0.0134 
## 33     4 pair        0.0122 
## 34     4 shoe        0.0120 
## 35     4 size        0.00915
## 36     4 love        0.00789
## 37     4 wear        0.00607
## 38     4 lol         0.00597
## 39     4 comfortable 0.00581
## 40     4 armour      0.00546
# !! CHANGE THESE labels based on what you see in top_terms above !!
topic_labels <- c(
  "1" = "Topic 1 — rename me",
  "2" = "Topic 2 — rename me",
  "3" = "Topic 3 — rename me",
  "4" = "Topic 4 — rename me"
)
top_terms %>%
  mutate(topic_label = topic_labels[as.character(topic)],
         term = reorder_within(term, beta, topic_label)) %>%
  ggplot(aes(x = term, y = beta, color = topic_label)) +
  geom_segment(aes(xend = term, y = 0, yend = beta), color = "gray70") +
  geom_point(size = 3, show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~topic_label, scales = "free") +
  coord_flip() +
  labs(title = "LDA: Top 10 Terms per Topic", x = NULL, y = "Beta (Word Probability)")

4.3 Topic Distribution per Source (Gamma)

tidy(nike_lda, matrix = "gamma") %>%
  mutate(topic_label = topic_labels[as.character(topic)]) %>%
  ggplot(aes(x = topic_label, y = document, fill = gamma)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(gamma, 2)), size = 3) +
  scale_fill_gradient(low = "#f7f7f7", high = "#1A5276") +
  theme(axis.text.x = element_text(angle = 20, hjust = 1)) +
  labs(title = "Topic Distribution per Source",
       x = "Topic", y = NULL, fill = "Gamma")


5. Business-Driven Text Analytics

5.1 TF-IDF Brand Differentiation

Words most uniquely associated with each brand versus the others.

custom_stopwords <- tibble(word = c("nike", "just", "im", "ive"))

tfidf_brand <- tidy_nike %>%
  mutate(brand = case_when(
    source %in% adidas_sources ~ "Adidas",
    source %in% ua_sources     ~ "Under Armour",
    TRUE                       ~ "Nike"
  )) %>%
  anti_join(custom_stopwords, by = "word") %>%
  count(brand, word, sort = TRUE) %>%
  bind_tf_idf(term = word, document = brand, n = n)

tfidf_brand %>%
  group_by(brand) %>%
  slice_max(order_by = tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = tidytext::reorder_within(word, tf_idf, brand)) %>%
  ggplot(aes(word, tf_idf, fill = brand)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~brand, scales = "free_y") +
  tidytext::scale_x_reordered() +
  coord_flip() +
  labs(title = "TF-IDF: Top 10 Distinctive Words by Brand",
       x = NULL, y = "TF-IDF Score")

5.2 Aspect-Based Sentiment

aspect_dictionary <- tribble(
  ~word,         ~aspect,
  "size",        "Fit/Sizing",    "fit",         "Fit/Sizing",
  "fits",        "Fit/Sizing",    "half",        "Fit/Sizing",
  "oversized",   "Fit/Sizing",    "sizing",      "Fit/Sizing",
  "tight",       "Fit/Sizing",    "loose",       "Fit/Sizing",
  "comfortable", "Comfort",       "comfy",       "Comfort",
  "comfort",     "Comfort",       "soft",        "Comfort",
  "cushion",     "Comfort",
  "quality",     "Quality/Durability", "peeled",  "Quality/Durability",
  "cheap",       "Quality/Durability", "durable", "Quality/Durability",
  "fell",        "Quality/Durability", "problem", "Quality/Durability",
  "awkward",     "Quality/Durability",
  "style",       "Style/Design",  "design",      "Style/Design",
  "colorway",    "Style/Design",  "look",        "Style/Design",
  "black",       "Style/Design",  "fleece",      "Style/Design",
  "bomber",      "Style/Design",
  "app",         "App Experience", "workout",    "App Experience",
  "workouts",    "App Experience", "fitness",    "App Experience",
  "crashing",    "App Experience", "crashes",    "App Experience",
  "bugs",        "App Experience", "issue",      "App Experience",
  "issues",      "App Experience", "frustrating","App Experience",
  "revenue",     "Corporate/Business", "strategy","Corporate/Business",
  "investor",    "Corporate/Business", "stock",   "Corporate/Business",
  "ceo",         "Corporate/Business", "wholesale","Corporate/Business",
  "relations",   "Corporate/Business",
  "team",        "Sports/Cultural Visibility", "national","Sports/Cultural Visibility",
  "shirt",       "Sports/Cultural Visibility", "jersey",  "Sports/Cultural Visibility",
  "cup",         "Sports/Cultural Visibility", "brazil",  "Sports/Cultural Visibility",
  "football",    "Sports/Cultural Visibility"
)

aspect_sentiment_bing <- tidy_nike %>%
  inner_join(aspect_dictionary, by = "word") %>%
  inner_join(bing_lexicon, by = "word")
# Net sentiment by aspect — diverging bar
aspect_sentiment_bing %>%
  mutate(score = if_else(sentiment == "positive", 1, -1)) %>%
  group_by(aspect) %>%
  summarise(mentions = n(), avg_sentiment = mean(score), .groups = "drop") %>%
  mutate(aspect = reorder(aspect, avg_sentiment)) %>%
  ggplot(aes(x = aspect, y = avg_sentiment, fill = avg_sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
  coord_flip() +
  labs(title = "Aspect-Based Sentiment: Net Score by Aspect",
       x = NULL, y = "Average Sentiment Score")

# Aspect sentiment by brand — bubble plot
aspect_sentiment_bing %>%
  mutate(
    score = if_else(sentiment == "positive", 1, -1),
    brand = case_when(
      source %in% adidas_sources ~ "Adidas",
      source %in% ua_sources     ~ "Under Armour",
      TRUE                       ~ "Nike"
    )
  ) %>%
  group_by(brand, aspect) %>%
  summarise(mentions = n(), avg_sentiment = mean(score), .groups = "drop") %>%
  ggplot(aes(x = avg_sentiment, y = aspect, size = mentions, color = avg_sentiment)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient2(low = "tomato", mid = "gray80", high = "steelblue", midpoint = 0) +
  scale_size_continuous(range = c(2, 8)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
  facet_wrap(~brand, ncol = 3) +
  labs(title = "Aspect Sentiment by Brand",
       subtitle = "Dot size = number of mentions",
       x = "Average Sentiment Score", y = NULL,
       color = "Sentiment", size = "Mentions")

5.3 Keyword Co-occurrence Network

word_cors <- tidy_nike %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  ungroup() %>%
  pairwise_cor(word, source, sort = TRUE)

# Top correlated words — lollipop
word_cors %>%
  filter(item1 %in% c("nike", "shoes", "quality", "price")) %>%
  group_by(item1) %>%
  slice_max(order_by = correlation, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(x = item2, y = correlation)) +
  geom_segment(aes(xend = item2, y = 0, yend = correlation), color = "gray70") +
  geom_point(color = "steelblue", size = 3) +
  facet_wrap(~item1, scales = "free") +
  coord_flip() +
  labs(title = "Top 5 Word Correlations with Key Nike Terms",
       x = NULL, y = "Correlation")


6. Brand Perception

6.1 Overall Emotional Positioning

tidy_nike %>%
  inner_join(nrc_lexicon) %>%
  filter(sentiment %in% c("joy", "anger", "trust", "anticipation",
                          "fear", "sadness", "surprise", "disgust")) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(sentiment = reorder(sentiment, n)) %>%
  ggplot(aes(x = sentiment, y = n)) +
  geom_segment(aes(xend = sentiment, y = 0, yend = n), color = "gray70") +
  geom_point(aes(color = sentiment), size = 5, show.legend = FALSE) +
  coord_flip() +
  labs(x = NULL, y = "Word Count",
       title = "Nike: Overall Emotional Positioning Across All Sources")

6.2 Brand Loyalty — Trust Words

# Top 15 trust words overall
tidy_nike %>%
  inner_join(nrc_trust) %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_segment(aes(xend = word, y = 0, yend = n), color = "gray70") +
  geom_point(color = "steelblue", size = 4) +
  coord_flip() +
  labs(x = NULL, y = "Word Count",
       title = "Nike: Top 15 Trust Words Driving Brand Loyalty")

# Trust words — percentage share
tidy_nike %>%
  inner_join(nrc_trust) %>%
  count(word, sort = TRUE) %>%
  mutate(proportion = n / sum(n)) %>%
  top_n(15, proportion) %>%
  mutate(word = reorder(word, proportion)) %>%
  ggplot(aes(word, proportion, fill = word)) +
  geom_col(show.legend = FALSE) +
  scale_y_continuous(labels = percent_format()) +
  coord_flip() +
  labs(x = NULL, y = "% of Trust Words",
       title = "Nike: Top Trust Words (% Share)")

bigrams_filtered %>%
  filter(!str_detect(word1, "^[0-9]+$"), !str_detect(word2, "^[0-9]+$"),
         !str_detect(word1, "'"),         !str_detect(word2, "'"),
         !word1 %in% c("positive", "negative", "neutral"),
         !word2 %in% c("positive", "negative", "neutral")) %>%
  filter(word1 %in% c("wear", "favorite", "recommend", "real", "true",
                      "perfect", "happy", "strength", "love", "free",
                      "comfortable", "trust") |
           word2 %in% c("wear", "favorite", "recommend", "real", "true",
                        "perfect", "happy", "strength", "love", "free",
                        "comfortable", "trust")) %>%
  count(word1, word2, sort = TRUE) %>%
  top_n(15) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(bigram, n)) +
  geom_segment(aes(xend = bigram, y = 0, yend = n), color = "gray70") +
  geom_point(color = "steelblue", size = 3) +
  coord_flip() +
  labs(x = NULL, y = "Count",
       title = "Top 15 Bigrams Around Loyalty & Trust Words")


7. Product Strategy Analysis

7.1 Parse Product Reviews

product_raw       <- nike_df %>% filter(source == "Nike - Product Reviews")
product_text_blob <- product_raw$text[[1]]

product_lines <- tibble(raw_line = str_split(product_text_blob, "\n")[[1]]) %>%
  mutate(raw_line = str_squish(raw_line)) %>%
  filter(raw_line != "")

product_reviews_structured <- product_lines %>%
  mutate(
    audience = str_extract(raw_line, "\\b(Women|Men|Kids|Everyone)\\b"),
    product_category = str_extract(
      raw_line,
      "\\b(Pants|Hoodie|Sweatshirt|Leggings|Shoes|Jacket|Skirt|Dress|Shirt|Accessories|Accessorise)\\b"
    ),
    product_name = str_trim(str_remove(
      raw_line,
      "\\b(Pants|Hoodie|Sweatshirt|Leggings|Shoes|Jacket|Skirt|Dress|Shirt|Accessories|Accessorise)\\b.*$"
    )),
    review_text = str_trim(str_remove(raw_line, "^.*\\b(Women|Men|Kids|Everyone)\\b"))
  ) %>%
  filter(!is.na(product_name), product_name != "",
         !is.na(product_category), product_category != "",
         !is.na(audience), audience != "",
         !is.na(review_text), review_text != "") %>%
  mutate(review_id = row_number())

glimpse(product_reviews_structured)
## Rows: 3
## Columns: 6
## $ raw_line         <chr> "Women's Loose Full-Zip Cropped HoodieSweatshirtWomen…
## $ audience         <chr> "Women", "Women", "Women"
## $ product_category <chr> "Hoodie", "Shoes", "Shoes"
## $ product_name     <chr> "Women's Loose Full-Zip Cropped HoodieSweatshirtWomen…
## $ review_text      <chr> "'s Loose Full-Zip Cropped HoodieSweatshirtWomen\"I’v…
## $ review_id        <int> 1, 2, 3
table(product_reviews_structured$product_category)
## 
## Hoodie  Shoes 
##      1      2
table(product_reviews_structured$audience)
## 
## Women 
##     3

7.2 Complaint Density by Product Line

product_tokens <- product_reviews_structured %>%
  unnest_tokens(word, review_text) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^[0-9]+$"), str_length(word) > 2)

negative_rate_by_product <- product_tokens %>%
  inner_join(bing_lexicon, by = "word") %>%
  filter(sentiment == "negative") %>%
  count(product_name, sort = TRUE) %>%
  left_join(product_reviews_structured %>% count(product_name, name = "review_count"),
            by = "product_name") %>%
  mutate(negative_words_per_review = n / review_count) %>%
  arrange(desc(negative_words_per_review))

negative_rate_by_product %>%
  mutate(product_name = reorder(product_name, negative_words_per_review)) %>%
  ggplot(aes(x = product_name, y = negative_words_per_review)) +
  geom_segment(aes(xend = product_name, y = 0, yend = negative_words_per_review),
               color = "gray70") +
  geom_point(color = "tomato", size = 4) +
  coord_flip() +
  labs(title = "Negative Complaint Density by Product Line",
       x = NULL, y = "Negative Words per Review")

7.3 Aspect Sentiment by Product Line

product_aspect_dictionary <- tribble(
  ~word,         ~aspect,
  "size",        "Fit/Sizing",    "fit",         "Fit/Sizing",
  "fits",        "Fit/Sizing",    "oversized",   "Fit/Sizing",
  "tight",       "Fit/Sizing",    "loose",       "Fit/Sizing",
  "long",        "Fit/Sizing",    "large",       "Fit/Sizing",
  "small",       "Fit/Sizing",
  "comfortable", "Comfort",       "comfy",       "Comfort",
  "comfort",     "Comfort",       "soft",        "Comfort",
  "roomy",       "Comfort",
  "quality",     "Quality/Durability", "peeled",  "Quality/Durability",
  "fell",        "Quality/Durability", "cheap",   "Quality/Durability",
  "shrink",      "Quality/Durability", "shrunk",  "Quality/Durability",
  "durable",     "Quality/Durability", "problem", "Quality/Durability",
  "issues",      "Quality/Durability",
  "style",       "Style/Design",  "design",      "Style/Design",
  "color",       "Style/Design",  "colorway",    "Style/Design",
  "fleece",      "Style/Design",  "bomber",      "Style/Design",
  "beautiful",   "Style/Design",  "feminine",    "Style/Design"
)

product_aspect_summary <- product_tokens %>%
  inner_join(product_aspect_dictionary, by = "word") %>%
  inner_join(bing_lexicon, by = "word") %>%
  mutate(score = if_else(sentiment == "positive", 1, -1)) %>%
  group_by(product_name, aspect) %>%
  summarise(mentions = n(), avg_sentiment = mean(score), .groups = "drop")

product_aspect_summary %>%
  ggplot(aes(x = product_name, y = aspect, fill = avg_sentiment)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(avg_sentiment, 2)), size = 3) +
  scale_fill_gradient2(low = "tomato", mid = "gray95", high = "steelblue", midpoint = 0) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
  labs(title = "Aspect Sentiment by Product Line",
       x = NULL, y = NULL, fill = "Avg Sentiment")


8. Sustainability Narrative

sustainability_keywords <- data_frame(
  word = c("sustainable", "sustainability", "eco", "recycle", "recycled",
           "recycling", "environment", "environmental", "planet", "carbon",
           "footprint", "renewable", "organic", "biodegradable", "climate",
           "emissions", "waste", "ethical", "responsible", "circular",
           "vegan", "natural", "conservation", "impact")
)

greenwashing_keywords <- data_frame(
  word = c("greenwashing", "greenwash", "misleading", "mislead", "propaganda",
           "dishonest", "pretend", "pretending", "performative",
           "hypocrisy", "hypocrite", "virtue")
)

sustainability_tokens <- nike_tokens %>%
  select(source, word) %>%
  inner_join(sustainability_keywords, by = "word") %>%
  mutate(brand = case_when(
    source %in% adidas_sources ~ "Adidas",
    source %in% ua_sources     ~ "Under Armour",
    TRUE                       ~ "Nike"
  ))

8.1 Sustainability Volume & Tone

# Volume by source
sustainability_tokens %>%
  count(source, sort = TRUE) %>%
  mutate(source = reorder(source, n)) %>%
  ggplot(aes(x = source, y = n)) +
  geom_segment(aes(xend = source, y = 0, yend = n), color = "gray70") +
  geom_point(color = "steelblue", size = 4) +
  coord_flip() +
  labs(title = "Sustainability Mentions by Source",
       subtitle = "Higher count = more sustainability language in that channel",
       x = NULL, y = "Count")

# Sentiment tone
sustainability_tokens %>%
  inner_join(bing_lexicon, by = "word") %>%
  count(source, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative,
         source = reorder(source, net_sentiment)) %>%
  ggplot(aes(source, net_sentiment, fill = net_sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
  coord_flip() +
  labs(title = "Net Sentiment Around Sustainability Keywords (Bing)",
       subtitle = "Blue = net positive  |  Red = net negative",
       x = NULL, y = "Positive minus Negative")

# Brand comparison
sustainability_tokens %>%
  count(brand) %>%
  mutate(brand = reorder(brand, n)) %>%
  ggplot(aes(brand, n, fill = brand)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Sustainability Mentions: Nike vs. Adidas vs. Under Armour",
       x = NULL, y = "Total Keyword Occurrences")

8.2 Sustainability Bigrams

bigrams_filtered %>%
  filter(word1 %in% sustainability_keywords$word |
           word2 %in% sustainability_keywords$word) %>%
  count(word1, word2, sort = TRUE) %>%
  head(15) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(bigram, n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Sustainability Bigrams",
       subtitle = "Context: 'green' and 'carbon' often appear in product contexts",
       x = NULL, y = "Count")

8.3 Greenwashing Language

greenwashing_tokens <- nike_tokens %>%
  select(source, word) %>%
  inner_join(greenwashing_keywords, by = "word")

greenwashing_tokens %>%
  mutate(brand = case_when(
    source %in% adidas_sources ~ "Adidas",
    source %in% ua_sources     ~ "Under Armour",
    TRUE                       ~ "Nike"
  )) %>%
  count(brand) %>%
  mutate(brand = reorder(brand, n)) %>%
  ggplot(aes(brand, n, fill = brand)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Greenwashing Language: Nike vs. Adidas vs. Under Armour",
       x = NULL, y = "Greenwashing Keyword Occurrences")


9. Pricing Sensitivity

pricing_terms <- tibble(
  word = c("overpriced", "expensive", "price", "priced", "cost", "costly",
           "value", "worth", "retail", "premium", "cheap", "affordable")
)

tidy_nike %>%
  inner_join(pricing_terms, by = "word") %>%
  count(source, word, sort = TRUE) %>%
  group_by(source) %>%
  slice_max(order_by = n, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = tidytext::reorder_within(word, n, source)) %>%
  ggplot(aes(word, n, fill = source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~source, scales = "free_y") +
  tidytext::scale_x_reordered() +
  coord_flip() +
  labs(title = "Top 10 Pricing-Related Words by Source",
       x = NULL, y = "Word Count")


10. Competitive Positioning

set.seed(42)

DATA_PATH <- "C:/Users/patri/OneDrive/Documents/Hult/MBAN/Business Analysis with Unstructured Data - Kurnicki/Competition"

read_text_file <- function(fp) {
  ext <- tolower(tools::file_ext(fp))
  df <- tryCatch({
    if (ext == "csv") {
      read.csv(fp, stringsAsFactors = FALSE, encoding = "UTF-8",
               na.strings = c("", "NA", "N/A"))
    } else if (ext %in% c("xlsx", "xls")) {
      as.data.frame(readxl::read_excel(fp, na = c("", "NA")))
    } else if (ext == "docx") {
      d <- officer::docx_summary(officer::read_docx(fp))
      txt <- unique(trimws(d$text[d$content_type %in% c("paragraph", "table cell") &
                                    !is.na(d$text) & nchar(trimws(d$text)) > 10]))
      return(txt[nchar(txt) > 10])
    } else if (ext == "txt") {
      lines <- trimws(readLines(fp, encoding = "UTF-8", warn = FALSE))
      return(lines[nchar(lines) > 0])
    } else { return(character(0)) }
  }, error = function(e) { message("[ERROR] ", basename(fp), " — ", e$message); NULL })
  if (is.null(df) || nrow(df) == 0) return(character(0))
  candidates <- c("text", "Text", "TEXT", "review", "Review", "comment", "Comment",
                  "headline", "Headline", "content", "Content", "body", "Body",
                  "description", "Description")
  col <- intersect(candidates, colnames(df))
  col <- if (length(col)) col[1] else {
    cc <- colnames(df)[sapply(df, is.character)]
    if (!length(cc)) return(character(0)) else cc[1]
  }
  trimws(na.omit(as.character(df[[col]])))
}

detect_meta <- function(fn) {
  f <- tolower(gsub("[^a-z0-9]", "_", fn))
  brand <- case_when(
    str_detect(f, "nike")                                    ~ "Nike",
    str_detect(f, "adidas")                                  ~ "Adidas",
    str_detect(f, "under_armour|underarmour|\\bua\\b|^ua_") ~ "UnderArmour",
    TRUE ~ NA_character_
  )
  source <- case_when(
    str_detect(f, "news|headline")                 ~ "News",
    str_detect(f, "youtube|yt_|_yt")               ~ "YouTube",
    str_detect(f, "review|website|web_review")     ~ "WebReview",
    str_detect(f, "reddit")                        ~ "Reddit",
    str_detect(f, "appstore|app_store|app_review") ~ "AppStore",
    str_detect(f, "twitter|tweet|x_mention")       ~ "Twitter",
    TRUE ~ NA_character_
  )
  list(brand = brand, source = source)
}

all_files <- list.files(DATA_PATH, pattern = "\\.(csv|txt|xlsx|xls|docx)$",
                        recursive = TRUE, full.names = TRUE, ignore.case = TRUE)

master_df <- map_dfr(all_files, function(fp) {
  meta <- detect_meta(basename(fp))
  if (is.na(meta$brand) || is.na(meta$source)) {
    message("[UNMATCHED] ", basename(fp)); return(tibble())
  }
  texts <- read_text_file(fp)
  if (!length(texts)) {
    message("[WARN] No text: ", basename(fp)); return(tibble())
  }
  tibble(text = texts, brand = meta$brand, source = meta$source,
         doc_id = paste0(meta$brand, "_", meta$source, "_", seq_along(texts)))
})

master_df %>% count(brand, source) %>% arrange(brand, source)
## # A tibble: 6 × 3
##   brand       source       n
##   <chr>       <chr>    <int>
## 1 Adidas      AppStore   813
## 2 Adidas      News        27
## 3 Adidas      Reddit     152
## 4 UnderArmour AppStore   758
## 5 UnderArmour News        46
## 6 UnderArmour Reddit      93
brand_colors  <- c("Nike" = "#111111", "Adidas" = "#1A7EC8", "UnderArmour" = "#E8242A")
active_brands <- master_df %>% count(brand) %>% filter(n > 0) %>% pull(brand)

custom_stop <- tibble(
  word    = c("nike", "adidas", "armour", "under", "ua", "brand", "shoe", "shoes",
              "just", "really", "like", "get", "got", "one", "also", "can", "use",
              "app", "run", "running", "workout"),
  lexicon = "custom"
)

tidy_tokens <- master_df %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_replace_all(word, "[^a-zA-Z]", "")) %>%
  filter(str_length(word) > 2) %>%
  anti_join(bind_rows(stop_words, custom_stop), by = "word") %>%
  mutate(word_stem = wordStem(word, language = "english"))

cat("Tokens:", nrow(tidy_tokens), "| Unique stems:", n_distinct(tidy_tokens$word_stem))
## Tokens: 13353 | Unique stems: 2738
# NRC Emotion Heatmap
tidy_tokens %>%
  inner_join(nrc_lexicon, by = "word") %>%
  filter(!sentiment %in% c("positive", "negative"), brand %in% active_brands) %>%
  count(brand, sentiment) %>%
  group_by(brand) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x = brand, y = sentiment, fill = pct)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = scales::percent(pct, accuracy = 1)), size = 4, fontface = "bold") +
  scale_fill_gradient(low = "#FDFEFE", high = "#1A5276", labels = percent_format()) +
  labs(title = "NRC Emotion Intensity Heatmap",
       subtitle = "Proportion of emotion words per brand",
       x = NULL, y = "Emotion", fill = "% Emotion\nWords") +
  theme_minimal(base_size = 13)

# TF-IDF Signature Language
tidy_tokens %>%
  count(brand, word) %>%
  bind_tf_idf(word, brand, n) %>%
  filter(brand %in% active_brands) %>%
  group_by(brand) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, brand)) %>%
  ggplot(aes(x = tf_idf, y = word, fill = brand)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~brand, scales = "free_y") +
  scale_y_reordered() +
  scale_fill_manual(values = brand_colors) +
  labs(title = "TF-IDF Signature Language per Brand",
       subtitle = "Top 15 words most uniquely associated with each brand",
       x = "TF-IDF Score", y = NULL) +
  theme_minimal(base_size = 12) +
  theme(strip.text = element_text(face = "bold"))


11. Modeling Details

Preprocessing Pipeline

Step Method Tool
Tokenization Unigrams, bigrams, quadrograms unnest_tokens()
Stopword removal Tidytext English stopword list + custom brand terms anti_join(stop_words)
Stemming Porter Stemmer wordStem() from SnowballC
Numeric filtering Regex ^[0-9]+$ str_detect()

Sentiment Lexicons

Lexicon Words Notes
AFINN ~2,477 Numeric scores −5 to +5
Bing ~6,788 Binary positive/negative
NRC ~13,901 10 emotion + 2 sentiment categories

LDA Configuration

Parameter Value
Number of topics (k) 4
Algorithm VEM (Variational EM)
Seed 123
Documents 6 Nike sources

12. Limitations

Lexicon bias. All three sentiment lexicons (AFINN, Bing, NRC) were built on general English text and may misclassify domain-specific language. For example, “sick” or “fire” are positive in streetwear communities but scored negatively by Bing.

Sampling bias. Data was manually collected from public platforms within a fixed time window. Reddit and YouTube comments skew toward highly engaged users, and App Store reviews skew toward dissatisfied customers, which may inflate negative sentiment scores.

Fake review detection. No automated fake review detection was applied. Inflated or incentivised reviews in the product review dataset could distort sentiment findings.

LDA instability. LDA topic assignments are probabilistic and sensitive to the number of topics (k). With only 6 documents as input, topics reflect source-level patterns rather than granular thematic clusters. Results should be interpreted directionally.

Stemming limitations. Porter stemming aggressively conflates related words (e.g., “running” → “run”, “comfortable” → “comfort”), which reduces interpretability of individual terms in visualisations.

No temporal dimension. Data was treated as a single corpus without timestamps. Sentiment trends over time — particularly relevant for Nike given recent brand controversies — could not be measured.

Overfitting risk in classification. The aspect dictionary was manually constructed and may overfit to the language patterns observed in this specific dataset, limiting generalisability to other product lines or time periods.


End of Technical Appendix