All packages and lexicons are loaded once here and reused throughout the document.
library(tidyverse)
library(tidytext)
library(stringr)
library(dplyr)
library(ggplot2)
library(scales)
library(SnowballC)
library(topicmodels)
library(tm)
library(readtext)
library(tidyr)
library(textdata)
library(widyr)
library(igraph)
library(ggraph)
library(readxl)
library(officer)
library(forcats)# Load all sentiment lexicons once
bing_lexicon <- get_sentiments("bing")
afinn_lexicon <- get_sentiments("afinn")
nrc_lexicon <- get_sentiments("nrc")
# Pre-filter NRC emotion subsets
nrc_joy <- nrc_lexicon %>% filter(sentiment == "joy")
nrc_anger <- nrc_lexicon %>% filter(sentiment == "anger")
nrc_trust <- nrc_lexicon %>% filter(sentiment == "trust")
nrc_anticipation <- nrc_lexicon %>% filter(sentiment == "anticipation")Text data was collected from six channels for Nike and two competitor brands (Adidas, Under Armour):
| Brand | Sources |
|---|---|
| Nike | App Store Reviews, News Headlines, Reddit Comments, Twitter Mentions, YouTube Comments, Product Reviews |
| Adidas | Twitter, News, YouTube, App Store, Reddit, Product Reviews |
| Under Armour | Twitter, News, YouTube, App Store, Reddit, Product Reviews |
# !! CHANGE THIS to the folder where your .docx files are saved !!
folder_path <- "C:/Users/patri/OneDrive/Documents/Hult/MBAN/Business Analysis with Unstructured Data - Kurnicki/Nike/New folder"
raw_docs <- readtext(paste0(folder_path, "/*.docx"))
colnames(raw_docs)[1] <- "source"
colnames(raw_docs)[2] <- "text"
raw_docs$source <- str_remove(raw_docs$source, "\\.docx$")
nike_df <- raw_docs %>% select(source, text)
# Confirm source names before recoding
unique(nike_df$source)## [1] "Adidas Reviews Website" "Adidas Twitter Mentions"
## [3] "adidas_headlines" "Adidas_Youtube_Comments"
## [5] "appstore_adidas" "appstore_underarmour"
## [7] "Nike - App Store Reviews" "Nike - News_Headlines_NoYears"
## [9] "Nike - Reddit_Comments_NoLinks" "nike - Twitter mentions"
## [11] "Nike - Youtube_Comments_NoLinks" "Nike Product Reviews"
## [13] "reddit_adidas" "reddit_underarmour"
## [15] "Under Armour Reviews Website" "Under Armour Twitter Mentions"
## [17] "Under Armour Youtube Comments" "under_armour_headlines"
nike_df$source <- recode(nike_df$source,
"Nike - App Store Reviews" = "Nike - App Store",
"Nike - News_Headlines_NoYears" = "Nike - News",
"Nike - Reddit_Comments_NoLinks" = "Nike - Reddit",
"nike - Twitter mentions" = "Nike - Twitter",
"Nike - Youtube_Comments_NoLinks" = "Nike - YouTube",
"Nike Product Reviews" = "Nike - Product Reviews",
"Adidas Twitter Mentions" = "Adidas - Twitter",
"adidas_headlines" = "Adidas - News",
"Adidas_Youtube_Comments" = "Adidas - YouTube",
"appstore_adidas" = "Adidas - App Store",
"reddit_adidas" = "Adidas - Reddit",
"Adidas Reviews Website" = "Adidas - Product Reviews",
"Under Armour Twitter Mentions" = "Under Armour - Twitter",
"Under Armour Youtube Comments" = "Under Armour - YouTube",
"under_armour_headlines" = "Under Armour - News",
"appstore_underarmour" = "Under Armour - App Store",
"reddit_underarmour" = "Under Armour - Reddit",
"Under Armour Reviews Website" = "Under Armour - Product Reviews"
)
# Brand groupings used throughout the script
nike_sources <- c("Nike - App Store", "Nike - News", "Nike - Reddit",
"Nike - Twitter", "Nike - YouTube", "Nike - Product Reviews")
adidas_sources <- c("Adidas - Twitter", "Adidas - News", "Adidas - YouTube",
"Adidas - App Store", "Adidas - Reddit", "Adidas - Product Reviews")
ua_sources <- c("Under Armour - Twitter", "Under Armour - YouTube",
"Under Armour - News", "Under Armour - App Store",
"Under Armour - Reddit", "Under Armour - Product Reviews")
print(unique(nike_df$source))## [1] "Adidas - Product Reviews" "Adidas - Twitter"
## [3] "Adidas - News" "Adidas - YouTube"
## [5] "Adidas - App Store" "Under Armour - App Store"
## [7] "Nike - App Store" "Nike - News"
## [9] "Nike - Reddit" "Nike - Twitter"
## [11] "Nike - YouTube" "Nike - Product Reviews"
## [13] "Adidas - Reddit" "Under Armour - Reddit"
## [15] "Under Armour - Product Reviews" "Under Armour - Twitter"
## [17] "Under Armour - YouTube" "Under Armour - News"
nike_tokens <- nike_df %>%
unnest_tokens(word, text)
nike_tokens %>% count(word, sort = TRUE) %>% head(10)## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
## word n text
## * <chr> <int> <chr>
## 1 the 4759 "\"\"..."
## 2 i 3883 "\"\"..."
## 3 and 2960 "\"\"..."
## 4 to 2960 "\"\"..."
## 5 a 2307 "\"\"..."
## 6 it 1946 "\"\"..."
## # ℹ 4 more rows
tidy_nike <- nike_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "^[0-9]+$"))
tidy_nike %>% count(word, sort = TRUE) %>% head(10)## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
## word n text
## * <chr> <int> <chr>
## 1 app 1690 "\"\"..."
## 2 love 601 "\"\"..."
## 3 nike 554 "\"\"..."
## 4 adidas 418 "\"\"..."
## 5 running 345 "\"\"..."
## 6 workouts 343 "\"\"..."
## # ℹ 4 more rows
Words are reduced to their root form using the Porter stemmer.
tidy_nike_stem <- tidy_nike %>%
mutate(word = wordStem(word))
tidy_nike_stem %>% count(word, sort = TRUE) %>% head(10)## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
## word n text
## * <chr> <int> <chr>
## 1 app 1771 "\"\"..."
## 2 run 725 "\"\"..."
## 3 love 664 "\"\"..."
## 4 workout 627 "\"\"..."
## 5 shoe 614 "\"\"..."
## 6 nike 572 "\"\"..."
## # ℹ 4 more rows
tidy_nike %>%
mutate(brand = case_when(
source %in% adidas_sources ~ "Adidas",
source %in% ua_sources ~ "Under Armour",
TRUE ~ "Nike"
)) %>%
count(brand, word, sort = TRUE) %>%
group_by(brand) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder_within(word, n, brand)) %>%
ggplot(aes(x = word, y = n)) +
geom_segment(aes(xend = word, y = 0, yend = n), color = "gray70") +
geom_point(aes(color = brand), size = 3, show.legend = FALSE) +
scale_x_reordered() +
facet_wrap(~brand, scales = "free") +
coord_flip() +
labs(x = NULL, y = "Word Count",
title = "Top 10 Words per Brand (Stopwords Removed)")bigrams_separated <- nike_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word)
bigrams_filtered %>%
count(word1, word2, sort = TRUE) %>%
head(15)## word1 word2 n
## 1 5 positive 267
## 2 1 negative 97
## 3 apple watch 70
## 4 running app 55
## 5 4 positive 52
## 6 3 neutral 51
## 7 2024 11 47
## 8 air max 47
## 9 2025 01 41
## 10 adidas score 40
## 11 underarmour score 37
## 12 2024 12 32
## 13 workout app 32
## 14 gif giphy 30
## 15 positive love 29
nike_df %>%
unnest_tokens(quadrogram, text, token = "ngrams", n = 4) %>%
separate(quadrogram, c("word1", "word2", "word3", "word4"), sep = " ") %>%
filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word,
!word3 %in% stop_words$word, !word4 %in% stop_words$word) %>%
count(word1, word2, word3, word4, sort = TRUE) %>%
head(15)## word1 word2 word3 word4 n
## 1 underarmour score 7 low 12
## 2 underarmour score 8 low 9
## 3 sneakerheads score 5 low 8
## 4 11 12 5 positive 6
## 5 2024 11 12 5 6
## 6 02 10 5 positive 5
## 7 11 13 5 positive 5
## 8 11 15 5 positive 5
## 9 11 23 5 positive 5
## 10 adidasoriginals score 5 low 5
## 11 01 06 5 positive 4
## 12 11 16 5 positive 4
## 13 12 04 5 positive 4
## 14 12 27 5 positive 4
## 15 2024 11 13 5 4
tidy_dtm <- nike_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(source, word, sort = TRUE) %>%
cast_dtm(source, word, n)
inspect(tidy_dtm[1:6, 1:10])## <<DocumentTermMatrix (documents: 6, terms: 10)>>
## Non-/sparse entries: 45/15
## Sparsity : 25%
## Maximal term length: 8
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs 2025 5 app love nike pair positive shoes workout
## Adidas - App Store 2 20 567 90 13 0 0 14 29
## Adidas - Reddit 0 14 0 2 1 7 1 6 0
## Adidas - YouTube 1 6 0 25 4 28 0 34 0
## Nike - App Store 239 294 639 206 128 3 324 0 206
## Nike - Reddit 2 22 15 89 217 170 0 178 4
## Under Armour - App Store 3 28 468 114 5 0 0 3 39
## Terms
## Docs workouts
## Adidas - App Store 20
## Adidas - Reddit 0
## Adidas - YouTube 0
## Nike - App Store 286
## Nike - Reddit 0
## Under Armour - App Store 34
Three sentiment lexicons are used in this analysis:
| Lexicon | Type | Scale |
|---|---|---|
| AFINN | Numeric score per word | -5 (very negative) to +5 (very positive) |
| Bing | Binary classification | Positive / Negative |
| NRC | Emotion categories | Joy, Anger, Trust, Anticipation, Fear, Sadness, Surprise, Disgust |
sentiments <- bind_rows(mutate(afinn_lexicon, lexicon = "afinn"),
mutate(nrc_lexicon, lexicon = "nrc"),
mutate(bing_lexicon, lexicon = "bing"))
# NRC emotion categories
unique(subset(sentiments, lexicon == "nrc")$sentiment)## [1] "trust" "fear" "negative" "sadness" "anger"
## [6] "surprise" "positive" "disgust" "joy" "anticipation"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.0000 -2.0000 -2.0000 -0.5894 2.0000 5.0000
tidy_nike %>%
filter(source == "Nike - Product Reviews") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>%
head(10)## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
## word n text
## * <chr> <int> <chr>
## 1 love 9 "\"\"..."
## 2 daughter 3 "\"\"..."
## 3 happy 3 "\"\"..."
## 4 favorite 2 "\"\"..."
## 5 true 2 "\"\"..."
## 6 baby 1 "\"\"..."
## # ℹ 4 more rows
tidy_nike %>%
filter(source == "Nike - Product Reviews") %>%
inner_join(nrc_anger) %>%
count(word, sort = TRUE) %>%
head(10)## readtext object consisting of 4 documents and 0 docvars.
## # A data frame: 4 × 3
## word n text
## * <chr> <int> <chr>
## 1 fleece 7 "\"\"..."
## 2 fits 5 "\"\"..."
## 3 force 1 "\"\"..."
## 4 storm 1 "\"\"..."
tidy_nike %>%
filter(source == "Nike - Product Reviews") %>%
inner_join(nrc_trust) %>%
count(word, sort = TRUE) %>%
head(10)## readtext object consisting of 10 documents and 0 docvars.
## # A data frame: 10 × 3
## word n text
## * <chr> <int> <chr>
## 1 cap 3 "\"\"..."
## 2 happy 3 "\"\"..."
## 3 wear 3 "\"\"..."
## 4 favorite 2 "\"\"..."
## 5 recommend 2 "\"\"..."
## 6 true 2 "\"\"..."
## # ℹ 4 more rows
app_store <- tidy_nike %>% filter(source == "Nike - App Store")
product_reviews <- tidy_nike %>% filter(source == "Nike - Product Reviews")
reddit <- tidy_nike %>% filter(source == "Nike - Reddit")
twitter <- tidy_nike %>% filter(source == "Nike - Twitter")
youtube <- tidy_nike %>% filter(source == "Nike - YouTube")
news <- tidy_nike %>% filter(source == "Nike - News")
sources_list <- list(
"App Store Reviews" = app_store,
"Nike Product Reviews" = product_reviews,
"Reddit Comments" = reddit,
"Twitter Mentions" = twitter,
"YouTube Comments" = youtube,
"News Headlines" = news
)
for (src_name in names(sources_list)) {
src_data <- sources_list[[src_name]]
afinn <- src_data %>%
inner_join(afinn_lexicon) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(
src_data %>% inner_join(bing_lexicon) %>% mutate(method = "Bing et al."),
src_data %>%
inner_join(nrc_lexicon %>% filter(sentiment %in% c("positive", "negative"))) %>%
mutate(method = "NRC")
) %>%
count(method, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
print(
bind_rows(afinn, bing_and_nrc) %>%
ggplot(aes(method, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
facet_wrap(~method, ncol = 1, scales = "free_y") +
labs(title = paste0(src_name, ": Sentiment Across Lexicons"),
x = NULL, y = "Sentiment Score")
)
}for (src_name in names(sources_list)) {
print(
sources_list[[src_name]] %>%
inner_join(bing_lexicon) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment", x = NULL,
title = paste0(src_name, ": Top 10 Positive & Negative Words")) +
coord_flip()
)
}tidy_nike_stem %>%
inner_join(bing_lexicon, by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
slice_max(order_by = n, n = 10, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = tidytext::reorder_within(word, n, sentiment)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
tidytext::scale_x_reordered() +
labs(y = "Contribution to sentiment", x = NULL,
title = "Overall: Top 10 Positive & Negative Words (Stemmed)") +
coord_flip()# AFINN
tidy_nike %>%
inner_join(afinn_lexicon) %>%
group_by(source) %>%
summarise(sentiment = sum(value)) %>%
mutate(source = reorder(source, sentiment)) %>%
ggplot(aes(source, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
coord_flip() +
labs(x = NULL, y = "Sentiment Score", title = "AFINN: Sentiment Intensity per Source")# Bing
tidy_nike %>%
inner_join(bing_lexicon) %>%
count(source, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative,
source = reorder(source, sentiment)) %>%
ggplot(aes(source, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
coord_flip() +
labs(x = NULL, y = "Net Sentiment (Positive - Negative)",
title = "Bing: Net Sentiment Score per Source")# NRC
tidy_nike %>%
inner_join(nrc_lexicon) %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(source, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative,
source = reorder(source, sentiment)) %>%
ggplot(aes(source, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
coord_flip() +
labs(x = NULL, y = "Net Sentiment (Positive - Negative)",
title = "NRC: Net Sentiment Score per Source")tidy_nike %>%
inner_join(nrc_lexicon) %>%
filter(sentiment %in% c("joy", "anger", "trust", "anticipation",
"fear", "sadness", "surprise", "disgust")) %>%
count(source, sentiment) %>%
group_by(source) %>%
mutate(proportion = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(x = source, y = sentiment, fill = proportion)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = percent(proportion, accuracy = 1)), size = 3) +
scale_fill_gradient(low = "#f7f7f7", high = "#1A5276",
labels = percent_format()) +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(x = NULL, y = NULL, fill = "% Words",
title = "Emotion Heatmap per Source")An LDA model with k = 4 topics is fitted on the Nike-only data. The seed is set to 123 for reproducibility.
nike_topics <- tidy(nike_lda, matrix = "beta")
top_terms <- nike_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
# !! READ THIS OUTPUT and update topic_labels below !!
top_terms %>% print(n = 40)## # A tibble: 40 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 app 0.0741
## 2 1 running 0.0177
## 3 1 run 0.0169
## 4 1 love 0.0147
## 5 1 it’s 0.0119
## 6 1 time 0.0119
## 7 1 track 0.00884
## 8 1 i’ve 0.00835
## 9 1 watch 0.00759
## 10 1 adidas 0.00688
## 11 2 adidas 0.0728
## 12 2 nike 0.0183
## 13 2 score 0.0176
## 14 2 shoes 0.0116
## 15 2 pair 0.00916
## 16 2 pro 0.00718
## 17 2 love 0.00702
## 18 2 video 0.00668
## 19 2 adios 0.00643
## 20 2 shoe 0.00595
## 21 3 app 0.0704
## 22 3 positive 0.0357
## 23 3 workouts 0.0319
## 24 3 workout 0.0231
## 25 3 love 0.0230
## 26 3 nike 0.0142
## 27 3 negative 0.0133
## 28 3 it’s 0.0123
## 29 3 free 0.0118
## 30 3 i’ve 0.00897
## 31 4 nike 0.0166
## 32 4 shoes 0.0134
## 33 4 pair 0.0122
## 34 4 shoe 0.0120
## 35 4 size 0.00915
## 36 4 love 0.00789
## 37 4 wear 0.00607
## 38 4 lol 0.00597
## 39 4 comfortable 0.00581
## 40 4 armour 0.00546
# !! CHANGE THESE labels based on what you see in top_terms above !!
topic_labels <- c(
"1" = "Topic 1 — rename me",
"2" = "Topic 2 — rename me",
"3" = "Topic 3 — rename me",
"4" = "Topic 4 — rename me"
)top_terms %>%
mutate(topic_label = topic_labels[as.character(topic)],
term = reorder_within(term, beta, topic_label)) %>%
ggplot(aes(x = term, y = beta, color = topic_label)) +
geom_segment(aes(xend = term, y = 0, yend = beta), color = "gray70") +
geom_point(size = 3, show.legend = FALSE) +
scale_x_reordered() +
facet_wrap(~topic_label, scales = "free") +
coord_flip() +
labs(title = "LDA: Top 10 Terms per Topic", x = NULL, y = "Beta (Word Probability)")tidy(nike_lda, matrix = "gamma") %>%
mutate(topic_label = topic_labels[as.character(topic)]) %>%
ggplot(aes(x = topic_label, y = document, fill = gamma)) +
geom_tile(color = "white") +
geom_text(aes(label = round(gamma, 2)), size = 3) +
scale_fill_gradient(low = "#f7f7f7", high = "#1A5276") +
theme(axis.text.x = element_text(angle = 20, hjust = 1)) +
labs(title = "Topic Distribution per Source",
x = "Topic", y = NULL, fill = "Gamma")Words most uniquely associated with each brand versus the others.
custom_stopwords <- tibble(word = c("nike", "just", "im", "ive"))
tfidf_brand <- tidy_nike %>%
mutate(brand = case_when(
source %in% adidas_sources ~ "Adidas",
source %in% ua_sources ~ "Under Armour",
TRUE ~ "Nike"
)) %>%
anti_join(custom_stopwords, by = "word") %>%
count(brand, word, sort = TRUE) %>%
bind_tf_idf(term = word, document = brand, n = n)
tfidf_brand %>%
group_by(brand) %>%
slice_max(order_by = tf_idf, n = 10, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = tidytext::reorder_within(word, tf_idf, brand)) %>%
ggplot(aes(word, tf_idf, fill = brand)) +
geom_col(show.legend = FALSE) +
facet_wrap(~brand, scales = "free_y") +
tidytext::scale_x_reordered() +
coord_flip() +
labs(title = "TF-IDF: Top 10 Distinctive Words by Brand",
x = NULL, y = "TF-IDF Score")aspect_dictionary <- tribble(
~word, ~aspect,
"size", "Fit/Sizing", "fit", "Fit/Sizing",
"fits", "Fit/Sizing", "half", "Fit/Sizing",
"oversized", "Fit/Sizing", "sizing", "Fit/Sizing",
"tight", "Fit/Sizing", "loose", "Fit/Sizing",
"comfortable", "Comfort", "comfy", "Comfort",
"comfort", "Comfort", "soft", "Comfort",
"cushion", "Comfort",
"quality", "Quality/Durability", "peeled", "Quality/Durability",
"cheap", "Quality/Durability", "durable", "Quality/Durability",
"fell", "Quality/Durability", "problem", "Quality/Durability",
"awkward", "Quality/Durability",
"style", "Style/Design", "design", "Style/Design",
"colorway", "Style/Design", "look", "Style/Design",
"black", "Style/Design", "fleece", "Style/Design",
"bomber", "Style/Design",
"app", "App Experience", "workout", "App Experience",
"workouts", "App Experience", "fitness", "App Experience",
"crashing", "App Experience", "crashes", "App Experience",
"bugs", "App Experience", "issue", "App Experience",
"issues", "App Experience", "frustrating","App Experience",
"revenue", "Corporate/Business", "strategy","Corporate/Business",
"investor", "Corporate/Business", "stock", "Corporate/Business",
"ceo", "Corporate/Business", "wholesale","Corporate/Business",
"relations", "Corporate/Business",
"team", "Sports/Cultural Visibility", "national","Sports/Cultural Visibility",
"shirt", "Sports/Cultural Visibility", "jersey", "Sports/Cultural Visibility",
"cup", "Sports/Cultural Visibility", "brazil", "Sports/Cultural Visibility",
"football", "Sports/Cultural Visibility"
)
aspect_sentiment_bing <- tidy_nike %>%
inner_join(aspect_dictionary, by = "word") %>%
inner_join(bing_lexicon, by = "word")# Net sentiment by aspect — diverging bar
aspect_sentiment_bing %>%
mutate(score = if_else(sentiment == "positive", 1, -1)) %>%
group_by(aspect) %>%
summarise(mentions = n(), avg_sentiment = mean(score), .groups = "drop") %>%
mutate(aspect = reorder(aspect, avg_sentiment)) %>%
ggplot(aes(x = aspect, y = avg_sentiment, fill = avg_sentiment > 0)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
coord_flip() +
labs(title = "Aspect-Based Sentiment: Net Score by Aspect",
x = NULL, y = "Average Sentiment Score")# Aspect sentiment by brand — bubble plot
aspect_sentiment_bing %>%
mutate(
score = if_else(sentiment == "positive", 1, -1),
brand = case_when(
source %in% adidas_sources ~ "Adidas",
source %in% ua_sources ~ "Under Armour",
TRUE ~ "Nike"
)
) %>%
group_by(brand, aspect) %>%
summarise(mentions = n(), avg_sentiment = mean(score), .groups = "drop") %>%
ggplot(aes(x = avg_sentiment, y = aspect, size = mentions, color = avg_sentiment)) +
geom_point(alpha = 0.8) +
scale_color_gradient2(low = "tomato", mid = "gray80", high = "steelblue", midpoint = 0) +
scale_size_continuous(range = c(2, 8)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
facet_wrap(~brand, ncol = 3) +
labs(title = "Aspect Sentiment by Brand",
subtitle = "Dot size = number of mentions",
x = "Average Sentiment Score", y = NULL,
color = "Sentiment", size = "Mentions")word_cors <- tidy_nike %>%
group_by(word) %>%
filter(n() >= 20) %>%
ungroup() %>%
pairwise_cor(word, source, sort = TRUE)
# Top correlated words — lollipop
word_cors %>%
filter(item1 %in% c("nike", "shoes", "quality", "price")) %>%
group_by(item1) %>%
slice_max(order_by = correlation, n = 5, with_ties = FALSE) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(x = item2, y = correlation)) +
geom_segment(aes(xend = item2, y = 0, yend = correlation), color = "gray70") +
geom_point(color = "steelblue", size = 3) +
facet_wrap(~item1, scales = "free") +
coord_flip() +
labs(title = "Top 5 Word Correlations with Key Nike Terms",
x = NULL, y = "Correlation")tidy_nike %>%
inner_join(nrc_lexicon) %>%
filter(sentiment %in% c("joy", "anger", "trust", "anticipation",
"fear", "sadness", "surprise", "disgust")) %>%
count(sentiment, sort = TRUE) %>%
mutate(sentiment = reorder(sentiment, n)) %>%
ggplot(aes(x = sentiment, y = n)) +
geom_segment(aes(xend = sentiment, y = 0, yend = n), color = "gray70") +
geom_point(aes(color = sentiment), size = 5, show.legend = FALSE) +
coord_flip() +
labs(x = NULL, y = "Word Count",
title = "Nike: Overall Emotional Positioning Across All Sources")# Top 15 trust words overall
tidy_nike %>%
inner_join(nrc_trust) %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_segment(aes(xend = word, y = 0, yend = n), color = "gray70") +
geom_point(color = "steelblue", size = 4) +
coord_flip() +
labs(x = NULL, y = "Word Count",
title = "Nike: Top 15 Trust Words Driving Brand Loyalty")# Trust words — percentage share
tidy_nike %>%
inner_join(nrc_trust) %>%
count(word, sort = TRUE) %>%
mutate(proportion = n / sum(n)) %>%
top_n(15, proportion) %>%
mutate(word = reorder(word, proportion)) %>%
ggplot(aes(word, proportion, fill = word)) +
geom_col(show.legend = FALSE) +
scale_y_continuous(labels = percent_format()) +
coord_flip() +
labs(x = NULL, y = "% of Trust Words",
title = "Nike: Top Trust Words (% Share)")bigrams_filtered %>%
filter(!str_detect(word1, "^[0-9]+$"), !str_detect(word2, "^[0-9]+$"),
!str_detect(word1, "'"), !str_detect(word2, "'"),
!word1 %in% c("positive", "negative", "neutral"),
!word2 %in% c("positive", "negative", "neutral")) %>%
filter(word1 %in% c("wear", "favorite", "recommend", "real", "true",
"perfect", "happy", "strength", "love", "free",
"comfortable", "trust") |
word2 %in% c("wear", "favorite", "recommend", "real", "true",
"perfect", "happy", "strength", "love", "free",
"comfortable", "trust")) %>%
count(word1, word2, sort = TRUE) %>%
top_n(15) %>%
unite(bigram, word1, word2, sep = " ") %>%
mutate(bigram = reorder(bigram, n)) %>%
ggplot(aes(bigram, n)) +
geom_segment(aes(xend = bigram, y = 0, yend = n), color = "gray70") +
geom_point(color = "steelblue", size = 3) +
coord_flip() +
labs(x = NULL, y = "Count",
title = "Top 15 Bigrams Around Loyalty & Trust Words")product_raw <- nike_df %>% filter(source == "Nike - Product Reviews")
product_text_blob <- product_raw$text[[1]]
product_lines <- tibble(raw_line = str_split(product_text_blob, "\n")[[1]]) %>%
mutate(raw_line = str_squish(raw_line)) %>%
filter(raw_line != "")
product_reviews_structured <- product_lines %>%
mutate(
audience = str_extract(raw_line, "\\b(Women|Men|Kids|Everyone)\\b"),
product_category = str_extract(
raw_line,
"\\b(Pants|Hoodie|Sweatshirt|Leggings|Shoes|Jacket|Skirt|Dress|Shirt|Accessories|Accessorise)\\b"
),
product_name = str_trim(str_remove(
raw_line,
"\\b(Pants|Hoodie|Sweatshirt|Leggings|Shoes|Jacket|Skirt|Dress|Shirt|Accessories|Accessorise)\\b.*$"
)),
review_text = str_trim(str_remove(raw_line, "^.*\\b(Women|Men|Kids|Everyone)\\b"))
) %>%
filter(!is.na(product_name), product_name != "",
!is.na(product_category), product_category != "",
!is.na(audience), audience != "",
!is.na(review_text), review_text != "") %>%
mutate(review_id = row_number())
glimpse(product_reviews_structured)## Rows: 3
## Columns: 6
## $ raw_line <chr> "Women's Loose Full-Zip Cropped HoodieSweatshirtWomen…
## $ audience <chr> "Women", "Women", "Women"
## $ product_category <chr> "Hoodie", "Shoes", "Shoes"
## $ product_name <chr> "Women's Loose Full-Zip Cropped HoodieSweatshirtWomen…
## $ review_text <chr> "'s Loose Full-Zip Cropped HoodieSweatshirtWomen\"I’v…
## $ review_id <int> 1, 2, 3
##
## Hoodie Shoes
## 1 2
##
## Women
## 3
product_tokens <- product_reviews_structured %>%
unnest_tokens(word, review_text) %>%
anti_join(stop_words, by = "word") %>%
filter(!str_detect(word, "^[0-9]+$"), str_length(word) > 2)
negative_rate_by_product <- product_tokens %>%
inner_join(bing_lexicon, by = "word") %>%
filter(sentiment == "negative") %>%
count(product_name, sort = TRUE) %>%
left_join(product_reviews_structured %>% count(product_name, name = "review_count"),
by = "product_name") %>%
mutate(negative_words_per_review = n / review_count) %>%
arrange(desc(negative_words_per_review))
negative_rate_by_product %>%
mutate(product_name = reorder(product_name, negative_words_per_review)) %>%
ggplot(aes(x = product_name, y = negative_words_per_review)) +
geom_segment(aes(xend = product_name, y = 0, yend = negative_words_per_review),
color = "gray70") +
geom_point(color = "tomato", size = 4) +
coord_flip() +
labs(title = "Negative Complaint Density by Product Line",
x = NULL, y = "Negative Words per Review")product_aspect_dictionary <- tribble(
~word, ~aspect,
"size", "Fit/Sizing", "fit", "Fit/Sizing",
"fits", "Fit/Sizing", "oversized", "Fit/Sizing",
"tight", "Fit/Sizing", "loose", "Fit/Sizing",
"long", "Fit/Sizing", "large", "Fit/Sizing",
"small", "Fit/Sizing",
"comfortable", "Comfort", "comfy", "Comfort",
"comfort", "Comfort", "soft", "Comfort",
"roomy", "Comfort",
"quality", "Quality/Durability", "peeled", "Quality/Durability",
"fell", "Quality/Durability", "cheap", "Quality/Durability",
"shrink", "Quality/Durability", "shrunk", "Quality/Durability",
"durable", "Quality/Durability", "problem", "Quality/Durability",
"issues", "Quality/Durability",
"style", "Style/Design", "design", "Style/Design",
"color", "Style/Design", "colorway", "Style/Design",
"fleece", "Style/Design", "bomber", "Style/Design",
"beautiful", "Style/Design", "feminine", "Style/Design"
)
product_aspect_summary <- product_tokens %>%
inner_join(product_aspect_dictionary, by = "word") %>%
inner_join(bing_lexicon, by = "word") %>%
mutate(score = if_else(sentiment == "positive", 1, -1)) %>%
group_by(product_name, aspect) %>%
summarise(mentions = n(), avg_sentiment = mean(score), .groups = "drop")
product_aspect_summary %>%
ggplot(aes(x = product_name, y = aspect, fill = avg_sentiment)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = round(avg_sentiment, 2)), size = 3) +
scale_fill_gradient2(low = "tomato", mid = "gray95", high = "steelblue", midpoint = 0) +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "Aspect Sentiment by Product Line",
x = NULL, y = NULL, fill = "Avg Sentiment")sustainability_keywords <- data_frame(
word = c("sustainable", "sustainability", "eco", "recycle", "recycled",
"recycling", "environment", "environmental", "planet", "carbon",
"footprint", "renewable", "organic", "biodegradable", "climate",
"emissions", "waste", "ethical", "responsible", "circular",
"vegan", "natural", "conservation", "impact")
)
greenwashing_keywords <- data_frame(
word = c("greenwashing", "greenwash", "misleading", "mislead", "propaganda",
"dishonest", "pretend", "pretending", "performative",
"hypocrisy", "hypocrite", "virtue")
)
sustainability_tokens <- nike_tokens %>%
select(source, word) %>%
inner_join(sustainability_keywords, by = "word") %>%
mutate(brand = case_when(
source %in% adidas_sources ~ "Adidas",
source %in% ua_sources ~ "Under Armour",
TRUE ~ "Nike"
))# Volume by source
sustainability_tokens %>%
count(source, sort = TRUE) %>%
mutate(source = reorder(source, n)) %>%
ggplot(aes(x = source, y = n)) +
geom_segment(aes(xend = source, y = 0, yend = n), color = "gray70") +
geom_point(color = "steelblue", size = 4) +
coord_flip() +
labs(title = "Sustainability Mentions by Source",
subtitle = "Higher count = more sustainability language in that channel",
x = NULL, y = "Count")# Sentiment tone
sustainability_tokens %>%
inner_join(bing_lexicon, by = "word") %>%
count(source, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(net_sentiment = positive - negative,
source = reorder(source, net_sentiment)) %>%
ggplot(aes(source, net_sentiment, fill = net_sentiment > 0)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
coord_flip() +
labs(title = "Net Sentiment Around Sustainability Keywords (Bing)",
subtitle = "Blue = net positive | Red = net negative",
x = NULL, y = "Positive minus Negative")# Brand comparison
sustainability_tokens %>%
count(brand) %>%
mutate(brand = reorder(brand, n)) %>%
ggplot(aes(brand, n, fill = brand)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Sustainability Mentions: Nike vs. Adidas vs. Under Armour",
x = NULL, y = "Total Keyword Occurrences")bigrams_filtered %>%
filter(word1 %in% sustainability_keywords$word |
word2 %in% sustainability_keywords$word) %>%
count(word1, word2, sort = TRUE) %>%
head(15) %>%
unite(bigram, word1, word2, sep = " ") %>%
mutate(bigram = reorder(bigram, n)) %>%
ggplot(aes(bigram, n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Sustainability Bigrams",
subtitle = "Context: 'green' and 'carbon' often appear in product contexts",
x = NULL, y = "Count")greenwashing_tokens <- nike_tokens %>%
select(source, word) %>%
inner_join(greenwashing_keywords, by = "word")
greenwashing_tokens %>%
mutate(brand = case_when(
source %in% adidas_sources ~ "Adidas",
source %in% ua_sources ~ "Under Armour",
TRUE ~ "Nike"
)) %>%
count(brand) %>%
mutate(brand = reorder(brand, n)) %>%
ggplot(aes(brand, n, fill = brand)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Greenwashing Language: Nike vs. Adidas vs. Under Armour",
x = NULL, y = "Greenwashing Keyword Occurrences")pricing_terms <- tibble(
word = c("overpriced", "expensive", "price", "priced", "cost", "costly",
"value", "worth", "retail", "premium", "cheap", "affordable")
)
tidy_nike %>%
inner_join(pricing_terms, by = "word") %>%
count(source, word, sort = TRUE) %>%
group_by(source) %>%
slice_max(order_by = n, n = 10, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = tidytext::reorder_within(word, n, source)) %>%
ggplot(aes(word, n, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~source, scales = "free_y") +
tidytext::scale_x_reordered() +
coord_flip() +
labs(title = "Top 10 Pricing-Related Words by Source",
x = NULL, y = "Word Count")set.seed(42)
DATA_PATH <- "C:/Users/patri/OneDrive/Documents/Hult/MBAN/Business Analysis with Unstructured Data - Kurnicki/Competition"
read_text_file <- function(fp) {
ext <- tolower(tools::file_ext(fp))
df <- tryCatch({
if (ext == "csv") {
read.csv(fp, stringsAsFactors = FALSE, encoding = "UTF-8",
na.strings = c("", "NA", "N/A"))
} else if (ext %in% c("xlsx", "xls")) {
as.data.frame(readxl::read_excel(fp, na = c("", "NA")))
} else if (ext == "docx") {
d <- officer::docx_summary(officer::read_docx(fp))
txt <- unique(trimws(d$text[d$content_type %in% c("paragraph", "table cell") &
!is.na(d$text) & nchar(trimws(d$text)) > 10]))
return(txt[nchar(txt) > 10])
} else if (ext == "txt") {
lines <- trimws(readLines(fp, encoding = "UTF-8", warn = FALSE))
return(lines[nchar(lines) > 0])
} else { return(character(0)) }
}, error = function(e) { message("[ERROR] ", basename(fp), " — ", e$message); NULL })
if (is.null(df) || nrow(df) == 0) return(character(0))
candidates <- c("text", "Text", "TEXT", "review", "Review", "comment", "Comment",
"headline", "Headline", "content", "Content", "body", "Body",
"description", "Description")
col <- intersect(candidates, colnames(df))
col <- if (length(col)) col[1] else {
cc <- colnames(df)[sapply(df, is.character)]
if (!length(cc)) return(character(0)) else cc[1]
}
trimws(na.omit(as.character(df[[col]])))
}
detect_meta <- function(fn) {
f <- tolower(gsub("[^a-z0-9]", "_", fn))
brand <- case_when(
str_detect(f, "nike") ~ "Nike",
str_detect(f, "adidas") ~ "Adidas",
str_detect(f, "under_armour|underarmour|\\bua\\b|^ua_") ~ "UnderArmour",
TRUE ~ NA_character_
)
source <- case_when(
str_detect(f, "news|headline") ~ "News",
str_detect(f, "youtube|yt_|_yt") ~ "YouTube",
str_detect(f, "review|website|web_review") ~ "WebReview",
str_detect(f, "reddit") ~ "Reddit",
str_detect(f, "appstore|app_store|app_review") ~ "AppStore",
str_detect(f, "twitter|tweet|x_mention") ~ "Twitter",
TRUE ~ NA_character_
)
list(brand = brand, source = source)
}
all_files <- list.files(DATA_PATH, pattern = "\\.(csv|txt|xlsx|xls|docx)$",
recursive = TRUE, full.names = TRUE, ignore.case = TRUE)
master_df <- map_dfr(all_files, function(fp) {
meta <- detect_meta(basename(fp))
if (is.na(meta$brand) || is.na(meta$source)) {
message("[UNMATCHED] ", basename(fp)); return(tibble())
}
texts <- read_text_file(fp)
if (!length(texts)) {
message("[WARN] No text: ", basename(fp)); return(tibble())
}
tibble(text = texts, brand = meta$brand, source = meta$source,
doc_id = paste0(meta$brand, "_", meta$source, "_", seq_along(texts)))
})
master_df %>% count(brand, source) %>% arrange(brand, source)## # A tibble: 6 × 3
## brand source n
## <chr> <chr> <int>
## 1 Adidas AppStore 813
## 2 Adidas News 27
## 3 Adidas Reddit 152
## 4 UnderArmour AppStore 758
## 5 UnderArmour News 46
## 6 UnderArmour Reddit 93
brand_colors <- c("Nike" = "#111111", "Adidas" = "#1A7EC8", "UnderArmour" = "#E8242A")
active_brands <- master_df %>% count(brand) %>% filter(n > 0) %>% pull(brand)
custom_stop <- tibble(
word = c("nike", "adidas", "armour", "under", "ua", "brand", "shoe", "shoes",
"just", "really", "like", "get", "got", "one", "also", "can", "use",
"app", "run", "running", "workout"),
lexicon = "custom"
)
tidy_tokens <- master_df %>%
unnest_tokens(word, text) %>%
mutate(word = str_replace_all(word, "[^a-zA-Z]", "")) %>%
filter(str_length(word) > 2) %>%
anti_join(bind_rows(stop_words, custom_stop), by = "word") %>%
mutate(word_stem = wordStem(word, language = "english"))
cat("Tokens:", nrow(tidy_tokens), "| Unique stems:", n_distinct(tidy_tokens$word_stem))## Tokens: 13353 | Unique stems: 2738
# NRC Emotion Heatmap
tidy_tokens %>%
inner_join(nrc_lexicon, by = "word") %>%
filter(!sentiment %in% c("positive", "negative"), brand %in% active_brands) %>%
count(brand, sentiment) %>%
group_by(brand) %>%
mutate(pct = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(x = brand, y = sentiment, fill = pct)) +
geom_tile(colour = "white", linewidth = 0.6) +
geom_text(aes(label = scales::percent(pct, accuracy = 1)), size = 4, fontface = "bold") +
scale_fill_gradient(low = "#FDFEFE", high = "#1A5276", labels = percent_format()) +
labs(title = "NRC Emotion Intensity Heatmap",
subtitle = "Proportion of emotion words per brand",
x = NULL, y = "Emotion", fill = "% Emotion\nWords") +
theme_minimal(base_size = 13)# TF-IDF Signature Language
tidy_tokens %>%
count(brand, word) %>%
bind_tf_idf(word, brand, n) %>%
filter(brand %in% active_brands) %>%
group_by(brand) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
mutate(word = reorder_within(word, tf_idf, brand)) %>%
ggplot(aes(x = tf_idf, y = word, fill = brand)) +
geom_col(show.legend = FALSE) +
facet_wrap(~brand, scales = "free_y") +
scale_y_reordered() +
scale_fill_manual(values = brand_colors) +
labs(title = "TF-IDF Signature Language per Brand",
subtitle = "Top 15 words most uniquely associated with each brand",
x = "TF-IDF Score", y = NULL) +
theme_minimal(base_size = 12) +
theme(strip.text = element_text(face = "bold"))| Step | Method | Tool |
|---|---|---|
| Tokenization | Unigrams, bigrams, quadrograms | unnest_tokens() |
| Stopword removal | Tidytext English stopword list + custom brand terms | anti_join(stop_words) |
| Stemming | Porter Stemmer | wordStem() from SnowballC |
| Numeric filtering | Regex ^[0-9]+$ |
str_detect() |
| Lexicon | Words | Notes |
|---|---|---|
| AFINN | ~2,477 | Numeric scores −5 to +5 |
| Bing | ~6,788 | Binary positive/negative |
| NRC | ~13,901 | 10 emotion + 2 sentiment categories |
| Parameter | Value |
|---|---|
| Number of topics (k) | 4 |
| Algorithm | VEM (Variational EM) |
| Seed | 123 |
| Documents | 6 Nike sources |
Lexicon bias. All three sentiment lexicons (AFINN, Bing, NRC) were built on general English text and may misclassify domain-specific language. For example, “sick” or “fire” are positive in streetwear communities but scored negatively by Bing.
Sampling bias. Data was manually collected from public platforms within a fixed time window. Reddit and YouTube comments skew toward highly engaged users, and App Store reviews skew toward dissatisfied customers, which may inflate negative sentiment scores.
Fake review detection. No automated fake review detection was applied. Inflated or incentivised reviews in the product review dataset could distort sentiment findings.
LDA instability. LDA topic assignments are probabilistic and sensitive to the number of topics (k). With only 6 documents as input, topics reflect source-level patterns rather than granular thematic clusters. Results should be interpreted directionally.
Stemming limitations. Porter stemming aggressively conflates related words (e.g., “running” → “run”, “comfortable” → “comfort”), which reduces interpretability of individual terms in visualisations.
No temporal dimension. Data was treated as a single corpus without timestamps. Sentiment trends over time — particularly relevant for Nike given recent brand controversies — could not be measured.
Overfitting risk in classification. The aspect dictionary was manually constructed and may overfit to the language patterns observed in this specific dataset, limiting generalisability to other product lines or time periods.
End of Technical Appendix