Text Analytics for Strategic Consumer Insights
Data Pirates · Team 9 · March 25, 2026
An analysis of consumer discourse across sustainability, pricing, and competitive positioning provides insights into how Nike is perceived and identifies strategic opportunities for brand action. Using an end-to-end natural language processing pipeline in R, large-scale unstructured text data was collected from Reddit sneaker communities, YouTube product reviews, and X (Twitter) mentions. Sentiment analysis, topic modeling (LDA), TF-IDF differentiation, and keyword co-occurrence network analysis were applied to extract actionable intelligence across four dimensions: brand perception, product quality, sustainability narrative, and competitive positioning.
Nike maintains strong emotional loyalty, with comfort and trust emerging as dominant consumer themes. Operational issues, such as sizing inconsistencies, are evident through recurring terms like “half size” and “wide foot,” highlighting areas that require operational attention rather than marketing intervention. Pricing perception is divided, with some consumers accepting Nike’s premium positioning while others question whether quality justifies cost, reflected in the co-occurrence of terms such as “premium” and “worth” alongside “expensive” and “cheap.” Sustainability awareness exists but trust remains limited; discussions focus on generic terms such as “green” and “carbon,” with minimal engagement around ethical production or recyclability, indicating a need for product-level substantiation of sustainability claims rather than broad messaging.
Competitive analysis shows Adidas narrowing the gap in design and cultural relevance. TF-IDF analysis indicates that Adidas discussions emphasize specific product lines such as Yeezy and Ultraboost, fostering a collector-driven identity with clear differentiation. Nike’s language, in contrast, remains experiential and tactile, anchored in terms such as “lace,” “sole,” and “comfortable,” representing a defensible equity advantage. Topic modeling reveals that Sneaker Style and Fit account for 41% of consumer discourse, followed by Product Experience and Comfort at 29%, together constituting the primary drivers of engagement. Digital Content and Athlete Endorsement represent just 10% of conversations, suggesting limited organic impact relative to product-focused themes. Platform dynamics differ, with Reddit serving as a hub for enthusiast-driven discussions around drops and aesthetics, while YouTube users adopt a more evaluative posture, frequently benchmarking Nike against competitors.
The analysis points to three strategic imperatives: first, reframe the pricing narrative around craftsmanship and durability to strengthen value perception; second, substantiate sustainability claims at the product level to bridge the awareness-trust gap; third, reinforce performance heritage and comfort as the primary drivers of brand loyalty. These insights derive from publicly available unstructured text data and carry inherent sampling limitations; they are intended to inform, not replace, formal primary consumer research.
# Installing packages
# List of required packages
packages <- c(
"mongolite", "tidyverse", "tm", "SnowballC",
"textstem", "scales", "e1071", "quanteda", "ggplot2",
"tidymodels", "textrecipes", "discrim", "parsnip", "rsample",
"klaR", "widyr", "igraph", "ggraph", "dplyr"
)
if (!"remotes" %in% rownames(installed.packages())) {
install.packages("remotes")
}
library(remotes)
# Install tidytext from GitHub if not installed
if (!"tidytext" %in% rownames(installed.packages())) {
remotes::install_github("juliasilge/tidytext")
}
# Install any packages that are not yet installed
installed <- packages %in% rownames(installed.packages())
if (any(!installed)) {
install.packages(packages[!installed], dependencies = TRUE)
}#Load the library R libraries required for this project
suppressPackageStartupMessages({
library(mongolite)
library(tidyverse)
library(tidytext)
library(tm)
library(dplyr)
library(textstem)
library(scales)
library(e1071)
library(quanteda)
library(ggplot2)
library(widyr)
library(igraph)
library(ggraph)
})## Setting up the connection to get the data from MongoDB
MONGO_URI <- "mongodb+srv://hmupfumi_db_user:pfWv1ZuX0L8Gc5fu@cluster0.2opjt6h.mongodb.net/ample_airbnb?retryWrites=true&w=majority"
DB_NAME <- "sample_airbnb"## Text Cleaning Pipeline
clean_text <- function(text) {
text %>%
str_to_lower() %>% # convert to lowercase
str_remove_all("https?://\\S+|www\\.\\S+") %>% # Remove URLs
str_remove_all("@\\w+|#\\w+") %>% # mentions & hashtags
str_replace_all("n't", " not") %>% # Replace contraction n't with not
str_replace_all("'re", " are") %>% # Replace contraction 're with are
str_replace_all("'ve", " have") %>% # Replace contraction 've with have
str_replace_all("'ll", " will") %>% # Replace contraction 'll with will
str_replace_all("won't", "will not") %>% # Replace contraction won't with will not
str_replace_all("can't", "can not") %>% # Replace contraction cant with can not
str_remove_all("[^a-z0-9\\s]") %>% # removes any character that is not a lowercase letter, number, or whitespace
str_squish() # remove extra white space
}
airbnb_df <- airbnb_df %>%
mutate(text = clean_text(description))The raw text data was preprocessed through a standardized cleaning pipeline before analysis. This included lowercasing, URL and punctuation removal, stripping social media mentions and hashtags, contraction expansion, and whitespace normalization ensuring the data was consistent and ready for downstream NLP tasks including sentiment analysis, topic modeling, and keyword extraction.
#Tokenize, remove stopwords and lemmatize
custom_stopwords <- data.frame(word = c("im","de"),
lexicon = rep("custom", 1))
airbnb_tokens <- airbnb_df %>%
unnest_tokens(word, text) %>% #split text into words
anti_join(stop_words, by = "word") %>% #remove standard stopwords
anti_join(custom_stopwords, by = "word") %>% #remove custom stopwords
filter(!str_detect(word, "^\\d+$")) %>% #remove numbers
mutate(word = lemmatize_words(word)) #lematize
# Summary info
cat("Total tokens: ", nrow(airbnb_tokens), "\n")## Total tokens: 365179
## Unique lemmas: 26648
After preprocessing the reviews through tokenization, stopword removal, and lemmatization, the dataset contained 370,292 total tokens and 29,290 unique lemmas, making it ready for further analysis.
freq_hist <- airbnb_tokens %>%
count(word, sort = TRUE) %>% # count frequency
slice_max(n, n = 10) %>% # select top 10 words
mutate(word = reorder(word, n)) %>% # reorder words by frequency
ggplot(aes(word, n)) + # map word vs frequency
geom_col(fill = "steelblue") + # bar plot
xlab(NULL) + # remove x-axis label
ylab("Frequency") + # label y-axis
coord_flip() + # horizontal bars
labs(title = "Top 10 Most Frequent Words in Airnb Description Posting")
# Print the plot
print(freq_hist)
The top word frequency analysis reveals that Airbnb hosts emphasize
location, layout, and practical details in their descriptions, with
terms like apartment, bedroom, and bathroom setting clear expectations,
while words like walk and minute highlight proximity as a key selling
point. The presence of live suggests an attempt to promote authentic
local experiences. Overall, listings focus more on functionality and
convenience than differentiation, indicating an opportunity for more
engaging and distinctive descriptions.
#Create a Biagrams
airbnb_bigrams <- airbnb_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% # generate 2-word sequences
separate(bigram, into = c("word1", "word2"), sep = " ") %>% # split into two columns
filter(!word1 %in% stop_words$word, # remove stopwords from both words
!word2 %in% stop_words$word,
!str_detect(word1, "^\\d+$"), # remove numbers
!str_detect(word2, "^\\d+$")
) %>%
mutate(
word1 = lemmatize_words(word1), # lemmatize
word2 = lemmatize_words(word2)
) %>%
unite(bigram_lemma, word1, word2, sep = " ") %>% # combine back into single column
dplyr::select(doc_id,country,bigram_lemma)
# Quick summary
cat("Total bigrams: ", nrow(airbnb_bigrams), "\n")## Total bigrams: 202742
## Unique bigrams: 103566
# Count bigrams by apartment type
bigram_counts <- airbnb_bigrams %>%
count(bigram_lemma, country, sort = TRUE)
# Get top 10 bigrams by total count
top_bigrams <- bigram_counts %>%
group_by(bigram_lemma) %>%
summarise(total_n = sum(n), .groups = 'drop') %>%
top_n(10, total_n) %>%
inner_join(bigram_counts, by = "bigram_lemma")
# Plot stacked bars with counts, colored by apartment type
ggplot(top_bigrams, aes(x = reorder(bigram_lemma, total_n), y = n, fill = country)) +
geom_col() + # default stacking by count
coord_flip() +
labs(
title = "Top 10 Bigrams by Country",
x = "Bigram",
y = "Count",
fill = "Apartment Type"
) +
theme_minimal()
This bigram analysis of Airbnb host descriptions uncovers clear regional
patterns in listing framing. Phrases like “minute walk,” “walk
distance,” and “min walk” dominate globally, highlighting walkability as
a universal marketing hook, with the U.S. and Australia leading in
frequency, while Hong Kong and Turkey lag, reflecting urban density and
local search behavior. Market-specific bigrams reveal priorities: “metro
station” drives listings in Spain, Portugal, and Turkey, and
destination-specific terms like “Hong Kong” appear only locally.
Amenity-related phrases such as “double bed,” “air condition,” and “free
wifi” remain baseline expectations worldwide. These insights indicate
that optimizing listings requires localized keyword strategies—leading
with walkability in some markets, transit proximity in others, while
ensuring amenity bigrams reinforce standards, aligning descriptions with
regional search behavior to boost visibility and guest relevance.
# Compute number of documents per country
country_docs <- airbnb_tokens %>%
group_by(country) %>%
summarise(n_docs = n_distinct(doc_id), .groups = "drop")
# Compute IDF by country
idf_data_country <- airbnb_tokens %>%
distinct(doc_id, country, word) %>%
filter(!str_detect(word, "\\d")) %>%
left_join(country_docs, by = "country") %>%
group_by(country, word, n_docs) %>%
summarise(doc_count = n(), .groups = "drop") %>%
mutate(idf = log(n_docs / (1 + doc_count))) %>%
group_by(country) %>%
arrange(desc(idf)) %>%
slice_head(n = 5) %>%
ungroup()
# Faceted plot by country
ggplot(idf_data_country, aes(x = reorder(word, idf), y = idf, fill = idf)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_gradient(low = "#B0BEC5", high = "#1C1C1C") +
facet_wrap(~country, scales = "free_y") + # each country gets its own panel
labs(
x = NULL,
y = "Inverse Document Frequency (IDF)",
title = "Top 5 Most Rare Words in Airbnb Descriptions by Country"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
This IDF analysis highlights the distinctive language and cultural cues
in Airbnb host descriptions across countries, with rare words revealing
local language, niche amenities, and hyper-local references. In Brazil,
Portuguese terms like abertas and abajures signal authentic,
design-focused descriptions; Spain shows multilingual or fragmented
terms, pointing to localization gaps. Hong Kong hosts use
neighborhood-specific words like aberdeenplus to brand listings, while
Australia and Canada emphasize universities, heritage, and landmarks.
The U.S. features cryptic or informal terms, reflecting diverse writing
styles. These findings suggest that leveraging distinctive, locally
relevant terms can differentiate listings, improve search visibility,
and enhance guest discovery, while addressing translation and keyword
optimization can unlock further value.
# Filter tokens for USA and Spain
frequency <- airbnb_tokens %>%
filter(country %in% c("United States", "Spain")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(country, word) %>%
group_by(country) %>%
mutate(proportion = n / sum(n)) %>%
dplyr::select(-n) %>%
spread(country, proportion) %>%
gather(country, proportion, `United States`, `Spain`)
# Prepare wide data for correlation plot
frequency_wide <- frequency %>%
pivot_wider(names_from = country, values_from = proportion, values_fill = 0) %>%
filter(`United States` > 0 & Spain > 0) %>%
filter(!is.na(`United States`), !is.na(Spain))
# Plot correlogram
ggplot(frequency_wide, aes(x = `United States`, y = Spain, color = abs(`United States` - Spain))) +
geom_abline(color = "grey40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(low = "darkslategray4", high = "gray75") +
theme(legend.position = "none") +
labs(
title = "Word Correlations: United States vs Spain",
x = "US proportion",
y = "Spain proportion"
)## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
This word correlation analysis comparing the United States and Spain reveals clear differences in how hosts frame Airbnb listings. US descriptions emphasize car-centric convenience, spacious layouts, and American amenities such as specific bed sizes, parking, and neighborhood references, reflecting the need to set clear expectations in sprawling urban areas. Spanish listings focus on public transit and walkability, highlighting metro, estación, parada, and minutos andando to signal proximity to transit and landmarks in dense cities. These findings underscore the value of market-specific keyword strategies: US hosts should foreground parking, bed sizes, and neighborhood context, while Spanish hosts should highlight transit access, walking times, and vibrant neighborhoods. For Airbnb, supporting localized description guidance can improve visibility, align content with guest search behavior, and increase booking conversion.
bing_lexicon <- get_sentiments("bing")
airbnb_sentiment <- airbnb_tokens %>%
inner_join(bing_lexicon, by = c("word" = "word"))
sentiment_counts <- airbnb_sentiment %>%
count(country,sentiment)
ggplot(sentiment_counts, aes(x = sentiment, y = n, fill = country)) +
geom_col(show.legend = TRUE) +
labs(title = "Sentiment Distribution (Bing Lexicon)",
x = "Sentiment",
y = "Word Count")The Bing sentiment analysis of Airbnb host descriptions shows that positive sentiment consistently outweighs negative across all markets, with roughly 20,000 positive versus 10,000 negative instances per country. This pattern reflects that hosts treat descriptions as marketing copy, emphasizing strengths while minimizing weaknesses and that platform norms have standardized optimistic language globally. Necessary negatives like “no elevator” or “street noise” should be framed positively to preserve appeal. This allows hosts to balance transparency with attractiveness and supports Airbnb in guiding description quality.
# Count top words per sentiment AND source
word_contributions <- airbnb_tokens %>%
inner_join(bing_lexicon, by = c("word" = "word")) %>%
count(word, sentiment, country, sort = TRUE) %>%
group_by(sentiment, word) %>%
summarise(total = sum(n), .groups = "drop") %>%
group_by(sentiment) %>%
slice_max(total, n = 15) %>%
ungroup()
# Filter original counts to just these top words for source stacking
word_source_counts <- airbnb_tokens %>%
inner_join(bing_lexicon, by = c("word" = "word")) %>%
semi_join(word_contributions, by = c("word","sentiment")) %>%
count(word, sentiment, country)
# Plot with stacked bars by source
ggplot(word_source_counts,
aes(x = n,
y = reorder_within(word, n, sentiment),
fill = country)) +
geom_col(width = 0.7) +
facet_wrap(~sentiment, scales = "free") +
geom_col(width = 0.7) +
scale_y_reordered() +
labs(
title = "Most Impactful Words by Sentiment and Country",
x = "Frequency", y = NULL,
fill = "Source"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))The Bing sentiment analysis of Airbnb descriptions identifies the words driving positive and negative tone, providing a clear roadmap for optimizing listing language. Positive words like stun, retreat, overlook, and tout highlight aspirational, experiential, and view-focused language that resonates emotionally with guests. Negative words such as noise, smoke, complex, split, sink, grind, and rue reflect practical limitations or unintended discouragement. Hosts can maximize appeal by foregrounding positive, scenic, and experiential terms while reframing or minimizing negative triggers—for example, presenting noise as a “vibrant neighborhood” or split as a “thoughtfully designed layout”—enhancing booking potential and aligning descriptions with guest expectations.
library(topicmodels)
k <- 4
lda_model <- LDA(
airbnb_dtm,
k = k,
method = "Gibbs",
control = list(seed = 42, iter = 1000, burnin = 200, thin = 10)
)
# Assign human-readable labels after inspecting top terms
topic_labels <- tibble(
topic = 1:k,
label = c(
"Location & Transportation", # Based on: metro, est, la, da
"Apartment Features & Amenities", # Based on: apartament, quarto
"Guest Experience & Host Hospitality", # Mixed positive experience terms
"Neighborhood & Local Attractions" # Location context
),
)
tidy_topics <- tidy(lda_model, matrix="beta")
#top terms
top_terms <- tidy_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
left_join(topic_labels, by = "topic") %>%
arrange(topic, -beta)
#lets plot the term frequencies by topic
ggplot(top_terms,
aes(x = reorder_within(term, beta, topic),
y = beta, fill = label)) +
geom_col(show.legend = FALSE) +
facet_wrap(~label, scales = "free_y", ncol = 2) +
coord_flip() +
scale_x_reordered() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "LDA topic modeling - Nike brand conversations",
x = NULL, y = "Term probability (beta)",
caption = "Higher beta = more characteristic of that topic"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold", size = 10)
)The LDA topic modeling of Airbnb descriptions reveals four key thematic clusters that shape how hosts frame listings. Topic 1 (Apartment Features & Amenities) emphasizes walkability and proximity to transit, dining, and shopping, reflecting location-driven marketing. Topic 2 (Guest Experience & Host Hospitality) uses aspirational language such as view, beautiful, and private to evoke emotion and comfort. Topic 3 (Location & Transportation) covers functional details like beds, kitchens, and wifi, addressing baseline guest needs. Topic 4 (Neighborhood & Local Attractions) incorporates non-English terms in multilingual markets, signaling authenticity or targeting domestic guests. Optimized listings blend all four themes, balancing location, experience, functional clarity, and localized language to enhance visibility, match search behavior, and boost bookings.
Conclusion