Airbnb Brand Intelligence

Text Analytics for Strategic Consumer Insights

Data Pirates · Team 9 · March 25, 2026


1 Executive Summary

An analysis of consumer discourse across sustainability, pricing, and competitive positioning provides insights into how Nike is perceived and identifies strategic opportunities for brand action. Using an end-to-end natural language processing pipeline in R, large-scale unstructured text data was collected from Reddit sneaker communities, YouTube product reviews, and X (Twitter) mentions. Sentiment analysis, topic modeling (LDA), TF-IDF differentiation, and keyword co-occurrence network analysis were applied to extract actionable intelligence across four dimensions: brand perception, product quality, sustainability narrative, and competitive positioning.

Nike maintains strong emotional loyalty, with comfort and trust emerging as dominant consumer themes. Operational issues, such as sizing inconsistencies, are evident through recurring terms like “half size” and “wide foot,” highlighting areas that require operational attention rather than marketing intervention. Pricing perception is divided, with some consumers accepting Nike’s premium positioning while others question whether quality justifies cost, reflected in the co-occurrence of terms such as “premium” and “worth” alongside “expensive” and “cheap.” Sustainability awareness exists but trust remains limited; discussions focus on generic terms such as “green” and “carbon,” with minimal engagement around ethical production or recyclability, indicating a need for product-level substantiation of sustainability claims rather than broad messaging.

Competitive analysis shows Adidas narrowing the gap in design and cultural relevance. TF-IDF analysis indicates that Adidas discussions emphasize specific product lines such as Yeezy and Ultraboost, fostering a collector-driven identity with clear differentiation. Nike’s language, in contrast, remains experiential and tactile, anchored in terms such as “lace,” “sole,” and “comfortable,” representing a defensible equity advantage. Topic modeling reveals that Sneaker Style and Fit account for 41% of consumer discourse, followed by Product Experience and Comfort at 29%, together constituting the primary drivers of engagement. Digital Content and Athlete Endorsement represent just 10% of conversations, suggesting limited organic impact relative to product-focused themes. Platform dynamics differ, with Reddit serving as a hub for enthusiast-driven discussions around drops and aesthetics, while YouTube users adopt a more evaluative posture, frequently benchmarking Nike against competitors.

The analysis points to three strategic imperatives: first, reframe the pricing narrative around craftsmanship and durability to strengthen value perception; second, substantiate sustainability claims at the product level to bridge the awareness-trust gap; third, reinforce performance heritage and comfort as the primary drivers of brand loyalty. These insights derive from publicly available unstructured text data and carry inherent sampling limitations; they are intended to inform, not replace, formal primary consumer research.


2 Installing Packages

# Installing packages
# List of required packages
packages <- c(
  "mongolite", "tidyverse", "tm", "SnowballC",
  "textstem", "scales", "e1071", "quanteda", "ggplot2",
  "tidymodels", "textrecipes", "discrim", "parsnip", "rsample",
  "klaR", "widyr", "igraph", "ggraph", "dplyr"
)

if (!"remotes" %in% rownames(installed.packages())) {
  install.packages("remotes")
}

library(remotes)

# Install tidytext from GitHub if not installed
if (!"tidytext" %in% rownames(installed.packages())) {
  remotes::install_github("juliasilge/tidytext")
}

# Install any packages that are not yet installed
installed <- packages %in% rownames(installed.packages())
if (any(!installed)) {
  install.packages(packages[!installed], dependencies = TRUE)
}

3 Loading Packages

#Load the library R libraries required for this project
suppressPackageStartupMessages({
library(mongolite)
library(tidyverse)
library(tidytext)
library(tm)
library(dplyr)
library(textstem)
library(scales)
library(e1071)
library(quanteda)
library(ggplot2)
library(widyr)
library(igraph)
library(ggraph)
})

4 Loading Data from MongoDB

## Setting up the connection to get the data from MongoDB

MONGO_URI <- "mongodb+srv://hmupfumi_db_user:pfWv1ZuX0L8Gc5fu@cluster0.2opjt6h.mongodb.net/ample_airbnb?retryWrites=true&w=majority"
DB_NAME <- "sample_airbnb"

4.1 Loading Airbn data

#Fetch the data from Airbnb
con_airbnb <- mongo(
  collection = "listingsAndReviews",
  db         = DB_NAME,
  url        = MONGO_URI
)

airbnb_raw <- con_airbnb$find(
  query  = '{}',
  fields = '{"_id": 0}'
)
airbnb_df <- data.frame(
                        country             = airbnb_raw$address$country,
                        property_type       = airbnb_raw$property_type, 
                        description         = airbnb_raw$description,
                        price               = airbnb_raw$price
                        )%>%
                  mutate(doc_id = row_number())

5 Text Preprocessing

5.1 Text Cleaning Pipeline

## Text Cleaning Pipeline 
clean_text <- function(text) {
  text %>%
    str_to_lower() %>%                               # convert to lowercase
    str_remove_all("https?://\\S+|www\\.\\S+") %>%   # Remove URLs
    str_remove_all("@\\w+|#\\w+") %>%                # mentions & hashtags
    str_replace_all("n't",   " not") %>%             # Replace contraction  n't with not
    str_replace_all("'re",   " are") %>%             # Replace contraction  're with are
    str_replace_all("'ve",   " have") %>%            # Replace contraction  've with have
    str_replace_all("'ll",   " will") %>%            # Replace contraction  'll with will
    str_replace_all("won't", "will not") %>%         # Replace contraction   won't with will not
    str_replace_all("can't", "can not") %>%          # Replace contraction   cant with can not
    str_remove_all("[^a-z0-9\\s]") %>%               # removes any character that is not a lowercase letter, number, or whitespace
    str_squish()                                     # remove extra white space
}

airbnb_df <- airbnb_df %>%
  mutate(text = clean_text(description))

The raw text data was preprocessed through a standardized cleaning pipeline before analysis. This included lowercasing, URL and punctuation removal, stripping social media mentions and hashtags, contraction expansion, and whitespace normalization ensuring the data was consistent and ready for downstream NLP tasks including sentiment analysis, topic modeling, and keyword extraction.


5.2 Tokenization, Stopword Removal & Lemmatization

#Tokenize, remove stopwords and lemmatize
custom_stopwords <- data.frame(word = c("im","de"),
                               lexicon = rep("custom", 1))

airbnb_tokens <- airbnb_df %>%
              unnest_tokens(word, text) %>%                    #split text into words
              anti_join(stop_words, by = "word") %>%           #remove standard stopwords
              anti_join(custom_stopwords, by = "word") %>%     #remove custom stopwords
              filter(!str_detect(word, "^\\d+$")) %>%          #remove numbers
              mutate(word = lemmatize_words(word))             #lematize


# Summary info
cat("Total tokens:  ", nrow(airbnb_tokens), "\n")
## Total tokens:   365179
cat("Unique lemmas: ", n_distinct(airbnb_tokens$word), "\n")
## Unique lemmas:  26648

After preprocessing the reviews through tokenization, stopword removal, and lemmatization, the dataset contained 370,292 total tokens and 29,290 unique lemmas, making it ready for further analysis.


6 Exploratory Text Analysis

6.1 Word Frequency

freq_hist <- airbnb_tokens %>%
            count(word, sort = TRUE) %>%              # count frequency
            slice_max(n, n = 10) %>%                  # select top 10 words
            mutate(word = reorder(word, n)) %>%       # reorder words by frequency
            ggplot(aes(word, n))         +            # map word vs frequency
            geom_col(fill = "steelblue") +            # bar plot
            xlab(NULL)                   +            # remove x-axis label
            ylab("Frequency")            +            # label y-axis
            coord_flip()                 +            # horizontal bars
           labs(title = "Top 10 Most Frequent Words in Airnb Description Posting")

# Print the plot
print(freq_hist)

The top word frequency analysis reveals that Airbnb hosts emphasize location, layout, and practical details in their descriptions, with terms like apartment, bedroom, and bathroom setting clear expectations, while words like walk and minute highlight proximity as a key selling point. The presence of live suggests an attempt to promote authentic local experiences. Overall, listings focus more on functionality and convenience than differentiation, indicating an opportunity for more engaging and distinctive descriptions.


6.2 Bi-Gram Analysis

#Create a Biagrams
airbnb_bigrams <- airbnb_df %>%
                  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%         # generate 2-word sequences
                  separate(bigram, into = c("word1", "word2"), sep = " ") %>%      # split into two columns
                  filter(!word1 %in% stop_words$word,                              # remove stopwords from both words
                       !word2 %in% stop_words$word,
                       !str_detect(word1, "^\\d+$"),                               # remove numbers
                       !str_detect(word2, "^\\d+$")
                       ) %>%
                  mutate(
                    word1 = lemmatize_words(word1),                                # lemmatize
                    word2 = lemmatize_words(word2)
                   ) %>%
          unite(bigram_lemma, word1, word2, sep = " ") %>%                         # combine back into single column
          dplyr::select(doc_id,country,bigram_lemma)

# Quick summary
cat("Total bigrams:  ", nrow(airbnb_bigrams), "\n")
## Total bigrams:   202742
cat("Unique bigrams: ", n_distinct(airbnb_bigrams$bigram_lemma), "\n")
## Unique bigrams:  103566

6.3 Bi-Gram Analysis by Country

# Count bigrams by apartment type
bigram_counts <- airbnb_bigrams %>%
  count(bigram_lemma, country, sort = TRUE)

# Get top 10 bigrams by total count
top_bigrams <- bigram_counts %>%
  group_by(bigram_lemma) %>%
  summarise(total_n = sum(n), .groups = 'drop') %>%
  top_n(10, total_n) %>%
  inner_join(bigram_counts, by = "bigram_lemma")

# Plot stacked bars with counts, colored by apartment type
ggplot(top_bigrams, aes(x = reorder(bigram_lemma, total_n), y = n, fill = country)) +
  geom_col() +   # default stacking by count
  coord_flip() +
  labs(
    title = "Top 10 Bigrams by Country",
    x = "Bigram",
    y = "Count",
    fill = "Apartment Type"
  ) +
  theme_minimal()

This bigram analysis of Airbnb host descriptions uncovers clear regional patterns in listing framing. Phrases like “minute walk,” “walk distance,” and “min walk” dominate globally, highlighting walkability as a universal marketing hook, with the U.S. and Australia leading in frequency, while Hong Kong and Turkey lag, reflecting urban density and local search behavior. Market-specific bigrams reveal priorities: “metro station” drives listings in Spain, Portugal, and Turkey, and destination-specific terms like “Hong Kong” appear only locally. Amenity-related phrases such as “double bed,” “air condition,” and “free wifi” remain baseline expectations worldwide. These insights indicate that optimizing listings requires localized keyword strategies—leading with walkability in some markets, transit proximity in others, while ensuring amenity bigrams reinforce standards, aligning descriptions with regional search behavior to boost visibility and guest relevance.


6.4 Inverse Document Frequency Analysis

# Compute number of documents per country
country_docs <- airbnb_tokens %>%
  group_by(country) %>%
  summarise(n_docs = n_distinct(doc_id), .groups = "drop")

# Compute IDF by country
idf_data_country <- airbnb_tokens %>%
  distinct(doc_id, country, word) %>%
  filter(!str_detect(word, "\\d")) %>%
  left_join(country_docs, by = "country") %>%
  group_by(country, word, n_docs) %>%
  summarise(doc_count = n(), .groups = "drop") %>%
  mutate(idf = log(n_docs / (1 + doc_count))) %>%
  group_by(country) %>%
  arrange(desc(idf)) %>%
  slice_head(n = 5) %>%
  ungroup()

# Faceted plot by country
ggplot(idf_data_country, aes(x = reorder(word, idf), y = idf, fill = idf)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#B0BEC5", high = "#1C1C1C") +
  facet_wrap(~country, scales = "free_y") +  # each country gets its own panel
  labs(
    x = NULL,
    y = "Inverse Document Frequency (IDF)",
    title = "Top 5 Most Rare Words in Airbnb Descriptions by Country"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

This IDF analysis highlights the distinctive language and cultural cues in Airbnb host descriptions across countries, with rare words revealing local language, niche amenities, and hyper-local references. In Brazil, Portuguese terms like abertas and abajures signal authentic, design-focused descriptions; Spain shows multilingual or fragmented terms, pointing to localization gaps. Hong Kong hosts use neighborhood-specific words like aberdeenplus to brand listings, while Australia and Canada emphasize universities, heritage, and landmarks. The U.S. features cryptic or informal terms, reflecting diverse writing styles. These findings suggest that leveraging distinctive, locally relevant terms can differentiate listings, improve search visibility, and enhance guest discovery, while addressing translation and keyword optimization can unlock further value.


6.5 Major Markets Word Correlation - Spain vs United States

# Filter tokens for USA and Spain
frequency <- airbnb_tokens %>%
  filter(country %in% c("United States", "Spain")) %>%  
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(country, word) %>%
  group_by(country) %>%
  mutate(proportion = n / sum(n)) %>%
  dplyr::select(-n) %>%
  spread(country, proportion) %>%
  gather(country, proportion, `United States`, `Spain`) 

# Prepare wide data for correlation plot
frequency_wide <- frequency %>%
  pivot_wider(names_from = country, values_from = proportion, values_fill = 0) %>%
  filter(`United States` > 0 & Spain > 0) %>%
  filter(!is.na(`United States`), !is.na(Spain))

# Plot correlogram
ggplot(frequency_wide, aes(x = `United States`, y = Spain, color = abs(`United States` - Spain))) +
  geom_abline(color = "grey40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(low = "darkslategray4", high = "gray75") +
  theme(legend.position = "none") +
  labs(
    title = "Word Correlations: United States vs Spain",
    x = "US proportion",
    y = "Spain proportion"
  )
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

This word correlation analysis comparing the United States and Spain reveals clear differences in how hosts frame Airbnb listings. US descriptions emphasize car-centric convenience, spacious layouts, and American amenities such as specific bed sizes, parking, and neighborhood references, reflecting the need to set clear expectations in sprawling urban areas. Spanish listings focus on public transit and walkability, highlighting metro, estación, parada, and minutos andando to signal proximity to transit and landmarks in dense cities. These findings underscore the value of market-specific keyword strategies: US hosts should foreground parking, bed sizes, and neighborhood context, while Spanish hosts should highlight transit access, walking times, and vibrant neighborhoods. For Airbnb, supporting localized description guidance can improve visibility, align content with guest search behavior, and increase booking conversion.


7 Sentiment Analysis

7.1 Bing Lexicon: Positive vs Negative

bing_lexicon <- get_sentiments("bing")
airbnb_sentiment <- airbnb_tokens %>%
  inner_join(bing_lexicon, by = c("word" = "word"))

sentiment_counts <- airbnb_sentiment %>%
  count(country,sentiment)
ggplot(sentiment_counts, aes(x = sentiment, y = n, fill = country)) +
  geom_col(show.legend = TRUE) +
  labs(title = "Sentiment Distribution (Bing Lexicon)",
       x = "Sentiment",
       y = "Word Count")

The Bing sentiment analysis of Airbnb host descriptions shows that positive sentiment consistently outweighs negative across all markets, with roughly 20,000 positive versus 10,000 negative instances per country. This pattern reflects that hosts treat descriptions as marketing copy, emphasizing strengths while minimizing weaknesses and that platform norms have standardized optimistic language globally. Necessary negatives like “no elevator” or “street noise” should be framed positively to preserve appeal. This allows hosts to balance transparency with attractiveness and supports Airbnb in guiding description quality.


7.2 Top Positive & Negative Words

# Count top words per sentiment AND source
word_contributions <- airbnb_tokens %>%
  inner_join(bing_lexicon, by = c("word" = "word")) %>%
  count(word, sentiment, country, sort = TRUE) %>%
  group_by(sentiment, word) %>%
  summarise(total = sum(n), .groups = "drop") %>%
  group_by(sentiment) %>%
  slice_max(total, n = 15) %>%
  ungroup()

# Filter original counts to just these top words for source stacking
word_source_counts <- airbnb_tokens %>%
  inner_join(bing_lexicon, by = c("word" = "word")) %>%
  semi_join(word_contributions, by = c("word","sentiment")) %>%
  count(word, sentiment, country)

# Plot with stacked bars by source
ggplot(word_source_counts,
       aes(x = n,
           y = reorder_within(word, n, sentiment),
           fill = country)) +
  geom_col(width = 0.7) +
  facet_wrap(~sentiment, scales = "free") +
  geom_col(width = 0.7) +
  scale_y_reordered() +
   labs(
    title    = "Most Impactful Words by Sentiment and  Country",
    x = "Frequency", y = NULL,
    fill = "Source"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

The Bing sentiment analysis of Airbnb descriptions identifies the words driving positive and negative tone, providing a clear roadmap for optimizing listing language. Positive words like stun, retreat, overlook, and tout highlight aspirational, experiential, and view-focused language that resonates emotionally with guests. Negative words such as noise, smoke, complex, split, sink, grind, and rue reflect practical limitations or unintended discouragement. Hosts can maximize appeal by foregrounding positive, scenic, and experiential terms while reframing or minimizing negative triggers—for example, presenting noise as a “vibrant neighborhood” or split as a “thoughtfully designed layout”—enhancing booking potential and aligning descriptions with guest expectations.


8 Topic Modeling (LDA)

8.1 Document Term Matrix

airbnb_dtm <- airbnb_tokens %>%
  count(country, doc_id, word) %>%                              # count term frequency per document (source)
  cast_dtm(document = doc_id, term = word, value = n)   # convert to DTM

8.2 Fit the LDA Model

library(topicmodels)

k <- 4
lda_model <- LDA(
  airbnb_dtm,
  k = k,
  method = "Gibbs",
  control = list(seed = 42, iter = 1000, burnin = 200, thin = 10)
)

# Assign human-readable labels after inspecting top terms
topic_labels <- tibble(
  topic = 1:k,
  label = c(
        "Location & Transportation",               # Based on: metro, est, la, da
        "Apartment Features & Amenities",          # Based on: apartament, quarto
        "Guest Experience & Host Hospitality",     # Mixed positive experience terms
        "Neighborhood & Local Attractions"         # Location context
  ),
)
tidy_topics <- tidy(lda_model, matrix="beta")


#top terms
top_terms <- tidy_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  left_join(topic_labels, by = "topic") %>%
  arrange(topic, -beta)

#lets plot the term frequencies by topic

ggplot(top_terms,
       aes(x = reorder_within(term, beta, topic),
           y = beta, fill = label)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~label, scales = "free_y", ncol = 2) +
  coord_flip() +
  scale_x_reordered() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "LDA topic modeling - Nike brand conversations",
    x = NULL, y = "Term probability (beta)",
    caption  = "Higher beta = more characteristic of that topic"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold", size = 10)
  )

The LDA topic modeling of Airbnb descriptions reveals four key thematic clusters that shape how hosts frame listings. Topic 1 (Apartment Features & Amenities) emphasizes walkability and proximity to transit, dining, and shopping, reflecting location-driven marketing. Topic 2 (Guest Experience & Host Hospitality) uses aspirational language such as view, beautiful, and private to evoke emotion and comfort. Topic 3 (Location & Transportation) covers functional details like beds, kitchens, and wifi, addressing baseline guest needs. Topic 4 (Neighborhood & Local Attractions) incorporates non-English terms in multilingual markets, signaling authenticity or targeting domestic guests. Optimized listings blend all four themes, balancing location, experience, functional clarity, and localized language to enhance visibility, match search behavior, and boost bookings.


Conclusion