Write text and code here.

Executive summary

Consumer reviews are invaluable textual data that reveal how customers use and evaluate products, while star ratings provide an intuitive measure of their overall appraisal. In this ATA class project, I will analyze women’s apparel reviews with a range of text-mining techniques to uncover what female consumers regard positively and what they perceive negatively. By further segmenting the data by age group, I aim to identify generational differences in clothing selection and purchasing behavior. Ultimately, this analysis will offer meaningful insights for women’s fashion marketing by highlighting which aspects consumers value most and where they experience dissatisfaction.

This analysis seeks to answer two main questions: 1. Overall, which aspects of women’s apparel leave customers satisfied, and which aspects cause dissatisfaction? 2. Do priorities differ across age segments—young, middle-aged, and older consumers—and, if so, can these insights inform age-targeted marketing strategies for clothing brands?

By applying a variety of text-mining techniques, this study aims to generate actionable insights into these questions.

Data background

The dataset used for this analysis is the “Women’s E-Commerce Clothing Reviews” downloaded from Kaggle. (Link: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews)

The review dataset contains 23,486 apparel reviews. Each row corresponds to an individual customer review, with fields for Clothing ID, reviewer age, review title, review text, product rating, recommendation status, number of review feedback votes, product category, product department, and product class name.

For the current analysis, the primary focus will be on the review text and product rating features.

Data loading, cleaning and preprocessing

First, the review data are loaded. Because some feature names contain spaces, i replace those spaces with underscores to simplify preprocessing. I also convert all values in the Review_Text feature to lowercase.

raw_review <- read_csv('women_clothes_review.csv')

## New names:
## Rows: 23486 Columns: 11
## ── Column specification
## ────────────────────────────────────────────────────────
## Delimiter: "," chr (5): Title, Review Text, Division Name, Department Name,
## Class Name dbl (6): ...1, Clothing ID, Age, Rating, Recommended IND, Positive
## Feedback ...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

names(raw_review) <- gsub(" ", "_", names(raw_review))
raw_review$Review_Text <- tolower(raw_review$Review_Text)
head(raw_review, 5)

## # A tibble: 5 × 11
##    ...1 Clothing_ID   Age Title               Review_Text Rating Recommended_IND
##   <dbl>       <dbl> <dbl> <chr>               <chr>        <dbl>           <dbl>
## 1     0         767    33 <NA>                "absolutel…      4               1
## 2     1        1080    34 <NA>                "love this…      5               1
## 3     2        1077    60 Some major design … "i had suc…      3               0
## 4     3        1049    50 My favorite buy!    "i love, l…      5               1
## 5     4         847    47 Flattering shirt    "this shir…      5               1
## # ℹ 4 more variables: Positive_Feedback_Count <dbl>, Division_Name <chr>,
## #   Department_Name <chr>, Class_Name <chr>

Next, I performed tokenization and stop-words removal. Because the overarching goal of this text-mining study is sentiment analysis, I kept only the words found in the Bing sentiment lexicon and removed every term that could not be mapped to a sentiment. To make age-specific comparisons easier, I also used the ‘case_when’ function to create a new feature, ‘age_group’, labeling reviewers aged 29 and under as “teenage,” those up to 59 as “adults,” and anyone 60 or older as “elderly.”

bing <- get_sentiments("bing")
tidy_review <- raw_review %>%
  # new feature by age group
  mutate(
    age_group = case_when( 
      Age < 30 ~ 'teenage',
      Age < 60 ~ 'adults',
      TRUE ~ 'elderly'
    ) 
  )%>%
  # tokenizing
  unnest_tokens(word, Review_Text) %>%
  # remove stop words
  anti_join(stop_words) %>%
  inner_join(bing)

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

## Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 22648 of `x` matches multiple rows in `y`.
## ℹ Row 3857 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

tidy_review

## # A tibble: 98,674 × 13
##     ...1 Clothing_ID   Age Title   Rating Recommended_IND Positive_Feedback_Co…¹
##    <dbl>       <dbl> <dbl> <chr>    <dbl>           <dbl>                  <dbl>
##  1     0         767    33 <NA>         4               1                      0
##  2     0         767    33 <NA>         4               1                      0
##  3     0         767    33 <NA>         4               1                      0
##  4     1        1080    34 <NA>         5               1                      4
##  5     1        1080    34 <NA>         5               1                      4
##  6     1        1080    34 <NA>         5               1                      4
##  7     1        1080    34 <NA>         5               1                      4
##  8     2        1077    60 Some m…      3               0                      0
##  9     2        1077    60 Some m…      3               0                      0
## 10     2        1077    60 Some m…      3               0                      0
## # ℹ 98,664 more rows
## # ℹ abbreviated name: ¹Positive_Feedback_Count
## # ℹ 6 more variables: Division_Name <chr>, Department_Name <chr>,
## #   Class_Name <chr>, age_group <chr>, word <chr>, sentiment <chr>

Text data analysis

Individual analysis and figures

Anaysis and Figure 1

First, I want to identify the most frequently used positive and negative words in the overall review dataset.

tidy_review %>%
  count(word, sentiment, sort = TRUE)

## # A tibble: 1,737 × 3
##    word        sentiment     n
##    <chr>       <chr>     <int>
##  1 love        positive   8948
##  2 top         positive   7405
##  3 perfect     positive   3772
##  4 flattering  positive   3517
##  5 soft        positive   3343
##  6 comfortable positive   3057
##  7 cute        positive   3041
##  8 nice        positive   3023
##  9 beautiful   positive   2960
## 10 pretty      positive   2194
## # ℹ 1,727 more rows

I examined the overall ratio of positive to negative words. I found that 75.3 % of the tokenized words were positive, while only 24.7 % were negative.

sent_total <- tidy_review %>% 
  count(sentiment, name = "n_words") %>%
  mutate(prop  = n_words / sum(n_words),
         label = percent(prop, accuracy = 0.1))

ggplot(sent_total, 
       aes(x = "", y = prop, fill = sentiment)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = label),
            position = position_stack(vjust = 0.5),
            size = 4, color = "white") +
  scale_fill_manual(values = c(positive = "#4DBBD5",
                               negative = "#DC0000")) +
  labs(title = "Overall Positive vs. Negative Word Share",
       fill  = "Sentiment") +
  theme_void(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Next, I examined the positive-to-negative word ratio within each age group. The adult group showed a slightly higher share of negative words than either the elderly or teenage cohorts—around a one-percentage-point difference—so the gap was not significant.

sent_pie <- tidy_review %>%
  count(age_group, sentiment, name = 'n_words') %>%
  group_by(age_group) %>%
  mutate(prop = n_words / sum(n_words),
         label = scales::percent(prop, accuracy = 0.1))

ggplot(sent_pie, 
       aes(x = "", y = prop, fill = sentiment)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  facet_wrap(~ age_group, ncol = 3) +
  geom_text(aes(label = label),
            position = position_stack(vjust = 0.5), 
            size = 3, color = "white") +
  scale_fill_manual(values = c(positive = "#4DBBD5", 
                               negative = "#DC0000")) +
  labs(title = "Positive vs. Negative Word Share by Age Group",
       fill  = "Sentiment") +
  theme_void(base_size = 12) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

I visualized the ten most frequently used words for each sentiment category (negative and positive) in the tidy_review data. Among the top negative words, terms such as “fall,” “loose,” and “worn” appeared most often, whereas positive words like “love” and “perfect” dominated the positive list.

tidy_review %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Anaysis and Figure 2

Next, I wanted to separate reviews into positive and negative groups based on their star ratings so I could compare the two sets. To do that, I needed a clear cut-off score for distinguishing positive from negative feedback. I therefore examined the overall rating distribution and found that most reviews had a score of 5; the median rating was 5, and the mean was roughly 4.25.

median_rating <- median(tidy_review$Rating)
sd_rating <- sd(tidy_review$Rating)
mean_rating <- mean(tidy_review$Rating)
median_rating

## [1] 5

sd_rating

## [1] 1.080305

mean_rating

## [1] 4.249508

quantile(tidy_review$Rating)

##   0%  25%  50%  75% 100% 
##    1    4    5    5    5

ggplot(raw_review, aes(x = Rating)) +
  geom_histogram(binwidth = 1,
                 fill = "steelblue", 
                 color = "white") +  
  scale_x_continuous(breaks = 1:5) +  
  labs(title = "Rating Distribution",
       x = "Rating",
       y = "Count")

For comparison, I classified only reviews with a rating of 5 as positive. To see which positive words appear most frequently within these reviews, I plotted their frequency. The resulting chart looked very similar to the earlier positive-word graph—unsurprisingly, since more than half of all reviews carry a top score of 5 and therefore fall into the positive category.

When I examined the words, I found they fell into two broad categories. The first consists of appearance-related terms—typical examples are “love,” “beautiful,” and “cute.” The second category contains performance-related words such as “perfect,” “comfortable,” and “soft.” This indicates that consumers write reviews with both the product’s aesthetics and its practical qualities in mind.

tidy_review %>%
  filter(Rating == 5) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(sentiment == 'positive') %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  labs(x = "Contribution to sentiment",
       y = NULL)

graph_pos <- tidy_review %>%
  filter(Rating  == 5) %>%
  pairwise_count(item = word,
                 feature = Clothing_ID,
                 sort = T) %>%
  filter(n >= 250) %>%
  as_tbl_graph()
graph_pos

## # A tbl_graph: 10 nodes and 22 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 10 × 1 (active)
##    name       
##    <chr>      
##  1 perfect    
##  2 love       
##  3 soft       
##  4 comfortable
##  5 flattering 
##  6 top        
##  7 cute       
##  8 nice       
##  9 super      
## 10 beautiful  
## #
## # Edge Data: 22 × 3
##    from    to     n
##   <int> <int> <dbl>
## 1     1     2   330
## 2     2     1   330
## 3     3     2   320
## # ℹ 19 more rows

Next, I visualized how the positive words relate to one another. The results were intriguing and mirrored the patterns I noted earlier: the nodes comfortable – soft – perfect clustered together, while words like beautiful and cute were connected around the central node love.

set.seed(99)                              # fix a random number
ggraph(graph_pos, layout = "fr") +      # layout 

  geom_edge_link(color = "gray50",          # edge color
                 alpha = 0.5) +             # edge contrast

  geom_node_point(color = "lightblue",     # node color
                  size = 5) +               # node size

  geom_node_text(aes(label = name),         # text label
                 repel = T,                 # off-node display
                 size = 5) +  # font

  theme_graph()

Next, I carried out the same analysis on the negative reviews. I classified reviews with a rating below 4 as negative, treating those with a rating of 4 as neutral rather than positive or negative. The distribution of negative words used in these negative reviews is shown below.

tidy_review %>%
  filter(Rating < 4) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(sentiment == 'negative') %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Next, I explored how the negative words are interconnected and observed that several of them clustered around the central node “disappointed.”

graph_neg <- tidy_review %>%
  filter(Rating < 4) %>%
  filter(sentiment == 'negative') %>%
  pairwise_count(item = word,
                 feature = Clothing_ID,
                 sort = T) %>%
  filter(n >= 50) %>%
  as_tbl_graph()
graph_neg

## # A tbl_graph: 13 nodes and 52 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 13 × 1 (active)
##    name        
##    <chr>       
##  1 disappointed
##  2 cheap       
##  3 bad         
##  4 sadly       
##  5 loose       
##  6 bust        
##  7 weird       
##  8 fall        
##  9 odd         
## 10 poor        
## 11 hung        
## 12 wrong       
## 13 worn        
## #
## # Edge Data: 52 × 3
##    from    to     n
##   <int> <int> <dbl>
## 1     1     2    89
## 2     2     1    89
## 3     3     1    85
## # ℹ 49 more rows

set.seed(99)                              # fix a random number
ggraph(graph_neg, layout = "fr") +      # layout 

  geom_edge_link(color = "gray50",          # edge color
                 alpha = 0.5) +             # edge contrast

  geom_node_point(color = "lightcoral",     # node color
                  size = 5) +               # node size

  geom_node_text(aes(label = name),         # text label
                 repel = T,                 # off-node display
                 size = 5) +  # font

  theme_graph()

Anaysis and Figure 3

Next, I set out to examine age-related differences. Specifically, I conducted this analysis to see whether each age group thinks positively or negatively about different aspects of the product.

tidy_pos <- tidy_review %>%
  filter(Rating == 5) %>%
  filter(sentiment == 'positive')
tidy_pos

## # A tibble: 46,685 × 13
##     ...1 Clothing_ID   Age Title   Rating Recommended_IND Positive_Feedback_Co…¹
##    <dbl>       <dbl> <dbl> <chr>    <dbl>           <dbl>                  <dbl>
##  1     1        1080    34 <NA>         5               1                      4
##  2     1        1080    34 <NA>         5               1                      4
##  3     1        1080    34 <NA>         5               1                      4
##  4     1        1080    34 <NA>         5               1                      4
##  5     3        1049    50 My fav…      5               1                      0
##  6     3        1049    50 My fav…      5               1                      0
##  7     3        1049    50 My fav…      5               1                      0
##  8     3        1049    50 My fav…      5               1                      0
##  9     3        1049    50 My fav…      5               1                      0
## 10     4         847    47 Flatte…      5               1                      6
## # ℹ 46,675 more rows
## # ℹ abbreviated name: ¹Positive_Feedback_Count
## # ℹ 6 more variables: Division_Name <chr>, Department_Name <chr>,
## #   Class_Name <chr>, age_group <chr>, word <chr>, sentiment <chr>

tidy_neg <- tidy_review %>%
  filter(Rating < 4) %>%
  filter(sentiment == 'negative')
tidy_neg

## # A tibble: 8,138 × 13
##     ...1 Clothing_ID   Age Title   Rating Recommended_IND Positive_Feedback_Co…¹
##    <dbl>       <dbl> <dbl> <chr>    <dbl>           <dbl>                  <dbl>
##  1     2        1077    60 Some m…      3               0                      0
##  2     2        1077    60 Some m…      3               0                      0
##  3     2        1077    60 Some m…      3               0                      0
##  4     5        1080    49 Not fo…      2               0                      4
##  5     5        1080    49 Not fo…      2               0                      4
##  6    10        1077    53 Dress …      3               0                     14
##  7    10        1077    53 Dress …      3               0                     14
##  8    10        1077    53 Dress …      3               0                     14
##  9    10        1077    53 Dress …      3               0                     14
## 10    14        1077    50 Pretty…      3               1                      1
## # ℹ 8,128 more rows
## # ℹ abbreviated name: ¹Positive_Feedback_Count
## # ℹ 6 more variables: Division_Name <chr>, Department_Name <chr>,
## #   Class_Name <chr>, age_group <chr>, word <chr>, sentiment <chr>

First, I calculated TF-IDF scores for the positive reviews in each age group. By using TF-IDF, I filtered out words that appear frequently across all reviews so I could capture clearer age-specific signals. The bar charts below show the ten highest-scoring positive words for each group, and they reveal distinct nuances:

Teenage: Words such as godsend, magical, and luxuriously dominate. These exclamatory, mood-laden terms suggest that younger reviewers mainly evaluate clothing on its outward appearance—especially designs that feel flashy, new, or exciting.

Adults: Terms emphasizing practicality and accuracy—cleaner, accurately, honest—receive the highest TF-IDF scores. This indicates that adults focus more on functional aspects than on pure aesthetics when judging apparel.

Elderly: Words like authentic, superior, and fashionable lead the list, underscoring themes of quality, dignity, and elegance. Older reviewers appear to value a garment’s premium feel and the aura it conveys.

In short, reviewers under 20 prioritize overall design—particularly vivid, novel, and joyful impressions; those in their 30s to 50s emphasize functionality and practicality; and consumers aged 60 and above rate garments positively when they project quality, refinement, and formality. These age-specific differences can serve as key inputs when crafting marketing strategies that target each demographic most effectively.

# 긍정 리뷰의 연령 그룹 별 tf-idf 계산
frequency_pos <- tidy_pos %>%
  count(age_group, word) %>%
  bind_tf_idf(term = word,
              document = age_group,
              n = n) %>%
  arrange(-tf_idf)
frequency_pos

## # A tibble: 1,168 × 6
##    age_group word          n       tf   idf   tf_idf
##    <chr>     <chr>     <int>    <dbl> <dbl>    <dbl>
##  1 adults    liking       15 0.000439 1.10  0.000482
##  2 teenage   godsend       2 0.000342 1.10  0.000376
##  3 adults    cheerful     11 0.000322 1.10  0.000353
##  4 adults    lover        11 0.000322 1.10  0.000353
##  5 teenage   tough         5 0.000855 0.405 0.000346
##  6 elderly   authentic     2 0.000301 1.10  0.000331
##  7 elderly   ecstatic      2 0.000301 1.10  0.000331
##  8 elderly   talented      2 0.000301 1.10  0.000331
##  9 elderly   angel         5 0.000753 0.405 0.000305
## 10 elderly   enjoying      5 0.000753 0.405 0.000305
## # ℹ 1,158 more rows

top10 <- frequency_pos %>%
  group_by(age_group) %>%
  slice_max(tf_idf, n = 10, with_ties = F)

# Ordering graph
top10$age_group <- factor(top10$age_group,
                          levels = c("teenage", "adults", "elderly"))

# Create a bar graph
ggplot(top10, aes(x = reorder_within(word, tf_idf, age_group),
                  y = tf_idf,
                  fill = age_group)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ age_group, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL, title = 'Top10 TF-IDF score by age group - Positive review')

I analyzed the age-specific TF-IDF distribution of negative words contained in the negative reviews. Significant differences were again apparent across age groups.

Teenage: Words such as lying, damaged, misleading, and illusion received high TF-IDF scores, indicating that this group reacts sensitively when the actual product differs from their expectations of design or quality. Because younger consumers are adept at gathering product information online before purchasing, they may feel particularly let down when reality does not match what they saw on the web.

Adults: Much like in the positive-review analysis, words related to practicality carried the highest TF-IDF scores—for example, itch, bizarre, and worse. This suggests that adults evaluate apparel primarily on functional grounds and become negative when practical needs are not met.

Elderly: Price emerged as a major concern, reflected in words such as pricey and overpriced. In addition, terms like unfinished and biased show that this group also criticizes the overall workmanship and completeness of a product.

In summary, for the teenage segment, minimizing discrepancies between online images and the actual product is crucial. For adults, it is important to use high-quality materials and enhance practicality, as functionality drives their evaluations. For the elderly segment, setting a reasonable price point and improving overall finishing and build quality are key to reducing negative reviews.

frequency_neg <- tidy_neg %>%
  count(age_group, word) %>%
  bind_tf_idf(term = word,
              document = age_group,
              n = n) %>%
  arrange(-tf_idf)
frequency_neg

## # A tibble: 1,125 × 6
##    age_group word           n      tf   idf  tf_idf
##    <chr>     <chr>      <int>   <dbl> <dbl>   <dbl>
##  1 elderly   shock          3 0.00342 1.10  0.00376
##  2 teenage   lying          3 0.00278 1.10  0.00305
##  3 teenage   worried        7 0.00648 0.405 0.00263
##  4 adults    boring        14 0.00226 1.10  0.00249
##  5 adults    itch          14 0.00226 1.10  0.00249
##  6 adults    unusual       13 0.00210 1.10  0.00231
##  7 adults    negative      11 0.00178 1.10  0.00195
##  8 teenage   damaged        5 0.00463 0.405 0.00188
##  9 elderly   intense        4 0.00457 0.405 0.00185
## 10 elderly   overpriced     4 0.00457 0.405 0.00185
## # ℹ 1,115 more rows

top10 <- frequency_neg %>%
  group_by(age_group) %>%
  slice_max(tf_idf, n = 10, with_ties = F)

# Ordering graph
top10$age_group <- factor(top10$age_group,
                          levels = c("teenage", "adults", "elderly"))

# Create a bar graph
ggplot(top10, aes(x = reorder_within(word, tf_idf, age_group),
                  y = tf_idf,
                  fill = age_group)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ age_group, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL, title = 'Top10 TF-IDF score by age group - Negative review')

However, I felt that analyzing single-word TF-IDF had its limitations, so I added a step that visualizes TF-IDF scores for bigrams. I believe this approach will make it easier to see which aspects each age group prioritizes more—relative to other cohorts—than the single-word analysis alone could reveal. I also ran the bigram TF-IDF analysis separately for the positive and negative review sets.

bigram_united <- raw_review %>%
  unnest_tokens(bigram, Review_Text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(word1 %in% bing$word) %>%
  filter(word2 %in% bing$word) %>%
  unite(bigram, word1, word2, sep = " ")
bigram_united

## # A tibble: 6,052 × 11
##     ...1 Clothing_ID   Age Title   Rating Recommended_IND Positive_Feedback_Co…¹
##    <dbl>       <dbl> <dbl> <chr>    <dbl>           <dbl>                  <dbl>
##  1     3        1049    50 My fav…      5               1                      0
##  2     3        1049    50 My fav…      5               1                      0
##  3     3        1049    50 My fav…      5               1                      0
##  4    10        1077    53 Dress …      3               0                     14
##  5    15        1065    47 Nice, …      4               1                      3
##  6    17         853    41 Looks …      5               1                      0
##  7    18        1120    32 Super …      5               1                      0
##  8    18        1120    32 Super …      5               1                      0
##  9    20         847    33 Cute, …      4               1                      2
## 10    31        1060    46 Cuter …      5               1                      7
## # ℹ 6,042 more rows
## # ℹ abbreviated name: ¹Positive_Feedback_Count
## # ℹ 4 more variables: Division_Name <chr>, Department_Name <chr>,
## #   Class_Name <chr>, bigram <chr>

bigram_sentiment <- bigram_united %>%
  mutate(
    sentiment = case_when(
      Rating == 5 ~ 'positive',
      Rating < 4 ~ 'negative',
      TRUE ~ 'neutral'
    )
  ) %>%
  mutate(
    age_group = case_when( 
      Age < 30 ~ 'teenage',
      Age < 60 ~ 'adults',
      TRUE ~ 'elderly'
    ) 
  )
bigram_sentiment

## # A tibble: 6,052 × 13
##     ...1 Clothing_ID   Age Title   Rating Recommended_IND Positive_Feedback_Co…¹
##    <dbl>       <dbl> <dbl> <chr>    <dbl>           <dbl>                  <dbl>
##  1     3        1049    50 My fav…      5               1                      0
##  2     3        1049    50 My fav…      5               1                      0
##  3     3        1049    50 My fav…      5               1                      0
##  4    10        1077    53 Dress …      3               0                     14
##  5    15        1065    47 Nice, …      4               1                      3
##  6    17         853    41 Looks …      5               1                      0
##  7    18        1120    32 Super …      5               1                      0
##  8    18        1120    32 Super …      5               1                      0
##  9    20         847    33 Cute, …      4               1                      2
## 10    31        1060    46 Cuter …      5               1                      7
## # ℹ 6,042 more rows
## # ℹ abbreviated name: ¹Positive_Feedback_Count
## # ℹ 6 more variables: Division_Name <chr>, Department_Name <chr>,
## #   Class_Name <chr>, bigram <chr>, sentiment <chr>, age_group <chr>

When I examine the bigram TF-IDF results, the priorities of each age group become much clearer.

Among teenagers, phrases like perfect cool, perfect beautiful, and fabulous fall dominate, reflecting their enthusiasm for items that feel “perfect,” “cool,” or eye-catching. Bigrams such as cheap love and bust smallish also appear, showing that this group cares about price advantages and precise fit.

In contrast, adults favor combinations like gorgeous top, comfortable easy, and soft comfy, where adjectives that highlight comfort and practicality pair with functional items (tops, tank tops). This pattern suggests that adults value wearability and versatility over pure design flair.

For the elderly segment, expressions such as wow beautiful, classic love, beautifully soft, and soft comfy are most frequent, emphasizing both the softness of the fabric and an elegant, classic image.

In short, the bigram analysis reveals that teenagers prioritize trendy design and price, adults focus on comfort and everyday utility, and consumers aged 60 and above place the highest value on soft materials and an overall sense of refinement.

bigram_tfidf_pos <- bigram_sentiment %>%
  filter(sentiment == 'positive') %>%
  count(bigram, age_group) %>%
  bind_tf_idf(bigram, age_group, n) %>%
  arrange(desc(tf_idf))
bigram_tfidf_pos

## # A tibble: 1,632 × 6
##    bigram                 age_group     n      tf   idf  tf_idf
##    <chr>                  <chr>     <int>   <dbl> <dbl>   <dbl>
##  1 fabulous fall          teenage       2 0.00422  1.10 0.00464
##  2 perfect beautiful      teenage       2 0.00422  1.10 0.00464
##  3 perfect cool           teenage       2 0.00422  1.10 0.00464
##  4 beautifully love       elderly       2 0.00376  1.10 0.00413
##  5 beautifully soft       elderly       2 0.00376  1.10 0.00413
##  6 classic love           elderly       2 0.00376  1.10 0.00413
##  7 comfortable attractive elderly       2 0.00376  1.10 0.00413
##  8 cute cute              elderly       2 0.00376  1.10 0.00413
##  9 soft soft              elderly       2 0.00376  1.10 0.00413
## 10 wow beautiful          elderly       2 0.00376  1.10 0.00413
## # ℹ 1,622 more rows

bigram_tfidf_pos %>%
  arrange(desc(tf_idf)) %>%
  group_by(age_group) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(bigram = reorder_within(bigram, tf_idf, age_group)) %>%
  ggplot(aes(tf_idf, bigram, fill = age_group)) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  facet_wrap(~ age_group, ncol = 2, scales = "free") +
  labs(x = "tf-idf of bigram", y = NULL)

For the negative-review bigram analysis, it proved difficult to assign clear positive or negative sentiment to each bigram, and the sample size was smaller than in previous analyses, making the results less conclusive. I expect that collecting a larger volume of negative reviews will improve the clarity and reliability of the bigram findings.

bigram_tfidf_neg <- bigram_sentiment %>%
  filter(sentiment == 'negative') %>%
  count(bigram, age_group) %>%
  bind_tf_idf(bigram, age_group, n) %>%
  arrange(desc(tf_idf))
bigram_tfidf_neg

## # A tibble: 616 × 6
##    bigram          age_group     n      tf   idf tf_idf
##    <chr>           <chr>     <int>   <dbl> <dbl>  <dbl>
##  1 pretty top      adults       16 0.0221  1.10  0.0243
##  2 pretty poor     teenage       2 0.0175  1.10  0.0193
##  3 super soft      adults       26 0.0359  0.405 0.0146
##  4 love love       teenage       4 0.0351  0.405 0.0142
##  5 attractive love elderly       1 0.00980 1.10  0.0108
##  6 attractive top  elderly       1 0.00980 1.10  0.0108
##  7 bad luck        elderly       1 0.00980 1.10  0.0108
##  8 bad poor        elderly       1 0.00980 1.10  0.0108
##  9 beautiful crisp elderly       1 0.00980 1.10  0.0108
## 10 coarse itchy    elderly       1 0.00980 1.10  0.0108
## # ℹ 606 more rows

bigram_tfidf_neg %>%
  arrange(desc(tf_idf)) %>%
  group_by(age_group) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(bigram = reorder_within(bigram, tf_idf, age_group)) %>%
  ggplot(aes(tf_idf, bigram, fill = age_group)) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  facet_wrap(~ age_group, ncol = 2, scales = "free") +
  labs(x = "tf-idf of bigram", y = NULL)

Automated Text Analysis Final Project: What do women want?

20204603 Jiho Kim

2025-06-17