Executive summary

The primary goal of the project is to explore what insights can be derived from analyzing women’s e-commerce clothing reviews. The analysis aims to uncover valuable insights from women’s e-commerce clothing reviews dataset sourced from Kaggle, and focuses on three main aspects: word frequency, sentiment analysis, and network analysis. - Firstly, the analysis examines the frequency of individual words in customer reviews, identifying the top words associated with different rating categories. This provides insights into the key aspects customers consider when rating clothing items. - Secondly, sentiment analysis is conducted to understand the sentiments expressed in customer reviews. Positive and negative words are identified, offering valuable suggestions to clothing sellers regarding aspects that generate positive feedback or potential areas of improvement. - Lastly, a network graph is constructed based on frequent consecutive word pairs (bigrams) in the reviews. The graph visualization provides an overview of the relationships between words, highlighting important connections and community groups.

Data background

The dataset used for this analysis is the Women’s Clothing E-Commerce dataset, sourced from Kaggle (https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews). It comprises customer reviews and additional feature variables associated with those reviews. The dataset consists of 23,486 rows, with each row representing a customer review. There are 10 feature variables provided:

  1. Clothing ID: An integer categorical variable that corresponds to the specific clothing item being reviewed.
  2. Age: A positive integer variable indicating the age of the reviewer.
  3. Title: A string variable denoting the title of the review.
  4. Review Text: A string variable containing the body of the review.
  5. Rating: A positive ordinal integer variable representing the product score given by the customer, ranging from 1 (worst) to 5 (best).
  6. Recommended IND: A binary variable indicating whether the customer recommends the product (1 for recommended, 0 for not recommended).
  7. Positive Feedback Count: A positive integer documenting the number of other customers who found the review helpful or positive.
  8. Division Name: A categorical variable indicating the high-level division of the product.
  9. Department Name: A categorical variable specifying the department name of the product.
  10. Class Name: A categorical variable indicating the class name of the product.

Data loading, cleaning and preprocessing

Prior to constructing the corpus, a data cleaning process is carried out. This involves converting the text of the reviews to ASCII encoding and removing any numerical values present in the text. Let’s take a preliminary look at the cleaned dataset:

clothing <- read_csv("data/Womens Clothing E-Commerce Reviews.csv") %>%
    mutate(review =  iconv(`Review Text`, to = "ASCII")) %>%
    rename(clothing_id = `Clothing ID`, rating = Rating) %>%
    mutate(review = removeNumbers(review)) %>%
    na.omit() %>%
    mutate(review_id = row_number()) %>%
    data.frame()

str(clothing)
## 'data.frame':    19632 obs. of  13 variables:
##  $ ...1                   : num  2 3 4 5 6 7 8 9 10 12 ...
##  $ clothing_id            : num  1077 1049 847 1080 858 ...
##  $ Age                    : num  60 50 47 49 39 39 24 34 53 53 ...
##  $ Title                  : chr  "Some major design flaws" "My favorite buy!" "Flattering shirt" "Not for the very petite" ...
##  $ Review.Text            : chr  "I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small "| __truncated__ "I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!" "This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leg"| __truncated__ "I love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually we"| __truncated__ ...
##  $ rating                 : num  3 5 5 2 5 4 5 5 3 5 ...
##  $ Recommended.IND        : num  0 1 1 0 1 1 1 1 0 1 ...
##  $ Positive.Feedback.Count: num  0 0 6 4 1 4 0 0 14 2 ...
##  $ Division.Name          : chr  "General" "General Petite" "General" "General" ...
##  $ Department.Name        : chr  "Dresses" "Bottoms" "Tops" "Dresses" ...
##  $ Class.Name             : chr  "Dresses" "Pants" "Blouses" "Dresses" ...
##  $ review                 : chr  "I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small "| __truncated__ "I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!" "This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leg"| __truncated__ "I love tracy reese dresses, but this one is not for the very petite. i am just under  feet tall and usually wea"| __truncated__ ...
##  $ review_id              : int  1 2 3 4 5 6 7 8 9 10 ...

To identify more essential and relevant n-grams, the text was initially tokenized into individual words. Stopwords were then removed from the tokenized text, after which the words were recombined. It’s important to note that reviews consisting of fewer than 3 words or exceeding 250 words were excluded from the analysis.

clothing_sub <- clothing %>% 
    select(review_id, review, rating)

review_remove_stopwords <- clothing_sub %>%
  unnest_tokens(output = word, input = review) %>%
  anti_join(stop_words) %>%
  group_by(review_id) %>%
  summarize(text_clean = paste(word, collapse = " "),
            totalwords = n()) %>%
  filter(totalwords>3, totalwords<250)

clothing_cleaned <- clothing %>% 
    select(-review) %>%
    inner_join(review_remove_stopwords) %>%
    na.omit()

Text data analysis

Individual analysis and figures

Anaysis of Word Frequency

For this part, a bar plot for word frequency is plotted. This is because bar plots effectively summarize word frequency by presenting the information in a visually appealing and easily interpretable format and allows for easy visual comparison of word frequencies.

We start with the cleaned dataset, then the text of the reviews is tokenized into individual words since we will focus on individual words in the customer reviews. Then, the frequency of each word is counted for each rating category, and the results are sorted in descending order, followed by calculation of the TF-IDF (Term Frequency-Inverse Document Frequency) value for each word and rating combination.

The resulting data frame is grouped by rating, and for each rating the top eight words with the highest 8 TF-IDF values are selected. Note that a filter is applied to remove any rows where the rating is 1 and the word occurs twice, or where the rating is 2 and the word occurs twice. This is done to exclude cases where the TF-IDF values may be inflated due to a small number of occurrences.

Finally, a bar plot is created, where the x-axis represents the TF-IDF values, the y-axis represents the words (reordered based on TF-IDF), and the fill color represents the rating. The plot is faceted by rating, with two columns, and a light-themed visual style is applied.

# create one-gram
tf_idf <- clothing_cleaned %>%
  unnest_tokens(output = word, input = text_clean) %>%
  count(rating, word, sort = T) %>%
  bind_tf_idf(word, rating, n)

tf_idf %>%
  group_by(rating) %>%
  slice_max(tf_idf, n = 8) %>%
  ungroup() %>%
  filter(!(rating ==1 & n==2), !(rating==2 & n==2)) %>% 
  ggplot(aes(tf_idf, reorder(word, tf_idf), fill = factor(rating))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~rating, ncol = 2, scales = "free") +
  theme_light() +
  labs(title="Frequency plot of one-grams",x = "tf-idf", y = NULL)

From the plot above we can see that for reviews rated at 5, words such as ‘penny,’ ‘subtle,’ and ‘cozy’ are frequently mentioned. This suggests that clothing items that are comfortable to wear and affordable may be more likely to receive a perfect rating of 5 out of 5. This observation is further supported by the TF-IDF analysis of words within reviews rated at 4, where ‘subtle,’ ‘cozy,’ and ‘cooler’ emerge as high-frequency words. Additionally, the 5 rating reviews emphasize that clothes that pair well with booties are considered the best, while the 3 rating reviews indicate that overstyled clothing may lead to a lower rating.

Analysis of Sentiments

Sentiment analysis is increasingly important in business and society. It offers numerous research challenges but promises insight useful for opinion analysis and social media analysis. Thus, for the reviews of women’s clothing data, the sentiment analysis is conducted.

We start with the original dataset to extract relevant information, including the clothing ID and the cleaned review text. The data is grouped by clothing ID and summarized by concatenating the review texts and counting the total number of reviews for each clothing item. Only clothing items with more than 10 reviews are considered.

The review column is then tokenized into individual words, and the tokenized data is joined with the sentiment lexicon “bing”. The sentiment lexicon provides information about positive and negative sentiments associated with each word. The joined data is then counted by word and sentiment, and reshaped into a matrix format using acast. This matrix represents the frequency of each word occurring in different sentiment categories.

Before plotting the word cloud, a random seed is set to ensure reproducibility. Finally, a comparison cloud is generated, where the words are colored based on sentiment, with negative sentiment displayed in lightcoral and positive sentiment displayed in steelblue. The maximum number of words displayed in the cloud is set to 100.

clothing_sentiment <- clothing_cleaned %>%
    select(clothing_id, text_clean) %>%
    group_by(clothing_id) %>%
    summarize(text = paste(text_clean, collapse = " "), total_reviews = n()) %>%
    filter(total_reviews > 10) %>%
    unnest_tokens(word, text)

set.seed(123)
clothing_sentiment %>%
    inner_join(get_sentiments("bing")) %>%
    count(word, sentiment, sort = TRUE) %>%
    acast(word ~ sentiment, value.var = "n", fill = 0) %>%
    comparison.cloud(colors = c("lightcoral", "steelblue"), 
                     title.bg.colors = c("lightcoral", "lightsteelblue"), 
                     title.colors = c("black", "black"), max.words = 100)

The resulting comparison reveals the sentiments expressed in customer reviews. Among these reviews, we can observe the presence of positive words such as ‘soft’ and ‘comfortable’, contrasting with negative words like ‘loose’ and ‘worn’. This insight can offer valuable suggestions to clothing sellers. It suggests that clothing items with comfortable and soft linings tend to be well-received and positively commented on by customers. On the other hand, clothing with a relaxed design might generate polarized reviews, indicating that it may not appeal to all customers equally. Sellers should take this into consideration if they aim to achieve a higher average rating and cater to customer preferences.

Network Graph

In this part, the method selected for constructing network graph is using a bigram, which results in a network centered on “frequent consecutive word pairs”. Besides, clustering of words is less clear with this method but overall relationship of words can be seen.

We begin with the cleaned clothing reviews dataset and tokenize the review text into bigrams, followed by separating the bigrams into two individual words and filtering out any stop words or irrelevant words. Then we count the occurrences of each pair of words (bigrams) and remove any rows with missing values. After data cleaning, we create a network graph data object from the pair bigram dataset, filtering out bigrams with a count less than 40 and calculate the centrality and assign community groups using the Infomap algorithm. Lastly, the network graph is plotted, where the edges between nodes are represented as gray lines, while the nodes are displayed as points with varying sizes based on their centrality measures. The color of the nodes represents the community groups, and the labels of the nodes are displayed.

bigram_review <- clothing_cleaned %>%
  unnest_tokens(input = text_clean, 
                output = bigram, token = "ngrams", n = 2)

bigram_seprated <- bigram_review %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% c("br", stop_words$word),
         !word2 %in% c("br", stop_words$word))

pair_bigram <- bigram_seprated %>%
  count(word1, word2, sort = T) %>%
  na.omit()

# Creating a network graph data
graph_bigram <- pair_bigram %>%
  filter(n >= 100) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),    # centrality
         group = as.factor(group_infomap()))  # community

# Creating a network graph
set.seed(1234)
ggraph(graph_bigram, layout = "fr") +
  geom_edge_link(color = "gray50",             # edge color
                 alpha = 0.5) +                # edge contrast
  geom_node_point(aes(size = centrality/10,       # node size
                      color = group, alpha = 0.5),# node color
                  show.legend = F) +           # legend removal
  scale_size(range = c(5, 10)) +               # range of node size
  geom_node_text(aes(label = name), repel = T, size = 2) +
  theme_graph()

The network displayed above essentially conveys the truth, such as normal/larger/regular/medium size and fit perfectly, etc.