This is analyzing the review dataset from the women’s e-commerce clothing sales reviews and ratings. The dataset contains more than 23,000 online reviews of women’s clothing from various retailers.
Let us set up the working environment and be ready for the analysis.
if(!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, readr, purrr,
tidyverse, tidytext, stringr, igraph,
wordcloud2, ggraph, topicmodels,
here, rio, htmlwidgets, webshot)
theme = theme_bw() +
theme(plot.title = element_text(face = "bold", size = (15)),
plot.subtitle = element_text(size = (10)),
axis.title = element_text(size = (10))) +
theme(axis.text.x = element_text(angle = 0), legend.position = "none")The dataset comes from kaggle.com. The dataset contains the following variables.
The dataset contains 23,486 entries pertaining to the age and review given by the customer and their opinions on the specific clothes.
df.1 = read_csv("DATA.csv")
df.1 = df.1[, -1]
colnames(df.1) = c("ID", "Age", "Title", "Review", "Rating",
"Recommend", "Liked", "Division", "Dept", "Class")
dim(df.1)## [1] 23486 10
The title contains the most missing values. We only filter out the missing values for the division for further analysis use.
## ID Age Title Review Rating Recommend Liked Division
## 0 0 3810 845 0 0 0 14
## Dept Class
## 14 14
## ID Age Title Review Rating Recommend Liked Division
## 0 0 3809 844 0 0 0 0
## Dept Class
## 0 0
temp = df.2$Dept %>% table() %>% prop.table() %>% data.frame()
ggplot(data = temp, aes(x = ., y = Freq*100)) +
geom_bar(stat = "identity") +
labs(title = "Percentage of Reviews by Department",
x = "Department Name",
y = "Percentage of Reviews (%)") +
geom_text(aes(label = round(Freq*100, 2)),
vjust = -0.30, size = 4) +
themetemp = df.2 %>% mutate(Dept = factor(Dept)) %>% group_by(Dept) %>% count(Rating) %>% mutate(perc = n/sum(n))
ggplot(data = temp,
aes(x = Rating, y = perc*100, fill = Dept)) +
geom_bar(stat = "identity") +
facet_wrap(~Dept) +
labs(title = "Percentage of Ratings by Department",
y = "Percentage of Ratings (%)") +
geom_text(aes(label = round(perc*100, 2)),
vjust = -0.30, size = 2.5) +
themetemp = df.2 %>% select(ID, Age, Dept) %>% mutate(Age.group = ifelse(Age < 30, "18-29", ifelse(Age < 40, "30-39", ifelse(Age < 50, "40-49", ifelse(Age < 60, "50-59", ifelse(Age < 70, "60-69", "70-99")))))) %>% mutate(Age.group = factor(Age.group), Dept = factor(Dept))
temp = temp %>% group_by(Age.group) %>% count(Dept)
ggplot(data = temp,
aes(x = Dept, y = n, fill = Age.group)) +
geom_bar(stat = "identity") +
facet_wrap(~Age.group, scales = "free") +
labs(title = "Department by Age",
x = "Department Name",
y = "Number of Reviews") +
geom_text(aes(label = n),
hjust = 0.7, size = 2.5) +
coord_flip() +
themeWe remove the missing values in entries to do a bigram analysis. There are 845 missing values in the review. Proportional speaking, 845/23, 486*100% = 3.6%, which is not going to be considered in the analysis. Also, we combine the title with the review to get all content in one variable.
temp = df.2 %>% filter(!is.na(Review))
n.title = temp %>% filter(is.na(Title)) %>% select(-Title)
y.title = temp %>% filter(!is.na(Title)) %>% unite(Review, c(Title, Review), sep = " ")
df.3 = bind_rows(n.title, y.title)
dim(df.3)## [1] 22628 9
We do some processes for content analysis. Firstly, we use 2 grams for the analysis. Secondly, we sort out the stop words and remove any digits.
cbigram.1 = df.3 %>% unnest_tokens(bigram, Review, token = "ngrams", n = 2)
cbigram.sep = cbigram.1 %>% separate(bigram, c("first", "second"), sep = " ")
cbigram.2 = cbigram.sep %>% filter(!first %in% stop_words$word,
!second %in% stop_words$word,
!str_detect(first, "\\d"),
!str_detect(second, "\\d")) %>% unite(bigram, c(first, second), sep = " ")
dim(cbigram.2)## [1] 125110 9
We group the words based on their ratings and plot the 10 most common bigrams for each rating. The most memtioned bigram for each rating as following.
top.bigram = cbigram.2 %>% mutate(Rating = factor(Rating)) %>% mutate(bigram = factor(bigram)) %>% group_by(Rating) %>% count(bigram, sort = T) %>% top_n(10, n)
ggplot(data = top.bigram,
aes(x = bigram, y = n, fill = Rating)) +
geom_bar(stat = "identity") +
facet_wrap(~Rating, ncol = 2, scales = "free") +
labs(title = "Top 10 Bigrams by Rating",
x = NULL,
y = "Frequency") +
coord_flip() +
themeWe can see the 5 stars and 1 stars bigrams in top 5 detail.
five.star = df.3 %>% filter(Rating == 5)
one.star = df.3 %>% filter(Rating == 1)
five.bi = bigramming(five.star) %>% count(bigram, sort = T)
one.bi = bigramming(one.star) %>% count(bigram, sort = T)
five.bi %>% head(5)## # A tibble: 5 x 2
## bigram n
## <chr> <int>
## 1 love love 495
## 2 fit perfectly 386
## 3 fits perfectly 379
## 4 highly recommend 337
## 5 super cute 337
## # A tibble: 5 x 2
## bigram n
## <chr> <int>
## 1 poor quality 43
## 2 cold water 14
## 3 cute top 10
## 4 beautiful dress 9
## 5 feels cheap 9
We visualize the 5 stars reviews and 1 stars reviews in a network. The network highlights the shard words within the most common bigrams.
five.graph = five.bi %>% separate(bigram, c("first", "second"), sep = " ") %>% filter(n > 75) %>% graph_from_data_frame()
one.graph = one.bi %>% separate(bigram, c("first", "second"), sep = " ") %>% filter(n > 5) %>% graph_from_data_frame()
summary(five.bi$n)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.111 1.000 495.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.209 1.000 43.000
Here shows the network of popular bigrams of 5 stars reviews. The gorgeous, love, beautiful, perfect, comfy, and comfortable clothing items are what is focused on in these 5 stars reviews.
set.seed(4444)
ggraph(five.graph, layout = "fr") +
geom_edge_link() +
geom_node_point(color = "orangered1", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
labs(title = "Network of Popular Bigrams of 5-star Reviews") +
theme_void()Here shows the network of popular bigrams of 1 stars reviews. The odd, weird, horrible, poor, and bad of the clothing items are what is focused on in these 5 stars reviews.
set.seed(4444)
ggraph(one.graph, layout = "fr") +
geom_edge_link() +
geom_node_point(color = "orangered1", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 0) +
labs(title = "Network of Popular Bigrams of 1-star Reviews") +
theme_void()In 5 stars ratings, the most common bigrams are ‘love’, ‘fit(s) perfectly’, and ‘fit(s) true’. And, it shows direct satisfaction.
w.1 = five.bi %>% filter(n > 75) %>% mutate(n = sqrt(n)) %>% wordcloud2(size = 0.5)
saveWidget(w.1, "1.html", selfcontained = F)
webshot("1.html", "1.png", vwidth = 700, vheight = 500, delay = 5)In 1 stars ratings, ‘poor quality’ and ‘cold water’ are common bigrams, which could refer to the way that clothing was washed and change. The lack of durability may lead to 1 stars rating. Also, some other bigrams are ‘weird’, ‘odd’, ‘horrible’, ‘feels cheap’, ‘bad quality’, and ‘potato sack’, which sums up why the clothes purchased are not satisfied with customers.
w.2 = one.bi %>% filter(n > 5) %>% mutate(n = sqrt(n)) %>% wordcloud2(size = 0.5)
saveWidget(w.2, "2.html", selfcontained = F)
webshot("2.html", "2.png", vwidth = 700, vheight = 500, delay = 5)let us go see the 118 reviews in the trend department. We use the topic modeling approach of latent dirchlet allocation to get a sense of the key characteristics of these reviews. We fit an LDA model using Gibbs sampling. We pick k = 5 for the 5 departments of bottoms, dresses, intimates, jackets, and tops. LDA is an unsupervised clustering machine learning algorithm.
trend.count = df.3 %>% filter(Dept == "Trend") %>% unnest_tokens(word, Review) %>% anti_join(stop_words, by = "word") %>% filter(!str_detect(word, "\\d")) %>% count(ID, word, sort = T)
trend.dtm = trend.count %>% cast_dtm(ID, word, n)
trendy = tidy(LDA(trend.dtm, k = 5, method = "GIBBS",
control = list(seed = 4444, alpha = 1)),
matrix = "beta")
top.trendy = trendy %>% group_by(topic) %>% top_n(5, beta) %>% arrange(topic, desc(beta))
temp = top.trendy %>% mutate(term = reorder(term, beta))
dim(temp)## [1] 28 3
ggplot(data = temp,
aes(x = term, y = beta, fill = factor(topic))) +
geom_bar(stat = "identity") +
facet_wrap(~topic, ncol = 3, scales = "free") +
labs(title = "LDA Analysis (k = 5)",
x = "Beta",
y = "Term") +
coord_flip() +
themeBy performing exploratory data analysis and content analysis, companies can use review analysis to focus on what works and what does not work. Knowing the reviewers can inform marketing decisions. Selecting items that have flexible for washing and comfortable fabric for wearing can lead to higher customer satisfaction. So as we know, a higher number of positive reviews become a form of advertisement that can eventually lead to higher sales. The key takeaways from the above analysis are the following.