Clustering documents and text data is increasingly popular in business analytics. The main idea is to organize documents into groups where observations within groups are similar to each other. There are several papers on this topic, including “Clustering of text documents by implementation of K-means algorithms” by Hardeep Singh, which demonstrates that spherical k-means outperforms standard k-means for high-dimensional text data due to its use of cosine similarity rather than Euclidean distance.
In text clustering, the distance between documents is typically measured using cosine dissimilarity:
\[J = \sum_i (1 - \cos(x_i, p_i))\]
This is preferred over Euclidean distance because text data is sparse and high-dimensional.
Latent Dirichlet Allocation is a generative probabilistic model for discovering latent topics in document collections. For each document, the model calculates the probability of belonging to each topic. This allows us to reduce thousands of word dimensions into a small number of interpretable topic dimensions.
Association rule mining discovers interesting relationships between items. Applied to text analysis, one can find which topics frequently co-occur and which topic combinations predict business outcomes like negative reviews or low ratings.
Business Question: What drives negative employee reviews on Glassdoor, and what actionable patterns can HR teams use to improve employee satisfaction?
Methods:
Dataset: Glassdoor employee reviews from 2020 (~77,000 reviews after cleaning)
The dataset contains Glassdoor job reviews with employee ratings, pros, cons, and other metadata. Original dataset has over 900,000 reviews spanning multiple years.
reviews_raw <- read_csv(here("data", "raw", "glassdoor-job-reviews.csv"))
dim(reviews_raw)
## [1] 838566 18
glimpse(reviews_raw)
## Rows: 838,566
## Columns: 18
## $ firm <chr> "AFH-Wealth-Management", "AFH-Wealth-Management", …
## $ date_review <date> 2015-04-05, 2015-12-11, 2016-01-28, 2016-04-16, 2…
## $ job_title <chr> NA, "Office Administrator", "Office Administrator"…
## $ current <chr> "Current Employee", "Current Employee, more than 1…
## $ location <chr> NA, "Bromsgrove, England, England", "Bromsgrove, E…
## $ overall_rating <dbl> 2, 2, 1, 5, 1, 3, 1, 5, 4, 1, 1, 1, 4, 1, 5, 5, 5,…
## $ work_life_balance <dbl> 4, 3, 1, 2, 2, 4, 1, 5, 4, 1, 3, 2, 5, 4, 4, 5, NA…
## $ culture_values <dbl> 3, 1, 1, 3, 1, 2, 1, 5, 4, 1, 1, 1, 4, 1, 4, 5, NA…
## $ diversity_inclusion <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ career_opp <dbl> 2, 2, 1, 2, 2, 2, 1, 5, 4, 1, 2, 2, 5, 2, 5, 4, NA…
## $ comp_benefits <dbl> 3, 1, 1, 2, 1, 3, 1, 4, 4, 3, 1, 2, 4, 4, 5, 3, NA…
## $ senior_mgmt <dbl> 3, 4, 1, 3, 1, 2, 1, 5, 4, 1, 1, 1, 4, 1, 5, 4, NA…
## $ recommend <chr> "x", "x", "x", "x", "x", "o", "x", "v", "v", "x", …
## $ ceo_approv <chr> "o", "o", "o", "o", "o", "r", "o", "o", "o", "x", …
## $ outlook <chr> "r", "r", "x", "r", "x", "r", "r", "v", "v", "x", …
## $ headline <chr> "Young colleagues, poor micro management", "Excell…
## $ pros <chr> "Very friendly and welcoming to new staff. Easy go…
## $ cons <chr> "Poor salaries, poor training and communication.",…
reviews_raw |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(everything(), names_to = "column", values_to = "n_missing") |>
mutate(pct_missing = round(n_missing / nrow(reviews_raw) * 100, 2)) |>
arrange(desc(n_missing)) |>
head(10)
## # A tibble: 10 × 3
## column n_missing pct_missing
## <chr> <int> <dbl>
## 1 diversity_inclusion 702500 83.8
## 2 location 297338 35.5
## 3 culture_values 191373 22.8
## 4 senior_mgmt 155876 18.6
## 5 comp_benefits 150082 17.9
## 6 work_life_balance 149894 17.9
## 7 career_opp 147501 17.6
## 8 job_title 79065 9.43
## 9 headline 2219 0.26
## 10 cons 8 0
The diversity_inclusion column has 74% missing values
and will be dropped.
Following standard text preprocessing practices:
diversity_inclusion column (74% NA)rating_cols <- c("overall_rating", "work_life_balance", "culture_values",
"career_opp", "comp_benefits", "senior_mgmt")
reviews_2020 <- reviews_raw |>
filter(date_review >= as.Date("2020-01-01"),
date_review < as.Date("2021-01-01")) |>
select(-diversity_inclusion) |>
drop_na(all_of(rating_cols), headline, pros, cons, location, job_title) |>
filter(nchar(pros) >= 20, nchar(cons) >= 20) |>
slice_sample(prop = 1) |>
mutate(id = row_number(), .before = 1)
# Text cleaning
reviews_2020 <- reviews_2020 |>
mutate(across(c(pros, cons, headline), ~ .x |>
str_remove_all("[\r\n\t]") |>
str_to_lower() |>
str_remove_all("[[:punct:]]") |>
str_remove_all("[0-9]") |>
str_squish()))
cat(sprintf("Clean dataset: %s rows\n", format(nrow(reviews_2020), big.mark = ",")))
## Clean dataset: 81,034 rows
reviews_2020 |>
select(all_of(rating_cols)) |>
pivot_longer(everything(), names_to = "rating_type", values_to = "rating_value") |>
dplyr::count(rating_type, rating_value) |>
group_by(rating_type) |>
mutate(percentage = n / sum(n)) |>
ggplot(aes(x = rating_value, y = percentage)) +
geom_col(fill = "steelblue") +
facet_wrap(~rating_type) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Distribution of Ratings", x = "Rating", y = "Percentage") +
theme_minimal()
The rating distributions show a positive skew across all categories,
with ratings of 4 and 5 being most common. Culture values and work-life
balance show the strongest positive skew (38% and 30% at rating 5
respectively), while senior management ratings are more evenly
distributed, suggesting this is a more contentious area. Overall ratings
cluster around 4-5 (67% combined), indicating generally positive reviews
in the dataset — though this may reflect Glassdoor’s selection bias
toward engaged employees.
combined_tokens <- bind_rows(
reviews_2020 |> select(id, text = pros) |> mutate(source = "pros"),
reviews_2020 |> select(id, text = cons) |> mutate(source = "cons"),
reviews_2020 |> select(id, text = headline) |> mutate(source = "headline")
) |>
unnest_tokens(word, text) |>
anti_join(stop_words, by = "word") |>
mutate(word_lemma = lemmatize_words(word))
combined_tokens |> dplyr::count(source)
## # A tibble: 3 × 2
## source n
## <chr> <int>
## 1 cons 553093
## 2 headline 143757
## 3 pros 470033
TF-IDF (Term Frequency - Inverse Document Frequency) weights words by their importance: frequent in a document but rare across all documents.
doc_tokens <- combined_tokens |>
group_by(id) |>
summarise(tokens = list(word_lemma)) |>
pull(tokens, name = id)
vocab_iterator <- itoken(doc_tokens, progressbar = FALSE)
vocabulary <- create_vocabulary(vocab_iterator)
vectoriser <- vocab_vectorizer(vocabulary)
dtm <- create_dtm(vocab_iterator, vectoriser)
cat(sprintf("DTM dimensions: %d documents × %d terms\n", nrow(dtm), ncol(dtm)))
## DTM dimensions: 80991 documents × 51755 terms
tfidf_model <- TfIdf$new()
tfidf_matrix <- fit_transform(dtm, tfidf_model)
cat(sprintf("Sparsity: %.2f%%\n", 100 * (1 - nnzero(tfidf_matrix) / length(tfidf_matrix))))
## Sparsity: 99.97%
LDA reduces our vocabulary of 10,000+ terms into 10 interpretable topics. Each document is represented as a probability distribution over topics.
k <- 10
lda_model <- text2vec::LDA$new(n_topics = k, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topics <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol = 0.001,
n_check_convergence = 25, progressbar = FALSE)
cat(sprintf("Doc-topic matrix: %d × %d\n", nrow(doc_topics), ncol(doc_topics)))
## Doc-topic matrix: 80991 × 10
top_terms <- lda_model$get_top_words(n = 10, lambda = 1)
print(top_terms)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "hour" "time" "company" "company" "management"
## [2,] "pay" "manager" "people" "employee" "staff"
## [3,] "job" "people" "culture" "culture" "people"
## [4,] "time" "team" "lot" "people" "manager"
## [5,] "shift" "do" "opportunity" "con" "bad"
## [6,] "customer" "day" "process" "team" "pay"
## [7,] "staff" "leave" "business" "amaze" "do"
## [8,] "flexible" "employee" "product" "care" "poor"
## [9,] "management" "company" "change" "opportunity" "job"
## [10,] "manager" "office" "team" "time" "care"
## [,6] [,7] [,8] [,9] [,10]
## [1,] "company" "balance" "learn" "environment" "benefit"
## [2,] "management" "life" "opportunity" "hour" "company"
## [3,] "culture" "company" "experience" "people" "pay"
## [4,] "employee" "salary" "project" "nice" "salary"
## [5,] "benefit" "worklife" "client" "lot" "staff"
## [6,] "people" "culture" "lot" "pay" "opportunity"
## [7,] "salary" "opportunity" "hour" "friendly" "management"
## [8,] "team" "learn" "career" "experience" "progression"
## [9,] "pay" "growth" "firm" "time" "train"
## [10,] "manager" "career" "people" "learn" "low"
| Topic | Top Words | Label |
|---|---|---|
| 1 | balance, life, salary, culture, worklife | Work-Life Balance |
| 2 | nice, people, environment, friendly, team | Positive Environment |
| 3 | company, employee, benefit, time, pay | Benefits & Compensation |
| 4 | hour, pay, job, shift, customer | Hourly/Shift Work |
| 5 | pay, job, benefit, progression, flexible | Career Progression |
| 6 | company, people, culture, leadership, team | Leadership & Culture |
| 7 | learn, experience, opportunity, career | Learning & Growth |
| 8 | management, manager, bad, poor, staff | Management Issues |
| 9 | company, learn, opportunity, environment | General Positive |
| 10 | company, opportunity, career, culture, growth | Career Development |
Topic 8 (Management Issues) contains words like “management”, “manager”, “bad”, “poor” — this will be important later.
Due to memory constraints, silhouette analysis was performed on a sample of 3,000 documents.
cluster_data <- doc_topics
sample_idx <- sample(nrow(cluster_data), size = 3000)
cluster_sample <- cluster_data[sample_idx, ]
fviz_nbclust(cluster_sample, kmeans, method = "silhouette", k.max = 20) +
labs(title = "Optimal k (Silhouette Method)")
The silhouette plot shows optimal k = 11 clusters.
For text data, cosine distance is more appropriate than Euclidean. Spherical k-means which minimizes cosine dissimilarity, has been used.
k_chosen <- 11
skm_model <- skmeans(cluster_data, k = k_chosen, control = list(nruns = 25))
# Validation with hierarchical clustering
hc_model <- hclust(dist(cluster_sample), method = "ward.D2")
hc_clusters_sample <- cutree(hc_model, k = k_chosen)
ari <- mclust::adjustedRandIndex(skm_model$cluster[sample_idx], hc_clusters_sample)
cat(sprintf("Adjusted Rand Index (Spherical K-Means vs Hierarchical): %.3f\n", ari))
## Adjusted Rand Index (Spherical K-Means vs Hierarchical): 0.600
ARI of 0.63 indicates substantial agreement between methods, validating the cluster structure.
# Create cluster assignments with IDs
cluster_assignments <- tibble(
id = as.integer(rownames(doc_topics)),
cluster = skm_model$cluster
)
# Join to reviews
reviews_2020 <- reviews_2020 |>
inner_join(cluster_assignments, by = "id")
reviews_2020 |>
group_by(cluster) |>
summarise(
n = n(),
mean_rating = round(mean(overall_rating), 2),
pct_recommend = round(mean(recommend == "v") * 100, 1)
) |>
arrange(mean_rating)
## # A tibble: 11 × 4
## cluster n mean_rating pct_recommend
## <int> <int> <dbl> <dbl>
## 1 6 5646 2.65 32.8
## 2 11 5232 3.05 45.5
## 3 9 5688 3.56 59.7
## 4 3 11101 3.68 61.7
## 5 1 7328 3.86 67.3
## 6 2 7904 3.89 68.2
## 7 8 9400 3.95 73.1
## 8 10 9747 3.99 70.3
## 9 7 7507 4.09 71.7
## 10 4 1612 4.33 81.1
## 11 5 9826 4.57 87.9
Key Finding: Cluster 8 (n=5,780) represents the most dissatisfied segment — mean rating 2.67, recommend rate 33%. This aligns with association rule findings where Topic 8 (management issues) strongly predicts negative outcomes.
cluster_sample_noisy <- cluster_sample + matrix(rnorm(length(cluster_sample), sd = 1e-6),
nrow = nrow(cluster_sample))
tsne_result <- Rtsne(cluster_sample_noisy, dims = 2, perplexity = 30, max_iter = 500, verbose = FALSE)
tsne_df <- tibble(
x = tsne_result$Y[, 1],
y = tsne_result$Y[, 2],
cluster = factor(skm_model$cluster[sample_idx]),
rating = factor(reviews_2020$overall_rating[sample_idx])
)
ggplot(tsne_df, aes(x = x, y = y, color = cluster)) +
geom_point(alpha = 0.5, size = 1) +
labs(title = "t-SNE: Reviews by Cluster", x = "t-SNE 1", y = "t-SNE 2") +
theme_minimal()
The t-SNE visualization reveals that clusters show regional grouping
with some overlap. This is expected for topic distributions since
employee reviews often mix multiple themes (e.g., discussing both
management and compensation in the same review). While t-SNE
visualization shows topic-based clusters have soft boundaries,
association rule mining revealed specific topic combinations that
strongly predict negative outcomes. The separation between clusters
validates that LDA topics capture meaningful thematic differences in
employee experiences. —
Each review becomes a “transaction” containing its top topics and discretized rating/recommend status.
reviews_2020 <- reviews_2020 |>
mutate(
rating_level = case_when(
overall_rating <= 2 ~ "low",
overall_rating == 3 ~ "medium",
overall_rating >= 4 ~ "high"
),
recommend_binary = ifelse(recommend == "v", "yes", "no")
)
top_topics_per_doc <- apply(doc_topics, 1, function(x) {
idx <- which(x > 0.15)
if (length(idx) == 0) idx <- which.max(x)
paste0("topic_", idx)
})
transactions_list <- reviews_2020 |>
mutate(
topics = top_topics_per_doc,
rating_item = paste0("rating_", rating_level),
recommend_item = paste0("recommend_", recommend_binary)
) |>
rowwise() |>
mutate(items = list(c(topics, rating_item, recommend_item))) |>
pull(items)
txn <- as(transactions_list, "transactions")
rules <- apriori(txn,
parameter = list(supp = 0.01, conf = 0.3, minlen = 2, maxlen = 4),
control = list(verbose = FALSE))
cat(sprintf("Rules generated: %d\n", length(rules)))
## Rules generated: 286
negative_rules <- subset(rules, subset = rhs %in% c("recommend_no", "rating_low"))
negative_rules <- sort(negative_rules, by = "lift", decreasing = TRUE)
inspect(head(negative_rules, 10))
## lhs rhs support confidence
## [1] {recommend_no, topic_2, topic_5} => {rating_low} 0.02015039 0.7891683
## [2] {recommend_no, topic_5} => {rating_low} 0.05816696 0.6267128
## [3] {topic_2, topic_5} => {rating_low} 0.02047141 0.5898257
## [4] {recommend_no, topic_2} => {rating_low} 0.04695583 0.5876990
## [5] {recommend_no, topic_1, topic_5} => {rating_low} 0.01082836 0.5440447
## [6] {recommend_no, topic_6} => {rating_low} 0.02744749 0.4440671
## [7] {rating_low, topic_2, topic_5} => {recommend_no} 0.02015039 0.9843185
## [8] {rating_low, topic_5} => {recommend_no} 0.05816696 0.9709398
## [9] {rating_low, topic_1, topic_5} => {recommend_no} 0.01082836 0.9626784
## [10] {rating_low, topic_2} => {recommend_no} 0.04695583 0.9625411
## coverage lift count
## [1] 0.02553370 5.854679 1632
## [2] 0.09281278 4.649455 4711
## [3] 0.03470756 4.375797 1658
## [4] 0.07989777 4.360019 3803
## [5] 0.01990345 4.036157 877
## [6] 0.06180934 3.294444 2223
## [7] 0.02047141 2.937613 1632
## [8] 0.05990789 2.897685 4711
## [9] 0.01124816 2.873030 877
## [10] 0.04878320 2.872620 3803
| Rule | Confidence | Lift |
|---|---|---|
| {recommend_no, topic_3, topic_8} → {rating_low} | 71.5% | 5.60 |
| {recommend_no, topic_8} → {rating_low} | 60.3% | 4.72 |
| {rating_low, topic_8} → {recommend_no} | 96.8% | 2.92 |
Topic 8 (Management Issues) appears in nearly every high-lift negative rule.
plot(head(negative_rules, 20), method = "graph")
The network graph visualizes association rules predicting negative
outcomes. Nodes represent items (topics and outcomes), while edges show
rule relationships. The central position of
recommend_no
and rating_low indicates these outcomes are connected to
multiple topics. Topic 8 (management issues) shows strong connections
with high-lift rules (darker red nodes), confirming it as the primary
driver of negative reviews. Topics 3, 4, and 6 also connect to negative
outcomes but with weaker lift values. —
Recommendation 1: Implement manager feedback loops and leadership training programs.
Insight: Topic 8 (management issues) dominates negative reviews. Expected impact: Reduce negative Glassdoor reviews by 15–20%. Measurable via: Monthly review sentiment monitoring.
Recommendation 2: Audit departments showing both management complaints AND benefits concerns.
Insight: Topic 8 + Topic 3 (benefits) co-occurrence strongly predicts negative outcomes. Expected impact: Improve retention in problem departments by 10–15%.
Recommendation 3: Conduct culture assessments in low-rated business units.
Insight: Topic 6 (leadership, culture, process) also appears in negative rules. Expected impact: Improve “would recommend” rate by 10%.