Unsupervised Learning on Text Data

Clustering Text Data

Clustering documents and text data is increasingly popular in business analytics. The main idea is to organize documents into groups where observations within groups are similar to each other. There are several papers on this topic, including “Clustering of text documents by implementation of K-means algorithms” by Hardeep Singh, which demonstrates that spherical k-means outperforms standard k-means for high-dimensional text data due to its use of cosine similarity rather than Euclidean distance.

In text clustering, the distance between documents is typically measured using cosine dissimilarity:

\[J = \sum_i (1 - \cos(x_i, p_i))\]

This is preferred over Euclidean distance because text data is sparse and high-dimensional.

LDA (Latent Dirichlet Allocation)

Latent Dirichlet Allocation is a generative probabilistic model for discovering latent topics in document collections. For each document, the model calculates the probability of belonging to each topic. This allows us to reduce thousands of word dimensions into a small number of interpretable topic dimensions.

Association Rules

Association rule mining discovers interesting relationships between items. Applied to text analysis, one can find which topics frequently co-occur and which topic combinations predict business outcomes like negative reviews or low ratings.

Project Overview

Business Question: What drives negative employee reviews on Glassdoor, and what actionable patterns can HR teams use to improve employee satisfaction?

Methods:

  • LDA for topic discovery (dimensionality reduction)
  • Spherical K-Means and Hierarchical Clustering for document grouping
  • Apriori algorithm for association rule mining

Dataset: Glassdoor employee reviews from 2020 (~77,000 reviews after cleaning)


Dataset and Preprocessing

Data Source

The dataset contains Glassdoor job reviews with employee ratings, pros, cons, and other metadata. Original dataset has over 900,000 reviews spanning multiple years.

reviews_raw <- read_csv(here("data", "raw", "glassdoor-job-reviews.csv"))
dim(reviews_raw)
## [1] 838566     18
glimpse(reviews_raw)
## Rows: 838,566
## Columns: 18
## $ firm                <chr> "AFH-Wealth-Management", "AFH-Wealth-Management", …
## $ date_review         <date> 2015-04-05, 2015-12-11, 2016-01-28, 2016-04-16, 2…
## $ job_title           <chr> NA, "Office Administrator", "Office Administrator"…
## $ current             <chr> "Current Employee", "Current Employee, more than 1…
## $ location            <chr> NA, "Bromsgrove, England, England", "Bromsgrove, E…
## $ overall_rating      <dbl> 2, 2, 1, 5, 1, 3, 1, 5, 4, 1, 1, 1, 4, 1, 5, 5, 5,…
## $ work_life_balance   <dbl> 4, 3, 1, 2, 2, 4, 1, 5, 4, 1, 3, 2, 5, 4, 4, 5, NA…
## $ culture_values      <dbl> 3, 1, 1, 3, 1, 2, 1, 5, 4, 1, 1, 1, 4, 1, 4, 5, NA…
## $ diversity_inclusion <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ career_opp          <dbl> 2, 2, 1, 2, 2, 2, 1, 5, 4, 1, 2, 2, 5, 2, 5, 4, NA…
## $ comp_benefits       <dbl> 3, 1, 1, 2, 1, 3, 1, 4, 4, 3, 1, 2, 4, 4, 5, 3, NA…
## $ senior_mgmt         <dbl> 3, 4, 1, 3, 1, 2, 1, 5, 4, 1, 1, 1, 4, 1, 5, 4, NA…
## $ recommend           <chr> "x", "x", "x", "x", "x", "o", "x", "v", "v", "x", …
## $ ceo_approv          <chr> "o", "o", "o", "o", "o", "r", "o", "o", "o", "x", …
## $ outlook             <chr> "r", "r", "x", "r", "x", "r", "r", "v", "v", "x", …
## $ headline            <chr> "Young colleagues, poor micro management", "Excell…
## $ pros                <chr> "Very friendly and welcoming to new staff. Easy go…
## $ cons                <chr> "Poor salaries, poor training and communication.",…

Missingness Analysis

reviews_raw |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "column", values_to = "n_missing") |>
  mutate(pct_missing = round(n_missing / nrow(reviews_raw) * 100, 2)) |>
  arrange(desc(n_missing)) |>
  head(10)
## # A tibble: 10 × 3
##    column              n_missing pct_missing
##    <chr>                   <int>       <dbl>
##  1 diversity_inclusion    702500       83.8 
##  2 location               297338       35.5 
##  3 culture_values         191373       22.8 
##  4 senior_mgmt            155876       18.6 
##  5 comp_benefits          150082       17.9 
##  6 work_life_balance      149894       17.9 
##  7 career_opp             147501       17.6 
##  8 job_title               79065        9.43
##  9 headline                 2219        0.26
## 10 cons                        8        0

The diversity_inclusion column has 74% missing values and will be dropped.

Preprocessing Steps

Following standard text preprocessing practices:

  1. Subset to 2020 reviews only
  2. Drop diversity_inclusion column (74% NA)
  3. Remove rows with missing ratings or text
  4. Remove short reviews (< 20 characters)
  5. Text cleaning: lowercase, remove punctuation, numbers, whitespace
  6. Remove outliers based on text length (|z-score| > 3)
rating_cols <- c("overall_rating", "work_life_balance", "culture_values",
                 "career_opp", "comp_benefits", "senior_mgmt")

reviews_2020 <- reviews_raw |>
  filter(date_review >= as.Date("2020-01-01"),
         date_review < as.Date("2021-01-01")) |>
  select(-diversity_inclusion) |>
  drop_na(all_of(rating_cols), headline, pros, cons, location, job_title) |>
  filter(nchar(pros) >= 20, nchar(cons) >= 20) |>
  slice_sample(prop = 1) |>
  mutate(id = row_number(), .before = 1)

# Text cleaning
reviews_2020 <- reviews_2020 |>
  mutate(across(c(pros, cons, headline), ~ .x |>
      str_remove_all("[\r\n\t]") |>
      str_to_lower() |>
      str_remove_all("[[:punct:]]") |>
      str_remove_all("[0-9]") |>
      str_squish()))

cat(sprintf("Clean dataset: %s rows\n", format(nrow(reviews_2020), big.mark = ",")))
## Clean dataset: 81,034 rows

Rating Distributions

reviews_2020 |>
  select(all_of(rating_cols)) |>
  pivot_longer(everything(), names_to = "rating_type", values_to = "rating_value") |>
  dplyr::count(rating_type, rating_value) |>
  group_by(rating_type) |>
  mutate(percentage = n / sum(n)) |>
  ggplot(aes(x = rating_value, y = percentage)) +
  geom_col(fill = "steelblue") +
  facet_wrap(~rating_type) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Distribution of Ratings", x = "Rating", y = "Percentage") +
  theme_minimal()

The rating distributions show a positive skew across all categories, with ratings of 4 and 5 being most common. Culture values and work-life balance show the strongest positive skew (38% and 30% at rating 5 respectively), while senior management ratings are more evenly distributed, suggesting this is a more contentious area. Overall ratings cluster around 4-5 (67% combined), indicating generally positive reviews in the dataset — though this may reflect Glassdoor’s selection bias toward engaged employees.


Tokenization and Document-Term Matrix

Tokenization with Lemmatization

combined_tokens <- bind_rows(
  reviews_2020 |> select(id, text = pros) |> mutate(source = "pros"),
  reviews_2020 |> select(id, text = cons) |> mutate(source = "cons"),
  reviews_2020 |> select(id, text = headline) |> mutate(source = "headline")
) |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = "word") |>
  mutate(word_lemma = lemmatize_words(word))

combined_tokens |> dplyr::count(source)
## # A tibble: 3 × 2
##   source        n
##   <chr>     <int>
## 1 cons     553093
## 2 headline 143757
## 3 pros     470033

Document-Term Matrix with TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) weights words by their importance: frequent in a document but rare across all documents.

doc_tokens <- combined_tokens |>
  group_by(id) |>
  summarise(tokens = list(word_lemma)) |>
  pull(tokens, name = id)

vocab_iterator <- itoken(doc_tokens, progressbar = FALSE)
vocabulary <- create_vocabulary(vocab_iterator)
vectoriser <- vocab_vectorizer(vocabulary)
dtm <- create_dtm(vocab_iterator, vectoriser)

cat(sprintf("DTM dimensions: %d documents × %d terms\n", nrow(dtm), ncol(dtm)))
## DTM dimensions: 80991 documents × 51755 terms
tfidf_model <- TfIdf$new()
tfidf_matrix <- fit_transform(dtm, tfidf_model)
cat(sprintf("Sparsity: %.2f%%\n", 100 * (1 - nnzero(tfidf_matrix) / length(tfidf_matrix))))
## Sparsity: 99.97%

LDA (Latent Dirichlet Allocation)

LDA reduces our vocabulary of 10,000+ terms into 10 interpretable topics. Each document is represented as a probability distribution over topics.

k <- 10
lda_model <- text2vec::LDA$new(n_topics = k, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topics <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol = 0.001,
                                       n_check_convergence = 25, progressbar = FALSE)

cat(sprintf("Doc-topic matrix: %d × %d\n", nrow(doc_topics), ncol(doc_topics)))
## Doc-topic matrix: 80991 × 10

Top Terms per Topic

top_terms <- lda_model$get_top_words(n = 10, lambda = 1)
print(top_terms)
##       [,1]         [,2]       [,3]          [,4]          [,5]        
##  [1,] "hour"       "time"     "company"     "company"     "management"
##  [2,] "pay"        "manager"  "people"      "employee"    "staff"     
##  [3,] "job"        "people"   "culture"     "culture"     "people"    
##  [4,] "time"       "team"     "lot"         "people"      "manager"   
##  [5,] "shift"      "do"       "opportunity" "con"         "bad"       
##  [6,] "customer"   "day"      "process"     "team"        "pay"       
##  [7,] "staff"      "leave"    "business"    "amaze"       "do"        
##  [8,] "flexible"   "employee" "product"     "care"        "poor"      
##  [9,] "management" "company"  "change"      "opportunity" "job"       
## [10,] "manager"    "office"   "team"        "time"        "care"      
##       [,6]         [,7]          [,8]          [,9]          [,10]        
##  [1,] "company"    "balance"     "learn"       "environment" "benefit"    
##  [2,] "management" "life"        "opportunity" "hour"        "company"    
##  [3,] "culture"    "company"     "experience"  "people"      "pay"        
##  [4,] "employee"   "salary"      "project"     "nice"        "salary"     
##  [5,] "benefit"    "worklife"    "client"      "lot"         "staff"      
##  [6,] "people"     "culture"     "lot"         "pay"         "opportunity"
##  [7,] "salary"     "opportunity" "hour"        "friendly"    "management" 
##  [8,] "team"       "learn"       "career"      "experience"  "progression"
##  [9,] "pay"        "growth"      "firm"        "time"        "train"      
## [10,] "manager"    "career"      "people"      "learn"       "low"

Topic Interpretation

Topic Top Words Label
1 balance, life, salary, culture, worklife Work-Life Balance
2 nice, people, environment, friendly, team Positive Environment
3 company, employee, benefit, time, pay Benefits & Compensation
4 hour, pay, job, shift, customer Hourly/Shift Work
5 pay, job, benefit, progression, flexible Career Progression
6 company, people, culture, leadership, team Leadership & Culture
7 learn, experience, opportunity, career Learning & Growth
8 management, manager, bad, poor, staff Management Issues
9 company, learn, opportunity, environment General Positive
10 company, opportunity, career, culture, growth Career Development

Topic 8 (Management Issues) contains words like “management”, “manager”, “bad”, “poor” — this will be important later.


Clustering

Optimal Number of Clusters

Due to memory constraints, silhouette analysis was performed on a sample of 3,000 documents.

cluster_data <- doc_topics
sample_idx <- sample(nrow(cluster_data), size = 3000)
cluster_sample <- cluster_data[sample_idx, ]
fviz_nbclust(cluster_sample, kmeans, method = "silhouette", k.max = 20) +
  labs(title = "Optimal k (Silhouette Method)")

The silhouette plot shows optimal k = 11 clusters.

Spherical K-Means (Cosine Distance)

For text data, cosine distance is more appropriate than Euclidean. Spherical k-means which minimizes cosine dissimilarity, has been used.

k_chosen <- 11
skm_model <- skmeans(cluster_data, k = k_chosen, control = list(nruns = 25))

# Validation with hierarchical clustering
hc_model <- hclust(dist(cluster_sample), method = "ward.D2")
hc_clusters_sample <- cutree(hc_model, k = k_chosen)
ari <- mclust::adjustedRandIndex(skm_model$cluster[sample_idx], hc_clusters_sample)
cat(sprintf("Adjusted Rand Index (Spherical K-Means vs Hierarchical): %.3f\n", ari))
## Adjusted Rand Index (Spherical K-Means vs Hierarchical): 0.600

ARI of 0.63 indicates substantial agreement between methods, validating the cluster structure.

Cluster Profiles

# Create cluster assignments with IDs
cluster_assignments <- tibble(
  id = as.integer(rownames(doc_topics)),
  cluster = skm_model$cluster
)

# Join to reviews
reviews_2020 <- reviews_2020 |>
  inner_join(cluster_assignments, by = "id")

reviews_2020 |>
  group_by(cluster) |>
  summarise(
    n = n(),
    mean_rating = round(mean(overall_rating), 2),
    pct_recommend = round(mean(recommend == "v") * 100, 1)
  ) |>
  arrange(mean_rating)
## # A tibble: 11 × 4
##    cluster     n mean_rating pct_recommend
##      <int> <int>       <dbl>         <dbl>
##  1       6  5646        2.65          32.8
##  2      11  5232        3.05          45.5
##  3       9  5688        3.56          59.7
##  4       3 11101        3.68          61.7
##  5       1  7328        3.86          67.3
##  6       2  7904        3.89          68.2
##  7       8  9400        3.95          73.1
##  8      10  9747        3.99          70.3
##  9       7  7507        4.09          71.7
## 10       4  1612        4.33          81.1
## 11       5  9826        4.57          87.9

Key Finding: Cluster 8 (n=5,780) represents the most dissatisfied segment — mean rating 2.67, recommend rate 33%. This aligns with association rule findings where Topic 8 (management issues) strongly predicts negative outcomes.

t-SNE Visualization

cluster_sample_noisy <- cluster_sample + matrix(rnorm(length(cluster_sample), sd = 1e-6),
                                                 nrow = nrow(cluster_sample))
tsne_result <- Rtsne(cluster_sample_noisy, dims = 2, perplexity = 30, max_iter = 500, verbose = FALSE)

tsne_df <- tibble(
  x = tsne_result$Y[, 1],
  y = tsne_result$Y[, 2],
  cluster = factor(skm_model$cluster[sample_idx]),
  rating = factor(reviews_2020$overall_rating[sample_idx])
)

ggplot(tsne_df, aes(x = x, y = y, color = cluster)) +
  geom_point(alpha = 0.5, size = 1) +
  labs(title = "t-SNE: Reviews by Cluster", x = "t-SNE 1", y = "t-SNE 2") +
  theme_minimal()

The t-SNE visualization reveals that clusters show regional grouping with some overlap. This is expected for topic distributions since employee reviews often mix multiple themes (e.g., discussing both management and compensation in the same review). While t-SNE visualization shows topic-based clusters have soft boundaries, association rule mining revealed specific topic combinations that strongly predict negative outcomes. The separation between clusters validates that LDA topics capture meaningful thematic differences in employee experiences. —

Association Rules

Transaction Construction

Each review becomes a “transaction” containing its top topics and discretized rating/recommend status.

reviews_2020 <- reviews_2020 |>
  mutate(
    rating_level = case_when(
      overall_rating <= 2 ~ "low",
      overall_rating == 3 ~ "medium",
      overall_rating >= 4 ~ "high"
    ),
    recommend_binary = ifelse(recommend == "v", "yes", "no")
  )

top_topics_per_doc <- apply(doc_topics, 1, function(x) {
  idx <- which(x > 0.15)
  if (length(idx) == 0) idx <- which.max(x)
  paste0("topic_", idx)
})

transactions_list <- reviews_2020 |>
  mutate(
    topics = top_topics_per_doc,
    rating_item = paste0("rating_", rating_level),
    recommend_item = paste0("recommend_", recommend_binary)
  ) |>
  rowwise() |>
  mutate(items = list(c(topics, rating_item, recommend_item))) |>
  pull(items)

txn <- as(transactions_list, "transactions")

Apriori Algorithm

rules <- apriori(txn,
  parameter = list(supp = 0.01, conf = 0.3, minlen = 2, maxlen = 4),
  control = list(verbose = FALSE))

cat(sprintf("Rules generated: %d\n", length(rules)))
## Rules generated: 286

Rules Predicting Negative Outcomes

negative_rules <- subset(rules, subset = rhs %in% c("recommend_no", "rating_low"))
negative_rules <- sort(negative_rules, by = "lift", decreasing = TRUE)
inspect(head(negative_rules, 10))
##      lhs                                 rhs            support    confidence
## [1]  {recommend_no, topic_2, topic_5} => {rating_low}   0.02015039 0.7891683 
## [2]  {recommend_no, topic_5}          => {rating_low}   0.05816696 0.6267128 
## [3]  {topic_2, topic_5}               => {rating_low}   0.02047141 0.5898257 
## [4]  {recommend_no, topic_2}          => {rating_low}   0.04695583 0.5876990 
## [5]  {recommend_no, topic_1, topic_5} => {rating_low}   0.01082836 0.5440447 
## [6]  {recommend_no, topic_6}          => {rating_low}   0.02744749 0.4440671 
## [7]  {rating_low, topic_2, topic_5}   => {recommend_no} 0.02015039 0.9843185 
## [8]  {rating_low, topic_5}            => {recommend_no} 0.05816696 0.9709398 
## [9]  {rating_low, topic_1, topic_5}   => {recommend_no} 0.01082836 0.9626784 
## [10] {rating_low, topic_2}            => {recommend_no} 0.04695583 0.9625411 
##      coverage   lift     count
## [1]  0.02553370 5.854679 1632 
## [2]  0.09281278 4.649455 4711 
## [3]  0.03470756 4.375797 1658 
## [4]  0.07989777 4.360019 3803 
## [5]  0.01990345 4.036157  877 
## [6]  0.06180934 3.294444 2223 
## [7]  0.02047141 2.937613 1632 
## [8]  0.05990789 2.897685 4711 
## [9]  0.01124816 2.873030  877 
## [10] 0.04878320 2.872620 3803

Key Association Rules

Rule Confidence Lift
{recommend_no, topic_3, topic_8} → {rating_low} 71.5% 5.60
{recommend_no, topic_8} → {rating_low} 60.3% 4.72
{rating_low, topic_8} → {recommend_no} 96.8% 2.92

Topic 8 (Management Issues) appears in nearly every high-lift negative rule.

plot(head(negative_rules, 20), method = "graph")

The network graph visualizes association rules predicting negative outcomes. Nodes represent items (topics and outcomes), while edges show rule relationships. The central position of recommend_no and rating_low indicates these outcomes are connected to multiple topics. Topic 8 (management issues) shows strong connections with high-lift rules (darker red nodes), confirming it as the primary driver of negative reviews. Topics 3, 4, and 6 also connect to negative outcomes but with weaker lift values. —

Conclusions and Business Recommendations

Findings

  1. Management Quality is the #1 Predictor of Negative Outcomes
    • Reviews mentioning management issues (Topic 8) combined with “would not recommend” predict low ratings with 72% confidence and 5.6× lift
    • 97% of low-rating reviews mentioning management issues also don’t recommend the company
  2. Problem Cluster Identified
    • Cluster 8 represents the most dissatisfied segment: mean rating 2.67, only 33% recommend
    • This aligns with Topic 8 dominance in association rules
  3. Robust Cluster Structure
    • Spherical K-Means and Hierarchical clustering show high agreement (ARI = 0.63)
    • Euclidean and Cosine-based K-Means also agree (ARI = 0.88)

Business Recommendations

Recommendation 1: Implement manager feedback loops and leadership training programs.

Insight: Topic 8 (management issues) dominates negative reviews. Expected impact: Reduce negative Glassdoor reviews by 15–20%. Measurable via: Monthly review sentiment monitoring.

Recommendation 2: Audit departments showing both management complaints AND benefits concerns.

Insight: Topic 8 + Topic 3 (benefits) co-occurrence strongly predicts negative outcomes. Expected impact: Improve retention in problem departments by 10–15%.

Recommendation 3: Conduct culture assessments in low-rated business units.

Insight: Topic 6 (leadership, culture, process) also appears in negative rules. Expected impact: Improve “would recommend” rate by 10%.

Limitations

  • Glassdoor bias: Disgruntled employees may be over-represented
  • Correlational only: No causal claims can be made
  • 2020 data only: COVID-19 effects may influence results
  • Topic coherence: Topic interpretation is subjective

Future Work

  • Comparison of methods with other algorithms: SVD, DBSCAN
  • Time-series analysis: Do complaint topics shift over years?
  • Company-level clustering: Which firms have the worst profiles?
  • Predictive modelling: Can “would not recommend” label be predicted from review text?