1 Intro

You are part of the data initiative team from a reputable start up company. The company is being ambitious to implement data-driven decision making in all division, including the Human Resource (HR) Department. The department often collecting feedback and review via group discussion that is being held quarterly. However, some employee may don't want to share all information directly to HR department for several reason, such as being afraid of jduged by other or any personal reasons. To overcome this problem, the HR also want to collect feedback and review from external resource. The HR decided to collect and analyze review from an online job seeker website where former and current employee can give review and rating to the company. To transform the data into informative insight, the HR want our help to analyze is there any topic or context that is being written on the website.

2 Library

The following chunk is the library that you can use to do text analysis and topic modeling.

# Data Wrangling
library(dplyr) 
library(lubridate)
library(stringr)
library(tidyr)

# Text Analysis
library(tidytext)
library(textclean)
library(SnowballC)
library(hunspell)

# Topic Modeling
library(textmineR)

# Data Visualization
library(ggplot2)
library(ggwordcloud)
library(scales)

# Extra Function
source("extra_function.R")

options(scipen = 999)

3 Import Data

the data imported by reading the workplace_review.csv and save them as df_review.

# Import data
df_review <- read.csv("data/workplace_review2.csv") %>% 
glimpse()

## Rows: 3,468
## Columns: 7
## $ location  <chr> "", "Mountain View, CA", "New York, NY", "Mountain View, ...
## $ date      <chr> " Dec 11, 2018", " Jun 21, 2013", " May 10, 2014", " Feb ...
## $ status    <chr> "Current Employee ", "Former Employee ", "Current Employe...
## $ job_title <chr> " Anonymous Employee", " Program Manager", " Software Eng...
## $ summary   <chr> "Best Company to work for", "Moving at the speed of light...
## $ overall   <int> 5, 5, 5, 4, 4, 5, 3, 5, 5, 5, 4, 3, 3, 5, 4, 2, 3, 4, 4, ...
## $ review    <chr> "Bureaucracy is slowing things down", "1) Work/life balan...

Data Description:

location: residence of the user
date: date of the review
status: status of the user (current/former employee)
job_title: role/job title of the user as an employee
summary: summary of the review in short text
overall: overall rating given by user to the company
review: full review from the user

4 Data Wrangling

transform the data type of the date column. And create a new column named review_day, review_month, and review_year that respectively contain information about day of the week, month , and year of the review.

#Convert date character class into date type 
df_review <- df_review %>% 
  mutate(date = mdy(date)) %>% 
# Create a new column named "review_day", "review_month", "review_year
  mutate(review_day = day(date),
         review_month = month(date, label = TRUE),
         review_year = year(date)
         )

# Create a new column named "review_day", "review_month", "review_year
glimpse(df_review)

## Rows: 3,468
## Columns: 10
## $ location     <chr> "", "Mountain View, CA", "New York, NY", "Mountain Vie...
## $ date         <date> 2018-12-11, 2013-06-21, 2014-05-10, 2015-02-08, 2018-...
## $ status       <chr> "Current Employee ", "Former Employee ", "Current Empl...
## $ job_title    <chr> " Anonymous Employee", " Program Manager", " Software ...
## $ summary      <chr> "Best Company to work for", "Moving at the speed of li...
## $ overall      <int> 5, 5, 5, 4, 4, 5, 3, 5, 5, 5, 4, 3, 3, 5, 4, 2, 3, 4, ...
## $ review       <chr> "Bureaucracy is slowing things down", "1) Work/life ba...
## $ review_day   <int> 11, 21, 10, 8, 9, 6, 29, 2, 1, 30, 29, 28, 27, 9, 20, ...
## $ review_month <ord> Des, Jun, Mei, Feb, Des, Des, Nov, Des, Des, Nov, Nov,...
## $ review_year  <dbl> 2018, 2013, 2014, 2015, 2018, 2018, 2018, 2018, 2018, ...

Check the earliest and the latest date using the range() function to the date column.

range(df_review$date)

## [1] "2008-02-14" "2018-12-11"

Some review are too old and is not relevant anymore to the current condition. We will only collect review from 2015 and later.

df_review <- df_review %>% 
  filter(review_year >= 2015) %>% 
  glimpse()

## Rows: 2,488
## Columns: 10
## $ location     <chr> "", "Mountain View, CA", "", "", "", "", "", "", "Los ...
## $ date         <date> 2018-12-11, 2015-02-08, 2018-12-09, 2018-12-06, 2018-...
## $ status       <chr> "Current Employee ", "Current Employee ", "Current Emp...
## $ job_title    <chr> " Anonymous Employee", " Anonymous Employee", " Anonym...
## $ summary      <chr> "Best Company to work for", "The best place I've worke...
## $ overall      <int> 5, 4, 4, 5, 3, 5, 5, 5, 4, 3, 3, 5, 4, 2, 3, 4, 4, 4, ...
## $ review       <chr> "Bureaucracy is slowing things down", "I live in SF so...
## $ review_day   <int> 11, 8, 9, 6, 29, 2, 1, 30, 29, 28, 27, 9, 20, 6, 21, 4...
## $ review_month <ord> Des, Feb, Des, Des, Nov, Des, Des, Nov, Nov, Nov, Nov,...
## $ review_year  <dbl> 2018, 2015, 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...

4.1 Employment Status

You may want to check how many of the review come from current employee and how many come from former employee. Use count() to count the number of each employee status in the data. Describe your finding.

reviews_count <- df_review %>% 
  count(status, name = "Total Review") 

reviews_count

4.2 Top 10 Reviewer by Job Title

First you may want to check the top 10 job title that give the most review on the website. This may give us insight at which employee that complain the most. Use count() to count the number of each job title in the data and then sort the data by usign arrange() and desc() to get descending sort (from small to big number). Describe your finding.

reviews_by_jobtitle <- df_review %>% 
  count(job_title, name = "Jumlah_review") %>% 
  arrange(desc(Jumlah_review))

head(reviews_by_jobtitle,10)

5 Text Cleansing

In this part you can start doing text cleansing from the review column. The method and order of the cleansing process is up to you. General text cleansing process include:

Make all character into lowercase
Remove mention name
Remove certain characters
Removing hashtag (#rstats)
Removing mention (@algoritma)
Removing URL
Replace contracted word (I'm, Don't, Doesn't) into the proper format (I am, Do not, Does not)
Remove all punctuation
Remove all numbers
Remove double space
Remove space at start and end of string

df_review <- df_review %>% 
  mutate(
    review_clean = review %>%
      tolower() %>% #Make all character into lowercase
    str_replace_all(pattern = "@.*? |@.*?[:punct:]", replacement = " ") %>%  
    str_remove_all(pattern = "\\*") %>%  
    str_remove_all(pattern = "-") %>% 
    replace_hash() %>% # remove hashtag
    replace_url() %>% 
    replace_contraction() %>% 
    str_replace_all(pattern = "[:punctuation:]", replacement = " ") %>% 
    str_remove_all(pattern = "[:digit:]") %>%
    str_trim() %>% 
    str_squish()
        ) 

head(df_review)

5.1 Filter Text by Word Length

In this part you check the statistics and summary of the length of the review. Is there any clean review that only has one or two token/words?

text_length <- sapply(strsplit(df_review$review_clean, " "), length)

summary(text_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   10.00   20.12   21.00  777.00

Since most review is quite short, you can choose minimum document length to be 10 or any number you want. Make sure that your final data is at least still have 1000 rows/observations so the model has many data to learn from.

df_review <- df_review %>% 
  filter(text_length >= 10 )
glimpse(df_review)

## Rows: 1,316
## Columns: 11
## $ location     <chr> "Mountain View, CA", "", "", "Mountain View, CA", "Sea...
## $ date         <date> 2015-02-08, 2018-11-28, 2018-11-27, 2018-12-09, 2018-...
## $ status       <chr> "Current Employee ", "Current Employee ", "Current Emp...
## $ job_title    <chr> " Anonymous Employee", " Anonymous Employee", " Anonym...
## $ summary      <chr> "The best place I've worked and also the most demandin...
## $ overall      <int> 4, 3, 3, 5, 4, 4, 4, 5, 3, 3, 5, 4, 3, 3, 3, 2, 3, 3, ...
## $ review       <chr> "I live in SF so the commute can take between 1.5 hour...
## $ review_day   <int> 8, 28, 27, 9, 13, 19, 19, 19, 2, 16, 6, 19, 10, 10, 9,...
## $ review_month <ord> Feb, Nov, Nov, Des, Nov, Nov, Nov, Nov, Des, Nov, Nov,...
## $ review_year  <dbl> 2015, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...
## $ review_clean <chr> "i live in sf so the commute can take between hours to...

6 Word Tokenization

In this part, you can start to tokenize the word on each review. Before you do tokenization, first make sure you create a column document_id that contain the index or identifier for each document/review. You can start doing tokenization. Use unnest_tokens() to do tokenization where each row will contain a single token/word.

df_token <- df_review %>% 
  mutate(document_id = rownames(.)) %>% 
  unnest_tokens(output = "word",
                input = review_clean
                ) %>% 
  select(document_id, date, review_month, word)

head(df_token,15)

6.1 Remove Stop Words

Remove stop words from your data by using filter. Since the review is in english, you can use stop_words data from tidytext package to remove the stop words.

df_token <- df_token %>% 
  filter( !(word %in% stop_words$word ))

head(df_token)

6.2 Word Stemming

After removing the stop word, you also need to do word stemming where you will transform all words into its basic form, such as from "walking" to "walk". You can use wordStem() function to do stemming using Porter's algorithm or you can use hunspell stemming method using the method inside the course material.

df_token <- df_token %>% 
  mutate(word = sapply(word, stem_hunspell))

head(df_token, 50)

Save stemmed df_token

write.csv(df_token, "stemmed_df_token.csv")

7 Text Visualization

Before making a topic model, it would be good to first check the top 50 words from the tokenized data. First, you need to count how many word inside the data using count() and save it as token_count object. Don't forget to use arrange() and desc() to sort the data since we only want the top 50 words based on the frequency.

token_count <- df_token %>% 
  count(word, name = "value") %>% 
  arrange(desc(value)) 

head(token_count,20)

You can create a word cloud to gain insight what is the top word that can be found easily in all reviews? Describe your finding.

token_count %>% 
  head(50) %>% 
  ggplot(aes(label = word, 
             size = value, # higher value column will have bigger size
             color = value) # higher value column will have darker color
         ) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 15) +
  scale_color_gradient(low = "lightpink", high = "firebrick4") +
  theme_void()

8 Topic Modeling

You can continue by removing rare words, word that appear in less than n document. For this session we will decided to remove words that appear in less than 5 documents. We also need to remove word that appear in more than 80% of the document. To do that, you need to count how many word on each document using count with document_id and word column, and followed by count again with word column to get the number of words in each document. Save the object as frequent_token.

# Count number of word in each document
number_document <- nrow(df_review)
frequent_token <- df_token %>% 
  count(document_id, word) %>% 
  count(word, name = "appearance") %>% 
  arrange(desc(appearance))

head(frequent_token)

# Remove word that appear in 80% of all reviews
top_word <- frequent_token %>% 
  filter(appearance >= (0.8 * number_document)) %>% 
  pull(word)
# Remove word that appear in less than 5 reviews
rare_word <- frequent_token %>% 
  filter(appearance <= 5) %>% 
  pull(word)
# Create custom stop word based on the frequent and rare words
custom_stop_word <- c(top_word, rare_word)

head(custom_stop_word)

##      access    acquired   advantage    disagree   ambitious    approach 
##    "access"   "acquire" "advantage"     "agree" "ambitious"  "approach"

Finally, you can filter frequent and rare token from your data using filter() and then create a sparse document-term matrix that will be used as the input for topic modeling process.

df_token <- df_token %>% 
  filter( !(word %in%  custom_stop_word ) )

head(df_token)

topic_dtm <- df_token %>% 
  count(document_id, word) %>% 
  cast_sparse(row = document_id,
              column = word,
              value = n)

topic_dtm[ 1:10, 1:9 ]

## 10 x 9 sparse Matrix of class "dgCMatrix"
##      ability affect balance boss busy career comfortable commute company
## 1          1      1       2    2    1      1           1       2       1
## 10         .      .       .    .    .      .           .       .       .
## 100        .      .       .    .    .      .           .       .       .
## 1000       .      .       .    .    .      .           .       .       .
## 1001       .      .       .    .    .      3           .       .       .
## 1002       .      .       .    .    .      .           .       .       .
## 1003       .      .       .    .    .      1           .       .       .
## 1004       .      .       .    .    .      .           .       .       .
## 1005       .      .       .    .    .      .           .       .       1
## 1006       .      .       .    .    .      .           .       .       .

8.1 Fitting Topic Model

Since review can be very broad in term of topics, we will create an LDA model with 20 topics. The sampling iterations for the model is 1000 iterations. Save/assign the model as lda_review.

lda_review <- FitLdaModel(dtm = topic_dtm, 
                        k = 20, # Number of Topics
                        iterations = 1000, # sampling iterations
                        )# Create lda model

menyimpan lda_review

lda_review %>% 
 saveRDS("lda_review.Rds")

names(lda_review)

## [1] "phi"       "theta"     "gamma"     "data"      "alpha"     "beta"     
## [7] "coherence"

8.2 Topic Exploration and Interpretation

Finally, you can start exploring the topic by visualizing the top n words from each topic using word cloud. You need to analyze some topics that pick your interest and describe what the topic is about and what can we do as an HR to handle the topics.

# Topic Visualization
lda_review$theta %>% 
  head()

##              t_1         t_2         t_3         t_4         t_5         t_6
## 1    0.001010101 0.162626263 0.001010101 0.001010101 0.021212121 0.001010101
## 10   0.004761905 0.004761905 0.004761905 0.004761905 0.004761905 0.004761905
## 100  0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1000 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1001 0.037931034 0.003448276 0.003448276 0.037931034 0.003448276 0.003448276
## 1002 0.140000000 0.006666667 0.006666667 0.006666667 0.006666667 0.006666667
##              t_7         t_8         t_9        t_10        t_11        t_12
## 1    0.001010101 0.001010101 0.132323232 0.031313131 0.001010101 0.001010101
## 10   0.052380952 0.004761905 0.528571429 0.004761905 0.004761905 0.052380952
## 100  0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1000 0.016666667 0.016666667 0.683333333 0.016666667 0.016666667 0.016666667
## 1001 0.003448276 0.003448276 0.003448276 0.003448276 0.141379310 0.486206897
## 1002 0.006666667 0.006666667 0.006666667 0.006666667 0.006666667 0.006666667
##             t_13        t_14        t_15        t_16        t_17        t_18
## 1    0.061616162 0.001010101 0.021212121 0.203030303 0.021212121 0.001010101
## 10   0.004761905 0.004761905 0.004761905 0.004761905 0.004761905 0.052380952
## 100  0.183333333 0.183333333 0.350000000 0.016666667 0.016666667 0.016666667
## 1000 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1001 0.003448276 0.003448276 0.003448276 0.003448276 0.003448276 0.244827586
## 1002 0.006666667 0.006666667 0.006666667 0.273333333 0.140000000 0.340000000
##             t_19        t_20
## 1    0.132323232 0.203030303
## 10   0.242857143 0.004761905
## 100  0.016666667 0.016666667
## 1000 0.016666667 0.016666667
## 1001 0.003448276 0.003448276
## 1002 0.006666667 0.006666667

lda_review$phi %>% 
  as.data.frame()

rowSums(lda_review$phi)

##  t_1  t_2  t_3  t_4  t_5  t_6  t_7  t_8  t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## t_17 t_18 t_19 t_20 
##    1    1    1    1

9 Get top 10 words

GetTopTerms(lda_review$phi, 10) %>% 
  as.data.frame()

#topic_cloud(lda_review, n = 15)
#ggsave("work_review.png")

melihat 10 review di topik 1

df_review_baru <- df_review %>% 
  mutate(document_id = rownames(.))

9.1 First Topic

the topic have terms Performance, System, process, role, target, base, cycle. They may tell about the poorly implemented Performance review system.

lda_review %>% 
  get_top_news(topic = 1, # cari review yang paling mewakili topic 1
               data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
  distinct() %>%
  head(10) %>% # ambil 10 review pertama
  select(date, summary, review) # hanya ambil judul dan headline berita

9.2 Second Topic

The review contains terms Team, Leadership, lack, join, people culture. it may tell about the lack of leadership on the Teams.

lda_review %>% 
  get_top_news(topic = 2, # cari review yang paling mewakili topic 2
               data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
  distinct() %>%
  head(10) %>% # ambil 10 review pertama
  select(date, summary, review) # hanya ambil judul dan headline berita

9.3 Third Topic

it contains Terms : Impact, culture, difficult, startup, feel, huge. The review topic may be related to feeling difficult to work at such huge company.

lda_review %>% 
  get_top_news(topic = 3, # cari review yang paling mewakili topic 3
               data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
  distinct() %>%
  head(10) %>% # ambil 10 review pertama
  select(date, summary, review) # hanya ambil judul dan headline berita

9.4 Fourth Topic

The fourth topic have the terms Commute, hard, View, hour, effort. The topic is likely about How it is hard to commute for work from San Francisco to Mountain View.

lda_review %>% 
  get_top_news(topic = 4, # cari review yang paling mewakili topic 3
               data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
  distinct() %>%
  head(10) %>% # ambil 10 review pertama
  select(date, summary, review) # hanya ambil judul dan headline berita

10 Topic Proportion Over Time

the range of date when each review is submitted.

range(df_review_baru$date)

## [1] "2015-01-02" "2018-12-10"

The first review start on 2 January 2015 and the latest review is on 10 december 2018. Check the proportion of each topic across the weeks by grouping the data into weekly interval.

rev_doc_topic <- lda_review %>% 
  get_top_news(topic = 1, data = df_review_baru)

topic_agg <- rev_doc_topic %>% 
  pivot_longer(paste0("t_", 1:8), names_to = "topic", values_to = "theta") %>%
  select(date, summary,  topic, theta) %>% 
  mutate(topic = str_replace_all(topic, "t_", "Topic "),
         time = floor_date(date, unit = "month") 
         ) %>%  
   group_by(time, topic) %>% 
   summarise(theta = mean(theta)) 

head(topic_agg, 10)

use line chart to draw the pattern for each topic.

topic_agg %>% 
  ggplot(aes(time, theta, fill = topic, color = topic)) +
   geom_line() +
   geom_point(show.legend = F) +
   theme_minimal() +
   theme(legend.position = "top", panel.grid.minor.y = element_blank()) +
   scale_x_date(date_breaks = "years", 
                labels = date_format("%b\n%y")) +
   scale_y_continuous() +
  facet_wrap(~topic) +
   labs(x = NULL, y = expression(theta), color = NULL, 
        title = "Topic Proportions Over Time on monthly Interval")

From the chart, we find that some topic dominated reviews in a single month.In March 2017, the topic domination is about the poorly implemented Performance review system. Topic on lack of leadership on the Teams Dominated in August 2018.

Topic Modeling

Aris Warsito

11/23/2020