You are part of the data initiative team from a reputable start up company. The company is being ambitious to implement data-driven decision making in all division, including the Human Resource (HR) Department. The department often collecting feedback and review via group discussion that is being held quarterly. However, some employee may don't want to share all information directly to HR department for several reason, such as being afraid of jduged by other or any personal reasons. To overcome this problem, the HR also want to collect feedback and review from external resource. The HR decided to collect and analyze review from an online job seeker website where former and current employee can give review and rating to the company. To transform the data into informative insight, the HR want our help to analyze is there any topic or context that is being written on the website.
The following chunk is the library that you can use to do text analysis and topic modeling.
# Data Wrangling
library(dplyr)
library(lubridate)
library(stringr)
library(tidyr)
# Text Analysis
library(tidytext)
library(textclean)
library(SnowballC)
library(hunspell)
# Topic Modeling
library(textmineR)
# Data Visualization
library(ggplot2)
library(ggwordcloud)
library(scales)
# Extra Function
source("extra_function.R")
options(scipen = 999)the data imported by reading the workplace_review.csv and save them as df_review.
# Import data
df_review <- read.csv("data/workplace_review2.csv") %>%
glimpse()## Rows: 3,468
## Columns: 7
## $ location <chr> "", "Mountain View, CA", "New York, NY", "Mountain View, ...
## $ date <chr> " Dec 11, 2018", " Jun 21, 2013", " May 10, 2014", " Feb ...
## $ status <chr> "Current Employee ", "Former Employee ", "Current Employe...
## $ job_title <chr> " Anonymous Employee", " Program Manager", " Software Eng...
## $ summary <chr> "Best Company to work for", "Moving at the speed of light...
## $ overall <int> 5, 5, 5, 4, 4, 5, 3, 5, 5, 5, 4, 3, 3, 5, 4, 2, 3, 4, 4, ...
## $ review <chr> "Bureaucracy is slowing things down", "1) Work/life balan...
Data Description:
transform the data type of the date column. And create a new column named review_day, review_month, and review_year that respectively contain information about day of the week, month , and year of the review.
#Convert date character class into date type
df_review <- df_review %>%
mutate(date = mdy(date)) %>%
# Create a new column named "review_day", "review_month", "review_year
mutate(review_day = day(date),
review_month = month(date, label = TRUE),
review_year = year(date)
)
# Create a new column named "review_day", "review_month", "review_year
glimpse(df_review)## Rows: 3,468
## Columns: 10
## $ location <chr> "", "Mountain View, CA", "New York, NY", "Mountain Vie...
## $ date <date> 2018-12-11, 2013-06-21, 2014-05-10, 2015-02-08, 2018-...
## $ status <chr> "Current Employee ", "Former Employee ", "Current Empl...
## $ job_title <chr> " Anonymous Employee", " Program Manager", " Software ...
## $ summary <chr> "Best Company to work for", "Moving at the speed of li...
## $ overall <int> 5, 5, 5, 4, 4, 5, 3, 5, 5, 5, 4, 3, 3, 5, 4, 2, 3, 4, ...
## $ review <chr> "Bureaucracy is slowing things down", "1) Work/life ba...
## $ review_day <int> 11, 21, 10, 8, 9, 6, 29, 2, 1, 30, 29, 28, 27, 9, 20, ...
## $ review_month <ord> Des, Jun, Mei, Feb, Des, Des, Nov, Des, Des, Nov, Nov,...
## $ review_year <dbl> 2018, 2013, 2014, 2015, 2018, 2018, 2018, 2018, 2018, ...
Check the earliest and the latest date using the range() function to the date column.
range(df_review$date)## [1] "2008-02-14" "2018-12-11"
Some review are too old and is not relevant anymore to the current condition. We will only collect review from 2015 and later.
df_review <- df_review %>%
filter(review_year >= 2015) %>%
glimpse()## Rows: 2,488
## Columns: 10
## $ location <chr> "", "Mountain View, CA", "", "", "", "", "", "", "Los ...
## $ date <date> 2018-12-11, 2015-02-08, 2018-12-09, 2018-12-06, 2018-...
## $ status <chr> "Current Employee ", "Current Employee ", "Current Emp...
## $ job_title <chr> " Anonymous Employee", " Anonymous Employee", " Anonym...
## $ summary <chr> "Best Company to work for", "The best place I've worke...
## $ overall <int> 5, 4, 4, 5, 3, 5, 5, 5, 4, 3, 3, 5, 4, 2, 3, 4, 4, 4, ...
## $ review <chr> "Bureaucracy is slowing things down", "I live in SF so...
## $ review_day <int> 11, 8, 9, 6, 29, 2, 1, 30, 29, 28, 27, 9, 20, 6, 21, 4...
## $ review_month <ord> Des, Feb, Des, Des, Nov, Des, Des, Nov, Nov, Nov, Nov,...
## $ review_year <dbl> 2018, 2015, 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...
You may want to check how many of the review come from current employee and how many come from former employee. Use count() to count the number of each employee status in the data. Describe your finding.
reviews_count <- df_review %>%
count(status, name = "Total Review")
reviews_countFirst you may want to check the top 10 job title that give the most review on the website. This may give us insight at which employee that complain the most. Use count() to count the number of each job title in the data and then sort the data by usign arrange() and desc() to get descending sort (from small to big number). Describe your finding.
reviews_by_jobtitle <- df_review %>%
count(job_title, name = "Jumlah_review") %>%
arrange(desc(Jumlah_review))
head(reviews_by_jobtitle,10)In this part you can start doing text cleansing from the review column. The method and order of the cleansing process is up to you. General text cleansing process include:
df_review <- df_review %>%
mutate(
review_clean = review %>%
tolower() %>% #Make all character into lowercase
str_replace_all(pattern = "@.*? |@.*?[:punct:]", replacement = " ") %>%
str_remove_all(pattern = "\\*") %>%
str_remove_all(pattern = "-") %>%
replace_hash() %>% # remove hashtag
replace_url() %>%
replace_contraction() %>%
str_replace_all(pattern = "[:punctuation:]", replacement = " ") %>%
str_remove_all(pattern = "[:digit:]") %>%
str_trim() %>%
str_squish()
)
head(df_review) In this part you check the statistics and summary of the length of the review. Is there any clean review that only has one or two token/words?
text_length <- sapply(strsplit(df_review$review_clean, " "), length)
summary(text_length)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 10.00 20.12 21.00 777.00
Since most review is quite short, you can choose minimum document length to be 10 or any number you want. Make sure that your final data is at least still have 1000 rows/observations so the model has many data to learn from.
df_review <- df_review %>%
filter(text_length >= 10 )
glimpse(df_review)## Rows: 1,316
## Columns: 11
## $ location <chr> "Mountain View, CA", "", "", "Mountain View, CA", "Sea...
## $ date <date> 2015-02-08, 2018-11-28, 2018-11-27, 2018-12-09, 2018-...
## $ status <chr> "Current Employee ", "Current Employee ", "Current Emp...
## $ job_title <chr> " Anonymous Employee", " Anonymous Employee", " Anonym...
## $ summary <chr> "The best place I've worked and also the most demandin...
## $ overall <int> 4, 3, 3, 5, 4, 4, 4, 5, 3, 3, 5, 4, 3, 3, 3, 2, 3, 3, ...
## $ review <chr> "I live in SF so the commute can take between 1.5 hour...
## $ review_day <int> 8, 28, 27, 9, 13, 19, 19, 19, 2, 16, 6, 19, 10, 10, 9,...
## $ review_month <ord> Feb, Nov, Nov, Des, Nov, Nov, Nov, Nov, Des, Nov, Nov,...
## $ review_year <dbl> 2015, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...
## $ review_clean <chr> "i live in sf so the commute can take between hours to...
In this part, you can start to tokenize the word on each review. Before you do tokenization, first make sure you create a column document_id that contain the index or identifier for each document/review. You can start doing tokenization. Use unnest_tokens() to do tokenization where each row will contain a single token/word.
df_token <- df_review %>%
mutate(document_id = rownames(.)) %>%
unnest_tokens(output = "word",
input = review_clean
) %>%
select(document_id, date, review_month, word)
head(df_token,15)Remove stop words from your data by using filter. Since the review is in english, you can use stop_words data from tidytext package to remove the stop words.
df_token <- df_token %>%
filter( !(word %in% stop_words$word ))
head(df_token)After removing the stop word, you also need to do word stemming where you will transform all words into its basic form, such as from "walking" to "walk". You can use wordStem() function to do stemming using Porter's algorithm or you can use hunspell stemming method using the method inside the course material.
df_token <- df_token %>%
mutate(word = sapply(word, stem_hunspell))
head(df_token, 50)Save stemmed df_token
write.csv(df_token, "stemmed_df_token.csv")Before making a topic model, it would be good to first check the top 50 words from the tokenized data. First, you need to count how many word inside the data using count() and save it as token_count object. Don't forget to use arrange() and desc() to sort the data since we only want the top 50 words based on the frequency.
token_count <- df_token %>%
count(word, name = "value") %>%
arrange(desc(value))
head(token_count,20)You can create a word cloud to gain insight what is the top word that can be found easily in all reviews? Describe your finding.
token_count %>%
head(50) %>%
ggplot(aes(label = word,
size = value, # higher value column will have bigger size
color = value) # higher value column will have darker color
) +
geom_text_wordcloud() +
scale_size_area(max_size = 15) +
scale_color_gradient(low = "lightpink", high = "firebrick4") +
theme_void()You can continue by removing rare words, word that appear in less than n document. For this session we will decided to remove words that appear in less than 5 documents. We also need to remove word that appear in more than 80% of the document. To do that, you need to count how many word on each document using count with document_id and word column, and followed by count again with word column to get the number of words in each document. Save the object as frequent_token.
# Count number of word in each document
number_document <- nrow(df_review)
frequent_token <- df_token %>%
count(document_id, word) %>%
count(word, name = "appearance") %>%
arrange(desc(appearance))
head(frequent_token)# Remove word that appear in 80% of all reviews
top_word <- frequent_token %>%
filter(appearance >= (0.8 * number_document)) %>%
pull(word)
# Remove word that appear in less than 5 reviews
rare_word <- frequent_token %>%
filter(appearance <= 5) %>%
pull(word)
# Create custom stop word based on the frequent and rare words
custom_stop_word <- c(top_word, rare_word)
head(custom_stop_word)## access acquired advantage disagree ambitious approach
## "access" "acquire" "advantage" "agree" "ambitious" "approach"
Finally, you can filter frequent and rare token from your data using filter() and then create a sparse document-term matrix that will be used as the input for topic modeling process.
df_token <- df_token %>%
filter( !(word %in% custom_stop_word ) )
head(df_token)topic_dtm <- df_token %>%
count(document_id, word) %>%
cast_sparse(row = document_id,
column = word,
value = n)
topic_dtm[ 1:10, 1:9 ]## 10 x 9 sparse Matrix of class "dgCMatrix"
## ability affect balance boss busy career comfortable commute company
## 1 1 1 2 2 1 1 1 2 1
## 10 . . . . . . . . .
## 100 . . . . . . . . .
## 1000 . . . . . . . . .
## 1001 . . . . . 3 . . .
## 1002 . . . . . . . . .
## 1003 . . . . . 1 . . .
## 1004 . . . . . . . . .
## 1005 . . . . . . . . 1
## 1006 . . . . . . . . .
Since review can be very broad in term of topics, we will create an LDA model with 20 topics. The sampling iterations for the model is 1000 iterations. Save/assign the model as lda_review.
lda_review <- FitLdaModel(dtm = topic_dtm,
k = 20, # Number of Topics
iterations = 1000, # sampling iterations
)# Create lda modelmenyimpan lda_review
lda_review %>%
saveRDS("lda_review.Rds")names(lda_review)## [1] "phi" "theta" "gamma" "data" "alpha" "beta"
## [7] "coherence"
Finally, you can start exploring the topic by visualizing the top n words from each topic using word cloud. You need to analyze some topics that pick your interest and describe what the topic is about and what can we do as an HR to handle the topics.
# Topic Visualization
lda_review$theta %>%
head()## t_1 t_2 t_3 t_4 t_5 t_6
## 1 0.001010101 0.162626263 0.001010101 0.001010101 0.021212121 0.001010101
## 10 0.004761905 0.004761905 0.004761905 0.004761905 0.004761905 0.004761905
## 100 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1000 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1001 0.037931034 0.003448276 0.003448276 0.037931034 0.003448276 0.003448276
## 1002 0.140000000 0.006666667 0.006666667 0.006666667 0.006666667 0.006666667
## t_7 t_8 t_9 t_10 t_11 t_12
## 1 0.001010101 0.001010101 0.132323232 0.031313131 0.001010101 0.001010101
## 10 0.052380952 0.004761905 0.528571429 0.004761905 0.004761905 0.052380952
## 100 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1000 0.016666667 0.016666667 0.683333333 0.016666667 0.016666667 0.016666667
## 1001 0.003448276 0.003448276 0.003448276 0.003448276 0.141379310 0.486206897
## 1002 0.006666667 0.006666667 0.006666667 0.006666667 0.006666667 0.006666667
## t_13 t_14 t_15 t_16 t_17 t_18
## 1 0.061616162 0.001010101 0.021212121 0.203030303 0.021212121 0.001010101
## 10 0.004761905 0.004761905 0.004761905 0.004761905 0.004761905 0.052380952
## 100 0.183333333 0.183333333 0.350000000 0.016666667 0.016666667 0.016666667
## 1000 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667 0.016666667
## 1001 0.003448276 0.003448276 0.003448276 0.003448276 0.003448276 0.244827586
## 1002 0.006666667 0.006666667 0.006666667 0.273333333 0.140000000 0.340000000
## t_19 t_20
## 1 0.132323232 0.203030303
## 10 0.242857143 0.004761905
## 100 0.016666667 0.016666667
## 1000 0.016666667 0.016666667
## 1001 0.003448276 0.003448276
## 1002 0.006666667 0.006666667
lda_review$phi %>%
as.data.frame()rowSums(lda_review$phi)## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8 t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## t_17 t_18 t_19 t_20
## 1 1 1 1
GetTopTerms(lda_review$phi, 10) %>%
as.data.frame()#topic_cloud(lda_review, n = 15)
#ggsave("work_review.png")melihat 10 review di topik 1
df_review_baru <- df_review %>%
mutate(document_id = rownames(.))the topic have terms Performance, System, process, role, target, base, cycle. They may tell about the poorly implemented Performance review system.
lda_review %>%
get_top_news(topic = 1, # cari review yang paling mewakili topic 1
data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
distinct() %>%
head(10) %>% # ambil 10 review pertama
select(date, summary, review) # hanya ambil judul dan headline beritaThe review contains terms Team, Leadership, lack, join, people culture. it may tell about the lack of leadership on the Teams.
lda_review %>%
get_top_news(topic = 2, # cari review yang paling mewakili topic 2
data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
distinct() %>%
head(10) %>% # ambil 10 review pertama
select(date, summary, review) # hanya ambil judul dan headline beritait contains Terms : Impact, culture, difficult, startup, feel, huge. The review topic may be related to feeling difficult to work at such huge company.
lda_review %>%
get_top_news(topic = 3, # cari review yang paling mewakili topic 3
data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
distinct() %>%
head(10) %>% # ambil 10 review pertama
select(date, summary, review) # hanya ambil judul dan headline beritaThe fourth topic have the terms Commute, hard, View, hour, effort. The topic is likely about How it is hard to commute for work from San Francisco to Mountain View.
lda_review %>%
get_top_news(topic = 4, # cari review yang paling mewakili topic 3
data = df_review_baru) %>% # data yang sudah di-cleansing sebelum menjadi token
distinct() %>%
head(10) %>% # ambil 10 review pertama
select(date, summary, review) # hanya ambil judul dan headline beritathe range of date when each review is submitted.
range(df_review_baru$date)## [1] "2015-01-02" "2018-12-10"
The first review start on 2 January 2015 and the latest review is on 10 december 2018. Check the proportion of each topic across the weeks by grouping the data into weekly interval.
rev_doc_topic <- lda_review %>%
get_top_news(topic = 1, data = df_review_baru)
topic_agg <- rev_doc_topic %>%
pivot_longer(paste0("t_", 1:8), names_to = "topic", values_to = "theta") %>%
select(date, summary, topic, theta) %>%
mutate(topic = str_replace_all(topic, "t_", "Topic "),
time = floor_date(date, unit = "month")
) %>%
group_by(time, topic) %>%
summarise(theta = mean(theta))
head(topic_agg, 10)use line chart to draw the pattern for each topic.
topic_agg %>%
ggplot(aes(time, theta, fill = topic, color = topic)) +
geom_line() +
geom_point(show.legend = F) +
theme_minimal() +
theme(legend.position = "top", panel.grid.minor.y = element_blank()) +
scale_x_date(date_breaks = "years",
labels = date_format("%b\n%y")) +
scale_y_continuous() +
facet_wrap(~topic) +
labs(x = NULL, y = expression(theta), color = NULL,
title = "Topic Proportions Over Time on monthly Interval") From the chart, we find that some topic dominated reviews in a single month.In March 2017, the topic domination is about the poorly implemented Performance review system. Topic on lack of leadership on the Teams Dominated in August 2018.