Text mining is a machine learning technique that processes unstructured text data to identify meaningful patterns and new insights.
Topic Modeling uses text mining for discovering topics or themes within a collection of documents (text data).
It can be applied in many fields, such as:
Topic modeling has been widely used and developed for English, but it can also be applied to other languages like Bahasa Indonesia. However, it is important to note that the availability and quality of resources for topic modeling in different languages can vary.
One of the challenges for performing topic modeling on Bahasa Indonesia data is the limited availability of resources that are specifically for this language. Some examples of these resources are:
These resources are important to accurately interpret and extract meaningful topics from text data. The lack of resources may cause inaccuracy and less impactful insights that can be drawn from the analysis.
Performing more text mining and topic modeling on Indonesian data can help improving the resources availability, such as by adding to the list of stopwords and slang words for the data preprocessing that can be reused in other Indonesian text mining projects.
Topic Modeling can be done on app reviews to identify the key themes that often come up on customers’ feedback.
The result can be used by the app owner to see what they need to improve or maintain, as well as by customers (or app users) to see what other customers’ thoughts on a specific topic related to the app.
In this project, we will use the Livin by Mandiri app reviews data from Kaggle and use it for performing topic modeling as well as improve the list of Indonesian stop words and slang.
Livin by Mandiri is a digital financial service platform developed by Bank Mandiri, available for Android and iOS devices. Users can use the app to make payments, transfer money, and manage their finances on their mobile devices.
This dataset in Kaggle was initially collected by scraping reviews on Google Play Store.
Similar text data processing like implemented in this project can also be used to do topic modeling on:
library(dplyr)
library(tm)
library(textclean)
library(stringr)
library(katadasaR)
library(tokenizers)
library(stopwords)
mandiri_raw <- read.csv("data_input/mandiri_reviews.csv") %>% select(review)
Based on the code output below, we have around 155k data entries.
glimpse(mandiri_raw)
#> Rows: 155,192
#> Columns: 1
#> $ review <chr> "Udah di coba, keren dan responsive, dengan tampilan yang makin…
From the data below, we can see that the text data contains:
head(mandiri_raw$review)
#> [1] "Udah di coba, keren dan responsive, dengan tampilan yang makin segar pastinya!"
#> [2] "Excellent"
#> [3] "Keren. Cakep benar semakin canggih. Terdepan terpercaya tumbuh bersama anda."
#> [4] "mantap"
#> [5] "Mantap"
#> [6] "mantap jiwa dan raga... ayo kita livinkan indonesia"
Looking at some more data entries, there are also text data with:
mandiri_raw$review[21:30]
#> [1] "Mantap👍👍👍👍👍"
#> [2] "Mantap jiewaa mandiri pakai Livin"
#> [3] "Mantul mantap betuuuuuullll......"
#> [4] "Mantap, top top🙏💪"
#> [5] "Transaksi lebih aman, mudah & praktis dgn Livin' by Mandiri."
#> [6] "Luar biasa"
#> [7] "Kereeeennnn"
#> [8] "Alhamdulillah, New Livin' By Mandiri selalu dihati"
#> [9] "Semakin lengkap aplikasinya dan semakin berinovasi..sehingga memudahkan nasabah..terimakasih bank Mandiri 👍👍👍"
#> [10] "Bagus..livin semakin canggih dan tampilannya semakin segar"
anyNA(mandiri_raw)
#> [1] FALSE
We have no NA values.
We preprocess the data by:
mandiri_clean <- mandiri_raw$review %>%
replace_html %>%
replace_url
mandiri_clean[21:30]
#> [1] "Mantap👍👍👍👍👍"
#> [2] "Mantap jiewaa mandiri pakai Livin"
#> [3] "Mantul mantap betuuuuuullll......"
#> [4] "Mantap, top top🙏💪"
#> [5] "Transaksi lebih aman, mudah & praktis dgn Livin' by Mandiri."
#> [6] "Luar biasa"
#> [7] "Kereeeennnn"
#> [8] "Alhamdulillah, New Livin' By Mandiri selalu dihati"
#> [9] "Semakin lengkap aplikasinya dan semakin berinovasi..sehingga memudahkan nasabah..terimakasih bank Mandiri 👍👍👍"
#> [10] "Bagus..livin semakin canggih dan tampilannya semakin segar"
mandiri_clean <- gsub("[[:punct:]]", " ", mandiri_clean) #replace punctuation with space
mandiri_clean[21:30]
#> [1] "Mantap "
#> [2] "Mantap jiewaa mandiri pakai Livin"
#> [3] "Mantul mantap betuuuuuullll "
#> [4] "Mantap top top "
#> [5] "Transaksi lebih aman mudah praktis dgn Livin by Mandiri "
#> [6] "Luar biasa"
#> [7] "Kereeeennnn"
#> [8] "Alhamdulillah New Livin By Mandiri selalu dihati"
#> [9] "Semakin lengkap aplikasinya dan semakin berinovasi sehingga memudahkan nasabah terimakasih bank Mandiri "
#> [10] "Bagus livin semakin canggih dan tampilannya semakin segar"
The strip function can help with:
mandiri_clean <- strip(mandiri_clean, apostrophe.remove = TRUE)
mandiri_clean[21:30]
#> [1] "mantap"
#> [2] "mantap jiewaa mandiri pakai livin"
#> [3] "mantul mantap betuuuuuullll"
#> [4] "mantap top top"
#> [5] "transaksi lebih aman mudah praktis dgn livin by mandiri"
#> [6] "luar biasa"
#> [7] "kereeeennnn"
#> [8] "alhamdulillah new livin by mandiri selalu dihati"
#> [9] "semakin lengkap aplikasinya dan semakin berinovasi sehingga memudahkan nasabah terimakasih bank mandiri"
#> [10] "bagus livin semakin canggih dan tampilannya semakin segar"
For example, we want to change words such as “betuuuuuullll” to “betul”.
mandiri_clean <- replace_word_elongation(mandiri_clean)
mandiri_clean[21:30]
#> [1] "mantap"
#> [2] "mantap jiewaa mandiri pakai livin"
#> [3] "mantul mantap betul"
#> [4] "mantap top top"
#> [5] "transaksi lebih aman mudah praktis dgn livin by mandiri"
#> [6] "luar biasa"
#> [7] "keren"
#> [8] "alhamdulillah new livin by mandiri selalu dihati"
#> [9] "semakin lengkap aplikasinya dan semakin berinovasi sehingga memudahkan nasabah terimakasih bank mandiri"
#> [10] "bagus livin semakin canggih dan tampilannya semakin segar"
# Number of data before removing duplicate reviews
length(mandiri_clean)
#> [1] 155192
# Number of data after removing duplicate reviews
mandiri_clean <- mandiri_clean %>% as.data.frame() %>% distinct() %>% rename(review = 1)
nrow(mandiri_clean)
#> [1] 100069
mandiri_clean$review[21:30]
#> [1] "mantap top top"
#> [2] "transaksi lebih aman mudah praktis dgn livin by mandiri"
#> [3] "luar biasa"
#> [4] "alhamdulillah new livin by mandiri selalu dihati"
#> [5] "semakin lengkap aplikasinya dan semakin berinovasi sehingga memudahkan nasabah terimakasih bank mandiri"
#> [6] "bagus livin semakin canggih dan tampilannya semakin segar"
#> [7] "tampilannya lebih keren dan banyak pilihan juga selamat ulang tahun buat bank mandiri yang ke semoga selalu memberikan pelayanan yang terbaik"
#> [8] "makin keren mudah a lebih mudah dgn yg baru"
#> [9] "tampilan lebih fresh dan login lebih cepat nice mandiri"
#> [10] "suka sama fitur yg terbarunya simple dan mudah digunakan"
As the process may take some time, we will save the result as a csv file.
write.csv(mandiri_clean, file = "mandiri_clean.csv", row.names = FALSE)
# Load data from the csv file
mandiri_clean <- read.csv("mandiri_clean.csv")
mandiri_clean$review %>% head()
#> [1] "udah di coba keren dan responsive dengan tampilan yang makin segar pastinya"
#> [2] "excellent"
#> [3] "keren cakep benar semakin canggih terdepan terpercaya tumbuh bersama anda"
#> [4] "mantap"
#> [5] "mantap jiwa dan raga ayo kita livinkan indonesia"
#> [6] "mandiri emang terbaik"
First, we will get the list of Indonesian slang words we want to replace. We also save it into a csv file so we can reuse it for another time, or for other projects.
## Get the list of slang words to replace
# Import Indonesian lexicon
spell.lex <- read.csv("data_input/colloquial-indonesian-lexicon.csv")
# Filter out slang that can be dealt with replace_word_elongation
elong_lex <-
spell.lex %>%
filter(category1=="elongasi"|category2=="elongasi"| category3=="elongasi") %>%
select(slang, formal) %>%
filter(replace_word_elongation(slang)==formal) %>% distinct()
spell.lex_clean <-
spell.lex %>%
select(slang, formal) %>%
distinct() %>%
anti_join(elong_lex, by = c("slang", "formal"))
# Save as a csv file
write.csv(spell.lex_clean, file = "indonesian_slang.csv", row.names = FALSE)
Now we replace the slang words in our dataset, and save the cleaned data in another csv file as the processing time is quite long.
# Load the indonesian slang list
spell.lex_clean <- read.csv("indonesian_slang.csv")
# Create a lookup table (key=slang, value=formal)
lookup_table <- setNames(spell.lex_clean$formal, paste0("\\b", spell.lex_clean$slang, "\\b"))
# Create a function to replace slang
replace_slang <- function(text) {
str_replace_all(text, lookup_table)
}
# Replace slang in the entire column
mandiri_clean_slang <- replace_slang(mandiri_clean$review)
# Save the cleaned data without slang
write.csv(mandiri_clean_slang, file = "mandiri_clean_slang.csv", row.names = FALSE)
Load the dataset with replaced slang words.
mandiri_clean_slang <- read.csv("mandiri_clean_slang.csv") %>% setNames("review")
mandiri_clean_slang$review %>% head()
#> [1] "sudah di coba keren dan responsive dengan tampilan yang makin segar pastinya"
#> [2] "excellent"
#> [3] "keren cakep benar semakin canggih terdepan terpercaya tumbuh bersama anda"
#> [4] "mantap"
#> [5] "mantap jiwa dan raga ayo kita livinkan indonesia"
#> [6] "mandiri memang terbaik"
In this process, we want to replace words with their root form. For example, the root form of the English words “slept” or “sleeping” is “sleep”, and the root form of the Indonesian words “membaca” or “pembacaan” is “baca”.
To do stemming for Indonesian words, we use the library
katadasaR.
Once again, we save the result to a csv file to avoid the long processing time.
# Create a function to do stemming
stemming <- function(x) {
words <- tokenize_words(x)
lapply(words, katadasar) %>%
unlist() %>%
str_c(collapse = " ")
}
# Stem Indonesian words
mandiri_clean_stem <- lapply(tokenize_words(mandiri_clean_slang$review), stemming)
# Save as csv file
mandiri_clean_stem_unlist <- mandiri_clean_stem %>% unlist()
write.csv(mandiri_clean_stem_unlist, file = "mandiri_clean_stem_unlist.csv", row.names = FALSE)
Load the csv file for the stemmed dataset.
# Load the csv file
mandiri_clean_stem_unlist <- read.csv("mandiri_clean_stem_unlist.csv")
mandiri_clean_stem_unlist$x %>% head()
#> [1] "sudah di coba keren dan responsive dengan tampil yang makin segar pasti"
#> [2] "excellent"
#> [3] "keren cakep benar makin canggih depan percaya tumbuh sama anda"
#> [4] "mantap"
#> [5] "mantap jiwa dan raga ayo kita livinkan indonesia"
#> [6] "mandiri memang baik"
In our reviews, there are some English words included with the Indonesian reviews. Therefore, we will remove both the Indonesian and English stopwords.
We will also remove additional stopwords such as “mandiri”, “livin”, and “aplikasi” which refers to the app itself.
# Indonesian stopwords
idstopwords <- stopwords("id", source = "stopwords-iso")
idstopwords2 <- readLines("data_input/stopword_list_id_2.txt")
idstopwords_all <- c(idstopwords, idstopwords2) %>% unique %>% sort()
idstopwords_all %>% head
# English stopwords
enstopwords <-
c(stopwords("en", source = "snowball"),
stopwords("en", source = "marimo"),
stopwords("en", source = "nltk"),
stopwords("en", source = "stopwords-iso"),
stopwords("en", source = "smart")) %>% unique %>% sort()
# Additional stopwords
addstopwords <- c("mandiri", "livin", "aplikasi")
# List of all stopwords combined
all_stopwords <- c(idstopwords_all, enstopwords, addstopwords) %>% unique()
### Remove stopwords from dataset ###
mandiri_clean_stopwords <- mandiri_clean_stem_unlist$x %>% tokenize_words(stopwords = all_stopwords)
# Convert to dataframe
mandiri_clean_all <- data.frame(x = unlist(lapply(mandiri_clean_stopwords,
paste,
collapse = " ")),
stringsAsFactors = FALSE) %>%
mutate(row_number = row_number()) %>%
rename(review=x) %>%
filter(review != "NA")
write.csv(mandiri_clean_all, "mandiri_clean_all.csv", row.names = F)
Now we have removed the listed stopwords. We saved it into a csv file to avoid long processing time.
Load the final cleaned dataset.
# Load csv file
mandiri_clean_all <- read.csv("mandiri_clean_all.csv")
mandiri_clean_all %>% head
#> review row_number
#> 1 coba keren responsive tampil segar 1
#> 2 excellent 2
#> 3 keren cakep canggih percaya tumbuh 3
#> 4 mantap 4
#> 5 mantap jiwa raga ayo livinkan indonesia 5
#> 6 super app 7
We can see the text cleaning result in both table and wordcloud form. This section has also been used for identifying and manually adding the list of Indonesian stopwords and slang words.
Here we filter to see the words with frequency more than 10. Later on when building the machine learning model, we can further adjust the minimum word frequency that we want to set.
# Create word frequency table
word_freq_tbl <- mandiri_clean_all$review %>%
tokenize_words() %>%
unlist %>%
table %>%
as.data.frame() %>%
rename(Word = 1) %>%
arrange(-Freq) %>%
filter(Freq>10) #Frequency can later be adjusted to tune ML model
# View top 10 most frequent words
head(word_freq_tbl, 10)
#> Word Freq
#> 1 mudah 13236
#> 2 transaksi 12416
#> 3 update 11801
#> 4 bagus 10011
#> 5 buka 8423
#> 6 biru 8352
#> 7 bantu 8093
#> 8 pakai 7626
#> 9 masuk 7514
#> 10 susah 7170
# View top 10 least frequent words
tail(word_freq_tbl, 10)
#> Word Freq
#> 2207 transaction 11
#> 2208 tuhan 11
#> 2209 tumben 11
#> 2210 ufdate 11
#> 2211 ultah 11
#> 2212 upgread 11
#> 2213 viturnya 11
#> 2214 weh 11
#> 2215 wiraswasta 11
#> 2216 wita 11
As the data is large, we will see the wordcloud only for a small subset of data.
library(wordcloud)
wordcloud(words =
mandiri_clean_all$review %>%
head(5000) %>%
tokenize_words %>%
unlist,
max.words = 1000)
We can convert the cleaned data to Document Term Matrix form, which will be the input for our model. We can also further tune our model by filtering out our cleaned data to remove words with small frequencies.
mandiri_dtm <- DocumentTermMatrix(mandiri_clean_all)
mandiri_dtm_matrix <- as.matrix(mandiri_dtm)