Project Background

Topic Modeling for Indonesian data

Text Mining

Text mining is a machine learning technique that processes unstructured text data to identify meaningful patterns and new insights.

Topic Modeling

Topic Modeling uses text mining for discovering topics or themes within a collection of documents (text data).

It can be applied in many fields, such as:

  • social media (find trends)
  • market research (analyze customer feedbacks)
  • customer service (evaluate chatbot services)
  • topic tracking (find main topic from text media)
  • and many more.

Topic modeling has been widely used and developed for English, but it can also be applied to other languages like Bahasa Indonesia. However, it is important to note that the availability and quality of resources for topic modeling in different languages can vary.

Topic Modeling Indonesian data

One of the challenges for performing topic modeling on Bahasa Indonesia data is the limited availability of resources that are specifically for this language. Some examples of these resources are:

  • Pre-trained language models
  • Annotated data
  • Preprocessing resources
    • For stopword removal
    • For stemming
    • For replacing slang words

These resources are important to accurately interpret and extract meaningful topics from text data. The lack of resources may cause inaccuracy and less impactful insights that can be drawn from the analysis.

Performing more text mining and topic modeling on Indonesian data can help improving the resources availability, such as by adding to the list of stopwords and slang words for the data preprocessing that can be reused in other Indonesian text mining projects.

Topic Modeling on app reviews

Topic Modeling can be done on app reviews to identify the key themes that often come up on customers’ feedback.

The result can be used by the app owner to see what they need to improve or maintain, as well as by customers (or app users) to see what other customers’ thoughts on a specific topic related to the app.

Implementation on Project

In this project, we will use the Livin by Mandiri app reviews data from Kaggle and use it for performing topic modeling as well as improve the list of Indonesian stop words and slang.

Livin by Mandiri is a digital financial service platform developed by Bank Mandiri, available for Android and iOS devices. Users can use the app to make payments, transfer money, and manage their finances on their mobile devices.

This dataset in Kaggle was initially collected by scraping reviews on Google Play Store.

Implementation to other businesses

Similar text data processing like implemented in this project can also be used to do topic modeling on:

  • reviews for similar items (other apps, books, movies/films, hotels, e-commerce/shop products, etc.), or
  • other forms of text data, such as
    • identifying topics from comments/texts on social medias (Instagram, Twitter, Facebook), or
    • identifying the main theme of books, news articles, video transcripts, scholar articles, and many more.

Code

Import Libraries

library(dplyr)
library(tm)
library(textclean)
library(stringr)
library(katadasaR)
library(tokenizers)
library(stopwords)

Input Data

mandiri_raw <- read.csv("data_input/mandiri_reviews.csv") %>% select(review)

EDA

Based on the code output below, we have around 155k data entries.

glimpse(mandiri_raw)
#> Rows: 155,192
#> Columns: 1
#> $ review <chr> "Udah di coba, keren dan responsive, dengan tampilan yang makin…

From the data below, we can see that the text data contains:

  • Uppercase and lowercase letters
  • Punctuation
  • Duplicate reviews content (such as “mantap” and “Mantap”)
  • Occasional English words
head(mandiri_raw$review)
#> [1] "Udah di coba, keren dan responsive, dengan tampilan yang makin segar pastinya!"
#> [2] "Excellent"                                                                     
#> [3] "Keren. Cakep benar semakin canggih. Terdepan terpercaya tumbuh bersama anda."  
#> [4] "mantap"                                                                        
#> [5] "Mantap"                                                                        
#> [6] "mantap jiwa dan raga... ayo kita livinkan indonesia"

Looking at some more data entries, there are also text data with:

  • Emoji
  • Word elongation (“kereeeennnn” instead of “keren”)
mandiri_raw$review[21:30]
#>  [1] "Mantap👍👍👍👍👍"                                                                                                
#>  [2] "Mantap jiewaa mandiri pakai Livin"                                                                               
#>  [3] "Mantul mantap betuuuuuullll......"                                                                               
#>  [4] "Mantap, top top🙏💪"                                                                                             
#>  [5] "Transaksi lebih aman, mudah & praktis dgn Livin' by Mandiri."                                                    
#>  [6] "Luar biasa"                                                                                                      
#>  [7] "Kereeeennnn"                                                                                                     
#>  [8] "Alhamdulillah, New Livin' By Mandiri selalu dihati"                                                              
#>  [9] "Semakin lengkap aplikasinya dan semakin berinovasi..sehingga memudahkan nasabah..terimakasih bank Mandiri 👍👍👍"
#> [10] "Bagus..livin semakin canggih dan tampilannya semakin segar"
anyNA(mandiri_raw)
#> [1] FALSE

We have no NA values.


Data Preprocessing

We preprocess the data by:

  1. Replacing HTML & URL
  2. Removing punctuation
  3. Case Folding
  4. Removing emoji & emoticon
  5. Removing numbers
  6. Removing unnecessary whitespace
  7. Removing elongated words
  8. Removing duplicate reviews
  9. Replacing slang words
  10. Stemming
  11. Removing stopwords

Replace HTML & URL

mandiri_clean <- mandiri_raw$review %>% 
  replace_html %>% 
  replace_url

mandiri_clean[21:30]
#>  [1] "Mantap👍👍👍👍👍"                                                                                                
#>  [2] "Mantap jiewaa mandiri pakai Livin"                                                                               
#>  [3] "Mantul mantap betuuuuuullll......"                                                                               
#>  [4] "Mantap, top top🙏💪"                                                                                             
#>  [5] "Transaksi lebih aman, mudah & praktis dgn Livin' by Mandiri."                                                    
#>  [6] "Luar biasa"                                                                                                      
#>  [7] "Kereeeennnn"                                                                                                     
#>  [8] "Alhamdulillah, New Livin' By Mandiri selalu dihati"                                                              
#>  [9] "Semakin lengkap aplikasinya dan semakin berinovasi..sehingga memudahkan nasabah..terimakasih bank Mandiri 👍👍👍"
#> [10] "Bagus..livin semakin canggih dan tampilannya semakin segar"

Remove punctuation

mandiri_clean <- gsub("[[:punct:]]", " ", mandiri_clean) #replace punctuation with space
mandiri_clean[21:30]
#>  [1] "Mantap     "                                                                                                  
#>  [2] "Mantap jiewaa mandiri pakai Livin"                                                                            
#>  [3] "Mantul mantap betuuuuuullll      "                                                                            
#>  [4] "Mantap  top top  "                                                                                            
#>  [5] "Transaksi lebih aman  mudah   praktis dgn Livin  by Mandiri "                                                 
#>  [6] "Luar biasa"                                                                                                   
#>  [7] "Kereeeennnn"                                                                                                  
#>  [8] "Alhamdulillah  New Livin  By Mandiri selalu dihati"                                                           
#>  [9] "Semakin lengkap aplikasinya dan semakin berinovasi  sehingga memudahkan nasabah  terimakasih bank Mandiri    "
#> [10] "Bagus  livin semakin canggih dan tampilannya semakin segar"

Text Stripping

The strip function can help with:

  • Case Folding (changing all words to lowercase)
  • Removing punctuation
  • Removing emoji & emoticon
  • Removing numbers
  • Removing unnecessary whitespace
mandiri_clean <- strip(mandiri_clean, apostrophe.remove = TRUE)
mandiri_clean[21:30]
#>  [1] "mantap"                                                                                                 
#>  [2] "mantap jiewaa mandiri pakai livin"                                                                      
#>  [3] "mantul mantap betuuuuuullll"                                                                            
#>  [4] "mantap top top"                                                                                         
#>  [5] "transaksi lebih aman mudah praktis dgn livin by mandiri"                                                
#>  [6] "luar biasa"                                                                                             
#>  [7] "kereeeennnn"                                                                                            
#>  [8] "alhamdulillah new livin by mandiri selalu dihati"                                                       
#>  [9] "semakin lengkap aplikasinya dan semakin berinovasi sehingga memudahkan nasabah terimakasih bank mandiri"
#> [10] "bagus livin semakin canggih dan tampilannya semakin segar"

Replace word elongation

For example, we want to change words such as “betuuuuuullll” to “betul”.

mandiri_clean <- replace_word_elongation(mandiri_clean)
mandiri_clean[21:30]
#>  [1] "mantap"                                                                                                 
#>  [2] "mantap jiewaa mandiri pakai livin"                                                                      
#>  [3] "mantul mantap betul"                                                                                    
#>  [4] "mantap top top"                                                                                         
#>  [5] "transaksi lebih aman mudah praktis dgn livin by mandiri"                                                
#>  [6] "luar biasa"                                                                                             
#>  [7] "keren"                                                                                                  
#>  [8] "alhamdulillah new livin by mandiri selalu dihati"                                                       
#>  [9] "semakin lengkap aplikasinya dan semakin berinovasi sehingga memudahkan nasabah terimakasih bank mandiri"
#> [10] "bagus livin semakin canggih dan tampilannya semakin segar"

Remove duplicate reviews

# Number of data before removing duplicate reviews
length(mandiri_clean)
#> [1] 155192
# Number of data after removing duplicate reviews
mandiri_clean <- mandiri_clean %>% as.data.frame() %>% distinct() %>% rename(review = 1)
nrow(mandiri_clean)
#> [1] 100069
mandiri_clean$review[21:30]
#>  [1] "mantap top top"                                                                                                                               
#>  [2] "transaksi lebih aman mudah praktis dgn livin by mandiri"                                                                                      
#>  [3] "luar biasa"                                                                                                                                   
#>  [4] "alhamdulillah new livin by mandiri selalu dihati"                                                                                             
#>  [5] "semakin lengkap aplikasinya dan semakin berinovasi sehingga memudahkan nasabah terimakasih bank mandiri"                                      
#>  [6] "bagus livin semakin canggih dan tampilannya semakin segar"                                                                                    
#>  [7] "tampilannya lebih keren dan banyak pilihan juga selamat ulang tahun buat bank mandiri yang ke semoga selalu memberikan pelayanan yang terbaik"
#>  [8] "makin keren mudah a lebih mudah dgn yg baru"                                                                                                  
#>  [9] "tampilan lebih fresh dan login lebih cepat nice mandiri"                                                                                      
#> [10] "suka sama fitur yg terbarunya simple dan mudah digunakan"

As the process may take some time, we will save the result as a csv file.

write.csv(mandiri_clean, file = "mandiri_clean.csv", row.names = FALSE)
# Load data from the csv file
mandiri_clean <- read.csv("mandiri_clean.csv")
mandiri_clean$review %>% head()
#> [1] "udah di coba keren dan responsive dengan tampilan yang makin segar pastinya"
#> [2] "excellent"                                                                  
#> [3] "keren cakep benar semakin canggih terdepan terpercaya tumbuh bersama anda"  
#> [4] "mantap"                                                                     
#> [5] "mantap jiwa dan raga ayo kita livinkan indonesia"                           
#> [6] "mandiri emang terbaik"

Replace slang words

First, we will get the list of Indonesian slang words we want to replace. We also save it into a csv file so we can reuse it for another time, or for other projects.

## Get the list of slang words to replace
# Import Indonesian lexicon
spell.lex <- read.csv("data_input/colloquial-indonesian-lexicon.csv") 

# Filter out slang that can be dealt with replace_word_elongation
elong_lex <- 
  spell.lex %>% 
  filter(category1=="elongasi"|category2=="elongasi"| category3=="elongasi") %>% 
  select(slang, formal) %>% 
  filter(replace_word_elongation(slang)==formal) %>% distinct()

spell.lex_clean <- 
  spell.lex %>% 
  select(slang, formal) %>% 
  distinct() %>% 
  anti_join(elong_lex, by = c("slang", "formal"))

# Save as a csv file
write.csv(spell.lex_clean, file = "indonesian_slang.csv", row.names = FALSE)

Now we replace the slang words in our dataset, and save the cleaned data in another csv file as the processing time is quite long.

# Load the indonesian slang list 
spell.lex_clean <- read.csv("indonesian_slang.csv")

# Create a lookup table (key=slang, value=formal)
lookup_table <- setNames(spell.lex_clean$formal, paste0("\\b", spell.lex_clean$slang, "\\b"))

# Create a function to replace slang
replace_slang <- function(text) {
  str_replace_all(text, lookup_table)
}

# Replace slang in the entire column
mandiri_clean_slang <- replace_slang(mandiri_clean$review)

# Save the cleaned data without slang
write.csv(mandiri_clean_slang, file = "mandiri_clean_slang.csv", row.names = FALSE)

Load the dataset with replaced slang words.

mandiri_clean_slang <- read.csv("mandiri_clean_slang.csv") %>% setNames("review")
mandiri_clean_slang$review %>% head()
#> [1] "sudah di coba keren dan responsive dengan tampilan yang makin segar pastinya"
#> [2] "excellent"                                                                   
#> [3] "keren cakep benar semakin canggih terdepan terpercaya tumbuh bersama anda"   
#> [4] "mantap"                                                                      
#> [5] "mantap jiwa dan raga ayo kita livinkan indonesia"                            
#> [6] "mandiri memang terbaik"

Stemming

In this process, we want to replace words with their root form. For example, the root form of the English words “slept” or “sleeping” is “sleep”, and the root form of the Indonesian words “membaca” or “pembacaan” is “baca”.

To do stemming for Indonesian words, we use the library katadasaR.

Once again, we save the result to a csv file to avoid the long processing time.

# Create a function to do stemming
stemming <- function(x) {
  words <- tokenize_words(x)
  lapply(words, katadasar) %>%
    unlist() %>%
    str_c(collapse = " ")
}

# Stem Indonesian words 
mandiri_clean_stem <- lapply(tokenize_words(mandiri_clean_slang$review), stemming)

# Save as csv file
mandiri_clean_stem_unlist <- mandiri_clean_stem %>% unlist()
write.csv(mandiri_clean_stem_unlist, file = "mandiri_clean_stem_unlist.csv", row.names = FALSE)

Load the csv file for the stemmed dataset.

# Load the csv file
mandiri_clean_stem_unlist <- read.csv("mandiri_clean_stem_unlist.csv")
mandiri_clean_stem_unlist$x %>% head()
#> [1] "sudah di coba keren dan responsive dengan tampil yang makin segar pasti"
#> [2] "excellent"                                                              
#> [3] "keren cakep benar makin canggih depan percaya tumbuh sama anda"         
#> [4] "mantap"                                                                 
#> [5] "mantap jiwa dan raga ayo kita livinkan indonesia"                       
#> [6] "mandiri memang baik"

Remove stopwords

In our reviews, there are some English words included with the Indonesian reviews. Therefore, we will remove both the Indonesian and English stopwords.

We will also remove additional stopwords such as “mandiri”, “livin”, and “aplikasi” which refers to the app itself.

# Indonesian stopwords
idstopwords <- stopwords("id", source = "stopwords-iso")
idstopwords2 <- readLines("data_input/stopword_list_id_2.txt")
idstopwords_all <- c(idstopwords, idstopwords2) %>% unique %>% sort()
idstopwords_all %>% head

# English stopwords
enstopwords <- 
  c(stopwords("en", source = "snowball"), 
    stopwords("en", source = "marimo"), 
    stopwords("en", source = "nltk"), 
    stopwords("en", source = "stopwords-iso"), 
    stopwords("en", source = "smart")) %>% unique %>% sort() 

# Additional stopwords
addstopwords <- c("mandiri", "livin", "aplikasi")

# List of all stopwords combined
all_stopwords <- c(idstopwords_all, enstopwords, addstopwords) %>% unique()

###  Remove stopwords from dataset  ###
mandiri_clean_stopwords <- mandiri_clean_stem_unlist$x %>% tokenize_words(stopwords = all_stopwords)

# Convert to dataframe
mandiri_clean_all <- data.frame(x = unlist(lapply(mandiri_clean_stopwords, 
                                                  paste, 
                                                  collapse = " ")), 
                                stringsAsFactors = FALSE) %>%
  mutate(row_number = row_number()) %>% 
  rename(review=x) %>% 
  filter(review != "NA") 

write.csv(mandiri_clean_all, "mandiri_clean_all.csv", row.names = F)

Now we have removed the listed stopwords. We saved it into a csv file to avoid long processing time.

Load the final cleaned dataset.

# Load csv file
mandiri_clean_all <- read.csv("mandiri_clean_all.csv")
mandiri_clean_all %>% head
#>                                    review row_number
#> 1      coba keren responsive tampil segar          1
#> 2                               excellent          2
#> 3      keren cakep canggih percaya tumbuh          3
#> 4                                  mantap          4
#> 5 mantap jiwa raga ayo livinkan indonesia          5
#> 6                               super app          7

View Text Cleaning Result

We can see the text cleaning result in both table and wordcloud form. This section has also been used for identifying and manually adding the list of Indonesian stopwords and slang words.

As Table

Here we filter to see the words with frequency more than 10. Later on when building the machine learning model, we can further adjust the minimum word frequency that we want to set.

# Create word frequency table
word_freq_tbl <- mandiri_clean_all$review %>% 
  tokenize_words() %>% 
  unlist %>% 
  table %>% 
  as.data.frame() %>% 
  rename(Word = 1) %>% 
  arrange(-Freq) %>% 
  filter(Freq>10) #Frequency can later be adjusted to tune ML model
# View top 10 most frequent words
head(word_freq_tbl, 10)
#>         Word  Freq
#> 1      mudah 13236
#> 2  transaksi 12416
#> 3     update 11801
#> 4      bagus 10011
#> 5       buka  8423
#> 6       biru  8352
#> 7      bantu  8093
#> 8      pakai  7626
#> 9      masuk  7514
#> 10     susah  7170
# View top 10 least frequent words
tail(word_freq_tbl, 10)
#>             Word Freq
#> 2207 transaction   11
#> 2208       tuhan   11
#> 2209      tumben   11
#> 2210      ufdate   11
#> 2211       ultah   11
#> 2212     upgread   11
#> 2213    viturnya   11
#> 2214         weh   11
#> 2215  wiraswasta   11
#> 2216        wita   11

As Wordcloud

As the data is large, we will see the wordcloud only for a small subset of data.

library(wordcloud)

wordcloud(words = 
            mandiri_clean_all$review %>%  
            head(5000) %>% 
            tokenize_words %>% 
            unlist, 
  max.words = 1000)

Document Term Matrix

We can convert the cleaned data to Document Term Matrix form, which will be the input for our model. We can also further tune our model by filtering out our cleaned data to remove words with small frequencies.

mandiri_dtm <- DocumentTermMatrix(mandiri_clean_all)
mandiri_dtm_matrix <- as.matrix(mandiri_dtm)

References