Introduction

In today’s society, people are always using their computer devices, especially smartphone. Through many applications inside the phone, we can easily find any information that we desired. Nowadays, news that are updated through Social Media applications can be obtained faster than the information that are searched manually through the internet browser, or news that are presented through a Television, given the option that many Social Media applications are allowing the users to spread their own opinion freely. An application goes by the name Twitter allowing us to gain information faster than any Social Media available today. This platform is used to share their thoughts about many kind of things that are happening in their lives, or share the latest news with the aim to make it go viral or a trending news where everyone can see. Unlike any other Social Media, Twitter have this features where you can repost a tweet called retweet. the more retweet that our tweet gets, the faster our tweet will spread.

My objective in this project is to gain information about an accident as fast as possible through Twitter. With the help of Twitter API, we can gain access through the tweeted texts of many different accounts. Through text processing, we will be able to create a Machine Learning model which can help us identify which text is about an accident and which text are not by going through keywords from each tweeted texts. However, with the limited amount of time given to create this project, I surf through Kaggle.com to find the dataset that I needed.

Data Wrangling

In this project we have obtained the scrapped twitter texts in a form of dataframe from a website called Kaggle.com which the detail about the web page will be given at the end of this project. As a result, we only need to call the data in the local directory where we store our scrapped twitter texts.

Read and Observer The Dataframe

Define libraries
library(tidyr)
library(rtweet)
library(lubridate)
library(textclean)
library(textshape)
library(tm)
library(katadasaR)
library(RVerbalExpressions)
library(tokenizers)
library(stringr)
library(ggplot2)
library(ggthemes)
library(caret)
library(e1071)
library(rsample)
library(tensorflow)
library(keras)
library(yardstick)
library(lubridate)
library(dplyr)
Read twitter data
twitter <- read.csv("data/twitter.csv")
twitter %>% head(10)

Here we have a dataframe contained twitter texts which has been scrapped from twitter. we can see that the data provider has processed the raw text and created several variables as information about the texts. The variables that we can see above are described as follow:

  • id_str: ID of the tweet
  • created_at: When the tweet is created
  • crawled_at: when the tweet is scrapped
  • screen_name: Twitter username
  • full_text: Tweeted texts
  • full_tweet: Raw texts result from scrapping process

For this project, we will process the full_text as the full_tweet still contain other information where the data provider separate them to become the other variables inside the dataframe. However, the dataframe above does not have a target variable. For these type of dataframe, we have to create a target variable by ourselves.

Read data which contains target variable
twitter_label_manual <- read.csv("data/twitter_label_manual.csv")
twitter_label_manual %>% head(10)

Fortunately, the data provider has given us a dataframe sliced from the dataframe that we have seen previously with an additional of target variable named is_accident. For this project we will use the twitter_label_manual as our data for our Machine Learning model.

Cleaning The Data

We have to make sure that our data is in fact free of “noise” where they are possibly produced when scrapping from twitter API or from when we download the data from the internet. Moreover, sometimes when we import the data to our RmD, we obtained “noise” as a form of a non-ASCII character that are supposed to be an ASCII character for an unknown reason. Fortunately, our data does not contain such problem, but we can try to check if the data contain NA and/or duplicated values.

Check if our data contain NA value
anyNA(twitter_label_manual)
## [1] FALSE

Above we can see that our data does not contain any NA values. As a result, we can continue to next step.

Check if the values of our data are duplicated
anyDuplicated(twitter_label_manual)
## [1] 0

Fortunately, our data does not contain any duplicated values inside the dataframe which means that our data is unique. We would not want any duplicated tweet as it will create bias to our Machine Learning model.

Change data type to datetime
twitter_label_manual <- twitter_label_manual %>% 
  mutate(created_at = ymd_hms(created_at),
         crawled_at = ymd_hms(crawled_at))

twitter_label_manual %>% head(10)

If we observe our dataframe, we have an information of when the tweeted texts were made and when the texts are scrapped. However, the data type are a character data type. For an information containing data and time, we should convert it to a datetime data type as seen above.

Exploratory Data Analysis

With datetime and user name information that we are given in the dataframe that we have, we can try to obtain a bit of information regarding our data.

Accident tweets
twitter_label_manual %>% 
  filter(is_accident == 1) %>% 
  head() %>% 
  pull(full_text)
## [1] "Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :\nhttps://t.co/gMHLep9IvZ mhmmdrhmtrmdhn\nVisit Wonderful  #MRahmatRamadhan"              
## [2] "Tewaskan 346 Orang dalam 2 Kecelakaan, Boss Boeing Minta Maaf https://t.co/wLRhFy8oYE"                                                                
## [3] "23.27: @PTJASAMARGA : Kunciran KM 14 - KM 16 arah Bitung PADAT, ada penanganan kecelakaan kendaraan truk fuso di bahu jalan."                         
## [4] "Terjadi kecelakaan truk muatan besar di Tol Kunciran, Serpong arah Bitung. Akibatnya, terjadi kepadatan di lokasi kecelakaan. https://t.co/eeMnh9lZJa"
## [5] "WNI Korban Tewas Kecelakaan Bus di Malaysia Bertambah Jadi 4 Orang - PT Bestprofit Futures Surabaya https://t.co/6Q1NFdiXc6"                          
## [6] "20.35 WIB #Tol_Japek Karawang Timur KM 51 - KM 52 arah Cikampek PADAT, ada Evakuasi Kecelakaan Kendaraan Truk di lajur 1/kiri dan bahu jalan."
Non-accident tweets
twitter_label_manual %>% 
  filter(is_accident == 0) %>% 
  head() %>% 
  pull(full_text)
## [1] "Anggota parlemen Taiwan juga berencana meningkatkan denda maksimum dan masa hukuman bagi orang yang menyetir dalam keadaan mabuk. https://t.co/GSWqziaKDN"                                                                                                                      
## [2] "C.Gerakan.bicara pertolongan pertama pada kecelakaan (P3K-BAKAT) https://t.co/jlPyXK3EBV"                                                                                                                                                                                       
## [3] "Asuransi mana nih??\n\nhttps://t.co/AJyABmimcY\n\nPPATK tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut. Saat ini, pihaknya telah... https://t.co/AJyABmimcY"                                                                               
## [4] "Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.\n\nTapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo"                                                                                                               
## [5] "UPDATE LAGI JADWAL SAMSAT KELILING DAN SAMSAT DESA\n.\nTertib bayar pajak yuk, ðŸ\230\201\n.\n🚫Stop Pelanggaran\n🚫Stop Kecelakaan\nKeselamatan untukâ\200¦ https://t.co/THOqIu80Zw"                                                                                                
## [6] "Dapet video kejadian kecelakaan tunggal di Margonda, Depok tadi pagi.. Ya Allah.. sedih liatnya.. \n\nSudah biasa liat yg ky gt.. Ga serem, tp sedih iya.. Turut berduka.. :(\n\nBuat kalian yg bawa kendaraan, jgn lupa berdoa sebelum bepergian, hati2 dan patuhi rambu yaa.."

Above we can see that the main difference between the tweets about accident and the tweet topic that are not about accident are the keywords “kecelakaan”, “tewas”, and “korban” that we often see in the tweets about accident. Although the tweets that are not about accident can also include “kecelakaan” in the text, just like the text with the topic besides accident in number four, we can clearly tell by reading it that the text are not talking about an accident that is happening at that moment. It is almost impossible for us to read them one by one to create an early warning system about an accident. With the help of Machine Learning, we hope to be able to differentiate texts about an accident and non-accident.

Plot top 10 most frequent account to report about accident
twitter_label_manual %>% 
  filter(is_accident == 1) %>% 
  group_by(screen_name) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  arrange(desc(freq)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(screen_name, freq), freq))+
  geom_col(aes(fill = freq))+
  coord_flip()+
  scale_fill_continuous(high = "#FF6961", low = "#FDFD96")+
  labs(title = "Top 10 Most Frequently Reporting About An Accident",
       y = "Frequency",
       x = "Account names",
       legend = "Frequency")+
  theme_pander()+
  theme(legend.position = "none")

If we observe the bar chart above, we can see that top 10 account who frequently created a tweet about accident are dominated by news companies. However, we can see that there are also radio broadcasting company in the second and ninth rank, and one community account in the third rank. In my opinion, the community account should be faster when it comes to publishing a tweet about an accident, especially when the community account is discussing topic regarding traffic condition as the name implies. As a result, news company tweets should not be included in our data.

Plot account non-news company
RNGkind(sample.kind = "Rounding")
set.seed(126)

wordcloud::wordcloud(
  twitter_label_manual$screen_name,
  min.freq = 1
)

In the lowest rank, we can see that the frequency has reduced to only one tweet per account which indicates that the tweet are more personal and might include some personal account when it comes to tweet about an accident on twitter. This personal tweet about accident happening around or to them are the tweet that we hope to get the most updated information regarding accident.

Proportion between accident tweet and not accident tweet
twitter_label_manual %>% 
  mutate(is_accident = ifelse(is_accident == 1, "Accident", "Non-Accident")) %>% 
  group_by(is_accident) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  mutate(prop = freq / sum(freq) * 100,
         ypos = cumsum(prop) - 0.5 * prop) %>% 
  ggplot(aes(reorder(is_accident, desc(freq)), freq))+
  geom_col(aes(fill = freq))+
  scale_fill_continuous(high = "#FF6961", low = "#FDFD96")+
  labs(title = "Proportion between accident tweet and non-accident tweet",
       y = "Frequency",
       x = "Topic",
       legend = "Frequency")+
  theme_pander()+
  theme(legend.position = "none")

According to the bar chart above, the proportion of the accident tweet are lower compare to the non-accident tweet. However, when we are talking about an observation value with the amount of 1002, the number of accident tweet are around 360 tweets. For a platform where people can talk about everything, people are frequently tweet about an accident happening across Indonesia.

Accident per Month

Accident frequency happend per month
twitter_label_manual %>% 
  mutate(year = year(created_at),
         yearmonth = format(created_at, "%m-%Y")) %>% 
  filter(is_accident == 1) %>% 
  group_by(year, yearmonth) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  ggplot(aes(reorder(yearmonth, year), freq))+
  geom_col(aes(fill = freq))+
  geom_text(aes(label = freq),
            fill = "red",
            vjust = 1,
            color = "white",
            fontface = "bold")+
  scale_fill_continuous(high = "#FF6961", low = "#FDFD96")+
  labs(title = "Accident Frequency per Month",
       y = "Frequency",
       x = "Date")+
  theme_pander()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

Trend accident per month
twitter_label_manual %>% 
  mutate(year = year(created_at),
         yearmonth = format(created_at, "%m-%Y")) %>% 
  filter(is_accident == 1) %>% 
  group_by(yearmonth, year) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  ggplot(aes(reorder(yearmonth, year), freq))+
  geom_line(group = 1, aes(col = freq), size = 2)+
  scale_color_continuous(low = "#FF6961", high = "#FDFD96")+
  labs(title = "Trend Accident per Month",
       y = "Frequency",
       x = "Date")+
  theme_pander()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

September 2019 seem to have a significantly higher number of accident in comparison to the other month of 2019 and 2020. The increase of accident in the month of June might have a correlation to the long holiday of the eid al-fitr where people from the Capital city of Jakarta visit their family in their hometown across Indonesia. For the increase of accident in the month of January might have a correlation to the New Year holiday where people usually celebrate it outside of the city. However, the increase in the month of September does not seems to have any correlation to any holiday as there are no holiday in that month. Due to this matter, I decided to read some text samples in September 2019.

Random sample of accident tweet in the month of September
RNGkind(sample.kind = "Rounding")
set.seed(126)

idx_sept <- sample(nrow(twitter_label_manual %>% 
                          mutate(month = month(created_at, label = T)) %>% 
                          filter(month == "Sep")),
                   20)

twitter_label_manual %>% 
  slice(idx_sept) %>%
  pull(full_text)
##  [1] "Artis Korea Han Ji Seong Tewas dalam Kecelakaan Mobil, Kejanggalan Terekam Kamera Dashboard https://t.co/7H3OhC5q2B lewat @TribunWOW"                                                                                                                                                      
##  [2] "#Kecelakaan #JembatanKembar #IdiRayeuk #AcehTimur |\n\nMOBIL yang sedang dalam kecepatan tinggi menabrak median di jembatan kembar Titi Baro, Kecamatan Idi Rayeuk, Aceh Timur.\nhttps://t.co/sGvQ2LnVPA"                                                                                  
##  [3] "/txt/ AU - jungkook hyung sama Kai\n          ¥¥¥ NOONA ¥¥¥\nPlot : \n\n\"Halo selamat malam, kami dari rumah sakit mengabarkan bahwa mobil yang keluarga anda tumpangi mengalami kecelakaan\"\n\n\"Apa keluarga ku selamat ?\" https://t.co/scR5rH5DPF"                             
##  [4] "Aksi polisi menolong korban kecelakaan motor vs motor itu tersebar di media sosial dan mendapat pujian dari warganet.\n\nhttps://t.co/ChLh3opOd5â\200¦\n\n#LH87\n#bagimunegeri https://t.co/Q2oWGTjqts"                                                                                       
##  [5] "Kecelakaan fatal terjadi di jalan tanjung morawa depan spbu Menurut pengirim video, sepeda motor menabrak bus yang sedang mau belok masuk ke spbu. Pengendara sepeda motor dan penumpang terlihat luka parah. Terjadi pukul 11.19 WIBâ\200¦ https://t.co/HFCNToktSC https://t.co/OWuQjR2TZP"  
##  [6] "Lima Orang Tewas  Pada Kecelakaan di Patokbeusi Subang dalam 5 Bulan https://t.co/mCQLGTUfSv"                                                                                                                                                                                              
##  [7] "Baru aja ngeliat Ibu penjual es krim dgn motor ketabrak mobil pick up. motornya ampe muter.\n\ntangan udah gatel pingin foto / snap, kemudian ingat dulu teman baik pernah negur untuk jangan posting kecelakaan dan menunjukan korban.\n\ndia benar, gak baik memviralkan kesialan orang."
##  [8] "#MostPopular Kecelakaan, Abang Ojol Ini Tanggung Biaya Rumah Sakit Penumpangnya https://t.co/7CLfYavvqr"                                                                                                                                                                                   
##  [9] "Dpet kabar dri grup WA, Bus parawisata MAN 1 SUKABUMI Pulang studytour dari bandung kecelakaan di waluran galumpit semoga semuanya selamat, AMIN"                                                                                                                                          
## [10] "yang udah nonton beautiful world ep 15 menurut kalian itu murni kecelakaan kan?"                                                                                                                                                                                                           
## [11] "Kalau mobil/motor listrik dan hibrid sudah banyak di jalan2 Indonesia, maka angka kecelakaan akan meningkat karena bunyi kendaraan itu sangat kecil."                                                                                                                                      
## [12] "Telah terjadi, kecelakaan DEPAN MATA GUA BANGET BANGSAT! bapak bapak bawa motor ditabrak dari belakang sama mobil pick up! Aing lagi makan mie rebus di warkop pinggur jalan langsung lemes -_-"                                                                                           
## [13] "Astagfirullah aku kecelakaan di kampung orang:\""                                                                                                                                                                                                                                          
## [14] "Mudik dengan Motor Rawan Kecelakaan https://t.co/J1u2bEu2Wd https://t.co/reatWuTm27"                                                                                                                                                                                                       
## [15] "Kira kira 70-100 penduduk perhari atau 2100 setiap bulan. Angka Kematian Akibat Kecelakaan, Indonesia Tertinggi di Dunia - News https://t.co/tdr6KTxR2c https://t.co/ELBqoJhcYY"                                                                                                           
## [16] "RS Siloam Yogyakarta Beri Edukasi Pertolongan Pertama pada Kecelakaan https://t.co/7zOy8EKZEN"                                                                                                                                                                                             
## [17] "Kecelakaan di Manggarai, Polisi: Pelaku Kabur, Mobil Ditinggal - detikNews https://t.co/YmhLrM2uPl"                                                                                                                                                                                        
## [18] "Hati-hati saat bayi mulai berjalan ya, Ma https://t.co/MY28uZwHdn"                                                                                                                                                                                                                         
## [19] "Banyak yg nikah karena kecelakaan tapi ijab qobulnya ga di rumah sakit."                                                                                                                                                                                                                   
## [20] "Resto H2B ini plg sukses dg layanan Go Foodnya.Antrean driver  mengular dan sabar menunggu.Sedih melihat anak bangsa, tak sedikit sarjana, bekerja tanpa jaminan kesehatan, jaminan kecelakaan kerja. Wish new President, new hope for better future of us"

From the 20 random text about accident in the month of September, there seems to be a big accident in the highway of Cipularang which involved 21 vehicle which explain the increase of accident count in September 2019.

Text Processing

In the previous random text sample that we have seen, we can see that the text are constructed with the combination of formal words, slang words, and even words for another language such as English and sometimes Javanese or Sundanese. To get the keywords out of these text, we have to process the text by making the words become as generally understood as possible.

Text Cleansing

The purpose of cleaning up the text is to make the words free from unneeded words and characters such as punctuation, website link, numbers, and many more. The characters and words that are removed in this process are those that are unique as we would like to harvest character which repeats as much as possible to give us clues to which text is about an accident and which text is not about an accident.

Recommendation for text cleaning

The recommendation

As it will take days to inspect the text one by one manually, we can use check_text() function to help us obtain a bit of guidance about how should we clean our texts. the following text that we can see below are the recommendation from the function:

  • *Suggestion: Consider running replace_contraction
  • *Suggestion: Consider running replace date
  • *Suggestion: Consider using replace_number
  • *Suggestion: Consider using replace_emoticons
  • *Suggestion: Consider using `qdapRegex::ex_tag’ (to capture meta-data) and/or replace_hash
  • *Suggestion: Consider running replace_html
  • *Suggestion: Consider using replace_incomplete
  • *Suggestion: Consider using replace_kern
  • *Suggestion: Consider running hunspell::hunspell_find & hunspell::hunspell_suggest
  • *Suggestion: Consider cleaning the raw text or running add_missing_endmark
  • *Suggestion: Consider running add_comma_space
  • *Suggestion: Consider running replace_non_ascii
  • *Suggestion: Consider running textshape::split_sentence
  • *Suggestion: Consider using qdapRegex::ex_tag' (to capture meta-data) and/orreplace_tag`
  • *Suggestion: Consider using replace_time
  • *Suggestion: Consider using replace_url
The code
check_text(twitter_label_manual$full_text)
## 
## 
## ===========
## CONTRACTION
## ===========
## 
## The following observations contain contractions:
## 
## 79, 559, 887
## 
## This issue affected the following text:
## 
## 79: I think not every epilepsy case is mental illness. I got seizure from accident. kecelakaan ditabrak kendaraan sampe terbang. Tabrak lari, uh. It's not mental illness in my case. Physical. https://t.co/0XwfigMpn8
## 559: I'll always be here for you,l
## Gws my baby jungkook 😭💜
## #WeLoveYouJungkook #kecelakaan https://t.co/CMXn9G2qoJ
## 887: Tell me what you saw.......
## 
## Mantep sekaleeeeeeeee wkwkw. Keren bgt dah mbak sooyoung huhu😭 om jang hyuk gausa diragukan lagi lah yaa, pertama ngira dia cacat dari awal, gataunya cacat krn kecelakaan dan buta pula 😭 i'm so sad.
## 
## *Suggestion: Consider running `replace_contraction`
## 
## 
## ====
## DATE
## ====
## 
## The following observations contain dates:
## 
## 308, 329, 522, 643, 723, 987
## 
## This issue affected the following text:
## 
## 308: GIAT OPS RAZIA RUTIN
## 
## Tapsel,Selasa (20/08/2019),Sat Lantas Polres Tapsel melaksanakan giat razia rutin di wilkum Polres Tapsel,untuk mengurangi angka kecelakaan di wilkum Polres Tapsel,serta memberi himbauan kepada masyarakat untuk selalu tertib dalam berkendera. https://t.co/jvKz5Mb73C
## 329: Informasi :
## Telah terjadi kecelakaan di Tol Pandaan-Malang KM 82 / B. Pada Hari Jumat tanggal 30-08-2019, pukul 08.30 WIB.
## 
## Kendaraan yang mengalami kecelakaan adalah Truck Boks Nopol L 9287 GJ.
## 
## Identitas pengemudi… https://t.co/rVbsQlOKwW
## 522: Penyerahan Santunan Kecelakaan Kerja &amp; Santunan Kematian Kepada Anggota Badan AdHoc Penyelenggara Pemilu 2019 bertempat di Kantor KPU Kabupaten Majalengka, Jumat (25/10/2019). https://t.co/pT3NdfZHRY
## 643: NEWS UPDATE  PAGI   : 
## 
## Bus Angkut Turis Jatuh ke Jurang di Tunisia, Tewaskan 22 Orang ???
## 
## Sebuah bus yang mengangkut turis Tunisia jatuh di pegunungan di utara negara itu, Minggu (1/12/2019). Kecelakaan itu… https://t.co/bs6Ha4qeZV
## 723: Mengantisipasi terjadinya  kecelakaan/ Musibah bagi warga yang merayakan libur natal dan tahun baru menjadi tugas utama polri.
## Pada hari minggu(29/12/2019)  20 personil https://t.co/Bqef5d6PzG
## 987: INFO: Terjadi laka lantas di📍Jl. A. Yani, Km 22,5, Landasan Ulin. 07:05 Wita 15/03/2020.
## 
## Dalam kecelakaan tsb melibatkan antara pengendara motor yg menabrak pengendara yg membawa gerobak bakso. Dalam kejadian tsb… https://t.co/h2K0kTFfWx
## 
## *Suggestion: Consider running `replace date`
## 
## 
## =====
## DIGIT
## =====
## 
## The following observations contain digits/numbers:
## 
## 1, 2, 4, 6, 7, 8, 9, 10, 11, 12...[truncated]...
## 
## This issue affected the following text:
## 
## 1: Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :
## https://t.co/gMHLep9IvZ mhmmdrhmtrmdhn
## Visit Wonderful  #MRahmatRamadhan
## ...[truncated]...
## 2: Tewaskan 346 Orang dalam 2 Kecelakaan, Boss Boeing Minta Maaf https://t.co/wLRhFy8oYE
## ...[truncated]...
## 4: C.Gerakan.bicara pertolongan pertama pada kecelakaan (P3K-BAKAT) https://t.co/jlPyXK3EBV
## ...[truncated]...
## 6: 23.27: @PTJASAMARGA : Kunciran KM 14 - KM 16 arah Bitung PADAT, ada penanganan kecelakaan kendaraan truk fuso di bahu jalan.
## ...[truncated]...
## 7: Terjadi kecelakaan truk muatan besar di Tol Kunciran, Serpong arah Bitung. Akibatnya, terjadi kepadatan di lokasi kecelakaan. https://t.co/eeMnh9lZJa
## ...[truncated]...
## 8: Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.
## 
## Tapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo
## ...[truncated]...
## 9: UPDATE LAGI JADWAL SAMSAT KELILING DAN SAMSAT DESA
## .
## Tertib bayar pajak yuk, 😁
## .
## 🚫Stop Pelanggaran
## 🚫Stop Kecelakaan
## Keselamatan untuk… https://t.co/THOqIu80Zw
## ...[truncated]...
## 10: Dapet video kejadian kecelakaan tunggal di Margonda, Depok tadi pagi.. Ya Allah.. sedih liatnya.. 
## 
## Sudah biasa liat yg ky gt.. Ga serem, tp sedih iya.. Turut berduka.. :(
## 
## Buat kalian yg bawa kendaraan, jgn lupa berdoa sebelum bepergian, hati2 dan patuhi rambu yaa..
## ...[truncated]...
## 11: 👦: "Abis kecelakaan dimna lo?"
## 
## 👧: " Gue gk kecelakaan kok, aman2 aja"
## 
## 👦: " trus itu knapa muka lo ancur"
## 
## SABAR. Muka jelek emang banyak cobaan
## ...[truncated]...
## 12: WNI Korban Tewas Kecelakaan Bus di Malaysia Bertambah Jadi 4 Orang - PT Bestprofit Futures Surabaya https://t.co/6Q1NFdiXc6
## ...[truncated]...
## 
## *Suggestion: Consider using `replace_number`
## 
## 
## ========
## EMOTICON
## ========
## 
## The following observations contain emoticons:
## 
## 1, 2, 3, 4, 5, 7, 8, 9, 10, 12...[truncated]...
## 
## This issue affected the following text:
## 
## 1: Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :
## https://t.co/gMHLep9IvZ mhmmdrhmtrmdhn
## Visit Wonderful  #MRahmatRamadhan
## ...[truncated]...
## 2: Tewaskan 346 Orang dalam 2 Kecelakaan, Boss Boeing Minta Maaf https://t.co/wLRhFy8oYE
## ...[truncated]...
## 3: Anggota parlemen Taiwan juga berencana meningkatkan denda maksimum dan masa hukuman bagi orang yang menyetir dalam keadaan mabuk. https://t.co/GSWqziaKDN
## ...[truncated]...
## 4: C.Gerakan.bicara pertolongan pertama pada kecelakaan (P3K-BAKAT) https://t.co/jlPyXK3EBV
## ...[truncated]...
## 5: Asuransi mana nih??
## 
## https://t.co/AJyABmimcY
## 
## PPATK tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut. Saat ini, pihaknya telah... https://t.co/AJyABmimcY
## ...[truncated]...
## 7: Terjadi kecelakaan truk muatan besar di Tol Kunciran, Serpong arah Bitung. Akibatnya, terjadi kepadatan di lokasi kecelakaan. https://t.co/eeMnh9lZJa
## ...[truncated]...
## 8: Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.
## 
## Tapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo
## ...[truncated]...
## 9: UPDATE LAGI JADWAL SAMSAT KELILING DAN SAMSAT DESA
## .
## Tertib bayar pajak yuk, 😁
## .
## 🚫Stop Pelanggaran
## 🚫Stop Kecelakaan
## Keselamatan untuk… https://t.co/THOqIu80Zw
## ...[truncated]...
## 10: Dapet video kejadian kecelakaan tunggal di Margonda, Depok tadi pagi.. Ya Allah.. sedih liatnya.. 
## 
## Sudah biasa liat yg ky gt.. Ga serem, tp sedih iya.. Turut berduka.. :(
## 
## Buat kalian yg bawa kendaraan, jgn lupa berdoa sebelum bepergian, hati2 dan patuhi rambu yaa..
## ...[truncated]...
## 12: WNI Korban Tewas Kecelakaan Bus di Malaysia Bertambah Jadi 4 Orang - PT Bestprofit Futures Surabaya https://t.co/6Q1NFdiXc6
## ...[truncated]...
## 
## *Suggestion: Consider using `replace_emoticons`
## 
## 
## ====
## HASH
## ====
## 
## The following observations contain Twitter style hash tags (e.g., #rstats):
## 
## 1, 14, 15, 21, 22, 24, 26, 37, 44, 45...[truncated]...
## 
## This issue affected the following text:
## 
## 1: Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :
## https://t.co/gMHLep9IvZ mhmmdrhmtrmdhn
## Visit Wonderful  #MRahmatRamadhan
## ...[truncated]...
## 14: 20.35 WIB #Tol_Japek Karawang Timur KM 51 - KM 52 arah Cikampek PADAT, ada Evakuasi Kecelakaan Kendaraan Truk di lajur 1/kiri dan bahu jalan.
## ...[truncated]...
## 15: #PopulerB1 2: Kecelakaan Bus di Malaysia, 4 WNI Meninggal dan 10 Terluka https://t.co/w3r06o7Hdq
## ...[truncated]...
## 21: Akibat Andra ingin selalu menjaga Emon, dia jadi dikejar-kejar terus sama Tony! Yang lebih parahnya, Andra sampai kecelakaan..
## 
## Nonton #AnakLangitEps1096dan1097, pukul 16.40 WIB.
## #SCTVSinetron
## 
## Jangan lupa saksikan via streaming di @vidiodotcom dan follow channel SCTV juga ya! https://t.co/vdk5tjd6gF
## ...[truncated]...
## 22: [18:36] #JAKARTA #KECELAKAAN Persimpangan Slipi #JasaMarga
## ...[truncated]...
## 24: Satgas Pamtas Yonif 725/Wrg Obati Warga Karena Kecelakaan Tunggal #TNIADMengabdiDanMembangunBersamaRakyat
## 
## https://t.co/Yq2ywPs8Sw
## ...[truncated]...
## 26: Ciel : waktu masa kecil aku gak tahu selalu ada aja kecelakaan yg terjadi pada diriku ,, papa aku bikin kopi yg panas banget alhasil ketumpahan kena perut aku ada aja dah pokoknya
## 
## #LRPD
## #PajamaClassA
## ...[truncated]...
## 37: KECELAKAAN BERAWAL DARI PELANGGARAN, 
## MARI BUDAYAKAN TERTIB BERLALU LINTAS.
## #MRSF2019 #satlantasindonesia #korlantaspolri #ntmcpolri  #roadsafety #generasimillennial #generasimuda #roadsafety #roadsafetyweek #StopPelanggaranStopKecelakaanKeselamatanUntukKemanusiaan https://t.co/Hu3iNGAihd
## ...[truncated]...
## 44: 06.55 WIB #Kecelakaan_Janger di Tangerang KM 18+800 arah Merak, Kendaraan Truk gandeng menabrak pembatas jalan tengah, SELESAI PENANGANAN Petugas.
## ...[truncated]...
## 45: [07:26] #JAKARTA #KECELAKAAN Jl. Raya Kembangan Selatan #TMC
## ...[truncated]...
## 
## *Suggestion: Consider using `qdapRegex::ex_tag' (to capture meta-data) and/or replace_hash
## 
## 
## ====
## HTML
## ====
## 
## The following observations contain HTML markup:
## 
## 25, 96, 195, 219, 233, 252, 273, 279, 292, 328...[truncated]...
## 
## This issue affected the following text:
## 
## 25: Sayangi nyawa anda &amp; orang lain. Jangan sia siakan nyawa melayang di jalan.
## 
## Berperilaku &amp; beretika tertib berlalu lintas adalah upaya menekan angka kecelakaan dg aturan yg berlaku.
## 
## Keselamatan no 1. https://t.co/XLOdUJPcBw
## ...[truncated]...
## 96: Menurut PM 82 Tahun 2018 pasal 33, Pita Penggaduh berfungsi untuk mengurangi kecepatan kendaraan, mengingatkan pengemudi tentang objek di depan yang harus diwaspadai, melindungi penyeberang jalan, &amp; mengingatkan lokasi rawan kecelakaan - @dishubdiy https://t.co/T3PoEozMQn
## ...[truncated]...
## 195: Aku bakal jadi bridesmaids sahabatku tanggal 7 juli. Tapi kemaren aku kecelakaan, muka penuh luka &amp; jahitan, kaki penuh luka. Mungkin tgl 7 udah sembuh tp tetap aja muka penuh bekas luka. … — datang aja si, dia nya juga g akan peduli lunya mau gmna skrg. https://t.co/kgpHxBahto
## ...[truncated]...
## 219: Saat Asik Liburan, Yeni Wahid Alami Kecelakaan
## 
## Begini nasibnya kini &gt;&gt; https://t.co/omTAQkYvO9 https://t.co/omTAQkYvO9
## ...[truncated]...
## 233: Karimunjawa Private Honeymoon
## 3D2N 
## Mulai dari IDR 6000K/COUPLE
## 
## Cek story ya guys..
## 
## Fasilitas:
## 1. Tiket kapal Express VIP CLASS
## 2. Asuransi kecelakaan 
## 3. Transportasi cek in &amp; cek out penginapan, ke alun2 &amp; tour… https://t.co/rWa4Mmupe4
## ...[truncated]...
## 252: Di situ orang2 pada nyenyak tidur 😴😴
## Tapi d'situ lah, para emak2 mau melahirkan &amp; ada yg kecelakaan juga 🙈🙈
## Ambil hikmahnya aja, biar makin pinter 🙂
## ...[truncated]...
## 273: Melihat Kecelakaan, Kendaraan Mogok. Hub Info Tol &amp; Bantuan tol_mms. #Indonesia_Ayo_Aman_Berlalu_Lintas https://t.co/ek9vzcGn7I
## ...[truncated]...
## 279: Genset, pemanas air dari listrik (water heater), colokan listrik &amp; ceklekan lampu adalah ~ sumber kecelakaan fatal yang bisa merenggut nyawa di dalam rumah kita sendiri.. mohon berhati-hati &amp; selalu pastikan semua aman sebelum dinyalakan.
## ...[truncated]...
## 292: Flashback satu stgh 1 tahun yg lalu saat sayyid Muhammad kecelakaan motor di Tarim kaki &amp; tulang paha patah serius,difoto ini beliau sedang dijenguk oleh guru mulia,keliatan muh kesakitan saat lehernya dipegang Habib Umar, https://t.co/LCveNax9rU
## ...[truncated]...
## 328: Akun2 influenze ini ga mikir psikis korban &amp; keluarganya kalik yaa?? mau biar terkesan infonya A1 sampe hrs vulgar gitu ngepostnya. Empati kita kadang kebablasan.
## Korban pemerkosaan,kecelakaan laka, pembunuhan, maen share vulgar gitu..tujuan baik klo cara ga pas, bulshit buat gw!
## ...[truncated]...
## 
## *Suggestion: Consider running `replace_html`
## 
## 
## ==========
## INCOMPLETE
## ==========
## 
## The following observations contain incomplete sentences (e.g., uses ending punctuation like '...'):
## 
## 5, 10, 20, 21, 23, 42, 73, 77, 92, 112...[truncated]...
## 
## This issue affected the following text:
## 
## 5: Asuransi mana nih??
## 
## https://t.co/AJyABmimcY
## 
## PPATK tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut. Saat ini, pihaknya telah... https://t.co/AJyABmimcY
## ...[truncated]...
## 10: Dapet video kejadian kecelakaan tunggal di Margonda, Depok tadi pagi.. Ya Allah.. sedih liatnya.. 
## 
## Sudah biasa liat yg ky gt.. Ga serem, tp sedih iya.. Turut berduka.. :(
## 
## Buat kalian yg bawa kendaraan, jgn lupa berdoa sebelum bepergian, hati2 dan patuhi rambu yaa..
## ...[truncated]...
## 20: Rembang – Satuan Lalu Lintas Polres Rembang masih menangani peristiwa kecelakaan lalu lintas di jalur Pantura Desa Punjulharjo, Rembang, yang mengakibatkan... https://t.co/LLh7U8BvpH
## ...[truncated]...
## 21: Akibat Andra ingin selalu menjaga Emon, dia jadi dikejar-kejar terus sama Tony! Yang lebih parahnya, Andra sampai kecelakaan..
## 
## Nonton #AnakLangitEps1096dan1097, pukul 16.40 WIB.
## #SCTVSinetron
## 
## Jangan lupa saksikan via streaming di @vidiodotcom dan follow channel SCTV juga ya! https://t.co/vdk5tjd6gF
## ...[truncated]...
## 23: Santunan diberikan kepada warga sebesar  Rp1 juta untuk warga yang meninggal karena sakit, sedangkan korban kecelakaan mendapat santunan Rp2,5 juta.
## 
## Hingga triwulan pertama di tahun... https://t.co/vGvnIfUZRf
## ...[truncated]...
## 42: 22/4: 12.35
## .
## Kecelakaan tunggal yang dikendarai oleh mbahe wong ndemak. Menurut korban kecelakaan terjadi lantaran remnya tidak pakem...(mungkin kampase kepanasan) lokasi di sigar… https://t.co/BdPZWxBiB0
## ...[truncated]...
## 73: Aku juga ikut Terharu saat teman ku mengalami kecelkaan ternyata begini rasanya kecelakaan ..
## ...[truncated]...
## 77: Akhirnya! Yes nomorku ditilpun orang prank kecelakaan. Hahahaha. Kuangkat dan kudiamkan.. setelah hola halo dengan backsound orang cekikikan kubilang saja kalau mau nipu gak gitu caranya...
## Eh dibalas bangs*t kau anj*ing... Wkwkwk..
## ...[truncated]...
## 92: Mas Brooo dan Mbak Sis..., jangan bawa kendaraan ya..., bila dlm keadaan mabuk. Karena dapat membahayan bagi pengendara yg lain dan diri anda sendiri.
## " Stop Pelanggaran, Stop Kecelakaan, Keselamatan Untuk Kemanusiaan "
## #OPSKESELAMATANMUARATAKUS2019 #PublikPercayaPolri https://t.co/CRFEz1dQGW
## ...[truncated]...
## 112: Kecelakaan,pembunuhan,perampokan ada di sekitarkita..
## Haruskah Kita TAKUT?hub 082312105777 Pin BB 5C5C8D3C
## https://t.co/YhrCdDS9Fw
## ...[truncated]...
## 
## *Suggestion: Consider using `replace_incomplete`
## 
## 
## ====
## KERN
## ====
## 
## The following observations contain kerning (e.g., 'The B O M B!'):
## 
## 767
## 
## This issue affected the following text:
## 
## 767: OI OI OI O I OI HENTIKAN KELAKUAN INI YA BANG SEBELUM ADA KECELAKAAN https://t.co/tm9AhYs3JO
## 
## *Suggestion: Consider using `replace_kern`
## 
## 
## ==========
## MISSPELLED
## ==========
## 
## The following observations contain potentially misspelled words:
## 
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10...[truncated]...
## 
## This issue affected the following text:
## 
## 1: <<R<<ek>><<<<ama>>n>>>> CCTV <<<<<<<<Ke>>c>>e<<l<<ak>>a>>a>>n>> Motor <<di>> <<P<<IK>>>>, <<<<de>>pan>> <<T<<<<ama>>n>>>> <<Gri<<se>><<<<nd>>a>>>> :
## <<ht<<tp>>s>>://t.co/<<<<gM>>HL<<ep>>>>9<<IvZ>> <<m<<hm>>m<<dr>><<hm>>trm<<dh>>n>>
## Vi<<si>>t Wo<<nd>>er<<fu>>l  #<<M<<Ra<<hm>>at>>R<<ama>><<dh>>an>>
## ...[truncated]...
## 2: <<<<T<<e<<wa>>>>s>><<kan>>>> 346 <<Ora<<ng>>>> <<d<<alam>>>> 2 <<<<<<<<Ke>>c>>e<<l<<ak>>a>>a>>n>>, Bo<<ss>> Boe<<i<<ng>>>> <<Minta>> <<M<<aa>>f>> <<ht<<tp>>s>>://t.co/<<wLRhFy>>8<<oYE>>
## ...[truncated]...
## 3: <<A<<ng>>gota>> <<p<<ar>>lemen>> <<Tai>><<wa>>n <<j<<<<ug>>a>>>> <<be<<re<<nc>>ana>>>> <<<<men<<i<<ng>>>>kat>><<kan>>>> <<<<de>><<<<nd>>a>>>> <<<<m<<ak>>>><<si>>mum>> <<dan>> <<masa>> <<<<hu<<ku>>m>>an>> <<ba<<gi>>>> <<ora<<ng>>>> ya<<ng>> <<menyetir>> <<d<<alam>>>> <<<<ke>><<ada>>an>> <<<<mabu>>k>>. <<ht<<tp>>s>>://t.co/<<GSWq<<z<<ia>>>>KDN>>
## ...[truncated]...
## 4: C.<<Ger<<a<<kan>>>>>>.<<bi<<c<<ar>>a>>>> <<per<<<<tol>>o<<ng>>>>an>> <<pert<<ama>>>> <<p<<ada>>>> <<<<ke>>ce<<l<<ak>>a>>an>> (P3K-<<BA<<KA>>T>>) <<ht<<tp>>s>>://t.co/<<<<jl>>PyXK>>3<<EBV>>
## ...[truncated]...
## 5: <<Asuran<<si>>>> <<mana>> <<<<ni>>h>>??
## 
## <<ht<<tp>>s>>://t.co/<<AJyABmimcY>>
## 
## <<PPATK>> <<tid<<ak>>>> <<<<mem<<ber>>i>><<kan>>>> <<rin<<ci>>an>> <<<<se>><<c<<ar>>a>>>> <<pasti>> <<<<<<s<<ia>>>>p>>a>> <<dan>> <<<<d<<ar>>i>><<mana>>>> <<asal>> <<p<<ar>>tai>> <<caleg>> <<<<ter>><<se>>but>>. <<S<<aa>>t>> <<i<<ni>>>>, <<<<p<<ih>><<ak>>>><<nya>>>> <<t<<e<<lah>>>>>>... <<ht<<tp>>s>>://t.co/<<AJyABmimcY>>
## ...[truncated]...
## 6: 23.27: @<<PTJASAMARGA>> : <<<<Ku>>n<<ci>>ran>> KM 14 - KM 16 <<<<ar>>ah>> <<B<<i<<tu>>>><<ng>>>> <<PADAT>>, <<ada>> <<pe<<nan>><<ga>><<nan>>>> <<<<ke>>ce<<l<<ak>>a>>an>> <<<<ke>><<<<nd>>a>>r<<aa>>n>> <<t<<ru>>k>> <<<<fu>>so>> <<di>> <<bahu>> <<jalan>>.
## ...[truncated]...
## 7: <<Ter<<ja<<di>>>>>> <<<<ke>>ce<<l<<ak>>a>>an>> <<t<<ru>>k>> <<<<muat>>an>> <<bes<<ar>>>> <<di>> <<Tol>> <<<<Ku>>n<<ci>>ran>>, <<Ser<<po>><<ng>>>> <<<<ar>>ah>> <<B<<i<<tu>>>><<ng>>>>. <<<<<<<<Ak>>i>>bat>><<nya>>>>, <<<<ter>><<ja<<di>>>>>> <<<<<<ke>><<p<<ada>>>>>>tan>> <<di>> <<loka<<si>>>> <<<<ke>>ce<<l<<ak>>a>>an>>. <<ht<<tp>>s>>://t.co/<<eeMnh>>9<<l<<ZJ>>a>>
## ...[truncated]...
## 8: Plot twist: <<<<Ibu>><<nya>>>> <<abis>> <<<<ke>>ce<<l<<ak>>a>>an>>, <<neme<<ni>>n>> <<ke>> <<UGD>> <<dan>> <<b<<ar>>u>> <<bisa>> <<<<di>>ti<<<<ng>><<ga>>>>l>>.
## 
## <<T<<api>>>> ya <<ga>> <<ma<<sa<<lah>>>>>> <<s<<ih>>>>. <<Yg>> <<pent<<i<<ng>>>>>> <<kan>> <<eTikA>> <<<<pR>>ofE<<si>>OnaL>>. <<ht<<tp>>s>>://t.co/0e2<<zHaMkCo>>
## ...[truncated]...
## 9: UPDATE <<LAGI>> <<JADWAL>> <<SAMSAT>> <<<<KE>>LILING>> DAN <<SAMSAT>> <<DESA>>
## .
## <<Tertib>> <<bay<<ar>>>> <<p<<aja>>k>> yuk, <<ð<<Ÿ>>>>˜
## .
## <<<<ð<<Ÿ>>>>š>>«Stop <<Pel<<a<<<<ng>><<ga>>>>ran>>>>
## <<<<ð<<Ÿ>>>>š>>«Stop <<<<<<<<Ke>>c>>e<<l<<ak>>a>>a>>n>>
## <<<<Ke>><<<<<<se>>l<<ama>>>>t>>an>> <<<<<<unt>>uk>><<â>>>>€¦ <<ht<<tp>>s>>://t.co/<<THOqIu>>80<<Zw>>
## ...[truncated]...
## 10: <<Dapet>> vi<<de>>o <<<<ke>><<<<ja<<di>>>>an>>>> <<<<ke>>ce<<l<<ak>>a>>an>> <<<<tu>><<<<ng>><<ga>>>>l>> <<di>> <<M<<ar>>go<<<<nd>>a>>>>, <<De<<po>>k>> <<ta<<di>>>> <<pa<<gi>>>>.. Ya Al<<lah>>.. <<<<se>>d<<ih>>>> <<<<l<<ia>>t>><<nya>>>>.. 
## 
## <<S<<<<uda>>h>>>> <<b<<ia>>sa>> <<l<<ia>>t>> <<yg>> <<ky>> gt.. Ga <<<<se>>rem>>, <<tp>> <<<<se>>d<<ih>>>> <<<<iy>>a>>.. <<Tu<<ru>>t>> <<<<ber>><<duka>>>>.. :(
## 
## <<Buat>> <<<<k<<ali>>>>an>> <<yg>> <<ba<<wa>>>> <<<<ke>><<<<nd>>a>>r<<aa>>n>>, <<jgn>> <<<<lu>>pa>> <<<<ber>><<doa>>>> <<<<se>><<be<<lu>>m>>>> <<be<<per<<gi>>>>an>>, <<h<<ati>>>>2 <<dan>> <<pa<<<<tu>>h>>i>> <<rambu>> <<y<<aa>>>>..
## ...[truncated]...
## 
## *Suggestion: Consider running `hunspell::hunspell_find` & `hunspell::hunspell_suggest`
## 
## 
## ==========
## NO ENDMARK
## ==========
## 
## The following observations contain elements with missing ending punctuation:
## 
## 1, 2, 3, 4, 5, 7, 8, 9, 11, 12...[truncated]...
## 
## This issue affected the following text:
## 
## 1: Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :
## https://t.co/gMHLep9IvZ mhmmdrhmtrmdhn
## Visit Wonderful  #MRahmatRamadhan
## ...[truncated]...
## 2: Tewaskan 346 Orang dalam 2 Kecelakaan, Boss Boeing Minta Maaf https://t.co/wLRhFy8oYE
## ...[truncated]...
## 3: Anggota parlemen Taiwan juga berencana meningkatkan denda maksimum dan masa hukuman bagi orang yang menyetir dalam keadaan mabuk. https://t.co/GSWqziaKDN
## ...[truncated]...
## 4: C.Gerakan.bicara pertolongan pertama pada kecelakaan (P3K-BAKAT) https://t.co/jlPyXK3EBV
## ...[truncated]...
## 5: Asuransi mana nih??
## 
## https://t.co/AJyABmimcY
## 
## PPATK tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut. Saat ini, pihaknya telah... https://t.co/AJyABmimcY
## ...[truncated]...
## 7: Terjadi kecelakaan truk muatan besar di Tol Kunciran, Serpong arah Bitung. Akibatnya, terjadi kepadatan di lokasi kecelakaan. https://t.co/eeMnh9lZJa
## ...[truncated]...
## 8: Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.
## 
## Tapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo
## ...[truncated]...
## 9: UPDATE LAGI JADWAL SAMSAT KELILING DAN SAMSAT DESA
## .
## Tertib bayar pajak yuk, 😁
## .
## 🚫Stop Pelanggaran
## 🚫Stop Kecelakaan
## Keselamatan untuk… https://t.co/THOqIu80Zw
## ...[truncated]...
## 11: 👦: "Abis kecelakaan dimna lo?"
## 
## 👧: " Gue gk kecelakaan kok, aman2 aja"
## 
## 👦: " trus itu knapa muka lo ancur"
## 
## SABAR. Muka jelek emang banyak cobaan
## ...[truncated]...
## 12: WNI Korban Tewas Kecelakaan Bus di Malaysia Bertambah Jadi 4 Orang - PT Bestprofit Futures Surabaya https://t.co/6Q1NFdiXc6
## ...[truncated]...
## 
## *Suggestion: Consider cleaning the raw text or running `add_missing_endmark`
## 
## 
## ====================
## NO SPACE AFTER COMMA
## ====================
## 
## The following observations contain commas with no space afterwards:
## 
## 13, 23, 26, 49, 112, 117, 132, 159, 171, 178...[truncated]...
## 
## This issue affected the following text:
## 
## 13: Shame on you @LionAirID @BoeingCEO @Boeing @BoeingAirplanes cc @kemenhub151 Keluarga Korban Kecelakaan Lion Air JT 610 Tolak Santunan Rp 1,25 Miliar, Ini Alasannya https://t.co/cCYr5NGhts
## ...[truncated]...
## 23: Santunan diberikan kepada warga sebesar  Rp1 juta untuk warga yang meninggal karena sakit, sedangkan korban kecelakaan mendapat santunan Rp2,5 juta.
## 
## Hingga triwulan pertama di tahun... https://t.co/vGvnIfUZRf
## ...[truncated]...
## 26: Ciel : waktu masa kecil aku gak tahu selalu ada aja kecelakaan yg terjadi pada diriku ,, papa aku bikin kopi yg panas banget alhasil ketumpahan kena perut aku ada aja dah pokoknya
## 
## #LRPD
## #PajamaClassA
## ...[truncated]...
## 49: Korban penipuan mah ga sakit, korban kecelakaan juga biasa,
## tpi kalo jdi korban perasaan kmu itu rasanya sakit pake bgt :’(
## ...[truncated]...
## 112: Kecelakaan,pembunuhan,perampokan ada di sekitarkita..
## Haruskah Kita TAKUT?hub 082312105777 Pin BB 5C5C8D3C
## https://t.co/YhrCdDS9Fw
## ...[truncated]...
## 117: Korban penipuan mah ga sakit, korban kecelakaan juga biasa,
## tpi kalo jdi korban perasaan kmu itu rasanya sakit pake bgt :’(
## ...[truncated]...
## 132: Korban penipuan mah ga sakit, korban kecelakaan juga biasa,
## tpi kalo jdi korban perasaan kmu itu rasanya sakit pake bgt :’(
## ...[truncated]...
## 159: "Jatuh cinta itu adalah sebuah kecelakaan yang sangat indah,, dan sebaiknya yang kita alami berulang kali pada orang yang sama,, tapi Jika tidaklah untuk selamanya..!! itu bukanlah cinta"
## ...[truncated]...
## 171: Terjadi kecelakaan beruntun di tol cipali majalengka km 150 arah jakarta,melibatkan bis safari,expander,innova dan truk pengangkut ayam. https://t.co/nCPl4n2x3M
## ...[truncated]...
## 178: FUSO TERGULING |
## 
## “Oprit hanya ditimbun dengan tanah bercampur batu. Sekarang mulai longsor dan saat hujan kondisi licin, sehingga sangat rawan terjadi kecelakaan,” 
## 
## https://t.co/rikynyUh1I
## ...[truncated]...
## 
## *Suggestion: Consider running `add_comma_space`
## 
## 
## =========
## NON ASCII
## =========
## 
## The following observations contain non-ASCII text:
## 
## 1, 5, 8, 9, 10, 11, 20, 21, 23, 24...[truncated]...
## 
## This issue affected the following text:
## 
## 1: Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :
## https://t.co/gMHLep9IvZ mhmmdrhmtrmdhn
## Visit Wonderful  #MRahmatRamadhan
## ...[truncated]...
## 5: Asuransi mana nih??
## 
## https://t.co/AJyABmimcY
## 
## PPATK tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut. Saat ini, pihaknya telah... https://t.co/AJyABmimcY
## ...[truncated]...
## 8: Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.
## 
## Tapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo
## ...[truncated]...
## 9: UPDATE LAGI JADWAL SAMSAT KELILING DAN SAMSAT DESA
## .
## Tertib bayar pajak yuk, 😁
## .
## 🚫Stop Pelanggaran
## 🚫Stop Kecelakaan
## Keselamatan untuk… https://t.co/THOqIu80Zw
## ...[truncated]...
## 10: Dapet video kejadian kecelakaan tunggal di Margonda, Depok tadi pagi.. Ya Allah.. sedih liatnya.. 
## 
## Sudah biasa liat yg ky gt.. Ga serem, tp sedih iya.. Turut berduka.. :(
## 
## Buat kalian yg bawa kendaraan, jgn lupa berdoa sebelum bepergian, hati2 dan patuhi rambu yaa..
## ...[truncated]...
## 11: 👦: "Abis kecelakaan dimna lo?"
## 
## 👧: " Gue gk kecelakaan kok, aman2 aja"
## 
## 👦: " trus itu knapa muka lo ancur"
## 
## SABAR. Muka jelek emang banyak cobaan
## ...[truncated]...
## 20: Rembang – Satuan Lalu Lintas Polres Rembang masih menangani peristiwa kecelakaan lalu lintas di jalur Pantura Desa Punjulharjo, Rembang, yang mengakibatkan... https://t.co/LLh7U8BvpH
## ...[truncated]...
## 21: Akibat Andra ingin selalu menjaga Emon, dia jadi dikejar-kejar terus sama Tony! Yang lebih parahnya, Andra sampai kecelakaan..
## 
## Nonton #AnakLangitEps1096dan1097, pukul 16.40 WIB.
## #SCTVSinetron
## 
## Jangan lupa saksikan via streaming di @vidiodotcom dan follow channel SCTV juga ya! https://t.co/vdk5tjd6gF
## ...[truncated]...
## 23: Santunan diberikan kepada warga sebesar  Rp1 juta untuk warga yang meninggal karena sakit, sedangkan korban kecelakaan mendapat santunan Rp2,5 juta.
## 
## Hingga triwulan pertama di tahun... https://t.co/vGvnIfUZRf
## ...[truncated]...
## 24: Satgas Pamtas Yonif 725/Wrg Obati Warga Karena Kecelakaan Tunggal #TNIADMengabdiDanMembangunBersamaRakyat
## 
## https://t.co/Yq2ywPs8Sw
## ...[truncated]...
## 
## *Suggestion: Consider running `replace_non_ascii`
## 
## 
## ==================
## NON SPLIT SENTENCE
## ==================
## 
## The following observations contain unsplit sentences (more than one sentence per element):
## 
## 3, 4, 5, 7, 8, 9, 10, 11, 18, 19...[truncated]...
## 
## This issue affected the following text:
## 
## 3: Anggota parlemen Taiwan juga berencana meningkatkan denda maksimum dan masa hukuman bagi orang yang menyetir dalam keadaan mabuk. https://t.co/GSWqziaKDN
## ...[truncated]...
## 4: C.Gerakan.bicara pertolongan pertama pada kecelakaan (P3K-BAKAT) https://t.co/jlPyXK3EBV
## ...[truncated]...
## 5: Asuransi mana nih??
## 
## https://t.co/AJyABmimcY
## 
## PPATK tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut. Saat ini, pihaknya telah... https://t.co/AJyABmimcY
## ...[truncated]...
## 7: Terjadi kecelakaan truk muatan besar di Tol Kunciran, Serpong arah Bitung. Akibatnya, terjadi kepadatan di lokasi kecelakaan. https://t.co/eeMnh9lZJa
## ...[truncated]...
## 8: Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.
## 
## Tapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo
## ...[truncated]...
## 9: UPDATE LAGI JADWAL SAMSAT KELILING DAN SAMSAT DESA
## .
## Tertib bayar pajak yuk, 😁
## .
## 🚫Stop Pelanggaran
## 🚫Stop Kecelakaan
## Keselamatan untuk… https://t.co/THOqIu80Zw
## ...[truncated]...
## 10: Dapet video kejadian kecelakaan tunggal di Margonda, Depok tadi pagi.. Ya Allah.. sedih liatnya.. 
## 
## Sudah biasa liat yg ky gt.. Ga serem, tp sedih iya.. Turut berduka.. :(
## 
## Buat kalian yg bawa kendaraan, jgn lupa berdoa sebelum bepergian, hati2 dan patuhi rambu yaa..
## ...[truncated]...
## 11: 👦: "Abis kecelakaan dimna lo?"
## 
## 👧: " Gue gk kecelakaan kok, aman2 aja"
## 
## 👦: " trus itu knapa muka lo ancur"
## 
## SABAR. Muka jelek emang banyak cobaan
## ...[truncated]...
## 18: Telah terjadi, kecelakaan DEPAN MATA GUA BANGET BANGSAT! bapak bapak bawa motor ditabrak dari belakang sama mobil pick up! Aing lagi makan mie rebus di warkop pinggur jalan langsung lemes -_-
## ...[truncated]...
## 19: Resto H2B ini plg sukses dg layanan Go Foodnya.Antrean driver  mengular dan sabar menunggu.Sedih melihat anak bangsa, tak sedikit sarjana, bekerja tanpa jaminan kesehatan, jaminan kecelakaan kerja. Wish new President, new hope for better future of us
## ...[truncated]...
## 
## *Suggestion: Consider running `textshape::split_sentence`
## 
## 
## ===
## TAG
## ===
## 
## The following observations contain Twitter style handle tags (e.g., @trinker):
## 
## 13, 21, 52, 60, 65, 68, 96, 99, 101, 106...[truncated]...
## 
## This issue affected the following text:
## 
## 13: Shame on you @LionAirID @BoeingCEO @Boeing @BoeingAirplanes cc @kemenhub151 Keluarga Korban Kecelakaan Lion Air JT 610 Tolak Santunan Rp 1,25 Miliar, Ini Alasannya https://t.co/cCYr5NGhts
## ...[truncated]...
## 21: Akibat Andra ingin selalu menjaga Emon, dia jadi dikejar-kejar terus sama Tony! Yang lebih parahnya, Andra sampai kecelakaan..
## 
## Nonton #AnakLangitEps1096dan1097, pukul 16.40 WIB.
## #SCTVSinetron
## 
## Jangan lupa saksikan via streaming di @vidiodotcom dan follow channel SCTV juga ya! https://t.co/vdk5tjd6gF
## ...[truncated]...
## 52: Pak Presiden @jokowi, meninggal dalam jumlah ratusan hanya terjadi dalam apa yang disebut bencana atau kecelakaan.
## ...[truncated]...
## 60: BREAKING NEWS: Bus yang Mengangkut Rombongan Istri Lurah se-Bandung Kecelakaan di Tol Cipularang https://t.co/KrI4Pkz9IW via @tribunjabar
## ...[truncated]...
## 65: Mohon dibantu retwit, @Patifosi_Online
## @PATIFOSI_MANIA @infosuporter @SuporterBOLAcom @PSSI
## Saudara kami ketua Patifosi Malaya, kurang diperhatikan oleh pemerintah.
## Kecelakaan saat bertugas sbg @Panwaslu 17 April lalu https://t.co/W8DpNP13bC
## ...[truncated]...
## 68: Kecelakaan beruntun mobil avanza putih n innova d tol dalam kota arah surabaya km 14. Bagian depan innova ringsek. @RadioElshinta @e100ss @NTMCLantasPolri
## ...[truncated]...
## 96: Menurut PM 82 Tahun 2018 pasal 33, Pita Penggaduh berfungsi untuk mengurangi kecepatan kendaraan, mengingatkan pengemudi tentang objek di depan yang harus diwaspadai, melindungi penyeberang jalan, &amp; mengingatkan lokasi rawan kecelakaan - @dishubdiy https://t.co/T3PoEozMQn
## ...[truncated]...
## 99: Hubungi PSC 119 YES no.telp. 119 / 0274420118 unt layanan kegawatdaruratan kecelakaan lalu lintas/medis di wilayah Kota Yogyakarta, 24 jam. Gratis @PemkotJogja @uddkotayogya #InfoMBS
## ...[truncated]...
## 101: Hy #Sobattangerang hubungi 112 jika kamu menemukan atau mengalami kegawatdarurattan seperti kecelakaan, kebakaran, kriminalitas dan bencana alam. Layanan 24 jam dan bebas pulsa.
## @pmi_tangerangct 
## @Kota_Tangerang 
## @bpbd_tng 
## @restrotangkot 
## @Dinkes_tgrkota https://t.co/2rZyTUskxP
## ...[truncated]...
## 106: Gegara whatsapp down jadi inget dulu si @samoouth kecelakaan ngabarin lewat DM Twitter, bukan lewat BBM atau SMS
## 
## Dari cerita singkat ini kita bisa meneladani kisah hidup Samot yang mampu bertahan hanya dari DM twitter
## ...[truncated]...
## 
## *Suggestion: Consider using `qdapRegex::ex_tag' (to capture meta-data) and/or `replace_tag`
## 
## 
## ====
## TIME
## ====
## 
## The following observations contain timestamps:
## 
## 22, 33, 45, 48, 64, 156, 166, 180, 209, 231...[truncated]...
## 
## This issue affected the following text:
## 
## 22: [18:36] #JAKARTA #KECELAKAAN Persimpangan Slipi #JasaMarga
## ...[truncated]...
## 33: ♻️ @SenkomCMNP: 5:09 Wib. Kendaraan Truk Tangki Pertamina yang mengalami Kecelakaan di KM 16+600. Masih Penanganan Petugas. Lajur 1 dan 2 Sudah bisa di lewati.(uda) @SonoraFM92 @RadioElshinta https://t.co/9QqgdoBzQW
## ...[truncated]...
## 45: [07:26] #JAKARTA #KECELAKAAN Jl. Raya Kembangan Selatan #TMC
## ...[truncated]...
## 48: [21:56] #JAKARTA #KECELAKAAN Meruya #MargaMandalasakti
## ...[truncated]...
## 64: [13:36] #JAKARTA #KECELAKAAN Jl. Raya Kembangan Selatan #TMC
## ...[truncated]...
## 156: [15:34] #SIDOARJO #KECELAKAAN Tol Pandaan #SS
## ...[truncated]...
## 166: [18:43] #JAKARTA #KECELAKAAN Petukangan #MargaMandalasakti
## ...[truncated]...
## 180: [07:40] #JAKARTA #KECELAKAAN Mega Kuningan #Elshinta
## ...[truncated]...
## 209: [01:35] #KEP.SERIBU #KECELAKAAN Jl. Minangkabau #Elshinta
## ...[truncated]...
## 231: [13:27] #JAKARTA #KECELAKAAN Jl. Kyai Maja #Elshinta
## ...[truncated]...
## 
## *Suggestion: Consider using `replace_time`
## 
## 
## ===
## URL
## ===
## 
## The following observations contain URLs:
## 
## 1, 2, 3, 4, 5, 7, 8, 9, 12, 13...[truncated]...
## 
## This issue affected the following text:
## 
## 1: Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :
## https://t.co/gMHLep9IvZ mhmmdrhmtrmdhn
## Visit Wonderful  #MRahmatRamadhan
## ...[truncated]...
## 2: Tewaskan 346 Orang dalam 2 Kecelakaan, Boss Boeing Minta Maaf https://t.co/wLRhFy8oYE
## ...[truncated]...
## 3: Anggota parlemen Taiwan juga berencana meningkatkan denda maksimum dan masa hukuman bagi orang yang menyetir dalam keadaan mabuk. https://t.co/GSWqziaKDN
## ...[truncated]...
## 4: C.Gerakan.bicara pertolongan pertama pada kecelakaan (P3K-BAKAT) https://t.co/jlPyXK3EBV
## ...[truncated]...
## 5: Asuransi mana nih??
## 
## https://t.co/AJyABmimcY
## 
## PPATK tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut. Saat ini, pihaknya telah... https://t.co/AJyABmimcY
## ...[truncated]...
## 7: Terjadi kecelakaan truk muatan besar di Tol Kunciran, Serpong arah Bitung. Akibatnya, terjadi kepadatan di lokasi kecelakaan. https://t.co/eeMnh9lZJa
## ...[truncated]...
## 8: Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.
## 
## Tapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo
## ...[truncated]...
## 9: UPDATE LAGI JADWAL SAMSAT KELILING DAN SAMSAT DESA
## .
## Tertib bayar pajak yuk, 😁
## .
## 🚫Stop Pelanggaran
## 🚫Stop Kecelakaan
## Keselamatan untuk… https://t.co/THOqIu80Zw
## ...[truncated]...
## 12: WNI Korban Tewas Kecelakaan Bus di Malaysia Bertambah Jadi 4 Orang - PT Bestprofit Futures Surabaya https://t.co/6Q1NFdiXc6
## ...[truncated]...
## 13: Shame on you @LionAirID @BoeingCEO @Boeing @BoeingAirplanes cc @kemenhub151 Keluarga Korban Kecelakaan Lion Air JT 610 Tolak Santunan Rp 1,25 Miliar, Ini Alasannya https://t.co/cCYr5NGhts
## ...[truncated]...
## 
## *Suggestion: Consider using `replace_url`

Text preview
twitter$full_text %>% 
  head()
## [1] "Pelajar SMP Tewas Kecelakaan di Jalinsum Bandarlampung https://t.co/Tgc8D6k3m0 https://t.co/3jmozMMLv0"                                                                                                                         
## [2] "Orang-orang pulang nonton film ini langsung galau, kalau saya langsung cari cerita-cerita mistis tentang paes. \nðŸ\230¬ðŸ\230¬ðŸ\230¬\n\nScene paling 'jleb' waktu Nina bilang, \"Meninggal bu.. Kecelakaan..\" https://t.co/NX3OLimKbu"
## [3] "[22:14] #JAKARTA #KECELAKAAN Rawamangun #TMC"                                                                                                                                                                                   
## [4] "[83:1] Kecelakaan besarlah bagi orang-orang yang curang"                                                                                                                                                                        
## [5] "Anggapannya kayak mobil vs motor kecelakaan, yg punya motor luka parah lalu mati, pdhl yg salah yg pake motor krn ngebut, dsb. Yg disalahin tetep yg gede dah.\n\nPemikiran jaman jahiliah yg melekat hingga skrg."             
## [6] "#TrukTangki\n\n#Kecelakaan\n#Evakuasi\n#NaganRaya\nHingga kini, mobil itu sudah dievakuasi menggunakan alat berat, setelah mengalami kecelakaan pada Rabu (3/4/2019) di desa itu. \n\nhttps://t.co/7JdI6lu2Zh"

This text preview is for us take a quick look on what kind of text “noises” that we are dealing with, we can see that some of them contains a URL, numbers, hash, and even non-ASCII characters. We do have to remember that since this text comes form twitter, there will be text that contain tags which we need to remove since they are not important in this case.

Dictionary for Indonesian slang language and stopwords
spell.lex <- read.csv("colloquial-indonesian-lexicon.csv")
stopword <- readLines("http://static.hikaruyuuki.com/wp-content/uploads/stopword_list_tala.txt")

As we are dealing with the text in Bahasa Indonesia, we cannot process it the same way as we would when we are processing words in English. The reason is because of the dictionary that the text cleaning functions are using, is the English dictionary. As a result, to be able to process texts in Bahasa Indonesia, we have to use dictionary from Bahasa Indonesia.

Text cleansing
RNGkind(sample.kind = "Rounding")
set.seed(1)

index <- sample(nrow(twitter_label_manual), 5)

twitter_label_manual$processed_text <- twitter_label_manual$full_text %>% 
  tolower() %>% # Lower any capital letter
  replace_url() %>%
  replace_hash() %>%
  replace_tag() %>% 
  removeNumbers() %>%
  str_replace_all(rx_punctuation(), " ") %>% # Replace punctuation with white space
  str_replace_all("[^\u0001-\u007F]+|<U\\+\\w+>", " ") %>% # Replace non-ASCII character with white space
  str_replace_all("\n", " ") %>% # Replace HTML new line character with white space
  replace_emoji() %>%
  removeWords(lexicon::hash_emojis$y) %>% # Remove words result from replacing emoji with words describing the                                             emoji
  str_squish() # Remove extra white space

twitter_label_manual %>% 
  slice(index) %>% 
  pull(processed_text)
## [1] "terjadi kecelakaan di jl raya keloposepuluh sukodono dengan kronologi seorang kakek sedang menyiram jalan raya di depan rumahnya dan tiba tiba di tabrak oleh pengendara motor untuk arus lalu lintas tetap terpantau ramai lancar red susi de"
## [2] "pengamat kok nalarnya gitu amat atau dia sekedar ikut trend kekinian yang apa apa taripnya naik"                                                                                                                                               
## [3] "kasus kecelakaan di jalan raya alat ukur tekanan ban harus disediakan"                                                                                                                                                                         
## [4] "evakuasi mi dimulai helikopter berhasil tembus lokasi kecelakaan"                                                                                                                                                                              
## [5] "kecelakaan beruntun di jalinsum labuhanbatu selatan orang tewas"

In this process, we lower all the capital characters, and remove URL, hash, tag, numbers, punctuation, non-ASCII character, HTML new line symbol, and emoji. We have to mind the order of the functions that we use as it might not be effective if we use it in the wrong order. For example, if we remove the punctuation before we remove the URL in the texts, the function where we remove the URL will not be useful since the URL text contains punctuation character. This means that function where it removes URL text will not be able to read the pattern of the URL text and leaving even more “noise” to the texts as the non-punctuation character remains inside the text. When dealing with text cleansing, It is encouraged to kae a view of some random samples of text as many as possible to get the cleaning as effective as possible as different texts might behave differently to the order of the functions. There might be a possibility that some characters, words, or patterns need to be removed using a different function.

Replace Indonesian slang language
# replace_slang <- twitter_label_manual$processed_text %>%
#   replace_internet_slang(slang = paste0("\\b", spell.lex$slang, "\\b"),
#                          replacement = spell.lex$formal,
#                          ignore.case = T)
# 
# saveRDS(replace_slang, "replace_slang.RDS")
Import RDS for pocessed slang words
processed_slang <- readRDS("replace_slang.RDS")
processed_slang[1:5]
## [1] "rekaman cctv kecelakaan motor di pik depan taman grisenda mhmmdrhmtrmdhn visit wonderful"                                        
## [2] "tewaskan orang dalam kecelakaan bos boeing meminta maaf"                                                                         
## [3] "anggota parlemen taiwan juga berencana meningkatkan denda maksimum dan masa hukuman bagi orang yang menyetir dalam keadaan mabuk"
## [4] "sih gerakan bicara pertolongan pertama pada kecelakaan pakai bakat"                                                              
## [5] "asuransi mana nih tidak memberikan rincian secara pasti siapa dan darimana asal partai caleg tersebut saat ini pihaknya telah"

For replacing the slang words in Bahasa Indonesia, we have to put the dictionary for slang words in Bahasa Indonesia inside the replace_internet_slang() function as the function default are using English dictionary. Since this process can took hours of running time, it is wise to save the result to any form of document we prefer, which in my case I prever to save it as .RDS file.

Stemming, Tokenizing and Removing Stopwords

Before we start to remove stopwords, we have to change the form of each words as its basic form through the process called stemming. After we are finished with stemming the words, we can then separate each words to become a keywords through tokenizing and remove the stopwords. Similar to replacing slang words, this function could take hours to run. it is encouraged to save the result before we move on to the next step.

Stemming
# stemming <- function(x){
#   paste(lapply(x,katadasar),collapse = " ")}
# 
# stemmed_text <- lapply(tokenize_words(processed_slang[]), stemming)
# 
# saveRDS(stemmed_text, "stemmed_text.RDS")
Import RDS for stemmed words
processed_stem <- readRDS("stemmed_text.RDS")
processed_stem[1:5]
## [[1]]
## [1] "rekam cctv celaka motor di pik depan taman grisenda mhmmdrhmtrmdhn visit wonderful"
## 
## [[2]]
## [1] "tewas orang dalam celaka bos boeing minta maaf"
## 
## [[3]]
## [1] "anggota parlemen taiwan juga rencana tingkat denda maksimum dan masa hukum bagi orang yang setir dalam ada mabuk"
## 
## [[4]]
## [1] "sih gerak bicara tolong pertama pada celaka pakai bakat"
## 
## [[5]]
## [1] "asuransi mana nih tidak berik rincian cara pasti siapa dan darimana asal partai caleg sebut saat ini pihak telah"

Using library package called katadasaR, we can obtain the dictionary in Bahasa Indonesia for stemming the words into its basic form. Although stemming suppose to be generalizing the words, not all words are in its correct form of basic words.

Tokenizing and removing stopwords
processed_stopwords <- tokenize_words(processed_stem, stopwords = stopword)
processed_stopwords[1:5]
## [[1]]
##  [1] "rekam"          "cctv"           "celaka"         "motor"         
##  [5] "pik"            "taman"          "grisenda"       "mhmmdrhmtrmdhn"
##  [9] "visit"          "wonderful"     
## 
## [[2]]
## [1] "tewas"  "orang"  "celaka" "bos"    "boeing" "maaf"  
## 
## [[3]]
##  [1] "anggota"  "parlemen" "taiwan"   "rencana"  "tingkat"  "denda"   
##  [7] "maksimum" "hukum"    "orang"    "setir"    "mabuk"   
## 
## [[4]]
## [1] "sih"    "gerak"  "bicara" "tolong" "celaka" "pakai"  "bakat" 
## 
## [[5]]
## [1] "asuransi" "nih"      "berik"    "rincian"  "darimana" "partai"   "caleg"

Using the Bahasa Indonesia stopwords dictionary from http://static.hikaruyuuki.com/wp-content/uploads/stopword_list_tala.txt, we can remove words that are listed in the stopwords dictionary while we tokenized the texts by inserting the dictionary to the tokenize_words() function.

Wordcloud

After we are finished with the process of stemming, tokenizing, and removing stopwords, we should get seperate words where the words has been transformed into its basic form where we can use it as our predictors. To get a bit of preview about the words, we shall plot a wordcloud where it will be able to show us which words are the most frequently used compare to the other.

Wordcloud plot

Wordcloud
RNGkind(sample.kind = "Rounding")
set.seed(126)

wordcloud::wordcloud(
  processed_stopwords %>% as.character()
)

Wordcloud with minimum frequency of 10
RNGkind(sample.kind = "Rounding")
set.seed(126)

wordcloud::wordcloud(
  processed_stopwords %>% as.character(),
  min.freq = 5
)

Wordcloud with minimum frequency of 10
RNGkind(sample.kind = "Rounding")
set.seed(126)

wordcloud::wordcloud(
  processed_stopwords %>% as.character(),
  min.freq = 20
)

In the wordcloud plot where the minimum frequency are not limited, we cannot really see which word are frequently used, However, when we try to limit the frequency of how often the word is used in the tweeted text, we can see that the word “celaka” is being highlighted by the wordcloud. It means that the word “celaka” is frequently used in the tweeted text about an accident.

Machine Learning Model

For this type of problems, There are three kinds of method that I think is the best suited for creating the Machine Learning model for this project. the three model which I previously mentioned is as follows:

  1. Naive Bayes model: The Naive Bayes model have the advantage of generating the Machine Learning model faster than any of the models that I have chosen. However, we will have to be reminded that the Naive Bayes model, as the name implies, is a naive model where it will assume that all the predictors that we have does not have any correlation with each other. if the predictors have a correlation with each other, we might get a biased result.

  2. Random Forest: The Random Forest model have the advantage of normally generate the best performing model out of all the Machine Learning method that I have chosen, as this method can be set to generate a certain number of “trees” and repeat the process for a certain number of times at random and pick the best performing “tree” out of all. However, this method takes hours or even days to run when we processed a large-sized data.

  3. Neural Network: As an alternative to the Random Forest, the Neural Network can be used to create our Machine Learning model as it is highly customizable and able to learn deeply into the data that we have when generating the model. However, this method can be confusing to use as there were no standardization to the parameters which result in us trying to figure out the best result out of all the parameters that we try by doing a trial and error method.

Naive Bayes

Create Train and Test data
RNGkind(sample.kind = "Rounding")
set.seed(126)

twitter_processed <- twitter_label_manual %>%
  mutate(is_accident = as.factor(ifelse(is_accident == 1, "yes", "no")))
twitter_processed$processed_text <- processed_stopwords

index <- sample(nrow(twitter_processed), nrow(twitter_processed)*0.75)
naive_train <- twitter_processed[index,]
naive_test <- twitter_processed[-index,]

The first step of creating any Machine Learning model will be to separate the dataframe that we have into a train and test dataframe. As the name implies, train data will be used to train the machine learning model, while the test data will be used to test the performance of the machine learning model.

Create Train and Test label
label_train <- naive_train$is_accident
label_test <- naive_test$is_accident

For NLP naive bayes model, we have to separate the train label and the test label from the train and test data.

Check the proportion of the label on the train data
prop.table(table(naive_train$is_accident))
## 
##        no       yes 
## 0.6338216 0.3661784

When the proportion of the data is not balanced, we have to balanced the data by using either upSample or downSample depending on how big of a difference is between “yes” and “no” proportion. If we do not balanced the data, the machine learning model might not be able to properly tell the difference between “yes” and “no” as it is trained to say “no” more than “yes”.

Balancing the data using upSample method
naive_train_ups <- upSample(x = naive_train %>% select(-is_accident),
                            y = naive_train$is_accident,
                            yname = "is_accident")

label_train_ups <- naive_train_ups$is_accident
Check the balance of the data after balancing
prop.table(table(naive_train_ups$is_accident))
## 
##  no yes 
## 0.5 0.5
prop.table(table(label_train_ups))
## label_train_ups
##  no yes 
## 0.5 0.5

We can see that after balancing the data, we have a balanced proportion of number between “yes” and “no”. I choose the upSample() as I like increase the likelihood of the model to predict any accident text and keep the information as much as possible even if it means we have to create a dummy variable.

For English text, we usually process the data after we convert the text into the corpus data, However, Bahasa Indonesia require different process than the usual. As a result, the conversion is done after we are finished with the text pre-processing. Since we have processed our text data earlier, we can directly convert the corpus text data into the Document Term Matrix data.

Convert the keywords into Corpus
corpus_test <- VCorpus(VectorSource(naive_test$processed_text))
corpus_train_ups <- VCorpus(VectorSource(naive_train_ups$processed_text))

dtm_test <- DocumentTermMatrix(x = corpus_test)
dtm_train_ups <- DocumentTermMatrix(x = corpus_train_ups)

inspect(dtm_train_ups)
## <<DocumentTermMatrix (documents: 952, terms: 2705)>>
## Non-/sparse entries: 10127/2565033
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  arah celaka jalan kendara korban orang padat tewas tol truk
##   142    0      1     0       0      0     0     0     0   0    0
##   165    0      1     0       0      0     0     0     0   0    0
##   177    0      1     0       0      0     0     0     0   0    0
##   211    0      1     0       0      0     1     0     0   0    0
##   258    0      1     0       0      0     1     0     0   0    0
##   267    0      1     1       0      0     0     0     0   0    0
##   469    0      1     0       0      0     0     0     0   0    0
##   589    3      1     0       0      0     0     7     0   0    0
##   68     0      1     0       0      0     0     0     0   0    0
##   74     0      1     0       0      0     0     0     0   0    0

This steps is done to separate the keywords, and create a term out of the words as shown in the example above. We can see that each word is now treated like they are variables of a dataframe. The binary values of each words are the indication of how many are those words appeared in a documents where it contain one tweeted text from a user.

Dimension of the DTM
dim(dtm_train_ups)
## [1]  952 2705

I took “dtm_train_ups” dimension value as a representative of all the DTM data and we can see that our data have so many predictors. In the previous wordcloud that we have seen, as we reduce the frequency, we can see that the words that we recognize as possible words for accident are more pronounced when we increase the minimum frequency of the words as the words which appeared less often are considered unimportant.

Raising the minimum frequency of the words
dtm_freq_ups <- findFreqTerms(x = dtm_train_ups,lowfreq = 5)
dtm_train_ups <- dtm_train_ups[, dtm_freq_ups]

The best minimum frequency based on my personal observation to the performance of the machine learning model is 5. As we go higher then 5, the accuracy of the prediction done by the model will decrease. As we go lower than 5, the accuracy of the prediction done by the model to the test data will stay the same, while the accuracy of the predition done by the model to the train data will increase which leads to overfitting.

Bernoulli converter
bernoulli_conv <- function(x){
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

dtm_train_ups_bn <- apply(dtm_train_ups, MARGIN = 2, FUN = bernoulli_conv)
dtm_test_bn <- apply(dtm_test, MARGIN = 2, FUN = bernoulli_conv)

Bernouli converter converts the frequency of each words in our data into the value of possibility of how often the words is used by simply converting any data that are bigger than 1 into 1.

Naive bayes model
# naive_model <- naiveBayes(x = dtm_train_ups_bn,
#                               y = label_train_ups)
# 
# saveRDS(naive_model, "naive_model.RDS")
naive_model <- readRDS("naive_model.RDS")
Prediction
pred_naive <- predict(object = naive_model,
                      newdata = dtm_test_bn,
                      type = "class")

pred_train <- predict(object = naive_model,
                      newdata = dtm_train_ups_bn,
                      type = "class")
Confusion matrix for minimum frequency of 10
confusionMatrix(data = pred_naive,
                reference = label_test,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  152  16
##        yes  14  69
##                                               
##                Accuracy : 0.8805              
##                  95% CI : (0.8338, 0.9179)    
##     No Information Rate : 0.6614              
##     P-Value [Acc > NIR] : 0.000000000000001324
##                                               
##                   Kappa : 0.7316              
##                                               
##  Mcnemar's Test P-Value : 0.8551              
##                                               
##             Sensitivity : 0.8118              
##             Specificity : 0.9157              
##          Pos Pred Value : 0.8313              
##          Neg Pred Value : 0.9048              
##              Prevalence : 0.3386              
##          Detection Rate : 0.2749              
##    Detection Prevalence : 0.3307              
##       Balanced Accuracy : 0.8637              
##                                               
##        'Positive' Class : yes                 
## 
confusionMatrix(data = pred_train,
                reference = label_train_ups,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  444  54
##        yes  32 422
##                                               
##                Accuracy : 0.9097              
##                  95% CI : (0.8896, 0.9271)    
##     No Information Rate : 0.5                 
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.8193              
##                                               
##  Mcnemar's Test P-Value : 0.02354             
##                                               
##             Sensitivity : 0.8866              
##             Specificity : 0.9328              
##          Pos Pred Value : 0.9295              
##          Neg Pred Value : 0.8916              
##              Prevalence : 0.5000              
##          Detection Rate : 0.4433              
##    Detection Prevalence : 0.4769              
##       Balanced Accuracy : 0.9097              
##                                               
##        'Positive' Class : yes                 
## 

When the model manage to achieve an overall accuracy of more than 80%, we can conclude that the model is able to properly differentiate which text is about an accident and which is not. Furthermore, the difference between the overall accuracy of the train data and test data is less than 5% where the accuracy of the train data is higher than the accuracy of the test data. This means that the model is neither overfitting nor underfitting.

Random Forest

RNGkind(sample.kind = "Rounding")
set.seed(126)

corpus_forest <- VCorpus(VectorSource(twitter_processed$processed_text))
dtm_forest <- DocumentTermMatrix(x = corpus_forest)
dtm_forest_freq <- findFreqTerms(x = dtm_forest,lowfreq = 5)
dtm_forest <- dtm_forest[, dtm_forest_freq]

bernoulli_conv <- function(x){
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}
dtm_forest_bn <- apply(dtm_forest, MARGIN = 2, FUN = bernoulli_conv)

forest_data <- data.frame(as.matrix(dtm_forest_bn))
forest_data$is_accident <- twitter_processed$is_accident

index <- sample(nrow(forest_data), nrow(forest_data)*0.75)
forest_train <- forest_data[index,]
forest_test <- forest_data[-index,]

Same like any other model, we have to separate the dataframe into a train data and test data to create the model. However, in Random Forest model, the form of the data accepted by the model is a dataframe. In this case, we can change take the DTM data from naive bayes method and change it into a dataframe as seen above.

Random forest modeling
# RNGkind(sample.kind = "Rounding")
# set.seed(126)
# 
# ctrl <- trainControl(method = "repeatedcv", number = 3, repeats = 5)
# forest_model <- train(is_accident ~ ., train_forest, method = "rf", tfControl = ctrl)
# 
# saveRDS(forest_model, "forest_model_adv.RDS")
Load random forest model
model_forest <- readRDS("forest_model_adv.RDS")
model_forest
## Random Forest 
## 
## 752 samples
## 123 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 752, 752, 752, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.7819397  0.4715446
##    62   0.7714177  0.5138179
##   123   0.7594235  0.4892080
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

When we generate a random forest model, we will generate as much as 500 decision trees in default settings. The generated trees will then be observed by the function and the best out of all will then be selected. For a more depth analysis, it can be set to split the data to a kind of train and test data at random for a number of times depending on how many times do we want it to be. We can also repeat the process a couple off times depending on how many times we would want it to be repeated. This parameter setting will allow the function to create a higher overall performing model compare to the default model. However. depending on how many observations and variables that we have, The data split and repeating process will increase the time for the random forest model to be generated. It is recommended to save the model after generating the model.

Prediction
pred_forest <- predict(model_forest, forest_test)
pred_forest_train <- predict(model_forest, forest_train)
confusion Matrix
confusionMatrix(data = pred_forest,
                reference = forest_test$is_accident,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  163  42
##        yes   3  43
##                                           
##                Accuracy : 0.8207          
##                  95% CI : (0.7676, 0.8661)
##     No Information Rate : 0.6614          
##     P-Value [Acc > NIR] : 0.00000001463   
##                                           
##                   Kappa : 0.5493          
##                                           
##  Mcnemar's Test P-Value : 0.00000001473   
##                                           
##             Sensitivity : 0.5059          
##             Specificity : 0.9819          
##          Pos Pred Value : 0.9348          
##          Neg Pred Value : 0.7951          
##              Prevalence : 0.3386          
##          Detection Rate : 0.1713          
##    Detection Prevalence : 0.1833          
##       Balanced Accuracy : 0.7439          
##                                           
##        'Positive' Class : yes             
## 
confusionMatrix(data = pred_forest_train,
                reference = forest_train$is_accident,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  470 124
##        yes   6 151
##                                                
##                Accuracy : 0.8269               
##                  95% CI : (0.7979, 0.8533)     
##     No Information Rate : 0.6338               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5899               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.5491               
##             Specificity : 0.9874               
##          Pos Pred Value : 0.9618               
##          Neg Pred Value : 0.7912               
##              Prevalence : 0.3662               
##          Detection Rate : 0.2011               
##    Detection Prevalence : 0.2091               
##       Balanced Accuracy : 0.7682               
##                                                
##        'Positive' Class : yes                  
## 

With the overall accuracy of 82,07%, the sensitivity or recall is only 50,59%. In other words, the model will look for a possible true positive as much as possible while also generating a lot of false negative which we would want to avoid as far as possible.

Neural Network LSTM (Long Short-Term Memory)

Compare to the other machine learning method, the Neural Network model considered as the most complicated to model out of all the model used in this paper. As keras library is borrowed from python, the step by step on creating the model is a bit more complex.

Tokenizing words with keras tokenizer
keras_tokenizer <- text_tokenizer(1024) %>% fit_text_tokenizer(unlist(processed_stem))

As keras accept tokenized words in different from compare to the tokenized words from tokenizer library, we have to tokenized the text using keras tokenizer. the number inside the text_tokenizer() function is there to filter words that are not frequently used in the texts about an accident.

Create Train and Test data
twitter_keras <- twitter_label_manual
twitter_keras$processed_stem <- unlist(processed_stem)

set.seed(126)
intrain <- initial_split(twitter_keras, 0.8, "is_accident")
lstm_train <- training(intrain)
lstm_test <- testing(intrain)

inval <- initial_split(lstm_test, 0.5, "is_accident")
lstm_val <- training(inval)
lstm_test <- testing(inval)

For this model I separate the dataframe into 3 different group consist of train, test, and validation data. The train and validation data will be used to create and generate model performance. and the test data will be used to find out the performance of the model with the data which it has never seen before.

Separating predictor and target
maxlen <- max(str_count(twitter_keras$processed_stem, "\\w+")) + 1

lstm_train_x <- texts_to_sequences(keras_tokenizer, lstm_train$processed_stem) %>% 
  pad_sequences(maxlen)

lstm_val_x <- texts_to_sequences(keras_tokenizer, lstm_val$processed_stem) %>% 
  pad_sequences(maxlen)

lstm_test_x <- texts_to_sequences(keras_tokenizer, lstm_test$processed_stem) %>% 
  pad_sequences(maxlen)

lstm_train_y <- to_categorical(lstm_train$is_accident, num_classes = 2)
lstm_val_y <- to_categorical(lstm_val$is_accident, num_classes = 2)
lstm_test_y <- to_categorical(lstm_test$is_accident, num_classes = 2)

The train and test separation process in Neural Network are different from the rest of the model which are used in this project. We have to separate the predictors and the target from the data. the “x” above are a label which represent the predictors, and the “y” are a label where it represent the target variable.

Modeling
lstm_model <- keras_model_sequential()

lstm_model %>% 
  layer_embedding(
    name = "input",
    input_dim = length(keras_tokenizer$word_counts),
    input_length = maxlen,
    output_dim = 32,
    embeddings_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 126)
  ) %>% 
  layer_dropout(
    name = "embedding_dropout",
    rate = 0.5,
    seed = 126
  ) %>% 
  layer_lstm(
    name = "lstm",
    units = 256,
    dropout = 0.2,
    recurrent_dropout = 0.2,
    return_sequences = F,
    recurrent_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 126),
    kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 126)
  ) %>% 
  layer_dense(
    name = "output",
    units = 2,
    activation = "sigmoid",
    kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 126)
  )


summary(lstm_model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (Embedding)                   (None, 54, 32)                  115712      
## ________________________________________________________________________________
## embedding_dropout (Dropout)         (None, 54, 32)                  0           
## ________________________________________________________________________________
## lstm (LSTM)                         (None, 256)                     295936      
## ________________________________________________________________________________
## output (Dense)                      (None, 2)                       514         
## ================================================================================
## Total params: 412,162
## Trainable params: 412,162
## Non-trainable params: 0
## ________________________________________________________________________________

Embedding layer: Used and can only be used in the first layer of the LSTM architecture. The aim of adding embedding layer in the first layer is to train text data into a numerical vectors which simulates the meaning of each words.

Deep neural layer: There are couple of deep neural layer kind out there. One of the is LSTM which is used specifically for NLP.

Output layer: The last layer where we set the output result based on the case. we set the number of the units to 2 because we have only 2 labels which are “yes” and “no”.

Define loss, oprimizer, and metrics type
lstm_model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.008),
  metrics = "accuracy",
  loss = "binary_crossentropy"
)

The optimizer above are choosen based on trial and error. As for the loss and metrics parameter, it is set based on our data and our target. “binary_crossentropy” is selected as our target consist of binary values (1 or 0, yes or no).

Generate performance chart
# RNGkind(sample.kind = "Rounding")
# set_random_seed(126)
# history <- lstm_model %>%
#   fit(
#     lstm_train_x,
#     lstm_train_y,
#     batch_size = 512,
#     epochs = 10,
#     verbose = 1,
#     validation_data = list(lstm_val_x, lstm_val_y)
#   )
Plot history
# plot(history)
# saveRDS(plot(history), "plot_history.RDS")
readRDS("plot_history.RDS")

Above we can see that the curved line of our training data started to go further away from the validation line which indicates an overfitting. As a result, I stop at epoch number 10 where we can see that the validation line has stabilized.

Save and load LSTM model
# lstm_model %>% save_model_hdf5("lstm_model.h5")
lstm_model <- load_model_hdf5("lstm_model.h5")
Prediction
# predict on train
lstm_train_pred <- lstm_model %>% 
  predict(lstm_train_x)

# predict on test
lstm_test_pred <- lstm_model %>% 
  predict(lstm_test_x)
Model performance
# accuracy on train
accuracy_vec(
  truth = factor(ifelse(lstm_train$is_accident == 1, "yes", "no")),
  estimate = factor(ifelse(lstm_train_pred[,2] > 0.5, "yes", "no"))
)
## [1] 0.9363296
# accuracy on test
accuracy_vec(
  truth = factor(ifelse(lstm_test$is_accident == 1, "yes", "no")),
  estimate = factor(ifelse(lstm_test_pred[,2] > 0.5, "yes", "no"))
)
## [1] 0.8415842

With the accuracy of 93,63% in train data and 84,16% in test data, the result has shown us that the accuracy of the LSTM model is overfitting with a difference of more than 5%.

Conclusion

Although each of the machine learning model generate a well performing model of an overall accuracy that is bigger than 80%, the Naive Bayes model is considered superior compare to the other model that is used in this paper. The Random Forest model generate an accuracy of around 6% less than than the Naive Bayes model. The Neural Network LSTM model generate a higher performing model of around 2% higher than the Random Forest model. However, considering the train and test prediction has more than 5% difference in number where the accuracy of the prediction in train data is higher than the test data, the Neural Network model is considered overfitting to the train data. Even though it is still possible to tune the model to fix the overfitting problem, the accuracy will suffer and goes below 80%.