library(textclean)
library(tokenizers)
library(wordcloud)
## Loading required package: RColorBrewer
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(devtools)
## Loading required package: usethis
library(katadasaR)
library(tm)
## Loading required package: NLP
library(stringr)
library(e1071)
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
## Loading required package: lattice
library(keras)
library(RVerbalExpressions)
library(magrittr)
library(textclean)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.7     ✔ purrr   0.3.4
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ tidyr::extract()    masks magrittr::extract()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ purrr::lift()       masks caret::lift()
## ✖ purrr::set_names()  masks magrittr::set_names()
library(tidytext)
library(rsample)
## 
## Attaching package: 'rsample'
## 
## The following object is masked from 'package:e1071':
## 
##     permutations
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
## 
## Attaching package: 'yardstick'
## 
## The following object is masked from 'package:readr':
## 
##     spec
## 
## The following object is masked from 'package:keras':
## 
##     get_weights
## 
## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity
library(SnowballC)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(ROCR)
library(partykit)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(tinytex)

1 Read Data

df <- read.csv("data/train.csv")
slang <- read.csv("data/colloquial-indonesian-lexicon.csv")
head(df)

The dataset contains 10,535 tweets and 8 columns with information as follow: - bully: Classification of the tweet as Yes (Bully) and No (Not Bully) - tweet: Content of the tweet - individual: Whether the tweet is a cyberbully targeted toward certain individual (0 = no, 1 = yes) - group: Whether the tweet is a cyberbully targeted toward certain group of people (0 = no, 1 = yes) - gender: Whether the tweet is a cyberbully based on gender or cursing someone using words that are degrading to gender (0 = no, 1 = yes) - physical: Whether the tweet is a cyberbully based on physical deficiencies/differences or disability (0 = no, 1 = yes) - race: Whether the tweet is a cyberbully based on a human race or ethnicity (0 = no, 1 = yes) - religion: Whether the tweet is a cyberbully based on a religion, religious organization, or a particular creed (0 = no, 1 = yes)

2 Dataset Cleaning

2.1 Check Missing & Duplicated Value

colSums(is.na(df))
##      bully      tweet individual      group     gender   physical       race 
##          0          0          0          0          0          0          0 
##   religion 
##          0
df_duplicated <- df[duplicated(df$tweet),]

df_clean <- df %>%
  as.data.frame() %>%
  distinct(tweet, .keep_all = T)

nrow(df_duplicated)
## [1] 98

From the above processes, we know that there are no missing values in the dataset. However, there are 98 duplicated tweets (label = tweet) as people have the tendency on copy-and-paste news on twitter.

We use the function distinct() to drop the duplicated tweet and saved it as df_clean, with the new number of rows 10,437.

2.2 Transform columns into their appropriate classes

Next, we will transform the “bully, individual, group, gender, physical, race and religion” columns with as.factor() function

df_clean <- df_clean %>% 
  mutate(bully = as.factor(bully),
         individual = as.factor(individual),
         group = as.factor(group),
         gender = as.factor(gender),
         physical = as.factor(physical),
         race = as.factor(race),
         religion = as.factor(religion))

2.3 Data Checking

Category with the most abusive and bullying tweets

df_clean_bully <-  df_clean %>% filter(bully == "yes")

df_clean_bully %>%
  summary()
##  bully         tweet           individual group    gender   physical race    
##  no :   0   Length:4380        0:1562     0:2818   0:4137   0:4128   0:3951  
##  yes:4380   Class :character   1:2818     1:1562   1: 243   1: 252   1: 429  
##             Mode  :character                                                 
##  religion
##  0:3761  
##  1: 619  
## 

By sub-setting the data to only shows tweets that classify as “Bully”, we then used the function summary() to generate the frequency from each category. In this case, 1 = Yes (bully) and 0 = No (no bully).

Based on the summary, it is fair to say that the cyber bully tweets mainly attack towards certain “individual” and “group” with number of tweets 2818 and 1562 respectively. On the contrary, “physical” and “gender” are two of the categories that have the least amount of cyberbullying.

What text or token can represent each cyberbully category? Firstly, let’s pull the tweets from our df_clean dataset. Let’s use head(20) to have better insights on the words. After we remove some words that are shown repeatedly that do not add any valuable insights to our data, we will pull the words by subsetting the label bully = yes to see what are the words that might represent cyber bullying

df_clean %>% 
  head(30) %>%
  pull(tweet)
##  [1] "USER terimakasih Ustadz sudah bersuara tentang Radikal radikal ini. Entah apa yang ada dalam pikiran rejim. Mesjid radikal...kampus radikal....dosen radikal....padahal tempat tersebut pijakan peradaban. Memangnya mau menghancurkan Indonesia ?"
##  [2] "USER USER Maaf sebenarnya twiter pertama kali dbuat bukan buat orang bego'"                                                                                                                                                                        
##  [3] "USER Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok.   Iya tau kasar bgt maaf'"                                                                                                                                            
##  [4] "Hadiri Lokakarya Kebudayaan Daerah, Bupati Rupinus Ajak Masyarakat Sekadau Rawat dan Manfaatkan Objek� Budaya"                                                                                                                                     
##  [5] "USER USER USER yg kaya gini layak di tangkap."                                                                                                                                                                                                     
##  [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan / bagian dari itu sudah waktunya lengserkan Jokowi sebelum indonesia hancur"                                                                                                       
##  [7] "USER Wonu oppa kenapa matanya sipit banget? '^'\""                                                                                                                                                                                                 
##  [8] "Smartfren jaringan nya kok brengsek ya ...'"                                                                                                                                                                                                       
##  [9] "USER USER Pret.. kampret , Tak dukung 2019 ganti presiden.. tp.presidenmu sapa ??? Rocky gerung kapir thaa ???'"                                                                                                                                   
## [10] "USER Ahelah sombong bener punuk onta'"                                                                                                                                                                                                             
## [11] "USER jancuk kw zonk!!!!'"                                                                                                                                                                                                                          
## [12] "Bawaslu: Kolaka Tertinggi Pelanggaran ASN Dalam Pilkada; ; #AkuSERUJI : Pilkada | ;"                                                                                                                                                               
## [13] "untung2an dan jangan sampai kami ngaku Komunis, ngaku Atheis,...'"                                                                                                                                                                                 
## [14] "USER USER USER USER USER si cebong sudah kebanyakan makan tahi jokowi, dia baru sadar kl nanti jokowi sudah mengumumkan INDONESIA dirubah menjadi INDOCHINA,,,'"                                                                                   
## [15] "RT USER: Butuh cowo kontol gede yg bs angetin dimusim ujan, horny ahh hawanya bikin pngen ngentot terus ya say    ! Retweet yg nger   '"                                                                                                           
## [16] "Apalagi mereka partai koalisi pendukung pemerintah yg juga diketahui sudah menjalin kerjasama politik dengan partai komunis china"                                                                                                                 
## [17] "USER terus bisa apa kalo iya....rejim kunyuk ini siapa yg berani lawan...??'"                                                                                                                                                                      
## [18] "- lipstick di sapu tangan Bora.  Dan diriku dalam hati, \"Mampus malu luh. Makan tuh cipokan.\"'"                                                                                                                                                  
## [19] "Yg mau lengserkan Jok pasti lawan politiknya spt SBY wowo HT TS PKS.biasa bayarin demo-demo..FPI.jgn takut pak Jok.Rakyat TNI POl bela JokAhok"                                                                                                    
## [20] "USER Iya sih kek komunis banget wkwkwk'"                                                                                                                                                                                                           
## [21] "Pengusaha Media Sosial Harus Dipastikan Kenetralannya dalam Pilkada dan Pilpres"                                                                                                                                                                   
## [22] "USER Wah kesempatan neh langsung deh dimanfaatin buat pencitraaan oleh rezim..        '"                                                                                                                                                           
## [23] "Yang pasti bukan rakyat cina ! Bukan rakyat Indonesia anak keturunan PKI ! Paham!! URL"                                                                                                                                                            
## [24] "RT USER: Mash kentut bau babi'"                                                                                                                                                                                                                    
## [25] "Sukses USER ; 1. Sukses Prestasi ; 2. Sukses Administrasi ; 3. Sukses Pelaksanaan ; 4. Sukses Ekonomi ; 5. Sukses Menjadi warisan dan kebanggaan bagi generasi selanjutnya.; Cc. USER USER USER ; #AsianGames2018"                                 
## [26] "RT USER: Kalau aku cantik tapi memek ku coklat kamu masih doyan gak?  #ngentot #memek #kontol #sange #pagicrot'"                                                                                                                                   
## [27] "Pak de USER akan lebih elegan klo sampean undang para penulis ke istana daripada ngundang busar buser kae..suwon"                                                                                                                                  
## [28] "USER USER Rak harusnya lo tau kalau temen temen lo itu pinter, ga kayak lo bloon. Jadi jangan bego begoin kita, ga mempan :)'"                                                                                                                     
## [29] "Mau di-Ahok-kan ya?"                                                                                                                                                                                                                               
## [30] "USER USER USER mabuk pil PCC"

2.4 Data Cleansing 2

Let’s remove the word “USER, RT and punctuations” which showed up a lot and does not hold any valuable meaning.

df_clean$tweet <- gsub("USER", " ", df_clean$tweet)
df_clean$tweet <- gsub("RT", " ", df_clean$tweet)
df_clean$tweet <- gsub("[[:punct:] ]+", " ", df_clean$tweet)
head(df_clean)

Afterwards, we should do some cleansing on removing certain elements that does not add any value to our data, i.e : dates, emojis, emails, emoticons, html, slangs, urls and tags/RT (@ and retweet). As the tweets are not by Institutions / Organisations, we use the replace_internet_slang() and replace all the slangs/abbreviations by using “Colloquial Indonesian Lexicon” from github. Additionally, as the text classification itself is case-sensitive, we should lower case all the tweets.

df_clean$tweet <- df_clean$tweet %>% 
  replace_tag() %>% 
  replace_date(replacement = " ") %>% 
  replace_email() %>% 
  replace_emoji(.) %>% 
  replace_emoticon(.) %>% 
  replace_url() %>%
  replace_html(.) %>% 
  str_to_lower() %>% 
  strip()

head(df_clean)
df_clean$tweet[1:10437] <- replace_internet_slang(df_clean$tweet[1:10437], slang = paste0("\\b", slang$slang, "\\b"), replacement = slang$formal, ignore.case = TRUE)
df_clean_ris <- data.frame(df_clean)
#saveRDS(df_clean_ris, "df_clean_ris.RDS")

df_clean_ris <- readRDS("df_clean_ris.RDS")

df_clean_ris %>% 
  head(30) %>%
  pull(tweet)
##  [1] "terimakasih ustadz sudah bersuara tentang radikal radikal ini entah apa yang ada dalam pikiran rejim mesjid radikal kampus radikal dosen radikal padahal tempat tersebut pijakan peradaban memangnya mau menghancurkan indonesia"
##  [2] "maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego"                                                                                                                                                                
##  [3] "anjing tahi goblok idiot bangsat monyet babi fucc kont ngents goblok iya tau kasar banget maaf"                                                                                                                                  
##  [4] "hadiri lokakarya kebudayaan daerah bupati rupinus ajak masyarakat sekadau rawat dan manfaatkan objek budaya"                                                                                                                     
##  [5] "yang kayak begini layak di tangkap"                                                                                                                                                                                              
##  [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan bagian dari itu sudah waktunya lengserkan jokowi sebelum indonesia hancur"                                                                                       
##  [7] "wonu oppa kenapa matanya sipit banget"                                                                                                                                                                                           
##  [8] "smartfren jaringan nya kok brengsek ya"                                                                                                                                                                                          
##  [9] "pret kampret tak dukung ganti presiden tapi presidenmu sapa rocky gerung kapir thaa"                                                                                                                                             
## [10] "ahelah sombong benar punuk onta"                                                                                                                                                                                                 
## [11] "jancuk kau zonk"                                                                                                                                                                                                                 
## [12] "bawaslu kolaka tertinggi pelanggaran asn dalam pilkada akuseruji pilkada"                                                                                                                                                        
## [13] "untung an dan jangan sampai kami mengaku komunis mengaku atheis"                                                                                                                                                                 
## [14] "sih cebong sudah kebanyakan makan tahi jokowi dia baru sadar kalau nanti jokowi sudah mengumumkan indonesia dirubah menjadi indochina"                                                                                           
## [15] "butuh cowok kontol gede yang bisa angetin dimusim ujan horny ah hawanya bikin pengin ngentot terus ya sayang retweet yang nger"                                                                                                  
## [16] "apalagi mereka partai koalisi pendukung pemerintah yang juga diketahui sudah menjalin kerjasama politik dengan partai komunis china"                                                                                             
## [17] "terus bisa apa kalo iya rejim kunyuk ini siapa yang berani lawan"                                                                                                                                                                
## [18] "lipstick di sapu tangan bora dan diriku dalam hati mampus malu lu makan tuh cipokan"                                                                                                                                             
## [19] "yang mau lengserkan jok pasti lawan politiknya seperti surabaya wowo hati ts pks biasa bayarin demo demo fpi jangan takut pak jok rakyat tni pol bela jokahok"                                                                   
## [20] "iya sih kayak komunis banget wkwkwk"                                                                                                                                                                                             
## [21] "pengusaha media sosial harus dipastikan kenetralannya dalam pilkada dan pilpres"                                                                                                                                                 
## [22] "wah kesempatan nih langsung deh dimanfaatin buat pencitraaan oleh rezim"                                                                                                                                                         
## [23] "yang pasti bukan rakyat cina bukan rakyat indonesia anak keturunan pki paham url"                                                                                                                                                
## [24] "mash kentut bau babi"                                                                                                                                                                                                            
## [25] "sukses sukses prestasi sukses administrasi sukses pelaksanaan sukses ekonomi sukses menjadi warisan dan kebanggaan bagi generasi selanjutnya cc asiangames"                                                                      
## [26] "kalau aku cantik tapi memek ku coklat kamu masih doyan enggak ngentot memek kontol sange pagicrot"                                                                                                                               
## [27] "pak dek akan lebih elegan kalo sampean undang para penulis ke istana daripada ngundang busar buser kae suwon"                                                                                                                    
## [28] "rak harusnya lo tau kalau teman teman lo itu pintar enggak kayak lo bloon jadi jangan bego begoin kita enggak mempan"                                                                                                            
## [29] "mau di ahok kan ya"                                                                                                                                                                                                              
## [30] "mabuk pil pcc"

We can see that after cleansing, there are no more emojis (previously #28), punctuation, hash symbols (previously #26). I specifically did not remove the content of the hash tag itself (only the # symbol) as hash tags in most often times are useful when categorizing a topic that can be aggregated into a thread. Additionally, now all the abbreviations are gone, i.e #19 from “yg” to “yang”, and #11 “kw” to “kau.

Next, let’s assign our df_clean data into new data set. I personally like to do this so I don’t have to re-run all the chunks from the beginning in case I mess up my data set, especially where some chunks take a long of time to run (replace_internet_slang()). From now on we will start working with df_clean_2 and treat df_clean as our master file.

df_clean_2 <- data.frame(df_clean_ris)
tracemem(df_clean_ris) == tracemem(df_clean_2)
## [1] FALSE
#saveRDS(df_clean_2, file = "df_clean_2.RDS")
df_clean_2 <- readRDS("df_clean_2.RDS")

2.5 Stemming, Stopwords and Tokenizing

Now we will start with stemming, remove stopwords, tokenizing and creating a wordcloud. Stemming is done to transform all the words into its’ root form, i.e “memakan” -> “makan”. We will use the katadasaR library to do this. Afterwards, we will save it into df_clean_3 and saveRDS in the case the RStudio crashes, we can directly work using df_clean_3.RDS

stemming <- function(x) {
  paste(lapply(x, katadasar), collapse = " ")
}

#df_clean_2$tweet[1:10437] <- lapply(tokenize_words(df_clean_2$tweet[1:10437]), stemming)

#df_clean_3 <- data.frame(df_clean_2)
#saveRDS(df_clean_3, file = "df_clean_3.RDS")

df_clean_3 <- readRDS("df_clean_3.RDS")
head(df_clean_3)

We can see that all of the words are transformed into its’ base form now. Let’s start the tokenization process. This process breaks our sentences into words by words so that it can be counted by the system into the wordcloud later, i.e : line #5 “yang kayak begini layak di tangkap” will be broken into “yang”, “kayak”, “begini”, “layak”, “di”, “tangkap”. Additionally, we will also do the stopwords process as the final step of our data cleansing. Stopwords are common words used in sentences that give context to the sentence itself but can be removed as they contain no crucial meaning in this project. They are usually conjunctions i.e : “dan”, “tapi”, “untuk”, etc.

stopwords <- readLines("data/stopwords-id.txt")
## Warning in readLines("data/stopwords-id.txt"): incomplete final line found on
## 'data/stopwords-id.txt'
df_clean_3$tweet <- df_clean_3$tweet %>% 
  replace_html(symbol = FALSE) %>% 
  replace_url(replacement = "")
df_clean_3$tweet <- gsub("url", " ", df_clean_3$tweet)

df_clean_3$tweet <- tokenize_words(df_clean_3$tweet, stopwords = stopwords)

We are finally done with the data cleansing, let’s save it for the final time as df_clean_final, and save RDS as well.

df_clean_final <- data.frame(df_clean_3)
#saveRDS(df_clean_final, file = "df_clean_final.RDS")
df_clean_final <- readRDS("df_clean_final.RDS")
df_clean_final$tweet[1:5]
## [[1]]
##  [1] "terimakasih" "ustadz"      "suara"       "radikal"     "radikal"    
##  [6] "pikir"       "rejim"       "mesjid"      "radikal"     "kampus"     
## [11] "radikal"     "dosen"       "radikal"     "pijak"       "adab"       
## [16] "hancur"      "indonesia"  
## 
## [[2]]
## [1] "maaf"   "twiter" "kali"   "orang"  "bego"  
## 
## [[3]]
##  [1] "anjing"  "tahi"    "goblok"  "idiot"   "bangsat" "monyet"  "babi"   
##  [8] "fucc"    "kont"    "ngents"  "goblok"  "iya"     "tau"     "kasar"  
## [15] "banget"  "maaf"   
## 
## [[4]]
##  [1] "hadir"      "lokakarya"  "budaya"     "daerah"     "bupati"    
##  [6] "rupinus"    "ajak"       "masyarakat" "sekadau"    "rawat"     
## [11] "manfaat"    "objek"      "budaya"    
## 
## [[5]]
## [1] "kayak"   "layak"   "tangkap"
df_clean_final

Now, let’s learn our dataset more by using wordcloud(). The df_clean_final dataset is the dataset that still contains both bully and not bully variable (10,437 rows)

3 Wordcloud and Frequency

3.1 Dataset General

df_clean_final_corpus <- VCorpus(VectorSource(df_clean_final$tweet))

df_clean_final_dtm <- DocumentTermMatrix(df_clean_final_corpus)
inspect(df_clean_final_dtm)
## <<DocumentTermMatrix (documents: 10437, terms: 16211)>>
## Non-/sparse entries: 91995/169102212
## Sparsity           : 100%
## Maximal term length: 109
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   agama gue indonesia islam jokowi kalo nya orang presiden sih
##   1285     0   0         0     0      0    1   0     0        0   0
##   211      0   0         0     0      0    0   0     0        0   0
##   2150     0   0         0     0      0    0   0     0        0   3
##   2794     0   0         0     0      0    0   0     0        0   1
##   6426     0   0         0     0      0    0   0     1        0   2
##   7019     0   0         0     0      1    0   0     0        1   1
##   764      0   0         0     0      0    0   0     0        0   0
##   78       0   0         0     0      0    0   0     0        0   0
##   8635     0   0         0     2      0    0   1     0        0   0
##   9936     0   0         0     0      0    0   2     1        0   0
wordcloud(df_clean_final_corpus,max.words = 200, col=brewer.pal(8, "Set2"), scale=c(3,0.25))

cleanfinal_count <- as.data.frame(as.matrix(df_clean_final_dtm))
cleanfinal_long <- pivot_longer(data = cleanfinal_count, cols = everything())
final_cleanfinal <- cleanfinal_long %>% group_by(name) %>% summarise(tot = sum(value))

cleanfinal_cloud <- final_cleanfinal %>% 
  filter(tot >= 50) %>% 
  arrange(desc(tot))

head(cleanfinal_cloud,30)

Let’s see the words that are commonly used in the non-bully tweets

3.2 Non Bully

df_clean_nobully <- df_clean_final %>% 
  filter(bully == "no")

df_clean_nobully_corpus <- VCorpus(VectorSource(df_clean_nobully$tweet))
df_clean_nobully_dtm <- DocumentTermMatrix(df_clean_nobully_corpus)
inspect(df_clean_nobully_dtm)
## <<DocumentTermMatrix (documents: 6057, terms: 12616)>>
## Non-/sparse entries: 56196/76358916
## Sparsity           : 100%
## Maximal term length: 34
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   agama asing gue indonesia islam kalo kayak orang presiden sih
##   119      0     1   0         0     0    0     0     0        0   0
##   1230     0     0   0         0     0    0     0     0        0   3
##   1423     0     0   0         0     0    0     0     0        0   0
##   1626     0     0   0         0     0    0     1     0        0   1
##   3601     0     0   0         0     0    0     1     0        0   0
##   3715     0     0   0         0     0    0     0     1        0   2
##   43       0     0   0         0     0    0     0     0        0   0
##   5017     0     0   0         0     2    0     0     0        0   0
##   5764     0     0   0         0     0    0     0     1        0   0
##   743      0     0   0         0     0    1     1     0        0   0
wordcloud(df_clean_nobully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.3))

nobully_count <- as.data.frame(as.matrix(df_clean_nobully_dtm))
nobully_long <- pivot_longer(data = nobully_count, cols = everything())
final_nobully <- nobully_long %>% group_by(name) %>% summarise(tot = sum(value))

nobully_cloud <- final_nobully %>% 
  filter(tot >= 50) %>% 
  arrange(desc(tot))

head(nobully_cloud,30)

3.3 Bully

Bully in general: I set the max.words to only 100 with min.freq of occurrences at least 20,000 in our dataset to make it more narrow and specific. We can see that the word “jokowi” (individual) is mentioned the most, followed by the ones in blue color (cebong, islam, orang). There are also group “pki”, religion “agama”, race “cina”

df_clean_bully2 <- df_clean_final %>% 
  filter(bully == "yes")

df_clean_bully_corpus <- VCorpus(VectorSource(df_clean_bully2$tweet))
df_clean_bully_dtm <- DocumentTermMatrix(df_clean_bully_corpus)
inspect(df_clean_bully_dtm)
## <<DocumentTermMatrix (documents: 4380, terms: 7372)>>
## Non-/sparse entries: 35799/32253561
## Sparsity           : 100%
## Maximal term length: 109
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   cebong gantipresiden indonesia islam jokowi kalo nya orang presiden sih
##   1015      0             0         0     0      0    0   0     1        0   0
##   1351      0             0         0     0      0    0   0     1        0   0
##   1401      0             0         0     0      0    0   1     0        0   0
##   1540      0             0         0     0      0    0   0     0        0   1
##   234       2             0         0     2      0    0   0     0        0   0
##   2700      0             0         0     0      0    1   1     0        0   1
##   2956      0             0         0     0      1    0   0     0        1   1
##   331       0             0         0     0      0    0   0     0        0   0
##   859       1             0         0     0      0    0   0     0        0   0
##   966       2             0         0     1      0    0   0     0        0   0
wordcloud(df_clean_bully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.25))

bully_count <- as.data.frame(as.matrix(df_clean_bully_dtm))
bully_long <- pivot_longer(data = bully_count, cols = everything())
final_bully <- bully_long %>% group_by(name) %>% summarise(tot = sum(value))

bully_cloud <- final_bully %>% 
  filter(tot >= 50) %>% 
  arrange(desc(tot))

head(bully_cloud,30)

Let’s dissect into each category to have a better view

3.3.1 Individual

In the Individual category, we can see there are cyberbully towards governmental individual, with the most “jokowi”, followed by “ahok”, “prabowo” and “anies”. We can also see that not all words consist a name of the individual, but yet these words are “rude”, i.e “tolol”, “babi”. Additionally, there are words that does not mention the name of individual, but can be used to attack the individual, i.e verbs such as “lengser”, “ganti presiden”, “tolol”, “kafir”

df_clean_bindividual <- df_clean_bully2 %>% 
  filter(individual == 1) 

df_clean_bindividual_corpus <- VCorpus(VectorSource(df_clean_bindividual$tweet))
df_clean_bindividual_dtm <- DocumentTermMatrix(df_clean_bindividual_corpus)
inspect(df_clean_bully_dtm)
## <<DocumentTermMatrix (documents: 4380, terms: 7372)>>
## Non-/sparse entries: 35799/32253561
## Sparsity           : 100%
## Maximal term length: 109
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   cebong gantipresiden indonesia islam jokowi kalo nya orang presiden sih
##   1015      0             0         0     0      0    0   0     1        0   0
##   1351      0             0         0     0      0    0   0     1        0   0
##   1401      0             0         0     0      0    0   1     0        0   0
##   1540      0             0         0     0      0    0   0     0        0   1
##   234       2             0         0     2      0    0   0     0        0   0
##   2700      0             0         0     0      0    1   1     0        0   1
##   2956      0             0         0     0      1    0   0     0        1   1
##   331       0             0         0     0      0    0   0     0        0   0
##   859       1             0         0     0      0    0   0     0        0   0
##   966       2             0         0     1      0    0   0     0        0   0
wordcloud(df_clean_bindividual_corpus,max.words = 30, min.freq = 10000, col=brewer.pal(8, "Set2"), scale=c(4.5,0.5))

bindividual_count <- as.data.frame(as.matrix(df_clean_bindividual_dtm))
bindividual_long <- pivot_longer(data = bindividual_count, cols = everything())
final_bindividual <- bindividual_long %>% group_by(name) %>% summarise(tot = sum(value))

bindividual_cloud <- final_bindividual %>% 
  filter(tot >= 50) %>% 
  arrange(desc(tot))

head(bindividual_cloud,30)

3.3.2 Group

df_clean_bgroup <- df_clean_bully2 %>% 
  filter(group == 1) 

df_clean_bgroup_corpus <- VCorpus(VectorSource(df_clean_bgroup$tweet))
df_clean_bgroup_dtm <- DocumentTermMatrix(df_clean_bgroup_corpus)
inspect(df_clean_bgroup_dtm)
## <<DocumentTermMatrix (documents: 1562, terms: 3853)>>
## Non-/sparse entries: 13790/6004596
## Sparsity           : 100%
## Maximal term length: 30
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   agama bubar cebong cina indonesia islam komunis nya orang pki
##   1074     0     0      0    0         0     0       0   0     0   0
##   1167     0     0      0    0         0     0       0   0     1   0
##   118      0     0      0    0         0     0       0   0     0   0
##   1481     0     0      0    0         0     0       0   0     0   0
##   317      0     0      1    0         0     0       0   0     0   0
##   364      0     0      0    0         0     0       0   0     1   0
##   501      0     0      0    1         0     0       0   0     1   0
##   575      0     0      0    0         0     0       0   0     0   0
##   683      1     0      3    0         0     3       0   0     0   0
##   726      1     0      0    0         0     0       0   0     1   0
wordcloud(df_clean_bgroup_corpus,max.words = 30, min.freq = 10000, col=brewer.pal(8, "Set2"), scale=c(3,0.25))

bgroup_count <- as.data.frame(as.matrix(df_clean_bgroup_dtm))
bgroup_long <- pivot_longer(data = bgroup_count, cols = everything())
final_bgroup <- bgroup_long %>% group_by(name) %>% summarise(tot = sum(value))

bgroup_cloud <- final_bgroup %>% 
  filter(tot >= 50) %>% 
  arrange(desc(tot))

head(bgroup_cloud,30)

3.3.3 Gender

df_clean_bgender <- df_clean_bully2 %>% 
  filter(gender == 1) 

df_clean_bgender_corpus <- VCorpus(VectorSource(df_clean_bgender$tweet))
df_clean_bgender_dtm <- DocumentTermMatrix(df_clean_bgender_corpus)
inspect(df_clean_bgender_dtm)
## <<DocumentTermMatrix (documents: 243, terms: 1043)>>
## Non-/sparse entries: 2061/251388
## Sparsity           : 99%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  banci bencong dasar gue homo kafir kayak manusia orang sih
##   13      0       0     0   0    0     0     0       0     0   0
##   130     0       0     1   1    0     0     0       0     0   0
##   131     0       0     0   4    0     0     1       0     0   1
##   161     0       0     0   0    0     0     0       0     0   0
##   194     0       0     0   0    1     0     2       1     2   0
##   208     0       2     0   1    0     0     0       0     0   2
##   52      1       0     0   0    0     0     0       0     1   0
##   62      1       0     0   1    0     0     1       0     1   0
##   82      0       0     0   0    0     0     0       0     0   0
##   85      0       0     0   0    0     0     1       0     0   0
wordcloud(df_clean_bgender_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(5,0.5))

bgender_count <- as.data.frame(as.matrix(df_clean_bgender_dtm))
bgender_long <- pivot_longer(data = bgender_count, cols = everything())
final_bgender <- bgender_long %>% group_by(name) %>% summarise(tot = sum(value))

bgender_cloud <- final_bgender %>% 
  filter(tot >= 10) %>% 
  arrange(desc(tot))

head(bgender_cloud,30)

3.3.4 Physical

df_clean_bphysical <- df_clean_bully2 %>% 
  filter(physical == 1) 

df_clean_bphysical_corpus <- VCorpus(VectorSource(df_clean_bphysical$tweet))
df_clean_bphysical_dtm <- DocumentTermMatrix(df_clean_bphysical_corpus)
inspect(df_clean_bphysical_dtm)
## <<DocumentTermMatrix (documents: 252, terms: 1061)>>
## Non-/sparse entries: 1986/265386
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  bolot budek gue idiot kayak mata muka orang picek sih
##   125     0     0   0     0     0    1    0     0     1   1
##   165     0     0   0     0     0    0    0     0     0   0
##   20      0     0   0     0     0    0    0     0     0   0
##   223     0     0   1     1     0    0    0     0     0   2
##   234     0     0   2     0     0    1    0     0     0   0
##   245     0     0   0     0     0    0    0     0     0   0
##   25      0     0   0     4     0    0    0     0     5   0
##   252     0     0   0     0     0    0    0     0     0   0
##   28      0     0   0     0     0    0    1     0     0   0
##   30      0     0   0     0     0    0    2     1     0   0
wordcloud(df_clean_bphysical_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(4,0.25))

bphysical_count <- as.data.frame(as.matrix(df_clean_bphysical_dtm))
bphysical_long <- pivot_longer(data = bphysical_count, cols = everything())
final_bphysical <- bphysical_long %>% group_by(name) %>% summarise(tot = sum(value))

bphysical_cloud <- final_bphysical %>% 
  filter(tot >= 10) %>% 
  arrange(desc(tot))

head(bphysical_cloud,30)

3.3.5 Religion

df_clean_breligion <- df_clean_bully2 %>% 
  filter(religion == 1) 

df_clean_breligion_corpus <- VCorpus(VectorSource(df_clean_breligion$tweet))
df_clean_breligion_dtm <- DocumentTermMatrix(df_clean_breligion_corpus)
inspect(df_clean_breligion_dtm)
## <<DocumentTermMatrix (documents: 619, terms: 1980)>>
## Non-/sparse entries: 5919/1219701
## Sparsity           : 100%
## Maximal term length: 30
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  agama ahok allah anti budha indonesia islam kafir muslim orang
##   114     1    0     0    0     0         0     0     0      0     0
##   145     0    1     0    0     0         0     1     0      0     0
##   156     1    0     0    0     0         0     0     0      0     0
##   266     1    0     0    0     0         0     3     0      0     0
##   291     1    0     0    0     0         0     0     0      0     1
##   30      2    0     0    0     0         0     2     0      0     0
##   47      0    0     0    0     0         0     0     0      0     0
##   477     0    0     0    0     0         0     0     1      0     1
##   591     0    0     0    0     0         0     0     0      0     0
##   63      0    0     0    0     0         0     0     1      2     1
wordcloud(df_clean_breligion_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(5,0.4))

breligion_count <- as.data.frame(as.matrix(df_clean_breligion_dtm))
breligion_long <- pivot_longer(data = breligion_count, cols = everything())
final_breligion <- breligion_long %>% group_by(name) %>% summarise(tot = sum(value))

breligion_cloud <- final_breligion %>% 
  filter(tot >= 10) %>% 
  arrange(desc(tot))

head(breligion_cloud,30)

3.3.6 Race

df_clean_brace <- df_clean_bully2 %>% 
  filter(race == 1) 

df_clean_brace_corpus <- VCorpus(VectorSource(df_clean_brace$tweet))
df_clean_brace_dtm <- DocumentTermMatrix(df_clean_brace_corpus)
inspect(df_clean_brace_dtm)
## <<DocumentTermMatrix (documents: 429, terms: 1342)>>
## Non-/sparse entries: 3527/572191
## Sparsity           : 99%
## Maximal term length: 21
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  antek china cina ganyang indonesia islam komunis orang pki usir
##   126     0     4    0       0         0     0       0     0   0    0
##   129     0     0    0       0         0     0       1     2   1    0
##   132     0     0    1       0         0     0       0     1   0    0
##   195     0     0    0       0         0     0       0     0   0    0
##   26      0     0    0       0         0     1       0     0   2    0
##   320     0     0    0       0         0     0       0     1   0    0
##   72      0     0    2       0         0     0       0     0   0    0
##   74      0     0    0       0         0     0       0     0   0    0
##   76      0     0    1       0         0     0       0     0   0    0
##   99      0     0    0       0         0     0       0     0   0    0
wordcloud(df_clean_brace_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(5,0.5))

brace_count <- as.data.frame(as.matrix(df_clean_brace_dtm))
brace_long <- pivot_longer(data = brace_count, cols = everything())
final_brace <- brace_long %>% group_by(name) %>% summarise(tot = sum(value))

brace_cloud <- final_brace %>% 
  filter(tot >= 10) %>% 
  arrange(desc(tot))

head(brace_cloud,30)

4 Training and Validation Dataset

4.1 Splitting (80:20)

We will split the train data into 80% of training, and the rest of 20% for validation

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

# train-test splitting
index <- sample(nrow(df_clean_final_dtm), nrow(df_clean_final_dtm)*0.8)

df_train <- df_clean_final_dtm[index,]
df_validation <- df_clean_final_dtm[-index,]

label_train <- df_clean_final[index, 'bully']
label_validation <- df_clean_final[-index, 'bully']

prop.table(table(label_train))
## label_train
##        no       yes 
## 0.5795904 0.4204096
prop.table(table(label_validation))
## label_validation
##        no       yes 
## 0.5833333 0.4166667
#Check Dim
dim(df_train)
## [1]  8349 16211
10437*0.8
## [1] 8349.6
#number of rows are 10437 after removing duplicates

4.2 Reduce Noise using findfreqterms

We will subset to find terms that only appear in the model for at least 10 times

df_freq <- findFreqTerms(df_train, lowfreq = 10)
length(df_freq)
## [1] 1396
head(df_freq)
## [1] "abang" "abu"   "acara" "adab"  "adat"  "adek"

4.3 Subset the words from df_freq into df_train

df_train2 <- df_train[,df_freq]
inspect(df_train2)
## <<DocumentTermMatrix (documents: 8349, terms: 1396)>>
## Non-/sparse entries: 51184/11604020
## Sparsity           : 100%
## Maximal term length: 34
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   agama gue indonesia islam jokowi kalo kayak orang presiden sih
##   2419     0   1         0     0      0    0     1     1        0   0
##   2429     0   2         0     0      0    0     1     1        0   1
##   2452     0   0         0     0      0    0     0     0        0   0
##   2842     0   0         0     0      0    0     0     0        0   1
##   4346     1   0         0     3      0    0     0     0        0   0
##   525      2   1         0     2      0    0     0     0        0   0
##   6751     0   0         0     0      0    0     0     0        0   0
##   9050     1   0         0     0      0    0     0     0        0   1
##   9849     0   0         0     1      0    0     0     6        0   0
##   9882     0   0         0     0      0    0     0     0        0   0

4.4 Bernoulli data train & validation

Use Bernoulli Converter to transform frequency of words into probability. If f > 0, value = 1 (appear) If f == 0, value = 0 (does not appear)

bernoulli_conv <- function(x){
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

df_train_bn <- apply(X = df_train2, MARGIN = 2, FUN = bernoulli_conv)
df_validation_bn <- apply(X = df_validation, MARGIN = 2, FUN = bernoulli_conv)

5 Naive Bayes Model Fitting and Prediction

naive_bully <- naiveBayes(x = df_train_bn, 
                          y = label_train)

df_train_pred <- predict(naive_bully, df_validation_bn, type = "class")
head(df_train_pred)
## [1] no  yes yes yes yes no 
## Levels: no yes
summary(df_train_pred)
##   no  yes 
## 1162  926

5.1 Data Train and Validation

5.1.1 Model Evaluation using Confusion Matrix

Model evaluation of training data to validation data

confusionMatrix(data = df_train_pred, # label hasil prediksi
                reference = label_validation, # label actual
                positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1005  157
##        yes  213  713
##                                          
##                Accuracy : 0.8228         
##                  95% CI : (0.8057, 0.839)
##     No Information Rate : 0.5833         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6388         
##                                          
##  Mcnemar's Test P-Value : 0.004246       
##                                          
##             Sensitivity : 0.8195         
##             Specificity : 0.8251         
##          Pos Pred Value : 0.7700         
##          Neg Pred Value : 0.8649         
##              Prevalence : 0.4167         
##          Detection Rate : 0.3415         
##    Detection Prevalence : 0.4435         
##       Balanced Accuracy : 0.8223         
##                                          
##        'Positive' Class : yes            
## 

5.1.2 Checking mispredicted tweets

Subset the data validation into the 20% that we used to validate

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

df_mispredicted <- readRDS("df_clean_3.RDS")
data_validation_check <- df_mispredicted[-index,]
head(data_validation_check)

Create Dataframe from the prediction results (train predicted in validation data)

validation_pred_results <- as.data.frame(df_train_pred)
head(validation_pred_results)

Combine both into one dataframe Our confusion matrix

      Reference

Prediction no yes no 1005 157 yes 213 713

5.1.3 Results misclassified tweets

5.1.3.1 Bully tweets but classified as not bully in validation dataset

  1. hanya syaitan dan iblis yang ganggu dengan suara adzan hati nyata se bu suka nama
  2. enggak tau diri anjir biasa nih apa yang kita tanam itu yang kita oleh tiati saja mbak
  3. haha susah bicara sama orang bloon sudah bloon hati penuh dengki lagi
  4. umat islam harus selalu milik kriteria sendiri hadap pilih pemimpinnya pemimpindarimasjid jangan pilih pemimpin non muslim
  5. h kntl
  6. apa itu komunis cara sederhana paksa hendak banyak atur contoh paksa pakai e money komunis banget rezim komunis
  7. karena pki adalah cara soeharto singkir soekarno sekarang mau tiru oleh prabowo mungkin karena masih menantu
  8. entah apa tuju sih kunyuk pakai foto amp nama gue wtf
  9. eh kita positif dulu gaes dia mau tekan bahwa kaum itu sia sia bela agama yang kitab suci fiksi juga kalau fiksi tapi isi suatu norma kehid
  10. lo yang sarap
data_validation_trainyes_predno <- data_validation_check %>% 
  mutate(validation_pred_results,
    .after=bully,
    tweet = as.character(tweet)) %>%
  filter(bully == "yes" & df_train_pred == "no")
      

head(data_validation_trainyes_predno)
nrow(data_validation_trainyes_predno)
## [1] 157

157 tweets that are originally “bully” but our system classified as “no”. Same to our confusion matrix.

5.1.3.2 Not bully tweets but classified as bully in validation dataset

  1. patung tugu rato nago besanding lokasi di simpang tiga kampung kagungan ratu camat tulang bawang udik lampung tebakgambarim ooredoo
  2. bahasa yang paling susah buat sulli adalah bahasa cina
  3. jokowi restu wna duduk jabat direksi bumn rizal ramli ampun deh
  4. mereun t ku sih gunawan sia asri a kunyuk jadi aing nu keuna kehed
  5. memang rezim saat ini banyak oknum polri yang arogan ini karena rakyat sangat lemah dan hukum tumpul atas dan tajam ke bawah
  6. dari tadi kayak kunyuk hadehhh
  7. satu bukti valid lagi bahaya virus idiot yang minum oleh salah satu admin bisa buat otak henti kerja saat mulut bicara atau jari nek keyboard komputer
  8. jadi ahok penjara hanya karena kutip kisah fiksi
  9. bantu aku cari judul film ini dong film tentang orang pjalan kaki spanjang km enggak film tahun an pokok lakon cewenya mati dtgah gurun pasir orgnya itu kabur hindar negara komunis supaya bisa balik ke negarany
  10. hebat di rejim ini orang sudah pada kayak semua
data_validation_trainno_predyes <- data_validation_check %>% 
  mutate(validation_pred_results,
    .after=bully,
    tweet = as.character(tweet)) %>%
  filter(bully == "no" & df_train_pred == "yes")

head(data_validation_trainno_predyes)
nrow(data_validation_trainno_predyes)
## [1] 213

213 tweets that are originally “not bully” but our system classified as “yes”. Same to our confusion matrix.

5.2 Data Test Naive Bayes

5.2.1 Cleansing

We will cleanse the test data set. The steps are similar to how we cleansed our training dataset

#df_test <- read_csv("data/test.csv")
#df_test$tweet <- df_test$tweet %>% 
#  replace_tag() %>% 
#  replace_date(replacement = " ") %>% 
#  replace_email() %>% 
#  replace_emoji(.) %>% 
#  replace_emoticon(.) %>% 
#  replace_url() %>%
#  replace_html(.) %>% 
#  str_to_lower()

#df_test$tweet <- gsub("user", " ", df_test$tweet)
#df_test$tweet <- gsub("rt", " ", df_test$tweet)
#df_test$tweet <- gsub("[[:punct:] ]+", " ", df_test$tweet)
#df_test$tweet <- gsub("url", " ", df_test$tweet)
#df_test$tweet <- gsub("[^a-z]+$", "", df_test$tweet)
#df_test$tweet <- gsub("[[:digit:]]", "", df_test$tweet)
#df_test$tweet <- strip(df_test$tweet)

#df_test$tweet <- lapply(tokenize_words(df_test$tweet), stemming)
#df_test$tweet <- as.character(df_test$tweet)

#saveRDS(df_test, file = "naivebayes_test_clean.RDS")
df_test <- readRDS("naivebayes_test_clean.RDS")

df_test_corpus <- VCorpus(VectorSource(df_test$tweet))
df_test_dtm <- DocumentTermMatrix(df_test_corpus)

5.2.2 Bernoulli Data Test

df_test_bn <- apply(X = df_test_dtm, MARGIN = 2, FUN = bernoulli_conv)

5.2.3 Prediction Data Test

Data training to data test

df_test_pred <- predict(naive_bully, df_test_bn, type = "class")
head(df_test_pred)
## [1] no  yes no  no  yes yes
## Levels: no yes
summary(df_test_pred)
##   no  yes 
## 1402 1232

5.2.4 Submission Data Test

submission <- df_test %>% 
  mutate(bully = df_test_pred)

write.csv(submission, "submission-inge-freq10.csv", row.names = F)

6 Random Forest Model Fitting and Prediction

6.1 Splitting 80:20

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

# train-test splitting
index <- sample(nrow(df_clean_final_dtm), nrow(df_clean_final_dtm)*0.8)

df_train_rf <- df_clean_final_dtm[index,]
df_validation_rf <- df_clean_final_dtm[-index,]

label_train_rf <- df_clean_final[index, 'bully']
label_validation_rf <- df_clean_final[-index, 'bully']

prop.table(table(label_train_rf))
## label_train_rf
##        no       yes 
## 0.5795904 0.4204096
prop.table(table(label_validation_rf))
## label_validation_rf
##        no       yes 
## 0.5833333 0.4166667

6.2 Reduce noise with findfreqterms()

df_freq2 <- findFreqTerms(df_train_rf, lowfreq = 10)
length(df_freq2)
## [1] 1396
head(df_freq2)
## [1] "abang" "abu"   "acara" "adab"  "adat"  "adek"

6.3 Subset the words from df_freq2 to df_train_rf2

df_train_rf2 <- df_train_rf[,df_freq2]
inspect(df_train_rf2)
## <<DocumentTermMatrix (documents: 8349, terms: 1396)>>
## Non-/sparse entries: 51184/11604020
## Sparsity           : 100%
## Maximal term length: 34
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   agama gue indonesia islam jokowi kalo kayak orang presiden sih
##   2419     0   1         0     0      0    0     1     1        0   0
##   2429     0   2         0     0      0    0     1     1        0   1
##   2452     0   0         0     0      0    0     0     0        0   0
##   2842     0   0         0     0      0    0     0     0        0   1
##   4346     1   0         0     3      0    0     0     0        0   0
##   525      2   1         0     2      0    0     0     0        0   0
##   6751     0   0         0     0      0    0     0     0        0   0
##   9050     1   0         0     0      0    0     0     0        0   1
##   9849     0   0         0     1      0    0     0     6        0   0
##   9882     0   0         0     0      0    0     0     0        0   0

6.4 Bernoulli data train & validation

bernoulli_conv <- function(x){
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

df_train_rf_bn <- apply(X = df_train_rf2, MARGIN = 2, FUN = bernoulli_conv)
df_validation_rf_bn <- apply(X = df_validation, MARGIN = 2, FUN = bernoulli_conv)

6.5 Model Fitting Random Forest

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

rf <- randomForest(x = df_train_rf_bn,
                   y = label_train_rf,
                   ntree = 15)

6.6 RF prediction

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
rf_pred <- predict(rf, df_validation_rf_bn
                   , type = "class")

head(rf_pred)
##   1   3   6   8   9  15 
## yes yes yes yes yes  no 
## Levels: no yes
summary(rf_pred)
##   no  yes 
## 1187  901

6.7 Confusion Matrix data train and validation

confusionMatrix(data = rf_pred, # label hasil prediksi
                reference = label_validation, # label actual
                positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1016  171
##        yes  202  699
##                                           
##                Accuracy : 0.8214          
##                  95% CI : (0.8042, 0.8376)
##     No Information Rate : 0.5833          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6344          
##                                           
##  Mcnemar's Test P-Value : 0.1203          
##                                           
##             Sensitivity : 0.8034          
##             Specificity : 0.8342          
##          Pos Pred Value : 0.7758          
##          Neg Pred Value : 0.8559          
##              Prevalence : 0.4167          
##          Detection Rate : 0.3348          
##    Detection Prevalence : 0.4315          
##       Balanced Accuracy : 0.8188          
##                                           
##        'Positive' Class : yes             
## 

7 Data Test RF

7.1 Call Data Test

df_test_rf <- readRDS("naivebayes_test_clean.RDS")
head(df_test_rf)

7.2 Bernoulli Data Test

df_test_bn_rf<- apply(X = df_test_dtm, MARGIN = 2, FUN = bernoulli_conv)

7.3 Prediction Data Test

df_test_pred_rf <- predict(rf, df_test_bn_rf, type = "class")

My Random Forest Data test stops here, as I am not sure as well why this error is happening. All of the Data Test used are also the one that I used in Naive Bayes Data Test. However, as the confusion matrix results of Naive Bayes shows better in Sensitivity, I would choose Naive Bayes as the better model for me since I want to reduce the False Negative.

8 Capstone Quiz Answer

8.1 No. 1

8.1.1 Which category has the most abusive and bullying text? how did you find it?

Individual and Group are the ones with the most cyberbully tweets, we can find the proportion by using summary() of the dataframe

df_clean_bully %>%
  summary()
##  bully         tweet           individual group    gender   physical race    
##  no :   0   Length:4380        0:1562     0:2818   0:4137   0:4128   0:3951  
##  yes:4380   Class :character   1:2818     1:1562   1: 243   1: 252   1: 429  
##             Mode  :character                                                 
##  religion
##  0:3761  
##  1: 619  
## 

8.1.2 What text or token can represent each cyberbully category?

Please refer to the wordcloud / frequency dataframe from section 3.3. Below are the top 10 tokens Individual:

head(bindividual_cloud,10)

Group:

head(bgroup_cloud,10)

Gender:

head(bgender_cloud,10)

Physical:

head(bphysical_cloud,10)

Race:

head(brace_cloud,10)

Religion:

head(breligion_cloud,10)

8.1.3 Is there any relationship between each category of cyberbully?

Yes, we can see several tokens that appear in one category and appear in another with less appearance, example : Indonesia : Religion (60), Race (101), group (174) Ahok : Religion (60), Individual (188) Islam : Religion (249), Race (35), Group (200)

There might more tokens that actually intersect between categories, but as we are only pulling those that are in the top 10, we do not see it here.

8.2 No.2

8.2.1 What text or token can represent if a text is a cyberbully?

Below are the tokens when we filter bully == yes

head(bully_cloud,30)
wordcloud(df_clean_bully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.25))

### Is it based on the term frequency of each word or token? Or is it based on the Term Frequency (TF) - Inverse Document Frequency (IDF)? Based on TF-IDF (Document term matrix).

8.2.2 Did you draw wordlcoud to visualize the most frequent text on both cyberbully and non-bully text?

Yes

Bully

wordcloud(df_clean_bully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.25))

Non-Bully

wordcloud(df_clean_nobully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.3))

##No. 3 ### What package will you use for text mining?

library(textclean)
library(tokenizers)
library(wordcloud)
library(dplyr)
library(devtools)
library(katadasaR)
library(tm)
library(stringr)
library(e1071)
library(caret)
library(keras)
library(RVerbalExpressions)
library(magrittr)
library(textclean)
library(tidyverse)
library(tidytext)
library(rsample)
library(yardstick)
library(SnowballC)
library(partykit)
library(ROCR)
library(partykit)
library(randomForest)

8.2.3 Did you remove particular tweet text like RT and USER?

Yes

head(df_clean_2$tweet,10)
##  [1] "terimakasih ustadz sudah bersuara tentang radikal radikal ini entah apa yang ada dalam pikiran rejim mesjid radikal kampus radikal dosen radikal padahal tempat tersebut pijakan peradaban memangnya mau menghancurkan indonesia"
##  [2] "maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego"                                                                                                                                                                
##  [3] "anjing tahi goblok idiot bangsat monyet babi fucc kont ngents goblok iya tau kasar banget maaf"                                                                                                                                  
##  [4] "hadiri lokakarya kebudayaan daerah bupati rupinus ajak masyarakat sekadau rawat dan manfaatkan objek budaya"                                                                                                                     
##  [5] "yang kayak begini layak di tangkap"                                                                                                                                                                                              
##  [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan bagian dari itu sudah waktunya lengserkan jokowi sebelum indonesia hancur"                                                                                       
##  [7] "wonu oppa kenapa matanya sipit banget"                                                                                                                                                                                           
##  [8] "smartfren jaringan nya kok brengsek ya"                                                                                                                                                                                          
##  [9] "pret kampret tak dukung ganti presiden tapi presidenmu sapa rocky gerung kapir thaa"                                                                                                                                             
## [10] "ahelah sombong benar punuk onta"

Did you use custom stopwords for Bahasa? Yes (for reference please go to Stemming, Stopwords and Tokenizing section)

stemming <- function(x) {
  paste(lapply(x, katadasar), collapse = " ")
}

df_clean_2$tweet[1:10437] <- lapply(tokenize_words(df_clean_2$tweet[1:10437]), stemming)

df_clean_3 <- data.frame(df_clean_2)

stopwords <- readLines("data/stopwords-id.txt")

df_clean_3$tweet <- df_clean_3$tweet %>% 
  replace_html(symbol = FALSE) %>% 
  replace_url(replacement = "")
df_clean_3$tweet <- gsub("url", " ", df_clean_3$tweet)

df_clean_3$tweet <- tokenize_words(df_clean_3$tweet, stopwords = stopwords)

Should you remove punctuation or emoticon? Yes (for reference please go to data cleansing section)

df_clean$tweet <- df_clean$tweet %>% 
  replace_tag() %>% 
  replace_date(replacement = " ") %>% 
  replace_email() %>% 
  replace_emoji(.) %>% 
  replace_emoticon(.) %>% 
  replace_url() %>%
  replace_html(.) %>% 
  str_to_lower() %>% 
  strip()

Will you create a document-term matrix? Yes (Please check on Wordcloud and Frequency section for the DTM process)

df_clean_final_dtm
## <<DocumentTermMatrix (documents: 10437, terms: 16211)>>
## Non-/sparse entries: 91995/169102212
## Sparsity           : 100%
## Maximal term length: 109
## Weighting          : term frequency (tf)

8.3 No. 4

8.3.1 What model will you use to classify the text?

Naive Bayes

8.3.2 How many tokens or words will you use for training the model?

1396 words (after subset with findfreqterms() with lowfreq = 10)

df_train2
## <<DocumentTermMatrix (documents: 8349, terms: 1396)>>
## Non-/sparse entries: 51184/11604020
## Sparsity           : 100%
## Maximal term length: 34
## Weighting          : term frequency (tf)

8.3.3 How much percent (%) of the data used for training the model?

80% training, 20% validation

8.3.4 How do you choose which one is the better model? Is it based on accuracy?

Based on Sensitivity as we would like to have the lowest False Negative (predicted not bully but in actual is bully)

NAIVE BAYES Train to Validation

confusionMatrix(data = df_train_pred, # label hasil prediksi
                reference = label_validation, # label actual
                positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1005  157
##        yes  213  713
##                                          
##                Accuracy : 0.8228         
##                  95% CI : (0.8057, 0.839)
##     No Information Rate : 0.5833         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6388         
##                                          
##  Mcnemar's Test P-Value : 0.004246       
##                                          
##             Sensitivity : 0.8195         
##             Specificity : 0.8251         
##          Pos Pred Value : 0.7700         
##          Neg Pred Value : 0.8649         
##              Prevalence : 0.4167         
##          Detection Rate : 0.3415         
##    Detection Prevalence : 0.4435         
##       Balanced Accuracy : 0.8223         
##                                          
##        'Positive' Class : yes            
## 

RANDOM FOREST Train to Validation

confusionMatrix(data = rf_pred, # label hasil prediksi
                reference = label_validation, # label actual
                positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1016  171
##        yes  202  699
##                                           
##                Accuracy : 0.8214          
##                  95% CI : (0.8042, 0.8376)
##     No Information Rate : 0.5833          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6344          
##                                           
##  Mcnemar's Test P-Value : 0.1203          
##                                           
##             Sensitivity : 0.8034          
##             Specificity : 0.8342          
##          Pos Pred Value : 0.7758          
##          Neg Pred Value : 0.8559          
##              Prevalence : 0.4167          
##          Detection Rate : 0.3348          
##    Detection Prevalence : 0.4315          
##       Balanced Accuracy : 0.8188          
##                                           
##        'Positive' Class : yes             
## 

8.3.5 Which model is the best?

Naive Bayes as it has higher Specificity

8.4 No.5

8.4.1 There is no overfitting between evaluation in (your own) train dataset and validation dataset, where the overfit is seen from the difference in the accuracy of 5%.

There is no overfitting : Train to validation 82.28% Train to test 82%

8.4.2 Confusion Matrix validation dataset

Accuracy in (your own) validation dataset reach > 80%. Sensitivity in (your own) validation dataset reach > 80%. Specificity in (your own) validation dataset reach > 75%. Precision in (your own) validation dataset reach > 75%.

confusionMatrix(data = df_train_pred, # label hasil prediksi
                reference = label_validation, # label actual
                positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1005  157
##        yes  213  713
##                                          
##                Accuracy : 0.8228         
##                  95% CI : (0.8057, 0.839)
##     No Information Rate : 0.5833         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6388         
##                                          
##  Mcnemar's Test P-Value : 0.004246       
##                                          
##             Sensitivity : 0.8195         
##             Specificity : 0.8251         
##          Pos Pred Value : 0.7700         
##          Neg Pred Value : 0.8649         
##              Prevalence : 0.4167         
##          Detection Rate : 0.3415         
##    Detection Prevalence : 0.4435         
##       Balanced Accuracy : 0.8223         
##                                          
##        'Positive' Class : yes            
## 

Accuracy in test dataset reach > 80%. Sensitivity in test dataset reach > 80%. pecificity in test dataset reach > 75%. Precision in test dataset reach > 75%.

NAIVE BAYES Train to Test

knitr::include_graphics("data/NAIVE BAYES FREQ 10.png")

8.5 No.6

8.5.1 Which tweets were incorrectly predicted on the test dataset?

Bully tweets but classified as not bully in validation dataset

head(data_validation_trainyes_predno,10)

Not Bully tweets but classified as bully in validation dataset

head(data_validation_trainno_predyes,10)

8.6 No.7

8.6.1 Is there any typical pattern among the misclassified texts?

Yes. As we have seen from the wordclouds of bully and non-bully tokens, there are several tokens that appear in both. i.e: We can see from the non-bully tweets but predicted as bully, there are tokens i.e cina (no 2), jokowi (no 3), ahok (no 8) that are also in the bully tweets wordcloud. Hence, there is a pattern where of mis-classification because of the existence of one token in both bully and non-bully tweets. Additionally: 1. Bully words that are in Javanese language / slang that are not captured in the colloquial-indonesian-lexicon might also be classified as non bully by system, i.e : kntl, sarap 2. The context of a sentence could also impact of the classification, that might not be captured too well in our prediction

8.6.2 Is there any particular words that present in most of the misclassified texts?

Yes, words like cina, jokowi, ahok, that present in both bully and non-bully

8.7 Conclusion

8.7.1 Is your goal achieved?

Yes, though this might not be the perfect model, we can already get a head-start in order to classify bully and non-bully tweets by at least >80% of accuracy. Though it would be beneficial in the future if we can tweak the model to reduce the false negative rate so we can have better performance.

8.7.2 Can the problem be solved using ML

Absolutely. With better model, this could help a social media company prevent cyberbullying. As we know, cyberbullying is a serious problem that has taken many lives of teenagers.