1 Import Library

library(textclean)
library(katadasaR)
library(tokenizers)
library(wordcloud)
library(dplyr)

library(keras)
library(RVerbalExpressions)
library(magrittr)
library(tidyverse)
library(tidytext)
library(rsample)
library(caret)
library(stringr)
library(yardstick)
library(SnowballC)
library(tm)

library(e1071)
library(partykit)
library(ROCR)

Using Cyberbully dataset, we would like to first find out what are the main topics or the most popular word included in a cyberbully-indicated tweet.

Then we would also make a prediction model to classify Indonesian cyberbully text.

2 Data Read

# read data
df <- read.csv("data/train.csv")
df %>% sample_n(2)

##   bully
## 1    no
## 2   yes
##                                                                                                                tweet
## 1 USER Wah.. Bangkai kapal kog bisa mengganggu ekosistim dan menjadi sampah utk biota laut, ikan... Ilmu baru ya...'
## 2                                                                    USER USER USER USER HTI EMANG PERLU DI BUBARKAN
##   individual group gender physical race religion
## 1          0     0      0        0    0        0
## 2          0     1      0        0    0        0

df %>% dim()

## [1] 10535     8

This data set contains approximately 10500 unique tweets that had been gathered by Algoritma. These are the description of the column:

bully: label of the tweet; bully or not bully
tweet: text data of the tweet
bully type: includes individual, group, gender, physical, race and religion

3 Data Cleaning Steps

To make the best prediction or to map the word of the tweet we have to clean our data from any potential problems.

3.1 Check Missing Values

First we are going to check whether there are any missing values in every column.

colSums(is.na(df))

##      bully      tweet individual      group     gender   physical       race 
##          0          0          0          0          0          0          0 
##   religion 
##          0

3.2 Check Duplicated Values

Then we would like to also see whether our data has any duplicated values inside it. first we are going to check duplicated data in all columns, and secondly we are going to check the duplicated data in tweet column.

# Check for Duplicated Values in all Columns
df[duplicated(df),] %>% sample_n(3)

##   bully
## 1    no
## 2    no
## 3    no
##                                                                      tweet
## 1          dekalarasi pilkada 2018 aman dan anti hoax warga panggreh jabon
## 2 RT USER: USER USER USER USER USER USER USER USER USER USER USER USER   '
## 3                                       I added a video to a USER playlist
##   individual group gender physical race religion
## 1          0     0      0        0    0        0
## 2          0     0      0        0    0        0
## 3          0     0      0        0    0        0

# Check for Duplicated Values in Column Tweet
df[duplicated(df$tweet),] %>% sample_n(3)

##   bully
## 1    no
## 2    no
## 3    no
##                                                                                                                                                                    tweet
## 1                                                                    USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER
## 2                                                                                                        dekalarasi pilkada 2018 aman dan anti hoax warga panggreh jabon
## 3 USER USER Ikut share..sekolah anakku sekolah katholik yg ngajar suster,teman sekelasnya ada yg muslim,kristen,budha,hindu,konghu chu..selama ini masih adem ayem saja'
##   individual group gender physical race religion
## 1          0     0      0        0    0        0
## 2          0     0      0        0    0        0
## 3          0     0      0        0    0        0

From the table above we know that our data still has some duplicated values, we are going to delete this data to reduce the processing power of our model making.

# Drop Duplicated Values 
df_nodup <- df %>% distinct()

# Drop Duplicated Tweet
df_nodup <- df_nodup %>% distinct(tweet, .keep_all = T)

df_nodup %>% dim()

## [1] 10437     8

3.3 Change Column Type

Then we have to change the column type into the desired type

df_clean <- df %>% mutate(bully = as.factor(bully)) %>% mutate_if(is.integer, as.logical)

tweets <- df_clean$tweet %>% as.character()
head(tweets)

## [1] "USER terimakasih Ustadz sudah bersuara tentang Radikal radikal ini. Entah apa yang ada dalam pikiran rejim. Mesjid radikal...kampus radikal....dosen radikal....padahal tempat tersebut pijakan peradaban. Memangnya mau menghancurkan Indonesia ?"
## [2] "USER USER Maaf sebenarnya twiter pertama kali dbuat bukan buat orang bego'"                                                                                                                                                                        
## [3] "USER Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok.   Iya tau kasar bgt maaf'"                                                                                                                                            
## [4] "Hadiri Lokakarya Kebudayaan Daerah, Bupati Rupinus Ajak Masyarakat Sekadau Rawat dan Manfaatkan Objekï¿½ Budaya"                                                                                                                                   
## [5] "USER USER USER yg kaya gini layak di tangkap."                                                                                                                                                                                                     
## [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan / bagian dari itu sudah waktunya lengserkan Jokowi sebelum indonesia hancur"

check_text(tweets)

3.4 Text Cleansing

After all of our data set been taken care of, we still have our principle, garbge in garbage out. Next, we are going to dive in to cleaning our text data by going through these steps:

3.4.1 Remove Repeated Pattern in Sentences

First we are going to Remove any repeated pattern in our sentences. data that we collected from algoritma had masked the mention features in twitter with “USER” and also masked a Retweet feature with “RT”. These two words are a repeated pattern that adds no value to the tweet and maybe effecting our modeling process because the huge quantity of it, so we would like to erase every tweet that includes “RT” and “USER”

tweets[3]

## [1] "USER Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok.   Iya tau kasar bgt maaf'"

tweets[26]

## [1] "RT USER: Kalau aku cantik tapi memek ku coklat kamu masih doyan gak?  #ngentot #memek #kontol #sange #pagicrot'"

# Remove USER
tweets <- gsub("USER ", " ", tweets)
tweets[3]

## [1] " Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok.   Iya tau kasar bgt maaf'"

# Remove RT
tweets <- gsub("RT ", " ", tweets)
tweets[26]

## [1] " USER: Kalau aku cantik tapi memek ku coklat kamu masih doyan gak?  #ngentot #memek #kontol #sange #pagicrot'"

3.4.2 Remove html & URL

The next thing we are going to do is to remove html and url id in our text.

tweets[436]

## [1] " USER: kalo di catatan harian menantu sinting, lakban = laki banget, sementara itu lakban menurut definisi df                   http   '"

tweets <- tweets %>% 
  replace_html() %>% # remove html with blank
  replace_url()      # remove url with blank
tweets[436]

## [1] " USER: kalo di catatan harian menantu sinting, lakban = laki banget, sementara itu lakban menurut definisi df                      '"

3.4.3 Remove Emoji and Emoticons

The next thing we are going to de is removing any emoji and emoticons id from our text data, and maybe replace it with the word correlating to that emoji as it would be more useful to us.

tweets[28]

## [1] "  Rak harusnya lo tau kalau temen temen lo itu pinter, ga kayak lo bloon. Jadi jangan bego begoin kita, ga mempan :)'"

tweets <- tweets %>% 
  replace_emoticon(.) %>% 
  replace_emoji(.)
tweets[28]

## [1] " Rak harusnya lo tau kalau temen temen lo itu pinter, ga kayak lo bloon. Jadi jangan bego begoin kita, ga mempan smiley '"

3.4.4 Remove Slang Words

Our Language contains many slang or colloquial words used in day to day life, twitter is not the exception for it. from the colloquial-indonesian-lexicon file that i gathered from a github repository online we are going to change the majority of slang words into a formal word. to do this we are going to use replace_internet_slang()

# Import Indonesian Lexicon 
spell.lex <- read.csv("data/colloquial-indonesian-lexicon.csv")

# Replace Internet Slang
tweets <- replace_internet_slang(tweets[1:20], slang = paste0("\\b", spell.lex$slang, "\\b"), 
                                 replacement = spell.lex$formal, ignore.case = TRUE)
saveRDS(tweets, file = "tweets-slang_clean2.RDS")

3.4.5 Remove Number, Punctuation and Transform words to lower case

The next thing we are going to do is to remove any number and punctuation, and we would also change all the text data into lowercase to not confuse the two same words as different words by our computer.

tweets <- readRDS("tweets-slang_clean2.RDS")
tweets[2]

## [1] " Maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego'"

library(tm)

3.4.6 Text Stripping

Next thing we are going to strip the sentence to delete any whitespace that are left in the sentences

# Text Stripping
tweets <- strip(tweets)
tweets[2]

## [1] "maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego'"

3.4.7 Stemming

After that we are going to stem any “awalan” and “akhiran” in a word. We are going to do this using the katadasaR package.

stemming <- function(x) {
  paste(lapply(x, katadasar), collapse = " ")
}

tweets <- lapply(tokenize_words(tweets[]), stemming)
tweets[2]

## [[1]]
## [1] "maaf benar twiter pertama kali buat bukan buat orang bego"

3.4.8 Tokenize and Stopwords Removal

We are going to seperate our sentence into different words known as tokenization. This has to be done because this is going to make our data readable by our computer, and also we would like to remove any stopwords that’s available in our data, to help this we are going to use stopwords_id_satya.txt as our stopword dataset.

stopwords <- readLines("data/stopwords_id_satya.txt")

## Warning in readLines("data/stopwords_id_satya.txt"): incomplete final line found
## on 'data/stopwords_id_satya.txt'

tweets <- tokenize_words(tweets, stopwords = stopwords)
tweets[2]

## [[1]]
## [1] "maaf"    "benar"   "twiter"  "pertama" "kali"    "orang"   "bego"

To make a model we need to change our text type to character, we are going to check it by using class()

class(tweets)

## [1] "list"

tweets <- as.character(tweets)
library(wordcloud)
wordcloud(tweets)

This is an just an example of how the mechanics of the topic modelling works. After this we are going to generate our wordcloud by using all of our data.

4 Topic Modeling

By doing the steps above, we are going to model our data text in more depth, we are going to measure in the different kinds of bullying type and suggest the most common word found by that bullying type.

4.1 Data and Text Cleaning

# Set Cleaning Parameters
punctuation <- rx_punctuation()
number <- rx_digit()
exclamation <- rx() %>% 
  rx_find(value = "!") %>% 
  rx_one_or_more()
question <- rx() %>% 
  rx_find(value = "?") %>% 
  rx_one_or_more()

# Text Cleaning
tweets <- df_clean %>% 
  mutate(
    text_clean = tweet %>% 
      replace_tag() %>% 
      replace_hash() %>% 
      replace_date(replacement = "") %>% 
      replace_email() %>% 
      replace_html(symbol = FALSE) %>% 
      replace_url(replacement = "") %>% 
      replace_emoticon(.) %>% 
      replace_emoji(.) %>% 
      replace_number(remove = TRUE) %>% 
      replace_internet_slang(
        slang = paste0("\\b", spell.lex$slang, "\\b"), 
        replacement = spell.lex$formal, ignore.case = TRUE
      ) %>% 
      str_replace_all(pattern = question, replacement = "") %>% 
      str_replace_all(pattern = exclamation, replacement = "") %>% 
      str_remove_all(pattern = punctuation) %>% 
      str_remove_all(pattern = number) %>% 
      str_to_lower() %>%  
      str_squish()
  )

saveRDS(tweets, file = "tweets-slang_clean.RDS")

# Cleaning Repeated Unused Word
tweets_clean <- readRDS("tweets-slang_clean.RDS")
tweets_clean <- tweets_clean %>% mutate(
  text_clean = gsub("user ", " ", tweets_clean$text_clean)) 
tweets_clean <- tweets_clean %>% mutate(
  text_clean = gsub("rt ", " ", tweets_clean$text_clean))
tweets_clean$text_clean[2]

## [1] "  maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego"

# Stemming Function
stemming <- function(x) {
  paste(lapply(x, katadasar), collapse = " ")
}

# Tokenize Text
tweets_clean$text_clean <- lapply(tokenize_words(tweets_clean$text_clean[]), stemming)
tweets_clean$text_clean[1]

# Remove Stopwords and Tokenize
stopwords <- readLines("data/stopwords_id_satya.txt")
tweets_clean$text_clean <- tokenize_words(tweets_clean$text_clean, stopwords = stopwords)
saveRDS(tweets_clean, file = "tweets_tokenize")

tweets_clean <- readRDS("tweets_tokenize")
tweets_clean$text_clean[2]

## [[1]]
## [1] "maaf"    "benar"   "twiter"  "pertama" "kali"    "orang"   "bego"

4.2 Finding Most Common Word

From the cleaning process that we had done, we are going to differentiate the different kinds of bullying. First we are going to do a topic modelling from an overall Bully tweet.

4.2.1 Bully Tweets

# Filter Bully Tweets
tweet_bully <- tweets_clean %>% filter(bully == "yes")

# Change to Corpus Format
bully_corpus <- tweet_bully$text_clean %>% VectorSource() %>% VCorpus()
bully_corpus[[1]]$content

##  [1] "anjing"  "tahi"    "goblok"  "idiot"   "bangsat" "monyet"  "babi"   
##  [8] "fucc"    "kont"    "ngents"  "goblok"  "iya"     "tau"     "kasar"  
## [15] "banget"  "maaf"

bully_dtm <- DocumentTermMatrix(bully_corpus)
inspect(bully_dtm)

## <<DocumentTermMatrix (documents: 4421, terms: 8176)>>
## Non-/sparse entries: 39221/36106875
## Sparsity           : 100%
## Maximal term length: 109
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   ahok cebong indonesia islam jadi jokowi kalo orang sih url
##   1019    0      0         0     0    0      0    0     1   0   0
##   1127    0      0         0     0    0      0    0     0   0   0
##   1405    0      0         0     0    0      0    0     0   0   0
##   1544    0      0         0     0    0      0    0     0   1   0
##   2314    0      0         0     0    0      0    0     0   2   0
##   2971    0      0         0     0    0      1    0     0   0   0
##   331     0      0         0     0    0      0    0     0   0   0
##   3667    0      0         0     1    0      0    0     0   0   0
##   3754    0      0         0     3    0      1    0     0   0   0
##   752     0      0         0     0    1      0    0     0   0   0

wordcloud(bully_corpus)

From the plot above we can see what are the most used words when people tweet something with the intention to bully another entity.

4.2.2 Individual Bully Tweets

individual_bully <- tweet_bully %>% filter(individual == TRUE)
individual_bully_corpus <- individual_bully$text_clean %>% VectorSource() %>% VCorpus()
individual_bully_corpus[[1]]$content

##  [1] "anjing"  "tahi"    "goblok"  "idiot"   "bangsat" "monyet"  "babi"   
##  [8] "fucc"    "kont"    "ngents"  "goblok"  "iya"     "tau"     "kasar"  
## [15] "banget"  "maaf"

wordcloud(individual_bully_corpus)

The next bully is the individual type. it has a wide spread of different words, but there are more popular words with the larger text.

4.2.3 Group Bully Tweets

group_bully <- tweet_bully %>% filter(group == TRUE)
group_bully_corpus <- group_bully$text_clean %>% VectorSource() %>% VCorpus()
group_bully_corpus[[1]]$content

## [1] "smartfren" "jaring"    "nya"       "brengsek"  "ya"

wordcloud(group_bully_corpus)

The next thing is a cyberbully tweet that is intended to groups of people, as we can see we have a lot of “indonesia” as the word correlates to bully tweet.

4.2.4 Gender Bully Tweets

gender_bully <- tweet_bully %>% filter(gender == TRUE)
gender_bully_corpus <- gender_bully$text_clean %>% VectorSource() %>% VCorpus()
gender_bully_corpus[[1]]$content

##  [1] "anjing"  "tahi"    "goblok"  "idiot"   "bangsat" "monyet"  "babi"   
##  [8] "fucc"    "kont"    "ngents"  "goblok"  "iya"     "tau"     "kasar"  
## [15] "banget"  "maaf"

wordcloud(gender_bully_corpus)

The next thing we are going to look at is a gender related cyberbully. As we can see there’s a lot of words that correlates to bullying a gender specific words that you can see above.

4.2.5 Physical Bully Tweets

physical_bully <- tweet_bully %>% filter(physical == TRUE)
physical_bully_corpus <- physical_bully$text_clean %>% VectorSource() %>% VCorpus()
physical_bully_corpus[[1]]$content

##  [1] "anjing"  "tahi"    "goblok"  "idiot"   "bangsat" "monyet"  "babi"   
##  [8] "fucc"    "kont"    "ngents"  "goblok"  "iya"     "tau"     "kasar"  
## [15] "banget"  "maaf"

wordcloud(physical_bully_corpus)

These are the physical-cyberbully most-used word.

4.2.6 Race Bully Tweets

race_bully <- tweet_bully %>% filter(race == TRUE)
race_bully_corpus <- race_bully$text_clean %>% VectorSource() %>% VCorpus()
race_bully_corpus[[1]]$content

##  [1] "partai"     "koalisi"    "dukung"     "pemerintah" "diketahui" 
##  [6] "jalin"      "kerjasama"  "politik"    "partai"     "komunis"   
## [11] "china"

wordcloud(race_bully_corpus)

These are the racist cyberbully most used word.

4.2.7 Religion Bully Tweets

religion_bully <- tweet_bully %>% filter(religion == TRUE)
religion_bully_corpus <- religion_bully$text_clean %>% VectorSource() %>% VCorpus()
religion_bully_corpus[[1]]$content

##  [1] "kapir"       "asli"        "menang"      "busa"        "mulut"      
##  [6] "alam"        "nyata"       "jadi"        "pecundangoh" "pemuja"     
## [11] "wowo"

wordcloud(religion_bully_corpus)

These are religion cyber bully most used tweet

5 Naive Bayes Classification Model

Now we will examine more deeply from the machine learning side. In this section we will model the tweet data that we cleaned above into a machine learning algorithm where we can classify new tweet data whether it is bullying or not.

First Classification that we are going to use is Naive Bayes Classification Model. The data needed here is text data from the tweet column and label data from the bully column. Then there are several steps after that where we will change the text data type to corpus and then to a document term matrix which will then be trained.

5.1 Convert Data to Corpus

Corpus is a collection of documents. In this case, one document is equivalent to one tweet observation. In one tweet there can be one or more sentences. One of the packages that we can use for text mining is tm. Converting from vector text to corpus can be done using the VCorpus() . function

tweets_corpus <- tweets_clean$text_clean %>% VectorSource() %>% VCorpus()
tweets_corpus[[2]]$content

## [1] "maaf"    "benar"   "twiter"  "pertama" "kali"    "orang"   "bego"

5.2 Make Document Term Matrix

We need to transform the text data into Document-Term Matrix (DTM) through the tokenization process. Tokenization is the process of breaking a sentence into several terms (can be 1 word, word pair, etc.). In DTM, one word will be one predictor with a value in the form of the frequency of occurrence of the word in a document.

tweets_dtm <- DocumentTermMatrix(tweets_corpus)
inspect(tweets_dtm)

## <<DocumentTermMatrix (documents: 10535, terms: 17333)>>
## Non-/sparse entries: 101288/182501867
## Sparsity           : 100%
## Maximal term length: 109
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    agama gue islam jadi jokowi orang presiden sih untuk url
##   10021     0   0     0    0      0     1        0   0     0   0
##   1286      0   0     0    0      0     0        0   0     0   0
##   211       0   0     0    0      0     0        0   0     0   0
##   2803      0   0     0    0      0     0        0   1     0   0
##   3370      0   0     0    0      0     0        0   0     0   0
##   5958      0   0     0    0      0     0        0   0     0   0
##   6461      0   0     0    0      0     1        0   1     0   0
##   764       0   0     0    0      0     0        0   0     0   0
##   8701      0   0     2    0      0     0        0   0     0   0
##   9128      0   2     0    1      0     0        0   0     0   0

5.3 Making Our Training and Validity Dataset

We are going to split our data into a training and validation by a ratio of 80:20.

RNGkind(sample.kind = "Rounding")
set.seed(305)

# Split Ratio 80:20
index <- sample(nrow(tweets_dtm), nrow(tweets_dtm)*0.8)

# Data Splitting
tweets_train <- tweets_dtm[index,] 
tweets_val <- tweets_dtm[-index,]

# classification of our dataset
label_train <- tweets_clean[index, 'bully']
label_val <- tweets_clean[-index, 'bully']

# Proportion of our Validity data set
prop.table(table(label_train))

## label_train
##        no       yes 
## 0.5759374 0.4240626

prop.table(table(label_val))

## label_val
##        no       yes 
## 0.5980066 0.4019934

# Check Dimension
dim(tweets_train)

## [1]  8428 17333

length(label_train)

## [1] 8428

5.4 infrequent Words Removal

Because the amount of our predictor is very high, reaching 23113, Let’s reduce noise in our data by retrieving words that occur quite often, for example at least 20 times in the entire SMS. Use the findFreqTerms() function

# Minimum frequency of appearance in documents
tweets_freq <- findFreqTerms(tweets_train, lowfreq = 20)

# Number of Unique words
length(unique(tweets_freq))

## [1] 817

tweets_freq %>% tail()

## [1] "wkwk"   "wkwkwk" "yahudi" "yakin"  "yuk"    "zaman"

To find the words that appears on on our tweets_freq Let’s subset the sms_train data by using this command:

tweets_train_freqreduced <- tweets_train[,tweets_freq]
inspect(tweets_train_freqreduced)

## <<DocumentTermMatrix (documents: 8428, terms: 817)>>
## Non-/sparse entries: 50021/6835655
## Sparsity           : 99%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   agama gue islam jadi jokowi orang presiden sih untuk url
##   2437     0   2     0    0      0     1        0   1     0   0
##   2461     0   0     0    0      0     0        0   0     0   0
##   2571     0   0     0    0      0     0        0   0     0   1
##   4363     1   0     3    0      0     0        0   0     0   0
##   4391     1   0     0    0      0     0        0   0     0   0
##   525      2   1     2    0      0     0        0   0     0   0
##   5384     0   0     0    4      0     0        0   0     0   0
##   6242     0   0     1    0      0     3        0   0     0   0
##   8958     0   0     3    0      1     0        1   0     2   0
##   9336     0   0     0    1      3     0        1   0     1   0

5.5 Bernoulli Converter

Naive Bayes model is really good at predicting binary categorical data type, the value of our tweets_train matrix is still a frequency. For probability calculations, the frequency will be changed to only the condition appearing (1) or not (0). One way is to use Bernoulli Converter.

If frequency > 0, then the value is 1 (appears)
If frequency == 0, then the value is 0 (does not appear)

bernoulli_conv <- function(x){
  # parameter ifelse: kondisi, TRUE, FALSE
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

after making our bernoulli converter function we will apply that function into our training and validation dataset.

tweets_train_bn <- apply(tweets_train_freqreduced, 2, FUN = bernoulli_conv)
tweets_val_bn <- apply(tweets_val, 2, FUN = bernoulli_conv)

tweets_train_bn[100:110, 35:40]

##       Terms
## Docs   angkat anies aniessandi anjing anjir antek
##   9569 "0"    "0"   "0"        "0"    "0"   "0"  
##   7730 "0"    "0"   "0"        "0"    "0"   "0"  
##   7128 "0"    "0"   "0"        "0"    "0"   "0"  
##   7218 "0"    "0"   "0"        "0"    "0"   "0"  
##   8561 "0"    "0"   "0"        "0"    "0"   "0"  
##   1557 "0"    "0"   "0"        "0"    "0"   "0"  
##   7452 "0"    "0"   "0"        "0"    "0"   "0"  
##   1218 "0"    "0"   "0"        "0"    "0"   "0"  
##   5567 "0"    "0"   "0"        "0"    "0"   "0"  
##   7854 "0"    "0"   "0"        "0"    "0"   "0"  
##   2594 "0"    "0"   "0"        "0"    "0"   "0"

5.6 Model Fitting

nb_bully <- naiveBayes(tweets_train_bn, y = label_train, laplace = 1)

5.7 Model Prediction

tweets_nb_pred <- predict(nb_bully, tweets_val_bn)

5.8 Model Evaluation

We will evaluate our prediction using confusion matrix with the available metrics

confusionMatrix(data = tweets_nb_pred, reference = label_val)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1023  163
##        yes  237  684
##                                           
##                Accuracy : 0.8102          
##                  95% CI : (0.7927, 0.8267)
##     No Information Rate : 0.598           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6107          
##                                           
##  Mcnemar's Test P-Value : 0.0002622       
##                                           
##             Sensitivity : 0.8119          
##             Specificity : 0.8076          
##          Pos Pred Value : 0.8626          
##          Neg Pred Value : 0.7427          
##              Prevalence : 0.5980          
##          Detection Rate : 0.4855          
##    Detection Prevalence : 0.5629          
##       Balanced Accuracy : 0.8097          
##                                           
##        'Positive' Class : no              
##

From the information above we would like to extract these important information.

      Reference

Prediction no yes no 971 137 yes 289 710

        Accuracy    : 0.7978
        Sensitivity : 0.7706        
        Specificity : 0.8383        
     Pos Pred Value : 0.8764

To make a better model we would like to optimalize our result to better suit our judgement. My personal judgment is that we should label bully tweets more even if sacrificing some tweets that are not detected as bullying as a bully, to do that we have to increase precision or the Post pred value of our confusion matrix.

6 Neural Network Model

The next thing we are going to do is a classification model by neral network approaches. like berfor we would like to clean our data. We already saved our data in RDS form

After we clean the text data, we will prepare the data so that the neural network model can be applied in it. First, we have to change our label to a factor to differentiate between the two outcome, then we have to change it to numerical values and set it to 0 and 1, 0 as not bully and 1 as bully.

tweets_clean2 <- tweets_clean2 %>% 
  mutate(bully = factor(bully, levels = c("no", "yes")),
         bully = as.numeric(bully),
         bully = bully - 1) %>% 
  select(text_clean, bully)

6.1 Tokenizer

As we stated before, tokenizer aims to separate each word in the entire document into a token form. The num_words parameter is for setting the maximum number of words to be used, sorted according to the largest frequency order. words that rarely appear will be removed. from a total of 23158 unique words contained in the text data, we reduced it to 1024 which will be used to make the model. The lower parameter is a logic condition, if TRUE then all words will be transformed to lowercase (tolower).

num_words <- 1024

tokenizer <- text_tokenizer(num_words = num_words, lower = T) %>% 
  fit_text_tokenizer(tweets_clean2$text_clean)

paste("number of unique words is ", length(tokenizer$word_counts))

## [1] "number of unique words is  23158"

docs <- c("no", "yes")
tokendocs <- text_tokenizer(num_words = 2, lower = TRUE) %>% fit_text_tokenizer(docs)
tokendocs$word_index[1:2]

## $no
## [1] 1
## 
## $yes
## [1] 2

6.2 Data Splitting

The data split that we will be done in three parts, our train, validation and test data. our test data already has a seperate file that is test.csv, and our validation test will be obtained from splitting our train.csv file into a split of 80:20, 80 being our train data and 20 being our test data.

Data Train is the data that we will use to train the model. Data Validation for evaluating hyperparameter tuning in models (adjust hidden layers, optimizers, learning rates, etc.). While the test data as an evaluator of the model that we make on unseen data.

set.seed(305)
intrain <- initial_split(data = tweets_clean2, prop = 0.8, strata = "bully")

data_train <- training(intrain)
data_val <- testing(intrain)
data_test <- read.csv(file = "data/test.csv")

maxlen <- max(str_count(tweets_clean2$text_clean, "\\w+")) + 1 
paste("maxiumum length words in data:", maxlen)

## [1] "maxiumum length words in data: 332"

data_train_x <- texts_to_sequences(tokenizer, data_train$text_clean) %>%
  pad_sequences(maxlen = maxlen)

data_val_x <- texts_to_sequences(tokenizer, data_val$text_clean) %>% pad_sequences(maxlen = maxlen)

data_test_x <- texts_to_sequences(tokenizer, data_test$tweet) %>%
  pad_sequences(maxlen = maxlen)

# prepare y
data_train_y <- to_categorical(data_train$bully, num_classes = 2)
data_val_y <- to_categorical(data_val$bully, num_classes = 2)
data_test_y <- to_categorical(data_test$bully, num_classes = 2 )

6.3 Random Initialization, Compiling and Training the Model

6.3.1 Embedding Layer

Embedding Layers can only be used in the initial / first layer of the LSTM architecture. In a variety of deep learning frameworks such as Keras, the embedding layer aims to train text data into numerical vectors which represent the closeness of the meaning of each word.

Embedding layer accepts several parameters. Some examples are:

input_dim, which is the maximum dimension of the vocabulary that has been explained in the num_words section.
input_length, the maximum length of the word sequence in the document input.
output_dim which is the embedding dimension of the output layer which will be passed to the next layer. generally is 32, but can be more dependent on the problem we face.

Input received of 2D vectors with the form: {batch_size, sequence_length}, while the output received 3D tensor with the forms {batch_size, sequence_length, output_dim}.

6.3.2 Deep Neural Layer

The Deep Network Layer accepts the embedding matrix as input and then is converted into smaller dimensions. The dimensions of the compression results have represented information from the data. In the case of data text, the deep learning architecture commonly used is RNN > LSTM / GRU.

you can check the Keras Documentation for the details sequential layers.

6.3.3 Output Layer

This output layer is the last layer in the deep learning architecture. At Keras use the layer_dense command where we need to set the unit parameters or how many neurons we want to build. In this case I use 3 units, because there are 3 classes we have (negative, neutral, positive).

6.3.4 Random Initialization

When the neural network / deep learning model train often results in different results. Why? because NN and DL use weigth which is generated randomly (randomness initialization). therefore we need to set the numbers (x-random models) in order to get a fixed result when repeated in the train (reproducible result). this can be done with the seed parameter in the initializer_random_uniform command. for more details, read the question and answer article in Keras studio

model_nn1 <- keras_model_sequential(name = "LDA_model") %>% 

  # layer input
  layer_embedding(
    name = "input",
    input_dim = num_words,
    input_length = maxlen,
    output_dim = 256, 
    embeddings_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
  ) %>%
  # layer dropout
  layer_dropout(
    name = "embedding_dropout",
    rate = 0.5
  ) %>%
  # layer lstm 1
  layer_lstm(
    name = "lstm",
    units = 256,
    dropout = 0.2,
    recurrent_dropout = 0.2,
    return_sequences = FALSE, 
    recurrent_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2),
    kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
  ) %>%
  # layer output
  layer_dense(
    name = "output",
    units = 2,
    activation = "softmax", 
    kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
  )

model_nn1 %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = "accuracy",
  loss = "categorical_crossentropy"
)

history_tune_1 <- model_nn1 %>% 
  fit(x = data_train_x, # prediktor
      y = data_train_y, # target
      batch_size = 512, # Bobot akan diupdate setiap 512 data
      epochs = 10,
      validation_data = list(data_val_x, data_val_y), # Evaluasi data validasi
      verbose = 1
  )

plot_nn1 <- plot(history_tune_1)
saveRDS(model_nn1, file = "model_nn1")
saveRDS(plot_nn1, file = "plot_nn1")

readRDS("plot_nn1")

This is the result of our train and validation data.

6.4 Model evaluation

# predict on train
data_train_pred <- model_nn1 %>%
  predict(data_train_x) %>%
  k_argmax() %>% as.array()

# predict on val
data_val_pred <- model_nn1 %>%
  predict(data_val_x) %>%
  k_argmax() %>% as.array()

# predict on test
data_test_pred <- model_nn1 %>%
  predict(data_test_x) %>%
  k_argmax() %>% as.array()

saveRDS(data_train_pred, "data_output/data_train_pred")
saveRDS(data_val_pred, "data_output/data_val_pred")
saveRDS(data_test_pred, "data_output/data_test_pred")

data_train_pred <- readRDS("data_output/data_train_pred")
data_val_pred <- readRDS("data_output/data_val_pred")
data_test_pred <- readRDS("data_output/data_test_pred")
test_pred_yn <- data_test_pred %>% str_replace_all(c("0" = "no", "1" = "yes"))
submission <- data_test %>% mutate(bully = test_pred_yn)
# save data
write.csv(submission, "submission-davel-NN.csv", row.names = F)

6.5 Accuracy

These are the three accuracies of my tweet prediction on different tweet data. the measure of the test data were already submitted in the shinyapps leaderboard as we can see on the image below.

# Accuracy on Train Data
accuracy_vec(
 truth = factor(data_train$bully,labels = c("no", "yes")),
 estimate = factor(data_train_pred, labels = c("no", "yes"))
)

## [1] 0.8827578

# Accuracy on Validation Data
accuracy_vec(
 truth = factor(data_val$bully,labels = c("no", "yes")),
 estimate = factor(data_val_pred, labels = c("no", "yes"))
)

## [1] 0.8183112

confusionMatrix(data = as.factor(data_val_pred), reference = as.factor(data_val$bully))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1036  196
##          1  187  689
##                                           
##                Accuracy : 0.8183          
##                  95% CI : (0.8012, 0.8346)
##     No Information Rate : 0.5802          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6265          
##                                           
##  Mcnemar's Test P-Value : 0.6827          
##                                           
##             Sensitivity : 0.8471          
##             Specificity : 0.7785          
##          Pos Pred Value : 0.8409          
##          Neg Pred Value : 0.7865          
##              Prevalence : 0.5802          
##          Detection Rate : 0.4915          
##    Detection Prevalence : 0.5844          
##       Balanced Accuracy : 0.8128          
##                                           
##        'Positive' Class : 0               
##

And this is the accuracy on Test Data

7 Conclusion

From several machine learning processes that have been carried out, several goals have been achieved, such as

The resulting prediction does not indicate an overfit, the difference between the accuracy of the models is not that far.
Data validation:
- Accuracy has exceeded 80%
- Sensitivity has exceeded 80%
- Specificity exceeds 75%
- Precision exceeds 75%
Data Test:
- Accuracy has exceeded 80%
- Sensitivity is just below 80%
- Specificity well exceeds 75%
- Precision far exceeds 75%

For me i think this problem can be solved by machine learning, with a better understanding about the purpose of embedding, the deep neural and the output layer we could solve this bully tweet classification with more accuracy. And of course a better processing power could help in a huge way.

8 Critique

The load of the neural network machine learning process took a toll on my device, as the process could take up to a day unattended and disrupting the other usage on my device, maybe for this specific capstone we could have more time to finish the task. Thankyou.

Cyberbully Topic Modeling and Classification

Davel Zhafran

20/9/2021