library(textclean)
library(katadasaR)
library(tokenizers)
library(wordcloud)
library(dplyr)
library(keras)
library(RVerbalExpressions)
library(magrittr)
library(tidyverse)
library(tidytext)
library(rsample)
library(caret)
library(stringr)
library(yardstick)
library(SnowballC)
library(tm)
library(e1071)
library(partykit)
library(ROCR)Using Cyberbully dataset, we would like to first find out what are the main topics or the most popular word included in a cyberbully-indicated tweet.
Then we would also make a prediction model to classify Indonesian cyberbully text.
# read data
df <- read.csv("data/train.csv")
df %>% sample_n(2)## bully
## 1 no
## 2 yes
## tweet
## 1 USER Wah.. Bangkai kapal kog bisa mengganggu ekosistim dan menjadi sampah utk biota laut, ikan... Ilmu baru ya...'
## 2 USER USER USER USER HTI EMANG PERLU DI BUBARKAN
## individual group gender physical race religion
## 1 0 0 0 0 0 0
## 2 0 1 0 0 0 0
df %>% dim()## [1] 10535 8
This data set contains approximately 10500 unique tweets that had been gathered by Algoritma. These are the description of the column:
To make the best prediction or to map the word of the tweet we have to clean our data from any potential problems.
First we are going to check whether there are any missing values in every column.
colSums(is.na(df))## bully tweet individual group gender physical race
## 0 0 0 0 0 0 0
## religion
## 0
Then we would like to also see whether our data has any duplicated values inside it. first we are going to check duplicated data in all columns, and secondly we are going to check the duplicated data in tweet column.
# Check for Duplicated Values in all Columns
df[duplicated(df),] %>% sample_n(3)## bully
## 1 no
## 2 no
## 3 no
## tweet
## 1 dekalarasi pilkada 2018 aman dan anti hoax warga panggreh jabon
## 2 RT USER: USER USER USER USER USER USER USER USER USER USER USER USER '
## 3 I added a video to a USER playlist
## individual group gender physical race religion
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
# Check for Duplicated Values in Column Tweet
df[duplicated(df$tweet),] %>% sample_n(3)## bully
## 1 no
## 2 no
## 3 no
## tweet
## 1 USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER USER
## 2 dekalarasi pilkada 2018 aman dan anti hoax warga panggreh jabon
## 3 USER USER Ikut share..sekolah anakku sekolah katholik yg ngajar suster,teman sekelasnya ada yg muslim,kristen,budha,hindu,konghu chu..selama ini masih adem ayem saja'
## individual group gender physical race religion
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
From the table above we know that our data still has some duplicated values, we are going to delete this data to reduce the processing power of our model making.
# Drop Duplicated Values
df_nodup <- df %>% distinct()
# Drop Duplicated Tweet
df_nodup <- df_nodup %>% distinct(tweet, .keep_all = T)df_nodup %>% dim()## [1] 10437 8
Then we have to change the column type into the desired type
df_clean <- df %>% mutate(bully = as.factor(bully)) %>% mutate_if(is.integer, as.logical)tweets <- df_clean$tweet %>% as.character()
head(tweets)## [1] "USER terimakasih Ustadz sudah bersuara tentang Radikal radikal ini. Entah apa yang ada dalam pikiran rejim. Mesjid radikal...kampus radikal....dosen radikal....padahal tempat tersebut pijakan peradaban. Memangnya mau menghancurkan Indonesia ?"
## [2] "USER USER Maaf sebenarnya twiter pertama kali dbuat bukan buat orang bego'"
## [3] "USER Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok. Iya tau kasar bgt maaf'"
## [4] "Hadiri Lokakarya Kebudayaan Daerah, Bupati Rupinus Ajak Masyarakat Sekadau Rawat dan Manfaatkan Objek� Budaya"
## [5] "USER USER USER yg kaya gini layak di tangkap."
## [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan / bagian dari itu sudah waktunya lengserkan Jokowi sebelum indonesia hancur"
check_text(tweets)After all of our data set been taken care of, we still have our principle, garbge in garbage out. Next, we are going to dive in to cleaning our text data by going through these steps:
First we are going to Remove any repeated pattern in our sentences. data that we collected from algoritma had masked the mention features in twitter with “USER” and also masked a Retweet feature with “RT”. These two words are a repeated pattern that adds no value to the tweet and maybe effecting our modeling process because the huge quantity of it, so we would like to erase every tweet that includes “RT” and “USER”
tweets[3]## [1] "USER Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok. Iya tau kasar bgt maaf'"
tweets[26]## [1] "RT USER: Kalau aku cantik tapi memek ku coklat kamu masih doyan gak? #ngentot #memek #kontol #sange #pagicrot'"
# Remove USER
tweets <- gsub("USER ", " ", tweets)
tweets[3]## [1] " Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok. Iya tau kasar bgt maaf'"
# Remove RT
tweets <- gsub("RT ", " ", tweets)
tweets[26]## [1] " USER: Kalau aku cantik tapi memek ku coklat kamu masih doyan gak? #ngentot #memek #kontol #sange #pagicrot'"
The next thing we are going to do is to remove html and url id in our text.
tweets[436]## [1] " USER: kalo di catatan harian menantu sinting, lakban = laki banget, sementara itu lakban menurut definisi df http '"
tweets <- tweets %>%
replace_html() %>% # remove html with blank
replace_url() # remove url with blank
tweets[436]## [1] " USER: kalo di catatan harian menantu sinting, lakban = laki banget, sementara itu lakban menurut definisi df '"
The next thing we are going to de is removing any emoji and emoticons id from our text data, and maybe replace it with the word correlating to that emoji as it would be more useful to us.
tweets[28]## [1] " Rak harusnya lo tau kalau temen temen lo itu pinter, ga kayak lo bloon. Jadi jangan bego begoin kita, ga mempan :)'"
tweets <- tweets %>%
replace_emoticon(.) %>%
replace_emoji(.)
tweets[28]## [1] " Rak harusnya lo tau kalau temen temen lo itu pinter, ga kayak lo bloon. Jadi jangan bego begoin kita, ga mempan smiley '"
Our Language contains many slang or colloquial words used in day to day life, twitter is not the exception for it. from the colloquial-indonesian-lexicon file that i gathered from a github repository online we are going to change the majority of slang words into a formal word. to do this we are going to use replace_internet_slang()
# Import Indonesian Lexicon
spell.lex <- read.csv("data/colloquial-indonesian-lexicon.csv")
# Replace Internet Slang
tweets <- replace_internet_slang(tweets[1:20], slang = paste0("\\b", spell.lex$slang, "\\b"),
replacement = spell.lex$formal, ignore.case = TRUE)
saveRDS(tweets, file = "tweets-slang_clean2.RDS")The next thing we are going to do is to remove any number and punctuation, and we would also change all the text data into lowercase to not confuse the two same words as different words by our computer.
tweets <- readRDS("tweets-slang_clean2.RDS")
tweets[2]## [1] " Maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego'"
library(tm)Next thing we are going to strip the sentence to delete any whitespace that are left in the sentences
# Text Stripping
tweets <- strip(tweets)
tweets[2]## [1] "maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego'"
After that we are going to stem any “awalan” and “akhiran” in a word. We are going to do this using the katadasaR package.
stemming <- function(x) {
paste(lapply(x, katadasar), collapse = " ")
}
tweets <- lapply(tokenize_words(tweets[]), stemming)
tweets[2]## [[1]]
## [1] "maaf benar twiter pertama kali buat bukan buat orang bego"
We are going to seperate our sentence into different words known as tokenization. This has to be done because this is going to make our data readable by our computer, and also we would like to remove any stopwords that’s available in our data, to help this we are going to use stopwords_id_satya.txt as our stopword dataset.
stopwords <- readLines("data/stopwords_id_satya.txt")## Warning in readLines("data/stopwords_id_satya.txt"): incomplete final line found
## on 'data/stopwords_id_satya.txt'
tweets <- tokenize_words(tweets, stopwords = stopwords)
tweets[2]## [[1]]
## [1] "maaf" "benar" "twiter" "pertama" "kali" "orang" "bego"
To make a model we need to change our text type to character, we are going to check it by using class()
class(tweets)## [1] "list"
tweets <- as.character(tweets)
library(wordcloud)
wordcloud(tweets)This is an just an example of how the mechanics of the topic modelling works. After this we are going to generate our wordcloud by using all of our data.
By doing the steps above, we are going to model our data text in more depth, we are going to measure in the different kinds of bullying type and suggest the most common word found by that bullying type.
# Set Cleaning Parameters
punctuation <- rx_punctuation()
number <- rx_digit()
exclamation <- rx() %>%
rx_find(value = "!") %>%
rx_one_or_more()
question <- rx() %>%
rx_find(value = "?") %>%
rx_one_or_more()# Text Cleaning
tweets <- df_clean %>%
mutate(
text_clean = tweet %>%
replace_tag() %>%
replace_hash() %>%
replace_date(replacement = "") %>%
replace_email() %>%
replace_html(symbol = FALSE) %>%
replace_url(replacement = "") %>%
replace_emoticon(.) %>%
replace_emoji(.) %>%
replace_number(remove = TRUE) %>%
replace_internet_slang(
slang = paste0("\\b", spell.lex$slang, "\\b"),
replacement = spell.lex$formal, ignore.case = TRUE
) %>%
str_replace_all(pattern = question, replacement = "") %>%
str_replace_all(pattern = exclamation, replacement = "") %>%
str_remove_all(pattern = punctuation) %>%
str_remove_all(pattern = number) %>%
str_to_lower() %>%
str_squish()
)
saveRDS(tweets, file = "tweets-slang_clean.RDS")# Cleaning Repeated Unused Word
tweets_clean <- readRDS("tweets-slang_clean.RDS")
tweets_clean <- tweets_clean %>% mutate(
text_clean = gsub("user ", " ", tweets_clean$text_clean))
tweets_clean <- tweets_clean %>% mutate(
text_clean = gsub("rt ", " ", tweets_clean$text_clean))
tweets_clean$text_clean[2]## [1] " maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego"
# Stemming Function
stemming <- function(x) {
paste(lapply(x, katadasar), collapse = " ")
}
# Tokenize Text
tweets_clean$text_clean <- lapply(tokenize_words(tweets_clean$text_clean[]), stemming)
tweets_clean$text_clean[1]# Remove Stopwords and Tokenize
stopwords <- readLines("data/stopwords_id_satya.txt")
tweets_clean$text_clean <- tokenize_words(tweets_clean$text_clean, stopwords = stopwords)
saveRDS(tweets_clean, file = "tweets_tokenize")tweets_clean <- readRDS("tweets_tokenize")
tweets_clean$text_clean[2]## [[1]]
## [1] "maaf" "benar" "twiter" "pertama" "kali" "orang" "bego"
By doing the steps above, we are going to model our data text in more depth, we are going to measure in the different kinds of bullying type and suggest the most common word found by that bullying type.
From the cleaning process that we had done, we are going to differentiate the different kinds of bullying. First we are going to do a topic modelling from an overall Bully tweet.
# Filter Bully Tweets
tweet_bully <- tweets_clean %>% filter(bully == "yes")
# Change to Corpus Format
bully_corpus <- tweet_bully$text_clean %>% VectorSource() %>% VCorpus()
bully_corpus[[1]]$content## [1] "anjing" "tahi" "goblok" "idiot" "bangsat" "monyet" "babi"
## [8] "fucc" "kont" "ngents" "goblok" "iya" "tau" "kasar"
## [15] "banget" "maaf"
bully_dtm <- DocumentTermMatrix(bully_corpus)
inspect(bully_dtm)## <<DocumentTermMatrix (documents: 4421, terms: 8176)>>
## Non-/sparse entries: 39221/36106875
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs ahok cebong indonesia islam jadi jokowi kalo orang sih url
## 1019 0 0 0 0 0 0 0 1 0 0
## 1127 0 0 0 0 0 0 0 0 0 0
## 1405 0 0 0 0 0 0 0 0 0 0
## 1544 0 0 0 0 0 0 0 0 1 0
## 2314 0 0 0 0 0 0 0 0 2 0
## 2971 0 0 0 0 0 1 0 0 0 0
## 331 0 0 0 0 0 0 0 0 0 0
## 3667 0 0 0 1 0 0 0 0 0 0
## 3754 0 0 0 3 0 1 0 0 0 0
## 752 0 0 0 0 1 0 0 0 0 0
wordcloud(bully_corpus)From the plot above we can see what are the most used words when people tweet something with the intention to bully another entity.
individual_bully <- tweet_bully %>% filter(individual == TRUE)
individual_bully_corpus <- individual_bully$text_clean %>% VectorSource() %>% VCorpus()
individual_bully_corpus[[1]]$content## [1] "anjing" "tahi" "goblok" "idiot" "bangsat" "monyet" "babi"
## [8] "fucc" "kont" "ngents" "goblok" "iya" "tau" "kasar"
## [15] "banget" "maaf"
wordcloud(individual_bully_corpus)The next bully is the individual type. it has a wide spread of different words, but there are more popular words with the larger text.
group_bully <- tweet_bully %>% filter(group == TRUE)
group_bully_corpus <- group_bully$text_clean %>% VectorSource() %>% VCorpus()
group_bully_corpus[[1]]$content## [1] "smartfren" "jaring" "nya" "brengsek" "ya"
wordcloud(group_bully_corpus)The next thing is a cyberbully tweet that is intended to groups of people, as we can see we have a lot of “indonesia” as the word correlates to bully tweet.
gender_bully <- tweet_bully %>% filter(gender == TRUE)
gender_bully_corpus <- gender_bully$text_clean %>% VectorSource() %>% VCorpus()
gender_bully_corpus[[1]]$content## [1] "anjing" "tahi" "goblok" "idiot" "bangsat" "monyet" "babi"
## [8] "fucc" "kont" "ngents" "goblok" "iya" "tau" "kasar"
## [15] "banget" "maaf"
wordcloud(gender_bully_corpus)The next thing we are going to look at is a gender related cyberbully. As we can see there’s a lot of words that correlates to bullying a gender specific words that you can see above.
physical_bully <- tweet_bully %>% filter(physical == TRUE)
physical_bully_corpus <- physical_bully$text_clean %>% VectorSource() %>% VCorpus()
physical_bully_corpus[[1]]$content## [1] "anjing" "tahi" "goblok" "idiot" "bangsat" "monyet" "babi"
## [8] "fucc" "kont" "ngents" "goblok" "iya" "tau" "kasar"
## [15] "banget" "maaf"
wordcloud(physical_bully_corpus)These are the physical-cyberbully most-used word.
race_bully <- tweet_bully %>% filter(race == TRUE)
race_bully_corpus <- race_bully$text_clean %>% VectorSource() %>% VCorpus()
race_bully_corpus[[1]]$content## [1] "partai" "koalisi" "dukung" "pemerintah" "diketahui"
## [6] "jalin" "kerjasama" "politik" "partai" "komunis"
## [11] "china"
wordcloud(race_bully_corpus)These are the racist cyberbully most used word.
religion_bully <- tweet_bully %>% filter(religion == TRUE)
religion_bully_corpus <- religion_bully$text_clean %>% VectorSource() %>% VCorpus()
religion_bully_corpus[[1]]$content## [1] "kapir" "asli" "menang" "busa" "mulut"
## [6] "alam" "nyata" "jadi" "pecundangoh" "pemuja"
## [11] "wowo"
wordcloud(religion_bully_corpus)These are religion cyber bully most used tweet
Now we will examine more deeply from the machine learning side. In this section we will model the tweet data that we cleaned above into a machine learning algorithm where we can classify new tweet data whether it is bullying or not.
First Classification that we are going to use is Naive Bayes Classification Model. The data needed here is text data from the tweet column and label data from the bully column. Then there are several steps after that where we will change the text data type to corpus and then to a document term matrix which will then be trained.
Corpus is a collection of documents. In this case, one document is equivalent to one tweet observation. In one tweet there can be one or more sentences. One of the packages that we can use for text mining is tm. Converting from vector text to corpus can be done using the VCorpus() . function
tweets_corpus <- tweets_clean$text_clean %>% VectorSource() %>% VCorpus()
tweets_corpus[[2]]$content## [1] "maaf" "benar" "twiter" "pertama" "kali" "orang" "bego"
We need to transform the text data into Document-Term Matrix (DTM) through the tokenization process. Tokenization is the process of breaking a sentence into several terms (can be 1 word, word pair, etc.). In DTM, one word will be one predictor with a value in the form of the frequency of occurrence of the word in a document.
tweets_dtm <- DocumentTermMatrix(tweets_corpus)
inspect(tweets_dtm)## <<DocumentTermMatrix (documents: 10535, terms: 17333)>>
## Non-/sparse entries: 101288/182501867
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama gue islam jadi jokowi orang presiden sih untuk url
## 10021 0 0 0 0 0 1 0 0 0 0
## 1286 0 0 0 0 0 0 0 0 0 0
## 211 0 0 0 0 0 0 0 0 0 0
## 2803 0 0 0 0 0 0 0 1 0 0
## 3370 0 0 0 0 0 0 0 0 0 0
## 5958 0 0 0 0 0 0 0 0 0 0
## 6461 0 0 0 0 0 1 0 1 0 0
## 764 0 0 0 0 0 0 0 0 0 0
## 8701 0 0 2 0 0 0 0 0 0 0
## 9128 0 2 0 1 0 0 0 0 0 0
We are going to split our data into a training and validation by a ratio of 80:20.
RNGkind(sample.kind = "Rounding")
set.seed(305)
# Split Ratio 80:20
index <- sample(nrow(tweets_dtm), nrow(tweets_dtm)*0.8)
# Data Splitting
tweets_train <- tweets_dtm[index,]
tweets_val <- tweets_dtm[-index,]# classification of our dataset
label_train <- tweets_clean[index, 'bully']
label_val <- tweets_clean[-index, 'bully']# Proportion of our Validity data set
prop.table(table(label_train))## label_train
## no yes
## 0.5759374 0.4240626
prop.table(table(label_val))## label_val
## no yes
## 0.5980066 0.4019934
# Check Dimension
dim(tweets_train)## [1] 8428 17333
length(label_train)## [1] 8428
Because the amount of our predictor is very high, reaching 23113, Let’s reduce noise in our data by retrieving words that occur quite often, for example at least 20 times in the entire SMS. Use the findFreqTerms() function
# Minimum frequency of appearance in documents
tweets_freq <- findFreqTerms(tweets_train, lowfreq = 20)
# Number of Unique words
length(unique(tweets_freq))## [1] 817
tweets_freq %>% tail()## [1] "wkwk" "wkwkwk" "yahudi" "yakin" "yuk" "zaman"
To find the words that appears on on our tweets_freq Let’s subset the sms_train data by using this command:
tweets_train_freqreduced <- tweets_train[,tweets_freq]
inspect(tweets_train_freqreduced)## <<DocumentTermMatrix (documents: 8428, terms: 817)>>
## Non-/sparse entries: 50021/6835655
## Sparsity : 99%
## Maximal term length: 12
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama gue islam jadi jokowi orang presiden sih untuk url
## 2437 0 2 0 0 0 1 0 1 0 0
## 2461 0 0 0 0 0 0 0 0 0 0
## 2571 0 0 0 0 0 0 0 0 0 1
## 4363 1 0 3 0 0 0 0 0 0 0
## 4391 1 0 0 0 0 0 0 0 0 0
## 525 2 1 2 0 0 0 0 0 0 0
## 5384 0 0 0 4 0 0 0 0 0 0
## 6242 0 0 1 0 0 3 0 0 0 0
## 8958 0 0 3 0 1 0 1 0 2 0
## 9336 0 0 0 1 3 0 1 0 1 0
Naive Bayes model is really good at predicting binary categorical data type, the value of our tweets_train matrix is still a frequency. For probability calculations, the frequency will be changed to only the condition appearing (1) or not (0). One way is to use Bernoulli Converter.
bernoulli_conv <- function(x){
# parameter ifelse: kondisi, TRUE, FALSE
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}after making our bernoulli converter function we will apply that function into our training and validation dataset.
tweets_train_bn <- apply(tweets_train_freqreduced, 2, FUN = bernoulli_conv)
tweets_val_bn <- apply(tweets_val, 2, FUN = bernoulli_conv)tweets_train_bn[100:110, 35:40]## Terms
## Docs angkat anies aniessandi anjing anjir antek
## 9569 "0" "0" "0" "0" "0" "0"
## 7730 "0" "0" "0" "0" "0" "0"
## 7128 "0" "0" "0" "0" "0" "0"
## 7218 "0" "0" "0" "0" "0" "0"
## 8561 "0" "0" "0" "0" "0" "0"
## 1557 "0" "0" "0" "0" "0" "0"
## 7452 "0" "0" "0" "0" "0" "0"
## 1218 "0" "0" "0" "0" "0" "0"
## 5567 "0" "0" "0" "0" "0" "0"
## 7854 "0" "0" "0" "0" "0" "0"
## 2594 "0" "0" "0" "0" "0" "0"
nb_bully <- naiveBayes(tweets_train_bn, y = label_train, laplace = 1)tweets_nb_pred <- predict(nb_bully, tweets_val_bn)We will evaluate our prediction using confusion matrix with the available metrics
confusionMatrix(data = tweets_nb_pred, reference = label_val)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1023 163
## yes 237 684
##
## Accuracy : 0.8102
## 95% CI : (0.7927, 0.8267)
## No Information Rate : 0.598
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6107
##
## Mcnemar's Test P-Value : 0.0002622
##
## Sensitivity : 0.8119
## Specificity : 0.8076
## Pos Pred Value : 0.8626
## Neg Pred Value : 0.7427
## Prevalence : 0.5980
## Detection Rate : 0.4855
## Detection Prevalence : 0.5629
## Balanced Accuracy : 0.8097
##
## 'Positive' Class : no
##
From the information above we would like to extract these important information.
Reference
Prediction no yes no 971 137 yes 289 710
Accuracy : 0.7978
Sensitivity : 0.7706
Specificity : 0.8383
Pos Pred Value : 0.8764
To make a better model we would like to optimalize our result to better suit our judgement. My personal judgment is that we should label bully tweets more even if sacrificing some tweets that are not detected as bullying as a bully, to do that we have to increase precision or the Post pred value of our confusion matrix.
The next thing we are going to do is a classification model by neral network approaches. like berfor we would like to clean our data. We already saved our data in RDS form
After we clean the text data, we will prepare the data so that the neural network model can be applied in it. First, we have to change our label to a factor to differentiate between the two outcome, then we have to change it to numerical values and set it to 0 and 1, 0 as not bully and 1 as bully.
tweets_clean2 <- tweets_clean2 %>%
mutate(bully = factor(bully, levels = c("no", "yes")),
bully = as.numeric(bully),
bully = bully - 1) %>%
select(text_clean, bully) As we stated before, tokenizer aims to separate each word in the entire document into a token form. The num_words parameter is for setting the maximum number of words to be used, sorted according to the largest frequency order. words that rarely appear will be removed. from a total of 23158 unique words contained in the text data, we reduced it to 1024 which will be used to make the model. The lower parameter is a logic condition, if TRUE then all words will be transformed to lowercase (tolower).
num_words <- 1024
tokenizer <- text_tokenizer(num_words = num_words, lower = T) %>%
fit_text_tokenizer(tweets_clean2$text_clean)
paste("number of unique words is ", length(tokenizer$word_counts))## [1] "number of unique words is 23158"
docs <- c("no", "yes")
tokendocs <- text_tokenizer(num_words = 2, lower = TRUE) %>% fit_text_tokenizer(docs)
tokendocs$word_index[1:2]## $no
## [1] 1
##
## $yes
## [1] 2
The data split that we will be done in three parts, our train, validation and test data. our test data already has a seperate file that is test.csv, and our validation test will be obtained from splitting our train.csv file into a split of 80:20, 80 being our train data and 20 being our test data.
Data Train is the data that we will use to train the model. Data Validation for evaluating hyperparameter tuning in models (adjust hidden layers, optimizers, learning rates, etc.). While the test data as an evaluator of the model that we make on unseen data.
set.seed(305)
intrain <- initial_split(data = tweets_clean2, prop = 0.8, strata = "bully")
data_train <- training(intrain)
data_val <- testing(intrain)
data_test <- read.csv(file = "data/test.csv")maxlen <- max(str_count(tweets_clean2$text_clean, "\\w+")) + 1
paste("maxiumum length words in data:", maxlen)## [1] "maxiumum length words in data: 332"
data_train_x <- texts_to_sequences(tokenizer, data_train$text_clean) %>%
pad_sequences(maxlen = maxlen)
data_val_x <- texts_to_sequences(tokenizer, data_val$text_clean) %>% pad_sequences(maxlen = maxlen)
data_test_x <- texts_to_sequences(tokenizer, data_test$tweet) %>%
pad_sequences(maxlen = maxlen)
# prepare y
data_train_y <- to_categorical(data_train$bully, num_classes = 2)
data_val_y <- to_categorical(data_val$bully, num_classes = 2)
data_test_y <- to_categorical(data_test$bully, num_classes = 2 )Embedding Layers can only be used in the initial / first layer of the LSTM architecture. In a variety of deep learning frameworks such as Keras, the embedding layer aims to train text data into numerical vectors which represent the closeness of the meaning of each word.
Embedding layer accepts several parameters. Some examples are:
input_dim, which is the maximum dimension of the vocabulary that has been explained in the num_words section.
input_length, the maximum length of the word sequence in the document input.
output_dim which is the embedding dimension of the output layer which will be passed to the next layer. generally is 32, but can be more dependent on the problem we face.
Input received of 2D vectors with the form: {batch_size, sequence_length}, while the output received 3D tensor with the forms {batch_size, sequence_length, output_dim}.
The Deep Network Layer accepts the embedding matrix as input and then is converted into smaller dimensions. The dimensions of the compression results have represented information from the data. In the case of data text, the deep learning architecture commonly used is RNN > LSTM / GRU.
you can check the Keras Documentation for the details sequential layers.
This output layer is the last layer in the deep learning architecture. At Keras use the layer_dense command where we need to set the unit parameters or how many neurons we want to build. In this case I use 3 units, because there are 3 classes we have (negative, neutral, positive).
When the neural network / deep learning model train often results in different results. Why? because NN and DL use weigth which is generated randomly (randomness initialization). therefore we need to set the numbers (x-random models) in order to get a fixed result when repeated in the train (reproducible result). this can be done with the seed parameter in the initializer_random_uniform command. for more details, read the question and answer article in Keras studio
model_nn1 <- keras_model_sequential(name = "LDA_model") %>%
# layer input
layer_embedding(
name = "input",
input_dim = num_words,
input_length = maxlen,
output_dim = 256,
embeddings_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
) %>%
# layer dropout
layer_dropout(
name = "embedding_dropout",
rate = 0.5
) %>%
# layer lstm 1
layer_lstm(
name = "lstm",
units = 256,
dropout = 0.2,
recurrent_dropout = 0.2,
return_sequences = FALSE,
recurrent_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2),
kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
) %>%
# layer output
layer_dense(
name = "output",
units = 2,
activation = "softmax",
kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
)
model_nn1 %>% compile(
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "accuracy",
loss = "categorical_crossentropy"
)
history_tune_1 <- model_nn1 %>%
fit(x = data_train_x, # prediktor
y = data_train_y, # target
batch_size = 512, # Bobot akan diupdate setiap 512 data
epochs = 10,
validation_data = list(data_val_x, data_val_y), # Evaluasi data validasi
verbose = 1
)
plot_nn1 <- plot(history_tune_1)
saveRDS(model_nn1, file = "model_nn1")
saveRDS(plot_nn1, file = "plot_nn1")readRDS("plot_nn1")This is the result of our train and validation data.
# predict on train
data_train_pred <- model_nn1 %>%
predict(data_train_x) %>%
k_argmax() %>% as.array()
# predict on val
data_val_pred <- model_nn1 %>%
predict(data_val_x) %>%
k_argmax() %>% as.array()
# predict on test
data_test_pred <- model_nn1 %>%
predict(data_test_x) %>%
k_argmax() %>% as.array()
saveRDS(data_train_pred, "data_output/data_train_pred")
saveRDS(data_val_pred, "data_output/data_val_pred")
saveRDS(data_test_pred, "data_output/data_test_pred")data_train_pred <- readRDS("data_output/data_train_pred")
data_val_pred <- readRDS("data_output/data_val_pred")
data_test_pred <- readRDS("data_output/data_test_pred")
test_pred_yn <- data_test_pred %>% str_replace_all(c("0" = "no", "1" = "yes"))
submission <- data_test %>% mutate(bully = test_pred_yn)
# save data
write.csv(submission, "submission-davel-NN.csv", row.names = F)These are the three accuracies of my tweet prediction on different tweet data. the measure of the test data were already submitted in the shinyapps leaderboard as we can see on the image below.
# Accuracy on Train Data
accuracy_vec(
truth = factor(data_train$bully,labels = c("no", "yes")),
estimate = factor(data_train_pred, labels = c("no", "yes"))
)## [1] 0.8827578
# Accuracy on Validation Data
accuracy_vec(
truth = factor(data_val$bully,labels = c("no", "yes")),
estimate = factor(data_val_pred, labels = c("no", "yes"))
)## [1] 0.8183112
confusionMatrix(data = as.factor(data_val_pred), reference = as.factor(data_val$bully))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1036 196
## 1 187 689
##
## Accuracy : 0.8183
## 95% CI : (0.8012, 0.8346)
## No Information Rate : 0.5802
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6265
##
## Mcnemar's Test P-Value : 0.6827
##
## Sensitivity : 0.8471
## Specificity : 0.7785
## Pos Pred Value : 0.8409
## Neg Pred Value : 0.7865
## Prevalence : 0.5802
## Detection Rate : 0.4915
## Detection Prevalence : 0.5844
## Balanced Accuracy : 0.8128
##
## 'Positive' Class : 0
##
And this is the accuracy on Test Data
From several machine learning processes that have been carried out, several goals have been achieved, such as
The resulting prediction does not indicate an overfit, the difference between the accuracy of the models is not that far.
Data validation:
Data Test:
For me i think this problem can be solved by machine learning, with a better understanding about the purpose of embedding, the deep neural and the output layer we could solve this bully tweet classification with more accuracy. And of course a better processing power could help in a huge way.
The load of the neural network machine learning process took a toll on my device, as the process could take up to a day unattended and disrupting the other usage on my device, maybe for this specific capstone we could have more time to finish the task. Thankyou.