library(textclean)
library(tokenizers)
library(wordcloud)
## Loading required package: RColorBrewer
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(devtools)
## Loading required package: usethis
library(katadasaR)
library(tm)
## Loading required package: NLP
library(stringr)
library(e1071)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Loading required package: lattice
library(keras)
library(RVerbalExpressions)
library(magrittr)
library(textclean)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.7 ✔ purrr 0.3.4
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ purrr::set_names() masks magrittr::set_names()
library(tidytext)
library(rsample)
##
## Attaching package: 'rsample'
##
## The following object is masked from 'package:e1071':
##
## permutations
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
##
## Attaching package: 'yardstick'
##
## The following object is masked from 'package:readr':
##
## spec
##
## The following object is masked from 'package:keras':
##
## get_weights
##
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
library(SnowballC)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(ROCR)
library(partykit)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library(tinytex)
df <- read.csv("data/train.csv")
slang <- read.csv("data/colloquial-indonesian-lexicon.csv")
head(df)
The dataset contains 10,535 tweets and 8 columns with information as follow: - bully: Classification of the tweet as Yes (Bully) and No (Not Bully) - tweet: Content of the tweet - individual: Whether the tweet is a cyberbully targeted toward certain individual (0 = no, 1 = yes) - group: Whether the tweet is a cyberbully targeted toward certain group of people (0 = no, 1 = yes) - gender: Whether the tweet is a cyberbully based on gender or cursing someone using words that are degrading to gender (0 = no, 1 = yes) - physical: Whether the tweet is a cyberbully based on physical deficiencies/differences or disability (0 = no, 1 = yes) - race: Whether the tweet is a cyberbully based on a human race or ethnicity (0 = no, 1 = yes) - religion: Whether the tweet is a cyberbully based on a religion, religious organization, or a particular creed (0 = no, 1 = yes)
colSums(is.na(df))
## bully tweet individual group gender physical race
## 0 0 0 0 0 0 0
## religion
## 0
df_duplicated <- df[duplicated(df$tweet),]
df_clean <- df %>%
as.data.frame() %>%
distinct(tweet, .keep_all = T)
nrow(df_duplicated)
## [1] 98
From the above processes, we know that there are no missing values in the dataset. However, there are 98 duplicated tweets (label = tweet) as people have the tendency on copy-and-paste news on twitter.
We use the function distinct() to drop the duplicated tweet and saved it as df_clean, with the new number of rows 10,437.
Next, we will transform the “bully, individual, group, gender, physical, race and religion” columns with as.factor() function
df_clean <- df_clean %>%
mutate(bully = as.factor(bully),
individual = as.factor(individual),
group = as.factor(group),
gender = as.factor(gender),
physical = as.factor(physical),
race = as.factor(race),
religion = as.factor(religion))
Category with the most abusive and bullying tweets
df_clean_bully <- df_clean %>% filter(bully == "yes")
df_clean_bully %>%
summary()
## bully tweet individual group gender physical race
## no : 0 Length:4380 0:1562 0:2818 0:4137 0:4128 0:3951
## yes:4380 Class :character 1:2818 1:1562 1: 243 1: 252 1: 429
## Mode :character
## religion
## 0:3761
## 1: 619
##
By sub-setting the data to only shows tweets that classify as “Bully”, we then used the function summary() to generate the frequency from each category. In this case, 1 = Yes (bully) and 0 = No (no bully).
Based on the summary, it is fair to say that the cyber bully tweets mainly attack towards certain “individual” and “group” with number of tweets 2818 and 1562 respectively. On the contrary, “physical” and “gender” are two of the categories that have the least amount of cyberbullying.
What text or token can represent each cyberbully category? Firstly, let’s pull the tweets from our df_clean dataset. Let’s use head(20) to have better insights on the words. After we remove some words that are shown repeatedly that do not add any valuable insights to our data, we will pull the words by subsetting the label bully = yes to see what are the words that might represent cyber bullying
df_clean %>%
head(30) %>%
pull(tweet)
## [1] "USER terimakasih Ustadz sudah bersuara tentang Radikal radikal ini. Entah apa yang ada dalam pikiran rejim. Mesjid radikal...kampus radikal....dosen radikal....padahal tempat tersebut pijakan peradaban. Memangnya mau menghancurkan Indonesia ?"
## [2] "USER USER Maaf sebenarnya twiter pertama kali dbuat bukan buat orang bego'"
## [3] "USER Anjing tai goblok idiot bangsat monyet babi fucc, kont, ngents, goblok. Iya tau kasar bgt maaf'"
## [4] "Hadiri Lokakarya Kebudayaan Daerah, Bupati Rupinus Ajak Masyarakat Sekadau Rawat dan Manfaatkan Objek� Budaya"
## [5] "USER USER USER yg kaya gini layak di tangkap."
## [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan / bagian dari itu sudah waktunya lengserkan Jokowi sebelum indonesia hancur"
## [7] "USER Wonu oppa kenapa matanya sipit banget? '^'\""
## [8] "Smartfren jaringan nya kok brengsek ya ...'"
## [9] "USER USER Pret.. kampret , Tak dukung 2019 ganti presiden.. tp.presidenmu sapa ??? Rocky gerung kapir thaa ???'"
## [10] "USER Ahelah sombong bener punuk onta'"
## [11] "USER jancuk kw zonk!!!!'"
## [12] "Bawaslu: Kolaka Tertinggi Pelanggaran ASN Dalam Pilkada; ; #AkuSERUJI : Pilkada | ;"
## [13] "untung2an dan jangan sampai kami ngaku Komunis, ngaku Atheis,...'"
## [14] "USER USER USER USER USER si cebong sudah kebanyakan makan tahi jokowi, dia baru sadar kl nanti jokowi sudah mengumumkan INDONESIA dirubah menjadi INDOCHINA,,,'"
## [15] "RT USER: Butuh cowo kontol gede yg bs angetin dimusim ujan, horny ahh hawanya bikin pngen ngentot terus ya say ! Retweet yg nger '"
## [16] "Apalagi mereka partai koalisi pendukung pemerintah yg juga diketahui sudah menjalin kerjasama politik dengan partai komunis china"
## [17] "USER terus bisa apa kalo iya....rejim kunyuk ini siapa yg berani lawan...??'"
## [18] "- lipstick di sapu tangan Bora. Dan diriku dalam hati, \"Mampus malu luh. Makan tuh cipokan.\"'"
## [19] "Yg mau lengserkan Jok pasti lawan politiknya spt SBY wowo HT TS PKS.biasa bayarin demo-demo..FPI.jgn takut pak Jok.Rakyat TNI POl bela JokAhok"
## [20] "USER Iya sih kek komunis banget wkwkwk'"
## [21] "Pengusaha Media Sosial Harus Dipastikan Kenetralannya dalam Pilkada dan Pilpres"
## [22] "USER Wah kesempatan neh langsung deh dimanfaatin buat pencitraaan oleh rezim.. '"
## [23] "Yang pasti bukan rakyat cina ! Bukan rakyat Indonesia anak keturunan PKI ! Paham!! URL"
## [24] "RT USER: Mash kentut bau babi'"
## [25] "Sukses USER ; 1. Sukses Prestasi ; 2. Sukses Administrasi ; 3. Sukses Pelaksanaan ; 4. Sukses Ekonomi ; 5. Sukses Menjadi warisan dan kebanggaan bagi generasi selanjutnya.; Cc. USER USER USER ; #AsianGames2018"
## [26] "RT USER: Kalau aku cantik tapi memek ku coklat kamu masih doyan gak? #ngentot #memek #kontol #sange #pagicrot'"
## [27] "Pak de USER akan lebih elegan klo sampean undang para penulis ke istana daripada ngundang busar buser kae..suwon"
## [28] "USER USER Rak harusnya lo tau kalau temen temen lo itu pinter, ga kayak lo bloon. Jadi jangan bego begoin kita, ga mempan :)'"
## [29] "Mau di-Ahok-kan ya?"
## [30] "USER USER USER mabuk pil PCC"
Let’s remove the word “USER, RT and punctuations” which showed up a lot and does not hold any valuable meaning.
df_clean$tweet <- gsub("USER", " ", df_clean$tweet)
df_clean$tweet <- gsub("RT", " ", df_clean$tweet)
df_clean$tweet <- gsub("[[:punct:] ]+", " ", df_clean$tweet)
head(df_clean)
Afterwards, we should do some cleansing on removing certain elements that does not add any value to our data, i.e : dates, emojis, emails, emoticons, html, slangs, urls and tags/RT (@ and retweet). As the tweets are not by Institutions / Organisations, we use the replace_internet_slang() and replace all the slangs/abbreviations by using “Colloquial Indonesian Lexicon” from github. Additionally, as the text classification itself is case-sensitive, we should lower case all the tweets.
df_clean$tweet <- df_clean$tweet %>%
replace_tag() %>%
replace_date(replacement = " ") %>%
replace_email() %>%
replace_emoji(.) %>%
replace_emoticon(.) %>%
replace_url() %>%
replace_html(.) %>%
str_to_lower() %>%
strip()
head(df_clean)
df_clean$tweet[1:10437] <- replace_internet_slang(df_clean$tweet[1:10437], slang = paste0("\\b", slang$slang, "\\b"), replacement = slang$formal, ignore.case = TRUE)
df_clean_ris <- data.frame(df_clean)
#saveRDS(df_clean_ris, "df_clean_ris.RDS")
df_clean_ris <- readRDS("df_clean_ris.RDS")
df_clean_ris %>%
head(30) %>%
pull(tweet)
## [1] "terimakasih ustadz sudah bersuara tentang radikal radikal ini entah apa yang ada dalam pikiran rejim mesjid radikal kampus radikal dosen radikal padahal tempat tersebut pijakan peradaban memangnya mau menghancurkan indonesia"
## [2] "maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego"
## [3] "anjing tahi goblok idiot bangsat monyet babi fucc kont ngents goblok iya tau kasar banget maaf"
## [4] "hadiri lokakarya kebudayaan daerah bupati rupinus ajak masyarakat sekadau rawat dan manfaatkan objek budaya"
## [5] "yang kayak begini layak di tangkap"
## [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan bagian dari itu sudah waktunya lengserkan jokowi sebelum indonesia hancur"
## [7] "wonu oppa kenapa matanya sipit banget"
## [8] "smartfren jaringan nya kok brengsek ya"
## [9] "pret kampret tak dukung ganti presiden tapi presidenmu sapa rocky gerung kapir thaa"
## [10] "ahelah sombong benar punuk onta"
## [11] "jancuk kau zonk"
## [12] "bawaslu kolaka tertinggi pelanggaran asn dalam pilkada akuseruji pilkada"
## [13] "untung an dan jangan sampai kami mengaku komunis mengaku atheis"
## [14] "sih cebong sudah kebanyakan makan tahi jokowi dia baru sadar kalau nanti jokowi sudah mengumumkan indonesia dirubah menjadi indochina"
## [15] "butuh cowok kontol gede yang bisa angetin dimusim ujan horny ah hawanya bikin pengin ngentot terus ya sayang retweet yang nger"
## [16] "apalagi mereka partai koalisi pendukung pemerintah yang juga diketahui sudah menjalin kerjasama politik dengan partai komunis china"
## [17] "terus bisa apa kalo iya rejim kunyuk ini siapa yang berani lawan"
## [18] "lipstick di sapu tangan bora dan diriku dalam hati mampus malu lu makan tuh cipokan"
## [19] "yang mau lengserkan jok pasti lawan politiknya seperti surabaya wowo hati ts pks biasa bayarin demo demo fpi jangan takut pak jok rakyat tni pol bela jokahok"
## [20] "iya sih kayak komunis banget wkwkwk"
## [21] "pengusaha media sosial harus dipastikan kenetralannya dalam pilkada dan pilpres"
## [22] "wah kesempatan nih langsung deh dimanfaatin buat pencitraaan oleh rezim"
## [23] "yang pasti bukan rakyat cina bukan rakyat indonesia anak keturunan pki paham url"
## [24] "mash kentut bau babi"
## [25] "sukses sukses prestasi sukses administrasi sukses pelaksanaan sukses ekonomi sukses menjadi warisan dan kebanggaan bagi generasi selanjutnya cc asiangames"
## [26] "kalau aku cantik tapi memek ku coklat kamu masih doyan enggak ngentot memek kontol sange pagicrot"
## [27] "pak dek akan lebih elegan kalo sampean undang para penulis ke istana daripada ngundang busar buser kae suwon"
## [28] "rak harusnya lo tau kalau teman teman lo itu pintar enggak kayak lo bloon jadi jangan bego begoin kita enggak mempan"
## [29] "mau di ahok kan ya"
## [30] "mabuk pil pcc"
We can see that after cleansing, there are no more emojis (previously #28), punctuation, hash symbols (previously #26). I specifically did not remove the content of the hash tag itself (only the # symbol) as hash tags in most often times are useful when categorizing a topic that can be aggregated into a thread. Additionally, now all the abbreviations are gone, i.e #19 from “yg” to “yang”, and #11 “kw” to “kau.
Next, let’s assign our df_clean data into new data set. I personally like to do this so I don’t have to re-run all the chunks from the beginning in case I mess up my data set, especially where some chunks take a long of time to run (replace_internet_slang()). From now on we will start working with df_clean_2 and treat df_clean as our master file.
df_clean_2 <- data.frame(df_clean_ris)
tracemem(df_clean_ris) == tracemem(df_clean_2)
## [1] FALSE
#saveRDS(df_clean_2, file = "df_clean_2.RDS")
df_clean_2 <- readRDS("df_clean_2.RDS")
Now we will start with stemming, remove stopwords, tokenizing and creating a wordcloud. Stemming is done to transform all the words into its’ root form, i.e “memakan” -> “makan”. We will use the katadasaR library to do this. Afterwards, we will save it into df_clean_3 and saveRDS in the case the RStudio crashes, we can directly work using df_clean_3.RDS
stemming <- function(x) {
paste(lapply(x, katadasar), collapse = " ")
}
#df_clean_2$tweet[1:10437] <- lapply(tokenize_words(df_clean_2$tweet[1:10437]), stemming)
#df_clean_3 <- data.frame(df_clean_2)
#saveRDS(df_clean_3, file = "df_clean_3.RDS")
df_clean_3 <- readRDS("df_clean_3.RDS")
head(df_clean_3)
We can see that all of the words are transformed into its’ base form now. Let’s start the tokenization process. This process breaks our sentences into words by words so that it can be counted by the system into the wordcloud later, i.e : line #5 “yang kayak begini layak di tangkap” will be broken into “yang”, “kayak”, “begini”, “layak”, “di”, “tangkap”. Additionally, we will also do the stopwords process as the final step of our data cleansing. Stopwords are common words used in sentences that give context to the sentence itself but can be removed as they contain no crucial meaning in this project. They are usually conjunctions i.e : “dan”, “tapi”, “untuk”, etc.
stopwords <- readLines("data/stopwords-id.txt")
## Warning in readLines("data/stopwords-id.txt"): incomplete final line found on
## 'data/stopwords-id.txt'
df_clean_3$tweet <- df_clean_3$tweet %>%
replace_html(symbol = FALSE) %>%
replace_url(replacement = "")
df_clean_3$tweet <- gsub("url", " ", df_clean_3$tweet)
df_clean_3$tweet <- tokenize_words(df_clean_3$tweet, stopwords = stopwords)
We are finally done with the data cleansing, let’s save it for the final time as df_clean_final, and save RDS as well.
df_clean_final <- data.frame(df_clean_3)
#saveRDS(df_clean_final, file = "df_clean_final.RDS")
df_clean_final <- readRDS("df_clean_final.RDS")
df_clean_final$tweet[1:5]
## [[1]]
## [1] "terimakasih" "ustadz" "suara" "radikal" "radikal"
## [6] "pikir" "rejim" "mesjid" "radikal" "kampus"
## [11] "radikal" "dosen" "radikal" "pijak" "adab"
## [16] "hancur" "indonesia"
##
## [[2]]
## [1] "maaf" "twiter" "kali" "orang" "bego"
##
## [[3]]
## [1] "anjing" "tahi" "goblok" "idiot" "bangsat" "monyet" "babi"
## [8] "fucc" "kont" "ngents" "goblok" "iya" "tau" "kasar"
## [15] "banget" "maaf"
##
## [[4]]
## [1] "hadir" "lokakarya" "budaya" "daerah" "bupati"
## [6] "rupinus" "ajak" "masyarakat" "sekadau" "rawat"
## [11] "manfaat" "objek" "budaya"
##
## [[5]]
## [1] "kayak" "layak" "tangkap"
df_clean_final
Now, let’s learn our dataset more by using wordcloud(). The df_clean_final dataset is the dataset that still contains both bully and not bully variable (10,437 rows)
df_clean_final_corpus <- VCorpus(VectorSource(df_clean_final$tweet))
df_clean_final_dtm <- DocumentTermMatrix(df_clean_final_corpus)
inspect(df_clean_final_dtm)
## <<DocumentTermMatrix (documents: 10437, terms: 16211)>>
## Non-/sparse entries: 91995/169102212
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama gue indonesia islam jokowi kalo nya orang presiden sih
## 1285 0 0 0 0 0 1 0 0 0 0
## 211 0 0 0 0 0 0 0 0 0 0
## 2150 0 0 0 0 0 0 0 0 0 3
## 2794 0 0 0 0 0 0 0 0 0 1
## 6426 0 0 0 0 0 0 0 1 0 2
## 7019 0 0 0 0 1 0 0 0 1 1
## 764 0 0 0 0 0 0 0 0 0 0
## 78 0 0 0 0 0 0 0 0 0 0
## 8635 0 0 0 2 0 0 1 0 0 0
## 9936 0 0 0 0 0 0 2 1 0 0
wordcloud(df_clean_final_corpus,max.words = 200, col=brewer.pal(8, "Set2"), scale=c(3,0.25))
cleanfinal_count <- as.data.frame(as.matrix(df_clean_final_dtm))
cleanfinal_long <- pivot_longer(data = cleanfinal_count, cols = everything())
final_cleanfinal <- cleanfinal_long %>% group_by(name) %>% summarise(tot = sum(value))
cleanfinal_cloud <- final_cleanfinal %>%
filter(tot >= 50) %>%
arrange(desc(tot))
head(cleanfinal_cloud,30)
Let’s see the words that are commonly used in the non-bully tweets
df_clean_nobully <- df_clean_final %>%
filter(bully == "no")
df_clean_nobully_corpus <- VCorpus(VectorSource(df_clean_nobully$tweet))
df_clean_nobully_dtm <- DocumentTermMatrix(df_clean_nobully_corpus)
inspect(df_clean_nobully_dtm)
## <<DocumentTermMatrix (documents: 6057, terms: 12616)>>
## Non-/sparse entries: 56196/76358916
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama asing gue indonesia islam kalo kayak orang presiden sih
## 119 0 1 0 0 0 0 0 0 0 0
## 1230 0 0 0 0 0 0 0 0 0 3
## 1423 0 0 0 0 0 0 0 0 0 0
## 1626 0 0 0 0 0 0 1 0 0 1
## 3601 0 0 0 0 0 0 1 0 0 0
## 3715 0 0 0 0 0 0 0 1 0 2
## 43 0 0 0 0 0 0 0 0 0 0
## 5017 0 0 0 0 2 0 0 0 0 0
## 5764 0 0 0 0 0 0 0 1 0 0
## 743 0 0 0 0 0 1 1 0 0 0
wordcloud(df_clean_nobully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.3))
nobully_count <- as.data.frame(as.matrix(df_clean_nobully_dtm))
nobully_long <- pivot_longer(data = nobully_count, cols = everything())
final_nobully <- nobully_long %>% group_by(name) %>% summarise(tot = sum(value))
nobully_cloud <- final_nobully %>%
filter(tot >= 50) %>%
arrange(desc(tot))
head(nobully_cloud,30)
Bully in general: I set the max.words to only 100 with min.freq of occurrences at least 20,000 in our dataset to make it more narrow and specific. We can see that the word “jokowi” (individual) is mentioned the most, followed by the ones in blue color (cebong, islam, orang). There are also group “pki”, religion “agama”, race “cina”
df_clean_bully2 <- df_clean_final %>%
filter(bully == "yes")
df_clean_bully_corpus <- VCorpus(VectorSource(df_clean_bully2$tweet))
df_clean_bully_dtm <- DocumentTermMatrix(df_clean_bully_corpus)
inspect(df_clean_bully_dtm)
## <<DocumentTermMatrix (documents: 4380, terms: 7372)>>
## Non-/sparse entries: 35799/32253561
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs cebong gantipresiden indonesia islam jokowi kalo nya orang presiden sih
## 1015 0 0 0 0 0 0 0 1 0 0
## 1351 0 0 0 0 0 0 0 1 0 0
## 1401 0 0 0 0 0 0 1 0 0 0
## 1540 0 0 0 0 0 0 0 0 0 1
## 234 2 0 0 2 0 0 0 0 0 0
## 2700 0 0 0 0 0 1 1 0 0 1
## 2956 0 0 0 0 1 0 0 0 1 1
## 331 0 0 0 0 0 0 0 0 0 0
## 859 1 0 0 0 0 0 0 0 0 0
## 966 2 0 0 1 0 0 0 0 0 0
wordcloud(df_clean_bully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.25))
bully_count <- as.data.frame(as.matrix(df_clean_bully_dtm))
bully_long <- pivot_longer(data = bully_count, cols = everything())
final_bully <- bully_long %>% group_by(name) %>% summarise(tot = sum(value))
bully_cloud <- final_bully %>%
filter(tot >= 50) %>%
arrange(desc(tot))
head(bully_cloud,30)
Let’s dissect into each category to have a better view
In the Individual category, we can see there are cyberbully towards governmental individual, with the most “jokowi”, followed by “ahok”, “prabowo” and “anies”. We can also see that not all words consist a name of the individual, but yet these words are “rude”, i.e “tolol”, “babi”. Additionally, there are words that does not mention the name of individual, but can be used to attack the individual, i.e verbs such as “lengser”, “ganti presiden”, “tolol”, “kafir”
df_clean_bindividual <- df_clean_bully2 %>%
filter(individual == 1)
df_clean_bindividual_corpus <- VCorpus(VectorSource(df_clean_bindividual$tweet))
df_clean_bindividual_dtm <- DocumentTermMatrix(df_clean_bindividual_corpus)
inspect(df_clean_bully_dtm)
## <<DocumentTermMatrix (documents: 4380, terms: 7372)>>
## Non-/sparse entries: 35799/32253561
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs cebong gantipresiden indonesia islam jokowi kalo nya orang presiden sih
## 1015 0 0 0 0 0 0 0 1 0 0
## 1351 0 0 0 0 0 0 0 1 0 0
## 1401 0 0 0 0 0 0 1 0 0 0
## 1540 0 0 0 0 0 0 0 0 0 1
## 234 2 0 0 2 0 0 0 0 0 0
## 2700 0 0 0 0 0 1 1 0 0 1
## 2956 0 0 0 0 1 0 0 0 1 1
## 331 0 0 0 0 0 0 0 0 0 0
## 859 1 0 0 0 0 0 0 0 0 0
## 966 2 0 0 1 0 0 0 0 0 0
wordcloud(df_clean_bindividual_corpus,max.words = 30, min.freq = 10000, col=brewer.pal(8, "Set2"), scale=c(4.5,0.5))
bindividual_count <- as.data.frame(as.matrix(df_clean_bindividual_dtm))
bindividual_long <- pivot_longer(data = bindividual_count, cols = everything())
final_bindividual <- bindividual_long %>% group_by(name) %>% summarise(tot = sum(value))
bindividual_cloud <- final_bindividual %>%
filter(tot >= 50) %>%
arrange(desc(tot))
head(bindividual_cloud,30)
df_clean_bgroup <- df_clean_bully2 %>%
filter(group == 1)
df_clean_bgroup_corpus <- VCorpus(VectorSource(df_clean_bgroup$tweet))
df_clean_bgroup_dtm <- DocumentTermMatrix(df_clean_bgroup_corpus)
inspect(df_clean_bgroup_dtm)
## <<DocumentTermMatrix (documents: 1562, terms: 3853)>>
## Non-/sparse entries: 13790/6004596
## Sparsity : 100%
## Maximal term length: 30
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama bubar cebong cina indonesia islam komunis nya orang pki
## 1074 0 0 0 0 0 0 0 0 0 0
## 1167 0 0 0 0 0 0 0 0 1 0
## 118 0 0 0 0 0 0 0 0 0 0
## 1481 0 0 0 0 0 0 0 0 0 0
## 317 0 0 1 0 0 0 0 0 0 0
## 364 0 0 0 0 0 0 0 0 1 0
## 501 0 0 0 1 0 0 0 0 1 0
## 575 0 0 0 0 0 0 0 0 0 0
## 683 1 0 3 0 0 3 0 0 0 0
## 726 1 0 0 0 0 0 0 0 1 0
wordcloud(df_clean_bgroup_corpus,max.words = 30, min.freq = 10000, col=brewer.pal(8, "Set2"), scale=c(3,0.25))
bgroup_count <- as.data.frame(as.matrix(df_clean_bgroup_dtm))
bgroup_long <- pivot_longer(data = bgroup_count, cols = everything())
final_bgroup <- bgroup_long %>% group_by(name) %>% summarise(tot = sum(value))
bgroup_cloud <- final_bgroup %>%
filter(tot >= 50) %>%
arrange(desc(tot))
head(bgroup_cloud,30)
df_clean_bgender <- df_clean_bully2 %>%
filter(gender == 1)
df_clean_bgender_corpus <- VCorpus(VectorSource(df_clean_bgender$tweet))
df_clean_bgender_dtm <- DocumentTermMatrix(df_clean_bgender_corpus)
inspect(df_clean_bgender_dtm)
## <<DocumentTermMatrix (documents: 243, terms: 1043)>>
## Non-/sparse entries: 2061/251388
## Sparsity : 99%
## Maximal term length: 23
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs banci bencong dasar gue homo kafir kayak manusia orang sih
## 13 0 0 0 0 0 0 0 0 0 0
## 130 0 0 1 1 0 0 0 0 0 0
## 131 0 0 0 4 0 0 1 0 0 1
## 161 0 0 0 0 0 0 0 0 0 0
## 194 0 0 0 0 1 0 2 1 2 0
## 208 0 2 0 1 0 0 0 0 0 2
## 52 1 0 0 0 0 0 0 0 1 0
## 62 1 0 0 1 0 0 1 0 1 0
## 82 0 0 0 0 0 0 0 0 0 0
## 85 0 0 0 0 0 0 1 0 0 0
wordcloud(df_clean_bgender_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(5,0.5))
bgender_count <- as.data.frame(as.matrix(df_clean_bgender_dtm))
bgender_long <- pivot_longer(data = bgender_count, cols = everything())
final_bgender <- bgender_long %>% group_by(name) %>% summarise(tot = sum(value))
bgender_cloud <- final_bgender %>%
filter(tot >= 10) %>%
arrange(desc(tot))
head(bgender_cloud,30)
df_clean_bphysical <- df_clean_bully2 %>%
filter(physical == 1)
df_clean_bphysical_corpus <- VCorpus(VectorSource(df_clean_bphysical$tweet))
df_clean_bphysical_dtm <- DocumentTermMatrix(df_clean_bphysical_corpus)
inspect(df_clean_bphysical_dtm)
## <<DocumentTermMatrix (documents: 252, terms: 1061)>>
## Non-/sparse entries: 1986/265386
## Sparsity : 99%
## Maximal term length: 14
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs bolot budek gue idiot kayak mata muka orang picek sih
## 125 0 0 0 0 0 1 0 0 1 1
## 165 0 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0 0
## 223 0 0 1 1 0 0 0 0 0 2
## 234 0 0 2 0 0 1 0 0 0 0
## 245 0 0 0 0 0 0 0 0 0 0
## 25 0 0 0 4 0 0 0 0 5 0
## 252 0 0 0 0 0 0 0 0 0 0
## 28 0 0 0 0 0 0 1 0 0 0
## 30 0 0 0 0 0 0 2 1 0 0
wordcloud(df_clean_bphysical_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(4,0.25))
bphysical_count <- as.data.frame(as.matrix(df_clean_bphysical_dtm))
bphysical_long <- pivot_longer(data = bphysical_count, cols = everything())
final_bphysical <- bphysical_long %>% group_by(name) %>% summarise(tot = sum(value))
bphysical_cloud <- final_bphysical %>%
filter(tot >= 10) %>%
arrange(desc(tot))
head(bphysical_cloud,30)
df_clean_breligion <- df_clean_bully2 %>%
filter(religion == 1)
df_clean_breligion_corpus <- VCorpus(VectorSource(df_clean_breligion$tweet))
df_clean_breligion_dtm <- DocumentTermMatrix(df_clean_breligion_corpus)
inspect(df_clean_breligion_dtm)
## <<DocumentTermMatrix (documents: 619, terms: 1980)>>
## Non-/sparse entries: 5919/1219701
## Sparsity : 100%
## Maximal term length: 30
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama ahok allah anti budha indonesia islam kafir muslim orang
## 114 1 0 0 0 0 0 0 0 0 0
## 145 0 1 0 0 0 0 1 0 0 0
## 156 1 0 0 0 0 0 0 0 0 0
## 266 1 0 0 0 0 0 3 0 0 0
## 291 1 0 0 0 0 0 0 0 0 1
## 30 2 0 0 0 0 0 2 0 0 0
## 47 0 0 0 0 0 0 0 0 0 0
## 477 0 0 0 0 0 0 0 1 0 1
## 591 0 0 0 0 0 0 0 0 0 0
## 63 0 0 0 0 0 0 0 1 2 1
wordcloud(df_clean_breligion_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(5,0.4))
breligion_count <- as.data.frame(as.matrix(df_clean_breligion_dtm))
breligion_long <- pivot_longer(data = breligion_count, cols = everything())
final_breligion <- breligion_long %>% group_by(name) %>% summarise(tot = sum(value))
breligion_cloud <- final_breligion %>%
filter(tot >= 10) %>%
arrange(desc(tot))
head(breligion_cloud,30)
df_clean_brace <- df_clean_bully2 %>%
filter(race == 1)
df_clean_brace_corpus <- VCorpus(VectorSource(df_clean_brace$tweet))
df_clean_brace_dtm <- DocumentTermMatrix(df_clean_brace_corpus)
inspect(df_clean_brace_dtm)
## <<DocumentTermMatrix (documents: 429, terms: 1342)>>
## Non-/sparse entries: 3527/572191
## Sparsity : 99%
## Maximal term length: 21
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs antek china cina ganyang indonesia islam komunis orang pki usir
## 126 0 4 0 0 0 0 0 0 0 0
## 129 0 0 0 0 0 0 1 2 1 0
## 132 0 0 1 0 0 0 0 1 0 0
## 195 0 0 0 0 0 0 0 0 0 0
## 26 0 0 0 0 0 1 0 0 2 0
## 320 0 0 0 0 0 0 0 1 0 0
## 72 0 0 2 0 0 0 0 0 0 0
## 74 0 0 0 0 0 0 0 0 0 0
## 76 0 0 1 0 0 0 0 0 0 0
## 99 0 0 0 0 0 0 0 0 0 0
wordcloud(df_clean_brace_corpus,max.words = 100, col=brewer.pal(8, "Set2"), scale=c(5,0.5))
brace_count <- as.data.frame(as.matrix(df_clean_brace_dtm))
brace_long <- pivot_longer(data = brace_count, cols = everything())
final_brace <- brace_long %>% group_by(name) %>% summarise(tot = sum(value))
brace_cloud <- final_brace %>%
filter(tot >= 10) %>%
arrange(desc(tot))
head(brace_cloud,30)
We will split the train data into 80% of training, and the rest of 20% for validation
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# train-test splitting
index <- sample(nrow(df_clean_final_dtm), nrow(df_clean_final_dtm)*0.8)
df_train <- df_clean_final_dtm[index,]
df_validation <- df_clean_final_dtm[-index,]
label_train <- df_clean_final[index, 'bully']
label_validation <- df_clean_final[-index, 'bully']
prop.table(table(label_train))
## label_train
## no yes
## 0.5795904 0.4204096
prop.table(table(label_validation))
## label_validation
## no yes
## 0.5833333 0.4166667
#Check Dim
dim(df_train)
## [1] 8349 16211
10437*0.8
## [1] 8349.6
#number of rows are 10437 after removing duplicates
We will subset to find terms that only appear in the model for at least 10 times
df_freq <- findFreqTerms(df_train, lowfreq = 10)
length(df_freq)
## [1] 1396
head(df_freq)
## [1] "abang" "abu" "acara" "adab" "adat" "adek"
df_train2 <- df_train[,df_freq]
inspect(df_train2)
## <<DocumentTermMatrix (documents: 8349, terms: 1396)>>
## Non-/sparse entries: 51184/11604020
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama gue indonesia islam jokowi kalo kayak orang presiden sih
## 2419 0 1 0 0 0 0 1 1 0 0
## 2429 0 2 0 0 0 0 1 1 0 1
## 2452 0 0 0 0 0 0 0 0 0 0
## 2842 0 0 0 0 0 0 0 0 0 1
## 4346 1 0 0 3 0 0 0 0 0 0
## 525 2 1 0 2 0 0 0 0 0 0
## 6751 0 0 0 0 0 0 0 0 0 0
## 9050 1 0 0 0 0 0 0 0 0 1
## 9849 0 0 0 1 0 0 0 6 0 0
## 9882 0 0 0 0 0 0 0 0 0 0
Use Bernoulli Converter to transform frequency of words into probability. If f > 0, value = 1 (appear) If f == 0, value = 0 (does not appear)
bernoulli_conv <- function(x){
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
df_train_bn <- apply(X = df_train2, MARGIN = 2, FUN = bernoulli_conv)
df_validation_bn <- apply(X = df_validation, MARGIN = 2, FUN = bernoulli_conv)
naive_bully <- naiveBayes(x = df_train_bn,
y = label_train)
df_train_pred <- predict(naive_bully, df_validation_bn, type = "class")
head(df_train_pred)
## [1] no yes yes yes yes no
## Levels: no yes
summary(df_train_pred)
## no yes
## 1162 926
Model evaluation of training data to validation data
confusionMatrix(data = df_train_pred, # label hasil prediksi
reference = label_validation, # label actual
positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1005 157
## yes 213 713
##
## Accuracy : 0.8228
## 95% CI : (0.8057, 0.839)
## No Information Rate : 0.5833
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6388
##
## Mcnemar's Test P-Value : 0.004246
##
## Sensitivity : 0.8195
## Specificity : 0.8251
## Pos Pred Value : 0.7700
## Neg Pred Value : 0.8649
## Prevalence : 0.4167
## Detection Rate : 0.3415
## Detection Prevalence : 0.4435
## Balanced Accuracy : 0.8223
##
## 'Positive' Class : yes
##
Subset the data validation into the 20% that we used to validate
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
df_mispredicted <- readRDS("df_clean_3.RDS")
data_validation_check <- df_mispredicted[-index,]
head(data_validation_check)
Create Dataframe from the prediction results (train predicted in validation data)
validation_pred_results <- as.data.frame(df_train_pred)
head(validation_pred_results)
Combine both into one dataframe Our confusion matrix
Reference
Prediction no yes no 1005 157 yes 213 713
data_validation_trainyes_predno <- data_validation_check %>%
mutate(validation_pred_results,
.after=bully,
tweet = as.character(tweet)) %>%
filter(bully == "yes" & df_train_pred == "no")
head(data_validation_trainyes_predno)
nrow(data_validation_trainyes_predno)
## [1] 157
157 tweets that are originally “bully” but our system classified as “no”. Same to our confusion matrix.
data_validation_trainno_predyes <- data_validation_check %>%
mutate(validation_pred_results,
.after=bully,
tweet = as.character(tweet)) %>%
filter(bully == "no" & df_train_pred == "yes")
head(data_validation_trainno_predyes)
nrow(data_validation_trainno_predyes)
## [1] 213
213 tweets that are originally “not bully” but our system classified as “yes”. Same to our confusion matrix.
We will cleanse the test data set. The steps are similar to how we cleansed our training dataset
#df_test <- read_csv("data/test.csv")
#df_test$tweet <- df_test$tweet %>%
# replace_tag() %>%
# replace_date(replacement = " ") %>%
# replace_email() %>%
# replace_emoji(.) %>%
# replace_emoticon(.) %>%
# replace_url() %>%
# replace_html(.) %>%
# str_to_lower()
#df_test$tweet <- gsub("user", " ", df_test$tweet)
#df_test$tweet <- gsub("rt", " ", df_test$tweet)
#df_test$tweet <- gsub("[[:punct:] ]+", " ", df_test$tweet)
#df_test$tweet <- gsub("url", " ", df_test$tweet)
#df_test$tweet <- gsub("[^a-z]+$", "", df_test$tweet)
#df_test$tweet <- gsub("[[:digit:]]", "", df_test$tweet)
#df_test$tweet <- strip(df_test$tweet)
#df_test$tweet <- lapply(tokenize_words(df_test$tweet), stemming)
#df_test$tweet <- as.character(df_test$tweet)
#saveRDS(df_test, file = "naivebayes_test_clean.RDS")
df_test <- readRDS("naivebayes_test_clean.RDS")
df_test_corpus <- VCorpus(VectorSource(df_test$tweet))
df_test_dtm <- DocumentTermMatrix(df_test_corpus)
df_test_bn <- apply(X = df_test_dtm, MARGIN = 2, FUN = bernoulli_conv)
Data training to data test
df_test_pred <- predict(naive_bully, df_test_bn, type = "class")
head(df_test_pred)
## [1] no yes no no yes yes
## Levels: no yes
summary(df_test_pred)
## no yes
## 1402 1232
submission <- df_test %>%
mutate(bully = df_test_pred)
write.csv(submission, "submission-inge-freq10.csv", row.names = F)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# train-test splitting
index <- sample(nrow(df_clean_final_dtm), nrow(df_clean_final_dtm)*0.8)
df_train_rf <- df_clean_final_dtm[index,]
df_validation_rf <- df_clean_final_dtm[-index,]
label_train_rf <- df_clean_final[index, 'bully']
label_validation_rf <- df_clean_final[-index, 'bully']
prop.table(table(label_train_rf))
## label_train_rf
## no yes
## 0.5795904 0.4204096
prop.table(table(label_validation_rf))
## label_validation_rf
## no yes
## 0.5833333 0.4166667
df_freq2 <- findFreqTerms(df_train_rf, lowfreq = 10)
length(df_freq2)
## [1] 1396
head(df_freq2)
## [1] "abang" "abu" "acara" "adab" "adat" "adek"
df_train_rf2 <- df_train_rf[,df_freq2]
inspect(df_train_rf2)
## <<DocumentTermMatrix (documents: 8349, terms: 1396)>>
## Non-/sparse entries: 51184/11604020
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs agama gue indonesia islam jokowi kalo kayak orang presiden sih
## 2419 0 1 0 0 0 0 1 1 0 0
## 2429 0 2 0 0 0 0 1 1 0 1
## 2452 0 0 0 0 0 0 0 0 0 0
## 2842 0 0 0 0 0 0 0 0 0 1
## 4346 1 0 0 3 0 0 0 0 0 0
## 525 2 1 0 2 0 0 0 0 0 0
## 6751 0 0 0 0 0 0 0 0 0 0
## 9050 1 0 0 0 0 0 0 0 0 1
## 9849 0 0 0 1 0 0 0 6 0 0
## 9882 0 0 0 0 0 0 0 0 0 0
bernoulli_conv <- function(x){
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
df_train_rf_bn <- apply(X = df_train_rf2, MARGIN = 2, FUN = bernoulli_conv)
df_validation_rf_bn <- apply(X = df_validation, MARGIN = 2, FUN = bernoulli_conv)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
rf <- randomForest(x = df_train_rf_bn,
y = label_train_rf,
ntree = 15)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
rf_pred <- predict(rf, df_validation_rf_bn
, type = "class")
head(rf_pred)
## 1 3 6 8 9 15
## yes yes yes yes yes no
## Levels: no yes
summary(rf_pred)
## no yes
## 1187 901
confusionMatrix(data = rf_pred, # label hasil prediksi
reference = label_validation, # label actual
positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1016 171
## yes 202 699
##
## Accuracy : 0.8214
## 95% CI : (0.8042, 0.8376)
## No Information Rate : 0.5833
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6344
##
## Mcnemar's Test P-Value : 0.1203
##
## Sensitivity : 0.8034
## Specificity : 0.8342
## Pos Pred Value : 0.7758
## Neg Pred Value : 0.8559
## Prevalence : 0.4167
## Detection Rate : 0.3348
## Detection Prevalence : 0.4315
## Balanced Accuracy : 0.8188
##
## 'Positive' Class : yes
##
df_test_rf <- readRDS("naivebayes_test_clean.RDS")
head(df_test_rf)
df_test_bn_rf<- apply(X = df_test_dtm, MARGIN = 2, FUN = bernoulli_conv)
df_test_pred_rf <- predict(rf, df_test_bn_rf, type = "class")
My Random Forest Data test stops here, as I am not sure as well why this error is happening. All of the Data Test used are also the one that I used in Naive Bayes Data Test. However, as the confusion matrix results of Naive Bayes shows better in Sensitivity, I would choose Naive Bayes as the better model for me since I want to reduce the False Negative.
Individual and Group are the ones with the most cyberbully tweets, we can find the proportion by using summary() of the dataframe
df_clean_bully %>%
summary()
## bully tweet individual group gender physical race
## no : 0 Length:4380 0:1562 0:2818 0:4137 0:4128 0:3951
## yes:4380 Class :character 1:2818 1:1562 1: 243 1: 252 1: 429
## Mode :character
## religion
## 0:3761
## 1: 619
##
Please refer to the wordcloud / frequency dataframe from section 3.3. Below are the top 10 tokens Individual:
head(bindividual_cloud,10)
Group:
head(bgroup_cloud,10)
Gender:
head(bgender_cloud,10)
Physical:
head(bphysical_cloud,10)
Race:
head(brace_cloud,10)
Religion:
head(breligion_cloud,10)
Yes, we can see several tokens that appear in one category and appear in another with less appearance, example : Indonesia : Religion (60), Race (101), group (174) Ahok : Religion (60), Individual (188) Islam : Religion (249), Race (35), Group (200)
There might more tokens that actually intersect between categories, but as we are only pulling those that are in the top 10, we do not see it here.
Below are the tokens when we filter bully == yes
head(bully_cloud,30)
wordcloud(df_clean_bully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.25))
### Is it based on the term frequency of each word or token? Or is it
based on the Term Frequency (TF) - Inverse Document Frequency (IDF)?
Based on TF-IDF (Document term matrix).
Yes
Bully
wordcloud(df_clean_bully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.25))
Non-Bully
wordcloud(df_clean_nobully_corpus,max.words = 100, min.freq = 20000, col=brewer.pal(8, "Set2"), scale=c(3.5,0.3))
##No. 3 ### What package will you use for text mining?
library(textclean)
library(tokenizers)
library(wordcloud)
library(dplyr)
library(devtools)
library(katadasaR)
library(tm)
library(stringr)
library(e1071)
library(caret)
library(keras)
library(RVerbalExpressions)
library(magrittr)
library(textclean)
library(tidyverse)
library(tidytext)
library(rsample)
library(yardstick)
library(SnowballC)
library(partykit)
library(ROCR)
library(partykit)
library(randomForest)
Yes
head(df_clean_2$tweet,10)
## [1] "terimakasih ustadz sudah bersuara tentang radikal radikal ini entah apa yang ada dalam pikiran rejim mesjid radikal kampus radikal dosen radikal padahal tempat tersebut pijakan peradaban memangnya mau menghancurkan indonesia"
## [2] "maaf sebenarnya twiter pertama kali dibuat bukan buat orang bego"
## [3] "anjing tahi goblok idiot bangsat monyet babi fucc kont ngents goblok iya tau kasar banget maaf"
## [4] "hadiri lokakarya kebudayaan daerah bupati rupinus ajak masyarakat sekadau rawat dan manfaatkan objek budaya"
## [5] "yang kayak begini layak di tangkap"
## [6] "ini namanya memancing konflik horizontal kalo polisi membiarkan bagian dari itu sudah waktunya lengserkan jokowi sebelum indonesia hancur"
## [7] "wonu oppa kenapa matanya sipit banget"
## [8] "smartfren jaringan nya kok brengsek ya"
## [9] "pret kampret tak dukung ganti presiden tapi presidenmu sapa rocky gerung kapir thaa"
## [10] "ahelah sombong benar punuk onta"
Did you use custom stopwords for Bahasa? Yes (for reference please go to Stemming, Stopwords and Tokenizing section)
stemming <- function(x) {
paste(lapply(x, katadasar), collapse = " ")
}
df_clean_2$tweet[1:10437] <- lapply(tokenize_words(df_clean_2$tweet[1:10437]), stemming)
df_clean_3 <- data.frame(df_clean_2)
stopwords <- readLines("data/stopwords-id.txt")
df_clean_3$tweet <- df_clean_3$tweet %>%
replace_html(symbol = FALSE) %>%
replace_url(replacement = "")
df_clean_3$tweet <- gsub("url", " ", df_clean_3$tweet)
df_clean_3$tweet <- tokenize_words(df_clean_3$tweet, stopwords = stopwords)
Should you remove punctuation or emoticon? Yes (for reference please go to data cleansing section)
df_clean$tweet <- df_clean$tweet %>%
replace_tag() %>%
replace_date(replacement = " ") %>%
replace_email() %>%
replace_emoji(.) %>%
replace_emoticon(.) %>%
replace_url() %>%
replace_html(.) %>%
str_to_lower() %>%
strip()
Will you create a document-term matrix? Yes (Please check on Wordcloud and Frequency section for the DTM process)
df_clean_final_dtm
## <<DocumentTermMatrix (documents: 10437, terms: 16211)>>
## Non-/sparse entries: 91995/169102212
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
Naive Bayes
1396 words (after subset with findfreqterms() with lowfreq = 10)
df_train2
## <<DocumentTermMatrix (documents: 8349, terms: 1396)>>
## Non-/sparse entries: 51184/11604020
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
80% training, 20% validation
Based on Sensitivity as we would like to have the lowest False Negative (predicted not bully but in actual is bully)
NAIVE BAYES Train to Validation
confusionMatrix(data = df_train_pred, # label hasil prediksi
reference = label_validation, # label actual
positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1005 157
## yes 213 713
##
## Accuracy : 0.8228
## 95% CI : (0.8057, 0.839)
## No Information Rate : 0.5833
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6388
##
## Mcnemar's Test P-Value : 0.004246
##
## Sensitivity : 0.8195
## Specificity : 0.8251
## Pos Pred Value : 0.7700
## Neg Pred Value : 0.8649
## Prevalence : 0.4167
## Detection Rate : 0.3415
## Detection Prevalence : 0.4435
## Balanced Accuracy : 0.8223
##
## 'Positive' Class : yes
##
RANDOM FOREST Train to Validation
confusionMatrix(data = rf_pred, # label hasil prediksi
reference = label_validation, # label actual
positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1016 171
## yes 202 699
##
## Accuracy : 0.8214
## 95% CI : (0.8042, 0.8376)
## No Information Rate : 0.5833
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6344
##
## Mcnemar's Test P-Value : 0.1203
##
## Sensitivity : 0.8034
## Specificity : 0.8342
## Pos Pred Value : 0.7758
## Neg Pred Value : 0.8559
## Prevalence : 0.4167
## Detection Rate : 0.3348
## Detection Prevalence : 0.4315
## Balanced Accuracy : 0.8188
##
## 'Positive' Class : yes
##
Naive Bayes as it has higher Specificity
There is no overfitting : Train to validation 82.28% Train to test 82%
Accuracy in (your own) validation dataset reach > 80%. Sensitivity in (your own) validation dataset reach > 80%. Specificity in (your own) validation dataset reach > 75%. Precision in (your own) validation dataset reach > 75%.
confusionMatrix(data = df_train_pred, # label hasil prediksi
reference = label_validation, # label actual
positive = "yes") # kelas positif: yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1005 157
## yes 213 713
##
## Accuracy : 0.8228
## 95% CI : (0.8057, 0.839)
## No Information Rate : 0.5833
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6388
##
## Mcnemar's Test P-Value : 0.004246
##
## Sensitivity : 0.8195
## Specificity : 0.8251
## Pos Pred Value : 0.7700
## Neg Pred Value : 0.8649
## Prevalence : 0.4167
## Detection Rate : 0.3415
## Detection Prevalence : 0.4435
## Balanced Accuracy : 0.8223
##
## 'Positive' Class : yes
##
Accuracy in test dataset reach > 80%. Sensitivity in test dataset reach > 80%. pecificity in test dataset reach > 75%. Precision in test dataset reach > 75%.
NAIVE BAYES Train to Test
knitr::include_graphics("data/NAIVE BAYES FREQ 10.png")
Bully tweets but classified as not bully in validation dataset
head(data_validation_trainyes_predno,10)
Not Bully tweets but classified as bully in validation dataset
head(data_validation_trainno_predyes,10)
Yes. As we have seen from the wordclouds of bully and non-bully tokens, there are several tokens that appear in both. i.e: We can see from the non-bully tweets but predicted as bully, there are tokens i.e cina (no 2), jokowi (no 3), ahok (no 8) that are also in the bully tweets wordcloud. Hence, there is a pattern where of mis-classification because of the existence of one token in both bully and non-bully tweets. Additionally: 1. Bully words that are in Javanese language / slang that are not captured in the colloquial-indonesian-lexicon might also be classified as non bully by system, i.e : kntl, sarap 2. The context of a sentence could also impact of the classification, that might not be captured too well in our prediction
Yes, words like cina, jokowi, ahok, that present in both bully and non-bully
Yes, though this might not be the perfect model, we can already get a head-start in order to classify bully and non-bully tweets by at least >80% of accuracy. Though it would be beneficial in the future if we can tweak the model to reduce the false negative rate so we can have better performance.
Absolutely. With better model, this could help a social media company prevent cyberbullying. As we know, cyberbullying is a serious problem that has taken many lives of teenagers.