The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.
The dataset is consist of Twitter tweets from Kaggle
library(inspectdf)
library(caret)
library(stringr)
library(magrittr)
library(tidyverse)
library(tidytext)
library(tm)
library(SnowballC)
library(textclean)
library(tokenizers)
library(wordcloud)
library(keras)
library(RColorBrewer)
library(tictoc)train <- read.csv(file = "data/train.csv")
test <- read.csv(file = "data/test.csv")
glimpse(train)## Rows: 31,962
## Columns: 3
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1~
## $ label <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0~
## $ tweet <chr> " @user when a father is dysfunctional and is so selfish he drag~
glimpse(test)## Rows: 17,197
## Columns: 2
## $ id <int> 31963, 31964, 31965, 31966, 31967, 31968, 31969, 31970, 31971, 3~
## $ tweet <chr> "#studiolife #aislife #requires #passion #dedication #willpower ~
The data consist of more than 31 thousands tweets and has three column which is the id, tweets, and label. we will remove id column, mutate type of label column, and change the name of column tweet for more readability.
train <- train %>%
rename(text = tweet) %>%
select(-id) %>%
mutate(label = as.factor(label))
test <- test %>%
rename(text = tweet) %>%
select(-id)colSums(is.na(train))## label text
## 0 0
# remove duplicated rows
train <- train[!duplicated(train), ]
train %>% duplicated() %>% sum() # total duplicated rows## [1] 0
train %>%
inspect_cat() %>%
show_plot()Due to the tendency of our model to predict non hate speech tweet. we will downsample the non hate speech tweet data.
train_upsam <- upSample(x = train %>% select(-label),
y = train$label,
yname = "label")
train_upsam %>%
inspect_cat() %>%
show_plot()First we perform text cleansing to standardize the data using textclean package. in this process we wil: * remove any url, emoji, or html tag * replace all digit numbers, tags, username or USER tag,and punctuation. * replace all slang word.
text_cleansing <- function(data) {
new_data <- data %>%
mutate(
text = text %>%
replace_url() %>%
replace_emoji() %>%
replace_emoticon() %>%
replace_html() %>%
str_to_lower() %>% # transform to lowercase
str_remove_all("@([0-9a-zA-Z_]+)") %>% # remove username
str_remove_all('[\\#]+') %>% # remove hashtag
str_remove_all('[\\!]+') %>%
str_remove_all('[\\&]+') %>%
str_remove_all('[\\"]+') %>%
replace_internet_slang() %>%
str_remove_all(pattern = "[[:digit:]]") %>% # remove number
str_remove_all(pattern = "[[:punct:]]") %>% # remove all punctuation except !,&,""
str_squish() # extra white space remove
) %>%
select(text, label)
return(new_data)
}
# data <- text_cleansing(train_upsam)
# test$label <- 0
# data_test <- text_cleansing(test)
# saveRDS(data,"data/data.rds")
# saveRDS(data_test,"data/data_test.rds")In this process we will:
change tweet format into corpus and do some more cleaning per corpus and do stemming which is replace the word with its stem word e.g. “reading” into the word “read”.
we also will remove some words that exists in both hate speech and non hate speech catgegories. the words is determine based on observations on word cloud on both train and test dataset. also from the result of prediction on validation dataset.
remove stopwords.
First we create 2 custom function. the first is corpus_cleansing to clean corpus data using package tm.
corpus_cleansing <- function(data) {
words.to.remove <- c("user","url","rt")
data_corpus <- data$text %>%
VectorSource() %>%
VCorpus(readerControl = list(language="en")) %>%
tm_map(removeWords, words.to.remove) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument) %>%
tm_map(stripWhitespace) %>%
sapply(as.character) %>%
as.data.frame(stringsAsFactors = FALSE)
data_clean <- bind_cols(data_corpus, data[,2] )%>%
`colnames<-`(c("text","label"))
return(data_clean)
}
# data_clean <- corpus_cleansing(data)
# data_test_clean <- corpus_cleansing(data_test)
# saveRDS(data_clean,"data/data_clean.rds")
# saveRDS(data_test_clean,"data/data_test_clean.rds")data_clean <- readRDS("data/data_clean.rds")
data_test_clean <- readRDS("data/data_test_clean.rds")The following word cloud is the most common words in train dataset both for hate speech and non hate speech. But first we have to change the words into matrix of corpus by creating a custom function.
# create custom function to
word_freq_table <- function(data_text){
text <- data_text %>% as.character()
docs <- Corpus(VectorSource(text))
dtm <- tm::TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
df <- data.frame(word = names(v),freq=v)
return(df)
}tweets <- data_clean %>% filter(label == 1) %>% select(text)
common_words <- word_freq_table(tweets)
wordcloud(words = common_words$word, freq = common_words$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))tweets <- data_clean %>% filter(label == 0) %>% select(text)
common_words <- word_freq_table(tweets)
wordcloud(words = common_words$word, freq = common_words$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))In this process we will separate each words into tokens. First, We have to combine data train and data test for tokenizer
data_all <- rbind(data_clean, data_test_clean)then we determine the total number of words that will be included in the training. for this case we will pick 1024 most frequent words and remove others.
num_words_train <- 1024
# prepare tokenizers
tokens <- text_tokenizer(num_words = num_words_train,
lower = TRUE) %>%
fit_text_tokenizer(data_all$text)length(tokens$word_counts)## [1] 43965
Total tokens/words that will be used for training is more than 43 thousands tokens.
Splitting the data into data_train dan data_test with proportion 80% training : 20% test data.
RNGkind(sample.kind = "Rounding")
set.seed(100)
# train-test splitting
index <- sample(1:nrow(data_clean), nrow(data_clean)*0.80)
data_train <- data_clean[index,]
data_val <- data_clean[-index,]maxlen <- max(str_count(data_clean$text, "\\w+")) + 1
# prepare data for predictors x
data_train_x <- texts_to_sequences(tokens, data_train$text) %>%
pad_sequences(maxlen = maxlen)
data_val_x <- texts_to_sequences(tokens, data_val$text) %>%
pad_sequences(maxlen = maxlen)
data_test_x <- texts_to_sequences(tokens, data_test_clean$text) %>%
pad_sequences(maxlen = maxlen)
# prepare data for target y
data_train_y <- to_categorical(data_train$label, num_classes = 2)
data_val_y <- to_categorical(data_val$label, num_classes = 2)Cek proporsi kelas target pada label_train dan label_test:
prop.table(table(data_train_y))## data_train_y
## 0 1
## 0.5 0.5
prop.table(table(data_val_y))## data_val_y
## 0 1
## 0.5 0.5
for model selection on this dataset. we will use Deep Learning using LSTM (Long Short-Term Memory) model.
for the measurement and evaluation we will use Accuracy metric. because we want to be able to differentiate both hate speech and non hate speech tweets equally.
Our goal is to make a model that has accuracy above 80%
RNGkind(sample.kind = "Rounding")
initializer <- initializer_random_normal(seed = 100)
model_DL <- keras_model_sequential()
# layer lstm 1 settings
unit_lstm1 <- 256
dropout_lstm1 <- 0.5
recurrent_dropout_lstm1 <- 0.5
# layer lstm 2 settings
unit_lstm2 <- 32
dropout_lstm2 <- 0.5
recurrent_dropout_lstm2 <- 0.5
model_DL %>%
layer_embedding(
name = "input",
input_dim = num_words_train,
input_length = maxlen,
output_dim = maxlen
) %>%
layer_dropout(
name = "embedding_dropout",
rate = 0.6
) %>%
# lstm1
layer_lstm(
name = "lstm1",
units = unit_lstm1,
dropout = dropout_lstm1,
recurrent_dropout = recurrent_dropout_lstm1,
return_sequences = TRUE
) %>%
# lstm2
layer_lstm(
name = "lstm2",
units = unit_lstm2,
dropout = dropout_lstm2,
recurrent_dropout = recurrent_dropout_lstm2,
return_sequences = FALSE
) %>%
# output layer
layer_dense(
name = "output",
units = 2,
activation = "sigmoid"
)
# Compile Model
model_DL %>%
compile(optimizer = optimizer_adam(learning_rate = 0.01),
metrics = "accuracy",
loss = "binary_crossentropy")
# model summary
summary(model_DL)## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input (Embedding) (None, 24, 24) 24576
## ________________________________________________________________________________
## embedding_dropout (Dropout) (None, 24, 24) 0
## ________________________________________________________________________________
## lstm1 (LSTM) (None, 24, 256) 287744
## ________________________________________________________________________________
## lstm2 (LSTM) (None, 32) 36992
## ________________________________________________________________________________
## output (Dense) (None, 2) 66
## ================================================================================
## Total params: 349,378
## Trainable params: 349,378
## Non-trainable params: 0
## ________________________________________________________________________________
epochs <- 10
batch_size <- 2048
# fit the model
tic()
history <- model_DL %>%
fit(
data_train_x,
data_train_y,
batch_size = batch_size,
epochs = epochs,
verbose = 1,
validation_data = list(data_val_x, data_val_y))
# history plot
plot(history)toc()## 401.79 sec elapsed
save(model_DL, file="model_DL.Rdata")data_val_pred <- model_DL %>%
predict_classes(data_val_x) %>%
as.vector()
# predict on test
data_test_pred <- model_DL %>%
predict_classes(data_test_x) %>%
as.vector()
conf.matrix <- confusionMatrix(
factor(data_val_pred, labels = c("no", "yes")),
factor(data_val$label, labels = c("no", "yes")),
positive = "yes"
)
conf.matrix <- confusionMatrix(
factor(data_val_pred, labels = c("no", "yes")),
factor(data_val$label, labels = c("no", "yes")),
positive = "yes"
)
conf.matrix$byClass %>% round(digits = 3) * 100## Sensitivity Specificity Pos Pred Value
## 96.0 87.1 88.2
## Neg Pred Value Precision Recall
## 95.6 88.2 96.0
## F1 Prevalence Detection Rate
## 92.0 50.1 48.1
## Detection Prevalence Balanced Accuracy
## 54.5 91.6
and based on the criteria for overfit which is >5% difference between accuracy of training and accuracy of validation result. the model is not overfit as can be seen on the graph above the difference is below 5%.
Based on the three model we use to classifies the data above. we can conclude that deep learning model produce the best accuracy by >80% on the validation data.
We has achieving our goal to make a model that can classifies hate speech tweets with accuracy above 80% and thus, we can conclude that the problem to identify hate speech tweets can be solved with machine learning.