Introduction

Background

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

Dataset

The dataset is consist of Twitter tweets from Kaggle

Library and Setup

library(inspectdf)
library(caret)
library(stringr)
library(magrittr)
library(tidyverse)
library(tidytext)
library(tm)
library(SnowballC)
library(textclean)
library(tokenizers)
library(wordcloud)
library(keras)
library(RColorBrewer)
library(tictoc)

Import Data

train <- read.csv(file = "data/train.csv")
test <- read.csv(file = "data/test.csv")

glimpse(train)

## Rows: 31,962
## Columns: 3
## $ id    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1~
## $ label <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0~
## $ tweet <chr> " @user when a father is dysfunctional and is so selfish he drag~

glimpse(test)

## Rows: 17,197
## Columns: 2
## $ id    <int> 31963, 31964, 31965, 31966, 31967, 31968, 31969, 31970, 31971, 3~
## $ tweet <chr> "#studiolife #aislife #requires #passion #dedication #willpower ~

Data Wrangling

The data consist of more than 31 thousands tweets and has three column which is the id, tweets, and label. we will remove id column, mutate type of label column, and change the name of column tweet for more readability.

train <- train %>% 
  rename(text = tweet) %>% 
  select(-id) %>% 
  mutate(label = as.factor(label))

test <- test %>% 
  rename(text = tweet) %>% 
  select(-id)

Missing and Duplicates value

colSums(is.na(train))

## label  text 
##     0     0

# remove duplicated rows
train <- train[!duplicated(train), ]
train %>% duplicated() %>% sum() # total duplicated rows

## [1] 0

There is no missing values from the data
We have removed some duplicated rows

Exploratory Data Analysis (EDA)

Inspect Category

train %>% 
   inspect_cat() %>% 
   show_plot()

Due to the tendency of our model to predict non hate speech tweet. we will downsample the non hate speech tweet data.

train_upsam <- upSample(x = train %>% select(-label),
                        y = train$label,
                        yname = "label")
train_upsam %>% 
   inspect_cat() %>% 
   show_plot()

Data Preprocessing

Text Cleansing

First we perform text cleansing to standardize the data using textclean package. in this process we wil: * remove any url, emoji, or html tag * replace all digit numbers, tags, username or USER tag,and punctuation. * replace all slang word.

text_cleansing <- function(data) {
  new_data <- data %>% 
      mutate(
          text = text %>%
            replace_url()  %>% 
            replace_emoji() %>%
            replace_emoticon() %>%
            replace_html() %>% 
            str_to_lower() %>% # transform to lowercase
            str_remove_all("@([0-9a-zA-Z_]+)") %>% # remove username
            str_remove_all('[\\#]+') %>% # remove hashtag
            str_remove_all('[\\!]+') %>% 
            str_remove_all('[\\&]+') %>% 
            str_remove_all('[\\"]+') %>% 
            replace_internet_slang() %>% 
            str_remove_all(pattern = "[[:digit:]]") %>% # remove number
            str_remove_all(pattern = "[[:punct:]]") %>% # remove all punctuation except !,&,""
            str_squish() # extra white space remove
        ) %>% 
    select(text, label)
  return(new_data)
}
# data <- text_cleansing(train_upsam)
# test$label <- 0
# data_test <- text_cleansing(test)
# saveRDS(data,"data/data.rds")
# saveRDS(data_test,"data/data_test.rds")

Stemming and Cleaning Data Corpus

In this process we will:

change tweet format into corpus and do some more cleaning per corpus and do stemming which is replace the word with its stem word e.g. “reading” into the word “read”.
we also will remove some words that exists in both hate speech and non hate speech catgegories. the words is determine based on observations on word cloud on both train and test dataset. also from the result of prediction on validation dataset.
remove stopwords.

First we create 2 custom function. the first is corpus_cleansing to clean corpus data using package tm.

corpus_cleansing <- function(data) {
  words.to.remove <- c("user","url","rt")
  data_corpus <- data$text %>% 
      VectorSource() %>% 
      VCorpus(readerControl = list(language="en")) %>%
      tm_map(removeWords, words.to.remove) %>%
      tm_map(removeWords, stopwords("en")) %>%
      tm_map(stemDocument) %>% 
      tm_map(stripWhitespace) %>% 
      sapply(as.character) %>%
      as.data.frame(stringsAsFactors = FALSE)
  data_clean <- bind_cols(data_corpus, data[,2] )%>%
    `colnames<-`(c("text","label"))
  return(data_clean)
}
# data_clean <- corpus_cleansing(data)
# data_test_clean <- corpus_cleansing(data_test)
# saveRDS(data_clean,"data/data_clean.rds")
# saveRDS(data_test_clean,"data/data_test_clean.rds")

data_clean <- readRDS("data/data_clean.rds")
data_test_clean <- readRDS("data/data_test_clean.rds")

Most common words by frequency

The following word cloud is the most common words in train dataset both for hate speech and non hate speech. But first we have to change the words into matrix of corpus by creating a custom function.

# create custom function to 
word_freq_table <- function(data_text){
  text <- data_text %>% as.character()
  docs <- Corpus(VectorSource(text))
  dtm <- tm::TermDocumentMatrix(docs)
  m <- as.matrix(dtm)
  v <- sort(rowSums(m),decreasing=TRUE)
  df <- data.frame(word = names(v),freq=v)
  return(df)
}

Hate speech

tweets <-  data_clean %>% filter(label == 1) %>% select(text)
common_words <- word_freq_table(tweets) 
wordcloud(words = common_words$word, freq = common_words$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Non Hate speech

tweets <-  data_clean %>% filter(label == 0) %>% select(text)
common_words <- word_freq_table(tweets) 
wordcloud(words = common_words$word, freq = common_words$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Tokenizer

In this process we will separate each words into tokens. First, We have to combine data train and data test for tokenizer

data_all <- rbind(data_clean, data_test_clean)

then we determine the total number of words that will be included in the training. for this case we will pick 1024 most frequent words and remove others.

num_words_train <- 1024

# prepare tokenizers
tokens <- text_tokenizer(num_words = num_words_train,
                            lower = TRUE) %>% 
  fit_text_tokenizer(data_all$text)

length(tokens$word_counts)

## [1] 43965

Total tokens/words that will be used for training is more than 43 thousands tokens.

Cross-Validation

Splitting the data into data_train dan data_test with proportion 80% training : 20% test data.

RNGkind(sample.kind = "Rounding")
set.seed(100)

# train-test splitting
index <- sample(1:nrow(data_clean), nrow(data_clean)*0.80)

data_train <- data_clean[index,] 
data_val <- data_clean[-index,]

maxlen <- max(str_count(data_clean$text, "\\w+")) + 1 

# prepare data for predictors x
data_train_x <- texts_to_sequences(tokens, data_train$text) %>%
  pad_sequences(maxlen = maxlen) 
data_val_x <- texts_to_sequences(tokens, data_val$text) %>%
  pad_sequences(maxlen = maxlen) 
data_test_x <- texts_to_sequences(tokens, data_test_clean$text) %>%
  pad_sequences(maxlen = maxlen)

# prepare data for target y
data_train_y <- to_categorical(data_train$label, num_classes = 2)
data_val_y <- to_categorical(data_val$label, num_classes = 2)

Cek proporsi kelas target pada label_train dan label_test:

prop.table(table(data_train_y))

## data_train_y
##   0   1 
## 0.5 0.5

prop.table(table(data_val_y))

## data_val_y
##   0   1 
## 0.5 0.5

Model Building

for model selection on this dataset. we will use Deep Learning using LSTM (Long Short-Term Memory) model.

for the measurement and evaluation we will use Accuracy metric. because we want to be able to differentiate both hate speech and non hate speech tweets equally.

Our goal is to make a model that has accuracy above 80%

Deep Learning (LSTM)

Building Architecture

RNGkind(sample.kind = "Rounding")

initializer <- initializer_random_normal(seed = 100)

model_DL <- keras_model_sequential()

# layer lstm 1 settings
unit_lstm1 <- 256
dropout_lstm1 <- 0.5
recurrent_dropout_lstm1 <- 0.5

# layer lstm 2 settings
unit_lstm2 <- 32
dropout_lstm2 <- 0.5
recurrent_dropout_lstm2 <- 0.5

model_DL %>%
  layer_embedding(
    name = "input",
    input_dim = num_words_train,
    input_length = maxlen,
    output_dim = maxlen
  ) %>%
  layer_dropout(
    name = "embedding_dropout",
    rate = 0.6
  ) %>%
   # lstm1
  layer_lstm(
    name = "lstm1",
    units = unit_lstm1,
    dropout = dropout_lstm1,
    recurrent_dropout = recurrent_dropout_lstm1,
    return_sequences = TRUE
  ) %>%
  # lstm2
  layer_lstm(
    name = "lstm2",
    units = unit_lstm2,
    dropout = dropout_lstm2,
    recurrent_dropout = recurrent_dropout_lstm2,
    return_sequences = FALSE
  ) %>%
  # output layer
  layer_dense(
    name = "output",
    units = 2,
    activation = "sigmoid"
  )

# Compile Model
model_DL %>% 
  compile(optimizer = optimizer_adam(learning_rate = 0.01),
          metrics = "accuracy",
          loss = "binary_crossentropy")

# model summary
summary(model_DL)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (Embedding)                   (None, 24, 24)                  24576       
## ________________________________________________________________________________
## embedding_dropout (Dropout)         (None, 24, 24)                  0           
## ________________________________________________________________________________
## lstm1 (LSTM)                        (None, 24, 256)                 287744      
## ________________________________________________________________________________
## lstm2 (LSTM)                        (None, 32)                      36992       
## ________________________________________________________________________________
## output (Dense)                      (None, 2)                       66          
## ================================================================================
## Total params: 349,378
## Trainable params: 349,378
## Non-trainable params: 0
## ________________________________________________________________________________

Model Training

epochs <- 10
batch_size <- 2048

# fit the model
tic()
history <- model_DL %>%
  fit(
    data_train_x,
    data_train_y,
    batch_size = batch_size,
    epochs = epochs,
    verbose = 1,
    validation_data = list(data_val_x, data_val_y))

# history plot
plot(history)

toc()

## 401.79 sec elapsed

save(model_DL, file="model_DL.Rdata")

Model Evaluation - Deep Learning

data_val_pred <- model_DL %>%
  predict_classes(data_val_x) %>%
  as.vector()

# predict on test
data_test_pred <- model_DL %>%
  predict_classes(data_test_x) %>%
  as.vector()

conf.matrix <-  confusionMatrix(
  factor(data_val_pred, labels = c("no", "yes")),
  factor(data_val$label, labels = c("no", "yes")),
  positive = "yes"
)


conf.matrix <-  confusionMatrix(
  factor(data_val_pred, labels = c("no", "yes")),
  factor(data_val$label, labels = c("no", "yes")),
  positive = "yes"
)
conf.matrix$byClass %>% round(digits = 3) * 100

##          Sensitivity          Specificity       Pos Pred Value 
##                 96.0                 87.1                 88.2 
##       Neg Pred Value            Precision               Recall 
##                 95.6                 88.2                 96.0 
##                   F1           Prevalence       Detection Rate 
##                 92.0                 50.1                 48.1 
## Detection Prevalence    Balanced Accuracy 
##                 54.5                 91.6

and based on the criteria for overfit which is >5% difference between accuracy of training and accuracy of validation result. the model is not overfit as can be seen on the graph above the difference is below 5%.

Conclusion

Based on the three model we use to classifies the data above. we can conclude that deep learning model produce the best accuracy by >80% on the validation data.

We has achieving our goal to make a model that can classifies hate speech tweets with accuracy above 80% and thus, we can conclude that the problem to identify hate speech tweets can be solved with machine learning.

Sentiment Analysis

Agung Atidhira

October 1, 2021