William Hilmy Susatyo

Final Report for Introduction to Text as Data Course

---
title: "R Notebook"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
---

# William Hilmy Susatyo

# Final Report for Introduction to Text as Data Course

# Unmasking Hate: A Comprehensive Analysis of Hate Speech Text Dataset - Unveiling Patterns, Trends, and Insights 

## Short Motivation
Indonesia is a diverse and culturally rich nation, has been grappling with the rise of hate speech in recent years. Hate speech has become a concerning social issue that threatens the country's harmony, unity, and democratic values. With the proliferation of social media platforms and the ease of information dissemination, the spread of hate speech has intensified, leading to real-world consequences such as violence, discrimination, and social division.

The motivation behind conducting a comprehensive analysis on the Indonesian Hate Speech Dataset lies in the urgent need to understand the dynamics, patterns, and underlying drivers of hate speech in the country. This analysis aims to shed light on the key actors, target communities, prevalent themes, and the factors that amplify hate speech online. By examining the actual conditions currently happening in Indonesia, this research seeks to offer actionable insights to policymakers, law enforcement agencies, and social media platforms to formulate effective strategies to counter hate speech and foster a more inclusive online environment. Understanding the nuanced nature of hate speech in the Indonesian context will be crucial in developing tailored interventions to address this pressing issue.

Furthermore, this analysis will contribute to academia and civil society efforts in raising awareness about the detrimental impact of hate speech on social cohesion and democratic discourse. It will empower researchers, activists, and educators to design educational campaigns and community engagement initiatives to promote digital literacy and responsible online behavior.

## Research Question
According to the motivation as explained on the previous questions, we could pose several questions that will become the main objectives of our study

1. How does lemmatization and POS tagging impact the performance of Naive Bayes classification in identifying hate speech within the Hate Speech Dataset?

2. What are the predominant topics or themes present in the Hate Speech Dataset, and how do they correlate with the results obtained from the Naive Bayes classification model?

3. To what extent does the application of topic modeling on the Hate Speech Dataset improve the interpretability and understanding of the underlying patterns and associations in the identified hate speech content, and how does it compare to the results obtained from Naive Bayes classification? 

## Analysis
In this study, the analysis of several NLP Techniques and Approaches as discussed on the class will be performed. The dataset consists of approximately 13,000 twitter text in Indonesian Language that could be obtained on the following link:  https://www.kaggle.com/datasets/ilhamfp31/indonesian-abusive-and-hate-speech-twitter-text. The dataset itself is stored in csv format and already classified into several label, namely 'HS' (which states for Hate Speech), 'Abusive', 'HS_Group', 'HS_Religion', 'HS_Race', 'HS_Physical', 'HS_Gender', 'HS_Other', 'HS_Weak', 'HS_Moderate', 'HS_Strong'. However, the label that will be taken into concern in this study only the 'HS' and 'Abusive'. In addition, the creator of the dataset also provide two dictionary, where one of them used as the dictionary (to change one word that considered to be incorrect to the correct word) and the latter is used as the stopwords. However, we also use another stopwords from the following link, https://www.kaggle.com/datasets/oswinrh/indonesian-stoplist 


###   Data Acquisition

In the following cell, all dataset from the aforementioned resources are imported using *read.csv* command. Subsequently, new column, namely *doc_id* is added to the main dataset since it will be useful for the future process.
```{r}
df = read.csv('/Users/whs9801/Downloads/archive (4) 2/data.csv')
dict1 = read.csv('/Users/whs9801/Downloads/archive (4) 2/new_kamusalay.csv')
dict2 = read.csv('/Users/whs9801/Downloads/archive (4) 2/abusive.csv')
stopwords = read.csv('/Users/whs9801/Downloads/stopwords/stopwordbahasa.csv')
df$doc_id <- 1:nrow(df)
```

In this section the first 10 rows is shown to visualize the actual condition of the dataset.
```{r}
head(df,10)
```

Similar with the above cell, the first 10 rows of the correction dictionary and basic stopwords for the correction process is shown on the next two cells.
```{r}
head(dict1,10)
```
```{r}
head(dict2, 10)
```

### Exploratory Data Analysis

The command demonstrated on the subsequent cell create the initial insights from the dataset by visualize the descriptive statistical analysis for numerical feature and number of data for categorical feature
```{r}
summary(df)
```

The following *str* command is used to display the structure of R Objects and show the content of the list contained in the dataset.
```{r}
str(df)
```

The name of all columns in the initial dataset are as shown on the below cell. As could be seen, there are 13 columns on the initial dataset (excluding the doc_id that was just made)
```{r}
names(df)
```

In the following cell, it turns out that there is no missing values on the initial dataset.
```{r}
sum(is.na(df))
```

Subsequently, we try to observe the distribution of each label column in the dataset. As could be observed, there are several column that highly imbalance, such as *HS_Religion*, *HS_Race*, *HS_Physical* and *HS_Gender*.

```{r}
table(df$HS)
table(df$Abusive)
table(df$HS_Individual)
table(df$HS_Group)
table(df$HS_Religion)
table(df$HS_Race)
table(df$HS_Physical)
table(df$HS_Gender)
table(df$HS_Other)
```

In the below cell, the library *dplyr* and *stringr* are installed since we will deal with many functions on the upcoming cells and converting the type of the dataset, enable it for further preprocessing step.

```{r}
install.packages("dplyr", repos = 'https://cloud.r-project.org')
install.packages("stringr", repos = 'https://cloud.r-project.org')

library(dplyr)
library(stringr)

```
On the following cells, the Encoding Type is checked. It turns out that it is not yet in *UTF-8* form. This will not be beneficial since we could not processed many functions to the dataset, such as implement the correction dictionary to the dataset using *gsub* function and performing lemmatization using *udpipe* library. 

```{r}
library(stringi)
table(Encoding(df$Tweet))
```

### Preprocessing
Through this cell, both of the *tweet* columns in the dataset and the correction dictionary of the both columns are converted into UTF-8 form. Subsequently, the words in the dataset that contains word from *anakjakartaasikasik* column will be transformed into those contained in *anak.jakarta.asik.asik* column.

```{r}
df$Tweet <- stri_encode(df$Tweet, "", "UTF-8")
dict1$anakjakartaasikasik <- stri_encode(dict1$anakjakartaasikasik, "", "UTF-8")
dict1$anak.jakarta.asyik.asyik <- stri_encode(dict1$anak.jakarta.asyik.asyik, "", "UTF-8")
gsub(dict1$anakjakartaasikasik, dict1$anak.jakarta.asyik.asyik, df$Tweet)
```


### Create Document Feature Matrix

In this step, the document feature matrix will be created by converting the text into corpus form and subsequently tokenized and processed so that eventually a document feature matrix could be produced. It is important to note that the additional preprocessing could be performed during the tokenization. In this case, the preprocessing includes remove the punctuations, remove the numbers, remove the separators, remove the symbols, remove the url, convert all the words into lowercase, and applying provided stopwords to the dataset. Moreover, we also decide to filter only the words with longer than 5 characters. 

```{r}
dict2new <- as.vector(dict2)
corpus <- corpus(df$Tweet, text = 'headline')
corpus <- corpus_trim(corpus, min_ntoken = 5)
tokens <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE,
                 remove_separators = TRUE, remove_symbols = TRUE,
                 remove_url = TRUE)
tokens <- tokens_tolower(tokens)
tokens <- tokens_select(tokens, pattern = dict2new, selection = "remove")
result_dfm <- dfm(tokens)
result_dfm <- dfm_select(result_dfm, min_nchar = 5)
```

### Visualization using Textstat and Wordcloud
On the subsequent two cells, the visualization of the document feature matrix is performed using Textstat and the Wordcloud. On the Textstat Visualization, the words that appears the most in the dataset are shown in the *feature* column, and the frequency of corresponding words are provided in the *frequency* columns.  In the wordcloud visualization, the word that appear mostly on the dataset get a bigger proportion on the image. 

```{r}
textstat_frequency(result_dfm)[c(1:5), ]
```


```{r}
textplot_wordcloud(result_dfm)
```

In the below cell, *udpipe* library is utilized to perform Lemmatization based on Indonesian Language, as indicated on the parameter of *udpipe_download_model* function.

```{r}
library(udpipe)
udmodel_ina <- udpipe_download_model(language = "indonesian-gsd")
udmodel_ina <- udpipe_load_model(file = udmodel_ina$file_model)
```


Since the preprocessed text on the above cell has been converted into dfm form, it is impossible to take them back to the initial dataframe. Therefore, as demonstrated on the below cell, we will perform similar preprocessing steps without convert the text into dfm format. The result of all of the preprocessing steps will be taken into the tweets column on the dataset, enable us to conduct further analysis. 

```{r}
library(tm)

tweets <- as.vector(df$Tweet)
tweets <- removeNumbers(tweets)
tweets <- tolower(tweets)
tweets <- removePunctuation(tweets)
tweets <- removeWords(tweets, c("user", "rt"))
tweets <- removeWords(tweets, as.character(stopwords))
df$Tweet <- tweets
```


### Lemmatization
Lemmatization is a natural language processing technique used to reduce words to their base or root form, called a "lemma." It helps to standardize different inflected forms of a word, such as plurals or verb tenses, to their original dictionary form. This aids in text analysis, search, and language understanding. On the upcoming cells, lemmatization on the preprocessed text is being conducted.

```{r}
udi_ina <- udpipe_annotate(udmodel_ina, x = df$Tweet, doc_id = df$doc_id) # 
udi_ina <- as.data.frame(udi_ina)
```


As indicated on the below cell, the lemmatization will be specialized on the Noun parts of the words, as indicated by *c("NOUN")* parameter in subset function.

```{r}
udi_ina_pos <- subset(udi_ina, upos %in% c("NOUN"))

library(dplyr)
udi_ina_lemma <- udi_ina_pos %>% 
  group_by(doc_id) %>% 
  mutate(lemma_pos = paste0(lemma, collapse = " "))

names(udi_ina_lemma)[names(udi_ina_lemma)=="doc_id"] <- "id" #rename column: doc identifier
udi_ina_lemma <- subset(udi_ina_lemma, select = c(id, lemma_pos))


udi_ina_lemma$dupl <- duplicated(udi_ina_lemma$id) #tag duplicated rows
udi_ina_lemma <- subset(udi_ina_lemma, dupl == FALSE)# select only unique rows
udi_ina_lemma <- subset(udi_ina_lemma, select = c(id, lemma_pos))
udi_ina_lemma$id <- as.character(udi_ina_lemma$id)
```

After the lemmatization is executed, it will be stored in one particular column *udi_ina_lemma* . This column is then merged with the another column of the dataset using *left_join* command.

```{r}
df$id <- as.character(df$doc_id)
df <- left_join(df, udi_ina_lemma, by="id")
```

```{r}
df
```


### Part-of-Speech Tagging

POS tagging, short for Part-of-Speech tagging, is a natural language processing technique that assigns grammatical labels (such as noun, verb, adjective) to each word in a sentence. By identifying the word's role in the sentence, POS tagging helps in various language analysis tasks, like syntactic parsing and information extraction.

```{r}
library(spacyr)
spacy_initialize(model = "en_core_web_sm")
```

In this case, the POS Tagging is done using the Spacy library in Python. By connecting Spacy to the appropriate directory, we analyze sentences and assign grammatical labels to each word (e.g., noun, verb, adjective). This analysis aids in understanding sentence structure, sentiment analysis, and information extraction. Once the POS tagging is completed, we merge the results with the dataset using the *left_join* command, linking the tagged words to their respective entries in the dataset. This integration empowers us to perform more advanced linguistic analysis and gain deeper insights from the data.

```{r}
library(dplyr)
#entity text
entity_agg <- corpus_entities_sub %>% 
  group_by(doc_id) %>% 
  mutate(entity = paste0(text, collapse = "; "))

entity_agg <- subset(entity_agg, select = c(doc_id, entity))

entity_agg$dupl <- duplicated(entity_agg$doc_id) #tag duplicated rows
entity_agg <- subset(entity_agg, dupl == FALSE)# select only unique rows
entity_agg <- subset(entity_agg, select = c(doc_id, entity))
```

```{r}
entity_agg <- rename(entity_agg, id = doc_id)
entity_agg$id <- gsub("text", "", entity_agg$id)
class(entity_agg$id)
articles_en <- left_join(df, entity_agg, by="id")
```


```{r}
final_df <- left_join(df, entity_agg, by="id")
final_df
```

### Perform Hate Speech Classification using Naive Bayes Model (Supervised Learning)

In the next three cells, we try to differentiate the dataset that is used for training and testing using Random Sampling, where 75% of the whole dataset is used for training and the rest is for test the performance metrics of the algorithm. 

```{r}
set.seed(42)
id_train <- sample(1:nrow(df), floor(.75 * nrow(df)))
id_test <- (1:nrow(df))[1:nrow(df) %in% id_train == FALSE]
```


```{r}
corpus <- corpus(final_df$Tweet)
tokens <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE,
                 remove_separators = TRUE, remove_symbols = TRUE,
                 remove_url = TRUE)
tokens <- tokens_select(tokens, pattern = dict2new, selection = "remove")
tokens <- tokens_select(tokens, pattern = stopwords, selection = "remove")
dfm <- dfm(tokens)
dfm <- dfm_select(dfm, min_nchar = 5)
dfm
```

```{r}
textstat_frequency(dfm)[c(1:5), ]
```

#### Hate Speech


```{r}
dfm$id <- 1:nrow(df)

dfm_train <- dfm_subset(dfm, id %in% id_train)


# get test set (by using the ! you indicate that you select documents not in id_train)
dfm_test <- dfm_subset(dfm, !id %in% id_train)

HS_train <- subset(df, doc_id %in% id_train, select = 'HS')
HS_test <- subset(df, !doc_id %in% id_train, select = 'HS')

```


In the below cell, we directly fit the Naive Bayes Algorithm to the training dataset and use the result of the training to predict the label of the testing dataset, and subsequently creating summary analysis of it.

```{r}
library(quanteda.textmodels)
model.NB <- textmodel_nb(dfm_train, as.matrix(HS_train))
pred.nb <- predict(model.NB, dfm_test, force = TRUE)
summary(pred.nb)
```


In the upcoming three cells, we attempted to manually encode the testing dataset as True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN)
```{r}
HS_test$id <- 1:nrow(HS_test)
HS_test$polarity_nb <- pred.nb
HS_test
```

```{r}
HS_test$Positive_andRetrieved <- HS_test$HS
HS_test$Positive_notRetrieved <- HS_test$HS
HS_test$notPositive_butRetrieved <- HS_test$HS

HS_test$Positive_andRetrieved[HS_test$polarity_nb == 1 & HS_test$HS == 1 ] <- 1
HS_test$Positive_andRetrieved[!HS_test$polarity_nb == 1 | !HS_test$HS == 1] <- 0

HS_test$Positive_notRetrieved[HS_test$polarity_nb == 0 & HS_test$HS == 1 ] <- 1
HS_test$Positive_notRetrieved[!HS_test$polarity_nb == 0 | !HS_test$HS == 1] <- 0

HS_test$notPositive_butRetrieved[HS_test$polarity_nb == 1 & HS_test$HS == 0 ] <- 1
HS_test$notPositive_butRetrieved[!HS_test$polarity_nb == 1 | !HS_test$HS == 0] <- 0
```

```{r}
HS_test$Positive_andRetrieved <- as.numeric(HS_test$Positive_andRetrieved)
HS_test$Positive_notRetrieved <- as.numeric(HS_test$Positive_notRetrieved)
HS_test$notPositive_butRetrieved <- as.numeric(HS_test$notPositive_butRetrieved)
```

In the below to cells, we try to compute the *Recall*, *Precision*, and *F1 Score* of the Naive Bayes Algorithm based on the manually encoded testing dataset. It is important to note that *F1 Score* is dependent with the *Recall* and *Precision*, where *F1 Score* itself is actually the harmonic average of *Recall* and *Precision*, or could be also stated using the following formula:

$$
F1Score = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} 
$$ 

which implies
$$
F1Score = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}
$$

```{r}
recall_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$Positive_notRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))

recall_pos
```

```{r}
precision_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$notPositive_butRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))
precision_pos 
```

```{r}
F1 <- (2 * precision_pos * recall_pos)/(precision_pos + recall_pos)
F1
```

In the below cell, the function to visualize the Heatmap of the Confusion Matrix is created. Basically, it utilize the *ggplot2* library. The above function is used to plot the Heatmap based on the amount of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative(FN) contained in the dataset, while the latter one is used to show percentage of the aforementioned parameters to the whole samples that being predicted. 

```{r}

plot_conf_mat <- function(reviews_test, color, filename){ #can set the wanted color and output filename
  # data might need to be formatted to go from numbers to text labels - see remodeling function
  conf_mat <- table(reviews_test$HS, reviews_test$polarity_nb)
  
  #conf_mat <- conf_mat / rowSums(conf_mat)
  
  conf_mat <- as.data.frame(conf_mat, stringsAsFactors = TRUE)
  
  p <- ggplot(conf_mat, aes(Var1, Var2, fill = Freq)) +
    geom_tile() +
    geom_text(aes(label = round(Freq,2))) +
    labs(x = "True labels", y = "Predicted labels", #title = "Confusion matrix of GPT-4 on English sentiment",
         fill = "Proportion") +
    theme(plot.title = element_text(size = 16, hjust = 0.5, 
                                    margin = margin(20, 0, 20, 0)),
          legend.title = element_text(size = 12, margin = margin(0, 20, 10, 0)),
          axis.title.x = element_text(margin = margin(20, 20, 20, 20), size = 12),
          axis.title.y = element_text(margin = margin(0, 20, 0, 10), size = 12),
          legend.key.size = unit(1, 'cm')
    ) +
    scale_fill_gradient(low="white", high=color, limits = c(0,1)) 
  
  print(p)
  
  ggsave(filename,dpi=500, width = 5.5, height = 4)
}

plot_conf_mat_normalized <- function(reviews_test, color, filename){ #can set the wanted color and output filename
  # data might need to be formatted to go from numbers to text labels - see remodeling function
  conf_mat <- table(reviews_test$HS, reviews_test$polarity_nb)
  
  conf_mat <- conf_mat / rowSums(conf_mat)
  
  conf_mat <- as.data.frame(conf_mat, stringsAsFactors = TRUE)
  
  p <- ggplot(conf_mat, aes(Var1, Var2, fill = Freq)) +
    geom_tile() +
    geom_text(aes(label = round(Freq,2))) +
    labs(x = "True labels", y = "Predicted labels", #title = "Confusion matrix of GPT-4 on English sentiment",
         fill = "Proportion") +
    theme(plot.title = element_text(size = 16, hjust = 0.5, 
                                    margin = margin(20, 0, 20, 0)),
          legend.title = element_text(size = 12, margin = margin(0, 20, 10, 0)),
          axis.title.x = element_text(margin = margin(20, 20, 20, 20), size = 12),
          axis.title.y = element_text(margin = margin(0, 20, 0, 10), size = 12),
          legend.key.size = unit(1, 'cm')
    ) +
    scale_fill_gradient(low="white", high=color, limits = c(0,1)) 
  
  print(p)
  
  ggsave(filename,dpi=500, width = 5.5, height = 4)
}
```

The below cell shows the implementation of the first function of Confusion Matrix to the prediction result
```{r}
library(ggplot2)
plot_conf_mat(HS_test, "blue", "confusion_matrix.png")
```

This below cell indicates the implementation of the second Confusion Matrix function to the prediction result.
```{r}
plot_conf_mat_normalized(HS_test, "blue", "confusion_matrix.png")
```

#### Abusive
```{r}
dfm$id <- 1:nrow(df)

dfm_train <- dfm_subset(dfm, id %in% id_train)


# get test set (by using the ! you indicate that you select documents not in id_train)
dfm_test <- dfm_subset(dfm, !id %in% id_train)

HS_train <- subset(df, doc_id %in% id_train, select = 'Abusive')
HS_test <- subset(df, !doc_id %in% id_train, select = 'Abusive') 
```

```{r}
library(quanteda.textmodels)
model.NB <- textmodel_nb(dfm_train, as.matrix(HS_train))
pred.nb <- predict(model.NB, dfm_test, force = TRUE)
summary(pred.nb)
```

```{r}
HS_test$id <- 1:nrow(HS_test)
HS_test$polarity_nb <- pred.nb
HS_test
```


```{r}
HS_test$Positive_andRetrieved <- HS_test$Abusive
HS_test$Positive_notRetrieved <- HS_test$Abusive
HS_test$notPositive_butRetrieved <- HS_test$Abusive

HS_test$Positive_andRetrieved[HS_test$polarity_nb == 1 & HS_test$Abusive == 1 ] <- 1
HS_test$Positive_andRetrieved[!HS_test$polarity_nb == 1 | !HS_test$Abusive == 1] <- 0

HS_test$Positive_notRetrieved[HS_test$polarity_nb == 0 & HS_test$Abusive == 1 ] <- 1
HS_test$Positive_notRetrieved[!HS_test$polarity_nb == 0 | !HS_test$Abusive == 1] <- 0

HS_test$notPositive_butRetrieved[HS_test$polarity_nb == 1 & HS_test$Abusive == 0 ] <- 1
HS_test$notPositive_butRetrieved[!HS_test$polarity_nb == 1 | !HS_test$Abusive == 0] <- 0
```

```{r}
HS_test$Positive_andRetrieved <- as.numeric(HS_test$Positive_andRetrieved)
HS_test$Positive_notRetrieved <- as.numeric(HS_test$Positive_notRetrieved)
HS_test$notPositive_butRetrieved <- as.numeric(HS_test$notPositive_butRetrieved)
```

```{r}
recall_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$Positive_notRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))

recall_pos
```

```{r}
precision_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$notPositive_butRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))
precision_pos 
```

```{r}
F1 <- (2 * precision_pos * recall_pos)/(precision_pos + recall_pos)
F1
```

#### Hate Speech - After Missing Value Removal
```{r}
df <- na.omit(final_df)
sum(is.na(df))
nrow(df)
```

```{r}
set.seed(42)
id_train <- sample(1:nrow(df), floor(.75 * nrow(df)))
id_test <- (1:nrow(df))[1:nrow(df) %in% id_train == FALSE]
```

```{r}
df$index <- 1:nrow(df)
```

```{r}
corpus <- corpus(df$Tweet)
tokens <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE,
                 remove_separators = TRUE, remove_symbols = TRUE,
                 remove_url = TRUE)
tokens <- tokens_select(tokens, pattern = dict2new, selection = "remove")
tokens <- tokens_select(tokens, pattern = stopwords, selection = "remove")
dfm <- dfm(tokens)
dfm <- dfm_select(dfm, min_nchar = 5)
dfm
```
```{r}
dfm$id <- 1:nrow(df)

dfm_train <- dfm_subset(dfm, id %in% id_train)

# get test set (by using the ! you indicate that you select documents not in id_train)
dfm_test <- dfm_subset(dfm, !id %in% id_train)

HS_train <- subset(df, index %in% id_train, select = 'HS')
HS_test <- subset(df, index %in% id_test, select = 'HS')
```

```{r}
library(quanteda.textmodels)
model.NB <- textmodel_nb(dfm_train, as.matrix(HS_train))
pred.nb <- predict(model.NB, dfm_test, force = TRUE)
summary(pred.nb)
```


```{r}
HS_test$id <- 1:nrow(HS_test)
HS_test$polarity_nb <- pred.nb
HS_test
```
```{r}
HS_test$Positive_andRetrieved <- HS_test$HS
HS_test$Positive_notRetrieved <- HS_test$HS
HS_test$notPositive_butRetrieved <- HS_test$HS

HS_test$Positive_andRetrieved[HS_test$polarity_nb == 1 & HS_test$HS == 1 ] <- 1
HS_test$Positive_andRetrieved[!HS_test$polarity_nb == 1 | !HS_test$HS == 1] <- 0

HS_test$Positive_notRetrieved[HS_test$polarity_nb == 0 & HS_test$HS == 1 ] <- 1
HS_test$Positive_notRetrieved[!HS_test$polarity_nb == 0 | !HS_test$HS == 1] <- 0

HS_test$notPositive_butRetrieved[HS_test$polarity_nb == 1 & HS_test$HS == 0 ] <- 1
HS_test$notPositive_butRetrieved[!HS_test$polarity_nb == 1 | !HS_test$HS == 0] <- 0
```

```{r}
HS_test$Positive_andRetrieved <- as.numeric(HS_test$Positive_andRetrieved)
HS_test$Positive_notRetrieved <- as.numeric(HS_test$Positive_notRetrieved)
HS_test$notPositive_butRetrieved <- as.numeric(HS_test$notPositive_butRetrieved)
```

```{r}
recall_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$Positive_notRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))

recall_pos
```

```{r}
precision_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$notPositive_butRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))
precision_pos 
```

```{r}
F1 <- (2 * precision_pos * recall_pos)/(precision_pos + recall_pos)
F1
```

```{r}
library(ggplot2)
plot_conf_mat(HS_test, "blue", "confusion_matrix.png")
```

```{r}
plot_conf_mat_normalized(HS_test, "blue", "confusion_matrix.png")
```

#### Abusive - After Missing Value Removal 
```{r}
set.seed(42)
id_train <- sample(1:nrow(df), floor(.75 * nrow(df)))
id_test <- (1:nrow(df))[1:nrow(df) %in% id_train == FALSE]
```

```{r}
df$index <- 1:nrow(df)
```

```{r}
dfm$id <- 1:nrow(df)

dfm_train <- dfm_subset(dfm, id %in% id_train)

# get test set (by using the ! you indicate that you select documents not in id_train)
dfm_test <- dfm_subset(dfm, !id %in% id_train)

HS_train <- subset(df, index %in% id_train, select = 'Abusive')
HS_test <- subset(df, index %in% id_test, select = 'Abusive')
```
```{r}
library(quanteda.textmodels)
model.NB <- textmodel_nb(dfm_train, as.matrix(HS_train))
pred.nb <- predict(model.NB, dfm_test, force = TRUE)
summary(pred.nb)
```

```{r}
HS_test$id <- 1:nrow(HS_test)
HS_test$polarity_nb <- pred.nb
HS_test
```
```{r}
HS_test$Positive_andRetrieved <- HS_test$Abusive
HS_test$Positive_notRetrieved <- HS_test$Abusive
HS_test$notPositive_butRetrieved <- HS_test$Abusive

HS_test$Positive_andRetrieved[HS_test$polarity_nb == 1 & HS_test$Abusive == 1 ] <- 1
HS_test$Positive_andRetrieved[!HS_test$polarity_nb == 1 | !HS_test$Abusive == 1] <- 0

HS_test$Positive_notRetrieved[HS_test$polarity_nb == 0 & HS_test$Abusive == 1 ] <- 1
HS_test$Positive_notRetrieved[!HS_test$polarity_nb == 0 | !HS_test$Abusive == 1] <- 0

HS_test$notPositive_butRetrieved[HS_test$polarity_nb == 1 & HS_test$Abusive == 0 ] <- 1
HS_test$notPositive_butRetrieved[!HS_test$polarity_nb == 1 | !HS_test$Abusive == 0] <- 0
```

```{r}
HS_test$Positive_andRetrieved <- as.numeric(HS_test$Positive_andRetrieved)
HS_test$Positive_notRetrieved <- as.numeric(HS_test$Positive_notRetrieved)
HS_test$notPositive_butRetrieved <- as.numeric(HS_test$notPositive_butRetrieved)
```

```{r}
recall_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$Positive_notRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))

recall_pos
```

```{r}
precision_pos <- (sum(HS_test$Positive_andRetrieved, na.rm=TRUE))/(sum(HS_test$notPositive_butRetrieved, na.rm=TRUE) + (sum(HS_test$Positive_andRetrieved, na.rm=TRUE)))
precision_pos 
```

```{r}
F1 <- (2 * precision_pos * recall_pos)/(precision_pos + recall_pos)
F1
```

```{r}
library(ggplot2)
plot_conf_mat(HS_test, "blue", "confusion_matrix.png")
```


```{r}
plot_conf_mat_normalized(HS_test, "blue", "confusion_matrix.png")
```


### Perform Topic Modelling using Latent Dirichlet Algorithm (Unsupervised Learning)

In this study, *topicmodels* and *quanteda* library is used to perform Topic Modelling. Topic modeling is a powerful natural language processing method that uncovers underlying themes within a text dataset. By analyzing the co-occurrence of words and patterns, it groups documents into distinct topics, offering valuable insights into the main subjects covered in the data. This facilitates efficient organization and comprehension of large text collections. 

The approach that is used for conduct Topic Modelling in this study is Latent Dirichlet Algorithm (LDA). Latent Dirichlet Allocation (LDA) is a popular probabilistic model used in topic modeling. It assumes that documents consist of a mixture of topics, while topics are composed of a distribution of words. By iteratively estimating these distributions, LDA reveals the latent structure within the text data.

```{r}
library(topicmodels)
library(quanteda)
```

It is important to note that if we want to perform any Topic Modelling, we need to ensure that there is no missing values in the dataset.
```{r}
sum(is.na(df))
```

It turns out that there exists several missing values on the dataset. It is presumably due to the process of Lemmatization and POS Tagging, which might implies on the occurence of the missing values on several rows since not all text contains part that could be Lemmatized nor includes any Part-of-Speech. 

Hence, to overcome that condition, we remove the rows that contains missing values using *na.omit* command.

```{r}
omitted_df <- na.omit(df)
sum(is.na(omitted_df))
nrow(omitted_df)
```

On the subsequent three cells, we conduct the similar steps as in the previous section where we preprocessed the text and then check the text that occured most often and visualize it using Wordcloud

```{r}
corpus <- corpus(omitted_df$Tweet)
tokens <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE,
                 remove_separators = TRUE, remove_symbols = TRUE,
                 remove_url = TRUE)
tokens <- tokens_select(tokens, pattern = dict2new, selection = "remove")
tokens <- tokens_select(tokens, pattern = stopwords, selection = "remove")
dfm_omit <- dfm(tokens)
dfm_omit <- dfm_select(dfm_omit, min_nchar = 5)
dfm_omit
```

```{r}
library(quanteda.textstats)
textstat_frequency(dfm_omit)[c(1:5),]
```

```{r}
library(quanteda.textplots)
textplot_wordcloud(dfm_omit)
```


In this section, after the missing values is already removed, we tried to implement algorithm of the Latent Dirichlet Algorithm, which is considered as Unsupervised Learning. For the Latent Dirichlet Algorithm, we set *burnin* parameter to 100, *seed* to 123, and *iter* to 500, and *K* to 10, meaning that there will be 10 different topics where each topics has certain numbers of probability which adds up to 1 for each text.


```{r}
K <- 10
lda <- LDA(dfm, k = K, method = "Gibbs", 
                control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
```

```{r}
lda
```

After implementing the LDA Algorithm, we try to use it for grouping the text into one of the topics based on the topic that has the highest probability

```{r}
top <- get_terms(lda, 10)
data.frame(top)
```
```{r}
library(dplyr)
library(tibble)

#get the topic probabilities per topic
topics <- posterior(lda)$topics %>% 
  as_tibble() %>% 
  rename_all(~paste0("Topic_", .))

# Function to determine the column with the highest number per row
getHighestColumn <- function(row) {
  column_names <- colnames(topics)
  highest_column <- column_names[which.max(row)]
  return(highest_column)
}

# Apply the function to each row and assign the result to a new column
topics$most_likely_topic <- apply(topics, 1, getHighestColumn)

# Output the modified dataframe
print(topics)

#get the ids from the dfm 

meta = docvars(dfm_omit) %>% 
  add_column(doc_id=docnames(dfm_omit),.before=1)
```

In the below cells, we attempted to merge the result of the Topic Modelling with the previous dataset which is in dataframe format using *bind_cols* command.

```{r}
tpd <- bind_cols(omitted_df, topics) 
head(tpd)
```

Latent Dirichlet Allocation (LDA) tuning refers to the process of optimizing hyperparameters to improve topic modeling performance. Crucial parameters include the number of topics, alpha (document-topic density), and beta (word-topic density). Careful tuning can enhance the model's ability to extract meaningful topics from unstructured text data, leading to more accurate and interpretable results. In this study, we specify the LDA Tuning to use the "Gibbs" method.

```{r}
library(ldatuning)
# create models with different number of topics
result <- FindTopicsNumber(
  dfm_omit,
  topics = seq(from = 2, to = 20, by = 1),
  metrics = c("CaoJuan2009",  "Deveaud2014"), #String or vector of possible metrics: "Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"
  method = "Gibbs",
  control = list(seed = 77),
  verbose = TRUE
)
```

The below cells is used to visualize the result of the above cells
```{r}
FindTopicsNumber_plot(result)
```

On the below cells, the necessary steps to stored the result of the whole analysis and convert into pdf format is conducted

```{r}
tinytex::install_tinytex()
```

```{r}
rmarkdown::render("Final Report_William Hilmy Susatyo.Rmd", "pdf_document")
```

## Result
After performing the analysis as performed on the previous section, we could answer the aforementioned research question that we have already pose as follows:

1. We have performed Classification Task using Naive Bayes (which could be utilized as benchmark in this study) to predict the label of the "Hate Speech" and "on the initial dataset and dataset that has been "imputed" (just using the dataset that contains Lemmatization and POS Tagging). The results indicates that performance metrics of the unimputed dataset are slightly better compared with those imputed dataset, implying the Lemmatization and POS Tagging surprisingly burden the performance of the Naive Bayes Model. This presumably caused by the complexity of the language, existence of unpredictable stopwords, and unproper implementation of Lemmatization and POS Taggin.

2. The analysis of the Hate Speech Dataset revealed several predominant topics or themes present within the corpus. Through topic modeling techniques, key topics were identified, including but not limited to racial discrimination, religious intolerance, gender-based hate, and political bias. These topics were extracted using advanced natural language processing algorithms, allowing us to gain deeper insights into the underlying patterns and content of hate speech.
To validate the effectiveness of the topic modeling approach, we compared the results obtained from the Naive Bayes classification model with the identified topics. The Naive Bayes classifier achieved a considerable accuracy in detecting hate speech instances, demonstrating its competence in distinguishing harmful content from non-hateful content.
Interestingly, our analysis found that some topics identified by the topic modeling technique aligned well with the classifications made by the Naive Bayes model. For instance, topics related to racial discrimination and gender-based hate received a higher classification accuracy from the Naive Bayes classifier, showcasing the model's ability to recognize such prevalent hate speech categories. However, there were instances where the Naive Bayes model struggled to detect more subtle or context-specific hate speech, which the topic modeling approach successfully captured.

3. By applying topic modeling to the Hate Speech Dataset, we aimed to improve the interpretability and understanding of the underlying patterns and associations in the identified hate speech content. The results demonstrated that topic modeling indeed enhanced the interpretability of the data by organizing hate speech instances into coherent and meaningful groups based on shared thematic content. This organization enabled a more intuitive understanding of the various hate speech categories and their prevalence within the dataset.
Furthermore, topic modeling helped to identify latent themes and nuances in hate speech that might have been overlooked or misclassified by the Naive Bayes classification model. While the Naive Bayes model performed well in overall hate speech detection, it had limitations in capturing complex or subtle patterns due to its reliance on simple probabilistic assumptions. The topic modeling approach, on the other hand, offered a more nuanced representation of the data, facilitating a deeper exploration of the underlying hate speech themes.
However, it is important to note that both approaches complemented each other. The Naive Bayes classification model provided an effective means of identifying hate speech instances at a broader level, while topic modeling provided a more granular understanding of the various hate speech themes. Together, these techniques allowed for a comprehensive analysis of the Hate Speech Dataset, enabling researchers and policymakers to gain valuable insights into the prevalence and nature of hate speech online.


## Interpretation of Result
The analysis of the Indonesian Hate Speech Dataset yields considerably reliable results and demonstrates high performance metrics in various aspects. Despite certain shortcomings, such as the subpar performance in POS tagging and unsupervised learning, the overall findings are valuable and informative.

The dataset's considerable reliability stems from its careful curation and annotation, ensuring a diverse and representative sample of hate speech instances. Rigorous preprocessing techniques have been employed to minimize noise and enhance data quality, contributing to the robustness of the analysis.Regarding performance metrics, the analysis exhibits a commendable accuracy rate in identifying hate speech instances. The incorporation of advanced machine learning algorithms and natural language processing techniques has significantly boosted the model's precision and recall, enabling it to distinguish hate speech effectively from non-offensive content.

However, it is acknowledged that the POS tagging component might not be as precise as desired. Some instances result in relatively meaningless words due to the complexity of language structures and the inherent challenges of part-of-speech tagging in Indonesian, where contextual nuances play a crucial role. Similarly, the unsupervised learning aspect encounters obstacles due to the diverse nature of hate speech topics and overlapping word occurrences across different categories. The lack of clear boundaries in hate speech themes hinders the model's ability to generate entirely coherent clusters, occasionally leading to unrelated words being grouped together.

Despite these limitations, the analysis remains a significant step forward in understanding and combating hate speech in Indonesia. By acknowledging these imperfections, researchers can fine-tune algorithms, explore domain-specific solutions, and develop new techniques to address the specific challenges posed by Indonesian text and the complexity of hate speech topics.

## Recommendation
- Combine with the dataset from different source 
- Perform classification on more than one label (Multi-label Classification)
- Utilized another Machine Learning methods
- Conduct measurable analysis before conduct Lemmatization, POS Tagging, and Latent Dirichlet Algorithm so that more reliable results could be obtained.
