This project involves the creation of a classification system such that text messages or e-mails can be identified as either spam or ham. In other words, the intent is to create a functioning spam filter. There are multiple ways to achieve this. In this project, a random forest model will be used in accordance with a guide found here. Additionally, the training and testing data is taken from kaggle, and is imported into R from my GitHub for ease of access. The data contains text messages labeled as either spam or ham.
So, let’s just import all the required data and libraries:
# Importing libraries
library(tidyverse)
library(tm)
library(caTools)
library(randomForest)
# Importing training/testing data that is labeled spam/ham
df <- read_csv('https://raw.githubusercontent.com/pmahdi/cuny-data-607/main/spam-project-4.csv')
glimpse(df)
## Rows: 5,572
## Columns: 5
## $ v1 <chr> "ham", "ham", "spam", "ham", "ham", "spam", "ham", "ham", "spam",…
## $ v2 <chr> "Go until jurong point, crazy.. Available only in bugis n great w…
## $ ...3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ...4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ...5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Looks like the last 3 columns only have missing values. Let’s make sure that’s the case and then drop them.
df %>%
apply(MARGIN = 2, FUN = is.na) %>%
apply(MARGIN = 2, FUN = sum) # Looks like last 3 columns are mostly empty
## v1 v2 ...3 ...4 ...5
## 0 0 5522 5560 5566
# Dropping those columns
df <- df %>%
select(1:2)
glimpse(df)
## Rows: 5,572
## Columns: 2
## $ v1 <chr> "ham", "ham", "spam", "ham", "ham", "spam", "ham", "ham", "spam", "…
## $ v2 <chr> "Go until jurong point, crazy.. Available only in bugis n great wor…
Now, let’s change the columns names to better reflect the variables they represent.
names(df) <- c('label', 'text')
names(df)
## [1] "label" "text"
df$text <- sapply(df$text, function(x) iconv(x, "ASCII", "UTF-8", sub = "byte"))
Creating the corpus means converting df to a class that
is compatible with all the processing functions available to an NLP
package. In this project, the package tm will be used. So,
the corpus that is created will be compatible with the functions found
in the tm package.
df_corpus <- VCorpus(VectorSource(df$text))
Stop words are words that are more important for syntax than semantics. As in, they are function words that express the relationship between the content words, which carry the semantic meaning of the text. AS such, they can be treated as unimportant noise when trying to classify text.
df_corpus <- tm_map(df_corpus, removeWords, stopwords("en"))
Punctuation should be removed because they are also syntactical in function, and we are concerned with semantics.
df_corpus <- tm_map(df_corpus, removePunctuation)
Converting all the words to lowercase is a normalizing step. Case does not distinguish words: a word is semantically the same regardless of case.
df_corpus <- tm_map(df_corpus, content_transformer(tolower))
Stemming is a natural language processing technique that converts inflected words to their root forms. It is an important step for normalizing the text.
df_corpus <- tm_map(df_corpus, stemDocument)
Although the corpus has been processed, now some other measures must be taken before random forest modeling can be done.
Now that the pre-processing is done, the corpus needs to be converted to a data frame because it is not a suitable structure for the following requirements. Firstly, the word frequencies for each message are needed as well as its ham/spam label. Additionally, very infrequent words should be removed from the messages because they are not useful in the classification process. To that end, words that appear in at least 1% of the messages are to be kept. Overall, these requirements are best represented through a rectangular data structure.
# DocumentTermMatrix is an object whereby the rows correspond to messages, and the columns correspond to word frequencies
df_dtm <- DocumentTermMatrix(df_corpus)
# Removing very infrequent words, keeping only those that appear in at least 1% of messages
df_dtm <- removeSparseTerms(df_dtm, 0.99)
# Converting to data frame
df_freq <- as.data.frame(as.matrix(df_dtm))
colnames(df_freq) <- make.names(colnames(df_dtm)) # This modifies column names starting with a digit
# Adding back the spam/ham label of each message
df_freq <- df_freq %>%
mutate(label = df$label, .before = alreadi)
df_freq has to be split into a training and a testing
group.
df_freq$label <- as.factor(df_freq$label) # Converting the label variable to a factor
set.seed(1992)
split <- sample.split(df_freq$label, 0.7)
train <- subset(df_freq, split == TRUE)
test <- subset(df_freq, split == FALSE)
With the training and testing datasets constructed, it is time to model the data.
model <- randomForest(label ~ ., data = train)
pred_train <- predict(model, type = 'prob')[, 2]
table(train$label, pred_train > 0.5)
##
## FALSE TRUE
## ham 3346 32
## spam 96 427
training_acc <- (3346 + 427) / nrow(train) # Training set accuracy
pred_test <- predict(model, newdata = test, type = 'prob')[, 2]
table(test$label, pred_test > 0.5)
##
## FALSE TRUE
## ham 1433 14
## spam 50 174
testing_acc <- (1433 + 174) / nrow(test) # Testing set accuracy
The training set’s accuracy is 96.7187901%, while the testing set’s accuracy is 96.1699581%. So, the model performs relatively well in filtering out spam text messages.