library(tm)
library(caret)
library(tidyverse)
library(e1071)
The goal of the project is to classify new documents using already classified “training” documents. First, the full ham/span dataset was downloaded from this website https://spamassassin.apache.org/old/publiccorpus/
From the many datasets on the website, I’ve chosen the “20030228_easy_ham.tar.bz2” for the ham data and “20021010_spam.tar.bz2” for the spam dataset.
Now, I’m going to combine both data into a document term matrix and labeled the rows: ham/spam.
Create a function to read files from the folder on my local machine
read_files <- function(folder_path) {
file_names <- list.files(folder_path, full.names = TRUE)
file_contents <- lapply(file_names, readLines)
return(file_contents)
}
Read the data
ham_files <- read_files("C:/Users/akpla/Documents/Project4/easy_ham")
spam_files <- read_files("C:/Users/akpla/Documents/Project4/spam")
We got a list of 2501 elements in the ‘ham_files’ and ‘501’ elements in the ‘spam_files’.
Assign a label of 1 for ham and 0 for spam
ham_labels <- rep(1, length(ham_files))
spam_labels <- rep(0, length(spam_files))
Combine the files and labels into a data frame
files <- c(ham_files, spam_files)
labels <- c(ham_labels, spam_labels)
# Concatenate the lines within each element
data_df <- data.frame(
text = sapply(files,
function(doc_lines) paste(doc_lines,
collapse = "\n")),
label = as.factor(labels))
Save the data into a csv and upload that into my GitHub to make this assignment reproducible.
write.csv(data_df, "data.csv", row.names = FALSE)
I loaded the data on gitHub “https://raw.githubusercontent.com/Kossi-Akplaka/Data607-data_acquisition_and_management/main/Project4/data.csv”
Prepocess the data: for this part, I just copy and paste the same code I used in my assigment 10 on sentiment analysis to tidy the data. (“https://rpubs.com/NeverBored746/1112638”)
corpus <- Corpus(VectorSource(files))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stripWhitespace)
Create the Document Term matrix DTM and save that in a dataframe.
dtm <- DocumentTermMatrix(corpus)
df <- as.data.frame(as.matrix(dtm))
# Tidy the column name to remove to replace any character not valid in R
colnames(df) <- make.names(colnames(df), unique = TRUE)
Add the labels to the data frame
df$label <- labels
Let’s explore the top 10 most used words for ham
# Filter the data frame for ham separately
ham_words <- df %>%
filter(label == 1) %>%
select(-label)
# Sum up the columns
ham_word_freq <- colSums(ham_words)
# Create a data frame of the top 10 words for ham
top_ham_words <- data.frame(word = names(sort(ham_word_freq, decreasing = TRUE)[1:10]),
freq = sort(ham_word_freq, decreasing = TRUE)[1:10])
top_ham_words
## word freq
## received received 14077
## sep sep 9788
## esmtp esmtp 8405
## localhost localhost 7347
## oct oct 5251
## tby tby 4728
## tfor tfor 4723
## postfix postfix 4661
## aug aug 4476
## ist ist 4224
Plot the data for the top 10 most used word for ham :
ggplot(top_ham_words, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity", fill = 'blue') +
labs(title = "Top 10 Words",
x = "Word",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Let’s do the same thing for spam emails:
# Filter the data frame for spam separately
spam_words <- df %>%
filter(label == 0) %>%
select(-label)
# Sum up the columns
spam_word_freq <- colSums(spam_words)
# Create a data frame of the top 10 words for spam
top_spam_words <- data.frame(word = names(sort(spam_word_freq, decreasing = TRUE)[1:10]),
freq = sort(spam_word_freq, decreasing = TRUE)[1:10])
# Plot the top 10 most used
ggplot(top_spam_words, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity", fill = 'green') +
labs(title = "Top 10 Words",
x = "Word",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can observe that certain words, such as ‘received’, ‘localhost’, etc… are common in both ham and spam categories. This suggests that the key to distinguishing between spam and ham lies in identifying words that are less frequently shared between the two categories
I decided to use 80% of the data for a train set and 20% of the data for the test set
The issue I’ve encountered is that the data has more than 73,000 features after the it was transformed into DTM. I kept getting errors that the memory allocated for protecting objects in R is exhausted.
I used different techniques like filtering out infrequently occurring words, recursive feature elimination, etc.. but i didn’t have success making the model better.
I used the text data to train my models. I split the data between train and test sets.
# Split the data into training (80%) and testing (20%)
set.seed(123) # For reproducibility
train_indices <- createDataPartition(data_df$label, p = 0.8, list = FALSE)
train_data <- data_df[train_indices, ]
test_data <- data_df[-train_indices, ]
We can use the Naive Bayes algorithm to classify the data. Note that the label is 1 if ham and 0 if spam.
model <- naiveBayes(label ~ text, data = train_data)
Now, we can evaluates the model using the test set.
# Make predictions on the test data
predictions <- predict(model, newdata = test_data)
#Evaluates the model
confusion_matrix <- table(predictions, test_data$label)
confusion_matrix
##
## predictions 0 1
## 0 0 0
## 1 100 500
Model score
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.8333333
Our model has an accuracy of 83%.
We can use data science to solve real life problem like detecting spams in emails. The next step is to try different algorithms to improve my model.