DATA 607 Project 4

Intro

The goal of this project is to work with a database to identify spam emails. Being able to classify new “test” documents using already classified “training” documents is crucial. A common scenario involves using a corpus of labeled spam and ham (non-spam) emails to predict whether a new document is spam or not. For this project, we can begin with a spam/ham dataset and then predict the class of new documents—either withheld from the training dataset or another source, such as your spam folder. An example corpus can be found at SpamAssassin Public Corpus.

Similar to other projects, I’ll start by initializing R Studio and then proceed step-by-step toward the end goal.

Initialization

## [1] "All required packages are installed"

Strategy and method

Upon researching this topic, I have come across some repositories on GitHub and articles that focus on email spam detection. The main concepts presented in these resources can be summarized in the following steps:

Finding and Importing Data:
- Gather relevant email data (both spam and ham).
- Import the data into your analysis environment.
Exploratory Data Analysis (EDA):
- Explore the dataset to understand its structure, features, and potential issues.
- Identify any missing data or outliers.
Creating a Corpus:
- Clean up the data by removing unnecessary characters, formatting, and noise.
- Handle missing data appropriately.
- Prepare the data for training by creating a corpus.
Applying a Machine Learning Algorithm:
- Choose an appropriate algorithm (e.g., Naive Bayes, SVM, or Random Forest).
- Train the model using the cleaned data.
- Fine-tune hyperparameters as needed.
Model Evaluation:
- Assess the model’s performance using metrics such as accuracy, precision, recall, and F1-score.
- Consider cross-validation to validate the model’s generalization ability.

These are the general steps, each with several sub-steps to be performed. As you work on your project, feel free to utilize available resources and code from other references, citing them appropriately when used.

All in all our goal is to use the source of legitimate (ham) and illegitimate emails (spam) and use them to train a network to be able to identify the future email an one of these categories. It is a fun project, and one thing that I have never done before.

In this project, the data collection has been done and we have access to the corpus of ham and spam data, which makes my life easier. First step i to load those data into R.

1- Importing data: Reading files into R

# Paths of spam and ham folders
spam_path <- "C:/Users/kohya/OneDrive/CUNY/DATA 607/DATA 607/Data/spam"
ham_path <- "C:/Users/kohya/OneDrive/CUNY/DATA 607/DATA 607/Data/ham"

# Function to read emails from a folder
read_emails <- function(folder_path) {
  # List all files in the folder
  email_files <- list.files(folder_path, full.names = TRUE)
  
  # Initialize list to store email content
  email_contents <- vector("list", length(email_files))
  
  # Loop through each file
  for (i in seq_along(email_files)) {
    # Read the content of the email
    email_content <- readLines(email_files[i],encoding = "UTF-8") #use "UTF-8 for multibyte encodidng to avoid the error happening later 
    
    # Store the email content
    email_contents[[i]] <- paste(email_content, collapse = "\n")
  }
  
  # Return the list of email contents
  return(email_contents)
}

# Read emails from spam and ham folders
spam_email_contents <- read_emails(spam_path)

## Warning in readLines(email_files[i], encoding = "UTF-8"): incomplete final line
## found on 'C:/Users/kohya/OneDrive/CUNY/DATA 607/DATA
## 607/Data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0'

ham_email_contents <- read_emails(ham_path)

# Create dataframes for spam and ham emails
spam_emails_df <- data.frame(Content = unlist(spam_email_contents), Label = "spam", stringsAsFactors = FALSE)
ham_emails_df <- data.frame(Content = unlist(ham_email_contents), Label = "ham", stringsAsFactors = FALSE)

#Combined the two spam and ham database together for
emails_df <- rbind(spam_emails_df,ham_emails_df)

# Print the first few rows of each dataframe
#knitr::kable(head(spam_emails_df, 3),format = "markdown")
# Convert the data frame to a Markdown table
message("Three examples of Spam emails")

## Three examples of Spam emails

#markdown_table <- knitr::kable(data.frame(head(spam_emails_df, 3)), format = "markdown", sanitize = TRUE)
#markdown_table <- pander::pander(head(spam_emails_df, 3), style = "pipe")
# Print the Markdown table
#cat(markdown_table)
cat("\n")

message("Three examples of Ham emails")

## Three examples of Ham emails

#knitr::kable(head(ham_emails_df, 3),format = "markdown")
# Convert the data frame to a Markdown table
#markdown_table <- knitr::kable(data.frame(head(ham_emails_df, 3)), format = "markdown", sanitize = TRUE)
#markdown_table <- pander::pander(head(ham_emails_df, 3), style = "pipe")
# Print the Markdown table
#cat(markdown_table)
cat("\n")

2- Data Processing: Exploratory Data Analyses

Once we have our dataset, we’ll need to preprocess the data. This typically involves tasks like tokenization, removing stop words, stemming or lemmatization, and possibly handling issues like spelling mistakes or special characters.

# Regular expression to find multibites characters
error_pattern <- "[[:cntrl:]]"  

# Function to identify potential error indices
find_multibyte_errors <- function(text) {
  error_indices <- grep(error_pattern, text, perl = TRUE)
  return(error_indices)
}

# Apply the function to spam Content
potential_errors <- lapply(spam_emails_df$Content, find_multibyte_errors)


# Define replacement character for multi-bites
replace_char <- "?"

# Replace multi-byte characters with the placeholder
spam_emails_df$Content <- str_replace_all(spam_emails_df$Content, "[[:cntrl:]]", replace_char)
ham_emails_df$Content <- str_replace_all(ham_emails_df$Content, "[[:cntrl:]]", replace_char)
emails_df$Content <- str_replace_all(emails_df$Content, "[[:cntrl:]]", replace_char)

# Now you can use nchar or strwidth on the modified columns
df_EDA <- data.frame(spam = nrow(spam_emails_df), ham = nrow(ham_emails_df))

df_EDA <- rbind(df_EDA, min_email = c(min(nchar(spam_emails_df$Content)), min(nchar(ham_emails_df$Content))))

df_EDA <- rbind(df_EDA, max_email = c(max(nchar(spam_emails_df$Content)), max(nchar(ham_emails_df$Content))))

# Set row names
rownames(df_EDA) <- c("Emails #", "Min char length", "Max char Length")

# Print the dataframe
print(df_EDA)

##                   spam   ham
## Emails #           501  2551
## Min char length    867   367
## Max char Length 232374 90428

2-1- Cleaning and tidying data

Exploring the data it shows it contains many unuseful and additional irrrelevant data. The SPAM email should be mostly recognized by its sender email address, its domain, the contains of the email, to how many recipient is was sent, subject. Additionally, there can be some other parameters like when it was sent, it is a reply/forwarded or a new email, and what is its sentiment. These additional information are those that can amply improve the quality of the process.

I must create a set of functions that goes thru the email and extract above information for each emails. Specifically by starting reading the HTML and text and seperate them to diffrentiate the body text from other informtion.

In this particular project, I only attmp to use the body text of the message for the analyses not the entire message.

#the goal of this section is to extract the body of a message text and store them in the database not the entire file to be read later

3- Feature Extraction: Creating a Corpus

Th goal of this section is to use the data to be able to extract some numerical data to be used later for machine learning algorithm. As I was learning about the topic, there are different techniques to be used with Bag-of-Words (BoW) and Term Frequency-Inverse Docuement Frequency (TF-IDF) are the most striagt forwrd and mostly used.

Bag-of-Words (BoW): Representing each document as a vector where each element corresponds to the frequency of a word in the document.
TF-IDF (Term Frequency-Inverse Document Frequency): Weighing the importance of words in a document relative to their frequency across all documents in the corpus.

3-1- BOW

To work on the BoW, one may need to follwo the follwoign steps:

Creating Corpus and Preprocessing:
- create a Corpus from the email contents using Corpus(VectorSource()).
- Then apply various preprocessing steps to the corpus, such as converting text to lowercase, removing punctuation, numbers, whitespace, and stopwords, and stemming the words.
Creating Document-Term Matrix (DTM):
- convert the preprocessed corpus into a Document-Term Matrix (DTM) using DocumentTermMatrix().
- The DTM represents each document as a row and each unique word as a column, with the cell values indicating the frequency of each word in each document.
- Use hunspell_ckeck to remove all the nonenglish and unrelated word from the column.
Removing Sparse Terms:
- Remove sparse terms (words that appear in only a small fraction of documents) from the DTM using removeSparseTerms().
Converting DTM to Dataframe:
- Convert the DTM to a dataframe email_matrix_spam and email_matrix_ham.
- It also adds the label column (spam/ham) to each dataframe.

SPAM email analyses

# Create a Corpus from the email contents
corpus_spam <- Corpus(VectorSource(spam_emails_df$Content))

# Convert to lowercase
corpus_spam <- tm_map(corpus_spam, content_transformer(tolower))

#Convert to plain text
#corpus_spam <-  tm_map(corpus_spam, PlainTextDocument)

# Remove punctuation
corpus_spam <- tm_map(corpus_spam, removePunctuation)

# Remove numbers
corpus_spam <- tm_map(corpus_spam, removeNumbers)

# Remove whitespace
corpus_spam <- tm_map(corpus_spam, stripWhitespace)

# Remove stopwords
corpus_spam <- tm_map(corpus_spam, removeWords, stopwords("english"))

# Stemming
corpus_spam <- tm_map(corpus_spam, stemDocument)

# Convert the corpus to a DocumentTermMatrix
#The values in the matrix are the number of times that word appears in each document.
dtm_spam <- DocumentTermMatrix(corpus_spam)


#Remove sparse terms
dtm_spam <- removeSparseTerms(dtm_spam, 0.95)

dtm_spam

## <<DocumentTermMatrix (documents: 501, terms: 559)>>
## Non-/sparse entries: 38208/241851
## Sparsity           : 86%
## Maximal term length: 44
## Weighting          : term frequency (tf)

# Convert the DocumentTermMatrix to a dataframe
email_matrix_spam <- as.data.frame(as.matrix(dtm_spam))

#use the make.names function to make the variable names.
colnames(email_matrix_spam) = make.names(colnames(email_matrix_spam))

#Use hunspell to check is the columns is actually an english word
#hunspell_check(colnames(email_matrix_spam))
#Use hunspell_check with which to only keep the column that have menaingful neglish words. 
email_matrix_spam <- email_matrix_spam[, which(hunspell_check(colnames(email_matrix_spam)))]

# Add labels to the dataframe
email_matrix_spam$Label <- spam_emails_df$Label

# Print the first few rows of the processed dataframe
kable(head(email_matrix_spam, 10), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "80%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands

access	best	buy	can	click	dollar	email	end	error	even	ever	fast	font	free	get	help	import	invest	less	let	life	list	lowest	make	money	new	pay	pleas	pop	public	request	sat	save	shop	size	start	state	take	thousand	type	wed	will	wish	claim	day	group	grow	host	join	legal	link	lose	maintain	risk	sender	support	use	user	web	window	wonder	date	may	account	age	allow	express	fill	follow	friend	give	hour	info	instant	internet	just	last	live	local	long	member	million	month	name	never	now	offer	one	outlook	pass	play	read	show	site	special	subject	term	true	two	unknown	year	like	address	back	bank	bill	bit	call	cash	cost	fax	first	found	got	huge	inform	line	made	mail	major	market	next.	partner	per	program	question	right	sale	send	sent	spam	sure	think	time	train	wait	want	way	word	work	X.	contact	form	home	look	old	rate	border	direct	div	great	height	high	input	need	product	protect	that	width	number	real	system	color	custom	detail	effect	low	price	version	Label
1	3	1	1	1	1	2	2	1	1	1	1	2	3	1	1	1	1	1	1	5	1	1	1	1	1	1	1	1	1	1	1	5	1	5	1	1	1	1	2	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	1	1	2	1	1	1	1	1	2	1	1	1	1	1	2	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	1	0	1	0	1	1	1	2	0	1	0	1	1	0	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
3	3	0	1	1	1	1	0	1	0	0	0	0	8	4	0	0	0	0	0	0	0	0	0	0	0	3	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	3	1	0	0	0	1	3	1	1	1	1	1	1	1	1	1	1	1	3	2	3	2	1	2	2	1	3	1	1	2	3	1	1	1	1	2	7	1	1	1	1	1	2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	2	0	1	0	1	1	2	1	1	1	1	1	2	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	1	0	0	4	0	1	0	1	0	0	1	2	1	0	0	0	0	0	1	0	1	1	1	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	2	1	0	3	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	2	0	0	0	0	0	0	1	2	1	1	1	3	1	1	2	2	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	2	2	1	1	1	1	1	1	2	1	1	1	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	1	0	1	1	2	3	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	2	0	0	1	0	0	0	0	0	8	1	0	0	2	0	0	0	3	0	0	1	0	0	0	0	1	0	0	0	1	0	25	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	2	0	0	1	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	1	15	1	2	1	17	1	6	1	1	2	1	36	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	1	0	0	5	0	1	0	1	0	0	2	2	1	2	0	0	0	0	1	0	2	2	1	0	1	1	0	1	0	0	0	0	0	0	0	0	0	0	4	1	0	3	1	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0	0	0	0	0	1	2	0	0	0	0	0	0	1	0	1	0	2	0	0	1	0	0	0	0	2	0	0	0	0	0	0	1	1	1	1	1	3	1	1	2	2	1	1	1	2	1	1	1	1	1	1	1	1	1	1	1	1	3	2	1	1	1	0	1	1	2	1	1	2	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0	spam
0	0	0	0	0	0	2	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	18	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1	0	0	1	0	2	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	10	1	1	1	1	1	1	spam

HAM email Analyses

#Do the same for the ham
corpus_ham <- Corpus(VectorSource(ham_emails_df$Content))
corpus_ham <- tm_map(corpus_ham, content_transformer(tolower))
#corpus_ham <- tm_map(corpus_ham, PlainTextDocument)
corpus_ham <- tm_map(corpus_ham, removePunctuation)
corpus_ham <- tm_map(corpus_ham, removeNumbers)
corpus_ham <- tm_map(corpus_ham, stripWhitespace)
corpus_ham <- tm_map(corpus_ham, removeWords, stopwords("english"))
corpus_ham <- tm_map(corpus_ham, stemDocument)
dtm_ham <- DocumentTermMatrix(corpus_ham)
dtm_ham <- removeSparseTerms(dtm_ham, 0.95)
dtm_ham

## <<DocumentTermMatrix (documents: 2551, terms: 399)>>
## Non-/sparse entries: 154473/863376
## Sparsity           : 85%
## Maximal term length: 92
## Weighting          : term frequency (tf)

email_matrix_ham <- as.data.frame(as.matrix(dtm_ham))
colnames(email_matrix_ham) = make.names(colnames(email_matrix_ham))
email_matrix_ham <- email_matrix_ham[, which(hunspell_check(colnames(email_matrix_ham)))]
email_matrix_ham$Label <- ham_emails_df$Label

# Print the first few rows of the processed dataframe
kable(head(email_matrix_ham, 10), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "80%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands

actual	cant	code	come	date	day	develop	discuss	error	express	get	happen	hit	like	line	list	local	log	mail	mark	new	part	run	still	subject	sun	that	time	today	use	version	wed	without	work	contact	email	far	free	group	high	internet	network	send	sponsor	unknown	well	build	claim	news	one	outlook	red	report	said	talk	want	can	home	last	make	may	say	web	world	also	even	file	give	long	might	now	old	put	sent	server	set	user	way	wrote	ask	bit	instead	just	lot	never	person	window	better	end	first	prefer	sell	will	year	found	look	right	your	bad	big	call	current	good	help	interest	made	mean	name	need	open	origin	point	read	reason	result	second	see	seem	start	stuff	thing	turn	two	word	write	yes	Label
1	1	1	2	1	1	1	1	1	1	1	1	3	1	1	4	1	1	1	1	1	1	1	1	3	1	1	1	1	1	2	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	4	0	0	0	0	0	2	0	0	0	0	1	0	0	0	0	1	1	1	1	3	1	2	1	1	1	2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	2	0	0	2	0	0	0	0	0	2	0	0	0	0	1	0	0	0	0	1	1	0	1	3	1	0	1	1	1	2	0	3	1	1	2	1	1	2	4	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	2	0	3	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	1	0	0	0	1	1	1	1	2	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	0	0	0	0	1	0	0	2	0	0	0	0	2	0	0	3	0	0	1	0	0	1	0	0	1	0	3	0	2	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	2	0	0	0	2	1	2	1	1	1	1	1	1	1	1	1	2	3	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	2	0	0	2	0	0	0	0	0	2	0	0	1	0	5	0	0	0	1	1	1	0	1	3	0	0	1	1	1	2	0	0	0	0	1	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	2	0	0	4	0	0	0	0	0	2	0	1	0	0	3	0	0	0	0	1	1	0	1	3	0	0	1	1	1	2	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	2	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	2	0	0	4	0	0	0	0	0	2	0	0	0	0	1	0	0	0	0	1	1	0	1	3	0	0	1	1	1	2	0	0	0	0	0	0	0	0	3	0	3	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	1	0	0	1	0	0	1	0	0	1	1	2	1	1	1	2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	3	0	0	2	0	0	0	0	0	2	0	1	0	0	2	0	0	1	0	1	1	0	1	3	0	1	1	1	1	2	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	ham
0	0	0	0	0	4	1	0	0	2	1	0	2	2	0	2	0	0	4	0	0	0	0	0	2	0	0	5	0	3	0	0	1	0	1	1	1	1	3	0	0	1	1	1	2	0	0	0	0	2	0	0	1	0	0	3	10	0	0	0	1	2	0	0	2	0	0	1	0	3	0	0	2	1	0	0	0	0	0	1	1	0	3	0	0	0	1	0	0	8	0	0	3	5	0	1	0	0	1	1	1	1	1	2	1	2	1	1	1	2	2	1	1	1	1	2	1	1	2	1	1	1	1	1	2	1	ham

Combined Email analyses

#let's work on the combined emails for the purpose of this work 
corpus <- Corpus(VectorSource(emails_df$Content))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.95)
dtm

## <<DocumentTermMatrix (documents: 3052, terms: 406)>>
## Non-/sparse entries: 180093/1059019
## Sparsity           : 85%
## Maximal term length: 62
## Weighting          : term frequency (tf)

email_matrix <- as.data.frame(as.matrix(dtm))
colnames(email_matrix) = make.names(colnames(email_matrix))
email_matrix <- email_matrix[, which(hunspell_check(colnames(email_matrix)))]
email_matrix$Label <- emails_df$Label

# Print the first few rows of the processed dataframe

kable(head(email_matrix, 10), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "80%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands

access	best	can	click	email	end	error	even	ever	free	get	help	import	less	let	life	list	make	money	new	pay	pleas	pop	public	sat	size	start	state	take	wed	will	wish	claim	day	group	link	maintain	sender	support	use	user	web	window	date	may	allow	express	follow	friend	give	internet	just	last	live	local	long	million	month	name	never	now	offer	one	outlook	play	read	show	site	subject	two	unknown	year	yes	like	might	thought	address	back	bit	call	first	found	got	inform	line	made	mail	market	next.	program	question	right	send	sent	spam	sure	think	time	want	way	word	work	contact	form	home	look	old	rate	direct	great	high	need	product	that	number	real	system	custom	discuss	server	version	Label
1	3	1	1	2	2	1	1	1	3	1	1	1	1	1	5	1	1	1	1	1	1	1	1	1	5	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	1	1	2	1	1	1	1	1	2	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0	1	1	0	0	0	1	0	0	0	0	1	0	1	0	1	0	0	1	1	0	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
3	3	1	1	1	0	1	0	0	8	4	0	0	0	0	0	0	0	0	0	3	1	1	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	3	1	0	0	1	1	1	1	1	1	1	3	2	3	2	1	2	1	3	1	1	2	3	1	1	1	2	7	1	1	2	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	1	1	2	1	1	1	1	1	2	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	1	0	4	0	1	0	1	1	2	1	0	0	0	0	1	1	1	1	0	1	1	0	0	0	0	0	0	0	2	1	0	3	1	0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	2	0	0	0	0	0	0	0	1	2	1	3	2	1	1	1	1	1	1	1	1	1	1	1	2	2	1	1	1	1	2	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	2	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	1	1	2	3	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	spam
0	0	2	0	1	0	0	0	0	1	0	0	2	0	0	3	0	1	0	0	0	0	1	0	0	25	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	1	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	1	1	1	1	1	1	1	0	0	0	0	0	0	0	spam
0	0	1	0	5	0	1	0	1	2	2	1	2	0	0	0	1	2	2	1	0	1	1	0	0	0	0	0	0	0	4	1	0	3	1	0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0	1	2	0	0	0	0	0	1	0	1	0	2	0	1	0	0	0	2	0	0	0	0	0	0	0	1	1	1	3	2	1	1	2	1	1	1	1	1	1	1	1	3	2	1	1	1	0	2	1	1	2	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	0	0	0	0	spam
0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	18	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1	0	0	0	1	0	0	0	0	1	1	1	1	spam

3-2- TF-IDF

The dataframe is created above will be used below to created the TF-IDF. The data preparation steps are the same, thus they are not repeated but the additional required step now listed below are taken:

Calculate TF-IDF weights using weightTfIdf().
Convert the TF-IDF weighted matrix to a dataframe.
Add the labels to the dataframe.

# Calculate TF-IDF weights
tfidf_transformer_spam <- weightTfIdf(dtm_spam)

# Convert TF-IDF weighted matrix to a dataframe
tfidf_df_spam <- as.data.frame(as.matrix(tfidf_transformer_spam))

# Add labels to the dataframe
tfidf_df_spam$Label <- spam_emails_df$Label

# Print the first few rows of the processed dataframe
#kable(head(tfidf_df_spam))


#do the same for the ham 
# Calculate TF-IDF weights
tfidf_transformer_ham <- weightTfIdf(dtm_ham)

# Convert TF-IDF weighted matrix to a dataframe
tfidf_df_ham <- as.data.frame(as.matrix(tfidf_transformer_ham))

# Add labels to the dataframe
tfidf_df_ham$Label <- ham_emails_df$Label

# Print the first few rows of the processed dataframe
#kable(head(tfidf_df_ham))


#do the same for the combined emals 
# Calculate TF-IDF weights
tfidf_transformer <- weightTfIdf(dtm)

# Convert TF-IDF weighted matrix to a dataframe
tfidf_df <- as.data.frame(as.matrix(tfidf_transformer))

# Add labels to the dataframe
tfidf_df$Label <- emails_df$Label

# Print the first few rows of the processed dataframe
kable(head(tfidf_df))

access	aug	best	can	charsetisocontenttransferencod	click	compani	complet	countri	easi	edtreceiv	email	end	error	esmtp	even	ever	fetchmailfor	free	get	help	import	istreceiv	less	let	life	list	localhost	mailwebnotenet	make	messageid	microsoft	mimevers	money	new	pay	phoboslabsspamassassintaintorg	pleas	pop	postfix	provid	public	remov	returnpath	sat	secur	servic	singledrop	size	start	state	take	texthtml	thu	webnotenet	wed	will	wish	zzzzlocalhost	zzzzlocalhostspamassassintaintorgreceiv	zzzzspamassassintaintorg	claim	contenttyp	day	dogmaslashnullorg	group	imap	irish	link	linux	maintain	mimeol	phobo	preced	produc	receiv	sender	support	textplain	use	user	web	window	date	may	allow	believ	dont	express	follow	friend	give	internet	just	last	live	local	long	million	month	name	never	now	offer	one	outlook	play	read	requir	show	site	smtp	subject	two	unknown	xmailer	year	yes	like	might	thought	address	back	bit	call	communic	didnt	ffor	first	found	got	inform	ive	line	made	mail	market	messag	next	peopl	program	question	right	send	sent	spam	sure	think	time	want	way	word	work	Label
0.0490699	0.1381853	0.1205666	0.0188259	0.0387572	0.0460452	0.040314	0.0489631	0.0487516	0.0461345	0.0192375	0.0426553	0.0860827	0.0446782	0.0016411	0.0327902	0.0486468	0.0009852	0.1011889	0.0182272	0.0362709	0.0403769	0.0010080	0.0457801	0.0437229	0.2437578	0.0177604	0.0025974	0.0375995	0.0260243	0.0119654	0.0276236	0.0381406	0.0440352	0.0232069	0.0479317	0.0445151	0.0309522	0.0474393	0.0008828	0.0423158	0.0462244	0.0715744	4.84e-05	0.0368199	0.0489631	0.0331109	0.0009852	0.2041219	0.0393386	0.0421745	0.0348607	0.0453475	0.0800071	0.0375995	0.0206513	0.0225350	0.0452623	0.0441075	0.0312385	0.0472464	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	spam
0.0000000	0.2533397	0.0000000	0.0000000	0.0473699	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0235124	0.0000000	0.0000000	0.0000000	0.0040117	0.0000000	0.0000000	0.0012041	0.0412251	0.0222777	0.0000000	0.0000000	0.0012320	0.0000000	0.0000000	0.0000000	0.0217071	0.0031747	0.0000000	0.0000000	0.0146244	0.1012867	0.0000000	0.0000000	0.0000000	0.0000000	0.0544073	0.0378304	0.0000000	0.0010790	0.0000000	0.0000000	0.0000000	5.92e-05	0.0000000	0.0000000	0.0000000	0.0012041	0.0000000	0.0480805	0.0000000	0.0000000	0.0000000	0.1955729	0.0000000	0.0000000	0.0000000	0.0553206	0.0539091	0.0381804	0.0000000	0.0538208	0.0088430	0.0454411	0.0019788	0.0843439	0.0019643	0.1181559	0.0455046	0.0885423	0.0547071	0.0468190	0.0307140	0.0169318	0.0426076	0.0127915	0.0466163	0.0530646	0.0060022	0.0184061	0.0730582	0.0522466	0.0464155	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	spam
0.0000000	0.2627226	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0313499	0.0000000	0.0000000	0.0000000	0.0026745	0.0000000	0.0000000	0.0016055	0.0549668	0.0297036	0.0000000	0.0000000	0.0016427	0.0000000	0.0000000	0.0000000	0.0289429	0.0042329	0.0612732	0.0000000	0.0194992	0.1350489	0.0000000	0.0000000	0.0000000	0.0000000	0.0725431	0.0504406	0.0773085	0.0014387	0.0000000	0.0000000	0.0000000	7.89e-05	0.0000000	0.0000000	0.0000000	0.0016055	0.0000000	0.0641074	0.0000000	0.0000000	0.0000000	0.1955729	0.0612732	0.0000000	0.0000000	0.0737608	0.0718788	0.0509072	0.0769942	0.0000000	0.0117906	0.0605881	0.0000000	0.0000000	0.0000000	0.0000000	0.0606728	0.0000000	0.0000000	0.0624253	0.0000000	0.0000000	0.0568101	0.0085277	0.0000000	0.0707528	0.0000000	0.0245415	0.0000000	0.0696622	0.0618874	0.0312652	0.0455888	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	spam
0.0912286	0.1427266	0.0747173	0.0116667	0.0000000	0.0285351	0.000000	0.0000000	0.0000000	0.0000000	0.0238436	0.0132171	0.0000000	0.0276879	0.0010170	0.0000000	0.0000000	0.0006105	0.1672230	0.0451830	0.0000000	0.0000000	0.0006247	0.0000000	0.0000000	0.0000000	0.0000000	0.0016097	0.0233011	0.0000000	0.0074152	0.0171189	0.0000000	0.0000000	0.0000000	0.0891124	0.0275868	0.0191816	0.0293990	0.0010942	0.0000000	0.0000000	0.0000000	3.00e-05	0.0000000	0.0000000	0.0410389	0.0006105	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0743728	0.0233011	0.0000000	0.0000000	0.0280499	0.0273342	0.0193591	0.1171179	0.0000000	0.0044838	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0230728	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0259434	0.0000000	0.0000000	0.0030434	0.0000000	0.0555654	0.0264913	0.0000000	0.0000000	0.0173366	0.0278932	0.0302122	0.0300950	0.0252595	0.0266276	0.0138998	0.0215204	0.0235346	0.0392506	0.0498116	0.088197	0.0553758	0.0260063	0.0592847	0.0276879	0.0762624	0.0257933	0.0163749	0.053812	0.0403940	0.0206711	0.0298917	0.0248672	0.0273876	0.0575167	0.1906849	0.0198681	0.0166798	0.0239832	0.0391708	0.0270481	0.0213285	0.0302776	0.0000000	0.0000000	0.000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	spam
0.0000000	0.2498693	0.0000000	0.0000000	0.0467210	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0231904	0.0000000	0.0000000	0.0000000	0.0039567	0.0000000	0.0000000	0.0011876	0.0406604	0.0219725	0.0000000	0.0000000	0.0012151	0.0000000	0.0000000	0.0000000	0.0214098	0.0031312	0.0000000	0.0000000	0.0144241	0.0998992	0.0000000	0.0000000	0.0000000	0.0000000	0.0536620	0.0373122	0.0000000	0.0010642	0.0000000	0.0000000	0.0000000	5.84e-05	0.0000000	0.0000000	0.0000000	0.0011876	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.1928938	0.0000000	0.0000000	0.0000000	0.0545628	0.0531706	0.0376574	0.0000000	0.0530836	0.0087218	0.0448186	0.0019517	0.0831885	0.0019374	0.1165373	0.0448812	0.0873294	0.0539577	0.0461776	0.0302932	0.0166998	0.0420239	0.0126163	0.0459777	0.0523377	0.0059199	0.0181540	0.0720574	0.0000000	0.0457797	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0263137	0.0514433	0.058769	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	spam
0.0000000	0.1048302	0.0000000	0.0142817	0.0000000	0.0000000	0.000000	0.0000000	0.0000000	0.0000000	0.0145939	0.0647184	0.0000000	0.0338938	0.0012450	0.0000000	0.0369045	0.0007474	0.0255880	0.0276551	0.0275158	0.0000000	0.0007647	0.0000000	0.0000000	0.0000000	0.0134734	0.0019705	0.0285237	0.0197426	0.0000000	0.0000000	0.0289342	0.0334060	0.0176053	0.0000000	0.0337701	0.0234810	0.0359884	0.0013394	0.0000000	0.0000000	0.0542978	3.67e-05	0.0000000	0.0000000	0.0000000	0.0007474	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0910425	0.0570475	0.0000000	0.0341911	0.0343369	0.0334608	0.0236982	0.0358421	0.0000000	0.0054887	0.0846144	0.0000000	0.0261757	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0198489	0.0289342	0.0000000	0.0037255	0.0000000	0.0000000	0.0000000	0.0000000	0.0145545	0.0212224	0.0000000	0.0000000	0.0368405	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0609762	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0200452	0.000000	0.0164826	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0408366	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.030583	0.0595072	0.0277792	0.0846144	0.0345974	0.0341451	0.0328792	0.0511123	0.022473	0.0298431	0.0306308	0.0283633	0.0284032	0.0310689	0.0100713	0.0351353	0.0158692	0.033466	0.0230836	0.0349986	0.0352042	0.0251802	0.0547355	0.0622372	0.0306308	0.0291872	0.0227746	0.0174385	0.0440202	0.0231358	0.0353432	0.0195057	spam

4- Applying a Machine Learning Algorithm

In this section we will be creating a few different models and will be evaluating their pefroamnce unsing the definition of accuracy, precision, recall, and F1 score as define below. We analyze the data to determine the true positive, false positive, true negative, and false negative values for the dataset, using the prediction model created using train datset on the test datset. We will display results in three confusion matrices, with the counts of TP, FP, TN, and FN.

Accuracy = (TP+TN)/(all (TP+TN+FP+FN)) Precision = TP/(TP+FP) Recall = TP / (TP+FN) F1 Score = 2 x (Precision x Recall) / (Precision + Recall)

For findign the confusion matrix, I will use the confusionMatrix() function from the caret package to generate a confusion matrix.

4-1- Explanatory Data Analyses

After cleaning and creating the DTM, I use a basic EDA to show some basics like wordscloud and histograms of words repeated in this Spam and ham emails. to just give us some intuitive ad additional information on what is going on.

It is surprising see some unrelated words like localhost, months, i.e., oct, sep, weeks of the day, i.e., tue, wed, and so one be repeated itself. It shows I need to do some additional data cleaning in the emails to remove some title and just work on the email bodies.

I have not figure it out yet, need sometimes to work. At the moment, I have decided to move on, knowing this is a lot of room for improvement in the DTM has been created for the ham and spam data.

SPAM EDA

# Calculate the sum of each column (word frequency) excluding the "Label" column
word_frequencies_spam <- colSums(email_matrix_spam[, !colnames(email_matrix_spam) %in% "Label"])


# Sort the word frequencies in descending order
sorted_word_frequencies_spam <- sort(word_frequencies_spam, decreasing = TRUE)

# Print the sorted word frequencies
kable(head(sorted_word_frequencies_spam, 20), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "20%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands

	x
size	1124
width	1086
email	955
will	794
height	617
jalapeno	525
can	478
free	477
mail	477
wed	450
font	443
get	416
money	414
new	384
list	375
pleas	370
make	369
time	368
border	361
one	351

# Visualize word frequencies using a histogram
hist(word_frequencies_spam, main = "Spam Word Frequency Distribution", xlab = "Word Frequency", ylab = "Frequency")

# Convert word frequencies to a dataframe
word_frequencies_df_spam <- data.frame(word = names(word_frequencies_spam), frequency = word_frequencies_spam)

word_frequencies_df_spam <- word_frequencies_df_spam [order(word_frequencies_df_spam$frequency,decreasing = TRUE),]

# Select the top 40 words
top_40_words_spam <- head(word_frequencies_df_spam, 40)


# Create a histogram using ggplot2 for the top 40 words
top_40_words_spam |> ggplot(aes(x = reorder(word, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Top 40 Word Frequencies in Spam emails",
       x = "Word",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualize word frequencies using a word cloud
wordcloud(names(word_frequencies_spam), freq = word_frequencies_spam, max.words = 70, random.order = FALSE)

HAM EAD

# Calculate the sum of each column (word frequency) excluding the "Label" column
word_frequencies_ham <- colSums(email_matrix_ham[, !colnames(email_matrix_ham) %in% "Label"])

# Sort the word frequencies in descending order
sorted_word_frequencies_ham <- sort(word_frequencies_ham, decreasing = TRUE)

# Print the sorted word frequencies
kable(head(sorted_word_frequencies_ham, 20), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "20%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands

	x
wed	3849
jalapeno	3652
use	2389
hit	2253
mail	1707
list	1565
can	1504
will	1421
get	1389
one	1236
like	1159
just	1155
time	988
work	976
new	974
sun	965
sat	952
friend	894
email	882
date	855

# Visualize word frequencies using a histogram
hist(sorted_word_frequencies_ham, main = "Ham Word Frequency Distribution", xlab = "Word Frequency", ylab = "Frequency")

# Convert word frequencies to a dataframe
word_frequencies_df_ham <- data.frame(word = names(word_frequencies_ham), frequency = word_frequencies_ham)

word_frequencies_df_ham <- word_frequencies_df_ham [order(word_frequencies_df_ham$frequency,decreasing = TRUE),]

# Select the top 40 words
top_40_words_ham <- head(word_frequencies_df_ham, 40)

# Create a histogram using ggplot2 for the top 50 words
top_40_words_ham |> ggplot(aes(x = reorder(word, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "plum", color = "black") +
  labs(title = "Top 40 Word Frequencies in Ham emails",
       x = "Word",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualize word frequencies using a word cloud
wordcloud(names(word_frequencies_ham), freq = word_frequencies_ham, max.words = 70, random.order = FALSE)

Combined EAD

# Calculate the sum of each column (word frequency) excluding the "Label" column
word_frequencies <- colSums(email_matrix[, !colnames(email_matrix) %in% "Label"])

# Sort the word frequencies in descending order
sorted_word_frequencies <- sort(word_frequencies, decreasing = TRUE)

# Print the sorted word frequencies
kable(head(sorted_word_frequencies, 20), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "20%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands

	x
wed	4299
jalapeno	4177
use	2652
hit	2269
will	2215
mail	2184
can	1982
list	1940
email	1837
get	1805
one	1587
just	1407
new	1358
time	1356
like	1342
size	1268
sat	1233
work	1200
sun	1186
make	1175

# Visualize word frequencies using a histogram
hist(sorted_word_frequencies, main = "Word Frequency Distribution", xlab = "Word Frequency", ylab = "Frequency")

# Convert word frequencies to a dataframe
word_frequencies_df <- data.frame(word = names(word_frequencies), frequency = word_frequencies)

word_frequencies_df <- word_frequencies_df [order(word_frequencies_df$frequency,decreasing = TRUE),]

# Select the top 50 words
top_40_words <- head(word_frequencies_df, 40)

# Create a histogram using ggplot2 for the top 50 words
top_40_words |> ggplot(aes(x = reorder(word, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "darkgreen", color = "black") +
  labs(title = "Top 40 Word Frequencies in combined emails",
       x = "Word",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualize word frequencies using a word cloud
wordcloud(names(word_frequencies), freq = word_frequencies, max.words = 70, random.order = FALSE)

4-2- Further Data Cleaning:

Looking into the bar-chart into the wordcloud one can also identify someother words that seems to be unimportant in identifying the spam vs ham such as localhost, esmtp, jmlocalhost, returnpath, fetchmailfor, imap, contenttyp, smtp, and messageid. So, I will remove them from the database manually.

# Define the list of words to remove manually
manual_word <- c('localhost', 'esmtp', 'jmlocalhost', 'returnpath', 'fetchmailfor', 'imap', 'contenttyp', 'smtp', 'messageid')

# Subset the DTM to remove columns containing the manual words
email_matrix_clean <- email_matrix[, !(colnames(email_matrix) %in% manual_word)]

4-3- Logestic Regression set up

In this section, we will use the data that has been created for training and testing logistic regression and later for machine learning (ML) to identify spam emails.

Unfortunately, logistic regression did not work well for this data, and the solution did not converge. As a result, no further action was taken.

Additionally, Recursive Partitioning and Regression Trees also seem to work, but upon closer examination, none of the three methods reveal a good solution for SPAM recognition. The failure seems to be rooted in the fact that we have not yet cleaned up the emails to contain only the body text, excluding additional irrelevant data.

set.seed(2014)

email_matrix_clean$Label = as.factor(email_matrix_clean$Label)

spl = sample.split(email_matrix_clean$Label, 0.7)

train = subset(email_matrix_clean, spl == TRUE)
test = subset(email_matrix_clean, spl == FALSE)

#spamLog = glm(Label~., data=train, family="binomial")

#summary(spamLog)

#spamCART = rpart(Label~., data=train, method="class")
#prp(spamCART)

4-4: Naive Bayes model (NB)

We can also use Naive Bayes classifiers, which are available through the e1071 package. Although I don’t know all the details about the package, I’ve discovered it and will give it a try.

Naive Bayes classifiers are simple probabilistic classifiers based on Bayes’ theorem, with the “naive” assumption of independence between features.

In R, you can use the naiveBayes() function from the e1071 package to train a Naive Bayes classifier.

nb_model <- naiveBayes(Label ~ ., data = train)

summary(nb_model)

##           Length Class  Mode     
## apriori     2    table  numeric  
## tables    209    -none- list     
## levels      2    -none- character
## isnumeric 209    -none- logical  
## call        4    -none- call

# Make predictions on the test dataset
predictions_nb <- predict(nb_model, newdata = test)

# Create a confusion matrix
conf_matrix_nb <- confusionMatrix(predictions_nb, test$Label)

# Print the confusion matrix
print(conf_matrix_nb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  726   22
##       spam  39  128
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.9152, 0.9486)
##     No Information Rate : 0.8361          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7674          
##                                           
##  Mcnemar's Test P-Value : 0.0405          
##                                           
##             Sensitivity : 0.9490          
##             Specificity : 0.8533          
##          Pos Pred Value : 0.9706          
##          Neg Pred Value : 0.7665          
##              Prevalence : 0.8361          
##          Detection Rate : 0.7934          
##    Detection Prevalence : 0.8175          
##       Balanced Accuracy : 0.9012          
##                                           
##        'Positive' Class : ham             
##

# Extract accuracy from the confusion matrix
accuracy_nb <- conf_matrix_nb$overall['Accuracy']
print(paste("Accuracy:", accuracy_nb))

## [1] "Accuracy: 0.933333333333333"

4-5- Support Vector Machines (SVM):

Now we’re using another method to detect spam emails—Support Vector Machines (SVM). SVM is a popular algorithm for classification tasks and is part of the e1071 package. Here’s how we used the train data to create an SVM model, made predictions on the test data, and evaluated its performance using the confusion matrix. The results show a significant improvement in accuracy compared to Naive Bayes (NB) with accuracy incresed to 96%.

svm_model <- svm(Label ~ ., data = train)

summary(svm_model)

## 
## Call:
## svm(formula = Label ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  590
## 
##  ( 231 359 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  ham spam

# Make predictions on the test dataset
predictions_svm <- predict(svm_model, newdata = test)

# Create a confusion matrix
conf_matrix_svm <- confusionMatrix(predictions_svm, test$Label)

# Print the confusion matrix
print(conf_matrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  762   28
##       spam   3  122
##                                           
##                Accuracy : 0.9661          
##                  95% CI : (0.9523, 0.9769)
##     No Information Rate : 0.8361          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8675          
##                                           
##  Mcnemar's Test P-Value : 1.629e-05       
##                                           
##             Sensitivity : 0.9961          
##             Specificity : 0.8133          
##          Pos Pred Value : 0.9646          
##          Neg Pred Value : 0.9760          
##              Prevalence : 0.8361          
##          Detection Rate : 0.8328          
##    Detection Prevalence : 0.8634          
##       Balanced Accuracy : 0.9047          
##                                           
##        'Positive' Class : ham             
##

# Extract accuracy from the confusion matrix
accuracy_svm <- conf_matrix_svm$overall['Accuracy']
print(paste("SVM Accuracy:", accuracy_svm))

## [1] "SVM Accuracy: 0.966120218579235"

4-6- Deep learning

Another option is deep learning, and the Keras and TensorFlow packages in R can help set this up. While I’ve learned how to use and configure them, I don’t fully understand the inner workings yet. I copied the model structure and layers from the internet, and it utilizes the ReLU function, which is particularly suited for binary classification.

Below is the code to set up deep learning in R. When working with this data, it is recommended to normalize it. As a first step, I will normalize the train and test data. There are two standard methods for normalization: min/max scaling and Z-score normalization. I’ve chosen to apply both methods by defining a function and using lapply.

# Z-score normalization function
z_score_normalize <- function(x) {
  (x - mean(x)) / sd(x)
}

# Min-max normalization function
min_max_normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Apply min-max normalization to your data
nor_train_MM <- as.data.frame(lapply(train[, -which(names(train) == "Label")], min_max_normalize))
nor_train_MM$Label <- train$Label


# Apply z-score normalization to your data (excluding the label column)
nor_train_ZS <- as.data.frame(lapply(train[, -which(names(train) == "Label")], z_score_normalize))
nor_train_ZS$Label <- train$Label


# Apply min-max normalization to your data
nor_test_MM <- as.data.frame(lapply(test[, -which(names(test) == "Label")], min_max_normalize))
nor_test_MM$Label <- test$Label


# Apply z-score normalization to your data
nor_test_ZS <- as.data.frame(lapply(test[, -which(names(test) == "Label")], z_score_normalize))
nor_test_ZS$Label <-test$Label


  # Define the neural network architecture
model_DP <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(train) - 1) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

# Compile the model
model_DP %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_adam(),
  metrics = c("accuracy")
)

# Train the model
history_MM <- model_DP %>% fit(
  x = as.matrix(nor_train_MM[, -which(names(nor_train_MM) == "Label")]),   # Features
  y = as.numeric(nor_train_MM$Label),  # Labels
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)

## Epoch 1/10
## 54/54 - 1s - loss: 0.1195 - accuracy: 0.7548 - val_loss: 0.1227 - val_accuracy: 1.0000 - 947ms/epoch - 18ms/step
## Epoch 2/10
## 54/54 - 0s - loss: -1.4594e+00 - accuracy: 0.7946 - val_loss: 4.6600e-04 - val_accuracy: 1.0000 - 175ms/epoch - 3ms/step
## Epoch 3/10
## 54/54 - 0s - loss: -5.4961e+00 - accuracy: 0.7946 - val_loss: 1.3317e-10 - val_accuracy: 1.0000 - 189ms/epoch - 3ms/step
## Epoch 4/10
## 54/54 - 0s - loss: -1.5316e+01 - accuracy: 0.7946 - val_loss: 3.8584e-25 - val_accuracy: 1.0000 - 172ms/epoch - 3ms/step
## Epoch 5/10
## 54/54 - 0s - loss: -3.4970e+01 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 164ms/epoch - 3ms/step
## Epoch 6/10
## 54/54 - 0s - loss: -6.9685e+01 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 161ms/epoch - 3ms/step
## Epoch 7/10
## 54/54 - 0s - loss: -1.2357e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 163ms/epoch - 3ms/step
## Epoch 8/10
## 54/54 - 0s - loss: -2.0059e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 154ms/epoch - 3ms/step
## Epoch 9/10
## 54/54 - 0s - loss: -3.0311e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 151ms/epoch - 3ms/step
## Epoch 10/10
## 54/54 - 0s - loss: -4.3504e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 156ms/epoch - 3ms/step

# Evaluate the model
evaluation_DP_MM <- model_DP %>% evaluate(
  x = as.matrix(nor_test_MM[, -which(names(nor_test_MM) == "Label")]),   # Features
  y = as.numeric(nor_test_MM$Label)   # Labels
)

## 29/29 - 0s - loss: -4.6997e+02 - accuracy: 0.8361 - 65ms/epoch - 2ms/step

summary (evaluation_DP_MM)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -469.9704 -352.2688 -234.5672 -234.5672 -116.8656    0.8361

# Make predictions on the test dataset
predictions_DP_MM <- model_DP %>% predict(
  x = as.matrix(nor_test_MM[, -which(names(nor_test_MM) == "Label")])  # Features
)

## 29/29 - 0s - 158ms/epoch - 5ms/step

# Convert predicted probabilities or classes to labels
predicted_labels_DP_MM <- ifelse(predictions_DP_MM > 0.5, "spam", "ham")

predicted_df_DP_MM <- data.frame(Label = factor(predicted_labels_DP_MM))

# Create a confusion matrix
conf_matrix <- table(predicted_df_DP_MM$Label, nor_test_MM$Label)

# Print the confusion matrix
print(conf_matrix)

##       
##        ham spam
##   spam 765  150

#conf_matrix_DP_MM <- confusionMatrix(predicted_labels_DP_MM, nor_test_MM$Label)

# Print the confusion matrix
#print(conf_matrix_DP_MM)

# Extract accuracy from the confusion matrix
#accuracy_DP_MM <- conf_matrix_DP_MM$overall['Accuracy']

#print(paste("DP for MM normalized data Accuracy:", accuracy_svm))

4-6- Random Forest

Another method I used for spam evalaution is Random Forest, that is among the very famous machine learning model to be used for classification specifically for binary.

The method presented in reference [1] is used here for this modeling. here is code is used.

# Train the random forest model
rf_model <- randomForest(Label ~ ., data = train)

# Print summary of the random forest model
print(rf_model)

## 
## Call:
##  randomForest(formula = Label ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 0.98%
## Confusion matrix:
##       ham spam class.error
## ham  1782    4 0.002239642
## spam   17  334 0.048433048

# Make predictions on the test dataset
predictions_rf <- predict(rf_model, newdata = test)

# Create a confusion matrix
conf_matrix_rf <- confusionMatrix(predictions_rf, test$Label)

# Print the confusion matrix
print(conf_matrix_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  761    8
##       spam   4  142
##                                           
##                Accuracy : 0.9869          
##                  95% CI : (0.9772, 0.9932)
##     No Information Rate : 0.8361          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.9948          
##             Specificity : 0.9467          
##          Pos Pred Value : 0.9896          
##          Neg Pred Value : 0.9726          
##              Prevalence : 0.8361          
##          Detection Rate : 0.8317          
##    Detection Prevalence : 0.8404          
##       Balanced Accuracy : 0.9707          
##                                           
##        'Positive' Class : ham             
##

# Extract accuracy from the confusion matrix
accuracy_rf <- conf_matrix_rf$overall['Accuracy']
print(paste("Random Forest Accuracy:", accuracy_rf))

## [1] "Random Forest Accuracy: 0.986885245901639"

5- Model Evaluation

I evaluated the performance of different models, including logistic regression, Naive Bayes (NB), Support Vector Machine (SVM), Random Forest, and Deep Learning. Using metrics such as accuracy, precision, recall, and F1-score, it became evident that Random Forest performed the best, achieving the highest accuracy among the evaluated models.

For a spam/ham classification problem, four methods were recommended, and they successfully predicted spam with accuracies ranging from 90% to 98.6%. Despite not achieving significant success in cleaning the SPAM and HAM emails during this process, we still achieved remarkable effectiveness with the Random Forest method, reaching 98.6% accuracy.

I also attempted Deep Learning, but unfortunately, I couldn’t complete it due to time constraints and not knowing enough about the topic.

All in all, this gives me hope that with better data cleaning, we may further improve the model’s performance, possibly reaching 99.9%.

All in all, it was an interesting and challenging problem to tackle, requiring extensive research. I utilized resources from GitHub and R-Pubs and contributed my portion of the code where needed.

References:

[1] https://github.com/Prakash-Khatri/Text_spam_ham_detection

[2] RPubs - Spam and Ham Detection of Email

DATA 607 Project 4 - SPAM/HAM

KoohPy <- Koohyar Pooladvand

2024-04-28