Intro

The goal of this project is to work with a database to identify spam emails. Being able to classify new “test” documents using already classified “training” documents is crucial. A common scenario involves using a corpus of labeled spam and ham (non-spam) emails to predict whether a new document is spam or not. For this project, we can begin with a spam/ham dataset and then predict the class of new documents—either withheld from the training dataset or another source, such as your spam folder. An example corpus can be found at SpamAssassin Public Corpus.

Similar to other projects, I’ll start by initializing R Studio and then proceed step-by-step toward the end goal.

Initialization

## [1] "All required packages are installed"

Strategy and method

Upon researching this topic, I have come across some repositories on GitHub and articles that focus on email spam detection. The main concepts presented in these resources can be summarized in the following steps:

  1. Finding and Importing Data:

    • Gather relevant email data (both spam and ham).

    • Import the data into your analysis environment.

  2. Exploratory Data Analysis (EDA):

    • Explore the dataset to understand its structure, features, and potential issues.

    • Identify any missing data or outliers.

  3. Creating a Corpus:

    • Clean up the data by removing unnecessary characters, formatting, and noise.

    • Handle missing data appropriately.

    • Prepare the data for training by creating a corpus.

  4. Applying a Machine Learning Algorithm:

    • Choose an appropriate algorithm (e.g., Naive Bayes, SVM, or Random Forest).

    • Train the model using the cleaned data.

    • Fine-tune hyperparameters as needed.

  5. Model Evaluation:

    • Assess the model’s performance using metrics such as accuracy, precision, recall, and F1-score.

    • Consider cross-validation to validate the model’s generalization ability.

These are the general steps, each with several sub-steps to be performed. As you work on your project, feel free to utilize available resources and code from other references, citing them appropriately when used.

All in all our goal is to use the source of legitimate (ham) and illegitimate emails (spam) and use them to train a network to be able to identify the future email an one of these categories. It is a fun project, and one thing that I have never done before.

In this project, the data collection has been done and we have access to the corpus of ham and spam data, which makes my life easier. First step i to load those data into R.

1- Importing data: Reading files into R

# Paths of spam and ham folders
spam_path <- "C:/Users/kohya/OneDrive/CUNY/DATA 607/DATA 607/Data/spam"
ham_path <- "C:/Users/kohya/OneDrive/CUNY/DATA 607/DATA 607/Data/ham"

# Function to read emails from a folder
read_emails <- function(folder_path) {
  # List all files in the folder
  email_files <- list.files(folder_path, full.names = TRUE)
  
  # Initialize list to store email content
  email_contents <- vector("list", length(email_files))
  
  # Loop through each file
  for (i in seq_along(email_files)) {
    # Read the content of the email
    email_content <- readLines(email_files[i],encoding = "UTF-8") #use "UTF-8 for multibyte encodidng to avoid the error happening later 
    
    # Store the email content
    email_contents[[i]] <- paste(email_content, collapse = "\n")
  }
  
  # Return the list of email contents
  return(email_contents)
}

# Read emails from spam and ham folders
spam_email_contents <- read_emails(spam_path)
## Warning in readLines(email_files[i], encoding = "UTF-8"): incomplete final line
## found on 'C:/Users/kohya/OneDrive/CUNY/DATA 607/DATA
## 607/Data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0'
ham_email_contents <- read_emails(ham_path)

# Create dataframes for spam and ham emails
spam_emails_df <- data.frame(Content = unlist(spam_email_contents), Label = "spam", stringsAsFactors = FALSE)
ham_emails_df <- data.frame(Content = unlist(ham_email_contents), Label = "ham", stringsAsFactors = FALSE)

#Combined the two spam and ham database together for
emails_df <- rbind(spam_emails_df,ham_emails_df)
# Print the first few rows of each dataframe
#knitr::kable(head(spam_emails_df, 3),format = "markdown")
# Convert the data frame to a Markdown table
message("Three examples of Spam emails")
## Three examples of Spam emails
#markdown_table <- knitr::kable(data.frame(head(spam_emails_df, 3)), format = "markdown", sanitize = TRUE)
#markdown_table <- pander::pander(head(spam_emails_df, 3), style = "pipe")
# Print the Markdown table
#cat(markdown_table)
cat("\n")
message("Three examples of Ham emails")
## Three examples of Ham emails
#knitr::kable(head(ham_emails_df, 3),format = "markdown")
# Convert the data frame to a Markdown table
#markdown_table <- knitr::kable(data.frame(head(ham_emails_df, 3)), format = "markdown", sanitize = TRUE)
#markdown_table <- pander::pander(head(ham_emails_df, 3), style = "pipe")
# Print the Markdown table
#cat(markdown_table)
cat("\n")

2- Data Processing: Exploratory Data Analyses

Once we have our dataset, we’ll need to preprocess the data. This typically involves tasks like tokenization, removing stop words, stemming or lemmatization, and possibly handling issues like spelling mistakes or special characters.

# Regular expression to find multibites characters
error_pattern <- "[[:cntrl:]]"  

# Function to identify potential error indices
find_multibyte_errors <- function(text) {
  error_indices <- grep(error_pattern, text, perl = TRUE)
  return(error_indices)
}

# Apply the function to spam Content
potential_errors <- lapply(spam_emails_df$Content, find_multibyte_errors)


# Define replacement character for multi-bites
replace_char <- "?"

# Replace multi-byte characters with the placeholder
spam_emails_df$Content <- str_replace_all(spam_emails_df$Content, "[[:cntrl:]]", replace_char)
ham_emails_df$Content <- str_replace_all(ham_emails_df$Content, "[[:cntrl:]]", replace_char)
emails_df$Content <- str_replace_all(emails_df$Content, "[[:cntrl:]]", replace_char)

# Now you can use nchar or strwidth on the modified columns
df_EDA <- data.frame(spam = nrow(spam_emails_df), ham = nrow(ham_emails_df))

df_EDA <- rbind(df_EDA, min_email = c(min(nchar(spam_emails_df$Content)), min(nchar(ham_emails_df$Content))))

df_EDA <- rbind(df_EDA, max_email = c(max(nchar(spam_emails_df$Content)), max(nchar(ham_emails_df$Content))))

# Set row names
rownames(df_EDA) <- c("Emails #", "Min char length", "Max char Length")

# Print the dataframe
print(df_EDA)
##                   spam   ham
## Emails #           501  2551
## Min char length    867   367
## Max char Length 232374 90428

2-1- Cleaning and tidying data

Exploring the data it shows it contains many unuseful and additional irrrelevant data. The SPAM email should be mostly recognized by its sender email address, its domain, the contains of the email, to how many recipient is was sent, subject. Additionally, there can be some other parameters like when it was sent, it is a reply/forwarded or a new email, and what is its sentiment. These additional information are those that can amply improve the quality of the process.

I must create a set of functions that goes thru the email and extract above information for each emails. Specifically by starting reading the HTML and text and seperate them to diffrentiate the body text from other informtion.

In this particular project, I only attmp to use the body text of the message for the analyses not the entire message.

#the goal of this section is to extract the body of a message text and store them in the database not the entire file to be read later 

3- Feature Extraction: Creating a Corpus

Th goal of this section is to use the data to be able to extract some numerical data to be used later for machine learning algorithm. As I was learning about the topic, there are different techniques to be used with Bag-of-Words (BoW) and Term Frequency-Inverse Docuement Frequency (TF-IDF) are the most striagt forwrd and mostly used.

3-1- BOW

To work on the BoW, one may need to follwo the follwoign steps:

  1. Creating Corpus and Preprocessing:

    • create a Corpus from the email contents using Corpus(VectorSource()).

    • Then apply various preprocessing steps to the corpus, such as converting text to lowercase, removing punctuation, numbers, whitespace, and stopwords, and stemming the words.

  2. Creating Document-Term Matrix (DTM):

    • convert the preprocessed corpus into a Document-Term Matrix (DTM) using DocumentTermMatrix().

    • The DTM represents each document as a row and each unique word as a column, with the cell values indicating the frequency of each word in each document.

    • Use hunspell_ckeck to remove all the nonenglish and unrelated word from the column.

  3. Removing Sparse Terms:

    • Remove sparse terms (words that appear in only a small fraction of documents) from the DTM using removeSparseTerms().
  4. Converting DTM to Dataframe:

    • Convert the DTM to a dataframe email_matrix_spam and email_matrix_ham.

    • It also adds the label column (spam/ham) to each dataframe.

SPAM email analyses

# Create a Corpus from the email contents
corpus_spam <- Corpus(VectorSource(spam_emails_df$Content))

# Convert to lowercase
corpus_spam <- tm_map(corpus_spam, content_transformer(tolower))

#Convert to plain text
#corpus_spam <-  tm_map(corpus_spam, PlainTextDocument)

# Remove punctuation
corpus_spam <- tm_map(corpus_spam, removePunctuation)

# Remove numbers
corpus_spam <- tm_map(corpus_spam, removeNumbers)

# Remove whitespace
corpus_spam <- tm_map(corpus_spam, stripWhitespace)

# Remove stopwords
corpus_spam <- tm_map(corpus_spam, removeWords, stopwords("english"))

# Stemming
corpus_spam <- tm_map(corpus_spam, stemDocument)

# Convert the corpus to a DocumentTermMatrix
#The values in the matrix are the number of times that word appears in each document.
dtm_spam <- DocumentTermMatrix(corpus_spam)


#Remove sparse terms
dtm_spam <- removeSparseTerms(dtm_spam, 0.95)

dtm_spam
## <<DocumentTermMatrix (documents: 501, terms: 559)>>
## Non-/sparse entries: 38208/241851
## Sparsity           : 86%
## Maximal term length: 44
## Weighting          : term frequency (tf)
# Convert the DocumentTermMatrix to a dataframe
email_matrix_spam <- as.data.frame(as.matrix(dtm_spam))

#use the make.names function to make the variable names.
colnames(email_matrix_spam) = make.names(colnames(email_matrix_spam))

#Use hunspell to check is the columns is actually an english word
#hunspell_check(colnames(email_matrix_spam))
#Use hunspell_check with which to only keep the column that have menaingful neglish words. 
email_matrix_spam <- email_matrix_spam[, which(hunspell_check(colnames(email_matrix_spam)))]

# Add labels to the dataframe
email_matrix_spam$Label <- spam_emails_df$Label

# Print the first few rows of the processed dataframe
kable(head(email_matrix_spam, 10), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "80%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands
access best buy can click dollar email end error even ever fast font free get help import invest less let life list lowest make money new pay pleas pop public request sat save shop size start state take thousand type wed will wish claim day group grow host join legal link lose maintain risk sender support use user web window wonder date may account age allow express fill follow friend give hour info instant internet just last live local long member million month name never now offer one outlook pass play read show site special subject term true two unknown year like address back bank bill bit call cash cost fax first found got huge inform line made mail major market next. partner per program question right sale send sent spam sure think time train wait want way word work X. contact form home look old rate border direct div great height high input need product protect that width number real system color custom detail effect low price version assist phone thank around avail credit fee must plan see contract find via act also answer ask box build center check come contain deal instruct know move origin part reason report unit within without code current deposit due fund govern interest kind late person sum tell thing write accept full put return share success transact transfer forward general much paid posit result total instead mime multipart pickup feel longer amp control data file meta obtain order quick retail six still good earn week your actual seen roman better differ keep law limit lot matter mean nation news past place power profit say secret sell strong today track understand well visit world search base sun open loan benefit amount card final problem run client learn tool present reach profession owner process fact second ship jalapeno Label
1 3 1 1 1 1 2 2 1 1 1 1 2 3 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1 5 1 5 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 2 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
3 3 0 1 1 1 1 0 1 0 0 0 0 8 4 0 0 0 0 0 0 0 0 0 0 0 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3 1 0 0 0 1 3 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 1 2 2 1 3 1 1 2 3 1 1 1 1 2 7 1 1 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 0 1 0 1 1 2 1 1 1 1 1 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 1 0 0 4 0 1 0 1 0 0 1 2 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 3 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 2 1 1 1 3 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 2 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 2 0 0 1 0 0 0 0 0 8 1 0 0 2 0 0 0 3 0 0 1 0 0 0 0 1 0 0 0 1 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 15 1 2 1 17 1 6 1 1 2 1 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 1 0 0 5 0 1 0 1 0 0 2 2 1 2 0 0 0 0 1 0 2 2 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 4 1 0 3 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 0 1 0 2 0 0 1 0 0 0 0 2 0 0 0 0 0 0 1 1 1 1 1 3 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 1 1 0 1 1 2 1 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 0 0 2 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 18 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 10 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam

HAM email Analyses

#Do the same for the ham
corpus_ham <- Corpus(VectorSource(ham_emails_df$Content))
corpus_ham <- tm_map(corpus_ham, content_transformer(tolower))
#corpus_ham <- tm_map(corpus_ham, PlainTextDocument)
corpus_ham <- tm_map(corpus_ham, removePunctuation)
corpus_ham <- tm_map(corpus_ham, removeNumbers)
corpus_ham <- tm_map(corpus_ham, stripWhitespace)
corpus_ham <- tm_map(corpus_ham, removeWords, stopwords("english"))
corpus_ham <- tm_map(corpus_ham, stemDocument)
dtm_ham <- DocumentTermMatrix(corpus_ham)
dtm_ham <- removeSparseTerms(dtm_ham, 0.95)
dtm_ham
## <<DocumentTermMatrix (documents: 2551, terms: 399)>>
## Non-/sparse entries: 154473/863376
## Sparsity           : 85%
## Maximal term length: 92
## Weighting          : term frequency (tf)
email_matrix_ham <- as.data.frame(as.matrix(dtm_ham))
colnames(email_matrix_ham) = make.names(colnames(email_matrix_ham))
email_matrix_ham <- email_matrix_ham[, which(hunspell_check(colnames(email_matrix_ham)))]
email_matrix_ham$Label <- ham_emails_df$Label

# Print the first few rows of the processed dataframe
kable(head(email_matrix_ham, 10), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "80%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands
actual cant code come date day develop discuss error express get happen hit like line list local log mail mark new part run still subject sun that time today use version wed without work contact email far free group high internet network send sponsor unknown well build claim news one outlook red report said talk want can home last make may say web world also even file give long might now old put sent server set user way wrote ask bit instead just lot never person window better end first prefer sell will year found look right your bad big call current good help interest made mean name need open origin point read reason result second see seem start stuff thing turn two word write yes clean process real test exist find phone script spam tire case differ got know let play problem think allow great idea though week friend link maintain tell add done fix ill less much take thought address best least sender state question number show base sure system post support check fork keep data place pleas thank next. enough there around back yet import rpm geek follow stop sat jalapeno Label
1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 4 1 1 1 1 1 1 1 1 3 1 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 4 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 1 1 1 1 3 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 2 0 0 2 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 1 1 0 1 3 1 0 1 1 1 2 0 3 1 1 2 1 1 2 4 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 3 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 2 0 0 3 0 0 1 0 0 1 0 0 1 0 3 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 2 1 2 1 1 1 1 1 1 1 1 1 2 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 2 0 0 0 0 0 2 0 0 1 0 5 0 0 0 1 1 1 0 1 3 0 0 1 1 1 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 4 0 0 0 0 0 2 0 1 0 0 3 0 0 0 0 1 1 0 1 3 0 0 1 1 1 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 2 0 0 4 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 1 1 0 1 3 0 0 1 1 1 2 0 0 0 0 0 0 0 0 3 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 2 0 0 0 0 0 2 0 1 0 0 2 0 0 1 0 1 1 0 1 3 0 1 1 1 1 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham
0 0 0 0 0 4 1 0 0 2 1 0 2 2 0 2 0 0 4 0 0 0 0 0 2 0 0 5 0 3 0 0 1 0 1 1 1 1 3 0 0 1 1 1 2 0 0 0 0 2 0 0 1 0 0 3 10 0 0 0 1 2 0 0 2 0 0 1 0 3 0 0 2 1 0 0 0 0 0 1 1 0 3 0 0 0 1 0 0 8 0 0 3 5 0 1 0 0 1 1 1 1 1 2 1 2 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ham

Combined Email analyses

#let's work on the combined emails for the purpose of this work 
corpus <- Corpus(VectorSource(emails_df$Content))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.95)
dtm
## <<DocumentTermMatrix (documents: 3052, terms: 406)>>
## Non-/sparse entries: 180093/1059019
## Sparsity           : 85%
## Maximal term length: 62
## Weighting          : term frequency (tf)
email_matrix <- as.data.frame(as.matrix(dtm))
colnames(email_matrix) = make.names(colnames(email_matrix))
email_matrix <- email_matrix[, which(hunspell_check(colnames(email_matrix)))]
email_matrix$Label <- emails_df$Label

# Print the first few rows of the processed dataframe

kable(head(email_matrix, 10), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "80%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands
access best can click email end error even ever free get help import less let life list make money new pay pleas pop public sat size start state take wed will wish claim day group link maintain sender support use user web window date may allow express follow friend give internet just last live local long million month name never now offer one outlook play read show site subject two unknown year yes like might thought address back bit call first found got inform line made mail market next. program question right send sent spam sure think time want way word work contact form home look old rate direct great high need product that number real system custom discuss server version case phone thank around avail must see find set also ask build cant check come know origin part reason report without code current interest kind person said tell thing write put much result instead data file order still good done talk week your actual network add better differ enough happen idea keep least lot mean news place post say sell test though today well world base sun develop open mark turn stuff hit seem problem run point process sponsor there fact second ill clean jalapeno fork wrote rpm Label
1 3 1 1 2 2 1 1 1 3 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 2 1 1 1 1 1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
3 3 1 1 1 0 1 0 0 8 4 0 0 0 0 0 0 0 0 0 3 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 3 1 0 0 1 1 1 1 1 1 1 3 2 3 2 1 2 1 3 1 1 2 3 1 1 1 2 7 1 1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 2 1 1 1 1 1 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 1 0 4 0 1 0 1 1 2 1 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 2 1 0 3 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 1 2 1 3 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 2 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 2 0 1 0 0 0 0 1 0 0 2 0 0 3 0 1 0 0 0 0 1 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 1 0 5 0 1 0 1 2 2 1 2 0 0 0 1 2 2 1 0 1 1 0 0 0 0 0 0 0 4 1 0 3 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 2 0 0 0 0 0 1 0 1 0 2 0 1 0 0 0 2 0 0 0 0 0 0 0 1 1 1 3 2 1 1 2 1 1 1 1 1 1 1 1 3 2 1 1 1 0 2 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 18 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam

3-2- TF-IDF

The dataframe is created above will be used below to created the TF-IDF. The data preparation steps are the same, thus they are not repeated but the additional required step now listed below are taken:

  • Calculate TF-IDF weights using weightTfIdf().

  • Convert the TF-IDF weighted matrix to a dataframe.

  • Add the labels to the dataframe.

# Calculate TF-IDF weights
tfidf_transformer_spam <- weightTfIdf(dtm_spam)

# Convert TF-IDF weighted matrix to a dataframe
tfidf_df_spam <- as.data.frame(as.matrix(tfidf_transformer_spam))

# Add labels to the dataframe
tfidf_df_spam$Label <- spam_emails_df$Label

# Print the first few rows of the processed dataframe
#kable(head(tfidf_df_spam))


#do the same for the ham 
# Calculate TF-IDF weights
tfidf_transformer_ham <- weightTfIdf(dtm_ham)

# Convert TF-IDF weighted matrix to a dataframe
tfidf_df_ham <- as.data.frame(as.matrix(tfidf_transformer_ham))

# Add labels to the dataframe
tfidf_df_ham$Label <- ham_emails_df$Label

# Print the first few rows of the processed dataframe
#kable(head(tfidf_df_ham))


#do the same for the combined emals 
# Calculate TF-IDF weights
tfidf_transformer <- weightTfIdf(dtm)

# Convert TF-IDF weighted matrix to a dataframe
tfidf_df <- as.data.frame(as.matrix(tfidf_transformer))

# Add labels to the dataframe
tfidf_df$Label <- emails_df$Label

# Print the first few rows of the processed dataframe
kable(head(tfidf_df))
access aug best can charsetisocontenttransferencod click compani complet countri easi edtreceiv email end error esmtp even ever fetchmailfor free get help import istreceiv less let life list localhost mailwebnotenet make messageid microsoft mimevers money new pay phoboslabsspamassassintaintorg pleas pop postfix provid public remov returnpath sat secur servic singledrop size start state take texthtml thu webnotenet wed will wish zzzzlocalhost zzzzlocalhostspamassassintaintorgreceiv zzzzspamassassintaintorg claim contenttyp day dogmaslashnullorg group imap irish link linux maintain mimeol phobo preced produc receiv sender support textplain use user web window date may allow believ dont express follow friend give internet just last live local long million month name never now offer one outlook play read requir show site smtp subject two unknown xmailer year yes like might thought address back bit call communic didnt ffor first found got inform ive line made mail market messag next peopl program question right send sent spam sure think time want way word work contact form home look normalxmail normalxmsmailprior old rate chang direct futur great high includ need noth product that unsubscrib number real system busi custom discuss fri server version xmimeol bitxprior case phone thank around invok qmail uid avail must see everi find set also anoth ask build cant check come know origin part reason report softwar someth without code current howev interest kind person said tell thing write put releas sinc someon charsetusasciicontenttransferencod much result zzzzjmasonorg cours instead tue comput data file order packag still good manag anyon done littl talk week your actual network add alreadi alway anyth better differ els enough exampl happen idea isnt keep least lot mani mean news place possibl post realli say sell test though today tri well world base pdtreceiv sun develop open mon mark turn stuff hit seem helo problem run url creat instal sourc point process bulklisthelp debian exim helouswsflistsourceforgenet sfnet sfnetpreced sponsor uswsffwsourceforgenet uswsflistbsourceforgenet uswsflistsourceforgenet vamm there issu fact second probabl septemb charsetusascii ill userid doesnt clean sep localhostlocaldomain jalapeno zzzzasonorg mayb spamassassin oct fork bulkreplyto jmasonorg jmlocalhost jmjmasonorg edtdeliveredto exmh intmxcorpexamplecom intmxcorpredhatcom listmanexamplecom listmanredhatcom maillocalhost mxexamplecom phoboslabsnetnoteinccom zzzzlocalhostnetnoteinccomreceiv zzzzexamplecom mxredhatcom wrote mozilla forkadminxentcom forkadminxentcomdeliveredto forkadminxentcomerrorsto forkadminxentcomxbeenther forkexamplecomreceiv forkexamplecomsubject forkexamplecomxmailmanvers forkxentcom forkxentcomlistunsubscrib httpxentcommailmanlistinfofork httpxentcompipermailforkd khare lairxentcom mailtoforkexamplecomlistsubscrib mailtoforkrequestxentcomsubjecthelplistpost mailtoforkrequestxentcomsubjectsubscribelistid mailtoforkrequestxentcomsubjectunsubscribelistarch pdtdeliveredto rohit xentcom charsetusasciisend rpm authnlegwnnet cleansend egwn egwnnet freshrpm httplistsfreshrpmsnetmailmanlistinforpmzzzlist httplistsfreshrpmsnetpipermailrpmzzzlistxoriginald mailtorpmlistrequestfreshrpmsnetsubjectsubscribelistid mailtorpmlistrequestfreshrpmsnetsubjectunsubscribelistarch mailtorpmzzzlistfreshrpmsnetlistsubscrib mailtorpmzzzlistrequestfreshrpmsnetsubjecthelplistpost rpmlistadminfreshrpmsnet rpmlistfreshrpmsnet rpmlistfreshrpmsnethttplistsfreshrpmsnetmailmanlistinforpmlist rpmzzzlistadminfreshrpmsnetdeliveredto rpmzzzlistadminfreshrpmsneterrorsto rpmzzzlistadminfreshrpmsnetxbeenther rpmzzzlistfreshrpmsnetlisthelp rpmzzzlistfreshrpmsnetlistunsubscrib rpmzzzlistfreshrpmsnetsubject rpmzzzlistfreshrpmsnetxmailmanvers nsegwnnet timesxspamstatus xpyzor yyyylocalhostnetnoteinccomreceiv examplecom rssfeedsexamplecomdeliveredto rssfeedsexamplecomsubject httpwwwnewsisfreecomclickd requiredtestsawlemailattributioninreptoknownmailinglist xspamstatus yyyylocalhostexamplecomreceiv jmexmhjmasonorg jmrpmjmasonorg yyyyexamplecomfrom encodingutfxspamstatus rssfeedsjmasonorg requiredtestsawlversioncvsxspamlevel Label
0.0490699 0.1381853 0.1205666 0.0188259 0.0387572 0.0460452 0.040314 0.0489631 0.0487516 0.0461345 0.0192375 0.0426553 0.0860827 0.0446782 0.0016411 0.0327902 0.0486468 0.0009852 0.1011889 0.0182272 0.0362709 0.0403769 0.0010080 0.0457801 0.0437229 0.2437578 0.0177604 0.0025974 0.0375995 0.0260243 0.0119654 0.0276236 0.0381406 0.0440352 0.0232069 0.0479317 0.0445151 0.0309522 0.0474393 0.0008828 0.0423158 0.0462244 0.0715744 4.84e-05 0.0368199 0.0489631 0.0331109 0.0009852 0.2041219 0.0393386 0.0421745 0.0348607 0.0453475 0.0800071 0.0375995 0.0206513 0.0225350 0.0452623 0.0441075 0.0312385 0.0472464 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0.0000000 0.2533397 0.0000000 0.0000000 0.0473699 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0235124 0.0000000 0.0000000 0.0000000 0.0040117 0.0000000 0.0000000 0.0012041 0.0412251 0.0222777 0.0000000 0.0000000 0.0012320 0.0000000 0.0000000 0.0000000 0.0217071 0.0031747 0.0000000 0.0000000 0.0146244 0.1012867 0.0000000 0.0000000 0.0000000 0.0000000 0.0544073 0.0378304 0.0000000 0.0010790 0.0000000 0.0000000 0.0000000 5.92e-05 0.0000000 0.0000000 0.0000000 0.0012041 0.0000000 0.0480805 0.0000000 0.0000000 0.0000000 0.1955729 0.0000000 0.0000000 0.0000000 0.0553206 0.0539091 0.0381804 0.0000000 0.0538208 0.0088430 0.0454411 0.0019788 0.0843439 0.0019643 0.1181559 0.0455046 0.0885423 0.0547071 0.0468190 0.0307140 0.0169318 0.0426076 0.0127915 0.0466163 0.0530646 0.0060022 0.0184061 0.0730582 0.0522466 0.0464155 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0.0000000 0.2627226 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0313499 0.0000000 0.0000000 0.0000000 0.0026745 0.0000000 0.0000000 0.0016055 0.0549668 0.0297036 0.0000000 0.0000000 0.0016427 0.0000000 0.0000000 0.0000000 0.0289429 0.0042329 0.0612732 0.0000000 0.0194992 0.1350489 0.0000000 0.0000000 0.0000000 0.0000000 0.0725431 0.0504406 0.0773085 0.0014387 0.0000000 0.0000000 0.0000000 7.89e-05 0.0000000 0.0000000 0.0000000 0.0016055 0.0000000 0.0641074 0.0000000 0.0000000 0.0000000 0.1955729 0.0612732 0.0000000 0.0000000 0.0737608 0.0718788 0.0509072 0.0769942 0.0000000 0.0117906 0.0605881 0.0000000 0.0000000 0.0000000 0.0000000 0.0606728 0.0000000 0.0000000 0.0624253 0.0000000 0.0000000 0.0568101 0.0085277 0.0000000 0.0707528 0.0000000 0.0245415 0.0000000 0.0696622 0.0618874 0.0312652 0.0455888 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0.0912286 0.1427266 0.0747173 0.0116667 0.0000000 0.0285351 0.000000 0.0000000 0.0000000 0.0000000 0.0238436 0.0132171 0.0000000 0.0276879 0.0010170 0.0000000 0.0000000 0.0006105 0.1672230 0.0451830 0.0000000 0.0000000 0.0006247 0.0000000 0.0000000 0.0000000 0.0000000 0.0016097 0.0233011 0.0000000 0.0074152 0.0171189 0.0000000 0.0000000 0.0000000 0.0891124 0.0275868 0.0191816 0.0293990 0.0010942 0.0000000 0.0000000 0.0000000 3.00e-05 0.0000000 0.0000000 0.0410389 0.0006105 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0743728 0.0233011 0.0000000 0.0000000 0.0280499 0.0273342 0.0193591 0.1171179 0.0000000 0.0044838 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0230728 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0259434 0.0000000 0.0000000 0.0030434 0.0000000 0.0555654 0.0264913 0.0000000 0.0000000 0.0173366 0.0278932 0.0302122 0.0300950 0.0252595 0.0266276 0.0138998 0.0215204 0.0235346 0.0392506 0.0498116 0.088197 0.0553758 0.0260063 0.0592847 0.0276879 0.0762624 0.0257933 0.0163749 0.053812 0.0403940 0.0206711 0.0298917 0.0248672 0.0273876 0.0575167 0.1906849 0.0198681 0.0166798 0.0239832 0.0391708 0.0270481 0.0213285 0.0302776 0.0000000 0.0000000 0.000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0.0000000 0.2498693 0.0000000 0.0000000 0.0467210 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0231904 0.0000000 0.0000000 0.0000000 0.0039567 0.0000000 0.0000000 0.0011876 0.0406604 0.0219725 0.0000000 0.0000000 0.0012151 0.0000000 0.0000000 0.0000000 0.0214098 0.0031312 0.0000000 0.0000000 0.0144241 0.0998992 0.0000000 0.0000000 0.0000000 0.0000000 0.0536620 0.0373122 0.0000000 0.0010642 0.0000000 0.0000000 0.0000000 5.84e-05 0.0000000 0.0000000 0.0000000 0.0011876 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1928938 0.0000000 0.0000000 0.0000000 0.0545628 0.0531706 0.0376574 0.0000000 0.0530836 0.0087218 0.0448186 0.0019517 0.0831885 0.0019374 0.1165373 0.0448812 0.0873294 0.0539577 0.0461776 0.0302932 0.0166998 0.0420239 0.0126163 0.0459777 0.0523377 0.0059199 0.0181540 0.0720574 0.0000000 0.0457797 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0263137 0.0514433 0.058769 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam
0.0000000 0.1048302 0.0000000 0.0142817 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0145939 0.0647184 0.0000000 0.0338938 0.0012450 0.0000000 0.0369045 0.0007474 0.0255880 0.0276551 0.0275158 0.0000000 0.0007647 0.0000000 0.0000000 0.0000000 0.0134734 0.0019705 0.0285237 0.0197426 0.0000000 0.0000000 0.0289342 0.0334060 0.0176053 0.0000000 0.0337701 0.0234810 0.0359884 0.0013394 0.0000000 0.0000000 0.0542978 3.67e-05 0.0000000 0.0000000 0.0000000 0.0007474 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0910425 0.0570475 0.0000000 0.0341911 0.0343369 0.0334608 0.0236982 0.0358421 0.0000000 0.0054887 0.0846144 0.0000000 0.0261757 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0198489 0.0289342 0.0000000 0.0037255 0.0000000 0.0000000 0.0000000 0.0000000 0.0145545 0.0212224 0.0000000 0.0000000 0.0368405 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0609762 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0200452 0.000000 0.0164826 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0408366 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.030583 0.0595072 0.0277792 0.0846144 0.0345974 0.0341451 0.0328792 0.0511123 0.022473 0.0298431 0.0306308 0.0283633 0.0284032 0.0310689 0.0100713 0.0351353 0.0158692 0.033466 0.0230836 0.0349986 0.0352042 0.0251802 0.0547355 0.0622372 0.0306308 0.0291872 0.0227746 0.0174385 0.0440202 0.0231358 0.0353432 0.0195057 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 spam

4- Applying a Machine Learning Algorithm

In this section we will be creating a few different models and will be evaluating their pefroamnce unsing the definition of accuracy, precision, recall, and F1 score as define below. We analyze the data to determine the true positive, false positive, true negative, and false negative values for the dataset, using the prediction model created using train datset on the test datset. We will display results in three confusion matrices, with the counts of TP, FP, TN, and FN.

Accuracy = (TP+TN)/(all (TP+TN+FP+FN)) Precision = TP/(TP+FP) Recall = TP / (TP+FN) F1 Score = 2 x (Precision x Recall) / (Precision + Recall)

For findign the confusion matrix, I will use the confusionMatrix() function from the caret package to generate a confusion matrix.

4-1- Explanatory Data Analyses

After cleaning and creating the DTM, I use a basic EDA to show some basics like wordscloud and histograms of words repeated in this Spam and ham emails. to just give us some intuitive ad additional information on what is going on.

It is surprising see some unrelated words like localhost, months, i.e., oct, sep, weeks of the day, i.e., tue, wed, and so one be repeated itself. It shows I need to do some additional data cleaning in the emails to remove some title and just work on the email bodies.

I have not figure it out yet, need sometimes to work. At the moment, I have decided to move on, knowing this is a lot of room for improvement in the DTM has been created for the ham and spam data.

SPAM EDA

# Calculate the sum of each column (word frequency) excluding the "Label" column
word_frequencies_spam <- colSums(email_matrix_spam[, !colnames(email_matrix_spam) %in% "Label"])


# Sort the word frequencies in descending order
sorted_word_frequencies_spam <- sort(word_frequencies_spam, decreasing = TRUE)

# Print the sorted word frequencies
kable(head(sorted_word_frequencies_spam, 20), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "20%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands
x
size 1124
width 1086
email 955
will 794
height 617
jalapeno 525
can 478
free 477
mail 477
wed 450
font 443
get 416
money 414
new 384
list 375
pleas 370
make 369
time 368
border 361
one 351
# Visualize word frequencies using a histogram
hist(word_frequencies_spam, main = "Spam Word Frequency Distribution", xlab = "Word Frequency", ylab = "Frequency")

# Convert word frequencies to a dataframe
word_frequencies_df_spam <- data.frame(word = names(word_frequencies_spam), frequency = word_frequencies_spam)

word_frequencies_df_spam <- word_frequencies_df_spam [order(word_frequencies_df_spam$frequency,decreasing = TRUE),]

# Select the top 40 words
top_40_words_spam <- head(word_frequencies_df_spam, 40)


# Create a histogram using ggplot2 for the top 40 words
top_40_words_spam |> ggplot(aes(x = reorder(word, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Top 40 Word Frequencies in Spam emails",
       x = "Word",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualize word frequencies using a word cloud
wordcloud(names(word_frequencies_spam), freq = word_frequencies_spam, max.words = 70, random.order = FALSE)

HAM EAD

# Calculate the sum of each column (word frequency) excluding the "Label" column
word_frequencies_ham <- colSums(email_matrix_ham[, !colnames(email_matrix_ham) %in% "Label"])

# Sort the word frequencies in descending order
sorted_word_frequencies_ham <- sort(word_frequencies_ham, decreasing = TRUE)

# Print the sorted word frequencies
kable(head(sorted_word_frequencies_ham, 20), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "20%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands
x
wed 3849
jalapeno 3652
use 2389
hit 2253
mail 1707
list 1565
can 1504
will 1421
get 1389
one 1236
like 1159
just 1155
time 988
work 976
new 974
sun 965
sat 952
friend 894
email 882
date 855
# Visualize word frequencies using a histogram
hist(sorted_word_frequencies_ham, main = "Ham Word Frequency Distribution", xlab = "Word Frequency", ylab = "Frequency")

# Convert word frequencies to a dataframe
word_frequencies_df_ham <- data.frame(word = names(word_frequencies_ham), frequency = word_frequencies_ham)

word_frequencies_df_ham <- word_frequencies_df_ham [order(word_frequencies_df_ham$frequency,decreasing = TRUE),]

# Select the top 40 words
top_40_words_ham <- head(word_frequencies_df_ham, 40)

# Create a histogram using ggplot2 for the top 50 words
top_40_words_ham |> ggplot(aes(x = reorder(word, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "plum", color = "black") +
  labs(title = "Top 40 Word Frequencies in Ham emails",
       x = "Word",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualize word frequencies using a word cloud
wordcloud(names(word_frequencies_ham), freq = word_frequencies_ham, max.words = 70, random.order = FALSE)

Combined EAD

# Calculate the sum of each column (word frequency) excluding the "Label" column
word_frequencies <- colSums(email_matrix[, !colnames(email_matrix) %in% "Label"])

# Sort the word frequencies in descending order
sorted_word_frequencies <- sort(word_frequencies, decreasing = TRUE)

# Print the sorted word frequencies
kable(head(sorted_word_frequencies, 20), 
      align = c("l", "c", "c"),  # Align columns left, center, center
      width = "20%",  # Width of the table
      booktabs = TRUE,  # Use booktabs style
      longtable = TRUE,  # Allow table to span multiple pages if needed
      escape = FALSE)  # Allow LaTeX commands
x
wed 4299
jalapeno 4177
use 2652
hit 2269
will 2215
mail 2184
can 1982
list 1940
email 1837
get 1805
one 1587
just 1407
new 1358
time 1356
like 1342
size 1268
sat 1233
work 1200
sun 1186
make 1175
# Visualize word frequencies using a histogram
hist(sorted_word_frequencies, main = "Word Frequency Distribution", xlab = "Word Frequency", ylab = "Frequency")

# Convert word frequencies to a dataframe
word_frequencies_df <- data.frame(word = names(word_frequencies), frequency = word_frequencies)

word_frequencies_df <- word_frequencies_df [order(word_frequencies_df$frequency,decreasing = TRUE),]

# Select the top 50 words
top_40_words <- head(word_frequencies_df, 40)

# Create a histogram using ggplot2 for the top 50 words
top_40_words |> ggplot(aes(x = reorder(word, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "darkgreen", color = "black") +
  labs(title = "Top 40 Word Frequencies in combined emails",
       x = "Word",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualize word frequencies using a word cloud
wordcloud(names(word_frequencies), freq = word_frequencies, max.words = 70, random.order = FALSE)

4-2- Further Data Cleaning:

Looking into the bar-chart into the wordcloud one can also identify someother words that seems to be unimportant in identifying the spam vs ham such as localhost, esmtp, jmlocalhost, returnpath, fetchmailfor, imap, contenttyp, smtp, and messageid. So, I will remove them from the database manually.

# Define the list of words to remove manually
manual_word <- c('localhost', 'esmtp', 'jmlocalhost', 'returnpath', 'fetchmailfor', 'imap', 'contenttyp', 'smtp', 'messageid')

# Subset the DTM to remove columns containing the manual words
email_matrix_clean <- email_matrix[, !(colnames(email_matrix) %in% manual_word)]

4-3- Logestic Regression set up

In this section, we will use the data that has been created for training and testing logistic regression and later for machine learning (ML) to identify spam emails.

Unfortunately, logistic regression did not work well for this data, and the solution did not converge. As a result, no further action was taken.

Additionally, Recursive Partitioning and Regression Trees also seem to work, but upon closer examination, none of the three methods reveal a good solution for SPAM recognition. The failure seems to be rooted in the fact that we have not yet cleaned up the emails to contain only the body text, excluding additional irrelevant data.

set.seed(2014)

email_matrix_clean$Label = as.factor(email_matrix_clean$Label)

spl = sample.split(email_matrix_clean$Label, 0.7)

train = subset(email_matrix_clean, spl == TRUE)
test = subset(email_matrix_clean, spl == FALSE)

#spamLog = glm(Label~., data=train, family="binomial")

#summary(spamLog)

#spamCART = rpart(Label~., data=train, method="class")
#prp(spamCART)

4-4: Naive Bayes model (NB)

We can also use Naive Bayes classifiers, which are available through the e1071 package. Although I don’t know all the details about the package, I’ve discovered it and will give it a try.

Naive Bayes classifiers are simple probabilistic classifiers based on Bayes’ theorem, with the “naive” assumption of independence between features.

In R, you can use the naiveBayes() function from the e1071 package to train a Naive Bayes classifier.

nb_model <- naiveBayes(Label ~ ., data = train)

summary(nb_model)
##           Length Class  Mode     
## apriori     2    table  numeric  
## tables    209    -none- list     
## levels      2    -none- character
## isnumeric 209    -none- logical  
## call        4    -none- call
# Make predictions on the test dataset
predictions_nb <- predict(nb_model, newdata = test)

# Create a confusion matrix
conf_matrix_nb <- confusionMatrix(predictions_nb, test$Label)

# Print the confusion matrix
print(conf_matrix_nb)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  726   22
##       spam  39  128
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.9152, 0.9486)
##     No Information Rate : 0.8361          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7674          
##                                           
##  Mcnemar's Test P-Value : 0.0405          
##                                           
##             Sensitivity : 0.9490          
##             Specificity : 0.8533          
##          Pos Pred Value : 0.9706          
##          Neg Pred Value : 0.7665          
##              Prevalence : 0.8361          
##          Detection Rate : 0.7934          
##    Detection Prevalence : 0.8175          
##       Balanced Accuracy : 0.9012          
##                                           
##        'Positive' Class : ham             
## 
# Extract accuracy from the confusion matrix
accuracy_nb <- conf_matrix_nb$overall['Accuracy']
print(paste("Accuracy:", accuracy_nb))
## [1] "Accuracy: 0.933333333333333"

4-5- Support Vector Machines (SVM):

Now we’re using another method to detect spam emails—Support Vector Machines (SVM). SVM is a popular algorithm for classification tasks and is part of the e1071 package. Here’s how we used the train data to create an SVM model, made predictions on the test data, and evaluated its performance using the confusion matrix. The results show a significant improvement in accuracy compared to Naive Bayes (NB) with accuracy incresed to 96%.

svm_model <- svm(Label ~ ., data = train)

summary(svm_model)
## 
## Call:
## svm(formula = Label ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  590
## 
##  ( 231 359 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  ham spam
# Make predictions on the test dataset
predictions_svm <- predict(svm_model, newdata = test)

# Create a confusion matrix
conf_matrix_svm <- confusionMatrix(predictions_svm, test$Label)

# Print the confusion matrix
print(conf_matrix_svm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  762   28
##       spam   3  122
##                                           
##                Accuracy : 0.9661          
##                  95% CI : (0.9523, 0.9769)
##     No Information Rate : 0.8361          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8675          
##                                           
##  Mcnemar's Test P-Value : 1.629e-05       
##                                           
##             Sensitivity : 0.9961          
##             Specificity : 0.8133          
##          Pos Pred Value : 0.9646          
##          Neg Pred Value : 0.9760          
##              Prevalence : 0.8361          
##          Detection Rate : 0.8328          
##    Detection Prevalence : 0.8634          
##       Balanced Accuracy : 0.9047          
##                                           
##        'Positive' Class : ham             
## 
# Extract accuracy from the confusion matrix
accuracy_svm <- conf_matrix_svm$overall['Accuracy']
print(paste("SVM Accuracy:", accuracy_svm))
## [1] "SVM Accuracy: 0.966120218579235"

4-6- Deep learning

Another option is deep learning, and the Keras and TensorFlow packages in R can help set this up. While I’ve learned how to use and configure them, I don’t fully understand the inner workings yet. I copied the model structure and layers from the internet, and it utilizes the ReLU function, which is particularly suited for binary classification.

Below is the code to set up deep learning in R. When working with this data, it is recommended to normalize it. As a first step, I will normalize the train and test data. There are two standard methods for normalization: min/max scaling and Z-score normalization. I’ve chosen to apply both methods by defining a function and using lapply.

# Z-score normalization function
z_score_normalize <- function(x) {
  (x - mean(x)) / sd(x)
}

# Min-max normalization function
min_max_normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Apply min-max normalization to your data
nor_train_MM <- as.data.frame(lapply(train[, -which(names(train) == "Label")], min_max_normalize))
nor_train_MM$Label <- train$Label


# Apply z-score normalization to your data (excluding the label column)
nor_train_ZS <- as.data.frame(lapply(train[, -which(names(train) == "Label")], z_score_normalize))
nor_train_ZS$Label <- train$Label


# Apply min-max normalization to your data
nor_test_MM <- as.data.frame(lapply(test[, -which(names(test) == "Label")], min_max_normalize))
nor_test_MM$Label <- test$Label


# Apply z-score normalization to your data
nor_test_ZS <- as.data.frame(lapply(test[, -which(names(test) == "Label")], z_score_normalize))
nor_test_ZS$Label <-test$Label


  # Define the neural network architecture
model_DP <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(train) - 1) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

# Compile the model
model_DP %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_adam(),
  metrics = c("accuracy")
)

# Train the model
history_MM <- model_DP %>% fit(
  x = as.matrix(nor_train_MM[, -which(names(nor_train_MM) == "Label")]),   # Features
  y = as.numeric(nor_train_MM$Label),  # Labels
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)
## Epoch 1/10
## 54/54 - 1s - loss: 0.1195 - accuracy: 0.7548 - val_loss: 0.1227 - val_accuracy: 1.0000 - 947ms/epoch - 18ms/step
## Epoch 2/10
## 54/54 - 0s - loss: -1.4594e+00 - accuracy: 0.7946 - val_loss: 4.6600e-04 - val_accuracy: 1.0000 - 175ms/epoch - 3ms/step
## Epoch 3/10
## 54/54 - 0s - loss: -5.4961e+00 - accuracy: 0.7946 - val_loss: 1.3317e-10 - val_accuracy: 1.0000 - 189ms/epoch - 3ms/step
## Epoch 4/10
## 54/54 - 0s - loss: -1.5316e+01 - accuracy: 0.7946 - val_loss: 3.8584e-25 - val_accuracy: 1.0000 - 172ms/epoch - 3ms/step
## Epoch 5/10
## 54/54 - 0s - loss: -3.4970e+01 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 164ms/epoch - 3ms/step
## Epoch 6/10
## 54/54 - 0s - loss: -6.9685e+01 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 161ms/epoch - 3ms/step
## Epoch 7/10
## 54/54 - 0s - loss: -1.2357e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 163ms/epoch - 3ms/step
## Epoch 8/10
## 54/54 - 0s - loss: -2.0059e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 154ms/epoch - 3ms/step
## Epoch 9/10
## 54/54 - 0s - loss: -3.0311e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 151ms/epoch - 3ms/step
## Epoch 10/10
## 54/54 - 0s - loss: -4.3504e+02 - accuracy: 0.7946 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 - 156ms/epoch - 3ms/step
# Evaluate the model
evaluation_DP_MM <- model_DP %>% evaluate(
  x = as.matrix(nor_test_MM[, -which(names(nor_test_MM) == "Label")]),   # Features
  y = as.numeric(nor_test_MM$Label)   # Labels
)
## 29/29 - 0s - loss: -4.6997e+02 - accuracy: 0.8361 - 65ms/epoch - 2ms/step
summary (evaluation_DP_MM)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -469.9704 -352.2688 -234.5672 -234.5672 -116.8656    0.8361
# Make predictions on the test dataset
predictions_DP_MM <- model_DP %>% predict(
  x = as.matrix(nor_test_MM[, -which(names(nor_test_MM) == "Label")])  # Features
)
## 29/29 - 0s - 158ms/epoch - 5ms/step
# Convert predicted probabilities or classes to labels
predicted_labels_DP_MM <- ifelse(predictions_DP_MM > 0.5, "spam", "ham")

predicted_df_DP_MM <- data.frame(Label = factor(predicted_labels_DP_MM))

# Create a confusion matrix
conf_matrix <- table(predicted_df_DP_MM$Label, nor_test_MM$Label)

# Print the confusion matrix
print(conf_matrix)
##       
##        ham spam
##   spam 765  150
#conf_matrix_DP_MM <- confusionMatrix(predicted_labels_DP_MM, nor_test_MM$Label)

# Print the confusion matrix
#print(conf_matrix_DP_MM)

# Extract accuracy from the confusion matrix
#accuracy_DP_MM <- conf_matrix_DP_MM$overall['Accuracy']

#print(paste("DP for MM normalized data Accuracy:", accuracy_svm))

4-6- Random Forest

Another method I used for spam evalaution is Random Forest, that is among the very famous machine learning model to be used for classification specifically for binary.

The method presented in reference [1] is used here for this modeling. here is code is used.

# Train the random forest model
rf_model <- randomForest(Label ~ ., data = train)

# Print summary of the random forest model
print(rf_model)
## 
## Call:
##  randomForest(formula = Label ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 0.98%
## Confusion matrix:
##       ham spam class.error
## ham  1782    4 0.002239642
## spam   17  334 0.048433048
# Make predictions on the test dataset
predictions_rf <- predict(rf_model, newdata = test)

# Create a confusion matrix
conf_matrix_rf <- confusionMatrix(predictions_rf, test$Label)

# Print the confusion matrix
print(conf_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  761    8
##       spam   4  142
##                                           
##                Accuracy : 0.9869          
##                  95% CI : (0.9772, 0.9932)
##     No Information Rate : 0.8361          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.9948          
##             Specificity : 0.9467          
##          Pos Pred Value : 0.9896          
##          Neg Pred Value : 0.9726          
##              Prevalence : 0.8361          
##          Detection Rate : 0.8317          
##    Detection Prevalence : 0.8404          
##       Balanced Accuracy : 0.9707          
##                                           
##        'Positive' Class : ham             
## 
# Extract accuracy from the confusion matrix
accuracy_rf <- conf_matrix_rf$overall['Accuracy']
print(paste("Random Forest Accuracy:", accuracy_rf))
## [1] "Random Forest Accuracy: 0.986885245901639"

5- Model Evaluation

I evaluated the performance of different models, including logistic regression, Naive Bayes (NB), Support Vector Machine (SVM), Random Forest, and Deep Learning. Using metrics such as accuracy, precision, recall, and F1-score, it became evident that Random Forest performed the best, achieving the highest accuracy among the evaluated models.

For a spam/ham classification problem, four methods were recommended, and they successfully predicted spam with accuracies ranging from 90% to 98.6%. Despite not achieving significant success in cleaning the SPAM and HAM emails during this process, we still achieved remarkable effectiveness with the Random Forest method, reaching 98.6% accuracy.

I also attempted Deep Learning, but unfortunately, I couldn’t complete it due to time constraints and not knowing enough about the topic.

All in all, this gives me hope that with better data cleaning, we may further improve the model’s performance, possibly reaching 99.9%.

All in all, it was an interesting and challenging problem to tackle, requiring extensive research. I utilized resources from GitHub and R-Pubs and contributed my portion of the code where needed.

References:

[1] https://github.com/Prakash-Khatri/Text_spam_ham_detection

[2] RPubs - Spam and Ham Detection of Email