607_Project4_DylanGold

Approach

For this project we are creating a model that can classify spam emails from emails with legitimate content(ham).   We have access to some public examples of spam and ham at https://spamassassin.apache.org/old/publiccorpus/.

My First step after unzipping the files is to look at some examples of the spam and ham. Initially I notice some things that may help identify spam. Spam often has several https mentions due to excessive amount of links. We can also look for key words that may help us identify spam vs ham. My plan is to make some sort of decision tree to help find spam vs ham.
We say in a previous asignment that there some ways we can analyze the results of a binary classifier. In this assignment I will try to reduce false positives (reduce the amount of ham that we call spam incorrectly). For this context I would rather get some spam mails in the main mailbox rather than accidentally send important emails to the spam list.

We can also try to incorporate techniques similar to our sentiment analysis in our model as well. Certain words like prize, win, free, tend to be more common in spam messages.
I when I initially go through the words in both spam and ham I can count to see which words tend to be more common in each type of message.

Another way I have looked into is using the tidymodels library to help create a model. I am less sure how this works but maybe I can incorporate it into the project.

Import libaries.

library(tidyverse)
library(rpart) # Library for creating descision trees

Below is the importing of the data. I create a function because we do this twice. I used chatgpt 4.0 to create code that would let me generate the dataframes from the files.

load_files <- function(path, spam){ # Give it the path and whether or not it is spam(1 = spam, 0 = ham)
  files <- list.files(path, full.names = TRUE) # Get all the files in the path given
  
  data <- tibble(
    filename = basename(files), 
    text = map_chr(files, ~ readChar(.x, file.info(.x)$size, useBytes = TRUE)), # Loads text from the files with readChar
    spam = spam # Label for spam
  )
  return (data)
  }
ham_df <- load_files("easy_ham", 0)
spam_df <- load_files("spam", 1)

We now have data frames with the spam and ham loaded in. I will split 70% of each of these spam and ham folders randomly into training data, the rest of the data to check the model. Hopefully by looking at keywords like that are common in spam and other patterns like links I can create a model. I will also consider using packages like tidymodels or rpart to create a model through a decision tree or something else.

Codebase.

Initially I decided to try a decision tree. Some other methods may include random forest (multiple decision trees) or Naive Bayes classifier. Regardless of which technique I choose, our initial step is preparing the data. I intend to use the rpart library which could use some quantitive data to help make its decisions.

I want to make a function that takes a block of text and returns a hash map of each word and its quantity. I noticed certain things like https,

, that could also be turned into their own tokens for this hashmap but initially I will just split by empty space. In R I will use table. This will automatically put it into a key value pair.
I also modified it to handle some errors with the help of chatgpt.

make_message_table <- function(msg) { # Will take a string and return it as a hashmap of all the words.
  if (is.na(msg) || msg == "") { # Added from chatgpt. Will remove empty or na messages.
    return(table(character(0)))
  }
  
  msg <- iconv(msg, from = "", to = "UTF-8", sub = "") # Also added from chatgpt. Sometimes the email messages have invalid characters for R. We can convert it to UFT-8.
  
  # I came back here to here make the parsing more strict on what is kept.
  msg <- tolower(msg) # Make everything lowercase so things like Free and free are the same.
  
  # Use gsub to remove certain
  msg <- gsub("[[:punct:]]", " ", msg) # Remove punctuation
  msg <- gsub("[0-9]+", " ", msg) # Remove numbers
  msg <- gsub("\\s+", " ", msg) # Remove extra whitespace
  msg <- trimws(msg)                        # trim ends
  
  split_msg <- strsplit(msg, "\\s+") # Split on spaces or new lines \\s+ 
  

  
  
  return(table(split_msg)) # Convert from list to a table, counting the values
}

# Test on a single spam message
single_spam <- spam_df[[3, 2]] # 3rd message, text value
single_spam_table<- make_message_table(single_spam) 
# view(single_spam_table) shows me it worked

I will try to use rpart following an example linked at https://www.geeksforgeeks.org/machine-learning/testing-rules-generated-by-rpart-package/.
As I look into what it takes for rpart to make a decision tree I realize we need more processing.
We need each word to be a column with counts as values.
First I will make the table for each row

all_df <-  rbind(spam_df, ham_df)
processed_df <- all_df %>%
  rowwise() %>% # Rowwise operation
  mutate(word_table = list(make_message_table(text))) %>% # Create new column with function from earlier
  ungroup()

Before preparing the data to be used for a model we need to split it into a testing and training set.
We can do this with the sample function.

set.seed(1212) # Set seed for sample

train_index <- sample(1:nrow(all_df), 0.7 * nrow(all_df)) # Sample will give us the indexing in this case. We use 1:nrow() to get the full range of indexes, then .7 of this for the size
# Note the train_index is a list of indexes not the because we used :

train_df <- processed_df[train_index, ] #get values at the training indexes
test_df  <- processed_df[-train_index, ] #get values not at the training indexes

Just putting each word in would be bad for processing of the decision tree. I asked chatgpt how we could filter the data before converting it into a matrix.
We can first get all the word frequency overall. With this we can filter for words that appeared more than once. Use this to adjust the processed data frame which finally is converted into a matrix which we can use for rpart.
We will just use the training data for this.

colnames(train_df) <- make.names(colnames(train_df), unique =  TRUE) # Get rid of invalid characters in the column names of train_df

word_freq <- table(unlist(lapply(train_df$word_table, names))) # By using lapply on the tables, with names for getting the key in the table we can get all the words.
keep_words <- names(word_freq[word_freq >= 10]) # Remove words that don't appear often. Things like specific names or random links will be removed which is good. 10 is arbitrary

train_df$word_table <- lapply(train_df$word_table, function(tab) { # We are applying this function to the each table we created earlier.
  tab <- tab[names(tab) %in% keep_words] # We are just making it so we only keep keys in which we have a keep words for.
  tab # Return the table 
})
keep_words[1:20] # Look at 20 words that we keep in our model. These words are not necessarily for spam or ham just in general. Alphabetical top 20
 [1] "a"          "aa"         "aaa"        "aaronsw"    "ab"        
 [6] "abc"        "ability"    "able"       "about"      "above"     
[11] "absolute"   "absolutely" "abuse"      "ac"         "accept"    
[16] "acceptable" "acceptance" "accepted"   "access"     "according" 

Our way of removing uncommon words helped by removing single instances of words, but another thing we could do to further filter is remove words that only appear in a single message(even if they appear multiple times). There are many ways we could work on the prepossessing but for now I will see how well this version does.

all_words <- keep_words
feature_matrix <- t(sapply(train_df$word_table, function(tab) { # sapply is the same as left but returns a matrix.
  vec <- setNames(rep(0, length(all_words)), all_words) # Set the name of each column as each word we have
  vec[names(tab)] <- as.numeric(tab) #  The values are the actual values
  vec
}))

We have now created a matrix for the model. The model can use a dataframe so we can convert it back into a dataframe.
There were some errors because of the invalid column names. use the make.names again and remove …, empty and na column names.

df_features <- as.data.frame(feature_matrix) # convert the features to a dataframe
colnames(df_features) <- make.names(colnames(df_features), unique = TRUE)
df_features <- df_features[, colnames(df_features) != "..." &
                              colnames(df_features) != "" &
                              !is.na(colnames(df_features))]

We can now create the model. I perform an additional step of normalizing the rows. This is so longer messages don’t affect the model as strongly.

df_features$spam <- as.factor(train_df$spam) # Turn spam into a factor
feature_matrix <- feature_matrix / rowSums(feature_matrix) # Additional Processing, normalizing the matrix so longer messages are not weighted based on their length
# Plug in spam as the y, rest as x with .
model <- rpart(spam ~ ., data = df_features, method = "class")

To actually test the model we need to turn our testing data into a valid format.
The test data should have the same columns as the training data.
I used chatgpt to help generate this.

train_cols <- colnames(df_features)[colnames(df_features) != "spam"] # Get the feature columns(ignore the label, spam)
test_feature_list <- lapply(test_df$word_table, function(tab) { # We are applying this function to the test_df now
  
  vec <- setNames(rep(0, length(train_cols)), train_cols) # We are getting the names from the train_cols, originally the df_features we trained on.
  
  if (!is.null(tab) && length(tab) > 0) { # When the row value is not null or empty
    common <- intersect(names(tab), train_cols) # intersect the names in the test data and the column names of the training data
    vec[common] <- as.numeric(tab[common]) # save the values that were intersected in vec
  }
  
  vec # return
})

test_matrix <- as.data.frame(do.call(rbind, test_feature_list))

We can run the model

# Predict with the model and the text matrix we just made
predictions <- predict(model, test_matrix, type = "class")

These are the results, when I initially ran it I got everything classified as spam.
I decided to go back to the parsing of the text because I used a very basic just split when I need to do things like remove punctuation, case sensitive and such. After making these changes I got something much more accurate.

table(Predicted = predictions, Actual = test_df$spam)
         Actual
Predicted   0   1
        0 762   9
        1   4 141
mean(predictions == test_df$spam)
[1] 0.9858079

We can see that our model is very accurate.
Our model predicted it was spam when it was not (false positive) 4 times.
Our model predicted it was ham when it spam 9 times (false negative)
We can also calculate the precision and recall.

precision <- 141 /(141 + 4) # TP/(TP + FP)
recall <- 141/(141 + 9) # TP/(TP + FN)

precision
[1] 0.9724138
recall
[1] 0.94

Because we care more about trying to minimize false positives(calling something spam when it is not) I would care more about precision in this model and luckily this model has a very high precision.

Conclusion

We were able to create a model for detecting spam vs ham. We had large samples of text for both spam and ham. Initially when parsing it I did not clean out garbage information which made the model detect everything as spam. By making sure the prepossessed information was good for the model I was able to create a model that was very accurate and had good precision. I made the prepossessed data better by get rid of information that would probably just confuse the model and making sure that things like free!! and FREE were treated as the same word. After prepossessing I was able to use rpart to create a decision tree. Some ways I could improve on this project are to look at other ways of creating a model. Theoretically I could pass the processed information through something like a bayes classifier. I could also adjust this model by trying to maximize precision(get no false positives) even at the cost of seeing more false negatives.