Project 4: Predicting Ham vs Spam Emails

Author

Emily El Mouaquite

Approach

For this assignment, I will use the Ling-Spam Dataset, which contains spam and ham messages from The Lingust List. Since the dataset is already clean and available on Kaggle, I can focus on training a model to classify emails as spam or ham. I will then test the model on new emails to evaluate how accurate it is.

Code Base

Loading the data and necessary packages:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
library(textrecipes)
library(discrim)

Attaching package: 'discrim'

The following object is masked from 'package:dials':

    smoothness
library(naivebayes)
naivebayes 1.0.0 loaded
For more information please visit: 
https://majkamichal.github.io/naivebayes/
emails <- read.csv("messages.csv")
head(emails)
                                              subject
1             job posting - apple-iss research center
2                                                    
3  query : letter frequencies for text identification
4                                                risk
5                            request book information
6 call for abstracts : optimality in syntactic theory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        message
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     content - length : 3386 apple-iss research center a us $ 10 million joint venture between apple computer inc . and the institute of systems science of the national university of singapore , located in singapore , is looking for : a senior speech scientist - - - - - - - - - - - - - - - - - - - - - - - - - the successful candidate will have research expertise in computational linguistics , including natural language processing and * * english * * and * * chinese * * statistical language modeling . knowledge of state-of - the-art corpus-based n - gram language models , cache language models , and part-of - speech language models are required . a text - to - speech project leader - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - the successful candidate will have research expertise expertise in two or more of the following areas : computational linguistics , including natural language parsing , lexical database design , and statistical language modeling ; text tokenization and normalization ; prosodic analysis . substantial knowledge of the phonology , syntax , and semantics of chinese is required . knowledge of acoustic phonetics and / or speech signal processing is desirable . both candidates will have a phd with at least 2 to 4 years of relevant work experience , or a technical msc degree with at least 5 to 7 years of experienc e . very strong software engineering skills , including design and implementation , and productization are required in these positions . knowledge of c , c + + and unix are preferred . a unix & c programmer - - - - - - - - - - - - - - - - - - - - we are looking for an experienced unix & c programmer , preferably with good industry experience , to join us in breaking new frontiers . strong knowledge of unix tools ( compilers , linkers , make , x - windows , e - mac , . . . ) and experience in matlab required . sun and silicon graphic experience is an advantage . programmers with less than two years industry experience need not apply . these positions include interaction with scientists in the national university of singapore , and with apple 's speech research and productization efforts located in cupertino , california . attendance and publication in international scientific / engineering conferences is encouraged . benefits include an internationally competitive salary , housing subsidy , and relocation expenses . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ send a complete resume , enclosing personal particulars , qualifications , experience and contact telephone number to : mr jean - luc lebrun center manager apple - iss research center , institute of systems science heng mui keng terrace , singapore 0511 tel : ( 65 ) 772-6571 fax : ( 65 ) 776-4005 email : jllebrun @ iss . nus . sg\n
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     lang classification grimes , joseph e . and barbara f . grimes ; ethnologue language family index ; pb . isbn : 0-88312 - 708 - 3 ; vi , 116 pp . ; $ 14 . 00 . summer institute of linguistics . this companion volume to ethnologue : languages of the world , twelfth edition lists language families of the world with sub-groups shown in a tree arrangement under the broadest classification of language family . the language family index facilitates locating language names in the ethnologue , making the data there more accessible . internet : academic . books @ sil . org languages , reference lang & culture gregerson , marilyn ; ritual , belief , and kinship in sulawesi ; pb . : isbn : 0-88312 - 621 - 4 ; ix , 194 pp . ; $ 25 . 00 . summer institute of linguistics . seven articles discuss five language groups in sulawesi , indonesia ; the primary focus is on cultural matters , with some linguistic content . topics include traditional religion and beliefs , certain ceremonies , and kinship . internet : academic . books @ sil . org language and society , indonesia computers & ling weber , david j . , stephen r . mcconnel , diana d . weber , and beth j . bryson ; primer : a tool for developing early reading materials ; pb . : isbn : 0-88313 - 678 - 8 ; xvi , 266 pp . + ms-dos software ; $ 26 . 00 . summer institute of linguistics . the authors present a computer program and instructions for developing reading materials in languages with little or no background in literacy . the book is structured as a how-to manual with step by step procedures to establish an appropriate primer sequence and to organize words , phrases , and sentences that correlate with the sequence . it presupposes a thorough knowledge of linguistics . internet : academic . books @ sil . org literacy , computer\n
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  i am posting this inquiry for sergei atamas ( satamas @ umabnet . ab . umd . edu ) , a research associate at the university of maryland at baltimore . his field is molecular biology , and his work involves comparing dna strings using various algorithms . i do n't understand the details well enough to pass them along . at any rate , one such algorithm relies upon frequencies with which the letters g , a , t , and c occur in the dna strings . he would like to explore the analogous use of letter ( sound ) frequencies in natural language texts . hence this posting . specifically , sergei wonders if any linguist subscribers could help steer him to recent literature concerning text identification based on letter frequencies . any suggestions could be sent directly to him at the above address , or to me and i ' ll pass them along . he would also be interested in collaborative work if this research connects with the work of any linguists or text processing specialists . he observes that very often work in one field would actually help work in a far-removed field , if only people knew what was going on over there . george fowler george fowler gfowler @ indiana . edu [ email ] dept . of slavic languages * * 1-317 - 726-1482 [ home ] * * [ try here first ! ] ballantine 502 1-812 - 855-2624 / - 2608 / - 9906 [ dept . ] indiana university 1-812 - 855-2829 [ office ] bloomington , in 47405 usa 1-812 - 855-2107 [ dept . fax ]\n
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         a colleague and i are researching the differing degrees of risk perceived by our hong kong students in different contexts where spoken english is required . we would be interested to find out more about research in the area of risk-taking in language learning . so far we have n't come up with much . can anyone help here ?\n
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       earlier this morning i was on the phone with a friend of mine living in south america . as we were talking in spanish , he said : " si voy a la liberi ' a , comprare ' el libro " which can be rendered into english as " if i go to the bookstore , i will purchase it " . i found this expression a bit unusual so i asked him saying that he really meant to say " si fuese a la libreri ' a , comprari ' a el libro " or " if i were to go to the bookstore , i would buy it " to which he said to me , " ah , the subjunctive is dead in spanish ! " . weather this is a matter of subjunctive discussion or not , is something to be left for another time . nevertheless , he mentioned in the course of our conversation that there is a book ( a spanish translation of a french original ) titled something like " la muerte del subjuntivo " or " the demise / death of the subjunctive " . does any one know of this book ? or books which may deal with similar content ? any and all help will be appreciated . joseph m kozono < kozonoj @ gunet . georgetown . edu >\n
6 content - length : 4437 call for papers is the best good enough ? workshop on optimality in syntactic theory to be held at the massachusetts institute of technology , cambridge , ma , may 19-21 1995 . syntactic research in a variety of frameworks is assigning a growing role to the notion of comparison . this work , which is at the forefront of current research , includes theories involving principles of economy and optimality . much of this work is still unpublished or in formative stages ( legendre , raymond , and smolensky ( 1993 ) , grimshaw ( 1993 ) , pesetsky ( 1994 ) , chomsky ( 1989 , 1993 , 1994 ) ) . the relevant data vary from one account to another , but empirical comparisons of these proposals now can and should be undertaken . ) from may 19-21 , 1995 , mit will be hosting a workshop to explore and clarify particular issues of syntactic theories in which comparison plays a significant role . the workshop will consist of invited talks and talks selected from anonymously submitted abstracts . abstracts are invited to address the following questions : * what is the nature of the candidate or reference set for comparison ? which linguistic objects compete for the best choice ? * what criteria determine the optimal output from a set of candidates ? * does the grammar compare derivations ( as with the economy principles of chomsky ( 1989 , 1993 ) ) or representations ( as in the optimality theoretic analyses developed for phonology by prince and smolensky ( 1993 ) ) . * is language acquisition or variation explained by parameterization or constraint re-ranking ? * what are the computational implications and requirements of the different approaches ? invited talks will be presented by : joan bresnan , stanford noam chomsky , mit jane grimshaw , rutgers david pesetsky , mit paul smolensky and geraldine legendre , johns hopkins university edward stabler , ucla submissions for consideration must be received by march 15 , 1994 , via mail or fax transmission . authors whose abstracts are accepted will be requested to provide a more complete paper by mid - april to prepare focused discussion . we may be able to assist with travel costs for student or unemployed presenters . eight or nine 30 - minute time slots are reserved for accepted papers , each with an additional 10 minutes for questions and discussion . abstracts should be anonymous and not longer than two pages . mailing address : good enough mit 20d-219 77 massachusetts avenue , cambridge , ma , 02139 mailings should include six copies of an anonymous abstract with a cover sheet indicating the paper title , author 's name , affiliation , address , phone number , and email address . fax transmissions may be made to ( 617 ) 253-5017 , attention : david pesetsky , and should also include the cover sheet . any further questions may be addressed by email to good-enough @ mit . edu . more detailed conference information will also be made available via anonymous ftp to broca . mit . edu , in the pub / good-enough directory . references cited above : chomsky , n . ( 1989 ) , " some notes on economy of derivation and representation . " in laka , i . and a . mahajan ( ed . ) _ mit working papers in linguistics 10 , cambridge : mit working papers in linguistics . chomsky , n . ( 1993 ) , " a minimalist program for linguistic theory , " in hale , k . and j . keyser ( ed . ) _ a view from building 20 _ , cambridge : mit press . chomsky , n . ( 1994 ) , " bare phrase structure , " occasional paper # 5 , cambridge : mit working papers in linguistics . grimshaw , j . ( 1993 ) , " minimal projection , heads , and optimality , " ms . rutgers university [ available by anonymous ftp from ruccs . rutgers . edu , as pub / ot / papers / minproj . ps ] , to appear in linguistic inquiry . legendre , g . , w . raymond , and p . smolensky ( 1993 ) " an optimality - theoretic typology of case and grammatical voice systems , " _ proceedings of the nineteenth annual meeting of the berkeley linguistic society _ , berkeley , ca , 464-478 . pesetsky , d . ( in prep . ) , _ syntax at the edge : optimality effects in sentence grammar _ [ handouts only available by anonymous ftp from ruccs . rutgers . edu , as pub / ot / papers / sentpron . ps ] . prince , a . and p . smolensky ( 1993 ) , _ optimality theory : constraint interaction in generative grammar _ , ruccs technical report # 2 , rutgers university center for cognitive science , piscataway , new jersey [ to appear , mit press ] .\n
  label
1     0
2     0
3     0
4     0
5     0
6     0
#change label column to factors from integer vectors
emails <- emails %>%
  mutate(label = as.factor(label))

A label of 1 classifies an email as spam, while a label of 0 classifies it as ham, or not spam.

In order to train a model to predict whether an email is spam or ham, a Naive Bayes classification algorithm will be used to learn patterns in the text data from the Ling-Spam Dataset, and classify new emails based on those patterns. According to IBM, the Naive Bayes classifier “seeks to model the distribution of inputs of a given class or category.”

#set seed to ensure reproducibility
set.seed(304150)
#use the tidymodels package to split the data into 80% used for training the model and 20% used for testing it
split <- initial_split(emails, prop = 0.8)
#80% training data
train <- training(split)
#20% testing data
test <- testing(split)
#combine subject and message columns
train <- train %>%
  mutate(full_text = paste(subject, message))
test <- test %>%
  mutate(full_text = paste(subject, message))

After splitting the data, the text has to be processed so that the model can learn from it. This teaches the model to predict whether an email is spam or ham from the text found in the dataset.

rec <- recipe(label ~ full_text, data = train) %>%
  #split the content in the subject and messages columns into individual words
  step_tokenize(full_text) %>%
  #limits to the top 500 words to keep the model from breaking
  step_tokenfilter(full_text, max_tokens = 500) %>%
  #assigns term frequency features to words 
  step_tfidf(full_text)

Now, Naive Bayes can be implemented.

model <- naive_Bayes() %>%
  set_engine("naivebayes") %>%
  set_mode("classification")

The model is now able to use Naive Bayes classifiers to learn from the data.

#create a workflow (from tidymodels) 
wf <- workflow() %>%
  #add the recipe created above that tells the model how to learn from the text
  add_recipe(rec) %>%
  #add the model (Naive Bayes)
  add_model(model)
#train the wf using the data
fit_model <- fit(wf, data = train)

The model can then be used to predict the classifications of the emails in the test dataset, as well as text it for accuracy.

#create predictions
predictions <- predict(fit_model, test) %>%
  bind_cols(test)
#test for accuracy
accuracy(predictions, truth = label, estimate = .pred_class)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.864

The accuracy test shows that the model is about 86 percent accurate in predicting an email to be either spam or ham.

Another way to test the model for accuracy is to create a ham email similar to the ones in the Ling-Spam dataset in its own dataframe.

fake_email <- data.frame(
  subject = "Meeting",
  message = "Calling all lingusts! We have our annual meeting at 3pm on Saturday."
  )
#combine subject and message
fake_email <- fake_email %>%
  mutate(full_text = paste(subject, message))
# predict
predict(fit_model, fake_email)
# A tibble: 1 × 1
  .pred_class
  <fct>      
1 0          

The model accurately predicted that this email was not spam, meaning that it would not end up in the recipient’s spam folder.

Conclusion

This project gave me a greater amount of insight into predictive modeling. I did encounter a couple of hurdles that I had to overcome while implementing the model. Firstly, while trying to train the workflow using the data, I got an error saying that the outcome of a classification vector should be a factor, not an integer vector. To fix this, I changed the data in the label column (either 0 or 1) to factors before splitting the data. The other issue I ran into was that the Naive Bayes classifier was not able to handle the many columns produced by TF-IDF. To fix this, I added the line step_tokenfilter(full_text, max_tokens = 500). However, this may have affected the accuracy of the model. To extend this work, one might try using different models, like logistic regression, and comparing their accuracy to that of the Naive Bayes classification algorithm.

References:

Creating a naïve Bayes spam filter in R

What are Naïve Bayes classifiers?

Machine learning with tidymodels