Predicting Email Spam: A Study Guide with Mathematical and
Coding Representation
1. Introduction to Spam Detection
Definition
Spam detection is a binary classification problem where emails are
categorized as either spam (junk) or ham (legitimate). The problem is
solved using machine learning techniques, leveraging statistical
patterns in email content.
Machine learning models such as Naive Bayes, Logistic
Regression, Decision Trees, Random Forests, and Gradient
Boosting are commonly used for spam detection.
2. Mathematical Representation of Spam
Classification
Spam detection uses probabilistic and tree-based models to classify
emails based on word frequencies, character patterns, and structural
attributes.
Naive Bayes Approach
The Naive Bayes classifier is based on Bayes’
theorem: \[
P(Spam | X) = \frac{P(X | Spam) P(Spam)}{P(X)}
\] where: - \(P(Spam | X)\) is
the probability that email \(X\) is
spam. - \(P(X | Spam)\) is the
likelihood of observing features \(X\)
given that the email is spam. - \(P(Spam)\) is the prior probability of an
email being spam. - \(P(X)\) is the
probability of features \(X\) appearing
in any email.
Assuming feature independence: \[
P(X | Spam) = P(x_1 | Spam) P(x_2 | Spam) ... P(x_n | Spam)
\] This simplification enables fast and scalable spam
classification.
Tree-Based Methods
Tree-based models like Decision Trees and Random
Forests use feature splits to classify emails: 1.
Entropy-based Splitting (Information Gain): \[
H(X) = - \sum p_i \log_2(p_i)
\] 2. Gini Impurity: \[
G(X) = 1 - \sum p_i^2
\] Random Forests use multiple trees to increase robustness by
training on different email subsets and averaging their
predictions【47:1†ESLII_print12_toc.pdf】.
3. Feature Engineering for Spam Detection
Spam filters analyze word frequencies and metadata. Key features
include: - Word Frequency: The occurrence of
spam-triggering words (e.g., ‘free’, ‘win’, ‘money’) - Character
Frequency: Special characters (e.g., ‘!’, ‘$’, ‘@’) used in
promotions【47:0†ESLII_print12_toc.pdf】. - Capitalization
Patterns: The proportion of capitalized words (e.g., CAPAVE,
CAPMAX metrics) - Email Structure: The presence of
multiple recipients, HTML tags, or missing subject lines.
4. Python Implementation of Spam Detection
Using Naive Bayes for Spam Classification
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
# Load dataset (Example: UCI Spam Dataset)
df = pd.read_csv("spam.csv", encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'message']
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
# Convert text to numerical vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Naive Bayes classifier
nb = MultinomialNB()
nb.fit(X_train, y_train)
# Predict and evaluate
y_pred = nb.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Using Random Forest for Spam Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
# Convert text to numerical vectors using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['message'])
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Predict and evaluate
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
5. Evaluating Spam Classifiers
Confusion Matrix Metrics
- Accuracy: Measures overall correctness.
- Precision: Measures how many predicted spam emails
are actually spam.
- Recall (Sensitivity): Measures how many actual spam
emails were correctly identified.
ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve
evaluates the trade-off between sensitivity and specificity. A
classifier with AUC > 0.90 is highly
effective【47:6†ESLII_print12_toc.pdf】.
6. Key Takeaways
- Naive Bayes is effective for spam classification
due to its probabilistic nature and efficiency.
- Tree-based models like Random Forest improve
accuracy by aggregating multiple decision trees.
- Feature selection is critical—word frequency,
capitalization, and special characters strongly indicate spam.
- Evaluating performance using precision, recall, and
AUC ensures robust spam filtering models.
By leveraging machine learning techniques, spam filters can
efficiently classify emails, reducing unwanted messages while preserving
legitimate communication【47:9†ESLII_print12_toc.pdf】.
LS0tDQp0aXRsZTogIlNwYW0iDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoqKlByZWRpY3RpbmcgRW1haWwgU3BhbTogQSBTdHVkeSBHdWlkZSB3aXRoIE1hdGhlbWF0aWNhbCBhbmQgQ29kaW5nIFJlcHJlc2VudGF0aW9uKioNCg0KIyMgKioxLiBJbnRyb2R1Y3Rpb24gdG8gU3BhbSBEZXRlY3Rpb24qKg0KDQojIyMgKipEZWZpbml0aW9uKioNClNwYW0gZGV0ZWN0aW9uIGlzIGEgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHByb2JsZW0gd2hlcmUgZW1haWxzIGFyZSBjYXRlZ29yaXplZCBhcyBlaXRoZXIgc3BhbSAoanVuaykgb3IgaGFtIChsZWdpdGltYXRlKS4gVGhlIHByb2JsZW0gaXMgc29sdmVkIHVzaW5nIG1hY2hpbmUgbGVhcm5pbmcgdGVjaG5pcXVlcywgbGV2ZXJhZ2luZyBzdGF0aXN0aWNhbCBwYXR0ZXJucyBpbiBlbWFpbCBjb250ZW50Lg0KDQpNYWNoaW5lIGxlYXJuaW5nIG1vZGVscyBzdWNoIGFzICoqTmFpdmUgQmF5ZXMsIExvZ2lzdGljIFJlZ3Jlc3Npb24sIERlY2lzaW9uIFRyZWVzLCBSYW5kb20gRm9yZXN0cywgYW5kIEdyYWRpZW50IEJvb3N0aW5nKiogYXJlIGNvbW1vbmx5IHVzZWQgZm9yIHNwYW0gZGV0ZWN0aW9uLg0KDQotLS0NCg0KIyMgKioyLiBNYXRoZW1hdGljYWwgUmVwcmVzZW50YXRpb24gb2YgU3BhbSBDbGFzc2lmaWNhdGlvbioqDQoNClNwYW0gZGV0ZWN0aW9uIHVzZXMgcHJvYmFiaWxpc3RpYyBhbmQgdHJlZS1iYXNlZCBtb2RlbHMgdG8gY2xhc3NpZnkgZW1haWxzIGJhc2VkIG9uIHdvcmQgZnJlcXVlbmNpZXMsIGNoYXJhY3RlciBwYXR0ZXJucywgYW5kIHN0cnVjdHVyYWwgYXR0cmlidXRlcy4NCg0KIyMjICoqTmFpdmUgQmF5ZXMgQXBwcm9hY2gqKg0KVGhlICoqTmFpdmUgQmF5ZXMqKiBjbGFzc2lmaWVyIGlzIGJhc2VkIG9uIEJheWVzJyB0aGVvcmVtOg0KXFsNClAoU3BhbSB8IFgpID0gXGZyYWN7UChYIHwgU3BhbSkgUChTcGFtKX17UChYKX0NClxdDQp3aGVyZToNCi0gXCggUChTcGFtIHwgWCkgXCkgaXMgdGhlIHByb2JhYmlsaXR5IHRoYXQgZW1haWwgXCggWCBcKSBpcyBzcGFtLg0KLSBcKCBQKFggfCBTcGFtKSBcKSBpcyB0aGUgbGlrZWxpaG9vZCBvZiBvYnNlcnZpbmcgZmVhdHVyZXMgXCggWCBcKSBnaXZlbiB0aGF0IHRoZSBlbWFpbCBpcyBzcGFtLg0KLSBcKCBQKFNwYW0pIFwpIGlzIHRoZSBwcmlvciBwcm9iYWJpbGl0eSBvZiBhbiBlbWFpbCBiZWluZyBzcGFtLg0KLSBcKCBQKFgpIFwpIGlzIHRoZSBwcm9iYWJpbGl0eSBvZiBmZWF0dXJlcyBcKCBYIFwpIGFwcGVhcmluZyBpbiBhbnkgZW1haWwuDQoNCkFzc3VtaW5nIGZlYXR1cmUgaW5kZXBlbmRlbmNlOg0KXFsNClAoWCB8IFNwYW0pID0gUCh4XzEgfCBTcGFtKSBQKHhfMiB8IFNwYW0pIC4uLiBQKHhfbiB8IFNwYW0pDQpcXQ0KVGhpcyBzaW1wbGlmaWNhdGlvbiBlbmFibGVzIGZhc3QgYW5kIHNjYWxhYmxlIHNwYW0gY2xhc3NpZmljYXRpb24uDQoNCiMjIyAqKlRyZWUtQmFzZWQgTWV0aG9kcyoqDQpUcmVlLWJhc2VkIG1vZGVscyBsaWtlICoqRGVjaXNpb24gVHJlZXMgYW5kIFJhbmRvbSBGb3Jlc3RzKiogdXNlIGZlYXR1cmUgc3BsaXRzIHRvIGNsYXNzaWZ5IGVtYWlsczoNCjEuICoqRW50cm9weS1iYXNlZCBTcGxpdHRpbmcgKEluZm9ybWF0aW9uIEdhaW4pKio6DQpcWw0KSChYKSA9IC0gXHN1bSBwX2kgXGxvZ18yKHBfaSkNClxdDQoyLiAqKkdpbmkgSW1wdXJpdHkqKjoNClxbDQpHKFgpID0gMSAtIFxzdW0gcF9pXjINClxdDQpSYW5kb20gRm9yZXN0cyB1c2UgbXVsdGlwbGUgdHJlZXMgdG8gaW5jcmVhc2Ugcm9idXN0bmVzcyBieSB0cmFpbmluZyBvbiBkaWZmZXJlbnQgZW1haWwgc3Vic2V0cyBhbmQgYXZlcmFnaW5nIHRoZWlyIHByZWRpY3Rpb25z44CQNDc6MeKAoEVTTElJX3ByaW50MTJfdG9jLnBkZuOAkS4NCg0KLS0tDQoNCiMjICoqMy4gRmVhdHVyZSBFbmdpbmVlcmluZyBmb3IgU3BhbSBEZXRlY3Rpb24qKg0KDQpTcGFtIGZpbHRlcnMgYW5hbHl6ZSB3b3JkIGZyZXF1ZW5jaWVzIGFuZCBtZXRhZGF0YS4gS2V5IGZlYXR1cmVzIGluY2x1ZGU6DQotICoqV29yZCBGcmVxdWVuY3k6KiogVGhlIG9jY3VycmVuY2Ugb2Ygc3BhbS10cmlnZ2VyaW5nIHdvcmRzIChlLmcuLCAnZnJlZScsICd3aW4nLCAnbW9uZXknKQ0KLSAqKkNoYXJhY3RlciBGcmVxdWVuY3k6KiogU3BlY2lhbCBjaGFyYWN0ZXJzIChlLmcuLCAnIScsICckJywgJ0AnKSB1c2VkIGluIHByb21vdGlvbnPjgJA0Nzow4oCgRVNMSUlfcHJpbnQxMl90b2MucGRm44CRLg0KLSAqKkNhcGl0YWxpemF0aW9uIFBhdHRlcm5zOioqIFRoZSBwcm9wb3J0aW9uIG9mIGNhcGl0YWxpemVkIHdvcmRzIChlLmcuLCBDQVBBVkUsIENBUE1BWCBtZXRyaWNzKQ0KLSAqKkVtYWlsIFN0cnVjdHVyZToqKiBUaGUgcHJlc2VuY2Ugb2YgbXVsdGlwbGUgcmVjaXBpZW50cywgSFRNTCB0YWdzLCBvciBtaXNzaW5nIHN1YmplY3QgbGluZXMuDQoNCi0tLQ0KDQojIyAqKjQuIFB5dGhvbiBJbXBsZW1lbnRhdGlvbiBvZiBTcGFtIERldGVjdGlvbioqDQoNCiMjIyAqKlVzaW5nIE5haXZlIEJheWVzIGZvciBTcGFtIENsYXNzaWZpY2F0aW9uKioNCmBgYHB5dGhvbg0KZnJvbSBza2xlYXJuLmZlYXR1cmVfZXh0cmFjdGlvbi50ZXh0IGltcG9ydCBDb3VudFZlY3Rvcml6ZXINCmZyb20gc2tsZWFybi5uYWl2ZV9iYXllcyBpbXBvcnQgTXVsdGlub21pYWxOQg0KZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdA0KZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IGFjY3VyYWN5X3Njb3JlDQppbXBvcnQgcGFuZGFzIGFzIHBkDQoNCiMgTG9hZCBkYXRhc2V0IChFeGFtcGxlOiBVQ0kgU3BhbSBEYXRhc2V0KQ0KZGYgPSBwZC5yZWFkX2Nzdigic3BhbS5jc3YiLCBlbmNvZGluZz0nbGF0aW4tMScpDQpkZiA9IGRmW1sndjEnLCAndjInXV0NCmRmLmNvbHVtbnMgPSBbJ2xhYmVsJywgJ21lc3NhZ2UnXQ0KZGZbJ2xhYmVsJ10gPSBkZlsnbGFiZWwnXS5tYXAoeydoYW0nOiAwLCAnc3BhbSc6IDF9KQ0KDQojIENvbnZlcnQgdGV4dCB0byBudW1lcmljYWwgdmVjdG9ycw0KdmVjdG9yaXplciA9IENvdW50VmVjdG9yaXplcigpDQpYID0gdmVjdG9yaXplci5maXRfdHJhbnNmb3JtKGRmWydtZXNzYWdlJ10pDQp5ID0gZGZbJ2xhYmVsJ10NCg0KIyBTcGxpdCBkYXRhc2V0DQpYX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoWCwgeSwgdGVzdF9zaXplPTAuMiwgcmFuZG9tX3N0YXRlPTQyKQ0KDQojIFRyYWluIE5haXZlIEJheWVzIGNsYXNzaWZpZXINCm5iID0gTXVsdGlub21pYWxOQigpDQpuYi5maXQoWF90cmFpbiwgeV90cmFpbikNCg0KIyBQcmVkaWN0IGFuZCBldmFsdWF0ZQ0KeV9wcmVkID0gbmIucHJlZGljdChYX3Rlc3QpDQpwcmludCgiQWNjdXJhY3k6IiwgYWNjdXJhY3lfc2NvcmUoeV90ZXN0LCB5X3ByZWQpKQ0KYGBgDQoNCiMjIyAqKlVzaW5nIFJhbmRvbSBGb3Jlc3QgZm9yIFNwYW0gQ2xhc3NpZmljYXRpb24qKg0KYGBgcHl0aG9uDQpmcm9tIHNrbGVhcm4uZW5zZW1ibGUgaW1wb3J0IFJhbmRvbUZvcmVzdENsYXNzaWZpZXINCmZyb20gc2tsZWFybi5mZWF0dXJlX2V4dHJhY3Rpb24udGV4dCBpbXBvcnQgVGZpZGZWZWN0b3JpemVyDQoNCiMgQ29udmVydCB0ZXh0IHRvIG51bWVyaWNhbCB2ZWN0b3JzIHVzaW5nIFRGLUlERg0KdmVjdG9yaXplciA9IFRmaWRmVmVjdG9yaXplcigpDQpYID0gdmVjdG9yaXplci5maXRfdHJhbnNmb3JtKGRmWydtZXNzYWdlJ10pDQoNCiMgU3BsaXQgZGF0YXNldA0KWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KFgsIHksIHRlc3Rfc2l6ZT0wLjIsIHJhbmRvbV9zdGF0ZT00MikNCg0KIyBUcmFpbiBSYW5kb20gRm9yZXN0DQpyZiA9IFJhbmRvbUZvcmVzdENsYXNzaWZpZXIobl9lc3RpbWF0b3JzPTEwMCwgcmFuZG9tX3N0YXRlPTQyKQ0KcmYuZml0KFhfdHJhaW4sIHlfdHJhaW4pDQoNCiMgUHJlZGljdCBhbmQgZXZhbHVhdGUNCnlfcHJlZCA9IHJmLnByZWRpY3QoWF90ZXN0KQ0KcHJpbnQoIkFjY3VyYWN5OiIsIGFjY3VyYWN5X3Njb3JlKHlfdGVzdCwgeV9wcmVkKSkNCmBgYA0KDQotLS0NCg0KIyMgKio1LiBFdmFsdWF0aW5nIFNwYW0gQ2xhc3NpZmllcnMqKg0KIyMjICoqQ29uZnVzaW9uIE1hdHJpeCBNZXRyaWNzKioNCi0gKipBY2N1cmFjeSoqOiBNZWFzdXJlcyBvdmVyYWxsIGNvcnJlY3RuZXNzLg0KLSAqKlByZWNpc2lvbioqOiBNZWFzdXJlcyBob3cgbWFueSBwcmVkaWN0ZWQgc3BhbSBlbWFpbHMgYXJlIGFjdHVhbGx5IHNwYW0uDQotICoqUmVjYWxsIChTZW5zaXRpdml0eSkqKjogTWVhc3VyZXMgaG93IG1hbnkgYWN0dWFsIHNwYW0gZW1haWxzIHdlcmUgY29ycmVjdGx5IGlkZW50aWZpZWQuDQoNCiMjIyAqKlJPQyBDdXJ2ZSBhbmQgQVVDKioNClRoZSAqKlJlY2VpdmVyIE9wZXJhdGluZyBDaGFyYWN0ZXJpc3RpYyAoUk9DKSBjdXJ2ZSoqIGV2YWx1YXRlcyB0aGUgdHJhZGUtb2ZmIGJldHdlZW4gc2Vuc2l0aXZpdHkgYW5kIHNwZWNpZmljaXR5LiBBIGNsYXNzaWZpZXIgd2l0aCAqKkFVQyA+IDAuOTAqKiBpcyBoaWdobHkgZWZmZWN0aXZl44CQNDc6NuKAoEVTTElJX3ByaW50MTJfdG9jLnBkZuOAkS4NCg0KLS0tDQoNCiMjICoqNi4gS2V5IFRha2Vhd2F5cyoqDQoxLiAqKk5haXZlIEJheWVzIGlzIGVmZmVjdGl2ZSBmb3Igc3BhbSBjbGFzc2lmaWNhdGlvbioqIGR1ZSB0byBpdHMgcHJvYmFiaWxpc3RpYyBuYXR1cmUgYW5kIGVmZmljaWVuY3kuDQoyLiAqKlRyZWUtYmFzZWQgbW9kZWxzIGxpa2UgUmFuZG9tIEZvcmVzdCBpbXByb3ZlIGFjY3VyYWN5KiogYnkgYWdncmVnYXRpbmcgbXVsdGlwbGUgZGVjaXNpb24gdHJlZXMuDQozLiAqKkZlYXR1cmUgc2VsZWN0aW9uIGlzIGNyaXRpY2FsKirigJR3b3JkIGZyZXF1ZW5jeSwgY2FwaXRhbGl6YXRpb24sIGFuZCBzcGVjaWFsIGNoYXJhY3RlcnMgc3Ryb25nbHkgaW5kaWNhdGUgc3BhbS4NCjQuICoqRXZhbHVhdGluZyBwZXJmb3JtYW5jZSB1c2luZyBwcmVjaXNpb24sIHJlY2FsbCwgYW5kIEFVQyoqIGVuc3VyZXMgcm9idXN0IHNwYW0gZmlsdGVyaW5nIG1vZGVscy4NCg0KQnkgbGV2ZXJhZ2luZyBtYWNoaW5lIGxlYXJuaW5nIHRlY2huaXF1ZXMsIHNwYW0gZmlsdGVycyBjYW4gZWZmaWNpZW50bHkgY2xhc3NpZnkgZW1haWxzLCByZWR1Y2luZyB1bndhbnRlZCBtZXNzYWdlcyB3aGlsZSBwcmVzZXJ2aW5nIGxlZ2l0aW1hdGUgY29tbXVuaWNhdGlvbuOAkDQ3OjnigKBFU0xJSV9wcmludDEyX3RvYy5wZGbjgJEuDQoNCg==