Build a spam/ham(non-spam) email dataset, then predict the class of new documents withheld from the training dataset
Source of dataset: https://spamassassin.apache.org/publiccorpus/
0)Import libraries
library(tm) #Text mining package: a framework for text mining application within R
## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
library(RTextTools) #A machine learning package for automatic text classification
## Warning: package 'RTextTools' was built under R version 3.4.4
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
1)Import dataset
The spam/ham email sources:
20021010_easy_ham.tar.bz2
20021010_spam.tar.bz2
20030228_spam.tar.bz2
20050311_spam_2.tar.bz2
unzip all files twice and enter the directory of both folders
Build two text corpus (large and structured set of texts for statistical analysis):
easy_ham <- VCorpus(DirSource('C:\\Users\\Asus-pc\\Downloads\\Spam_Ham\\20021010_easy_ham\\easy_ham'))
for (i in 1:length(easy_ham)) {#add a label for each email to identify if it's spam or ham
meta(easy_ham[[i]],'class') <- 0
}
spam <- VCorpus(DirSource('C:\\Users\\Asus-pc\\Downloads\\Spam_Ham\\20021010_spam\\spam'))
for (i in 1:length(spam)) {
meta(spam[[i]],'class') <- 1
}
easy_ham
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2551
spam
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2398
2)Clean and tidy data
easy_ham <- tm_map(easy_ham,content_transformer(tolower)) #set all letters to lowercase
easy_ham <- tm_map(easy_ham,removeNumbers) #remove all numbers
easy_ham <- tm_map(easy_ham,stripWhitespace) #remove all white spaces
easy_ham <- tm_map(easy_ham,content_transformer(removePunctuation)) #remove punctuation
easy_ham <- tm_map(easy_ham,removeWords,stopwords('english')) #remove stopwords
easy_ham <- tm_map(easy_ham,content_transformer(function(x) iconv(x,from='UTF-8',sub='byte'))) #convert special characters to byte
spam <- tm_map(spam,content_transformer(tolower))
spam <- tm_map(spam,removeNumbers)
spam <- tm_map(spam,stripWhitespace)
spam <- tm_map(spam,content_transformer(removePunctuation))
spam <- tm_map(spam,removeWords,stopwords('english'))
spam <- tm_map(spam,content_transformer(function(x) iconv(x,from='UTF-8',sub='byte')))
3)Preparation for Analysis
combine both datasets into one
dataset <- c(spam[1:300],easy_ham[501:1000],spam[301:1500],easy_ham[1:500],spam[1501:2034],easy_ham[1001:2500],spam[2035:2398],easy_ham[2501:2551])
Transform the dataset into document term matrix
(A document-term matrix or term-document matrix is a methematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.)
dataset_dtm <- DocumentTermMatrix(dataset)
dataset_dtm
## <<DocumentTermMatrix (documents: 4949, terms: 118908)>>
## Non-/sparse entries: 891274/587584418
## Sparsity : 100%
## Maximal term length: 868
## Weighting : term frequency (tf)
separate into training and test data
class <- as.vector(unlist(meta(dataset,type='local',tag='class')))
len <- length(dataset)
container <- create_container(dataset_dtm,labels = class,trainSize = 1:3960,testSize = 3961:4949,virgin = FALSE)
4)Analysis
The machine learning algorithm chosen is Support vector machine.
It’s an supervised learning model with associated learning algorithms that analyze data used for classification and regression analysis.
model <- train_model(container,'SVM')
result <- classify_model(container,model)
head(result)
## SVM_LABEL SVM_PROB
## 1 0 0.9999994
## 2 0 0.9999997
## 3 0 0.9999995
## 4 0 0.9834646
## 5 0 0.9778436
## 6 0 0.9695122
prop.table(table(result[,1] == class[3961:len]))
##
## TRUE
## 1
5)Conclusion
The Support Vector Machine algorithm is performing pretty well.