Info
It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. For this project, you can start with a spam/hamdataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
Intro
I initiated this project by first downloading the data from spamassassin’s publiccorpus. I later extracted this information into folders. There were two folders spam and ham. The goal of the assignment was to read the files in to R and predict their clasification.Below, I started loading ham data. The code was adapted from an .r-bloggers article.
hamdir<-"C:/Users/OmegaCel/Documents/MasterDataAnalytics/IS607DataAcquisition&Management/Week10/easy_ham/"
hamList= dir(hamdir)
HamMSG = c()
for(i in 1:length(hamList)) {
file = paste0(hamdir,hamList[i])
connection = file(file, open="rt", encoding="latin1")
text = readLines(connection)
msg = text[seq(which(text=="")[1]+1,length(text),1)]
close(connection)
result = c(paste(msg, collapse=" "))
HamMSG = rbind(HamMSG,result)
}
HamMSGdf = data.frame(HamMSG,stringsAsFactors = FALSE, row.names = NULL)
Spam
Here we are performing the same step as above, but with the Spam dataset.
spamdir<-"C:/Users/OmegaCel/Documents/MasterDataAnalytics/IS607DataAcquisition&Management/Week10/spam/"
spamList= dir(spamdir)
SpamMSG = c()
for(i in 1:length(spamList)) {
file = paste0(spamdir,spamList[i])
connection = file(file, open="rt", encoding="latin1")
text = readLines(connection)
msg = try(text[seq(which(text=="")[1]+1,length(text),1)], silent = TRUE)
close(connection)
result = c(paste(msg, collapse=" "))
SpamMSG = rbind(SpamMSG,result)
}
SpamMSGdf = data.frame(SpamMSG,stringsAsFactors = FALSE, row.names = NULL)
In this step I created a corpus and assign a meta tag to Spam and Ham dataset. I also combine both datasets in order to further analyze the data. I tried two additional methods before aquiring one sutable to me.
SpamCorpus = Corpus(VectorSource(SpamMSGdf$SpamMSG))
# meta(SpamCorpus,"Class") = "Spam"
# meta(SpamCorpus, "Class")
#meta(SpamCorpus[[1]],"Class") = "Spam"
for(i in 1:length(SpamCorpus)){
meta(SpamCorpus[[i]], "Class") = "Spam"
}
HamCorpus = Corpus(VectorSource(HamMSGdf$HamMSG))
# meta(HamCorpus,"Class") = "Ham"
# meta(HamCorpus, "Class")
#meta(HamCorpus[[1]],"Class") = "Ham"
for(i in 1:length(HamCorpus)){
meta(HamCorpus[[i]], "Class") = "Ham"
}
HamandSpamCorp = c(SpamCorpus,HamCorpus)
Data Clensing
After creating a corpus I initiated the tm package functions to remove unwanted words, tags, and puntuation. Most of the steps came from Automated data collection Ch10. Upon encountering errors trying to create the TermDocumentMatrix I turned to stack overflow.
CleanHamandSpamCorp = tm_map(HamandSpamCorp,removeNumbers)
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp,str_replace_all,pattern = "<.*?>", replacement =" ")
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp,str_replace_all,pattern ="\\=", replacement =" ")
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp,str_replace_all,pattern = "[[:punct:]]", replacement =" ")
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp,removeWords, words= stopwords("en"))
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp,tolower)
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp,stripWhitespace)
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp,stemDocument)
#http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument
CleanHamandSpamCorp = tm_map(CleanHamandSpamCorp, PlainTextDocument)
#http://stackoverflow.com/questions/18504559/twitter-data-analysis-error-in-term-document-matrix
CleanHamandSpamCorp = corpus <- Corpus(VectorSource(CleanHamandSpamCorp))
#CleanHamandSpamCorp [[1]][[1]]
tdmHS = TermDocumentMatrix(CleanHamandSpamCorp)
tdmHS
<<TermDocumentMatrix (terms: 46307, documents: 3051)>>
Non-/sparse entries: 291652/140991005
Sparsity : 100%
Maximal term length: 123
Weighting : term frequency (tf)
melabels = factor(unlist(meta(HamandSpamCorp, "Class")))
#aa = as.data.frame(as.matrix(tdmHS))
len = length(melabels)
Container
I encountered many problems in this step. I followed the books intructions but no solution was generated. I was thinkin the error happend when I combined both datasets.The code works up untill the creation of the container.The error witness is the following: Error in svm.default(x = container@training_matrix, y = container@training_codes, : x and y don’t match
TrainP = round(len * .8)
tdmHS = removeSparseTerms(tdmHS, 1-(25/length(HamandSpamCorp)))
Mycontainer = create_container(tdmHS,
labels = melabels,
trainSize=1:1500,
testSize=1501:1959,
virgin=FALSE)
# svm = train_model(Mycontainer, "SVM")
# tree = train_model(Mycontainer, "TREE")
# maxent = train_model(Mycontainer, "MAXENT")
#
# svmOut = classify_model(Mycontainer,svm)
# treeOut = classify_model(Mycontainer, tree)
# maxentOut = classify_model(Mycontainer,maxent)
#
# head(svmOut)
# head(treeOut)
# head(maxentOut)
LS0tDQp0aXRsZTogIjYwN1dlZWsxMCINCm91dHB1dDogDQogIGh0bWxfbm90ZWJvb2s6DQogICAgdGhlbWU6IGNvc21vDQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0OiB0cnVlDQogICAgY29kZV9mb2xkaW5nOiBzaG93DQoNCi0tLQ0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmxpYnJhcnkoInRtIikNCmxpYnJhcnkoIlJUZXh0VG9vbHMiKQ0KbGlicmFyeSgic3RyaW5nciIpDQpsaWJyYXJ5KCJTbm93YmFsbEMiKQ0KDQoNCmBgYA0KI0luZm8NCkl0IGNhbiBiZSB1c2VmdWwgdG8gYmUgYWJsZSB0byBjbGFzc2lmeSBuZXcgInRlc3QiIGRvY3VtZW50cyB1c2luZyBhbHJlYWR5IGNsYXNzaWZpZWQgInRyYWluaW5nIiBkb2N1bWVudHMuIEEgY29tbW9uIGV4YW1wbGUgaXMgdXNpbmcgYSBjb3JwdXMgb2YgbGFiZWxlZCBzcGFtIGFuZCBoYW0gKG5vbi1zcGFtKSBlLW1haWxzIHRvIHByZWRpY3Qgd2hldGhlciBvciBub3QgYSBuZXcgZG9jdW1lbnQgaXMgc3BhbS4gDQpGb3IgdGhpcyBwcm9qZWN0LCB5b3UgY2FuIHN0YXJ0IHdpdGggYSBzcGFtL2hhbWRhdGFzZXQsIHRoZW4gcHJlZGljdCB0aGUgY2xhc3Mgb2YgbmV3IGRvY3VtZW50cyAoZWl0aGVyIHdpdGhoZWxkIGZyb20gdGhlIHRyYWluaW5nIGRhdGFzZXQgb3IgZnJvbSBhbm90aGVyIHNvdXJjZSBzdWNoIGFzIHlvdXIgb3duIHNwYW0gZm9sZGVyKS4gIE9uZSBleGFtcGxlIGNvcnB1czogaHR0cHM6Ly9zcGFtYXNzYXNzaW4uYXBhY2hlLm9yZy9wdWJsaWNjb3JwdXMvDQoNCg0KI0ludHJvDQpJIGluaXRpYXRlZCB0aGlzIHByb2plY3QgYnkgZmlyc3QgZG93bmxvYWRpbmcgdGhlIGRhdGEgZnJvbSBzcGFtYXNzYXNzaW4ncyBwdWJsaWNjb3JwdXMuIEkgbGF0ZXIgZXh0cmFjdGVkIHRoaXMgaW5mb3JtYXRpb24gaW50byBmb2xkZXJzLiBUaGVyZSB3ZXJlIHR3byBmb2xkZXJzIHNwYW0gYW5kIGhhbS4gVGhlIGdvYWwgb2YgdGhlIGFzc2lnbm1lbnQgd2FzIHRvIHJlYWQgdGhlIGZpbGVzIGluIHRvIFIgYW5kIHByZWRpY3QgdGhlaXIgY2xhc2lmaWNhdGlvbi5CZWxvdywgSSBzdGFydGVkIGxvYWRpbmcgaGFtIGRhdGEuIFRoZSBjb2RlIHdhcyBhZGFwdGVkIGZyb20gYW4gLnItYmxvZ2dlcnMgYXJ0aWNsZS4NCmBgYHtyIEhhbSwgd2FybmluZz1GQUxTRX0NCg0KaGFtZGlyPC0iQzovVXNlcnMvT21lZ2FDZWwvRG9jdW1lbnRzL01hc3RlckRhdGFBbmFseXRpY3MvSVM2MDdEYXRhQWNxdWlzaXRpb24mTWFuYWdlbWVudC9XZWVrMTAvZWFzeV9oYW0vIg0KaGFtTGlzdD0gZGlyKGhhbWRpcikNCkhhbU1TRyA9IGMoKQ0KZm9yKGkgaW4gMTpsZW5ndGgoaGFtTGlzdCkpIHsNCiAgZmlsZSA9IHBhc3RlMChoYW1kaXIsaGFtTGlzdFtpXSkNCiAgY29ubmVjdGlvbiA9IGZpbGUoZmlsZSwgb3Blbj0icnQiLCBlbmNvZGluZz0ibGF0aW4xIikNCiAgdGV4dCA9IHJlYWRMaW5lcyhjb25uZWN0aW9uKQ0KICBtc2cgPSB0ZXh0W3NlcSh3aGljaCh0ZXh0PT0iIilbMV0rMSxsZW5ndGgodGV4dCksMSldDQogIGNsb3NlKGNvbm5lY3Rpb24pDQogIHJlc3VsdCA9IGMocGFzdGUobXNnLCBjb2xsYXBzZT0iICIpKQ0KICBIYW1NU0cgPSByYmluZChIYW1NU0cscmVzdWx0KQ0KfQ0KSGFtTVNHZGYgPSBkYXRhLmZyYW1lKEhhbU1TRyxzdHJpbmdzQXNGYWN0b3JzID0gRkFMU0UsIHJvdy5uYW1lcyA9IE5VTEwpDQoNCmBgYA0KDQojU3BhbQ0KSGVyZSB3ZSBhcmUgcGVyZm9ybWluZyB0aGUgc2FtZSBzdGVwIGFzIGFib3ZlLCBidXQgd2l0aCB0aGUgU3BhbSBkYXRhc2V0Lg0KYGBge3IgU3BhbSwgd2FybmluZz1GQUxTRX0NCnNwYW1kaXI8LSJDOi9Vc2Vycy9PbWVnYUNlbC9Eb2N1bWVudHMvTWFzdGVyRGF0YUFuYWx5dGljcy9JUzYwN0RhdGFBY3F1aXNpdGlvbiZNYW5hZ2VtZW50L1dlZWsxMC9zcGFtLyINCnNwYW1MaXN0PSBkaXIoc3BhbWRpcikNClNwYW1NU0cgPSBjKCkNCmZvcihpIGluIDE6bGVuZ3RoKHNwYW1MaXN0KSkgew0KICBmaWxlID0gcGFzdGUwKHNwYW1kaXIsc3BhbUxpc3RbaV0pDQogIGNvbm5lY3Rpb24gPSBmaWxlKGZpbGUsIG9wZW49InJ0IiwgZW5jb2Rpbmc9ImxhdGluMSIpDQogIHRleHQgPSByZWFkTGluZXMoY29ubmVjdGlvbikNCiAgbXNnID0gdHJ5KHRleHRbc2VxKHdoaWNoKHRleHQ9PSIiKVsxXSsxLGxlbmd0aCh0ZXh0KSwxKV0sIHNpbGVudCA9IFRSVUUpDQogIGNsb3NlKGNvbm5lY3Rpb24pDQogIHJlc3VsdCA9IGMocGFzdGUobXNnLCBjb2xsYXBzZT0iICIpKQ0KICBTcGFtTVNHID0gcmJpbmQoU3BhbU1TRyxyZXN1bHQpDQp9DQpTcGFtTVNHZGYgPSBkYXRhLmZyYW1lKFNwYW1NU0csc3RyaW5nc0FzRmFjdG9ycyA9IEZBTFNFLCByb3cubmFtZXMgPSBOVUxMKQ0KDQoNCmBgYA0KDQoNCkluIHRoaXMgc3RlcCBJIGNyZWF0ZWQgYSBjb3JwdXMgYW5kIGFzc2lnbiBhIG1ldGEgdGFnIHRvIFNwYW0gYW5kIEhhbSBkYXRhc2V0LiBJIGFsc28gY29tYmluZSBib3RoIGRhdGFzZXRzIGluIG9yZGVyIHRvIGZ1cnRoZXIgYW5hbHl6ZSB0aGUgZGF0YS4gSSB0cmllZCB0d28gYWRkaXRpb25hbCBtZXRob2RzIGJlZm9yZSBhcXVpcmluZyBvbmUgc3V0YWJsZSB0byBtZS4NCmBgYHtyIGNvcnB1cywgd2FybmluZz1GQUxTRX0NClNwYW1Db3JwdXMgPSBDb3JwdXMoVmVjdG9yU291cmNlKFNwYW1NU0dkZiRTcGFtTVNHKSkNCg0KIyBtZXRhKFNwYW1Db3JwdXMsIkNsYXNzIikgPSAiU3BhbSINCiMgbWV0YShTcGFtQ29ycHVzLCAiQ2xhc3MiKQ0KI21ldGEoU3BhbUNvcnB1c1tbMV1dLCJDbGFzcyIpID0gIlNwYW0iDQpmb3IoaSBpbiAxOmxlbmd0aChTcGFtQ29ycHVzKSl7DQogIG1ldGEoU3BhbUNvcnB1c1tbaV1dLCAiQ2xhc3MiKSA9ICJTcGFtIg0KfQ0KDQpIYW1Db3JwdXMgPSBDb3JwdXMoVmVjdG9yU291cmNlKEhhbU1TR2RmJEhhbU1TRykpDQojIG1ldGEoSGFtQ29ycHVzLCJDbGFzcyIpID0gIkhhbSINCiMgbWV0YShIYW1Db3JwdXMsICJDbGFzcyIpDQoNCiNtZXRhKEhhbUNvcnB1c1tbMV1dLCJDbGFzcyIpID0gIkhhbSINCmZvcihpIGluIDE6bGVuZ3RoKEhhbUNvcnB1cykpew0KICBtZXRhKEhhbUNvcnB1c1tbaV1dLCAiQ2xhc3MiKSA9ICJIYW0iDQp9DQoNCkhhbWFuZFNwYW1Db3JwID0gYyhTcGFtQ29ycHVzLEhhbUNvcnB1cykNCg0KDQpgYGANCiNEYXRhIENsZW5zaW5nDQpBZnRlciBjcmVhdGluZyBhIGNvcnB1cyBJIGluaXRpYXRlZCB0aGUgdG0gcGFja2FnZSBmdW5jdGlvbnMgdG8gcmVtb3ZlIHVud2FudGVkIHdvcmRzLCB0YWdzLCBhbmQgcHVudHVhdGlvbi4gTW9zdCBvZiB0aGUgc3RlcHMgY2FtZSBmcm9tIEF1dG9tYXRlZCBkYXRhIGNvbGxlY3Rpb24gQ2gxMC4gVXBvbiBlbmNvdW50ZXJpbmcgZXJyb3JzIHRyeWluZyB0byBjcmVhdGUgdGhlIFRlcm1Eb2N1bWVudE1hdHJpeCAgSSB0dXJuZWQgdG8gc3RhY2sgb3ZlcmZsb3cuDQpgYGB7ciB9DQoNCkNsZWFuSGFtYW5kU3BhbUNvcnAgPSB0bV9tYXAoSGFtYW5kU3BhbUNvcnAscmVtb3ZlTnVtYmVycykNCkNsZWFuSGFtYW5kU3BhbUNvcnAgPSB0bV9tYXAoQ2xlYW5IYW1hbmRTcGFtQ29ycCxzdHJfcmVwbGFjZV9hbGwscGF0dGVybiA9ICI8Lio/PiIsIHJlcGxhY2VtZW50ID0iICIpDQpDbGVhbkhhbWFuZFNwYW1Db3JwID0gdG1fbWFwKENsZWFuSGFtYW5kU3BhbUNvcnAsc3RyX3JlcGxhY2VfYWxsLHBhdHRlcm4gPSJcXD0iLCByZXBsYWNlbWVudCA9IiAiKQ0KQ2xlYW5IYW1hbmRTcGFtQ29ycCA9IHRtX21hcChDbGVhbkhhbWFuZFNwYW1Db3JwLHN0cl9yZXBsYWNlX2FsbCxwYXR0ZXJuID0gIltbOnB1bmN0Ol1dIiwgcmVwbGFjZW1lbnQgPSIgIikNCkNsZWFuSGFtYW5kU3BhbUNvcnAgPSB0bV9tYXAoQ2xlYW5IYW1hbmRTcGFtQ29ycCxyZW1vdmVXb3Jkcywgd29yZHM9IHN0b3B3b3JkcygiZW4iKSkNCkNsZWFuSGFtYW5kU3BhbUNvcnAgPSB0bV9tYXAoQ2xlYW5IYW1hbmRTcGFtQ29ycCx0b2xvd2VyKQ0KQ2xlYW5IYW1hbmRTcGFtQ29ycCA9IHRtX21hcChDbGVhbkhhbWFuZFNwYW1Db3JwLHN0cmlwV2hpdGVzcGFjZSkNCkNsZWFuSGFtYW5kU3BhbUNvcnAgPSB0bV9tYXAoQ2xlYW5IYW1hbmRTcGFtQ29ycCxzdGVtRG9jdW1lbnQpDQojaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL3F1ZXN0aW9ucy8yNDE5MTcyOC9kb2N1bWVudHRlcm1tYXRyaXgtZXJyb3Itb24tY29ycHVzLWFyZ3VtZW50DQpDbGVhbkhhbWFuZFNwYW1Db3JwID0gdG1fbWFwKENsZWFuSGFtYW5kU3BhbUNvcnAsIFBsYWluVGV4dERvY3VtZW50KQ0KI2h0dHA6Ly9zdGFja292ZXJmbG93LmNvbS9xdWVzdGlvbnMvMTg1MDQ1NTkvdHdpdHRlci1kYXRhLWFuYWx5c2lzLWVycm9yLWluLXRlcm0tZG9jdW1lbnQtbWF0cml4DQpDbGVhbkhhbWFuZFNwYW1Db3JwID0gY29ycHVzIDwtIENvcnB1cyhWZWN0b3JTb3VyY2UoQ2xlYW5IYW1hbmRTcGFtQ29ycCkpDQoNCiNDbGVhbkhhbWFuZFNwYW1Db3JwIFtbMV1dW1sxXV0NCg0KdGRtSFMgPSBUZXJtRG9jdW1lbnRNYXRyaXgoQ2xlYW5IYW1hbmRTcGFtQ29ycCkNCnRkbUhTDQptZWxhYmVscyA9IGZhY3Rvcih1bmxpc3QobWV0YShIYW1hbmRTcGFtQ29ycCwgIkNsYXNzIikpKQ0KI2FhID0gYXMuZGF0YS5mcmFtZShhcy5tYXRyaXgodGRtSFMpKQ0KDQoNCmxlbiA9IGxlbmd0aChtZWxhYmVscykNCmBgYA0KDQojQ29udGFpbmVyDQpJIGVuY291bnRlcmVkIG1hbnkgcHJvYmxlbXMgaW4gdGhpcyBzdGVwLiBJIGZvbGxvd2VkIHRoZSBib29rcyBpbnRydWN0aW9ucyBidXQgbm8gc29sdXRpb24gd2FzIGdlbmVyYXRlZC4gSSB3YXMgdGhpbmtpbiB0aGUgZXJyb3IgaGFwcGVuZCB3aGVuIEkgY29tYmluZWQgYm90aCBkYXRhc2V0cy5UaGUgY29kZSB3b3JrcyB1cCB1bnRpbGwgdGhlIGNyZWF0aW9uIG9mIHRoZSBjb250YWluZXIuVGhlIGVycm9yIHdpdG5lc3MgaXMgdGhlIGZvbGxvd2luZzoNCkVycm9yIGluIHN2bS5kZWZhdWx0KHggPSBjb250YWluZXJAdHJhaW5pbmdfbWF0cml4LCB5ID0gY29udGFpbmVyQHRyYWluaW5nX2NvZGVzLCAgOiANCiAgeCBhbmQgeSBkb24ndCBtYXRjaA0KDQpgYGB7ciBjb250YWluZXJ9DQoNClRyYWluUCA9IHJvdW5kKGxlbiAqIC44KQ0KdGRtSFMgPSByZW1vdmVTcGFyc2VUZXJtcyh0ZG1IUywgMS0oMjUvbGVuZ3RoKEhhbWFuZFNwYW1Db3JwKSkpDQoNCg0KTXljb250YWluZXIgPSBjcmVhdGVfY29udGFpbmVyKHRkbUhTLA0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsYWJlbHMgPSBtZWxhYmVscywNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5TaXplPTE6MTUwMCwNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdGVzdFNpemU9MTUwMToxOTU5LA0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICB2aXJnaW49RkFMU0UpDQoNCiMgc3ZtID0gdHJhaW5fbW9kZWwoTXljb250YWluZXIsICJTVk0iKQ0KIyB0cmVlID0gdHJhaW5fbW9kZWwoTXljb250YWluZXIsICJUUkVFIikNCiMgbWF4ZW50ID0gdHJhaW5fbW9kZWwoTXljb250YWluZXIsICJNQVhFTlQiKQ0KIyANCiMgc3ZtT3V0ID0gY2xhc3NpZnlfbW9kZWwoTXljb250YWluZXIsc3ZtKQ0KIyB0cmVlT3V0ID0gY2xhc3NpZnlfbW9kZWwoTXljb250YWluZXIsIHRyZWUpDQojIG1heGVudE91dCA9IGNsYXNzaWZ5X21vZGVsKE15Y29udGFpbmVyLG1heGVudCkNCiMgDQojIGhlYWQoc3ZtT3V0KQ0KIyBoZWFkKHRyZWVPdXQpDQojIGhlYWQobWF4ZW50T3V0KQ0KYGBgDQoNCiNSZWZlcmVuY2VzDQogaHR0cHM6Ly9ycHVicy5jb20vYW5pbGNzMTNtLzEyNjE3MA0KaHR0cHM6Ly93d3cuci1ibG9nZ2Vycy5jb20vY2xhc3NpZnlpbmctZW1haWxzLWFzLXNwYW0tb3ItaGFtLXVzaW5nLXJ0ZXh0dG9vbHMvDQo=