This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Naive Bayes: 70 percent chance of rain, such forecasts are known as probability of precipitating reports. While many ML algo ignore features that have weak effects, Baynes method utilize all the available evidence to subtly change the predictions.
If the two events are totally unrelated, they are called independent events. Event independence simply implies that knowing the outcome of one event does not provide any information about the outcome of the other.
Dependent events are the basis of predictive modelling. Just as the presence of clouds is predictive of a rainy day, the appearance of the word Viagra is predictive of a spam email.
The relationship between dependent events can be described using Bayes Theorom. The prob of event A, given that event B occurred is known as conditional prob P(A|B), since prob of A is dependent (that is ,conditional) on what happened with event B, Bayes theorem tells us that our estimate of P(A|B), should be based on a measure of how often B is observed to occur in general.
If we know event B occurred, the prob of event A is higher the more often that A and B occur together each time B is observed.
The best estimate of spam was 20% known as prior prob. The prob that Viagra was used in previous spam is called likelihood P(Viagra | Spam). The prob that Viagra appeared in any message at all, or P(Viagra) is known as the marginal likelihood.
By applying Bayes theorem to this evidence we can compute a posterior prob that measure how likely the message is to be spam. If it is more than 50 % is spam.
Pos prob, P ( Spam | Viagra ) = likelihood, P ( Viagra | Spam ) prior prob P ( Viagra ), marginal likelihood
Viagra
Likelihood Yes No Total Spam 4/20 16/20 20 Ham 1/80 79/80 80 Total 5/100 95/100 100
P ( Spam | Viagra (4 /20) ) = P ( Viagra | Spam ( 20/100 ) ) = 0,80 P ( Viagra ( 5 / 100 ))
The Naïve Bayes algo describes a simple method apply Bayes Theorom t classification problems.
Naïve Bayes assumes that all of the features in the dataset are equally important and independent. These assumptions are rarely true in most real-world applications.
Besides one variable Viagra you can have a lot of variables. It can happen that the message is spam with 0% prob and ham 100% prob. Ths prediction doesn’t make sense. This problem may arise if an event never occurs for one or more levels of the class or variable. For instance the term Groceries had never previously appeared in a spam message consequently, P ( Spam | groceries ) = 0%
Because prob in the Naïve Bayes formula are multiplied in a chain, this 0 % value causes the posterior prob of spam to be zero, giving the word Groceries the ability to effectively mollify and overrule all of the other evidence. Even if the email was otherwise overwhelmingly expected to be spam, the absence of the word Groceries in spam will always veto the other evidence and result in the prob of spam being zero.
A solution to this problem involves using something called the LAPLACE ESTIMATOR. The LE essentially adds a small number to each of the counts in the frequency table, which ensures that each feature has a nonzero prob of occurring with each class. Typically
Step 2: Exploring and preparing the data —-
read the sms data into the sms data frame
setwd("/Users/anknape/Mainfolder/R/Book/MLWR/Chapter 04")
getwd()
## [1] "/Users/anknape/Mainfolder/R/Book/MLWR/Chapter 04"
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
Examine the structure of the sms data
str(sms_raw)
## 'data.frame': 5559 obs. of 2 variables:
## $ type: chr "ham" "ham" "ham" "spam" ...
## $ text: chr "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!"| __truncated__ ...
The type element is currently a character vector. Since this is a categorical variable , it woud be better to convert it into a factor, as shown in the following code: convert spam/ham to factor. Only the column type will be converted.
sms_raw$type <- factor(sms_raw$type)
Examine the type variable more carefully
str(sms_raw$type)
## Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
table(sms_raw$type)
##
## ham spam
## 4812 747
Build a corpus using the text mining (tm) package
#install.packages("tm")
library(tm)
## Loading required package: NLP
We create a Corpus which is a collection of text documents, In this case it is a collection of SMS. We use VCorpus, that is a volatile corpus meaning stored in memory as opposed to corpus stored on disk. Stored on disk you use “PCorpus”. Since we already loaded the SMS message text into R, we will use the VectorSource() reader function to create a source object from the existing vestor sms_raw$text
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
Examine the sms corpus
print(sms_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5559
#str(sms_corpus)
#summary(sms_corpus)
inspect(sms_corpus[1:2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 49
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 23
To view the actual mesaage we use “as.character” function and apply it to the desired messages
as.character(sms_corpus[[1]])
## [1] "Hope you are having a good week. Just checking in"
as.character(sms_corpus[[100]])
## [1] "Urgent Urgent! We have 800 FREE flights to Europe to give away, call B4 10th Sept & take a friend 4 FREE. Call now to claim on 09050000555. BA128NNFWFLY150ppm"
IF you want to see multiple message you must apply the as.character function to the lines.
lapply(sms_corpus[1:2], as.character)
## $`1`
## [1] "Hope you are having a good week. Just checking in"
##
## $`2`
## [1] "K..give back my thanks."
Clean up the corpus using tm_map() thus bringing all letters to lower.
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
Show the difference between sms_corpus and corpus_clean
as.character(sms_corpus[[1]])
## [1] "Hope you are having a good week. Just checking in"
as.character(sms_corpus_clean[[1]])
## [1] "hope you are having a good week. just checking in"
The content_transformer () function can beused to apply more sophisticated text processing and cleanup processes such as grep pattern matching and replacement, check this out.
Remove numbers
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
Remove stop words, mind that you can also make your own list.
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())
Remove punctuation
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
Notice that content_transformer () is not used above that means that these fubctions like removeNumbers etc are built in functions. For an overview of the build functions simply type: getTransformations() page 110
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
stopwords()
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
To work around the default behavior of removePunctuation() simply create a custom function that replaces rather than removes punctiationncharacter. Essentially this uses “gsub()” function to substitue any punctuation characters in x with a blank space. tip: create a custom function to replace (rather than remove) punctuation
removePunctuation("hello...world")
## [1] "helloworld"
replacePunctuation <- function(x) { gsub("[[:punct:]]+", " ", x) }
replacePunctuation("hello...world")
## [1] "hello world"
Reducing words to their root form is called stemming. Takes words like learned, learning, learns etc and brings it to root LEARN. This allows machine learning algo to treat the related terms as a single concpet rather than attemtping to learn a pattern for each variant. For stemming you need the SnowballC package. For more details on http;//snowball.tartarus.org
#install.packages("SnowballC")
library(SnowballC)
wordStem(c("learn", "learned", "learning", "learns"))
## [1] "learn" "learn" "learn" "learn"
In order to apply it to the whole document
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
Eliminate unneeded whitespace
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
Examine the final clean corpus
lapply(sms_corpus[1:3], as.character)
## $`1`
## [1] "Hope you are having a good week. Just checking in"
##
## $`2`
## [1] "K..give back my thanks."
##
## $`3`
## [1] "Am also doing in cbe only. But have to pay."
lapply(sms_corpus_clean[1:3], as.character)
## $`1`
## [1] "hope good week just check"
##
## $`2`
## [1] "kgive back thank"
##
## $`3`
## [1] " also cbe pay"
Now we split the message into individual components trhough a process called TOKENIZATION. A token is a single element of a text string of WORDS. The tm package has a function called DocumentTermMatrix()in which rows indicate documents (SMS message) and columns indicate terms (words)
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
Alternative solution: The previous steps done can also be done in one step but here is the volgorde important see page 113. create a document-term sparse matrix directly from the SMS corpus
sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list(
tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE
))
Alternative solution: using custom stop words function ensures identical result
sms_dtm3 <- DocumentTermMatrix(sms_corpus, control = list(
tolower = TRUE,
removeNumbers = TRUE,
stopwords = function(x) { removeWords(x, stopwords()) },
removePunctuation = TRUE,
stemming = TRUE
))
Compare the result
sms_dtm
## <<DocumentTermMatrix (documents: 5559, terms: 6518)>>
## Non-/sparse entries: 42113/36191449
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
sms_dtm2
## <<DocumentTermMatrix (documents: 5559, terms: 6909)>>
## Non-/sparse entries: 43192/38363939
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
sms_dtm3
## <<DocumentTermMatrix (documents: 5559, terms: 6518)>>
## Non-/sparse entries: 42113/36191449
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
With our data prepared for analysis we now need to split the data into training and test datasets, so that once our spam classifier is built, it can be evaluated on data it has not previously seen. It is clear that the split occurs after the data have been cleaned and processed, we need exaclty the same prepartaion steps to occur on training and test set.
Creating training and test datasets
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]
#inspect(sms_dtm_test)
Also save the labels, these are the spams and hams.
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type
sms_raw[1, ]$type
## [1] ham
## Levels: ham spam
Check that the proportion of spam is similar
prop.table(table(sms_train_labels))
## sms_train_labels
## ham spam
## 0.8647158 0.1352842
prop.table(table(sms_test_labels))
## sms_test_labels
## ham spam
## 0.8683453 0.1316547
Word cloud visualization
#install.packages("wordcloud")
#install.packages("RColorBrewer")
library(RColorBrewer)
library(wordcloud)
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
We selected “random.order = FALSE” the cloud will arranged in random order with higher frequency words placed closer to the center. If we dont specify random order the cloud would be arranged randomly by default. “min.freq” specifies the number of words that appear. For more informaton on this package check: http://blog.fellstat.com/?cat=11
Subset the training data into spam and ham groups
spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")
Max.words shows the number of words and scale allows us to adjust the max and min font size
wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))
We now set the frequncy of the trainset
sms_dtm_freq_train <- removeSparseTerms(sms_dtm_train, 0.999)
sms_dtm_freq_train
## <<DocumentTermMatrix (documents: 4169, terms: 1101)>>
## Non-/sparse entries: 24834/4565235
## Sparsity : 99%
## Maximal term length: 19
## Weighting : term frequency (tf)
We now transform the sparse matrix into a data structure that can be used to train a Naive Bayes classifier Now it contains 6500 features and not all are important. To reduce the number of features we will eliminate any word that appear in less than five SMS messages or less than about 0,1%. Finding freq words requires findFreqTerms() in tm, it returns a character vector containing the words that appear for at least the specified number of times, indicator features for frequent words
findFreqTerms(sms_dtm_train, 5)
## [1] "abiola" "abl" "abt"
## [4] "accept" "access" "account"
## [7] "across" "act" "activ"
## [10] "actual" "add" "address"
## [13] "admir" "adult" "advanc"
## [16] "aft" "afternoon" "age"
## [19] "ago" "aha" "ahead"
## [22] "aight" "aint" "air"
## [25] "aiyo" "alex" "almost"
## [28] "alon" "alreadi" "alright"
## [31] "also" "alway" "angri"
## [34] "announc" "anoth" "answer"
## [37] "anymor" "anyon" "anyth"
## [40] "anytim" "anyway" "apart"
## [43] "app" "appli" "appreci"
## [46] "arcad" "ard" "area"
## [49] "argu" "argument" "armand"
## [52] "around" "arrang" "arriv"
## [55] "asap" "ask" "askd"
## [58] "attempt" "auction" "avail"
## [61] "ave" "avoid" "await"
## [64] "awak" "award" "away"
## [67] "awesom" "babe" "babi"
## [70] "back" "bad" "bag"
## [73] "bank" "bare" "basic"
## [76] "bath" "batteri" "bcoz"
## [79] "bday" "beauti" "becom"
## [82] "bed" "bedroom" "beer"
## [85] "begin" "believ" "best"
## [88] "better" "bid" "big"
## [91] "bill" "bird" "birthday"
## [94] "bit" "black" "blank"
## [97] "bless" "blue" "bluetooth"
## [100] "bold" "bonus" "boo"
## [103] "book" "boost" "bore"
## [106] "boss" "bother" "bout"
## [109] "box" "boy" "boytoy"
## [112] "break" "breath" "bring"
## [115] "brother" "bslvyl" "btnationalr"
## [118] "buck" "bus" "busi"
## [121] "buy" "cabin" "call"
## [124] "caller" "callertun" "camcord"
## [127] "came" "camera" "campus"
## [130] "can" "cancel" "cancer"
## [133] "cant" "car" "card"
## [136] "care" "carlo" "case"
## [139] "cash" "cashbal" "catch"
## [142] "caus" "celebr" "cell"
## [145] "centr" "chanc" "chang"
## [148] "charg" "chat" "cheap"
## [151] "cheaper" "check" "cheer"
## [154] "chennai" "chikku" "childish"
## [157] "children" "choic" "choos"
## [160] "christma" "claim" "class"
## [163] "clean" "clear" "close"
## [166] "club" "code" "coffe"
## [169] "cold" "colleagu" "collect"
## [172] "colleg" "colour" "come"
## [175] "comin" "comp" "compani"
## [178] "competit" "complet" "complimentari"
## [181] "comput" "condit" "confirm"
## [184] "congrat" "congratul" "connect"
## [187] "contact" "content" "contract"
## [190] "cook" "cool" "copi"
## [193] "correct" "cos" "cost"
## [196] "costa" "costpm" "coupl"
## [199] "cours" "cover" "coz"
## [202] "crave" "crazi" "creat"
## [205] "credit" "cri" "cross"
## [208] "cuddl" "cum" "cup"
## [211] "current" "custcar" "custom"
## [214] "cut" "cute" "cuz"
## [217] "dad" "daddi" "darl"
## [220] "darlin" "darren" "dat"
## [223] "date" "day" "dead"
## [226] "deal" "dear" "decid"
## [229] "decim" "decis" "deep"
## [232] "definit" "del" "deliv"
## [235] "deliveri" "den" "depend"
## [238] "detail" "didnt" "die"
## [241] "diet" "differ" "difficult"
## [244] "digit" "din" "dinner"
## [247] "direct" "dis" "discount"
## [250] "discuss" "disturb" "dnt"
## [253] "doc" "doctor" "doesnt"
## [256] "dog" "doin" "don"
## [259] "done" "dont" "door"
## [262] "doubl" "download" "draw"
## [265] "dream" "drink" "drive"
## [268] "drop" "drug" "dude"
## [271] "due" "dun" "dunno"
## [274] "dvd" "earli" "earlier"
## [277] "earth" "easi" "eat"
## [280] "eatin" "egg" "either"
## [283] "els" "email" "embarass"
## [286] "end" "energi" "england"
## [289] "enjoy" "enough" "enter"
## [292] "entitl" "entri" "envelop"
## [295] "etc" "euro" "eve"
## [298] "even" "ever" "everi"
## [301] "everybodi" "everyon" "everyth"
## [304] "exact" "exam" "excel"
## [307] "excit" "excus" "expect"
## [310] "experi" "expir" "extra"
## [313] "eye" "face" "facebook"
## [316] "fact" "fall" "famili"
## [319] "fanci" "fantasi" "fantast"
## [322] "far" "fast" "fat"
## [325] "father" "fault" "feb"
## [328] "feel" "felt" "fetch"
## [331] "fight" "figur" "file"
## [334] "fill" "film" "final"
## [337] "find" "fine" "finger"
## [340] "finish" "first" "fix"
## [343] "flag" "flat" "flight"
## [346] "flower" "follow" "fone"
## [349] "food" "forev" "forget"
## [352] "forgot" "forward" "found"
## [355] "freak" "free" "freemsg"
## [358] "freephon" "fren" "fri"
## [361] "friday" "friend" "friendship"
## [364] "frm" "frnd" "frnds"
## [367] "full" "fullonsmscom" "fun"
## [370] "funni" "futur" "gal"
## [373] "game" "gap" "gas"
## [376] "gave" "gay" "gentl"
## [379] "get" "gettin" "gift"
## [382] "girl" "girlfrnd" "give"
## [385] "glad" "god" "goe"
## [388] "goin" "gone" "gonna"
## [391] "good" "goodmorn" "goodnight"
## [394] "got" "goto" "gotta"
## [397] "great" "grin" "guarante"
## [400] "gud" "guess" "guy"
## [403] "gym" "haf" "haha"
## [406] "hai" "hair" "half"
## [409] "hand" "handset" "hang"
## [412] "happen" "happi" "hard"
## [415] "hate" "hav" "havent"
## [418] "head" "hear" "heard"
## [421] "heart" "heavi" "hee"
## [424] "hell" "hello" "help"
## [427] "hey" "hgsuiteland" "hit"
## [430] "hiya" "hmm" "hmmm"
## [433] "hmv" "hol" "hold"
## [436] "holder" "holiday" "home"
## [439] "hook" "hop" "hope"
## [442] "horni" "hospit" "hot"
## [445] "hotel" "hour" "hous"
## [448] "how" "howev" "howz"
## [451] "hrs" "httpwwwurawinnercom" "hug"
## [454] "huh" "hungri" "hurri"
## [457] "hurt" "ice" "idea"
## [460] "identifi" "ignor" "ill"
## [463] "immedi" "import" "inc"
## [466] "includ" "india" "info"
## [469] "inform" "insid" "instead"
## [472] "interest" "invit" "ipod"
## [475] "irrit" "ish" "island"
## [478] "issu" "ive" "izzit"
## [481] "januari" "jay" "job"
## [484] "john" "join" "joke"
## [487] "joy" "jst" "jus"
## [490] "just" "juz" "kate"
## [493] "keep" "kept" "kick"
## [496] "kid" "kill" "kind"
## [499] "kinda" "king" "kiss"
## [502] "knew" "know" "knw"
## [505] "ladi" "land" "landlin"
## [508] "laptop" "lar" "last"
## [511] "late" "later" "latest"
## [514] "laugh" "lazi" "ldn"
## [517] "lead" "learn" "least"
## [520] "leav" "lect" "left"
## [523] "leh" "lei" "less"
## [526] "lesson" "let" "letter"
## [529] "liao" "librari" "lie"
## [532] "life" "lift" "light"
## [535] "like" "line" "link"
## [538] "list" "listen" "littl"
## [541] "live" "lmao" "load"
## [544] "loan" "local" "locat"
## [547] "log" "lol" "london"
## [550] "long" "longer" "look"
## [553] "lookin" "lor" "lose"
## [556] "lost" "lot" "lovabl"
## [559] "love" "lover" "loyalti"
## [562] "ltd" "luck" "lucki"
## [565] "lunch" "luv" "mad"
## [568] "made" "mah" "mail"
## [571] "make" "malaria" "man"
## [574] "mani" "march" "mark"
## [577] "marri" "match" "mate"
## [580] "matter" "maxim" "maxmin"
## [583] "may" "mayb" "meal"
## [586] "mean" "meant" "med"
## [589] "medic" "meet" "meetin"
## [592] "meh" "member" "men"
## [595] "merri" "messag" "met"
## [598] "mid" "midnight" "might"
## [601] "min" "mind" "mine"
## [604] "minut" "miracl" "miss"
## [607] "mistak" "moan" "mob"
## [610] "mobil" "mobileupd" "mode"
## [613] "mom" "moment" "mon"
## [616] "monday" "money" "month"
## [619] "morn" "mother" "motorola"
## [622] "move" "movi" "mrng"
## [625] "mrt" "mrw" "msg"
## [628] "msgs" "mths" "much"
## [631] "mum" "murder" "music"
## [634] "must" "muz" "nah"
## [637] "nake" "name" "nation"
## [640] "natur" "naughti" "near"
## [643] "need" "net" "network"
## [646] "neva" "never" "new"
## [649] "news" "next" "nice"
## [652] "nigeria" "night" "nite"
## [655] "nobodi" "noe" "nokia"
## [658] "noon" "nope" "normal"
## [661] "normpton" "noth" "notic"
## [664] "now" "num" "number"
## [667] "nyt" "obvious" "offer"
## [670] "offic" "offici" "okay"
## [673] "oki" "old" "omg"
## [676] "one" "onlin" "onto"
## [679] "oop" "open" "oper"
## [682] "opinion" "opt" "optout"
## [685] "orang" "orchard" "order"
## [688] "oredi" "oso" "other"
## [691] "otherwis" "outsid" "pack"
## [694] "page" "paid" "pain"
## [697] "paper" "parent" "park"
## [700] "part" "parti" "partner"
## [703] "pass" "passion" "password"
## [706] "past" "pay" "peopl"
## [709] "per" "person" "pete"
## [712] "phone" "photo" "pic"
## [715] "pick" "pictur" "pin"
## [718] "piss" "pix" "pizza"
## [721] "place" "plan" "play"
## [724] "player" "pleas" "pleasur"
## [727] "plenti" "pls" "plus"
## [730] "plz" "pmin" "pmsg"
## [733] "pobox" "point" "poli"
## [736] "polic" "poor" "pop"
## [739] "possess" "possibl" "post"
## [742] "pound" "power" "ppm"
## [745] "pray" "present" "press"
## [748] "pretti" "previous" "price"
## [751] "princess" "privat" "prize"
## [754] "prob" "probabl" "problem"
## [757] "project" "promis" "pub"
## [760] "put" "qualiti" "question"
## [763] "quick" "quit" "quiz"
## [766] "quot" "rain" "random"
## [769] "rang" "rate" "rather"
## [772] "rcvd" "reach" "read"
## [775] "readi" "real" "reali"
## [778] "realli" "reason" "receipt"
## [781] "receiv" "recent" "record"
## [784] "refer" "regard" "regist"
## [787] "relat" "relax" "remain"
## [790] "rememb" "remind" "remov"
## [793] "rent" "rental" "repli"
## [796] "repres" "request" "respond"
## [799] "respons" "rest" "result"
## [802] "return" "reveal" "review"
## [805] "reward" "right" "ring"
## [808] "rington" "rite" "road"
## [811] "rock" "role" "room"
## [814] "roommat" "rose" "round"
## [817] "rowwjhl" "rpli" "rreveal"
## [820] "run" "rush" "sad"
## [823] "sae" "safe" "said"
## [826] "sale" "sat" "saturday"
## [829] "savamob" "save" "saw"
## [832] "say" "sch" "school"
## [835] "scream" "sea" "search"
## [838] "sec" "second" "secret"
## [841] "see" "seem" "seen"
## [844] "select" "self" "sell"
## [847] "semest" "send" "sens"
## [850] "sent" "serious" "servic"
## [853] "set" "settl" "sex"
## [856] "sexi" "shall" "share"
## [859] "shd" "ship" "shirt"
## [862] "shop" "short" "show"
## [865] "shower" "sick" "side"
## [868] "sigh" "sight" "sign"
## [871] "silent" "simpl" "sinc"
## [874] "singl" "sipix" "sir"
## [877] "sis" "sister" "sit"
## [880] "situat" "skxh" "skype"
## [883] "slave" "sleep" "slept"
## [886] "slow" "slowli" "small"
## [889] "smile" "smoke" "sms"
## [892] "smth" "snow" "sofa"
## [895] "sol" "somebodi" "someon"
## [898] "someth" "sometim" "somewher"
## [901] "song" "soni" "sonyericsson"
## [904] "soon" "sorri" "sort"
## [907] "sound" "south" "space"
## [910] "speak" "special" "specialcal"
## [913] "spend" "spent" "spoke"
## [916] "spree" "stand" "start"
## [919] "statement" "station" "stay"
## [922] "std" "step" "still"
## [925] "stockport" "stone" "stop"
## [928] "store" "stori" "street"
## [931] "student" "studi" "stuff"
## [934] "stupid" "style" "sub"
## [937] "subscrib" "success" "suck"
## [940] "suit" "summer" "sun"
## [943] "sunday" "sunshin" "sup"
## [946] "support" "suppos" "sure"
## [949] "surf" "surpris" "sweet"
## [952] "swing" "system" "take"
## [955] "talk" "tampa" "tariff"
## [958] "tcs" "tea" "teach"
## [961] "tear" "teas" "tel"
## [964] "tell" "ten" "tenerif"
## [967] "term" "test" "text"
## [970] "thank" "thanx" "that"
## [973] "thing" "think" "thinkin"
## [976] "thk" "tho" "though"
## [979] "thought" "throw" "thru"
## [982] "tht" "thur" "tick"
## [985] "ticket" "til" "till"
## [988] "time" "tire" "titl"
## [991] "tmr" "toclaim" "today"
## [994] "togeth" "told" "tomo"
## [997] "tomorrow" "tone" "tonight"
## [1000] "tonit" "took" "top"
## [1003] "torch" "tot" "total"
## [1006] "touch" "tough" "tour"
## [1009] "toward" "town" "track"
## [1012] "train" "transact" "travel"
## [1015] "treat" "tri" "trip"
## [1018] "troubl" "true" "trust"
## [1021] "truth" "tscs" "ttyl"
## [1024] "tuesday" "turn" "twice"
## [1027] "two" "txt" "txting"
## [1030] "txts" "type" "ufind"
## [1033] "ugh" "ull" "uncl"
## [1036] "understand" "unless" "unlimit"
## [1039] "unredeem" "unsub" "unsubscrib"
## [1042] "updat" "ure" "urgent"
## [1045] "urself" "use" "user"
## [1048] "usf" "usual" "uve"
## [1051] "valentin" "valid" "valu"
## [1054] "via" "video" "vikki"
## [1057] "visit" "vodafon" "voic"
## [1060] "vomit" "voucher" "wait"
## [1063] "wake" "walk" "wan"
## [1066] "wana" "wanna" "want"
## [1069] "wap" "warm" "wast"
## [1072] "wat" "watch" "water"
## [1075] "way" "weak" "wear"
## [1078] "weather" "wed" "wednesday"
## [1081] "weed" "week" "weekend"
## [1084] "welcom" "well" "wen"
## [1087] "went" "what" "whatev"
## [1090] "whenev" "whole" "wid"
## [1093] "wif" "wife" "wil"
## [1096] "will" "win" "wine"
## [1099] "winner" "wish" "wit"
## [1102] "within" "without" "wiv"
## [1105] "wkli" "wks" "wnt"
## [1108] "woke" "won" "wonder"
## [1111] "wont" "word" "work"
## [1114] "workin" "world" "worri"
## [1117] "wors" "worth" "wot"
## [1120] "wow" "write" "wrong"
## [1123] "wwq" "wwwgetzedcouk" "xmas"
## [1126] "xxx" "yahoo" "yar"
## [1129] "yeah" "year" "yep"
## [1132] "yes" "yesterday" "yet"
## [1135] "yoga" "yup"
The result is a charachter vector and we will save it.
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
Sneak preview shows 1367 words
str(sms_freq_words)
## chr [1:1136] "abiola" "abl" "abt" "accept" "access" ...
Create DTMs with only the frequent terms
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
The naive Bayes classifier is typically trained on data with categorial features. This posses a problem, since the cells in the sparse matrix are numeric and measure the numebr of times a word appeares in a messages. We need to change this to a categorial variable that simply indicates yes or no depending on whether the word appears at all. convert counts to a factor.
The following defines a convert_counts() function to convert counts to " Yes / No " strings;
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
We now need to apply convert_counts() to each column in our sparse matrix. It uses a MARGIN parameter to either rows or columns. Margin =2 means columns as Margin=1 is used for rows. apply() convert_counts() to columns of train/test data
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
#sms_test
We will add the Naive Bayes algo via the e1071 package
#install.packages("e1071")
library(e1071)
To build our model on the sms_train matrix we use:
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
The predict () function is used to make the predictions. We will store these in a vector named sms_test_pred. We will simply supply the function with the names of our classifier and test dataset we will use
sms_test_pred <- predict(sms_classifier, sms_test)
To compare the predictions to the true values, we’ll use the CrossTable() function in the gmodels package which we used previously. This time we’ll add some additional parametrs to eliminate unnecessary cell propositions and use the “dnn” parameter to relabel the rows and columns, see code;
#install.packages("gmodels")
library(gmodels)
CrossTable(sms_test_pred, sms_test_labels,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1201 | 30 | 1231 |
## | 0.995 | 0.164 | |
## -------------|-----------|-----------|-----------|
## spam | 6 | 153 | 159 |
## | 0.005 | 0.836 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1207 | 183 | 1390 |
## | 0.868 | 0.132 | |
## -------------|-----------|-----------|-----------|
##
##
Laplace is set at 0 meaning that this allows words that appeared in zero spam or zerpo ham messages to have an indusputable say in the classification process. We set the Laplace = 1
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_test_labels,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1202 | 28 | 1230 |
## | 0.996 | 0.153 | |
## -------------|-----------|-----------|-----------|
## spam | 5 | 155 | 160 |
## | 0.004 | 0.847 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1207 | 183 | 1390 |
## | 0.868 | 0.132 | |
## -------------|-----------|-----------|-----------|
##
##