This is project is about demonstration of text classification into spam/ham using the Naive Bayes Machine Learning model. The dataset is collected from SMS spam collection from UCI Machine Learning Repository The dataset is segregated into two categories one is spam and other is ham. We will be using 80% of the dataset as a training set and rest as test set. # Import Libraries
library(tm)
library(SnowballC)
#library(wordcloud)
library(RColorBrewer)
library(e1071) # for Naive Bayes
library(caret) # for Confusion Matrix# Import data
sms_raw<- read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/SpamHamText.csv")
head(sms_raw)## Column1
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## 6 spam
## Column2
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
sms_raw <- sms_raw[, 1:2] # Fist column gives whether the observation is spam or ham and second column is the actual message
colnames(sms_raw) <- c("Tag", "Msg")# Number of rows and columns
dim(sms_raw)## [1] 5574 2
# Number or % of Spam and Ham messages
table(sms_raw$Tag)##
## ham spam
## 4827 747
prop.table(table(sms_raw$Tag))##
## ham spam
## 0.8659849 0.1340151
sms_corpus <- VCorpus(VectorSource(sms_raw$Msg))
sms_dtm <- DocumentTermMatrix(sms_corpus, control =
list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))
dim(sms_dtm)## [1] 5574 7024
#Training & Test set
sms_dtm_train <- sms_dtm[1:4457, ]
sms_dtm_test <- sms_dtm[4458:5572, ]
#Training & Test Label
sms_train_labels <- sms_raw[1:4457, ]$Tag
sms_test_labels <- sms_raw[4458:5572, ]$Tag
#Proportion for training & test labels
prop.table(table(sms_train_labels))## sms_train_labels
## ham spam
## 0.8649316 0.1350684
prop.table(table(sms_test_labels))## sms_test_labels
## ham spam
## 0.8699552 0.1300448
threshold <- 0.1
min_freq = round(sms_dtm$nrow*(threshold/100),0)
min_freq## [1] 6
# Create vector of most frequent words
freq_words <- findFreqTerms(x = sms_dtm, lowfreq = min_freq)
#Filter the DTM
sms_dtm_freq_train <- sms_dtm_train[ , freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , freq_words]
dim(sms_dtm_freq_train)## [1] 4457 1268
convert_values <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,
convert_values)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,
convert_values)#Create model from the training dataset
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
#Make predictions on test set
sms_test_pred <- predict(sms_classifier, sms_test)
#Create confusion matrix
confusionMatrix(data = sms_test_pred, reference = sms_test_labels,
positive = "spam", dnn = c("Prediction", "Actual"))## Confusion Matrix and Statistics
##
## Actual
## Prediction ham spam
## ham 965 16
## spam 5 129
##
## Accuracy : 0.9812
## 95% CI : (0.9714, 0.9883)
## No Information Rate : 0.87
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.914
##
## Mcnemar's Test P-Value : 0.0291
##
## Sensitivity : 0.8897
## Specificity : 0.9948
## Pos Pred Value : 0.9627
## Neg Pred Value : 0.9837
## Prevalence : 0.1300
## Detection Rate : 0.1157
## Detection Prevalence : 0.1202
## Balanced Accuracy : 0.9423
##
## 'Positive' Class : spam
##
We see that the model translated 5 spam messages as ham. The model is shows the accurancy of 98.12%.