Introduction

This is project is about demonstration of text classification into spam/ham using the Naive Bayes Machine Learning model. The dataset is collected from SMS spam collection from UCI Machine Learning Repository The dataset is segregated into two categories one is spam and other is ham. We will be using 80% of the dataset as a training set and rest as test set. # Import Libraries

library(tm)
library(SnowballC)
#library(wordcloud)
library(RColorBrewer)
library(e1071) # for Naive Bayes
library(caret) # for Confusion Matrix

Data Import and Exploration

Data Import

# Import data
sms_raw<- read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/SpamHamText.csv")
head(sms_raw)

##   Column1
## 1     ham
## 2     ham
## 3    spam
## 4     ham
## 5     ham
## 6    spam
##                                                                                                                                                       Column2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv

Data Cleansing

Subsetting the columns and rename the columns

sms_raw <- sms_raw[, 1:2] # Fist column gives whether the observation is spam or ham and second column is the actual message
colnames(sms_raw) <- c("Tag", "Msg")

Dataset breakdown by categories

# Number of rows and columns
dim(sms_raw)

## [1] 5574    2

# Number or % of Spam and Ham messages
table(sms_raw$Tag)

## 
##  ham spam 
## 4827  747

prop.table(table(sms_raw$Tag))

## 
##       ham      spam 
## 0.8659849 0.1340151

Data processing

sms_corpus <- VCorpus(VectorSource(sms_raw$Msg))

sms_dtm <- DocumentTermMatrix(sms_corpus, control = 
                                 list(tolower = TRUE,
                                      removeNumbers = TRUE,
                                      stopwords = TRUE,
                                      removePunctuation = TRUE,
                                      stemming = TRUE))

dim(sms_dtm)

## [1] 5574 7024

Creating Training and Test Data

#Training & Test set
sms_dtm_train <- sms_dtm[1:4457, ]
sms_dtm_test <- sms_dtm[4458:5572, ]

#Training & Test Label
sms_train_labels <- sms_raw[1:4457, ]$Tag
sms_test_labels <- sms_raw[4458:5572, ]$Tag

#Proportion for training & test labels
prop.table(table(sms_train_labels))

## sms_train_labels
##       ham      spam 
## 0.8649316 0.1350684

prop.table(table(sms_test_labels))

## sms_test_labels
##       ham      spam 
## 0.8699552 0.1300448

threshold <- 0.1

min_freq = round(sms_dtm$nrow*(threshold/100),0)

min_freq

## [1] 6

# Create vector of most frequent words
freq_words <- findFreqTerms(x = sms_dtm, lowfreq = min_freq)


#Filter the DTM
sms_dtm_freq_train <- sms_dtm_train[ , freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , freq_words]

dim(sms_dtm_freq_train)

## [1] 4457 1268

convert_values <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,
                   convert_values)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,
                  convert_values)

Naive Bayes Model Training

#Create model from the training dataset
sms_classifier <- naiveBayes(sms_train, sms_train_labels)

#Make predictions on test set
sms_test_pred <- predict(sms_classifier, sms_test)

#Create confusion matrix
confusionMatrix(data = sms_test_pred, reference = sms_test_labels,
                positive = "spam", dnn = c("Prediction", "Actual"))

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction ham spam
##       ham  965   16
##       spam   5  129
##                                           
##                Accuracy : 0.9812          
##                  95% CI : (0.9714, 0.9883)
##     No Information Rate : 0.87            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.914           
##                                           
##  Mcnemar's Test P-Value : 0.0291          
##                                           
##             Sensitivity : 0.8897          
##             Specificity : 0.9948          
##          Pos Pred Value : 0.9627          
##          Neg Pred Value : 0.9837          
##              Prevalence : 0.1300          
##          Detection Rate : 0.1157          
##    Detection Prevalence : 0.1202          
##       Balanced Accuracy : 0.9423          
##                                           
##        'Positive' Class : spam            
##

Conclusion

We see that the model translated 5 spam messages as ham. The model is shows the accurancy of 98.12%.

Project -4 Text Mining

Arun Reddy

April 14, 2019