Introduction

Would it be nice to classify text message as spam or ham(legit messages) ? This project aims to do exactly that. We will use SMS Spam Collection dataset to build a document classifier (Naive Bayes algorithm) that classify text message, then predict the class of new text (testing dataset), then using confusion matrix to evaluate the model performance.

Library

library(tm)
## Loading required package: NLP
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
## Loading required package: lattice
library(e1071)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer

Importing Data

The dataset can be downloaded at Kaggle.

The dataset contains text of SMS messages and their respective label as Spam or Ham.

sms_raw <- read.csv("https://raw.githubusercontent.com/yinaS1234/data-607/main/project%204/spamham.csv", header = FALSE, stringsAsFactors = FALSE)

sms_raw <- sms_raw[,1:2]
colnames(sms_raw) <- c("Type", "Text")
sms_raw$Type <- factor(sms_raw$Type)

# Check the structure of the dataset
str(sms_raw)
## 'data.frame':    5573 obs. of  2 variables:
##  $ Type: Factor w/ 3 levels "Category","ham",..: 1 2 2 3 2 2 3 2 2 3 ...
##  $ Text: chr  "Message" "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ ...
# Check the number of spam and ham messages
table(sms_raw$Type)
## 
## Category      ham     spam 
##        1     4825      747
prop.table(table(sms_raw$Type))
## 
##     Category          ham         spam 
## 0.0001794366 0.8657814463 0.1340391172

About 86% of the messages are ham while the remaining 14% are spam.

Data Clearning- Creating a corpus

Now that we have the tm package installed and loaded we can create a corpus from the text data. A corpus is a collection of text documents.

—–Use the Vcorpus() function to create a volatile corpus.

—- Transform to lower case and remove punctuation and number

—–Remove stopwords

—–Trim words to their root form through **stemming* stemDocument()

—–Remove extra white spaces

# Create corpus
sms_corpus <- VCorpus(x = VectorSource(sms_raw$Text))

# Print corpus
sms_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5573
# Check the text in some messages and their type
lapply(sms_corpus[5:8], as.character)
## $`5`
## [1] "U dun say so early hor... U c already then say..."
## 
## $`6`
## [1] "Nah I don't think he goes to usf, he lives around here though"
## 
## $`7`
## [1] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
## 
## $`8`
## [1] "Even my brother is not like to speak with me. They treat me like aids patent."
sms_raw$Type[5:8]
## [1] ham  ham  spam ham 
## Levels: Category ham spam
corpus_clean <- sms_corpus

# Remove Numbers
corpus_clean <- tm_map(x = corpus_clean, FUN = removeNumbers)

# Transform all letters to lower case
corpus_clean <- tm_map(x = corpus_clean, content_transformer(tolower))


# Remove punctuation
corpus_clean <- tm_map(x = corpus_clean, FUN = removePunctuation)

# Print stopwords()
stopwords()
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
# Remove stop words
corpus_clean <- tm_map(x = corpus_clean, FUN = removeWords, stopwords())

# Test the function
wordStem(words = c("cooks", "cooking", "cooked"))
## [1] "cook" "cook" "cook"
# Stem words in corpus
corpus_clean <- tm_map(x = corpus_clean, FUN = stemDocument)

# Remove extra white spaces
corpus_clean <- tm_map(x = corpus_clean, FUN = stripWhitespace)

Create a Document Term Matrix (DTM)

A DTM is a data structure where each document is represented in its own row and a matrix that represents the frequency of each term (word)

# Create Document Term Matrix
DTM <- DocumentTermMatrix(x = corpus_clean)

DTM
## <<DocumentTermMatrix (documents: 5573, terms: 6956)>>
## Non-/sparse entries: 43715/38722073
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

Data Exploration with word clouds

Word clouds provide a very intuitive way to visualize the frequency of words in a corpus.

wordcloud(words = corpus_clean, 
          min.freq = 100, # minimum number of times a word must be present before it appears 
          random.order = FALSE, # Arrange most frequent words to be in the center of the word cloud
          color = (colors = c("#4575b4","#74add1","#abd9e9","#e0f3f8","#fee090","#fdae61","#f46d43","#d73027")) # Colour gradient for the font
          )

Train/Test Split

Now split the dataset into a training and test datasets. We will use 80% of the data for training and the remaining 20% for testing.

# Create Training Set
DTM_train <- DTM[1:round(nrow(DTM)*0.80, 0), ]

# Create Test Set
DTM_test <- DTM[(round(nrow(DTM)*0.80, 0)+1):nrow(DTM), ]

# Create vectors with labels for the training and test set
train_labels <- sms_raw[1:round(nrow(sms_raw)*0.80, 0), ]$Type
test_labels <- sms_raw[(round(nrow(sms_raw)*0.80, 0)+1):nrow(DTM), ]$Type

# Check proportion of ham and spam is similar on the training and test set
prop.table(table(train_labels))
## train_labels
##     Category          ham         spam 
## 0.0002243158 0.8647375505 0.1350381337
prop.table(table(test_labels))
## test_labels
##  Category       ham      spam 
## 0.0000000 0.8699552 0.1300448

Model Development

Feature Selection- most frequent words

Our document term matrix has 6956 features, since this is a very high number to use on our Bayesian algorithm we will reduce the number of features by only selecting the most frequent words.

# filter out the words appear less than 0.1% of the time
threshold <- 0.1 # in %
min_freq = round(DTM$nrow*(threshold/100),0) # calculate minimum frequency
min_freq
## [1] 6
# Create vector of most frequent words
frequent_words <- findFreqTerms(x = DTM, lowfreq = min_freq)

str(frequent_words)
##  chr [1:1251] "‘ll" "abiola" "abl" "abt" "accept" "access" "account" ...
# Filter DTM to only have most frequent words
DTM_train_most_frequent <- DTM_train[, frequent_words]
DTM_test_most_frequent <- DTM_test[, frequent_words]

# Check dimension of DTM
dim(DTM_train_most_frequent)
## [1] 4458 1251

Convert numeric values into categorical values

In order to use the Naive Bayes classifier we need to convert the numerical features in our Document Term Matrix (DTM) to categorical features.

# Create function  that converts numeric values to "Yes" or "No" if word is present or absent in document
is_present <- function(x) {
  x <- ifelse(test = x > 0, yes = "Yes", no = "No")
}

# Test function
x <- is_present(c(1, 0, 3, 4, 0, 0))
x
## [1] "Yes" "No"  "Yes" "Yes" "No"  "No"
# Apply is_present() function to training and test DTM
DTM_train_most_frequent <- apply(X = DTM_train_most_frequent, 
                                 MARGIN = 2, # Apply function to columns
                                 FUN = is_present) # Specify function to be used

DTM_test_most_frequent <- apply(X = DTM_test_most_frequent, 
                                 MARGIN = 2, # Apply function to columns
                                 FUN = is_present) # Specify function to be used

Train Model

# Create model from the training dataset
spam_classifier <- naiveBayes(x = DTM_train_most_frequent, y = train_labels)

# Print probability tables for some words
spam_classifier$tables$call
##             call
## train_labels         No        Yes
##     Category 1.00000000 0.00000000
##     ham      0.94422827 0.05577173
##     spam     0.57308970 0.42691030
spam_classifier$tables$friend
##             friend
## train_labels         No        Yes
##     Category 1.00000000 0.00000000
##     ham      0.98184176 0.01815824
##     spam     0.98172757 0.01827243
spam_classifier$tables$free
##             free
## train_labels         No        Yes
##     Category 1.00000000 0.00000000
##     ham      0.98806744 0.01193256
##     spam     0.77574751 0.22425249

We can see from the probability tables above that the probability of finding the words “call” and “free” in spam messages are multiple times higher than in legitimate messages. On the other hand the word “friend” is almost equally likely to be found in both ham and spam.

Evaluating the performance of our classifier

# Make predictions on test set
test_predictions <- predict(object = spam_classifier, newdata = DTM_test_most_frequent)


# Create confusion matrix
confusionMatrix(data = test_predictions, reference = test_labels, positive = "spam", dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Category ham spam
##   Category        0   0    0
##   ham             0 965   16
##   spam            0   5  129
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9812          
##                  95% CI : (0.9714, 0.9883)
##     No Information Rate : 0.87            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.914           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Category Class: ham Class: spam
## Sensitivity                       NA     0.9948      0.8897
## Specificity                        1     0.8897      0.9948
## Pos Pred Value                    NA     0.9837      0.9627
## Neg Pred Value                    NA     0.9627      0.9837
## Prevalence                         0     0.8700      0.1300
## Detection Rate                     0     0.8655      0.1157
## Detection Prevalence               0     0.8798      0.1202
## Balanced Accuracy                 NA     0.9423      0.9423

Conclusion

We test the performance of our spam classifier using our testing dataset
As we can see from the output above we obtain an overall accuracy of 98.12%.

The Naive Bayes model we built seems like able to classify spam and ham text effectively.

---
title: "Data 607 Project 4-Document Classifier"
author: "yina qiao"
date: "2023-04-28"
output: openintro::lab_report
---

# Introduction


**Would it be nice to classify text message as spam or ham(legit messages) ? This project aims to do exactly that. We will use SMS Spam Collection dataset to build a document classifier (Naive Bayes algorithm) that classify text message, then predict the class of new text (testing dataset), then using confusion matrix to evaluate the model performance.**


# Library
```{r}
library(tm)
library(caret)
library(e1071)
library(pROC)
library(SnowballC)
library(wordcloud)
```

# Importing Data

The dataset can be downloaded at  [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data).

The dataset contains text of SMS messages and their respective label as Spam or Ham.

```{r}
sms_raw <- read.csv("https://raw.githubusercontent.com/yinaS1234/data-607/main/project%204/spamham.csv", header = FALSE, stringsAsFactors = FALSE)

sms_raw <- sms_raw[,1:2]
colnames(sms_raw) <- c("Type", "Text")
sms_raw$Type <- factor(sms_raw$Type)

# Check the structure of the dataset
str(sms_raw)

# Check the number of spam and ham messages
table(sms_raw$Type)
prop.table(table(sms_raw$Type))
```


About 86% of the messages are ham while the remaining 14% are spam.



# Data Clearning- Creating a corpus

Now that we have the tm package installed and loaded we can create a **corpus** from the text data. A corpus is a collection of text documents.

-----Use the `Vcorpus()` function to create a volatile corpus.\

---- Transform to lower case and remove punctuation and number\

-----Remove stopwords\

-----Trim words to their root form through **stemming* stemDocument()\

-----Remove extra white spaces\

```{r}
# Create corpus
sms_corpus <- VCorpus(x = VectorSource(sms_raw$Text))

# Print corpus
sms_corpus

# Check the text in some messages and their type
lapply(sms_corpus[5:8], as.character)
sms_raw$Type[5:8]

corpus_clean <- sms_corpus

# Remove Numbers
corpus_clean <- tm_map(x = corpus_clean, FUN = removeNumbers)

# Transform all letters to lower case
corpus_clean <- tm_map(x = corpus_clean, content_transformer(tolower))


# Remove punctuation
corpus_clean <- tm_map(x = corpus_clean, FUN = removePunctuation)

# Print stopwords()
stopwords()

# Remove stop words
corpus_clean <- tm_map(x = corpus_clean, FUN = removeWords, stopwords())

# Test the function
wordStem(words = c("cooks", "cooking", "cooked"))

# Stem words in corpus
corpus_clean <- tm_map(x = corpus_clean, FUN = stemDocument)

# Remove extra white spaces
corpus_clean <- tm_map(x = corpus_clean, FUN = stripWhitespace)
```



# Create a Document Term Matrix (DTM)

A DTM is a data structure where each document is represented in its own row and a matrix that represents the frequency of each term (word)


```{r}
# Create Document Term Matrix
DTM <- DocumentTermMatrix(x = corpus_clean)

DTM
```

# Data Exploration with word clouds

Word clouds provide a very intuitive way to visualize the frequency of words in a corpus.


```{r message=FALSE, warning=FALSE}

wordcloud(words = corpus_clean, 
          min.freq = 100, # minimum number of times a word must be present before it appears 
          random.order = FALSE, # Arrange most frequent words to be in the center of the word cloud
          color = (colors = c("#4575b4","#74add1","#abd9e9","#e0f3f8","#fee090","#fdae61","#f46d43","#d73027")) # Colour gradient for the font
          )
```

# Train/Test Split

Now split the dataset into a training and test datasets. We will use 80% of the data for training and the remaining 20% for testing.

```{r}

# Create Training Set
DTM_train <- DTM[1:round(nrow(DTM)*0.80, 0), ]

# Create Test Set
DTM_test <- DTM[(round(nrow(DTM)*0.80, 0)+1):nrow(DTM), ]

# Create vectors with labels for the training and test set
train_labels <- sms_raw[1:round(nrow(sms_raw)*0.80, 0), ]$Type
test_labels <- sms_raw[(round(nrow(sms_raw)*0.80, 0)+1):nrow(DTM), ]$Type

# Check proportion of ham and spam is similar on the training and test set
prop.table(table(train_labels))
prop.table(table(test_labels))

```

# Model Development

## Feature Selection- most frequent words

Our document term matrix has `r DTM$ncol` features, since this is a very high number to use on our Bayesian algorithm we will reduce the number of features by only selecting the most frequent words.


```{r}

# filter out the words appear less than 0.1% of the time
threshold <- 0.1 # in %
min_freq = round(DTM$nrow*(threshold/100),0) # calculate minimum frequency
min_freq

# Create vector of most frequent words
frequent_words <- findFreqTerms(x = DTM, lowfreq = min_freq)

str(frequent_words)

# Filter DTM to only have most frequent words
DTM_train_most_frequent <- DTM_train[, frequent_words]
DTM_test_most_frequent <- DTM_test[, frequent_words]

# Check dimension of DTM
dim(DTM_train_most_frequent)

```


## Convert numeric values into categorical values

In order to use the Naive Bayes classifier we need to convert the numerical features in our Document Term Matrix (DTM) to categorical features.

```{r}
# Create function  that converts numeric values to "Yes" or "No" if word is present or absent in document
is_present <- function(x) {
  x <- ifelse(test = x > 0, yes = "Yes", no = "No")
}

# Test function
x <- is_present(c(1, 0, 3, 4, 0, 0))
x

# Apply is_present() function to training and test DTM
DTM_train_most_frequent <- apply(X = DTM_train_most_frequent, 
                                 MARGIN = 2, # Apply function to columns
                                 FUN = is_present) # Specify function to be used

DTM_test_most_frequent <- apply(X = DTM_test_most_frequent, 
                                 MARGIN = 2, # Apply function to columns
                                 FUN = is_present) # Specify function to be used

```

## Train Model

```{r}
# Create model from the training dataset
spam_classifier <- naiveBayes(x = DTM_train_most_frequent, y = train_labels)

# Print probability tables for some words
spam_classifier$tables$call
spam_classifier$tables$friend
spam_classifier$tables$free
```

We can see from the probability tables above that the probability of finding the words "call" and "free" in spam messages are multiple times higher than in legitimate messages. On the other hand the word "friend" is almost equally likely to be found in both ham and spam.

# Evaluating the performance of our classifier

```{r message=FALSE, warning=FALSE}
# Make predictions on test set
test_predictions <- predict(object = spam_classifier, newdata = DTM_test_most_frequent)


# Create confusion matrix
confusionMatrix(data = test_predictions, reference = test_labels, positive = "spam", dnn = c("Prediction", "Actual"))

```

# Conclusion

We test the performance of our spam classifier using our testing dataset\
As we can see from the output above we obtain an overall accuracy of `r x <- confusionMatrix(data = test_predictions, reference = test_labels); round(x$overall[["Accuracy"]], 4)*100`%. 

The Naive Bayes model we built seems like able to classify spam and ham text effectively. 

# References
1. https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

2. https://www.geeksforgeeks.org/naive-bayes-classifier-in-r-programming/



