This documentation is a deliverable paper for the coursera case study: Data Science Program of the course Practical Machine Learning.
Our Task is to develop a classification model using the SMS Spam Collection Dataset to distinguish between legit and spam SMS messages. The dataset contains 5,572 English messages, with 86.6% labeled as legit and 13.41% as spam. The primary evaluation criterion is the average accuracy for both classes.
Goal: Able to predict and filter the Short Message Service (SMS) message if it is a legit sms or spam.
In this section we will analyze the data for sms collection and drilled down to the factors in distinguishing if a certain SMS message will be classified as spam or it is legit.
To provide you of an idea, here is some of the example tagged messages:
| Legit SMS Message | Spam SMS Message |
|---|---|
| I got another job! The one at the hospital, doing data analysis or something, starts on Monday! Not sure when my thesis will finish. | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
As observed, most of the words involved in messages that are tagged as spam are free and text. This can be associated to spams that are enticing the subscriber to interact back with the spammer. Notice that one of the characteristic of a legit sms is that it contains date such as name of day, which does not appear on a typical spam message.
Our purpose is to create a model classifier that will able to learn the pattern of finding out if a message is spam or not.
Loading the libraries needed and the SMS collection data.
##
## legit spam
## 4825 747
This section is responsible for cleaning the text data, removing the punctuation, whitespaces, numbers, unnecessary texts and such and later on will be standardized to a unified formatting.
For the initial step we will, create a corpus which is a collection of text. From our data, our corpus will be the collection of the gathered SMS messages.
corpus = VCorpus(VectorSource(spam$Text))
as.character(corpus[[1]])
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
spam_corpus = tm_map(corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
spam_corpus = tm_map(spam_corpus, content_transformer(tolower))
spam_corpus = tm_map(spam_corpus, removeNumbers)
spam_corpus = tm_map(spam_corpus, removePunctuation)
spam_corpus = tm_map(spam_corpus, removeWords, stopwords("english"))
spam_corpus = tm_map(spam_corpus, stripWhitespace)
spam_corpus = tm_map(spam_corpus, stemDocument)
as.character(spam_corpus[[1]])
## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"
The created corpus consists of 5,572 text messages. To further analyze the data, the text messages will be split into individual words after being standardized into lowercase characters so that Hello!, Hello, HELLO and hello will be counted as one word.
The data is then processed into tokens.
spam_dtm <- DocumentTermMatrix(spam_corpus)
spam_dtm <- removeSparseTerms(spam_dtm, 0.999)
inspect(spam_dtm[40:50, 10:15])
## <<DocumentTermMatrix (documents: 11, terms: 6)>>
## Non-/sparse entries: 0/66
## Sparsity : 100%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs add address admir adult advanc aft
## 40 0 0 0 0 0 0
## 41 0 0 0 0 0 0
## 42 0 0 0 0 0 0
## 43 0 0 0 0 0 0
## 44 0 0 0 0 0 0
## 45 0 0 0 0 0 0
## 46 0 0 0 0 0 0
## 47 0 0 0 0 0 0
## 48 0 0 0 0 0 0
## 49 0 0 0 0 0 0
## 50 0 0 0 0 0 0
For the next part we created a word frequency to determine Top Five frequent use words in SMS collection message.
# Create the Word Frequency
word_freq<- sort(colSums(as.matrix(spam_dtm)), decreasing=TRUE)
words <- data.frame(word=names(word_freq), freq=word_freq)
head(words,5)
## word freq
## call call 657
## now now 479
## get get 451
## can can 405
## will will 389
After cleaning and analyzing the data, we will prepare the data for the training and test set to be fed in the model training and learning.
We then append the Class as our response variable to our
spam_dtm$Class <- spam$Class
str(spam_dtm)
## List of 7
## $ i : int [1:34560] 1 1 1 1 1 1 1 1 1 2 ...
## $ j : int [1:34560] 71 131 173 222 417 420 771 1132 1178 518 ...
## $ v : num [1:34560] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 5572
## $ ncol : int 1209
## $ dimnames:List of 2
## ..$ Docs : chr [1:5572] "1" "2" "3" "4" ...
## ..$ Terms: chr [1:1209] "abiola" "abl" "abt" "accept" ...
## $ Class : Factor w/ 2 levels "legit","spam": 1 1 2 1 1 2 1 1 2 2 ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
# Define the convert_count function
count <- function(x) {
factor(ifelse(x > 0, "Yes", "No"), levels=c("No", "Yes"))}
# Apply the count function to get final training and testing DTMs
dataset <- apply(spam_dtm, 2, count)
sms_data = as.data.frame(as.matrix(dataset))
sms_data$Class <- spam$Class
#sms_data$Class = ifelse(sms_data$Class == 'spam',0,1)
From this, we will split the SMS collection data to 70% training data and 30% Testing data. The model will learn based from the partitioned training data and will be evaluated using the testing data.
# Create Training and Test Data
set.seed(123)
train_ind = sample(1:nrow(sms_data), size = floor(0.7*(nrow(sms_data))))
train_sms = sms_data[train_ind, ]
test_sms = sms_data[-train_ind, ]
prop.table(table(train_sms$Class))
##
## legit spam
## 0.8687179 0.1312821
prop.table(table(test_sms$Class))
##
## legit spam
## 0.8594498 0.1405502
Our model is built using two machine learning algorithms:
A. Random Forest
set.seed(123)
rf_classifier = randomForest(x = train_sms[-1210],
y = train_sms$Class,
ntree = 100)
rf_classifier
##
## Call:
## randomForest(x = train_sms[-1210], y = train_sms$Class, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 34
##
## OOB estimate of error rate: 2.59%
## Confusion matrix:
## legit spam class.error
## legit 3375 13 0.003837072
## spam 88 424 0.171875000
# Predicting the Test set results
rf_pred = predict(rf_classifier, newdata = test_sms[-1210])
confusionMatrix(table(rf_pred,test_sms$Class))
## Confusion Matrix and Statistics
##
##
## rf_pred legit spam
## legit 1428 46
## spam 9 189
##
## Accuracy : 0.9671
## 95% CI : (0.9574, 0.9751)
## No Information Rate : 0.8594
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8542
##
## Mcnemar's Test P-Value : 1.208e-06
##
## Sensitivity : 0.9937
## Specificity : 0.8043
## Pos Pred Value : 0.9688
## Neg Pred Value : 0.9545
## Prevalence : 0.8594
## Detection Rate : 0.8541
## Detection Prevalence : 0.8816
## Balanced Accuracy : 0.8990
##
## 'Positive' Class : legit
##
In this algorithm, random forest uses 100 decision trees and 34 variables are considered for each split for classification to build the model.
The trained Random Forest model achieved an accuracy of 96.71% on the test set, significantly outperforming the baseline accuracy of 85.94% (the no information rate). The Kappa statistic, a measure of agreement between predicted and actual classes, was 0.8542, indicating substantial agreement. The model demonstrated high sensitivity (true positive rate) of 99.37% and balanced accuracy of 89.90%, showing its ability to effectively distinguish between both legit and spam messages.
B. Naive Bayes
# Using Naive Bayes
control <- trainControl(method="repeatedcv", number=10, repeats=3)
system.time( classifier_nb <- naiveBayes(train_sms, train_sms$Class, laplace = 1,
trControl = control,tuneLength = 7) )
## user system elapsed
## 0.336 0.047 0.403
naive_pred = predict(classifier_nb, type = 'class', newdata = test_sms)
confusionMatrix(naive_pred,test_sms$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction legit spam
## legit 1434 0
## spam 3 235
##
## Accuracy : 0.9982
## 95% CI : (0.9948, 0.9996)
## No Information Rate : 0.8594
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9926
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.9979
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9874
## Prevalence : 0.8594
## Detection Rate : 0.8577
## Detection Prevalence : 0.8577
## Balanced Accuracy : 0.9990
##
## 'Positive' Class : legit
##
The Naive Bayes model was trained using 10-fold repeated cross-validation, with a total elapsed time of approximately 0.255 seconds. The model achieved remarkably high performance on the test set, with an accuracy of 99.82%. The model correctly classified 1434 out of 1437 legitimate messages (with a sensitivity of 99.79%) and all 235 spam messages (with a specificity of 100%). The Kappa statistic was 0.9926, indicating excellent agreement beyond what might be due to chance. These results demonstrate the model’s exceptional ability to accurately classify both the legit and spam messages, making it a highly effective classifier for this task.
In the subsequent sections, we delve into an exploration of the performance of different classifiers, namely:
across the various datasets.
This examination offers insights into how each classifier operates within the context of the datasets provided. The results and interpretations of these analyses are presented in detail within this R Markdown document.
In
scenarios where distinguishing attributes are linearly separable, the
most effective machine learning algorithm among the provided options
would likely be the Decision Tree classifier. Decision
Trees are well-suited for capturing linear separability in data, as they
recursively partition the feature space based on the attributes, leading
to regions where classes are separated by hyperplanes.
While all the mentioned classifiers have their merits and can handle various types of data, Decision Trees excel at capturing simple linear relationships due to their ability to create decision boundaries that are aligned with the axes. They can effectively model the data’s linear separability and produce interpretable results. Naive Bayes’ can be also consider in dealing with this type of scenarion with the same behavior.
In
scenarios where distinguishing attributes are linearly separable and
slightly more scattered, Decision Tree classifier would
likely be a strong choice among the provided classifiers.
Decision Tree works well in handling non-linear relationships and can support missing values. In cases where linear separability is present but the attributes are scattered, Decision Trees can effectively identify local patterns and decision boundaries. It does not assume any specific underlying data distribution and can capture intricate relationships.
While Naïve Bayes and KNN could also handle scattered data to some extent, decision trees’ ability to adapt to varying densities and nonlinear boundaries would make it a suitable option. Deep ANN might be overkill if the linear separability is still preserved.
Figure C
In scenarios with distinguishing attributes that have varying class distribution proportions, where set 1 has class A with 60% of records filled with 1 and class B with 40%, while set 2 has class A with 40% filled with 1 and class B with 60%, Naïve Bayes would be a suitable choice among the provided classifiers.
Naïve Bayes is particularly robust to imbalanced class distributions, making it well-suited for scenarios where the class proportions differ significantly. It calculates conditional probabilities independently for each attribute given the class, which can help accommodate varying class distributions.
Decision Trees, K Nearest Neighbors, and Deep ANN might perform relatively well too, but Naïve Bayes would likely handle the imbalanced class proportions more effectively due to its probabilistic nature and the way it estimates class probabilities.
In
scenarios where class A and class B are organized in alternating 4x4
cohorts along the y-axis (Attribute X2) and x-axis (Attribute X1),
Decision Tree would likely be a suitable choice among
the provided classifiers.
Decision Trees are well-equipped to capture spatial relationships and boundaries in the data, especially when there are structured patterns like alternating cohorts. The branching nature of Decision Trees allows them to segment the feature space based on patterns, making them effective for capturing the organized arrangement you described.
While K Nearest Neighbors and Deep ANN could potentially handle such patterns, Decision Trees are known for their ability to create hierarchical decision boundaries that align well with structured data layouts. Naïve Bayes may not be the best choice in this case, as its assumption of attribute independence might not align with the organized pattern of the cohorts.
In
scenarios where class A is slightly above and to the left of class B
along the y-axis (Attribute X2) and x-axis (Attribute X1), K
Nearest Neighbors (KNN) would likely be a strong choice among
the provided classifiers.
KNN can effectively capture spatial relationships and identify local patterns in the data. Since the classes have a specific spatial arrangement, KNN’s ability to find nearest neighbors based on distance would enable it to accurately classify points within similar regions.
Decision Trees and Deep ANN might be able to model such relationships as well, but KNN’s focus on proximity-based classification makes it particularly well-suited for this kind of data.
In
scenarios where part of Class B is inscribed within the circle of Class
A and the remaining part of Class B is clustered outside the union of
the two circles, Deep Artificial Neural Network (ANN)
would likely be the most suitable choice among the provided
classifiers.
ANN excels at identifying local patterns and clusters known for its complex based prediction. It would be able to effectively capture the spatial separation between the two classes, especially considering that a portion of Class B is inside the circle of Class A. ANN’s focus on classification makes it well-suited for such complex prediction.
Decision Trees could potentially model this scenario as well, and KNN’s inherent ability to consider spatial relationships is also good in terms predicrting this scenarions